0% found this document useful (0 votes)
251 views506 pages

Regression Modeling Strategies

Uploaded by

jy y
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
251 views506 pages

Regression Modeling Strategies

Uploaded by

jy y
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 506

Regression Modeling Strategies

Frank E Harrell Jr
Department of Biostatistics
Vanderbilt University School of Medicine
Nashville TN 37232 USA
[email protected]
hbiostat.org/rms
hbiostat.org/doc/rms/4day.html

Questions on current topic during class: Chat/raise your hand/turn on video


Written Q&A/discussions during class and offline: Navigate from datamethods.org/rms
General questions: stats.stackexchange.com, tag regression-strategies
Course notes: hbiostat.org/doc/rms.pdf (full) hbiostat.org/doc/rms1.pdf (1-day)
Supplemental material: hbiostat.org/bbr:
Biostatistics for Biomedical Research
Blog: fharrell.com Twitter: @f2harrell #rmscourse #bbrcourse #StatThink

Drew Griffin Levy PhD, Moderator and Co-Instructor


[email protected]
DoGoodScience.com Twitter: @DrewLevy

Virtual Course 10–13 May 2022

Copyright 1995-2022 FE Harrell All Rights Reserved


Updated May 7, 2022
Contents

1 Introduction 1-1

1.1 Hypothesis Testing, Estimation, and Prediction . . . . . . . . . . . . . 1-1

1.2 Examples of Uses of Predictive Multivariable Modeling . . . . . . . . . 1-3

1.3 Misunderstandings about Prediction vs. Classification . . . . . . . . . . 1-5

1.4 Planning for Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . 1-10

1.5 Choice of the Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-13

1.6 Model uncertainty / Data-driven Model Specification . . . . . . . . . . 1-14

2 General Aspects of Fitting Regression Models 2-1

2.1 Notation for Multivariable Regression Models . . . . . . . . . . . . . . 2-5

2.2 Model Formulations . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-6

2.3 Interpreting Model Parameters . . . . . . . . . . . . . . . . . . . . . . 2-7

2.3.1 Nominal Predictors . . . . . . . . . . . . . . . . . . . . . . . . 2-7

2.3.2 Interactions . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-8

2.3.3 Example: Inference for a Simple Model . . . . . . . . . . . . . 2-9

2.3.4 Review of Composite (Chunk) Tests . . . . . . . . . . . . . . . 2-11

2.4 Relaxing Linearity Assumption for Continuous Predictors . . . . . . . . 2-13

ii
CONTENTS iii

2.4.1 Avoiding Categorization . . . . . . . . . . . . . . . . . . . . . 2-13

2.4.2 Simple Nonlinear Terms . . . . . . . . . . . . . . . . . . . . . 2-18

2.4.3 Splines for Estimating Shape of Regression Function and Deter-


mining Predictor Transformations . . . . . . . . . . . . . . . . 2-18

2.4.4 Cubic Spline Functions . . . . . . . . . . . . . . . . . . . . . . 2-21

2.4.5 Restricted Cubic Splines . . . . . . . . . . . . . . . . . . . . . 2-22

2.4.6 Choosing Number and Position of Knots . . . . . . . . . . . . 2-28

2.4.7 Nonparametric Regression . . . . . . . . . . . . . . . . . . . . 2-30

2.4.8 Advantages of Regression Splines over Other Methods . . . . . 2-32

2.5 Recursive Partitioning: Tree-Based Models . . . . . . . . . . . . . . . 2-34

2.5.1 New Directions in Predictive Modeling . . . . . . . . . . . . . 2-35

2.5.2 Choosing Between Machine Learning and Statistical Modeling . 2-38

2.6 Multiple Degree of Freedom Tests of Association . . . . . . . . . . . . 2-41

2.7 Assessment of Model Fit . . . . . . . . . . . . . . . . . . . . . . . . . 2-43

2.7.1 Regression Assumptions . . . . . . . . . . . . . . . . . . . . . 2-43

2.7.2 Modeling and Testing Complex Interactions . . . . . . . . . . . 2-47

2.7.3 Fitting Ordinal Predictors . . . . . . . . . . . . . . . . . . . . 2-52

2.7.4 Distributional Assumptions . . . . . . . . . . . . . . . . . . . 2-53

2.8 Complex Curve Fitting Example . . . . . . . . . . . . . . . . . . . . . 2-54

3 Missing Data 3-1

3.1 Types of Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . 3-1

3.2 Prelude to Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-2


CONTENTS iv

3.3 Missing Values for Different Types of Response Variables . . . . . . . . 3-3

3.4 Problems With Simple Alternatives to Imputation . . . . . . . . . . . 3-4

3.5 Strategies for Developing an Imputation Model . . . . . . . . . . . . . 3-7

3.5.1 Interactions . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-10

3.6 Single Conditional Mean Imputation . . . . . . . . . . . . . . . . . . . 3-11

3.7 Predictive Mean Matching . . . . . . . . . . . . . . . . . . . . . . . . 3-12

3.8 Multiple Imputation . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-12

3.9 Diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-17

3.10 Summary and Rough Guidelines . . . . . . . . . . . . . . . . . . . . . 3-19

3.10.1 Effective Sample Size . . . . . . . . . . . . . . . . . . . . . . 3-20

3.11 Bayesian Methods for Missing Data . . . . . . . . . . . . . . . . . . . 3-22

4 Multivariable Modeling Strategies 4-1

4.1 Prespecification of Predictor Complexity Without Later Simplification . 4-3

4.1.1 Learning From a Saturated Model . . . . . . . . . . . . . . . . 4-6

4.1.2 Using Marginal Generalized Rank Correlations . . . . . . . . . 4-7

4.2 Checking Assumptions of Multiple Predictors Simultaneously . . . . . . 4-9

4.3 Variable Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-10

4.3.1 Maxwell’s Demon as an Analogy to Variable Selection . . . . . 4-17

4.4 Overfitting and Limits on Number of Predictors . . . . . . . . . . . . 4-19

4.5 Shrinkage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-21

4.6 Collinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-24


CONTENTS v

4.7 Data Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-26

4.7.1 Redundancy Analysis . . . . . . . . . . . . . . . . . . . . . . . 4-27

4.7.2 Variable Clustering . . . . . . . . . . . . . . . . . . . . . . . . 4-28

4.7.3 Transformation and Scaling Variables Without Using Y . . . . 4-29

4.7.4 Simultaneous Transformation and Imputation . . . . . . . . . . 4-31

4.7.5 Simple Scoring of Variable Clusters . . . . . . . . . . . . . . . 4-35

4.7.6 Simplifying Cluster Scores . . . . . . . . . . . . . . . . . . . . 4-36

4.7.7 How Much Data Reduction Is Necessary? . . . . . . . . . . . . 4-36

4.8 Other Approaches to Predictive Modeling . . . . . . . . . . . . . . . . 4-39

4.9 Overly Influential Observations . . . . . . . . . . . . . . . . . . . . . 4-39

4.10 Comparing Two Models . . . . . . . . . . . . . . . . . . . . . . . . . 4-42

4.11 Improving the Practice of Multivariable Prediction . . . . . . . . . . . 4-44

4.12 Summary: Possible Modeling Strategies . . . . . . . . . . . . . . . . . 4-47

4.12.1 Developing Predictive Models . . . . . . . . . . . . . . . . . . 4-47

4.12.2 Developing Models for Effect Estimation . . . . . . . . . . . . 4-49

4.12.3 Developing Models for Hypothesis Testing . . . . . . . . . . . 4-50

5 Describing, Resampling, Validating, and Simplifying the Model 5-1

5.1 Describing the Fitted Model . . . . . . . . . . . . . . . . . . . . . . . 5-1

5.1.1 Interpreting Effects . . . . . . . . . . . . . . . . . . . . . . . . 5-1

5.1.2 Indexes of Model Performance . . . . . . . . . . . . . . . . . . 5-2

5.2 The Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-6


CONTENTS vi

5.3 Model Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-11

5.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-11

5.3.2 Which Quantities Should Be Used in Validation? . . . . . . . . 5-14

5.3.3 Data-Splitting . . . . . . . . . . . . . . . . . . . . . . . . . . 5-16

5.3.4 Improvements on Data-Splitting: Resampling . . . . . . . . . . 5-17

5.3.5 Validation Using the Bootstrap . . . . . . . . . . . . . . . . . 5-18

5.4 Bootstrapping Ranks of Predictors . . . . . . . . . . . . . . . . . . . 5-23

5.5 Simplifying the Final Model by Approximating It . . . . . . . . . . . . 5-25

5.5.1 Difficulties Using Full Models . . . . . . . . . . . . . . . . . . 5-25

5.5.2 Approximating the Full Model . . . . . . . . . . . . . . . . . . 5-25

5.6 How Do We Break Bad Habits? . . . . . . . . . . . . . . . . . . . . . 5-27

6 R Software 6-1

6.1 The R Modeling Language . . . . . . . . . . . . . . . . . . . . . . . . 6-2

6.2 User-Contributed Functions . . . . . . . . . . . . . . . . . . . . . . . 6-3

6.3 The rms Package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-5

6.4 Other Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-12

7 Modeling Longitudinal Responses using Generalized Least Squares 7-1

7.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-1

7.2 Model Specification for Effects on E(Y ) . . . . . . . . . . . . . . . . 7-3

7.2.1 Common Basis Functions . . . . . . . . . . . . . . . . . . . . 7-3

7.2.2 Model for Mean Profile . . . . . . . . . . . . . . . . . . . . . 7-3


CONTENTS vii

7.2.3 Model Specification for Treatment Comparisons . . . . . . . . 7-4

7.3 Modeling Within-Subject Dependence . . . . . . . . . . . . . . . . . . 7-6

7.4 Parameter Estimation Procedure . . . . . . . . . . . . . . . . . . . . 7-11

7.5 Common Correlation Structures . . . . . . . . . . . . . . . . . . . . . 7-13

7.6 Checking Model Fit . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-15

7.7 R Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-16

7.8 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-17

7.8.1 Graphical Exploration of Data . . . . . . . . . . . . . . . . . . 7-17

7.8.2 Using Generalized Least Squares . . . . . . . . . . . . . . . . . 7-21

7.8.3 Bayesian Proportional Odds Random Effects Model . . . . . . 7-30

7.8.4 Bayesian Markov Semiparametric Model . . . . . . . . . . . . 7-38

8 Binary Logistic Regression 8-1

8.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-2

8.1.1 Model Assumptions and Interpretation of Parameters . . . . . . 8-3

8.1.2 Odds Ratio, Risk Ratio, and Risk Difference . . . . . . . . . . 8-4

8.1.3 Detailed Example . . . . . . . . . . . . . . . . . . . . . . . . 8-5

8.1.4 Design Formulations . . . . . . . . . . . . . . . . . . . . . . . 8-11

8.2 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-13

8.2.1 Maximum Likelihood Estimates . . . . . . . . . . . . . . . . . 8-13

8.2.2 Estimation of Odds Ratios and Probabilities . . . . . . . . . . 8-13

8.2.3 Minimum Sample Size Requirement . . . . . . . . . . . . . . . 8-13


CONTENTS viii

8.3 Test Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-16

8.4 Residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-17

8.5 Assessment of Model Fit . . . . . . . . . . . . . . . . . . . . . . . . . 8-18

8.6 Collinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-38

8.7 Overly Influential Observations . . . . . . . . . . . . . . . . . . . . . 8-38

8.8 Quantifying Predictive Ability . . . . . . . . . . . . . . . . . . . . . . 8-38

8.9 Validating the Fitted Model . . . . . . . . . . . . . . . . . . . . . . . 8-40

8.10 Describing the Fitted Model . . . . . . . . . . . . . . . . . . . . . . . 8-45

8.11 Bayesian Logistic Model Example . . . . . . . . . . . . . . . . . . . . 8-52

9 Logistic Model Case Study: Survival of Titanic Passengers 9-1

9.1 Descriptive Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-2

9.2 Exploring Trends with Nonparametric Regression . . . . . . . . . . . . 9-5

9.3 Binary Logistic Model with Casewise Deletion of Missing Values . . . . 9-8

9.4 Examining Missing Data Patterns . . . . . . . . . . . . . . . . . . . . 9-14

9.5 Single Conditional Mean Imputation . . . . . . . . . . . . . . . . . . . 9-17

9.6 Multiple Imputation . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-21

9.7 Summarizing the Fitted Model . . . . . . . . . . . . . . . . . . . . . . 9-25

9.8 Bayesian Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-28

10 Ordinal Logistic Regression 10-1

10.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-1

10.2 Ordinality Assumption . . . . . . . . . . . . . . . . . . . . . . . . . . 10-3


CONTENTS ix

10.3 Proportional Odds Model . . . . . . . . . . . . . . . . . . . . . . . . 10-4

10.3.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-4

10.3.2 Assumptions and Interpretation of Parameters . . . . . . . . . 10-5

10.3.3 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-5

10.3.4 Residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-5

10.3.5 Assessment of Model Fit . . . . . . . . . . . . . . . . . . . . . 10-6

10.3.6 Quantifying Predictive Ability . . . . . . . . . . . . . . . . . . 10-15

10.3.7 Describing the Model . . . . . . . . . . . . . . . . . . . . . . 10-15

10.3.8 Validating the Fitted Model . . . . . . . . . . . . . . . . . . . 10-16

10.3.9 R Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-16

10.4 Continuation Ratio Model . . . . . . . . . . . . . . . . . . . . . . . . 10-18

10.4.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-18

10.4.2 Assumptions and Interpretation of Parameters . . . . . . . . . 10-19

10.4.3 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-19

10.4.4 Residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-19

10.4.5 Assessment of Model Fit . . . . . . . . . . . . . . . . . . . . . 10-20

10.4.6 Extended CR Model . . . . . . . . . . . . . . . . . . . . . . . 10-20

10.4.7 Role of Penalization in Extended CR Model . . . . . . . . . . . 10-20

10.4.8 Validating the Fitted Model . . . . . . . . . . . . . . . . . . . 10-20

10.4.9 R Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-20

11 Regression Models for Continuous Y and Case Study in Ordinal Re-


gression 11-1
CONTENTS x

11.1 Dataset and Descriptive Statistics . . . . . . . . . . . . . . . . . . . . 11-3

11.2 The Linear Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-7

11.2.1 Checking Assumptions of OLS and Other Models . . . . . . . . 11-7

11.3 Quantile Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-11

11.4 Ordinal Regression Models for Continuous Y . . . . . . . . . . . . . . 11-13

11.5 Ordinal Regression Applied to HbA1c . . . . . . . . . . . . . . . . . . 11-19

11.5.1 Checking Fit for Various Models Using Age . . . . . . . . . . . 11-19

11.5.2 Examination of BMI . . . . . . . . . . . . . . . . . . . . . . . 11-24

12 Case Study in Parametric Survival Modeling and Model Approximation12-1

12.1 Descriptive Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-2

12.2 Checking Adequacy of Log-Normal Accelerated Failure Time Model . . 12-7

12.3 Summarizing the Fitted Model . . . . . . . . . . . . . . . . . . . . . . 12-14

12.4 Internal Validation of the Fitted Model Using the Bootstrap . . . . . . 12-18

12.5 Approximating the Full Model . . . . . . . . . . . . . . . . . . . . . . 12-20

13 Case Study in Cox Regression 13-1

13.1 Choosing the Number of Parameters and Fitting the Model . . . . . . 13-1

13.2 Checking Proportional Hazards . . . . . . . . . . . . . . . . . . . . . 13-7

13.3 Testing Interactions . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-9

13.4 Describing Predictor Effects . . . . . . . . . . . . . . . . . . . . . . . 13-10

13.5 Validating the Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-11

13.6 Presenting the Model . . . . . . . . . . . . . . . . . . . . . . . . . . 13-13


CONTENTS xi

14 Semiparametric Ordinal Longitudinal Models 14-1

14.1 Longitudinal Ordinal Models as Unifying Concepts . . . . . . . . . . . 14-1

14.1.1 General Outcome Attributes . . . . . . . . . . . . . . . . . . . 14-1

14.1.2 What is a Fundamental Outcome Assessment? . . . . . . . . . 14-2

14.1.3 Examples of Longitudinal Ordinal Outcomes . . . . . . . . . . 14-3

14.1.4 Statistical Model . . . . . . . . . . . . . . . . . . . . . . . . . 14-4

14.2 Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-6

14.3 Case Study For 4-Level Ordinal Longitudinal Outcome . . . . . . . . . 14-8

14.3.1 Descriptives . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-11

14.3.2 Model Fitting . . . . . . . . . . . . . . . . . . . . . . . . . . 14-14

14.3.3 Covariate Effects . . . . . . . . . . . . . . . . . . . . . . . . . 14-19

14.3.4 Correlation Structure . . . . . . . . . . . . . . . . . . . . . . 14-20

14.3.5 Computing Derived Quantities . . . . . . . . . . . . . . . . . . 14-27

14.3.6 Bootstrap Confidence Interval for Difference in Mean Time Unwell14-30

14.3.7 Notes on Inference . . . . . . . . . . . . . . . . . . . . . . . . 14-33

Bibliography 15-1

in the right margin indicates a hyperlink to a YouTube video related to the subject.

in the right margin is a hyperlink to an audio file elaborating on the notes. Red
letters and numbers in the right margin are cues referred to within the audio recordings.

Rotated boxed blue text in the right margin at the start of a section represents the
howto

mnemonic key for linking to archival discussions about that section in vbiostatcourse.slack.com
channel #rms.
CONTENTS xii

blog in the right margin is a link to a blog entry that further discusses the topic.
CONTENTS xiii

Course Philosophy

ˆ Modeling is the endeavor to transform data into information


and information into either prediction or evidence about the
data generating mechanisma

ˆ Models are usually the best descriptive statistics


– adjust for one variable while displaying the association
with Y and another variable

– descriptive statistics do not work in higher dimensions

ˆ Satisfaction of model assumptions improves precision and


increases statistical power
– Be aware of assumptions, especially those mattering the
most

ˆ It is more productive to make a model fit step by step (e.g.,


transformation estimation) than to postulate a simple model
and find out what went wrong
– Model diagnostics are often not actionable

– Changing the model in reaction to observed patterns ↑


uncertainty but is reflected by an apparent ↓ in uncer-
tainty

ˆ Graphical methods should be married to formal inference


a Thanks to Drew Levy for ideas that greatly improved this section.
CONTENTS xiv

ˆ Overfitting occurs frequently, so data reduction and model


validation are important

ˆ Software without multiple facilities for assessing and fixing


model fit may only seem to be user-friendly

ˆ Carefully fitting an improper model is better than badly fit-


ting (and overfitting) a well-chosen one
– E.g. small N and overfitting vs. carefully formulated right
hand side of model

ˆ Methods which work for all types of regression models are


the most valuable.

ˆ In most research projects the cost of data collection far out-


weighs the cost of data analysis, so it is important to use the
most efficient and accurate modeling techniques, to avoid
categorizing continuous variables, and to not remove data
from the estimation sample just to be able to validate the
model.
– A $100 analysis can make a $1,000,000 study worthless.

ˆ The bootstrap is a breakthrough for statistical modeling and


model validation.

ˆ Bayesian modeling is ready for prime time.


– Can incorporate non-data knowledge
CONTENTS xv

– Provides full exact inferential tools even when penalizing


β

– Rational way to account for model uncertainty

– Direct inference: evidence for all possible values of β

– More accurate way of dealing with missing data

ˆ Using the data to guide the data analysis is almost as dan-


gerous as not doing so.

ˆ A good overall strategy is to decide how many degrees


of freedom (i.e., number of regression parameters) can be
“spent”, where they should be spent, to spend them with no
regrets.

See the excellent text Clinical Prediction Models by Steyer-


berg [179].
Chapter 1

Introduction

1.1

Hypothesis Testing, Estimation, and Predic-


tion

intro-hep
Even when only testing H0 a model based approach has advan-
tages: A

ˆ Permutation and rank tests not as useful for estimation

ˆ Cannot readily be extended to cluster sampling or repeated


measurements

ˆ Models generalize tests


– 2-sample t-test, ANOVA →
multiple linear regression

– Wilcoxon, Kruskal-Wallis, Spearman →


proportional odds ordinal logistic model

1-1
CHAPTER 1. INTRODUCTION 1-2

– log-rank → Cox

ˆ Models not only allow for multiplicity adjustment but for


shrinkage of estimates
– Statisticians comfortable with P -value adjustment but
fail to recognize that the difference between the most
different treatments is badly biased

Statistical estimation is usually model-based B

ˆ Relative effect of increasing cholesterol from 200 to 250


mg/dl on hazard of death, holding other risk factors constant

ˆ Adjustment depends on how other risk factors relate to haz-


ard

ˆ Usually interested in adjusted (partial) effects, not unad-


justed (marginal or crude) effects
CHAPTER 1. INTRODUCTION 1-3

1.2

Examples of Uses of Predictive Multivariable


Modeling

intro-ex
ˆ Financial performance, consumer purchasing, loan pay-back
C

ˆ Ecology

ˆ Product life

ˆ Employment discrimination

ˆ Medicine, epidemiology, health services research

ˆ Probability of diagnosis, time course of a disease

ˆ Checking that a previously developed summary index (e.g.,


BMI) adequately summarizes its component variables

ˆ Developing new summary indexes by how variables predict


an outcome

ˆ Comparing non-randomized treatments

ˆ Getting the correct estimate of relative effects in randomized


studies requires covariable adjustment if model is nonlinear
– Crude odds ratios biased towards 1.0 if sample heteroge-
neous
CHAPTER 1. INTRODUCTION 1-4

ˆ Estimating absolute treatment effect (e.g., risk difference)


– Use e.g. difference in two predicted probabilities

ˆ Cost-effectiveness ratios
– incremental cost / incremental ABSOLUTE benefit

– most studies use avg. cost difference / avg. benefit, which


may apply to no one
CHAPTER 1. INTRODUCTION 1-5

1.3

Misunderstandings about Prediction vs. Clas-


sification

intro-classify
ˆ Many analysts desire to develop “classifiers” instead of pre-
dictions

Blog: Classification vs. Prediction


ˆ Outside of, for example, visual or sound pattern recognition,
classification represents a premature decision

ˆ See this blog for details

ˆ Suppose that D

1. response variable is binary


2. the two levels represent a sharp dichotomy with no gray
zone (e.g., complete success vs. total failure with no pos-
sibility of a partial success)
3. one is forced to assign (classify) future observations to
only these two choices
4. the cost of misclassification is the same for every future
observation, and the ratio of the cost of a false positive
to the cost of a false negative equals the (often hidden)
ratio implied by the analyst’s classification rule

ˆ Then classification is still suboptimal for driving the devel-


opment of a predictive instrument as well as for hypothesis
testing and estimation
CHAPTER 1. INTRODUCTION 1-6

ˆ Classification and its associated classification accuracy measure—


the proportion classified “correctly”—are very sensitive to the
relative frequencies of the outcome variable. If a classifier is
applied to another dataset with a different outcome preva-
lence, the classifier may no longer apply.

ˆ Far better is to use the full information in the data to de-


velope a probability model, then develop classification rules
on the basis of estimated probabilities
– ↑ power, ↑ precision, ↑ decision making

ˆ Classification is more problematic if response variable is or-


dinal or continuous or the groups are not truly distinct (e.g.,
disease or no disease when severity of disease is on a con-
tinuum); dichotomizing it up front for the analysis is not
appropriate
– minimum loss of information (when dichotomization is at
the median) is large

– may require the sample size to increase many–fold to


compensate for loss of information [69]

ˆ Two-group classification represents artificial forced choice


– best option may be “no choice, get more data”

ˆ Unlike prediction (e.g., of absolute risk), classification im-


plicitly uses utility (loss; cost of false positive or false nega-
CHAPTER 1. INTRODUCTION 1-7

tive) functions

ˆ Hidden problems:
– Utility function depends on variables not collected (sub-
jects’ preferences) that are available only at the decision
point

– Assumes every subject has the same utility function

– Assumes this function coincides with the analyst’s

ˆ Formal decision analysis uses


– optimum predictions using all available data

– subject-specific utilities, which are often based on vari-


ables not predictive of the outcome

ˆ ROC analysis is misleading except for the special case of


mass one-time group decision making with unknowable util-
itiesa

See [201, 28, 73, 24, 66, 77].


a To make an optimal decision you need to know all relevant data about an individual (used to estimate the probability of an outcome), and

the utility (cost, loss function) of making each decision. Sensitivity and specificity do not provide this information. For example, if one estimated
that the probability of a disease given age, sex, and symptoms is 0.1 and the “cost”of a false positive equaled the “cost” of a false negative, one
would act as if the person does not have the disease. Given other utilities, one would make different decisions. If the utilities are unknown, one
gives the best estimate of the probability of the outcome to the decision maker and let her incorporate her own unspoken utilities in making an
optimum decision for her.
Besides the fact that cutoffs do not apply to individuals, only to groups, individual decision making does not utilize sensitivity and specificity.
For an individual we can compute Prob(Y = 1|X = x); we don’t care about Prob(Y = 1|X > c), and an individual having X = x would be
quite puzzled if she were given Prob(X > c|future unknown Y) when she already knows X = x so X is no longer a random variable.
Even when group decision making is needed, sensitivity and specificity can be bypassed. For mass marketing, for example, one can rank order
individuals by the estimated probability of buying the product, to create a lift curve. This is then used to target the k most likely buyers where
k is chosen to meet total program cost constraints.
CHAPTER 1. INTRODUCTION 1-8

Accuracy score used to drive model building should be a con-


tinuous score that utilizes all of the information in the data.
In summary: E

ˆ Classification is a forced choice — a decision.

ˆ Decisions require knowledge of the cost or utility of making


an incorrect decision.

ˆ Predictions are made without knowledge of utilities.

ˆ A prediction can lead to better decisions than classification.


For example suppose that one has an estimate of the risk
of an event, P̂ . One might make a decision if P̂ < 0.10
or P̂ > 0.90 in some situations, even without knowledge of
utilities. If on the other hand P̂ = 0.6 or the confidence
interval for P is wide, one might
– make no decision and instead opt to collect more data

– make a tentative decision that is revisited later

– make a decision using other considerations such as the


infusion of new resources that allow targeting a larger
number of potential customers in a marketing campaign
F

The Dichotomizing Motorist

ˆ The speed limit is 60.


CHAPTER 1. INTRODUCTION 1-9

ˆ I am going faster than the speed limit.

ˆ Will I be caught?

An answer by a dichotomizer: G

ˆ Are you going faster than 70?

An answer from a better dichotomizer: H

ˆ If you are among other cars, are you going faster than 73?

ˆ If you are exposed are your going faster than 67?

Better: I

ˆ How fast are you going and are you exposed?

Analogy to most medical diagnosis research in which +/- di-


agnosis is a false dichotomy of an underlying disease severity:
J

ˆ The speed limit is moderately high.

ˆ I am going fairly fast.

ˆ Will I be caught?
CHAPTER 1. INTRODUCTION 1-10

1.4

Planning for Modeling

intro-plan
ˆ Chance that predictive model will be used [159]
K
ˆ Response definition, follow-up

ˆ Variable definitions

ˆ Observer variability

ˆ Missing data

ˆ Preference for continuous variables

ˆ Subjects

ˆ Sites

What can keep a sample of data from being appropriate for


modeling: L

1. Most important predictor or response variables not collected


2. Subjects in the dataset are ill-defined or not representative
of the population to which inferences are needed
3. Data collection sites do not represent the population of sites
4. Key variables missing in large numbers of subjects
CHAPTER 1. INTRODUCTION 1-11

5. Data not missing at random


6. No operational definitions for key variables and/or measure-
ment errors severe
7. No observer variability studies done

What else can go wrong in modeling? M

1. The process generating the data is not stable.


2. The model is misspecified with regard to nonlinearities or
interactions, or there are predictors missing.
3. The model is misspecified in terms of the transformation of
the response variable or the model’s distributional assump-
tions.
4. The model contains discontinuities (e.g., by categorizing
continuous predictors or fitting regression shapes with sud-
den changes) that can be gamed by users.
5. Correlations among subjects are not specified, or the cor-
relation structure is misspecified, resulting in inefficient pa-
rameter estimates and overconfident inference.
6. The model is overfitted, resulting in predictions that are too
extreme or positive associations that are false.
7. The user of the model relies on predictions obtained by ex-
trapolating to combinations of predictor values well outside
the range of the dataset used to develop the model.
8. Accurate and discriminating predictions can lead to behavior
changes that make future predictions inaccurate.
CHAPTER 1. INTRODUCTION 1-12

Iezzoni [102] lists these dimensions to capture, for patient out-


come studies: N

1. age
2. sex
3. acute clinical stability
4. principal diagnosis
5. severity of principal diagnosis
6. extent and severity of comorbidities
7. physical functional status
8. psychological, cognitive, and psychosocial functioning
9. cultural, ethnic, and socioeconomic attributes and behaviors
10. health status and quality of life
11. patient attitudes and preferences for outcomes

General aspects to capture in the predictors: O

1. baseline measurement of response variable


2. current status
3. trajectory as of time zero, or past levels of a key variable
4. variables explaining much of the variation in the response
5. more subtle predictors whose distributions strongly differ be-
tween levels of the key variable of interest in an observational
study
CHAPTER 1. INTRODUCTION 1-13

1.5

Choice of the Model

intro-choice
ˆ In biostatistics and epidemiology and most other areas we
usually choose model empirically P

ˆ Model must use data efficiently

ˆ Should model overall structure (e.g., acute vs. chronic)

ˆ Robust models are better

ˆ Should have correct mathematical structure (e.g., constraints


on probabilities)
CHAPTER 1. INTRODUCTION 1-14

1.6

Model uncertainty / Data-driven Model Spec-


ification

intro-uncertainty
ˆ Standard errors, C.L., P -values, R2 wrong if computed as
if the model pre-specified
Q
ˆ Stepwise variable selection is widely used and abused

ˆ Bootstrap can be used to repeat all analysis steps to properly


penalize variances, etc.

ˆ Ye [221]: “generalized degrees of freedom” (GDF) for any


“data mining” or model selection procedure based on least
squares
– Example: 20 candidate predictors, n = 22, forward step-
wise, best 5-variable model: GDF=14.1

– Example: CART, 10 candidate predictors, n = 100, 19


nodes: GDF=76

ˆ See [131] for an approach involving adding noise to Y to


improve variable selection

ˆ Another example: t-test to compare two means


– Basic test assumes equal variance and normal data dis-
tribution
CHAPTER 1. INTRODUCTION 1-15

– Typically examine the two sample distributions to decide


whether to transform Y or switch to a different test

– Examine the two SDs to decide whether to use the stan-


dard test or switch to a Welch t-test

– Final confidence interval for mean difference is conditional


on the final choices being correct

– Ignores model uncertainty

– Confidence interval will not have the claimed coverage

– Get proper coverage by additing parameters for what you


don’t know
* Bayesian t-test: parameters for variance ratio and for
d.f. of a t-distribution for the raw data (allows heavy
tails)
Chapter 2

General Aspects of Fitting Regression


Models

Regression modeling meets many analytic needs:

genreg-intro
ˆ Prediction, capitalizing on efficient estimation methods such
as maximum likelihood and the predominant additivity in a
variety of problems
– E.g.: effects of age, smoking, and air quality add to pre-
dict lung capacity

– When effects are predominantly additive, or when there


aren’t too many interactions and one knows the likely
interacting variables in advance, regression can beat ma-
chine learning techniques that assume interaction effects
are likely to be as strong as main effects

ˆ Separate effects of variables (especially exposure and treat-


ment)

2-1
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 2-2

ˆ Hypothesis testing

ˆ Deep understanding of uncertainties associated with all model


components
– Simplest example: confidence interval for the slope of a
predictor

– Confidence intervals for predicted values; simultaneous


confidence intervals for a series of predicted values
* E.g.: confidence band for Y over a series of values of
X

Alternative: Stratification

ˆ Cross-classify subjects on the basis of the Xs, estimate a


property of Y for each stratum

ˆ Only handles a small number of Xs

ˆ Does not handle continuous X

Alternative: Single Trees (recursive partitioning/CART)

ˆ Interpretable because they are over-simplified and usually


wrong

ˆ Cannot separate effects


CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 2-3

ˆ Finds spurious interactions

ˆ Require huge sample size

ˆ Do not handle continuous X effectively; results in very het-


erogeneous nodes because of incomplete conditioning

ˆ Tree structure is unstable so insights are fragile

Alternative: Machine Learning

ˆ E.g. random forests, bagging, boosting, support vector ma-


chines, neural networks, deep learning

ˆ Allows for high-order interactions and does not require pre-


specification of interaction terms

ˆ Almost automatic; can save analyst time and do the analysis


in one step (long computing time)

ˆ Uninterpretable black box

ˆ Effects of individual predictors are not separable

ˆ Interaction effects (e.g., differential treatment effect = pre-


cision medicine = personalized medicine) not available

ˆ Because of not using prior information about dominance of


additivity, can require 200 events per candidate predictor
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 2-4

when Y is binary [195]


– Logistic regression may require 20 events per candidate
predictor

– Can create a demand for “big data” where additive statis-


tical models can work on moderate-size data

– See this article in Harvard Business Review for more about


regression vs. complex methods
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 2-5

2.1

Notation for Multivariable Regression Models

genreg-notation
ˆ Weighted sum of a set of independent or predictor variables

ˆ Interpret parameters and state assumptions by linearizing A

model with respect to regression coefficients

ˆ Analysis of variance setups, interaction effects, nonlinear ef-


fects

ˆ Examining the 2 regression assumptions

Y response (dependent) variable


X X1, X2, . . . , Xp – list of predictors
β β0, β1, . . . , βp – regression coefficients
β0 intercept parameter(optional)
β1, . . . , βp weights or regression coefficients
Xβ β0 + β1X1 + . . . + βpXp, X0 = 1

Model: connection between X and Y B

C(Y |X) : property of distribution of Y given X, e.g.


C(Y |X) = E(Y |X) or Prob{Y = 1|X}.
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 2-6

2.2

Model Formulations

genreg-modform
General regression model
C(Y |X) = g(X).
General linear regression model
C(Y |X) = g(Xβ).
Examples C

C(Y |X) = E(Y |X) = Xβ,


Y |X ∼ N (Xβ, σ 2)
C(Y |X) = Prob{Y = 1|X} = (1 + exp(−Xβ))−1

Linearize: h(C(Y |X)) = Xβ, h(u) = g −1(u)


Example: D

C(Y |X) = Prob{Y = 1|X} = (1 + exp(−Xβ))−1


u
h(u) = logit(u) = log( )
1−u
h(C(Y |X)) = C 0(Y |X) (link)
General linear regression model:
C 0(Y |X) = Xβ.
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 2-7

2.3

Interpreting Model Parameters

genreg-interp
Suppose that Xj is linear and doesn’t interact with other X’sa.
E
0
C (Y |X) = Xβ = β0 + β1X1 + . . . + βpXp
βj = C 0(Y |X1, X2, . . . , Xj + 1, . . . , Xp)
− C 0(Y |X1, X2, . . . , Xj , . . . , Xp)
Drop 0 from C 0 and assume C(Y |X) is property of Y that is
linearly related to weighted sum of X’s.

2.3.1

Nominal Predictors

Nominal (polytomous) factor with k levels : k − 1 dummy


variables. E.g. T = J, K, L, M : F

C(Y |T = J) = β0
C(Y |T = K) = β0 + β1
C(Y |T = L) = β0 + β2
C(Y |T = M ) = β0 + β3.
C(Y |T ) = Xβ = β0 + β1X1 + β2X2 + β3X3,
where
X1 = 1 if T = K, 0 otherwise
a Note that it is not necessary to “hold constant” all other variables to be able to interpret the effect of one predictor. It is sufficient to hold

constant the weighted sum of all the variables other than Xj . And in many cases it is not physically possible to hold other variables constant
while varying one, e.g., when a model contains X and X 2 (David Hoaglin, personal communication).
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 2-8

X2 = 1 if T = L, 0 otherwise
X3 = 1 if T = M, 0 otherwise.
The test for any differences in the property C(Y ) between treat-
ments is H0 : β1 = β2 = β3 = 0.

2.3.2

Interactions
G

genreg-interp-interact
X1 and X2, effect of X1 on Y depends on level of X2. One
way to describe interaction is to add X3 = X1X2 to model:
C(Y |X) = β0 + β1X1 + β2X2 + β3X1X2.

C(Y |X1 + 1, X2) − C(Y |X1, X2)


= β0 + β1(X1 + 1) + β2X2
+ β3(X1 + 1)X2
− [β0 + β1X1 + β2X2 + β3X1X2]
= β1 + β3X2.
One-unit increase in X2 on C(Y |X) : β2 + β3X1.
Worse interactions:
If X1 is binary, the interaction may take the form of a difference
in shape (and/or distribution) of X2 vs. C(Y ) depending on
whether X1 = 0 or X1 = 1 (e.g. logarithm vs. square root).
This paper describes how interaction effects can be misleading.
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 2-9

2.3.3

Example: Inference for a Simple Model

Postulate the model C(Y |age, sex) = β0 + β1age + β2(sex =


f ) + β3age(sex = f ) where sex = f is a dummy indicator
variable for sex=female, i.e., the reference cell is sex=maleb.
Model assumes H

1. age is linearly related to C(Y ) for males,


2. age is linearly related to C(Y ) for females, and
3. interaction between age and sex is simple
4. whatever distribution, variance, and independence assump-
tions are appropriate for the model being considered.
Interpretations of parameters: I

Parameter Meaning
β0 C(Y |age = 0, sex = m)
β1 C(Y |age = x + 1, sex = m) − C(Y |age = x, sex = m)
β2 C(Y |age = 0, sex = f ) − C(Y |age = 0, sex = m)
β3 C(Y |age = x + 1, sex = f ) − C(Y |age = x, sex = f )−
[C(Y |age = x + 1, sex = m) − C(Y |age = x, sex = m)]

β3 is the difference in slopes (female – male).


When a high-order effect such as an interaction effect is in
the model, be sure to interpret low-order effects by finding out
what makes the interaction effect ignorable. In our example,
the interaction effect is zero when age=0 or sex is male.
b You can also think of the last part of the model as being β3 X3 , where X3 = age × I[sex = f ].
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 2-10

Hypotheses that are usually inappropriate: J

1. H0 : β1 = 0: This tests whether age is associated with Y


for males
2. H0 : β2 = 0: This tests whether sex is associated with Y
for zero year olds

More useful hypotheses follow. For any hypothesis need to K

ˆ Write what is being tested

ˆ Translate to parameters tested

ˆ List the alternative hypothesis

ˆ Not forget what the test is powered to detect


– Test against nonzero slope has maximum power when
linearity holds

– If true relationship is monotonic, test for non-flatness will


have some but not optimal power

– Test against a quadratic (parabolic) shape will have some


power to detect a logarithmic shape but not against a sine
wave over many cycles

ˆ Useful to write e.g. “Ha : age is associated with C(Y ),


powered to detect a linear relationship”
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 2-11

Most Useful Tests for Linear age × sex Model


Null or Alternative Hypothesis Mathematical
Statement
Effect of age is independent of sex or H0 : β3 = 0
Effect of sex is independent of age or
age and sex are additive
age effects are parallel
age interacts with sex Ha : β3 6= 0
age modifies effect of sex
sex modifies effect of age
sex and age are non-additive (synergistic)
L
age is not associated with Y H0 : β1 = β3 = 0
age is associated with Y Ha : β1 6= 0 or β3 6= 0
age is associated with Y for either
females or males
sex is not associated with Y H0 : β2 = β3 = 0
sex is associated with Y Ha : β2 6= 0 or β3 6= 0
sex is associated with Y for some
value of age
Neither age nor sex is associated with Y H0 : β1 = β2 = β3 = 0
Either age or sex is associated with Y Ha : β1 6= 0 or β2 6= 0 or β3 6= 0

Note: The last test is called the global test of no association.


If an interaction effect present, there is both an age and a sex
effect. There can also be age or sex effects when the lines are
parallel. The global test of association (test of total associa-
tion) has 3 d.f. instead of 2 (age + sex) because it allows for
unequal slopes.
2.3.4

Review of Composite (Chunk) Tests


genreg-chunk

ˆ In the model
y ∼ age + sex + weight + waist + tricep

M
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 2-12

we may want to jointly test the association between all body


measurements and response, holding age and sex constant.

ˆ This 3 d.f. test may be obtained two ways:


– Remove the 3 variables and compute the change in SSR
or SSE

– Test H0 : β3 = β4 = β5 = 0 using matrix algebra (e.g.,


anova(fit, weight, waist, tricep) if fit is a fit object
created by the R rms package)
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 2-13

2.4

Relaxing Linearity Assumption for Continu-


ous Predictors

genreg-nonlinear
2.4.1

Avoiding Categorization

Natura non facit saltus


Gottfried Wilhelm Leibniz

genreg-cat
(Nature does not make jumps)

Lucy D’Agostino
McGowan N

ˆ Relationships seldom linear except when predicting one vari-


able from itself measured earlier

ˆ Categorizing continuous predictors into intervals is a disas-


ter; see references
[164, 2, 97, 117, 4, 13, 67, 158, 185, 30]
[135, 169, 5, 99, 139, 206, 69, 75, 44, 17] and Biostatistics for
Biomedical Research, Chapter 18.

ˆ Some problems caused by this approach:


CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 2-14

1. Estimated values have reduced precision, and associated


tests have reduced power
2. Categorization assumes relationship between predictor and
response is flat within intervals; far less reasonable than
a linearity assumption in most cases
3. To make a continuous predictor be more accurately mod-
eled when categorization is used, multiple intervals are
required
4. Because of sample size limitations in the very low and
very high range of the variable, the outer intervals (e.g.,
outer quintiles) will be wide, resulting in significant het-
erogeneity of subjects within those intervals, and residual
confounding
5. Categorization assumes that there is a discontinuity in
response as interval boundaries are crossed. Other than
the effect of time (e.g., an instant stock price drop after
bad news), there are very few examples in which such
discontinuities have been shown to exist.
6. Categorization only seems to yield interpretable estimates.
E.g. odds ratio for stroke for persons with a systolic blood
pressure > 160 mmHg compared to persons with a blood
pressure ≤ 160 mmHg → interpretation of OR depends
on distribution of blood pressures in the sample (the pro-
portion of subjects > 170, > 180, etc.). If blood pressure
is modeled as a continuous variable (e.g., using a regres-
sion spline, quadratic, or linear effect) one can estimate
the ratio of odds for exact settings of the predictor, e.g.,
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 2-15

the odds ratio for 200 mmHg compared to 120 mmHg.


7. Categorization does not condition on full information.
When, for example, the risk of stroke is being assessed for
a new subject with a known blood pressure (say 162 mmHg),
the subject does not report to her physician “my blood
pressure exceeds 160” but rather reports 162 mmHg. The
risk for this subject will be much lower than that of a sub-
ject with a blood pressure of 200 mmHg.
8. If cutpoints are determined in a way that is not blinded
to the response variable, calculation of P -values and con-
fidence intervals requires special simulation techniques;
ordinary inferential methods are completely invalid. E.g.:
cutpoints chosen by trial and error utilizing Y , even in-
formally → P -values too small and CLs not accuratec.
9. Categorization not blinded to Y → biased effect esti-
mates [4, 169]
10. “Optimal” cutpoints do not replicate over studies. Hollan-
der et al. [99] state that “. . . the optimal cutpoint approach has disadvantages.
One of these is that in almost every study where this method is applied, an-
other cutpoint will emerge. This makes comparisons across studies extremely
difficult or even impossible. Altman et al. point out this problem for studies
of the prognostic relevance of the S-phase fraction in breast cancer published
in the literature. They identified 19 different cutpoints used in the literature;
some of them were solely used because they emerged as the ’optimal’ cutpoint
in a specific data set. In a meta-analysis on the relationship between cathepsin-
D content and disease-free survival in node-negative breast cancer patients,
c If a cutpoint is chosen that minimizes the P -value and the resulting P -value is 0.05, the true type I error can easily be above 0.5 [99].
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 2-16

12 studies were in included with 12 different cutpoints . . . Interestingly, neither


cathepsin-D nor the S-phase fraction are recommended to be used as prognostic
markers in breast cancer in the recent update of the American Society of Clin-
ical Oncology.” Giannoni et al. [75] demonstrated that many claimed “optimal
cutpoints” are just the observed median values in the sample, which happens
to optimize statistical power for detecting a separation in outcomes.

11. Disagreements in cutpoints (which are bound to happen


whenever one searches for things that do not exist) cause
severe interpretation problems. One study may provide an
odds ratio for comparing body mass index (BMI) > 30
with BMI ≤ 30, another for comparing BMI > 28 with
BMI ≤ 28. Neither of these has a good definition and
the two estimates are not comparable.
12. Cutpoints are arbitrary and manipulatable; cutpoints can
be found that can result in both positive and negative
associations [206].
13. If a confounder is adjusted for by categorization, there
will be residual confounding that can be explained away
by inclusion of the continuous form of the predictor in the
model in addition to the categories.

ˆ To summarize: The use of a (single) cutpoint c makes many


assumptions, including: O

1. Relationship between X and Y is discontinuous at X = c


and only X = c
2. c is correctly found as the cutpoint
3. X vs. Y is flat to the left of c
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 2-17

4. X vs. Y is flat to the right of c


5. The choice of c does not depend on the values of other
predictors

Interactive demonstration of power loss of categorization vs.


straight line and quadratic fits in OLS, with varying degree of
nonlinearity and noise added to X (must run in RStudio)
require ( Hmisc )
getRs ( ’ catgNoise.r ’)

Exampled of misleading results from creating intervals (here,


deciles) of a continuous predictor. Final interval is extremely
heterogeneous and is greatly influenced by very large glycohe-
moglobin values, creating the false impression of an inflection
point at 5.9.

See this for excellent graphical examples of the harm of cate-


gorizing predictors, especially when using quantile groups.
d From NHANES III; Diabetes Care 32:1327-34; 2009 adapted from Diabetes Care 20:1183-1197; 1997.
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 2-18

2.4.2

Simple Nonlinear Terms

genreg-poly
C(Y |X1) = β0 + β1X1 + β2X12.
P

ˆ H0 : model is linear in X1 vs. Ha : model is quadratic in


X1 ≡ H0 : β2 = 0.

ˆ Test of linearity may be powerful if true model is not ex-


tremely non-parabolic

ˆ Predictions not accurate in general as many phenomena are


non-quadratic

ˆ Can get more flexible fits by adding powers higher than 2

ˆ But polynomials do not adequately fit logarithmic functions


or “threshold” effects, and have unwanted peaks and valleys.

2.4.3

Splines for Estimating Shape of Regression Function


and Determining Predictor Transformations
genreg-spline

Draftsman’s spline : flexible strip of metal or rubber used to


trace curves.
Q

Spline Function : piecewise polynomial


CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 2-19

Linear Spline Function : piecewise linear function


ˆ Bilinear regression: model is β0 +β1X if X ≤ a, β2 +β3X
if X > a.

ˆ Problem with this notation: two lines not constrained to


join

ˆ To force simple continuity: β0 +β1X+β2(X−a)×I[X >


a] = β0 + β1X1 + β2X2, where X2 = (X1 − a) × I[X1 >
a].

ˆ Slope is β1, X ≤ a, β1 + β2, X > a.

ˆ β2 is the slope increment as you pass a

More generally: X-axis divided into intervals with endpoints


a, b, c (knots).
f (X) = β0 + β1X + β2(X − a)+ + β3(X − b)+
+ β4(X − c)+,
where
(u)+ = u, u > 0,
0, u ≤ 0.
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 2-20

f (X) = β0 + β1X, X≤a


= β0 + β1X + β2(X − a) a<X≤b
= β0 + β1X + β2(X − a) + β3(X − b) b<X≤c
= β0 + β1X + β2(X − a)
+β3(X − b) + β4(X − c) c < X.
f(X)

0 1 2 3 4 5 6
X
Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5.

C(Y |X) = f (X) = Xβ,


where Xβ = β0 + β1X1 + β2X2 + β3X3 + β4X4, and
X1 = X X2 = (X − a)+
X3 = (X − b)+ X4 = (X − c)+.
Overall linearity in X can be tested by testing H0 : β2 = β3 =
β4 = 0.
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 2-21

2.4.4

Cubic Spline Functions

genreg-cubic
Cubic splines are smooth at knots (function, first and second
derivatives agree) — can’t see joins. S

f (X) = β0 + β1X + β2X 2 + β3X 3


+ β4(X − a)3+ + β5(X − b)3+ + β6(X − c)3+
= Xβ

X1 = X X2 = X 2
X3 = X 3 X4 = (X − a)3+
X5 = (X − b)3+ X6 = (X − c)3+.
k knots → k + 3 coefficients excluding intercept.
X 2 and X 3 terms must be included to allow nonlinearity when
X < a.
stats.stackexchange.com/questions/421964 has some useful de-
scriptions of what happens at the knots, e.g.:
Knots are where different cubic polynomials are joined, and
cubic splines force there to be three levels of continuity (the
function, its slope, and its acceleration or second derivative
(slope of the slope) do not change) at these points. At the
knots the jolt (third derivative or rate of change of acceleration)
is allowed to change suddenly, meaning the jolt is allowed to be
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 2-22

discontinuous at the knots. Between knots, jolt is constant.


The following graphs show the function and its first three deriva-
tives (all further derivatives are zero) for the function given by
f (x) = x + x2 + 2x3 + 10(x − 0.25)3+ − 50(x − 0.5)3+ − 100(x −
0.75)3+ for x going from 0 to 1, where there are three knots, at
x = 0.25, 0.5, 0.75.
x ← seq (0 , 1 , length =500)
x1 ← pmax ( x - .25 , 0)
x2 ← pmax ( x - .50 , 0)
x3 ← pmax ( x - .75 , 0)
b1 ← 1; b2 ← 1; b3 ← 2; b4 ← 10; b5 ← -50 ; b6 ← -100
y ← b1 * x + b2 * x ∧ 2 + b3 * x ∧ 3 + b4 * x1 ∧ 3 + b5 * x2 ∧ 3 + b6 * x3 ∧ 3
y1 ← b1 + 2 * b2 * x + 3 * b3 * x ∧ 2 + 3 * b4 * x1 ∧ 2 + 3 * b5 * x2 ∧ 2 + 3 * b6 * x3 ∧ 2
y2 ← 2 * b2 + 6 * b3 * x + 6 * b4 * x1 + 6 * b5 * x2 + 6 * b6 * x3
y3 ← 6 * b3 + 6 * b4 * ( x1 >0) + 6 * b5 * ( x2 >0) + 6 * b6 * ( x3 >0)

g ← function () abline ( v =(1:3) / 4 , col = gray ( .85 ) )


plot (x , y , type = ’l ’ , ylab = ’ ’) ; g ()
text (0 , 1 .5 , ’ Function ’ , adj =0)

plot (x , y1 , type = ’l ’ , ylab = ’ ’) ; g ()


text (0 , -15 , ’ First Derivative : Slope \ nRate of Change of Function ’ ,
adj =0)
plot (x , y2 , type = ’l ’ , ylab = ’ ’) ; g ()
text (0 , -125 , ’ Second Derivative : Acceleration \ nRate of Change of Slope ’ ,
adj =0)
plot (x , y3 , type = ’l ’ , ylab = ’ ’) ; g ()
text (0 , -400 , ’ Third Derivative : Jolt \ nRate of Change of Acceleration ’ ,
adj =0)

2.4.5

Restricted Cubic Splines genreg-rcs

Stone and Koo [184]: cubic splines poorly behaved in tails. Con-
strain function to be linear in tails. T

k + 3 → k − 1 parameters [56].
To force linearity when X < a: X 2 and X 3 terms must be
omitted
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 2-23

2.5
2.0
1.5 Function
1.0
0.5
0.0
0.0 0.2 0.4 0.6 0.8 1.0
x
0

−10
First Derivative: Slope
Rate of Change of Function
−20

−30
0.0 0.2 0.4 0.6 0.8 1.0
x
0
−50
−100 Second Derivative: Acceleration
−150 Rate of Change of Slope

−200
−250
0.0 0.2 0.4 0.6 0.8 1.0
x
0

−200
Third Derivative: Jolt
−400
Rate of Change of Acceleration
−600

−800
0.0 0.2 0.4 0.6 0.8 1.0
x
Figure 2.2: A regular cubic spline function with three levels of continuity that prevent the human eye from detecting the knots. Also
shown is the function’s first three derivatives. Knots are located at x = 0.25, 0.5, 0.75. For x beyond the outer knots, the function
is not restricted to be linear. Linearity would imply an acceleration of zero. Vertical lines are drawn at the knots.
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 2-24

To force linearity when X > last knot: last two βs are redun-
dant, i.e., are just combinations of the other βs.
The restricted spline function with k knots t1, . . . , tk is given
by [56]
f (X) = β0 + β1X1 + β2X2 + . . . + βk−1Xk−1,
where X1 = X and for j = 1, . . . , k − 2, U

Xj+1 = (X − tj )3+ − (X − tk−1)3+(tk − tj )/(tk − tk−1)


+ (X − tk )3+(tk−1 − tj )/(tk − tk−1).
Xj is linear in X for X ≥ tk .
For numerical behavior and to put all basis functions for X on
the same scale, R Hmisc and rms package functions by default
divide the terms above by τ = (tk − t1)2.
require ( Hmisc )
x ← rcspline.eval ( seq (0 ,1 , .01 ) ,
knots = seq ( .05 , .95 , length =5) , inclx = T )
xm ← x
xm [ xm > .0106 ] ← NA
matplot ( x [ ,1] , xm , type = " l " , ylim = c (0 , .01 ) ,
xlab = expression ( X ) , ylab = ’ ’ , lty =1)
matplot ( x [ ,1] , x , type = " l " ,
xlab = expression ( X ) , ylab = ’ ’ , lty =1)

x ← seq (0 , 1 , length =300)


for ( nk in 3:6) {
set.seed ( nk )
knots ← seq ( .05 , .95 , length = nk )
xx ← rcspline.eval (x , knots = knots , inclx = T )
for ( i in 1 : ( nk - 1) )
xx [ , i ] ← ( xx [ , i ] - min ( xx [ , i ]) ) /
( max ( xx [ , i ]) - min ( xx [ , i ]) )
for ( i in 1 : 20) {
beta ← 2 * runif ( nk-1 ) - 1
xbeta ← xx % * % beta + 2 * runif (1) - 1
xbeta ← ( xbeta - min ( xbeta ) ) /
( max ( xbeta ) - min ( xbeta ) )
if ( i == 1) {
plot (x , xbeta , type = " l " , lty =1 ,
xlab = expression ( X ) , ylab = ’ ’ , bty = " l " )
title ( sub = paste ( nk , " knots " ) , adj =0 , cex = .75 )
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 2-25

0.010 1.0

0.008 0.8

0.006 0.6

0.004 0.4

0.002 0.2

0.000 0.0
0.0 0.4 0.8 0.0 0.4 0.8
X X
Figure 2.3: Restricted cubic spline component variables for k = 5 and knots at X = .05, .275, .5, .725, and .95. Nonlinear basis
functions are scaled by τ . The left panel is a y–magnification of the right panel. Fitted functions such as those in Figure 2.4 will be
linear combinations of these basis functions as long as knots are at the same locations used here.

for ( j in 1 : nk )
arrows ( knots [ j ] , .04 , knots [ j ] , -.03 ,
angle =20 , length = .07 , lwd =1 .5 )
}
else lines (x , xbeta , col = i )
}
}

Interactive demonstration of linear and cubic spline fitting, plus


ordinary 4th order polynomial. This can be run with RStudio
or in an ordinary R session.
require ( Hmisc )
getRs ( ’ demoSpline.r ’) # if using RStudio
getRs ( ’ demoSpline.r ’ , put = ’ source ’) # if not

Paul Lambert’s excellent self-contained interactive demonstra-


tions of continuity restrictions, cubic polynomial, linear spline,
cubic spline, and restricted cubic spline fitting is at pclam-
bert.net/interactivegraphs. Jordan Gauthier has another nice
interactive demonstration at drjgauthier.shinyapps.io/spliny.
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 2-26

1.0 1.0

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0.0 0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
X X
3 knots 4 knots

1.0 1.0

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0.0 0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
X X
5 knots 6 knots

Figure 2.4: Some typical restricted cubic spline functions for k = 3, 4, 5, 6. The y–axis is Xβ. Arrows indicate knots. These curves
were derived by randomly choosing values of β subject to standard deviations of fitted functions being normalized.
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 2-27

Once β0, . . . , βk−1 are estimated, the restricted cubic spline can
be restated in the form V

f (X) = β0 + β1X + β2(X − t1)3+ + β3(X − t2)3+


+ . . . + βk+1(X − tk )3+
by dividing β2, . . . , βk−1 by τ and computing
βk = [β2 (t1 − tk ) + β3 (t2 − tk ) + . . .
+βk−1 (tk−2 − tk )]/(tk − tk−1 )
βk+1 = [β2 (t1 − tk−1 ) + β3 (t2 − tk−1 ) + . . .
+βk−1 (tk−2 − tk−1 )]/(tk−1 − tk ).

A test of linearity in X can be obtained by testing

H0 : β2 = β3 = . . . = βk−1 = 0.

Example: [170]
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 2-28

2.4.6

Choosing Number and Position of Knots


W

ˆ Knots are specified in advance in regression splines

ˆ Locations not important in most situations [183, 60]

ˆ Place knots where data exist — fixed quantiles of predictor’s


marginal distribution

ˆ Fit depends more on choice of k


CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 2-29

k Quantiles
3 .10 .5 .90
4 .05 .35 .65 .95
5 .05 .275 .5 .725 .95
6 .05 .23 .41 .59 .77 .95
7 .025 .1833 .3417 .5 .6583 .8167 .975

n < 100 – replace outer quantiles with 5th smallest and 5th
largest X [184].
Choice of k: X

ˆ Flexibility of fit vs. n and variance

ˆ Usually k = 3, 4, 5. Often k = 4

ˆ Large n (e.g. n ≥ 100) – k = 5

ˆ Small n (< 30, say) – k = 3

ˆ Can use Akaike’s information criterion (AIC) [7, 197] to choose


k

ˆ This chooses k to maximize model likelihood ratio χ2 − 2k.

See [79] for a comparison of restricted cubic splines, fractional


polynomials, and penalized splines.
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 2-30

2.4.7

Nonparametric Regression

genreg-nonpar
ˆ Estimate tendency (mean or median) of Y as a function of
X
Y

ˆ Few assumptions

ˆ Especially handy when there is a single X

ˆ Plotted trend line may be the final result of the analysis

ˆ Simplest smoother: moving average

X: 1 2 3 5 8
Y : 2.1 3.8 5.7 11.1 17.2

2.1 + 3.8 + 5.7


Ê(Y |X = 2) =
3
2+3+5 3.8 + 5.7 + 11.1
Ê(Y |X = ) =
3 3
– overlap OK

– problem in estimating E(Y ) at outer X-values

– estimates very sensitive to bin width

ˆ Moving linear regression far superior to moving avg. (moving


flat line) Z
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 2-31

ˆ Cleveland’s [41] moving linear regression smoother loess (lo-


cally weighted least squares) is the most popular smoother.
To estimate central tendency of Y at X = x:
– take all the data having X values within a suitable interval
about x (default is 32 of the data)

– fit weighted least squares linear regression within this


neighborhood

– points near x given the most weighte

– points near extremes of interval receive almost no weight

– loess works much better at extremes of X than moving


avg.

– provides an estimate at each observed X; other estimates


obtained by linear interpolation

– outlier rejection algorithm built-in

ˆ loess works for binary Y — just turn off outlier detection A

ˆ Other popular smoother: Friedman’s “super smoother”

ˆ For loess or supsmu amount of smoothing can be controlled


by analyst
e Weight here means something different than regression coefficient. It means how much a point is emphasized in developing the regression

coefficients.
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 2-32

ˆ Another alternative: smoothing splinesf

ˆ Smoothers are very useful for estimating trends in residual


plots
2.4.8

Advantages of Regression Splines over Other Methods

Regression splines have several advantages [89]: B

ˆ Parametric splines can be fitted using any existing regression


program

ˆ Regression coefficients estimated using standard techniques


(ML or least squares), formal tests of no overall association,
linearity, and additivity, confidence limits for the estimated
regression function are derived by standard theory.

ˆ The fitted function directly estimates transformation predic-


tor should receive to yield linearity in C(Y |X).

ˆ Even when a simple transformation is obvious, spline func-


tion can be used to represent the predictor in the final model
(and the d.f. will be correct). Nonparametric methods do
not yield a prediction equation.

ˆ Extension to non-additive models.


Multi-dimensional nonparametric estimators often require
f These place knots at all the observed data points but penalize coefficient estimates towards smoothness.
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 2-33

burdensome computations.
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 2-34

2.5

Recursive Partitioning: Tree-Based Models

genreg-rpart
Breiman, Friedman, Olshen, and Stone [27]: CART (Classifica-
tion and Regression Trees) — essentially model-free
Method: C

ˆ Find predictor so that best possible binary split has maxi-


mum value of some statistic for comparing 2 groups

ˆ Within previously formed subsets, find best predictor and


split maximizing criterion in the subset

ˆ Proceed in like fashion until < k obs. remain to split

ˆ Summarize Y for the terminal node (e.g., mean, modal cat-


egory)

ˆ Prune tree backward until it cross-validates as well as its


“apparent” accuracy, or use shrinkage
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 2-35

Advantages/disadvantages of recursive partitioning: D

ˆ Does not require functional form for predictors

ˆ Does not assume additivity — can identify complex interac-


tions

ˆ Can deal with missing data flexibly

ˆ Interactions detected are frequently spurious

ˆ Does not use continuous predictors effectively

ˆ Penalty for overfitting in 3 directions

ˆ Often tree doesn’t cross-validate optimally unless pruned


back very conservatively

ˆ Very useful in messy situations or those in which overfitting


is not as problematic (confounder adjustment using propen-
sity scores [45]; missing value imputation)

See [10].

2.5.1

New Directions in Predictive Modeling


genreg-newdir

The approaches recommended in this course are

E
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 2-36

ˆ fitting fully pre-specified models without deletion of“insignif-


icant” predictors

ˆ using data reduction methods (masked to Y ) to reduce the


dimensionality of the predictors and then fitting the number
of parameters the data’s information content can support

ˆ use shrinkage (penalized estimation) to fit a large model


without worrying about the sample size.

The data reduction approach can yield very interpretable, sta-


ble models, but there are many decisions to be made when
using a two-stage (reduction/model fitting) approach, Newer
approaches are evolving, including the following. These new
approach handle continuous predictors well, unlike recursive
partitioning. F

ˆ lasso (shrinkage using L1 norm favoring zero regression co-


efficients) [189, 182]

ˆ elastic net (combination of L1 and L2 norms that handles


the p > n case better than the lasso) [227]

ˆ adaptive lasso [224, 208]

ˆ more flexible lasso to differentially penalize for variable se-


lection and for regression coefficient estimation [157]

ˆ group lasso to force selection of all or none of a group of


CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 2-37

related variables (e.g., dummy variables representing a poly-


tomous predictor)

ˆ group lasso-like procedures that also allow for variables within


a group to be removed [209]

ˆ sparse-group lasso using L1 and L2 norms to achieve spare-


ness on groups and within groups of variables [173]

ˆ adaptive group lasso (Wang & Leng)

ˆ Breiman’s nonnegative garrote [220]

ˆ “preconditioning”, i.e., model simplification after developing


a “black box” predictive model [144, 143]

ˆ sparse principal components analysis to achieve parsimony


in data reduction [217, 226, 123, 122]

ˆ bagging, boosting, and random forests [94]

One problem prevents most of these methods from being ready


for everyday use:they require scaling predictors before fitting G

the model. When a predictor is represented by nonlinear basis


functions, the scaling recommendations in the literature are not
sensible. There are also computational issues and difficulties
obtaining hypothesis tests and confidence intervals.
When data reduction is not required, generalized additive mod-
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 2-38

els [95, 218] should also be considered.

2.5.2

Choosing Between Machine Learning and Statistical


Modeling

ˆ Statistical models allow for complexity (nonlinearity, inter-


action)

ˆ Easy to allow every predictor to have nonlinear effect

ˆ Easy to handle unlimited numbers of candidate predictors if


assume additivity (e.g., using ridge regression, lasso, elastic
net)

ˆ Interactions should be pre-specified

ˆ Machine learning is gaining attention but is oversold in some


settings

blog
ˆ Researchers are under the mistaken impression that machine
learning can be used on small samples blog

Considerations in Choosing One Approach over An-


other
A statistical model may be the better choice if

ˆ Uncertainty is inherent and the signal:noise ratio is not large—


CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 2-39

even with identical twins, one twin may get colon cancer and
the other not; model tendencies instead of doing classifica-
tion

ˆ One could never have perfect training data, e.g., cannot re-
peatedly test one subject and have outcomes assessed with-
out error

ˆ One wants to isolate effects of a small number of variables

ˆ Uncertainty in an overall prediction or the effect of a predic-


tor is sought

ˆ Additivity is the dominant way that predictors affect the


outcome, or interactions are relatively small in number and
can be pre-specified

ˆ The sample size isn’t huge

ˆ One wants to isolate (with a predominantly additive effect)


the effects of “special” variables such as treatment or a risk
factor

ˆ One wants the entire model to be interpretable

Machine learning may be the better choice if

ˆ The signal:noise ratio is large and the outcome being pre-


dicted doesn’t have a strong component of randomness; e.g.,
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 2-40

in visual pattern recognition an object must be an “E” or not


an “E”

ˆ The learning algorithm can be trained on an unlimited num-


ber of exact replications (e.g., 1000 repetitions of each letter
in the alphabet or of a certain word to be translated to Ger-
man)

ˆ Overall prediction is the goal, without being able to suc-


cinctly describe the impact of any one variable (e.g., treat-
ment)

ˆ One is not very interested in estimating uncertainty in fore-


casts or in effects of select predictors

ˆ Non-additivity is expected to be strong and can’t be isolated


to a few pre-specified variables (e.g., in visual pattern recog-
nition the letter “L” must have both a dominating vertical
component and a dominating horizontal component)

ˆ The sample size is huge [195]

ˆ One does not need to isolate the effect of a special variable


such as treatment

ˆ One does not care that the model is a “black box”


CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 2-41

2.6

Multiple Degree of Freedom Tests of Associ-


ation

genreg-multidf
C(Y |X) = β0 + β1X1 + β2X2 + β3X22,
H0 : β2 = β3 = 0 with 2 d.f. to assess association between X2 H

and outcome.
In the 5-knot restricted cubic spline model
C(Y |X) = β0 + β1X + β2X 0 + β3X 00 + β4X 000,
H0 : β1 = . . . = β4 = 0 I

ˆ Test of association: 4 d.f.

ˆ Insignificant → dangerous to interpret plot

ˆ What to do if 4 d.f. test insignificant, 3 d.f. test for linearity


insig., 1 d.f. test sig. after delete nonlinear terms?

Grambsch and O’Brien [80] elegantly described the hazards of


pretesting J

ˆ Studied quadratic regression

ˆ Showed 2 d.f. test of association is nearly optimal even when


regression is linear if nonlinearity entertained
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 2-42

ˆ Considered ordinary regression model


E(Y |X) = β0 + β1X + β2X 2

ˆ Two ways to test association between X and Y

ˆ Fit quadratic model and test for linearity (H0 : β2 = 0)

ˆ F -test for linearity significant at α = 0.05 level → report as


the final test of association the 2 d.f. F test of H0 : β1 =
β2 = 0

ˆ If the test of linearity insignificant, refit without the quadratic


term and final test of association is 1 d.f. test, H0 : β1 =
0|β2 = 0

ˆ Showed that type I error > α

ˆ Fairly accurate P -value obtained by instead testing against


F with 2 d.f. even at second stage

ˆ Cause: are retaining the most significant part of F

ˆ BUT if test against 2 d.f. can only lose power when com-
pared with original F for testing both βs

ˆ SSR from quadratic model > SSR from linear model


CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 2-43

2.7

Assessment of Model Fit

genreg-gof
2.7.1

Regression Assumptions

The general linear regression model is K

C(Y |X) = Xβ = β0 + β1X1 + β2X2 + . . . + βk Xk .


Verify linearity and additivity. Special case:
C(Y |X) = β0 + β1X1 + β2X2,
where X1 is binary and X2 is continuous.

X1 = 1
C(Y)

X1 = 0

X2

Figure 2.5: Regression assumptions for one binary and one continuous predictor

Methods for checking fit: L

1. Fit simple linear additive model and check examine residual


plots for patterns
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 2-44

ˆ For OLS: box plots of e stratified by X1, scatterplots


of e vs. X2 and Ŷ , with trend curves (want flat central
tendency, constant variability)

ˆ For normality, qqnorm plots of overall and stratified resid-


uals
Advantage: Simplicity
Disadvantages:
ˆ Can only compute standard residuals for uncensored con-
tinuous response

ˆ Subjective judgment of non-randomness

ˆ Hard to handle interaction

ˆ Hard to see patterns with large n (trend lines help)

ˆ Seeing patterns does not lead to corrective action


2. Scatterplot of Y vs. X2 using different symbols according
to values of X1 M

Advantages: Simplicity, can see interaction


Disadvantages:
ˆ Scatterplots cannot be drawn for binary, categorical, or
censored Y

ˆ Patterns difficult to see if relationships are weak or n large


3. Stratify the sample by X1 and quantile groups (e.g. deciles)
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 2-45

of X2; estimate C(Y |X1, X2) for each stratum


Advantages: Simplicity, can see interactions, handles cen-
sored Y (if you are careful)
Disadvantages:
ˆ Requires large n

ˆ Does not use continuous var. effectively (no interpolation)

ˆ Subgroup estimates have low precision

ˆ Dependent on binning method


4. Separately for levels of X1 fit a nonparametric smoother
relating X2 to Y N

Advantages: All regression aspects of the model can be


summarized efficiently with minimal assumptions
Disadvantages:
ˆ Does not apply to censored Y

ˆ Hard to deal with multiple predictors


5. Fit flexible nonlinear parametric model
Advantages:
ˆ One framework for examining the model assumptions, fit-
ting the model, drawing formal inference

ˆ d.f. defined and all aspects of statistical inference “work


as advertised”
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 2-46

Disadvantages:
ˆ Complexity

ˆ Generally difficult to allow for interactions when assessing


patterns of effects

Confidence limits, formal inference can be problematic for meth-


ods 1-4. O

Restricted cubic spline works well for method 5.


Ĉ(Y |X) = β̂0 + β̂1X1 + β̂2X2 + β̂3X20 + β̂4X200
= β̂0 + β̂1X1 + fˆ(X2),
where
fˆ(X2) = β̂2X2 + β̂3X20 + β̂4X200,
fˆ(X2) spline-estimated transformation of X2. P

ˆ Plot fˆ(X2) vs. X2

ˆ n large → can fit separate functions by X1

ˆ Test of linearity: H0 : β3 = β4 = 0

ˆ Few good reasons to do the test other than to demonstrate


that linearity is not a good default assumption

ˆ Nonlinear → use transformation suggested by spline fit or


keep spline terms
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 2-47

ˆ Tentative transformation g(X2) → check adequacy by ex-


panding g(X2) in spline function and testing linearity

ˆ Can find transformations by plotting g(X2) vs. fˆ(X2) for


variety of g

ˆ Multiple continuous predictors → expand each using spline

ˆ Example: assess linearity of X2, X3


Q

C(Y |X) = β0 + β1X1 + β2X2 + β3X20 + β4X200


+ β5X3 + β6X30 + β7X300,
Overall test of linearity H0 : β3 = β4 = β6 = β7 = 0, with 4
d.f.
2.7.2

Modeling and Testing Complex Interactions

genreg-interact
Note: Interactions will be misleading if main effects are not
properly modeled [225].

Suppose X1 binary or linear, X2 continuous: R

C(Y |X) = β0 + β1X1 + β2X2 + β3X20 + β4X200


+β5X1X2 + β6X1X20 + β7X1X200
Simultaneous test of linearity and additivity: H0 : β3 = . . . =
β7 = 0. S
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 2-48

ˆ 2 continuous variables: could transform separately and form


simple product

ˆ But transformations depend on whether interaction terms


adjusted for, so it is usually not possible to estimate trans-
formations and interaction effects other than simultaneously

ˆ Compromise: Fit interactions of the form X1f (X2) and


X2g(X1):
T

C(Y |X) = β0 + β1X1 + β2X10 + β3X100


+ β4X2 + β5X20 + β6X200
+ β7X1X2 + β8X1X20 + β9X1X200
+ β10X2X10 + β11X2X100
U

ˆ Test of additivity is H0 : β7 = β8 = . . . = β11 = 0 with 5


d.f.

ˆ Test of lack of fit for the simple product interaction with X2


is H0 : β8 = β9 = 0

ˆ Test of lack of fit for the simple product interaction with X1


is H0 : β10 = β11 = 0

General spline surface: V

ˆ Cover X1 × X2 plane with grid and fit patch-wise cubic


CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 2-49

polynomial in two variables

ˆ Restrict to be of form aX1 + bX2 + cX1X2 in corners

ˆ Uses all (k − 1)2 cross-products of restricted cubic spline


terms

ˆ See Gray [81, 82, Section 3.2] for penalized splines allowing
control of effective degrees of freedom. See Berhane et
al. [18] for a good discussion of tensor splines.

Tensor Spline Surface, Penalized MLE


200
150
SYSBP
100
50

0 20 40 60 80 100 120 140


DIABP 4 knots, penalty=21
Adjusted to: tx=tpa age=61.52

Figure 2.6: Logistic regression estimate of probability of a hemorrhagic stroke for patients in the GUSTO-I trial given t-PA, using
a tensor spline of two restricted cubic splines and penalization (shrinkage). Dark (cold color) regions are low risk, and bright (hot)
regions are higher risk.

Figure 2.6 is particularly interesting because the literature had suggested (based on
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 2-50

approximately 24 strokes) that pulse pressure was the main cause of hemorrhagic stroke
whereas this flexible modeling approach (based on approximately 230 strokes) suggests
that mean arterial blood pressure (roughly a 45◦ line) is what is most important over a
broad range of blood pressures. At the far right one can see that pulse pressure (axis
perpendicular to 45◦ line) may have an impact although a non-monotonic one.

Other issues: W

ˆ Y non-censored (especially continuous) → multi-dimensional


scatterplot smoother [35]

ˆ Interactions of order > 2: more trouble

ˆ 2-way interactions among p predictors: pooled tests

ˆ p tests each with p − 1 d.f.

Some types of interactions to pre-specify in clinical studies: X

ˆ Treatment × severity of disease being treated

ˆ Age × risk factors

ˆ Age × type of disease

ˆ Measurement × state of a subject during measurement

ˆ Race × disease

ˆ Calendar time × treatment


CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 2-51

ˆ Quality × quantity of a symptom

ˆ Measurement × amount of deterioration of the measure-


ment

The section between the two horizontal blue lines was inserted after the audio narration was recorded.

The last example is worth expanding as an example in model


formulation. Consider the following study.

ˆ A sample of patients seen over several years have a blood


sample taken at time of hospitalization

ˆ Blood samples are frozen

ˆ Long after the last patient was sampled, the blood samples
are thawed all in the same week and a blood analysis is done

ˆ It is known that the quality of the blood analysis deterio-


rates roughly logarithmically by the age of the sample; blood
measurements made on old samples are assumed to be less
predictive of outcome

ˆ This is reflected in an interaction between a function of sam-


ple age and the blood measurement Bg

ˆ Patients were followed for an event, and the outcome vari-


able of interest is the time from hospitalization to that event
g For continuous Y one might need to model the residual variance of Y as increasing with sample age, in addition to modeling the mean

function.
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 2-52

ˆ To not assume a perfect logarithmic relationship for sample


age on the effect of the blood measurement, a restricted
cubic spline model with 3 default knots will be fitted for log
sample age

ˆ Sample age is assumed to not modify the effects of non-


blood predictors patient age and sex

ˆ Model may be specified the following way using the R rms


package to fit a Cox proportional hazards model

ˆ Test for nonlinearity of sampleAge tests the adequacy of as-


suming a plain logarithmic trend in sample age
f ← cph ( Surv ( etime , event ) ∼ rcs ( log ( sampleAge ) , 3) * rcs (B , 4) +
rcs ( age , 5) * sex , data = mydata )

The B × sampleAge interaction effects have 6 d.f. and tests


whether the sample deterioration affects the effect of B. By
not assuming that B has the same effect for old samples as
for young samples, the investigator will be able to estimate
the effect of B on outcome when the blood analysis is ideal by
inserting sampleAge = 1 day when requesting predicted values
as a function of B.

2.7.3

Fitting Ordinal Predictors


Y

ˆ Small no. categories (3-4) → polytomous factor, dummy


variables
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 2-53

ˆ Design matrix for easy test of adequacy of initial codes → k


original codes + k − 2 dummies

ˆ More categories → score using data-driven trend. Later


tests use k − 1 d.f. instead of 1 d.f.

ˆ E.g., compute logit(mortality) vs. category

ˆ Much better: used penalized maximum likelihood estima-


tion (R ordSmooth package) or Bayesian shrinkage (R brms
package).

2.7.4

Distributional Assumptions
Z

ˆ Some models (e.g., logistic): all assumptions in C(Y |X) =


Xβ (implicitly assuming no omitted variables!)

ˆ Linear regression: Y ∼ Xβ + ,  ∼ n(0, σ 2)

ˆ Examine distribution of residuals

ˆ Some models (Weibull, Cox [49]):


C(Y |X) = C(Y = y|X) = d(y) + Xβ
C = log hazard

ˆ Check form of d(y)

ˆ Show d(y) does not interact with X


CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 2-54

2.8

Complex Curve Fitting Example


A

ˆ Restricted cubic spline

ˆ Discontinuity (interrupted time series analysis)

ˆ Cyclic trend (seasonality)

ˆ Data from academic.oup.com/ije/article/46/1/348/2622842


by Bernal, Cummins, Gasparrini [19]

ˆ Rates of hospital admissions for acute coronary events in


Sicily before and after a smoking ban on 2005-01

ˆ Poisson regression on case counts, adjusted for population


size as an offset variable (analyzes event rate) (see stats.stackexchan

ˆ Classic interrupted time series puts a discontinuity at the


intervention point and assesses statistical evidence for a
nonzero jump height

ˆ We will do that and also fit a continuous cubic spline but


with multiple knots near the intervention point

ˆ All in context of long-term and seasonal trends

ˆ Uses the rms package gTrans function documented at hbio-


stat.org/R/rms/gtrans.html
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 2-55

ˆ Can do this without gTrans but harder to plot predicted


values, get contrasts, and get chunk tests

ˆ Time variable is months after 2001-12-01


require ( rms ) # engages rms which also engages Hmisc which provides getHdata

options ( prType = ’ latex ’) # a p p l i e s to p r i n t i n g model fits


getHdata ( sicily ) # fetch dataset from hbiostat.org / data
d ← sicily
dd ← datadist ( d ) ; options ( datadist = ’ dd ’)

Start with a standard restricted cubic spline fit, 6 knots at


default quantile locations. From the fitted Poisson model we
estimate the number of cases per a constant population size of
100,000.
g ← function ( x ) exp ( x ) * 100000
off ← list ( stdpop = mean ( d $ stdpop ) ) # offset for p r e d i c t i o n ( 3 8 3 4 6 4 .4 )
w ← geom_point ( aes ( x = time , y = rate ) , data = d )
v ← geom_vline ( aes ( xintercept =37 , col = I ( ’ red ’) ) )
yl ← ylab ( ’ Acute Coronary Cases Per 100 ,000 ’)
f ← Glm ( aces ∼ offset ( log ( stdpop ) ) + rcs ( time , 6) , data =d , family = poisson )
f $ aic

[1] 721.5237

ggplot ( Predict (f , fun =g , offset = off ) ) + w + v + yl

240
Acute Coronary Cases Per 100,000

230

220

210

200

190
0 20 40 60
Time Since Start of Study, months B
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 2-56

ˆ To add seasonality to the model can add sine and/or cosine


terms

ˆ See pmean.com/07/CyclicalTrends.html by Steve Simon

ˆ If you knew the month at which incidence is a minimum,


could just add a sine term to the model

ˆ Adding both sine and cosine terms effectively allows for a


model parameter that estimates the time origin

ˆ Assume the period (cycle length) is 12m


# Save knot locations
k ← attr ( rcs ( d $ time , 6) , ’ parms ’)
k

[1] 5.00 14.34 24.78 35.22 45.66 55.00

kn ← k
# rcspline.eval is the rcs workhorse
h ← function ( x ) cbind ( rcspline.eval (x , kn ) ,
sin = sin (2 * pi * x / 12) , cos = cos (2 * pi * x / 12) )
f ← Glm ( aces ∼ offset ( log ( stdpop ) ) + gTrans ( time , h ) ,
data =d , family = poisson )
f $ aic

[1] 674.112

ggplot ( Predict (f , fun =g , offset = off ) ) + w + v + yl


CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 2-57

Acute Coronary Cases Per 100,000


240

220

200

180
0 20 40 60
Time Since Start of Study, months

Next add more knots near intervention to allow for sudden


change
kn ← sort ( c (k , c (36 , 37 , 38) ) )
f ← Glm ( aces ∼ offset ( log ( stdpop ) ) + gTrans ( time , h ) ,
data =d , family = poisson )
f $ aic

[1] 661.7904

ggplot ( Predict (f , fun =g , offset = off ) ) + w + v + yl


Acute Coronary Cases Per 100,000

240

220

200

180
0 20 40 60
Time Since Start of Study, months

Now make the slow trend simpler (6 knots) but add a discon-
tinuity at the intervention. More finely control times at which
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 2-58

predictions are requested, to handle discontinuity.


h ← function ( x ) cbind ( rcspline.eval (x , k ) ,
sin = sin (2 * pi * x / 12) , cos = cos (2 * pi * x / 12) ,
jump = x ≥ 37)
f ← Glm ( aces ∼ offset ( log ( stdpop ) ) + gTrans ( time , h ) ,
data =d , family = poisson )
f $ aic

[1] 659.6044

times ← sort ( c ( seq (0 , 60 , length =200) , 36 .999 , 37 , 37 .001 ) )


ggplot ( Predict (f , time = times , fun =g , offset = off ) ) + w + v + yl
Acute Coronary Cases Per 100,000

260

240

220

200

180
0 20 40 60
Time Since Start of Study, months

Look at fit statistics especially evidence for the jump


f

General Linear Model

Glm(formula = aces ~ offset(log(stdpop)) + gTrans(time, h), family = poisso


data = d)

Model Likelihood
Ratio Test
Obs 59 LR χ2 169.64
Residual d.f. 51 d.f. 7
g 0.07973896 Pr(> χ2 ) <0.0001
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 2-59

β̂ S.E. Wald Z Pr(> |Z|)


Intercept -6.2118 0.0095 -656.01 <0.0001
time 0.0635 0.0113 5.63 <0.0001
time’ -0.1912 0.0433 -4.41 <0.0001
time” 0.2653 0.0760 3.49 0.0005
time”’ -0.2409 0.0925 -2.61 0.0092
sin 0.0343 0.0067 5.11 <0.0001
cos 0.0380 0.0065 5.86 <0.0001
jump -0.1268 0.0313 -4.06 <0.0001
C

ˆ Evidence for an intervention effect (jump)

ˆ Evidence for seasonality

ˆ Could have analyzed rates using a semiparametric model


Chapter 3

Missing Data

3.1

Types of Missing Data

missing-type
ˆ Missing completely at random (MCAR)

ˆ Missing at random (MAR)a A

ˆ Informative missing
(non-ignorable non-response)

See [33, 168, 58, 87, 1, 214, 31] for an introduction to missing data
and imputation concepts.

a“Although missing at random (MAR) is a non-testable assumption, it has been pointed out in the literature that we can get very close to

MAR if we include enough variables in the imputation models” [87].

3-1
CHAPTER 3. MISSING DATA 3-2

3.2

Prelude to Modeling

missing-prelude
ˆ Quantify extent of missing data

ˆ Characterize types of subjects with missing data B

ˆ Find sets of variables missing on same subjects


CHAPTER 3. MISSING DATA 3-3

3.3

Missing Values for Different Types of Re-


sponse Variables

missing-y
ˆ Serial data with subjects dropping out (not covered in this
courseb C

ˆ Y =time to event, follow-up curtailed: covered under sur-


vival analysisc

ˆ Often discard observations with completely missing Y but


sometimes wastefuld

ˆ Characterize missings in Y before dropping obs.

b Twist et al. [191] found instability in using multiple imputation of longitudinal data, and advantages of using instead full likelihood models.
c White and Royston [213] provide a method for multiply imputing missing covariate values using censored survival time data.
d Y is so valuable that if one is only missing a Y value, imputation is not worthwhile, and imputation of Y is not advised if MCAR or MAR.
CHAPTER 3. MISSING DATA 3-4

3.4

Problems With Simple Alternatives to Impu-


tation

missing-alt
Deletion of records—
D
ˆ Badly biases parameter estimates when the probability of a
case being incomplete is related to Y and not just X [128].

ˆ Deletion because of a subset of X being missing always


results in inefficient estimates

ˆ Deletion of records with missing Y can result in biases [50]


but is the preferred approach under MCARe

ˆ However von Hippel [204] found advantages to a “use all


variables to impute all variables then drop observations with
missing Y ” approach (but see [186])

ˆ Lee and Carlin [121] suggest that observations missing on


both Y and on a predictor of major interest are not helpful

ˆ Only discard obs. when


– MCAR can be justified

– Rarely missing predictor of overriding importance that


can’t be imputed from other data
e Multiple imputation of Y in that case does not improve the analysis and assumes the imputation model is correct.
CHAPTER 3. MISSING DATA 3-5

– Fraction of obs. with missings small and n is large

ˆ No advantage of deletion except savings of analyst time

ˆ Making up missing data better than throwing away real data

ˆ See [110]

Adding extra categories of categorical predictors— E

ˆ Including missing data but adding a category ‘missing’ causes


serious biases [1, 104, 192]

ˆ Problem acute when values missing because subject too sick

ˆ Difficult to interpret

ˆ Fails even under MCAR [104, 1, 58, 194, 110]

ˆ May be OK if values are “missing” because of “not applica-


ble”f

Likewise, serious problems are caused by setting missing con-


tinuous predictors to a constant (e.g., zero) and adding an in-
dicator variable to try to estimate the effect of missing values.
Two examples from Donder et al.[58] using binary logistic re-
gression, N = 500.
f E.g. you have a measure of marital happiness, dichotomized as high or low, but your sample contains some unmarried people. OK to have

a 3-category variable with values high, low, and unmarried—Paul Allison, IMPUTE list, 4Jul09.
CHAPTER 3. MISSING DATA 3-6

Results of 1000 Simulations With β1 = 1.0 with MAR and Two Types of
Imputation F

Imputation β̂1 S.E. Coverage of


Method 0.90 C.I.
Single 0.989 0.09 0.64
Multiple 0.989 0.14 0.90

Now consider a simulation with β1 = 1, β2 = 0, X2 correlated


with X1(r = 0.75) but redundant in predicting Y , use missing-
ness indicator when X1 is MCAR in 0.4 of 500 subjects. This
is also compared with grand mean fill-in imputation.

Results of 1000 Simulations Adding a Third Predictor Indicating Missing


for X1 G

Imputation β̂1 β̂2


Method
Indicator 0.55 0.51
Overall mean 0.55

In the incomplete observations the constant X1 is uncorrelated


with X2.
CHAPTER 3. MISSING DATA 3-7

3.5

Strategies for Developing an Imputation Model

missing-impmodel
The goal of imputation is to preserve the information
and meaning of the non-missing data.

There is a full Bayesian modeling alternative to all the methods


presented below. The Bayesian approach requires more effort
but has several advantages [65].

Exactly how are missing values estimated? H

ˆ Could ignore all other information — random or grand mean


fill-in

ˆ Can use external info not used in response model (e.g., zip
code for income)

ˆ Need to utilize reason for non-response if possible

ˆ Use statistical model with sometimes-missing X as response


variable

ˆ Model to estimate the missing values should include all vari-


ables that are either I

1. related to the missing data mechanism;


2. have distributions that differ between subjects that have
the target variable missing and those that have it mea-
CHAPTER 3. MISSING DATA 3-8

sured;
3. associated with the sometimes-missing variable when it is
not missing; or
4. included in the final response model [12, 87]

ˆ Ignoring imputation results in biased V̂ (β̂)

ˆ transcan function in Hmisc library: “optimal” transforma-


tions of all variables to make residuals more stable and to
allow non-monotonic transformations

ˆ aregImpute function in Hmisc: good approximation to full


Bayesian multiple imputation procedure using the bootstrap

ˆ transcan and aregImpute use the following for fitting impu-


tation models: J

1. initialize NAs to median (mode for categoricals)


2. expand all categorical predictors using dummy variables
3. expand all continuous predictors using restricted cubic
splines
4. optionally optimally transform the variable being predicted
by expanding it with restricted cubic splines and using the
first canonical variate (multivariate regression) as the op-
timum transformation (maximizing R2)
5. one-dimensional scoring of categorical variables being pre-
dicted using canonical variates on dummy variables rep-
resenting the categories (Fisher’s optimum scoring algo-
CHAPTER 3. MISSING DATA 3-9

rithm); when imputing categories, solve for which cate-


gory yields a score that is closest to the predicted score

ˆ aregImpute and transcan work withfit.mult.impute to make K

final analysis of response variable relatively easy

ˆ Predictive mean matching [128]: replace missing value with


observed value of subject having closest predicted value to
the predicted value of the subject with the NA. Key consid-
erations are how to
1. model the target when it is not NA
2. match donors on predicted values
3. avoid overuse of “good” donors to disallow excessive ties
in imputed data
4. account for all uncertainties

ˆ Predictive model for each target uses any outcomes, all pre-
dictors in the final model other than the target, plus auxiliary
variables not in the outcome model

ˆ No distributional assumptions; nicely handles target vari-


ables with strange distributions [202]

ˆ Predicted values need only be monotonically related to real


predictive values
– PMM can result in some donor observations being used
repeatedly L
CHAPTER 3. MISSING DATA 3-10

– Causes lumpy distribution of imputed values

– Address by sampling from multinomial distribution, prob-


abilities = scaled distance of all predicted values to pre-
dicted value (y ∗) of observation needing imputing

– Tukey’s tricube function is a good weighting function


(used in loess):
wi = (1 − min(di/s, 1)3)3,
di = |ŷi − y ∗|
s = 0.2 × mean|ŷi − y ∗| is a good default scale factor
scale so that P wi = 1

ˆ Recursive partitioning with surrogate splits — handles case


where a predictor of a variable needing imputation is missing
itself. But there are problems [151] even with completely
random missingness.

ˆ [214] discusses an alternative method based on choosing a


donor observation at random from the q closest matches
(q = 3, for example)

3.5.1

Interactions
M

ˆ When interactions are in the outcome model, oddly enough


it may be better to treat interaction terms as “just another
variable” and do unconstrained imputation of them [108]
CHAPTER 3. MISSING DATA 3-11

3.6

Single Conditional Mean Imputation

missing-single
ˆ Can fill-in using unconditional mean or median if number of
missings low and X is unrelated to other Xs
N

ˆ Otherwise, first approximation to good imputation uses other


Xs to predict a missing X

ˆ This is a single “best guess” conditional mean

ˆ X̂j = Z θ̂, Z = Xj̄ plus possibly auxiliary variables that pre-


cede Xj in the causal chain that are not intended to be in
the outcome model.
Cannot include Y in Z without adding random errors to im-
puted values as done with multiple imputation (would steal
info from Y )

ˆ Recursive partitioning can sometimes be helpful for nonpara-


metrically estimating conditional means
CHAPTER 3. MISSING DATA 3-12

3.7

Predictive Mean Matching

missing-pmm
3.8

Multiple Imputation

missing-mi
ˆ Single imputation could use a random draw from the condi-
tional distribution for an individual O

X̂j = Z θ̂ + ˆ, Z = [X j̄, Y ] plus auxiliary variables


ˆ = n(0, σ̂) or a random draw from the calculated residuals
– bootstrap

– approximate Bayesian bootstrap [165, 87]: sample with re-


placement from sample with replacement of residuals

ˆ Multiple imputations (M ) with random draws


– Draw sample of M residuals for each missing value to be
imputed

– Average M β̂

– In general can provide least biased estimates of β

– Simple formula for imputation-corrected var(β̂)


Function of average “apparent” variances and between-
imputation variances of β̂
CHAPTER 3. MISSING DATA 3-13

– Even when the χ2 distribution is a good approximation


when data have no missing values, the t or F distributions
are needed to have accurate P -values and confidence lim-
its when there are missings [127, 160]

– BUT full multiple imputation needs to account for uncer-


tainty in the imputation models by refitting these models
for each of the M draws

– transcan does not do that; aregImpute does

ˆ Note that multiple imputation can and should use the re-
sponse variable for imputing predictors [138]

ˆ aregImpute algorithm [138] P

– Takes all aspects of uncertainty into account using the


bootstrap

– Different bootstrap resamples used for each imputation


by fitting a flexible additive model on a sample with re-
placement from the original data

– This model is used to predict all of the original missing


and non-missing values for the target variable for the cur-
rent imputation

– Uses flexible parametric additive regression models to im-


pute
CHAPTER 3. MISSING DATA 3-14

– There is an option to allow target variables to be opti-


mally transformed, even non-monotonically (but this can
overfit)

– By default uses predictive mean matching for imputation;


no residuals required (can also do more parametric regres-
sion imputation)

– By default uses weighted PMM; many other matching


options

– Uses by default van Buuren’s “Type 1” matching [31, Sec-


tion 3.4.2] to capture the right amount of uncertainty by
computing predicted values for missing values using a re-
gression fit on the bootstrap sample, and finding donor
observations by matching those predictions to predictions
from potential donors using the regression fit from the
original sample of complete observations

– When a predictor of the target variable is missing, it is


first imputed from its last imputation when it was a target
variable

– First 3 iterations of process are ignored (“burn-in”)

– Compares favorably to R MICE approach

– Example:
a ← aregImpute (∼ age + sex + bp + death + heart.attack.before.death ,
data = mydata , n.impute =5)
CHAPTER 3. MISSING DATA 3-15

f ← fit.mult.impute ( death ∼ rcs ( age ,3) + sex +


rcs ( bp ,5) , lrm , a , data = mydata )

See Barzi and Woodward [12] for a nice review of multiple imputation with
detailed comparison of results (point estimates and confidence limits for
the effect of the sometimes-missing predictor) for various imputation meth-
ods. Barnes et al. [11] have a good overview of imputation methods and a
comparison of bias and confidence interval coverage for the methods when
applied to longitudinal data with a small number of subjects. Horton and
Kleinman [100] have a good review of several software packages for dealing
with missing data, and a comparison of them with aregImpute. Harel and
Zhou [87] provide a nice overview of multiple imputation and discuss some of
the available software. White and Carlin [212] studied bias of multiple impu-
tation vs. complete-case analysis. White et al. [214] provide much practical
guidance.

Caution: Methods can generate imputations having very reasonable distri-


butions but still not having the property that final response model regression
coefficients have nominal confidence interval coverage. It is worth checking
that imputations generate the correct collinearities among covariates. Q

ˆ With MICE and aregImpute we are using the chained equation


approach [214] R

ˆ Chained equations handles a wide variety of target variables


to be imputed and allows for multiple variables to be missing
on the same subject

ˆ Iterative process cycles through all target variables to impute


all missing values [196]
CHAPTER 3. MISSING DATA 3-16

ˆ Does not attempt to use the full Bayesian multivariate model


for all target variables, making it more flexible and easy to
use

ˆ Possible to create improper imputations, e.g., imputing con-


flicting values for different target variables

ˆ However, simulation studies [196] demonstrate very good


performance of imputation based on chained equations
CHAPTER 3. MISSING DATA 3-17

3.9

Diagnostics

missing-dx
ˆ MCAR can be partially assessed by comparing distribution
of non-missing Y for those subjects with complete X vs. S
those subjects having incomplete X [128]

ˆ Yucel and Zaslavsky [223] (see also [96])

ˆ Interested in reasonableness of imputed values for a sometimes-


missing predictor Xj

ˆ Duplicate entire dataset

ˆ In the duplicated observations set all non-missing values of


Xj to missing; let w denote this set of observations set to
missing

ˆ Develop imputed values for the missing values of Xj

ˆ In the observations in w compare the distribution of imputed


Xj to the original values of Xj

ˆ Bondarenko and Raghunathan [22] present a variety of useful


diagnostics on the reasonableness of imputed values.
CHAPTER 3. MISSING DATA 3-18

originally missing

duplicate
impute
{ all missing
CHAPTER 3. MISSING DATA 3-19

3.10

Summary and Rough Guidelines


T
Table 3.1: Summary of Methods for Dealing with Missing Values

missing-summary
Method Deletion Single Multiple
Allows non-random missing x x
Reduces sample size x
Apparent S.E. of β̂ too low x
Increases real S.E. of β̂ x
β̂ biased if not MCAR x

The following contains crude guidelines. Simulation studies are


needed to refine the recommendations. Here f refers to the
proportion of observations having any variables missing. U

f < 0.03: It doesn’t matter very much how you impute miss-
ings or whether you adjust variance of regression coefficient
estimates for having imputed data in this case. For continu-
ous variables imputing missings with the median non-missing
value is adequate; for categorical predictors the most fre-
quent category can be used. Complete case analysis is also
an option here. Multiple imputation may be needed to check
that the simple approach “worked.”
f ≥ 0.03: Use multiple imputation with number of imputa-
tionsg equal to max(5, 100f ). Fewer imputations may be
possible with very large sample sizes. See statisticalhorizons.com/ho
many-imputations. Type 1 predictive mean matching is usu-
ally preferred, with weighted selection of donors. Account
g White et al. [214] recommend choosing M so that the key inferential statistics are very reproducible should the imputation analysis be

repeated. They suggest the use of 100f imputations. See also [31, Section 2.7]. von Hippel [205] finds that the number of imputations should be
quadratically increasing with the fraction of missing information.
CHAPTER 3. MISSING DATA 3-20

for imputation in estimating the covariance matrix for fi-


nal parameter estimates. Use the t distribution instead of
the Gaussian distribution for tests and confidence intervals,
if possible, using the estimated d.f. for the parameter esti-
mates. V

Multiple predictors frequently missing: More imputations


may be required. Perform a “sensitivity to order” analysis
by creating multiple imputations using different orderings of
sometimes missing variables. It may be beneficial to ini-
tially sort variables so that the one with the most NAs will
be imputed first.

Reason for missings more important than number of missing


values.
Extreme amount of missing data does not prevent one from
using multiple imputation, because alternatives are worse [103,
132].

3.10.1

Effective Sample Size


missing-n

It is useful to look look at examples of effective sample sizes


in the presence of missing data. If a sample of 1000 subjects
contains various amounts and patterns of missings what size
nc of a complete sample would have equivalent information for
the intended purpose of the analysis? W
CHAPTER 3. MISSING DATA 3-21

1. A new marker was collected on a random sample of 200 of


the subjects and one wants to estimate the added predictive
value due to the marker: nc = 200
2. Height is missing on 100 subjects but we want to study
association between BMI and outcome. Weight, sex, and
waist circumference are available on all subjects: nc = 980
1
3. Each of 10 predictors is randomly missing on 10 of subjects,
and the predictors are uncorrelated with each other and are
each weakly related to the outcome: nc = 500
4. Same as previous but the predictors can somewhat be pre-
dicted from non-missing predictors: nc = 750
1
5. The outcome variable was not assessed on a random 5 of
subjects: nc = 800
6. The outcome represents sensitive information, is missing on
1
2 of subjects, and we don’t know what made subjects re-
spond to the question: nc = 0 (serious selection bias)
7. One of the baseline variables was collected prospectively 12
of the time and for the other subjects it was retrospectively
estimated only for subjects ultimately suffering a stroke and
we don’t know which subjects had a stroke: nc = 0 (study
not worth doing)
8. The outcome variable was assessed by emailing the 1000
subjects, for which 800 responded, and we don’t know what
made subjects respond: nc = 0 (model will possibly be very
biased—at least the intercept)
CHAPTER 3. MISSING DATA 3-22

3.11

Bayesian Methods for Missing Data

ˆ Multiple imputation developed as an approximation to a full


Bayesian model

ˆ Full Bayesian model treats missings as unknown parameters


and provides exact inference and correct measures of uncer-
tainty

ˆ See this case study for an example

ˆ The case study also shows how to do “posterior stacking” if


you want to avoid having to specify a full model for missings,
and instead use usual multiple imputations as described in
this chapter
– Run a multiple imputation algorithm

– For each completed dataset run the Bayesian analysis and


draw thousands of samples from the posterior distribution
of the parameters

– Pool all these posterior draws over all the multiple impu-
tations and do posterior inference as usual with no special
correction required

– Made easy by the Hmisc package aregImpute function and


the rms stackMI function as demonstrated in the Titanic
CHAPTER 3. MISSING DATA 3-23

case study later in the notes.


Chapter 4

Multivariable Modeling Strategies

ˆ “Spending d.f.”: examining or fitting parameters in models,


or examining tables or graphs that utilize Y to tell you how
to model variables

ˆ If wish to preserve statistical properties, can’t retrieve d.f.


once they are “spent” (see Grambsch & O’Brien)

ˆ If a scatterplot suggests linearity and you fit a linear model,


how many d.f. did you actually spend (i.e., the d.f. that when
put into a formula results in accurate confidence limits or
P -values)?

ˆ Decide number of d.f. that can be spent B

ˆ Decide where to spend them

ˆ Spend them

ˆ General references: [142, 181, 92, 76]

4-1
CHAPTER 4. MULTIVARIABLE MODELING STRATEGIES 4-2

There are many choices to be made when deciding upon a


global modeling strategy, including choice between C

ˆ parametric and nonparametric procedures

ˆ parsimony and complexity

ˆ parsimony and good discrimination ability

ˆ interpretable models and black boxes.


CHAPTER 4. MULTIVARIABLE MODELING STRATEGIES 4-3

4.1

Prespecification of Predictor Complexity With-


out Later Simplification

strategy-complexity
ˆ Rarely expect linearity

ˆ Can’t always use graphs or other devices to choose transfor-


mation D

ˆ If select from among many transformations, results biased

ˆ Need to allow flexible nonlinearity to potentially strong pre-


dictors not known to predict linearly

ˆ Once decide a predictor is “in” can choose no. of parameters


to devote to it using a general association index with Y

ˆ Need a measure of “potential predictive punch”

ˆ Measure needs to mask analyst to true form of regression


to preserve statistical properties

Motivating examples:
# Overfitting a flat relationship
require ( rms )

set.seed (1)
x ← runif (1000)
y ← runif (1000 , -0.5 , 0 .5 )
dd ← datadist (x , y ) ; options ( datadist = ’ dd ’)
par ( mfrow = c (2 ,2) , mar = c (2 , 2 , 3 , 0 .5 ) )
pp ← function ( actual ) {
yhat ← predict (f , data.frame ( x = xs ) )
CHAPTER 4. MULTIVARIABLE MODELING STRATEGIES 4-4

yreal ← actual ( xs )
plot (0 , 0 , xlim = c (0 ,1) ,
ylim = range ( c ( quantile (y , c (0 .1 , 0 .9 ) ) , yhat ,
yreal ) ) ,
type = ’n ’ , axes = FALSE )
axis (1 , labels = FALSE ) ; axis (2 , labels = FALSE )
lines ( xs , yreal )
lines ( xs , yhat , col = ’ blue ’)
}
f ← ols ( y ∼ rcs (x , 5) )
xs ← seq (0 , 1 , length =150)
pp ( function ( x ) 0 * x )
title ( ’ Mild Error :\ nOverfitting a Flat Relationship ’ ,
cex =0 .5 )
y ← x + runif (1000 , -0.5 , 0 .5 )
f ← ols ( y ∼ rcs (x , 5) )
pp ( function ( x ) x )
title ( ’ Mild Error :\ nOverfitting a Linear Relationship ’ ,
cex =0 .5 )
y ← x ∧ 4 + runif (1000 , -1 , 1)
f ← ols ( y ∼ x )
pp ( function ( x ) x ∧ 4)
title ( ’ Serious Error :\ nUnderfitting a Steep Relationship ’ ,
cex =0 .5 )
y ← - ( x - 0 .5 ) ∧ 2 + runif (1000 , -0.2 , 0 .2 )
f ← ols ( y ∼ x )
pp ( function ( x ) - ( x - 0 .5 ) ∧ 2)
title ( ’ Tragic Error :\ nMonotonic Fit to \ nNon-Monotonic Relationship ’ ,
cex =0 .5 )
CHAPTER 4. MULTIVARIABLE MODELING STRATEGIES 4-5

Mild Error: Mild Error:


0 Overfitting a Flat Relationship Overfitting a Linear Relationship

0
0 Tragic0Error:
Serious Error:
Monotonic Fit to
Underfitting a Steep Relationship
Non−Monotonic Relationship
0

Table 4.1: Examples of Reducing the Number of Parameters

Categorical predictor with k levels Collapse less frequent


categories into “other”
Continuous predictor represented Reduce k to a number as
as k-knot r.c. spline low as 3, or 0 (linear) E
CHAPTER 4. MULTIVARIABLE MODELING STRATEGIES 4-6

4.1.1

Learning From a Saturated Model

When the effective sample size available is sufficiently large


so that a saturated main effects model may be fitted, a good
approach to gauging predictive potential is the following. F

ˆ Let all continuous predictors be represented as restricted


cubic splines with k knots, where k is the maximum number
of knots the analyst entertains for the current problem.

ˆ Let all categorical predictors retain their original categories


except for pooling of very low prevalence categories (e.g.,
ones containing < 6 observations).

ˆ Fit this general main effects model.

ˆ Compute the partial χ2 statistic for testing the association


of each predictor with the response, adjusted for all other
predictors. In the case of ordinary regression convert partial
F statistics to χ2 statistics or partial R2 values.

ˆ Make corrections for chance associations to“level the playing


field” for predictors having greatly varying d.f., e.g., subtract
the d.f. from the partial χ2 (the expected value of χ2p is p
under H0).

ˆ Make certain that tests of nonlinearity are not revealed as


CHAPTER 4. MULTIVARIABLE MODELING STRATEGIES 4-7

this would bias the analyst.

ˆ Sort the partial association statistics in descending order.

Commands in the rms package can be used to plot only what


is needed. Here is an example for a logistic model.
f ← lrm ( y ∼ sex + race + rcs ( age ,5) + rcs ( weight ,5) +
rcs ( height ,5) + rcs ( blood.pressure ,5) )
plot ( anova ( f ) )

4.1.2

Using Marginal Generalized Rank Correlations

When collinearities or confounding are not problematic, a quicker


approach based on pairwise measures of association can be use-
ful. This approach will not have numerical problems (e.g., sin-
gular covariance matrix) and is based on: G

ˆ 2 d.f. generalization of Spearman ρ—R2 based on rank(X)


and rank(X)2 vs. rank(Y )

ˆ ρ2 can detect U-shaped relationships

ˆ For categorical X, ρ2 is R2 from dummy variables regressed


against rank(Y ); this is tightly related to the Wilcoxon–
Mann–Whitney–Kruskal–Wallis rank test for group differ-
encesa

ˆ Sort variables by descending order of ρ2


a This test statistic does not inform the analyst of which groups are different from one another.
CHAPTER 4. MULTIVARIABLE MODELING STRATEGIES 4-8

ˆ Specify number of knots for continuous X, combine infre-


quent categories of categorical X based on ρ2

Allocating d.f. based on partial tests of association or sorting


ρ2 is a fair procedure because H

ˆ We already decided to keep variable in model no matter


what ρ2 or χ2 values are seen

ˆ ρ2 and χ2 do not reveal degree of nonlinearity; high value


may be due solely to strong linear effect

ˆ low ρ2 or χ2 for a categorical variable might lead to collaps-


ing the most disparate categories

Initial simulations show the procedure to be conservative. Note


that one can move from simpler to more complex models but
not the other way round
CHAPTER 4. MULTIVARIABLE MODELING STRATEGIES 4-9

4.2

Checking Assumptions of Multiple Predictors


Simultaneously

strategy-simult
ˆ Sometimes failure to adjust for other variables gives wrong
transformation of an X, or wrong significance of interactions
I

ˆ Sometimes unwieldy to deal simultaneously with all predic-


tors at each stage → assess regression assumptions sepa-
rately for each predictor
CHAPTER 4. MULTIVARIABLE MODELING STRATEGIES 4-10

4.3

Variable Selection

strategy-selection
ˆ Series of potential predictors with no prior knowledge

ˆ ↑ exploration → ↑ shrinkage (overfitting)


J

ˆ Summary of problem: E(β̂|β̂ “significant” ) 6= β [37]

ˆ Biased R2, β̂, standard errors, P -values too small

ˆ F and χ2 statistics do not have the claimed distributionb [80]

ˆ Will result in residual confounding if use variable selection


to find confounders [84]

ˆ Derksen and Keselman [55] found that in stepwise analyses K

the final model represented noise 0.20-0.74 of time, final


model usually contained < 12 actual number of authentic
predictors. Also:
1. “The degree of correlation between the predictor vari-
ables affected the frequency with which authentic
predictor variables found their way into the final
model.
2. The number of candidate predictor variables affected
the number of noise variables that gained entry to
the model.
b Lockhart et al. [130] provide an example with n = 100 and 10 orthogonal predictors where all true βs are zero. The test statistic for the

first variable to enter has type I error of 0.39 when the nominal α is set to 0.05.
CHAPTER 4. MULTIVARIABLE MODELING STRATEGIES 4-11

3. The size of the sample was of little practical impor-


tance in determining the number of authentic vari-
ables contained in the final model.
4. The population multiple coefficient of determination
could be faithfully estimated by adopting a statistic
that is adjusted by the total number of candidate
predictor variables rather than the number of vari-
ables in the final model”.

ˆ Global test with p d.f. insignificant → stop

Simulation experiment, true σ 2 = 6.25, 8 candidate variables, 4 L

of them related to Y in the population. Select best model using


2
all possible subsets regression to maximize Radj (not usually
recommended but gives variable selection more of a chance to
work in this context).
Note: The audio was made using stepAIC with collinearities
in predictors. The code below allows for several options. Here
we use all possible subsets of predictors and force predictors to
be uncorrelated, which is the easiest case for variable selection.
require ( MASS )

require ( leaps )

sim ← function (n , sigma =2 .5 , method = c ( ’ stepaic ’ , ’ leaps ’) ,


pr = FALSE , prcor = FALSE , dataonly = FALSE ) {
method ← match.arg ( method )
if ( uncorrelated ) {
x1 ← rnorm ( n )
x2 ← rnorm ( n )
x3 ← rnorm ( n )
x4 ← rnorm ( n )
x5 ← rnorm ( n )
x6 ← rnorm ( n )
x7 ← rnorm ( n )
CHAPTER 4. MULTIVARIABLE MODELING STRATEGIES 4-12

x8 ← rnorm ( n )
}
else {
x1 ← rnorm ( n )
x2 ← x1 + 2 .0 * rnorm ( n ) # was + 0 .5 * rnorm ( n )
x3 ← rnorm ( n )
x4 ← x3 + 1 .5 * rnorm ( n )
x5 ← x1 + rnorm ( n ) / 1 .3
x6 ← x2 + 2 .25 * rnorm ( n ) # was rnorm ( n )/1 .3
x7 ← x3 + x4 + 2 .5 * rnorm ( n ) # w a s + r n o r m ( n )
x8 ← x7 + 4 .0 * rnorm ( n ) # was + 0 .5 * rnorm ( n )
}
z ← cbind ( x1 , x2 , x3 , x4 , x5 , x6 , x7 , x8 )
if ( prcor ) return ( round ( cor ( z ) , 2) )
lp ← x1 + x2 + .5 * x3 + .4 * x7
y ← lp + sigma * rnorm ( n )
if ( dataonly ) return ( list ( x =z , y = y ) )
if ( method == ’ leaps ’) {
s ← summary ( regsubsets (z , y ) )
best ← which.max ( s $ adjr2 )
xvars ← s $ which [ best , -1 ] # remove intercept
ssr ← s $ rss [ best ]
p ← sum ( xvars )
xs ← if ( p == 0) ’ none ’ else paste ((1 : 8) [ xvars ] , collapse = ’ ’)
if ( pr ) print ( xs )
ssesw ← ( n - 1) * var ( y ) - ssr
s2s ← ssesw / ( n - p - 1)
yhat ← if ( p == 0) mean ( y ) else fitted ( lm ( y ∼ z [ , xvars ]) )
}
f ← lm ( y ∼ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 )
if ( method == ’ stepaic ’) {
g ← stepAIC (f , trace =0)
p ← g $ rank - 1
xs ← if ( p == 0) ’ none ’ else
gsub ( ’[ \\+ x ] ’ , ’ ’ , as.character ( formula ( g ) ) [3])
if ( pr ) print ( formula ( g ) , showEnv = FALSE )
ssesw ← sum ( resid ( g ) ∧ 2)
s2s ← ssesw / g $ df.residual
yhat ← fitted ( g )
}
# Set SSEsw / ( n - gdf - 1) = true sigma ∧ 2
gdf ← n - 1 - ssesw / ( sigma ∧ 2)
# Compute root mean squared error against true linear predictor
rmse.full ← sqrt ( mean (( fitted ( f ) - lp ) ∧ 2) )
rmse.step ← sqrt ( mean (( yhat - lp ) ∧ 2) )
list ( stats = c ( n =n , vratio = s2s / ( sigma ∧ 2) ,
gdf = gdf , apparentdf =p , rmse.full = rmse.full , rmse.step = rmse.step ) ,
xselected = xs )
}

rsim ← function (B , n , method = c ( ’ stepaic ’ , ’ leaps ’) ) {


method ← match.arg ( method )
xs ← character ( B )
r ← matrix ( NA , nrow =B , ncol =6)
for ( i in 1: B ) {
w ← sim (n , method = method )
r [i ,] ← w $ stats
CHAPTER 4. MULTIVARIABLE MODELING STRATEGIES 4-13

xs [ i ] ← w $ xselected
}
colnames ( r ) ← names ( w $ stats )
s ← apply (r , 2 , median )
p ← r [ , ’ apparentdf ’]
s [ ’ apparentdf ’] ← mean ( p )
print ( round (s , 2) )
print ( table ( p ) )
cat ( ’ Prob [ correct model ]= ’ , round ( sum ( xs == ’ 1237 ’) /B , 2) , ’\ n ’)
}

Show the correlation matrix being assumed for the Xs:


uncorrelated ← TRUE
sim (50000 , prcor = TRUE )

x1 x2 x3 x4 x5 x6 x7 x8
x1 1.00 0.00 -0.01 0.00 0.01 0.01 0.00 0.01
x2 0.00 1.00 0.00 0.00 0.01 0.00 0.00 0.00
x3 -0.01 0.00 1.00 0.00 0.00 0.00 0.00 -0.01
x4 0.00 0.00 0.00 1.00 0.01 0.00 0.00 0.01
x5 0.01 0.01 0.00 0.01 1.00 0.01 0.00 0.00
x6 0.01 0.00 0.00 0.00 0.01 1.00 0.00 0.00
x7 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.01
x8 0.01 0.00 -0.01 0.01 0.00 0.00 0.01 1.00

Simulate to find the distribution of the number of variables


selected, the proportion of simulations in which the true model M

(X1, X2, X3, X7) was found, the median value of σ̂ 2/σ 2, the
median effective d.f., and the mean number of apparent d.f.,
for varying sample sizes.
set.seed (11)
m ← ’ leaps ’ # all possible regressions stopping on R2adj
rsim (100 , 20 , method = m ) # actual model found twice out of 100

n vratio gdf apparentdf rmse . full rmse . step


20.00 0.94 5.32 4.10 1.62 1.58
p
1 2 3 4 5 6 7 8
3 14 18 22 27 11 4 1
Prob [ correct model ]= 0.02

rsim (100 , 40 , method = m )

n vratio gdf apparentdf rmse . full rmse . step


40.00 0.61 17.89 4.38 1.21 1.24
p
2 3 4 5 6 7
CHAPTER 4. MULTIVARIABLE MODELING STRATEGIES 4-14

9 18 24 29 15 5
Prob [ correct model ]= 0.04

rsim (100 , 150 , method = m )

n vratio gdf apparentdf rmse . full rmse . step


150.00 0.44 85.99 5.01 0.59 0.57
p
2 3 4 5 6 7 8
1 5 27 35 24 7 1
Prob [ correct model ]= 0.2

rsim (100 , 300 , method = m )

n vratio gdf apparentdf rmse . full rmse . step


300.00 0.42 177.01 5.16 0.43 0.40
p
4 5 6 7 8
27 42 20 10 1
Prob [ correct model ]= 0.26

rsim (100 , 2000)

n vratio gdf apparentdf rmse . full rmse . step


2000.00 1.00 6.43 4.58 0.17 0.15
p
4 5 6 7
53 37 9 1
Prob [ correct model ]= 0.53

As n ↑ the mean number of variables selected increased. The


proportion of simulations in which the correct model was found
increased from 0 to 0.53. σ 2 is underestimated when n = 300
by a factor of 0.42, resulting in the d.f. needed to de-bias σˆ2
being greater than n when the apparent d.f. was only 5.16 on
the average. Variable selection slightly increased closeness to
the true Xβ.
If the simulations are re-run allowing for collinearities (uncorre-
lated=FALSE) one can expect variable selection to be even more
problematic.
Variable selection methods [88]:
N
CHAPTER 4. MULTIVARIABLE MODELING STRATEGIES 4-15

ˆ Forward selection, backward elimination

ˆ Stopping rule: “residual χ2” with d.f. = no. candidates re-


maining at current step

ˆ Test for significance or use Akaike’s information criterion


(AIC [7]), here χ2 − 2 × d.f.

ˆ Better to use subject matter knowledge!

ˆ No currently available stopping rule was developed for step-


wise, only for comparing a limited number of pre-specified
models [26, Section 1.3]

ˆ Roecker [163] studied forward selection (FS), all possible sub- O

sets selection (APS), full fits

ˆ APS more likely to select smaller, less accurate models than


FS

ˆ Neither as accurate as full model fit unless more than half


of candidate variables redundant or unnecessary

ˆ Step-down is usually better than forward [133] and can be


used efficiently with maximum likelihood estimation [118]

ˆ Fruitless to try different stepwise methods to look for agree-


ment [216]
CHAPTER 4. MULTIVARIABLE MODELING STRATEGIES 4-16

ˆ Bootstrap can help decide between full and reduced model P

ˆ Full model fits gives meaningful confidence intervals with


standard formulas, C.I. after stepwise does not [3, 101, 26]

ˆ Data reduction (grouping variables) can help

ˆ Using the bootstrap to select important variables for inclu-


sion in the final model [167] is problematic [8]

ˆ It is not logical that a population regression coefficient would


be exactly zero just because its estimate was “insignificant”

See also these articles: Q

ˆ Step away from stepwise by Gary Smith

ˆ Five myths about variable selection by Georg Heinze and


Daniela Dunkler

ˆ Variable selection - A review and recommendations for the


practicing statistician by Georg Heinze, Christine Wallisch,
Daniela Dunkler

ˆ Stopping stepwise by Peter Flom


CHAPTER 4. MULTIVARIABLE MODELING STRATEGIES 4-17

4.3.1

Maxwell’s Demon as an Analogy to Variable Selection

Some of the information in the data is spent on variable selec-


tion instead of using all information for estimation.
Model specification is preferred to model selection.
Information content of the data usually insufficient for reliable
variable selection.

James Clerk Maxwell


Maxwell imagines one container divided into two parts, A and B. Both parts are filled
with the same gas at equal temperatures and placed next to each other. Observing the
molecules on both sides, an imaginary demon guards a trapdoor between the two parts.
When a faster-than-average molecule from A flies towards the trapdoor, the demon
opens it, and the molecule will fly from A to B. Likewise, when a slower-than-
average molecule from B flies towards the trapdoor, the demon will let it pass from B to
A. The average speed of the molecules in B will have increased while in A they will have
slowed down on average. Since average molecular speed corresponds to temperature,
the temperature decreases in A and increases in B, contrary to the second law of
thermodynamics.

Szilárd pointed out that a real-life Maxwell’s demon would need to have some means of
measuring molecular speed, and that the act of acquiring information would require an
CHAPTER 4. MULTIVARIABLE MODELING STRATEGIES 4-18

expenditure of energy. Since the demon and the gas are interacting, we must consider
the total entropy of the gas and the demon combined. The expenditure of energy by
the demon will cause an increase in the entropy of the demon, which will be larger than
the lowering of the entropy of the gas.

Source: commons.wikimedia.org/wiki/File:YoungJamesClerkMaxwell.jpg
en.wikipedia.org/wiki/Maxwell’s_demon

Peter Ellis’ blog article contains excellent examples of issues


discussed here but applied to time series modeling.
CHAPTER 4. MULTIVARIABLE MODELING STRATEGIES 4-19

4.4

Overfitting and Limits on Number of Predic-


tors

strategy-limits
ˆ Concerned with avoiding overfitting

ˆ Assume typical problem in medicine, epidemiology, and the R

social sciences in which the signal:noise ratio is small (higher


ratios allow for more aggressive modeling)

ˆ p should be < m
15 [90, 91, 175, 146, 145, 203, 195]

ˆ p = number of parameters in full model or number of can-


didate parameters in a stepwise analysis

ˆ Derived from simulations to find minimum sample size so


that apparent discrimination = validated discrimination

ˆ Applies to typical signal:noise ratios found outside of tightly


controlled experiments

ˆ If true R2 is high, many parameters can be estimated from


smaller samples

ˆ Ignores sample size needed just to estimate the intercept or,


in semiparametric models, the underlying distribution func-
tionc
c The sample size needed for these is model-dependent
CHAPTER 4. MULTIVARIABLE MODELING STRATEGIES 4-20

ˆ Riley et al. [161, 162] have refined sample size estimation for
continuous, binary, and time-to-event models to account for
all of these issues

ˆ To just estimate σ in a linear model with a multiplicative


margin of error of 1.2 with 0.95 confidence requires n = 70
S

Table 4.2: Limiting Sample Sizes for Various Response Variables

Type of Response Variable Limiting Sample Size m


Continuous n (total sample size)
Binary min(n1 , n2 ) a
Ordinal (k categories) n − n12 ki=1 n3i b
P

Failure (survival) time number of failures c


a If one considers the power of a two-sample binomial test compared with a Wilcoxon test if the response could be made continuous and the

proportional odds assumption holds, the effective sample size for a binary response is 3n1 n2 /n ≈ 3 min(n1 , n2 ) if nn1 is near 0 or 1 [215, Eq. 10,
15]. Here n1 and n2 are the marginal frequencies of the two response levels [145].
b Based on the power of a proportional odds model two-sample test when the marginal cell sizes for the response are n , . . . , n , compared
1 k
with all cell sizes equal to unity (response is continuous) [215, Eq, 3]. If all cell sizes are equal, the relative efficiency of having k response categories
compared to a continuous response is 1 − k12 [215, Eq. 14], e.g., a 5-level response is almost as efficient as a continuous one if proportional odds
holds across category cutoffs.
c This is approximate, as the effective sample size may sometimes be boosted somewhat by censored observations, especially for non-

proportional hazards methods such as Wilcoxon-type tests [16]. T

ˆ Narrowly distributed predictor → even higher n

ˆ p includes all variables screened for association with re-


sponse, including interactions

ˆ Univariable screening (graphs, crosstabs, etc.) in no way


reduces multiple comparison problems of model building [187]
CHAPTER 4. MULTIVARIABLE MODELING STRATEGIES 4-21

4.5

Shrinkage

ˆ Slope of calibration plot; regression to the mean

strategy-shrinkage
ˆ Statistical estimation procedure — “pre-shrunk” models
U
ˆ Aren’t regression coefficients OK because they’re unbiased?

ˆ Problem is in how we use coefficient estimates

ˆ Consider 20 samples of size n = 50 from U (0, 1)

ˆ Compute group means and plot in ascending order

ˆ Equivalent to fitting an intercept and 19 dummies using least


squares

ˆ Result generalizes to general problems in plotting Y vs. X β̂


set.seed (123)
n ← 50
y ← runif (20 * n )
group ← rep (1:20 , each = n )
ybar ← tapply (y , group , mean )
ybar ← sort ( ybar )
plot (1:20 , ybar , type = ’n ’ , axes = FALSE , ylim = c ( .3 , .7 ) ,
xlab = ’ Group ’ , ylab = ’ Group Mean ’)
lines (1:20 , ybar )
points (1:20 , ybar , pch =20 , cex = .5 )
axis (2)
axis (1 , at =1:20 , labels = FALSE )
for ( j in 1:20) axis (1 , at =j , labels = names ( ybar ) [ j ])
abline ( h = .5 , col = gray ( .85 ) )

ˆ Prevent shrinkage by using pre-shrinkage V


CHAPTER 4. MULTIVARIABLE MODELING STRATEGIES 4-22

0.7

0.6

Group Mean
0.5

0.4

0.3

16 6 17 2 10 14 20 9 8 7 11 18 5 4 3 1 15 13 19 12

Group

Figure 4.1: Sorted means from 20 samples of size 50 from a uniform [0, 1] distribution. The reference line at 0.5 depicts the true
population value of all of the means.

ˆ Spiegelhalter [178]: var. selection arbitrary, better prediction


usually results from fitting all candidate variables and using
shrinkage

ˆ Shrinkage closer to that expected from full model fit than


based on number of significant variables [48]

ˆ Ridge regression [119, 197]

ˆ Penalized MLE [200, 81, 93]

ˆ Heuristic shrinkage parameter of van Houwelingen and le


Cessie [197, Eq. 77] W

model χ2 − p
γ̂ = ,
model χ2
CHAPTER 4. MULTIVARIABLE MODELING STRATEGIES 4-23

n−p−1 2
ˆ OLSd: γ̂ = n−1 Radj /R
2
2 n−1
Radj =1 − (1 − R2) n−p−1

ˆ p close to no. candidate variables

ˆ Copas [48, Eq. 8.5] adds 2 to numerator

d An excellent discussion about such indexes may be found here.


CHAPTER 4. MULTIVARIABLE MODELING STRATEGIES 4-24

4.6

Collinearity

strategy-collinearity
ˆ When at least 1 predictor can be predicted well from others

ˆ Can be a blessing (data reduction, transformations)


X

ˆ ↑ s.e. of β̂, ↓ power

ˆ This is appropriate → asking too much of the data [38,


Chap. 9]

ˆ Variables compete in variable selection, chosen one arbitrary

ˆ Does not affect joint influence of a set of highly correlated


variables (use multiple d.f. tests)

ˆ Does not at all affect predictions on model construction sam-


ple Y

ˆ Does not affect predictions on new data [140, pp. 379-381] if


1. Extreme extrapolation not attempted
2. New data have same type of collinearities as original data

ˆ Example: LDL and total cholesterol – problem only if more


inconsistent in new data

ˆ Example: age and age2 – no problem


CHAPTER 4. MULTIVARIABLE MODELING STRATEGIES 4-25

ˆ One way to quantify for each predictor: variance inflation


factors (VIF)

ˆ General approach (maximum likelihood) — transform in-


formation matrix to correlation form, VIF=diagonal of in-
verse [53, 210]

ˆ See Belsley [15, pp. 28-30] for problems with VIF

ˆ Easy approach: SAS VARCLUS procedure [166], R varclus Z

function, other clustering techniques: group highly corre-


lated variables

ˆ Can score each group (e.g., first principal component, P C1 [52]);


summary scores not collinear
CHAPTER 4. MULTIVARIABLE MODELING STRATEGIES 4-26

4.7

Data Reduction

strategy-reduction
ˆ Unless n >> p, model unlikely to validate

ˆ Data reduction: ↓ p
A

ˆ Use the literature to eliminate unimportant variables.

ˆ Eliminate variables whose distributions are too narrow.

ˆ Eliminate candidate predictors that are missing in a large


number of subjects, especially if those same predictors are
likely to be missing for future applications of the model.

ˆ Use a statistical data reduction method such as incomplete


principal components regression, nonlinear generalizations
of principal components such as principal surfaces, sliced in-
verse regression, variable clustering, or ordinary cluster anal-
ysis on a measure of similarity between variables.

ˆ Data reduction is completely masked to Y , which is


precisely why it does not distort estimates, standard errors,
P -values, or confidence limits

ˆ Data reduction = unsupervised learning

ˆ Example: dataset with 40 events and 60 candidate predictors


CHAPTER 4. MULTIVARIABLE MODELING STRATEGIES 4-27

– Use variable clustering to group variables by correlation


structure

– Use clinical knowledge to refine the clusters

– Keep age and severity of disease as separate predictors


because of their strength

– For others create clusters: socioeconomic, risk factors/his-


tory, and physiologic function

– Summarize each cluster with its first principal component


P C1, i.e., the linear combination of characteristics that
maximizes variance of the score across subjects subject
to an overall constraint on the coefficients

– Fit outcome model with 5 predictors

4.7.1

Redundancy Analysis
B

ˆ Remove variables that have poor distributions


– E.g., categorical variables with fewer than 2 categories
having at least 20 observations

ˆ Use flexible additive parametric additive models to determine


how well each variable can be predicted from the remaining
variables
CHAPTER 4. MULTIVARIABLE MODELING STRATEGIES 4-28

ˆ Variables dropped in stepwise fashion, removing the most


predictable variable at each step

ˆ Remaining variables used to predict

ˆ Process continues until no variable still in the list of pre-


dictors can be predicted with an R2 or adjusted R2 greater
than a specified threshold or until dropping the variable with
the highest R2 (adjusted or ordinary) would cause a variable
that was dropped earlier to no longer be predicted at the
threshold from the now smaller list of predictors

ˆ R function redun in Hmisc package

ˆ Related to principal variables [136] but faster

4.7.2

Variable Clustering
C

ˆ Goal: Separate variables into groups


– variables within group correlated with each other

– variables not correlated with non-group members

ˆ Score each dimension, stop trying to separate effects of fac-


tors measuring same phenomenon

ˆ Variable clustering [166, 52] (oblique-rotation PC analysis)


CHAPTER 4. MULTIVARIABLE MODELING STRATEGIES 4-29

→ separate variables so that first PC is representative of


group

ˆ Can also do hierarchical cluster analysis on similarity matrix


based on squared Spearman or Pearson correlations, or more
generally, Hoeffding’s D [98].

ˆ See [85] for a method related to variable clustering and


sparse principal components.

ˆ [39] implement many more variable clustering methods

Example: Figure 11.6


4.7.3

Transformation and Scaling Variables Without Using


Y

strategy-transcan
ˆ Reduce p by estimating transformations using associations
with other predictors
D
ˆ Purely categorical predictors – correspondence analysis [120,
51, 40, 83, 137]

ˆ Mixture of qualitative and continuous variables: qualitative


principal components

ˆ Maximum total variance (MTV) of Young, Takane, de Leeuw [222,


137]
CHAPTER 4. MULTIVARIABLE MODELING STRATEGIES 4-30

1. Compute P C1 of variables using correlation matrix


2. Use regression (with splines, dummies, etc.) to predict
P C1 from each X — expand each Xj and regress it
separately on P C1 to get working transformations
3. Recompute P C1 on transformed Xs
4. Repeat 3-4 times until variation explained by P C1 plateaus
and transformations stabilize

ˆ Maximum generalized variance (MGV) method of Sarle [115,


pp. 1267-1268]
1. Predict each variable from (current transformations of)
all other variables
2. For each variable, expand it into linear and nonlinear
terms or dummies, compute first canonical variate
3. For example, if there are only two variables X1 and X2
represented as quadratic polynomials, solve for a, b, c, d
such that aX1+bX12 has maximum correlation with cX2+
dX22.
4. Goal is to transform each var. so that it is most similar
to predictions from other transformed variables
5. Does not rely on PCs or variable clustering

ˆ MTV (PC-based instead of canonical var.) and MGV im-


plemented in SAS PROC PRINQUAL [115]
1. Allows flexible transformations including monotonic splines
2. Does not allow restricted cubic splines, so may be unsta-
ble unless monotonicity assumed
CHAPTER 4. MULTIVARIABLE MODELING STRATEGIES 4-31

3. Allows simultaneous imputation but often yields wild es-


timates
4.7.4

Simultaneous Transformation and Imputation

R transcan Function for Data Reduction & Imputation E

ˆ Initialize missings to medians (or most frequent category)

ˆ Initialize transformations to original variables

ˆ Take each variable in turn as Y

ˆ Exclude obs. missing on Y

ˆ Expand Y (spline or dummy variables)

ˆ Score (transform Y ) using first canonical variate

ˆ Missing Y → predict canonical variate from Xs

ˆ The imputed values can optionally be shrunk to avoid over-


fitting for small n or large p

ˆ Constrain imputed values to be in range of non-imputed


ones

ˆ Imputations on original scale


CHAPTER 4. MULTIVARIABLE MODELING STRATEGIES 4-32

1. Continuous → back-solve with linear interpolation


2. Categorical → classification tree (most freq. cat.) or
match to category whose canonical score is closest to
one predicted

ˆ Multiple imputation — bootstrap or approx. Bayesian boot.


1. Sample residuals multiple times (default M = 5)
2. Are on “optimally” transformed scale
3. Back-transform
4. fit.mult.impute works with aregImpute and transcan out-
put to easily get imputation-corrected variances and avg.
β̂

ˆ Option to insert constants as imputed values (ignored during


transformation estimation); helpful when a lab value may be
missing because the patient returned to normal

ˆ Imputations and transformed values may be easily obtained


for new data

ˆ An R function Function will create a series of R functions


that transform each predictor

ˆ Example: n = 415 acutely ill patients


1. Relate heart rate to mean arterial blood pressure
2. Two blood pressures missing
3. Heart rate not monotonically related to blood pressure
CHAPTER 4. MULTIVARIABLE MODELING STRATEGIES 4-33

4. See Figures 4.2 and 4.3


require ( Hmisc )
getHdata ( support ) # Get data frame from web site
heart.rate ← support $ hrt
blood.pressure ← support $ meanbp
blood.pressure [400:401]

Mean Arterial Blood Pressure Day 3


[1] 151 136

blood.pressure [400:401] ← NA # C r e a t e t w o m i s s i n g s
d ← data.frame ( heart.rate , blood.pressure )
par ( pch =46) # F i g u r e 4.2
w ← transcan (∼ heart.rate + blood.pressure , transformed = TRUE ,
imputed = TRUE , show.na = TRUE , data = d )

Convergence criterion :2.901 0.035

0.007
Convergence in 4 iterations
R2 achieved in predicting each variable :

heart . rate blood . pressure


0.259 0.259

Adjusted R2 :

heart . rate blood . pressure


0.254 0.253

w $ imputed $ blood.pressure

400 401
132.4057 109.7741

t ← w $ transformed
spe ← round ( c ( spearman ( heart.rate , blood.pressure ) ,
spearman ( t [ , ’ heart.rate ’] ,
t [ , ’ blood.pressure ’ ]) ) , 2)

plot ( heart.rate , blood.pressure ) # F i g u r e 4.3


plot ( t [ , ’ heart.rate ’] , t [ , ’ blood.pressure ’] ,
xlab = ’ Transformed hr ’ , ylab = ’ Transformed bp ’)

ACE (Alternating Conditional Expectation) of Breiman and


Friedman [25] F

1. Uses nonparametric “super smoother” [72]


2. Allows monotonicity constraints, categorical vars.
CHAPTER 4. MULTIVARIABLE MODELING STRATEGIES 4-34

8
0

Transformed blood.pressure
Transformed heart.rate
6
−2

4 −4

2 −6

−8
0

0 50 150 250 0 50 100 150


heart.rate blood.pressure

Figure 4.2: Transformations fitted using transcan. Tick marks indicate the two imputed values for blood pressure.

0
150
−2
Transformed bp
blood.pressure

100 −4

−6
50

−8

0 50 150 250 0 2 4 6 8
heart.rate Transformed hr

Figure 4.3: The lower left plot contains raw data (Spearman ρ = −0.02); the lower right is a scatterplot of the corresponding
transformed values (ρ = −0.13). Data courtesy of the SUPPORT study [109].
CHAPTER 4. MULTIVARIABLE MODELING STRATEGIES 4-35

3. Does not handle missing data


G

ˆ These methods find marginal transformations

ˆ Check adequacy of transformations using Y


1. Graphical
2. Nonparametric smoothers (X vs. Y )
3. Expand original variable using spline, test additional pre-
dictive information over original transformation

4.7.5

Simple Scoring of Variable Clusters

strategy-scoring
ˆ Try to score groups of transformed variables with P C1

ˆ Reduces d.f. by pre-transforming var. and by combining H

multiple var.

ˆ Later may want to break group apart, but delete all variables
in groups whose summary scores do not add significant in-
formation

ˆ Sometimes simplify cluster score by finding a subset of its


constituent variables which predict it with high R2.

Series of dichotomous variables: I

ˆ Construct X1 = 0-1 according to whether any variables pos-


CHAPTER 4. MULTIVARIABLE MODELING STRATEGIES 4-36

itive

ˆ Construct X2 = number of positives

ˆ Test whether original variables add to X1 or X2

4.7.6

Simplifying Cluster Scores

4.7.7

How Much Data Reduction Is Necessary?

strategy-howmuch
Using Expected Shrinkage to Guide Data Reduction

ˆ Fit full model with all candidates, p d.f., LR likelihood ratio


J
2
χ

ˆ Compute γ̂

ˆ If < 0.9, consider shrunken estimator from whole model, or


data reduction (again not using Y )

ˆ q regression d.f. for reduced model

ˆ Assume best case: discarded dimensions had no association


with Y

ˆ Expected loss in LR is p − q
CHAPTER 4. MULTIVARIABLE MODELING STRATEGIES 4-37

ˆ New shrinkage [LR − (p − q) − q]/[LR − (p − q)]

ˆ Solve for q → q ≤ (LR − p)/9

ˆ Under these assumptions, no hope unless original LR > p+9

ˆ No χ2 lost by dimension reduction → q ≤ LR/10

Example:
K

ˆ Binary logistic model, 45 events on 150 subjects

ˆ 10:1 rule → analyze 4.5 d.f. total

ˆ Analyst wishes to include age, sex, 10 others

ˆ Not known if age linear or if age and sex additive

ˆ 4 knots → 3+1+1 d.f. for age and sex if restrict interaction


to be linear

ˆ Full model with 15 d.f. has LR=50

ˆ Expected shrinkage factor (50 − 15)/50 = 0.7

ˆ LR> 15 + 9 = 24 → reduction may help

ˆ Reduction to q = (50 − 15)/9 ≈ 4 d.f. necessary

ˆ Have to assume age linear, reduce other 10 to 1 d.f.


CHAPTER 4. MULTIVARIABLE MODELING STRATEGIES 4-38

ˆ Separate hypothesis tests intended → use full model, adjust


for multiple comparisons
Summary of Some Data Reduction Methods

Goals Reasons Methods

Variable clustering
• Subject matter
knowledge
Group predictors so that • ↓ d.f. arising from • Group predictors to
each group represents multiple predictors maximize proportion
a single dimension that of variance explained
can be summarized with • Make P C1 more rea- by P C1 of each group
a single score sonable summary
• Hierarchical cluster-
ing using a matrix
of similarity measures
between predictors

• ↓ d.f. due to nonlin-


ear and dummy vari-
able components
• Allows predictors to • Maximum total vari-
be optimally com- ance on a group of re-
bined lated predictors
Transform predictors
• Make P C1 more rea- • Canonical variates on
sonable summary the total set of predic-
tors
• Use in customized
model for imputing
missing values on
each predictor

Score a group of predic- • P C1


↓ d.f. for group to unity
tors • Simple point scores

Principal components
Multiple dimensional ↓ d.f. for all predictors 1, 2, . . . , k, k < p
scoring of all predictors combined computed from all
transformed predictors
CHAPTER 4. MULTIVARIABLE MODELING STRATEGIES 4-39

4.8

Other Approaches to Predictive Modeling

4.9

Overly Influential Observations

strategy-influence
ˆ Every observation should influence fit

ˆ Major results should not rest on 1 or 2 obs.


L

ˆ Overly infl. obs. → ↑ variance of predictions

ˆ Also affects variable selection

Reasons for influence: M

ˆ Too few observations for complexity of model (see Sections


4.7, 4.3)

ˆ Data transcription or entry errors

ˆ Extreme values of a predictor


1. Sometimes subject so atypical should remove from dataset
2. Sometimes truncate measurements where data density
ends
3. Example: n = 4000, 2000 deaths, white blood count
range 500-100,000, .05,.95 quantiles=2755, 26700
CHAPTER 4. MULTIVARIABLE MODELING STRATEGIES 4-40

4. Linear spline function fit


5. Sensitive to WBC> 60000 (n = 16)
6. Predictions stable if truncate WBC to 40000 (n = 46
above 40000)

ˆ Disagreements between predictors and response. Ignore un-


less extreme values or another explanation

ˆ Example: n = 8000, one extreme predictor value not on


straight line relationship with other (X, Y ) → χ2 = 36 for
H0 : linearity

Statistical Measures: N

ˆ Leverage: capacity to be influential (not necessarily infl.)


Diagonals of “hat matrix” H = X(X 0X)−1X 0 — measures
how an obs. predicts its own response [14]

ˆ hii > 2(p + 1)/n may signal a high leverage point [14]

ˆ DFBETAS: change in β̂ upon deletion of each obs, scaled


by s.e.

ˆ DFFIT: change in X β̂ upon deletion of each obs

ˆ DFFITS: DFFIT standardized by s.e. of β̂

ˆ Some
r
classify obs as overly influential when |DFFITS| >
2 (p + 1)/(n − p − 1) [14]
CHAPTER 4. MULTIVARIABLE MODELING STRATEGIES 4-41

ˆ Others examine entire distribution for “outliers”

ˆ No substitute for careful examination of data [36, 177]

ˆ Maximum likelihood estimation requires 1-step approxima-


tions
CHAPTER 4. MULTIVARIABLE MODELING STRATEGIES 4-42

4.10

Comparing Two Models

strategy-compare
ˆ Level playing field (independent datasets, same no. candi-
date d.f., careful bootstrapping)
O
ˆ Criteria:
1. calibration
2. discrimination
3. face validity
4. measurement errors in required predictors
5. use of continuous predictors (which are usually better de-
fined than categorical ones)
6. omission of “insignificant” variables that nonetheless make
sense as risk factors
7. simplicity (though this is less important with the avail-
ability of computers)
8. lack of fit for specific types of subjects

ˆ Goal is to rank-order: ignore calibration P

ˆ Otherwise, dismiss a model having poor calibration

ˆ Good calibration → compare discrimination (e.g., R2 [141],


model χ2, Somers’ Dxy , Spearman’s ρ, area under ROC
curve)
CHAPTER 4. MULTIVARIABLE MODELING STRATEGIES 4-43

ˆ Worthwhile to compare models on a measure not used to


optimize either model, e.g., mean absolute error, median Q

absolute error if using OLS

ˆ Rank measures may not give enough credit to extreme pre-


dictions → model χ2, R2, examine extremes of distribution
of Ŷ

ˆ Examine differences in predicted values from the two models

ˆ See [147, 150, 149, 148] for discussions and examples of low
power for testing differences in ROC areas, and for other
approaches.
CHAPTER 4. MULTIVARIABLE MODELING STRATEGIES 4-44

4.11

Improving the Practice of Multivariable Pre-


diction

strategy-improve
See also Section 5.6.
Greenland [84] discusses many important points:
R

ˆ Stepwise variable selection on confounders leaves important


confounders uncontrolled

ˆ Shrinkage is far superior to variable selection

ˆ Variable selection does more damage to confidence interval


widths than to point estimates

ˆ Claims about unbiasedness of ordinary MLEs are misleading


because they assume the model is correct and is the only
model entertained

ˆ “models need to be complex to capture uncertainty about


the relations . . . an honest uncertainty assessment requires
parameters for all effects that we know may be present.
This advice is implicit in an antiparsimony principle often
attributed to L. J. Savage ‘All models should be as big as
an elephant’ (see Draper, 1995)”

Greenland’s example of inadequate adjustment for confounders


CHAPTER 4. MULTIVARIABLE MODELING STRATEGIES 4-45

as a result of using a bad modeling strategy: S

ˆ Case-control study of diet, food constituents, breast cancer

ˆ 140 cases, 222 controls

ˆ 35 food constituent intakes and 5 confounders

ˆ Food intakes are correlated

ˆ Traditional stepwise analysis not adjusting simultaneously


for all foods consumed → 11 foods had P < 0.05

ˆ Full model with all 35 foods competing → 2 had P < 0.05

ˆ Rigorous simultaneous analysis (hierarchical random slopes


model) penalizing estimates for the number of associations
examined → no foods associated with breast cancer
T

Global Strategies

ˆ Use a method known not to work well (e.g., stepwise vari-


able selection without penalization; recursive partitioning),
document how poorly the model performs (e.g. using the
bootstrap), and use the model anyway

ˆ Develop a black box model that performs poorly and is dif-


ficult to interpret (e.g., does not incorporate penalization)
CHAPTER 4. MULTIVARIABLE MODELING STRATEGIES 4-46

ˆ Develop a black box model that performs well and is difficult


to interpret

ˆ Develop interpretable approximations to the black box

ˆ Develop an interpretable model (e.g. give priority to additive


effects) that performs well and is likely to perform equally
well on future data from the same stream
U

Preferred Strategy in a Nutshell

ˆ Decide how many d.f. can be spent

ˆ Decide where to spend them

ˆ Spend them

ˆ Don’t reconsider, especially if inference needed


CHAPTER 4. MULTIVARIABLE MODELING STRATEGIES 4-47

4.12

Summary: Possible Modeling Strategies

4.12.1

Developing Predictive Models

strategy-summary
1. Assemble accurate, pertinent data and lots of it, with wide
distributions for X.
2. Formulate good hypotheses — specify relevant candidate V

predictors and possible interactions. Don’t use Y to decide


which X’s to include.
3. Characterize subjects with missing Y . Delete such subjects
in rare circumstances [50]. For certain models it is effective
to multiply impute Y .
4. Characterize and impute missing X. In most cases use mul-
tiple imputation based on X and Y
5. For each predictor specify complexity or degree of nonlinear-
ity that should be allowed (more for important predictors or
for large n) (Section 4.1)
6. Do data reduction if needed (pre-transformations, combina-
tions), or use penalized estimation [93]
7. Use the entire sample in model development
8. Can do highly structured testing to simplify “initial” model
(a) Test entire group of predictors with a single P -value
CHAPTER 4. MULTIVARIABLE MODELING STRATEGIES 4-48

(b) Make each continuous predictor have same number of


knots, and select the number that optimizes AIC
(c) Test the combined effects of all nonlinear terms with a
single P -value
9. Make tests of linearity of effects in the model only to demon-
strate to others that such effects are often statistically sig-
nificant. Don’t remove individual insignificant effects from
the model.
10. Check additivity assumptions by testing pre-specified inter-
action terms. Use a global test and either keep all or delete
all interactions.
11. Check to see if there are overly-influential observations.
12. Check distributional assumptions and choose a different model
if needed.
13. Do limited backwards step-down variable selection if parsi-
mony is more important that accuracy [178]. But confidence
limits, etc., must account for variable selection (e.g., boot-
strap).
14. This is the “final” model.
15. Interpret the model graphically and by computing predicted
values and appropriate test statistics. Compute pooled tests
of association for collinear predictors.
16. Validate this model for calibration and discrimination ability,
preferably using bootstrapping.
CHAPTER 4. MULTIVARIABLE MODELING STRATEGIES 4-49

17. Shrink parameter estimates if there is overfitting but no fur-


ther data reduction is desired (unless shrinkage built-in to
estimation)
18. When missing values were imputed, adjust final variance-
covariance matrix for imputation. Do this as early as possi-
ble because it will affect other findings.
19. When all steps of the modeling strategy can be automated,
consider using Faraway’s method [68] to penalize for the ran-
domness inherent in the multiple steps.
20. Develop simplifications to the final model as needed.

4.12.2

Developing Models for Effect Estimation


W

1. Less need for parsimony; even less need to remove insignifi-


cant variables from model (otherwise CLs too narrow)
2. Careful consideration of interactions; inclusion forces esti-
mates to be conditional and raises variances
3. If variable of interest is mostly the one that is missing, mul-
tiple imputation less valuable
4. Complexity of main variable specified by prior beliefs, com-
promise between variance and bias
5. Don’t penalize terms for variable of interest
6. Model validation less necessary
CHAPTER 4. MULTIVARIABLE MODELING STRATEGIES 4-50

4.12.3

Developing Models for Hypothesis Testing


X

1. Virtually same as previous strategy


2. Interactions require tests of effect by varying values of an-
other variable, or“main effect + interaction”joint tests (e.g.,
is treatment effective for either sex, allowing effects to be
different)
3. Validation may help quantify overadjustment
Chapter 5

Describing, Resampling, Validating, and


Simplifying the Model

5.1

Describing the Fitted Model

val-describe
5.1.1

Interpreting Effects
A

ˆ Regression coefficients if 1 d.f. per factor, no interaction

ˆ Not standardized regression coefficients

ˆ Many programs print meaningless estimates such as effect


of increasing age2 by one unit, holding age constant

ˆ Need to account for nonlinearity, interaction, and use mean-


ingful ranges

ˆ For monotonic relationships, estimate X β̂ at quartiles of


continuous variables, separately for various levels of inter-
5-1
CHAPTER 5. DESCRIBING, RESAMPLING, VALIDATING, AND SIMPLIFYING THE MODEL 5-2

acting factors

ˆ Subtract estimates, anti-log, e.g., to get inter-quartile-range


odds or hazards ratios. Base C.L. on s.e. of difference. See B

Figure 12.10.

ˆ Partial effect plot: Plot effect of each predictor on Xβ or


some transformation. See Figure 12.8. See also [106].

ˆ Nomogram. See Figure 12.12.

ˆ Use regression tree to approximate the full model

5.1.2

Indexes of Model Performance

Error Measures
C

ˆ Central tendency of prediction errors


– Mean absolute prediction error: mean |Y − Ŷ |

– Mean squared prediction error


* Binary Y : Brier score (quadratic proper scoring rule)

– Logarithmic proper scoring rule (avg. log-likelihood)

ˆ Discrimination measures D
CHAPTER 5. DESCRIBING, RESAMPLING, VALIDATING, AND SIMPLIFYING THE MODEL 5-3

– Pure discrimination: rank correlation of (Ŷ , Y )


* Spearman ρ, Kendall τ , Somers’ Dxy

* Y binary → Dxy = 2 × (C − 12 )
C = concordance probability = area under receiver op-
erating characteristic curve ∝ Wilcoxon-Mann-Whitney
statistic

– Mostly discrimination: R2
* Radj
2
—overfitting corrected if model pre-specified

– Brier score can be decomposed into discrimination and


calibration components

– Discrimination measures based on variation in Ŷ


* regression sum of squares

* g–index

ˆ Calibration measures E

– calibration–in–the–large: average Ŷ vs. average Y

– high-resolution calibration curve (calibration–in–the–small).


See Figure 9.7.

– calibration slope and intercept

– maximum absolute calibration error


CHAPTER 5. DESCRIBING, RESAMPLING, VALIDATING, AND SIMPLIFYING THE MODEL 5-4

– mean absolute calibration error

– 0.9 quantile of calibration error

See Van Calster et al. [193] for a nice discussion of different levels
of calibration stringency and their relationship to likelihood of
errors in decision making.
F

g–Index

ˆ Based on Gini’s mean difference


– mean over all possible i 6= j of |Zi − Zj |

– interpretable, robust, highly efficient measure of variation

ˆ g = Gini’s mean difference of Xiβ̂ = Ŷ

ˆ Example: Y = systolic blood pressure; g = 11mmHg is


typical difference in Ŷ

ˆ Independent of censoring etc.

ˆ For models in which anti-log of difference in Ŷ represent G

meaningful ratios (odds ratios, hazard ratios, ratio of medi-


ans):
gr = exp(g)

ˆ For models in which Ŷ can be turned into a probability


CHAPTER 5. DESCRIBING, RESAMPLING, VALIDATING, AND SIMPLIFYING THE MODEL 5-5

estimate (e.g., logistic regression):


gp = Gini’s mean difference of P̂

ˆ These g–indexes represent e.g. “typical” odds ratios, “typical”


risk differences

ˆ Can define partial g


CHAPTER 5. DESCRIBING, RESAMPLING, VALIDATING, AND SIMPLIFYING THE MODEL 5-6

5.2

The Bootstrap

val-boot
ˆ If know population model, use simulation or analytic deriva-
tions to study behavior of statistical estimator H

ˆ Suppose Y has a cumulative dist. fctn. F (y) = Prob{Y ≤


y}

ˆ We have sample of size n from F (y),


Y1, Y2, . . . , Yn

ˆ Steps:
1. Repeatedly simulate sample of size n from F
2. Compute statistic of interest
3. Study behavior over B repetitions

ˆ Example: 1000 samples, 1000 sample medians, compute


their sample variance I

ˆ F unknown → estimate by empirical dist. fctn.


1 Xn
Fn(y) = [Yi ≤ y].
n i=1
ˆ Example: sample of size n = 30 from a normal distribution
with mean 100 and SD 10
set.seed (6)
x ← rnorm (30 , 100 , 20)
xs ← seq (50 , 150 , length =150)
CHAPTER 5. DESCRIBING, RESAMPLING, VALIDATING, AND SIMPLIFYING THE MODEL 5-7

cdf ← pnorm ( xs , 100 , 20)


plot ( xs , cdf , type = ’l ’ , ylim = c (0 ,1) ,
xlab = expression ( x ) ,
ylab = expression ( paste ( " Prob [ " , X ≤ x , " ] " ) ) )
lines ( ecdf ( x ) , cex = .5 )

1.0

0.8
Prob[X ≤ x]
0.6

0.4

0.2

0.0
60 80 100 120 140
x
Figure 5.1: Empirical and population cumulative distribution function

ˆ Fn corresponds to density function placing probability 1


n at J

each observed data point ( nk if point duplicated k times)

ˆ Pretend that F ≡ Fn

ˆ Sampling from Fn ≡ sampling with replacement from ob-


served data Y1, . . . , Yn

ˆ Large n → selects 1 − e−1 ≈ 0.632 of original data points


in each bootstrap sample at least once

ˆ Some observations not selected, others selected more than


once
CHAPTER 5. DESCRIBING, RESAMPLING, VALIDATING, AND SIMPLIFYING THE MODEL 5-8

ˆ Efron’s bootstrap → general-purpose technique for estimat- K

ing properties of estimators without assuming or knowing


distribution of data F

ˆ Take B samples of size n with replacement, choose B so L

that summary measure of individual statistics ≈ summary if


B=∞

ˆ Bootstrap based on distribution of observed differences be-


tween a resampled parameter estimate and the original es-
timate telling us about the distribution of unobservable dif-
ferences between the original estimate and the unknown pa-
rameter

Example: Data (1, 5, 6, 7, 8, 9), obtain 0.80 confidence interval


for population median, and estimate of population expected M

value of sample median (only to estimate the bias in the original


estimate of the median).
options ( digits =3)
y ← c (2 ,5 ,6 ,7 ,8 ,9 ,10 ,11 ,12 ,13 ,14 ,19 ,20 ,21)
y ← c (1 ,5 ,6 ,7 ,8 ,9)
set.seed (17)
n ← length ( y )
n2 ← n / 2
n21 ← n2 +1
B ← 400
M ← double ( B )
plot (0 , 0 , xlim = c (0 , B ) , ylim = c (3 ,9) ,
xlab = " Bootstrap Samples Used " ,
ylab = " Mean and 0 .1 , 0 .9 Quantiles " , type = " n " )
for ( i in 1: B ) {
s ← sample (1: n , n , replace = T )
x ← sort ( y [ s ])
m ← .5 * ( x [ n2 ]+ x [ n21 ])
M[i] ← m
if ( i ≤ 20) {
w ← as.character ( x )
cat (w , " & & " , sprintf ( ’% .1f ’ ,m ) ,
if ( i < 20) " \\\\\ n " else " \\\\ \\ hline \ n " ,
CHAPTER 5. DESCRIBING, RESAMPLING, VALIDATING, AND SIMPLIFYING THE MODEL 5-9

file = ’∼/ doc / rms / validate / tab.tex ’ , append = i > 1)


}
points (i , mean ( M [1: i ]) , pch =46)
if ( i ≥ 10) {
q ← quantile ( M [1: i ] , c ( .1 , .9 ) )
points (i , q [1] , pch =46 , col = ’ blue ’)
points (i , q [2] , pch =46 , col = ’ blue ’)
}
}
table ( M )

M
1 3 3.5 4 4.5 5 5.5 6 6.5 7 7.5 8 8.5 9
2 7 6 2 1 30 45 59 72 70 45 48 8 5

hist (M , nclass = length ( unique ( M ) ) , xlab = " " , main = " " )

9 70
Mean and 0.1, 0.9 Quantiles

8 60

7 50

Frequency 40
6
30
5
20
4 10

3 0

0 100 200 300 400 2 4 6 8


Bootstrap Samples Used

Figure 5.2: Estimating properties of sample median using the bootstrap

First 20 samples:
CHAPTER 5. DESCRIBING, RESAMPLING, VALIDATING, AND SIMPLIFYING THE MODEL 5-10

Bootstrap Sample Sample Median


155789 6.0
115799 6.0
677899 7.5
115689 5.5
167788 7.0
156889 7.0
168899 8.0
556789 6.5
156778 6.5
156899 7.0
157789 7.0
156678 6.0
166789 6.5
567789 7.0
156888 7.0
116678 6.0
555889 6.5
566677 6.0
157999 8.0
115557 5.0

ˆ Histogram tells us whether we can assume normality for the


bootstrap medians or need to use quantiles of medians to
construct C.L.

ˆ Need high B for quantiles, low for variance (but see [23])

ˆ See [62] for useful information about bootstrap confidence


intervals and the latest R functions
CHAPTER 5. DESCRIBING, RESAMPLING, VALIDATING, AND SIMPLIFYING THE MODEL 5-11

5.3

Model Validation

val-how
5.3.1

Introduction
O

ˆ External validation (best: another country at another time);


also validates sampling, measurementsa

ˆ Internal
– apparent (evaluate fit on same data used to create fit)

– data splitting

– cross-validation

– bootstrap: get overfitting-corrected accuracy index

ˆ Best way to make model fit data well is to discard much of


the data P

ˆ Predictions on another dataset will be inaccurate

ˆ Need unbiased assessment of predictive accuracy

Working definition of external validation: Validation of


a prediction tool on a sample that was not available at publi-
a But in many cases it is better to combine data and include country or calendar time as a predictor.
CHAPTER 5. DESCRIBING, RESAMPLING, VALIDATING, AND SIMPLIFYING THE MODEL 5-12

cation time. Alternate: Validation of a prediction tool by an


independent research team.
One suggested hierarchy of the quality of various validation
methods is as follows, ordered from worst to best. Q

1. Attempting several validations (internal or external) and re-


porting only the one that “worked”
2. Reporting apparent performance on the training dataset (no
validation)
3. Reporting predictive accuracy on an undersized independent
test sample
4. Internal validation using data splitting where at least one of
the training and test samples is not huge and the investigator
is not aware of the arbitrariness of variable selection done
on a single sample
5. Strong internal validation using 100 repeats of 10-fold cross- R

validation or several hundred bootstrap resamples, repeating


all analysis steps involving Y afresh at each re-sample and
the arbitrariness of selected “important variables” is reported
(if variable selection is used)
6. External validation on a large test sample, done by the orig-
inal research team
7. Re-analysis by an independent research team using strong
internal validation of the original dataset
CHAPTER 5. DESCRIBING, RESAMPLING, VALIDATING, AND SIMPLIFYING THE MODEL 5-13

8. External validation using new test data, done by an inde-


pendent research team
9. External validation using new test data generated using dif-
ferent instruments/technology, done by an independent re-
search team

Some points to consider: S

ˆ Unless both sample sizes are huge, external validation can


be low precision

ˆ External validation can be costly and slow and may result in


disappointment that would have been revealed earlier with
rigorous internal validation

ˆ External validation is sometimes gamed; researchers disap-


pointed in the validation sometimes ask for a “do over”; re-
sampling validation is harder to game as long as all analytical
steps using Y are repeated each time.

ˆ Instead of external validation to determine model applica-


bility at a different time or place, and being disappointed if T

the model does not work in that setting, consider building a


unified model containing time and place as predictors

ˆ When the model was fully pre-specified, external validation


tests the model
CHAPTER 5. DESCRIBING, RESAMPLING, VALIDATING, AND SIMPLIFYING THE MODEL 5-14

ˆ But when the model was fitted using machine learning, fea-
ture screening, variable selection, or model selection, the
model developed using training data is usually only an ex-
ample of a model, and the test sample validation could be
called an example validation

ˆ When resampling is used to repeat all modeling steps for


each resample, rigorous internal validation tests the process
used to develop the model and happens to also provide a
high-precision estimate of the likely future performance of
the “final” model developed using that process, properly pe-
nalizing for model uncertainty.

ˆ Resampling also reveals the volatility of the model selection


process

→ See BBR 10.11


Collins et al. [43] estimate that a typical sample size needed for
externally validating a time-to-event model is 200 events.

5.3.2

Which Quantities Should Be Used in Validation?


U

ˆ OLS: R is one good measure for quantifying drop-off in


2

predictive ability

ˆ Example: n = 10, p = 9, apparent R2 = 1 but R2 will be


close to zero on new subjects
CHAPTER 5. DESCRIBING, RESAMPLING, VALIDATING, AND SIMPLIFYING THE MODEL 5-15

ˆ Example: n = 20, p = 10, apparent R2 = .9, R2 on new


2
data 0.7, Radj = 0.79

ˆ Adjusted R2 solves much of the bias problem assuming p in


its formula is the largest number of parameters ever exam-
ined against Y

ˆ Few other adjusted indexes exist

ˆ Also need to validate models with phantom d.f.

ˆ Cross-validation or bootstrap can provide unbiased estimate


of any index; bootstrap has higher precision V

ˆ Two main types of quantities to validate


1. Calibration or reliability: ability to make unbiased esti-
mates of response (Ŷ vs. Y )
2. Discrimination: ability to separate responses
OLS: R2; g–index; binary logistic model: ROC area,
equivalent to rank correlation between predicted prob-
ability of event and 0/1 event

ˆ Unbiased validation nearly always necessary, to detect over-


fitting
CHAPTER 5. DESCRIBING, RESAMPLING, VALIDATING, AND SIMPLIFYING THE MODEL 5-16

5.3.3

Data-Splitting
W

ˆ Split data into training and test sets

ˆ Interesting to compare index of accuracy in training and test

ˆ Freeze parameters from training

ˆ Make sure you allow R2 = 1 − SSE/SST for test sample


to be < 0

ˆ Don’t compute ordinary R2 on X β̂ vs. Y ; this allows for


linear recalibration aX β̂ + b vs. Y

ˆ Test sample must be large enough to obtain very accurate


assessment of accuracy X

ˆ Training sample is what’s left

ˆ Example: overall sample n = 300, training sample n = 200,


develop model, freeze β̂, predict on test sample (n = 100),
(Yi −Xi β̂)2
P
2
R = 1 − P(Y −Ȳ )2 .
i

ˆ Disadvantages of data splitting: Y

1. Costly in ↓ n [163, 26]


2. Requires decision to split at beginning of analysis
3. Requires larger sample held out than cross-validation
CHAPTER 5. DESCRIBING, RESAMPLING, VALIDATING, AND SIMPLIFYING THE MODEL 5-17

4. Results vary if split again


5. Does not validate the final model (from recombined data)
6. Not helpful in getting CL corrected for var. selection
7. Nice summary of disadvantages:[180]

5.3.4

Improvements on Data-Splitting: Resampling


Z

ˆ No sacrifice in sample size

ˆ Work when modeling process automated

ˆ Bootstrap excellent for studying arbitrariness of variable se-


lection [167]. See P. 8-43.

ˆ Cross-validation solves many problems of data splitting [197,


172, 219, 61]

ˆ Example of ×-validation: A

1. Split data at random into 10 tenths


1
2. Leave out 10 of data at a time
9
3. Develop model on 10 , including any variable selection,
pre-testing, etc.
1
4. Freeze coefficients, evaluate on 10
5. Average R2 over 10 reps

ˆ Drawbacks:
CHAPTER 5. DESCRIBING, RESAMPLING, VALIDATING, AND SIMPLIFYING THE MODEL 5-18

1. Choice of number of groups and repetitions


2. Doesn’t show full variability of var. selection
3. Does not validate full model
4. Lower precision than bootstrap
5. Need to do 50 repeats of 10-fold cross-validation to en-
sure adequate precision

ˆ Randomization method B

1. Randomly permute Y
2. Optimism = performance of fitted model compared to
what expect by chance

5.3.5

Validation Using the Bootstrap


C

ˆ Estimate optimism of final whole sample fit without holding


out data

ˆ From original X and Y select sample of size n with replace-


ment

ˆ Derive model from bootstrap sample

ˆ Apply to original sample

ˆ Simple bootstrap uses average of indexes computed on orig-


inal sample
CHAPTER 5. DESCRIBING, RESAMPLING, VALIDATING, AND SIMPLIFYING THE MODEL 5-19

ˆ Estimated optimism = difference in indexes

ˆ Repeat about B = 100 times, get average expected opti-


mism

ˆ Subtract average optimism from apparent index in final model

ˆ Example: n = 1000, have developed a final model that is


hopefully ready to publish. Call estimates from this final D

model β̂.
– final model has apparent R2 (Rapp
2
) =0.4
2
– how inflated is Rapp ?

– get resamples of size 1000 with replacement from original


1000
2
– for each resample compute Rboot = apparent R2 in boot-
strap sample

– freeze these coefficients (call them β̂boot), apply to original


2
(whole) sample (Xorig , Yorig ) to get Rorig = R2(Xorig β̂boot, Yorig )
2 2
– optimism = Rboot − Rorig

– average over B = 100 optimisms to get optimism


2 2
– Roverf itting corrected = Rapp − optimism
CHAPTER 5. DESCRIBING, RESAMPLING, VALIDATING, AND SIMPLIFYING THE MODEL 5-20

ˆ Example: See P. 8-41

ˆ Is estimating unconditional (not conditional on X) distribu-


tion of R2, etc. [68, p. 217] E

ˆ Conditional estimates would require assuming the model one


is trying to validate

ˆ Efron’s “.632” method may perform better (reduce bias fur-


ther) for small n [61], [63, p. 253], [64]

Bootstrap useful for assessing calibration in addition to discrim-


ination:
F

ˆ Fit C(Y |X) = Xβ on bootstrap sample

ˆ Re-fit C(Y |X) = γ0 + γ1X β̂ on same data

ˆ γ̂0 = 0, γ̂1 = 1

ˆ Test data (original dataset): re-estimate γ0, γ1

ˆ γ̂1 < 1 if overfit, γ̂0 > 0 to compensate

ˆ γ̂1 quantifies overfitting and useful for improving calibra-


tion [178]

ˆ Use Efron’s method to estimate optimism in (0, 1), estimate


(γ0, γ1) by subtracting optimism from (0, 1)
CHAPTER 5. DESCRIBING, RESAMPLING, VALIDATING, AND SIMPLIFYING THE MODEL 5-21

ˆ See also Copas [47] and van Houwelingen and le Cessie [197,
p. 1318]

See [71] for warnings about the bootstrap, and [61] for variations
on the bootstrap to reduce bias.
Use bootstrap to choose between full and reduced models:
G

ˆ Bootstrap estimate of accuracy for full model

ˆ Repeat, using chosen stopping rule for each re-sample

ˆ Full fit usually outperforms reduced model [178]

ˆ Stepwise modeling often reduces optimism but this is not


offset by loss of information from deleting marginal var.
Method Apparent Rank Over- Bias-Corrected
Correlation of Optimism Correlation
Predicted vs.
Observed
Full Model 0.50 0.06 0.44
Stepwise Model 0.47 0.05 0.42

In this example, stepwise modeling lost a possible 0.50−0.47 =


0.03 predictive discrimination. The full model fit will especially
be an improvement when H

1. The stepwise selection deleted several variables which were


almost significant.
2. These marginal variables have some real predictive value,
even if it’s slight.
CHAPTER 5. DESCRIBING, RESAMPLING, VALIDATING, AND SIMPLIFYING THE MODEL 5-22

3. There is no small set of extremely dominant variables that


would be easily found by stepwise selection.

Other issues: I

ˆ See [197] for many interesting ideas

ˆ Faraway [68] shows how bootstrap is used to penalize for


choosing transformations for Y , outlier and influence check-
ing, variable selection, etc. simultaneously

ˆ Brownstone [29, p. 74] feels that“theoretical statisticians have


been unable to analyze the sampling properties of [usual
multi-step modeling strategies] under realistic conditions”
and concludes that the modeling strategy must be com-
pletely specified and then bootstrapped to get consistent
estimates of variances and other sampling properties

ˆ See Blettner and Sauerbrei [21] and Chatfield [37] for more
interesting examples of problems resulting from data-driven
analyses.
CHAPTER 5. DESCRIBING, RESAMPLING, VALIDATING, AND SIMPLIFYING THE MODEL 5-23

5.4

Bootstrapping Ranks of Predictors

val-ranks
ˆ Order of importance of predictors not pre-specified
J

ˆ Researcher interested in determining “winners” and “losers”

ˆ Bootstrap useful in documenting the difficulty of this task

ˆ Get confidence limits of the rank of each predictor in the


scale of partial χ2 - d.f.

ˆ Example using OLS


# Use the plot method for anova , with pl = FALSE to s u p p r e s s actual
# p l o t t i n g of c h i - s q u a r e - d.f. for each b o o t s t r a p r e p e t i t i o n .
# Rank the n e g a t i v e of the a d j u s t e d c h i - s q u a r e s so that a rank of
# 1 is a s s i g n e d to the h i g h e s t . It is i m p o r t a n t to tell
# p l o t . a n o v a . r m s not to sort the results , or every b o o t s t r a p
# r e p l i c a t i o n w o u l d h a v e r a n k s of 1 ,2 ,3 , ... for the s t a t s .
require ( rms )

n ← 300
set.seed (1)
d ← data.frame ( x1 = runif ( n ) , x2 = runif ( n ) , x3 = runif ( n ) , x4 = runif ( n ) ,
x5 = runif ( n ) , x6 = runif ( n ) , x7 = runif ( n ) , x8 = runif ( n ) ,
x9 = runif ( n ) , x10 = runif ( n ) , x11 = runif ( n ) , x12 = runif ( n ) )
d $ y ← with (d , 1 * x1 + 2 * x2 + 3 * x3 + 4 * x4 + 5 * x5 + 6 * x6 + 7 * x7 +
8 * x8 + 9 * x9 + 10 * x10 + 11 * x11 + 12 * x12 + 9 * rnorm ( n ) )

f ← ols ( y ∼ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10 + x11 + x12 , data = d )


B ← 1000
ranks ← matrix ( NA , nrow =B , ncol =12)
rankvars ← function ( fit )
rank ( plot ( anova ( fit ) , sort = ’ none ’ , pl = FALSE ) )
Rank ← rankvars ( f )
for ( i in 1: B ) {
j ← sample (1: n , n , TRUE )
bootfit ← update (f , data =d , subset = j )
ranks [i ,] ← rankvars ( bootfit )
}
lim ← t ( apply ( ranks , 2 , quantile , probs = c ( .025 , .975 ) ) )
predictor ← factor ( names ( Rank ) , names ( Rank ) )
w ← data.frame ( predictor , Rank , lower = lim [ ,1] , upper = lim [ ,2])
require ( ggplot2 )
CHAPTER 5. DESCRIBING, RESAMPLING, VALIDATING, AND SIMPLIFYING THE MODEL 5-24

ggplot (w , aes ( x = predictor , y = Rank ) ) + geom_point () + coord_flip () +


scale_y_continuous ( breaks =1:12) +
geom_errorbar ( aes ( ymin = lim [ ,1] , ymax = lim [ ,2]) , width =0)

x12
x11
x10
x9
x8
predictor

x7
x6
x5
x4
x3
x2
x1
1 2 3 4 5 6 7 8 9 10 11 12
Rank

Figure 5.3: Bootstrap percentile 0.95 confidence limits for ranks of predictors in an OLS model. Ranking is on the basis of partial
χ2 minus d.f. Point estimates are original ranks
CHAPTER 5. DESCRIBING, RESAMPLING, VALIDATING, AND SIMPLIFYING THE MODEL 5-25

5.5

Simplifying the Final Model by Approximat-


ing It

val-approx
5.5.1

Difficulties Using Full Models


K

ˆ Predictions are conditional on all variables, standard errors


↑ when predict for a low-frequency category

ˆ Collinearity

ˆ Can average predictions over categories to marginalize, ↓


s.e.
5.5.2

Approximating the Full Model


L

ˆ Full model is gold standard

ˆ Approximate it to any desired degree of accuracy

ˆ If approx. with a tree, best c-v tree will have 1 obs./node

ˆ Can use least squares to approx. model by predicting Ŷ =


X β̂

ˆ When original model also fit using least squares, coef. of


CHAPTER 5. DESCRIBING, RESAMPLING, VALIDATING, AND SIMPLIFYING THE MODEL 5-26

approx. model against Ŷ ≡ coef. of subset of variables fitted


against Y (as in stepwise)

ˆ Model approximation still has some advantages M

1. Uses unbiased estimate of σ from full fit


2. Stopping rule less arbitrary
3. Inheritance of shrinkage

ˆ If estimates from full model are β̂ and approx. model is based


on a subset T of predictors X, coef. of approx. model are
W β̂, where
W = (T 0T )−1T 0X

ˆ Variance matrix of reduced coef.: W V W 0


CHAPTER 5. DESCRIBING, RESAMPLING, VALIDATING, AND SIMPLIFYING THE MODEL 5-27

5.6

How Do We Break Bad Habits?

val-habits
ˆ Insist on validation of predictive models and discoveries
N
ˆ Show collaborators that split-sample validation is not appro-
priate unless the number of subjects is huge
– Split more than once and see volatile results

– Calculate a confidence interval for the predictive accuracy


in the test dataset and show that it is very wide

ˆ Run simulation study with no real associations and show


that associations are easy to find

ˆ Analyze the collaborator’s data after randomly permuting


the Y vector and show some positive findings

ˆ Show that alternative explanations are easy to posit


– Importance of a risk factor may disappear if 5 “unimpor-
tant” risk factors are added back to the model

– Omitted main effects can explain apparent interactions

– Uniqueness analysis: attempt to predict the predicted val-


ues from a model derived by data torture from all of the
features not used in the model
Chapter 6

R Software

rrms
R allows interaction spline functions, wide variety of predic-
tor parameterizations, wide variety of models, unifying model
formula language, model validation by resampling.
R is comprehensive:

ˆ Easy to write R functions for new models → wide variety


of modern regression models implemented (trees, nonpara-
metric, ACE, AVAS, survival models for multiple events)

ˆ Designs can be generated for any model → all handle “class”


var, interactions, nonlinear expansions

ˆ Single R objects (e.g., fit object) can be self-documenting


→ automatic hypothesis tests, predictions for new data

ˆ Superior graphics

ˆ Classes and generic functions

6-1
CHAPTER 6. R SOFTWARE 6-2

6.1

The R Modeling Language

R statistical modeling language:


response ∼ terms

y ∼ age + sex # age + sex main effects


y ∼ age + sex + age : sex # add second-order interaction
y ∼ age * sex # second-order interaction +
# all main effects
y ∼ ( age + sex + pressure ) ∧ 2
# age + sex + pressure + age : sex + age : pressure...
y ∼ ( age + sex + pressure ) ∧ 2 - sex : pressure
# all main e f f e c t s and all 2 nd order
# interactions except sex : pressure
y ∼ ( age + race ) * sex # age + race + sex + age : sex + race : sex
y ∼ treatment * ( age * race + age * sex ) # n o i n t e r a c t . w i t h r a c e , s e x
sqrt ( y ) ∼ sex * sqrt ( age ) + race
# functions , with dummy v a r i a b l e s g e n e r a t e d if
# race is an R factor ( classification ) variable
y ∼ sex + poly ( age ,2) # poly generates orthogonal
polynomials
race.sex ← interaction ( race , sex )
y ∼ age + race.sex # for when you want dummy variables for
# all combinations of the factors

The formula for a regression model is given to a modeling func-


tion, e.g.
lrm ( y ∼ rcs (x ,4) )

is read “use a logistic regression model to model y as a function


of x, representing x by a restricted cubic spline with 4 default
knots”a.
update function: re-fit model with changes in terms or data:
f ← lrm ( y ∼ rcs (x ,4) + x2 + x3 )
f2 ← update (f , subset = sex == " male " )
f3 ← update (f , .∼.-x2 ) # remove x2 from model
f4 ← update (f , .∼. + rcs ( x5 ,5) ) # a d d r c s ( x 5 , 5 ) t o m o d e l
f5 ← update (f , y2 ∼ . ) # same terms , new response var.

a lrm and rcs are in the rms package.


CHAPTER 6. R SOFTWARE 6-3

6.2

User-Contributed Functions

ˆ R is high-level object-oriented language.

ˆ R (UNIX, Linux, Mac, Windows)

ˆ Multitude of user-contributed functions freely available

ˆ International community of users

Some R functions:

ˆ See Venables and Ripley

ˆ Hierarchical clustering: hclust

ˆ Principal components: princomp, prcomp

ˆ Canonical correlation: cancor

ˆ Nonparametric transform-both-sides additive models:


ace, avas

ˆ Parametric transform-both-sides additive models:


areg, areg.boot (Hmisc package in R))

ˆ Rank correlation methods:


rcorr, hoeffd, spearman2 (Hmisc)
CHAPTER 6. R SOFTWARE 6-4

ˆ Variable clustering: varclus (Hmisc)

ˆ Single imputation: transcan (Hmisc)

ˆ Multiple imputation: aregImpute (Hmisc)

ˆ Restricted cubic splines:


rcspline.eval (Hmisc)

ˆ Re-state restricted spline in simpler form:


rcspline.restate (Hmisc)
CHAPTER 6. R SOFTWARE 6-5

6.3

The rms Package

ˆ datadist function to compute predictor distribution sum-


maries
y ∼ sex + lsp ( age , c (20 ,30 ,40 ,50 ,60) ) +
sex % ia % lsp ( age , c (20 ,30 ,40 ,50 ,60) )

E.g. restrict age × cholesterol interaction to be of form AF (B)+


BG(A):
y ∼ lsp ( age ,30) + rcs ( cholesterol ,4) +
lsp ( age ,30) % ia % rcs ( cholesterol ,4)

Special fitting functions by Harrell to simplify procedures de-


scribed in these notes:
Table 6.1: rms Fitting Functions

Function Purpose Related R


Functions
ols Ordinary least squares linear model lm
lrm Binary and ordinal logistic regression model glm
Has options for penalized MLE
orm Ordinal semi-parametric regression model for polr,lrm
continuous Y and several link functions
psm Accelerated failure time parametric survival survreg
models
cph Cox proportional hazards regression coxph
bj Buckley-James censored least squares model survreg,lm
Glm rms version of glm glm
Gls rms version of gls gls (nlme package)
Rq rms version of rq rq (quantreg package)

Below notice that there are three graphic models implemented


CHAPTER 6. R SOFTWARE 6-6

Table 6.2: rms Transformation Functions

Function Purpose Related R


Functions
asis No post-transformation (seldom used explicitly) I
rcs Restricted cubic splines ns
pol Polynomial using standard notation poly
lsp Linear spline
catg Categorical predictor (seldom) factor
scored Ordinal categorical variables ordered
matrx Keep variables as group for anova and fastbw matrix
strat Non-modeled stratification factors strata
(used for cph only)

for depicting the effects of predictors in the fitted model: lat-


tice graphics, a ggplot method using the ggplot2 package
(which has an option to convert the result to plotly), and a
direct plotly method. plotly is used to create somewhat inter-
active graphics with drill-down capability, and the rms package
takes advantage of this capability. plotly graphics are best
used with RStudio Rmarkdown html output.
CHAPTER 6. R SOFTWARE 6-7

Function Purpose Related Functions


print Print parameters and statistics of fit
coef Fitted regression coefficients
formula Formula used in the fit
specs Detailed specifications of fit
vcov Fetch covariance matrix
logLik Fetch maximized log-likelihood
AIC Fetch AIC with option to put on chi-square basis
lrtest Likelihood ratio test for two nested models
univarLR Compute all univariable LR χ2
robcov Robust covariance matrix estimates
bootcov Bootstrap covariance matrix estimates
and bootstrap distributions of estimates
pentrace Find optimum penalty factors by tracing
effective AIC for a grid of penalties
effective.df Print effective d.f. for each type of variable
in model, for penalized fit or pentrace result
summary Summary of effects of predictors
plot.summary Plot continuously shaded confidence bars
for results of summary
anova Wald tests of most meaningful hypotheses
plot.anova Graphical depiction of anova
contrast General contrasts, C.L., tests
gendata Easily generate predictor combinations
predict Obtain predicted values or design matrix
Predict Obtain predicted values and confidence limits easily
varying a subset of predictors and others set at
default values
plot.Predict Plot the result of Predict using lattice
ggplot.Predict Plot the result of Predict using ggplot2
plotp.Predict Plot the result of Predict using plotly
fastbw Fast backward step-down variable selection step
residuals (or resid) Residuals, influence stats from fit
sensuc Sensitivity analysis for unmeasured confounder
which.influence Which observations are overly influential residuals
latex LATEX representation of fitted model Function
CHAPTER 6. R SOFTWARE 6-8

Function Purpose Related Functions


Function R function analytic representation of X β̂ latex
from a fitted regression model
Hazard R function analytic representation of a fitted
hazard function (for psm)
Survival R function analytic representation of fitted
survival function (for psm, cph)
Quantile R function analytic representation of fitted
function for quantiles of survival time
(for psm, cph)
Mean R function analytic representation of fitted
function for mean survival time or for ordinal logistic
nomogram Draws a nomogram for the fitted model latex, plot
survest Estimate survival probabilities (psm, cph) survfit
survplot Plot survival curves (psm, cph) plot.survfit
survplotp Plot survival curves with plotly features survplot
validate Validate indexes of model fit using resampling
val.prob External validation of a probability model lrm
val.surv External validation of a survival model calibrate
calibrate Estimate calibration curve using resampling val.prob
vif Variance inflation factors for fitted model
naresid Bring elements corresponding to missing data
back into predictions and residuals
naprint Print summary of missing values
impute Impute missing values aregImpute

rmsb: Bayesian regression modeling strategies package, focus-


ing on semiparametric univariate and longitudinal models.
Function Purpose
blrm Bayesian binary and ordinal logistic model
stackMI Bayesian posterior stacking for multiple imputation
stanDx Stan diagnostics on fit
stanDxplot Trace plots to check posterior sampling convergence
PostF Creates R function for computing posterior probabilities
plot.rmsb Plot posterior densities, intervals, point summaries
compareBmods Compare two models using LOO-cv
HPDint Compute highest posterior density interval
distSym Compute meaure of symmetry of posterior distribution

An extensive overview of Bayesian capabilities of the rmsb pack-


age may be found at hbiostat.org/R/rmsb/blrm.html.
Global options prType and grType control printed and some
graphical output, respectively as shown in example code below.
The default is plain output and static graphics. If using plotly
interactive graphics through ggplot or plotp or with anova or
CHAPTER 6. R SOFTWARE 6-9

summary functions it is best to do so with RStudio html output


or html notebooks. If using html output you must be producing
an html document or notebook. When setting grType to use
LATEX or html it is highly recommended that you use the knitr
package.
Example:

ˆ treat: categorical variable with levels "a","b","c"

ˆ num.diseases: ordinal variable, 0-4

ˆ age: continuous
Restricted cubic spline

ˆ cholesterol: continuous
(3 missings; use median)
log(cholesterol+10)

ˆ Allow treat × cholesterol interaction

ˆ Program to fit logistic model, test all effects in design, es-


timate effects (e.g. inter-quartile range odds ratios), plot
estimated transformations
require ( rms ) # make new functions available
options ( prType = ’ latex ’) # print , summary , anova LaTeX output
# others : ’ html ’ , ’ plain ’
options ( grType = ’ plotly ’) # plotly graphics for ggplot , anova , summary
# d e f a u l t is ’ base ’ for static graphics
ddist ← datadist ( cholesterol , treat , num.diseases , age )
# Could have used ddist ← datadist ( data.frame.name )
options ( datadist = " ddist " ) # d e f i n e s data dist. to rms
cholesterol ← impute ( cholesterol )
fit ← lrm ( y ∼ treat + scored ( num.diseases ) + rcs ( age ) +
CHAPTER 6. R SOFTWARE 6-10

log ( cholesterol +10) + treat : log ( cholesterol +10) )


fit # o u t p u t s plain , LaTeX , or html markup
describe ( y ∼ treat + scored ( num.diseases ) + rcs ( age ) )
# or use d e s c r i b e ( f o r m u l a ( fit ) ) for all v a r i a b l e s used in fit
# d e s c r i b e f u n c t i o n ( in Hmisc ) gets simple s t a t i s t i c s on v a r i a b l e s
# fit ← robcov ( fit ) # Would make all statistics that follow
# use a robust covariance matrix
# would need x =T , y = T in lrm ()
specs ( fit ) # Describe the design characteristics
anova ( fit ) # plain , LaTex , or html
anova ( fit , treat , cholesterol ) # Test these 2 by t h e m s e l v e s
plot ( anova ( fit ) ) # Summarize anova graphically
summary ( fit ) # Estimate effects using default ranges
# prints plain , LaTeX , or html
plot ( summary ( fit ) ) # G r a p h i c a l d i s p l a y of e f f e c t s with C.I.
summary ( fit , treat = " b " , age =60) # Specify reference cell and adjustment val
summary ( fit , age = c (50 ,70) ) # E s t i m a t e effect of i n c r e a s i n g age from
# 50 to 70
summary ( fit , age = c (50 ,60 ,70) ) # I n c r e a s e age f r o m 50 to 70 , a d j u s t to
# 60 when e s t i m a t i n g e f f e c t s of other
# factors
# If had not defined datadist , would have to define ranges for all var.

# Estimate and test treatment ( b-a ) effect averaged over 3 cholesterols


contrast ( fit , list ( treat = ’b ’ , cholesterol = c (150 ,200 ,250) ) ,
list ( treat = ’a ’ , cholesterol = c (150 ,200 ,250) ) ,
type = ’ average ’)
# See the help file for contrast.rms for several examples of
# how to obtain joint tests of m u l t i p l e c o n t r a s t s and how to get
# double differences ( interaction contrasts )

p ← Predict ( fit , age = seq (20 ,80 , length =100) , treat , conf.int = FALSE )
plot ( p ) # Plot relationship between age and log
# or ggplot (p) , plotp (p) # odds , s e p a r a t e curve for each treat ,
# no C.I.
plot (p , ∼ age | treat ) # Same but 2 panels
ggplot (p , groups = FALSE )
bplot ( Predict ( fit , age , cholesterol , np =50) )
# 3 -dimensional perspective p l o t for age ,
# cholesterol , and log odds using default
# ranges for both variables
plot ( Predict ( fit , num.diseases , fun = function ( x ) 1 / (1+ exp ( -x ) ) , conf.int = .9 ) ,
ylab = " Prob " ) # Plot e s t i m a t e d p r o b a b i l i t i e s i n s t e a d of
# log odds ( or use ggplot () )
# can also use plotp () for plotly
# Again , if no datadist were defined , would have to tell plot all limits
logit ← predict ( fit , expand.grid ( treat = " b " , num.dis =1:3 , age = c (20 ,40 ,60) ,
cholesterol = seq (100 ,300 , length =10) ) )
# Could also obtain list of predictor settings interactively }
logit ← predict ( fit , gendata ( fit , nobs =12) )

# Since age doesn ’ t interact with anything , we can quickly and


# interactively try various transformations of age , t a k i n g the spline
# f u n c t i o n of age as the gold standard. We are seeking a linearizing
# transformation.

ag ← 10:80
CHAPTER 6. R SOFTWARE 6-11

logit ← predict ( fit , expand.grid ( treat = " a " , num.dis =0 , age = ag ,


cholesterol = median ( cholesterol ) ) , type = " terms " ) [ , " age " ]
# Note : if age i n t e r a c t e d with anything , this would be the age
# " main effect " ignoring interaction terms
# Could also use
# l o g i t ← P r e d i c t ( f , a g e = ag , . . . ) $ y h a t ,
# which allows e v a l u a t i o n of the shape for any level of i n t e r a c t i n g
# factors. When age does not interact with anything , the result from
# p r e d i c t ( f , ... , t y p e =" t e r m s ") w o u l d e q u a l the r e s u l t f r o m
# P r e d i c t if all other terms were i g n o r e d

# Could also specify


# l o g i t ← p r e d i c t ( fit , g e n d a t a ( fit , a g e = ag , c h o l e s t e r o l = . . . ) )
# U n - m e n t i o n e d v a r i a b l e s set to r e f e r e n c e values

plot ( ag ∧ .5 , logit ) # try square root vs. spline transform.


plot ( ag ∧ 1 .5 , logit ) # try 1 .5 power

latex ( fit ) # invokes latex.lrm , creates fit.tex


# Draw a nomogram for the model fit
plot ( nomogram ( fit ) )

# Compose R function to evaluate linear predictors analytically


g ← Function ( fit )
g ( treat = ’b ’ , cholesterol =260 , age =50)
# Letting num.diseases default to reference value

To examine interactions in a simpler way, you may want to


group age into tertiles:
age.tertile ← cut2 ( age , g =3)
# For automatic ranges later , add age.tertile to datadist input
fit ← lrm ( y ∼ age.tertile * rcs ( cholesterol ) )
CHAPTER 6. R SOFTWARE 6-12

6.4

Other Functions

ˆ supsmu: Friedman’s “super smoother”

ˆ lowess: Cleveland’s scatterplot smoother

ˆ glm: generalized linear models (see Glm)

ˆ gam: Generalized additive models

ˆ rpart: Like original CART with surrogate splits for missings,


censored data extension (Atkinson & Therneau)

ˆ validate.rpart: in rms; validates recursive partitioning with


respect to certain accuracy indexes

ˆ loess: multi-dimensional scatterplot smoother


f ← loess ( y ∼ age * pressure )
plot ( f ) # cross-sectional plots
ages ← seq (20 ,70 , length =40)
pressures ← seq (80 ,200 , length =40)
pred ← predict (f , expand.grid ( age = ages , pressure = pressures ) )
persp ( ages , pressures , pred ) # 3 -d plot
Chapter 7

Modeling Longitudinal Responses using


Generalized Least Squares

7.1

Notation
A

ˆ N subjects

ˆ Subject i (i = 1, 2, . . . , N ) has ni responses measured at


times ti1, ti2, . . . , tini

ˆ Response at time t for subject i: Yit

ˆ Subject i has baseline covariates Xi

ˆ Generally the response measured at time ti1 = 0 is a co-


variate in Xi instead of being the first measured response
Yi0

ˆ Time trend in response is modeled with k parameters so


that the time “main effect” has k d.f.
7-1
CHAPTER 7. MODELING LONGITUDINAL RESPONSES USING GENERALIZED LEAST SQUARES 7-2

ˆ Let the basis functions modeling the time effect be g1(t), g2(t), . . . , g
CHAPTER 7. MODELING LONGITUDINAL RESPONSES USING GENERALIZED LEAST SQUARES 7-3

7.2

Model Specification for Effects on E(Y )

7.2.1

Common Basis Functions


B

ˆ k dummy variables for k + 1 unique times (assumes no func-


tional form for time but may spend many d.f.)

ˆ k = 1 for linear time trend, g1(t) = t

ˆ k–order polynomial in t

ˆ k + 1–knot restricted cubic spline (one linear term, k − 1


nonlinear terms)

7.2.2

Model for Mean Profile


C

ˆ A model for mean time-response profile without interactions


between time and any X:
E[Yit|Xi] = Xiβ + γ1g1(t) + γ2g2(t) + . . . + γk gk (t)

ˆ Model with interactions between time and some X’s: add


product terms for desired interaction effects

ˆ Example: To allow the mean time trend for subjects in group


1 (reference group) to be arbitrarily different from time trend
CHAPTER 7. MODELING LONGITUDINAL RESPONSES USING GENERALIZED LEAST SQUARES 7-4

for subjects in group 2, have a dummy variable for group 2,


a time “main effect” curve with k d.f. and all k products of
these time components with the dummy variable for group
2
7.2.3

Model Specification for Treatment Comparisons


D

ˆ In studies comparing two or more treatments, a response is


often measured at baseline (pre-randomization)

ˆ Analyst has the option to use this measurement as Yi0 or as


part of Xi

ˆ Jim Rochon (Rho, Inc., Chapel Hill NC) has the following
comments about this:
For RCTs, I draw a sharp line at the point when the intervention begins. The LHS [left hand side of the model
equation] is reserved for something that is a response to treatment. Anything before this point can potentially be
included as a covariate in the regression model. This includes the“baseline”value of the outcome variable. Indeed,
the best predictor of the outcome at the end of the study is typically where the patient began at the beginning. It
drinks up a lot of variability in the outcome; and, the effect of other covariates is typically mediated through this
variable.
I treat anything after the intervention begins as an outcome. In the western scientific method, an “effect” must
follow the “cause” even if by a split second.
Note that an RCT is different than a cohort study. In a cohort study, “Time 0” is not terribly meaningful. If we
want to model, say, the trend over time, it would be legitimate, in my view, to include the “baseline” value on the
LHS of that regression model.
Now, even if the intervention, e.g., surgery, has an immediate effect, I would include still reserve the LHS for
anything that might legitimately be considered as the response to the intervention. So, if we cleared a blocked
artery and then measured the MABP, then that would still be included on the LHS.
Now, it could well be that most of the therapeutic effect occurred by the time that the first repeated measure was
taken, and then levels off. Then, a plot of the means would essentially be two parallel lines and the treatment
effect is the distance between the lines, i.e., the difference in the intercepts.
If the linear trend from baseline to Time 1 continues beyond Time 1, then the lines will have a common intercept
but the slopes will diverge. Then, the treatment effect will the difference in slopes.
One point to remember is that the estimated intercept is the value at time 0 that we predict from the set of
repeated measures post randomization. In the first case above, the model will predict different intercepts even
though randomization would suggest that they would start from the same place. This is because we were asleep
at the switch and didn’t record the “action” from baseline to time 1. In the second case, the model will predict the
same intercept values because the linear trend from baseline to time 1 was continued thereafter.
More importantly, there are considerable benefits to including it as a covariate on the RHS. The baseline value
tends to be the best predictor of the outcome post-randomization, and this maneuver increases the precision of
CHAPTER 7. MODELING LONGITUDINAL RESPONSES USING GENERALIZED LEAST SQUARES 7-5

the estimated treatment effect. Additionally, any other prognostic factors correlated with the outcome variable will
also be correlated with the baseline value of that outcome, and this has two important consequences. First, this
greatly reduces the need to enter a large number of prognostic factors as covariates in the linear models. Their
effect is already mediated through the baseline value of the outcome variable. Secondly, any imbalances across the
treatment arms in important prognostic factors will induce an imbalance across the treatment arms in the baseline
value of the outcome. Including the baseline value thereby reduces the need to enter these variables as covariates
in the linear models.

Stephen Senn [171] states that temporally and logically, a


“baseline cannot be a response to treatment”, so baseline and
response cannot be modeled in an integrated framework.
. . . one should focus clearly on ‘outcomes’ as being the only values that can be influenced by treatment and examine
critically any schemes that assume that these are linked in some rigid and deterministic view to ‘baseline’ values.
An alternative tradition sees a baseline as being merely one of a number of measurements capable of improving
predictions of outcomes and models it in this way.

The final reason that baseline cannot be modeled as the re-


sponse at time zero is that many studies have inclusion/ex- E

clusion criteria that include cutoffs on the baseline variable.


In other words, the baseline measurement comes from a trun-
cated distribution. In general it is not appropriate to model
the baseline with the same distributional shape as the follow-
up measurements. Thus the approach recommended by Liang
and Zeger [125] and Liu et al. [129] are problematica.

a In addition to this, one of the paper’s conclusions that analysis of covariance is not appropriate if the population means of the baseline

variable are not identical in the treatment groups is not correct [171]. See [107] for a rebuke of [129].
CHAPTER 7. MODELING LONGITUDINAL RESPONSES USING GENERALIZED LEAST SQUARES 7-6

7.3

Modeling Within-Subject Dependence


F

ˆ Random effects and mixed effects models have become very


popular

ˆ Disadvantages:
– Induced correlation structure for Y may be unrealistic

– Numerically demanding

– Require complex approximations for distributions of test


statistics

ˆ Conditional random effects vs. (subject-) marginal models:


– Random effects are subject-conditional

– Random effects models are needed to estimate responses


for individual subjects

– Models without random effects are marginalized with re-


spect to subject-specific effects

– They are natural when the interest is on group-level pa-


rameters (e.g., overall treatment effect)

– Random effects are natural when there is clustering at


more than the subject level (multi-level models)
CHAPTER 7. MODELING LONGITUDINAL RESPONSES USING GENERALIZED LEAST SQUARES 7-7

ˆ Extended linear model (marginal; with no random effects) is


a logical extension of the univariate model (e.g., few statis-
ticians use subject random effects for univariate Y )

ˆ This was known as growth curve models and generalized


least squares [155, 78] and was developed long before mixed
effect models became popular

ˆ Pinheiro and Bates (Section 5.1.2) state that “in some appli-
cations, one may wish to avoid incorporating random effects
in the model to account for dependence among observations,
choosing to use the within-group component Λi to directly
model variance-covariance structure of the response.”

ˆ We will assume that Yit|Xi has a multivariate normal distri- G

bution with mean given above and with variance-covariance


matrix Vi, an ni ×ni matrix that is a function of ti1, . . . , tini

ˆ We further assume that the diagonals of Vi are all equal

ˆ Procedure can be generalized to allow for heteroscedasticity


over time or with respect to X (e.g., males may be allowed
to have a different variance than females)

ˆ This extended linear model has the following assumptions: H

– all the assumptions of OLS at a single time point includ-


ing correct modeling of predictor effects and univariate
normality of responses conditional on X
CHAPTER 7. MODELING LONGITUDINAL RESPONSES USING GENERALIZED LEAST SQUARES 7-8

– the distribution of two responses at two different times for


the same subject, conditional on X, is bivariate normal
with a specified correlation coefficient

– the joint distribution of all ni responses for the ith subject


is multivariate normal with the given correlation pattern
(which implies the previous two distributional assump-
tions)

– responses from any times for any two different subjects


are uncorrelated
CHAPTER 7. MODELING LONGITUDINAL RESPONSES USING GENERALIZED LEAST SQUARES 7-9

ab
What Methods To Use for Repeated Measurements / Serial Data?
Repeated GEE Mixed GLS Markov LOCF Summary
Measures Effects Statisticc
ANOVA Model
Assumes normality × × ×
Assumes independence of ×d ×e
measurements within subject
Assumes a correlation structuref × ×g × × ×
Requires same measurement × ?
times for all subjects
Does not allow smooth modeling ×
of time to save d.f.
Does not allow adjustment for ×
baseline covariates
Does not easily extend to × ×
non-continuous Y
Loses information by not using ×h ×
intermediate measurements
Does not allow widely varying # × ×i × ×j
of observations per subject
Does not allow for subjects × × × × ×
to have distinct trajectoriesk
Assumes subject-specific effects ×
are Gaussian
Badly biased if non-random ? × ×
dropouts
Biased in general ×
l
Harder to get tests & CLs × ×m
Requires large # subjects/clusters ×
SEs are wrong ×n ×
Assumptions are not verifiable × N/A × × ×
in small samples
Does not extend to complex × × × × ?
settings such as time-dependent
covariates and dynamico models
a Thanks to Charles Berry, Brian Cade, Peter Flom, Bert Gunter, and Leena Choi for valuable input.
b GEE: generalized estimating equations; GLS: generalized least squares; LOCF: last observation carried forward.
c E.g., compute within-subject slope, mean, or area under the curve over time. Assumes that the summary measure is an adequate summary

of the time profile and assesses the relevant treatment effect.


d Unless one uses the Huynh-Feldt or Greenhouse-Geisser correction
e For full efficiency, if using the working independence model
f Or requires the user to specify one
g For full efficiency of regression coefficient estimates
h Unless the last observation is missing
i The cluster sandwich variance estimator used to estimate SEs in GEE does not perform well in this situation, and neither does the working

independence model because it does not weight subjects properly.


j Unless one knows how to properly do a weighted analysis
k Or uses population averages
l Unlike GLS, does not use standard maximum likelihood methods yielding simple likelihood ratio χ2 statistics. Requires high-dimensional

integration to marginalize random effects, using complex approximations, and if using SAS, unintuitive d.f. for the various tests.
m Because there is no correct formula for SE of effects; ordinary SEs are not penalized for imputation and are too small
n If correction not applied
o E.g., a model with a predictor that is a lagged value of the response variable
I
CHAPTER 7. MODELING LONGITUDINAL RESPONSES USING GENERALIZED LEAST SQUARES 7-10

ˆ Markov models use ordinary univariate software and are very


flexible

ˆ They apply the same way to binary, ordinal, nominal, and


continuous Y

ˆ They require post-fitting calculations to get probabilities,


means, and quantiles that are not conditional on the previous
Y value

Gardiner et al. [74] compared several longitudinal data mod- J

els, especially with regard to assumptions and how regression


coefficients are estimated. Peters et al. [152] have an empiri-
cal study confirming that the “use all available data” approach
of likelihood–based longitudinal models makes imputation of
follow-up measurements unnecessary.
CHAPTER 7. MODELING LONGITUDINAL RESPONSES USING GENERALIZED LEAST SQUARES 7-11

7.4

Parameter Estimation Procedure


K

ˆ Generalized least squares

ˆ Like weighted least squares but uses a covariance matrix


that is not diagonal

ˆ Each subject can have her own shape of Vi due to each


subject being measured at a different set of times

ˆ Maximum likelihood

ˆ Newton-Raphson or other trial-and-error methods used for


estimating parameters

ˆ For small number of subjects, advantages in using REML


(restricted maximum likelihood) instead of ordinary MLE [57,
Section 5.3], [154, Chapter 5], [78] (esp. to get more unbiased
estimate of the covariance matrix)

ˆ When imbalances are not severe, OLS fitted ignoring subject


identifiers may be efficient L

– But OLS standard errors will be too small as they don’t


take intra-cluster correlation into account

– May be rectified by substituting covariance matrix esti-


mated from Huber-White cluster sandwich estimator or
CHAPTER 7. MODELING LONGITUDINAL RESPONSES USING GENERALIZED LEAST SQUARES 7-12

from cluster bootstrap

ˆ When imbalances are severe and intra-subject correlations


are strong, OLS is not expected to be efficient because it M

gives equal weight to each observation


– a subject contributing two distant observations receives
1
5 the weight of a subject having 10 tightly-spaced obser-
vations
CHAPTER 7. MODELING LONGITUDINAL RESPONSES USING GENERALIZED LEAST SQUARES 7-13

7.5

Common Correlation Structures


N

ˆ Usually restrict ourselves to isotropic correlation structures


— correlation between responses within subject at two times
depends only on a measure of distance between the two
times, not the individual times

ˆ We simplify further and assume depends on |t1 − t2|

ˆ Can speak interchangeably of correlations of residuals within


subjects or correlations between responses measured at dif-
ferent times on the same subject, conditional on covariates
X

ˆ Assume that the correlation coefficient for Yit1 vs. Yit2 condi-
tional on baseline covariates Xi for subject i is h(|t1 −t2|, ρ),
where ρ is a vector (usually a scalar) set of fundamental cor-
relation parameters
ˆ Some commonly used structures when times are continuous
and are not equally spaced [154, Section 5.3.3] (nlme corre- O

lation function names are at the right if the structure is


implemented in nlme):
Compound symmetry : h = ρ if t1 6= t2 , 1 if t1 = t2 nlme corCompSymm
(Essentially what two-way ANOVA assumes)
Autoregressive-moving average lag 1 : h = ρ|t1 −t2 | = ρs corCAR1
where s = |t1 − t2 |
Exponential : h = exp(−s/ρ) corExp
Gaussian : h = exp[−(s/ρ)2 ] corGaus
Linear : h = (1 − s/ρ)[s < ρ] corLin
CHAPTER 7. MODELING LONGITUDINAL RESPONSES USING GENERALIZED LEAST SQUARES 7-14

Rational quadratic : h = 1 − (s/ρ)2 /[1 + (s/ρ)2 ] corRatio


Spherical : h = [1 − 1.5(s/ρ) + 0.5(s/ρ)3 ][s < ρ] corSpher
s−dmin
dmin +δ d
Linear exponent AR(1) : h = ρ max −dmin , 1 if t1 = t2 [174]

The structures 3–7 use ρ as a scaling parameter, not as


something restricted to be in [0, 1]
CHAPTER 7. MODELING LONGITUDINAL RESPONSES USING GENERALIZED LEAST SQUARES 7-15

7.6

Checking Model Fit


P

ˆ Constant variance assumption: usual residual plots

ˆ Normality assumption: usual qq residual plots

ˆ Correlation pattern: Variogram


– Estimate correlations of all possible pairs of residuals at
different time points

– Pool all estimates at same absolute difference in time s

– Variogram is a plot with y = 1 − ĥ(s, ρ) vs. s on the


x-axis

– Superimpose the theoretical variogram assumed by the


model
CHAPTER 7. MODELING LONGITUDINAL RESPONSES USING GENERALIZED LEAST SQUARES 7-16

7.7

R Software
Q

ˆ Nonlinear mixed effects model package of Pinheiro & Bates

ˆ For linear models, fitting functions are


– lme for mixed effects models

– gls for generalized least squares without random effects

ˆ For this version the rms package has Gls so that many fea-
tures of rms can be used:
anova : all partial Wald tests, test of linearity, pooled tests
summary : effect estimates (differences in Ŷ ) and confidence
limits, can be plotted
plot, ggplot, plotp : continuous effect plots
nomogram : nomogram
Function : generate R function code for fitted model
latex : LATEX representation of fitted model
In addition, Gls has a bootstrap option (hence you do not
use rms’s bootcov for Gls fits).
To get regular gls functions named anova (for likelihood ra-
tio tests, AIC, etc.) or summary use anova.gls or summary.gls

ˆ nlme package has many graphics and fit-checking functions

ˆ Several functions will be demonstrated in the case study


CHAPTER 7. MODELING LONGITUDINAL RESPONSES USING GENERALIZED LEAST SQUARES 7-17

7.8

Case Study

Consider the dataset in Table 6.9 of Davis [54, pp. 161-163] from
a multicenter, randomized controlled trial of botulinum toxin
type B (BotB) in patients with cervical dystonia from nine U.S.
sites. R

ˆ Randomized to placebo (N = 36), 5000 units of BotB (N =


36), 10,000 units of BotB (N = 37)

ˆ Response variable: total score on Toronto Western Spas-


modic Torticollis Rating Scale (TWSTRS), measuring sever-
ity, pain, and disability of cervical dystonia (high scores mean
more impairment)

ˆ TWSTRS measured at baseline (week 0) and weeks 2, 4, 8,


12, 16 after treatment began

ˆ Dataset cdystonia from web site

7.8.1

Graphical Exploration of Data


require ( rms )

options ( prType = ’ latex ’) # for model print , summary , anova


getHdata ( cdystonia )
attach ( cdystonia )

# Construct unique subject ID


uid ← with ( cdystonia , factor ( paste ( site , id ) ) )
CHAPTER 7. MODELING LONGITUDINAL RESPONSES USING GENERALIZED LEAST SQUARES 7-18

# Tabulate patterns of subjects ’ time points


table ( tapply ( week , uid ,
function ( w ) paste ( sort ( unique ( w ) ) , collapse = ’ ’) ) )

0 0 2 4 0 2 4 12 16 0 2 4 8 0 2 4 8 12
1 1 3 1 1
0 2 4 8 12 16 0 2 4 8 16 0 2 8 12 16 0 4 8 12 16 0 4 8 16
94 1 2 4 1

# Plot raw data , superposing subjects


xl ← xlab ( ’ Week ’) ; yl ← ylab ( ’ TWSTRS-total score ’)
ggplot ( cdystonia , aes ( x = week , y = twstrs , color = factor ( id ) ) ) +
geom_line () + xl + yl + facet_grid ( treat ∼ site ) +
guides ( color = FALSE ) # F i g . 7.1

1 2 3 4 5 6 7 8 9

60

10000U
40

20
TWSTRS−total score

60

5000U
40

20

60

40 Placebo

20

0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Week

Figure 7.1: Time profiles for individual subjects, stratified by study site and dose
# Show quartiles
require ( data.table )

cdystonia ← data.table ( cdystonia )


cdys ← cdystonia [ , j = as.list ( quantile ( twstrs , (1 : 3) / 4) ) ,
by = list ( treat , week ) ]
cdys ← upData ( cdys , rename = c ( ’ 25% ’= ’ Q1 ’ , ’ 50% ’= ’ Q2 ’ , ’ 75% ’= ’ Q3 ’) , print = FALSE )
ggplot ( cdys , aes ( x = week , y = Q2 ) ) + xl + yl + ylim (0 , 70) +
CHAPTER 7. MODELING LONGITUDINAL RESPONSES USING GENERALIZED LEAST SQUARES 7-19

geom_line () + facet_wrap (∼ treat , nrow =2) +


geom_ribbon ( aes ( ymin = Q1 , ymax = Q3 ) , alpha =0 .2 ) # Fig. 7.2

10000U 5000U

60

40
TWSTRS−total score

20

0
0 5 10 15
Placebo

60

40

20

0
0 5 10 15
Week

Figure 7.2: Quartiles of TWSTRS stratified by dose


# Show means with bootstrap nonparametric CLs
cdys ← cdystonia [ , j = as.list ( smean.cl.boot ( twstrs ) ) ,
by = list ( treat , week ) ]
ggplot ( cdys , aes ( x = week , y = Mean ) ) + xl + yl + ylim (0 , 70) +
geom_line () + facet_wrap (∼ treat , nrow =2) +
geom_ribbon ( aes ( x = week , ymin = Lower , ymax = Upper ) , alpha =0 .2 ) # Fig. 7.3
CHAPTER 7. MODELING LONGITUDINAL RESPONSES USING GENERALIZED LEAST SQUARES 7-20

10000U 5000U

60

40
TWSTRS−total score

20

0
0 5 10 15
Placebo

60

40

20

0
0 5 10 15
Week

Figure 7.3: Mean responses and nonparametric bootstrap 0.95 confidence limits for population means, stratified by dose
CHAPTER 7. MODELING LONGITUDINAL RESPONSES USING GENERALIZED LEAST SQUARES 7-21

Model with Yi0 as Baseline Covariate


baseline ← subset ( data.frame ( cdystonia , uid ) , week == 0 ,
-week )
baseline ← upData ( baseline , rename = c ( twstrs = ’ twstrs0 ’) ,
print = FALSE )
followup ← subset ( data.frame ( cdystonia , uid ) , week > 0 ,
c ( uid , week , twstrs ) )
rm ( uid )
both ← merge ( baseline , followup , by = ’ uid ’)

dd ← datadist ( both )
options ( datadist = ’ dd ’)

7.8.2

Using Generalized Least Squares

We stay with baseline adjustment and use a variety of correla-


tion structures, with constant variance. Time is modeled as a S

restricted cubic spline with 3 knots, because there are only 3


unique interior values of week.
require ( nlme )

cp ← list ( corCAR1 , corExp , corCompSymm , corLin , corGaus , corSpher )


z ← vector ( ’ list ’ , length ( cp ) )
for ( k in 1: length ( cp ) ) {
z [[ k ]] ← gls ( twstrs ∼ treat * rcs ( week , 3) +
rcs ( twstrs0 , 3) + rcs ( age , 4) * sex , data = both ,
correlation = cp [[ k ]]( form = ∼week | uid ) )
}

anova ( z [[1]] , z [[2]] , z [[3]] , z [[4]] , z [[5]] , z [[6]])

Model df AIC BIC logLik


z [[1]] 1 20 3553.906 3638.357 -1756.953
z [[2]] 2 20 3553.906 3638.357 -1756.953
z [[3]] 3 20 3587.974 3672.426 -1773.987
z [[4]] 4 20 3575.079 3659.531 -1767.540
z [[5]] 5 20 3621.081 3705.532 -1790.540
z [[6]] 6 20 3570.958 3655.409 -1765.479

AIC computed above is set up so that smaller values are best.


From this the continuous-time AR1 and exponential structures
CHAPTER 7. MODELING LONGITUDINAL RESPONSES USING GENERALIZED LEAST SQUARES 7-22

are tied for the best. For the remainder of the analysis use
corCAR1, using Gls.
a ← Gls ( twstrs ∼ treat * rcs ( week , 3) + rcs ( twstrs0 , 3) +
rcs ( age , 4) * sex , data = both ,
correlation = corCAR1 ( form =∼week | uid ) )

Generalized Least Squares Fit by REML

Gls(model = twstrs ~ treat * rcs(week, 3) + rcs(twstrs0, 3) +


rcs(age, 4) * sex, data = both, correlation = corCAR1(form = ~week |
uid))

Obs 522 Log-restricted-likelihood -1756.95


Clusters 108 Model d.f. 17
g 11.334 σ 8.5917
d.f. 504

β̂ S.E. t Pr(> |t|)


Intercept -0.3093 11.8804 -0.03 0.9792
treat=5000U 0.4344 2.5962 0.17 0.8672
treat=Placebo 7.1433 2.6133 2.73 0.0065
week 0.2879 0.2973 0.97 0.3334
week’ 0.7313 0.3078 2.38 0.0179
twstrs0 0.8071 0.1449 5.57 <0.0001
twstrs0’ 0.2129 0.1795 1.19 0.2360
age -0.1178 0.2346 -0.50 0.6158
age’ 0.6968 0.6484 1.07 0.2830
age” -3.4018 2.5599 -1.33 0.1845
sex=M 24.2802 18.6208 1.30 0.1929
treat=5000U × week 0.0745 0.4221 0.18 0.8599
treat=Placebo × week -0.1256 0.4243 -0.30 0.7674
treat=5000U × week’ -0.4389 0.4363 -1.01 0.3149
treat=Placebo × week’ -0.6459 0.4381 -1.47 0.1411
age × sex=M -0.5846 0.4447 -1.31 0.1892
age’ × sex=M 1.4652 1.2388 1.18 0.2375
age” × sex=M -4.0338 4.8123 -0.84 0.4023

Correlation Structure: Continuous AR(1)


Formula: ~week | uid
Parameter estimate(s):
Phi
0.8666689
CHAPTER 7. MODELING LONGITUDINAL RESPONSES USING GENERALIZED LEAST SQUARES 7-23

ρ̂ = 0.8672, the estimate of the correlation between two mea- T


surements taken one week apart on the same subject. The esti-
mated correlation for measurements 10 weeks apart is 0.867210 =
0.24.
v ← Variogram (a , form =∼ week | uid )
plot ( v ) # F i g u r e 7.4

0.6
Semivariogram

0.4

0.2

2 4 6 8 10 12 14

Distance

Figure 7.4: Variogram, with assumed correlation pattern superimposed

Check constant variance and normality assumptions: U


both $ resid ← r ← resid ( a ) ; both $ fitted ← fitted ( a )
yl ← ylab ( ’ Residuals ’)
p1 ← ggplot ( both , aes ( x = fitted , y = resid ) ) + geom_point () +
facet_grid (∼ treat ) + yl
p2 ← ggplot ( both , aes ( x = twstrs0 , y = resid ) ) + geom_point () + yl
p3 ← ggplot ( both , aes ( x = week , y = resid ) ) + yl + ylim ( -20 ,20) +
stat_summary ( fun.data = " mean_sdl " , geom = ’ smooth ’)
p4 ← ggplot ( both , aes ( sample = resid ) ) + stat_qq () +
geom_abline ( intercept = mean ( r ) , slope = sd ( r ) ) + yl
gridExtra :: grid.arrange ( p1 , p2 , p3 , p4 , ncol =2) # F i g u r e 7.5

Now get hypothesis tests, estimates, and graphically interpret


the model.
anova ( a )
CHAPTER 7. MODELING LONGITUDINAL RESPONSES USING GENERALIZED LEAST SQUARES 7-24

10000U 5000U Placebo 20


20

Residuals
0
Residuals

−20
−20

−40 −40

20 30 40 50 60 70 20 30 40 50 60 70 20 30 40 50 60 70 30 40 50 60
fitted twstrs0

20 20

10
0
Residuals

Residuals

−20
−10

−40
−20
4 8 12 16 −2 0 2
week theoretical

Figure 7.5: Three residual plots to check for absence of trends in central tendency and in variability. Upper right panel shows the
baseline score on the x-axis. Bottom left panel shows the mean ±2×SD. Bottom right panel is the QQ plot for checking normality
of residuals from the GLS fit.
CHAPTER 7. MODELING LONGITUDINAL RESPONSES USING GENERALIZED LEAST SQUARES 7-25

Wald Statistics for twstrs


χ2 d.f. P
treat (Factor+Higher Order Factors) 22.11 6 0.0012
All Interactions 14.94 4 0.0048
week (Factor+Higher Order Factors) 77.27 6 <0.0001
All Interactions 14.94 4 0.0048
Nonlinear (Factor+Higher Order Factors) 6.61 3 0.0852
twstrs0 233.83 2 <0.0001
Nonlinear 1.41 1 0.2354
age (Factor+Higher Order Factors) 9.68 6 0.1388
All Interactions 4.86 3 0.1826
Nonlinear (Factor+Higher Order Factors) 7.59 4 0.1077
sex (Factor+Higher Order Factors) 5.67 4 0.2252
All Interactions 4.86 3 0.1826
treat × week (Factor+Higher Order Factors) 14.94 4 0.0048
Nonlinear 2.27 2 0.3208
Nonlinear Interaction : f(A,B) vs. AB 2.27 2 0.3208
age × sex (Factor+Higher Order Factors) 4.86 3 0.1826
Nonlinear 3.76 2 0.1526
Nonlinear Interaction : f(A,B) vs. AB 3.76 2 0.1526
TOTAL NONLINEAR 15.03 8 0.0586
TOTAL INTERACTION 19.75 7 0.0061
TOTAL NONLINEAR + INTERACTION 28.54 11 0.0027
TOTAL 322.98 17 <0.0001
plot ( anova ( a ) ) # Figure 7.6

ylm ← ylim (25 , 60)


p1 ← ggplot ( Predict (a , week , treat , conf.int = FALSE ) ,
adj.subtitle = FALSE , legend.position = ’ top ’) + ylm
p2 ← ggplot ( Predict (a , twstrs0 ) , adj.subtitle = FALSE ) + ylm
p3 ← ggplot ( Predict (a , age , sex ) , adj.subtitle = FALSE ,
legend.position = ’ top ’) + ylm
gridExtra :: grid.arrange ( p1 , p2 , p3 , ncol =2) # F i g u r e 7.7

summary ( a ) # Shows for week 8

Low High ∆ Effect S.E. Lower 0.95 Upper 0.95


week 4 12 8 6.69100 1.10570 4.5238 8.8582
twstrs0 39 53 14 13.55100 0.88618 11.8140 15.2880
age 46 65 19 2.50270 2.05140 -1.5179 6.5234
treat — 5000U:10000U 1 2 0.59167 1.99830 -3.3249 4.5083
treat — Placebo:10000U 1 3 5.49300 2.00430 1.5647 9.4212
sex — M:F 1 2 -1.08500 1.77860 -4.5711 2.4011
CHAPTER 7. MODELING LONGITUDINAL RESPONSES USING GENERALIZED LEAST SQUARES 7-26

χ2 P
sex 5.7 0.2252
age * sex 4.9 0.1826
age 9.7 0.1388
treat * week 14.9 0.0048
treat 22.1 0.0012
week 77.3 0.0000
twstrs0 233.8 0.0000

0 50 150
2
χ − df
Figure 7.6: Results of anova.rms from generalized least squares fit with continuous time AR1 correlation structure

Treatment 10000U 5000U Placebo


60
48

44

^

40

^

40

20
36

4 8 12 16 30 40 50 60
Week TWSTRS−total score

Sex F M

50

45

40

^

35

30

25
40 50 60 70 80
Age, years

Figure 7.7: Estimated effects of time, baseline TWSTRS, age, and sex
CHAPTER 7. MODELING LONGITUDINAL RESPONSES USING GENERALIZED LEAST SQUARES 7-27

# To get r e s u l t s for week 8 for a different reference group


# for treatment , use e.g. s u m m a r y ( a , w e e k =4 , t r e a t = ’ Placebo ’)

# Compare low dose with placebo , separately at each time


k1 ← contrast (a , list ( week = c (2 ,4 ,8 ,12 ,16) , treat = ’ 5000 U ’) ,
list ( week = c (2 ,4 ,8 ,12 ,16) , treat = ’ Placebo ’) )
options ( width =80)
print ( k1 , digits =3)

week twstrs0 age sex Contrast S . E . Lower Upper Z Pr ( >| z |)


1 2 46 56 F -6.31 2.10 -10.43 -2.186 -3.00 0.0027
2 4 46 56 F -5.91 1.82 -9.47 -2.349 -3.25 0.0011
3 8 46 56 F -4.90 2.01 -8.85 -0.953 -2.43 0.0150
4* 12 46 56 F -3.07 1.75 -6.49 0.361 -1.75 0.0795
5* 16 46 56 F -1.02 2.10 -5.14 3.092 -0.49 0.6260

Redundant contrasts are denoted by *

Confidence intervals are 0.95 individual intervals

# Compare high dose with placebo


k2 ← contrast (a , list ( week = c (2 ,4 ,8 ,12 ,16) , treat = ’ 10000 U ’) ,
list ( week = c (2 ,4 ,8 ,12 ,16) , treat = ’ Placebo ’) )
print ( k2 , digits =3)

week twstrs0 age sex Contrast S . E . Lower Upper Z Pr ( >| z |)


1 2 46 56 F -6.89 2.07 -10.96 -2.83 -3.32 0.0009
2 4 46 56 F -6.64 1.79 -10.15 -3.13 -3.70 0.0002
3 8 46 56 F -5.49 2.00 -9.42 -1.56 -2.74 0.0061
4* 12 46 56 F -1.76 1.74 -5.17 1.65 -1.01 0.3109
5* 16 46 56 F 2.62 2.09 -1.47 6.71 1.25 0.2099

Redundant contrasts are denoted by *

Confidence intervals are 0.95 individual intervals

k1 ← as.data.frame ( k1 [ c ( ’ week ’ , ’ Contrast ’ , ’ Lower ’ , ’ Upper ’) ])


p1 ← ggplot ( k1 , aes ( x = week , y = Contrast ) ) + geom_point () +
geom_line () + ylab ( ’ Low Dose - Placebo ’) +
geom_errorbar ( aes ( ymin = Lower , ymax = Upper ) , width =0)
k2 ← as.data.frame ( k2 [ c ( ’ week ’ , ’ Contrast ’ , ’ Lower ’ , ’ Upper ’) ])
p2 ← ggplot ( k2 , aes ( x = week , y = Contrast ) ) + geom_point () +
geom_line () + ylab ( ’ High Dose - Placebo ’) +
geom_errorbar ( aes ( ymin = Lower , ymax = Upper ) , width =0)
gridExtra :: grid.arrange ( p1 , p2 , ncol =2) # F i g u r e 7.8

Although multiple d.f. tests such as total treatment effects or


treatment × time interaction tests are comprehensive, their in- V

creased degrees of freedom can dilute power. In a treatment


comparison, treatment contrasts at the last time point (single
CHAPTER 7. MODELING LONGITUDINAL RESPONSES USING GENERALIZED LEAST SQUARES 7-28

4
0

High Dose − Placebo


Low Dose − Placebo

−4

−4

−8 −8

4 8 12 16 4 8 12 16
week week

Figure 7.8: Contrasts and 0.95 confidence limits from GLS fit

d.f. tests) are often of major interest. Such contrasts are in-
formed by all the measurements made by all subjects (up until
dropout times) when a smooth time trend is assumed.
n ← nomogram (a , age = c ( seq (20 , 80 , by =10) , 85) )
plot (n , cex.axis = .55 , cex.var = .8 , lmgp = .25 ) # F i g u r e 7.9
CHAPTER 7. MODELING LONGITUDINAL RESPONSES USING GENERALIZED LEAST SQUARES 7-29

0 10 20 30 40 50 60 70 80 90 100
Points

TWSTRS−total score
20 25 30 35 40 45 50 55 60 65 70

50 60
age (sex=F)
85 80 4070 20
60 70
age (sex=M)
50 40 85 30 20

5000U
treat (week=2)
10000U Placebo
5000U
treat (week=4)
10000U Placebo
5000U
treat (week=8)
10000U Placebo
10000U
treat (week=12)
5000U
Placebo
treat (week=16)
5000U

Total Points
0 10 20 30 40 50 60 70 80 90 100 110 120 130 140

Linear Predictor
15 20 25 30 35 40 45 50 55 60 65 70

Figure 7.9: Nomogram from GLS fit. Second axis is the baseline score.
CHAPTER 7. MODELING LONGITUDINAL RESPONSES USING GENERALIZED LEAST SQUARES 7-30

7.8.3

Bayesian Proportional Odds Random Effects Model


W

ˆ Develop a y-transformation invariant longitudinal model

ˆ Proportional odds model with no grouping of TWSTRS


scores

ˆ Bayesian random effects model

ˆ Random effects Gaussion with exponential prior distribution


for its SD, with mean 1.0

ˆ Compound symmetry correlation structure

ˆ Demonstrates a large amount of patient-to-patient intercept


variability
require ( rmsb )

stanSet () # i n ∼/ . R p r o f i l e - sets o p t i o n s ( m c . c o r e s =)

bpo ← blrm ( twstrs ∼ treat * rcs ( week , 3) + rcs ( twstrs0 , 3) +


rcs ( age , 4) * sex + cluster ( uid ) , data = both , file = ’ bpo.rds ’)

# file = means that after the f i r s t t i m e the m o d e l is run , it w i l l not


# be re-run unless the data , f i t t i n g options , or u n d e r l y i n g Stan code change
stanDx ( bpo )

Iterations : 2000 on each of 4 chains , with 4000 posterior distribution samples saved

For each parameter , n_eff is a crude measure of effective sample size


and Rhat is the potential scale reduction factor on split chains
( at convergence , Rhat =1)

n_eff Rhat
y >=7 1760 1.003
y >=9 1569 1.004
y >=10 1182 1.007
y >=11 1005 1.009
CHAPTER 7. MODELING LONGITUDINAL RESPONSES USING GENERALIZED LEAST SQUARES 7-31

y >=13 925 1.009


y >=14 832 1.011
y >=15 824 1.011
y >=16 757 1.013
y >=17 657 1.016
y >=18 601 1.017
y >=19 593 1.017
y >=20 577 1.017
y >=21 550 1.019
y >=22 493 1.020
y >=23 452 1.021
y >=24 431 1.024
y >=25 408 1.026
y >=26 396 1.028
y >=27 395 1.026
y >=28 384 1.026
y >=29 371 1.029
y >=30 355 1.029
y >=31 355 1.027
y >=32 347 1.028
y >=33 343 1.025
y >=34 347 1.024
y >=35 357 1.022
y >=36 353 1.024
y >=37 357 1.024
y >=38 360 1.023
y >=39 364 1.022
y >=40 370 1.022
y >=41 372 1.023
y >=42 384 1.022
y >=43 399 1.022
y >=44 424 1.020
y >=45 440 1.020
y >=46 474 1.018
y >=47 521 1.016
y >=48 541 1.017
y >=49 569 1.015
y >=50 652 1.013
y >=51 733 1.011
y >=52 802 1.011
y >=53 870 1.010
y >=54 956 1.009
y >=55 1065 1.006
y >=56 1121 1.006
y >=57 1253 1.005
y >=58 1370 1.003
y >=59 1458 1.002
y >=60 1849 1.002
y >=61 2014 1.002
y >=62 2093 1.001
y >=63 2181 1.001
y >=64 2130 1.001
y >=65 2291 1.000
y >=66 2471 1.000
y >=67 2524 1.000
y >=68 2546 1.000
y >=71 2621 1.001
CHAPTER 7. MODELING LONGITUDINAL RESPONSES USING GENERALIZED LEAST SQUARES 7-32

treat =5000 U 826 1.003


treat = Placebo 562 1.009
week 2123 1.001
week ’ 3704 1.000
twstrs0 817 1.009
twstrs0 ’ 902 1.001
age 847 1.003
age ’ 974 1.000
age ’ ’ 786 1.006
sex = M 730 1.001
treat =5000 U * week 4148 1.000
treat = Placebo * week 3849 1.000
treat =5000 U * week ’ 4384 1.002
treat = Placebo * week ’ 3967 1.000
age * sex = M 902 1.004
age ’ * sex = M 986 1.004
age ’ ’ * sex = M 1060 1.005
sigmag 794 1.009

print ( bpo , intercepts = TRUE )

Bayesian Proportional Odds Ordinal Logistic Model

Dirichlet Priors With Concentration Parameter 0.044 for Intercepts

blrm(formula = twstrs ~ treat * rcs(week, 3) + rcs(twstrs0, 3) +


rcs(age, 4) * sex + cluster(uid), data = both, file = "bpo.rds")

Mixed Calibration/ Discrimination Rank Discrim.


Discrimination Indexes Indexes Indexes
Obs 522 LOO log L -1745.98±23.68 g 3.828 [3.286, 4.402] C 0.793 [0.786, 0.799]
Draws 4000 LOO IC 3491.97±47.36 gp 0.435 [0.416, 0.448] Dxy 0.586 [0.571, 0.598]
Chains 4 Effective p 178.4±7.83 EV 0.592 [0.541, 0.645]
p 17 B 0.149 [0.138, 0.159] v 11.419 [8.304, 15.017]
Cluster on uid vp 0.148 [0.134, 0.16]
Clusters 109
σγ 1.8908 [1.5399, 2.2828]

Mean β̂ Median β̂ S.E. Lower Upper Pr(β > 0) Symmetry


y≥7 -2.1766 -2.2287 4.1157 -10.1514 5.9119 0.2992 1.03
y≥9 -3.1869 -3.2609 4.0086 -10.5348 5.1551 0.2130 1.04
y≥10 -4.3653 -4.3480 3.9282 -12.0578 3.4376 0.1302 1.02
y≥11 -4.8021 -4.7881 3.9131 -12.7389 2.5489 0.1082 1.04
y≥13 -5.0057 -5.0119 3.9117 -12.8507 2.3876 0.1020 1.03
y≥14 -5.3621 -5.3583 3.9134 -13.0756 2.1929 0.0880 1.02
y≥15 -5.6728 -5.6477 3.9120 -13.6068 1.7705 0.0762 1.01
y≥16 -6.0588 -6.0563 3.9166 -13.9082 1.4526 0.0628 1.01
y≥17 -6.8684 -6.8575 3.9195 -14.4646 0.9032 0.0430 1.02
y≥18 -7.1228 -7.1011 3.9143 -14.5805 0.7661 0.0367 1.02
y≥19 -7.4118 -7.3881 3.9149 -14.9704 0.4064 0.0310 1.01
y≥20 -7.6074 -7.5594 3.9137 -15.2362 0.0963 0.0275 1.02
y≥21 -7.7874 -7.7255 3.9137 -15.1384 0.1638 0.0248 1.02
y≥22 -8.2022 -8.1798 3.9109 -15.6933 -0.3445 0.0180 1.03
CHAPTER 7. MODELING LONGITUDINAL RESPONSES USING GENERALIZED LEAST SQUARES 7-33

Mean β̂ Median β̂ S.E. Lower Upper Pr(β > 0) Symmetry


y≥23 -8.4727 -8.4533 3.9120 -15.7727 -0.4267 0.0150 1.02
y≥24 -8.7590 -8.7572 3.9149 -16.1055 -0.6939 0.0127 1.02
y≥25 -9.0187 -8.9946 3.9154 -16.5415 -1.1603 0.0107 1.02
y≥26 -9.4148 -9.3872 3.9145 -16.7518 -1.3684 0.0085 1.04
y≥27 -9.7067 -9.7024 3.9150 -17.1118 -1.7369 0.0065 1.02
y≥28 -9.9486 -9.9211 3.9155 -17.4259 -2.0012 0.0055 1.03
y≥29 -10.1819 -10.1522 3.9153 -18.1714 -2.7561 0.0043 1.03
y≥30 -10.4853 -10.4359 3.9170 -17.8904 -2.4764 0.0032 1.03
y≥31 -10.7738 -10.7329 3.9178 -18.1240 -2.7320 0.0022 1.03
y≥32 -10.8900 -10.8624 3.9178 -18.2656 -2.8608 0.0022 1.03
y≥33 -11.2566 -11.2313 3.9207 -18.6004 -3.2128 0.0022 1.02
y≥34 -11.5653 -11.5192 3.9224 -18.9888 -3.6146 0.0018 1.02
y≥35 -11.7877 -11.7622 3.9236 -19.1365 -3.7604 0.0013 1.03
y≥36 -12.0338 -12.0038 3.9259 -19.3754 -3.9827 0.0010 1.03
y≥37 -12.3111 -12.2657 3.9272 -19.7920 -4.3857 0.0010 1.03
y≥38 -12.5376 -12.5120 3.9281 -20.4359 -5.0062 0.0010 1.02
y≥39 -12.7840 -12.7388 3.9314 -20.3261 -4.9190 0.0010 1.02
y≥40 -12.9686 -12.9408 3.9309 -20.8457 -5.4301 0.0010 1.01
y≥41 -13.1525 -13.1185 3.9327 -20.6746 -5.2604 0.0010 1.02
y≥42 -13.4765 -13.4597 3.9345 -20.9059 -5.4771 0.0010 1.02
y≥43 -13.7044 -13.6640 3.9359 -21.2206 -5.7800 0.0010 1.03
y≥44 -14.0447 -13.9972 3.9384 -21.6970 -6.2459 0.0008 1.03
y≥45 -14.3653 -14.3071 3.9386 -21.9644 -6.5455 0.0000 1.02
y≥46 -14.6657 -14.6244 3.9412 -22.3029 -6.8496 0.0000 1.02
y≥47 -15.0812 -15.0211 3.9443 -22.7833 -7.3329 0.0000 1.02
y≥48 -15.3706 -15.3139 3.9455 -23.0209 -7.5663 0.0000 1.01
CHAPTER 7. MODELING LONGITUDINAL RESPONSES USING GENERALIZED LEAST SQUARES 7-34

Mean β̂ Median β̂ S.E. Lower Upper Pr(β > 0) Symmetry


y≥49 -15.7425 -15.7054 3.9440 -23.0510 -7.5864 0.0000 1.00
y≥50 -16.0590 -16.0391 3.9457 -23.4106 -7.9624 0.0000 1.01
y≥51 -16.5886 -16.5532 3.9498 -24.1610 -8.6563 0.0000 1.02
y≥52 -16.9515 -16.9157 3.9544 -24.4683 -8.9205 0.0000 1.02
y≥53 -17.3971 -17.3601 3.9594 -24.9188 -9.3942 0.0000 1.02
y≥54 -17.8985 -17.8403 3.9613 -25.5213 -10.0391 0.0000 1.02
y≥55 -18.3121 -18.2699 3.9623 -25.9552 -10.4022 0.0000 1.01
y≥56 -18.5627 -18.5089 3.9624 -26.1697 -10.6027 0.0000 1.01
y≥57 -19.0339 -18.9663 3.9636 -26.7458 -11.1864 0.0000 1.02
y≥58 -19.5957 -19.5552 3.9664 -27.1818 -11.5891 0.0000 1.00
y≥59 -19.9498 -19.9167 3.9652 -27.7410 -12.1558 0.0000 1.01
y≥60 -20.2804 -20.2283 3.9670 -27.9099 -12.2528 0.0000 1.00
y≥61 -20.9709 -20.9290 3.9702 -29.1018 -13.5106 0.0000 0.99
y≥62 -21.3387 -21.3046 3.9752 -29.4950 -13.8587 0.0000 1.00
y≥63 -21.7490 -21.6908 3.9832 -29.7935 -14.0874 0.0000 1.00
y≥64 -21.8908 -21.8465 3.9858 -29.6261 -13.9519 0.0000 0.99
y≥65 -22.6046 -22.5912 3.9899 -30.3561 -14.6233 0.0000 1.00
y≥66 -22.9854 -22.9552 3.9964 -30.7077 -14.8947 0.0000 0.98
y≥67 -23.4008 -23.3744 4.0028 -31.4212 -15.6900 0.0000 0.99
y≥68 -24.1800 -24.1512 4.0185 -32.3262 -16.5090 0.0000 0.98
y≥71 -25.0531 -24.9831 4.0342 -33.0324 -17.1918 0.0000 0.97
treat=5000U 0.1101 0.1112 0.7193 -1.3354 1.5228 0.5645 1.00
treat=Placebo 2.3857 2.3733 0.7459 0.8984 3.8546 1.0000 1.05
week 0.1210 0.1205 0.0805 -0.0320 0.2777 0.9342 1.03
week’ 0.1927 0.1946 0.0875 0.0277 0.3635 0.9855 0.98
twstrs0 0.2300 0.2300 0.0498 0.1318 0.3265 1.0000 1.02
twstrs0’ 0.1285 0.1273 0.0632 0.0063 0.2566 0.9805 1.03
age -0.0084 -0.0080 0.0765 -0.1647 0.1385 0.4548 0.97
age’ 0.1761 0.1773 0.2088 -0.2520 0.5797 0.8035 1.04
age” -1.0116 -1.0240 0.8230 -2.6465 0.5932 0.1055 0.98
sex=M 5.3610 5.2011 6.1481 -6.2735 17.4074 0.8130 1.03
treat=5000U × week 0.0508 0.0513 0.1124 -0.1638 0.2751 0.6800 0.97
treat=Placebo × week -0.0558 -0.0582 0.1126 -0.2725 0.1641 0.3167 0.97
treat=5000U × week’ -0.1626 -0.1624 0.1222 -0.3945 0.0751 0.0897 1.02
treat=Placebo × week’ -0.1379 -0.1360 0.1231 -0.3733 0.1050 0.1328 1.01
age × sex=M -0.1182 -0.1144 0.1471 -0.4063 0.1598 0.2040 0.98
age’ × sex=M 0.1734 0.1707 0.4070 -0.5971 0.9924 0.6688 1.02
age” × sex=M -0.0363 -0.0379 1.5762 -3.0687 3.0526 0.4898 0.99

a ← anova ( bpo )
a

Relative Explained Variation for twstrs. Approximate total model Wald χ2 used in denominators of REV:252.8
[208.1, 328.8].
CHAPTER 7. MODELING LONGITUDINAL RESPONSES USING GENERALIZED LEAST SQUARES 7-35

REV Lower U pper d.f.


treat (Factor+Higher Order Factors) 0.123 0.065 0.205 6
All Interactions 0.088 0.033 0.160 4
week (Factor+Higher Order Factors) 0.594 0.466 0.701 6
All Interactions 0.088 0.033 0.160 4
Nonlinear (Factor+Higher Order Factors) 0.021 0.001 0.068 3
twstrs0 0.632 0.487 0.704 2
Nonlinear 0.016 0.000 0.048 1
age (Factor+Higher Order Factors) 0.025 0.007 0.085 6
All Interactions 0.016 0.001 0.055 3
Nonlinear (Factor+Higher Order Factors) 0.022 0.003 0.072 4
sex (Factor+Higher Order Factors) 0.020 0.001 0.068 4
All Interactions 0.016 0.001 0.055 3
treat × week (Factor+Higher Order Factors) 0.088 0.033 0.160 4
Nonlinear 0.008 0.000 0.042 2
Nonlinear Interaction : f(A,B) vs. AB 0.008 0.000 0.042 2
age × sex (Factor+Higher Order Factors) 0.016 0.001 0.055 3
Nonlinear 0.015 0.000 0.049 2
Nonlinear Interaction : f(A,B) vs. AB 0.015 0.000 0.049 2
TOTAL NONLINEAR 0.057 0.030 0.143 8
TOTAL INTERACTION 0.101 0.057 0.190 7
TOTAL NONLINEAR + INTERACTION 0.133 0.082 0.241 11
TOTAL 1.000 1.000 1.000 17
plot ( a )

twstrs0 [ ]

week [ ]

treat [ ]

treat * week [ ]

age [ ]

sex [ ]

age * sex [ ]

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7


Relative Explained Variation X

ˆ Show the final graphic (high dose:placebo contrast as func-


tion of time)

ˆ Intervals are 0.95 highest posterior density intervals

ˆ y-axis: log-odds ratio


CHAPTER 7. MODELING LONGITUDINAL RESPONSES USING GENERALIZED LEAST SQUARES 7-36

wks ← c (2 ,4 ,8 ,12 ,16)


k ← contrast ( bpo , list ( week = wks , treat = ’ 10000 U ’) ,
list ( week = wks , treat = ’ Placebo ’) ,
cnames = paste ( ’ Week ’ , wks ) )
k

week Contrast S.E. Lower Upper Pr ( Contrast >0)


1 Week 2 2 -2.2741167 0.6154921 -3.414998 -0.9800939 0.0000
2 Week 4 4 -2.1625817 0.5502386 -3.181820 -1.0091711 0.0000
3 Week 8 8 -1.8016186 0.6029870 -2.951032 -0.6042740 0.0013
4* Week 12 12 -0.8890838 0.5428456 -1.924157 0.2050322 0.0510
5* Week 16 16 0.1613439 0.6175396 -1.098165 1.3472247 0.6075

Redundant contrasts are denoted by *

Intervals are 0.95 highest posterior density intervals


Contrast is the posterior mean

plot ( k )

Week 2 Week 4 Week 8


0.6 0.6
0.6
0.4 0.4 0.4
0.2 0.2 0.2
0.0 0.0 0.0
−4−3−2−1 −4−3−2−1 −4−3−2−10 0.95 HPDI

Week 12 Week 16 mean


0.8
0.6 median
0.6
0.4
0.4
0.2 0.2
0.0 0.0
−3−2−1 0 1 −2−10 1 2

k ← as.data.frame ( k [ c ( ’ week ’ , ’ Contrast ’ , ’ Lower ’ , ’ Upper ’) ])


ggplot (k , aes ( x = week , y = Contrast ) ) + geom_point () +
geom_line () + ylab ( ’ High Dose - Placebo ’) +
geom_errorbar ( aes ( ymin = Lower , ymax = Upper ) , width =0)
CHAPTER 7. MODELING LONGITUDINAL RESPONSES USING GENERALIZED LEAST SQUARES 7-37

High Dose − Placebo


0

−1

−2

−3

4 8 12 16
week

For each posterior draw compute the difference in means and


get an exact (to within simulation error) 0.95 highest posterior
density intervals for these differences.
M ← Mean ( bpo ) # create R function that computes mean Y from X* beta
k ← contrast ( bpo , list ( week = wks , treat = ’ 10000 U ’) ,
list ( week = wks , treat = ’ Placebo ’) ,
fun =M , cnames = paste ( ’ Week ’ , wks ) )
plot (k , which = ’ diff ’) + theme ( legend.position = ’ bottom ’)

Week 2 Week 4 Week 8 Week 12 Week 16


0.25

0.20

0.15 First − Second

0.10

0.05

0.00
−10 0 −10 0 −10 0 −10 0 −10 0

0.95 HPDI Mean Median

f ← function ( x ) {
hpd ← HPDint (x , prob =0 .95 ) # is in rmsb
CHAPTER 7. MODELING LONGITUDINAL RESPONSES USING GENERALIZED LEAST SQUARES 7-38

r ← c ( mean ( x ) , median ( x ) , hpd )


names ( r ) ← c ( ’ Mean ’ , ’ Median ’ , ’ Lower ’ , ’ Upper ’)
r
}
w ← as.data.frame ( t ( apply ( k $ esta - k $ estb , 2 , f ) ) )
week ← as.numeric ( sub ( ’ Week ’ , ’ ’ , rownames ( w ) ) )
ggplot (w , aes ( x = week , y = Mean ) ) + geom_point () +
geom_line () + ylab ( ’ High Dose - Placebo ’) +
geom_errorbar ( aes ( ymin = Lower , ymax = Upper ) , width =0) +
scale_y_continuous ( breaks = c ( -8 , -4 , 0 , 4) )

4
High Dose − Placebo

−4

−8

4 8 12 16
week

7.8.4

Bayesian Markov Semiparametric Model


Y

ˆ First-order Markov model

ˆ Serial correlation induced by Markov model is similar to


AR(1) which we already know fits these data

ˆ Markov model is more likely to fit the data than the ran-
dom effects model, which induces a compound symmetry
correlation structure

ˆ Models state transitions


CHAPTER 7. MODELING LONGITUDINAL RESPONSES USING GENERALIZED LEAST SQUARES 7-39

ˆ PO model at each visit, with Y from previous visit condi-


tioned upon just like any covariate

ˆ Need to uncondition (marginalize) on previous Y to get the


time-response profile we usually need

ˆ Semiparametric model is especially attactive because one


can easily “uncondition” a discrete Y model, and the distri-
bution of Y for control subjects can be any shape

ˆ Let measurement times be t1, t2, . . . , tm, and the measure-


ment for a subject at time t be denoted Y (t)

ˆ First-order Markov model:


Pr(Y (ti) ≥ y|X, Y (ti−1)) = expit(αy + Xβ
+ g(Y (ti−1), ti, ti − ti−1))

ˆ g involves any number of regression coefficients for a main


effect of t, the main effect of time gap ti − ti−1 if this is not
collinear with absolute time, a main effect of the previous
state, and interactions between these

ˆ Examples of how the previous state may be modeled in g:


– linear in numeric codes for Y

– spline function in same

– discontinuous bi-linear relationship where there is a slope


CHAPTER 7. MODELING LONGITUDINAL RESPONSES USING GENERALIZED LEAST SQUARES 7-40

for in-hospital outcome severity, a separate slope for out-


patient outcome severity, and an intercept jump at the
transition from inpatient to outpatient (or vice versa)

ˆ Markov model is quite flexible in handling time trends and


serial correlation patterns

ˆ Can allow for irregular measurement times:


hbiostat.org/stat/irreg.html

Fit the model and run standard Stan diagnostics.


# Create a new v a r i a b l e to hold p r e v i o u s value of Y for the subject
# For week 2 , p r e v i o u s value is the baseline value
setDT ( both , key = c ( ’ uid ’ , ’ week ’) )
both [ , ptwstrs := shift ( twstrs ) , by = uid ]
both [ week == 2 , ptwstrs := twstrs0 ]
dd ← datadist ( both )
bmark ← blrm ( twstrs ∼ treat * rcs ( week , 3) + rcs ( ptwstrs , 4) +
rcs ( age , 4) * sex ,
data = both , file = ’ bmark.rds ’)

# When adding partial PO terms for week and ptwstrs , z = -1.8 , 5 .04
stanDx ( bmark )

Iterations : 2000 on each of 4 chains , with 4000 posterior distribution samples saved

For each parameter , n_eff is a crude measure of effective sample size


and Rhat is the potential scale reduction factor on split chains
( at convergence , Rhat =1)

n_eff Rhat
y >=7 3933 1.000
y >=9 5783 1.000
y >=10 5257 1.000
y >=11 5262 1.000
y >=13 4668 0.999
y >=14 4432 0.999
y >=15 4721 0.999
y >=16 4555 1.000
y >=17 4002 0.999
y >=18 3652 0.999
y >=19 3663 0.999
y >=20 3400 0.999
y >=21 3521 0.999
y >=22 3777 0.999
y >=23 3692 0.999
CHAPTER 7. MODELING LONGITUDINAL RESPONSES USING GENERALIZED LEAST SQUARES 7-41

y >=24 3734 1.000


y >=25 3945 0.999
y >=26 3756 0.999
y >=27 3622 1.000
y >=28 3690 1.000
y >=29 3551 1.000
y >=30 3444 1.000
y >=31 3259 1.000
y >=32 3613 0.999
y >=33 4222 1.000
y >=34 4621 1.000
y >=35 4725 1.000
y >=36 5220 1.000
y >=37 5384 1.000
y >=38 5732 1.000
y >=39 5991 1.000
y >=40 6102 1.000
y >=41 6332 1.000
y >=42 6378 1.000
y >=43 6655 1.000
y >=44 7240 1.000
y >=45 7492 1.000
y >=46 6781 1.000
y >=47 6324 1.000
y >=48 6321 1.000
y >=49 5792 1.000
y >=50 5606 1.000
y >=51 5400 1.000
y >=52 5530 1.000
y >=53 5254 1.000
y >=54 5268 1.000
y >=55 5061 1.000
y >=56 4783 1.000
y >=57 4759 1.000
y >=58 5064 1.000
y >=59 5103 1.001
y >=60 5137 1.000
y >=61 5471 1.000
y >=62 5419 1.000
y >=63 5187 1.000
y >=64 5392 1.000
y >=65 5339 1.001
y >=66 5346 1.002
y >=67 4516 1.001
y >=68 4670 1.001
y >=71 4841 1.000
treat =5000 U 9579 1.000
treat = Placebo 8212 1.000
week 6164 0.999
week ’ 8498 0.999
ptwstrs 3316 1.000
ptwstrs ’ 7429 1.000
ptwstrs ’ ’ 7112 1.000
age 6991 1.000
age ’ 7712 1.000
age ’ ’ 8219 1.000
sex = M 8158 1.000
CHAPTER 7. MODELING LONGITUDINAL RESPONSES USING GENERALIZED LEAST SQUARES 7-42

treat =5000 U * week 8469 1.000


treat = Placebo * week 8037 1.000
treat =5000 U * week ’ 8265 1.000
treat = Placebo * week ’ 8015 0.999
age * sex = M 8998 0.999
age ’ * sex = M 9445 1.000
age ’ ’ * sex = M 8538 0.999

stanDxplot ( bmark )

Chain 1 Chain 2 Chain 3 Chain 4


0.05

age
0.00
−0.05

age * sex=M
−0.10
−0.15
0.2
0.1
0.0
−0.1
−0.2
0.4

age'
0.2

age' * sex=M
0.0
−0.2
0.50
0.25
0.00
−0.25
−0.50

age''
0

age'' * sex=M
−1
2
1
0
−1

ptwstrs
0.30
0.25
0.20
0.15
0.10

ptwstrs'
Parameter Value

0.1
0.0
−0.1
−0.2
−0.3

ptwstrs''
1.5
1.0
0.5
0.0

sex=M
5
0

treat=5000U
−5
−10

treat=5000U
2
1
0
−1

treat=5000U
0.2
0.0
−0.2
−0.4

* week
0.25
0.00

treat=Placebo
−0.25

treat=Placebo
* week'
4
3
2
1
0

treat=Placebo
0.00
−0.25
−0.50
0.4

* week
0.2
0.0
−0.2

week
* week'
0.6
0.4
0.2
week'
0.0
−0.2
−0.4
−0.6
0 250 500 750 1000 0 250 500 750 1000 0 250 500 750 1000 0 250 500 750 1000
Post Burn−in Iteration

Note that posterior sampling is much more efficient without


random effects.
bmark

Bayesian Proportional Odds Ordinal Logistic Model

Dirichlet Priors With Concentration Parameter 0.044 for Intercepts

blrm(formula = twstrs ~ treat * rcs(week, 3) + rcs(ptwstrs, 4) +


CHAPTER 7. MODELING LONGITUDINAL RESPONSES USING GENERALIZED LEAST SQUARES 7-43

rcs(age, 4) * sex, data = both, file = "bmark.rds")

Frequencies of Missing Values Due to Each Variable

twstrs treat week ptwstrs age sex


0 0 0 5 0 0

Mixed Calibration/ Discrimination Rank Discrim.


Discrimination Indexes Indexes Indexes
Obs 517 LOO log L -1785.43±22.21 g 3.262 [2.963, 3.535] C 0.828 [0.826, 0.83]
Draws 4000 LOO IC 3570.86±44.42 gp 0.415 [0.401, 0.429] Dxy 0.656 [0.651, 0.661]
Chains 4 Effective p 89.27±4.67 EV 0.532 [0.491, 0.571]
p 18 B 0.117 [0.113, 0.121] v 8.38 [6.902, 9.796]
vp 0.133 [0.123, 0.143]

Mode β̂ Mean β̂ Median β̂ S.E. Lower Upper Pr(β > 0) Symmetry


treat=5000U 0.2195 0.2140 0.2139 0.5458 -0.8897 1.2605 0.6542 0.98
treat=Placebo 1.8281 1.8320 1.8344 0.5801 0.6984 2.9785 0.9990 1.02
week 0.4858 0.4852 0.4850 0.0826 0.3310 0.6491 1.0000 1.03
week’ -0.2874 -0.2861 -0.2855 0.0875 -0.4574 -0.1167 0.0010 1.00
ptwstrs 0.1997 0.2008 0.2010 0.0270 0.1482 0.2539 1.0000 1.02
ptwstrs’ -0.0625 -0.0655 -0.0649 0.0639 -0.1886 0.0669 0.1498 0.98
ptwstrs” 0.5338 0.5489 0.5451 0.2558 0.0074 1.0221 0.9848 1.06
age -0.0295 -0.0287 -0.0285 0.0303 -0.0878 0.0301 0.1735 0.98
age’ 0.1237 0.1219 0.1215 0.0849 -0.0454 0.2868 0.9240 0.99
age” -0.5072 -0.5021 -0.5016 0.3396 -1.1745 0.1535 0.0703 0.98
sex=M -0.4593 -0.4728 -0.4209 2.4680 -5.2648 4.4731 0.4182 0.99
treat=5000U × week -0.0338 -0.0313 -0.0304 0.1070 -0.2431 0.1710 0.3870 1.02
treat=Placebo × week -0.2715 -0.2719 -0.2724 0.1125 -0.4868 -0.0553 0.0088 1.03
treat=5000U × week’ -0.0342 -0.0374 -0.0385 0.1166 -0.2692 0.1862 0.3815 0.98
treat=Placebo × week’ 0.1195 0.1194 0.1203 0.1210 -0.1208 0.3486 0.8368 0.97
age × sex=M 0.0111 0.0116 0.0111 0.0594 -0.1059 0.1275 0.5785 1.00
age’ × sex=M -0.0510 -0.0531 -0.0522 0.1682 -0.3886 0.2759 0.3777 1.04
age” × sex=M 0.2618 0.2712 0.2610 0.6523 -1.0108 1.5629 0.6577 0.96

a ← anova ( bpo )
a

Relative Explained Variation for twstrs. Approximate total model Wald χ2 used in denominators of REV:252.8
[208, 325.2].
CHAPTER 7. MODELING LONGITUDINAL RESPONSES USING GENERALIZED LEAST SQUARES 7-44

REV Lower U pper d.f.


treat (Factor+Higher Order Factors) 0.123 0.065 0.214 6
All Interactions 0.088 0.041 0.173 4
week (Factor+Higher Order Factors) 0.594 0.462 0.685 6
All Interactions 0.088 0.041 0.173 4
Nonlinear (Factor+Higher Order Factors) 0.021 0.001 0.067 3
twstrs0 0.632 0.487 0.704 2
Nonlinear 0.016 0.000 0.047 1
age (Factor+Higher Order Factors) 0.025 0.007 0.088 6
All Interactions 0.016 0.001 0.056 3
Nonlinear (Factor+Higher Order Factors) 0.022 0.004 0.077 4
sex (Factor+Higher Order Factors) 0.020 0.004 0.070 4
All Interactions 0.016 0.001 0.056 3
treat × week (Factor+Higher Order Factors) 0.088 0.041 0.173 4
Nonlinear 0.008 0.000 0.038 2
Nonlinear Interaction : f(A,B) vs. AB 0.008 0.000 0.038 2
age × sex (Factor+Higher Order Factors) 0.016 0.001 0.056 3
Nonlinear 0.015 0.000 0.049 2
Nonlinear Interaction : f(A,B) vs. AB 0.015 0.000 0.049 2
TOTAL NONLINEAR 0.057 0.032 0.143 8
TOTAL INTERACTION 0.101 0.053 0.194 7
TOTAL NONLINEAR + INTERACTION 0.133 0.081 0.243 11
TOTAL 1.000 1.000 1.000 17
plot ( a )

twstrs0 [ ]

week [ ]

treat [ ]

treat * week [ ]

age [ ]

sex [ ]

age * sex [ ]

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7


Relative Explained Variation

Let’s add subject-level random effects to the model. Smallness


of the standard deviation of the random effects provides support
for the assumption of conditional independence that we like to
make for Markov models and allows us to simplify the model
by omitting random effects.
bmarkre ← blrm ( twstrs ∼ treat * rcs ( week , 3) + rcs ( ptwstrs , 4) +
CHAPTER 7. MODELING LONGITUDINAL RESPONSES USING GENERALIZED LEAST SQUARES 7-45

rcs ( age , 4) * sex + cluster ( uid ) ,


data = both , file = ’ bmarkre.rds ’)

stanDx ( bmarkre )

Iterations : 2000 on each of 4 chains , with 4000 posterior distribution samples saved

For each parameter , n_eff is a crude measure of effective sample size


and Rhat is the potential scale reduction factor on split chains
( at convergence , Rhat =1)

n_eff Rhat
y >=7 2429 1.000
y >=9 3542 1.000
y >=10 2851 1.000
y >=11 2709 1.000
y >=13 2617 1.000
y >=14 2451 1.000
y >=15 2421 1.001
y >=16 2377 1.001
y >=17 2105 1.001
y >=18 1998 1.001
y >=19 1822 1.001
y >=20 1757 1.001
y >=21 1688 1.001
y >=22 1676 1.001
y >=23 1675 1.001
y >=24 1689 1.001
y >=25 1694 1.001
y >=26 1659 1.000
y >=27 1629 1.000
y >=28 1571 1.000
y >=29 1580 1.000
y >=30 1612 1.001
y >=31 1619 1.001
y >=32 1669 1.000
y >=33 1836 1.001
y >=34 1938 1.001
y >=35 2038 1.000
y >=36 2245 1.000
y >=37 2409 1.000
y >=38 2477 1.000
y >=39 2517 1.000
y >=40 2639 1.000
y >=41 2831 1.000
y >=42 2943 1.000
y >=43 3300 1.000
y >=44 3718 1.000
y >=45 3832 1.000
y >=46 3835 1.000
y >=47 3884 1.000
y >=48 3960 1.000
y >=49 3844 1.000
y >=50 3728 1.000
y >=51 3460 1.000
y >=52 3286 1.000
y >=53 2899 1.000
CHAPTER 7. MODELING LONGITUDINAL RESPONSES USING GENERALIZED LEAST SQUARES 7-46

y >=54 2900 1.000


y >=55 3163 1.000
y >=56 3036 1.000
y >=57 2940 1.000
y >=58 2975 1.000
y >=59 3109 1.000
y >=60 3162 1.000
y >=61 3139 1.000
y >=62 3041 1.001
y >=63 2885 1.001
y >=64 2936 1.001
y >=65 3284 1.000
y >=66 3016 1.001
y >=67 2933 1.001
y >=68 3097 1.001
y >=71 3348 1.001
treat =5000 U 5397 1.000
treat = Placebo 4535 1.000
week 3273 1.000
week ’ 5366 0.999
ptwstrs 1626 1.001
ptwstrs ’ 3363 1.000
ptwstrs ’ ’ 4682 1.000
age 4341 1.000
age ’ 4768 1.000
age ’ ’ 4236 1.000
sex = M 4157 1.000
treat =5000 U * week 5433 1.000
treat = Placebo * week 5088 0.999
treat =5000 U * week ’ 5185 0.999
treat = Placebo * week ’ 5109 0.999
age * sex = M 4192 1.000
age ’ * sex = M 4566 1.001
age ’ ’ * sex = M 5666 1.000
sigmag 1140 1.001

bmarkre

Bayesian Proportional Odds Ordinal Logistic Model

Dirichlet Priors With Concentration Parameter 0.044 for Intercepts

blrm(formula = twstrs ~ treat * rcs(week, 3) + rcs(ptwstrs, 4) +


rcs(age, 4) * sex + cluster(uid), data = both, file = "bmarkre.rds")

Frequencies of Missing Values Due to Each Variable

twstrs treat week ptwstrs age sex


0 0 0 5 0 0
cluster(uid)
0
CHAPTER 7. MODELING LONGITUDINAL RESPONSES USING GENERALIZED LEAST SQUARES 7-47

Mixed Calibration/ Discrimination Rank Discrim.


Discrimination Indexes Indexes Indexes
Obs 517 LOO log L -1786.29±22.29 g 3.237 [2.946, 3.531] C 0.828 [0.825, 0.83]
Draws 4000 LOO IC 3572.59±44.58 gp 0.414 [0.4, 0.427] Dxy 0.655 [0.649, 0.66]
Chains 4 Effective p 93.13±4.78 EV 0.529 [0.486, 0.565]
p 18 B 0.117 [0.114, 0.121] v 8.253 [6.835, 9.815]
Cluster on uid vp 0.132 [0.123, 0.142]
Clusters 109
σγ 0.1132 [2e-04, 0.3423]

Mean β̂ Median β̂ S.E. Lower Upper Pr(β > 0) Symmetry


treat=5000U 0.2235 0.2215 0.5608 -0.8493 1.3540 0.6610 1.01
treat=Placebo 1.8359 1.8362 0.5811 0.7254 2.9836 0.9985 1.02
week 0.4859 0.4858 0.0824 0.3266 0.6454 1.0000 1.00
week’ -0.2855 -0.2845 0.0873 -0.4577 -0.1189 0.0003 0.99
ptwstrs 0.1992 0.1995 0.0269 0.1480 0.2511 1.0000 1.00
ptwstrs’ -0.0645 -0.0654 0.0629 -0.1845 0.0611 0.1545 1.04
ptwstrs” 0.5475 0.5518 0.2472 0.0529 1.0253 0.9870 0.98
age -0.0287 -0.0281 0.0314 -0.0903 0.0324 0.1830 1.04
age’ 0.1213 0.1203 0.0879 -0.0397 0.3000 0.9115 1.02
age” -0.4997 -0.5013 0.3474 -1.1813 0.1563 0.0760 0.97
sex=M -0.3927 -0.4450 2.4026 -4.9408 4.5589 0.4275 1.01
treat=5000U × week -0.0334 -0.0345 0.1078 -0.2363 0.1755 0.3828 1.02
treat=Placebo × week -0.2712 -0.2714 0.1120 -0.4941 -0.0523 0.0092 1.01
treat=5000U × week’ -0.0354 -0.0348 0.1158 -0.2591 0.1854 0.3820 0.99
treat=Placebo × week’ 0.1182 0.1175 0.1216 -0.1151 0.3580 0.8352 1.00
age × sex=M 0.0093 0.0105 0.0580 -0.1094 0.1206 0.5678 0.99
age’ × sex=M -0.0453 -0.0458 0.1661 -0.3660 0.2768 0.3885 1.01
age” × sex=M 0.2428 0.2452 0.6502 -0.9813 1.5374 0.6418 1.00

The random effects SD is only 0.11 on the logit scale. Also,


the standard deviations of all the regression parameter poste-
rior distributions are virtually unchanged with the addition of
random effects:
plot ( sqrt ( diag ( vcov ( bmark ) ) ) , sqrt ( diag ( vcov ( bmarkre ) ) ) ,
xlab = ’ Posterior SDs in Conditional Independence Markov Model ’ ,
ylab = ’ Posterior SDs in Random Effects Markov Model ’)
abline ( a =0 , b =1 , col = gray (0 .85 ) )
CHAPTER 7. MODELING LONGITUDINAL RESPONSES USING GENERALIZED LEAST SQUARES 7-48

Posterior SDs in Random Effects Markov Model

2.0

1.5

1.0

0.5

0.0

0.0 0.5 1.0 1.5 2.0 2.5


Posterior SDs in Conditional Independence Markov Model

So we will use the model omitting random effects.


Show the partial effects of all the predictors, including the ef-
fect of the previous measurement of TWSTRS. Also compute
high dose:placebo treatment contrasts on these conditional es-
timates.
ggplot ( Predict ( bmark ) )
ggplot ( Predict ( bmark , week , treat ) )
CHAPTER 7. MODELING LONGITUDINAL RESPONSES USING GENERALIZED LEAST SQUARES 7-49

10 10
log odds

M
log odds
5 5

Sex
2
0 0
F Treatment

log odds
−5 −5
10000U
40 50 60 70 80 20 30 40 50 60 −5 0 5 10 0
Age, years 5000U
TWSTRS−total score log odds
10 Placebo
Placebo
Treatment

log odds

5 −2
5000U
0
10000U 4 8 12 16
−5
Week
−5 0 5 10 4 8 12 16
Adjusted to:ptwstrs=42 age=56 sex=F
log odds Week
k ← contrast ( bmark , list ( week = wks , treat = ’ 10000 U ’) ,
list ( week = wks , treat = ’ Placebo ’) ,
cnames = paste ( ’ Week ’ , wks ) )
k

week Contrast S.E. Lower Upper Pr ( Contrast >0)


1 Week 2 2 -1.2881567 0.3904306 -2.0283753 -0.5065498 0.0003
2 Week 4 4 -0.7443446 0.2636565 -1.2614970 -0.2410560 0.0018
3 Week 8 8 0.2239069 0.3507381 -0.4624913 0.9211815 0.7382
4* Week 12 12 0.7146670 0.2647678 0.2136860 1.2530821 0.9952
5* Week 16 16 1.0860543 0.4058400 0.3473830 1.8813078 0.9955

Redundant contrasts are denoted by *

Intervals are 0.95 highest posterior density intervals


Contrast is the posterior mean

plot ( k )
k ← as.data.frame ( k [ c ( ’ week ’ , ’ Contrast ’ , ’ Lower ’ , ’ Upper ’) ])
ggplot (k , aes ( x = week , y = Contrast ) ) + geom_point () +
geom_line () + ylab ( ’ High Dose - Placebo ’) +
geom_errorbar ( aes ( ymin = Lower , ymax = Upper ) , width =0)
CHAPTER 7. MODELING LONGITUDINAL RESPONSES USING GENERALIZED LEAST SQUARES 7-50

2
Week 2 Week 4 Week 8
1.00 1.5 1.2
0.75 0.9
1.0

High Dose − Placebo


0.6 1
0.50
0.5
0.25 0.3
0.00 0.0 0.0
−2 −1 0 −1.5
−1.0
−0.5
0.0 −1.0
−0.5
0.0
0.5
1.0 0.95 HPDI 0

Week 12 Week 16 mean


1.5
median
0.75 −1
1.0
0.50
0.5 0.25
−2
0.0 0.00
0.00.51.01.5 0 1 2 4 8 12 16
week

Using posterior means for parameter values, compute the prob-


ability that at a given week twstrs will be ≥ 40 when at the
previous visit it was 40. Also show the conditional mean twstrs
when it was 40 at the previous visit.
ex ← ExProb ( bmark )
ex40 ← function ( lp , ... ) ex ( lp , y =40 , ... )
ggplot ( Predict ( bmark , week , treat , ptwstrs =40 , fun = ex40 ) )

1.00

0.75
Treatment
10000U
0.50
5000U
Placebo
0.25

0.00
4 8 12 16
Week
Adjusted to:age=56 sex=F

ggplot ( Predict ( bmark , week , treat , ptwstrs =40 , fun = Mean ( bmark ) ) )
CHAPTER 7. MODELING LONGITUDINAL RESPONSES USING GENERALIZED LEAST SQUARES 7-51

50

45
Treatment
40
10000U
5000U
35
Placebo

30

25
4 8 12 16
Week
Adjusted to:age=56 sex=F
Z

ˆ Semiparametric models provide not only estimates of ten-


dencies of Y but also estimate the whole distribution of Y

ˆ Estimate the entire conditional distribution of Y at week 12


for high-dose patients having TWSTRS=42 at week 8

ˆ Other covariates set to median/mode

ˆ Use posterior mean of all the cell probabilities

ˆ Also show pointwise 0.95 highest posterior density intervals

ˆ To roughly approximate simultaneous confidence bands make


the pointwise limits sum to 1 like the posterior means do
# Get median / mode for covariates including ptwstrs ( TWSTRS in previous visit )
d ← gendata ( bmark )
d

treat week ptwstrs age sex


1 10000 U 8 42 56 F
CHAPTER 7. MODELING LONGITUDINAL RESPONSES USING GENERALIZED LEAST SQUARES 7-52

d $ week ← 12
p ← predict ( bmark , d , type = ’ fitted.ind ’) # d e f a u l t s to p o s t e r i o r means
yvals ← as.numeric ( sub ( ’ twstrs = ’ , ’ ’ , p $ y ) )
lo ← p $ Lower / sum ( p $ Lower )
hi ← p $ Upper / sum ( p $ Upper )
plot ( yvals , p $ Mean , type = ’l ’ , xlab = ’ TWSTRS ’ , ylab = ’ ’ ,
ylim = range ( c ( lo , hi ) ) )
lines ( yvals , lo , col = gray (0 .8 ) )
lines ( yvals , hi , col = gray (0 .8 ) )

0.10

0.08

0.06

0.04

0.02

0.00
10 20 30 40 50 60 70
TWSTRS A

ˆ Repeat this showing the variation over 5 posterior draws


p ← predict ( bmark , d , type = ’ fitted.ind ’ , posterior.summary = ’ all ’)
cols ← adjustcolor (1 : 10 , 0 .7 )
for ( i in 1 : 5) {
if ( i == 1) plot ( yvals , p [i , 1 , ] , type = ’l ’ , col = cols [1] , xlab = ’ TWSTRS ’ , ylab = ’
’)
else lines ( yvals , p [i , 1 , ] , col = cols [ i ])
}
CHAPTER 7. MODELING LONGITUDINAL RESPONSES USING GENERALIZED LEAST SQUARES 7-53

0.08

0.06

0.04

0.02

0.00
10 20 30 40 50 60 70
TWSTRS B

ˆ Turn to marginalized (unconditional on previous twstrs)


quantities

ˆ Capitalize on PO model being a multinomial model, just


with PO restrictions

ˆ Manipulations of conditional probabilities to get the uncon-


ditional probability that twstrs=y doesn’t need to know
about PO

ˆ Compute all cell probabilities and use the law of total prob-
ability recursively
k
Pr(Yt = y|X) = Pr(Yt = y|X, Yt−1 = j) Pr(Yt−1 = j|X)
X

j=1

ˆ predict.blrm method with type=’fitted.ind’ computes the


needed conditional cell probabilities, optionally for all pos-
terior draws at once
CHAPTER 7. MODELING LONGITUDINAL RESPONSES USING GENERALIZED LEAST SQUARES 7-54

ˆ Easy to get highest posterior density intervals for derived pa-


rameters such as unconditional probabilities or unconditional
means

ˆ Hmisc package soprobMarkovOrdm function (in version 4.6)


computes an array of all the state occupancy probabilities
for all the posterior draws
# B a s e l i n e twstrs to 42 in d
# For each dose , get all the p o s t e r i o r draws for all state occupancy
# probabilities for all visit
ylev ← sort ( unique ( both $ twstrs ) )
tlev ← c ( ’ Placebo ’ , ’ 10000 U ’)
R ← list ()
for ( trt in tlev ) { # s e p a r a t e l y by t r e a t m e n t
d $ treat ← trt
u ← soprobMarkovOrdm ( bmark , d , wks , ylev ,
tvarname = ’ week ’ , pvarname = ’ ptwstrs ’)
R [[ trt ]] ← u
}
dim ( R [[1]]) # posterior draws x times x distinct twstrs values

[1] 4000 5 62

# For each p o s t e r i o r draw , treatment , and week c o m p u t e the mean TWSTRS


# Then c o m p u t e p o s t e r i o r mean of means , and HPD i n t e r v a l
Rmean ← Rmeans ← list ()
for ( trt in tlev ) {
r ← R [[ trt ]]
# Mean Y at each week and posterior draw ( mean from a discrete distribution )
m ← apply (r , 1:2 , function ( x ) sum ( ylev * x ) )
Rmeans [[ trt ]] ← m
# Posterior mean and median and HPD interval over draws
u ← apply (m , 2 , f ) # f defined above
u ← rbind ( week = as.numeric ( colnames ( u ) ) , u )
Rmean [[ trt ]] ← u
}
r ← lapply ( Rmean , function ( x ) as.data.frame ( t ( x ) ) )
for ( trt in tlev ) r [[ trt ]] $ treat ← trt
r ← do.call ( rbind , r )
ggplot (r , aes ( x = week , y = Mean , color = treat ) ) + geom_line () +
geom_ribbon ( aes ( ymin = Lower , ymax = Upper ) , alpha =0 .2 , linetype =0)
CHAPTER 7. MODELING LONGITUDINAL RESPONSES USING GENERALIZED LEAST SQUARES 7-55

50

45

treat

Mean
40
10000U
Placebo
35

30

4 8 12 16
week
C

ˆ Use the same posterior draws of unconditional probabilities


of all values of TWSTRS to get the posterior distribution of
differences in mean TWSTRS between high and low dose
Dif ← Rmeans $ ‘10000 U ‘ - Rmeans $ Placebo
dif ← as.data.frame ( t ( apply ( Dif , 2 , f ) ) )
dif $ week ← as.numeric ( rownames ( dif ) )
ggplot ( dif , aes ( x = week , y = Mean ) ) + geom_line () +
geom_ribbon ( aes ( ymin = Lower , ymax = Upper ) , alpha =0 .2 , linetype =0) +
ylab ( ’ High Dose - Placebo TWSTRS ’)
High Dose − Placebo TWSTRS

−5

−10

4 8 12 16
week
D

ˆ Get posterior mean of all cell probabilities estimates at week


CHAPTER 7. MODELING LONGITUDINAL RESPONSES USING GENERALIZED LEAST SQUARES 7-56

12

ˆ Distribution of TWSTRS conditional high dose, median age,


mode sex

ˆ Not conditional on week 8 value


p ← R $ ‘10000 U ‘[ , ’ 12 ’ , ] # 4000 x 62
pmean ← apply (p , 2 , mean )
yvals ← as.numeric ( names ( pmean ) )
plot ( yvals , pmean , type = ’l ’ , xlab = ’ TWSTRS ’ , ylab = ’ ’)

0.04

0.03

0.02

0.01

0.00
10 20 30 40 50 60 70
TWSTRS
Chapter 8

Binary Logistic Regression


A

ˆ Y = 0, 1

ˆ Time of event not important

ˆ In C(Y |X) C is Prob{Y = 1}

ˆ g(u) is 1
1+e−u

8-1
CHAPTER 8. BINARY LOGISTIC REGRESSION 8-2

8.1

Model

Prob{Y = 1|X} = [1 + exp(−Xβ)]−1.


P = [1 + exp(−x)]−1

1.0

0.8

0.6
P

0.4

0.2

0.0
−4 −2 0 2 4
X
Figure 8.1: Logistic function
B

ˆO= P
1−P

ˆP = O
1+O

ˆ Xβ = log 1−P
P

ˆ eXβ = O
CHAPTER 8. BINARY LOGISTIC REGRESSION 8-3

8.1.1

Model Assumptions and Interpretation of Parameters

logit{Y = 1|X} = logit(P ) = log[P/(1 − P )]


= Xβ,
C

ˆ Increase Xj by d → increase odds Y = 1 by exp(βj d),


increase log odds by βj d.

ˆ If there is only one predictor X and that predictor is binary,


the model can be written
logit{Y = 1|X = 0} = β0
logit{Y = 1|X = 1} = β0 + β1.

ˆ One continuous predictor:


logit{Y = 1|X} = β0 + β1X,

ˆ Two treatments (indicated by X1 = 0 or 1) and one contin-


uous covariable (X2).
logit{Y = 1|X} = β0 + β1X1 + β2X2,

logit{Y = 1|X1 = 0, X2} = β0 + β2X2


logit{Y = 1|X1 = 1, X2} = β0 + β1 + β2X2.
CHAPTER 8. BINARY LOGISTIC REGRESSION 8-4

8.1.2

Odds Ratio, Risk Ratio, and Risk Difference


D

ˆ Odds ratio capable of being constant

ˆ Ex: risk factor doubles odds of disease


Without Risk Factor With Risk Factor
Probability Odds Odds Probability
.2 .25 .5 .33
.5 1 2 .67
.8 4 8 .89
.9 9 18 .95
.98 49 98 .99
plot (0 , 0 , type = " n " , xlab = " Risk for Subject Without Risk Factor " ,
ylab = " Increase in Risk " ,
xlim = c (0 ,1) , ylim = c (0 , .6 ) ) # F i g u r e 8.2
i ← 0
or ← c (1 .1 ,1 .25 ,1 .5 ,1 .75 ,2 ,3 ,4 ,5 ,10)
for ( h in or ) {
i ← i + 1
p ← seq ( .0001 , .9999 , length =200)
logit ← log ( p / (1 - p ) ) # s a m e a s q l o g i s ( p )
logit ← logit + log ( h ) # m o d i f y b y o d d s r a t i o
p2 ← 1 / (1 + exp ( -logit ) ) # s a m e a s p l o g i s ( l o g i t )
d ← p2 - p
lines (p , d , lty = i )
maxd ← max ( d )
smax ← p [ d == maxd ]
text ( smax , maxd + .02 , format ( h ) , cex = .6 )
}

Let X1 be a binary risk factor and let


A = {X2, . . . , Xp} be the other factors. Then the estimate E

of Prob{Y = 1|X1 = 1, A}− Prob{Y = 1|X1 = 0, A} is


1
1 + exp −[βˆ0 + βˆ1 + βˆ2 X2 + . . . + βˆp Xp ]
1

1 + exp −[βˆ0 + βˆ2 X2 + . . . + βˆp Xp ]
CHAPTER 8. BINARY LOGISTIC REGRESSION 8-5

0.6
10
0.5

Increase in Risk
0.4 5
4

0.3 3

0.2 2
1.75
1.5
0.1 1.25
1.1
0.0
0.0 0.2 0.4 0.6 0.8 1.0
Risk for Subject Without Risk Factor
Figure 8.2: Absolute benefit as a function of risk of the event in a control subject and the relative effect (odds ratio) of the risk
factor. The odds ratios are given for each curve.

1
= − R̂,
1 + ( 1−R̂R̂ ) exp(−β̂1 )
where R = Prob[Y = 1|X1 = 0, A].

1+e−X2 β
ˆ Risk ratio is 1+e−X1 β

e X1 β
ˆ Does not simplify like odds ratio, which is e X2 β
= e(X1−X2)β

8.1.3

Detailed Example
Females Age: 37 39 39 42 47 48 48 52 53 55 56 57 58 58 60 64 65 68 68 70
Response: 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 1 1 1 1
Males Age: 34 38 40 40 41 43 43 43 44 46 47 48 48 50 50 52 55 60 61 61
Response: 1 1 0 0 0 1 1 1 0 0 1 1 1 0 1 1 1 1 1 1
require ( rms )
getHdata ( sex.age.response )
d ← sex.age.response
dd ← datadist ( d ) ; options ( datadist = ’ dd ’)
f ← lrm ( response ∼ sex + age , data = d )
fasr ← f # Save for later
w ← function ( ... )
with (d , {
CHAPTER 8. BINARY LOGISTIC REGRESSION 8-6

m ← sex == ’ male ’
f ← sex == ’ female ’
lpoints ( age [ f ] , response [ f ] , pch =1)
lpoints ( age [ m ] , response [ m ] , pch =2)
af ← cut2 ( age , c (45 ,55) , levels.mean = TRUE )
prop ← tapply ( response , list ( af , sex ) , mean ,
na.rm = TRUE )
agem ← as.numeric ( row.names ( prop ) )
lpoints ( agem , prop [ , ’ female ’] ,
pch =4 , cex =1 .3 , col = ’ green ’)
lpoints ( agem , prop [ , ’ male ’] ,
pch =5 , cex =1 .3 , col = ’ green ’)
x ← rep (62 , 4) ; y ← seq ( .25 , .1 , length =4)
lpoints (x , y , pch = c (1 , 2 , 4 , 5) ,
col = rep ( c ( ’ blue ’ , ’ green ’) , each =2) )
ltext ( x +5 , y ,
c ( ’F Observed ’ , ’M Observed ’ ,
’F Proportion ’ , ’M Proportion ’) , cex = .8 )
} ) # F i g u r e 8.3

plot ( Predict (f , age = seq (34 , 70 , length =200) , sex , fun = plogis ) ,
ylab = ’ Pr [ response ] ’ , ylim = c ( -.02 , 1 .02 ) , addpanel = w )
# Hmisc :: latex prevents q u a n t r e g :: latex from conflicting
ltx ← function ( fit ) Hmisc :: latex ( fit , inline = TRUE , columns =54 ,
file = ’ ’ , after = ’$ . ’ , digits =3 ,
size = ’ Ssize ’ , before = ’$ X \\ hat {\\ beta }= ’)
ltx ( f )

X β̂ = −9.84 + 3.49[male] + 0.158 age.

sex response
Frequency
Row Pct 0 1 Total Odds/Log

F 14 6 20 6/14=.429
70.00 30.00 -.847

M 6 14 20 14/6=2.33
30.00 70.00 .847

Total 20 20 40

M:F odds ratio = (14/6)/(6/14) = 5.44, log=1.695 F

sex × response

Statistic DF Value Prob

Chi Square 1 6.400 0.011


Likelihood Ratio Chi-Square 1 6.583 0.010
CHAPTER 8. BINARY LOGISTIC REGRESSION 8-7

male
0.8
Pr[response]

0.6

0.4

F Observed
0.2 M Observed
F Proportion
female
M Proportion

40 50 60 70
age

Figure 8.3: Data, subgroup proportions, and fitted logistic model, with 0.95 pointwise confidence bands
CHAPTER 8. BINARY LOGISTIC REGRESSION 8-8

Parameter Estimate Std Err Wald χ2 P

β0 -0.847 0.488 3.015


β1 1.695 0.690 6.030 0.014

Log likelihood (β1 = 0) : -27.727


Log likelihood (max) : -24.435
LR χ2(H0 : β1 = 0) : -2(-27.727- -24.435) = 6.584

Next, consider the relationship between age and response, ig-


noring sex.
age response
Frequency
Row Pct 0 1 Total Odds/Log

<45 8 5 13 5/8=.625
61.5 38.4 -.47

45-54 6 6 12 6/6=1
50.0 50.0 0

55+ 6 9 15 9/6=1.5
40.0 60.0 .405

Total 20 20 40

55+ : <45 odds ratio = (9/6)/(5/8) = 2.4, log=.875 G

Parameter Estimate Std Err Wald χ2 P

β0 -2.734 1.838 2.213


β1 0.054 0.036 2.276 0.131

The estimate of β1 is in rough agreement with that obtained


from the frequency table. The 55+:<45 log odds ratio is .875, H

and since the respective mean ages in the 55+ and <45 age
groups are 61.1 and 40.2, an estimate of the log odds ratio
increase per year is .875/(61.1–40.2)=.875/20.9=.042.
The likelihood ratio test for H0: no association between age
and response is obtained as follows:
CHAPTER 8. BINARY LOGISTIC REGRESSION 8-9

Log likelihood (β1 = 0) : -27.727


Log likelihood (max) : -26.511
LR χ2(H0 : β1 = 0) : -2(-27.727- -26.511) = 2.432

(Compare 2.432 with the Wald statistic 2.28.)


Next we consider the simultaneous association of age and sex
with response. I

sex=F

age response
Frequency
Row Pct 0 1 Total

<45 4 0 4
100.0 0.0

45-54 4 1 5
80.0 20.0

55+ 6 5 11
54.6 45.4

Total 14 6 20 J

sex=M

age response
Frequency
Row Pct 0 1 Total

<45 4 5 9
44.4 55.6

45-54 2 5 7
28.6 71.4

55+ 0 4 4
0.0 100.0

Total 6 14 20 K

A logistic model for relating sex and age simultaneously to re-


sponse is given below.
CHAPTER 8. BINARY LOGISTIC REGRESSION 8-10

Parameter Estimate Std Err Wald χ2 P

β0 -9.843 3.676 7.171


β1 (sex) 3.490 1.199 8.469 0.004
β2 (age) 0.158 0.062 6.576 0.010

Likelihood ratio tests are obtained from the information below.


CHAPTER 8. BINARY LOGISTIC REGRESSION 8-11

Log likelihood (β1 = 0, β2 = 0) : -27.727


Log likelihood (max) : -19.458
Log likelihood (β1 = 0) : -26.511
Log likelihood (β2 = 0) : -24.435
LR χ2 (H0 : β1 = β2 = 0) : -2(-27.727- -19.458)= 16.538
LR χ2 (H0 : β1 = 0) sex|age : -2(-26.511- -19.458) = 14.106
LR χ2 (H0 : β2 = 0) age|sex : -2(-24.435- -19.458) = 9.954

The 14.1 should be compared with the Wald statistic of 8.47,


and 9.954 should be compared with 6.58. The fitted logistic L

model is plotted separately for females and males in Figure 8.3.


The fitted model is
logit{Response = 1|sex, age} =
−9.84 + 3.49 × sex + .158 × age,
where as before sex=0 for females, 1 for males. For exam-
ple, for a 40 year old female, the predicted logit is −9.84 +
.158(40) = −3.52. The predicted probability of a response is
1/[1 + exp(3.52)] = .029. For a 40 year old male, the predicted
logit is −9.84 + 3.49 + .158(40) = −.03, with a probability of
.492.
8.1.4

Design Formulations
M

ˆ Can do ANOVA using k − 1 dummies for a k-level predictor

ˆ Can get same χ2 statistics as from a contingency table

ˆ Can go farther: covariable adjustment


CHAPTER 8. BINARY LOGISTIC REGRESSION 8-12

ˆ Simultaneous comparison of multiple variables between two


groups: Turn problem backwards to predict group from all
the dependent variables

ˆ This is more robust than a parametric multivariate test

ˆ Propensity scores for adjusting for nonrandom treatment se- N

lection: Predict treatment from all baseline variables

ˆ Adjusting for the predicted probability of getting a treatment


adjusts adequately for confounding from all of the variables

ˆ In a randomized study, using logistic model to adjust for co-


variables, even with perfect balance, will improve the treat-
ment effect estimate
CHAPTER 8. BINARY LOGISTIC REGRESSION 8-13

8.2

Estimation

8.2.1

Maximum Likelihood Estimates

Like binomial case but P s vary; β̂ computed by trial and error


using an iterative maximization technique

8.2.2

Estimation of Odds Ratios and Probabilities

P̂i = [1 + exp(−Xiβ̂)]−1.
{1 + exp[−(Xiβ̂ ± zs)]}−1.

8.2.3

Minimum Sample Size Requirement


O

ˆ Simplest case: no covariates, only an intercept

ˆ Consider margin of error of 0.1 in estimating θ = Prob[Y =


1] with 0.95 confidence

ˆ Worst case: θ = 1
2

ˆ Requires n = 96 observationsa
a The general formula for the sample size required to achieve a margin of error of δ in estimating a true probability of θ at the 0.95 confidence
CHAPTER 8. BINARY LOGISTIC REGRESSION 8-14

ˆ Single binary predictor with prevalence 12 : need n = 96 for


each value of X

ˆ For margin of error of ±0.05, n = 384 is required (if true


probabilities near 0.5 are possible); n = 246 required if true
probabilities are only known not to be in [0.2, 0.8].

ˆ Single continuous predictor X having a normal distribu-


tion with mean zero and standard deviation σ, with true P
1
P = 1+exp(−X) so that the expected number of events is n2 .
Compute mean of maxX∈[−1.5,1.5] |P − P̂ | over 1000 simu-
lations for varying n and σ b
sigmas ← c ( .5 , .75 , 1 , 1 .25 , 1 .5 , 1 .75 , 2 , 2 .5 , 3 , 4)
ns ← seq (25 , 300 , by =25)
nsim ← 1000
xs ← seq ( -1.5 , 1 .5 , length =200)
pactual ← plogis ( xs )

dn ← list ( sigma = format ( sigmas ) , n = format ( ns ) )


maxerr ← N1 ← array ( NA , c ( length ( sigmas ) , length ( ns ) ) , dn )
require ( rms )

i ← 0
for ( s in sigmas ) {
i ← i + 1
j ← 0
for ( n in ns ) {
j ← j + 1
n1 ← maxe ← 0
for ( k in 1: nsim ) {
x ← rnorm (n , 0 , s )
P ← plogis ( x )
y ← ifelse ( runif ( n ) ≤ P , 1 , 0)
n1 ← n1 + sum ( y )
beta ← lrm.fit (x , y ) $ coefficients
phat ← plogis ( beta [1] + beta [2] * xs )
maxe ← maxe + max ( abs ( phat - pactual ) )
}
n1 ← n1 / nsim
maxe ← maxe / nsim
maxerr [i , j ] ← maxe
level is n = ( 1.96
δ
)2 × θ(1 − θ). Set θ = 12 for the worst case.
b An average absolute error of 0.05 corresponds roughly to a 0.95 confidence interval margin of error of 0.1.
CHAPTER 8. BINARY LOGISTIC REGRESSION 8-15

N1 [i , j ] ← n1
}
}
xrange ← range ( xs )
simerr ← llist ( N1 , maxerr , sigmas , ns , nsim , xrange )

maxe ← reShape ( maxerr )


# F i g u r e 8.4
xYplot ( maxerr ∼ n , groups = sigma , data = maxe ,
ylab = expression ( paste ( ’ Average Maximum ’ ,
abs ( hat ( P ) - P ) ) ) ,
type = ’l ’ , lty = rep (1:2 , 5) , label.curve = FALSE ,
abline = list ( h = c ( .15 , .1 , .05 ) , col = gray ( .85 ) ) )
Key ( .8 , .68 , other = list ( cex = .7 ,
title = expression (∼∼∼∼∼∼∼∼∼∼∼sigma ) ) )

σ
0.5
0.75
Average Maximum P − P

1
0.20 1.25
^

1.5
1.75
2
2.5
0.15 3
4

0.10

0.05

50 100 150 200 250 300


n

Figure 8.4: Simulated expected maximum error in estimating probabilities for x ∈ [−1.5, 1.5] with a single normally distributed X
with mean zero
CHAPTER 8. BINARY LOGISTIC REGRESSION 8-16

8.3

Test Statistics
Q

ˆ Likelihood ratio test best

ˆ Score test second best (score χ2 ≡ Pearson χ2)

ˆ Wald test may misbehave but is quick


CHAPTER 8. BINARY LOGISTIC REGRESSION 8-17

8.4

Residuals
R

Partial residuals (to check predictor transformations)


Yi − P̂i
rij = β̂j Xij + ,
P̂i(1 − P̂i)
CHAPTER 8. BINARY LOGISTIC REGRESSION 8-18

8.5

Assessment of Model Fit


S

logit{Y = 1|X} = β0 + β1X1 + β2X2,

X1 = 1
logit{Y=1}

X1 = 0

X2
Figure 8.5: Logistic regression assumptions for one binary and one continuous predictor

getHdata ( acath )
acath $ sex ← factor ( acath $ sex , 0:1 , c ( ’ male ’ , ’ female ’) )
dd ← datadist ( acath ) ; options ( datadist = ’ dd ’)
f ← lrm ( sigdz ∼ rcs ( age , 4) * sex , data = acath )

w ← function ( ... )
with ( acath , {
plsmo ( age , sigdz , group = sex , fun = qlogis , lty = ’ dotted ’ ,
add = TRUE , grid = TRUE )
af ← cut2 ( age , g =10 , levels.mean = TRUE )
prop ← qlogis ( tapply ( sigdz , list ( af , sex ) , mean ,
na.rm = TRUE ) )
agem ← as.numeric ( row.names ( prop ) )
lpoints ( agem , prop [ , ’ female ’] , pch =4 , col = ’ green ’)
lpoints ( agem , prop [ , ’ male ’] , pch =2 , col = ’ green ’)
} ) # F i g u r e 8.6
plot ( Predict (f , age , sex ) , ylim = c ( -2 ,4) , addpanel =w ,
label.curve = list ( offset = unit (0 .5 , ’ cm ’) ) )
T

ˆ Can verify by plotting stratified proportions


CHAPTER 8. BINARY LOGISTIC REGRESSION 8-19

2 male

log odds
1

−1
female

30 40 50 60 70 80

Age, Year

Figure 8.6: Logit proportions of significant coronary artery disease by sex and deciles of age for n=3504 patients, with spline fits
(smooth curves). Spline fits are for k = 4 knots at age= 36, 48, 56, and 68 years, and interaction between age and sex is allowed.
Shaded bands are pointwise 0.95 confidence limits for predicted log odds. Smooth nonparametric estimates are shown as dotted
curves. Data courtesy of the Duke Cardiovascular Disease Databank.

ˆ P̂ = number of events divided by stratum size

ˆ Ô = P̂
1−P̂

ˆ Plot log Ô (scale on which linearity is assumed)

ˆ Stratified estimates are noisy

ˆ 1 or 2 Xs → nonparametric smoother

ˆ plsmo function makes it easy to use loess to compute logits


of nonparametric estimates (fun=qlogis)

ˆ General: restricted cubic spline expansion of one or more


predictors U

logit{Y = 1|X} = β̂0 + β̂1X1 + β̂2X2 + β̂3X20 + β̂4X200


CHAPTER 8. BINARY LOGISTIC REGRESSION 8-20

= β̂0 + β̂1X1 + f (X2),

logit{Y = 1|X} = β0 + β1X1 + β2X2 + β3X20 + β4X200


+β5X1X2 + β6X1X20 + β7X1X200
V
lr ← function ( formula )
{
f ← lrm ( formula , data = acath )
stats ← f $ stats [ c ( ’ Model L.R. ’ , ’ d.f. ’) ]
cat ( ’ L.R. Chi-square : ’ , round ( stats [1] ,1) ,
’ d.f. : ’ , stats [2] , ’\ n ’)
f
}
a ← lr ( sigdz ∼ sex + age )

L . R . Chi - square : 766 d . f .: 2

b ← lr ( sigdz ∼ sex * age )

L . R . Chi - square : 768.2 d . f .: 3

c ← lr ( sigdz ∼ sex + rcs ( age ,4) )

L . R . Chi - square : 769.4 d . f .: 4

d ← lr ( sigdz ∼ sex * rcs ( age ,4) )

L . R . Chi - square : 782.5 d . f .: 7

lrtest (a , b )

Model 1: sigdz ∼ sex + age


Model 2: sigdz ∼ sex * age

L . R . Chisq d.f. P
2.1964146 1.0000000 0.1383322

lrtest (a , c )

Model 1: sigdz ∼ sex + age


Model 2: sigdz ∼ sex + rcs ( age , 4)

L . R . Chisq d.f. P
3.4502500 2.0000000 0.1781508

lrtest (a , d )
CHAPTER 8. BINARY LOGISTIC REGRESSION 8-21

Model 1: sigdz ∼ sex + age


Model 2: sigdz ∼ sex * rcs ( age , 4)

L . R . Chisq d.f. P
16.547036344 5.000000000 0.005444012

lrtest (b , d )

Model 1: sigdz ∼ sex * age


Model 2: sigdz ∼ sex * rcs ( age , 4)

L . R . Chisq d.f. P
14.350621767 4.000000000 0.006256138

lrtest (c , d )

Model 1: sigdz ∼ sex + rcs ( age , 4)


Model 2: sigdz ∼ sex * rcs ( age , 4)

L . R . Chisq d.f. P
13.096786352 3.000000000 0.004431906

Model / Hypothesis Likelihood d.f. P Formula


Ratio χ2
a: sex, age (linear, no interaction) 766.0 2
b: sex, age, age × sex 768.2 3
c: sex, spline in age 769.4 4
d: sex, spline in age, interaction 782.5 7
H0 : no age × sex interaction 2.2 1 .14 (b − a)
given linearity
H0 : age linear | no interaction 3.4 2 .18 (c − a)
H0 : age linear, no interaction 16.6 5 .005 (d − a)
H0 : age linear, product form 14.4 4 .006 (d − b)
interaction
H0 : no interaction, allowing for 13.1 3 .004 (d − c)
nonlinearity in age

ˆ Example of finding transform. of a single continuous predic-


tor

ˆ Duration of symptoms vs. odds of severe coronary disease


CHAPTER 8. BINARY LOGISTIC REGRESSION 8-22

ˆ Look at AIC to find best # knots for the money

k Model χ2 AIC
0 99.23 97.23
3 112.69 108.69
4 121.30 115.30
5 123.51 115.51
6 124.41 114.41
dz ← subset ( acath , sigdz ==1)
dd ← datadist ( dz )

f ← lrm ( tvdlm ∼ rcs ( cad.dur , 5) , data = dz )


w ← function ( ... )
with ( dz , {
plsmo ( cad.dur , tvdlm , fun = qlogis , add = TRUE ,
grid = TRUE , lty = ’ dotted ’)
x ← cut2 ( cad.dur , g =15 , levels.mean = TRUE )
prop ← qlogis ( tapply ( tvdlm , x , mean , na.rm = TRUE ) )
xm ← as.numeric ( names ( prop ) )
lpoints ( xm , prop , pch =2 , col = ’ green ’)
} ) # F i g u r e 8.7
plot ( Predict (f , cad.dur ) , addpanel = w )

2
log odds

−1

0 100 200 300

Duration of Symptoms of Coronary Artery Disease

Figure 8.7: Estimated relationship between duration of symptoms and the log odds of severe coronary artery disease for k = 5.
Knots are marked with arrows. Solid line is spline fit; dotted line is a nonparametric loess estimate.
f ← lrm ( tvdlm ∼ log10 ( cad.dur + 1) , data = dz )
w ← function ( ... )
with ( dz , {
x ← cut2 ( cad.dur , m =150 , levels.mean = TRUE )
prop ← tapply ( tvdlm , x , mean , na.rm = TRUE )
CHAPTER 8. BINARY LOGISTIC REGRESSION 8-23

xm ← as.numeric ( names ( prop ) )


lpoints ( xm , prop , pch =2 , col = ’ green ’)
} )
# F i g u r e 8.8
plot ( Predict (f , cad.dur , fun = plogis ) , ylab = ’P ’ ,
ylim = c ( .2 , .8 ) , addpanel = w )

0.7

0.6
P

0.5

0.4

0.3

0 100 200 300

Duration of Symptoms of Coronary Artery Disease

Figure 8.8: Fitted linear logistic model in log10 (duration+1), with subgroup estimates using groups of 150 patients. Fitted equation
is logit(tvdlm) = −.9809 + .7122 log10 (months + 1).
CHAPTER 8. BINARY LOGISTIC REGRESSION 8-24

Modeling Interaction Surfaces W

ˆ Sample of 2258 ptsc

ˆ Predict significant coronary disease

ˆ For now stratify age into tertiles to examine interactions


simply
ˆ Model has 2 dummies for age, sex, age × sex, 4-knot re-
stricted cubic spline in cholesterol, age tertile × cholesterol
acath ← transform ( acath ,
cholesterol = choleste ,
age.tertile = cut2 ( age , g =3) ,
sx = as.integer ( acath $ sex ) - 1)
# sx for loess , need to code as numeric
dd ← datadist ( acath ) ; options ( datadist = ’ dd ’)

# First model stratifies age into tertiles to get more


# e m p i r i c a l e s t i m a t e s of age x cholesterol interaction

f ← lrm ( sigdz ∼ age.tertile * ( sex + rcs ( cholesterol ,4) ) ,


data = acath )
f

Logistic Regression Model


lrm(formula = sigdz ~ age.tertile * (sex + rcs(cholesterol, 4)),
data = acath)

Frequencies of Missing Values Due to Each Variable

sigdz age.tertile sex cholesterol


0 0 0 1246

Model Likelihood Discrimination Rank Discrim.


Ratio Test Indexes Indexes
Obs 2258 LR χ2 533.52 R 2
0.291 C 0.780
2
0 768 d.f. 14 R14,2258 0.206 Dxy 0.560
2 2
1 1490 Pr(> χ ) <0.0001 R14,1520.4 0.289 γ 0.560
∂ log L
max | ∂β | 2×10−8 Brier 0.173 τa 0.251

c Many patients had missing cholesterol.


CHAPTER 8. BINARY LOGISTIC REGRESSION 8-25

β̂ S.E. Wald Z Pr(> |Z|)


Intercept -0.4155 1.0987 -0.38 0.7053
age.tertile=[49,58) 0.8781 1.7337 0.51 0.6125
age.tertile=[58,82] 4.7861 1.8143 2.64 0.0083
sex=female -1.6123 0.1751 -9.21 <0.0001
cholesterol 0.0029 0.0060 0.48 0.6347
cholesterol’ 0.0384 0.0242 1.59 0.1126
cholesterol” -0.1148 0.0768 -1.49 0.1350
age.tertile=[49,58) × sex=female -0.7900 0.2537 -3.11 0.0018
age.tertile=[58,82] × sex=female -0.4530 0.2978 -1.52 0.1283
age.tertile=[49,58) × cholesterol 0.0011 0.0095 0.11 0.9093
age.tertile=[58,82] × cholesterol -0.0158 0.0099 -1.59 0.1111
age.tertile=[49,58) × cholesterol’ -0.0183 0.0365 -0.50 0.6162
age.tertile=[58,82] × cholesterol’ 0.0127 0.0406 0.31 0.7550
age.tertile=[49,58) × cholesterol” 0.0582 0.1140 0.51 0.6095
age.tertile=[58,82] × cholesterol” -0.0092 0.1301 -0.07 0.9436

ltx ( f )

X β̂ = −0.415 + 0.878[age.tertile ∈ [49, 58)] + 4.79[age.tertile ∈ [58, 82]] − 1.61[female] +


0.00287cholesterol + 1.52×10−6 (cholesterol − 160)3+ − 4.53×10−6 (cholesterol − 208)3+ + 3.44×
10−6 (cholesterol−243)3+ −4.28×10−7 (cholesterol−319)3+ +[female][−0.79[age.tertile ∈ [49, 58)]−
0.453[age.tertile ∈ [58, 82]]]+[age.tertile ∈ [49, 58)][0.00108cholesterol−7.23×10−7 (cholesterol−
160)3+ +2.3×10−6 (cholesterol−208)3+ −1.84×10−6 (cholesterol−243)3+ +2.69×10−7 (cholesterol−
319)3+ ] + [age.tertile ∈ [58, 82]][−0.0158cholesterol + 5 × 10−7 (cholesterol − 160)3+ − 3.64 ×
10−7 (cholesterol − 208)3+ − 5.15×10−7 (cholesterol − 243)3+ + 3.78×10−7 (cholesterol − 319)3+ ].
print ( anova ( f ) , caption = ’ Crudely categorizing age into tertiles ’ ,
size = ’ smaller ’)

Crudely categorizing age into tertiles

χ2 d.f. P
age.tertile (Factor+Higher Order Factors) 120.74 10 <0.0001
All Interactions 21.87 8 0.0052
sex (Factor+Higher Order Factors) 329.54 3 <0.0001
All Interactions 9.78 2 0.0075
cholesterol (Factor+Higher Order Factors) 93.75 9 <0.0001
All Interactions 10.03 6 0.1235
Nonlinear (Factor+Higher Order Factors) 9.96 6 0.1263
age.tertile × sex (Factor+Higher Order Factors) 9.78 2 0.0075
age.tertile × cholesterol (Factor+Higher Order Factors) 10.03 6 0.1235
Nonlinear 2.62 4 0.6237
Nonlinear Interaction : f(A,B) vs. AB 2.62 4 0.6237
TOTAL NONLINEAR 9.96 6 0.1263
TOTAL INTERACTION 21.87 8 0.0052
TOTAL NONLINEAR + INTERACTION 29.67 10 0.0010
TOTAL 410.75 14 <0.0001
yl ← c ( -1 ,5)
plot ( Predict (f , cholesterol , age.tertile ) ,
adj.subtitle = FALSE , ylim = yl ) # F i g u r e 8.9
CHAPTER 8. BINARY LOGISTIC REGRESSION 8-26

4
[58,82]
3

log odds
2
[49,58)
1

0 [17,49)

100 200 300 400

Cholesterol, mg %

Figure 8.9: Log odds of significant coronary artery disease modeling age with two dummy variables

ˆ Now model age as continuous predictor X

ˆ Start with nonparametric surface using Y = 0/1


# Re-do model with continuous age
f ← loess ( sigdz ∼ age * ( sx + cholesterol ) , data = acath ,
parametric = " sx " , drop.square = " sx " )
ages ← seq (25 , 75 , length =40)
chols ← seq (100 , 400 , length =40)
g ← expand.grid ( cholesterol = chols , age = ages , sx =0)
# drop sex dimension of grid since held to 1 value
p ← drop ( predict (f , g ) )
p [ p < 0 .001 ] ← 0 .001
p [ p > 0 .999 ] ← 0 .999
zl ← c ( -3 , 6) # F i g u r e 8.10
wireframe ( qlogis ( p ) ∼ cholesterol * age ,
xlab = list ( rot =30) , ylab = list ( rot = -40 ) ,
zlab = list ( label = ’ log odds ’ , rot =90) , zlim = zl ,
scales = list ( arrows = FALSE ) , data = g )

ˆ Next try parametric fit using linear spline in age, chol. (3


knots each), all product terms. For all the remaining 3-d Y

plots we limit plotting to points that are supported by at


least 5 subjects beyond those cholesterol/age combinations
f ← lrm ( sigdz ∼ lsp ( age , c (46 ,52 ,59) ) *
( sex + lsp ( cholesterol , c (196 ,224 ,259) ) ) ,
data = acath )
ltx ( f )
CHAPTER 8. BINARY LOGISTIC REGRESSION 8-27

4
log odds

−2

70 400
60 350
50 300
ag 250
40 ol
e
150
200
s ter
30 ole
100 ch

Figure 8.10: Local regression fit for the logit of the probability of significant coronary disease vs. age and cholesterol for males,
based on the loess function.
CHAPTER 8. BINARY LOGISTIC REGRESSION 8-28

X β̂ = −1.83 + 0.0232 age + 0.0759(age − 46)+ − 0.0025(age − 52)+ + 2.27(age − 59)+ +


3.02[female] − 0.0177 cholesterol + 0.114(cholesterol − 196)+ − 0.131(cholesterol − 224)+ +
0.0651(cholesterol − 259)+ + [female][−0.112 age + 0.0852 (age − 46)+ − 0.0302 (age − 52)+ +
0.176(age−59)+ ]+age[0.000577cholesterol−0.00286(cholesterol−196)+ +0.00382(cholesterol−
224)+ −0.00205(cholesterol−259)+ ]+(age−46)+ [−0.000936cholesterol+0.00643(cholesterol−
196)+ −0.0115(cholesterol−224)+ +0.00756(cholesterol−259)+ ]+(age−52)+ [0.000433cholesterol−
0.0037 (cholesterol − 196)+ + 0.00815 (cholesterol − 224)+ − 0.00715 (cholesterol − 259)+ ] +
(age − 59)+ [−0.0124 cholesterol + 0.015 (cholesterol − 196)+ − 0.0067 (cholesterol − 224)+ +
0.00752 (cholesterol − 259)+ ].
print ( anova ( f ) , caption = ’ Linear spline surface ’ ,
size = ’ smaller ’)

Linear spline surface

χ2 d.f. P
age (Factor+Higher Order Factors) 164.17 24 <0.0001
All Interactions 42.28 20 0.0025
Nonlinear (Factor+Higher Order Factors) 25.21 18 0.1192
sex (Factor+Higher Order Factors) 343.80 5 <0.0001
All Interactions 23.90 4 <0.0001
cholesterol (Factor+Higher Order Factors) 100.13 20 <0.0001
All Interactions 16.27 16 0.4341
Nonlinear (Factor+Higher Order Factors) 16.35 15 0.3595
age × sex (Factor+Higher Order Factors) 23.90 4 <0.0001
Nonlinear 12.97 3 0.0047
Nonlinear Interaction : f(A,B) vs. AB 12.97 3 0.0047
age × cholesterol (Factor+Higher Order Factors) 16.27 16 0.4341
Nonlinear 11.45 15 0.7204
Nonlinear Interaction : f(A,B) vs. AB 11.45 15 0.7204
f(A,B) vs. Af(B) + Bg(A) 9.38 9 0.4033
Nonlinear Interaction in age vs. Af(B) 9.99 12 0.6167
Nonlinear Interaction in cholesterol vs. Bg(A) 10.75 12 0.5503
TOTAL NONLINEAR 33.22 24 0.0995
TOTAL INTERACTION 42.28 20 0.0025
TOTAL NONLINEAR + INTERACTION 49.03 26 0.0041
TOTAL 449.26 29 <0.0001
perim ← with ( acath ,
perimeter ( cholesterol , age , xinc =20 , n =5) )
zl ← c ( -2 , 4) # F i g u r e 8.11
bplot ( Predict (f , cholesterol , age , np =40) , perim = perim ,
lfun = wireframe , zlim = zl , adj.subtitle = FALSE )

ˆ Next try smooth spline surface, include all cross-products Z


f ← lrm ( sigdz ∼ rcs ( age ,4) * ( sex + rcs ( cholesterol ,4) ) ,
data = acath , tol =1 e-11 )
ltx ( f )

X β̂ = −6.41 + 0.166age − 0.00067(age − 36)3+ + 0.00543(age − 48)3+ − 0.00727(age − 56)3+ +


0.00251(age − 68)3+ + 2.87[female] + 0.00979cholesterol + 1.96 × 10−6 (cholesterol − 160)3+ −
7.16 × 10−6 (cholesterol − 208)3+ + 6.35 × 10−6 (cholesterol − 243)3+ − 1.16 × 10−6 (cholesterol −
319)3+ + [female][−0.109age + 7.52 × 10−5 (age − 36)3+ + 0.00015(age − 48)3+ − 0.00045(age −
CHAPTER 8. BINARY LOGISTIC REGRESSION 8-29

2
log odds

1
0
−1
−2

70
400
60 350
Ag 50 300
e, 250 g%
Ye
40 200 o l,m
r
ar 30
100
150 s te
le
o
Ch

Figure 8.11: Linear spline surface for males, with knots for age at 46, 52, 59 and knots for cholesterol at 196, 224, and 259 (quartiles).
CHAPTER 8. BINARY LOGISTIC REGRESSION 8-30

56)3+ + 0.000225(age − 68)3+ ] + age[−0.00028cholesterol + 2.68×10−9 (cholesterol − 160)3+ + 3.03×


10−8 (cholesterol − 208)3+ − 4.99×10−8 (cholesterol − 243)3+ + 1.69×10−8 (cholesterol − 319)3+ ] +
age0 [0.00341cholesterol−4.02×10−7 (cholesterol−160)3+ +9.71×10−7 (cholesterol−208)3+ −5.79×
10−7 (cholesterol − 243)3+ + 8.79×10−9 (cholesterol − 319)3+ ] + age00 [−0.029cholesterol + 3.04×
10−6 (cholesterol − 160)3+ − 7.34×10−6 (cholesterol − 208)3+ + 4.36×10−6 (cholesterol − 243)3+ −
5.82×10−8 (cholesterol − 319)3+ ].
print ( anova ( f ) , caption = ’ Cubic spline surface ’ ,
size = ’ smaller ’)

Cubic spline surface

χ2 d.f. P
age (Factor+Higher Order Factors) 165.23 15 <0.0001
All Interactions 37.32 12 0.0002
Nonlinear (Factor+Higher Order Factors) 21.01 10 0.0210
sex (Factor+Higher Order Factors) 343.67 4 <0.0001
All Interactions 23.31 3 <0.0001
cholesterol (Factor+Higher Order Factors) 97.50 12 <0.0001
All Interactions 12.95 9 0.1649
Nonlinear (Factor+Higher Order Factors) 13.62 8 0.0923
age × sex (Factor+Higher Order Factors) 23.31 3 <0.0001
Nonlinear 13.37 2 0.0013
Nonlinear Interaction : f(A,B) vs. AB 13.37 2 0.0013
age × cholesterol (Factor+Higher Order Factors) 12.95 9 0.1649
Nonlinear 7.27 8 0.5078
Nonlinear Interaction : f(A,B) vs. AB 7.27 8 0.5078
f(A,B) vs. Af(B) + Bg(A) 5.41 4 0.2480
Nonlinear Interaction in age vs. Af(B) 6.44 6 0.3753
Nonlinear Interaction in cholesterol vs. Bg(A) 6.27 6 0.3931
TOTAL NONLINEAR 29.22 14 0.0097
TOTAL INTERACTION 37.32 12 0.0002
TOTAL NONLINEAR + INTERACTION 45.41 16 0.0001
TOTAL 450.88 19 <0.0001
# F i g u r e 8.12 :
bplot ( Predict (f , cholesterol , age , np =40) , perim = perim ,
lfun = wireframe , zlim = zl , adj.subtitle = FALSE )

ˆ Now restrict surface by excluding doubly nonlinear terms A


f ← lrm ( sigdz ∼ sex * rcs ( age ,4) + rcs ( cholesterol ,4) +
rcs ( age ,4) % ia % rcs ( cholesterol ,4) , data = acath )
print ( anova ( f ) , size = ’ smaller ’ ,
caption = ’ Singly nonlinear cubic spline surface ’)
CHAPTER 8. BINARY LOGISTIC REGRESSION 8-31

2
log odds

1
0
−1
−2

70
400
60 350
Ag 50 300
e, 250 g%
Ye
40 200 o l,m
r
ar 30
100
150 s te
le
o
Ch

Figure 8.12: Restricted cubic spline surface in two variables, each with k = 4 knots
CHAPTER 8. BINARY LOGISTIC REGRESSION 8-32

Singly nonlinear cubic spline surface

χ2 d.f. P
sex (Factor+Higher Order Factors) 343.42 4 <0.0001
All Interactions 24.05 3 <0.0001
age (Factor+Higher Order Factors) 169.35 11 <0.0001
All Interactions 34.80 8 <0.0001
Nonlinear (Factor+Higher Order Factors) 16.55 6 0.0111
cholesterol (Factor+Higher Order Factors) 93.62 8 <0.0001
All Interactions 10.83 5 0.0548
Nonlinear (Factor+Higher Order Factors) 10.87 4 0.0281
age × cholesterol (Factor+Higher Order Factors) 10.83 5 0.0548
Nonlinear 3.12 4 0.5372
Nonlinear Interaction : f(A,B) vs. AB 3.12 4 0.5372
Nonlinear Interaction in age vs. Af(B) 1.60 2 0.4496
Nonlinear Interaction in cholesterol vs. Bg(A) 1.64 2 0.4400
sex × age (Factor+Higher Order Factors) 24.05 3 <0.0001
Nonlinear 13.58 2 0.0011
Nonlinear Interaction : f(A,B) vs. AB 13.58 2 0.0011
TOTAL NONLINEAR 27.89 10 0.0019
TOTAL INTERACTION 34.80 8 <0.0001
TOTAL NONLINEAR + INTERACTION 45.45 12 <0.0001
TOTAL 453.10 15 <0.0001
# F i g u r e 8.13 :
bplot ( Predict (f , cholesterol , age , np =40) , perim = perim ,
lfun = wireframe , zlim = zl , adj.subtitle = FALSE )
ltx ( f )

X β̂ = −7.2 + 2.96[female] + 0.164age + 7.23×10−5 (age − 36)3+ − 0.000106(age − 48)3+ − 1.63×


10−5 (age − 56)3+ + 4.99×10−5 (age − 68)3+ + 0.0148cholesterol + 1.21×10−6 (cholesterol − 160)3+ −
5.5×10−6 (cholesterol − 208)3+ + 5.5×10−6 (cholesterol − 243)3+ − 1.21×10−6 (cholesterol − 319)3+ +
age[−0.00029cholesterol+9.28×10−9 (cholesterol−160)3+ +1.7×10−8 (cholesterol−208)3+ −4.43×
10−8 (cholesterol − 243)3+ + 1.79×10−8 (cholesterol − 319)3+ ] + cholesterol[2.3×10−7 (age − 36)3+ +
4.21×10−7 (age − 48)3+ − 1.31×10−6 (age − 56)3+ + 6.64×10−7 (age − 68)3+ ] + [female][−0.111age +
8.03×10−5 (age − 36)3+ + 0.000135(age − 48)3+ − 0.00044(age − 56)3+ + 0.000224(age − 68)3+ ].

ˆ Finally restrict the interaction to be a simple product B


f ← lrm ( sigdz ∼ rcs ( age ,4) * sex + rcs ( cholesterol ,4) +
age % ia % cholesterol , data = acath )
print ( anova ( f ) , caption = ’ Linear interaction surface ’ ,
size = ’ smaller ’)
CHAPTER 8. BINARY LOGISTIC REGRESSION 8-33

2
log odds

1
0
−1
−2

70
400
60 350
Ag 50 300
e, 250 g%
Ye
40 200 o l,m
r
ar 30
100
150 s te
le
o
Ch

Figure 8.13: Restricted cubic spline fit with age × spline(cholesterol) and cholesterol × spline(age)
CHAPTER 8. BINARY LOGISTIC REGRESSION 8-34

Linear interaction surface

χ2 d.f. P
age (Factor+Higher Order Factors) 167.83 7 <0.0001
All Interactions 31.03 4 <0.0001
Nonlinear (Factor+Higher Order Factors) 14.58 4 0.0057
sex (Factor+Higher Order Factors) 345.88 4 <0.0001
All Interactions 22.30 3 <0.0001
cholesterol (Factor+Higher Order Factors) 89.37 4 <0.0001
All Interactions 7.99 1 0.0047
Nonlinear 10.65 2 0.0049
age × cholesterol (Factor+Higher Order Factors) 7.99 1 0.0047
age × sex (Factor+Higher Order Factors) 22.30 3 <0.0001
Nonlinear 12.06 2 0.0024
Nonlinear Interaction : f(A,B) vs. AB 12.06 2 0.0024
TOTAL NONLINEAR 25.72 6 0.0003
TOTAL INTERACTION 31.03 4 <0.0001
TOTAL NONLINEAR + INTERACTION 43.59 8 <0.0001
TOTAL 452.75 11 <0.0001
# F i g u r e 8.14 :
bplot ( Predict (f , cholesterol , age , np =40) , perim = perim ,
lfun = wireframe , zlim = zl , adj.subtitle = FALSE )
f.linia ← f # s a v e l i n e a r i n t e r a c t i o n f i t f o r l a t e r
ltx ( f )

X β̂ = −7.36+0.182age−5.18×10−5 (age−36)3+ +8.45×10−5 (age−48)3+ −2.91×10−6 (age−56)3+ −


2.99×10−5 (age − 68)3+ + 2.8[female] + 0.0139cholesterol + 1.76×10−6 (cholesterol − 160)3+ − 4.88×
10−6 (cholesterol − 208)3+ + 3.45×10−6 (cholesterol − 243)3+ − 3.26×10−7 (cholesterol − 319)3+ −
0.00034 age × cholesterol + [female][−0.107age + 7.71×10−5 (age − 36)3+ + 0.000115(age − 48)3+ −
0.000398(age − 56)3+ + 0.000205(age − 68)3+ ].

The Wald test for age × cholesterol interaction yields χ2 =


7.99 with 1 d.f., p=.005.

ˆ See how well this simple interaction model compares with


initial model using 2 dummies for age C

ˆ Request predictions to be made at mean age within tertiles


# Make e s t i m a t e s of c h o l e s t e r o l e f f e c t s for mean age in
# t e r t i l e s c o r r e s p o n d i n g to i n i t i a l a n a l y s i s
mean.age ←
with ( acath ,
as.vector ( tapply ( age , age.tertile , mean , na.rm = TRUE ) ) )
plot ( Predict (f , cholesterol , age = round ( mean.age ,2) ,
sex = " male " ) ,
adj.subtitle = FALSE , ylim = yl ) # 3 c u r v e s , F i g u r e 8.15
CHAPTER 8. BINARY LOGISTIC REGRESSION 8-35

2
log odds

1
0
−1
−2

70
400
60 350
Ag 50 300
e, 250 g%
Ye
40 200 o l,m
r
ar 30
100
150 s te
le
o
Ch

Figure 8.14: Spline fit with nonlinear effects of cholesterol and age and a simple product interaction

3 63.73
log odds

2
53.06
1

0 41.74

100 200 300 400

Cholesterol, mg %

Figure 8.15: Predictions from linear interaction model with mean age in tertiles indicated.
CHAPTER 8. BINARY LOGISTIC REGRESSION 8-36

ˆ Using residuals for “duration of symptoms” example


f ← lrm ( tvdlm ∼ cad.dur , data = dz , x = TRUE , y = TRUE )
resid (f , " partial " , pl = " loess " , xlim = c (0 ,250) , ylim = c ( -3 ,3) )
scat1d ( dz $ cad.dur )
log.cad.dur ← log10 ( dz $ cad.dur + 1)
f ← lrm ( tvdlm ∼ log.cad.dur , data = dz , x = TRUE , y = TRUE )
resid (f , " partial " , pl = " loess " , ylim = c ( -3 ,3) )
scat1d ( log.cad.dur ) # F i g u r e 8.16

3 3

2 2

1 1

0 0
ri

−1 ri −1

−2 −2

−3 −3
0 50 150 250 0.0 1.0 2.0
cad.dur log.cad.dur
Figure 8.16: Partial residuals for duration and log10 (duration+1). Data density shown at top of each plot.

ˆ Relative merits of strat., nonparametric, splines for checking


fit D
CHAPTER 8. BINARY LOGISTIC REGRESSION 8-37

Method Choice Assumes Uses Ordering Low Good


Required Additivity of X Variance Resolution
on X
Stratification Intervals
Smoother on X1 Bandwidth x x x
stratifying on X2 (not on X2 ) (if min. strat.) (X1 )
Smooth partial Bandwidth x x x x
residual plot
Spline model Knots x x x x
for all Xs

ˆ Hosmer-Lemeshow test is a commonly used test of goodness-


of-fit of a binary logistic model E

Compares proportion of events with mean predicted proba-


bility within deciles of P̂
– Arbitrary (number of groups, how to form groups)

– Low power (too many d.f.)

– Does not reveal the culprits

ˆ A new omnibus test based of SSE has more power and re-
quires no grouping; still does not lead to corrective action.

ˆ Any omnibus test lacks power against specific alternatives


such as nonlinearity or interaction
CHAPTER 8. BINARY LOGISTIC REGRESSION 8-38

8.6

Collinearity

8.7

Overly Influential Observations

8.8

Quantifying Predictive Ability


F

ˆ Generalized Nagelkerke R : equals ordinary R in normal


2 2

case:
2 1 − exp(−LR/n)
RN = 0
,
1 − exp(−L /n)

ˆ 4 versions of Maddala-Cox-Snell R2: hbiostat.org/bib/r2.html


2
– Perhaps best: Rp,m = 1 − exp(−(LR − p)/m)

– m is the effective sample size based on approximate vari-


ance of a log odds ratio in a proportional odds ordinal
logistic model

– If Y has k distinct values with proportions p1, p2, . . . , pk ,


m = n × (1 − Pki=1 p3i )

– For binary Y with proportion Y = 1 of p,


m = n × 3p(1 − p)
CHAPTER 8. BINARY LOGISTIC REGRESSION 8-39

ˆ With perfect prediction in the case where there are 50 X = 0


2 2 2
and 50 X = 1 with Y = X, RN = 1, Rn,0 = 0.75, Rm,0 =
0.84

ˆ Brier score (calibration + discrimination):


1 Xn
B= (P̂i − Yi)2,
n i=1
ˆ c = “concordance probability” = ROC area
– Related to Wilcoxon-Mann-Whitney stat and Somers’ Dxy
Dxy = 2(c − .5).

– Good pure index of predictive discrimination for a single


model

– Not useful for comparing two models [46, 150]d

ˆ “Coefficient of discrimination” [190]: average P̂ when Y = 1 G

minus average P̂ when Y = 0


– Has many advantages. Tjur shows how it ties in with
sum of squares–based R2 measures.

ˆ “Percent classified correctly” has lots of problems H

– improper scoring rule; optimizing it will lead to incorrect


model

– arbitrary, insensitive, uses a strange loss (utility function)


d But see [148].
CHAPTER 8. BINARY LOGISTIC REGRESSION 8-40

8.9

Validating the Fitted Model


I

ˆ Possible indexes [9]


– Accuracy of P̂ : calibration
1
Plot −X
against estimated prob. that Y = 1 on
1+e new β̂old
new data

– Discrimination: C or Dxy

– R2 or B

ˆ Use bootstrap to estimate calibration equation J

Pc = Prob{Y = 1|X β̂} = [1 + exp −(γ0 + γ1X β̂)]−1,


Emax(a, b) = max |P̂ − P̂c|,
a≤P̂ ≤b

ˆ Bootstrap validation of age-sex-response data, 150 samples

ˆ 2 predictors forced into every model


d ← sex.age.response
dd ← datadist ( d ) ; options ( datadist = ’ dd ’)
f ← lrm ( response ∼ sex + age , data =d , x = TRUE , y = TRUE )
set.seed (3) # f o r r e p r o d u c i b i l i t y
v1 ← validate (f , B =150)

latex ( v1 ,
caption = ’ Bootstrap Validation , 2 Predictors Without Stepdown ’ ,
insert.bottom = ’ \\ label { pg : l r m - s e x - a g e - r e s p o n s e - b o o t } ’ ,
digits =2 , size = ’ Ssize ’ , file = ’ ’)
CHAPTER 8. BINARY LOGISTIC REGRESSION 8-41

Bootstrap Validation, 2 Predictors Without Stepdown


Index Original Training Test Optimism Corrected n
Sample Sample Sample Index
Dxy 0.70 0.70 0.67 0.03 0.66 150
2
R 0.45 0.48 0.43 0.05 0.40 150
Intercept 0.00 0.00 0.04 −0.04 0.04 150
Slope 1.00 1.00 0.92 0.08 0.92 150
Emax 0.00 0.00 0.03 0.03 0.03 150
D 0.39 0.43 0.36 0.07 0.32 150
U −0.05 −0.05 0.02 −0.07 0.02 150
Q 0.44 0.48 0.34 0.14 0.30 150
B 0.16 0.15 0.18 −0.03 0.19 150
g 2.10 2.38 1.97 0.41 1.70 150
gp 0.35 0.35 0.34 0.01 0.34 150

ˆ Allow for step-down at each re-sample K

ˆ Use individual tests at α = 0.10


ˆ Both age and sex selected in 137 of 150, neither in 3 samples
v2 ← validate (f , B =150 , bw = TRUE ,
rule = ’p ’ , sls = .1 , type = ’ individual ’)

latex ( v2 ,
caption = ’ Bootstrap Validation , 2 Predictors with Stepdown ’ ,
digits =2 , B =15 , file = ’ ’ , size = ’ Ssize ’)

Bootstrap Validation, 2 Predictors with Stepdown


Index Original Training Test Optimism Corrected n
Sample Sample Sample Index
Dxy 0.70 0.71 0.65 0.07 0.63 150
2
R 0.45 0.50 0.41 0.09 0.36 150
Intercept 0.00 0.00 0.01 −0.01 0.01 150
Slope 1.00 1.00 0.83 0.17 0.83 150
Emax 0.00 0.00 0.04 0.04 0.04 150
D 0.39 0.46 0.35 0.11 0.28 150
U −0.05 −0.05 0.05 −0.10 0.05 150
Q 0.44 0.51 0.29 0.21 0.22 150
B 0.16 0.14 0.18 −0.04 0.20 150
g 2.10 2.60 1.90 0.70 1.40 150
gp 0.35 0.35 0.33 0.02 0.33 150
CHAPTER 8. BINARY LOGISTIC REGRESSION 8-42

Factors Retained in Backwards Elimination


First 15 Resamples
sex age
• •
• •
• •

• •
• •
• •
• •
• •
• •
• •
• •
• •
• •

Frequencies of Numbers of Factors Retained


0 1 2
1 11 138

ˆ Try adding 5 noise candidate variables L


set.seed (133)
n ← nrow ( d )
x1 ← runif ( n )
x2 ← runif ( n )
x3 ← runif ( n )
x4 ← runif ( n )
x5 ← runif ( n )
f ← lrm ( response ∼ age + sex + x1 + x2 + x3 + x4 + x5 ,
data =d , x = TRUE , y = TRUE )
v3 ← validate (f , B =150 , bw = TRUE ,
rule = ’p ’ , sls = .1 , type = ’ individual ’)

k ← attr ( v3 , ’ kept ’)
# Compute number of x1-x5 selected
nx ← apply ( k [ ,3:7] , 1 , sum )
# Get selections of age and sex
v ← colnames ( k )
as ← apply ( k [ ,1:2] , 1 ,
function ( x ) paste ( v [1:2][ x ] , collapse = ’ , ’) )
table ( paste ( as , ’ ’ , nx , ’ Xs ’) )

0 Xs 1 Xs age , sex 0 Xs age , sex 1 Xs age , sex 2 Xs


50 4 30 26
8
age , sex 3 Xs sex 0 Xs sex 1 Xs sex 2 Xs
5 9 3 1
CHAPTER 8. BINARY LOGISTIC REGRESSION 8-43

latex ( v3 , #
caption = ’ Bootstrap Validation with 5 Noise Variables and Stepdown ’ ,
digits =2 , B =15 , size = ’ Ssize ’ , file = ’ ’)

Bootstrap Validation with 5 Noise Variables and Stepdown


Index Original Training Test Optimism Corrected n
Sample Sample Sample Index
Dxy 0.70 0.47 0.38 0.09 0.61 136
2
R 0.45 0.34 0.23 0.11 0.34 136
Intercept 0.00 0.00 0.04 −0.04 0.04 136
Slope 1.00 1.00 0.77 0.23 0.77 136
Emax 0.00 0.00 0.07 0.07 0.07 136
D 0.39 0.31 0.18 0.13 0.26 136
U −0.05 −0.05 0.06 −0.11 0.06 136
Q 0.44 0.36 0.12 0.24 0.20 136
B 0.16 0.18 0.21 −0.04 0.20 136
g 2.10 1.81 1.06 0.75 1.35 136
gp 0.35 0.24 0.19 0.04 0.31 136

Factors Retained in Backwards Elimination


First 15 Resamples
age sex x1 x2 x3 x4 x5
• •
• •


• • • •
• • •
• •
• • • • •

• • •

• • •

Frequencies of Numbers of Factors Retained


0 1 2 3 4 5
50 13 33 27 8 5

ˆ Repeat but force age and sex to be in all models M


v4 ← validate (f , B =150 , bw = TRUE , rule = ’p ’ , sls = .1 ,
type = ’ individual ’ , force =1:2)
CHAPTER 8. BINARY LOGISTIC REGRESSION 8-44

ap4 ← round ( v4 [ , ’ index.orig ’] , 2)


bc4 ← round ( v4 [ , ’ index.corrected ’] , 2)

latex ( v4 ,
caption = ’ Bootstrap Validation with 5 Noise Variables and Stepdown ,
Forced Inclusion of age and sex ’ ,
digits =2 , B =15 , size = ’ Ssize ’)

Bootstrap Validation with 5 Noise Variables and Stepdown, Forced Inclusion of age and sex
Index Original Training Test Optimism Corrected n
Sample Sample Sample Index
Dxy 0.70 0.76 0.66 0.10 0.60 130
R2 0.45 0.54 0.41 0.12 0.33 130
Intercept 0.00 0.00 0.06 −0.06 0.06 130
Slope 1.00 1.00 0.76 0.24 0.76 130
Emax 0.00 0.00 0.07 0.07 0.07 130
D 0.39 0.50 0.35 0.15 0.24 130
U −0.05 −0.05 0.08 −0.13 0.08 130
Q 0.44 0.55 0.27 0.28 0.16 130
B 0.16 0.14 0.18 −0.04 0.21 130
g 2.10 2.75 1.89 0.86 1.25 130
gp 0.35 0.37 0.33 0.04 0.31 130

Factors Retained in Backwards Elimination


First 15 Resamples
age sex x1 x2 x3 x4 x5
• •
• •
• • •
• •
• • •
• •
• •
• •
• •
• •
• •
• •
• • •
• • •
• •

Frequencies of Numbers of Factors Retained


2 3 4 5 6
88 29 10 1 2
CHAPTER 8. BINARY LOGISTIC REGRESSION 8-45

8.10

Describing the Fitted Model


s ← summary ( f.linia )
print (s , size = ’ Ssize ’)

Low High ∆ Effect S.E. Lower 0.95 Upper 0.95


age 46 59 13 0.90629 0.18381 0.546030 1.26650
Odds Ratio 46 59 13 2.47510 1.726400 3.54860
cholesterol 196 259 63 0.75479 0.13642 0.487410 1.02220
Odds Ratio 196 259 63 2.12720 1.628100 2.77920
sex — female:male 1 2 -2.42970 0.14839 -2.720600 -2.13890
Odds Ratio 1 2 0.08806 0.065837 0.11778
plot ( s ) # Figure 8.17

Odds Ratio
0.10 1.00 2.00 3.00 4.00
age − 59:46

cholesterol − 259:196

sex − female:male

Adjusted to:age=52 sex=male cholesterol=224.5

Figure 8.17: Odds ratios and confidence bars, using quartiles of age and cholesterol for assessing their effects on the odds of coronary
disease.
CHAPTER 8. BINARY LOGISTIC REGRESSION 8-46

1.00
Probabiity ABM vs AVM


● ●
0.75 ●




0.50 ●


0.25 ●



0.00
0 20 40 60
Age in Years

Figure 8.18: Linear spline fit for probability of bacterial vs. viral meningitis as a function of age at onset [176]. Points are simple
proportions by age quantile groups.
CHAPTER 8. BINARY LOGISTIC REGRESSION 8-47

Figure 8.19: (A) Relationship between myocardium at risk and ventricular fibrillation, based on the individual best fit equations
for animals anesthetized with pentobarbital and α-chloralose. The amount of myocardium at risk at which 0.5 of the animals are
expected to fibrillate (MAR50 ) is shown for each anesthetic group. (B) Relationship between myocardium at risk and ventricular
fibrillation, based on equations derived from the single slope estimate. Note that the MAR50 describes the overall relationship
between myocardium at risk and outcome when either the individual best fit slope or the single slope estimate is used. The shift of
the curve to the right during α-chloralose anesthesia is well described by the shift in MAR50 . Test for interaction had P=0.10 [211].
Reprinted by permission, NRC Research Press.
CHAPTER 8. BINARY LOGISTIC REGRESSION 8-48

Figure 8.20: A nomogram for estimating the likelihood of significant coronary artery disease (CAD) in women. ECG = electrocar-
diographic; MI = myocardial infarction [156]. Reprinted from American Journal of Medicine, Vol 75, Pryor DB et al., “Estimating
the likelihood of significant coronary artery disease”, p. 778, Copyright 1983, with permission from Excerpta Medica, Inc.
CHAPTER 8. BINARY LOGISTIC REGRESSION 8-49

Age Month Probability Glucose Total


ABM vs AVM Ratio PMN
.05
B 11000
10000
9000
.10 8000

7000

.15 6000
Reading Reading
5000
Line Line
.20
4000

80
12m 12m A
.25 3000

75 2500

1 Feb 1 Feb .30 2000


70 6m
1500
18m 1 Mar 1 Jan
.35
65 0.99
1000
1 Apr 1 Dec
60 0.95 .40
0m 0.90
1 May 1 Nov 0.80 500
55 0.70 400
2y 0.60 .45
0.50 300
1 Jun 1 Oct 0.40
0.30
50 0.20 200
5
0.10 .50
1 Jul 1 Sep
45 0.05 100

0.01 50
10 1 Aug 1 Aug .55
40

10
35
≥ .60
15

30 0
A B
25 20

22 22y

Figure 8.21: Nomogram for estimating probability of bacterial (ABM) vs. viral (AVM) meningitis. Step 1, place ruler on reading
lines for patient’s age and month of presentation and mark intersection with line A; step 2, place ruler on values for glucose ratio
and total polymorphonuclear leukocyte (PMN) count in cerbrospinal fluid and mark intersection with line B; step 3, use ruler to join
marks on lines A and B, then read off the probability of ABM vs. AVM [176].
CHAPTER 8. BINARY LOGISTIC REGRESSION 8-50

# Draw a nomogram that shows examples of confidence intervals


nom ← nomogram ( f.linia , cholesterol = seq (150 , 400 , by =50) ,
interact = list ( age = seq (30 , 70 , by =10) ) ,
lp.at = seq ( -2 , 3 .5 , by = .5 ) ,
conf.int = TRUE , conf.lp = " all " ,
fun = function ( x ) 1 / (1+ exp ( -x ) ) , # o r p l o g i s
funlabel = " Probability of CAD " ,
fun.at = c ( seq ( .1 , .9 , by = .1 ) , .95 , .99 )
) # F i g u r e 8.22
plot ( nom , col.grid = gray ( c (0 .8 , 0 .95 ) ) ,
varname.label = FALSE , ia.space =1 , xfrac = .46 , lmgp = .2 )
CHAPTER 8. BINARY LOGISTIC REGRESSION 8-51

0 10 20 30 40 50 60 70 80 90 100
Points

cholesterol (age=30
sex=male) 150 250 300 350 400
cholesterol (age=40
sex=male) 150 250 300 350 400
cholesterol (age=50 250 300 350 400
sex=male) 200
cholesterol (age=60 250 350
sex=male) 200
cholesterol (age=70 250 400
sex=male) 200
cholesterol (age=30
sex=female) 150 250 300 350 400
cholesterol (age=40
sex=female) 150 250 300 350 400
cholesterol (age=50 250 300 350 400
sex=female) 200
cholesterol (age=60 250 350
sex=female) 200
cholesterol (age=70 250 400
sex=female) 200

Total Points
0 10 20 30 40 50 60 70 80 90 100

Linear Predictor
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 3 3.5

Probability of CAD
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95

Figure 8.22: Nomogram relating age, sex, and cholesterol to the log odds and to the probability of significant coronary artery disease.
Select one axis corresponding to sex and to age ∈ {30, 40, 50, 60, 70}. There was linear interaction between age and sex and between
age and cholesterol. 0.70 and 0.90 confidence intervals are shown (0.90 in gray). Note that for the “Linear Predictor” scale there
are various lengths of confidence intervals near the same value of X β̂, demonstrating that the standard error of X β̂ depends on the
individual X values. Also note that confidence intervals corresponding to smaller patient groups (e.g., females) are wider.
CHAPTER 8. BINARY LOGISTIC REGRESSION 8-52

8.11

Bayesian Logistic Model Example

Re-analyze data in Section 8.1.3 using the R rmsb package. See


hbiostat.org/doc/rms/lrm-brms.pdf for a parallel analysis using
the brms package.
The rmsb package relies on the Stan Bayesian modeling sys-
tem [188, 32].
The frequentist model was fitted using lrm and the Bayesian
model is fitted using the rmsb blrm function. For the Bayesian
model, the intercept prior is non-informative (iprior=1). For
blrm a complication arises in specifying priors for regression co-
efficients. This is due to the QR decomposition being used on
the design matrix to remove MCMC sampling problems with
collinearities. By default, priors correspond to the re-mixed,
scaled, and centered covariates. To keep predictors in raw form
(and risk some posterior sampling problems), use the keepsep
argument as done below on both predictors. Then Gaussian pri-
ors are used. The age and sex parameters were given mean zero
priors with standard deviations computed to achieve specified
tail prior probabilities. Four MCMC chains with 5000 iterations
were used with a warm-up of 2500 iterations each, resulting in
10000 retained draws from the posterior distribution.
Before fitting the Bayesian model with two skeptical priors, fit
the model with a flat prior for the intercept and Gaussian priors
CHAPTER 8. BINARY LOGISTIC REGRESSION 8-53

with SD=100 for the two slopes. See below that the posterior
modes from this fit are in close agreement with the maximum
likelihood estimates from the frequentist model fit.
dd ← datadist ( sex.age.response )
options ( datadist = ’ dd ’)
require ( rmsb )

# Frequentist model
flrm ← lrm ( response ∼ sex + age , data = sex.age.response )

# Bayesian model
# Distribute chains across all available cpu cores :
options ( mc.cores = parallel :: detectCores () )

# Fit a model with all flat priors


set.seed (8)
ff ← blrm ( response ∼ sex + age , data = sex.age.response , iter =5000)
# Elapsed time 2 .2s
latexVerbatim ( round ( rbind ( MLE = coef ( flrm ) , Mode = coef ( ff , ’ mode ’) ,
Mean = coef ( ff ) , Median = coef ( ff , ’ median ’) ) , 3) ,
file = ’ ’)

Intercept sex=male age


MLE -9.843 3.490 0.158
Mode -9.827 3.485 0.158
Mean -11.345 3.985 0.182
Median -10.976 3.876 0.177
# Set priors
# Solve for SD such that sex effect has only a 0 .025 chance of
# being above 5 ( or being below -5 )

s1 ← 5 / qnorm (0 .975 )

# Solve for SD such that 10 -year age effect has only 0 .025 chance
# of being above 20

s2 ← (20 / qnorm (0 .975 ) ) / 10 # divide by 10 since ratio on 10 b scale

# Full model
set.seed (11)
f ← blrm ( response ∼ sex + age , data = sex.age.response ,
priorsd = c ( s1 , s2 ) , iprior =1 , keepsep = ’ age | sex ’ , iter =5000)
# Elapsed time 1 .7s
f

Bayesian Logistic Model

Non-informative Priors for Intercepts

blrm(formula = response ~ sex + age, keepsep = "age|sex", data = sex.age.response,


CHAPTER 8. BINARY LOGISTIC REGRESSION 8-54

priorsd = c(s1, s2), iprior = 1, iter = 5000)

Mixed Calibration/ Discrimination Rank Discrim.


Discrimination Indexes Indexes Indexes
Obs 40 LOO log L -22.55±3.38 g 2.03 [0.922, 3.197] C 0.833 [0.809, 0.851]
0 20 LOO IC 45.1±6.76 gp 0.331 [0.227, 0.419] Dxy 0.666 [0.618, 0.702]
1 20 Effective p 2.92±0.63 EV 0.34 [0.152, 0.534]
Draws 10000 B 0.175 [0.162, 0.197] v 3.481 [0.599, 7.799]
Chains 4 vp 0.084 [0.039, 0.135]
p 2

Mode β̂ Mean β̂ Median β̂ S.E. Lower Upper Pr(β > 0) Symmetry


Intercept -8.4221 -9.4184 -9.1476 3.3220 -16.1667 -3.3363 0.0001 0.83
sex=male 2.9162 3.2156 3.1519 1.0323 1.2500 5.2814 0.9999 1.14
age 0.1362 0.1527 0.1489 0.0569 0.0406 0.2616 0.9994 1.20

The following parameters remained separate (where not orthogonalized) during model fitting so that
prior distributions could be focused explicitly on them: sex=male, age

MCMC sampling diagnostics are below. No apparent problems.


# MCMC sampling diagnostics

stanDx ( f )

Iterations : 5000 on each of 4 chains , with 10000 posterior distribution samples saved

For each parameter , n_eff is a crude measure of effective sample size


and Rhat is the potential scale reduction factor on split chains
( at convergence , Rhat =1)

n_eff Rhat
Intercept 9224 1
sex = male 6575 1
age 6686 1

stanDxplot ( f )
CHAPTER 8. BINARY LOGISTIC REGRESSION 8-55

Chain 1 Chain 2 Chain 3 Chain 4


0.4

0.3

0.2

age
0.1
Parameter Value

0.0

sex=male
4

0 5001000150020002500 0 5001000150020002500 0 5001000150020002500 0 5001000150020002500


Post Burn−in Iteration

The model summaries for the frequentist and Bayesian models


are shown below, with posterior means computed as Bayesian
“point estimates.” The parameter estimates are similar for the
two approaches. The frequentist 0.95 confidence interval for
the age parameter is 0.037 - 0.279 while the Bayesian 0.95
credible interval is 0.047 - 0.263. Similarly, the 0.95 confidence
interval for sex is 1.14 - 5.84 and the corresponding Bayesian
0.95 credible interval is 1.23 - 5.19. The results made sense
in view of the use of skeptical priors when the sample size is
CHAPTER 8. BINARY LOGISTIC REGRESSION 8-56

small.
# Bayesian model output

summary (f , age =20:21) # posterior means

Low High ∆ Effect S.E. Lower 0.95 Upper 0.95


age 20 21 1 0.15266 0.056885 0.040614 0.26156
Odds Ratio 20 21 1 1.16490 1.041400 1.29900
sex — male:female 1 2 3.21560 1.032300 1.250000 5.28140
Odds Ratio 1 2 24.91800 3.490400 196.65000
summary (f , age =20:21 , posterior.summary = ’ median ’) # post. medians

Low High ∆ Effect S.E. Lower 0.95 Upper 0.95


age 20 21 1 0.14893 0.056885 0.040614 0.26156
Odds Ratio 20 21 1 1.16060 1.041400 1.29900
sex — male:female 1 2 3.15190 1.032300 1.250000 5.28140
Odds Ratio 1 2 23.38100 3.490400 196.65000
# Note that mean vs median doesn ’ t affect HPD intervals , only pt estimates

The figure shows the posterior draws for the age and sex pa-
rameters as well as the trace of the 4 MCMC chains for each
parameter and the bivariate posterior distribution. The poste-
rior distributions of each parameter are roughly round shaped
and the overlap between chains in the trace plots indicates good
convergence. The bivariate density plot indicates moderate cor-
relation between the age and sex parameters.
Create a 0.95 bivariate credible interval for the joint distribution
of age and sex. Any number of intervals could be drawn, as
any region that covers 0.95 of the posterior density could be
accurately be called a 0.95 credible interval. Commonly used:
maximum a-posteriori probability (MAP) interval, which seeks
to find the region that holds 0.95 of the density, while also
having the smallest area. In a 1-dimensional setting, this would
CHAPTER 8. BINARY LOGISTIC REGRESSION 8-57

translate into having the shortest interval length, and therefore


the most precise estimate. The figure below shows the point
estimate as well as the corresponding MAP interval.
# display posterior densities for age and sex parameters
plot ( f )

sex=male age
0.4

6
0.3

0.95 HPDI
4 Density
0.2
mean
median
mode
2
0.1

0.0 0

0 2 4 6 0.0 0.1 0.2 0.3 0.4

plot (f , bivar = TRUE ) # MAP region


CHAPTER 8. BINARY LOGISTIC REGRESSION 8-58

0.3

0.2
age

0.1

2 4 6
sex=male
Spearman ρ = 0.66

plot (f , bivar = TRUE , bivarmethod = ’ kernel ’)

0.3

Probability
0.01
0.2 0.1
0.25
age

0.5
0.75

0.1 0.9
0.95

2 4 6
sex=male
Spearman ρ = 0.66

In the above figure, the point estimate does not appear quite at
the point of highest density. This is because blrm estimates (by
CHAPTER 8. BINARY LOGISTIC REGRESSION 8-59

default) the posterior mean, rather than the posterior mode.


You have the full posterior density, so you can calculate what-
ever you’d like if you don’t want the mean.
A plot of the partial effects on the probability scale from the
Bayesian model reveals the same pattern as Figure 8.3.
# Partial effects plot

ggplot ( Predict (f , age , sex , fun = plogis , funint = FALSE ) , ylab = ’P ( Y =1) ’)

1.00

0.75

sex
P(Y=1)

0.50 female
male

0.25

0.00
40 50 60
age

# Frequentist
# variance-covariance for sex and age parameters
v ← vcov ( flrm ) [2:3 ,2:3]

# Sampling based parameter estimate correlation coefficient


f_cc ← v [1 ,2] / sqrt ( v [1 ,1] * v [2 ,2])

# Bayesian
# Linear correlation between params from posterior
draws ← f $ draws [ , c ( ’ sex = male ’ , ’ age ’) ]
b_cc ← cor ( draws ) [1 ,2]

Using the code in the block above, we calculate the frequen-


tist sampling-based parameter estimate correlation coefficient
is 0.75 while the linear correlation between the posterior draws
for the age and sex parameters is 0.67. Both models indicate
CHAPTER 8. BINARY LOGISTIC REGRESSION 8-60

a comparable amount of correlation between the parameters,


though in difference senses (sampling data vs. sampling poste-
rior distribution of parameters).
P ← PostF (f , pr = TRUE )

Original Name Short Name


Intercept a1
sex = male b1
age b2

( p1 ← P ( b1 > 0) ) # post prob ( sex has positive association with Y)

[1] 0.9999

( p2 ← P ( b2 > 0) )

[1] 0.9994

( p3 ← P ( b1 > 0 & b2 > 0) )

[1] 0.9993

( p4 ← P ( b1 > 0 | b2 > 0) )

[1] 1

The posterior probability that sex has a postive relationship with


hospital death is estimated as Prob(βsex > 0) = 0.9999 while
the posterior probability that age has a postive relationship with
hospital death is Prob(βage > 0) = 0.9994 and the probability
of both events is Prob(βsex > 0 ∩ βage > 0) = 0.9993. Even
using somewhat skeptical priors centered around 0, male gender
and increasing age are highly likely to be associated with the
response.
As seen above, the MCMC algorithm used by blrm provides
us with samples from the joint posterior distribution of βage
and βsex. Unlike frequentist intervals which require the log-
likelihood to be approximately quadratic in form, there are no
CHAPTER 8. BINARY LOGISTIC REGRESSION 8-61

such restrictions placed on the posterior distribution, as it will


always be proportional to the product of the likelihood density
and the prior, regardless of the likelihood function that is used.
In this specific example, we notice that the bivariate density is
somewhat skewed – a characteristic that would likely lead to
unequal tail converage probabilities if a symmetric confidence
interval is used.
ggplot ( as.data.frame ( draws ) , aes ( x = ‘ sex = male ‘ , y = age ) ) +
geom_hex () +
theme ( legend.position = " none " )

0.4

0.3

0.2
age

0.1

0.0

0 2 4 6
sex=male
Chapter 9

Logistic Model Case Study: Survival of


Titanic Passengers

Data source: The Titanic Passenger List edited by Michael A. Findlay, originally
published in Eaton & Haas (1994) Titanic: Triumph and Tragedy, Patrick Stephens
Ltd, and expanded with the help of the Internet community. The original html files were
obtained from Philip Hind (1999) (http://atschool.eduweb.co.uk/phind). The dataset
was compiled and interpreted by Thomas Cason. It is available in Rand spreadsheet
formats from hbiostat.org/data under the name titanic3.

9-1
CHAPTER 9. LOGISTIC MODEL CASE STUDY: SURVIVAL OF TITANIC PASSENGERS 9-2

9.1

Descriptive Statistics
require ( rms )

options ( prType = ’ latex ’) # for print , summary , anova


getHdata ( titanic3 ) # get dataset from web site
# List of names of v a r i a b l e s to analyze
v ← c ( ’ pclass ’ , ’ survived ’ , ’ age ’ , ’ sex ’ , ’ sibsp ’ , ’ parch ’)
t3 ← titanic3 [ , v ]
units ( t3 $ age ) ← ’ years ’
latex ( describe ( t3 ) , file = ’ ’)

t3
6 Variables 1309 Observations
pclass
n missing distinct
1309 0 3

Value 1st 2nd 3rd


Frequency 323 277 709
Proportion 0.247 0.212 0.542
survived : Survived
n missing distinct Info Sum Mean Gmd
1309 0 2 0.708 500 0.382 0.4725

age : Age [years]


n missing distinct Info Mean Gmd .05 .10 .25 .50 .75 .90 .95
1046 263 98 0.999 29.88 16.06 5 14 21 28 39 50 57

lowest : 0.1667 0.3333 0.4167 0.6667 0.7500, highest: 70.5000 71.0000 74.0000 76.0000 80.0000
sex
n missing distinct
1309 0 2

Value female male


Frequency 466 843
Proportion 0.356 0.644
sibsp : Number of Siblings/Spouses Aboard
n missing distinct Info Mean Gmd
1309 0 7 0.67 0.4989 0.777

lowest : 0 1 2 3 4, highest: 2 3 4 5 8
Value 0 1 2 3 4 5 8
Frequency 891 319 42 20 22 6 9
Proportion 0.681 0.244 0.032 0.015 0.017 0.005 0.007
parch : Number of Parents/Children Aboard
n missing distinct Info Mean Gmd
1309 0 8 0.549 0.385 0.6375

lowest : 0 1 2 3 4, highest: 3 4 5 6 9
Value 0 1 2 3 4 5 6 9
Frequency 1002 170 113 8 6 6 2 2
Proportion 0.765 0.130 0.086 0.006 0.005 0.005 0.002 0.002
dd ← datadist ( t3 )
# describe distributions of variables to rms
options ( datadist = ’ dd ’)
s ← summary ( survived ∼ age + sex + pclass +
cut2 ( sibsp ,0:3) + cut2 ( parch ,0:3) , data = t3 )
plot (s , main = ’ ’ , subtitles = FALSE ) # F i g u r e 9.1
CHAPTER 9. LOGISTIC MODEL CASE STUDY: SURVIVAL OF TITANIC PASSENGERS 9-3

N
Age [years]
[ 0.167,22.0) 290
[22.000,28.5) 246
[28.500,40.0) 265
[40.000,80.0] 245
Missing 263

sex
female 466
male 843

pclass
1st 323
2nd 277
3rd 709

Number of Siblings/Spouses Aboard


0 891
1 319
2 42
[3,8] 57

Number of Parents/Children Aboard


0 1002
1 170
2 113
[3,9] 24

Overall
1309

0.2 0.3 0.4 0.5 0.6 0.7

Survived

Figure 9.1: Univariable summaries of Titanic survival

Show 4-way relationships after collapsing levels. Suppress esti-


mates based on < 25 passengers. A
tn ← transform ( t3 ,
agec = ifelse ( age < 21 , ’ child ’ , ’ adult ’) ,
sibsp = ifelse ( sibsp == 0 , ’ no sib / sp ’ , ’ sib / sp ’) ,
parch = ifelse ( parch == 0 , ’ no par / child ’ , ’ par / child ’) )

g ← function ( y ) if ( length ( y ) < 25) NA else mean ( y )


s ← with ( tn , summarize ( survived ,
llist ( agec , sex , pclass , sibsp , parch ) , g ) )
# llist , summarize in Hmisc package
# Figure 9.2 :
ggplot ( subset (s , agec ! = ’ NA ’) ,
aes ( x = survived , y = pclass , shape = sex ) ) +
geom_point () + facet_grid ( agec ∼ sibsp * parch ) +
xlab ( ’ Proportion Surviving ’) + ylab ( ’ Passenger Class ’) +
scale_x_continuous ( breaks = c (0 , .5 , 1) )
CHAPTER 9. LOGISTIC MODEL CASE STUDY: SURVIVAL OF TITANIC PASSENGERS 9-4

no sib/sp no sib/sp sib/sp sib/sp


no par/child par/child no par/child par/child

3rd

adult
Passenger Class

2nd

1st sex
female
3rd male

2nd child

1st

0.0 0.5 1.0


0.0 0.5 1.0
0.0 0.5 1.0
0.0 0.5 1.0
Proportion Surviving

Figure 9.2: Multi-way summary of Titanic survival


CHAPTER 9. LOGISTIC MODEL CASE STUDY: SURVIVAL OF TITANIC PASSENGERS 9-5

9.2

Exploring Trends with Nonparametric Regres-


sion
# Figure9.3
b ← scal e_size_d iscrete ( range = c ( .1 , .85 ) )

yl ← ylab ( NULL )
p1 ← ggplot ( t3 , aes ( x = age , y = survived ) ) +
histSpikeg ( survived ∼ age , lowess = TRUE , data = t3 ) +
ylim (0 ,1) + yl
p2 ← ggplot ( t3 , aes ( x = age , y = survived , color = sex ) ) +
histSpikeg ( survived ∼ age + sex , lowess = TRUE ,
data = t3 ) + ylim (0 ,1) + yl
p3 ← ggplot ( t3 , aes ( x = age , y = survived , size = pclass ) ) +
histSpikeg ( survived ∼ age + pclass , lowess = TRUE ,
data = t3 ) + b + ylim (0 ,1) + yl
p4 ← ggplot ( t3 , aes ( x = age , y = survived , color = sex ,
size = pclass ) ) +
histSpikeg ( survived ∼ age + sex + pclass ,
lowess = TRUE , data = t3 ) +
b + ylim (0 ,1) + yl
gridExtra :: grid.arrange ( p1 , p2 , p3 , p4 , ncol =2) # combine 4

# F i g u r e 9.4
top ← theme ( legend.position = ’ top ’)
p1 ← ggplot ( t3 , aes ( x = age , y = survived , color = cut2 ( sibsp ,
0:2) ) ) + stat_plsmo () + b + ylim (0 ,1) + yl + top +
sc al e_c ol or_ dis cr ete ( name = ’ siblings / spouses ’)
p2 ← ggplot ( t3 , aes ( x = age , y = survived , color = cut2 ( parch ,
0:2) ) ) + stat_plsmo () + b + ylim (0 ,1) + yl + top +
sc al e_c ol or_ dis cr ete ( name = ’ parents / children ’)
gridExtra :: grid.arrange ( p1 , p2 , ncol =2)
CHAPTER 9. LOGISTIC MODEL CASE STUDY: SURVIVAL OF TITANIC PASSENGERS 9-6

1.00 1.00

0.75 0.75

sex
0.50 0.50 female
male

0.25 0.25

0.00 0.00
0 20 40 60 80 0 20 40 60 80
age age

1.00 1.00

sex
0.75 0.75 female
pclass male

1st
0.50 0.50
2nd pclass
3rd 1st
0.25 0.25 2nd
3rd

0.00 0.00
0 20 40 60 80 0 20 40 60 80
age age

Figure 9.3: Nonparametric regression (loess) estimates of the relationship between age and the probability of surviving the Titanic,
with tick marks depicting the age distribution. The top left panel shows unstratified estimates of the probability of survival. Other
panels show nonparametric estimates by various stratifications.
CHAPTER 9. LOGISTIC MODEL CASE STUDY: SURVIVAL OF TITANIC PASSENGERS 9-7

siblings/spouses 0 1 parents/children
[2,8] 0 1 [2,9]

1.00 1.00

0.75 0.75

0.50 0.50

0.25 0.25

0.00 0.00
0 20 40 60 80 0 20 40 60 80
age age

Figure 9.4: Relationship between age and survival stratified by the number of siblings or spouses on board (left panel) or by the
number of parents or children of the passenger on board (right panel).
CHAPTER 9. LOGISTIC MODEL CASE STUDY: SURVIVAL OF TITANIC PASSENGERS 9-8

9.3

Binary Logistic Model with Casewise Dele-


tion of Missing Values

First fit a model that is saturated with respect to age, sex,


pclass. Insufficient variation in sibsp, parch to fit complex
interactions or nonlinearities.
f1 ← lrm ( survived ∼ sex * pclass * rcs ( age ,5) +
rcs ( age ,5) * ( sibsp + parch ) , data = t3 ) # Table
# 9.1
print ( f1 , coefs = FALSE )

Logistic Regression Model

lrm(formula = survived ~ sex * pclass * rcs(age, 5) + rcs(age,


5) * (sibsp + parch), data = t3)

Frequencies of Missing Values Due to Each Variable

survived sex pclass age sibsp parch


0 0 0 263 0 0

Model Likelihood Discrimination Rank Discrim.


Ratio Test Indexes Indexes
Obs 1046 LR χ2 573.41 R2 0.569 C 0.883
2
0 619 d.f. 39 R39,1046 0.400 Dxy 0.766
1 427 Pr(> χ2 ) <0.0001 2
R39,758.1 0.506 γ 0.767
∂ log L
max | ∂β | 0.004 Brier 0.127 τa 0.370
print ( anova ( f1 ) , table.env = TRUE , label = ’ titanic-anova3 ’ , size = ’ small ’)

3-way interactions, parch clearly insignificant, so drop


f ← lrm ( survived ∼ ( sex + pclass + rcs ( age ,5) ) ∧ 2 +
rcs ( age ,5) * sibsp , data = t3 )
print ( f )

Logistic Regression Model

lrm(formula = survived ~ (sex + pclass + rcs(age, 5))^2 + rcs(age,


5) * sibsp, data = t3)
CHAPTER 9. LOGISTIC MODEL CASE STUDY: SURVIVAL OF TITANIC PASSENGERS 9-9

Table 9.1: Wald Statistics for survived

χ2 d.f. P
sex (Factor+Higher Order Factors) 187.15 15 <0.0001
All Interactions 59.74 14 <0.0001
pclass (Factor+Higher Order Factors) 100.10 20 <0.0001
All Interactions 46.51 18 0.0003
age (Factor+Higher Order Factors) 56.20 32 0.0052
All Interactions 34.57 28 0.1826
Nonlinear (Factor+Higher Order Factors) 28.66 24 0.2331
sibsp (Factor+Higher Order Factors) 19.67 5 0.0014
All Interactions 12.13 4 0.0164
parch (Factor+Higher Order Factors) 3.51 5 0.6217
All Interactions 3.51 4 0.4761
sex × pclass (Factor+Higher Order Factors) 42.43 10 <0.0001
sex × age (Factor+Higher Order Factors) 15.89 12 0.1962
Nonlinear (Factor+Higher Order Factors) 14.47 9 0.1066
Nonlinear Interaction : f(A,B) vs. AB 4.17 3 0.2441
pclass × age (Factor+Higher Order Factors) 13.47 16 0.6385
Nonlinear (Factor+Higher Order Factors) 12.92 12 0.3749
Nonlinear Interaction : f(A,B) vs. AB 6.88 6 0.3324
age × sibsp (Factor+Higher Order Factors) 12.13 4 0.0164
Nonlinear 1.76 3 0.6235
Nonlinear Interaction : f(A,B) vs. AB 1.76 3 0.6235
age × parch (Factor+Higher Order Factors) 3.51 4 0.4761
Nonlinear 1.80 3 0.6147
Nonlinear Interaction : f(A,B) vs. AB 1.80 3 0.6147
sex × pclass × age (Factor+Higher Order Factors) 8.34 8 0.4006
Nonlinear 7.74 6 0.2581
TOTAL NONLINEAR 28.66 24 0.2331
TOTAL INTERACTION 75.61 30 <0.0001
TOTAL NONLINEAR + INTERACTION 79.49 33 <0.0001
TOTAL 241.93 39 <0.0001
CHAPTER 9. LOGISTIC MODEL CASE STUDY: SURVIVAL OF TITANIC PASSENGERS 9-10

Frequencies of Missing Values Due to Each Variable

survived sex pclass age sibsp


0 0 0 263 0

Model Likelihood Discrimination Rank Discrim.


Ratio Test Indexes Indexes
Obs 1046 LR χ2 553.87 R 2
0.555 C 0.878
2
0 619 d.f. 26 R26,1046 0.396 Dxy 0.756
2 2
1 427 Pr(> χ ) <0.0001 R26,758.1 0.502 γ 0.758
∂ log L
max | ∂β | 6×10−6 Brier 0.130 τa 0.366

β̂ S.E. Wald Z Pr(> |Z|)


Intercept 3.3075 1.8427 1.79 0.0727
sex=male -1.1478 1.0878 -1.06 0.2914
pclass=2nd 6.7309 3.9617 1.70 0.0893
pclass=3rd -1.6437 1.8299 -0.90 0.3691
age 0.0886 0.1346 0.66 0.5102
age’ -0.7410 0.6513 -1.14 0.2552
age” 4.9264 4.0047 1.23 0.2186
age”’ -6.6129 5.4100 -1.22 0.2216
sibsp -1.0446 0.3441 -3.04 0.0024
sex=male × pclass=2nd -0.7682 0.7083 -1.08 0.2781
sex=male × pclass=3rd 2.1520 0.6214 3.46 0.0005
sex=male × age -0.2191 0.0722 -3.04 0.0024
sex=male × age’ 1.0842 0.3886 2.79 0.0053
sex=male × age” -6.5578 2.6511 -2.47 0.0134
sex=male × age”’ 8.3716 3.8532 2.17 0.0298
pclass=2nd × age -0.5446 0.2653 -2.05 0.0401
pclass=3rd × age -0.1634 0.1308 -1.25 0.2118
pclass=2nd × age’ 1.9156 1.0189 1.88 0.0601
pclass=3rd × age’ 0.8205 0.6091 1.35 0.1780
pclass=2nd × age” -8.9545 5.5027 -1.63 0.1037
pclass=3rd × age” -5.4276 3.6475 -1.49 0.1367
pclass=2nd × age”’ 9.3926 6.9559 1.35 0.1769
pclass=3rd × age”’ 7.5403 4.8519 1.55 0.1202
age × sibsp 0.0357 0.0340 1.05 0.2933
age’ × sibsp -0.0467 0.2213 -0.21 0.8330
age” × sibsp 0.5574 1.6680 0.33 0.7382
age”’ × sibsp -1.1937 2.5711 -0.46 0.6425

Note that the adjusted Maddala-Cox-Snell R2 using an effective


sample size of 758.1 is only 0.004 smaller than the full model.
print ( anova ( f ) , table.env = TRUE , label = ’ titanic-anova2 ’ , size = ’ small ’) # 9.2
CHAPTER 9. LOGISTIC MODEL CASE STUDY: SURVIVAL OF TITANIC PASSENGERS 9-11

Table 9.2: Wald Statistics for survived

χ2 d.f. P
sex (Factor+Higher Order Factors) 199.42 7 <0.0001
All Interactions 56.14 6 <0.0001
pclass (Factor+Higher Order Factors) 108.73 12 <0.0001
All Interactions 42.83 10 <0.0001
age (Factor+Higher Order Factors) 47.04 20 0.0006
All Interactions 24.51 16 0.0789
Nonlinear (Factor+Higher Order Factors) 22.72 15 0.0902
sibsp (Factor+Higher Order Factors) 19.95 5 0.0013
All Interactions 10.99 4 0.0267
sex × pclass (Factor+Higher Order Factors) 35.40 2 <0.0001
sex × age (Factor+Higher Order Factors) 10.08 4 0.0391
Nonlinear 8.17 3 0.0426
Nonlinear Interaction : f(A,B) vs. AB 8.17 3 0.0426
pclass × age (Factor+Higher Order Factors) 6.86 8 0.5516
Nonlinear 6.11 6 0.4113
Nonlinear Interaction : f(A,B) vs. AB 6.11 6 0.4113
age × sibsp (Factor+Higher Order Factors) 10.99 4 0.0267
Nonlinear 1.81 3 0.6134
Nonlinear Interaction : f(A,B) vs. AB 1.81 3 0.6134
TOTAL NONLINEAR 22.72 15 0.0902
TOTAL INTERACTION 67.58 18 <0.0001
TOTAL NONLINEAR + INTERACTION 70.68 21 <0.0001
TOTAL 253.18 26 <0.0001

Show the many effects of predictors. B


p ← Predict (f , age , sex , pclass , sibsp =0 , fun = plogis )
ggplot ( p ) # F i g . 9.5

ggplot ( Predict (f , sibsp , age = c (10 ,15 ,20 ,50) , conf.int = FALSE ) )
# # F i g u r e 9.6

Note that children having many siblings apparently had lower


survival. Married adults had slightly higher survival than un-
married ones.
Validate the model using the bootstrap to check overfitting.
Ignoring two very insignificant pooled tests. C
f ← update (f , x = TRUE , y = TRUE )
# x = TRUE , y= TRUE adds raw data to fit object so can bootstrap
set.seed (131) # so can r e p l i c a t e r e - s a m p l e s
latex ( validate (f , B =200) , digits =2 , size = ’ Ssize ’)
CHAPTER 9. LOGISTIC MODEL CASE STUDY: SURVIVAL OF TITANIC PASSENGERS 9-12

1st 2nd 3rd


1.00

0.75
sex
0.50 female
male

0.25

0.00
0 20 40 60 0 20 40 60 0 20 40 60
Age, years

Figure 9.5: Effects of predictors on probability of survival of Titanic passengers, estimated for zero siblings or spouses

Age, years
−2
log odds

10
15
20
−4
50

−6
0 2 4 6 8
Number of Siblings/Spouses Aboard
Adjusted to:sex=male pclass=3rd

Figure 9.6: Effect of number of siblings and spouses on the log odds of surviving, for third class males
CHAPTER 9. LOGISTIC MODEL CASE STUDY: SURVIVAL OF TITANIC PASSENGERS 9-13

Index Original Training Test Optimism Corrected n


Sample Sample Sample Index
Dxy 0.76 0.77 0.74 0.03 0.72 200
2
R 0.55 0.58 0.53 0.05 0.50 200
Intercept 0.00 0.00 −0.08 0.08 −0.08 200
Slope 1.00 1.00 0.86 0.14 0.86 200
Emax 0.00 0.00 0.05 0.05 0.05 200
D 0.53 0.56 0.49 0.06 0.46 200
U 0.00 0.00 0.01 −0.01 0.01 200
Q 0.53 0.56 0.49 0.07 0.46 200
B 0.13 0.13 0.13 −0.01 0.14 200
g 2.43 2.78 2.40 0.38 2.04 200
gp 0.37 0.37 0.35 0.02 0.35 200
cal ← calibrate (f , B =200) # Figure 9.7
plot ( cal , subtitles = FALSE )

n =1046 Mean absolute error =0.011 Mean squared error =0.00016


0.9 Quantile of absolute error =0.018

1.0

0.8
Actual Probability

0.6

0.4

Apparent
0.2
Bias−corrected
Ideal
0.0
0.0 0.2 0.4 0.6 0.8 1.0

Predicted Pr{survived=1}

Figure 9.7: Bootstrap overfitting-corrected loess nonparametric calibration curve for casewise deletion model

But moderate problem with missing data


CHAPTER 9. LOGISTIC MODEL CASE STUDY: SURVIVAL OF TITANIC PASSENGERS 9-14

9.4

Examining Missing Data Patterns


na.patterns ← naclus ( titanic3 )
require ( rpart ) # Recursive partitioning package

who.na ← rpart ( is.na ( age ) ∼ sex + pclass + survived +


sibsp + parch , data = titanic3 , minbucket =15)
naplot ( na.patterns , ’ na per var ’)
plot ( who.na , margin = .1 ) ; text ( who.na ) # F i g u r e 9.8
plot ( na.patterns )

Fraction of NAs in each Variable


pclass pclass=ab
survived |
name
sex
sibsp
parch
ticket
cabin
boat
fare parch>=0.5
embarked
age 0.09167
home.dest
body 0.1806 0.3249
0.0 0.2 0.4 0.6 0.8
Fraction of NAs

0.0
0.1
boat
embarked
cabin
fare
ticket
parch
sibsp

sex
name
pclass
survived
Fraction Missing

0.2
age

0.3
0.4
body
home.dest

Figure 9.8: Patterns of missing data. Upper left panel shows the fraction of observations missing on each predictor. Lower panel
depicts a hierarchical cluster analysis of missingness combinations. The similarity measure shown on the Y -axis is the fraction of
observations for which both variables are missing. Right panel shows the result of recursive partitioning for predicting is.na(age).
The rpart function found only strong patterns according to passenger class.
plot ( summary ( is.na ( age ) ∼ sex + pclass + survived +
sibsp + parch , data = t3 ) ) # F i g u r e 9.9

m ← lrm ( is.na ( age ) ∼ sex * pclass + survived + sibsp + parch ,


data = t3 )
print (m , needspace = ’3 .5in ’)
CHAPTER 9. LOGISTIC MODEL CASE STUDY: SURVIVAL OF TITANIC PASSENGERS 9-15

mean N
sex
female 466
male 843
pclass
1st 323
2nd 277
3rd 709
Survived
No 809
Yes 500
Number of Siblings/Spouses Aboard
0 891
1 319
2 42
3 20
4 22
5 6
8 9
Number of Parents/Children Aboard
0 1002
1 170
2 113
3 8
4 6
5 6
6 2
9 2
Overall
1309

0.0 0.2 0.4 0.6 0.8 1.0

is.na(age)
N=1309
Figure 9.9: Univariable descriptions of proportion of passengers with missing age

Logistic Regression Model

lrm(formula = is.na(age) ~ sex * pclass + survived + sibsp +


parch, data = t3)

Model Likelihood Discrimination Rank Discrim.


Ratio Test Indexes Indexes
Obs 1309 LR χ2 114.99 R2 0.133 C 0.703
2
FALSE 1046 d.f. 8 R8,1309 0.078 Dxy 0.406
TRUE 263 Pr(> χ2 ) <0.0001 2
R8,630.5 0.156 γ 0.451
∂ log L
max | ∂β | 5×10−6 Brier 0.148 τa 0.131
CHAPTER 9. LOGISTIC MODEL CASE STUDY: SURVIVAL OF TITANIC PASSENGERS 9-16

Table 9.3: Wald Statistics for is.na(age)

χ2 d.f. P
sex (Factor+Higher Order Factors) 5.61 3 0.1324
All Interactions 5.58 2 0.0614
pclass (Factor+Higher Order Factors) 68.43 4 <0.0001
All Interactions 5.58 2 0.0614
survived 0.98 1 0.3232
sibsp 0.35 1 0.5548
parch 7.92 1 0.0049
sex × pclass (Factor+Higher Order Factors) 5.58 2 0.0614
TOTAL 82.90 8 <0.0001

β̂ S.E. Wald Z Pr(> |Z|)


Intercept -2.2030 0.3641 -6.05 <0.0001
sex=male 0.6440 0.3953 1.63 0.1033
pclass=2nd -1.0079 0.6658 -1.51 0.1300
pclass=3rd 1.6124 0.3596 4.48 <0.0001
survived -0.1806 0.1828 -0.99 0.3232
sibsp 0.0435 0.0737 0.59 0.5548
parch -0.3526 0.1253 -2.81 0.0049
sex=male × pclass=2nd 0.1347 0.7545 0.18 0.8583
sex=male × pclass=3rd -0.8563 0.4214 -2.03 0.0422

print ( anova ( m ) , table.env = TRUE , label = ’ titanic-anova.na ’) # T a b l e 9.3

pclass and parch are the important predictors of missing age.


CHAPTER 9. LOGISTIC MODEL CASE STUDY: SURVIVAL OF TITANIC PASSENGERS 9-17

9.5

Single Conditional Mean Imputation


D

First try: conditional mean imputation


Default spline transformation for age caused distribution of im-
puted values to be much different from non-imputed ones; con-
strain to linear
xtrans ← transcan (∼ I ( age ) + sex + pclass + sibsp + parch ,
imputed = TRUE , pl = FALSE , pr = FALSE , data = t3 )

summary ( xtrans )

transcan ( x = ∼I ( age ) + sex + pclass + sibsp + parch , imputed = TRUE ,


pr = FALSE , pl = FALSE , data = t3 )

Iterations : 5

R2 achieved in predicting each variable :

age sex pclass sibsp parch


0.264 0.076 0.242 0.249 0.291

Adjusted R2 :

age sex pclass sibsp parch


0.260 0.073 0.239 0.245 0.288

Coefficients of canonical variates for predicting each ( row ) variable

age sex pclass sibsp parch


age 0.92 6.05 -2.02 -2.65
sex 0.03 -0.56 -0.01 -0.75
pclass 0.08 -0.26 0.03 0.28
sibsp -0.02 0.00 0.03 0.86
parch -0.03 -0.30 0.23 0.75

Summary of imputed values

age
n missing distinct Info Mean Gmd .05 .10
263 0 24 0.91 28.53 6.925 17.34 21.77
.25 .50 .75 .90 .95
26.17 28.10 28.10 42.77 42.77

lowest : 9.82894 11.75710 13.22440 15.15250 17.28300


highest : 33.24650 34.73840 38.63790 40.83950 42.76770

Starting estimates for imputed values :


CHAPTER 9. LOGISTIC MODEL CASE STUDY: SURVIVAL OF TITANIC PASSENGERS 9-18

age sex pclass sibsp parch


28 2 3 0 0

# Look at mean i m p u t e d v a l u e s by sex , p c l a s s and observed means


# a g e . i is age , f i l l e d in with c o n d i t i o n a l mean estimates
age.i ← with ( t3 , impute ( xtrans , age , data = t3 ) )
i ← is.imputed ( age.i )
with ( t3 , tapply ( age.i [ i ] , list ( sex [ i ] , pclass [ i ]) , mean ) )

1 st 2 nd 3 rd
female 39.08396 31.31831 23.10548
male 42.76765 33.24650 26.87451

with ( t3 , tapply ( age , list ( sex , pclass ) , mean , na.rm = TRUE ) )

1 st 2 nd 3 rd
female 37.03759 27.49919 22.18531
male 41.02925 30.81540 25.96227

dd ← datadist ( dd , age.i )
f.si ← lrm ( survived ∼ ( sex + pclass + rcs ( age.i ,5) ) ∧ 2 +
rcs ( age.i ,5) * sibsp , data = t3 )
print ( f.si , coefs = FALSE )

Logistic Regression Model

lrm(formula = survived ~ (sex + pclass + rcs(age.i, 5))^2 + rcs(age.i,


5) * sibsp, data = t3)

Model Likelihood Discrimination Rank Discrim.


Ratio Test Indexes Indexes
Obs 1309 LR χ2 640.85 R2 0.526 C 0.861
2
0 809 d.f. 26 R26,1309 0.375 Dxy 0.722
1 500 Pr(> χ2 ) <0.0001 2
R26,927 0.485 γ 0.726
∂ log L
max | ∂β | 0.0004 Brier 0.133 τa 0.341
p1 ← Predict (f , age , pclass , sex , sibsp =0 , fun = plogis )
p2 ← Predict ( f.si , age.i , pclass , sex , sibsp =0 , fun = plogis )
p ← rbind ( ’ Casewise Deletion ’= p1 , ’ Single Imputation ’= p2 ,
rename = c ( age.i = ’ age ’) ) # creates .set. variable
ggplot (p , groups = ’ sex ’ , ylab = ’ Probability of Surviving ’)
# F i g u r e 9.10

print ( anova ( f.si ) , table.env = TRUE , label = ’ titanic-anova.si ’) # Table 9.4


CHAPTER 9. LOGISTIC MODEL CASE STUDY: SURVIVAL OF TITANIC PASSENGERS 9-19

Casewise Deletion Single Imputation


1.00

0.75

1st
0.50

0.25

0.00
1.00
Probability of Surviving

0.75
sex

2nd
0.50 female
male
0.25

0.00
1.00

0.75
3rd

0.50

0.25

0.00
0 20 40 60 0 20 40 60
Age, years

Figure 9.10: Predicted probability of survival for males from fit using casewise deletion (bottom) and single conditional mean
imputation (top). sibsp is set to zero for these predicted values.
CHAPTER 9. LOGISTIC MODEL CASE STUDY: SURVIVAL OF TITANIC PASSENGERS 9-20

Table 9.4: Wald Statistics for survived

χ2 d.f. P
sex (Factor+Higher Order Factors) 245.39 7 <0.0001
All Interactions 52.85 6 <0.0001
pclass (Factor+Higher Order Factors) 112.07 12 <0.0001
All Interactions 36.79 10 <0.0001
age.i (Factor+Higher Order Factors) 49.32 20 0.0003
All Interactions 25.62 16 0.0595
Nonlinear (Factor+Higher Order Factors) 19.71 15 0.1835
sibsp (Factor+Higher Order Factors) 22.02 5 0.0005
All Interactions 12.28 4 0.0154
sex × pclass (Factor+Higher Order Factors) 30.29 2 <0.0001
sex × age.i (Factor+Higher Order Factors) 8.91 4 0.0633
Nonlinear 5.62 3 0.1319
Nonlinear Interaction : f(A,B) vs. AB 5.62 3 0.1319
pclass × age.i (Factor+Higher Order Factors) 6.05 8 0.6421
Nonlinear 5.44 6 0.4888
Nonlinear Interaction : f(A,B) vs. AB 5.44 6 0.4888
age.i × sibsp (Factor+Higher Order Factors) 12.28 4 0.0154
Nonlinear 2.05 3 0.5614
Nonlinear Interaction : f(A,B) vs. AB 2.05 3 0.5614
TOTAL NONLINEAR 19.71 15 0.1835
TOTAL INTERACTION 67.00 18 <0.0001
TOTAL NONLINEAR + INTERACTION 69.53 21 <0.0001
TOTAL 305.74 26 <0.0001
CHAPTER 9. LOGISTIC MODEL CASE STUDY: SURVIVAL OF TITANIC PASSENGERS 9-21

9.6

Multiple Imputation

The following uses aregImpute with predictive mean matching.


By default, aregImpute does not transform age when it is be-
ing predicted from the other variables. Four knots are used to
transform age when used to impute other variables (not needed
here as no other missings were present). Since the fraction of
263
observations with missing age is 1309 = 0.2 we use 20 imputa-
tions.
set.seed (17) # so can r e p r o d u c e random a s p e c t s
mi ← aregImpute (∼ age + sex + pclass +
sibsp + parch + survived ,
data = t3 , n.impute =20 , nk =4 , pr = FALSE )

mi

Multiple Imputation using Bootstrap and PMM

aregImpute ( formula = ∼age + sex + pclass + sibsp + parch + survived ,


data = t3 , n . impute = 20 , nk = 4 , pr = FALSE )

n : 1309 p: 6 Imputations : 20 nk : 4

Number of NAs :
age sex pclass sibsp parch survived
263 0 0 0 0 0

type d . f .
age s 1
sex c 1
pclass c 2
sibsp s 3
parch s 3
survived l 1

Transformation of Target Variables Forced to be Linear

R - squares for Predicting Non - Missing Values for Each Variable


Using Last Imputations of Predictors
age
0.373

# Print the first 10 imputations for the first 10 passengers


CHAPTER 9. LOGISTIC MODEL CASE STUDY: SURVIVAL OF TITANIC PASSENGERS 9-22

# having missing age


mi $ imputed $ age [1:10 , 1:10]

[ ,1] [ ,2] [ ,3] [ ,4] [ ,5] [ ,6] [ ,7] [ ,8] [ ,9] [ ,10]
16 41 47.0 24 44 60.0 47 28.0 29 49 17
38 53 44.0 76 59 35.0 39 16.0 54 19 29
41 45 46.0 28 40 50.0 61 19.0 63 18 61
47 31 28.5 33 35 61.0 55 45.5 38 41 47
60 35 40.0 49 41 27.0 36 51.0 2 33 27
70 30 30.0 16 53 56.0 70 17.0 38 45 51
71 55 36.0 36 42 42.0 33 65.0 46 39 57
75 24 36.0 47 49 45.5 47 47.0 38 55 56
81 60 45.0 46 28 55.0 42 45.0 61 33 45
107 46 29.0 40 58 71.0 58 47.0 63 61 56

Show the distribution of imputed (black) and actual ages (gray).


E
plot ( mi )
Ecdf ( t3 $ age , add = TRUE , col = ’ gray ’ , lwd =2 ,
subtitles = FALSE ) # F i g . 9.11

1.0
Proportion <= x

0.8
0.6
0.4
0.2
0.0
0 20 40 60 80
Imputed age
Figure 9.11: Distributions of imputed and actual ages for the Titanic dataset. Imputed values are in black and actual ages in gray.

Fit logistic models for 20 completed datasets and print the ratio
of imputation-corrected variances to average ordinary variances F
f.mi ← fit.mult.impute (
survived ∼ ( sex + pclass + rcs ( age ,5) ) ∧ 2 +
rcs ( age ,5) * sibsp ,
lrm , mi , data = t3 , pr = FALSE )
print ( anova ( f.mi ) , table.env = TRUE , label = ’ titanic-anova.mi ’ ,
size = ’ small ’) # T a b l e 9.5

The Wald χ2 for age is reduced by accounting for imputation


CHAPTER 9. LOGISTIC MODEL CASE STUDY: SURVIVAL OF TITANIC PASSENGERS 9-23

Table 9.5: Wald Statistics for survived

χ2 d.f. P
sex (Factor+Higher Order Factors) 237.81 7 <0.0001
All Interactions 53.44 6 <0.0001
pclass (Factor+Higher Order Factors) 113.77 12 <0.0001
All Interactions 38.60 10 <0.0001
age (Factor+Higher Order Factors) 49.97 20 0.0002
All Interactions 26.00 16 0.0540
Nonlinear (Factor+Higher Order Factors) 23.03 15 0.0835
sibsp (Factor+Higher Order Factors) 25.08 5 0.0001
All Interactions 13.42 4 0.0094
sex × pclass (Factor+Higher Order Factors) 32.70 2 <0.0001
sex × age (Factor+Higher Order Factors) 10.54 4 0.0322
Nonlinear 8.40 3 0.0384
Nonlinear Interaction : f(A,B) vs. AB 8.40 3 0.0384
pclass × age (Factor+Higher Order Factors) 5.53 8 0.6996
Nonlinear 4.67 6 0.5870
Nonlinear Interaction : f(A,B) vs. AB 4.67 6 0.5870
age × sibsp (Factor+Higher Order Factors) 13.42 4 0.0094
Nonlinear 2.11 3 0.5492
Nonlinear Interaction : f(A,B) vs. AB 2.11 3 0.5492
TOTAL NONLINEAR 23.03 15 0.0835
TOTAL INTERACTION 66.42 18 <0.0001
TOTAL NONLINEAR + INTERACTION 69.10 21 <0.0001
TOTAL 294.26 26 <0.0001

but is increased by using patterns of association with survival G

status to impute missing age.


Show estimated effects of age by classes.
p1 ← Predict ( f.si , age.i , pclass , sex , sibsp =0 , fun = plogis )
p2 ← Predict ( f.mi , age , pclass , sex , sibsp =0 , fun = plogis )
p ← rbind ( ’ Single Imputation ’= p1 , ’ Multiple Imputation ’= p2 ,
rename = c ( age.i = ’ age ’) )
ggplot (p , groups = ’ sex ’ , ylab = ’ Probability of Surviving ’)
# F i g u r e 9.12
CHAPTER 9. LOGISTIC MODEL CASE STUDY: SURVIVAL OF TITANIC PASSENGERS 9-24

Single Imputation Multiple Imputation


1.00

0.75

1st
0.50

0.25

0.00
1.00
Probability of Surviving

0.75
sex

2nd
0.50 female
male
0.25

0.00
1.00

0.75
3rd

0.50

0.25

0.00
0 20 40 60 0 20 40 60
Age, years

Figure 9.12: Predicted probability of survival for males from fit using single conditional mean imputation again (top) and multiple
random draw imputation (bottom). Both sets of predictions are for sibsp=0.
CHAPTER 9. LOGISTIC MODEL CASE STUDY: SURVIVAL OF TITANIC PASSENGERS 9-25

9.7

Summarizing the Fitted Model

Show odds ratios for changes in predictor values H


# Get predicted values for certain types of passengers
s ← summary ( f.mi , age = c (1 ,30) , sibsp =0:1)
# override default ranges for 3 variables
plot (s , log = TRUE , main = ’ ’) # Figure 9.13

0.10 0.50 3.00


age − 30:1

sibsp − 1:0

sex − female:male

pclass − 1st:3rd

pclass − 2nd:3rd

Adjusted to:sex=male pclass=3rd age=28 sibsp=0

Figure 9.13: Odds ratios for some predictor settings


phat ← predict ( f.mi ,
combos ←
expand.grid ( age = c (2 ,21 ,50) , sex = levels ( t3 $ sex ) ,
pclass = levels ( t3 $ pclass ) ,
sibsp =0) , type = ’ fitted ’)
# Can also use P r e d i c t ( f.mi , age = c (2 ,21 ,50) , sex , pclass ,
# s i b s p =0 , fun = p l o g i s )$ y h a t
options ( digits =1)
data.frame ( combos , phat )

age sex pclass sibsp phat


1 2 female 1 st 0 0.97
2 21 female 1 st 0 0.98
3 50 female 1 st 0 0.97
4 2 male 1 st 0 0.89
5 21 male 1 st 0 0.47
6 50 male 1 st 0 0.26
7 2 female 2 nd 0 1.00
8 21 female 2 nd 0 0.90
9 50 female 2 nd 0 0.81
10 2 male 2 nd 0 1.00
11 21 male 2 nd 0 0.08
12 50 male 2 nd 0 0.03
CHAPTER 9. LOGISTIC MODEL CASE STUDY: SURVIVAL OF TITANIC PASSENGERS 9-26

13 2 female 3 rd 0 0.85
14 21 female 3 rd 0 0.57
15 50 female 3 rd 0 0.35
16 2 male 3 rd 0 0.91
17 21 male 3 rd 0 0.13
18 50 male 3 rd 0 0.05

options ( digits =5)

We can also get predicted values by creating an S function that


will evaluate the model on demand. I
pred.logit ← Function ( f.mi )
# Note : if don ’ t d e f i n e s i b s p to p r e d . l o g i t , d e f a u l t s to 0
# normally just type the f u n c t i o n name to see its body
latex ( pred.logit , file = ’ ’ , type = ’ Sinput ’ , size = ’ small ’ ,
width.cutoff =49)

p r e d . l o g i t ← f u n c t i o n ( s e x = ”male ” , p c l a s s = ”3 r d ” , age = 2 8 ,
s i b s p = 0)
{
3 . 3 7 3 0 7 9 = 1 . 0 4 8 4 7 9 5 * ( s e x == ”male ”) + 5 . 8 0 7 8 1 6 8 *
( p c l a s s == ”2 nd ”) = 1 . 4 3 7 0 7 7 1 * ( p c l a s s ==
”3 r d ”) + 0 . 0 7 8 3 4 7 3 1 8 * age = 0 . 0 0 0 2 7 1 5 0 0 5 3 *
pmax ( age = 6 , 0 ) ∧ 3 + 0 . 0 0 1 7 0 9 3 2 8 4 * pmax ( age =
2 1 , 0 ) ∧ 3 = 0 . 0 0 2 3 7 5 1 5 0 5 * pmax ( age = 2 8 , 0 ) ∧ 3 +
0 . 0 0 1 0 1 2 6 3 7 3 * pmax ( age = 3 6 , 0 ) ∧ 3 = 7 .5314668e = 05 *
pmax ( age = 5 6 , 0 ) ∧ 3 = 1 . 1 7 9 9 2 3 5 * s i b s p +
( s e x == ”male ”) * ( = 0.47754081 * ( p c l a s s ==
”2 nd ”) + 2 . 0 6 6 5 9 2 4 * ( p c l a s s == ”3 r d ”) ) +
( s e x == ”male ”) * ( = 0.21884197 * age + 0 . 0 0 0 4 2 4 6 3 4 4 4 *
pmax ( age = 6 , 0 ) ∧ 3 = 0 . 0 0 2 3 8 6 0 2 4 6 * pmax ( age =
2 1 , 0 ) ∧ 3 + 0 . 0 0 3 0 9 9 6 6 8 2 * pmax ( age = 2 8 ,
0 ) ∧ 3 = 0 . 0 0 1 2 2 5 5 7 8 4 * pmax ( age = 3 6 , 0 ) ∧ 3 +
8 .7300463e = 05 * pmax ( age = 5 6 , 0 ) ∧ 3 ) +
( p c l a s s == ”2 nd ”) * ( = 0.47647131 * age + 0 . 0 0 0 6 8 4 8 3 *
pmax ( age = 6 , 0 ) ∧ 3 = 0 . 0 0 2 9 9 9 0 4 1 7 * pmax ( age =
2 1 , 0 ) ∧ 3 + 0 . 0 0 3 1 2 2 1 2 5 5 * pmax ( age = 2 8 ,
0 ) ∧ 3 = 0 . 0 0 0 8 3 4 7 2 7 8 2 * pmax ( age = 3 6 ,
0 ) ∧ 3 + 2 .6813959e = 05 * pmax ( age = 5 6 ,
0 ) ∧ 3 ) + ( p c l a s s == ”3 r d ”) * ( = 0.16335774 *
age + 0 . 0 0 0 3 0 9 8 6 5 4 6 * pmax ( age = 6 , 0 ) ∧ 3 =
0 . 0 0 1 8 1 7 4 7 1 6 * pmax ( age = 2 1 , 0 ) ∧ 3 + 0 . 0 0 2 4 9 1 6 5 7 *
pmax ( age = 2 8 , 0 ) ∧ 3 = 0 . 0 0 1 0 8 2 4 0 8 2 * pmax ( age =
3 6 , 0 ) ∧ 3 + 9 .8357307e = 05 * pmax ( age = 5 6 ,
0 ) ∧ 3 ) + s i b s p * ( 0 . 0 4 2 3 6 8 0 3 7 * age = 2 .0590588e = 05 *
pmax ( age = 6 , 0 ) ∧ 3 + 0 . 0 0 0 1 7 8 8 4 5 3 6 * pmax ( age =
2 1 , 0 ) ∧ 3 = 0 . 0 0 0 3 9 2 0 1 9 1 1 * pmax ( age = 2 8 ,
0 ) ∧ 3 + 0 . 0 0 0 2 8 7 3 2 3 8 5 * pmax ( age = 3 6 , 0 ) ∧ 3 =
5 .3559508e = 05 * pmax ( age = 5 6 , 0 ) ∧ 3 )
}

# Run the newly created function


plogis ( pred.logit ( age = c (2 ,21 ,50) , sex = ’ male ’ , pclass = ’3 rd ’) )

[1] 0.912648 0.134219 0.050343


CHAPTER 9. LOGISTIC MODEL CASE STUDY: SURVIVAL OF TITANIC PASSENGERS 9-27

A nomogram could be used to obtain predicted values manually,


but this is not feasible when so many interaction terms are J

present.
CHAPTER 9. LOGISTIC MODEL CASE STUDY: SURVIVAL OF TITANIC PASSENGERS 9-28

9.8

Bayesian Analysis
K

ˆ Repeat the multiple imputation-based approach but using a


Bayesian binary logistic model

ˆ Using default blrm function normal priors on regression co-


efficients with zero mean and large SD making the priors
almost flat

ˆ blrm uses the rstan package that provides the full power of
Stan to R

ˆ Could use smaller SDs to get penalized estimates

ˆ Using 4 independent Markov chain Hamiltonion posterior


sampling procedures each with 1000 burn-in iterations that
are discarded, and 1000 “real” iterations for a total of 4000
posterior sample draws

ˆ Use the first 10 multiple imputations already developed above


(object mi), running the Bayesian procedure separately for
10 completed datasets

ˆ Merely have to stack the posterior draws into one giant sam-
ple to account for imputation and get correct posterior dis-
tribution
# Use all available CPU cores. Each chain will be run on its
CHAPTER 9. LOGISTIC MODEL CASE STUDY: SURVIVAL OF TITANIC PASSENGERS 9-29

# own core.
# fitIf function only re-runs the model if bt.rds file doesn ’ t exist
require ( rmsb )

options ( mc.cores = parallel :: detectCores () )


getRs ( ’ fitIf.r ’ , put = ’ source ’)
# 10 Bayesian analysis took 5m on 4 cores
set.seed (1)
fitIf ( bt ← stackMI ( survived ∼ ( sex + pclass + rcs ( age , 5) ) ∧ 2 +
rcs ( age , 5) * sibsp ,
blrm , mi , data = t3 , n.impute =10 , refresh =25) )
bt

Bayesian Logistic Model

Dirichlet Priors With Concentration Parameter 0.541 for Intercepts

stackMI(formula = survived ~ (sex + pclass + rcs(age, 5))^2 +


rcs(age, 5) * sibsp, fitter = blrm, xtrans = mi, data = t3,
n.impute = 10, refresh = 25)

Mixed Calibration/ Discrimination Rank Discrim.


Discrimination Indexes Indexes Indexes
Obs 1309 B 0.134 [0.132, 0.136] g 2.543 [2.214, 2.98] C 0.867 [0.86, 0.872]
0 809 gp 0.358 [0.341, 0.373] Dxy 0.734 [0.72, 0.743]
1 500 EV 0.463 [0.424, 0.506]
Draws 40000 v 5.928 [3.76, 8.404]
Chains 4 vp 0.109 [0.101, 0.119]
Imputations 10
p 26

Mean β̂ Median β̂ S.E. Lower Upper Pr(β > 0) Symmetry


Intercept 4.2251 4.0271 2.1028 0.3645 8.5187 0.9918 1.31
sex=male -1.1483 -1.1348 1.0752 -3.2582 0.9551 0.1412 0.97
pclass=2nd 6.8178 6.3852 4.2035 -0.6757 15.5954 0.9704 1.33
pclass=3rd -2.1311 -1.9283 2.0807 -6.3349 1.7351 0.1411 0.78
age 0.0380 0.0474 0.1467 -0.2514 0.3196 0.6282 0.83
age’ -0.5689 -0.5807 0.7146 -1.9333 0.8583 0.2098 1.05
age” 3.8450 3.8366 4.1029 -4.0133 12.0105 0.8270 1.00
age”’ -5.4485 -5.4036 5.6846 -16.4766 5.8044 0.1676 0.99
sibsp -1.2766 -1.2608 0.3520 -1.9766 -0.5998 0.0000 0.89
sex=male × pclass=2nd -0.5159 -0.5305 0.7291 -1.9297 0.9332 0.2349 1.07
sex=male × pclass=3rd 2.2193 2.1910 0.6360 0.9948 3.4821 0.9999 1.14
sex=male × age -0.2224 -0.2216 0.0682 -0.3558 -0.0874 0.0006 1.00
sex=male × age’ 1.0582 1.0563 0.3918 0.2956 1.8277 0.9967 1.02
sex=male × age” -5.7933 -5.7831 2.5904 -10.9011 -0.7983 0.0115 0.99
sex=male × age”’ 7.2944 7.2764 3.9646 -0.4396 15.0405 0.9678 1.01
pclass=2nd × age -0.5464 -0.5230 0.2736 -1.0998 -0.0413 0.0108 0.78
pclass=3rd × age -0.1309 -0.1399 0.1437 -0.4029 0.1554 0.1737 1.21
CHAPTER 9. LOGISTIC MODEL CASE STUDY: SURVIVAL OF TITANIC PASSENGERS 9-30

Mean β̂ Median β̂ S.E. Lower Upper Pr(β > 0) Symmetry


pclass=2nd × age’ 1.9695 1.9213 1.0944 -0.1787 4.1078 0.9727 1.15
pclass=3rd × age’ 0.6807 0.6922 0.6813 -0.6632 1.9990 0.8439 0.93
pclass=2nd × age” -8.5936 -8.4640 5.6835 -20.0139 2.2876 0.0604 0.94
pclass=3rd × age” -4.1719 -4.1757 3.8312 -11.7624 3.2033 0.1367 1.03
pclass=2nd × age”’ 8.8410 8.7521 7.4719 -5.7498 23.6237 0.8835 1.03
pclass=3rd × age”’ 5.7731 5.7292 5.2248 -4.4802 15.9941 0.8668 1.00
age × sibsp 0.0438 0.0435 0.0324 -0.0192 0.1078 0.9127 1.01
age’ × sibsp -0.0211 -0.0252 0.2265 -0.4583 0.4255 0.4550 1.05
age” × sibsp 0.2008 0.2107 1.6449 -3.0195 3.4105 0.5523 0.99
age”’ × sibsp -0.6196 -0.6189 2.6613 -5.8611 4.5647 0.4080 1.00
L

ˆ Note that fit indexes have HPD uncertainty intervals

ˆ Everthing above accounts for imputation

ˆ Look at diagnostics
stanDx ( bt )

Diagnostics for each of 10 imputations

Iterations : 2000 on each of 4 chains , with 4000 posterior distribution samples saved

For each parameter , n_eff is a crude measure of effective sample size


and Rhat is the potential scale reduction factor on split chains
( at convergence , Rhat =1)

n_eff Rhat
Imputation 1: Intercept 2213 1.001
Imputation 1: sex = male 2519 1.001
Imputation 1: pclass =2 nd 1623 1.001
Imputation 1: pclass =3 rd 3038 1.000
Imputation 1: age 1490 1.002
Imputation 1: age ’ 1563 1.000
Imputation 1: age ’ ’ 1394 1.001
Imputation 1: age ’ ’ ’ 2913 1.000
Imputation 1: sibsp 2981 1.000
Imputation 1: sex = male * pclass =2 nd 2944 1.001
Imputation 1: sex = male * pclass =3 rd 2745 1.000
Imputation 1: sex = male * age 3500 1.000
Imputation 1: sex = male * age ’ 3211 1.000
Imputation 1: sex = male * age ’ ’ 4320 1.000
Imputation 1: sex = male * age ’ ’ ’ 3855 1.000
Imputation 1: pclass =2 nd * age 1217 1.001
Imputation 1: pclass =3 rd * age 2247 1.000
Imputation 1: pclass =2 nd * age ’ 1243 1.002
Imputation 1: pclass =3 rd * age ’ 3108 1.000
CHAPTER 9. LOGISTIC MODEL CASE STUDY: SURVIVAL OF TITANIC PASSENGERS 9-31

Imputation 1: pclass =2 nd * age ’ ’ 1277 1.001


Imputation 1: pclass =3 rd * age ’ ’ 4487 1.000
Imputation 1: pclass =2 nd * age ’ ’ ’ 2308 1.001
Imputation 1: pclass =3 rd * age ’ ’ ’ 4550 0.999
Imputation 1: age * sibsp 4935 1.000
Imputation 1: age ’ * sibsp 5548 1.000
Imputation 1: age ’ ’ * sibsp 4588 1.000
Imputation 1: age ’ ’ ’ * sibsp 4865 1.000
Imputation 2: Intercept 2170 1.001
Imputation 2: sex = male 2872 1.000
Imputation 2: pclass =2 nd 1094 1.005
Imputation 2: pclass =3 rd 2969 0.999
Imputation 2: age 1107 1.005
Imputation 2: age ’ 982 1.004
Imputation 2: age ’ ’ 1060 1.004
Imputation 2: age ’ ’ ’ 2261 1.001
Imputation 2: sibsp 2292 1.001
Imputation 2: sex = male * pclass =2 nd 2625 1.001
Imputation 2: sex = male * pclass =3 rd 2911 1.000
Imputation 2: sex = male * age 3312 1.001
Imputation 2: sex = male * age ’ 3058 1.000
Imputation 2: sex = male * age ’ ’ 3855 0.999
Imputation 2: sex = male * age ’ ’ ’ 3757 1.000
Imputation 2: pclass =2 nd * age 936 1.006
Imputation 2: pclass =3 rd * age 1665 1.003
Imputation 2: pclass =2 nd * age ’ 803 1.006
Imputation 2: pclass =3 rd * age ’ 2710 0.999
Imputation 2: pclass =2 nd * age ’ ’ 873 1.006
Imputation 2: pclass =3 rd * age ’ ’ 3663 1.001
Imputation 2: pclass =2 nd * age ’ ’ ’ 1844 1.002
Imputation 2: pclass =3 rd * age ’ ’ ’ 4283 1.000
Imputation 2: age * sibsp 3800 1.000
Imputation 2: age ’ * sibsp 4230 1.000
Imputation 2: age ’ ’ * sibsp 3498 1.000
Imputation 2: age ’ ’ ’ * sibsp 4123 1.001
Imputation 3: Intercept 2027 1.000
Imputation 3: sex = male 2420 1.001
Imputation 3: pclass =2 nd 1623 1.002
Imputation 3: pclass =3 rd 2632 1.001
Imputation 3: age 1461 1.002
Imputation 3: age ’ 1369 1.002
Imputation 3: age ’ ’ 1363 1.004
Imputation 3: age ’ ’ ’ 2674 1.001
Imputation 3: sibsp 3061 1.001
Imputation 3: sex = male * pclass =2 nd 2689 1.000
Imputation 3: sex = male * pclass =3 rd 2655 1.000
Imputation 3: sex = male * age 3937 0.999
Imputation 3: sex = male * age ’ 3925 1.000
Imputation 3: sex = male * age ’ ’ 4361 1.000
Imputation 3: sex = male * age ’ ’ ’ 4415 1.000
Imputation 3: pclass =2 nd * age 1232 1.002
Imputation 3: pclass =3 rd * age 2022 1.002
Imputation 3: pclass =2 nd * age ’ 1177 1.002
Imputation 3: pclass =3 rd * age ’ 3244 1.001
Imputation 3: pclass =2 nd * age ’ ’ 1241 1.002
Imputation 3: pclass =3 rd * age ’ ’ 3758 1.000
Imputation 3: pclass =2 nd * age ’ ’ ’ 2459 1.000
CHAPTER 9. LOGISTIC MODEL CASE STUDY: SURVIVAL OF TITANIC PASSENGERS 9-32

Imputation 3: pclass =3 rd * age ’ ’ ’ 4596 1.000


Imputation 3: age * sibsp 4393 1.000
Imputation 3: age ’ * sibsp 4592 1.000
Imputation 3: age ’ ’ * sibsp 4516 1.001
Imputation 3: age ’ ’ ’ * sibsp 3958 1.001
Imputation 4: Intercept 2560 1.000
Imputation 4: sex = male 2852 1.000
Imputation 4: pclass =2 nd 1907 1.000
Imputation 4: pclass =3 rd 2984 1.001
Imputation 4: age 1707 1.000
Imputation 4: age ’ 1577 1.001
Imputation 4: age ’ ’ 1745 1.000
Imputation 4: age ’ ’ ’ 2850 1.000
Imputation 4: sibsp 2530 1.000
Imputation 4: sex = male * pclass =2 nd 3067 1.001
Imputation 4: sex = male * pclass =3 rd 2926 1.001
Imputation 4: sex = male * age 4335 1.002
Imputation 4: sex = male * age ’ 4144 0.999
Imputation 4: sex = male * age ’ ’ 4150 0.999
Imputation 4: sex = male * age ’ ’ ’ 3714 1.000
Imputation 4: pclass =2 nd * age 1331 1.001
Imputation 4: pclass =3 rd * age 2872 1.001
Imputation 4: pclass =2 nd * age ’ 1315 1.001
Imputation 4: pclass =3 rd * age ’ 2910 1.000
Imputation 4: pclass =2 nd * age ’ ’ 1448 1.001
Imputation 4: pclass =3 rd * age ’ ’ 3777 1.001
Imputation 4: pclass =2 nd * age ’ ’ ’ 2425 1.000
Imputation 4: pclass =3 rd * age ’ ’ ’ 4556 1.000
Imputation 4: age * sibsp 4148 1.000
Imputation 4: age ’ * sibsp 4568 1.001
Imputation 4: age ’ ’ * sibsp 3716 0.999
Imputation 4: age ’ ’ ’ * sibsp 4133 0.999
Imputation 5: Intercept 2064 1.000
Imputation 5: sex = male 2836 1.001
Imputation 5: pclass =2 nd 1699 1.002
Imputation 5: pclass =3 rd 3059 1.002
Imputation 5: age 1397 1.001
Imputation 5: age ’ 1466 1.001
Imputation 5: age ’ ’ 1446 1.001
Imputation 5: age ’ ’ ’ 2722 1.000
Imputation 5: sibsp 2651 1.000
Imputation 5: sex = male * pclass =2 nd 3073 1.000
Imputation 5: sex = male * pclass =3 rd 2734 1.001
Imputation 5: sex = male * age 3571 1.001
Imputation 5: sex = male * age ’ 3261 1.000
Imputation 5: sex = male * age ’ ’ 4032 1.000
Imputation 5: sex = male * age ’ ’ ’ 3770 1.001
Imputation 5: pclass =2 nd * age 1099 1.001
Imputation 5: pclass =3 rd * age 2226 1.000
Imputation 5: pclass =2 nd * age ’ 1212 1.001
Imputation 5: pclass =3 rd * age ’ 2706 1.000
Imputation 5: pclass =2 nd * age ’ ’ 1287 1.001
Imputation 5: pclass =3 rd * age ’ ’ 3603 1.000
Imputation 5: pclass =2 nd * age ’ ’ ’ 2022 1.000
Imputation 5: pclass =3 rd * age ’ ’ ’ 4262 1.000
Imputation 5: age * sibsp 4096 1.001
Imputation 5: age ’ * sibsp 4242 0.999
CHAPTER 9. LOGISTIC MODEL CASE STUDY: SURVIVAL OF TITANIC PASSENGERS 9-33

Imputation 5: age ’ ’ * sibsp 4035 0.999


Imputation 5: age ’ ’ ’ * sibsp 4017 1.000
Imputation 6: Intercept 1996 1.000
Imputation 6: sex = male 2782 1.001
Imputation 6: pclass =2 nd 1433 1.001
Imputation 6: pclass =3 rd 3017 1.000
Imputation 6: age 1258 1.003
Imputation 6: age ’ 1142 1.001
Imputation 6: age ’ ’ 1284 1.002
Imputation 6: age ’ ’ ’ 2214 1.001
Imputation 6: sibsp 2086 1.001
Imputation 6: sex = male * pclass =2 nd 3138 1.001
Imputation 6: sex = male * pclass =3 rd 2913 1.000
Imputation 6: sex = male * age 3825 1.000
Imputation 6: sex = male * age ’ 3781 1.000
Imputation 6: sex = male * age ’ ’ 3165 1.001
Imputation 6: sex = male * age ’ ’ ’ 3738 1.001
Imputation 6: pclass =2 nd * age 991 1.003
Imputation 6: pclass =3 rd * age 1987 1.000
Imputation 6: pclass =2 nd * age ’ 987 1.003
Imputation 6: pclass =3 rd * age ’ 2994 1.001
Imputation 6: pclass =2 nd * age ’ ’ 1054 1.002
Imputation 6: pclass =3 rd * age ’ ’ 3967 1.000
Imputation 6: pclass =2 nd * age ’ ’ ’ 1731 1.002
Imputation 6: pclass =3 rd * age ’ ’ ’ 4709 1.000
Imputation 6: age * sibsp 5053 1.000
Imputation 6: age ’ * sibsp 4852 1.000
Imputation 6: age ’ ’ * sibsp 4735 1.000
Imputation 6: age ’ ’ ’ * sibsp 4711 1.000
Imputation 7: Intercept 1968 1.001
Imputation 7: sex = male 2724 1.000
Imputation 7: pclass =2 nd 1497 1.004
Imputation 7: pclass =3 rd 2662 1.000
Imputation 7: age 1355 1.006
Imputation 7: age ’ 1278 1.006
Imputation 7: age ’ ’ 1425 1.002
Imputation 7: age ’ ’ ’ 2861 1.004
Imputation 7: sibsp 2699 1.001
Imputation 7: sex = male * pclass =2 nd 3133 1.000
Imputation 7: sex = male * pclass =3 rd 2521 1.000
Imputation 7: sex = male * age 3915 1.000
Imputation 7: sex = male * age ’ 3167 1.001
Imputation 7: sex = male * age ’ ’ 3304 1.000
Imputation 7: sex = male * age ’ ’ ’ 3024 1.002
Imputation 7: pclass =2 nd * age 1188 1.006
Imputation 7: pclass =3 rd * age 2365 1.001
Imputation 7: pclass =2 nd * age ’ 1035 1.007
Imputation 7: pclass =3 rd * age ’ 2710 1.000
Imputation 7: pclass =2 nd * age ’ ’ 1198 1.004
Imputation 7: pclass =3 rd * age ’ ’ 3147 1.001
Imputation 7: pclass =2 nd * age ’ ’ ’ 2180 1.003
Imputation 7: pclass =3 rd * age ’ ’ ’ 3449 1.000
Imputation 7: age * sibsp 4234 1.000
Imputation 7: age ’ * sibsp 3661 1.000
Imputation 7: age ’ ’ * sibsp 3362 1.000
Imputation 7: age ’ ’ ’ * sibsp 3659 0.999
Imputation 8: Intercept 2460 1.000
CHAPTER 9. LOGISTIC MODEL CASE STUDY: SURVIVAL OF TITANIC PASSENGERS 9-34

Imputation 8: sex = male 2775 1.000


Imputation 8: pclass =2 nd 1678 1.002
Imputation 8: pclass =3 rd 2777 1.000
Imputation 8: age 1384 1.001
Imputation 8: age ’ 1284 1.002
Imputation 8: age ’ ’ 1608 1.000
Imputation 8: age ’ ’ ’ 2648 1.001
Imputation 8: sibsp 3020 1.000
Imputation 8: sex = male * pclass =2 nd 3135 1.000
Imputation 8: sex = male * pclass =3 rd 2573 1.000
Imputation 8: sex = male * age 3565 1.000
Imputation 8: sex = male * age ’ 3626 0.999
Imputation 8: sex = male * age ’ ’ 3767 1.000
Imputation 8: sex = male * age ’ ’ ’ 3888 1.000
Imputation 8: pclass =2 nd * age 1095 1.002
Imputation 8: pclass =3 rd * age 2777 1.002
Imputation 8: pclass =2 nd * age ’ 1261 1.003
Imputation 8: pclass =3 rd * age ’ 3043 1.000
Imputation 8: pclass =2 nd * age ’ ’ 1234 1.002
Imputation 8: pclass =3 rd * age ’ ’ 3881 1.000
Imputation 8: pclass =2 nd * age ’ ’ ’ 2436 1.001
Imputation 8: pclass =3 rd * age ’ ’ ’ 4068 1.000
Imputation 8: age * sibsp 4150 1.000
Imputation 8: age ’ * sibsp 4543 1.000
Imputation 8: age ’ ’ * sibsp 3692 1.000
Imputation 8: age ’ ’ ’ * sibsp 4209 1.000
Imputation 9: Intercept 2329 1.000
Imputation 9: sex = male 3018 1.000
Imputation 9: pclass =2 nd 1411 1.003
Imputation 9: pclass =3 rd 2836 1.001
Imputation 9: age 1216 1.004
Imputation 9: age ’ 1178 1.003
Imputation 9: age ’ ’ 1280 1.004
Imputation 9: age ’ ’ ’ 2887 1.000
Imputation 9: sibsp 2491 1.001
Imputation 9: sex = male * pclass =2 nd 3036 1.001
Imputation 9: sex = male * pclass =3 rd 2505 1.001
Imputation 9: sex = male * age 3263 1.000
Imputation 9: sex = male * age ’ 3763 1.000
Imputation 9: sex = male * age ’ ’ 4284 0.999
Imputation 9: sex = male * age ’ ’ ’ 3377 1.000
Imputation 9: pclass =2 nd * age 1006 1.004
Imputation 9: pclass =3 rd * age 2261 1.000
Imputation 9: pclass =2 nd * age ’ 959 1.005
Imputation 9: pclass =3 rd * age ’ 3096 1.001
Imputation 9: pclass =2 nd * age ’ ’ 1064 1.004
Imputation 9: pclass =3 rd * age ’ ’ 4543 1.000
Imputation 9: pclass =2 nd * age ’ ’ ’ 1898 1.002
Imputation 9: pclass =3 rd * age ’ ’ ’ 4256 1.000
Imputation 9: age * sibsp 4510 1.000
Imputation 9: age ’ * sibsp 4832 1.000
Imputation 9: age ’ ’ * sibsp 4267 0.999
Imputation 9: age ’ ’ ’ * sibsp 4354 1.000
Imputation 10: Intercept 2948 0.999
Imputation 10: sex = male 2730 1.000
Imputation 10: pclass =2 nd 3332 1.000
Imputation 10: pclass =3 rd 2646 1.000
CHAPTER 9. LOGISTIC MODEL CASE STUDY: SURVIVAL OF TITANIC PASSENGERS 9-35

Imputation 10: age 2859 1.000


Imputation 10: age ’ 2881 1.000
Imputation 10: age ’ ’ 3492 0.999
Imputation 10: age ’ ’ ’ 4258 1.000
Imputation 10: sibsp 3399 1.000
Imputation 10: sex = male * pclass =2 nd 3436 1.000
Imputation 10: sex = male * pclass =3 rd 2566 1.000
Imputation 10: sex = male * age 2772 1.000
Imputation 10: sex = male * age ’ 2633 1.001
Imputation 10: sex = male * age ’ ’ 3435 1.000
Imputation 10: sex = male * age ’ ’ ’ 3772 1.000
Imputation 10: pclass =2 nd * age 2373 1.001
Imputation 10: pclass =3 rd * age 2457 1.000
Imputation 10: pclass =2 nd * age ’ 2602 1.000
Imputation 10: pclass =3 rd * age ’ 2282 1.000
Imputation 10: pclass =2 nd * age ’ ’ 3161 1.001
Imputation 10: pclass =3 rd * age ’ ’ 2826 1.002
Imputation 10: pclass =2 nd * age ’ ’ ’ 4128 0.999
Imputation 10: pclass =3 rd * age ’ ’ ’ 3824 1.000
Imputation 10: age * sibsp 4195 0.999
Imputation 10: age ’ * sibsp 3988 0.999
Imputation 10: age ’ ’ * sibsp 3830 1.000
Imputation 10: age ’ ’ ’ * sibsp 3636 1.000

# Look at convergence of only 2 parameters


stanDxplot ( bt , c ( ’ sex = male ’ , ’ pclass =3 rd ’ , ’ age ’) , rev = TRUE )
CHAPTER 9. LOGISTIC MODEL CASE STUDY: SURVIVAL OF TITANIC PASSENGERS 9-36

age pclass=3rd sex=male

Chain
−44
0

Chain
−8
−12
4
0

1Chain
−4
−8
−12

10
−40

Chain
−8

11
−50

Chain
−10
4

12
−40

Chain
−8
−12

13
0

Chain
−5
−10
3

14
−30

Chain
−6
−9

15
−40

Chain
−8

16
−40

Chain
−8
−12

17
0

Chain
−5
−10
5

18Chain
−50
−10
2.5

19
0.0
−2.5

Chain
−5.0
−7.5
−10.0
0

2Chain
−4
−8

20
0

Chain
−5
−10
4

21
0
−4

Chain
−8
−12
4

22
0
−4

Chain
−8

23
0
−5

Chain
−10
−15
Parameter Value

24
0
−4

Chain
−8

25
0
−4

Chain
−8
−12

26
0
−4

Chain
−8
−12

27
0

Chain
−5
−10

28Chain
0
−5

29
0
−4

Chain
−8
0

3Chain
−5
−10

30
0
−4

Chain
−8

31
0
−5

Chain
−10
5

32
0

Chain
−5
4

33
0
−4

Chain
−8

34
0

Chain
−5
−10
5

35
0

Chain
−5
−10
4

36
0
−4

Chain
−8
−12

37
0
−4

Chain
−8

38Chain
0
−4
−8

39
0
−4

Chain
−8
0

4Chain
−5

40Chain
0
−4
−8
−12
4
0

5Chain
−4
−8
−12
0

6Chain
−5
−10
0

7Chain
−5
−10
−15
0
8 9
−4
−8
0 250 500 750 1000 0 250 500 750 1000 0 250 500 750 1000
Post Burn−in Iteration
M

ˆ Difficult to see but there are 40 traces (10 imputations × 4


chains)

ˆ Diagnostics look good; posterior samples can be trusted

ˆ Plot posterior densities for select parameters


CHAPTER 9. LOGISTIC MODEL CASE STUDY: SURVIVAL OF TITANIC PASSENGERS 9-37

ˆ Also shows the 10 densities before stacking


plot ( bt , c ( ’ sex = male ’ , ’ pclass =3 rd ’ , ’ age ’) , nrow =2)

sex=male pclass=3rd
0.4
0.20

0.3
0.15
0.95 HPDI
0.2
0.10 Imputation 1
Imputation 10
0.1 0.05 Imputation 2
Imputation 3
0.0 0.00
Imputation 4
−6 −4 −2 0 2 −15 −10 −5 0 5
Imputation 5
age
Imputation 6
3 Imputation 7
Imputation 8
Imputation 9
2
mean
median
1 Stacked

0
−1.0 −0.5 0.0 0.5

ˆ Plot partial effect plots with 0.95 highest posterior density


intervals
p ← Predict ( bt , age , sex , pclass , sibsp =0 , fun = plogis , funint = FALSE )
ggplot ( p )
CHAPTER 9. LOGISTIC MODEL CASE STUDY: SURVIVAL OF TITANIC PASSENGERS 9-38

1st 2nd 3rd

1.00

0.75

sex
0.50 female
male

0.25

0.00

0 20 40 60 0 20 40 60 0 20 40 60
Age, years
O

ˆ Compute approximate measure of explained outcome varia-


tion for predictors
plot ( anova ( bt ) )

sex [ ]
pclass [ ]
age [ ]
sex * pclass [ ]
sibsp [ ]
age * sibsp [ ]
sex * age [ ]
pclass * age [ ]

0.0 0.2 0.4 0.6 0.8


Relative Explained Variation P
CHAPTER 9. LOGISTIC MODEL CASE STUDY: SURVIVAL OF TITANIC PASSENGERS 9-39

ˆ Contrast second class males and females, both at 5 years


and 30 years of age, all other things being equal

ˆ Compute 0.95 HPD interval for the contrast and a joint


uncertainty region

ˆ Compute P(both contrasts < 0), both < −2, and P(either
one < 0)
k ← contrast ( bt , list ( sex = ’ male ’ , age = c (5 , 30) , pclass = ’2 nd ’) ,
list ( sex = ’ female ’ , age = c (5 , 30) , pclass = ’2 nd ’) ,
cnames = c ( ’ age 5 M-F ’ , ’ age 30 M-F ’) )
k

age Contrast S.E. Lower Upper Pr ( Contrast >0)


1 age 5 M - F 5 -2.7761 0.7820 -4.3300 -1.2508 4e -04
2 age 30 M - F 30 -4.1501 0.5168 -5.1807 -3.1610 0 e +00

Intervals are 0.95 highest posterior density intervals


Contrast is the posterior mean

plot ( k )

age 5 M−F age 30 M−F


0.8
0.5

0.4 0.6

0.3
0.95 HPDI
0.4
mean
0.2
median

0.2
0.1

0.0 0.0
−6 −4 −2 0 −6 −5 −4 −3 −2

plot (k , bivar = TRUE ) # assumes an ellipse


plot (k , bivar = TRUE , bivarmethod = ’ kernel ’) # doesn ’ t
P ← PostF (k , pr = TRUE )
CHAPTER 9. LOGISTIC MODEL CASE STUDY: SURVIVAL OF TITANIC PASSENGERS 9-40

Contrast names : age 5 M -F , age 30 M - F

P ( ‘ age 5 M-F ‘ < 0 & ‘ age 30 M-F ‘ < 0) # note backticks

[1] 0.99962

P ( ‘ age 5 M-F ‘ < -2 & ‘ age 30 M-F ‘ < -2 )

[1] 0.8414

P ( ‘ age 5 M-F ‘ < 0 | ‘ age 30 M-F ‘ < 0)

[1] 1

−3.0
−3.0
Probability
−3.5 0.01
−3.5
0.1
age 30 M−F

−4.0
age 30 M−F −4.0 0.25
0.5
−4.5 0.75
−4.5
0.9

−5.0 −5.0 0.95

−5.5 −5.5
−4 −3 −2 −1 −4 −3 −2 −1
age 5 M−F age 5 M−F
Spearman ρ = 0.49 Spearman ρ = 0.49

R Software Used
Package Purpose Functions
Hmisc Miscellaneous functions summary,plsmo,naclus,llist,latex
summarize,Dotplot,describe
Hmisc Imputation transcan,impute,fit.mult.impute,aregImpute,stackMI
rms Modeling datadist,lrm,blrm,rcs
Model presentation plot,summary,nomogram,Function,anova
Estimation Predict,summary,contrast
Model validation validate,calibrate
Misc. Bayesian stanDx,stanDxplot,plot
rparta Recursive partitioning rpart
a Written by Atkinson & Therneau
Chapter 10

Ordinal Logistic Regression

10.1

Background
A

ˆ Levels of Y are ordered; no spacing assumed

ˆ If no model assumed, one can still assess association between


X and Y

ˆ Example: Y = 0, 1, 2 corresponds to no event, heart attack,


death. Test of association between race (3 levels) and out-
come (3 levels) can be obtained from a 2 × 2 d.f. χ2 test
for a contingency table

ˆ If willing to assuming an ordering of Y and a model, can


test for association using 2 × 1 d.f.

ˆ Proportional odds model: generalization of Wilcoxon-Mann-


Whitney-Kruskal-Wallis-Spearman

10-1
CHAPTER 10. ORDINAL LOGISTIC REGRESSION 10-2

ˆ Can have n categories for n observations!

ˆ Continuation ratio model: discrete proportional hazards model


CHAPTER 10. ORDINAL LOGISTIC REGRESSION 10-3

10.2

Ordinality Assumption
B

ˆ Assume X is linearly related to some appropriate log odds

ˆ Estimate mean X|Y with and without assuming the model


holds

ˆ For simplicity assume X discrete

ˆ Let Pjx = Pr(Y = j|X = x, model)


Pr(X = x)
Pr(X = x|Y = j) = Pr(Y = j|X = x)
Pr(Y = j)
Pr(X = x)
E(X|Y = j) = xPjx
,
X

x Pr(Y = j)
and the expectation can be estimated by
Ê(X|Y = j) = xP̂jxfx/gj ,
X

where P̂jx = estimate of Pjx from the 1-predictor model


n
Ê(X|Y = j) = xiP̂jxi /gj .
X

i=1
CHAPTER 10. ORDINAL LOGISTIC REGRESSION 10-4

10.3

Proportional Odds Model

10.3.1

Model
C

ˆ Walker & Duncan [207] — most popular ordinal response


model

ˆ For convenience Y = 0, 1, 2, . . . , k
1
Pr[Y ≥ j|X] = ,
1 + exp[−(αj + Xβ)]
where j = 1, 2, . . . , k.

ˆ αj is the logit of Prob[Y ≥ j] when all Xs are zero

ˆ Odds[Y ≥ j|X] = exp(αj + Xβ)

ˆ Odds[Y ≥ j|Xm = a + 1] / Odds[Y ≥ j|Xm = a] = eβm

ˆ Same odds ratio eβm for any j = 1, 2, . . . , k


α +Xβ
e j
ˆ Odds[Y ≥ j|X] / Odds[Y ≥ v|X] = eαv +Xβ
= eαj −αv

ˆ Odds[Y ≥ j|X] = constant× Odds[Y ≥ v|X]

ˆ Assumes OR for 1 unit increase in age is the same when


considering the probability of death as when considering the
CHAPTER 10. ORDINAL LOGISTIC REGRESSION 10-5

probability of death or heart attack

ˆ PO model only uses ranks of Y ; same β̂s if transform Y ; is


robust to outliers
10.3.2

Assumptions and Interpretation of Parameters

10.3.3

Estimation

10.3.4

Residuals
D

ˆ Construct binary events Y ≥ j, j = 1, 2, . . . , k and use


corresponding predicted probabilities
1
P̂ij = ,
1 + exp[−(α̂j + Xiβ̂)]

ˆ Score residual for subject i predictor m:


Uim = Xim([Yi ≥ j] − P̂ij ),

ˆ For each column of U plot mean Ū·m and C.L. against Y

ˆ Partial residuals are more useful as they can also estimate


CHAPTER 10. ORDINAL LOGISTIC REGRESSION 10-6

covariable transformations [116, 42]:


Yi − P̂i
rim = β̂mXim + ,
P̂i(1 − P̂i)
where
1
P̂i = .
1 + exp[−(α + Xiβ̂)]

ˆ Smooth rim vs. Xim to estimate how Xm relates to the log


relative odds that Y = 1|Xm

ˆ For ordinal Y compute binary model partial res. for all cut-
offs j:
[Yi ≥ j] − P̂ij
rim = β̂mXim + ,
P̂ij (1 − P̂ij )

Li and Shepherd[124] have a residual for ordinal models that


serves for the entire range of Y without the need to consider
cutoffs. Their residual is useful for checking functional form of
predictors but not the proportional odds assumption.

10.3.5

Assessment of Model Fit


E

ˆ Section 10.2

ˆ Stratified proportions Y ≥ j, j = 1, 2, . . . , k, since logit(Y ≥


j|X) − logit(Y ≥ i|X) = αj − αi, for any constant X
require ( Hmisc )
CHAPTER 10. ORDINAL LOGISTIC REGRESSION 10-7

getHdata ( support )
sfdm ← as.integer ( support $ sfdm2 ) - 1
sf ← function ( y )
c ( ’Y ≥ 1 ’= qlogis ( mean ( y ≥ 1) ) , ’Y ≥ 2 ’= qlogis ( mean ( y ≥ 2) ) ,
’Y ≥ 3 ’= qlogis ( mean ( y ≥ 3) ) )
s ← summary ( sfdm ∼ adlsc + sex + age + meanbp , fun = sf , data = support )
plot (s , which =1:3 , pch =1:3 , xlab = ’ logit ’ , vnames = ’ names ’ , main = ’ ’ ,
width.factor =1 .5 )

N
adlsc
0.000 282
[0.495,1.167) 150
[1.167,3.024) 199
[3.024,7.000] 210

sex
female 377
male 464

age
[19.8, 52.4) 211
[52.4, 65.3) 210
[65.3, 74.8) 210
[74.8,100.1] 210

meanbp
[ 0, 64) 211
[ 64, 78) 216
[ 78,108) 204
[108,180] 210

Overall
841

−0.5 0.0 0.5 1.0 1.5

logit
N=841 N missing=159
Figure 10.1: Checking PO assumption separately for a series of predictors. The circle, triangle, and plus sign correspond to
Y ≥ 1, 2, 3, respectively. PO is checked by examining the vertical constancy of distances between any two of these three symbols.
Response variable is the severe functional disability scale sfdm2 from the 1000-patient SUPPORT dataset, with the last two categories
combined because of low frequency of coma/intubation.

Note that computing ORs for various cutoffs and seeing dis-
agreements among them can cause reviewers to confuse lack
of fit with sampling variation (random chance). For a 4-level Y
having a given vector of probabilities in a control group, let’s
assume PO with a true OR of 3 and simulate 10 experiments
to show variation of observed ORs over all cutoffs of Y. First
do it for a sample size of n=10,000 then for n=200.
# Until a new Hmisc is on CRAN
# s o u r c e ( ’ h t t p s : // r a w . g i t h u b u s e r c o n t e n t . c o m / h a r r e l f e / Hmisc / master /R/ popower.s ’)
p ← c ( .1 , .2 , .3 , .4 )
set.seed (7)
simPOcuts (10000 , odds.ratio =3 , p = p )
CHAPTER 10. ORDINAL LOGISTIC REGRESSION 10-8

y >=2 y >=3 y >=4


Simulation 1 2.822446 2.996466 2.868826
Simulation 2 2.869895 2.990059 3.046847
Simulation 3 3.116256 2.883418 3.259568
Simulation 4 3.124129 3.169854 3.088819
Simulation 5 2.918214 3.075153 3.019572
Simulation 6 2.927433 2.990027 2.818097
Simulation 7 3.336213 3.221263 3.006214
Simulation 8 2.772421 3.110541 3.075125
Simulation 9 3.083166 3.226093 2.958005
Simulation 10 3.666162 3.248635 2.995330

simPOcuts ( 200 , odds.ratio =3 , p = p )

y >=2 y >=3 y >=4


Simulation 1 1.879121 2.203704 2.793400
Simulation 2 2.666667 2.666667 2.720430
Simulation 3 Inf 5.664179 3.304527
Simulation 4 4.260870 2.068376 3.672414
Simulation 5 3.272727 3.006689 4.515625
Simulation 6 5.705882 8.327586 3.618243
Simulation 7 1.642055 2.545894 2.071429
Simulation 8 9.791209 2.047619 3.073733
Simulation 9 2.811594 2.041667 2.666667
Simulation 10 2.966292 3.328767 2.470588

A better approach for discrete Y is to show the impact of mak-


ing the PO assumption: F

ˆ Select a set of covariate settings over which to evaluate


accuracy of predictions

ˆ Vary at least one of the predictors, i.e., the one for which
you want to assess the impact of the PO assumption

ˆ Fit a PO model the usual way

ˆ Fit other models that relax the PO assumption


– to relax the PO assumption for all predictors fit a multi-
nomial logistic model
CHAPTER 10. ORDINAL LOGISTIC REGRESSION 10-9

– to relax the PO assumption for a subset of predictors


fit a partial PO model[153] (here using the R VGAM vglm
function)

ˆ For all the covariate combinations evaluate predicted proba-


bilities for all levels of Y using the PO model and the relaxed
assumption models

ˆ Use the bootstrap to compute confidence intevals for the


difference in predicted probabilities between a PO and a
relaxed model. This guards against over-emphasis of dif-
ferences when the sample size does not support estima-
tion, especially for the relaxed model with more parameters.
Note that the sample problem occurs when comparing pre-
dicted unadjusted probabilities to observed proportions, as
observed proportions can be noisy.

Example: re-do the assessment above


require ( rms )

# Need source () until i m p a c t P O is in an rms update on CRAN


# s o u r c e ( ’ h t t p s : // r a w . g i t h u b u s e r c o n t e n t . c o m / h a r r e l f e / r m s / m a s t e r / R / i m p a c t P O . r ’ )
# One h e a d a c h e : since using a n o n - r m s f i t t i n g f u n c t i o n need to hard
# code knots in s p l i n e s
kq ← seq (0 .05 , 0 .95 , length =4)
kage ← quantile ( support $ age , kq , na.rm = TRUE )
kbp ← quantile ( support $ meanbp , kq , na.rm = TRUE )

d ← expand.grid ( adlsc =0:6 , sex = ’ male ’ , age =65 , meanbp =78)

# Because of very low frequency (7) of s f d m =3 , combine categories 3, 4


support $ sfdm3 ← pmin ( sfdm , 3)

done.impact ← FALSE
if ( done.impact ) w ← readRDS ( ’ impactPO.rds ’) else {
set.seed (1)
w ← impactPO ( sfdm3 ∼ pol ( adlsc , 2) + sex + rcs ( age , kage ) +
rcs ( meanbp , kbp ) , nonpo = ∼ pol ( adlsc , 2) ,
newdata =d , B =300 , data = support )
CHAPTER 10. ORDINAL LOGISTIC REGRESSION 10-10

saveRDS (w , ’ impactPO.rds ’)
}

PO PPO Multinomial
Deviance 1871.70 1824.79 1795.93
d.f. 12 16 30
AIC 1895.70 1856.79 1855.93
p 9 13 27
LR chi ∧ 2 124.11 171.02 199.89
LR - p 115.11 158.02 172.89
LR chi ∧ 2 test for PO 46.91 75.77
d.f. 4 18
Pr ( > chi ∧ 2) <0.0001 <0.0001
MCS R2 0.137 0.184 0.212
MCS R2 adj 0.128 0.171 0.186
McFadden R2 0.062 0.086 0.100
McFadden R2 adj 0.053 0.073 0.073
Mean | difference | from PO 0.038 0.036

Covariate combination - specific mean | difference | in predicted probabilities

method adlsc sex age meanbp Mean | difference |


1 PPO 0 male 65 78 0.013
2 PPO 1 male 65 78 0.027
3 PPO 2 male 65 78 0.031
4 PPO 3 male 65 78 0.018
5 PPO 4 male 65 78 0.023
6 PPO 5 male 65 78 0.051
7 PPO 6 male 65 78 0.101
11 Multinomial 0 male 65 78 0.020
21 Multinomial 1 male 65 78 0.030
31 Multinomial 2 male 65 78 0.035
41 Multinomial 3 male 65 78 0.030
51 Multinomial 4 male 65 78 0.021
61 Multinomial 5 male 65 78 0.032
71 Multinomial 6 male 65 78 0.081

Bootstrap 0.95 confidence intervals for differences in model predicted


probabilities based on 253 bootstraps

adlsc sex age meanbp


1 0 male 65 78

PO - PPO probability estimates

0 1 2 3
Lower -0.020 -0.013 -0.001 -0.034
Upper 0.007 0.036 0.030 -0.001

PO - Multinomial probability estimates

0 1 2 3
Lower -0.051 -0.038 -0.010 -0.041
Upper 0.031 0.061 0.035 -0.004
CHAPTER 10. ORDINAL LOGISTIC REGRESSION 10-11

adlsc sex age meanbp


2 1 male 65 78

PO - PPO probability estimates

0 1 2 3
Lower -0.038 0.033 -0.015 -0.027
Upper -0.014 0.074 -0.004 -0.008

PO - Multinomial probability estimates

0 1 2 3
Lower -0.083 0.018 -0.027 -0.033
Upper -0.009 0.091 0.025 0.003

adlsc sex age meanbp


3 2 male 65 78

PO - PPO probability estimates

0 1 2 3
Lower -0.045 0.037 -0.043 -0.024
Upper -0.006 0.090 -0.013 0.001

PO - Multinomial probability estimates

0 1 2 3
Lower -0.085 0.028 -0.061 -0.032
Upper -0.009 0.100 0.018 0.014

adlsc sex age meanbp


4 3 male 65 78

PO - PPO probability estimates

0 1 2 3
Lower -0.025 0.009 -0.055 -0.016
Upper 0.017 0.064 -0.013 0.015

PO - Multinomial probability estimates

0 1 2 3
Lower -0.060 0.012 -0.079 -0.035
Upper 0.014 0.093 0.018 0.020

adlsc sex age meanbp


5 4 male 65 78

PO - PPO probability estimates

0 1 2 3
Lower 0.006 -0.052 -0.049 -0.005
Upper 0.057 0.015 -0.003 0.028

PO - Multinomial probability estimates


CHAPTER 10. ORDINAL LOGISTIC REGRESSION 10-12

0 1 2 3
Lower -0.023 -0.025 -0.070 -0.055
Upper 0.046 0.074 0.024 0.017

adlsc sex age meanbp


6 5 male 65 78

PO - PPO probability estimates

0 1 2 3
Lower 0.044 -0.144 -0.025 0.009
Upper 0.112 -0.056 0.016 0.045

PO - Multinomial probability estimates

0 1 2 3
Lower 0.014 -0.099 -0.049 -0.078
Upper 0.110 0.030 0.042 0.018

adlsc sex age meanbp


7 6 male 65 78

PO - PPO probability estimates

0 1 2 3
Lower 0.081 -0.275 0.002 0.019
Upper 0.184 -0.127 0.046 0.070

PO - Multinomial probability estimates

0 1 2 3
Lower 0.068 -0.263 -0.018 -0.074
Upper 0.189 -0.042 0.065 0.054

# Reverse levels of y so stacked bars have higher y located higher


revo ← function ( z ) {
z ← as.factor ( z )
factor (z , levels = rev ( levels ( as.factor ( z ) ) ) )
}
ggplot ( w $ estimates , aes ( x = method , y = Probability , fill = revo ( y ) ) ) +
facet_wrap (∼ adlsc ) + geom_col () +
xlab ( ’ ’) + guides ( fill = guide_legend ( title = ’ ’) ) +
theme ( legend.position = ’ bottom ’)

AIC indicates that a model assuming PO nowhere is better


than one that assumes PO everywhere. The PPO model with
far fewer parameters is just as good, and is also better than the
PO model, indicating non-PO with respect to adlsc. The fit of
the PO model is such that cell probabilities become more inac-
curate for higher level outcomes. This can also be seen by the
CHAPTER 10. ORDINAL LOGISTIC REGRESSION 10-13

0 1 2
1.00

0.75

0.50

0.25

0.00

3 4 5
1.00
Probability

0.75

0.50

0.25

0.00
PO PPO Multinomial PO PPO Multinomial
6
1.00

0.75

0.50

0.25

0.00
PO PPO Multinomial

3 2 1 0

Figure 10.2: Checking the impact of the PO assumption by comparing predicted probabilities of all outcome categories from a PO
model with a multinomial logistic model that assumes PO for no variables and a partial proportional odds model that does not
assume PO for a key variable of interest
CHAPTER 10. ORDINAL LOGISTIC REGRESSION 10-14

increasing mean absolute differences with probability estimates


from the PO model. Bootstrap nonparametric percentile con-
fidence intervals (300 resamples, not all of which converged)
for differences in predicted cell probabilities between the PO
model and a relaxed model are also found above. Some of
these intervals exclude 0, in line with the other evidence for
non-PO.
See fharrell.com/post/impactpo for a similar example but with
proportional odds clearly violated.
When Y is continuous or almost continuous and X is discrete,
the PO model assumes that the logit of the cumulative distri-
bution function of Y is parallel across categories of X. The
corresponding, more rigid, assumptions of the ordinary linear
model (here, parametric ANOVA) are parallelism and linearity
of the normal inverse cumulative distribution function across
categories of X. As an example consider the web site’s dia-
betes dataset, where we consider the distribution of log glyco-
hemoglobin across subjects’ body frames.
getHdata ( diabetes )
a ← Ecdf (∼ log ( glyhb ) , group = frame , fun = qnorm , xlab = ’ log ( HbA1c ) ’ ,
label.curves = FALSE , data = diabetes ,
ylab = expression ( paste ( Phi ∧ -1 , ( F [ n ]( x ) ) ) ) ) # F i g u r e 10.3
b ← Ecdf (∼ log ( glyhb ) , group = frame , fun = qlogis , xlab = ’ log ( HbA1c ) ’ ,
label.curves = list ( keys = ’ lines ’) , data = diabetes ,
ylab = expression ( logit ( F [ n ]( x ) ) ) )
print (a , more = TRUE , split = c (1 ,1 ,2 ,1) )
print (b , split = c (2 ,1 ,2 ,1) )
CHAPTER 10. ORDINAL LOGISTIC REGRESSION 10-15

2 5

logit(Fn(x))
Φ−1(Fn(x))

0 0

−2 small
−5 medium
large

1.0 1.5 2.0 2.5 1.0 1.5 2.0 2.5

log(HbA1c) log(HbA1c)

Figure 10.3: Transformed empirical cumulative distribution functions stratified by body frame in the diabetes dataset. Left panel:
checking all assumptions of the parametric ANOVA. Right panel: checking all assumptions of the PO model (here, Kruskal–Wallis
test).

10.3.6

Quantifying Predictive Ability

10.3.7

Describing the Model

For PO models there are four and sometimes five types of rel-
evant predictions: G

1. logit[Y ≥ j|X], i.e., the linear predictor


2. Prob[Y ≥ j|X]
3. Prob[Y = j|X]
4. Quantiles of Y |X (e.g., the mediana)
5. E(Y |X) if Y is interval scaled.
a If Y does not have very many levels, the median will be a discontinuous function of X and may not be satisfactory.
CHAPTER 10. ORDINAL LOGISTIC REGRESSION 10-16

Graphics: H

1. Partial effect plot (prob. scale or mean)


2. Odds ratio chart
3. Nomogram (possibly including the mean)

10.3.8

Validating the Fitted Model

10.3.9

R Functions

The rms package’s lrm and orm functions fit the PO model di-
rectly, assuming that the levels of the response variable (e.g.,
the levels of a factor variable) are listed in the proper order.
predict computes all types of estimates except for quantiles.
orm allows for more link functions than the logistic and is in-
tended to efficiently handle hundreds of intercepts as happens
when Y is continuous.
The R functions popower and posamsize (in the Hmisc package)
compute power and sample size estimates for ordinal responses
using the proportional odds model.
The function plot.xmean.ordinaly in rms computes and graphs
the quantities described in Section 10.2. It plots simple Y -
stratified means overlaid with Ê(X|Y = j), with j on the
x-axis. The Ês are computed for both PO and continuation
CHAPTER 10. ORDINAL LOGISTIC REGRESSION 10-17

ratio ordinal logistic models.


The Hmisc package’s summary.formula function is also useful for
assessing the PO assumption.
Generic rms functions such as validate, calibrate, and nomo-
gram work with PO model fits from lrm as long as the analyst
specifies which intercept(s) to use.
rms has a special function generator Mean for constructing an
easy-to-use function for getting the predicted mean Y from a
PO model. This is handy with plot and nomogram. If the fit
has been run through bootcov, it is easy to use the Predict
function to estimate bootstrap confidence limits for predicted
means.
CHAPTER 10. ORDINAL LOGISTIC REGRESSION 10-18

10.4

Continuation Ratio Model

10.4.1

Model

Unlike the PO model, which is based on cumulative probabili-


ties, the continuation ratio (CR) model is based on conditional
probabilities. The (forward) CR model [70, 6, 20] is stated as
follows for Y = 0, . . . , k:

1
Pr(Y = j|Y ≥ j, X) =
1 + exp[−(θj + Xγ)]
logit(Y = 0|Y ≥ 0, X) = logit(Y = 0|X)
= θ0 + Xγ
logit(Y = 1|Y ≥ 1, X) = θ1 + Xγ
...
logit(Y = k − 1|Y ≥ k − 1, X) = θk−1 + Xγ.

The CR model has been said to be likely to fit ordinal responses


when subjects have to “pass through” one category to get to
the next The CR model is a discrete version of the Cox propor-
tional hazards model. The discrete hazard function is defined
as Pr(Y = j|Y ≥ j).
Advantage of CR model: easy to allow unequal slopes across
CHAPTER 10. ORDINAL LOGISTIC REGRESSION 10-19

Y for selected X.
10.4.2

Assumptions and Interpretation of Parameters

10.4.3

Estimation

10.4.4

Residuals

To check CR model assumptions, binary logistic model partial


residuals are again valuable. We separately fit a sequence of
binary logistic models using a series of binary events and the
corresponding applicable (increasingly small) subsets of sub-
jects, and plot smoothed partial residuals against X for all of
the binary events. Parallelism in these plots indicates that the
CR model’s constant γ assumptions are satisfied.
CHAPTER 10. ORDINAL LOGISTIC REGRESSION 10-20

10.4.5

Assessment of Model Fit

10.4.6

Extended CR Model

10.4.7

Role of Penalization in Extended CR Model

10.4.8

Validating the Fitted Model

10.4.9

R Functions

The cr.setup function in rms returns a list of vectors useful in


constructing a dataset used to trick a binary logistic function
such as lrm into fitting CR models.
Chapter 11

Regression Models for Continuous Y and


Case Study in Ordinal Regression

This chapter concerns univariate continuous Y . There are


many multivariable models for predicting such response vari-
ables. A

ˆ linear models with assumed normal residuals, fitted with or-


dinary least squares

ˆ generalized linear models and other parametric models based


on special distributions such as the gamma

ˆ generalized additive models (GAMs)

ˆ generalization of GAMs to also nonparametrically transform


Y

ˆ quantile regression (see Section 11.3)

ˆ other robust regression models that, like quantile regres-

11-1
CHAPTER 11. REGRESSION MODELS FOR CONTINUOUS Y AND CASE STUDY IN ORDINAL REGRESSION 11-2

sion, use an objective different from minimizing the sum of


squared errors [198]

ˆ semiparametric models based on the ranks of Y , such as the


Cox proportional hazards model and the proportional odds
ordinal logistic model

ˆ cumulative probability models (often called cumulative link


models) which are semiparametric models from a wider class
of families than the logistic

Semiparametric models that treat Y as ordinal but not interval- B

scaled have many advantages including robustness and freedom


of distributional assumptions for Y conditional on any given set
of predictors.
Advantages are demonstrated in a case study of a cumulative
probability ordinal model. Some of the results are compared to
quantile regression and OLS. Many of the methods used in the
case study also apply to ordinary linear models.
CHAPTER 11. REGRESSION MODELS FOR CONTINUOUS Y AND CASE STUDY IN ORDINAL REGRESSION 11-3

11.1

Dataset and Descriptive Statistics


C

ˆ Diabetes Mellitus (DM) type II (adult onset diabetes) is


strongly associated with obesity

ˆ Primary laboratory test for diabetes: gylcosylated hemoglobin


(HbA1c), also called glycated hemoglobin, glycohemoglobin,
or
hemoglobin A1c.

ˆ HbA1c reflects average blood glucose for the preceding 60


to 90 days

ˆ HbA1c > 7.0 usually taken as a positive diagnosis of diabetes

ˆ Goal of analysis:
– better understand effects of body size measurements on
risk of DM

– enhance screening for DM

ˆ Best way to develop a model for DM screening is not to


fit binary logistic model with HbA1c > 7 as the response D

variable
– All cutpoints are arbitrary; no justification for any puta-
tive cut
CHAPTER 11. REGRESSION MODELS FOR CONTINUOUS Y AND CASE STUDY IN ORDINAL REGRESSION 11-4

– HbA1c 2=6.9, 7.1=10

– Larger standard errors of β̂, lower power, wider confidence


bands

– Better: predict continuous HbA1c using continuous re-


sponse model, then convert to probability HbA1c exceeds
any cutoff or estimate 0.9 quantile of HbA1c

ˆ Data: U.S. National Health and Nutrition Examination Sur-


vey (NHANES) from National Center for Health Statistic- E

s/CDC: http://www.cdc.gov/nchs/nhanes.htm[34]

ˆ age ≥ 80 coded as 80 by CDC

ˆ Subset with age ≥ 21, neither diagnosed nor treated for DM


require ( rms )

options ( prType = ’ latex ’) # for print , summary , anova


getHdata ( nhgh )
w ← subset ( nhgh , age ≥ 21 & dx ==0 & tx ==0 , select = -c ( dx , tx ) )
latex ( describe ( w ) , file = ’ ’)

w
18 Variables 4629 Observations
seqn : Respondent sequence number
n missing distinct Info Mean Gmd .05 .10 .25 .50 .75 .90 .95
4629 0 4629 1 56902 3501 52136 52633 54284 56930 59495 61079 61641

lowest : 51624 51629 51630 51645 51647, highest: 62152 62153 62155 62157 62158
sex
n missing distinct
4629 0 2

Value male female


Frequency 2259 2370
Proportion 0.488 0.512
age : Age [years]
n missing distinct Info Mean Gmd .05 .10 .25 .50 .75 .90 .95
4629 0 703 1 48.57 19.85 23.33 26.08 33.92 46.83 61.83 74.83 80.00

lowest : 21.00000 21.08333 21.16667 21.25000 21.33333, highest: 79.66667 79.75000 79.83333 79.91667 80.00000
CHAPTER 11. REGRESSION MODELS FOR CONTINUOUS Y AND CASE STUDY IN ORDINAL REGRESSION 11-5

re : Race/Ethnicity
n missing distinct
4629 0 5

lowest : Mexican American Other Hispanic Non-Hispanic White Non-Hispanic Black


Racial
highest: Mexican American Other Hispanic Non-Hispanic White Non-Hispanic Black
Racial
Mexican American (832, 0.180), Other Hispanic (474, 0.102), Non-Hispanic White (2318, 0.501),
Non-Hispanic Black (756, 0.163), Other Race Including Multi-Racial (249, 0.054)
income : Family Income
n missing distinct
4389 240 14

lowest : [0,5000) [5000,10000) [10000,15000) [15000,20000) [20000,25000)


highest: [65000,75000) > 20000 < 20000 [75000,100000) >= 100000
[0,5000) (162, 0.037), [5000,10000) (216, 0.049), [10000,15000) (371, 0.085), [15000,20000)
(300, 0.068), [20000,25000) (374, 0.085), [25000,35000) (535, 0.122), [35000,45000) (421,
0.096), [45000,55000) (346, 0.079), [55000,65000) (257, 0.059), [65000,75000) (188, 0.043), >
20000 (149, 0.034), < 20000 (52, 0.012), [75000,100000) (399, 0.091), >= 100000 (619, 0.141)
wt : Weight [kg]
n missing distinct Info Mean Gmd .05 .10 .25 .50 .75 .90 .95
4629 0 890 1 80.49 22.34 52.44 57.18 66.10 77.70 91.40 106.52 118.00

lowest : 33.2 36.1 37.9 38.5 38.7, highest: 184.3 186.9 195.3 196.6 203.0

ht : Standing Height [cm]


n missing distinct Info Mean Gmd .05 .10 .25 .50 .75 .90 .95
4629 0 512 1 167.5 11.71 151.1 154.4 160.1 167.2 175.0 181.0 184.8

lowest : 123.3 135.4 137.5 139.4 139.8, highest: 199.2 199.3 199.6 201.7 202.7
bmi : Body Mass Index [kg/m2 ]
n missing distinct Info Mean Gmd .05 .10 .25 .50 .75 .90 .95
4629 0 1994 1 28.59 6.965 20.02 21.35 24.12 27.60 31.88 36.75 40.68

lowest : 13.18 14.59 15.02 15.40 15.49, highest: 61.20 62.81 65.62 71.30 84.87
leg : Upper Leg Length [cm]
n missing distinct Info Mean Gmd .05 .10 .25 .50 .75 .90 .95
4474 155 216 1 38.39 4.301 32.0 33.5 36.0 38.4 41.0 43.3 44.6

lowest : 20.4 24.9 25.0 25.1 26.4, highest: 49.0 49.5 49.8 50.0 50.3
arml : Upper Arm Length [cm]
n missing distinct Info Mean Gmd .05 .10 .25 .50 .75 .90 .95
4502 127 156 1 37.01 3.116 32.6 33.5 35.0 37.0 39.0 40.6 41.7

lowest : 24.8 27.0 27.5 29.2 29.5, highest: 45.2 45.5 45.6 46.0 47.0
armc : Arm Circumference [cm]
n missing distinct Info Mean Gmd .05 .10 .25 .50 .75 .90 .95
4499 130 290 1 32.87 5.475 25.4 26.9 29.5 32.5 35.8 39.1 41.4

lowest : 17.9 19.0 19.3 19.5 19.9, highest: 54.2 54.9 55.3 56.0 61.0
waist : Waist Circumference [cm]
n missing distinct Info Mean Gmd .05 .10 .25 .50 .75 .90 .95
4465 164 716 1 97.62 17.18 74.8 78.6 86.9 96.3 107.0 117.8 125.0

lowest : 59.7 60.0 61.5 62.0 62.4, highest: 160.0 160.6 162.2 162.7 168.7
tri : Triceps Skinfold [mm]
n missing distinct Info Mean Gmd .05 .10 .25 .50 .75 .90 .95
4295 334 342 1 18.94 9.463 7.2 8.8 12.0 18.0 25.2 31.0 33.8

lowest : 2.6 3.1 3.2 3.3 3.4, highest: 39.6 39.8 40.0 40.2 40.6
sub : Subscapular Skinfold [mm]
n missing distinct Info Mean Gmd .05 .10 .25 .50 .75 .90 .95
3974 655 329 1 20.8 9.124 8.60 10.30 14.40 20.30 26.58 32.00 35.00

lowest : 3.8 4.2 4.6 4.8 4.9, highest: 40.0 40.1 40.2 40.3 40.4
CHAPTER 11. REGRESSION MODELS FOR CONTINUOUS Y AND CASE STUDY IN ORDINAL REGRESSION 11-6

gh : Glycohemoglobin [%]
n missing distinct Info Mean Gmd .05 .10 .25 .50 .75 .90 .95
4629 0 63 0.994 5.533 0.5411 4.8 5.0 5.2 5.5 5.8 6.0 6.3

lowest : 4.0 4.1 4.2 4.3 4.4, highest: 11.9 12.0 12.1 12.3 14.5
albumin : Albumin [g/dL]
n missing distinct Info Mean Gmd .05 .10 .25 .50 .75 .90 .95
4576 53 26 0.99 4.261 0.3528 3.7 3.9 4.1 4.3 4.5 4.7 4.8

lowest : 2.6 2.7 3.0 3.1 3.2, highest: 4.9 5.0 5.1 5.2 5.3
bun : Blood urea nitrogen [mg/dL]
n missing distinct Info Mean Gmd .05 .10 .25 .50 .75 .90 .95
4576 53 50 0.995 13.03 5.309 7 8 10 12 15 19 22

lowest : 1 2 3 4 5, highest: 49 53 55 56 63
SCr : Creatinine [mg/dL]
n missing distinct Info Mean Gmd .05 .10 .25 .50 .75 .90 .95
4576 53 167 1 0.8887 0.2697 0.58 0.62 0.72 0.84 0.99 1.14 1.25

lowest : 0.34 0.38 0.39 0.40 0.41, highest: 5.98 6.34 9.13 10.98 15.66

dd ← datadist ( w ) ; options ( datadist = ’ dd ’)


CHAPTER 11. REGRESSION MODELS FOR CONTINUOUS Y AND CASE STUDY IN ORDINAL REGRESSION 11-7

11.2

The Linear Model

The most popular multivariable model for analyzing a univariate


continuous Y is the the linear model
E(Y |X) = Xβ,
where β is estimated using ordinary least squares, that is, by
solving for β̂ to minimize P(Yi − X β̂)2. F

ˆ To compute P -values and confidence limits using parametric


methods (and for least squares estimates to coincide with
maximum likelihood estimates) we would have to assume
that Y |X is normal with mean Xβ and constant variance
σ2 a
11.2.1

Checking Assumptions of OLS and Other Models


G

ˆ First see if gh would make a Gaussian residuals model fit

ˆ Use ordinary regression on 4 key variables to collapse into


one variable (predicted mean from OLS model)

ˆ Stratify predicted mean into 6 quantile groups

ˆ Apply the normal inverse ECDF of gh to these strata and


a The latter assumption may be dispensed with if we use a robust Huber–White or bootstrap covariance matrix estimate. Normality may

sometimes be dispensed with by using bootstrap confidence intervals, but this would not fix inefficiency problems with OLS when residuals are
non-normal.
CHAPTER 11. REGRESSION MODELS FOR CONTINUOUS Y AND CASE STUDY IN ORDINAL REGRESSION 11-8

check for normality and constant σ 2

ˆ ECDF is for Prob[Y ≤ y|X] but for ordinal modeling we


want to state models in terms of Prob[Y ≥ y|X] so take 1
- ECDF before inverse transforming
f ← ols ( gh ∼ rcs ( age ,5) + sex + re + rcs ( bmi , 3) , data = w )
pgh ← fitted ( f )

p ← function ( fun , row , col ) {


f ← substitute ( fun ) ; g ← function ( F ) eval ( f )
z ← Ecdf (∼ gh , groups = cut2 ( pgh , g =6) ,
fun = function ( F ) g (1 - F ) ,
ylab = as.expression ( f ) , xlim = c (4 .5 , 7 .75 ) , data =w ,
label.curve = FALSE )
print (z , split = c ( col , row , 2 , 2) , more = row < 2 | col < 2)
}
p ( log ( F / (1 -F ) ) , 1 , 1)
p ( qnorm ( F ) , 1 , 2)
p ( -log ( -log ( F ) ) , 2 , 1)
p ( log ( -log (1 -F ) ) , 2 , 2)
# Get slopes of pgh for some c u t o f f s of Y
# Use glm c o m p l e m e n t a r y l o g - l o g link on Prob (Y < cutoff ) to
# get l o g - l o g link on Prob ( Y ≥ cutoff )
r ← NULL
for ( link in c ( ’ logit ’ , ’ probit ’ , ’ cloglog ’) )
for ( k in c (5 , 5 .5 , 6) ) {
co ← coef ( glm ( gh < k ∼ pgh , data =w , family = binomial ( link ) ) )
r ← rbind (r , data.frame ( link = link , cutoff =k ,
slope = round ( co [2] ,2) ) )
}
print (r , row.names = FALSE )

link cutoff slope


logit 5.0 -3.39
logit 5.5 -4.33
logit 6.0 -5.62
probit 5.0 -1.69
probit 5.5 -2.61
probit 6.0 -3.07
cloglog 5.0 -3.18
cloglog 5.5 -2.97
cloglog 6.0 -2.51
H

ˆ Upper right curves are not linear, implying that a normal


conditional distribution cannot work for ghb
b They are not parallel either.
CHAPTER 11. REGRESSION MODELS FOR CONTINUOUS Y AND CASE STUDY IN ORDINAL REGRESSION 11-9

log(F (1 − F))

5 2

qnorm(F)
0 0

−5 −2

5.0 5.5 6.0 6.5 7.0 7.5 5.0 5.5 6.0 6.5 7.0 7.5

Glycohemoglobin, % Glycohemoglobin, %

2
log(− log(1 − F))

6
− log(− log(F))

0
4
−2
2
−4
0
−6
−2

5.0 5.5 6.0 6.5 7.0 7.5 5.0 5.5 6.0 6.5 7.0 7.5

Glycohemoglobin, % Glycohemoglobin, %

Figure 11.1: Examination of normality and constant variance assumption, and assumptions for various ordinal models
CHAPTER 11. REGRESSION MODELS FOR CONTINUOUS Y AND CASE STUDY IN ORDINAL REGRESSION11-10

ˆ There is non-parallelism for the logit model

ˆ Other graphs will be used to guide selection of an ordinal


model below
CHAPTER 11. REGRESSION MODELS FOR CONTINUOUS Y AND CASE STUDY IN ORDINAL REGRESSION11-11

11.3

Quantile Regression
I

ˆ Ruled out OLS and semiparametric proportional odds model

ˆ Quantile regression [112, 111] is a different approach to mod-


eling Y

ˆ No distributional assumptions other than continuity of Y

ˆ All the usual right hand side assumptions

ˆ When there is a single predictor that is categorical, quantile


regression coincides with ordinary sample quantiles stratified
by that predictor

ˆ Is transformation invariant - pre-transforming Y not impor-


tant

Let ρτ (y) = y(τ − [y < 0]). The τ th sample quantile is the J

minimizer q of Pni−1 ρτ (yi − q). For a conditional τ th quan-


tile of Y |X the corresponding quantile regression estimator β̂τ
minimizes Pni=1 ρτ (Yi − Xβ).
Quantile regression is not as efficient at estimating quantiles as
is ordinary least squares at estimating the mean, if the latter’s
assumptions hold.
Koenker’s quantreg package in R [113] implements quantile re-
CHAPTER 11. REGRESSION MODELS FOR CONTINUOUS Y AND CASE STUDY IN ORDINAL REGRESSION11-12

gression, and the rms package’s Rq function provides a front-end


that gives rise to various graphics and inference tools.
If we model the median gh as a function of covariates, only
the Xβ structure need be correct. Other quantiles (e.g., 90th
percentile) can be directly modeled but standard errors will be
much larger as it is more difficult to precisely estimate outer
quantiles.
CHAPTER 11. REGRESSION MODELS FOR CONTINUOUS Y AND CASE STUDY IN ORDINAL REGRESSION11-13

11.4

Ordinal Regression Models for Continuous Y


K

ˆ Advantages of semiparametric models (e.g., quantile regres-


sion and cumulative probability ordinal models

ˆ For ordinal cumulative probability models, there is no distri-


butional assumption for Y given a setting of X

ˆ Assume only a connection between distributions of Y for


different X

ˆ Applying an increasing 1–1 transformation to Y results in


no change to regression coefficient estimatesc

ˆ Regression coefficient estimates are completely robust to ex-


treme Y valuesd

ˆ Estimates of quantiles of Y are


exactly transformation-preserving, e.g., estimate of median
of log Y is exactly the log of the estimate of median Y

ˆ Manuguerra [134] devloped an ordinal model for continu-


ous Y which they incorrectly labeled semi-parametric and is
actually a lower-dimensional flexible parametric model that
instead of having intercepts has a spline function of y.
c For symmetric distributions applying a decreasing transformation will negate the coefficients. For asymmetric distributions (e.g., Gumbel),

reversing the order of Y will do more than change signs.


d Only an estimate of mean Y from these β̂s is non-robust.
CHAPTER 11. REGRESSION MODELS FOR CONTINUOUS Y AND CASE STUDY IN ORDINAL REGRESSION11-14

For a general continuous distribution function F (y), an ordi- L

nal regression model based on cumulative probabilities may


be stated as followse. Let the ordered unique values of Y
be denoted by y1, y2, . . . , yk and let the intercepts associated
with y1, . . . , yk be α1, α2, . . . , αk , where α1 = ∞ because
Prob[Y ≥ y1] = 1. Let αy = αi, i : yi = y. Then

Prob[Y ≥ yi|X] = F (αi + Xβ) = F (αyi + Xβ)

For the OLS fully parametric case, the model may be restated
Y − Xβ y − Xβ
Prob[Y ≥ y|X] = Prob[ ≥ ]
σ σ
y − Xβ −y Xβ
= 1 − Φ( ) = Φ( + )
σ σ σ
so that to within an additive constantf αy = −yσ (intercepts α
are linear in y whereas they are arbitrarily descending in the
ordinal model), and σ is absorbed in β to put the OLS model
into the new notation.
The general ordinal regression model assumes that for fixed
X1, X2,
F −1(Prob[Y ≥ y|X2]) − F −1(Prob[Y ≥ y|X1])
= (X2 − X1)β
independent of the αs (parallelism assumption). If F = [1 +
exp(−y)]−1, this is the proportional odds assumption.
e It is more traditional to state the model in terms of Prob[Y ≤ y|X] but we use Prob[Y ≥ y|X] so that higher predicted values are

associated with higher Y .


f αˆ are unchanged if a constant is added to all y.
y
CHAPTER 11. REGRESSION MODELS FOR CONTINUOUS Y AND CASE STUDY IN ORDINAL REGRESSION11-15

Table 11.1: Distribution families used in ordinal cumulative probability models. Φ denotes the Gaussian cumulative distribution
function. For the Connection column, P1 = Prob[Y ≥ y|X1 ], P2 = Prob[Y ≥ y|X2 ], ∆ = (X2 − X1 )β. The connection specifies
the only distributional assumption if the model is fitted semiparametrically, i.e, contains an intercept for every unique Y value less
one. For parametric models, P1 must be specified absolutely instead of just requiring a relationship between P1 and P2 . For example,
the traditional Gaussian parametric model specifies that Prob[Y ≥ y|X] = 1 − Φ( y−Xβ σ
) = Φ( −y+Xβ
σ
).

Distribution F Inverse Link Name Connection


(Link Function)
y P2 P1
Logistic [1 + exp(−y)]−1 log( 1−y ) logit 1−P2 = 1−P 1
exp(∆)
Gaussian Φ(y) Φ−1 (y) probit P2 = Φ(Φ−1 (P1 ) + ∆)
exp(∆)
Gumbel maximum exp(− exp(−y)) log(− log(y)) log − log P2 = P1
value
Gumbel minimum 1 − exp(− exp(y)) log(− log(1 − y)) complementary 1 − P2 = (1 − P1 )exp(∆)
value log − log
Cauchy 1
π tan−1 (y) + 1
2 tan[π(y − 21 )] cauchit

Common choices of F , implemented in the rms orm function,


are shown in Table 11.1. M

The Gumbel maximum value distribution is also called the ex-


treme value type I distribution. This distribution (log − log
link) also represents a continuous time proportional hazards
model. The hazard ratio when X changes from X1 to X2 is
exp(−(X2 − X1)β).
The mean of Y |X is easily estimated by computing N

k
X
ˆ
yiProb[Y = yi|X]
i=1

and the q th quantile of Y |X is y such that


F −1(1 − q) − X β̂ = α̂y .g
The orm function in the rms package takes advantage of the
information matrix being of a sparse tri-band diagonal form
for the intercept parameters. This makes the computations
g The intercepts have to be shifted to the left one position in solving this equation because the quantile is such that Prob[Y ≤ y] = q

whereas the model is stated in terms of Prob[Y ≥ y].


CHAPTER 11. REGRESSION MODELS FOR CONTINUOUS Y AND CASE STUDY IN ORDINAL REGRESSION11-16

efficient even for hundreds of intercepts (i.e., unique values of


Y ). orm is made to handle continuous Y .
Ordinal regression has nice properties in addition to those listed
above, allowing for O

ˆ estimation of quantiles as efficiently as quantile regression


if the parallel slopes assumptions hold

ˆ efficient estimation of mean Y

ˆ direct estimation of Prob[Y ≥ y|X]

ˆ arbitrary clumping of values of Y , while still estimating β


and mean Y efficientlyh

ˆ solutions for β̂ using ordinary Newton-Raphson or other pop-


ular optimization techniques

ˆ being based on a standard likelihood function, penalized es-


timation can be straightforward

ˆ Wald, score, and likelihood ratio χ2 tests that are more


powerful than tests from quantile regression

To summarize how assumptions of parametric models compare P

to assumptions of semiparametric models, consider the ordinary


linear model or its special case the equal variance two-sample
h But it is not sensible to estimate quantiles of Y when there are heavy ties in Y in the area containing the quantile.
CHAPTER 11. REGRESSION MODELS FOR CONTINUOUS Y AND CASE STUDY IN ORDINAL REGRESSION11-17

t-test, vs. the probit or logit (proportional odds) ordinal model


or their special cases the Van der Waerden (normal-scores) two-
sample test or the Wilcoxon test. All the assumptions of the
linear model other than independence of residuals are captured
in the following (written in traditional Y ≤ y form):
y − Xβ
F (y|X) = Prob[Y ≤ y|X] = Φ( )
σ
y − Xβ
Φ−1(F (y|X)) =
σ
On the other hand, ordinal models assume the following:

−1 Φ−1(F(y|X)) − ∆Xβ
Φ (F(y|X))
− ∆Xβ σ logit(F(y|X))

y y
Figure 11.2: Assumptions of the linear model (left panel) and semiparametric ordinal probit or logit (proportional odds) models
(right panel). Ordinal models do not assume any shape for the distribution of Y for a given X; they only assume parallelism.

Prob[Y ≤ y|X] = F (g(y) − Xβ),


where g is unknown and may be discontinuous.
From this point we revert back to Y ≥ y notation so that Y
increases as Xβ increases.
CHAPTER 11. REGRESSION MODELS FOR CONTINUOUS Y AND CASE STUDY IN ORDINAL REGRESSION11-18

Global Modeling Implications Q

ˆ Ordinal regression invariant to choice of transformation of


Y

ˆ Y needs to be ordinal

ˆ Difference in two ordinal variables is not necessarily ordinal

ˆ → Never analyze differences in regression

ˆ Y =final value, adjust for baseline values as covariates


CHAPTER 11. REGRESSION MODELS FOR CONTINUOUS Y AND CASE STUDY IN ORDINAL REGRESSION11-19

11.5

Ordinal Regression Applied to HbA1c


R

ˆ In Figure 11.1, logit inverse curves are not parallel so pro-


portional odds assumption does not hold

ˆ log-log link yields highest degree of parallelism and most


constant regression coefficients across cutoffs of gh so use
this link in an ordinal regression model (linearity of the curves
is not required)

11.5.1

Checking Fit for Various Models Using Age


S

Another way to examine model fit is to flexibly fit the single


most important predictor (age) using a variety of methods, and
comparing predictions to sample quantiles and means based on
overlapping subsets on age, each subset being subjects hav-
ing age < 5 years away from the point being predicted by the
models. Here we predict the 0.5, 0.75, and 0.9 quantiles and
the mean. For quantiles we can compare to quantile regres-
sion(discussed below) and for means we compare to OLS.
ag ← 25:75
lag ← length ( ag )
q2 ← q3 ← p90 ← means ← numeric ( lag )
for ( i in 1: lag ) {
s ← which ( abs ( w $ age - ag [ i ]) < 5)
y ← w $ gh [ s ]
a ← quantile (y , probs = c ( .5 , .75 , .9 ) )
q2 [ i ] ← a [1]
q3 [ i ] ← a [2]
p90 [ i ] ← a [3]
means [ i ] ← mean ( y )
CHAPTER 11. REGRESSION MODELS FOR CONTINUOUS Y AND CASE STUDY IN ORDINAL REGRESSION11-20

}
fams ← c ( ’ logistic ’ , ’ probit ’ , ’ loglog ’ , ’ cloglog ’)
fe ← function ( pred , target ) mean ( abs ( pred $ yhat - target ) )
mod ← gh ∼ rcs ( age ,6)
P ← Er ← list ()
for ( est in c ( ’ q2 ’ , ’ q3 ’ , ’ p90 ’ , ’ mean ’) ) {
meth ← if ( est == ’ mean ’) ’ ols ’ else ’ QR ’
p ← list ()
er ← rep ( NA , 5)
names ( er ) ← c ( fams , meth )
for ( family in fams ) {
h ← orm ( mod , family = family , data = w )
fun ← if ( est == ’ mean ’) Mean ( h )
else {
qu ← Quantile ( h )
switch ( est , q2 = function ( x ) qu ( .5 , x ) ,
q3 = function ( x ) qu ( .75 , x ) ,
p90 = function ( x ) qu ( .9 , x ) )
}
p [[ family ]] ← z ← Predict (h , age = ag , fun = fun , conf.int = FALSE )
er [ family ] ← fe (z , switch ( est , mean = means , q2 = q2 , q3 = q3 , p90 = p90 ) )
}
h ← switch ( est ,
mean = ols ( mod , data = w ) ,
q2 = Rq ( mod , data = w ) ,
q3 = Rq ( mod , tau =0 .75 , data = w ) ,
p90 = Rq ( mod , tau =0 .90 , data = w ) )
p [[ meth ]] ← z ← Predict (h , age = ag , conf.int = FALSE )
er [ meth ] ← fe (z , switch ( est , mean = means , q2 = q2 , q3 = q3 , p90 = p90 ) )

Er [[ est ]] ← er
pr ← do.call ( ’ rbind ’ , p )
pr $ est ← est
P ← rbind.data.frame (P , pr )
}

xyplot ( yhat ∼ age | est , groups = .set. , data =P , type = ’l ’ , # F i g u r e 11.3


auto.key = list ( x = .75 , y = .2 , points = FALSE , lines = TRUE ) ,
panel = function ( ... , subscripts ) {
panel.xyplot ( ... , subscripts = subscripts )
est ← P $ est [ subscripts [1]]
lpoints ( ag , switch ( est , mean = means , q2 = q2 , q3 = q3 , p90 = p90 ) ,
col = gray ( .7 ) )
er ← format ( round ( Er [[ est ]] ,3) , nsmall =3)
ltext (26 , 6 .15 , paste ( names ( er ) , collapse = ’\ n ’) ,
cex = .7 , adj =0)
ltext (40 , 6 .15 , paste ( er , collapse = ’\ n ’) ,
cex = .7 , adj =1) })

It can be seen in Figure 11.3 that models dedicated to a specific T

task (quantile reqression for quantiles and OLS for means) were
best for those tasks. Although the log-log ordinal cumulative
CHAPTER 11. REGRESSION MODELS FOR CONTINUOUS Y AND CASE STUDY IN ORDINAL REGRESSION11-21

30 40 50 60 70

q2 q3
6.4
logistic 0.023 logistic 0.036
probit 0.028 probit 0.042
6.2
loglog 0.044 loglog 0.050
cloglog 0.053 cloglog 0.075
6.0
QR 0.024 QR 0.027

5.8

5.6

5.4

5.2
yhat

mean p90
6.4
logistic 0.021 logistic 0.046
probit 0.025 probit 0.053
6.2
loglog 0.026 loglog 0.074
cloglog 0.033 cloglog 0.111
6.0
ols 0.013 QR 0.030

5.8
logistic
5.6 probit
loglog
5.4 cloglog
QR
5.2 ols

30 40 50 60 70

age

Figure 11.3: Three estimated quantiles and estimated mean using 6 methods, compared against caliper-matched sample quan-
tiles/means (circles). Numbers are mean absolute differences between predicted and sample quantities using overlapping intervals
of age and caliper matching. QR:quantile regression.
CHAPTER 11. REGRESSION MODELS FOR CONTINUOUS Y AND CASE STUDY IN ORDINAL REGRESSION11-22

probability model did not estimate the median as accurately as


some other methods, it does well for the 0.75 and 0.9 quantiles
and is the best compromise overall because of its ability to
also directly predict the mean as well as quantitles such as
Prob[HbA1c > 7|X].
For here on we focus on the log-log ordinal model.
Going back to the bottom left of figure 11.1, let’s look at quan- U

tile groups of predicted HbA1c by OLS and plot predicted dis-


tributions of actual HbA1c against empirical distributions.
w $ pghg ← cut2 ( pgh , g =6)
f ← orm ( gh ∼ pghg , family = loglog , data = w )
lp ← predict (f , newdata = data.frame ( pghg = levels ( w $ pghg ) ) )
ep ← ExProb ( f ) # E x c e e d a n c e p r o b . f u n c t n . g e n e r a t o r i n r m s
z ← ep ( lp )
j ← order ( w $ pghg ) # p u t s i n o r d e r o f l p ( l e v e l s o f p g h g )
plot (z , xlim = c (4 , 7 .5 ) , data = w [j , c ( ’ pghg ’ , ’ gh ’) ]) # F i g . 11.4

1.0
[4.88,5.29)
[5.29,5.44)
0.8 [5.44,5.56)
[5.56,5.66)
[5.66,5.76)
Prob(Y ≥ y)

0.6 [5.76,6.48]

0.4

0.2

0.0
4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5
gh
Figure 11.4: Observed (dashed lines, open circles) and predicted (solid lines, closed circles) exceedance probability distributions from
a model using 6-tiles of OLS-predicted HbA1c . Key shows quantile group intervals of predicted mean HbA1c .
CHAPTER 11. REGRESSION MODELS FOR CONTINUOUS Y AND CASE STUDY IN ORDINAL REGRESSION11-23

Agreement between predicted and observed exceedance proba-


bility distributions is excellent in Figure 11.4.
To return to the initial look at a linear model with assumed V

Gaussian residuals, fit a probit ordinal model and compare the


estimated intercepts to the linear relationship with gh that is
assumed by the normal distribution.
f ← orm ( gh ∼ rcs ( age ,6) , family = probit , data = w )
g ← ols ( gh ∼ rcs ( age ,6) , data = w )
s ← g $ stats [ ’ Sigma ’]
yu ← f $ yunique [ -1 ]
r ← quantile ( w $ gh , c ( .005 , .995 ) )
alphas ← coef ( f ) [1: num.intercepts ( f ) ]
plot ( -yu / s , alphas , type = ’l ’ , xlim = rev ( - r / s ) , # F i g . 11.5
xlab = expression ( -y / hat ( sigma ) ) , ylab = expression ( alpha [ y ]) )

2
1
0
−1
αy

−2
−3
−4
−5

−14 −12 −10 −8


^
−y σ
Figure 11.5: Estimated intercepts from probit model

Figure 11.5 depicts a significant departure from that implied by


Gaussian residuals.
CHAPTER 11. REGRESSION MODELS FOR CONTINUOUS Y AND CASE STUDY IN ORDINAL REGRESSION11-24

11.5.2

Examination of BMI

Using the log-log model, we first check the adequacy of BMI


as a summary of height and weight for estimating median gh. W

ˆ Adjust for age (without assuming linearity) in every case

ˆ Look at ratio of coefficients of log height and log weight

ˆ Use AIC to judge whether BMI is an adequate summary of


height and weight
f ← orm ( gh ∼ rcs ( age ,5) + log ( ht ) + log ( wt ) ,
family = loglog , data = w )
f

-log-log Ordinal Regression Model

orm(formula = gh ~ rcs(age, 5) + log(ht) + log(wt), data = w,


family = loglog)

Model Likelihood Discrimination Rank Discrim.


Ratio Test Indexes Indexes
Obs 4629 LR χ2 1126.94 R2 0.217 ρ 0.486
2
Distinct Y 63 d.f. 6 R6,4629 0.215
Y0.5 5.5 Pr(> χ2 ) <0.0001 2
R6,4602.2 0.216
max | ∂ log
∂β
L
| 1×10−6 Score χ2 1262.81 |Pr(Y ≥ Y0.5 ) − 12 | 0.153
Pr(> χ2 ) <0.0001

β̂ S.E. Wald Z Pr(> |Z|)


age 0.0398 0.0055 7.29 <0.0001
age’ -0.0158 0.0275 -0.57 0.5657
age” -0.0072 0.0866 -0.08 0.9333
age”’ 0.0309 0.1135 0.27 0.7853
ht -3.0680 0.2789 -11.00 <0.0001
wt 1.2748 0.0704 18.10 <0.0001

aic ← NULL
CHAPTER 11. REGRESSION MODELS FOR CONTINUOUS Y AND CASE STUDY IN ORDINAL REGRESSION11-25

for ( mod in list ( gh ∼ rcs ( age ,5) + rcs ( log ( bmi ) ,5) ,
gh ∼ rcs ( age ,5) + rcs ( log ( ht ) ,5) + rcs ( log ( wt ) ,5) ,
gh ∼ rcs ( age ,5) + rcs ( log ( ht ) ,4) * rcs ( log ( wt ) ,4) ) )
aic ← c ( aic , AIC ( orm ( mod , family = loglog , data = w ) ) )
print ( aic )

[1] 25910.77 25910.17 25906.03

The ratio of the coefficient of log height to the coefficient of log X

weight is -2.4, which is between what BMI uses and the more
dimensionally reasonable weight / height3. By AIC, a spline in-
teraction surface between height and weight does slightly better
than BMI in predicting HbA1c, but a nonlinear function of BMI
is barely worse. It will require other body size measures to
displace BMI as a predictor.
As an aside, compare this model fit to that from the Cox pro-
portional hazards model. The Cox model uses a conditioning
argument to obtain a partial likelihood free of the intercepts α
(and requires a second step to estimate these log discrete haz-
ard components) whereas we are using a full marginal likelihood
of the ranks of Y [105].
print ( cph ( Surv ( gh ) ∼ rcs ( age ,5) + log ( ht ) + log ( wt ) , data = w ) )

Cox Proportional Hazards Model

cph(formula = Surv(gh) ~ rcs(age, 5) + log(ht) + log(wt), data = w)

Model Tests Discrimination


Indexes
Obs 4629 LR χ2 1120.20 R 2
0.215
2
Events 4629 d.f. 6 R6,4629 0.214
Center 8.3792 Pr(> χ2 ) 0.0000 Dxy 0.359
Score χ2 1258.07
Pr(> χ2 ) 0.0000

β̂ S.E. Wald Z Pr(> |Z|)


age -0.0392 0.0054 -7.24 <0.0001
age’ 0.0148 0.0274 0.54 0.5888
CHAPTER 11. REGRESSION MODELS FOR CONTINUOUS Y AND CASE STUDY IN ORDINAL REGRESSION11-26

β̂ S.E. Wald Z Pr(> |Z|)


age” 0.0093 0.0862 0.11 0.9144
age”’ -0.0321 0.1131 -0.28 0.7767
ht 3.0477 0.2779 10.97 <0.0001
wt -1.2653 0.0701 -18.04 <0.0001

Back up and look at all body size measures, and examine their
redundancies. Y
v ← varclus (∼ wt + ht + bmi + leg + arml + armc + waist +
tri + sub + age + sex + re , data = w )
plot ( v ) # F i g u r e 11.6
# Omit wt so it won ’ t be removed before bmi
redun (∼ ht + bmi + leg + arml + armc + waist + tri + sub ,
data =w , r2 = .75 )

Redundancy Analysis

redun ( formula = ∼ht + bmi + leg + arml + armc + waist + tri +


sub , data = w , r2 = 0.75)

n : 3853 p: 8 nk : 3

Number of NAs : 776


Frequencies of Missing Values Due to Each Variable
ht bmi leg arml armc waist tri sub
0 0 155 127 130 164 334 655

Transformation of target variables forced to be linear

R2 cutoff : 0.75 Type : ordinary

R2 with which each variable can be predicted from all other variables :

ht bmi leg arml armc waist tri sub


0.829 0.924 0.682 0.748 0.843 0.864 0.531 0.594

Rendundant variables :

bmi ht

Predicted from variables :

leg arml armc waist tri sub

Variable Deleted R2 R2 after later deletions


1 bmi 0.924 0.909
2 ht 0.792

Six size measures adequately capture the entire set. Height and
CHAPTER 11. REGRESSION MODELS FOR CONTINUOUS Y AND CASE STUDY IN ORDINAL REGRESSION11-27

0.0

age
reOther Race Including Multi−Racial
reOther Hispanic
0.2

reNon−Hispanic White
reNon−Hispanic Black
sexfemale
Spearman ρ2

0.4
tri
sub

leg

0.6
ht
arml

0.8
wt
armc
bmi
waist

1.0

Figure 11.6: Variable clustering for all potential predictors


CHAPTER 11. REGRESSION MODELS FOR CONTINUOUS Y AND CASE STUDY IN ORDINAL REGRESSION11-28

BMI are removed.


An advantage of removing height is that it is age-dependent in Z

the elderly:
f ← orm ( ht ∼ rcs ( age ,4) * sex , data = w ) # P r o p . o d d s m o d e l
qu ← Quantile ( f ) ; med ← function ( x ) qu ( .5 , x )
ggplot ( Predict (f , age , sex , fun = med , conf.int = FALSE ) ,
ylab = ’ Predicted Median Height , cm ’)

180
Predicted Median Height, cm

175

170 sex
male
165 female

160

155
20 40 60 80
Age, years

Figure 11.7: Estimated median height as a smooth function of age, allowing age to interact with sex, from a proportional odds
model

But also see a change in leg length:


f ← orm ( leg ∼ rcs ( age ,4) * sex , data = w )
qu ← Quantile ( f ) ; med ← function ( x ) qu ( .5 , x )
ggplot ( Predict (f , age , sex , fun = med , conf.int = FALSE ) ,
ylab = ’ Predicted Median Upper Leg Length , cm ’)

Next allocate d.f. according to generalized Spearman ρ2i. A


s ← spearman2 ( gh ∼ age + sex + re + wt + leg + arml + armc +
waist + tri + sub , data =w , p =2)
plot ( s )

Parameters will be allocated in descending order of ρ2. But


note that subscapular skinfold has a large number of NAs and
i Competition between collinear size measures hurts interpretation of partial tests of association in a saturated additive model.
CHAPTER 11. REGRESSION MODELS FOR CONTINUOUS Y AND CASE STUDY IN ORDINAL REGRESSION11-29

Predicted Median Upper Leg Length, cm


42

40

sex
38 male
female

36

34
20 40 60 80
Age, years

Figure 11.8: Estimated median upper leg length as a smooth function of age, allowing age to interact with sex, from a proportional
odds model

Spearman ρ2 Response : gh N df
age 4629 2
waist 4465 2
leg 4474 2
sub 3974 2
armc 4499 2
wt 4629 2
re 4629 4
tri 4295 2
arml 4502 2
sex 4629 1
0.00 0.05 0.10 0.15 0.20
Adjusted ρ 2

Figure 11.9: Generalized squared rank correlations


CHAPTER 11. REGRESSION MODELS FOR CONTINUOUS Y AND CASE STUDY IN ORDINAL REGRESSION11-30

other predictors also have NAs. Suboptimal casewise deletion


will be used until the final model is fitted.
Because there are many competing body measures, we use B

backwards stepdown to arrive at a set of predictors. The boot-


strap will be used to penalize predictive ability for variable se-
lection. First the full model is fit using casewise deletion, then
we do a composite test to assess whether any of the frequently–
missing predictors is important.
f ← orm ( gh ∼ rcs ( age ,5) + sex + re + rcs ( wt ,3) + rcs ( leg ,3) + arml +
rcs ( armc ,3) + rcs ( waist ,4) + tri + rcs ( sub ,3) ,
family = loglog , data =w , x = TRUE , y = TRUE )
print (f , coefs = FALSE )

-log-log Ordinal Regression Model

orm(formula = gh ~ rcs(age, 5) + sex + re + rcs(wt, 3) + rcs(leg,


3) + arml + rcs(armc, 3) + rcs(waist, 4) + tri + rcs(sub,
3), data = w, x = TRUE, y = TRUE, family = loglog)

Frequencies of Missing Values Due to Each Variable


N
sub t 655
tri t 334
waist t 164
leg t 155
armc t 130
arml t 127
gh t 0
age t 0
sex t 0
re t 0
wt t 0
0 100200300400500600700

Model Likelihood Discrimination Rank Discrim.


Ratio Test Indexes Indexes
Obs 3853 LR χ2 1180.13 R2 0.265 ρ 0.520
2
Distinct Y 60 d.f. 22 R22,3853 0.260
Y0.5 5.5 Pr(> χ2 ) <0.0001 2
R22,3829.2 0.261
max | ∂ log
∂β
L
| 3×10−5 Score χ2 1298.88 |Pr(Y ≥ Y0.5 ) − 12 | 0.172
Pr(> χ2 ) <0.0001
## Composite test :
anova (f , leg , arml , armc , waist , tri , sub )
CHAPTER 11. REGRESSION MODELS FOR CONTINUOUS Y AND CASE STUDY IN ORDINAL REGRESSION11-31

Wald Statistics for gh

χ2 d.f. P
leg 8.30 2 0.0158
Nonlinear 3.32 1 0.0685
arml 0.16 1 0.6924
armc 6.66 2 0.0358
Nonlinear 3.29 1 0.0695
waist 29.40 3 <0.0001
Nonlinear 4.29 2 0.1171
tri 16.62 1 <0.0001
sub 40.75 2 <0.0001
Nonlinear 4.50 1 0.0340
TOTAL NONLINEAR 14.95 5 0.0106
TOTAL 128.29 11 <0.0001

The model yields Spearman ρ = 0.52, the rank correlation


between predicted and observed HbA1c.
Show predicted mean and median HbA1c as a function of age, C

adjusting other variables to median/mode. Compare the es-


timate of the median with that from quantile regression (dis-
cussed below).
M ← Mean ( f )
qu ← Quantile ( f )
med ← function ( x ) qu ( .5 , x )
p90 ← function ( x ) qu ( .9 , x )
fq ← Rq ( formula ( f ) , data = w )
fq90 ← Rq ( formula ( f ) , data =w , tau = .9 )

pmean ← Predict (f , age , fun =M , conf.int = FALSE )


pmed ← Predict (f , age , fun = med , conf.int = FALSE )
p90 ← Predict (f , age , fun = p90 , conf.int = FALSE )
pmedqr ← Predict ( fq , age , conf.int = FALSE )
p90qr ← Predict ( fq90 , age , conf.int = FALSE )
z ← rbind ( ’ orm mean ’= pmean , ’ orm median ’= pmed , ’ orm P90 ’= p90 ,
’ QR median ’= pmedqr , ’ QR P90 ’= p90qr )
ggplot (z , groups = ’ .set. ’ ,
adj.subtitle = FALSE , legend.label = FALSE )

Next do fast backward step-down in an attempt to get a model D

without so much competition amoung variables. The stepwise


selection will be penalized for in the model validation.
CHAPTER 11. REGRESSION MODELS FOR CONTINUOUS Y AND CASE STUDY IN ORDINAL REGRESSION11-32

6.25

6.00

5.75 orm mean


orm median
orm P90
5.50
QR median
QR P90

5.25

5.00

20 40 60 80
Age, years

Figure 11.10: Estimated mean and 0.5 and 0.9 quantiles from the log-log ordinal model using casewise deletion, along with predictions
of 0.5 and 0.9 quantiles from quantile regression (QR). Age is varied and other predictors are held constant to medians/modes.

print ( fastbw (f , rule = ’p ’) , estimates = FALSE )

Deleted Chi - Sq d.f. P Residual d.f. P AIC


arml 0.16 1 0.6924 0.16 1 0.6924 -1.84
sex 0.45 1 0.5019 0.61 2 0.7381 -3.39
wt 5.72 2 0.0572 6.33 4 0.1759 -1.67
armc 3.32 2 0.1897 9.65 6 0.1400 -2.35

Factors in Final Model

[1] age re leg waist tri sub

Validate the model, properly penalizing for variable selection E


set.seed (13) # s o c a n r e p r o d u c e r e s u l t s
v ← validate (f , B =100 , bw = TRUE , estimates = FALSE , rule = ’p ’)

Backwards Step - down - Original Model

Deleted Chi - Sq d.f. P Residual d.f. P AIC


arml 0.16 1 0.6924 0.16 1 0.6924 -1.84
sex 0.45 1 0.5019 0.61 2 0.7381 -3.39
wt 5.72 2 0.0572 6.33 4 0.1759 -1.67
armc 3.32 2 0.1897 9.65 6 0.1400 -2.35
CHAPTER 11. REGRESSION MODELS FOR CONTINUOUS Y AND CASE STUDY IN ORDINAL REGRESSION11-33

Factors in Final Model

[1] age re leg waist tri sub

# Show number of variables selected in first 30 boots


latex (v , B =30 , file = ’ ’ , size = ’ small ’)

Index Original Training Test Optimism Corrected n


Sample Sample Sample Index
ρ 0.5225 0.5279 0.5204 0.0076 0.5149 100
R2 0.2712 0.2778 0.2689 0.0089 0.2623 100
Slope 1.0000 1.0000 0.9790 0.0210 0.9790 100
g 1.2276 1.2483 1.2196 0.0287 1.1989 100
|Pr(Y ≥ Y0.5 ) − 12 | 0.2007 0.2058 0.1988 0.0070 0.1937 100

Factors Retained in Backwards Elimination


First 30 Resamples
age sex re wt leg arml armc waist tri sub
• • • • • • • •
• • • • • • • •
• • • • • • •
• • • • • • • •
• • • • • •
• • • • • • • •
• • • • • • • •
• • • • • • • •
• • • • • • • •
• • • • • • • •
• • • • • • •
• • • • • • • •
• • • • • • •
• • • • •
• • • • • •
• • • • • • • •
• • • • • •
• • • • • • • •
• • • • • • • •
• • • • • • • •
• • • • • • •
• • • • • • • •
• • • • • • • •
• • • • • • •
• • • • • • •
• • • • • •
• • • • • • •
• • • • • •
• • • • • • •
• • • • • •

Frequencies of Numbers of Factors Retained


5 6 7 8 9
2 20 30 45 3
CHAPTER 11. REGRESSION MODELS FOR CONTINUOUS Y AND CASE STUDY IN ORDINAL REGRESSION11-34

Next fit the reduced model. Use multiple imputation to impute


missing predictors. F

Do an ANOVA for the reduced model, taking imputation into G


account.
a ← aregImpute (∼ gh + wt + ht + bmi + leg + arml + armc + waist +
tri + sub + age + re , data =w , n.impute =5 , pr = FALSE )
g ← fit.mult.impute ( gh ∼ rcs ( age ,5) + re + rcs ( leg ,3) +
rcs ( waist ,4) + tri + rcs ( sub ,4) ,
orm , a , family = loglog , data =w , pr = FALSE )
print (g , needspace = ’1 .5in ’)

-log-log Ordinal Regression Model

fit.mult.impute(formula = gh ~ rcs(age, 5) + re + rcs(leg, 3) +


rcs(waist, 4) + tri + rcs(sub, 4), fitter = orm, xtrans = a,
data = w, pr = FALSE, family = loglog)

Model Likelihood Discrimination Rank Discrim.


Ratio Test Indexes Indexes
Obs 4629 LR χ2 1445.23 R 2
0.269 ρ 0.512
2
Distinct Y 63 d.f. 17 R17,4629 0.265
2 2
Y0.5 5.5 Pr(> χ ) <0.0001 R17,4602.2 0.267
∂ log L
max | ∂β | 1×10−5 Score χ2 1566.55 1
|Pr(Y ≥ Y0.5 ) − 2 | 0.173
Pr(> χ2 ) <0.0001

β̂ S.E. Wald Z Pr(> |Z|)


age 0.0405 0.0055 7.36 <0.0001
age’ -0.0228 0.0277 -0.82 0.4094
age” 0.0123 0.0871 0.14 0.8880
age”’ 0.0428 0.1143 0.37 0.7082
re=Other Hispanic -0.0795 0.0592 -1.34 0.1794
re=Non-Hispanic White -0.4119 0.0451 -9.14 <0.0001
re=Non-Hispanic Black 0.0662 0.0563 1.18 0.2396
re=Other Race Including Multi-Racial -0.0509 0.0749 -0.68 0.4964
leg -0.0344 0.0092 -3.75 0.0002
leg’ 0.0160 0.0106 1.51 0.1298
waist 0.0071 0.0051 1.40 0.1618
waist’ 0.0318 0.0160 1.99 0.0469
waist” -0.0950 0.0512 -1.86 0.0634
tri -0.0160 0.0027 -5.86 <0.0001
sub -0.0023 0.0103 -0.22 0.8220
sub’ 0.0655 0.0314 2.08 0.0372
sub” -0.1838 0.1038 -1.77 0.0766
CHAPTER 11. REGRESSION MODELS FOR CONTINUOUS Y AND CASE STUDY IN ORDINAL REGRESSION11-35

an ← anova ( g )
print ( an , caption = ’ ANOVA for reduced model after multiple imputation , with
addition of a combined effect for four size variables ’)

ANOVA for reduced model after multiple imputation, with addition of a combined effect for
four size variables
χ2 d.f. P
age 698.03 4 <0.0001
Nonlinear 29.54 3 <0.0001
re 163.54 4 <0.0001
leg 24.19 2 <0.0001
Nonlinear 2.29 1 0.1298
waist 128.33 3 <0.0001
Nonlinear 4.23 2 0.1208
tri 34.29 1 <0.0001
sub 41.27 3 <0.0001
Nonlinear 6.37 2 0.0414
TOTAL NONLINEAR 46.91 8 <0.0001
TOTAL 1457.15 17 <0.0001
b ← anova (g , leg , waist , tri , sub )
# Add new lines to the plot with combined effect of 4 size var.
s ← rbind ( an , size = b [ ’ TOTAL ’ , ])
class ( s ) ← ’ anova.rms ’
plot ( s )

χ2 P
leg 24.2 0.0000
tri 34.3 0.0000
sub 41.3 0.0000
waist 128.3 0.0000
re 163.5 0.0000
size 412.3 0.0000
age 698.0 0.0000

0 200 400 600


χ2 − df
H
ggplot ( Predict ( g ) , abbrev = TRUE , ylab = NULL ) # Figure 11.11
I
M ← Mean ( g )
ggplot ( Predict (g , fun = M ) , abbrev = TRUE , ylab = NULL ) # Figure 11.12
CHAPTER 11. REGRESSION MODELS FOR CONTINUOUS Y AND CASE STUDY IN ORDINAL REGRESSION11-36

1.5 1.5
ORIM−R
1.0 1.0

Race Ethnicity
Nn−HsB
0.5 0.5

0.0 0.0 Nn−HsW

−0.5 −0.5 OthrHs

−1.0 −1.0 MxcnAm


20 40 60 80 30 35 40 45
Age, years Upper Leg Length, cm −1.0 −0.5 0.0 0.5 1.0 1.5
1.5 1.5 1.5

1.0 1.0 1.0

0.5 0.5 0.5

0.0 0.0 0.0

−0.5 −0.5 −0.5

−1.0 −1.0 −1.0


10 20 30 40 10 20 30 40 80 100 120 140
Subscapular Skinfold, mm Triceps Skinfold, mm Waist Circumference, cm
Figure 11.11: Partial effects (log hazard or log-log cumulative probability scale) of all predictors in reduced model, after multiple
imputation
CHAPTER 11. REGRESSION MODELS FOR CONTINUOUS Y AND CASE STUDY IN ORDINAL REGRESSION11-37

6.25 6.25
ORIM−R
6.00 6.00

Race Ethnicity
Nn−HsB
5.75 5.75

5.50 5.50 Nn−HsW

5.25 5.25 OthrHs

5.00 5.00 MxcnAm


20 40 60 80 30 35 40 45
Age, years Upper Leg Length, cm 5.00 5.25 5.50 5.75 6.00 6.25
6.25 6.25 6.25

6.00 6.00 6.00

5.75 5.75 5.75

5.50 5.50 5.50

5.25 5.25 5.25

5.00 5.00 5.00


10 20 30 40 10 20 30 40 80 100 120 140
Subscapular Skinfold, mm Triceps Skinfold, mm Waist Circumference, cm
Figure 11.12: Partial effects (mean scale) of all predictors in reduced model, after multiple imputation
CHAPTER 11. REGRESSION MODELS FOR CONTINUOUS Y AND CASE STUDY IN ORDINAL REGRESSION11-38

Compare the estimated age partial effects and confidence in- J

tervals with those from a model using casewise deletion, and


with bootstrap nonparametric confidence intervals (also with
casewise deletion).
gc ← orm ( gh ∼ rcs ( age ,5) + re + rcs ( leg ,3) +
rcs ( waist ,4) + tri + rcs ( sub ,4) ,
family = loglog , data =w , x = TRUE , y = TRUE )
gb ← bootcov ( gc , B =300)

bootclb ← Predict ( gb , age , boot.type = ’ basic ’)


bootclp ← Predict ( gb , age , boot.type = ’ percentile ’)
multimp ← Predict (g , age )
plot ( Predict ( gc , age ) , addpanel = function ( ... ) {
with ( bootclb , { llines ( age , lower , col = ’ blue ’)
llines ( age , upper , col = ’ blue ’) })
with ( bootclp , { llines ( age , lower , col = ’ blue ’ , lty =2)
llines ( age , upper , col = ’ blue ’ , lty =2) })
with ( multimp , { llines ( age , lower , col = ’ red ’)
llines ( age , upper , col = ’ red ’)
llines ( age , yhat , col = ’ red ’) } ) } ,
col.fill = gray ( .9 ) , adj.subtitle = FALSE ) # F i g u r e 11.13

1.0
log hazard

0.5

0.0

−0.5

20 30 40 50 60 70 80

Age, years

Figure 11.13: Partial effect for age from multiple imputation (center red line) and casewise deletion (center blue line) with symmetric
Wald 0.95 confidence bands using casewise deletion (gray shaded area), basic bootstrap confidence bands using casewise deletion
(blue lines), percentile bootstrap confidence bands using casewise deletion (dashed blue lines), and symmetric Wald confidence bands
accounting for multiple imputation (red lines).
CHAPTER 11. REGRESSION MODELS FOR CONTINUOUS Y AND CASE STUDY IN ORDINAL REGRESSION11-39

In OLS the mean equals the median and both are linearly related K

to any other quantiles. Semiparametric models are not this


restrictive:
M ← Mean ( g )
qu ← Quantile ( g )
med ← function ( lp ) qu ( .5 , lp )
q90 ← function ( lp ) qu ( .9 , lp )
lp ← predict ( g )
lpr ← quantile ( predict ( g ) , c ( .002 , .998 ) , na.rm = TRUE )
lps ← seq ( lpr [1] , lpr [2] , length =200)
pmn ← M ( lps )
pme ← med ( lps )
p90 ← q90 ( lps )
plot ( pmn , pme , # F i g u r e 11.14
xlab = expression ( paste ( ’ Predicted Mean ’ , HbA [ " 1 c " ]) ) ,
ylab = ’ Median and 0 .9 Quantile ’ , type = ’l ’ ,
xlim = c (4 .75 , 8 .0 ) , ylim = c (4 .75 , 8 .0 ) , bty = ’n ’)
box ( col = gray ( .8 ) )
lines ( pmn , p90 , col = ’ blue ’)
abline ( a =0 , b =1 , col = gray ( .8 ) )
text (6 .5 , 5 .5 , ’ Median ’)
text (5 .5 , 6 .3 , ’0 .9 ’ , col = ’ blue ’)
nint ← 350
scat1d ( M ( lp ) , nint = nint )
scat1d ( med ( lp ) , side =2 , nint = nint )
scat1d ( q90 ( lp ) , side =4 , col = ’ blue ’ , nint = nint )

8.0
Median and 0.9 Quantile

7.5

7.0

6.5
0.9
6.0

5.5 Median

5.0

5.0 5.5 6.0 6.5 7.0 7.5 8.0


Predicted Mean HbA1c
Figure 11.14: Predicted mean HbA1c vs. predicted median and 0.9 quantile along with their marginal distributions
CHAPTER 11. REGRESSION MODELS FOR CONTINUOUS Y AND CASE STUDY IN ORDINAL REGRESSION11-40

Draw a nomogram to compute 7 different predicted values for L

each subject.
g ← Newlevels (g , list ( re = abbreviate ( levels ( w $ re ) ) ) )
exprob ← ExProb ( g )
nom ←
nomogram (g , fun = list ( Mean =M ,
’ Median Glycohemoglobin ’ = med ,
’0 .9 Quantile ’ = q90 ,
’ Prob ( HbA1c ≥ 6 .5 ) ’=
function ( x ) exprob (x , y =6 .5 ) ,
’ Prob ( HbA1c ≥ 7 .0 ) ’=
function ( x ) exprob (x , y =7) ,
’ Prob ( HbA1c ≥ 7 .5 ) ’=
function ( x ) exprob (x , y =7 .5 ) ) ,
fun.at = list ( seq (5 , 8 , by = .5 ) ,
c (5 ,5 .25 ,5 .5 ,5 .75 ,6 ,6 .25 ) ,
c (5 .5 ,6 ,6 .5 ,7 ,8 ,10 ,12 ,14) ,
c ( .01 , .05 , .1 , .2 , .3 , .4 ) ,
c ( .01 , .05 , .1 , .2 , .3 , .4 ) ,
c ( .01 , .05 , .1 , .2 , .3 , .4 ) ) )
plot ( nom , lmgp = .28 ) # F i g u r e 11.15
CHAPTER 11. REGRESSION MODELS FOR CONTINUOUS Y AND CASE STUDY IN ORDINAL REGRESSION11-41

0 10 20 30 40 50 60 70 80 90 100
Points

Age
20 25 30 35 40 45 50 55 60 65 70 75 80
OthH
Race/Ethnicity
N−HW ORIM
Upper Leg Length
55 45 35 30 25 20
Waist Circumference
50 70 90 100 110 120 130 140 150 160 170
Triceps Skinfold
45 35 25 15 5 0
15 20 25 30 35 40 45
Subscapular Skinfold
10
Total Points
0 20 40 60 80 100 120 140 160 180 200 220 240 260 280
Linear Predictor
−1.5 −1 −0.5 0 0.5 1 1.5 2 2.5
Mean
5 5.5 6 6.5 7 7.5 8
Median Glycohemoglobin
5.25 5.5 5.75 6 6.25
0.9 Quantile
5.5 6 6.5 7 8 10 12 14
Prob(HbA1c >= 6.5)
0.01 0.05 0.1 0.2 0.3 0.4
Prob(HbA1c >= 7.0)
0.01 0.05 0.1 0.2 0.3 0.4
Prob(HbA1c >= 7.5)
0.01 0.05 0.1 0.2 0.3

Figure 11.15: Nomogram for predicting median, mean, and 0.9 quantile of glycohemoglobin, along with the estimated probability
that HbA1c ≥ 6.5, 7, or 7.5, all from the log-log ordinal model
Chapter 12

Case Study in Parametric Survival Modeling


and Model Approximation

Data source: Random sample of 1000 patients from Phases I


& II of SUPPORT (Study to Understand Prognoses Preferences
Outcomes and Risks of Treatment, funded by the Robert Wood
Johnson Foundation). See [109]. The dataset is available from
https://hbiostat.org/data. A

ˆ Analyze acute disease subset of SUPPORT (acute respira-


tory failure, multiple organ system failure, coma) — the
shape of the survival curves is different between acute and
chronic disease categories

ˆ Patients had to survive until day 3 of the study to qualify

ˆ Baseline physiologic variables measured during day 3

12-1
CHAPTER 12. PARAMETRIC SURVIVAL MODELING AND MODEL APPROXIMATION 12-2

12.1

Descriptive Statistics

Create a variable acute to flag categories of interest; print uni-


variable descriptive statistics.
require ( rms )

options ( prType = ’ latex ’) # for print , summary , anova


getHdata ( support ) # Get data frame from web site
acute ← support $ dzclass % in % c ( ’ ARF / MOSF ’ , ’ Coma ’)
latex ( describe ( support [ acute ,]) , file = ’ ’)

support[acute, ]
35 Variables 537 Observations
age : Age
n missing distinct Info Mean Gmd .05 .10 .25 .50 .75 .90 .95
537 0 529 1 60.7 19.98 28.49 35.22 47.93 63.67 74.49 81.54 85.56

lowest : 18.04199 18.41499 19.76500 20.29599 20.31200, highest: 91.61896 91.81696 91.93396 92.73895 95.50995
death : Death at any time up to NDI date:31DEC94
n missing distinct Info Sum Mean Gmd
537 0 2 0.67 356 0.6629 0.4477
sex
n missing distinct
537 0 2

Value female male


Frequency 251 286
Proportion 0.467 0.533
hospdead : Death in Hospital
n missing distinct Info Sum Mean Gmd
537 0 2 0.703 201 0.3743 0.4693

slos : Days from Study Entry to Discharge


n missing distinct Info Mean Gmd .05 .10 .25 .50 .75 .90 .95
537 0 85 0.999 23.44 22.24 4.0 5.0 9.0 15.0 27.0 47.4 68.2

lowest : 3 4 5 6 7, highest: 145 164 202 236 241


d.time : Days of Follow-Up
n missing distinct Info Mean Gmd .05 .10 .25 .50 .75 .90 .95
537 0 340 1 446.1 566.1 4 6 16 182 724 1421 1742

lowest : 3 4 5 6 7, highest: 1977 1979 1982 2011 2022


dzgroup
n missing distinct
537 0 3

Value ARF/MOSF w/Sepsis Coma MOSF w/Malig


Frequency 391 60 86
Proportion 0.728 0.112 0.160
dzclass
n missing distinct
537 0 2

Value ARF/MOSF Coma


Frequency 477 60
Proportion 0.888 0.112
CHAPTER 12. PARAMETRIC SURVIVAL MODELING AND MODEL APPROXIMATION 12-3

num.co : number of comorbidities


n missing distinct Info Mean Gmd
537 0 7 0.926 1.525 1.346

lowest : 0 1 2 3 4, highest: 2 3 4 5 6
Value 0 1 2 3 4 5 6
Frequency 111 196 133 51 31 10 5
Proportion 0.207 0.365 0.248 0.095 0.058 0.019 0.009
edu : Years of Education
n missing distinct Info Mean Gmd .05 .10 .25 .50 .75 .90 .95
411 126 22 0.957 12.03 3.581 7 8 10 12 14 16 17

lowest : 0 1 2 3 4, highest: 17 18 19 20 22
income
n missing distinct
335 202 4

Value under $11k $11-$25k $25-$50k >$50k


Frequency 158 79 63 35
Proportion 0.472 0.236 0.188 0.104
scoma : SUPPORT Coma Score based on Glasgow D3
n missing distinct Info Mean Gmd .05 .10 .25 .50 .75 .90 .95
537 0 11 0.822 19.24 27.87 0 0 0 0 37 55 100

lowest : 0 9 26 37 41, highest: 55 61 89 94 100


Value 0 9 26 37 41 44 55 61 89 94 100
Frequency 301 50 44 19 17 43 11 6 8 6 32
Proportion 0.561 0.093 0.082 0.035 0.032 0.080 0.020 0.011 0.015 0.011 0.060
charges : Hospital Charges
n missing distinct Info Mean Gmd .05 .10 .25 .50 .75 .90 .95
517 20 516 1 86652 90079 11075 15180 27389 51079 100904 205562 283411

lowest : 3448.0 4432.0 4574.0 5555.0 5849.0, highest: 504659.5 538323.0 543761.0 706577.0 740010.0

totcst : Total RCC cost


n missing distinct Info Mean Gmd .05 .10 .25 .50 .75 .90 .95
471 66 471 1 46360 46195 6359 8449 15412 29308 57028 108927 141569

lowest : 0.000 2071.109 2522.451 3190.625 3325.350


highest: 269057.000 269131.250 338955.000 357918.750 390460.500

totmcst : Total micro-cost


n missing distinct Info Mean Gmd .05 .10 .25 .50 .75 .90 .95
331 206 328 1 39022 36200 6131 8283 14415 26323 54102 87495 111920

lowest : 0.000 1561.619 2477.510 2626.270 3421.068


highest: 144234.000 154709.000 198047.000 234875.875 271467.250

avtisst : Average TISS, Days 3-25


n missing distinct Info Mean Gmd .05 .10 .25 .50 .75 .90 .95
536 1 205 1 29.83 14.19 12.46 14.50 19.62 28.00 39.00 47.17 50.37

lowest : 4.000000 5.666664 8.000000 9.000000 9.500000


highest: 58.500000 59.000000 60.000000 61.000000 64.000000
race
n missing distinct
535 2 5

lowest : white black asian other hispanic, highest: white black asian other hispanic
Value white black asian other hispanic
Frequency 417 84 4 8 22
Proportion 0.779 0.157 0.007 0.015 0.041
meanbp : Mean Arterial Blood Pressure Day 3
n missing distinct Info Mean Gmd .05 .10 .25 .50 .75 .90 .95
537 0 109 1 83.28 35 41.8 49.0 59.0 73.0 111.0 124.4 135.0

lowest : 0 20 27 30 32, highest: 155 158 161 162 180


wblc : White Blood Cell Count Day 3
n missing distinct Info Mean Gmd .05 .10 .25 .50 .75 .90 .95
532 5 241 1 14.1 9.984 0.8999 4.5000 7.9749 12.3984 18.1992 25.1891 30.1873

lowest : 0.04999542 0.06999207 0.09999084 0.14999390 0.19998169


highest: 51.39843750 58.19531250 61.19531250 79.39062500 100.00000000
CHAPTER 12. PARAMETRIC SURVIVAL MODELING AND MODEL APPROXIMATION 12-4

hrt : Heart Rate Day 3


n missing distinct Info Mean Gmd .05 .10 .25 .50 .75 .90 .95
537 0 111 0.999 105 38.59 51 60 75 111 126 140 155

lowest : 0 11 30 36 40, highest: 189 193 199 232 300


resp : Respiration Rate Day 3
n missing distinct Info Mean Gmd .05 .10 .25 .50 .75 .90 .95
537 0 45 0.997 23.72 12.65 8 10 12 24 32 39 40

lowest : 0 4 6 7 8, highest: 48 49 52 60 64
temp : Temperature (celcius) Day 3
n missing distinct Info Mean Gmd .05 .10 .25 .50 .75 .90 .95
537 0 61 0.999 37.52 1.505 35.50 35.80 36.40 37.80 38.50 39.09 39.50

lowest : 32.50000 34.00000 34.09375 34.89844 35.00000, highest: 40.19531 40.59375 40.89844 41.00000 41.19531
pafi : PaO2/(.01*FiO2) Day 3
n missing distinct Info Mean Gmd .05 .10 .25 .50 .75 .90 .95
500 37 357 1 227.2 125 86.99 105.08 137.88 202.56 290.00 390.49 433.31

lowest : 45.00000 48.00000 53.32812 54.00000 55.00000


highest: 574.00000 595.12500 640.00000 680.00000 869.37500

alb : Serum Albumin Day 3


n missing distinct Info Mean Gmd .05 .10 .25 .50 .75 .90 .95
346 191 34 0.997 2.668 0.7219 1.700 1.900 2.225 2.600 3.100 3.400 3.800

lowest : 1.099854 1.199951 1.299805 1.399902 1.500000, highest: 4.099609 4.199219 4.500000 4.699219 4.799805

bili : Bilirubin Day 3

n missing distinct Info Mean Gmd .05 .10 .25 .50


386 151 88 0.997 2.678 3.507 0.3000 0.4000 0.6000 0.8999
.75 .90 .95
2.0000 6.5996 13.1743
lowest : 0.09999084 0.19998169 0.29998779 0.39996338 0.50000000
highest: 22.59765620 30.00000000 31.50000000 35.00000000 39.29687500
crea : Serum creatinine Day 3
n missing distinct Info Mean Gmd .05 .10 .25 .50 .75 .90 .95
537 0 84 0.998 2.232 1.997 0.6000 0.7000 0.8999 1.3999 2.5996 5.2395 7.3197

lowest : 0.2999878 0.3999634 0.5000000 0.5999756 0.6999512


highest: 10.3984375 10.5996094 11.1992188 11.5996094 11.7988281

sod : Serum sodium Day 3


n missing distinct Info Mean Gmd .05 .10 .25 .50 .75 .90 .95
537 0 38 0.997 138.1 7.471 129 131 134 137 142 147 150

lowest : 118 120 121 126 127, highest: 156 157 158 168 175
ph : Serum pH (arterial) Day 3
n missing distinct Info Mean Gmd .05 .10 .25 .50 .75 .90 .95
500 37 49 0.998 7.416 0.08775 7.270 7.319 7.380 7.420 7.470 7.510 7.529

lowest : 6.959961 6.989258 7.069336 7.119141 7.129883, highest: 7.559570 7.569336 7.589844 7.599609 7.659180

glucose : Glucose Day 3


n missing distinct Info Mean Gmd .05 .10 .25 .50 .75 .90 .95
297 240 179 1 167.7 92.13 76.0 89.0 106.0 141.0 200.0 292.4 347.2

lowest : 30 42 52 55 68, highest: 446 468 492 576 598


bun : BUN Day 3
n missing distinct Info Mean Gmd .05 .10 .25 .50 .75 .90 .95
304 233 100 1 38.91 31.12 8.00 11.00 16.75 30.00 56.00 79.70 100.70

lowest : 1 3 4 5 6, highest: 123 124 125 128 146

urine : Urine Output Day 3


n missing distinct Info Mean Gmd .05 .10 .25 .50 .75 .90 .95
303 234 262 1 2095 1579 20.3 364.0 1156.5 1870.0 2795.0 4008.6 4817.5

lowest : 0 5 8 15 20, highest: 6865 6920 7360 7560 7750


CHAPTER 12. PARAMETRIC SURVIVAL MODELING AND MODEL APPROXIMATION 12-5

adlp : ADL Patient Day 3


n missing distinct Info Mean Gmd
104 433 8 0.875 1.577 2.152

lowest : 0 1 2 3 4, highest: 3 4 5 6 7
Value 0 1 2 3 4 5 6 7
Frequency 51 19 7 6 4 7 8 2
Proportion 0.490 0.183 0.067 0.058 0.038 0.067 0.077 0.019
adls : ADL Surrogate Day 3
n missing distinct Info Mean Gmd
392 145 8 0.888 1.86 2.466

lowest : 0 1 2 3 4, highest: 3 4 5 6 7
Value 0 1 2 3 4 5 6 7
Frequency 185 68 22 18 17 20 39 23
Proportion 0.472 0.173 0.056 0.046 0.043 0.051 0.099 0.059
sfdm2
n missing distinct
468 69 5

lowest : no(M2 and SIP pres) adl>=4 (>=5 if sur) SIP>=30 Coma or Intub <2 mo. follow-up
highest: no(M2 and SIP pres) adl>=4 (>=5 if sur) SIP>=30 Coma or Intub <2 mo. follow-up
Value no(M2 and SIP pres) adl>=4 (>=5 if sur) SIP>=30 Coma or Intub
Frequency 134 78 30 5
Proportion 0.286 0.167 0.064 0.011
Value <2 mo. follow-up
Frequency 221
Proportion 0.472
adlsc : Imputed ADL Calibrated to Surrogate
n missing distinct Info Mean Gmd .05 .10 .25 .50 .75 .90 .95
537 0 144 0.956 2.119 2.386 0.000 0.000 0.000 1.839 3.375 6.000 6.000

lowest : 0.0000000 0.4947510 0.4947999 1.0000000 1.1667481


highest: 5.7832031 6.0000000 6.3398438 6.4658203 7.0000000 B
# Show patterns of missing data
plot ( naclus ( support [ acute ,]) ) # Figure 12.1

Show associations between predictors using a general non-monotonic


measure of dependence (Hoeffding D). C
ac ← support [ acute ,]
ac $ dzgroup ← ac $ dzgroup [ drop = TRUE ] # Remove unused levels
attach ( ac )
vc ← varclus (∼ age + sex + dzgroup + num.co + edu + income + scoma + race +
meanbp + wblc + hrt + resp + temp + pafi + alb + bili + crea + sod +
ph + glucose + bun + urine + adlsc , sim = ’ hoeffding ’)
plot ( vc ) # F i g u r e 12.2
Fraction Missing
30 * Hoeffding D

0.5
0.4
0.3
0.2
0.1
0.0

0.35
0.30
0.25
0.20
0.15
0.10
0.05
0.00
adlsc
edu sod
income$25−$50k crea
income$11−$25k temp
income>$50k resp
hrt
adlsc meanbp
num.co race

automatically expanded into dummy variables.


glucose avtisst
alb wblc
ph charges
pafi totcst
meanbp scoma
pafi
urine ph
resp sfdm2
hrt alb
temp bili
age totmcst
bili adlp
urine
crea glucose
bun bun
sexmale adls
sod edu
CHAPTER 12. PARAMETRIC SURVIVAL MODELING AND MODEL APPROXIMATION

raceasian income
raceother num.co
racehispanic dzclass
dzgroup
raceblack d.time
Figure 12.1: Cluster analysis showing which predictors tend to be missing on the same patients

dzgroupComa slos
scoma hospdead
dzgroupMOSF w/Malig sex
wblc age
death

Figure 12.2: Hierarchical clustering of potential predictors using Hoeffding D as a similarity measure. Categorical predictors are
12-6
CHAPTER 12. PARAMETRIC SURVIVAL MODELING AND MODEL APPROXIMATION 12-7

12.2

Checking Adequacy of Log-Normal Acceler-


ated Failure Time Model
dd ← datadist ( ac )
# describe distributions of variables to rms
options ( datadist = ’ dd ’)

# Generate right-censored survival time variable


years ← d.time / 365 .25
units ( years ) ← ’ Year ’
S ← Surv ( years , death )

# Show normal inverse Kaplan-Meier estimates


# stratified by dzgroup
survplot ( npsurv ( S ∼ dzgroup ) , conf = ’ none ’ ,
fun = qnorm , logt = TRUE ) # F i g u r e 12.3

0 dzgroup=MOSF w/Malig
dzgroup=ARF/MOSF w/Sepsis
−1
dzgroup=Coma

−2
−3 −2 −1 0 1 2
log Follow−up Time in Years
Figure 12.3: Φ−1 (SKM (t)) stratified by dzgroup. Linearity and semi-parallelism indicate a reasonable fit to the log-normal accelerated
failure time model with respect to one predictor.

More stringent assessment of log-normal assumptions: check


distribution of residuals from an adjusted model: D
f ← psm ( S ∼ dzgroup + rcs ( age ,5) + rcs ( meanbp ,5) ,
dist = ’ lognormal ’ , y = TRUE ) # d i s t = ’ g a u s s i a n ’ f o r S +
r ← resid ( f )

survplot (r , dzgroup , label.curve = FALSE )


survplot (r , age , label.curve = FALSE )
survplot (r , meanbp , label.curve = FALSE )
random.number ← runif ( length ( age ) )
CHAPTER 12. PARAMETRIC SURVIVAL MODELING AND MODEL APPROXIMATION 12-8

survplot (r , random.number , label.curve = FALSE ) # Figure 12.4

Age
1.0 1.0

0.8 0.8
Survival Probability

Survival Probability
0.6 0.6

0.4 0.4

0.2 0.2

0.0 0.0

−3.0 −2.0 −1.0 0.0 1.0 2.0 −3.0 −2.0 −1.0 0.0 1.0 2.0
Residual Residual

Mean Arterial Blood Pressure Day 3


1.0 1.0

0.8 0.8
Survival Probability

Survival Probability
0.6 0.6

0.4 0.4

0.2 0.2

0.0 0.0

−3.0 −2.0 −1.0 0.0 1.0 2.0 −3.0 −2.0 −1.0 0.0 1.0 2.0
Residual Residual

Figure 12.4: Kaplan-Meier estimates of distributions of normalized, right-censored residuals from the fitted log-normal survival model.
Residuals are stratified by important variables in the model (by quartiles of continuous variables), plus a random variable to depict
the natural variability (in the lower right plot). Theoretical standard Gaussian distributions of residuals are shown with a thick solid
line. The upper left plot is with respect to disease group.

The fit for dzgroup is not great but overall fit is good.
Remove from consideration predictors that are missing in > 0.2
of the patients. Many of these were only collected for the
second phase of SUPPORT.
Of those variables to be included in the model, find which ones
have enough potential predictive power to justify allowing for
nonlinear relationships or multiple categories, which spend more
d.f. For each variable compute Spearman ρ2 based on multiple
CHAPTER 12. PARAMETRIC SURVIVAL MODELING AND MODEL APPROXIMATION 12-9

linear regression of rank(x), rank(x)2 and the survival time,


truncating survival time at the shortest follow-up for survivors
(356 days). This rids the data of censoring but creates many
ties at 356 days.
shortest.follow.up ← min ( d.time [ death ==0] , na.rm = TRUE )
d.timet ← pmin ( d.time , shortest.follow.up )

w ← spearman2 ( d.timet ∼ age + num.co + scoma + meanbp +


hrt + resp + temp + crea + sod + adlsc +
wblc + pafi + ph + dzgroup + race , p =2)
plot (w , main = ’ ’) # F i g u r e 12.5

N df
scoma 537 2
meanbp 537 2
dzgroup 537 2
crea 537 2
pafi 500 2
ph 500 2
sod 537 2
hrt 537 2
adlsc 537 2
temp 537 2
wblc 532 2
num.co 537 2
age 537 2
resp 537 2
race 535 4
0.00 0.02 0.04 0.06 0.08 0.10 0.12
Adjusted ρ2

Figure 12.5: Generalized Spearman ρ2 rank correlation between predictors and truncated survival time

A better approach is to use the complete information in the


failure and censoring times by computing Somers’ Dxy rank E

correlation allowing for censoring.


w ← rcorrcens ( S ∼ age + num.co + scoma + meanbp + hrt + resp +
temp + crea + sod + adlsc + wblc + pafi + ph +
dzgroup + race )
plot (w , main = ’ ’) # F i g u r e 12.6

# Compute number of missing values per variable


sapply ( llist ( age , num.co , scoma , meanbp , hrt , resp , temp , crea , sod , adlsc ,
wblc , pafi , ph ) , function ( x ) sum ( is.na ( x ) ) )

age num . co scoma meanbp hrt resp temp crea sod adlsc wblc
0 0 0 0 0 0 0 0 0 0 5
pafi ph
37 37
CHAPTER 12. PARAMETRIC SURVIVAL MODELING AND MODEL APPROXIMATION 12-10

N
meanbp 537
crea 537
dzgroup 537
scoma 537
pafi 500
ph 500
adlsc 537
age 537
num.co 537
hrt 537
resp 537
race 535
sod 537
wblc 532
temp 537
0.00 0.05 0.10 0.15 0.20
|Dxy|

Figure 12.6: Somers’ Dxy rank correlation between predictors and original survival time. For dzgroup or race, the correlation
coefficient is the maximum correlation from using a dummy variable to represent the most frequent or one to represent the second
most frequent category.’,scap=’Somers’ Dxy rank correlation between predictors and original survival time

# Can also do naplot ( naclus ( s u p p o r t [ acute ,]) )


# Can also use the Hmisc naclus and naplot functions to do this
# Impute m i s s i n g values with normal or modal values
wblc.i ← impute ( wblc , 9)
pafi.i ← impute ( pafi , 333 .3 )
ph.i ← impute ( ph , 7 .4 )
race2 ← race
levels ( race2 ) ← list ( white = ’ white ’ , other = levels ( race ) [ -1 ])
race2 [ is.na ( race2 ) ] ← ’ white ’
dd ← datadist ( dd , wblc.i , pafi.i , ph.i , race2 )

Do a formal redundancy analysis using more than pairwise as-


sociations, and allow for non-monotonic transformations in pre- F

dicting each predictor from all other predictors. This analysis


requires missing values to be imputed so as to not greatly re-
duce the sample size.
redun (∼ crea + age + sex + dzgroup + num.co + scoma + adlsc + race2 +
meanbp + hrt + resp + temp + sod + wblc.i + pafi.i + ph.i , nk =4)

Redundancy Analysis

redun ( formula = ∼crea + age + sex + dzgroup + num . co + scoma +


adlsc + race2 + meanbp + hrt + resp + temp + sod + wblc . i +
pafi . i + ph .i , nk = 4)
CHAPTER 12. PARAMETRIC SURVIVAL MODELING AND MODEL APPROXIMATION 12-11

n : 537 p : 16 nk : 4

Number of NAs : 0

Transformation of target variables forced to be linear

R2 cutoff : 0.9 Type : ordinary

R2 with which each variable can be predicted from all other variables :

crea age sex dzgroup num . co scoma adlsc race2 meanbp hrt
0.133 0.246 0.132 0.451 0.147 0.418 0.153 0.151 0.178 0.258
resp temp sod wblc . i pafi . i ph . i
0.131 0.197 0.135 0.093 0.143 0.171

No redundant variables

Better approach to gauging predictive potential and allocating


d.f.: G

ˆ Allow all continuous variables to have a the maximum num-


ber of knots entertained, in a log-normal survival model

ˆ Must use imputation to avoid losing data

ˆ Fit a “saturated” main effects model

ˆ Makes full use of censored data

ˆ Had to limit to 4 knots, force scoma to be linear, and omit


ph.i to avoid singularity
k ← 4
f ← psm ( S ∼ rcs ( age , k ) + sex + dzgroup + pol ( num.co ,2) + scoma +
pol ( adlsc ,2) + race + rcs ( meanbp , k ) + rcs ( hrt , k ) + rcs ( resp , k ) +
rcs ( temp , k ) + rcs ( crea , k ) + rcs ( sod , k ) + rcs ( wblc.i , k ) +
rcs ( pafi.i , k ) , dist = ’ lognormal ’)
plot ( anova ( f ) ) # F i g u r e 12.7
H

ˆ Figure 12.7 properly blinds the analyst to the form of effects


CHAPTER 12. PARAMETRIC SURVIVAL MODELING AND MODEL APPROXIMATION 12-12

χ2 P
sex 0.2 0.6799
temp 2.6 0.4558
race 4.1 0.3887
sod 4.5 0.2109
num.co 4.4 0.1099
hrt 8.4 0.0379
wblc.i 8.9 0.0302
adlsc 8.3 0.0157
resp 9.7 0.0217
scoma 8.9 0.0028
pafi.i 16.4 0.0009
age 19.5 0.0002
meanbp 20.1 0.0002
crea 25.6 0.0000
dzgroup 39.7 0.0000
0 10 20 30
2
χ − df
Figure 12.7: Partial χ2 statistics for association of each predictor with response from saturated main effects model, penalized for
d.f.

(tests of linearity).

ˆ Fit a log-normal survival model with number of parame- I

ters corresponding to nonlinear effects determined from Fig-


ure 12.7. For the most promising predictors, five knots can
be allocated, as there are fewer singularity problems once less
Note: Since the audio was recorded, a bug in psm was
fixed on 2017-03-12. Discrimination indexes shown in the
promising predictors are simplified. table below are correct but the audio is incorrect for g
and gr .

f ← psm ( S ∼ rcs ( age ,5) + sex + dzgroup + num.co +


scoma + pol ( adlsc ,2) + race2 + rcs ( meanbp ,5) +
rcs ( hrt ,3) + rcs ( resp ,3) + temp +
rcs ( crea ,4) + sod + rcs ( wblc.i ,3) + rcs ( pafi.i ,4) ,
dist = ’ lognormal ’) # ’ g a u s s i a n ’ f o r S +
print ( f )

Parametric Survival Model: Log Normal Distribution

psm(formula = S ~ rcs(age, 5) + sex + dzgroup + num.co + scoma +


pol(adlsc, 2) + race2 + rcs(meanbp, 5) + rcs(hrt, 3) + rcs(resp,
3) + temp + rcs(crea, 4) + sod + rcs(wblc.i, 3) + rcs(pafi.i,
4), dist = "lognormal")
CHAPTER 12. PARAMETRIC SURVIVAL MODELING AND MODEL APPROXIMATION 12-13

Model Likelihood Discrimination


Ratio Test Indexes
Obs 537 LR χ2 236.83 R 2
0.594
2
Events 356 d.f. 30 R30,537 0.320
2 2
σ 2.230782 Pr(> χ ) <0.0001 R30,356 0.441
Dxy 0.485

β̂ S.E. Wald Z Pr(> |Z|)


(Intercept) -5.6883 3.7851 -1.50 0.1329
age -0.0148 0.0309 -0.48 0.6322
age’ -0.0412 0.1078 -0.38 0.7024
age” 0.1670 0.5594 0.30 0.7653
age”’ -0.2099 1.3707 -0.15 0.8783
sex=male -0.0737 0.2181 -0.34 0.7354
dzgroup=Coma -2.0676 0.4062 -5.09 <0.0001
dzgroup=MOSF w/Malig -1.4664 0.3112 -4.71 <0.0001
num.co -0.1917 0.0858 -2.23 0.0255
scoma -0.0142 0.0044 -3.25 0.0011
adlsc -0.3735 0.1520 -2.46 0.0140
adlsc2 0.0442 0.0243 1.82 0.0691
race2=other 0.2979 0.2658 1.12 0.2624
meanbp 0.0702 0.0210 3.34 0.0008
meanbp’ -0.3080 0.2261 -1.36 0.1732
meanbp” 0.8438 0.8556 0.99 0.3241
meanbp”’ -0.5715 0.7707 -0.74 0.4584
hrt -0.0171 0.0069 -2.46 0.0140
hrt’ 0.0064 0.0063 1.02 0.3090
resp 0.0454 0.0230 1.97 0.0483
resp’ -0.0851 0.0291 -2.93 0.0034
temp 0.0523 0.0834 0.63 0.5308
crea -0.4585 0.6727 -0.68 0.4955
crea’ -11.5176 19.0027 -0.61 0.5444
crea” 21.9840 31.0113 0.71 0.4784
sod 0.0044 0.0157 0.28 0.7792
wblc.i 0.0746 0.0331 2.25 0.0242
wblc.i’ -0.0880 0.0377 -2.34 0.0195
pafi.i 0.0169 0.0055 3.07 0.0021
pafi.i’ -0.0569 0.0239 -2.38 0.0173
pafi.i” 0.1088 0.0482 2.26 0.0239
Log(scale) 0.8024 0.0401 19.99 <0.0001

a ← anova ( f )
CHAPTER 12. PARAMETRIC SURVIVAL MODELING AND MODEL APPROXIMATION 12-14

12.3

Summarizing the Fitted Model


J

ˆ Plot the shape of the effect of each predictor on log survival


time.

ˆ All effects centered: can be placed on common scale

ˆ Wald χ2 statistics, penalized for d.f., plotted in descending


order
ggplot ( Predict (f , ref.zero = TRUE ) , vnames = ’ names ’ ,
sepdiscrete = ’ vertical ’ , anova = a ) # F i g u r e 12.8
K
print (a , size = ’ tsz ’)

Wald Statistics for S


CHAPTER 12. PARAMETRIC SURVIVAL MODELING AND MODEL APPROXIMATION 12-15

adlsc age crea hrt


2
1 χ22 = 8.3
0
−1
−2 χ24 = 16 χ23 = 33.6 χ22 = 11.8
−3
0 2 4 6 20 40 60 80 2.5 5.0 7.5 50 100 150

meanbp num.co pafi.i resp


2
1 χ24 = 27.6 χ21 = 5 χ23 = 15.4 χ22 = 11.1
log(T)

0
−1
−2
−3
30 60 90 120 150 0 2 4 6 100 200 300 400 500 10 20 30 40

scoma sod temp wblc.i


2
1 χ21 = 10.6 χ21 = 0.1 χ21 = 0.4 χ22 = 5.5
0
−1
−2
−3
0 25 50 75 100 130 135 140 145 150 15535 36 37 38 39 40 0 10 20 30 40

dzgroup race2 sex


MOSF w/Malig other male
Coma
χ21 = 1.3 χ21 = 0.1
ARF/MOSF w/Sepsis χ22 = 45.7 white female

−3 −2 −1 0 1 2 −3 −2 −1 0 1 2 −3 −2 −1 0 1 2
log(T)

Figure 12.8: Effect of each predictor on log survival time. Predicted values have been centered so that predictions at predictor
reference values are zero. Pointwise 0.95 confidence bands are also shown. As all Y -axes have the same scale, it is easy to see which
predictors are strongest.
CHAPTER 12. PARAMETRIC SURVIVAL MODELING AND MODEL APPROXIMATION 12-16

χ2 d.f. P
age 15.99 4 0.0030
Nonlinear 0.23 3 0.9722
sex 0.11 1 0.7354
dzgroup 45.69 2 <0.0001
num.co 4.99 1 0.0255
scoma 10.58 1 0.0011
adlsc 8.28 2 0.0159
Nonlinear 3.31 1 0.0691
race2 1.26 1 0.2624
meanbp 27.62 4 <0.0001
Nonlinear 10.51 3 0.0147
hrt 11.83 2 0.0027
Nonlinear 1.04 1 0.3090
resp 11.10 2 0.0039
Nonlinear 8.56 1 0.0034
temp 0.39 1 0.5308
crea 33.63 3 <0.0001
Nonlinear 21.27 2 <0.0001
sod 0.08 1 0.7792
wblc.i 5.47 2 0.0649
Nonlinear 5.46 1 0.0195
pafi.i 15.37 3 0.0015
Nonlinear 6.97 2 0.0307
TOTAL NONLINEAR 60.48 14 <0.0001
TOTAL 261.47 30 <0.0001
plot ( a ) # Figure 12.9
L
options ( digits =3)
plot ( summary ( f ) , log = TRUE , main = ’ ’) # Figure 12.10
CHAPTER 12. PARAMETRIC SURVIVAL MODELING AND MODEL APPROXIMATION 12-17

χ2 P
sod 0.1 0.7792
sex 0.1 0.7354
temp 0.4 0.5308
race2 1.3 0.2624
wblc.i 5.5 0.0649
num.co 5.0 0.0255
adlsc 8.3 0.0159
resp 11.1 0.0039
scoma 10.6 0.0011
hrt 11.8 0.0027
age 16.0 0.0030
pafi.i 15.4 0.0015
meanbp 27.6 0.0000
crea 33.6 0.0000
dzgroup 45.7 0.0000
0 10 20 30 40
χ2 − df
Figure 12.9: Contribution of variables in predicting survival time in log-normal model

0.10 0.50 1.00 2.00 4.00


age − 74.5:47.9
num.co − 2:1
scoma − 37:0
adlsc − 3.38:0
meanbp − 111:59
hrt − 126:75
resp − 32:12
temp − 38.5:36.4
crea − 2.6:0.9
sod − 142:134
wblc.i − 18.2:8.1
pafi.i − 323:142
sex − female:male
dzgroup − Coma:ARF/MOSF w/Sepsis
dzgroup − MOSF w/Malig:ARF/MOSF w/Sepsis
race2 − other:white

Figure 12.10: Estimated survival time ratios for default settings of predictors. For example, when age changes from its lower quartile
to the upper quartile (47.9y to 74.5y), median survival time decreases by more than half. Different shaded areas of bars indicate
different confidence levels (0.9, 0.95, 0.99).
CHAPTER 12. PARAMETRIC SURVIVAL MODELING AND MODEL APPROXIMATION 12-18

12.4

Internal Validation of the Fitted Model Using


the Bootstrap

Validate indexes describing the fitted model.


# First add data to model fit so bootstrap can re-sample
# from the data
g ← update (f , x = TRUE , y = TRUE )
set.seed (717)
latex ( validate (g , B =120 , dxy = TRUE ) , digits =2 , size = ’ Ssize ’)

Index Original Training Test Optimism Corrected n


Sample Sample Sample Index
Dxy 0.49 0.51 0.46 0.05 0.44 120
R2 0.59 0.66 0.54 0.12 0.47 120
Intercept 0.00 0.00 −0.03 0.03 −0.03 120
Slope 1.00 1.00 0.90 0.10 0.90 120
D 0.48 0.55 0.42 0.13 0.35 120
U 0.00 0.00 −0.01 0.01 −0.01 120
Q 0.48 0.55 0.43 0.12 0.36 120
g 1.96 2.04 1.86 0.18 1.78 120
M

ˆ From Dxy and R2 there is a moderate amount of overfitting.

ˆ Slope shrinkage factor (0.90) is not troublesome

ˆ Almost unbiased estimate of future predictive discrimination


on similar patients is the corrected Dxy of 0.43.

Validate predicted 1-year survival probabilities. Use a smooth


approach that does not require binning [114] and use less precise N

Kaplan-Meier estimates obtained by stratifying patients by the


predicted probability, with at least 60 patients per group.
set.seed (717)
cal ← calibrate (g , u =1 , B =120)
CHAPTER 12. PARAMETRIC SURVIVAL MODELING AND MODEL APPROXIMATION 12-19

plot ( cal , subtitles = FALSE )


cal ← calibrate (g , cmethod = ’ KM ’ , u =1 , m =60 , B =120 , pr = FALSE )
plot ( cal , add = TRUE ) # F i g u r e 12.11

Fraction Surviving 1 Years


0.8

0.6

0.4

0.2

0.0
0.0 0.2 0.4 0.6 0.8
Predicted 1 Year Survival
Figure 12.11: Bootstrap validation of calibration curve. Dots represent apparent calibration accuracy; × are bootstrap estimates
corrected for overfitting, based on binning predicted survival probabilities and and computing Kaplan-Meier estimates. Black curve
is the estimated observed relationship using hare and the blue curve is the overfitting-corrected hare estimate. The gray-scale line
depicts the ideal relationship.
CHAPTER 12. PARAMETRIC SURVIVAL MODELING AND MODEL APPROXIMATION 12-20

12.5

Approximating the Full Model

The fitted log-normal model is perhaps too complex for routine


use and for routine data collection. Let us develop a simplified
model that can predict the predicted values of the full model
with high accuracy (R2 = 0.96). The simplification is done
using a fast backward stepdown against the full model predicted
values.
Z ← predict ( f ) # X* beta hat
a ← ols ( Z ∼ rcs ( age ,5) + sex + dzgroup + num.co +
scoma + pol ( adlsc ,2) + race2 +
rcs ( meanbp ,5) + rcs ( hrt ,3) + rcs ( resp ,3) +
temp + rcs ( crea ,4) + sod + rcs ( wblc.i ,3) +
rcs ( pafi.i ,4) , sigma =1)
# sigma =1 is used to p r e v e n t sigma hat from being zero when
# R2 =1 .0 since we start out by a p p r o x i m a t i n g Z with all
# component variables
fastbw (a , aics =10000) # fast backward stepdown

Deleted Chi - Sq d.f. P Residual d . f . P AIC R2


sod 0.43 1 0.512 0.43 1 0.5117 -1.57 1.000
sex 0.57 1 0.451 1.00 2 0.6073 -3.00 0.999
temp 2.20 1 0.138 3.20 3 0.3621 -2.80 0.998
race2 6.81 1 0.009 10.01 4 0.0402 2.01 0.994
wblc . i 29.52 2 0.000 39.53 6 0.0000 27.53 0.976
num . co 30.84 1 0.000 70.36 7 0.0000 56.36 0.957
resp 54.18 2 0.000 124.55 9 0.0000 106.55 0.924
adlsc 52.46 2 0.000 177.00 11 0.0000 155.00 0.892
pafi . i 66.78 3 0.000 243.79 14 0.0000 215.79 0.851
scoma 78.07 1 0.000 321.86 15 0.0000 291.86 0.803
hrt 83.17 2 0.000 405.02 17 0.0000 371.02 0.752
age 68.08 4 0.000 473.10 21 0.0000 431.10 0.710
crea 314.47 3 0.000 787.57 24 0.0000 739.57 0.517
meanbp 403.04 4 0.000 1190.61 28 0.0000 1134.61 0.270
dzgroup 441.28 2 0.000 1631.89 30 0.0000 1571.89 0.000

Approximate Estimates after Deleting Factors

Coef S . E . Wald Z P
[1 ,] -0.5928 0.04315 -13.74 0

Factors in Final Model

None
CHAPTER 12. PARAMETRIC SURVIVAL MODELING AND MODEL APPROXIMATION 12-21

f.approx ← ols ( Z ∼ dzgroup + rcs ( meanbp ,5) + rcs ( crea ,4) + rcs ( age ,5) +
rcs ( hrt ,3) + scoma + rcs ( pafi.i ,4) + pol ( adlsc ,2) +
rcs ( resp ,3) , x = TRUE )
f.approx $ stats

n Model L . R . d.f. R2 g
537.000 1688.225 23.000 0.957 1.915
Sigma
0.370
O

ˆ Estimate variance–covariance matrix of the coefficients of


reduced model

ˆ This covariance matrix does not include the scale parameter


V ← vcov (f , regcoef.only = TRUE ) # var ( full model )
X ← cbind ( Intercept =1 , g $ x ) # full model design
x ← cbind ( Intercept =1 , f.approx $ x ) # approx. model design
w ← solve ( t ( x ) % * % x , t ( x ) ) % * % X # contrast matrix
v ← w %*% V %*% t(w)

Compare variance estimates (diagonals of v) with variance es-


timates from a reduced model that is fitted against the actual
outcomes.
f.sub ← psm ( S ∼ dzgroup + rcs ( meanbp ,5) + rcs ( crea ,4) + rcs ( age ,5) +
rcs ( hrt ,3) + scoma + rcs ( pafi.i ,4) + pol ( adlsc ,2) +
rcs ( resp ,3) , dist = ’ lognormal ’) # ’ g a u s s i a n ’ f o r S +

r ← diag ( v ) / diag ( vcov ( f.sub , regcoef.only = TRUE ) )


r [ c ( which.min ( r ) , which.max ( r ) ) ]

hrt ’ age
0.976 0.982
P
f.approx $ var ← v
print ( anova ( f.approx , test = ’ Chisq ’ , ss = FALSE ) , size = ’ tsz ’)
CHAPTER 12. PARAMETRIC SURVIVAL MODELING AND MODEL APPROXIMATION 12-22

Wald Statistics for Z


χ2 d.f. P
dzgroup 55.94 2 <0.0001
meanbp 29.87 4 <0.0001
Nonlinear 9.84 3 0.0200
crea 39.04 3 <0.0001
Nonlinear 24.37 2 <0.0001
age 18.12 4 0.0012
Nonlinear 0.34 3 0.9517
hrt 9.87 2 0.0072
Nonlinear 0.40 1 0.5289
scoma 9.85 1 0.0017
pafi.i 14.01 3 0.0029
Nonlinear 6.66 2 0.0357
adlsc 9.71 2 0.0078
Nonlinear 2.87 1 0.0904
resp 9.65 2 0.0080
Nonlinear 7.13 1 0.0076
TOTAL NONLINEAR 58.08 13 <0.0001
TOTAL 252.32 23 <0.0001

Equation for simplified model:


# Typeset mathematical form of approximate model
latex ( f.approx )

E(Z) = Xβ, where

X β̂ =
−2.51
−1.94[Coma] − 1.75[MOSF w/Malig]
+0.068meanbp − 3.08×10−5 (meanbp − 41.8)3+ + 7.9×10−5 (meanbp − 61)3+
−4.91×10−5 (meanbp − 73)3+ + 2.61×10−6 (meanbp − 109)3+ − 1.7×10−6 (meanbp − 135)3+
−0.553crea − 0.229(crea − 0.6)3+ + 0.45(crea − 1.1)3+ − 0.233(crea − 1.94)3+
+0.0131(crea − 7.32)3+
−0.0165age − 1.13×10−5 (age − 28.5)3+ + 4.05×10−5 (age − 49.5)3+
−2.15×10−5 (age − 63.7)3+ − 2.68×10−5 (age − 72.7)3+ + 1.9×10−5 (age − 85.6)3+
−0.0136hrt + 6.09×10−7 (hrt − 60)3+ − 1.68×10−6 (hrt − 111)3+ + 1.07×10−6 (hrt − 140)3+
−0.0135 scoma
CHAPTER 12. PARAMETRIC SURVIVAL MODELING AND MODEL APPROXIMATION 12-23

+0.0161pafi.i − 4.77×10−7 (pafi.i − 88)3+ + 9.11×10−7 (pafi.i − 167)3+


−5.02×10−7 (pafi.i − 276)3+ + 6.76×10−8 (pafi.i − 426)3+ − 0.369 adlsc + 0.0409 adlsc2
+0.0394resp − 9.11×10−5 (resp − 10)3+ + 0.000176(resp − 24)3+ − 8.5×10−5 (resp − 39)3+

and [c] = 1 if subject is in group c, 0 otherwise; (x)+ = x if x > 0, 0 otherwise

Nomogram for predicting median and mean survival time, based


on approximate model:
# Derive S functions that express mean and quantiles
# of s u r v i v a l time for s p e c i f i c linear predictors
# analytically
expected.surv ← Mean ( f )
quantile.surv ← Quantile ( f )
latex ( expected.surv , file = ’ ’ , type = ’ Sinput ’)

expected.surv ← function ( lp = NULL , parms = 0 .802352037606488 )


{
names ( parms ) ← NULL
exp ( lp + exp (2 * parms ) / 2)
}

latex ( quantile.surv , file = ’ ’ , type = ’ Sinput ’)

quantile.surv ← function ( q = 0 .5 , lp = NULL , parms = 0 .802352037606488 )


{
names ( parms ) ← NULL
f ← function ( lp , q , parms ) lp + exp ( parms ) * qnorm ( q )
names ( q ) ← format ( q )
drop ( exp ( outer ( lp , q , FUN = f , parms = parms ) ) )
}

median.surv ← function ( x ) quantile.surv ( lp = x )

# Improve variable labels for the nomogram


f.approx ← Newlabels ( f.approx , c ( ’ Disease Group ’ , ’ Mean Arterial BP ’ ,
’ Creatinine ’ , ’ Age ’ , ’ Heart Rate ’ , ’ SUPPORT Coma Score ’ ,
’ PaO2 / ( .01 * FiO2 ) ’ , ’ ADL ’ , ’ Resp. Rate ’) )
nom ←
nomogram ( f.approx ,
pafi.i = c (0 , 50 , 100 , 200 , 300 , 500 , 600 , 700 , 800 , 900) ,
fun = list ( ’ Median Survival Time ’= median.surv ,
’ Mean Survival Time ’ = expected.surv ) ,
fun.at = c ( .1 , .25 , .5 ,1 ,2 ,5 ,10 ,20 ,40) )
plot ( nom , cex.var =1 , cex.axis = .75 , lmgp = .25 )
# F i g u r e 12.12
CHAPTER 12. PARAMETRIC SURVIVAL MODELING AND MODEL APPROXIMATION 12-24

0 10 20 30 40 50 60 70 80 90 100
Points
MOSF w/Malig
Disease Group
Coma ARF/MOSF w/Sepsis

Mean Arterial BP
0 20 40 60 80 120
6 7 8 9 10 11 12
Creatinine
5 3 2 1 0

Age
100 70 60 50 30 10

Heart Rate
300 200 100 50 0
SUPPORT Coma
Score 100 70 50 30 10
300
PaO2/(.01*FiO2)
0 50 100 200 500 700 900
5 7
ADL
4.5 2 1 0
65 60 55 50 45 40 35 30
Resp. Rate
0 5 15

Total Points
0 50 100 150 200 250 300 350 400 450

Linear Predictor
−7 −5 −3 −1 1 2 3 4

Median Survival Time


0.10.25
0.5 1 2 5 102040

Mean Survival Time


0.10.25
0.5 1 2 5 102040

Figure 12.12: Nomogram for predicting median and mean survival time, based on approximation of full model
CHAPTER 12. PARAMETRIC SURVIVAL MODELING AND MODEL APPROXIMATION 12-25

S Packages and Functions Used


Packages Purpose Functions
Hmisc Miscellaneous functions describe,ecdf,naclus,
varclus,llist,spearman2
describe,impute,latex
rms Modeling datadist,psm,rcs,ols,fastbw
Model presentation survplot,Newlabels,Function,
Mean,Quantile,nomogram
Model validation validate,calibrate

Note: All packages are available from CRAN


Chapter 13

Case Study in Cox Regression

13.1

Choosing the Number of Parameters and Fit-


ting the Model
A

ˆ Clinical trial of estrogen for prostate cancer

ˆ Response is time to death, all causes

ˆ Base analysis on Cox proportional hazards model [49]

ˆ S(t|X) = probability of surviving at least to time t given


set of predictor values X

ˆ S(t|X) = S0(t)exp(Xβ)

ˆ Censor time to death at time of last follow-up for patients


still alive at end of study (treat survival time for pt. censored
at 24m as 24m+)

13-1
CHAPTER 13. CASE STUDY IN COX REGRESSION 13-2

ˆ Use simple, partial approaches to data reduction

ˆ Use transcan for single imputation

ˆ Again combine last 2 categories for ekg,pf

ˆ See if we can use a full additive model (4 knots for contin-


uous X) B

Predictor Name d.f. Original Levels

Dose of estrogen rx 3 placebo, 0.2, 1.0, 5.0 mg estrogen


Age in years age 3
Weight index: wt(kg)-ht(cm)+200 wt 3
Performance rating pf 2 normal, in bed < 50% of time,
in bed > 50%, in bed always
History of cardiovascular disease hx 1 present/absent
Systolic blood pressure/10 sbp 3
Diastolic blood pressure/10 dbp 3
Electrocardiogram code ekg 5 normal, benign, rhythm disturb.,
block, strain, old myocardial
infarction, new MI
Serum hemoglobin (g/100ml) hg 3
Tumor size (cm2 ) sz 3
Stage/histologic grade combination sg 3
Serum prostatic acid phosphatase ap 3
Bone metastasis bm 1 present/absent

ˆ Total of 36 candidate d.f.

ˆ Impute missings and estimate shrinkage C


require ( rms )

options ( prType = ’ latex ’) # for print , summary , anova


getHdata ( prostate )
levels ( prostate $ ekg ) [ levels ( prostate $ ekg ) % in %
c ( ’ old MI ’ , ’ recent MI ’) ] ← ’ MI ’
# combines last 2 levels and uses a new name , MI
CHAPTER 13. CASE STUDY IN COX REGRESSION 13-3

prostate $ pf.coded ← as.integer ( prostate $ pf )


# save original pf , re-code to 1 -4
levels ( prostate $ pf ) ← c ( levels ( prostate $ pf ) [1:3] ,
levels ( prostate $ pf ) [3])
# combine last 2 levels

w ← transcan (∼ sz + sg + ap + sbp + dbp + age +


wt + hg + ekg + pf + bm + hx ,
imputed = TRUE , data = prostate , pl = FALSE , pr = FALSE )

attach ( prostate )
sz ← impute (w , sz , data = prostate )
sg ← impute (w , sg , data = prostate )
age ← impute (w , age , data = prostate )
wt ← impute (w , wt , data = prostate )
ekg ← impute (w , ekg , data = prostate )

dd ← datadist ( prostate )
options ( datadist = ’ dd ’)

units ( dtime ) ← ’ Month ’


S ← Surv ( dtime , status ! = ’ alive ’)

f ← cph ( S ∼ rx + rcs ( age ,4) + rcs ( wt ,4) + pf + hx +


rcs ( sbp ,4) + rcs ( dbp ,4) + ekg + rcs ( hg ,4) +
rcs ( sg ,4) + rcs ( sz ,4) + rcs ( log ( ap ) ,4) + bm )

print (f , coefs = FALSE )

Cox Proportional Hazards Model

cph(formula = S ~ rx + rcs(age, 4) + rcs(wt, 4) + pf + hx + rcs(sbp,


4) + rcs(dbp, 4) + ekg + rcs(hg, 4) + rcs(sg, 4) + rcs(sz,
4) + rcs(log(ap), 4) + bm)

Model Tests Discrimination


Indexes
Obs 502 LR χ2 136.22 R 2
0.238
2
Events 354 d.f. 36 R36,502 0.181
Center -2.9933 Pr(> χ2 ) 0.0000 2
R36,354 0.247
Score χ2 143.62 Dxy 0.333
Pr(> χ2 ) 0.0000

ˆ Global LR χ2 is 135 and very significant → modeling war-


ranted D
CHAPTER 13. CASE STUDY IN COX REGRESSION 13-4

ˆ AIC on χ2 scale = 136.2 − 2 × 36 = 64.2

ˆ Rough shrinkage: 0.74 ( 136.2−36


136.2 )

ˆ Informal data reduction (increase for ap) E

Variables Reductions d.f. Saved


wt Assume variable not important enough 1
for 4 knots; use 3 knots
pf Assume linearity 1
hx,ekg Make new 0,1,2 variable and assume 5
linearity: 2=hx and ekg not normal
or benign, 1=either, 0=none
sbp,dbp Combine into mean arterial bp and 4
2 1
use 3 knots: map= 3 dbp + 3 sbp
sg Use 3 knots 1
sz Use 3 knots 1
ap Look at shape of effect of ap in detail, -1
and take log before expanding as spline
to achieve numerical stability: add 1 knot
CHAPTER 13. CASE STUDY IN COX REGRESSION 13-5

heart ← hx + ekg % nin % c ( ’ normal ’ , ’ benign ’)


label ( heart ) ← ’ Heart Disease Code ’
map ← (2 * dbp + sbp ) / 3
label ( map ) ← ’ Mean Arterial Pressure / 10 ’
dd ← datadist ( dd , heart , map )

f ← cph ( S ∼ rx + rcs ( age ,4) + rcs ( wt ,3) + pf.coded +


heart + rcs ( map ,3) + rcs ( hg ,4) +
rcs ( sg ,3) + rcs ( sz ,3) + rcs ( log ( ap ) ,5) + bm ,
x = TRUE , y = TRUE , surv = TRUE , time.inc =5 * 12)
print (f , coefs = FALSE )

Cox Proportional Hazards Model

cph(formula = S ~ rx + rcs(age, 4) + rcs(wt, 3) + pf.coded +


heart + rcs(map, 3) + rcs(hg, 4) + rcs(sg, 3) + rcs(sz, 3) +
rcs(log(ap), 5) + bm, x = TRUE, y = TRUE, surv = TRUE, time.inc = 5 *
12)

Model Tests Discrimination


Indexes
Obs 502 LR χ2 118.37 R 2
0.210
2
Events 354 d.f. 24 R24,502 0.171
Center -2.4307 Pr(> χ2 ) 0.0000 2
R24,354 0.234
Score χ2 125.58 Dxy 0.321
Pr(> χ2 ) 0.0000
# x , y for predict , validate , calibrate ;
# surv , t i m e . i n c for c a l i b r a t e
anova ( f )

Wald Statistics for S


CHAPTER 13. CASE STUDY IN COX REGRESSION 13-6

χ2 d.f. P
rx 8.01 3 0.0459
age 13.84 3 0.0031
Nonlinear 9.06 2 0.0108
wt 8.21 2 0.0165
Nonlinear 2.54 1 0.1110
pf.coded 3.79 1 0.0517
heart 23.51 1 <0.0001
map 0.04 2 0.9779
Nonlinear 0.04 1 0.8345
hg 12.52 3 0.0058
Nonlinear 8.25 2 0.0162
sg 1.64 2 0.4406
Nonlinear 0.05 1 0.8304
sz 12.73 2 0.0017
Nonlinear 0.06 1 0.7990
ap 6.51 4 0.1639
Nonlinear 6.22 3 0.1012
bm 0.03 1 0.8670
TOTAL NONLINEAR 23.81 11 0.0136
TOTAL 119.09 24 <0.0001

ˆ Savings of 12 d.f. F

ˆ AIC=70, shrinkage 0.80


CHAPTER 13. CASE STUDY IN COX REGRESSION 13-7

13.2
Checking Proportional Hazards

ˆ This is our tentative model

ˆ Examine distributional assumptions using scaled Schoenfeld


residuals

ˆ Complication arising from predictors using multiple d.f.

ˆ Transform to 1 d.f. empirically using X β̂

ˆ cox.zph does this automatically

ˆ Following analysis approx. since internal coefficients esti-


mated
z ← predict (f , type = ’ terms ’)
# required x=T above to store design matrix
f.short ← cph ( S ∼ z , x = TRUE , y = TRUE )
# store raw x, y so can get residuals

ˆ Fit f.short has same LR χ2 of 118 as the fit f, but with G

falsely low d.f.

ˆ All β = 1
phtest ← cox.zph (f , transform = ’ identity ’)
phtest

chisq df p
rx 4.07 e +00 3 0.25
rcs ( age , 4) 4.27 e +00 3 0.23
rcs ( wt , 3) 2.22 e -01 2 0.89
pf . coded 5.34 e -02 1 0.82
heart 4.95 e -01 1 0.48
rcs ( map , 3) 3.20 e +00 2 0.20
CHAPTER 13. CASE STUDY IN COX REGRESSION 13-8

rcs ( hg , 4) 5.26 e +00 3 0.15


rcs ( sg , 3) 1.01 e +00 2 0.60
rcs ( sz , 3) 3.07 e -01 2 0.86
rcs ( log ( ap ) , 5) 3.59 e +00 4 0.47
bm 2.11 e -06 1 1.00
GLOBAL 2.30 e +01 24 0.52

plot ( phtest [1]) # plot only the first variable

10

5
Beta(t) for rx

−5

−10

−15

0 20 40 60
Time
Figure 13.1: Raw and spline-smoothed scaled Schoenfeld residuals for dose of estrogen, nonlinearly coded from the Cox model fit,
with ± 2 standard errors.

ˆ None of the effects significantly change over time

ˆ Global test of PH P = 0.52


CHAPTER 13. CASE STUDY IN COX REGRESSION 13-9

13.3
Testing Interactions

ˆ Will ignore non-PH for dose even though it makes sense

ˆ More accurate predictions could be obtained using stratifi-


cation or time dep. cov.

ˆ Test all interactions with dose


Reduce to 1 d.f. as before
z.dose ← z [ , " rx " ] # s a m e a s s a y i n g z [ , 1 ] - g e t f i r s t c o l u m n
z.other ← z [ , -1 ] # all but the first column of z
f.ia ← cph ( S ∼ z.dose * z.other )
print ( anova ( f.ia ) , size = ’ tsz ’)

Wald Statistics for S


χ2 d.f. P
z.dose (Factor+Higher Order Factors) 18.74 11 0.0660
All Interactions 12.17 10 0.2738
z.other (Factor+Higher Order Factors) 125.89 20 <0.0001
All Interactions 12.17 10 0.2738
z.dose × z.other (Factor+Higher Order Factors) 12.17 10 0.2738
TOTAL 129.10 21 <0.0001
CHAPTER 13. CASE STUDY IN COX REGRESSION 13-10

13.4
Describing Predictor Effects
H

ˆ Plot relationship between each predictor and log λ


ggplot ( Predict ( f ) , sepdiscrete = ’ vertical ’ , nlevels =4 ,
vnames = ’ names ’) # F i g u r e 13.2

age ap hg
1.0
0.5
0.0
−0.5
−1.0
−1.5
60 70 80 0 50 100 9 11 13 15 17
log Relative Hazard

map sg sz
1.0
0.5
0.0
−0.5
−1.0
−1.5
8 10 12 7.5 10.0 12.5 15.0 0 10 20 30 40 50

wt
1.0
0.5
0.0
−0.5
−1.0
−1.5
80 90 100 110 120 130

bm heart

1 2
1
0 0

pf.coded rx
4 5.0 mg estrogen
3 1.0 mg estrogen
2 0.2 mg estrogen
1 placebo
−1.5 −1.0 −0.5 0.0 0.5 1.0 −1.5 −1.0 −0.5 0.0 0.5 1.0
log Relative Hazard

Figure 13.2: Shape of each predictor on log hazard of death. Y -axis shows X β̂, but the predictors not plotted are set to reference
values. Note the highly non-monotonic relationship with ap, and the increased slope after age 70 which has been found in outcome
models for various diseases.
CHAPTER 13. CASE STUDY IN COX REGRESSION 13-11

13.5
Validating the Model
I

ˆ Validate for Dxy and slope shrinkage


set.seed (1) # s o c a n r e p r o d u c e results
v ← validate (f , B =300)
latex (v , file = ’ ’)

Index Original Training Test Optimism Corrected n


Sample Sample Sample Index
Dxy 0.3208 0.3494 0.2953 0.0541 0.2667 300
R2 0.2101 0.2481 0.1756 0.0724 0.1377 300
Slope 1.0000 1.0000 0.7863 0.2137 0.7863 300
D 0.0292 0.0354 0.0239 0.0116 0.0176 300
U −0.0005 −0.0005 0.0024 −0.0029 0.0024 300
Q 0.0297 0.0359 0.0215 0.0144 0.0153 300
g 0.7174 0.7999 0.6290 0.1708 0.5466 300

ˆ Shrinkage surprisingly close to heuristic estimate of 0.79

ˆ Now validate 5-year survival probability estimates


cal ← calibrate (f , B =300 , u =5 * 12 , maxdim =3)

Using Cox survival estimates at 60 Months

plot ( cal )
CHAPTER 13. CASE STUDY IN COX REGRESSION 13-12

Fraction Surviving 60 Month

0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
0.0 0.2 0.4 0.6
Predicted
Black: observed
B=300 60
Gray: Month
based
ideal Survival
on observed−predicted
Blue : optimism
Meancorrected
|error|=0.035 0.9 Quantile=0.057
Figure 13.3: Bootstrap estimate of calibration accuracy for 5-year estimates from the final Cox model, using adaptive linear spline
hazard regression. Line nearer the ideal line corresponds to apparent predictive accuracy. The blue curve corresponds to bootstrap-
corrected estimates.
CHAPTER 13. CASE STUDY IN COX REGRESSION 13-13

13.6
Presenting the Model

ˆ Display hazard ratios, overriding default for ap


plot ( summary (f , ap = c (1 ,20) ) , log = TRUE , main = ’ ’)

0.50 1.50 3.50


age − 76:70
wt − 107:90
pf.coded − 4:1
heart − 2:0
map − 11:9.333333
hg − 14.69922:12.29883
sg − 11:9
sz − 21:5
ap − 20:1
bm − 1:0
rx − 0.2 mg estrogen:placebo
rx − 1.0 mg estrogen:placebo
rx − 5.0 mg estrogen:placebo

Figure 13.4: Hazard ratios and multi-level confidence bars for effects of predictors in model, using default ranges except for ap

ˆ Draw nomogram, with predictions stated 4 ways


surv ← Survival ( f )
surv3 ← function ( x ) surv (3 * 12 , lp = x )
surv5 ← function ( x ) surv (5 * 12 , lp = x )
quan ← Quantile ( f )
med ← function ( x ) quan ( lp = x ) / 12
ss ← c ( .05 , .1 , .2 , .3 , .4 , .5 , .6 , .7 , .8 , .9 , .95 )

nom ← nomogram (f , ap = c ( .1 , .5 ,1 ,2 ,3 ,4 ,5 ,10 ,20 ,30 ,40) ,


fun = list ( surv3 , surv5 , med ) ,
funlabel = c ( ’3 -year Survival ’ , ’5 -year Survival ’ ,
’ Median Survival Time ( years ) ’) ,
fun.at = list ( ss , ss , c ( .5 ,1:6) ) )
plot ( nom , xfrac = .65 , lmgp = .35 )
CHAPTER 13. CASE STUDY IN COX REGRESSION 13-14

0 10 20 30 40 50 60 70 80 90 100
Points
5.0 mg estrogen
rx
1.0 mg estrogen 0.2 mg estrogen
75 80 85 90
Age in Years
70 50
120
Weight Index = wt(kg)−ht(cm)+200
110 90 80 70 60
2 4
pf.coded
1 3
1
Heart Disease Code
0 2
22 12
Mean Arterial Pressure/10
4 8
16 18 20 22
Serum Hemoglobin (g/100ml)
14 12 10 8 6 4
Combined Index of Stage and Hist.
Grade
5 6 7 8 10 12 15

Size of Primary Tumor (cm^2)


0 5 15 25 30 35 40 45 50 55 60 65 70
1 2 3 5
Serum Prostatic Acid Phosphatase
0.5 40 0.1
1
Bone Metastases
0

Total Points
0 20 40 60 80 100 140 180 220 260

Linear Predictor
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

3−year Survival
0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.05

5−year Survival
0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.05

Median Survival Time (years)


6 5 4 3 2 1 0.5

Figure 13.5: Nomogram for predicting death in prostate cancer trial


Chapter 14

Semiparametric Ordinal Longitudinal Models

14.1

Longitudinal Ordinal Models as Unifying Con-


cepts

This material in this section is taken from


hbiostat.org/talks/rcteff.html. See also
hbiostat.org/proj/covid19/ordmarkov.html

14.1.1

General Outcome Attributes

ˆ Timing and severity of outcomes

ˆ Handle
– terminal events (death)

– non-terminal events (MI, stroke)

14-1
CHAPTER 14. SEMIPARAMETRIC ORDINAL LONGITUDINAL MODELS 14-2

– recurrent events (hospitalization)

ˆ Break the ties; the more levels of Y the better:


fharrell.com/post/ordinal-info
– Maximum power when there is only one patient at each
level (continuous Y)

14.1.2

What is a Fundamental Outcome Assessment?

ˆ In a given week or day what is the severity of the worst thing


that happened to the patient?

ˆ Expert clinician consensus of outcome ranks

ˆ Spacing of outcome categories irrelevant

ˆ Avoids defining additive weights for multiple events on same


week

ˆ Events can be graded & can code common co-occurring


events as worse event

ˆ Can translate an ordinal longitudinal model to obtain a va-


riety of estimates
– time until a condition

– expected time in state


CHAPTER 14. SEMIPARAMETRIC ORDINAL LONGITUDINAL MODELS 14-3

ˆ Bayesian partial proportional odds model can compute the


probability that the treatment affects mortality differently
than it affects nonfatal outcomes

ˆ Model also elegantly handles partial information: at each


day/week the ordinal Y can be left, right, or interval cen-
sored when a range of the scale was not measured

14.1.3

Examples of Longitudinal Ordinal Outcomes

ˆ 0=alive 1=dead
– censored at 3w: 000

– death at 2w: 01

– longitudinal binary logistic model OR ≈ HR

ˆ 0=at home 1=hospitalized 2=MI 3=dead


– hospitalized at 3w, rehosp at 7w, MI at 8w & stays in
hosp, f/u ends at 10w: 0010001211

ˆ 0-6 QOL excellent–poor, 7=MI 8=stroke 9=dead


– QOL varies, not assessed in 3w but pt event free, stroke
at 8w, death 9w: 12[0-6]334589
* MI status unknown at 7w: 12[0-6]334[5-7]89a
a Better: treat the outcome as being in one of two non-contiguous values 5,7 instead of [5-7] but no software is currently available for this
CHAPTER 14. SEMIPARAMETRIC ORDINAL LONGITUDINAL MODELS 14-4

– Can make first 200 levels be a continuous response vari-


able and the remaining values represent clinical event
overrides
14.1.4
Statistical Model

ˆ Proportional odds ordinal logistic model with covariate ad-


justment

ˆ Patient random effects (intercepts) handle intra-patient cor-


relation

ˆ Better fitting: Markov model


– handles absorbing states, extremely high day-to-day cor-
relations within subject

– faster, flexible, uses standard software

– state transition probabilities

– after fit translate to unconditional state occupancy prob-


abilities

– use these to estimate expected time in a set of states


(e.g., on ventilator or dead); restricted mean survival time
without assuming PH

ˆ Extension of binary logistic model


CHAPTER 14. SEMIPARAMETRIC ORDINAL LONGITUDINAL MODELS 14-5

ˆ Generalization of Wilcoxon-Mann-Whitney Two-Sample Test

ˆ No assumption about Y distribution for a given patient type

ˆ Does not use the numeric Y codes


CHAPTER 14. SEMIPARAMETRIC ORDINAL LONGITUDINAL MODELS 14-6

14.2

Case Studies

ˆ Random effects model for continuous Y: Section 7.8.3

ˆ Markov model for continuous Y: Section 7.8.4

ˆ Multiple detailed case studies for discrete ordinal Y:


hbiostat.org/proj/covid19.
– ORCHID: hydroxychloroquine for treatment of COVID-19
with patient assessment on select days

– VIOLET: vitamin D for serious respiratory illness with


assessment on 28 consecutive days
* Large power gain demonstrated over time to recovery
or ordinal status at a given day

* Loosely speaking serial assessments for each 5 day pe-


riod had the same statistical information as a new pa-
tient assessed once

– ACTT-1: NIH-NIAID Remdesivir study for treatment of


COVID-19 with daily assessment while in hospital, select
days after that with interval censoring
* Assesses time-varying effect of remdesivir

* Handles death explicitly, unlike per-patient time to re-


CHAPTER 14. SEMIPARAMETRIC ORDINAL LONGITUDINAL MODELS 14-7

covery

– Other: Bayesian and frequentist power simulation, explo-


ration of unequal time gaps, etc.
CHAPTER 14. SEMIPARAMETRIC ORDINAL LONGITUDINAL MODELS 14-8

14.3

Case Study For 4-Level Ordinal Longitudinal


Outcome

ˆ VIOLET: randomized clinical trial of seriously ill adults in


ICUs to study the effectiveness of vitamin D vs. placebo

ˆ Daily ordinal outcomes assessed for 28 days with very little


missing data

ˆ Original paper: DOI:10.1056/NEJMoa1911124 focused on


1078 patients with confirmed baseline vitamin D deficiency

ˆ Focus on 1352 of the original 1360 randomized patients

ˆ Extensive re-analyses:
– hbiostat.org/proj/covid19/violet2.html

– hbiostat.org/proj/covid19/orchid.html

– hbiostat.org/R/Hmisc/markov/sim.html

ˆ Fitted a frequentist first-order Markov partial proportional


odds (PO) model to 1352 VIOLET patients using the R VGAM
package to simulate 250,000 patient longitudinal records
with daily assessments up to 28d:
hbiostat.org/data/repo/simlongord.html
CHAPTER 14. SEMIPARAMETRIC ORDINAL LONGITUDINAL MODELS 14-9

ˆ Simulation inserted an odds ratio of 0.75 for tx=1 : tx=0


(log OR = -0.288)

ˆ Case study uses the first 500 simulated patients


– 13203 records

– average of 26.4 records per patient out of a maximum of


28, due to deaths

– full 250,00 and 500-patient datasets available at


hbiostat.org/data

ˆ 4-level outcomes:
– patient at home

– in hospital or other health facility

– on ventilator or diagnosed with acute respiratory distress


syndrome (ARDS)

– dead

ˆ Death is an absorbing state


– only possible previous states are the first 3

– at baseline no one was at home

– a patient who dies has Dead as the status on their final


CHAPTER 14. SEMIPARAMETRIC ORDINAL LONGITUDINAL MODELS 14-10

record, with no “deaths carried forward”

– later we will carry deaths forward just to be able to look at


empirical state occupancy probabilities (SOPs) vs. model
estimates

ˆ Frequentist modeling using the VGAM package allows us to use


the unconstrained partial PO (PPO) model with regard to
time, but does not allow us to compute uncertainty intervals
for derived parameters (e.g., SOPs and mean time in states)

ˆ Can use the bootstrap to obtain approximate confidence


limits (as below)

ˆ Bayesian analysis using the rmsb package provides exact un-


certainty intervals for derived parameters but at present rmsb
only implements the constrained PPO model when getting
predicted values

ˆ PPO for time allows mix of outcomes to change over time


(which occurred in the real data)

ˆ Model specification:
– For day t let Y (t) denote the ordinal outcome for a pa-
tient
Pr(Y ≥ y|X, Y (t−1)) = expit(αy +Xβ+g(Y (t−1), t))

– g contains regression coefficients for the previous state


CHAPTER 14. SEMIPARAMETRIC ORDINAL LONGITUDINAL MODELS 14-11

Y (t − 1) effect, the absolute time t effect, and any y-


dependency on the t effect (non-PO for t)

– Baseline covariates: age, SOFA score (a measure of organ


function), treatment (tx)

– Time-dependent covariate: previous state (yprev, 3 lev-


els)

– Time trend: linear spline with knot at day 2 (handles


exception at day 1 when almost no one was sent home)

– Changing mix of outcomes over time


* effect of time on transition ORs for different cutoffs of
Y

* 2 time components (one slope change) × 3 Y cutoffs


= 6 parameters related to day

ˆ Reverse coding of Y so that higher levels are worse

14.3.1

Descriptives

ˆ all state transitions from one day to the next

ˆ SOPs estimated by proportions (need to carry death for-


ward)
require ( rms )
CHAPTER 14. SEMIPARAMETRIC ORDINAL LONGITUDINAL MODELS 14-12

require ( data.table )

require ( VGAM )

knitrSet ( ’ markov ’)
getHdata ( simlongord500 )
d ← simlongord500
setDT (d , key = ’ id ’)
d[, y := factor (y , levels = rev ( levels ( y )))]
d [ , yprev := factor ( yprev , levels = rev ( levels ( yprev ) ) ) ]
setnames (d , ’ time ’ , ’ day ’)
# Show descriptive statistics for baseline data
latex ( describe ( d [ day == 1 , . ( yprev , age , sofa , tx ) ] , ’ Baseline Variables ’) , file
= ’ ’)

Baseline Variables
4 Variables 500 Observations
yprev
n missing distinct
500 0 2

Value In Hospital/Facility Vent/ARDS


Frequency 340 160
Proportion 0.68 0.32
age
n missing distinct Info Mean Gmd .05 .10 .25 .50 .75 .90 .95
500 0 73 0.999 56.08 17.49 28.95 36.00 46.00 58.00 67.00 75.10 79.05

lowest : 19 20 21 22 23, highest: 89 90 91 92 94


sofa
n missing distinct Info Mean Gmd .05 .10 .25 .50 .75 .90 .95
500 0 18 0.992 5.274 3.901 0.00 1.00 3.00 5.00 7.25 10.00 11.00

lowest : 0 1 2 3 4, highest: 13 14 15 17 18
Value 0 1 2 3 4 5 6 7 8 9 10 11 12 13
Frequency 38 38 41 51 59 53 58 37 24 32 32 17 8 4
Proportion 0.076 0.076 0.082 0.102 0.118 0.106 0.116 0.074 0.048 0.064 0.064 0.034 0.016 0.008
Value 14 15 17 18
Frequency 4 2 1 1
Proportion 0.008 0.004 0.002 0.002
tx
n missing distinct Info Sum Mean Gmd
500 0 2 0.75 256 0.512 0.5007

# Check that death can only occur on the last day


d [ , . ( ddif = if ( any ( y == ’ Dead ’) ) min ( day [ y == ’ Dead ’ ]) -
max ( day ) else NA_integer_ ) ,
by = id ][ , table ( ddif ) ]

ddif 0 43
propsTrans ( y ∼ day + id , data =d , maxsize =4 , arrow = ’- > ’) +
theme ( axis.text.x = element_text ( angle =90 , hjust =1) )
CHAPTER 14. SEMIPARAMETRIC ORDINAL LONGITUDINAL MODELS 14-13

day 1 −> 2 day 2 −> 3 day 3 −> 4 day 4 −> 5 day 5 −> 6 day 6 −> 7
Dead
Vent/ARDS
In Hospital/Facility
Home

day 7 −> 8 day 8 −> 9 day 9 −> 10 day 10 −> 11 day 11 −> 12 day 12 −> 13
Dead
Vent/ARDS
In Hospital/Facility
Home

day 13 −> 14 day 14 −> 15 day 15 −> 16 day 16 −> 17 day 17 −> 18 day 18 −> 19 Proportion
Current State

Dead 0.25
Vent/ARDS
0.50
In Hospital/Facility
Home 0.75
1.00
day 19 −> 20 day 20 −> 21 day 21 −> 22 day 22 −> 23 day 23 −> 24 day 24 −> 25
Dead
Vent/ARDS
In Hospital/Facility
Home
Home
In Hospital/Facility
Vent/ARDS

Home
In Hospital/Facility
Vent/ARDS

Home
In Hospital/Facility
Vent/ARDS
day 25 −> 26 day 26 −> 27 day 27 −> 28
Dead
Vent/ARDS
In Hospital/Facility
Home
Home
In Hospital/Facility
Vent/ARDS

Home
In Hospital/Facility
Vent/ARDS

Home
In Hospital/Facility
Vent/ARDS

Previous State

Show state occupancy proportions by creating a data table with


death carried forward.
w ← d [ day < 28 & y == ’ Dead ’ , ]
w [ , if ( .N > 1) stop ( ’ Error : more than one death record ’) , by = id ]

Empty data . table (0 rows and 1 cols ): id

w ← w [ , . ( day = ( day + 1) : 28 , y = y , tx = tx ) , by = id ]
u ← rbind (d , w , fill = TRUE )
setkey (u , id )
u [ , Tx := paste0 ( ’ tx = ’ , tx ) ]
CHAPTER 14. SEMIPARAMETRIC ORDINAL LONGITUDINAL MODELS 14-14

propsPO ( y ∼ day + Tx , data = u ) +


guides ( fill = guide_legend ( title = ’ Status ’) ) +
theme ( legend.position = ’ bottom ’ , axis.text.x = element_text ( angle =90 , hjust =1) )

tx=0 tx=1
1.00

0.75
Proportion

0.50

0.25

0.00
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
day

Status Dead Vent/ARDS In Hospital/Facility Home

14.3.2

Model Fitting

ˆ Fit the PPO first-order Markov model without assuming PO


for the time effect

ˆ Also fit a model that has linear splines with more knots
to add flexibility in how time and baseline covariates are
transformed

ˆ Disregard the terrible statistical practice of using asterisks


to denote ”significant” results
f ← vglm ( ordered ( y ) ∼ yprev + lsp ( day , 2) + age + sofa + tx ,
cumulative ( reverse = TRUE , parallel = FALSE ∼ lsp ( day , 2) ) , data = d )
CHAPTER 14. SEMIPARAMETRIC ORDINAL LONGITUDINAL MODELS 14-15

summary ( f )

Call :
vglm ( formula = ordered ( y ) ∼ yprev + lsp ( day , 2) + age + sofa +
tx , family = cumulative ( reverse = TRUE , parallel = FALSE ∼
lsp ( day , 2)) , data = d )

Coefficients :
Estimate Std . Error z value Pr ( >| z |)
( Intercept ):1 -5.685047 0.586907 -9.686 < 2e -16 ***
( Intercept ):2 -13.982588 0.558102 -25.054 < 2e -16 ***
( Intercept ):3 -21.626076 1.495433 -14.461 < 2e -16 ***
yprevIn Hospital / Facility 9.010046 0.288818 31.196 < 2e -16 ***
yprevVent / ARDS 15.201826 0.336292 45.204 < 2e -16 ***
lsp ( day , 2) day :1 -1.119279 0.254831 -4.392 1.12 e -05 ***
lsp ( day , 2) day :2 -0.268585 0.234568 -1.145 0.252
lsp ( day , 2) day :3 1.150635 0.750875 1.532 0.125
lsp ( day , 2) day ’:1 1.170655 0.256770 4.559 5.14 e -06 ***
lsp ( day , 2) day ’:2 0.294863 0.238611 1.236 0.217
lsp ( day , 2) day ’:3 -1.141568 0.756102 -1.510 0.131
age 0.010996 0.002814 3.907 9.34 e -05 ***
sofa 0.060413 0.012562 4.809 1.52 e -06 ***
tx -0.350262 0.084709 -4.135 3.55 e -05 ***
---
Signif . codes : 0 ’*** ’ 0.001 ’** ’ 0.01 ’* ’ 0.05 ’. ’ 0.1 ’ ’ 1

Names of linear predictors : logitlink ( P [Y >=2]) , logitlink ( P [Y >=3]) ,


logitlink ( P [Y >=4])

Residual deviance : 4534.299 on 39595 degrees of freedom

Log - likelihood : -2267.15 on 39595 degrees of freedom

Number of Fisher scoring iterations : 9

Warning : Hauck - Donner effect detected in the following estimate ( s ):


’( Intercept ):3 ’

Exponentiated coefficients :
yprevIn Hospital / Facility yprevVent / ARDS lsp ( day , 2) day :1
8.184899 e +03 4.000083 e +06 3.265152 e -01
lsp ( day , 2) day :2 lsp ( day , 2) day :3 lsp ( day , 2) day ’:1
7.644601 e -01 3.160199 e +00 3.224103 e +00
lsp ( day , 2) day ’:2 lsp ( day , 2) day ’:3 age
1.342943 e +00 3.193179 e -01 1.011057 e +00
sofa tx
1.062275 e +00 7.045038 e -01

# Note : vglm will handle rcs () but not in g e t t i n g p r e d i c t i o n s since


# it doesn ’ t know where to find the c o m p u t e d knot l o c a t i o n s
# Linear s p l i n e s have knots e x p l i c i t l y stated at all times
g ← vglm ( ordered ( y ) ∼ yprev + lsp ( day , c (2 , 4 , 8 , 15) ) +
lsp ( age , c (35 , 60 , 75) ) + lsp ( sofa , c (2 , 6 , 10) ) + tx ,
cumulative ( reverse = TRUE , parallel = FALSE ∼ lsp ( day , c (2 , 4 , 8 , 15) ) ) ,
data = d )
CHAPTER 14. SEMIPARAMETRIC ORDINAL LONGITUDINAL MODELS 14-16

summary ( g )

Call :
vglm ( formula = ordered ( y ) ∼ yprev + lsp ( day , c (2 , 4 , 8 , 15)) +
lsp ( age , c (35 , 60 , 75)) + lsp ( sofa , c (2 , 6 , 10)) + tx , family = cumulative ( reverse =
parallel = FALSE ∼ lsp ( day , c (2 , 4 , 8 , 15))) , data = d )

Coefficients :
Estimate Std . Error z value Pr ( >| z |)
( Intercept ):1 -6.999641 0.853756 -8.199 2.43 e -16 ***
( Intercept ):2 -14.999370 0.831770 -18.033 < 2e -16 ***
( Intercept ):3 -23.264014 1.634479 -14.233 < 2e -16 ***
yprevIn Hospital / Facility 9.050121 0.295104 30.668 < 2e -16 ***
yprevVent / ARDS 15.272417 0.342349 44.611 < 2e -16 ***
lsp ( day , c (2 , 4 , 8 , 15)) day :1 -1.032077 0.282930 -3.648 0.000264 ***
lsp ( day , c (2 , 4 , 8 , 15)) day :2 -0.484565 0.277909 -1.744 0.081227 .
lsp ( day , c (2 , 4 , 8 , 15)) day :3 1.549806 0.793898 1.952 0.050921 .
lsp ( day , c (2 , 4 , 8 , 15)) day ’:1 1.064104 0.339577 3.134 0.001727 **
lsp ( day , c (2 , 4 , 8 , 15)) day ’:2 0.714003 0.374521 1.906 0.056593 .
lsp ( day , c (2 , 4 , 8 , 15)) day ’:3 -1.713117 0.925886 -1.850 0.064278 .
lsp ( day , c (2 , 4 , 8 , 15)) day ’ ’:1 -0.020867 0.141055 -0.148 0.882394
lsp ( day , c (2 , 4 , 8 , 15)) day ’ ’:2 -0.281249 0.206644 -1.361 0.173504
lsp ( day , c (2 , 4 , 8 , 15)) day ’ ’:3 -0.032936 0.424279 -0.078 0.938123
lsp ( day , c (2 , 4 , 8 , 15)) day ’ ’ ’:1 0.055257 0.075085 0.736 0.461776
lsp ( day , c (2 , 4 , 8 , 15)) day ’ ’ ’:2 0.127460 0.116354 1.095 0.273320
lsp ( day , c (2 , 4 , 8 , 15)) day ’ ’ ’:3 0.323703 0.273592 1.183 0.236746
lsp ( day , c (2 , 4 , 8 , 15)) day ’ ’ ’ ’:1 -0.006121 0.052947 -0.116 0.907968
lsp ( day , c (2 , 4 , 8 , 15)) day ’ ’ ’ ’:2 -0.092095 0.078118 -1.179 0.238428
lsp ( day , c (2 , 4 , 8 , 15)) day ’ ’ ’ ’:3 -0.106570 0.160922 -0.662 0.507815
lsp ( age , c (35 , 60 , 75)) age 0.042333 0.018820 2.249 0.024493 *
lsp ( age , c (35 , 60 , 75)) age ’ -0.038028 0.023118 -1.645 0.099985 .
lsp ( age , c (35 , 60 , 75)) age ’ ’ 0.010658 0.015805 0.674 0.500104
lsp ( age , c (35 , 60 , 75)) age ’ ’ ’ -0.017427 0.026048 -0.669 0.503472
lsp ( sofa , c (2 , 6 , 10)) sofa 0.199993 0.096401 2.075 0.038025 *
lsp ( sofa , c (2 , 6 , 10)) sofa ’ -0.132951 0.122966 -1.081 0.279605
lsp ( sofa , c (2 , 6 , 10)) sofa ’ ’ -0.045706 0.069159 -0.661 0.508686
lsp ( sofa , c (2 , 6 , 10)) sofa ’ ’ ’ 0.043445 0.081920 0.530 0.595880
tx -0.357512 0.085314 -4.191 2.78 e -05 ***
---
Signif . codes : 0 ’*** ’ 0.001 ’** ’ 0.01 ’* ’ 0.05 ’. ’ 0.1 ’ ’ 1

Names of linear predictors : logitlink ( P [Y >=2]) , logitlink ( P [Y >=3]) ,


logitlink ( P [Y >=4])

Residual deviance : 4518.99 on 39580 degrees of freedom

Log - likelihood : -2259.495 on 39580 degrees of freedom

Number of Fisher scoring iterations : 9

Warning : Hauck - Donner effect detected in the following estimate ( s ):


’( Intercept ):3 ’

Exponentiated coefficients :
yprevIn Hospital / Facility yprevVent / ARDS
CHAPTER 14. SEMIPARAMETRIC ORDINAL LONGITUDINAL MODELS 14-17

8.519570 e +03 4.292659 e +06


lsp ( day , c (2 , 4 , 8 , 15)) day :1 lsp ( day , c (2 , 4 , 8 , 15)) day :2
3.562661 e -01 6.159648 e -01
lsp ( day , c (2 , 4 , 8 , 15)) day :3 lsp ( day , c (2 , 4 , 8 , 15)) day ’:1
4.710557 e +00 2.898242 e +00
lsp ( day , c (2 , 4 , 8 , 15)) day ’:2 lsp ( day , c (2 , 4 , 8 , 15)) day ’:3
2.042150 e +00 1.803029 e -01
lsp ( day , c (2 , 4 , 8 , 15)) day ’ ’:1 lsp ( day , c (2 , 4 , 8 , 15)) day ’ ’:2
9.793494 e -01 7.548407 e -01
lsp ( day , c (2 , 4 , 8 , 15)) day ’ ’:3 lsp ( day , c (2 , 4 , 8 , 15)) day ’ ’ ’:1
9.676000 e -01 1.056812 e +00
lsp ( day , c (2 , 4 , 8 , 15)) day ’ ’ ’:2 lsp ( day , c (2 , 4 , 8 , 15)) day ’ ’ ’:3
1.135939 e +00 1.382237 e +00
lsp ( day , c (2 , 4 , 8 , 15)) day ’ ’ ’ ’:1 lsp ( day , c (2 , 4 , 8 , 15)) day ’ ’ ’ ’:2
9.938979 e -01 9.120181 e -01
lsp ( day , c (2 , 4 , 8 , 15)) day ’ ’ ’ ’:3 lsp ( age , c (35 , 60 , 75)) age
8.989124 e -01 1.043242 e +00
lsp ( age , c (35 , 60 , 75)) age ’ lsp ( age , c (35 , 60 , 75)) age ’ ’
9.626862 e -01 1.010715 e +00
lsp ( age , c (35 , 60 , 75)) age ’ ’ ’ lsp ( sofa , c (2 , 6 , 10)) sofa
9.827238 e -01 1.221394 e +00
lsp ( sofa , c (2 , 6 , 10)) sofa ’ lsp ( sofa , c (2 , 6 , 10)) sofa ’ ’
8.755077 e -01 9.553224 e -01
lsp ( sofa , c (2 , 6 , 10)) sofa ’ ’ ’ tx
1.044402 e +00 6.994145 e -01

lrtest (g , f )

Likelihood ratio test

Model 1: ordered ( y ) ∼ yprev + lsp ( day , c (2 , 4 , 8 , 15)) + lsp ( age , c (35 ,


60 , 75)) + lsp ( sofa , c (2 , 6 , 10)) + tx
Model 2: ordered ( y ) ∼ yprev + lsp ( day , 2) + age + sofa + tx
# Df LogLik Df Chisq Pr ( > Chisq )
1 39580 -2259.5
2 39595 -2267.2 15 15.309 0.4294

AIC ( f ) ; AIC ( g )

[1] 4562.299

[1] 4576.99

We will use the simpler model, which has the better (smaller)
AIC. Check to PO assumption on time by comparing the simpler
model’s AIC to the AIC from a fully PO model.
h ← vglm ( ordered ( y ) ∼ yprev + lsp ( day , 2) + age + sofa + tx ,
cumulative ( reverse = TRUE , parallel = TRUE ) , data = d )
lrtest (f , h )
CHAPTER 14. SEMIPARAMETRIC ORDINAL LONGITUDINAL MODELS 14-18

Likelihood ratio test

Model 1: ordered ( y ) ∼ yprev + lsp ( day , 2) + age + sofa + tx


Model 2: ordered ( y ) ∼ yprev + lsp ( day , 2) + age + sofa + tx
# Df LogLik Df Chisq Pr ( > Chisq )
1 39595 -2267.2
2 39599 -2275.2 4 16.147 0.002828 **
---
Signif . codes : 0 ’*** ’ 0.001 ’** ’ 0.01 ’* ’ 0.05 ’. ’ 0.1 ’ ’ 1

AIC ( f ) ; AIC ( h )

[1] 4562.299

[1] 4570.447

The model allowing for non-PO in time is better. Now show


Wald tests on the parameters.
wald ← function ( f ) {
se ← sqrt ( diag ( vcov ( f ) ) )
s ← round ( cbind ( beta = coef ( f ) , SE = se , Z = coef ( f ) / se ) , 3)
a ← c ( ’ ≥ in hospital / facility ’ , ’ ≥ vent / ARDS ’ , ’ dead ’ ,
’ previous state in hospital / facility ’ ,
’ previous state vent / ARDS ’ ,
’ initial slope for day , ≥ hospital / facility ’ ,
’ initial slope for day , ≥ vent / ARDS ’ ,
’ initial slope for day , dead ’ ,
’ slope increment , ≥ hospital / facilty ’ ,
’ slope increment , ≥ vent / ARDS ’ ,
’ slope increament , dead ’ ,
’ baseline age linear effect ’ ,
’ baseline SOFA score linear effect ’ ,
’ treatment log OR ’)
rownames ( s ) ← a
s
}
wald ( f )

beta SE Z
>= in hospital / facility -5.685 0.587 -9.686
>= vent / ARDS -13.983 0.558 -25.054
dead -21.626 1.495 -14.461
previous state in hospital / facility 9.010 0.289 31.196
previous state vent / ARDS 15.202 0.336 45.204
initial slope for day , >= hospital / facility -1.119 0.255 -4.392
initial slope for day , >= vent / ARDS -0.269 0.235 -1.145
initial slope for day , dead 1.151 0.751 1.532
slope increment , >= hospital / facilty 1.171 0.257 4.559
slope increment , >= vent / ARDS 0.295 0.239 1.236
slope increament , dead -1.142 0.756 -1.510
baseline age linear effect 0.011 0.003 3.907
baseline SOFA score linear effect 0.060 0.013 4.809
treatment log OR -0.350 0.085 -4.135
CHAPTER 14. SEMIPARAMETRIC ORDINAL LONGITUDINAL MODELS 14-19

We see evidence for a benefit of treatment. Compute the treat-


ment transition OR and approximate 0.95 confidence interval.
lor ← coef ( f ) [ ’ tx ’]
se ← sqrt ( vcov ( f ) [ ’ tx ’ , ’ tx ’ ])
b ← exp ( lor + qnorm (0 .975 ) * se * c (0 , -1 , 1) )
names ( b ) ← c ( ’ OR ’ , ’ Lower ’ , ’ Upper ’)
round (b , 3)

OR Lower Upper
0.705 0.597 0.832

The maximum likelihood estimate of the OR compares favor-


ably with the true OR of 0.75 on which the simulations were
based.
14.3.3

Covariate Effects

ˆ Most interesting covariate effect is effect of time since ran-


domization

ˆ Show estimated time trends in relative log odds of transition


probabilities

ˆ Other variables set to median/mode and using tx=0


w ← d [ day == 1]
dat ← expand.grid ( yprev = ’ In Hospital / Facility ’ , age = median ( w $ age ) ,
sofa = median ( w $ sofa ) , tx =0 , day =1:28)
ltrans ← function ( fit , mod ) {
p ← predict ( fit , dat )
u ← data.frame ( day = as.vector ( row ( p ) ) , y = as.vector ( col ( p ) ) , logit = as.vector ( p )
)
u$y ← factor ( u $y , 1:3 , paste ( ’ ≥ ’ , levels ( d $ y ) [ -1 ]) )
u $ mod ← mod
u
}
u ← rbind ( ltrans (f , ’ model with few knots ’) ,
ltrans (g , ’ model with more knots ’) )
ggplot (u , aes ( x = day , y = logit , color = y ) ) + geom_line () +
CHAPTER 14. SEMIPARAMETRIC ORDINAL LONGITUDINAL MODELS 14-20

facet_wrap (∼ mod , ncol =2) +


xlab ( ’ Day ’) + ylab ( ’ Log Odds ’) +
labs ( caption = ’ Relative log odds of transitioning from in hospital / facility to
indicated status ’)

model with few knots model with more knots


4

y
Log Odds

>= In Hospital/Facility
−4 >= Vent/ARDS
>= Dead

−8

0 10 20 0 10 20
Day
Relative log odds of transitioning from in hospital/facility to indicated status

14.3.4

Correlation Structure

ˆ The data were simulated under a first-order Markov pro-


cess so it doesn’t make sense to check correlation pattern
assumptions for our model

ˆ When the simulated data were created, the within-patient


correlation pattern was checked against the pattern from the
fitted model by simulating a large trial from the model fit
and comparing correlations in the simulated data to those
in the real data
CHAPTER 14. SEMIPARAMETRIC ORDINAL LONGITUDINAL MODELS 14-21

ˆ It showed excellent agreement

ˆ Let’s compute the Spearman ρ correlation matrix on the 500


patient dataset and show the matrix from the real data next
to it

ˆ Delete day 28 from the new correlation matrix to conform


with correlation matrix computed on real data

ˆ Also show correlation matrix from 10,000 patient sample

ˆ Heights of bars are proportional to Spearman ρ


# Tall and thin -> short and wide data table
w ← dcast ( d [ , . ( id , day , y = as.numeric ( y ) ) ] ,
id ∼ day , value.var = ’y ’)
r ← cor ( as.matrix ( w [ , -1 ]) , use = ’ p a ir wi s e. c om pl e te .o b s ’ ,
method = ’ spearman ’) [ -28 , -28 ]
p ← plotCorrM (r , xangle =90)
p [[1]] + theme ( legend.position = ’ none ’) +
labs ( caption = ’ Spearman correlation matrix from 500 patient dataset ’)

max |r|:0.993 min |r|:0.196

27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
1
2
3
4
5
6
7
8
9

Spearman correlation matrix from 500 patient dataset

vcorr ← readRDS ( ’∼/ R / examples / simMarkovOrd / vcorr.rds ’)


ra ← vcorr $ r.actual
plotCorrM ( ra , xangle =90) [[1]] + theme ( legend.position = ’ none ’) +
labs ( caption = ’ Spearman correlation matrix from actual data ’)
CHAPTER 14. SEMIPARAMETRIC ORDINAL LONGITUDINAL MODELS 14-22

max |r|:0.992 min |r|:0.307

27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1

10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
1
2
3
4
5
6
7
8
9
Spearman correlation matrix from actual data

rc ← vcorr $ r.simulated
plotCorrM ( rc , xangle =90) [[1]] + theme ( legend.position = ’ none ’) +
labs ( caption = ’ Spearman correlation matrix from 10 ,000 simulated patients ’)

max |r|:0.988 min |r|:0.404

27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
1
2
3
4
5
6
7
8
9

Spearman correlation matrix from 10,000 simulated patients

ˆ Estimating the whole correlation matrix from 500 patients


is noisy

ˆ Compute the mean absolute difference between two correla-


tion matrices (the first on 10,000 simulated patients assum-
ing a first-order Markov process and the second from the
CHAPTER 14. SEMIPARAMETRIC ORDINAL LONGITUDINAL MODELS 14-23

real data)

ˆ Compute means of mean absolute differences stratified by


the day involved
ad ← abs ( rc [ -28 , -28 ] - ra )
round ( apply ( ad , 1 , mean ) , 2)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0.10 0.06 0.05 0.05 0.04 0.04 0.03 0.02 0.02 0.02 0.02 0.02 0.02 0.03 0.03 0.02
17 18 19 20 21 22 23 24 25 26 27
0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02

Actual and simulated with-patient correlations agree well except


when day 1 is involved.
Look at a variogram-like graph, especially to see the intra-
patient correlations in the raw data as a function of days be-
tween two measurements (the x-axis)
p [[2]]

1.00

0.75
Correlation

0.50

0.25

0 10 20
Absolute Time Difference

ˆ Usual serial correlation declining pattern; outcome status


values become less correlated within patient as time gap
increases
CHAPTER 14. SEMIPARAMETRIC ORDINAL LONGITUDINAL MODELS 14-24

ˆ Also see non-isotropic pattern: correlations depend also on


absolute time, not just gap

Formal Goodness of Fit Assessments for Correlation Structure

ˆ Data simulation model → we already know that the first


order Markov process has to fit

ˆ Do two formal assessments to demonstrate how this can be


done in general. Both make the correlation structure more
versatile.
– Add patient-specific intercepts to see if a compound sym-
metry structure adds anything to the first-order Markov
structure

– Add a dependency on state before last in addition to our


model’s dependency on the last state to see if a second-
order Markov process fits better

Add Random Effects

ˆ Bayesian models handle random effects more naturally than


frequentist models →
use a Bayesian partial PO first-order Markov model (R rmsb
package)
require ( rmsb )

options ( prType = ’ latex ’)


stanSet () # sets to use all but one core
CHAPTER 14. SEMIPARAMETRIC ORDINAL LONGITUDINAL MODELS 14-25

seed ← 2 # The f o l l o w i n g took 15 m using 4 cores


b ← blrm ( y ∼ yprev + lsp ( day , 2) + age + sofa + tx + cluster ( id ) ,
∼ lsp ( day , 2) , data =d , file = ’ bppo.rds ’)
stanDx ( b )

Iterations : 2000 on each of 4 chains , with 4000 posterior distribution samples saved

For each parameter , n_eff is a crude measure of effective sample size


and Rhat is the potential scale reduction factor on split chains
( at convergence , Rhat =1)

n_eff Rhat
y >= In Hospital / Facility 2283 1.000
y >= Vent / ARDS 2162 1.002
y >= Dead 1885 1.002
yprev = In Hospital / Facility 2248 1.001
yprev = Vent / ARDS 2715 1.001
day 1027 1.003
day ’ 2434 1.000
age 3728 1.000
sofa 1677 0.999
tx 3281 1.002
day :y >= Vent / ARDS 328 1.010
day ’: y >= Vent / ARDS 3066 1.000
day :y >= Dead 2602 1.000
day ’: y >= Dead 2228 1.001
sigmag 2324 1.002

Bayesian Partial Proportional Odds Ordinal Logistic Model

Dirichlet Priors With Concentration Parameter 0.455 for Intercepts

blrm(formula = y ~ yprev + lsp(day, 2) + age + sofa + tx + cluster(id),


ppo = ~lsp(day, 2), data = d, file = "bppo.rds")

Frequencies of Responses

Home In Hospital/Facility Vent/ARDS


8231 3995 934
Dead
43

Mixed Calibration/ Discrimination Rank Discrim.


Discrimination Indexes Indexes Indexes
Obs 13203 B 0.028 [0.028, 0.028] g 5.337 [5.071, 5.61] C 0.983 [0.983, 0.983]
Draws 4000 gp 0.455 [0.452, 0.458] Dxy 0.966 [0.965, 0.966]
Chains 4 EV 0.878 [0.868, 0.889]
p 7 v 26.27 [23.574, 28.598]
Cluster on id vp 0.206 [0.203, 0.209]
Clusters 500
σγ 0.2176 [0.0011, 0.4764]
CHAPTER 14. SEMIPARAMETRIC ORDINAL LONGITUDINAL MODELS 14-26

Mean β̂ Median β̂ S.E. Lower Upper Pr(β > 0) Symmetry


y≥In Hospital/Facility -5.3659 -5.3796 0.5901 -6.3794 -4.0985 0.0000 1.09
y≥Vent/ARDS -13.6426 -13.6384 0.5567 -14.7277 -12.5501 0.0000 0.94
y≥Dead -21.7289 -21.6025 1.5831 -24.9182 -18.8302 0.0000 0.73
yprev=In Hospital/Facility 8.7224 8.7131 0.2699 8.1969 9.2485 1.0000 1.15
yprev=Vent/ARDS 14.9042 14.8973 0.3249 14.2721 15.5421 1.0000 1.09
day -1.1320 -1.1266 0.2582 -1.6088 -0.6265 0.0000 0.93
day’ 1.1775 1.1740 0.2600 0.6470 1.6424 1.0000 1.07
age 0.0112 0.0112 0.0030 0.0054 0.0172 0.9998 0.99
sofa 0.0638 0.0634 0.0141 0.0358 0.0915 1.0000 1.11
tx -0.3588 -0.3578 0.0884 -0.5225 -0.1802 0.0000 0.98
day:y≥Vent/ARDS 0.8217 0.8180 0.3532 0.1661 1.5277 0.9928 1.05
day’:y≥Vent/ARDS -0.8457 -0.8411 0.3568 -1.5655 -0.1914 0.0058 0.96
day:y≥Dead 2.4244 2.3405 0.8416 0.8711 4.0675 1.0000 1.33
day’:y≥Dead -2.4635 -2.3856 0.8470 -4.1387 -0.9151 0.0000 0.76

ˆ Note that blrm parameterizes the partial PO parameters dif-


ferently than vglm.

ˆ Posterior median of the standard deviation of the random


effects σγ is 0.22

ˆ This is quite small on the logit scale in which most of the


action takes place in [−4, 4]

ˆ Random intercepts add an inconsequential improvement in


the fit, justifying the Markov process’ conditional (on prior
state) independence assumption

Second-order Markov Process

ˆ On follow-up days 2-28

ˆ Frequentist partial PO model


# Derive time-before-last states ( lag-1 \ Co { yprev })
h ← d [ , yprev2 := shift ( yprev ) , by = id ]
h ← h [ day > 1 , ]
CHAPTER 14. SEMIPARAMETRIC ORDINAL LONGITUDINAL MODELS 14-27

# Fit first-order model ignoring day 1 so can c o m p a r e to second-order


# We have to make time linear since no day 1 data
f1 ← vglm ( ordered ( y ) ∼ yprev + day + age + sofa + tx ,
cumulative ( reverse = TRUE , parallel = FALSE ∼ day ) , data = h )
f2 ← vglm ( ordered ( y ) ∼ yprev + yprev2 + day + age + sofa + tx ,
cumulative ( reverse = TRUE , parallel = FALSE ∼ day ) , data = h )
lrtest ( f2 , f1 )

Likelihood ratio test

Model 1: ordered ( y ) ∼ yprev + yprev2 + day + age + sofa + tx


Model 2: ordered ( y ) ∼ yprev + day + age + sofa + tx
# Df LogLik Df Chisq Pr ( > Chisq )
1 38096 -2098.2
2 38098 -2098.5 2 0.572 0.7512

AIC ( f1 ) ; AIC ( f2 )

[1] 4218.93

[1] 4222.358

ˆ First-order model has better fit ”for the money” by AIC

ˆ Formal chunk test of second-order terms not impressive

14.3.5

Computing Derived Quantities

From the fitted Markov state transition model, compute for one
covariate setting and two treatments:

ˆ state occupancy probabilities

ˆ mean time in state

ˆ differences between treatments in mean time in state

To specify covariate setting:


CHAPTER 14. SEMIPARAMETRIC ORDINAL LONGITUDINAL MODELS 14-28

ˆ most common initial state is In Hospital/Facility so use


that

ˆ within that category look at relationship between the two


covariates

ˆ they have no correlation so use the individual medians


istate ← ’ In Hospital / Facility ’
w ← d [ day == 1 & yprev == istate , ]
w [ , cor ( age , sofa , method = ’ spearman ’) ]

[1] 0.01651803

x ← w [ , lapply ( .SD , median ) , .SDcols = Cs ( age , sofa ) ]


adjto ← x [ , paste0 ( ’ age = ’ , x [ , age ] , ’ sofa = ’ , x [ , sofa ] ,
’ initial state = ’ , istate ) ]
# Expand to cover both treatments and initial state
x ← cbind ( tx =0:1 , yprev = istate , x )
x

tx yprev age sofa


1: 0 In Hospital / Facility 56 5
2: 1 In Hospital / Facility 56 5

Compute all SOPs for each treatment. soprobMarkovOrdm is in


Hmisc.
S ← z ← NULL
for ( Tx in 0:1) {
s ← soprobMarkovOrdm (f , x [ tx == Tx , ] , times =1:28 , ylevels = levels ( d $ y ) ,
absorb = ’ Dead ’ , tvarname = ’ day ’)
S ← rbind (S , cbind ( tx = Tx , s ) )
u ← data.frame ( day = as.vector ( row ( s ) ) , y = as.vector ( col ( s ) ) , p = as.vector ( s ) )
u $ tx ← Tx
z ← rbind (z , u )
}
z $ y ← factor ( z $y , 1:4 , levels ( d $ y ) )
revo ← function ( z ) {
z ← as.factor ( z )
factor (z , levels = rev ( levels ( as.factor ( z ) ) ) )
}
ggplot (z , aes ( x = factor ( day ) , y =p , fill = revo ( y ) ) ) +
facet_wrap (∼ paste ( ’ Treatment ’ , tx ) , nrow =1) + geom_col () +
xlab ( ’ Day ’) + ylab ( ’ Probability ’) +
guides ( fill = guide_legend ( title = ’ Status ’) ) +
labs ( caption = paste0 ( ’ Estimated state occupancy probabilities for \ n ’ ,
adjto ) ) +
CHAPTER 14. SEMIPARAMETRIC ORDINAL LONGITUDINAL MODELS 14-29

theme ( legend.position = ’ bottom ’ ,


axis.text.x = element_text ( angle =90 , hjust =1) )

Treatment 0 Treatment 1
1.00

0.75
Probability

0.50

0.25

0.00
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
Day

Status Dead Vent/ARDS In Hospital/Facility Home

Estimated state occupancy probabilities for


age=56 sofa=5 initial state=In Hospital/Facility

Compute by treatment the mean time unwell (expected number


of days not at home). Expected days in state is simply the sum
over days of daily probabilities of being in that state.
mtu ← tapply (1 . - S [ , ’ Home ’] , S [ , ’ tx ’] , sum )
dmtu ← diff ( mtu )
w ← c ( mtu , dmtu )
names ( w ) ← c ( ’ tx =0 ’ , ’ tx =1 ’ , ’ Days Difference ’)
w

tx =0 tx =1 Days Difference
10.742602 7.768658 -2.973943

We estimate that patients on treatment 1 have 3 less days un-


well than those on treatment 0 for the given covariate settings.
Do a similar calculation for the expected number of days alive
out of 28 days (similar to restricted mean survival time.
mta ← tapply (1 . - S [ , ’ Dead ’] , S [ , ’ tx ’] , sum )
w ← c ( mta , diff ( mta ) )
CHAPTER 14. SEMIPARAMETRIC ORDINAL LONGITUDINAL MODELS 14-30

names ( w ) ← c ( ’ tx =0 ’ , ’ tx =1 ’ , ’ Days Difference ’)


w

tx =0 tx =1 Days Difference
27.585079 27.844274 0.259195

14.3.6

Bootstrap Confidence Interval for Difference in Mean


Time Unwell

ˆ Need to sample with replacement from patients, not records


– code taken from rms package’s bootcov function

– sampling patients entails including some patients multiple


times and omitting others

– save all the record numbers, group them by patient ID,


sample from these IDs, then use all the original records
whose record numbers correspond to the sampled IDs

ˆ Use the basic bootstrap to get 0.95 confidence intervals

ˆ Speed up the model fit by having each bootstrap fit use


as starting parameter estimates the values from the original
data fit
B ← 500 # number of b o o t s t r a p resamples
recno ← split (1 : nrow ( d ) , d $ id )
npt ← length ( recno ) # 500
startbeta ← coef ( f )
seed ← 3
if ( file.exists ( ’ boot.rds ’) ) {
z ← readRDS ( ’ boot.rds ’)
betas ← z $ betas
diffmean ← z $ diffmean
CHAPTER 14. SEMIPARAMETRIC ORDINAL LONGITUDINAL MODELS 14-31

} else {
betas ← diffmean ← numeric ( B )
ylev ← levels ( d $ y )
for ( i in 1 : B ) {
j ← unlist ( recno [ sample ( npt , npt , replace = TRUE ) ])
g ← vglm ( ordered ( y ) ∼ yprev + lsp ( day , 2) + age + sofa + tx ,
cumulative ( reverse = TRUE , parallel = FALSE ∼ lsp ( day , 2) ) ,
coefstart = startbeta , data = d [j , ])
betas [ i ] ← coef ( g ) [ ’ tx ’]
s0 ← soprobMarkovOrdm (g , x [ tx == 0 , ] , times =1:28 , ylevels = ylev ,
absorb = ’ Dead ’ , tvarname = ’ day ’)
s1 ← soprobMarkovOrdm (g , x [ tx == 1 , ] , times =1:28 , ylevels = ylev ,
absorb = ’ Dead ’ , tvarname = ’ day ’)
# P ( not at home ) = 1 - P( home ); sum these probs to get E[ days ]
mtud ← sum (1 . - s1 [ , ’ Home ’ ]) - sum (1 . - s0 [ , ’ Home ’ ])
diffmean [ i ] ← mtud
}
saveRDS ( list ( betas = betas , diffmean = diffmean ) , ’ boot.rds ’ , compress = ’ xz ’)
}

See how bootstrap treatment log ORs relate to differences in


days unwell.
ggfreqScatter ( betas , diffmean ,
xlab = ’ Log OR ’ , ylab = ’ Difference in Mean Days Unwell ’)
CHAPTER 14. SEMIPARAMETRIC ORDINAL LONGITUDINAL MODELS 14-32

−1.0

−1.5
Frequency
1
Difference in Mean Days Unwell

−2.0
2
3
−2.5
4
5
−3.0
6
7
−3.5
8
9
−4.0
10
11
−4.5

−5.0

−0.60 −0.55 −0.50 −0.45 −0.40 −0.35 −0.30 −0.25 −0.20 −0.15 −0.10
Log OR

Compute basic bootstrap 0.95 confidence interval for OR and


differences in mean time
# bootBCa is in the rms package and uses the boot package
clb ← exp ( bootBCa ( coef ( f ) [ ’ tx ’] , betas , seed = seed , n = npt , type = ’ basic ’) )
clm ← bootBCa ( dmtu , diffmean , seed = seed , n = npt , type = ’ basic ’)
a ← round ( c ( clb , clm ) , 3) [ c (1 ,3 ,2 ,4) ]
data.frame ( Quantity = c ( ’ OR ’ , ’ Difference in mean days unwell ’) ,
Lower = a [1:2] , Upper = a [3:4])

Quantity Lower Upper


1 OR 0.584 0.853
2 Difference in mean days unwell -4.609 -1.286
CHAPTER 14. SEMIPARAMETRIC ORDINAL LONGITUDINAL MODELS 14-33

14.3.7

Notes on Inference

ˆ Differences between treatments in mean time in state(s) is


zero if and only if the treatment OR=1
– note agreement in bootstrap estimates

– will not necessarily be true if PO is relaxed for treatment

– inference about any treatment effect is the same for all


covariate settings that do not interact with treatment

– → p-values are the same for the two metrics, and Bayesian
posterior probabilities are also identical

ˆ Bayesian posterior probabilities for mean time in state > ,


for  > 0, will vary with covariate settings (sicker patients
at baseline have more room to move)
Annotated Bibliography

[1] Paul D. Allison. Missing Data. Sage University Papers Series on Quantitative Applications in the Social Sciences,
07-136. Thousand Oaks CA: Sage, 2001 (cit. on pp. 3-1, 3-5).
[2] D. G. Altman. “Categorising Continuous Covariates (Letter to the Editor)”. In: Brit J Cancer 64 (1991), p. 975
(cit. on p. 2-13).
[3] D. G. Altman and P. K. Andersen. “Bootstrap Investigation of the Stability of a Cox Regression Model”. In: Stat
Med 8 (1989), pp. 771–783 (cit. on p. 4-16).
[4] D. G. Altman et al. “Dangers of Using ‘optimal’ Cutpoints in the Evaluation of Prognostic Factors”. In: J Nat
Cancer Inst 86 (1994), pp. 829–835 (cit. on pp. 2-13, 2-15).
[5] Douglas G. Altman. “Suboptimal Analysis Using ‘optimal’ Cutpoints”. In: Brit J Cancer 78 (1998), pp. 556–557
(cit. on p. 2-13).
[6] B. G. Armstrong and M. Sloan. “Ordinal Regression Models for Epidemiologic Data”. In: Am J Epi 129 (1989),
pp. 191–204 (cit. on p. 10-18).
See letter to editor by Peterson
.
[7] A. C. Atkinson.“A Note on the Generalized Information Criterion for Choice of a Model”. In: Biometrika 67 (1980),
pp. 413–418 (cit. on pp. 2-29, 4-15).
[8] Peter C. Austin. “Bootstrap Model Selection Had Similar Performance for Selecting Authentic and Noise Variables
Compared to Backward Variable Elimination: A Simulation Study”. In: J Clin Epi 61 (2008). ”in general, a bootstrap
model selection method had comparable performance to conventional backward variable elimination for identifying
the true regression model. In most settings, both methods performed poorly at correctly identifying the correct
regression model.”, pp. 1009–1017 (cit. on p. 4-16).
[9] Peter C. Austin and Ewout W. Steyerberg.“The Integrated Calibration Index (ICI) and Related Metrics for Quanti-
fying the Calibration of Logistic Regression Models”. In: Statistics in Medicine 38.21 (2019), pp. 4051–4065. issn:
1097-0258. doi: 10.1002/sim.8281. url: https://onlinelibrary.wiley.com/doi/abs/10.1002/sim.8281
(visited on 08/10/2019) (cit. on p. 8-40).
[10] Peter C. Austin, Jack V. Tu, and Douglas S. Lee. “Logistic Regression Had Superior Performance Compared with
Regression Trees for Predicting In-Hospital Mortality in Patients Hospitalized with Heart Failure”. In: J Clin Epi 63
(2010). ROC areas for logistic models varied from 0.747 to 0.775 whereas they varied from 0.620-0.651 for recursive
partitioning;repeated data simulation showed large variation in tree structure, pp. 1145–1155 (cit. on p. 2-35).
[11] Sunni A. Barnes, Stacy R. Lindborg, and John W. Seaman.“Multiple Imputation Techniques in Small Sample Clinical
Trials”. In: Stat Med 25 (2006). bad performance of LOCF including high bias and poor confidence interval cov-
erage;simulation setup;longitudinal data;serial data;RCT;dropout;assumed missing at random (MAR);approximate
Bayesian bootstrap;Bayesian least squares;missing data;nice background summary;new completion score method
based on fitting a Poisson model for the number of completed clinic visits and using donors and approximate
Bayesian bootstrap, pp. 233–245 (cit. on p. 3-15).

15-1
ANNOTATED BIBLIOGRAPHY 15-2

[12] Federica Barzi and Mark Woodward.“Imputations of Missing Values in Practice: Results from Imputations of Serum
Cholesterol in 28 Cohort Studies”. In: Am J Epi 160 (2004). excellent review article for multiple imputation;list of
variables to include in imputation model;”Imputation models should ideally include all covariates that are related
to the missing data mechanism, have distributions that differ between the respondents and nonrespondents, are
associated with cholesterol, and will be included in the analyses of the final complete data sets”;detailed comparison
of results (cholesterol effect and confidence limits) for various imputation methods, pp. 34–45 (cit. on pp. 3-8,
3-15).
[13] Heiko Belcher. “The Concept of Residual Confounding in Regression Models and Some Applications”. In: Stat Med
11 (1992), pp. 1747–1758 (cit. on p. 2-13).
[14] D. A. Belsley, E. Kuh, and R. E. Welsch. Regression Diagnostics: Identifying Influential Data and Sources of
Collinearity. New York: Wiley, 1980 (cit. on p. 4-40).
[15] David A. Belsley. Conditioning Diagnostics: Collinearity and Weak Data in Regression. New York: Wiley, 1991
(cit. on p. 4-25).
[16] Jacqueline K. Benedetti et al.“Effective Sample Size for Tests of Censored Survival Data”. In: Biometrika 69 (1982),
pp. 343–349 (cit. on p. 4-20).
[17] Caroline Bennette and Andrew Vickers.“Against Quantiles: Categorization of Continuous Variables in Epidemiologic
Research, and Its Discontents”. In: BMC Med Res Methodol 12.1 (Feb. 2012). terrific graphical examples; nice
display of outcome heterogeneity within quantile groups of PSA, pp. 21+. issn: 1471-2288. doi: 10.1186/1471-
2288-12-21. pmid: 22375553. url: http://dx.doi.org/10.1186/1471-2288-12-21 (cit. on p. 2-13).
[18] Kiros Berhane, Michael Hauptmann, and Bryan Langholz. “Using Tensor Product Splines in Modeling Exposure–
Time–Response Relationships: Application to the Colorado Plateau Uranium Miners Cohort”. In: Stat Med 27
(2008). discusses taking product of all univariate spline basis functions, pp. 5484–5496 (cit. on p. 2-49).
[19] James Lopez Bernal, Steven Cummins, and Antonio Gasparrini. “Interrupted Time Series Regression for the Eval-
uation of Public Health Interventions: A Tutorial”. In: International Journal of Epidemiology 46.1 (Feb. 1, 2017),
pp. 348–355. issn: 0300-5771. doi: 10.1093/ije/dyw098. url: https://doi.org/10.1093/ije/dyw098 (visited
on 04/18/2021) (cit. on p. 2-54).
[20] D. M. Berridge and J. Whitehead.“Analysis of Failure Time Data with Ordinal Categories of Response”. In: Stat Med
10 (1991), pp. 1703–1710. doi: 10.1002/sim.4780101108. url: http://dx.doi.org/10.1002/sim.4780101108
(cit. on p. 10-18).
[21] Maria Blettner and Willi Sauerbrei.“Influence of Model-Building Strategies on the Results of a Case-Control Study”.
In: Stat Med 12 (1993), pp. 1325–1338 (cit. on p. 5-22).
[22] Irina Bondarenko and Trivellore Raghunathan. “Graphical and Numerical Diagnostic Tools to Assess Suitability of
Multiple Imputations and Imputation Models”. In: Stat Med 35.17 (July 2016), pp. 3007–3020. issn: 02776715.
doi: 10.1002/sim.6926. url: http://dx.doi.org/10.1002/sim.6926 (cit. on p. 3-17).
[23] James G. Booth and Somnath Sarkar.“Monte Carlo Approximation of Bootstrap Variances”. In: Am Statistician 52
(1998). number of resamples required to estimate variances, quantiles; 800 resamples may be required to guarantee
with 0.95 confidence that the relative error of a variance estimate is 0.1;Efron’s original suggestions for as low as
25 resamples were based on comparing stability of bootstrap estimates to sampling error, but small relative effects
can significantly change P-values;number of bootstrap resamples, pp. 354–357 (cit. on p. 5-10).
[24] Robert Bordley. “Statistical Decisionmaking without Math”. In: Chance 20.3 (2007), pp. 39–44 (cit. on p. 1-7).
[25] L. Breiman and J. H. Friedman.“Estimating Optimal Transformations for Multiple Regression and Correlation (with
Discussion)”. In: J Am Stat Assoc 80 (1985), pp. 580–619 (cit. on p. 4-33).
[26] Leo Breiman.“The Little Bootstrap and Other Methods for Dimensionality Selection in Regression: X-fixed Predic-
tion Error”. In: J Am Stat Assoc 87 (1992), pp. 738–754 (cit. on pp. 4-15, 4-16, 5-16).
[27] Leo Breiman et al. Classification and Regression Trees. Pacific Grove, CA: Wadsworth and Brooks/Cole, 1984
(cit. on p. 2-34).
ANNOTATED BIBLIOGRAPHY 15-3

[28] William M. Briggs and Russell Zaretzki.“The Skill Plot: A Graphical Technique for Evaluating Continuous Diagnostic
Tests (with Discussion)”. In: Biometrics 64 (2008). ”statistics such as the AUC are not especially relevant to someone
who must make a decision about a particular x c. ... ROC curves lack or obscure several quantities that are necessary
for evaluating the operational effectiveness of diagnostic tests. ... ROC curves were first used to check how radio
<i>receivers</i> (like radar receivers) operated over a range of frequencies. ... This is not how most ROC curves
are used now, particularly in medicine. The receiver of a diagnostic measurement ... wants to make a decision based
on some x c, and is not especially interested in how well he would have done had he used some different cutoff.”; in
the discussion David Hand states ”when integrating to yield the overall AUC measure, it is necessary to decide what
weight to give each value in the integration. The AUC implicitly does this using a weighting derived empirically
from the data. This is nonsensical. The relative importance of misclassifying a case as a noncase, compared to the
reverse, cannot come from the data itself. It must come externally, from considerations of the severity one attaches
to the different kinds of misclassifications.”; see Lin, Kvam, Lu Stat in Med 28:798-813;2009, pp. 250–261 (cit. on
p. 1-7).
[29] David Brownstone. “Regression Strategies”. In: Proceedings of the 20th Symposium on the Interface between
Computer Science and Statistics. Washington, DC: American Statistical Association, 1988, pp. 74–79 (cit. on
p. 5-22).
[30] Petra Buettner, Claus Garbe, and Irene Guggenmoos-Holzmann.“Problems in Defining Cutoff Points of Continuous
Prognostic Factors: Example of Tumor Thickness in Primary Cutaneous Melanoma”. In: J Clin Epi 50 (1997).
choice of cut point depends on marginal distribution of predictor, pp. 1201–1210 (cit. on p. 2-13).
[31] Stef Buuren. Flexible Imputation of Missing Data. Boca Raton, FL: Chapman & Hall/CRC, 2012. doi: 10.1201/
b11826. url: http://dx.doi.org/10.1201/b11826 (cit. on pp. 3-1, 3-14, 3-19).
[32] Bob Carpenter et al. “Stan: A Probabilistic Programming Language”. In: J Stat Software 76.1 (2017), pp. 1–32.
doi: 10.18637/jss.v076.i01. url: https://www.jstatsoft.org/v076/i01 (cit. on p. 8-52).
[33] James R. Carpenter and Melanie Smuk.“Missing Data: A Statistical Framework for Practice”. In: Biometrical Journal
63.5 (2021), pp. 915–947. issn: 1521-4036. doi: 10.1002/bimj.202000196. url: https://onlinelibrary.
wiley.com/doi/abs/10.1002/bimj.202000196 (visited on 10/21/2021) (cit. on p. 3-1).
eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1002/bimj.202000196
.
[34] Centers for Disease Control and Prevention CDC. National Center for Health Statistics NCHS. “National Health
and Nutrition Examination Survey”. In: (2010). url: http://www.cdc.gov/nchs/nhanes/nhanes2009-2010/
nhanes09_10.htm (cit. on p. 11-4).
[35] John M. Chambers and Trevor J. Hastie, eds. Statistical Models in S. Pacific Grove, CA: Wadsworth and Brook-
s/Cole, 1992 (cit. on p. 2-50).
[36] C. Chatfield. “Avoiding Statistical Pitfalls (with Discussion)”. In: Stat Sci 6 (1991), pp. 240–268 (cit. on p. 4-41).
[37] C. Chatfield. “Model Uncertainty, Data Mining and Statistical Inference (with Discussion)”. In: J Roy Stat Soc A
158 (1995). bias by selecting model because it fits the data well; bias in standard errors;P. 420: ... need for a
better balance in the literature and in statistical teaching between techniques and problem solving strategies. P.
421: It is ‘well known’ to be ‘logically unsound and practically misleading’ (Zhang, 1992) to make inferences as if
a model is known to be true when it has, in fact, been selected from the same data to be used for estimation
purposes. However, although statisticians may admit this privately (Breiman (1992) calls it a ‘quiet scandal’), they
(we) continue to ignore the difficulties because it is not clear what else could or should be done. P. 421: Estimation
errors for regression coefficients are usually smaller than errors from failing to take into account model specification.
P. 422: Statisticians must stop pretending that model uncertainty does not exist and begin to find ways of coping
with it. P. 426: It is indeed strange that we often admit model uncertainty by searching for a best model but then
ignore this uncertainty by making inferences and predictions as if certain that the best fitting model is actually
true. P. 427: The analyst needs to assess the model selection process and not just the best fitting model. P. 432:
The use of subset selection methods is well known to introduce alarming biases. P. 433: ... the AIC can be highly
biased in data-driven model selection situations. P. 434: Prediction intervals will generally be too narrow. In the
discussion, Jamal R. M. Ameen states that a model should be (a) satisfactory in performance relative to the stated
objective, (b) logically sound, (c) representative, (d) questionable and subject to on-line interrogation, (e) able to
accommodate external or expert information and (f) able to convey information., pp. 419–466 (cit. on pp. 4-10,
5-22).
[38] Samprit Chatterjee and Ali S. Hadi. Regression Analysis by Example. Fifth. New York: Wiley, 2012. isbn: 0-470-
90584-0 (cit. on p. 4-24).
ANNOTATED BIBLIOGRAPHY 15-4

[39] Marie Chavent et al. “ClustOfVar: An R Package for the Clustering of Variables”. In: J Stat Software 50.13 (Sept.
2012), pp. 1–16 (cit. on p. 4-29).
[40] A. Ciampi et al.“Stratification by Stepwise Regression, Correspondence Analysis and Recursive Partition”. In: Comp
Stat Data Analysis 1986 (1986), pp. 185–204 (cit. on p. 4-29).
[41] W. S. Cleveland. “Robust Locally Weighted Regression and Smoothing Scatterplots”. In: J Am Stat Assoc 74
(1979), pp. 829–836 (cit. on p. 2-31).
[42] D. Collett. Modelling Binary Data. Second. London: Chapman and Hall, 2002. isbn: 1-58488-324-3 (cit. on p. 10-6).
[43] Gary S. Collins, Emmanuel O. Ogundimu, and Douglas G. Altman. “Sample Size Considerations for the External
Validation of a Multivariable Prognostic Model: A Resampling Study”. In: Stat Med 35.2 (Jan. 2016), pp. 214–226.
issn: 02776715. doi: 10.1002/sim.6787. url: http://dx.doi.org/10.1002/sim.6787 (cit. on p. 5-14).
[44] Gary S. Collins et al. “Quantifying the Impact of Different Approaches for Handling Continuous Predictors on
the Performance of a Prognostic Model”. In: Stat Med 35.23 (Oct. 2016). used rms package hazard regression
method (hare) for survival model calibration, pp. 4124–4135. issn: 02776715. doi: 10 . 1002 / sim . 6986. url:
http://dx.doi.org/10.1002/sim.6986 (cit. on p. 2-13).
[45] E. Francis Cook and Lee Goldman. “Asymmetric Stratification: An Outline for an Efficient Method for Controlling
Confounding in Cohort Studies”. In: Am J Epi 127 (1988), pp. 626–639 (cit. on p. 2-35).
[46] Nancy R. Cook. “Use and Misues of the Receiver Operating Characteristic Curve in Risk Prediction”. In: Circ 115
(2007). example of large change in predicted risk in cardiovascular disease with tiny change in ROC area;possible
limits to c index when calibration is perfect;importance of calibration accuracy and changes in predicted risk when
new variables are added, pp. 928–935 (cit. on p. 8-39).
[47] J. B. Copas. “Cross-Validation Shrinkage of Regression Predictors”. In: J Roy Stat Soc B 49 (1987), pp. 175–183
(cit. on p. 5-21).
[48] J. B. Copas.“Regression, Prediction and Shrinkage (with Discussion)”. In: J Roy Stat Soc B 45 (1983), pp. 311–354
(cit. on pp. 4-22, 4-23).
[49] David R. Cox.“Regression Models and Life-Tables (with Discussion)”. In: J Roy Stat Soc B 34 (1972), pp. 187–220
(cit. on pp. 2-53, 13-1).
[50] Sybil L. Crawford, Sharon L. Tennstedt, and John B. McKinlay. “A Comparison of Analytic Methods for Non-
Random Missingness of Outcome Data”. In: J Clin Epi 48 (1995), pp. 209–219 (cit. on pp. 3-4, 4-47).
[51] N. J. Crichton and J. P. Hinde.“Correspondence Analysis as a Screening Method for Indicants for Clinical Diagnosis”.
In: Stat Med 8 (1989), pp. 1351–1362 (cit. on p. 4-29).
[52] Ralph B. D’Agostino et al.“Development of Health Risk Appraisal Functions in the Presence of Multiple Indicators:
The Framingham Study Nursing Home Institutionalization Model”. In: Stat Med 14 (1995), pp. 1757–1770 (cit. on
pp. 4-25, 4-28).
[53] C. E. Davis et al. “An Example of Dependencies among Variables in a Conditional Logistic Regression”. In: Modern
Statistical Methods in Chronic Disease Epidemiology. Ed. by S. H. Moolgavkar and R. L. Prentice. New York:
Wiley, 1986, pp. 140–147 (cit. on p. 4-25).
[54] Charles S. Davis. Statistical Methods for the Analysis of Repeated Measurements. New York: Springer, 2002 (cit. on
p. 7-17).
[55] S. Derksen and H. J. Keselman. “Backward, Forward and Stepwise Automated Subset Selection Algorithms: Fre-
quency of Obtaining Authentic and Noise Variables”. In: British J Math Stat Psych 45 (1992), pp. 265–282 (cit. on
p. 4-10).
[56] T. F. Devlin and B. J. Weeks. “Spline Functions for Logistic Regression Modeling”. In: Proceedings of the Eleventh
Annual SAS Users Group International Conference. Cary, NC: SAS Institute, Inc., 1986, pp. 646–651 (cit. on
pp. 2-22, 2-24).
[57] Peter J. Diggle et al. Analysis of Longitudinal Data. second. Oxford UK: Oxford University Press, 2002 (cit. on
p. 7-11).
[58] Donders et al. “Review: A Gentle Introduction to Imputation of Missing Values”. In: J Clin Epi 59 (2006). simple
demonstration of failure of the add new category method (indicator variable), pp. 1087–1091 (cit. on pp. 3-1, 3-5).
[59] William D. Dupont. Statistical Modeling for Biomedical Researchers. second. Cambridge, UK: Cambridge University
Press, 2008 (cit. on p. 15-15).
ANNOTATED BIBLIOGRAPHY 15-5

[60] S. Durrleman and R. Simon. “Flexible Regression Models with Cubic Splines”. In: Stat Med 8 (1989), pp. 551–561
(cit. on p. 2-28).
[61] B. Efron. “Estimating the Error Rate of a Prediction Rule: Improvement on Cross-Validation”. In: J Am Stat Assoc
78 (1983). suggested need at least 200 models to get an average that is adequate, i.e., 20 repeats of 10-fold cv,
pp. 316–331 (cit. on pp. 5-17, 5-20, 5-21).
[62] Bradley Efron and Balasubramanian Narasimhan.“The Automatic Construction of Bootstrap Confidence Intervals”.
In: Journal of Computational and Graphical Statistics 0.0 (Jan. 14, 2020), pp. 1–12. issn: 1061-8600. doi: 10.1080/
10618600.2020.1714633. url: https://doi.org/10.1080/10618600.2020.1714633 (visited on 03/13/2020)
(cit. on p. 5-10).
eprint: https://doi.org/10.1080/10618600.2020.1714633
.
[63] Bradley Efron and Robert Tibshirani. An Introduction to the Bootstrap. New York: Chapman and Hall, 1993 (cit. on
p. 5-20).
[64] Bradley Efron and Robert Tibshirani.“Improvements on Cross-Validation: The .632+ Bootstrap Method”. In: J Am
Stat Assoc 92 (1997), pp. 548–560 (cit. on p. 5-20).
[65] Nicole S. Erler et al. “Dealing with Missing Covariates in Epidemiologic Studies: A Comparison between Multiple
Imputation and a Full Bayesian Approach”. In: Stat Med 35.17 (July 2016), pp. 2955–2974. issn: 02776715. doi:
10.1002/sim.6944. url: http://dx.doi.org/10.1002/sim.6944 (cit. on p. 3-7).
[66] Juanjuan Fan and Richard A. Levine.“To Amnio or Not to Amnio: That Is the Decision for Bayes”. In: Chance 20.3
(2007), pp. 26–32 (cit. on p. 1-7).
[67] David Faraggi and Richard Simon. “A Simulation Study of Cross-Validation for Selecting an Optimal Cutpoint in
Univariate Survival Analysis”. In: Stat Med 15 (1996). bias in point estimate of effect from selecting cutpoints
based on P-value; loss of information from dichotomizing continuous predictors, pp. 2203–2213 (cit. on p. 2-13).
[68] J. J. Faraway. “The Cost of Data Analysis”. In: J Comp Graph Stat 1 (1992), pp. 213–229 (cit. on pp. 4-49, 5-20,
5-22).
[69] Valerii Fedorov, Frank Mannino, and Rongmei Zhang.“Consequences of Dichotomization”. In: Pharm Stat 8 (2009).
optimal cutpoint depends on unknown parameters;should only entertain dichotomization when ”estimating a value
of the cumulative distribution and when the assumed model is very different from the true model”;nice graphics,
pp. 50–61. doi: 10.1002/pst.331. url: http://dx.doi.org/10.1002/pst.331 (cit. on pp. 1-6, 2-13).
[70] Steven E. Fienberg. The Analysis of Cross-Classified Categorical Data. Second. New York: Springer, 2007. isbn:
0-387-72824-4 (cit. on p. 10-18).
[71] D. Freedman, W. Navidi, and S. Peters. “On the Impact of Variable Selection in Fitting Regression Equations”.
In: Lecture Notes in Economics and Mathematical Systems. New York: Springer-Verlag, 1988, pp. 1–16 (cit. on
p. 5-21).
[72] J. H. Friedman. A Variable Span Smoother. Technical Report 5. Laboratory for Computational Statistics, Depart-
ment of Statistics, Stanford University, 1984 (cit. on p. 4-33).
[73] Mitchell H. Gail and Ruth M. Pfeiffer. “On Criteria for Evaluating Models of Absolute Risk”. In: Biostatistics 6.2
(2005), pp. 227–239 (cit. on p. 1-7).
[74] Joseph C. Gardiner, Zhehui Luo, and Lee A. Roman. “Fixed Effects, Random Effects and GEE: What Are the
Differences?” In: Stat Med 28 (2009). nice comparison of models; econometrics; different use of the term ”fixed
effects model”, pp. 221–239 (cit. on p. 7-10).
[75] A. Giannoni et al. “Do Optimal Prognostic Thresholds in Continuous Physiological Variables Really Exist? Analysis
of Origin of Apparent Thresholds, with Systematic Review for Peak Oxygen Consumption, Ejection Fraction and
BNP”. In: PLoS ONE 9.1 (2014). doi: 10.1371/journal.pone.0081699. url: http://dx.doi.org/10.1371/
journal.pone.0081699 (cit. on pp. 2-13, 2-16).
[76] John H. Giudice, John R. Fieberg, and Mark S. Lenarz. “Spending Degrees of Freedom in a Poor Economy: A Case
Study of Building a Sightability Model for Moose in Northeastern Minnesota”. In: J Wildlife Manage (2011). doi:
10.1002/jwmg.213. url: http://dx.doi.org/10.1002/jwmg.213 (cit. on p. 4-1).
[77] Tilmann Gneiting and Adrian E. Raftery. “Strictly Proper Scoring Rules, Prediction, and Estimation”. In: J Am
Stat Assoc 102 (2007). wonderful review article except missing references from Scandanavian and German medical
decision making literature, pp. 359–378 (cit. on p. 1-7).
ANNOTATED BIBLIOGRAPHY 15-6

[78] Harvey Goldstein.“Restricted Unbiased Iterative Generalized Least-Squares Estimation”. In: Biometrika 76.3 (1989).
derivation of REML, pp. 622–623 (cit. on pp. 7-7, 7-11).
[79] Usha S. Govindarajulu et al. “Comparing Smoothing Techniques in Cox Models for Exposure-Response Relation-
ships”. In: Stat Med 26 (2007). authors wrote a SAS macro for restricted cubic splines even though such a macro
as existed since 1984; would have gotten more useful results had simulation been used so would know the true
regression shape;measure of agreement of two estimated curves by computing the area between them, standardized
by average of areas under the two;penalized spline and rcs were closer to each other than to fractional polynomials,
pp. 3735–3752 (cit. on p. 2-29).
[80] P. M. Grambsch and P. C. O’Brien. “The Effects of Transformations and Preliminary Tests for Non-Linearity in
Regression”. In: Stat Med 10 (1991), pp. 697–709 (cit. on pp. 2-41, 4-10).
[81] Robert J. Gray. “Flexible Methods for Analyzing Survival Data Using Splines, with Applications to Breast Cancer
Prognosis”. In: J Am Stat Assoc 87 (1992), pp. 942–951 (cit. on pp. 2-49, 4-22).
[82] Robert J. Gray. “Spline-Based Tests in Survival Analysis”. In: Biometrics 50 (1994), pp. 640–652 (cit. on p. 2-49).
[83] Michael J. Greenacre. “Correspondence Analysis of Multivariate Categorical Data by Weighted Least-Squares”. In:
Biometrika 75 (1988), pp. 457–467 (cit. on p. 4-29).
[84] Sander Greenland. “When Should Epidemiologic Regressions Use Random Coefficients?” In: Biometrics 56 (2000).
use of statistics in epidemiology is largely primitive;stepwise variable selection on confounders leaves important con-
founders uncontrolled;composition matrix;example with far too many significant predictors with many regression co-
efficients absurdly inflated when overfit;lack of evidence for dietary effects mediated through constituents;shrinkage
instead of variable selection;larger effect on confidence interval width than on point estimates with variable se-
lection;uncertainty about variance of random effects is just uncertainty about prior opinion;estimation of vari-
ance is pointless;instead the analysis should be repeated using different values;”if one feels compelled to estimate
$\tauˆ {2}$, I would recommend giving it a proper prior concentrated amount contextually reasonable values”;claim
about ordinary MLE being unbiased is misleading because it assumes the model is correct and is the only model
entertained;shrinkage towards compositional model;”models need to be complex to capture uncertainty about the
relations...an honest uncertainty assessment requires parameters for all effects that we know may be present. This
advice is implicit in an antiparsimony principle often attributed to L. J. Savage ’All models should be as big as an
elephant (see Draper, 1995)’”. See also gus06per., pp. 915–921. doi: 10.1111/j.0006-341X.2000.00915.x. url:
http://dx.doi.org/10.1111/j.0006-341X.2000.00915.x (cit. on pp. 4-10, 4-44).
[85] Jian Guo et al. “Principal Component Analysis with Sparse Fused Loadings”. In: J Comp Graph Stat 19.4 (2011).
incorporates blocking structure in the variables;selects different variables for different components;encourages load-
ings of highly correlated variables to have same magnitude, which aids in interpretation, pp. 930–946 (cit. on
p. 4-29).
[86] D. Hand and M. Crowder. Practical Longitudinal Data Analysis. London: Chapman & Hall, 1996.
[87] Ofer Harel and Xiao-Hua Zhou. “Multiple Imputation: Review of Theory, Implementation and Software”. In: Stat
Med 26 (2007). failed to review aregImpute;excellent overview;ugly S code;nice description of different statistical
tests including combining likelihood ratio tests (which appears to be complex, requiring an out-of-sample log
likelihood computation);congeniality of imputation and analysis models;Bayesian approximation or approximate
Bayesian bootstrap overview;”Although missing at random (MAR) is a non-testable assumption, it has been pointed
out in the literature that we can get very close to MAR if we include enough variables in the imputation models
... it would be preferred if the missing data modelling was done by the data constructors and not by the users...
MI yields valid inferences not only in congenial settings, but also in certain uncongenial ones as well—where the
imputer’s model (1) is more general (i.e. makes fewer assumptions) than the complete-data estimation method, or
when the imputer’s model makes additional assumptions that are well-founded.”, pp. 3057–3077 (cit. on pp. 3-1,
3-8, 3-12, 3-15).
[88] F. E. Harrell. “The LOGIST Procedure”. In: SUGI Supplemental Library Users Guide. Version 5. Cary, NC: SAS
Institute, Inc., 1986, pp. 269–293 (cit. on p. 4-14).
[89] F. E. Harrell, K. L. Lee, and B. G. Pollock. “Regression Models in Clinical Studies: Determining Relationships
between Predictors and Response”. In: J Nat Cancer Inst 80 (1988), pp. 1198–1202 (cit. on p. 2-32).
[90] F. E. Harrell et al. “Regression Modeling Strategies for Improved Prognostic Prediction”. In: Stat Med 3 (1984),
pp. 143–152 (cit. on p. 4-19).
[91] F. E. Harrell et al.“Regression Models for Prognostic Prediction: Advantages, Problems, and Suggested Solutions”.
In: Ca Trt Rep 69 (1985), pp. 1071–1077 (cit. on p. 4-19).
ANNOTATED BIBLIOGRAPHY 15-7

[92] Frank E. Harrell, Kerry L. Lee, and Daniel B. Mark.“Multivariable Prognostic Models: Issues in Developing Models,
Evaluating Assumptions and Adequacy, and Measuring and Reducing Errors”. In: Stat Med 15 (1996), pp. 361–387
(cit. on p. 4-1).
[93] Frank E. Harrell et al. “Development of a Clinical Prediction Model for an Ordinal Outcome: The World Health
Organization ARI Multicentre Study of Clinical Signs and Etiologic Agents of Pneumonia, Sepsis, and Meningitis in
Young Infants”. In: Stat Med 17 (1998), pp. 909–944. url: http://onlinelibrary.wiley.com/doi/10.1002/
(SICI)1097-0258(19980430)17:8%3C909::AID-SIM753%3E3.0.CO;2-O/abstract (cit. on pp. 4-22, 4-47).
[94] Trevor Hastie, Robert Tibshirani, and Jerome H. Friedman. The Elements of Statistical Learning. second. New
York: Springer, 2008 (cit. on p. 2-37).
ISBN-10: 0387848576; ISBN-13: 978-0387848570
.
[95] Trevor J. Hastie and Robert J. Tibshirani. Generalized Additive Models. Boca Raton, FL: Chapman & Hall/CRC,
1990 (cit. on p. 2-38).
ISBN 9780412343902
.
[96] Yulei He and Alan M. Zaslavsky.“Diagnosing Imputation Models by Applying Target Analyses to Posterior Replicates
of Completed Data”. In: Stat Med 31.1 (2012), pp. 1–18. doi: 10.1002/sim.4413. url: http://dx.doi.org/
10.1002/sim.4413 (cit. on p. 3-17).
[97] S. G. Hilsenbeck and G. M. Clark. “Practical P-Value Adjustment for Optimally Selected Cutpoints”. In: Stat Med
15 (1996), pp. 103–112 (cit. on p. 2-13).
[98] W. Hoeffding. “A Non-Parametric Test of Independence”. In: Ann Math Stat 19 (1948), pp. 546–557 (cit. on
p. 4-29).
[99] Norbert Holländer, Willi Sauerbrei, and Martin Schumacher. “Confidence Intervals for the Effect of a Prognostic
Factor after Selection of an ‘optimal’ Cutpoint”. In: Stat Med 23 (2004). true type I error can be much greater
than nominal level;one example where nominal is 0.05 and true is 0.5;minimum P-value method;CART;recursive
partitioning;bootstrap method for correcting confidence interval;based on heuristic shrinkage coefficient;”It should
be noted, however, that the optimal cutpoint approach has disadvantages. One of these is that in almost every
study where this method is applied, another cutpoint will emerge. This makes comparisons across studies extremely
difficult or even impossible. Altman et al. point out this problem for studies of the prognostic relevance of the
S-phase fraction in breast cancer published in the literature. They identified 19 different cutpoints used in the
literature; some of them were solely used because they emerged as the ‘optimal’ cutpoint in a specific data set. In
a meta-analysis on the relationship between cathepsin-D content and disease-free survival in node-negative breast
cancer patients, 12 studies were in included with 12 different cutpoints ... Interestingly, neither cathepsin-D nor
the S-phase fraction are recommended to be used as prognostic markers in breast cancer in the recent update
of the American Society of Clinical Oncology.”; dichotomization; categorizing continuous variables; refs alt94dan,
sch94out, alt98sub, pp. 1701–1713. doi: 10.1002/sim.1611. url: http://dx.doi.org/10.1002/sim.1611
(cit. on pp. 2-13, 2-15).
[100] Nicholas J. Horton and Ken P. Kleinman. “Much Ado about Nothing: A Comparison of Missing Data Methods and
Software to Fit Incomplete Data Regression Models”. In: Am Statistician 61.1 (2007), pp. 79–90 (cit. on p. 3-15).
[101] C. M. Hurvich and C. L. Tsai.“The Impact of Model Selection on Inference in Linear Regression”. In: Am Statistician
44 (1990), pp. 214–217 (cit. on p. 4-16).
[102] Lisa I. Iezzoni. “Dimensions of Risk”. In: Risk Adjustment for Measuring Health Outcomes. Ed. by Lisa I. Iezzoni.
dimensions of risk factors to include in models. Ann Arbor, MI: Foundation of the American College of Healthcare
Executives, 1994. Chap. 2, pp. 29–118 (cit. on p. 1-12).
[103] K. J. Janssen et al. “Missing Covariate Data in Medical Research: To Impute Is Better than to Ignore”. In: J Clin
Epi 63 (2010), pp. 721–727 (cit. on p. 3-20).
[104] Michael P. Jones. “Indicator and Stratification Methods for Missing Explanatory Variables in Multiple Linear Re-
gression”. In: J Am Stat Assoc 91 (1996), pp. 222–230 (cit. on p. 3-5).
[105] J. D. Kalbfleisch and R. L. Prentice.“Marginal Likelihood Based on Cox’s Regression and Life Model”. In: Biometrika
60 (1973), pp. 267–278 (cit. on p. 11-25).
[106] Juha Karvanen and Frank E. Harrell. “Visualizing Covariates in Proportional Hazards Model”. In: Stat Med 28
(2009), pp. 1957–1966 (cit. on p. 5-2).
ANNOTATED BIBLIOGRAPHY 15-8

[107] Michael G. Kenward, Ian R. White, and James R. Carpener.“Should Baseline Be a Covariate or Dependent Variable
in Analyses of Change from Baseline in Clinical Trials? (Letter to the Editor)”. In: Stat Med 29 (2010). sharp rebuke
of liu09sho, pp. 1455–1456 (cit. on p. 7-5).
[108] Soeun Kim, Catherine A. Sugar, and Thomas R. Belin. “Evaluating Model-Based Imputation Methods for Missing
Covariates in Regression Models with Interactions”. In: Stat Med 34.11 (May 2015), pp. 1876–1888. issn: 02776715.
doi: 10.1002/sim.6435. url: http://dx.doi.org/10.1002/sim.6435 (cit. on p. 3-10).
[109] W. A. Knaus et al.“The SUPPORT Prognostic Model: Objective Estimates of Survival for Seriously Ill Hospitalized
Adults”. In: Ann Int Med 122 (1995), pp. 191–203. doi: 10.7326/0003-4819-122-3-199502010-00007. url:
http://dx.doi.org/10.7326/0003-4819-122-3-199502010-00007 (cit. on pp. 4-34, 12-1).
[110] Mirjam J. Knol et al. “Unpredictable Bias When Using the Missing Indicator Method or Complete Case Analysis
for Missing Confounder Values: An Empirical Example”. In: J Clin Epi 63 (2010), pp. 728–736 (cit. on p. 3-5).
[111] R. Koenker and G. Bassett. “Regression Quantiles”. In: Econometrica 46 (1978), pp. 33–50 (cit. on p. 11-11).
[112] Roger Koenker. Quantile Regression. New York: Cambridge University Press, 2005 (cit. on p. 11-11).
ISBN-10: 0-521-60827-9; ISBN-13: 978-0-521-60827-5
.
[113] Roger Koenker. Quantreg: Quantile Regression. 2009. url: http://CRAN.R-project.org/package=quantreg
(cit. on p. 11-11).
R package version 4.38
.
[114] Charles Kooperberg, Charles J. Stone, and Young K. Truong.“Hazard Regression”. In: J Am Stat Assoc 90 (1995),
pp. 78–94 (cit. on p. 12-18).
[115] Warren F. Kuhfeld. “The PRINQUAL Procedure”. In: SAS/STAT 9.2 User’s Guide. Second. Cary, NC: SAS Pub-
lishing, 2009. url: http://support.sas.com/documentation/onlinedoc/stat (cit. on p. 4-30).
[116] J. M. Landwehr, D. Pregibon, and A. C. Shoemaker. “Graphical Methods for Assessing Logistic Regression Models
(with Discussion)”. In: J Am Stat Assoc 79 (1984), pp. 61–83 (cit. on p. 10-6).
[117] B. Lausen and M. Schumacher. “Evaluating the Effect of Optimized Cutoff Values in the Assessment of Prognostic
Factors”. In: Comp Stat Data Analysis 21.3 (1996), pp. 307–326. doi: 10.1016/0167- 9473(95)00016- X. url:
http://dx.doi.org/10.1016/0167-9473(95)00016-X (cit. on p. 2-13).
[118] J. F. Lawless and K. Singhal. “Efficient Screening of Nonnormal Regression Models”. In: Biometrics 34 (1978),
pp. 318–327 (cit. on p. 4-15).
[119] S. le Cessie and J. C. van Houwelingen.“Ridge Estimators in Logistic Regression”. In: Appl Stat 41 (1992), pp. 191–
201 (cit. on p. 4-22).
[120] A. Leclerc et al. “Correspondence Analysis and Logistic Modelling: Complementary Use in the Analysis of a Health
Survey among Nurses”. In: Stat Med 7 (1988), pp. 983–995 (cit. on p. 4-29).
[121] Katherine J. Lee and John B. Carlin. “Recovery of Information from Multiple Imputation: A Simulation Study”. In:
Emerg Themes Epi 9.1 (June 2012). Not sure that the authors satisfactorily dealt with nonlinear predictor effects
in the absence of strong auxiliary information, there is little to gain from multiple imputation with missing data in the
exposure-of-interest. In fact, the authors went further to say that multiple imputation can introduce bias not present
in a complete case analysis if a poorly fitting imputation model is used [from Yong Hao Pua], pp. 3+. issn: 1742-
7622. doi: 10.1186/1742-7622-9-3. pmid: 22695083. url: http://dx.doi.org/10.1186/1742-7622-9-3
(cit. on p. 3-4).
[122] Seokho Lee, Jianhua Z. Huang, and Jianhua Hu. “Sparse Logistic Principal Components Analysis for Binary Data”.
In: Ann Appl Stat 4.3 (2010), pp. 1579–1601 (cit. on p. 2-37).
[123] Chenlei Leng and Hansheng Wang.“On General Adaptive Sparse Principal Component Analysis”. In: J Comp Graph
Stat 18.1 (2009), pp. 201–215 (cit. on p. 2-37).
[124] Chun Li and Bryan E. Shepherd.“A New Residual for Ordinal Outcomes”. In: Biometrika 99.2 (2012), pp. 473–480.
doi: 10.1093/biomet/asr073. eprint: http://biomet.oxfordjournals.org/content/99/2/473.full.pdf+
html. url: http://biomet.oxfordjournals.org/content/99/2/473.abstract (cit. on p. 10-6).
[125] Kung-Yee Liang and Scott L. Zeger.“Longitudinal Data Analysis of Continuous and Discrete Responses for Pre-Post
Designs”. In: Sankhyā 62 (2000). makes an error in assuming the baseline variable will have the same univariate
distribution as the response except for a shift;baseline may have for example a truncated distribution based on a
trial’s inclusion criteria;if correlation between baseline and response is zero, ANCOVA will be twice as efficient as
simple analysis of change scores;if correlation is one they may be equally efficient, pp. 134–148 (cit. on p. 7-5).
ANNOTATED BIBLIOGRAPHY 15-9

[126] James K. Lindsey. Models for Repeated Measurements. Clarendon Press, 1997.
[127] Stuart Lipsitz, Michael Parzen, and Lue P. Zhao. “A Degrees-Of-Freedom Approximation in Multiple Imputation”.
In: J Stat Comp Sim 72.4 (Jan. 2002), pp. 309–318. doi: 10.1080/00949650212848. url: http://dx.doi.org/
10.1080/00949650212848 (cit. on p. 3-13).
[128] Roderick J. A. Little and Donald B. Rubin. Statistical Analysis with Missing Data. second. New York: Wiley, 2002
(cit. on pp. 3-4, 3-9, 3-17).
[129] Guanghan F. Liu et al.“Should Baseline Be a Covariate or Dependent Variable in Analyses of Change from Baseline
in Clinical Trials?” In: Stat Med 28 (2009). seems to miss several important points, such as the fact that the
baseline variable is often part of the inclusion/exclusion criteria and so has a truncated distribution that is different
from that of the follow-up measurements;sharp rebuke in ken10sho, pp. 2509–2530 (cit. on p. 7-5).
[130] Richard Lockhart et al. A Significance Test for the Lasso. arXiv, 2013. arXiv: 1301.7161. url: http://arxiv.
org/abs/1301.7161 (cit. on p. 4-10).
[131] Xiaohui Luo, Leonard A. Stfanski, and Dennis D. Boos. “Tuning Variable Selection Procedures by Adding Noise”.
In: Technometrics 48 (2006). adding a known amount of noise to the response and studying ² to tune the stopping
rule to avoid overfitting or underfitting;simulation setup, pp. 165–175 (cit. on p. 1-14).
[132] Paul Madley-Dowd et al. “The Proportion of Missing Data Should Not Be Used to Guide Decisions on Multiple
Imputation”. In: Journal of Clinical Epidemiology 110 (June 1, 2019), pp. 63–73. issn: 0895-4356. doi: 10.1016/j.
jclinepi.2019.02.016. url: http://www.sciencedirect.com/science/article/pii/S0895435618308710
(visited on 09/14/2019) (cit. on p. 3-20).
[133] Nathan Mantel. “Why Stepdown Procedures in Variable Selection”. In: Technometrics 12 (1970), pp. 621–625
(cit. on p. 4-15).
[134] Maurizio Manuguerra and Gillian Z. Heller.“Ordinal Regression Models for Continuous Scales”. In: Int J Biostat 6.1
(Jan. 2010). mislabeled a flexible parametric model as semi-parametric; does not cover semi-parametric approach
with lots of intercepts. issn: 1557-4679. doi: 10.2202/1557-4679.1230. url: http://dx.doi.org/10.2202/
1557-4679.1230 (cit. on p. 11-13).
[135] S. E. Maxwell and H. D. Delaney. “Bivariate Median Splits and Spurious Statistical Significance”. In: Psych Bull
113 (1993), pp. 181–190. doi: 10.1037//0033-2909.113.1.181. url: http://dx.doi.org/10.1037//0033-
2909.113.1.181 (cit. on p. 2-13).
[136] George P. McCabe. “Principal Variables”. In: Technometrics 26 (1984), pp. 137–144 (cit. on p. 4-28).
[137] George Michailidis and Jan de Leeuw.“The Gifi System of Descriptive Multivariate Analysis”. In: Stat Sci 13 (1998),
pp. 307–336 (cit. on p. 4-29).
[138] Karel G. M. Moons et al. “Using the Outcome for Imputation of Missing Predictor Values Was Preferred”. In: J
Clin Epi 59 (2006). use of outcome variable; excellent graphical summaries of simulations, pp. 1092–1101. doi:
10.1016/j.jclinepi.2006.01.009. url: http://dx.doi.org/10.1016/j.jclinepi.2006.01.009 (cit. on
p. 3-13).
[139] Barry K. Moser and Laura P. Coombs. “Odds Ratios for a Continuous Outcome Variable without Dichotomizing”.
In: Stat Med 23 (2004). large loss of efficiency and power;embeds in a logistic distribution, similar to proportional
odds model;categorization;dichotomization of a continuous response in order to obtain odds ratios often results in
an inflation of the needed sample size by a factor greater than 1.5, pp. 1843–1860 (cit. on p. 2-13).
[140] Raymond H. Myers. Classical and Modern Regression with Applications. Boston: PWS-Kent, 1990 (cit. on p. 4-24).
[141] N. J. D. Nagelkerke.“A Note on a General Definition of the Coefficient of Determination”. In: Biometrika 78 (1991),
pp. 691–692 (cit. on p. 4-42).
[142] Todd G. Nick and J. Michael Hardin. “Regression Modeling Strategies: An Illustrative Case Study from Medical
Rehabilitation Outcomes Research”. In: Am J Occ Ther 53 (1999), pp. 459–470 (cit. on p. 4-1).
[143] David J. Nott and Chenlei Leng. “Bayesian Projection Approaches to Variable Selection in Generalized Linear
Models”. In: Computational Statistics & Data Analysis 54.12 (Dec. 2010), pp. 3227–3241. issn: 01679473. doi:
10.1016/j.csda.2010.01.036. url: http://dx.doi.org/10.1016/j.csda.2010.01.036 (cit. on p. 2-37).
ANNOTATED BIBLIOGRAPHY 15-10

[144] Debashis Paul et al. ““Preconditioning” for Feature Selection and Regression in High-Dimensional Problems”. In:
Ann Stat 36.4 (2008). develop consistent Y using a latent variable structure, using for example supervised principal
components. Then run stepwise regression or lasso predicting Y (lasso worked better). Can run into problems when
a predictor has importance in an adjusted sense but has no marginal correlation with Y;model approximation;model
simplification, pp. 1595–1619. doi: 10 . 1214 / 009053607000000578. url: http : / / dx . doi . org / 10 . 1214 /
009053607000000578 (cit. on p. 2-37).
[145] Peter Peduzzi et al. “A Simulation Study of the Number of Events per Variable in Logistic Regression Analysis”. In:
J Clin Epi 49 (1996), pp. 1373–1379 (cit. on pp. 4-19, 4-20).
[146] Peter Peduzzi et al. “Importance of Events per Independent Variable in Proportional Hazards Regression Analysis.
II. Accuracy and Precision of Regression Estimates”. In: J Clin Epi 48 (1995), pp. 1503–1510 (cit. on p. 4-19).
[147] N. Peek et al. “External Validation of Prognostic Models for Critically Ill Patients Required Substantial Sample
Sizes”. In: J Clin Epi 60 (2007). large sample sizes need to obtain reliable external validations;inadequate power of
DeLong, DeLong, and Clarke-Pearson test for differences in correlated ROC areas (p. 498);problem with tests of
calibration accuracy having too much power for large sample sizes, pp. 491–501 (cit. on p. 4-43).
[148] Michael J. Pencina, Ralph B. D’Agostino, and Olga V. Demler. “Novel Metrics for Evaluating Improvement in
Discrimination: Net Reclassification and Integrated Discrimination Improvement for Normal Variables and Nested
Models”. In: Stat Med 31.2 (2012), pp. 101–113. doi: 10.1002/sim.4348. url: http://dx.doi.org/10.1002/
sim.4348 (cit. on pp. 4-43, 8-39).
[149] Michael J. Pencina, Ralph B. D’Agostino, and Ewout W. Steyerberg. “Extensions of Net Reclassification Improve-
ment Calculations to Measure Usefulness of New Biomarkers”. In: Stat Med 30 (2011). lack of need for NRI to
be category-based;arbitrariness of categories;”category-less or continuous NRI is the most objective and versatile
measure of improvement in risk prediction;authors misunderstood the inadequacy of three categories if categories
are used;comparison of NRI to change in C index;example of continuous plot of risk for old model vs. risk for new
model, pp. 11–21. doi: 10.1002/sim.4085. url: http://dx.doi.org/10.1002/sim.4085 (cit. on p. 4-43).
[150] Michael J. Pencina et al. “Evaluating the Added Predictive Ability of a New Marker: From Area under the ROC
Curve to Reclassification and Beyond”. In: Stat Med 27 (2008). small differences in ROC area can still be very mean-
ingful;example of insignificant test for difference in ROC areas with very significant results from new method;Yates’
discrimination slope;reclassification table;limiting version of this based on whether and amount by which probabil-
ities rise for events and lower for non-events when compare new model to old;comparing two models;see letter to
the editor by Van Calster and Van Huffel, Stat in Med 29:318-319, 2010 and by Cook and Paynter, Stat in Med
31:93-97, 2012, pp. 157–172 (cit. on pp. 4-43, 8-39).
[151] de Vries Bas B.L. Penning, Smeden Maarten van, and Rolf H.H. Groenwold. “Propensity Score Estimation Using
Classification and Regression Trees in the Presence of Missing Covariate Data”. In: Epidemiologic Methods 7.1
(2018). doi: 10.1515/em-2017-0020. url: https://www.degruyter.com/view/j/em.2018.7.issue-1/em-
2017-0020/em-2017-0020.xml (visited on 09/02/2019) (cit. on p. 3-10).
[152] Sanne A. Peters et al. “Multiple Imputation of Missing Repeated Outcome Measurements Did Not Add to Linear
Mixed-Effects Models.” In: J Clin Epi 65.6 (2012), pp. 686–695. doi: 10.1016/j.jclinepi.2011.11.012. url:
http://dx.doi.org/10.1016/j.jclinepi.2011.11.012 (cit. on p. 7-10).
[153] Bercedis Peterson and Frank E. Harrell. “Partial Proportional Odds Models for Ordinal Response Variables”. In:
Appl Stat 39 (1990), pp. 205–217 (cit. on p. 10-9).
[154] José C. Pinheiro and Douglas M. Bates. Mixed-Effects Models in S and S-PLUS. New York: Springer, 2000 (cit. on
pp. 7-11, 7-13).
[155] Richard F. Potthoff and S. N. Roy. “A Generalized Multivariate Analysis of Variance Model Useful Especially for
Growth Curve Problems”. In: Biometrika 51 (1964). included an AR1 example, pp. 313–326 (cit. on p. 7-7).
[156] David B. Pryor et al. “Estimating the Likelihood of Significant Coronary Artery Disease”. In: Am J Med 75 (1983),
pp. 771–780 (cit. on p. 8-48).
[157] Peter Radchenko and Gareth M. James.“Variable Inclusion and Shrinkage Algorithms”. In: J Am Stat Assoc 103.483
(2008). solves problem caused by lasso using the same penalty parameter for variable selection and shrinkage which
causes lasso to have to keep too many variables in the model to avoid overshrinking the remaining predictors;does
not handle scaling issue well, pp. 1304–1315 (cit. on p. 2-36).
[158] D. R. Ragland. “Dichotomizing Continuous Outcome Variables: Dependence of the Magnitude of Association and
Statistical Power on the Cutpoint”. In: Epi 3 (1992), pp. 434–440. doi: 10.1097/00001648-199209000-00009.
url: http://dx.doi.org/10.1097/00001648-199209000-00009 (cit. on p. 2-13).
ANNOTATED BIBLIOGRAPHY 15-11

See letters to editor May 1993 P. 274-, Vol 4 No. 3


.
[159] Brendan M. Reilly and Arthur T. Evans. “Translating Clinical Research into Clinical Practice: Impact of Using
Prediction Rules to Make Decisions”. In: Ann Int Med 144 (2006). impact analysis;example of decision aid being
ignored or overruled making MD decisions worse;assumed utilities are constant across subjects by concluding that
directives have more impact than predictions;Goldman-Cook clinical prediction rule in AMI, pp. 201–209 (cit. on
p. 1-10).
[160] J. P. Reiter. “Small-Sample Degrees of Freedom for Multi-Component Significance Tests with Multiple Imputation
for Missing Data”. In: Biometrika 94.2 (Feb. 2007), pp. 502–508. issn: 0006-3444. doi: 10.1093/biomet/asm028.
url: http://dx.doi.org/10.1093/biomet/asm028 (cit. on p. 3-13).
[161] Richard D. Riley et al.“Minimum Sample Size for Developing a Multivariable Prediction Model: Part I – Continuous
Outcomes”. In: Statistics in Medicine 38.7 (Mar. 30, 2019), pp. 1276–1296. issn: 1097-0258. doi: 10.1002/sim.
7993. url: https://onlinelibrary.wiley.com/doi/abs/10.1002/sim.7993 (visited on 01/19/2019) (cit. on
p. 4-20).
[162] Richard D. Riley et al. “Minimum Sample Size for Developing a Multivariable Prediction Model: PART II - Binary
and Time-to-Event Outcomes”. In: Statistics in Medicine 38.7 (Mar. 30, 2019), pp. 1276–1296. issn: 1097-0258.
doi: 10.1002/sim.7992. url: https://onlinelibrary.wiley.com/doi/abs/10.1002/sim.7992 (visited on
01/19/2019) (cit. on p. 4-20).
[163] Ellen B. Roecker. “Prediction Error and Its Estimation for Subset-Selected Models”. In: Technometrics 33 (1991),
pp. 459–468 (cit. on pp. 4-15, 5-16).
[164] Patrick Royston, Douglas G. Altman, and Willi Sauerbrei. “Dichotomizing Continuous Predictors in Multiple Re-
gression: A Bad Idea”. In: Stat Med 25 (2006). destruction of statistical inference when cutpoints are chosen
using the response variable; varying effect estimates when change cutpoints;difficult to interpret effects when di-
chotomize;nice plot showing effect of categorization; PBC data, pp. 127–141. doi: 10 . 1002 / sim . 2331. url:
http://dx.doi.org/10.1002/sim.2331 (cit. on p. 2-13).
[165] D. Rubin and N. Schenker.“Multiple Imputation in Health-Care Data Bases: An Overview and Some Applications”.
In: Stat Med 10 (1991), pp. 585–598 (cit. on p. 3-12).
[166] Warren Sarle. “The VARCLUS Procedure”. In: SAS/STAT User’s Guide. fourth. Vol. 2. Cary, NC: SAS Institute,
Inc., 1990. Chap. 43, pp. 1641–1659. url: http://support.sas.com/documentation/onlinedoc/stat (cit. on
pp. 4-25, 4-28).
[167] Willi Sauerbrei and Martin Schumacher. “A Bootstrap Resampling Procedure for Model Building: Application to
the Cox Regression Model”. In: Stat Med 11 (1992), pp. 2093–2109 (cit. on pp. 4-16, 5-17).
[168] Joseph L. Schafer and John W. Graham.“Missing Data: Our View of the State of the Art”. In: Psych Meth 7 (2002).
excellent review and overview of missing data and imputation;problems with MICE;less technical description of 3
types of missing data, pp. 147–177 (cit. on p. 3-1).
[169] G. Schulgen et al. “Outcome-Oriented Cutpoints in Quantitative Exposure”. In: Am J Epi 120 (1994), pp. 172–184
(cit. on pp. 2-13, 2-15).
[170] E. Selvin et al. “Glycated Hemoglobin, Diabetes, and Cardiovascular Risk in Nondiabetic Adults”. In: NEJM 362.9
(Mar. 2010), pp. 800–811. doi: 10.1056/NEJMoa0908359. url: http://dx.doi.org/10.1056/NEJMoa0908359
(cit. on p. 2-27).
[171] Stephen Senn. “Change from Baseline and Analysis of Covariance Revisited”. In: Stat Med 25 (2006). shows
that claims that in a 2-arm study it is not true that ANCOVA requires the population means at baseline to be
identical;refutes some claims of lia00lon;problems with counterfactuals;temporal additivity (”amounts to supposing
that despite the fact that groups are difference at baseline they would show the same evolution over time”);causal
additivity;is difficult to design trials for which simple analysis of change scores is unbiased, ANCOVA is biased,
and a causal interpretation can be given;temporally and logically, a ”baseline cannot be a <i>response</i> to
treatment”, so baseline and response cannot be modeled in an integrated framework as Laird and Ware’s model
has been used;”one should focus clearly on ‘outcomes’ as being the only values that can be influenced by treatment
and examine critically any schemes that assume that these are linked in some rigid and deterministic view to
‘baseline’ values. An alternative tradition sees a baseline as being merely one of a number of measurements capable
of improving predictions of outcomes and models it in this way.”;”You cannot establish necessary conditions for an
estimator to be valid by nominating a model and seeing what the model implies unless the model is universally
agreed to be impeccable. On the contrary it is appropriate to start with the estimator and see what assumptions
are implied by valid conclusions.”;this is in distinction to lia00lon, pp. 4334–4344 (cit. on p. 7-5).
ANNOTATED BIBLIOGRAPHY 15-12

[172] Jun Shao. “Linear Model Selection by Cross-Validation”. In: J Am Stat Assoc 88 (1993), pp. 486–494 (cit. on
p. 5-17).
[173] Noah Simon et al.“A Sparse-Group Lasso”. In: J Comp Graph Stat 22.2 (2013). sparse effects both on a group and
within group levels;can also be considered special case of group lasso allowing overlap between groups, pp. 231–245.
doi: 10.1080/10618600.2012.681250. eprint: http://www.tandfonline.com/doi/pdf/10.1080/10618600.
2012 . 681250. url: http : / / www . tandfonline . com / doi / abs / 10 . 1080 / 10618600 . 2012 . 681250 (cit. on
p. 2-37).
[174] Sean L. Simpson et al. “A Linear Exponent AR(1) Family of Correlation Structures”. In: Stat Med 29 (2010),
pp. 1825–1838 (cit. on p. 7-14).
[175] L. R. Smith, F. E. Harrell, and L. H. Muhlbaier. “Problems and Potentials in Modeling Survival”. In: Medical
Effectiveness Research Data Methods (Summary Report), AHCPR Pub. No. 92-0056. Ed. by Mary L. Grady and
Harvey A. Schwartz. Rockville, MD: US Dept. of Health and Human Services, Agency for Health Care Policy and
Research, 1992, pp. 151–159. url: https://hbiostat.org/bib/papers/smi92pro.pdf (cit. on p. 4-19).
[176] Alan Spanos, Frank E. Harrell, and David T. Durack.“Differential Diagnosis of Acute Meningitis: An Analysis of the
Predictive Value of Initial Observations”. In: JAMA 262 (1989), pp. 2700–2707. doi: 10.1001/jama.262.19.2700.
url: http://dx.doi.org/10.1001/jama.262.19.2700 (cit. on pp. 8-46, 8-49).
[177] Ian Spence and Robert F. Garrison. “A Remarkable Scatterplot”. In: Am Statistician 47 (1993), pp. 12–19 (cit. on
p. 4-41).
[178] D. J. Spiegelhalter. “Probabilistic Prediction in Patient Management and Clinical Trials”. In: Stat Med 5 (1986). z-
test for calibration inaccuracy (implemented in Stata, and R Hmisc package’s val.prob function), pp. 421–433. doi:
10.1002/sim.4780050506. url: https://onlinelibrary.wiley.com/doi/abs/10.1002/sim.4780050506
(cit. on pp. 4-22, 4-48, 5-20, 5-21).
[179] Ewout W. Steyerberg. Clinical Prediction Models. New York: Springer, 2009 (cit. on pp. xv, 15-15).
[180] Ewout W. Steyerberg. “Validation in Prediction Research: The Waste by Data-Splitting”. In: Journal of Clinical
Epidemiology 0.0 (July 28, 2018). issn: 0895-4356, 1878-5921. doi: 10.1016/j.jclinepi.2018.07.010. url:
https://www.jclinepi.com/article/S0895-4356(18)30485-2/abstract (visited on 07/30/2018) (cit. on
p. 5-17).
[181] Ewout W. Steyerberg et al.“Prognostic Modeling with Logistic Regression Analysis: In Search of a Sensible Strategy
in Small Data Sets”. In: Med Decis Mak 21 (2001), pp. 45–56 (cit. on p. 4-1).
[182] Ewout W. Steyerberg et al. “Prognostic Modelling with Logistic Regression Analysis: A Comparison of Selection
and Estimation Methods in Small Data Sets”. In: Stat Med 19 (2000), pp. 1059–1079 (cit. on p. 2-36).
[183] C. J. Stone. “Comment: Generalized Additive Models”. In: Stat Sci 1 (1986), pp. 312–314 (cit. on p. 2-28).
[184] C. J. Stone and C. Y. Koo. “Additive Splines in Statistics”. In: Proceedings of the Statistical Computing Section
ASA. Washington, DC, 1985, pp. 45–48 (cit. on pp. 2-22, 2-29).
[185] Samy Suissa and Lucie Blais.“Binary Regression with Continuous Outcomes”. In: Stat Med 14 (1995), pp. 247–255.
doi: 10.1002/sim.4780140303. url: http://dx.doi.org/10.1002/sim.4780140303 (cit. on p. 2-13).
[186] Thomas R. Sullivan et al. “Bias and Precision of the “Multiple Imputation, Then Deletion” Method for Dealing
With Missing Outcome Data”. In: American Journal of Epidemiology 182.6 (Sept. 15, 2015). Disagrees with von
Hippel approach of ”impute then delete” for Y, pp. 528–534. issn: 0002-9262. doi: 10.1093/aje/kwv100. url:
https://doi.org/10.1093/aje/kwv100 (visited on 01/05/2021) (cit. on p. 3-4).
[187] Guo-Wen Sun, Thomas L. Shook, and Gregory L. Kay. “Inappropriate Use of Bivariable Analysis to Screen Risk
Factors for Use in Multivariable Analysis”. In: J Clin Epi 49 (1996), pp. 907–916 (cit. on p. 4-20).
[188] Stan Development Team. “Stan: A C++ Library for Probability and Sampling”. In: (2020). url: https://cran.r-
project.org/package=rstan (cit. on p. 8-52).
[189] Robert Tibshirani.“Regression Shrinkage and Selection via the Lasso”. In: J Roy Stat Soc B 58 (1996), pp. 267–288
(cit. on p. 2-36).
[190] Tue Tjur. “Coefficients of Determination in Logistic Regression Models—A New Proposal: The Coefficient of
Discrimination”. In: Am Statistician 63.4 (2009), pp. 366–372 (cit. on p. 8-39).
[191] Jos Twisk et al. “Multiple Imputation of Missing Values Was Not Necessary before Performing a Longitudinal
Mixed-Model Analysis”. In: J Clin Epi 66.9 (2013), pp. 1022–1028. doi: 10.1016/j.jclinepi.2013.03.017. url:
http://dx.doi.org/10.1016/j.jclinepi.2013.03.017 (cit. on p. 3-3).
ANNOTATED BIBLIOGRAPHY 15-13

[192] Werner Vach and Maria Blettner. “Missing Data in Epidemiologic Studies”. In: Ency of Biostatistics. New York:
Wiley, 1998, pp. 2641–2654 (cit. on p. 3-5).
[193] Ben Van Calster et al. “A Calibration Hierarchy for Risk Models Was Defined: From Utopia to Empirical Data”.
In: J Clin Epi 74 (June 2016), pp. 167–176. issn: 08954356. doi: 10.1016 /j.jclinepi. 2015.12.005. url:
http://dx.doi.org/10.1016/j.jclinepi.2015.12.005 (cit. on p. 5-4).
[194] Geert J. M. G. van der Heijden et al. “Imputation of Missing Values Is Superior to Complete Case Analysis and
the Missing-Indicator Method in Multivariable Diagnostic Research: A Clinical Example”. In: J Clin Epi 59 (2006).
Invalidity of adding a new category or an indicator variable for missing values even with MCAR, pp. 1102–1109.
doi: 10.1016/j.jclinepi.2006.01.015. url: http://dx.doi.org/10.1016/j.jclinepi.2006.01.015
(cit. on p. 3-5).
[195] Tjeerd van der Ploeg, Peter C. Austin, and Ewout W. Steyerberg.“Modern Modelling Techniques Are Data Hungry:
A Simulation Study for Predicting Dichotomous Endpoints.” In: BMC medical research methodology 14.1 (Dec.
2014). Would be better to use proper accuracy scores in the assessment. Too much emphasis on optimism as opposed
to final discrimination measure. But much good practical information. Recursive partitioning fared poorly., pp. 137+.
issn: 1471-2288. doi: 10.1186/1471-2288-14-137. pmid: 25532820. url: http://dx.doi.org/10.1186/1471-
2288-14-137 (cit. on pp. 2-4, 2-40, 4-19).
[196] S. van Buuren et al.“Fully Conditional Specification in Multivariate Imputation”. In: J Stat Computation Sim 76.12
(2006). justification for chained equations alternative to full multivariate modeling, pp. 1049–1064 (cit. on pp. 3-15,
3-16).
[197] J. C. van Houwelingen and S. le Cessie. “Predictive Value of Statistical Models”. In: Stat Med 9 (1990), pp. 1303–
1325 (cit. on pp. 2-29, 4-22, 5-17, 5-21, 5-22).
[198] William N. Venables and Brian D. Ripley. Modern Applied Statistics with S. Fourth. New York: Springer-Verlag,
2003. isbn: 0-387-95457-0 (cit. on p. 11-2).
[199] Geert Verbeke and Geert Molenberghs. Linear Mixed Models for Longitudinal Data. New York: Springer, 2000.
[200] Pierre J. Verweij and Hans C. van Houwelingen. “Penalized Likelihood in Cox Regression”. In: Stat Med 13 (1994),
pp. 2427–2436 (cit. on p. 4-22).
[201] Andrew J. Vickers. “Decision Analysis for the Evaluation of Diagnostic Tests, Prediction Models, and Molecular
Markers”. In: Am Statistician 62.4 (2008). limitations of accuracy metrics;incorporating clinical consequences;nice
example of calculation of expected outcome;drawbacks of conventional decision analysis, especially because of the
difficulty of eliciting the expected harm of a missed diagnosis;use of a threshold on the probability of disease for
taking some action;decision curve;has other good references to decision analysis, pp. 314–320 (cit. on p. 1-7).
[202] Gerko Vink et al. “Predictive Mean Matching Imputation of Semicontinuous Variables”. In: Statistica Neerlandica
68.1 (Feb. 2014), pp. 61–90. issn: 00390402. doi: 10.1111/stan.12023. url: http://dx.doi.org/10.1111/
stan.12023 (cit. on p. 3-9).
[203] Eric Vittinghoff and Charles E. McCulloch. “Relaxing the Rule of Ten Events per Variable in Logistic and Cox
Regression”. In: Am J Epi 165 (2006). the authors may have not been quite stringent enough in their assessment
of adequacy of predictions;letter to the editor submitted, pp. 710–718 (cit. on p. 4-19).
[204] Paul T. von Hippel. “Regression with Missing Ys: An Improved Strategy for Analyzing Multiple Imputed Data”. In:
Soc Meth 37.1 (2007), pp. 83–117 (cit. on p. 3-4).
[205] Paul T. von Hippel. “The Number of Imputations Should Increase Quadratically with the Fraction of Missing
Information”. Aug. 2016. arXiv: 1608.05406. url: http://arxiv.org/abs/1608.05406 (cit. on p. 3-19).
[206] Howard Wainer. “Finding What Is Not There through the Unfortunate Binning of Results: The Mendel Effect”.
In: Chance 19.1 (2006). can find bins that yield either positive or negative association;especially pertinent when
effects are small;”With four parameters, I can fit an elephant; with five, I can make it wiggle its trunk.” - John von
Neumann, pp. 49–56 (cit. on pp. 2-13, 2-16).
[207] S. H. Walker and D. B. Duncan. “Estimation of the Probability of an Event as a Function of Several Independent
Variables”. In: Biometrika 54 (1967), pp. 167–178 (cit. on p. 10-4).
[208] Hansheng Wang and Chenlei Leng. “Unified LASSO Estimation by Least Squares Approximation”. In: J Am Stat
Assoc 102 (2007), pp. 1039–1048. doi: 10.1198/016214507000000509. url: http://dx.doi.org/10.1198/
016214507000000509 (cit. on p. 2-36).
[209] S. Wang et al. “Hierarchically Penalized Cox Regression with Grouped Variables”. In: Biometrika 96.2 (2009),
pp. 307–322 (cit. on p. 2-37).
ANNOTATED BIBLIOGRAPHY 15-14

[210] Yohanan Wax. “Collinearity Diagnosis for a Relative Risk Regression Analysis: An Application to Assessment of
Diet-Cancer Relationship in Epidemiological Studies”. In: Stat Med 11 (1992), pp. 1273–1287 (cit. on p. 4-25).
[211] T. L. Wenger et al. “Ventricular Fibrillation Following Canine Coronary Reperfusion: Different Outcomes with
Pentobarbital and -Chloralose”. In: Can J Phys Pharm 62 (1984), pp. 224–228 (cit. on p. 8-47).
[212] Ian R. White and John B. Carlin. “Bias and Efficiency of Multiple Imputation Compared with Complete-Case
Analysis for Missing Covariate Values”. In: Stat Med 29 (2010), pp. 2920–2931 (cit. on p. 3-15).
[213] Ian R. White and Patrick Royston.“Imputing Missing Covariate Values for the Cox Model”. In: Stat Med 28 (2009).
approach to using event time and censoring indicator as predictors in the imputation model for missing baseline
covariates;recommended an approximation using the event indicator and the cumulative hazard transformation of
time, without their interaction, pp. 1982–1998 (cit. on p. 3-3).
[214] Ian R. White, Patrick Royston, and Angela M. Wood. “Multiple Imputation Using Chained Equations: Issues and
Guidance for Practice”. In: Stat Med 30.4 (2011). practical guidance for the use of multiple imputation using
chained equations;MICE;imputation models for different types of target variables;PMM choosing at random from
among a few closest matches;choosing number of multiple imputations by a reproducibility argument, suggesting
100f imputations when f is the fraction of cases that are incomplete, pp. 377–399 (cit. on pp. 3-1, 3-10, 3-15,
3-19).
[215] John Whitehead.“Sample Size Calculations for Ordered Categorical Data”. In: Stat Med 12 (1993), pp. 2257–2271
(cit. on p. 4-20).
See letter to editor SM 15:1065-6 for binary case;see errata in SM 13:871 1994;see kol95com, jul96sam
.
[216] Ryan E. Wiegand. “Performance of Using Multiple Stepwise Algorithms for Variable Selection”. In: Stat Med 29
(2010). fruitless to try different stepwise methods and look for agreement;the methods will agree on the wrong
model, pp. 1647–1659 (cit. on p. 4-15).
[217] Daniela M. Witten and Robert Tibshirani. “Testing Significance of Features by Lassoed Principal Components”. In:
Ann Appl Stat 2.3 (2008). reduction in false discovery rates over using a vector of t-statistics;borrowing strength
across genes;”one would not expect a single gene to be associated with the outcome, since, in practice, many genes
work together to effect a particular phenotype. LPC effectively down-weights individual genes that are associated
with the outcome but that do not share an expression pattern with a larger group of genes, and instead favors
large groups of genes that appear to be differentially-expressed.”;regress principal components on outcome;sparse
principal components, pp. 986–1012 (cit. on p. 2-37).
[218] S. N. Wood. Generalized Additive Models: An Introduction with R. Boca Raton, FL: Chapman & Hall/CRC, 2006
(cit. on p. 2-38).
ISBN 9781584884743
.
[219] C. F. J. Wu. “Jackknife, Bootstrap and Other Resampling Methods in Regression Analysis”. In: Ann Stat 14.4
(1986), pp. 1261–1350 (cit. on p. 5-17).
[220] Shifeng Xiong. “Some Notes on the Nonnegative Garrote”. In: Technometrics 52.3 (2010). ”... to select tuning
parameters, it may be unnecessary to optimize a model selectin criterion repeatedly”;natural selection of penalty
function, pp. 349–361 (cit. on p. 2-37).
[221] Jianming Ye.“On Measuring and Correcting the Effects of Data Mining and Model Selection”. In: J Am Stat Assoc
93 (1998), pp. 120–131 (cit. on p. 1-14).
[222] F. W. Young, Y. Takane, and J. de Leeuw. “The Principal Components of Mixed Measurement Level Multivariate
Data: An Alternating Least Squares Method with Optimal Scaling Features”. In: Psychometrika 43 (1978), pp. 279–
281 (cit. on p. 4-29).
[223] Recai M. Yucel and Alan M. Zaslavsky.“Using Calibration to Improve Rounding in Imputation”. In: Am Statistician
62.2 (2008). using rounding to impute binary variables using techniques for continuous data;uses the method to
solve for the cutpoint for a continuous estimate to be converted into a binary value;method should be useful in
more general situations;idea is to duplicate the entire dataset and in the second half of the new datasets to set all
non-missing values of the target variable to missing;multiply impute these now-missing values and compare them
to the actual values, pp. 125–129 (cit. on p. 3-17).
[224] Hao H. Zhang and Wenbin Lu. “Adaptive Lasso for Cox’s Proportional Hazards Model”. In: Biometrika 94 (2007).
penalty function has ratios against original MLE;scale-free lasso, pp. 691–703 (cit. on p. 2-36).
ANNOTATED BIBLIOGRAPHY 15-15

[225] Min Zhang et al.“Interaction Analysis under Misspecification of Main Effects: Some Common Mistakes and Simple
Solutions”. In: Statistics in Medicine n/a.n/a (2020). issn: 1097-0258. doi: 10 . 1002 / sim . 8505. url: https :
//onlinelibrary.wiley.com/doi/abs/10.1002/sim.8505 (visited on 02/27/2020) (cit. on p. 2-47).
eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1002/sim.8505
.
[226] Hui Zhou, Trevor Hastie, and Robert Tibshirani. “Sparse Principal Component Analysis”. In: J Comp Graph Stat
15 (2006). principal components analysis that shrinks some loadings to zero, pp. 265–286 (cit. on p. 2-37).
[227] Hui Zou and Trevor Hastie. “Regularization and Variable Selection via the Elastic Net”. In: J Roy Stat Soc B 67.2
(2005), pp. 301–320 (cit. on p. 2-36).

R packages written by FE Harrell are freely available from CRAN, and are managed at github.com/harrelfe.

To obtain a book with detailed examples and case studies


and notes on the theory and applications of survival analysis,
logistic regression, ordinal regression, linear models, and lon-
gitudinal models, order Regression Modeling Strategies with
Applications to Linear Models, Logistic and Ordinal Regres-
sion, and Survival Analysis, 2nd Edition by FE Harrell from
Springer NY (2015). Steyerberg [179] and Dupont [59] are
excellent texts for accompanying the book.

To obtain a glossary of statistical terms and other handouts related to diagnostic and prognostic modeling,
see hbiostat.org/doc/glossary.pdf

You might also like