Regression Modeling Strategies
Regression Modeling Strategies
Frank E Harrell Jr
Department of Biostatistics
Vanderbilt University School of Medicine
Nashville TN 37232 USA
[email protected]
hbiostat.org/rms
hbiostat.org/doc/rms/4day.html
1 Introduction 1-1
ii
CONTENTS iii
6 R Software 6-1
9.3 Binary Logistic Model with Casewise Deletion of Missing Values . . . . 9-8
12.4 Internal Validation of the Fitted Model Using the Bootstrap . . . . . . 12-18
13.1 Choosing the Number of Parameters and Fitting the Model . . . . . . 13-1
Bibliography 15-1
in the right margin indicates a hyperlink to a YouTube video related to the subject.
in the right margin is a hyperlink to an audio file elaborating on the notes. Red
letters and numbers in the right margin are cues referred to within the audio recordings.
Rotated boxed blue text in the right margin at the start of a section represents the
howto
mnemonic key for linking to archival discussions about that section in vbiostatcourse.slack.com
channel #rms.
CONTENTS xii
blog in the right margin is a link to a blog entry that further discusses the topic.
CONTENTS xiii
Course Philosophy
Introduction
1.1
intro-hep
Even when only testing H0 a model based approach has advan-
tages: A
1-1
CHAPTER 1. INTRODUCTION 1-2
– log-rank → Cox
1.2
intro-ex
Financial performance, consumer purchasing, loan pay-back
C
Ecology
Product life
Employment discrimination
Cost-effectiveness ratios
– incremental cost / incremental ABSOLUTE benefit
1.3
intro-classify
Many analysts desire to develop “classifiers” instead of pre-
dictions
Suppose that D
tive) functions
Hidden problems:
– Utility function depends on variables not collected (sub-
jects’ preferences) that are available only at the decision
point
the utility (cost, loss function) of making each decision. Sensitivity and specificity do not provide this information. For example, if one estimated
that the probability of a disease given age, sex, and symptoms is 0.1 and the “cost”of a false positive equaled the “cost” of a false negative, one
would act as if the person does not have the disease. Given other utilities, one would make different decisions. If the utilities are unknown, one
gives the best estimate of the probability of the outcome to the decision maker and let her incorporate her own unspoken utilities in making an
optimum decision for her.
Besides the fact that cutoffs do not apply to individuals, only to groups, individual decision making does not utilize sensitivity and specificity.
For an individual we can compute Prob(Y = 1|X = x); we don’t care about Prob(Y = 1|X > c), and an individual having X = x would be
quite puzzled if she were given Prob(X > c|future unknown Y) when she already knows X = x so X is no longer a random variable.
Even when group decision making is needed, sensitivity and specificity can be bypassed. For mass marketing, for example, one can rank order
individuals by the estimated probability of buying the product, to create a lift curve. This is then used to target the k most likely buyers where
k is chosen to meet total program cost constraints.
CHAPTER 1. INTRODUCTION 1-8
Will I be caught?
An answer by a dichotomizer: G
If you are among other cars, are you going faster than 73?
Better: I
Will I be caught?
CHAPTER 1. INTRODUCTION 1-10
1.4
intro-plan
Chance that predictive model will be used [159]
K
Response definition, follow-up
Variable definitions
Observer variability
Missing data
Subjects
Sites
1. age
2. sex
3. acute clinical stability
4. principal diagnosis
5. severity of principal diagnosis
6. extent and severity of comorbidities
7. physical functional status
8. psychological, cognitive, and psychosocial functioning
9. cultural, ethnic, and socioeconomic attributes and behaviors
10. health status and quality of life
11. patient attitudes and preferences for outcomes
1.5
intro-choice
In biostatistics and epidemiology and most other areas we
usually choose model empirically P
1.6
intro-uncertainty
Standard errors, C.L., P -values, R2 wrong if computed as
if the model pre-specified
Q
Stepwise variable selection is widely used and abused
genreg-intro
Prediction, capitalizing on efficient estimation methods such
as maximum likelihood and the predominant additivity in a
variety of problems
– E.g.: effects of age, smoking, and air quality add to pre-
dict lung capacity
2-1
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 2-2
Hypothesis testing
Alternative: Stratification
2.1
genreg-notation
Weighted sum of a set of independent or predictor variables
2.2
Model Formulations
genreg-modform
General regression model
C(Y |X) = g(X).
General linear regression model
C(Y |X) = g(Xβ).
Examples C
2.3
genreg-interp
Suppose that Xj is linear and doesn’t interact with other X’sa.
E
0
C (Y |X) = Xβ = β0 + β1X1 + . . . + βpXp
βj = C 0(Y |X1, X2, . . . , Xj + 1, . . . , Xp)
− C 0(Y |X1, X2, . . . , Xj , . . . , Xp)
Drop 0 from C 0 and assume C(Y |X) is property of Y that is
linearly related to weighted sum of X’s.
2.3.1
Nominal Predictors
C(Y |T = J) = β0
C(Y |T = K) = β0 + β1
C(Y |T = L) = β0 + β2
C(Y |T = M ) = β0 + β3.
C(Y |T ) = Xβ = β0 + β1X1 + β2X2 + β3X3,
where
X1 = 1 if T = K, 0 otherwise
a Note that it is not necessary to “hold constant” all other variables to be able to interpret the effect of one predictor. It is sufficient to hold
constant the weighted sum of all the variables other than Xj . And in many cases it is not physically possible to hold other variables constant
while varying one, e.g., when a model contains X and X 2 (David Hoaglin, personal communication).
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 2-8
X2 = 1 if T = L, 0 otherwise
X3 = 1 if T = M, 0 otherwise.
The test for any differences in the property C(Y ) between treat-
ments is H0 : β1 = β2 = β3 = 0.
2.3.2
Interactions
G
genreg-interp-interact
X1 and X2, effect of X1 on Y depends on level of X2. One
way to describe interaction is to add X3 = X1X2 to model:
C(Y |X) = β0 + β1X1 + β2X2 + β3X1X2.
2.3.3
Parameter Meaning
β0 C(Y |age = 0, sex = m)
β1 C(Y |age = x + 1, sex = m) − C(Y |age = x, sex = m)
β2 C(Y |age = 0, sex = f ) − C(Y |age = 0, sex = m)
β3 C(Y |age = x + 1, sex = f ) − C(Y |age = x, sex = f )−
[C(Y |age = x + 1, sex = m) − C(Y |age = x, sex = m)]
In the model
y ∼ age + sex + weight + waist + tricep
M
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 2-12
2.4
genreg-nonlinear
2.4.1
Avoiding Categorization
genreg-cat
(Nature does not make jumps)
Lucy D’Agostino
McGowan N
2.4.2
genreg-poly
C(Y |X1) = β0 + β1X1 + β2X12.
P
2.4.3
0 1 2 3 4 5 6
X
Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5.
2.4.4
genreg-cubic
Cubic splines are smooth at knots (function, first and second
derivatives agree) — can’t see joins. S
X1 = X X2 = X 2
X3 = X 3 X4 = (X − a)3+
X5 = (X − b)3+ X6 = (X − c)3+.
k knots → k + 3 coefficients excluding intercept.
X 2 and X 3 terms must be included to allow nonlinearity when
X < a.
stats.stackexchange.com/questions/421964 has some useful de-
scriptions of what happens at the knots, e.g.:
Knots are where different cubic polynomials are joined, and
cubic splines force there to be three levels of continuity (the
function, its slope, and its acceleration or second derivative
(slope of the slope) do not change) at these points. At the
knots the jolt (third derivative or rate of change of acceleration)
is allowed to change suddenly, meaning the jolt is allowed to be
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 2-22
2.4.5
Stone and Koo [184]: cubic splines poorly behaved in tails. Con-
strain function to be linear in tails. T
k + 3 → k − 1 parameters [56].
To force linearity when X < a: X 2 and X 3 terms must be
omitted
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 2-23
2.5
2.0
1.5 Function
1.0
0.5
0.0
0.0 0.2 0.4 0.6 0.8 1.0
x
0
−10
First Derivative: Slope
Rate of Change of Function
−20
−30
0.0 0.2 0.4 0.6 0.8 1.0
x
0
−50
−100 Second Derivative: Acceleration
−150 Rate of Change of Slope
−200
−250
0.0 0.2 0.4 0.6 0.8 1.0
x
0
−200
Third Derivative: Jolt
−400
Rate of Change of Acceleration
−600
−800
0.0 0.2 0.4 0.6 0.8 1.0
x
Figure 2.2: A regular cubic spline function with three levels of continuity that prevent the human eye from detecting the knots. Also
shown is the function’s first three derivatives. Knots are located at x = 0.25, 0.5, 0.75. For x beyond the outer knots, the function
is not restricted to be linear. Linearity would imply an acceleration of zero. Vertical lines are drawn at the knots.
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 2-24
To force linearity when X > last knot: last two βs are redun-
dant, i.e., are just combinations of the other βs.
The restricted spline function with k knots t1, . . . , tk is given
by [56]
f (X) = β0 + β1X1 + β2X2 + . . . + βk−1Xk−1,
where X1 = X and for j = 1, . . . , k − 2, U
0.010 1.0
0.008 0.8
0.006 0.6
0.004 0.4
0.002 0.2
0.000 0.0
0.0 0.4 0.8 0.0 0.4 0.8
X X
Figure 2.3: Restricted cubic spline component variables for k = 5 and knots at X = .05, .275, .5, .725, and .95. Nonlinear basis
functions are scaled by τ . The left panel is a y–magnification of the right panel. Fitted functions such as those in Figure 2.4 will be
linear combinations of these basis functions as long as knots are at the same locations used here.
for ( j in 1 : nk )
arrows ( knots [ j ] , .04 , knots [ j ] , -.03 ,
angle =20 , length = .07 , lwd =1 .5 )
}
else lines (x , xbeta , col = i )
}
}
1.0 1.0
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0.0 0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
X X
3 knots 4 knots
1.0 1.0
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0.0 0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
X X
5 knots 6 knots
Figure 2.4: Some typical restricted cubic spline functions for k = 3, 4, 5, 6. The y–axis is Xβ. Arrows indicate knots. These curves
were derived by randomly choosing values of β subject to standard deviations of fitted functions being normalized.
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 2-27
Once β0, . . . , βk−1 are estimated, the restricted cubic spline can
be restated in the form V
H0 : β2 = β3 = . . . = βk−1 = 0.
Example: [170]
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 2-28
2.4.6
k Quantiles
3 .10 .5 .90
4 .05 .35 .65 .95
5 .05 .275 .5 .725 .95
6 .05 .23 .41 .59 .77 .95
7 .025 .1833 .3417 .5 .6583 .8167 .975
n < 100 – replace outer quantiles with 5th smallest and 5th
largest X [184].
Choice of k: X
Usually k = 3, 4, 5. Often k = 4
2.4.7
Nonparametric Regression
genreg-nonpar
Estimate tendency (mean or median) of Y as a function of
X
Y
Few assumptions
X: 1 2 3 5 8
Y : 2.1 3.8 5.7 11.1 17.2
coefficients.
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 2-32
burdensome computations.
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 2-34
2.5
genreg-rpart
Breiman, Friedman, Olshen, and Stone [27]: CART (Classifica-
tion and Regression Trees) — essentially model-free
Method: C
See [10].
2.5.1
E
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 2-36
2.5.2
blog
Researchers are under the mistaken impression that machine
learning can be used on small samples blog
even with identical twins, one twin may get colon cancer and
the other not; model tendencies instead of doing classifica-
tion
One could never have perfect training data, e.g., cannot re-
peatedly test one subject and have outcomes assessed with-
out error
2.6
genreg-multidf
C(Y |X) = β0 + β1X1 + β2X2 + β3X22,
H0 : β2 = β3 = 0 with 2 d.f. to assess association between X2 H
and outcome.
In the 5-knot restricted cubic spline model
C(Y |X) = β0 + β1X + β2X 0 + β3X 00 + β4X 000,
H0 : β1 = . . . = β4 = 0 I
BUT if test against 2 d.f. can only lose power when com-
pared with original F for testing both βs
2.7
genreg-gof
2.7.1
Regression Assumptions
X1 = 1
C(Y)
X1 = 0
X2
Figure 2.5: Regression assumptions for one binary and one continuous predictor
Disadvantages:
Complexity
Test of linearity: H0 : β3 = β4 = 0
genreg-interact
Note: Interactions will be misleading if main effects are not
properly modeled [225].
See Gray [81, 82, Section 3.2] for penalized splines allowing
control of effective degrees of freedom. See Berhane et
al. [18] for a good discussion of tensor splines.
Figure 2.6: Logistic regression estimate of probability of a hemorrhagic stroke for patients in the GUSTO-I trial given t-PA, using
a tensor spline of two restricted cubic splines and penalization (shrinkage). Dark (cold color) regions are low risk, and bright (hot)
regions are higher risk.
Figure 2.6 is particularly interesting because the literature had suggested (based on
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 2-50
approximately 24 strokes) that pulse pressure was the main cause of hemorrhagic stroke
whereas this flexible modeling approach (based on approximately 230 strokes) suggests
that mean arterial blood pressure (roughly a 45◦ line) is what is most important over a
broad range of blood pressures. At the far right one can see that pulse pressure (axis
perpendicular to 45◦ line) may have an impact although a non-monotonic one.
Other issues: W
Race × disease
The section between the two horizontal blue lines was inserted after the audio narration was recorded.
Long after the last patient was sampled, the blood samples
are thawed all in the same week and a blood analysis is done
function.
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 2-52
2.7.3
2.7.4
Distributional Assumptions
Z
2.8
[1] 721.5237
240
Acute Coronary Cases Per 100,000
230
220
210
200
190
0 20 40 60
Time Since Start of Study, months B
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 2-56
kn ← k
# rcspline.eval is the rcs workhorse
h ← function ( x ) cbind ( rcspline.eval (x , kn ) ,
sin = sin (2 * pi * x / 12) , cos = cos (2 * pi * x / 12) )
f ← Glm ( aces ∼ offset ( log ( stdpop ) ) + gTrans ( time , h ) ,
data =d , family = poisson )
f $ aic
[1] 674.112
220
200
180
0 20 40 60
Time Since Start of Study, months
[1] 661.7904
240
220
200
180
0 20 40 60
Time Since Start of Study, months
Now make the slow trend simpler (6 knots) but add a discon-
tinuity at the intervention. More finely control times at which
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 2-58
[1] 659.6044
260
240
220
200
180
0 20 40 60
Time Since Start of Study, months
Model Likelihood
Ratio Test
Obs 59 LR χ2 169.64
Residual d.f. 51 d.f. 7
g 0.07973896 Pr(> χ2 ) <0.0001
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 2-59
Missing Data
3.1
missing-type
Missing completely at random (MCAR)
Informative missing
(non-ignorable non-response)
See [33, 168, 58, 87, 1, 214, 31] for an introduction to missing data
and imputation concepts.
a“Although missing at random (MAR) is a non-testable assumption, it has been pointed out in the literature that we can get very close to
3-1
CHAPTER 3. MISSING DATA 3-2
3.2
Prelude to Modeling
missing-prelude
Quantify extent of missing data
3.3
missing-y
Serial data with subjects dropping out (not covered in this
courseb C
b Twist et al. [191] found instability in using multiple imputation of longitudinal data, and advantages of using instead full likelihood models.
c White and Royston [213] provide a method for multiply imputing missing covariate values using censored survival time data.
d Y is so valuable that if one is only missing a Y value, imputation is not worthwhile, and imputation of Y is not advised if MCAR or MAR.
CHAPTER 3. MISSING DATA 3-4
3.4
missing-alt
Deletion of records—
D
Badly biases parameter estimates when the probability of a
case being incomplete is related to Y and not just X [128].
See [110]
Difficult to interpret
a 3-category variable with values high, low, and unmarried—Paul Allison, IMPUTE list, 4Jul09.
CHAPTER 3. MISSING DATA 3-6
Results of 1000 Simulations With β1 = 1.0 with MAR and Two Types of
Imputation F
3.5
missing-impmodel
The goal of imputation is to preserve the information
and meaning of the non-missing data.
Can use external info not used in response model (e.g., zip
code for income)
sured;
3. associated with the sometimes-missing variable when it is
not missing; or
4. included in the final response model [12, 87]
Predictive model for each target uses any outcomes, all pre-
dictors in the final model other than the target, plus auxiliary
variables not in the outcome model
3.5.1
Interactions
M
3.6
missing-single
Can fill-in using unconditional mean or median if number of
missings low and X is unrelated to other Xs
N
3.7
missing-pmm
3.8
Multiple Imputation
missing-mi
Single imputation could use a random draw from the condi-
tional distribution for an individual O
– Average M β̂
Note that multiple imputation can and should use the re-
sponse variable for imputing predictors [138]
– Example:
a ← aregImpute (∼ age + sex + bp + death + heart.attack.before.death ,
data = mydata , n.impute =5)
CHAPTER 3. MISSING DATA 3-15
See Barzi and Woodward [12] for a nice review of multiple imputation with
detailed comparison of results (point estimates and confidence limits for
the effect of the sometimes-missing predictor) for various imputation meth-
ods. Barnes et al. [11] have a good overview of imputation methods and a
comparison of bias and confidence interval coverage for the methods when
applied to longitudinal data with a small number of subjects. Horton and
Kleinman [100] have a good review of several software packages for dealing
with missing data, and a comparison of them with aregImpute. Harel and
Zhou [87] provide a nice overview of multiple imputation and discuss some of
the available software. White and Carlin [212] studied bias of multiple impu-
tation vs. complete-case analysis. White et al. [214] provide much practical
guidance.
3.9
Diagnostics
missing-dx
MCAR can be partially assessed by comparing distribution
of non-missing Y for those subjects with complete X vs. S
those subjects having incomplete X [128]
originally missing
duplicate
impute
{ all missing
CHAPTER 3. MISSING DATA 3-19
3.10
missing-summary
Method Deletion Single Multiple
Allows non-random missing x x
Reduces sample size x
Apparent S.E. of β̂ too low x
Increases real S.E. of β̂ x
β̂ biased if not MCAR x
f < 0.03: It doesn’t matter very much how you impute miss-
ings or whether you adjust variance of regression coefficient
estimates for having imputed data in this case. For continu-
ous variables imputing missings with the median non-missing
value is adequate; for categorical predictors the most fre-
quent category can be used. Complete case analysis is also
an option here. Multiple imputation may be needed to check
that the simple approach “worked.”
f ≥ 0.03: Use multiple imputation with number of imputa-
tionsg equal to max(5, 100f ). Fewer imputations may be
possible with very large sample sizes. See statisticalhorizons.com/ho
many-imputations. Type 1 predictive mean matching is usu-
ally preferred, with weighted selection of donors. Account
g White et al. [214] recommend choosing M so that the key inferential statistics are very reproducible should the imputation analysis be
repeated. They suggest the use of 100f imputations. See also [31, Section 2.7]. von Hippel [205] finds that the number of imputations should be
quadratically increasing with the fraction of missing information.
CHAPTER 3. MISSING DATA 3-20
3.10.1
3.11
– Pool all these posterior draws over all the multiple impu-
tations and do posterior inference as usual with no special
correction required
Spend them
4-1
CHAPTER 4. MULTIVARIABLE MODELING STRATEGIES 4-2
4.1
strategy-complexity
Rarely expect linearity
Motivating examples:
# Overfitting a flat relationship
require ( rms )
set.seed (1)
x ← runif (1000)
y ← runif (1000 , -0.5 , 0 .5 )
dd ← datadist (x , y ) ; options ( datadist = ’ dd ’)
par ( mfrow = c (2 ,2) , mar = c (2 , 2 , 3 , 0 .5 ) )
pp ← function ( actual ) {
yhat ← predict (f , data.frame ( x = xs ) )
CHAPTER 4. MULTIVARIABLE MODELING STRATEGIES 4-4
yreal ← actual ( xs )
plot (0 , 0 , xlim = c (0 ,1) ,
ylim = range ( c ( quantile (y , c (0 .1 , 0 .9 ) ) , yhat ,
yreal ) ) ,
type = ’n ’ , axes = FALSE )
axis (1 , labels = FALSE ) ; axis (2 , labels = FALSE )
lines ( xs , yreal )
lines ( xs , yhat , col = ’ blue ’)
}
f ← ols ( y ∼ rcs (x , 5) )
xs ← seq (0 , 1 , length =150)
pp ( function ( x ) 0 * x )
title ( ’ Mild Error :\ nOverfitting a Flat Relationship ’ ,
cex =0 .5 )
y ← x + runif (1000 , -0.5 , 0 .5 )
f ← ols ( y ∼ rcs (x , 5) )
pp ( function ( x ) x )
title ( ’ Mild Error :\ nOverfitting a Linear Relationship ’ ,
cex =0 .5 )
y ← x ∧ 4 + runif (1000 , -1 , 1)
f ← ols ( y ∼ x )
pp ( function ( x ) x ∧ 4)
title ( ’ Serious Error :\ nUnderfitting a Steep Relationship ’ ,
cex =0 .5 )
y ← - ( x - 0 .5 ) ∧ 2 + runif (1000 , -0.2 , 0 .2 )
f ← ols ( y ∼ x )
pp ( function ( x ) - ( x - 0 .5 ) ∧ 2)
title ( ’ Tragic Error :\ nMonotonic Fit to \ nNon-Monotonic Relationship ’ ,
cex =0 .5 )
CHAPTER 4. MULTIVARIABLE MODELING STRATEGIES 4-5
0
0 Tragic0Error:
Serious Error:
Monotonic Fit to
Underfitting a Steep Relationship
Non−Monotonic Relationship
0
4.1.1
4.1.2
4.2
strategy-simult
Sometimes failure to adjust for other variables gives wrong
transformation of an X, or wrong significance of interactions
I
4.3
Variable Selection
strategy-selection
Series of potential predictors with no prior knowledge
first variable to enter has type I error of 0.39 when the nominal α is set to 0.05.
CHAPTER 4. MULTIVARIABLE MODELING STRATEGIES 4-11
require ( leaps )
x8 ← rnorm ( n )
}
else {
x1 ← rnorm ( n )
x2 ← x1 + 2 .0 * rnorm ( n ) # was + 0 .5 * rnorm ( n )
x3 ← rnorm ( n )
x4 ← x3 + 1 .5 * rnorm ( n )
x5 ← x1 + rnorm ( n ) / 1 .3
x6 ← x2 + 2 .25 * rnorm ( n ) # was rnorm ( n )/1 .3
x7 ← x3 + x4 + 2 .5 * rnorm ( n ) # w a s + r n o r m ( n )
x8 ← x7 + 4 .0 * rnorm ( n ) # was + 0 .5 * rnorm ( n )
}
z ← cbind ( x1 , x2 , x3 , x4 , x5 , x6 , x7 , x8 )
if ( prcor ) return ( round ( cor ( z ) , 2) )
lp ← x1 + x2 + .5 * x3 + .4 * x7
y ← lp + sigma * rnorm ( n )
if ( dataonly ) return ( list ( x =z , y = y ) )
if ( method == ’ leaps ’) {
s ← summary ( regsubsets (z , y ) )
best ← which.max ( s $ adjr2 )
xvars ← s $ which [ best , -1 ] # remove intercept
ssr ← s $ rss [ best ]
p ← sum ( xvars )
xs ← if ( p == 0) ’ none ’ else paste ((1 : 8) [ xvars ] , collapse = ’ ’)
if ( pr ) print ( xs )
ssesw ← ( n - 1) * var ( y ) - ssr
s2s ← ssesw / ( n - p - 1)
yhat ← if ( p == 0) mean ( y ) else fitted ( lm ( y ∼ z [ , xvars ]) )
}
f ← lm ( y ∼ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 )
if ( method == ’ stepaic ’) {
g ← stepAIC (f , trace =0)
p ← g $ rank - 1
xs ← if ( p == 0) ’ none ’ else
gsub ( ’[ \\+ x ] ’ , ’ ’ , as.character ( formula ( g ) ) [3])
if ( pr ) print ( formula ( g ) , showEnv = FALSE )
ssesw ← sum ( resid ( g ) ∧ 2)
s2s ← ssesw / g $ df.residual
yhat ← fitted ( g )
}
# Set SSEsw / ( n - gdf - 1) = true sigma ∧ 2
gdf ← n - 1 - ssesw / ( sigma ∧ 2)
# Compute root mean squared error against true linear predictor
rmse.full ← sqrt ( mean (( fitted ( f ) - lp ) ∧ 2) )
rmse.step ← sqrt ( mean (( yhat - lp ) ∧ 2) )
list ( stats = c ( n =n , vratio = s2s / ( sigma ∧ 2) ,
gdf = gdf , apparentdf =p , rmse.full = rmse.full , rmse.step = rmse.step ) ,
xselected = xs )
}
xs [ i ] ← w $ xselected
}
colnames ( r ) ← names ( w $ stats )
s ← apply (r , 2 , median )
p ← r [ , ’ apparentdf ’]
s [ ’ apparentdf ’] ← mean ( p )
print ( round (s , 2) )
print ( table ( p ) )
cat ( ’ Prob [ correct model ]= ’ , round ( sum ( xs == ’ 1237 ’) /B , 2) , ’\ n ’)
}
x1 x2 x3 x4 x5 x6 x7 x8
x1 1.00 0.00 -0.01 0.00 0.01 0.01 0.00 0.01
x2 0.00 1.00 0.00 0.00 0.01 0.00 0.00 0.00
x3 -0.01 0.00 1.00 0.00 0.00 0.00 0.00 -0.01
x4 0.00 0.00 0.00 1.00 0.01 0.00 0.00 0.01
x5 0.01 0.01 0.00 0.01 1.00 0.01 0.00 0.00
x6 0.01 0.00 0.00 0.00 0.01 1.00 0.00 0.00
x7 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.01
x8 0.01 0.00 -0.01 0.01 0.00 0.00 0.01 1.00
(X1, X2, X3, X7) was found, the median value of σ̂ 2/σ 2, the
median effective d.f., and the mean number of apparent d.f.,
for varying sample sizes.
set.seed (11)
m ← ’ leaps ’ # all possible regressions stopping on R2adj
rsim (100 , 20 , method = m ) # actual model found twice out of 100
9 18 24 29 15 5
Prob [ correct model ]= 0.04
4.3.1
Szilárd pointed out that a real-life Maxwell’s demon would need to have some means of
measuring molecular speed, and that the act of acquiring information would require an
CHAPTER 4. MULTIVARIABLE MODELING STRATEGIES 4-18
expenditure of energy. Since the demon and the gas are interacting, we must consider
the total entropy of the gas and the demon combined. The expenditure of energy by
the demon will cause an increase in the entropy of the demon, which will be larger than
the lowering of the entropy of the gas.
Source: commons.wikimedia.org/wiki/File:YoungJamesClerkMaxwell.jpg
en.wikipedia.org/wiki/Maxwell’s_demon
4.4
strategy-limits
Concerned with avoiding overfitting
p should be < m
15 [90, 91, 175, 146, 145, 203, 195]
Riley et al. [161, 162] have refined sample size estimation for
continuous, binary, and time-to-event models to account for
all of these issues
proportional odds assumption holds, the effective sample size for a binary response is 3n1 n2 /n ≈ 3 min(n1 , n2 ) if nn1 is near 0 or 1 [215, Eq. 10,
15]. Here n1 and n2 are the marginal frequencies of the two response levels [145].
b Based on the power of a proportional odds model two-sample test when the marginal cell sizes for the response are n , . . . , n , compared
1 k
with all cell sizes equal to unity (response is continuous) [215, Eq, 3]. If all cell sizes are equal, the relative efficiency of having k response categories
compared to a continuous response is 1 − k12 [215, Eq. 14], e.g., a 5-level response is almost as efficient as a continuous one if proportional odds
holds across category cutoffs.
c This is approximate, as the effective sample size may sometimes be boosted somewhat by censored observations, especially for non-
4.5
Shrinkage
strategy-shrinkage
Statistical estimation procedure — “pre-shrunk” models
U
Aren’t regression coefficients OK because they’re unbiased?
0.7
0.6
Group Mean
0.5
0.4
0.3
16 6 17 2 10 14 20 9 8 7 11 18 5 4 3 1 15 13 19 12
Group
Figure 4.1: Sorted means from 20 samples of size 50 from a uniform [0, 1] distribution. The reference line at 0.5 depicts the true
population value of all of the means.
model χ2 − p
γ̂ = ,
model χ2
CHAPTER 4. MULTIVARIABLE MODELING STRATEGIES 4-23
n−p−1 2
OLSd: γ̂ = n−1 Radj /R
2
2 n−1
Radj =1 − (1 − R2) n−p−1
4.6
Collinearity
strategy-collinearity
When at least 1 predictor can be predicted well from others
4.7
Data Reduction
strategy-reduction
Unless n >> p, model unlikely to validate
Data reduction: ↓ p
A
4.7.1
Redundancy Analysis
B
4.7.2
Variable Clustering
C
strategy-transcan
Reduce p by estimating transformations using associations
with other predictors
D
Purely categorical predictors – correspondence analysis [120,
51, 40, 83, 137]
blood.pressure [400:401] ← NA # C r e a t e t w o m i s s i n g s
d ← data.frame ( heart.rate , blood.pressure )
par ( pch =46) # F i g u r e 4.2
w ← transcan (∼ heart.rate + blood.pressure , transformed = TRUE ,
imputed = TRUE , show.na = TRUE , data = d )
0.007
Convergence in 4 iterations
R2 achieved in predicting each variable :
Adjusted R2 :
w $ imputed $ blood.pressure
400 401
132.4057 109.7741
t ← w $ transformed
spe ← round ( c ( spearman ( heart.rate , blood.pressure ) ,
spearman ( t [ , ’ heart.rate ’] ,
t [ , ’ blood.pressure ’ ]) ) , 2)
8
0
Transformed blood.pressure
Transformed heart.rate
6
−2
4 −4
2 −6
−8
0
Figure 4.2: Transformations fitted using transcan. Tick marks indicate the two imputed values for blood pressure.
0
150
−2
Transformed bp
blood.pressure
100 −4
−6
50
−8
0 50 150 250 0 2 4 6 8
heart.rate Transformed hr
Figure 4.3: The lower left plot contains raw data (Spearman ρ = −0.02); the lower right is a scatterplot of the corresponding
transformed values (ρ = −0.13). Data courtesy of the SUPPORT study [109].
CHAPTER 4. MULTIVARIABLE MODELING STRATEGIES 4-35
4.7.5
strategy-scoring
Try to score groups of transformed variables with P C1
multiple var.
Later may want to break group apart, but delete all variables
in groups whose summary scores do not add significant in-
formation
itive
4.7.6
4.7.7
strategy-howmuch
Using Expected Shrinkage to Guide Data Reduction
Compute γ̂
Expected loss in LR is p − q
CHAPTER 4. MULTIVARIABLE MODELING STRATEGIES 4-37
Example:
K
Variable clustering
• Subject matter
knowledge
Group predictors so that • ↓ d.f. arising from • Group predictors to
each group represents multiple predictors maximize proportion
a single dimension that of variance explained
can be summarized with • Make P C1 more rea- by P C1 of each group
a single score sonable summary
• Hierarchical cluster-
ing using a matrix
of similarity measures
between predictors
Principal components
Multiple dimensional ↓ d.f. for all predictors 1, 2, . . . , k, k < p
scoring of all predictors combined computed from all
transformed predictors
CHAPTER 4. MULTIVARIABLE MODELING STRATEGIES 4-39
4.8
4.9
strategy-influence
Every observation should influence fit
Statistical Measures: N
hii > 2(p + 1)/n may signal a high leverage point [14]
Some
r
classify obs as overly influential when |DFFITS| >
2 (p + 1)/(n − p − 1) [14]
CHAPTER 4. MULTIVARIABLE MODELING STRATEGIES 4-41
4.10
strategy-compare
Level playing field (independent datasets, same no. candi-
date d.f., careful bootstrapping)
O
Criteria:
1. calibration
2. discrimination
3. face validity
4. measurement errors in required predictors
5. use of continuous predictors (which are usually better de-
fined than categorical ones)
6. omission of “insignificant” variables that nonetheless make
sense as risk factors
7. simplicity (though this is less important with the avail-
ability of computers)
8. lack of fit for specific types of subjects
See [147, 150, 149, 148] for discussions and examples of low
power for testing differences in ROC areas, and for other
approaches.
CHAPTER 4. MULTIVARIABLE MODELING STRATEGIES 4-44
4.11
strategy-improve
See also Section 5.6.
Greenland [84] discusses many important points:
R
Global Strategies
Spend them
4.12
4.12.1
strategy-summary
1. Assemble accurate, pertinent data and lots of it, with wide
distributions for X.
2. Formulate good hypotheses — specify relevant candidate V
4.12.2
4.12.3
5.1
val-describe
5.1.1
Interpreting Effects
A
acting factors
Figure 12.10.
5.1.2
Error Measures
C
Discrimination measures D
CHAPTER 5. DESCRIBING, RESAMPLING, VALIDATING, AND SIMPLIFYING THE MODEL 5-3
* Y binary → Dxy = 2 × (C − 12 )
C = concordance probability = area under receiver op-
erating characteristic curve ∝ Wilcoxon-Mann-Whitney
statistic
– Mostly discrimination: R2
* Radj
2
—overfitting corrected if model pre-specified
* g–index
Calibration measures E
See Van Calster et al. [193] for a nice discussion of different levels
of calibration stringency and their relationship to likelihood of
errors in decision making.
F
g–Index
5.2
The Bootstrap
val-boot
If know population model, use simulation or analytic deriva-
tions to study behavior of statistical estimator H
Steps:
1. Repeatedly simulate sample of size n from F
2. Compute statistic of interest
3. Study behavior over B repetitions
1.0
0.8
Prob[X ≤ x]
0.6
0.4
0.2
0.0
60 80 100 120 140
x
Figure 5.1: Empirical and population cumulative distribution function
Pretend that F ≡ Fn
M
1 3 3.5 4 4.5 5 5.5 6 6.5 7 7.5 8 8.5 9
2 7 6 2 1 30 45 59 72 70 45 48 8 5
hist (M , nclass = length ( unique ( M ) ) , xlab = " " , main = " " )
9 70
Mean and 0.1, 0.9 Quantiles
8 60
7 50
Frequency 40
6
30
5
20
4 10
3 0
First 20 samples:
CHAPTER 5. DESCRIBING, RESAMPLING, VALIDATING, AND SIMPLIFYING THE MODEL 5-10
Need high B for quantiles, low for variance (but see [23])
5.3
Model Validation
val-how
5.3.1
Introduction
O
Internal
– apparent (evaluate fit on same data used to create fit)
– data splitting
– cross-validation
But when the model was fitted using machine learning, fea-
ture screening, variable selection, or model selection, the
model developed using training data is usually only an ex-
ample of a model, and the test sample validation could be
called an example validation
5.3.2
predictive ability
5.3.3
Data-Splitting
W
5.3.4
Example of ×-validation: A
Drawbacks:
CHAPTER 5. DESCRIBING, RESAMPLING, VALIDATING, AND SIMPLIFYING THE MODEL 5-18
Randomization method B
1. Randomly permute Y
2. Optimism = performance of fitted model compared to
what expect by chance
5.3.5
model β̂.
– final model has apparent R2 (Rapp
2
) =0.4
2
– how inflated is Rapp ?
γ̂0 = 0, γ̂1 = 1
See also Copas [47] and van Houwelingen and le Cessie [197,
p. 1318]
See [71] for warnings about the bootstrap, and [61] for variations
on the bootstrap to reduce bias.
Use bootstrap to choose between full and reduced models:
G
Other issues: I
See Blettner and Sauerbrei [21] and Chatfield [37] for more
interesting examples of problems resulting from data-driven
analyses.
CHAPTER 5. DESCRIBING, RESAMPLING, VALIDATING, AND SIMPLIFYING THE MODEL 5-23
5.4
val-ranks
Order of importance of predictors not pre-specified
J
n ← 300
set.seed (1)
d ← data.frame ( x1 = runif ( n ) , x2 = runif ( n ) , x3 = runif ( n ) , x4 = runif ( n ) ,
x5 = runif ( n ) , x6 = runif ( n ) , x7 = runif ( n ) , x8 = runif ( n ) ,
x9 = runif ( n ) , x10 = runif ( n ) , x11 = runif ( n ) , x12 = runif ( n ) )
d $ y ← with (d , 1 * x1 + 2 * x2 + 3 * x3 + 4 * x4 + 5 * x5 + 6 * x6 + 7 * x7 +
8 * x8 + 9 * x9 + 10 * x10 + 11 * x11 + 12 * x12 + 9 * rnorm ( n ) )
x12
x11
x10
x9
x8
predictor
x7
x6
x5
x4
x3
x2
x1
1 2 3 4 5 6 7 8 9 10 11 12
Rank
Figure 5.3: Bootstrap percentile 0.95 confidence limits for ranks of predictors in an OLS model. Ranking is on the basis of partial
χ2 minus d.f. Point estimates are original ranks
CHAPTER 5. DESCRIBING, RESAMPLING, VALIDATING, AND SIMPLIFYING THE MODEL 5-25
5.5
val-approx
5.5.1
Collinearity
5.6
val-habits
Insist on validation of predictive models and discoveries
N
Show collaborators that split-sample validation is not appro-
priate unless the number of subjects is huge
– Split more than once and see volatile results
R Software
rrms
R allows interaction spline functions, wide variety of predic-
tor parameterizations, wide variety of models, unifying model
formula language, model validation by resampling.
R is comprehensive:
Superior graphics
6-1
CHAPTER 6. R SOFTWARE 6-2
6.1
6.2
User-Contributed Functions
Some R functions:
6.3
age: continuous
Restricted cubic spline
cholesterol: continuous
(3 missings; use median)
log(cholesterol+10)
p ← Predict ( fit , age = seq (20 ,80 , length =100) , treat , conf.int = FALSE )
plot ( p ) # Plot relationship between age and log
# or ggplot (p) , plotp (p) # odds , s e p a r a t e curve for each treat ,
# no C.I.
plot (p , ∼ age | treat ) # Same but 2 panels
ggplot (p , groups = FALSE )
bplot ( Predict ( fit , age , cholesterol , np =50) )
# 3 -dimensional perspective p l o t for age ,
# cholesterol , and log odds using default
# ranges for both variables
plot ( Predict ( fit , num.diseases , fun = function ( x ) 1 / (1+ exp ( -x ) ) , conf.int = .9 ) ,
ylab = " Prob " ) # Plot e s t i m a t e d p r o b a b i l i t i e s i n s t e a d of
# log odds ( or use ggplot () )
# can also use plotp () for plotly
# Again , if no datadist were defined , would have to tell plot all limits
logit ← predict ( fit , expand.grid ( treat = " b " , num.dis =1:3 , age = c (20 ,40 ,60) ,
cholesterol = seq (100 ,300 , length =10) ) )
# Could also obtain list of predictor settings interactively }
logit ← predict ( fit , gendata ( fit , nobs =12) )
ag ← 10:80
CHAPTER 6. R SOFTWARE 6-11
6.4
Other Functions
7.1
Notation
A
N subjects
Let the basis functions modeling the time effect be g1(t), g2(t), . . . , g
CHAPTER 7. MODELING LONGITUDINAL RESPONSES USING GENERALIZED LEAST SQUARES 7-3
7.2
7.2.1
k–order polynomial in t
7.2.2
Jim Rochon (Rho, Inc., Chapel Hill NC) has the following
comments about this:
For RCTs, I draw a sharp line at the point when the intervention begins. The LHS [left hand side of the model
equation] is reserved for something that is a response to treatment. Anything before this point can potentially be
included as a covariate in the regression model. This includes the“baseline”value of the outcome variable. Indeed,
the best predictor of the outcome at the end of the study is typically where the patient began at the beginning. It
drinks up a lot of variability in the outcome; and, the effect of other covariates is typically mediated through this
variable.
I treat anything after the intervention begins as an outcome. In the western scientific method, an “effect” must
follow the “cause” even if by a split second.
Note that an RCT is different than a cohort study. In a cohort study, “Time 0” is not terribly meaningful. If we
want to model, say, the trend over time, it would be legitimate, in my view, to include the “baseline” value on the
LHS of that regression model.
Now, even if the intervention, e.g., surgery, has an immediate effect, I would include still reserve the LHS for
anything that might legitimately be considered as the response to the intervention. So, if we cleared a blocked
artery and then measured the MABP, then that would still be included on the LHS.
Now, it could well be that most of the therapeutic effect occurred by the time that the first repeated measure was
taken, and then levels off. Then, a plot of the means would essentially be two parallel lines and the treatment
effect is the distance between the lines, i.e., the difference in the intercepts.
If the linear trend from baseline to Time 1 continues beyond Time 1, then the lines will have a common intercept
but the slopes will diverge. Then, the treatment effect will the difference in slopes.
One point to remember is that the estimated intercept is the value at time 0 that we predict from the set of
repeated measures post randomization. In the first case above, the model will predict different intercepts even
though randomization would suggest that they would start from the same place. This is because we were asleep
at the switch and didn’t record the “action” from baseline to time 1. In the second case, the model will predict the
same intercept values because the linear trend from baseline to time 1 was continued thereafter.
More importantly, there are considerable benefits to including it as a covariate on the RHS. The baseline value
tends to be the best predictor of the outcome post-randomization, and this maneuver increases the precision of
CHAPTER 7. MODELING LONGITUDINAL RESPONSES USING GENERALIZED LEAST SQUARES 7-5
the estimated treatment effect. Additionally, any other prognostic factors correlated with the outcome variable will
also be correlated with the baseline value of that outcome, and this has two important consequences. First, this
greatly reduces the need to enter a large number of prognostic factors as covariates in the linear models. Their
effect is already mediated through the baseline value of the outcome variable. Secondly, any imbalances across the
treatment arms in important prognostic factors will induce an imbalance across the treatment arms in the baseline
value of the outcome. Including the baseline value thereby reduces the need to enter these variables as covariates
in the linear models.
a In addition to this, one of the paper’s conclusions that analysis of covariance is not appropriate if the population means of the baseline
variable are not identical in the treatment groups is not correct [171]. See [107] for a rebuke of [129].
CHAPTER 7. MODELING LONGITUDINAL RESPONSES USING GENERALIZED LEAST SQUARES 7-6
7.3
Disadvantages:
– Induced correlation structure for Y may be unrealistic
– Numerically demanding
Pinheiro and Bates (Section 5.1.2) state that “in some appli-
cations, one may wish to avoid incorporating random effects
in the model to account for dependence among observations,
choosing to use the within-group component Λi to directly
model variance-covariance structure of the response.”
ab
What Methods To Use for Repeated Measurements / Serial Data?
Repeated GEE Mixed GLS Markov LOCF Summary
Measures Effects Statisticc
ANOVA Model
Assumes normality × × ×
Assumes independence of ×d ×e
measurements within subject
Assumes a correlation structuref × ×g × × ×
Requires same measurement × ?
times for all subjects
Does not allow smooth modeling ×
of time to save d.f.
Does not allow adjustment for ×
baseline covariates
Does not easily extend to × ×
non-continuous Y
Loses information by not using ×h ×
intermediate measurements
Does not allow widely varying # × ×i × ×j
of observations per subject
Does not allow for subjects × × × × ×
to have distinct trajectoriesk
Assumes subject-specific effects ×
are Gaussian
Badly biased if non-random ? × ×
dropouts
Biased in general ×
l
Harder to get tests & CLs × ×m
Requires large # subjects/clusters ×
SEs are wrong ×n ×
Assumptions are not verifiable × N/A × × ×
in small samples
Does not extend to complex × × × × ?
settings such as time-dependent
covariates and dynamico models
a Thanks to Charles Berry, Brian Cade, Peter Flom, Bert Gunter, and Leena Choi for valuable input.
b GEE: generalized estimating equations; GLS: generalized least squares; LOCF: last observation carried forward.
c E.g., compute within-subject slope, mean, or area under the curve over time. Assumes that the summary measure is an adequate summary
integration to marginalize random effects, using complex approximations, and if using SAS, unintuitive d.f. for the various tests.
m Because there is no correct formula for SE of effects; ordinary SEs are not penalized for imputation and are too small
n If correction not applied
o E.g., a model with a predictor that is a lagged value of the response variable
I
CHAPTER 7. MODELING LONGITUDINAL RESPONSES USING GENERALIZED LEAST SQUARES 7-10
7.4
Maximum likelihood
7.5
Assume that the correlation coefficient for Yit1 vs. Yit2 condi-
tional on baseline covariates Xi for subject i is h(|t1 −t2|, ρ),
where ρ is a vector (usually a scalar) set of fundamental cor-
relation parameters
Some commonly used structures when times are continuous
and are not equally spaced [154, Section 5.3.3] (nlme corre- O
7.6
7.7
R Software
Q
For this version the rms package has Gls so that many fea-
tures of rms can be used:
anova : all partial Wald tests, test of linearity, pooled tests
summary : effect estimates (differences in Ŷ ) and confidence
limits, can be plotted
plot, ggplot, plotp : continuous effect plots
nomogram : nomogram
Function : generate R function code for fitted model
latex : LATEX representation of fitted model
In addition, Gls has a bootstrap option (hence you do not
use rms’s bootcov for Gls fits).
To get regular gls functions named anova (for likelihood ra-
tio tests, AIC, etc.) or summary use anova.gls or summary.gls
7.8
Case Study
Consider the dataset in Table 6.9 of Davis [54, pp. 161-163] from
a multicenter, randomized controlled trial of botulinum toxin
type B (BotB) in patients with cervical dystonia from nine U.S.
sites. R
7.8.1
0 0 2 4 0 2 4 12 16 0 2 4 8 0 2 4 8 12
1 1 3 1 1
0 2 4 8 12 16 0 2 4 8 16 0 2 8 12 16 0 4 8 12 16 0 4 8 16
94 1 2 4 1
1 2 3 4 5 6 7 8 9
60
10000U
40
20
TWSTRS−total score
60
5000U
40
20
60
40 Placebo
20
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Week
Figure 7.1: Time profiles for individual subjects, stratified by study site and dose
# Show quartiles
require ( data.table )
10000U 5000U
60
40
TWSTRS−total score
20
0
0 5 10 15
Placebo
60
40
20
0
0 5 10 15
Week
10000U 5000U
60
40
TWSTRS−total score
20
0
0 5 10 15
Placebo
60
40
20
0
0 5 10 15
Week
Figure 7.3: Mean responses and nonparametric bootstrap 0.95 confidence limits for population means, stratified by dose
CHAPTER 7. MODELING LONGITUDINAL RESPONSES USING GENERALIZED LEAST SQUARES 7-21
dd ← datadist ( both )
options ( datadist = ’ dd ’)
7.8.2
are tied for the best. For the remainder of the analysis use
corCAR1, using Gls.
a ← Gls ( twstrs ∼ treat * rcs ( week , 3) + rcs ( twstrs0 , 3) +
rcs ( age , 4) * sex , data = both ,
correlation = corCAR1 ( form =∼week | uid ) )
0.6
Semivariogram
0.4
0.2
2 4 6 8 10 12 14
Distance
Residuals
0
Residuals
−20
−20
−40 −40
20 30 40 50 60 70 20 30 40 50 60 70 20 30 40 50 60 70 30 40 50 60
fitted twstrs0
20 20
10
0
Residuals
Residuals
−20
−10
−40
−20
4 8 12 16 −2 0 2
week theoretical
Figure 7.5: Three residual plots to check for absence of trends in central tendency and in variability. Upper right panel shows the
baseline score on the x-axis. Bottom left panel shows the mean ±2×SD. Bottom right panel is the QQ plot for checking normality
of residuals from the GLS fit.
CHAPTER 7. MODELING LONGITUDINAL RESPONSES USING GENERALIZED LEAST SQUARES 7-25
χ2 P
sex 5.7 0.2252
age * sex 4.9 0.1826
age 9.7 0.1388
treat * week 14.9 0.0048
treat 22.1 0.0012
week 77.3 0.0000
twstrs0 233.8 0.0000
0 50 150
2
χ − df
Figure 7.6: Results of anova.rms from generalized least squares fit with continuous time AR1 correlation structure
44
Xβ
^
40
Xβ
^
40
20
36
4 8 12 16 30 40 50 60
Week TWSTRS−total score
Sex F M
50
45
40
Xβ
^
35
30
25
40 50 60 70 80
Age, years
Figure 7.7: Estimated effects of time, baseline TWSTRS, age, and sex
CHAPTER 7. MODELING LONGITUDINAL RESPONSES USING GENERALIZED LEAST SQUARES 7-27
4
0
−4
−4
−8 −8
4 8 12 16 4 8 12 16
week week
Figure 7.8: Contrasts and 0.95 confidence limits from GLS fit
d.f. tests) are often of major interest. Such contrasts are in-
formed by all the measurements made by all subjects (up until
dropout times) when a smooth time trend is assumed.
n ← nomogram (a , age = c ( seq (20 , 80 , by =10) , 85) )
plot (n , cex.axis = .55 , cex.var = .8 , lmgp = .25 ) # F i g u r e 7.9
CHAPTER 7. MODELING LONGITUDINAL RESPONSES USING GENERALIZED LEAST SQUARES 7-29
0 10 20 30 40 50 60 70 80 90 100
Points
TWSTRS−total score
20 25 30 35 40 45 50 55 60 65 70
50 60
age (sex=F)
85 80 4070 20
60 70
age (sex=M)
50 40 85 30 20
5000U
treat (week=2)
10000U Placebo
5000U
treat (week=4)
10000U Placebo
5000U
treat (week=8)
10000U Placebo
10000U
treat (week=12)
5000U
Placebo
treat (week=16)
5000U
Total Points
0 10 20 30 40 50 60 70 80 90 100 110 120 130 140
Linear Predictor
15 20 25 30 35 40 45 50 55 60 65 70
Figure 7.9: Nomogram from GLS fit. Second axis is the baseline score.
CHAPTER 7. MODELING LONGITUDINAL RESPONSES USING GENERALIZED LEAST SQUARES 7-30
7.8.3
stanSet () # i n ∼/ . R p r o f i l e - sets o p t i o n s ( m c . c o r e s =)
Iterations : 2000 on each of 4 chains , with 4000 posterior distribution samples saved
n_eff Rhat
y >=7 1760 1.003
y >=9 1569 1.004
y >=10 1182 1.007
y >=11 1005 1.009
CHAPTER 7. MODELING LONGITUDINAL RESPONSES USING GENERALIZED LEAST SQUARES 7-31
a ← anova ( bpo )
a
Relative Explained Variation for twstrs. Approximate total model Wald χ2 used in denominators of REV:252.8
[208.1, 328.8].
CHAPTER 7. MODELING LONGITUDINAL RESPONSES USING GENERALIZED LEAST SQUARES 7-35
twstrs0 [ ]
week [ ]
treat [ ]
treat * week [ ]
age [ ]
sex [ ]
age * sex [ ]
plot ( k )
−1
−2
−3
4 8 12 16
week
0.20
0.10
0.05
0.00
−10 0 −10 0 −10 0 −10 0 −10 0
f ← function ( x ) {
hpd ← HPDint (x , prob =0 .95 ) # is in rmsb
CHAPTER 7. MODELING LONGITUDINAL RESPONSES USING GENERALIZED LEAST SQUARES 7-38
4
High Dose − Placebo
−4
−8
4 8 12 16
week
7.8.4
Markov model is more likely to fit the data than the ran-
dom effects model, which induces a compound symmetry
correlation structure
# When adding partial PO terms for week and ptwstrs , z = -1.8 , 5 .04
stanDx ( bmark )
Iterations : 2000 on each of 4 chains , with 4000 posterior distribution samples saved
n_eff Rhat
y >=7 3933 1.000
y >=9 5783 1.000
y >=10 5257 1.000
y >=11 5262 1.000
y >=13 4668 0.999
y >=14 4432 0.999
y >=15 4721 0.999
y >=16 4555 1.000
y >=17 4002 0.999
y >=18 3652 0.999
y >=19 3663 0.999
y >=20 3400 0.999
y >=21 3521 0.999
y >=22 3777 0.999
y >=23 3692 0.999
CHAPTER 7. MODELING LONGITUDINAL RESPONSES USING GENERALIZED LEAST SQUARES 7-41
stanDxplot ( bmark )
age
0.00
−0.05
age * sex=M
−0.10
−0.15
0.2
0.1
0.0
−0.1
−0.2
0.4
age'
0.2
age' * sex=M
0.0
−0.2
0.50
0.25
0.00
−0.25
−0.50
age''
0
age'' * sex=M
−1
2
1
0
−1
ptwstrs
0.30
0.25
0.20
0.15
0.10
ptwstrs'
Parameter Value
0.1
0.0
−0.1
−0.2
−0.3
ptwstrs''
1.5
1.0
0.5
0.0
sex=M
5
0
treat=5000U
−5
−10
treat=5000U
2
1
0
−1
treat=5000U
0.2
0.0
−0.2
−0.4
* week
0.25
0.00
treat=Placebo
−0.25
treat=Placebo
* week'
4
3
2
1
0
treat=Placebo
0.00
−0.25
−0.50
0.4
* week
0.2
0.0
−0.2
week
* week'
0.6
0.4
0.2
week'
0.0
−0.2
−0.4
−0.6
0 250 500 750 1000 0 250 500 750 1000 0 250 500 750 1000 0 250 500 750 1000
Post Burn−in Iteration
a ← anova ( bpo )
a
Relative Explained Variation for twstrs. Approximate total model Wald χ2 used in denominators of REV:252.8
[208, 325.2].
CHAPTER 7. MODELING LONGITUDINAL RESPONSES USING GENERALIZED LEAST SQUARES 7-44
twstrs0 [ ]
week [ ]
treat [ ]
treat * week [ ]
age [ ]
sex [ ]
age * sex [ ]
stanDx ( bmarkre )
Iterations : 2000 on each of 4 chains , with 4000 posterior distribution samples saved
n_eff Rhat
y >=7 2429 1.000
y >=9 3542 1.000
y >=10 2851 1.000
y >=11 2709 1.000
y >=13 2617 1.000
y >=14 2451 1.000
y >=15 2421 1.001
y >=16 2377 1.001
y >=17 2105 1.001
y >=18 1998 1.001
y >=19 1822 1.001
y >=20 1757 1.001
y >=21 1688 1.001
y >=22 1676 1.001
y >=23 1675 1.001
y >=24 1689 1.001
y >=25 1694 1.001
y >=26 1659 1.000
y >=27 1629 1.000
y >=28 1571 1.000
y >=29 1580 1.000
y >=30 1612 1.001
y >=31 1619 1.001
y >=32 1669 1.000
y >=33 1836 1.001
y >=34 1938 1.001
y >=35 2038 1.000
y >=36 2245 1.000
y >=37 2409 1.000
y >=38 2477 1.000
y >=39 2517 1.000
y >=40 2639 1.000
y >=41 2831 1.000
y >=42 2943 1.000
y >=43 3300 1.000
y >=44 3718 1.000
y >=45 3832 1.000
y >=46 3835 1.000
y >=47 3884 1.000
y >=48 3960 1.000
y >=49 3844 1.000
y >=50 3728 1.000
y >=51 3460 1.000
y >=52 3286 1.000
y >=53 2899 1.000
CHAPTER 7. MODELING LONGITUDINAL RESPONSES USING GENERALIZED LEAST SQUARES 7-46
bmarkre
2.0
1.5
1.0
0.5
0.0
10 10
log odds
M
log odds
5 5
Sex
2
0 0
F Treatment
log odds
−5 −5
10000U
40 50 60 70 80 20 30 40 50 60 −5 0 5 10 0
Age, years 5000U
TWSTRS−total score log odds
10 Placebo
Placebo
Treatment
log odds
5 −2
5000U
0
10000U 4 8 12 16
−5
Week
−5 0 5 10 4 8 12 16
Adjusted to:ptwstrs=42 age=56 sex=F
log odds Week
k ← contrast ( bmark , list ( week = wks , treat = ’ 10000 U ’) ,
list ( week = wks , treat = ’ Placebo ’) ,
cnames = paste ( ’ Week ’ , wks ) )
k
plot ( k )
k ← as.data.frame ( k [ c ( ’ week ’ , ’ Contrast ’ , ’ Lower ’ , ’ Upper ’) ])
ggplot (k , aes ( x = week , y = Contrast ) ) + geom_point () +
geom_line () + ylab ( ’ High Dose - Placebo ’) +
geom_errorbar ( aes ( ymin = Lower , ymax = Upper ) , width =0)
CHAPTER 7. MODELING LONGITUDINAL RESPONSES USING GENERALIZED LEAST SQUARES 7-50
2
Week 2 Week 4 Week 8
1.00 1.5 1.2
0.75 0.9
1.0
1.00
0.75
Treatment
10000U
0.50
5000U
Placebo
0.25
0.00
4 8 12 16
Week
Adjusted to:age=56 sex=F
ggplot ( Predict ( bmark , week , treat , ptwstrs =40 , fun = Mean ( bmark ) ) )
CHAPTER 7. MODELING LONGITUDINAL RESPONSES USING GENERALIZED LEAST SQUARES 7-51
50
45
Treatment
40
10000U
5000U
35
Placebo
30
25
4 8 12 16
Week
Adjusted to:age=56 sex=F
Z
d $ week ← 12
p ← predict ( bmark , d , type = ’ fitted.ind ’) # d e f a u l t s to p o s t e r i o r means
yvals ← as.numeric ( sub ( ’ twstrs = ’ , ’ ’ , p $ y ) )
lo ← p $ Lower / sum ( p $ Lower )
hi ← p $ Upper / sum ( p $ Upper )
plot ( yvals , p $ Mean , type = ’l ’ , xlab = ’ TWSTRS ’ , ylab = ’ ’ ,
ylim = range ( c ( lo , hi ) ) )
lines ( yvals , lo , col = gray (0 .8 ) )
lines ( yvals , hi , col = gray (0 .8 ) )
0.10
0.08
0.06
0.04
0.02
0.00
10 20 30 40 50 60 70
TWSTRS A
0.08
0.06
0.04
0.02
0.00
10 20 30 40 50 60 70
TWSTRS B
Compute all cell probabilities and use the law of total prob-
ability recursively
k
Pr(Yt = y|X) = Pr(Yt = y|X, Yt−1 = j) Pr(Yt−1 = j|X)
X
j=1
[1] 4000 5 62
50
45
treat
Mean
40
10000U
Placebo
35
30
4 8 12 16
week
C
−5
−10
4 8 12 16
week
D
12
0.04
0.03
0.02
0.01
0.00
10 20 30 40 50 60 70
TWSTRS
Chapter 8
Y = 0, 1
g(u) is 1
1+e−u
8-1
CHAPTER 8. BINARY LOGISTIC REGRESSION 8-2
8.1
Model
1.0
0.8
0.6
P
0.4
0.2
0.0
−4 −2 0 2 4
X
Figure 8.1: Logistic function
B
O= P
1−P
P = O
1+O
Xβ = log 1−P
P
eXβ = O
CHAPTER 8. BINARY LOGISTIC REGRESSION 8-3
8.1.1
8.1.2
0.6
10
0.5
Increase in Risk
0.4 5
4
0.3 3
0.2 2
1.75
1.5
0.1 1.25
1.1
0.0
0.0 0.2 0.4 0.6 0.8 1.0
Risk for Subject Without Risk Factor
Figure 8.2: Absolute benefit as a function of risk of the event in a control subject and the relative effect (odds ratio) of the risk
factor. The odds ratios are given for each curve.
1
= − R̂,
1 + ( 1−R̂R̂ ) exp(−β̂1 )
where R = Prob[Y = 1|X1 = 0, A].
1+e−X2 β
Risk ratio is 1+e−X1 β
e X1 β
Does not simplify like odds ratio, which is e X2 β
= e(X1−X2)β
8.1.3
Detailed Example
Females Age: 37 39 39 42 47 48 48 52 53 55 56 57 58 58 60 64 65 68 68 70
Response: 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 1 1 1 1
Males Age: 34 38 40 40 41 43 43 43 44 46 47 48 48 50 50 52 55 60 61 61
Response: 1 1 0 0 0 1 1 1 0 0 1 1 1 0 1 1 1 1 1 1
require ( rms )
getHdata ( sex.age.response )
d ← sex.age.response
dd ← datadist ( d ) ; options ( datadist = ’ dd ’)
f ← lrm ( response ∼ sex + age , data = d )
fasr ← f # Save for later
w ← function ( ... )
with (d , {
CHAPTER 8. BINARY LOGISTIC REGRESSION 8-6
m ← sex == ’ male ’
f ← sex == ’ female ’
lpoints ( age [ f ] , response [ f ] , pch =1)
lpoints ( age [ m ] , response [ m ] , pch =2)
af ← cut2 ( age , c (45 ,55) , levels.mean = TRUE )
prop ← tapply ( response , list ( af , sex ) , mean ,
na.rm = TRUE )
agem ← as.numeric ( row.names ( prop ) )
lpoints ( agem , prop [ , ’ female ’] ,
pch =4 , cex =1 .3 , col = ’ green ’)
lpoints ( agem , prop [ , ’ male ’] ,
pch =5 , cex =1 .3 , col = ’ green ’)
x ← rep (62 , 4) ; y ← seq ( .25 , .1 , length =4)
lpoints (x , y , pch = c (1 , 2 , 4 , 5) ,
col = rep ( c ( ’ blue ’ , ’ green ’) , each =2) )
ltext ( x +5 , y ,
c ( ’F Observed ’ , ’M Observed ’ ,
’F Proportion ’ , ’M Proportion ’) , cex = .8 )
} ) # F i g u r e 8.3
plot ( Predict (f , age = seq (34 , 70 , length =200) , sex , fun = plogis ) ,
ylab = ’ Pr [ response ] ’ , ylim = c ( -.02 , 1 .02 ) , addpanel = w )
# Hmisc :: latex prevents q u a n t r e g :: latex from conflicting
ltx ← function ( fit ) Hmisc :: latex ( fit , inline = TRUE , columns =54 ,
file = ’ ’ , after = ’$ . ’ , digits =3 ,
size = ’ Ssize ’ , before = ’$ X \\ hat {\\ beta }= ’)
ltx ( f )
sex response
Frequency
Row Pct 0 1 Total Odds/Log
F 14 6 20 6/14=.429
70.00 30.00 -.847
M 6 14 20 14/6=2.33
30.00 70.00 .847
Total 20 20 40
sex × response
male
0.8
Pr[response]
0.6
0.4
F Observed
0.2 M Observed
F Proportion
female
M Proportion
40 50 60 70
age
Figure 8.3: Data, subgroup proportions, and fitted logistic model, with 0.95 pointwise confidence bands
CHAPTER 8. BINARY LOGISTIC REGRESSION 8-8
<45 8 5 13 5/8=.625
61.5 38.4 -.47
45-54 6 6 12 6/6=1
50.0 50.0 0
55+ 6 9 15 9/6=1.5
40.0 60.0 .405
Total 20 20 40
and since the respective mean ages in the 55+ and <45 age
groups are 61.1 and 40.2, an estimate of the log odds ratio
increase per year is .875/(61.1–40.2)=.875/20.9=.042.
The likelihood ratio test for H0: no association between age
and response is obtained as follows:
CHAPTER 8. BINARY LOGISTIC REGRESSION 8-9
sex=F
age response
Frequency
Row Pct 0 1 Total
<45 4 0 4
100.0 0.0
45-54 4 1 5
80.0 20.0
55+ 6 5 11
54.6 45.4
Total 14 6 20 J
sex=M
age response
Frequency
Row Pct 0 1 Total
<45 4 5 9
44.4 55.6
45-54 2 5 7
28.6 71.4
55+ 0 4 4
0.0 100.0
Total 6 14 20 K
Design Formulations
M
8.2
Estimation
8.2.1
8.2.2
P̂i = [1 + exp(−Xiβ̂)]−1.
{1 + exp[−(Xiβ̂ ± zs)]}−1.
8.2.3
Worst case: θ = 1
2
Requires n = 96 observationsa
a The general formula for the sample size required to achieve a margin of error of δ in estimating a true probability of θ at the 0.95 confidence
CHAPTER 8. BINARY LOGISTIC REGRESSION 8-14
i ← 0
for ( s in sigmas ) {
i ← i + 1
j ← 0
for ( n in ns ) {
j ← j + 1
n1 ← maxe ← 0
for ( k in 1: nsim ) {
x ← rnorm (n , 0 , s )
P ← plogis ( x )
y ← ifelse ( runif ( n ) ≤ P , 1 , 0)
n1 ← n1 + sum ( y )
beta ← lrm.fit (x , y ) $ coefficients
phat ← plogis ( beta [1] + beta [2] * xs )
maxe ← maxe + max ( abs ( phat - pactual ) )
}
n1 ← n1 / nsim
maxe ← maxe / nsim
maxerr [i , j ] ← maxe
level is n = ( 1.96
δ
)2 × θ(1 − θ). Set θ = 12 for the worst case.
b An average absolute error of 0.05 corresponds roughly to a 0.95 confidence interval margin of error of 0.1.
CHAPTER 8. BINARY LOGISTIC REGRESSION 8-15
N1 [i , j ] ← n1
}
}
xrange ← range ( xs )
simerr ← llist ( N1 , maxerr , sigmas , ns , nsim , xrange )
σ
0.5
0.75
Average Maximum P − P
1
0.20 1.25
^
1.5
1.75
2
2.5
0.15 3
4
0.10
0.05
Figure 8.4: Simulated expected maximum error in estimating probabilities for x ∈ [−1.5, 1.5] with a single normally distributed X
with mean zero
CHAPTER 8. BINARY LOGISTIC REGRESSION 8-16
8.3
Test Statistics
Q
8.4
Residuals
R
8.5
X1 = 1
logit{Y=1}
X1 = 0
X2
Figure 8.5: Logistic regression assumptions for one binary and one continuous predictor
getHdata ( acath )
acath $ sex ← factor ( acath $ sex , 0:1 , c ( ’ male ’ , ’ female ’) )
dd ← datadist ( acath ) ; options ( datadist = ’ dd ’)
f ← lrm ( sigdz ∼ rcs ( age , 4) * sex , data = acath )
w ← function ( ... )
with ( acath , {
plsmo ( age , sigdz , group = sex , fun = qlogis , lty = ’ dotted ’ ,
add = TRUE , grid = TRUE )
af ← cut2 ( age , g =10 , levels.mean = TRUE )
prop ← qlogis ( tapply ( sigdz , list ( af , sex ) , mean ,
na.rm = TRUE ) )
agem ← as.numeric ( row.names ( prop ) )
lpoints ( agem , prop [ , ’ female ’] , pch =4 , col = ’ green ’)
lpoints ( agem , prop [ , ’ male ’] , pch =2 , col = ’ green ’)
} ) # F i g u r e 8.6
plot ( Predict (f , age , sex ) , ylim = c ( -2 ,4) , addpanel =w ,
label.curve = list ( offset = unit (0 .5 , ’ cm ’) ) )
T
2 male
log odds
1
−1
female
30 40 50 60 70 80
Age, Year
Figure 8.6: Logit proportions of significant coronary artery disease by sex and deciles of age for n=3504 patients, with spline fits
(smooth curves). Spline fits are for k = 4 knots at age= 36, 48, 56, and 68 years, and interaction between age and sex is allowed.
Shaded bands are pointwise 0.95 confidence limits for predicted log odds. Smooth nonparametric estimates are shown as dotted
curves. Data courtesy of the Duke Cardiovascular Disease Databank.
Ô = P̂
1−P̂
1 or 2 Xs → nonparametric smoother
lrtest (a , b )
L . R . Chisq d.f. P
2.1964146 1.0000000 0.1383322
lrtest (a , c )
L . R . Chisq d.f. P
3.4502500 2.0000000 0.1781508
lrtest (a , d )
CHAPTER 8. BINARY LOGISTIC REGRESSION 8-21
L . R . Chisq d.f. P
16.547036344 5.000000000 0.005444012
lrtest (b , d )
L . R . Chisq d.f. P
14.350621767 4.000000000 0.006256138
lrtest (c , d )
L . R . Chisq d.f. P
13.096786352 3.000000000 0.004431906
k Model χ2 AIC
0 99.23 97.23
3 112.69 108.69
4 121.30 115.30
5 123.51 115.51
6 124.41 114.41
dz ← subset ( acath , sigdz ==1)
dd ← datadist ( dz )
2
log odds
−1
Figure 8.7: Estimated relationship between duration of symptoms and the log odds of severe coronary artery disease for k = 5.
Knots are marked with arrows. Solid line is spline fit; dotted line is a nonparametric loess estimate.
f ← lrm ( tvdlm ∼ log10 ( cad.dur + 1) , data = dz )
w ← function ( ... )
with ( dz , {
x ← cut2 ( cad.dur , m =150 , levels.mean = TRUE )
prop ← tapply ( tvdlm , x , mean , na.rm = TRUE )
CHAPTER 8. BINARY LOGISTIC REGRESSION 8-23
0.7
0.6
P
0.5
0.4
0.3
Figure 8.8: Fitted linear logistic model in log10 (duration+1), with subgroup estimates using groups of 150 patients. Fitted equation
is logit(tvdlm) = −.9809 + .7122 log10 (months + 1).
CHAPTER 8. BINARY LOGISTIC REGRESSION 8-24
ltx ( f )
χ2 d.f. P
age.tertile (Factor+Higher Order Factors) 120.74 10 <0.0001
All Interactions 21.87 8 0.0052
sex (Factor+Higher Order Factors) 329.54 3 <0.0001
All Interactions 9.78 2 0.0075
cholesterol (Factor+Higher Order Factors) 93.75 9 <0.0001
All Interactions 10.03 6 0.1235
Nonlinear (Factor+Higher Order Factors) 9.96 6 0.1263
age.tertile × sex (Factor+Higher Order Factors) 9.78 2 0.0075
age.tertile × cholesterol (Factor+Higher Order Factors) 10.03 6 0.1235
Nonlinear 2.62 4 0.6237
Nonlinear Interaction : f(A,B) vs. AB 2.62 4 0.6237
TOTAL NONLINEAR 9.96 6 0.1263
TOTAL INTERACTION 21.87 8 0.0052
TOTAL NONLINEAR + INTERACTION 29.67 10 0.0010
TOTAL 410.75 14 <0.0001
yl ← c ( -1 ,5)
plot ( Predict (f , cholesterol , age.tertile ) ,
adj.subtitle = FALSE , ylim = yl ) # F i g u r e 8.9
CHAPTER 8. BINARY LOGISTIC REGRESSION 8-26
4
[58,82]
3
log odds
2
[49,58)
1
0 [17,49)
Cholesterol, mg %
Figure 8.9: Log odds of significant coronary artery disease modeling age with two dummy variables
4
log odds
−2
70 400
60 350
50 300
ag 250
40 ol
e
150
200
s ter
30 ole
100 ch
Figure 8.10: Local regression fit for the logit of the probability of significant coronary disease vs. age and cholesterol for males,
based on the loess function.
CHAPTER 8. BINARY LOGISTIC REGRESSION 8-28
χ2 d.f. P
age (Factor+Higher Order Factors) 164.17 24 <0.0001
All Interactions 42.28 20 0.0025
Nonlinear (Factor+Higher Order Factors) 25.21 18 0.1192
sex (Factor+Higher Order Factors) 343.80 5 <0.0001
All Interactions 23.90 4 <0.0001
cholesterol (Factor+Higher Order Factors) 100.13 20 <0.0001
All Interactions 16.27 16 0.4341
Nonlinear (Factor+Higher Order Factors) 16.35 15 0.3595
age × sex (Factor+Higher Order Factors) 23.90 4 <0.0001
Nonlinear 12.97 3 0.0047
Nonlinear Interaction : f(A,B) vs. AB 12.97 3 0.0047
age × cholesterol (Factor+Higher Order Factors) 16.27 16 0.4341
Nonlinear 11.45 15 0.7204
Nonlinear Interaction : f(A,B) vs. AB 11.45 15 0.7204
f(A,B) vs. Af(B) + Bg(A) 9.38 9 0.4033
Nonlinear Interaction in age vs. Af(B) 9.99 12 0.6167
Nonlinear Interaction in cholesterol vs. Bg(A) 10.75 12 0.5503
TOTAL NONLINEAR 33.22 24 0.0995
TOTAL INTERACTION 42.28 20 0.0025
TOTAL NONLINEAR + INTERACTION 49.03 26 0.0041
TOTAL 449.26 29 <0.0001
perim ← with ( acath ,
perimeter ( cholesterol , age , xinc =20 , n =5) )
zl ← c ( -2 , 4) # F i g u r e 8.11
bplot ( Predict (f , cholesterol , age , np =40) , perim = perim ,
lfun = wireframe , zlim = zl , adj.subtitle = FALSE )
2
log odds
1
0
−1
−2
70
400
60 350
Ag 50 300
e, 250 g%
Ye
40 200 o l,m
r
ar 30
100
150 s te
le
o
Ch
Figure 8.11: Linear spline surface for males, with knots for age at 46, 52, 59 and knots for cholesterol at 196, 224, and 259 (quartiles).
CHAPTER 8. BINARY LOGISTIC REGRESSION 8-30
χ2 d.f. P
age (Factor+Higher Order Factors) 165.23 15 <0.0001
All Interactions 37.32 12 0.0002
Nonlinear (Factor+Higher Order Factors) 21.01 10 0.0210
sex (Factor+Higher Order Factors) 343.67 4 <0.0001
All Interactions 23.31 3 <0.0001
cholesterol (Factor+Higher Order Factors) 97.50 12 <0.0001
All Interactions 12.95 9 0.1649
Nonlinear (Factor+Higher Order Factors) 13.62 8 0.0923
age × sex (Factor+Higher Order Factors) 23.31 3 <0.0001
Nonlinear 13.37 2 0.0013
Nonlinear Interaction : f(A,B) vs. AB 13.37 2 0.0013
age × cholesterol (Factor+Higher Order Factors) 12.95 9 0.1649
Nonlinear 7.27 8 0.5078
Nonlinear Interaction : f(A,B) vs. AB 7.27 8 0.5078
f(A,B) vs. Af(B) + Bg(A) 5.41 4 0.2480
Nonlinear Interaction in age vs. Af(B) 6.44 6 0.3753
Nonlinear Interaction in cholesterol vs. Bg(A) 6.27 6 0.3931
TOTAL NONLINEAR 29.22 14 0.0097
TOTAL INTERACTION 37.32 12 0.0002
TOTAL NONLINEAR + INTERACTION 45.41 16 0.0001
TOTAL 450.88 19 <0.0001
# F i g u r e 8.12 :
bplot ( Predict (f , cholesterol , age , np =40) , perim = perim ,
lfun = wireframe , zlim = zl , adj.subtitle = FALSE )
2
log odds
1
0
−1
−2
70
400
60 350
Ag 50 300
e, 250 g%
Ye
40 200 o l,m
r
ar 30
100
150 s te
le
o
Ch
Figure 8.12: Restricted cubic spline surface in two variables, each with k = 4 knots
CHAPTER 8. BINARY LOGISTIC REGRESSION 8-32
χ2 d.f. P
sex (Factor+Higher Order Factors) 343.42 4 <0.0001
All Interactions 24.05 3 <0.0001
age (Factor+Higher Order Factors) 169.35 11 <0.0001
All Interactions 34.80 8 <0.0001
Nonlinear (Factor+Higher Order Factors) 16.55 6 0.0111
cholesterol (Factor+Higher Order Factors) 93.62 8 <0.0001
All Interactions 10.83 5 0.0548
Nonlinear (Factor+Higher Order Factors) 10.87 4 0.0281
age × cholesterol (Factor+Higher Order Factors) 10.83 5 0.0548
Nonlinear 3.12 4 0.5372
Nonlinear Interaction : f(A,B) vs. AB 3.12 4 0.5372
Nonlinear Interaction in age vs. Af(B) 1.60 2 0.4496
Nonlinear Interaction in cholesterol vs. Bg(A) 1.64 2 0.4400
sex × age (Factor+Higher Order Factors) 24.05 3 <0.0001
Nonlinear 13.58 2 0.0011
Nonlinear Interaction : f(A,B) vs. AB 13.58 2 0.0011
TOTAL NONLINEAR 27.89 10 0.0019
TOTAL INTERACTION 34.80 8 <0.0001
TOTAL NONLINEAR + INTERACTION 45.45 12 <0.0001
TOTAL 453.10 15 <0.0001
# F i g u r e 8.13 :
bplot ( Predict (f , cholesterol , age , np =40) , perim = perim ,
lfun = wireframe , zlim = zl , adj.subtitle = FALSE )
ltx ( f )
2
log odds
1
0
−1
−2
70
400
60 350
Ag 50 300
e, 250 g%
Ye
40 200 o l,m
r
ar 30
100
150 s te
le
o
Ch
Figure 8.13: Restricted cubic spline fit with age × spline(cholesterol) and cholesterol × spline(age)
CHAPTER 8. BINARY LOGISTIC REGRESSION 8-34
χ2 d.f. P
age (Factor+Higher Order Factors) 167.83 7 <0.0001
All Interactions 31.03 4 <0.0001
Nonlinear (Factor+Higher Order Factors) 14.58 4 0.0057
sex (Factor+Higher Order Factors) 345.88 4 <0.0001
All Interactions 22.30 3 <0.0001
cholesterol (Factor+Higher Order Factors) 89.37 4 <0.0001
All Interactions 7.99 1 0.0047
Nonlinear 10.65 2 0.0049
age × cholesterol (Factor+Higher Order Factors) 7.99 1 0.0047
age × sex (Factor+Higher Order Factors) 22.30 3 <0.0001
Nonlinear 12.06 2 0.0024
Nonlinear Interaction : f(A,B) vs. AB 12.06 2 0.0024
TOTAL NONLINEAR 25.72 6 0.0003
TOTAL INTERACTION 31.03 4 <0.0001
TOTAL NONLINEAR + INTERACTION 43.59 8 <0.0001
TOTAL 452.75 11 <0.0001
# F i g u r e 8.14 :
bplot ( Predict (f , cholesterol , age , np =40) , perim = perim ,
lfun = wireframe , zlim = zl , adj.subtitle = FALSE )
f.linia ← f # s a v e l i n e a r i n t e r a c t i o n f i t f o r l a t e r
ltx ( f )
2
log odds
1
0
−1
−2
70
400
60 350
Ag 50 300
e, 250 g%
Ye
40 200 o l,m
r
ar 30
100
150 s te
le
o
Ch
Figure 8.14: Spline fit with nonlinear effects of cholesterol and age and a simple product interaction
3 63.73
log odds
2
53.06
1
0 41.74
Cholesterol, mg %
Figure 8.15: Predictions from linear interaction model with mean age in tertiles indicated.
CHAPTER 8. BINARY LOGISTIC REGRESSION 8-36
3 3
2 2
1 1
0 0
ri
−1 ri −1
−2 −2
−3 −3
0 50 150 250 0.0 1.0 2.0
cad.dur log.cad.dur
Figure 8.16: Partial residuals for duration and log10 (duration+1). Data density shown at top of each plot.
A new omnibus test based of SSE has more power and re-
quires no grouping; still does not lead to corrective action.
8.6
Collinearity
8.7
8.8
case:
2 1 − exp(−LR/n)
RN = 0
,
1 − exp(−L /n)
8.9
– Discrimination: C or Dxy
– R2 or B
latex ( v1 ,
caption = ’ Bootstrap Validation , 2 Predictors Without Stepdown ’ ,
insert.bottom = ’ \\ label { pg : l r m - s e x - a g e - r e s p o n s e - b o o t } ’ ,
digits =2 , size = ’ Ssize ’ , file = ’ ’)
CHAPTER 8. BINARY LOGISTIC REGRESSION 8-41
latex ( v2 ,
caption = ’ Bootstrap Validation , 2 Predictors with Stepdown ’ ,
digits =2 , B =15 , file = ’ ’ , size = ’ Ssize ’)
k ← attr ( v3 , ’ kept ’)
# Compute number of x1-x5 selected
nx ← apply ( k [ ,3:7] , 1 , sum )
# Get selections of age and sex
v ← colnames ( k )
as ← apply ( k [ ,1:2] , 1 ,
function ( x ) paste ( v [1:2][ x ] , collapse = ’ , ’) )
table ( paste ( as , ’ ’ , nx , ’ Xs ’) )
latex ( v3 , #
caption = ’ Bootstrap Validation with 5 Noise Variables and Stepdown ’ ,
digits =2 , B =15 , size = ’ Ssize ’ , file = ’ ’)
•
• • • •
• • •
• •
• • • • •
•
• • •
• • •
latex ( v4 ,
caption = ’ Bootstrap Validation with 5 Noise Variables and Stepdown ,
Forced Inclusion of age and sex ’ ,
digits =2 , B =15 , size = ’ Ssize ’)
Bootstrap Validation with 5 Noise Variables and Stepdown, Forced Inclusion of age and sex
Index Original Training Test Optimism Corrected n
Sample Sample Sample Index
Dxy 0.70 0.76 0.66 0.10 0.60 130
R2 0.45 0.54 0.41 0.12 0.33 130
Intercept 0.00 0.00 0.06 −0.06 0.06 130
Slope 1.00 1.00 0.76 0.24 0.76 130
Emax 0.00 0.00 0.07 0.07 0.07 130
D 0.39 0.50 0.35 0.15 0.24 130
U −0.05 −0.05 0.08 −0.13 0.08 130
Q 0.44 0.55 0.27 0.28 0.16 130
B 0.16 0.14 0.18 −0.04 0.21 130
g 2.10 2.75 1.89 0.86 1.25 130
gp 0.35 0.37 0.33 0.04 0.31 130
8.10
Odds Ratio
0.10 1.00 2.00 3.00 4.00
age − 59:46
cholesterol − 259:196
sex − female:male
Figure 8.17: Odds ratios and confidence bars, using quartiles of age and cholesterol for assessing their effects on the odds of coronary
disease.
CHAPTER 8. BINARY LOGISTIC REGRESSION 8-46
1.00
Probabiity ABM vs AVM
●
● ●
0.75 ●
●
●
●
0.50 ●
●
0.25 ●
●
●
●
0.00
0 20 40 60
Age in Years
Figure 8.18: Linear spline fit for probability of bacterial vs. viral meningitis as a function of age at onset [176]. Points are simple
proportions by age quantile groups.
CHAPTER 8. BINARY LOGISTIC REGRESSION 8-47
Figure 8.19: (A) Relationship between myocardium at risk and ventricular fibrillation, based on the individual best fit equations
for animals anesthetized with pentobarbital and α-chloralose. The amount of myocardium at risk at which 0.5 of the animals are
expected to fibrillate (MAR50 ) is shown for each anesthetic group. (B) Relationship between myocardium at risk and ventricular
fibrillation, based on equations derived from the single slope estimate. Note that the MAR50 describes the overall relationship
between myocardium at risk and outcome when either the individual best fit slope or the single slope estimate is used. The shift of
the curve to the right during α-chloralose anesthesia is well described by the shift in MAR50 . Test for interaction had P=0.10 [211].
Reprinted by permission, NRC Research Press.
CHAPTER 8. BINARY LOGISTIC REGRESSION 8-48
Figure 8.20: A nomogram for estimating the likelihood of significant coronary artery disease (CAD) in women. ECG = electrocar-
diographic; MI = myocardial infarction [156]. Reprinted from American Journal of Medicine, Vol 75, Pryor DB et al., “Estimating
the likelihood of significant coronary artery disease”, p. 778, Copyright 1983, with permission from Excerpta Medica, Inc.
CHAPTER 8. BINARY LOGISTIC REGRESSION 8-49
7000
.15 6000
Reading Reading
5000
Line Line
.20
4000
80
12m 12m A
.25 3000
75 2500
0.01 50
10 1 Aug 1 Aug .55
40
10
35
≥ .60
15
30 0
A B
25 20
22 22y
Figure 8.21: Nomogram for estimating probability of bacterial (ABM) vs. viral (AVM) meningitis. Step 1, place ruler on reading
lines for patient’s age and month of presentation and mark intersection with line A; step 2, place ruler on values for glucose ratio
and total polymorphonuclear leukocyte (PMN) count in cerbrospinal fluid and mark intersection with line B; step 3, use ruler to join
marks on lines A and B, then read off the probability of ABM vs. AVM [176].
CHAPTER 8. BINARY LOGISTIC REGRESSION 8-50
0 10 20 30 40 50 60 70 80 90 100
Points
cholesterol (age=30
sex=male) 150 250 300 350 400
cholesterol (age=40
sex=male) 150 250 300 350 400
cholesterol (age=50 250 300 350 400
sex=male) 200
cholesterol (age=60 250 350
sex=male) 200
cholesterol (age=70 250 400
sex=male) 200
cholesterol (age=30
sex=female) 150 250 300 350 400
cholesterol (age=40
sex=female) 150 250 300 350 400
cholesterol (age=50 250 300 350 400
sex=female) 200
cholesterol (age=60 250 350
sex=female) 200
cholesterol (age=70 250 400
sex=female) 200
Total Points
0 10 20 30 40 50 60 70 80 90 100
Linear Predictor
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 3 3.5
Probability of CAD
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95
Figure 8.22: Nomogram relating age, sex, and cholesterol to the log odds and to the probability of significant coronary artery disease.
Select one axis corresponding to sex and to age ∈ {30, 40, 50, 60, 70}. There was linear interaction between age and sex and between
age and cholesterol. 0.70 and 0.90 confidence intervals are shown (0.90 in gray). Note that for the “Linear Predictor” scale there
are various lengths of confidence intervals near the same value of X β̂, demonstrating that the standard error of X β̂ depends on the
individual X values. Also note that confidence intervals corresponding to smaller patient groups (e.g., females) are wider.
CHAPTER 8. BINARY LOGISTIC REGRESSION 8-52
8.11
with SD=100 for the two slopes. See below that the posterior
modes from this fit are in close agreement with the maximum
likelihood estimates from the frequentist model fit.
dd ← datadist ( sex.age.response )
options ( datadist = ’ dd ’)
require ( rmsb )
# Frequentist model
flrm ← lrm ( response ∼ sex + age , data = sex.age.response )
# Bayesian model
# Distribute chains across all available cpu cores :
options ( mc.cores = parallel :: detectCores () )
s1 ← 5 / qnorm (0 .975 )
# Solve for SD such that 10 -year age effect has only 0 .025 chance
# of being above 20
# Full model
set.seed (11)
f ← blrm ( response ∼ sex + age , data = sex.age.response ,
priorsd = c ( s1 , s2 ) , iprior =1 , keepsep = ’ age | sex ’ , iter =5000)
# Elapsed time 1 .7s
f
The following parameters remained separate (where not orthogonalized) during model fitting so that
prior distributions could be focused explicitly on them: sex=male, age
stanDx ( f )
Iterations : 5000 on each of 4 chains , with 10000 posterior distribution samples saved
n_eff Rhat
Intercept 9224 1
sex = male 6575 1
age 6686 1
stanDxplot ( f )
CHAPTER 8. BINARY LOGISTIC REGRESSION 8-55
0.3
0.2
age
0.1
Parameter Value
0.0
sex=male
4
small.
# Bayesian model output
The figure shows the posterior draws for the age and sex pa-
rameters as well as the trace of the 4 MCMC chains for each
parameter and the bivariate posterior distribution. The poste-
rior distributions of each parameter are roughly round shaped
and the overlap between chains in the trace plots indicates good
convergence. The bivariate density plot indicates moderate cor-
relation between the age and sex parameters.
Create a 0.95 bivariate credible interval for the joint distribution
of age and sex. Any number of intervals could be drawn, as
any region that covers 0.95 of the posterior density could be
accurately be called a 0.95 credible interval. Commonly used:
maximum a-posteriori probability (MAP) interval, which seeks
to find the region that holds 0.95 of the density, while also
having the smallest area. In a 1-dimensional setting, this would
CHAPTER 8. BINARY LOGISTIC REGRESSION 8-57
sex=male age
0.4
6
0.3
0.95 HPDI
4 Density
0.2
mean
median
mode
2
0.1
0.0 0
0.3
0.2
age
0.1
2 4 6
sex=male
Spearman ρ = 0.66
0.3
Probability
0.01
0.2 0.1
0.25
age
0.5
0.75
0.1 0.9
0.95
2 4 6
sex=male
Spearman ρ = 0.66
In the above figure, the point estimate does not appear quite at
the point of highest density. This is because blrm estimates (by
CHAPTER 8. BINARY LOGISTIC REGRESSION 8-59
ggplot ( Predict (f , age , sex , fun = plogis , funint = FALSE ) , ylab = ’P ( Y =1) ’)
1.00
0.75
sex
P(Y=1)
0.50 female
male
0.25
0.00
40 50 60
age
# Frequentist
# variance-covariance for sex and age parameters
v ← vcov ( flrm ) [2:3 ,2:3]
# Bayesian
# Linear correlation between params from posterior
draws ← f $ draws [ , c ( ’ sex = male ’ , ’ age ’) ]
b_cc ← cor ( draws ) [1 ,2]
[1] 0.9999
( p2 ← P ( b2 > 0) )
[1] 0.9994
[1] 0.9993
( p4 ← P ( b1 > 0 | b2 > 0) )
[1] 1
0.4
0.3
0.2
age
0.1
0.0
0 2 4 6
sex=male
Chapter 9
Data source: The Titanic Passenger List edited by Michael A. Findlay, originally
published in Eaton & Haas (1994) Titanic: Triumph and Tragedy, Patrick Stephens
Ltd, and expanded with the help of the Internet community. The original html files were
obtained from Philip Hind (1999) (http://atschool.eduweb.co.uk/phind). The dataset
was compiled and interpreted by Thomas Cason. It is available in Rand spreadsheet
formats from hbiostat.org/data under the name titanic3.
9-1
CHAPTER 9. LOGISTIC MODEL CASE STUDY: SURVIVAL OF TITANIC PASSENGERS 9-2
9.1
Descriptive Statistics
require ( rms )
t3
6 Variables 1309 Observations
pclass
n missing distinct
1309 0 3
lowest : 0.1667 0.3333 0.4167 0.6667 0.7500, highest: 70.5000 71.0000 74.0000 76.0000 80.0000
sex
n missing distinct
1309 0 2
lowest : 0 1 2 3 4, highest: 2 3 4 5 8
Value 0 1 2 3 4 5 8
Frequency 891 319 42 20 22 6 9
Proportion 0.681 0.244 0.032 0.015 0.017 0.005 0.007
parch : Number of Parents/Children Aboard
n missing distinct Info Mean Gmd
1309 0 8 0.549 0.385 0.6375
lowest : 0 1 2 3 4, highest: 3 4 5 6 9
Value 0 1 2 3 4 5 6 9
Frequency 1002 170 113 8 6 6 2 2
Proportion 0.765 0.130 0.086 0.006 0.005 0.005 0.002 0.002
dd ← datadist ( t3 )
# describe distributions of variables to rms
options ( datadist = ’ dd ’)
s ← summary ( survived ∼ age + sex + pclass +
cut2 ( sibsp ,0:3) + cut2 ( parch ,0:3) , data = t3 )
plot (s , main = ’ ’ , subtitles = FALSE ) # F i g u r e 9.1
CHAPTER 9. LOGISTIC MODEL CASE STUDY: SURVIVAL OF TITANIC PASSENGERS 9-3
N
Age [years]
[ 0.167,22.0) 290
[22.000,28.5) 246
[28.500,40.0) 265
[40.000,80.0] 245
Missing 263
sex
female 466
male 843
pclass
1st 323
2nd 277
3rd 709
Overall
1309
Survived
3rd
adult
Passenger Class
2nd
1st sex
female
3rd male
2nd child
1st
9.2
yl ← ylab ( NULL )
p1 ← ggplot ( t3 , aes ( x = age , y = survived ) ) +
histSpikeg ( survived ∼ age , lowess = TRUE , data = t3 ) +
ylim (0 ,1) + yl
p2 ← ggplot ( t3 , aes ( x = age , y = survived , color = sex ) ) +
histSpikeg ( survived ∼ age + sex , lowess = TRUE ,
data = t3 ) + ylim (0 ,1) + yl
p3 ← ggplot ( t3 , aes ( x = age , y = survived , size = pclass ) ) +
histSpikeg ( survived ∼ age + pclass , lowess = TRUE ,
data = t3 ) + b + ylim (0 ,1) + yl
p4 ← ggplot ( t3 , aes ( x = age , y = survived , color = sex ,
size = pclass ) ) +
histSpikeg ( survived ∼ age + sex + pclass ,
lowess = TRUE , data = t3 ) +
b + ylim (0 ,1) + yl
gridExtra :: grid.arrange ( p1 , p2 , p3 , p4 , ncol =2) # combine 4
# F i g u r e 9.4
top ← theme ( legend.position = ’ top ’)
p1 ← ggplot ( t3 , aes ( x = age , y = survived , color = cut2 ( sibsp ,
0:2) ) ) + stat_plsmo () + b + ylim (0 ,1) + yl + top +
sc al e_c ol or_ dis cr ete ( name = ’ siblings / spouses ’)
p2 ← ggplot ( t3 , aes ( x = age , y = survived , color = cut2 ( parch ,
0:2) ) ) + stat_plsmo () + b + ylim (0 ,1) + yl + top +
sc al e_c ol or_ dis cr ete ( name = ’ parents / children ’)
gridExtra :: grid.arrange ( p1 , p2 , ncol =2)
CHAPTER 9. LOGISTIC MODEL CASE STUDY: SURVIVAL OF TITANIC PASSENGERS 9-6
1.00 1.00
0.75 0.75
sex
0.50 0.50 female
male
0.25 0.25
0.00 0.00
0 20 40 60 80 0 20 40 60 80
age age
1.00 1.00
sex
0.75 0.75 female
pclass male
1st
0.50 0.50
2nd pclass
3rd 1st
0.25 0.25 2nd
3rd
0.00 0.00
0 20 40 60 80 0 20 40 60 80
age age
Figure 9.3: Nonparametric regression (loess) estimates of the relationship between age and the probability of surviving the Titanic,
with tick marks depicting the age distribution. The top left panel shows unstratified estimates of the probability of survival. Other
panels show nonparametric estimates by various stratifications.
CHAPTER 9. LOGISTIC MODEL CASE STUDY: SURVIVAL OF TITANIC PASSENGERS 9-7
siblings/spouses 0 1 parents/children
[2,8] 0 1 [2,9]
1.00 1.00
0.75 0.75
0.50 0.50
0.25 0.25
0.00 0.00
0 20 40 60 80 0 20 40 60 80
age age
Figure 9.4: Relationship between age and survival stratified by the number of siblings or spouses on board (left panel) or by the
number of parents or children of the passenger on board (right panel).
CHAPTER 9. LOGISTIC MODEL CASE STUDY: SURVIVAL OF TITANIC PASSENGERS 9-8
9.3
χ2 d.f. P
sex (Factor+Higher Order Factors) 187.15 15 <0.0001
All Interactions 59.74 14 <0.0001
pclass (Factor+Higher Order Factors) 100.10 20 <0.0001
All Interactions 46.51 18 0.0003
age (Factor+Higher Order Factors) 56.20 32 0.0052
All Interactions 34.57 28 0.1826
Nonlinear (Factor+Higher Order Factors) 28.66 24 0.2331
sibsp (Factor+Higher Order Factors) 19.67 5 0.0014
All Interactions 12.13 4 0.0164
parch (Factor+Higher Order Factors) 3.51 5 0.6217
All Interactions 3.51 4 0.4761
sex × pclass (Factor+Higher Order Factors) 42.43 10 <0.0001
sex × age (Factor+Higher Order Factors) 15.89 12 0.1962
Nonlinear (Factor+Higher Order Factors) 14.47 9 0.1066
Nonlinear Interaction : f(A,B) vs. AB 4.17 3 0.2441
pclass × age (Factor+Higher Order Factors) 13.47 16 0.6385
Nonlinear (Factor+Higher Order Factors) 12.92 12 0.3749
Nonlinear Interaction : f(A,B) vs. AB 6.88 6 0.3324
age × sibsp (Factor+Higher Order Factors) 12.13 4 0.0164
Nonlinear 1.76 3 0.6235
Nonlinear Interaction : f(A,B) vs. AB 1.76 3 0.6235
age × parch (Factor+Higher Order Factors) 3.51 4 0.4761
Nonlinear 1.80 3 0.6147
Nonlinear Interaction : f(A,B) vs. AB 1.80 3 0.6147
sex × pclass × age (Factor+Higher Order Factors) 8.34 8 0.4006
Nonlinear 7.74 6 0.2581
TOTAL NONLINEAR 28.66 24 0.2331
TOTAL INTERACTION 75.61 30 <0.0001
TOTAL NONLINEAR + INTERACTION 79.49 33 <0.0001
TOTAL 241.93 39 <0.0001
CHAPTER 9. LOGISTIC MODEL CASE STUDY: SURVIVAL OF TITANIC PASSENGERS 9-10
χ2 d.f. P
sex (Factor+Higher Order Factors) 199.42 7 <0.0001
All Interactions 56.14 6 <0.0001
pclass (Factor+Higher Order Factors) 108.73 12 <0.0001
All Interactions 42.83 10 <0.0001
age (Factor+Higher Order Factors) 47.04 20 0.0006
All Interactions 24.51 16 0.0789
Nonlinear (Factor+Higher Order Factors) 22.72 15 0.0902
sibsp (Factor+Higher Order Factors) 19.95 5 0.0013
All Interactions 10.99 4 0.0267
sex × pclass (Factor+Higher Order Factors) 35.40 2 <0.0001
sex × age (Factor+Higher Order Factors) 10.08 4 0.0391
Nonlinear 8.17 3 0.0426
Nonlinear Interaction : f(A,B) vs. AB 8.17 3 0.0426
pclass × age (Factor+Higher Order Factors) 6.86 8 0.5516
Nonlinear 6.11 6 0.4113
Nonlinear Interaction : f(A,B) vs. AB 6.11 6 0.4113
age × sibsp (Factor+Higher Order Factors) 10.99 4 0.0267
Nonlinear 1.81 3 0.6134
Nonlinear Interaction : f(A,B) vs. AB 1.81 3 0.6134
TOTAL NONLINEAR 22.72 15 0.0902
TOTAL INTERACTION 67.58 18 <0.0001
TOTAL NONLINEAR + INTERACTION 70.68 21 <0.0001
TOTAL 253.18 26 <0.0001
ggplot ( Predict (f , sibsp , age = c (10 ,15 ,20 ,50) , conf.int = FALSE ) )
# # F i g u r e 9.6
0.75
sex
0.50 female
male
0.25
0.00
0 20 40 60 0 20 40 60 0 20 40 60
Age, years
Figure 9.5: Effects of predictors on probability of survival of Titanic passengers, estimated for zero siblings or spouses
Age, years
−2
log odds
10
15
20
−4
50
−6
0 2 4 6 8
Number of Siblings/Spouses Aboard
Adjusted to:sex=male pclass=3rd
Figure 9.6: Effect of number of siblings and spouses on the log odds of surviving, for third class males
CHAPTER 9. LOGISTIC MODEL CASE STUDY: SURVIVAL OF TITANIC PASSENGERS 9-13
1.0
0.8
Actual Probability
0.6
0.4
Apparent
0.2
Bias−corrected
Ideal
0.0
0.0 0.2 0.4 0.6 0.8 1.0
Predicted Pr{survived=1}
Figure 9.7: Bootstrap overfitting-corrected loess nonparametric calibration curve for casewise deletion model
9.4
0.0
0.1
boat
embarked
cabin
fare
ticket
parch
sibsp
sex
name
pclass
survived
Fraction Missing
0.2
age
0.3
0.4
body
home.dest
Figure 9.8: Patterns of missing data. Upper left panel shows the fraction of observations missing on each predictor. Lower panel
depicts a hierarchical cluster analysis of missingness combinations. The similarity measure shown on the Y -axis is the fraction of
observations for which both variables are missing. Right panel shows the result of recursive partitioning for predicting is.na(age).
The rpart function found only strong patterns according to passenger class.
plot ( summary ( is.na ( age ) ∼ sex + pclass + survived +
sibsp + parch , data = t3 ) ) # F i g u r e 9.9
mean N
sex
female 466
male 843
pclass
1st 323
2nd 277
3rd 709
Survived
No 809
Yes 500
Number of Siblings/Spouses Aboard
0 891
1 319
2 42
3 20
4 22
5 6
8 9
Number of Parents/Children Aboard
0 1002
1 170
2 113
3 8
4 6
5 6
6 2
9 2
Overall
1309
is.na(age)
N=1309
Figure 9.9: Univariable descriptions of proportion of passengers with missing age
χ2 d.f. P
sex (Factor+Higher Order Factors) 5.61 3 0.1324
All Interactions 5.58 2 0.0614
pclass (Factor+Higher Order Factors) 68.43 4 <0.0001
All Interactions 5.58 2 0.0614
survived 0.98 1 0.3232
sibsp 0.35 1 0.5548
parch 7.92 1 0.0049
sex × pclass (Factor+Higher Order Factors) 5.58 2 0.0614
TOTAL 82.90 8 <0.0001
9.5
summary ( xtrans )
Iterations : 5
Adjusted R2 :
age
n missing distinct Info Mean Gmd .05 .10
263 0 24 0.91 28.53 6.925 17.34 21.77
.25 .50 .75 .90 .95
26.17 28.10 28.10 42.77 42.77
1 st 2 nd 3 rd
female 39.08396 31.31831 23.10548
male 42.76765 33.24650 26.87451
1 st 2 nd 3 rd
female 37.03759 27.49919 22.18531
male 41.02925 30.81540 25.96227
dd ← datadist ( dd , age.i )
f.si ← lrm ( survived ∼ ( sex + pclass + rcs ( age.i ,5) ) ∧ 2 +
rcs ( age.i ,5) * sibsp , data = t3 )
print ( f.si , coefs = FALSE )
0.75
1st
0.50
0.25
0.00
1.00
Probability of Surviving
0.75
sex
2nd
0.50 female
male
0.25
0.00
1.00
0.75
3rd
0.50
0.25
0.00
0 20 40 60 0 20 40 60
Age, years
Figure 9.10: Predicted probability of survival for males from fit using casewise deletion (bottom) and single conditional mean
imputation (top). sibsp is set to zero for these predicted values.
CHAPTER 9. LOGISTIC MODEL CASE STUDY: SURVIVAL OF TITANIC PASSENGERS 9-20
χ2 d.f. P
sex (Factor+Higher Order Factors) 245.39 7 <0.0001
All Interactions 52.85 6 <0.0001
pclass (Factor+Higher Order Factors) 112.07 12 <0.0001
All Interactions 36.79 10 <0.0001
age.i (Factor+Higher Order Factors) 49.32 20 0.0003
All Interactions 25.62 16 0.0595
Nonlinear (Factor+Higher Order Factors) 19.71 15 0.1835
sibsp (Factor+Higher Order Factors) 22.02 5 0.0005
All Interactions 12.28 4 0.0154
sex × pclass (Factor+Higher Order Factors) 30.29 2 <0.0001
sex × age.i (Factor+Higher Order Factors) 8.91 4 0.0633
Nonlinear 5.62 3 0.1319
Nonlinear Interaction : f(A,B) vs. AB 5.62 3 0.1319
pclass × age.i (Factor+Higher Order Factors) 6.05 8 0.6421
Nonlinear 5.44 6 0.4888
Nonlinear Interaction : f(A,B) vs. AB 5.44 6 0.4888
age.i × sibsp (Factor+Higher Order Factors) 12.28 4 0.0154
Nonlinear 2.05 3 0.5614
Nonlinear Interaction : f(A,B) vs. AB 2.05 3 0.5614
TOTAL NONLINEAR 19.71 15 0.1835
TOTAL INTERACTION 67.00 18 <0.0001
TOTAL NONLINEAR + INTERACTION 69.53 21 <0.0001
TOTAL 305.74 26 <0.0001
CHAPTER 9. LOGISTIC MODEL CASE STUDY: SURVIVAL OF TITANIC PASSENGERS 9-21
9.6
Multiple Imputation
mi
n : 1309 p: 6 Imputations : 20 nk : 4
Number of NAs :
age sex pclass sibsp parch survived
263 0 0 0 0 0
type d . f .
age s 1
sex c 1
pclass c 2
sibsp s 3
parch s 3
survived l 1
[ ,1] [ ,2] [ ,3] [ ,4] [ ,5] [ ,6] [ ,7] [ ,8] [ ,9] [ ,10]
16 41 47.0 24 44 60.0 47 28.0 29 49 17
38 53 44.0 76 59 35.0 39 16.0 54 19 29
41 45 46.0 28 40 50.0 61 19.0 63 18 61
47 31 28.5 33 35 61.0 55 45.5 38 41 47
60 35 40.0 49 41 27.0 36 51.0 2 33 27
70 30 30.0 16 53 56.0 70 17.0 38 45 51
71 55 36.0 36 42 42.0 33 65.0 46 39 57
75 24 36.0 47 49 45.5 47 47.0 38 55 56
81 60 45.0 46 28 55.0 42 45.0 61 33 45
107 46 29.0 40 58 71.0 58 47.0 63 61 56
1.0
Proportion <= x
0.8
0.6
0.4
0.2
0.0
0 20 40 60 80
Imputed age
Figure 9.11: Distributions of imputed and actual ages for the Titanic dataset. Imputed values are in black and actual ages in gray.
Fit logistic models for 20 completed datasets and print the ratio
of imputation-corrected variances to average ordinary variances F
f.mi ← fit.mult.impute (
survived ∼ ( sex + pclass + rcs ( age ,5) ) ∧ 2 +
rcs ( age ,5) * sibsp ,
lrm , mi , data = t3 , pr = FALSE )
print ( anova ( f.mi ) , table.env = TRUE , label = ’ titanic-anova.mi ’ ,
size = ’ small ’) # T a b l e 9.5
χ2 d.f. P
sex (Factor+Higher Order Factors) 237.81 7 <0.0001
All Interactions 53.44 6 <0.0001
pclass (Factor+Higher Order Factors) 113.77 12 <0.0001
All Interactions 38.60 10 <0.0001
age (Factor+Higher Order Factors) 49.97 20 0.0002
All Interactions 26.00 16 0.0540
Nonlinear (Factor+Higher Order Factors) 23.03 15 0.0835
sibsp (Factor+Higher Order Factors) 25.08 5 0.0001
All Interactions 13.42 4 0.0094
sex × pclass (Factor+Higher Order Factors) 32.70 2 <0.0001
sex × age (Factor+Higher Order Factors) 10.54 4 0.0322
Nonlinear 8.40 3 0.0384
Nonlinear Interaction : f(A,B) vs. AB 8.40 3 0.0384
pclass × age (Factor+Higher Order Factors) 5.53 8 0.6996
Nonlinear 4.67 6 0.5870
Nonlinear Interaction : f(A,B) vs. AB 4.67 6 0.5870
age × sibsp (Factor+Higher Order Factors) 13.42 4 0.0094
Nonlinear 2.11 3 0.5492
Nonlinear Interaction : f(A,B) vs. AB 2.11 3 0.5492
TOTAL NONLINEAR 23.03 15 0.0835
TOTAL INTERACTION 66.42 18 <0.0001
TOTAL NONLINEAR + INTERACTION 69.10 21 <0.0001
TOTAL 294.26 26 <0.0001
0.75
1st
0.50
0.25
0.00
1.00
Probability of Surviving
0.75
sex
2nd
0.50 female
male
0.25
0.00
1.00
0.75
3rd
0.50
0.25
0.00
0 20 40 60 0 20 40 60
Age, years
Figure 9.12: Predicted probability of survival for males from fit using single conditional mean imputation again (top) and multiple
random draw imputation (bottom). Both sets of predictions are for sibsp=0.
CHAPTER 9. LOGISTIC MODEL CASE STUDY: SURVIVAL OF TITANIC PASSENGERS 9-25
9.7
sibsp − 1:0
sex − female:male
pclass − 1st:3rd
pclass − 2nd:3rd
13 2 female 3 rd 0 0.85
14 21 female 3 rd 0 0.57
15 50 female 3 rd 0 0.35
16 2 male 3 rd 0 0.91
17 21 male 3 rd 0 0.13
18 50 male 3 rd 0 0.05
p r e d . l o g i t ← f u n c t i o n ( s e x = ”male ” , p c l a s s = ”3 r d ” , age = 2 8 ,
s i b s p = 0)
{
3 . 3 7 3 0 7 9 = 1 . 0 4 8 4 7 9 5 * ( s e x == ”male ”) + 5 . 8 0 7 8 1 6 8 *
( p c l a s s == ”2 nd ”) = 1 . 4 3 7 0 7 7 1 * ( p c l a s s ==
”3 r d ”) + 0 . 0 7 8 3 4 7 3 1 8 * age = 0 . 0 0 0 2 7 1 5 0 0 5 3 *
pmax ( age = 6 , 0 ) ∧ 3 + 0 . 0 0 1 7 0 9 3 2 8 4 * pmax ( age =
2 1 , 0 ) ∧ 3 = 0 . 0 0 2 3 7 5 1 5 0 5 * pmax ( age = 2 8 , 0 ) ∧ 3 +
0 . 0 0 1 0 1 2 6 3 7 3 * pmax ( age = 3 6 , 0 ) ∧ 3 = 7 .5314668e = 05 *
pmax ( age = 5 6 , 0 ) ∧ 3 = 1 . 1 7 9 9 2 3 5 * s i b s p +
( s e x == ”male ”) * ( = 0.47754081 * ( p c l a s s ==
”2 nd ”) + 2 . 0 6 6 5 9 2 4 * ( p c l a s s == ”3 r d ”) ) +
( s e x == ”male ”) * ( = 0.21884197 * age + 0 . 0 0 0 4 2 4 6 3 4 4 4 *
pmax ( age = 6 , 0 ) ∧ 3 = 0 . 0 0 2 3 8 6 0 2 4 6 * pmax ( age =
2 1 , 0 ) ∧ 3 + 0 . 0 0 3 0 9 9 6 6 8 2 * pmax ( age = 2 8 ,
0 ) ∧ 3 = 0 . 0 0 1 2 2 5 5 7 8 4 * pmax ( age = 3 6 , 0 ) ∧ 3 +
8 .7300463e = 05 * pmax ( age = 5 6 , 0 ) ∧ 3 ) +
( p c l a s s == ”2 nd ”) * ( = 0.47647131 * age + 0 . 0 0 0 6 8 4 8 3 *
pmax ( age = 6 , 0 ) ∧ 3 = 0 . 0 0 2 9 9 9 0 4 1 7 * pmax ( age =
2 1 , 0 ) ∧ 3 + 0 . 0 0 3 1 2 2 1 2 5 5 * pmax ( age = 2 8 ,
0 ) ∧ 3 = 0 . 0 0 0 8 3 4 7 2 7 8 2 * pmax ( age = 3 6 ,
0 ) ∧ 3 + 2 .6813959e = 05 * pmax ( age = 5 6 ,
0 ) ∧ 3 ) + ( p c l a s s == ”3 r d ”) * ( = 0.16335774 *
age + 0 . 0 0 0 3 0 9 8 6 5 4 6 * pmax ( age = 6 , 0 ) ∧ 3 =
0 . 0 0 1 8 1 7 4 7 1 6 * pmax ( age = 2 1 , 0 ) ∧ 3 + 0 . 0 0 2 4 9 1 6 5 7 *
pmax ( age = 2 8 , 0 ) ∧ 3 = 0 . 0 0 1 0 8 2 4 0 8 2 * pmax ( age =
3 6 , 0 ) ∧ 3 + 9 .8357307e = 05 * pmax ( age = 5 6 ,
0 ) ∧ 3 ) + s i b s p * ( 0 . 0 4 2 3 6 8 0 3 7 * age = 2 .0590588e = 05 *
pmax ( age = 6 , 0 ) ∧ 3 + 0 . 0 0 0 1 7 8 8 4 5 3 6 * pmax ( age =
2 1 , 0 ) ∧ 3 = 0 . 0 0 0 3 9 2 0 1 9 1 1 * pmax ( age = 2 8 ,
0 ) ∧ 3 + 0 . 0 0 0 2 8 7 3 2 3 8 5 * pmax ( age = 3 6 , 0 ) ∧ 3 =
5 .3559508e = 05 * pmax ( age = 5 6 , 0 ) ∧ 3 )
}
present.
CHAPTER 9. LOGISTIC MODEL CASE STUDY: SURVIVAL OF TITANIC PASSENGERS 9-28
9.8
Bayesian Analysis
K
blrm uses the rstan package that provides the full power of
Stan to R
Merely have to stack the posterior draws into one giant sam-
ple to account for imputation and get correct posterior dis-
tribution
# Use all available CPU cores. Each chain will be run on its
CHAPTER 9. LOGISTIC MODEL CASE STUDY: SURVIVAL OF TITANIC PASSENGERS 9-29
# own core.
# fitIf function only re-runs the model if bt.rds file doesn ’ t exist
require ( rmsb )
Look at diagnostics
stanDx ( bt )
Iterations : 2000 on each of 4 chains , with 4000 posterior distribution samples saved
n_eff Rhat
Imputation 1: Intercept 2213 1.001
Imputation 1: sex = male 2519 1.001
Imputation 1: pclass =2 nd 1623 1.001
Imputation 1: pclass =3 rd 3038 1.000
Imputation 1: age 1490 1.002
Imputation 1: age ’ 1563 1.000
Imputation 1: age ’ ’ 1394 1.001
Imputation 1: age ’ ’ ’ 2913 1.000
Imputation 1: sibsp 2981 1.000
Imputation 1: sex = male * pclass =2 nd 2944 1.001
Imputation 1: sex = male * pclass =3 rd 2745 1.000
Imputation 1: sex = male * age 3500 1.000
Imputation 1: sex = male * age ’ 3211 1.000
Imputation 1: sex = male * age ’ ’ 4320 1.000
Imputation 1: sex = male * age ’ ’ ’ 3855 1.000
Imputation 1: pclass =2 nd * age 1217 1.001
Imputation 1: pclass =3 rd * age 2247 1.000
Imputation 1: pclass =2 nd * age ’ 1243 1.002
Imputation 1: pclass =3 rd * age ’ 3108 1.000
CHAPTER 9. LOGISTIC MODEL CASE STUDY: SURVIVAL OF TITANIC PASSENGERS 9-31
Chain
−44
0
Chain
−8
−12
4
0
1Chain
−4
−8
−12
10
−40
Chain
−8
11
−50
Chain
−10
4
12
−40
Chain
−8
−12
13
0
Chain
−5
−10
3
14
−30
Chain
−6
−9
15
−40
Chain
−8
16
−40
Chain
−8
−12
17
0
Chain
−5
−10
5
18Chain
−50
−10
2.5
19
0.0
−2.5
Chain
−5.0
−7.5
−10.0
0
2Chain
−4
−8
20
0
Chain
−5
−10
4
21
0
−4
Chain
−8
−12
4
22
0
−4
Chain
−8
23
0
−5
Chain
−10
−15
Parameter Value
24
0
−4
Chain
−8
25
0
−4
Chain
−8
−12
26
0
−4
Chain
−8
−12
27
0
Chain
−5
−10
28Chain
0
−5
29
0
−4
Chain
−8
0
3Chain
−5
−10
30
0
−4
Chain
−8
31
0
−5
Chain
−10
5
32
0
Chain
−5
4
33
0
−4
Chain
−8
34
0
Chain
−5
−10
5
35
0
Chain
−5
−10
4
36
0
−4
Chain
−8
−12
37
0
−4
Chain
−8
38Chain
0
−4
−8
39
0
−4
Chain
−8
0
4Chain
−5
40Chain
0
−4
−8
−12
4
0
5Chain
−4
−8
−12
0
6Chain
−5
−10
0
7Chain
−5
−10
−15
0
8 9
−4
−8
0 250 500 750 1000 0 250 500 750 1000 0 250 500 750 1000
Post Burn−in Iteration
M
sex=male pclass=3rd
0.4
0.20
0.3
0.15
0.95 HPDI
0.2
0.10 Imputation 1
Imputation 10
0.1 0.05 Imputation 2
Imputation 3
0.0 0.00
Imputation 4
−6 −4 −2 0 2 −15 −10 −5 0 5
Imputation 5
age
Imputation 6
3 Imputation 7
Imputation 8
Imputation 9
2
mean
median
1 Stacked
0
−1.0 −0.5 0.0 0.5
1.00
0.75
sex
0.50 female
male
0.25
0.00
0 20 40 60 0 20 40 60 0 20 40 60
Age, years
O
sex [ ]
pclass [ ]
age [ ]
sex * pclass [ ]
sibsp [ ]
age * sibsp [ ]
sex * age [ ]
pclass * age [ ]
Compute P(both contrasts < 0), both < −2, and P(either
one < 0)
k ← contrast ( bt , list ( sex = ’ male ’ , age = c (5 , 30) , pclass = ’2 nd ’) ,
list ( sex = ’ female ’ , age = c (5 , 30) , pclass = ’2 nd ’) ,
cnames = c ( ’ age 5 M-F ’ , ’ age 30 M-F ’) )
k
plot ( k )
0.4 0.6
0.3
0.95 HPDI
0.4
mean
0.2
median
0.2
0.1
0.0 0.0
−6 −4 −2 0 −6 −5 −4 −3 −2
[1] 0.99962
[1] 0.8414
[1] 1
−3.0
−3.0
Probability
−3.5 0.01
−3.5
0.1
age 30 M−F
−4.0
age 30 M−F −4.0 0.25
0.5
−4.5 0.75
−4.5
0.9
−5.5 −5.5
−4 −3 −2 −1 −4 −3 −2 −1
age 5 M−F age 5 M−F
Spearman ρ = 0.49 Spearman ρ = 0.49
R Software Used
Package Purpose Functions
Hmisc Miscellaneous functions summary,plsmo,naclus,llist,latex
summarize,Dotplot,describe
Hmisc Imputation transcan,impute,fit.mult.impute,aregImpute,stackMI
rms Modeling datadist,lrm,blrm,rcs
Model presentation plot,summary,nomogram,Function,anova
Estimation Predict,summary,contrast
Model validation validate,calibrate
Misc. Bayesian stanDx,stanDxplot,plot
rparta Recursive partitioning rpart
a Written by Atkinson & Therneau
Chapter 10
10.1
Background
A
10-1
CHAPTER 10. ORDINAL LOGISTIC REGRESSION 10-2
10.2
Ordinality Assumption
B
x Pr(Y = j)
and the expectation can be estimated by
Ê(X|Y = j) = xP̂jxfx/gj ,
X
i=1
CHAPTER 10. ORDINAL LOGISTIC REGRESSION 10-4
10.3
10.3.1
Model
C
For convenience Y = 0, 1, 2, . . . , k
1
Pr[Y ≥ j|X] = ,
1 + exp[−(αj + Xβ)]
where j = 1, 2, . . . , k.
10.3.3
Estimation
10.3.4
Residuals
D
For ordinal Y compute binary model partial res. for all cut-
offs j:
[Yi ≥ j] − P̂ij
rim = β̂mXim + ,
P̂ij (1 − P̂ij )
10.3.5
Section 10.2
getHdata ( support )
sfdm ← as.integer ( support $ sfdm2 ) - 1
sf ← function ( y )
c ( ’Y ≥ 1 ’= qlogis ( mean ( y ≥ 1) ) , ’Y ≥ 2 ’= qlogis ( mean ( y ≥ 2) ) ,
’Y ≥ 3 ’= qlogis ( mean ( y ≥ 3) ) )
s ← summary ( sfdm ∼ adlsc + sex + age + meanbp , fun = sf , data = support )
plot (s , which =1:3 , pch =1:3 , xlab = ’ logit ’ , vnames = ’ names ’ , main = ’ ’ ,
width.factor =1 .5 )
N
adlsc
0.000 282
[0.495,1.167) 150
[1.167,3.024) 199
[3.024,7.000] 210
sex
female 377
male 464
age
[19.8, 52.4) 211
[52.4, 65.3) 210
[65.3, 74.8) 210
[74.8,100.1] 210
meanbp
[ 0, 64) 211
[ 64, 78) 216
[ 78,108) 204
[108,180] 210
Overall
841
logit
N=841 N missing=159
Figure 10.1: Checking PO assumption separately for a series of predictors. The circle, triangle, and plus sign correspond to
Y ≥ 1, 2, 3, respectively. PO is checked by examining the vertical constancy of distances between any two of these three symbols.
Response variable is the severe functional disability scale sfdm2 from the 1000-patient SUPPORT dataset, with the last two categories
combined because of low frequency of coma/intubation.
Note that computing ORs for various cutoffs and seeing dis-
agreements among them can cause reviewers to confuse lack
of fit with sampling variation (random chance). For a 4-level Y
having a given vector of probabilities in a control group, let’s
assume PO with a true OR of 3 and simulate 10 experiments
to show variation of observed ORs over all cutoffs of Y. First
do it for a sample size of n=10,000 then for n=200.
# Until a new Hmisc is on CRAN
# s o u r c e ( ’ h t t p s : // r a w . g i t h u b u s e r c o n t e n t . c o m / h a r r e l f e / Hmisc / master /R/ popower.s ’)
p ← c ( .1 , .2 , .3 , .4 )
set.seed (7)
simPOcuts (10000 , odds.ratio =3 , p = p )
CHAPTER 10. ORDINAL LOGISTIC REGRESSION 10-8
Vary at least one of the predictors, i.e., the one for which
you want to assess the impact of the PO assumption
done.impact ← FALSE
if ( done.impact ) w ← readRDS ( ’ impactPO.rds ’) else {
set.seed (1)
w ← impactPO ( sfdm3 ∼ pol ( adlsc , 2) + sex + rcs ( age , kage ) +
rcs ( meanbp , kbp ) , nonpo = ∼ pol ( adlsc , 2) ,
newdata =d , B =300 , data = support )
CHAPTER 10. ORDINAL LOGISTIC REGRESSION 10-10
saveRDS (w , ’ impactPO.rds ’)
}
PO PPO Multinomial
Deviance 1871.70 1824.79 1795.93
d.f. 12 16 30
AIC 1895.70 1856.79 1855.93
p 9 13 27
LR chi ∧ 2 124.11 171.02 199.89
LR - p 115.11 158.02 172.89
LR chi ∧ 2 test for PO 46.91 75.77
d.f. 4 18
Pr ( > chi ∧ 2) <0.0001 <0.0001
MCS R2 0.137 0.184 0.212
MCS R2 adj 0.128 0.171 0.186
McFadden R2 0.062 0.086 0.100
McFadden R2 adj 0.053 0.073 0.073
Mean | difference | from PO 0.038 0.036
0 1 2 3
Lower -0.020 -0.013 -0.001 -0.034
Upper 0.007 0.036 0.030 -0.001
0 1 2 3
Lower -0.051 -0.038 -0.010 -0.041
Upper 0.031 0.061 0.035 -0.004
CHAPTER 10. ORDINAL LOGISTIC REGRESSION 10-11
0 1 2 3
Lower -0.038 0.033 -0.015 -0.027
Upper -0.014 0.074 -0.004 -0.008
0 1 2 3
Lower -0.083 0.018 -0.027 -0.033
Upper -0.009 0.091 0.025 0.003
0 1 2 3
Lower -0.045 0.037 -0.043 -0.024
Upper -0.006 0.090 -0.013 0.001
0 1 2 3
Lower -0.085 0.028 -0.061 -0.032
Upper -0.009 0.100 0.018 0.014
0 1 2 3
Lower -0.025 0.009 -0.055 -0.016
Upper 0.017 0.064 -0.013 0.015
0 1 2 3
Lower -0.060 0.012 -0.079 -0.035
Upper 0.014 0.093 0.018 0.020
0 1 2 3
Lower 0.006 -0.052 -0.049 -0.005
Upper 0.057 0.015 -0.003 0.028
0 1 2 3
Lower -0.023 -0.025 -0.070 -0.055
Upper 0.046 0.074 0.024 0.017
0 1 2 3
Lower 0.044 -0.144 -0.025 0.009
Upper 0.112 -0.056 0.016 0.045
0 1 2 3
Lower 0.014 -0.099 -0.049 -0.078
Upper 0.110 0.030 0.042 0.018
0 1 2 3
Lower 0.081 -0.275 0.002 0.019
Upper 0.184 -0.127 0.046 0.070
0 1 2 3
Lower 0.068 -0.263 -0.018 -0.074
Upper 0.189 -0.042 0.065 0.054
0 1 2
1.00
0.75
0.50
0.25
0.00
3 4 5
1.00
Probability
0.75
0.50
0.25
0.00
PO PPO Multinomial PO PPO Multinomial
6
1.00
0.75
0.50
0.25
0.00
PO PPO Multinomial
3 2 1 0
Figure 10.2: Checking the impact of the PO assumption by comparing predicted probabilities of all outcome categories from a PO
model with a multinomial logistic model that assumes PO for no variables and a partial proportional odds model that does not
assume PO for a key variable of interest
CHAPTER 10. ORDINAL LOGISTIC REGRESSION 10-14
2 5
logit(Fn(x))
Φ−1(Fn(x))
0 0
−2 small
−5 medium
large
log(HbA1c) log(HbA1c)
Figure 10.3: Transformed empirical cumulative distribution functions stratified by body frame in the diabetes dataset. Left panel:
checking all assumptions of the parametric ANOVA. Right panel: checking all assumptions of the PO model (here, Kruskal–Wallis
test).
10.3.6
10.3.7
For PO models there are four and sometimes five types of rel-
evant predictions: G
Graphics: H
10.3.8
10.3.9
R Functions
The rms package’s lrm and orm functions fit the PO model di-
rectly, assuming that the levels of the response variable (e.g.,
the levels of a factor variable) are listed in the proper order.
predict computes all types of estimates except for quantiles.
orm allows for more link functions than the logistic and is in-
tended to efficiently handle hundreds of intercepts as happens
when Y is continuous.
The R functions popower and posamsize (in the Hmisc package)
compute power and sample size estimates for ordinal responses
using the proportional odds model.
The function plot.xmean.ordinaly in rms computes and graphs
the quantities described in Section 10.2. It plots simple Y -
stratified means overlaid with Ê(X|Y = j), with j on the
x-axis. The Ês are computed for both PO and continuation
CHAPTER 10. ORDINAL LOGISTIC REGRESSION 10-17
10.4
10.4.1
Model
1
Pr(Y = j|Y ≥ j, X) =
1 + exp[−(θj + Xγ)]
logit(Y = 0|Y ≥ 0, X) = logit(Y = 0|X)
= θ0 + Xγ
logit(Y = 1|Y ≥ 1, X) = θ1 + Xγ
...
logit(Y = k − 1|Y ≥ k − 1, X) = θk−1 + Xγ.
Y for selected X.
10.4.2
10.4.3
Estimation
10.4.4
Residuals
10.4.5
10.4.6
Extended CR Model
10.4.7
10.4.8
10.4.9
R Functions
11-1
CHAPTER 11. REGRESSION MODELS FOR CONTINUOUS Y AND CASE STUDY IN ORDINAL REGRESSION 11-2
11.1
Goal of analysis:
– better understand effects of body size measurements on
risk of DM
variable
– All cutpoints are arbitrary; no justification for any puta-
tive cut
CHAPTER 11. REGRESSION MODELS FOR CONTINUOUS Y AND CASE STUDY IN ORDINAL REGRESSION 11-4
s/CDC: http://www.cdc.gov/nchs/nhanes.htm[34]
w
18 Variables 4629 Observations
seqn : Respondent sequence number
n missing distinct Info Mean Gmd .05 .10 .25 .50 .75 .90 .95
4629 0 4629 1 56902 3501 52136 52633 54284 56930 59495 61079 61641
lowest : 51624 51629 51630 51645 51647, highest: 62152 62153 62155 62157 62158
sex
n missing distinct
4629 0 2
lowest : 21.00000 21.08333 21.16667 21.25000 21.33333, highest: 79.66667 79.75000 79.83333 79.91667 80.00000
CHAPTER 11. REGRESSION MODELS FOR CONTINUOUS Y AND CASE STUDY IN ORDINAL REGRESSION 11-5
re : Race/Ethnicity
n missing distinct
4629 0 5
lowest : 33.2 36.1 37.9 38.5 38.7, highest: 184.3 186.9 195.3 196.6 203.0
lowest : 123.3 135.4 137.5 139.4 139.8, highest: 199.2 199.3 199.6 201.7 202.7
bmi : Body Mass Index [kg/m2 ]
n missing distinct Info Mean Gmd .05 .10 .25 .50 .75 .90 .95
4629 0 1994 1 28.59 6.965 20.02 21.35 24.12 27.60 31.88 36.75 40.68
lowest : 13.18 14.59 15.02 15.40 15.49, highest: 61.20 62.81 65.62 71.30 84.87
leg : Upper Leg Length [cm]
n missing distinct Info Mean Gmd .05 .10 .25 .50 .75 .90 .95
4474 155 216 1 38.39 4.301 32.0 33.5 36.0 38.4 41.0 43.3 44.6
lowest : 20.4 24.9 25.0 25.1 26.4, highest: 49.0 49.5 49.8 50.0 50.3
arml : Upper Arm Length [cm]
n missing distinct Info Mean Gmd .05 .10 .25 .50 .75 .90 .95
4502 127 156 1 37.01 3.116 32.6 33.5 35.0 37.0 39.0 40.6 41.7
lowest : 24.8 27.0 27.5 29.2 29.5, highest: 45.2 45.5 45.6 46.0 47.0
armc : Arm Circumference [cm]
n missing distinct Info Mean Gmd .05 .10 .25 .50 .75 .90 .95
4499 130 290 1 32.87 5.475 25.4 26.9 29.5 32.5 35.8 39.1 41.4
lowest : 17.9 19.0 19.3 19.5 19.9, highest: 54.2 54.9 55.3 56.0 61.0
waist : Waist Circumference [cm]
n missing distinct Info Mean Gmd .05 .10 .25 .50 .75 .90 .95
4465 164 716 1 97.62 17.18 74.8 78.6 86.9 96.3 107.0 117.8 125.0
lowest : 59.7 60.0 61.5 62.0 62.4, highest: 160.0 160.6 162.2 162.7 168.7
tri : Triceps Skinfold [mm]
n missing distinct Info Mean Gmd .05 .10 .25 .50 .75 .90 .95
4295 334 342 1 18.94 9.463 7.2 8.8 12.0 18.0 25.2 31.0 33.8
lowest : 2.6 3.1 3.2 3.3 3.4, highest: 39.6 39.8 40.0 40.2 40.6
sub : Subscapular Skinfold [mm]
n missing distinct Info Mean Gmd .05 .10 .25 .50 .75 .90 .95
3974 655 329 1 20.8 9.124 8.60 10.30 14.40 20.30 26.58 32.00 35.00
lowest : 3.8 4.2 4.6 4.8 4.9, highest: 40.0 40.1 40.2 40.3 40.4
CHAPTER 11. REGRESSION MODELS FOR CONTINUOUS Y AND CASE STUDY IN ORDINAL REGRESSION 11-6
gh : Glycohemoglobin [%]
n missing distinct Info Mean Gmd .05 .10 .25 .50 .75 .90 .95
4629 0 63 0.994 5.533 0.5411 4.8 5.0 5.2 5.5 5.8 6.0 6.3
lowest : 4.0 4.1 4.2 4.3 4.4, highest: 11.9 12.0 12.1 12.3 14.5
albumin : Albumin [g/dL]
n missing distinct Info Mean Gmd .05 .10 .25 .50 .75 .90 .95
4576 53 26 0.99 4.261 0.3528 3.7 3.9 4.1 4.3 4.5 4.7 4.8
lowest : 2.6 2.7 3.0 3.1 3.2, highest: 4.9 5.0 5.1 5.2 5.3
bun : Blood urea nitrogen [mg/dL]
n missing distinct Info Mean Gmd .05 .10 .25 .50 .75 .90 .95
4576 53 50 0.995 13.03 5.309 7 8 10 12 15 19 22
lowest : 1 2 3 4 5, highest: 49 53 55 56 63
SCr : Creatinine [mg/dL]
n missing distinct Info Mean Gmd .05 .10 .25 .50 .75 .90 .95
4576 53 167 1 0.8887 0.2697 0.58 0.62 0.72 0.84 0.99 1.14 1.25
lowest : 0.34 0.38 0.39 0.40 0.41, highest: 5.98 6.34 9.13 10.98 15.66
11.2
sometimes be dispensed with by using bootstrap confidence intervals, but this would not fix inefficiency problems with OLS when residuals are
non-normal.
CHAPTER 11. REGRESSION MODELS FOR CONTINUOUS Y AND CASE STUDY IN ORDINAL REGRESSION 11-8
log(F (1 − F))
5 2
qnorm(F)
0 0
−5 −2
5.0 5.5 6.0 6.5 7.0 7.5 5.0 5.5 6.0 6.5 7.0 7.5
Glycohemoglobin, % Glycohemoglobin, %
2
log(− log(1 − F))
6
− log(− log(F))
0
4
−2
2
−4
0
−6
−2
5.0 5.5 6.0 6.5 7.0 7.5 5.0 5.5 6.0 6.5 7.0 7.5
Glycohemoglobin, % Glycohemoglobin, %
Figure 11.1: Examination of normality and constant variance assumption, and assumptions for various ordinal models
CHAPTER 11. REGRESSION MODELS FOR CONTINUOUS Y AND CASE STUDY IN ORDINAL REGRESSION11-10
11.3
Quantile Regression
I
11.4
For the OLS fully parametric case, the model may be restated
Y − Xβ y − Xβ
Prob[Y ≥ y|X] = Prob[ ≥ ]
σ σ
y − Xβ −y Xβ
= 1 − Φ( ) = Φ( + )
σ σ σ
so that to within an additive constantf αy = −yσ (intercepts α
are linear in y whereas they are arbitrarily descending in the
ordinal model), and σ is absorbed in β to put the OLS model
into the new notation.
The general ordinal regression model assumes that for fixed
X1, X2,
F −1(Prob[Y ≥ y|X2]) − F −1(Prob[Y ≥ y|X1])
= (X2 − X1)β
independent of the αs (parallelism assumption). If F = [1 +
exp(−y)]−1, this is the proportional odds assumption.
e It is more traditional to state the model in terms of Prob[Y ≤ y|X] but we use Prob[Y ≥ y|X] so that higher predicted values are
Table 11.1: Distribution families used in ordinal cumulative probability models. Φ denotes the Gaussian cumulative distribution
function. For the Connection column, P1 = Prob[Y ≥ y|X1 ], P2 = Prob[Y ≥ y|X2 ], ∆ = (X2 − X1 )β. The connection specifies
the only distributional assumption if the model is fitted semiparametrically, i.e, contains an intercept for every unique Y value less
one. For parametric models, P1 must be specified absolutely instead of just requiring a relationship between P1 and P2 . For example,
the traditional Gaussian parametric model specifies that Prob[Y ≥ y|X] = 1 − Φ( y−Xβ σ
) = Φ( −y+Xβ
σ
).
k
X
ˆ
yiProb[Y = yi|X]
i=1
−1 Φ−1(F(y|X)) − ∆Xβ
Φ (F(y|X))
− ∆Xβ σ logit(F(y|X))
y y
Figure 11.2: Assumptions of the linear model (left panel) and semiparametric ordinal probit or logit (proportional odds) models
(right panel). Ordinal models do not assume any shape for the distribution of Y for a given X; they only assume parallelism.
Y needs to be ordinal
11.5
11.5.1
}
fams ← c ( ’ logistic ’ , ’ probit ’ , ’ loglog ’ , ’ cloglog ’)
fe ← function ( pred , target ) mean ( abs ( pred $ yhat - target ) )
mod ← gh ∼ rcs ( age ,6)
P ← Er ← list ()
for ( est in c ( ’ q2 ’ , ’ q3 ’ , ’ p90 ’ , ’ mean ’) ) {
meth ← if ( est == ’ mean ’) ’ ols ’ else ’ QR ’
p ← list ()
er ← rep ( NA , 5)
names ( er ) ← c ( fams , meth )
for ( family in fams ) {
h ← orm ( mod , family = family , data = w )
fun ← if ( est == ’ mean ’) Mean ( h )
else {
qu ← Quantile ( h )
switch ( est , q2 = function ( x ) qu ( .5 , x ) ,
q3 = function ( x ) qu ( .75 , x ) ,
p90 = function ( x ) qu ( .9 , x ) )
}
p [[ family ]] ← z ← Predict (h , age = ag , fun = fun , conf.int = FALSE )
er [ family ] ← fe (z , switch ( est , mean = means , q2 = q2 , q3 = q3 , p90 = p90 ) )
}
h ← switch ( est ,
mean = ols ( mod , data = w ) ,
q2 = Rq ( mod , data = w ) ,
q3 = Rq ( mod , tau =0 .75 , data = w ) ,
p90 = Rq ( mod , tau =0 .90 , data = w ) )
p [[ meth ]] ← z ← Predict (h , age = ag , conf.int = FALSE )
er [ meth ] ← fe (z , switch ( est , mean = means , q2 = q2 , q3 = q3 , p90 = p90 ) )
Er [[ est ]] ← er
pr ← do.call ( ’ rbind ’ , p )
pr $ est ← est
P ← rbind.data.frame (P , pr )
}
task (quantile reqression for quantiles and OLS for means) were
best for those tasks. Although the log-log ordinal cumulative
CHAPTER 11. REGRESSION MODELS FOR CONTINUOUS Y AND CASE STUDY IN ORDINAL REGRESSION11-21
30 40 50 60 70
q2 q3
6.4
logistic 0.023 logistic 0.036
probit 0.028 probit 0.042
6.2
loglog 0.044 loglog 0.050
cloglog 0.053 cloglog 0.075
6.0
QR 0.024 QR 0.027
5.8
5.6
5.4
5.2
yhat
mean p90
6.4
logistic 0.021 logistic 0.046
probit 0.025 probit 0.053
6.2
loglog 0.026 loglog 0.074
cloglog 0.033 cloglog 0.111
6.0
ols 0.013 QR 0.030
5.8
logistic
5.6 probit
loglog
5.4 cloglog
QR
5.2 ols
30 40 50 60 70
age
Figure 11.3: Three estimated quantiles and estimated mean using 6 methods, compared against caliper-matched sample quan-
tiles/means (circles). Numbers are mean absolute differences between predicted and sample quantities using overlapping intervals
of age and caliper matching. QR:quantile regression.
CHAPTER 11. REGRESSION MODELS FOR CONTINUOUS Y AND CASE STUDY IN ORDINAL REGRESSION11-22
1.0
[4.88,5.29)
[5.29,5.44)
0.8 [5.44,5.56)
[5.56,5.66)
[5.66,5.76)
Prob(Y ≥ y)
0.6 [5.76,6.48]
0.4
0.2
0.0
4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5
gh
Figure 11.4: Observed (dashed lines, open circles) and predicted (solid lines, closed circles) exceedance probability distributions from
a model using 6-tiles of OLS-predicted HbA1c . Key shows quantile group intervals of predicted mean HbA1c .
CHAPTER 11. REGRESSION MODELS FOR CONTINUOUS Y AND CASE STUDY IN ORDINAL REGRESSION11-23
2
1
0
−1
αy
−2
−3
−4
−5
11.5.2
Examination of BMI
aic ← NULL
CHAPTER 11. REGRESSION MODELS FOR CONTINUOUS Y AND CASE STUDY IN ORDINAL REGRESSION11-25
for ( mod in list ( gh ∼ rcs ( age ,5) + rcs ( log ( bmi ) ,5) ,
gh ∼ rcs ( age ,5) + rcs ( log ( ht ) ,5) + rcs ( log ( wt ) ,5) ,
gh ∼ rcs ( age ,5) + rcs ( log ( ht ) ,4) * rcs ( log ( wt ) ,4) ) )
aic ← c ( aic , AIC ( orm ( mod , family = loglog , data = w ) ) )
print ( aic )
weight is -2.4, which is between what BMI uses and the more
dimensionally reasonable weight / height3. By AIC, a spline in-
teraction surface between height and weight does slightly better
than BMI in predicting HbA1c, but a nonlinear function of BMI
is barely worse. It will require other body size measures to
displace BMI as a predictor.
As an aside, compare this model fit to that from the Cox pro-
portional hazards model. The Cox model uses a conditioning
argument to obtain a partial likelihood free of the intercepts α
(and requires a second step to estimate these log discrete haz-
ard components) whereas we are using a full marginal likelihood
of the ranks of Y [105].
print ( cph ( Surv ( gh ) ∼ rcs ( age ,5) + log ( ht ) + log ( wt ) , data = w ) )
Back up and look at all body size measures, and examine their
redundancies. Y
v ← varclus (∼ wt + ht + bmi + leg + arml + armc + waist +
tri + sub + age + sex + re , data = w )
plot ( v ) # F i g u r e 11.6
# Omit wt so it won ’ t be removed before bmi
redun (∼ ht + bmi + leg + arml + armc + waist + tri + sub ,
data =w , r2 = .75 )
Redundancy Analysis
n : 3853 p: 8 nk : 3
R2 with which each variable can be predicted from all other variables :
Rendundant variables :
bmi ht
Six size measures adequately capture the entire set. Height and
CHAPTER 11. REGRESSION MODELS FOR CONTINUOUS Y AND CASE STUDY IN ORDINAL REGRESSION11-27
0.0
age
reOther Race Including Multi−Racial
reOther Hispanic
0.2
reNon−Hispanic White
reNon−Hispanic Black
sexfemale
Spearman ρ2
0.4
tri
sub
leg
0.6
ht
arml
0.8
wt
armc
bmi
waist
1.0
the elderly:
f ← orm ( ht ∼ rcs ( age ,4) * sex , data = w ) # P r o p . o d d s m o d e l
qu ← Quantile ( f ) ; med ← function ( x ) qu ( .5 , x )
ggplot ( Predict (f , age , sex , fun = med , conf.int = FALSE ) ,
ylab = ’ Predicted Median Height , cm ’)
180
Predicted Median Height, cm
175
170 sex
male
165 female
160
155
20 40 60 80
Age, years
Figure 11.7: Estimated median height as a smooth function of age, allowing age to interact with sex, from a proportional odds
model
40
sex
38 male
female
36
34
20 40 60 80
Age, years
Figure 11.8: Estimated median upper leg length as a smooth function of age, allowing age to interact with sex, from a proportional
odds model
Spearman ρ2 Response : gh N df
age 4629 2
waist 4465 2
leg 4474 2
sub 3974 2
armc 4499 2
wt 4629 2
re 4629 4
tri 4295 2
arml 4502 2
sex 4629 1
0.00 0.05 0.10 0.15 0.20
Adjusted ρ 2
χ2 d.f. P
leg 8.30 2 0.0158
Nonlinear 3.32 1 0.0685
arml 0.16 1 0.6924
armc 6.66 2 0.0358
Nonlinear 3.29 1 0.0695
waist 29.40 3 <0.0001
Nonlinear 4.29 2 0.1171
tri 16.62 1 <0.0001
sub 40.75 2 <0.0001
Nonlinear 4.50 1 0.0340
TOTAL NONLINEAR 14.95 5 0.0106
TOTAL 128.29 11 <0.0001
6.25
6.00
5.25
5.00
20 40 60 80
Age, years
Figure 11.10: Estimated mean and 0.5 and 0.9 quantiles from the log-log ordinal model using casewise deletion, along with predictions
of 0.5 and 0.9 quantiles from quantile regression (QR). Age is varied and other predictors are held constant to medians/modes.
an ← anova ( g )
print ( an , caption = ’ ANOVA for reduced model after multiple imputation , with
addition of a combined effect for four size variables ’)
ANOVA for reduced model after multiple imputation, with addition of a combined effect for
four size variables
χ2 d.f. P
age 698.03 4 <0.0001
Nonlinear 29.54 3 <0.0001
re 163.54 4 <0.0001
leg 24.19 2 <0.0001
Nonlinear 2.29 1 0.1298
waist 128.33 3 <0.0001
Nonlinear 4.23 2 0.1208
tri 34.29 1 <0.0001
sub 41.27 3 <0.0001
Nonlinear 6.37 2 0.0414
TOTAL NONLINEAR 46.91 8 <0.0001
TOTAL 1457.15 17 <0.0001
b ← anova (g , leg , waist , tri , sub )
# Add new lines to the plot with combined effect of 4 size var.
s ← rbind ( an , size = b [ ’ TOTAL ’ , ])
class ( s ) ← ’ anova.rms ’
plot ( s )
χ2 P
leg 24.2 0.0000
tri 34.3 0.0000
sub 41.3 0.0000
waist 128.3 0.0000
re 163.5 0.0000
size 412.3 0.0000
age 698.0 0.0000
1.5 1.5
ORIM−R
1.0 1.0
Race Ethnicity
Nn−HsB
0.5 0.5
6.25 6.25
ORIM−R
6.00 6.00
Race Ethnicity
Nn−HsB
5.75 5.75
1.0
log hazard
0.5
0.0
−0.5
20 30 40 50 60 70 80
Age, years
Figure 11.13: Partial effect for age from multiple imputation (center red line) and casewise deletion (center blue line) with symmetric
Wald 0.95 confidence bands using casewise deletion (gray shaded area), basic bootstrap confidence bands using casewise deletion
(blue lines), percentile bootstrap confidence bands using casewise deletion (dashed blue lines), and symmetric Wald confidence bands
accounting for multiple imputation (red lines).
CHAPTER 11. REGRESSION MODELS FOR CONTINUOUS Y AND CASE STUDY IN ORDINAL REGRESSION11-39
In OLS the mean equals the median and both are linearly related K
8.0
Median and 0.9 Quantile
7.5
7.0
6.5
0.9
6.0
5.5 Median
5.0
each subject.
g ← Newlevels (g , list ( re = abbreviate ( levels ( w $ re ) ) ) )
exprob ← ExProb ( g )
nom ←
nomogram (g , fun = list ( Mean =M ,
’ Median Glycohemoglobin ’ = med ,
’0 .9 Quantile ’ = q90 ,
’ Prob ( HbA1c ≥ 6 .5 ) ’=
function ( x ) exprob (x , y =6 .5 ) ,
’ Prob ( HbA1c ≥ 7 .0 ) ’=
function ( x ) exprob (x , y =7) ,
’ Prob ( HbA1c ≥ 7 .5 ) ’=
function ( x ) exprob (x , y =7 .5 ) ) ,
fun.at = list ( seq (5 , 8 , by = .5 ) ,
c (5 ,5 .25 ,5 .5 ,5 .75 ,6 ,6 .25 ) ,
c (5 .5 ,6 ,6 .5 ,7 ,8 ,10 ,12 ,14) ,
c ( .01 , .05 , .1 , .2 , .3 , .4 ) ,
c ( .01 , .05 , .1 , .2 , .3 , .4 ) ,
c ( .01 , .05 , .1 , .2 , .3 , .4 ) ) )
plot ( nom , lmgp = .28 ) # F i g u r e 11.15
CHAPTER 11. REGRESSION MODELS FOR CONTINUOUS Y AND CASE STUDY IN ORDINAL REGRESSION11-41
0 10 20 30 40 50 60 70 80 90 100
Points
Age
20 25 30 35 40 45 50 55 60 65 70 75 80
OthH
Race/Ethnicity
N−HW ORIM
Upper Leg Length
55 45 35 30 25 20
Waist Circumference
50 70 90 100 110 120 130 140 150 160 170
Triceps Skinfold
45 35 25 15 5 0
15 20 25 30 35 40 45
Subscapular Skinfold
10
Total Points
0 20 40 60 80 100 120 140 160 180 200 220 240 260 280
Linear Predictor
−1.5 −1 −0.5 0 0.5 1 1.5 2 2.5
Mean
5 5.5 6 6.5 7 7.5 8
Median Glycohemoglobin
5.25 5.5 5.75 6 6.25
0.9 Quantile
5.5 6 6.5 7 8 10 12 14
Prob(HbA1c >= 6.5)
0.01 0.05 0.1 0.2 0.3 0.4
Prob(HbA1c >= 7.0)
0.01 0.05 0.1 0.2 0.3 0.4
Prob(HbA1c >= 7.5)
0.01 0.05 0.1 0.2 0.3
Figure 11.15: Nomogram for predicting median, mean, and 0.9 quantile of glycohemoglobin, along with the estimated probability
that HbA1c ≥ 6.5, 7, or 7.5, all from the log-log ordinal model
Chapter 12
12-1
CHAPTER 12. PARAMETRIC SURVIVAL MODELING AND MODEL APPROXIMATION 12-2
12.1
Descriptive Statistics
support[acute, ]
35 Variables 537 Observations
age : Age
n missing distinct Info Mean Gmd .05 .10 .25 .50 .75 .90 .95
537 0 529 1 60.7 19.98 28.49 35.22 47.93 63.67 74.49 81.54 85.56
lowest : 18.04199 18.41499 19.76500 20.29599 20.31200, highest: 91.61896 91.81696 91.93396 92.73895 95.50995
death : Death at any time up to NDI date:31DEC94
n missing distinct Info Sum Mean Gmd
537 0 2 0.67 356 0.6629 0.4477
sex
n missing distinct
537 0 2
lowest : 0 1 2 3 4, highest: 2 3 4 5 6
Value 0 1 2 3 4 5 6
Frequency 111 196 133 51 31 10 5
Proportion 0.207 0.365 0.248 0.095 0.058 0.019 0.009
edu : Years of Education
n missing distinct Info Mean Gmd .05 .10 .25 .50 .75 .90 .95
411 126 22 0.957 12.03 3.581 7 8 10 12 14 16 17
lowest : 0 1 2 3 4, highest: 17 18 19 20 22
income
n missing distinct
335 202 4
lowest : 3448.0 4432.0 4574.0 5555.0 5849.0, highest: 504659.5 538323.0 543761.0 706577.0 740010.0
lowest : white black asian other hispanic, highest: white black asian other hispanic
Value white black asian other hispanic
Frequency 417 84 4 8 22
Proportion 0.779 0.157 0.007 0.015 0.041
meanbp : Mean Arterial Blood Pressure Day 3
n missing distinct Info Mean Gmd .05 .10 .25 .50 .75 .90 .95
537 0 109 1 83.28 35 41.8 49.0 59.0 73.0 111.0 124.4 135.0
lowest : 0 4 6 7 8, highest: 48 49 52 60 64
temp : Temperature (celcius) Day 3
n missing distinct Info Mean Gmd .05 .10 .25 .50 .75 .90 .95
537 0 61 0.999 37.52 1.505 35.50 35.80 36.40 37.80 38.50 39.09 39.50
lowest : 32.50000 34.00000 34.09375 34.89844 35.00000, highest: 40.19531 40.59375 40.89844 41.00000 41.19531
pafi : PaO2/(.01*FiO2) Day 3
n missing distinct Info Mean Gmd .05 .10 .25 .50 .75 .90 .95
500 37 357 1 227.2 125 86.99 105.08 137.88 202.56 290.00 390.49 433.31
lowest : 1.099854 1.199951 1.299805 1.399902 1.500000, highest: 4.099609 4.199219 4.500000 4.699219 4.799805
lowest : 118 120 121 126 127, highest: 156 157 158 168 175
ph : Serum pH (arterial) Day 3
n missing distinct Info Mean Gmd .05 .10 .25 .50 .75 .90 .95
500 37 49 0.998 7.416 0.08775 7.270 7.319 7.380 7.420 7.470 7.510 7.529
lowest : 6.959961 6.989258 7.069336 7.119141 7.129883, highest: 7.559570 7.569336 7.589844 7.599609 7.659180
lowest : 0 1 2 3 4, highest: 3 4 5 6 7
Value 0 1 2 3 4 5 6 7
Frequency 51 19 7 6 4 7 8 2
Proportion 0.490 0.183 0.067 0.058 0.038 0.067 0.077 0.019
adls : ADL Surrogate Day 3
n missing distinct Info Mean Gmd
392 145 8 0.888 1.86 2.466
lowest : 0 1 2 3 4, highest: 3 4 5 6 7
Value 0 1 2 3 4 5 6 7
Frequency 185 68 22 18 17 20 39 23
Proportion 0.472 0.173 0.056 0.046 0.043 0.051 0.099 0.059
sfdm2
n missing distinct
468 69 5
lowest : no(M2 and SIP pres) adl>=4 (>=5 if sur) SIP>=30 Coma or Intub <2 mo. follow-up
highest: no(M2 and SIP pres) adl>=4 (>=5 if sur) SIP>=30 Coma or Intub <2 mo. follow-up
Value no(M2 and SIP pres) adl>=4 (>=5 if sur) SIP>=30 Coma or Intub
Frequency 134 78 30 5
Proportion 0.286 0.167 0.064 0.011
Value <2 mo. follow-up
Frequency 221
Proportion 0.472
adlsc : Imputed ADL Calibrated to Surrogate
n missing distinct Info Mean Gmd .05 .10 .25 .50 .75 .90 .95
537 0 144 0.956 2.119 2.386 0.000 0.000 0.000 1.839 3.375 6.000 6.000
0.5
0.4
0.3
0.2
0.1
0.0
0.35
0.30
0.25
0.20
0.15
0.10
0.05
0.00
adlsc
edu sod
income$25−$50k crea
income$11−$25k temp
income>$50k resp
hrt
adlsc meanbp
num.co race
raceasian income
raceother num.co
racehispanic dzclass
dzgroup
raceblack d.time
Figure 12.1: Cluster analysis showing which predictors tend to be missing on the same patients
dzgroupComa slos
scoma hospdead
dzgroupMOSF w/Malig sex
wblc age
death
Figure 12.2: Hierarchical clustering of potential predictors using Hoeffding D as a similarity measure. Categorical predictors are
12-6
CHAPTER 12. PARAMETRIC SURVIVAL MODELING AND MODEL APPROXIMATION 12-7
12.2
0 dzgroup=MOSF w/Malig
dzgroup=ARF/MOSF w/Sepsis
−1
dzgroup=Coma
−2
−3 −2 −1 0 1 2
log Follow−up Time in Years
Figure 12.3: Φ−1 (SKM (t)) stratified by dzgroup. Linearity and semi-parallelism indicate a reasonable fit to the log-normal accelerated
failure time model with respect to one predictor.
Age
1.0 1.0
0.8 0.8
Survival Probability
Survival Probability
0.6 0.6
0.4 0.4
0.2 0.2
0.0 0.0
−3.0 −2.0 −1.0 0.0 1.0 2.0 −3.0 −2.0 −1.0 0.0 1.0 2.0
Residual Residual
0.8 0.8
Survival Probability
Survival Probability
0.6 0.6
0.4 0.4
0.2 0.2
0.0 0.0
−3.0 −2.0 −1.0 0.0 1.0 2.0 −3.0 −2.0 −1.0 0.0 1.0 2.0
Residual Residual
Figure 12.4: Kaplan-Meier estimates of distributions of normalized, right-censored residuals from the fitted log-normal survival model.
Residuals are stratified by important variables in the model (by quartiles of continuous variables), plus a random variable to depict
the natural variability (in the lower right plot). Theoretical standard Gaussian distributions of residuals are shown with a thick solid
line. The upper left plot is with respect to disease group.
The fit for dzgroup is not great but overall fit is good.
Remove from consideration predictors that are missing in > 0.2
of the patients. Many of these were only collected for the
second phase of SUPPORT.
Of those variables to be included in the model, find which ones
have enough potential predictive power to justify allowing for
nonlinear relationships or multiple categories, which spend more
d.f. For each variable compute Spearman ρ2 based on multiple
CHAPTER 12. PARAMETRIC SURVIVAL MODELING AND MODEL APPROXIMATION 12-9
N df
scoma 537 2
meanbp 537 2
dzgroup 537 2
crea 537 2
pafi 500 2
ph 500 2
sod 537 2
hrt 537 2
adlsc 537 2
temp 537 2
wblc 532 2
num.co 537 2
age 537 2
resp 537 2
race 535 4
0.00 0.02 0.04 0.06 0.08 0.10 0.12
Adjusted ρ2
Figure 12.5: Generalized Spearman ρ2 rank correlation between predictors and truncated survival time
age num . co scoma meanbp hrt resp temp crea sod adlsc wblc
0 0 0 0 0 0 0 0 0 0 5
pafi ph
37 37
CHAPTER 12. PARAMETRIC SURVIVAL MODELING AND MODEL APPROXIMATION 12-10
N
meanbp 537
crea 537
dzgroup 537
scoma 537
pafi 500
ph 500
adlsc 537
age 537
num.co 537
hrt 537
resp 537
race 535
sod 537
wblc 532
temp 537
0.00 0.05 0.10 0.15 0.20
|Dxy|
Figure 12.6: Somers’ Dxy rank correlation between predictors and original survival time. For dzgroup or race, the correlation
coefficient is the maximum correlation from using a dummy variable to represent the most frequent or one to represent the second
most frequent category.’,scap=’Somers’ Dxy rank correlation between predictors and original survival time
Redundancy Analysis
n : 537 p : 16 nk : 4
Number of NAs : 0
R2 with which each variable can be predicted from all other variables :
crea age sex dzgroup num . co scoma adlsc race2 meanbp hrt
0.133 0.246 0.132 0.451 0.147 0.418 0.153 0.151 0.178 0.258
resp temp sod wblc . i pafi . i ph . i
0.131 0.197 0.135 0.093 0.143 0.171
No redundant variables
χ2 P
sex 0.2 0.6799
temp 2.6 0.4558
race 4.1 0.3887
sod 4.5 0.2109
num.co 4.4 0.1099
hrt 8.4 0.0379
wblc.i 8.9 0.0302
adlsc 8.3 0.0157
resp 9.7 0.0217
scoma 8.9 0.0028
pafi.i 16.4 0.0009
age 19.5 0.0002
meanbp 20.1 0.0002
crea 25.6 0.0000
dzgroup 39.7 0.0000
0 10 20 30
2
χ − df
Figure 12.7: Partial χ2 statistics for association of each predictor with response from saturated main effects model, penalized for
d.f.
(tests of linearity).
a ← anova ( f )
CHAPTER 12. PARAMETRIC SURVIVAL MODELING AND MODEL APPROXIMATION 12-14
12.3
0
−1
−2
−3
30 60 90 120 150 0 2 4 6 100 200 300 400 500 10 20 30 40
−3 −2 −1 0 1 2 −3 −2 −1 0 1 2 −3 −2 −1 0 1 2
log(T)
Figure 12.8: Effect of each predictor on log survival time. Predicted values have been centered so that predictions at predictor
reference values are zero. Pointwise 0.95 confidence bands are also shown. As all Y -axes have the same scale, it is easy to see which
predictors are strongest.
CHAPTER 12. PARAMETRIC SURVIVAL MODELING AND MODEL APPROXIMATION 12-16
χ2 d.f. P
age 15.99 4 0.0030
Nonlinear 0.23 3 0.9722
sex 0.11 1 0.7354
dzgroup 45.69 2 <0.0001
num.co 4.99 1 0.0255
scoma 10.58 1 0.0011
adlsc 8.28 2 0.0159
Nonlinear 3.31 1 0.0691
race2 1.26 1 0.2624
meanbp 27.62 4 <0.0001
Nonlinear 10.51 3 0.0147
hrt 11.83 2 0.0027
Nonlinear 1.04 1 0.3090
resp 11.10 2 0.0039
Nonlinear 8.56 1 0.0034
temp 0.39 1 0.5308
crea 33.63 3 <0.0001
Nonlinear 21.27 2 <0.0001
sod 0.08 1 0.7792
wblc.i 5.47 2 0.0649
Nonlinear 5.46 1 0.0195
pafi.i 15.37 3 0.0015
Nonlinear 6.97 2 0.0307
TOTAL NONLINEAR 60.48 14 <0.0001
TOTAL 261.47 30 <0.0001
plot ( a ) # Figure 12.9
L
options ( digits =3)
plot ( summary ( f ) , log = TRUE , main = ’ ’) # Figure 12.10
CHAPTER 12. PARAMETRIC SURVIVAL MODELING AND MODEL APPROXIMATION 12-17
χ2 P
sod 0.1 0.7792
sex 0.1 0.7354
temp 0.4 0.5308
race2 1.3 0.2624
wblc.i 5.5 0.0649
num.co 5.0 0.0255
adlsc 8.3 0.0159
resp 11.1 0.0039
scoma 10.6 0.0011
hrt 11.8 0.0027
age 16.0 0.0030
pafi.i 15.4 0.0015
meanbp 27.6 0.0000
crea 33.6 0.0000
dzgroup 45.7 0.0000
0 10 20 30 40
χ2 − df
Figure 12.9: Contribution of variables in predicting survival time in log-normal model
Figure 12.10: Estimated survival time ratios for default settings of predictors. For example, when age changes from its lower quartile
to the upper quartile (47.9y to 74.5y), median survival time decreases by more than half. Different shaded areas of bars indicate
different confidence levels (0.9, 0.95, 0.99).
CHAPTER 12. PARAMETRIC SURVIVAL MODELING AND MODEL APPROXIMATION 12-18
12.4
0.6
0.4
0.2
0.0
0.0 0.2 0.4 0.6 0.8
Predicted 1 Year Survival
Figure 12.11: Bootstrap validation of calibration curve. Dots represent apparent calibration accuracy; × are bootstrap estimates
corrected for overfitting, based on binning predicted survival probabilities and and computing Kaplan-Meier estimates. Black curve
is the estimated observed relationship using hare and the blue curve is the overfitting-corrected hare estimate. The gray-scale line
depicts the ideal relationship.
CHAPTER 12. PARAMETRIC SURVIVAL MODELING AND MODEL APPROXIMATION 12-20
12.5
Coef S . E . Wald Z P
[1 ,] -0.5928 0.04315 -13.74 0
None
CHAPTER 12. PARAMETRIC SURVIVAL MODELING AND MODEL APPROXIMATION 12-21
f.approx ← ols ( Z ∼ dzgroup + rcs ( meanbp ,5) + rcs ( crea ,4) + rcs ( age ,5) +
rcs ( hrt ,3) + scoma + rcs ( pafi.i ,4) + pol ( adlsc ,2) +
rcs ( resp ,3) , x = TRUE )
f.approx $ stats
n Model L . R . d.f. R2 g
537.000 1688.225 23.000 0.957 1.915
Sigma
0.370
O
hrt ’ age
0.976 0.982
P
f.approx $ var ← v
print ( anova ( f.approx , test = ’ Chisq ’ , ss = FALSE ) , size = ’ tsz ’)
CHAPTER 12. PARAMETRIC SURVIVAL MODELING AND MODEL APPROXIMATION 12-22
X β̂ =
−2.51
−1.94[Coma] − 1.75[MOSF w/Malig]
+0.068meanbp − 3.08×10−5 (meanbp − 41.8)3+ + 7.9×10−5 (meanbp − 61)3+
−4.91×10−5 (meanbp − 73)3+ + 2.61×10−6 (meanbp − 109)3+ − 1.7×10−6 (meanbp − 135)3+
−0.553crea − 0.229(crea − 0.6)3+ + 0.45(crea − 1.1)3+ − 0.233(crea − 1.94)3+
+0.0131(crea − 7.32)3+
−0.0165age − 1.13×10−5 (age − 28.5)3+ + 4.05×10−5 (age − 49.5)3+
−2.15×10−5 (age − 63.7)3+ − 2.68×10−5 (age − 72.7)3+ + 1.9×10−5 (age − 85.6)3+
−0.0136hrt + 6.09×10−7 (hrt − 60)3+ − 1.68×10−6 (hrt − 111)3+ + 1.07×10−6 (hrt − 140)3+
−0.0135 scoma
CHAPTER 12. PARAMETRIC SURVIVAL MODELING AND MODEL APPROXIMATION 12-23
0 10 20 30 40 50 60 70 80 90 100
Points
MOSF w/Malig
Disease Group
Coma ARF/MOSF w/Sepsis
Mean Arterial BP
0 20 40 60 80 120
6 7 8 9 10 11 12
Creatinine
5 3 2 1 0
Age
100 70 60 50 30 10
Heart Rate
300 200 100 50 0
SUPPORT Coma
Score 100 70 50 30 10
300
PaO2/(.01*FiO2)
0 50 100 200 500 700 900
5 7
ADL
4.5 2 1 0
65 60 55 50 45 40 35 30
Resp. Rate
0 5 15
Total Points
0 50 100 150 200 250 300 350 400 450
Linear Predictor
−7 −5 −3 −1 1 2 3 4
Figure 12.12: Nomogram for predicting median and mean survival time, based on approximation of full model
CHAPTER 12. PARAMETRIC SURVIVAL MODELING AND MODEL APPROXIMATION 12-25
13.1
S(t|X) = S0(t)exp(Xβ)
13-1
CHAPTER 13. CASE STUDY IN COX REGRESSION 13-2
attach ( prostate )
sz ← impute (w , sz , data = prostate )
sg ← impute (w , sg , data = prostate )
age ← impute (w , age , data = prostate )
wt ← impute (w , wt , data = prostate )
ekg ← impute (w , ekg , data = prostate )
dd ← datadist ( prostate )
options ( datadist = ’ dd ’)
χ2 d.f. P
rx 8.01 3 0.0459
age 13.84 3 0.0031
Nonlinear 9.06 2 0.0108
wt 8.21 2 0.0165
Nonlinear 2.54 1 0.1110
pf.coded 3.79 1 0.0517
heart 23.51 1 <0.0001
map 0.04 2 0.9779
Nonlinear 0.04 1 0.8345
hg 12.52 3 0.0058
Nonlinear 8.25 2 0.0162
sg 1.64 2 0.4406
Nonlinear 0.05 1 0.8304
sz 12.73 2 0.0017
Nonlinear 0.06 1 0.7990
ap 6.51 4 0.1639
Nonlinear 6.22 3 0.1012
bm 0.03 1 0.8670
TOTAL NONLINEAR 23.81 11 0.0136
TOTAL 119.09 24 <0.0001
Savings of 12 d.f. F
13.2
Checking Proportional Hazards
All β = 1
phtest ← cox.zph (f , transform = ’ identity ’)
phtest
chisq df p
rx 4.07 e +00 3 0.25
rcs ( age , 4) 4.27 e +00 3 0.23
rcs ( wt , 3) 2.22 e -01 2 0.89
pf . coded 5.34 e -02 1 0.82
heart 4.95 e -01 1 0.48
rcs ( map , 3) 3.20 e +00 2 0.20
CHAPTER 13. CASE STUDY IN COX REGRESSION 13-8
10
5
Beta(t) for rx
−5
−10
−15
0 20 40 60
Time
Figure 13.1: Raw and spline-smoothed scaled Schoenfeld residuals for dose of estrogen, nonlinearly coded from the Cox model fit,
with ± 2 standard errors.
13.3
Testing Interactions
13.4
Describing Predictor Effects
H
age ap hg
1.0
0.5
0.0
−0.5
−1.0
−1.5
60 70 80 0 50 100 9 11 13 15 17
log Relative Hazard
map sg sz
1.0
0.5
0.0
−0.5
−1.0
−1.5
8 10 12 7.5 10.0 12.5 15.0 0 10 20 30 40 50
wt
1.0
0.5
0.0
−0.5
−1.0
−1.5
80 90 100 110 120 130
bm heart
1 2
1
0 0
pf.coded rx
4 5.0 mg estrogen
3 1.0 mg estrogen
2 0.2 mg estrogen
1 placebo
−1.5 −1.0 −0.5 0.0 0.5 1.0 −1.5 −1.0 −0.5 0.0 0.5 1.0
log Relative Hazard
Figure 13.2: Shape of each predictor on log hazard of death. Y -axis shows X β̂, but the predictors not plotted are set to reference
values. Note the highly non-monotonic relationship with ap, and the increased slope after age 70 which has been found in outcome
models for various diseases.
CHAPTER 13. CASE STUDY IN COX REGRESSION 13-11
13.5
Validating the Model
I
plot ( cal )
CHAPTER 13. CASE STUDY IN COX REGRESSION 13-12
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
0.0 0.2 0.4 0.6
Predicted
Black: observed
B=300 60
Gray: Month
based
ideal Survival
on observed−predicted
Blue : optimism
Meancorrected
|error|=0.035 0.9 Quantile=0.057
Figure 13.3: Bootstrap estimate of calibration accuracy for 5-year estimates from the final Cox model, using adaptive linear spline
hazard regression. Line nearer the ideal line corresponds to apparent predictive accuracy. The blue curve corresponds to bootstrap-
corrected estimates.
CHAPTER 13. CASE STUDY IN COX REGRESSION 13-13
13.6
Presenting the Model
Figure 13.4: Hazard ratios and multi-level confidence bars for effects of predictors in model, using default ranges except for ap
0 10 20 30 40 50 60 70 80 90 100
Points
5.0 mg estrogen
rx
1.0 mg estrogen 0.2 mg estrogen
75 80 85 90
Age in Years
70 50
120
Weight Index = wt(kg)−ht(cm)+200
110 90 80 70 60
2 4
pf.coded
1 3
1
Heart Disease Code
0 2
22 12
Mean Arterial Pressure/10
4 8
16 18 20 22
Serum Hemoglobin (g/100ml)
14 12 10 8 6 4
Combined Index of Stage and Hist.
Grade
5 6 7 8 10 12 15
Total Points
0 20 40 60 80 100 140 180 220 260
Linear Predictor
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
3−year Survival
0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.05
5−year Survival
0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.05
14.1
14.1.1
Handle
– terminal events (death)
14-1
CHAPTER 14. SEMIPARAMETRIC ORDINAL LONGITUDINAL MODELS 14-2
14.1.2
14.1.3
0=alive 1=dead
– censored at 3w: 000
– death at 2w: 01
14.2
Case Studies
covery
14.3
Extensive re-analyses:
– hbiostat.org/proj/covid19/violet2.html
– hbiostat.org/proj/covid19/orchid.html
– hbiostat.org/R/Hmisc/markov/sim.html
4-level outcomes:
– patient at home
– dead
Model specification:
– For day t let Y (t) denote the ordinal outcome for a pa-
tient
Pr(Y ≥ y|X, Y (t−1)) = expit(αy +Xβ+g(Y (t−1), t))
14.3.1
Descriptives
require ( data.table )
require ( VGAM )
knitrSet ( ’ markov ’)
getHdata ( simlongord500 )
d ← simlongord500
setDT (d , key = ’ id ’)
d[, y := factor (y , levels = rev ( levels ( y )))]
d [ , yprev := factor ( yprev , levels = rev ( levels ( yprev ) ) ) ]
setnames (d , ’ time ’ , ’ day ’)
# Show descriptive statistics for baseline data
latex ( describe ( d [ day == 1 , . ( yprev , age , sofa , tx ) ] , ’ Baseline Variables ’) , file
= ’ ’)
Baseline Variables
4 Variables 500 Observations
yprev
n missing distinct
500 0 2
lowest : 0 1 2 3 4, highest: 13 14 15 17 18
Value 0 1 2 3 4 5 6 7 8 9 10 11 12 13
Frequency 38 38 41 51 59 53 58 37 24 32 32 17 8 4
Proportion 0.076 0.076 0.082 0.102 0.118 0.106 0.116 0.074 0.048 0.064 0.064 0.034 0.016 0.008
Value 14 15 17 18
Frequency 4 2 1 1
Proportion 0.008 0.004 0.002 0.002
tx
n missing distinct Info Sum Mean Gmd
500 0 2 0.75 256 0.512 0.5007
ddif 0 43
propsTrans ( y ∼ day + id , data =d , maxsize =4 , arrow = ’- > ’) +
theme ( axis.text.x = element_text ( angle =90 , hjust =1) )
CHAPTER 14. SEMIPARAMETRIC ORDINAL LONGITUDINAL MODELS 14-13
day 1 −> 2 day 2 −> 3 day 3 −> 4 day 4 −> 5 day 5 −> 6 day 6 −> 7
Dead
Vent/ARDS
In Hospital/Facility
Home
day 7 −> 8 day 8 −> 9 day 9 −> 10 day 10 −> 11 day 11 −> 12 day 12 −> 13
Dead
Vent/ARDS
In Hospital/Facility
Home
day 13 −> 14 day 14 −> 15 day 15 −> 16 day 16 −> 17 day 17 −> 18 day 18 −> 19 Proportion
Current State
Dead 0.25
Vent/ARDS
0.50
In Hospital/Facility
Home 0.75
1.00
day 19 −> 20 day 20 −> 21 day 21 −> 22 day 22 −> 23 day 23 −> 24 day 24 −> 25
Dead
Vent/ARDS
In Hospital/Facility
Home
Home
In Hospital/Facility
Vent/ARDS
Home
In Hospital/Facility
Vent/ARDS
Home
In Hospital/Facility
Vent/ARDS
day 25 −> 26 day 26 −> 27 day 27 −> 28
Dead
Vent/ARDS
In Hospital/Facility
Home
Home
In Hospital/Facility
Vent/ARDS
Home
In Hospital/Facility
Vent/ARDS
Home
In Hospital/Facility
Vent/ARDS
Previous State
w ← w [ , . ( day = ( day + 1) : 28 , y = y , tx = tx ) , by = id ]
u ← rbind (d , w , fill = TRUE )
setkey (u , id )
u [ , Tx := paste0 ( ’ tx = ’ , tx ) ]
CHAPTER 14. SEMIPARAMETRIC ORDINAL LONGITUDINAL MODELS 14-14
tx=0 tx=1
1.00
0.75
Proportion
0.50
0.25
0.00
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
day
14.3.2
Model Fitting
Also fit a model that has linear splines with more knots
to add flexibility in how time and baseline covariates are
transformed
summary ( f )
Call :
vglm ( formula = ordered ( y ) ∼ yprev + lsp ( day , 2) + age + sofa +
tx , family = cumulative ( reverse = TRUE , parallel = FALSE ∼
lsp ( day , 2)) , data = d )
Coefficients :
Estimate Std . Error z value Pr ( >| z |)
( Intercept ):1 -5.685047 0.586907 -9.686 < 2e -16 ***
( Intercept ):2 -13.982588 0.558102 -25.054 < 2e -16 ***
( Intercept ):3 -21.626076 1.495433 -14.461 < 2e -16 ***
yprevIn Hospital / Facility 9.010046 0.288818 31.196 < 2e -16 ***
yprevVent / ARDS 15.201826 0.336292 45.204 < 2e -16 ***
lsp ( day , 2) day :1 -1.119279 0.254831 -4.392 1.12 e -05 ***
lsp ( day , 2) day :2 -0.268585 0.234568 -1.145 0.252
lsp ( day , 2) day :3 1.150635 0.750875 1.532 0.125
lsp ( day , 2) day ’:1 1.170655 0.256770 4.559 5.14 e -06 ***
lsp ( day , 2) day ’:2 0.294863 0.238611 1.236 0.217
lsp ( day , 2) day ’:3 -1.141568 0.756102 -1.510 0.131
age 0.010996 0.002814 3.907 9.34 e -05 ***
sofa 0.060413 0.012562 4.809 1.52 e -06 ***
tx -0.350262 0.084709 -4.135 3.55 e -05 ***
---
Signif . codes : 0 ’*** ’ 0.001 ’** ’ 0.01 ’* ’ 0.05 ’. ’ 0.1 ’ ’ 1
Exponentiated coefficients :
yprevIn Hospital / Facility yprevVent / ARDS lsp ( day , 2) day :1
8.184899 e +03 4.000083 e +06 3.265152 e -01
lsp ( day , 2) day :2 lsp ( day , 2) day :3 lsp ( day , 2) day ’:1
7.644601 e -01 3.160199 e +00 3.224103 e +00
lsp ( day , 2) day ’:2 lsp ( day , 2) day ’:3 age
1.342943 e +00 3.193179 e -01 1.011057 e +00
sofa tx
1.062275 e +00 7.045038 e -01
summary ( g )
Call :
vglm ( formula = ordered ( y ) ∼ yprev + lsp ( day , c (2 , 4 , 8 , 15)) +
lsp ( age , c (35 , 60 , 75)) + lsp ( sofa , c (2 , 6 , 10)) + tx , family = cumulative ( reverse =
parallel = FALSE ∼ lsp ( day , c (2 , 4 , 8 , 15))) , data = d )
Coefficients :
Estimate Std . Error z value Pr ( >| z |)
( Intercept ):1 -6.999641 0.853756 -8.199 2.43 e -16 ***
( Intercept ):2 -14.999370 0.831770 -18.033 < 2e -16 ***
( Intercept ):3 -23.264014 1.634479 -14.233 < 2e -16 ***
yprevIn Hospital / Facility 9.050121 0.295104 30.668 < 2e -16 ***
yprevVent / ARDS 15.272417 0.342349 44.611 < 2e -16 ***
lsp ( day , c (2 , 4 , 8 , 15)) day :1 -1.032077 0.282930 -3.648 0.000264 ***
lsp ( day , c (2 , 4 , 8 , 15)) day :2 -0.484565 0.277909 -1.744 0.081227 .
lsp ( day , c (2 , 4 , 8 , 15)) day :3 1.549806 0.793898 1.952 0.050921 .
lsp ( day , c (2 , 4 , 8 , 15)) day ’:1 1.064104 0.339577 3.134 0.001727 **
lsp ( day , c (2 , 4 , 8 , 15)) day ’:2 0.714003 0.374521 1.906 0.056593 .
lsp ( day , c (2 , 4 , 8 , 15)) day ’:3 -1.713117 0.925886 -1.850 0.064278 .
lsp ( day , c (2 , 4 , 8 , 15)) day ’ ’:1 -0.020867 0.141055 -0.148 0.882394
lsp ( day , c (2 , 4 , 8 , 15)) day ’ ’:2 -0.281249 0.206644 -1.361 0.173504
lsp ( day , c (2 , 4 , 8 , 15)) day ’ ’:3 -0.032936 0.424279 -0.078 0.938123
lsp ( day , c (2 , 4 , 8 , 15)) day ’ ’ ’:1 0.055257 0.075085 0.736 0.461776
lsp ( day , c (2 , 4 , 8 , 15)) day ’ ’ ’:2 0.127460 0.116354 1.095 0.273320
lsp ( day , c (2 , 4 , 8 , 15)) day ’ ’ ’:3 0.323703 0.273592 1.183 0.236746
lsp ( day , c (2 , 4 , 8 , 15)) day ’ ’ ’ ’:1 -0.006121 0.052947 -0.116 0.907968
lsp ( day , c (2 , 4 , 8 , 15)) day ’ ’ ’ ’:2 -0.092095 0.078118 -1.179 0.238428
lsp ( day , c (2 , 4 , 8 , 15)) day ’ ’ ’ ’:3 -0.106570 0.160922 -0.662 0.507815
lsp ( age , c (35 , 60 , 75)) age 0.042333 0.018820 2.249 0.024493 *
lsp ( age , c (35 , 60 , 75)) age ’ -0.038028 0.023118 -1.645 0.099985 .
lsp ( age , c (35 , 60 , 75)) age ’ ’ 0.010658 0.015805 0.674 0.500104
lsp ( age , c (35 , 60 , 75)) age ’ ’ ’ -0.017427 0.026048 -0.669 0.503472
lsp ( sofa , c (2 , 6 , 10)) sofa 0.199993 0.096401 2.075 0.038025 *
lsp ( sofa , c (2 , 6 , 10)) sofa ’ -0.132951 0.122966 -1.081 0.279605
lsp ( sofa , c (2 , 6 , 10)) sofa ’ ’ -0.045706 0.069159 -0.661 0.508686
lsp ( sofa , c (2 , 6 , 10)) sofa ’ ’ ’ 0.043445 0.081920 0.530 0.595880
tx -0.357512 0.085314 -4.191 2.78 e -05 ***
---
Signif . codes : 0 ’*** ’ 0.001 ’** ’ 0.01 ’* ’ 0.05 ’. ’ 0.1 ’ ’ 1
Exponentiated coefficients :
yprevIn Hospital / Facility yprevVent / ARDS
CHAPTER 14. SEMIPARAMETRIC ORDINAL LONGITUDINAL MODELS 14-17
lrtest (g , f )
AIC ( f ) ; AIC ( g )
[1] 4562.299
[1] 4576.99
We will use the simpler model, which has the better (smaller)
AIC. Check to PO assumption on time by comparing the simpler
model’s AIC to the AIC from a fully PO model.
h ← vglm ( ordered ( y ) ∼ yprev + lsp ( day , 2) + age + sofa + tx ,
cumulative ( reverse = TRUE , parallel = TRUE ) , data = d )
lrtest (f , h )
CHAPTER 14. SEMIPARAMETRIC ORDINAL LONGITUDINAL MODELS 14-18
AIC ( f ) ; AIC ( h )
[1] 4562.299
[1] 4570.447
beta SE Z
>= in hospital / facility -5.685 0.587 -9.686
>= vent / ARDS -13.983 0.558 -25.054
dead -21.626 1.495 -14.461
previous state in hospital / facility 9.010 0.289 31.196
previous state vent / ARDS 15.202 0.336 45.204
initial slope for day , >= hospital / facility -1.119 0.255 -4.392
initial slope for day , >= vent / ARDS -0.269 0.235 -1.145
initial slope for day , dead 1.151 0.751 1.532
slope increment , >= hospital / facilty 1.171 0.257 4.559
slope increment , >= vent / ARDS 0.295 0.239 1.236
slope increament , dead -1.142 0.756 -1.510
baseline age linear effect 0.011 0.003 3.907
baseline SOFA score linear effect 0.060 0.013 4.809
treatment log OR -0.350 0.085 -4.135
CHAPTER 14. SEMIPARAMETRIC ORDINAL LONGITUDINAL MODELS 14-19
OR Lower Upper
0.705 0.597 0.832
Covariate Effects
y
Log Odds
>= In Hospital/Facility
−4 >= Vent/ARDS
>= Dead
−8
0 10 20 0 10 20
Day
Relative log odds of transitioning from in hospital/facility to indicated status
14.3.4
Correlation Structure
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
1
2
3
4
5
6
7
8
9
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
1
2
3
4
5
6
7
8
9
Spearman correlation matrix from actual data
rc ← vcorr $ r.simulated
plotCorrM ( rc , xangle =90) [[1]] + theme ( legend.position = ’ none ’) +
labs ( caption = ’ Spearman correlation matrix from 10 ,000 simulated patients ’)
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
1
2
3
4
5
6
7
8
9
real data)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0.10 0.06 0.05 0.05 0.04 0.04 0.03 0.02 0.02 0.02 0.02 0.02 0.02 0.03 0.03 0.02
17 18 19 20 21 22 23 24 25 26 27
0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02
1.00
0.75
Correlation
0.50
0.25
0 10 20
Absolute Time Difference
Iterations : 2000 on each of 4 chains , with 4000 posterior distribution samples saved
n_eff Rhat
y >= In Hospital / Facility 2283 1.000
y >= Vent / ARDS 2162 1.002
y >= Dead 1885 1.002
yprev = In Hospital / Facility 2248 1.001
yprev = Vent / ARDS 2715 1.001
day 1027 1.003
day ’ 2434 1.000
age 3728 1.000
sofa 1677 0.999
tx 3281 1.002
day :y >= Vent / ARDS 328 1.010
day ’: y >= Vent / ARDS 3066 1.000
day :y >= Dead 2602 1.000
day ’: y >= Dead 2228 1.001
sigmag 2324 1.002
Frequencies of Responses
AIC ( f1 ) ; AIC ( f2 )
[1] 4218.93
[1] 4222.358
14.3.5
From the fitted Markov state transition model, compute for one
covariate setting and two treatments:
[1] 0.01651803
Treatment 0 Treatment 1
1.00
0.75
Probability
0.50
0.25
0.00
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
Day
tx =0 tx =1 Days Difference
10.742602 7.768658 -2.973943
tx =0 tx =1 Days Difference
27.585079 27.844274 0.259195
14.3.6
} else {
betas ← diffmean ← numeric ( B )
ylev ← levels ( d $ y )
for ( i in 1 : B ) {
j ← unlist ( recno [ sample ( npt , npt , replace = TRUE ) ])
g ← vglm ( ordered ( y ) ∼ yprev + lsp ( day , 2) + age + sofa + tx ,
cumulative ( reverse = TRUE , parallel = FALSE ∼ lsp ( day , 2) ) ,
coefstart = startbeta , data = d [j , ])
betas [ i ] ← coef ( g ) [ ’ tx ’]
s0 ← soprobMarkovOrdm (g , x [ tx == 0 , ] , times =1:28 , ylevels = ylev ,
absorb = ’ Dead ’ , tvarname = ’ day ’)
s1 ← soprobMarkovOrdm (g , x [ tx == 1 , ] , times =1:28 , ylevels = ylev ,
absorb = ’ Dead ’ , tvarname = ’ day ’)
# P ( not at home ) = 1 - P( home ); sum these probs to get E[ days ]
mtud ← sum (1 . - s1 [ , ’ Home ’ ]) - sum (1 . - s0 [ , ’ Home ’ ])
diffmean [ i ] ← mtud
}
saveRDS ( list ( betas = betas , diffmean = diffmean ) , ’ boot.rds ’ , compress = ’ xz ’)
}
−1.0
−1.5
Frequency
1
Difference in Mean Days Unwell
−2.0
2
3
−2.5
4
5
−3.0
6
7
−3.5
8
9
−4.0
10
11
−4.5
−5.0
−0.60 −0.55 −0.50 −0.45 −0.40 −0.35 −0.30 −0.25 −0.20 −0.15 −0.10
Log OR
14.3.7
Notes on Inference
– → p-values are the same for the two metrics, and Bayesian
posterior probabilities are also identical
[1] Paul D. Allison. Missing Data. Sage University Papers Series on Quantitative Applications in the Social Sciences,
07-136. Thousand Oaks CA: Sage, 2001 (cit. on pp. 3-1, 3-5).
[2] D. G. Altman. “Categorising Continuous Covariates (Letter to the Editor)”. In: Brit J Cancer 64 (1991), p. 975
(cit. on p. 2-13).
[3] D. G. Altman and P. K. Andersen. “Bootstrap Investigation of the Stability of a Cox Regression Model”. In: Stat
Med 8 (1989), pp. 771–783 (cit. on p. 4-16).
[4] D. G. Altman et al. “Dangers of Using ‘optimal’ Cutpoints in the Evaluation of Prognostic Factors”. In: J Nat
Cancer Inst 86 (1994), pp. 829–835 (cit. on pp. 2-13, 2-15).
[5] Douglas G. Altman. “Suboptimal Analysis Using ‘optimal’ Cutpoints”. In: Brit J Cancer 78 (1998), pp. 556–557
(cit. on p. 2-13).
[6] B. G. Armstrong and M. Sloan. “Ordinal Regression Models for Epidemiologic Data”. In: Am J Epi 129 (1989),
pp. 191–204 (cit. on p. 10-18).
See letter to editor by Peterson
.
[7] A. C. Atkinson.“A Note on the Generalized Information Criterion for Choice of a Model”. In: Biometrika 67 (1980),
pp. 413–418 (cit. on pp. 2-29, 4-15).
[8] Peter C. Austin. “Bootstrap Model Selection Had Similar Performance for Selecting Authentic and Noise Variables
Compared to Backward Variable Elimination: A Simulation Study”. In: J Clin Epi 61 (2008). ”in general, a bootstrap
model selection method had comparable performance to conventional backward variable elimination for identifying
the true regression model. In most settings, both methods performed poorly at correctly identifying the correct
regression model.”, pp. 1009–1017 (cit. on p. 4-16).
[9] Peter C. Austin and Ewout W. Steyerberg.“The Integrated Calibration Index (ICI) and Related Metrics for Quanti-
fying the Calibration of Logistic Regression Models”. In: Statistics in Medicine 38.21 (2019), pp. 4051–4065. issn:
1097-0258. doi: 10.1002/sim.8281. url: https://onlinelibrary.wiley.com/doi/abs/10.1002/sim.8281
(visited on 08/10/2019) (cit. on p. 8-40).
[10] Peter C. Austin, Jack V. Tu, and Douglas S. Lee. “Logistic Regression Had Superior Performance Compared with
Regression Trees for Predicting In-Hospital Mortality in Patients Hospitalized with Heart Failure”. In: J Clin Epi 63
(2010). ROC areas for logistic models varied from 0.747 to 0.775 whereas they varied from 0.620-0.651 for recursive
partitioning;repeated data simulation showed large variation in tree structure, pp. 1145–1155 (cit. on p. 2-35).
[11] Sunni A. Barnes, Stacy R. Lindborg, and John W. Seaman.“Multiple Imputation Techniques in Small Sample Clinical
Trials”. In: Stat Med 25 (2006). bad performance of LOCF including high bias and poor confidence interval cov-
erage;simulation setup;longitudinal data;serial data;RCT;dropout;assumed missing at random (MAR);approximate
Bayesian bootstrap;Bayesian least squares;missing data;nice background summary;new completion score method
based on fitting a Poisson model for the number of completed clinic visits and using donors and approximate
Bayesian bootstrap, pp. 233–245 (cit. on p. 3-15).
15-1
ANNOTATED BIBLIOGRAPHY 15-2
[12] Federica Barzi and Mark Woodward.“Imputations of Missing Values in Practice: Results from Imputations of Serum
Cholesterol in 28 Cohort Studies”. In: Am J Epi 160 (2004). excellent review article for multiple imputation;list of
variables to include in imputation model;”Imputation models should ideally include all covariates that are related
to the missing data mechanism, have distributions that differ between the respondents and nonrespondents, are
associated with cholesterol, and will be included in the analyses of the final complete data sets”;detailed comparison
of results (cholesterol effect and confidence limits) for various imputation methods, pp. 34–45 (cit. on pp. 3-8,
3-15).
[13] Heiko Belcher. “The Concept of Residual Confounding in Regression Models and Some Applications”. In: Stat Med
11 (1992), pp. 1747–1758 (cit. on p. 2-13).
[14] D. A. Belsley, E. Kuh, and R. E. Welsch. Regression Diagnostics: Identifying Influential Data and Sources of
Collinearity. New York: Wiley, 1980 (cit. on p. 4-40).
[15] David A. Belsley. Conditioning Diagnostics: Collinearity and Weak Data in Regression. New York: Wiley, 1991
(cit. on p. 4-25).
[16] Jacqueline K. Benedetti et al.“Effective Sample Size for Tests of Censored Survival Data”. In: Biometrika 69 (1982),
pp. 343–349 (cit. on p. 4-20).
[17] Caroline Bennette and Andrew Vickers.“Against Quantiles: Categorization of Continuous Variables in Epidemiologic
Research, and Its Discontents”. In: BMC Med Res Methodol 12.1 (Feb. 2012). terrific graphical examples; nice
display of outcome heterogeneity within quantile groups of PSA, pp. 21+. issn: 1471-2288. doi: 10.1186/1471-
2288-12-21. pmid: 22375553. url: http://dx.doi.org/10.1186/1471-2288-12-21 (cit. on p. 2-13).
[18] Kiros Berhane, Michael Hauptmann, and Bryan Langholz. “Using Tensor Product Splines in Modeling Exposure–
Time–Response Relationships: Application to the Colorado Plateau Uranium Miners Cohort”. In: Stat Med 27
(2008). discusses taking product of all univariate spline basis functions, pp. 5484–5496 (cit. on p. 2-49).
[19] James Lopez Bernal, Steven Cummins, and Antonio Gasparrini. “Interrupted Time Series Regression for the Eval-
uation of Public Health Interventions: A Tutorial”. In: International Journal of Epidemiology 46.1 (Feb. 1, 2017),
pp. 348–355. issn: 0300-5771. doi: 10.1093/ije/dyw098. url: https://doi.org/10.1093/ije/dyw098 (visited
on 04/18/2021) (cit. on p. 2-54).
[20] D. M. Berridge and J. Whitehead.“Analysis of Failure Time Data with Ordinal Categories of Response”. In: Stat Med
10 (1991), pp. 1703–1710. doi: 10.1002/sim.4780101108. url: http://dx.doi.org/10.1002/sim.4780101108
(cit. on p. 10-18).
[21] Maria Blettner and Willi Sauerbrei.“Influence of Model-Building Strategies on the Results of a Case-Control Study”.
In: Stat Med 12 (1993), pp. 1325–1338 (cit. on p. 5-22).
[22] Irina Bondarenko and Trivellore Raghunathan. “Graphical and Numerical Diagnostic Tools to Assess Suitability of
Multiple Imputations and Imputation Models”. In: Stat Med 35.17 (July 2016), pp. 3007–3020. issn: 02776715.
doi: 10.1002/sim.6926. url: http://dx.doi.org/10.1002/sim.6926 (cit. on p. 3-17).
[23] James G. Booth and Somnath Sarkar.“Monte Carlo Approximation of Bootstrap Variances”. In: Am Statistician 52
(1998). number of resamples required to estimate variances, quantiles; 800 resamples may be required to guarantee
with 0.95 confidence that the relative error of a variance estimate is 0.1;Efron’s original suggestions for as low as
25 resamples were based on comparing stability of bootstrap estimates to sampling error, but small relative effects
can significantly change P-values;number of bootstrap resamples, pp. 354–357 (cit. on p. 5-10).
[24] Robert Bordley. “Statistical Decisionmaking without Math”. In: Chance 20.3 (2007), pp. 39–44 (cit. on p. 1-7).
[25] L. Breiman and J. H. Friedman.“Estimating Optimal Transformations for Multiple Regression and Correlation (with
Discussion)”. In: J Am Stat Assoc 80 (1985), pp. 580–619 (cit. on p. 4-33).
[26] Leo Breiman.“The Little Bootstrap and Other Methods for Dimensionality Selection in Regression: X-fixed Predic-
tion Error”. In: J Am Stat Assoc 87 (1992), pp. 738–754 (cit. on pp. 4-15, 4-16, 5-16).
[27] Leo Breiman et al. Classification and Regression Trees. Pacific Grove, CA: Wadsworth and Brooks/Cole, 1984
(cit. on p. 2-34).
ANNOTATED BIBLIOGRAPHY 15-3
[28] William M. Briggs and Russell Zaretzki.“The Skill Plot: A Graphical Technique for Evaluating Continuous Diagnostic
Tests (with Discussion)”. In: Biometrics 64 (2008). ”statistics such as the AUC are not especially relevant to someone
who must make a decision about a particular x c. ... ROC curves lack or obscure several quantities that are necessary
for evaluating the operational effectiveness of diagnostic tests. ... ROC curves were first used to check how radio
<i>receivers</i> (like radar receivers) operated over a range of frequencies. ... This is not how most ROC curves
are used now, particularly in medicine. The receiver of a diagnostic measurement ... wants to make a decision based
on some x c, and is not especially interested in how well he would have done had he used some different cutoff.”; in
the discussion David Hand states ”when integrating to yield the overall AUC measure, it is necessary to decide what
weight to give each value in the integration. The AUC implicitly does this using a weighting derived empirically
from the data. This is nonsensical. The relative importance of misclassifying a case as a noncase, compared to the
reverse, cannot come from the data itself. It must come externally, from considerations of the severity one attaches
to the different kinds of misclassifications.”; see Lin, Kvam, Lu Stat in Med 28:798-813;2009, pp. 250–261 (cit. on
p. 1-7).
[29] David Brownstone. “Regression Strategies”. In: Proceedings of the 20th Symposium on the Interface between
Computer Science and Statistics. Washington, DC: American Statistical Association, 1988, pp. 74–79 (cit. on
p. 5-22).
[30] Petra Buettner, Claus Garbe, and Irene Guggenmoos-Holzmann.“Problems in Defining Cutoff Points of Continuous
Prognostic Factors: Example of Tumor Thickness in Primary Cutaneous Melanoma”. In: J Clin Epi 50 (1997).
choice of cut point depends on marginal distribution of predictor, pp. 1201–1210 (cit. on p. 2-13).
[31] Stef Buuren. Flexible Imputation of Missing Data. Boca Raton, FL: Chapman & Hall/CRC, 2012. doi: 10.1201/
b11826. url: http://dx.doi.org/10.1201/b11826 (cit. on pp. 3-1, 3-14, 3-19).
[32] Bob Carpenter et al. “Stan: A Probabilistic Programming Language”. In: J Stat Software 76.1 (2017), pp. 1–32.
doi: 10.18637/jss.v076.i01. url: https://www.jstatsoft.org/v076/i01 (cit. on p. 8-52).
[33] James R. Carpenter and Melanie Smuk.“Missing Data: A Statistical Framework for Practice”. In: Biometrical Journal
63.5 (2021), pp. 915–947. issn: 1521-4036. doi: 10.1002/bimj.202000196. url: https://onlinelibrary.
wiley.com/doi/abs/10.1002/bimj.202000196 (visited on 10/21/2021) (cit. on p. 3-1).
eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1002/bimj.202000196
.
[34] Centers for Disease Control and Prevention CDC. National Center for Health Statistics NCHS. “National Health
and Nutrition Examination Survey”. In: (2010). url: http://www.cdc.gov/nchs/nhanes/nhanes2009-2010/
nhanes09_10.htm (cit. on p. 11-4).
[35] John M. Chambers and Trevor J. Hastie, eds. Statistical Models in S. Pacific Grove, CA: Wadsworth and Brook-
s/Cole, 1992 (cit. on p. 2-50).
[36] C. Chatfield. “Avoiding Statistical Pitfalls (with Discussion)”. In: Stat Sci 6 (1991), pp. 240–268 (cit. on p. 4-41).
[37] C. Chatfield. “Model Uncertainty, Data Mining and Statistical Inference (with Discussion)”. In: J Roy Stat Soc A
158 (1995). bias by selecting model because it fits the data well; bias in standard errors;P. 420: ... need for a
better balance in the literature and in statistical teaching between techniques and problem solving strategies. P.
421: It is ‘well known’ to be ‘logically unsound and practically misleading’ (Zhang, 1992) to make inferences as if
a model is known to be true when it has, in fact, been selected from the same data to be used for estimation
purposes. However, although statisticians may admit this privately (Breiman (1992) calls it a ‘quiet scandal’), they
(we) continue to ignore the difficulties because it is not clear what else could or should be done. P. 421: Estimation
errors for regression coefficients are usually smaller than errors from failing to take into account model specification.
P. 422: Statisticians must stop pretending that model uncertainty does not exist and begin to find ways of coping
with it. P. 426: It is indeed strange that we often admit model uncertainty by searching for a best model but then
ignore this uncertainty by making inferences and predictions as if certain that the best fitting model is actually
true. P. 427: The analyst needs to assess the model selection process and not just the best fitting model. P. 432:
The use of subset selection methods is well known to introduce alarming biases. P. 433: ... the AIC can be highly
biased in data-driven model selection situations. P. 434: Prediction intervals will generally be too narrow. In the
discussion, Jamal R. M. Ameen states that a model should be (a) satisfactory in performance relative to the stated
objective, (b) logically sound, (c) representative, (d) questionable and subject to on-line interrogation, (e) able to
accommodate external or expert information and (f) able to convey information., pp. 419–466 (cit. on pp. 4-10,
5-22).
[38] Samprit Chatterjee and Ali S. Hadi. Regression Analysis by Example. Fifth. New York: Wiley, 2012. isbn: 0-470-
90584-0 (cit. on p. 4-24).
ANNOTATED BIBLIOGRAPHY 15-4
[39] Marie Chavent et al. “ClustOfVar: An R Package for the Clustering of Variables”. In: J Stat Software 50.13 (Sept.
2012), pp. 1–16 (cit. on p. 4-29).
[40] A. Ciampi et al.“Stratification by Stepwise Regression, Correspondence Analysis and Recursive Partition”. In: Comp
Stat Data Analysis 1986 (1986), pp. 185–204 (cit. on p. 4-29).
[41] W. S. Cleveland. “Robust Locally Weighted Regression and Smoothing Scatterplots”. In: J Am Stat Assoc 74
(1979), pp. 829–836 (cit. on p. 2-31).
[42] D. Collett. Modelling Binary Data. Second. London: Chapman and Hall, 2002. isbn: 1-58488-324-3 (cit. on p. 10-6).
[43] Gary S. Collins, Emmanuel O. Ogundimu, and Douglas G. Altman. “Sample Size Considerations for the External
Validation of a Multivariable Prognostic Model: A Resampling Study”. In: Stat Med 35.2 (Jan. 2016), pp. 214–226.
issn: 02776715. doi: 10.1002/sim.6787. url: http://dx.doi.org/10.1002/sim.6787 (cit. on p. 5-14).
[44] Gary S. Collins et al. “Quantifying the Impact of Different Approaches for Handling Continuous Predictors on
the Performance of a Prognostic Model”. In: Stat Med 35.23 (Oct. 2016). used rms package hazard regression
method (hare) for survival model calibration, pp. 4124–4135. issn: 02776715. doi: 10 . 1002 / sim . 6986. url:
http://dx.doi.org/10.1002/sim.6986 (cit. on p. 2-13).
[45] E. Francis Cook and Lee Goldman. “Asymmetric Stratification: An Outline for an Efficient Method for Controlling
Confounding in Cohort Studies”. In: Am J Epi 127 (1988), pp. 626–639 (cit. on p. 2-35).
[46] Nancy R. Cook. “Use and Misues of the Receiver Operating Characteristic Curve in Risk Prediction”. In: Circ 115
(2007). example of large change in predicted risk in cardiovascular disease with tiny change in ROC area;possible
limits to c index when calibration is perfect;importance of calibration accuracy and changes in predicted risk when
new variables are added, pp. 928–935 (cit. on p. 8-39).
[47] J. B. Copas. “Cross-Validation Shrinkage of Regression Predictors”. In: J Roy Stat Soc B 49 (1987), pp. 175–183
(cit. on p. 5-21).
[48] J. B. Copas.“Regression, Prediction and Shrinkage (with Discussion)”. In: J Roy Stat Soc B 45 (1983), pp. 311–354
(cit. on pp. 4-22, 4-23).
[49] David R. Cox.“Regression Models and Life-Tables (with Discussion)”. In: J Roy Stat Soc B 34 (1972), pp. 187–220
(cit. on pp. 2-53, 13-1).
[50] Sybil L. Crawford, Sharon L. Tennstedt, and John B. McKinlay. “A Comparison of Analytic Methods for Non-
Random Missingness of Outcome Data”. In: J Clin Epi 48 (1995), pp. 209–219 (cit. on pp. 3-4, 4-47).
[51] N. J. Crichton and J. P. Hinde.“Correspondence Analysis as a Screening Method for Indicants for Clinical Diagnosis”.
In: Stat Med 8 (1989), pp. 1351–1362 (cit. on p. 4-29).
[52] Ralph B. D’Agostino et al.“Development of Health Risk Appraisal Functions in the Presence of Multiple Indicators:
The Framingham Study Nursing Home Institutionalization Model”. In: Stat Med 14 (1995), pp. 1757–1770 (cit. on
pp. 4-25, 4-28).
[53] C. E. Davis et al. “An Example of Dependencies among Variables in a Conditional Logistic Regression”. In: Modern
Statistical Methods in Chronic Disease Epidemiology. Ed. by S. H. Moolgavkar and R. L. Prentice. New York:
Wiley, 1986, pp. 140–147 (cit. on p. 4-25).
[54] Charles S. Davis. Statistical Methods for the Analysis of Repeated Measurements. New York: Springer, 2002 (cit. on
p. 7-17).
[55] S. Derksen and H. J. Keselman. “Backward, Forward and Stepwise Automated Subset Selection Algorithms: Fre-
quency of Obtaining Authentic and Noise Variables”. In: British J Math Stat Psych 45 (1992), pp. 265–282 (cit. on
p. 4-10).
[56] T. F. Devlin and B. J. Weeks. “Spline Functions for Logistic Regression Modeling”. In: Proceedings of the Eleventh
Annual SAS Users Group International Conference. Cary, NC: SAS Institute, Inc., 1986, pp. 646–651 (cit. on
pp. 2-22, 2-24).
[57] Peter J. Diggle et al. Analysis of Longitudinal Data. second. Oxford UK: Oxford University Press, 2002 (cit. on
p. 7-11).
[58] Donders et al. “Review: A Gentle Introduction to Imputation of Missing Values”. In: J Clin Epi 59 (2006). simple
demonstration of failure of the add new category method (indicator variable), pp. 1087–1091 (cit. on pp. 3-1, 3-5).
[59] William D. Dupont. Statistical Modeling for Biomedical Researchers. second. Cambridge, UK: Cambridge University
Press, 2008 (cit. on p. 15-15).
ANNOTATED BIBLIOGRAPHY 15-5
[60] S. Durrleman and R. Simon. “Flexible Regression Models with Cubic Splines”. In: Stat Med 8 (1989), pp. 551–561
(cit. on p. 2-28).
[61] B. Efron. “Estimating the Error Rate of a Prediction Rule: Improvement on Cross-Validation”. In: J Am Stat Assoc
78 (1983). suggested need at least 200 models to get an average that is adequate, i.e., 20 repeats of 10-fold cv,
pp. 316–331 (cit. on pp. 5-17, 5-20, 5-21).
[62] Bradley Efron and Balasubramanian Narasimhan.“The Automatic Construction of Bootstrap Confidence Intervals”.
In: Journal of Computational and Graphical Statistics 0.0 (Jan. 14, 2020), pp. 1–12. issn: 1061-8600. doi: 10.1080/
10618600.2020.1714633. url: https://doi.org/10.1080/10618600.2020.1714633 (visited on 03/13/2020)
(cit. on p. 5-10).
eprint: https://doi.org/10.1080/10618600.2020.1714633
.
[63] Bradley Efron and Robert Tibshirani. An Introduction to the Bootstrap. New York: Chapman and Hall, 1993 (cit. on
p. 5-20).
[64] Bradley Efron and Robert Tibshirani.“Improvements on Cross-Validation: The .632+ Bootstrap Method”. In: J Am
Stat Assoc 92 (1997), pp. 548–560 (cit. on p. 5-20).
[65] Nicole S. Erler et al. “Dealing with Missing Covariates in Epidemiologic Studies: A Comparison between Multiple
Imputation and a Full Bayesian Approach”. In: Stat Med 35.17 (July 2016), pp. 2955–2974. issn: 02776715. doi:
10.1002/sim.6944. url: http://dx.doi.org/10.1002/sim.6944 (cit. on p. 3-7).
[66] Juanjuan Fan and Richard A. Levine.“To Amnio or Not to Amnio: That Is the Decision for Bayes”. In: Chance 20.3
(2007), pp. 26–32 (cit. on p. 1-7).
[67] David Faraggi and Richard Simon. “A Simulation Study of Cross-Validation for Selecting an Optimal Cutpoint in
Univariate Survival Analysis”. In: Stat Med 15 (1996). bias in point estimate of effect from selecting cutpoints
based on P-value; loss of information from dichotomizing continuous predictors, pp. 2203–2213 (cit. on p. 2-13).
[68] J. J. Faraway. “The Cost of Data Analysis”. In: J Comp Graph Stat 1 (1992), pp. 213–229 (cit. on pp. 4-49, 5-20,
5-22).
[69] Valerii Fedorov, Frank Mannino, and Rongmei Zhang.“Consequences of Dichotomization”. In: Pharm Stat 8 (2009).
optimal cutpoint depends on unknown parameters;should only entertain dichotomization when ”estimating a value
of the cumulative distribution and when the assumed model is very different from the true model”;nice graphics,
pp. 50–61. doi: 10.1002/pst.331. url: http://dx.doi.org/10.1002/pst.331 (cit. on pp. 1-6, 2-13).
[70] Steven E. Fienberg. The Analysis of Cross-Classified Categorical Data. Second. New York: Springer, 2007. isbn:
0-387-72824-4 (cit. on p. 10-18).
[71] D. Freedman, W. Navidi, and S. Peters. “On the Impact of Variable Selection in Fitting Regression Equations”.
In: Lecture Notes in Economics and Mathematical Systems. New York: Springer-Verlag, 1988, pp. 1–16 (cit. on
p. 5-21).
[72] J. H. Friedman. A Variable Span Smoother. Technical Report 5. Laboratory for Computational Statistics, Depart-
ment of Statistics, Stanford University, 1984 (cit. on p. 4-33).
[73] Mitchell H. Gail and Ruth M. Pfeiffer. “On Criteria for Evaluating Models of Absolute Risk”. In: Biostatistics 6.2
(2005), pp. 227–239 (cit. on p. 1-7).
[74] Joseph C. Gardiner, Zhehui Luo, and Lee A. Roman. “Fixed Effects, Random Effects and GEE: What Are the
Differences?” In: Stat Med 28 (2009). nice comparison of models; econometrics; different use of the term ”fixed
effects model”, pp. 221–239 (cit. on p. 7-10).
[75] A. Giannoni et al. “Do Optimal Prognostic Thresholds in Continuous Physiological Variables Really Exist? Analysis
of Origin of Apparent Thresholds, with Systematic Review for Peak Oxygen Consumption, Ejection Fraction and
BNP”. In: PLoS ONE 9.1 (2014). doi: 10.1371/journal.pone.0081699. url: http://dx.doi.org/10.1371/
journal.pone.0081699 (cit. on pp. 2-13, 2-16).
[76] John H. Giudice, John R. Fieberg, and Mark S. Lenarz. “Spending Degrees of Freedom in a Poor Economy: A Case
Study of Building a Sightability Model for Moose in Northeastern Minnesota”. In: J Wildlife Manage (2011). doi:
10.1002/jwmg.213. url: http://dx.doi.org/10.1002/jwmg.213 (cit. on p. 4-1).
[77] Tilmann Gneiting and Adrian E. Raftery. “Strictly Proper Scoring Rules, Prediction, and Estimation”. In: J Am
Stat Assoc 102 (2007). wonderful review article except missing references from Scandanavian and German medical
decision making literature, pp. 359–378 (cit. on p. 1-7).
ANNOTATED BIBLIOGRAPHY 15-6
[78] Harvey Goldstein.“Restricted Unbiased Iterative Generalized Least-Squares Estimation”. In: Biometrika 76.3 (1989).
derivation of REML, pp. 622–623 (cit. on pp. 7-7, 7-11).
[79] Usha S. Govindarajulu et al. “Comparing Smoothing Techniques in Cox Models for Exposure-Response Relation-
ships”. In: Stat Med 26 (2007). authors wrote a SAS macro for restricted cubic splines even though such a macro
as existed since 1984; would have gotten more useful results had simulation been used so would know the true
regression shape;measure of agreement of two estimated curves by computing the area between them, standardized
by average of areas under the two;penalized spline and rcs were closer to each other than to fractional polynomials,
pp. 3735–3752 (cit. on p. 2-29).
[80] P. M. Grambsch and P. C. O’Brien. “The Effects of Transformations and Preliminary Tests for Non-Linearity in
Regression”. In: Stat Med 10 (1991), pp. 697–709 (cit. on pp. 2-41, 4-10).
[81] Robert J. Gray. “Flexible Methods for Analyzing Survival Data Using Splines, with Applications to Breast Cancer
Prognosis”. In: J Am Stat Assoc 87 (1992), pp. 942–951 (cit. on pp. 2-49, 4-22).
[82] Robert J. Gray. “Spline-Based Tests in Survival Analysis”. In: Biometrics 50 (1994), pp. 640–652 (cit. on p. 2-49).
[83] Michael J. Greenacre. “Correspondence Analysis of Multivariate Categorical Data by Weighted Least-Squares”. In:
Biometrika 75 (1988), pp. 457–467 (cit. on p. 4-29).
[84] Sander Greenland. “When Should Epidemiologic Regressions Use Random Coefficients?” In: Biometrics 56 (2000).
use of statistics in epidemiology is largely primitive;stepwise variable selection on confounders leaves important con-
founders uncontrolled;composition matrix;example with far too many significant predictors with many regression co-
efficients absurdly inflated when overfit;lack of evidence for dietary effects mediated through constituents;shrinkage
instead of variable selection;larger effect on confidence interval width than on point estimates with variable se-
lection;uncertainty about variance of random effects is just uncertainty about prior opinion;estimation of vari-
ance is pointless;instead the analysis should be repeated using different values;”if one feels compelled to estimate
$\tauˆ {2}$, I would recommend giving it a proper prior concentrated amount contextually reasonable values”;claim
about ordinary MLE being unbiased is misleading because it assumes the model is correct and is the only model
entertained;shrinkage towards compositional model;”models need to be complex to capture uncertainty about the
relations...an honest uncertainty assessment requires parameters for all effects that we know may be present. This
advice is implicit in an antiparsimony principle often attributed to L. J. Savage ’All models should be as big as an
elephant (see Draper, 1995)’”. See also gus06per., pp. 915–921. doi: 10.1111/j.0006-341X.2000.00915.x. url:
http://dx.doi.org/10.1111/j.0006-341X.2000.00915.x (cit. on pp. 4-10, 4-44).
[85] Jian Guo et al. “Principal Component Analysis with Sparse Fused Loadings”. In: J Comp Graph Stat 19.4 (2011).
incorporates blocking structure in the variables;selects different variables for different components;encourages load-
ings of highly correlated variables to have same magnitude, which aids in interpretation, pp. 930–946 (cit. on
p. 4-29).
[86] D. Hand and M. Crowder. Practical Longitudinal Data Analysis. London: Chapman & Hall, 1996.
[87] Ofer Harel and Xiao-Hua Zhou. “Multiple Imputation: Review of Theory, Implementation and Software”. In: Stat
Med 26 (2007). failed to review aregImpute;excellent overview;ugly S code;nice description of different statistical
tests including combining likelihood ratio tests (which appears to be complex, requiring an out-of-sample log
likelihood computation);congeniality of imputation and analysis models;Bayesian approximation or approximate
Bayesian bootstrap overview;”Although missing at random (MAR) is a non-testable assumption, it has been pointed
out in the literature that we can get very close to MAR if we include enough variables in the imputation models
... it would be preferred if the missing data modelling was done by the data constructors and not by the users...
MI yields valid inferences not only in congenial settings, but also in certain uncongenial ones as well—where the
imputer’s model (1) is more general (i.e. makes fewer assumptions) than the complete-data estimation method, or
when the imputer’s model makes additional assumptions that are well-founded.”, pp. 3057–3077 (cit. on pp. 3-1,
3-8, 3-12, 3-15).
[88] F. E. Harrell. “The LOGIST Procedure”. In: SUGI Supplemental Library Users Guide. Version 5. Cary, NC: SAS
Institute, Inc., 1986, pp. 269–293 (cit. on p. 4-14).
[89] F. E. Harrell, K. L. Lee, and B. G. Pollock. “Regression Models in Clinical Studies: Determining Relationships
between Predictors and Response”. In: J Nat Cancer Inst 80 (1988), pp. 1198–1202 (cit. on p. 2-32).
[90] F. E. Harrell et al. “Regression Modeling Strategies for Improved Prognostic Prediction”. In: Stat Med 3 (1984),
pp. 143–152 (cit. on p. 4-19).
[91] F. E. Harrell et al.“Regression Models for Prognostic Prediction: Advantages, Problems, and Suggested Solutions”.
In: Ca Trt Rep 69 (1985), pp. 1071–1077 (cit. on p. 4-19).
ANNOTATED BIBLIOGRAPHY 15-7
[92] Frank E. Harrell, Kerry L. Lee, and Daniel B. Mark.“Multivariable Prognostic Models: Issues in Developing Models,
Evaluating Assumptions and Adequacy, and Measuring and Reducing Errors”. In: Stat Med 15 (1996), pp. 361–387
(cit. on p. 4-1).
[93] Frank E. Harrell et al. “Development of a Clinical Prediction Model for an Ordinal Outcome: The World Health
Organization ARI Multicentre Study of Clinical Signs and Etiologic Agents of Pneumonia, Sepsis, and Meningitis in
Young Infants”. In: Stat Med 17 (1998), pp. 909–944. url: http://onlinelibrary.wiley.com/doi/10.1002/
(SICI)1097-0258(19980430)17:8%3C909::AID-SIM753%3E3.0.CO;2-O/abstract (cit. on pp. 4-22, 4-47).
[94] Trevor Hastie, Robert Tibshirani, and Jerome H. Friedman. The Elements of Statistical Learning. second. New
York: Springer, 2008 (cit. on p. 2-37).
ISBN-10: 0387848576; ISBN-13: 978-0387848570
.
[95] Trevor J. Hastie and Robert J. Tibshirani. Generalized Additive Models. Boca Raton, FL: Chapman & Hall/CRC,
1990 (cit. on p. 2-38).
ISBN 9780412343902
.
[96] Yulei He and Alan M. Zaslavsky.“Diagnosing Imputation Models by Applying Target Analyses to Posterior Replicates
of Completed Data”. In: Stat Med 31.1 (2012), pp. 1–18. doi: 10.1002/sim.4413. url: http://dx.doi.org/
10.1002/sim.4413 (cit. on p. 3-17).
[97] S. G. Hilsenbeck and G. M. Clark. “Practical P-Value Adjustment for Optimally Selected Cutpoints”. In: Stat Med
15 (1996), pp. 103–112 (cit. on p. 2-13).
[98] W. Hoeffding. “A Non-Parametric Test of Independence”. In: Ann Math Stat 19 (1948), pp. 546–557 (cit. on
p. 4-29).
[99] Norbert Holländer, Willi Sauerbrei, and Martin Schumacher. “Confidence Intervals for the Effect of a Prognostic
Factor after Selection of an ‘optimal’ Cutpoint”. In: Stat Med 23 (2004). true type I error can be much greater
than nominal level;one example where nominal is 0.05 and true is 0.5;minimum P-value method;CART;recursive
partitioning;bootstrap method for correcting confidence interval;based on heuristic shrinkage coefficient;”It should
be noted, however, that the optimal cutpoint approach has disadvantages. One of these is that in almost every
study where this method is applied, another cutpoint will emerge. This makes comparisons across studies extremely
difficult or even impossible. Altman et al. point out this problem for studies of the prognostic relevance of the
S-phase fraction in breast cancer published in the literature. They identified 19 different cutpoints used in the
literature; some of them were solely used because they emerged as the ‘optimal’ cutpoint in a specific data set. In
a meta-analysis on the relationship between cathepsin-D content and disease-free survival in node-negative breast
cancer patients, 12 studies were in included with 12 different cutpoints ... Interestingly, neither cathepsin-D nor
the S-phase fraction are recommended to be used as prognostic markers in breast cancer in the recent update
of the American Society of Clinical Oncology.”; dichotomization; categorizing continuous variables; refs alt94dan,
sch94out, alt98sub, pp. 1701–1713. doi: 10.1002/sim.1611. url: http://dx.doi.org/10.1002/sim.1611
(cit. on pp. 2-13, 2-15).
[100] Nicholas J. Horton and Ken P. Kleinman. “Much Ado about Nothing: A Comparison of Missing Data Methods and
Software to Fit Incomplete Data Regression Models”. In: Am Statistician 61.1 (2007), pp. 79–90 (cit. on p. 3-15).
[101] C. M. Hurvich and C. L. Tsai.“The Impact of Model Selection on Inference in Linear Regression”. In: Am Statistician
44 (1990), pp. 214–217 (cit. on p. 4-16).
[102] Lisa I. Iezzoni. “Dimensions of Risk”. In: Risk Adjustment for Measuring Health Outcomes. Ed. by Lisa I. Iezzoni.
dimensions of risk factors to include in models. Ann Arbor, MI: Foundation of the American College of Healthcare
Executives, 1994. Chap. 2, pp. 29–118 (cit. on p. 1-12).
[103] K. J. Janssen et al. “Missing Covariate Data in Medical Research: To Impute Is Better than to Ignore”. In: J Clin
Epi 63 (2010), pp. 721–727 (cit. on p. 3-20).
[104] Michael P. Jones. “Indicator and Stratification Methods for Missing Explanatory Variables in Multiple Linear Re-
gression”. In: J Am Stat Assoc 91 (1996), pp. 222–230 (cit. on p. 3-5).
[105] J. D. Kalbfleisch and R. L. Prentice.“Marginal Likelihood Based on Cox’s Regression and Life Model”. In: Biometrika
60 (1973), pp. 267–278 (cit. on p. 11-25).
[106] Juha Karvanen and Frank E. Harrell. “Visualizing Covariates in Proportional Hazards Model”. In: Stat Med 28
(2009), pp. 1957–1966 (cit. on p. 5-2).
ANNOTATED BIBLIOGRAPHY 15-8
[107] Michael G. Kenward, Ian R. White, and James R. Carpener.“Should Baseline Be a Covariate or Dependent Variable
in Analyses of Change from Baseline in Clinical Trials? (Letter to the Editor)”. In: Stat Med 29 (2010). sharp rebuke
of liu09sho, pp. 1455–1456 (cit. on p. 7-5).
[108] Soeun Kim, Catherine A. Sugar, and Thomas R. Belin. “Evaluating Model-Based Imputation Methods for Missing
Covariates in Regression Models with Interactions”. In: Stat Med 34.11 (May 2015), pp. 1876–1888. issn: 02776715.
doi: 10.1002/sim.6435. url: http://dx.doi.org/10.1002/sim.6435 (cit. on p. 3-10).
[109] W. A. Knaus et al.“The SUPPORT Prognostic Model: Objective Estimates of Survival for Seriously Ill Hospitalized
Adults”. In: Ann Int Med 122 (1995), pp. 191–203. doi: 10.7326/0003-4819-122-3-199502010-00007. url:
http://dx.doi.org/10.7326/0003-4819-122-3-199502010-00007 (cit. on pp. 4-34, 12-1).
[110] Mirjam J. Knol et al. “Unpredictable Bias When Using the Missing Indicator Method or Complete Case Analysis
for Missing Confounder Values: An Empirical Example”. In: J Clin Epi 63 (2010), pp. 728–736 (cit. on p. 3-5).
[111] R. Koenker and G. Bassett. “Regression Quantiles”. In: Econometrica 46 (1978), pp. 33–50 (cit. on p. 11-11).
[112] Roger Koenker. Quantile Regression. New York: Cambridge University Press, 2005 (cit. on p. 11-11).
ISBN-10: 0-521-60827-9; ISBN-13: 978-0-521-60827-5
.
[113] Roger Koenker. Quantreg: Quantile Regression. 2009. url: http://CRAN.R-project.org/package=quantreg
(cit. on p. 11-11).
R package version 4.38
.
[114] Charles Kooperberg, Charles J. Stone, and Young K. Truong.“Hazard Regression”. In: J Am Stat Assoc 90 (1995),
pp. 78–94 (cit. on p. 12-18).
[115] Warren F. Kuhfeld. “The PRINQUAL Procedure”. In: SAS/STAT 9.2 User’s Guide. Second. Cary, NC: SAS Pub-
lishing, 2009. url: http://support.sas.com/documentation/onlinedoc/stat (cit. on p. 4-30).
[116] J. M. Landwehr, D. Pregibon, and A. C. Shoemaker. “Graphical Methods for Assessing Logistic Regression Models
(with Discussion)”. In: J Am Stat Assoc 79 (1984), pp. 61–83 (cit. on p. 10-6).
[117] B. Lausen and M. Schumacher. “Evaluating the Effect of Optimized Cutoff Values in the Assessment of Prognostic
Factors”. In: Comp Stat Data Analysis 21.3 (1996), pp. 307–326. doi: 10.1016/0167- 9473(95)00016- X. url:
http://dx.doi.org/10.1016/0167-9473(95)00016-X (cit. on p. 2-13).
[118] J. F. Lawless and K. Singhal. “Efficient Screening of Nonnormal Regression Models”. In: Biometrics 34 (1978),
pp. 318–327 (cit. on p. 4-15).
[119] S. le Cessie and J. C. van Houwelingen.“Ridge Estimators in Logistic Regression”. In: Appl Stat 41 (1992), pp. 191–
201 (cit. on p. 4-22).
[120] A. Leclerc et al. “Correspondence Analysis and Logistic Modelling: Complementary Use in the Analysis of a Health
Survey among Nurses”. In: Stat Med 7 (1988), pp. 983–995 (cit. on p. 4-29).
[121] Katherine J. Lee and John B. Carlin. “Recovery of Information from Multiple Imputation: A Simulation Study”. In:
Emerg Themes Epi 9.1 (June 2012). Not sure that the authors satisfactorily dealt with nonlinear predictor effects
in the absence of strong auxiliary information, there is little to gain from multiple imputation with missing data in the
exposure-of-interest. In fact, the authors went further to say that multiple imputation can introduce bias not present
in a complete case analysis if a poorly fitting imputation model is used [from Yong Hao Pua], pp. 3+. issn: 1742-
7622. doi: 10.1186/1742-7622-9-3. pmid: 22695083. url: http://dx.doi.org/10.1186/1742-7622-9-3
(cit. on p. 3-4).
[122] Seokho Lee, Jianhua Z. Huang, and Jianhua Hu. “Sparse Logistic Principal Components Analysis for Binary Data”.
In: Ann Appl Stat 4.3 (2010), pp. 1579–1601 (cit. on p. 2-37).
[123] Chenlei Leng and Hansheng Wang.“On General Adaptive Sparse Principal Component Analysis”. In: J Comp Graph
Stat 18.1 (2009), pp. 201–215 (cit. on p. 2-37).
[124] Chun Li and Bryan E. Shepherd.“A New Residual for Ordinal Outcomes”. In: Biometrika 99.2 (2012), pp. 473–480.
doi: 10.1093/biomet/asr073. eprint: http://biomet.oxfordjournals.org/content/99/2/473.full.pdf+
html. url: http://biomet.oxfordjournals.org/content/99/2/473.abstract (cit. on p. 10-6).
[125] Kung-Yee Liang and Scott L. Zeger.“Longitudinal Data Analysis of Continuous and Discrete Responses for Pre-Post
Designs”. In: Sankhyā 62 (2000). makes an error in assuming the baseline variable will have the same univariate
distribution as the response except for a shift;baseline may have for example a truncated distribution based on a
trial’s inclusion criteria;if correlation between baseline and response is zero, ANCOVA will be twice as efficient as
simple analysis of change scores;if correlation is one they may be equally efficient, pp. 134–148 (cit. on p. 7-5).
ANNOTATED BIBLIOGRAPHY 15-9
[126] James K. Lindsey. Models for Repeated Measurements. Clarendon Press, 1997.
[127] Stuart Lipsitz, Michael Parzen, and Lue P. Zhao. “A Degrees-Of-Freedom Approximation in Multiple Imputation”.
In: J Stat Comp Sim 72.4 (Jan. 2002), pp. 309–318. doi: 10.1080/00949650212848. url: http://dx.doi.org/
10.1080/00949650212848 (cit. on p. 3-13).
[128] Roderick J. A. Little and Donald B. Rubin. Statistical Analysis with Missing Data. second. New York: Wiley, 2002
(cit. on pp. 3-4, 3-9, 3-17).
[129] Guanghan F. Liu et al.“Should Baseline Be a Covariate or Dependent Variable in Analyses of Change from Baseline
in Clinical Trials?” In: Stat Med 28 (2009). seems to miss several important points, such as the fact that the
baseline variable is often part of the inclusion/exclusion criteria and so has a truncated distribution that is different
from that of the follow-up measurements;sharp rebuke in ken10sho, pp. 2509–2530 (cit. on p. 7-5).
[130] Richard Lockhart et al. A Significance Test for the Lasso. arXiv, 2013. arXiv: 1301.7161. url: http://arxiv.
org/abs/1301.7161 (cit. on p. 4-10).
[131] Xiaohui Luo, Leonard A. Stfanski, and Dennis D. Boos. “Tuning Variable Selection Procedures by Adding Noise”.
In: Technometrics 48 (2006). adding a known amount of noise to the response and studying ² to tune the stopping
rule to avoid overfitting or underfitting;simulation setup, pp. 165–175 (cit. on p. 1-14).
[132] Paul Madley-Dowd et al. “The Proportion of Missing Data Should Not Be Used to Guide Decisions on Multiple
Imputation”. In: Journal of Clinical Epidemiology 110 (June 1, 2019), pp. 63–73. issn: 0895-4356. doi: 10.1016/j.
jclinepi.2019.02.016. url: http://www.sciencedirect.com/science/article/pii/S0895435618308710
(visited on 09/14/2019) (cit. on p. 3-20).
[133] Nathan Mantel. “Why Stepdown Procedures in Variable Selection”. In: Technometrics 12 (1970), pp. 621–625
(cit. on p. 4-15).
[134] Maurizio Manuguerra and Gillian Z. Heller.“Ordinal Regression Models for Continuous Scales”. In: Int J Biostat 6.1
(Jan. 2010). mislabeled a flexible parametric model as semi-parametric; does not cover semi-parametric approach
with lots of intercepts. issn: 1557-4679. doi: 10.2202/1557-4679.1230. url: http://dx.doi.org/10.2202/
1557-4679.1230 (cit. on p. 11-13).
[135] S. E. Maxwell and H. D. Delaney. “Bivariate Median Splits and Spurious Statistical Significance”. In: Psych Bull
113 (1993), pp. 181–190. doi: 10.1037//0033-2909.113.1.181. url: http://dx.doi.org/10.1037//0033-
2909.113.1.181 (cit. on p. 2-13).
[136] George P. McCabe. “Principal Variables”. In: Technometrics 26 (1984), pp. 137–144 (cit. on p. 4-28).
[137] George Michailidis and Jan de Leeuw.“The Gifi System of Descriptive Multivariate Analysis”. In: Stat Sci 13 (1998),
pp. 307–336 (cit. on p. 4-29).
[138] Karel G. M. Moons et al. “Using the Outcome for Imputation of Missing Predictor Values Was Preferred”. In: J
Clin Epi 59 (2006). use of outcome variable; excellent graphical summaries of simulations, pp. 1092–1101. doi:
10.1016/j.jclinepi.2006.01.009. url: http://dx.doi.org/10.1016/j.jclinepi.2006.01.009 (cit. on
p. 3-13).
[139] Barry K. Moser and Laura P. Coombs. “Odds Ratios for a Continuous Outcome Variable without Dichotomizing”.
In: Stat Med 23 (2004). large loss of efficiency and power;embeds in a logistic distribution, similar to proportional
odds model;categorization;dichotomization of a continuous response in order to obtain odds ratios often results in
an inflation of the needed sample size by a factor greater than 1.5, pp. 1843–1860 (cit. on p. 2-13).
[140] Raymond H. Myers. Classical and Modern Regression with Applications. Boston: PWS-Kent, 1990 (cit. on p. 4-24).
[141] N. J. D. Nagelkerke.“A Note on a General Definition of the Coefficient of Determination”. In: Biometrika 78 (1991),
pp. 691–692 (cit. on p. 4-42).
[142] Todd G. Nick and J. Michael Hardin. “Regression Modeling Strategies: An Illustrative Case Study from Medical
Rehabilitation Outcomes Research”. In: Am J Occ Ther 53 (1999), pp. 459–470 (cit. on p. 4-1).
[143] David J. Nott and Chenlei Leng. “Bayesian Projection Approaches to Variable Selection in Generalized Linear
Models”. In: Computational Statistics & Data Analysis 54.12 (Dec. 2010), pp. 3227–3241. issn: 01679473. doi:
10.1016/j.csda.2010.01.036. url: http://dx.doi.org/10.1016/j.csda.2010.01.036 (cit. on p. 2-37).
ANNOTATED BIBLIOGRAPHY 15-10
[144] Debashis Paul et al. ““Preconditioning” for Feature Selection and Regression in High-Dimensional Problems”. In:
Ann Stat 36.4 (2008). develop consistent Y using a latent variable structure, using for example supervised principal
components. Then run stepwise regression or lasso predicting Y (lasso worked better). Can run into problems when
a predictor has importance in an adjusted sense but has no marginal correlation with Y;model approximation;model
simplification, pp. 1595–1619. doi: 10 . 1214 / 009053607000000578. url: http : / / dx . doi . org / 10 . 1214 /
009053607000000578 (cit. on p. 2-37).
[145] Peter Peduzzi et al. “A Simulation Study of the Number of Events per Variable in Logistic Regression Analysis”. In:
J Clin Epi 49 (1996), pp. 1373–1379 (cit. on pp. 4-19, 4-20).
[146] Peter Peduzzi et al. “Importance of Events per Independent Variable in Proportional Hazards Regression Analysis.
II. Accuracy and Precision of Regression Estimates”. In: J Clin Epi 48 (1995), pp. 1503–1510 (cit. on p. 4-19).
[147] N. Peek et al. “External Validation of Prognostic Models for Critically Ill Patients Required Substantial Sample
Sizes”. In: J Clin Epi 60 (2007). large sample sizes need to obtain reliable external validations;inadequate power of
DeLong, DeLong, and Clarke-Pearson test for differences in correlated ROC areas (p. 498);problem with tests of
calibration accuracy having too much power for large sample sizes, pp. 491–501 (cit. on p. 4-43).
[148] Michael J. Pencina, Ralph B. D’Agostino, and Olga V. Demler. “Novel Metrics for Evaluating Improvement in
Discrimination: Net Reclassification and Integrated Discrimination Improvement for Normal Variables and Nested
Models”. In: Stat Med 31.2 (2012), pp. 101–113. doi: 10.1002/sim.4348. url: http://dx.doi.org/10.1002/
sim.4348 (cit. on pp. 4-43, 8-39).
[149] Michael J. Pencina, Ralph B. D’Agostino, and Ewout W. Steyerberg. “Extensions of Net Reclassification Improve-
ment Calculations to Measure Usefulness of New Biomarkers”. In: Stat Med 30 (2011). lack of need for NRI to
be category-based;arbitrariness of categories;”category-less or continuous NRI is the most objective and versatile
measure of improvement in risk prediction;authors misunderstood the inadequacy of three categories if categories
are used;comparison of NRI to change in C index;example of continuous plot of risk for old model vs. risk for new
model, pp. 11–21. doi: 10.1002/sim.4085. url: http://dx.doi.org/10.1002/sim.4085 (cit. on p. 4-43).
[150] Michael J. Pencina et al. “Evaluating the Added Predictive Ability of a New Marker: From Area under the ROC
Curve to Reclassification and Beyond”. In: Stat Med 27 (2008). small differences in ROC area can still be very mean-
ingful;example of insignificant test for difference in ROC areas with very significant results from new method;Yates’
discrimination slope;reclassification table;limiting version of this based on whether and amount by which probabil-
ities rise for events and lower for non-events when compare new model to old;comparing two models;see letter to
the editor by Van Calster and Van Huffel, Stat in Med 29:318-319, 2010 and by Cook and Paynter, Stat in Med
31:93-97, 2012, pp. 157–172 (cit. on pp. 4-43, 8-39).
[151] de Vries Bas B.L. Penning, Smeden Maarten van, and Rolf H.H. Groenwold. “Propensity Score Estimation Using
Classification and Regression Trees in the Presence of Missing Covariate Data”. In: Epidemiologic Methods 7.1
(2018). doi: 10.1515/em-2017-0020. url: https://www.degruyter.com/view/j/em.2018.7.issue-1/em-
2017-0020/em-2017-0020.xml (visited on 09/02/2019) (cit. on p. 3-10).
[152] Sanne A. Peters et al. “Multiple Imputation of Missing Repeated Outcome Measurements Did Not Add to Linear
Mixed-Effects Models.” In: J Clin Epi 65.6 (2012), pp. 686–695. doi: 10.1016/j.jclinepi.2011.11.012. url:
http://dx.doi.org/10.1016/j.jclinepi.2011.11.012 (cit. on p. 7-10).
[153] Bercedis Peterson and Frank E. Harrell. “Partial Proportional Odds Models for Ordinal Response Variables”. In:
Appl Stat 39 (1990), pp. 205–217 (cit. on p. 10-9).
[154] José C. Pinheiro and Douglas M. Bates. Mixed-Effects Models in S and S-PLUS. New York: Springer, 2000 (cit. on
pp. 7-11, 7-13).
[155] Richard F. Potthoff and S. N. Roy. “A Generalized Multivariate Analysis of Variance Model Useful Especially for
Growth Curve Problems”. In: Biometrika 51 (1964). included an AR1 example, pp. 313–326 (cit. on p. 7-7).
[156] David B. Pryor et al. “Estimating the Likelihood of Significant Coronary Artery Disease”. In: Am J Med 75 (1983),
pp. 771–780 (cit. on p. 8-48).
[157] Peter Radchenko and Gareth M. James.“Variable Inclusion and Shrinkage Algorithms”. In: J Am Stat Assoc 103.483
(2008). solves problem caused by lasso using the same penalty parameter for variable selection and shrinkage which
causes lasso to have to keep too many variables in the model to avoid overshrinking the remaining predictors;does
not handle scaling issue well, pp. 1304–1315 (cit. on p. 2-36).
[158] D. R. Ragland. “Dichotomizing Continuous Outcome Variables: Dependence of the Magnitude of Association and
Statistical Power on the Cutpoint”. In: Epi 3 (1992), pp. 434–440. doi: 10.1097/00001648-199209000-00009.
url: http://dx.doi.org/10.1097/00001648-199209000-00009 (cit. on p. 2-13).
ANNOTATED BIBLIOGRAPHY 15-11
[172] Jun Shao. “Linear Model Selection by Cross-Validation”. In: J Am Stat Assoc 88 (1993), pp. 486–494 (cit. on
p. 5-17).
[173] Noah Simon et al.“A Sparse-Group Lasso”. In: J Comp Graph Stat 22.2 (2013). sparse effects both on a group and
within group levels;can also be considered special case of group lasso allowing overlap between groups, pp. 231–245.
doi: 10.1080/10618600.2012.681250. eprint: http://www.tandfonline.com/doi/pdf/10.1080/10618600.
2012 . 681250. url: http : / / www . tandfonline . com / doi / abs / 10 . 1080 / 10618600 . 2012 . 681250 (cit. on
p. 2-37).
[174] Sean L. Simpson et al. “A Linear Exponent AR(1) Family of Correlation Structures”. In: Stat Med 29 (2010),
pp. 1825–1838 (cit. on p. 7-14).
[175] L. R. Smith, F. E. Harrell, and L. H. Muhlbaier. “Problems and Potentials in Modeling Survival”. In: Medical
Effectiveness Research Data Methods (Summary Report), AHCPR Pub. No. 92-0056. Ed. by Mary L. Grady and
Harvey A. Schwartz. Rockville, MD: US Dept. of Health and Human Services, Agency for Health Care Policy and
Research, 1992, pp. 151–159. url: https://hbiostat.org/bib/papers/smi92pro.pdf (cit. on p. 4-19).
[176] Alan Spanos, Frank E. Harrell, and David T. Durack.“Differential Diagnosis of Acute Meningitis: An Analysis of the
Predictive Value of Initial Observations”. In: JAMA 262 (1989), pp. 2700–2707. doi: 10.1001/jama.262.19.2700.
url: http://dx.doi.org/10.1001/jama.262.19.2700 (cit. on pp. 8-46, 8-49).
[177] Ian Spence and Robert F. Garrison. “A Remarkable Scatterplot”. In: Am Statistician 47 (1993), pp. 12–19 (cit. on
p. 4-41).
[178] D. J. Spiegelhalter. “Probabilistic Prediction in Patient Management and Clinical Trials”. In: Stat Med 5 (1986). z-
test for calibration inaccuracy (implemented in Stata, and R Hmisc package’s val.prob function), pp. 421–433. doi:
10.1002/sim.4780050506. url: https://onlinelibrary.wiley.com/doi/abs/10.1002/sim.4780050506
(cit. on pp. 4-22, 4-48, 5-20, 5-21).
[179] Ewout W. Steyerberg. Clinical Prediction Models. New York: Springer, 2009 (cit. on pp. xv, 15-15).
[180] Ewout W. Steyerberg. “Validation in Prediction Research: The Waste by Data-Splitting”. In: Journal of Clinical
Epidemiology 0.0 (July 28, 2018). issn: 0895-4356, 1878-5921. doi: 10.1016/j.jclinepi.2018.07.010. url:
https://www.jclinepi.com/article/S0895-4356(18)30485-2/abstract (visited on 07/30/2018) (cit. on
p. 5-17).
[181] Ewout W. Steyerberg et al.“Prognostic Modeling with Logistic Regression Analysis: In Search of a Sensible Strategy
in Small Data Sets”. In: Med Decis Mak 21 (2001), pp. 45–56 (cit. on p. 4-1).
[182] Ewout W. Steyerberg et al. “Prognostic Modelling with Logistic Regression Analysis: A Comparison of Selection
and Estimation Methods in Small Data Sets”. In: Stat Med 19 (2000), pp. 1059–1079 (cit. on p. 2-36).
[183] C. J. Stone. “Comment: Generalized Additive Models”. In: Stat Sci 1 (1986), pp. 312–314 (cit. on p. 2-28).
[184] C. J. Stone and C. Y. Koo. “Additive Splines in Statistics”. In: Proceedings of the Statistical Computing Section
ASA. Washington, DC, 1985, pp. 45–48 (cit. on pp. 2-22, 2-29).
[185] Samy Suissa and Lucie Blais.“Binary Regression with Continuous Outcomes”. In: Stat Med 14 (1995), pp. 247–255.
doi: 10.1002/sim.4780140303. url: http://dx.doi.org/10.1002/sim.4780140303 (cit. on p. 2-13).
[186] Thomas R. Sullivan et al. “Bias and Precision of the “Multiple Imputation, Then Deletion” Method for Dealing
With Missing Outcome Data”. In: American Journal of Epidemiology 182.6 (Sept. 15, 2015). Disagrees with von
Hippel approach of ”impute then delete” for Y, pp. 528–534. issn: 0002-9262. doi: 10.1093/aje/kwv100. url:
https://doi.org/10.1093/aje/kwv100 (visited on 01/05/2021) (cit. on p. 3-4).
[187] Guo-Wen Sun, Thomas L. Shook, and Gregory L. Kay. “Inappropriate Use of Bivariable Analysis to Screen Risk
Factors for Use in Multivariable Analysis”. In: J Clin Epi 49 (1996), pp. 907–916 (cit. on p. 4-20).
[188] Stan Development Team. “Stan: A C++ Library for Probability and Sampling”. In: (2020). url: https://cran.r-
project.org/package=rstan (cit. on p. 8-52).
[189] Robert Tibshirani.“Regression Shrinkage and Selection via the Lasso”. In: J Roy Stat Soc B 58 (1996), pp. 267–288
(cit. on p. 2-36).
[190] Tue Tjur. “Coefficients of Determination in Logistic Regression Models—A New Proposal: The Coefficient of
Discrimination”. In: Am Statistician 63.4 (2009), pp. 366–372 (cit. on p. 8-39).
[191] Jos Twisk et al. “Multiple Imputation of Missing Values Was Not Necessary before Performing a Longitudinal
Mixed-Model Analysis”. In: J Clin Epi 66.9 (2013), pp. 1022–1028. doi: 10.1016/j.jclinepi.2013.03.017. url:
http://dx.doi.org/10.1016/j.jclinepi.2013.03.017 (cit. on p. 3-3).
ANNOTATED BIBLIOGRAPHY 15-13
[192] Werner Vach and Maria Blettner. “Missing Data in Epidemiologic Studies”. In: Ency of Biostatistics. New York:
Wiley, 1998, pp. 2641–2654 (cit. on p. 3-5).
[193] Ben Van Calster et al. “A Calibration Hierarchy for Risk Models Was Defined: From Utopia to Empirical Data”.
In: J Clin Epi 74 (June 2016), pp. 167–176. issn: 08954356. doi: 10.1016 /j.jclinepi. 2015.12.005. url:
http://dx.doi.org/10.1016/j.jclinepi.2015.12.005 (cit. on p. 5-4).
[194] Geert J. M. G. van der Heijden et al. “Imputation of Missing Values Is Superior to Complete Case Analysis and
the Missing-Indicator Method in Multivariable Diagnostic Research: A Clinical Example”. In: J Clin Epi 59 (2006).
Invalidity of adding a new category or an indicator variable for missing values even with MCAR, pp. 1102–1109.
doi: 10.1016/j.jclinepi.2006.01.015. url: http://dx.doi.org/10.1016/j.jclinepi.2006.01.015
(cit. on p. 3-5).
[195] Tjeerd van der Ploeg, Peter C. Austin, and Ewout W. Steyerberg.“Modern Modelling Techniques Are Data Hungry:
A Simulation Study for Predicting Dichotomous Endpoints.” In: BMC medical research methodology 14.1 (Dec.
2014). Would be better to use proper accuracy scores in the assessment. Too much emphasis on optimism as opposed
to final discrimination measure. But much good practical information. Recursive partitioning fared poorly., pp. 137+.
issn: 1471-2288. doi: 10.1186/1471-2288-14-137. pmid: 25532820. url: http://dx.doi.org/10.1186/1471-
2288-14-137 (cit. on pp. 2-4, 2-40, 4-19).
[196] S. van Buuren et al.“Fully Conditional Specification in Multivariate Imputation”. In: J Stat Computation Sim 76.12
(2006). justification for chained equations alternative to full multivariate modeling, pp. 1049–1064 (cit. on pp. 3-15,
3-16).
[197] J. C. van Houwelingen and S. le Cessie. “Predictive Value of Statistical Models”. In: Stat Med 9 (1990), pp. 1303–
1325 (cit. on pp. 2-29, 4-22, 5-17, 5-21, 5-22).
[198] William N. Venables and Brian D. Ripley. Modern Applied Statistics with S. Fourth. New York: Springer-Verlag,
2003. isbn: 0-387-95457-0 (cit. on p. 11-2).
[199] Geert Verbeke and Geert Molenberghs. Linear Mixed Models for Longitudinal Data. New York: Springer, 2000.
[200] Pierre J. Verweij and Hans C. van Houwelingen. “Penalized Likelihood in Cox Regression”. In: Stat Med 13 (1994),
pp. 2427–2436 (cit. on p. 4-22).
[201] Andrew J. Vickers. “Decision Analysis for the Evaluation of Diagnostic Tests, Prediction Models, and Molecular
Markers”. In: Am Statistician 62.4 (2008). limitations of accuracy metrics;incorporating clinical consequences;nice
example of calculation of expected outcome;drawbacks of conventional decision analysis, especially because of the
difficulty of eliciting the expected harm of a missed diagnosis;use of a threshold on the probability of disease for
taking some action;decision curve;has other good references to decision analysis, pp. 314–320 (cit. on p. 1-7).
[202] Gerko Vink et al. “Predictive Mean Matching Imputation of Semicontinuous Variables”. In: Statistica Neerlandica
68.1 (Feb. 2014), pp. 61–90. issn: 00390402. doi: 10.1111/stan.12023. url: http://dx.doi.org/10.1111/
stan.12023 (cit. on p. 3-9).
[203] Eric Vittinghoff and Charles E. McCulloch. “Relaxing the Rule of Ten Events per Variable in Logistic and Cox
Regression”. In: Am J Epi 165 (2006). the authors may have not been quite stringent enough in their assessment
of adequacy of predictions;letter to the editor submitted, pp. 710–718 (cit. on p. 4-19).
[204] Paul T. von Hippel. “Regression with Missing Ys: An Improved Strategy for Analyzing Multiple Imputed Data”. In:
Soc Meth 37.1 (2007), pp. 83–117 (cit. on p. 3-4).
[205] Paul T. von Hippel. “The Number of Imputations Should Increase Quadratically with the Fraction of Missing
Information”. Aug. 2016. arXiv: 1608.05406. url: http://arxiv.org/abs/1608.05406 (cit. on p. 3-19).
[206] Howard Wainer. “Finding What Is Not There through the Unfortunate Binning of Results: The Mendel Effect”.
In: Chance 19.1 (2006). can find bins that yield either positive or negative association;especially pertinent when
effects are small;”With four parameters, I can fit an elephant; with five, I can make it wiggle its trunk.” - John von
Neumann, pp. 49–56 (cit. on pp. 2-13, 2-16).
[207] S. H. Walker and D. B. Duncan. “Estimation of the Probability of an Event as a Function of Several Independent
Variables”. In: Biometrika 54 (1967), pp. 167–178 (cit. on p. 10-4).
[208] Hansheng Wang and Chenlei Leng. “Unified LASSO Estimation by Least Squares Approximation”. In: J Am Stat
Assoc 102 (2007), pp. 1039–1048. doi: 10.1198/016214507000000509. url: http://dx.doi.org/10.1198/
016214507000000509 (cit. on p. 2-36).
[209] S. Wang et al. “Hierarchically Penalized Cox Regression with Grouped Variables”. In: Biometrika 96.2 (2009),
pp. 307–322 (cit. on p. 2-37).
ANNOTATED BIBLIOGRAPHY 15-14
[210] Yohanan Wax. “Collinearity Diagnosis for a Relative Risk Regression Analysis: An Application to Assessment of
Diet-Cancer Relationship in Epidemiological Studies”. In: Stat Med 11 (1992), pp. 1273–1287 (cit. on p. 4-25).
[211] T. L. Wenger et al. “Ventricular Fibrillation Following Canine Coronary Reperfusion: Different Outcomes with
Pentobarbital and -Chloralose”. In: Can J Phys Pharm 62 (1984), pp. 224–228 (cit. on p. 8-47).
[212] Ian R. White and John B. Carlin. “Bias and Efficiency of Multiple Imputation Compared with Complete-Case
Analysis for Missing Covariate Values”. In: Stat Med 29 (2010), pp. 2920–2931 (cit. on p. 3-15).
[213] Ian R. White and Patrick Royston.“Imputing Missing Covariate Values for the Cox Model”. In: Stat Med 28 (2009).
approach to using event time and censoring indicator as predictors in the imputation model for missing baseline
covariates;recommended an approximation using the event indicator and the cumulative hazard transformation of
time, without their interaction, pp. 1982–1998 (cit. on p. 3-3).
[214] Ian R. White, Patrick Royston, and Angela M. Wood. “Multiple Imputation Using Chained Equations: Issues and
Guidance for Practice”. In: Stat Med 30.4 (2011). practical guidance for the use of multiple imputation using
chained equations;MICE;imputation models for different types of target variables;PMM choosing at random from
among a few closest matches;choosing number of multiple imputations by a reproducibility argument, suggesting
100f imputations when f is the fraction of cases that are incomplete, pp. 377–399 (cit. on pp. 3-1, 3-10, 3-15,
3-19).
[215] John Whitehead.“Sample Size Calculations for Ordered Categorical Data”. In: Stat Med 12 (1993), pp. 2257–2271
(cit. on p. 4-20).
See letter to editor SM 15:1065-6 for binary case;see errata in SM 13:871 1994;see kol95com, jul96sam
.
[216] Ryan E. Wiegand. “Performance of Using Multiple Stepwise Algorithms for Variable Selection”. In: Stat Med 29
(2010). fruitless to try different stepwise methods and look for agreement;the methods will agree on the wrong
model, pp. 1647–1659 (cit. on p. 4-15).
[217] Daniela M. Witten and Robert Tibshirani. “Testing Significance of Features by Lassoed Principal Components”. In:
Ann Appl Stat 2.3 (2008). reduction in false discovery rates over using a vector of t-statistics;borrowing strength
across genes;”one would not expect a single gene to be associated with the outcome, since, in practice, many genes
work together to effect a particular phenotype. LPC effectively down-weights individual genes that are associated
with the outcome but that do not share an expression pattern with a larger group of genes, and instead favors
large groups of genes that appear to be differentially-expressed.”;regress principal components on outcome;sparse
principal components, pp. 986–1012 (cit. on p. 2-37).
[218] S. N. Wood. Generalized Additive Models: An Introduction with R. Boca Raton, FL: Chapman & Hall/CRC, 2006
(cit. on p. 2-38).
ISBN 9781584884743
.
[219] C. F. J. Wu. “Jackknife, Bootstrap and Other Resampling Methods in Regression Analysis”. In: Ann Stat 14.4
(1986), pp. 1261–1350 (cit. on p. 5-17).
[220] Shifeng Xiong. “Some Notes on the Nonnegative Garrote”. In: Technometrics 52.3 (2010). ”... to select tuning
parameters, it may be unnecessary to optimize a model selectin criterion repeatedly”;natural selection of penalty
function, pp. 349–361 (cit. on p. 2-37).
[221] Jianming Ye.“On Measuring and Correcting the Effects of Data Mining and Model Selection”. In: J Am Stat Assoc
93 (1998), pp. 120–131 (cit. on p. 1-14).
[222] F. W. Young, Y. Takane, and J. de Leeuw. “The Principal Components of Mixed Measurement Level Multivariate
Data: An Alternating Least Squares Method with Optimal Scaling Features”. In: Psychometrika 43 (1978), pp. 279–
281 (cit. on p. 4-29).
[223] Recai M. Yucel and Alan M. Zaslavsky.“Using Calibration to Improve Rounding in Imputation”. In: Am Statistician
62.2 (2008). using rounding to impute binary variables using techniques for continuous data;uses the method to
solve for the cutpoint for a continuous estimate to be converted into a binary value;method should be useful in
more general situations;idea is to duplicate the entire dataset and in the second half of the new datasets to set all
non-missing values of the target variable to missing;multiply impute these now-missing values and compare them
to the actual values, pp. 125–129 (cit. on p. 3-17).
[224] Hao H. Zhang and Wenbin Lu. “Adaptive Lasso for Cox’s Proportional Hazards Model”. In: Biometrika 94 (2007).
penalty function has ratios against original MLE;scale-free lasso, pp. 691–703 (cit. on p. 2-36).
ANNOTATED BIBLIOGRAPHY 15-15
[225] Min Zhang et al.“Interaction Analysis under Misspecification of Main Effects: Some Common Mistakes and Simple
Solutions”. In: Statistics in Medicine n/a.n/a (2020). issn: 1097-0258. doi: 10 . 1002 / sim . 8505. url: https :
//onlinelibrary.wiley.com/doi/abs/10.1002/sim.8505 (visited on 02/27/2020) (cit. on p. 2-47).
eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1002/sim.8505
.
[226] Hui Zhou, Trevor Hastie, and Robert Tibshirani. “Sparse Principal Component Analysis”. In: J Comp Graph Stat
15 (2006). principal components analysis that shrinks some loadings to zero, pp. 265–286 (cit. on p. 2-37).
[227] Hui Zou and Trevor Hastie. “Regularization and Variable Selection via the Elastic Net”. In: J Roy Stat Soc B 67.2
(2005), pp. 301–320 (cit. on p. 2-36).
R packages written by FE Harrell are freely available from CRAN, and are managed at github.com/harrelfe.
To obtain a glossary of statistical terms and other handouts related to diagnostic and prognostic modeling,
see hbiostat.org/doc/glossary.pdf