Oxford Three Notes
Oxford Three Notes
(Non-)Proportional Hazards
As we’ve said from the outset, the exponential, Weibull, and Cox models are all proportional
hazards (PH) models. That is,
• They assume that the effect of a covariate is to shift the hazard proportionally to the
baseline.
• So, for two individuals A and B, their relative hazards will be:
Note that the relationship in (1) is assumed to be true for all time points, and irrespective
of the shape of the baseline hazard; this will become very important in just a bit.
• When hazards are “flat” (e.g., in an exponential model), proportional hazards corre-
spond to parallel horizontal lines.
For lots of reasons, it may be the case that hazards aren’t strictly proportional. Consider a
few examples:
• In medical studies: Resistance to the therapy/drug yields hazards which are converging
for treatment vs. control groups over time.
• We can see similar effects with learning phenomena: hazards between groups may
converge.
• Conversely, the effects of a variable may grow more pronounced over time, leading to
hazards that diverge as time passes.
1
• In some cases, hazards for two groups may actually cross. One example of this is
common in oncology: The decision between surgery and radiation treatments.
h(t|Xi ) = h0 (t)exp(Xi β)
We can think of a more generalized PH model which relaxes the strict proportionality as
something like:
Tests
In general, there are three kinds of tests for nonproportionality:
1. Tests for changes in parameter values for coefficients estimated on a subsample of the
data defined by t,
2. Tests based on plots of survival estimates and regression residuals against time, and
2
Stratified/Piecewise Regressions
If the influences of our covariates on the hazard of the event of interest vary over time, the
simplest way they could do so is as a step function. In this vein, consider the function g(t)
in (1) as:
g(t) = 0 ∀ t ≤ τ
= 1∀t>τ
This amounts to estimating interactions of each of the covariates with a dummy variable
which is “turned on” for t > τ . A few things about this approach:
• It allows the effect of the variables be different earlier than later in the process.
• One can choose τ on the basis of events/structure in the data (e.g. Congressional
redistrictings, etc.) or just use the median point (better than the mean).
• More generally, one can consider more than one “step,” if data are plentiful.
In general, stepwise models of this sort are a good place to start, but are probably not
adequate solutions to the more general issue of nonproportionality.
Kalbfleisch and Prentice (1980) were the first to suggest that one could make use of the
predicted survival plots for subgroups of the data to assess the PH assumption. The idea is
based on the fact that, in the Cox model,
Z t
S(t) = exp −exp(Xi β) h0 (t) dt
0
3
• Lines which are diverging, converging or crossing suggest time-varying effects of the
covariate in question.
Log-log survivor plots (as they are called) have some advantages:
They also, however, are not especially reliable; they’ll often signal the presence of nonpropor-
tionality when in fact there is none, while at other times they can miss nonproportionality
completely. Accordingly, they’re a good (graphical) place to start, but other tests are better.
- Martingale Residuals
Recall that the counting process formulation of the Cox model gave us something that looked
like a residual:
Martingale residuals are akin to the difference between observed and expected values at t;
these “residuals” have the property that:
• E(Mi ) = 0 and
• Cov(Mi , Mj ) = 0 asymptotically.
• If you have one record per unit, this is just the residual, but
4
• More than one record per subject yields the “partial” martingale residual (Stata’s
term). The latter change over t (because Ĥi (t) is also changing over t); we can sum
across all t for each observation i to get the “total” martingale residual.
One can make modifications to the martingale residuals to correct for their inherent skewness
(Therneau et al. 1990). The usefulness of martingale residuals is largely in general model
checking, particularly in detecting either influential observations or departures from linearity
in the effects of covariates.
- Schoenfeld Residuals
N
( P )
∂lnL(β) j∈R(t) Xjk exp(Xj β)
X
= Ci Xik − P
∂βk i=1 j∈R(t) exp(Xj β)
XN
= Ci (Xik − X̄wi k ). (5)
i=1
We can think of X̄wi k as a weighted mean of covariate Xk over the risk set at time t, with
weights corresponding to exp(Xβ). By substituting the estimated β̂ into (5), we can get the
estimated Schoenfeld residual for the ith observation on the kth covariate as r̂ik :
" P #
j∈R(t) Xjk exp(Xj β̂)
r̂ik = Ci Xik − P (6)
j∈R(t) exp(Xj β̂)
The (bad) intuition is that, at a particular time t, this quantity is the cumulative covariate-
specific difference between the expected and the observed values of the hazard.
• Are defined only at event times (note the presence of Ci in equation (6)),
• are asymmetrical (Therneau introduces a scaling procedure for these as well), and
Most important for our purposes: If the proportional hazards assumption holds, these resid-
uals should be unrelated to survival time; that is, they should be a “random walk” vis-à-vis
5
survival time. Conversely, if covariates are nonproportional, then the residuals will vary
systematically with survival time.
To get the intuition of why this ought to be the case, remember that Schoenfeld residuals
are something like unit-specific marginal covariate effects. Now, visualize think of the case
where rising hazards are converging:
• The PH model assumes that they are in fact proportional to one another.
• The result will be residuals which “underpredict” the marginal effect of Xk at earlier
time points, and “overpredict” it at later ones.
• So plotting the residuals against time (or log-time) will yield a negatively-sloped
“cloud.”
The reverse is true for hazards which are diverging: a PH model will “overpredict” early,
and “underpredict” later, with the result that a plot of the residuals against time will have
a positive slope.
This suggests (at least) two things we can do with these residuals:
1. Plot them against (some function of) time, to see if there’s a pattern.
2. Tests based on the correlation between some function of time and the residuals.
6
. estat phtest, detail
Time: Time
----------------------------------------------------------------
| rho chi2 df Prob>chi2
------------+---------------------------------------------------
age | 0.34444 6.64 1 0.0100
pension | -0.06250 0.20 1 0.6553
pagree | -0.09512 0.51 1 0.4770
------------+---------------------------------------------------
global test | 7.02 3 0.0712
----------------------------------------------------------------
• Note that this can be done with any duration model (Cox, Weibull, etc.).
• In practice, people usually use ln(T ), rather than just T , for the interactions; this is
because in nearly every case the covariates enter the model as exp(Xi β).
• Interpretation is standard for interaction terms: the marginal effect of the covariate in
question is then dependent on the time at which the change in that covariate occurs.
Time-by-covariate interactions can be both a test for nonproportionality as well as the rem-
edy for it; it can yield very different, interesting results. For example, consider the interna-
tional conflict example B-S&Z (2001):
• Growth decreases the likelihood of conflict, but does so more at later points than in
earlier ones, while
• We observe the opposite effect for alliances: Their pacifying influence wanes over time.
In our Supreme Court example, we would likely want to include a time interaction with the
age variable, as described in the handout:
7
. gen lnT=ln(service)
. gen agexlnT=age*lnT
------------------------------------------------------------------------------
_t | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age | .0023907 .0589481 0.04 0.968 -.1131454 .1179269
pension | 2.039256 .5487846 3.72 0.000 .9636582 3.114854
pagree | .0849074 .2968391 0.29 0.775 -.4968865 .6667012
agexlnT | .0250956 .019777 1.27 0.204 -.0136665 .0638578
------------------------------------------------------------------------------
Here, the effect of a one–unit difference in age is effectively zero at ln(T ) = 0 (that is, at T =
1), but increases over time. This is what we would have expected, given our residual–based
tests, above. Note that the Grambsch-Therneau test for nonproportionality now reflects that
there is no systematic variation in the covariate effects over time:
8
. estat phtest, detail
Time: Time
----------------------------------------------------------------
| rho chi2 df Prob>chi2
------------+---------------------------------------------------
age | -0.06453 0.21 1 0.6481
pension | -0.03437 0.06 1 0.8055
pagree | -0.10123 0.59 1 0.4434
agexlnT | 0.24114 2.13 1 0.1446
------------+---------------------------------------------------
global test | 6.11 4 0.1913
----------------------------------------------------------------
Interestingly, the substance of the changing effect of age over time is what we also found
in the Weibull model above as well. That is, one can think of nonproportionality in the
Cox model as similar to systematic (variable-specific) duration dependence in a parametric
context. In both cases, the result is to make the marginal effect of that covariate on the
hazard change as time passes.
9
Duration Dependence
In all of our discussions so far, we’ve talked about duration dependence as if it were an
intrinsic characteristic of the hazard. But, stop and think for a second: Why might we have
duration dependence? In fact, there are two broad types of reasons: state dependence and
unobserved heterogeneity.
State Dependence
State dependence is the situation in which the value of the hazard in some way depends
directly on its own previous values, and/or the amount of time that has passed in a state.
Consider a few substantive examples:
1. Institutionalization: Institutions often become “sticky” over time, causing the hazard
of their termination to drop – think of government agencies.
2. Degradation: Think of “wear and tear;” the longer (say) a machine runs, the more
likely it is to fail, as parts become worn out, friction builds up, etc.
3. Similarly, consider the old “coalition of minorities” argument from presidential politics:
presidents/parliaments slowly anger subgroups of the group that got them elected; over
time, those groups defect, yielding higher hazards of (say) no-confidence votes.
Each of these is an example of state dependence: the (conditional) hazard of “failure” depends
on how long you’ve been in the state. The relationship between state dependence and
duration dependence, however, is a negative one:
Substantively, we’re often interested in these sorts of things. Social scientists often talk
about duration dependence in substantive terms (e.g./ Scott Bennett’s work on alliances),
and it is thus the dominant way of thinking about duration dependence in empirical work.
10
Unobserved Heterogeneity
There’s another source of duration dependence, though, one that receives relatively little
attention among social scientists: unobserved heterogeneity. This amounts to nothing more
than observations being conditionally different in their individual hazards – say, due to
omitted variable bias. Remember:
• Models assume that observations which are the same on all the covariates are otherwise
identical.
• If we don’t include X in the model, then over time the observations with the higher
hazards (i.e., those in group X = 0) will experience the event (and exit) at a higher
rate.
• This means that, over time, the sample will increasingly be composed of observations
with X = 1 (and lower hazards).
• Thus, the estimated hazard will appear to decline over time, even though there is no
true state dependence at all (because the hazards are “flat”).
The point of this is to note that unobserved heterogeneity yields hazards which will appear
to be declining over time; and, more generally, hazards that are more downward–sloping
than they truly are. This is referred to as “spurious duration dependence,” and will occur
even if the omitted covariate is independent of all the others, or of time. As a result of this,
• “Rising” hazards will be flat, or even nonmonotonic (Omori and Johnson 1991).
This raises the issue that just talking about duration dependence in purely substantive /
state–dependence terms confounds things. In fact, negative duration dependence (for exam-
ple, a Weibull model in which p̂ < 1.0) may or may not indicate positive state dependence.
The substantive interpretation of those p̂s, then, is really potentially problematic.
11
Negative Duration Dependence Due To Unobserved Heterogeneity
1. Model specification.
• The better your model is, the less heterogeneity there will be, and the more
confident you can be in substantively interpreting what you do observe/estimate.
• For example, Lancaster (1979) noted that, in a model of strike duration, negative
duration dependence decreased as variables were added (see also Bennett 1999).
• These are called “frailty” models, and are akin to random effects models in a
duration context.
• We’ll talk more about these a little later today; for now, note that if we can
deal with the heterogeneity by (effectively) allowing each unit to have its own
“intercept” (e.g., baseline hazard), then the problems associated with spurious
duration dependence go away.
12
• For example, in a model of the duration of interstate wars, the “democratic peace”
literature might lead us to think that democracies should have greater (positive)
duration dependence than autocracies (since their electorates are more responsive
to losing military efforts).
• To do this, one might allow the p parameter in the Weibull model to vary as a
function of some covariates:
◦ Formally, we’d likely specify p = exp(Zi γ), (or, equivalently, lnp = Zi γ) since
p needs to be strictly greater than zero.
◦ We can then replace p with this expression in the usual Weibull likelihood,
and estimate β̂ and γ̂ jointly.
◦ Interpretation is straightforward: variables which increase (decrease) p cause
the hazard to rise (fall) more quickly, or drop (rise) more slowly.
• We can also combine this with the model with unit-level frailties; e.g., the paper
on alliance duration:
◦ Theory: Bigger alliances will last longer.
◦ Also: Larger alliances will be more institutionalized / “sticky,” and so will
also have lower (more negative) duration dependence.
◦ This is in fact what happens: The effect of alliance size is to decrease both λ
and p.
• There is (some) software to do this:
◦ Stata allow you to introduce covariates into the “ancilliary” parameters of the
various parametric models, though not while also estimating frailty terms.
◦ A package called TDA (Blossfeld and Rohwer 1995) will also do this.
The handout (pp. 8-9) contains an example of this approach, using the Supreme Court
retirement data.
13
Cure Models
“Cure” models are a means to relax yet another assumption we’ve so far been making about
our event process. In particular, cure models are used for data in which not all units are,
in fact, “at risk” for the event that we’re studying. These have also been called “split-
population” models in criminology and economics; I generally prefer the former term, as it
more directly conveys what the models are about.
There are three general classes of cure models: parametric mixture, parametric non-mixture,
and semi-parametric. We’ll also discuss how one can implement cure models in a discrete-
time approach as well. And we’ll do (yet another) example using the familiar 1950-1985
MID data.
Mixture Models
Think for a moment about a standard controlled clinical trial. In certain circumstances, it
may be the case that the drug in question cures the patient – that is, that for some fraction
of the subjects receiving the drug, they will never experience the event that defines the end
of the duration studied. Note several things about this:
• Similarly, the treatment may be neither necessary nor sufficient for a cure; some sub-
jects may be “cured” on their own, even without receiving the drug.
• In
R ∞either case, though, we have a problem: Standard survival models assume that
0
f (t) dt = 1 ∀ i – that is, that all observations eventually have the event of interest.
If there is a “cured” group in the data, this assumption is violated.
Now, think about this problem in the context of a standard parametric model for continuous-
time duration data. Assume:
• The corresponding CDF is then F (Ti |Xi , β) = Pr(Ti ≤ ti |Xi , β), ti > 0, where ti
represents the duration defined by the end of the “follow–up” period for observation i.
Also,
14
The associated survival function is then equal to 1 − F (Ti |Xi , β). We can then write the
hazard in standard fashion as:
f (Ti |Xi , β)
h(Ti |Xi , β) =
S(Ti |Xi , β)
For those observations that experience the event of interest during the observation period, we
observe both Ri = 1 and their duration Ti . Since these observations also necessarily belong
to the group in which Yi = 1, we can write the unconditional density for these observations
as:
In contrast, for those observations in which we do not observe an event (that is, where
Ri = 0), this fact may be due to two possible conditions:
1. It is possible that the observation in question is among those that will never experience
the event defining the duration (that is, Yi = 0).
2. It may also be the case, however, that the observation will experience the event, but
simply did not do so during the observation period (that is, Yi = 1 but Ti > ti ).
1
Because Yi = 0 implies that the observation will never experience the event of interest (and thus the
duration will never be observed), the probabilities for f (Ti |Yi = 0) and F (Ti |Yi = 0) cannot be defined.
15
If, as is routinely the case, we assume that censoring is uninformative, the contribution to
the likelihood for observations with Ri = 0 is therefore:
Combining these values for each of the respective sets of observations, and assuming inde-
pendence across observations, the resulting likelihood function is:
N
Y
L= [δi g(Ti |Xi , β)]Ri {[(1 − δi ) + δi [1 − G(Ti |Xi , β)]}(1−Ri ) (9)
i=1
N
X
lnL = Ri {ln(δi ) + ln[g(Ti |Xi , β)]} + (1 − Ri )ln{(1 − δi ) + δi [1 − G(Ti |Xi , β)]} (10)
i=1
• The hazard function for those who do experience the event may be any of the commonly–
used parametric distributions (e.g., exponential, Weibull, log-logistic, etc.). Scholars
are actively working on semi- and nonparametric approaches for estimating the latency;
more on this below.
exp(Zi γ)
δi = (11)
1 + exp(Zi γ)
although other specifications (e.g. probit, complimentary log-log, etc.) are also possi-
ble. Interpretation of these estimates is standard...
• Note that this model is identified even when the variables in δi are identical to those
in the model of survival time. This means that one can test for the effects of the same
set of variables on both the incidence of failure and the duration associated with it
(Schmidt and Witte 1989).
16
Non-Mixture Models
An intuitive motivation for the non-mixture cure model arises in oncological studies of tumor
recurrence following chemotherapy.2 Prior to the treatment, we assume that the number of
cancerous cell clusters is large; the effectiveness of the treatment, however, means that the
probability of survival of any single cluster is vanishingly small (Yakovlev 1994, 1996).
• Now call the time for each of the remaining Ni clusters to develop into a detectable
tumor Zij , j = {1, 2, ...Ni }, which has a distribution function F (t).
Under these assumptions, the overall survival time until detection of the first post-therapy
tumor is:
πi = exp[−exp(Xi β)]
then the overall survival function conforms to that of the Cox model, where F (t) is a trans-
form of the “baseline” hazard (See Sposto 2002 for a derivation).
• In most instances, there is very little difference – if the chosen density function f (t)
and its associated quantities are the same, then the two models will generally yield
very similar results (Figure 7).
2
This section owes a great deal to Sposto (2002); other good references for non-mixture cure models
include Chen 2002; Tsodikov 2003; Yin 2005; and Tournard and Ecochard 2006).
17
Figure 7: Mixture and Non-Mixture Cured-Fraction Survival Functions (Exponential Haz-
ards with λ = 0.1 and π = 0.5)
• In terms of interpretation, the non-mixture model cannot and should not be interpreted
as a model for a mixture of cured and non-cured individuals in the population. Instead,
it is a multiplicative distribution of two otherwise independent probabilities.
• Some (e.g. Lambert et al.) have reported that non-mixture models have better model
convergence properties than mixture models.
18
the same fashion that a standard Poisson GLM with such dummies can replicate a Cox
survival model.
Specifically, call:
Practical Considerations
The use of cure models such as that outlined above should be considered whenever all obser-
vations cannot reasonably be assumed to “fail” at some point in the future (Chung, Schmidt,
and Witte 1991). A particularly useful property of cure models is that they allow for sep-
arate estimation of the influence of covariates on the probability of experiencing the event
from their effect on the time until the event of interest occurs for those observations that
do experience the event. That is, covariates can have an independently positive or negative
influence, or no effect at all, on both the incidence and the latency of an event. This fact
makes cure models more flexible than other duration models; one may find that a particular
19
covariate affects incidence but not latency, or vice versa. Such interpretations are not avail-
able with other duration models.
Analysts should thus consider using cure models whenever there is a theoretical reason to
suspect that not all observations will eventually “fail.” Such an assessment is relatively
straightforward and more common than the political science literature currently reflects.
Political scientists have simply not been routinely asking whether all observations are ex-
pected to eventually fail, but they should.
In addition to theoretical considerations, one can empirically look at the data to get a gen-
eral sense of the need for relaxing the assumption made by all other duration models, i.e.,
that eventually all observations experience the event. By plotting a Kaplan Meier (KM)
figure of the survivor function versus time (Kaplan and Meier 1958), the analyst will gain a
sense of whether observations in the data exist that will not experience the event of interest.
Price (1999) and Sy and Taylor (2000) illustrate the use of a KM survival curve to empir-
ically assess the need for a split population model. If it “shows a long and stable plateau
with heavy censoring at the tail,” there is strong reason to suspect that there is a subpopu-
lation that will not experience the event (Sy and Taylor 2000, 227; see also Peng et al. 2001).
Several issues arise in the estimation and interpretation of cure models. Note, for example,
that when δi = 1 ∀ i (that is, when all observations will eventually fail), the likelihood re-
duces to that for a standard duration model with censoring. However, testing for δi = 1 is
a case of a boundary condition3 and thus standard asymptotic theory does not apply (Price
1999). Maller and Zhou (1996) offer a corrected likelihood–ratio test for the proposition that
all observations will eventually experience the event of interest. Issues of goodness-of-fit for
split population models are an important, but currently ongoing, area of research (e.g., Sy
and Taylor 2000). Finally, one can also test H0 : δ = 1 (i.e., if the assumption that all obser-
vations will eventually fail is true) statistically. If so, the equation reduces to the standard
general duration model with censoring; that is, a (e.g.) Weibull model is a special case of
the Weibull cure model.
3
Note that this does not correspond to the case of Zi γ = 0 (which yields δi = 0.5).
20
Estimating Cure Models in Stata
In Stata, there are (at least) four options for estimating survival models with a cured fraction.
Each has its pluses and minuses...
lncure
• Estimates a log-normal cure model only.
• Does not allow covariates to determine the cured fraction (that is, one can estimate
only δ̂, not δ̂i ).
spsurv
• Will estimate discrete-time cure models (a la using cloglog in a discrete-time context).
cureregr
• Fits a very flexible generalized parametric cure model, as in (10), above.
◦ Exponential
◦ Weibull
◦ Log-Normal
◦ Logistic
◦ Gamma
• Allows for covariates in the “scale” parameter (that is, the duration part of the model),
the “cured fraction”/mixture parameter, and even the “shape” parameter (cf. today’s
discussion of modeling duration dependence).
21
• Allows different “links” for the covariates to the cured fraction parameter:
1
◦ Logistic: δi = 1+exp(−Zi γ)
.
◦ Complimentary log-log: δi = exp[−exp(Zi γ)].
◦ Linear: δi = Zi γ.
zip / zinb
• These are the commands for “zero-inflated” Poisson and negative binomial models,
respectively.
• This is a discrete-time setup; temporal issues have to be dealt with explicitly on the
right-hand side of the model.
• a model estimated using spsurv (i.e., a discrete-time model with a constant cured
fraction),
• two parametric (here, Weibull) models – one mixture, one non-mixture – estimated
using cureregr,
• a set of comparison plots for the values of interest in these two models, and
22
Heterogeneity and “Frailty” Models
Cure models represent a particular form of heterogeneity in survival data – some observations
are essentially not “at risk” for the event of interest at all. We can extend this logic a bit to
think of situations where each individual has some (individual–specific, latent) propensity
toward the event of interest. This is the intuition behind frailty models in survival analysis.
Implications
Q: What happens if such heterogeneity is present, but unaccounted for? That is, what hap-
pens of you ignore these νi s?
23
1. The parameter estimates from a model ignoring the effects are inconsistent (Lancaster
1985). In essence, because you’ve misspecified the model.
2. Ignoring the unit effects will also lead you to tend to underestimate the hazard (Omori
and Johnson 1993). The intuition of this is straightforward:
• There is more variability in the actual hazard than your model is picking up, so
• Over time, this will cause observations to “select out” of the data (like we discussed
earlier regarding duration dependence):
◦ Low-frailty cases will stay in
◦ High-frailty ones will drop out
• The result is an underestimated hazard.
• Think of the cure model as one example – in the end, you’ll be left with only the
“cured” folks, and so
• You’ll correspondingly overestimate the survival times.
4. If the νi s are correlated with the Xs, you’ll get lousy estimates of the βs, too.
How To Deal
OK, its bad. So, what do we do?
• Use random effects: Assume a distribution for the αi s, condition them out, and esti-
mate the parameters of that distribution.
• There’s the problem of incidental parameters (Lancaster 2000), which lead to incon-
sistency.
A random-effects approach (called “frailty” in the survival literature) is much more com-
mon...
• E.g., Lancaster (1979) in economics, Vaupel et al. (1979, 1981) in demography, etc.
24
• As with random-effects models, frailty models involve making an assumption about
the distribution of the (random) νi s, and then conditioning them out of the resulting
likelihood.
Not surprisingly, then, the most common frailty models are parametric ones; and the most
often-used distribution for the frailties is the gamma distribution. So, for example, the
Weibull distribution with a frailty term has a conditional survival function that looks like:
exp(−λt)p .
Now, if we specify that νi ∼ g(1, θ), where the gamma density g is
1 1/θ−1 −ν
g(ν, θ) = (1/θ) ν exp
θ Γ(1/θ) θ
and Γ(·) is the indefinite Gamma integral. Making this assumption about the distribution
of the νs means that the marginal survivor function is then equal to:
λp(λt)p−1
h(t) =
1 + θ(λt)p
= λp(λt)p−1 [S(t)]θ . (20)
• This looks more-or-less like a standard Weibull, but with an added “weighting” that
is a function of S(t) and θ, and
• If/when θ = 0, the distribution reverts to the standard Weibull (that is, in the case
where there is no unit-level variability).
It is also possible to derive a frailty model in the Cox context; in both parametric and semi-
parametric cases, the idea is to integrate out the frailties to get a conditional hazard/survival
function from which β̂ can then be estimated.
25
Estimation
How do we estimate frailty models? There are several options...
1. Fit a standard (e.g., Cox) model, and retain the estimate of the baseline hazard
Ĥ0 (ti ) for each observation.
2. Choose a set of possible values for θ (let’s call them θ̃, e.g. θ̃ ∈ {0, .1, .2, ...4, 4.5, 5}).
3. For each value of θ̃, generate an estimated “predicted frailty” ν̃ˆi for each observa-
tion. This is the “E” step.
4. Fit a second survival model, this time including the estimated ν̃ˆi s as an additional
covariate, with a fixed coefficient of 1.0 (that is, as an offset):
• You can then evaluate the “profile log-partial-likelihood” for each of the values of θ̃ to
figure out what the MLE is.
Direct Estimation
It’s also possible to directly estimate β and θ.
• This has been implemented for the Cox model with gamma, gaussian, and t frailties in
R (via the survival package), and for the Cox model with gamma frailties in Stata’s
-stcox- command.
26
Another Alternative
• Remember that, in event count models: Poisson event arrivals + gamma heterogeneity
= negative binomial.
• In theory, we can port this idea over to the survival/counting process world, and
• References are Lawless (1987), Thall (1988) Abu-Libdeh et al. (1990), Turnbull et al.
(1997). In this case:
◦ The “dispersion” parameter is equal to the estimate of the frailty variance. But...
◦ I’ve personally never been able to get this to work...
• Frailty models have the same strong requirements for consistency as do random–effects
models in a panel context; in particular, that Cov(Xi , νi ) = 0. In addition, they also
require that the frailties be independent of the censoring mechanism (that is, that
Cov(Ci , νi ) = 0 as well).
• Interpretation of the resulting estimates is standard, with the caveat that all interpre-
tations are necessarily conditional on some level of frailty ν̂i . In most instances, people
set ν̂i = 1, which is the natural (mean) frailty level.
• Frailty models are probably best used when the researcher suspects that one or more
important unit-level covariates may have been omitted from the model. (See the ex-
ample below for a bit more on this).
27
• These models can be a bit computationally challenging, particularly on large datasets.
• Note that Stata will also allow the analyst to generate ν̂i s for each of the observations
in the data. These can be useful, in that they can allow the analyst to see what
observations are more or less “frail” vis-à-vis the event of interest.
Finally, it should be noted that frailty models have also been used to deal with a particular
kind of heterogeneity: that due to repeated events. More on this below...
Frailty Models in R
R is also a strong package for estimating frailty models, particularly those rooted in the Cox
model. R will estimate Cox models with frailties that are:
• Gamma-distributed,
• Gaussian, or
• t-distributed.4
4
That is, the αi s are either Normal or t on the scale of the linear predictor; the frailties νi are log-normal
and log-t, respectively.
28
Options include the method of estimation (EM, penalized partial-likelihood) as well as con-
trols over the parameters of those estimation subcommands. The result is a coxph object:
> summary(GFrail)
Call:
coxph(formula = Surv(start, duration, dispute, type = "counting") ~
contig + capratio + allies + growth + democ + trade + frailty.gamma(dyadid,
method = c("em")))
n= 20448
coef se(coef) se2 Chisq DF p
contig 1.199 0.1673 0.1310 51.41 1 7.5e-13
capratio -0.199 0.0547 0.0495 13.29 1 2.7e-04
allies -0.370 0.1685 0.1252 4.82 1 2.8e-02
growth -3.685 1.3457 1.2991 7.50 1 6.2e-03
democ -0.365 0.1309 0.1108 7.78 1 5.3e-03
trade -3.039 12.0152 10.3084 0.06 1 8.0e-01
frailty.gamma(dyadid, met 708.95 394 0.0e+00
29
R will also estimate parametric frailty models with the survreg command, in exactly the
same fashion:
> print(W.GFrail)
Call:
survreg(formula = Surv(duration, dispute) ~ contig + capratio +
allies + growth + democ + trade + frailty.gamma(dyadid, method = c("em")))
Scale= 0.541
30
Extensions
Like other models with random effects (and random-parameter models in general), frailties
turn out to be valuable in a host of different survival data contexts. Some of the readings
give you a flavor for these:
• Repeated Events. As we’ll see in a few minutes, one could use frailties to account
for dependence due to observations having repeated events.
• Spatial Frailties. An article by Banerjee et al. (2003) uses frailty terms in a Bayesian
context, to fit a model with spatially-referenced frailty terms to data on infant mortality
in Minnesota. For reasons that are readily apparent if you think about it, spatial
survival models are another big growth area in these sorts of statistics.
I encourage you to explore any of these that you think might be useful in your own research.
An Example
The handout contains the results of estimating Cox and Weibull models with gamma frailties
on the 1950-1985 international dispute data. Note several things:
• The results for the variables are generally the same as for the standard Cox model; they
can be interpreted similarly, though it is important to note (as the output indicates)
that the results – including the standard error estimates – are conditional on the
random effects ν̂i .
• R reports a Wald test for the null hypothesis that θ = 0 – that is, that there are no
unit-level random effects / frailties. Here, we can confidently reject that null.
• To get predictions for the unit-specific frailties, we’d use predict in R. Alternatively,
if we were using Stata, we’d add the effects(newvar ) option to the stcox command;
this creates a new variable called newvar that contains the estimated log-frailties.
31
Competing Risks
The idea of competing risks addresses the potential for multiple failure types. For example,
• Members of Congress can retire, be defeated, run for higher office, or die in office.
Motivation
Assume that observation i is at risk for R different kinds of events, and that each event type
has a corresponding duration Ti1 , ..., TiR associated with it, each of which has a corresponding
density fr (t), hazard function hr (t) and a survivor function Sr (t), r ∈ {1, 2, ...R}.
• Finally, we typically assume there can’t be exact equality across the Tir s (though this
is not a big deal), and that, given long enough, the observation would have experienced
each of the events in question eventually.
Estimation
If the risks for the various types of events are conditionally independent (that is, independent
once the influence of the covariates X are taken into account – more on this in a bit), then
estimation is easy. The contribution of each uncensored observation to the likelihood is:
Y
Li = fr (Ti |Xir , βr ) Sr (Ti |Xir , βr ) (21)
r6=Di
That is, the contribution of a given observation with failure due to risk r to the likelihood
function is identical to its contribution in a model where only failures due to risk r are
observed and all other cases are treated as censored. The overall likelihood is then:
N
( )
Y Y
L= fr (Ti |Xir , βr ) Sr (Ti |Xir , βr ) (22)
i=1 r6=Di
which – because we observe only one of the r events and its corresponding survival time –
can be rewritten as:
R Y
Y Nr
L= {fr (Ti |Xir , βr ) Sr (Ti |Xir , βr )} (23)
r=1 i=1
32
where Nr denotes summation over the set of observations experiencing event r. If we modify
our old familiar censoring indicator, such that Cir = 1 indicates that observation i experi-
enced event r, and Cir = 0 otherwise, we can further rewrite (23) as:
R Y
Y N
L= [fr (Ti |Xir , βr )]Cir [Sr (Ti |Xri , βr )]1−Cir . (24)
r=1 i=1
and the log-likelihood is then just the sums (over r and i) of the logs of the terms on the
right-hand side of (24).
• The proofs for this are in Cox and Oakes 1984; David and Moeschberger 1978; Dier-
meier and Stevenson 1999; and a number of other places.
• The intuition is easy: To the extent that the (marginal) risks are independent, their
covariances are zero, and drop out of the likelihood functions, leaving us with easy
products.
• So, if two risks j and k are conditionally independent of one another, we may analyze
durations resulting from failure due to risk j by treating those failures due to k as
censored, in the sense that they have not (yet) reached their theoretical time to failure
from risk j; the same may then be done for risk k by treating failures due to risk j as
censored.
◦ That is, you can just run the model “both ways.”
◦ Interpretation is then standard for each of the two models.
◦ Also, there is no identification problem with having similar (or even identical) sets
of covariates in the models for the various failure events, if that is what theory
suggests is the right thing to do.
Diermeier and Stevenson (1999) give a nice example of this in the context of the cabinet
failures literature, where the competing risks are cabinet failures due to elections and re-
placements.
Independence of Risks
At first thought, it may seem that assuming conditionally independent risks is a pretty strong
and/or unjustifiable thing to do. But, remember:
• The risks only need be independent conditional on the effects of the covariates.
• This means that, if a particular covariate affects the hazard of more than one event,
and it is in the model, then its effect is “controlled for.”
33
• If you’ve got a good model, then, this may not be such a strong assumption.
Unfortunately, there are no great “tests” for the presence of conditional dependence in
competing risks. (And, for a long time, there was even a big debate over whether a
dependent-risks model is even identified, or identifiable.) As a general rule, people tend
to use independent-risks models. But, if you really, really think your risks are conditionally
dependent, there are several options:
1. One can model dependent risks using frailties/random effects (e.g. Oakes 1989, Gordon
2001), as we discussed above. Sandy Gordon’s 2001 AJPS paper is a nice introduction
to this approach. The intuition is that one assumes that the correlation between the
risks is due to individual-specific factors, which are then captured in the frailty term
and integrated out of the likelihood.
2. One can also take advantage of the Poisson/duration equivalence we talked about last
time...
• Competing risks: Justices can leave the Court through retirement or through, ahem,
mortality (cf. Zorn and Van Winkle 2000).
• Covariates:
34
◦ Justice’s party agreement (coded one if the party of the sitting president is the
same as that of the president that appointed the justice, and zero otherwise).
35
Models for Repeated Events
A large number of the kinds of things that social scientists study are capable of repetition:
• International wars (that is, between dyads of countries),
• Marriages,
• Policy changes / shifts,
• Cabinet failures within countries, etc.
The possibility of repeated events leads to the potential for dependencies across events for
the same unit. This, in turn, makes the usual PL- or ML-based inferences suspect:
• Treating such events as independent implies we have more information than we do.
• This leads to a tendency to (usually, but not always) underestimate s.e.s and/or over-
estimate the precision of our estimates.
Moreover, at times we may have theory about – or be otherwise interested in – the fact
that second, third, etc. events are somehow different from “first events.” For example, the
“security dilemma” and the resulting spiral models of war suggest that, well, war will lead
to additional war; on the other hand, informational models of war (a la Fearon, Gartzke,
etc.) suggest that the occurrence of war decreases the hazard of future wars. Which is it?
Absent better methods than we’ve introduced so far, we can’t tell.
There are two general approaches to repeated events: Frailty models, and variance-correction
models
Frailty Models
We discussed these just a bit ago. In the context of repeated events, we can think of capturing
the possible dependence across events within a subject with the frailty νi .
• This makes some substantive sense, if the frailty is fixed over time.
• Practically speaking, one just estimates a model where the shared frailty identifier
denotes the unit that is experiencing the multiple events.
• Intuitively, this addresses the possibility of dependence if we are willing to assume that
such dependence is largely a function of unit-level influences.
• So, for example, if some cabinets (say, Italy’s) are more likely on average to collapse
than others (say, Great Britain’s), this would be a reasonable approach.
In a nutshell, the use of frailty terms can be – and often is – thought of as a means of dealing
with data where events repeat.
36
Schematic of Approaches to Repeated Events in Duration Models
“Variance-Correction” Models
“Variance-Correction” (or “marginal”) models do exactly what the name implies: they esti-
mate a model as if the events were not repeated/dependent, and then “fix up” the variance-
covariances after the fact. This is analogous to GEE-type models for panel data. As per the
recent literature on the subject, there are four variance-correction models that have been
widely used:
1. Andersen/Gill (AG)
Figures 11 and 12 present a schematic of the relationship among these various types. There
37variance-corrected models:
are three key things that define the different
3. Whether baseline hazards are constant across events, or allowed to be different (that
is, the presence or absence of stratification).
37
Comparison of Variance-Correction Models for Heterogeneity
Figure 12: A Comparison of Key Characteristics of Variance–Correction Models
Robust standard
Yes Yes Yes
errors?
Stratification by
No Yes Yes
Event?
Risk Sets
• This tells us whether events develop sequentially or simultaneously – that is, can an
observation be “at risk” for the second event before they experience the first event?
Time Scale
• The “counting process” approach generally assumes the answer is no; this is sometimes
known as “elapsed time.”
38
Stratification
• Some models assume a common baseline hazard h0 (t) for the first, second, etc. events.
• Other approaches allow for stratified analysis by events, where each event has its own
baseline hazard.
• The latter is generally more flexible/general.
Various combinations of these different characteristics can be combined into the models we
see, as follows.
AG (Andersen/Gill 1982)
This approach adopts the counting process formulation to the Cox model. It assumes:
• Independent events, so...
• A single baseline hazard
As a practical matter, this amounts to nothing more than a Cox model with robust / clustered
standard errors. It is, in every respect, the simplest and most restrictive alternative.
39
PWP - Interevent Time
This model is like PWP-ET, except that the the “clock starts over” after every event. In
my opinion, this model probably best corresponds to most of the data we analyze.
• The effect on the hazard of the covariate in question is a smooth (read: linear, or at
least monotonic) function of the “event count.” In that case, a standard multiplicative
interaction will capture what is going on nicely.
• If the latter is the case for a large number of covariates, you may simply be better off
estimating separate models for each event count (as we did in the JOP paper).
Practical Advice
As a practical matter, estimating these models is simply a function of:
• Setting up the data correctly (so as to define the right risk sets),
This is all outlined in our paper, or in Cleves (1999) and Kelly and Lim (2000).
40