6 Modeling Survival Data With Cox Regression Models: 6.1 The Proportional Hazards Model
6 Modeling Survival Data With Cox Regression Models: 6.1 The Proportional Hazards Model
6 Modeling Survival Data With Cox Regression Models: 6.1 The Proportional Hazards Model
T
λ(t|z) = λ0 (t)ez1 β1 +···+zp βp = λ0 (t)ez β , (6.1)
where z is a p × 1 vector of covariates such as treatment indicators, prognositc factors, etc., and
Obviously,
λ(t|z = 0) = λ0 (t).
So λ0 (t) is often called the baseline hazard function. It can be interpreted as the hazard function
The baseline hazard function λ0 (t) in model (6.1) can take any shape as a function of t. The
only requirement is that λ0 (t) > 0. This is the nonparametric part of the model and z T β is the
parametric part of the model. So Cox’s proportional hazards model is a semiparametric model.
T β)
S(t|z) = [S0 (t)]exp(z ,
where S(t|z) is the survival function of the subpopulation with covariate z and S0 (t) is the
PAGE 120
CHAPTER 6 ST 745, Daowen Zhang
which is a constant over time (so the name of proportional hazards model). Equivalently,
" #
λ(t|z1 )
log = (z1 − z0 )T β, for all t ≥ 0.
λ(t|z0 )
3. With one unit increase in zk while other covariate values being held fixed, then
" #
λ(t|zk + 1)
log = log(λ(t|zk + 1)) − log(λ(t|zk )) = βk .
λ(t|zk )
Therefore, βk is the increase in log hazard (i.e., log hazard-ratio) at any time with unit
λ(t|zk + 1)
= eβk , for all t ≥ 0.
λ(t|zk )
So exp(βk ) is the hazard ratio associated with one unit increase in zk . Furthermore, since
P [t ≤ T < t + ∆t|T ≥ t, zk + 1]
≈ eβk , for all t ≥ 0.
P [t ≤ T < t + ∆t|T ≥ t, zk ]
so exp(βk ) can be loosely interpreted as the ratio of two conditional probabilities of dying
λ(t|zk + 1) − λ(t|zk )
= eβk − 1.
λ(t|zk )
Inferential Problems
From the interpretation of the model, it is obvious that β characterizes the “effect” of z. So
β should be the focus of our inference while λ0 (t) is a nuisance “parameter”. Given a sample of
PAGE 121
CHAPTER 6 ST 745, Daowen Zhang
3. Diagnostics.
Estimation
Since the baseline hazard λ0 (t) is left completely unspecified (infinite dimensional), ordinary
likelihood methods can’t be used to estimate β. Cox conceived of the idea of a partial likelihood
to remove the nuisance parameter λ0 (t) from the proposed estimating equation.
Historical Note: Cox described the proportional hazards model in JRSSB (1972), in what is
now the most quoted statistical papers in history. He also outlined in this paper the method for
estimation which he referred to as using conditional likelihood. It was pointed out to him in the
literature that what he proposed was not a conditional likelihood and that there may be some
flaws in his logic. Cox (1975) was able to recast his method of estimation through what he called
“partial likelihood” and published this in Biometrika. This approach seemed to be based on
sound inferential principles. Rigorous proofs showing the consistency and asymptotic normality
were not published until 1981 when Tsiatis (Annals of Statistics) demonstrated these large sample
properties. In 1982, Anderson and Gill (Annals of Statistics) simplified and generalized these
Xi = min(Ti , Ci ).
∆i = I(Ti ≤ Ci ).
PAGE 122
CHAPTER 6 ST 745, Daowen Zhang
T
λ(t|zi ) = λ0 (t)ezi β ,
where
( )
P [t ≤ Ti < t + h|Ti ≥ t, zi ]
λ(t|zi ) = lim+ .
h→0 h
Assume that Ci and Ti are conditionally independent given zi . Then the cause-specific hazard
can be used to represent the hazard of interest. That is (in terms of conditional probabilities)
Similar to the case of log rank test, we need to define some notation. Let us break the time
axis (patient time) into a grid of points. Assume the survival time is continuous. We hence can
take the grid points dense enough so that at most one death can occur within any interval.
Let dNi (u) denote the indicator for the ith individual being observed to die in [u, u + ∆u).
Namely,
Let Yi (u) denote the indicator for whether or not the ith individual is at risk at time u.
Namely,
Pn
Let dN (u) = i=1 dNi (u) denote the number of deaths for the whole sample occurring in
[u, u + ∆u). Since we are assuming ∆u is sufficiently small, so dN (u) is either 1 or 0 at any time
u.
Pn
Let Y (u) = i=1 Yi (u) be the total number from the entire sample who are at risk at time u.
Let F(x) denote the information up to time x (one of the grid points)
F(x) = {(dNi (u), Yi (u), zi ), i = 1, · · · , n; for grid points u < x and dN (x)}.
PAGE 123
CHAPTER 6 ST 745, Daowen Zhang
Note: Conditional on F(x), we know who has died or was censored prior to x, when they
died or were censored, together with their covariate values. We know the individuals at risk at
time x and their corresponding covariate value. In addition, we also know if a death occurs at
What we don’t know is the individual who was observed to die among those at risk at time
x if dN (x) = 1.
Let I(x) denote the individual in the sample who died at time x if someone died. If no one
For example, if I(x) = j, then this means that the jth individual in the sample with covariate
If we let u1 < u2 < · · · denote the value of the grid points along the time axis, then the data
Denote the observed values of the above random variables by lower cases. Then the likelihood
×···
PAGE 124
CHAPTER 6 ST 745, Daowen Zhang
That is, the full likelihood can be written as the product of a series of conditional likelihoods.
The partial likelihood (as defined by D.R. Cox) consists of the product of every other condi-
Y
PL = P [I(u) = i(u)|F(u) = f (u); λ0 (·), β].
{all grid pt u}
Suppose we have the following small data set, we will try to find find out this partial likeli-
hood:
Patient ID x δ z
1 2 1 2
2 2 0 2
3 3 1 1
4 4 1 3
e2β eβ e3β
P L(β) = × × . (6.2)
e2β + e2β + eβ + e3β eβ + e3β e3β
In general, we have to consider two cases in calculating the above partial likelihood.
Case 1: Suppose conditional on F(u) we have dN (u) = 0. That is, no death is observed at
Therefore, the partial likelihood is not affected at any point u such that dN (u) = 0.
PAGE 125
CHAPTER 6 ST 745, Daowen Zhang
Case 2: dN (u) = 1. Conditional on F(u), if we know that one individual dies at time u,
then it must be one of the individuals still at risk (alive and not censored) at time u; i.e.,
{i : Yi (u) = 1}.
Also conditional on F(u), we know the covariate vector zi associated to each individual i
death happened to the ith subject (who is actually observed to die at u) rather
these subjects are not equally likely, but rather, they are proportional to their cause-
Let Ai = the event that subject i is going to die in [u, u + ∆u) given that he/she is still
alive at u. If a patient is not at risk at u (i.e., Yi (u) = 0), then Ai = φ. Since we chose ∆u
Because of the independence of survival times and censoring times, those Y (u) patients
who are at risk at u (not censored and still alive at u) make up a random sample of the
subpopulation consisting of the patients who will survive up to u (and with the same
3 that the cause-specific hazard is the same as the hazard of interest; i.e.,
PAGE 126
CHAPTER 6 ST 745, Daowen Zhang
where the last equation is due to the assumption of the cox model. Therefore
= P [Ai(u) |A1 ∪ · · · ∪ An ]
P [Ai(u) ]
= Pn
l=1 P [Al ]
T
λ0 (u)exp(zi(u) β)∆u
≈ Pn T
l=1 λ0 (u)exp(zl β)Yl (u)∆u
T
exp(zi(u) β)
= Pn T
.
l=1 exp(zl β)Yl (u)
Here Yi(u) (u) = 1 since we know this patient had to be at risk at u (since we know that
Remark: To be formal, we need to define z0 even though it is never used. We can, for
example, take z0 = 0.
Other equivalent ways of writing the partial likelihood include: Let t1 , · · · , td define the
Remark: Stare at these different representations for a while, you will convince yourself that
The importance of using the partial likelihood is that this function depends only on β,
the parameter of interest, and is free of the baseline hazard λ0 (t), which is infinite dimensional
nuisance function.
PAGE 127
CHAPTER 6 ST 745, Daowen Zhang
Cox suggested treating P L as a regular likelihood function and making inference on β ac-
cordingly. For example, we maximize the partial likelihood to get the estimate of β, often called
MPLE (maximum partial likelihood estimate), and use the minus of the second derivative of the
For ease of presentation, let us focus on one covariate case. The extension is straightforward.
Define
Pn
zl exp(zl β)Yl (u) X n
z̄(u, β) = Pl=1
n = zl wl ,
l=1 exp(zl β)Yl (u) l=1
where
is the weight that is proportional to the hazard of the individual failing. So z̄(u, β) can be
interpreted as the weighted average of the covariate z among those individuals still at risk at
Define
P Ã Pn !2
n 2
l=1 zl exp(zl β)Yl (u) l=1 zl exp(zl β)Yl (u)
Vz (u, β) = Pn − P n
l=1 exp(zl β)Yl (u) l=1 exp(zl β)Yl (u)
PAGE 128
CHAPTER 6 ST 745, Daowen Zhang
" Pn #
2
l=1 zl exp(zl β)Yl (u) 2
= Pn − (z̄(u, β))
l=1 exp(zl β)Yl (u)
X
n
= zl2 wl − (z̄(u, β))2 .
l=1
This last representation says that Vz (u, β) can be interpreted as the weighted variance of the
covariates among those individuals still at risk at u and hence Vz (u, β) > 0. Consequently,
∂ 2 `(β) X
= − dN (u)Vz (u, β) < 0.
∂β 2 u
The above property can also be displayed graphically. For example, the partial likelihood
0.10
0.05
0.00
−4 −2 0 2 4
beta
Therefore `(β) has a unique maximizer and can be obtained uniquely by solving the following
PAGE 129
CHAPTER 6 ST 745, Daowen Zhang
∂ 2 `(β) X
− = dN (u)Vz (u, β)
∂β 2 u
Ultimately, we want to show that the MPLE β̂ has nice statistical properties. These include:
• Consistency: That is, β̂ will converge to the true value of β which generated the data as
variance which can be estimated from the data. This approximation will be better as the
sample size gets larger. This result is useful in making inference for the true β.
• Efficiency: Among all other competing estimators for β, the MPLE has the smallest vari-
In order to show the properties for β̂, we expand U (β̂) at the true value β0 using Taylor
expansion:
∂U (β0 )
0 = U (β̂) ≈ U (β0 ) + (β̂ − β0 ).
∂β
Since
∂U (β0 ) ∂ 2 `(β0 )
= = −J(β0 ),
∂β ∂β 2
therefore
This expression indicates that we need to investigate the properties of the score function U (β0 )
X h i
U (β0 ) = dN (u) zI(u) − z̄(u, β0 ) .
u
PAGE 130
CHAPTER 6 ST 745, Daowen Zhang
Since
" #
X ³ ´
E[U (β0 )] = E dN (u) zI(u) − z̄(u, β0 )
u
X h ³ ´i
= E dN (u) zI(u) − z̄(u, β0 ) ,
u
and
h ³ ´i
E dN (u) zI(u) − z̄(u, β0 )
h h ³ ´¯ ii
= E E dN (u) zI(u) − z̄(u, β0 ) ¯¯ F(u)
Conditional on F(u), dN (u) and z̄(u, β0 ) are both known. Consequently the inner expecta-
h i
dN (u) E[zI(u) |F(u)] − z̄(u, β0 ) .
Remember that I(u) is the patient identifier for the individual that dies at time u and is set
to zero if no one dies at u. If no one dies at u, then dN (u) = 0, and hence the above quantity is
zero. If someone dies at u, then dN (u) = 1, and conditional on F(u), we know it has to be one
of the Y (u) people at risk at time u; i.e., I(u) must be one of the values {i : Yi = 1}.
The conditional distribution of zI(u) given F(u) can be derived through the conditional
Therefore
Pn
X
n
zl exp(zl β0 )Yl (u)
E[zI(u) |F(u)] = zl wl = Pl=1
n = z̄(u, β0 ).
l=1 l=1 exp(zl β0 )Yl (u)
E[U (β0 )] = 0.
PAGE 131
CHAPTER 6 ST 745, Daowen Zhang
Pn
2 z2 exp(z2 β0 )Y2 (u)/ l=1 exp(zl β0 )Yl (u) = w2
.. .. ..
. . .
Pn
n zn exp(zn β0 )Yn (u)/ l=1 exp(zl β0 )Yl (u) = wn
Note: From the conditional distribution of zI(u) given F(u), it is easy to see the conditional
variance of zI(u)
n ³
X ´2
Var[zI(u) |F(u)] = zl − E[zI(u) |F(u)] wl
l=1
Pn
l=1 (zl − z̄(u, β0 ))2 exp(zl β0 )Yl (u)
= Pn
l=1 exp(zl β0 )Yl (u)
= Vz (u, β0 ).
PAGE 132
CHAPTER 6 ST 745, Daowen Zhang
As usual, we will take an arbitrary cross-product and show it has zero expectation. Assume
h i h i
A(u) = dN (u) zI(u) − z̄(u, β0 ) , A(u0 ) = dN (u0 ) zI(u) − z̄(u0 , β0 ) .
E [A(u)A(u0 )]
Therefore
Xh i
Var[U (β0 )] = E A2 (u)
u
X h i
= E A2 (u)
u
X h h ¯ ii
= E E A2 (u)¯¯ F(u)
u
h ¯ i ·n h io2 ¯¯ ¸
E A2 (u)¯¯ F(u) = E dN (u) zI(u) − z̄(u, β0 ) ¯ F(u) .
¯
Since we pick the grid points in our partition of time fine enough so that dN (u) is either 0
h ¯ i · h i2 ¯¯ ¸
¯2 ¯
E A (u)¯ F(u) = E dN (u) zI(u) − z̄(u, β0 ) ¯ F(u) .
Conditional on F(u), dN (u) is known, z̄(u, β0 ) is also known and from Table 6.1
PAGE 133
CHAPTER 6 ST 745, Daowen Zhang
Therefore
h ¯ i ·h i2 ¯¯ ¸
2¯ ¯
E A (u)¯ F(u) = dN (u)E zI(u) − z̄(u, β0 ) ¯ F(u)
= dN (u)Var[zI(u) |F(u)]
= dN (u)Vz (u, β0 ).
Consequently,
X
Var [U (β0 )] = E [dN (u)Vz (u, β0 )]
u
" #
X
= E dN (u)Vz (u, β0 ) .
u
P
Note that the quantity u dN (u)Vz (u, β0 ) is a statistic (can be calculated from the observed
P P
data), so u dN (u)Vz (u, β0 ) is an unbiased estimate of Var [U (β0 )]. In fact, u dN (u)Vz (u, β0 )
Conclusion
P
The score U (β0 ) = u A(u) is a sum of conditionally uncorrelated mean zero random vari-
X
J(β0 ) = dN (u)Vz (u, β0 ).
u
a
U (β0 ) ∼ N(0, J(β0 )).
PAGE 134
CHAPTER 6 ST 745, Daowen Zhang
Of course, in practice, β0 is unknown. But we can substitute β̂ for β0 and use J −1 (β̂) as the
estimated variance of β̂. That is, we use the following approximate distribution for (β̂ − β0 )
where
X h i
J(β̂) = dN (u) Vz (u, β̂) ,
u
λ(t) = λ0 (t)ezβ .
After we get our data (xi , δi , zi ), we can obtain the MPLE β̂ by solving the partial likelihood
β̂ ∼ N(β0 , J −1 (β̂)).
a
We can use this fact to construct confidence interval for β and test the hypothesis H0 : β = β0 ,
β̂ ± zα/2 [J −1 (β̂)]1/2 .
Myelomatosis data revisited: We analyzed myelomatosis data and did not find statistically
significant difference between treatments 1 and 2. We want to quantify the difference by assuming
the hazards of these two treatments are proportional to each other. Define a treatment indicator
trt1 which takes value 0 for treatment 1 and takes value 1 for treatment 2. Then we can use
PAGE 135
CHAPTER 6 ST 745, Daowen Zhang
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Percent
Total Event Censored Censored
25 17 8 32.00
So β̂ = 0.5728 with standard error 0.5096. This means that compared to treatment 1,
treatment 2 will increase the hazard of dying at any time by 77% (exp(β̂) − 1). A 95% CI of β is
PAGE 136
CHAPTER 6 ST 745, Daowen Zhang
Note: The output also gives three tests for H0 : β = 0: likelihood ratio, score and Wald tests.
λ(t) = λ0 (t)ezβ .
Score test: Under H0 : β = 0, the score U (0) (evaluated under H0 ) has the distribution
a
U (0) ∼ N(0, J(0)).
Or equivalently,
" #2
U (0)
∼ χ21 .
a
J 1/2 (0)
X h i
U (0) = dN (u) zI(u) − z̄(u, 0) .
u
Then
1. If a death occurs at time u, then dN (u) = 1, in which case there will a contribution to
2. Since z = 1 for treatment 1 and z = 0 for treatment 0, zI(u) will then the number of deaths
PAGE 137
CHAPTER 6 ST 745, Daowen Zhang
which is the proportion of individuals in group 1 among those at risk at time u. Since
we only assume one death at time u, this proportion is the expected number of death
for treatment 1 among those at risk at time u, under the null hypothesis of no treatment
difference.
4. Therefore, U (0) is the sum over the death times of the observed number of deaths from
treatment 1 minus the expected number of deaths under the null hypothesis. This was the
where dN 1 (u) = # of observed deaths from treatment 1, Y1 (u) = # at risk at time u from
where
P
l [zl − z̄(u, 0)]2 Yl (u)
Vz (u, 0) = P .
l Yl (u)
Note: Among the Y (u) individuals at risk at time u, there are Y1 (u) individuals whose zl
Y1 (u)
z̄(u, 0) = .
Y (u)
Therefore,
P
l [zl − z̄(u, 0)]2 Yl (u)
Vz (u, 0) = P
l Yl (u)
h i h i
Y1 (u) 2 Y1 (u) 2
1− Y (u)
Y1 (u) + 0 − Y (u)
Y0 (u)
= (zl (u) takes 1 or 0)
Y (u)
Y02 (u)Y1 (u) Y12 (u)Y0 (u)
Y 2 (u)
+ Y 2 (u)
= (Y1 (u) + Y0 (u) = Y (u))
Y (u)
PAGE 138
CHAPTER 6 ST 745, Daowen Zhang
Therefore,
X Y0 (u)Y1 (u)
J(0) = dN (u) .
u Y 2 (u)
Let us contrast this with the variance used to compute the logrank test statistic:
" #
X Y1 (u)Y0 (u)dN (u)[Y (u) − dN (u)]
.
u Y 2 (u)[Y (u) − 1]
Note: In the special case where dN (u) can only be one or zero, then above expression reduces to
" # " #
X Y1 (u)Y0 (u)dN (u)[Y (u) − 1] X Y1 (u)Y0 (u)dN (u)
= ,
u Y 2 (u)[Y (u) − 1] u Y 2 (u)
Therefore, we have demonstrated with continuous survival time data with no ties, the score
test of the hypothesis H0 : β = 0 in the proportional hazards model is exactly the same as the
λ(t|z) = λ0 (t)ezβ
for any covariate value z, whether or not z is discrete or continuous. The null hypothesis H0 : β =
0 implies that the hazard rate at any time t is unaffected by the covariate z. This also implies
that the survival distribution does not depend on z. The alternative hypothesis HA : β 6= 0
implies that hazard rate increases or decreases (depending on the sign of β) as z increases
throughout all time. Therefore, belief in this alternative hypothesis would mean that individuals
with a higher value of z would have stochastically larger (or smaller depending on the sign of
PAGE 139
CHAPTER 6 ST 745, Daowen Zhang
β) survival distribution than those individuals with a smaller values of z. The test command
in Proc Lifetest computes the score test of the hypothesis H0 : β = 0 for the proportional
hazards model. Consequently, when using the test command, the covariate z is not limited to
For example, we can test the treatment difference between treatments 1 and 2 for myelo-
Variable TRT
TRT 4.21151
As in the ordinary likelihood theory, the (partial) likelihood ratio test can also be used to
H0 : β = β0 .
PAGE 140
CHAPTER 6 ST 745, Daowen Zhang
Recall that `(β) is the log partial likelihood. Intuitively, if H0 is true, then β̂, the MPLE
of β, should be close to β0 . Hence `(β̂) should be close to `(β0 ). Since `(β̂) − `(β0 ) is always
Therefore,
h i
2 `(β̂) − `(β0 ) ≈ J(β̂)(β̂ − β0 )2
" #2
β̂ − β0
∼ χ21 under H0 : β = β0 .
a
=
J −1/2 (β̂)
Note: The SAS procedure Phreg can ONLY handle right censored data.
PAGE 141