6 Modeling Survival Data With Cox Regression Models: 6.1 The Proportional Hazards Model

CHAPTER 6 ST 745, Daowen Zhang
6 Modeling Survival Data with Cox Regression Models
6.1 The Proportional Hazards Model
A proportional hazards model proposed by D.R. Cox (1972) assumes that
T
λ(t|z) = λ0 (t)ez1 β1 +···+zp βp = λ0 (t)ez β , (6.1)
where z is a p × 1 vector of covariates such as treatment indicators, prognositc factors, etc., and
β is a p × 1 vector of regression coefficients. Note that there is no intercept β0 in model (6.1).
Obviously,
λ(t|z = 0) = λ0 (t).
So λ0 (t) is often called the baseline hazard function. It can be interpreted as the hazard function
for the population of subjects with z = 0.
The baseline hazard function λ0 (t) in model (6.1) can take any shape as a function of t. The
only requirement is that λ0 (t) > 0. This is the nonparametric part of the model and z T β is the
parametric part of the model. So Cox’s proportional hazards model is a semiparametric model.
Interpretation of a proportional hazards model
1. It is easy to show that under model (6.1)
T β)
S(t|z) = [S0 (t)]exp(z ,
where S(t|z) is the survival function of the subpopulation with covariate z and S0 (t) is the
survival function of baseline population (z = 0). That is

Rt
− λ0 (u)du
S0 (t) = e 0 .
PAGE 120
2. For any two sets of covariates z0 and z1 ,

T
λ(t|z1 ) λ0 (t)ez1 β (z1 −z0 )T β
= z Tβ = e , for all t ≥ 0,
λ(t|z0 ) λ0 (t)e 0
which is a constant over time (so the name of proportional hazards model). Equivalently,
" #
λ(t|z1 )
log = (z1 − z0 )T β, for all t ≥ 0.
λ(t|z0 )
3. With one unit increase in zk while other covariate values being held fixed, then
" #
λ(t|zk + 1)
log = log(λ(t|zk + 1)) − log(λ(t|zk )) = βk .
λ(t|zk )
Therefore, βk is the increase in log hazard (i.e., log hazard-ratio) at any time with unit
increase in the kth covariate zk . Equivalently,
λ(t|zk + 1)
= eβk , for all t ≥ 0.
λ(t|zk )
So exp(βk ) is the hazard ratio associated with one unit increase in zk . Furthermore, since
P [t ≤ T < t + ∆t|T ≥ t, z] ≈ λ(t|z)∆t, we have
P [t ≤ T < t + ∆t|T ≥ t, zk + 1]
≈ eβk , for all t ≥ 0.
P [t ≤ T < t + ∆t|T ≥ t, zk ]
so exp(βk ) can be loosely interpreted as the ratio of two conditional probabilities of dying
in the near future given a subject is alive at any time t. Since
λ(t|zk + 1) − λ(t|zk )
= eβk − 1.
λ(t|zk )
So eβk − 1 can be interpreted as the percentage change (increase or decrease) in hazard
with one unit increase in zk while adjusting for other covariates.
Inferential Problems
From the interpretation of the model, it is obvious that β characterizes the “effect” of z. So
β should be the focus of our inference while λ0 (t) is a nuisance “parameter”. Given a sample of
censored survival data, our inferential problems include:
1. Estimate β; derive its statistical properties.
PAGE 121
2. Testing hypothesis H0 : β = 0 or for part of β.
3. Diagnostics.
Estimation
Since the baseline hazard λ0 (t) is left completely unspecified (infinite dimensional), ordinary
likelihood methods can’t be used to estimate β. Cox conceived of the idea of a partial likelihood
to remove the nuisance parameter λ0 (t) from the proposed estimating equation.
Historical Note: Cox described the proportional hazards model in JRSSB (1972), in what is
now the most quoted statistical papers in history. He also outlined in this paper the method for
estimation which he referred to as using conditional likelihood. It was pointed out to him in the
literature that what he proposed was not a conditional likelihood and that there may be some
flaws in his logic. Cox (1975) was able to recast his method of estimation through what he called
“partial likelihood” and published this in Biometrika. This approach seemed to be based on
sound inferential principles. Rigorous proofs showing the consistency and asymptotic normality
were not published until 1981 when Tsiatis (Annals of Statistics) demonstrated these large sample
properties. In 1982, Anderson and Gill (Annals of Statistics) simplified and generalized these
results through the use of counting processes.
6.2 Estimation Using Partial Likelihood
Data and Model
1. Data: (Xi , ∆i , zi ), i = 1, · · · , n, where for the ith individual
Xi = min(Ti , Ci ).
∆i = I(Ti ≤ Ci ).
zi = (zi1 , zi2 , · · · , zip )T is a vector of covariates.
PAGE 122
2. Model: Proportional hazards model
T
λ(t|zi ) = λ0 (t)ezi β ,
where
( )
P [t ≤ Ti < t + h|Ti ≥ t, zi ]
λ(t|zi ) = lim+ .
h→0 h
Assume that Ci and Ti are conditionally independent given zi . Then the cause-specific hazard
can be used to represent the hazard of interest. That is (in terms of conditional probabilities)
P [x ≤ Xi < x + ∆x, ∆i = 1|Xi ≥ x, zi ] = P [x ≤ Ti < x + ∆x|Ti ≥ x, zi ] ≈ λTi (x|zi )∆x.
Similar to the case of log rank test, we need to define some notation. Let us break the time
axis (patient time) into a grid of points. Assume the survival time is continuous. We hence can
take the grid points dense enough so that at most one death can occur within any interval.
Let dNi (u) denote the indicator for the ith individual being observed to die in [u, u + ∆u).
Namely,
dNi (u) = I(Xi ∈ [u, u + ∆u), ∆i = 1).
Let Yi (u) denote the indicator for whether or not the ith individual is at risk at time u.
Namely,
Yi (u) = I(Xi ≥ u).
Pn
Let dN (u) = i=1 dNi (u) denote the number of deaths for the whole sample occurring in
[u, u + ∆u). Since we are assuming ∆u is sufficiently small, so dN (u) is either 1 or 0 at any time
u.
Pn
Let Y (u) = i=1 Yi (u) be the total number from the entire sample who are at risk at time u.
Let F(x) denote the information up to time x (one of the grid points)
F(x) = {(dNi (u), Yi (u), zi ), i = 1, · · · , n; for grid points u < x and dN (x)}.
PAGE 123
Note: Conditional on F(x), we know who has died or was censored prior to x, when they
died or were censored, together with their covariate values. We know the individuals at risk at
time x and their corresponding covariate value. In addition, we also know if a death occurs at
interval [x, x + ∆x).
What we don’t know is the individual who was observed to die among those at risk at time
x if dN (x) = 1.
Let I(x) denote the individual in the sample who died at time x if someone died. If no one
dies at time x, then I(x) = 0
For example, if I(x) = j, then this means that the jth individual in the sample with covariate
vector zj died in [x, x + ∆x).
Let F(∞) denote all the data in the sample. Namely
F(∞) = {(Xi , ∆i , zi ), i = 1, · · · , n}.
If we let u1 < u2 < · · · denote the value of the grid points along the time axis, then the data
(with redundancy) can be expressed as
(F(u1 ), I(u1 ), F(u2 ), I(u2 ), · · · , F(∞)).
Denote the observed values of the above random variables by lower cases. Then the likelihood
of the parameter λ0 (t) and β can be written as
P [F(u1 ) = f (u1 ); λ0 (·), β] × P [I(u1 ) = i(u1 )|F(u1 ) = f (u1 ); λ0 (·), β]
×P [F(u2 ) = f (u2 )|F(u1 ) = f (u1 ), I(u1 ) = i(u1 ); λ0 (·), β]
×P [I(u2 ) = i(u2 )|F(u1 ) = f (u1 ), I(u1 ) = i(u1 ), F(u2 ) = f (u2 ); λ0 (·), β]
×···
and the last term can be simplified as
P [I(u2 ) = i(u2 )|F(u1 ) = f (u1 ), I(u1 ) = i(u1 ), F(u2 ) = f (u2 ); λ0 (·), β]
PAGE 124
= P [I(u2 ) = i(u2 )|F(u2 ) = f (u2 ); λ0 (·), β].
That is, the full likelihood can be written as the product of a series of conditional likelihoods.
The partial likelihood (as defined by D.R. Cox) consists of the product of every other condi-
tional probabilities in the above presentation. That is
Y
PL = P [I(u) = i(u)|F(u) = f (u); λ0 (·), β].
{all grid pt u}
Suppose we have the following small data set, we will try to find find out this partial likeli-
hood:
Patient ID x δ z
1 2 1 2
2 2 0 2
3 3 1 1
4 4 1 3
It turns out that the partial likelihood is
e2β eβ e3β
P L(β) = × × . (6.2)
e2β + e2β + eβ + e3β eβ + e3β e3β
In general, we have to consider two cases in calculating the above partial likelihood.
Case 1: Suppose conditional on F(u) we have dN (u) = 0. That is, no death is observed at
time u. In such a case, I(u) = 0 with probability 1.
Hence for any grid point u where dN (u) = 0, we have
P [I(u) = 0|F(u) = f (u)] = 1.
Therefore, the partial likelihood is not affected at any point u such that dN (u) = 0.
PAGE 125
Case 2: dN (u) = 1. Conditional on F(u), if we know that one individual dies at time u,
then it must be one of the individuals still at risk (alive and not censored) at time u; i.e.,
among the following individuals
{i : Yi (u) = 1}.
Also conditional on F(u), we know the covariate vector zi associated to each individual i
such that Yi (u) = 1. Therefore, we ask the following question:

Pn
Among Y (u) = i=1 Yi (u) individuals, what is the probability that the observed
death happened to the ith subject (who is actually observed to die at u) rather
than to the other patients?

Unlike the null hypothesis case for the two-sample problem, the probabilities of choosing
these subjects are not equally likely, but rather, they are proportional to their cause-
specific hazard of dying at time u, which can be derived as follows:
Let Ai = the event that subject i is going to die in [u, u + ∆u) given that he/she is still
alive at u. If a patient is not at risk at u (i.e., Yi (u) = 0), then Ai = φ. Since we chose ∆u
to be so small that there is at most one death in [u, u + ∆u), so we know
A1 , A2 , · · · , An are mutually exclusive.
Because of the independence of survival times and censoring times, those Y (u) patients
who are at risk at u (not censored and still alive at u) make up a random sample of the
subpopulation consisting of the patients who will survive up to u (and with the same
covariate value). Under independent censoring assumption, we already showed in Chapter
3 that the cause-specific hazard is the same as the hazard of interest; i.e.,
λ(u, δi = 1|zi ) = λ(u, |zi ).
Since ∆u is chosen to be very small, so
P [Ai ] ≈ Yi (u)λ(u, δi = 1|zi )∆u
= Yi (u)λ(u, |zi )∆u.
= Yi (u)λ0 (u)exp(ziT β)∆u,
PAGE 126
where the last equation is due to the assumption of the cox model. Therefore
P [I(u) = i(u)|F(u) = f (u); λ0 (·), β]
= P [Ai(u) |A1 ∪ · · · ∪ An ]
P [Ai(u) ]
= Pn
l=1 P [Al ]
T
λ0 (u)exp(zi(u) β)∆u
≈ Pn T
l=1 λ0 (u)exp(zl β)Yl (u)∆u
T
exp(zi(u) β)
= Pn T
.
l=1 exp(zl β)Yl (u)
Here Yi(u) (u) = 1 since we know this patient had to be at risk at u (since we know that
this patient died in [u, u + ∆u)).
Combining these cases, the partial likelihood can be written as

" T #dN (u)
Y exp(zi(u) β)
P L(β) = Pn T
.
{all grid pt u} l=1 exp(zl β)Yl (u)
Remark: To be formal, we need to define z0 even though it is never used. We can, for
example, take z0 = 0.
Other equivalent ways of writing the partial likelihood include: Let t1 , · · · , td define the
distinct death times, then

 
T
Y
d exp(zi(tj)
β)
P L(β) = P ;
n
j=1 l=1 exp(zlT β)Yl (tj )
" #dNi (u)
Y
n Y exp(ziT β)
P L(β) = Pn T
;
i=1 {all grid pt u} l=1 exp(zl β)Yl (u)
" #δi
Y
n
exp(ziT β)
P L(β) = Pn T
.
i=1 l=1 exp(zl β)Yl (xi )
Remark: Stare at these different representations for a while, you will convince yourself that
they are all equivalent.
The importance of using the partial likelihood is that this function depends only on β,
the parameter of interest, and is free of the baseline hazard λ0 (t), which is infinite dimensional
nuisance function.
PAGE 127
Cox suggested treating P L as a regular likelihood function and making inference on β ac-
cordingly. For example, we maximize the partial likelihood to get the estimate of β, often called
MPLE (maximum partial likelihood estimate), and use the minus of the second derivative of the
log partial likelihood as the information matrix, etc.
Properties of the score of the partial likelihood
For ease of presentation, let us focus on one covariate case. The extension is straightforward.
Obviously, the log partial likelihood function of β is

" Ã n !#
X X
`(β) = dN (u) zI(u) β − log exp(zl β)Yl (u) .
{all grid pts u} l=1
The score function is

" Pn #
∂`(β) X l=1 zl exp(zl β)Yl (u)
U (β) = = dN (u) zI(u) − Pn ,
∂β l=1 exp(zl β)Yl (u)
{all grid pts u}
and the second derivative is
P Ã Pn !2 
∂ 2 `(β) X n 2
l=1 zl exp(zl β)Yl (u)
 P l=1 zl exp(zl β)Yl (u) .
= − dN (u) − Pn
∂β 2 u
n
l=1 exp(zl β)Yl (u) l=1 exp(zl β)Yl (u)
Define
Pn
zl exp(zl β)Yl (u) X n
z̄(u, β) = Pl=1
n = zl wl ,
l=1 exp(zl β)Yl (u) l=1
where
exp(zl β)Yl (u)

wl = Pn
is the weight that is proportional to the hazard of the individual failing. So z̄(u, β) can be
interpreted as the weighted average of the covariate z among those individuals still at risk at
time u with weights wl .
Define
P Ã Pn !2 
n 2
l=1 zl exp(zl β)Yl (u) l=1 zl exp(zl β)Yl (u)
Vz (u, β) =  Pn − P n

l=1 exp(zl β)Yl (u) l=1 exp(zl β)Yl (u)
PAGE 128
" Pn #
2
l=1 zl exp(zl β)Yl (u) 2
= Pn − (z̄(u, β))
X
n
= zl2 wl − (z̄(u, β))2 .
l=1
This can be shown to be equal to

" #
X
n
(zl − z̄(u, β))2 exp(zl β)Yl (u) X
n
Vz (u, β) = Pn = (zl − z̄(u, β))2 wl .
l=1 l=1 exp(zl β)Yl (u) l=1
This last representation says that Vz (u, β) can be interpreted as the weighted variance of the
covariates among those individuals still at risk at u and hence Vz (u, β) > 0. Consequently,
∂ 2 `(β) X
= − dN (u)Vz (u, β) < 0.
∂β 2 u
The above property can also be displayed graphically. For example, the partial likelihood
function (6.2) looks like:
Figure 6.1: The partial likelihood (6.2)

0.15
partial likelihood
0.10
0.05
0.00
−4 −2 0 2 4
beta
Therefore `(β) has a unique maximizer and can be obtained uniquely by solving the following
partial likelihood equation:

" Pn #
∂`(β) X l=1 zl exp(zl β)Yl (u)
U (β) = = dN (u) zI(u) − Pn = 0.
∂β l=1 exp(zl β)Yl (u)
{all grid pts u}
PAGE 129
This maximizer β̂ defines the MPLE of β.
Terminology: The quantity
∂ 2 `(β) X
− = dN (u)Vz (u, β)
∂β 2 u
is defined as the partial likelihood observed information and is denoted by J(β).
Ultimately, we want to show that the MPLE β̂ has nice statistical properties. These include:
• Consistency: That is, β̂ will converge to the true value of β which generated the data as
the sample size gets larger. We call this true value β0 .
• Asymptotic Normality: β̂ will be approximately normally distributed with mean β0 and a
variance which can be estimated from the data. This approximation will be better as the
sample size gets larger. This result is useful in making inference for the true β.
• Efficiency: Among all other competing estimators for β, the MPLE has the smallest vari-
ance, at least, when the sample size gets larger.
In order to show the properties for β̂, we expand U (β̂) at the true value β0 using Taylor
expansion:
∂U (β0 )
0 = U (β̂) ≈ U (β0 ) + (β̂ − β0 ).
∂β
Since
∂U (β0 ) ∂ 2 `(β0 )
= = −J(β0 ),
∂β ∂β 2
therefore
(β̂ − β0 ) ≈ [J(β0 )]−1 U (β0 )
This expression indicates that we need to investigate the properties of the score function U (β0 )
X h i
U (β0 ) = dN (u) zI(u) − z̄(u, β0 ) .
u
PAGE 130
Properties of the score:
(1) E[U (β0 )] = 0.
Since
" #
X ³ ´
E[U (β0 )] = E dN (u) zI(u) − z̄(u, β0 )
u
X h ³ ´i
= E dN (u) zI(u) − z̄(u, β0 ) ,
u
and
h ³ ´i
E dN (u) zI(u) − z̄(u, β0 )
h h ³ ´¯ ii
= E E dN (u) zI(u) − z̄(u, β0 ) ¯¯ F(u)
Conditional on F(u), dN (u) and z̄(u, β0 ) are both known. Consequently the inner expecta-
tion can be written as
h i
dN (u) E[zI(u) |F(u)] − z̄(u, β0 ) .
Remember that I(u) is the patient identifier for the individual that dies at time u and is set
to zero if no one dies at u. If no one dies at u, then dN (u) = 0, and hence the above quantity is
zero. If someone dies at u, then dN (u) = 1, and conditional on F(u), we know it has to be one
of the Y (u) people at risk at time u; i.e., I(u) must be one of the values {i : Yi = 1}.
The conditional distribution of zI(u) given F(u) can be derived through the conditional
distribution of I(u) given F(u) as shown in Table 6.1.
Therefore
Pn
X
n
zl exp(zl β0 )Yl (u)
E[zI(u) |F(u)] = zl wl = Pl=1
n = z̄(u, β0 ).
l=1 l=1 exp(zl β0 )Yl (u)
From this, we immediately get
E[U (β0 )] = 0.
PAGE 131
Table 6.1: Conditional distribution of zI(u) given F(u)
Values of I(u) Values of zI(u) Probability

Pn
1 z1 exp(z1 β0 )Y1 (u)/ l=1 exp(zl β0 )Yl (u) = w1
Pn
2 z2 exp(z2 β0 )Y2 (u)/ l=1 exp(zl β0 )Yl (u) = w2
.. .. ..
. . .
Pn
n zn exp(zn β0 )Yn (u)/ l=1 exp(zl β0 )Yl (u) = wn
Note: From the conditional distribution of zI(u) given F(u), it is easy to see the conditional
variance of zI(u)
n ³
X ´2
Var[zI(u) |F(u)] = zl − E[zI(u) |F(u)] wl
l=1
Pn
l=1 (zl − z̄(u, β0 ))2 exp(zl β0 )Yl (u)
= Pn
l=1 exp(zl β0 )Yl (u)
= Vz (u, β0 ).
(2) Finding an unbiased estimate for the variance of U (β0 )
Since E[U (β0 )] = 0, so
Var[U (β0 )] = E[U (β0 )]2

" #2
X h i
= E dN (u) zI(u) − z̄(u, β0 )
u
" #
Xn h io2
= E dN (u) zI(u) − z̄(u, β0 )
u
 
X n h io n h io
+ E dN (u) zI(u) − z̄(u, β0 ) dN (u0 ) zI(u0 ) − z̄(u0 , β0 ) 
u6=u0
PAGE 132
As usual, we will take an arbitrary cross-product and show it has zero expectation. Assume
u0 > u and denote
h i h i
A(u) = dN (u) zI(u) − z̄(u, β0 ) , A(u0 ) = dN (u0 ) zI(u) − z̄(u0 , β0 ) .
Then the expectation of the cross-product is
E [A(u)A(u0 )]
= E [E [A(u)A(u0 )| F(u0 )]] .
Since u0 > u, conditional on F(u0 ), A(u) is known. So
E [A(u)A(u0 )| F(u0 )] = A(u)E [A(u0 )| F(u0 )] = 0.
Therefore
Xh i
Var[U (β0 )] = E A2 (u)
u
X h i
= E A2 (u)
u
X h h ¯ ii
= E E A2 (u)¯¯ F(u)
u
The inner conditional expectation is
h ¯ i ·n h io2 ¯¯ ¸
E A2 (u)¯¯ F(u) = E dN (u) zI(u) − z̄(u, β0 ) ¯ F(u) .
¯
Since we pick the grid points in our partition of time fine enough so that dN (u) is either 0
or 1, so dN 2 (u) = dN (u). Hence
h ¯ i · h i2 ¯¯ ¸
¯2 ¯
E A (u)¯ F(u) = E dN (u) zI(u) − z̄(u, β0 ) ¯ F(u) .
Conditional on F(u), dN (u) is known, z̄(u, β0 ) is also known and from Table 6.1
z̄(u, β0 ) = E[zI(u) |F(u)].
PAGE 133
Therefore
h ¯ i ·h i2 ¯¯ ¸
2¯ ¯
E A (u)¯ F(u) = dN (u)E zI(u) − z̄(u, β0 ) ¯ F(u)
= dN (u)Var[zI(u) |F(u)]
= dN (u)Vz (u, β0 ).
Consequently,
X
Var [U (β0 )] = E [dN (u)Vz (u, β0 )]
u
" #
X
= E dN (u)Vz (u, β0 ) .
u
P
Note that the quantity u dN (u)Vz (u, β0 ) is a statistic (can be calculated from the observed
P P
data), so u dN (u)Vz (u, β0 ) is an unbiased estimate of Var [U (β0 )]. In fact, u dN (u)Vz (u, β0 )
is the partial likelihood observed information J(β0 ) we defined before.
Conclusion
P
The score U (β0 ) = u A(u) is a sum of conditionally uncorrelated mean zero random vari-
ables and its variance can be unbiasedly estimated by
X
J(β0 ) = dN (u)Vz (u, β0 ).
u
By the martingale CLT, we have:
a
U (β0 ) ∼ N(0, J(β0 )).
Previously, we have shown that
(β̂ − β0 ) ≈ [J(β0 )]−1 U (β0 ).
Treating J(β0 ) as a constant, we get the approximate distribution of (β̂ − β0 )
(β̂ − β0 ) ∼ N(0, J −1 (β0 )).

a
PAGE 134
Of course, in practice, β0 is unknown. But we can substitute β̂ for β0 and use J −1 (β̂) as the
estimated variance of β̂. That is, we use the following approximate distribution for (β̂ − β0 )
(β̂ − β0 ) ∼ N(0, J −1 (β̂)),

a
where
X h i
J(β̂) = dN (u) Vz (u, β̂) ,
u
and β̂ is the MPLE of β solving the following equation

X h i
U (β̂) = dN (u) zI(u) − z̄(u, β̂) = 0.
u
Inference with a Single Covariate
Assume a proportional hazards model with a single covariate z
λ(t) = λ0 (t)ezβ .
After we get our data (xi , δi , zi ), we can obtain the MPLE β̂ by solving the partial likelihood
equation; i.e., setting the partial score to zero. Then asymptotically,
β̂ ∼ N(β0 , J −1 (β̂)).
a
We can use this fact to construct confidence interval for β and test the hypothesis H0 : β = β0 ,
etc. For example, a (1 − α) CI of β is
β̂ ± zα/2 [J −1 (β̂)]1/2 .
Myelomatosis data revisited: We analyzed myelomatosis data and did not find statistically
significant difference between treatments 1 and 2. We want to quantify the difference by assuming
the hazards of these two treatments are proportional to each other. Define a treatment indicator
trt1 which takes value 0 for treatment 1 and takes value 1 for treatment 2. Then we can use
Proc Phreg for this purpose.
PAGE 135
proc phreg data=myel;

model dur*status(0)=trt1;
run;
Part of the output is given as follows:
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
16:43 Thursday, March 2, 2000 15

The PHREG Procedure
Data Set: WORK.MYEL

Dependent Variable: DUR
Censoring Variable: STATUS
Censoring Value(s): 0
Ties Handling: BRESLOW
Summary of the Number of

Event and Censored Values
Percent
Total Event Censored Censored
25 17 8 32.00
Testing Global Null Hypothesis: BETA=0

Without With
Criterion Covariates Covariates Model Chi-Square
-2 LOG L 94.084 92.765 1.319 with 1 DF (p=0.2508)

Score . . 1.297 with 1 DF (p=0.2547)
Wald . . 1.263 with 1 DF (p=0.2610)
Analysis of Maximum Likelihood Estimates
Parameter Standard Wald Pr > Risk

Variable DF Estimate Error Chi-Square Chi-Square Ratio
TRT1 1 0.572807 0.50960 1.26344 0.2610 1.773
So β̂ = 0.5728 with standard error 0.5096. This means that compared to treatment 1,
treatment 2 will increase the hazard of dying at any time by 77% (exp(β̂) − 1). A 95% CI of β is
β̂ ± 1.96 ∗ se[β̂] = 0.5728 ± 1.96 ∗ 0.5096 = [−0.426, 1.572].
And a 95% CI for the hazard ratio exp(β) is
[e−0.426 , e1.572 ] = [0.653, 4.816].
PAGE 136
Note: The output also gives three tests for H0 : β = 0: likelihood ratio, score and Wald tests.
Comparison of score test and two-sample log rank test
Assume z is the dichotomous indicator for treatment; i.e.,




 1 for treatment 1
z= ,


 0 for treatment 0
and the proportional hazards model:
λ(t) = λ0 (t)ezβ .
Score test: Under H0 : β = 0, the score U (0) (evaluated under H0 ) has the distribution
a
U (0) ∼ N(0, J(0)).
Or equivalently,
" #2
U (0)
∼ χ21 .
a
J 1/2 (0)
Since the score U (0) has the expression
X h i
U (0) = dN (u) zI(u) − z̄(u, 0) .
u
Then
1. If a death occurs at time u, then dN (u) = 1, in which case there will a contribution to
U (0) by adding [zI(u) − z̄(u, 0)]. Otherwise no contribution.
2. Since z = 1 for treatment 1 and z = 0 for treatment 0, zI(u) will then the number of deaths
at time u from treatment 1.
3. Under H0 : β = 0, z̄(u, 0) is simplified to be

Pn
zl Yl (u)
z̄(u, 0) = Pl=1
n ,
l=1 Yl (u)
PAGE 137
which is the proportion of individuals in group 1 among those at risk at time u. Since
we only assume one death at time u, this proportion is the expected number of death
for treatment 1 among those at risk at time u, under the null hypothesis of no treatment
difference.
4. Therefore, U (0) is the sum over the death times of the observed number of deaths from
treatment 1 minus the expected number of deaths under the null hypothesis. This was the
numerator of the two-sample log rank test:

" #
X Y1 (u)
dN1 (u) − dN (u)
u Y (u)
where dN 1 (u) = # of observed deaths from treatment 1, Y1 (u) = # at risk at time u from
treatment 1, Y (u) = total # at risk at time u from 2 treatments, dN (u) = total # of
deaths from 2 treatments.
5. The denominator of the score test was computed as

" #1/2
X
1/2
J (0) = dN (u)Vz (u, 0) ,
u
where
P
l [zl − z̄(u, 0)]2 Yl (u)
Vz (u, 0) = P .
l Yl (u)
Note: Among the Y (u) individuals at risk at time u, there are Y1 (u) individuals whose zl
value of zl = 1 and Y0 (u) individuals whose zl value of zl = 0. We already argued that
Y1 (u)
z̄(u, 0) = .
Y (u)
Therefore,
P
l [zl − z̄(u, 0)]2 Yl (u)
Vz (u, 0) = P
l Yl (u)
h i h i
Y1 (u) 2 Y1 (u) 2
1− Y (u)
Y1 (u) + 0 − Y (u)
Y0 (u)
= (zl (u) takes 1 or 0)
Y (u)
Y02 (u)Y1 (u) Y12 (u)Y0 (u)
Y 2 (u)
+ Y 2 (u)
= (Y1 (u) + Y0 (u) = Y (u))
Y (u)
PAGE 138
Y0 (u)Y1 (u)Y (u)

=
Y 3 (u)
Y0 (u)Y1 (u)
= .
Y 2 (u)
Therefore,
X Y0 (u)Y1 (u)
J(0) = dN (u) .
u Y 2 (u)
Let us contrast this with the variance used to compute the logrank test statistic:
" #
X Y1 (u)Y0 (u)dN (u)[Y (u) − dN (u)]
.
u Y 2 (u)[Y (u) − 1]
Note: In the special case where dN (u) can only be one or zero, then above expression reduces to
" # " #
X Y1 (u)Y0 (u)dN (u)[Y (u) − 1] X Y1 (u)Y0 (u)dN (u)
= ,
u Y 2 (u)[Y (u) − 1] u Y 2 (u)
which is exactly equal to J(0).
Therefore, we have demonstrated with continuous survival time data with no ties, the score
test of the hypothesis H0 : β = 0 in the proportional hazards model is exactly the same as the
logrank test for dichotomous covariate z.
The score test

" #2
U (0)
J 1/2 (0)
can be used to test the hypothesis H0 : β = 0 for the model
λ(t|z) = λ0 (t)ezβ
for any covariate value z, whether or not z is discrete or continuous. The null hypothesis H0 : β =
0 implies that the hazard rate at any time t is unaffected by the covariate z. This also implies
that the survival distribution does not depend on z. The alternative hypothesis HA : β 6= 0
implies that hazard rate increases or decreases (depending on the sign of β) as z increases
throughout all time. Therefore, belief in this alternative hypothesis would mean that individuals
with a higher value of z would have stochastically larger (or smaller depending on the sign of
PAGE 139
β) survival distribution than those individuals with a smaller values of z. The test command
in Proc Lifetest computes the score test of the hypothesis H0 : β = 0 for the proportional
hazards model. Consequently, when using the test command, the covariate z is not limited to
being dichotomous, nor discrete.
For example, we can test the treatment difference between treatments 1 and 2 for myelo-
matosis data using the following SAS command:
proc lifetest data=myel;

time dur*status(0);
test trt;
run;
and part of the output is presented in the following:
Univariate Chi-Squares for the LOG RANK Test
Test Standard Pr >

Variable Statistic Deviation Chi-Square Chi-Square
TRT -2.3376 2.0522 1.2975 0.2547
Covariance Matrix for the LOG RANK Statistics
Variable TRT
TRT 4.21151
Forward Stepwise Sequence of Chi-Squares for the LOG RANK Test

Pr > Chi-Square Pr >
Variable DF Chi-Square Chi-Square Increment Increment
TRT 1 1.2975 0.2547 1.2975 0.2547
Likelihood Ratio Test
As in the ordinary likelihood theory, the (partial) likelihood ratio test can also be used to
test the null hypothesis:
H0 : β = β0 .
PAGE 140
Recall that `(β) is the log partial likelihood. Intuitively, if H0 is true, then β̂, the MPLE
of β, should be close to β0 . Hence `(β̂) should be close to `(β0 ). Since `(β̂) − `(β0 ) is always
non-negative, so we should reject H0 when this difference is large.
The likelihood ratio test uses the fact that

h i
2 `(β̂) − `(β0 ) ∼ χ21 ,
a
under H0 : β = 0.
Therefore, for a given level of significance α, we reject H0 : β = β0 if

h i
2 `(β̂) − `(β0 ) ≥ χ21,α
where χ21,α is the value such that P [χ21 > χ21,α ] = α.
Expanding `(β0 ) at the MPLE β̂, we get

d`(β̂) 1 d2 `(β̂)
`(β0 ) ≈ `(β̂) + (β0 − β̂) + (β0 − β̂)2 .
dβ 2! d2 β
Since MPLE β̂ maximizes `(β), i.e.,
d`(β̂)
U (β̂) = = 0,
dβ
and
d2 `(β̂)
= −J(β̂),
d2 β
so
h i
2 `(β̂) − `(β0 ) ≈ J(β̂)(β̂ − β0 )2 .
We already derived that
(β̂ − β0 ) ∼ N(0, J −1 (β̂)).

a
Therefore,
h i
2 `(β̂) − `(β0 ) ≈ J(β̂)(β̂ − β0 )2
" #2
β̂ − β0
∼ χ21 under H0 : β = β0 .
a
=
J −1/2 (β̂)
Note: The SAS procedure Phreg can ONLY handle right censored data.
PAGE 141

6 Modeling Survival Data With Cox Regression Models: 6.1 The Proportional Hazards Model

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

6 Modeling Survival Data With Cox Regression Models: 6.1 The Proportional Hazards Model

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

6 Modeling Survival Data With Cox Regression Models: 6.1 The Proportional Hazards Model

Uploaded by

Copyright:

Available Formats

CHAPTER 6 ST 745, Daowen Zhang

6 Modeling Survival Data with Cox Regression Models

6.1 The Proportional Hazards Model

A proportional hazards model proposed by D.R. Cox (1972) assumes that

β is a p × 1 vector of regression coefficients. Note that there is no intercept β0 in model (6.1).

for the population of subjects with z = 0.

Interpretation of a proportional hazards model

1. It is easy to show that under model (6.1)

survival function of baseline population (z = 0). That is

2. For any two sets of covariates z0 and z1 ,

increase in the kth covariate zk . Equivalently,

P [t ≤ T < t + ∆t|T ≥ t, z] ≈ λ(t|z)∆t, we have

in the near future given a subject is alive at any time t. Since

So eβk − 1 can be interpreted as the percentage change (increase or decrease) in hazard

with one unit increase in zk while adjusting for other covariates.

censored survival data, our inferential problems include:

1. Estimate β; derive its statistical properties.

2. Testing hypothesis H0 : β = 0 or for part of β.

results through the use of counting processes.

6.2 Estimation Using Partial Likelihood

Data and Model

1. Data: (Xi , ∆i , zi ), i = 1, · · · , n, where for the ith individual

zi = (zi1 , zi2 , · · · , zip )T is a vector of covariates.

2. Model: Proportional hazards model

P [x ≤ Xi < x + ∆x, ∆i = 1|Xi ≥ x, zi ] = P [x ≤ Ti < x + ∆x|Ti ≥ x, zi ] ≈ λTi (x|zi )∆x.

dNi (u) = I(Xi ∈ [u, u + ∆u), ∆i = 1).

Yi (u) = I(Xi ≥ u).

interval [x, x + ∆x).

dies at time x, then I(x) = 0

vector zj died in [x, x + ∆x).

Let F(∞) denote all the data in the sample. Namely

F(∞) = {(Xi , ∆i , zi ), i = 1, · · · , n}.

(with redundancy) can be expressed as

(F(u1 ), I(u1 ), F(u2 ), I(u2 ), · · · , F(∞)).

of the parameter λ0 (t) and β can be written as

P [F(u1 ) = f (u1 ); λ0 (·), β] × P [I(u1 ) = i(u1 )|F(u1 ) = f (u1 ); λ0 (·), β]

×P [F(u2 ) = f (u2 )|F(u1 ) = f (u1 ), I(u1 ) = i(u1 ); λ0 (·), β]

×P [I(u2 ) = i(u2 )|F(u1 ) = f (u1 ), I(u1 ) = i(u1 ), F(u2 ) = f (u2 ); λ0 (·), β]

and the last term can be simplified as

P [I(u2 ) = i(u2 )|F(u1 ) = f (u1 ), I(u1 ) = i(u1 ), F(u2 ) = f (u2 ); λ0 (·), β]

= P [I(u2 ) = i(u2 )|F(u2 ) = f (u2 ); λ0 (·), β].

tional probabilities in the above presentation. That is

It turns out that the partial likelihood is

time u. In such a case, I(u) = 0 with probability 1.

Hence for any grid point u where dN (u) = 0, we have

P [I(u) = 0|F(u) = f (u)] = 1.

among the following individuals

such that Yi (u) = 1. Therefore, we ask the following question:

than to the other patients?

specific hazard of dying at time u, which can be derived as follows:

to be so small that there is at most one death in [u, u + ∆u), so we know

A1 , A2 , · · · , An are mutually exclusive.

covariate value). Under independent censoring assumption, we already showed in Chapter

λ(u, δi = 1|zi ) = λ(u, |zi ).

Since ∆u is chosen to be very small, so

P [Ai ] ≈ Yi (u)λ(u, δi = 1|zi )∆u

= Yi (u)λ(u, |zi )∆u.