Lec5 Survival

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

STAT 425: Introduction to Nonparametric Statistics Winter 2018

Lecture 5: Survival Analysis


Instructor: Yen-Chi Chen

Note: in this lecture, we will use the notations T1 , · · · , Tn as the response variable and all these random
variables are positive. These random variables will be called event time or death time. They often refer to
certain ‘time’ characteristics of each individual, e.g., the time that the individual is dead/gets a disease.

5.1 Survival Function

We assume that our data consists of IID random variables T1 , · · · , Tn ∼ F . The survival function S(t) of
this population is defined as
S(t) = P (T1 > t) = 1 − F (t).
Namely, it is just one minus the corresponding CDF. Although this definition is extremely simple and seems
to be very trivial from the CDF, later we will see that it turns out to be an elegant tool of modeling and
interpreting the data.
In medical research, the quantity Ti often refers to certain time characteristic of individual i. For instance,
the variable T may refer to the age that the individual i passes away. Then the survival function S(t) can be
interpreted as the chance that an individual is still alive after age t. If S(60) = 0.8, it means that there are
80% of the individuals in the population who will still be alive at the age 60. Namely, S(t) is the probability
that an individual will survive past time t.
Here are some basic properties about S(t):

• S(0) = 1 and S(∞) = 0.

• S(t) is a non-increasing function.

A quantity that is often used along with the survival function is the hazard function. The hazard function
is
P (t < T1 ≤ t + ∆t|T1 > t) p(t)
h(t) = lim = ,
∆t→0 ∆t S(t)
d
where p(t) = dt F (t) is the PDF of random variable T1 . Note that you can also write the hazard function as

∂ log S(t)
h(t) = − .
∂t

How can we interpret the hazard function? The hazard function describes the ‘intensity of death’ at the
time t given that the individual has already survived past time t.
There is another quantity that is also common in survival analysis, the cumulative hazard function. The
cumulative hazard function is Z t
H(t) = h(s)ds.
0

5-1
5-2 Lecture 5: Survival Analysis

You can interpret H(t) as the cumulative amount of hazard up to time t. The cumulative hazard function
and survival function as linked as follows:
Rt
H(t) = − log S(t), S(t) = e−H(t) = e− 0
h(s)ds
.

Example 1. What is the survival function and hazard function of an exponential R.V.? Let T1 ∼ Exp(λ).
Then
p(t) = λe−λt , F (t) = 1 − e−λt for t ≥ 0
Thus,
S(t) = e−λt
and
h(t) = λ, H(t) = λt.
Namely, in an exponential distribution, the hazard function is a constant and the cumulative hazard is just
a linear function of time.
Example 2 (Weibull distribution). The Weibull distribution is a distribution with two parameters, λ
and k, and it is a distribution for positive random variable. Its PDF is
k
p(t) = λk · (λt)k−1 · e−(λt) , t ≥ 0.
When k = 1, it reduces to the exponential distribution. Its CDF and survival function are
k k
F (t) = 1 − e−(λt) , S(t) = e−(λt) .
And the hazard function and cumulative hazard function are
h(t) = λk · (λt)k , H(t) = (λt)k .

5.1.1 Estimating the Survival Function: Simple Method

How do we estimate the survival function? There are three methods. The first method is a parametric
approach. This method assumes a parametric model (e.g., exponential distribution) of the data and we
estimate the parameter first then form the estimator of the survival function. A second approach is to
compute the EDF first and then converted it to an estimator of the survival function. The last approach
is a powerful nonparametric method called the Kaplan-Meier estimator and we will discuss it in the next
section.
Parametric Approach. Assume that we model the distribution as an exponential distribution with unknown
parameter λ. An estimator of λ is (you can check HW01 to see why this is an estimator)

b = 1 = Pnn
λ .
T̄n i=1 Ti

Then we estimate the survival function using


− T̄t
b −λt e n
Sb1 (t) = λe = , t ≥ 0.
b
T̄n

EDF Approach. Recall that the EDF Fb(t) will be


n
1X
Fb(t) = I(Ti ≤ t).
n i=1
Lecture 5: Survival Analysis 5-3

Then the survival function can be estimated by


n
1X
Sb2 (t) = 1 − Fb(t) = I(Ti > t).
n i=1

5.1.2 Kaplan-Meier estimator

Let t1 < t2 < · · · < tm be the time point where the observations T1 , · · · , Tn actually take values.
To see how the estimator is constructed, we do the following analysis. We partition the time axis into disjoint
segments:
B0 = [0, t1 ), B1 = [t1 , t2 ), · · · , Bm−1 = [tm−1 , tm ), Bm = [tm , ∞).
Then we define
n
X
N` = number of individuals alive at (event happens after) the beginning of B` = I(Ti ≥ t` )
i=1

and
n
X
D` = number of individuals die (event happens at) in B` = I(Ti ∈ B` ).
i=1

Now we have converted T1 , · · · , Tn to (N0 , D0 ), · · · , (Nm , Dm ). Formally, N` should be defined as the number
of individuals at risk at the beginning of B` . Later we will explain what does the at risk means.
The Kaplan-Meier (KM) estimator estimates S(t) using
Y  D`

SKM (t) =
b 1− .
N`
`:t` ≤t

What is the intuition of the KM estimator? We now consider t in different time segments and see if we can
gain some intuitions. Recall that the survival function

S(t) = P (T > t) = Probability of surviving past time t.

For t ∈ B0 = [0, t1 ), there is no event happens within this interval so SbKM (t) = 1.
For t ∈ B1 = [t1 , t2 ), the survival function

S(t) = P (T > t) = P (survives past time t) = P (survives in [0, t1 ) and in [t1 , t)) = P (survives in B0 and in B1 ).

Now recall that for two events A and B, P (A and B) = P (A)P (B|A). Thus,

S(t) = P (survives in B0 and in B1 ) = P (survives in B0 )P (survives in B1 |survives in B0 ).

The probability P (survives in B1 |survives in B0 ) can be estimated using


N1 − D 1 D1
Pb(survives in B1 |survives in B0 ) = =1−
N1 N1
and because no event occurs in B0 , P (survives in B0 ) = 1. Thus,
 
D1
SbKM (t) = 1 × 1 − .
N1
5-4 Lecture 5: Survival Analysis

Now for the next time segment B2 , we apply the same intuition. Namely, for t ∈ B2 ,

S(t) = P (survives in B0 )P (survives in B1 |survives in B0 )P (survives in B2 |survives in B1 ),

where we can estimate P (survives in B2 |survives in B1 ) via


D2
Pb(survives in B2 |survives in B1 ) = 1 − ,
N2
which leads to    
D1 D2
SbKM (t) = 1 × 1 − × 1− .
N1 N2

For the other segments, we can apply the same procedure to obtain the estimator. This gives you the intuition
of how the KM estimator is constructed. This derivation can also be seen in http://pages.stat.wisc.edu/
~ifischer/Intro_Stat/Lecture_Notes/8_-_Survival_Analysis/8.2_-_Kaplan-Meier_Formula.pdf.
Note that when we observe every individual’s event time (namely, there is no censoring – a mechanism we
will discuss later), the KM estimator and the EDF approach are the same.

5.1.3 Nelson-Aalen estimator

Nelson-Aalen (NA) estimator is another powerful estimator of the survival function. It not only estimates
the survival function but also provides an estimate of the cumulative hazard. Actually, NA estimator first
estimate the cumulative hazard function and then convert it into an estimate of the survival function using
the relation S(t) = e−H(t) . Here is an intuition about how this estimator is constructed.
Recall that the KM estimator uses
Y  D`

SbKM (t) = 1− .
N`
`:t` ≤t

as an estimate of S(t). When D` is much smaller than N` , we have


D
− N` D`
e ` ≈1− .
N`
Therefore,
b KM (t) = − log SbKM (t)
H
Y  D`

= − log 1−
N`
`:t` ≤t
 
X D`
=− log 1 −
N`
`:t` ≤t
D
− `
X
≈− log e N`
`:t` ≤t
X D`
= .
N`
`:t` ≤t

Using the above derivation, the NA estimator estimates the cumulative hazard function by
X D`
H
b N A (t) =
N`
`:t` ≤t
Lecture 5: Survival Analysis 5-5

and then estimate the survival function as


 

P D` X D`
SbN A (t) = e−HN A (t) = e `:t` ≤t N`
= exp − .
b
N`
`:t` ≤t

The theoretical analysis of the KM and NA estimators (such as the expectation and variance) involve
some non-trivial algebra. If you are interested in, I would recommend the following lecture note http:
//www4.stat.ncsu.edu/~dzhang2/st745/chap2.pdf.

5.2 Censoring

However, in reality, our data may not be so nice. We may not be able to observe the actual event time Ti
because of many complications. For instance, in a medical research, individuals may leave the study (called
dropout) so we only observe their leaving time instead of the actual death time. The phenomena that we
sometimes cannot observe the actual time but a ‘censoring time’ is called censoring in Statistics.
To model this process, we often need to introduce two other variables: Y and C. The T is the actual event
time of interest and C is the censoring time that is competing with T and Y is the actual observing time.
In most cases, we will consider the right-censoring problem where the three variables are related by

Y = min{T, C}.

We will assume that T and C are independent. Note that if what we observe is Y = max{T, C}, this problem
is called a left-censoring problem. Moreover, we not only observe Y , we also know if this Y comes from the
event time or censoring time. Namely, we have one extra variable δ such that δ = I(T < C).
When we only observe (Y1 , δ1 ), · · · , (Yn , δn ) instead of T1 , · · · , Tn , how can we infer the survival function T1 ?
This is the central question to many biostatistical research.
Because we have several R.V.s now, we will add subscript to denote the functions associated to each random
variable. Namely, FT , ST , hT , HT are the CDF, survival function, hazard function, and cumulative hazard
function of random variable T and FC , SC , hC , HC are those of random variable C and FY , SY , hY , HY are
those of random variable Y .
Here are some relations among these functions.

• SY (t) = P (Y > t) = P (min{T, C} > t) = P (T > t)P (C > t) = ST (t)SC (t).


Namely, the survival function of Y is the product of the other two survival functions.
• FY (t) = 1 − (1 − FT (t))(1 − FC (t)) = FT (t) + FC (t) − FT (t)FC (t).
• pY (t) = pT (t) + pC (t) − pT (t)FC (t) − pC (t)FT (t) = pT (t)SC (t) + pC (t)ST (t).
The PDF of Y is the sum of the weighted PDF of the other two and the weight is the survival function.
• hY (t) = hT (t) + hC (t).
Namely, the hazard function of Y is the summation of the other two.
• HY (t) = HT (t) + HC (t).
Similarly, the cumulative hazard is also the sum of the other two.

Note that δ is just a Bernoulli random variable with probability being 1 as P (T < C).
5-6 Lecture 5: Survival Analysis

5.2.1 Estimating the Survival Function in Censoring

When there is censoring, the EDF approach no longer works. However, the KM and NA estimators are still
valid. Essentially, the estimator is the same but we need to modify a little bit about N` and D` . As we have
mentioned, formally, N` should be defined as

N` = number of individuals at risk at the beginning of B` .

What does the phrase at risk means? It refers to as being alive and not censored so it can be modified by
replacing Ti with Yi . Thus,
Xn
N` = I(Yi ≥ t` ).
i=1
For the quantity D` , it is still the number of events in the interval B` but we need to modify it by the number
of observed events in the interval. Therefore,
n
X
D` = I(Yi ∈ B` , δi = 1).
i=1

Using these two modifications, the KM estimator and NA estimator are


Y  D`

SbKM (t) = 1−
N`
`:t` ≤t
 
X D`
SbN A (t) = exp − .
N`
`:t` ≤t

Note that parametric models may still be applicable during the censoring case and the estimator is often
done using a maximum likelihood approach, which is beyond the scope of this course so we will not cover it
here. Here is a lecture note about this topic: http://www4.stat.ncsu.edu/~dzhang2/st745/chap3.pdf

5.3 Cox Model

In reality, we often not only observe the event time for an individual but also have access to other covariates
of this individual. We often are interested in understanding how these covariates affect the survival function
of the event.
For instance, in a cancer study, we may have each individual’s age when they got cancer (the event time T )
and this individual’s gender, BMI, smoking habit, and education level. The other variables are the covariates
in this study. Health scientists are often interested in how these covariates change the survival function. Let
X denotes the covariates. A parameter of interest will be the survival function of T given X. Namely, it is
the conditional survival function
S(t|x) = P (T > t|X = x).
For instance, we may be interested in

S(Age = t|(gender, BMI, smokinghabit, educationlevel) = (male, 20, neversmoke, college)).

We can then define the conditional hazard function and conditional cumulative hazard function as
∂ log S(t|x)
h(t|x) = − , H(t|x) = − log S(t|x).
∂t
Lecture 5: Survival Analysis 5-7

The Cox (proportional hazard) model is one of the most popular model combining the covariates and
the survival function. It starts with modeling the hazard function h(t|X = x):
h(t|X = x) = h0 (t) exp(xT β),
where β is the vector of coefficients of each covariate. The function h0 (t) is called the baseline hazard
function. Namely, the Cox model assumes that the covariates have a linear multiplication effect on the
hazard function and the effect stays the same across time.
This implies the conditional hazard function being
Z t
H(t|x) = exp(xT β) h0 (s)ds = exp(xT β)H0 (t),
0

where H0 (t) is the baseline cumulative hazard function. This further yields the conditional survival function
exp(xT β) T
S(t|x) = exp(−H(t|x)) = exp − exp(xT β)H0 (t) = exp (−H0 (t)) = S0 (t)exp(x β) ,


where S0 (t) is called the baseline survival function.


Why it is called a proportional hazard model? Here is an intuition about it. Consider two individuals with
different covariates that one has X = x1 and the other has X = x2 . The ratio of their hazard function
h(t|x1 ) h0 (t) exp(xT1 β) exp(xT1 β)
= = = exp((x1 − x2 )T β)
h(t|x2 ) h0 (t) exp(xT2 β) exp(xT2 β)
is a constant over time. Namely,
h(t|x1 ) = exp((x1 − x2 )T β) × h(t|x2 ) ∝ h(t|x2 ) ∀t ≥ 0.
Thus, their hazard is always proportional to each other regardless of the value of time t.
Estimation of the parameter β is often done by maximizing the partial likelihood function:
n
Y
L
b n (β) = Li (β),
i=1

where
h(Ti |Xi ) exp(XiT β)
Li (β) = P =P T
.
j:Tj ≥Ti h(Tj |Xj ) j:Tj ≥Ti exp(Xj β)

Namely, our estimator


βbn = argmaxβ L
b n (β).
This estimator turns out to be an unbiased estimator and has variance shrinking at rate O(n−1 ) and has
asymptotic normality under suitable condition. An interesting fact is that we do not need to know the
baseline hazard function h0 (t) to estimate β! (estimating h0 (t) is not easy and the convergence rate is often
slow; we will discuss a similar pattern in density estimation) The property that we can estimate parameter
of interest without estimating the entire model is related to the topic semi-parametric model 1 .
Note that the detailed analysis an derivation is beyond the scope of this course (you may learn it in a course
called ‘survival analysis’). If you want to learn more, I would recommend the following two lecture notes:

• http://www4.stat.ncsu.edu/~dzhang2/st745/chap6.pdf
• http://www.public.iastate.edu/~kkoehler/stat565/coxph.4page.pdf

1 https://en.wikipedia.org/wiki/Semiparametric_model

You might also like