0% found this document useful (0 votes)
71 views23 pages

The Kaplan-Meier Estimate of The Survival Function

This document provides an overview of the Kaplan-Meier method for estimating survival functions from lifetime data that may be censored. It introduces the Kaplan-Meier estimate and how it generalizes the empirical survival function to account for censoring. The document also derives an estimate of the variance of the Kaplan-Meier estimate using the delta method and shows how this can be used to construct approximate confidence intervals for the survival function.

Uploaded by

rodicasept1967
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
71 views23 pages

The Kaplan-Meier Estimate of The Survival Function

This document provides an overview of the Kaplan-Meier method for estimating survival functions from lifetime data that may be censored. It introduces the Kaplan-Meier estimate and how it generalizes the empirical survival function to account for censoring. The document also derives an estimate of the variance of the Kaplan-Meier estimate using the delta method and shows how this can be used to construct approximate confidence intervals for the survival function.

Uploaded by

rodicasept1967
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Chapter 2

The KaplanMeier estimate of


the survival function
2.1 Introduction to the KaplanMeier method
2.1.1 Introduction
The KaplanMeier estimate of the survival function is an empirical or non-
parametric method of estimating S(t) from non- or right-censored data. It is
extremely popular as it requires only very weak assumptions and yet utilises
the information content of both fully observed and right-censored data. It
comes as standard in most statistical packages (such as R) and can also be
calculated by hand (e.g. in exams. . . ).
2.1.2 Who were Kaplan and Meier
Both were students of the famous John Tukey. In 1952, Paul Meier started
working on the duration of cancer while at Johns Hopkins University, Chicago,
USA. Edward Kaplan later started working on the lifetime of vacuum tubes
20
CHAPTER 2. KAPLANMEIER 21
in the repeaters of sub-oceanic telephone cables while at Bell labs. They in-
dependently submitted their research on survival times to the Journal of the
American Statistical Association, whose editor encouraged them to submit a
joint paper, which they did in 1958: Kaplan, E. L. and P. Meier (1958). Non-
parametric Estimation from Incomplete Observations. J. Am. Stat. Assoc.,
53:457481. Google Scholar has 20 000 citations for this paper.
2.1.3 Motivating example: Leukmia data
We consider remission times for two groups of leukmia patients. Freireich
et al. (1963, Blood, 21:699:716) applied 6-Mercaptopurine and a placebo to
42 youths ( 20 years) with leukmia. The times of interest are the duration
of remission in weeks. These are:
6-MP: 6, 6, 6, 7, 10, 13, 16, 22, 23, 6+, 9+, 10+, 11+, 17+, 19+, 20+, 25+,
32+, 32+, 34+, 35+
Placebo: 1, 1, 2, 2, 3, 4, 4, 5, 5, 8, 8, 8, 8, 11, 11, 12, 12, 15, 17, 22, 23
using the notation t+ to indicate a right censored observation at time t.
Let us rewrite the data using the following notation: d(t) is the number of
deaths or failures (recorded) at time t, q(t) is the number of right-censorings
at time t, and n(t

) is the total number of individuals at risk an instant


before time t.
t d(t) q(t) n(t

)
CHAPTER 2. KAPLANMEIER 22
2.1.4 The estimate in the absense of censoring
Question: What proportion of the sample given a placebo survived to:
time 0.0?
time 0.9?
time 1.0?
time 1.1?
time 2.0?
In the absence of censoring, the empirical survival function

S(t) is a step
function with heights equal to the proportion of the starting population sur-
viving to the instant after t, i.e.

S(t) = n(t
+
)/n(0).
2.1.5 KaplanMeiers method in the presense of cen-
soring
What do we do when there is censoring? The above formula requires gener-
alising to the case when there are changing numbers of active participants in
the study. The generalisation accounting for q(t) is called the KaplanMeier
estimate.
The KaplanMeier estimate of S(t) is

S(t) =

S(t

) p(T > t|T t).


If no failures occur at time t, p(T > t|T t) = 1.
If one or more failures occur at time t,
p(T > t|T t) =
n(t

) d(t)
n(t

)
. (2.1)
CHAPTER 2. KAPLANMEIER 23
Note: that the KaplanMeier estimate does not change between events, nor
at times when only censorings occur. It drops only at times when a failure
has been observed. If we write t
(i)
as the ith ordered event time, and d
(i)
,
q
(i)
and n
(i

)
accordingly, the KaplanMeier formula can be rewritten

S(t) =

t
(i)
t
n
(i

)
d
(i)
n
(i

)
(2.2)
with

S(t) = 1 for t < t
(1)
.
What about

S(t) for t greater than the maximum observed event time, t
max
,
say? Various methods have been proposed: Efron (1967) suggested setting

S(t) = 0 for t > t


max
, Gill (1980) suggested

S(t) =

S(t
max
), and Brown et al.
(1974) suggested

S(t) = exp{log(

S(t
max
))t/t
max
}. In truth, the best policy
is not to attempt to estimate it as the validity of the estimate cannot be
assessed. It is better to stop the graph at the last observed event time.
2.2 Variance of the KM estimate
It is natural for we statisticians to want to know how certain our estimate of
S(t) is. Within the frequentist framework, this is found via
V{

S(t)} = V
_
_
_

t
(i)
t
n
(i

)
d
(i)
n
(i

)
_
_
_
= V
_
_
_

t
(i)
t
p
(i)
_
_
_
. (2.3)
The variance of a sum of independent events is easy (the sum of the vari-
ances), the variance of a product is hard. So it would be easier to work with
the log survival function and then to seek to convert it back later:
V{log

S(t)} = V
_
_
_

t
(i)
t
log p
(i)
_
_
_
=

t
(i)
t
V{log p
(i)
} (2.4)
under the assumption that failures arise independently among the population
(usually a safe assumption, but not for contagious diseases, for instance).
CHAPTER 2. KAPLANMEIER 24
2.2.1 The delta method
Reminder Obtaining the variance of a function of a random variable is a
problem commonly faced in statistics. It is often approximated via a Taylor
expansion around the mean, also known as the delta method.
Taylors expansion gives f(x) = f(a) + (x a)f

(a) + (x a)
2
f

(a)/2! + . . .
The delta method uses a = = E(X) and considers just the rst order terms
above. For example, log X log + (X )
1

. Thus
E(log X) E(log ) +E{(X )/} (2.5)
= log (2.6)
V(log X) V(log ) +V{(X )/} (2.7)
= 0 +V(X/) +V(/) (2.8)
=
V(X)

2
. (2.9)
Thus
V{log p
(i)
}
V{ p
(i)
}
E{ p
(i)
}
2
. (2.10)
Recall that
p
(i)
=
n
(i

)
d
(i)
n
(i

)
. (2.11)
Under the assumption that failures arise independently with probability p
(i)
,
d
(i)
Bin(n
(i

)
, 1 p
(i)
) and so
E{ p
(i)
} =
n
(i

)
n
(i

)
(1 p
(i)
)
n
(i

)
(2.12)
= p
(i)
(2.13)
V{ p
(i)
} = V
_
d
(i)
n
(i

)
_
(2.14)
=
1
n
2
(i

)
n
(i

)
p
(i)
(1 p
(i)
) (2.15)
=
p
(i)
(1 p
(i)
)
n
(i

)
. (2.16)
CHAPTER 2. KAPLANMEIER 25
Therefore
V{log p
(i)
}
1 p
(i)
p
(i)
n
(i

)
. (2.17)
But we dont know p
(i)
! We therefore use our estimate of it to get the
following estimated, approximated variance:

V{log p
(i)
} =
1 p
(i)
p
(i)
n
(i

)
(2.18)
=
n
(i

)
n
(i

)
+d
(i)
n
(i

)
n
(i

)
d
(i)
(2.19)
=
d
(i)
n
(i

)
(n
(i

)
d
(i)
)
. (2.20)
We therefore have
V{log

S(t)}

t
(i)
t
d
(i)
n
(i

)
(n
(i

)
d
(i)
)
. (2.21)
But we really want V{

S(t)} = V
_
e
log

S(t)
_
! Again, we use the delta method,
this time using e
X
e

+ (X )e

, so V(e
X
) (e

)
2
V(X).
Thus,
V{

S(t)} = V
_
e
log

S(t)
_
(2.22)
e
2E{log

S(t)}
V{log

S(t)} (2.23)
e
2E{log

S(t)}

t
(i)
t
d
(i)
n
(i

)
(n
(i

)
d
(i)
)
(2.24)
and so we have

V{

S(t)} =

S(t)
2

t
(i)
t
d
(i)
n
(i

)
(n
(i

)
d
(i)
)
. (2.25)
This is Greenwoods estimator of the variance of the survival function (Green-
wood, 1926).
CHAPTER 2. KAPLANMEIER 26
2.2.2 Condence interval for S(t)
We could try using Greenwoods estimate to construct asymptotic condence
intervals for S(t).
Example: Pollock et al. (1989) radio-tagged 18 quail (Colinus virginianus
L.) and followed their survival. The following are death or censoring times
in weeks, using the same notation as before:
3, 3, 6, 8, 8+, 9, 9+, 9+, 10, 10+, 12+, 13+, 13+, 13+, 13+, 13+, 13+,
13+.
The KaplanMeier estimate of the survival function, the variance of this
estimate, and a 95% condence interval constructed in the usual way, are as
below.
t

S(t) V{

S(t)} 95%CI
1 1.00 0.00
2
(1.00,1.00)
2 1.00 0.00
2
(1.00,1.00)
3 0.89 0.07
2
(0.74,1.03)
4 0.89 0.07
2
(0.74,1.03)
5 0.89 0.07
2
(0.74,1.03)
6 0.83 0.09
2
(0.66,1.01)
7 0.83 0.09
2
(0.66,1.01)
8 0.78 0.10
2
(0.59,0.97)
9 0.72 0.11
2
(0.51,0.93)
10 0.65 0.12
2
(0.41,0.88)
11 0.65 0.12
2
(0.41,0.88)
12 0.65 0.12
2
(0.41,0.88)
13 0.65 0.12
2
(0.41,0.88)
This approach may lead to CIs that exceed the range of S(t), which look
non-sensical.
An alternative was proposed by Kalbeisch and Prentice (2002) that gets
around this problem, namely to use

V{log(log

S(t))} =
1
(log

S(t))
2

t
(i)
t
d
(i)
n
(i

)
(n
(i

)
d
(i)
)
(2.26)
CHAPTER 2. KAPLANMEIER 27
(again using the delta method) to get a condence interval of log(log

S(t)),
and then to transform that back into the original scale. The motivation is
the fact that if a = log(log b) then a (, ) b (0, 1).
Precisely, they propose letting
c
1
= log(log

S(t)) + z
1/2
_
V{log(log

S(t))} (2.27)
c
2
= log(log

S(t)) z
1/2
_
V{log(log

S(t))} (2.28)
so that the 1 condence interval for S(t) is (exp{e
c
2
}, exp{e
c
1
}).
Note that this doesnt work for

S(t) = 0 or 1. In such cases, I suggest using
(0, 0) or (1, 1) as the condence interval if needed graphically, and otherwise
not to oer a condence interval for those points.
2.2.3 Simultaneous condence interval for S(t)
If you connect up the condence intervals for all times, you get a pointwise
condence band for S(t). This has the interpretation that the true S(t)
for any particular t will be within the band for 95% (or, generally, 1 )
of experiments you conduct. If you wish the coverage to be 95% for all
times jointly, the condence band must be expanded in size to obtain a
simultaneous condence band. Hall and Wellner (1980) proposed a method
for doing so, but it is fairly complex and not readily available in computer
packages, so we shall not cover it here.
2.3 Testing dierences in survival curves
2.3.1 Introduction
As in most statistics, a key objective is to test whether subpopulations behave
in the same way.
CHAPTER 2. KAPLANMEIER 28
0 20 40 60 80 100 120
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
t
S
^
((
t
))
For example, a group of patients has been allocated to treatment with omega
3 oils or a placebo by Stoll et al. (1999) and the duration without an attack
recorded and plotted (orange for omega 3, blue for placebo). It appears that
the two subpopulations do dier, with the survival curve of the patients on
the drug lying above that of the placebo patients. Is this a genuine nding,
or can it be explained by a small sample size giving the spurious impression
of a dierence?
Various tests have been proposed for testing for dierences in survival be-
tween categorical covariates; we present three named ones. They all have a
very similar structure, but have dierent power depending on what the exact
nature of the dierence between the survival curves is.
2.3.2 Notation
Let there be K categories.
CHAPTER 2. KAPLANMEIER 29
Let d
k,(i)
be the number of failures in group k at ordered time t
(i)
, where the
ordering is over all categories.
Let d
(i)
=

K
k=1
d
k,(i)
be the total number of failures at time t
(i)
.
Let n
k,(i

)
be the number of members of group k at risk an instant before t
(i)
and n
(i

)
.
2.3.3 Test with two categories
Initially we limit attention to two categories.
Hypotheses
H
0
: S
1
(t) = S
2
(t)
H
1
: S
1
(t) = S
2
(t)
Test-statistic
If H
0
is true, the expected number of deaths in group k at time t
(i)
is
e
k,(i)
=
n
k,(i

)
d
(i)
n
(i)
. (2.29)
The n
k,(i

)
term is the number at risk in category k. The ratio d
(i)
/n
(i)
is
the overall proportion in both populations failing at time t
(i)
. The variance
in d
k,(i)
is given by the variance of the hypergeometric distribution:

V
1,(i)
=

V
2,(i)
=
n
1,(i

)
n
2,(i

)
d
(i)
(n
(i

)
d
(i)
)
n
2
(i

)
(n
(i

)
1)
. (2.30)
The test-statistic is then
q =
_
m
i=1
w
i
(d
1,(i)
e
1,(i)
)

m
i=1
w
2
i
v
1,(i)
(2.31)
CHAPTER 2. KAPLANMEIER 30
for some weights w
i
(see later) where m is the number of distinct death/failure
times. Note that the statistic is based on one subpopulations sample mo-
ments only, as the other is deterministic conditional on the rst.
If H
0
is true, q
2
1
asymptotically.
Assumptions:
Censoring is independent of group.

m
i=1
d
(i)
is large.

m
i=1
e
k,(i)
is large.
p-value
The p-value, p = p(Q > q|H
0
true), follows from the CDF of the
2
1
distribu-
tion.
Weights
The various tests dier in terms of the weights used, and hence the kind of
discrepancies between the survival functions that the test is best able to pick
up.
The log-rank test uses w
i
= 1. It puts emphasis on larger values of
time.
The (generalised) Wilcoxon test uses w
i
= n
(i

)
. It puts emphasis on
smaller values of time.
The TaroneWare test uses w
i
=

n
(i

)
. It puts emphasis on interme-
diate values of time.
The three tests can simply be eecting in R by setting rho=0, 1 or 0.5,
respectively. See next lecture.
CHAPTER 2. KAPLANMEIER 31
2.3.4 Tests with multiple categories
If there are K 3 subpopulations of interest, similar tests can be constructed
by generalising notation to use matrix algebra.
Notation
Let d
(i)
T
= (d
1,(i)
, d
2,(i)
, . . . , d
K1,(i)
).
Let e
T
(i)
= (e
1,(i)
, e
2,(i)
, . . . , e
K1,(i)
).
Note that both are of length K 1 for the same reason as we used only one
category in the two category case.
The (K 1) (K 1) covariance matrix

V
(i)
has diagonal elements
v
k,k,(i)
=
n
k,(i

)
(n
(i

)
n
k,(i

)
)d
(i)
(n
(i

)
d
(i)
)
n
2
(i

)
(n
(i

)
1)
(2.32)
and o-diagonal elements
v
k,j,(i)
=
n
k,(i

)
n
j,(i

)
d
(i)
(n
(i

)
d
(i)
)
n
2
(i

)
(n
(i

)
1)
. (2.33)
The weight matrix is W
i
= w
i
I
K1
where I
K1
is the (K 1) (K 1)
identity matrix.
Hypotheses
H
0
: S
1
(t) = S
2
(t) = . . . = S
K
(t).
H
1
: There is at least one pair of categories k and j such that S
j
(t) = S
k
(t).
Test-statistic
If H
0
is true, using the same logic as in the two-category case, the test-statistic
is then
q =
_
m

i=1
W
i
(d
(i)
e
(i)
)
_
T
_
m

i=1
W
i

V
(i)
W
i
_
1
_
m

i=1
W
i
(d
(i)
e
(i)
)
_
.
(2.34)
If H
0
is true, under the same assumptions as for the 2 subpopulation case,
q
2
K1
, and the p-value can be found in a similar fashion.
CHAPTER 2. KAPLANMEIER 32
2.3.5 Discussion
These tests are very useful in assessing whether a covariate aects survival.
However, they do not allow us to say how survival is aected. Ideally, wed
like to be able to say how much more at risk on group is than another.
Wed also like to be able to incorporate non-categorical covariates, such as
age. An ad hoc approach is to categorise the covariate arbitrarily. A more
satisfying approach is to do some semi-parametric modelling to investigate
the functional relationship between the covariates and survival. This can be
done using Coxs proportional hazards model, which we will discuss in the
next part of the course, chapter 3.
2.4 Finding the KaplanMeier estimate in R
In the earlier part of this chapter, we discussed how to calculate the Kaplan
Meier estimate of S(t) by hand and to do tests of the hypothesis that dierent
categories of individuals have the same survival functions. Luckily these
routines have been incorporated in many standard statistical packages such
as R, Splus, SAS, SPSS, etc. In this course, we will use R to analyse data, but
you are welcome to use your preferred statistics package as long as you do not
seek any support from me. The advantages of R over other packages are: it
is open-source, free, uses easily-replicated command lines rather than opaque
and forgettable combinations of menu clicks, and it has a vast collection of
add-on packages contributed by the user community.
To start R in a GUI like windows, either double click on its icon if it is
on the desktop or search for the program in the start menu. In a unix-like
environment, open a command line terminal and type R at the prompt.
There are a variety of optional packages relating to survival analysis. To
obtain a list of all such packages, type
> help.search("survival")
CHAPTER 2. KAPLANMEIER 33
We shall use the survival package. To load this, type
> library(survival)
Complete documentation for this package can be obtained at
http://cran.r-project.org/web/packages/survival/index.html.
Documentation on any particular function can be obtained with the ?functionname
command, e.g.
> ?survfit
2.4.1 Loading data
The rst thing we want to do is to load data in to R. There are three scenarios:
Use data R knows about already. For pedagogic and illustrative pur-
poses, some data sets are available to R automatically. One such data
set is the aml data in the survival package. This details survival in pa-
tients with acute myelogenous leukmia, as described in Miller (1997)
Survival Analysis, Wiley and Sons. Since weve already loaded the
survival package, we can see these data by typing
> aml
The data are in a structure with three sub-elements: time (to death or
censoring), status (indicating censoring if status = 0) and x (a factor
indicating if maintainence chemotherapy was given). To see survival
and censoring times, type
> aml$time
You can do a plot by typing
CHAPTER 2. KAPLANMEIER 34
> hist(aml$time)
for example.
Load your own data from a le. More often, you will wish to load in
your own data from a text le. This can be done with the read.table
or scan commands. If you are manually creating the data set le, use
your favourite text editor and save your le as a .txt or .dat le.
Good practice is to put it in its own directory with a suitable name
and an associated readme le explaining its contents to posterity.
If the le is in the current working directory, you can just give R its
name in the read.table command and all will be well. To nd the
current working directory, type
> getwd()
Otherwise, you must specify the les location in either relative or ab-
solute terms. Here are some examples (not for real les!):
#reads the file dataset.dat from the working directory:
> read.table("dataset.dat")
#reads the file dataset.dat from my web page:
> read.table("http://courses.nus.edu.sg/course/stacar/internet/dataset.dat")
#reads the file dataset.dat from the directory mydata
# on your computers c drive:
> read.table("C:/mydata/dataset.dat")
Let us read the data bipolar from my webpage into a variable called
s data:
> s_data =
read.table("http://courses.nus.edu.sg/course/stacar/internet/bipolar.dat",col.names=T)
These data are the times of no attack of bipolar disorder in a group of
patients given omega 3 or a placebo. See Stoll et al. (1999, Arch. Gen.
Psychiatry, 56:40712). There are three columns, t, d and o. The rst
gives the survival or censoring time in days, the second indicates if the
event is survival (d=1) or censoring (d=0), and the third indicates if the
patient received omega 3 oil (o=1) or the placebo (o=0).
Form your own data structure from disparate elements. Suppose that
you have manipulated some vectors of data and with to combine them
CHAPTER 2. KAPLANMEIER 35
together into a single structure. This could be simulated data (as fol-
lows) or could result from having each part of the data stored in its
own le. Let us create a simulated data set with three elements: time,
censor and an imaginary covariate x:
> N = 100 #the sample size
> covariate = rnorm(N,0,1) #100 draws from a N(0,1) distribution
#give them an exponential lifetime affected by the covariate:
> time = rexp(N,covariate^2)
> censored = (time>10) #ie =1 if time>10 and 0 else
#put the data into a structure called sim:
> sim = list(x=covariate,time=time,censor=censored)
> sim #look at the resulting data
The rst thing you should do once the data are loaded is to plot them! For
example:
> hist(aml$time,freq=F)
> lines(density(subset(aml$time,aml$x=="Maintained"),from=0),col=2)
> lines(density(subset(aml$time,aml$x=="Nonmaintained"),from=0),col=3)
2.4.2 KaplanMeier Plots in R
The main function for the KaplanMeier estimator in R is called survfit.
It takes as its primary argument a formula object. The formula object is
of form:
a survival object ~ covariate terms
If you do not wish to use any covariates, the covariate terms can be
omitted. Survival objects are created by the Surv (note capital S) function.
This takes arguments time (i.e. event time) and event (0 for right censored,
1 for failure). For example, for the aml data we may use
Surv(aml$time,aml$status)
CHAPTER 2. KAPLANMEIER 36
The covariate terms in the formula object are specied via a symbolic model
formula. This might resemble x1 + x2 + x1*x2 for two covariates x1
and x2 including an interaction term. For example, the aml data have one
binary categorical covariate x. Two KaplanMeier curves can be created by
specifying aml$x. Thus we can use the KaplanMeier routine on the aml
data by entering:
survfit(Surv(aml$time,aml$status)~aml$x)
An alternative that avoids all the amls is
survfit(Surv(time,status)~x,data=aml)
Lets store the t in an eponymous variable:
fit = survfit(Surv(aml$time,aml$status)~aml$x)
to be manipulated later. The function survfit has various options, including
#just fit to elements 1--11 in the data set:
, subset=1:11
#just fit to elements that have an x equal to the string "Maintained":
, subset=(x=="Maintained")
#do a log-log transformation to create CIs:
, conf.type="log-log"
#create CIs on the original scale:
, conf.type="plain"
#constructs a 90%CI instead of the default of 95%:
, conf.int=0.9
The plot function is precongured for survfit output. To plot the output
(which weve stored in fit), type:
plot(fit)
CHAPTER 2. KAPLANMEIER 37
0 50 100 150
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
Note that the axes are not labelled, a terrible sin. Correct it via
plot(fit,xlab="t",ylab=expression(hat(S)(t))
0 50 100 150
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
t
S ^
((
t
))
Syntax for the expression function can be obtained in Murrel & Ihaka (2000,
J. Comp. Graph. Stat. 9:582599). By default, plot plots just the estimators

S
i
(t) if there are more than one category i. With just one category it plots in
addition a condence interval. There are various additional options to plot
that change the defaults:
CHAPTER 2. KAPLANMEIER 38
#switch off the marks indicating censoring events:
, mark.time=F
#use a vertical line rather than a + to represent censoring events:
, mark="|"
#force CIs to be on:
, conf.int=T
#force CIs to be off:
, conf.int=F
#use NUS colours to distinguish the lines:
, col=c("orange","blue")
#use red and blue lines:
, col=c(2,4)
#put a legend with labels A and B on the plot
#(though informative labels would be far better):
, legend.text=c("A","B")
#put the legend in the top right corner of the plot area:
, legend.pos=1
, log=T #plots log S(t) versus t
, fun=log #ditto
, fun=sqrt #plots sqrt S(t) versus t
, fun=cumhaz #plots cumulative hazard
#plot a log log plot, changing the x axis too somehow:
, fun="cloglog"
I personally do not like these default settings. Here is my preferred version
for a binary explanatory variable:
> fit$label=c(rep(1,fit$strata[1]),rep(2,fit$strata[2]))
> t1=c(0,subset(fit$time,fit$label==1));t2=c(0,subset(fit$time,fit$label==2))
> St1=c(1,subset(fit$surv,fit$label==1));St2=c(1,subset(fit$surv,fit$label==2))
> uSt1=c(1,subset(fit$upper,fit$label==1));uSt2=c(1,subset(fit$upper,fit$label==2))
> lSt1=c(1,subset(fit$lower,fit$label==1));lSt2=c(1,subset(fit$lower,fit$label==2))
> plot(0,0,pch=" ",ylim=0:1,xlim=range(t1,t2),xlab="t",ylab=expression(hat(S)(t)))
> lines(t1,uSt1,lty=2,type=s,col="orange");lines(t1,lSt1,lty=2,type=s,col="orange")
> lines(t2,uSt2,lty=2,type=s,col="blue");lines(t2,lSt2,lty=2,type=s,col="blue")
> lines(t1,St1,type=s,col="orange");lines(t2,St2,type=s,col="blue")
CHAPTER 2. KAPLANMEIER 39

0 50 100 150
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
t
S ^
((
t
))
You can output your plots to a le to be included in reports, etc. To create
a pdf le, do the following:
> cm=1/2.54;pdf("myplot.pdf",height=10*cm,width=10*cm)
#R uses inches by default.
#Change dimensions of the plot to suit your requirements.
> #put all your plotting commands here
> dev.off()
Again, the defaults are not very goodthere is too much white space in the
wrong place. Try
> cm=1/2.54;pdf("myplot.pdf",height=10*cm,width=10*cm)
#change margins and marginal spacing respectively
#(See ?par for details):
> par(mai=c(2,2,0.5,0.5)*cm,mgp=c(2,0.75,0))
#change size of text etc:
> par(cex=1.25)
> #put all your plotting commands here
> dev.off()
CHAPTER 2. KAPLANMEIER 40

0 50 100 150
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
t
S ^
((
t
))
Alternatives to pdf include
> postscript("myplot.ps")
> jpeg("myplot.jpeg")
> png("myplot.png")
Note that postscript and pdf formats are vector-based and hence lossless: try
zooming in on a pdf and it will still look good, but zoom in on a jpeg and it
will look blocky. Note however that if you put any graphics into a Microsoft
Oce document, the chances are that no matter how good the original the
nal graphic will look terrible.
As well as plotting the output, try summarising them:
> summary(fit)
This prints out a table with rows corresponding to failure times, and columns
giving these times, the number at risk, the number of failures, the KM esti-
mate of the survival function, Greenwoods estimate of the standard error of
this estimate, along with lower and upper bounds on a condence interval.
Note that the condence interval can be inaccurate, for instance, it sets equal
to 1 any upper bounds that go above 1.
CHAPTER 2. KAPLANMEIER 41
2.4.3 Tests in R
Tests of the hypothesis that subpopulations have the same S(t) can be done
easily using the survdiff function. This takes a formula expression as its
primary argument. It allows the following further arguments:
> subset= #as survfit
> rho=0
The rho option puts weights

S(t)

on the summands in the test statistic. If


= 0 we have the log-rank test as all event times have the same weight.
If = 1 we have the generalised Wilcoxon test. If = 1/2 we have the
TaroneWare test. The notation is due to Harrington and Fleming (1982)
Biometrika 69:55366.
For example, let us test whether maintenance chemotherapy has any eect
on survival times due to acute myelogenous leukmia.
> survdiff(Surv(time,status)~x,data=aml,rho=1)
Call:
survdiff(formula = Surv(time, status) ~ x, data = aml, rho = 1)
N Observed Expected (O-E)^2/E (O-E)^2/V
x=Maintained 11 3.85 6.14 0.86 2.78
x=Nonmaintained 12 7.18 4.88 1.08 2.78
Chisq= 2.8 on 1 degrees of freedom, p= 0.0955
> survdiff(Surv(time,status)~x,data=aml,rho=0)
Call:
survdiff(formula = Surv(time, status) ~ x, data = aml, rho = 0)
N Observed Expected (O-E)^2/E (O-E)^2/V
x=Maintained 11 7 10.69 1.27 3.40
x=Nonmaintained 12 11 7.31 1.86 3.40
Chisq= 3.4 on 1 degrees of freedom, p= 0.0653
CHAPTER 2. KAPLANMEIER 42
The second of these, the log-rank test, puts more emphasis on larger values
of time, which is where the main dierence appears to be (from the graphs),
giving it a smaller p-value. Note that its a bit unprincipled to keep doing
tests until you get a p-value you want, so you should really decide on the
weights before analysing the data.

You might also like