Biostatistics 203. Survival Analysis: Yhchan

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

B a s i c S t a t i s t i c s F o r D o c t o r s Singapore Med J 2004 Vol 45(6) : 249

Biostatistics 203.
Survival analysis
Y H Chan

Table I. Summary of the common univariate/multivariate analysis, let’ us consider a simple example on the survival
biostatistical techniques to analyse quantitative and
times (in months) for 25 lung cancer patients who all
qualitative data types.
died; the timings are : 1, 5, 6, 6, 9, 10, 10, 10, 12, 12, 12,
Quantitative data(1) Qualitative data(2)
12, 12, 13, 15, 16, 20, 24, 24, 27, 32, 34, 36, 36, 44 months.
Normality/homogeneity of Independent Matched Performing a simple descriptive, we have n = 25,
variance assumptions satisfied? sample case-control mean (sd) = 17.52 (11.48) months and median =
YES NO 12 months.
Parametric Non-parametric
tests tests Fig. 1 The distribution of the survival times.
1 Sample T Sign test 10
Paired T Wilcoxon Chi Square/ McNemar Mean = 17.52
Signed Rank Fisher Exact test 8 Std. dev. = 11.482
N = 25
Frequency

2 Sample T Wilcoxon 6
Rank Sum/
Mann Whitney U 4

ANOVA Kruskal Wallis 2

Multivariate tests
0
0 10 20 30 40
Multiple linear regression(3) Logistic Conditional
Time (in months)
regression(4) logistic
regression It is obvious that the distribution is not normal
(Fig. 1) as expected from survival-time data.
In this article, we shall discuss the use of survival Kaplan Meier is the usual technique performed to
analysis on a quantitative type of data corresponding analyse survival-time data. Table II shows the Kaplan
to the time from a well-defined time origin until the Meier analysis for the above 25 subjects (all died of
occurrence of some particular event of interest or lung cancer):
end-point.
Table II. Kaplan Meier analysis (no censoring).
Medical examples are: Kaplan Meier technique (All subjects died)
• Duration – time from randomisation to relapse
Survival time Standard error 95% CI
• Pressure sore – time to development
• Survival – time from randomisation until death Mean 17.52 2.30 13.02, 22.02
Median 12.00 1.25 9.55, 14.45
Clinical Trials and Non-medical examples are:
Epidemiology
Research Unit
• Banking – time from making a loan to full- What do we observe? The Kaplan Meier results of
226 Outram Road repayment Table II is exactly the same to that of the descriptive
Blk B #02-02
Singapore 169039 • Economy – time from graduation to get 1st job results above. So why do we need to do a survival
Y H Chan, PhD • Social – time from being single to getting analysis? To quote a Chinese saying, we have used
Head of Biostatistics married “a bull knife to kill a chicken”: an “overkill in
Correspondence to: analysis”! The reason here is: since all the subjects
Dr Y H Chan
Tel: (65) 6325 7070 Since survival time is a quantitative variable, why died (presumably of lung cancer), we have no extra
Fax: (65) 6324 2700
Email: chanyh@
can’t we just use the usual techniques from Table I? information to require us to perform a survival analysis
cteru.com.sg Before we explain the main reason why we use survival – no censored data.
Singapore Med J 2004 Vol 45(6) : 250

What are censored observations? Censored Put the variables “time” and “status” at their
observations arise in cases for which appropriate options, click on ‘Define Event’ button
• the critical event has not yet occurred to get Template II.
• lost to follow-up
• other interventions offered Template II. Defining the event.
• event occurred but unrelated cause

Let us consider the situation where we have more


information (censored cases) for our 25 lung cancer
patients : 1#, 5#, 6, 6, 9#, 10, 10, 10#, 12, 12, 12, 12, 12#, 13#,
15#, 16#, 20#, 24, 24#, 27#, 32, 34#, 36#, 36#, 44# months
(where # denotes censored observations).
The subject with 44# definitely is a surviving person
at the point of analysis (we cannot “ask” the patient
to die – not ethical!). The 1# could be one who just
enrolled into the study recently and still surviving.
Perhaps, the 5# could be one who (after five months)
decided to seek other help and did not return to the Put a 1 as an event as defined accordingly. Click
study; his survival status is unknown. Lastly, the 13# “Continue”. In Template I, click on the “Options” folder
could be one who died but not because of lung cancer. and checked the boxes as shown in Template III.
In all, 10 of the 25 subjects died from lung cancer.
How do we present this data in SPSS? Table III
Template III. Kaplan Meier options.
shows the 1st six cases, as an example.

Table III. Survival analysis dataset in SPSS.


Subject number Survival time Status
1 1 0
2 5 0
3 6 1
4 6 1
5 9 0
6 10 1
etc

The last variable “Status” tells SPSS which case is


censored (denoted by 0) and which case is an event
(dying of lung-cancer, denoted by 1).
To perform a Kaplan Meier analysis in SPSS, go to
Analyze, Survival, Kaplan Meier to get Template I.
Ticking on the “Mean and median survival” option
Template I. Kaplan Meier analysis. gives Table IV.

Table IV. Kaplan Meier analysis (with censoring).

Kaplan Meier technique

Survival time Standard error 95% CI

Mean 28.51 3.54 21.58, 35.44


Median 32.00 14.43 3.71, 60.29

Table IV shows the Kaplan Meier analysis with


censored data information taken into account. We
observe that the median survival time has increased
from 12 months (without censoring) to 32 months.
Singapore Med J 2004 Vol 45(6) : 251

This means that with the factoring in of the “extra” Table V shows the mean/median survival times
information, we are being “realistic” about the survival for the control and active groups with log-rank test
time of, in this case, lung cancer or being “fair” to the p = 0.1835 – no differences between the active and
treatment under study with the intent of extending the control on having a shorter time to event, with the
survival time of these subjects. Fig. 2 shows the survival survival plot given in Fig. 3. One common misconception
plots for both censored and no-censored scenarios. of survival analysis is that some researchers interpret
the result as one group being more likely to have
Fig. 2 Survival plots – lung cancer example. deaths (this should be given by logistic regression!). It
No censoring With censoring is the time to event which is the primary response here.
1.0

0.8 Table V. Kaplan Meier analysis for comparison between two groups.
Cum Survival

0.6 Survival analysis for time


Factor group = control
0.4

Survival time Standard error 95% confidence


0.2 interval
Survival function
Censored
0.0 Mean 21 5 (12, 30)
0 10 20 30 40 50 0 10 20 30 40 50
(Limited to 36)
Time (in months) Time (in months) Median 12 2 (7, 17)

Factor group = active


COMPARING TWO SURVIVAL CURVES
Survival time Standard error 95% confidence
Kaplan Meier can be used to compare two treatment
interval
groups on their survival times. Put the variable “group”
in the “Factor” option, see Template IV. Mean 31 4 (23, 39)
(Limited to 44)
Median 32 8 (17, 47)
Template IV. Defining the factor for comparison.
Total Number Number Percent
of events censored censored

Group control 12 5 7 58.33


Group active 13 5 8 61.54
Overall 25 10 15 60.00

Test statistics for equality of survival distributions for group


Statistic df Significance

Log rank 1.77 1 .1835

Fig. 3 Survival plot for comparison of two groups.


Survival Functions
1.0

Click on “Compare Factor” on the left-hand corner 0.8

of Template IV to invoke the log-rank test to compare


Cum survival

0.6
the two groups (Template V).
0.4
Group
Template V.The log-rank test 0.2
Active
Control
Active-censored
Control-censored
0.0

0 10 20 30 40 50
Time (in months)

The Kaplan Meier technique is the univariate


version of survival analysis. To take into account
confounders into the analysis, we have to use cox
regression.
Singapore Med J 2004 Vol 45(6) : 252

COX REGRESSION Template VIII. Invoking the 95% CI for the hazard ratio.
For the above lung cancer example, we have collected
information on race, age and gender, and want to look
at a confounder model to determine whether the two
groups differ after adjusting for demographics.
To perform a cox regression, go to Analyse, Survival,
Cox regression to get Template VI.

Template VI. Cox regression: lung cancer example.


From Template VI, ask for plots to get Template IX
– click on “Survival” and Separate Lines for “group”.

Template IX. Survival plot for Cox regression.

The declaration for the categorical variables is


similar to that discussed in the logistic regression
article(4) by clicking on the “Categorical” folder and
put group, race and sex as the categorical covariates
(Template VII)

The following Tables VIa – e show the results


Template VII. Declaration of categorical variables.
for the Cox regression.

Table VIa. Categorical definition.

Categorical variable codings

Frequency (1) (2) (3)

Group 1.00=control 12 1
2.00=active 13 0
Race 1=chinese 15 1 0 0
2=indian 5 0 1 0
In Template VI, click on “Options” to invoke the
3=malay 2 0 0 1
95% CI for the hazard ratio (HR), given by the
4=other 3 0 0 0
expression exp(B) – which is also the same expression
Sex 1=male 17 1
for odds ratios in logistic regression. This is another
2=female 8 0
common mistake – researchers at times refer to odds
ratio in survival analysis (mistaken by the same
The reference category for group is active, race
symbol). The interpretation for the hazard ratio is
is “other race” and sex is female.
similar to that of the odds ratio. A value of one
Table VIb gives the p-values (Sig) and the hazard
means there is no differences between two groups
ratios (Exp(B)) of the variables. Firstly, we have to check
in having a “shorter time to event”. A HR >1 means
for multicolinearity by observing whether the SE of
that the group of interest comparing to the reference
all the variables are small (see logistic regression(4)
group (to be observed from the categorical
for a detailed discussion on this checking).
declaration) likely have a shorter time to event. A HR
<1 means that the group of interest less likely to have
a shorter time to event.
Singapore Med J 2004 Vol 45(6) : 253

Table VIb. Estimates of variables in Cox regression.

Variables in the equation

95.0% CI for Exp(B)


B SE Wald df Sig. Exp(B) Lower Upper

Group 1.841 .911 4.086 1 .043 6.302 1.058 37.550


Sex 3.670 1.435 6.542 1 .011 39.263 2.358 653.769
Age .115 .043 7.137 1 .008 1.122 1.031 1.220
Race 2.066 3 .559
Race(1) -.307 1.181 .068 1 .795 .735 .073 7.448
Race(2) .983 1.299 .573 1 .449 2.672 .210 34.060
Race(3) .907 1.469 .381 1 .537 2.476 .139 44.085

Since this is an adjusting for confounder model, Thus taking into account these information, a
our interest is only in the variable group. ‘Thankfully’ treatment difference is found, as observed from the
the p-value is 0.043 (statistically significant!) compared survival plot in Fig. 4.
to the Kaplan Meier analysis (well, we do not always
get this happy ending). The HR is 6.302 (95% CI 1.058 Fig. 4 Survival plot for the lung cancer example.

- 37.55), comparing the control with the active (obtained Survival functions for patterns 1 - 2
from the categorical definition table IVa), the control 1.0

likely to have a shorter time to event and in this


0.8
example, the event is death.
What is going on here? Why now a statistical
Cum Survival

0.6
difference? Table VIb also showed that there are
statistical differences for gender and also age – the 0.4
men and older people were doing worst. Performing
a cross-tabulation shows that there are more men and 0.2
Group
less women in the control group (p = 0.673) and mean Active
Control
age is higher in the active group. See Tables VIc 0.0

and VId. 0 10 20 30 40
Time (in months)

Table VIc. Cross-tabulation between group and gender.


The above exercise showed that it is not relevant to
The sex of the patient * group cross-tabulation
stop at the univariate analysis but to always perform a
Group multivariate analysis to present the realistic situation!
Control Active Total Since we found a difference between treatment
groups, do you want to stop here? How about interaction
Sex of Male Count 9 8 17
patient % within group 75.0% 61.5% 68.0% between gender and group, or age and group? Question
of interest would be: is there a particular group (female
Female Count 3 5 8
% within group 25.0% 38.5% 32.0% on active, for example) performing better? Note that
we will start to ask these questions only when the
Total Count 12 13 25
% within group 100.0% 100.0% 100.0% “main effects” model showed significant differences
in the variables of interest.
How to put in the interaction term? In Template
Table VId. Age differences between group (p=0.737).
VI, highlight group 1st, hold the ctrl key and highlight
Group statistics age – observe the button >a*b> becomes “visible” –
Group N Mean Std. deviation Std. error mean click on this button – see Template X.

Age active 13 31.6923 16.16263 4.48271


control 12 29.5833 14.73683 4.25416
Singapore Med J 2004 Vol 45(6) : 254

Table VIe. Result with interaction terms.

Variables in the equation

95.0% CI for Exp(B)


B SE Wald df Sig. Exp(B) Lower Upper

Group -5.524 4.891 1.276 1 .259 .004 .000 58.121


Sex 1.687 1.716 .966 1 .326 5.401 .187 156.115
Age .082 .055 2.186 1 .139 1.085 .974 1.200
Race 3.171 3 .366
Race(1) -.869 1.341 .420 1 .517 .419 .303 5.804
Race(2) 1.112 1.261 .777 1 .378 3.041 .257 36.039
Race(3) 1.018 1.570 .421 1 .517 2.769 .128 60.107
Age*group .121 .089 1.823 1 .177 1.128 .947 1.344
Group*sex 5.584 3.261 2.933 1 .087 266.224 .447 158709.101

Template X. Preparing to put an interaction term


Table VIe shows that none of the interaction
group*age.
terms are significant. This implies that regardless of
age or gender, the active group is performing better
(from Table VIb).
Let us discuss another example on the use of
interaction term – using the breast cancer survival
dataset from SPSS. Variables collected were age and
the categorical histology grade, oestrogen receptor
status, progesterone receptor status, pathological
tumour size and lymph node status. The interest is
to determine the predictors for a shorter survival time
to death.

Table VIIa. Categorical definition – breast cancer example.

Categorical variable codings

Frequency (1) (2)


Click on >a*b> button to activate age*group(Cat)
– see Template XI. Likewise do the same for histgrad 1=1 56 0 0
2=2 352 1 0
gender*group.
3=3 252 0 1
cr 0=negative 262 0
Template XI. Activating an interaction term.
1=positive 398 1
pr 0=negative 299 0
1=positive 361 1
pathscat 1=<=2cm 457 0 0
2=2-5cm 196 1 0
3=>5cm 7 0 1
ln_yesno 0=no 485 0
1=yes 175 1

Reference group for histology grade is grade 1,


for er, pr and lymph node is negative and tumour size
is ≤2cm.
Singapore Med J 2004 Vol 45(6) : 255

Table VIIb. Main effects model – breast cancer example.

Variables in the equation

95.0% CI for Exp(B)


B SE Wald df Sig. Exp(B) Lower Upper

Age -.021 .014 2.200 1 .138 .980 .953 1.007


histgrad .872 2 .647
histgrad(1) .778 1.036 .564 1 .453 2.177 .286 16.587
histgrad(2) .942 1.056 .796 1 .972 2.564 .324 20.300
cr -.022 .432 .003 1 .959 .978 .419 2.281
pr -.455 .422 1.159 1 .282 .635 .277 1.452
pathscat 6.005 2 .050
pathscat(1) .638 .336 3.614 1 .057 1.893 .980 3.657
pathscat(2) 1.484 .776 3.658 1 .056 4.412 .964 20.200
ln_yesno .724 .337 4.605 1 .032 2.063 1.065 3.997

Table VIIc. Interaction terms – breast cancer example.

Variables in the equation

95.0% CI for Exp(B)


B SE Wald df Sig. Exp(B) Lower Upper

Age -.023 .014 2.845 1 .092 .977 .951 1.004


histgrad 1.165 2 .559
histgrad(1) 1.047 1.067 .962 1 .327 2.848 .352 23.068
histgrad(2) 1.161 1.081 1.153 1 .283 3.192 .384 26.563
cr -.063 .424 .022 1 .881 .939 .409 2.156
pr -.516 .413 1.556 1 .212 .597 .266 1.342
pathscat 8.520 2 .014
pathscat(1) -.179 .501 .128 1 .721 .836 .313 2.233
pathscat(2) 3.100 1.102 7.904 1 .005 22.189 2.557 192.566
ln_yesno .006 .505 .000 1 .990 1.006 .374 2.706
ln_yesno*pathscat 8.564 2 .014
ln_yesno*pathscat(1) 1.670 .707 5.574 1 .018 5.312 1.328 21.248
ln_yesno*pathscat(2) -1.847 1.547 1.425 1 .233 .158 .008 3.274

Those with a positive lymph node more likely to >5cm are at risk (HR=22.19, 95% CI 2.56 - 192.57,
have a shorter time to death (HR = 2.06, 95% CI p=0.005) and for subjects with tumour size 2 - 5cm,
1.07 - 4.0, p = 0.032). Tumour size is “just off statistical they are at a higher risk if they have a positive lymph
significance”. Should we conclude that only women node (HR=5.31, 95% CI 1.33 - 21.25, p=0.018).
with a positive lymph node are at a higher risk? Chotto One last assumption to check: proportional hazard
matte (wait a minute) – what happens if we include a model. From the lung cancer example, in Template IX,
lymph node * tumor size interaction (see Table VIIc). click on the “log-minus-log” plot option to get Fig. 5,
Here we can see that lymph node status is no we do not want the lines to cross each other. When
more statistically significant but tumour size and their the proportional hazard assumption is not satisfied,
interaction are! The results are telling us that regardless we will have to use Cox regression with time-
of the lymph node status, subjects with tumour size dependent covariate to analyse the data.
Singapore Med J 2004 Vol 45(6) : 256

Fig. 5 Log-minus-log plot for proportional hazard checking. Our next article will be “Biostatistics 301. Repeated
LML function for patterns 1 - 2 measurement analysis”.
2

1 REFERENCES
1. Chan YH. Biostatistics 102. Quantitative data – parametric and
0
non-parametric tests. Singapore Med J 2003; 44:391-6.
Log minus log

2. Chan YH. Biostatistics 103: Qualitative data – tests of independence.


-1
Singapore Med J 2003; 44:498-503.
-2 3. Chan YH. Biostatistics 201. Linear regression analysis. Singapore
Med J 2004; 45:55-61.
-3 4. Chan YH. Biostatistics 202. Logistic regression analysis. Singapore
Med J 2004; 45:149-53.
-4 Group
Active
Control
-5

5 10 15 20 25 30 35
Time (in months)

You might also like