Survival Analysis Practical
Survival Analysis Practical
Practical 1:
Compute and plot the estimated survival function, the probability density function, and the hazard
function from the given data below. Also, check whether the survival times follow exponential (or
Weibull) distribution or not.
Solution:
For uncensored data,
Number of patients surviving longer than t
Survival function, 𝑆(𝑡) = Total number of patients
R-code:
year <- 0:9
num_alive <- c(1100, 860, 680, 496, 358, 240, 180, 128, 84, 52)
num_dying <- c(240, 180, 184, 138, 118, 60, 52, 44, 32, 28)
survival <- num_alive/1100
pdf <- num_dying / (1100*1)
Page 1 of 22
hazard <- num_dying / (num_alive*1)
data<-data.frame(year,num_alive,num_dying,survival,pdf,hazard)
data
par(mfrow=c(2,2))
plot(year, survival, type = "s", ylim = c(0, 1), xlab = "Year of follow-up", ylab = "Survival
Probability", main = "Survival Function")
plot(year, pdf, type = "l", ylim = c(0, max(pdf)), xlab = "Year of follow-up", ylab = "PDF", main
= "PDF")
plot(year, hazard, type = "l", ylim = c(0, max(hazard)), xlab = "Year of follow-up", ylab = "Hazard
Function", main = "Hazard Function")
dev.off()
Output:
Table: Estimates of survival function, PDF, and hazard function
Year Number of Number Survival PDF Hazard
alive at the of probability function
beginning of dying
the interval in the
interval
1 0 1100 240 1.00000000 0.21818182 0.2181818
2 1 860 180 0.78181818 0.16363636 0.2093023
3 2 680 184 0.61818182 0.16727273 0.2705882
4 3 496 138 0.45090909 0.12545455 0.2782258
5 4 358 118 0.32545455 0.10727273 0.3296089
6 5 240 60 0.21818182 0.05454545 0.2500000
7 6 180 52 0.16363636 0.04727273 0.2888889
8 7 128 44 0.11636364 0.04000000 0.3437500
9 8 84 32 0.07636364 0.02909091 0.3809524
10 9 52 28 0.04727273 0.02545455 0.5384615
Page 2 of 22
Figure: Graph of survival function, PDF, and hazard function
Page 3 of 22
Output:
Figure: Graph to check the fitting of exponential and Weibull distribution to the given data
The first part of the above figure represents a negative sloped straight line when plotting the log
of survival time against the survival time. It indicates the fitting of exponential distribution with
the given data. Furthermore, the graph of log of the negative log of survival time against log of
survival time in the second part shows a positive linear trend, which again supports the fitting of
Weibull distribution to the data.
Practical 2:
Given that the pdf of Weibull distribution: 𝑓(𝑡) = 𝜆𝛽(𝜆𝑡)𝛽−1 exp(−𝜆𝑡)𝛽 , use your roll number to
simulate a random sample of size 1000 from the distribution with shape parameter 2 and scale
parameter 5. Do the followings: i. estimate the parameters of this function, ii. find out the score
function and the observed information matrix, iii. find 95% CI for 𝛽 and log (𝛽) with
interpretation, iv. test the two hypotheses that 𝐻0 : 𝛳 = (1,2) 𝑣𝑠 𝐻1 : 𝛳 ≠ (1,2), note that 𝛳 = (𝛽, 𝜆);
and 𝐻0 : 𝛽 = 1 𝑣𝑠 𝐻1 : 𝛽 ≠ 1.
Note that, the score function and the information matrix for Weibull distribution are
𝑛
+ ∑[log(𝜆𝑡) − (𝜆𝑡)𝛽 log (𝜆𝑡)]
𝛽
𝑈(𝛽, 𝜆) = ( 𝑛𝛽 𝛽(𝜆𝑡)𝛽
), and
𝜆
−∑ 𝜆
Page 4 of 22
𝑛 𝑛 (𝜆𝑡)𝛽 [1+𝛽log (𝜆𝑡)]
+ ∑[(𝜆𝑡)𝛽 (log(𝜆𝑡))2 ] − 𝜆 + ∑
𝛽2 𝜆
𝐼(𝛽, 𝜆) = ( 𝛽
), respectively.
𝑛 (𝜆𝑡) [1+𝛽log (𝜆𝑡)] 𝑛𝛽 𝛽(𝛽−1)(𝜆𝑡)𝛽
− +∑ 2
+∑ 2
𝜆 𝜆 𝜆 𝜆
Solution:
(i) Estimation of the parameters:
Finding the log-likelihood function:
𝐿 = ∏ 𝑓(𝑡)
𝑙 = ∑ log (𝑓(𝑡))
R-code:
set.seed(07)
t <- rweibull(1000, 2, 5) # Generating data from the Weibull distribution
loglik <- function(theta, t)
{
beta = theta[1]
lambda = theta[2]
-sum(log(dweibull(t, shape = beta, scale = 1/lambda)))
}
# The function optim minimizes the result. That's why, we put a negative
# sign before log-likelihood for maximization. The scale = 1/lambda,
# because of the inverse form of scale parameter in R weibull function.
Page 5 of 22
Output:
1
The obtained estimates are, 𝛽̂ = 2.03, 𝜆̂ = 0.203 = 4.92.
Output:
The estimates of the score function is:
̂ (𝛽, 𝜆) = (0.022).
𝑈
1.406
Page 6 of 22
beta = theta[1]
lambda = theta[2]
output = matrix(0, 2, 2)
output[1, 1] = n / beta^2 + sum((lambda * t)^(beta) * (log(lambda * t))^2)
output[1, 2] = sum((lambda * t)^(beta) * (1 + beta * log(lambda * t))) / lambda - n / lambda
output[2, 1] = output[1, 2]
output[2, 2] = n * beta / lambda^2 + sum(beta * (beta - 1) * (lambda * t)^(beta)) / lambda^2
output
}
obs.inf(hat.theta, t)
Output:
The estimated observed information matrix is:
434.476 2061.097
𝐼̂(𝛽, 𝜆) = ( ).
2061.098 100397.093
Output:
95% CI for 𝛽: (1.936, 2.134)
The 95% confidence interval for 𝛽 (1.936, 2.134) suggests that we are 95% confident that the true
value of the shape parameter 𝛽 lies between 1.936 and 2.134. In other words, the interval (1.936,
2.134) gives us a range that would include the true value of 𝛽 in approximately 95% of the cases
Page 7 of 22
if we were to repeatedly sample from the same population and compute confidence intervals from
those samples.
Output:
95% CI for log (𝛽): (0.662, 0.759)
Output:
𝑊𝑛 = 316972.1
p-value < 0.001
𝑅𝑛 = 11705.03
Page 8 of 22
p-value < 0.001
Since p-value is less than 0.05 we may reject the null hypothesis at 5% level of significance and
may conclude that the true values of 𝛽 and 𝜆 are different from 1 and 2, respectively.
Testing 𝑯𝟎 : 𝛽 = 1 𝑣𝑠 𝑯𝟏 : 𝛽 ≠ 1
R-code:
#Wald Type
wn1 <- (hat.theta[1]-1)^2/inv.obs.inf[1,1]
1-pchisq(wn,1)
#LRT
rn1 <- 2*((-loglik(hat.theta,t))-(-loglik(c(1,1/mean(t)),t)))
1-pchisq(rn,1)
Output:
𝑊𝑛 = 420.0485
p-value < 0.001
𝑅𝑛 = 587.7729
p-value < 0.001
Since p-value is less than 0.05 we may reject the null hypothesis at 5% level of significance and
may conclude that the true value of 𝛽 is different from 1.
Page 9 of 22
Practical 3:
The lifetimes (in months) of 10 pieces of equipment are given below:
Item Number 1 2 3 4 5 6 7 8 9 10
𝑇𝑖 2 - 51 - 33 22 19 24 4 -
𝐶𝑖 81 72 70 60 41 31 31 30 29 21
Assume that the lifetimes follow an exponential distribution. Under the setup of type I censoring,
find the 95% CI of the parameter of exponential distribution using Sprott and Cox method.
Solution:
𝐶𝑖
𝑇 1 −
In Sprott method, 𝑇 = ∑𝑟𝑖=1 𝑇𝑖 + (𝑛 − 𝑟) 𝑇𝑟 , 𝜃̂ = 𝑟 , 𝜙̂ = 𝜃̂ − ⁄3, 𝑄̂ = ∑ (1 − 𝑒 𝜃̂ ),
1
𝜙 2 ̂2
𝑆𝐸(𝜙̂) = (9𝑄̂ ) , 𝜃̂ = 𝜙̂ −3
2 ̂
𝜃 2 ̂
𝜃
In Cox method, 2𝑟 𝜃 ~𝜒(2𝑟+1) or, 14 𝜃 ~𝜒(15) .
R-code:
n <- 10
r <- 7
Ti <- c(2, 51, 33, 22, 19, 24, 4)
Ci <- c(81, 72, 70, 60, 41, 31, 31, 30, 29, 21)
T <- sum(Ti) + (n - r) * max(Ti)
theta_hat <- T / r
phi_hat <- theta_hat^(-1/3)
Q_hat <- sum(1 - exp(-Ci / theta_hat))
std.er.phi_hat <- ((phi_hat^2) / (9 * Q_hat))^(1/2)
Page 10 of 22
phi_ci <- c(phi_hat - 1.96 * std.er.phi_hat, phi_hat + 1.96 * std.er.phi_hat)
theta_ci <- phi_ci^-3
theta_ci
#Cox method
l <- qchisq(0.025,15)
u <- qchisq(0.975,15)
ci <- c(14*theta_hat/u, 14*theta_hat/l)
ci
Output:
Under Sprott method, the 95% CI for 𝜃 is (21.828, 110.007), while the interval is (22.40, 98.37)
under Cox method.
Practical 4:
(i) Survival times of patients in a study on Myeloma are given below:
13, 52+, 6, 40, 10, 7+, 66, 10+, 10, 14, 16, 4, 65, 5, 11+, 10, 15+, 5, 76+, 56+, 88, 24, 51, 4, 40+,
8, 18, 5, 16, 50, 40, 1, 36, 5, 10, 91, 18+, 1, 18+, 6, 1, 23, 15, 18, 12+, 12, 17, 3+.
(Total 48 observations among which 12 observations are censored)
Find and plot the life table estimate of the survival function.
Solution:
Here,
𝑑𝑗 = Number of deaths in the jth interval
𝑐𝑗 = Number of censored survival times in the interval
𝑛𝑗 = Number of individuals who are alive at the start of the interval
𝑐
𝑛𝑗′ = 𝑛𝑗 − 2𝑗 ,
𝑛𝑗′ −𝑑𝑗
𝑆𝑡∗ =∏ .
𝑛𝑗′
Page 11 of 22
R-code:
time_period <- seq(from = 0, to = 60, by=12)
dj <- c(16, 10, 1, 3, 2, 4)
cj <- c(4, 4, 0, 1, 2, 1)
nj <- c(48, 28, 14, 13, 9, 5)
nj_prime <- nj-cj/2
s <- (nj_prime-dj)/nj_prime
st <- sapply(1:length(s), function(n) prod(s[1:n]))
data <- data.frame(time_period, dj, cj, nj, nj_prime, s, st)
data
plot(time_period, st, type = "s", ylim = c(0, 1), xlab = "Survival time", ylab = "Survival
probability", lwd=2)
time_period = c(time_period,91)
st = c(st,0.0236)
plot(time_period, st, type = "s", ylim = c(0, 1), xlab = "Survival time", ylab = "Survival
probability", lwd=2)
Output:
Table: Life table estimate of the survival function from Myeloma data
Time-period 𝑑𝑗 𝑐𝑗 𝑛𝑗 𝑛𝑗′ 𝑛𝑗′ − 𝑑𝑗 𝑆𝑡∗
𝑛𝑗′
0- 16 4 48 46 0.6522 0.6522
12- 10 4 28 26 0.6154 0.4013 (0.6522 × 0.6154)
24- 1 0 14 14 0.9286 0.3727 (0.4013 × 0.9286)
36- 3 1 13 12.5 0.7600 0.2833
48- 2 2 9 8 0.7500 0.2124
60- 4 1 5 4.5 0.1110 0.0236
Page 12 of 22
Figure: Plot of life table estimate of the survival function from Myeloma data
R-code:
time <- c(0,10,19,30,36,59,75,93,97,107)
nj <- c(18,18,15,13,12,8,7,6,5,3)
dj <- c(0,rep(1,9))
k_s <- (nj-dj)/nj
k_st <- sapply(1:length(k_s), function(n) prod(k_s[1:n]))
Page 13 of 22
k_data <- data.frame(time, nj, dj, k_s, k_st)
k_data
plot(time, k_st, type = "s", ylim = c(0, 1), xlab = "Survival time", ylab = "Survival probability",
lwd=2)
Output:
Table: Kaplan-Meier estimate of the survival function from IUD data
Time 𝑑𝑗 𝑛𝑗 𝑛𝑗 − 𝑑𝑗 𝑆̂(𝑡)
𝑛𝑗
0 0 18 1 1
10 1 18 0.944 0.944
19 1 15 0.933 0.8807 (0.944×0.933)
30 1 13 0.923 0.8129
36 1 12 0.917 0.7454
59 1 8 0.875 0.6522
75 1 7 0.857 0.5590
93 1 6 0.833 0.4656
97 1 5 0.800 0.3725
107 1 3 0.667 0.2484
Figure: Plot of Kaplan-Meier estimate of the survival function from IUD data
Page 14 of 22
Nelson-Aalen survival function estimate is given by:
𝑑
𝑆̃(𝑡) = ∏ exp (− 𝑛𝑗 ).
𝑗
R-code:
time <- c(0,10,19,30,36,59,75,93,97,107)
nj <- c(18,18,15,13,12,8,7,6,5,3)
dj <- c(0,rep(1,9))
ex <- exp(-dj/nj)
n_st <- sapply(1:length(ex), function(n) prod(ex[1:n]))
n_data <- data.frame(time, nj, dj, ex, n_st)
n_data
plot(time, n_st, type = "s", ylim = c(0, 1), xlab = "Survival time", ylab = "Survival probability",
lwd=2)
Output:
Table: Nelson-Aalen estimate of the survival function from IUD data
Time 𝑑𝑗 𝑛𝑗 𝑑𝑗 𝑆̃(𝑡)
exp (− )
𝑛𝑗
0 0 18 1 1
10 1 18 0.945 0.945
19 1 15 0.935 0.884
30 1 13 0.925 0.819
36 1 12 0.920 0.754
59 1 8 0.882 0.665
75 1 7 0.867 0.576
93 1 6 0.846 0.488
97 1 5 0.818 0.399
107 1 3 0.716 0.286
Page 15 of 22
Figure: Plot of Nelson-Aalen estimate of the survival function from IUD data
R-code:
x1<-(dj/(nj*(nj-dj)))
std.er <- k_st*sapply(1:length(x1), function(n) sqrt((sum(x1[1:n]))))
lower <- k_st-1.96*std.er
upper <- k_st+1.96*std.er
data.b1 <- data.frame(time, nj, k_st, std.er, lower, upper)
data.b1
Page 16 of 22
Output:
Table: 95% CI for Kaplan-Meier estimate of the survival function using Greenwood standard error
from IUD data
Time 𝑛𝑗 𝑆̂(𝑡) SE CI
Lower bound Upper bound
1 0 18 1.0000000 0.00000000 1.00000000 1.0000000
2 10 18 0.9444444 0.05399030 0.83862347 1.0502654
3 19 15 0.8814815 0.07898919 0.72666266 1.0363003
4 30 13 0.8136752 0.09777700 0.62203229 1.0053181
5 36 12 0.7458689 0.11067019 0.52895536 0.9627825
6 59 8 0.6526353 0.13031975 0.39720862 0.9080620
7 75 7 0.5594017 0.14116728 0.28271384 0.8360896
8 93 6 0.4661681 0.14519912 0.18157782 0.7507584
9 97 5 0.3729345 0.14299297 0.09266826 0.6532007
10 107 3 0.2486230 0.13924720 -0.02430152 0.5215475
R-code:
se <- (k_st*(sqrt(1-k_st)))/sqrt(nj)
lower <- k_st-1.96*se
upper <- k_st+1.96*se
data.b2 <- data.frame(time, nj, k_st, se, lower, upper)
data.b2
Page 17 of 22
Output:
Table: 95% CI for Kaplan-Meier estimate of the survival function using Peto et al standard error
from IUD data
Time 𝑛𝑗 𝑆̂(𝑡) SE CI
Lower bound Upper bound
95% CI for 𝑺(𝒕) using plain method, log method, and log-log method:
R-code:
tj <- c(10,13,18,19,23,30,36,38,54,56,59,75,93,97,104,107,107,107)
event_status <- c(1,0,0,1,0,1,1,0,0,0,1,1,1,1,0,1,0,0)
library(survival)
KM_plain <- survfit(Surv(time=tj,event=event_status)~1,conf.type="plain")
summary(KM_plain)
KM_log <- survfit(Surv(time=tj,event=event_status)~1,conf.type="log")
summary(KM_log)
KM_loglog <- survfit(Surv(time=tj,event=event_status)~1,conf.type="log-log")
summary(KM_loglog)
Page 18 of 22
Output:
Table: 95% CI for Kaplan-Meier estimate of the survival function using plain method from IUD data
Time 𝑛𝑗 𝑑𝑗 𝑆̂(𝑡) SE CI
Lower Upper
bound bound
10 18 1 0.944 0.0540 0.8386 1.000
19 15 1 0.881 0.0790 0.7267 1.000
30 13 1 0.814 0.0978 0.6220 1.000
36 12 1 0.746 0.1107 0.5290 0.963
59 8 1 0.653 0.1303 0.3972 0.908
75 7 1 0.559 0.1412 0.2827 0.836
93 6 1 0.466 0.1452 0.1816 0.751
97 5 1 0.373 0.1430 0.0927 0.653
107 3 1 0.249 0.1392 0.0000 0.522
Table: 95% CI for Kaplan-Meier estimate of the survival function using log method from IUD data
Time 𝑛𝑗 𝑑𝑗 𝑆̂(𝑡) SE CI
Lower Upper
bound bound
10 18 1 0.944 0.0540 0.8443 1.000
19 15 1 0.881 0.0790 0.7395 1.000
30 13 1 0.814 0.0978 0.6429 1.000
36 12 1 0.746 0.1107 0.5577 0.998
59 8 1 0.653 0.1303 0.4413 0.965
75 7 1 0.559 0.1412 0.3411 0.917
93 6 1 0.466 0.1452 0.2532 0.858
97 5 1 0.373 0.1430 0.1759 0.791
107 3 1 0.249 0.1392 0.0829 0.745
Table: 95% CI for Kaplan-Meier estimate of the survival function using log-log method from IUD
data
Time 𝑛𝑗 𝑑𝑗 𝑆̂(𝑡) SE CI
Lower Upper
bound bound
10 18 1 0.944 0.0540 0.6664 0.992
19 15 1 0.881 0.0790 0.6019 0.969
30 13 1 0.814 0.0978 0.5241 0.936
36 12 1 0.746 0.1107 0.4536 0.897
59 8 1 0.653 0.1303 0.3438 0.843
75 7 1 0.559 0.1412 0.2564 0.780
93 6 1 0.466 0.1452 0.1830 0.710
97 5 1 0.373 0.1430 0.1209 0.631
107 3 1 0.249 0.1392 0.0468 0.531
Page 19 of 22
c. Kaplan-Meier type hazard function and its standard error
The K-M type hazard function can be estimated as:
𝑑
ℎ̂(𝑡) = 𝑛 𝜏𝑗 ,
𝑗 𝑗
1
(𝑛𝑗 −𝑑𝑗 ) 2
where 𝑆𝐸{ℎ̂(𝑡)} = ℎ̂(𝑡) { }.
𝑛𝑗 𝑑𝑗
R-code:
tao_j <- sapply(1:length(time), function(n) time[n+1]-time[n])
k_ht <- dj/(nj*tao_j)
h_se <- k_ht*sqrt((nj-dj)/(nj*dj))
data.c <- data.frame(time, tao_j, nj, dj, k_ht, h_se)
data.c
Output:
Table: Kaplan-Meier type estimate of the hazard function with standard error (SE) from IUD data
Time 𝜏𝑗 𝑛𝑗 𝑑𝑗 ℎ̂(𝑡) SE
1 0 10 18 0 0.000000000 NaN
2 10 9 18 1 0.006172840 0.005998922
3 19 11 15 1 0.006060606 0.005855102
4 30 6 13 1 0.012820513 0.012317550
5 36 23 12 1 0.003623188 0.003468939
6 59 16 8 1 0.007812500 0.007307925
7 75 18 7 1 0.007936508 0.007347779
8 93 4 6 1 0.041666667 0.038036289
9 97 10 5 1 0.020000000 0.017888544
10 107 NA 3 1 NA NA
Page 20 of 22
Practical 5:
Survival times of women with tumors that were negatively stained or positively stained with HPA
are as follows:
Negative staining: 23, 47, 69, 70+, 71+, 100+, 101+, 148, 181, 198+, 208+, 212+, 224+
Positive staining: 5, 8, 10, 13, 18, 24, 26, 26, 31, 35, 40, 41, 48, 50, 59, 61, 68, 71, 76+, 105+,
107+, 109+, 113, 116+, 118, 143, 154+, 162+, 188+, 212+, 217+, 225+
Analyze whether or not there is a difference in the survival experience of the two groups of women.
Solution:
library(survival)
time=c(23,47,69,70,71,100,101,148,181,198,208,212,224,5,8,10,13,18,24,26,26,31,35,40,41,48,
50,59,61,68,71,76,105,107,109,113,116,118,143,154,162,188,212,217,225)
event <- c(rep(1,3),rep(0,4),rep(1,2), rep(0,4), rep(1,18), rep(0,4),1,0,1,1,rep(0,6))
group <- c(rep(1,13), rep(2,32))
data <- data.frame(group, time, event)
KME <- survfit(Surv(time,event)~group)
summary(KME)
plot(KME,xlab="Survival time",ylab="Estimated suvival probability",col=c("black","red"),lty=1)
legend("bottomleft", legend=c("Negative staining", "Positive staining"),
col=c("black","red"),lty=c(1,1))
log_rank_test <- survdiff(Surv(time,event)~group)
log_rank_test
#extra
wilcoxon_test <- survdiff(Surv(time,event)~group,data=data, rho=1)
wilcoxon_test
Page 21 of 22
Output:
The above figure shows that the estimated survival probabilities for those women with negatively
stained tumors are always greater than that for women with positively stained tumors. Now, we
can test this result in two ways, one is Log-rank test and another one is Wilcoxon test. If the two
estimated survivor functions do not cross each other, the assumption of proportional hazards may
be justified, and the log-rank test is appropriate otherwise Wilcoxon test. From the above plot we
can see that the survival curves do not cross each other suggesting the use of Log-rank test in the
data.
The p-value obtained from Log-rank test is 0.06, which gives evidence against the null hypothesis
at the 6% level. We, therefore, conclude that the data do provide some evidence that the survival
experience of the given two groups of women is different.
The End
Page 22 of 22