0% found this document useful (0 votes)
25 views36 pages

Chapter 2 Empirical Models Censured Data

The document discusses empirical models for analyzing censored data, particularly in the context of survival analysis. It explains right-censoring, provides real-world examples across various fields, and introduces specialized methods like the Kaplan-Meier and Nelson-Aalen estimators for handling such data. The document also includes practical examples and R code for estimating survival functions using the Kaplan-Meier method.

Uploaded by

Rowa Alharami
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views36 pages

Chapter 2 Empirical Models Censured Data

The document discusses empirical models for analyzing censored data, particularly in the context of survival analysis. It explains right-censoring, provides real-world examples across various fields, and introduces specialized methods like the Kaplan-Meier and Nelson-Aalen estimators for handling such data. The document also includes practical examples and R code for estimating survival functions using the Kaplan-Meier method.

Uploaded by

Rowa Alharami
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Empirical Models (Censured Data)

Dhafer Malouche
Outline

Introduction

Kaplan-Meyer Estimator

The Nelson-Åalen Estimator


Introduction
What is Censored Data?

Right-Censoring occurs when we lose track of subjects before ob-


serving their event time. We only know that their true event time
exceeds some duration.

▶ Common in survival analysis and duration studies


▶ Arises when study period ends before all events occur
▶ Occurs when subjects drop out of longitudinal studies

Censor event
Subject B (observed event)

Subject A (censored)

0 2 4 6 8
Study duration (years)
Real-World Examples

Medical Research: Engineering:


▶ Cancer survival studies ▶ Product reliability testing
▶ Vaccine efficacy trials ▶ Component failure analysis
▶ Chronic disease progression ▶ Maintenance scheduling
Insurance: Social Sciences:
▶ Policy lapse studies ▶ Unemployment duration
▶ Claim development patterns ▶ Marriage duration studies
▶ Policyholder longevity ▶ Recidivism analysis

5 of 36
Analytical Challenges

Standard ECDF Fails: Survival Probability


▶ Treats censored
observations as complete
▶ Underestimates survival
probabilities True Survival
▶ Biases parameter estimates
Naive Estimate
Specialized Methods Needed:
▶ Kaplan-Meier estimator
Time
▶ Nelson-Åalen estimator
▶ Cox proportional hazards
models

6 of 36
Definition

Definition An observation is censored from above (also called


right censored) at u if when it is at or above u it is recorded
as being equal to u, but when it is below u it is recorded at its
observed value.

7 of 36
Example: Lightbulb Lifespan Study

Imagine a study that tracks the lifespan of a particular type of lightbulb.


The study is designed to last for 5,000 hours. In this scenario, the value
u (the censoring point) is 5,000 hours. Here’s how the data might look:
Lightbulb Observed Lifespan (hours) Censored
A 4,800 No
B 5,000 Yes
C 3,500 No
D 5,000 No
E 5,000 Yes

8 of 36
Example of Right-Censored Data in Insurance
Consider an insurance company tracking claims related to a specific type of
insurance policy. The data collection period ends on a certain date, and any claims
still open (not fully settled) at that time are right-censored. This means that the
final claim amount is not yet known, but it will be at least as much as the amount
already recorded
Claim Amount Paid Final Claim
Claim ID Start Date as of Cutoff Date Amount (Censored)
001 01-Jan-2023 $5,000 > $5,000 (Open)
002 15-Feb-2023 $8,000 > $8,000 (Open)
003 10-Mar-2023 $3,500 $3,500 (Closed)
004 20-Apr-2023 $7,250 > $7,250 (Open)
005 05-May-2023 $4,000 $4,000 (Closed)

▶ Claims 001, 002, and 004 are right-censored. As of the cutoff date, these
claims are still open.
▶ Claims 003 and 005 are not censored; these claims have been closed with final
amounts paid.
9 of 36
Right-Censored Data in a Mortality Study
Assume a mortality study begins on January 1, 2020, and ends on December 31,
2025. The study tracks a cohort of individuals to observe their age at death.
However, by the end of the study, not all individuals in the cohort have died. For
those still alive, their data is right-censored.

ID Birth Status Age at Age at


ID Date (2025) Study End Death
001 01-Jan-1950 Alive 75 years > 75 years (Alive)
002 15-Feb-1955 Deceased - 68 years (Died in 2023)
003 10-Mar-1960 Alive 65 years > 65 years (Alive)
004 20-Apr-1948 Exited - Unknown (Exited)
005 05-May-1958 Deceased - 64 years (Died in 2022)

▶ Right censoring occurs for participants who are still alive at study end or who
exit the study early.
▶ This affects the accuracy of mortality estimates and must be accounted for in
actuarial analyses.

10 of 36
Example: Pension Liabilities
▶ Pension funds often monitor the time until retirement or death of plan
participants. These times can be partially observed (censored) if participants
exit the plan, switch employers, or the observation period ends before an event
occurs.
▶ Context: Employers set aside funds based on expected payout timelines.
▶ Right-Censoring: If a participant remains alive or employed at the end of the
observation window, their “event time” (e.g., retirement date) is not fully
known.
▶ Implication: Actuaries adjust liability and funding estimates to account for
censored individuals.

Participant Hire Date Years Observed Status (Censored?)


P01 2005 18 Retired (No)
P02 2008 15 Still Active (Yes)
P03 2010 13 Still Active (Yes)
P04 1998 25 Retired (No)
P05 2012 11 Exited Plan (Yes)
11 of 36
Kaplan-Meyer Estimator
Estimating the Survival function

▶ Example:

1 2 3∗ 4 4 4∗ 4∗ 5 7∗ 8 8 8 9

9 9 9 10 12 12 15∗
∗ are right censored data
▶ Calculate the following statistics

i yi si bi ri ▶ bi the number of censored


0 0 - - - observations in [yi , yi+1 )
1 1 1 0 20
2 2 1 1 20 − 1 − 0 = 19 ▶ r1 = n for
3 4 2 2 19 − 1 − 1 = 17 i = 2, 3, . . . , k + 1
4 5 1 1 17 − 2 − 2 = 13
5 8 3 0 13 − 1 − 1 = 11 ri = ri−1 − si−1 − bi−1 ,
6 9 4 1 11 − 3 − 0 = 8
7 12 2 1 8−4−1=3 ▶ ri are called numbers at
max 15 - - 3−2−1=0
risk
Estimating the Survival function

▶ Fn is estimated using the survival function:

Sn (y ) = 1 − Fn (y )

▶ Where (Kaplan-Meier estimate of S(y ))

1,


 y < y1 ,
Y j j 
s
 Y 
i
1 − λ̂i = 1− j = 1, 2, . . . , k − 1,
 
, yj ≤ y < yj+1 ,


Sn (y ) = ri
i=1 i=1
 k k
si
 Y  Y 
1 − λ̂i = 1−

, yk ≤ y < ymax .


ri


i=1 i=1

14 of 36
Example

= 1, y < 1,

i yi si ri Ŝn (yi ) 

1 1 1 20 1 − 1/20 = 0.950 = 0.950, 1 ≤ y < 2,




2 2 1 19 0.95(1 − 1/19) = 0.900 = 0.900, 2 y < 4,


 ≤
3 4 2 17 0.9(1 − 2/17) = 0.794

= 0.794, 4≤y <5


4 5 1 13 0.794(1 − 1/13) = 0.733 S20 (y ) =
= 0.737, 5 ≤ y < 8,
5 8 3 11 0.733(1 − 3/11) = 0.533


= 0.533, 8 ≤ y < 9

6 9 4 8 0.533(1 − 4/8) = 0.267




7 12 2 3 0.267(1 − 2/3) = 0.089 = 0.167, 9 ≤ y < 12



= 0.089, 12 ≤ y < 15

15 of 36
Using R

> # Loading the required libraries


> library(survival)
> library(KMsurv)
> # Defining the data
> y <- c(1, 2, 3, 4, 4, 4, 4, 5, 7, 8, 8, 8, 9, 9, 9, 9, 10, 12, 12, 15)
> length(y)
[1] 20
> # Defining the censoring indicator
> cens <- c(1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0)

16 of 36
Using R

> ## Kaplan-Meyer Estimation of the Survival function


> km_fit =survfit(Surv(y,cens)˜1)
> summary(km_fit)
Call: survfit(formula = Surv(y, cens) ˜ 1)

time n.risk n.event survival std.err lower 95% CI upper 95% CI


1 20 1 0.9500 0.0487 0.8591 1.000
2 19 1 0.9000 0.0671 0.7777 1.000
4 17 2 0.7941 0.0919 0.6329 0.996
5 13 1 0.7330 0.1032 0.5563 0.966
8 11 3 0.5331 0.1238 0.3382 0.840
9 8 4 0.2666 0.1127 0.1163 0.611
12 3 2 0.0889 0.0817 0.0147 0.539

where n.risk are ri , n.event are si , survival are Ŝn (yi )

17 of 36
How the KM Output is Computed?

▶ n.risk (ri ): The number of individuals still at risk just before time
yi .
▶ n.event (si ): The number of events (failures/deaths) observed
exactly at yi .

▶ survival Ŝn (yi ) : Kaplan-Meier estimate of survival at time yi .
▶ std.err: Standard error of the survival estimate, often computed
via Greenwood’s formula.
▶ 95% CI: Confidence interval for the survival estimate.

18 of 36
Formulas in Practice:

si 
▶ Ŝn (yi ) = Ŝn (yi−1 ) × 1 − , For the first event time y1 , we set
ri
Ŝn (y0 ) = 1.
v
u i
 uX sj
▶ std.err Ŝn (yi ) ≈ Ŝn (yi )t (Greenwood’s formula)
j=1
rj (rj − sj )

▶ n.risk (ri ) updates as ri = ri−1 − si−1 − bi−1 , where bi−1 is the


number censored between times yi−1 and yi .
▶ Confidence intervals often use the log-transformation:
 
ln − ln(Ŝn (yi )) ± zα/2 StdErr ln[− ln(Ŝn (yi ))]

19 of 36
Example at Times = 1, 2, and 3

In the R output, n.risk, n.event, survival, and std.err are


derived using these steps:
si 
▶ Kaplan-Meier survival: Ŝ(yi ) = Ŝ(yi−1 ) × 1 − ri .
▶ Greenwood’s formula for variance :
i
d Ŝ(yi )) = Ŝ(yi )
2 X sj
Var( ,
j=1
rj rj − sj

where Ŝ(y0 ) := 1.
q

▶ Standard error: se Ŝ(yi ) = d Ŝ(yi )).
Var(
▶ If no event occurs at some time t, the survival estimate and its
variance remain the same as at the last event time.

20 of 36
Example at Times = 1, 2, and 3
▶ Time = 1 : r1 = 20, s1 = 1
 
Ŝ(1) = 1 × 1 − 1
20
= 0.95.

1 √
= (0.95)2 × = 0.00238 ⇒ se Ŝ(1) = 0.00238 ≈ 0.0487
 
c Ŝ(1)
Var
20 · 19
▶ Time = 2 : r2 = 19, s2 = 1
18
 
Ŝ(2) = Ŝ(1) × 1 − 1
= 0.95 ×≈ 0.90.
19
19
For the variance, we add the second term to Greenwood’s sum:
1 1
h i
c Ŝ(2) = (Ŝ(2))2 +

Var .
20 · 19 19 · 18
Numerically, this leads to
se Ŝ(2) ≈ 0.0671.


▶ Time = 3: No new event at y = 3, so

Ŝ(3) = Ŝ(2) ≈ 0.90 and se Ŝ(3) = se Ŝ(2) ≈ 0.0671.


 

21 of 36
The Nelson-Åalen Estimator
The Nelson-Åalen Estimator

An alternative to the Kaplan-Meier estimator, called the Nelson-Åalen


estimator, is sometimes used.
▶ To motivate the estimator, note that if S(y ) is the survival
function of a continuous
Z distribution with failure rate h(y ), then
y
− ln S(y ) = H(y ) = h(t)dt is called the cumulative hazard
0
rate function.
X
▶ The discrete analog in the present context is given by λi ,
i|yi ≤y
which can intuitively be estimated by replacing λi with its estimate
si
λ̂i = .
ri
What is the Hazard Function? (Intuitive Explanation)
Definition (Simple Terms):
▶ The hazard function at time t tells you the rate at which events happen
exactly at time t, given the subject has survived (not had the event) up to t.
▶ Another name for the hazard is instantaneous risk of the event.

Easy Example: Lightbulbs


▶ Suppose you have many identical lightbulbs.
▶ You test how long they last (their lifetimes).
▶ The hazard function at time t is the chance a bulb will burn out at that exact
moment (t hours), knowing it has worked fine until hour t.
Why It Matters:
▶ It helps us see if the chance of failure (or event) increases, decreases, or stays
the same as time goes on.
▶ If the hazard is higher at later times, it means bulbs are more likely to fail
when they get older.

24 of 36
Nelson-Åalen Estimator

The Nelson-Åalen estimator of H(y ) is defined for y < ymax as follows:





0, y < y1 ,
 j j
si

 X X
λ̂i = , yj ≤ y < yj+1 , j = 1, 2, . . . , k − 1,


Ĥ(y ) = i=1 r
i=1 i

 k k

 X X si
λ̂i = , yk ≤ y < ymax .




i=1
r
i=1 i

X
▶ Here, Ĥ(y ) = λ̂i for y < ymax .
i|yi ≤y
▶ The Nelson-Åalen estimator of the survival function is
Ŝ(y ) = exp(−Ĥ(y )).
25 of 36
Variance Formulas and Confidence Intervals
(Nelson-Aalen)
Variance of the Cumulative Hazard:
X si
c Ĥ(t)) =
Var( .
ri2
i:yi ≤t

Survival Function:
Ŝ(t) = exp −Ĥ(t) .


Variance of the Survival Estimate (Delta Method):


X si
c Ĥ(t)) = Ŝ(t)2
c Ŝ(t) ≈ Ŝ(t)2 × Var(

Var .
ri2
i:yi ≤t

Confidence Interval:
q 
▶ Plain (Linear) CI: Ŝ(t) ± zα/2 c Ŝ(t)
Var
▶ Log(-log) CI: :
q
ln − ln Ŝ(t) ± zα/2
 
c Ĥ(t)).
Var(

26 of 36
Why Use the Nelson-Aalen Method?

Alternative Approach to KM:


▶ Kaplan-Meier (KM) directly estimates the survival function by multiplying
stepwise survival probabilities.
▶ Nelson-Aalen (NA) focuses on estimating the cumulative hazard function, then
converts it to a survival function via Ŝ(t) = e −Ĥ(t) .
Reasons to Consider Nelson-Aalen:
X events
▶ Easier Hazard Interpretation: NA’s core is Ĥ(t) = , providing a
risk
more direct handle on the hazard and its variance.
▶ Robustness in Small Samples: In smaller datasets, the NA estimator can
sometimes be more stable and less biased, especially for the hazard function.
▶ Different Variance Structure: NA’s variance derivation differs from KM,
which can matter if you focus on hazard-based inference.

27 of 36
Differences between KM and Nelson-Aalen

▶ Core Quantity:
▶ KM: Estimates Ŝ(t) directly.
▶ NA: Estimates Ĥ(t), then Ŝ(t) = e −Ĥ(t) .
▶ Interpretation:
▶ KM: Step function for survival probabilities.
▶ NA: Step function for cumulative hazard; survival is a smooth
transform.
▶ Asymptotic Equivalence:
▶ Both converge to the true survival function with large samples.
▶ Practical Usage:
▶ KM is often reported for survival curves.
▶ NA is used to analyze or model the hazard function more directly.

28 of 36
Estimating the Survival

▶ Example:

1 2 3∗ 4 4 4∗ 4∗ 5 7∗ 8 8 8 9

9 9 9 10 12 12 15∗
∗ are right censored data
▶ Nelson-Åalen Estimator

= 1, y < 1,

i yi si ri Ĥn (yi ) 
 −0.050
=e = 0.951, 1 ≤ y < 2,

1 1 1 20 1/20 = 0.050



= e −0.103 = 0.902, 2 ≤ y < 4,

2 2 1 19 0.05 + 1/19 = 0.103



3 4 2 17 0.103 + 2/17 = 0.220  −0.220
=e = 0.803, 4 ≤ y < 5,

4 5 1 13 0.220 + 1/13 = 0.297 Ŝ20 (y ) =
 = e −0.297 = 0.743, 5 ≤ y < 8,
5 8 3 11 0.297 + 3/11 = 0.570

= e −0.570 = 0.566, 8 ≤ y < 9,


6 9 4 8 0.570 + 4/8 = 1.070



= e −1.070 = 0.343, 9 ≤ y < 12,

7 12 2 3 1.070 + 2/3 = 1.737



 −1.737
=e = 0.176, 12 ≤ y < 15.

29 of 36
How to use R to estimate the survival function

survfit(Surv(y,cens) ˜1, type="fleming-harrington")

R by default returns CIs using the log(-log) method.


▶ For the default (log-transformed) CI, use:
survfit(Surv(y,cens) ˜ 1, type = "fleming-harrington",
conf.type = "log")

▶ For the plain CI, specify:


survfit(Surv(y,cens) ˜ 1, type = "fleming-harrington",
conf.type = "plain")

30 of 36
Using R

> ## Nelson-Åalen Estimation of the Survival function


> na_fit =survfit(Surv(y,cens)˜1,type="fleming-harrington")
> summary(na_fit)
Call: survfit(formula = Surv(y, cens) ˜ 1, type = "fleming-harrington")

time n.risk n.event survival std.err lower 95% CI upper 95% CI


1 20 1 0.951 0.0476 0.8624 1.000
2 19 1 0.902 0.0655 0.7828 1.000
4 17 2 0.802 0.0886 0.6462 0.996
5 13 1 0.743 0.1000 0.5707 0.967
8 11 3 0.566 0.1171 0.3769 0.849
9 8 4 0.343 0.1114 0.1815 0.648
12 3 2 0.176 0.1008 0.0574 0.541

31 of 36
Calculating NA Estimates at Time = 1
▶ r1 = 20, s1 = 1
1
▶ Hazard increment: ∆H(1) = = 0.05
20
▶ Cumulative hazard: H(1) = 0.05
▶ Survival estimate:
Ŝ(1) = e −H(1) = e −0.05 ≈ 0.951.
▶ Variance of H(1):
1
Var(H(1)) = = 0.0025.
202
▶ Using the delta method,

Var Ŝ(1) ≈ Ŝ(1)2 Var(H(1)) ≈ (0.951)2 × 0.0025 ≈ 0.00226,




so the standard error is



se Ŝ(1) ≈ 0.00226 ≈ 0.0476.


32 of 36
Calculating NA Estimates at Time = 2
▶ r2 = 19, s2 = 1
1
▶ Hazard increment: ∆H(2) = ≈ 0.05263
19
▶ Cumulative hazard:

H(2) = H(1) + ∆H(2) ≈ 0.05 + 0.05263 = 0.10263.


▶ Survival estimate:

Ŝ(2) = e −H(2) ≈ e −0.10263 ≈ 0.902.


▶ Variance of H(2):

1 1
Var(H(2)) = + 2 ≈ 0.0025 + 0.00277 = 0.00527.
202 19
▶ Variance of survival:

Var Ŝ(2) ≈ (0.902)2 × 0.00527 ≈ 0.00429,




so the standard error is



se Ŝ(2) ≈ 0.00429 ≈ 0.0655.


33 of 36
Calculating NA Estimates at Time = 4
▶ r3 = 17, s3 = 2
2
▶ Hazard increment: ∆H(4) = ≈ 0.11765
17
▶ Cumulative hazard:

H(4) = H(2) + ∆H(4) ≈ 0.10263 + 0.11765 = 0.22028.


▶ Survival estimate:

Ŝ(4) = e −H(4) ≈ e −0.22028 ≈ 0.802.


▶ Variance of H(4):

2 2
Var(H(4)) = Var(H(2))+ ≈ 0.00527+ ≈ 0.00527+0.00692 = 0.01219.
172 289
▶ Variance of survival:

Var Ŝ(4) ≈ (0.802)2 × 0.01219 ≈ 0.00784,




so the standard error is



se Ŝ(4) ≈ 0.00784 ≈ 0.0886.


34 of 36
Conclusion Remarks

▶ The Kaplan-Meier (KM) and Nelson-Aalen (NA) estimators are fundamental in


survival analysis for estimating survival probabilities.

▶ The KM estimator, a non-parametric method, is particularly effective for


right-censored survival data and is adept at comparing survival across groups.

▶ It calculates survival probability at each time point, considering the number of


at-risk individuals and the occurrence of events.

▶ On the other hand, the NA estimator focuses on the cumulative hazard


function, representing the integral of the hazard rate over time.

35 of 36
Conclusion Remarks

▶ The hazard function denotes the instantaneous event rate at a given time,
conditional on survival until that time.

▶ The NA estimator is preferred when estimating the hazard rate and the
cumulative hazard rate of a population over time.

▶ The choice between KM and NA estimators hinges on the research objectives


and data characteristics.

▶ While KM is suited for estimating survival probabilities, NA is apt for hazard


rates and cumulative hazards.

▶ In practice, combining both methods offers a comprehensive understanding of


survival data, with KM estimating survival probabilities and NA focusing on
hazard rates.

36 of 36

You might also like