Chapter 2 Empirical Models Censured Data
Chapter 2 Empirical Models Censured Data
Dhafer Malouche
Outline
Introduction
Kaplan-Meyer Estimator
Censor event
Subject B (observed event)
Subject A (censored)
0 2 4 6 8
Study duration (years)
Real-World Examples
5 of 36
Analytical Challenges
6 of 36
Definition
7 of 36
Example: Lightbulb Lifespan Study
8 of 36
Example of Right-Censored Data in Insurance
Consider an insurance company tracking claims related to a specific type of
insurance policy. The data collection period ends on a certain date, and any claims
still open (not fully settled) at that time are right-censored. This means that the
final claim amount is not yet known, but it will be at least as much as the amount
already recorded
Claim Amount Paid Final Claim
Claim ID Start Date as of Cutoff Date Amount (Censored)
001 01-Jan-2023 $5,000 > $5,000 (Open)
002 15-Feb-2023 $8,000 > $8,000 (Open)
003 10-Mar-2023 $3,500 $3,500 (Closed)
004 20-Apr-2023 $7,250 > $7,250 (Open)
005 05-May-2023 $4,000 $4,000 (Closed)
▶ Claims 001, 002, and 004 are right-censored. As of the cutoff date, these
claims are still open.
▶ Claims 003 and 005 are not censored; these claims have been closed with final
amounts paid.
9 of 36
Right-Censored Data in a Mortality Study
Assume a mortality study begins on January 1, 2020, and ends on December 31,
2025. The study tracks a cohort of individuals to observe their age at death.
However, by the end of the study, not all individuals in the cohort have died. For
those still alive, their data is right-censored.
▶ Right censoring occurs for participants who are still alive at study end or who
exit the study early.
▶ This affects the accuracy of mortality estimates and must be accounted for in
actuarial analyses.
10 of 36
Example: Pension Liabilities
▶ Pension funds often monitor the time until retirement or death of plan
participants. These times can be partially observed (censored) if participants
exit the plan, switch employers, or the observation period ends before an event
occurs.
▶ Context: Employers set aside funds based on expected payout timelines.
▶ Right-Censoring: If a participant remains alive or employed at the end of the
observation window, their “event time” (e.g., retirement date) is not fully
known.
▶ Implication: Actuaries adjust liability and funding estimates to account for
censored individuals.
▶ Example:
1 2 3∗ 4 4 4∗ 4∗ 5 7∗ 8 8 8 9
∗
9 9 9 10 12 12 15∗
∗ are right censored data
▶ Calculate the following statistics
Sn (y ) = 1 − Fn (y )
1,
y < y1 ,
Y j j
s
Y
i
1 − λ̂i = 1− j = 1, 2, . . . , k − 1,
, yj ≤ y < yj+1 ,
Sn (y ) = ri
i=1 i=1
k k
si
Y Y
1 − λ̂i = 1−
, yk ≤ y < ymax .
ri
i=1 i=1
14 of 36
Example
= 1, y < 1,
i yi si ri Ŝn (yi )
1 1 1 20 1 − 1/20 = 0.950 = 0.950, 1 ≤ y < 2,
2 2 1 19 0.95(1 − 1/19) = 0.900 = 0.900, 2 y < 4,
≤
3 4 2 17 0.9(1 − 2/17) = 0.794
= 0.794, 4≤y <5
4 5 1 13 0.794(1 − 1/13) = 0.733 S20 (y ) =
= 0.737, 5 ≤ y < 8,
5 8 3 11 0.733(1 − 3/11) = 0.533
= 0.533, 8 ≤ y < 9
6 9 4 8 0.533(1 − 4/8) = 0.267
7 12 2 3 0.267(1 − 2/3) = 0.089 = 0.167, 9 ≤ y < 12
= 0.089, 12 ≤ y < 15
15 of 36
Using R
16 of 36
Using R
17 of 36
How the KM Output is Computed?
▶ n.risk (ri ): The number of individuals still at risk just before time
yi .
▶ n.event (si ): The number of events (failures/deaths) observed
exactly at yi .
▶ survival Ŝn (yi ) : Kaplan-Meier estimate of survival at time yi .
▶ std.err: Standard error of the survival estimate, often computed
via Greenwood’s formula.
▶ 95% CI: Confidence interval for the survival estimate.
18 of 36
Formulas in Practice:
si
▶ Ŝn (yi ) = Ŝn (yi−1 ) × 1 − , For the first event time y1 , we set
ri
Ŝn (y0 ) = 1.
v
u i
uX sj
▶ std.err Ŝn (yi ) ≈ Ŝn (yi )t (Greenwood’s formula)
j=1
rj (rj − sj )
19 of 36
Example at Times = 1, 2, and 3
where Ŝ(y0 ) := 1.
q
▶ Standard error: se Ŝ(yi ) = d Ŝ(yi )).
Var(
▶ If no event occurs at some time t, the survival estimate and its
variance remain the same as at the last event time.
20 of 36
Example at Times = 1, 2, and 3
▶ Time = 1 : r1 = 20, s1 = 1
Ŝ(1) = 1 × 1 − 1
20
= 0.95.
1 √
= (0.95)2 × = 0.00238 ⇒ se Ŝ(1) = 0.00238 ≈ 0.0487
c Ŝ(1)
Var
20 · 19
▶ Time = 2 : r2 = 19, s2 = 1
18
Ŝ(2) = Ŝ(1) × 1 − 1
= 0.95 ×≈ 0.90.
19
19
For the variance, we add the second term to Greenwood’s sum:
1 1
h i
c Ŝ(2) = (Ŝ(2))2 +
Var .
20 · 19 19 · 18
Numerically, this leads to
se Ŝ(2) ≈ 0.0671.
21 of 36
The Nelson-Åalen Estimator
The Nelson-Åalen Estimator
24 of 36
Nelson-Åalen Estimator
0, y < y1 ,
j j
si
X X
λ̂i = , yj ≤ y < yj+1 , j = 1, 2, . . . , k − 1,
Ĥ(y ) = i=1 r
i=1 i
k k
X X si
λ̂i = , yk ≤ y < ymax .
i=1
r
i=1 i
X
▶ Here, Ĥ(y ) = λ̂i for y < ymax .
i|yi ≤y
▶ The Nelson-Åalen estimator of the survival function is
Ŝ(y ) = exp(−Ĥ(y )).
25 of 36
Variance Formulas and Confidence Intervals
(Nelson-Aalen)
Variance of the Cumulative Hazard:
X si
c Ĥ(t)) =
Var( .
ri2
i:yi ≤t
Survival Function:
Ŝ(t) = exp −Ĥ(t) .
Confidence Interval:
q
▶ Plain (Linear) CI: Ŝ(t) ± zα/2 c Ŝ(t)
Var
▶ Log(-log) CI: :
q
ln − ln Ŝ(t) ± zα/2
c Ĥ(t)).
Var(
26 of 36
Why Use the Nelson-Aalen Method?
27 of 36
Differences between KM and Nelson-Aalen
▶ Core Quantity:
▶ KM: Estimates Ŝ(t) directly.
▶ NA: Estimates Ĥ(t), then Ŝ(t) = e −Ĥ(t) .
▶ Interpretation:
▶ KM: Step function for survival probabilities.
▶ NA: Step function for cumulative hazard; survival is a smooth
transform.
▶ Asymptotic Equivalence:
▶ Both converge to the true survival function with large samples.
▶ Practical Usage:
▶ KM is often reported for survival curves.
▶ NA is used to analyze or model the hazard function more directly.
28 of 36
Estimating the Survival
▶ Example:
1 2 3∗ 4 4 4∗ 4∗ 5 7∗ 8 8 8 9
∗
9 9 9 10 12 12 15∗
∗ are right censored data
▶ Nelson-Åalen Estimator
= 1, y < 1,
i yi si ri Ĥn (yi )
−0.050
=e = 0.951, 1 ≤ y < 2,
1 1 1 20 1/20 = 0.050
= e −0.103 = 0.902, 2 ≤ y < 4,
2 2 1 19 0.05 + 1/19 = 0.103
3 4 2 17 0.103 + 2/17 = 0.220 −0.220
=e = 0.803, 4 ≤ y < 5,
4 5 1 13 0.220 + 1/13 = 0.297 Ŝ20 (y ) =
= e −0.297 = 0.743, 5 ≤ y < 8,
5 8 3 11 0.297 + 3/11 = 0.570
= e −0.570 = 0.566, 8 ≤ y < 9,
6 9 4 8 0.570 + 4/8 = 1.070
= e −1.070 = 0.343, 9 ≤ y < 12,
7 12 2 3 1.070 + 2/3 = 1.737
−1.737
=e = 0.176, 12 ≤ y < 15.
29 of 36
How to use R to estimate the survival function
30 of 36
Using R
31 of 36
Calculating NA Estimates at Time = 1
▶ r1 = 20, s1 = 1
1
▶ Hazard increment: ∆H(1) = = 0.05
20
▶ Cumulative hazard: H(1) = 0.05
▶ Survival estimate:
Ŝ(1) = e −H(1) = e −0.05 ≈ 0.951.
▶ Variance of H(1):
1
Var(H(1)) = = 0.0025.
202
▶ Using the delta method,
32 of 36
Calculating NA Estimates at Time = 2
▶ r2 = 19, s2 = 1
1
▶ Hazard increment: ∆H(2) = ≈ 0.05263
19
▶ Cumulative hazard:
1 1
Var(H(2)) = + 2 ≈ 0.0025 + 0.00277 = 0.00527.
202 19
▶ Variance of survival:
33 of 36
Calculating NA Estimates at Time = 4
▶ r3 = 17, s3 = 2
2
▶ Hazard increment: ∆H(4) = ≈ 0.11765
17
▶ Cumulative hazard:
2 2
Var(H(4)) = Var(H(2))+ ≈ 0.00527+ ≈ 0.00527+0.00692 = 0.01219.
172 289
▶ Variance of survival:
34 of 36
Conclusion Remarks
35 of 36
Conclusion Remarks
▶ The hazard function denotes the instantaneous event rate at a given time,
conditional on survival until that time.
▶ The NA estimator is preferred when estimating the hazard rate and the
cumulative hazard rate of a population over time.
36 of 36