Meth 2024 Part1 Censored
Meth 2024 Part1 Censored
ta methods
Valentin Patilea
Alternative title:
Missing or modified data methods
‘Big Data’ can diminish the variance1 ,
but cannot remove the bias2
1
Here, ‘Big Data’ is used in the sense of a huge number of observations of
some random vector of interest with fixed dimension
2
However, sometimes ‘Big Data’ may help to remove it
Agenda
Introduction
Organization
Overview
Notation and conventions
Another look at expectations
Few references
Censored data
4 / 80
Module organization
▶ There are 5 lectures, 2 TD and 1 TP
▶ Evaluation
▶ Quiz on November 19, at 17:30 (20-30 minutes)
▶ Final exam, no documents (formulaire allowed)
▶ Contact:
▶ [email protected]
▶ office hours: by appointment only, office 270, usually after
5:00pm
5 / 80
Materials and bibliography
▶ The course is designed to be self-contained
▶ BUT... prerequisites from 1A are required!
▶ Probability (expectation, conditional expectation, density...)
▶ Statistics (MLE, MoM, bias...)
▶ There are no previous exams, this is the first time the course
is given
6 / 80
Agenda
Introduction
Organization
Overview
Notation and conventions
Another look at expectations
Few references
Censored data
7 / 80
Types of incomplete (not perfect) data to be considered
▶ Censoring
▶ Missing
▶ Biased
8 / 80
A general modeling paradigm
9 / 80
Censoring for ‘time-to-event’ data
▶ Let Y ≥ 0 be a random variable of interest, typically a time to
some event
▶ time to ‘failure’ (medicine, engineering, economics, finance,
demography...)
▶ The statistical field studying time-to-event data is the Survival
Analysis
11 / 80
Biased data
▶ Biased data can be generated by a selection bias, usually due
to the method for collecting the data5
▶ A well known bias is the so-called survivorship bias
▶ Schools may ‘forget’ the students who drop out during the
training to report better performance on the job market
▶ Abraham Wald during WW2 (see the aircraft picture)
14 / 80
Starting point in this course
▶ Understand how to compute unbiased (or asymptotically
unbiased) estimators for expectations
▶ Expectations are required everywhere
▶ Mean, variance, quantiles etc.
▶ Estimation and inference (distribution function, MLE, MoM,
regression/predictive models or algorithms...)
E[ϕ(incomplete
b data)] (e.g., E[·]
b is the empirical mean)
▶ The steps
▶ Find an appropriate weighting for the expectation of interest.
Appropriate weighting means
▶ Build estimators
E[weighting
b × ϕ(incomplete data)]
18 / 80
Summarizing the road map of the lectures (and the tutorials/lab)
▶ Type of statistical analysis
▶ focused on the marginal distribution
▶ mean, higher order moments, quantiles,...
▶ density
▶ focused on the conditional distribution (predictive analysis)
▶ mean regression, quantile regression,...
Introduction
Organization
Overview
Notation and conventions
Another look at expectations
Few references
Censored data
20 / 80
▶ By definition 0/0=0
▶ The random variables are denoted by capital letter (X , Y , Z ...), the
random vectors are column matrices denoted by bold capital letters
(X, Y, Z...)
▶ The components of a random vector X ∈ Rp are denoted X (j) ,
1≤j ≤p
▶ All the random elements are defined on (Ω, F, P)
▶ The stochastic independence between random variables or vectors is
denoted by ⊥ (we will write X ⊥ Y , X ⊥ {Y , Z }, X ⊥ Y,
X ⊥ {Y, Z}...)
▶ Here, a continuous random variable or vector admits a density with
respect to the Lebesgue measure
▶ For x, y ∈ Rp , x ≤ y denotes the componentwise (coordinatewise)
partial order between the vectors x and y, that is x (j) ≤ y (j) ,
∀1 ≤ j ≤ p
▶ The set C[0, 1] is the set of real-valued continuous functions defined
on [0, 1]
21 / 80
▶ For a random variable X (resp. vector X), the distribution function
(abbreviated df) is denoted by F or FX (resp. FX )
▶ The function Φ : R → (0, 1) is the standard normal
distribution function
22 / 80
▶ Distinguish between E(Y | X) and E(Y | X = x)
▶ E(Y | X) is a random variable, a measurable function of X
▶ E(Y | X = x) is a non random function defined on the support
of X
it means that the two random variables are equal almost surely
▶ By definition
Introduction
Organization
Overview
Notation and conventions
Another look at expectations
Few references
Censored data
24 / 80
Statistical analysis
▶ Recall that a major purpose of the statistical analysis is to provide a
description of the data generating process (DGP)
P = {Pθ : θ ∈ Θ},
27 / 80
Examples (3/9) – marginal distribution aspects
▶ Assume that Y admits a density fY and the model P is defined by
densities8 , that is P = {fθ : θ ∈ Θ}.
▶ Assume that the density of Y belongs9 to P and a unique θ0 ∈ Θ
exists such that fθ0 = fY
▶ Then
fθ (Y )
E log < 0, ∀θ ̸= θ0
fY (Y )
▶ Consequently,
θ0 = arg max E [log fθ (Y )]
θ∈Θ
▶ Orthonormal means
Z 1
φj (y )φk (y )dy = 1{j = k}
0
10
The proof is omitted and not required
11
The result is not limited to densities.
30 / 80
Examples (6/9) – marginal distribution aspects
▶ Ley Y1 , . . . , Yn be an IID sample from Y
where
n
1X
θbj = φj (Yi ), 1≤j ≤J
n i=1
32 / 80
Examples (8/9) – conditional distribution aspects
▶ When ℓ(y , c) = (y − c)[τ − 1{y − c < 0}], the solution is the τ −th
order conditional quantile function qτ (x) given X = x
▶ the median regression is obtained with τ = 1/2
▶ the most common is the linear quantile regression
33 / 80
Examples (9/9) – conditional distribution aspects
▶ Using the cosine basis decomposition (slides 29, 30), consider a
model free mean regression estimation with one predictor X ∈ [0, 1]
admitting a density fX :
Y = m(X ) + ε, E(ε | X ) = 0, Var(ε | X ) ≤ C
▶ We need to understand
▶ how to build (asymptotically) unbiased estimators for the
expectations with ‘incomplete’ data... if possible!
▶ how to modify the usual approaches for statistical analysis to
account for the ’data incompleteness’
35 / 80
Illustrations of the bias induced by the incompleteness
36 / 80
Agenda
Introduction
Organization
Overview
Notation and conventions
Another look at expectations
Few references
Censored data
37 / 80
Censoring
▶ Klein, J.P., van Houwelingen, H.C., Ibrahim, J.G. & Scheike, T.H.
(2014). Handbook of Survival Anlysis, CRC Press.
38 / 80
Missingness
39 / 80
Biased data
▶ Gill, R.D., Vardi, Y. & Wellner, J.A. (1988). Large Sample Theory
of Empirical Distributions in Biased Sampling Models. Ann.
Statist., vol. 16(3), 1069–1112.
https://doi.org/10.1214/aos/1176350948
40 / 80
Agenda
Introduction
Censored data
41 / 80
▶ Refresh the notions from the 1A courses
▶ See also the textbook14 Foata & Fuchs (2012). Calcul des
probabilités (3ème édition), Dunod.
14
A PDF copy of the second edition of this book can be downloaded here
https://mathksar.weebly.com/uploads/1/4/4/0/14403348/calcul_des_
probabilits_dunod_second_edition.pdf
42 / 80
Agenda
Introduction
Censored data
43 / 80
▶ See, e.g., Foata & Fuchs (2012), Ch. 12.
44 / 80
Agenda
Introduction
Censored data
45 / 80
▶ Conditional distribution for discrete random vectors
▶ Conditional independence
X ⊥ Y | {Z, W} =⇒ X⊥Y|W
46 / 80
Agenda
Introduction
Censored data
47 / 80
▶ Parametric models have some appealing features, but can be
wrong (misspecified)!
▶ Nonparametric (model free) approaches are sometimes
preferable, they provide warranty against misspecification
▶ There are several very common, and very simple,
nonparametric estimators in the case without covariates
▶ Empirical distribution function (df)
▶ Empirical survival function
▶ Empirical mean, empirical variance,... More generally,
empirical moments, quantiles,...
▶ all of them noting but functionals of the empirical df
▶ ...
48 / 80
Agenda
Introduction
Censored data
Types of censoring
IPCW and the Kaplan-Meier estimator
Regression models under right-censoring
Take away
49 / 80
Random right-censoring
▶ Let Y ∈ R+ a random variable15 of interest
Ti = Yi , Ti = Li or Ti = Ri ,
is observed.
18
Herein, a ∨ b = max(a, b).
51 / 80
Ignoring the censoring induces biases (1/3)
▶ Consider the case of a random right-censored Y
▶ The purpose is to estimate θ = E(Y ) under the assumption19
Y ⊥C
E(δ | Y ) = SC (Y −) and δT = δY ,
we get
h i
E θb2 = E [δT ] = E [δY ]
= E [E (δ | Y ) Y ] = E [SC (Y −)Y ]
≤ E [Y ] ,
55 / 80
Agenda
Introduction
Censored data
Types of censoring
IPCW and the Kaplan-Meier estimator
Regression models under right-censoring
Take away
56 / 80
Proposition 1
Let Y , C ∈ R+ and assume Y ⊥ C ;
Let T = Y ∧ C and δ = 1{Y ≤ C }. Then, for any integrable function
ϕ(Y ), such that21 ϕ(·) vanishes outside the support of T ,
δ
E [ω(δ, T )ϕ(T )] = E[ϕ(Y )] with ω(δ, T ) = .
SC (T −)
Corollary 1
Consider the IID sample (Ti , δi ) of (T , δ). Under the conditions of
Proposition 1, the functions
n
1X δi
FY (y ) =
e 1{Ti ≤ y } and S
eY (y ) = 1−FeY (y ), y ≥ 0,
n i=1 SC (Ti −)
| {z }
ω(δi ,Ti )
21
Alternatively, impose sup{y : SC (y −) > 0} ≥ sup{y : SY (y −) > 0}.
57 / 80
Expectation and empirical mean as integrals
▶ Using the integral notation, we write
Z
E[ϕ(Y )] = ϕ(y )dFY (y )
58 / 80
IPCW and unbiased expectation estimators
▶ If SC is given, the expectation estimator obtained by IPCW
(Inverse Probability of Censoring Weighting) is unbiased and
consistent, under random right-censoring!
Corollary 2
Consider the IID sample (Ti , δi ) of (T , δ). Under the conditions of
Proposition 1:
▶ Z Z
E ϕ(y )d FY (y ) = ϕ(y )dFY (y );
e
λY (yk ) = P(Y = yk | Y ≥ yk )
▶ The hazard rate is thus the probability that the event occurs
at time yk given that it did not occur previously
62 / 80
Empirical distribution and the hazard function (2/4)
▶ Exercise: show that for each k ≥ 1,
SY (yk ) SY (yk )
λY (yk ) = 1 − =1− ,
SY (yk−1 ) SY (yk −)
and
k
Y
SY (yk ) = {1 − λY (yj )}
j=1
Proposition 2
The hazard function characterizes the distribution of Y .
63 / 80
Empirical distribution and the hazard function (3/4)
▶ Since22
pk
λY (yk ) = (by definition SY (y1 −) = 1)
SY (yk −)
and
k
Y
SY (yk ) = {1 − λY (yj )},
j=1
22
Recall, the support points {y1 , y2 , . . . , } are ordered!
64 / 80
Empirical distribution and the hazard function (4/4)
▶ Define Pn
j=1 1{Yj = yk }
λY ,n (yk ) = Pn
j=1 1{Yj ≥ yk }
(By convention, the product with an empty set of indices is equal to 1.)
▶ In the case with no ties in the sample of size n, one can write
Y 1
Y n−i
SY ,n (y ) = 1 − FY ,n (y ) = 1− =
n−i +1 n−i +1
{i:Y(i) ≤y } {i:Y(i) ≤y }
where Y(1) , Y(2) , . . . , Y(n) is the order statistics, that is the values of
the sample (increasingly) ordered
65 / 80
Kaplan-Meier estimator of the survival function SY
▶ The sample is (Ti , δi ), 1 ≤ i ≤ n
▶ recall: Ti = Yi ∧ Ci and δi = 1{Yi ≤ Ci }
▶ Let t1 , t2 , . . . be the ordered distinct values in the sample
T1 , . . . , Tn . Ties (ex-echos) are allowed
▶ Let
Pn
j=1 δj 1{Tj = tk} ♯{failures at time tk }
λ(tk ) = Pn
b = , k = 1, 2, . . .
j=1 1{Tj ≥ tk } ♯{individuals ‘at risk’ at time tk }
δϕ(T ) = δϕ(Y )
24
Here, the sign ≈ in the display means that the difference between left and
right side tends to zero in probability. See also slide 54.
67 / 80
▶ The Kaplan-Meier estimator has jumps (puts positive mass)
only at the non-censored observations
▶ When there are no ties, the Kaplan-Meier estimator could be
rewritten, ∀y
( )δk
b )=1−F
Y 1
S(y b (y ) = 1 − Pn (2)
{k:Tk ≤y } j=1 1{T j ≥ Tk }
68 / 80
Kaplan-Meier estimator properties
▶ Kaplan-Meier (KM) estimator is an appropriate estimator of
SY (·) under random right-censoring
▶ The KM estimator becomes the empirical df is no observation
is censored
▶ When the last observation is uncensored, the KM estimator is
a survivor function (it vanishes from the last observation to
the right)
▶ When the last observation is censored, the KM estimator does
not vanish, it is not a proper survival function
▶ A common convention is to say that it has a mass at infinity
70 / 80
Exercise
25
This guarantees the no observed value Ti can be at the same time
censored and uncensored.
26
This empirical estimator does not take into account the presence of
censoring and thus estimates the survivor of T .
71 / 80
The KM is a feasible IPCW
Back to IPCW...
Proposition 3 (*27 )
Assume that Y and C are continuous random variables, and the
observations (Ti , δi ) are random copies of T = Y ∧ C and
δ = 1{Y ≤ C }. Then, the KM estimator S bY of the survival function SY
admits the IPCW representation
n
1X δi
SY (y ) =
b 1{Ti > y },
n i=1 S
bC (Ti −)
| {z }
ω (δi ,Ti )
b
where S
bC is the KM estimator of SC .
Introduction
Censored data
Types of censoring
IPCW and the Kaplan-Meier estimator
Regression models under right-censoring
Take away
73 / 80
The IPCW in predictive models (1/5)
▶ The IPCW principle extends to predictive (regression) models
▶ The complete data are an IID sample from Y ∈ R+ , X ∈ Rp
▶ The variable Y is right-censored by C , so the IID observations are
(Ti , δi , Xi ) where Ti = Yi ∧ Ci , δi = 1{Yi ≤ Ci }
Proposition 4
For any ϕ(·, ·) such that ϕ(Y , X)/SC |X (Y − | X) is well defined29
and E[ϕ(Y , X)] exists, if (3) holds, then
" #
δ
E ϕ(T , X) = E[ϕ(Y , X)]
SC |X (T − | X)
| {z }
ω(δ,T ,X)
28
For example, ϕ(Y , X) = ℓ(Y , m(X)), with ℓ(·, ·) a loss function and m(·)
in the predictive/regression model M
29
The convention 0/0 = 0 applies
75 / 80
The IPCW in predictive models (3/5)
▶ Let us consider the quadratic loss ℓ(y , c), which is used for the
mean regression, and assume a linear model
31
Recall : qτ (x) = inf{y : FY |X (y |x) ≥ τ }, and FY |X (y |x) = P(Y ≤ y |X = x).
77 / 80
The IPCW in predictive models (5/5)
▶ Predictive models where the prediction function (mean regression,
quantile regression,...) is defined by the minimization of an
expected loss, can be applied with censored responses Y by simply
applying the suitable weight!
▶ However, the weighting requires the knowledge of the conditional
survival function of the censoring variable SC (· | x)
▶ this is a nuisance quantity, preferable to be without restrictions
Introduction
Censored data
Types of censoring
IPCW and the Kaplan-Meier estimator
Regression models under right-censoring
Take away
79 / 80
Conclusions
▶ In many applications, the observation of a variable of interest
Y (time to some event) is right-censored, i.e., we only know
that its value is larger than what is observed (called T above)
▶ Usual estimators based on the censored observations are
biased (df, mean, variance, quantiles, MLE, LSE,...)
▶ The most important estimator in the presence of random
right-censoring is the Kaplan-Meier estimator of the survival
function of the variable of interest
▶ Estimators for predictive models (regressions) taking into
account the random right-censoring have been introduced
▶ All the estimators presented above can be interpreted as
IPCW versions of standard estimators
▶ The methods presented can be adapted to the random
left-censoring, but not necessarily to more complex censoring
mechanisms (because the weighting function may not be explicit)
80 / 80