0% found this document useful (0 votes)
5 views80 pages

Meth 2024 Part1 Censored

The document outlines a course on statistical methods for incomplete data, focusing on types of incompleteness such as censoring, missing data, and bias. It emphasizes the importance of understanding and correcting for these issues to ensure unbiased statistical analysis. The course includes lectures, evaluations, and provides materials for self-study, while also requiring prior knowledge in probability and statistics.

Uploaded by

Irch Ngoubili
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views80 pages

Meth 2024 Part1 Censored

The document outlines a course on statistical methods for incomplete data, focusing on types of incompleteness such as censoring, missing data, and bias. It emphasizes the importance of understanding and correcting for these issues to ensure unbiased statistical analysis. The course includes lectures, evaluations, and provides materials for self-study, while also requiring prior knowledge in probability and statistics.

Uploaded by

Irch Ngoubili
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 80

Incomplete D?

ta methods

Valentin Patilea

Ensai 2A, Oct-Dec 2024


This version: October 17, 2024
Méthodes statistiques pour des données incomplètes

Alternative title:
Missing or modified data methods
‘Big Data’ can diminish the variance1 ,
but cannot remove the bias2

1
Here, ‘Big Data’ is used in the sense of a huge number of observations of
some random vector of interest with fixed dimension
2
However, sometimes ‘Big Data’ may help to remove it
Agenda

Introduction
Organization
Overview
Notation and conventions
Another look at expectations
Few references

Recalling few basic probability & statistics notions

Censored data

4 / 80
Module organization
▶ There are 5 lectures, 2 TD and 1 TP

▶ The provisional plan


▶ Lecture 1: Introduction, general presentation, refresh
proba/stat notions
▶ Lecture 2: Censoring
▶ Lecture 3, 4, 5: TBA

▶ Evaluation
▶ Quiz on November 19, at 17:30 (20-30 minutes)
▶ Final exam, no documents (formulaire allowed)

▶ Contact:
[email protected]
▶ office hours: by appointment only, office 270, usually after
5:00pm

5 / 80
Materials and bibliography
▶ The course is designed to be self-contained
▶ BUT... prerequisites from 1A are required!
▶ Probability (expectation, conditional expectation, density...)
▶ Statistics (MLE, MoM, bias...)

▶ All the course materials (slides, supplementary materials, R


codes) are available on Moodle
▶ Studying alternative materials is always encouraged, and can
be beneficial, most likely after reading the materials on
Moodle
▶ A list of references is proposed

▶ There are no previous exams, this is the first time the course
is given

6 / 80
Agenda

Introduction
Organization
Overview
Notation and conventions
Another look at expectations
Few references

Recalling few basic probability & statistics notions

Censored data

7 / 80
Types of incomplete (not perfect) data to be considered

▶ Censoring

▶ Missing

▶ Biased

▶ Other: error in variables... depending on the available time

Types of incomplete data not considered here

▶ Latent models... and many others

8 / 80
A general modeling paradigm

▶ Usually it is considered that there exists a complete sample of


data

▶ Due to some ‘incompleteness’ mechanism (missing, censoring


etc) the complete sample is not available for the statistical
analysis

▶ Only the ‘incomplete’ sample is available, and the statistical


analysis can only rely on that data!

9 / 80
Censoring for ‘time-to-event’ data
▶ Let Y ≥ 0 be a random variable of interest, typically a time to
some event
▶ time to ‘failure’ (medicine, engineering, economics, finance,
demography...)
▶ The statistical field studying time-to-event data is the Survival
Analysis

▶ For some individuals, the value of Y is observed, for others we


only know that Y is
▶ larger3 than some observed value (right-censoring)
▶ smaller than some observed value (left-censoring)
▶ or between some observed values (interval-censoring)

▶ With this type of incomplete (modified) data, the sample size


is usually known4
3
The situation we mainly consider in the course
4
Survival analysis also studies the so-called truncated data where the total
number of individuals is unknown. This is beyond our scope.
10 / 80
Missing data

▶ Let X ∈ Rp be a random vector of interest, and let XO be a


sub-vector of dimension q with 1 ≤ q ≤ p
▶ For some individuals, only the realization of XO is available,
the other components of X are not observed and are usually
registered "NA"
▶ Missing data occur in practically any type of application

▶ All the statistical procedures can be affected by missing data


(estimation, inference, regression and predictive models,...)
▶ With missing data, the sample size is known.

11 / 80
Biased data
▶ Biased data can be generated by a selection bias, usually due
to the method for collecting the data5
▶ A well known bias is the so-called survivorship bias
▶ Schools may ‘forget’ the students who drop out during the
training to report better performance on the job market
▶ Abraham Wald during WW2 (see the aircraft picture)

▶ Another well known bias is the so-called length bias


▶ occurs when the sampling scheme is such that the probability
of observing a positive-valued random variable is proportional
to the value of the valuable
▶ Oscar winners live few years longer then the unsuccessful
actors and actresses
▶ Bus waiting time paradox: the average rider at a bus stop
observes more delays than expected looking at a bus schedule
5
With biased data, there could be unobserved individuals, and their number
might not be known!
12 / 80
13 / 80
Our mission

▶ Correctly identify the nature of incompleteness

▶ Design an appropriate remedy


▶ the remedy may not be always the same!
▶ the remedy may depend on the purpose of the analysis!!
▶ it may require additional conditions, such as identifiability
assumptions! Such conditions may be untestable!!

▶ Apply the remedy and provide unbiased statistical analysis


using the available data

My mission: help to improve your statistical thinking

14 / 80
Starting point in this course
▶ Understand how to compute unbiased (or asymptotically
unbiased) estimators for expectations
▶ Expectations are required everywhere
▶ Mean, variance, quantiles etc.
▶ Estimation and inference (distribution function, MLE, MoM,
regression/predictive models or algorithms...)

▶ In general, for any integrable function ϕ(·),

E[ϕ(incomplete data)] ̸= E[ϕ(complete data)],


| {z }
expectation of interest

and therefore, in general,

E[ϕ(incomplete
b data)] (e.g., E[·]
b is the empirical mean)

will be a biased estimator of the expectation of interest (even


asymptotically!)
15 / 80
A remedy
▶ In the course, the focus will be on weighting methods
▶ Weighting is a general approach which applies with many
incompleteness mechanisms

▶ The steps
▶ Find an appropriate weighting for the expectation of interest.
Appropriate weighting means

E[weighting × ϕ(incomplete data)] = E[ϕ(complete data)]


| {z }
expectation of interest

▶ Build estimators

E[weighting
b × ϕ(incomplete data)]

▶ The weighting has to be a function of the available data!

▶ the weighting is usually an unknown function, which has to be


estimated
16 / 80
The framework considered in the course
▶ We will revisit statistical (data analysis) methods for marginal
and conditional distributions designed and learned6 for
complete data

▶ Assume the complete data are IID (independent, identically


distributed)
▶ although many idea discussed below can be extended to
stationary time series, we here consider only the independent
case

▶ The incompleteness mechanism affects each individual


(observation unit, datum) in the same way, independently of
the other individuals

▶ Provide a few ideas for adapting the data analysis methods


when facing ‘incomplete’ data
6
... from other courses and/or your own study
17 / 80
Marginal and conditional distribution

▶ Marginal distribution: the statistical analysis can be


considered for random variables or random vectors
▶ study the mean, variance, quantiles, distribution function,
density etc

▶ Conditional distribution: study random variables or random


vectors given other random variables/vectors, called
covariates or predictors
▶ conditional mean (usually called regression), conditional
quantiles, conditional density etc – predictive models

▶ Our purpose: understand how to use weighting in both types


of situations

18 / 80
Summarizing the road map of the lectures (and the tutorials/lab)
▶ Type of statistical analysis
▶ focused on the marginal distribution
▶ mean, higher order moments, quantiles,...
▶ density
▶ focused on the conditional distribution (predictive analysis)
▶ mean regression, quantile regression,...

▶ Type of models and estimation/inference approaches (for


density and regressions)
▶ parametric
▶ for density: Gaussian, Poisson etc, and MLE,...
▶ for regressions: linear model and least squares,...
▶ model free (nonparametric)
▶ for density and mean regression: using basis of functions

▶ Type of ‘incompleteness’ mechanisms


▶ Censoring
▶ Missingness
▶ Biased
19 / 80
Agenda

Introduction
Organization
Overview
Notation and conventions
Another look at expectations
Few references

Recalling few basic probability & statistics notions

Censored data

20 / 80
▶ By definition 0/0=0
▶ The random variables are denoted by capital letter (X , Y , Z ...), the
random vectors are column matrices denoted by bold capital letters
(X, Y, Z...)
▶ The components of a random vector X ∈ Rp are denoted X (j) ,
1≤j ≤p
▶ All the random elements are defined on (Ω, F, P)
▶ The stochastic independence between random variables or vectors is
denoted by ⊥ (we will write X ⊥ Y , X ⊥ {Y , Z }, X ⊥ Y,
X ⊥ {Y, Z}...)
▶ Here, a continuous random variable or vector admits a density with
respect to the Lebesgue measure
▶ For x, y ∈ Rp , x ≤ y denotes the componentwise (coordinatewise)
partial order between the vectors x and y, that is x (j) ≤ y (j) ,
∀1 ≤ j ≤ p
▶ The set C[0, 1] is the set of real-valued continuous functions defined
on [0, 1]
21 / 80
▶ For a random variable X (resp. vector X), the distribution function
(abbreviated df) is denoted by F or FX (resp. FX )
▶ The function Φ : R → (0, 1) is the standard normal
distribution function

▶ Given a sample X1 , . . . , Xn ∈ Rp , p ≥ 1, the empirical distribution


function (edf) is defined as
n
1X
Fn (x) = FX,n (x) = 1{Xi ≤ x}, x ∈ Rp
n i=1

▶ The function S = 1 − F is called survival function.


▶ The survival function, and the df, are càdlàg
▶ The notation S(y −) stands for the limit from the left :

S(y −) = lim S(t)


t↑y

▶ If S is a continuous function, then S(y ) = S(y −), ∀y

22 / 80
▶ Distinguish between E(Y | X) and E(Y | X = x)
▶ E(Y | X) is a random variable, a measurable function of X
▶ E(Y | X = x) is a non random function defined on the support
of X

▶ When writing the equality of two conditional expectations, say,

E(Y1 | X) = E(Y2 | X),

it means that the two random variables are equal almost surely

▶ By definition

E(Y1 | X) = E(Y2 | X) iff E(Y1 | X = x) = E(Y2 | X = x), ∀x

▶ For any matrix A, A⊤ denotes its transpose

▶ If A, B are symmetric matrices, we say that A is smaller than B,


and write A ≪ B, if B − A is semi-positive definite. We also write
A ≫ B if A − B is semi-positive definite.
23 / 80
Agenda

Introduction
Organization
Overview
Notation and conventions
Another look at expectations
Few references

Recalling few basic probability & statistics notions

Censored data

24 / 80
Statistical analysis
▶ Recall that a major purpose of the statistical analysis is to provide a
description of the data generating process (DGP)

▶ Marginal distribution point of view: the usual quantities of interest


▶ characteristics of the distribution (mean, higher order
moments, quantiles etc)
▶ the distribution itself, that can be determined by the df, the
density, the characteristic function etc

▶ Conditional distribution point of view: basically, the interest focuses


on the conditional version of the same quantities
▶ conditional mean, conditional variance,..., given the covariates7
value
▶ conditional distribution given the covariates value

▶ The notion of expectation plays a central role!


7
Covariates are sometimes called predictors, explanatory variables,...
25 / 80
Examples (1/9) – marginal distribution aspects
▶ For illustration, let Y ∈ R be the random variable under study
▶ Many of the characteristics of interest of the distribution of Y are
solution of a problem (involving an expectation) like
min E[ℓ(Y , c)],
c∈R

where ℓ(y , c) ≥ 0 is a loss function (risk, regret, contrast...)


▶ For instance
▶ ℓ(y , c) = (y − c)2 for the mean – quadratic loss
▶ ℓ(y , c) = |y − c| for the median – absolute deviation
▶ ℓ(y , c) = (y − c)[τ − 1{y − c < 0}] for the quantile of order
τ ∈ (0, 1) – check function, or pinball loss function in ML
▶ Using
▶ the (complete) data Yi , 1 ≤ i ≤ n,
▶ and a simple estimator of E[ℓ(Y , c)] (the sample mean),
a simple estimator of the characteristic associated with the loss ℓ is
n n
1X h X i
arg min ℓ(Yi , c) n−1 ℓ(Yi , c) is the so-called empirical risk
c∈R n
i=1 i=1
26 / 80
Examples (2/9) – marginal distribution aspects

▶ It is also common to assume that the DGP of Y belongs to some


family of probability distributions

P = {Pθ : θ ∈ Θ},

(usually called model) depending on parameter(s)


▶ Poisson, exponential, Gaussian...

▶ Then, the mission is to estimate θ, the model parameter(s)

▶ Two very common estimation approaches are


▶ maximum likelihood estimator (MLE)
▶ method of moments (MoM)

▶ The justification of both methods is based on expectations

27 / 80
Examples (3/9) – marginal distribution aspects
▶ Assume that Y admits a density fY and the model P is defined by
densities8 , that is P = {fθ : θ ∈ Θ}.
▶ Assume that the density of Y belongs9 to P and a unique θ0 ∈ Θ
exists such that fθ0 = fY
▶ Then  
fθ (Y )
E log < 0, ∀θ ̸= θ0
fY (Y )

▶ Consequently,
θ0 = arg max E [log fθ (Y )]
θ∈Θ

▶ With complete data, the MLE is based on the simplest unbiased


estimate of E [log fθ (Y )] :
n
1X
θb = arg max log fθ (Yi )
θ∈Θ n
i=1
8
Similar arguments apply with discrete distributions
9
The arguments can be adapted to the case of misspecified models
28 / 80
Examples (4/9) – marginal distribution aspects
▶ Consider the model free density estimation based on an
orthonormal series approximation

▶ Consider the cosine orthonormal basis of functions on [0, 1],



φ0 (y ) = 1 and φj (y ) = 2 cos(jπy ), y ∈ [0, 1], for j = 1, 2, ...

▶ Orthonormal means
Z 1
φj (y )φk (y )dy = 1{j = k}
0

▶ For any f ∈ C[0, 1], the Fourier basis expansion


X Z 1
f (y ) = θj φj (y ), where θj = f (y )φj (y )dy ,
j≥0 0

where θj are the Fourier coefficients


29 / 80
Examples (5/9) – marginal distribution aspects
▶ Let Y ∈ [0, 1] a random variable with a continuous density fY , to be
estimated without a specific model
▶ The Fourier coefficients become the expectations
Z 1
θj = fY (y )φj (y )dy = E[φj (Y )], j = 0, 1, 2...
0

▶ We can thus define an approximation of fY as


J
X J
X
fY ,J (y ) = θj φj (y ) = E[φj (Y )]φj (y )
j=0 j=0

▶ Theorem:(*10 ) If fY ∈ C[0, 1], then11


J→∞
sup |fY ,J (y ) − fY (y )| −→ 0.
y ∈[0,1]

10
The proof is omitted and not required
11
The result is not limited to densities.
30 / 80
Examples (6/9) – marginal distribution aspects
▶ Ley Y1 , . . . , Yn be an IID sample from Y

▶ A natural model free estimator12 of the density is


J
X
fY ,J (y ) =
b θbj φj (y )
j=0

where
n
1X
θbj = φj (Yi ), 1≤j ≤J
n i=1

▶ This simple estimator is not a density, but can easily modified to a


density, e.g.,
fY ,J (y ) − c},
fY ,J (y ) = max{0, b
e
R1
where the constant c ∈ R is such that 0 e fY ,J (y )dy = 1
12
In Statistics such estimators are also called nonparametric estimators. To
achieve consistency, J has to tend to infinity.
31 / 80
Examples (7/9) – conditional distribution aspects
▶ Reconsider the slide 26 when predictors X ∈ Rp are available

▶ Let Y ∈ R, and consider M a set of functions of the predictors


(e.g., the set of linear combinations β ⊤ X)

▶ Very often, the characteristic of interest of the conditional


distribution of Y given the predictors’ value minimizes an expected
loss, i.e., the characteristic is solution of

min E[ℓ(Y , m(X))], (1)


m∈M

where ℓ(·, ·) ≥ 0 is a loss function

▶ With (complete) data (Yi , Xi ), 1 ≤ i ≤ n, and the simplest


unbiased estimator of the expectation E[ℓ(Y , m(X)] is
n
1X  
arg min ℓ(Yi , m(Xi )) the armin of the empirical risk
m∈M n
i=1

32 / 80
Examples (8/9) – conditional distribution aspects

▶ When ℓ(y , c) = (y − c)2 , the solution in (1) is the conditional


expectation, i.e.,
m(x) = E(Y | X = x).
The set M defines the regression model, the set where the ‘true’
(mean) regression is supposed to be
▶ Linear model, GLM,...

▶ When ℓ(y , c) = (y − c)[τ − 1{y − c < 0}], the solution is the τ −th
order conditional quantile function qτ (x) given X = x
▶ the median regression is obtained with τ = 1/2
▶ the most common is the linear quantile regression

▶ Definition: for τ ∈ (0, 1), the τ −th order conditional quantile


function qτ (x) given X = x is

qτ (x) = inf{y : FY |X (y | x) ≥ τ }, with FY |X (y | x) = P(Y ≤ y | X = x)

33 / 80
Examples (9/9) – conditional distribution aspects
▶ Using the cosine basis decomposition (slides 29, 30), consider a
model free mean regression estimation with one predictor X ∈ [0, 1]
admitting a density fX :
Y = m(X ) + ε, E(ε | X ) = 0, Var(ε | X ) ≤ C

▶ If m(x ) = E(Y | X = x ), x ∈ [0, 1] is continuous, then


X X
m(x ) = γj φj (x ) = E[Y φj (X )/fX (X )]φj (x )
j≥0 j≥0

▶ Consider the approximation of m(x ) defined for some J ≥ 1


J
X
mJ (x ) = γj φj (x ), with γj = E[Y φj (X )/fX (X )]
j=0

▶ Given fX (·), a model free estimation of m(·) is easily obtained with


the γj estimated by the empirical means13
13
In general fX (·) needs to be estimated, and this can be done using the
same basis decomposition approach, see the slides 29, 30.
34 / 80
Expectation: the backbone of the statistical analysis

▶ Disposing of unbiased (or at least asymptotically unbiased)


estimators of an expectation is very important!

▶ Most of the statistical and machine learning methods require


(simple) unbiased estimators for expectations

▶ With censored, missing or biased data, the simple empirical mean is


no longer an unbiased estimator of the expectation, not even
asymptotically!

▶ We need to understand
▶ how to build (asymptotically) unbiased estimators for the
expectations with ‘incomplete’ data... if possible!
▶ how to modify the usual approaches for statistical analysis to
account for the ’data incompleteness’

35 / 80
Illustrations of the bias induced by the incompleteness

See the R codes on Moodle

36 / 80
Agenda

Introduction
Organization
Overview
Notation and conventions
Another look at expectations
Few references

Recalling few basic probability & statistics notions

Censored data

37 / 80
Censoring

▶ Klein, J.P., van Houwelingen, H.C., Ibrahim, J.G. & Scheike, T.H.
(2014). Handbook of Survival Anlysis, CRC Press.

▶ Kleinbaum, D.G. & Klein, J.P. (2005). Survival Anlysis. A


Self-Learning Text, Springer.

▶ Lecture Notes in French available online


▶ Saint Pierre, P. (2021). Intro à l’analyse des durées de survie.
https://perso.math.univ-toulouse.fr/psaintpi/
files/2021/04/Cours_Survie_1.pdf

38 / 80
Missingness

▶ Molenberghs, G., Fitzmaurice, G., Kenward, M. G., Tsiatis, A.&


Verbeke, G. (eds.) (2014). Handbook of missing data methodology.
CRC Press.

▶ Little, R.J. (2021). Missing data assumptions. Annu. Rev. Stat.


Appl., vol. 8, 89–107.
(https://www.annualreviews.org/content/journals/10.
1146/annurev-statistics-040720-031104)

▶ Little, R.J., Rubin D.B. (2019). Statistical Analysis with Missing


Data. New York: Wiley. 3rd ed.

▶ van Buuren, S. (2012). Flexible Imputation of Missing Data.


Chapman & Hall/CRC Press.
(https://stefvanbuuren.name/fimd/)

39 / 80
Biased data

▶ Efromovich, S. (2018). Missing and Modified Data in


Nonparametric Estimation: With R Examples. Chapman &
Hall/CRC, CRC Press.

▶ Gill, R.D., Vardi, Y. & Wellner, J.A. (1988). Large Sample Theory
of Empirical Distributions in Biased Sampling Models. Ann.
Statist., vol. 16(3), 1069–1112.
https://doi.org/10.1214/aos/1176350948

40 / 80
Agenda

Introduction

Recalling few basic probability & statistics notions


Distribution function, density, independence
Conditional expectation
Conditional distribution, conditional independence
Common nonparametric (model free) estimator

Censored data

41 / 80
▶ Refresh the notions from the 1A courses
▶ See also the textbook14 Foata & Fuchs (2012). Calcul des
probabilités (3ème édition), Dunod.

▶ Distribution function (df)

▶ Density and continuous variables and vectors

▶ The joint density of a continuous vector and a binary variable

▶ Stochastic independence (denoted by the symbol ⊥)

14
A PDF copy of the second edition of this book can be downloaded here
https://mathksar.weebly.com/uploads/1/4/4/0/14403348/calcul_des_
probabilits_dunod_second_edition.pdf
42 / 80
Agenda

Introduction

Recalling few basic probability & statistics notions


Distribution function, density, independence
Conditional expectation
Conditional distribution, conditional independence
Common nonparametric (model free) estimator

Censored data

43 / 80
▶ See, e.g., Foata & Fuchs (2012), Ch. 12.

▶ Conditional expectation for discrete random vectors

▶ Conditional expectation for continuous random vectors


▶ the case of a conditioning including also a binary variable

▶ Properties of the conditional expectation

44 / 80
Agenda

Introduction

Recalling few basic probability & statistics notions


Distribution function, density, independence
Conditional expectation
Conditional distribution, conditional independence
Common nonparametric (model free) estimator

Censored data

45 / 80
▶ Conditional distribution for discrete random vectors

▶ Conditional distribution for continuous random vectors


▶ the case of a conditioning including also a binary variable

▶ Conditional independence

▶ Characterization and properties of the conditional independence

Exercise: Show that

X ⊥ {Y, Z} | W ⇐⇒ X ⊥ Y | {Z, W} and X ⊥ Z | W

Exercise: Show that

X ⊥ Y | {Z, W} =⇒ X⊥Y|W

46 / 80
Agenda

Introduction

Recalling few basic probability & statistics notions


Distribution function, density, independence
Conditional expectation
Conditional distribution, conditional independence
Common nonparametric (model free) estimator

Censored data

47 / 80
▶ Parametric models have some appealing features, but can be
wrong (misspecified)!
▶ Nonparametric (model free) approaches are sometimes
preferable, they provide warranty against misspecification
▶ There are several very common, and very simple,
nonparametric estimators in the case without covariates
▶ Empirical distribution function (df)
▶ Empirical survival function
▶ Empirical mean, empirical variance,... More generally,
empirical moments, quantiles,...
▶ all of them noting but functionals of the empirical df

▶ ...

48 / 80
Agenda

Introduction

Recalling few basic probability & statistics notions

Censored data
Types of censoring
IPCW and the Kaplan-Meier estimator
Regression models under right-censoring
Take away

49 / 80
Random right-censoring
▶ Let Y ∈ R+ a random variable15 of interest

▶ Let Yi , 1 ≤ i ≤ n, be a IID sample of Y

▶ The most common censoring scheme is the so-called random


right-censoring16

▶ There exists a random variable C (the right censoring)

▶ Let Ci , 1 ≤ i ≤ n, be a random sample of C

▶ The data consist of the pairs of random variables (T , δ)


where17

Ti = Yi ∧ Ci and δi = 1{Yi ≤ Ci }, 1≤i ≤n


15
Sometimes called lifetime, duration, time-to-event.
16
If no possible confusion, we simply call it right-censoring or censoring.
17
Herein, a ∧ b = min(a, b).
50 / 80
Other random censoring
▶ Left-censoring: Y is left-censored when it is only known to be
smaller than some observed value :
▶ there exists a random variable L (the left censoring) and the
data consist of the pairs (Ti , δi ) where18

Ti = Yi ∨ Li and δi = 1{Yi ≥ Li }, 1≤i ≤n

▶ Left and right-censoring: Y is left and right-censored if variables L


and R exist, such that P(L ≤ R) = 1, and the data consist of the
pairs (Ti , ∆i ) where

Ti = max(min(Yi , Ri ), Li ) = min(max(Yi , Li ), Ri ), 1≤i ≤n

and ∆i indicates which of situations

Ti = Yi , Ti = Li or Ti = Ri ,

is observed.
18
Herein, a ∨ b = max(a, b).
51 / 80
Ignoring the censoring induces biases (1/3)
▶ Consider the case of a random right-censored Y
▶ The purpose is to estimate θ = E(Y ) under the assumption19

Y ⊥C

▶ Bad idea 1: consider


n
1X
θb1 = Ti
n i=1

Then, since by the independence assumption ST = SY SC ,


h i Z ∞ Z ∞
E θb1 = E [T ] = SY (t)SC (t)dt ≤ SY (t)dt = E [Y ]
0 0

and in general the inequality is strict


19
This is an untestable assumption!!
52 / 80
Ignoring the censoring induces biases (2/3)
▶ Bad idea 2: consider
n
1X
θb2 = δi Ti
n i=1

▶ Then, since Y ⊥ C and δ = 1{Y ≤ C },

E(δ | Y ) = SC (Y −) and δT = δY ,

we get
h i
E θb2 = E [δT ] = E [δY ]
= E [E (δ | Y ) Y ] = E [SC (Y −)Y ]
≤ E [Y ] ,

and in general the inequality is strict


53 / 80
Ignoring the censoring induces biases (3/3)
▶ Bad idea 2bis: consider alternatively
" n #−1 n
X X
θb2bis = δi δi Ti
i=1 i=1

▶ By similar calculations using Y ⊥ C and δ = 1{Y ≤ C }, and


technical conditions (including P(δ = 1) > 0), we get20

E [δT ] + oP (1) E [δY ] + oP (1) p E [δY ]


θb2bis = = −→
E(δ) + oP (1) E(δ) + oP (1) E(δ)
E [δY ] E [E (δ | Y ) Y ] E [SC (Y −)Y ]
and = = ̸= E [Y ]
E(δ) E(δ) E [SC (Y −)]

▶ This idea corresponds to replacing the censored observations by the


average of the uncensored ones, and is still a bad idea!
20
Below, oP (1) stand for a quantity tending to zero in probability, as the sample size increases to infinity. The
oP (1) below are guaranteed by the Law of Large Numbers
54 / 80
A good idea
▶ Find a weighting ω(T , δ), function of the observations (T , δ)
such that
E [ω(T , δ)T ] = E [Y ] ,
and next use the empirical means for ω(T , δ)T
▶ The choices

ω(T , δ) = 1 and ω(T , δ) = δ,

do not work... see the previous slide

▶ Any suggestion for the weighting?


▶ Is this suggestion feasible?
▶ Would it automatically work for other types of censoring?

55 / 80
Agenda

Introduction

Recalling few basic probability & statistics notions

Censored data
Types of censoring
IPCW and the Kaplan-Meier estimator
Regression models under right-censoring
Take away

56 / 80
Proposition 1
Let Y , C ∈ R+ and assume Y ⊥ C ;
Let T = Y ∧ C and δ = 1{Y ≤ C }. Then, for any integrable function
ϕ(Y ), such that21 ϕ(·) vanishes outside the support of T ,

δ
E [ω(δ, T )ϕ(T )] = E[ϕ(Y )] with ω(δ, T ) = .
SC (T −)

Corollary 1
Consider the IID sample (Ti , δi ) of (T , δ). Under the conditions of
Proposition 1, the functions
n
1X δi
FY (y ) =
e 1{Ti ≤ y } and S
eY (y ) = 1−FeY (y ), y ≥ 0,
n i=1 SC (Ti −)
| {z }
ω(δi ,Ti )

are unbiased estimators of FY (y ) and SY (y ) = 1 − FY (y ), respectively.

21
Alternatively, impose sup{y : SC (y −) > 0} ≥ sup{y : SY (y −) > 0}.
57 / 80
Expectation and empirical mean as integrals
▶ Using the integral notation, we write
Z
E[ϕ(Y )] = ϕ(y )dFY (y )

▶ For unifying the notation, we write an empirical mean as


n
1X
Z
ϕ(Yi ) = ϕ(y )dFY ,n (y )
n i=1
with dFY ,n the measure given by the empirical df
n n
1X 1X
FY ,n (y ) = 1{Yi ≤ y }, that is dFY ,n (y ) = dδ Yi (y ),
n i=1 n i=1
where δ Yi is the Dirac measure at Yi .
▶ Now, we can write
n
1X δi
Z
ϕ(Ti ) = ϕ(y )d FeY (y )
n i=1 SC (Ti −)

58 / 80
IPCW and unbiased expectation estimators
▶ If SC is given, the expectation estimator obtained by IPCW
(Inverse Probability of Censoring Weighting) is unbiased and
consistent, under random right-censoring!
Corollary 2
Consider the IID sample (Ti , δi ) of (T , δ). Under the conditions of
Proposition 1:
▶ Z  Z
E ϕ(y )d FY (y ) = ϕ(y )dFY (y );
e

▶ if sup{y : SC (y −) > 0} > sup{y : SY (y −) > 0}, then


Z Z
ϕ(y )d FeY (y ) → ϕ(y )dFY (y ), in probability.

Remark: the estimator FeY (y ) and the corresponding integrals


are infeasible in most situations, as they require the knowledge
of the survival function of the censoring variable, that is SC .
59 / 80
Feasible IPCW
▶ A feasible IPCW empirical mean would be
n
1X δi
ϕ(Ti ),
n i=1 SbC (Ti −)
| {z }
ω
b(δi ,Ti )

for some estimator SbC (·) of the survival function SC (·) of C


▶ In general, the censoring variable is less important, and SC (·)
is a nuisance quantity
▶ The distribution function and the empirical mean do not
require any model
▶ Thus, one should avoid introduce modeling assumptions of C ,
like assuming SC (·) in a parametric family
▶ Would be possible to consider a model free estimator of SC (·)?

▶ It is worth noting that C is left-censored by Y !


60 / 80
A glimpse at a classical approach
▶ Let us forget the IPCW for the next few slides...

▶ First, note that


Z
SY (y ) = P(Y > y ) = 1{t > y }dFY (t), ∀y ,

that is the survival function is an integral as above for a particular


ϕ(·)

▶ We now present the usual definition of the most common


nonparametric (model free) survival function SY
▶ it is the counterpart of the empirical survival function to be
applied under random right-censoring
▶ when no observation is right-censored, the estimator below
coincides with the empirical survival functions!

▶ The nonparametric (model free) survival function SY presented


below (slide 66) is the so-called Kaplan-Meier (KM) estimator
61 / 80
Empirical distribution and the hazard function (1/4)
▶ Let Y ∈ {y1 , y2 , . . .} be a discrete random variable with
0 ≤ y1 < y2 < · · ·
▶ Let pk = P(Y = yk ) > 0, k ≥ 1

▶ The hazard function (also called hazard rate, or failure rate) is


defined as
P(Y = yk ) pk pk pk
λY (yk ) = =P = = , k≥1
P(Y ≥ yk ) j≥k pj SY (yk−1 ) SY (yk −)

▶ We can also write the hazard function as

λY (yk ) = P(Y = yk | Y ≥ yk )

▶ The hazard rate is thus the probability that the event occurs
at time yk given that it did not occur previously
62 / 80
Empirical distribution and the hazard function (2/4)
▶ Exercise: show that for each k ≥ 1,

SY (yk ) SY (yk )
λY (yk ) = 1 − =1− ,
SY (yk−1 ) SY (yk −)

and
k
Y
SY (yk ) = {1 − λY (yj )}
j=1

(by definition SY (y1 −) = 1).

Proposition 2
The hazard function characterizes the distribution of Y .

▶ Exercise: Prove the Proposition.

63 / 80
Empirical distribution and the hazard function (3/4)

▶ Since22
pk
λY (yk ) = (by definition SY (y1 −) = 1)
SY (yk −)

and
k
Y
SY (yk ) = {1 − λY (yj )},
j=1

the hazard function can be estimated easily using the


observed frequencies
▶ We use this idea to rewrite the empirical distribution function
when the observations are complete (there is no censoring)

22
Recall, the support points {y1 , y2 , . . . , } are ordered!
64 / 80
Empirical distribution and the hazard function (4/4)
▶ Define Pn
j=1 1{Yj = yk }
λY ,n (yk ) = Pn
j=1 1{Yj ≥ yk }

▶ Then, for any y , the empirical survival function can be written as


( Pn )
j=1 1{Yj = yk }
Y Y
SY ,n (y ) = 1−FY ,n (y ) = {1−λY ,n (yk )} = 1 − Pn
{k:yk ≤y } {k:yk ≤y } j=1 1{Yj ≥ yk }

(By convention, the product with an empty set of indices is equal to 1.)
▶ In the case with no ties in the sample of size n, one can write
Y  1
 Y n−i
SY ,n (y ) = 1 − FY ,n (y ) = 1− =
n−i +1 n−i +1
{i:Y(i) ≤y } {i:Y(i) ≤y }

where Y(1) , Y(2) , . . . , Y(n) is the order statistics, that is the values of
the sample (increasingly) ordered
65 / 80
Kaplan-Meier estimator of the survival function SY
▶ The sample is (Ti , δi ), 1 ≤ i ≤ n
▶ recall: Ti = Yi ∧ Ci and δi = 1{Yi ≤ Ci }
▶ Let t1 , t2 , . . . be the ordered distinct values in the sample
T1 , . . . , Tn . Ties (ex-echos) are allowed
▶ Let
Pn
j=1 δj 1{Tj = tk} ♯{failures at time tk }
λ(tk ) = Pn
b = , k = 1, 2, . . .
j=1 1{Tj ≥ tk } ♯{individuals ‘at risk’ at time tk }

▶ Then, ∀y , the estimator of SY (y ), the survivor at y , is


( Pn )
j=1 δj 1{Tj = tk }
Y Y
S(y
b ) = 1−F
b (y ) = {1−λ(t
b )} =
k 1− Pn
{k:tk ≤y } {k:tk ≤y } j=1 1{Tj ≥ tk }

b ) is the Kaplan-Meier estimator23 of S (·)


The function y 7→ S(y Y
23
For simplicity, when no danger of confusion, we drop the subscript Y and
write S
b(y ) and bλ(tk ) instead of S
bY (y ) and b
λY (tk ), respectively.
66 / 80
Some insight in the definition of the KM
▶ Let us try to understand the definition of λ(t
b k)

▶ Using the law of large numbers, for any tk , we can write24


Pn
n−1 δj 1{Tj = tk } E(δ1{T = tk })
λ(t
b k) = Pj=1
n ≈
n−1 j=1 1{Tj ≥ tk } E(1{T ≥ tk })
E(δ1{Y = tk }) E[E(δ | Y )1{Y = tk }]
= =
SY (tk −)SC (tk −) SY (tk −)SC (tk −)
E[SC (Y −)1{Y = tk }] P(Y = tk )
= = =: λY (tk )
SY (tk −)SC (tk −) P(Y ≥ tk )

▶ We used the independence between C and Y and the following


identity: for any real-valued function ϕ(·),

δϕ(T ) = δϕ(Y )
24
Here, the sign ≈ in the display means that the difference between left and
right side tends to zero in probability. See also slide 54.
67 / 80
▶ The Kaplan-Meier estimator has jumps (puts positive mass)
only at the non-censored observations
▶ When there are no ties, the Kaplan-Meier estimator could be
rewritten, ∀y
( )δk
b )=1−F
Y 1
S(y b (y ) = 1 − Pn (2)
{k:Tk ≤y } j=1 1{T j ≥ Tk }

▶ If, in addition, the sample (Ti , δi ), 1 ≤ i ≤ n, is given such


that the Ti are ordered increasingly, then the Kaplan-Meier
estimator could be rewritten, ∀y
δ k
1
Y 
b )=1−F
S(y b (y ) = 1−
{k:Tk ≤y }
n−k +1

68 / 80
Kaplan-Meier estimator properties
▶ Kaplan-Meier (KM) estimator is an appropriate estimator of
SY (·) under random right-censoring
▶ The KM estimator becomes the empirical df is no observation
is censored
▶ When the last observation is uncensored, the KM estimator is
a survivor function (it vanishes from the last observation to
the right)
▶ When the last observation is censored, the KM estimator does
not vanish, it is not a proper survival function
▶ A common convention is to say that it has a mass at infinity

▶ One could use the KM estimator for estimating the survival


function of the censoring times Ci
▶ In the case where P(Y = C ) = 0, one has only to replace δk
by 1 − δk in formula (2) above
▶ The case with P(Y = C ) > 0 may be slightly more elaborate
69 / 80
Kaplan-Meier estimator asymptotic properties

▶ Estimating SC in the weighting introduce bias for the KM

▶ However, the KM estimator has practically the same


asymptotic properties like the empirical survival function (esf)

▶ KM is n−uniformly consistent

▶ n−asymptotically normal, with a larger variance than the esf
▶ allows to construct confidence intervals and confidence bands

▶ See the Supplementary Materials on Moodle

70 / 80
Exercise

▶ Assume25 that P(Y = C ) = 0

▶ Consider a sample (Ti , δi ), 1 ≤ i ≤ n

▶ Let SbY and SbC be the KM estimator of the survival functions


SY and SC , respectively
▶ Show that
SbY (y )SbC (y ) = SbT (y ), ∀y ,
where SbT is the empirical survival function of the sample26
Ti , 1 ≤ i ≤ n

25
This guarantees the no observed value Ti can be at the same time
censored and uncensored.
26
This empirical estimator does not take into account the presence of
censoring and thus estimates the survivor of T .
71 / 80
The KM is a feasible IPCW
Back to IPCW...

Proposition 3 (*27 )
Assume that Y and C are continuous random variables, and the
observations (Ti , δi ) are random copies of T = Y ∧ C and
δ = 1{Y ≤ C }. Then, the KM estimator S bY of the survival function SY
admits the IPCW representation
n
1X δi
SY (y ) =
b 1{Ti > y },
n i=1 S
bC (Ti −)
| {z }
ω (δi ,Ti )
b

where S
bC is the KM estimator of SC .

Another IPCW type estimator of SY can be derived using a different estimator


of SC (based on the exponential of the cumulative hazard) – to be discussed
during the lectures!!
27
The proof is omitted and not required.
72 / 80
Agenda

Introduction

Recalling few basic probability & statistics notions

Censored data
Types of censoring
IPCW and the Kaplan-Meier estimator
Regression models under right-censoring
Take away

73 / 80
The IPCW in predictive models (1/5)
▶ The IPCW principle extends to predictive (regression) models
▶ The complete data are an IID sample from Y ∈ R+ , X ∈ Rp
▶ The variable Y is right-censored by C , so the IID observations are
(Ti , δi , Xi ) where Ti = Yi ∧ Ci , δi = 1{Yi ≤ Ci }

▶ Identification assumption: we have to impose the untestable


condition
Y ⊥C |X (3)

▶ On the slides 32, 33, the regression/predicitve models were


introduced in an unified way, using the expected loss function :
▶ in general, the regression/predictive function is the solution of
min E[ℓ(Y , m(X))], (4)
m∈M

where M is the model (i.e., the set of all possible regression/


predictive functions)
▶ How to build (asymptotically) unbiased estimators for E[ℓ(Y , m(X))]
using the (Ti , δi , Xi )’s ? 74 / 80
The IPCW in predictive models (2/5)
▶ Idea: find a suitable weighting function ω(δ, T , X) such that,
for a function of interest28 ϕ(T , X), it holds
E [ω(δ, T , X)ϕ(T , X)] = E[ϕ(Y , X)]
▶ Let SC |X (y | x) denote the conditional survival function of C
given X = x

Proposition 4
For any ϕ(·, ·) such that ϕ(Y , X)/SC |X (Y − | X) is well defined29
and E[ϕ(Y , X)] exists, if (3) holds, then
" #
δ
E ϕ(T , X) = E[ϕ(Y , X)]
SC |X (T − | X)
| {z }
ω(δ,T ,X)

28
For example, ϕ(Y , X) = ℓ(Y , m(X)), with ℓ(·, ·) a loss function and m(·)
in the predictive/regression model M
29
The convention 0/0 = 0 applies
75 / 80
The IPCW in predictive models (3/5)
▶ Let us consider the quadratic loss ℓ(y , c), which is used for the
mean regression, and assume a linear model

Y = X⊤ β + ε, E(ε | X) = 0, Var(ε | X) = σ 2 (5)

▶ Using an IID sample of T , δ and X, consider the IPCW version of


the least squares estimator (LSE) for β
n
1X δi
β = β IPCW-LSE = arg min
b b {Ti − X⊤
i β}
2
β n S
i=1 | C
(Ti − | Xi )
{z }
ω(δi ,Ti ,Xi )

▶ Exercise: Derive30 the expression of the β


b
IPCW-LSE and propose an
2
estimator of the variance σ .
30
Remark: the theoretical study of β
b IPCW-LSE is more involved and beyond
our scope. It can
√ be shown that, under suitable conditions, β IPCW-LSE is
b
consistent and n−asymptotically normal. When SC (· | x) is not known and
has to be estimated, the study becomes even more complex.
76 / 80
The IPCW in predictive models (4/5)
▶ Exercise: Consider the model (5) with randomly right censored
response Y , under the condition (3). Using an IID sample of T , δ
and X, consider the usual LSE but with the response Y replaced by
δT /SC (T − | X) (assuming SC (· | x) given).
▶ Write the expression of this LSE and calculate its bias
▶ Calculate the variance of this LSE and compare it with the
variance of the infeasible standard LSE using the responses Y

▶ Exercise: Consider a linear quantile regression model with randomly


right-censored response Y , covariate vector X, that is31

qτ (x) = x⊤ β τ , τ ∈ (0, 1).

Assume the condition (3) holds true. Using an IID sample of T , δ


and X, propose an estimator for β τ , for a given τ (and assuming
SC (· | x) given). Use (4) and the appropriate loss ℓ(·, ·).

31
Recall : qτ (x) = inf{y : FY |X (y |x) ≥ τ }, and FY |X (y |x) = P(Y ≤ y |X = x).
77 / 80
The IPCW in predictive models (5/5)
▶ Predictive models where the prediction function (mean regression,
quantile regression,...) is defined by the minimization of an
expected loss, can be applied with censored responses Y by simply
applying the suitable weight!
▶ However, the weighting requires the knowledge of the conditional
survival function of the censoring variable SC (· | x)
▶ this is a nuisance quantity, preferable to be without restrictions

▶ Three main solutions have been considered so far :


▶ Assume that C ⊥ {T , X}, which is more restrictive than (3)
▶ then, SC (· | x) = SC (·) (the conditional and the marginal
survivors of C coincide), and SC (·) is estimated by KM
▶ Estimate SC (· | x) using a convenient predictive model
▶ typically, Cox’s Proportional Hazard model
▶ Estimate SC (· | x) nonparametrically, by a conditional version
of the KM estimator
The first solution is the simplest and the most used
78 / 80
Agenda

Introduction

Recalling few basic probability & statistics notions

Censored data
Types of censoring
IPCW and the Kaplan-Meier estimator
Regression models under right-censoring
Take away

79 / 80
Conclusions
▶ In many applications, the observation of a variable of interest
Y (time to some event) is right-censored, i.e., we only know
that its value is larger than what is observed (called T above)
▶ Usual estimators based on the censored observations are
biased (df, mean, variance, quantiles, MLE, LSE,...)
▶ The most important estimator in the presence of random
right-censoring is the Kaplan-Meier estimator of the survival
function of the variable of interest
▶ Estimators for predictive models (regressions) taking into
account the random right-censoring have been introduced
▶ All the estimators presented above can be interpreted as
IPCW versions of standard estimators
▶ The methods presented can be adapted to the random
left-censoring, but not necessarily to more complex censoring
mechanisms (because the weighting function may not be explicit)
80 / 80

You might also like