0% found this document useful (0 votes)
132 views107 pages

Machine Learning For Survival Analysis

This document provides an outline for a tutorial on machine learning for survival analysis. It begins with an introduction to basic concepts and statistical methods in survival analysis. It then discusses various machine learning methods that can be used, including parametric, semi-parametric and non-parametric approaches. Finally, it lists some related topics like early prediction, data transformation, and competing risks analysis. Examples of applications in healthcare, education, crowdfunding, and other domains are provided.

Uploaded by

kucing9956
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
132 views107 pages

Machine Learning For Survival Analysis

This document provides an outline for a tutorial on machine learning for survival analysis. It begins with an introduction to basic concepts and statistical methods in survival analysis. It then discusses various machine learning methods that can be used, including parametric, semi-parametric and non-parametric approaches. Finally, it lists some related topics like early prediction, data transformation, and competing risks analysis. Examples of applications in healthcare, education, crowdfunding, and other domains are provided.

Uploaded by

kucing9956
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 107

Machine 

Learning for 
Survival Analysis
Chandan K. Reddy Yan Li
Dept. of Computer Science Dept. of Computational Medicine
Virginia Tech and Bioinformatics
http://www.cs.vt.edu/~reddy Univ. of Michigan, Ann Arbor

1
Tutorial Outline
Basic Concepts

Statistical Methods

Machine Learning Methods

Related Topics

2
Tutorial Outline
Basic Concepts

Statistical Methods

Machine Learning Methods

Related Topics

3
Healthcare
Demographics Comorbodities Laboratory Procedures Medications
Age Hypertension Hemoglobin Hemodialysis ACE inhibitor
Gender Diabetes Blood count Contrast dye Dopamine
Race CKD Glucose Catheterization Milrinone

Event IMPACT
Prediction Lower healthcare costs
Improve quality of life
Model

Event of Interest : Rehospitalization; Disease recurrence; Cancer survival


Outcome: Likelihood of hospitalization within t days of discharge

4
Mining Events in Longitudinal Data
Classification Problem:
1

3 +ve and 7 -ve


2

Cannot predict the time of event


3

Need to re-train for each time


4
Subjects
5 6

Regression Problem:
Can predict the time of event
7

Only 3 samples (not 10)


8

– loss of data
9
10

- Death
1 2 3 4 5 6 7 8 9 10 11 12 - Dropout/Censored
Time
Ping Wang, Yan Li, Chandan, K. Reddy, “Machine Learning for Survival
- Other Events
Analysis: A Survey”. ACM Computing Surveys (under revision), 2017.
5
Problem Statement
For a given instance , represented by a triplet , , .
is the feature vector;
is the binary event indicator, i.e., 1 for an uncensored instance
and 0 for a censored instance;
denotes the observed time and is equal to the survival time for an
uncensored instance and for a censored instance, i.e.,
1
0
Note for :
The value of will be both non-negative and continuous.
is latent for censored instances.

Goal of survival analysis: To estimate the time to the event of


interest for a new instance with feature predictors denoted by .

6
Education
Demographics Financial Pre-enrollment Enrollment Semester
Age Cash amount High school GPA Transfer credits Semester GPA
Gender Income ACT scores College % passed
Race/Ethnicity Scholarships Graduation age Major % dropped

IMPACT
Event
Educated Society
Prediction Better Future
Model

Event of Interest : Student dropout


Outcome: Likelihood of a student being dropout within t days
S. Ameri, M. J. Fard, R. B. Chinnam and C. K. Reddy, "Survival Analysis based Framework for Early Prediction of
Student Dropouts", CIKM 2016. 7
Crowdfunding
Projects Creators Twitter Temporal
Duration Past success # Promotions # Backers
Goal amount Location Backings Funding
Category # projects Communities # retweets

Event IMPACT
Improve local economy
Prediction Successful businesses
Model

Event of Interest: Project Success


Outcome: Likelihood of a project being successful within t days

Y. Li, V. Rakesh, and C. K. Reddy, "Project Success Prediction in Crowdfunding Environments", WSDM 2016.
8
Other Applications
Reliability: Device Failure Modeling in Engineering
Goal: Estimate when a device will fail
Features: Product and manufacturer details, user reviews
Duration Modeling: Unemployment Duration in Economics
Goal: Estimate the time people spend without a job (for getting a new job)
Features: User demographics and experience, Job details and economics
Click Through Rate: Computational Advertising on the Web
Goal: Estimate when a web user will click the link of the ad.
Features: User and Ad information, website statistics
Customer Lifetime Value: Targeted Marketing
Goal: Estimate the frequent purchase pattern for customers.
Features: Customer and store/product information.
How long ?

History information
Event of interest
9
Taxonomy of Survival Analysis Methods
Kaplan-Meier Basic Cox-PH Lasso-Cox
Statistical Methods
Non-Parametric Nelson-Aalen Penalized Cox Ridge-Cox

Time-Dependent EN-Cox
Life-Table Cox
OSCAR-Cox
Semi-Parametric Cox Regression Cox Boost

Linear Regression Tobit


Weighted
Parametric Buckley James Regression
Accelerated
Panelized Structured
Failure Time
Regression Regularization

Survival Trees
Naïve Bayes
Survival Analysis Bayesian
Methods Methods Bayesian
Network
Neural Network Random Survival
Forests
Machine Support Vector
Bagging Survival
Learning Machine
Trees

Ensemble Active Learning


Transfer
Advanced Machine Learning
Learning Multi-Task
Learning

Uncensoring
Early Prediction

Data Calibration
Related Topics Transformation
Competing
Risks
Complex Events
Recurrent
Events
10
Tutorial Outline
Basic Concepts

Statistical Methods

Machine Learning Methods

Related Topics

11
Basics of Survival Analysis
Main focuses is on time to event data. Typically, survival data
are not fully observed, but rather are censored.
Several important functions:
Death
Survival function, indicating the probability that the stance
instance can survive for longer than a certain time t.
Pr
Cumulative density function, representing the probability
that the event of interest occurs earlier than t. Survival function
1 exp
Death density function:
⁄ ⁄
Hazard function: representing the probability the “event” of
interest occurs in the next instant, given survival to time t.
ln Cumulative hazard function

Chandan K. Reddy and Yan Li, "A Review of Clinical Prediction Models", in Healthcare Data Analytics,
Chandan K. Reddy and Charu C. Aggarwal (eds.), Chapman and Hall/CRC Press, 2015. 12
Evaluation Metrics
Due to the presence of the censoring in survival data,
the standard evaluation metrics for regression such as
root of mean squared error and are not suitable for
measuring the performance in survival analysis.
Three specialized evaluation metrics for survival
analysis:
Concordance index (C-index)
Brier score
Mean absolute error

13
Concordance Index (C‐Index)
It is a rank order statistic for predictions against true outcomes
and is defined as the ratio of the concordant pairs to the total
comparable pairs.
Given the comparable instance pair , with and are the
actual observed times and S( ) and S( ) are the predicted
survival times,
The pair , is concordant if > and S( ) > S( ).
The pair , is discordant if > and S( ) < S( ).

Then, the concordance probability Pr


measures the concordance between the rankings of actual
values and predicted values.
For a binary outcome, C-index is identical to the area under the
ROC curve (AUC).
U. Hajime, et al. "On the C‐statistics for evaluating overall adequacy of risk prediction procedures with censored survival
data." Statistics in medicine, 2011.
14
Comparable Pairs
The survival times of two instances can be compared if:
Both of them are uncensored;
The observed event time of the uncensored instance is
smaller than the censoring time of the censored instance.

Without Censoring With Censoring


A total of 5C2 comparable pairs Comparable only with events and
with those censored after the events

H. Steck, B. Krishnapuram, C. Dehing-oberije, P. Lambin, and V. C. Raykar, “On ranking in survival analysis: Bounds on the
concordance index”, NIPS 2008. 15
C‐index
When the output of the model is the prediction of survival time:
1
̂ |
: :

Where | is the predicted survival probabilities,


denotes the total number of comparable pairs.

When the output of the model is the hazard ratio (Cox model):
1
̂
: :

Where · is the indicator function and is the estimated


parameters from the Cox based models. (The patient who has
a longer survival time should have a smaller hazard ratio).
16
C‐index during a Time Period
Area under the ROC curves (AUC) is
1
0, 1

In a possible survival time ∈ , is the set of all possible


survival times, the time-specific AUC is defined as
1
,
: :

denotes the number of comparable pairs at time .


Then the C-index during a time period 0, ∗ can be calculated as:
∗ ∑: ∑

∑ ∈ ∑ ∑ ∑ ∈ ·
∑∈

C-index is a weighted average of the area under time-specific ROC


curves (Time-dependent AUC). 17
Brier Score
Brier score is used to evaluate the prediction models where the
outcome to be predicted is either binary or categorical in nature.
The individual contributions to the empirical Brier score are
reweighted based on the censoring information:
1

denotes the weight for the instance.

The weights can be estimated by considering the Kaplan-Meier


estimator of the censoring distribution on the dataset.
/
1/
The weights for the instances that are censored before will be 0.
The weights for the instances that are uncensored at are greater than 1.

E. Graf, C. Schmoor, W. Sauerbrei, and M. Schumacher, “Assessment and comparison of prognostic classification schemes
for survival data”, Statistics in medicine, 1999. 18
Mean Absolute Error
For survival analysis problems, the mean absolute error (MAE)
can be defined as an average of the differences between the
predicted time values and the actual observation time values.
1
| |

where
-- the actual observation times.
-- the predicted times.
Only the samples for which the event occurs are being
considered in this metric.
Condition: MAE can only be used for the evaluation of survival
models which can provide the event time as the predicted
target value.
19
Summary of Statistical methods
Type Advantages Disadvantages Specific methods

More efficient when no Difficult to interpret; Kaplan-Meier


Non-
suitable theoretical yields inaccurate Nelson-Aalen
parametric
distributions known. estimates. Life-Table

Cox model
The knowledge of the
The distribution of the Regularized Cox
Semi- underlying distribution of
outcome is unknown;
parametric survival times is not CoxBoost
not easy to interpret.
required.
Time-Dependent Cox

Easy to interpret, more When the distribution Tobit


efficient and accurate assumption is violated, it Buckley-James
Parametric when the survival times may be inconsistent and
follow a particular can give sub-optimal Penalized regression
distribution. results. Accelerated Failure Time

20
Kaplan‐Meier Analysis 
Kaplan-Meier (KM) analysis is a nonparametric approach
to survival outcomes. The survival function is:
1
:
where
• … -- a set of distinct event times
observed in the sample.
• -- number of events at .
• -- number of censored observations
between and .
• -- number of individuals “at risk” right
before the death.

E. Bradley. "Logistic regression, survival analysis, and the Kaplan-Meier curve." JASA 1988. 21
Survival Outcomes
Patient Days Status Patient Days Status Patient Days Status
Status
1 21 1 15 256 2 29 398 1
1: Death
2 39 1 16 260 1 30 414 1 2: Lost to follow up
3 77 1 17 261 1 31 420 1 3: Withdrawn Alive
4 133 1 18 266 1 32 468 2

5 141 2 19 269 1 33 483 1

6 152 1 20 287 3 34 489 1

7 153 1 21 295 1 35 505 1

8 161 1 22 308 1 36 539 1

9 179 1 23 311 1 37 565 3

10 184 1 24 321 2 38 618 1

11 197 1 25 326 1 39 793 1

12 199 1 26 355 1 40 794 1

13 214 1 27 361 1

14 228 1 28 374 1

22
Kaplan‐Meier Analysis
Kaplan-Meier Analysis
Time Status
1 21 1 1 0 40 0.975
2 39 1 1 0 39 0.95
3 77 1 1 0 38 0.925
4 133 1 1 0 37 0.9
5 141 2 0 1 36 .
6 152 1 1 0 35 0.874
7 153 1 1 0 34 0.849

KM Estimator:

1
:

23
Kaplan‐Meier Analysis
KM Estimator:

Time Status Estimate Sdv Error ∑ Time Status Estimate Sdv Error ∑
1 21 1 0.975 0.025 1 40 21 287 3 . . 18 20
2 39 1 0.95 0.034 2 39 22 295 1 0.508 0.081 19 19
3 77 1 0.925 0.042 3 38 23 308 1 0.479 0.081 20 18
4 133 1 0.9 0.047 4 37 24 311 1 0.451 0.081 21 17
5 141 2 . . 4 36 25 321 2 . . 21 16
6 152 1 0.874 0.053 5 35 26 326 1 0.421 0.081 22 15
7 153 1 0.849 0.057 6 34 27 355 1 0.391 0.081 23 14
8 161 1 0.823 0.061 7 33 28 361 1 0.361 0.08 24 13
9 179 1 0.797 0.064 8 32 29 374 1 0.331 0.079 25 12
10 184 1 0.771 0.067 9 31 30 398 1 0.301 0.077 26 11
11 193 1 0.746 0.07 10 30 31 414 1 0.271 0.075 27 10
12 197 1 0.72 0.072 11 29 32 420 1 0.241 0.072 28 9
13 199 1 0.694 0.074 12 28 33 468 2 . . 28 8
14 214 1 0.669 0.075 13 27 34 483 1 0.206 0.07 29 7
15 228 1 0.643 0.077 14 26 35 489 1 0.172 0.066 30 6
16 256 2 . . 14 25 36 505 1 0.137 0.061 31 5
17 260 1 0.616 0.078 15 24 37 539 1 0.103 0.055 32 4
18 261 1 0.589 0.079 16 23 38 565 3 . . 32 3
19 266 1 0.563 0.08 17 22 39 618 1 0.052 0.046 33 2
20 269 1 0.536 0.08 18 21 40 794 1 0 0 34 1

24
Nelson‐Aalen Estimator
Nelson-Aalen estimator is a non-parametric estimator of the
cumulative hazard function (CHF) for censored data.
Instead of estimating the survival probability as done in KM
estimator, NA estimator directly estimates the hazard probability.
The Nelson-Aalen estimator of the cumulative hazard function:

-- the number of deaths at time


-- the number of individuals at risk at
The cumulative hazard rate function can be used to estimate the
survival function and its variance.
exp

The NA and KM estimators are asymptotically equivalent.


W. Nelson. “Theory and applications of hazard plotting for censored failure data.” Technometrics, 1972.
O. Aalen. “Nonparametric inference for a family of counting processes.” The Annals of Statistics, 1978. 25
Clinical Life Tables
Clinical life tables applies to grouped survival data from
studies in patients with specific diseases, it focuses more
on the conditional probability of dying within the interval.
The time interval is , VS.
… is a set of distinct death times
The survival function is:
1 Nonparametric

Assumption:
• at the beginning of each interval:
• at the end of each interval:
• on average halfway through the interval: /2

KM analysis suits small data set with a more accurate analysis,


Clinical life table suit for large data set with a relatively approximate result.

Cox, David R. "Regression models and life-tables", Journal of the Royal Statistical Society. Series B (Methodological), 1972. 26
Clinical Life Tables
Clinical Life Table
Interval Interval Std. Error
Interval Start Time End Time of
NOTE: 1 0 182 40 1 39.5 8 0.797 0.06
The length of interval 2 183 365 31 3 29.5 15 0.392 0.08
is half year(183 days) 3 366 548 13 1 12.5 8 0.141 0.06
4 549 731 4 1 3.5 1 0.101 0.05
5 732 915 2 0 2 2 0 0

Clinical Life Table:

On average halfway through


the interval: /2

27
Statistical methods
Type Advantages Disadvantages Specific methods

More efficient when no Difficult to interpret; Kaplan-Meier


Non-
suitable theoretical yields inaccurate Nelson-Aalen
parametric
distributions known. estimates. Life-Table

Cox model
The knowledge of the
The distribution of the Regularized Cox
Semi- underlying distribution of
outcome is unknown;
parametric survival times is not CoxBoost
not easy to interpret.
required.
Time-Dependent Cox

Easy to interpret, more When the distribution Tobit


efficient and accurate assumption is violated, it Buckley-James
Parametric when the survival times may be inconsistent and
follow a particular can give sub-optimal Penalized regression
distribution. results. Accelerated Failure Time

28
Taxonomy of Survival Analysis Methods
Kaplan-Meier Basic Cox-PH Lasso-Cox
Statistical Methods
Non-Parametric Nelson-Aalen Penalized Cox Ridge-Cox

Time-Dependent EN-Cox
Life-Table Cox
OSCAR-Cox
Semi-Parametric Cox Regression Cox Boost

Linear Regression Tobit


Weighted
Parametric Buckley James Regression
Accelerated
Panelized Structured
Failure Time
Regression Regularization

Survival Trees
Naïve Bayes
Survival Analysis Bayesian
Methods Methods Bayesian
Network
Neural Network Random Survival
Forests
Machine Support Vector
Bagging Survival
Learning Machine
Trees

Ensemble Active Learning


Transfer
Advanced Machine Learning
Learning Multi-Task
Learning

Uncensoring
Early Prediction

Data Calibration
Related Topics Transformation
Competing
Risks
Complex Events
Recurrent
Events
29
Cox Proportional Hazards Model
The Cox proportional hazards model is the most commonly
used model in survival analysis.
Hazard Function , sometimes called an instantaneous
failure rate, shows the event rate at time conditional on
survival until time or later.
,
, exp ⇒ log

where A linear model for the log


of the hazard ratio.
• , ,…, is the covariate vector.
• is the baseline hazard function, which can be an arbitrary
non-negative function of time.
The Cox model is a semi-parametric algorithm since the baseline
hazard function is unspecified.
D. R. Cox, “Regression models and life tables”. Journal of the Royal Statistical Society, 1972. 30
Cox Proportional Hazards Model
The Proportional Hazards assumption means that the hazard ratio of two
instances and is constant over time (independent of time).
, exp
exp
, exp

The survival function in Cox model can be computed as follows:



exp exp
is the cumulative baseline hazard function;
exp represents the baseline survival function.

The Breslow’s estimator is the most widely used method to estimate ,


which is given by:

if is an event time, otherwise 0.


∑ ∈
represents the set of subjects who are at risk at time . 31
Optimization of Cox model
Not possible to fit the model using the standard likelihood function
Reason: the baseline hazard function is not specified.
Cox model uses partial likelihood function:
Advantage: depends only on the parameter of interest and is free of
the nuisance parameters (baseline hazard).
Conditional on the fact that the event occurs at , the individual
probability corresponding to covariate can be formulated as:
,
∑∈ ,

-- the total number of events of interest that occurred during


the observation period for instances.
⋯ -- the distinct ordered time to event of interest.
-- the covariate vector for the subject who has the event at .
-- the set of risk subjects at . 32
Partial Likelihood Function
The partial likelihood function of the Cox model will be:
exp
∑ ∈ exp

If 1, the term in the product is the conditional probability;


if 0, the corresponding term is 1, which means that the term will not
have any effect on the final product.
The coefficient vector is estimated by minimizing the negative
log-partial likelihood:
exp

The maximum partial likelihood estimator (MPLE) can be used


along with the numerical Newton-Raphson method to iteratively
find an estimator which minimizes .
D. R. Cox, Regression models and life tables, Journal of the Royal Statistical Society, 1972. 33
Regularized Cox Models
Regularized Cox regression methods:

is a sparsity inducing norm and is the regularization parameter.
Method Penalty Term Formulation

LASSO Promotes Sparsity

Ridge Handles Correlation

Elastic Net (EN) | | 1 Sparsity + Correlation

Adaptive LASSO (AL) ∑ | |


Adaptive Variants are
Adaptive Elastic Net slightly more effective
| | 1
(AEN)
Sparsity + Feature
OSCAR ∥ ∥ ∥ ∥
Correlation Graph
34
Lasso‐Cox and Ridge‐Cox
Lasso performs feature selection and estimates the regression
coefficients simultaneously using a ℓ -norm regularizer .
Lasso-Cox model incorporates the ℓ -norm into the log-partial
likelihood and inherits the properties of Lasso.
Extensions of Lasso-Cox method:
Adaptive Lasso-Cox - adaptively weighted ℓ -penalties on regression
coefficients.
Fused Lasso-Cox - coefficients and their successive differences are
penalized.
Graphical Lasso-Cox - ℓ -penalty on the inverse covariance matrix is
applied to estimate the sparse graphs .
Ridge-Cox is Cox regression model regularized by a ℓ -norm
Incorporates a ℓ -norm regularizer to select the correlated features.
Shrink their values towards each other.
N. Simon et al., “Regularization paths for Coxs proportional hazards model via coordinate descent”, JSS 2011. 35
EN‐Cox and OSCAR‐Cox
EN-Cox method uses the Elastic Net penalty term (combining the ℓ
and squared ℓ penalties) into the log-partial likelihood function.
Performs feature selection and handles correlation between the features.
Kernel Elastic Net Cox (KEN-Cox) method builds a kernel similarity
matrix for the feature space to incorporate the pairwise feature
similarity into the Cox model.
OSCAR-Cox uses Octagonal Shrinkage and Clustering Algorithm for
Regression regularizer within the Cox framework.
β ∥ ∥ ∥ ∥
is the sparse symmetric edge set matrix from a graph constructed by
features.
Performs the variable selection for highly correlated features in regression.
Obtain equal coefficients for the features which relate to the outcome in
similar ways.
B. Vinzamuri and C. K. Reddy, "Cox Regression with Correlation based Regularization for Electronic Health Records", ICDM 2013.36
CoxBoost
CoxBoost method can be applied to fit the sparse survival
models on the high-dimensional data by considers some
mandatory covariates explicitly in the model.

CoxBoost VS. Regular gradient boosting approach (RGBA)

Similar goal: estimate the coefficients in Cox model.


Differences:
RGBA: updates in component-wise boosting or fits
the gradient by using all covariates in each step.
CoxBoost: considers a flexible set of candidate
variables for updating in each boosting step.

H. Binder and M. Schumacher, “Allowing for mandatory covariates in boosting estimation of sparse high-dimensional survival
models”, BMC bioinformatics, 2008.
37
CoxBoost
How to update in each iteration of CoxBoost?

Assume that ,⋯, being the actual


estimate of the overall parameter vector after step
1 of the algorithm and predefined candidate sets of
features in step with ⊂ 1, ⋯ , , 1, ⋯ , .
Update all parameters
in each set
simultaneously (MLE) ∈ ∗

∉ ∗

Determine Best ∗
which improves the
overall fitting most Update

Special case:

Component-wise CoxBoost: 1 ,⋯, in each step .


38
TD‐Cox Model
Cox regression model is also effectively adapted to time-
dependent Cox model to handle time-dependent covariates.
Given a survival analysis problem which involves both time-
dependent and time-independent features, the variables at
time can be denoted as:

⋅ , ⋅ ,…, ⋅ , ⋅ , ⋅ ,…, ⋅

Time-dependent Time-independent

The TD-Cox model can be formulated as:


, exp · ·

Time-dependent Time-independent
39
TD‐Cox Model
For the two sets of predictors at time :
, ,…, , , ,…,
∗ ∗
, ,…, , ⋅ , ⋅ ,…,

The hazard ratio for TD-Cox model can be computed as


follows:
,
,

Since the first component in the exponent is time-dependent, we can


consider the hazard ratio in the TD-Cox model as a function of time .
This means that it does not satisfy the PH assumption mentioned in
the standard Cox model. 40
Counting Process Example
Gende Weight Smoke Start Time Stop Time
ID r (lb) (0/1) (days) (days) Status
(0/1)
1 (F) 125 0 0 20 1
0 (M) 171 1 0 20 0
0 180 0 20 30 1
0 165 1 0 20 0
0 160 0 20 30 0
0 168 0 30 50 0
1 130 0 0 20 0
1 125 1 20 30 0
1 120 1 30 80 1

41
Taxonomy of Survival Analysis Methods
Kaplan-Meier Basic Cox-PH Lasso-Cox
Statistical Methods
Non-Parametric Nelson-Aalen Penalized Cox Ridge-Cox

Time-Dependent EN-Cox
Life-Table Cox
OSCAR-Cox
Semi-Parametric Cox Regression Cox Boost

Linear Regression Tobit


Weighted
Parametric Buckley James Regression
Accelerated
Panelized Structured
Failure Time
Regression Regularization

Survival Trees
Naïve Bayes
Survival Analysis Bayesian
Methods Methods Bayesian
Network
Neural Network Random Survival
Forests
Machine Support Vector
Bagging Survival
Learning Machine
Trees

Ensemble Active Learning


Transfer
Advanced Machine Learning
Learning Multi-Task
Learning

Uncensoring
Early Prediction

Data Calibration
Related Topics Transformation
Competing
Risks
Complex Events
Recurrent
Events
42
Statistical Methods
Type Advantages Disadvantages Specific methods

More efficient when no Difficult to interpret; Kaplan-Meier


Non-
suitable theoretical yields inaccurate Nelson-Aalen
parametric
distributions known. estimates. Life-Table

Cox model
The knowledge of the
The distribution of the Regularized Cox
Semi- underlying distribution of
outcome is unknown;
parametric survival times is not CoxBoost
not easy to interpret.
required.
Time-Dependent Cox

Easy to interpret, more When the distribution Tobit


efficient and accurate assumption is violated, it Buckley-James
Parametric when the survival times may be inconsistent and
follow a particular can give sub-optimal Penalized regression
distribution. results. Accelerated Failure Time

43
Parametric Censored Regression
f(t)
0.8

0.6

0.4

0.2 S(t)
0
yi yi 1 2 3

Event density function : rate of events per unit time


— ∏ , : The joint probability of uncensored instances.
Survival function Pr : the probability that the event did
not happen up to time
— ∏ , : The joint probability of censored instances.

 Likelihood function
, ,

44
Parametric Censored Regression
Generalized Linear Model


Where
log

/ 1

Negative log-likelihood

2
m log log log 1
,

Uncensored censored
Instances Instances

45
Optimization
Use second order second-order Taylor expansion to formulate the
log-likelihood as a reweighted least squares

where , . The first-order derivative , second-


order derivative , and other components in optimization share the
same formulation with respect to · , · , · , and F · .
In addition, we can add some regularization term to encode some
prior assumption.

Y. Li, K. S. Xu, C. K. Reddy, “Regularized Parametric Regression for High-dimensional Survival Analysis“, 2016. SDM
46
Pros and Cons
Advantages:
Easy to interpret.
Rather than Cox model, it can directly predict the
survival(event) time.
More efficient and accurate when the time to event of
interest is follow a particular distribution.

Disadvantages:
The model performance strongly relies on the choosing of
distribution, and in practice it is very difficult to choose a
suitable distribution for a given problem.

Li, Yan, Vineeth Rakesh, and Chandan K. Reddy. "Project success prediction in crowdfunding environments."
Proceedings of the Ninth ACM International Conference on Web Search and Data Mining. ACM, 2016. 47
Commonly Used Distributions
Distributions PDF Survival Hazard
Exponential exp exp

Weibull exp exp

/ / 1
Logistic /
1 /
1 / 1

1
Log-logistic
1 1 1

1
1 exp
Normal exp 1 Φ 2 1 Φ 2
2 2

log 1 log
1 log exp
1 Φ 2 2
Log-normal exp
2 2 log
1 Φ

48
Tobit Model
Tobit model is one of the earliest attempts to extend linear regression
with the Gaussian distribution for data analysis with censored
observations.
In Tobit model, a latent variable ∗ is introduced and it is assumed to
linearly depend on as:
y∗ , ∼ 0,
where is a normally distributed error term.

For the instance, the observable variable will be ∗ if ∗ 0,


otherwise it will be 0. This means that if the latent variable is above
zero, the observed variable equals to the latent variable and zero
otherwise.
The parameters in the model can be estimated with maximum
likelihood estimation (MLE) method.
J. Tobin, Estimation of relationships for limited dependent variables. Econometrica: Journal of the Econometric Society, 1958.
49
Buckley‐James Regression Method
The Buckley-James (BJ) regression is a AFT model.
log 1
0
The estimated target value

∗ log 1
log
log | log log , 0

The key point is to calculate log | log log , :


log | log log , log
·
1 log

Rather than a selected closed formed theoretical distribution, the Kaplan-Meier


(KM) estimation method are used to approximate the F(·).

J. Buckley and I. James, Linear regression with censored data. Biometrika, 1979. 50
Buckley‐James Regression Method
The Least squares is used as the empirical loss function
1 ∗
min log
2

Where log = log
1 ·
1 log

The Elastic-Net regularizer also has been used to penalize the BJ-
regression (EN-BJ) to handle the high-dimensional survival data.
1 ∗ 1 2
min log 1
2 2 2
To estimate of of BJ and EN-BJ models, we just need to calculate
log ∗ based on the of pervious iteration and then minimize the lest
square or penalized lest square via standard algorithms.

Wang, Sijian, et al. “Doubly Penalized Buckley–James Method for Survival Data with High‐Dimensional Covariates.” Biometrics, 2008
51
Regularized Weighted Linear Regression

×

Induce more penalize to case 1 and less penalize to case 2

Y. Li, B. Vinzamuri, and C. K. Reddy, “Regularized Weighted Linear Regression for High-dimensional Censored Data“, SDM 2016.
52
Weighted Residual Sum‐of‐Squares
More weight to the censored instances whose estimated
survival time is lesser than censored time
Less weight to the censored instances whose estimated
survival time is greater than censored time.

Weighted residual sum-of-squares


1

2

where weight is defined as follows:


1 1
= 0 A demonstration of linear
0 0 regression model for dataset with
right censored observations.

53
Self‐Training Framework
Self-training: training the model by using its own prediction

Training
a base
model

Update Stop when the Estimate


training training dataset won’t survival
set change time

Approximate
the survival If the estimated survival
time is larger than censored
time of time
censored
instances

54
Bayesian Survival Analysis
Penalized regression encode assumption via regularization term,
while Bayesian approach encode assumption via prior distribution.
Bayesian Paradigm
Based on observed data , one can build a likelihood function | .
(likelihood estimator)
Suppose is random and has a prior distribution denote by .
Inference concerning is based on the posterior distribution

usually does not have an analytic closed form, requires methods


like MCMC to sample from | and methods to estimate .
Posterior predictive distribution of a future observation vector given D

where | denotes the sampling density function of


Ibrahim, Joseph G., Ming‐Hui Chen, and Debajyoti Sinha. Bayesian survival analysis. John Wiley & Sons, 2005. 55
Bayesian Survival Analysis
Under the Bayesian framework the lasso estimate can be viewed as a
Bayesian posterior mode estimate under independent Laplace priors for
the regression parameters.
Lee, Kyu Ha, Sounak Chakraborty, and Jianguo Sun. "Bayesian variable selection in
semiparametric proportional hazards model for high dimensional survival data." The
International Journal of Biostatistics 7.1 (2011): 1-32.

Similarly based on the mixture representation of Laplace distribution, the


Fused lasso prior and group lasso prior can be also encode based on a
similar scheme.
Lee, Kyu Ha, Sounak Chakraborty, and Jianguo Sun. "Survival prediction and variable
selection with simultaneous shrinkage and grouping priors." Statistical Analysis and Data
Mining: The ASA Data Science Journal 8.2 (2015): 114-127.

A similar approach can also be applied in the parametric AFT model.

Komarek, Arnost. Accelerated failure time models for multivariate interval-censored data with
flexible distributional assumptions. Diss. PhD thesis, PhD thesis, Katholieke Universiteit
Leuven, Faculteit Wetenschappen, 2006.
56
Deep Survival Analysis
Deep Survival Analysis is a hierarchical generative approach to
survival analysis in the context of the EHR
Deep survival analysis models covariates and survival time in a
Bayesian framework.
It can easily handle both missing covariates and model survival time.
Deep exponential families (DEF) are a class of multi-layer probability
models built from exponential families. Therefore, they are capable
to model the complex relationship and latent structure to build a joint
model for both the covariates and the survival times.

is the output of DEF network, which can be used to generate the


observed covariates and the time to failure.

R. Ranganath, A. Perotte, N. Elhadad, and D. Blei. "Deep survival analysis." Machine Learning for Healthcare, 2016. 57
Deep Survival Analysis

is the feature vector, which is supposed can be generated from a prior


distribution.
The Weibull distribution is used to model the survival time.
a and b are drawn from normal distribution, they are parameter related to
survival time.
Given a feature vector x, the model makes predictions via the posterior
predictive distribution:

58
Tutorial Outline
Basic Concepts

Statistical Methods

Machine Learning Methods

Related Topics

59
Machine Learning Methods
Basic ML Models
Survival Trees
Bagging Survival Trees
Random Survival Forest
Support Vector Regression
Deep Learning
Rank based Methods
Advanced ML Models
Active Learning
Multi-task Learning
Transfer Learning

60
Survival Tree
Survival trees is similar to decision tree which is built by recursive
splitting of tree nodes. A node of a survival tree is considered
“pure” if all the patients in the node survive for an identical span of
time.

The logrank test is most commonly used dissimilarity measure that


estimates the survival difference between two groups. For each
node, examine every possible split on each feature, and then
select the best split, which maximizes the survival difference
between two children nodes.

LeBlanc, M. and Crowley, J. (1993). Survival Trees by Goodness of Split. Journal of the American Statistical
Association 88, 457–467. 61
Logrank Test
The logrank test is obtained by constructing a (2 X 2) table at each distinct
death time, and comparing the death rates between the two groups,
conditional on the number at risk in the groups. Let , … , represent the
ordered, distinct death times. At the -th death time, we have the following:

∑ /

 the numerator is the squared sum of deviations between the observed


and expected values. The denominator is the variance of the
(Patnaik ,1948).
 The test statistic, , gets bigger as the differences between the
observed and expected values get larger, or as the variance gets smaller.
 It follows a distribution asymptotically under the null hypothesis.
Segal, Mark Robert. "Regression trees for censored data." Biometrics (1988): 35-47. 62
Bagging Survival Trees
Bagging
Survival
Bagging
Tree Survival
Trees
- Draw B bootstrap samples from the original data.
- Grow a survival tree for each bootstrap sample based on all features.
Recursively spitting the node using the feature that maximizes survival
difference between daughter nodes.
- Compute the bootstrap aggregated survival function for a new observation
.
The samples in the selected
leaf node of 1-st Tree
Bagging
Survival Build K-M curve

Trees The samples in the selected


An aggregated
estimator of | )
leaf node of B-th Tree

Hothorn, Torsten, et al. "Bagging survival trees." Statistics in medicine 23.1 (2004): 77-91.
63
Random Survival Forests
Random Survival
Forests Tree RSF

1. Draw B bootstrap samples from the original data (63% in the bag data,
37% Out of bag data(OOB)).
2. Grow a survival tree for each bootstrap sample based on randomly
select candidate features, and splits the node using feature from the
selected candidate features that maximizes survival difference between
daughter nodes.
3. Grow the tree to full size, each terminal node should have no less than
0 unique deaths.
4. Calculate a Cumulative Hazard Function (CHF) for each tree. Average
to obtain the bootstrap ensemble CHF.
5. Using OOB data, calculate prediction error for the OOB ensemble CHF.

H. Ishwaran, U. B. Kogalur, E. H. Blackstone and M. S. Lauer, “Random Survival Forests”. Annals of


Applied Statistics, 2008
64
Random Survival Forests
The cumulative hazard function (CHF) in random survival forests is
estimated via Nelson-Aalen estimator:
,
,
,

where , is the -th distinct event time of the samples in leaf , , is the
number events at , , and , is the number of individuals at risk at , .

∗∗ ∗
OOB ensemble CHF ( ) and bootstrap ensemble CHF ( )
∑ ∗
∗∗ , | ∗
1 ∗
, |
∑ ,

where ∗ | is the CHF of the node in b-th bootstrap which belongs to.
, 1 if i is an OOB case for b; otherwise, set , 0. Therefore OOB
ensemble CHF is the average over bootstrap samples which i is OOB, and
bootstrap ensemble CHF is the average of all B bootstrap.

O. O. Aalen, “Nonparametric inference for a family of counting processes”, Annals of Statistics 1978. 65
Support Vector Regression (SVR)
Once a model has been learned, it can be applied to a new instance
through

is a kernel, and the SVR algorithm can abstractly be


considered as a linear algorithm

: margin of error
C: regularization parameter

: slack variables
66
Support Vector Approach for Censored Data
Interval Targets: These are samples for which we have both an upper and a
lower bound on the target. The tuple ( , , ) with < .
As long as the output is between and , there is no empirical error.
Right censored sample is written as ( , ∞) whose survival time is
greater than ∈ , but the upper bound is unknown.

Graphical representation of Loss functions


c( f ( xi ), I i , U i )
c( f ( xi ), I i , U i )


Ii Ii Ui f ( xi )
Ui f ( xi )

SVR loss SVRC loss in general SVRC loss for right


censored

P. K. Shivaswamy, W. Chu, and M. Jansche. "A support vector approach to censored targets”, ICDM 2007. 67
Support Vector Regression for Censored Data 
A graphical representation of the SVRc parameters for events.
Lesser acceptable margin when the predicted
value is grater than the event time
Greater penalty rate when the predicted value is
greater than the censored time
Predicting a high risky patient will survive longer is
more gangrenous than predicting a low risky patient
will survive shorter
Graphical representation of the SVRc parameters for censored data.

Greater acceptable margin when the predicted


value is greater than the censored time

Less penalty rate when the predicted value is


greater than the censored time
The possible survival time of censored instances
should be grater than or equal to the corresponding
censored time.
F. M. Khan and V. B. Zubek. "Support vector regression for censored data (SVRc): a novel tool for survival
analysis." ICDM 2008 68
Neural Network Model
Input layer Hidden layer Output layer
1

Cox Proportional
...

Hazards Model
Softmax
function

Hidden layer takes softmax , as active function.

, ,
: :

No longer to be a linear function

D. Faraggi and R. Simon. "A neural network model for survival data." Statistics in medicine, 1995. 69
Deep Survival: A Deep Cox Proportional Hazards Network
Input layer Hidden layers
1
Output layer

...

Cox Proportional
...

...
Hazards Model

Takes some modern deep learning techniques such as Rectified


Linear Units (ReLU) active function, Batch Normalization, dropout.
, ,
: :

No longer to be a linear function

Katzman, Jared, et al. "Deep Survival: A Deep Cox Proportional Hazards Network." arXiv , 2016. 70
Deep Convolutional Neural Network

, ,
: :

: image patch from -th patient


No longer to be a liner
: the deep model
function

Pos: Directly built deep model for survival analysis from images input

X. Zhu, J. Yao, and J. Huang. "Deep convolutional neural network for survival analysis with pathological images“, BIBM 2016. 71
Ranking based Models
C-index is a pairwise ranking based evaluation metric. Boosting
concordance index (BoostCI) is an approach which aims at directly optimize
the C-index.

is the kaplan-Meier estimator, and as the existence of · the above


definition is non-smooth and nonconvex, which is hart to optimize.

In BoostCI, a sigmoid function is used to provide a smooth approximation


for indicator function.

Therefore, we have the smoothed version

weights

A. Mayr and M. Schmid, “Boosting the concordance index for survival data–a unified framework to derive and evaluate
biomarker combinations”, PloS one, 2014. 72
BoostCI Algorithm
The component-wise gradient boosting algorithm is used to
optimize the smoothed C-index.
Learning Step:
1. Initialize the estimate of the marker combination with offset values,
and set maximum number ( ) of iteration, and set 1.
2. Compute the negative gradient vector of smoothed C-index.
3. Fit the negative gradient vector separately to each of the components of
via the base-learners :, .

4. Select the component that best fits the negative gradient vector, and the
selected index of base-learn is denote as ∗
5. Update the marker combination for this component

← ∗ :, ∗ .

6. Stop if . Else increase by one and go back to step 2


73
Machine Learning Methods
Basic ML Models
Survival Trees
Bagging Survival Trees
Random Survival Forest
Support Vector Machine
Deep Learning
Rank based Methods
Advanced ML Models
Active Learning
Multi-Task Learning
Transfer Learning

74
Active Learning for Survival Data
Objective: Identify the representative samples in the data
Outcome: Allow the Model to select instances
to be included. It can minimize the training
cost and complexity of the model and obtain a
good generalization performance for Censored
data.

Our sampling method chooses that particular


instance which maximizes the following
criterion.
K
L X (  )
X  arg max  h (Tk | X )

X  pool k 1 

Active learning based framework for the survival regression using a novel
model discriminative gradient based sampling procedure.

Helps clinicians to understand more about the most representative patients.

B. Vinzamuri, Y. Li, C. Reddy, "Active Learning Based Survival Regression for Censored Data", CIKM 2014.
75
Active Learning with Censored Data
Time to Update
EHR Censored
Training
features(X) Status(δ) Event(T) data

Domain Expert
Train Cox Model
(Oracle)

Column
Partial log Labelling
wise kernel
Elastic Net likelihood L(β) request for
matrix(Ke) Regularization
instance
Unlabelled
Pool (Pool) End of active
learning
rounds
Compute
Gradient Output
δL(β)/ δβ Gradient Based Survival AUC
Discriminative and RMSE
Sampling
76
Multi‐task Learning Formulation
Advantage: The model is general, no assumption on either survival
time or survival function.

1 Y 1 2 3 4 5 6 7 8 9 10 11 12
2 1 1 1 1 1 1 1 1 1 1 0 0 0
2 1 1 1 1 1 ? ? ? ? ? ? ?
patient

3 3 1 1 1 1 1 1 1 1 1 1 ? ?
4 1 1 1 0 0 0 0 0 0 0 0 0
4
1: Alive 0: Death ?: Unknown
0 6 12
Month

 Similar tasks: All the binary classifiers aim at predicting the life status
of each patient.
 Temporal smoothness: For each patient, the life statuses of adjacent
time intervals are mostly same.
 Not reversible: Once a patient is dead, he is impossible to be alive
again.
77
Multi‐task Learning Formulation
Y 1 2 3 4 5 6 7 8 9 10 11 12 W 1 2 3 4 5 6 7 8 9 10 11 12
D1 1 1 1 1 1 1 1 1 1 0 0 0 D1 1 1 1 1 1 1 1 1 1 1 1 1
D2 1 1 1 1 1 ? ? ? ? ? ? ? D2 1 1 1 1 1 0 0 0 0 0 0 0
D3 1 1 1 1 1 1 1 1 1 1 ? ? D3 1 1 1 1 1 1 1 1 1 1 0 0
D4 1 1 1 0 0 0 0 0 0 0 0 0 D4 1 1 1 1 1 1 1 1 1 1 1 1
How to deal with the “?” in Y
The Proposed objective function:
1
min Π ,
∈ 2 2
Where
Handling
1
Π Censored
0 0

Similar tasks: select some common features across all the task via , -norm.

Temporal smoothness & Irreversible:


Y and should follow a non-negative non-increasing list structure
0, | ,∀ 1, … , , ∀ 1, … ,
Yan Li, Jie Wang, Jieping Ye and Chandan K. Reddy “A Multi-Task Learning Formulation for Survival Analysis". KDD 2016
78
Multi‐task Learning Formulation
1
min Π ,
∈ 2 2
Subject to:
ADMM:
min Π

Solving the non‐negative non‐increasing list structure by max‐heap projection
min ,
∈ 2 2
Solving the  , ‐norm by using FISTA algorithm

An adaptive variant model


Too many time intervals, non-negative non-increasing list will be so
strong that will overfit the model. Relaxation of the above model:
1
min Π ,
∈ 2 2
79
Multi‐Task Logistic Regression
Model survival distribution via a sequence of dependent regressions.
Consider a simpler classification task of predicting whether an individual
will survive for more than months.

Consider a serious of time points ( , , ,…, ), we can get a series of


logistic regression models

The model should enforce the dependency of the outputs by predicting the
survival status of a patient at each of the time snapshots, let
( , , , … , ) where 0 (no death event yet ), and 1 (death)

C. Yu et al. "Learning patient-specific cancer survival distributions as a sequence of dependent regressors." NIPS 2011.
80
Multi‐Task Logistic Regression

A very similar idea as cox model:


exp ∑ :, exp ∑ :, with 1 ∀
1, … , .
is the score of sequence with the event occurring in the interval
, . But different from cox model the coefficient is different in
different time interval. So no proportional hazard assumption.
For censored instances:

The numerator is the score of the death will happen after

In the model add ∑ :, :, regularization term to


achieve temporary smoothness. 81
Knowledge Transfer 
Transfer learning models aim at using auxiliary data to
augment learning when there are insufficient number of training
samples in target dataset.
Traditional Machine Transfer Learning
Learning

×
training items Similar but
not the
same

Learning System

Learning System Learning System

Knowledge Learning System

82
Transfer Learning for Survival Analysis
How long ? Event of interest
History information
Labeling the time-to-event data is very time consuming!
X B

Source data
Source Task
TCGA …

Target data
Target Task
• Both source and target tasks are survival analysis problem.
• There exist some features which are important among all correlated disease.
Yan Li, Lu Wang, Jie Wang, Jieping Ye and Chandan K. Reddy "Transfer Learning for Survival Analysis via Efficient L2,1-norm
Regularized Cox Regression". ICDM 2016. 83
Transfer‐Cox Model
The Proposed objective function:
1
min ,
, 2

Where , , , and denote the coefficient


vector and negative partial log-likelihood,
log β ,

of source take and target take, respectively. And


, .
• L2,1 norm can encourage group sparsity; therefore, it
selects some common features across all the task.
• We propose a FISTA based algorithm to solve the
problem with a linear scalability.
84
Using Strong Rule in Learning Process
Theorem: Given a sequence of
Let B=0, Calculate = parameter values
⋯ and suppose the
optimal solution 1 at is
Let K=k+1, Calculate known. Then for any 1, 2, … , m
the feature will be discarded if
1 2
Discard inactive features
and the corresponding coefficient
based on Theorem
will be set to 0

Using FISTA algorithm


update result

Record optimal Check KKT condition Update selected


solution active features
All selected
feature
obey KKT
85
Summary of Machine Learning Methods
Basic ML Models
Survival Trees
Bagging Survival Trees
Random Survival Forest
Support Vector Regression
Deep Learning
Rank based Methods
Advanced ML Models
Active Learning
Multi-Task Learning
Transfer Learning
86
Tutorial Outline
Basic Concepts

Statistical Methods

Machine Learning Methods

Related Topics

87
Taxonomy of Survival Analysis Methods
Kaplan-Meier Basic Cox-PH Lasso-Cox
Statistical Methods
Non-Parametric Nelson-Aalen Penalized Cox Ridge-Cox

Time-Dependent EN-Cox
Life-Table Cox
OSCAR-Cox
Semi-Parametric Cox Regression Cox Boost

Linear Regression Tobit


Weighted
Parametric Buckley James Regression
Accelerated
Panelized Structured
Failure Time
Regression Regularization

Survival Trees
Naïve Bayes
Survival Analysis Bayesian
Methods Methods Bayesian
Network
Neural Network Random Survival
Forests
Machine Support Vector
Bagging Survival
Learning Machine
Trees

Ensemble Active Learning


Transfer
Advanced Machine Learning
Learning Multi-Task
Learning

Uncensoring
Early Prediction

Data Calibration
Related Topics Transformation
Competing
Risks
Complex Events
Recurrent
Events
88
Related Topics
Early Prediction

Data Transformation

Uncensoring

Calibration

Complex Events

Competing Risks

Recurrent Events

89
Early Stage Event Prediction
Collecting data for survival analysis is very “time” consuming.

S6

S5
Subjects

S4

S3

S2

S1

Time tc tf
Any existing survival model can predict only until tc
Develop a Bayesian approach for early stage prediction.
M. J Fard, P. Wang, S. Chawla, and C. K. Reddy, “A Bayesian perspective on early stage event prediction in longitudinal data”,
TKDE 2016. 90
Bayesian Approach

Tree-Augmented Bayesian Networks


Naïve Bayes (NB)
Naïve Bayes (TAN) (BN)

 P x j | y t c   1  j 1 Px j | ytc   1, x p  j   P x j | y t c   1, Pa x j 
m m m

j 1 j 1

Probability of
Event Occurrence
P  y t f   1 | x , t  t f  
Prior X Likelihood
P x , t  t f 
Extrapolation of Prior
1
Log - logistic : F tc  
Weibull : F tc   1  e  
a

1  tc b 
tc
b a
91
Early Stage Prediction
0.9
0.9
0.9
0.88
0.88
0.88
0.86
0.86
0.86
0.84
0.84
0.84
Accuracy
Accuracy

0.82
0.82
0.82
0.8
0.8
0.8 Cox
Cox
LR
LR
0.78
0.78 RF
RF
NB
NB
0.76
0.76 TAN
TAN
BN
0.74
0.74 BN
ESP_NB
ESP_NB
0.72 ESP_TAN
0.72 ESP_TAN
ESP_BN
ESP_BN
0.7
0.7
20%
20%
20% 40%
40%
40% 60%
60%
60% 80%
80%
80% 100%
100%
100%
Percentageofof available
Percentage availableevent
eventoccurrence
occurrenceinformation
information

92
Data Transformation
Two data transformation techniques that will be useful
for data pre-processing in survival analysis.
Uncensoring approach
Calibration
Transform the data to a more conducive form so that
other survival-based (or sometimes even the standard
algorithms) can be applied effectively.

93
Uncensoring Approach
The censored instances actually have partial informative
labeling information which provides the possible range of
the corresponding true response (survival time).
Such censored data have to be handled with special
care within any machine learning method in order to
make good predictions.
Two naive ways of handling such censored data:
Delete the censored instances.
Treating censoring as event-free.

94
Uncensoring Approach I
For each censored instance, estimate the probability of event and probability
of being censored (considering censoring as a new event) using Kaplan-
Meier estimator. Give a new class label based on these probability values.

Probability of survival Probability of un-censoring



1 1
: :

Probability of event Probability of censoring


1 1

Yes No
Event Event-free

M. J Fard, P. Wang, S. Chawla, and C. K. Reddy, “A bayesian perspective on early stage event prediction in longitudinal data”,
TKDE 2016. 95
Uncensoring Approach II
Group the instances in the given data into three categorizes:
(i) Instances which experience the event of interest during the
observation will be labeled as event.
(ii) Instances whose censored time is later than a predefined time
point are labeled as event-free.
(iii) Instances whose censored time is earlier than a predefined
time point,
A copy of these instances will be labeled as event.
Another copy of the same instances will be labeled as event-free.
These instances will be weighted by a marginal probability of event
occurrence estimated by the Kaplan-Meier method.

B. Zupan, J. DemsAr, M. W. Kattan, R. J. Beck, and I. Bratko, “Machine learning for survival analysis: a case study on recurrence
of prostate cancer”, Artificial intelligence in medicine, 2000.
96
Calibration
Motivation
Inappropriately labeled censored instances in survival data cannot
provide much information to the survival algorithm.
The censoring depending on the covariates may lead to some bias
in standard survival estimators.
Approach - Regularized inverse covariance based imputed censoring
Impute an appropriate label value for each censored instance, a
new representation of the original survival data can be learned
effectively.
It has the ability to capture correlations between censored
instances and correlations between similar features.
Estimates the calibrated time-to-event values by exploiting row-
wise and column-wise correlations among censored instances for
imputing them.
B. Vinzamuri, Y. Li, and C. K Reddy, “Pre-processing censored survival data using inverse covariance matrix based
calibration”, TKDE 2017. 97
Complex Events
Until now, the discussion has been primarily focused on
survival problems in which each instance can experience only
a single event of interest.
However, in many real-world domains, each instance may
experience different types of events and each event may
occur more than once during the observation time period.

Since this scenario is more complex than the survival


problems discussed so far, we consider them to be complex
events.
Competing risks
Recurrent events

98
Stratified Cox Model
The stratified Cox model is a modification of the regular Cox model
which allows for control by stratification of the predictors which do
not satisfy the PH assumption in Cox model.
Variables , ,…, do not satisfy the PH assumption.
Variables , ,…, satisfy the PH assumption.

Create a single new variable :
(1) categorize each ; (2) form all the possible combinations of categories;

(3) the strata are the categories of .

The general stratified Cox model will be:


, t exp β ⋯

Can be different for each strata Coefficients are the same for each strata

∗ ∗
where 1,2, ⋯ , , strata defined from .
The coefficients are estimated by maximizing the partial likelihood
function obtained by multiplying likelihood functions for each strata.
99
Competing Risks
The competing risks will only exist in survival problems with
more than one possible event of interest, but only one event
will occur at any given time.
Kidney Failure

Heart Disease
Alive Death
Stroke
Other Diseases

In this case, competing risks are the events that prevent an


event of interest from occurring which is different from
censoring.
In the case of censoring, the event of interest still occurs at
a later time, while the event of interest is impeded.
Cumulative Incidence Curve (CIC) and Lunn-McNeil (LM)
100
Cumulative Incidence Curve (CIC)
The cumulative incidence curve is one of the main approaches
for competing risks which estimates the marginal probability of
each event . The CIC is defined as

: :

where
represents the estimated hazard at time for event .
is the number of events for the event at .
denotes the number of instances who are at the risk of
experiencing events at .
denotes the survival probability at last time point .

H. Putter, M. Fiocco, and R. B. Geskus, “Tutorial in biostatistics: competing risks and multi-state models”, Statistics in
medicine, 2007. 101
Lunn‐McNeil (LM)
Lunn-McNeil fits a single Cox PH model which considers all the events
( , E , … , E ) in competing risks rather than separate models for each
event.
LM approach is implemented using an augmented data, in which a dummy
variable is created for each event to distinguish different competing risks.

The augmented data for the subject at time .

ID Time Status … …

i 1 0 … 0 …
i 0 1 … 0 …
… … … … … … … … … …
i 0 0 … 1 …
Only one of them
Dummy variables Features
equals to 1.

M. Lunn and D. McNeil, “Applying Cox regression to competing risks”, Biometrics, 1995. 102
Recurrent Events
In many application domains, the event of interest can occur for
each instance more than once during the observation time
period.

In survival analysis, we refer to such events which occur more


than once as recurrent events, which is different from the
competing risks.
If all the recurring events for each instance are of the same
type.
 Method: counting process (CP) algorithm.
If there are different types of events or the order of the
events is the main goal.
 Method: methods using stratified Cox (SC) approaches,
including stratified CP, Gap Time and Marginal approach.
103
Software Resources
Algorithm Software Language Link
Kaplan-Meier

Nelson-Aalen survival R https://cran.r-project.org/web/packages/survival/index.html

Life-Table

Basic Cox https://cran.r-project.org/web/packages/survival/index.html


survival R
TD-Cox

Lasso-Cox

Ridge-Cox fastcox R https://cran.r-project.org/web/packages/fastcox/index.html

EN-Cox

Oscar-Cox RegCox R https://github.com/MLSurvival/RegCox

CoxBoost CoxBoost R https://cran.rproject.org/web/packages/CoxBoost/

Tobit survival R https://cran.r-project.org/web/packages/survival/index.html

BJ bujar R https://cran.rproject.org/web/packages/bujar/index.html

AFT survival R https://cran.r-project.org/web/packages/survival/index.html

104
Software Resources
Algorithm Software Language Link
Baysian Methods BMA R https://cran.rproject.org/web/packages/BMA/index.html

RSF randomForestSRC R https://cran.rproject.org/web/packages/randomForestSRC/

BST ipred R https://cran.rproject.org/web/packages/ipred/index.html

Boosting mboost R https://cran.rproject.org/web/packages/mboost/

Active Learning RegCox R https://github.com/MLSurvival/RegCox

Transfer Learning TransferCox C++ https://github.com/MLSurvival/TransferCox

Multi-Task
MTLSA Matlab https://github.com/MLSurvival/MTLSA
Learning
Early Prediction
ESP R https://github.com/MLSurvival/ESP
Uncensoring

Calibration survutils R https://github.com/MLSurvival/survutils

Competing Risks survival R https://cran.r-project.org/web/packages/survival/index.html

Recurrent Events survrec R https://cran.r-project.org/web/packages/survrec/

105
Acknowledgements
Graduate Students

Ping Wang Bhanu Vinzamuri Mahtab Fard Vineeth Rakesh


Collaborators

Jieping Ye Sanjay Chawla Charu Aggarwal Naren Ramakrishnan


Univ. of Michigan Univ. of Sydney IBM Research Virginia Tech

Funding Agencies

106
Thank You
Questions and Comments

Feel free to email questions or suggestions to

[email protected]

http://www.cs.vt.edu/~reddy/
107

You might also like