Machine Learning For Survival Analysis
Machine Learning For Survival Analysis
Learning for
Survival Analysis
Chandan K. Reddy Yan Li
Dept. of Computer Science Dept. of Computational Medicine
Virginia Tech and Bioinformatics
http://www.cs.vt.edu/~reddy Univ. of Michigan, Ann Arbor
1
Tutorial Outline
Basic Concepts
Statistical Methods
Related Topics
2
Tutorial Outline
Basic Concepts
Statistical Methods
Related Topics
3
Healthcare
Demographics Comorbodities Laboratory Procedures Medications
Age Hypertension Hemoglobin Hemodialysis ACE inhibitor
Gender Diabetes Blood count Contrast dye Dopamine
Race CKD Glucose Catheterization Milrinone
Event IMPACT
Prediction Lower healthcare costs
Improve quality of life
Model
4
Mining Events in Longitudinal Data
Classification Problem:
1
Regression Problem:
Can predict the time of event
7
– loss of data
9
10
- Death
1 2 3 4 5 6 7 8 9 10 11 12 - Dropout/Censored
Time
Ping Wang, Yan Li, Chandan, K. Reddy, “Machine Learning for Survival
- Other Events
Analysis: A Survey”. ACM Computing Surveys (under revision), 2017.
5
Problem Statement
For a given instance , represented by a triplet , , .
is the feature vector;
is the binary event indicator, i.e., 1 for an uncensored instance
and 0 for a censored instance;
denotes the observed time and is equal to the survival time for an
uncensored instance and for a censored instance, i.e.,
1
0
Note for :
The value of will be both non-negative and continuous.
is latent for censored instances.
6
Education
Demographics Financial Pre-enrollment Enrollment Semester
Age Cash amount High school GPA Transfer credits Semester GPA
Gender Income ACT scores College % passed
Race/Ethnicity Scholarships Graduation age Major % dropped
IMPACT
Event
Educated Society
Prediction Better Future
Model
Event IMPACT
Improve local economy
Prediction Successful businesses
Model
Y. Li, V. Rakesh, and C. K. Reddy, "Project Success Prediction in Crowdfunding Environments", WSDM 2016.
8
Other Applications
Reliability: Device Failure Modeling in Engineering
Goal: Estimate when a device will fail
Features: Product and manufacturer details, user reviews
Duration Modeling: Unemployment Duration in Economics
Goal: Estimate the time people spend without a job (for getting a new job)
Features: User demographics and experience, Job details and economics
Click Through Rate: Computational Advertising on the Web
Goal: Estimate when a web user will click the link of the ad.
Features: User and Ad information, website statistics
Customer Lifetime Value: Targeted Marketing
Goal: Estimate the frequent purchase pattern for customers.
Features: Customer and store/product information.
How long ?
History information
Event of interest
9
Taxonomy of Survival Analysis Methods
Kaplan-Meier Basic Cox-PH Lasso-Cox
Statistical Methods
Non-Parametric Nelson-Aalen Penalized Cox Ridge-Cox
Time-Dependent EN-Cox
Life-Table Cox
OSCAR-Cox
Semi-Parametric Cox Regression Cox Boost
Survival Trees
Naïve Bayes
Survival Analysis Bayesian
Methods Methods Bayesian
Network
Neural Network Random Survival
Forests
Machine Support Vector
Bagging Survival
Learning Machine
Trees
Uncensoring
Early Prediction
Data Calibration
Related Topics Transformation
Competing
Risks
Complex Events
Recurrent
Events
10
Tutorial Outline
Basic Concepts
Statistical Methods
Related Topics
11
Basics of Survival Analysis
Main focuses is on time to event data. Typically, survival data
are not fully observed, but rather are censored.
Several important functions:
Death
Survival function, indicating the probability that the stance
instance can survive for longer than a certain time t.
Pr
Cumulative density function, representing the probability
that the event of interest occurs earlier than t. Survival function
1 exp
Death density function:
⁄ ⁄
Hazard function: representing the probability the “event” of
interest occurs in the next instant, given survival to time t.
ln Cumulative hazard function
Chandan K. Reddy and Yan Li, "A Review of Clinical Prediction Models", in Healthcare Data Analytics,
Chandan K. Reddy and Charu C. Aggarwal (eds.), Chapman and Hall/CRC Press, 2015. 12
Evaluation Metrics
Due to the presence of the censoring in survival data,
the standard evaluation metrics for regression such as
root of mean squared error and are not suitable for
measuring the performance in survival analysis.
Three specialized evaluation metrics for survival
analysis:
Concordance index (C-index)
Brier score
Mean absolute error
13
Concordance Index (C‐Index)
It is a rank order statistic for predictions against true outcomes
and is defined as the ratio of the concordant pairs to the total
comparable pairs.
Given the comparable instance pair , with and are the
actual observed times and S( ) and S( ) are the predicted
survival times,
The pair , is concordant if > and S( ) > S( ).
The pair , is discordant if > and S( ) < S( ).
H. Steck, B. Krishnapuram, C. Dehing-oberije, P. Lambin, and V. C. Raykar, “On ranking in survival analysis: Bounds on the
concordance index”, NIPS 2008. 15
C‐index
When the output of the model is the prediction of survival time:
1
̂ |
: :
When the output of the model is the hazard ratio (Cox model):
1
̂
: :
∑ ∈ ∑ ∑ ∑ ∈ ·
∑∈
E. Graf, C. Schmoor, W. Sauerbrei, and M. Schumacher, “Assessment and comparison of prognostic classification schemes
for survival data”, Statistics in medicine, 1999. 18
Mean Absolute Error
For survival analysis problems, the mean absolute error (MAE)
can be defined as an average of the differences between the
predicted time values and the actual observation time values.
1
| |
where
-- the actual observation times.
-- the predicted times.
Only the samples for which the event occurs are being
considered in this metric.
Condition: MAE can only be used for the evaluation of survival
models which can provide the event time as the predicted
target value.
19
Summary of Statistical methods
Type Advantages Disadvantages Specific methods
Cox model
The knowledge of the
The distribution of the Regularized Cox
Semi- underlying distribution of
outcome is unknown;
parametric survival times is not CoxBoost
not easy to interpret.
required.
Time-Dependent Cox
20
Kaplan‐Meier Analysis
Kaplan-Meier (KM) analysis is a nonparametric approach
to survival outcomes. The survival function is:
1
:
where
• … -- a set of distinct event times
observed in the sample.
• -- number of events at .
• -- number of censored observations
between and .
• -- number of individuals “at risk” right
before the death.
E. Bradley. "Logistic regression, survival analysis, and the Kaplan-Meier curve." JASA 1988. 21
Survival Outcomes
Patient Days Status Patient Days Status Patient Days Status
Status
1 21 1 15 256 2 29 398 1
1: Death
2 39 1 16 260 1 30 414 1 2: Lost to follow up
3 77 1 17 261 1 31 420 1 3: Withdrawn Alive
4 133 1 18 266 1 32 468 2
13 214 1 27 361 1
14 228 1 28 374 1
22
Kaplan‐Meier Analysis
Kaplan-Meier Analysis
Time Status
1 21 1 1 0 40 0.975
2 39 1 1 0 39 0.95
3 77 1 1 0 38 0.925
4 133 1 1 0 37 0.9
5 141 2 0 1 36 .
6 152 1 1 0 35 0.874
7 153 1 1 0 34 0.849
KM Estimator:
1
:
23
Kaplan‐Meier Analysis
KM Estimator:
Time Status Estimate Sdv Error ∑ Time Status Estimate Sdv Error ∑
1 21 1 0.975 0.025 1 40 21 287 3 . . 18 20
2 39 1 0.95 0.034 2 39 22 295 1 0.508 0.081 19 19
3 77 1 0.925 0.042 3 38 23 308 1 0.479 0.081 20 18
4 133 1 0.9 0.047 4 37 24 311 1 0.451 0.081 21 17
5 141 2 . . 4 36 25 321 2 . . 21 16
6 152 1 0.874 0.053 5 35 26 326 1 0.421 0.081 22 15
7 153 1 0.849 0.057 6 34 27 355 1 0.391 0.081 23 14
8 161 1 0.823 0.061 7 33 28 361 1 0.361 0.08 24 13
9 179 1 0.797 0.064 8 32 29 374 1 0.331 0.079 25 12
10 184 1 0.771 0.067 9 31 30 398 1 0.301 0.077 26 11
11 193 1 0.746 0.07 10 30 31 414 1 0.271 0.075 27 10
12 197 1 0.72 0.072 11 29 32 420 1 0.241 0.072 28 9
13 199 1 0.694 0.074 12 28 33 468 2 . . 28 8
14 214 1 0.669 0.075 13 27 34 483 1 0.206 0.07 29 7
15 228 1 0.643 0.077 14 26 35 489 1 0.172 0.066 30 6
16 256 2 . . 14 25 36 505 1 0.137 0.061 31 5
17 260 1 0.616 0.078 15 24 37 539 1 0.103 0.055 32 4
18 261 1 0.589 0.079 16 23 38 565 3 . . 32 3
19 266 1 0.563 0.08 17 22 39 618 1 0.052 0.046 33 2
20 269 1 0.536 0.08 18 21 40 794 1 0 0 34 1
24
Nelson‐Aalen Estimator
Nelson-Aalen estimator is a non-parametric estimator of the
cumulative hazard function (CHF) for censored data.
Instead of estimating the survival probability as done in KM
estimator, NA estimator directly estimates the hazard probability.
The Nelson-Aalen estimator of the cumulative hazard function:
Assumption:
• at the beginning of each interval:
• at the end of each interval:
• on average halfway through the interval: /2
Cox, David R. "Regression models and life-tables", Journal of the Royal Statistical Society. Series B (Methodological), 1972. 26
Clinical Life Tables
Clinical Life Table
Interval Interval Std. Error
Interval Start Time End Time of
NOTE: 1 0 182 40 1 39.5 8 0.797 0.06
The length of interval 2 183 365 31 3 29.5 15 0.392 0.08
is half year(183 days) 3 366 548 13 1 12.5 8 0.141 0.06
4 549 731 4 1 3.5 1 0.101 0.05
5 732 915 2 0 2 2 0 0
27
Statistical methods
Type Advantages Disadvantages Specific methods
Cox model
The knowledge of the
The distribution of the Regularized Cox
Semi- underlying distribution of
outcome is unknown;
parametric survival times is not CoxBoost
not easy to interpret.
required.
Time-Dependent Cox
28
Taxonomy of Survival Analysis Methods
Kaplan-Meier Basic Cox-PH Lasso-Cox
Statistical Methods
Non-Parametric Nelson-Aalen Penalized Cox Ridge-Cox
Time-Dependent EN-Cox
Life-Table Cox
OSCAR-Cox
Semi-Parametric Cox Regression Cox Boost
Survival Trees
Naïve Bayes
Survival Analysis Bayesian
Methods Methods Bayesian
Network
Neural Network Random Survival
Forests
Machine Support Vector
Bagging Survival
Learning Machine
Trees
Uncensoring
Early Prediction
Data Calibration
Related Topics Transformation
Competing
Risks
Complex Events
Recurrent
Events
29
Cox Proportional Hazards Model
The Cox proportional hazards model is the most commonly
used model in survival analysis.
Hazard Function , sometimes called an instantaneous
failure rate, shows the event rate at time conditional on
survival until time or later.
,
, exp ⇒ log
H. Binder and M. Schumacher, “Allowing for mandatory covariates in boosting estimation of sparse high-dimensional survival
models”, BMC bioinformatics, 2008.
37
CoxBoost
How to update in each iteration of CoxBoost?
∉ ∗
Determine Best ∗
which improves the
overall fitting most Update
Special case:
⋅ , ⋅ ,…, ⋅ , ⋅ , ⋅ ,…, ⋅
Time-dependent Time-independent
Time-dependent Time-independent
39
TD‐Cox Model
For the two sets of predictors at time :
, ,…, , , ,…,
∗ ∗
, ,…, , ⋅ , ⋅ ,…,
41
Taxonomy of Survival Analysis Methods
Kaplan-Meier Basic Cox-PH Lasso-Cox
Statistical Methods
Non-Parametric Nelson-Aalen Penalized Cox Ridge-Cox
Time-Dependent EN-Cox
Life-Table Cox
OSCAR-Cox
Semi-Parametric Cox Regression Cox Boost
Survival Trees
Naïve Bayes
Survival Analysis Bayesian
Methods Methods Bayesian
Network
Neural Network Random Survival
Forests
Machine Support Vector
Bagging Survival
Learning Machine
Trees
Uncensoring
Early Prediction
Data Calibration
Related Topics Transformation
Competing
Risks
Complex Events
Recurrent
Events
42
Statistical Methods
Type Advantages Disadvantages Specific methods
Cox model
The knowledge of the
The distribution of the Regularized Cox
Semi- underlying distribution of
outcome is unknown;
parametric survival times is not CoxBoost
not easy to interpret.
required.
Time-Dependent Cox
43
Parametric Censored Regression
f(t)
0.8
0.6
0.4
0.2 S(t)
0
yi yi 1 2 3
Likelihood function
, ,
44
Parametric Censored Regression
Generalized Linear Model
Where
log
/ 1
Negative log-likelihood
2
m log log log 1
,
Uncensored censored
Instances Instances
45
Optimization
Use second order second-order Taylor expansion to formulate the
log-likelihood as a reweighted least squares
Y. Li, K. S. Xu, C. K. Reddy, “Regularized Parametric Regression for High-dimensional Survival Analysis“, 2016. SDM
46
Pros and Cons
Advantages:
Easy to interpret.
Rather than Cox model, it can directly predict the
survival(event) time.
More efficient and accurate when the time to event of
interest is follow a particular distribution.
Disadvantages:
The model performance strongly relies on the choosing of
distribution, and in practice it is very difficult to choose a
suitable distribution for a given problem.
Li, Yan, Vineeth Rakesh, and Chandan K. Reddy. "Project success prediction in crowdfunding environments."
Proceedings of the Ninth ACM International Conference on Web Search and Data Mining. ACM, 2016. 47
Commonly Used Distributions
Distributions PDF Survival Hazard
Exponential exp exp
/ / 1
Logistic /
1 /
1 / 1
1
Log-logistic
1 1 1
1
1 exp
Normal exp 1 Φ 2 1 Φ 2
2 2
log 1 log
1 log exp
1 Φ 2 2
Log-normal exp
2 2 log
1 Φ
48
Tobit Model
Tobit model is one of the earliest attempts to extend linear regression
with the Gaussian distribution for data analysis with censored
observations.
In Tobit model, a latent variable ∗ is introduced and it is assumed to
linearly depend on as:
y∗ , ∼ 0,
where is a normally distributed error term.
∗ log 1
log
log | log log , 0
J. Buckley and I. James, Linear regression with censored data. Biometrika, 1979. 50
Buckley‐James Regression Method
The Least squares is used as the empirical loss function
1 ∗
min log
2
∗
Where log = log
1 ·
1 log
The Elastic-Net regularizer also has been used to penalize the BJ-
regression (EN-BJ) to handle the high-dimensional survival data.
1 ∗ 1 2
min log 1
2 2 2
To estimate of of BJ and EN-BJ models, we just need to calculate
log ∗ based on the of pervious iteration and then minimize the lest
square or penalized lest square via standard algorithms.
Wang, Sijian, et al. “Doubly Penalized Buckley–James Method for Survival Data with High‐Dimensional Covariates.” Biometrics, 2008
51
Regularized Weighted Linear Regression
×
✓
Induce more penalize to case 1 and less penalize to case 2
Y. Li, B. Vinzamuri, and C. K. Reddy, “Regularized Weighted Linear Regression for High-dimensional Censored Data“, SDM 2016.
52
Weighted Residual Sum‐of‐Squares
More weight to the censored instances whose estimated
survival time is lesser than censored time
Less weight to the censored instances whose estimated
survival time is greater than censored time.
53
Self‐Training Framework
Self-training: training the model by using its own prediction
Training
a base
model
Approximate
the survival If the estimated survival
time is larger than censored
time of time
censored
instances
54
Bayesian Survival Analysis
Penalized regression encode assumption via regularization term,
while Bayesian approach encode assumption via prior distribution.
Bayesian Paradigm
Based on observed data , one can build a likelihood function | .
(likelihood estimator)
Suppose is random and has a prior distribution denote by .
Inference concerning is based on the posterior distribution
Komarek, Arnost. Accelerated failure time models for multivariate interval-censored data with
flexible distributional assumptions. Diss. PhD thesis, PhD thesis, Katholieke Universiteit
Leuven, Faculteit Wetenschappen, 2006.
56
Deep Survival Analysis
Deep Survival Analysis is a hierarchical generative approach to
survival analysis in the context of the EHR
Deep survival analysis models covariates and survival time in a
Bayesian framework.
It can easily handle both missing covariates and model survival time.
Deep exponential families (DEF) are a class of multi-layer probability
models built from exponential families. Therefore, they are capable
to model the complex relationship and latent structure to build a joint
model for both the covariates and the survival times.
R. Ranganath, A. Perotte, N. Elhadad, and D. Blei. "Deep survival analysis." Machine Learning for Healthcare, 2016. 57
Deep Survival Analysis
58
Tutorial Outline
Basic Concepts
Statistical Methods
Related Topics
59
Machine Learning Methods
Basic ML Models
Survival Trees
Bagging Survival Trees
Random Survival Forest
Support Vector Regression
Deep Learning
Rank based Methods
Advanced ML Models
Active Learning
Multi-task Learning
Transfer Learning
60
Survival Tree
Survival trees is similar to decision tree which is built by recursive
splitting of tree nodes. A node of a survival tree is considered
“pure” if all the patients in the node survive for an identical span of
time.
LeBlanc, M. and Crowley, J. (1993). Survival Trees by Goodness of Split. Journal of the American Statistical
Association 88, 457–467. 61
Logrank Test
The logrank test is obtained by constructing a (2 X 2) table at each distinct
death time, and comparing the death rates between the two groups,
conditional on the number at risk in the groups. Let , … , represent the
ordered, distinct death times. At the -th death time, we have the following:
∑ /
Hothorn, Torsten, et al. "Bagging survival trees." Statistics in medicine 23.1 (2004): 77-91.
63
Random Survival Forests
Random Survival
Forests Tree RSF
1. Draw B bootstrap samples from the original data (63% in the bag data,
37% Out of bag data(OOB)).
2. Grow a survival tree for each bootstrap sample based on randomly
select candidate features, and splits the node using feature from the
selected candidate features that maximizes survival difference between
daughter nodes.
3. Grow the tree to full size, each terminal node should have no less than
0 unique deaths.
4. Calculate a Cumulative Hazard Function (CHF) for each tree. Average
to obtain the bootstrap ensemble CHF.
5. Using OOB data, calculate prediction error for the OOB ensemble CHF.
where , is the -th distinct event time of the samples in leaf , , is the
number events at , , and , is the number of individuals at risk at , .
∗∗ ∗
OOB ensemble CHF ( ) and bootstrap ensemble CHF ( )
∑ ∗
∗∗ , | ∗
1 ∗
, |
∑ ,
where ∗ | is the CHF of the node in b-th bootstrap which belongs to.
, 1 if i is an OOB case for b; otherwise, set , 0. Therefore OOB
ensemble CHF is the average over bootstrap samples which i is OOB, and
bootstrap ensemble CHF is the average of all B bootstrap.
O. O. Aalen, “Nonparametric inference for a family of counting processes”, Annals of Statistics 1978. 65
Support Vector Regression (SVR)
Once a model has been learned, it can be applied to a new instance
through
: margin of error
C: regularization parameter
: slack variables
66
Support Vector Approach for Censored Data
Interval Targets: These are samples for which we have both an upper and a
lower bound on the target. The tuple ( , , ) with < .
As long as the output is between and , there is no empirical error.
Right censored sample is written as ( , ∞) whose survival time is
greater than ∈ , but the upper bound is unknown.
∞
Ii Ii Ui f ( xi )
Ui f ( xi )
P. K. Shivaswamy, W. Chu, and M. Jansche. "A support vector approach to censored targets”, ICDM 2007. 67
Support Vector Regression for Censored Data
A graphical representation of the SVRc parameters for events.
Lesser acceptable margin when the predicted
value is grater than the event time
Greater penalty rate when the predicted value is
greater than the censored time
Predicting a high risky patient will survive longer is
more gangrenous than predicting a low risky patient
will survive shorter
Graphical representation of the SVRc parameters for censored data.
Cox Proportional
...
Hazards Model
Softmax
function
, ,
: :
D. Faraggi and R. Simon. "A neural network model for survival data." Statistics in medicine, 1995. 69
Deep Survival: A Deep Cox Proportional Hazards Network
Input layer Hidden layers
1
Output layer
...
Cox Proportional
...
...
Hazards Model
Katzman, Jared, et al. "Deep Survival: A Deep Cox Proportional Hazards Network." arXiv , 2016. 70
Deep Convolutional Neural Network
, ,
: :
Pos: Directly built deep model for survival analysis from images input
X. Zhu, J. Yao, and J. Huang. "Deep convolutional neural network for survival analysis with pathological images“, BIBM 2016. 71
Ranking based Models
C-index is a pairwise ranking based evaluation metric. Boosting
concordance index (BoostCI) is an approach which aims at directly optimize
the C-index.
weights
A. Mayr and M. Schmid, “Boosting the concordance index for survival data–a unified framework to derive and evaluate
biomarker combinations”, PloS one, 2014. 72
BoostCI Algorithm
The component-wise gradient boosting algorithm is used to
optimize the smoothed C-index.
Learning Step:
1. Initialize the estimate of the marker combination with offset values,
and set maximum number ( ) of iteration, and set 1.
2. Compute the negative gradient vector of smoothed C-index.
3. Fit the negative gradient vector separately to each of the components of
via the base-learners :, .
4. Select the component that best fits the negative gradient vector, and the
selected index of base-learn is denote as ∗
5. Update the marker combination for this component
← ∗ :, ∗ .
74
Active Learning for Survival Data
Objective: Identify the representative samples in the data
Outcome: Allow the Model to select instances
to be included. It can minimize the training
cost and complexity of the model and obtain a
good generalization performance for Censored
data.
X pool k 1
Active learning based framework for the survival regression using a novel
model discriminative gradient based sampling procedure.
B. Vinzamuri, Y. Li, C. Reddy, "Active Learning Based Survival Regression for Censored Data", CIKM 2014.
75
Active Learning with Censored Data
Time to Update
EHR Censored
Training
features(X) Status(δ) Event(T) data
Domain Expert
Train Cox Model
(Oracle)
Column
Partial log Labelling
wise kernel
Elastic Net likelihood L(β) request for
matrix(Ke) Regularization
instance
Unlabelled
Pool (Pool) End of active
learning
rounds
Compute
Gradient Output
δL(β)/ δβ Gradient Based Survival AUC
Discriminative and RMSE
Sampling
76
Multi‐task Learning Formulation
Advantage: The model is general, no assumption on either survival
time or survival function.
1 Y 1 2 3 4 5 6 7 8 9 10 11 12
2 1 1 1 1 1 1 1 1 1 1 0 0 0
2 1 1 1 1 1 ? ? ? ? ? ? ?
patient
3 3 1 1 1 1 1 1 1 1 1 1 ? ?
4 1 1 1 0 0 0 0 0 0 0 0 0
4
1: Alive 0: Death ?: Unknown
0 6 12
Month
Similar tasks: All the binary classifiers aim at predicting the life status
of each patient.
Temporal smoothness: For each patient, the life statuses of adjacent
time intervals are mostly same.
Not reversible: Once a patient is dead, he is impossible to be alive
again.
77
Multi‐task Learning Formulation
Y 1 2 3 4 5 6 7 8 9 10 11 12 W 1 2 3 4 5 6 7 8 9 10 11 12
D1 1 1 1 1 1 1 1 1 1 0 0 0 D1 1 1 1 1 1 1 1 1 1 1 1 1
D2 1 1 1 1 1 ? ? ? ? ? ? ? D2 1 1 1 1 1 0 0 0 0 0 0 0
D3 1 1 1 1 1 1 1 1 1 1 ? ? D3 1 1 1 1 1 1 1 1 1 1 0 0
D4 1 1 1 0 0 0 0 0 0 0 0 0 D4 1 1 1 1 1 1 1 1 1 1 1 1
How to deal with the “?” in Y
The Proposed objective function:
1
min Π ,
∈ 2 2
Where
Handling
1
Π Censored
0 0
Similar tasks: select some common features across all the task via , -norm.
Solving the non‐negative non‐increasing list structure by max‐heap projection
min ,
∈ 2 2
Solving the , ‐norm by using FISTA algorithm
The model should enforce the dependency of the outputs by predicting the
survival status of a patient at each of the time snapshots, let
( , , , … , ) where 0 (no death event yet ), and 1 (death)
C. Yu et al. "Learning patient-specific cancer survival distributions as a sequence of dependent regressors." NIPS 2011.
80
Multi‐Task Logistic Regression
×
training items Similar but
not the
same
Learning System
82
Transfer Learning for Survival Analysis
How long ? Event of interest
History information
Labeling the time-to-event data is very time consuming!
X B
Source data
Source Task
TCGA …
Target data
Target Task
• Both source and target tasks are survival analysis problem.
• There exist some features which are important among all correlated disease.
Yan Li, Lu Wang, Jie Wang, Jieping Ye and Chandan K. Reddy "Transfer Learning for Survival Analysis via Efficient L2,1-norm
Regularized Cox Regression". ICDM 2016. 83
Transfer‐Cox Model
The Proposed objective function:
1
min ,
, 2
Statistical Methods
Related Topics
87
Taxonomy of Survival Analysis Methods
Kaplan-Meier Basic Cox-PH Lasso-Cox
Statistical Methods
Non-Parametric Nelson-Aalen Penalized Cox Ridge-Cox
Time-Dependent EN-Cox
Life-Table Cox
OSCAR-Cox
Semi-Parametric Cox Regression Cox Boost
Survival Trees
Naïve Bayes
Survival Analysis Bayesian
Methods Methods Bayesian
Network
Neural Network Random Survival
Forests
Machine Support Vector
Bagging Survival
Learning Machine
Trees
Uncensoring
Early Prediction
Data Calibration
Related Topics Transformation
Competing
Risks
Complex Events
Recurrent
Events
88
Related Topics
Early Prediction
Data Transformation
Uncensoring
Calibration
Complex Events
Competing Risks
Recurrent Events
89
Early Stage Event Prediction
Collecting data for survival analysis is very “time” consuming.
S6
S5
Subjects
S4
S3
S2
S1
Time tc tf
Any existing survival model can predict only until tc
Develop a Bayesian approach for early stage prediction.
M. J Fard, P. Wang, S. Chawla, and C. K. Reddy, “A Bayesian perspective on early stage event prediction in longitudinal data”,
TKDE 2016. 90
Bayesian Approach
P x j | y t c 1 j 1 Px j | ytc 1, x p j P x j | y t c 1, Pa x j
m m m
j 1 j 1
Probability of
Event Occurrence
P y t f 1 | x , t t f
Prior X Likelihood
P x , t t f
Extrapolation of Prior
1
Log - logistic : F tc
Weibull : F tc 1 e
a
1 tc b
tc
b a
91
Early Stage Prediction
0.9
0.9
0.9
0.88
0.88
0.88
0.86
0.86
0.86
0.84
0.84
0.84
Accuracy
Accuracy
0.82
0.82
0.82
0.8
0.8
0.8 Cox
Cox
LR
LR
0.78
0.78 RF
RF
NB
NB
0.76
0.76 TAN
TAN
BN
0.74
0.74 BN
ESP_NB
ESP_NB
0.72 ESP_TAN
0.72 ESP_TAN
ESP_BN
ESP_BN
0.7
0.7
20%
20%
20% 40%
40%
40% 60%
60%
60% 80%
80%
80% 100%
100%
100%
Percentageofof available
Percentage availableevent
eventoccurrence
occurrenceinformation
information
92
Data Transformation
Two data transformation techniques that will be useful
for data pre-processing in survival analysis.
Uncensoring approach
Calibration
Transform the data to a more conducive form so that
other survival-based (or sometimes even the standard
algorithms) can be applied effectively.
93
Uncensoring Approach
The censored instances actually have partial informative
labeling information which provides the possible range of
the corresponding true response (survival time).
Such censored data have to be handled with special
care within any machine learning method in order to
make good predictions.
Two naive ways of handling such censored data:
Delete the censored instances.
Treating censoring as event-free.
94
Uncensoring Approach I
For each censored instance, estimate the probability of event and probability
of being censored (considering censoring as a new event) using Kaplan-
Meier estimator. Give a new class label based on these probability values.
Yes No
Event Event-free
M. J Fard, P. Wang, S. Chawla, and C. K. Reddy, “A bayesian perspective on early stage event prediction in longitudinal data”,
TKDE 2016. 95
Uncensoring Approach II
Group the instances in the given data into three categorizes:
(i) Instances which experience the event of interest during the
observation will be labeled as event.
(ii) Instances whose censored time is later than a predefined time
point are labeled as event-free.
(iii) Instances whose censored time is earlier than a predefined
time point,
A copy of these instances will be labeled as event.
Another copy of the same instances will be labeled as event-free.
These instances will be weighted by a marginal probability of event
occurrence estimated by the Kaplan-Meier method.
B. Zupan, J. DemsAr, M. W. Kattan, R. J. Beck, and I. Bratko, “Machine learning for survival analysis: a case study on recurrence
of prostate cancer”, Artificial intelligence in medicine, 2000.
96
Calibration
Motivation
Inappropriately labeled censored instances in survival data cannot
provide much information to the survival algorithm.
The censoring depending on the covariates may lead to some bias
in standard survival estimators.
Approach - Regularized inverse covariance based imputed censoring
Impute an appropriate label value for each censored instance, a
new representation of the original survival data can be learned
effectively.
It has the ability to capture correlations between censored
instances and correlations between similar features.
Estimates the calibrated time-to-event values by exploiting row-
wise and column-wise correlations among censored instances for
imputing them.
B. Vinzamuri, Y. Li, and C. K Reddy, “Pre-processing censored survival data using inverse covariance matrix based
calibration”, TKDE 2017. 97
Complex Events
Until now, the discussion has been primarily focused on
survival problems in which each instance can experience only
a single event of interest.
However, in many real-world domains, each instance may
experience different types of events and each event may
occur more than once during the observation time period.
98
Stratified Cox Model
The stratified Cox model is a modification of the regular Cox model
which allows for control by stratification of the predictors which do
not satisfy the PH assumption in Cox model.
Variables , ,…, do not satisfy the PH assumption.
Variables , ,…, satisfy the PH assumption.
∗
Create a single new variable :
(1) categorize each ; (2) form all the possible combinations of categories;
∗
(3) the strata are the categories of .
Can be different for each strata Coefficients are the same for each strata
∗ ∗
where 1,2, ⋯ , , strata defined from .
The coefficients are estimated by maximizing the partial likelihood
function obtained by multiplying likelihood functions for each strata.
99
Competing Risks
The competing risks will only exist in survival problems with
more than one possible event of interest, but only one event
will occur at any given time.
Kidney Failure
Heart Disease
Alive Death
Stroke
Other Diseases
: :
where
represents the estimated hazard at time for event .
is the number of events for the event at .
denotes the number of instances who are at the risk of
experiencing events at .
denotes the survival probability at last time point .
H. Putter, M. Fiocco, and R. B. Geskus, “Tutorial in biostatistics: competing risks and multi-state models”, Statistics in
medicine, 2007. 101
Lunn‐McNeil (LM)
Lunn-McNeil fits a single Cox PH model which considers all the events
( , E , … , E ) in competing risks rather than separate models for each
event.
LM approach is implemented using an augmented data, in which a dummy
variable is created for each event to distinguish different competing risks.
ID Time Status … …
i 1 0 … 0 …
i 0 1 … 0 …
… … … … … … … … … …
i 0 0 … 1 …
Only one of them
Dummy variables Features
equals to 1.
M. Lunn and D. McNeil, “Applying Cox regression to competing risks”, Biometrics, 1995. 102
Recurrent Events
In many application domains, the event of interest can occur for
each instance more than once during the observation time
period.
Life-Table
Lasso-Cox
EN-Cox
BJ bujar R https://cran.rproject.org/web/packages/bujar/index.html
104
Software Resources
Algorithm Software Language Link
Baysian Methods BMA R https://cran.rproject.org/web/packages/BMA/index.html
Multi-Task
MTLSA Matlab https://github.com/MLSurvival/MTLSA
Learning
Early Prediction
ESP R https://github.com/MLSurvival/ESP
Uncensoring
105
Acknowledgements
Graduate Students
Funding Agencies
106
Thank You
Questions and Comments
http://www.cs.vt.edu/~reddy/
107