0% found this document useful (0 votes)
38 views16 pages

Clinical Intervention Prediction

1. The document describes a study that uses deep learning models like LSTMs and CNNs to predict the onset and weaning of five clinical interventions (invasive ventilation, non-invasive ventilation, vasopressors, colloid boluses, crystalloid boluses) in ICU patients based on heterogeneous clinical data. 2. The models achieved state-of-the-art results in hourly prediction of interventions after a 6 hour gap. Feature occlusion was also used to interpret the LSTM models by identifying important data modalities and features for different predictions. 3. Patient trajectories that led to the most and least confident predictions were also examined to aid in interpreting the CNN model and making the models more transparent to physicians.

Uploaded by

abbas.fadhail5d
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views16 pages

Clinical Intervention Prediction

1. The document describes a study that uses deep learning models like LSTMs and CNNs to predict the onset and weaning of five clinical interventions (invasive ventilation, non-invasive ventilation, vasopressors, colloid boluses, crystalloid boluses) in ICU patients based on heterogeneous clinical data. 2. The models achieved state-of-the-art results in hourly prediction of interventions after a 6 hour gap. Feature occlusion was also used to interpret the LSTM models by identifying important data modalities and features for different predictions. 3. Patient trajectories that led to the most and least confident predictions were also examined to aid in interpreting the CNN model and making the models more transparent to physicians.

Uploaded by

abbas.fadhail5d
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Proceedings of Machine Learning for Healthcare 2017 JMLR W&C Track Volume 68

Clinical Intervention Prediction and Understanding with Deep Neural


Networks

Harini Suresh 1 HSURESH @ MIT. EDU

Nathan Hunt 1 NHUNT @ MIT. EDU

Alistair Johnson 2 AEWJ @ MIT. EDU

Leo Anthony Celi 2 LCELI @ MIT. EDU

Peter Szolovits 1 PSZ @ MIT. EDU

Marzyeh Ghassemi 1 MGHASSEM @ MIT. EDU

1
Computer Science and Artificial Intelligence Lab, MIT, Cambridge, MA
2
Laboratory for Computational Physiology, MIT, Cambridge, MA

Abstract
Real-time prediction of clinical interventions remains a challenge within intensive care units (ICUs).
This task is complicated by data sources that are sparse, noisy, heterogeneous and outcomes that are
imbalanced. In this work, we integrate data across many ICU sources — vitals, labs, notes, demo-
graphics — and focus on learning rich representations of this data to predict onset and weaning of
multiple invasive interventions. In particular, we compare both long short-term memory networks
(LSTM) and convolutional neural networks (CNN) for prediction of five intervention tasks: in-
vasive ventilation, non-invasive ventilation, vasopressors, colloid boluses, and crystalloid boluses.
Our predictions are done in a forward-facing manner after a six hour gap time to support clinically
actionable planning. We achieve state-of-the-art results on these predictive tasks using deep archi-
tectures. Further, we explore the use of feature occlusion to interpret LSTM models, and compare
this to the interpretability gained from examining inputs that maximally activate CNN outputs. We
show that our models are able to significantly outperform baselines for intervention prediction, and
provide insight into model learning.

1. Introduction

As Intensive Care Units (ICUs) play an increasing role in acute healthcare delivery (Vincent, 2013),
clinicians must anticipate patients’ care needs in a fast-paced, data-overloaded setting. The sec-
ondary analysis of healthcare data is a critical step toward improving modern healthcare, as it af-
fords the study of care in real care settings and patient populations. The widespread availability of
electronic healthcare data (Charles et al., 2013; Jamoom E and E, 2016) allows new investigations
into evidence-based decision support, where we can learn when patients need a given intervention.

c 2017.
Continuous, forward-facing event prediction is particularly important in the ICU setting where we
want to account for evolving clinical needs.
In this work, we focus on predicting the onset and weaning of interventions. The efficacy of
clinical interventions can vary drastically among patients, and unnecessarily administering an inter-
vention can be harmful and expensive. We target interventions that span a wide severity of needs
in critical care: invasive ventilation, non-invasive (NI) ventilation, vasopressors, colloid boluses,
and crystalloid boluses. Mechanical ventilation is commonly used for breathing assistance, but has
many potential complications (Yang and Tobin) and small changes in ventilation settings can have
large impact in patient outcomes (Tobin, 2006). Vasopressors are a common ICU medication, but
there is no robust evidence of improved outcomes from their use (Müllner et al., 2004), and some
evidence they may be harmful (D’Aragon et al., 2015). Fluid boluses are used to improve cardio-
vascular function and organ perfusion. There are two bolus types: crystalloid and colloid. Both are
often considered as less aggressive alternatives to vasopressors, but there are no multi-center trials
studying whether fluid bolus therapy should be given to critically ill patients, only studies trying to
distinguish which type of fluid should be given (Malbrain et al., 2014).
Capturing complex relationships across disparate data types is key for predictive performance
in our tasks. To this end, we take advantage of the success of deep learning models in capturing
rich representations of data with little hand-engineering by domain experts. We use long short-term
memory networks (LSTM) (Hochreiter and Schmidhuber, 1997), which have been shown to effec-
tively model complicated dependencies in timeseries data (Bengio et al., 1994) and have achieved
state-of-the-art results in many different applications: e.g. machine translation (Hermann et al.,
2015), dialogue systems (Chorowski et al., 2015) and image captioning (Xu et al., 2015). They are
well-suited to our modeling tasks because patient symptoms may exhibit important temporal depen-
dencies. We compare the LSTM models to a convolutional neural network (CNN) architecture that
has previously been explored for longitudinal laboratory data (Razavian et al., 2016). We train one
model per intervention which predicts all outcomes for that intervention given any patient record.
In doing so, we:
1. Achieve state-of-the-art prediction results in forward-facing, hourly prediction of clinical in-
terventions (onset, weaning, or continuity).
2. Demonstrate how feature occlusion can be used in the LSTM model to show which data
modalities and features are most important in different predictive tasks. This is an important
step in making models more interpretable by physicians.
3. Further aid in model interpretability by highlighting patient trajectories that lead to the most
and least confident predictions across outcomes and features in a CNN model.

2. Background and Related Work


Clinical decision-making often happens in settings of limited knowledge and high uncertainty; for
example, only 10 of the 72 ICU interventions evaluated in randomized controlled trials (RCTs) are
associated with improved outcomes (Ospina-Tascón et al., 2008). Secondary analysis of electronic
health records (EHR) aims to gain insight from healthcare data previously collected for the primary
purpose of facilitating patient care.
Recent studies have applied recurrent neural networks (RNNs) to modeling sequential EHR
data to tag ICU signals with billing code labels (Che et al., 2016; Lipton et al., 2015; Choi et al.,
2015), and to identify the impact of different drugs for diabetes (Krishnan et al., 2015). Razavian
et al. (2016) compared CNNs to LSTMs for longitudinal outcome prediction on billing codes using
lab tests. With regard to interpretability, Choi et al. (2016) used temporal attention to identify im-
portant features in early diagnostic prediction of chronic diseases from time-ordered billing codes.
Others have focused on using representations of clinical notes (Ghassemi et al., 2014) or patient
physiological signals to predict mortality (Ghassemi et al., 2015).
Previous work on interventions in ICU populations have often either focused on a single out-
come or used data from specialized cohorts. Such models with vasopressors as a predictive target
have achieved AUCs of 0.79 in patients receiving fluid resuscitation (Fialho et al., 2013), 0.85 in
septic shock patients (Salgado et al., 2016), and 0.88 for onset after a 4 hour gap and 0.71 for wean-
ing, only trained on patients who did receive a vasopressor (Wu et al., 2016). However, we train
our models on general ICU populations in order to make them more applicable. In the most recent
prior work on interventions, also on a general ICU population, the best AUC performances were
0.67 (ventilation), 0.78 (vasopressor) for vasopressor onset prediction after a 4 hour gap (Ghassemi
et al., 2017). These were lowered to 0.66 and 0.74 with a longer prediction gap time of 8 hours.

3. Data and Preprocessing


See Figure 1 for an overall description of data flow.

3.1 Data Source


We use data from the Multiparameter Intelligent Monitoring in Intensive Care (MIMIC-III v1.4)
database (Johnson et al., 2016). MIMIC III is publicly available, and contains over 58,000 hospital
admissions from approximately 38,600 adults. We consider patients 15 and older who had ICU
stays of 12 – 240 hours and consider each patient’s first ICU stay only. This yields 34,148 unique
ICU stays.

3.2 Data Extraction and Preprocessing


For each patient, we extract 5 static variables (such as gender and age), 29 time-varying vitals and
labs (such as oxygen saturation and blood urea nitrogen), and all available, de-identified clinical
notes for each patient as timeseries across their entire stay (see Table 3 in the Appendix for a
complete listing of variables).
Static variables are replicated across all timesteps for each patient. Vital and lab measurements
are given timestamps that are rounded to the nearest hour. If an hour has multiple measurements for
a signal, those measurements are averaged.

3.3 Representation of Notes and Vitals


Clinical narrative notes were transformed to a 50-dimensional vector of topic proportions for each
note using Latent Dirichlet Allocation (Blei et al., 2003; Griffiths and Steyvers, 2004). These vectors
are replicated forward and aggregated through time (Ghassemi et al., 2014). For example, if a
patient had a note A recorded at hour 3 and a note B at hour 7, hours 3–6 would contain the topic
distribution from A, while hours 7 onward would contain the aggregated topic distribution from A
and B combined.
We compare raw physiological data to physiological words, where we categorize the vitals data
by first converting each value into a z-score based on the population mean and standard deviation
for that variable, and then rounding this score to the nearest integer and capping it to be between -4
and 4. Each z-score value then becomes its own column, which explicitly allows for a representation
of missingness (e.g., all columns for a particular variable zeroed) that does not require imputation
(Figure 7 in Appendix B) (Wu et al., 2016).
Figure 1: Data preprocessing and feature extraction with numerical measurements and lab values,
clinical notes and static demographics.

Figure 2: Given data from a fixed-length (6 hour) sliding window, models predict the status of
intervention in a prediction window (4 hours) after a gap time (6 hours). Windows slide along the
entire patient record, creating multiple examples from each record.

The physiological variables, topic distribution, and static variables for each patient are concate-
nated into a single feature vector per patient, per hour (Esteban et al., 2016). The intervention state
of each patient (a binary value indicating whether or not they are on the intervention of interest at
each timestep) and the time of day for each timestep (an integer from 0 to 23 representing the hour)
are also added to this feature vector. Using the time of day as a feature makes it easier for the model
to capture circadian rhythms that may be present in, e.g., the vitals data.

3.4 Prediction Task


We split each patient’s record into 6 hour chunks using a sliding window and make a prediction
for a window of 4 hours after a gap time of 6 hours (Figure 2). When predicting ventilation, non-
invasive ventilation, or vasopressors, the model classifies the prediction window as one of four
possible outcomes: 1) onset, 2) wean, 3) continuing on an intervention, or 4) continuing to stay off
an intervention. In this representation, a label of 0 indicates “off” an intervention and a label of 1
indicates “on”. Therefore, a prediction window is an onset if there is a transition from a label of 0
to 1 for the patient during that window; weaning is the opposite: a transition from 1 to 0. A window
is classified as “stay on” if the label for the entire window is 1 or “stay off” if 0. When predicting
colloid or crystalloid boluses, we classify the prediction window into one of two classes: 1) onset,
Onset Weaning Stay Off Stay On
Ventilation 0.005 0.017 0.798 0.18
Vasopressor 0.008 0.016 0.862 0.114
Non-invasive Ventilation 0.024 0.035 0.695 0.246
Colloid Bolus 0.003 - - -
Crystalloid Bol 0.022 - - -

Table 1: Proportion of each intervention class. Note that colloid and crystalloid boluses are not
administered for specific durations, and thus have only a single class (onset).

or 2) no Onset, since these interventions are not administered for on-going durations of time. After
splitting the patient records into fixed-length chunks, there are 1,154,101 examples. Table 1 lists the
proportions of each class for each intervention.

4. Methods
4.1 Long Short-Term Memory Network (LSTM)
Having seen the input sequence x1 . . . xt of a given example, we predict yˆt , a probability distribution
over the outcomes, with target outcome yt :

h1 . . . ht = LSTM(x1 . . . xt ) (1)
yˆt = softmax(Wy ht + by ) (2)

where xi ∈ RV , Wy ∈ RNC ×L2 , ht ∈ RL2 , by ∈ RNC where V is the dimensionality of the input
(number of variables), NC is the number of classes we predict, and L2 is the second hidden layer
size. Figure 3a shows a model schematic, and more model details are provided in Appendix C.

4.2 Convolution Neural Network (CNN)


We employ a similar CNN architecture to Razavian et al. (2016), except that we do not initially
convolve the features into an intermediate representation. We represent features as channels and
perform 1D temporal convolutions, rather than treating the input as a 2D image. Our architecture
consists of temporal convolutions at three different temporal granularities with 64 filters each. The
dimensions of the filters are 1 × i, where i ∈ {3, 4, 5}.
We pad the inputs such that the outputs from the convolutional layers are the same size, and we
use a stride of 1. Each convolution is followed by a max pooling layer with a pooling size of 3. The
outputs from all three temporal granularities are concatenated and flattened, and followed by 2 fully
connected layers with dropout in between and a softmax over the output (Figure 3b).

4.3 Experimental Settings


We use a train/validation/test split of 70/10/20 and stratify the splits based on outcome. For the
LSTM, we use dropout with a keep probability of 0.8 during training (only on stacked layers), and
L2 regularization with lambda = 0.0001. We use 2 hidden LSTM layers of 512 nodes each. For
the CNN, we use dropout between fully-connected layers with a keep probability of 0.5. We use
a weighted loss function during optimization to account for class imbalances. All parameters were
determined using cross-validation with the validation set. We implemented all models in Tensor-
Flow version 1.0.1 using the Adam optimizer on mini-batches of 128 examples. We determine when
(a) The LSTM consists of two hidden layers with 512 (b) The CNN architecture performs temporal convolutions
nodes each. We sequentially feed in each hour’s data. at 3 different granularities (3, 4, and 5 hours), max-pools
At the end of the example window, we use the final and combines the outputs, and runs this through 2 fully
hidden state to predict the output. connected layers to arrive at the prediction.

Figure 3: Schematics of the a) LSTM and b) CNN model architectures.

to stop training with early stopping based on AUC (area under the receiver operating characteristic
curve) performance on the validation set.

4.4 Evaluation
We evaluate our results based on per-class AUCs as well as aggregated macro AUCs. If there are
K classes each with a per-class AUC of PAU Ck then the macro AUC is defined as the average of
the per-class AUCS, AU Cmacro = K1 k AU Ck . We use the macro AUC as an aggregate score
because it weights the AUCs of all classes equally, regardless of class size (Manning et al., 2008).
This is important because of the large class imbalance present in the data.
We use L2 regularized logistic regression (LR) as a baseline for comparison with the neural
networks (Pedregosa et al., 2011). The same input is used for the LR model as for the numerical
LSTM and CNN (imputed time windows of data) but the timesteps are concatenated into a single
input vector.

4.5 Interpretibility
4.5.1 LSTM F EATURE -L EVEL O CCLUSION
Because of the additional time dependencies of recurrent neural networks, getting feature-level
interpretability from LSTMs is notoriously difficult. To achieve this, we borrow an idea from image
recognition to help understand how the LSTM uses different features of the patients. Zeiler and
Fergus (2013) use occlusion to understand how models process images: they remove a region of the
image (by setting all values in that region to 0) and compare the model’s prediction of this occluded
image with the original prediction. A large shift in the prediction implies that the occluded region
contains important information for the correct prediction. With our LSTM model, we mask features
one by one from the patients (replacing the given feature with random noise drawn from the same
distribution by bootstrapping). We then compare the predictive ability of the model with and without
each feature; when this difference is large, then the model was relying heavily on that feature to
make the prediction.
Note that examining feature interactions would require a more complex analysis to occlude
all pairs, triples, etc., but would not necessarily demonstrate the direction or exact nature of the
interaction.
4.5.2 CNN F ILTER /ACTIVATION V ISUALIZATION
We get interpretability from the CNN models in two ways. First, in order to understand how the
CNN is using the patient data to predict certain tasks, we find and compare the top 10 real examples
that our model predicts are most and least likely to have a specific outcome. As our gap time is 6
hours, this means that the model predicts high probability of onset of the given task 6 hours after
the end of the identified trajectories.
Second, we generate “hallucinations” from the model which maximize the predicted probability
for a given task (Erhan et al., 2009). This is done by creating an objective function that maximizes
the activation of a specific output node, and backpropagating gradients back to the input image,
adjusting the image so that it maximally activates the output node.

5. Results
We found deep architectures achieved state-of-the-art prediction results for our intervention tasks,
compared to both our baseline as well as other work predicting intervention onset and weaning
(Ghassemi et al., 2017; Wu et al., 2016). The AUCs for each of our five intervention types and 4
prediction tasks are shown for all models in Table 2. All models use 6 hour chucks of “raw” data
which have either been transformed to a 0-1 range (normalized and mean imputed), or discretized
into physiological words (described in section 3.3).

5.1 Physiological Words Improve Predictive Task Performance With High Class Imbalance
We observed a significantly increased AUC for some interventions when using physiological words
— specifically for ventilation onset (from 0.61 to 0.75) and colloid bolus onset (from 0.52 to 0.72),
which have the lowest proportion of onset examples (Table 1). This may be because physiological
words have a smoothing effect. Since we round the z-score for each value to the nearest integer, if
a patient has a heart rate of 87 at one hour and then 89 at the next, those values will probably be
represented as the same word. This effect may make the model invariant to small fluctuations in
the patient’s data and more resilient to overfitting small classes. In addition, the physiological word
representation has an explicit encoding for missing data. This is in contrast to the raw data that has
been forward-filled and mean-imputed, introducing noise and making it difficult for the model to
know how confident to be in the measurements it is given (Che et al., 2016).

5.2 Feature-Level Occlusions Identify Important Per-Class Features


We are able to interpret the LSTM’s predictions using feature occlusion (Section 4.5.1). We note
that vitals, labs, topics and static data are important for different interventions (Figure 4). Table 5 in
Appendix D has a complete listing of the most probable words for each topic mentioned.
For mechanical ventilation, the top five most important features – pH, sodium, lactate, hemoglobin,
and potassium – are consistent for weaning and onset . This is sensible, because all are important
lab values used to assess a patient’s physiological stability, and ventilation is an aggressive interven-
tion. However, ventilation onset additionally places importance on a patient’s Glasgow Coma Score
(GCS) and Topic 4 (assessing patient consciousness), likely because patient sedation is a critical
part of mechanical ventilation. We also note that the scale of AUC difference between ventilation
onset and weaning is the largest observed (up to 0.30 for weaning and 0.12 for onset).
In vasopressor onset prediction, physiological variables such as potassium and hematocrit are
consistently important, which agrees with clinical assessment of cardiovascular state (Bassi et al.,
2013). Similarly, Topic 3 (noting many physiological values) is also important for both onset and
Intervention Type
Task Model VENT NI-VENT VASO COL BOL CRYS BOL
Baseline 0.60 0.66 0.43 0.65 0.67
0.77
Onset
LSTM Raw 0.61 0.75 0.52 0.70
AUC
LSTM Words 0.75 0.76 0.76 0.72 0.71
CNN 0.62 0.73 0.77 0.70 0.69
Baseline 0.83 0.71 0.74 - -
LSTM Raw 0.90 0.80 0.91 - -
Wean
AUC

LSTM Words 0.90 0.81 0.91 - -


CNN 0.91 0.80 0.91 - -
Baseline 0.50 0.79 0.55 - -
Stay On

LSTM Raw 0.96 0.86 0.96 - -


AUC

LSTM Words 0.97 0.86 0.95 - -


CNN 0.96 0.86 0.96 - -
Baseline 0.94 0.71 0.93 - -
Stay Off

LSTM Raw 0.95 0.86 0.96 - -


AUC

LSTM Words 0.97 0.86 0.95 - -


CNN 0.95 0.86 0.96 - -

Baseline 0.72 0.72 0.66 - -


Macro

LSTM Raw 0.86 0.82 0.90 - -


AUC

LSTM Words 0.90 0.82 0.89 - -


CNN 0.86 0.81 0.90 - -

Table 2: Comparison of model performance on five targeted interventions. Models that perform
best for a given (intervention, task) pair are bolded.

weaning. Note that the overall difference in AUC for onset ranges up to 0.16, but there is no signifi-
cant decrease in AUC for weaning (< 0.02). This is consistent with previous work that demonstrated
weaning to be a more difficult task in general for vasopressors (Wu et al., 2016). We also note that
weaning prediction places importance on time of day. As noted by Wu et al. (2016), this could be a
side-effect of patients being left on interventions longer than necessary.
For non-invasive ventilation onset and weaning the learned topics are more important than phys-
iological variables. This may mean that the need for less severe interventions can only be detected
from clinical insights derived in notes. Similar to vasopressors, we note that onset AUCs vary more
than weaning AUCs (0.14 vs 0.01), and that time of day is important for weaning.
For crystalloid and colloid bolus onsets, topics are all but one of the five most important features
for detection. Colloid boluses in general have more AUC variance for the topic features (0.14 vs.
0.05), which is likely due to the larger class imbalance compared to crystalloids.

5.3 Convolutional Filters Target Short-term Trajectories


We are able to understand the CNN by examining maximally activating patient trajectories (Section
4.5.2). Figure 5 shows the mean with standard deviation error bars for four of the most differen-
tiated features of the 10 real patient trajectories that are the highest and lowest activating for each
task. The trends suggest that patients who will require ventilation in the future have higher diastolic
blood pressure, respiratory rate, and heart rate, and lower oxygen saturation – possibly correspond-
(a) (b) (c) (d)

(e) (f) (g) (h)

Figure 4: We are able to make interpretable predictions using an LSTM and occluding specific
features. These figures display the eight features that cause the largest decrease in prediction AUC
for each intervention task. In general, physiological data were more important for the more invasive
interventions – mechanical ventilation (4a, 4b) and vasopressors (4c, 4d) – while clinical note topics
were more important for less invasive tasks – non-invasive ventilation (4e, 4f) and fluid boluses (4g,
4h). Note that all weaning tasks except for ventilation have significantly less AUC variance.

ing to patients who are hyperventilating. For vasopressor onsets, we see a decreased systolic blood
pressure, heart rate and oxygen saturation rate. These could either indicate altered peripheral perfu-
sion or stress hyperglycemia. Topic 3, which was important for vasopressor onset using occlusion
(Figure 4), is also increased.
Non-invasive ventilation onset is associated with decreased creatinine, phosphate, oxygen satu-
ration and blood urea nitrogen, potentially indicating neuromuscular respiratory failure. For colloid
and crystalloid boluses, we note general indicators of physiological decline, as boluses are given for
a wide range of conditions.
Hallucinations for vasopressor and ventilation onset are shown in Figure 6. While the model
was not trained with any physiological priors, it identifies blood pressure drops as being maximally
activating for vasopressor onset, and respiratory rate decline for ventilation onset. This suggests that
it is able to learn physiologically-relevant factors that are important for intervention prediction. The
hallucinations give us more insight into underlying properties of the network and what it is looking
for. However, since these trajectories are made to maximize the output of the model, they do not
necessarily correspond to physiologically plausible trajectories.

6. Conclusion
In this work, we targeted forward-facing prediction of ICU interventions covering multiple phys-
iological organ systems. To our knowledge, our model is the first to use deep neural networks to
Figure 5: Trajectories of the 10 maximally and minimally activating examples for onset of each of
the interventions. These are the six hour trajectories that occur before another six hour gap time
preceding the onset.

Figure 6: Trajectories generated by adjusting inputs to maximally activate a specific output node of
the CNN.

predict both onset and weaning of interventions using all available modalities of ICU data. In these
tasks, deep learning methods beat state-of-the-art AUCs reported in prior work for intervention pre-
diction tasks. This is sensible given that prior works have focused on single targets with smaller
datasets (Wu et al., 2016) or unsupervised representations prior to supervised training (Ghassemi
et al., 2017). We also note that LSTM models over physiological word inputs significantly improved
performance on the two intervention tasks with the lowest incidence rate — possibly because this
representation encodes important information about what is “normal” for each physiological value,
or is more robust to missingness in the physiological data.
Importantly, we were able to demonstrate interpretability for both models. In the LSTMs, we
examined feature importance using occlusion, and found that physiological data were important in
more invasive tasks, while clinical note topics were more important for less invasive interventions.
This could indicate that there is more clinical discretion at play for less invasive tasks. We also
found that all weaning tasks save ventilation had less AUC variance, which could indicate that these
decisions are also made with a large amount of clinical judgment.
The temporal convolutions in our CNN filters over the multi-channel input learnt interesting
and clinically-relevant trends in real patient trajectories, and these were further mimicked in the
hallucinations generated by the network. As in prior work (Razavian et al., 2016), we found that
RNNs often have similar or improved performance as compared to CNNs.
Acknowledgements
This research was funded in part by the Intel Science and Technology Center for Big Data, the
National Library of Medicine Biomedical Informatics Research Training grant 2T15 LM007092-22,
NIH National Institute of Biomedical Imaging and Bioengineering (NIBIB) grant R01-EB017205,
and NIH National Human Genome Research Institute (NHGRI) grant U54-HG007963.

References
Estevão Bassi, Marcelo Park, and Luciano Cesar Pontes Azevedo. Therapeutic strategies for high-
dose vasopressor-dependent shock. Critical care research and practice, 2013, 2013.

Yoshua Bengio, Patrice Simard, and Paolo Frasconi. Learning long-term dependencies with gradient
descent is difficult. IEEE transactions on neural networks, 5(2):157–166, 1994.

D. Blei, A. Ng, and M. Jordan. Latent dirichlet allocation. JMLR, 3(5):993–1022, 2003.

Dustin Charles, Meghan Gabriel, and Michael F Furukawa. Adoption of electronic health record
systems among us non-federal acute care hospitals: 2008-2012. ONC data brief, 9:1–9, 2013.

Zhengping Che, Sanjay Purushotham, Kyunghyun Cho, David Sontag, and Yan Liu. Recurrent neu-
ral networks for multivariate time series with missing values. arXiv preprint arXiv:1606.01865,
2016.

Edward Choi, Mohammad Taha Bahadori, and Jimeng Sun. Doctor AI: predicting clinical events
via recurrent neural networks. CoRR, abs/1511.05942, 2015. URL http://arxiv.org/
abs/1511.05942.

Edward Choi, Mohammad Taha Bahadori, Jimeng Sun, Joshua Kulas, Andy Schuetz, and Walter
Stewart. Retain: An interpretable predictive model for healthcare using reverse time attention
mechanism. In Advances in Neural Information Processing Systems, pages 3504–3512, 2016.

Jan K Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, and Yoshua Bengio.
Attention-based models for speech recognition. In Advances in Neural Information Processing
Systems, pages 577–585, 2015.

Frederick D’Aragon, Emilie P Belley-Cote, Maureen O Meade, François Lauzier, Neill KJ Ad-
hikari, Matthias Briel, Manoj Lalu, Salmaan Kanji, Pierre Asfar, Alexis F Turgeon, et al. Blood
pressure targets for vasopressor therapy: A systematic review. Shock, 43(6):530–539, 2015.

Dumitru Erhan, Yoshua Bengio, Aaron Courville, and Pascal Vincent. Visualizing higher-layer
features of a deep network. Technical report, University of Montreal, 2009.

Cristóbal Esteban, Oliver Staeck, Stephan Baier, Yinchong Yang, and Volker Tresp. Predicting
clinical events by combining static and dynamic information using recurrent neural networks. In
Healthcare Informatics (ICHI), 2016 IEEE International Conference on, pages 93–101. IEEE,
2016.

AS Fialho, LA Celi, F Cismondi, SM Vieira, SR Reti, JM Sousa, SN Finkelstein, et al. Disease-


based modeling to predict fluid response in intensive care units. Methods Inf Med, 52(6):494–502,
2013.
Marzyeh Ghassemi, Tristan Naumann, Finale Doshi-Velez, Nicole Brimmer, Rohit Joshi, Anna
Rumshisky, and Peter Szolovits. Unfolding physiological state: Mortality modelling in intensive
care units. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge
discovery and data mining, pages 75–84. ACM, 2014.
Marzyeh Ghassemi, Marco AF Pimentel, Tristan Naumann, Thomas Brennan, David A Clifton,
Peter Szolovits, and Mengling Feng. A multivariate timeseries modeling approach to severity
of illness assessment and forecasting in icu with sparse, heterogeneous clinical data. In Proc.
Twenty-Ninth AAAI Conf. on Artificial Intelligence, 2015.
Marzyeh Ghassemi, Mike Wu, Michael Hughes, and Finale Doshi-Velez. Predicting intervention
onset in the icu with switching state space models. In Proceedings of the AMIA Summit on
Clinical Research Informatics (CRI), volume 2017. American Medical Informatics Association,
2017.
T. Griffiths and M. Steyvers. Finding scientific topics. In PNAS, volume 101, pages 5228–5235,
2004.
Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa
Suleyman, and Phil Blunsom. Teaching machines to read and comprehend. In Advances in
Neural Information Processing Systems, pages 1693–1701, 2015.
Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):
1735–1780, 1997.
Yang N Jamoom E and Hing E. Office-based physician electronic health record adoption. Office of
the National Coordinator for Health Information Technology, 2016.
Alistair EW Johnson, Tom J Pollard, Lu Shen, Li-wei H Lehman, Mengling Feng, Mohammad
Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. MIMIC-III,
a freely accessible critical care database. Scientific data, 3, 2016.
Rahul G Krishnan, Uri Shalit, and David Sontag. Deep kalman filters. arXiv preprint
arXiv:1511.05121, 2015.
Zachary C Lipton, David C Kale, Charles Elkan, and Randall Wetzell. Learning to diagnose with
lstm recurrent neural networks. arXiv preprint arXiv:1511.03677, 2015.
ML Malbrain, Paul E Marik, Ine Witters, Colin Cordemans, Andrew W Kirkpatrick, Derek J
Roberts, and Niels Van Regenmortel. Fluid overload, de-resuscitation, and outcomes in critically
ill or injured patients: a systematic review with suggestions for clinical practice. Anaesthesiol
Intensive Ther, 46(5):361–80, 2014.
Christopher Manning, Prabhakar Raghavan, and Hinrich Schtze. Introduction to Information Re-
trieval. Cambridge University Press, 2008.
Marcus Müllner, Bernhard Urbanek, Christof Havel, Heidrun Losert, Gunnar Gamper, and Harald
Herkner. Vasopressors for shock. The Cochrane Library, 2004.
Gustavo A Ospina-Tascón, Gustavo Luiz Büchele, and Jean-Louis Vincent. Multicenter, random-
ized, controlled trials evaluating mortality in intensive care: Doomed to fail? Critical care
medicine, 36(4):1311–1322, 2008.
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Pretten-
hofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot,
and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning
Research, 12:2825–2830, 2011.

Narges Razavian, Jake Marcus, and David Sontag. Multi-task prediction of disease onsets from
longitudinal lab tests. In JMLR (Journal of Machine Learning Research): MLHC Conference
Proceedings, 2016.

Cátia M Salgado, Susana M Vieira, Luı́s F Mendonça, Stan Finkelstein, and João MC Sousa. En-
semble fuzzy models in personalized medicine: Application to vasopressors administration. En-
gineering Applications of Artificial Intelligence, 49:141–148, 2016.

Martin J Tobin. Principles and practice of mechanical ventilation, 2006.

Jean-Louis Vincent. Critical care-where have we been and where are we going? Critical Care, 17
(Suppl 1):S2, 2013.

Mike Wu, Marzyeh Ghassemi, Mengling Feng, Leo A Celi, Peter Szolovits, and Finale Doshi-Velez.
Understanding vasopressor intervention and weaning: Risk prediction in a public heterogeneous
clinical time series database. Journal of the American Medical Informatics Association, page
ocw138, 2016.

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C Courville, Ruslan Salakhutdinov,
Richard S Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation
with visual attention. In ICML, volume 14, pages 77–81, 2015.

Karl L Yang and Martin J Tobin. A prospective study of indexes predicting the outcome of trials of
weaning from mechanical ventilation. New England Journal of Medicine, 324.

Matthew D. Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. CoRR,
abs/1311.2901, 2013. URL http://arxiv.org/abs/1311.2901.
Appendix
Appendix A. Dataset Statistics

Table 3: Variables

Static Variables Gender Age Ethnicity


ICU Admission Type
Vitals and Labs Anion gap Bicarbonate blood pH
Blood urea nitrogen Chloride Creatinine
Diastolic blood pressure Fraction inspired oxygen Glascow coma scale total
Glucose Heart rate Hematocrit
Hemoglobin INR* Lactate
Magnesium Mean blood pressure Oxygen saturation
Partial thromboplastin time Phosphate Platelets
Potassium Prothrombin time Respiratory rate
Sodium Systolic blood pressure Temperature
Weight White blood cell count
* International normalized ratio of the prothrombin time

Table 4: Dataset Statistics

Train Test Total


Patients 27,318 6,830 34,148
Notes 564,652 140,089 703,877
Elective Admission 4,536 1,158 5,694
Urgent Admission 746 188 934
Emergency Admission 22,036 5,484 27,520
Mean Age 63.9 64.1 63.9
Black/African American 1,921 512 2,433
Hispanic/Latino 702 166 868
White 19,424 4,786 24,210
CCU (coronary care unit) 4,156 993 5,149
CSRU (cardiac surgery recovery) 5,625 1,408 7,033
MICU (medical ICU) 9,580 2,494 12,074
SICU (surgical ICU) 4,384 1,074 5,458
TSICU (trauma SICU) 3,573 861 4,434
Female 11,918 2,924 14,842
Male 15,400 3,906 19,306
ICU Mortalities 1,741 439 2,180
In-hospital Mortalities 2,569 642 3,211
30 Day Mortalities 2,605 656 3,216
90 Day Mortalities 2,835 722 3,557
Vasopressor Usage 8,347 2,069 10,416
Ventilator Usage 11,096 2,732 13,828
Appendix B. Physiological Word Generation
See Figure 7.

Figure 7: Converting data from continuous timeseries format to discrete “physiological words.” The
numeric values are first z-scored and rounded, and then each z-score is made into its own category.
On the right, glucose -2 indicates the presence of a glucose value that was 2 standard deviations
below the mean. A row containing all zeros for a given variable indicates that the value for that
variable was missing at the timestep.

Appendix C. LSTM Model Details


LSTM performs the following update equations for a single layer, given its previous hidden state
and the new input:

ft = σ(Wf [ht−1 , xt ] + bf ) (3)


it = σ(Wi [ht−1 , xt ] + bi ) (4)
c˜t = tanh(Wc [ht−1 , xt ] + bc ) (5)
ct = ft ct−1 + it c˜t (6)
ot = σ(Wo [ht−1 , xt ] + bo ) (7)
ht = ot tanh(ct ) (8)

where Wf , Wi , Wc , Wo ∈ RL1 ×(L1 +V ) , bf , bi , bc , bo ∈ RL1 are learned parameters, and ft , it , c˜t ,


ct , ot , ht ∈ RL1 . In these equations, σ stands for an element-wise application of the sigmoid (logis-
tic) function, and is an element-wise product. This is generalized to multiple layers by providing
ht from the previous layer in place of the input.
We calculate classification loss using categorical cross-entropy, which sets the loss for predic-
tions for N examples over M classes as:

N M
1 XX
L(ŷ1 . . . ŷN ) = − yij log ŷij
N
i=1 j=1

where ŷij is the probability our model predicts for example i being in class j, and yij is the true
value.
Appendix D. Generated Topics

Table 5: Most probable words in the topics most important for intervention predictions.

Topic Top Ten Words Possible Topic


Topic 1 pt care resp vent respiratory secretions remains intubated Respiratory fail-
abg plan psv bs support settings cont placed changes note ure/infection
wean rsbi coarse cpap continue peep suctioned clear extu-
bated rr mask weaned
Topic 2 family pt ni care patient dnr stitle dr home daughter support Discussion of end-
team meeting wife son comfort note social doctor sw dni of-life care
known time status hospital contact pt’s work plan lastname
Topic 3 hr resp gi pt cont gu neuro bs cv id note abd soft bp today Multiple physio-
stool social noted progress clear remains nursing skin urine logical changes
sats foley npn yellow stable ls
Topic 4 pain pt assessment response action plan control continue Assessments of
given dilaudid monitor chronic acute morphine iv po prn patient responsive-
patient pca hr meds bp drain cont nausea ordered relief sbp ness
pericardial assess
Topic 10 pt intubated vent propofol sedation sedated fentanyl peep Continued need for
tube versed secretions abg wean remains continue ett suc- ventilation
tioned plan ps increased extubation settings ac sounds min
cpap sputum respiratory hr ogt
Topic 38 ml dl mg pm meq assessed icu ul total medications sys- Many labs tested
tems review pulse labs balance comments code hour rr min
respiratory rhythm prophylaxis admission allergies blood
urine mmhg status dose
Topic 48 ed pt patient transferred hospital pain admitted denies ad- Emergency ad-
mission days nausea received ago presented micu showed mission/transfer
vomiting past reports history given blood bp old year ar- patient
rival known osh diarrhea unit

You might also like