CS598 Report
CS598 Report
Table 5: A comparison of mean absolute error values on test set for Figure 6: Histogram showing episodes over length of stay
different batch sizes and learning rates after Winsorization
4. RESULTS
4.1 Evaluation
The initial results of our benchmark models closely align with the
findings reported in the benchmark study [12], with our results
showing slight improvement. For instance, the original paper
reported a mean absolute error (MAE) of 94.7 for a basic LSTM
model, whereas our model achieved an MAE of 79.3. Notably, we
conducted training using regression output rather than employing
custom bins for classification, which was used in the original
study. Our attempts at training with custom bins (e.g., 1 day, 2
days, 3 days, etc., up to 2 weeks) did not yield satisfactory results.
Please refer to Table 7 for a comprehensive listing of results.
Figure 7: Distribution of Data Points Across Different Lengths of
When evaluating a forecasting model, it is essential to understand Stay
two key aspects: a) the amount of historical data required to make
accurate predictions and b) how closely the model's predictions The results are presented in Figure 8.
align with the current state [28]. To assess the model's
performance at different intervals, we initially applied As anticipated, the error decreases over progressive time periods
Winsorization of 96% across the length of stay for each episode, for all models, except for linear regression, which decreases until
removing the bottom 3% and top 97% of data points. Additionally, 50 hours but then rises again. Other models also exhibit an
we filtered out episodes lasting less than 60 hours to ensure inflection point at 60 hours, except for NeuralLOS with full data
consistent comparison across different time periods. The choice of and LSTM.
60 hours was made because it is close to one standard deviation of
the average length of stay (66 hours). Figure 6 illustrates the When comparing NeuralLOS using only physiological (tabular)
distribution of episodes over the length of stay after applying data with a model augmented with Bio-ClinicalNote embeddings
Winsorization. [16], we included a version of NeuralLOS trained on the same
dataset as the model with notes. Note processing consumes a
Considering the test set comprises 1,555 episodes, we observe a significant amount of time, preventing training on the full dataset.
consistent number of data points at each period up to 60 hours, as To facilitate comparison, we trained NeuralLOS on the same
depicted in Figure 7. By plotting the mean squared error at these smaller dataset to discern differences. The model with notes
different time periods, we gain insights into how the models appears to perform better than the tabular-only model overall, but
perform as they access more information over time. not when considering stays longer than 60 days.
lr/batch size 32 128 256 We also investigated whether predictions become more accurate as
0.01 1.008e+4 1.528e+3 3.337e+3 the patient approaches the end of their stay. Using the same
0.001 1.636 1.845 1.765 episodes, we categorized bins from 2 weeks (336 hours) to 12
0.0001 1.662 1.706 1.737 hours. The number of data points in each bin is illustrated in
Figure 9.
0.00001 2.349 2.839 2.596
4.2 Infrastructure
Figure 12: LSTM deviation distribution Figure 15: NeuralLOS with notes deviation distribution
Figure 13: NeuralLOS deviation distribution Figure 16: LSTM deviation distribution
Figure 14: NeuralLOS with notes deviation distribution Figure 17: NeuralLOS deviation distribution
Forecasting the length of a patient's stay is a critical challenge in
healthcare. Obtaining an estimate of the remaining length of stay
aids hospitals in better resource allocation for healthcare services.
Additionally, it provides valuable insights for insurance companies
to estimate expenses accurately. Leveraging NeuralLOS, we
achieved impressive results in predicting length of stay. By
comparing our model with various benchmark models and
presenting results from different perspectives, we demonstrated its
effectiveness. Although further refinement is required to enhance
the model's performance, even in its current implementation,
NeuralLOS yields superior results.
7. LIMITATIONS
One of the primary challenges we encountered was the scarcity of
computational resources required to process the entire dataset.
The embeddings of notes consume significant memory, and due to
Figure 18: NeuralLOS with notes deviation distribution
memory constraints, we were unable to accommodate the entire
working set in memory. Additionally, since NeuralLOS involves
We trained and evaluated our models on AWS Cloud Plat-form. computing a large number of parameters, utilizing GPUs was
The machine configuration is listed below: imperative to expedite training. Despite encountering some
hurdles, we managed to secure access to a GPU in GCP with
Machine Type n1-standard [16 vCPUs] limited capacity. Consequently, we trained our EpisodeNet on a
CPU Platform Intel Broadwell subset of the data. An intriguing observation we made was that the
Memory 110GB prediction accuracy of NeuralLOS improves for patients with
GPU NVIDIA Tesla 4 longer stays. This improvement can be attributed to the
Storage 400 GB SSD accumulation of more information over time, enabling the model
to make more accurate predictions.
We utilize widely-used Python libraries, including but not limited to
PyTorch, Keras, TensorFlow, scikit-learn, Matplotlib, and pickle. Our 8. RESOURCES
code is accessible through a GitHub repository. It's important to note Github : https://github.com/vibhor-github/lenght-of-stay.git
that the benchmark code is constrained by data preprocessing-
intensive tasks. Initially, the benchmark code lacked the capability to
run in parallel and utilize GPU resources effectively. Consequently, we 9. ACKNOWLEDGEMENTS
dedicated significant effort to implement multi-threaded data We express our gratitude to Professor Jimeng Sun and all teaching
preprocessing, aiming to maximize GPU utilization. assistants for their invaluable guidance and support throughout
this work
5. TEAM CONTRIBUTIONS
It was a collaborative effort, both of us involved in various aspects, 10. REFERENCES
contributing to the planning, experimentation, and training phases of
model development. [1] E. Alsentzer, J. R. Murphy, W. Boag, W.-H. Weng,
D. Jin, T. Naumann, and M. B. A. McDermott. Pub-licly
Vibhor spearheaded the setup of AWS environments for training available clinical bert embeddings, 2019.
LSTM and linear regression models. He also led the efforts to adapt [2] H. Baek, M. Cho, S. Kim, H. Hwang, M. Song, and
and upgrade the benchmark code to ensure compatibility with newer
S. Yoo. Analysis of length of hospital stay using elec-
versions of libraries like TensorFlow and Keras. Addressing speed
tronic health records: A statistical and data mining
challenges, Alan implemented multiprocessing capabilities in the
approach. PloS one, 13(4):e0195901, 2018.
preprocessing routines used for creating training tensors. Additionally,
[3] K. Canese and S. Weis. Pubmed: the bibliographic
Alan authored the Data and Evaluation sections of the report and
developed the program responsible for aggregating results from all database. In The NCBI Handbook [Internet]. 2nd edi-
models. tion. National Center for Biotechnology Information
(US), 2013.
Priyank played a key role in designing the NeuralLOS model [4] Q. Chen, Y. Peng, and Z. Lu. Biosentvec: creat-
architecture and implementing the dataset windowing techniques. He
actively participated in model training and metric generation. ing sentence embeddings for biomedical texts. 2019
IEEE International Conference on Healthcare Infor-
matics (ICHI), Jun 2019.
We both significantly worked on the generation of BioSentVec and
BioClinicalBERT embeddings for notes. They also played a crucial role [5] D. E. Clark and L. M. Ryan. Concurrent prediction of
in generating preprocessed data using the benchmark code. hospital mortality and length of stay from risk factors on
admission. Health services research, 37(3):631–645, 2002.
6. CONCLUSIONS
[6] S. Cropley. The relationship-based care model: evalua-
tion of the impact on patient satisfaction, length of stay, and
readmission rates. JONA: The Journal of Nursing
Administration, 42(6):333–339, 2012.
[7] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. [19] A. E. Johnson, T. J. Pollard, L. Shen, L. H. Lehman,
Bert: Pre-training of deep bidirectional trans- M. Feng, M. Ghassemi, B. Moody, P. Szolovits, L. A. Celi, and
formers for language understanding. arXiv preprint R. G. Mark. Mimic-iii, a freely accessible crit-ical care
arXiv:1810.04805, 2018. database. Scientific data, 3:160035, 2016.
[8] G. DH. Length of stay: Prediction and explanation. [20] D. P. Kingma and J. Ba. Adam: A method for stochas-
Health services research, 3(1), 12–34., 1968. tic optimization, 2017.
[9] J. Fang, J. Zhu, and X. Zhang. Prediction of length of [21] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Ima-
stay on the intensive care unit based on bayesian neu-ral genet classification with deep convolutional neural net-
network. In Journal of Physics: Conference Series, volume works. Advances in neural information processing sys-tems,
1631, page 012089. IOP Publishing, 2020. 25:1097–1105, 2012.
[10] R. Figueroa, J. Harman, and J. Engberg. Use of claims [22] Y. LeCun, P. Haffner, L. Bottou, and Y. Bengio. Object
data to examine the impact of length of inpatient psy- recognition with gradient-based learning. In Shape, con-tour
chiatric stay on readmission rate. Psychiatric Services, and grouping in computer vision, pages 319–345.
55(5):560–565, 2004. Springer, 1999.
[11] T. Gentimis, A. J. Alnaser, A. Durante, K. Cook, [23] J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. H. So,
and R. Steele. Predicting hospital length of stay using and J. Kang. Biobert: a pre-trained biomedical lan-guage
neural networks on mimic iii data. In 2017 IEEE representation model for biomedical text mining.
15th Intl Conf on Dependable, Autonomic and Secure Bioinformatics, Sep 2019.
Computing, 15th Intl Conf on Pervasive Intelligence and
Computing, 3rd Intl Conf on Big Data Intelligence and [24] J. Mullenbach, S. Wiegreffe, J. Duke, J. Sun, and
Computing and Cyber Science and Technology J. Eisenstein. Explainable prediction of medical codes from
Congress(DASC/PiCom/DataCom/CyberSciTech), pages clinical text. arXiv preprint arXiv:1802.05695, 2018.
1194–1201, 2017.
[25] K. J. Ottenbacher, P. M. Smith, S. B. Illig, R. T. Linn,
[12] H. Harutyunyan, H. Khachatrian, D. C. Kale,
G. V. Ostir, and C. V. Granger. Trends in length of stay,
G. Ver Steeg, and A. Galstyan. Multitask learning and
living setting, functional outcome, and mortality following
benchmarking with clinical time series data. Scientific Data,
medical rehabilitation. Jama, 292(14):1687–1695, 2004.
6(1), Jun 2019.