Nihpp rs3044914v2
Nihpp rs3044914v2
Article
Keywords:
DOI: https://doi.org/10.21203/rs.3.rs-3044914/v2
License: This work is licensed under a Creative Commons Attribution 4.0 International License.
Read Full License
Additional Declarations: Yes there is potential Competing Interest. Prof. Eran Halperin has an a liation
with Optum.
SLIViT: a general AI framework for clinical-feature diagnosis from limited
3D biomedical-imaging data
Oren Avram*,1,2,3, Berkin Durmus*,2, Nadav Rakocz2, Giulia Corradetti4,5, Ulzee An1,2,
Muneeswar G. Nitalla4,5, Akos Rudas1, Yu Wakatsuki4, Kazutaka Hirabayashi4, Swetha
Velaga4, Liran Tiosano4, Federico Corvi4, Aditya Verma4,6, Ayesha Karamat4, Sophiana
Lindenberg4, Deniz Oncel4, Louay Almidani4, Victoria Hull4, Sohaib Fasih-Ahmad4, Houri
Esmaeilkhanian4, Charles C. Wykoff7, Elior Rahmani1, Corey W. Arnold8,9,10, Bolei Zhou2,
Noah Zaitlen11,12, Ilan Gronau13, Sriram Sankararaman1,2,12, Jeffrey N. Chiang1, Srinivas
R. Sadda+,4,5, Eran Halperin+,1,2,3,12
*Equal contribution
+
Joint supervision
Abstract
We present SLIViT, a deep-learning framework that accurately measures
disease-related risk factors in volumetric biomedical imaging, such as magnetic
resonance imaging (MRI) scans, optical coherence tomography (OCT) scans, and
ultrasound videos. To evaluate SLIViT, we applied it to five different datasets of these
three different data modalities tackling seven learning tasks (including both classification
and regression) and found that it consistently and significantly outperforms
domain-specific state-of-the-art models, typically improving performance (ROC AUC or
correlation) by 0.1-0.4. Notably, compared to existing approaches, SLIViT can be
applied even when only a small number of annotated training samples is available,
which is often a constraint in medical applications. When trained on less than 700
annotated volumes, SLIViT obtained accuracy comparable to trained clinical specialists
while reducing annotation time by a factor of 5,000 demonstrating its utility to automate
and expedite ongoing research and other practical clinical scenarios.
Main
Biomedical imaging analysis is a critical component of clinical care with widespread use
across multiple domains. For example, analyzing optical coherence tomography (OCT)
images of the retina allows ophthalmologists to diagnose and follow up on ocular
diseases, such as age-related macular degeneration (AMD), and tailor appropriate and
personalized interventions to delay the progression of retinal atrophy and irreversible
vision loss1,2. Another example is the analysis of heart function using cardiac imaging,
such as heart computed tomography (CT) and ultrasound. Monitoring heart function can
help cardiologists assess potential cardiac issues, prescribe medications to improve a
medical condition, e.g., reduced heart ejection fraction, and guide treatment decisions3,4.
Lastly, radiologists’ analysis and regular monitoring of breast imaging such as
mammography and magnetic resonance imaging (MRI) help detect early breast
cancers, initiate a consequent interventive therapy, and determine the effectiveness of
such therapeutics5,6. These medical insights and actionable information are obtained
following an expert’s time-intensive manual analysis. The automation of these analyses
using artificial intelligence may further improve healthcare as it reduces costs and
treatment burden7.
Deep vision models, such as Convolutional Neural Networks (CNNs) and their
derivatives, are considered state-of-the-art methods to tackle computer vision tasks in
general8,9 and medical-related vision tasks in particular10. In order to train a deep vision
model to accurately learn and predict a target variable in a general vision task
(excluding segmentation tasks) from scratch, a very large number of training samples is
needed. Transfer learning addresses this challenge by pre-training a vision model for a
general learning task on a very large data set, and then using this general model as a
starting point for training a specialized model on a much smaller dataset11. The key
advantage of transfer learning is that the pre-training can be done on a large dataset in
another domain, where data are abundant, and then the fine-tuning can be done using a
small dataset in the domain of interest. Using a transfer learning approach, a plethora of
previously developed deep vision methods analyzing 2D biomedical-imaging12–15, were
first pre-trained on over a million labeled natural images (in a supervised fashion) taken
from ImageNet16, and later on, fine-tuned to a specific medical-learning task on a much
smaller number of labeled biomedical images (typically fewer than 10,000). Some
methods used self-supervised-based transfer-learning techniques relying mainly on
unlabeled medical data17–19, and others combined both natural and medical images7,20.
Overall, the understanding that pre-trained weights can be leveraged as ‘prior
knowledge’ for fine-tuning downstream learning tasks, were major factors in the
fruitfulness of the majority of these 2D biomedical-imaging deep vision models.
Many diagnoses rely, however, on volumetric biomedical imaging (e.g., 3D OCT and
MRI scans, or ultrasound videos) and transfer learning is not directly applicable, since in
contrast to the 2D domain, there is no large annotated ‘ImageNet-like’ dataset of
structured 3D scans. Moreover, annotating 3D biomedical images is far more
labor-prohibitive than 2D images. For example, a 3D OCT scan that is composed of 97
2D frames (usually referred to as B-scans) usually requires a 5-10 minutes inspection of
a highly trained clinical retina specialist in order to detect retinal-disease biomarkers,
such as, the volume of a drusen lesion21. Therefore, considering the resources typically
devoted to such a task, it is practically infeasible to annotate 100,000 (or more)
volumes, to eliminate the necessity of supervised transfer learning. In fact, even merely
compiling such large-sized volumetric datasets (without labels) that is required for
self-supervised-based learning22 could be cost-, processing-, and storage-prohibitive
when standard resources are available23. These gaps are acute because state-of-the-art
supervised models for 3D image analysis, such as 3D ResNet24 and 3D Vision
Transformer25 (ViT), involve the optimization of a very large number of parameters, thus
requiring large datasets for training26.
Results
In order to cope with volumetric data, we treat each volume as a set of slices. A similar
technique was shown to be effective for volumetric data modalities42. Essentially, each
original slice of the volume is embedded into a single feature map. However, SLIViT
reduces memory overhead and accelerates the processing time, by tiling the 2D images
into a single elongated 2D image (rather than a set of separate images), such that it
conforms with the input dimension expected by the 2D-based feature-map extractor.
Once the feature maps are extracted, they are paired with (trainable) positional
embeddings and comprehensively aggregated using a downstream ViT module40.
SLIViT’s ViT module together with (trainable) positional embeddings allow to preserve
the long-range dependencies across the depth dimension if needed29,43. Similar
divide-and-conquer schemes were shown to be fruitful in other studies as well44,45. Of
note, the ViT’s attention mechanism implicitly eliminates the necessity for image
registration preprocessing.
We tested SLIViT on five datasets of three different data modalities (OCT, ultrasound,
and MRI) with a limited number of annotated samples, tackling a variety of
clinical-feature learning tasks (including both classification and regression). In the OCT
experiment, we evaluated the diagnosis performance of ocular disease high-risk
factors27 and measured it by both the receiver operating characteristic (ROC) area
under the curve (AUC) and precision-recall (PR) AUC. In the ultrasound and MRI
2
experiments, we compared the 𝑅 of the models’ predictions vs. ground truth in
(respectively) cardiac function analysis and in hepatic fat level imputation. In each data
modality, we compared SLIViT with a diverse set of up to six strong baselines, including
domain-specific24,27–29 and generic (fully-supervised-24,25 and self-supervised-based7,19)
state-of-the-art methods. SLIViT manifested consistent and significant performance
superiority across domains (Fig. 2). In the following sections we present these and
additional results in detail.
This result, together with the exceptional magnitude of this public annotated dataset,
further motivated us to examine the dynamics of the training set size and SLIViT’s
performance in predicting the EF of a given echocardiogram (Fig. 4, lower panel). We
randomly sampled size-decreasing subsets from the original training set and trained a
SLIViT model per subset. Compared to other examined methods trained on the original
2
training set (n=7,465), when SLIViT used the 25% subset (n=1,866) its performance (𝑅
=0.487; CI [0.466, 0.507]) was significantly better than R3D, R(2+1)D, and 3D ViT
(paired t-test p-value<0.001); on par with EchoNet (paired t-test p-value>0.579); and
significantly lower than UniMiSS (paired t-test p-value<0.001). When SLIViT used the
2
50% subset, it significantly outperformed all other benchmarked methods (𝑅 =0.614; CI
[0.594, 0.634]; paired t-test p-value<0.001). These observations substantiate SLIViT’s
ability to appropriately learn spatiotemporal features using a sparsely-labeled dataset.
We also wished to assess the benefit of using supervised learning for pre-training, as
opposed to self-supervised learning. The latter was demonstrated as a powerful
approach in different visual tasks66, specifically, in the medical-imaging domain where
procuring annotations is laborious and expensive7,17,19,20. We thus sought to explore the
utility of self-supervised-based pre-training approach on SLIViT using an unlabeled
version of the 2D OCT B-scans dataset (Figures S5 and S6). To this end, we took the
REMEDIS approach7 that was originally shown to obtain remarkable performance when
pre-trained even on much smaller (unlabeled) datasets than our 2D OCT B-scans
dataset. Yet, initializing SLIViT with the fully supervised pre-trained weights significantly
outperformed the self-supervised initialization in all downstream learning tasks (paired
t-test p-value<0.001).
by demonstrating SLIViT’s superiority when trained on less than 700 volumes in four
independent binary classification learning tasks of retinal-disease risk factors with two
independent 3D OCT datasets. Then we showed SLIViT’s superiority in two heart
function analysis tasks both done with an echocardiogram dataset. We next tested
SLIViT on an MRI dataset of 3D liver scans labeled with a corresponding hepatic fat
content measurement and again, observed significant improvement compared to the
state-of-the-art. We also showed that SLIViT was able to obtain on-par performance to
clinical specialists’ assessment, but rather, almost four orders of magnitude faster
compared to the annotation procurement net time required by the specialists. Lastly, we
explored SLIViT’s learning ability robustness to randomly permuted volumes. We
showed that a scenario of shuffled volumes dataset, a recurring situation in the very
limited number of publicly available volumetric datasets, has little to no effect on
SLIViT’s performance, meaning that SLIViT is potentially agnostic to imaging protocol.
To facilitate reproducibility, generalizability, and the likelihood that other researchers will
be able to successfully apply SLIViT to their datasets, we intentionally avoided complex
hyperparameter tuning and the usage of specialized hardware for training as required
by other methods (e.g., 19). The sizes of the different architectures we used were set
according to our available (standard) computational resources, and other
hyperparameters were set to default values. This suggests that there is room for further
improvement in task-specific performance. Yet, in its current form, SLIViT can serve as
a reliable baseline model for any study of volumetric biomedical imaging. We believe
that SLIViT’s simplicity is one of its major strengths.
The utility of self-supervised pre-training has been validated in numerous medical
imaging learning tasks7,19,20,67,68, however, its general translatability across domains
remains unclear22. According to our study, where a large-enough 2D labeled dataset is
accessible and limited labeled volumes are available, the supervised pre-training
approach is superior. This finding was supported by our experiments for fine-tuning both
in the same domain and across domains. That being said, as demonstrated, SLIViT’s
pre-training strategy is very flexible and can thus harness the utility of self-supervised
approaches, such as REMEDIS. If one has access to an(other) unlabeled dataset of
relevant medical images (whether 2D or 3D), then self-supervised pre-training SLIViT
(either) as an alternative to (or followed/preceded by) supervised 2D OCT B-scans
pre-training may further improve the model’s performance. Notably, the end-to-end
fine-tuning approach SLIViT takes (see Methods) was shown to attain typically better
performance for self-supervised-based medical-imaging learning tasks22. That is, SLIViT
already employs an optimized fine-tuning approach for a potential
self-supervised-based avenue.
SLIViT was tested on 3D OCT scans, echocardiograms, and MRI volumes and can
potentially be leveraged to analyze other types of data modalities, such as 3D CT scans
and 3D X-ray imaging. Such biomedical volumetric imaging data is inherently structured
in the sense that they involve a limited assortment of objects and movements (typically
shrinkage, dilation, and shivering). SLIViT is specifically tailored to be adept at
analyzing a series of biomedical frames created in a structured biomedical-imaging
process and does not pretend to be proficient at learning problems of natural videos,
such as action recognition tasks. Natural videos are inherently more complex, as the
background may change, objects may flip, change color (due to shade), and even
disappear (due to obfuscation), let alone when considering a multi-scene video. In
addition, there is a plethora of gigantic natural video datasets that allow standard
3D-based vision models to be adequately tuned for natural video learning tasks. We
thus do not expect SLIViT to outperform (as is) standard 3D-based vision models in
natural-videos-learning tasks (such as action recognition). That being said, SLIViT could
potentially be tweaked to perform well on natural videos as well, e.g., using a different
feature-map extractor, however, this direction requires further research.
Importantly, there are multiple additional steps that are required in order to deploy
SLIViT in a clinical setting. Notably, the point of operation (tradeoff between precision
and recall) is application specific and further optimization may be required to obtain
optimal results at that point of operation. We note that point of operation varies also
across clinicians (see Figures 5 and S3). Moreover, additional evaluations of the models
are required to ensure no systematic biases exist that would lead to increasing health
disparities69.
Overall, this study highlights an important step toward fully automating
volumetric-biomedical-imaging annotation. The major leap happens under ‘real life’
settings of a low-number training dataset. SLIViT thrives given just hundreds of training
samples for some tasks giving it an extreme advantage over other 3D-based methods,
in almost every practical case that is related to 3D biomedical-imaging annotation. Even
under the unrealistic assumption that the financial resources are endless, in ongoing
research, due to its nature, the hurdle of a limited-size training dataset is inevitable.
Once a previously unknown disease-related risk factor is found and characterized, it
could take months in order to train a specialist to be able to accurately annotate this
recently-discovered risk factor in biomedical images at scale. However, using a
relatively small training dataset (that can be annotated within only a few working days of
a single trained clinician), SLIViT could dramatically expedite the annotation process of
many other non-annotated volumes with an on-par performance level of a clinical
specialist.
Methods
Model specifications
The SLIViT framework contains a preprocessing step, a 2D ConvNeXt that serves as a
feature-map extractor, and a vision transformer (ViT) that serves as a feature-map
integrator (see Fig. 1). A ConvNeXt architecture has several complexities39. Here we
used the backbone of the tiny variant (ConvNeXt-T) with 256x256 image size as
SLIViT’s feature-map extractor. The ViT-based feature-map integrator underwent few
adjustments with respect to the original architecture40, including using GeLu as the
activation functions73 and initializing the positional embeddings as the number of the
original slice. Notably, we intentionally avoid complex hyperparameter tuning and usage
of specialized hardware as required by other methods19. The ConvNeXt’s variant (T)
and the ViT’s depth (# layers = 5) were set according to our available (standard)
computational resources to facilitate reproducibility, generalizability, and the likelihood
that other researchers will be able to successfully apply it to their datasets. The ViT’s
width is governed by the number of 2D frames of the input volume.
Let 𝑁 be the number of 𝐻×𝑊 2D frames of an input image. Given an input 𝑊×𝐻×𝑁
image, its 𝑁 frames are resized (according to the ConvNeXt-T variant) and tiled into an
image of size 𝑁∗256×256 (see Step (1) in Fig. 1). The manipulated image is then fed
into the feature-map extractor which generates, in turn, an 𝑁∗8×8 feature maps with
𝐹 = 768 filters each. These feature maps are then reshaped into 𝑁 different 8×8×768
feature maps (see Step (3) in Fig. 1), each corresponding to a slice in the original
volume. Each of the feature maps is flattened into an 8∗8∗768 (1D) vector and
tokenized into a vector of size 768 using a fully connected (FC) layer. The bias term of
the FC layer is initialized as the feature-map number (that essentially corresponds to an
original slice number), and the projected feature volumes are then fed into the ViT
(along with a class token of the same size). The ViT outputs N encoded values and a
class token. The class token is then fed into another FC layer to generate final output.
Using the 2D ViT as a feature-map integrator corresponds with the Factorised Encoder
with ‘late fusion of depth information’ of the previously devised 3D ViT named ViViT25,
yet, is far less complex than the 3D ViT.
Pre-training
We borrowed an ImageNet-1K pre-trained SLIViT-like feature-map extractor
architecture, i.e., a ConvNeXt-T backbone, from
https://huggingface.co/facebook/convnext-tiny-224, and appended to it a subsequent
FC layer to fit a four-category classification task. We then trained this
SLIViT-backbone-like module on the publicly available labeled Kermany Dataset41,74.
Training the feature-map extractor on the Kermany Dataset took less than 12 hours
using a single NVIDIA Tesla V100 Volta GPU Accelerator 32GB Graphics Card. Several
sets of pre-trained weights were examined in this study (see The utility of 2D B-scan
OCT in pre-training section). The pre-trained backbone weights obtained from
combining ImageNet initialization with additional pre-training on the Kermany Dataset
(henceforth “combined weights”), which typically led to the best performance, are
available at project’s GitHub repository (see Code availability section).
Per-task fine-tuning
Each of the SLIViT models used in the different experiments reported here, was
initialized with the combined weights. The fine-tuning was done in an end-to-end
fashion22. Namely, rather than merely training the downstream feature-map integrator,
while keeping the feature-map extractor frozen, all the model’s parameters were set as
trainable, and were then fine-tuned (according to the dataset and task in question).
Notably, we intentionally avoided complex hyperparameter tuning as required by some
other methods (e.g., 19) to facilitate reproducibility and generalizability. Frames were
resized into 256×256 pixels to fit SLIViT’s backbone architecture and then, standard
preprocessing transformations were applied (including contrast stretching, random
horizontal flipping, and random resize cropping) using PyTorch’s default values. Binary
cross entropy and 𝐿1 norm were used as loss functions for the classification and
regression tasks, respectively. In each experiment, excluding the ultrasound (in which
the split was given), a random validation set was used for determining the convergence
of the training process with the same loss function metric used for the test set
evaluation. The model was optimized using the default fast.ai optimizer with the default
parameters. The starting learning rate in each training procedure was chosen by
fast.ai’s learning rate finder and the model was fitted using the fit-one-cycle approach
for faster convergence75,76. All models were trained with four samples per batch and
early stopping was set to five epochs, meaning that the training process continued until
no improvement was observed in the validation loss for five consecutive passes on the
whole training set. The model weights that achieved the lowest loss on the validation set
during training were used for the test set evaluation. Weights & Biases77 was used for
experiment tracking and visualizations of the training procedures.
Statistical analysis
The performance of each trained model was evaluated (on the corresponding test set)
using an appropriate metric score. The binary classification tasks were evaluated using
2
area under the ROC and PR curves. The regression tasks were evaluated using the 𝑅
metric. The test set predictions were calculated and a 90% confidence interval (CI) was
computed for each evaluated score using a standard bootstrapping procedure with
1,000 iterations as done in other studies17,78. Briefly, let n denote the test set size, for
each bootstrap iteration n samples were randomly drawn (with repetition) and based on
the predictions of the sampled set a single score was obtained. Out of the 1,000
sampled-sets score distribution, the 50th and 950th ranked scores were selected to
obtain the 90% CI. In order to compute the significance value of the difference between
two given distributions (induced by two different models) a paired t-test on the
distribution of differences between the sampled-set corresponding scores was
computed (𝐻𝐴: µ ≠ 0). SLIViT's performance improvement was considered to be
significant if the paired t-test produced a p-value lower than 1e-3 subject to Bonferroni
correction for multiple hypothesis testing.
Datasets
The positive-label frequencies in this dataset were 3.37%, 7.87%, 2.0%, and 2.67%, for
DV, IHRF, SDD, and hDC, respectively. Although the annotations for this dataset
included the eyes laterality, the scans themselves lacked the laterality obscuring the link
between a scan to its annotation in case both eyes were scanned for a patient. To
address this gap, we considered the middle slice per volume to determine the laterality
and trained a standard CNN on the Houston Dataset (that had the eyes laterality
recorded). Using the trained network (97% accuracy on an external test set; not shown)
we inferred the laterality for the SLIVER-net dataset scans when needed, that is, when
both eyes of the same patient were scanned.
Code availability
The code of SLIViT is available at the project’s GitHub repository:
https://github.com/berkindurmus/SLIViT.
Data availability
The Kermany dataset was downloaded from
https://www.kaggle.com/datasets/paultimothymooney/kermany2018. The 3D OCT
B-scan data are not publicly available due to institutional data use policy and concerns
about patient privacy. However, they are available from the authors upon reasonable
request and with permission of the institutional review board. The echocardiogram
dataset was downloaded from https://echonet.github.io/dynamic/index.html#dataset.
The MRI dataset was downloaded from https://www.ukbiobank.ac.uk under application
number 33127.
Acknowledgments
This work was supported by NIH/NEI grants RO1EY023164 and 1R01EY030614 and
an Unrestricted Grant from Research to Prevent Blindness, Inc. This research was
conducted using the UK Biobank Resource under application #33127.
Ethics declarations
E.H. has an affiliation with Optum.
Figures
Figure 1 | The proposed SLIViT framework
The input of SLIViT is a 3D volume of N frames of size HxW. (1) The frames of the
volume are resized and vertically tiled into an “elongated image”. (2) The elongated
image is fed into a ConvNeXt-based Feature Extractor that was pre-trained on both
natural and medical 2D labeled images. (3) N feature maps are extracted (each
corresponding to an original frame). (4) Feature maps are (tokenized and) fed into a
ViT-based Feature Integrator followed by a fully-connected layer that outputs the
prediction for the task in question.
Figure 2 | SLIViT’s outperformance overview
Shown are the performance scores in one classification task (with two different metrics)
of eye disease biomarker diagnosis in volumetric-OCT scans and two regression tasks
of (1) heart function analysis in ultrasound videos and (2) liver fat levels imputation in
volumetric MRI scans. Domain-specific methods (hatched) used are SLIVER-net,
EchoNet, and 3D ResNet, for OCT, ultrasound, and MRI, respectively. The general
cross-modality benchmarking used are 3D ResNet (green) and UniMiSS (brown) which
are (fully) supervised and self-supervised-based, respectively (see relevant
experiment’s section for additional benchmarking). Box plot whiskers represent a 90%
CI.
Figure 3 | ROC AUC performance comparison of five models in four independent
AMD-biomarker classification tasks when trained on less than 700 OCT volumes
Shown are the ROC AUC scores of SLIViT (blue), SLIVER-net (orange), 3D ResNet
(green), 3D ViT (red), and UniMiSS (brown) on four single-task classification problems
of AMD high-risk factors in two independent volumetric-OCT datasets. The expected
performance of a naive classifier is 0.5. The left panel shows the performance when
trained and tested on the Houston Dataset. The right panel shows the performance
when trained on the Houston Dataset and tested on the SLIVER-net Dataset (see Table
S1A). Box plot whiskers represent a 90% CI.
Figure 4 | Performance comparison
on cardiac function prediction tasks
using echocardiograms
Shown are the ROC curves (blue) of SLIViT trained to predict four AMD high-risk
biomarkers (DV, IHRF, SDD, and hDC; see main text) using less than 700 OCT volumes
(Houston Dataset) and tested on an independent dataset (Pasadena Dataset). The
light-blue shaded area represents a 90% CI for SLIViT’s performance. The red dot
represents the specialists’ average performance. The green asterisks correspond to the
retina specialists’ assessments. Two of the clinical specialists obtained the exact same
performance score for IHRF classification.
Supplementary Material
Figure S1 | PR-AUC performance comparison of five models in four independent
AMD-biomarker classification tasks when trained on less than 700 OCT volumes
Shown are the PR AUC scores as an alternative scoring metric for the experiment
shown in Figure 3. The dashed lines represent the corresponding biomarker’s
positive-label prevalence, which is the expected PR AUC score of a naive classifier. The
left panel shows the performance when trained and tested on the Houston Dataset. The
right panel shows the performance when trained on the Houston Dataset and tested on
the SLIVER-net Dataset (see Table S1B). Box plot whiskers represent a 90% CI.
Figure S2 | Performance comparison of a cardiomyopathy binary classification task on
echocardiograms
Shown are the PR curves yielded by modeling SLIViT (blue) and 3D ResNet (green) to
classify cardiomyopathy. The shaded areas represent a 90% CI.
Figure S3 | SLIViT’s PR performance compared to junior clinical retina specialists’
assessment
Shown are the PR curves (blue) of SLIViT trained to predict four AMD high-risk
biomarkers (DV, IHRF, SDD, and hDC; see main text) using less than 700 OCT volumes
(Houston Dataset) and tested on an independent dataset (Pasadena Dataset). The
light-blue shaded area represents a 90% CI for SLIViT’s performance. The red dot
represents the specialists’ average performance. The green asterisks correspond to the
retina specialists’ assessments. Two of the clinical specialists obtained the exact same
performance score for IHRF classification.
Figure S4 | SLIViT’s performance in a volumetric-OCT frame-permutation experiment
Shown is the ROC AUC scores distribution of 100 shuffled models (light blue) trained on
100 different (shuffled) copies of a volumetric-OCT dataset. The expected performance
of a naive classifier is 0.5. Box plot whiskers extend to the 5th and the 95th percentiles
of the 100 shuffled models’ performance distribution. The dashed blue line represents
the performance of a SLIViT model trained on the volumetric-OCT dataset using the
original order of each volume. The performance ranks of this latter model compared to
the former models’ distribution were 22, 34, 56, and 47 for DV, IHRF, SDD, and hDC,
respectively.
Figure S5 | Pre-training ablation study for (volumetric) OCT-related downstream learning
tasks
Shown are the ROC (left) and PR (right) AUC scores across different fine-tuned models
for volumetric-OCT classification tasks initialized with five different sets of pre-trained
weights. The expected ROC AUC score of a naive classifier is 0.5. Combined, the
proposed SLIViT’s initialization, is ImageNet weights initialization followed by supervised
pre-training on the Kermany Dataset. ssCombined is an ImageNet weights initialization
followed by self-supervised pre-training on an unlabeled version of the Kermany
Dataset. The dashed lines represent the corresponding biomarker’s positive-label
prevalence, which is the expected PR AUC score of a naive classifier. Box plot whiskers
represent a 90% CI.
Figure S6 | Pre-training ablation study for (volumetric) non-OCT-related downstream
learning tasks
2
Shown are the 𝑅 scores for the volumetric ultrasound and MRI regression tasks
initialized with five different sets of pre-trained weights. Combined, the proposed
SLIViT’s initialization, is ImageNet weights initialization followed by supervised
pre-training on the Kermany Dataset. ssCombined is an ImageNet weights initialization
followed by self-supervised pre-training on an unlabeled version of the Kermany
dataset. Box plot whiskers represent a 90% CI.
Table S1 | Average classification performance scores of SLIViT, SLIVER-net, 3D
ResNet, 3D ViT, and UniMiSS trained on less than 700 OCT volumes
Shown are the performance raw numbers underlying Fig. 3 (ROC AUC) and Fig. S1
(PR AUC) of the AMD high-risk biomarker prediction experiments. The numbers in the
square brackets represent the corresponding 90% CI.
Epithelial Pigment and Outer Retinal Atrophy Using Machine Learning. Ophthalmol
2. Wong, T. Y., Liew, G. & Mitchell, P. Clinical update: new treatments for age-related
4. Bloom, M. W. et al. Heart failure with reduced ejection fraction. Nat Rev Dis Primers
3, 17058 (2017).
6. Mann, R. M., Kuhl, C. K. & Moy, L. Contrast-enhanced MRI for breast cancer
[cs.NE] (2015).
10. Esteva, A. et al. A guide to deep learning in healthcare. Nat. Med. 25, 24–29
(2019).
11. Zhuang, F. et al. A Comprehensive Survey on Transfer Learning. Proc. IEEE 109,
43–76 (2021).
ambulatory electrocardiograms using a deep neural network. Nat. Med. 25, 65–69
(2019).
14. Rajpurkar, P. et al. Deep learning for chest radiograph diagnosis: A retrospective
e1002686 (2018).
15. Gulshan, V. et al. Development and Validation of a Deep Learning Algorithm for
2402–2410 (2016).
16. Deng, J. et al. ImageNet: A large-scale hierarchical image database. in 2009 IEEE
17. Tiu, E. et al. Expert-level detection of pathologies from unannotated chest X-ray
18. Zhang, Y., Jiang, H., Miura, Y., Manning, C. D. & Langlotz, C. P. Contrastive
Learning of Medical Visual Representations from Paired Images and Text. arXiv
[cs.CV] (2020).
19. Xie, Y., Zhang, J., Xia, Y. & Wu, Q. UniMiSS: Universal Medical Self-Supervised
20. Azizi, S. et al. Big Self-Supervised Models Advance Medical Image Classification.
arXiv [eess.IV] (2021).
21. Wu, Z. et al. OCT Signs of Early Atrophy in Age-Related Macular Degeneration:
22. Huang, S.-C. et al. Self-supervised learning for medical image classification: a
23. Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic
25. Arnab, A. et al. ViViT: A Video Vision Transformer. arXiv [cs.CV] (2021).
26. Zhu, H., Chen, B. & Yang, C. Understanding Why ViT Trains Badly on Small
28. Ghorbani, A. et al. Deep learning interpretation of echocardiograms. NPJ Digit Med
3, 10 (2020).
29. Gupta, U. et al. Transferring Models Trained on Natural Images to 3D MRI via
30. Witowski, J. et al. Improving breast cancer diagnostics with deep learning for MRI.
31. Yang, M., Huang, X., Huang, L. & Cai, G. Diagnosis of Parkinson’s disease based
on 3D ResNet: The frontal lobe is crucial. Biomed. Signal Process. Control 85,
104904 (2023).
33. Turnbull, R. Using a 3D ResNet for Detecting the Presence and Severity of
35. Zhou, H.-Y., Lu, C., Yang, S., Han, X. & Yu, Y. Preservational Learning improves
36. Xie, Y., Zhang, J., Liao, Z., Xia, Y. & Shen, C. PGL: Prior-Guided Local
(2020).
37. Chen, X., Fan, H., Girshick, R. & He, K. Improved Baselines with Momentum
38. Chen, X., Xie, S. & He, K. An Empirical Study of Training Self-Supervised Vision
39. Liu, Z. et al. A ConvNet for the 2020s. arXiv [cs.CV] (2022).
40. Dosovitskiy, A. et al. An Image is Worth 16x16 Words: Transformers for Image
42. Gupta, U., Lam, P. K., Ver Steeg, G. & Thompson, P. M. Improved Brain Age
43. Schlemper, J. et al. Attention gated networks: Learning to leverage salient regions
44. Bertasius, G., Wang, H. & Torresani, L. Is Space-Time Attention All You Need for
45. Neimark, D., Bar, O., Zohar, M. & Asselmann, D. Video Transformer Network. arXiv
[cs.CV] (2021).
disease burden projection for 2020 and 2040: a systematic review and
47. Hirabayashi, K. et al. OCT Risk Factors for Development of Atrophy in Eyes with
(2023).
48. Ouyang, D. et al. EchoNet-dynamic: A large new cardiac motion video data
https://echonet.github.io/dynamic/NeuroIPS_2019_ML4H%20Workshop_Paper.pdf.
49. Ziaeian, B. & Fonarow, G. C. Epidemiology and aetiology of heart failure. Nat. Rev.
50. Klapholz, M. et al. Hospitalization for heart failure in the presence of a normal left
ventricular ejection fraction: results of the New York Heart Failure Registry. J. Am.
52. Idilman, I. S. et al. Hepatic steatosis: quantification by proton density fat fraction
Parameter for Liver Fat Assessment Using MRI Proton Density Fat Fraction as the
75–82 (2022).
56. Kühn, J.-P. et al. Pancreatic Steatosis Demonstrated at MR Imaging in the General
Nonalcoholic Fatty Liver Disease and Normal Controls. Gastroenterol. Res. Pract.
58. Trout, A. T. et al. Relationship between abdominal fat stores and liver fat,
non-alcoholic fatty liver disease. Abdom Radiol (NY) 44, 3107–3114 (2019).
59. Covarrubias, Y. et al. Pilot study on longitudinal change in pancreatic proton density
fat fraction during a weight-loss surgery program in adults with obesity. J. Magn.
60. Esteva, A. et al. Dermatologist-level classification of skin cancer with deep neural
61. Huang, Z., Bianchi, F., Yuksekgonul, M., Montine, T. J. & Zou, J. A visual-language
foundation model for pathology image analysis using medical Twitter. Nat. Med.
(2023) doi:10.1038/s41591-023-02504-3.
62. Liu, Y. et al. A deep learning system for differential diagnosis of skin diseases. Nat.
63. Guan, H., Wang, L., Yao, D., Bozoki, A. & Liu, M. Learning Transferable 3D-CNN
2021).
64. Mustafa, B. et al. Supervised Transfer Learning at Scale for Medical Imaging. arXiv
[cs.CV] (2021).
65. Raghu, M., Zhang, C., Kleinberg, J. & Bengio, S. Transfusion: Understanding
66. Newell, A. & Deng, J. How Useful is Self-Supervised Pretraining for Visual Tasks?
67. Taleb, A. et al. 3D Self-Supervised Methods for Medical Imaging. arXiv [cs.CV]
(2020).
71. Howard, J. & Gugger, S. fastai: A Layered API for Deep Learning. arXiv [cs.LG]
(2020).
72. Pedregosa, F. et al. Scikit-learn: Machine Learning in Python. arXiv [cs.LG] (2012).
73. Hendrycks, D. & Gimpel, K. Gaussian Error Linear Units (GELUs). arXiv [cs.LG]
(2016).
75. Smith, L. N. Cyclical Learning Rates for Training Neural Networks. arXiv [cs.CV]
(2015).
78. Rajkomar, A. et al. Scalable and accurate deep learning with electronic health
79. Ferris, F. L., 3rd et al. Clinical classification of age-related macular degeneration.
80. Nassisi, M. et al. OCT Risk Factors for Development of Late Age-Related Macular
81. Lei, J., Balasubramanian, S., Abdelfattah, N. S., Nittala, M. G. & Sadda, S. R.
Proposal of a simple optical coherence tomography-based scoring system for
82. Nittala, M. G. et al. AMISH EYE STUDY: Baseline Spectral Domain Optical