Deepaaa: Clinically Applicable and Generalizable Detection of Abdominal Aortic Aneurysm Using Deep Learning
Deepaaa: Clinically Applicable and Generalizable Detection of Abdominal Aortic Aneurysm Using Deep Learning
Tenenholtz1[0000−0003−1250−3716]
1
MGH and BWH Center for Clinical Data Science
2
Massachusetts General Hospital (MGH)
3
Brigham and Women’s Hospital (BWH)
4
Nuance Communications Inc.
5
work done while affiliated with 1
6
work done while affiliated with 4
7
corresponding author [email protected]
1 Introduction
The primary dataset was used for the training and initial validation of the model
and contained 321 studies (223 unique patients). These were selected based on
a keyword search of study reports ensuring a mixture of positive and negative
DeepAAA (Arxiv version, accepted MICCAI 2019) 3
cases of AAA. The query was biased to largely include studies captured between
2005 and 2007. Of the studies selected, there were 217 (67.6 %) males and 104
(32.4 %) females with a mean age of 70.3 years; 153 (47.7 %) CT scans with
contrast and 168 (52.3 %) without; 247 (76.9 %) studies with AAA present and
74 (23.1 %) without AAA. For each study, the axial series was used for aorta
segmentation and AAA detection. Slice thickness of the images ranged from 2
to 10 mm, while the number of images for each series varied from 40 to 384.
To generate a ground-truth aortic segmentation, the abdominal aorta was
manually contoured on the axial scans slice-by-slice until the aortic bifurcation.
Each study was annotated by 1 to 4 CT technologists under supervision of
2 radiologists. Based on the clinical definition [2], the presence of AAA was
determined by applying a 3.0 cm threshold to the maximum aortic diameter as
defined by the manual segmentations.
As many exams were annotated by multiple annotators, a partial assess-
ment of inter-rater variability was possible. Of the 153 contrast studies, 124
were annotated by at least 2 independent technologists, leading to 517 pairwise
comparisons. The non-contrast data, however, contained only 10 studies where
more than one segmentation was performed, resulting in only 16 pairwise com-
parisons. The average inter-rater Dice on contrast series was 0.95 ± 0.03, while
on noncontrast series, it was 0.90 ± 0.08. Given the small number of samples, the
inter-rater variability on non-contrast data should not be considered definitive
but suggests roughly similar levels of agreement. For the subsequent analysis,
one reference segmentation per dataset was selected randomly as ground truth.
tient demographics. All of these factors may vary significantly over time at a
single site, and thus, we selected 57 studies (57 unique patients) predominantly
captured between 2012 and 2016 for this dataset. The studies were selected to in-
clude a mixture of positive and negative cases of AAA through keyword search of
study reports. All negative studies were manually verified to not contain a AAA.
To assess the model against radiologist-reported ground truth and validate post-
processing stages which generate the AAA measurement, the maximum aortic
diameter and presence of AAA was sourced from radiology reporting rather than
being derived from manual segmentations (as was done for the primary data set).
3 Methods
We achieve AAA detection via two sequential steps: (1) aorta segmentation (2)
aorta contour fitting for the estimation of the largest cross-sectional diameter.
For abdominal aortic segmentation, we developed a variant of a 3D U-Net [1]
which accepts series with varying numbers of images. As discussed in Section 2,
our dataset contained a wide distribution of image counts and slice thicknesses
as abdominal studies may also cover other regions of the body, including the
pelvis or thorax. It is thus essential to develop an algorithm adapts to variabil-
ity along the axial dimension. The 3D U-Net architecture we used contained
4 down/upsampling modules (plus the bottleneck layer), 2 convolutional layers
per module, and 32 initial features in the network. The convolutional kernel size
was 3 × 3 × 3 in both the downsampling and upsampling path, while the 3D
pooling kernels were 2 × 2 × 1 to preserve image count. Batch normalization was
applied before each ReLU activation, and dropout regularization was utilized at
the bottleneck layer with a dropout rate of 0.2. A 1 × 1 × 1 convolutional layer
with softmax activation over two classes (background and aorta) was applied at
the output layer and thresholded at 0.5 to generate the binary aorta mask.
The model was trained with the RMSprop optimizer using a learning rate
of 0.0001. Weights selected for evaluation were those that minimized the loss on
the validation set, which were not in general the last epoch weights. The loss
function was a smoothed negative Dice coefficient:
PN
2 i=1 pi gi + 1
D = − PN PN (1)
i=1 pi + i=1 gi + 1
similar, but not identical, to that used in [8]. The summation is over all N
voxels in a scan, pi is the predicted aorta probability and gi is the ground truth
classification for voxel i. The additional ones in the numerator and denominator
avoid division by zero and yield a perfect score for a correct, empty segmentation.
In order to build a general AAA detector that worked with both contrast and
non-contrast CT scans, we mixed both types of CT images for model training.
All the experiments were implemented utilizing the Keras deep learning library
with the Tensorflow backend on NVIDIA DGX-1 Volta.
After aorta segmentation, we applied ellipse fitting [4] image-by-image to the
contours of the aorta. The largest aortic diameters (d) were thus assigned by the
DeepAAA (Arxiv version, accepted MICCAI 2019) 5
long axis of the ellipses. For the regions where the aorta was not parallel to the
axial CT scans, angle correction was applied to retrieve the true aorta diameter,
i.e. d cos θ, where θ was the angle between the secant plane of the aorta and the
axial scan. Based on the definition of AAA, predicted positives were the studies
where the largest diameter of the aorta segment was greater than 3cm. We then
compared the predicted results with the ground truth annotations.
4 Results
4.1 Training and Cross-Validation on Primary Data Set
To assess model validity and repeatability, the primary dataset was divided into
5 folds such that no patient was repeated between folds. Cross validation was
performed by selecting folds {n, n + 1, n + 2} mod 5 as training, n + 3 mod 5 as
validation and the remaining fold as test for n ∈ {0..5}. For each combination,
the weights with the best validation score after 100 epochs were selected.
Inference on each test study was evaluated in terms of Dice score relative
to the reference segmentation and in terms of the maximum diameter of the
aorta evaluated on the inferred segmentation versus the same calculation on the
reference segmentation. The detailed results of this cross validation are presented
in Table 2. Over the 5 folds, the average Dice score ranged from 0.883 to 0.894,
with a average Dice score of 0.887 ± 0.111. The estimate of the diameter is
consistently within one standard deviation of zero. There may be a slight bias
towards smaller diameter, as 4 of the 5 folds had negative means but this bias
is small with overall mean -1.3mm ± 7.3.
For a final set of weights, the complete primary dataset was randomly split
into training (80%), validation (10%), and test sets (10%). Training was per-
formed for 300 epochs and the weights with lowest validation loss were selected.
As shown in Fig. 1, DeepAAA successfully segments the aorta on both con-
trast and non-contrast CT images, and works well with more challenging cases
where blood-clots are present or the aortic boundary is unclear in the images.
8
Total is not 321 as two datasets were excluded due to truncated images. They were
retained in the generation of the full model.
6 J-T. Lu et al. (Arxiv version, accepted MICCAI 2019)
Fig. 1. DeepAAA aorta segmentation (red overlay) and the largest aortic diameter
estimation (yellow crosses, the long axis of ellipse fitting [green curves] of the aorta
segment): (a-c) Aneurysm with thrombus on contrast CT. (d-f) Large aneurysm on
non-contrast CT where aortic boundary is hard to segment. (g-i) normal aorta.
For each study, the model’s outputs were compared to the study labels and
the model’s overall performance was measured in terms of sensitivity/specificity
for detecting AAA and mean error in the maximum diameter. Table 3, last row,
summarizes these results, along with a comparison to the model’s performance
on the held-out test set for the same metrics. During the process we noted that
some studies in this additional validation set extended into thoracic anatomy,
and model inference of this region was removed manually in post-processing.
5 Discussion
While AAAs are rarely missed when the leading indication for a study, the
rate of detection significantly decreases when the AAA is an incidental finding.
DeepAAA aims to provide a “second set of eyes” and reduce the rate of missed
incidental findings. Therefore, to properly contextualize model performance, it is
important to quantify this rate of misdiagnosis. Claridge et al, in a retrospective
analysis of 3246 abdominal CT scans and their reports, found that only 65%
of AAAs were detected by radiologists [2]. DeepAAA exceeds the sensitivity
they found (Table 4) while achieving a high specificity (Table 3) and localizes
the suspected AAA for radiologist confirmation. Thus, a parallel read from our
algorithm could potentially provide a significant reduction in missed AAAs and
offer significant clinical value, enabling early detection and treatment of AAA.
Many observers have noted that machine learning models applied to radiol-
ogy may not generalize well [10]. Changing the equipment used to capture input
images and changing the demographics of the underlying patient cohorts tend
to reduce model performance. This lack of generalizability would significantly
hamper a model’s clinical utility because deployment at sites other than where
the model was trained may result in surprising under-performance. To test Deep-
AAA’s ability to generalize, we simulated a significant change in input data by
creating a second cohort of validation data (Section 2.2) acquired from different
patients using different equipment more than five years after the original train-
ing data were acquired. The model showed higher specificity (100%) and reduced
mean error in diameter prediction with only slightly lower sensitivity (85%) -
essentially demonstrating that the model is robust and has not over-fit to any
cohort- or equipment-related idiosyncrasies of the original training data.
8 J-T. Lu et al. (Arxiv version, accepted MICCAI 2019)
Future work would involve extending the DeepAAA model beyond the ab-
dominal region to include segmentation of the thoracic aorta. Thoracic aortic
aneurysms (TAA), although not nearly as prevalent as AAA, are still a signifi-
cant source of mortality and generally affect a younger population. In addition,
models to predict AAA growth or rupture would be of significant clinical value
in guiding more targeted surveillance programs and therapy.
In the main paper, the cross validation results presented in Table 2 were slightly
inconsistent with the remainder of the paper as two datasets were omitted due
to differences in processing techniques. In this supplement, we present the cross
validation results on the full primary dataset to avoid any confusion related to
this issue. While some small numerical changes did occur, the overall conclusions
remain the same.
The primary dataset was divided into 5 folds such that no patient was re-
peated between folds. Note that the folds in this supplement are not the same
folds as those in Table 2 in the main paper, the difference was necessary to
maintain a balanced number of datasets per fold while also not allowing any
patient to be present in more than one fold. Cross validation was performed by
selecting folds {n, n + 1, n + 2} mod 5 as training, n + 3 mod 5 as validation and
DeepAAA (Arxiv version, accepted MICCAI 2019) 9
the remaining fold as test for n ∈ {0..5}. For each combination, the weights with
the best validation score after 100 epochs were selected.
Inference on each test study was evaluated in terms of Dice score relative
to the reference segmentation and in terms of the maximum diameter of the
aorta evaluated on the inferred segmentation versus the same calculation on the
reference segmentation. The detailed results of this cross validation are presented
in Table 5. Over the 5 folds, the average Dice score ranged from 0.848 to 0.901,
with a average Dice score of 0.873 ± 0.129. The estimate of the diameter is
consistently within one standard deviation of zero. There may be a slight bias
towards smaller diameter, as 4 of the 5 folds had negative means but this bias
is small with overall mean -1.7mm ± 8.7.
References
1. Çiçek, O., Abdulkadir, A., Lienkamp, S., Brox, T., Ronneberger, O.: 3D U-Net:
Learning Dense Volumetric Segmentation from Sparse Annotation. In: Ourselin,
S., Joskowicz, L., Sabuncu, M.R., Unal, G., Wells, W. (eds.) Proc. 19th Conf.
Medical Image Computing and Computer-Assisted Intervention (MICCAI). pp.
424–32. Springer (2016)
2. Claridge, R., Arnold, S., Morrison, N., van Rij, A.M.: Measuring abdominal aor-
tic diameters in routine abdominal computed tomography scans and implications
for abdominal aortic aneurysm screening. J. Vasc. Surg. 65(6), 1637–1642 (2017).
https://doi.org/10.1016/j.jvs.2016.11.044
3. de Bruijne, M., van Ginneken, B., Viergever, M., Niessen, W.: Interactive segmen-
tation of abdominal aortic aneurysms in CTA images. Med. Image Anal. 8(2),
127–38 (2004). https://doi.org/10.1016/j.media.2004.01.001
4. Fitzgibbon, A.W., Pilu, M., Fisher, R.B.: Direct least squares fitting of ellipses.
vol. 1, pp. 253–257 (1996). https://doi.org/10.1109/ICPR.1996.546029
5. Lindholt, J., Juul, S., Fasting, H., Henneberg, E.: Screening for abdominal aortic
aneurysms: Single centre randomised controlled trial. BMJ 330(7494), 750 (2005).
https://doi.org/10.1136/bmj.38369.620162.82
6. López-Linares, K., Aranjuelo, N., Kabongo, L., Maclair, G., Lete, N., Ceresa, M.,
Garcı́a-Familiar, A., Macı́a, I., Ballester, M.A.G.: Fully automatic detection and
segmentation of abdominal aortic thrombus in post-operative CTA images using
deep convolutional neural networks. Med. Image Anal. 46, 202–214 (May 2018).
https://doi.org/10.1016/j.media.2018.03.010
7. Mell, M.W., Hlatky, M.A., Shreibati, J.B., Dalman, R.L., Baker, L.C.: Late diag-
nosis of abdominal aortic aneurysms substantiates underutilization of abdominal
aortic aneurysm screening for Medicare beneficiaries. J. Vasc. Surg. 57(6), 1519–23,
1523.e1 (2013). https://doi.org/10.1016/j.jvs.2012.12.034
8. Milletari, F., Navab, N., Ahmadi, S.A.: V-Net: Fully Convolutional Neural Net-
works for Volumetric Medical Image Segmentation. In: 4th Int. Conf. 3D Vision
(3DV). pp. 565–71 (2016)
9. Siriapisith, T., Kusakunniran, W., Haddawy, P.: Outer Wall Segmentation of Ab-
dominal Aortic Aneurysm by Variable Neighborhood Search Through Intensity
and Gradient Spaces. Journal of Digital Imaging 31(4), 490–504 (Aug 2018).
https://doi.org/10.1007/s10278-018-0049-z
10 J-T. Lu et al. (Arxiv version, accepted MICCAI 2019)
10. Zech, J.R., Badgeley, M.A., Liu, M., Costa, A.B., Titano, J.J., Oermann, E.K.:
Variable generalization performance of a deep learning model to detect pneumonia
in chest radiographs: A cross-sectional study. PLOS Medicine 15(11), 1–17 (2018).
https://doi.org/10.1371/journal.pmed.1002683
11. Zhuge, F., Rubin, G.D., Sun, S., Napel, S.: An abdominal aortic aneurysm segmen-
tation method: Level set with region and statistical information. Medical Physics
33(5), 1440–53 (2006). https://doi.org/10.1118/1.2193247