0% found this document useful (0 votes)
68 views10 pages

Deepaaa: Clinically Applicable and Generalizable Detection of Abdominal Aortic Aneurysm Using Deep Learning

DeepAAA is a deep learning model that uses a modified 3D U-Net architecture to perform automated segmentation of the abdominal aorta and detection of abdominal aortic aneurysms (AAAs) on computed tomography (CT) scans. The model was trained on 321 contrast and non-contrast CT scans from Massachusetts General Hospital and achieved high performance for AAA detection. It was further tested on 57 additional CT scans from different patients and facilities, demonstrating generalizability. DeepAAA exceeds the reported performance of radiologists for incidental AAA detection and has potential to help prevent missed diagnoses.

Uploaded by

IzzHyuk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
68 views10 pages

Deepaaa: Clinically Applicable and Generalizable Detection of Abdominal Aortic Aneurysm Using Deep Learning

DeepAAA is a deep learning model that uses a modified 3D U-Net architecture to perform automated segmentation of the abdominal aorta and detection of abdominal aortic aneurysms (AAAs) on computed tomography (CT) scans. The model was trained on 321 contrast and non-contrast CT scans from Massachusetts General Hospital and achieved high performance for AAA detection. It was further tested on 57 additional CT scans from different patients and facilities, demonstrating generalizability. DeepAAA exceeds the reported performance of radiologists for incidental AAA detection and has potential to help prevent missed diagnoses.

Uploaded by

IzzHyuk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

DeepAAA: clinically applicable and

generalizable detection of abdominal aortic


aneurysm using deep learning

Jen-Tang Lu5 , Rupert Brooks4[0000−0002−5642−7770] , Stefan Hahn4 , Jin Chen6 ,


Varun Buch1,7[0000−0003−1704−4811] , Gopal Kotecha5 , Katherine P. Andriole1,3 ,
Brian Ghoshhajra2 , Joel Pinto4 , Paul Vozila4 , Mark Michalski1 , and Neil A.
arXiv:1907.02567v1 [eess.IV] 4 Jul 2019

Tenenholtz1[0000−0003−1250−3716]
1
MGH and BWH Center for Clinical Data Science
2
Massachusetts General Hospital (MGH)
3
Brigham and Women’s Hospital (BWH)
4
Nuance Communications Inc.
5
work done while affiliated with 1
6
work done while affiliated with 4
7
corresponding author [email protected]

Abstract. We propose a deep learning-based technique for detection


and quantification of abdominal aortic aneurysms (AAAs). The condi-
tion, which leads to more than 10,000 deaths per year in the United
States, is asymptomatic, often detected incidentally, and often missed by
radiologists. Our model architecture is a modified 3D U-Net combined
with ellipse fitting that performs aorta segmentation and AAA detec-
tion. The study uses 321 abdominal-pelvic CT examinations performed
by Massachusetts General Hospital Department of Radiology for train-
ing and validation. The model is then further tested for generalizability
on a separate set of 57 examinations with differing patient demograph-
ics and acquisition characteristics than the original dataset. DeepAAA
achieves high performance on both sets of data (sensitivity/specificity
0.91/0.95 and 0.85 / 1.0 respectively), on contrast and non-contrast CT
scans and works with image volumes with varying numbers of images.
We find that DeepAAA exceeds literature-reported performance of radi-
ologists on incidental AAA detection. It is expected that the model can
serve as an effective background detector in routine CT examinations to
prevent incidental AAAs from being missed.

Keywords: Segmentation · aorta · aneurysm · deep learning · U-Net

1 Introduction

Abdominal aortic aneurysms (AAAs), an enlargement or widening of the ab-


dominal aorta, commonly occurs in males older than 65 years with a prevalence
of 4 to 8 percent [5]. Untreated aneurysms tend to grow and eventually may
rupture with mortality rates exceeding 90%. As most AAAs are asymptomatic
2 J-T. Lu et al. (Arxiv version, accepted MICCAI 2019)

until critical bleeding, incidental finding of AAAs becomes critical. However, on


routine abdominal computed tomography (CT) exams, only 65% of AAAs are
incidentally identified [2]. This low reporting rate makes it difficult to provide
timely intervention for patients. Indeed, it is common for AAAs to be first diag-
nosed at a point where a patient is already at risk for rupture [7]. Furthermore,
in routine clinical practice, the size of AAAs is determined by manual measure-
ment of the maximal aortic diameter, which is time-consuming and prone to
high inter-reader variability.
Consequently, a variety of computer-aided diagnosis techniques have been
proposed over the past decade for automated aorta segmentation. Many of
these previous aids used classical computer vision techniques that required prior
knowledge, such as external seed points for initialization [3]. Driven by the ever-
increasing capability of deep learning, neural networks have recently been used
for aorta segmentation on CT angiography [6]. However, these previous deep
learning algorithms focused only on CT exams with contrast, while incidental
identification of AAAs on scans without contrast is equally important but more
challenging. Additionally, most of the previous works concentrated on the task
of automated aortic segmentation [11,9,6], but there are very few studies investi-
gating the more applied task of AAA detection, which has much greater clinical
relevance than purely performing segmentation alone.
In this paper, we demonstrate a deep-learning solution (DeepAAA) for auto-
mated aorta segmentation and AAA detection on both contrast and non-contrast
CT series. Specifically, we develop a variant of a 3D U-Net [1] for aorta segmenta-
tion on abdominal CT scans. The proposed method handles series with varying
numbers of images. We then apply ellipse fitting to the segmented aortic con-
tours and estimate the largest aortic diameter. DeepAAA is a general solution,
achieving a high detection rate for AAAs on both contrast and non-contrast CT
scans and working with variable image resolutions and slice thicknesses. Further-
more, our solution demonstrates strong generalizability and performance relative
to literature-reported values for radiologist sensitivity at AAA detection.

2 Cohort and annotation

Image data consisted of contrast and non-contrast CT examinations of the ab-


domen and pelvis performed between January 2005 and April 2017 by Mas-
sachusetts General Hospital Department of Radiology. The investigators ob-
tained local Institutional Review Board approval for the project and selected
two datasets from the database. The two datasets differ in terms of their cap-
ture dates and imaging equipment used as characterized in Table 1.

2.1 Primary Data Set

The primary dataset was used for the training and initial validation of the model
and contained 321 studies (223 unique patients). These were selected based on
a keyword search of study reports ensuring a mixture of positive and negative
DeepAAA (Arxiv version, accepted MICCAI 2019) 3

Table 1. Comparison between primary and additional validation data sets

Characteristics: Primary Data Set Additional Validation Set


Number of studies 321 57
Dates captured 2005-2007 (90%) 2012-2016 (85%)
Imaging equipment A 61% B 26%
A 96% B 4%
manufacturer C 8% D 5%
Contrast % 48% 51%
Presence of AAA % 77% 51%
Mean age (by study) 70 years 72 years
Gender (by study) 68% Male, 32% Female 68% Male, 32% Female
Data labelling method:
Max. aortic diameter Manual segmentation Sourced from reports
Presence/absence 3.0 cm threshold applied
Sourced from reports
of AAA to segmentation

cases of AAA. The query was biased to largely include studies captured between
2005 and 2007. Of the studies selected, there were 217 (67.6 %) males and 104
(32.4 %) females with a mean age of 70.3 years; 153 (47.7 %) CT scans with
contrast and 168 (52.3 %) without; 247 (76.9 %) studies with AAA present and
74 (23.1 %) without AAA. For each study, the axial series was used for aorta
segmentation and AAA detection. Slice thickness of the images ranged from 2
to 10 mm, while the number of images for each series varied from 40 to 384.
To generate a ground-truth aortic segmentation, the abdominal aorta was
manually contoured on the axial scans slice-by-slice until the aortic bifurcation.
Each study was annotated by 1 to 4 CT technologists under supervision of
2 radiologists. Based on the clinical definition [2], the presence of AAA was
determined by applying a 3.0 cm threshold to the maximum aortic diameter as
defined by the manual segmentations.
As many exams were annotated by multiple annotators, a partial assess-
ment of inter-rater variability was possible. Of the 153 contrast studies, 124
were annotated by at least 2 independent technologists, leading to 517 pairwise
comparisons. The non-contrast data, however, contained only 10 studies where
more than one segmentation was performed, resulting in only 16 pairwise com-
parisons. The average inter-rater Dice on contrast series was 0.95 ± 0.03, while
on noncontrast series, it was 0.90 ± 0.08. Given the small number of samples, the
inter-rater variability on non-contrast data should not be considered definitive
but suggests roughly similar levels of agreement. For the subsequent analysis,
one reference segmentation per dataset was selected randomly as ground truth.

2.2 Additional Validation Set


An additional validation set was used to test the robustness of the model to
changes in imaging equipment, imaging department capture protocols, and pa-
4 J-T. Lu et al. (Arxiv version, accepted MICCAI 2019)

tient demographics. All of these factors may vary significantly over time at a
single site, and thus, we selected 57 studies (57 unique patients) predominantly
captured between 2012 and 2016 for this dataset. The studies were selected to in-
clude a mixture of positive and negative cases of AAA through keyword search of
study reports. All negative studies were manually verified to not contain a AAA.
To assess the model against radiologist-reported ground truth and validate post-
processing stages which generate the AAA measurement, the maximum aortic
diameter and presence of AAA was sourced from radiology reporting rather than
being derived from manual segmentations (as was done for the primary data set).

3 Methods
We achieve AAA detection via two sequential steps: (1) aorta segmentation (2)
aorta contour fitting for the estimation of the largest cross-sectional diameter.
For abdominal aortic segmentation, we developed a variant of a 3D U-Net [1]
which accepts series with varying numbers of images. As discussed in Section 2,
our dataset contained a wide distribution of image counts and slice thicknesses
as abdominal studies may also cover other regions of the body, including the
pelvis or thorax. It is thus essential to develop an algorithm adapts to variabil-
ity along the axial dimension. The 3D U-Net architecture we used contained
4 down/upsampling modules (plus the bottleneck layer), 2 convolutional layers
per module, and 32 initial features in the network. The convolutional kernel size
was 3 × 3 × 3 in both the downsampling and upsampling path, while the 3D
pooling kernels were 2 × 2 × 1 to preserve image count. Batch normalization was
applied before each ReLU activation, and dropout regularization was utilized at
the bottleneck layer with a dropout rate of 0.2. A 1 × 1 × 1 convolutional layer
with softmax activation over two classes (background and aorta) was applied at
the output layer and thresholded at 0.5 to generate the binary aorta mask.
The model was trained with the RMSprop optimizer using a learning rate
of 0.0001. Weights selected for evaluation were those that minimized the loss on
the validation set, which were not in general the last epoch weights. The loss
function was a smoothed negative Dice coefficient:
PN
2 i=1 pi gi + 1
D = − PN PN (1)
i=1 pi + i=1 gi + 1

similar, but not identical, to that used in [8]. The summation is over all N
voxels in a scan, pi is the predicted aorta probability and gi is the ground truth
classification for voxel i. The additional ones in the numerator and denominator
avoid division by zero and yield a perfect score for a correct, empty segmentation.
In order to build a general AAA detector that worked with both contrast and
non-contrast CT scans, we mixed both types of CT images for model training.
All the experiments were implemented utilizing the Keras deep learning library
with the Tensorflow backend on NVIDIA DGX-1 Volta.
After aorta segmentation, we applied ellipse fitting [4] image-by-image to the
contours of the aorta. The largest aortic diameters (d) were thus assigned by the
DeepAAA (Arxiv version, accepted MICCAI 2019) 5

Table 2. Results of 5-fold cross-validation. Delta is predicted minus reference largest


diameter. Standard deviations combined using pooled variance.

Fold N Mean Dice Mean Delta (mm)


0 64 0.887 ± 0.121 -0.4 ± 8.5
1 64 0.893 ± 0.107 -0.7 ± 5.4
2 64 0.894 ± 0.060 -3.2 ± 6.0
3 64 0.883 ± 0.126 -2.7 ± 6.3
4 63 0.877 ± 0.127 0.8 ± 9.5
All8 319 0.887 ± 0.111 -1.3 ± 7.3

long axis of the ellipses. For the regions where the aorta was not parallel to the
axial CT scans, angle correction was applied to retrieve the true aorta diameter,
i.e. d cos θ, where θ was the angle between the secant plane of the aorta and the
axial scan. Based on the definition of AAA, predicted positives were the studies
where the largest diameter of the aorta segment was greater than 3cm. We then
compared the predicted results with the ground truth annotations.

4 Results
4.1 Training and Cross-Validation on Primary Data Set
To assess model validity and repeatability, the primary dataset was divided into
5 folds such that no patient was repeated between folds. Cross validation was
performed by selecting folds {n, n + 1, n + 2} mod 5 as training, n + 3 mod 5 as
validation and the remaining fold as test for n ∈ {0..5}. For each combination,
the weights with the best validation score after 100 epochs were selected.
Inference on each test study was evaluated in terms of Dice score relative
to the reference segmentation and in terms of the maximum diameter of the
aorta evaluated on the inferred segmentation versus the same calculation on the
reference segmentation. The detailed results of this cross validation are presented
in Table 2. Over the 5 folds, the average Dice score ranged from 0.883 to 0.894,
with a average Dice score of 0.887 ± 0.111. The estimate of the diameter is
consistently within one standard deviation of zero. There may be a slight bias
towards smaller diameter, as 4 of the 5 folds had negative means but this bias
is small with overall mean -1.3mm ± 7.3.
For a final set of weights, the complete primary dataset was randomly split
into training (80%), validation (10%), and test sets (10%). Training was per-
formed for 300 epochs and the weights with lowest validation loss were selected.
As shown in Fig. 1, DeepAAA successfully segments the aorta on both con-
trast and non-contrast CT images, and works well with more challenging cases
where blood-clots are present or the aortic boundary is unclear in the images.
8
Total is not 321 as two datasets were excluded due to truncated images. They were
retained in the generation of the full model.
6 J-T. Lu et al. (Arxiv version, accepted MICCAI 2019)

Fig. 1. DeepAAA aorta segmentation (red overlay) and the largest aortic diameter
estimation (yellow crosses, the long axis of ellipse fitting [green curves] of the aorta
segment): (a-c) Aneurysm with thrombus on contrast CT. (d-f) Large aneurysm on
non-contrast CT where aortic boundary is hard to segment. (g-i) normal aorta.

We achieve high performance on aortic segmentation with an average Dice co-


efficient of 0.91, which yields high sensitivity (0.91) and high specificity (0.95)
on AAA detection (Table 3). We further examine the error in the largest aortic
diameter measurement (dpred dtrue ). We find that the algorithm tends to un-
derestimate the aorta size, but the 2.02 mm average discrepancy is well within
the 10 mm gradations on which clinical decisions are generally based.

4.2 Testing Model Robustness on the Additional Validation Set


Using the final model trained in Section 4.1, we performed inference on studies
from the additional validation set described in Section 2.2. Each study was la-
belled for the presence of a AAA via the radiology report, and for those studies
with positive findings, the maximum aortic diameter was also extracted.
DeepAAA (Arxiv version, accepted MICCAI 2019) 7

Table 3. Performance of DeepAAA on segmentation and detection

Segmentation Detection of AAA


Dataset CT Type Dice Mean Delta (mm) Sensitivity Specificity
Primary Contrast 0.89 ± 0.05 -2.67 ± 2.62 0.89 0.94
Non-contrast 0.90 ± 0.05 -1.36 ± 4.30 0.92 0.95
Overall 0.90 ± 0.05 -2.02 ± 3.62 0.91 0.95
Additional Validation Set -0.6 ± 3.0 0.85 1.00

For each study, the model’s outputs were compared to the study labels and
the model’s overall performance was measured in terms of sensitivity/specificity
for detecting AAA and mean error in the maximum diameter. Table 3, last row,
summarizes these results, along with a comparison to the model’s performance
on the held-out test set for the same metrics. During the process we noted that
some studies in this additional validation set extended into thoracic anatomy,
and model inference of this region was removed manually in post-processing.

5 Discussion

While AAAs are rarely missed when the leading indication for a study, the
rate of detection significantly decreases when the AAA is an incidental finding.
DeepAAA aims to provide a “second set of eyes” and reduce the rate of missed
incidental findings. Therefore, to properly contextualize model performance, it is
important to quantify this rate of misdiagnosis. Claridge et al, in a retrospective
analysis of 3246 abdominal CT scans and their reports, found that only 65%
of AAAs were detected by radiologists [2]. DeepAAA exceeds the sensitivity
they found (Table 4) while achieving a high specificity (Table 3) and localizes
the suspected AAA for radiologist confirmation. Thus, a parallel read from our
algorithm could potentially provide a significant reduction in missed AAAs and
offer significant clinical value, enabling early detection and treatment of AAA.
Many observers have noted that machine learning models applied to radiol-
ogy may not generalize well [10]. Changing the equipment used to capture input
images and changing the demographics of the underlying patient cohorts tend
to reduce model performance. This lack of generalizability would significantly
hamper a model’s clinical utility because deployment at sites other than where
the model was trained may result in surprising under-performance. To test Deep-
AAA’s ability to generalize, we simulated a significant change in input data by
creating a second cohort of validation data (Section 2.2) acquired from different
patients using different equipment more than five years after the original train-
ing data were acquired. The model showed higher specificity (100%) and reduced
mean error in diameter prediction with only slightly lower sensitivity (85%) -
essentially demonstrating that the model is robust and has not over-fit to any
cohort- or equipment-related idiosyncrasies of the original training data.
8 J-T. Lu et al. (Arxiv version, accepted MICCAI 2019)

Table 4. Comparison between DeepAAA and literature reported performance of ra-


diologists on AAA reporting for routine abdominal CT according to aneurysm size

Method 30-39 mm 40-49 mm ≥50 mm


DeepAAA sensitivity 0.68 1.00 1.00
Radiologists’ sensitivity[2] 0.52 0.87 1.00

Future work would involve extending the DeepAAA model beyond the ab-
dominal region to include segmentation of the thoracic aorta. Thoracic aortic
aneurysms (TAA), although not nearly as prevalent as AAA, are still a signifi-
cant source of mortality and generally affect a younger population. In addition,
models to predict AAA growth or rupture would be of significant clinical value
in guiding more targeted surveillance programs and therapy.

6 Supplemental Material: Revised Cross Validation


results

In the main paper, the cross validation results presented in Table 2 were slightly
inconsistent with the remainder of the paper as two datasets were omitted due
to differences in processing techniques. In this supplement, we present the cross
validation results on the full primary dataset to avoid any confusion related to
this issue. While some small numerical changes did occur, the overall conclusions
remain the same.

Table 5. Results of 5-fold cross-validation. Delta is predicted minus reference largest


diameter. Standard deviations combined using pooled variance.

Fold N Mean Dice Mean Delta (mm)


0 65 0.869 ± 0.143 0.7 ± 10.0
1 64 0.848 ± 0.170 -2.1 ± 8.9
2 64 0.864 ± 0.155 -2.7 ± 10.4
3 64 0.901 ± 0.059 -2.4 ± 5.2
4 64 0.882 ± 0.078 -2.0 ± 8.3
All 321 0.873 ± 0.129 -1.7 ± 8.7

The primary dataset was divided into 5 folds such that no patient was re-
peated between folds. Note that the folds in this supplement are not the same
folds as those in Table 2 in the main paper, the difference was necessary to
maintain a balanced number of datasets per fold while also not allowing any
patient to be present in more than one fold. Cross validation was performed by
selecting folds {n, n + 1, n + 2} mod 5 as training, n + 3 mod 5 as validation and
DeepAAA (Arxiv version, accepted MICCAI 2019) 9

the remaining fold as test for n ∈ {0..5}. For each combination, the weights with
the best validation score after 100 epochs were selected.
Inference on each test study was evaluated in terms of Dice score relative
to the reference segmentation and in terms of the maximum diameter of the
aorta evaluated on the inferred segmentation versus the same calculation on the
reference segmentation. The detailed results of this cross validation are presented
in Table 5. Over the 5 folds, the average Dice score ranged from 0.848 to 0.901,
with a average Dice score of 0.873 ± 0.129. The estimate of the diameter is
consistently within one standard deviation of zero. There may be a slight bias
towards smaller diameter, as 4 of the 5 folds had negative means but this bias
is small with overall mean -1.7mm ± 8.7.

References

1. Çiçek, O., Abdulkadir, A., Lienkamp, S., Brox, T., Ronneberger, O.: 3D U-Net:
Learning Dense Volumetric Segmentation from Sparse Annotation. In: Ourselin,
S., Joskowicz, L., Sabuncu, M.R., Unal, G., Wells, W. (eds.) Proc. 19th Conf.
Medical Image Computing and Computer-Assisted Intervention (MICCAI). pp.
424–32. Springer (2016)
2. Claridge, R., Arnold, S., Morrison, N., van Rij, A.M.: Measuring abdominal aor-
tic diameters in routine abdominal computed tomography scans and implications
for abdominal aortic aneurysm screening. J. Vasc. Surg. 65(6), 1637–1642 (2017).
https://doi.org/10.1016/j.jvs.2016.11.044
3. de Bruijne, M., van Ginneken, B., Viergever, M., Niessen, W.: Interactive segmen-
tation of abdominal aortic aneurysms in CTA images. Med. Image Anal. 8(2),
127–38 (2004). https://doi.org/10.1016/j.media.2004.01.001
4. Fitzgibbon, A.W., Pilu, M., Fisher, R.B.: Direct least squares fitting of ellipses.
vol. 1, pp. 253–257 (1996). https://doi.org/10.1109/ICPR.1996.546029
5. Lindholt, J., Juul, S., Fasting, H., Henneberg, E.: Screening for abdominal aortic
aneurysms: Single centre randomised controlled trial. BMJ 330(7494), 750 (2005).
https://doi.org/10.1136/bmj.38369.620162.82
6. López-Linares, K., Aranjuelo, N., Kabongo, L., Maclair, G., Lete, N., Ceresa, M.,
Garcı́a-Familiar, A., Macı́a, I., Ballester, M.A.G.: Fully automatic detection and
segmentation of abdominal aortic thrombus in post-operative CTA images using
deep convolutional neural networks. Med. Image Anal. 46, 202–214 (May 2018).
https://doi.org/10.1016/j.media.2018.03.010
7. Mell, M.W., Hlatky, M.A., Shreibati, J.B., Dalman, R.L., Baker, L.C.: Late diag-
nosis of abdominal aortic aneurysms substantiates underutilization of abdominal
aortic aneurysm screening for Medicare beneficiaries. J. Vasc. Surg. 57(6), 1519–23,
1523.e1 (2013). https://doi.org/10.1016/j.jvs.2012.12.034
8. Milletari, F., Navab, N., Ahmadi, S.A.: V-Net: Fully Convolutional Neural Net-
works for Volumetric Medical Image Segmentation. In: 4th Int. Conf. 3D Vision
(3DV). pp. 565–71 (2016)
9. Siriapisith, T., Kusakunniran, W., Haddawy, P.: Outer Wall Segmentation of Ab-
dominal Aortic Aneurysm by Variable Neighborhood Search Through Intensity
and Gradient Spaces. Journal of Digital Imaging 31(4), 490–504 (Aug 2018).
https://doi.org/10.1007/s10278-018-0049-z
10 J-T. Lu et al. (Arxiv version, accepted MICCAI 2019)

10. Zech, J.R., Badgeley, M.A., Liu, M., Costa, A.B., Titano, J.J., Oermann, E.K.:
Variable generalization performance of a deep learning model to detect pneumonia
in chest radiographs: A cross-sectional study. PLOS Medicine 15(11), 1–17 (2018).
https://doi.org/10.1371/journal.pmed.1002683
11. Zhuge, F., Rubin, G.D., Sun, S., Napel, S.: An abdominal aortic aneurysm segmen-
tation method: Level set with region and statistical information. Medical Physics
33(5), 1440–53 (2006). https://doi.org/10.1118/1.2193247

You might also like