0% found this document useful (0 votes)

51 views16 pages

Comparing Different Deep Learning Architectures For Classification of Chest Radiographs

The authors compared 15 deep learning architectures on their ability to classify chest radiographs from the CheXpert dataset. Smaller networks like ResNet-34, AlexNet and VGG-16 achieved similar classification performance as larger networks like DenseNet-201 and ResNet-151 while requiring fewer resources. The study aims to identify the most suitable neural network architectures for chest radiograph classification tasks.

Uploaded by

Govardhan Jain

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

51 views16 pages

Comparing Different Deep Learning Architectures For Classification of Chest Radiographs

Uploaded by

Govardhan Jain

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/339445459

Comparing Different Deep Learning Architectures for Classiﬁcation of Chest

Radiographs

Preprint · February 2020

CITATIONS READS

0 65

6 authors, including:

Keno Bressem Lisa Adams

Charité Universitätsmedizin Berlin Charité Universitätsmedizin Berlin
27 PUBLICATIONS 42 CITATIONS 37 PUBLICATIONS 57 CITATIONS

SEE PROFILE SEE PROFILE

Stefan M. Niehues
Charité Universitätsmedizin Berlin
148 PUBLICATIONS 872 CITATIONS

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Thermal ablation facts View project

All content following this page was uploaded by Keno Bressem on 27 February 2020.

The user has requested enhancement of the downloaded file.

Comparing Different Deep Learning Architectures for
Classification of Chest Radiographs
Keno K. Bressem, Lisa Adams, Christoph Erxleben, Bernd Hamm, Stefan Niehues, Janis Vahldiek

Department of Radiology, Charité Universitätsmedizin Berlin

keno-kyrill.bressem(at)charite.de
arXiv:2002.08991v1 [cs.LG] 20 Feb 2020

Abstract
Chest radiographs are among the most frequently acquired images in radiology and are often the
subject of computer vision research. However, most of the models used to classify chest radiographs are
derived from openly available deep neural networks, trained on large image-datasets. These datasets
routinely differ from chest radiographs in that they are mostly color images and contain several possible
image classes, while radiographs are greyscale images and often only contain fewer image classes.
Therefore, very deep neural networks, which can represent more complex relationships in image-features,
might not be required for the comparatively simpler task of classifying grayscale chest radiographs.
We compared fifteen different architectures of artificial neural networks regarding training-time and
performance on the openly available CheXpert dataset to identify the most suitable models for deep
learning tasks on chest radiographs. We could show, that smaller networks such as ResNet-34, AlexNet
or VGG-16 have the potential to classify chest radiographs as precisely as deeper neural networks such as
DenseNet-201 or ResNet-151, while being less computationally demanding.

1
Introduction and might even outperform deeper networks while re-
quiring lesser resources. Therefore, we systematically
Chest radiographs are among the most frequently examine the performance of fifteen openly available
used imaging procedures in radiology. They have artificial neural network architectures in order to iden-
been widely employed in the field of computer vision, tify the most suitable ones for the basic classification
as chest radiographs are a standardized technique of chest radiographs.
and, if compared to other radiological examinations
such as computed tomography or magnetic resonance
imaging, contain a smaller group of relevant patholo- Methods
gies. Although many artificial neural networks for
the classification of chest radiographs have been de-
Data preparation
veloped, it is still the subject of intensive research.
Only a few groups design their own networks from
The free available CheXpert dataset consists of
scratch, but rather use already established architec-
224,316 chest radiographs from 65,240 patients.
tures, such as ResNet-50 or DenseNet-121 (with 50
Fourteen findings have been annotated for each image:
and 121 representing the number of layers within the
enlarged cardiomediastinum, cardiomegaly, lung
respective neural network) [3][5][7][2][14][11]. These
opacity, lung lesion, edema, consolidation, pneumonia,
neural networks have often been trained on large,
atelectasis, pneumothorax, pleural effusion, pleural
openly available datasets, such as ImageNet, and are
other, fracture and support devices. Hereby the
therefore already able to recognize numerous image
findings can be annotated as present (1), absent
features. When training a model for a new task, such
(NA) or uncertain (-1). Similar to previous work on
as the classification of chest radiographs, the use of
the classification of the CheXpert dataset [7][16],
pre-trained networks may improve the training speed
we trained these networks on a subset of labels:
and accuracy of the new model, since important im-
cardiomegaly, edema, consolidation, atelectasis
age features that have already been learned can be
and pleural effusion. As we only aim at network
transferred to the new task and do not have to be
comparison and not on maximal precision of a
learned again. However, the feature space of freely
neural network, for this analysis, each image with
available data sets such as ImageNet differs from chest
an uncertainty label was excluded, other approaches
radiographs as they contain color images and more
such as zero imputation or self-training were also not
categories. The ImageNet Challenge includes 1000
adopted. Furthermore, only frontal radiographs were
possible categories per image, while CheXpert, a large
used, leaving 135,494 images from 53,388 patients for
freely available data set of chest radiographs, only
training. CheXpert offers additional dataset with
distinguishes between 14 categories (or classes)[13].
235 images (201 images after excluding uncertainty
Although the ImageNet challenge showed a trend to-
labels and lateral radiographs), annotated by two
wards higher accuracies for deeper networks, this may
independent radiologists, which is intended as an
not be fully transferrable to radiology. In radiology,
evaluations dataset and was therefore used for this
sometimes only limited features of an image can be
purpose.
decisive for the diagnosis. Therefore, images cannot
be scaled down as much as desired, as the required
information would otherwise be lost. But, the more Data augmentation
complex a neural network architecture is, the more
resources are required for training and deployment of For the first and second training session, the images
such an algorithm. As up-scaling the input-images were scaled to 320 x 320 pixels, using bilinear inter-
resolution exponentially increases memory usage dur- polation, and pixel values were normalized. During
ing training for large neural networks, that evalu- training, multiple image-transformations were applied:
ate many parameters, the size of a mini batch needs flipping of the images alongside the horizontal and
to be reduced earlier and more strongly, potentially vertical axis, rotation of up to 10°, zooming of up
affecting optimizers such as stochastic gradient de- to 110%, adding of random lightning or symmetric
scent. Therefore, it is currently not clear, which of wrapping.
the available artificial neural networks designed for
and trained on the ImageNet dataset will perform the
best for the classification of chest radiographs. The
hypothesis of this work is, that shallow networks are
already sufficient for the classification of radiographs

2
Model training The dataset presents class imbalances (% positives for
each finding: cardiomegaly 33%, edema 21%, consol-
14 different convolutional neural networks (CNN) of idation 16%, atelectasis 37%, pleural effusion 32%),
five different architectures (ResNet, DenseNet, VGG, so that the AUPRC as well as the AUROC can be
SqueezeNet and AlexNet) were trained on the CheX- considered equally important measurements for the
pert dataset [3][5][17][6][9]. All training was done performance of the network. The performance of the
using the Python programming language (https:// tested networks is compared to the AUROC reported
www.python.org, version 3.8) with the PyTorch (https: by Irvin et al.[7].However, only values for AUROC,
//pytorch.org) and FastAI (https://fast.ai) libraries but not for AUPRC, are provided there. In most cases,
on a workstation running on Ubuntu 18.04 with two the best results were achieved with a batch size of 32,
Nvidia GeForce RTX 2080ti graphic cards (11 GB of so all the information provided below refers to models
RAM each)[4][10]. In the first training session, batch trained with this batch size. Results achieved with
size was held constant at 16 for all models, while it smaller batch sizes of 16 will be explicitly mentioned.
was increased to 32 for all networks in the second
session. In the first two sessions, each model was
trained for eight epochs, whereas during the first five Area under the Receiver Operating
epochs only the classification-head of each network Characteristic Curve
was trained. Thereafter, the model was unfrozen and
trained as whole for three additional epochs. Before Deeper artificial neural networks generally achieved
training and after the first five epochs, the optimal higher AUROC values than shallow networks (Table
learning rate was determined [19], which was between 1 and Figures 1-3). Regarding the pooled AUROC
1e-1 and 1e-2 for the first five epochs and between 1e-5 for the detection of the five pathologies, ResNet-152
and 1e-6 for the rest of the training. We trained one (0.882), DenseNet-161 (0.881) and ResNet-50 (0.881)
multilabel classification head for each model. Since performed best (Irvin et al. CheXpert baseline
the performance of a neural network can be subject to 0.889)[7]. Broken down for individual findings, the
minor random fluctuations, the training was repeated most accurate detection of atelectasis was achieved by
for a total of five times. The predictions on the valida- ResNet-18 (0.816, batch size 16), ResNet-101 (0.813,
tion data set were then exported as comma separated batch size 16), VGG-19 (0.813, batch size 16) and
values (CSV) for evaluation. ResNet-50 (0.811). For detection of cardiomegaly, the
best four models surpassed the CheXpert baseline of
0.828 (ResNet-34 0.840, ResNet-152 0.836, DenseNet-
Evaluation 161 0.834, ResNet-50 0.832). For congestion, the
highest AUROC was achieved using ResNet-152
Evaluation was performed using the “R” statistical (0.917), ResNet-50 (0.916) and DenseNet-161 (0.913).
environment including the “tidyverse” and “ROCR” Pulmonary edema was most accurately detected using
libraries [12][20][18].Predictions on the validation DenseNet-161 (0.923), DenseNet-169 (0.922) and
dataset of the five models for each network archi- DenseNet-201 (0.922). For pleural effusion, the four
tecture were pooled so that the models could be best models were ResNet-152 (0.937), ResNet-101
evaluated as a consortium. For each individual (0.936), ResNet-50 (0.934) and DenseNet-169 (0.934),
prediction as well as the pooled predictions, receiver all of which performed superior to the CheXpert
operation characteristic (ROC) curves and precision baseline of 0.928.
recall curves (PRC) were plotted and the areas under
each curve were calculated (AUROC and AUPRC).
AUROC and AUPRC were chosen as they enable Area under the Precision Recall Curve
a comparison of different models, independent of a
chosen threshold for the classification. For AUPRC, shallower artificial neural networks
could achieve higher values than deeper network-
architectures (Table 2 and Figures 4-6). The highest
Results pooled values for the AUPRC were achieved by
training VGG-16 (0.709), AlexNet (0.701) and
ResNet-34 (0.688). For atelectasis, CGG-16 and
The CheXpert validation dataset consists out of 234
AlexNet both achieved the highest AUPRC of 0.732,
studies of 200 patients, not used for training with
followed by Resnet-35 with 0.652. Cardiomegaly was
no uncertainty-labels. After excluding lateral radio-
most accurately detected by SqueezeNet 1.0 (0.565),
graphs (n = 32), 202 images of 200 patients remained.

3
Alexnet-152 (0.565) and Vgg-13 (0.563). SqueezNet Instead, an accurate classification of chest radiographs
1.0 also achieved the highest AUPRC values for may be achieved with comparably shallow networks,
consolidation (0.815) followed by ResNet-152 (0.810) such as AlexNet (8 layers), ResNet-34 or VGG-16,
and ResNet-50 (0.809). The best classifications of which surpass even complex deep networks such as
pulmonary edema were achieved by DenseNet-169, ResNet-150 or DenseNet-201.
DenseNet-161 (both 0.743) and DenseNet-201 The use of smaller neural networks has the advantage,
(0.742). Finally, for pleural effusion ResNet-101 and that hardware requirements and training time are
ResNet-152 achieved the highest AUPRC of 0.591, lower compared to deeper networks. Shorter training
followed by ResNet-50 (0.590). times allow to test more hyperparameters, simpli-
fying the overall training process. Lower hardware
requirements also enable the use of increased image
Overall best Performance resolutions. This could be of relevance for the evalu-
ation of chest radiographs with a generic resolution
Considering both AUROC and AUPRC, the best per- of 2048 x 2048 px to 4280 x 4280 px, where specific
formance was achieved by VGG-16 (AUROC: 0.856, findings, such as small pneumothorax, require larger
AUPRC: 0.709), ResNet-34 (AUROC: 0.872, AUPRC: resolutions of input-images, because otherwise the cru-
0.688) and AlexNet (AUROC: 0.839, AUPRC: 0.701), cial information regarding their presence could be lost
all with a batch size of 32. due to a downscaling. Furthermore, shorter training
times might simplify the integration of improvement
methods into the training data, such as the implemen-
Training time tation of ‘human in the loop’ annotations. ‘Human
in the loop’ implies that the training of a network is
Fourteen different network-architectures were trained supervised by a human expert, who may intervene and
10 times each with a multilabel-classification head (five correct the network at critical steps. For example, the
times each for batch size of 16 or 32 and an input- human expert can check the misclassifications with
image resolution of 320 x 320 pixels) and once with a the highest loss for incorrect labels, thus effectively re-
binary classification head for each finding, resulting ducing label noise. With shorter training times, such
in 210 individual training runs. Overall, training took feedback loops can be executed faster. In the CheX-
340 hours. As to be expected, the training of deeper pert dataset, which was used as a groundwork for the
networks required more time than the training of present analysis, labels for the images were generated
shallower networks. For an image resolution of 320 x using a specifically developed natural language pro-
320 pixels, the training of AlexNet required the least cessing tool, which did not produce perfect labels. For
amount of time with a time per epoch of 2:29 to 2:50 example, the F1 scores for the mention and subsequent
minutes and a total duration of 20 minutes for a batch negation of cardiomegaly were 0.973 and 0.909, and
size of 32. Using a smaller batch size of 16, the time the F1 score for an uncertainty label was 0.727. There-
per epoch raised to 2:59 - 3:06 minutes and a total fore, it can be assumed, that there is a certain amount
duration of 24 minutes. In contrast, using a batch of noise in the training data, which might affect the
size of 16, training of a DenseNet-201 took the longest accuracy of the models trained on it. Implementing a
with 5:11 hours and epochs requiring 41 minutes. For human-in-the loop approach for partially correcting
a batch size of 32, training a DenseNet-169 required the label noise could further improve performance of
the largest amount of time with 3:06 hours (epochs networks trained on the CheXpert dataset [8]. Our
between 21 and 27 minutes). Increasing the batch findings differ from applied techniques used in pre-
size from 16 to 32 lead to an average acceleration of vious literature, where deeper network architectures,
training by 29.9% ± 9.34%. Table 3 gives an overview mainly a DenseNet-121, were used instead of small
of training times. networks to classify the CheXpert data set [11][1][15].
The authors of the CheXpert dataset achieved an aver-
age overall AUROC of 0.889 [7], using a DenseNet-121
Discussion which was not surpassed by any of the models used
in our analysis, although differences between the best
In the present work, different architectures of artificial performing networks and the CheXpert baseline were
neural networks are analyzed with respect to their smaller than 0.01.. It should be noted, however, that
performance for the classification of chest radiographs. in our analysis the hyperparameters for the models
We could show that more complex neural networks do were probably not selected as precise as in the original
not necessarily perform better than shallow networks. CheXpert paper by Irvin et al., since the focus of this

4
work was more on comparing the architectures and not
on the complete optimization of one specific network.
Still, we identified model, which achieved higher AU-
ROC values in two of the five findings (cardiomegaly
and effusion). Pham et al. also used a DenseNet-121
as the basis for their model and proposed the most
accurate model of the CheXpert dataset with a mean
AUROC of 0.940 for the five selected findings [11].
The good results are probably due to the hierarchical
structure of the classification framework, which takes
into account correlations between different labels, and
the application of a label-smoothing technique, which
also allows the use of uncertainty labels (which were
excluded in our present work). Allaouzi et al. simi-
larly used a DenseNet-121 and created three different
models for the classification of the CheXpert and
ChestX-ray14, yielding an AUC of 0.72 for atelectasis,
0.87-0.88 for cardiomegaly, 0.74-0.77 for consolidation,
0.86-0.87 for edema and 0.90 for effusion [1]. Ex-
cept for cardiomegaly, we achieved better values with
several models (e.g. ResNet-34, ResNet-50, AlexNet,
VGG-16). We would interpret this as proof that com-
plex deep networks are not necessarily superior to
more shallow networks for chest x-ray classification.
At least for the CheXpert dataset it seems that meth-
ods optimizing the handling of uncertainty labels and
hierarchical structures of the data are important to
improve model performance. Sabottke et al. trained a
ResNet-32 for classification of chest radiographs and
therefore are one of the few groups using a smaller
network [15]. With an AUROC of 0.809 for atelecta-
sis, 0.925 for cardiomegaly, 0.888 for edema and 0.859
for effusion, their network performed not as good as
some of our tested networks. Raghu et al. employed
a ResNet-50, an Inception-v3 as well as a custom-
designed small network. Similar to our findings, they
observed, that smaller networks showed a comparable
performance to deeper networks [13].

Conclusion

In the present work, we could show that smaller arti-

ficial neural networks for the classification of chest ra-
diographs can perform similar, or even surpass deeper
and very deep neural networks. In contrast to many
previous studies, which mostly used a DenseNet-121,
we achieved the best results with up to 95% smaller
networks. Using smaller networks therefore has the
advantage that that they have lower hardware require-
ments, as they require less GPU RAM and can be
trained faster without loss of performance.

5
Tables

Table 1 Area under the Receiver Operating Characteristic Curve

Network Batchsize Atelectasis Cardiomegaly Consolidation Edema Effusion Pooled

CheXpert baseline 16 0.818 0.828 0.938 0.934 0.928 0.889
ResNet-18 16 0.816 0.797 0.905 0.868 0.899 0.857
ResNet-34 16 0.799 0.798 0.902 0.891 0.905 0.859
ResNet-50 16 0.798 0.799 0.890 0.880 0.913 0.856
ResNet-101 16 0.813 0.810 0.905 0.889 0.907 0.865
ResNet-152 16 0.801 0.809 0.908 0.896 0.916 0.866
DenseNet-121 16 0.809 0.794 0.895 0.883 0.906 0.857
DenseNet-161 16 0.800 0.817 0.885 0.900 0.923 0.865
DenseNet-169 16 0.805 0.795 0.898 0.891 0.909 0.860
DenseNet-201 16 0.805 0.812 0.891 0.886 0.916 0.862
AlexNet 16 0.790 0.755 0.857 0.894 0.881 0.835
SqueezeNet-1.0 16 0.761 0.755 0.833 0.907 0.885 0.828
SqueezeNet-1.1 16 0.767 0.764 0.880 0.903 0.879 0.839
VGG-13 16 0.798 0.752 0.886 0.867 0.872 0.835
VGG-16 16 0.809 0.766 0.892 0.879 0.883 0.846
VGG-19 16 0.811 0.786 0.901 0.890 0.884 0.854
ResNet-18 32 0.796 0.822 0.908 0.903 0.911 0.868
ResNet-34 32 0.797 0.840 0.903 0.902 0.919 0.872
ResNet-50 32 0.811 0.832 0.916 0.913 0.934 0.881
ResNet-101 32 0.797 0.823 0.911 0.915 0.936 0.876
ResNet-152 32 0.802 0.836 0.917 0.920 0.937 0.882
DenseNet-121 32 0.808 0.828 0.879 0.904 0.926 0.869
DenseNet-161 32 0.809 0.834 0.913 0.923 0.928 0.881
DenseNet-169 32 0.809 0.816 0.900 0.922 0.934 0.876
DenseNet-201 32 0.795 0.820 0.904 0.922 0.931 0.874
AlexNet 32 0.791 0.768 0.856 0.894 0.886 0.839
SqueezeNet-1.0 32 0.773 0.769 0.880 0.913 0.895 0.846
SqueezeNet-1.1 32 0.785 0.789 0.895 0.904 0.898 0.854
VGG-13 32 0.800 0.762 0.883 0.896 0.907 0.850
VGG-16 32 0.798 0.776 0.890 0.911 0.906 0.856
VGG-19 32 0.787 0.790 0.879 0.911 0.916 0.857

Table 1 shows the different areas under the receiver operating characteristic curve (AUROC) for each of
the network architectures and individual finding as well as the pooled AUROC per model. According to
the pooled AUROC, ResNet-152, ResNet-50 und DenseNet-161 were the best models, while SqueezeNet
and AlexNet showed the poorest performance. For cardiomegaly, ResNet-34, ResNet-50, ResNet-152 and
DenseNet-161 could surpass the CheXpert baseline provided by Irvin et al. ResnEt-50, ResNet-101, ResNet-152
and DenseNet-169 could also surpass the CheXpert baseline for pleural effusion. A batch size of 32 often lead
to better results compared to a batch size of 16.

6
Table 2 Area under the Precision Recall Curve

Network Batchsize Atelectasis Cardiomegaly Consolidation Edema Effusion Pooled

ResNet-18 16 0.500 0.559 0.806 0.727 0.580 0.634
ResNet-34 16 0.506 0.560 0.804 0.735 0.580 0.637
ResNet-50 16 0.501 0.557 0.802 0.733 0.585 0.636
ResNet-101 16 0.499 0.558 0.765 0.735 0.582 0.628
ResNet-152 16 0.503 0.559 0.808 0.737 0.584 0.638
DenseNet-121 16 0.503 0.554 0.802 0.733 0.580 0.634
DenseNet-161 16 0.501 0.557 0.799 0.736 0.587 0.636
DenseNet-169 16 0.500 0.560 0.805 0.733 0.582 0.636
DenseNet-201 16 0.320 0.555 0.445 0.734 0.582 0.527
AlexNet 16 0.543 0.565 0.490 0.733 0.577 0.582
SqueezeNet-1.0 16 0.509 0.565 0.425 0.736 0.576 0.562
SqueezeNet-1.1 16 0.505 0.563 0.400 0.733 0.575 0.555
VGG-13 16 0.502 0.563 0.761 0.726 0.574 0.625
VGG-16 16 0.501 0.559 0.797 0.733 0.577 0.633
VGG-19 16 0.500 0.558 0.808 0.731 0.577 0.635
ResNet-18 32 0.502 0.557 0.805 0.736 0.582 0.636
ResNet-34 32 0.652 0.556 0.806 0.737 0.585 0.667
ResNet-50 32 0.497 0.555 0.809 0.740 0.590 0.638
ResNet-101 32 0.500 0.558 0.808 0.740 0.591 0.639
ResNet-152 32 0.502 0.559 0.810 0.741 0.591 0.641
DenseNet-121 32 0.500 0.558 0.793 0.736 0.587 0.635
DenseNet-161 32 0.499 0.556 0.808 0.743 0.589 0.639
DenseNet-169 32 0.499 0.556 0.805 0.743 0.588 0.638
DenseNet-201 32 0.502 0.555 0.808 0.742 0.589 0.639
AlexNet 32 0.720 0.562 0.789 0.731 0.578 0.676
SqueezeNet-1.0 32 0.354 0.562 0.815 0.738 0.580 0.610
SqueezeNet-1.1 32 0.506 0.563 0.804 0.731 0.577 0.636
VGG-13 32 0.501 0.560 0.799 0.735 0.578 0.635
VGG-16 32 0.732 0.561 0.804 0.739 0.582 0.684
VGG-19 32 0.501 0.562 0.800 0.740 0.585 0.638

Table 2 shows the area under the precision recall curve (AUPRC) for all networks and findings. In contrast
to the AUROC, where deeper models achieved higher values, shallower networks yielded the best results for
AUPRC (ResNet-24, AlexNet, VGG-16). DenseNet-201 and Squeezenet showed the lowest AUPRC values.
Again, a batch size of 32 appeared to deliver better results compared to a batch size of 16.

7
Table 3 Duration of Training

Table 3 provides an overview of training time per epoch (duration/epoch) and an overall training-time
(duration/training) for each neural network. The times given are the average of five training runs rounded to
the nearest minute.

8
Figures

Receiver Operating Characteristic Curves

Figures 1, 2 and 3 display the ROC-curves for all models. The colored lines represent a single training, black
lines represent the pooled performance over five trainings.

Figure 1