Comparing Different Deep Learning Architectures For Classification of Chest Radiographs
Comparing Different Deep Learning Architectures For Classification of Chest Radiographs
net/publication/339445459
CITATIONS READS
0 65
6 authors, including:
Stefan M. Niehues
Charité Universitätsmedizin Berlin
148 PUBLICATIONS 872 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Keno Bressem on 27 February 2020.
keno-kyrill.bressem(at)charite.de
arXiv:2002.08991v1 [cs.LG] 20 Feb 2020
Abstract
Chest radiographs are among the most frequently acquired images in radiology and are often the
subject of computer vision research. However, most of the models used to classify chest radiographs are
derived from openly available deep neural networks, trained on large image-datasets. These datasets
routinely differ from chest radiographs in that they are mostly color images and contain several possible
image classes, while radiographs are greyscale images and often only contain fewer image classes.
Therefore, very deep neural networks, which can represent more complex relationships in image-features,
might not be required for the comparatively simpler task of classifying grayscale chest radiographs.
We compared fifteen different architectures of artificial neural networks regarding training-time and
performance on the openly available CheXpert dataset to identify the most suitable models for deep
learning tasks on chest radiographs. We could show, that smaller networks such as ResNet-34, AlexNet
or VGG-16 have the potential to classify chest radiographs as precisely as deeper neural networks such as
DenseNet-201 or ResNet-151, while being less computationally demanding.
1
Introduction and might even outperform deeper networks while re-
quiring lesser resources. Therefore, we systematically
Chest radiographs are among the most frequently examine the performance of fifteen openly available
used imaging procedures in radiology. They have artificial neural network architectures in order to iden-
been widely employed in the field of computer vision, tify the most suitable ones for the basic classification
as chest radiographs are a standardized technique of chest radiographs.
and, if compared to other radiological examinations
such as computed tomography or magnetic resonance
imaging, contain a smaller group of relevant patholo- Methods
gies. Although many artificial neural networks for
the classification of chest radiographs have been de-
Data preparation
veloped, it is still the subject of intensive research.
Only a few groups design their own networks from
The free available CheXpert dataset consists of
scratch, but rather use already established architec-
224,316 chest radiographs from 65,240 patients.
tures, such as ResNet-50 or DenseNet-121 (with 50
Fourteen findings have been annotated for each image:
and 121 representing the number of layers within the
enlarged cardiomediastinum, cardiomegaly, lung
respective neural network) [3][5][7][2][14][11]. These
opacity, lung lesion, edema, consolidation, pneumonia,
neural networks have often been trained on large,
atelectasis, pneumothorax, pleural effusion, pleural
openly available datasets, such as ImageNet, and are
other, fracture and support devices. Hereby the
therefore already able to recognize numerous image
findings can be annotated as present (1), absent
features. When training a model for a new task, such
(NA) or uncertain (-1). Similar to previous work on
as the classification of chest radiographs, the use of
the classification of the CheXpert dataset [7][16],
pre-trained networks may improve the training speed
we trained these networks on a subset of labels:
and accuracy of the new model, since important im-
cardiomegaly, edema, consolidation, atelectasis
age features that have already been learned can be
and pleural effusion. As we only aim at network
transferred to the new task and do not have to be
comparison and not on maximal precision of a
learned again. However, the feature space of freely
neural network, for this analysis, each image with
available data sets such as ImageNet differs from chest
an uncertainty label was excluded, other approaches
radiographs as they contain color images and more
such as zero imputation or self-training were also not
categories. The ImageNet Challenge includes 1000
adopted. Furthermore, only frontal radiographs were
possible categories per image, while CheXpert, a large
used, leaving 135,494 images from 53,388 patients for
freely available data set of chest radiographs, only
training. CheXpert offers additional dataset with
distinguishes between 14 categories (or classes)[13].
235 images (201 images after excluding uncertainty
Although the ImageNet challenge showed a trend to-
labels and lateral radiographs), annotated by two
wards higher accuracies for deeper networks, this may
independent radiologists, which is intended as an
not be fully transferrable to radiology. In radiology,
evaluations dataset and was therefore used for this
sometimes only limited features of an image can be
purpose.
decisive for the diagnosis. Therefore, images cannot
be scaled down as much as desired, as the required
information would otherwise be lost. But, the more Data augmentation
complex a neural network architecture is, the more
resources are required for training and deployment of For the first and second training session, the images
such an algorithm. As up-scaling the input-images were scaled to 320 x 320 pixels, using bilinear inter-
resolution exponentially increases memory usage dur- polation, and pixel values were normalized. During
ing training for large neural networks, that evalu- training, multiple image-transformations were applied:
ate many parameters, the size of a mini batch needs flipping of the images alongside the horizontal and
to be reduced earlier and more strongly, potentially vertical axis, rotation of up to 10°, zooming of up
affecting optimizers such as stochastic gradient de- to 110%, adding of random lightning or symmetric
scent. Therefore, it is currently not clear, which of wrapping.
the available artificial neural networks designed for
and trained on the ImageNet dataset will perform the
best for the classification of chest radiographs. The
hypothesis of this work is, that shallow networks are
already sufficient for the classification of radiographs
2
Model training The dataset presents class imbalances (% positives for
each finding: cardiomegaly 33%, edema 21%, consol-
14 different convolutional neural networks (CNN) of idation 16%, atelectasis 37%, pleural effusion 32%),
five different architectures (ResNet, DenseNet, VGG, so that the AUPRC as well as the AUROC can be
SqueezeNet and AlexNet) were trained on the CheX- considered equally important measurements for the
pert dataset [3][5][17][6][9]. All training was done performance of the network. The performance of the
using the Python programming language (https:// tested networks is compared to the AUROC reported
www.python.org, version 3.8) with the PyTorch (https: by Irvin et al.[7].However, only values for AUROC,
//pytorch.org) and FastAI (https://fast.ai) libraries but not for AUPRC, are provided there. In most cases,
on a workstation running on Ubuntu 18.04 with two the best results were achieved with a batch size of 32,
Nvidia GeForce RTX 2080ti graphic cards (11 GB of so all the information provided below refers to models
RAM each)[4][10]. In the first training session, batch trained with this batch size. Results achieved with
size was held constant at 16 for all models, while it smaller batch sizes of 16 will be explicitly mentioned.
was increased to 32 for all networks in the second
session. In the first two sessions, each model was
trained for eight epochs, whereas during the first five Area under the Receiver Operating
epochs only the classification-head of each network Characteristic Curve
was trained. Thereafter, the model was unfrozen and
trained as whole for three additional epochs. Before Deeper artificial neural networks generally achieved
training and after the first five epochs, the optimal higher AUROC values than shallow networks (Table
learning rate was determined [19], which was between 1 and Figures 1-3). Regarding the pooled AUROC
1e-1 and 1e-2 for the first five epochs and between 1e-5 for the detection of the five pathologies, ResNet-152
and 1e-6 for the rest of the training. We trained one (0.882), DenseNet-161 (0.881) and ResNet-50 (0.881)
multilabel classification head for each model. Since performed best (Irvin et al. CheXpert baseline
the performance of a neural network can be subject to 0.889)[7]. Broken down for individual findings, the
minor random fluctuations, the training was repeated most accurate detection of atelectasis was achieved by
for a total of five times. The predictions on the valida- ResNet-18 (0.816, batch size 16), ResNet-101 (0.813,
tion data set were then exported as comma separated batch size 16), VGG-19 (0.813, batch size 16) and
values (CSV) for evaluation. ResNet-50 (0.811). For detection of cardiomegaly, the
best four models surpassed the CheXpert baseline of
0.828 (ResNet-34 0.840, ResNet-152 0.836, DenseNet-
Evaluation 161 0.834, ResNet-50 0.832). For congestion, the
highest AUROC was achieved using ResNet-152
Evaluation was performed using the “R” statistical (0.917), ResNet-50 (0.916) and DenseNet-161 (0.913).
environment including the “tidyverse” and “ROCR” Pulmonary edema was most accurately detected using
libraries [12][20][18].Predictions on the validation DenseNet-161 (0.923), DenseNet-169 (0.922) and
dataset of the five models for each network archi- DenseNet-201 (0.922). For pleural effusion, the four
tecture were pooled so that the models could be best models were ResNet-152 (0.937), ResNet-101
evaluated as a consortium. For each individual (0.936), ResNet-50 (0.934) and DenseNet-169 (0.934),
prediction as well as the pooled predictions, receiver all of which performed superior to the CheXpert
operation characteristic (ROC) curves and precision baseline of 0.928.
recall curves (PRC) were plotted and the areas under
each curve were calculated (AUROC and AUPRC).
AUROC and AUPRC were chosen as they enable Area under the Precision Recall Curve
a comparison of different models, independent of a
chosen threshold for the classification. For AUPRC, shallower artificial neural networks
could achieve higher values than deeper network-
architectures (Table 2 and Figures 4-6). The highest
Results pooled values for the AUPRC were achieved by
training VGG-16 (0.709), AlexNet (0.701) and
ResNet-34 (0.688). For atelectasis, CGG-16 and
The CheXpert validation dataset consists out of 234
AlexNet both achieved the highest AUPRC of 0.732,
studies of 200 patients, not used for training with
followed by Resnet-35 with 0.652. Cardiomegaly was
no uncertainty-labels. After excluding lateral radio-
most accurately detected by SqueezeNet 1.0 (0.565),
graphs (n = 32), 202 images of 200 patients remained.
3
Alexnet-152 (0.565) and Vgg-13 (0.563). SqueezNet Instead, an accurate classification of chest radiographs
1.0 also achieved the highest AUPRC values for may be achieved with comparably shallow networks,
consolidation (0.815) followed by ResNet-152 (0.810) such as AlexNet (8 layers), ResNet-34 or VGG-16,
and ResNet-50 (0.809). The best classifications of which surpass even complex deep networks such as
pulmonary edema were achieved by DenseNet-169, ResNet-150 or DenseNet-201.
DenseNet-161 (both 0.743) and DenseNet-201 The use of smaller neural networks has the advantage,
(0.742). Finally, for pleural effusion ResNet-101 and that hardware requirements and training time are
ResNet-152 achieved the highest AUPRC of 0.591, lower compared to deeper networks. Shorter training
followed by ResNet-50 (0.590). times allow to test more hyperparameters, simpli-
fying the overall training process. Lower hardware
requirements also enable the use of increased image
Overall best Performance resolutions. This could be of relevance for the evalu-
ation of chest radiographs with a generic resolution
Considering both AUROC and AUPRC, the best per- of 2048 x 2048 px to 4280 x 4280 px, where specific
formance was achieved by VGG-16 (AUROC: 0.856, findings, such as small pneumothorax, require larger
AUPRC: 0.709), ResNet-34 (AUROC: 0.872, AUPRC: resolutions of input-images, because otherwise the cru-
0.688) and AlexNet (AUROC: 0.839, AUPRC: 0.701), cial information regarding their presence could be lost
all with a batch size of 32. due to a downscaling. Furthermore, shorter training
times might simplify the integration of improvement
methods into the training data, such as the implemen-
Training time tation of ‘human in the loop’ annotations. ‘Human
in the loop’ implies that the training of a network is
Fourteen different network-architectures were trained supervised by a human expert, who may intervene and
10 times each with a multilabel-classification head (five correct the network at critical steps. For example, the
times each for batch size of 16 or 32 and an input- human expert can check the misclassifications with
image resolution of 320 x 320 pixels) and once with a the highest loss for incorrect labels, thus effectively re-
binary classification head for each finding, resulting ducing label noise. With shorter training times, such
in 210 individual training runs. Overall, training took feedback loops can be executed faster. In the CheX-
340 hours. As to be expected, the training of deeper pert dataset, which was used as a groundwork for the
networks required more time than the training of present analysis, labels for the images were generated
shallower networks. For an image resolution of 320 x using a specifically developed natural language pro-
320 pixels, the training of AlexNet required the least cessing tool, which did not produce perfect labels. For
amount of time with a time per epoch of 2:29 to 2:50 example, the F1 scores for the mention and subsequent
minutes and a total duration of 20 minutes for a batch negation of cardiomegaly were 0.973 and 0.909, and
size of 32. Using a smaller batch size of 16, the time the F1 score for an uncertainty label was 0.727. There-
per epoch raised to 2:59 - 3:06 minutes and a total fore, it can be assumed, that there is a certain amount
duration of 24 minutes. In contrast, using a batch of noise in the training data, which might affect the
size of 16, training of a DenseNet-201 took the longest accuracy of the models trained on it. Implementing a
with 5:11 hours and epochs requiring 41 minutes. For human-in-the loop approach for partially correcting
a batch size of 32, training a DenseNet-169 required the label noise could further improve performance of
the largest amount of time with 3:06 hours (epochs networks trained on the CheXpert dataset [8]. Our
between 21 and 27 minutes). Increasing the batch findings differ from applied techniques used in pre-
size from 16 to 32 lead to an average acceleration of vious literature, where deeper network architectures,
training by 29.9% ± 9.34%. Table 3 gives an overview mainly a DenseNet-121, were used instead of small
of training times. networks to classify the CheXpert data set [11][1][15].
The authors of the CheXpert dataset achieved an aver-
age overall AUROC of 0.889 [7], using a DenseNet-121
Discussion which was not surpassed by any of the models used
in our analysis, although differences between the best
In the present work, different architectures of artificial performing networks and the CheXpert baseline were
neural networks are analyzed with respect to their smaller than 0.01.. It should be noted, however, that
performance for the classification of chest radiographs. in our analysis the hyperparameters for the models
We could show that more complex neural networks do were probably not selected as precise as in the original
not necessarily perform better than shallow networks. CheXpert paper by Irvin et al., since the focus of this
4
work was more on comparing the architectures and not
on the complete optimization of one specific network.
Still, we identified model, which achieved higher AU-
ROC values in two of the five findings (cardiomegaly
and effusion). Pham et al. also used a DenseNet-121
as the basis for their model and proposed the most
accurate model of the CheXpert dataset with a mean
AUROC of 0.940 for the five selected findings [11].
The good results are probably due to the hierarchical
structure of the classification framework, which takes
into account correlations between different labels, and
the application of a label-smoothing technique, which
also allows the use of uncertainty labels (which were
excluded in our present work). Allaouzi et al. simi-
larly used a DenseNet-121 and created three different
models for the classification of the CheXpert and
ChestX-ray14, yielding an AUC of 0.72 for atelectasis,
0.87-0.88 for cardiomegaly, 0.74-0.77 for consolidation,
0.86-0.87 for edema and 0.90 for effusion [1]. Ex-
cept for cardiomegaly, we achieved better values with
several models (e.g. ResNet-34, ResNet-50, AlexNet,
VGG-16). We would interpret this as proof that com-
plex deep networks are not necessarily superior to
more shallow networks for chest x-ray classification.
At least for the CheXpert dataset it seems that meth-
ods optimizing the handling of uncertainty labels and
hierarchical structures of the data are important to
improve model performance. Sabottke et al. trained a
ResNet-32 for classification of chest radiographs and
therefore are one of the few groups using a smaller
network [15]. With an AUROC of 0.809 for atelecta-
sis, 0.925 for cardiomegaly, 0.888 for edema and 0.859
for effusion, their network performed not as good as
some of our tested networks. Raghu et al. employed
a ResNet-50, an Inception-v3 as well as a custom-
designed small network. Similar to our findings, they
observed, that smaller networks showed a comparable
performance to deeper networks [13].
Conclusion
5
Tables
Table 1 shows the different areas under the receiver operating characteristic curve (AUROC) for each of
the network architectures and individual finding as well as the pooled AUROC per model. According to
the pooled AUROC, ResNet-152, ResNet-50 und DenseNet-161 were the best models, while SqueezeNet
and AlexNet showed the poorest performance. For cardiomegaly, ResNet-34, ResNet-50, ResNet-152 and
DenseNet-161 could surpass the CheXpert baseline provided by Irvin et al. ResnEt-50, ResNet-101, ResNet-152
and DenseNet-169 could also surpass the CheXpert baseline for pleural effusion. A batch size of 32 often lead
to better results compared to a batch size of 16.
6
Table 2 Area under the Precision Recall Curve
Table 2 shows the area under the precision recall curve (AUPRC) for all networks and findings. In contrast
to the AUROC, where deeper models achieved higher values, shallower networks yielded the best results for
AUPRC (ResNet-24, AlexNet, VGG-16). DenseNet-201 and Squeezenet showed the lowest AUPRC values.
Again, a batch size of 32 appeared to deliver better results compared to a batch size of 16.
7
Table 3 Duration of Training
Table 3 provides an overview of training time per epoch (duration/epoch) and an overall training-time
(duration/training) for each neural network. The times given are the average of five training runs rounded to
the nearest minute.
8
Figures
Figures 1, 2 and 3 display the ROC-curves for all models. The colored lines represent a single training, black
lines represent the pooled performance over five trainings.
Figure 1
0.5
0.0
0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0
false positive rate
0.5
0.0
0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0
false positive rate
0.5
0.0
0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0
false positive rate
0.5
0.0
0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0
false positive rate
0.5
0.0
0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0
false positive rate
9
Figure 2
1.0
true positive rate
0.5
0.0
0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0
false positive rate
1.0
true positive rate
0.5
0.0
0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0
false positive rate
1.0
true positive rate
0.5
0.0
0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0
false positive rate
1.0
true positive rate
0.5
0.0
0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0
false positive rate
0.5
0.0
0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0
false positive rate
10
Figure 3
SqueezeNet−1.1 SqueezeNet−1.0
atelectasis cardiomegaly consolidation edema pleural effusion
1.0
true positive rate
0.5
0.0
0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0
false positive rate
0.5
0.0
0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0
false positive rate
0.5
0.0
0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0
false positive rate
0.5
0.0
0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0
false positive rate
0.5
0.0
0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0
false positive rate
11
Precision Recall Curves
Figures 1, 2 and 3 display the precision recall curves for all models. The colored lines represent a single
training, black lines represent the pooled performance over five trainings.
Figure 4
0.5
0.0
0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0
precision
0.5
0.0
0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0
precision
0.5
0.0
0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0
precision
0.5
0.0
0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0
precision
0.5
0.0
0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0
precision
12
Figure 5
1.0
recall
0.5
0.0
0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0
precision
1.0
recall
0.5
0.0
0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0
precision
1.0
recall
0.5
0.0
0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0
precision
1.0
recall
0.5
0.0
0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0
precision
0.5
0.0
0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0
precision
13
Figure 6
SqueezeNet−1.1 SqueezeNet−1.0
atelectasis cardiomegaly consolidation edema pleural effusion
1.0
recall
0.5
0.0
0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0
precision
0.5
0.0
0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0
precision
0.5
0.0
0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0
precision
0.5
0.0
0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0
precision
0.5
0.0
0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0
precision
14
References [11] Hieu H Pham et al. “Interpreting chest X-rays
via CNNs that exploit disease dependencies
[1] Imane Allaouzi and Mohamed Ben Ahmed. “A and uncertainty labels”. In: arXiv preprint
novel approach for multi-label chest X-ray clas- arXiv:1911.06475 (2019).
sification of common thorax diseases”. In: IEEE
[12] R Core Team. R: A Language and Environment
Access 7 (2019), pp. 64279–64288.
for Statistical Computing. R Foundation for Sta-
[2] Aurelia Bustos et al. “Padchest: A large chest tistical Computing. Vienna, Austria, 2019. url:
x-ray image dataset with multi-label annotated https://www.R-project.org/.
reports”. In: arXiv preprint arXiv:1901.07441
[13] Maithra Raghu et al. “Transfusion: Understand-
(2019).
ing transfer learning for medical imaging”. In:
[3] Kaiming He et al. “Deep residual learning for Advances in Neural Information Processing Sys-
image recognition”. In: Proceedings of the IEEE tems. 2019, pp. 3342–3352.
conference on computer vision and pattern recog-
[14] Pranav Rajpurkar et al. “Chexnet: Radiologist-
nition. 2016, pp. 770–778.
level pneumonia detection on chest x-rays
[4] Jeremy Howard et al. fastai. https://github. with deep learning”. In: arXiv preprint
com/fastai/fastai. 2018. arXiv:1711.05225 (2017).
[5] Gao Huang et al. “Densely connected convolu- [15] Carl F. Sabottke and Bradley M. Spieler. “The
tional networks”. In: Proceedings of the IEEE Effect of Image Resolution on Deep Learning
conference on computer vision and pattern recog- in Radiography”. In: Radiology: Artificial Intel-
nition. 2017, pp. 4700–4708. ligence 2.1 (2020), e190015. doi: 10.1148/ryai.
[6] Forrest N Iandola et al. “SqueezeNet: AlexNet- 2019190015.
level accuracy with 50x fewer parameters [16] Carl F Sabottke and Bradley M Spieler. “The
and< 0.5 MB model size”. In: arXiv preprint effect of image resolution on deep learning in ra-
arXiv:1602.07360 (2016). diography”. In: Radiology: Artificial Intelligence
[7] Jeremy Irvin et al. “Chexpert: A large chest ra- 2.1 (2020), e190015.
diograph dataset with uncertainty labels and ex- [17] Karen Simonyan and Andrew Zisserman.
pert comparison”. In: Proceedings of the AAAI “Very deep convolutional networks for large-
Conference on Artificial Intelligence. Vol. 33. scale image recognition”. In: arXiv preprint
2019, pp. 590–597. arXiv:1409.1556 (2014).
[8] Davood Karimi et al. “Deep learning with noisy [18] T. Sing et al. “ROCR: visualizing classifier per-
labels: exploring techniques and remedies in formance in R”. In: Bioinformatics 21.20 (2005),
medical image analysis”. In: arXiv preprint p. 7881. url: http://rocr.bioinf.mpi-sb.mpg.de.
arXiv:1912.02911 (2019).
[19] Leslie N Smith. “Cyclical learning rates for train-
[9] Alex Krizhevsky, Ilya Sutskever, and Geoffrey ing neural networks”. In: 2017 IEEE Winter
E Hinton. “ImageNet Classification with Deep Conference on Applications of Computer Vision
Convolutional Neural Networks”. In: Advances (WACV). IEEE. 2017, pp. 464–472.
in Neural Information Processing Systems 25.
[20] Hadley Wickham. tidyverse: Easily Install and
Ed. by F. Pereira et al. Curran Associates, Inc.,
Load the ’Tidyverse’. R package version 1.2.1.
2012, pp. 1097–1105. url: http://papers.nips.
2017. url: https : / / CRAN . R - project . org /
cc/paper/4824- imagenet- classification- with-
package=tidyverse.
deep-convolutional-neural-networks.pdf.
[10] Adam Paszke et al. “PyTorch: An Imperative
Style, High-Performance Deep Learning Li-
brary”. In: Advances in Neural Information
Processing Systems 32. Ed. by H. Wallach et al.
Curran Associates, Inc., 2019, pp. 8024–8035.
url: http :// papers.neurips. cc/paper /9015-
pytorch-an-imperative-style-high-performance-
deep-learning-library.pdf.
15