Unsupervised Anomaly Detection For X-Ray Images
Unsupervised Anomaly Detection For X-Ray Images
Images
{d.davletshina,v.melnychuk,viet.tran,hitansh.singla}@campus.lmu.de
2
Lehrstuhl für Datenbanksysteme und Data Mining, Ludwig-Maximilians-Universität
München, Munich, Germany
{berrendorf,faerman,fromm,schubert}@dbs.ifi.lmu.de
1 Introduction
Fig. 1. A few examples from the used subset of the MURA dataset containing X-ray
images of hands demonstrating the large variety of image quality.
4 Davletshina et al.
2.1 Preprocessing
offline
Min-Max Padding &
Model Augmentation
Normalization Centering
online
Fig. 2. The full image preprocessing pipeline. Steps highlighted in green are performed
once and the result is stored to disk. Steps highlighted in orange are done on-the-fly.
Cropping The first step in our pipeline is to detect the X-ray image carrier
in the image. To this end, we apply OpenCV’s contour detection using Otsu
binarization [14], and retrieve the minimum size bounding box, which does not
need to be axis-aligned. This works sufficiently well as long as the majority of the
image carrier is within the image (cf. Figure 3). However, the approach might fail
for heavily tilted images or those where larger parts of the image carrier reach
beyond the image border.
Fig. 3. Result of image carrier detection with OpenCV (left side). The first image
shows the original image with a detected rectangle. Next to it is the extracted image.
The right image shows the result of running object detection on an image containing
two hands. We extract the image of both hands separately such that our preprocessed
data set does not contain images with more than one hand.
Data Augmentation Due to GPU memory constraints, the images for BiGAN and
α-GAN are resized to 128 pixels on the longer image side while maintaining aspect
ratio before applying the augmentation. For the auto-encoder models, this is not
necessary. Afterwards, standard data augmentation methods (horizontal/vertical
flipping, channel-wise multiplication, rotation, scaling) using the imgaug4 library
are applied before finally padding the images to 512x512 (AE + DCGAN) or
128x128 (BiGAN + αGAN) pixels.
2.2 Models
2.3 Autoencoders
For simplicity, we describe the general loss formulation using a vector input
x ∈ Rn instead of a two-dimensional pixel-matrix. In its simplest form, the
reconstruction loss is given as the mean over pixel-wise squared differences. Let
x, x̂ = D(E(x)) ∈ Rn , then
n
1X 1
L(x, x̂) = (xi − x̂i )2 = kx − x̂k22
n i=1 n
1
LM (x, x̂, m) = km (x − x̂)k2
kmk1
We aim for using auto-encoder architectures which are strong enough to suc-
cessfully reconstruct normal hands, without risking to learn identity mappings
by allowing too wide bottlenecks. While the architecture should generalize over
all normal hands, a too strong generalization might cause the effect that also
anomalies can be reconstructed sufficiently well.
2.4 GAN
BiGAN [4] / ALI [5] extends DCGAN by an encoder E, which encodes the real
image into latent space5 . The discriminator is now provided with both, the real
and fake image, as well as their latent codes, i.e. D((G(z), z), (x, E(x))).
Anomaly Detection Scores For the GAN models, we generally use the dis-
criminator’s output as the anomaly score. When converged, the discriminator
should be able to distinguish between images belonging to the data manifold,
i.e. images of hands without any anomalies, and those which lie outside, such
as those containing anomalous regions. For αGAN we use the mean over code
discriminator and discriminator probability.
3 Related Work
With the rapid advancement of deep learning methods, they have also found
their way into medical imaging, cf. e.g. [11,18]. Despite the limited availability
of labels in medical contexts, supervised methods make up the vast majority.
Very likely, this is due to the easier trainability, but possibly also because the
interpretability of the results so far has often been secondary. Sato et al. [21]
use a 3D CAE for a pathology detection method in CT scans of brains. The
CAE is trained solely on normal images, and at test time, the MSE between
the image and its reconstruction is taken as the anomaly score. Uzunova et
al. [24] use VAE for medical 2D and 3D CT images. Similarly, they use MSE
reconstruction loss as the anomaly score. Besides the KL-divergence in latent
space, they use a L1 reconstruction loss for training, which produced less smooth
output. GANomaly [1] and its extension with skip-connections uses an AE and
maps the reconstructed input back to the latent space. The anomaly score is
computed in latent space between original and reconstructed input. They apply
their methods on X-Ray security imagery to detect anomalous items in baggage.
Recently, there have been a lot of publications using the currently popular GANs.
For example, [13] uses a semi-supervised approach for anomaly detection in
chest X-ray images. They replace the standard discriminator classification into
real and fake, with a three-way classification into real-normal, real-abnormal,
and fake. While this allows training with fewer labels, it still requires them for
training. Schlegl et al [22], train a DC-GAN on slices of OCT scans, where the
original volume is cut along the x-z axis, and the slices are further randomly
cropped. At test time, they use gradient descent to iteratively solve the inverse
problem of obtaining a latent vector that produces the image. Stopping after
a few iterations, the L1 distance between the generated image and the input
image is considered as residual loss. To summarize, the focus of recent work for
Unsupervised Anomaly Detection for X-Ray Images 9
anomaly detection approaches lies either in applying existing methods for a new
type of data or adapting unsupervised methods for anomaly detection. Instead,
we provide an extensive evaluation of state-of-the-art unsupervised learning
approaches that can be directly used for anomaly detection. Furthermore, we
evaluate the importance of different preprocessing steps and compare methods
with regard to explainability.
4 Experiments
anomaly n p n p n
Fig. 4. Visualization of the applied data split scheme. "n" denotes patients which do
not have a abnormal study ("negative"), "p" the contrary ("positive"). Notice that the
training part of the split does not contain any images of anomalies, i.e. we do not use
anomalous images for training.
We trained all models on a machine with one NVIDIA Tesla V100 GPU
with 16GiB of VRAM, 20 cores and 360GB of RAM. Following [16], we train
our models from scratch and do not use transfer learning from large image
classification datasets. We performed a manual hyper-parameter search on the
validation set and selected the best-performing models per type with respect to
Area-under-Curve for the Receiver-Operator-Curve (ROC-AUC). We report the
ROC-AUC on the test set.
Table 1. Quantitative results for all models. We report ROC-AUC on the test set for
the best configuration regarding validation set ROC-AUC. All numbers are mean and
standard deviation across four separate trainings with different random seeds. For each
model we report results for various anomaly scores: Mean Squared Error (MSE), L1,
Kullback–Leibler Divergence (KLD), Discriminator Probability (D). Top-200 denotes
the case, when only 200 pixels with the highest error are taken into consideration.
CAE
MSE .460 ± .033 .504 ± .034 .466 ± .022 .510 ± .021 .501 ± .013 .570 ± .019
MSE (top-200) .466 ± .013 .448 ± .025 .486 ± .015 .473 ± .018 .506 ± .039 .553 ± .023
VAE
KLD .488 ± .031 .491 ± .013 .470 ± .046 .496 ± .045 .520 ± .026 .533 ± .014
L1 .432 ± .033 .446 ± .016 .438 ± .033 .438 ± .016 .435 ± .014 .483 ± .009
L1 + KLD .432 ± .033 .446 ± .016 .438 ± .034 .437 ± .016 .438 ± .011 .488 ± .011
L1 (top-200) .438 ± .017 .472 ± .010 .440 ± .025 .471 ± .013 .428 ± .013 .481 ± .010
MSE .432 ± .033 .446 ± .016 .438 ± .033 .438 ± .016 .435 ± .014 .483 ± .009
MSE + KLD .432 ± .033 .446 ± .016 .438 ± .033 .438 ± .016 .436 ± .013 .486 ± .010
MSE (top-200) .438 ± .017 .472 ± .010 .440 ± .025 .471 ± .013 .428 ± .013 .481 ± .010
DCGAN
Disc. (D) .497 ± .018 .491 ± .041 .493 ± .015 .493 ± .025 .530 ± .027 .527 ± .022
BiGAN
MSE .471 ± .021 - .438 ± .039 - .491 ± .042 .522 ± .017
MSE (top-200) .471 ± .011 - .459 ± .030 - .475 ± .033 .508 ± .026
Disc. (D) .508 ± .007 - .534 ± .016 - .549 ± .006 .522 ± .019
αGAN
Code-Disc. (C) .500 ± .000 - .500 ± .001 - .500 ± .000 .500 ± .000
MSE .476 ± .029 - .466 ± .022 - .442 ± .013 .528 ± .018
MSE (top-200) .465 ± .031 - .446 ± .018 - .422 ± .016 .533 ± .013
Disc. (D) .503 ± .022 - .534 ± .022 - .607 ± .016 .584 ± .012
C + D .503 ± .022 - .534 ± .022 - .607 ± .016 .584 ± .012
Apart from the performance of single models, we also evaluate the importance
of the preprocessing steps. Therefore, we evaluate the models on the raw data,
the data after cropping the hand regions, as well as on the fully preprocessed data.
We also vary whether histogram equalization is applied before the augmentation
or not. We summarize the quantitative results in Table 1 showing the mean
and standard deviation across four runs. There is a clear trend in preprocessing:
All models have their best runs in the fully preprocessed setting, emphasizing
the importance of our preprocessing pipeline for noisy datasets. Interestingly,
without foreground segmentation, i.e. only by cropping the single hands, the
results appear to be worse than on the raw data. While histogram equalization is
a contrast enhancement method in particular useful to improve human perception
Unsupervised Anomaly Detection for X-Ray Images 11
5 Conclusion
Fig. 5. Example heatmaps of reconstruction error of CAE. The left image-pair shows a
hand from a study labeled as normal hand. Here we can see that the reconstruction
error is relatively wide spread. The right image pair shows an abnormal hand, where
the abnormality is clearly highlighted.
Unsupervised Anomaly Detection for X-Ray Images 13
closely match the anomalous regions. As future work, we envision the extension to
broader datasets such as the full MURA dataset, as well as obtaining pixel-level
anomaly scores for the GAN based models. To this end, methods from the field
of explainable AI, such as grad-CAM [23] or LRP [2] can be applied to the
discriminator to obtain heatmaps similarly to those of the AE models. Moreover,
we see the potential for different model architectures closer tailored to the specific
problem and data type, as well as the possibility of building an ensemble model
using the different ways how to extract anomaly scores from single models, or
even across different model types.
Acknowledgement
We would like to thank Franz Pfister and Rami Eisaway from deepc (www.
deepc.ai) for access to the data and support in understanding the use case.
Part of this work has been conducted during a practical course at Ludwig-
Maximilians-Unversität München funded by Z.DB. This work has been funded by
the German Federal Ministry of Education and Research (BMBF) under Grant
No. 01IS18036A. The authors of this work take full responsibilities for its content.
References
1. S. Akcay, A. Atapour-Abarghouei, and T. P. Breckon. Ganomaly: Semi-supervised
anomaly detection via adversarial training. In C. V. Jawahar, H. Li, G. Mori, and
K. Schindler, editors, Computer Vision – ACCV 2018, pages 622–637, Cham, 2019.
Springer International Publishing.
2. S. Bach, A. Binder, G. Montavon, F. Klauschen, K.-R. Müller, and W. Samek.
On Pixel-Wise Explanations for Non-Linear Classifier Decisions by Layer-Wise
Relevance Propagation. PLOS ONE, 10(7):1–46, 07 2015.
3. J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale
Hierarchical Image Database. In CVPR09, 2009.
4. J. Donahue, P. Krähenbühl, and T. Darrell. Adversarial Feature Learning. In 5th
International Conference on Learning Representations, ICLR 2017, Toulon, France,
April 24-26, 2017, Conference Track Proceedings, 2017.
5. V. Dumoulin, I. Belghazi, B. Poole, A. Lamb, M. Arjovsky, O. Mastropietro, and
A. C. Courville. Adversarially Learned Inference. In 5th International Confer-
ence on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017,
Conference Track Proceedings, 2017.
6. X. Glorot, A. Bordes, and Y. Bengio. Deep Sparse Rectifier Neural Networks. In
Proceedings of the Fourteenth International Conference on Artificial Intelligence
and Statistics, AISTATS 2011, Fort Lauderdale, USA, April 11-13, 2011, pages
315–323, 2011.
7. I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair,
A. Courville, and Y. Bengio. Generative Adversarial Nets. In Advances in neural
information processing systems, pages 2672–2680, 2014.
8. S. Ioffe and C. Szegedy. Batch Normalization: Accelerating Deep Network Training
by Reducing Internal Covariate Shift. In Proceedings of the 32nd International
Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, pages
448–456, 2015.
14 Davletshina et al.
Supplementary Material
Fig. 8. Schematic overview of DCGAN and example comparison of real (above) and
fake (below) images. The generator G(z) generates an image from input noise z. The
discriminator distinguishes between real images and generated ones. More details in
the text.
Fig. 9. Schematic overview of BiGAN / ALI and an examples of real (above) and fake
(below) images. The generator G(z) generates an image from input noise z. The encoder
E(x) encodes the real image. The discriminator distinguishes between real images and
generated ones additionally provided with the noise vector z and the encoding z\ .
18 Davletshina et al.
.2 Architecture Details
CAE VAE
encoder encoder
kernel size output filters kernel size output filters
(3, 3) (512, 512, 16) (4, 4) (255, 255, 8)
(4, 4) (126, 126, 16)
(4, 4) (256, 256, 32)
(4, 4) (62, 62, 32)
(3, 3) (256, 256, 32)
(4, 4) (30, 30, 64)
(4, 4) (128, 128, 64) (4, 4) (14, 14, 128)
(3, 3) (128, 128, 64) (4, 4) (6, 6, 256)
(4, 4) (64, 64, 128) (4, 4) (2, 2, 512)
(3, 3) (64, 64, 128)
bottleneck
(4, 4) (32, 32, 256)
(3, 3) (32, 32, 256) reshape: z (2 · 2 · 512)
(4, 4) (16, 16, 512) µ = FC(z) (1024,)
σ = FC(z) (1024,)
decoder z 0 = σ + µ (1024,)
kernel size output filters z 00 = F C(z 0 ) (2 · 2 · 512)
reshape: (2, 2, 512)
(4, 4) (32, 32, 256)
(4, 4) (64, 64, 128) decoder
(4, 4) (128, 128, 64) kernel size output filters
(4, 4) (256, 256, 32) (4, 4) (6, 6, 256)
(4, 4) (512, 512, 16) (4, 4) (14, 14, 128)
(3, 3) (512, 512, 1) (4, 4) (30, 30, 64)
(4, 4) (62, 62, 32)
(4, 4) (126, 126, 16)
(4, 4) (254, 254, 8)
(6, 6) (512, 512, 1)
20 Davletshina et al.
.3 Data Augmentation
We use two different augmentation strategies, named default (used for GANs
and VAE) and advanced (used for CAE and BAE).
Default
Advanced
.4 Training Details
– For all models we train 4 variants with different random seeds being 42, 4242,
424242, and 42424242.
CAE
– Batch Size: 32
– Image Resolution: 512 × 512
– 1,000 epochs
– Batch Normalization
– Learning rate: 0.0001
– Adam optimizer
BAE
– Batch Size: 32
– Image Resolution: 512 × 512
– 500 epochs
– Batch Normalization
– Learning rate: 0.0001
– Adam optimizer
22 Davletshina et al.
VAE
– Batch Size: 32
– Image Resolution: 512 × 512
– 500 epochs
– Batch Normalization
– zdim = 2, 048, hdim = 18, 432
– Learning rate: 0.0001
– Adam optimizer
DCGAN
– Batch Size: 80
– Image Resolution: 512 × 512
– 500 epochs
– No Batch Normalization
– Spectral Normalization
– Soft Labels
– Generator Learning Rate: 0.001
– Discriminator Learning Rate: 0.00001
– Soft Delta: 0.01
– zdim = 2, 048
– As we observed mode collapse, we added minibatch discrimination [20]
– Adam optimizer
BiGAN
– Batch Size: 16
– Image Resolution: 128 × 128
– 500 epochs
– Generator & Encoder Learning Rate: 0.001
– Discriminator Learning Rate: 0.000005
– Adversarial Loss: Hinge Loss
– zdim = 100
– Adam optimizer
α-GAN
– Batch Size: 16
– Image Resolution: 128 × 128
– 500 epochs
– Generator & Encoder Learning Rate: 0.001
– Discriminator & Code-Discriminator Learning Rate: 0.000005
– Adversarial Loss: Hinge Loss
– zdim = 100
– As we observed mode collapse, we added minibatch discrimination [20]
– Adam optimizer