Exploring Normalizing Flow For Anomaly Detection
Exploring Normalizing Flow For Anomaly Detection
detection
Chinmay Pathak
Exploring normalizing flow for anomaly
detection
by
C. A. Pathak
to obtain the degree of Master of Science
at the Delft University of Technology,
to be defended publicly on Thursday August 29 2019 at 12:00 PM.
Acknowledgements
First of all, I would like to thank Dr. Jan van Gemert for providing me the opportunity to work on this
thesis. I am grateful for his continuous support and guidance throughout this thesis. His inputs were
invaluable for my work. I truly enjoyed working with him.
I would like to thank Miriam Huijser from Aiir innovations for providing all the necessary data and guid-
ance. Your inputs through out the work and feedback for the report helped in a big way.
I would like to thank my parents and my sister for always believing in me and supporting me. I would like
to also thank my friends Maneesh, Palash, Yashavi, Rajesh, Devendra and Sneha for making my stay
off the work fun and relaxing which helped me to focus on my work better. Special thanks to Shreyas
Nikte for providing a place to work from for the whole duration of this work.
Last but not the least, I would like to thank all my friends at crows nest for providing me company during
lunch and coffee breaks. Time flew by much faster because of you guys.
C. A. Pathak
Delft, August 2019
Contents
1 Scientific Paper 1
2 Introduction 19
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.1.2 Research Question . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3 General Background on Deep Learning 21
3.1 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 Activation Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3 Training a Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.4 Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.4.1 Convolutional Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.4.2 Pooling layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4 Unsupervised Deep Learning Models 27
4.1 Autoencoders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2 Generative models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.2.1 Variational Autoencoders (VAE) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.2.2 Generative Adversarial Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2.3 Normalizing flows. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Bibliography 31
iii
1
Scientific Paper
1
2 1. Scientific Paper
We show that the black pixels adversely affect the like- In this approach, an anomaly score is determined us-
lihood values of the flow-based model. 2) We show ing the sum of the distances from the k nearest neigh-
that the convolutional autoencoder is unable to detect bors from the test instance. In other work, V.Skvara,
small anomalies in the images and thus establish the et al. [47] shows the applicability of the KNN for
need for a approach different that reconstruction based anomaly detection tasks in comparison to deep gener-
approach for anomaly detection. 3) We show that fre- ative models such as Generative Adversarial Networks
quency of the image does play a role in the case of (GANs) [21], Variational Autoencoders (VAEs) [26]
the normalizing flow-based model and convolutional and Autoencoders (AEs) [23]. However, this compari-
autoencoder son was not done on image datasets, while in our prob-
lem setting we use image data only, here deep learning
2. Related Work methods have a clear upper hand [8]. Along with that,
Anomaly detection (AD) has long been a topic of these traditional methods also require explicit feature
interest in various fields [9]. The importance of un- engineering [39], which is not required when work-
derstanding the normal behaviour of a system has al- ing with deep neural networks and for an approach in-
ways been a point of interest. Traditional machine volving SVM, the support vectors need to be stored for
learning techniques such as Support Vector Machines the class prediction in the classification tasks which in-
(SVM) [50], k-nearest neighbors (KNN) have shown troduces memory constraints [43]. All this makes the
great success at outlier detection tasks [7, 40]. How- deep neural nets a better choice with larger and more
ever, as the size of the dataset started getting larger complex data set.
and more complex, these traditional methods are no
Deep methods for anomaly detection
longer the state of the art when it comes to the task of
anomaly detection as they do not scale well [8]. Es- There are various approaches based on deep learn-
pecially when it comes to high dimensional complex ing for anomaly detection. Based on the nature of the
data such as images. With the rise of deep learning in input data and the availability of labels these are classi-
the last decade, many computer vision problems have fied as supervised, semi-supervised, or unsupervised.
seen a huge boost in performance and accuracy [5]. In supervised deep anomaly detection, the anomaly
Tasks such as object detection, recognition, and seg- detection problem is predominantly posed as a binary
mentation are solved with the help of convolutional classification problem where all the anomaly samples
neural networks (CNNs) with state of the art results are combined into one class and normal samples as an-
compared to traditional machine learning because of other [11]. In a different setting, “none of the above
the automatic feature extraction capacity of the neural category” can also be appended to the classification
networks [8, 38]. model. Methods such as defectnet [3] anomaly de-
tection via image resynthesis [29] and, detection of
Traditional methods for anomaly detection
manufacturing defect using CNN and Transfer Learn-
Traditional machine learning approaches for ing [19] are some of the examples which use super-
anomaly detection include methods such as one-class vised approach to detect anomalies. In [19] authors
classification using SVM [41] where radial basis make use of transfer learning with ImageNet [13] and
function (RBF) kernel is used to learn a region that COCO [28] data set, and 4 CNN modules to overcome
contains the data instances. At test time, anomalous the problem of fewer samples to get the mean average
samples are identified only if they fall outside the precision of 0.957 on gdxray [33] dataset of only 2800
learned region. A variation of this method, Support samples at a cost of high memory and longer time to
Vector Data Description (SVDD) [48] defines the train. The biggest drawback of this approach is that
smallest hypersphere in the latent space describing the the model requires the distribution of anomalies to be
training samples and the anomalies are identified if known before training. This poses a big problem as
they fall outside the defined hypersphere. one of the main challenges while working on anomaly
KNN is an unsupervised approach which is tradi- detection is that the anomalies are rare and not known
tionally used for anomaly detection [4, 6, 18, 54]. prior to the dataset [8].
4 1. Scientific Paper
The semi-supervised AD methods assume all that output generated by the decoder is called the recon-
the training samples have only normal class labels and struction error. For normal samples, this error ide-
work on the assumption that points which are close ally is zero as the model has learned the representation
to each other both in input and latent space share the of normal class and can thus reconstruct it, while for
same label. This method can be implemented using anomalous samples it should be a higher value as the
any of the models such as autoencoders, generative ad- model does not know the distribution of anomaly sam-
versarial networks, or CNNs [8]. GAN-based methods ples and as a result has difficulties in reconstructing it.
such as ganomaly [1] and skip-ganomaly [2] produce Based on this, when a model is not able to reconstruct
good results with a semi-supervised approach. How- the input properly for a test sample, then that sample
ever Lu et.al [30], puts forward a fundamental limita- is identified as an anomaly.
tion that, unless the said assumption of the relation be- The same reconstruction-based approach is also
tween labelled and unlabeled data distribution holds, used with GANs in methods such as AnoGAN [46]
semi-supervised methods cannot provide any signifi- and efficient anomaly detection with GANs [53].
cant benefits over supervised learning. This applies to GANs have shown greater ability to generate images
the deep neural networks as well [8]. with better fidelity to the train distribution hence there
The main challenge while working with anomaly is a lot of interest in developing reconstruction based
detection is the lack of samples from the anoma- approach using GANs. In AnoGAN [46] first, a GAN
lous class, because of which supervised and semi- is trained to create samples similar to training (normal)
supervised methods struggle. Unsupervised methods instances. During test time, best possible generated
thus are the most widely applicable approach for AD image matching the test sample is found out iteratively.
as they detect anomalies based solely on the intrin- For a normal test instance, this should result in lower
sic properties of the data [8]. There are different anomaly score which is a simple L2 distance between
unsupervised methods which are used in AD such the test input sample and reconstructed sample, while
as Deep autoencoder, GANs, VAEs, auto-regressive for an anomalous sample anomaly score will be high.
models [36, 37, 44, 49] and normalizing flow-based This has the disadvantage that, for every test sample,
models [14, 15, 24, 25]. the best matching image needs to be generated which
One use case of unsupervised methods is density makes it slow.
estimation. The generative model pθ (x) trained on The main difficulty regarding the reconstruction-
some data distribution p(x) ideally should assign a based encoder-decoder models is the degree of com-
high likelihood to the samples from same the distri- pression which works as another hyperparameter that
bution as the model learns the joint probability dis- needs to be manually tuned to get the best results be-
tribution of the given data. Thus, when any out-of- cause of the unsupervised nature of the problem [43].
distribution (OOD) sample Y from q(y) is fed to a Along with that, the approximation in the inference in
density estimation model it assigns a low likelihood to VAEs limits its ability to learn high dimensional deep
it. This property of generative models is very useful representations. In GANs the sidestepping from the
for a problem setting such as anomaly detection. likelihood objective in the training altogether, makes it
Deep Autoencoders [23] are the most common and difficult to train and also there is no encoder in GANs
fundamental unsupervised deep models that are used to that maps input to the latent variables directly. None
for AD consisting of an encoder-decoder architec- of these models provide a way to tractably calculate
ture [56, 55, 32, 22]. This model works as a dimen- exact log-likelihood of a data point.
sionality reduction method which learns the common There is a recent development on the invertible gen-
variations from the training data. VAEs on the other erative models [15, 24, 25, 44, 49] which enables the
hand, optimize the variational lower bound on the like- calculation of the exact log-likelihood of the input dis-
lihood of the data. The samples are encoded in such tribution using the change of variables formula. These
a way that the data may be generated from a simple models can be classified into autoregressive flows and
gaussian prior using the similar encoder-decoder ar- normalizing flows. Autoregressive flows such as Pix-
chitecture. The difference between the input and the elCNN, PixelCNN++ [44, 49] have the advantage of
5
Figure 3: Negative log-likelihoods for three cases (0.10%, 20% and 86% anomalies) from Table 1. Lines represent
the trend of absolute negative log-likelihood of the test samples. The Lower three plots represent the Negative log-
likelihoods (NLL) of the test samples from the normal class (digit 1). Upper three plots represent the NLL of the
test samples from the anomaly class (digit 8) with fraction of anomalies 0.10%, 20% and 86%. As the fraction of
anomalies in the training data increases with each case, from the trend-lines we can see that the NLL of anomaly
class (digit 8) test samples got better, while normal class test samples’ NLL remained in the same range as the
number of samples for it remained constant. We can say that the model is able to focus on the normal data and
does not get affected by “noise” in the training set.
Figure 4: Negative log-likelihood plot when the nor- Table 2: Different cases used investigate the problem
mal and anomalous classes are reverse. Red represent that causes the model to produce better log-likelihood
the negative log-likelihood of anomaly class (digit 1) for the anomalous samples. The table shows pairs the
and blue represents the negative log-likelihood of the object class from fashion-MNIST used as normal and
normal class (digit 8). anomaly class. The model is trained using the samples
from the normal class only. Column 3 and 5 show the
the one which is used for training has less number of average number of black pixels for the image samples
black pixels than the test anomaly class. Thus, we from that class.
can assume that the problem arises due to the presence
of black pixels. As they are easy to learn the model
will produce better likelihood if the number there are shirts and class coats by keeping a tight boundary
higher number of black pixels. around the image sample and cropping out black pix-
To validate that the problem is indeed because of the els’ area as much as possible. We resized the cropped
black pixels again a small is carried out. We cropped images back to the shape of 28*28. We performed the
the image samples of Fashion-MNIST dataset for class same test as in case 4 from the Table 2 and found out
9
be reconstructed better or will get better negative log- sigmoid activations. Convolutional autoencoder and
likelihood values compared to the high frequency im- GLOW architecture is same as from experiment 2.
ages. The Figure 10 shows the distribution of the er-
This means that the generative models used in this ror values, i.e the reconstruction error for fully con-
experiment i.e 1.Fully connected autoencoder, 2.Con- nected and convolutional autoencoder and negative
volutional autoencoder(CAE), and 3.GLOW will be log-likelihood for GLOW, in a scatter plot, for all the
able to reconstruct the images with low frequencies frequencies separately. From the Figure 10(a) we see
with a lower reconstruction in case of AE and CAE, or that frequency has no effect on the fully connected au-
will get better negative log-likelihood values in case toencoder as all the frequencies were reconstructed in
of glow compared to the high frequency images. To a similar range of reconstruction error values. How-
test this hypothesis, we created set a small dataset ex- ever, from Figure 11 we can observe that the intensity
plained as follows. of the pixels played a part in the ability of the model to
reconstruct the test samples. For every frequency we
Toy Frequency dataset feed the model in the ascending order of the intensity,
we observe that with the increase in the intensity irre-
The Figure 9 shows the samples from both classes, spective of the frequency the reconstruction error for
low and high. The low frequency class has frequencies test samples became higher.
ranging from one to five, whereas frequencies in range In Figure 10(b) we see that frequency 14 and 15
11 to 15 belongs to high frequency class. To make sure affected the reconstruction ability of the CAE signifi-
that there are enough samples in the dataset, variations cantly. Similarly in case of Figure 10(c) frequency 14
in the intensity are used as a form of data augmentation and 15 test samples produced negative log-likelihood
as well as 90 degree rotation. As a result, both low and significantly higher than the rest. However, in both
high frequency class has 600 images each, with 120 Figure 10(b) and (c) we can see an upward trend of the
images for each frequency. error values indicating that the frequencies does play a
role in the ability in the ability of the models to either
reconstruct or predict likelihood.
Thus, we can conclude that the fully autoencoder
was not susceptible at all to the higher frequencies but
is clearly gets affected by the intensities in the of the
pixels in the image. Whereas CAE and GLOW did
not show any such effect because of the intensities but
showed a bias towards the low frequency in the im-
ages.
ing easy to learn cause a problem when predicting the formation sharing regarding the nearby patches while
likelihood of the data sample. Sometimes the back- calculating the likelihood, it may happen that the big-
ground of a blade is black shadow and that gets ex- ger anomalies might go undetected or partially de-
posed because of the anomaly in the blade. As a result tected as the model does not have a global context
what model sees is the black shadow and not the actual understanding and thus it may miss some anomalies.
blade and thus the model does not identify that as an Other than that the model does perform well in de-
anomaly. tecting small anomalies because of the patch-based ap-
proach. The middle image in 14 and the first image in
13 are the perfect examples of that.
6. Conclusion
This work explores the problem of anomaly detec-
tion with the normalizing flow-based model. In this
work we first verified the approach on the toy datasets
with help of small experiments. We were able to show
that the flow-based model is able to detect the acute
anomalies present in the images which convolutional
autoencoder did not. This establishes the need of a
approach other than the reconstruction based methods.
We believe through this work that the ability of the
normalizing flow-based model to be able to do the ex-
act log-likelihood evaluation could provide that alter-
native approach. We showed that black pixels causes
the flow-based models to give a better negative log-
likelihood value. We also showed that convolutional
autoencoder and GLOW show a bias towards the low
frequencies in the image whereas fully connected au-
toencoder does not. However, fully connected autoen-
coder is affected by the intensities of the pixels in the
image.
In case of a more complex BoroscopeV1 dataset,
we showed promising results where normalizing flow-
based model was able to detect the anomalies in the
image precisely. However, in some cases it failed to
Figure 14: Left column: Test input samples contain- differentiate between the normal and anomalous sam-
ing anomalies, Right column: Corresponding gen- ples we speculate it could be either because of the re-
erated heatmap. Part of the blades having anoma- flections from the blades or the black pixels. Also,
lies results in a darker red patch in the corresponding because of the patch based approach some anomalies
heatmap. Darker the red patch higher the negative log- which landed on the border of the patches could not be
likelihood. This figure shows some of the failed cases. detected.
Reflections from the shiny surface of blades, the area
of the anomalous region under the patch, black pixel, Recommendations
texture of the blade played a role in the ability of the
model in identifying the correct anomalies. Anomaly detection is a not an easy problem than it
seems from the first glance. The unavailability is of the
One important thing to note here is that because the anomalous data samples is the biggest challenge while
images are investigated in patches and there is no in- working on problem like this. In this work we explored
15
[16] H. Dutta, C. Giannella, K. Borne, and H. Kargupta. [29] K. Lis, K. Nakka, M. Salzmann, and P. Fua. Detecting
Distributed top-k outlier detection from astronomy the unexpected via image resynthesis. arXiv preprint
catalogs using the demac system. In Proceedings arXiv:1904.07595, 2019. 2
of the 2007 SIAM International Conference on Data [30] T. T. Lu. Fundamental limitations of semi-supervised
Mining, pages 473–478. SIAM, 2007. 1 learning. Master’s thesis, University of Waterloo,
[17] E. Eskin. Anomaly detection over noisy data using 2009. 3
learned probability distributions. 2000. 1 [31] Y. Lu and B. Huang. Structured output learning with
[18] E. Eskin, A. Arnold, M. Prerau, L. Portnoy, and conditional generative flows. ArXiv, abs/1905.13288,
S. Stolfo. A geometric framework for unsupervised 2019. 14
anomaly detection. In Applications of data mining in [32] R. Mehrotra, A. H. Awadallah, M. Shokouhi, E. Yil-
computer security, pages 77–101. Springer, 2002. 2 maz, I. Zitouni, A. El Kholy, and M. Khabsa. Deep
[19] M. K. Ferguson, A. Ronay, Y.-T. T. Lee, and K. H. sequential models for task satisfaction prediction. In
Law. Detection and segmentation of manufacturing Proceedings of the 2017 ACM on Conference on Infor-
defects with convolutional neural networks and trans- mation and Knowledge Management, pages 737–746.
fer learning. Smart and sustainable manufacturing ACM, 2017. 3
systems, 2, 2018. 2 [33] D. Mery, V. Riffo, U. Zscherpel, G. Mondragón,
[20] I. Goodfellow, Y. Bengio, and A. Courville. Deep I. Lillo, I. Zuccar, H. Lobel, and M. Carrasco. Gdxray:
learning. MIT press, 2016. 8 The database of x-ray images for nondestructive test-
ing. Journal of Nondestructive Evaluation, 34(4):42,
[21] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, 2015. 2
D. Warde-Farley, S. Ozair, A. Courville, and Y. Ben-
gio. Generative adversarial nets. In Advances in neu- [34] E. Nalisnick, A. Matsukawa, Y. W. Teh, D. Gorur,
ral information processing systems, pages 2672–2680, and B. Lakshminarayanan. Do deep generative mod-
2014. 2 els know what they don’t know? arXiv preprint
arXiv:1810.09136, 2018. 4
[22] S. Hawkins, H. He, G. Williams, and R. Baxter.
[35] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu,
Outlier detection using replicator neural networks.
and A. Y. Ng. Reading digits in natural images with
In International Conference on Data Warehousing
unsupervised feature learning. 2011. 4
and Knowledge Discovery, pages 170–180. Springer,
2002. 3 [36] A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan,
O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior,
[23] G. E. Hinton and R. R. Salakhutdinov. Reducing the and K. Kavukcuoglu. Wavenet: A generative model
dimensionality of data with neural networks. science, for raw audio. arXiv preprint arXiv:1609.03499,
313(5786):504–507, 2006. 2, 3 2016. 3
[24] E. Hoogeboom, J. W. Peters, R. v. d. Berg, and [37] A. v. d. Oord, N. Kalchbrenner, and K. Kavukcuoglu.
M. Welling. Integer discrete flows and lossless com- Pixel recurrent neural networks. arXiv preprint
pression. arXiv preprint arXiv:1905.07376, 2019. 1, arXiv:1601.06759, 2016. 3
3
[38] P. Oza and V. M. Patel. One-class convolutional neural
[25] D. P. Kingma and P. Dhariwal. Glow: Generative flow network. IEEE Signal Processing Letters, 26(2):277–
with invertible 1x1 convolutions. In Advances in Neu- 281, 2018. 2
ral Information Processing Systems, pages 10215–
[39] M. Pal and G. M. Foody. Feature selection for
10224, 2018. 1, 3, 4, 5
classification of hyperspectral data by svm. IEEE
[26] D. P. Kingma and M. Welling. Auto-encoding varia- Transactions on Geoscience and Remote Sensing,
tional bayes. arXiv preprint arXiv:1312.6114, 2013. 48(5):2297–2307, 2010. 2
2 [40] S. Ramaswamy, R. Rastogi, and K. Shim. Efficient
[27] A. Krizhevsky, V. Nair, and G. Hinton. The cifar-10 algorithms for mining outliers from large data sets.
dataset. online: http://www. cs. toronto. edu/kriz/cifar. In ACM Sigmod Record, volume 29, pages 427–438.
html, 55, 2014. 4 ACM, 2000. 2
[28] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, [41] G. Rätsch, S. Mika, B. Schölkopf, and K.-R. Müller.
D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft Constructing boosting algorithms from svms: An ap-
coco: Common objects in context. In European con- plication to one-class classification. IEEE Transac-
ference on computer vision, pages 740–755. Springer, tions on Pattern Analysis & Machine Intelligence,
2014. 2 (9):1184–1199, 2002. 2
17
[42] J. Ren, P. J. Liu, E. Fertig, J. Snoek, R. Poplin, M. A. and performance. Knowledge and information sys-
DePristo, J. V. Dillon, and B. Lakshminarayanan. tems, 10(3):333–355, 2006. 2
Likelihood ratios for out-of-distribution detection. [55] D. Zhao, B. Guo, J. Wu, W. Ning, and Y. Yan. Robust
arXiv preprint arXiv:1906.02845, 2019. 8, 9 feature learning by improved auto-encoder from non-
[43] L. Ruff, R. Vandermeulen, N. Goernitz, L. Deecke, gaussian noised images. In 2015 IEEE International
S. A. Siddiqui, A. Binder, E. Müller, and M. Kloft. Conference on Imaging Systems and Techniques (IST),
Deep one-class classification. In International Confer- pages 1–5. IEEE, 2015. 3
ence on Machine Learning, pages 4393–4402, 2018. [56] B. Zong, Q. Song, M. R. Min, W. Cheng,
2, 3 C. Lumezanu, D. Cho, and H. Chen. Deep autoencod-
[44] T. Salimans, A. Karpathy, X. Chen, and D. P. Kingma. ing gaussian mixture model for unsupervised anomaly
Pixelcnn++: Improving the pixelcnn with discretized detection. 2018. 3
logistic mixture likelihood and other modifications.
arXiv preprint arXiv:1701.05517, 2017. 1, 3
[45] T. Schlegl, P. Seeböck, S. M. Waldstein, U. Schmidt-
Erfurth, and G. Langs. Unsupervised anomaly de-
tection with generative adversarial networks to guide
marker discovery. In International Conference on In-
formation Processing in Medical Imaging, pages 146–
157. Springer, 2017. 1
[46] T. Schlegl, P. Seeböck, S. M. Waldstein, U. Schmidt-
Erfurth, and G. Langs. Unsupervised anomaly de-
tection with generative adversarial networks to guide
marker discovery. In International Conference on In-
formation Processing in Medical Imaging, pages 146–
157. Springer, 2017. 3
[47] V. Škvára, T. Pevnỳ, and V. Šmı́dl. Are generative
deep models for novelty detection truly better? arXiv
preprint arXiv:1807.05027, 2018. 2
[48] D. M. Tax and R. P. Duin. Support vector data de-
scription. Machine learning, 54(1):45–66, 2004. 2
[49] A. Van den Oord, N. Kalchbrenner, L. Espeholt,
O. Vinyals, A. Graves, et al. Conditional image gen-
eration with pixelcnn decoders. In Advances in neu-
ral information processing systems, pages 4790–4798,
2016. 1, 3
[50] V. Vapnik. The nature of statistical learning theory.
Springer science & business media, 2013. 2
[51] l. weng. Use of deep learning features in log-linear
models. Log-Linear Models, Extensions, and Appli-
cations, 2018. 4
[52] M. Yamaguchi, Y. Koizumi, and N. Harada. Adaflow:
Domain-adaptive density estimator with application to
anomaly detection and unpaired cross-domain transla-
tion. In ICASSP 2019-2019 IEEE International Con-
ference on Acoustics, Speech and Signal Processing
(ICASSP), pages 3647–3651. IEEE, 2019. 1
[53] H. Zenati, C. S. Foo, B. Lecouat, G. Manek, and V. R.
Chandrasekhar. Efficient gan-based anomaly detec-
tion. arXiv preprint arXiv:1802.06222, 2018. 3
[54] J. Zhang and H. Wang. Detecting outlying subspaces
for high-dimensional data: the new task, algorithms,
2
Introduction
2.1. Introduction
Determining which instances stand out as dissimilar compared to other while when analyzing real-world
data-sets has become a common need. Such insances are known as anomlaies. The domain of the
project is to detect anomalies on the jet engine blade images. These anomalies occur mainly due
to different types wear and tear on the blades, and in size from a small scratch or dents to a large
deformation of the blade. A new dataset called BoroscopeV1 consisting the images 2.1 of jet engine
blades was created for this purpose.
Figure 2.1: Sample images showing variations in the anomalies from the BoroscopeV1 dataset
Jet engine blades some times get scratched or deformed during their operation. As its malfunction
can cause a engine failure which could prove life threatening to the passengers travelling in the aircraft.
Detection of it while inspection is important. Thus accurate detection of these anomalies occurring in
the form of wear and tear in the image of an jet engine blade is extremely important. As the whole
operation is done on images this can be modelled as a computer vision problem.
Over the past decade, deep learning techniques have seen a significant rise in the real-world ap-
plications specifically in the field of computer vision owing to rising computational power, increased
storage and extensive methods of data collection. Convolutional neural networks have shown great
applicability when it comes to the tasks such image classification, object detection, segmentation and
some more. Detection of features and classification of an image based on that, is a computer vision
problem. Feature representation of an image can be done in many ways. One could also use human
annotations for describing each image and then classifying based on that, but it is not a practical and
feasible approach. Traditional machine learning approaches expects the features to be fed directly to
the classifier instead of learning those based on the given data. A neural network can generate its own
hidden features which can be effectively used for the detection and classification tasks.
In supervised learning all the labelled training examples are fed into the model, with this model learns
the representation of all the classes of the data. This is certainly a desired way as one can optimize the
patterns and structures based on the labels. The BoroscopeV1 dataset has a very uneven distribution
of the data samples with high percentage of images with no defects which are referred as ‘healthy’ or
‘normal’ samples, and very small percentage of images with defects which are referred as ”unhealthy”,
19
20 2. Introduction
or ‘anomalous’ samples. As the data distribution is so uneven, supervised model will not see enough
images from unhealthy class during training to learn its representation. And consequently will not be
able to detect the anomalies accurately as the discriminative model such as supervised classification,
tends to bias the predictions towards the class with more number of samples. Thus instead of learning
to predict an output depending on input data, an approach which will learn the inherent structure of the
input data will be much more useful in this kind of problem setting. Unsupervised learning does exactly
that.
Uneven distribution of the training data with high number of normal samples is a classic setting
for an anomaly detection problem. Unsupervised representation learning has become very dominant
in the task of anomaly detection in the recent years with rise of variational autoencoders, generative
adversarial networks, autorgressive and normalizing flow-based models. Anomaly detection a well
known sub-domain of the unsupervised learning is a challenging task because of the high dimensional
structure of the images. This work shows the normalizing flow-based approach for the detecting the
anomalies.
2.1.1. Motivation
Reconstruction using convolutional autoencoder is the most widely approach for the anomaly detec-
tion on images. However, the convolutional autoencoder have the disadvantage when detecting small
defects in the image as they tend to generalize over that small region which results in lower reconstruc-
tion error values. Along with that, choosing a right degree of compression is also a cause of problem
because it works as a hyperparameter that needs to be manually tuned and choosing the right value is
hard due to the unsupervised nature of the problem setting. Thus, it is important to look for an approach
that tries to learn the underlying structure of the data accurately such that we can predict the likelihood
of the test samples. Thus, normalizing flow-based approach is used in this work.
The flow-based generative models are able to do exact latent variable inference and log likelihood
evaluation. Whereas in variational autoencoders it is possible to only approximate the value of latent
variables corresponding to a datapoint. GANs have no encoder at all to infer the latents.
• What is the effect of low and high frequency images on the ability of the model to predict the
likelihood?
3
General Background on Deep Learning
This chapter provides a general background theoretical information on neural networks needed for
clear insight. We start with the basic overview of what neural networks are, how they work. Followed
by a detailed explanation on the working of convolutional neural networks (CNNs) in classification. We
also look into different types of convolution operations that are used in different unsupervised models
which are discussed in the next section and also in the related work section in the 1st chapter of the
report 1.
𝑢 = ∑𝑤 𝑥 + 𝑏 (3.1)
𝑦 = 𝑓(𝑢) (3.2)
The deep learning is is a subset of neural networks where a multiple layers are stacked one to each
Figure 3.1: Mathematical model of a single neuron in an artificial neural network. A neuron consists of inputs , ..., and
weights , ..., , a bias and an activation function [6]
other to create a hierarchy between the input and the output layer. The Any model that uses more
than two layers used in the model is referred as a deep model. Figure 3.2 illustrates a simple three-
layer Neural network. Each layer consists of several neurons. Every connection between the neurons
exchanges information with the help of weights and activation function. These weights are trained
21
22 3. General Background on Deep Learning
with the help of backpropagation during training with different objective function depending on the task.
The neural network in Figure 3.2 is a fully connected neural networks where every neuron from the
previous layer is connected to the every neuron in the next layer. These neurons respond to the different
combinations of the inputs from previous layers also neurons within a layer do not share any connection.
Figure 3.2: A simple three layer fully connected network with 2 hidden layers and an output layer
Figure 3.3: Sigmoid (a) squashes real numbers to range between [0,1] where as tanh (b) maps the real numbers to range
between [-1,1]. ReLU (c) is zero when x < 0 and linear with slope 1 when x > 0. [6]
using the traditional optimization methods of gradient descent, by optimizing the parameters based on
the gradient.
In deep learning due high number of parameters and complexity it is computationally inefficient to
calculate all the gradients. Backpropagation [9] is used to solve this problem. There are different tech-
niques which facilitate the training of deep complex neural networks with the help of backpropagation.
Stochastic Gradient Discent (SGD) [1](cite), Adagrad [4], Adam [7] are a few frequently used examples.
Figure 3.4: A typical block digram of CNN consisting multiple convolutional layers with ReLU activation function, pooling layer
and full connected layer at the end with the softmax predictions. [13]
It is important to discuss to special types of convolutions which are used in the models discussed in
the 1.
Transposed Convolution
It is also known as deconvolution. As it can be seen from the figure 3.4, convolution operation shrinks
the input volume. However, in may cases such as generating high resolution images, semantic seg-
mentation or in the decoder part of an auto-encoder we need to perform the up-sampling. Traditionally
up-sampling can be done with different interpolation schemes. Modern architectures such as neural
networks however, tend to learn this transformation.
A transposed convolution however does not exactly deconvolves the previous convolution operation
done on the image. If an input of 5x5 is undergone an convolution operation to create a feature map
of 2x2. Transposed convolution on the output feature map 2x2 carries out a regular convolution op-
eration only but reverses its spatial transformation. The transpose convolution we make sure that the
output feature map is same as what we started with (5x5 in this case). Such that, it reconstructs the
24 3. General Background on Deep Learning
Figure 3.6: Left shows the normal convolution operation on a 5x5 feature map resulting in 2x2 area. Right shows the transpose
convolution on 2x2 feature map generating the an output of 5x5
spatial resolution from before and performs convolution with the help of some padding. This is not the
mathematical inverse of the convolution process but for encoder-decoder architectures it’s very useful.
1x1 Convolution
The figure 3.7 shows the illustration of 1x1 convolution where input tensor has the dimensions WxHxD
and the filter size is 1x1xD. After convolution the output tensor is of size WxHx1. If N such convolutions
are applied we will end up with the output layer of dimension WxHxN. 1x1 convolutions facilitates the
dimensionality reduction for efficient computation and the feature pooling capacity. One more advan-
tage of this as described in [14] is that, we can again apply the non-linearity after the convolution which
helps the model to learn more complex function.
Figure 3.8: Pooling layer down-samples the volume spatially. Left image shows reduction in the spatial dimension except depth
using pooling. Right image shows how MaxPooling works [6]
4
Unsupervised Deep Learning Models
In previous chapter we discussed the basic of deep learning and a working of simple convolutional neu-
ral network. In this chapter we will look at the models which work when there are no labeled training
examples. These are the unsupervised models used in deep learning. This chapter will give a brief
explanation of the working principle of a few such models such as, Autoencodres, Variational autoen-
coders (VAE) [8], Generative Adversarial Networks (GANs) [? ] and which are referred throughout this
report.
4.1. Autoencoders
Autoencoders are the most widely used unsupervised models. Autoencoders apply backpropagation,
by setting the output values to be equal to the inputs. Figure 4.1 shows a simple fully connected au-
toencoder with one hidden layer. There are many use cases of the autoencoders such as anomaly
detection, image denoising to name a few.
Autoencoder, by design, is a dimensionality reduction algorithm which learns to model the common
variation in the data samples and ignore the noise. An autoencoder consists of 4 parts ‘Encoder’, ‘Bot-
tleneck’, ‘Decoder’, ‘Reconstruction loss’. Encoder is the part where model learns to reduce the input
dimensions and map the compressed input data on the bottleneck layer.
Bottleneck layer is the layer which holds the compressed representation of the input data from an en-
coder. Decoder is the more often than not a mirror image of encoder. It learns to reconstruct the data
from the compressed representation to be as close to the original input as possible. Reconstruction
loss is the loss function which is used to measure the reconstruction quality of the decoder.
27
28 4. Unsupervised Deep Learning Models
autoencoder except a few important differences such as, vanilla autoencoders use only pixel recon-
structions without any f-divergence to measure the loss. VAE uses the reparameterization trick to allow
the flow of gradients. To sample the data VAE uses decoder 𝑧 𝑄(𝑍|𝑋) where 𝑍 = 𝜇(𝑥) + 𝜎 .𝑒 with
𝑒 𝑁(0, 𝐼) compared to the autoencoder which requires encoder-decoder to generate an output.
Change of variables
In normalizing flow, we map the simple distribution to the complex one with the help of invertible trans-
formations. To be able to do that we make we of change of variables theorem.Given a random variable
𝑧 with known probability density function 𝑧 ∼ 𝜋(𝑧), we would transform it into a different random vari-
able with the 1 to 1 invertible mapping function 𝑥 = 𝑓(𝑧). The function 𝑓 is invertible, so 𝑧 = 𝑓 (𝑥).
However the problems it creates is that, how to infer the unknown probability density function of the
new variable, 𝑝(𝑥).
𝑑𝑓
log 𝑝 = log 𝜋 (𝑧 ) = log 𝜋 − log ∣ 𝑑𝑒𝑡 ∣ (4.6)
𝑑𝑧
Thus we can step by step back-trace the initial distribution by expanding the equation of the output x
from figure 4.5 [15] and give the log 𝑝(𝑥) as:
𝑑𝑓
log 𝑝(𝑥) = log 𝜋 (𝑧 ) − ∑ log ∣ 𝑑𝑒𝑡 ∣ (4.7)
𝑑𝑧
The path traversed by random variables 𝑧 = 𝑓 (𝑧 ) is the flow. Full chain formed by successive
distributions 𝜋 is called normalizing flow. The transformation function 𝑓 should satisfy the follow specific
structural requirements such as: The input and output dimnesions must be same, transfomation must
be invertible and the computing the determinant of jacobian needs to be efficient. [15].
Bibliography
[1] Léon Bottou. Large-scale machine learning with stochastic gradient descent. In Proceedings of
COMPSTAT’2010, pages 177–186. Springer, 2010.
[2] George E Dahl, Tara N Sainath, and Geoffrey E Hinton. Improving deep neural networks for
lvcsr using rectified linear units and dropout. In 2013 IEEE international conference on acoustics,
speech and signal processing, pages 8609–8613. IEEE, 2013.
[4] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning
and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159, 2011.
[5] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair,
Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural infor-
mation processing systems, pages 2672–2680, 2014.
[6] A. karpathy. Cs231n convolutional neural networks for visual recognition, Jun 2017. URL http:
//cs231n.github.io/neural-networks-1/.
[7] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint
arXiv:1412.6980, 2014.
[8] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint
arXiv:1312.6114, 2013.
[9] Yann LeCun, Bernhard Boser, John S Denker, Donnie Henderson, Richard E Howard, Wayne
Hubbard, and Lawrence D Jackel. Backpropagation applied to handwritten zip code recognition.
Neural computation, 1(4):541–551, 1989.
[10] Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted boltzmann machines. In
Proceedings of the 27th international conference on machine learning (ICML-10), pages 807–814,
2010.
[11] C. Nicholson. A beginner’s guide to neural networks and deep learning, 2019. URL https:
//skymind.com/wiki/neural-network.
[12] Chigozie Nwankpa, Winifred Ijomah, Anthony Gachagan, and Stephen Marshall. Activation
functions: Comparison of trends in practice and research for deep learning. arXiv preprint
arXiv:1811.03378, 2018.
[13] Patel. S. and Pingel. J. Introduction to deep learning: What are convolu-
tional neural networks?, 2017. URL https://www.mathworks.com/videos/
introduction-to-deep-learning-what-are-convolutional-neural-networks--1489512765771.
html.
[14] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov,
Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions.
In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9,
2015.
[15] L. Weng. Flow based deep generative models, 2018. URL https://lilianweng.github.
io/lil-log/2018/10/13/flow-based-deep-generative-models.html.
31
32 Bibliography