Variational Autoencoder For Anomaly Detection: A Comparative Study

Variational Autoencoder for Anomaly Detection: A Comparative Study

Uploaded by

xidaodao

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views6 pages

Variational Autoencoder For Anomaly Detection: A Comparative Study

Variational Autoencoder for Anomaly Detection: A Comparative Study

Uploaded by

xidaodao

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Variational Autoencoder for Anomaly Detection:

A Comparative Study
Huy Hoang Nguyen ∗ , Cuong Nhat Nguyen ∗ , Xuan Tung Dao∗ , Quoc Trung Duong∗ , Dzung Pham Thi Kim∗ ,
Minh-Tan Pham†
∗ FPT University, Swinburne Vietnam, Hanoi, Vietnam
† IRISA, Université Bretagne Sud, UMR 6074, 56000 Vannes, France

Abstract— This paper aims to conduct a comparative analysis to evolve and maintain its prominence in the realm of anomaly
of contemporary Variational Autoencoder (VAE) architectures detection. Gangloff et al. [5] previously introduced a VAE
employed in anomaly detection, elucidating their performance architecture with Gaussian Random Field priors (VAE-GRF)
and behavioral characteristics within this specific task. The
architectural configurations under consideration encompass the and showed that this model exhibited competitive results com-
arXiv:2408.13561v1 [cs.CV] 24 Aug 2024

original VAE baseline, the VAE with a Gaussian Random Field pared to the standard VAE when applied to textural images.
prior (VAE-GRF), and the VAE incorporating a vision trans- The same authors have investigated the discrete latent space
former (ViT-VAE). The findings reveal that ViT-VAE exhibits of Vector-Quantized VAEs and introduced a novel alignment
exemplary performance across various scenarios, whereas VAE- map to measure anomaly scores for AD from images [6], [7].
GRF may necessitate more intricate hyperparameter tuning to
attain its optimal performance state. Additionally, to mitigate the From the literature, there has been several research studies
propensity for over-reliance on results derived from the widely showing interests in integrating attention mechanisms [8] into
used MVTec dataset, this paper leverages the recently-public the encoder-decoder AE architecture. In [9]–[11], the imple-
MiAD dataset for benchmarking. This deliberate inclusion seeks mentation of attention for these systems is based on the Vision
to enhance result competitiveness by alleviating the impact of Transformer (ViT) [12] idea for feature extraction, followed
domain-specific models tailored exclusively for MVTec, thereby
contributing to a more robust evaluation framework. Codes is by the utilization of convolutional layers to reconstruct the
available at this link. image directly from image feature instead of the traditional
ViT class tokens. Additional research works have explored
Index Terms—VAE, VAE-GRF, ViT, computer vision, deep
learning, anomaly detection Masked Autoencoder for anomaly detection [13], [14] by
using the masking trick of the Transformer for ViT’s image
feature, which involves masking part of the image patches
I. I NTRODUCTION for self-supervised learning. However, all the aforementioned
Anomaly detection (AD) represents a machine learning research has used Transformer in the AE framework and
process designed to discern abnormal patterns within a given not the VAE. For instant, [15] has been found to be among
set of input data. This process finds application across diverse limited studies from which ViT is implemented in a VAE
fields, including but not limited to fraudulent detection within architecture. Nevertheless, instead of using image features
the banking industry, intrusion detection in security systems, and Convolutional layers to reconstruct the image, it employs
anomaly detection in manufacturing, and cancer detection class tokens and Linear layers respectively with multiple pre-
within healthcare [1]. There are also various ways to imple- processing steps to enhance the input information. Thus, one
ment an anomaly detection model ranging from supervised of the goal of this work is to implement a VAE using ViT’s
anomaly detection [2], to semi-supervised anomaly detection image features to perform AD task for comparative purposes.
[3], and unsupervised anomaly detection [4]. In this paper, we In summary, the main contributions in this paper are:
will focus on unsupervised anomaly detection. • First, we investigate and implement a ViT-VAE frame-
Reconstruction-based AD constitutes a specific branch of work for anomaly detection and elucidate its effectiveness
anomaly detection that identifies abnormal patterns through the in this context. Inspired from [4], [5], we train ViT-VAE
reconstruction capacity of deep neural networks. This variant in an unsupervised manner and measure anomaly scores
aims to reconstruct input data using generative deep models from both image space and the latent space.
and subsequently compares the reconstructed data with the • Second, we conduct extensive experiments to evaluate
original input to identify and localize anomalies. While there and benchmark the performance of contemporary VAE
exist numerous reconstruction-based networks suitable for the architectures including the standard VAE, the VAE-GRF
AD task, our focus in this work is on Variational Autoencoder and the ViT-VAE on two public datasets including the
(VAE) and its various adaptations. MVTec AD [16] and the MiAD [17].
In recent years, generative architectures such as Generative
II. M ETHODOLOGY
Adversarial Networks (GANs), Diffusion models, and VAEs
have garnered significant attention due to their extensive A. VAE Architectures
applications across various facets of human society. Despite Anomaly detection identifies unusual patterns in data, which
being the oldest among these architectures, VAE has continued is challenging for traditional methods [18]. As one of the
most popular deep learning-based generative models, VAEs number of parameters in the encoder, latent space, and decoder
address this task by learning a latent space representation with as the traditional VAE, eliminating the need for supplementary
both mean and variability in data distribution, making them modules. Importantly, this model introduces only two addi-
sensitive to differences even when normal and anomalous data tional scalar parameters, the range and variance of the prior,
having similar mean values. The reconstruction probability in to enhance outcomes with minimal computational expense.
VAEs accounts for dissimilarity and variability, allowing nu- The refinement in prior modeling is reported to contribute
anced sensitivity to reconstruction based on variable variance to improved results, showcasing the model’s efficiency and
[19]. In this section, we provide more comprehensive details effectiveness, in particular for texture images data.
about various VAE architectures mainly focused in this work, The VAE-GRF model features a stochastic encoding net-
including the traditional VAE (with convolutional layers), the work that maps input data to a convolutional latent space asso-
VAE with Gaussian Random Field (VAE-GRF) [4], and our ciated with realizations of a random variable. The introduction
implemented ViT-VAE. of a zero-mean stationary and toroidal Gaussian Random Field
1) Traditional Variational Autoencoder: A VAE is a gen- as the latent space prior is a key innovation. This GRF prior,
erative model specifically designed for likelihood estimation, with factorized dimensions, brings about enhanced efficiency
comprising an encoder and a decoder to capture intricate data and computational parallelizability. The model extends the
distributions [20]. The encoder, denoted as qϕ (z|x), transforms classical VAE by incorporating GRFs, presenting a strict
input data x into a latent representation z, characterized by generalization that has been reported to improve performance
parameters µϕ (x) and σϕ (x) following a Gaussian distribution. in various applications. Figure 1 illustrates the main difference
The reparameterization trick is employed to efficiently sample between the standard VAE and VAE-GRF, lying on their
the latent representation, aiding in both training and the feature distributions within the latent space.
generation of diverse samples. Remark: VAE-GRF represents a robust model with distinct
The decoder, represented as pθ (x̂|z), utilizes the sampled la- improvements and strengths. Its refined modeling approach,
tent representation z to reconstruct the input data x̂. The model relevance in AD tasks for textured images, and efficiency
is trained to minimize the reconstruction error, measured by without added computational costs collectively position VAE-
the negative log-likelihood of the generated data given the GRF as a promising alternative to the traditional VAE baseline.
latent representation. The model’s versatility further underscores its potential for
VAE incorporates variational inference to estimate the pos- widespread adoption across various domains and applications.
terior distribution qϕ (z|x). This is achieved by minimizing 3) Vision-Transformer Variational Autoencoder: The Vi-
the Kullback-Leibler (KL) divergence between the learned sion Transformer (ViT) [12] is a novel architecture for image
distribution and a predefined distribution, usually a standard classification tasks that departs from traditional convolutional
normal distribution p(z) = N (0, I). The KL divergence term neural network (CNN) designs by leveraging self-attention
encourages the model to learn a latent space adhering to the mechanisms. In this paper, we explore the ViT to extract the
desired distribution. The overall objective function includes latent representation of the image. This involves extracting the
a reconstruction term, measuring the negative log-likelihood output from the intermediate layer of the ViT model. The latent
of generated data, and a KL divergence term penalizing the representation can be considered as the feature representation
deviation from the predefined distribution. of the image captured by the model before passing through
From the literature, VAE has demonstrated strengths in the decoder for reconstruction.
achieving meaningful latent space representations. These rep- The fundamental idea behind ViT is to treat an image as
resentations facilitate the generation of diverse and realistic a sequence of fixed-size non-overlapping patches, which are
samples during the inference phase. The model’s ability to then linearly embedded into vectors [12]. These patches serve
efficiently capture complex data distributions and its success as the input tokens for a Transformer architecture, originally
in likelihood estimation contribute to its significance in various designed for natural language processing tasks. In ViT, the
applications. The training process involves minimizing the image is divided into a grid of patches, and each patch is
total loss, which is the average of individual losses over linearly embedded into a vector. These patch embedding are
the training dataset, enabling VAE to effectively learn and flattened and serve as the input to the Transformer encoder.
generalize from the data. For a deeper insights into the The architecture of the ViT-VAE model can be visualized in
formulation of VAEs, the reparameterization trick, and the Figure 2.
objective function, please refer to the original paper of VAE As observed from the figure, the resulting sequence of
Remark: A variational autoencoder (VAE) represents an embedded vectors is fed into the Transformer encoder, which
improved version of an autoencoder, integrating regularization consists of multiple layers of self-attention and feed-forward
methods to counter overfitting and establish favorable charac- sub-layers. The self-attention mechanism allows the model to
teristics in the latent space for efficient generative capabilities. capture relationships between different patches in the image.
Operating as a generative model, VAEs pursue a comparable The output of the Transformer encoder serves as a powerful
goal to generative adversarial networks. representation of the original image and is employed for
2) Variational Autoencoder with Gaussian Random Field downstream tasks. Notably, previous research has highlighted
prior: In the VAE-GRF model [5], notable strengths have been ViT’s exceptional performance in basic image processing
identified without delving into specific formula details. The problems, showcasing its versatility and effectiveness in tasks
VAE-GRF distinguishes itself by maintaining an equivalent such as image classification. The ViT architecture’s departure
Fig. 1: Illustration of the differences between VAE and VAE-GRF architectures as well as their anomal map differencies [5] .

Fig. 2: Architecture of ViT-VAE model [11]

from traditional CNN designs, treating images as sequences can also be used to mark anomaly at pixel level. We evaluate
of patches and leveraging self-attention mechanisms, has led the model using ROCAUC of these values.
to breakthroughs in image understanding.
Remark: In summary, the VAE-ViT model’s strengths lie III. E XPERIMENTS
in its utilization of the ViT architecture, demonstrating success A. Setup
in basic image processing problems. The model efficiently
processes images by dividing them into patches, leveraging Our experiments were conducted on a machine of 8-GB
self-attention mechanisms, and showcasing the versatility of RAM with NVIDIA GeForce RTX 3050 4GB GPU. The
the Vision Transformer in extracting meaningful features for implementation was done using Python with Pytorch 2.0
downstream tasks. framework. For hyper-parameter setting, we used 100 epochs
and set the batch size to 8, and the learning rate to 10−4 . We
reduced the batch size and training epochs from [5] due to
B. Evaluation Metrics hardware limitation.
We leverage the AD metric proposed in [5] which is the For VAE, we used ResNet18 backbone and set the z dimen-
combination of two metrics including the SSM (Structure- sion of the model to 256 with an feature map size of 32, and
based similarity measure from the reconstruction images) and the β correlation to 1. The same setup was used for VAE-GRF
MAD (alignment map from the latent space). Readers are but with other model specialized hyperparameters, including
invited to [5] for more information. We note that, such metrics correlation type of the model (we use identity correlation for
this experiment), as done in [5]. However, it should be noted
that the settings were done differently for ViT-VAE as ViT
does not work the same as CNN (in the standard VAE). ViT
requires accurate patch size to capture the image feature in
which will also directly affect the latent dimension setting of
VAE. In particular, we set the latent dimension of ViT VAE to
384 and latent image size to 14, which is based on the prior
implementation of [11].

B. Datasets
To perform our comparative study, two public datasets were
used for benchmarking, including the MVTec AD [16] and the
MIAD dataset [17]. Each of them is now described.
1) The MVTec dataset [16]: The MVTec AD dataset has
gained widespread recognition as a benchmark for anomaly
detection in industrial scenarios. The dataset comprises 5354
high-resolution images, with 3629 allocated for training and Fig. 3: Examples of good and anomalous samples of from the
1725 for testing purposes. The 15 distinct classes within the MVTec and the MiAD datasets.
dataset encompass various industrial objects and scenes [16]
The training set exclusively includes images without any
distinctive focus introduces a new and challenging image
defects, establishing a baseline for normalcy in an industrial
domain, contrasting with models predominantly tailored for
context. In contrast, the test set encompasses both defect-
the MVTec dataset.
free images and those containing various types of defects.
Despite MiAD encompassing seven classes, our study re-
These anomalies range from surface defects on objects to
stricts its focus to four surface anomaly classes: Electrical
structural anomalies, such as distorted object parts, and defects
Insulator, Metal Welding, Photovoltaic Module, and Wind Tur-
manifested by the absence of specific object components.
bine. The remaining three classes pertain to logical anomaly
Notably, the defects present in the test images are manually
detection, a facet beyond the scope of the present investigation.
generated and labeled with pixel-precise masks outlining the
In Figure 3, we shows some good and anomalous image
defective areas.
samples within texture and non-texture materials from the two
Nevertheless, this dataset exhibits two weaknesses. The first
datasets. We observe that these contain anomaly regions that
issue pertains to the limited number of images for each class,
are quite different in numbers, sizes and forms.
with the class having the highest number of training images
totaling 381, a quantity considered insufficient. Another prob-
lem arises from the overuse of the MVTec dataset, resulting in C. Comparative results
the overutilization of current research focused on MVTec. This In this section, our primary focus revolves around a compre-
underscores the necessity for a more recent dataset specifically hensive performance comparison of VAE, VAE-GRF and ViT-
designed for anomaly detection, that we now describe. VAE. It is crucial to acknowledge that GRF prior is specifically
. tailored for texture data. Therefore, we separate the dataset into
2) The MIAD dataset [17]: This dataset is a novel dataset two sub-classes, namely texture and non-texture data.
for unsupervised anomaly detection in various maintenance Table I highlight the difference in performance of the
inspections [17]. The dataset contains more than 10,000 high- three models across the 15 classes of the MVTec dataset.
resolution color images in seven outdoor industrial scenarios, There are 6 classes classified as texture which are carpet,
such as photovoltaic modules, wind turbine blades, and metal grid, leather, tile, wood, and hazelnut, and 9 classes as non-
welding. The images are generated by 3D graphics software texture, namely bottle, cable, capsule, metal nut, pill, screw,
and cover both surface and logical anomalies with pixel- toothbrush, transistor, and zipper. Overall, the ViT-VAE model
precise ground truth. shows significant dominance in performance in non-texture
Similar to the MVTec dataset, the MiAD dataset is or- data while, other models may be competitive for some texture
ganized into training and test sets. The training set com- images. It is noted that VAE-GRF does not provide good
prises anomaly-free images, while the test set includes both performance in comparison with other models in texture data
anomalous and anomaly-free images. In contrast to MVTec, where it is expected to have good result, according to [5].
MiAD boasts a significantly larger number of training images, Table II shows the record from the experiments with MiAD
specifically 10,000 images for each class, and a minimum of dataset. In our experiment, the three logical classes from
1,200 testing images. This notable augmentation stems from the entire dataset was not included since our focus was on
the synthetic nature of the MiAD dataset. Notably, MiAD surface (structure) anomalous images. These include Electrical
places emphasis on outdoor perspectives captured in uncon- Insulator, Metal Welding, Photovoltaic Module and Wind
trolled environments, incorporating diverse camera viewpoints, Turbine. The result on this dataset shown a more competitive
complex backgrounds, and object surface degradation. This result of the three models, with a slightly better performance
yielded by VAE-GRF. More discussions will be provided in paper, thus leading to the low performance of VAE-GRF.
the next section. This suggests that to use VAE-GRF, the information about
data characteristics should be taken into consideration to
MAD * SSM find an optimal hyperparameter setting.
Category
VAE VAE-GRF ViT-VAE
• Secondly, ViT-VAE shows excellent performance in most
Texture
carpet 0.88 ± 0.13 0.86 ± 0.11 0.92 ± 0.10
of the MVTec’s classes. By leveraging the ViT architec-
grid 0.89 ± 0.10 0.80 ± 0.12 0.94 ± 0.05 ture, ViT-VAE exhibits an enhanced ability to capture
leather 0.84 ± 0.16 0.90 ± 0.12 0.72 ± 0.24 long-range dependencies with reduced inductive bias,
tile 0.84 ± 0.10 0.68 ± 0.17 0.80 ± 0.14
wood 0.80 ± 0.20 0.73 ± 0.20 0.68 ± 0.19
albeit at the expense of necessitating a larger volume of
hazelnut 0.97 ± 0.02 0.97 ± 0.02 0.98 ± 0.01 data for optimal performance, as asserted by Thisanke et
Average 0.87 ± 0.11 0.82 ± 0.12 0.88 ± 0.12 al. [21]. Remarkably, ViT-VAE manifests commendable
Non-Texture results, particularly excelling in hazelnut, screw, and
bottle 0.80 ± 0.11 0.81 ± 0.11 0.85 ± 0.09 carpet categories— the top three classes in terms of
cable 0.63 ± 0.19 0.60 ± 0.24 0.67 ± 0.17
capsule 0.84 ± 0.17 0.87 ± 0.15 0.92 ± 0.13 dataset volume. Conversely, instances where ViT-VAE
metal nut 0.76 ± 0.09 0.80 ± 0.09 0.82 ± 0.07 exhibits comparatively lower performance coincide with
pill 0.84 ± 0.16 0.87 ± 0.12 0.87 ± 0.17 classes characterized by a limited volume of data.
screw 0.92 ± 0.06 0.88 ± 0.10 0.94 ± 0.04
toothbrush 0.84 ± 0.10 0.84 ± 0.07 0.87 ± 0.06 2) MiAD experiments: The MiAD experiment focuses on
transistor 0.69 ± 0.15 0.67 ± 0.14 0.72 ± 0.17 the robustness of VAE-based anomaly detection methods. To
zipper 0.86 ± 0.10 0.82 ± 0.11 0.89 ± 0.11 achieve this, we follow the approach of the MiAD dataset,
Average 0.79 ± 0.12 0.79 ± 0.12 0.84 ± 0.11
which involves incorporating variations in viewpoint and back-
TABLE I: Comparative results yielded by the three models on ground into both the training and testing data [17]. Upon
the MVTec dataset. examining the results of the experiment (showed in Table II),
two interesting points can be emerged.
• First, in the conducted experiments (some illustrations
MAD * SSM are showed in Figure 4), both VAE and ViT-VAE en-
Anomaly Category VAE VAE-GRF ViT-VAE counter challenges in handling random backgrounds from
Electical Insulator 0.87 0.84 0.86 the Wind Turbine dataset. Conversely, VAE-GRF does
Metal Welding 0.69 0.82 0.71
Structure
Photovoltaic Module 0.98 0.93 0.97
not encounter this issue but still mislabels the anomaly
Wind Turbine 0.92 0.93 0.93 positions. This discrepancy can be elucidated by the fun-
Average 0.87 0.88 0.87 damental principle of the VAE-GRF prior, wherein a more
stationary configuration is related to the improvement
TABLE II: Comparative results yielded by the three models of model performance, and vice versa. Specifically, the
on the MiAD dataset. differences in viewpoints of the training data show a
significant effect on the reconstruction output of VAE-
GRF. In conclusion, none of the VAE models in this study
shows an advantage on the MiAD dataset, thus indicating
D. Discussion
that MiAD is a challenging dataset for future research.
In this section, we go in-dept into understand the experi- • Secondly, the MiAD dataset can also be used as a ro-
ment result and underline the important aspects of the three bustness test for AD frameworks, as demonstrated in our
comparative models. research. By not training the model on the full training
1) MVTec experiments: The aim of our experiments on dataset, the inconsistency between training and testing
this dataset is to verify the performance of our implemented data would help researchers clearly identify a model that
ViT-VAE compared to the VAE and VAE-GRF whose results can still function despite domain shifts. This demonstrates
were reported in [5] (but not for all 15 classes). From the that the MiAD dataset holds immense potential and
experimental results showed Table I, there are two important should be utilized more in the future, along with the
remarks that can be drawed: classical MVTec dataset.
• First, the performance of VAE-GRF on the MVTec
dataset is significantly lower than the record in the IV. C ONCLUSION
original paper [5], hence we suspect that this is due Despite its status as the earliest generative algorithm, the
to the gap in the number of training epochs and the Variational Autoencoder (VAE) continues to maintain promi-
difference in hyperparameter settings. In [5], the paper nence within the niche of anomaly detection. In pursuit of
claimed that VAE-GRF requires detail hypothesis and a comprehensive understanding, we conducted experiments
settings to maintain a tracable model. Indeed, to achieve to assess the performance of VAE models in this specific
the performance in the original paper in wood and tile, task and clarify their crucial characteristics. Notably, VAE-
VAE-GRF requires extra hyperparameter tuning in its GRF necessitates meticulous hyperparameter tuning to yield
Matern correlation and β in βELBO. However, to ensure favorable outcomes, whereas the Vision Transformer-based
the equality comparison, we keep the hyperparameter VAE (ViT-VAE) demonstrates robust performance even with
settings of VAE-GRF the same for each classes in this a limited number of training epochs.
Fig. 4: Illustrations of anomaly maps provided by the three models for some MVTec and MIAD samples.

Furthermore, our observations indicate that the MiAD [10] B. Choi and J. Jeong, “Viv-ano: Anomaly detection and localization
dataset presents itself as a promising resource. Distinguished combining vision transformer and variational autoencoder in the manu-
facturing process,” Electronics, vol. 11, no. 15, p. 2306, 2022.
by variations in both image domain and anomalies, MiAD [11] Y. Lee and P. Kang, “AnoViT: Unsupervised anomaly detection and lo-
holds significant potential for future research endeavors and calization with vision transformer-based encoder-decoder,” IEEE Access,
warrants more extensive uses in anomaly detection studies in vol. 10, pp. 46717–46724, 2022.
[12] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai,
future research works. T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al.,
“An image is worth 16x16 words: Transformers for image recognition
at scale,” arXiv preprint arXiv:2010.11929, 2020.
R EFERENCES [13] C. Huang, Q. Xu, Y. Wang, Y. Wang, and Y. Zhang, “Self-supervised
masking for unsupervised anomaly detection and localization,” IEEE
[1] A. B. Nassif, M. A. Talib, Q. Nasir, and F. M. Dakalbab, “Machine Transactions on Multimedia, 2022.
learning for anomaly detection: A systematic review,” IEEE Access, [14] M.-I. Georgescu, “Masked autoencoders for unsupervised anomaly
vol. 9, pp. 78658–78700, 2021. detection in medical images,” Procedia Computer Science, vol. 225,
[2] H. Li, J. Hu, B. Li, H. Chen, Y. Zheng, and C. Shen, “Target before pp. 969–978, 2023.
shooting: Accurate anomaly detection and localization under one mil- [15] T. Chen, B. Li, and J. Zeng, “Learning traces by yourself: Blind image
lisecond via cascade patch retrieval,” 2023. forgery localization via anomaly detection with vit-vae,” IEEE Signal
[3] S. Akcay, A. Atapour-Abarghouei, and T. P. Breckon, “Ganomaly: Semi- Processing Letters, vol. 30, pp. 150–154, 2023.
supervised anomaly detection via adversarial training,” 2018. [16] P. Bergmann, M. Fauser, D. Sattlegger, and C. Steger, “MVTec AD–A
[4] H. Gangloff, M.-T. Pham, L. Courtrai, and S. Lefèvre, “Variational comprehensive real-world dataset for unsupervised anomaly detection,”
autoencoder with gaussian random field prior: application to unsuper- in Proceedings of the IEEE/CVF conference on computer vision and
vised animal detection in aerial images,” in 2023 IEEE International pattern recognition, pp. 9592–9600, 2019.
Conference on Image Processing (ICIP), pp. 1620–1624, IEEE, 2023. [17] T. Bao, J. Chen, W. Li, X. Wang, J. Fei, L. Wu, R. Zhao, and Y. Zheng,
[5] H. Gangloff, M.-T. Pham, L. Courtrai, and S. Lefèvre, “Unsupervised “Miad: A maintenance inspection dataset for unsupervised anomaly
anomaly detection using variational autoencoder with gaussian random detection,” in Proceedings of the IEEE/CVF International Conference
field prior,” in 2023 IEEE International Conference on Image Processing on Computer Vision, pp. 993–1002, 2023.
(ICIP), pp. 1620–1624, IEEE, 2023. [18] M. Munir, M. A. Chattha, A. Dengel, and S. Ahmed, “A comparative
[6] H. Gangloff, M.-T. Pham, L. Courtrai, and S. Lefèvre, “Leveraging analysis of traditional and deep learning-based anomaly detection meth-
vector-quantized variational autoencoder inner metrics for anomaly de- ods for streaming data,” in 2019 18th IEEE international conference on
tection,” in 2022 26th International Conference on Pattern Recognition machine learning and applications (ICMLA), pp. 561–566, IEEE, 2019.
(ICPR), pp. 435–441, IEEE, 2022. [19] J. An and S. Cho, “Variational autoencoder based anomaly detection
[7] M.-T. Pham, H. Gangloff, and S. Lefèvre, “Weakly supervised marine using reconstruction probability,” in Special lecture on IE, vol. 2, pp. 1–
animal detection from remote sensing images using vector-quantized 18, 2015.
variational autoencoder,” in IGARSS 2023-2023 IEEE International [20] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv
Geoscience and Remote Sensing Symposium, pp. 5559–5562, IEEE, preprint arXiv:1312.6114, 2013.
2023. [21] H. Thisanke, C. Deshan, K. Chamith, S. Seneviratne, R. Vidanaarachchi,
[8] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, and D. Herath, “Semantic segmentation using vision transformers: A
Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in survey,” Engineering Applications of Artificial Intelligence, vol. 126,
neural information processing systems, vol. 30, 2017. p. 106669, 2023.
[9] J. Zhang, X. Chen, Y. Wang, C. Wang, Y. Liu, X. Li, M.-H. Yang, and
D. Tao, “Exploring plain ViT reconstruction for multi-class unsupervised
anomaly detection,” arXiv preprint arXiv:2312.07495, 2023.