Electronics 11 02306 v2 PDF
Electronics 11 02306 v2 PDF
Electronics 11 02306 v2 PDF
Article
ViV-Ano: Anomaly Detection and Localization Combining
Vision Transformer and Variational Autoencoder in the
Manufacturing Process
Byeonggeun Choi and Jongpil Jeong *
Abstract: The goal of image anomaly detection is to determine whether there is an abnormality in
an image. Image anomaly detection is currently used in various fields such as medicine, intelligent
information, military fields, and manufacturing. The encoder–decoder structure, which learns a
normal-looking periodic normal pattern and shows good performance in judging anomaly scores
through reconstruction errors showing the differences between the reconstructed images and the
input image, is widely used in the field of anomaly detection. The existing image anomaly detection
method extracts normal information through local features of the image, but the vision transformer
base and the probability distribution are generated by learning the global relationship between image
anomaly detection and an image patch that can locate anomalies to extract normal information.
We propose Vision Transformer and VAE for Anomaly Detection (ViV-Ano), an anomaly detection
model that combines a model variational autoencoder (VAE) with Vision Transformer (ViT). The
proposed ViV-Ano model showed similar or better performance when compared to the existing
model on a benchmark dataset. In addition, an MVTec anomaly detection dataset (MVTecAD), a
Citation: Choi, B.; Jeong, J. ViV-Ano: dataset for industrial anomaly detection, showed similar or improved performance when compared
Anomaly Detection and Localization to the existing model.
Combining Vision Transformer and
Variational Autoencoder in the Keywords: anomaly detection; computer vision; vision transformer; variational autoencoder;
Manufacturing Process. Electronics unsupervised learning; manufacturing; Industry 4.0; smart factory
2022, 11, 2306. https://doi.org/
10.3390/electronics11152306
algorithms that mimic the structure of a nerve, have shown useful performance in using
patterns in image recognition. CNNs receive the data in matrix form to preserve the
shape of the image through a convolution layer that extracts the feature and a pooling
layer that samples the extracted feature to prevent data loss while converting to the one-
dimensional data of the existing feedforward neural network (FNN) [6]. However, the
CNN cannot extract the feature by looking at the whole if the architecture depth is not
sufficient. Moreover, weight-sharing reduces the operation cost, but the computational
cost is too large to perform the convolution operation [7]. In other words, CNNs have
a disadvantage in image learning because the model learns only local information and
not the global context. However, ViT can learn more image features than CNNs, which
only learn local information, because attention operation and patch images are used to
additionally learn global contexts.
Research on deep learning in image anomaly detection methods has been steadily
progressing. The encoder–decoder method and the method based on generation are of-
ten used for image anomaly detection and anomaly localization using deep learning.
The encoder–decoder method trains the model while compressing the original image
and reconstructing the compressed image to resemble the original image. Anomaly de-
tection is mainly achieved through the differences in this restoration, and in the case of
generation-based methods, the distribution of the generated normal data is learned through
a distribution that generates an image with a shape similar to that of the original image.
Recently, Vision Transformer (ViT) has shown a powerful ability to classify and judge im-
ages in computer vision. ViT’s self-attention layer consists of several heads and calculates
the self-attention distance, which is the distance between the position of the query patch
and the position referenced by the patch, for each head [8]. The head of ViT is a mixture
of local and global information even in the lowest layer, so the information is smoothly
transferred from the lower layer to the higher layer. Since there is no bias in ViT, there is a
significant need for pre-learning data. If the dataset is small, bias is not present in ViT, so
comparing pros and cons, CNN is still preferred. The preferred structure for each situation
is also under study [9]. Variational autoencoders (VAEs) are generative models that can
increase the performance of models in learning stochastic factors [10]. In addition, a VAE is
a reconstruction-based model, so better results can be obtained for anomaly detection.
In this paper, we propose a model for detecting anomalies during the manufacturing
process:
• The image applies the ViT method in the area to be analyzed. ViT has word-like image
patches, uses an attention-based approach, and is a model that does not use an existing
seq2seq iterative architecture based on attention. In ViT, the model can help interpret
the image by allocating weights according to the location of the attention application.
• Since there is no bias in the transformer base, many datasets are required for learning.
We complement the existing ViT’s marked need to learn using a large amount of
data and improve the performance and accuracy for anomaly detection through a
method of generating new data using the probability distributions of the VAE in
combination with the VAE. We propose a ViV-Ano model by combining ViT with a
VAE for anomaly detection.
The remainder of the paper is organized as follows. Section 2 introduces the anomaly
detection and approach models used in manufacturing. Section 3 provides an explanation of
the components of the proposed model and the main ideas in this paper. Section 4 describes
the structure of the dataset, the experimental environment, and the results. Finally, Section 5
summarizes and concludes the study, and presents directions for future research.
2. Related Work
2.1. Anomaly Detection in Manufacturing
The development of artificial intelligence in the manufacturing industry has been
steadily studied since the beginning of Industry 4.0. Many manufacturers and countries
have aimed to make the most of cyber-physical systems, including information and commu-
Electronics 2022, 11, 2306 3 of 15
(Figure 1) [8]. ViT has been introduced and has achieved excellent performance in image
classification, semantic segmentation, and image detection.
Figure 2 illustrates the main idea of this study. Data are collected through device
equipment capable of collecting images used in various industries. Image data are classified
into normal and abnormal data according to each class, so that they can be learned by, and
used in, the model. A process is performed in which the image used is transformed to a
certain size, so that it can be separated according to the patch in the model. Thereafter, the
image data are separated according to the designated patch, and linear transformation is
performed. After undergoing a linear transformation sequence through patch embedding,
Electronics 2022, 11, 2306 6 of 15
a learnable class token is placed in front of it, and a position embedding tensor is added
to determine the location information of each element of the sequence. Subsequently, the
data that were used in the embedding process are used in the encoder process to extract
each feature and create an integrated reconstruction vector, and the data are reconstructed
through the decoder. The embedding and encoder processes are described in detail in the
sections below. The results of the performed abnormality detection can be confirmed by
experts in the process, so that it is possible to prepare for problems that may occur as a
result of the abnormality.
sequence, and a position embedding tensor is added to determine the location information
of each element of the sequence, as in the transformer model.
The encoder processes it in the same way as the existing ViT, except for the last output
part, and provides it as input to the model. The entered patch is delivered to the multi-
headed self-attention (MSA) function (Equation (2)) and the multilayer perceptron (MLP)
block (Equation (3)) after embedding (Equation (1)). In Equation (1), Xc lasss is a class token,
each E is a patch image embedded in the patched image, and E p os signifies the embedding
position. LN, used in both Equations (2) and (3), denotes local normalization. This process
is represented by the following formula and can be confirmed through Figure 3.
Figure 3. ViT embedding: if the input image size H, W is 224, and the channel number is 3, then the
patch size is 16. When patch embeddings are applied, this is a sequence of 196 × 768 dimensions of
the tensor. Subsequently, if a class token is added, it has a sequence of 197 × 768. It is then combined
with a position embedding function with the same 197 × 768 dimension, and then entered into
the encoder.
Electronics 2022, 11, 2306 7 of 15
h i
z0 = xclass ; x1p E; x2p E; · · · ; x N ( P2 ·C)× D , Epos ∈ R( N +1)× D
p E + Epos , E ∈ R (1)
Uqkv ∈ RD×3Dh
[q, k, v] = zUqkv (4)
!
qk>
A = softmax √ A ∈ RN×N
Dh (5)
SA(z) = Av
MSA(z) = [SA1 (z); SA2 (z); · · · ; SAk (z)]Umsa Umsa ∈ Rk· Dh × D (6)
The structure of the proposed model is shown in Figure 4.
The encoder of a typical ViT uses the MLP to obtain information about the image
through this process. In this paper, we subsequently apply the VAE to determine the
latent space distribution. The encoder of the VAE encodes the input data with a standard
deviation vector and an average, and then performs within the distribution corresponding
to the two vectors during the sampling process. Using KLD as a loss function, we learn
that the distribution is close to the standard normal distribution. Therefore, the latent space
distribution of the VAE sampled in a distribution close to the standard normal distribution
shows that the data are symmetrical and the distribution is continuous, based on the input
data. In the autoencoder, p( x ) = p( x | z) is the possibility function distribution and z is
not a specific vector in p(z), so for µ, the possibility function is the maximum likelihood
estimation (MLE), and the probability function is the highest. Thus, when µ is known,
the probability density function that produces the output data is pθ ( x | z), the probability
density function that uses the z distribution to produce a specific vector z, the average value
of the MLE point, and a specific vector z in pθ (z). Suppose Z is a Gaussian distribution.
Then, θ, which is the optimal MLE parameter in the output data, can be expressed by
Equation (7).
Z
pθ ( x ) = pθ (z) pθ ( x | z)dz (7)
Unlike in a fully connected layer, the target is normalized with a vector, whereas in
the convolution layer, the target is normalized per channel. That is, the mean and variance
are obtained for batch, height, and width. The corresponding expression is shown in
Equation (9). In Equation (9), BN signifies batch normalization, B denotes batch, W denotes
width, and H denotes height.
X − µbatch
BN ( X ) = γ +β
σbatch
(
1
BHW ∑ ∑ ∑ xi,j,k
µbatch = (9)
i j k
1 2
2
σbatch = ∑
BHW i j k ∑ ∑ xi,j,k − µbatch
In this formula, wk ( x; θ ) is the weight, µk ( x; θ ) is the mean, and σk2 ( x; θ ) is the kth
Gaussian variance. The GMM parameters are estimated using neural networks with
parameters θ and input x. This makes the Gaussian’s mixing weights applicable to the
constraints ∑kK=1− wk ( x; θ ) = 1 and wk ( x; θ ) ≥ 0∀k. Weight estimation is performed using
the softmax function.
exp aw
k (x)
wk ( x ) = K (11)
∑k=1 exp αiw ( x )
4.2. Datasets
The MNIST and CIFAR10 datasets are composed of image data, all of which are in
ten classes. All consist of normal and abnormal datasets in the same way as the general
experimental settings for one-class classification. The image of the class is assumed to be
normal and is used as training data, and the remaining classes are defined as abnormal.
The test dataset consists of images of normal and abnormal classes. Images from both
datasets were adjusted to H = 512, W = 512, and C = 3 and used for training and evaluation.
Electronics 2022, 11, 2306 10 of 15
The MNIST dataset consists of numbers between 0 and 9. There are about 6000 images
for each numeric class in the training data. The test data consist of normal and abnormal
classes, with a total of 10,000 images. The CIFAR10 dataset contains image data for ten
classes, and the training dataset has approximately 5000 images per class. The model was
trained with 4500 images and the performance was verified with the remaining 500 images.
The test data consisted of 10,000 images in both normal and abnormal classes.
We used the MVTec Anomaly Detection (MVTecAD) [26] dataset, which is designed
to test anomaly detection algorithms for industrial quality control. MVTecAD is classified
into 15 categories and consists of 3629 images in total for training and verification and
1725 images for testing. The original image resolution ranges from 700 × 700 to 1024 × 1024.
The training set has only defect-free image data. The test set consists of various defective
image data and defect-free image data. The test set has different types of abnormalities
from class to class, but there are abnormal defects in the form of cracks, deformation,
discoloration, and scratches. As can be seen from the Capsule class in Figure 6, the objects
are centered and aligned in a uniform manner throughout the dataset. Since the abnormal
phenomena are different in size, shape, and structure, the method can be applied in various
situations for industrial defect detection.
Figure 6. Normal and abnormal images for the capsule and carpet included in the MVTecAD dataset.
For the capsule and carpet, (a,e) are normal images and (b–d,f–h) are defective images and abnormal
mask images, respectively. The reason for the defect in the abnormal images of the capule is: (b) crack;
(c) faulty imprint; (d) fork. The reason for the defect in the abnormal images of the carpet is: (f) cut;
(g) hole; (h) thread.
Electronics 2022, 11, 2306 11 of 15
4.4. Results
To evaluate the performance of ViV-Ano with ViT and VAE-based structures, we
compared the results from [31,32] using the ViV-Ano anomaly detection performance
on the MNIST and CIFAR10 datasets, and compared the MVTecAD dataset with the
experimental results in this paper. Anomaly detection using the aforementioned three
model types (MNIST, CIFAR10, MVTecAD) confirmed similar or better performance than
existing models.
The model was tested using MNIST and CIFAR10, which are reference datasets for
anomaly detection. Anomaly detection is defined at the global level, and not limited to spe-
cific and possible small image patches like those in datasets such as MVTecAD. Therefore,
anomaly detection was performed using only the reconstruction loss, without determining
the anomaly location. From the results in Tables 1 and 2, it can be confirmed that anomaly
detection was achieved with a similar or better performance than the existing method. In
addition, anomaly detection and localization were performed using the MvTecAD dataset.
The results for the MVTecAD dataset in Table 3 show that the proposed model exhibited a
better performance than the other models in Category 5. From Table 3, it can be seen that
anomaly detection in product images such as textures and objects is performed well.
Electronics 2022, 11, 2306 12 of 15
AE- ViV-
Type Category 1-NN OCSVM KMean VAE AnoGAN CNN UniStud
SSIM Ano
Object Carpet 0.512 0.355 0.253 0.501 0.647 0.204 0.469 0.695 0.782
Grid 0.228 0.125 0.107 0.224 0.849 0.226 0.183 0.819 0.789
Leather 0.446 0.306 0.308 0.635 0.561 0.378 0.641 0.819 0.786
Tile 0.822 0.722 0.779 0.870 0.175 0.177 0.797 0.912 0.881
Wood 0.502 0.336 0.411 0.628 0.605 0.386 0.621 0.725 0.864
Bottle 0.898 0.850 0.495 0.897 0.834 0.620 0.742 0.918 0.935
Cable 0.806 0.431 0.513 0.654 0.478 0.383 0.558 0.865 0.881
Capsule 0.631 0.554 0.387 0.526 0.860 0.306 0.306 0.916 0.869
Hazelnut 0.861 0.616 0.698 0.878 0.916 0.698 0.844 0.937 0.884
Metal
0.705 0.319 0.351 0.576 0.603 0.320 0.358 0.895 0.914
Nut
Pill 0.725 0.544 0.514 0.769 0.830 0.776 0.460 0.935 0.895
Screw 0.604 0.644 0.550 0.559 0.887 0.466 0.277 0.928 0.878
Toothbrush 0.538 0.538 0.337 0.693 0.784 0.749 0.151 0.863 0.928
Transistor 0.496 0.496 0.399 0.626 0.725 0.549 0.628 0.701 0.876
Zipper 0.512 0.355 0.253 0.549 0.665 0.467 0.703 0.933 0.901
Mean - 0.640 0.479 0.423 0.639 0.694 0.443 0.515 0.857 0.871
In addition, Table 4 shows the results of evaluating the proposed anomaly localization
performance using the MVTecAD dataset. We compared the CNN-based autoencoder
method frequently used in conventional anomaly localization with the method proposed
in this paper. We compared the data with the results of [26], where the CNN-based
autoencoder was used. Compared to the performances of other models, the anomaly
Electronics 2022, 11, 2306 13 of 15
positioning performance was improved for 11 out of 15 items. Regardless of the product
type, the anomaly localization performance improved for the AUROC on average by 1.85%.
Experiments using ViT as an encoder model instead of CNN, which conducted data
preprocessing and training under the same conditions, showed that anomaly detection and
anomaly location performance increased. This result can be attributed to the addition of
global and local information to the position embedding through a number of self-attention
operations in ViT.
5. Conclusions
In the manufacturing sector, anomaly detection and location are used to detect abnor-
malities such as defects in image data. Image anomaly detection and localization play an
important role in making accurate decisions and improving the work efficiency of experts.
A ViT-based encoder–decoder and Gaussian approximation methods were used to detect
and locate anomalies. Anomaly localization could be achieved through anomaly detec-
tion with reconstruction-based approaches, and equivalent or better results were obtained
compared to existing techniques.
In future studies, it is expected that ViT-based models will be more efficient at detecting
and locating anomalies than existing CNN models if anomaly detection and locating
performances can be improved using encoders such as DeiT [33], CrossViT [34], or Swin
Transformer [35].
Author Contributions: Conceptualization, B.C. and J.J.; methodology, B.C.; software, B.C.; vali-
dation, B.C. and J.J.; formal analysis, B.C.; investigation, B.C.; resources, J.J.; data curation, B.C.;
writing—original draft preparation, B.C.; writing—review and editing, J.J.; visualization, B.C.; super-
vision, J.J.; project administration, J.J.; funding acquisition, J.J. All authors have read and agreed to
the published version of the manuscript.
Funding: This research was supported by the MSIT (Ministry of Science and ICT), Korea, under the
ITRC (Information Technology Research Center) support program (IITP-2022-2018-0-01417) super-
vised by the IITP (Institute for Information & Communications Technology Planning & Evaluation).
Also, this work was supported by the National Research Foundation of Korea (NRF) grant funded by
the Korea government (MSIT) (No. 2021R1F1A1060054).
Data Availability Statement: Not applicable.
Conflicts of Interest: The authors declare no conflict of interest.
Electronics 2022, 11, 2306 14 of 15
References
1. Kim, D.; Cha, J.; Oh, S.; Jeong, J. AnoGAN-Based Anomaly Filtering for Intelligent Edge Device in Smart Factory. In Proceedings
of the 15th International Conference on Ubiquitous Information Management and Communication 2 (IMCOM), Seoul, Korea, 4–6
January 2021; pp. 1–6.
2. Cha, J.; Park, J.; Jeong, J. A Novel Defect Classification Scheme Based on Convolutional Autoencoder with Skip Connection in
Semiconductor Manufacturing. In Proceedings of the 24th International Conference on Advanced Communication Technology
(ICACT), PyeongChang, Korea, 13–16 February 2022; pp. 347–352.
3. Meneganti, M.; Saviello, F.S.; Tagliaferri, R. Fuzzy neural networks for classification and detection of anomalies. IEEE Trans.
Neural Netw. 1998, 9, 848–861. [CrossRef] [PubMed]
4. Zeng, Q.; Wu, S. A fuzzy clustering approach for intrusion detection. In Proceedings of the International Conference on Web
Information Systems and Mining, Shanghai, China, 7–8 November 2009; pp. 728–732.
5. Lee, T.; Lee, K.B.; Kim, C.O. Performance of machine learning algorithms for class-imbalanced process fault detection problems.
IEEE Trans. Semicond. Manuf. 2016, 29, 436–445. [CrossRef]
6. LeCun, Y.; Boser, B.; Denker, J.S.; Henderson, D.; Howard, R.E.; Hubbard, W.; Jackel, L.D. Backpropagation applied to handwritten
zip code recognition. Neural Comput. 1989, 1, 541–551. [CrossRef]
7. Naseer, S.; Saleem, Y.; Khalid, S.; Bashir, M.K.; Han, J.; Iqbal, M.M.; Han, K. Enhanced network anomaly detection based on deep
neural networks. IEEE Access 2018, 6, 48231–48246. [CrossRef]
8. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.;
Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929.
9. Raghu, M.; Unterthiner, T.; Kornblith, S.; Zhang, C.; Dosovitskiy, A. Do vision transformers see like convolutional neural
networks? Adv. Neural Inf. Process. Syst. 2021, 34, 12116–12128.
10. Kingma, D.P.; Welling, M. Auto-encoding variational bayes. arXiv 2013, arXiv:1312.6114.
11. Cui, Y.; Liu, Z.; Lian, S. A Survey on Unsupervised Industrial Anomaly Detection Algorithms. arXiv 2022, arXiv:2204.11161.
12. Bozcan, I.; Korndorfer, C.; Madsen, M.W.; Kayacan, E. Score-Based Anomaly Detection for Smart Manufacturing Systems.
IEEE/ASME Trans. Mechatron. 2022. [CrossRef]
13. Chandola, V.; Banerjee, A.; Kumar, V. Anomaly detection: A survey. ACM Comput. Surv. (CSUR) 2009, 41, 1–58. [CrossRef]
14. Abdulhammed, R.; Faezipour, M.; Abuzneid, A.; AbuMallouh, A. Deep and machine learning approaches for anomaly-based
intrusion detection of imbalanced network traffic. IEEE Sensors Lett. 2018, 3, 1–4. [CrossRef]
15. Li, Y.; Peng, X.; Zhang, J.; Li, Z.; Wen, M. DCT-GAN: Dilated Convolutional Transformer-based GAN for Time Series Anomaly
Detection. IEEE Trans. Knowl. Data Eng. 2021. [CrossRef]
16. Zhang, H.; Xia, Y.; Yan, T.; Liu, G. Unsupervised Anomaly Detection in Multivariate Time Series through Transformer-based
Variational Autoencoder. In Proceedings of the 33rd Chinese Control and Decision Conference (CCDC), Kunming, China, 22–24
May 2021; pp. 281–286.
17. Han, X.; Chen, K.; Zhou, Y.; Qiu, M.; Fan, C.; Liu, Y.; Zhang, T. A Unified Anomaly Detection Methodology for Lane-
Following of Autonomous Driving Systems. In Proceedings of the Intl Conf on Parallel & Distributed Processing with
Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking
(ISPA/BDCloud/SocialCom/SustainCom), New York, NY, USA, 30 September–3 October 2021; pp. 836–844.
18. Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the limits of transfer
learning with a unified text-to-text transformer. arXiv 2019, arXiv:1910.10683.
19. Shoeybi, M.; Patwary, M.; Puri, R.; LeGresley, P.; Casper, J.; Catanzaro, B. Megatron-lm: Training multi-billion parameter language
models using model parallelism. arXiv 2019, arXiv:1909.08053.
20. Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al.
Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901.
21. Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding.
arXiv 2018, arXiv:1810.04805.
22. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You
Need. Available online: https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html
(accessed on 12 June 2022).
23. Zhou, T.; Li, L.; Li, X.; Feng, C.M.; Li, J.; Shao, L. Group-wise learning for weakly supervised semantic segmentation. IEEE Trans.
Image Process. 2021, 31, 799–811. [CrossRef]
24. Zhou, T.; Qi, S.; Wang, W.; Shen, J.; Zhu, S.C. Cascaded parsing of human-object interaction recognition. IEEE Trans. Pattern Anal.
Mach. Intell. 2021, 44, 2827–2840. [CrossRef]
25. Park, D.; Hoshi, Y.; Kemp, C.C. A multimodal anomaly detector for robot-assisted feeding using an lstm-based variational
autoencoder. IEEE Robot. Autom. Lett. 2018, 3, 1544–1551. [CrossRef]
26. Bergmann, P.; Fauser, M.; Sattlegger, D.; Steger, C. MVTec AD–A comprehensive real-world dataset for unsupervised anomaly
detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Long Beach, CA, USA, 15–20
June 2019; pp. 9592–9600.
Electronics 2022, 11, 2306 15 of 15
27. Bergmann, P.; Fauser, M.; Sattlegger, D.; Steger, C. Uninformed students: Student-teacher anomaly detection with discriminative
latent embeddings. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA,
14–19 June 2020; pp. 4183–4192.
28. Defard, T.; Setkov, A.; Loesch, A.; Audigier, R. Padim: A patch distribution modeling framework for anomaly detection
and localization. In Proceedings of the International Conference on Pattern Recognition, Virtual Event, 10–15 January 2021;
pp. 475–489.
29. Kim, P.Y.; Iftekharuddin, K.M.; Davey, P.G.; Tóth, M.; Garas, A.; Holló, G.; Essock, E.A. Novel fractal feature-based multiclass
glaucoma detection and progression prediction. IEEE J. Biomed. Health Inform. 2013, 17, 269–276. [CrossRef]
30. Nightingale, K.R.; Rouze, N.C.; Rosenzweig, S.J.; Wang, M.H.; Abdelmalek, M.F.; Guy, C.D.; Palmeri, M.L. Derivation and
analysis of viscoelastic properties in human liver: Impact of frequency on fibrosis and steatosis staging. IEEE Trans. Ultrason.
Ferroelectr. Freq. Control. 2015, 62, 165–175. [CrossRef] [PubMed]
31. Abati, D.; Porrello, A.; Calderara, S.; Cucchiara, R. Latent space autoregression for novelty detection. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 481–490.
32. Wu, Y.; Balaji, Y.; Vinzamuri, B.; Feizi, S. Mirrored autoencoders with simplex interpolation for unsupervised anomaly detection.
arXiv 2020, arXiv:2003.10713.
33. Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Training data-efficient image transformers & distillation
through attention. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 10347–10357.
34. Chen, C.F.R.; Fan, Q.; Panda, R. Crossvit: Cross-attention multi-scale vision transformer for image classification. In Proceedings
of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 357–366.
35. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted
windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October
2021; pp. 10012–10022.