0% found this document useful (0 votes)
19 views15 pages

DIAD

The document presents DiAD, a novel diffusion-based framework for multi-class anomaly detection that addresses challenges in preserving image categories and structural integrity during reconstruction. It incorporates a Semantic-Guided network and a Spatial-aware Feature Fusion block to enhance reconstruction accuracy and maintain semantic information. Experimental results demonstrate that DiAD outperforms state-of-the-art methods on benchmark datasets, achieving significant improvements in anomaly localization and detection metrics.

Uploaded by

brily0213
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views15 pages

DIAD

The document presents DiAD, a novel diffusion-based framework for multi-class anomaly detection that addresses challenges in preserving image categories and structural integrity during reconstruction. It incorporates a Semantic-Guided network and a Spatial-aware Feature Fusion block to enhance reconstruction accuracy and maintain semantic information. Experimental results demonstrate that DiAD outperforms state-of-the-art methods on benchmark datasets, achieving significant improvements in anomaly localization and detection metrics.

Uploaded by

brily0213
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

DiAD: A Diffusion-based Framework for Multi-class Anomaly Detection

Haoyang He1 * Jiangning Zhang2 * , Hongxu Chen1 , Xuhai Chen1 , Zhishan Li1 ,
Xu Chen2 , Yabiao Wang2 , Chengjie Wang2 , Lei Xie1†
1 2
Zhejiang University Youtu Lab, Tencent
arXiv:2312.06607v1 [cs.CV] 11 Dec 2023

Abstract
Reconstruction-based approaches have achieved remarkable
outcomes in anomaly detection. The exceptional image re-
construction capabilities of recently popular diffusion models
have sparked research efforts to utilize them for enhanced re-
construction of anomalous images. Nonetheless, these meth-
ods might face challenges related to the preservation of im-
age categories and pixel-wise structural integrity in the more
practical multi-class setting. To solve the above problems,
we propose a Difusion-based Anomaly Detection (DiAD)
framework for multi-class anomaly detection, which con- Figure 1: A analysis of different diffusion models for multi-
sists of a pixel-space autoencoder, a latent-space Semantic- class anomaly detection. The image above shows various
Guided (SG) network with a connection to the stable dif- denoising network architectures, while the images below
fusion’s denoising network, and a feature-space pre-trained
demonstrate the results reconstructed by different methods
feature extractor. Firstly, The SG network is proposed for
reconstructing anomalous regions while preserving the orig- for the same input image. a) DDPM suffers from categorical
inal image’s semantic information. Secondly, we introduce errors. b) LDM exhibits semantic errors. c) Our approach ef-
Spatial-aware Feature Fusion (SFF) block to maximize re- fectively reconstructs the anomalous regions while preserv-
construction accuracy when dealing with extensively recon- ing the semantic information of the original image.
structed areas. Thirdly, the input and reconstructed images
are processed by a pre-trained feature extractor to generate
anomaly maps based on features extracted at different scales.
Experiments on MVTec-AD and VisA datasets demonstrate based (Zavrtanik, Kristan, and Skočaj 2021a; Li et al.
the effectiveness of our approach which surpasses the state- 2021), embedding-based (Defard et al. 2021; Roth et al.
of-the-art methods, e.g., achieving 96.8/52.6 and 97.2/99.0 2022; Xie et al. 2023) and reconstruction-based (Liu et al.
(AUROC/AP) for localization and detection respectively on 2022; Liang et al. 2023) methods. The central concept of
multi-class MVTec-AD dataset. Code will be available at the reconstruction-based method is that during the train-
https://lewandofskee.github.io/projects/diad. ing phase, the model only learns from normal images. Dur-
ing the testing phase, the model reconstructs abnormal im-
Introduction ages into normal ones using the trained model. Therefore,
by comparing the reconstructed image with the input im-
Anomaly detection is a crucial task in computer vision and age, we can determine the location of anomalies. Traditional
industrial applications (Tao et al. 2022; Salehi et al. 2022; reconstruction-based methods, including AEs (Zavrtanik,
Liu et al. 2023), which goal of visual anomaly detection Kristan, and Skočaj 2021b), VAEs (Kingma and Welling
is to determine anomalous images and locate the regions 2022), and GANs (Liang et al. 2023; Yan et al. 2021) can
of anomaly accurately. Existing anomaly detection mod- learn the distribution of normal samples and reconstruct ab-
els (Liznerski et al. 2021; Yi and Yoon 2020; Yu et al. normal regions during the testing phase. However, these
2021) mostly correspond to one class, which requires a large models have limited reconstruction capabilities and cannot
amount of storage space and training time as the number of reconstruct complicated textures and objects well, especially
classes increases. Therefore, there is an urgent need for an large-scale defects or disappearances as shown in Figure 1.
unsupervised multi-class anomaly detection model that is ro- Hence, models with stronger reconstruction capability are
bust and stable. required to effectively tackle multi-class anomaly detection.
The current mainstream unsupervised anomaly detection
methods can be divided into three categories: synthesizing- Recently, the diffusion models (Ho, Jain, and Abbeel
2020; Rombach et al. 2022; Zhang and Agrawala 2023)
* Equal contribution. have demonstrated their powerful image-generation capa-

Corresponding author. bility. However, directly using current mainstream diffusion
models cannot effectively address multi-class anomaly de- able reconstruction ability. It has demonstrated excellent
tection problems. 1) For the Denoising Diffusion Probabilis- performance in various applications such as image genera-
tic Model (DDPM) (Ho, Jain, and Abbeel 2020) in Fig. 1- tion (Zhang and Agrawala 2023), video generation (Ho et al.
(a), when performing the multi-class setting, this method 2022), object detection (Chen et al. 2022), image segmenta-
may encounter issues with misclassifying generated image tion (Amit et al. 2022) and etc. LDM (Rombach et al. 2022)
categories. The reason is that after adding T timesteps noise introduces conditions through cross-attention to control gen-
to the input image, the image has lost its original class in- eration. However, it fails to accurately reconstruct images
formation. During inference, denoising is performed based that contain the original semantic information.
on this Gaussian noise-like distribution, which may gener-
ate samples belonging to different categories. 2) Latent Dif- Anomaly detection. AD contains a variety of different
fusion Model (LDM) (Rombach et al. 2022) has an embed- settings, e.g., open-set (Ding, Pang, and Shen 2022), noisy
der as a class condition as shown in Fig. 1-(b), which does learning (Tan et al. 2021; Yoon et al. 2022), zero-/few-
not exist the problem of misclassification found in DDPM. shot (Huang et al. 2022; Jeong et al. 2023; Cao et al. 2023;
However, LDM still cannot address the issue of semantic Chen, Han, and Zhang 2023; Chen et al. 2023b; Zhang et al.
loss in generated images. LDM is unable to simultaneously 2023b), 3D AD (Wang et al. 2023; Chen et al. 2023a), etc.
preserve the semantic information of the input image while This paper studies general unsupervised anomaly detection,
reconstructing the anomalous regions. For example, they which can primarily be categorized into three major method-
may fail to maintain direction consistency with the input im- ologies:
age in terms of objects like screws and hazelnuts, as well 1) Synthesizing-based methods synthesize anomalies on
as exhibit substantial differences from the original image in normal image samples. During the training phase, both
terms of texture class images. normal images and synthetically generated abnormal im-
To address the aforementioned problems, we propose a ages are input into the network for training, which aids
diffusion-based framework, DiAD, for multi-class anomaly in anomaly detection and localization. DRAEM (Zavrtanik,
detection and localization, illustrated in Fig. 2, which com- Kristan, and Skočaj 2021a) consists of an end-to-end net-
prises three components: a pixel space autoencoder, a latent work composed of a reconstruction network and a discrim-
space denoising network and a feature space ImageNet pre- inative sub-network, which synthesizes and generates just-
trained model. To effectively maintain consistent semantic out-distribution phenomena. However, due to the diversity
information with the original image while reconstructing the and unpredictability of anomalies in real-world scenarios, it
location of anomalous regions, we propose the Semantic- is impossible to synthesize all types of anomalies.
Guided (SG) network with a connection to the Stable Dif- 2) Embedding-based methods encode the original image’s
fusion (SD) denoising network in LDM. To further enhance three-dimensional information into a multidimensional fea-
the capability of preserving fine details in the original im- ture space (Roth et al. 2022; Cao et al. 2022; Gu et al.
age and reconstructing large defects, we propose the Spatial- 2023). Most methods employ networks (He et al. 2016; Tan
aware Feature Fusion (SFF) block to integrate features at and Le 2019; Zhang et al. 2022, 2023c; Wu et al. 2023)
different scales. Finally, the reconstructed and input images pre-trained on ImageNet (Deng et al. 2009) for feature ex-
are passed through a pre-trained model to extract features at traction. RD4AD (Deng and Li 2022) utilizes a WideRes-
different scales and compute anomaly scores. We summarize Net50 (Zagoruyko and Komodakis 2016) as the teacher
our contributions as follows: model for feature extraction and employs a structurally iden-
• We propose a novel diffusion-based framework DiAD tical network in reverse as the student model, computing
for multi-class anomaly detection, which firstly tackles the cosine similarity of corresponding features as anomaly
the problem of existing denoising networks of diffusion- scores. However, due to significant differences between in-
based methods failing to correctly reconstruct anomalies. dustrial images and the data distribution in ImageNet, the ex-
tracted features might not be suitable for industrial anomaly
• We construct an SG network connecting to the SD de-
detection purposes.
noising network to maintain consistent semantic infor-
mation and reconstruct the anomalies. 3) Reconstruction-based methods aim to train a model
on a dataset without anomalies. The model learns to iden-
• We propose an SFF block to integrate features from dif- tify patterns and characteristics in the normal data. OCR-
ferent scales to further improve the anomaly reconstruc- GAN (Liang et al. 2023) decouples images into different
tion ability. frequencies and uses GAN for reconstruction. EdgRec (Liu
• Abundant experiments demonstrate the sufficient su- et al. 2022) achieves good reconstruction results by first syn-
periority of DiAD over SOTA methods, e.g., we sur- thesizing anomalies and then extracting grayscale edge in-
pass the multi-class anomaly detection diffusion-based formation from images, which is ultimately input into a re-
method by 20.6↑/ 11.7↑ in pixel/image AUROC and non- construction network. However, there are certain limitations
diffusion method by 9.2↑ in pixel-AP and 0.7↑ in image- in the reconstruction of large-area anomalies. Moreover, the
AUROC on MVTec-AD dataset. accuracy of anomaly localization is also not sufficient.
Recently, some studies have applied diffusion models to
Related work anomaly detection. AnoDDPM (Wyatt et al. 2022) is the first
Diffusion model. The diffusion model has gained approach to employ a diffusion model for medical anomaly
widespread attention and research interest since its remark- detection. DiffusionAD (Zhang et al. 2023a) utilizes an
Feature Space Pixel Space Latent Space
Train/Test Test only Frozen Add

Diffusion Forward Process

Input Image

SG SG SG SG SG
SG
SFF DB EB EB EB EB CNNs
M3 M2 M1 4
M
4 3 2 1
4×4 4×4 32×32
8×8
Semantic-Guided Network 16×16
Reconstruction Image

×T

SD SD SD SD SD SD SD SD
SD SD
DB DB DB DB EB EB EB EB
M Denoising
1 2 3 4 4 3 2 1
Network
4×4 4×4 32×32
32×32 8×8 8×8
16×16 16×16

Figure 2: Framework of the proposed DiAD that contains three parts: 1) a pixel-space autoencoder {E, D}; 2) a latent-
space Semantic-Guided (SG) network with a connection to Stable Diffusion (SD) denoising network; and 3) a feature-space
pre-trained feature extractor Ψ. During training, the input x0 and the latent variable zT are inputted into the SG network and the
SD denoising network, respectively. The output noise and input noise are calculated for MSE loss and gradient optimization is
computed. During testing, x0 and the reconstructed image xˆ0 are inputted into the same pre-trained feature extraction network
to obtain feature maps {f1 ,f2 ,f3 } of different scales, and their anomaly scores S are calculated.

anomaly synthetic strategy to generate anomalous samples where z ∼ N (0, I), σt is a fixed constant related to the vari-
and labels, along with two sub-networks dedicated to the ance schedule, ϵθ (xt , t) is a U-Net (Ronneberger, Fischer,
tasks of denoising and segmentation. DDAD (Mousakhan, and Brox 2015) network to predict the distribution and θ is
Brox, and Tayyub 2023) employs a score-based pre-trained the learnable parameter which could be optimized as:
diffusion model to generate normal samples while fine- 2
tuning the pre-trained feature extractor to achieve domain min Ex0 ∼q(x0 ),ϵ∼N (0,I),t ∥ϵ − ϵθ (xt , t)∥2 . (3)
θ
transfer. However, these approaches only add limited steps
of noise and perform few denoising steps, which makes them Latent Diffusion Model. Latent Diffusion Model (LDM)
unable to reconstruct large-scale defects. focuses on the low-dimensional latent space with condition-
To overcome the aforementioned problems, We pro- ing mechanisms. LDM consists of a pre-trained autoencoder
pose a diffusion-based framework DiAD for multi-class model and a denoising U-Net-like attention-based network.
anomaly detection, which firstly tackles the problem of ex- The network compresses images using an encoder, conducts
isting diffusion-based methods failing to correctly recon- diffusion and denoising operations in the latent representa-
struct anomalies. tion space, and subsequently reconstructs the images back to
the original pixel space using a decoder. The training opti-
Preliminaries mization objective is:
Denoising Diffusion Probabilistic Model. Denoising h
2
i
Diffusion Probabilistic Model (DDPM) consists of two pro- LLDM = Ez0 ,t,c,ϵ∼N (0,1) ∥ϵ − ϵθ (zt , t, c)∥2 , (4)
cesses: the forward diffusion process and the reverse denois-
ing process. During the forward process, a noisy sample xt where c represents the conditioning mechanisms which can
is generated using a Markov chain that incrementally adds consist of multimodal types such as text or image, connected
Gaussian-distributed noise to an initial data sample x0 . The to the model through a cross-attention mechanism. zt repre-
forward diffusion process can be characterized as follows: sents the latent space variable,
√ √
xt = ᾱt x0 + 1 − ᾱt ϵt , ϵt ∼ N (0, I), (1)
QT QT Method
where αt = 1 − βt , ᾱt = i=1 αi = i=1 (1 − βi ) and βi The proposed pipeline DiAD is shown in Fig. 2. First, the
represents the noise schedule used to regulate the quantity pre-trained encoder downsamples the input image into a
of noise added at each timestep. latent-space representation. Then, noise is added to the latent
In the reverse denoising process, xT is first sampled from representation, followed by the denoising process using an
equation 1 and xt−1 is reconstructed from xt and the model SD denoising network with a connection to the SG network.
prediction ϵθ (xt , t) with the formulation: The denoising process is repeated for the same timesteps as
 
1 1 − αt the diffusion process. Finally, the reconstructed latent rep-
xt−1 = √ xt − √ ϵθ (xt , t) + σt z, (2)
αt 1 − ᾱt resentation is restored to the original image level using the
pre-trained decoder. In terms of anomaly detection and lo-

Normalization
Conv2d 3×3

Activation
SG
calization, the input and reconstructed images are fed into SG
SG DB Conv =
DB 4_3 Block
the same pre-trained model to extract features at different DB
4_2
4_1
scales and calculate the differences between these features. = Add
Block
Block
Semantic-Guided Network Conv
Conv
Conv
Block
Block
Block
As discussed earlier, DDPM and LDM each have specific Conv
Conv
Conv
Block
problems when addressing multi-class anomaly detection Block
Block
tasks. In response to these issues and the multi-class task Conv
Conv
Conv
Block
itself, we propose an SG network to address the problem SG
SG SG EB
of LDM’s inability to effectively reconstruct anomalies and SG
SG EB SG EB 3_3
EB EB
preserve the semantic information of the input image. EB 4_3 3_2
4_2 3_1
4_1
Given an input image x0 ∈ R3×H×W in pixel space, the
pre-trained encoder E encodes x0 into a latent space repre-
sentation z = E(x0 ) where z ∈ Rc×h×w . Similar to Eq. 1 Figure 3: Schematic diagram of SFF block. Each layer in
where the original pixel space x is replaced by latent repre- SGDB4 is obtained by adding the corresponding SGEB4 to
sentation z, the forward diffusion process now can be char- every SGEB3 with Conv Block performed.
acterized as follows:
√ √
zt = ᾱt z0 + 1 − ᾱt ϵt , ϵt ∼ N (0, I). (5)
SGEBs, MSG (·) and MSD (·) represent SG and SD mid-
The perturbed representation zT and input x0 are simulta- dle blocks respectively, DSD (·) represent all the SDDBs and
neously fed into the SD denoising network and SG network, DSGj (·) represents SGDBs for j-th blocks.
respectively. After T steps of the reverse denoising process,
the final variable ẑ is restored to the reconstructed image Spatial-aware Feature Fusion Block
xˆ0 from the pre-trained decoder D giving xˆ0 = D(ẑ). The
When adding several layers of decoder blocks from SGEBs
training objective of DiAD is:
to SDDBs during the experiment as shown in Table 7, we
found it to be challenging to solve the multi-class anomaly
h i
2
LDiAD = Ez0 ,t,ci ,ϵ∼N (0,1) ∥ϵ − ϵθ (zt , t, ci )∥2 . (6)
detection. This is because the dataset contains various types,
The denoising network consists of a pre-trained SD de- such as objects and textures. For texture-related cases, the
noising network and an SG network that replicates the SD anomalies are generally smaller, so it is necessary to pre-
parameters for initiation as shown in Fig. 2. The pre-trained serve their original textures. On the other hand, the de-
SD denoising network comprises four encoder blocks, one fects often cover larger areas for object-related cases, requir-
middle block and four decoder blocks. Here, ’block’ means ing stronger reconstruction capabilities. Therefore, it is ex-
a frequently utilized unit in the construction of the neural tremely challenging to simultaneously preserve the normal
network layer, e.g.,, ’resnet’ block, transformer block, multi- information of the original samples and reconstruct the ab-
head cross attention block, etc. normal locations in different scenarios.
The input image x0 ∈ R3×H×W is transformed into Hence, we proposed a Spatial-aware Feature Fusion (SFF)
x ∈ Rd×h×w by a set of ’conv-silu’ layers C in SG network block with the aim of integrating high-scale semantic infor-
in order to keep the same dimension with the latent represen- mation into the low-scale. This ultimately enables the model
tations in SD Encoder Block 1 ESD1 . Then, the result of the to both preserve the information of the original normal sam-
summation of x and z are input into the SG Encoder Blocks ples and reconstruct large-scale abnormal regions. The struc-
(SGEBs). After continuous downsampling by the encoder ture of the SFF block is shown in Fig. 3. Each SGEBs
ESG , the results are finally added to the output of the SD consists of three sub-layers. Therefore, the SFF block inte-
middle block MSD after its completion in the middle block grates the features of each layer in SGEB3 into each layer in
MSG . Additionally, to address multi-class tasks of differ- SGEB4 and adds the fused features to the original features.
ent scenarios and categories, the results of the SG Decoder The final output of each layer of the SGEB4 is:
Blocks (SGDBs) DSG are also added to the results of the J
SD decoder DSD with an SFF block combined which will X
Qi = Pi + F(Hj ), (8)
be particularly explained in the next section. The output G
j=1
of the denoising network is characterized as:
where Pi represents the low-scale output features of the i-
G = DSD (MSD (ESD (zt )) + MSG (ESD (z + C (x0 )))) th layer of SGEB4, Qi represents the final low-scale output
+ DSGj (MSG (ESD (z + C (x0 )))), features of the i-th layer of SGDB4, Hj represents the high-
(7) scale output features of the j-th layer of SGEB3, J = 3
where z represents the latent representation with noise per- indicates three layers of SGEB3 used in the experiment and
turbed, x0 represents the input image, C(·) represents a set F(·) represent a basic convolutional block which consists of
of ’conv-silu’ layers in SG network, ESD (·) represents all a 3x3 convolution layer followed by a normalization layer
the SD encoder blocks (SDEBs), ESG (·) represents all the and an activation layers.
Non-Diffusion Method Diffusion-based Method
Category
PaDiM MKD DRAEM RD4AD UniAD DDPM LDM Ours
Bottle 97.9/- 98.7/- 97.5/99.2/96.1 99.6/99.9/98.4 99.7/100./100. 63.6/71.8/86.3 93.8/98.7/93.7 99.7/96.5/91.8
Cable 70.9/- 78.2/- 57.8/74.0/76.3 84.1/89.5/82.5 95.2/95.9/88.0 55.6/69.7/76.0 55.7/74.8/77.7 94.8/98.8/95.2
Capsule 73.4/- 68.3/- 65.3/92.5/90.4 94.1/96.9/96.9 86.9/97.8/94.4 52.9/82.0/90.5 60.5/81.4/90.5 89.0/97.5/95.5
Hazelnut 85.5/- 97.1/- 93.7/97.5/92.3 60.8/69.8/86.4 99.8/100./99.3 87.0/90.4/88.1 93.0/95.8/89.8 99.5/99.7/97.3
Objects

Metal Nut 88.0/- 64.9/- 72.8/95.0/92.0 100./100./99.5 99.2/99.9/99.5 60.0/74.4/89.4 53.0/80.1/89.4 99.1/96.0/91.6
Pill 68.8/- 79.7/- 82.2/94.9/92.4 97.5/99.6/96.8 93.7/98.7/95.7 55.8/84.0/91.6 62.1/93.1/91.6 95.7/98.5/94.5
Screw 56.9/- 75.6/- 92.0/95.7/89.9 97.7/99.3/95.8 87.5/96.5/89.0 53.6/71.9/85.9 58.7/81.9/85.6 90.7/99.7/97.9
Toothbrush 95.3/- 75.3/- 90.6/96.8/90.0 97.2/99.0/94.7 94.2/97.4/95.2 57.5/68.0/83.3 78.6/83.9/83.3 99.7/99.9/99.2
Transistor 86.6/- 73.4/- 74.8/77.4/71.1 94.2/95.2/90.0 99.8/98.0/93.8 57.8/44.6/57.1 61.0/57.8/59.1 99.8/99.6/97.4
Zipper 79.7/- 87.4/- 98.8/99.9/99.2 99.5/99.9/99.2 95.8/99.5/97.1 64.9/77.4/88.1 73.6/89.5/90.6 95.1/99.1/94.4
Carpet 93.8/- 69.8/- 98.0/99.1/96.7 98.5/99.6/97.2 99.8/99.9/99.4 95.5/98.7/91.0 99.4/99.8/99.4 99.4/99.9/98.3
Grid 73.9/- 83.8/- 99.3/99.7/98.2 98.0/99.4/96.5 98.2/99.5/97.3 83.5/93.9/86.9 67.3/82.6/84.4 98.5/99.8/97.7
Textures

Leather 99.9/- 93.6/- 98.7/99.3/95.0 100./100./100. 100./100./100. 98.4/99.5/96.3 97.4/99.0/96.3 99.8/99.7/97.6


Tile 93.3/- 89.5/- 99.8/100./100. 98.3/99.3/96.4 99.3/99.8/98.2 93.697.5/92.0 97.1/98.7/94.1 96.8/99.9/98.4
Wood 98.4/- 93.4/- 99.8/100./100. 99.2/99.8/98.3 98.6/99.6/96.6 98.6/99.6/97.5 97.8/99.4/95.9 99.7/100./100.
Mean 84.2/- 81.9/- 88.1/94.7/92.0 94.6/96.5/95.2 96.5/98.8/96.2 71.9/81.6/86.6 76.6/87.8/88.1 97.2/99.0/96.5

Table 1: Comparison with SOTA methods on MVTec-AD dataset for multi-class anomaly detection with
AU ROCcls /APcls /F 1maxcls metrics.

As Batch Normalization (BN) (Ioffe and Szegedy 2015) Non-Diffusion Diffusion-based


Metrics
considers the normalization statistics of all images within a DRAEM UniAD DDPM LDM Ours
batch, it leads to a loss of unique details in each sample. AU ROCcls 79.1 85.5 54.5 56.7 86.8
BN is suitable for a relatively large mini-batch scenario with APcls 81.9 85.5 57.9 61.4 88.3
similar data distributions. However, for multi-class anomaly F 1maxcls 78.9 84.4 72.3 73.1 85.1
detection where there are significant differences in data dis- AU ROCseg 91.3 95.9 79.7 86.6 96.0
tributions among different categories, normalizing the en- APseg 23.5 21.0 2.2 6.0 26.1
F 1maxseg 29.5 27.0 4.5 9.9 33.0
tire batch is not suitable for tasks in the multi-class set- P RO 58.8 75.6 46.8 55.0 75.2
ting. Since the results generated by using SD mainly depend
on the input image instance, using Instance Normalization Table 2: Quantitative comparisons on VisA dataset.
(IN) (Ulyanov, Vedaldi, and Lempitsky 2017) can not only
accelerate model convergence but also maintain the indepen-
dence between each image instance. In addition, in terms of where σn indicates the upsampling factor in order to keep
choosing the activation function, we use the SiLU (Elfwing, the same dimension of the pixel space image and N indi-
Uchibe, and Doya 2018) instead of the commonly used cates the number of feature layers used during inference.
ReLU (Hahnloser et al. 2000), which can preserve more in-
put information. Experimental results in Table 7 show that Experiment
the performance is improved by using IN and SiLU simulta-
neously instead of the combination of BN and ReLU. Datasets and evaluation metrics
MVTec-AD dataset. MVTec-AD (Bergmann et al. 2019)
Anomaly localization and detection dataset simulates real-world industrial production scenarios,
During the inference stage, the reconstruction image is ob- filling the gap in unsupervised anomaly detection. It consists
tained through the diffusion and denoising process in the la- of 5 types of textures and 10 types of objects, in 5,354 high-
tent space. For anomaly localization and detection, We use resolution images from different domains. The training set
the same ImageNet pre-trained feature extractor Ψ to extract contains 3,629 images with only anomaly-free samples. The
features from both the input image x0 and the reconstructed test set consists of 1,725 images, including both normal and
image xˆ0 and calculate the anomaly map on different scale abnormal samples. Pixel-level annotations are provided for
feature maps Mn using cosine similarity: the anomaly localization evaluation.
VisA dataset. VisA (Zou et al. 2022) dataset consists of a
total of 10,821 high-resolution images, including 9,621 nor-
T
(Ψn (x0 , xˆ0 )) · Ψn (x0 , xˆ0 ) mal images and 1,200 anomaly images with 78 types of
Mn (x0 , xˆ0 ) = 1 − , (9) anomalies. The VisA dataset comprises 12 subsets, each cor-
∥Ψn (x0 , xˆ0 )∥ ∥Ψn (x0 , xˆ0 )∥
responding to a distinct object. 12 objects could be catego-
where n represents the n-th feature layer fn and the anomaly rized into three different object types: Complex structure,
score S for an input-pair of anomaly localization is: Multiple instances, and Single instance.
X MVTec-3D dataset. MVTec-3D (Bergmann et al. 2022)
S= σn Mn (x0 , xˆ0 ), (10) dataset comprises 4,147 scans obtained using a high-
n∈N resolution industrial 3D sensor. It consists of 10 categories
Non-Diffusion Method Diffusion-based Method
Category
PaDiM MKD DRAEM RD4AD UniAD DDPM LDM Ours
Bottle 96.1/- 91.8/- 87.6/62.5/56.9 97.8/68.2/67.6 98.1/66.0/69.2 59.9/ 4.9/11.7 86.9/49.1/50.0 98.4/52.2/54.8
Cable 81.0/- 89.3/- 71.3/14.7/17.8 85.1/26.3/33.6 97.3/39.9/45.2 66.5/ 6.7/10.6 89.3/18.5/26.2 96.8/50.1/57.8
Capsule 96.9/- 88.3/- 50.5/ 6.0/10.0 98.8/43.4/50.0 98.5/42.7/46.5 63.1/ 6.2/ 9.7 90.0/ 7.9/27.3 97.1/42.0/45.3
Hazelnut 96.3/- 91.2/- 96.9/70.0/60.5 97.9/36.2/51.6 98.1/55.2/56.8 91.2/24.1/28.3 95.1/51.2/53.5 98.3/79.2/80.4
Objects

Metal Nut 84.8/- 64.2/- 62.2/31.1/21.0 93.8/62.3/65.4 94.8/55.5/66.4 62.7/14.6/29.2 70.5/19.3/30.7 97.3/30.0/38.3
Pill 87.7/- 69.7/- 94.4/59.1/44.1 97.5/63.4/65.2 95.0/44.0/53.9 55.3/ 4.0/ 8.4 74.9/10.2/15.0 95.7/46.0/51.4
Screw 94.1/- 92.1/- 95.5/33.8/40.6 99.4/40.2/44.6 98.3/28.7/37.6 91.1/ 1.8/ 3.8 91.7/ 2.2/ 4.6 97.9/60.6/59.6
Toothbrush 95.6/- 88.9/- 97.7/55.2/55.8 99.0/53.6/58.8 98.4/34.9/45.7 76.9/ 4.0/ 7.7 93.7/20.4/ 9.8 99.0/78.7/72.8
Transistor 92.3/- 71.7/- 64.5/23.6/15.1 85.9/42.3/45.2 97.9/59.5/64.6 53.2/ 5.8/11.4 85.5/25.0/30.7 95.1/15.6/31.7
Zipper 94.8/- 86.1/- 98.3/74.3/69.3 98.5/53.9/60.3 96.8/40.1/49.9 67.4/ 3.5/ 7.6 66.9/ 5.3/ 7.4 96.2/60.7/60.0
Carpet 97.6/- 95.5/- 98.6/78.7/73.1 99.0/58.5/60.4 98.5/49.9/51.1 89.2/18.8/44.3 99.1/70.6/66.0 98.6/42.2/46.4
Grid 71.0/- 82.3/- 98.7/44.5/46.2 99.2/46.0/47.4 96.5/23.0/28.4 63.1/ 0.7/ 1.9 52.4/ 1.1/ 1.9 96.6/66.0/64.1
Textures

Leather 84.8/- 96.7/- 97.3/60.3/57.4 99.3/38.0/45.1 98.8/32.9/34.4 97.3/38.9/43.2 99.0/45.9/44.0 98.8/56.1/62.3


Tile 80.5/- 85.3/- 98.0/93.6/86.0 95.3/48.5/60.5 91.8/42.1/50.6 87.0/35.2/36.6 90.1/43.9/51.6 92.4/65.7/64.1
Wood 89.1/- 80.5/- 96.0/81.4/74.6 95.3/47.8/51.0 93.2/37.2/41.5 84.7/30.9/37.3 92.3/44.1/46.6 93.3/43.3/43.5
Mean 89.5/- 84.9/- 87.2/52.5/48.6 96.1/48.6/53.8 96.8/43.4/49.5 75.6/13.3/19.5 85.1/27.6/31.0 96.8/52.6/55.5

Table 3: Comparison with SOTA methods on MVTec-AD dataset for multi-class anomaly localization with
AU ROCseg /APseg /F 1maxseg metrics.

Non-Diffusion Diffusion-based optimiser (Loshchilov and Hutter 2019) with a learning rate
Method
DRAEM UniAD DDPM LDM Ours of 1e−5 is set. A Gaussian filter with σ = 5 is used to smooth
PRO 71.1 90.4 49.0 66.3 90.7 the anomaly localization score. For anomaly detection, the
anomaly score of the image is the maximum value of the av-
Table 4: Multi-class anomaly localization results with PRO eragely pooled anomaly localization score which undergoes
metric on MVTec-AD datasets. 8 rounds of global average pooling operations with a size
of 8 × 8. During inference, the initial denoising timestep T
is set from 1,000. We use DDIM (Song, Meng, and Ermon
with both RGB images and 3D point clouds respectively. 2021) as the sampler with 10 steps by default.
The training set contains 2,656 images with only anomaly-
free samples. The test set consists of 1,197 images, including Comparison with SOTAs
both normal and abnormal samples. Only RGB images are We conduct and analyze a range of qualitative and quantita-
used in this experiment. tive comparison experiments on MVTec-AD, VisA, MVTec-
Medical dataset. We also merge three types of medical 3D and Medical datasets. We choose a synthesizing-based
datasets BraTS2021 (Baid et al. 2021), BTCV (Landman method DRAEM (Zavrtanik, Kristan, and Skočaj 2021a),
et al. 2015) and LiTs (Bilic et al. 2023) into one Medical three embedding-based methods MKD (Salehi et al. 2021),
dataset for multi-class anomaly detection. The training set PaDiM (Defard et al. 2021) and RD4AD (Deng and Li
contains 9,042 slices and the test set consists of 5,208 slices. 2022), a reconstruction-based method EdgRec (Liu et al.
Evaluation Metrics. Following prior works, Area Under the 2022), a unified SOTA UniAD (You et al. 2022) method
Receiver Operating Characteristic Curve (AUROC), Aver- and diffusion-based DDPM and LDM methods. Specifically,
age Precision (AP) and F1-score-max (F1max) are used in we categorize the aforementioned methods into two types:
both anomaly detection and anomaly localization, where cls non-diffusion and diffusion-based methods. For the experi-
represents the image level anomaly detection and seg repre- ments on Medical dataset, we follow the BMAD (Bao et al.
sents the pixel level anomaly localization. Also, Per-Region- 2023) benchmark and add two methods STFPM (Yamada
Overlap (PRO) is used in anomaly localization. The DICE and Hotta 2021) and CFLOW (Gudovskiy, Ishizaka, and
score is commonly used in the medical field. Kozuka 2022) for comparison.
Qualitative Results. We conducted substantial qualitative
Implementation Details experiments on MVTec-AD and VisA datasets to visually
All images in MVTec-AD and VisA are resized to 256 × demonstrate the superiority of our method in image re-
256. For the denoising network, we adopt the 4-th block construction and the accuracy of anomaly localization. As
of SGDB for connection to SDDB. In this experiment, shown in Figure 4, our method exhibits better reconstruction
we adopt ResNet50 as the feature extraction network and capabilities for anomalous regions compared to the EdgRec
choose n ∈ {2, 3, 4} as the feature layers used in calcu- on MVTec-AD dataset. In comparison to UniAD shown in
lating the anomaly localization. We utilized the KL method Figure 5, our method exhibits more accurate anomaly lo-
as the Auto-encoder and fine-tune the model before training calization abilities on VisA dataset. More qualitative results
the denoising network. We train for 1000 epochs on a single will be presented in Appendix.
NVIDIA Tesla V100 32GB with a batch size of 12. Adam Quantitative Results. As shown in Table 1 and in Ta-
Non-Diffusion Diffusion-based SD MSG SGEB3 SGEB4 BN+ReLU IN+SiLU cls seg
Metrics
DRAEM UniAD DDPM LDM Ours ✓ 79.3 89.5
✓ ✓ 95.1 91.1
AU ROCcls 63.2 78.9 66.3 68.5 84.6
✓ ✓ ✓ 95.3 89.1
APcls 86.1 93.4 78.0 90.6 94.8
✓ ✓ ✓ ✓ 93.8 91.2
F 1maxcls 89.2 91.4 86.6 91.6 95.5 ✓ ✓ ✓ ✓ 96.7 96.7
AU ROCseg 93.2 96.5 90.7 92.2 96.4 ✓ ✓ ✓ ✓ 97.2 96.8
APseg 16.8 21.2 6.0 9.3 25.3
F 1maxseg 20.2 28.0 10.7 13.5 32.2
P RO 55.0 88.1 69.7 73.8 87.8
Table 7: Ablation studies on the design of DiAD with AU-
ROC metrics.
Table 5: Quantitative comparisons on MVTec-3D dataset.
Ablation Studies
The architecture design of DiAD. We investigate the im-
portance of each module in DiAD as shown in Table 7. SD
indicates only the diffusion model without connecting to the
SG network which is the LDM’s architecture. MSG indi-
cates only the middle block of the SG network adding to the
middle of SD. SGEB3 and SGEB4 indicate directly skip-
connecting to the corresponding SDDB. When connecting
SGDB3 and SGDB4 at the same time, more details of the
original images are preserved in terms of texture, but the re-
construction ability for large anomaly areas decreases. Us-
ing the combination of IN+SiLU in the SFF block yields
better results compared to using BN+ReLU.
Effect of pre-trained feature extractors. Table 8 shows the
quantitative comparison of using different pre-trained back-
bones as feature extraction networks. ResNet50 achieved the
Figure 4: Qualitative illustration on MVTec-AD dataset. best performance in anomaly classification metrics, while
WideResNet101 excelled in anomaly segmentation.
Backbone AU ROCcls APcls F 1maxcls AU ROCseg APseg F 1maxseg P RO
ble 3, our method achieves SOTA AUROC/AP/F1max met- VGG
16 91.8 97.2 93.9 92.1 47.2 50.5 80.1
19 91.3 96.9 93.7 92.3 47.5 50.6 80.4
rics of 97.2/99.0/96.5 and 96.8/52.6/55.5 for image-wise and 18 94.7 98.1 96.0 96.0 49.9 53.3 89.1
pixel-wise respectively for multi-class setting on MVTec- ResNet
34
50
95.2
97.2
98.3
99
95.7
96.5
96.2
96.8
51.2
52.6
54.5
55.5
89.6
90.7
AD dataset. For the diffusion-based methods, our approach 101 96.2 98.4 96.5 96.9 52.9 56.4 91.2

significantly outperforms existing DDPM and LDM meth- WideResNet


50
101
95.9
95.6
98.6
98.3
96.5
95.8
96.4
96.9
51.8
54.6
55.1
56.5
89.3
91.4
ods in terms of 11.7↑ in AUROC and 25↑ in AP for anomaly b0 93.5 97.7 94.7 94.0 50.0 52.4 84.0
b2 94.2 98.0 95.1 94.1 48.6 52.1 84.2
localization. For non-diffusion methods, our approach sur- EfficientNet
b4 92.8 97.5 94.8 93.6 47.2 50.7 83.5
passes existing methods in both metrics, especially at the
pixel level, where our method exceeds UniAD by 9.2↑/6.0↑ Table 8: Ablation studies on different feature extractors.
in AP/F1max. Our method has also demonstrated its superi-
ority on VisA dataset, as shown in Table 2. Our approach Effect of feature layers used in anomaly score calculat-
exhibits significant improvements compared to diffusion- ing. After extracting feature maps of 5 different scales using
based methods of 30.1↑/9.4↑ than the LDM method in a pre-trained backbone, the anomaly scores are calculated by
image/pixel AUROC. It also performs well compared to computing the cosine similarity between feature maps from
UniAD by 4.9↑/6.0↑ in pixel AP/F1max metrics. Detailed different layers. The experimental results, as shown in Ta-
experiments for each category are provided in Appendix. We ble 9, indicate that using feature maps from layers f2 , f3 ,
have extended the method to 3D datasets and medical do- and f4 (with corresponding sizes of 64 × 64, 32 × 32, and
main datasets. Table 5 and Table 6 show the effectiveness 16 × 16) yields the best performance.
and scalability of our method on MVtec-3D and Medical
datasets, with results surpassing the state of the art (SOTA). f1 f2 f3 f4 f5 AU ROCcls APcls F 1maxcls AU ROCseg APseg F 1maxseg
✓ ✓ ✓ ✓ ✓ 93.8 97.8 95.0 94.0 42.0 45.9
✓ ✓ ✓ ✓ 96.7 98.7 96.1 96.7 52.5 55.2
✓ ✓ ✓ ✓ 93.4 97.1 93.6 95.2 48.5 51.3
Metrics MKD CFLOW RD4AD PaDiM PatchCore STFPM UniAD Ours ✓ ✓ ✓ 97.1 99.0 96.8 96.4 49.4 53.1
AU ROCcls 70.9 62.0 74.7 64.6 76.0 72.2 76.4 77.2 ✓ ✓ ✓ 97.2 99.0 96.5 96.8 52.6 55.5
AU ROCseg 92.8 93.2 96.2 93.0 96.8 93.4 96.7 96.9 ✓ ✓ 94 97.4 94.2 95.3 48.5 51.7
P RO 79.3 79.0 88.0 79.2 86.6 86.0 87.4 87.7 ✓ ✓ 97.1 99.0 96.8 96.4 49.4 53.1
DICE 21.9 13.5 19.5 15.2 21.7 17.1 28.7 32.3

Table 9: Ablation studies on the feature layers used in calcu-


Table 6: Quantitative comparisons on Medical dataset. lating the anomaly localization score based on ResNet50.
the background’s anti-interference capability for multi-class
Input
anomaly detection. Additionally, we will incorporate multi-
modal assistance in our anomaly detection. Lastly, we will
Ours Rec.

utilize larger models to enhance reconstruction performance.

References
GT

Amit, T.; Shaharbany, T.; Nachmani, E.; and Wolf, L. 2022.


SegDiff: Image Segmentation with Diffusion Probabilistic
UniAD Loc.

Models. arXiv:2112.00390.
Baid, U.; Ghodasara, S.; Mohan, S.; Bilello, M.; Calabrese,
E.; Colak, E.; Farahani, K.; Kalpathy-Cramer, J.; Kitamura,
Ours Loc.

F. C.; Pati, S.; et al. 2021. The rsna-asnr-miccai brats 2021


benchmark on brain tumor segmentation and radiogenomic
classification. arXiv preprint arXiv:2107.02314.
Figure 5: Qualitative results on VisA dataset.
Bao, J.; Sun, H.; Deng, H.; He, Y.; Zhang, Z.; and Li, X.
2023. BMAD: Benchmarks for Medical Anomaly Detec-
tion. arXiv preprint arXiv:2306.11876.
Bergmann, P.; Fauser, M.; Sattlegger, D.; and Steger, C.
2019. MVTec AD–A comprehensive real-world dataset for
unsupervised anomaly detection. In CVPR, 9592–9600.
Bergmann, P.; Jin, X.; Sattlegger, D.; and Steger, C. 2022.
The MVTec 3D-AD Dataset for Unsupervised 3D Anomaly
Detection and Localization. In VISGRAPP. SCITEPRESS -
Science and Technology Publications.
Bilic, P.; Christ, P.; Li, H. B.; Vorontsov, E.; Ben-Cohen,
A.; Kaissis, G.; Szeskin, A.; Jacobs, C.; Mamani, G. E. H.;
Chartrand, G.; et al. 2023. The liver tumor segmentation
benchmark (lits). Medical Image Analysis, 84: 102680.
Figure 6: Ablation studies on different diffusion timesteps. Cao, Y.; Wan, Q.; Shen, W.; and Gao, L. 2022. Informa-
tive knowledge distillation for image anomaly segmentation.
Knowledge-Based Systems, 248: 108846.
Effect of forward diffusion timesteps. Increasing the num- Cao, Y.; Xu, X.; Sun, C.; Cheng, Y.; Du, Z.; Gao, L.; and
ber of diffusion steps in the forward process impacts the per- Shen, W. 2023. Segment Any Anomaly without Train-
formance of image reconstruction. The experimental results, ing via Hybrid Prompt Regularization. arXiv preprint
depicted in Figure 6, indicate that with an increasing num- arXiv:2305.10724.
ber of forward diffusion steps, the image approaches pure
Gaussian noise, while the anomaly reconstruction ability im- Chen, R.; Xie, G.; Liu, J.; Wang, J.; Luo, Z.; Wang, J.; and
proves as well. Nevertheless, when the number of forward Zheng, F. 2023a. Easynet: An easy network for 3d industrial
diffusion steps is less than 600, a significant decline in per- anomaly detection. In ACM MM, 7038–7046.
formance occurs because the number of steps is insufficient Chen, S.; Sun, P.; Song, Y.; and Luo, P. 2022. DiffusionDet:
for anomaly reconstruction. Diffusion Model for Object Detection. arXiv:2211.09788.
Chen, X.; Han, Y.; and Zhang, J. 2023. A Zero-/Few-
Conclusion Shot Anomaly Classification and Segmentation Method for
CVPR 2023 VAND Workshop Challenge Tracks 1&2: 1st
This paper proposes a diffusion-based DiAD framework to Place on Zero-shot AD and 4th Place on Few-shot AD. arXiv
address the issue of category and semantic loss in the stable preprint arXiv:2305.17382.
diffusion model for multi-class anomaly detection. We pro-
pose the Semantic-Guided network and Spatial-aware Fea- Chen, X.; Zhang, J.; Tian, G.; He, H.; Zhang, W.; Wang,
ture Fusion block to better reconstruct the abnormal regions Y.; Wang, C.; Wu, Y.; and Liu, Y. 2023b. CLIP-AD: A
while maintaining the same semantic information as the in- Language-Guided Staged Dual-Path Model for Zero-shot
put image. Our approach achieves state-of-the-art perfor- Anomaly Detection. arXiv preprint arXiv:2311.00453.
mance on MVTec-AD and VisA datasets, significantly out- Defard, T.; Setkov, A.; Loesch, A.; and Audigier, R.
performing the non-diffusion and diffusion-based methods. 2021. Padim: a patch distribution modeling framework for
Limitation. Although our method has demonstrated excep- anomaly detection and localization. In ICPR, 475–489.
tional performance in reconstructing anomalies, it can be Springer.
susceptible to the influence of background impurities, re- Deng, H.; and Li, X. 2022. Anomaly detection via reverse
sulting in errors in localization and classification. In the fu- distillation from one-class embedding. In CVPR, 9737–
ture, we will further explore diffusion models and enhance 9746.
Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; and Fei- Liu, J.; Xie, G.; Wang, J.; Li, S.; Wang, C.; Zheng, F.; and
Fei, L. 2009. Imagenet: A large-scale hierarchical image Jin, Y. 2023. Deep Industrial Image Anomaly Detection: A
database. In CVPR, 248–255. Ieee. Survey. arXiv preprint arXiv:2301.11514, 2.
Ding, C.; Pang, G.; and Shen, C. 2022. Catching both gray Liu, T.; Li, B.; Zhao, Z.; Du, X.; Jiang, B.; and Geng, L.
and black swans: Open-set supervised anomaly detection. In 2022. Reconstruction from edge image combined with color
CVPR, 7388–7398. and gradient difference for industrial surface anomaly detec-
Elfwing, S.; Uchibe, E.; and Doya, K. 2018. Sigmoid- tion. arXiv:2210.14485.
weighted linear units for neural network function approx- Liznerski, P.; Ruff, L.; Vandermeulen, R. A.; Franks, B. J.;
imation in reinforcement learning. Neural networks, 107: Kloft, M.; and Müller, K. 2021. Explainable Deep One-
3–11. Class Classification. In ICLR.
Gu, Z.; Liu, L.; Chen, X.; Yi, R.; Zhang, J.; Wang, Y.; Wang, Loshchilov, I.; and Hutter, F. 2019. Decoupled Weight De-
C.; Shu, A.; Jiang, G.; and Ma, L. 2023. Remembering Nor- cay Regularization. arXiv:1711.05101.
mality: Memory-guided Knowledge Distillation for Unsu- Mousakhan, A.; Brox, T.; and Tayyub, J. 2023. Anomaly
pervised Anomaly Detection. In ICCV, 16401–16409. Detection with Conditioned Denoising Diffusion Models.
Gudovskiy, D.; Ishizaka, S.; and Kozuka, K. 2022. Cflow- arXiv:2305.15956.
ad: Real-time unsupervised anomaly detection with localiza- Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; and Om-
tion via conditional normalizing flows. In WACV, 98–107. mer, B. 2022. High-Resolution Image Synthesis with Latent
Hahnloser, R. H.; Sarpeshkar, R.; Mahowald, M. A.; Dou- Diffusion Models. arXiv:2112.10752.
glas, R. J.; and Seung, H. S. 2000. Digital selection and Ronneberger, O.; Fischer, P.; and Brox, T. 2015. U-net: Con-
analogue amplification coexist in a cortex-inspired silicon volutional networks for biomedical image segmentation. In
circuit. nature, 405(6789): 947–951. MICCAI, 234–241. Springer.
He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual Roth, K.; Pemula, L.; Zepeda, J.; Schölkopf, B.; Brox, T.;
learning for image recognition. In CVPR, 770–778. and Gehler, P. 2022. Towards total recall in industrial
Ho, J.; Chan, W.; Saharia, C.; Whang, J.; Gao, R.; Gritsenko, anomaly detection. In CVPR, 14318–14328.
A.; Kingma, D. P.; Poole, B.; Norouzi, M.; Fleet, D. J.; and Salehi, M.; Mirzaei, H.; Hendrycks, D.; Li, Y.; Ro-
Salimans, T. 2022. Imagen Video: High Definition Video hban, M. H.; and Sabokrou, M. 2022. A Unified
Generation with Diffusion Models. arXiv:2210.02303. Survey on Anomaly, Novelty, Open-Set, and Out-of-
Ho, J.; Jain, A.; and Abbeel, P. 2020. Denoising Diffusion Distribution Detection: Solutions and Future Challenges.
Probabilistic Models. In NeurIPS, volume 33, 6840–6851. arXiv:2110.14051.
Huang, C.; Guan, H.; Jiang, A.; Zhang, Y.; Spratling, M.; Salehi, M.; Sadjadi, N.; Baselizadeh, S.; Rohban, M. H.; and
and Wang, Y.-F. 2022. Registration based few-shot anomaly Rabiee, H. R. 2021. Multiresolution knowledge distillation
detection. In ECCV, 303–319. Springer. for anomaly detection. In CVPR, 14902–14912.
Ioffe, S.; and Szegedy, C. 2015. Batch Normalization: Ac- Song, J.; Meng, C.; and Ermon, S. 2021. Denoising Diffu-
celerating Deep Network Training by Reducing Internal Co- sion Implicit Models. In ICLR. OpenReview.net.
variate Shift. In Bach, F. R.; and Blei, D. M., eds., ICML, Tan, D. S.; Chen, Y.-C.; Chen, T. P.-C.; and Chen, W.-
volume 37 of JMLR Workshop and Conference Proceedings, C. 2021. Trustmae: A noise-resilient defect classification
448–456. JMLR.org. framework using memory-augmented auto-encoders with
Jeong, J.; Zou, Y.; Kim, T.; Zhang, D.; Ravichandran, A.; trust regions. In WACV, 276–285.
and Dabeer, O. 2023. Winclip: Zero-/few-shot anomaly clas- Tan, M.; and Le, Q. 2019. Efficientnet: Rethinking model
sification and segmentation. In CVPR, 19606–19616. scaling for convolutional neural networks. In ICML, 6105–
Kingma, D. P.; and Welling, M. 2022. Auto-Encoding Vari- 6114. PMLR.
ational Bayes. arXiv:1312.6114. Tao, X.; Gong, X.; Zhang, X.; Yan, S.; and Adak, C. 2022.
Landman, B.; Xu, Z.; Igelsias, J.; Styner, M.; Langerak, T.; Deep Learning for Unsupervised Anomaly Localization in
and Klein, A. 2015. Miccai multi-atlas labeling beyond Industrial Images: A Survey. IEEE Transactions on Instru-
the cranial vault–workshop and challenge. In Proc. MIC- mentation and Measurement, 71: 1–21.
CAI Multi-Atlas Labeling Beyond Cranial Vault—Workshop Ulyanov, D.; Vedaldi, A.; and Lempitsky, V. 2017. Instance
Challenge, volume 5, 12. Normalization: The Missing Ingredient for Fast Stylization.
Li, C.-L.; Sohn, K.; Yoon, J.; and Pfister, T. 2021. Cutpaste: arXiv:1607.08022.
Self-supervised learning for anomaly detection and localiza- Wang, Y.; Peng, J.; Zhang, J.; Yi, R.; Wang, Y.; and Wang, C.
tion. In CVPR, 9664–9674. 2023. Multimodal Industrial Anomaly Detection via Hybrid
Liang, Y.; Zhang, J.; Zhao, S.; Wu, R.; Liu, Y.; and Pan, Fusion. In CVPR, 8032–8041.
S. 2023. Omni-frequency channel-selection representations Wu, J.; Li, J.; Zhang, J.; Zhang, B.; Chi, M.; Wang, Y.; and
for unsupervised anomaly detection. IEEE Transactions on Wang, C. 2023. PVG: Progressive Vision Graph for Vision
Image Processing. Recognition. arXiv preprint arXiv:2308.00574.
Wyatt, J.; Leach, A.; Schmon, S. M.; and Willcocks, C. G. anomaly detection and segmentation. In ECCV, 392–408.
2022. AnoDDPM: Anomaly Detection with Denoising Dif- Springer.
fusion Probabilistic Models using Simplex Noise. In CVPR
Workshops 2022, New Orleans, LA, USA, June 19-20, 2022,
649–655. IEEE. Appendices
Xie, G.; Wang, J.; Liu, J.; Jin, Y.; and Zheng, F. 2023. Push- Effect of DDIM sampler steps
ing the Limits of Fewshot Anomaly Detection in Industry In order to accelerate the sampling speed in the denoising
Vision: Graphcore. In ICLR. process, UiAD adopts the DDIM sampling strategy. We in-
Yamada, S.; and Hotta, K. 2021. Reconstruction student vestigated the impact of different DDIM sampler steps on
with attention for student-teacher pyramid matching. arXiv the results, as shown in Table 10. The results indicate that
preprint arXiv:2111.15376. increasing the number of sampling steps does not signifi-
cantly affect the results. Therefore, using a 10-step sampling
Yan, X.; Zhang, H.; Xu, X.; Hu, X.; and Heng, P. 2021. process can achieve the best performance while greatly ac-
Learning Semantic Context from Normal Samples for Un- celerating the sampling speed.
supervised Anomaly Detection. In AAAI, 3110–3118.
Yi, J.; and Yoon, S. 2020. Patch SVDD: Patch-level SVDD Steps 1 5 10 20 50 100 200
for Anomaly Detection and Segmentation. In ACCV.
Yoon, J.; Sohn, K.; Li, C.-L.; Arik, S. O.; Lee, C.-Y.; and seg 72.5 96.5 96.8 96.8 96.7 96.7 96.8
Pfister, T. 2022. Self-supervise, Refine, Repeat: Improv- cls 66.1 96.4 97.2 97.1 97.0 96.8 96.9
ing Unsupervised Anomaly Detection. Transactions on Ma-
chine Learning Research. Table 10: Ablation studies on DDIM sampler steps.
You, Z.; Cui, L.; Shen, Y.; Yang, K.; Lu, X.; Zheng, Y.; and
Le, X. 2022. A Unified Model for Multi-class Anomaly De-
tection. In NeurIPS, volume 35, 4571–4584. Effect of Global average pooling
Yu, J.; Zheng, Y.; Wang, X.; Li, W.; Wu, Y.; Zhao, R.; Global average pooling is used to reduce the potential oc-
and Wu, L. 2021. FastFlow: Unsupervised Anomaly currence of false positives. For m-n in the table below, m
Detection and Localization via 2D Normalizing Flows. represents the iterations and n represents the kernel size.
arXiv:2111.07677. Through quantitative analysis, the most effective approach
is employing an 8 × 8 size global average pooling with 8 it-
Zagoruyko, S.; and Komodakis, N. 2016. Wide Residual erations. Also, the best-performing combinations exhibit the
Networks. In BMVC. BMVA Press. same feature map size.
Zavrtanik, V.; Kristan, M.; and Skočaj, D. 2021a. Draem-a Global Average Pooling 1-16 4-16 5-12 6-10 8-8 10-7 15-5 20-4
discriminatively trained reconstruction embedding for sur- AUROC-cls 96.0 96.7 96.9 97.1 97.2 97.2 97.0 96.8
face anomaly detection. In ICCV, 8330–8339.
Zavrtanik, V.; Kristan, M.; and Skočaj, D. 2021b. Recon- Limitations of the datasets
struction by inpainting for visual anomaly detection. Pattern We found that there are several categories of image-level
Recognition, 112: 107706. anomaly detection results that are significantly lower than
Zhang, H.; Wang, Z.; Wu, Z.; and Jiang, Y.-G. 2023a. others, such as capsules and screws. As shown in Fig 7,
DiffusionAD: Denoising Diffusion for Anomaly Detection. we discovered some false positives in input good images
arXiv:2303.08730. during the test. Our method performs well in reconstruct-
ing the objects in the objects’ main bodies, but the back-
Zhang, J.; Chen, X.; Xue, Z.; Wang, Y.; Wang, C.; and Liu, ground region of the original image contains impurities,
Y. 2023b. Exploring Grounding Potential of VQA-oriented causing the pre-trained feature extraction network to extract
GPT-4V for Zero-shot Anomaly Detection. arXiv preprint features that perceive the background impurities as anoma-
arXiv:2311.02612. lies. As anomaly detection is expected to identify anomalies
Zhang, J.; Li, X.; Li, J.; Liu, L.; Xue, Z.; Zhang, B.; Jiang, within the object rather than the background region, there
Z.; Huang, T.; Wang, Y.; and Wang, C. 2023c. Rethink- are certain deficiencies in the Mvtec-AD as well as the VisA
ing Mobile Block for Efficient Attention-based Models. In datasets that lead to false positives. In response to this issue,
ICCV, 1389–1400. we increase the number of global average pooling operations
Zhang, J.; Li, X.; Wang, Y.; Wang, C.; Yang, Y.; Liu, Y.; to alleviate the problem of high anomaly scores caused by
and Tao, D. 2022. Eatformer: Improving vision trans- impurities in the background.
former inspired by evolutionary algorithm. arXiv preprint
arXiv:2206.09325. Hyperparameters of DiAD
Zhang, L.; and Agrawala, M. 2023. Adding Con- We provided a comprehensive set of hyperparameters for the
ditional Control to Text-to-Image Diffusion Models. three models in DiAD as shown in Table 11.
arXiv:2302.05543.
Zou, Y.; Jeong, J.; Pemula, L.; Zhang, D.; and Dabeer, O.
2022. Spot-the-difference self-supervised pre-training for
Input Good Image Anomaly Map Reconstruction Input Good Image Anomaly Map Reconstruction

Figure 7: Visualization of false positive classifications and localizations.

Model Name
Parameters Name
SD Denoising Network SG Network Autoencoder
z shape 32 × 32 × 4
|z| 4096
Diffusion steps T 1000
DDIM sampling steps T 10
Noise Schedule linear
Model input shape 32 × 32 × 4 256 × 256 × 3 256 × 256 × 3
N params 859M 471M 83.7M
Embed dim - - 4
Channels 320 320 128
Num res blocks 2 2 2
Channel Multiplier 1,2,4,4 1,2,4,4 1,2,4,4
Attention resolutions 4,2,1 4,2,1 -
Num Heads 8 8 -
Batch Size 12
Accumulate grad batches 4
Epochs 1000
Learning Rate 1.0e-5

Table 11: Hyperparameters for the DiAD. All models trained on a single NVIDIA Tesla V100 32GB.
Non-Diffusion Method Diffusion-based Method
Category
DRAEM UniAD DDPM LDM Ours
pcb1 71.9/72.2/70.0 92.8/92.7/87.8 54.1/47.7/67.1 51.2/46.9/66.8 88.1/88.7/80.7
pcb2 78.4/78.2/76.2 87.8/87.7/83.1 50.8/48.5/66.6 57.0/63.4/67.5 91.4/91.4/84.7
pcb3 76.6/77.4/74.7 78.6/78.6/76.1 53.4/51.2/66.8 62.7/69.6/72.0 86.2/87.6/77.6
pcb4 97.3/97.5/93.5 98.8/98.8/94.3 56.0/48.4/66.4 54.4/47.1/66.8 99.6/99.5/97.0
macaroni1 69.8/68.5/70.9 79.9/79.8/72.7 50.9/55.1/68.0 56.2/49.6/68.4 85.7/85.2/78.8
macaroni2 59.4/60.7/68.0 71.6/71.6/69.9 54.4/51.8/67.1 56.8/52.7/66.6 62.5/57.4/69.6
capsules 83.4/91.1/82.1 55.6/55.6/76.9 58.9/62.7/78.2 57.7/71.4/77.3 58.2/69.0/78.5
candle 69.3/73.9/68.0 94.1/94.0/86.1 52.7/48.3/66.6 50.4/52.2/68.2 92.8/92.0/87.6
cashew 81.7/89.7/87.3 92.8/92.8/91.4 63.5/78.9/80.6 61.1/71.0/80.0 91.5/95.7/89.7
chewinggum 93.7/97.1/91.0 96.3/96.2/95.2 50.9/65.6/80.0 53.9/65.8/81.3 99.1/99.5/95.9
fryum 89.1/95.0/86.6 83.0/83.0/85.0 51.0/62.4/80.0 63.5/71.6/81.6 89.8/95.0/87.2
pipe fryum 82.8/91.2/83.9 94.7/94.7/93.9 56.9/74.9/80.0 56.1/75.5/80.3 96.2/98.1/93.7
Mean 79.1/81.9/78.9 85.5/85.5/84.4 54.5/57.9/72.3 56.7/61.4/73.1 86.8/88.3/85.1

Table 12: Comparison with SOTA methods on VisA dataset for multi-class anomaly detection with
AU ROCcls /APcls /F 1maxcls metrics.

Non-Diffusion Method Diffusion-based Method


Category
DRAEM UniAD DDPM LDM Ours
pcb1 94.6/31.8/37.2/52.8 93.3/ 3.9/ 8.3/64.1 75.7/ 1.1/ 2.8/36.1 84.5/ 2.1/ 4.9/54.3 98.7/49.6/52.8/80.2
pcb2 92.3/10.0/18.6/66.2 93.9/ 4.2/ 9.2/66.9 76.2/ 0.7/ 1.6/30.8 89.5/ 2.5/ 6.7/52.7 95.2/ 7.5/16.7/67.0
pcb3 90.8/14.1/24.4/42.9 97.3/13.8/21.9/70.6 83.3/ 1.0/ 2.5/56.1 94.4/ 9.2/17.4/67.8 96.7/ 8.0/18.8/68.9
pcb4 94.4/31.0/37.6/75.7 94.9/14.7/22.9/72.3 73.0/ 1.4/ 3.5/29.9 80.4/ 2.1/ 4.2/40.3 97.0/17.6/27.2/85.0
macaroni1 95.0/19.1/24.1/67.0 97.4/ 3.7/ 9.7/84.0 87.4/ 0.4/ 1.0/61.2 81.6/ 0.3/ 1.3/47.3 94.1/10.2/16.7/68.5
macaroni2 94.6/ 3.9/12.4/65.2 95.2/ 0.9/ 4.3/76.6 84.8/ 0.2/ 0.6/54.1 87.2/ 0.3/ 0.6/57.2 93.6/ 0.9/ 2.8/73.1
capsules 97.1/27.8/33.7/62.8 88.7/ 3.0/ 7.4/43.7 77.1/ 1.1/ 2.8/34.6 75.5/ 1.1/ 2.7/34.8 97.3/10.0/21.0/77.9
candle 82.2/10.1/19.0/65.6 98.5/17.6/27.9/91.6 76.4/ 0.4/ 1.4/34.1 85.3/ 0.9/ 1.9/46.8 97.3/12.8/22.8/89.4
cashew 80.7/ 9.9/15.7/38.5 98.6/51.7/58.3/87.9 74.5/ 2.7/ 5.2/58.7 90.5/ 5.1/10.1/68.3 90.9/53.1/60.9/61.8
chewinggum 91.0/62.3/63.3/40.9 98.8/54.9/56.1/81.3 74.7/ 1.4/ 2.8/37.9 84.1/ 3.1/ 6.9/52.9 94.7/11.9/25.8/59.5
fryum 92.4/38.8/38.5/69.5 95.9/34.0/40.6/76.2 85.7/ 9.4/17.2/58.4 89.9/14.8/24.8/60.1 97.6/58.6/60.1/81.3
pipe fryum 91.1/38.1/39.6/61.8 98.9/50.2/57.7/91.5 87.0/ 6.9/12.9/69.6 96.4/31.0/37.2/77.6 99.4/72.7/69.9/89.9
Mean 91.3/23.5/29.5/58.8 95.9/21.0/27.0/75.6 79.7/ 2.2/ 4.5/46.8 86.6/ 6.0/ 9.9/55.0 96.0/26.1/33.0/75.2

Table 13: Comparison with SOTA methods on VisA dataset for multi-class anomaly localization with
AU ROCseg /APseg /F 1maxseg /P RO metrics.
Figure 8: Qualitative comparison results for anomaly localization on MVTec-AD dataset.
Figure 9: Qualitative comparison results for anomaly localization on MVTec-AD dataset.
Figure 10: Qualitative comparison results for anomaly localization on VisA dataset.

You might also like