DIAD
DIAD
Haoyang He1 * Jiangning Zhang2 * , Hongxu Chen1 , Xuhai Chen1 , Zhishan Li1 ,
Xu Chen2 , Yabiao Wang2 , Chengjie Wang2 , Lei Xie1†
1 2
Zhejiang University Youtu Lab, Tencent
arXiv:2312.06607v1 [cs.CV] 11 Dec 2023
Abstract
Reconstruction-based approaches have achieved remarkable
outcomes in anomaly detection. The exceptional image re-
construction capabilities of recently popular diffusion models
have sparked research efforts to utilize them for enhanced re-
construction of anomalous images. Nonetheless, these meth-
ods might face challenges related to the preservation of im-
age categories and pixel-wise structural integrity in the more
practical multi-class setting. To solve the above problems,
we propose a Difusion-based Anomaly Detection (DiAD)
framework for multi-class anomaly detection, which con- Figure 1: A analysis of different diffusion models for multi-
sists of a pixel-space autoencoder, a latent-space Semantic- class anomaly detection. The image above shows various
Guided (SG) network with a connection to the stable dif- denoising network architectures, while the images below
fusion’s denoising network, and a feature-space pre-trained
demonstrate the results reconstructed by different methods
feature extractor. Firstly, The SG network is proposed for
reconstructing anomalous regions while preserving the orig- for the same input image. a) DDPM suffers from categorical
inal image’s semantic information. Secondly, we introduce errors. b) LDM exhibits semantic errors. c) Our approach ef-
Spatial-aware Feature Fusion (SFF) block to maximize re- fectively reconstructs the anomalous regions while preserv-
construction accuracy when dealing with extensively recon- ing the semantic information of the original image.
structed areas. Thirdly, the input and reconstructed images
are processed by a pre-trained feature extractor to generate
anomaly maps based on features extracted at different scales.
Experiments on MVTec-AD and VisA datasets demonstrate based (Zavrtanik, Kristan, and Skočaj 2021a; Li et al.
the effectiveness of our approach which surpasses the state- 2021), embedding-based (Defard et al. 2021; Roth et al.
of-the-art methods, e.g., achieving 96.8/52.6 and 97.2/99.0 2022; Xie et al. 2023) and reconstruction-based (Liu et al.
(AUROC/AP) for localization and detection respectively on 2022; Liang et al. 2023) methods. The central concept of
multi-class MVTec-AD dataset. Code will be available at the reconstruction-based method is that during the train-
https://lewandofskee.github.io/projects/diad. ing phase, the model only learns from normal images. Dur-
ing the testing phase, the model reconstructs abnormal im-
Introduction ages into normal ones using the trained model. Therefore,
by comparing the reconstructed image with the input im-
Anomaly detection is a crucial task in computer vision and age, we can determine the location of anomalies. Traditional
industrial applications (Tao et al. 2022; Salehi et al. 2022; reconstruction-based methods, including AEs (Zavrtanik,
Liu et al. 2023), which goal of visual anomaly detection Kristan, and Skočaj 2021b), VAEs (Kingma and Welling
is to determine anomalous images and locate the regions 2022), and GANs (Liang et al. 2023; Yan et al. 2021) can
of anomaly accurately. Existing anomaly detection mod- learn the distribution of normal samples and reconstruct ab-
els (Liznerski et al. 2021; Yi and Yoon 2020; Yu et al. normal regions during the testing phase. However, these
2021) mostly correspond to one class, which requires a large models have limited reconstruction capabilities and cannot
amount of storage space and training time as the number of reconstruct complicated textures and objects well, especially
classes increases. Therefore, there is an urgent need for an large-scale defects or disappearances as shown in Figure 1.
unsupervised multi-class anomaly detection model that is ro- Hence, models with stronger reconstruction capability are
bust and stable. required to effectively tackle multi-class anomaly detection.
The current mainstream unsupervised anomaly detection
methods can be divided into three categories: synthesizing- Recently, the diffusion models (Ho, Jain, and Abbeel
2020; Rombach et al. 2022; Zhang and Agrawala 2023)
* Equal contribution. have demonstrated their powerful image-generation capa-
†
Corresponding author. bility. However, directly using current mainstream diffusion
models cannot effectively address multi-class anomaly de- able reconstruction ability. It has demonstrated excellent
tection problems. 1) For the Denoising Diffusion Probabilis- performance in various applications such as image genera-
tic Model (DDPM) (Ho, Jain, and Abbeel 2020) in Fig. 1- tion (Zhang and Agrawala 2023), video generation (Ho et al.
(a), when performing the multi-class setting, this method 2022), object detection (Chen et al. 2022), image segmenta-
may encounter issues with misclassifying generated image tion (Amit et al. 2022) and etc. LDM (Rombach et al. 2022)
categories. The reason is that after adding T timesteps noise introduces conditions through cross-attention to control gen-
to the input image, the image has lost its original class in- eration. However, it fails to accurately reconstruct images
formation. During inference, denoising is performed based that contain the original semantic information.
on this Gaussian noise-like distribution, which may gener-
ate samples belonging to different categories. 2) Latent Dif- Anomaly detection. AD contains a variety of different
fusion Model (LDM) (Rombach et al. 2022) has an embed- settings, e.g., open-set (Ding, Pang, and Shen 2022), noisy
der as a class condition as shown in Fig. 1-(b), which does learning (Tan et al. 2021; Yoon et al. 2022), zero-/few-
not exist the problem of misclassification found in DDPM. shot (Huang et al. 2022; Jeong et al. 2023; Cao et al. 2023;
However, LDM still cannot address the issue of semantic Chen, Han, and Zhang 2023; Chen et al. 2023b; Zhang et al.
loss in generated images. LDM is unable to simultaneously 2023b), 3D AD (Wang et al. 2023; Chen et al. 2023a), etc.
preserve the semantic information of the input image while This paper studies general unsupervised anomaly detection,
reconstructing the anomalous regions. For example, they which can primarily be categorized into three major method-
may fail to maintain direction consistency with the input im- ologies:
age in terms of objects like screws and hazelnuts, as well 1) Synthesizing-based methods synthesize anomalies on
as exhibit substantial differences from the original image in normal image samples. During the training phase, both
terms of texture class images. normal images and synthetically generated abnormal im-
To address the aforementioned problems, we propose a ages are input into the network for training, which aids
diffusion-based framework, DiAD, for multi-class anomaly in anomaly detection and localization. DRAEM (Zavrtanik,
detection and localization, illustrated in Fig. 2, which com- Kristan, and Skočaj 2021a) consists of an end-to-end net-
prises three components: a pixel space autoencoder, a latent work composed of a reconstruction network and a discrim-
space denoising network and a feature space ImageNet pre- inative sub-network, which synthesizes and generates just-
trained model. To effectively maintain consistent semantic out-distribution phenomena. However, due to the diversity
information with the original image while reconstructing the and unpredictability of anomalies in real-world scenarios, it
location of anomalous regions, we propose the Semantic- is impossible to synthesize all types of anomalies.
Guided (SG) network with a connection to the Stable Dif- 2) Embedding-based methods encode the original image’s
fusion (SD) denoising network in LDM. To further enhance three-dimensional information into a multidimensional fea-
the capability of preserving fine details in the original im- ture space (Roth et al. 2022; Cao et al. 2022; Gu et al.
age and reconstructing large defects, we propose the Spatial- 2023). Most methods employ networks (He et al. 2016; Tan
aware Feature Fusion (SFF) block to integrate features at and Le 2019; Zhang et al. 2022, 2023c; Wu et al. 2023)
different scales. Finally, the reconstructed and input images pre-trained on ImageNet (Deng et al. 2009) for feature ex-
are passed through a pre-trained model to extract features at traction. RD4AD (Deng and Li 2022) utilizes a WideRes-
different scales and compute anomaly scores. We summarize Net50 (Zagoruyko and Komodakis 2016) as the teacher
our contributions as follows: model for feature extraction and employs a structurally iden-
• We propose a novel diffusion-based framework DiAD tical network in reverse as the student model, computing
for multi-class anomaly detection, which firstly tackles the cosine similarity of corresponding features as anomaly
the problem of existing denoising networks of diffusion- scores. However, due to significant differences between in-
based methods failing to correctly reconstruct anomalies. dustrial images and the data distribution in ImageNet, the ex-
tracted features might not be suitable for industrial anomaly
• We construct an SG network connecting to the SD de-
detection purposes.
noising network to maintain consistent semantic infor-
mation and reconstruct the anomalies. 3) Reconstruction-based methods aim to train a model
on a dataset without anomalies. The model learns to iden-
• We propose an SFF block to integrate features from dif- tify patterns and characteristics in the normal data. OCR-
ferent scales to further improve the anomaly reconstruc- GAN (Liang et al. 2023) decouples images into different
tion ability. frequencies and uses GAN for reconstruction. EdgRec (Liu
• Abundant experiments demonstrate the sufficient su- et al. 2022) achieves good reconstruction results by first syn-
periority of DiAD over SOTA methods, e.g., we sur- thesizing anomalies and then extracting grayscale edge in-
pass the multi-class anomaly detection diffusion-based formation from images, which is ultimately input into a re-
method by 20.6↑/ 11.7↑ in pixel/image AUROC and non- construction network. However, there are certain limitations
diffusion method by 9.2↑ in pixel-AP and 0.7↑ in image- in the reconstruction of large-area anomalies. Moreover, the
AUROC on MVTec-AD dataset. accuracy of anomaly localization is also not sufficient.
Recently, some studies have applied diffusion models to
Related work anomaly detection. AnoDDPM (Wyatt et al. 2022) is the first
Diffusion model. The diffusion model has gained approach to employ a diffusion model for medical anomaly
widespread attention and research interest since its remark- detection. DiffusionAD (Zhang et al. 2023a) utilizes an
Feature Space Pixel Space Latent Space
Train/Test Test only Frozen Add
Input Image
SG SG SG SG SG
SG
SFF DB EB EB EB EB CNNs
M3 M2 M1 4
M
4 3 2 1
4×4 4×4 32×32
8×8
Semantic-Guided Network 16×16
Reconstruction Image
×T
SD SD SD SD SD SD SD SD
SD SD
DB DB DB DB EB EB EB EB
M Denoising
1 2 3 4 4 3 2 1
Network
4×4 4×4 32×32
32×32 8×8 8×8
16×16 16×16
Figure 2: Framework of the proposed DiAD that contains three parts: 1) a pixel-space autoencoder {E, D}; 2) a latent-
space Semantic-Guided (SG) network with a connection to Stable Diffusion (SD) denoising network; and 3) a feature-space
pre-trained feature extractor Ψ. During training, the input x0 and the latent variable zT are inputted into the SG network and the
SD denoising network, respectively. The output noise and input noise are calculated for MSE loss and gradient optimization is
computed. During testing, x0 and the reconstructed image xˆ0 are inputted into the same pre-trained feature extraction network
to obtain feature maps {f1 ,f2 ,f3 } of different scales, and their anomaly scores S are calculated.
anomaly synthetic strategy to generate anomalous samples where z ∼ N (0, I), σt is a fixed constant related to the vari-
and labels, along with two sub-networks dedicated to the ance schedule, ϵθ (xt , t) is a U-Net (Ronneberger, Fischer,
tasks of denoising and segmentation. DDAD (Mousakhan, and Brox 2015) network to predict the distribution and θ is
Brox, and Tayyub 2023) employs a score-based pre-trained the learnable parameter which could be optimized as:
diffusion model to generate normal samples while fine- 2
tuning the pre-trained feature extractor to achieve domain min Ex0 ∼q(x0 ),ϵ∼N (0,I),t ∥ϵ − ϵθ (xt , t)∥2 . (3)
θ
transfer. However, these approaches only add limited steps
of noise and perform few denoising steps, which makes them Latent Diffusion Model. Latent Diffusion Model (LDM)
unable to reconstruct large-scale defects. focuses on the low-dimensional latent space with condition-
To overcome the aforementioned problems, We pro- ing mechanisms. LDM consists of a pre-trained autoencoder
pose a diffusion-based framework DiAD for multi-class model and a denoising U-Net-like attention-based network.
anomaly detection, which firstly tackles the problem of ex- The network compresses images using an encoder, conducts
isting diffusion-based methods failing to correctly recon- diffusion and denoising operations in the latent representa-
struct anomalies. tion space, and subsequently reconstructs the images back to
the original pixel space using a decoder. The training opti-
Preliminaries mization objective is:
Denoising Diffusion Probabilistic Model. Denoising h
2
i
Diffusion Probabilistic Model (DDPM) consists of two pro- LLDM = Ez0 ,t,c,ϵ∼N (0,1) ∥ϵ − ϵθ (zt , t, c)∥2 , (4)
cesses: the forward diffusion process and the reverse denois-
ing process. During the forward process, a noisy sample xt where c represents the conditioning mechanisms which can
is generated using a Markov chain that incrementally adds consist of multimodal types such as text or image, connected
Gaussian-distributed noise to an initial data sample x0 . The to the model through a cross-attention mechanism. zt repre-
forward diffusion process can be characterized as follows: sents the latent space variable,
√ √
xt = ᾱt x0 + 1 − ᾱt ϵt , ϵt ∼ N (0, I), (1)
QT QT Method
where αt = 1 − βt , ᾱt = i=1 αi = i=1 (1 − βi ) and βi The proposed pipeline DiAD is shown in Fig. 2. First, the
represents the noise schedule used to regulate the quantity pre-trained encoder downsamples the input image into a
of noise added at each timestep. latent-space representation. Then, noise is added to the latent
In the reverse denoising process, xT is first sampled from representation, followed by the denoising process using an
equation 1 and xt−1 is reconstructed from xt and the model SD denoising network with a connection to the SG network.
prediction ϵθ (xt , t) with the formulation: The denoising process is repeated for the same timesteps as
1 1 − αt the diffusion process. Finally, the reconstructed latent rep-
xt−1 = √ xt − √ ϵθ (xt , t) + σt z, (2)
αt 1 − ᾱt resentation is restored to the original image level using the
pre-trained decoder. In terms of anomaly detection and lo-
Normalization
Conv2d 3×3
Activation
SG
calization, the input and reconstructed images are fed into SG
SG DB Conv =
DB 4_3 Block
the same pre-trained model to extract features at different DB
4_2
4_1
scales and calculate the differences between these features. = Add
Block
Block
Semantic-Guided Network Conv
Conv
Conv
Block
Block
Block
As discussed earlier, DDPM and LDM each have specific Conv
Conv
Conv
Block
problems when addressing multi-class anomaly detection Block
Block
tasks. In response to these issues and the multi-class task Conv
Conv
Conv
Block
itself, we propose an SG network to address the problem SG
SG SG EB
of LDM’s inability to effectively reconstruct anomalies and SG
SG EB SG EB 3_3
EB EB
preserve the semantic information of the input image. EB 4_3 3_2
4_2 3_1
4_1
Given an input image x0 ∈ R3×H×W in pixel space, the
pre-trained encoder E encodes x0 into a latent space repre-
sentation z = E(x0 ) where z ∈ Rc×h×w . Similar to Eq. 1 Figure 3: Schematic diagram of SFF block. Each layer in
where the original pixel space x is replaced by latent repre- SGDB4 is obtained by adding the corresponding SGEB4 to
sentation z, the forward diffusion process now can be char- every SGEB3 with Conv Block performed.
acterized as follows:
√ √
zt = ᾱt z0 + 1 − ᾱt ϵt , ϵt ∼ N (0, I). (5)
SGEBs, MSG (·) and MSD (·) represent SG and SD mid-
The perturbed representation zT and input x0 are simulta- dle blocks respectively, DSD (·) represent all the SDDBs and
neously fed into the SD denoising network and SG network, DSGj (·) represents SGDBs for j-th blocks.
respectively. After T steps of the reverse denoising process,
the final variable ẑ is restored to the reconstructed image Spatial-aware Feature Fusion Block
xˆ0 from the pre-trained decoder D giving xˆ0 = D(ẑ). The
When adding several layers of decoder blocks from SGEBs
training objective of DiAD is:
to SDDBs during the experiment as shown in Table 7, we
found it to be challenging to solve the multi-class anomaly
h i
2
LDiAD = Ez0 ,t,ci ,ϵ∼N (0,1) ∥ϵ − ϵθ (zt , t, ci )∥2 . (6)
detection. This is because the dataset contains various types,
The denoising network consists of a pre-trained SD de- such as objects and textures. For texture-related cases, the
noising network and an SG network that replicates the SD anomalies are generally smaller, so it is necessary to pre-
parameters for initiation as shown in Fig. 2. The pre-trained serve their original textures. On the other hand, the de-
SD denoising network comprises four encoder blocks, one fects often cover larger areas for object-related cases, requir-
middle block and four decoder blocks. Here, ’block’ means ing stronger reconstruction capabilities. Therefore, it is ex-
a frequently utilized unit in the construction of the neural tremely challenging to simultaneously preserve the normal
network layer, e.g.,, ’resnet’ block, transformer block, multi- information of the original samples and reconstruct the ab-
head cross attention block, etc. normal locations in different scenarios.
The input image x0 ∈ R3×H×W is transformed into Hence, we proposed a Spatial-aware Feature Fusion (SFF)
x ∈ Rd×h×w by a set of ’conv-silu’ layers C in SG network block with the aim of integrating high-scale semantic infor-
in order to keep the same dimension with the latent represen- mation into the low-scale. This ultimately enables the model
tations in SD Encoder Block 1 ESD1 . Then, the result of the to both preserve the information of the original normal sam-
summation of x and z are input into the SG Encoder Blocks ples and reconstruct large-scale abnormal regions. The struc-
(SGEBs). After continuous downsampling by the encoder ture of the SFF block is shown in Fig. 3. Each SGEBs
ESG , the results are finally added to the output of the SD consists of three sub-layers. Therefore, the SFF block inte-
middle block MSD after its completion in the middle block grates the features of each layer in SGEB3 into each layer in
MSG . Additionally, to address multi-class tasks of differ- SGEB4 and adds the fused features to the original features.
ent scenarios and categories, the results of the SG Decoder The final output of each layer of the SGEB4 is:
Blocks (SGDBs) DSG are also added to the results of the J
SD decoder DSD with an SFF block combined which will X
Qi = Pi + F(Hj ), (8)
be particularly explained in the next section. The output G
j=1
of the denoising network is characterized as:
where Pi represents the low-scale output features of the i-
G = DSD (MSD (ESD (zt )) + MSG (ESD (z + C (x0 )))) th layer of SGEB4, Qi represents the final low-scale output
+ DSGj (MSG (ESD (z + C (x0 )))), features of the i-th layer of SGDB4, Hj represents the high-
(7) scale output features of the j-th layer of SGEB3, J = 3
where z represents the latent representation with noise per- indicates three layers of SGEB3 used in the experiment and
turbed, x0 represents the input image, C(·) represents a set F(·) represent a basic convolutional block which consists of
of ’conv-silu’ layers in SG network, ESD (·) represents all a 3x3 convolution layer followed by a normalization layer
the SD encoder blocks (SDEBs), ESG (·) represents all the and an activation layers.
Non-Diffusion Method Diffusion-based Method
Category
PaDiM MKD DRAEM RD4AD UniAD DDPM LDM Ours
Bottle 97.9/- 98.7/- 97.5/99.2/96.1 99.6/99.9/98.4 99.7/100./100. 63.6/71.8/86.3 93.8/98.7/93.7 99.7/96.5/91.8
Cable 70.9/- 78.2/- 57.8/74.0/76.3 84.1/89.5/82.5 95.2/95.9/88.0 55.6/69.7/76.0 55.7/74.8/77.7 94.8/98.8/95.2
Capsule 73.4/- 68.3/- 65.3/92.5/90.4 94.1/96.9/96.9 86.9/97.8/94.4 52.9/82.0/90.5 60.5/81.4/90.5 89.0/97.5/95.5
Hazelnut 85.5/- 97.1/- 93.7/97.5/92.3 60.8/69.8/86.4 99.8/100./99.3 87.0/90.4/88.1 93.0/95.8/89.8 99.5/99.7/97.3
Objects
Metal Nut 88.0/- 64.9/- 72.8/95.0/92.0 100./100./99.5 99.2/99.9/99.5 60.0/74.4/89.4 53.0/80.1/89.4 99.1/96.0/91.6
Pill 68.8/- 79.7/- 82.2/94.9/92.4 97.5/99.6/96.8 93.7/98.7/95.7 55.8/84.0/91.6 62.1/93.1/91.6 95.7/98.5/94.5
Screw 56.9/- 75.6/- 92.0/95.7/89.9 97.7/99.3/95.8 87.5/96.5/89.0 53.6/71.9/85.9 58.7/81.9/85.6 90.7/99.7/97.9
Toothbrush 95.3/- 75.3/- 90.6/96.8/90.0 97.2/99.0/94.7 94.2/97.4/95.2 57.5/68.0/83.3 78.6/83.9/83.3 99.7/99.9/99.2
Transistor 86.6/- 73.4/- 74.8/77.4/71.1 94.2/95.2/90.0 99.8/98.0/93.8 57.8/44.6/57.1 61.0/57.8/59.1 99.8/99.6/97.4
Zipper 79.7/- 87.4/- 98.8/99.9/99.2 99.5/99.9/99.2 95.8/99.5/97.1 64.9/77.4/88.1 73.6/89.5/90.6 95.1/99.1/94.4
Carpet 93.8/- 69.8/- 98.0/99.1/96.7 98.5/99.6/97.2 99.8/99.9/99.4 95.5/98.7/91.0 99.4/99.8/99.4 99.4/99.9/98.3
Grid 73.9/- 83.8/- 99.3/99.7/98.2 98.0/99.4/96.5 98.2/99.5/97.3 83.5/93.9/86.9 67.3/82.6/84.4 98.5/99.8/97.7
Textures
Table 1: Comparison with SOTA methods on MVTec-AD dataset for multi-class anomaly detection with
AU ROCcls /APcls /F 1maxcls metrics.
Metal Nut 84.8/- 64.2/- 62.2/31.1/21.0 93.8/62.3/65.4 94.8/55.5/66.4 62.7/14.6/29.2 70.5/19.3/30.7 97.3/30.0/38.3
Pill 87.7/- 69.7/- 94.4/59.1/44.1 97.5/63.4/65.2 95.0/44.0/53.9 55.3/ 4.0/ 8.4 74.9/10.2/15.0 95.7/46.0/51.4
Screw 94.1/- 92.1/- 95.5/33.8/40.6 99.4/40.2/44.6 98.3/28.7/37.6 91.1/ 1.8/ 3.8 91.7/ 2.2/ 4.6 97.9/60.6/59.6
Toothbrush 95.6/- 88.9/- 97.7/55.2/55.8 99.0/53.6/58.8 98.4/34.9/45.7 76.9/ 4.0/ 7.7 93.7/20.4/ 9.8 99.0/78.7/72.8
Transistor 92.3/- 71.7/- 64.5/23.6/15.1 85.9/42.3/45.2 97.9/59.5/64.6 53.2/ 5.8/11.4 85.5/25.0/30.7 95.1/15.6/31.7
Zipper 94.8/- 86.1/- 98.3/74.3/69.3 98.5/53.9/60.3 96.8/40.1/49.9 67.4/ 3.5/ 7.6 66.9/ 5.3/ 7.4 96.2/60.7/60.0
Carpet 97.6/- 95.5/- 98.6/78.7/73.1 99.0/58.5/60.4 98.5/49.9/51.1 89.2/18.8/44.3 99.1/70.6/66.0 98.6/42.2/46.4
Grid 71.0/- 82.3/- 98.7/44.5/46.2 99.2/46.0/47.4 96.5/23.0/28.4 63.1/ 0.7/ 1.9 52.4/ 1.1/ 1.9 96.6/66.0/64.1
Textures
Table 3: Comparison with SOTA methods on MVTec-AD dataset for multi-class anomaly localization with
AU ROCseg /APseg /F 1maxseg metrics.
Non-Diffusion Diffusion-based optimiser (Loshchilov and Hutter 2019) with a learning rate
Method
DRAEM UniAD DDPM LDM Ours of 1e−5 is set. A Gaussian filter with σ = 5 is used to smooth
PRO 71.1 90.4 49.0 66.3 90.7 the anomaly localization score. For anomaly detection, the
anomaly score of the image is the maximum value of the av-
Table 4: Multi-class anomaly localization results with PRO eragely pooled anomaly localization score which undergoes
metric on MVTec-AD datasets. 8 rounds of global average pooling operations with a size
of 8 × 8. During inference, the initial denoising timestep T
is set from 1,000. We use DDIM (Song, Meng, and Ermon
with both RGB images and 3D point clouds respectively. 2021) as the sampler with 10 steps by default.
The training set contains 2,656 images with only anomaly-
free samples. The test set consists of 1,197 images, including Comparison with SOTAs
both normal and abnormal samples. Only RGB images are We conduct and analyze a range of qualitative and quantita-
used in this experiment. tive comparison experiments on MVTec-AD, VisA, MVTec-
Medical dataset. We also merge three types of medical 3D and Medical datasets. We choose a synthesizing-based
datasets BraTS2021 (Baid et al. 2021), BTCV (Landman method DRAEM (Zavrtanik, Kristan, and Skočaj 2021a),
et al. 2015) and LiTs (Bilic et al. 2023) into one Medical three embedding-based methods MKD (Salehi et al. 2021),
dataset for multi-class anomaly detection. The training set PaDiM (Defard et al. 2021) and RD4AD (Deng and Li
contains 9,042 slices and the test set consists of 5,208 slices. 2022), a reconstruction-based method EdgRec (Liu et al.
Evaluation Metrics. Following prior works, Area Under the 2022), a unified SOTA UniAD (You et al. 2022) method
Receiver Operating Characteristic Curve (AUROC), Aver- and diffusion-based DDPM and LDM methods. Specifically,
age Precision (AP) and F1-score-max (F1max) are used in we categorize the aforementioned methods into two types:
both anomaly detection and anomaly localization, where cls non-diffusion and diffusion-based methods. For the experi-
represents the image level anomaly detection and seg repre- ments on Medical dataset, we follow the BMAD (Bao et al.
sents the pixel level anomaly localization. Also, Per-Region- 2023) benchmark and add two methods STFPM (Yamada
Overlap (PRO) is used in anomaly localization. The DICE and Hotta 2021) and CFLOW (Gudovskiy, Ishizaka, and
score is commonly used in the medical field. Kozuka 2022) for comparison.
Qualitative Results. We conducted substantial qualitative
Implementation Details experiments on MVTec-AD and VisA datasets to visually
All images in MVTec-AD and VisA are resized to 256 × demonstrate the superiority of our method in image re-
256. For the denoising network, we adopt the 4-th block construction and the accuracy of anomaly localization. As
of SGDB for connection to SDDB. In this experiment, shown in Figure 4, our method exhibits better reconstruction
we adopt ResNet50 as the feature extraction network and capabilities for anomalous regions compared to the EdgRec
choose n ∈ {2, 3, 4} as the feature layers used in calcu- on MVTec-AD dataset. In comparison to UniAD shown in
lating the anomaly localization. We utilized the KL method Figure 5, our method exhibits more accurate anomaly lo-
as the Auto-encoder and fine-tune the model before training calization abilities on VisA dataset. More qualitative results
the denoising network. We train for 1000 epochs on a single will be presented in Appendix.
NVIDIA Tesla V100 32GB with a batch size of 12. Adam Quantitative Results. As shown in Table 1 and in Ta-
Non-Diffusion Diffusion-based SD MSG SGEB3 SGEB4 BN+ReLU IN+SiLU cls seg
Metrics
DRAEM UniAD DDPM LDM Ours ✓ 79.3 89.5
✓ ✓ 95.1 91.1
AU ROCcls 63.2 78.9 66.3 68.5 84.6
✓ ✓ ✓ 95.3 89.1
APcls 86.1 93.4 78.0 90.6 94.8
✓ ✓ ✓ ✓ 93.8 91.2
F 1maxcls 89.2 91.4 86.6 91.6 95.5 ✓ ✓ ✓ ✓ 96.7 96.7
AU ROCseg 93.2 96.5 90.7 92.2 96.4 ✓ ✓ ✓ ✓ 97.2 96.8
APseg 16.8 21.2 6.0 9.3 25.3
F 1maxseg 20.2 28.0 10.7 13.5 32.2
P RO 55.0 88.1 69.7 73.8 87.8
Table 7: Ablation studies on the design of DiAD with AU-
ROC metrics.
Table 5: Quantitative comparisons on MVTec-3D dataset.
Ablation Studies
The architecture design of DiAD. We investigate the im-
portance of each module in DiAD as shown in Table 7. SD
indicates only the diffusion model without connecting to the
SG network which is the LDM’s architecture. MSG indi-
cates only the middle block of the SG network adding to the
middle of SD. SGEB3 and SGEB4 indicate directly skip-
connecting to the corresponding SDDB. When connecting
SGDB3 and SGDB4 at the same time, more details of the
original images are preserved in terms of texture, but the re-
construction ability for large anomaly areas decreases. Us-
ing the combination of IN+SiLU in the SFF block yields
better results compared to using BN+ReLU.
Effect of pre-trained feature extractors. Table 8 shows the
quantitative comparison of using different pre-trained back-
bones as feature extraction networks. ResNet50 achieved the
Figure 4: Qualitative illustration on MVTec-AD dataset. best performance in anomaly classification metrics, while
WideResNet101 excelled in anomaly segmentation.
Backbone AU ROCcls APcls F 1maxcls AU ROCseg APseg F 1maxseg P RO
ble 3, our method achieves SOTA AUROC/AP/F1max met- VGG
16 91.8 97.2 93.9 92.1 47.2 50.5 80.1
19 91.3 96.9 93.7 92.3 47.5 50.6 80.4
rics of 97.2/99.0/96.5 and 96.8/52.6/55.5 for image-wise and 18 94.7 98.1 96.0 96.0 49.9 53.3 89.1
pixel-wise respectively for multi-class setting on MVTec- ResNet
34
50
95.2
97.2
98.3
99
95.7
96.5
96.2
96.8
51.2
52.6
54.5
55.5
89.6
90.7
AD dataset. For the diffusion-based methods, our approach 101 96.2 98.4 96.5 96.9 52.9 56.4 91.2
References
GT
Models. arXiv:2112.00390.
Baid, U.; Ghodasara, S.; Mohan, S.; Bilello, M.; Calabrese,
E.; Colak, E.; Farahani, K.; Kalpathy-Cramer, J.; Kitamura,
Ours Loc.
Model Name
Parameters Name
SD Denoising Network SG Network Autoencoder
z shape 32 × 32 × 4
|z| 4096
Diffusion steps T 1000
DDIM sampling steps T 10
Noise Schedule linear
Model input shape 32 × 32 × 4 256 × 256 × 3 256 × 256 × 3
N params 859M 471M 83.7M
Embed dim - - 4
Channels 320 320 128
Num res blocks 2 2 2
Channel Multiplier 1,2,4,4 1,2,4,4 1,2,4,4
Attention resolutions 4,2,1 4,2,1 -
Num Heads 8 8 -
Batch Size 12
Accumulate grad batches 4
Epochs 1000
Learning Rate 1.0e-5
Table 11: Hyperparameters for the DiAD. All models trained on a single NVIDIA Tesla V100 32GB.
Non-Diffusion Method Diffusion-based Method
Category
DRAEM UniAD DDPM LDM Ours
pcb1 71.9/72.2/70.0 92.8/92.7/87.8 54.1/47.7/67.1 51.2/46.9/66.8 88.1/88.7/80.7
pcb2 78.4/78.2/76.2 87.8/87.7/83.1 50.8/48.5/66.6 57.0/63.4/67.5 91.4/91.4/84.7
pcb3 76.6/77.4/74.7 78.6/78.6/76.1 53.4/51.2/66.8 62.7/69.6/72.0 86.2/87.6/77.6
pcb4 97.3/97.5/93.5 98.8/98.8/94.3 56.0/48.4/66.4 54.4/47.1/66.8 99.6/99.5/97.0
macaroni1 69.8/68.5/70.9 79.9/79.8/72.7 50.9/55.1/68.0 56.2/49.6/68.4 85.7/85.2/78.8
macaroni2 59.4/60.7/68.0 71.6/71.6/69.9 54.4/51.8/67.1 56.8/52.7/66.6 62.5/57.4/69.6
capsules 83.4/91.1/82.1 55.6/55.6/76.9 58.9/62.7/78.2 57.7/71.4/77.3 58.2/69.0/78.5
candle 69.3/73.9/68.0 94.1/94.0/86.1 52.7/48.3/66.6 50.4/52.2/68.2 92.8/92.0/87.6
cashew 81.7/89.7/87.3 92.8/92.8/91.4 63.5/78.9/80.6 61.1/71.0/80.0 91.5/95.7/89.7
chewinggum 93.7/97.1/91.0 96.3/96.2/95.2 50.9/65.6/80.0 53.9/65.8/81.3 99.1/99.5/95.9
fryum 89.1/95.0/86.6 83.0/83.0/85.0 51.0/62.4/80.0 63.5/71.6/81.6 89.8/95.0/87.2
pipe fryum 82.8/91.2/83.9 94.7/94.7/93.9 56.9/74.9/80.0 56.1/75.5/80.3 96.2/98.1/93.7
Mean 79.1/81.9/78.9 85.5/85.5/84.4 54.5/57.9/72.3 56.7/61.4/73.1 86.8/88.3/85.1
Table 12: Comparison with SOTA methods on VisA dataset for multi-class anomaly detection with
AU ROCcls /APcls /F 1maxcls metrics.
Table 13: Comparison with SOTA methods on VisA dataset for multi-class anomaly localization with
AU ROCseg /APseg /F 1maxseg /P RO metrics.
Figure 8: Qualitative comparison results for anomaly localization on MVTec-AD dataset.
Figure 9: Qualitative comparison results for anomaly localization on MVTec-AD dataset.
Figure 10: Qualitative comparison results for anomaly localization on VisA dataset.