(eccv) Package eccv Warning: Package ‘hyperref’ is loaded with option ‘pagebackref’, which is *not* recommended for camera-ready version

11institutetext: Zhejiang University 22institutetext: Nanyang Technological University
 
https://zhouzheyuan.github.io/r3d-ad

R3D-AD: Reconstruction via Diffusion for 3D Anomaly Detection

Zheyuan Zhou 1 *1 *    Le Wang Equal contribution.11    Naiyu Fang 11 2 2   
Zili Wang
11
   Lemiao Qiu 11    Shuyou Zhang 11
Abstract

3D anomaly detection plays a crucial role in monitoring parts for localized inherent defects in precision manufacturing. Embedding-based and reconstruction-based approaches are among the most popular and successful methods. However, there are two major challenges to the practical application of the current approaches: 1) the embedded models suffer the prohibitive computational and storage due to the memory bank structure; 2) the reconstructive models based on the MAE mechanism fail to detect anomalies in the unmasked regions. In this paper, we propose R3D-AD, reconstructing anomalous point clouds by diffusion model for precise 3D anomaly detection. Our approach capitalizes on the data distribution conversion of the diffusion process to entirely obscure the input’s anomalous geometry. It step-wisely learns a strict point-level displacement behavior, which methodically corrects the aberrant points. To increase the generalization of the model, we further present a novel 3D anomaly simulation strategy named Patch-Gen to generate realistic and diverse defect shapes, which narrows the domain gap between training and testing. Our R3D-AD ensures a uniform spatial transformation, which allows straightforwardly generating anomaly results by distance comparison. Extensive experiments show that our R3D-AD outperforms previous state-of-the-art methods, achieving 73.4% Image-level AUROC on the Real3D-AD dataset and 74.9% Image-level AUROC on the Anomaly-ShapeNet dataset with an exceptional efficiency.

Keywords:
3D anomaly detection, industrial applications, 3D reconstruction, self-supervised learning

1 Introduction

Anomaly detection aims to identify instances containing anomalies and to precisely locate the specific positions of defects. This task is extensively applied across multiple fields and plays a crucial role in quality control within industrial production [29]. 3D anomaly detection [19] has emerged due to its intrinsic modality superior for avoiding blind spots in advanced processing and precision manufacturing. However, the discrete and disordered data form of point clouds makes it more difficult to acquire features compared to images. With the scarcity of anomalies, 3D anomaly detection also faces the problem of domain shift while only normal data are presented during training. The presence of these issues underscores the necessity and urgency of devising an efficient framework for the 3D anomaly detection task.

Refer to caption
Figure 1: Comparison of architectures. (a) Embedded model encodes the input 𝒳𝒳\mathcal{X}caligraphic_X into features and stores them in the memory bank during training. The anomaly map \mathcal{M}caligraphic_M is obtained by comparing the test features with all the features in the memory bank. (b) Reconstructive model is trained by minimizing the loss between its input 𝒳𝒳\mathcal{X}caligraphic_X and the output 𝒳^^𝒳\widehat{\mathcal{X}}over^ start_ARG caligraphic_X end_ARG. The anomaly map \mathcal{M}caligraphic_M is obtained by comparing the test phase input with its corresponding reconstruction target.

Similar to traditional 2D anomaly detection [29, 43], current 3D anomaly detection can be primarily categorized into embedding-based and reconstruction-based, as illustrated in Fig. 1. The embedding-based methods involve mapping features extracted with a pre-trained encoder onto a normal distribution for learning. Distributions that do not fall within the interval are classified as anomalies. Most existing 3D anomaly detection methods are based on a memory bank mechanism [11, 37, 19, 3], which stores some representative features during the training phase to implicitly construct a feature distribution. In the testing phase, the presence of anomalies is determined by calculating the Euclidean distance between the input test object and all template point clouds stored in memory. The reconstruction-based methods train a network capable of accurately reconstructing normal point clouds, under the presumption that anomalous point clouds will not be effectively reconstructed since they are not included during training. The anomaly map is produced through the comparison of discrepancies between the input point cloud and its reconstruction. IMRNet [18] employs PointMAE [26] to reconstruct the input in several iterations, getting the final anomaly map by calculating the explicit spatial coordinate differences and implicit deep feature differences of the point cloud, respectively.

However, existing methods face two key issues, high resource cost and irreparable reconstruction. Firstly, methods based on the memory bank [11, 37, 19, 3] store all features from the training phase, each test point cloud needs to be compared with all samples in the memory bank, significantly increasing memory overhead and inference time costs. This makes such methods almost inapplicable in real industrial production lines due to their inefficiency. Secondly, masked autoencoder (MAE) mechanism [9, 39, 26] only reconstructs the masked portions of the input, defects within unmasked portions may be preserved. This contradicts the fundamental assumption of detecting anomalies by comparing the original defect-containing point cloud with a reconstructed anomaly-free version. These methods inevitably lead to incorrect reconstructions, undermining their effectiveness in accurately localizing defects.

We propose R3D-AD, a novel 3D anomaly detection method that does not suffer from the space burden and time endurance in memory-based embedded models nor the anomaly unmasking probability in the MAE-based reconstructive models. In contrast to PointMAE, one of our key insights is to perform undifferentiated masking for 3D objects via the noise diffusion mechanism, which maximizes the preservation of anomaly-free shapes and reconstructs abnormal regions. In the reparameterized diffusion process, one-step full mask and reconstruction are achieved by converting the point cloud distribution, instead of the multiple iterative method [18]. We hypothesize that anomaly detection verifies the gap between the reconstructed shapes and the positive samples by learning point movement. Specifically, for input models with arbitrary anomalies, we encode them as latent shape embeddings as decoding conditions and explicitly control the point cloud reconstruction process by step-wise displacements (SWD) decoding. The shape embedding harbors abundant global features and makes it easier to train the network without dwelling on the introduction of local anomaly details. Another key to our approach is to implement a controllable method of point-wise displacement during the diffusion process to refine the point cloud deformation iteratively. We propose to inject latent shape embedding into each step of the inverse denoising process, which drives the anomalous regions to converge to a smooth surface. We further adopted a 3D anomaly simulation strategy Patch-Gen to address the limitations of the dataset, which generates abundant defectives by producing spatial irregularity that is faithful to the real scene, including bulges, sinks, etc. This point cloud data augmentation encourages the self-supervised model to reconstruct more realistic anomaly-free shapes when facing the actual anomaly.

To the best of our knowledge, this is the very first attempt at exploring diffusion in reconstruction-based 3D anomaly detection. Our main contributions are summarized as follows: (i) We introduce a novel framework, termed R3D-AD, which performs a one-step full mask and anomaly-free reconstruction for fast and accurate 3D anomaly detection. (ii) We propose to learn the step-wise displacement in the reverse diffusion process to explicitly control the reconstruction of anomalous shapes. (iii) We introduce a 3D anomaly simulation strategy named Patch-Gen to address the limitation of the data anomaly patterns and improve the reconstruction performance in a supervised setting. (iv) Extensive experiments demonstrate that our R3D-AD has achieved state-of-the-art performance on both Real3D-AD and Anomaly-ShapeNet datasets.

2 Related work

2.1 2D Anomaly Detection

Anomaly detection has received increasing attention from researchers in recent years, and many new methods have been proposed to address the problem. Flow-based methods [30, 8, 36, 40] use learned distributions and flow’s bijective properties to spot defects, while Memory-based approaches [29, 14, 1] gauge anomaly scores by contrasting test sample features with memory bank-stored norms. Reconstruction-based models [2, 43, 42] flag anomalies by comparing inputs to their online reconstructions. Recent works [16, 33, 12, 44] augment the anomaly detection datasets with generated synthetic anomalies to compensate for the negative example scarcity problem.

2.2 3D Anomaly Detection

This field lags behind the development of 2D anomaly detection since 3D data are harder to obtain, while point cloud data are sparser and contain more noise than image data. BTF [11] integration of handcrafted 3D descriptors with classic 2D method PatchCore [29], constructing a basic framework for 3D anomaly detection. M3DM [37] advances the field by separately analyzing features from point clouds and RGB images, then merging these for improved decision-making. CPMF [3] converts point clouds into two-dimensional images from multiple angles, extracting additional features from these images with a pre-trained network, and enhancing detection capabilities through information fusion. Reg3D-AD [19] develops a registration-based method, the RANSAC algorithm was used to align each sample before comparing it to the stored template during the test phase. IMRNet [18] trains a PointMAE [26] to reconstruct anomaly-free samples and identifies anomalies by juxtaposing the reconstructed point cloud against the initial input. Many of these use memory banks to store the features of the training samples or require multiple iterations to restore points. Unlike previous methods, our approach requires only one step of reconstruction and has significant advantages in both time and space efficiency.

2.3 Diffusion Models

Diffusion models have proven their effectiveness in several generative tasks, such as image generation [32], speech generation [15], and video generation [10]. Denoising Diffusion Probabilistic Models (DDPMs) [13, 35, 34] employ a forward noising mechanism, incrementally integrating Gaussian noise into images, alongside a reverse process meticulously trained to counteract the forward mechanism. Denoise AD [22] conducts DDPM for reconstructing within the features space, generating images that contain less noise. In recent years, many studies [25, 20, 6, 17] have attempted to use the diffusion model to explore the 3D reconstruction task. DPM [23] incorporates a shape latent variable to encapsulate the geometric intricacies of 3D shapes, it distinctively models this variable’s distribution utilizing Normalizing Flows [28, 7]. PVD [45] utilizes PVCNNs [21] for the point-voxel representation of 3D shapes and integrates structured locality into point clouds. This innovative approach leverages the strengths of both point and voxel representations, optimizing the model’s ability to capture the intricate spatial hierarchies and local geometries within 3D objects. Since diffusion-based reconstruction recovers the target shape from complete noise, the dilemma of reconstructing only the masked region in the MAE [9] mechanism does not exist.

3 Method

3.1 Overview

We model the anomaly detection problem as mapping an anomalous point cloud 𝒫aN×3subscript𝒫asuperscript𝑁3{\mathcal{P}}_{\text{a}}\in\mathbb{R}^{N\times 3}caligraphic_P start_POSTSUBSCRIPT a end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × 3 end_POSTSUPERSCRIPT to a positive shape with which it is aligned. The framework of R3D-AD is shown in Fig. 2, where the simulated anomalous shapes are reconstructed in a self-supervised setting in the training phase and then compared with the original input to detect anomalies. The reconstructed anomaly-free model is aligned with the input, thus allowing direct computation of anomaly scores and segmentation of anomalous regions by conditioned distance functions. Simultaneously, the anomaly simulation strategy faithfully generates realistic defects and randomly synthesizes diverse anomaly shapes on normal samples, improving the generalization ability of the network in the case of limited anomaly samples.

3.2 Preliminary of denoising diffusion probabilistic models

A DDPM is inspired by the thermal diffusion process in an evolving thermodynamic system, which consists of a diffusion process and a reverse process.

The forward Markovian process gradually adds Gaussian noise to a clean sample 𝒙(0)superscript𝒙(0)\bm{x}^{\text{(0)}}bold_italic_x start_POSTSUPERSCRIPT (0) end_POSTSUPERSCRIPT from a data distribution q(𝒙(0))𝑞superscript𝒙(0)q(\bm{x}^{\text{(0)}})italic_q ( bold_italic_x start_POSTSUPERSCRIPT (0) end_POSTSUPERSCRIPT ) and turns it into a Gaussian noise 𝒙(T)superscript𝒙𝑇\bm{x}^{(T)}bold_italic_x start_POSTSUPERSCRIPT ( italic_T ) end_POSTSUPERSCRIPT, which is defined as

q(𝒙(0),,𝒙(T))=t=1Tq(𝒙(t)|𝒙(t1)),𝑞superscript𝒙0superscript𝒙𝑇superscriptsubscriptproduct𝑡1𝑇𝑞conditionalsuperscript𝒙𝑡superscript𝒙𝑡1q(\bm{x}^{(0)},...,\bm{x}^{(T)})=\prod\limits_{t=1}^{T}{q{(\bm{x}^{(t)}|\bm{x}% ^{(t-1)})}},italic_q ( bold_italic_x start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT , … , bold_italic_x start_POSTSUPERSCRIPT ( italic_T ) end_POSTSUPERSCRIPT ) = ∏ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_q ( bold_italic_x start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT | bold_italic_x start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT ) , (1)

where q(𝒙(t)|𝒙(t1))=𝒩(𝒙(t);1βt𝒙(t1),βt𝑰)𝑞conditionalsuperscript𝒙𝑡superscript𝒙𝑡1𝒩superscript𝒙𝑡1subscript𝛽𝑡superscript𝒙𝑡1subscript𝛽𝑡𝑰{q(\bm{x}^{(t)}|\bm{x}^{(t-1)})}=\mathcal{N}(\bm{x}^{(t)};\sqrt{1-\beta_{t}}% \bm{x}^{(t-1)},\beta_{t}\bm{I})italic_q ( bold_italic_x start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT | bold_italic_x start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT ) = caligraphic_N ( bold_italic_x start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ; square-root start_ARG 1 - italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_italic_x start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT , italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_italic_I ) is the Markov diffusion kernel, t=1,,T𝑡1𝑇t=1,...,Titalic_t = 1 , … , italic_T, T𝑇Titalic_T is the number of diffusion steps, and βtsubscript𝛽𝑡\beta_{t}italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is a variance schedule. We have q(𝒙(t)|𝒙(0))=𝒩(𝒙(t);α¯t𝒙(0),(1α¯t)𝑰)𝑞conditionalsuperscript𝒙𝑡superscript𝒙0𝒩superscript𝒙𝑡subscript¯𝛼𝑡superscript𝒙01subscript¯𝛼𝑡𝑰{q(\bm{x}^{(t)}|\bm{x}^{(0)})}=\mathcal{N}(\bm{x}^{(t)};\sqrt{{\overline{% \alpha}}_{t}}\bm{x}^{(0)},(1-{\overline{\alpha}}_{t})\bm{I})italic_q ( bold_italic_x start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT | bold_italic_x start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ) = caligraphic_N ( bold_italic_x start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ; square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_italic_x start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT , ( 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_italic_I ) by reparameterization with αt=1βtsubscript𝛼𝑡1subscript𝛽𝑡{\alpha}_{t}=1-\beta_{t}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 1 - italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , α¯t=s=0tαssubscript¯𝛼𝑡superscriptsubscriptproduct𝑠0𝑡subscript𝛼𝑠\overline{\alpha}_{t}=\prod_{s=0}^{t}\alpha_{s}over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∏ start_POSTSUBSCRIPT italic_s = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT. 𝒙(t)superscript𝒙𝑡\bm{x}^{(t)}bold_italic_x start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT can be sampled by

𝒙(t)=α¯t𝒙(0)+ϵ(1α¯t),superscript𝒙𝑡subscript¯𝛼𝑡superscript𝒙0italic-ϵ1subscript¯𝛼𝑡\bm{x}^{(t)}=\sqrt{{\overline{\alpha}}_{t}}\bm{x}^{(0)}+\epsilon\sqrt{(1-{% \overline{\alpha}}_{t})},bold_italic_x start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_italic_x start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT + italic_ϵ square-root start_ARG ( 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG , (2)

where ϵbold-italic-ϵ\bm{\epsilon}bold_italic_ϵ is a standard Gaussian noise and ϵ𝒩(0,𝑰)similar-toitalic-ϵ𝒩0𝑰\epsilon\sim\mathcal{N}(0,\bm{I})italic_ϵ ∼ caligraphic_N ( 0 , bold_italic_I ). When T𝑇Titalic_T is large enough, 𝒙(T)superscript𝒙𝑇\bm{x}^{(T)}bold_italic_x start_POSTSUPERSCRIPT ( italic_T ) end_POSTSUPERSCRIPT will eventually become a Gaussian noise.

The reverse process is also a Markovian process that denoises over a series of steps to generate meaningful data from the target distribution q(𝒙(0))𝑞superscript𝒙0q(\bm{x}^{(0)})italic_q ( bold_italic_x start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ). The inverse process denoises the noise 𝒙(T)superscript𝒙𝑇\bm{x}^{(T)}bold_italic_x start_POSTSUPERSCRIPT ( italic_T ) end_POSTSUPERSCRIPT from a distribution p(𝒙(T))𝑝superscript𝒙𝑇p(\bm{x}^{(T)})italic_p ( bold_italic_x start_POSTSUPERSCRIPT ( italic_T ) end_POSTSUPERSCRIPT ), which is defined as

pθ(𝒙(0),,𝒙(T1)|𝒙(T),𝒄)=t=1Tpθ(𝒙(t1)|𝒙(t),𝒄),subscript𝑝𝜃superscript𝒙0conditionalsuperscript𝒙𝑇1superscript𝒙𝑇𝒄superscriptsubscriptproduct𝑡1𝑇subscript𝑝𝜃conditionalsuperscript𝒙𝑡1superscript𝒙𝑡𝒄p_{\theta}(\bm{x}^{(0)},...,\bm{x}^{(T-1)}|\bm{x}^{(T)},\bm{c})=\prod\limits_{% t=1}^{T}{p_{\theta}{(\bm{x}^{(t-1)}|\bm{x}^{(t)},\bm{c})}},italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT , … , bold_italic_x start_POSTSUPERSCRIPT ( italic_T - 1 ) end_POSTSUPERSCRIPT | bold_italic_x start_POSTSUPERSCRIPT ( italic_T ) end_POSTSUPERSCRIPT , bold_italic_c ) = ∏ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT | bold_italic_x start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_italic_c ) , (3)

where pθ(𝒙(t1)|𝒙(t)),𝒄=𝒩(𝒙(t1);μθ(𝒙(t),t,𝒄),σt2𝑰)subscript𝑝𝜃conditionalsuperscript𝒙𝑡1superscript𝒙𝑡𝒄𝒩superscript𝒙𝑡1subscript𝜇𝜃superscript𝒙𝑡𝑡𝒄superscriptsubscript𝜎𝑡2𝑰p_{\theta}{(\bm{x}^{(t-1)}|\bm{x}^{(t)}),\bm{c}}=\mathcal{N}(\bm{x}^{(t-1)};% \mu_{\theta}(\bm{x}^{(t)},t,\bm{c}),\sigma_{t}^{2}\bm{I})italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT | bold_italic_x start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) , bold_italic_c = caligraphic_N ( bold_italic_x start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT ; italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_t , bold_italic_c ) , italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_italic_I ), the mean 𝝁θ(𝒙(t),t,𝒄)subscript𝝁𝜃superscript𝒙𝑡𝑡𝒄\bm{\mu}_{\theta}(\bm{x}^{(t)},t,\bm{c})bold_italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_t , bold_italic_c ) is estimated by a neural network parameterized by 𝜽𝜽\bm{\theta}bold_italic_θ, 𝒄𝒄\bm{c}bold_italic_c is the latent condition encoding, and σt2superscriptsubscript𝜎𝑡2\sigma_{t}^{2}italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is a step-dependent variance. 𝝁θsubscript𝝁𝜃\bm{\mu}_{\theta}bold_italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT can be reparameterized as

μθ(𝒙(t),t,𝒄)=1αt(𝒙(t)βt1α¯tϵ𝜽(𝒙(t),t,𝒄)),subscript𝜇𝜃superscript𝒙𝑡𝑡𝒄1subscript𝛼𝑡superscript𝒙𝑡subscript𝛽𝑡1subscript¯𝛼𝑡subscriptbold-italic-ϵ𝜽superscript𝒙𝑡𝑡𝒄\mu_{\theta}(\bm{x}^{(t)},t,\bm{c})=\frac{1}{\sqrt{\alpha_{t}}}(\bm{x}^{(t)}-% \frac{\beta_{t}}{\sqrt{1-\overline{\alpha}_{t}}}\bm{\epsilon}_{\bm{\theta}}(% \bm{x}^{(t)},t,\bm{c})),italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_t , bold_italic_c ) = divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG ( bold_italic_x start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT - divide start_ARG italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG bold_italic_ϵ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_t , bold_italic_c ) ) , (4)

where ϵ𝜽(𝒙(t),t,𝒄)subscriptbold-italic-ϵ𝜽superscript𝒙𝑡𝑡𝒄\bm{\epsilon}_{\bm{\theta}}(\bm{x}^{(t)},t,\bm{c})bold_italic_ϵ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_t , bold_italic_c ) is a neural network utilized to denoise the Gaussian noise from 𝒙(T)superscript𝒙𝑇\bm{x}^{(T)}bold_italic_x start_POSTSUPERSCRIPT ( italic_T ) end_POSTSUPERSCRIPT.

The training objective is minimized by training ϵ𝜽(𝒙(t),t,𝒄)subscriptbold-italic-ϵ𝜽superscript𝒙𝑡𝑡𝒄\bm{\epsilon}_{\bm{\theta}}(\bm{x}^{(t)},t,\bm{c})bold_italic_ϵ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_t , bold_italic_c ) to approximate ϵbold-italic-ϵ\bm{\epsilon}bold_italic_ϵ. The training objective is defined as

=𝔼t[1:T],𝒙(𝟎)q(𝒙(𝟎)),ϵ𝒩(0,𝑰)ϵϵ𝜽(α¯t𝒙(0)+1α¯tϵ,t,𝒄),subscript𝔼𝑡similar-todelimited-[]:1𝑇formulae-sequencesimilar-tosuperscript𝒙0𝑞superscript𝒙0similar-toitalic-ϵ𝒩0𝑰normbold-italic-ϵsubscriptbold-italic-ϵ𝜽subscript¯𝛼𝑡superscript𝒙01subscript¯𝛼𝑡italic-ϵ𝑡𝒄\mathcal{L}=\mathbb{E}_{t\sim[1:T],\bm{x^{(0)}}\sim q(\bm{x^{(0)}}),\epsilon% \sim\mathcal{N}(0,\bm{I})}\parallel\bm{\epsilon}-\bm{\epsilon}_{\bm{\theta}}(% \sqrt{\overline{\alpha}_{t}}\bm{x}^{(0)}+\sqrt{1-\overline{\alpha}_{t}}% \epsilon,t,\bm{c})\parallel,caligraphic_L = blackboard_E start_POSTSUBSCRIPT italic_t ∼ [ 1 : italic_T ] , bold_italic_x start_POSTSUPERSCRIPT bold_( bold_0 bold_) end_POSTSUPERSCRIPT ∼ italic_q ( bold_italic_x start_POSTSUPERSCRIPT bold_( bold_0 bold_) end_POSTSUPERSCRIPT ) , italic_ϵ ∼ caligraphic_N ( 0 , bold_italic_I ) end_POSTSUBSCRIPT ∥ bold_italic_ϵ - bold_italic_ϵ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_italic_x start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT + square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_ϵ , italic_t , bold_italic_c ) ∥ , (5)

where t𝑡titalic_t is sampled from the uniform distribution over 1,2, …, T𝑇Titalic_T, q(𝒙(𝟎))𝑞superscript𝒙0q(\bm{x^{(0)}})italic_q ( bold_italic_x start_POSTSUPERSCRIPT bold_( bold_0 bold_) end_POSTSUPERSCRIPT ) is the distribution of 𝒙(𝟎)superscript𝒙0\bm{x^{(0)}}bold_italic_x start_POSTSUPERSCRIPT bold_( bold_0 bold_) end_POSTSUPERSCRIPT, and ϵbold-italic-ϵ\bm{\epsilon}bold_italic_ϵ is the Gaussian noise.

Refer to caption
Figure 2: Overall architecture of R3D-AD for shape reconstruction and anomaly detection of point cloud objects. Reconstruction training phase: The simulated anomalous 𝒫a(0)superscriptsubscript𝒫a0\mathcal{P}_{\text{a}}^{(0)}caligraphic_P start_POSTSUBSCRIPT a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT is generated resort to Patch-Gen from the input point cloud. It further fully masked as 𝒫a(T)superscriptsubscript𝒫a𝑇\mathcal{P}_{\text{a}}^{(T)}caligraphic_P start_POSTSUBSCRIPT a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_T ) end_POSTSUPERSCRIPT while also encoded to latent shape embedding. The SWD decoder then explicitly reconstructs the anomaly-free object 𝒫r(0)superscriptsubscript𝒫r0\mathcal{P}_{\text{r}}^{(0)}caligraphic_P start_POSTSUBSCRIPT r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT with consistent spatial transform by conditionally generating point-level displacements Δ(t)superscriptΔ𝑡\mathrm{\Delta}^{(t)}roman_Δ start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT at each step of the inverse process. Detection testing phase: The test point cloud 𝒫a(0)superscriptsubscript𝒫a0\mathcal{P}_{\text{a}}^{(0)}caligraphic_P start_POSTSUBSCRIPT a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT is reconstructed to 𝒫r(0)superscriptsubscript𝒫r0\mathcal{P}_{\text{r}}^{(0)}caligraphic_P start_POSTSUBSCRIPT r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT with normal shape, and compared at a distance level to detect the anomalous region.

3.3 Diffusion-based 3D anomaly reconstruction

We formulate the point cloud reconstruction task of the anomaly-free model as the conditional generation, which decodes the explicit displacement with the target distribution q(𝒫r|𝒫a(T),𝒄)𝑞conditionalsubscript𝒫rsuperscriptsubscript𝒫a𝑇𝒄q(\mathcal{P}_{\text{r}}|\mathcal{P}_{\text{a}}^{(T)},\bm{c})italic_q ( caligraphic_P start_POSTSUBSCRIPT r end_POSTSUBSCRIPT | caligraphic_P start_POSTSUBSCRIPT a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_T ) end_POSTSUPERSCRIPT , bold_italic_c ), where 𝒄𝒄\bm{c}bold_italic_c is the decoding condition. The essential question of anomaly detection in this paper is how to conditional reconstruct anomaly-free shapes on the reference of input point clouds with different spatial transformations. Since there is a high similarity of global features between abnormal and normal samples during self-supervised reconstruction, the most immediate approach is to extract an efficient global feature from input to serve as an auxiliary conditional embedding for the denoising function ϵ𝜽subscriptbold-italic-ϵ𝜽\bm{\epsilon}_{\bm{\theta}}bold_italic_ϵ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT. We implement the encoding of latent shape embedding 𝒄𝒄\bm{c}bold_italic_c as a conditional input to guide reconstruction in the reverse diffusion process.

3.3.1 Latent shape embedding

The feature encoder aims to encode the point cloud to the latent shape embedding 𝒄𝒄\bm{c}bold_italic_c with high-level features for the conditional generation process. Different from other global-local extracting methods [38, 41], we focus more on extracting global features, which characterize the semantic information of shape and pose of most anomaly-free regions in the point cloud. The feature encoder mainly consists of cascaded multi-layer perceptions (MLP) based on PointNet[5]. It implements max-pooling after mapping 𝒫a(0)superscriptsubscript𝒫a0\mathcal{P}_{\text{a}}^{(0)}caligraphic_P start_POSTSUBSCRIPT a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT to different dimensions and then compresses them to extract the global shape embedding.

3.3.2 Step-wise displacement decoding

To achieve point cloud reconstruction with transformation consistency while preserving the structure of non-anomalous regions, our method injects latent shape embedding 𝒄𝒄\bm{c}bold_italic_c to the decoder at each step of the reverse diffusion process, as shown in Fig. 2. In principle, in the training phase, ϵ𝜽subscriptbold-italic-ϵ𝜽\bm{\epsilon}_{\bm{\theta}}bold_italic_ϵ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT learns the added Gaussian noise in the forward diffusion process by the decoder to model the conditional probability distribution. Conditionally generating target shapes from N×3𝑁3N\times 3italic_N × 3 Gaussian noise is a straightforward approach, but it is afflicted by the issues of reconstructing the point cloud details and transform consistency. Learning the relative deformation of points for anomalous objects is more efficient. Considering the mapping degradation of the vanilla autoencoder in the reconstruction training phase [22], we utilize the Gaussian noise of the forward process Eq. 2 to fully mask the point cloud object directly without blind spots, preventing the decoding process from receiving negative state shapes. The masked points 𝒫a(T)superscriptsubscript𝒫a𝑇\mathcal{P}_{\text{a}}^{(T)}caligraphic_P start_POSTSUBSCRIPT a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_T ) end_POSTSUPERSCRIPT and latent shape embedding 𝒄256𝒄superscript256\bm{c}\in\mathbb{R}^{256}bold_italic_c ∈ blackboard_R start_POSTSUPERSCRIPT 256 end_POSTSUPERSCRIPT are as the inputs of the SWD decoder. The point-wise displacement vector Δ(t)superscriptΔ𝑡\mathrm{\Delta}^{(t)}roman_Δ start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT is generated at each step of the iterative process thus disentangling the prediction noise and the desired anomaly-free shape. The reverse process can be defined according to Eq. 3 and the displacement vector Δ(t)superscriptΔ𝑡\mathrm{\Delta}^{(t)}roman_Δ start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT can be represented by

Δ(t1)=1αt(Δ(t)βt1α¯tϵ𝜽(Δ(t),βt,𝒄))+σϵ,superscriptΔ𝑡11subscript𝛼𝑡superscriptΔ𝑡subscript𝛽𝑡1subscript¯𝛼𝑡subscriptbold-italic-ϵ𝜽superscriptΔ𝑡subscript𝛽𝑡𝒄𝜎bold-italic-ϵ\mathrm{\Delta}^{(t-1)}=\frac{1}{\sqrt{\alpha_{t}}}(\mathrm{\Delta}^{(t)}-% \frac{\beta_{t}}{\sqrt{1-\overline{\alpha}_{t}}}\bm{\epsilon}_{\bm{\theta}}(% \mathrm{\Delta}^{(t)},\beta_{t},\bm{c}))+\sigma\bm{\epsilon},roman_Δ start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG ( roman_Δ start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT - divide start_ARG italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG bold_italic_ϵ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( roman_Δ start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_c ) ) + italic_σ bold_italic_ϵ , (6)

where σ𝜎\sigmaitalic_σ is the variance. A PointwiseNet is adopted for ϵ𝜽subscriptbold-italic-ϵ𝜽\bm{\epsilon}_{\bm{\theta}}bold_italic_ϵ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT to decode the Δ(t1)superscriptΔ𝑡1\mathrm{\Delta}^{(t-1)}roman_Δ start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT from the previous step and 𝒄𝒄\bm{c}bold_italic_c. βtsubscript𝛽𝑡\beta_{t}italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is used to generate trigonometric position embedding 𝒆psubscript𝒆p\bm{e}_{\text{p}}bold_italic_e start_POSTSUBSCRIPT p end_POSTSUBSCRIPT = (βtsubscript𝛽𝑡\beta_{t}italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, sin(βtsubscript𝛽𝑡\beta_{t}italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT), cos(βtsubscript𝛽𝑡\beta_{t}italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT)). 𝒆psubscript𝒆p\bm{e}_{\text{p}}bold_italic_e start_POSTSUBSCRIPT p end_POSTSUBSCRIPT is concatenated with 𝒄𝒄\bm{c}bold_italic_c and then fed into the concatenate-squash linear module of PointwiseNet with a residual function. The output reconstructed point cloud at the t𝑡titalic_t step is 𝒫r(t)=𝒫r(t+1)+Δ(t)superscriptsubscript𝒫r𝑡superscriptsubscript𝒫r𝑡1superscriptΔ𝑡\mathcal{P}_{\text{r}}^{(t)}=\mathcal{P}_{\text{r}}^{(t+1)}+\mathrm{\Delta}^{(% t)}caligraphic_P start_POSTSUBSCRIPT r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = caligraphic_P start_POSTSUBSCRIPT r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT + roman_Δ start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT. The registered original and reconstructed objects are distinguished from the anomalous shape by the anomaly scores based on the conditioned distance function.

3.4 3D anomaly simulation strategy

Given that a small number of normal samples is not conducive for the model to learning diverse and essential features, we propose the Patch-Gen strategy to simulate the defects from anomaly-free shapes for training data augmentation. Patch-Gen encourages the reconstruction model to learn to detect irregularity, where the anomaly-free point clouds and their diverse anomaly patterns are integrated into training pairs and are utilized to learn the discrimination feature between normal and anomalous surfaces. The intuition is that the diversity of simulated negative samples forces our network to learn how to reconstruct anomaly-free shapes instead of memorizing their complete outfits.

Refer to caption
Figure 3: Illustration of Patch-Gen, the 3D anomaly simulation strategy. The input normal point cloud is first randomly rotated. On the surface of the normalized cube, we randomly select viewpoints to find the nearest patch of points. The selected points are then transformed into irregular defects according to the specific deformation solution.

As shown in Fig. 3, the input normal point cloud is first randomly rotated. The random spatial rotation is designed to improve the generalization capability for test samples with very different spatial transformations, as defined by:

𝒫a=𝒫,subscript𝒫a𝒫\mathcal{P}_{\text{a}}=\mathcal{P}\cdot\mathcal{R},caligraphic_P start_POSTSUBSCRIPT a end_POSTSUBSCRIPT = caligraphic_P ⋅ caligraphic_R , (7)

where 𝒫𝒫\mathcal{P}caligraphic_P is the normal sample and 3×3superscript33\mathcal{R}\in\mathbb{R}^{3\times 3}caligraphic_R ∈ blackboard_R start_POSTSUPERSCRIPT 3 × 3 end_POSTSUPERSCRIPT is obtained by randomly selecting rotation angles for all three axes. In addition to global shape awareness of the model by the random rotation, we further perform a fine granularity of the anomaly simulation. We randomly take a viewpoint 𝒫vsubscript𝒫𝑣\mathcal{P}_{v}caligraphic_P start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT from the surface of the cube. Therefore, the patch of nearest N𝑁Nitalic_N points 𝒫nsubscript𝒫𝑛\mathcal{P}_{n}caligraphic_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT from 𝒫asubscript𝒫𝑎\mathcal{P}_{a}caligraphic_P start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT can be determined according to the 𝒫vsubscript𝒫𝑣\mathcal{P}_{v}caligraphic_P start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT. The shape augmentation scheme Patch-Gen is defined as follows:

𝒫n=𝒫n+Snormalize(𝒫n𝒫v)𝒯,subscript𝒫𝑛subscript𝒫𝑛direct-product𝑆normalizesubscript𝒫𝑛subscript𝒫𝑣𝒯\mathcal{P}_{n}=\mathcal{P}_{n}+S\cdot\textit{normalize}(\mathcal{P}_{n}-% \mathcal{P}_{v})\odot\mathcal{T},caligraphic_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = caligraphic_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + italic_S ⋅ normalize ( caligraphic_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - caligraphic_P start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) ⊙ caligraphic_T , (8)

where nomalize𝑛𝑜𝑚𝑎𝑙𝑖𝑧𝑒nomalizeitalic_n italic_o italic_m italic_a italic_l italic_i italic_z italic_e represents a normalization operation on a vector, S𝑆Sitalic_S is a predefined hyper-parameter that controls the scaling of the patch points, and 𝒯𝒯\mathcal{T}caligraphic_T is the translation matrix originating from a Gaussian distribution. The 𝒫asubscript𝒫a\mathcal{P}_{\text{a}}caligraphic_P start_POSTSUBSCRIPT a end_POSTSUBSCRIPT is finally obtained by only updating the patch region 𝒫nsubscript𝒫𝑛\mathcal{P}_{n}caligraphic_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT while keeping the rest points unchanged.

With the proposed Patch-Gen, we can simulate the generation of multiple anomalies, which is mainly done by controlling 𝒯𝒯\mathcal{T}caligraphic_T. Bulge or sink can be generated by sorting 𝒯𝒯\mathcal{T}caligraphic_T after sampling from the distribution, while damage can be generated by direct overlaying without manipulation. Fig. 7 further illustrates the contrast between the generated anomalies and actual ones, affirming that our approach can remarkably emulate real-world scenarios with a high degree of fidelity.

3.5 Training objective

In the reconstruction task of the object with N𝑁Nitalic_N points, the network learns a diffusion model with an N×3N×3superscript𝑁3superscript𝑁3\mathbb{R}^{N\times 3}\rightarrow\mathbb{R}^{N\times 3}blackboard_R start_POSTSUPERSCRIPT italic_N × 3 end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_N × 3 end_POSTSUPERSCRIPT mapping relation. Iterative denoising under the semantic condition of point embedding realizes the prediction of point offsets. Concretely, the network is trained to learn the noise that needs to be eliminated to recover the anomaly-free shape with the L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT distance between the ground truth and the denoised reconstructed points. We make use of the mean squared error (MSE) loss as the primary reconstruction loss which evaluates the mean squared error of the element-wise distances between 𝒫a(0)superscriptsubscript𝒫a0\mathcal{P}_{\text{a}}^{(0)}caligraphic_P start_POSTSUBSCRIPT a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT and 𝒫r(0)superscriptsubscript𝒫r0\mathcal{P}_{\text{r}}^{(0)}caligraphic_P start_POSTSUBSCRIPT r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT. The MSE training loss is formulated as:

𝒫a,𝒫r=1Ni=1Npa𝒫a,pr𝒫rpapr2.\mathcal{L}_{{\mathcal{P}}_{\text{a}},{\mathcal{P}}_{\text{r}}}=\frac{1}{N}{% \sum\limits_{i=1}^{N}{{}_{p_{\text{a}}\in{\mathcal{P}}_{\text{a}},p_{\text{r}}% \in{\mathcal{P}}_{\text{r}}}{\parallel p_{\text{a}}-p_{\text{r}}\parallel^{2}}% }}.caligraphic_L start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT a end_POSTSUBSCRIPT , caligraphic_P start_POSTSUBSCRIPT r end_POSTSUBSCRIPT end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_FLOATSUBSCRIPT italic_p start_POSTSUBSCRIPT a end_POSTSUBSCRIPT ∈ caligraphic_P start_POSTSUBSCRIPT a end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT r end_POSTSUBSCRIPT ∈ caligraphic_P start_POSTSUBSCRIPT r end_POSTSUBSCRIPT end_FLOATSUBSCRIPT ∥ italic_p start_POSTSUBSCRIPT a end_POSTSUBSCRIPT - italic_p start_POSTSUBSCRIPT r end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (9)

4 Experiments

4.1 Datasets

Method BTF[11] M3DM[37] PatchCore[29] CPMF[3] Reg3D-AD[19] IMRNet[18] Ours
Feat. Raw FPFH PointMAE FPFH PointMAE ResNet PointMAE PointMAE Raw
Airplane 0.730 0.520 0.434 0.882 0.726 0.701 0.716 0.762 0.772
Candybar 0.539 0.630 0.552 0.541 0.663 0.552 0.685 0.755 0.696
Car 0.647 0.560 0.541 0.590 0.498 0.551 0.697 0.711 0.713
Chicken 0.789 0.432 0.683 0.837 0.827 0.504 0.852 0.780 0.714
Diamond 0.707 0.545 0.602 0.574 0.783 0.523 0.900 0.905 0.685
Duck 0.691 0.784 0.433 0.546 0.489 0.582 0.584 0.517 0.909
Fish 0.602 0.549 0.540 0.675 0.630 0.558 0.915 0.880 0.692
Gemstone 0.686 0.648 0.644 0.370 0.374 0.589 0.417 0.674 0.665
Seahorse 0.596 0.779 0.495 0.505 0.539 0.729 0.762 0.604 0.720
Shell 0.396 0.754 0.694 0.589 0.501 0.653 0.583 0.665 0.840
Starfish 0.530 0.575 0.551 0.441 0.519 0.700 0.506 0.674 0.701
Toffees 0.703 0.462 0.450 0.565 0.585 0.390 0.827 0.774 0.703
Average 0.635 0.603 0.552 0.593 0.595 0.586 0.704 0.725 0.734
Table 1: Image-level anomaly detection AUROC on Real3D-AD dataset. We highlight the best result in bold and the second best result with an underline.

4.1.1 Real3D-AD

[19] is a 3D anomaly detection dataset based on real samples, exhibiting a higher point precision and spatial distance per point cloud. Each category contains 4 training samples and 100 test samples. The training set contains 360° complete surface point clouds of the objects, which are obtained by manually calibrating and stitching the scans of multiple sides of the objects. The test samples are scans only one side with a huge difference from the training set. The distribution of the point clouds also varies among the total 12 categories, further deepening the detection difficulty compared to 2D scenes.

4.1.2 Anomaly-ShapeNet

[18] is a 3D anomaly detection, crafted through modifications to the synthetic samples found in ShapeNetCorev2 [4]. It contains 40 diverse categories, featuring over 1600 samples of its complete surface point clouds. Each category’s training set contains merely 4 samples, while the test sets are designed to assess the model’s performance across both normal and a spectrum of abnormal samples. It widely increases the anomaly types while keeping the number of points the same as the previous studies, which places higher demands on the robustness and generality of the proposed algorithms.

4.2 Evaluation metrics

For image-level anomaly detection, the Area Under the Receiver Operating Curve (AUROC) is utilized in line with established practices. For the evaluation of pixel-level anomalies, the AUROC metric is similarly applied in the context of point segmentation accuracy. A value of 0.5 of the AUROC score denotes no discriminative capability (equivalent to random guessing), whereas a score of 1.0 indicates perfect discrimination between positive and negative classes.

4.3 Implementation details

Our methodology is implemented using PyTorch [27] with end-to-end training across the network. The optimization is performed using the Adam optimizer, starting at an initial learning of 0.001. The training process involves a total batch size of 128 across 40,000 iterations for comprehensive learning. All input point clouds undergo a preprocessing step where they are randomly downsampled to a fixed size of 4096 and 2048 points on Real3D-AD and Anomaly-ShapeNet, respectively. Additionally, we normalized these point clouds by setting their center of gravity as the origin of coordinates and scaling their dimensions to fall within the range of -1 to 1, optimizing for the diffusion process.

Method BTF[11] M3DM[37] PatchCore[29] CPMF[3] Reg3D-AD[19] IMRNet[18] Ours
Feat. Raw FPFH PointMAE FPFH PointMAE ResNet PointMAE PointMAE Raw
Ashtray 0.578 0.420 0.577 0.587 0.591 0.353 0.597 0.671 0.833
Bag 0.410 0.546 0.537 0.571 0.601 0.643 0.706 0.660 0.719
Bottle 0.558 0.404 0.584 0.614 0.588 0.469 0.569 0.631 0.750
Bowl 0.470 0.581 0.579 0.558 0.547 0.679 0.548 0.676 0.751
Bucket 0.469 0.517 0.405 0.510 0.577 0.542 0.681 0.676 0.719
Cap 0.509 0.562 0.599 0.645 0.583 0.601 0.632 0.704 0.726
Cup 0.462 0.598 0.548 0.593 0.583 0.498 0.524 0.700 0.767
Eraser 0.525 0.719 0.627 0.657 0.677 0.689 0.343 0.548 0.890
Headset 0.447 0.505 0.597 0.610 0.609 0.551 0.574 0.698 0.767
Helmet 0.508 0.569 0.488 0.465 0.495 0.532 0.491 0.603 0.704
Jar 0.420 0.424 0.441 0.472 0.483 0.610 0.592 0.780 0.838
Microphone 0.563 0.671 0.357 0.388 0.488 0.509 0.414 0.755 0.762
Shelf 0.164 0.609 0.564 0.494 0.523 0.685 0.688 0.603 0.696
Tap 0.549 0.553 0.747 0.760 0.498 0.528 0.659 0.686 0.818
Vase 0.517 0.464 0.534 0.554 0.582 0.514 0.576 0.629 0.734
Average 0.493 0.528 0.552 0.568 0.562 0.559 0.572 0.659 0.749
Table 2: Image-level anomaly detection AUROC on Anomaly-ShapeNet dataset. We highlight the best result in bold and the second best result with an underline.

4.4 Main results

We conduct experiments on Real3D-AD [19] based on real sampling and Anomaly-ShapeNet [18] based on simulation.

As shown in Table 1, we first compare the image-level AUROC metric with current cutting-edge 3D anomaly detection models on Real3D-AD. It shows that our method achieves the best performance using only raw point cloud data, while most of the existing methods use Fast Point Feature Histograms (FPFH) operator [31] or ShapeNet [4] pre-trained PointMAE [26] as feature extractor. Due to significant disparities in quantity, size, and distribution among different categories of point clouds in Real3D-AD, scoring variations across categories are more pronounced with other methods. For instance, numerous methods perform under 0.5 in certain categories, indicating their inadequacy in extracting meaningful features while facing challenging samples. In contrast, our method not only exhibits superior performance in 3D anomaly detection across the majority of categories but also achieves the best overall average across all categories. This demonstrates the strong generalizability and robustness of our approach.

We further evaluate our method on Anomaly-ShapeNet in Table 2, which encompasses a broader array of categories and a greater diversity of defect types. Compared to Real3D-AD, Anomaly-ShapeNet significantly enhances the diversity of defects, wherein the increased variety of defect types further escalates the complexity of detection tasks. The results highlight the exceptional performance of our method across all evaluated categories, demonstrating an average improvement of 9% on AUROC relative to the approaches previously utilized.

Model Diffusion Condition Relative Patch-Gen I-AUROC P-AUROC
A 0.586 0.524
B 0.667 0.513
C 0.712 0.573
D 0.734 0.592
Table 3: Ablation studies for 3D adaptation components on Real3D-AD dataset.

4.5 Ablation study

To delve into the effect of individual components, we conduct ablation experiments on the Real3D-AD dataset. To fully demonstrate and compare the performance of the models, we report both image-level and pixel-level results with I-AUROC and P-AUROC, respectively.

4.5.1 Main component

Table 3 compares the performance of different variants from R3D-AD, which includes the influence of the denoising condition embedding, displacement-based reconstruction way, and the data augmentation strategy of Patch-Gen. Model A is denoted as our baseline, which is a vanilla DDPM model for point cloud reconstruction. Introducing a condition into the DDPM (Model B) significantly boosts performance, particularly in terms of I-AUROC, which sees a 13.8% increase to 0.667. Model C, which predicts point displacements based on conditional DDPM, preserving detailed structural information while accommodating the relative displacement of points contributes to a notable 6.0% gain in P-AUROC over Model B. Model D is trained under the conditions of shape embedding with the Patch-Gen strategy. Considering that the defective portion contains only a small portion of the original point cloud, we try to reconstruct the relative displacement in a way that preserves as much detail as possible, which is effective for both 3D anomaly detection and segmentation.

4.5.2 Patch-Gen

Table. 4 analyzes the influence of two key parameters in Patch-Gen: the selection points ratio and the scaling points factor.

The selection points ratio from Table. LABEL:subtab:select determines the proportion of points in the point cloud that are selected for transformation. Our findings suggest that a selection ratio of 1/32 achieves the best performance. It appears that this ratio provides a balanced trade-off between maintaining sufficient structure for anomaly detection and introducing enough variation to simulate anomalies effectively. Notably, as the ratio increases beyond 1/16, both I-AUROC and P-AUROC scores decrease in severity, since real defects only account for a small portion of the overall point cloud, a wide selection of points not only destroys the structure of the original point cloud, but also makes the distribution of the training and test sets inconsistent.

The scaling points factor is the intensity of the random transformation applied to the selected points, as detailed in Table LABEL:subtab:scale. The optimal performance is observed at a scaling factor of 0.1, which implies that minor transformations are more effective for simulating anomalies without significantly altering the original data distribution. Larger scaling factors lead to a consistent decline in performance, underscoring the importance of subtle transformations for preserving the utility of the simulated anomalies for detection tasks.

ratio I-AUROC P-AUROC
1/64 0.716 0.584
1/32 0.734 0.592
1/16 0.727 0.579
1/8 0.683 0.528
(a)
factor I-AUROC P-AUROC
0.1 0.734 0.592
0.2 0.727 0.572
0.4 0.715 0.554
0.8 0.661 0.517
(b)
Table 4: Ablation studies for Patch-Gen implementation on Real3D-AD dataset. Default settings are marked in gray.
Refer to caption
Figure 4: Memory and time cost during inference on Real3D-AD dataset. (a) Memory usage comparison between different models. (b) 3D anomaly detection performance vs. frames per second on an NVIDIA RTX 3090 GPU. Our R3D-AD outperforms all previous methods on both accuracy and efficiency by a significant margin.

4.5.3 Memory and time cost

As depicted in Figure 4, we evaluate the disparity in both storage consumption and inference time of our model under identical experimental conditions, compared to existing methods. Regarding memory usage, our approach demonstrates a marked superiority by employing raw coordinate features instead of FPFH or PointMAE features, significantly reducing the memory footprint. Since no memory bank exists, our method is also more space-efficient compared to BTF which also uses raw features. Moreover, our method eliminates the necessity to compare all the features in memory, substantially increasing operational efficiency. The implementation of Patch-Gen inherently bestows our model with exceptional robustness, enabling precise reconstruction of point clouds from various angles without the need for the time-intensive RANSAC alignment process required by Reg3D-AD.

Refer to caption
Figure 5: Qualitative analysis on Real3D-AD dataset and Anomaly-ShapeNet dataset. The anomaly map is obtained directly by calculating the differences between the input and reconstructed point clouds, where deeper colors represent more confidence.

4.6 Qualitative results

Figure 5 presents some qualitative outcomes, with varying shades of color indicating different levels of anomaly scores. We select several representative defective samples to demonstrate the robustness of our algorithm. The left four columns display samples from Real3D-AD, while the right four columns samples are from Anomaly-ShapeNet. The illustration reveals that our R3D-AD algorithm has precisely reconstructed the defective portions of the point cloud across various samples: the deep sink in the Seahorse sample, the concavity in the Bag sample, and the bulge in the Jar sample. Leveraging the accurately reconstructed point clouds, final point cloud segmentation maps are also produced, further evidencing the efficacy of our approach.

5 Conclusion

In this work, we presented R3D-AD, a novel reconstructive 3D anomaly detection model based on conditional diffusion. Our goal is to overcome the limitations faced by current 3D anomaly detection methods, such as the inefficiencies due to the memory bank module and low performance caused by incorrect rebuilds with MAE. To address these challenges, we leverage the diffusion process for full reconstruction, followed by a direct comparison between the input and the reconstructed point cloud to obtain the final anomaly score. The embedded latent variable that spans the decoding process, step-wisely generating point-level displacements from the noise to the target anomaly-free sample. We also propose Patch-Gen, a data augmentation tailored for point cloud anomaly simulation. Extensive experiments conducted on 3D anomaly benchmarks validate the superiority of our R3D-AD in comparison to state-of-the-art alternatives in terms of both accuracy and versatility.

Acknowledgements

This work was supported in part by the Pioneer and Leading Goose R&D Program of Zhejiang (Grant No. 2022C01051), in part by the National Natural Science Foundation of China (Grant No. 52375271, 52275274), and in part by the Natural Science Foundation of Zhejiang Province (Grant No. LY23E050011).

References

  • [1] Bae, J., Lee, J.H., Kim, S.: Pni: Industrial anomaly detection using position and neighborhood information. In: ICCV (2023)
  • [2] Bergmann, P., Löwe, S., Fauser, M., Sattlegger, D., Steger, C.: Improving unsupervised defect segmentation by applying structural similarity to autoencoders. In: VISIGRAPP (2019)
  • [3] Cao, Y., Xu, X., Shen, W.: Complementary pseudo multimodal feature for point cloud anomaly detection. arXiv preprint (2023)
  • [4] Chang, A.X., Funkhouser, T., Guibas, L., Hanrahan, P., Huang, Q., Li, Z., Savarese, S., Savva, M., Song, S., Su, H., Xiao, J., Yi, L., Yu, F.: ShapeNet: An Information-Rich 3D Model Repository. arXiv preprint (2015)
  • [5] Charles, R.Q., Su, H., Kaichun, M., Guibas, L.J.: Pointnet: Deep learning on point sets for 3d classification and segmentation. In: CVPR (2017)
  • [6] Chu, R., Xie, E., Mo, S., Li, Z., Nießner, M., Fu, C.W., Jia, J.: Diffcomplete: Diffusion-based generative 3d shape completion. In: NeurIPS (2023)
  • [7] Dinh, L., Sohl-Dickstein, J., Bengio, S.: Density estimation using real nvp. In: ICLR (2017)
  • [8] Gudovskiy, D., Ishizaka, S., Kozuka, K.: Cflow-ad: Real-time unsupervised anomaly detection with localization via conditional normalizing flows. In: WACV (2022)
  • [9] He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: CVPR (2022)
  • [10] Ho, J., Chan, W., Saharia, C., Whang, J., Gao, R., Gritsenko, A., Kingma, D.P., Poole, B., Norouzi, M., Fleet, D.J., Salimans, T.: Imagen video: High definition video generation with diffusion models (2022)
  • [11] Horwitz, E., Hoshen, Y.: Back to the feature: Classical 3d features are (almost) all you need for 3d anomaly detection. In: CVPRW (2023)
  • [12] Hu, T., Zhang, J., Yi, R., Du, Y., Chen, X., Liu, L., Wang, Y., Wang, C.: Anomalydiffusion: Few-shot anomaly image generation with diffusion model. In: AAAI (2024)
  • [13] Jonathan Ho, Ajay Jain, and Pieter Abbeel: Denoising diffusion probabilistic models. In: NeurIPS (2020)
  • [14] Kim, D., Park, C., Cho, S., Lee, S.: Fapm: Fast adaptive patch memory for real-time industrial anomaly detection. In: ICASSP (2023)
  • [15] Kong, Z., Ping, W., Huang, J., Zhao, K., , Catanzaro, B.: Diffwave: A versatile diffusion model for audio synthesis. In: ICLR (2021)
  • [16] Li, C.L., Sohn, K., Yoon, J., Pfister, T.: Cutpaste: Self-supervised learning for anomaly detection and localization. In: CVPR (2021)
  • [17] Li, M., Duan, Y., Zhou, J., Lu, J.: Diffusion-sdf: Text-to-shape via voxelized diffusion. In: CVPR (2023)
  • [18] Li, W., Xu, X., Gu, Y., Zheng, B., Gao, S., Wu, Y.: Towards scalable 3d anomaly detection and localization: A benchmark via 3d anomaly synthesis and a self-supervised learning network. arXiv preprint (2023)
  • [19] Liu, J., Xie, G., Li, X., Wang, J., Liu, Y., Wang, C., Zheng, F., et al.: Real3d-ad: A dataset of point cloud anomaly detection. In: NeurIPS (2023)
  • [20] Liu, Z., Feng, Y., Black, M.J., Nowrouzezahrai, D., Paull, L., Liu, W.: Meshdiffusion: Score-based generative 3d mesh modeling. In: ICLR (2023)
  • [21] Liu, Z., Tang, H., Lin, Y., Han, S.: Point-voxel cnn for efficient 3d deep learning. In: NeurIPS (2019)
  • [22] Lu, F., Yao, X., Fu, C., Jia, J.: Removing anomalies as noises for industrial defect localization. In: ICCV (2023)
  • [23] Luo, S., Hu, W.: Diffusion probabilistic models for 3d point cloud generation. In: CVPR (2021)
  • [24] Van der Maaten, L., Hinton, G.: Visualizing data using t-sne. Journal of machine learning research (2008)
  • [25] Mo, S., Xie, E., Chu, R., Hong, L., Nießner, M., Li, Z.: Dit-3d: Exploring plain diffusion transformers for 3d shape generation. In: NeurIPS (2023)
  • [26] Pang, Y., Wang, W., Tay, F.E., Liu, W., Tian, Y., Yuan, L.: Masked autoencoders for point cloud self-supervised learning. In: ECCV (2022)
  • [27] Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al.: Pytorch: An imperative style, high-performance deep learning library. NeurIPS (2019)
  • [28] Rezende, D.J., Mohamed, S.: Variational inference with normalizing flows. In: ICML (2015)
  • [29] Roth, K., Pemula, L., Zepeda, J., Schölkopf, B., Brox, T., Gehler, P.: Towards total recall in industrial anomaly detection. In: CVPR (2022)
  • [30] Rudolph, M., Wandt, B., Rosenhahn, B.: Same same but differnet: Semi-supervised defect detection with normalizing flows. In: WACV (2021)
  • [31] Rusu, R.B., Blodow, N., Beetz, M.: Fast point feature histograms (fpfh) for 3d registration. In: ICRA (2009)
  • [32] Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E., Ghasemipour, S.K.S., Ayan, B.K., Mahdavi, S.S., Lopes, R.G., Salimans, T., Ho, J., Fleet, D.J., Norouzi, M.: Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint (2022)
  • [33] Schlüter, H.M., Tan, J., Hou, B., Kainz, B.: Natural synthetic anomalies for self-supervised anomaly detection and localization. In: ECCV (2022)
  • [34] Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. arXiv preprint (2020)
  • [35] Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S., Poole, B.: Score-based generative modeling through stochastic differential equations. In: ICLR (2021)
  • [36] Tailanian, M., Pardo, Á., Musé, P.: U-flow: A u-shaped normalizing flow for anomaly detection with unsupervised threshold. arXiv preprint (2022)
  • [37] Wang, Y., Peng, J., Zhang, J., Yi, R., Wang, Y., Wang, C.: Multimodal industrial anomaly detection via hybrid fusion. In: CVPR (2023)
  • [38] Wang, Y., Sun, Y., Liu, Z., Sarma, S.E., Bronstein, M.M., Solomon, J.M.: Dynamic graph cnn for learning on point clouds. TOG (2019)
  • [39] Xie, Z., Zhang, Z., Cao, Y., Lin, Y., Bao, J., Yao, Z., Dai, Q., Hu, H.: Simmim: A simple framework for masked image modeling. In: CVPR (2022)
  • [40] Yu, J., Zheng, Y., Wang, X., Li, W., Wu, Y., Zhao, R., Wu, L.: Fastflow: Unsupervised anomaly detection and localization via 2d normalizing flows. arXiv preprint (2021)
  • [41] Yu, X., Tang, L., Rao, Y., Huang, T., Zhou, J., Lu, J.: Point-bert: Pre-training 3d point cloud transformers with masked point modeling. In: CVPR (2022)
  • [42] Zavrtanik, V., Kristan, M., Skočaj, D.: Draem-a discriminatively trained reconstruction embedding for surface anomaly detection. In: ICCV (2021)
  • [43] Zavrtanik, V., Kristan, M., Skočaj, D.: Reconstruction by inpainting for visual anomaly detection. Pattern Recognition (2021)
  • [44] Zhang, X., Xu, M., Zhou, X.: Realnet: A feature selection network with realistic synthetic anomaly for anomaly detection. In: CVPR (2024)
  • [45] Zhou, L., Du, Y., Wu, J.: 3d shape generation and completion through point-voxel diffusion. In: ICCV (2021)

Appendix 0.A Appendix

0.A.1 Additional implement details

0.A.1.1 Patch-Gen pseudocode

We formulate the process of the proposed 3D anomaly simulation strategy Patch-Gen in Algorithm 1. The procedure begins by taking an initial point cloud 𝒫𝒫\mathcal{P}caligraphic_P as input and aims to produce an augmented point cloud 𝒫asubscript𝒫𝑎\mathcal{P}_{a}caligraphic_P start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT that reflects the addition of anomaly. The rotation matrix \mathcal{R}caligraphic_R is obtained by applying arbitrary rotation angles to all the rotation axes. The translation matrix 𝒯𝒯\mathcal{T}caligraphic_T is sampled from a Gaussian distribution, and after normalization and scaling, it dictates the displacement of the nearest points towards the viewpoint, while the rest of the point cloud remains unchanged. The anomaly point cloud representing damage types can be acquired through direct manipulation of matrix 𝒯𝒯\mathcal{T}caligraphic_T, derived from random sampling procedures. By sorting the matrix 𝒯𝒯\mathcal{T}caligraphic_T, we can further simulate defects such as bulge and sink. This perspective can be likened to a gravitational force acting as an anchor, exerting influence on the patch points within the domain.

Algorithm 1 Patch-Gen
𝒫𝒫\mathcal{P}caligraphic_P: input point cloud
N𝑁Nitalic_N: number of points to select
S𝑆Sitalic_S: scaling factor for transformation
Pasubscript𝑃𝑎P_{a}italic_P start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT: augmented point cloud
absent\mathcal{R}\leftarrowcaligraphic_R ← random rotation matrix \triangleright 3×3superscript33\mathbb{R}^{3\times 3}blackboard_R start_POSTSUPERSCRIPT 3 × 3 end_POSTSUPERSCRIPT
𝒫a=𝒫subscript𝒫𝑎𝒫\mathcal{P}_{a}=\mathcal{P}\cdot\mathcal{R}caligraphic_P start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = caligraphic_P ⋅ caligraphic_R \triangleright apply rotation
𝒫vsubscript𝒫𝑣absent\mathcal{P}_{v}\leftarrowcaligraphic_P start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ← random viewpoint \triangleright 1×3superscript13\mathbb{R}^{1\times 3}blackboard_R start_POSTSUPERSCRIPT 1 × 3 end_POSTSUPERSCRIPT
𝒫n=NN(𝒫a,𝒫v,N)subscript𝒫𝑛NNsubscript𝒫𝑎subscript𝒫𝑣𝑁\mathcal{P}_{n}=\textit{NN}(\mathcal{P}_{a},\mathcal{P}_{v},N)caligraphic_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = NN ( caligraphic_P start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , caligraphic_P start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , italic_N ) \triangleright select N𝑁Nitalic_N nearest neighbor points to the viewpoint
𝒯𝒯absent\mathcal{T}\leftarrowcaligraphic_T ← random translation matrix \triangleright N×3superscript𝑁3\mathbb{R}^{N\times 3}blackboard_R start_POSTSUPERSCRIPT italic_N × 3 end_POSTSUPERSCRIPT
𝒫n=𝒫n+Snormalize(𝒫n𝒫v)𝒯subscript𝒫𝑛subscript𝒫𝑛direct-product𝑆normalizesubscript𝒫𝑛subscript𝒫𝑣𝒯\mathcal{P}_{n}=\mathcal{P}_{n}+S\cdot\textit{normalize}(\mathcal{P}_{n}-% \mathcal{P}_{v})\odot\mathcal{T}caligraphic_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = caligraphic_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + italic_S ⋅ normalize ( caligraphic_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - caligraphic_P start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) ⊙ caligraphic_T \triangleright update selected points only

0.A.1.2 R3D-AD pseudocode

To further clarify the overall architecture of the proposed network R3D-AD, we provide the training and testing iteration procedures more compactly in Algorithm 2 and Algorithm 3, respectively.

During training, anomalies are simulated by Patch-Gen, and noise is artificially added following a Gaussian distribution. The model predicts this noise and calculates a displacement to correct for it. The reconstruction loss is measured by comparing the original and corrected point clouds.

Algorithm 2 R3D-AD training iteration
𝒫𝒫\mathcal{P}caligraphic_P: input point cloud
\mathcal{L}caligraphic_L: reconstruction loss
𝒫Uniform(normalize(𝒫))similar-tosuperscript𝒫Uniform𝑛𝑜𝑟𝑚𝑎𝑙𝑖𝑧𝑒𝒫\mathcal{P}^{{}^{\prime}}\sim\mathrm{Uniform}(normalize(\mathcal{P}))caligraphic_P start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ∼ roman_Uniform ( italic_n italic_o italic_r italic_m italic_a italic_l italic_i italic_z italic_e ( caligraphic_P ) ) \triangleright normalize and downsample the input point cloud
𝒫a(0)=Patch-Gen(𝒫)superscriptsubscript𝒫𝑎0Patch-Gensuperscript𝒫\mathcal{P}_{a}^{(0)}=\textit{Patch-Gen}(\mathcal{P}^{{}^{\prime}})caligraphic_P start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT = Patch-Gen ( caligraphic_P start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ) \triangleright 3D anomaly simulation strategy (Algorithm 1)
c=PointNet(𝒫a(0))𝑐PointNetsuperscriptsubscript𝒫𝑎0c=\textit{PointNet}(\mathcal{P}_{a}^{(0)})italic_c = PointNet ( caligraphic_P start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ) \triangleright feature extraction
𝒫a(0)q(𝒫a(0))similar-tosuperscriptsubscript𝒫𝑎0𝑞superscriptsubscript𝒫𝑎0\mathcal{P}_{a}^{(0)}\sim q(\mathcal{P}_{a}^{(0)})caligraphic_P start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ∼ italic_q ( caligraphic_P start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ) \triangleright point distribution
tUniform({1,,T})similar-to𝑡Uniform1𝑇t\sim\mathrm{Uniform}(\{1,\dotsc,T\})italic_t ∼ roman_Uniform ( { 1 , … , italic_T } ) \triangleright step distribution
ϵ𝒩(𝟎,𝐈)similar-tobold-italic-ϵ𝒩0𝐈{\bm{\epsilon}}\sim\mathcal{N}(\mathbf{0},\mathbf{I})bold_italic_ϵ ∼ caligraphic_N ( bold_0 , bold_I ) \triangleright noise distribution
𝝁=ϵθ(α¯t𝒫a(0)+1α¯tϵ,c,t)𝝁subscriptbold-italic-ϵ𝜃subscript¯𝛼𝑡superscriptsubscript𝒫𝑎01subscript¯𝛼𝑡bold-italic-ϵc𝑡{\bm{\mu}}={\bm{\epsilon}}_{\theta}(\sqrt{\bar{\alpha}_{t}}\mathcal{P}_{a}^{(0% )}+\sqrt{1-\bar{\alpha}_{t}}{\bm{\epsilon}},\textbf{{c}},t)bold_italic_μ = bold_italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG caligraphic_P start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT + square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_italic_ϵ , c , italic_t ) \triangleright noise prediction
Δ=1α¯t(α¯t𝒫a(0)+1α¯t(ϵ𝝁))Δ1subscript¯𝛼𝑡subscript¯𝛼𝑡superscriptsubscript𝒫𝑎01subscript¯𝛼𝑡bold-italic-ϵ𝝁\Delta=\frac{1}{\sqrt{\bar{\alpha}_{t}}}\left(\sqrt{\bar{\alpha}_{t}}\mathcal{% P}_{a}^{(0)}+\sqrt{1-\bar{\alpha}_{t}}({\bm{\epsilon}}-{\bm{\mu}})\right)roman_Δ = divide start_ARG 1 end_ARG start_ARG square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG ( square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG caligraphic_P start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT + square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ( bold_italic_ϵ - bold_italic_μ ) ) \triangleright displacement prediction
=𝒫(𝒫a(0)+Δ)2superscriptnormsuperscript𝒫superscriptsubscript𝒫𝑎0Δ2\mathcal{L}=\left\|\mathcal{P}^{{}^{\prime}}-(\mathcal{P}_{a}^{(0)}+\Delta)% \right\|^{2}caligraphic_L = ∥ caligraphic_P start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT - ( caligraphic_P start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT + roman_Δ ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT \triangleright relative reconstruction loss

During testing, noise is progressively removed from a simulated noisy version of the cloud, aiming to reconstruct its anomaly-free outfits. The anomaly score is assessed by comparing the clusters after KNN of the original and reconstructed point clouds.

Algorithm 3 R3D-AD testing iteration
𝒫𝒫\mathcal{P}caligraphic_P: input point cloud
𝒜𝒜\mathcal{A}caligraphic_A: anomaly score
𝒫Uniform(normalize(𝒫))similar-tosuperscript𝒫Uniform𝑛𝑜𝑟𝑚𝑎𝑙𝑖𝑧𝑒𝒫\mathcal{P}^{{}^{\prime}}\sim\mathrm{Uniform}(normalize(\mathcal{P}))caligraphic_P start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ∼ roman_Uniform ( italic_n italic_o italic_r italic_m italic_a italic_l italic_i italic_z italic_e ( caligraphic_P ) ) \triangleright normalize and downsample the input point cloud
c=PointNet(𝒫)𝑐PointNetsuperscript𝒫c=\textit{PointNet}(\mathcal{P}^{{}^{\prime}})italic_c = PointNet ( caligraphic_P start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ) \triangleright feature extraction
Δ(T)𝒩(𝟎,𝐈)similar-tosuperscriptΔ𝑇𝒩0𝐈\Delta^{(T)}\sim\mathcal{N}(\mathbf{0},\mathbf{I})roman_Δ start_POSTSUPERSCRIPT ( italic_T ) end_POSTSUPERSCRIPT ∼ caligraphic_N ( bold_0 , bold_I )
for t=T,,1𝑡𝑇1t=T,\dotsc,1italic_t = italic_T , … , 1 do
     𝐳𝒩(𝟎,𝐈)similar-to𝐳𝒩0𝐈\mathbf{z}\sim\mathcal{N}(\mathbf{0},\mathbf{I})bold_z ∼ caligraphic_N ( bold_0 , bold_I ) if t>1𝑡1t>1italic_t > 1, else 𝐳=𝟎𝐳0\mathbf{z}=\mathbf{0}bold_z = bold_0
     Δ(t1)=1αt(Δ(t)1αt1α¯tϵ𝜽(Δ(t),𝒄,t))+σt𝐳superscriptΔ𝑡11subscript𝛼𝑡superscriptΔ𝑡1subscript𝛼𝑡1subscript¯𝛼𝑡subscriptbold-italic-ϵ𝜽superscriptΔ𝑡𝒄𝑡subscript𝜎𝑡𝐳\mathrm{\Delta}^{(t-1)}=\frac{1}{\sqrt{\alpha_{t}}}(\mathrm{\Delta}^{(t)}-% \frac{1-\alpha_{t}}{\sqrt{1-\overline{\alpha}_{t}}}\bm{\epsilon}_{\bm{\theta}}% (\mathrm{\Delta}^{(t)},\bm{c},t))+\sigma_{t}\mathbf{z}roman_Δ start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG ( roman_Δ start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT - divide start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG bold_italic_ϵ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( roman_Δ start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_italic_c , italic_t ) ) + italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_z
end for
𝒫^=𝒫+Δ(0)^𝒫superscript𝒫superscriptΔ0\widehat{\mathcal{P}}=\mathcal{P}^{{}^{\prime}}+\Delta^{(0)}over^ start_ARG caligraphic_P end_ARG = caligraphic_P start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT + roman_Δ start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT \triangleright reconstructed point cloud
cluster^=KNN(𝒫^,k)^𝑐𝑙𝑢𝑠𝑡𝑒𝑟KNN^𝒫𝑘\widehat{cluster}=\textit{KNN}(\widehat{\mathcal{P}},k)over^ start_ARG italic_c italic_l italic_u italic_s italic_t italic_e italic_r end_ARG = KNN ( over^ start_ARG caligraphic_P end_ARG , italic_k ) \triangleright reconstructed point-cluster
cluster=KNN(𝒫,k)𝑐𝑙𝑢𝑠𝑡𝑒𝑟KNNsuperscript𝒫𝑘cluster=\textit{KNN}(\mathcal{P}^{{}^{\prime}},k)italic_c italic_l italic_u italic_s italic_t italic_e italic_r = KNN ( caligraphic_P start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT , italic_k ) \triangleright input point-cluster
𝒜=clustercluster^2𝒜superscriptnorm𝑐𝑙𝑢𝑠𝑡𝑒𝑟^𝑐𝑙𝑢𝑠𝑡𝑒𝑟2\mathcal{A}=\left\|cluster-\widehat{cluster}\right\|^{2}caligraphic_A = ∥ italic_c italic_l italic_u italic_s italic_t italic_e italic_r - over^ start_ARG italic_c italic_l italic_u italic_s italic_t italic_e italic_r end_ARG ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT \triangleright euclidean distance for point-cluster
Refer to caption
Figure 6: t-SNE visualization on Real3D-AD.
Anomaly type Bulge Sink Oracle
Airplane 1.31 1.35 1.58
Candybar 2.43 2.30 2.54
Car 1.15 1.23 1.37
Chicken 3.50 2.92 4.02
Diamond 0.84 0.83 0.97
Duck 1.53 1.29 1.67
Fish 1.42 1.45 1.57
Gemstone 2.58 5.23 5.26
Seahorse 2.37 2.35 2.45
Shell 1.30 1.29 1.40
Starfish 2.47 2.46 2.64
Toffees 1.73 1.71 1.79
Table 5: PSNR of generated anomalous with Patch-Gen on Real3D-AD.

0.A.2 Additional experiments

0.A.2.1 Quality of the generated anomalies

The proposed 3D anomaly simulation strategy Patch-Gen is designed to address the problem of the lack of 3D anomalous samples in the training phase.

T-distributed Stochastic Neighbor Embedding (t-SNE) [24] is particularly effective at visualizing high-dimensional samples by giving each data point a corresponding location in a low-dimensional map, allowing complex data to be understood at a glance. We follow [16] and use the t-SNE to validate the quality and effectiveness of our generated anomaly samples. As shown in Fig 6, the generated anomalies are clearly distinguished from normal samples and overlap with real anomalous samples, which strengthens our model to reconstruct well on unseen anomalies.

Peak Signal-to-Noise Ratio (PSNR) is an engineering term that quantifies the quality of the reconstruction of a signal. PSNR is typically measured in decibels (dB) and calculated based on the mean squared error between the origin and the reconstruction. The higher the PSNR value, the better the quality of the reconstruction. In Table 5, the PSNR is computed by comparing the generated samples with real anomalies. We randomly select two normal samples to calculate their PSNR, and we average the PSNR obtained from multiple times of randomization to obtain the upper bound of the PSNR limit for each category. The Oracle PSNR servers are a reference to the generation quality.

Training Testing I-AUROC CD Oracle
Dataset Category Dataset Category
Real3D-AD Airplane ShapeNetCore.v2 Airplane - 0.032 0.001
Real3D-AD Car ShapeNetCore.v2 Car - 0.077 0.004
ShapeNetCore.v2 Airplane Real3D-AD Airplane 0.614 - 0.772
ShapeNetCore.v2 Car Real3D-AD Car 0.601 - 0.713
Anomaly-ShapeNet {bowl0..3} Anomaly-ShapeNet bowl4 0.715 - 0.744
Table 6: Generalization capability of R3D-AD for unseen data.
Method BTF[11] M3DM[37] PatchCore[29] CPMF[3] Reg3D-AD[19] IMRNet[18] Ours
Feat. Raw FPFH PointMAE FPFH PointMAE ResNet PointMAE PointMAE Raw
ashtray0 0.578 0.420 0.577 0.587 0.591 0.353 0.597 0.671 0.833
bag0 0.410 0.546 0.537 0.571 0.601 0.643 0.706 0.660 0.720
bottle0 0.597 0.344 0.574 0.604 0.513 0.520 0.486 0.552 0.733
bottle1 0.510 0.546 0.637 0.667 0.601 0.482 0.695 0.700 0.737
bottle3 0.568 0.322 0.541 0.572 0.650 0.405 0.525 0.640 0.781
bowl0 0.564 0.509 0.634 0.504 0.523 0.783 0.671 0.681 0.819
bowl1 0.264 0.668 0.663 0.639 0.629 0.639 0.525 0.702 0.778
bowl2 0.525 0.510 0.684 0.615 0.458 0.625 0.490 0.685 0.741
bowl3 0.385 0.490 0.617 0.537 0.579 0.658 0.348 0.599 0.767
bowl4 0.664 0.609 0.464 0.494 0.501 0.683 0.663 0.676 0.744
bowl5 0.417 0.699 0.409 0.558 0.593 0.685 0.593 0.710 0.656
bucket0 0.617 0.401 0.309 0.469 0.593 0.482 0.610 0.580 0.683
bucket1 0.321 0.633 0.501 0.551 0.561 0.601 0.752 0.771 0.756
cap0 0.668 0.618 0.557 0.580 0.589 0.601 0.693 0.737 0.822
cap3 0.527 0.522 0.423 0.453 0.476 0.551 0.725 0.775 0.730
cap4 0.468 0.520 0.777 0.757 0.727 0.553 0.643 0.652 0.681
cap5 0.373 0.586 0.639 0.790 0.538 0.697 0.467 0.652 0.670
cup0 0.403 0.586 0.539 0.600 0.610 0.497 0.510 0.643 0.776
cup1 0.521 0.610 0.556 0.586 0.556 0.499 0.538 0.757 0.757
eraser0 0.525 0.719 0.627 0.657 0.677 0.689 0.343 0.548 0.890
headset0 0.378 0.520 0.577 0.583 0.591 0.643 0.537 0.720 0.738
headset1 0.515 0.490 0.617 0.637 0.627 0.458 0.610 0.676 0.795
helmet0 0.553 0.571 0.526 0.546 0.556 0.555 0.600 0.597 0.757
helmet2 0.602 0.542 0.623 0.425 0.447 0.462 0.614 0.641 0.633
helmet3 0.526 0.444 0.374 0.404 0.424 0.520 0.367 0.573 0.707
helmet4 0.349 0.719 0.427 0.484 0.552 0.589 0.381 0.600 0.720
jar0 0.420 0.424 0.441 0.472 0.483 0.610 0.592 0.780 0.838
microphone0 0.563 0.671 0.357 0.388 0.488 0.509 0.414 0.755 0.762
shelf0 0.164 0.609 0.564 0.494 0.523 0.685 0.688 0.603 0.696
tap0 0.525 0.560 0.754 0.753 0.458 0.359 0.676 0.676 0.736
tap1 0.573 0.546 0.739 0.766 0.538 0.697 0.641 0.696 0.900
vase0 0.531 0.342 0.423 0.455 0.447 0.451 0.533 0.533 0.788
vase1 0.549 0.219 0.427 0.423 0.552 0.345 0.702 0.757 0.729
vase2 0.410 0.546 0.737 0.721 0.741 0.582 0.605 0.614 0.752
vase3 0.717 0.699 0.439 0.449 0.460 0.582 0.650 0.700 0.742
vase4 0.425 0.510 0.476 0.506 0.516 0.514 0.500 0.524 0.630
vase5 0.585 0.409 0.317 0.417 0.579 0.618 0.520 0.676 0.757
vase7 0.448 0.518 0.657 0.693 0.650 0.397 0.462 0.635 0.771
vase8 0.424 0.668 0.663 0.662 0.663 0.529 0.620 0.630 0.721
vase9 0.564 0.268 0.663 0.660 0.629 0.609 0.594 0.594 0.718
Average 0.493 0.528 0.552 0.568 0.562 0.559 0.572 0.659 0.749
Table 7: Complete image-level anomaly detection AUROC on Anomaly-ShapeNet dataset. We highlight the best result in bold.

0.A.2.2 Generalization on unseen data

To assess the robustness and generalization capabilities of our proposed model, we conduct a series of experiments on different categories from diverse datasets, as outlined in Table 6. The oracle result represents the performance ceiling of our model, which is obtained by training on the category that is identical to the testing.

For known categories, we focus on the well-regarded ShapeNetCore.v2 dataset [4], which includes categories such as Airplanes and Cars, also featured in the Real3D-AD dataset [19]. It’s pertinent to note that ShapeNetCore.v2 is not an anomaly detection dataset; it does not encompass anomalous samples. Therefore, for the first and second rows in Table 6, the AUROC metric cannot be utilized in this context. Instead, we resort to evaluating the generalization performance of models trained on Real3D-AD of the same category on ShapeNetCore.v2 using the Chamfer Distance (CD) metric. The marked decline in performance observed upon transitioning from ShapeNetCore.v2 to Real3D-AD, and vice versa, illuminates the hurdles presented by inconsistencies between datasets. This highlights the importance of our reconstruction approach, which effectively learns inductive biases, allowing for better generalization across different data distributions.

For unknown categories, we utilize the Anomaly-ShapeNet dataset [18], as shown in the last row of Table. 6. The model was trained on a subset of Bowl and tested a category it had never encountered during training. Remarkably, despite this lack of prior exposure, our model achieves an impressive score of 0.715 image-level AUROC. This performance surpassed all other methods trained and tested exclusively on “bowl4”, thus demonstrating the superior generalization capability of our method.

These results not only validate the effectiveness of our approach in handling both known and unknown categories but also underscore its potential for real-world applications where data diversity and unseen scenarios are commonplace.

0.A.3 Additional main results

Anomaly-ShapeNet [18] contains a total of 40 categories. In Table 2 of the main text, due to the space limitation, we consider objects that belong to the same kind but with differing appearances to be in the same category (e.g., bottle0, bottle1, bottle3 are categorized as Bottle). Here, we provide the specific image-level AUROC as in Table 7.

Refer to caption
(a) Visualization on Real3D-AD dataset.
Refer to caption
(b) Visualization on Anomaly-ShapeNet dataset.
Figure 7: Visualization on Real3D-AD dataset and Anomaly-ShapeNet dataset. The red region indicates the real abnormal area of the anomaly point cloud in the testing set, while the yellow region indicates the simulated abnormal area generated by Patch-Gen based on the normal point cloud in the training set.

0.A.4 Additional qualitative results

To further demonstrate and compare the effect of our proposed 3D anomaly simulation strategy Patch-Gen, we conduct additional qualitative analysis on the Real3D-AD dataset and the Anomaly-ShapeNet dataset.

The first row shows the anomaly samples in the testing split, where the second row shows the normal samples in the training split, and the third row shows the anomaly samples simulated by Patch-Gen. It can be seen from Fig. 7 that our method fully simulates the defects that vary in different classes, proving that our method can well compensate for the domain gap caused by using only positive samples for training in 3D anomaly detection.