Surface Normal Estimation with Transformers

Barry Shichen Hu

{}^{1}

{}^{*}

Siyun Liang

{}^{1}

{}^{*}

Johannes Paetzold

{}^{1}

Huy H. Nguyen

{}^{2}\quad

Isao Echizen

{}^{2,3}

Jiapeng Tang

{}^{1}

{}^{1}

Technical University of Munich

{}^{2}

National Institute of Informatics, Japan

{}^{3}

University of Tokyo, Japan

Abstract

We propose the use of a Transformer to accurately predict normals from point clouds with noise and density variations. Previous learning-based methods utilize PointNet variants to explicitly extract multi-scale features at different input scales, then focus on a surface fitting method by which local point cloud neighborhoods are fitted to a geometric surface approximated by either a polynomial function or a multi-layer perceptron (MLP). However, fitting surfaces to fixed-order polynomial functions can suffer from overfitting or underfitting, and learning MLP-represented hyper-surfaces requires pre-generated per-point weights. To avoid these limitations, we first unify the design choices in previous works and then propose a simplified Transformer-based model to extract richer and more robust geometric features for the surface normal estimation task. Through extensive experiments, we demonstrate that our Transformer-based method achieves state-of-the-art performance on both the synthetic shape dataset PCPNet, and the real-world indoor scene dataset SceneNN, exhibiting more noise-resilient behavior and significantly faster inference. Most importantly, we demonstrate that the sophisticated hand-designed modules in existing works are not necessary to excel at the task of surface normal estimation. The code, data, and pre-trained models are publicly available in https://anonymous.4open.science/r/E34CYRW-17E7.

Figure 1: We unify and simplify existing learning-based methods for surface normal estimation by proposing a straightforward Transformer-based model that directly predicts normals without relying on surface fitting. Our greatly simplified method not only achieves state-of-the-art performance but also exhibits significantly faster inference speed than previous works. In the figure, we present the simplified pipelines of existing works for comparison, and visualize the prediction error using a heat map. Inference times are recorded as well.

^*^*footnotetext: Equal contribution.

1 Introduction

Estimating surface normals of point clouds is a fundamental problem in 3D computer vision that has a wide variety of downstream applications, such as point cloud denoising [27, 2, 38, 28], rendering [5, 13, 32], and reconstruction [21, 12]. While a significant amount of research has been dedicated to this topic, the accurate prediction of point cloud normals amid various types of noise, missing structures, and density variations remains a persistent challenge.

Existing methods address the surface normal estimation problem through either traditional surface fitting methods or more recent learning-based approaches. Traditional methods involve fitting planes or polynomials to a local neighborhood and then computing the normal from the estimated surface [17, 1, 19, 22, 37]. However, explicit surface fitting is sensitive to noise and outliers. Furthermore, it heavily relies on hand-tuned parameters, such as the order of the polynomial function, which can lead to underfitting or overfitting [30, 31, 6, 14]. In contrast, earlier learning-based methods like [7, 15, 4, 47, 46] apply neural networks to directly regress the surface normal, thus bypassing explicit surface fitting and its associated challenges.

However, recent learning-based methods [3, 49, 23, 24, 43] have renewed interest in the traditional surface fitting paradigm, demonstrating that the integration of a neural network into this conventional approach led to superior performance compared to direct regression. These methods initially use a neural network, such as the PointNet family [33, 34], to learn point-wise weights of a neighborhood and then apply a classic geometric surface fitting algorithm, like n-jet fitting, to compute normals [8]. Following the idea of surface fitting, [25] innovatively proposes hyper-surface fitting by learning a set of MLP layers whose parameters interpret the geometric structures of a hyper-surface. While avoiding the model fitting problem associated with surface fitting methods, [25] relies on a set of handcrafted per-point weights that may not accurately reflect the true contribution of points to the surface fitting. To address these issues, [26] learns an angular field that points toward the ground truth normal, instead of directly predicting the surface normal. This method, however, requires extensive sampling and time-consuming optimization during testing. To mitigate the pitfalls of existing methods, we take a step back and ask the challenging question:

Can rich geometric features be extracted directly from raw point clouds for normal estimation without relying on any handcrafted features or hand-designed modules?

To address this question, we first analyze current learning-based methods for surface normal estimation and discover that, despite variations in network design, the fundamental design choices in existing works are centered around Graph Convolution, which preserves locality, and multi-scale feature fusion, which aggregates geometric features from larger to smaller scales. Therefore, in this work, we continue to use Graph Convolution for local neighborhood aggregation and explore the optimal features for the Graph operation. Additionally, we propose using a Transformer as an alternative for multi-scale feature extraction, contending that the Transformer can extract richer multi-scale features due to its superior capacity for modeling relationships and its expansive receptive field.

As a result, we propose SNEtransformer, a simplified and unified Transformer-based backbone that learns directly from point clouds for normal estimation. Experiments on synthetic and RGB-D scan datasets demonstrate that our backbone not only achieves state-of-the-art performance but also proves to be faster in inference and more resilient to noise compared to existing methods. In summary, our main contributions are:

•

We unify previous learning-based methods and propose the first Transformer-based model for end-to-end normal estimation without additional surface fitting steps.
•

We demonstrate that our method achieves state-of-the-art accuracy and inference speed, showing greater resilience to noise in both synthetic and real-world scan datasets.
•

Through comprehensive ablation studies, we identify the best design decisions that lead to increased accuracy.

2 Related Work

Learning-based Direct Regression Methods.

Initial methods have been proposed to directly regress normal vectors from raw point clouds using neural networks. PCPNet [15] applies the PointNet architecture [33] in multi-scale neighborhoods to extract geometric features based on which normals and curvatures of point clouds are estimated. Nesti-Net [4] follows a structure similar to PCPNet but proposes training multiple backbones on neighborhoods of different sizes, then uses a mixture-of-experts architecture [20] to select the optimal backbone to predict the normal. Refine-Net [46] follows a two-stage design where it first computes an initial normal estimate and then deploys a deep neural network for refinement. Despite the simplicity of existing direct regression methods, they have demonstrated weaker performance in normal estimation tasks.

Learning-based Surface Fitting Methods.

Recent methods combine traditional surface fitting techniques with deep learning to achieve higher accuracy. DeepFit [3] and AdaFit [49] both use PointNet-based models [33] to predict point-wise weights in a local patch and then apply n-jet fitting to estimate the normal [8]. However, they suffer from underfitting or overfitting due to the fixed order of the polynomial function. Hsurf [25] explores geometric priors in high-dimensional space but requires training with hand-crafted per-point weights. NeAF [26] learns an angular field and applies extensive sampling and test time optimization to obtain surface normal vectors. Each of the aforementioned methods features hand-designed modules and comes with its own limitations. Instead, we propose a simpler yet more effective architecture that directly predicts surface normals from raw point clouds.

Transformers in 3D Applications.

Transformers have gained increasing popularity since their introduction [39]. Though they were originally introduced as a language model, they have proven to be effective in computer vision tasks [11, 35, 10, 45, 50]. Furthermore, Transformers serve as robust backbones for 3D applications as well. Zhao et al. [44] proposed the use of a Transformer for point cloud classification and segmentation tasks. Misra et al. [29] applied a Transformer to 3D object detection without relying on predetermined query points. Yu et al. [41] achieved state-of-the-art performance in 3D point cloud completion, while Shit et al. [36] utilized the Transformer’s relational modeling power in 3D graph generation. Although Transformers have been shown to be effective for 3D tasks, no previous work has applied Transformers to surface normal estimation tasks. Therefore, we decided to explore this area by experimenting with the straightforward use of Transformers for point cloud normal estimation tasks.

3 Method

3.1 Preliminaries

Surface Normal Estimation.

Given a local point set centered at a query point $\mathbf{p}$ as $\mathcal{P}=\{\mathbf{p}_{i}\mid i=1,\ldots,m\}$ , the learning objective is to estimate the unoriented normal $\mathbf{n_{p}}$ of the point $\mathbf{p}$ .

Graph Convolution.

Graph Convolution learns local geometric structures of a point cloud by first constructing a local neighborhood graph through the $k$ -nearest neighbor algorithm centered at a query point [40]. We represent the resulting point cloud patch as coordinates $\mathcal{X}=\{\mathbf{x}_{1},\ldots,\mathbf{x}_{k}\}\subseteq\mathbb{R}^{3}$ , their features $\mathcal{F}=\{\mathbf{f}_{1},\ldots,\mathbf{f}_{k}\}\subseteq\mathbb{R}^{F}$ , and the graph as $\mathcal{G}=(\mathcal{V},\mathcal{E})$ , where $\mathcal{V}=\{1,\ldots,k\}$ and $\mathcal{E}\subseteq\mathcal{V}\times\mathcal{V}$ are the nodes and edges, respectively.

Then, one convolution step calculates and aggregates the edge features, as graphically illustrated in Figure 2, and the mathematical formula for the convolution operation is:

\mathbf{f}_{i}^{\prime}=\square_{j:(i,j)\in\mathcal{E}}h_{\Theta}(\mathbf{f}_{% i},\mathbf{f}_{j})

(1)

where $\mathbf{f}_{i}$ is the feature vector of the point $\mathbf{x}_{i}$ , and $\{\mathbf{f}_{j}:(i,j)\in\mathcal{E}\}$ are features of the nearest neighbors of $\mathbf{x}_{i}$ . The function $h_{\Theta}:\mathbb{R}^{F}\times\mathbb{R}^{F}\rightarrow\mathbb{R}^{F^{\prime}}$ is a learnable function parameterized by $\Theta$ that extracts edge features, and $\square$ is a symmetric aggregation function.

Cascaded Scale Aggregation.

To explicitly extract multi-scale features from a point cloud patch, Zhu et al. [49] proposed the Cascaded Scale Aggregation (CSA), which was later adopted by subsequent studies [24, 25]. Essentially, CSA utilizes features from a larger scale to assist feature extraction at a smaller scale. For a point $\mathbf{p}$ , the scale $s$ is defined as the size of the nearest neighbor point set $\mathcal{N}_{s}(\mathbf{p})$ . A CSA layer considers two scales, $s_{k}$ and $s_{k+1}$ , where $s_{k+1}<s_{k}$ and $\mathcal{N}_{s_{k+1}}\subseteq\mathcal{N}_{s_{k}}$ . For a point $\mathbf{p}$ , we integrate its feature $\mathbf{f}_{k}$ at scale $s_{k}$ into its feature aggregation at scale $s_{k+1}$ , as follows:

\mathbf{f}_{k+1}=\phi_{k}\left(\varphi_{k}\left(\operatorname{MaxPool}\left\{% \mathbf{f}_{k,j}\mid\mathbf{p}_{j}\in\mathcal{N}_{s_{k}}\right\}\right),% \mathbf{f}_{k}\right)

(2)

where both $\phi_{k}$ and $\varphi_{k}$ are Multi-layer Perceptrons (MLPs). The motivation is that the larger scale provides information about broader surfaces, while the smaller scale includes more detailed local features for surface fitting [49]. In subsequent sections, we demonstrate that the Transformer is a superior method for multi-scale feature extraction compared to CSA.

Attention Mechanism.

Attention models the relations between inputs. The inputs consist of queries, keys, and values, where the queries and keys are of the same dimension $d_{k}$ , and values are of dimension $d_{v}$ . To compute the attention scores, [39] computes the dot products of all queries with all keys, divided by $\sqrt{d_{k}}$ , and then applies a softmax function. The resulting mathematical formulation is:

\operatorname{Attention}(\mathbf{Q},\mathbf{K},\mathbf{V})=\operatorname{% softmax}\left(\frac{\mathbf{Q}\mathbf{K}^{T}}{\sqrt{d_{k}}}\right)\mathbf{V}

(3)

where $\mathbf{Q}$ , $\mathbf{K}$ , and $\mathbf{V}$ are the query, key and value matrices [39]. Self-attention simply means that the $\mathbf{Q}$ , $\mathbf{K}$ , and $\mathbf{V}$ are derived from the same input features, and the Transformer Encoder architecture utilizes the self-attention mechanism to extract the relations between the inputs.

We apply Transformer Encoder layers to extract multi-scale geometry for two reasons. First, the attention mechanism ‘attends’ to points in both smaller and larger neighborhoods, thus naturally extracting multi-scale features. Second, attention models the contribution of each input point to the calculation of the normal vector, and such contributions are modeled by the attention scores. This serves as a denoising mechanism, particularly useful in the case of noisy input where the attention learns to weigh the importance of input points instead of indiscriminately favoring smaller neighborhoods, as CSA does. Therefore, we hypothesize that Transformers lead to more noise-agnostic behavior. This hypothesis is verified in Section 4, and Figure 3 visualizes the weights predicted by CSA and Transformer.

Refer to caption — Figure 2: Graph Convolution preserves locality, while the Transformer Encoder extracts multi-scale features. The global attention map assigns larger weights to ‘more reliable’ points and smaller weights to ‘unreliable’ ones, thereby functioning as a denoising mechanism.

3.2 Analysis of Alternative Methods

We analyze recent state-of-the-art models to identify design patterns that lead to improved surface normal estimation performance. The architectures of HSurf-Net [25]and GraphFit [24] are documented in Figure 1.

AdaFit.

The backbone of AdaFit is based on the Cascaded Scale Aggregation (CSA) module. It employs a series of CSA and MLP layers to aggregate geometric features from broader neighborhoods down to the smallest neighborhood [49]. Then, it deploys two MLP heads to predict the point-wise offsets and weights. The point-wise offsets are used to denoise the input point cloud, and then the weights are applied for n-jet fitting on the denoised point patch to predict the normal vector at the query point [49].

GraphFit.

GraphFit consists of a series of CSA-based Graph Convolution and adaptive layers [24]. Given the feature vector set $\mathcal{F}=\left\{\mathbf{f}_{i}\mid i=1,2,\ldots,m\right\}$ of a local point set $\mathcal{P}$ , the Graph Convolution is formulated as follows:

\mathbf{f}_{i}^{\prime}=\max_{j\in\mathcal{N}(i)}\phi\left(\left[\mathbf{f}_{j% }-\mathbf{f}_{i},\mathbf{f}_{i}\right]\right)

(4)

where $\mathbf{f}_{i}$ is the feature at point $\mathbf{p}_{i}$ , and $\{\mathbf{f}_{j}\mid j\in\mathcal{N}(i)\}$ is the set of features of point $\mathbf{p}_{i}$ ’s neighbours. $[\cdot,\cdot]$ is the concatenation operation and $\phi$ represents an MLP [24]. Then, CSA is used to include features obtained from current Graph Convolution into the next level. Following one block of Graph Convolutions and CSA, which outputs the aggregated feature set $\mathcal{F}^{\prime}=\left\{\mathbf{f}_{i}^{\prime}\mid i=1,2,\ldots,m\right\}$ , GraphFit adaptively updates the per-point feature as follows:

\displaystyle\overline{\mathcal{F}}

\displaystyle=\mathbf{s}_{\mathcal{F}}\odot\mathcal{F}+\left(1-\mathbf{s}_{% \mathcal{F}}\right)\odot\mathcal{F}^{\prime}

(5)

where $\mathbf{s}_{\mathcal{F}}$ represents element-wise weights predicted from $\mathcal{F}$ and $\mathcal{F}^{\prime}$ by an MLP. Similar to AdaFit, extracted features are used to predict the per-point offsets and weights for n-jet surface fitting [24].

HSurf-Net.

Like [24, 49], HSurf-Net utilizes Graph Convolution based operation to extract local features and CSA variant to fuse features from larger to smaller scales [25]. However, to prevent overfitting or underfitting due to the fixed order of the polynomial function, HSurf-Net proposes the use of an MLP to represent a hypersurface. To conduct hypersurface fitting, HSurf-Net predicts a set of per-point weights and element-wise multiplies these weights with the point features extracted. The resulting feature set is then fed into a block of MLP and pooling layers to predict the normal. To guide the model in predicting the correct weight for each point, HSurf-Net uses pre-generated target weights following the method in [42].

Summary on Common Design Patterns.

Despite differences in surface fitting and specific hand-crafted network modules, the common designs of existing methods include Graph Convolution and multi-scale feature fusion with CSA. Thus, we propose a simple Transformer-based backbone that unifies existing works.

3.3 Proposed Architecture

Our backbone consists of multiple layers of enhanced Graph Convolution [40] and Transformer Encoder [39], as illustrated in Fig. 2. At each layer, we first update each point’s feature through a Graph Convolution operation. Then, the point features are directly fed as input to a Transformer Encoder layer.

3.3.1 Enhanced Graph Convolution

Inspired by the relative position encoding discussed in [25] and the graph convolution presented in [24], we propose an enhanced convolutional approach for feature aggregation within a local neighborhood. Consider a local point cloud obtained by executing the $k$ -nearest neighbors algorithm centered at a point with coordinates $\mathbf{x}_{c}$ , resulting in a graph-structured point set represented by Cartesian coordinates $\left\{\mathbf{x}_{i}\mid i=1,2,\ldots,k\right\}\in\mathbb{R}^{k\times 3}$ and their corresponding features $\left\{\mathbf{f}_{i}\mid i=1,2,\ldots,k\right\}\in\mathbb{R}^{k\times F}$ . To aggregate the features from the local neighbourhood to $\mathbf{x}_{c}$ , we first construct the edge features between $\mathbf{x}_{c}$ and $\mathbf{x}_{j}$ as follows:

\mathbf{e}_{cj}=\phi\left([\mathbf{x}_{j}-\mathbf{x}_{c},\mathbf{x}_{c},% \mathbf{x}_{j},\mathbf{f}_{j},\mathbf{f}_{j}-\mathbf{f}_{c}]\right),j\in% \mathcal{N}(c)

(6)

where $[\cdot,\cdot]$ represents the concatenation operation, and $\phi$ is implemented as an MLP. We then output the local neighborhood information for $\mathbf{x}_{c}$ :

\mathbf{f}_{c}^{\prime}=\max_{j\in\mathcal{N}(c)}\mathbf{e}_{cj}

(7)

Our graph convolution operation not only aggregates local features to preserve locality, but also encodes positional information and edge features for use in the subsequent Transformer Encoder layer.

3.3.2 Transformer Layer

We utilize the Transformer Encoder Layer, as proposed in [39], to extract multi-scale geometric features. The architecture is depicted in Figure LABEL:fig:transformer_arch in the supplementary material. Specifically, features extracted from the graph convolution are directly fed into a Transformer Encoder layer. Instead of limiting the attention operation to a local neighborhood of a point, global attention is employed among all points in the input. The experimental results, detailed in Section 4.6, demonstrate that this global attention mechanism leads to improved outcomes.

3.3.3 Loss Function

Our goal is to predict the unoriented normal vector; hence, we apply the sin loss between the predicted normal vectors of the point cloud patch, $\hat{\mathbf{n}}_{\mathbf{p}}$ , and the ground truth, $\mathbf{n}_{\mathbf{p}}$ :

L=\left\|\hat{\mathbf{n}}_{\mathbf{p}}\times\mathbf{n}_{\mathbf{p}}\right\|

(8)

3.3.4 Comparison to Alternative Backbones

Our model architecture is greatly simplified compared to alternative methods. First, we do not rely on a surface fitting scheme like those described in [25, 49, 24, 3]; instead, we directly predict the normal vectors for the entire input point cloud patch. Second, we do not explicitly extract multi-scale features by operating the backbone at different scales of the point cloud; rather, we leverage the Transformer’s ability to model relationships and implicitly extract multi-scale features. Third, instead of using carefully hand-designed modules to extract geometric features from the input, we apply simple Graph Convolution layers and Transformer encoder layers. We demonstrate that a straightforward combination of Graph Convolution with a Transformer is sufficient to accurately predict normal vectors.

4 Results

We first explain the experimental setup, then demonstrate the quality of normal estimation on the widely-used synthetic dataset PCPNet [15], the real-world scan dataset SceneNN and Semantic3D [18, 16]. Finally, we use ablation studies to identify the design choices that enable our method to be more accurate. For additional qualitative visualizations, please refer to the supplementary material.

4.1 Experimental Setup

Data Preprocessing.

Similar to [25, 24, 3], SNEtransformer takes in a local patch of 700 points obtained by the $k$ -nearest neighbors algorithm at a query point. Following [25, 24, 15], to remove unnecessary degrees of freedom, we normalize each point’s coordinates by the patch radius and rotate the points into a coordinate system defined by Principal Component Analysis [25]. Given a point cloud patch, instead of only predicting the normal vector of a query point as in [25, 24, 15], SNEtransformer predicts the normal vectors of the entire query point neighborhood.

Evaluation Metrics.

We adopt the angular Root Mean Squared Error (RMSE) between the predicted normal and the ground truth to evaluate the estimation results [15]. Suppose a point cloud $\mathcal{P}$ ’s predicted normal set is $\hat{\mathcal{N}}(\mathcal{P})=\left\{\hat{\mathbf{n}}_{i}\in\mathbb{R}^{3}% \right\}_{i=1}^{m}$ , and ground truth normal set is $\mathcal{N}(\mathcal{P})=\left\{\mathbf{n}_{i}\in\mathbb{R}^{3}\right\}_{i=1}^% {m}$ . The RMSE error is calculated as:

\operatorname{RMSE}(\hat{\mathcal{N}}(\mathcal{P}))=\sqrt{\frac{1}{m}\sum_{i=1% }^{m}\arccos^{2}\left(\hat{\mathbf{n}}_{i},\mathbf{n}_{i}\right)}

(9)

Following [25, 49, 15], we also use the metric of the percentage of good points PGP( $\alpha$ ) to analyze the error distribution of the predicted normal:

\operatorname{PGP}(\alpha)=\frac{1}{m}\sum_{i=1}^{m}I\left(\arccos\left(\hat{% \mathbf{n}}_{i},\mathbf{n}_{i}\right)<\alpha\right),\alpha\in\left[0^{\circ},3% 0^{\circ}\right]

(10)

where $I$ is the indicator function. PGP( $\alpha$ ) measures the percentage of normal predictions with errors that fall below various angle thresholds denoted by $\alpha$ .

Implementation Details.

For training, we use an Adam optimizer with a learning rate of $2\times 10^{-4}$ and a batch size of 32. The learning rate is decreased by a factor of 0.005 every epoch. Our method is trained for 250 epochs, during which we randomly sample 100,000 point patches from the training set in each epoch. Experiments are conducted on a cluster of NVIDIA A100 GPUs.

Category	Year	Approach	PCPNet Dataset							SceneNN Dataset
			Noise $\sigma$				Density		Average	Orig-	Extra	Average
			None	0.12%	0.6%	1.2%	Stripes	Gradient	Average	inal	Noise	Average
PCA [17]	1992	Classical surface fitting	12.29	12.87	18.38	27.52	13.66	12.81	16.25	15.93	16.32	16.12
Jet [9]	2005	Classical surface fitting	12.35	12.84	18.33	27.68	13.39	13.13	16.29	15.17	15.59	15.38
HoughCNN [7]	2016	Direct regression	10.23	11.62	22.66	33.39	11.02	12.47	16.90	-	-	-
PCPNet [15]	2018	Direct regression	9.64	11.51	18.27	22.84	11.73	13.46	14.58	20.86	21.40	21.13
Nesti-Net [4]	2019	Direct regression	7.06	10.24	17.77	22.31	8.64	8.95	12.49	13.01	15.19	14.10
Lenssen et al. [23]	2020	Learning-based surface fitting	6.72	9.95	17.18	21.96	7.73	7.51	11.84	10.24	13.00	11.62
DeepFit [3]	2020	Learning-based surface fitting	6.51	9.21	16.73	23.12	7.92	7.31	11.80	10.33	13.07	11.70
Refine-Net [46]	2022	Direct regression	5.92	9.04	16.52	22.19	7.70	7.20	11.43	18.09	19.73	18.91
Zhang et al. [43]	2022	Learning-based surface fitting	5.65	9.19	16.78	22.93	6.68	6.29	11.25	9.31	13.11	11.21
Zhou et al. [48]	2021	Learning-based surface fitting	5.90	9.10	16.50	22.08	6.79	6.40	11.13	-	-	-
AdaFit [49]	2021	Learning-based surface fitting	5.19	9.05	16.44	21.94	6.01	5.90	10.76	8.39	12.85	10.62
GraphFit [24]	2022	Learning-based surface fitting	4.45	8.74	16.05	21.64	5.22	5.48	10.26	7.99	12.17	10.08
NeAF [26]	2023	Angular field	4.20	9.25	16.35	21.74	4.89	4.88	10.22	-	-	-
HSurf-Net [25]	2022	Learning-based surface fitting	4.17	8.78	16.25	21.61	4.98	4.86	10.11	7.55	12.23	9.89
SNEtransformer (Ours)	2023	Direct regression	3.99	8.97	15.85	20.98	4.81	4.67	9.88	7.44	12.14	9.79
SNEdiffusion (Ours)	2023	Diffusion	4.00	8.88	16.25	21.37	4.96	4.89	10.05	-	-	-
SNEdiffusion as regression model(Ours)	2023	Direct regression	3.93	8.90	16.27	21.40	4.82	4.64	10.00	-	-	-

Table 1: Normal angle RMSE results on the PCPNet and SceneNN dataset, sorted by the values (lowers are better) on the PCPNet dataset. As a direct regression method, SNEtransformer outperforms existing learning-based surface fitting methods significantly in noisy scenarios.

4.2 Results on PCPNet

PCPNet is a point cloud normal estimation dataset comprising synthetic shapes and 3D scanned objects. The training set contains eight point clouds, and the test set consists of nineteen. Following [25, 24, 49], SNEtransformer is trained on point clouds with various levels of Gaussian noise—none, low, medium, and high—and is evaluated against point clouds with different Gaussian noise levels, as well as two additional settings where the point density is inconsistent. The quantitative evaluation results, presented in Table 1, demonstrate that SNEtransformer outperforms existing methods in almost all scenarios. Figure 5 displays the PGP curves under all noise conditions, and a visual comparison of the normal prediction error output by SNEtransformer and existing methods is shown in Figure 3. It is evident that our method produces more accurate normal estimations in various testing scenarios.

4.3 Results on SceneNN

SceneNN is an RGB-D scan dataset captured in various indoor settings. Following [25], we first train the SNEtransformer on the PCPNet dataset, then evaluate the trained model on SceneNN without fine-tuning to explore the model’s scalability. Due to sensor errors, the data naturally contains noise, presenting a good opportunity to test the model’s noise agnosticism. We use the same evaluation settings as [25] and report the numerical results in Table 1 and the visual results in Figure 3. Table 1 shows that SNEtransformer generalizes well to real-world data and outperforms previous methods on both original and extra-noise settings. Figure 5 presents the PGP curves under original and extra-noise conditions, demonstrating that our method produces high-quality estimations for indoor scanning.

4.4 Results on Semantic3D

We visualize the normal estimation results on the outdoor scanning dataset Semantic3D in Figure 4, despite the absence of ground truth data for normals. It is apparent that our method preserves finer details such as carved patterns on doors, grooves between bricks, and letters on buildings—details that other methods tend to oversmooth. This suggests that our method also provides higher-quality normal estimation in outdoor scanning scenarios.

4.5 Application to Surface Reconstruction Task

We apply Poisson reconstruction [17] on PCPNet point clouds with the per-point normal predicted by SNEtransformer. Figure 6 shows that SNEtransformer helps recover finer details in areas with complex local geometry like the hand of the Statue of Liberty. Notably, our method enhances geometry reconstruction when the input is noisy.

4.6 Ablation Studies

Graph Convolution and Transformer.

To validate the effectiveness of Graph Convolution and the Transformer, we conduct ablation studies on them and report the results in Table 2. We observed that the performance of normal estimation degrades when either Graph Convolution or the Transformer is removed from the network. However, the accuracy of normal estimation degrades even further when the Transformer architecture is removed, which demonstrates its significant effectiveness.

Global Attention or Local Attention.

To demonstrate the effectiveness of global attention, we compare its performance with that of local attention. To implement local attention, we first run the $k$ -nearest neighbors algorithm at each point in the point cloud patch and then apply attention only within the set of nearest neighbor points. The results in Table 2 show that applying attention on a global scale improves the results and leads to noise-agnostic behavior. This validates our assumption that global attention allows the network to attend to any points it deems helpful for the estimation tasks, thereby increasing resilience to noise.

Noise level	PCPNet Dataset
	Ours	Ours w/o	Ours w/o	Ours with
	Ours	Transformer	GC	local attention
None	3.99	5.95 (+1.95)	5.05 (+1.05)	4.61 (+0.61)
$\sigma$ =0.12%	8.97	9.77 (+0.80)	9.53 (+0.56)	9.30 (+0.33)
$\sigma$ =0.6%	15.85	17.98 (+2.13)	17.01 (+1.16)	17.20 (+1.35)
$\sigma$ =1.2%	20.98	22.56 (+1.57)	22.28 (+1.29)	22.14 (+1.15)
Density (stripes)	4.81	6.67 (+1.85)	6.08 (+1.26)	5.52 (+0.70)
Density (gradients)	4.67	6.14 (+1.47)	5.74 (+1.07)	5.24 (+0.57)
Average	9.88	11.51 (+1.63)	10.94 (+1.06)	10.66 (+0.78)

Table 2: Ablation experiments reveal the effectiveness of the Transformer, Graph Convolution, and global attention.

	PCPNet Dataset
	Noise $\sigma$				Density
	None	0.12%	0.6%	1.2%	Stripes	Gradient
$xyz$ + $\Delta_{xyz}$ + $\mathbf{f}$ + $\Delta_{\mathbf{f}}$	3.99	8.97	15.85	20.98	4.81	4.67
$xyz$ + $\Delta_{xyz}$ + $\mathbf{f}$	4.74	9.23	16.24	21.76	5.85	5.45
$xyz$ + $\mathbf{f}$ + $\Delta_{\mathbf{f}}$	4.90	9.34	16.55	22.43	5.63	5.40

Table 3: Ablation study on input features for Graph Convolution.

\Delta_{xyz}

represents the difference in coordinates between a neighbor and the query point, while

\Delta_{\mathbf{f}}

indicates the difference in features between a neighbor and the query point.

Ablation on Features for Graph Convolution.

We have explored various features for inclusion in Graph Convolution, and the results are listed in Table 3. In summary, there are four main features: the 3D coordinates of the query point and its neighboring points (denoted as $xyz$ ), the difference in 3D coordinates between the query point and its neighbors (denoted as $\Delta_{xyz}$ ), the features of the query point and its neighboring points (denoted as $\mathbf{f}$ ), and the difference in features between the query point and the neighbors (denoted as $\Delta_{\mathbf{f}}$ ). We conclude that both $\Delta_{xyz}$ and $\Delta_{\mathbf{f}}$ contribute to better estimation results.

5 Conclusion

In this paper, we introduce the SNEtransformer, a Transformer-based model that accurately predicts surface normals. We demonstrate that a straightforward combination of Graph Convolution with a Transformer is sufficient to achieve state-of-the-art performance, without the need for hand-designed modules. Our model unifies existing approaches and is proven to be noise-agnostic, as evidenced by extensive experiments on both indoor and outdoor datasets. Lastly, we showcase the potential of our method in various downstream applications.

Acknowledgements

This work was partially supported by JSPS KAKENHI Grant JP21H04907, and by JST CREST Grants JPMJCR18A6 and JPMJCR20D3, Japan.

References

Alexa et al. [2001] Marc Alexa, Johannes Behr, Daniel Cohen-Or, Shachar Fleishman, David Levin, and Claudio T Silva. Point set surfaces. In Proceedings Visualization, 2001. VIS’01., pages 21–29. IEEE, 2001.
Avron et al. [2010] Haim Avron, Andrei Sharf, Chen Greif, and Daniel Cohen-Or. L1-sparse reconstruction of sharp point set surfaces. ACM Transactions on Graphics, 29(5):1–12, 2010.
Ben-Shabat and Gould [2020] Yizhak Ben-Shabat and Stephen Gould. DeepFit: 3D surface fitting via neural network weighted least squares. In European Conference on Computer Vision, pages 20–34. Springer, 2020.
Ben-Shabat et al. [2019] Yizhak Ben-Shabat, Michael Lindenbaum, and Anath Fischer. Nesti-Net: Normal estimation for unstructured 3D point clouds using convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 10112–10120, 2019.
Blinn [1978] James F Blinn. Simulation of wrinkled surfaces. ACM SIGGRAPH Computer Graphics, 12(3):286–292, 1978.
Boulch and Marlet [2012] Alexandre Boulch and Renaud Marlet. Fast and robust normal estimation for point clouds with sharp features. In Computer Graphics Forum, pages 1765–1774. Wiley Online Library, 2012.
Boulch and Marlet [2016] Alexandre Boulch and Renaud Marlet. Deep learning for robust normal estimation in unstructured point clouds. In Computer Graphics Forum, pages 281–290. Wiley Online Library, 2016.
Cazals and Pouget [2005a] F. Cazals and M. Pouget. Estimating differential quantities using polynomial fitting of osculating jets. Computer Aided Geometric Design, 22(2):121–146, 2005a.
Cazals and Pouget [2005b] Frédéric Cazals and Marc Pouget. Estimating differential quantities using polynomial fitting of osculating jets. Computer Aided Geometric Design, 22(2):121–146, 2005b.
Chen et al. [2022] Dave Zhenyu Chen, Ronghang Hu, Xinlei Chen, Matthias Nießner, and Angel X. Chang. Unit3d: A unified transformer for 3d dense captioning and visual grounding, 2022.
Dosovitskiy et al. [2021] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale, 2021.
Fleishman et al. [2005] Shachar Fleishman, Daniel Cohen-Or, and Cláudio T. Silva. Robust moving least-squares fitting with sharp features. 24(3), 2005.
Gouraud [1971] Henri Gouraud. Continuous shading of curved surfaces. IEEE Transactions on Computers, 100(6):623–629, 1971.
Guennebaud and Gross [2007] Gaël Guennebaud and Markus Gross. Algebraic point set surfaces. In ACM SIGGRAPH 2007 papers. 2007.
Guerrero et al. [2018] Paul Guerrero, Yanir Kleiman, Maks Ovsjanikov, and Niloy J Mitra. PCPNet: learning local shape properties from raw point clouds. In Computer Graphics Forum, pages 75–85. Wiley Online Library, 2018.
Hackel et al. [2017] Timo Hackel, N. Savinov, L. Ladicky, Jan D. Wegner, K. Schindler, and M. Pollefeys. SEMANTIC3D.NET: A new large-scale point cloud classification benchmark. In ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences, pages 91–98, 2017.
Hoppe et al. [1992] Hugues Hoppe, Tony DeRose, Tom Duchamp, John McDonald, and Werner Stuetzle. Surface reconstruction from unorganized points. In Proceedings of the 19th Annual Conference on Computer Graphics and Interactive Techniques, pages 71–78, 1992.
Hua et al. [2016] Binh-Son Hua, Quang-Hieu Pham, Duc Thanh Nguyen, Minh-Khoi Tran, Lap-Fai Yu, and Sai-Kit Yeung. Scenenn: A scene meshes dataset with annotations. In International Conference on 3D Vision (3DV), 2016.
Huang et al. [2009] Hui Huang, Dan Li, Hao Zhang, Uri Ascher, and Daniel Cohen-Or. Consolidation of unorganized point clouds for surface reconstruction. ACM Transactions on Graphics, 28(5):1–7, 2009.
Jacobs et al. [1991] Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. Adaptive mixtures of local experts. Neural Computation, 3(1):79–87, 1991.
Kazhdan et al. [2006] Michael Kazhdan, Matthew Bolitho, and Hugues Hoppe. Poisson surface reconstruction. In Proceedings of the fourth Eurographics Symposium on Geometry Processing, 2006.
Lange and Polthier [2005] Carsten Lange and Konrad Polthier. Anisotropic smoothing of point sets. Computer Aided Geometric Design, 22(7):680–692, 2005.
Lenssen et al. [2020] Jan Eric Lenssen, Christian Osendorfer, and Jonathan Masci. Deep iterative surface normal estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11247–11256, 2020.
Li et al. [2022a] Keqiang Li, Mingyang Zhao, Huaiyu Wu, Dong-Ming Yan, Zhen Shen, Fei-Yue Wang, and Gang Xiong. Graphfit: Learning multi-scale graph-convolutional representation for point cloud normal estimation. In Computer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXII, page 651–667, Berlin, Heidelberg, 2022a. Springer-Verlag.
Li et al. [2022b] Qing Li, Yu-Shen Liu, Jin-San Cheng, Cheng Wang, Yi Fang, and Zhizhong Han. Hsurf-net: Normal estimation for 3d point clouds by learning hyper surfaces. In Advances in Neural Information Processing Systems, pages 4218–4230. Curran Associates, Inc., 2022b.
Li et al. [2023] Shujuan Li, Junsheng Zhou, Baorui Ma, Yu-Shen Liu, and Zhizhong Han. Neaf: Learning neural angle fields for point normal estimation. Proceedings of the AAAI Conference on Artificial Intelligence, 37(1):1396–1404, 2023.
Lu et al. [2020a] Dening Lu, Xuequan Lu, Yangxing Sun, and Jun Wang. Deep feature-preserving normal estimation for point cloud filtering. Computer-Aided Design, 125:102860, 2020a.
Lu et al. [2020b] Xuequan Lu, Scott Schaefer, Jun Luo, Lizhuang Ma, and Ying He. Low rank matrix approximation for 3D geometry filtering. IEEE Transactions on Visualization and Computer Graphics, 2020b.
Misra et al. [2021] Ishan Misra, Rohit Girdhar, and Armand Joulin. An End-to-End Transformer Model for 3D Object Detection. In ICCV, 2021.
Mitra and Nguyen [2003] Niloy J Mitra and An Nguyen. Estimating surface normals in noisy point cloud data. In Proceedings of the Nineteenth Annual Symposium on Computational Geometry, pages 322–328, 2003.
Pauly et al. [2002] Mark Pauly, Markus Gross, and Leif P Kobbelt. Efficient simplification of point-sampled surfaces. In IEEE Visualization, 2002. VIS 2002., pages 163–170. IEEE, 2002.
Phong [1975] Bui Tuong Phong. Illumination for computer generated pictures. Communications of the ACM, 18(6):311–317, 1975.
Qi et al. [2017a] Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. PointNet: Deep learning on point sets for 3D classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 652–660, 2017a.
Qi et al. [2017b] Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. PointNet++: Deep hierarchical feature learning on point sets in a metric space. Advances in Neural Information Processing Systems, 30:5099–5108, 2017b.
Redmon et al. [2015] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection, 2015. cite arxiv:1506.02640.
Shit et al. [2022] Suprosanna Shit, Rajat Koner, Bastian Wittmann, Johannes Paetzold, Ivan Ezhov, Hongwei Li, Jiazhen Pan, Sahand Sharifzadeh, Georgios Kaissis, Volker Tresp, and Bjoern Menze. Relationformer: A unified framework for image-to-graph generation, 2022.
Stewart [1993] Gilbert W Stewart. On the early history of the singular value decomposition. SIAM Review, 35(4):551–566, 1993.
Sun et al. [2015] Yujing Sun, Scott Schaefer, and Wenping Wang. Denoising point sets via l0 minimization. Computer Aided Geometric Design, 35:2–15, 2015.
Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in Neural Information Processing Systems, 30, 2017.
Wang et al. [2018] Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E. Sarma, Michael M. Bronstein, and Justin M. Solomon. Dynamic graph CNN for learning on point clouds. CoRR, abs/1801.07829, 2018.
Yu et al. [2021] Xumin Yu, Yongming Rao, Ziyi Wang, Zuyan Liu, Jiwen Lu, and Jie Zhou. Pointr: Diverse point cloud completion with geometry-aware transformers, 2021.
Zhang et al. [2022a] Jie Zhang, Jun-Jie Cao, Hai-Rui Zhu, Dong-Ming Yan, and Xiu-Ping Liu. Geometry guided deep surface normal estimation. Computer-Aided Design, 142:103119, 2022a.
Zhang et al. [2022b] Jie Zhang, Jun-Jie Cao, Hai-Rui Zhu, Dong-Ming Yan, and Xiu-Ping Liu. Geometry guided deep surface normal estimation. Computer-Aided Design, 142:103119, 2022b.
Zhao et al. [2021a] Hengshuang Zhao, Li Jiang, Jiaya Jia, Philip HS Torr, and Vladlen Koltun. Point transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 16259–16268, 2021a.
Zhao et al. [2021b] Lichen Zhao, Daigang Cai, Lu Sheng, and Dong Xu. 3DVG-Transformer: Relation modeling for visual grounding on point clouds. In ICCV, pages 2928–2937, 2021b.
Zhou et al. [2022] Haoran Zhou, Honghua Chen, Yingkui Zhang, Mingqiang Wei, Haoran Xie, Jun Wang, Tong Lu, Jing Qin, and Xiao-Ping Zhang. Refine-Net: Normal refinement neural network for noisy point clouds. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
Zhou et al. [2020] Jun Zhou, Hua Huang, Bin Liu, and Xiuping Liu. Normal estimation for 3D point clouds via local plane constraint and multi-scale selection. Computer-Aided Design, 129:102916, 2020.
Zhou et al. [2021] Jun Zhou, Wei Jin, Mingjie Wang, Xiuping Liu, Zhiyang Li, and Zhaobin Liu. Improvement of normal estimation for point clouds via simplifying surface fitting. arXiv preprint arXiv:2104.10369, 2021.
Zhu et al. [2021a] Runsong Zhu, Yuan Liu, Zhen Dong, Yuan Wang, Tengping Jiang, Wenping Wang, and Bisheng Yang. AdaFit: Rethinking learning-based normal estimation on point clouds. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6118–6127, 2021a.
Zhu et al. [2021b] Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. Deformable detr: Deformable transformers for end-to-end object detection, 2021b.