HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: epic

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: arXiv.org perpetual non-exclusive license
arXiv:2401.05745v1 [cs.CV] 11 Jan 2024

Surface Normal Estimation with Transformers

Barry Shichen Hu11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT Siyun Liang11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT Johannes Paetzold11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT
Huy H. Nguyen22{}^{2}\quadstart_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT Isao Echizen2,323{}^{2,3}start_FLOATSUPERSCRIPT 2 , 3 end_FLOATSUPERSCRIPT Jiapeng Tang11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT
11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPTTechnical University of Munich 
22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPTNational Institute of Informatics, Japan        33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPTUniversity of Tokyo, Japan
Abstract

We propose the use of a Transformer to accurately predict normals from point clouds with noise and density variations. Previous learning-based methods utilize PointNet variants to explicitly extract multi-scale features at different input scales, then focus on a surface fitting method by which local point cloud neighborhoods are fitted to a geometric surface approximated by either a polynomial function or a multi-layer perceptron (MLP). However, fitting surfaces to fixed-order polynomial functions can suffer from overfitting or underfitting, and learning MLP-represented hyper-surfaces requires pre-generated per-point weights. To avoid these limitations, we first unify the design choices in previous works and then propose a simplified Transformer-based model to extract richer and more robust geometric features for the surface normal estimation task. Through extensive experiments, we demonstrate that our Transformer-based method achieves state-of-the-art performance on both the synthetic shape dataset PCPNet, and the real-world indoor scene dataset SceneNN, exhibiting more noise-resilient behavior and significantly faster inference. Most importantly, we demonstrate that the sophisticated hand-designed modules in existing works are not necessary to excel at the task of surface normal estimation. The code, data, and pre-trained models are publicly available in https://anonymous.4open.science/r/E34CYRW-17E7.

[Uncaptioned image]
Figure 1: We unify and simplify existing learning-based methods for surface normal estimation by proposing a straightforward Transformer-based model that directly predicts normals without relying on surface fitting. Our greatly simplified method not only achieves state-of-the-art performance but also exhibits significantly faster inference speed than previous works. In the figure, we present the simplified pipelines of existing works for comparison, and visualize the prediction error using a heat map. Inference times are recorded as well.
**footnotetext: Equal contribution.

1 Introduction

Estimating surface normals of point clouds is a fundamental problem in 3D computer vision that has a wide variety of downstream applications, such as point cloud denoising [27, 2, 38, 28], rendering [5, 13, 32], and reconstruction [21, 12]. While a significant amount of research has been dedicated to this topic, the accurate prediction of point cloud normals amid various types of noise, missing structures, and density variations remains a persistent challenge.

Existing methods address the surface normal estimation problem through either traditional surface fitting methods or more recent learning-based approaches. Traditional methods involve fitting planes or polynomials to a local neighborhood and then computing the normal from the estimated surface [17, 1, 19, 22, 37]. However, explicit surface fitting is sensitive to noise and outliers. Furthermore, it heavily relies on hand-tuned parameters, such as the order of the polynomial function, which can lead to underfitting or overfitting [30, 31, 6, 14]. In contrast, earlier learning-based methods like  [7, 15, 4, 47, 46] apply neural networks to directly regress the surface normal, thus bypassing explicit surface fitting and its associated challenges.

However, recent learning-based methods  [3, 49, 23, 24, 43] have renewed interest in the traditional surface fitting paradigm, demonstrating that the integration of a neural network into this conventional approach led to superior performance compared to direct regression. These methods initially use a neural network, such as the PointNet family  [33, 34], to learn point-wise weights of a neighborhood and then apply a classic geometric surface fitting algorithm, like n-jet fitting, to compute normals  [8]. Following the idea of surface fitting,  [25] innovatively proposes hyper-surface fitting by learning a set of MLP layers whose parameters interpret the geometric structures of a hyper-surface. While avoiding the model fitting problem associated with surface fitting methods,  [25] relies on a set of handcrafted per-point weights that may not accurately reflect the true contribution of points to the surface fitting. To address these issues,  [26] learns an angular field that points toward the ground truth normal, instead of directly predicting the surface normal. This method, however, requires extensive sampling and time-consuming optimization during testing. To mitigate the pitfalls of existing methods, we take a step back and ask the challenging question:

Can rich geometric features be extracted directly from raw point clouds for normal estimation without relying on any handcrafted features or hand-designed modules?

To address this question, we first analyze current learning-based methods for surface normal estimation and discover that, despite variations in network design, the fundamental design choices in existing works are centered around Graph Convolution, which preserves locality, and multi-scale feature fusion, which aggregates geometric features from larger to smaller scales. Therefore, in this work, we continue to use Graph Convolution for local neighborhood aggregation and explore the optimal features for the Graph operation. Additionally, we propose using a Transformer as an alternative for multi-scale feature extraction, contending that the Transformer can extract richer multi-scale features due to its superior capacity for modeling relationships and its expansive receptive field.

As a result, we propose SNEtransformer, a simplified and unified Transformer-based backbone that learns directly from point clouds for normal estimation. Experiments on synthetic and RGB-D scan datasets demonstrate that our backbone not only achieves state-of-the-art performance but also proves to be faster in inference and more resilient to noise compared to existing methods. In summary, our main contributions are:

  • We unify previous learning-based methods and propose the first Transformer-based model for end-to-end normal estimation without additional surface fitting steps.

  • We demonstrate that our method achieves state-of-the-art accuracy and inference speed, showing greater resilience to noise in both synthetic and real-world scan datasets.

  • Through comprehensive ablation studies, we identify the best design decisions that lead to increased accuracy.

2 Related Work

Learning-based Direct Regression Methods.

Initial methods have been proposed to directly regress normal vectors from raw point clouds using neural networks. PCPNet [15] applies the PointNet architecture [33] in multi-scale neighborhoods to extract geometric features based on which normals and curvatures of point clouds are estimated. Nesti-Net [4] follows a structure similar to PCPNet but proposes training multiple backbones on neighborhoods of different sizes, then uses a mixture-of-experts architecture [20] to select the optimal backbone to predict the normal. Refine-Net [46] follows a two-stage design where it first computes an initial normal estimate and then deploys a deep neural network for refinement. Despite the simplicity of existing direct regression methods, they have demonstrated weaker performance in normal estimation tasks.

Learning-based Surface Fitting Methods.

Recent methods combine traditional surface fitting techniques with deep learning to achieve higher accuracy. DeepFit [3] and AdaFit [49] both use PointNet-based models [33] to predict point-wise weights in a local patch and then apply n-jet fitting to estimate the normal [8]. However, they suffer from underfitting or overfitting due to the fixed order of the polynomial function. Hsurf [25] explores geometric priors in high-dimensional space but requires training with hand-crafted per-point weights. NeAF [26] learns an angular field and applies extensive sampling and test time optimization to obtain surface normal vectors. Each of the aforementioned methods features hand-designed modules and comes with its own limitations. Instead, we propose a simpler yet more effective architecture that directly predicts surface normals from raw point clouds.

Transformers in 3D Applications.

Transformers have gained increasing popularity since their introduction [39]. Though they were originally introduced as a language model, they have proven to be effective in computer vision tasks [11, 35, 10, 45, 50]. Furthermore, Transformers serve as robust backbones for 3D applications as well. Zhao et al. [44] proposed the use of a Transformer for point cloud classification and segmentation tasks. Misra et al. [29] applied a Transformer to 3D object detection without relying on predetermined query points. Yu et al. [41] achieved state-of-the-art performance in 3D point cloud completion, while Shit et al. [36] utilized the Transformer’s relational modeling power in 3D graph generation. Although Transformers have been shown to be effective for 3D tasks, no previous work has applied Transformers to surface normal estimation tasks. Therefore, we decided to explore this area by experimenting with the straightforward use of Transformers for point cloud normal estimation tasks.

3 Method

3.1 Preliminaries

Surface Normal Estimation.

Given a local point set centered at a query point 𝐩𝐩\mathbf{p}bold_p as 𝒫={𝐩ii=1,,m}𝒫conditional-setsubscript𝐩𝑖𝑖1𝑚\mathcal{P}=\{\mathbf{p}_{i}\mid i=1,\ldots,m\}caligraphic_P = { bold_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_i = 1 , … , italic_m }, the learning objective is to estimate the unoriented normal 𝐧𝐩subscript𝐧𝐩\mathbf{n_{p}}bold_n start_POSTSUBSCRIPT bold_p end_POSTSUBSCRIPT of the point 𝐩𝐩\mathbf{p}bold_p.

Graph Convolution.

Graph Convolution learns local geometric structures of a point cloud by first constructing a local neighborhood graph through the k𝑘kitalic_k-nearest neighbor algorithm centered at a query point [40]. We represent the resulting point cloud patch as coordinates 𝒳={𝐱1,,𝐱k}3𝒳subscript𝐱1subscript𝐱𝑘superscript3\mathcal{X}=\{\mathbf{x}_{1},\ldots,\mathbf{x}_{k}\}\subseteq\mathbb{R}^{3}caligraphic_X = { bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } ⊆ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT, their features ={𝐟1,,𝐟k}Fsubscript𝐟1subscript𝐟𝑘superscript𝐹\mathcal{F}=\{\mathbf{f}_{1},\ldots,\mathbf{f}_{k}\}\subseteq\mathbb{R}^{F}caligraphic_F = { bold_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } ⊆ blackboard_R start_POSTSUPERSCRIPT italic_F end_POSTSUPERSCRIPT, and the graph as 𝒢=(𝒱,)𝒢𝒱\mathcal{G}=(\mathcal{V},\mathcal{E})caligraphic_G = ( caligraphic_V , caligraphic_E ), where 𝒱={1,,k}𝒱1𝑘\mathcal{V}=\{1,\ldots,k\}caligraphic_V = { 1 , … , italic_k } and 𝒱×𝒱𝒱𝒱\mathcal{E}\subseteq\mathcal{V}\times\mathcal{V}caligraphic_E ⊆ caligraphic_V × caligraphic_V are the nodes and edges, respectively.

Then, one convolution step calculates and aggregates the edge features, as graphically illustrated in Figure 2, and the mathematical formula for the convolution operation is:

𝐟i=j:(i,j)hΘ(𝐟i,𝐟j)superscriptsubscript𝐟𝑖subscript:𝑗𝑖𝑗subscriptΘsubscript𝐟𝑖subscript𝐟𝑗\mathbf{f}_{i}^{\prime}=\square_{j:(i,j)\in\mathcal{E}}h_{\Theta}(\mathbf{f}_{% i},\mathbf{f}_{j})bold_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = □ start_POSTSUBSCRIPT italic_j : ( italic_i , italic_j ) ∈ caligraphic_E end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT ( bold_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) (1)

where 𝐟isubscript𝐟𝑖\mathbf{f}_{i}bold_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the feature vector of the point 𝐱isubscript𝐱𝑖\mathbf{x}_{i}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and {𝐟j:(i,j)}conditional-setsubscript𝐟𝑗𝑖𝑗\{\mathbf{f}_{j}:(i,j)\in\mathcal{E}\}{ bold_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT : ( italic_i , italic_j ) ∈ caligraphic_E } are features of the nearest neighbors of 𝐱isubscript𝐱𝑖\mathbf{x}_{i}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. The function hΘ:F×FF:subscriptΘsuperscript𝐹superscript𝐹superscriptsuperscript𝐹h_{\Theta}:\mathbb{R}^{F}\times\mathbb{R}^{F}\rightarrow\mathbb{R}^{F^{\prime}}italic_h start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT : blackboard_R start_POSTSUPERSCRIPT italic_F end_POSTSUPERSCRIPT × blackboard_R start_POSTSUPERSCRIPT italic_F end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_F start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT is a learnable function parameterized by ΘΘ\Thetaroman_Θ that extracts edge features, and \square is a symmetric aggregation function.

Cascaded Scale Aggregation.

To explicitly extract multi-scale features from a point cloud patch, Zhu et al. [49] proposed the Cascaded Scale Aggregation (CSA), which was later adopted by subsequent studies [24, 25]. Essentially, CSA utilizes features from a larger scale to assist feature extraction at a smaller scale. For a point 𝐩𝐩\mathbf{p}bold_p, the scale s𝑠sitalic_s is defined as the size of the nearest neighbor point set 𝒩s(𝐩)subscript𝒩𝑠𝐩\mathcal{N}_{s}(\mathbf{p})caligraphic_N start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_p ). A CSA layer considers two scales, sksubscript𝑠𝑘s_{k}italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and sk+1subscript𝑠𝑘1s_{k+1}italic_s start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT, where sk+1<sksubscript𝑠𝑘1subscript𝑠𝑘s_{k+1}<s_{k}italic_s start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT < italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and 𝒩sk+1𝒩sksubscript𝒩subscript𝑠𝑘1subscript𝒩subscript𝑠𝑘\mathcal{N}_{s_{k+1}}\subseteq\mathcal{N}_{s_{k}}caligraphic_N start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⊆ caligraphic_N start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT. For a point 𝐩𝐩\mathbf{p}bold_p, we integrate its feature 𝐟ksubscript𝐟𝑘\mathbf{f}_{k}bold_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT at scale sksubscript𝑠𝑘s_{k}italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT into its feature aggregation at scale sk+1subscript𝑠𝑘1s_{k+1}italic_s start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT, as follows:

𝐟k+1=ϕk(φk(MaxPool{𝐟k,j𝐩j𝒩sk}),𝐟k)subscript𝐟𝑘1subscriptitalic-ϕ𝑘subscript𝜑𝑘MaxPoolconditionalsubscript𝐟𝑘𝑗subscript𝐩𝑗subscript𝒩subscript𝑠𝑘subscript𝐟𝑘\mathbf{f}_{k+1}=\phi_{k}\left(\varphi_{k}\left(\operatorname{MaxPool}\left\{% \mathbf{f}_{k,j}\mid\mathbf{p}_{j}\in\mathcal{N}_{s_{k}}\right\}\right),% \mathbf{f}_{k}\right)bold_f start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT = italic_ϕ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_φ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( roman_MaxPool { bold_f start_POSTSUBSCRIPT italic_k , italic_j end_POSTSUBSCRIPT ∣ bold_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ caligraphic_N start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT } ) , bold_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) (2)

where both ϕksubscriptitalic-ϕ𝑘\phi_{k}italic_ϕ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and φksubscript𝜑𝑘\varphi_{k}italic_φ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT are Multi-layer Perceptrons (MLPs). The motivation is that the larger scale provides information about broader surfaces, while the smaller scale includes more detailed local features for surface fitting [49]. In subsequent sections, we demonstrate that the Transformer is a superior method for multi-scale feature extraction compared to CSA.

Attention Mechanism.

Attention models the relations between inputs. The inputs consist of queries, keys, and values, where the queries and keys are of the same dimension dksubscript𝑑𝑘d_{k}italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, and values are of dimension dvsubscript𝑑𝑣d_{v}italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT. To compute the attention scores, [39] computes the dot products of all queries with all keys, divided by dksubscript𝑑𝑘\sqrt{d_{k}}square-root start_ARG italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG, and then applies a softmax function. The resulting mathematical formulation is:

Attention(𝐐,𝐊,𝐕)=softmax(𝐐𝐊Tdk)𝐕Attention𝐐𝐊𝐕softmaxsuperscript𝐐𝐊𝑇subscript𝑑𝑘𝐕\operatorname{Attention}(\mathbf{Q},\mathbf{K},\mathbf{V})=\operatorname{% softmax}\left(\frac{\mathbf{Q}\mathbf{K}^{T}}{\sqrt{d_{k}}}\right)\mathbf{V}roman_Attention ( bold_Q , bold_K , bold_V ) = roman_softmax ( divide start_ARG bold_QK start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG end_ARG ) bold_V (3)

where 𝐐𝐐\mathbf{Q}bold_Q, 𝐊𝐊\mathbf{K}bold_K, and 𝐕𝐕\mathbf{V}bold_V are the query, key and value matrices [39]. Self-attention simply means that the 𝐐𝐐\mathbf{Q}bold_Q, 𝐊𝐊\mathbf{K}bold_K, and 𝐕𝐕\mathbf{V}bold_V are derived from the same input features, and the Transformer Encoder architecture utilizes the self-attention mechanism to extract the relations between the inputs.

We apply Transformer Encoder layers to extract multi-scale geometry for two reasons. First, the attention mechanism ‘attends’ to points in both smaller and larger neighborhoods, thus naturally extracting multi-scale features. Second, attention models the contribution of each input point to the calculation of the normal vector, and such contributions are modeled by the attention scores. This serves as a denoising mechanism, particularly useful in the case of noisy input where the attention learns to weigh the importance of input points instead of indiscriminately favoring smaller neighborhoods, as CSA does. Therefore, we hypothesize that Transformers lead to more noise-agnostic behavior. This hypothesis is verified in Section 4, and Figure 3 visualizes the weights predicted by CSA and Transformer.

Refer to caption
Figure 2: Graph Convolution preserves locality, while the Transformer Encoder extracts multi-scale features. The global attention map assigns larger weights to ‘more reliable’ points and smaller weights to ‘unreliable’ ones, thereby functioning as a denoising mechanism.

3.2 Analysis of Alternative Methods

We analyze recent state-of-the-art models to identify design patterns that lead to improved surface normal estimation performance. The architectures of HSurf-Net [25]and GraphFit [24] are documented in Figure 1.

AdaFit.

The backbone of AdaFit is based on the Cascaded Scale Aggregation (CSA) module. It employs a series of CSA and MLP layers to aggregate geometric features from broader neighborhoods down to the smallest neighborhood [49]. Then, it deploys two MLP heads to predict the point-wise offsets and weights. The point-wise offsets are used to denoise the input point cloud, and then the weights are applied for n-jet fitting on the denoised point patch to predict the normal vector at the query point  [49].

GraphFit.

GraphFit consists of a series of CSA-based Graph Convolution and adaptive layers [24]. Given the feature vector set ={𝐟ii=1,2,,m}conditional-setsubscript𝐟𝑖𝑖12𝑚\mathcal{F}=\left\{\mathbf{f}_{i}\mid i=1,2,\ldots,m\right\}caligraphic_F = { bold_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_i = 1 , 2 , … , italic_m } of a local point set 𝒫𝒫\mathcal{P}caligraphic_P, the Graph Convolution is formulated as follows:

𝐟i=maxj𝒩(i)ϕ([𝐟j𝐟i,𝐟i])superscriptsubscript𝐟𝑖subscript𝑗𝒩𝑖italic-ϕsubscript𝐟𝑗subscript𝐟𝑖subscript𝐟𝑖\mathbf{f}_{i}^{\prime}=\max_{j\in\mathcal{N}(i)}\phi\left(\left[\mathbf{f}_{j% }-\mathbf{f}_{i},\mathbf{f}_{i}\right]\right)bold_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = roman_max start_POSTSUBSCRIPT italic_j ∈ caligraphic_N ( italic_i ) end_POSTSUBSCRIPT italic_ϕ ( [ bold_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - bold_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] ) (4)

where 𝐟isubscript𝐟𝑖\mathbf{f}_{i}bold_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the feature at point 𝐩isubscript𝐩𝑖\mathbf{p}_{i}bold_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and {𝐟jj𝒩(i)}conditional-setsubscript𝐟𝑗𝑗𝒩𝑖\{\mathbf{f}_{j}\mid j\in\mathcal{N}(i)\}{ bold_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∣ italic_j ∈ caligraphic_N ( italic_i ) } is the set of features of point 𝐩isubscript𝐩𝑖\mathbf{p}_{i}bold_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT’s neighbours. [,][\cdot,\cdot][ ⋅ , ⋅ ] is the concatenation operation and ϕitalic-ϕ\phiitalic_ϕ represents an MLP [24]. Then, CSA is used to include features obtained from current Graph Convolution into the next level. Following one block of Graph Convolutions and CSA, which outputs the aggregated feature set ={𝐟ii=1,2,,m}superscriptconditional-setsuperscriptsubscript𝐟𝑖𝑖12𝑚\mathcal{F}^{\prime}=\left\{\mathbf{f}_{i}^{\prime}\mid i=1,2,\ldots,m\right\}caligraphic_F start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = { bold_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∣ italic_i = 1 , 2 , … , italic_m }, GraphFit adaptively updates the per-point feature as follows:

¯¯\displaystyle\overline{\mathcal{F}}over¯ start_ARG caligraphic_F end_ARG =𝐬+(1𝐬)absentdirect-productsubscript𝐬direct-product1subscript𝐬superscript\displaystyle=\mathbf{s}_{\mathcal{F}}\odot\mathcal{F}+\left(1-\mathbf{s}_{% \mathcal{F}}\right)\odot\mathcal{F}^{\prime}= bold_s start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT ⊙ caligraphic_F + ( 1 - bold_s start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT ) ⊙ caligraphic_F start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT (5)

where 𝐬subscript𝐬\mathbf{s}_{\mathcal{F}}bold_s start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT represents element-wise weights predicted from \mathcal{F}caligraphic_F and superscript\mathcal{F}^{\prime}caligraphic_F start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT by an MLP. Similar to AdaFit, extracted features are used to predict the per-point offsets and weights for n-jet surface fitting [24].

HSurf-Net.

Like [24, 49], HSurf-Net utilizes Graph Convolution based operation to extract local features and CSA variant to fuse features from larger to smaller scales [25]. However, to prevent overfitting or underfitting due to the fixed order of the polynomial function, HSurf-Net proposes the use of an MLP to represent a hypersurface. To conduct hypersurface fitting, HSurf-Net predicts a set of per-point weights and element-wise multiplies these weights with the point features extracted. The resulting feature set is then fed into a block of MLP and pooling layers to predict the normal. To guide the model in predicting the correct weight for each point, HSurf-Net uses pre-generated target weights following the method in [42].

Summary on Common Design Patterns.

Despite differences in surface fitting and specific hand-crafted network modules, the common designs of existing methods include Graph Convolution and multi-scale feature fusion with CSA. Thus, we propose a simple Transformer-based backbone that unifies existing works.

3.3 Proposed Architecture

Our backbone consists of multiple layers of enhanced Graph Convolution [40] and Transformer Encoder [39], as illustrated in Fig. 2. At each layer, we first update each point’s feature through a Graph Convolution operation. Then, the point features are directly fed as input to a Transformer Encoder layer.

3.3.1 Enhanced Graph Convolution

Inspired by the relative position encoding discussed in [25] and the graph convolution presented in [24], we propose an enhanced convolutional approach for feature aggregation within a local neighborhood. Consider a local point cloud obtained by executing the k𝑘kitalic_k-nearest neighbors algorithm centered at a point with coordinates 𝐱csubscript𝐱𝑐\mathbf{x}_{c}bold_x start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, resulting in a graph-structured point set represented by Cartesian coordinates {𝐱ii=1,2,,k}k×3conditional-setsubscript𝐱𝑖𝑖12𝑘superscript𝑘3\left\{\mathbf{x}_{i}\mid i=1,2,\ldots,k\right\}\in\mathbb{R}^{k\times 3}{ bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_i = 1 , 2 , … , italic_k } ∈ blackboard_R start_POSTSUPERSCRIPT italic_k × 3 end_POSTSUPERSCRIPT and their corresponding features {𝐟ii=1,2,,k}k×Fconditional-setsubscript𝐟𝑖𝑖12𝑘superscript𝑘𝐹\left\{\mathbf{f}_{i}\mid i=1,2,\ldots,k\right\}\in\mathbb{R}^{k\times F}{ bold_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_i = 1 , 2 , … , italic_k } ∈ blackboard_R start_POSTSUPERSCRIPT italic_k × italic_F end_POSTSUPERSCRIPT. To aggregate the features from the local neighbourhood to 𝐱csubscript𝐱𝑐\mathbf{x}_{c}bold_x start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, we first construct the edge features between 𝐱csubscript𝐱𝑐\mathbf{x}_{c}bold_x start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT and 𝐱jsubscript𝐱𝑗\mathbf{x}_{j}bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT as follows:

𝐞cj=ϕ([𝐱j𝐱c,𝐱c,𝐱j,𝐟j,𝐟j𝐟c]),j𝒩(c)formulae-sequencesubscript𝐞𝑐𝑗italic-ϕsubscript𝐱𝑗subscript𝐱𝑐subscript𝐱𝑐subscript𝐱𝑗subscript𝐟𝑗subscript𝐟𝑗subscript𝐟𝑐𝑗𝒩𝑐\mathbf{e}_{cj}=\phi\left([\mathbf{x}_{j}-\mathbf{x}_{c},\mathbf{x}_{c},% \mathbf{x}_{j},\mathbf{f}_{j},\mathbf{f}_{j}-\mathbf{f}_{c}]\right),j\in% \mathcal{N}(c)bold_e start_POSTSUBSCRIPT italic_c italic_j end_POSTSUBSCRIPT = italic_ϕ ( [ bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - bold_f start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ] ) , italic_j ∈ caligraphic_N ( italic_c ) (6)

where [,][\cdot,\cdot][ ⋅ , ⋅ ] represents the concatenation operation, and ϕitalic-ϕ\phiitalic_ϕ is implemented as an MLP. We then output the local neighborhood information for 𝐱csubscript𝐱𝑐\mathbf{x}_{c}bold_x start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT:

𝐟c=maxj𝒩(c)𝐞cjsuperscriptsubscript𝐟𝑐subscript𝑗𝒩𝑐subscript𝐞𝑐𝑗\mathbf{f}_{c}^{\prime}=\max_{j\in\mathcal{N}(c)}\mathbf{e}_{cj}bold_f start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = roman_max start_POSTSUBSCRIPT italic_j ∈ caligraphic_N ( italic_c ) end_POSTSUBSCRIPT bold_e start_POSTSUBSCRIPT italic_c italic_j end_POSTSUBSCRIPT (7)

Our graph convolution operation not only aggregates local features to preserve locality, but also encodes positional information and edge features for use in the subsequent Transformer Encoder layer.

3.3.2 Transformer Layer

We utilize the Transformer Encoder Layer, as proposed in [39], to extract multi-scale geometric features. The architecture is depicted in Figure LABEL:fig:transformer_arch in the supplementary material. Specifically, features extracted from the graph convolution are directly fed into a Transformer Encoder layer. Instead of limiting the attention operation to a local neighborhood of a point, global attention is employed among all points in the input. The experimental results, detailed in Section 4.6, demonstrate that this global attention mechanism leads to improved outcomes.

3.3.3 Loss Function

Our goal is to predict the unoriented normal vector; hence, we apply the sin loss between the predicted normal vectors of the point cloud patch, 𝐧^𝐩subscript^𝐧𝐩\hat{\mathbf{n}}_{\mathbf{p}}over^ start_ARG bold_n end_ARG start_POSTSUBSCRIPT bold_p end_POSTSUBSCRIPT, and the ground truth, 𝐧𝐩subscript𝐧𝐩\mathbf{n}_{\mathbf{p}}bold_n start_POSTSUBSCRIPT bold_p end_POSTSUBSCRIPT:

L=𝐧^𝐩×𝐧𝐩𝐿normsubscript^𝐧𝐩subscript𝐧𝐩L=\left\|\hat{\mathbf{n}}_{\mathbf{p}}\times\mathbf{n}_{\mathbf{p}}\right\|italic_L = ∥ over^ start_ARG bold_n end_ARG start_POSTSUBSCRIPT bold_p end_POSTSUBSCRIPT × bold_n start_POSTSUBSCRIPT bold_p end_POSTSUBSCRIPT ∥ (8)

3.3.4 Comparison to Alternative Backbones

Our model architecture is greatly simplified compared to alternative methods. First, we do not rely on a surface fitting scheme like those described in [25, 49, 24, 3]; instead, we directly predict the normal vectors for the entire input point cloud patch. Second, we do not explicitly extract multi-scale features by operating the backbone at different scales of the point cloud; rather, we leverage the Transformer’s ability to model relationships and implicitly extract multi-scale features. Third, instead of using carefully hand-designed modules to extract geometric features from the input, we apply simple Graph Convolution layers and Transformer encoder layers. We demonstrate that a straightforward combination of Graph Convolution with a Transformer is sufficient to accurately predict normal vectors.

4 Results

We first explain the experimental setup, then demonstrate the quality of normal estimation on the widely-used synthetic dataset PCPNet [15], the real-world scan dataset SceneNN and Semantic3D [18, 16]. Finally, we use ablation studies to identify the design choices that enable our method to be more accurate. For additional qualitative visualizations, please refer to the supplementary material.

4.1 Experimental Setup

Data Preprocessing.

Similar to  [25, 24, 3], SNEtransformer takes in a local patch of 700 points obtained by the k𝑘kitalic_k-nearest neighbors algorithm at a query point. Following  [25, 24, 15], to remove unnecessary degrees of freedom, we normalize each point’s coordinates by the patch radius and rotate the points into a coordinate system defined by Principal Component Analysis [25]. Given a point cloud patch, instead of only predicting the normal vector of a query point as in  [25, 24, 15], SNEtransformer predicts the normal vectors of the entire query point neighborhood.

Evaluation Metrics.

We adopt the angular Root Mean Squared Error (RMSE) between the predicted normal and the ground truth to evaluate the estimation results [15]. Suppose a point cloud 𝒫𝒫\mathcal{P}caligraphic_P ’s predicted normal set is 𝒩^(𝒫)={𝐧^i3}i=1m^𝒩𝒫superscriptsubscriptsubscript^𝐧𝑖superscript3𝑖1𝑚\hat{\mathcal{N}}(\mathcal{P})=\left\{\hat{\mathbf{n}}_{i}\in\mathbb{R}^{3}% \right\}_{i=1}^{m}over^ start_ARG caligraphic_N end_ARG ( caligraphic_P ) = { over^ start_ARG bold_n end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT, and ground truth normal set is 𝒩(𝒫)={𝐧i3}i=1m𝒩𝒫superscriptsubscriptsubscript𝐧𝑖superscript3𝑖1𝑚\mathcal{N}(\mathcal{P})=\left\{\mathbf{n}_{i}\in\mathbb{R}^{3}\right\}_{i=1}^% {m}caligraphic_N ( caligraphic_P ) = { bold_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT. The RMSE error is calculated as:

RMSE(𝒩^(𝒫))=1mi=1marccos2(𝐧^i,𝐧i)RMSE^𝒩𝒫1𝑚superscriptsubscript𝑖1𝑚superscript2subscript^𝐧𝑖subscript𝐧𝑖\operatorname{RMSE}(\hat{\mathcal{N}}(\mathcal{P}))=\sqrt{\frac{1}{m}\sum_{i=1% }^{m}\arccos^{2}\left(\hat{\mathbf{n}}_{i},\mathbf{n}_{i}\right)}roman_RMSE ( over^ start_ARG caligraphic_N end_ARG ( caligraphic_P ) ) = square-root start_ARG divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT roman_arccos start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( over^ start_ARG bold_n end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG (9)

Following [25, 49, 15], we also use the metric of the percentage of good points PGP(α𝛼\alphaitalic_α) to analyze the error distribution of the predicted normal:

PGP(α)=1mi=1mI(arccos(𝐧^i,𝐧i)<α),α[0,30]formulae-sequencePGP𝛼1𝑚superscriptsubscript𝑖1𝑚𝐼subscript^𝐧𝑖subscript𝐧𝑖𝛼𝛼superscript0superscript30\operatorname{PGP}(\alpha)=\frac{1}{m}\sum_{i=1}^{m}I\left(\arccos\left(\hat{% \mathbf{n}}_{i},\mathbf{n}_{i}\right)<\alpha\right),\alpha\in\left[0^{\circ},3% 0^{\circ}\right]roman_PGP ( italic_α ) = divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_I ( roman_arccos ( over^ start_ARG bold_n end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) < italic_α ) , italic_α ∈ [ 0 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT , 30 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT ] (10)

where I𝐼Iitalic_I is the indicator function. PGP(α𝛼\alphaitalic_α) measures the percentage of normal predictions with errors that fall below various angle thresholds denoted by α𝛼\alphaitalic_α.

Implementation Details.

For training, we use an Adam optimizer with a learning rate of 2×1042superscript1042\times 10^{-4}2 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT and a batch size of 32. The learning rate is decreased by a factor of 0.005 every epoch. Our method is trained for 250 epochs, during which we randomly sample 100,000 point patches from the training set in each epoch. Experiments are conducted on a cluster of NVIDIA A100 GPUs.

Refer to caption
Figure 3: a) Qualitative results on PCPNet dataset. The point cloud heatmap reflects the error on the normal estimation. b) Visualization of the per-point weight. CSA (AdaFit) favors smaller neighborhoods indiscriminately, while HSurf-Net is trained with weights that prioritize ‘on surface’ points. Meanwhile, the Transformer acquires optimal global attention weights through training on raw point cloud data.
Category Year Approach PCPNet Dataset SceneNN Dataset
Noise σ𝜎\sigmaitalic_σ Density Average Orig- Extra Average
None 0.12% 0.6% 1.2% Stripes Gradient inal Noise
PCA [17] 1992 Classical surface fitting 12.29 12.87 18.38 27.52 13.66 12.81 16.25 15.93 16.32 16.12
Jet [9] 2005 Classical surface fitting 12.35 12.84 18.33 27.68 13.39 13.13 16.29 15.17 15.59 15.38
HoughCNN [7] 2016 Direct regression 10.23 11.62 22.66 33.39 11.02 12.47 16.90 - - -
PCPNet [15] 2018 Direct regression 9.64 11.51 18.27 22.84 11.73 13.46 14.58 20.86 21.40 21.13
Nesti-Net [4] 2019 Direct regression 7.06 10.24 17.77 22.31 8.64 8.95 12.49 13.01 15.19 14.10
Lenssen et al. [23] 2020 Learning-based surface fitting 6.72 9.95 17.18 21.96 7.73 7.51 11.84 10.24 13.00 11.62
DeepFit [3] 2020 Learning-based surface fitting 6.51 9.21 16.73 23.12 7.92 7.31 11.80 10.33 13.07 11.70
Refine-Net [46] 2022 Direct regression 5.92 9.04 16.52 22.19 7.70 7.20 11.43 18.09 19.73 18.91
Zhang et al. [43] 2022 Learning-based surface fitting 5.65 9.19 16.78 22.93 6.68 6.29 11.25 9.31 13.11 11.21
Zhou et al. [48] 2021 Learning-based surface fitting 5.90 9.10 16.50 22.08 6.79 6.40 11.13 - - -
AdaFit [49] 2021 Learning-based surface fitting 5.19 9.05 16.44 21.94 6.01 5.90 10.76 8.39 12.85 10.62
GraphFit  [24] 2022 Learning-based surface fitting 4.45 8.74 16.05 21.64 5.22 5.48 10.26 7.99 12.17 10.08
NeAF  [26] 2023 Angular field 4.20 9.25 16.35 21.74 4.89 4.88 10.22 - - -
HSurf-Net  [25] 2022 Learning-based surface fitting 4.17 8.78 16.25 21.61 4.98 4.86 10.11 7.55 12.23 9.89
SNEtransformer (Ours) 2023 Direct regression 3.99 8.97 15.85 20.98 4.81 4.67 9.88 7.44 12.14 9.79
SNEdiffusion (Ours) 2023 Diffusion 4.00 8.88 16.25 21.37 4.96 4.89 10.05 - - -
SNEdiffusion as regression model(Ours) 2023 Direct regression 3.93 8.90 16.27 21.40 4.82 4.64 10.00 - - -
Table 1: Normal angle RMSE results on the PCPNet and SceneNN dataset, sorted by the values (lowers are better) on the PCPNet dataset. As a direct regression method, SNEtransformer outperforms existing learning-based surface fitting methods significantly in noisy scenarios.
Refer to caption
Figure 4: a) Visualization of predicted normals on the Semantic3D dataset. Our method preserves sharper geometric details, as highlighted by the red and green border regions. b) Error visualization of noisy point clouds in the SceneNN datasets. Point colors correspond to the angular error mapped onto a heatmap. SNEtransformer predicts more accurate normals than baselines when the input is affected by noise.
Refer to caption
Figure 5: Percentage of Good Points (PGP) graphs for the PCPNet and SceneNN datasets. The area under the blue color is enlarged and displayed in a black pane. Our method produces high-quality estimations in noisy settings.

4.2 Results on PCPNet

PCPNet is a point cloud normal estimation dataset comprising synthetic shapes and 3D scanned objects. The training set contains eight point clouds, and the test set consists of nineteen. Following [25, 24, 49], SNEtransformer is trained on point clouds with various levels of Gaussian noise—none, low, medium, and high—and is evaluated against point clouds with different Gaussian noise levels, as well as two additional settings where the point density is inconsistent. The quantitative evaluation results, presented in Table 1, demonstrate that SNEtransformer outperforms existing methods in almost all scenarios. Figure 5 displays the PGP curves under all noise conditions, and a visual comparison of the normal prediction error output by SNEtransformer and existing methods is shown in Figure 3. It is evident that our method produces more accurate normal estimations in various testing scenarios.

4.3 Results on SceneNN

SceneNN is an RGB-D scan dataset captured in various indoor settings. Following [25], we first train the SNEtransformer on the PCPNet dataset, then evaluate the trained model on SceneNN without fine-tuning to explore the model’s scalability. Due to sensor errors, the data naturally contains noise, presenting a good opportunity to test the model’s noise agnosticism. We use the same evaluation settings as [25] and report the numerical results in Table 1 and the visual results in Figure 3. Table 1 shows that SNEtransformer generalizes well to real-world data and outperforms previous methods on both original and extra-noise settings. Figure 5 presents the PGP curves under original and extra-noise conditions, demonstrating that our method produces high-quality estimations for indoor scanning.

4.4 Results on Semantic3D

We visualize the normal estimation results on the outdoor scanning dataset Semantic3D in Figure 4, despite the absence of ground truth data for normals. It is apparent that our method preserves finer details such as carved patterns on doors, grooves between bricks, and letters on buildings—details that other methods tend to oversmooth. This suggests that our method also provides higher-quality normal estimation in outdoor scanning scenarios.

Refer to caption
Figure 6: Poisson surface reconstruction on point clouds from the PCPNet dataset. Normals predicted by the SNEtransformer help recover finer details in point clouds with variant density noise (top) and Gaussian noise (bottom), as highlighted in the windows below each shape.

4.5 Application to Surface Reconstruction Task

We apply Poisson reconstruction [17] on PCPNet point clouds with the per-point normal predicted by SNEtransformer. Figure 6 shows that SNEtransformer helps recover finer details in areas with complex local geometry like the hand of the Statue of Liberty. Notably, our method enhances geometry reconstruction when the input is noisy.

4.6 Ablation Studies

Graph Convolution and Transformer.

To validate the effectiveness of Graph Convolution and the Transformer, we conduct ablation studies on them and report the results in Table 2. We observed that the performance of normal estimation degrades when either Graph Convolution or the Transformer is removed from the network. However, the accuracy of normal estimation degrades even further when the Transformer architecture is removed, which demonstrates its significant effectiveness.

Global Attention or Local Attention.

To demonstrate the effectiveness of global attention, we compare its performance with that of local attention. To implement local attention, we first run the k𝑘kitalic_k-nearest neighbors algorithm at each point in the point cloud patch and then apply attention only within the set of nearest neighbor points. The results in Table 2 show that applying attention on a global scale improves the results and leads to noise-agnostic behavior. This validates our assumption that global attention allows the network to attend to any points it deems helpful for the estimation tasks, thereby increasing resilience to noise.

Noise level PCPNet Dataset
Ours Ours w/o Ours w/o Ours with
Transformer GC local attention
None 3.99 5.95 (+1.95) 5.05 (+1.05) 4.61 (+0.61)
σ𝜎\sigmaitalic_σ=0.12% 8.97 9.77 (+0.80) 9.53 (+0.56) 9.30 (+0.33)
σ𝜎\sigmaitalic_σ=0.6% 15.85 17.98 (+2.13) 17.01 (+1.16) 17.20 (+1.35)
σ𝜎\sigmaitalic_σ=1.2% 20.98 22.56 (+1.57) 22.28 (+1.29) 22.14 (+1.15)
Density (stripes) 4.81 6.67 (+1.85) 6.08 (+1.26) 5.52 (+0.70)
Density (gradients) 4.67 6.14 (+1.47) 5.74 (+1.07) 5.24 (+0.57)
Average 9.88 11.51 (+1.63) 10.94 (+1.06) 10.66 (+0.78)
Table 2: Ablation experiments reveal the effectiveness of the Transformer, Graph Convolution, and global attention.
PCPNet Dataset
Noise σ𝜎\sigmaitalic_σ Density
None 0.12% 0.6% 1.2% Stripes Gradient
xyz𝑥𝑦𝑧xyzitalic_x italic_y italic_z+ΔxyzsubscriptΔ𝑥𝑦𝑧\Delta_{xyz}roman_Δ start_POSTSUBSCRIPT italic_x italic_y italic_z end_POSTSUBSCRIPT+𝐟𝐟\mathbf{f}bold_f+Δ𝐟subscriptΔ𝐟\Delta_{\mathbf{f}}roman_Δ start_POSTSUBSCRIPT bold_f end_POSTSUBSCRIPT 3.99 8.97 15.85 20.98 4.81 4.67
xyz𝑥𝑦𝑧xyzitalic_x italic_y italic_z+ΔxyzsubscriptΔ𝑥𝑦𝑧\Delta_{xyz}roman_Δ start_POSTSUBSCRIPT italic_x italic_y italic_z end_POSTSUBSCRIPT+𝐟𝐟\mathbf{f}bold_f 4.74 9.23 16.24 21.76 5.85 5.45
xyz𝑥𝑦𝑧xyzitalic_x italic_y italic_z+𝐟𝐟\mathbf{f}bold_f+Δ𝐟subscriptΔ𝐟\Delta_{\mathbf{f}}roman_Δ start_POSTSUBSCRIPT bold_f end_POSTSUBSCRIPT 4.90 9.34 16.55 22.43 5.63 5.40
Table 3: Ablation study on input features for Graph Convolution. ΔxyzsubscriptΔ𝑥𝑦𝑧\Delta_{xyz}roman_Δ start_POSTSUBSCRIPT italic_x italic_y italic_z end_POSTSUBSCRIPT represents the difference in coordinates between a neighbor and the query point, while Δ𝐟subscriptΔ𝐟\Delta_{\mathbf{f}}roman_Δ start_POSTSUBSCRIPT bold_f end_POSTSUBSCRIPT indicates the difference in features between a neighbor and the query point.
Ablation on Features for Graph Convolution.

We have explored various features for inclusion in Graph Convolution, and the results are listed in Table 3. In summary, there are four main features: the 3D coordinates of the query point and its neighboring points (denoted as xyz𝑥𝑦𝑧xyzitalic_x italic_y italic_z), the difference in 3D coordinates between the query point and its neighbors (denoted as ΔxyzsubscriptΔ𝑥𝑦𝑧\Delta_{xyz}roman_Δ start_POSTSUBSCRIPT italic_x italic_y italic_z end_POSTSUBSCRIPT), the features of the query point and its neighboring points (denoted as 𝐟𝐟\mathbf{f}bold_f), and the difference in features between the query point and the neighbors (denoted as Δ𝐟subscriptΔ𝐟\Delta_{\mathbf{f}}roman_Δ start_POSTSUBSCRIPT bold_f end_POSTSUBSCRIPT). We conclude that both ΔxyzsubscriptΔ𝑥𝑦𝑧\Delta_{xyz}roman_Δ start_POSTSUBSCRIPT italic_x italic_y italic_z end_POSTSUBSCRIPT and Δ𝐟subscriptΔ𝐟\Delta_{\mathbf{f}}roman_Δ start_POSTSUBSCRIPT bold_f end_POSTSUBSCRIPT contribute to better estimation results.

5 Conclusion

In this paper, we introduce the SNEtransformer, a Transformer-based model that accurately predicts surface normals. We demonstrate that a straightforward combination of Graph Convolution with a Transformer is sufficient to achieve state-of-the-art performance, without the need for hand-designed modules. Our model unifies existing approaches and is proven to be noise-agnostic, as evidenced by extensive experiments on both indoor and outdoor datasets. Lastly, we showcase the potential of our method in various downstream applications.

Acknowledgements

This work was partially supported by JSPS KAKENHI Grant JP21H04907, and by JST CREST Grants JPMJCR18A6 and JPMJCR20D3, Japan.

References

  • Alexa et al. [2001] Marc Alexa, Johannes Behr, Daniel Cohen-Or, Shachar Fleishman, David Levin, and Claudio T Silva. Point set surfaces. In Proceedings Visualization, 2001. VIS’01., pages 21–29. IEEE, 2001.
  • Avron et al. [2010] Haim Avron, Andrei Sharf, Chen Greif, and Daniel Cohen-Or. L1-sparse reconstruction of sharp point set surfaces. ACM Transactions on Graphics, 29(5):1–12, 2010.
  • Ben-Shabat and Gould [2020] Yizhak Ben-Shabat and Stephen Gould. DeepFit: 3D surface fitting via neural network weighted least squares. In European Conference on Computer Vision, pages 20–34. Springer, 2020.
  • Ben-Shabat et al. [2019] Yizhak Ben-Shabat, Michael Lindenbaum, and Anath Fischer. Nesti-Net: Normal estimation for unstructured 3D point clouds using convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 10112–10120, 2019.
  • Blinn [1978] James F Blinn. Simulation of wrinkled surfaces. ACM SIGGRAPH Computer Graphics, 12(3):286–292, 1978.
  • Boulch and Marlet [2012] Alexandre Boulch and Renaud Marlet. Fast and robust normal estimation for point clouds with sharp features. In Computer Graphics Forum, pages 1765–1774. Wiley Online Library, 2012.
  • Boulch and Marlet [2016] Alexandre Boulch and Renaud Marlet. Deep learning for robust normal estimation in unstructured point clouds. In Computer Graphics Forum, pages 281–290. Wiley Online Library, 2016.
  • Cazals and Pouget [2005a] F. Cazals and M. Pouget. Estimating differential quantities using polynomial fitting of osculating jets. Computer Aided Geometric Design, 22(2):121–146, 2005a.
  • Cazals and Pouget [2005b] Frédéric Cazals and Marc Pouget. Estimating differential quantities using polynomial fitting of osculating jets. Computer Aided Geometric Design, 22(2):121–146, 2005b.
  • Chen et al. [2022] Dave Zhenyu Chen, Ronghang Hu, Xinlei Chen, Matthias Nießner, and Angel X. Chang. Unit3d: A unified transformer for 3d dense captioning and visual grounding, 2022.
  • Dosovitskiy et al. [2021] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale, 2021.
  • Fleishman et al. [2005] Shachar Fleishman, Daniel Cohen-Or, and Cláudio T. Silva. Robust moving least-squares fitting with sharp features. 24(3), 2005.
  • Gouraud [1971] Henri Gouraud. Continuous shading of curved surfaces. IEEE Transactions on Computers, 100(6):623–629, 1971.
  • Guennebaud and Gross [2007] Gaël Guennebaud and Markus Gross. Algebraic point set surfaces. In ACM SIGGRAPH 2007 papers. 2007.
  • Guerrero et al. [2018] Paul Guerrero, Yanir Kleiman, Maks Ovsjanikov, and Niloy J Mitra. PCPNet: learning local shape properties from raw point clouds. In Computer Graphics Forum, pages 75–85. Wiley Online Library, 2018.
  • Hackel et al. [2017] Timo Hackel, N. Savinov, L. Ladicky, Jan D. Wegner, K. Schindler, and M. Pollefeys. SEMANTIC3D.NET: A new large-scale point cloud classification benchmark. In ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences, pages 91–98, 2017.
  • Hoppe et al. [1992] Hugues Hoppe, Tony DeRose, Tom Duchamp, John McDonald, and Werner Stuetzle. Surface reconstruction from unorganized points. In Proceedings of the 19th Annual Conference on Computer Graphics and Interactive Techniques, pages 71–78, 1992.
  • Hua et al. [2016] Binh-Son Hua, Quang-Hieu Pham, Duc Thanh Nguyen, Minh-Khoi Tran, Lap-Fai Yu, and Sai-Kit Yeung. Scenenn: A scene meshes dataset with annotations. In International Conference on 3D Vision (3DV), 2016.
  • Huang et al. [2009] Hui Huang, Dan Li, Hao Zhang, Uri Ascher, and Daniel Cohen-Or. Consolidation of unorganized point clouds for surface reconstruction. ACM Transactions on Graphics, 28(5):1–7, 2009.
  • Jacobs et al. [1991] Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. Adaptive mixtures of local experts. Neural Computation, 3(1):79–87, 1991.
  • Kazhdan et al. [2006] Michael Kazhdan, Matthew Bolitho, and Hugues Hoppe. Poisson surface reconstruction. In Proceedings of the fourth Eurographics Symposium on Geometry Processing, 2006.
  • Lange and Polthier [2005] Carsten Lange and Konrad Polthier. Anisotropic smoothing of point sets. Computer Aided Geometric Design, 22(7):680–692, 2005.
  • Lenssen et al. [2020] Jan Eric Lenssen, Christian Osendorfer, and Jonathan Masci. Deep iterative surface normal estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11247–11256, 2020.
  • Li et al. [2022a] Keqiang Li, Mingyang Zhao, Huaiyu Wu, Dong-Ming Yan, Zhen Shen, Fei-Yue Wang, and Gang Xiong. Graphfit: Learning multi-scale graph-convolutional representation for point cloud normal estimation. In Computer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXII, page 651–667, Berlin, Heidelberg, 2022a. Springer-Verlag.
  • Li et al. [2022b] Qing Li, Yu-Shen Liu, Jin-San Cheng, Cheng Wang, Yi Fang, and Zhizhong Han. Hsurf-net: Normal estimation for 3d point clouds by learning hyper surfaces. In Advances in Neural Information Processing Systems, pages 4218–4230. Curran Associates, Inc., 2022b.
  • Li et al. [2023] Shujuan Li, Junsheng Zhou, Baorui Ma, Yu-Shen Liu, and Zhizhong Han. Neaf: Learning neural angle fields for point normal estimation. Proceedings of the AAAI Conference on Artificial Intelligence, 37(1):1396–1404, 2023.
  • Lu et al. [2020a] Dening Lu, Xuequan Lu, Yangxing Sun, and Jun Wang. Deep feature-preserving normal estimation for point cloud filtering. Computer-Aided Design, 125:102860, 2020a.
  • Lu et al. [2020b] Xuequan Lu, Scott Schaefer, Jun Luo, Lizhuang Ma, and Ying He. Low rank matrix approximation for 3D geometry filtering. IEEE Transactions on Visualization and Computer Graphics, 2020b.
  • Misra et al. [2021] Ishan Misra, Rohit Girdhar, and Armand Joulin. An End-to-End Transformer Model for 3D Object Detection. In ICCV, 2021.
  • Mitra and Nguyen [2003] Niloy J Mitra and An Nguyen. Estimating surface normals in noisy point cloud data. In Proceedings of the Nineteenth Annual Symposium on Computational Geometry, pages 322–328, 2003.
  • Pauly et al. [2002] Mark Pauly, Markus Gross, and Leif P Kobbelt. Efficient simplification of point-sampled surfaces. In IEEE Visualization, 2002. VIS 2002., pages 163–170. IEEE, 2002.
  • Phong [1975] Bui Tuong Phong. Illumination for computer generated pictures. Communications of the ACM, 18(6):311–317, 1975.
  • Qi et al. [2017a] Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. PointNet: Deep learning on point sets for 3D classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 652–660, 2017a.
  • Qi et al. [2017b] Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. PointNet++: Deep hierarchical feature learning on point sets in a metric space. Advances in Neural Information Processing Systems, 30:5099–5108, 2017b.
  • Redmon et al. [2015] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection, 2015. cite arxiv:1506.02640.
  • Shit et al. [2022] Suprosanna Shit, Rajat Koner, Bastian Wittmann, Johannes Paetzold, Ivan Ezhov, Hongwei Li, Jiazhen Pan, Sahand Sharifzadeh, Georgios Kaissis, Volker Tresp, and Bjoern Menze. Relationformer: A unified framework for image-to-graph generation, 2022.
  • Stewart [1993] Gilbert W Stewart. On the early history of the singular value decomposition. SIAM Review, 35(4):551–566, 1993.
  • Sun et al. [2015] Yujing Sun, Scott Schaefer, and Wenping Wang. Denoising point sets via l0 minimization. Computer Aided Geometric Design, 35:2–15, 2015.
  • Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in Neural Information Processing Systems, 30, 2017.
  • Wang et al. [2018] Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E. Sarma, Michael M. Bronstein, and Justin M. Solomon. Dynamic graph CNN for learning on point clouds. CoRR, abs/1801.07829, 2018.
  • Yu et al. [2021] Xumin Yu, Yongming Rao, Ziyi Wang, Zuyan Liu, Jiwen Lu, and Jie Zhou. Pointr: Diverse point cloud completion with geometry-aware transformers, 2021.
  • Zhang et al. [2022a] Jie Zhang, Jun-Jie Cao, Hai-Rui Zhu, Dong-Ming Yan, and Xiu-Ping Liu. Geometry guided deep surface normal estimation. Computer-Aided Design, 142:103119, 2022a.
  • Zhang et al. [2022b] Jie Zhang, Jun-Jie Cao, Hai-Rui Zhu, Dong-Ming Yan, and Xiu-Ping Liu. Geometry guided deep surface normal estimation. Computer-Aided Design, 142:103119, 2022b.
  • Zhao et al. [2021a] Hengshuang Zhao, Li Jiang, Jiaya Jia, Philip HS Torr, and Vladlen Koltun. Point transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 16259–16268, 2021a.
  • Zhao et al. [2021b] Lichen Zhao, Daigang Cai, Lu Sheng, and Dong Xu. 3DVG-Transformer: Relation modeling for visual grounding on point clouds. In ICCV, pages 2928–2937, 2021b.
  • Zhou et al. [2022] Haoran Zhou, Honghua Chen, Yingkui Zhang, Mingqiang Wei, Haoran Xie, Jun Wang, Tong Lu, Jing Qin, and Xiao-Ping Zhang. Refine-Net: Normal refinement neural network for noisy point clouds. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
  • Zhou et al. [2020] Jun Zhou, Hua Huang, Bin Liu, and Xiuping Liu. Normal estimation for 3D point clouds via local plane constraint and multi-scale selection. Computer-Aided Design, 129:102916, 2020.
  • Zhou et al. [2021] Jun Zhou, Wei Jin, Mingjie Wang, Xiuping Liu, Zhiyang Li, and Zhaobin Liu. Improvement of normal estimation for point clouds via simplifying surface fitting. arXiv preprint arXiv:2104.10369, 2021.
  • Zhu et al. [2021a] Runsong Zhu, Yuan Liu, Zhen Dong, Yuan Wang, Tengping Jiang, Wenping Wang, and Bisheng Yang. AdaFit: Rethinking learning-based normal estimation on point clouds. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6118–6127, 2021a.
  • Zhu et al. [2021b] Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. Deformable detr: Deformable transformers for end-to-end object detection, 2021b.