3D Learnable Supertoken Transformer for LiDAR Point Cloud Scene Segmentation

Lu, Dening; Zhou, Jun; Gao, Kyle; Xu, Linlin; Li, Jonathan

Computer Science > Computer Vision and Pattern Recognition

arXiv:2405.15826 (cs)

[Submitted on 23 May 2024 (v1), last revised 26 Dec 2024 (this version, v2)]

Title:3D Learnable Supertoken Transformer for LiDAR Point Cloud Scene Segmentation

Authors:Dening Lu, Jun Zhou, Kyle Gao, Linlin Xu, Jonathan Li

View PDF HTML (experimental)

Abstract:3D Transformers have achieved great success in point cloud understanding and representation. However, there is still considerable scope for further development in effective and efficient Transformers for large-scale LiDAR point cloud scene segmentation. This paper proposes a novel 3D Transformer framework, named 3D Learnable Supertoken Transformer (3DLST). The key contributions are summarized as follows. Firstly, we introduce the first Dynamic Supertoken Optimization (DSO) block for efficient token clustering and aggregating, where the learnable supertoken definition avoids the time-consuming pre-processing of traditional superpoint generation. Since the learnable supertokens can be dynamically optimized by multi-level deep features during network learning, they are tailored to the semantic homogeneity-aware token clustering. Secondly, an efficient Cross-Attention-guided Upsampling (CAU) block is proposed for token reconstruction from optimized supertokens. Thirdly, the 3DLST is equipped with a novel W-net architecture instead of the common U-net design, which is more suitable for Transformer-based feature learning. The SOTA performance on three challenging LiDAR datasets (airborne MultiSpectral LiDAR (MS-LiDAR) (89.3% of the average F1 score), DALES (80.2% of mIoU), and Toronto-3D dataset (80.4% of mIoU)) demonstrate the superiority of 3DLST and its strong adaptability to various LiDAR point cloud data (airborne MS-LiDAR, aerial LiDAR, and vehicle-mounted LiDAR data). Furthermore, 3DLST also achieves satisfactory results in terms of algorithm efficiency, which is up to 5x faster than previous best-performing methods.

Comments:	13 pages, 10 figures, 7 tables
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2405.15826 [cs.CV]
	(or arXiv:2405.15826v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2405.15826

Submission history

From: Dening Lu [view email]
[v1] Thu, 23 May 2024 20:41:15 UTC (7,712 KB)
[v2] Thu, 26 Dec 2024 17:06:08 UTC (9,213 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:3D Learnable Supertoken Transformer for LiDAR Point Cloud Scene Segmentation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:3D Learnable Supertoken Transformer for LiDAR Point Cloud Scene Segmentation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators