Deep Learning Based 3D Segmentation: A Survey: He, Yu, Liu, Yang, Sun and Mian
Deep Learning Based 3D Segmentation: A Survey: He, Yu, Liu, Yang, Sun and Mian
Yong Hea , Hongshan Yua,∗ , Xiaoyan Liua , Zhengeng Yanga , Wei Suna and Ajaml Mianb
a Hunan University, Lushan South Rd., Yuelu Dist., Changsha, 410082, Hunan, China
b University of Western Australia, 35 Stirling Hwy, Perth, 6009, WA, Australia
3D semantic map number of methods in the literature that have been evaluated on different benchmark datasets. Whereas
Deep learning survey papers on RGB-D and point cloud segmentation exist, there is a lack of an in-depth and recent
survey that covers all 3D data modalities and application domains. This paper fills the gap and provides
a comprehensive survey of the recent progress made in deep learning based 3D segmentation. It covers
over 180 works, analyzes their strengths and limitations and discusses their competitive results on
benchmark datasets. The survey provides a summary of the most commonly used pipelines and finally
highlights promising research directions for the future.
1. Introduction
Segmentation of 3D scenes is a fundamental and chal- chair chair
lenging problem in computer vision as well as computer table table
decompose instances further into their different components (d) Mesh (e) Point (h) Part Segmentation
such as armrests, legs and backrest of the same chair.
Fig. 1: The main five types of 3D data: (a) RGB-D image, (b)
Compared to conventional single view 2D segmentation, projected images, (c) voxels, (d) mesh, and (d) point. Types of
3D segmentation gives a more comprehensive understand- 3D segmentation: (f) 3D semantic segmentation, (g) 3D instance
ing of a scene, since 3D data (e.g. RGB-D, point cloud, segmentation, and (h) 3D part segmentation.
voxel, mesh, 3D video) contain richer geometric, shape, and
scale information with less background noise. Moreover,
the representation of 3D data, for example in the form of also attracted a growing interest from the research com-
projected images, has more semantic information. munity over the past decade. However, 3D deep learning
Recently, deep learning techniques have dominated many methods still face many unsolved challenges. For example,
research areas including computer vision and natural lan- irregularity of point clouds makes it difficult to exploit
guage processing. Motivated by its success in learning local features and converting them to high-resolution voxels
powerful features, deep learning for 3D segmentation has comes with a huge computational burden.
⋆
This work was partially supported by the National Natural Science This paper provides a comprehensive survey of recent
Foundation of China (Grant U2013203, 61973106). Professor Ajmal Mian progress in deep learning methods for 3D segmentation.
is the recipient of an Australian Research Council Future Fellowship Award It focuses on analyzing commonly used building blocks,
(project number FT210100268) funded by the Australian Government.
∗ Corresponding author convolution kernels and complete architectures pointing out
[email protected] (Y. He); [email protected] (H. Yu);
the pros and cons in each case. The survey covers over 180
[email protected] (X. Liu); [email protected] (Z. Yang); representative papers published in the last five years. Al-
[email protected] (W. Sun); [email protected] (A. Mian) though some notable 3D segmentation surveys have been re-
ORCID (s): 0000-0003-2916-3068 (Y. He); 0000-0003-1973-6766 (H. Yu)
leased including RGB-D semantic segmentation Fooladgar
2.1. 3D Segment. Dataset 3.1. RGB-D Based 5.1. Regular Data 6.1. 3D Video Segment. 7.1. Experiments of 3D ⚫ Synthetic datasets with richer
4.1. Proposal Based
⚫ Multi-tasks learning ⚫ Project. image based ⚫ Recrrent neural Semantic Segment. information for multiple tasks
⚫ Bonunding boxes based
⚫ Depth encoding ⚫ Voxel based network based ⚫ Results on SUN-RGBD ⚫ Unified network for multiple
⚫ Bounding boxes free
2.2. Evaluation Metrics ⚫ Multi-scale networks ⚫ Spatio-temporary and NYUDv2 tasks.
⚫ Overall accuracy ⚫ Novel Neural Networks convolution based ⚫ Results on S3DIS, ⚫ Multiple modals for segment.
⚫ Mean accuracy ⚫ Data/feature/score fusion 4.2. Proposal Free Based 5.2. Irregular Data ⚫ Spatio-temporary ScanNet, Semantic3D ⚫ Interpretable and sparse
⚫ Mean intersection over ⚫ Post-processing ⚫ 2D embedding based ⚫ Mesh based transformer based and SemanticKITTI feature abstraction
union ⚫ Multi-tasks learning ⚫ Point based ⚫ Data analysis ⚫ Weakly-supervised and
⚫ Average precision 3.2. Project. Image Based ⚫ Clustering based unsupervised segment.
⚫ Mean average precision ⚫ Multi-view images based ⚫ Dynamic conv. based 6.2. 3D Semantic Map ⚫ Real-time and incremental
⚫ Overall average category ⚫ Sphrical images based ⚫ Association information 7.2. Experiments of 3D segment.:
intersection over union exploitation Instance Segment. ⚫ 3D video semantic segment.
⚫ Overall average instance ⚫ Spatio-temporary ⚫ Results on ScanNet
intersection over union 3.3. Voxel Based convolution based ⚫ Data analysis
⚫ Uniform voxels based
⚫ Non-uniform voxels based
7.3. Experiments of 3D
3.4. Point Based Part Segment.
⚫ MLP based ⚫ Results on ShapeNet
⚫ Point convolution based ⚫ Data analysis
⚫ Graph convolution based
⚫ Point Transformer based
and Kasaei (2020), remote sensing imagery segmentation Figure 2 shows a snapshot of how this survey is or-
Yuan, Shi and Gu (2021), point clouds segmentation Xie, ganized. Section 2 introduces some terminology and back-
Jiaojiao and Zhu (2020a), Guo, Wang, Hu, Liu, Liu and Ben- ground concepts, including popular 3D datasets and evalua-
namoun (2020), Liu, Sun, Li, Hu and Wang (2019a), Bello, tion metrics for 3D segmentation. Section 3 reviews methods
Yu, Wang, Adam and Li (2020), Naseer, Khan and Porikli for 3D semantic segmentation whereas Section 4 reviews
(2018), Ioannidou, Chatzilari, Nikolopoulos and Kompat- methods for 3D instance segmentation. Section 5 provides
siaris (2017), these surveys do not comprehensively cover a survey of existing methods for 3D part segmentation. Sec-
all 3D data types and typical application domains. Most im- tion 6 reviews the 3D segmentation methods used in some
portantly, these surveys do not focus on 3D segmentation but common application areas including 3D video segmentation
give a general survey of deep learning from point clouds Guo and 3D semantic map. Section 7 presents performance com-
et al. (2020), Liu et al. (2019a), Bello et al. (2020), Naseer parison between 3D segmentation methods on several popu-
et al. (2018), Ioannidou et al. (2017). Given the importance lar datasets, and gives corresponding data analysis. Finally,
of the three segmentation tasks, this paper focuses exclu- Section 8 identifies promising future research directions and
sively on deep learning techniques for 3D segmentation. The concludes the paper.
contributions of this paper are summarized as follows:
• To the best of our knowledge, this is the first sur- 2. Terminology and Background Concept
vey paper to comprehensively cover deep learning This section introduces some terminologies and back-
methods on 3D segmentation covering all 3D data ground concepts, including 3D data representation, popular
representations, including RGB-D, projected images, 3D segmentation datasets and evaluation metrics to help the
voxels, point clouds, meshes, and 3D videos. reader easily navigate through the field of 3D segmentation.
• This survey provides an in-depth analysis of the rel- 2.1. 3D Segmentation Dataset
ative advantages and disadvantages of different types Datasets are critical to train and test 3D segmentation
of 3D data segmentation methods. algorithms using deep learning. However, it is cumbersome
• Unlike existing reviews, this survey papers focuses on and expensive to privately gather and annotate datasets as it
deep learning methods designed specifically for 3D needs domain expertise, high quality sensors and processing
segmentation and also discusses typical segmentation equipment. Thus, building on public datasets is an ideal way
pipelines as well as application domains. to reduce the cost. Following this way has another advantage
for the community that it provides a fair comparison between
• Finally, this survey provides comprehensive compar- algorithms. Table 1 summarizes some of the most popular
isons of existing methods on several public benchmark and typical datasets with respect to the sensor type, data size
3D datasets, draw interesting conclusions and identify and format, scene class and annotation method.
promising future research directions. These datasets are acquired for 3D semantic segmenta-
tion by different type of sensors, including RGB-D cameras
(a) S3DIS
(d) ScanNet
(b) Semantic3D
Silberman and Fergus (2011), Silberman, Hoiem, Kohli and ScanNet Dai et al. (2017), Semantic3D Hackel et al. (2017),
Fergus (2012), Song, Lichtenberg and Xiao (2015), Hua, SemanticKITTI Chang et al. (2017) and ShapeNet Yi et al.
Pham, Nguyen, Tran, Yu and Yeung (2016), Dai, Chang, (2016). Some examples with annotation from these datasets
Savva, Halber, Funkhouser and Nießner (2017), mobile laser are shown in Figure 3.
scanner Roynard, Deschaud and Goulette (2018), Behley, S3DIS: In this dataset, the complete point clouds are ob-
Garbade, Milioto, Quenzel, Behnke, Stachniss and Gall tained without any manual intervention using the Matterport
(2019), static terrestrial scanner Hackel, Savinov, Ladicky, scanner. The dataset consists of 271 rooms belonging to 6
Wegner, Schindler and Pollefeys (2017) and unreal engine large-scale indoor scenes from 3 different buildings (total
Brodeur, Perez, Anand, Golemo, Celotti, Strub, Rouat, of 6020 square meters). These areas mainly include offices,
Larochelle and Courville (2017), Wu, Wu, Gkioxari and educational and exhibition spaces, and conference rooms etc.
Tian (2018b) and other 3D scanners Armeni, Sener, Zamir, Semantic3D comprises a total of around 4 billion 3D points
Jiang, Brilakis, Fischer and Savarese (2016), Chang, Dai, acquired with static terrestrial laser scanners, covering up
Funkhouser, Halber, Niebner, Savva, Song, Zeng and Zhang to 160×240×30 meters in real-world 3D space. Point clouds
(2017). Among these, the ones obtained from unreal engine belong to 8 classes (e.g. urban and rural) and contain 3D
are synthetic datasets Brodeur et al. (2017) Wu et al. (2018b) coordinates, RGB information, and intensity. Unlike 2D
that do not require expensive equipment or annotation time. annotation strategies, 3D data labeling is easily amenable to
These are also rich in categories and quantities of objects. over-segmentation where each point is individually assigned
Synthetic datasets have complete 360 degree 3D objects to a class label.
with no occlusion effects or noise compared to the real- SemanticKITTI is a large outdoor dataset containing de-
world datasets which are noisy and contain occlusions tailed point-wise annotation of 28 classes. Building on the
Silberman and Fergus (2011), Silberman et al. (2012), Song KITTI vision benchmark Geiger, Lenz and Urtasun (2012),
et al. (2015), Hua et al. (2016), Dai et al. (2017), Roynard SemanticKITTI contains annotations of all 22 sequences
et al. (2018), Behley et al. (2019), Armeni et al. (2016), of this benchmark consisting of 43K scans. Moreover, the
Hackel et al. (2017), Chang et al. (2017). For 3D instance dataset contains labels for the complete horizontal 360 filed-
segmentation, there are limited 3D datasets, such as ScanNet of-view of the rotating laser sensor.
Dai et al. (2017) and S3DIS Armeni et al. (2016). These ScanNet dataset is particularly valuable for research in scene
two datasets contain scans of real-world indoor scenes ob- understanding as its annotations contain estimated calibra-
tained by RGB-D cameras or Matterport separately. For 3D tion parameters, camera poses, 3D surface reconstruction,
part segmentation, the Princeton Segmentation Benchmark textured meshes, dense object level semantic segmentation,
(PSB) Chen, Golovinskiy and Funkhouser (2009), COSEG and CAD models. The dataset comprises annotated RGB-D
Wang, Asafi, Van Kaick, Zhang, Cohen-Or and Chen (2012) scans of real-world environments. There are 2.5M RGB-D
and ShapeNet Yi, Kim, Ceylan, Shen, Yan, Su, Lu, Huang, images in 1513 scans acquired in 707 distinct places. After
Sheffer and Guibas (2016) are three of the most popular RGB-D image processing, annotation human intelligence
datasets. Below, we introduce five famous segmentation tasks were performed using the Amazon Mechanical Turk.
datasets in detail, including S3DIS Armeni et al. (2016),
Table 1
Summary of popular datasets for 3D segmentation datasets including the sensor, type, size, object class, number of classes (shown in
brackets), and annotation method. S←synthetic environment. R←real-world environment. Kf←thousand frames. s←scan. Mp←million
points. the symbol ‘–’ means information unavailable.
Dataset Sensors Type Size Scene class (number) Annotation method
Datasets for 3D semantic segmentation
NYUv1 Silberman and Fergus (2011) Microsoft Kinect v1 R 2347f bedroom, cafe, kitchen, etc. (7) Condition Random Field-based model
NYUv2 Silberman et al. (2012) Microsoft Kinect v1 R 1449f bedroom, cafe, kitchen, etc. (26) 2D annotation from AMK
SUN RGB-D Song et al. (2015) RealSense, Xtion LIVE PRO, MKv1/2 R 10355f objects, room layouts, etc.(47) 2D/3Dpolygons +3D bounding box
SceneNN Hua et al. (2016) Asus Xtion PRO, MK v2 R 100s bedroom, office, apartment, etc.(-) 3D Labels project to 2D frames
RueMonge2014 Riemenschneider,
Bódis-Szomorú, Weissenberg and – R 428s window, wall, balcony, door, etc(7) Multi-view semantic labelling + CRF
Van Gool (2014)
ScanNet Dai et al. (2017) Occipital structure sensor R 2.5Mf office, apartment, bathroom, etc(19) 3D labels project to 2D frames
S3DIS Armeni et al. (2016) Matterport camera R 70496f conference rooms, offices, etc(11) Hierarchical labeling
Semantic3D Hackel et al. (2017) Terrestrial laser scanner R 1660Mp farms, town hall, sport fields, etc (8) Three baseline methods
NPM3D Roynard et al. (2018) Velodyne HDL-32E LiDAR R 143.1Mp ground, vehicle, hunman, etc (50) Human labeling
SemanticKITTI Behley et al. (2019) Velodyne HDL-64E R 43Ks ground, vehicle, hunman, etc(28) Multi-scans semantic labelling
Matterport3D Chang et al. (2017) Matterport camera R 194.4Kf various rooms (90) Hierarchical labeling
HoME Brodeur et al. (2017) Planner5D platform S 45622f rooms, object and etc.(84) SSCNet+ a short text description
House3D Wu et al. (2018b) Planner5D platform S 45622f rooms, object and etc.(84) SSCNet+3 ways
Datasets for 3D instance segmentation
ScanNet Dai et al. (2017) Occipital structure sensor R 2.5Mf office, apartment, bathroom, etc(19) 3D labels project to 2D frames
S3DIS Armeni et al. (2016) Matterport camera R 70496f conference rooms, offices, etc(11) Active learning method
Datasets for 3D part segmentation
ShapeNet Yi et al. (2016) – S 31963s transportation, tool, etc.(16) Propagating human label to shapes
PSB Chen et al. (2009) Amazon’s Mechanical Turk S 380s human,cup, glasses airplane,etc(19) Interactive segmentation tool
COSEG Wang et al. (2012) – S 1090s vase, lamp, guiter, etc (11) semi-supervised learning method
ShapeNet dataset has a novel scalable method for effi- Mean Accuracy is an extension of OAcc, computing
cient and accurate geometric annotation of massive 3D OAcc in a per-class and then averaging over the total number
shape collections. The novel technical innovations explic- of classes 𝐾.
itly model and lessen the human cost of the annotation
1 ∑
𝐾
effort. Researchers create detailed point-wise labeling of 𝑝𝑖𝑖
𝑚𝐴𝑐𝑐 = ∑
31963 models in shape categories in ShapeNetCore and 𝐾 + 1 𝑖=0 𝐾
𝑝 𝑗=0 𝑖𝑗
combine feature-based classifiers, point-to-point correspon-
dences, and shape-to-shape similarities into a single CRF Mean Intersection over Union is a standard metric for
optimization over the network of shapes. semantic segmentation. It computes the intersection ratio
between ground truth and predicted value averaged over the
2.2. Evaluation Metrics total number of classes 𝐾.
Different evaluation metrics can assert the validity and
superiority of segmentation methods including the execution 1 ∑
𝐾
𝑝𝑖𝑖
time, memory footprint and accuracy. However, few authors 𝑚𝐼𝑜𝑈 =
𝐾 + 1 𝑖=0 ∑𝐾 𝑝 + ∑𝐾 𝑝 − 𝑝
provide detailed information about the execution time and 𝑗=0 𝑖𝑗 𝑖=0 𝑗𝑖 𝑖𝑖
memory footprint of their method. This paper introduces the
For 3D instance segmentation, Average Precision (AP)
accuracy metrics mainly.
and mean class Average Precision (mAP) are also frequently
For 3D semantic segmentation, Overall Accuracy (OAcc),
used. Assuming 𝐿𝐼 , 𝐼 ∈ [0, 𝐾] instance in every class, and
mean class Accuracy (mAcc) and mean class Intersection
𝑐𝑖𝑗 is the amount of point of instance 𝑖 inferred to belong to
over Union (mIoU) are the most frequently used metrics to
instance 𝑗 (𝑖 = 𝑗 represents correct and 𝑖 ≠ 𝑗 represents
measure the accuracy of segmentation methods. For the sake
incorrect segmentations).
of explanation, we assume that there are a total of 𝐾 + 1
Average Precision is another simple metric for segmen-
classes, and 𝑝𝑖𝑗 is the minimum unit (e.g. pixel, voxel, mesh,
tation that computes the ratio between true positives and the
point) of class 𝑖 implied to belong to class 𝑗. In other words,
total number of positive samples.
𝑝𝑖𝑖 represents true positives, while 𝑝𝑖𝑗 and 𝑝𝑗𝑖 represent false
positives and false negatives respectively. ∑ 𝐿𝐼
𝐾 ∑
𝑐𝑖𝑖
Overall Accuracy is a simple metric that computes the 𝐴𝑃 = ∑𝐿𝐼
ratio between the number of truly classified samples and the 𝐼=0 𝑖=0 𝑐𝑖𝑖 + 𝑐
𝑗=0 𝑖𝑗
total number of samples.
Mean Average precision is an extension of AP which
∑
𝐾
𝑝𝑖𝑖 computes per-class AP and then averages over the total
𝑂𝐴𝑐𝑐 = ∑𝐾
𝑖=0 𝑗=0 𝑝𝑖𝑗
number of classes 𝐾. and post-processing (see Figure 5). RGB-D image based
semantic segmentation methods are summarized in Table 2.
𝐾 𝐿𝐼
1 ∑∑ 𝑐𝑖𝑖 Multi-tasks learning: Depth estimation and semantic seg-
𝑚𝐴𝑃 =
𝐾 + 1 𝐼=0 𝑖=0 𝑐 + ∑𝐿𝐼 𝑐 mentation are two fundamental challenging tasks in com-
𝑖𝑖 𝑗=0 𝑖𝑗 puter vision. These tasks are also somewhat related as depth
For 3D part segmentation, overall average category variation within an object is small compared to depth vari-
Intersection over Union (𝑚𝐼𝑜𝑈𝑐𝑎𝑡 ) and overall average in- ation between different objects. Hence, many researchers
stance Intersection over Union (𝑚𝐼𝑜𝑈𝑖𝑛𝑠 ) are most fre- choose to unite depth estimation task and semantic segmen-
quently used. For the sake of explanation, we assume tation task. From the view of relationship of the two tasks,
𝑀𝐽 , 𝐽 ∈ [0, 𝐿𝐼 ] parts in every instance, and 𝑞𝑖𝑗 as the total there are two main types of multi-task leaning framework,
number of points in part 𝑖 inferred to belong to part 𝑗. Hence, cascade and parallel framework.
𝑞𝑖𝑖 represents the number of true positive, while 𝑞𝑖𝑗 and 𝑞𝑗𝑖 As for the cascade framework, depth estimation task
are false positives and false negative respectively. provides depth images for semantic segmentation task. For
Overall average category Intersection over Union is an example, Cao et al. Cao, Shen and Shen (2016) used the
evaluation metric for part segmentation that measures the deep convolutional neural fields (DCNF) introduced by Liu
mean IoU averaged across K classes. et al. Liu, Shen, Lin and Reid (2015) for depth estimation.
The estimated depth images and RGB images are fed into
1 ∑∑∑
𝐾 𝐿𝐼 𝑀𝐽
𝑞𝑖𝑖 a two-channel FCN for semantic segmentation. Similarly,
𝑚𝐼𝑜𝑈𝑐𝑎𝑡 =
𝐾 + 1 𝐼=0 𝐽 =0 𝑖=0 ∑𝑀𝑗 𝑞 + ∑𝑀𝑗 𝑞 − 𝑞 Guo et al. Guo and Chen (2018) adopted the deep net-
𝑗=0 𝑖𝑗 𝑖=0 𝑗𝑖 𝑖𝑖
work proposed by Ivanecky Ivaneckỳ (2016) for automatic
Overall average instance Intersection over Union, for generating depth images from single RGB images, and then
part segmentation, measures the mean IoU across all in- proposed a two-channel FCN model on the image pair of
stances. RGB and predicted depth map for pixel labeling.
∑ ∑∑
𝐾 𝐿𝐼 𝑀𝐽
𝑞𝑖𝑖
The cascade framework performs depth estimation and
1
𝑚𝐼𝑜𝑈𝑖𝑛𝑠 = ∑𝐾 ∑𝑀𝑗 ∑𝑀𝑗 semantic segmentation separately, which is simultaneously
𝐼=0
𝐿𝐼 + 1 𝐼=0 𝐽 =0 𝑖=0 𝑗=0
𝑞𝑖𝑗 + 𝑖=0
𝑞𝑗𝑖 − 𝑞𝑖𝑖 unable to perform end-to-end training for two tasks. Con-
sequently, depth estimation task does not get any benefit
from semantic segmentation task. In contrast, the parallel
3. 3D Semantic segmentation framework performs these two tasks in an unify network,
Many deep learning methods on 3D semantic segmen- which allows two tasks get benefits each other. For instance,
tation have been proposed in the literature. These methods Wang et al. Wang, Shen, Lin, Cohen, Price and Yuille
can be divided into five categories according to the data (2015) used Joint Global CNN to exploit pixel-wise depth
representation used, namely, RGB-D image based, projected values and semantic labels from RGB images to provide
images based, voxel based, point based, 3D video and other accurate global scale and semantic guidance. As well as,
representations based. Point based methods can be further they use Joint Region CNN to extract region-wise depth
categorized, based on the network architecture, into Multiple values and semantic map from RGB to learn detailed depth
Layer Perceptron (MLP) based, Point Convolution based and semantic boundaries. Mousavian et al. Mousavian,
and Graph Convolution based and Point Transformer based Pirsiavash and Košecká (2016) presented a multi-scale FCN
methods. Figure 4 shows the milestones of deep learning on comprising five streams that simultaneously explore depth
3D semantic segmentation in recent years. and semantic features at different scales, where the two tasks
share the underlying feature representation. Liu et al. Liu,
3.1. RGB-D Based Wang, Li, Fu, Li and Lu (2018b) proposed a collaborative
The depth map in an RGB-D image contains geometric deconvolutional neural network(C-DCNN) to jointly model
information about the real-world which is useful to distin- the two tasks. However, the quality of depth maps estimated
guish foreground objects from background, hence provid- from RGB images is not as good as the one acquired directly
ing opportunities to improve the segmentation accuracy. In from depth sensors. This multi-task learning pipeline has
this category, generally the classical two-channel network been gradually abandoned in RGB-D semantic segmenta-
is used to extract features from RGB and depth images tion.
separately. However, this simple framework is not powerful Depth Encoding: Conventional 2D CNNs are unable to
enough to extract rich and refined features. To this end, exploit rich geometric features from raw depth images. An
researchers have integrated several additional modules into alternative way is to encode raw depth images into other
the above simple two-channel framework to improve the representations that are suitable to 2D CNN. Hoft et al. Höft,
performance by learning rich context and geometric infor- Schulz and Behnke (2014) used a simplified version of the
mation that are crucial for semantic segmentation. These histogram of oriented gradients (HOG) to represent depth
modules can be roughly divided into six categories: multi- channel from RGB-D scenes. Gupta et al. Gupta, Girshick,
task learning, depth encoding, multi-scale network, novel Arbeláez and Malik (2014) and Aman et al. Lin, Chen,
neural network architectures, data/feature/score level fusion Cohen-Or, Heng and Huang (2017) calculated three new
2019
2018
2017
RGB-D based
Voxel based
Point based
MLP based
Convolution based
2020
2021
2022
2023
Graph based
Transformer based
Fig. 4: Milestones of deep learning based 3D semantic segmentation methods. Note that the arrow (timeline) goes anti-clockwise
channels named horizontal disparity, height above ground Novel Neural Networks: Given the fixed grid computation
and angle with gravity (HHA) from the raw depth images. of CNNs, their ability to process and exploit geometric
Liu et al. Liu, Wu, Wang and Qian (2018a) point out a information is limited. Therefore, researchers have proposed
limitation of HHA that some scenes may not be enough other novel neural network architectures to better exploit
horizontal and vertical planes. Hence, they propose a novel geometric features and the relationships between RGB and
gravity direction detection method with vertical lines fitted depth images. These architectures can be divided into five
to learn better representation. Hazirbas et al. Hazirbas, Ma, main categories.
Domokos and Cremers (2016) also argue that HHA repre- Improved 2D Convolutional Neural Networks (2D CNNs)
sentation has a high computational cost and contains less Inspired from cascaded feature networks Lin et al. (2017),
information than the raw depth images. They propose an Jiang et al. Jiang, Zhang, Huang and Zheng (2017) pro-
architecture called FuseNet that consists of two encoder- posed a novel Dense-Sensitive Fully Convolutional Neural
decoder branches, including a depth branch and an RGB Network (DFCN) which incorporates depth information
branch, which directly encodes depth information with a into the early layers of the network using feature fusion
lower computational load. tactics. This is followed by several dilated convolutional
Multi-scale Network: The context information learned by layers for context information exploitation. Similarly, Wang
multi-scale networks is useful for small objects and detailed et al. Wang and Neumann (2018) proposed a depth-aware
region segmentation. Couprie et al. Couprie, Farabet, Naj- 2D CNN by introducing two novel layers, depth aware
man and LeCun (2013) applied a multi-scale convolutional convolution layer and depth-aware pooling layer, which are
network to learn features directly from the RGB images and based on the prior that pixels with the same semantic label
the depth images. Aman et al. Raj, Maturana and Scherer and similar depth should have more impact on one another.
(2015) proposed a multi-scale deep ConvNet for segmenta- Deconvolutional Neural Networks(DeconvNets) are a
tion where the coarse predictions of VGG16-FC net are up simple yet effective and efficient solution for the refinement
sampled in a Scale-2 module and then concatenated with the of segmentation map. Liu et al. Liu et al. (2018b) and Wang
low-level predictions of VGG-M net in Scale-1 module to et al. Wang, Wang, Tao, See and Wang (2016) all adopt
get both high and low level features. However, this method the DeconvNet for RGB-D semantic segmentation because
is sensitive to clutter in the scene resulting in output errors. of good performance. However, the potential of DeconvNet
Lin et al. Lin et al. (2017) exploit the fact that lower scene- is limited since the high-level prediction map aggregates
resolution regions have higher depth, and higher scene- large context for dense prediction. To this end, Cheng et
resolution regions have lower depth. They use depth maps al. Cheng, Cai, Li, Zhao and Huang (2017) proposed a
to split the corresponding color images into multiple scene- locality-sensitive DeconvNet (LS-DenconvNet) to refine the
resolution regions, and introduce context-aware receptive boundary segmentation over depth and color images. LS-
field (CaRF) which focuses on semantic segmentation of DeconvNet incorporates local visual and geometric cues
certain scene-resolution regions. This makes their pipeline from the raw RGB-D data into each DeconvNet, which is
a multi-scale network. able to up sample the coarse convolutional maps with large
context while recovering sharp object boundaries.
Post-
Data/Feature
(e) Processing
/Score Fusion
(f)
(b) (c) (d) output: labels
Fig. 5: Typical two-channel framework with six improvement modules, including (a) multi-tasks learning, (b) depth encoding, (c) multi-
scale network, (d) novel neural network architecture, (e) feature/score level fusion, and (f) post-processing.
Recurrent Neural Networks (RNNs) can capture long- Transformer. This model incorporates and leverages depth
range dependencies between pixels but are mainly suited information to complement and enhance the ambiguous and
to a single data channel (e.g. RGB). Fan et al. Fan, Mei, obscured features in RGB images. The hierarchical architec-
Prokhorov and Ling (2017) extended the single-modal RNNs ture allows for multi-scale feature learning and enables more
to multimodal RNNs (MM-RNNs) for application to RGB- effective integration of RGB and depth information.
D scene labeling. The MM-RNNs allow ‘memory’ sharing Data/Feature/Score Fusion: Optimal fusion of the texture
across depth and color channels. Each channel not only (RGB channels) and geometric (depth channel) information
possess its own features but also has the attributes of other is important for accurate semantic segmentation. There are
channel making the learned features more discriminative three fusion tactics: data level, feature level and score level,
for semantic segmentation. Li et al. Li, Gan, Liang, Yu, referring to early, middle and late fusion respectively. A
Cheng and Lin (2016) proposed a novel Long Short-Term simple data level fusion strategy is to concatenate the RGB
Memorized Context Fusion (LSTM-CF) model to capture and depth images into four channels for direct input to a CNN
and fuse contextual information from multiple channels of model e.g. as performed by Couprie et al. Couprie et al.
RGB and depth images. (2013). However, such a data level fusion does not exploit
Graph Neural Networks (GNNs) were first used for the strong correlations between depth and photometric chan-
RGB-D semantic segmentation by Qi et al. Qi, Liao, Jia, nels. Feature level fusion, on the other hand, captures these
Fidler and Urtasun (2017c) who cast the 2D RGB pixels into correlations. For example, Li et al. Li et al. (2016) proposed
3D space based on depth information and associated the 3D a memorized fusion layer to adaptively fuse vertical depth
points with semantic information. Nest, they built a k-nearest and RGB contexts in a data-driven manner. Their method
neighbor graph from the 3D points and applied a 3D graph performs bidirectional propagation along the horizontal di-
neural network (3DGNN) to perform pixelwise predictions. rection to hold true 2D global contexts. Similarly, Wang
Transformers have gained popularity in RGB image seg- et al. Wang et al. (2016) proposed a feature transforma-
mentation and have also been extended to RGB-D segmen- tion network that correlates the depth and color channels,
tation. Researchers have proposed various approaches to and bridges the convolutional networks and deconvolutional
leverage Transformers for this purpose. One notable work networks in a single channel. The feature transformation
by Ying et al. Ying and Chuah (2022) introduces the con- network can discover specific features in a single channel as
cept of Uncertainty-Aware Self-Attention, which explicitly well as common features between two channels, allowing the
manages the information flow from unreliable depth pixels two branches to share features to improve the representation
to confident depth pixels during feature extraction. This ap- power of shared information. The above complex feature
proach aims to address the challenges posed by noisy or un- level fusion models are inserted in a specific same layer
certain depth information in RGB-D segmentation. Another between RGB and depth channels, which is difficult to train
study by Wu et al. Wu, Zhou, Allibert, Stolz, Demonceaux and ignores other same layer feature fusion. To this end,
and Ma (2022c) adopts the Swin-Transformer directly to Hazirbas et al. Hazirbas et al. (2016) and Jiang et al. Jiang
exploit both the RGB and depth features. By leveraging et al. (2017) carry out fusion as an element-wise summa-
the self-attention mechanism, this approach captures long- tion to fuse feature of multiple same layers between the
range dependencies and enables effective fusion of RGB and two channels. Wu et al. Wu et al. (2022c) proposea novel
depth information for segmentation. Inspired by the success Transformer-based fusion scheme, named TransD-Fusion to
of the Swin-Transformer, Yang et al. Yang, Xu, Zhang, better model long-range contextual information.
Xu and Huang (2022) proposes a hierarchical Swin-RGBD
Table 2
Summary of RGB-D based methods with deep learning. Est.←depth estimation. Enc.←depth encoding. Mul.←multi-scale networks.
Nov.←novel neural networks. Fus.←data/feature/score fusion. Pos.←post-processing.
Methods Est. Enc. Mul. Nov. Fus. Pos. Architecture(2-stream) Contribution
Cao et al. (2016) ✓ ✓ × × ✓ × FCNs Estimating depth images+a unified network for two tasks
Guo and Chen (2018) ✓ × × × ✓ × FCNs Incorporating depth & gradient for depth estim.
Wang et al. (2015) ✓ × × × × ✓ Region./Global CNN HCRF for fusion and refining + two tasks by a network
Mousavian et al. (2016) ✓ × ✓ × ✓ ✓ FCN FC-CRF for refining + Mutual improvement for two tasks
Liu et al. (2018b) ✓ × × ✓ × ✓ S/D-DCNN PBL for two feature maps integration + FC-CRF
Höft et al. (2014) × ✓ × × × × CNNs A embedding for depth images
Gupta et al. (2014) × ✓ × × × × CNNs HHA for depth images
Liu et al. (2018a) × ✓ × × ✓ ✓ DCNNs New depth encoding+ FC-CRF for refining
Hazirbas et al. (2016) × ✓ × × ✓ × Encoder-decoder Semantic and depth feature fusion at each layer
Couprie et al. (2013) × × ✓ × ✓ × ConvNets RGB laplacian pyramid for multi-scale features
Raj et al. (2015) × ✓ ✓ × ✓ × VGG-M New multi-scale deep CNN
Lin et al. (2017) × × ✓ ✓ ✓ × CFN CaRF for multi-resolution features
Jiang et al. (2017) × × × ✓ ✓ ✓ RGB-FCN Semantic & depth feature fusion at each layer + DCRF
Wang and Neumann (2018) × × × ✓ × × Depth-aware CNN Depth-aware Conv. and depth aware average pooling
Cheng et al. (2017) × ✓ × ✓ ✓ × FCN + Deconv LS-DeconvNet + novel gated fusion
Fan et al. (2017) × × × ✓ ✓ × MM-RNNs Multimodal RNN
Li et al. (2016) × ✓ × ✓ ✓ × LSTM-CF LSTM-CF for capturing and fusing contextual inf.
Qi et al. (2017c) × × × ✓ × × 3DGNN GNN for RGB-D semantic segmentation
Wang et al. (2016) × × × ✓ ✓ × ConvNet-DeconvNet MK-MMD for assessing the similarity between common features
Ying and Chuah (2022) × × × ✓ ✓ × Swin-Transformer Effective and scalable fusion module based on aross-attention
Wu et al. (2022c) × × × ✓ ✓ × Swin-Transformers Transformer-based fusion module
Yang et al. (2022) × × × ✓ × × Swin-Transformer+ResNet Swin-RGB-D Transformer
Score level fusion is commonly performed using the 3.2. Projected Images Based Segmentation
simple averaging strategy. However, the contributions of The core idea of projected images based semantic seg-
RGB model and depth model for semantic segmentation mentation is to use 2D CNNs to exploit features from pro-
are different. Liu et al. Liu et al. (2018a) proposed a score jected images of 3D scenes/shapes and then fuse these
level fusion layer with weighted summation that uses a features for label prediction. This pipeline not only exploits
convolution layer to learn the weights from the two channels. more semantic information from large-scale scenes com-
Similarly, Cheng et al. Cheng et al. (2017) proposed a gated pared to a single-view image, but also reduces the data size
fusion layer to learn the varying performance of RGB and of a 3D scene compared to a point cloud. The projected im-
depth channels for different class recognition in different ages mainly include multi-view images or spherical images.
scenes. Both techniques improved the results over the simple Among, multi-view images projection is usually em-
averaging strategy at the cost of additional learnable param- ployed on RGB-D datasets Dai et al. (2017), and statics
eters. terrestrial scanning datasets Hackel et al. (2017). Spherical
Post-Processing: The results of CNN or DCNN used for images projection is usually employed on self-driving mo-
RGB-D semantic segmentation are generally very coarse bile laser scanning datasets Behley et al. (2019). Projected
resulting in rough boundaries and the vanishing of small images based semantic segmentation methods are summa-
objects. A common method to address this problem is to rized in Table 3.
couple the CNN with a Conditional Random Field (CRF).
Wang et al. Wang et al. (2015) further boost the mutual 3.2.1. Multi-View Images Based Segmentation
interactions between the two channels by the joint inference MVCNN Su, Maji, Kalogerakis and Learned-Miller
of Hierarchical CRF (HCRF). It enforces synergy between (2015) uses a unified network to combine features from
global and local predictions, where the global layouts are multiple views of a 3D shape, formed by a virtual camera,
used to guide the local predictions and reduce local am- into a single and compact shape descriptor to get improved
biguities, as well as local results provide detailed regional classification performance. This inspired researchers to take
structures and boundaries. Mousavian et al. Mousavian et al. the same idea into 3D semantic segmentation (see Figure 6).
(2016), Liu et al. Liu et al. (2018b), and Long et al. Liu et al. For example, Lawin et al. Lawin, Danelljan, Tosteberg,
(2018a) adopt a Fully Connected CRF (FC-CRF) for post- Bhat, Khan and Felsberg (2017) project point clouds into
processing, where the pixel-wise label prediction jointly multi-view synthetic images, including RGB, depth and
considers geometric constraint, such as pixel-wise normal surface normal images. The prediction score of all multi-
information, pixel position, intensity and depth, to promote view images is fused into a single representation and back-
the consistency of pixel-wise labeling. Similarly, Jiang et al. projected into each point. However, the snapshot can erro-
Jiang et al. (2017) proposed Dense-sensitive CRF (DCRF) neously catch the points behind the observed structure if the
that integrates the depth information with FC-CRF. density of the point cloud is low, which makes the deep
network to misinterpret the multiple views. To this end,
SnapNet Boulch, Le Saux and Audebert (2017), Boulch, RangNet++ Milioto, Vizzo, Behley and Stachniss (2019)
Guerry, Le Saux and Audebert (2018) preprocesses point transfers the semantic labels to 3D point clouds, avoiding
clouds for computing point features(like normal or local discarding points regardless of the level of discretization
noise) and generating a mesh, which is similar to point used in CNN. Despite the likeness between regular RGB
cloud densification. From the mesh and point clouds, they and LiDAR images, the feature distribution of LiDAR im-
generate RGB and depth images by suitable snapshot. Then, ages changes at different locations. SqueezeSegv3 Xu, Wu,
they perform a pixel-wise labeling of 2D snapshots using Wang, Zhan, Vajda, Keutzer and Tomizuka (2020) has a
FCN and fast back-project these labels into 3D points by spatially-adaptive and context-aware convolution, termed
efficient buffering. Above methods need obtain the whole Spatially-Adaptive Convolution (SAC) to adopt different
point clouds of 3D scene in advance to provide a complete filters for different locations. Inspired by the success of
spatial structure for back-projection. However, the multi- 2D vision Transformer, RangViT Ando, Gidaris, Bursuc,
view images directly obtained from real-world scene would Puy, Boulch and Marlet (2023) leverage ViTs pre-trained
lose much spatial information. some works attempt to unite on long natural image datasets by adding the down and up
3D scene reconstruction with semantic segmentation, where module on the top and bottom of ViTs, and achieves a good
scene reconstruction could make up for spatial information. performance comparing to the projection based methods.
For example, Guerry et al. Guerry, Boulch, Le Saux, Moras, Similarly, to make the long projection image to suit the ViTs,
Plyer and Filliat (2017) reconstruct 3D scene with global RangeFormer Kong, Liu, Chen, Ma, Zhu, Li, Hou, Qiao and
multi-view RGB and Gray stereo images. Then, the labels Liu (2023) adopts a scalable training strategy that splits the
of 2D snapshots are back-projected onto the reconstructed whole projection image into several sub-images, and puts
scene. But, simple back-projection can not optimally fuse them into ViTs for training. After training, the predictions
semantic and spatial geometric features. Along the line, are merged sequentially to form the complete scene.
Pham et al. Pham, Hua, Nguyen and Yeung (2019a) pro-
posed a novel Higher-order CRF, following back-projection, 3.3. Voxel Based Segmentation
to further develop the initial segmentation. Similar to pixels, voxels divide the 3D space into many
volumetric grids with a specific size and discrete coordi-
nates. It contains more geometric information of the scene
compared to projected images. 3D ShapeNets Wu, Song,
Khosla, Yu, Zhang, Tang and Xiao (2015) and VoxNet
Maturana and Scherer (2015) take volumetric occupancy
grid representation as input to a 3D convolutional neural
network for object recognition, which guides 3D semantic
segmentation based on voxels. Voxel based semantic seg-
mentation methods are summarized in Table 3.
Fig. 6: Illustration of basic frameworks for projected images based 3D CNN is a common architecture used to process
segmentation methods. Top: Multi-view images based framework. uniform voxels for label prediction. Huang et al. Huang
Bottom: Spherical images based framework. and You (2016) presented a 3D FCN for coarse voxel level
predictions. Their method is limited by spatial inconsistency
between predictions and provide a coarse labeling. Tchapmi
et al. Tchapmi et al. (2017) introduce a novel network SEG-
3.2.2. Spherical Images Based Segmentation
Cloud to produce fine-grained predictions. It up samples the
Selecting snapshots from a 3D scene is not straight
coarse voxel-wise prediction obtained from a 3D FCN to the
forward. Snapshots must be taken after giving due consid-
original 3D point space resolution by trilinear interpolation.
eration to the number of viewpoints, viewing distance and
With fixed resolution voxels, the computational com-
angle of the virtual cameras to get an optimal representa-
plexity grows linearly with the increase of the scene scale.
tion of the complete scene. To avoid these complexities,
Large voxels can lower the computational cost of large-
researchers project the complete point cloud onto a sphere
scale scene parsing. Liu et al. Liu et al. (2017) introduced
(see Figure 6.Bottom). For example, Wu et al. Wu, Wan,
a novel network called 3D CNN-DQN-RNN. Like the slid-
Yue and Keutzer (2018a) proposed an end-to-end pipeline
ing windows in 2D semantic segmentation, this network
called SqueezeSeg, inspired from SqueezeNet Iandola, Han,
proposes eye window that traverses the whole data for fast
Moskewicz, Ashraf, Dally and Keutzer (2016), to learn
localizing and segmenting class objects under the control of
features from spherical images which are then refined by
3D CNN and deep Q-Network (DQN). The 3D CNN and
CRF implemented as a recurrent layer. Similarly, PointSeg
Residual RNN further refine features in the eye window. The
Wang, Shi, Yun, Tai and Liu (2018e) extends the SqueezeNet
pipeline learns key features of interesting regions efficiently
by integrating the feature-wise and channel-wise attention to
to enhance the accuracy of large-scale scene parsing with
learn robust representation. SqueezeSegv2 Wu, Zhou, Zhao,
less computational cost. Rethage et at. Rethage et al. (2018)
Yue and Keutzer (2019a) improves the structure of Squeeze-
present a novel fully convolutional point network (FCPN),
Seg with Context Aggregation Module (CAM), adding Li-
sensitive to multi-scale input , to parse large-scale scene
DAR mask as a channel to increase robustness to noise.
Table 3
Summary of projected images/voxel/other representation based methods with deep learning. M←multi-view image. S←spherical image.
V←voxel. T←tangent images. L←lattice. P←point clouds.
Type Methods Input Architecture Feature extractor Contribution
Lawin et al. Lawin et al. (2017) M multi-stream VGG-16 Investigate the impact of different input modalities
Boulch et al. Boulch et al. (2017) Boulch et al. (2018) M SegNet/U-Net VGG-16 New and efficient framework SnapNet
Guerry et al. Guerry et al. (2017) M SegNet/U-Net VGG-16 Improved MVCNN+3D consistent data augment.
Pham et al. Pham et al. (2019a) M Two-stream 2DConv High-order CRF+ real-time reconstruction pipeline
Wu et al. Wu et al. (2018a) S AlexNet Firemodules End-to-end pipeline SqueezeSeg + real time
projection
Wang et al. Wang et al. (2018e) S AlexNet Firemodules Quite light-weight framework PointSeg + real time
Wu et al. Wu et al. (2019a) S AlexNet Firemodules Robust framework SqueezeSegV2
Milioto et al. Milioto et al. (2019) S DarkNet Residual block GPU-accelerated post-processing + RangNet++
Xu et al. Xu et al. (2020) S RangeNet SAC Adopting different filters for different locations
Ando et al. Ando et al. (2023) S U-Net ViTs Decreasing the gaps between image and point domain.
Kong et al. Kong et al. (2023) S U-Net ViTs Introducing a scalable training from range view strategy
Huang et al. Huang and You (2016) V 3D CNN 3DConv Efficiently handling large data
Tchapmi et al. Tchapmi, Choy, Armeni, Gwak and
V 3D FCNN 3DConv Combining 3D FCNN with fine-represen.
Savarese (2017)
Meng et al. Meng, Gao, Lai and Manocha (2019) V VAE RBF A novel voxel-based representation + RBF
Liu et al. Liu, Li, Zhang, Zhou, Ye, Wang and Lu (2017) V 3D CNN/DQN/RNN 3DConv Integrating three vision tasks into one frame.
Rethage et at. Rethage, Wald, Sturm, Navab and Tombari
voxel
3DMV Dai and Nießner (2018) M+V Cascade frame. ENet+3DConv Inferring 3D semantics from both 3D and 2D input
Hung et al. Chiang, Lin, Liu and Hsu (2019) V+M+P Parallel frame. SSCNet/DeepLab/PN Leveraging 2D and 3D features
PVCNN Liu, Tang, Lin and Han (2019b) V+P POintNet PVConv Both memory and computation efficient
MVPNet Jaritz, Gu and Su (2019) M+P Cascade frame. U-Net+PointNet++ Leveraging 2D and 3D features
LaserNet++ Meyer, Charland, Hegde, Laddha and
M+P Cascade frame. ResNet+LNet Unified network for two tasks
Vallespi-Gonzalez (2019)
BPNet Hu, Zhao, Jiang, Jia and Wong (2021) M+P Cascade frame. 2/3DUNet Bidirection projection module
without pro- or post-process steps. Particularly, FCPN is able to focus on relevant dense voxels without sacrificing reso-
to learn memory efficient representations that scale well to lution. However, empty space still imposes computational
larger volumes. Similarly, Dai et al. Dai et al. (2018) design and memory burden in OctNet. In contrast, Graham et al.
a novel 3D CNN to train on scene subvolumes but deploy Graham et al. (2018) proposed a novel submanifold sparse
on arbitrarily large scenes at test time, as it is able to handle convolution (SSC) that does not perform computations in
large scenes with varying spatial extent. Additionally, their empty regions, making up for the drawback of OctNet.
network adopts a coarse-to-fine tactic to predict multiple
resolution scenes to handle the resolution growth in data 3.4. Point Based Segmentation
size as the scene increases in size. Traditionally, the voxel Point clouds are scattered irregularly in 3D space, lack-
representation only comprises Boolean occupancy informa- ing any canonical order and translation invariance, which
tion which loses much geometric informatin. Meng et al. restricts the use of conventional 2D/3D convolutional neural
Meng et al. (2019) develop a novel information-rich voxel networks. Recently, a series of point-based semantic seg-
representation by using a variational auto-encoder(VAE) mentation networks have been proposed. These methods
taking radial basis function(RBF) to capture the distribution can be roughly subdivided into four categories: MLP based,
of points within each voxel. Further, they proposed a group point convolution based, graph convolution based and Trans-
equivariant convolution to exploit feature. former based. These methods are summarized in Table 4.
In fixed scale scenes, the computational complexity
grows cubically as the voxel resolution increases. However, 3.4.1. MLP Based
the volumetric representation is naturally sparse, result- These methods apply a Multi Layer Perceptron directly
ing in unnecessary computations when applying 3D dense on the points to learn features. The PointNet Qi, Su, Mo and
convolution on the sparse data. To aleviate this problem, Guibas (2017a) is a pioneering work that directly processes
OctNet Riegler et al. (2017) divides the space hierarchically point clouds. It uses shared MLP to exploit points-wise fea-
into nonniform voxels using a series of unbalanced octrees. tures and adopts a symmetric function such as max-pooling
Tree structure allows memory allocation and computation to collect these features into a global feature representation.
Because the max-pooling layer only captures the maximum 3.4.2. Point Convolution Based
activation across global points, PointNet cannot learn to Point convolution based methods perform convolution
exploit local features. Building on PointNet, PointNet++ Qi, operations directly on the points. Different from 2D convo-
Yi, Su and Guibas (2017b) defines a hierarchical learning lution , the weight function of point convolution need learn
architecture. It hierarchically samples points using farthest from point geometric information adaptively. Early convo-
point sampling (FPS) and groups local regions using k near- lution networks focus on the convolution weight function
est neighbor search as well as ball search. Progressively, a design. For example, RSNet Huang, Wang and Neumann
simplified PointNet exploits features in local regions at mul- (2018) exploit point-wise features using 1x1 convolution
tiple scales or multiple resolutions. Similarly, Engelmann et and then pass them through the local dependency module
al. Engelmann, Kontogianni, Schult and Leibe (2018) define (LDM) to exploit local context features. However, it does
local regions by KNN clustering and K-means clustering and not define the neighborhood for each point in order to learn
use a simplified PointNet to extract local features. local features. On the other hand, PointwiseCNN Hua, Tran
To learn the short and long-range dependencies, some and Yeung (2018) sorts points in a specific order, e.g. XYZ
works introduce the Recurrent Neural Networks (RNN) coordinate or Morton cureve Morton (1966), and queries
to MLP-based methods. For example, ESC Engelmann, nearest neighbors dynamically and bins them into 3x3x3
Kontogianni, Hermans and Leibe (2017) divides global kernel cells before convolving with the same kernel weights.
points into multi-scale/grid blocks. The concatenated (local) Gradually, some point convolution works approximate
block features are appended to the point-wise features and the convolution weight function as MLP to learn weights
passed through Recurrent Consolidation Units (RCUs) to from point coordinates. PCCN Wang, Suo, Ma, Pokrovsky
further learn global context features. Similarly, HRNN Ye, and Urtasun (2018c) performs Parametric CNN, where the
Li, Huang, Du and Zhang (2018) uses Pointwise Pyramid kernel is estimated as an MLP, on KD-tree neighborhood
Pooling (3P) to extract local features on the multi-size to learn local features. PointCNN Li, Bu, Sun, Wu, Di and
local regions. Point-wise features and local features are Chen (2018b) coarsens the input points with farthest point
concatenated and a two-direction hierarchical RNN explores sampling. The convolution layer learns an 𝜒-transformation
context features on these concatenated features. However, from local points by MLP to simultaneously weight and
the local features learned are not sufficient because the permute the features, subsequently applying a standard con-
deeper layer features do not cover a larger spatial extent. volution on these transformed features.
Another technology, some works integrate the hand-craft Some works associates a coefficient (derived from point
point representation into PointNet or PointNet++ network coordinates) with the weight function to adjust the learned
to enhance the point representation ability with less learn- convolutional weights. An extension of Monte Carlo ap-
able netowrk parameters. Inspired by SIFT representation proximation for convolution called PointConv Wu, Qi and
Lowe (2004), PointSIFT Jiang, Wu, Zhao, Zhao and Lu Fuxin (2019b) takes the point density into account. It uses
(2018) inserts a PointSIFT module layer learn local shape MLP to approximate a weight function of the convolution
information. This module transforms each point into a new kernel, and applies an inverse density scale to reweight
shape representation by encoding information of different the learned weight function. Similarly, MCC Hermosilla,
orientations. PointWeb Zhao, Jiang, Fu and Jia (2019a) pro- Ritschel, Vázquez, Vinacua and Ropinski (2018) phrases
pose a adaptive feature adjustment (AFA) module to learning convolution as a Monte Carlo integration problem by rely-
the interactive information between local points to enhance ing on point probability density function (PDF), where the
the point representation. Similarly, RepSurf Ran, Liu and convolution kernel is also represented by an MLP. Moreover,
Wang (2022) introduces two novel point representations, it introduces Possion Disk Sampling (PDS) Wei (2008) to
namely triangular and umbrella representative surfaces, to construct a point hierarchy instead of FPS, which provides
establish connections and enhance the representation capa- an opportunity to get the maximal number of samples in a
bility of learned point-wise features. This approach effec- receptive field.
tively improves feature representation with fewer learnable Another line of works use other function instead of
network parameters, drawing significant attention from the MLP to approximate the convolution weight function. Flex-
research community. In contrast to the aforementioned meth- Convolution Groh, Wieschollek and Lensch (2018) uses
ods, PointNeXt Qian, Li, Peng, Mai, Hammoud, Elhoseiny a linear function with fewer parameters to model a convo-
and Ghanem (2022) takes a different approach by revisiting lution kernel and adapts inverse density importance sub-
the classical PointNet++ architecture through a systematic sampling (IDISS) to coarsen the points. KPConv Thomas,
study of model training and scaling strategies. It proposes a Qi, Deschaud, Marcotegui, Goulette and Guibas (2019) and
set of improved training strategies that lead to a significant KCNet Shen, Feng, Yang and Tian (2018) fixed the convo-
performance boost for PointNet++. Additionally, PointNeXt lution kernel for robustness to varying point density. These
introduces an inverted residual bottleneck design and em- networks predefine the kernel points on local region and
ploys separable MLPs to enable efficient and effective model learn convolutional weights on the kernel points from their
scaling. geometric connections to local points using linear and Gaus-
sian correlation functions, respectively. Here, the number
and position of kernel points need be optimized for different network adopts PointNet to embed these points and refine
datasets. the embedding by Gated Recurrent Unit (GRU). Based on
Point convolution on limited local receptive field could the basic architecture of PoinNet++, Li et al. Li, Ma,
not exploit long-range features. Therefore, some works in- Zhong, Cao and Li (2019b) proposed Geometric Graph
troduce the dilated mechanism into point convolution. Di- Convolution (TGCov), its filters defined as products of local
lated point convolution(DPC) Engelmann, Kontogianni and point-wise features with local geometric connection features
Leibe (2020b) adapts standard point convolution on neigh- expressed by Gaussian weighted Taylor kernels. Feng et
borhood points of each point where the neighborhood points al. Feng, Zhang, Lin, Gilani and Mian (2020) constructed
are determined though a dilated KNN search. similarly, A- a local graph on neighborhood points searched along multi-
CNN Komarichev, Zhong and Hua (2019) defines a new directions and explore local features by a local attention-
local ring-shaped region by dilated KNN, and projects points edge convolution (LAE-Conv). These features are imported
on a tangent plane to further order neighbor points in lo- into a point-wise spatial attention module to capture accurate
cal regions. Then, the standard point convolutions are per- and robust local geometric details. Lei et al. designs a
formed on these ordered neighbors represented as a closed fuzzy coefficient to times weight function, to enable the
loop array. convolution weights robust.
In the large-scale point clouds semantic segmentation Continuous graph convolution also incurs a high compu-
area, RandLA-Net Hu, Yang, Xie, Rosa, Guo, Wang, tational cost and generally suffer from the vanishing gradient
Trigoni and Markham (2020) uses random point sampling problem. Inspired by the separable convolution strategy in
instead of the more complex point selection approach. Xception Chollet (2017) that significantly reduces parame-
It introduces a novel local feature aggregation module ters and computation burden, HDGCN Liang, Yang, Deng,
(LFAM) to progressively increase the receptive field and Wang and Wang (2019a) designed a DGConv that composes
effectively preserve geometric details. Another technology, depth-wise graph convolution followed by a point-wise con-
PolarNet Zhang, Zhou, David, Yue, Xi, Gong and Foroosh volution, and add DGConv into the hierarchical structure to
(2020) first partitions a large point cloud into smaller grids extract local and global features. DeepGCNs Li, Muller,
(local regions) along their polar bird’s-eye-view (BEV) Thabet and Ghanem (2019a) borrows some concepts from
coordinates. It then abstracts local region points into a fixed- 2D CNN such as residual connections between different
length representation by a simplified PointNet and these layers (ResNet) to alleviate the vanishing gradient problem,
representations are passed through a standard convolution. and dilation mechanism to allow the GCN to go deeper.
Lei et al. Lei, Akhtar and Mian (2020) propose a discrete
3.4.3. Graph Convolution Based spherical convolution kernel (SPH3D kernel) that consists of
The graph convolution based methods perform convo- the spherical convolution learning depth-wise features and
lution on points connected with a graph structure, where point-wise convolution learning point-wise features.
the graph help the feature aggregation exploit the structure Tree structures such as KD-tree and Octree can be
information between points. the graphs can be divided into viewed as a special type of graph, allowing to share con-
spectral graph and spatial graph. In the spectral graph, LS- volution layers depending on the tree splitting orientation.
GCN Wang, Samari and Siddiqi (2018a) adopts the basic 3DContextNet Zeng and Gevers (2018) adopts a KD-
architecture of PointNet++, replaces MLPs with a spectral tree structure to hierarchically represent points where the
graph convolution using standard unparametrized Fourier nodes of different tree layers represent local regions at
kernels, as well as a novel recursive spectral cluster pooling different scales, and employs a simplified PointNet with a
substitute for max-pooling. However, transformation from gating function on nodes to explore local features. However,
spatial to spectral domain incurs a high computational cost. their performance depends heavily on the randomization
Besides that, spectral graph networks are usually defined on of the tree construction. Lei et al. Lei, Akhtar and Mian
a fixed graph structure and are thus unable to directly process (2019) built an Octree based hierarchical structure on global
data with varying graph structures. points to guide the spherical convolution computation in
In the spatial graph category, ECC Simonovsky and per layer of the network. The spherical convolution kernel
Komodakis (2017) is among of the pioneer methods to systematically partitions a 3D spherical region into multiple
apply spatial graph network to extract features from point bins that specifies learnable parameters to weight the points
clouds. It dynamically generates edge-conditioned filters to falling within the corresponding bin.
learn edge features that describe the relationships between
a point and its neighbors. Based on PointNet architecture, 3.4.4. Transformer Based
DGCNN Wang, Sun, Liu, Sarma, Bronstein and Solomon Attention mechanism has recently become popular for
(2019b) implements dynamic edge convolution called Edge- improving point cloud segmentation accuracy. Compared to
Conv on the neighborhood of each point. The convolution point convolution, Transformer introduces the point features
is approximated by a simplified PointNet. SPG Landrieu into the weight learning. For example, Ma et al. Ma, Guo,
and Simonovsky (2018) parts the point clouds into a num- Liu, Lei and Wen (2020) use the channel self-attention
ber of simple geometrical shapes (termed super-points) and mechanism to learn independence between any two point-
builts super graph on global super-points. Furthermore, this wise feature channels, and further define a Channel Graph
where the channel maps are presented as nodes and the convolution, which explores global context features. These
independencies are represented as graph edges. AGCN Xie, features are embedded into a sparse lattice that allows the
Chen and Peng (2020b) integrates attention mechanism with application of standard 2D convolutions.
GCN for analyzing the relationships between local features Although the above methods have achieved significant
of points and introduces a global point graph to compensate progress in 3D semantic segmentation, each has its own
for the relative information of individual points. PointANSL drawbacks. For instance, multi-view images have more spec-
Yan, Zheng, Li, Wang and Cui (2020) use the general self- tral information like color/intensity but less geometric in-
attention mechanism for group feature updating, and propose formation of the scene. On the other hand, voxels have
a adaptive sampling (AS) module to overcome the issues of more geometric information but less spectral information.
FPS. To get the best of both worlds, some methods adopt hybrid
The Transformer model, which employs self-attention representations as input to learn comprehensive features of
as a fundamental component, includes position encoding a scene. Dai et al. Dai and Nießner (2018) map 2D semantic
to capture the sequential order of input tokens. Position features obtained by multi-view networks into 3D grids of
encoding is crucial to ensure that the model understands scene. These pipelines make 3D grids attach rich 2D seman-
the relative positions of tokens within a sequence. Point tic as well as 3D geometric information so that the scene can
Transformer Zhao, Jiang, Jia, Torr and Koltun (2021) in- get better segmentation by a 3D CNN. Similarly, Hung et
troduce MLP-based position encoding into vector attention, al. Chiang et al. (2019) back-project 2D multi-view image
and use a KNN-based downsampling module to decrease the features on to the 3D point cloud space and use a unified
point resolution. Follow up work, Point Transformer v2 Wu, network to extract local details and global context from sub-
Lao, Jiang, Liu and Zhao (2022a) strengthens the position volumes and the global scene respectively. Liu et al. Liu
encoding mechanism by applying an additional encoding et al. (2019b) argue that voxel-based and point-based NN
multiplier to the relation vector, and designs a partition- are computationally inefficient in high-resolution and data
based pooling strategy to align the geometric information. structuring respectively. To overcome these challenges, they
Point Transformers are generally computationally ex- propose Point-Voxel CNN (PVCNN) that represents the 3D
pensive because the original self-attention module needs to input data as point clouds to take advantage of the sparsity
generate a huge attention map. To address this problem, to lower the memory footprint, and leverage the voxel-based
PatchFormer Zhang, Wan, Shen and Wu (2022) calculates convolution to obtain a contiguous memory access pattern.
the attention map via low-rank approximation. Similarly, Jaritz et al. Jaritz et al. (2019) proposed MVPNet that collect
FastPointTransformer Park, Jeong, Cho and Park (2022) 2D multi-view dense image features into 3D sparse point
introduces a lightweight local self-attention module that clouds and then use a unified network to fuse the semantic
learns continuous positional information while reducing the and geometric features. Also, Meyer et al. Meyer et al.
space complexity. Inspired by the success of window-based (2019) fuse 2D image and point clouds to address 3D object
Transformer in the 2D domain, Stratified Transformer Lai, detection and semantic segmentation by a unifying network.
Liu, Jiang, Wang, Zhao, Liu, Qi and Jia (2022) designs a BPNet Hu et al. (2021) consists of 2D and 3D sub-networks
cubic window and samples distant points as keys, but in a with symmetric architectures, connected through a bidirec-
sparser way, to expand the receptive field. Similarly, Sphere- tional projection module (BPM). This allows the interaction
Former Lai, Chen, Lu, Liu and Jia (2023) designs radial of complementary information from both visual domains
window self-attention that partitions that space into several at multiple architectural levels, leading to improved scene
non-overlapping narrow and long windows for exploiting recognition by leveraging the advantages of both 2D and
long-range dependencies. 3D information. The other representations based semantic
segmentation methods are summarized in Table 3.
Table 4
Summary of point based semantic segmentation methods with deep learning.
Type Methods Neighb. search Feature abstraction Coarsening Contribution
PointNet Qi et al. (2017a) None MLP None Pioneering processing points directly
G+RCU Engelmann et al. (2018) None MLP None Two local definition+local/global pathway
ESC Engelmann et al. (2017) None MLP None MC/Grid Block for local defini.+RCUs for context exploit.
HRNN Ye et al. (2018) None MLP None 3P for local feature exploit..+HRNN for local context exploit.
MLP
PointNet++ Qi et al. (2017b) Ball/KNN PointNet FPS Proposing hierarchical learning framework
PointSIFT Jiang et al. (2018) KNN PointNet FPS PointSIFT module for local shape information
PointWeb Zhao et al. (2019a) KNN PointNet FPS AFA for interactive feature exploitation
Repsurf Ran et al. (2022) KNN PointNet FPS Local triangular orientation + local umbrella orientation
PointNeXt Qian et al. (2022) KNN InvResMLP FPS Next version of PointNet
RSNet Huang et al. (2018) None 1x1 Conv None LDM for local context exploitation
DPC Engelmann et al. (2020b) DKNN PointConv None Dilated KNN for expanding the receptive field
PointWiseCNN Hua et al. (2018) Grid PWConv. None Novel point convolution
PCCN Wang et al. (2018c) KD index PCConv. None KD-tree index for neigh. search+novel point Conv.
Point Convolution
KPConv Thomas et al. (2019) Ball KPConv. Grid sampling Novel point convolution
FlexConv Groh et al. (2018) KD index flexConv. IDISS Novel point Conv.+flex-maxpooling without subsampling
PointCNN Li et al. (2018b) DKNN 𝜒-Conv FPS Novel point convolution
MCC Hermosilla et al. (2018) Ball MCConv. PDS Novel coarsening layer+point convolution
PointConv Wu et al. (2019b) KNN PointConv FPS Novel point convolution considering point density
A-CNN Komarichev et al. (2019) DKNN AConv FPS Novel neighborhood search+point convolution
RandLA-Net Hu et al. (2020) KNN LocSE RPS LFAM with large receptive field and keeping geometric details
PolarNet Zhang et al. (2020) None PointNet PolarGrid Novel local regions definition + RingConv
DGCNN Wang et al. (2019b) KNN EdgeConv None Novel graph convolution + updating graph
SPG Landrieu and Simonovsky (2018) partition PointNet None Superpoint graph + parsing large-scale scene
DeepGCNs Li et al. (2019a) DKNN DGConv RPS Adapting residual connections between layers
Graph Convolution
SPH3D-GCN Lei et al. (2020) Ball SPH3D-GConv FPS Novel graph convolution + pooling + uppooling
LS-GCN Wang et al. (2018a) KNN Spec.Conv. FPS Local spectral graph + Novel graph convolution
PAN Feng et al. (2020) Multi-direct. LAE-Conv PFS Point-wise spatial attention+local graph Conv.
TGNet Li et al. (2019b) Ball TGConv PFS Novel graph Conv.+multi-scale features explo.
HDGCN Liang et al. (2019a) KNN DGConv FPS Depthwise graph Conv. + Pointwise Conv.
3DCon.Net Zeng and Gevers (2018) KNN PointNet Tree layer KD tree structure
𝜓-CNN Lei et al. (2019) Octree neig. 𝜓-Conv Tree layer Octree structure+ Novel graph convolution
PGCRNet Ma et al. (2020) None Conv1D None PointGCR to model context dependencies
AGCN Xie et al. (2020b) KNN MLP None Point attention layer for aggregating local features
PointANSL Yan et al. (2020) KNN local-nonlocal module AS Local-nonlocal module + adaptive sampling
Point Transformer
Point Transformer Zhao et al. (2021) KNN Point Transformer Pooling MLP-based relative position encoding + vector attention
Point Transformer v2 Wu et al. (2022a) Grid partition PointTransformerv2 pooling Novel position encoding + grid Pooling
PatchFormer Zhang et al. (2022) Boxes partition Patch Transformer DWConv First linear attention + Lightweight multi-scale Transformer
Fast Point Transformer Park et al. (2022) Voxel partition Fast point Transformer Voxel-based sampl. Lightweight local self-attention + novel position encoding
Stratified Transformer Lai et al. (2022) Voxel partition Stratified Transformer PFS Contextual relative position encoding
SphereFormer Lai et al. (2023) Voxel partition Sphereformer + cubicformer Maxpooling Novel spherical window for LIDAR points
Detection-free methods include SGPN Wang, Yu, Huang feature vectors, and further propose a MV-CRF to jointly
and Neumann (2018d) which assumes that the points be- optimize object classes and instance labels. Similarly Liu
longing to the same object instance should have very similar et al. Liu and Furukawa (2019) and 3D-GEL Liang, Yang
features. Hence, it learns a similarity matrix to predict and Wang (2019b) adopt SSCN to generate semantic predic-
proposals. The proposals are pruned by confidence scores tions and instance embeddings simultaneously, then use two
of the points to generate highly credible instance proposals. GCNs to refine the instance labels. OccuSeg Han, Zheng,
However, this simple distance similarity metric learning is Xu and Fang (2020) uses a multi-task learning network to
not informative and is unable to segment adjacent objects of produce both occupancy signal and spatial embedding. The
the same class. To this end, 3D-MPA Engelmann, Bokeloh, occupancy signal represents the number of voxel occupied
Fathi, Leibe and Nießner (2020a) learns object proposals by per voxel.
from sampled and grouped point features that vote for the Clustering based: methods like MASC Liu and Furukawa
same object center, and then consolidates the proposal fea- (2019) rely on high performance of the SSCN Graham
tures using a graph convolutional network enabling higher- et al. (2018) to predict the similarity embedding between
level interactions between proposals which result in refined neighboring points at multiple scales and semantic topol-
proposal features. AS-Net Jiang, Yan, Cai, Zheng and Xiao ogy. A simple yet effective clustering Liu, Yang, Li, Zhou,
(2020a) uses an assignment module to assign proposal Xu, Li and Lu (2018c) is adapted to segment points into
candidates and then eliminates redundant candidates by a instances based on the two types of learned embeddings.
suppression network. SoftGroup Vu, Kim, Luu, Nguyen MTML Lahoud, Ghanem, Pollefeys and Oswald (2019)
and Yoo (2022) proposes top-down refinement to refine learns two sets of feature embeddings, including the feature
the instance proposal. SSTNet Liang, Li, Xu, Tan and Jia embedding unique to every instance and the direction em-
(2021) proposes an end-to-end solution of Semantic Super- bedding that orients the instance center, which provides a
point Tree Network (SSTNet) to generate object instance stronger grouping force. Similarly, PointGroup Jiang, Zhao,
proposals from scene points. A key contribution in SSTNet Shi, Liu, Fu and Jia (2020b) groups points into different
is an intermediate semantic superpoint tree (SST), which clusters based on the original coordinate embedding space
is constructed based on the learned semantic features of and the shifted coordinate embedding space. In addition, the
superpoints. The tree is traversed and split at intermediate proposed ScoreNet guides the proper cluster selection. The
nodes to generate proposals of object instances. above methods usually group points according to point-level
embeddings, without the instance-level corrections. HAIS
4.2. Proposal Free Chen, Fang, Zhang, Liu and Wang (2021) introduce the
Proposal-free methods learn feature embedding for each set aggregation and intra-instance prediction to refine the
point and then apply clustering to obtain defintive 3D in- instance at the object level.
stance labels (see Figure 7) breaking down the task into Dynamic convolution based: These methods overcome
two main challenges. From the embedding learning point the limitations of clustering based methods by generating
of a view, these methods can be roughly subdivided into kernels and then using them to convolve with the point
four categories: 2D embedding based multi-tasks learning, features to generate instance masks. Dyco3D He, Shen and
clustering based, and dynamic convolution based. Van Den Hengel (2021) adopts the clustering algorithm to
2D embedding based: An example of these methods is generate a kernel for convolution. Similarly, PointInst3D He,
the 3D-BEVIS Elich, Engelmann, Kontogianni and Leibe Yin, Shen and van den Hengel (2022) uses FPS to generate
(2019) that learns 2D global instance embedding with a kernels. DKNet Wu, Shi, Du, Lu, Cao and Zhong (2022b)
bird’s-eye-view of the full scene. It then propagates the introduces candidate mining and candidate aggregation to
learned embedding onto point clouds by DGCNN Wang generate more instance kernels. Moreover, ISBNet Ngo,
et al. (2019b). Another example is PanopticFusion Narita, Hua and Nguyen (2023) proposes a new instance encoder
Seno, Ishikawa and Kaji (2019) which predicts pixel-wise combining instance-aware PFS with a point aggregation
instance labels by 2D instance segmentation network Mask layer to generate kernels to replace clustering in DyCo3D.
R-CNN He, Gkioxari, Dollár and Girshick (2017a) for RGB 3D instance segmentation methods are summarized in Table
frames and integrates the learned labels into 3D volumes. 5.
Multi-tasks learning: 3D semantic segmentation and 3D
instance segmentation can influence each other. For example 5. 3D Part Segmentation
objects with different classes must be different instances,
and objects with the same instance label must be the same 3D part segmentation is the next finer level, after in-
class. Based on this, ASIS Wang, Liu, Shen, Shen and Jia stance segmentation, where the aim is to label different parts
(2019a) designs an encoder-decoder network, termed ASIS, of an instance. The pipeline of part segmentation is quite
to learn semantic-aware instance embeddings for boost- similar to that of semantic segmentation except that the
ing the performance of the two tasks. Similarly, JSIS3D labels are now for individual parts. Therefore, some existing
Pham, Nguyen, Hua, Roig and Yeung (2019b) uses a unified 3D semantic segmentation networks Meng et al. (2019),
network namely MT-PNet to predict the semantic labels Graham et al. (2018), Qi et al. (2017a), Qi et al. (2017b),
of points and embedding the points into high-dimensional Zeng and Gevers (2018), Huang et al. (2018), Thomas et al.
(2019), Hua et al. (2018), Hermosilla et al. (2018), Wu
Table 5
Summary of 3D instance segmentation methods with deep learning. M←multi-view image; Me←mesh;V←voxel; P←point clouds.
Type Methods Input Propo./Embed. Prediction Refining/Grouping Contribution
3D-BoNet Yang et al. (2019) P Bounding box regression Point mask prediction Directly regressing 3D bounding box
SGPN Wang et al. (2018d) P SM + SCM + PN Non-Maximum suppression New group proposal
3D-MPA Engelmann et al. (2020a) p SSCNet Graph ConvNet Multi proposal aggregation strategy
AS-Net Jiang et al. (2020a) p Four branches with MLPs Candidate proposal suppression Novel Algorithm mapping labels to candidates
SoftGroup Vu et al. (2022) P Soft-grouping module top-down refinment Novel clustering algorithm based on dual coordinate sets
SSTNet Liang et al. (2021) p Tree traversal + splitting CliqueNet Constructing the superpoint tree for instance segmentation
3D-BEVIS Elich et al. (2019) M U-Net/FCN + 3D prop. Mean-shift clustering Joint 2D-3D feature
PanopticFus Narita et al. (2019) M PSPNet/Mask R-CNN FC-CRF Coopering with semantic mapping
ASIS Wang et al. (2019a) P 1 encoder+ 2 decoders ASIS module Simultaneously performing sem./ins. segmentation tasks
JSIS3D Pham et al. (2019b) P MT-PNet MV-CRF Simultaneously performing sem./ins. segmentation tasks
3D-GEL Liang et al. (2019b) P SSCNet GCN Structure-aware loss function + attention-based GCN
OccuSeg Han et al. (2020) P 3D-UNet Graph-based clustering Proposing a novel occupancy signal
proposal free
MASC Liu and Furukawa (2019) Me U-Net with SSConv Clustering algorithm Novel clustering based on affinity and mesh topology
MTML Lahoud et al. (2019) V SSCNet Mean-shift clustering Multi-task learning
PointGroup Jiang et al. (2020b) P U-Net with SSConv Point clustering + ScoreNet Novel clustering algorithm based on dual coordinate sets
HAIS Chen et al. (2021) P 3D U-Net Set aggregation Hierarchical aggregation for fine-grained predictions
Dyco3D He et al. (2021) P 3D U-Net Dynamic conv. Generating kernel by clustering for convolution
PointInst3D He et al. (2022) P 3D U-Net MLP Generating kernel by FPS
DKNet Wu et al. (2022b) P 3D U-Net MLP Generating kernel by candidate mining and aggregation
ISBNet Ngo et al. (2023) P 3D U-Net Box-aware dynamic conv Generating kernel by instance aware FPS and point aggrega.
et al. (2019b), Li et al. (2018b), Wang et al. (2019b), Lei details. Furthermore, multiple model fusion can enhance the
et al. (2020), Xie et al. (2020b), Wang et al. (2018c), Groh segmentation performance. Combining the advantages of
et al. (2018), Lei et al. (2019), Su et al. (2018), Rosu et al. images and voxels, Song et al. Song et al. (2017) proposed a
(2019) can also be trained for part segmentation. However, two-stream FCN, termed AppNet and GeoNet, to explore 2D
these networks can not entirely tackle the difficulties of part appearance and 3D geometric features from 2D images. In
segmentation. For example, various parts with the same particular, their VolNet extracts 3D geometric features from
semantic label might have diverse shapes, and the number 3D volumes guiding GeoNet to extract features from a single
of parts for an instance with the same semantic label may be image.
different. We subdivide 3D part segmentation methods into
two categories: regular data based and irregular data based 5.2. Irregular Data Based
as follows. Irregular data representations usually includes meshes
Xu et al. (2017), Hanocka et al. (2019) and point clouds Li
5.1. Regular Data Based et al. (2018a), Shen et al. (2018), Yi et al. (2017), Verma
Regular data usually includes projected images Kaloger- et al. (2018), Wang et al. (2018b), Yu et al. (2019), Zhao
akis, Averkiou, Maji and Chaudhuri (2017), voxels Wang et al. (2019b) Yue, Wang, Tang and Chen (2022). Mesh
and Lu (2019), Le and Duan (2018), Song, Chen, Li and provides an efficient approximation to a 3D shape because
Zhao (2017). As for projected images, Kalogerakis et al. it captures the flat, sharp and intricate of surface shape
Kalogerakis et al. (2017) obtain a set of images from mul- surface and topology. Xu et al. Xu et al. (2017) put the face
tiple views that optimally cover object surface, and then normal and face distance histogram as the input of a two-
use multi-view Fully Convolutional Networks(FCNs) and stream framework and use the CRF to optimize the final
surface-based Conditional Random Fields (CRFs) to predict labels. Inspired by traditional CNN, Hanocka et al. Hanocka
and refine part labels separately. Voxel is a useful repre- et al. (2019) design novel mesh convolution and pooling to
sentation of geometric data. However, fine-grained tasks operate on the mesh edges.
like part segmentation require high resolution voxels with As for point clouds, the graph convolution is the most
more detailed structure information, which leads to high commonly used pipeline. In the spectral graph domain,
computation cost. Wang et al. Wang and Lu (2019) proposed SyncSpecCNN Yi et al. (2017) introduces a Sychronized
VoxSegNet to exploit more detailed information from voxels Spectral CNN to process irregular data. Specially, multi-
with limited resolution. They use spatial dense extraction channel convolution and parametrized dilated convolution
to preserve the spatial resolution during the sub-sampling kernels are proposed to solve multi-scale analysis and infor-
process and an attention feature aggregation (AFA) module mation sharing across shapes respectively. In spatial graph
to adaptively select scale features. Le et al. Le and Duan domain, in analogy to a convolution kernel for images, KC-
(2018) introduced a novel 3D CNN called PointGrid, to Net Shen et al. (2018) present point-set kernel and nearest-
incorporate a constant number of points with each cell neighbor-graph to improve PointNet with an efficient local
allowing the network to learn better local geometry shape feature exploitation structure. Similarly, Wang et al. Wang
Table 6
Summary of 3D part segmentation methods. M←multi-view image; Me←mesh;V←voxel;P←point clouds; reg.←regular data; ir-
reg.←irregular data.
Type Methods Input Architecture Feature extractor Contribution
ShapePFCN Kalogerakis et al. (2017) M Multi-stream FCN 2DConv Per-label confidence maps + surface-based CRF
regular
VoxSegNet Wang and Lu (2019) V 3DU-Net AtrousConv SDE for preserving the spatial resolution AFA for feature selecting
Pointgrid Le and Duan (2018) V Conv-deconv 3DConv Learning higher order local geometry shape.
SubvolumeSup Song et al. (2017) M+V 2-stream FCN 2D/3DConv GeoNet/AppNet for 3/2D features exploi. + DCT for aligning.
DCN Xu, Dong and Zhong (2017) Me 2-tream DCN & NN DirectionalConv DCN/NN for local feature and global feature.
MeshCNN Hanocka, Hertz, Fish, Giryes,
Me 2D CNN MeshConv Novel mesh convolution and pooling
Fleishman and D. (2019)
PartNet Yu, Liu, Zhang, Zhu and Xu
P RNN PN Part feature learning scheme for context and geometry feature exploitation
(2019)
SSCNN Yi, Su, Guo and Guibas (2017) P FCN SpectralConv STN for allowing weight sharing + spectral multi-scale kernel
KCNet Shen et al. (2018) P PN MLP KNN graph on points + kernel correlation for measuring geometric affinity
irregular
to operate over a sliding window over the RGB-D video in complex, large-scale, and dynamic scenes. Efforts have
frames. Specifically, the convolutional gated recurrent unit been made to enhance the robustness using association in-
preserves the spatial information and reduces the parame- formation exploitation from multiple frames, multi-model
ters. Similarly, Yurdakul et al. Emre Yurdakul and Yemez fusion and novel post-processing operations. These efforts
(2017) combine fully convolutional and recurrent neural are explained below.
network to investigate the contribution of depth and temporal Association information exploitation: mainly depends on
information separately in the synthetic RGB-D video. SLAM trajectory, recurrent neural networks or scene flow.
Spatio-temporal convolution based: Nearby video frames Ma et al. Ma, Stückler, Kerl and Cremers (2017) enforce
provide diverse viewpoints and additional context of objects consistency by warping CNN feature maps from multi-views
and scenes. STD2P He, Chiu, Keuper and Fritz (2017b) into a common reference view by using the SLAM trajectory
uses a novel spatio-temporal pooling layer to aggregate and to supervise training at multiple scales. SemanticFusion
region correspondences computed by optical flow and image McCormac, Handa, Davison and Leutenegger (2017) in-
boundary-based super-pixels. Choy et al. Choy, Gwak and corporates deconvolutional neural networks with a state-of-
Savarese (2019) proposed 4D Spatio-Temporary ConvNet, the-art dense SLAM system, ElasticFusion, which provides
to directly process a 3D point cloud video. To overcome long-term correspondence between frames of a video. These
challenges in the high-dimensional 4D space (3D space and correspondences allow label predictions from multi-views
time), they introduced the 4D sptio-temporal convolution, a to be probabilistically fused into a map. Similarly, using
generalized sparse convolution, and the trilateral-stationary the connection information between frames provided by a
conditional random field that keeps spatio-temporal con- recurrent unit on RGB-D videos, Xiang et al. Xiang and Fox
sistency. Similarly, based on 3D sparse convolution, Shi (2017) proposed a data associated recurrent neural networks
et al. Shi, Lin, Wang, Hung and Wang (2020) proposed (DA-RNN) and integrated the output of the DA-RNN with
SpSequenceNet that contains two novel modules, a cross KinnectFusion, which provides a consistent semantic label-
frame-global attention module and a cross-frame local in- ing of the 3D scene. Cheng et al. Cheng, Sun and Meng
terpolation module to exploit spatial and temporal feature in (2020) use a CRF-RNN-based semantic segmentation to
4D point clouds. PointMotionNet Wang, Li, Sullivan, Abbott generate the corresponding labels. Specifically, the authors
and Chen (2022) proposes a spatio-temporal convolution proposed an optical flow-based method to deal with the
that exploits a time-invariant spatial neighboring space and dynamic factors for accurate localization. Kochanov et al.
extracts spatio-temporal features, to distinguish the moving Kochanov, Ošep, Stückler and Leibe (2016) also use scene
and static objects. flow to propagate dynamic objects within the 3D semantic
Spatio-temporal Transformer based : To capture the dy- maps.
namics in point cloud video, point tracking is usually em- Multiple model fusion: Jeong et al. Jeong, Yoon and Park
ployed. However, P4Transformer Fan, Yang and Kankan- (2018) build a 3D map by estimating odometry based on
halli (2021) proposes a 4D convolution to embed the spatio- GPS and IMU, and use a 2D CNN for semantic segmenta-
temporal local structures in point cloud video and further tion. They integrate the 3D map with semantic labels using a
introduces a Transformer to leverage the motion information coordinate transformation and Bayes’ update scheme. Zhao
across the entire video by performing the self-attention on et al. Zhao, Sun, Purkait, Duckett and Stolkin (2018) use
these embedded local features. Similarly, PST2 Wei, Liu, PixelNet and VoxelNet to exploit global context information
Xie, Ke and Guo (2022) performs spatio-temporal self at- and local shape information separately and then fuse the
tention across adjacent frames to capture the spatio-temporal score maps with a softmax weighted fusion that adaptively
context, and proposes a resolution embedding mudule to en- learns the contribution of different data streams. The final
hance the resolution of feature maps by aggregating features. dense 3D semantic maps are generated with visual odometry
and recursive Bayesian update.
6.1.2. 3D semantic map construction
Unmanned systems do not just need to avoid obstacles
but also need to establish a deeper understanding of the scene 7. Experimental Results
such as object parsing, self localization etc. To facilitate Below we summarize the quantitative results of the
such tasks, unmanned systems build a 3D semantic map segmentation methods discussed in Sections 3, 4 and 5 on
of the scene which includes two key problems: geomet- some typical public datasets, as well as analyze these results
ric reconstruction and semantic segmentation. 3D scene qualitatively.
reconstruction has conventionally relied on simultaneous
localization and mapping system (SLAM) to obtain a 3D 7.1. Results for 3D Semantic Segmentation
map without semantic information. This is followed by 2D We report the results of RGB-D based semantic seg-
semantic segmentation with a 2D CNN and then the 2D la- mentation methods on SUN-RGB-D Song et al. (2015)
bels are transferred to the 3D map following an optimization and NYUDv2 Silberman et al. (2012) datasets using mAcc
(e.g. conditional random field) to obtain a 3D semantic map (mean Accuracy) and mIoU (mean Intersection over Union)
Yang, Huang and Scherer (2017). This common pipeline as the evaluation metrics. These results of various methods
does not guarantee high performance of 3D semantic maps are taken from the original papers and they are shown in
Table 7.
Table 7 et al. (2021), Ran et al. (2022), Qian et al. (2022), Ball
Evaluation performance regarding for RGB-D semantic segmenta- search Hermosilla et al. (2018), Thomas et al. (2019), Lei
tion methods on the SUN-RGB-D and NYUDv2. Note that the ‘%’ et al. (2020), grid-based search Hua et al. (2018), Wu et al.
after the value is omitted and the symbol ‘–’ means the results are (2022a) and tree based search Lei et al. (2019). KNN search
unavailable. retrieves the K closest neighbors to a query point based
Methods
NYUDv2 SUN-RGB-D on a distance metric, and hence lacks robustness to point
mAcc mIoU mAcc mIoU clouds with varying densities. Some works integrate the
Guo and Chen (2018) 46.3 34.8 45.7 33.7 dilated mechanism with the neighbor search to expand the
Wang et al. (2015) – 44.2 – –
Mousavian et al. (2016) 52.3 39.2 – – receptive field Komarichev et al. (2019), Li et al. (2018b), Li
Liu et al. (2018b) 50.8 39.8 50.0 39.4 et al. (2019a). Ball search involves finding all points within a
Gupta et al. (2014) 35.1 28.6 – –
Liu et al. (2018a) 51.7 41.2 – – specified radius (ball) around a query point. Similarly, grid-
Hazirbas et al. (2016) – – 48.3 37.3 based search divides the point cloud space into a regular grid
Lin et al. (2017) – 47.7 – 48.1
Jiang et al. (2017) – – 50.6 39.3 structure. Ball search and grid-based search are both useful
Wang and Neumann (2018) 47.3 – – – for effectively capturing local structures and neighborhoods
Cheng et al. (2017) 60.7 45.9 58.0 –
Fan et al. (2017) 50.2 – – – of varying densities.
Li et al. (2016) 49.4 – 48.1 – Features abstraction: In feature abstraction, commonly
Qi et al. (2017c) 55.7 43.1 57.0 45.9
Wang et al. (2016) 60.6 38.3 50.1 33.5 used methods include MLP-based, Convolution-based, and
Transformer-based approaches. MLP is often used to extract
features from individual points in point cloud data. By
We report the results of projected images/voxel/point passing the feature vectors of each point through multiple
clouds/other representation semantic segmentation methods fully connected layers, MLP learns nonlinear point-level
on S3DIS Armeni et al. (2016) (both Area 5 and 6-fold cross feature representations. MLP offers flexibility and scalability
validation), ScanNet Dai et al. (2017) (test sets), Semantic3D in point cloud processing. Convolution operations on point
Hackel et al. (2017) (reduced-8 subsets) and SemanticKITTI clouds typically involve aggregating (low-level) information
Behley et al. (2019) (only xyz without RGB). We use mAcc, from local points to capture local structures and contex-
oAcc (overall accuracy) and mIoU as the evaluation metrics. tual information. In contrast, Transformer based methods
These results of various methods are taken from the original establish correlations between high-level point information
papers. Table 8 reports the results. through the attention mechanism, which is more helpful for
The architectures of point cloud semantic segmentation high-level tasks such as point cloud segmentation.
typically focus on five main components: basic framework, The essence of MLP based, convolution based, and
neighborhood search, features abstraction, coarsening, and Transformer based methods is to learn the relationships
pre-processing. Below, we provide a more detailed discus- between points, obtaining the robust weights. In the context
sion of each component. of similar baseline architecture, the more comprehensive the
Basic framework: Basic networks are one of the main learned point cloud relationship in the feature abstraction
driving forces behind the development of 3D segmentation. process, the stronger the robustness of the model becomes.
Generally, there are two main basic frameworks including Recently, MLP-based methods (e.g. Resurf Ran et al. (2022)
PointNet and PointNet++. The PointNet framework utilizes and PointNeXt Qian et al. (2022)) exhibit better accuracy
shared Multi-Layer Perceptrons (MLPs) to capture point- and efficiency, encouraging researchers to re-examine and
wise features and employs max-pooling to aggregate these further explore the potential of MLP-based approaches.
features into a global representation. However, it lacks the Coarsening: Coarsening, also known as downsampling or
ability to learn local features due to the absence of a defined subsampling, involves reducing the number of points in the
local neighborhood. Additionally, the fixed resolution of the point cloud while preserving the essential structures and fea-
feature map makes it challenging to adapt to deep archi- tures. Coarsening techniques include random sampling Hu
tectures. In contrast, the PointNet++ framework introduces et al. (2020), farthest point sampling Qi et al. (2017a,b), tree-
a novel hierarchical learning architecture. It defines local based methods Lei et al. (2019) and mesh-based decimation
regions in a hierarchical manner and to progressively extract Lei, Akhtar, Shah and Mian (2023). This step helps to reduce
features from these regions. This approach enables the net- computational complexity and improve efficiency in subse-
work to capture both local and global information, leading to quent stages of the segmentation process. Random sampling
improved performance. As a result, many current networks is simple and computationally efficient but may not select
adopt the PointNet++ framework or similar variations (such the most optimal points in maintaining local and global
as 3D U-Net). This framework significantly reduces compu- structure. This can potentially lead to information loss in
tational and memory complexities, particularly in high-level feature rich regions. Farthest point sampling is widely used
tasks like semantic segmentation, instance segmentation, in networks as it ensures a more even spatial distribution of
and detection. the selected points and can help preserve global structures.
Neighborhood search: To exploit the local features of point However, local structures can still get destroyed with farthest
clouds, the neighborhood point search is introduced into point sampling. Tree-based methods leverage hierarchical
networks, including the K nearest neighbors (KNN) Zhao tree structures, such as an octree, to partition the point cloud
Table 8
Evaluation performance regarding for projected images, voxel, point clouds and other representation semantic segmentation methods on the
S3DIS, ScanNet, Semantic3D and SemanticKITTI. Note: the ‘%’ after the value is omitted, the symbol ‘–’ means the results are unavailable,
the dotted line means the subdivision of methods according to the type of architecture.
S3DIS ScanNet Semantic3D SemanticKITTI
Method Type Area5 6-fold test set reduced-8 only xyz
mAcc mIoU mIoU oAcc mIoU oAcc mIoU mAcc mIoU
Lawin et al. Lawin et al. (2017) – – – – – 88.9 58.5 – –
Boulch et al. Boulch et al. (2017) – – – – – 91.0 67.4 – –
projection
Wu et al. Wu et al. (2018a) – – – – – – – – 37.2
Wang et al. Wang et al. (2018e) – – – – – – – – 39.8
Wu et al. Wu et al. (2019a) – – – – – – – – 44.9
Milioto et al. Milioto et al. (2019) – – – – – – – – 52.2
Xu et al. Xu et al. (2020) – – – – – – – – 55.9
RangViT Ando et al. (2023) – – – – – – – – 55.9
RangFormer Kong et al. (2023) – – – – – – – – 64.0
Tchapmi et al. Tchapmi et al. (2017) 57.35 48.92 48.92 – – 88.1 61.30 – –
voxel
and perform coarsening. Mesh based methods must convert The above methods are hand-crafted or engineered tech-
the point cloud to a mesh first before it can decimate it. niques that do not involve learning parameters directly from
This adds a computational overhead to the already expensive the data, which determines the sub-sampling pattern based
mesh decimation process. Moreover, creating a mesh from on predefined rules or heuristics, without explicitly optimiz-
complex and sparse point clouds obtained from LiDAR ing for the task at hand. Therefore, some works propose
sensors is not always possible Lei et al. (2023). learnable coarsening methods that integrate a learnable
layer into the coarsening module such as pooling Groh et al.
Table 9
Evaluation performance regarding for 3D instance segmentation methods on the ScanNet. Note: the ‘%’ after the value is omitted.
Methods mAP bath. bed book. cabi. chair count. curt. desk door other pict. refr. shower. sink sofa table toilet wind.
GSPN Yi et al. (2019) 30.6 50.0 40.5 31.1 34.8 58.9 5.4 6.8 12.6 28.3 29.0 2.8 21.9 21.4 33.1 39.6 27.5 82.1 24.5
3D-SIS Hou et al. (2019) 38.2 100 43.2 24.5 19.0 57.7 1.3 26.3 3.3 32.0 24.0 7.5 42.2 85.7 11.7 69.9 27.1 88.3 23.5
3D-BoNet Yang et al. (2019) 48.8 100 67.2 59.0 30.1 48.4 9.8 62.0 30.6 34.1 25.9 12.5 43.4 79.6 40.2 49.9 51.3 90.9 43.9
SGPN Wang et al. (2018d) 14.3 20.8 39.0 16.9 6.5 27.5 2.9 6.9 0 8.7 4.3 1.4 2.7 0 11.2 35.1 16.8 43.8 13.8
3D-MPA Engelmann et al. (2020a) 61.1 100 83.3 76.5 52.6 75.6 13.6 58.8 47.0 43.8 43.2 35.8 65.0 85.7 42.9 76.5 55.7 100 43.0
SoftGroup Vu et al. (2022) 76.1 100 80.8 84.5 71.6 86.2 24.3 82.4 65.5 62.0 73.4 69.9 79.1 98.1 71.6 84.4 76.9 100 59.4
SSTNet Liang et al. (2021) 69.8 100 69.7 88.8 55.6 80.3 38.7 62.6 41.7 55.6 58.5 70.2 60.0 100 82.4 72.0 69.2 100 50.9
3D-BEVIS Elich et al. (2019) 24.8 66.7 56.6 7.6 3.5 39.4 2.7 3.5 9.8 9.8 3.0 2.5 9.8 37.5 12.6 60.4 18.1 85.4 17.1
PanopticFus. Narita et al. (2019) 47.8 66.7 71.2 59.5 25.9 55.0 0 61.3 17.5 25.0 43.4 43.7 41.1 85.7 48.5 59.1 26.7 94.4 35.9
OccuSeg Han et al. (2020) 67.2 100 75.8 68.2 57.6 84.2 47.7 50.4 52.4 56.7 58.5 45.1 55.7 100 75.1 79.7 56.3 100 46.7
MTML Lahoud et al. (2019) 54.9 100 80.7 58.8 32.7 64.7 0.4 81.5 18.0 41.8 36.4 18.2 44.5 100 44.2 68.8 57.1 100 39.6
PointGroup Jiang et al. (2020b) 63.6 100 76.5 62.4 50.5 79.7 11.6 69.6 38.4 44.1 55.9 47.6 59.6 100 66.6 75.6 55.6 99.7 51.3
HAIS Chen et al. (2021) 69.9 100 84.9 82.0 67.5 80.8 27.9 75.7 46.5 51.7 59.6 55.9 60.0 100 65.4 76.7 67.6 99.4 56.0
Dyco3D He et al. (2021) 64.1 100 84.1 89.3 53.1 80.2 11.5 58.8 44.8 43.8 53.7 43.0 55.0 85.7 53.4 76.4 65.7 98.7 56.8
DKNet Wu et al. (2022b) 71.8 100 81.4 78.2 61.9 87.2 22.4 75.1 56.9 67.7 58.5 72.4 63.3 98.1 51.5 81.9 73.6 100 61.7
ISBNet Ngo et al. (2023) 76.3 100 87.3 71.7 66.6 85.8 50.8 66.7 76.4 64.3 67.6 68.8 82.5 100 77.3 74.1 77.7 100 55.6
Table 10
Evaluation performance regarding for 3D part segmentation on the ShapeNet. Note: the ‘%’ after the value is omitted, the symbol ‘–’ means
the results are unavailable.
Methods Ins. mIoU Methods Ins. mIoU
VV-Net Meng et al. (2019) 87.4 LatticeNet Su et al. (2018) 83.9
SSCNet Graham et al. (2018) 86.0 SGPN Wang et al. (2018d) 85.8
PointNet Qi et al. (2017a) 83.7 ShapePFCN Kalogerakis et al. (2017) 88.4
PointNet++ Qi et al. (2017b) 85.1 VoxSegNet Wang and Lu (2019) 87.5
3DContextNet Zeng and Gevers (2018) 84.3 PointgridLe and Duan (2018) 86.4
RSNet Huang et al. (2018) 84.9 KPConv Thomas et al. (2019) 86.4
MCC Hermosilla et al. (2018) 85.9 SO-Net Li et al. (2018a) 84.9
PointConv Wu et al. (2019b) 85.7 PartNet Yu et al. (2019) 87.4
DGCNN Wang et al. (2019b) 85.1 SyncSpecCNN Yi et al. (2017) 84.7
SPH3D-GCN Lei et al. (2020) 86.8 KCNet Yi et al. (2017) 84.7
AGCN Xie et al. (2020b) 85.4 PointCNN Li et al. (2018b) 86.1
PCCN Wang et al. (2018c) 85.9 SpiderCNN Xu et al. (2018) 85.3
Flex-Conv Groh et al. (2018) 85.0 FeaStNet Verma et al. (2018) 81.5
𝜓-CNN Lei et al. (2019) 86.8 Kd-Net Klokov and Lempitsky (2017) 82.3
SPLATNet Su et al. (2018) 84.6 O-CNN Wang et al. (2017) 85.9
DRGCNN Yue et al. (2022) 86.2
proposal-free methods can achieve better results by taking most exiting datasets are generally designed for a sin-
into account the overall context and characteristics of the gle task. Currently, only a few semantic segmentation
point cloud. datasets also contain labels for instances Dai et al.
(2017) and scene layout Song et al. (2015) to meet the
7.3. Results for 3D Part Segmentation multi-task objective.
We report the results of 3D part segmentation methods
on ShapeNet Yi et al. (2016) datasets and use Ins. mIoU as • Unified network for multiple tasks: It is expensive
the evaluation metric. These results of various methods are and impractical for a system to accomplish different
taken from the original papers and they are shown in Table computer vision tasks by various deep learning net-
10. We can find that part segmentation performance of all works. Towards fundamental feature exploitation of
methods is quite similar. one underlying assumption is that scene, Semantic segmentation has strong consistency
objects in ShapeNet datasets are synthetic, normalized in with some tasks, such as depth estimation Meyer et al.
scale, aligned in pose, and lack scene context. This makes (2019), Liu et al. (2015), Guo and Chen (2018), Liu
part segmentation network difficult to extract rich context et al. (2018b), scene completion Dai et al. (2018), Xia,
features. Another underlying assumption is that the point Liu, Li, Zhu, Ma, Li, Hou and Qiao (2023), Zhang,
clouds in synthetic scene without background noise is more Han, Dong, Li, Yin and Yang (2023), instance seg-
simpler and cleaner than ones in real scene, so that the mentation Liang et al. (2019b), Pham et al. (2019b),
geometric features of point clouds is easy to exploitation. Han et al. (2020), and object detection Meyer et al.
The accuracy performance of various part segmentation (2019), Lian, Li and Chen (2022). These tasks could
network is difficult to be effectively distinguished. cooperate with each other to improve performance in
a unified network because they exhibit certain corre-
8. Discussion and Conclusion lations and shared feature representations.
3D segmentation using deep learning techniques has • Multiple modals for segmentation: Semantic segmen-
made significant progress during recent years. However, tation using multiple representations, such as pro-
this is just the beginning and significant developments lie jected images, voxels, and point clouds, has the po-
ahead of us. Below, we present some outstanding issues and tential to achieve higher accuracy. Single represen-
identify potential research directions. tation limits segmentation accuracy due to the lim-
ited scene information in some practical scenarios.
• Synthetic datasets with richer information for multiple For instance, LiDAR measurements become sparser
tasks: Synthetic datasets gradually play an important as the distance increases, and incorporating high-
role on semantic segmentation due to the low cost and resolution image data can improve performance on
diverse scenes that can be generated Brodeur et al. distant objects. Therefore, utilizing multiple represen-
(2017), Wu et al. (2018b) compared to real datasets tations, also known as multiple modalities, can be an
Dai et al. (2017), Armeni et al. (2016), Hackel et al. alternative way to enhance segmentation performance
(2017). It is well known that the information contained Dai and Nießner (2018), Chiang et al. (2019), Liu et al.
in training data determine the upper limit of the scene (2019b), Hu et al. (2021). Moreover, segmenting point
parsing accuracy. Existing datasets lack important cloud with large image models (such as SAM Kirillov,
semantic information, such as material, and texture Mintun, Ravi, Mao, Rolland, Gustafson, Xiao, White-
information, which is more crucial for segmentation head, Berg, Lo et al. (2023)) and natural language
with similar color or geometric information. Besides, models like ChatGPT can be popular approaches. The
advanced capabilities of large models enable them to in each category, with potential research directions being
capture intricate patterns and semantic relationships, listed.
leading to improved performance and accuracy in
segmentation tasks.
References
• Interpretable and sparse feature abstraction: Vari- Ando, A., Gidaris, S., Bursuc, A., Puy, G., Boulch, A., Marlet, R., 2023.
ous features abstraction, including MLP, Convolution Rangevit: Towards vision transformers for 3d semantic segmentation in
autonomous driving, in: Proceedings of the IEEE/CVF Conference on
and Transformer, have undergone significant develop- Computer Vision and Pattern Recognition, pp. 5240–5250.
ment. Feature abstraction modules may prioritize gen- Armeni, I., Sener, O., Zamir, A., Jiang, H., Brilakis, I., Fischer, M.,
erating interpretable feature representations, enabling Savarese, S., 2016. 3d semantic parsing of large-scale indoor spaces,
them to provide explanations for model decisions, in: Proceedings of the IEEE/CVF Conference on Computer Vision and
visualizations of points of interest, and other inter- Pattern Recognition, pp. 1534–1543.
Behley, J., Garbade, M., Milioto, A., Quenzel, J., Behnke, S., Stachniss, C.,
pretability functions. Moreover, in scenarios involving
Gall, J., 2019. Semantickitti: A dataset for semantic scene understand-
large-scale data and limited resources, the feature ing of lidar sequences, in: Processings of the IEEE/CVF International
abstraction module improve computational efficiency. Conference on Computer Vision, IEEE. pp. 9297–9307.
Bello, S., Yu, S., Wang, C., Adam, J., Li, J., 2020. deep learning on 3d point
• Weakly-supervised and unsupervised segmentation: clouds. Remote Sensing 12, 1729.
Deep learning has gained significant success in 3D Boulch, A., Guerry, J., Le Saux, B., Audebert, N., 2018. Snapnet: 3d
segmentation, but heavily hinges on large-scale la- point cloud semantic labeling with 2d deep segmentation networks.
Computers & Graphics 71, 189–198.
belled training samples. Weakly-supervised learn- Boulch, A., Le Saux, B., Audebert, N., 2017. Unstructured point cloud se-
ing refers to a training approach where the model mantic labeling using deep segmentation networks. 3dor@ eurographics
is trained with limited or incomplete supervision. 2, 7.
Unsupervised learning only use unlabelled training Brodeur, S., Perez, E., Anand, A., Golemo, F., Celotti, L., Strub, F., Rouat,
samples. weakly-supervised Su, Xu and Jia (2023), J., Larochelle, H., Courville, A., 2017. Home: A household multimodal
environment. arXiv preprint arXiv:1711.11017 .
Shi, Wei, Li, Liu and Lin (2022) and unsupervised
Cao, Y., Shen, C., Shen, H., 2016. Exploiting depth from single monocular
Xiao, Huang, Guan, Zhang, Lu and Shao (2023) images for object detection and semantic segmentation. IEEE Transac-
paradigms are considered as an alternative to relax tions on Image Processing 26, 836–846.
the impractical requirement of large-scale labelled Chang, A., Dai, A., Funkhouser, T., Halber, M., Niebner, M., Savva, M.,
datasets. Song, S., Zeng, A., Zhang, Y., 2017. Matterport3d: Learning from rgb-
d data in indoor environments, in: 2017 International Conference on 3D
• Real-time and incremental segmentation: Real-time Vision (3DV), IEEE. pp. 667–676.
Chen, S., Fang, J., Zhang, Q., Liu, W., Wang, X., 2021. Hierarchical aggre-
3D scene parsing is crucial for some applications gation for 3d instance segmentation, in: Proceedings of the IEEE/CVF
such as autonomous driving and mobile robots. How- International Conference on Computer Vision, pp. 15467–15476.
ever, most existing 3D semantic segmentation meth- Chen, X., Golovinskiy, A., Funkhouser, T., 2009. A benchmark for 3d mesh
ods mainly focus on the improvement of segmentation segmentation. ACM Transactions on Graphics 28, 1–12.
accuracy but rarely focus on real-time performance. A Cheng, J., Sun, Y., Meng, M.Q.H., 2020. Robust semantic mapping in
challenging environments. Robotica 38, 256–270.
few lightweight 3D semantic segmentation networks Cheng, Y., Cai, R., Li, Z., Zhao, X., Huang, K., 2017. Locality-sensitive
realize real-time by pre-processing point clouds into deconvolution networks with gated fusion for rgb-d indoor semantic seg-
other presentations such as projected images Wu et al. mentation, in: Proceedings of the IEEE/CVF Conference on Computer
(2018a), Wu et al. (2019a), Park, Kim, Kim and Jo Vision and Pattern Recognition, pp. 3029–3037.
(2023), Ando et al. (2023). Additionally, incremental Chiang, H.Y., Lin, Y.L., Liu, Y.C., Hsu, W.H., 2019. A unified point-
based framework for 3d segmentation, in: Processing of the International
segmentation will become an important research di-
Conference on 3D Vision, IEEE. pp. 155–163.
rection, allowing models to incrementally update and Chollet, F., 2017. Xception: Deep learning with depthwise separable con-
adapt in dynamic scenes. volutions, in: Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pp. 1251–1258.
• 3D video semantic segmentation: Like 2D video se- Choy, C., Gwak, J., Savarese, S., 2019. 4d spatio-temporal convnets:
mantic segmentation, A handful of works try to ex- Minkowski convolutional neural networks, in: Proceedings of the
ploit 4D spatio-temporal features on 3D videos (also IEEE/CVF Conference on Computer Vision and Pattern Recognition,
pp. 3075–3084.
call 4D point clouds) Wei et al. (2022), Fan et al. Couprie, C., Farabet, C., Najman, L., LeCun, Y., 2013. Indoor semantic
(2021). From these works, it can be seen that the segmentation using depth information. arXiv preprint arXiv:1301.3572
spatio-temporal features can help improve the robust- .
ness of 3D video or dynamic 3D scene semantic Dai, A., Chang, A., Savva, M., Halber, M., Funkhouser, T., Nießner, M.,
segmentation. 2017. Scannet: Richly-annotated 3d reconstructions of indoor scenes,
in: Proceedings of the IEEE/CVF Conference on Computer Vision and
We provided a comprehensive survey of the recent devel- Pattern Recognition, pp. 5828–5839.
Dai, A., Nießner, M., 2018. 3dmv: Joint 3d-multi-view prediction for 3d se-
opment in 3D segmentation using deep learning techniques, mantic scene segmentation, in: Proceedings of the European Conference
including 3D semantic segmentation, 3D instance segmenta- on Computer Vision, pp. 452–468.
tion and 3D part segmentation. We presented a comprehen-
sive performance comparison and merit of various methods
Dai, A., Ritchie, D., Bokeloh, M., Reed, S., Sturm, J., Nießner, M., 2018. Computer Vision and Pattern Recognition, pp. 2940–2949.
Scancomplete: Large-scale scene completion and semantic segmenta- Hanocka, R., Hertz, A., Fish, N., Giryes, R., Fleishman, S., D., C.O., 2019.
tion for 3d scans, in: Proceedings of the IEEE/CVF Conference on Meshcnn: a network with an edge. ACM Transactions on Graphics 38,
Computer Vision and Pattern Recognition, pp. 4578–4587. 1–12.
Elich, C., Engelmann, F., Kontogianni, T., Leibe, B., 2019. 3d bird’s- Hazirbas, C., Ma, L., Domokos, C., Cremers, D., 2016. Fusenet: Incorporat-
eye-view instance segmentation, in: German Conference on Pattern ing depth into semantic segmentation via fusion-based cnn architecture,
Recognition, Springer. pp. 48–61. in: Proc. Asian Conf. Compu. Vis., Springer. pp. 213–228.
Emre Yurdakul, E., Yemez, Y., 2017. Semantic segmentation of rgbd videos He, K., Gkioxari, G., Dollár, P., Girshick, R., 2017a. Mask r-cnn, in:
with recurrent fully convolutional neural networks, in: Processings of the Processings of the IEEE/CVF International Conference on Computer
IEEE/CVF International Conference on Computer Vision Workshops, Vision, pp. 2961–2969.
pp. 367–374. He, T., Shen, C., Van Den Hengel, A., 2021. Dyco3d: Robust instance
Engelmann, F., Bokeloh, M., Fathi, A., Leibe, B., Nießner, M., 2020a. 3d- segmentation of 3d point clouds through dynamic convolution, in: Pro-
mpa: Multi-proposal aggregation for 3d semantic instance segmentation, ceedings of the IEEE/CVF conference on computer vision and pattern
in: Proceedings of the IEEE/CVF Conference on Computer Vision and recognition, pp. 354–363.
Pattern Recognition, pp. 9031–9040. He, T., Yin, W., Shen, C., van den Hengel, A., 2022. Pointinst3d: Segment-
Engelmann, F., Kontogianni, T., Hermans, A., Leibe, B., 2017. Exploring ing 3d instances by points, in: Proceedings of the European Conference
spatial context for 3d semantic segmentation of point clouds, in: Pro- on Computer Vision, Springer. pp. 286–302.
cessings of the IEEE/CVF International Conference on Computer Vision He, Y., Chiu, W.C., Keuper, M., Fritz, M., 2017b. Std2p: Rgbd semantic
Workshops, pp. 716–724. segmentation using spatio-temporal data-driven pooling, in: Proceed-
Engelmann, F., Kontogianni, T., Leibe, B., 2020b. Dilated point con- ings of the IEEE/CVF Conference on Computer Vision and Pattern
volutions: On the receptive field size of point convolutions on 3d Recognition, pp. 4837–4846.
point clouds, in: 2020 IEEE International Conference on Robotics and Hermosilla, P., Ritschel, T., Vázquez, P.P., Vinacua, À., Ropinski, T., 2018.
Automation (ICRA), IEEE. pp. 9463–9469. Monte carlo convolution for learning on non-uniformly sampled point
Engelmann, F., Kontogianni, T., Schult, J., Leibe, B., 2018. Know what your clouds. ACM Transactions on Graphics 37, 1–12.
neighbors do: 3d semantic segmentation of point clouds, in: Proceedings Höft, N., Schulz, H., Behnke, S., 2014. Fast semantic segmentation
of the European Conference on Computer Vision, pp. 0–0. of rgb-d scenes with gpu-accelerated deep neural networks, in: Joint
Fan, H., Mei, X., Prokhorov, D., Ling, H., 2017. Rgb-d scene labeling with German/Austrian Conference on Artificial Intelligence, Springer. pp.
multimodal recurrent neural networks, in: Proceedings of the IEEE/CVF 80–85.
Conference on Computer Vision and Pattern Recognition Workshops, Hou, J., Dai, A., Nießner, M., 2019. 3d-sis: 3d semantic instance segmen-
pp. 9–17. tation of rgb-d scans, in: Proceedings of the IEEE/CVF Conference on
Fan, H., Yang, Y., Kankanhalli, M., 2021. Point 4d transformer networks Computer Vision and Pattern Recognition, pp. 4421–4430.
for spatio-temporal modeling in point cloud videos, in: Proceedings of Hu, Q., Yang, B., Xie, L., Rosa, S., Guo, Y., Wang, Z., Trigoni, N.,
the IEEE/CVF conference on computer vision and pattern recognition, Markham, A., 2020. Randla-net: Efficient semantic segmentation of
pp. 14204–14213. large-scale point clouds, in: Proceedings of the IEEE/CVF Conference
Feng, M., Zhang, L., Lin, X., Gilani, S.Z., Mian, A., 2020. Point attention on Computer Vision and Pattern Recognition, pp. 11108–11117.
network for semantic segmentation of 3d point clouds. Pattern Recog- Hu, W., Zhao, H., Jiang, L., Jia, J., Wong, T.T., 2021. Bidirectional
nition , 107446. projection network for cross dimension scene understanding, in: Pro-
Fooladgar, F., Kasaei, S., 2020. A survey on indoor rgb-d semantic ceedings of the IEEE/CVF Conference on Computer Vision and Pattern
segmentation: from hand-crafted features to deep convolutional neural Recognition, pp. 14373–14382.
networks. Multimedia Tools and Applications 79, 4499–4524. Hua, B., Pham, Q., Nguyen, D., Tran, M., Yu, L., Yeung, S., 2016.
Geiger, A., Lenz, P., Urtasun, R., 2012. Are we ready for autonomous driv- Scenenn: A scene meshes dataset with annotations, in: Processing of
ing? the kitti vision benchmark suite, in: Proceedings of the IEEE/CVF the International Conference on 3D Vision, IEEE. pp. 92–101.
Conference on Computer Vision and Pattern Recognition, IEEE. pp. Hua, B.S., Tran, M.K., Yeung, S.K., 2018. Pointwise convolutional neural
3354–3361. networks, in: Proceedings of the IEEE/CVF Conference on Computer
Graham, B., Engelcke, M., Van Der Maaten, L., 2018. 3d semantic Vision and Pattern Recognition, pp. 984–993.
segmentation with submanifold sparse convolutional networks, in: Pro- Huang, J., You, S., 2016. Point cloud labeling using 3d convolutional neural
ceedings of the IEEE/CVF Conference on Computer Vision and Pattern network, in: 2016 23rd International Conference on Pattern Recognition
Recognition, pp. 9224–9232. (ICPR), IEEE. pp. 2670–2675.
Groh, F., Wieschollek, P., Lensch, H.P., 2018. Flex-convolution, in: Asian Huang, Q., Wang, W., Neumann, U., 2018. Recurrent slice networks for
Conference on Computer Vision, Springer. pp. 105–122. 3d segmentation of point clouds, in: Proceedings of the IEEE/CVF
Guerry, J., Boulch, A., Le Saux, B., Moras, J., Plyer, A., Filliat, D., 2017. Conference on Computer Vision and Pattern Recognition, pp. 2626–
Snapnet-r: Consistent 3d multi-view semantic labeling for robotics, in: 2635.
Processings of the IEEE/CVF International Conference on Computer Iandola, F., Han, S., Moskewicz, M., Ashraf, K., Dally, W., Keutzer, K.,
Vision Workshops, pp. 669–678. 2016. Squeezenet: Alexnet-level accuracy with 50x fewer parameters
Guo, Y., Chen, T., 2018. Semantic segmentation of rgbd images based on and< 0.5 mb model size. arXiv preprint arXiv:1602.07360 .
deep depth regression. Pattern Recognit. Lett. 109, 55–64. Ioannidou, A., Chatzilari, E., Nikolopoulos, S., Kompatsiaris, I., 2017.
Guo, Y., Wang, H., Hu, Q., Liu, H., Liu, L., Bennamoun, M., 2020. Deep Deep learning advances in computer vision with 3d data: A survey. ACM
learning for 3d point clouds: A survey. IEEE Transactions on Pattern Computing Surveys 50, 1–38.
Analysis and Machine Intelligence . Ivaneckỳ, B.J., 2016. Depth estimation by convolutional neural networks.
Gupta, S., Girshick, R., Arbeláez, P., Malik, J., 2014. Learning rich features Ph.D. thesis. Master thesis, Brno University of Technology.
from rgb-d images for object detection and segmentation, in: Computer Jampani, V., Kiefel, M., Gehler, P.V., 2016. Learning sparse high dimen-
Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, sional filters: Image filtering, dense crfs and bilateral neural networks,
September 6-12, 2014, Proceedings, Part VII 13, Springer. pp. 345–360. in: Proceedings of the IEEE/CVF Conference on Computer Vision and
Hackel, T., Savinov, N., Ladicky, L., Wegner, J., Schindler, K., Pollefeys, Pattern Recognition, pp. 4452–4461.
M., 2017. Semantic3d. net: A new large-scale point cloud classification Jaritz, M., Gu, J., Su, H., 2019. Multi-view pointnet for 3d scene under-
benchmark. arXiv preprint arXiv:1704.03847 . standing, in: Processings of the IEEE/CVF International Conference on
Han, L., Zheng, T., Xu, L., Fang, L., 2020. Occuseg: Occupancy-aware 3d Computer Vision Workshops, pp. 0–0.
instance segmentation, in: Proceedings of the IEEE/CVF Conference on
Jeong, J., Yoon, T.S., Park, J.B., 2018. Multimodal sensor-based semantic Li, G., Muller, M., Thabet, A., Ghanem, B., 2019a. Deepgcns: Can gcns
3d mapping for a large-scale environment. Expert Systems with Appli- go as deep as cnns?, in: Processings of the IEEE/CVF International
cations 105, 1–10. Conference on Computer Vision, pp. 9267–9276.
Jiang, H., Yan, F., Cai, J., Zheng, J., Xiao, J., 2020a. End-to-end 3d point Li, J., Chen, B.M., Hee Lee, G., 2018a. So-net: Self-organizing network for
cloud instance segmentation without detection, in: Proceedings of the point cloud analysis, in: Proceedings of the IEEE/CVF Conference on
IEEE/CVF Conference on Computer Vision and Pattern Recognition, Computer Vision and Pattern Recognition, pp. 9397–9406.
pp. 12796–12805. Li, Y., Bu, R., Sun, M., Wu, W., Di, X., Chen, B., 2018b. Pointcnn:
Jiang, J., Zhang, Z., Huang, Y., Zheng, L., 2017. Incorporating depth into Convolution on x-transformed points. Advances in Neural Information
both cnn and crf for indoor semantic segmentation, in: Processing of Processing Systemsms 31, 820–830.
the IEEE International Conference on Software Engineering and Service Li, Y., Ma, L., Zhong, Z., Cao, D., Li, J., 2019b. Tgnet: Geometric graph
Science, IEEE. pp. 525–530. cnn on 3-d point cloud segmentation. IEEE Transactions on Geoscience
Jiang, L., Zhao, H., Shi, S., Liu, S., Fu, C., Jia, J., 2020b. Pointgroup: Dual- and Remote Sensing 58, 3588–3600.
set point grouping for 3d instance segmentation, in: Proceedings of the Li, Z., Gan, Y., Liang, X., Yu, Y., Cheng, H., Lin, L., 2016. Lstm-
IEEE/CVF Conference on Computer Vision and Pattern Recognition, cf: Unifying context modeling and fusion with lstms for rgb-d scene
pp. 4867–4876. labeling, in: Proceedings of the European Conference on Computer
Jiang, M., Wu, Y., Zhao, T., Zhao, Z., Lu, C., 2018. Pointsift: A sift- Vision, Springer. pp. 541–557.
like network module for 3d point cloud semantic segmentation. arXiv Lian, Q., Li, P., Chen, X., 2022. Monojsg: Joint semantic and geometric
preprint arXiv:1807.00652 . cost volume for monocular 3d object detection, in: Proceedings of the
Kalogerakis, E., Averkiou, M., Maji, S., Chaudhuri, S., 2017. 3d shape IEEE/CVF Conference on Computer Vision and Pattern Recognition,
segmentation with projective convolutional networks, in: Proceedings pp. 1070–1079.
of the IEEE/CVF Conference on Computer Vision and Pattern Recog- Liang, Z., Li, Z., Xu, S., Tan, M., Jia, K., 2021. Instance segmentation in
nition, pp. 3779–3788. 3d scenes using semantic superpoint tree networks, in: Proceedings of
Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, the IEEE/CVF International Conference on Computer Vision, pp. 2783–
T., Whitehead, S., Berg, A.C., Lo, W.Y., et al., 2023. Segment anything. 2792.
arXiv preprint arXiv:2304.02643 . Liang, Z., Yang, M., Deng, L., Wang, C., Wang, B., 2019a. Hierar-
Klokov, R., Lempitsky, V., 2017. Escape from cells: Deep kd-networks chical depthwise graph convolutional neural network for 3d semantic
for the recognition of 3d point cloud models, in: Processings of the segmentation of point clouds, in: Processing of the IEEE International
IEEE/CVF International Conference on Computer Vision, pp. 863–872. Conference on Robotics and Automation, IEEE. pp. 8152–8158.
Kochanov, D., Ošep, A., Stückler, J., Leibe, B., 2016. Scene flow propaga- Liang, Z., Yang, M., Wang, C., 2019b. 3d graph embedding learning
tion for semantic mapping and object discovery in dynamic street scenes, with a structure-aware loss function for point cloud semantic instance
in: Proc. IEEE Int. Conf. Intell. Rob. Syst., IEEE. pp. 1785–1792. segmentation. arXiv preprint arXiv:1902.05247 .
Komarichev, A., Zhong, Z., Hua, J., 2019. A-cnn: Annularly convolutional Lin, D., Chen, G., Cohen-Or, D., Heng, P., Huang, H., 2017. Cascaded
neural networks on point clouds, in: Proceedings of the IEEE/CVF feature network for semantic segmentation of rgb-d images, in: Process-
Conference on Computer Vision and Pattern Recognition, pp. 7421– ings of the IEEE/CVF International Conference on Computer Vision,
7430. pp. 1311–1319.
Kong, L., Liu, Y., Chen, R., Ma, Y., Zhu, X., Li, Y., Hou, Y., Qiao, Y., Liu, Liu, C., Furukawa, Y., 2019. Masc: multi-scale affinity with sparse convo-
Z., 2023. Rethinking range view representation for lidar segmentation. lution for 3d instance segmentation. arXiv preprint arXiv:1902.04478
arXiv preprint arXiv:2303.05367 . .
Lahoud, J., Ghanem, B., Pollefeys, M., Oswald, M., 2019. 3d instance Liu, F., Li, S., Zhang, L., Zhou, C., Ye, R., Wang, Y., Lu, J., 2017.
segmentation via multi-task metric learning, in: Processings of the 3dcnn-dqn-rnn: A deep reinforcement learning framework for semantic
IEEE/CVF International Conference on Computer Vision, pp. 9256– parsing of large-scale 3d point clouds, in: Processings of the IEEE/CVF
9266. International Conference on Computer Vision, pp. 5678–5687.
Lai, X., Chen, Y., Lu, F., Liu, J., Jia, J., 2023. Spherical transformer for lidar- Liu, F., Shen, C., Lin, G., Reid, I., 2015. Learning depth from single monoc-
based 3d recognition, in: Proceedings of the IEEE/CVF Conference on ular images using deep convolutional neural fields. IEEE Transactions
Computer Vision and Pattern Recognition, pp. 17545–17555. on Pattern Analysis and Machine Intelligence 38, 2024–2039.
Lai, X., Liu, J., Jiang, L., Wang, L., Zhao, H., Liu, S., Qi, X., Jia, J., 2022. Liu, H., Wu, W., Wang, X., Qian, Y., 2018a. Rgb-d joint modelling
Stratified transformer for 3d point cloud segmentation, in: Proceedings with scene geometric information for indoor semantic segmentation.
of the IEEE/CVF Conference on Computer Vision and Pattern Recog- Multimed. Tools. Appl. 77, 22475–22488.
nition, pp. 8500–8509. Liu, J., Wang, Y., Li, Y., Fu, J., Li, J., Lu, H., 2018b. Collaborative
Landrieu, L., Simonovsky, M., 2018. Large-scale point cloud semantic deconvolutional neural networks for joint depth estimation and semantic
segmentation with superpoint graphs, in: Proceedings of the IEEE/CVF segmentation. IEEE Trans. Neural Netw. Learn. Syst. 29, 5655–5666.
Conference on Computer Vision and Pattern Recognition, pp. 4558– Liu, W., Sun, J., Li, W., Hu, T., Wang, P., 2019a. Deep learning on point
4567. clouds and its application: A survey. Sensors 19, 4188.
Lawin, F., Danelljan, M., Tosteberg, P., Bhat, G., Khan, F., Felsberg, M., Liu, Y., Yang, S., Li, B., Zhou, W., Xu, J., Li, H., Lu, Y., 2018c. Affinity
2017. Deep projective 3d semantic segmentation, in: Computer Analysis derivation and graph merge for instance segmentation, in: Proceedings
of Images and Patterns, Springer. pp. 95–107. of the European Conference on Computer Vision, pp. 686–703.
Le, T., Duan, Y., 2018. Pointgrid: A deep network for 3d shape understand- Liu, Z., Tang, H., Lin, Y., Han, S., 2019b. Point-voxel cnn for efficient
ing, in: Proceedings of the IEEE/CVF Conference on Computer Vision 3d deep learning, in: Advances in Neural Information Processing Sys-
and Pattern Recognition, pp. 9204–9214. temsms, pp. 965–975.
Lei, H., Akhtar, N., Mian, A., 2019. Octree guided cnn with spherical ker- Lowe, D.G., 2004. Distinctive image features from scale-invariant key-
nels for 3d point clouds, in: Proceedings of the IEEE/CVF Conference points. International journal of computer vision 60, 91–110.
on Computer Vision and Pattern Recognition, pp. 9631–9640. Ma, L., Stückler, J., Kerl, C., Cremers, D., 2017. Multi-view deep learning
Lei, H., Akhtar, N., Mian, A., 2020. Spherical kernel for efficient graph for consistent semantic mapping with rgb-d cameras, in: Proc. IEEE Int.
convolution on 3d point clouds. IEEE Transactions on Pattern Analysis Conf. Intell. Rob. Syst., IEEE. pp. 598–605.
and Machine Intelligence . Ma, Y., Guo, Y., Liu, H., Lei, Y., Wen, G., 2020. Global context reasoning
Lei, H., Akhtar, N., Shah, M., Mian, A., 2023. Mesh convolution with for semantic segmentation of 3d point clouds, in: Proceedings of the
continuous filters for 3-d surface parsing. IEEE Transactions on Neural IEEE/CVF Winter Conference on Applications of Computer Vision, pp.
Networks and Learning Systems . 2931–2940.
Maturana, D., Scherer, S., 2015. Voxnet: A 3d convolutional neural network Rethage, D., Wald, J., Sturm, J., Navab, N., Tombari, F., 2018. Fully-
for real-time object recognition, in: Proc. IEEE Int. Conf. Intell. Rob. convolutional point networks for large-scale point clouds, in: Proceed-
Syst., IEEE. pp. 922–928. ings of the European Conference on Computer Vision, pp. 596–611.
McCormac, J., Handa, A., Davison, A., Leutenegger, S., 2017. Semanticfu- Riegler, G., Osman Ulusoy, A., Geiger, A., 2017. Octnet: Learning deep 3d
sion: Dense 3d semantic mapping with convolutional neural networks, representations at high resolutions, in: Proceedings of the IEEE/CVF
in: Proceedings of the IEEE International Conference on Robotics and Conference on Computer Vision and Pattern Recognition, pp. 3577–
Automation, IEEE. pp. 4628–4635. 3586.
Meng, H., Gao, L., Lai, Y., Manocha, D., 2019. Vv-net: Voxel vae net Riemenschneider, H., Bódis-Szomorú, A., Weissenberg, J., Van Gool, L.,
with group convolutions for point cloud segmentation, in: Processings of 2014. Learning where to classify in multi-view semantic segmenta-
the IEEE/CVF International Conference on Computer Vision, pp. 8500– tion, in: Proceedings of the European Conference on Computer Vision,
8508. Springer. pp. 516–532.
Meyer, G.P., Charland, J., Hegde, D., Laddha, A., Vallespi-Gonzalez, C., Rosu, R.A., Schütt, P., Quenzel, J., Behnke, S., 2019. Latticenet: Fast
2019. Sensor fusion for joint 3d object detection and semantic segmenta- point cloud segmentation using permutohedral lattices. arXiv preprint
tion, in: Proceedings of the IEEE/CVF Conference on Computer Vision arXiv:1912.05905 .
and Pattern Recognition Workshops, pp. 0–0. Roynard, X., Deschaud, J., Goulette, F., 2018. Paris-lille-3d: A large
Milioto, A., Vizzo, I., Behley, J., Stachniss, C., 2019. Rangenet++: Fast and and high-quality ground-truth urban point cloud dataset for automatic
accurate lidar semantic segmentation, in: Proc. IEEE Int. Conf. Intell. segmentation and classification. The International Journal of Robotics
Rob. Syst., IEEE. pp. 4213–4220. Research 37, 545–557.
Morton, G.M., 1966. A computer oriented geodetic data base and a new Shen, Y., Feng, C., Yang, Y., Tian, D., 2018. Mining point cloud local
technique in file sequencing . structures by kernel correlation and graph pooling, in: Proceedings of the
Mousavian, A., Pirsiavash, H., Košecká, J., 2016. Joint semantic seg- IEEE/CVF Conference on Computer Vision and Pattern Recognition,
mentation and depth estimation with deep convolutional networks, in: pp. 4548–4557.
Processing of the International Conference on 3D Vision, IEEE. pp. Shi, H., Lin, G., Wang, H., Hung, T.Y., Wang, Z., 2020. Spsequencenet: Se-
611–619. mantic segmentation network on 4d point clouds, in: Proceedings of the
Narita, G., Seno, T., Ishikawa, T., Kaji, Y., 2019. Panopticfusion: Online IEEE/CVF Conference on Computer Vision and Pattern Recognition,
volumetric semantic mapping at the level of stuff and things. arXiv pp. 4574–4583.
preprint arXiv:1903.01177 . Shi, H., Wei, J., Li, R., Liu, F., Lin, G., 2022. Weakly supervised
Naseer, M., Khan, S., Porikli, F., 2018. Indoor scene understanding in segmentation on outdoor 4d point clouds with temporal matching and
2.5/3d for autonomous agents: A survey. IEEE Access 7, 1859–1887. spatial graph propagation, in: Proceedings of the IEEE/CVF Conference
Ngo, T.D., Hua, B.S., Nguyen, K., 2023. Isbnet: a 3d point cloud instance on Computer Vision and Pattern Recognition, pp. 11840–11849.
segmentation network with instance-aware sampling and box-aware Silberman, N., Fergus, R., 2011. Indoor scene segmentation using a
dynamic convolution, in: Proceedings of the IEEE/CVF Conference on structured light sensor, in: Processings of the IEEE/CVF International
Computer Vision and Pattern Recognition, pp. 13550–13559. Conference on Computer Vision Worksh., IEEE. pp. 601–608.
Park, C., Jeong, Y., Cho, M., Park, J., 2022. Fast point transformer, in: Silberman, N., Hoiem, D., Kohli, P., Fergus, R., 2012. Indoor segmentation
Proceedings of the IEEE/CVF Conference on Computer Vision and and support inference from rgbd images, in: Proceedings of the European
Pattern Recognition, pp. 16949–16958. Conference on Computer Vision, Springer. pp. 746–760.
Park, J., Kim, C., Kim, S., Jo, K., 2023. Pcscnet: Fast 3d semantic segmen- Simonovsky, M., Komodakis, N., 2017. Dynamic edge-conditioned filters
tation of lidar point cloud for autonomous car using point convolution in convolutional neural networks on graphs, in: Proceedings of the
and sparse convolution network. Expert Systems with Applications 212, IEEE/CVF Conference on Computer Vision and Pattern Recognition,
118815. pp. 3693–3702.
Pham, Q., Hua, B., Nguyen, T., Yeung, S., 2019a. Real-time progressive Song, S., Lichtenberg, S.P., Xiao, J., 2015. Sun rgb-d: A rgb-d scene under-
3d semantic segmentation for indoor scenes, in: Proceedings of the standing benchmark suite, in: Proceedings of the IEEE/CVF Conference
IEEE/CVF Winter Conference on Applications of Computer Vision, on Computer Vision and Pattern Recognition, pp. 567–576.
IEEE. pp. 1089–1098. Song, Y., Chen, X., Li, J., Zhao, Q., 2017. Embedding 3d geometric features
Pham, Q.H., Nguyen, T., Hua, B.S., Roig, G., Yeung, S.K., 2019b. Jsis3d: for rigid object part segmentation, in: Processings of the IEEE/CVF
joint semantic-instance segmentation of 3d point clouds with multi- International Conference on Computer Vision, pp. 580–588.
task pointwise networks and multi-value conditional random fields, in: Su, H., Jampani, V., Sun, D., Maji, S., Kalogerakis, E., Yang, M.H., Kautz,
Proceedings of the IEEE/CVF Conference on Computer Vision and J., 2018. Splatnet: Sparse lattice networks for point cloud processing,
Pattern Recognition, pp. 8827–8836. in: Proceedings of the IEEE/CVF Conference on Computer Vision and
Qi, C.R., Su, H., Mo, K., Guibas, L.J., 2017a. Pointnet: Deep learning on Pattern Recognition, pp. 2530–2539.
point sets for 3d classification and segmentation, in: Proceedings of the Su, H., Maji, S., Kalogerakis, E., Learned-Miller, E., 2015. Multi-view
IEEE/CVF Conference on Computer Vision and Pattern Recognition, convolutional neural networks for 3d shape recognition, in: Processings
pp. 652–660. of the IEEE/CVF International Conference on Computer Vision, pp.
Qi, C.R., Yi, L., Su, H., Guibas, L.J., 2017b. Pointnet++: Deep hierarchical 945–953.
feature learning on point sets in a metric space. Advances in Neural Su, Y., Xu, X., Jia, K., 2023. Weakly supervised 3d point cloud segmen-
Information Processing Systemsms 30, 5099–5108. tation via multi-prototype learning. IEEE Transactions on Circuits and
Qi, X., Liao, R., Jia, J., Fidler, S., Urtasun, R., 2017c. 3d graph neural net- Systems for Video Technology .
works for rgbd semantic segmentation, in: Processings of the IEEE/CVF Tatarchenko, M., Park, J., Koltun, V., Zhou, Q.Y., 2018. Tangent convo-
International Conference on Computer Vision, pp. 5199–5208. lutions for dense prediction in 3d, in: Proceedings of the IEEE/CVF
Qian, G., Li, Y., Peng, H., Mai, J., Hammoud, H.A.A.K., Elhoseiny, M., Conference on Computer Vision and Pattern Recognition, pp. 3887–
Ghanem, B., 2022. Pointnext: Revisiting pointnet++ with improved 3896.
training and scaling strategies. arXiv preprint arXiv:2206.04670 . Tchapmi, L., Choy, C., Armeni, I., Gwak, J., Savarese, S., 2017. Segcloud:
Raj, A., Maturana, D., Scherer, S., 2015. Multi-scale convolutional archi- Semantic segmentation of 3d point clouds, in: Processing of the Inter-
tecture for semantic segmentation. Robotics Institute, Carnegie Mellon national Conference on 3D Vision, IEEE. pp. 537–547.
University, Tech. Rep. CMU-RITR-15-21 . Thomas, H., Qi, C.R., Deschaud, J.E., Marcotegui, B., Goulette, F., Guibas,
Ran, H., Liu, J., Wang, C., 2022. Surface representation for point clouds, L.J., 2019. Kpconv: Flexible and deformable convolution for point
in: Proceedings of the IEEE/CVF Conference on Computer Vision and clouds, in: Processings of the IEEE/CVF International Conference on
Pattern Recognition, pp. 18942–18952. Computer Vision, pp. 6411–6420.
Valipour, S., Siam, M., Jagersand, M., Ray, N., 2017. Recurrent fully Conference on Robotics and Automation, IEEE. pp. 1887–1893.
convolutional networks for video segmentation, in: Proceedings of the Wu, B., Zhou, X., Zhao, S., Yue, X., Keutzer, K., 2019a. Squeezesegv2:
IEEE/CVF Winter Conference on Applications of Computer Vision, Improved model structure and unsupervised domain adaptation for road-
IEEE. pp. 29–36. object segmentation from a lidar point cloud, in: Proceedings of the
Verma, N., Boyer, E., Verbeek, J., 2018. Feastnet: Feature-steered graph IEEE International Conference on Robotics and Automation, IEEE. pp.
convolutions for 3d shape analysis, in: Proceedings of the IEEE/CVF 4376–4382.
Conference on Computer Vision and Pattern Recognition, pp. 2598– Wu, W., Qi, Z., Fuxin, L., 2019b. Pointconv: Deep convolutional networks
2606. on 3d point clouds, in: Proceedings of the IEEE/CVF Conference on
Vu, T., Kim, K., Luu, T.M., Nguyen, T., Yoo, C.D., 2022. Softgroup for 3d Computer Vision and Pattern Recognition, pp. 9621–9630.
instance segmentation on point clouds, in: Proceedings of the IEEE/CVF Wu, X., Lao, Y., Jiang, L., Liu, X., Zhao, H., 2022a. Point transformer
Conference on Computer Vision and Pattern Recognition, pp. 2708– v2: Grouped vector attention and partition-based pooling. Advances in
2717. Neural Information Processing Systemsms 35, 33330–33342.
Wang, C., Samari, B., Siddiqi, K., 2018a. Local spectral graph convolution Wu, Y., Shi, M., Du, S., Lu, H., Cao, Z., Zhong, W., 2022b. 3d instances
for point set feature learning, in: Proceedings of the European Confer- as 1d kernels, in: Proceedings of the European Conference on Computer
ence on Computer Vision, pp. 52–66. Vision, Springer. pp. 235–252.
Wang, J., Li, X., Sullivan, A., Abbott, L., Chen, S., 2022. Pointmotionnet: Wu, Y., Wu, Y., Gkioxari, G., Tian, Y., 2018b. Building generaliz-
Point-wise motion learning for large-scale lidar point clouds sequences, able agents with a realistic and rich 3d environment. arXiv preprint
in: Proceedings of the IEEE/CVF Conference on Computer Vision and arXiv:1801.02209 .
Pattern Recognition, pp. 4419–4428. Wu, Z., Song, S., Khosla, A., Yu, F., Zhang, L., Tang, X., Xiao, J.,
Wang, J., Wang, Z., Tao, D., See, S., Wang, G., 2016. Learning common and 2015. 3d shapenets: A deep representation for volumetric shapes, in:
specific features for rgb-d semantic segmentation with deconvolutional Proceedings of the IEEE/CVF Conference on Computer Vision and
networks, in: Proceedings of the European Conference on Computer Pattern Recognition, pp. 1912–1920.
Vision, Springer. pp. 664–679. Wu, Z., Zhou, Z., Allibert, G., Stolz, C., Demonceaux, C., Ma, C., 2022c.
Wang, P., Gan, Y., Shui, P., Yu, F., Zhang, Y., Chen, S., Sun, Z., 2018b. 3d Transformer fusion for indoor rgb-d semantic segmentation. Available
shape segmentation via shape fully convolutional networks. Computers at SSRN 4251286 .
& Graphics 70, 128–139. Xia, Z., Liu, Y., Li, X., Zhu, X., Ma, Y., Li, Y., Hou, Y., Qiao, Y., 2023. Scp-
Wang, P., Shen, X., Lin, Z., Cohen, S., Price, B., Yuille, A., 2015. Towards net: Semantic scene completion on point cloud, in: Proceedings of the
unified depth and semantic prediction from a single image, in: Pro- IEEE/CVF Conference on Computer Vision and Pattern Recognition,
ceedings of the IEEE/CVF Conference on Computer Vision and Pattern pp. 17642–17651.
Recognition, pp. 2800–2809. Xiang, Y., Fox, D., 2017. Da-rnn: Semantic mapping with data associated
Wang, P.S., Liu, Y., Guo, Y.X., Sun, C.Y., Tong, X., 2017. O-cnn: Octree- recurrent neural networks. arXiv preprint arXiv:1703.03098 .
based convolutional neural networks for 3d shape analysis. ACM Xiao, A., Huang, J., Guan, D., Zhang, X., Lu, S., Shao, L., 2023. Unsuper-
Transactions on Graphics 36, 1–11. vised point cloud representation learning with deep neural networks: A
Wang, S., Suo, S., Ma, W.C., Pokrovsky, A., Urtasun, R., 2018c. Deep para- survey. IEEE Transactions on Pattern Analysis and Machine Intelligence
metric continuous convolutional neural networks, in: Proceedings of the .
IEEE/CVF Conference on Computer Vision and Pattern Recognition, Xie, Y., Jiaojiao, T., Zhu, X., 2020a. Linking points with labels in 3d: A
pp. 2589–2597. review of point cloud semantic segmentation. IEEE Geoscience and
Wang, W., Neumann, U., 2018. Depth-aware cnn for rgb-d segmentation, Remote Sensing Magazine .
in: Proceedings of the European Conference on Computer Vision, pp. Xie, Z., Chen, J., Peng, B., 2020b. Point clouds learning with attention-
135–150. based graph convolution networks. Neurocomputing .
Wang, W., Yu, R., Huang, Q., Neumann, U., 2018d. Sgpn: Similarity Xu, C., Wu, B., Wang, Z., Zhan, W., Vajda, P., Keutzer, K., Tomizuka, M.,
group proposal network for 3d point cloud instance segmentation, in: 2020. Squeezesegv3: Spatially-adaptive convolution for efficient point-
Proceedings of the IEEE/CVF Conference on Computer Vision and cloud segmentation. arXiv preprint arXiv:2004.01803 .
Pattern Recognition, pp. 2569–2578. Xu, H., Dong, M., Zhong, Z., 2017. Directionally convolutional networks
Wang, X., Liu, S., Shen, X., Shen, C., Jia, J., 2019a. Associatively seg- for 3d shape segmentation, in: Processings of the IEEE/CVF Interna-
menting instances and semantics in point clouds, in: Proceedings of the tional Conference on Computer Vision, pp. 2698–2707.
IEEE/CVF Conference on Computer Vision and Pattern Recognition, Xu, M., Ding, R., Zhao, H., Qi, X., 2021. Paconv: Position adaptive
pp. 4096–4105. convolution with dynamic kernel assembling on point clouds, in: Pro-
Wang, Y., Asafi, S., Van Kaick, O., Zhang, H., Cohen-Or, D., Chen, B., ceedings of the IEEE/CVF Conference on Computer Vision and Pattern
2012. Active co-analysis of a set of shapes. ACM Transactions on Recognition, pp. 3173–3182.
Graphics 31, 1–10. Xu, Y., Fan, T., Xu, M., Zeng, L., Qiao, Y., 2018. Spidercnn: Deep learning
Wang, Y., Shi, T., Yun, P., Tai, L., Liu, M., 2018e. Pointseg: Real-time on point sets with parameterized convolutional filters, in: Proceedings
semantic segmentation based on 3d lidar point cloud. arXiv preprint of the European Conference on Computer Vision, pp. 87–102.
arXiv:1807.06288 . Yan, X., Zheng, C., Li, Z., Wang, S., Cui, S., 2020. Pointasnl: Robust
Wang, Y., Sun, Y., Liu, Z., Sarma, S.E., Bronstein, M.M., Solomon, J.M., point clouds processing using nonlocal neural networks with adaptive
2019b. Dynamic graph cnn for learning on point clouds. ACM sampling, in: Proceedings of the IEEE/CVF conference on computer
Transactions on Graphics 38, 1–12. vision and pattern recognition, pp. 5589–5598.
Wang, Z., Lu, F., 2019. Voxsegnet: Volumetric cnns for semantic part Yang, B., Wang, J., Clark, R., Hu, Q., Wang, S., Markham, A., Trigoni, N.,
segmentation of 3d shapes. IEEE Transactions on Visualization and 2019. Learning object bounding boxes for 3d instance segmentation on
Computer Graphics . point clouds. Advances in Neural Information Processing Systemsms
Wei, L.Y., 2008. Parallel poisson disk sampling. ACM Transactions on 32, 6740–6749.
Graphics 27, 1–9. Yang, S., Huang, Y., Scherer, S., 2017. Semantic 3d occupancy mapping
Wei, Y., Liu, H., Xie, T., Ke, Q., Guo, Y., 2022. Spatial-temporal trans- through efficient high order crfs, in: Proc. IEEE Int. Conf. Intell. Rob.
former for 3d point cloud sequences, in: Proceedings of the IEEE/CVF Syst., IEEE. pp. 590–597.
Winter Conference on Applications of Computer Vision, pp. 1171–1180. Yang, Y., Xu, Y., Zhang, C., Xu, Z., Huang, J., 2022. Hierarchical vision
Wu, B., Wan, A., Yue, X., Keutzer, K., 2018a. Squeezeseg: Convolutional transformer with channel attention for rgb-d image segmentation, in:
neural nets with recurrent crf for real-time road-object segmentation Proceedings of the 4th International Symposium on Signal Processing
from 3d lidar point cloud, in: Proceedings of the IEEE International Systems, pp. 68–73.
Ye, X., Li, J., Huang, H., Du, L., Zhang, X., 2018. 3d recurrent neural Hongshan Yu received the B.S., M.S. and Ph.D.
networks with context fusion for point cloud semantic segmentation, in: degrees of Control Science and Technology from
Proceedings of the European Conference on Computer Vision, pp. 403– electrical and information engineering of Hu-
417. nan University, Changsha, China, in 2001, 2004
Yi, L., Kim, V., Ceylan, D., Shen, I., Yan, M., Su, H., Lu, C., Huang, Q., and 2007 respectively. From 2011 to 2012, he
Sheffer, A., Guibas, L., 2016. A scalable active framework for region worked as a postdoctoral researcher in Laboratory
annotation in 3d shape collections. ACM Transactions on Graphics 35, for Computational Neuroscience of University of
1–12. Pittsburgh, USA. He is currently a professor of
Yi, L., Su, H., Guo, X., Guibas, L.J., 2017. Syncspeccnn: Synchro- Hunan University and associate dean of National
nized spectral cnn for 3d shape segmentation, in: Proceedings of the Engineering Laboratory for Robot Visual Percep-
IEEE/CVF Conference on Computer Vision and Pattern Recognition, tion and Control. His research interests include
pp. 2282–2290. autonomous mobile robot and machine vision.
Yi, L., Zhao, W., Wang, H., Sung, M., Guibas, L.J., 2019. Gspn: Generative
shape proposal network for 3d instance segmentation in point cloud, Xiaoyan Liu received her Ph.D. degree of Process
in: Proceedings of the IEEE/CVF Conference on Computer Vision and and System Engineering in 2005 from Otto-von-
Pattern Recognition, pp. 3947–3956. Guericke University Magdeburg, Germany. She is
Ying, X., Chuah, M.C., 2022. Uctnet: Uncertainty-aware cross-modal trans- currently a professor of Hunan University. Her re-
former network for indoor rgb-d semantic segmentation, in: Proceedings search interests include machine vision and pattern
of the European Conference on Computer Vision, Springer. pp. 20–37. recognition.
Yu, F., Liu, K., Zhang, Y., Zhu, C., Xu, K., 2019. Partnet: A recursive
part decomposition network for fine-grained and hierarchical shape seg-
mentation, in: Proceedings of the IEEE/CVF Conference on Computer Zhengeng Yang received the B.S. and M.S. degrees
Vision and Pattern Recognition, pp. 9491–9500. from Central South University, Changsha, China,
Yuan, X., Shi, J., Gu, L., 2021. A review of deep learning methods for in 2009 and 2012, respectively. He received the Phd
semantic segmentation of remote sensing imagery. Expert Systems with degree from Hunan University, Changsha, China,
Applications 169, 114417. in 2020. He is currently a post-doctor researcher
Yue, C., Wang, Y., Tang, X., Chen, Q., 2022. Drgcnn: Dynamic region at Hunan University, Changsha. He was a Vis-
graph convolutional neural network for point clouds. Expert Systems iting Scholar with the University of Pittsburgh,
with Applications 205, 117663. Pittsburgh, PA during 2018 -2020. His research
Zeng, W., Gevers, T., 2018. 3dcontextnet: Kd tree guided hierarchical interests include computer vision, image analysis,
learning of point clouds using local and global contextual cues, in: and machine learning.
Proceedings of the European Conference on Computer Vision, pp. 0–0.
Zhang, C., Wan, H., Shen, X., Wu, Z., 2022. Patchformer: An efficient
point transformer with patch attention, in: Proceedings of the IEEE/CVF Wei Sun received the M.S. and Ph.D. degrees of
Conference on Computer Vision and Pattern Recognition, pp. 11799– Control Science and Technology from the Hunan
11808. University, Changsha, China, in 1999 and 2002,
Zhang, Y., Zhou, Z., David, P., Yue, X., Xi, Z., Gong, B., Foroosh, H., 2020. respectively. He is currently a Professor at Hunan
Polarnet: An improved grid representation for online lidar point clouds University. His research interests include artifi-
semantic segmentation, in: Proceedings of the IEEE/CVF Conference cial intelligence, robot control, complex mechan-
on Computer Vision and Pattern Recognition, pp. 9601–9610. ical and electrical control systems, and automotive
Zhang, Z., Han, X., Dong, B., Li, T., Yin, B., Yang, X., 2023. Point electronics.
cloud scene completion with joint color and semantic estimation from
single rgb-d image. IEEE Transactions on Pattern Analysis and Machine
Intelligence . Ajmal Mian is a Professor of computer science
Zhao, C., Sun, L., Purkait, P., Duckett, T., Stolkin, R., 2018. Dense rgb-d with The University of Western Australia. His re-
semantic mapping with pixel-voxel neural network. Sensors 18, 3099. search interests include 3D computer vision, ma-
Zhao, H., Jiang, L., Fu, C.W., Jia, J., 2019a. Pointweb: Enhancing local chine learning, point cloud analysis, human action
neighborhood features for point cloud processing, in: Proceedings of the recognition, and video description. He is Fellow
IEEE/CVF Conference on Computer Vision and Pattern Recognition, of the International Association for Pattern Recog-
pp. 5565–5573. nition and has received several awards including
Zhao, H., Jiang, L., Jia, J., Torr, P.H., Koltun, V., 2021. Point transformer, the HBF Mid-Career Scientist of the Year Award,
in: Proceedings of the IEEE/CVF International Conference on Computer the West Australian Early Career Scientist of the
Vision, pp. 16259–16268. Year Award, the Aspire Professional Development
Zhao, Y., Birdal, T., Deng, H., Tombari, F., 2019b. 3d point capsule Award, the Vice-Chancellors Mid-Career Research
networks, in: Proceedings of the IEEE/CVF Conference on Computer Award, the Outstanding Young Investigator Award,
Vision and Pattern Recognition, pp. 1009–1018. the IAPR Best Scientific Paper Award, the EH
Thompson Award, and excellence in Research Su-
pervision Award. He has received three presti-
Yong He received the M.S. degree from China gious fellowships and several major research grants
University of Mining and Technology, Xuzhou, from the Australian Research Council, the National
Jiangsu, China, in 2018. He is currently pursing the Health and Medical Research Council of Australia
Ph.D. degree with Hunan University, Changsha, and the US Dept of Defense DARPA with a total
China, and he is also currently a Visiting Scholar funding of over $40 Million. He serves as a Senior
with University of Western Australia, Perth, Aus- Editor for the IEEE Transactions on Neural Net-
tralia. His research interests include computer vi- works and Learning Systems, and Associate Editor
sion, point clouds analysis, and deep learning. for the IEEE Transactions on Image Processing,
and the Pattern Recognition Journal.