Mask3D: Mask Transformer For 3D Semantic Instance Segmentation
Mask3D: Mask Transformer For 3D Semantic Instance Segmentation
Jonas Schult1 , Francis Engelmann2, 3 , Alexander Hermans1 , Or Litany4 , Siyu Tang2 , Bastian Leibe1
Fig. 3: Number of queries and decoder layers. foreground mask points, many background points). The Dice
loss Ldice is specifically designed to address such data
imbalance. Tab. IV (right) shows scores on ScanNet validation
queries. Parametric refers to learned positions and features for combinations of both losses. While Ldice improves over
[2], while non-parametric refers to point positions sampled LBCE , we observe an additional improvement by training our
with furthest point sampling (FPS) [44]. When selecting query model with a weighted sum of both losses (Eq. 5).
positions with FPS, we can either initialize the queries to C. Qualitative Results and Limitations
zero ( 2 , as in 3DETR [37]) or use the point features at the
Fig. 4 shows several representative examples of Mask3D
sampled position 3 . Tab. IV (left) shows the effects of using
instance segmentation results on ScanNet. The scenes are
parametric or non-parametric queries on ScanNet validation
quite diverse and present a number of challenges, including
(5 cm). In line with [37], we see that non-parametric queries 2
clutter, scanning artifacts and numerous similar objects.
outperform parametric queries 1 . Interestingly, 3 results in
Still, our model shows quite robust results. There are still
degraded performance compared to both parametric 1 and
limitations in our model though. A systematic mistake that we
position-only non-parametric queries 2 .
observed are merged instances that are far apart (see Fig. 4,
Number of Queries and Decoders. We analyze the effect bottom left). As the attention mechanism can attend to the
of varying numbers of queries K during inference on models full point cloud, it can happen that two objects with similar
trained with K = 100 and K = 200 non-parametric queries semantics and geometry expose similar learned point features
sampled with FPS. By increasing K from 100 to 200 during and are therefore combined into one instance even if they
training, we observe a slight increase in performance (Fig. 3, are far apart in the scene. This is less likely to happen with
left) at the cost of additional memory. When evaluating methods that explicitly encode geometric priors.
with fewer queries than trained with, we observe reduced
performance but faster runtime. When evaluating with more V. C ONCLUSION
queries than trained with, we observe slightly improved In this work, we have introduced Mask3D, for 3D semantic
performance, typically less than 1 mAP. Our final model instance segmentation. Mask3D is based on Transformer
uses K = 100 due to memory constraint when using 2 cm decoders, and learns instance queries that, combined with
voxels in the feature backbone. In this study, we report scores learned point features, directly predict semantic instance
using 5 cm on ScanNet validation. We also analyse the mask masks without the need for hand-selected voting schemes or
quality that we obtain after each Transformer decoder layer hand-crafted grouping mechanisms. We think that Mask3D
in our trained model (Fig. 3, right). We see a rapid increase is an attractive alternative to current voting-based approaches
up to 4 layers, then the quality increases a bit slower. and expect to see follow-up work along this line of research.
Mask Loss. The mask module (Fig. 2, ◻ ∎) generates instance Acknowledgments: This work is supported by the ERC Con-
heatmaps for every instance query. After Hungarian matching, solidator Grant DeeViSe (ERC-2017-CoG-773161), SNF Grant
the corresponding ground truth mask is used to compute the 200021 204840, compute resources from RWTH Aachen University
mask loss Lmask . The binary cross entropy loss LBCE is (rwth1238) and the ETH AI Center post-doctoral fellowship. We
the obvious choice for binary segmentation tasks. However, additionally thank Alexey Nekrasov, Ali Athar and István Sárándi
it does not perform well under large class imbalance (few for helpful discussions and feedback.
R EFERENCES [22] Ji Hou, Angela Dai, and Matthias Nießner. 3D-SIS: 3D Semantic
Instance Segmentation of RGB-D Scans. In IEEE Conference on
[1] Iro Armeni, Ozan Sener, Amir R Zamir, Helen Jiang, Ioannis Brilakis, Computer Vision and Pattern Recognition, 2019.
Martin Fischer, and Silvio Savarese. 3D Semantic Parsing of Large- [23] Ji Hou, Benjamin Graham, Matthias Nießner, and Saining Xie.
Scale Indoor Spaces. In IEEE Conference on Computer Vision and Exploring Data-Efficient 3D Scene Understanding with Contrastive
Pattern Recognition, 2016. Scene Contexts. In IEEE Conference on Computer Vision and Pattern
[2] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Recognition, 2021.
Alexander Kirillov, and Sergey Zagoruyko. End-to-End Object [24] Paul VC Hough. Machine Analysis of Bubble Chamber Pictures. In
Detection with Transformers. In European Conference on Computer International Conference on High Energy Accelerators and Instrumen-
Vision, 2020. tation, 1959.
[3] Meida Chen, Qingyong Hu, Thomas Hugues, Andrew Feng, Yu Hou, [25] Xu Jia, Bert De Brabandere, Tinne Tuytelaars, and Luc V Gool.
Kyle McCullough, and Lucio Soibelman. STPLS3D: A Large-Scale Dynamic Filter Networks. Neural Information Processing Systems,
Synthetic and Real Aerial Photogrammetry 3D Point Cloud Dataset. 2016.
arXiv:2203.09065, 2022. [26] Li Jiang, Hengshuang Zhao, Shaoshuai Shi, Shu Liu, Chi-Wing Fu,
[4] Shaoyu Chen, Jiemin Fang, Qian Zhang, Wenyu Liu, and Xinggang and Jiaya Jia. PointGroup: Dual-Set Point Grouping for 3D Instance
Wang. Hierarchical Aggregation for 3D Instance Segmentation. In Segmentation. In IEEE Conference on Computer Vision and Pattern
International Conference on Computer Vision, 2021. Recognition, 2020.
[5] Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexander Kirillov, [27] Harold W Kuhn. The Hungarian method for the assignment problem.
and Rohit Girdhar. Masked-attention Mask Transformer for Universal Naval research logistics quarterly, 2(1-2):83–97, 1955.
Image Segmentation. In IEEE Conference on Computer Vision and [28] Jean Lahoud, Bernard Ghanem, Marc Pollefeys, and Martin R Oswald.
Pattern Recognition, 2022. 3D Instance Segmentation via Multi-Task Metric Learning. In
[6] Bowen Cheng, Alex Schwing, and Alexander Kirillov. Per-Pixel International Conference on Computer Vision, 2019.
Classification is Not All You Need for Semantic Segmentation. In [29] Xin Lai, Jianhui Liu, Li Jiang, Liwei Wang, Hengshuang Zhao, Shu
Advances in Neural Information Processing Systems, 2021. Liu, Xiaojuan Qi, and Jiaya Jia. Stratified Transformer for 3D Point
[7] Julian Chibane, Francis Engelmann, Tuan Anh Tran, and Gerard Cloud Segmentation. In IEEE Conference on Computer Vision and
Pons-Moll. Box2Mask: Weakly Supervised 3D Semantic Instance Pattern Recognition, 2022.
Segmentation Using Bounding Boxes. In European Conference on [30] Bastian Leibe, Aleš Leonardis, and Bernt Schiele. Robust Object Detec-
Computer Vision, 2022. tion with Interleaved Categorization and Segmentation. International
[8] Christopher Choy, JunYoung Gwak, and Silvio Savarese. 4D Spatio- Journal of Computer Vision, 2008.
Temporal ConvNets: Minkowski Convolutional Neural Networks. In [31] Zhihao Liang, Zhihao Li, Songcen Xu, Mingkui Tan, and Kui Jia.
IEEE Conference on Computer Vision and Pattern Recognition, 2019. Instance Segmentation in 3D Scenes using Semantic Superpoint Tree
[9] Angela Dai, Angel X. Chang, Manolis Savva, Maciej Halber, Thomas Networks. In International Conference on Computer Vision, 2021.
Funkhouser, and Matthias Nießner. ScanNet: Richly-annotated 3D [32] Chen Liu and Yasutaka Furukawa. MASC: Multi-Scale Affinity with
Reconstructions of Indoor Scenes. In IEEE Conference on Computer Sparse Convolution for 3D Instance Segmentation. arXiv:1902.04478,
Vision and Pattern Recognition, 2017. 2019.
[10] Ruoxi Deng, Chunhua Shen, Shengjun Liu, Huibing Wang, and Xinru [33] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang,
Liu. Learning to Predict Crisp Boundaries. In European Conference Stephen Lin, and Baining Guo. Swin Transformer: Hierarchical Vision
on Computer Vision, 2018. Transformer using Shifted Windows. In International Conference on
[11] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weis- Computer Vision, 2021.
senborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, [34] Ze Liu, Zheng Zhang, Yue Cao, Han Hu, and Xin Tong. Group-Free
Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, 3D Object Detection via Transformers. In International Conference
and Neil Houlsby. An Image is Worth 16x16 Words: Transformers for on Computer Vision, 2021.
Image Recognition at Scale. In International Conference on Learning [35] Ilya Loshchilov and Frank Hutter. Decoupled Weight Decay Regu-
Representations, 2021. larization. In International Conference on Learning Representations,
[12] Cathrin Elich, Francis Engelmann, Theodora Kontogianni, and Bastian 2019.
Leibe. 3D Bird’s-eye-view Instance Segmentation. In German [36] Daniel Maturana and Sebastian Scherer. VoxNet: A 3D Convolutional
Conference on Pattern Recognition, 2019. Neural Network for Real-time Object Recognition. In International
[13] Francis Engelmann, Martin Bokeloh, Alireza Fathi, Bastian Leibe, Conference on Intelligent Robots and Systems, 2015.
and Matthias Nießner. 3D-MPA: Multi-Proposal Aggregation for 3D [37] Ishan Misra, Rohit Girdhar, and Armand Joulin. An End-to-End
Semantic Instance Segmentation. In IEEE Conference on Computer Transformer Model for 3D Object Detection. In International
Vision and Pattern Recognition, 2020. Conference on Computer Vision, 2021.
[14] Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. [38] Alexey Nekrasov, Jonas Schult, Or Litany, Bastian Leibe, and Francis
A density-based algorithm for discovering clusters in large spatial Engelmann. Mix3D: Out-of-Context Data Augmentation for 3D Scenes.
databases with noise. In ACM SIGKDD International Conference on In International Conference on 3D Vision, 2021.
Knowledge Discovery and Data Mining, 1996. [39] Xuran Pan, Zhuofan Xia, Shiji Song, Li Erran Li, and Gao Huang. 3D
[15] Pedro F Felzenszwalb and Daniel P Huttenlocher. Efficient Graph- Object Detection With Pointformer. In IEEE Conference on Computer
Based Image Segmentation. International Journal of Computer Vision, Vision and Pattern Recognition, 2021.
59(2):167–181, 2004. [40] Chunghyun Park, Yoonwoo Jeong, Minsu Cho, and Jaesik Park. Fast
[16] Benjamin Graham, Martin Engelcke, and Laurens Van Der Maaten. Point Transformer. In IEEE Conference on Computer Vision and
3D Semantic Segmentation with Submanifold Sparse Convolutional Pattern Recognition, 2022.
Networks. In IEEE Conference on Computer Vision and Pattern [41] Charles R Qi, Or Litany, Kaiming He, and Leonidas J Guibas. Deep
Recognition, 2018. Hough Voting for 3D Object Detection in Point Clouds. In International
[17] Benjamin Graham and Laurens van der Maaten. Submanifold Sparse Conference on Computer Vision, 2019.
Convolutional Networks. arXiv:1706.01307, 2017. [42] Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet:
[18] Lei Han, Tian Zheng, Lan Xu, and Lu Fang. OccuSeg: Occupancy- Deep Learning on Point Sets for 3D Classification and Segmentation.
aware 3D Instance Segmentation. In IEEE Conference on Computer In IEEE Conference on Computer Vision and Pattern Recognition,
Vision and Pattern Recognition, 2020. 2017.
[19] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask [43] Charles R Qi, Hao Su, Matthias Nießner, Angela Dai, Mengyuan Yan,
R-CNN. In International Conference on Computer Vision, 2017. and Leonidas J Guibas. Volumetric and Multi-View CNNs for Object
[20] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Classification on 3D Data. In IEEE Conference on Computer Vision
Residual Learning for Image Recognition. In IEEE Conference on and Pattern Recognition, 2016.
Computer Vision and Pattern Recognition, 2016. [44] Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas.
[21] Tong He, Chunhua Shen, and Anton van den Hengel. DyCo3D: PointNet++: Deep Hierarchical Feature Learning on Point Sets in a
Robust Instance Segmentation of 3D Point Clouds through Dynamic Metric Space. In Advances in Neural Information Processing Systems,
Convolution. In IEEE Conference on Computer Vision and Pattern 2017.
Recognition, 2021.
[45] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster [55] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion
R-CNN: Towards Real-Time Object Detection with Region Proposal Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention
Networks. In Neural Information Processing Systems, 2015. Is All You Need. In Advances in Neural Information Processing
[46] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolu- Systems, 2017.
tional networks for biomedical image segmentation. In International [56] Thang Vu, Kookhoi Kim, Tung M Luu, Xuan Thanh Nguyen, and
Conference on Medical Image Computing and Computer-Assisted Chang D Yoo. SoftGroup for 3D Instance Segmentation on Point
Intervention, 2015. Clouds. In IEEE Conference on Computer Vision and Pattern
[47] David Rozenberszki, Or Litany, and Angela Dai. Language-Grounded Recognition, 2022.
Indoor 3D Semantic Segmentation in the Wild. In European Conference [57] Peng-Shuai Wang, Yang Liu, Yu-Xiao Guo, Chun-Yu Sun, and Xin
on Computer Vision, 2022. Tong. O-CNN: Octree-based Convolutional Neural Networks for 3D
[48] Danila Rukhovich, Anna Vorontsova, and Anton Konushin. Shape Analysis. ACM Transactions on Graphics, 2017.
FCAF3D: Fully Convolutional Anchor-Free 3D Object Detection. [58] Weiyue Wang, Ronald Yu, Qiangui Huang, and Ulrich Neumann.
arXiv:2112.00322, 2021. SGPN: Similarity Group Proposal Network for 3D Point Cloud Instance
[49] Leslie N Smith and Nicholay Topin. Super-Convergence: Very Segmentation. In IEEE Conference on Computer Vision and Pattern
Fast Training of Neural Networks Using Large Learning Rates. In Recognition, 2018.
Artificial intelligence and machine learning for multi-domain operations [59] Xinlong Wang, Shu Liu, Xiaoyong Shen, Chunhua Shen, and Jiaya Jia.
applications, 2019. Associatively Segmenting Instances and Semantics in Point Clouds. In
[50] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, IEEE Conference on Computer Vision and Pattern Recognition, 2019.
and Ruslan Salakhutdinov. Dropout: A Simple Way to Prevent Neural [60] Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang
Networks from Overfitting . Journal of Machine Learning Research, Zhang, Xiaoou Tang, and Jianxiong Xiao. 3D ShapeNets: A Deep
15(1):1929–1958, 2014. Representation for Volumetric Shapes. In IEEE Conference on
[51] Russell Stewart, Mykhaylo Andriluka, and Andrew Y Ng. End-to-end Computer Vision and Pattern Recognition, 2015.
people detection in crowded scenes. In IEEE Conference on Computer [61] Bo Yang, Jianan Wang, Ronald Clark, Qingyong Hu, Sen Wang, Andrew
Vision and Pattern Recognition, 2016. Markham, and Niki Trigoni. Learning Object Bounding Boxes for
[52] Matthew Tancik, Pratul Srinivasan, Ben Mildenhall, Sara Fridovich- 3D Instance Segmentation on Point Clouds. In Advances in Neural
Keil, Nithin Raghavan, Utkarsh Singhal, Ravi Ramamoorthi, Jonathan Information Processing Systems, 2019.
Barron, and Ren Ng. Fourier Features Let Networks Learn High [62] Li Yi, Wang Zhao, He Wang, Minhyuk Sung, and Leonidas J
Frequency Functions in Low Dimensional Domains. In Advances in Guibas. GSPN: Generative Shape Proposal Network for 3D Instance
Neural Information Processing Systems, 2020. Segmentation in Point Cloud. In IEEE Conference on Computer Vision
[53] Hugues Thomas, Charles R Qi, Jean-Emmanuel Deschaud, Beatriz and Pattern Recognition, 2019.
Marcotegui, François Goulette, and Leonidas J Guibas. KPConv: [63] Hengshuang Zhao, Li Jiang, Jiaya Jia, Philip HS Torr, and Vladlen
Flexible and Deformable Convolution for Point Clouds. In International Koltun. Point Transformer. In International Conference on Computer
Conference on Computer Vision, 2019. Vision, 2021.
[54] Zhi Tian, Chunhua Shen, and Hao Chen. Conditional Convolutions for [64] Min Zhong, Xinghao Chen, Xiaokang Chen, Gang Zeng, and Yunhe
Instance Segmentation. In European Conference on Computer Vision, Wang. MaskGroup: Hierarchical Point Grouping and Masking for 3D
2020. Instance Segmentation. arXiv:2203.14662, 2022.
Mask3D: Mask Transformer for 3D Semantic Instance Segmentation
Supplementary Material
Fig. 7: Qualitative Analysis of DBSCAN Postprocessing. Mask3D occassionally predicts masks containing two instances of the same
class. In (b), two windows are merged into a single instance since their underlying point cloud features result in a high response when
convolved with the instance query (c.f. heatmap in (c)). In (d), we apply DBSCAN as a postprocessing routine to split erroneously merged
instances based on spatial contiguity. We do not see this effect for voting-based methods as they explicitly encode geometric priors (e)-(f).