Mask3D: Mask Transformer For 3D Semantic Instance Segmentation

Uploaded by

mincheems

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

42 views12 pages

Mask3D: Mask Transformer For 3D Semantic Instance Segmentation

Uploaded by

mincheems

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Mask3D: Mask Transformer for 3D Semantic Instance Segmentation

Jonas Schult1 , Francis Engelmann2, 3 , Alexander Hermans1 , Or Litany4 , Siyu Tang2 , Bastian Leibe1

Abstract— Modern 3D semantic instance segmentation ap-

proaches predominantly rely on specialized voting mechanisms
followed by carefully designed geometric clustering techniques.
Building on the successes of recent Transformer-based methods
for object detection and image segmentation, we propose the
first Transformer-based approach for 3D semantic instance seg- Input 3D Scene Instance Heatmaps 3D Semantic Instances
arXiv:2210.03105v2 [cs.CV] 12 Apr 2023

mentation. We show that we can leverage generic Transformer

building blocks to directly predict instance masks from 3D point Fig. 1: Mask3D. We train an end-to-end model for 3D semantic
clouds. In our model – called Mask3D – each object instance is instance segmentation on point clouds. Given an input 3D point cloud
represented as an instance query. Using Transformer decoders, (left), our Transformer-based model uses an attention mechanism to
the instance queries are learned by iteratively attending to point produce instance heatmaps across all points (center) and directly
cloud features at multiple scales. Combined with point features, predicts all semantic object instances in parallel (right).
the instance queries directly yield all instance masks in parallel.
Mask3D has several advantages over current state-of-the-art
approaches, since it neither relies on (1) voting schemes which Well established 2D CNN architectures [20, 46] can now easily
require hand-selected geometric properties (such as centers) nor
(2) geometric grouping mechanisms requiring manually-tuned
be adapted to sparse 3D data. These models can process large-
hyper-parameters (e.g. radii) and (3) enables a loss that directly scale 3D scenes in one pass, which is necessary to capture
optimizes instance masks. Mask3D sets a new state-of-the-art on global scene context at multiple scales. As a result, bottom-
ScanNet test (+ 6.2 mAP), S3DIS 6-fold (+ 10.1 mAP), STPLS3D up approaches which benefit from strong features (MTML
(+ 11.2 mAP) and ScanNet200 test (+ 12.4 mAP). [28], MASC [32]) experienced another performance boost.
Soon after, inspired by Hough voting approaches [24, 30],
I. I NTRODUCTION
VoteNet [41] proposed center-voting for 3D object detection.
This work addresses the task of semantic instance seg- Instead of mapping points to an abstract high-dimensional
mentation of 3D scenes. That is, given a 3D point cloud, feature space (as in bottom-up approaches), points now vote
the desired output is a set of object instances represented as for their object center – votes from the same object are
binary foreground masks (over all input points) with their then closer to each other which enables geometric grouping
corresponding semantic labels (e.g. ‘chair’, ‘table’, ‘window’). into instance masks. This idea quickly influenced the 3D
Instance segmentation resides at the intersection of two instance segmentation field, and by now, the vast majority
problems: semantic segmentation and object detection. There- of current state-of-the-art 3D instance segmentation methods
fore methods have opted to either first learn semantic point [4, 13, 18, 26, 56] make use of both object center-voting and
features followed by grouping them into separate instances sparse feature backbones.
(bottom-up) or detecting object instances followed by refining Although 3D instance segmentation has made impressive
their semantic mask (top-down). Bottom-up approaches progress, current approaches have several major problems:
(ASIS [59], SGPN [58], 3D-BEVIS [12]) employ contrastive typical state-of-the-art models are based on manually-tuned
learning, mapping points to a high-dimensional feature space components, such as voting mechanisms that predict hand-
where features of the same instance are close together, and selected geometric properties (e.g., centers [26], bounding
far apart otherwise. Top-down methods (3D-SIS [22], 3D- boxes [7], occupancy [18]), and heuristics for clustering the
BoNet [61]) use an approach akin to Mask R-CNN [19]: votes (e.g., dual-set grouping [26], proposal aggregation [13],
First detect instances as bounding boxes and then perform set aggregation/filtering [4]). Another limitation of these
mask segmentation on each box individually. While 3D-SIS models is that they are not designed to directly predict
[22] relies on predefined anchor boxes [19], 3D-BoNet [61] instance masks. Instead, masks are obtained by grouping
proposes an interesting variation that predicts bounding boxes votes, and the model is trained using proxy-losses on the votes.
from a global scene descriptor and optimizes an association A more elegant alternative consists of directly predicting
loss based on bipartite matching [27]. A major step forward and supervising instance masks, such as 3D-BoNet [61]
was sparked by powerful feature backbones [17, 53, 57] such or DyCo3D [21]. Recently, this idea gained popularity in
as sparse convolutional networks [8, 16] that improve over 2D object detection (DETR [2]) and image segmentation
existing PointNets [42, 44] and dense 3D CNNs [36, 43, 60]. (Mask2Former [5, 6]) but so far received less attention in 3D
1
[21, 37, 61]. At the same time, in 2D image processing, we
Computer Vision Group, RWTH Aachen University, Germany.
2 Computer Vision and Learning Group, ETH Zürich, Switzerland.
observe a strong shift from ubiquitous CNN architectures
3 ETH AI Center, Zürich, Switzerland. [19, 20, 45, 46] towards Transformer-based models [6, 11, 33].
4 NVIDIA, Santa Clara, USA In 3D, the move towards Transformers is less pronounced
with only a few methods focusing on 3D object detection II. R ELATED W ORK
[34, 37, 39] or 3D semantic segmentation [29, 63, 63] and
no methods for 3D instance segmentation. Overall, these 3D Instance Segmentation. Numerous methods have been
approaches are still behind in terms of performance compared proposed for 3D instance semantic segmentation, includ-
to current state-of-the-art methods [38, 48, 56, 57]. ing bottom-up approaches [12, 28, 32, 58, 59], top-down ap-
proaches [22, 61, 61], and more recently, voting-based ap-
In this work, we propose the first Transformer-based model proaches [4, 13, 18, 26, 56]. MASC [32] uses a multi-scale
for 3D semantic instance segmentation of large-scale scenes hierarchical feature backbone, similar to ours, however, the
that sets new state-of-the-art scores over a wide range of multi-scale features are used to compute pairwise affinities
datasets, and addresses the aforementioned problems on followed by an offline clustering step. Such backbones are
hand-crafted model designs. The main challenge lies in also successfully employed in other fields [5, 48]. Another
directly predicting instance masks and their corresponding influential work is DyCo3D [21], which is among the few
semantic labels. To this end, our model predicts instance approaches that directly predict instance masks without
queries that encode semantic and geometric information a subsequent clustering step. DyCo3D relies on dynamic
of each instance in the scene. Each instance query is then convolutions [25, 54] which is similar in spirit to our mask
further decoded into a semantic class and an instance feature. prediction mechanism. However, it does not use optimal
The key idea (to directly generate masks) is to compute supervision assignment during training, resulting in subpar
similarity scores between individual instance features and all performance. Optimal assignment of the supervision signal
point features in the point cloud [4, 6, 21]. This results in a was first implemented by 3D-BoNet [61] using Hungarian
heatmap over the point cloud, which (after normalization matching. Similar to ours, [61] directly predicts all instances
and thresholding) yields the final binary instance mask in parallel. However, it uses only a single-scale scene
(c.f. Fig. 1). Our model, called Mask3D, builds on recent descriptor which cannot encode object masks of diverse sizes.
advances in both Transformers [5, 37] and 3D deep learning Transformers. Initially proposed by Vaswani et al. [55] for
[8, 17, 57]: to compute strong point features, we leverage NLP, Transformers have recently revolutionized the field of
a sparse convolutional feature backbone [8] that efficiently computer vision with successful models such as ViT [11] for
processes full scenes and naturally provides multi-scale point image classification, DETR [2] for 2D object detection, or
features. To generate instance queries, we rely on stacked Mask2Former [5, 6] for 2D segmentation tasks. The success
Transformer decoders [5, 6] that iteratively attend to learned of Transformers has been less prominent in the 3D point cloud
point features in a coarse-to-fine fashion using non-parametric domain though and recent Transformer-based methods focus
queries [37]. Unlike voting-based methods, directly predicting on either 3D object detection [34, 37, 39] or 3D semantic
and supervising masks causes some challenges during training: segmentation [29, 40, 63]. Most of these rely on specific
before computing a mask loss, we first have to establish attention modifications to deal with the quadratic complexity
correspondences between predicted and annotated masks. of the attention [29, 39, 40, 63]. Liu et al. [34] use vanilla
A naïve solution would be to choose for each predicted Transformer decoder, but only to refine object proposals,
mask the nearest ground truth mask [21]. However, this whereas Misra et al. [37] are the first to show how to apply
does not guarantee an optimal matching and any unmatched a vanilla Transformer to point clouds, still relying on an
annotated mask would not contribute to the loss. Instead, initial learned downsampling stage though. DyCo3D [21]
we perform bipartite graph matching to obtain optimal also uses a Transformer, however at the bottleneck of the
associations between ground truth and predicted masks [2, 61]. feature backbone to increase the receptive field size and is
We evaluate our model on four challenging 3D instance not related to our mechanism for 3D instance segmentation.
segmentation datasets, ScanNet v2 [9], ScanNet200 [47], In this work, we show how a vanilla Transformer decoder can
S3DIS [1] and STPLS3D [3] and significantly outperform be applied to the task of 3D semantic instance segmentation
prior art, even surpassing architectures that are highly tuned and achieve state-of-the-art performance.
towards specific datasets. Our experimental study compares
III. M ETHOD
various query types, different mask losses, and evaluates the
number of queries as well as Transformer decoder steps. Fig. 2 illustrates our end-to-end 3D instance segmentation
model Mask3D. As in Mask2Former [5], our model includes
Our contributions are as follows: (1) We propose the a feature backbone (∎ ◻), a Transformer decoder (∎
◻) built from
first competitive Transformer-based model for 3D semantic mask modules (∎ ◻) and Transformer decoder layers used for
instance segmentation. (2) Our model named Mask3D builds query refinement (∎ ◻). At the core of the model are instance
on domain-agnostic components, avoiding center voting, non- queries, which each should represent one object instance in
maximum suppression, or grouping heuristics, and overall re- the scene and predict the corresponding point-level instance
quires less hand-tuning. (3) Mask3D achieves state-of-the-art mask. To that end, the instance queries are iteratively refined
performance on ScanNet, ScanNet200, S3DIS and STPLS3D. by the Transformer decoder (Fig. 2, ◻ ∎) which allows the
To reach that level of performance with a Transformer-based instance queries to cross-attend to point features extracted
approach, it is key to predict instance queries that encode the from the feature backbone and self-attend the other instance
semantics and geometry of the scene and objects. queries. This process is repeated for multiple iterations and
create auxiliary binary masks for the masked cross-attention
of the following refinement step. When this mask is used as
input for the masked cross-attention, we reduce the resolution
according to the voxel feature resolution by average pooling.
Next to the binary mask, we predict a single semantic class
per instance. This step is done via a linear projection layer
into C + 1 dimensions, followed by a softmax. While prior
work [4, 13, 56] typically needs to obtain the semantic label
of an instance via majority voting or grouping over per-point
predicted semantics, this information is directly contained in
the refined instance queries.
Query Refinement. (Fig. 2, ◻ ∎) The Transformer decoder
Fig. 2: Illustration of the Mask3D model. The feature backbone starts with K instance queries, and refines them through
outputs multi-scale point features F, while the Transformer decoder
a stack of L Transformer decoder layers to a final set of
iteratively refines the instance queries X. Given point features and
instance queries, the mask module predicts for each query a semantic accurate, scene specific instance queries by cross-attending
class and an instance heatmap, which (after thresholding) results to scene features, and reasoning at the instance-level through
in a binary instance mask B. τ applies a threshold of 0.5 and self-attention. We discuss different types of instance queries
spatially rescales if required. ⋅ is the dot product. σ is the sigmoid in Sec. III-A. Each layer attends to one of the feature maps
function. We show a simplified model with fewer layers.
from the feature backbone using standard cross-attention:
√
X = softmax(QKT / D)V. (2)
feature scales, yielding the final set of refined instance queries.
A mask module consumes the refined instance queries together To do so, the voxel features Fr ∈ RMr ×D are first linearly
with the point features, and returns (for each query) a semantic projected to a set of keys and values of fixed dimensionality
class and a binary instance mask based on the dot product K, V ∈ RMr ×D and our K instance queries X are linearly
between point features and instance queries. Next, we describe projected to the queries Q ∈ RK×D . This cross-attention
each of these components in more detail. thus allows the queries to extract information from the voxel
Sparse Feature Backbone. (Fig. 2, ◻ ∎) We use a sparse features. The cross-attention is followed by a self-attention
convolutional U-net backbone with a symmetrical encoder and step between the queries, where the keys, values, and queries
decoder, based on the MinkowskiEngine [8]. Given a colored are all computed based on linear projections of the instance
input point cloud P ∈ RN ×6 of size N , it is first quantized queries. Without such inter-query communications, the model
into M0 voxels V ∈ RM0 ×3 , where each voxel is assigned could not avoid multiple instance queries latching onto the
the average RGB color of the points within that voxel as its same object, resulting in duplicate instance masks. Similar
initial feature. Next to the full-resolution output feature map to most Transformer-based approaches, we use positional
F0 ∈ RM0 ×D , we also extract a multi-resolution hierarchy of encodings for our keys and queries. We use Fourier positional
features from the backbone decoder before upsampling to the encodings [52] based on voxel positions. We add the resulting
next finer feature map. At each of these resolutions r ≥ 0 we positional encodings to their respective keys before computing
can extract features for a set of Mr voxels, which we linearly the cross-attention. All instance queries are also assigned a
project to a fixed and common dimension D, yielding feature fixed (and potentially learned) positional embedding, that is
matrices Fr ∈ RMr ×D . We let the queries attend to features not updated throughout the query refinement process. These
from coarser feature maps of the backbone decoder, i.e. r ≥ 1, positional encodings are added to the respective queries in the
and use the full-resolution feature map (r = 0) to compute cross-attention, as well as to both the keys and queries in the
the auxiliary and final per-voxel instance masks. self-attention. Instead of using the vanilla cross-attention
Mask Module. (Fig. 2, ◻ ∎) Given the set of K instance queries (where each query attends to all voxel features in one
X ∈ RK×D , we predict a binary mask for each instance resolution) we use a masked variant where each instance
and classify each of them as one of C classes or as being query only attends to the voxels within its corresponding
inactive. To create the binary mask, we map the instance intermediate instance mask B predicted by the previous layer.
queries through an MLP fmask (⋅), to the same feature space This is realized by adding −∞ to the attention matrix to all
as the backbone output features. We then compute the dot voxels for which the mask is 0. Eq. 2 then becomes:
product between these instance features and the backbone √
features F0 . The resulting similarity scores are fed through X = softmax(QKT / D + B′ )V with B′ij = −∞ ⋅ [Bij = 0]
a sigmoid and thresholded at 0.5, yielding the final binary (3)
mask B ∈ {0, 1}M ×K : where [ ⋅ ] are Iverson brackets. In [5], masking out the context
from the cross-attention improved segmentation. A likely
B = {bi,j = [σ(F0 fmask (X)T )i,j > 0.5]}. (1)
reason is that the Transformer does not need to learn to focus
We apply the mask module to the refined queries X at each on a specific instance instead of irrelevant context, but is
Transformer layer using the full-resolution feature map F0 , to forced to do so by design.
In practice, we attend to the 4 coarsest levels of the methods require a dedicated ScoreNet [26] which is trained
feature backbone, from coarse to fine, and do this a total to estimate the intersection over union with the ground truth
of 3 times, resulting in L = 12 query refinement steps. The instances, we directly obtain the confidence scores from the
Transformer decoder layers share weights for all 3 iterations. refined query features and point features as in Mask2Former
Early experiments showed that this approach preserves the [5]. We first select the queries with a dominant semantic
performance while keeping memory requirements in bound. class, for which we obtain the class confidence based on the
Sampled Cross-Attention. Point clouds in a training batch softmax output ccl ∈ [0, 1], which we additionally multiply
typically have different point counts. While MinkowskiEngine with a mask based confidence:
can handle this, current Transformer implementations rely c = ccl ⋅ (ΣM
i mi ⋅ [mi > 0.5])/(Σi [mi > 0.5]),
M
(7)
on a fixed number of points in each batch entry. In order
to leverage well-tested Transformer implementations, in this where mi ∈ [0, 1] is the instance mask confidence for the ith
work we propose to pad the voxel features and mask out the voxel given a single query. In essence, this is the mean mask
attention where needed. In case the number of voxels exceeds confidence of all voxels falling inside of the binarized mask
a certain threshold, we resort to sampling voxel features. To [5]. For an instance prediction to have a high confidence, it
allow instances to have access to all voxel features during needs both a confident classification among C classes, and a
cross-attention, we resample the voxels in each Transformer mask that predominantly consists of high-confidence voxels.
decoder layer though, and use all voxels during inference. Query Types. Methods like DETR [2] or Mask2Former [5, 6]
This can be seen as a form of dropout [50]. In practice, this use parametric queries. During training both the instance
procedure saves significant amounts of memory and is crucial query features and the corresponding positional encodings
for obtaining competitive performance. In particular, since are learned. This thus means that during training the set of
the proposed sampled cross-attention requires less memory, K instance queries has to be optimized in such a way that it
it enables training on higher-resolution voxel grids which can cover all instances present in a scene during inference.
is necessary for achieving competitive results on common Misra et al. [37] propose to initialize queries with sampled
benchmarks (e.g., 2 cm voxel side-length on ScanNet [9]). point coordinates from the input point cloud based on farthest
A. Training and Implementation Details point sampling. Since this initialization does not involve
learned parameters, they are called non-parametric queries.
Correspondences. Given that there is no ordering to the set Interestingly, the instance query features are initialized with
of instances in a scene and the set of predicted instances, zeros and only the 3D position of the sampled points is used to
we need to establish correspondences between the two sets set the corresponding positional encoding. We also experiment
during training. To that end, we use bipartite graph matching. with a variant where we use sampled point features as
While such a supervision approach is not new (e.g. [51, 61]), instance query features. Similar to [37], we observe improved
recently it has become more common in Transformer-based performance when using non-parametric queries although less
approaches [2, 5, 6]. We construct a cost matrix C ∈ RK×K̂ , pronounced. The key advantage of non-parametric queries
where K̂ is the number of ground truth instances in a scene. is that, during inference, we can sample a different number
The matching cost for a predicted instance k and a target of queries than during training. This provides a trade-off
instance k̂ is given by: between inference speed and performance, without the need
C(k, k̂) = λdice Ldice (k, k̂) + λBCE LBCEmask (k, k̂) + λcl LCEcl (k, k̂) to retrain the model when using more instance queries.
(4) Training Details. The feature backbone is a Minkowski
We set the weights to λdice = λcl = 2.0 and λBCE = 5.0. The Res16UNet34C [8]. We train for 600 epochs using AdamW
optimal solution for this cost assignment problem is efficiently [35] and a one-cycle learning rate schedule [49] with a
found using the Hungarian method [27]. After establishing maximal learning rate of 10−4 . Longer training times (1000
the correspondences, we can directly optimize each predicted epochs) did not further improve results. One training on
mask as follows: 2 cm voxelization takes ∼78 hours on an NVIDIA A40 GPU.
Lmask = λBCE LBCE + λdice Ldice , (5) We perform standard data augmentation: horizontal flipping,
random rotations around the z-axis, elastic distortion [46]
where LBCE is the binary cross-entropy loss (over the and random scaling. Color augmentations include jittering,
foreground and background of that mask) and Ldice is the brightness and contrast augmentations. During training on
Dice loss [10]. We use the default multi-class cross-entropy ScanNet, we reduce memory consumption by computing
loss LCEcl to supervise the classification. If a mask is left the dot product between instance queries and aggregated
unassigned, we seek to maximize the associated no-object point features within segments (obtained from a graph-based
class, for which the LCEcl loss is weighted by an additional segmentation [15], similar to OccuSeg [18] or Mix3D [38]).
λno-obj. = 0.1. The overall loss for all auxiliary instance Wrongly merged instances can be separated using connected
predictions after each of the L layers is defined as: components [14] (Sec. IV-C).
L = ΣL
l Lmask + λcl LCEcl
l l
(6) IV. E XPERIMENTS
Prediction Confidence Score. We seek to assign a con- In this section, we compare Mask3D with prior state-of-the-
fidence to each predicted instance. While other existing art on four publicly available 3D indoor and outdoor datasets
TABLE I: 3D Instance Segmentation Scores on ScanNet v2 TABLE II: 3D Instance Segmentation Scores on S3DIS [1]. We
[9]. We report mean average precision (mAP) with different IoU report mean average precision (mAP) with different IoU threshold
threshold over 18 classes on the ScanNet validation and test set. The (as in [9]) as well as mean precision (mPrec) and mean recall (mRec)
inference speed is averaged over the validation set and computed on with 50% IoU threshold (as in [59]) over 13 classes on S3DIS Area
a TITAN X GPU (c.f. [56]), excluding postprocessing. Test scores 5 and 6-fold cross validation. Scores in light gray are pre-trained
accessed on 13. September 2022. on ScanNet [9] and fine-tuned on S3DIS [1].

ScanNet Val ScanNet Test S3DIS Area 5 S3DIS 6-fold CV

Runtime
Method mAP mAP50 mAP mAP50 (in ms) Method AP AP50 Prec50 Rec50 AP AP50 Prec50 Rec50
SGPN [58] – – 4.9 14.3 158439 SGPN [58] – – 36.0 28.7 – – 38.2 31.2
GSPN [62] 19.3 37.8 – 30.6 12702 ASIS [59] – – 55.3 42.4 – – 63.6 47.5
3D-SIS [22] – 18.7 16.1 38.2 – 3D-Bonet [61] – – 57.5 40.2 – – 65.6 47.6
MASC [32] – – 25.4 44.7 – OccuSeg [18] – – – – – – 72.8 60.3
3D-Bonet [61] – – 25.3 48.8 9202 3D-MPA [13] – – 63.1 58.0 – – 66.7 64.1
MTML [28] 20.3 40.2 28.2 54.9 – PointGroup [26] – 57.8 61.9 62.1 – 64.0 69.6 69.2
3D-MPA [13] 35.5 59.1 35.5 61.1 – DyCo3D [21] – – 64.3 64.2 – – – –
DyCo3D [21] 35.4 57.6 39.5 64.1 – MaskGroup [64] – 65.0 62.9 64.7 – 69.9 66.6 69.6
PointGroup [26] 34.8 56.7 40.7 63.6 452 SSTNet [31] 42.7 59.3 65.5 64.2 54.1 67.8 73.5 73.4
MaskGroup [64] 42.0 63.3 43.4 66.4 – Mask3D (Ours) 56.6 68.4 68.7 66.3 64.5 75.5 72.8 74.5
OccuSeg [18] 44.2 60.7 48.6 67.2 1904
HAIS [4] – – 71.1 65.0 – – 73.2 69.4
SSTNet [31] 49.4 64.3 50.6 69.8 428
SoftGroup [56] 51.6 66.1 73.6 66.6 54.4 68.9 75.3 69.8
HAIS [4] 43.5 64.1 45.7 69.9 339
Mask3D (Ours) 57.8 71.9 74.3 63.7 61.8 74.3 76.5 66.2
SoftGroup [56] 46.0 67.6 50.4 76.1 345
Mask3D (Ours) 55.2 73.7 56.6 78.0 339
TABLE III: 3D Instance Segmentation Scores on ScanNet200
[47] and STPLS3D [3]. We report mean average precision (mAP)
with different IoU threshold over 14 classes on the STPLS3D test
(Sec. IV-A). Then, we provide analysis experiments on the set. Hidden test scores accessed on 13. September 2022.
proposed model investigating query types and the impact of
the number of query refinement steps as well as the number ScanNet 200 STPLS3D
of queries during inference. (Sec. IV-B). Finally, we show Method head com tail Method mAP mAP50
qualitative results and discuss limitations (Sec. IV-C).
CSC [23] 22.3 8.2 4.6 PointGroup [26] 23.3 38.5
A. Comparing with State-of-the-Art Methods Mink34D [8] 24.6 8.3 4.3 HAIS [4] 35.1 46.7
LGround [47] 27.5 10.8 6.0 SoftGroup [56] 46.2 61.8
Datasets and Metrics. We evaluate Mask3D on four publicly Mask3D (Ours) 38.3 26.3 16.8 Mask3D (Ours) 57.3 74.3
available 3D instance segmentation datasets.
ScanNet [9] is a richly-annotated dataset of 3D recon-
structed indoor scenes. It contains hundreds of different rooms the data generation process of aerial photogrammetry point
showing a large variety of room types such as hotels, libraries clouds. 25 urban scenes totalling 6 km2 are densely annotated
and offices. The provided splits contain 1202 training, 312 with 14 instance classes. We follow the common splits [3, 56].
validation and 100 hidden test scenes. Each scene is annotated
Results are summarized in Tab. I (ScanNet), Tab. II (S3DIS),
with semantic and instance segmentation labels covering
Tab. III (left, ScanNet200) and Tab. III (right, STPLS3D).
18 object categories. The benchmark evaluation metric is
Mask3D outperforms prior work by a large margin on
mean average precision (mAP). ScanNet200 [47] extends the
the most challenging metric mAP by at least 6.2 mAP on
original ScanNet scenes with an order of magnitude more
ScanNet, 6.2 mAP on S3DIS, 10.8 mAP on ScanNet200
classes. ScanNet200 allows to test an algorithm’s performance
and 11.2 mAP on STPLS3D. As in [4, 56], we also report
under the natural imbalance of classes, particularly for
scores for models pre-trained on ScanNet and fine-tuned
challenging long-tail classes such as coffee-kettle and potted-
on S3DIS. For Mask3D, pre-training improves performance
plant. We keep the same train, validation and test splits as
by 1.2 mAP on Area 5. Mask3D’s strong performance on
in the original ScanNet dataset. S3DIS [1] is a large-scale
indoor and outdoor datasets as well as its ability to work
indoor dataset showing six different areas from three different
under challenging class imbalance settings without inherent
campus buildings. It contains 272 scans and is also annotated
modifications to the architecture or the training regime
with semantic instance masks over 13 different classes. We
highlights its generality. Trained models are available at:
follow the common splits and evaluate on Area-5 and 6-fold
https://github.com/JonasSchult/Mask3D
cross validation. We report scores using the mAP metric
from ScanNet and mean precision/recall at IoU threshold
B. Analysis Experiments
50% (mPrec50 /mRec50 ) as initially introduced by ASIS [59].
Unlike mAP, this metric does not consider confidence scores, Query Types. Mask3D iteratively refines instance queries
therefore we filter out instance masks with a prediction by attending to voxel features (Fig. 2, ◻ ∎). We distinguish
confidence score below 80% to avoid excessive false positives. two types of query initialization prior to attending to voxel
STPLS3D [3] is a synthetic outdoor dataset closely mimicking features: 1 parametric and 2 - 3 non-parametric initial
TABLE IV: Ablations. a) We explore two variants for query
positions and features. Parametric queries 1 are learned during
training. Non-parametric queries consist of FPS point positions 2
and potentially their features 3 , resembling scene-specific queries.
b) We optimize the instance mask prediction using the binary cross-
entropy loss LCE and the dice loss Ldice . A weighted combination
of dice and cross-entropy loss results in best performance.

Position Features mAP Ldice LBCE mAP

1 Param. Param. 39.7±0.7 7 3 27.0±0.6
2 FPS Zeros 40.6±0.3 3 7 38.0±0.3
3 FPS Point Feat. 38.4±0.3 3 3 40.6±0.3
a) Query Type b) Mask Loss

Fig. 4: Qualitative Results on ScanNet. We show pairs of predicted

instance masks and predicted semantic labels. On the bottom left, we
show the heatmap of a failure case of two windows that are wrongly
assigned to a single instance. The corresponding point features are
visualized as RGB after projecting them to 3D using PCA.

Fig. 3: Number of queries and decoder layers. foreground mask points, many background points). The Dice
loss Ldice is specifically designed to address such data
imbalance. Tab. IV (right) shows scores on ScanNet validation
queries. Parametric refers to learned positions and features for combinations of both losses. While Ldice improves over
[2], while non-parametric refers to point positions sampled LBCE , we observe an additional improvement by training our
with furthest point sampling (FPS) [44]. When selecting query model with a weighted sum of both losses (Eq. 5).
positions with FPS, we can either initialize the queries to C. Qualitative Results and Limitations
zero ( 2 , as in 3DETR [37]) or use the point features at the
Fig. 4 shows several representative examples of Mask3D
sampled position 3 . Tab. IV (left) shows the effects of using
instance segmentation results on ScanNet. The scenes are
parametric or non-parametric queries on ScanNet validation
quite diverse and present a number of challenges, including
(5 cm). In line with [37], we see that non-parametric queries 2
clutter, scanning artifacts and numerous similar objects.
outperform parametric queries 1 . Interestingly, 3 results in
Still, our model shows quite robust results. There are still
degraded performance compared to both parametric 1 and
limitations in our model though. A systematic mistake that we
position-only non-parametric queries 2 .
observed are merged instances that are far apart (see Fig. 4,
Number of Queries and Decoders. We analyze the effect bottom left). As the attention mechanism can attend to the
of varying numbers of queries K during inference on models full point cloud, it can happen that two objects with similar
trained with K = 100 and K = 200 non-parametric queries semantics and geometry expose similar learned point features
sampled with FPS. By increasing K from 100 to 200 during and are therefore combined into one instance even if they
training, we observe a slight increase in performance (Fig. 3, are far apart in the scene. This is less likely to happen with
left) at the cost of additional memory. When evaluating methods that explicitly encode geometric priors.
with fewer queries than trained with, we observe reduced
performance but faster runtime. When evaluating with more V. C ONCLUSION
queries than trained with, we observe slightly improved In this work, we have introduced Mask3D, for 3D semantic
performance, typically less than 1 mAP. Our final model instance segmentation. Mask3D is based on Transformer
uses K = 100 due to memory constraint when using 2 cm decoders, and learns instance queries that, combined with
voxels in the feature backbone. In this study, we report scores learned point features, directly predict semantic instance
using 5 cm on ScanNet validation. We also analyse the mask masks without the need for hand-selected voting schemes or
quality that we obtain after each Transformer decoder layer hand-crafted grouping mechanisms. We think that Mask3D
in our trained model (Fig. 3, right). We see a rapid increase is an attractive alternative to current voting-based approaches
up to 4 layers, then the quality increases a bit slower. and expect to see follow-up work along this line of research.
Mask Loss. The mask module (Fig. 2, ◻ ∎) generates instance Acknowledgments: This work is supported by the ERC Con-
heatmaps for every instance query. After Hungarian matching, solidator Grant DeeViSe (ERC-2017-CoG-773161), SNF Grant
the corresponding ground truth mask is used to compute the 200021 204840, compute resources from RWTH Aachen University
mask loss Lmask . The binary cross entropy loss LBCE is (rwth1238) and the ETH AI Center post-doctoral fellowship. We
the obvious choice for binary segmentation tasks. However, additionally thank Alexey Nekrasov, Ali Athar and István Sárándi
it does not perform well under large class imbalance (few for helpful discussions and feedback.
R EFERENCES [22] Ji Hou, Angela Dai, and Matthias Nießner. 3D-SIS: 3D Semantic
Instance Segmentation of RGB-D Scans. In IEEE Conference on
[1] Iro Armeni, Ozan Sener, Amir R Zamir, Helen Jiang, Ioannis Brilakis, Computer Vision and Pattern Recognition, 2019.
Martin Fischer, and Silvio Savarese. 3D Semantic Parsing of Large- [23] Ji Hou, Benjamin Graham, Matthias Nießner, and Saining Xie.
Scale Indoor Spaces. In IEEE Conference on Computer Vision and Exploring Data-Efficient 3D Scene Understanding with Contrastive
Pattern Recognition, 2016. Scene Contexts. In IEEE Conference on Computer Vision and Pattern
[2] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Recognition, 2021.
Alexander Kirillov, and Sergey Zagoruyko. End-to-End Object [24] Paul VC Hough. Machine Analysis of Bubble Chamber Pictures. In
Detection with Transformers. In European Conference on Computer International Conference on High Energy Accelerators and Instrumen-
Vision, 2020. tation, 1959.
[3] Meida Chen, Qingyong Hu, Thomas Hugues, Andrew Feng, Yu Hou, [25] Xu Jia, Bert De Brabandere, Tinne Tuytelaars, and Luc V Gool.
Kyle McCullough, and Lucio Soibelman. STPLS3D: A Large-Scale Dynamic Filter Networks. Neural Information Processing Systems,
Synthetic and Real Aerial Photogrammetry 3D Point Cloud Dataset. 2016.
arXiv:2203.09065, 2022. [26] Li Jiang, Hengshuang Zhao, Shaoshuai Shi, Shu Liu, Chi-Wing Fu,
[4] Shaoyu Chen, Jiemin Fang, Qian Zhang, Wenyu Liu, and Xinggang and Jiaya Jia. PointGroup: Dual-Set Point Grouping for 3D Instance
Wang. Hierarchical Aggregation for 3D Instance Segmentation. In Segmentation. In IEEE Conference on Computer Vision and Pattern
International Conference on Computer Vision, 2021. Recognition, 2020.
[5] Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexander Kirillov, [27] Harold W Kuhn. The Hungarian method for the assignment problem.
and Rohit Girdhar. Masked-attention Mask Transformer for Universal Naval research logistics quarterly, 2(1-2):83–97, 1955.
Image Segmentation. In IEEE Conference on Computer Vision and [28] Jean Lahoud, Bernard Ghanem, Marc Pollefeys, and Martin R Oswald.
Pattern Recognition, 2022. 3D Instance Segmentation via Multi-Task Metric Learning. In
[6] Bowen Cheng, Alex Schwing, and Alexander Kirillov. Per-Pixel International Conference on Computer Vision, 2019.
Classification is Not All You Need for Semantic Segmentation. In [29] Xin Lai, Jianhui Liu, Li Jiang, Liwei Wang, Hengshuang Zhao, Shu
Advances in Neural Information Processing Systems, 2021. Liu, Xiaojuan Qi, and Jiaya Jia. Stratified Transformer for 3D Point
[7] Julian Chibane, Francis Engelmann, Tuan Anh Tran, and Gerard Cloud Segmentation. In IEEE Conference on Computer Vision and
Pons-Moll. Box2Mask: Weakly Supervised 3D Semantic Instance Pattern Recognition, 2022.
Segmentation Using Bounding Boxes. In European Conference on [30] Bastian Leibe, Aleš Leonardis, and Bernt Schiele. Robust Object Detec-
Computer Vision, 2022. tion with Interleaved Categorization and Segmentation. International
[8] Christopher Choy, JunYoung Gwak, and Silvio Savarese. 4D Spatio- Journal of Computer Vision, 2008.
Temporal ConvNets: Minkowski Convolutional Neural Networks. In [31] Zhihao Liang, Zhihao Li, Songcen Xu, Mingkui Tan, and Kui Jia.
IEEE Conference on Computer Vision and Pattern Recognition, 2019. Instance Segmentation in 3D Scenes using Semantic Superpoint Tree
[9] Angela Dai, Angel X. Chang, Manolis Savva, Maciej Halber, Thomas Networks. In International Conference on Computer Vision, 2021.
Funkhouser, and Matthias Nießner. ScanNet: Richly-annotated 3D [32] Chen Liu and Yasutaka Furukawa. MASC: Multi-Scale Affinity with
Reconstructions of Indoor Scenes. In IEEE Conference on Computer Sparse Convolution for 3D Instance Segmentation. arXiv:1902.04478,
Vision and Pattern Recognition, 2017. 2019.
[10] Ruoxi Deng, Chunhua Shen, Shengjun Liu, Huibing Wang, and Xinru [33] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang,
Liu. Learning to Predict Crisp Boundaries. In European Conference Stephen Lin, and Baining Guo. Swin Transformer: Hierarchical Vision
on Computer Vision, 2018. Transformer using Shifted Windows. In International Conference on
[11] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weis- Computer Vision, 2021.
senborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, [34] Ze Liu, Zheng Zhang, Yue Cao, Han Hu, and Xin Tong. Group-Free
Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, 3D Object Detection via Transformers. In International Conference
and Neil Houlsby. An Image is Worth 16x16 Words: Transformers for on Computer Vision, 2021.
Image Recognition at Scale. In International Conference on Learning [35] Ilya Loshchilov and Frank Hutter. Decoupled Weight Decay Regu-
Representations, 2021. larization. In International Conference on Learning Representations,
[12] Cathrin Elich, Francis Engelmann, Theodora Kontogianni, and Bastian 2019.
Leibe. 3D Bird’s-eye-view Instance Segmentation. In German [36] Daniel Maturana and Sebastian Scherer. VoxNet: A 3D Convolutional
Conference on Pattern Recognition, 2019. Neural Network for Real-time Object Recognition. In International
[13] Francis Engelmann, Martin Bokeloh, Alireza Fathi, Bastian Leibe, Conference on Intelligent Robots and Systems, 2015.
and Matthias Nießner. 3D-MPA: Multi-Proposal Aggregation for 3D [37] Ishan Misra, Rohit Girdhar, and Armand Joulin. An End-to-End
Semantic Instance Segmentation. In IEEE Conference on Computer Transformer Model for 3D Object Detection. In International
Vision and Pattern Recognition, 2020. Conference on Computer Vision, 2021.
[14] Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. [38] Alexey Nekrasov, Jonas Schult, Or Litany, Bastian Leibe, and Francis
A density-based algorithm for discovering clusters in large spatial Engelmann. Mix3D: Out-of-Context Data Augmentation for 3D Scenes.
databases with noise. In ACM SIGKDD International Conference on In International Conference on 3D Vision, 2021.
Knowledge Discovery and Data Mining, 1996. [39] Xuran Pan, Zhuofan Xia, Shiji Song, Li Erran Li, and Gao Huang. 3D
[15] Pedro F Felzenszwalb and Daniel P Huttenlocher. Efficient Graph- Object Detection With Pointformer. In IEEE Conference on Computer
Based Image Segmentation. International Journal of Computer Vision, Vision and Pattern Recognition, 2021.
59(2):167–181, 2004. [40] Chunghyun Park, Yoonwoo Jeong, Minsu Cho, and Jaesik Park. Fast
[16] Benjamin Graham, Martin Engelcke, and Laurens Van Der Maaten. Point Transformer. In IEEE Conference on Computer Vision and
3D Semantic Segmentation with Submanifold Sparse Convolutional Pattern Recognition, 2022.
Networks. In IEEE Conference on Computer Vision and Pattern [41] Charles R Qi, Or Litany, Kaiming He, and Leonidas J Guibas. Deep
Recognition, 2018. Hough Voting for 3D Object Detection in Point Clouds. In International
[17] Benjamin Graham and Laurens van der Maaten. Submanifold Sparse Conference on Computer Vision, 2019.
Convolutional Networks. arXiv:1706.01307, 2017. [42] Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet:
[18] Lei Han, Tian Zheng, Lan Xu, and Lu Fang. OccuSeg: Occupancy- Deep Learning on Point Sets for 3D Classification and Segmentation.
aware 3D Instance Segmentation. In IEEE Conference on Computer In IEEE Conference on Computer Vision and Pattern Recognition,
Vision and Pattern Recognition, 2020. 2017.
[19] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask [43] Charles R Qi, Hao Su, Matthias Nießner, Angela Dai, Mengyuan Yan,
R-CNN. In International Conference on Computer Vision, 2017. and Leonidas J Guibas. Volumetric and Multi-View CNNs for Object
[20] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Classification on 3D Data. In IEEE Conference on Computer Vision
Residual Learning for Image Recognition. In IEEE Conference on and Pattern Recognition, 2016.
Computer Vision and Pattern Recognition, 2016. [44] Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas.
[21] Tong He, Chunhua Shen, and Anton van den Hengel. DyCo3D: PointNet++: Deep Hierarchical Feature Learning on Point Sets in a
Robust Instance Segmentation of 3D Point Clouds through Dynamic Metric Space. In Advances in Neural Information Processing Systems,
Convolution. In IEEE Conference on Computer Vision and Pattern 2017.
Recognition, 2021.
[45] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster [55] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion
R-CNN: Towards Real-Time Object Detection with Region Proposal Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention
Networks. In Neural Information Processing Systems, 2015. Is All You Need. In Advances in Neural Information Processing
[46] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolu- Systems, 2017.
tional networks for biomedical image segmentation. In International [56] Thang Vu, Kookhoi Kim, Tung M Luu, Xuan Thanh Nguyen, and
Conference on Medical Image Computing and Computer-Assisted Chang D Yoo. SoftGroup for 3D Instance Segmentation on Point
Intervention, 2015. Clouds. In IEEE Conference on Computer Vision and Pattern
[47] David Rozenberszki, Or Litany, and Angela Dai. Language-Grounded Recognition, 2022.
Indoor 3D Semantic Segmentation in the Wild. In European Conference [57] Peng-Shuai Wang, Yang Liu, Yu-Xiao Guo, Chun-Yu Sun, and Xin
on Computer Vision, 2022. Tong. O-CNN: Octree-based Convolutional Neural Networks for 3D
[48] Danila Rukhovich, Anna Vorontsova, and Anton Konushin. Shape Analysis. ACM Transactions on Graphics, 2017.
FCAF3D: Fully Convolutional Anchor-Free 3D Object Detection. [58] Weiyue Wang, Ronald Yu, Qiangui Huang, and Ulrich Neumann.
arXiv:2112.00322, 2021. SGPN: Similarity Group Proposal Network for 3D Point Cloud Instance
[49] Leslie N Smith and Nicholay Topin. Super-Convergence: Very Segmentation. In IEEE Conference on Computer Vision and Pattern
Fast Training of Neural Networks Using Large Learning Rates. In Recognition, 2018.
Artificial intelligence and machine learning for multi-domain operations [59] Xinlong Wang, Shu Liu, Xiaoyong Shen, Chunhua Shen, and Jiaya Jia.
applications, 2019. Associatively Segmenting Instances and Semantics in Point Clouds. In
[50] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, IEEE Conference on Computer Vision and Pattern Recognition, 2019.
and Ruslan Salakhutdinov. Dropout: A Simple Way to Prevent Neural [60] Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang
Networks from Overfitting . Journal of Machine Learning Research, Zhang, Xiaoou Tang, and Jianxiong Xiao. 3D ShapeNets: A Deep
15(1):1929–1958, 2014. Representation for Volumetric Shapes. In IEEE Conference on
[51] Russell Stewart, Mykhaylo Andriluka, and Andrew Y Ng. End-to-end Computer Vision and Pattern Recognition, 2015.
people detection in crowded scenes. In IEEE Conference on Computer [61] Bo Yang, Jianan Wang, Ronald Clark, Qingyong Hu, Sen Wang, Andrew
Vision and Pattern Recognition, 2016. Markham, and Niki Trigoni. Learning Object Bounding Boxes for
[52] Matthew Tancik, Pratul Srinivasan, Ben Mildenhall, Sara Fridovich- 3D Instance Segmentation on Point Clouds. In Advances in Neural
Keil, Nithin Raghavan, Utkarsh Singhal, Ravi Ramamoorthi, Jonathan Information Processing Systems, 2019.
Barron, and Ren Ng. Fourier Features Let Networks Learn High [62] Li Yi, Wang Zhao, He Wang, Minhyuk Sung, and Leonidas J
Frequency Functions in Low Dimensional Domains. In Advances in Guibas. GSPN: Generative Shape Proposal Network for 3D Instance
Neural Information Processing Systems, 2020. Segmentation in Point Cloud. In IEEE Conference on Computer Vision
[53] Hugues Thomas, Charles R Qi, Jean-Emmanuel Deschaud, Beatriz and Pattern Recognition, 2019.
Marcotegui, François Goulette, and Leonidas J Guibas. KPConv: [63] Hengshuang Zhao, Li Jiang, Jiaya Jia, Philip HS Torr, and Vladlen
Flexible and Deformable Convolution for Point Clouds. In International Koltun. Point Transformer. In International Conference on Computer
Conference on Computer Vision, 2019. Vision, 2021.
[54] Zhi Tian, Chunhua Shen, and Hao Chen. Conditional Convolutions for [64] Min Zhong, Xinghao Chen, Xiaokang Chen, Gang Zeng, and Yunhe
Instance Segmentation. In European Conference on Computer Vision, Wang. MaskGroup: Hierarchical Point Grouping and Masking for 3D
2020. Instance Segmentation. arXiv:2203.14662, 2022.
Mask3D: Mask Transformer for 3D Semantic Instance Segmentation
Supplementary Material

STPLS3D Specific Details. As STPLS3D’s evaluation pro-

tocol [3] evaluates on 50m×50m blocks evenly cropped from
the full city scene, instances are potentially separated into
multiple blocks. We therefore feed slightly larger 54m×54m
blocks in our model but only keep the relevant predicted
instances of the 50m×50m block. This approach achieves
significantly better results, usually roughly 1.2 mAP.
Model Details. Figure 5 shows our full model. Unlike the
figure in the main paper, this shows the complete model,
including all backbone feature levels and all query refinement
steps in the Transformer decoder. We deploy a Minkowski
Res16UNet34C [8] and obtain feature maps Fi from all of
its 5 scales. The feature maps have (96, 96, 128, 256, 256)
channels (sorted from fine to coarse). As the Transformer
decoder expects a feature dimension of 128, we apply a non-
shared linear projection after each Fi to map the features to
the expected dimension. Furthermore, we employ a modified
Transformer decoder by Mask2Former [5] (swapped cross-
and self-attention) leveraging an 8-headed attention and a
feedforward network with 1024-dimensional features. For
each intermediate feature map Fi with i > 0, we instantiate a
dedicated decoder layer. We attend to the backbone features
3 times with Transformer decoders with shared weights. In
Fig. 5: Illustration of the full Mask3D model. In the main paper, all our experiments, we use 100 instance queries. Following
we showed a simplified version of our model with fewer hierarchical Misra et al. [37], the query positions are calculated from
feature levels in the feature backbone (shown in green) and fewer
query refinement layers (blue). The feature backbone outputs point Fourier positional encodings based on relative voxel positions
features in 5 scales, while the Transformer decoder iteratively refines scaled to [−1, 1]. We do not use Dropout.
the instance queries. Given point features and instance queries, the Comparison Feature Backbones. As an additional candidate
mask module predicts for each query a semantic class and an instance for non-convolution-based backbones, we deploy the recent
heatmap, which (after thresholding) results in a binary instance mask.
StratifiedFormer [29] which is a Transformer-based feature
backbone. The resulting scores are reported in Tab. V. The
TABLE V: Feature Backbones. We experimented with convolu- experiment with the StratifiedFormer shows encouraging
tional and transformer-based feature backbones (c.f. Fig. 5, ◻
∎). results but does not yet reach the performance of the
sparse convolutional backbone. However, the experiment
Backbone Name Backbone Param. mAP mAP50
clearly shows that our model also runs on different types of
StratifiedFormer [29] 18,798,662 31.1 54.6 feature backbones. We also report scores of another voxel-
Res16UNet18B [8] 17,204,660 40.0 63.7
Res16UNet34C [8] 37,856,052 40.9 64.4 based feature backbone (Minkowski Res16UNet18B) that is
significantly smaller than our original backbone (Minkowski
Res16UNet34C) to show robustness towards model size on
ScanNet validation. We find that the smaller feature backbone
I. I MPLEMENTATION D ETAILS
works comparably to the bigger Res16UNet34C. This shows
S3DIS Specific Details. As S3DIS [1] contains a few very that Mask3D does not overly rely on the specific voxel-based
large spaces, e.g. lecture halls, and also provides a very feature extractor.
high point density, scenes can exceed several millions of Model sizes. Tab. VI shows the model size of Mask3D and
points. We therefore resort to training on 6m×6m blocks two recent top-performing baselines HAIS [4] and SoftGroup
randomly cropped from the ground plane to keep the memory [56] obtained from the official code releases. The most
requirements in bounds. As Mask3D thus effectively sees parameters, by far (>90%), are due to the feature-learning
less data in each epoch, we train for 1000 epochs. However, backbones (Fig. 5, ◻∎). In comparison, the remaining number
during test, we disable cropping and infer full scenes. of parameters (including the transformer-decoder) is very
TABLE VI: Model sizes. We compare Mask3D’s model size against
(Res16UNet18B). The smaller feature backbone results in
recent top-performing methods. For all models, most parameters are
in the feature backbone and only a small fraction is in the instance comparable segmentation performance (40.9 vs. 40.0 mAP)
segmentation specific part of the models. evaluated on ScanNet validation 5 cm. Additional feature
backbones are analyzed in Tab. V.
Model Name All Params. Backbone Other
HAIS [4] 30.856M 30.118M 0.738M A. Comparison to SoftGroup
SoftGroup [56] 30.858M 30.118M 0.740M In the following, we qualitatively compare Mask3D with
Mask3D (Ours) 39.617M 37.856M 1.761M
Mask3D (Ours – small) 18.958M 17.205M 1.753M SoftGroup [56], the currently best performing voting-based
3D instance segmentation approach. We highlight two error
TABLE VII: Ablation on DBSCAN postprocessing. To split cases for SoftGroup and show our Mask3D for comparison
wrongly merged instances, we employ DBSCAN as an optional in Fig. 6.
postprocessing routine. We report best scores around a minimal
Density-Based Clustering. In Section 4.3 (main paper),
distance =0.9 (ScanNet) and =0.6 (S3DIS-A5).
we described one limitation of Mask3D. A few times, we
ScanNet Validation (2 cm) S3DIS Area 5 (2 cm) observed that similarly looking objects are merged into a
AP AP50 AP25 AP AP50 AP25 single instance even if they are apart in the input point cloud
– 54.3 73.0 83.4 55.7 69.8 76.1
(c.f. Fig. 7(b)–(c)). We trace this back to Mask3D’s possibility
to attend to the full point cloud combined with instances
0.5 54.1 72.1 82.1 57.6 71.7 77.2
0.6 54.4 72.4 82.4 57.8 71.9 77.2 which show similar semantics and geometry. As a solution,
0.7 54.9 73.2 83.1 57.7 71.8 77.2 we propose to apply DBSCAN [14] on the output instance
0.8 55.0 73.3 83.2 57.5 71.6 77.1 masks produced by Mask3D. For each of the K instance
0.9 55.1 73.7 83.6 57.6 71.6 77.1
1.0 55.0 73.5 83.5 57.5 71.5 77.2 masks individually, DBSCAN yields spatially contiguous
1.1 55.0 73.6 83.6 57.5 71.4 77.2 clusters (c.f. Fig. 7(d)). We treat these dense clusters as new
instance masks. We update the confidence score for each
newly created instance by applying Equation (7, main paper).
small (<10%). In absolute numbers, the proposed transformer- In our hyperparameter ablation study in Tab. VII, we achieved
decoder is larger than the other parts of the baseline methods overall best results when applying DBSCAN with a minimal
but still small compared to the size of the feature backbones. distance parameter of 0.9 for ScanNet, 0.6 for S3DIS and
To verify that the improved performance of Mask3D 14.0 for STPLS3D. Note that we do not consider noise points,
does not originate from more model parameters, we ran i.e., we set the minimal size of a cluster to 1.
an additional experiment with a smaller feature backbone
Fig. 6: Qualitative Comparison to SoftGroup [56]. We compare Mask3D with the current top-performing voting-based approach
SoftGroup. The top example shows a scene containing a single large U-shaped table, see (e) in pink. SoftGroup is based on center-voting
and tries to predict the instance center, shown in (b) in red. However, predicting centers of such very large non-convex shapes can be
difficult for voting-based approaches. Indeed, SoftGroup fails to correctly segment the table and returns two partial instances (c). Our
Mask3D, on the other side, does not rely on hand-selected geometric properties such as centers and can handle arbitrarily shaped and sized
objects. It correctly predicts the tables instance mask (e). In the bottom example, we see that SoftGroup has difficulties to predict precise
centers for multiple chairs located next to each other (b). As a result, the manually tuned grouping mechanism aggregates them all into
one big instance which is later discarded by the refinement step. It therefore misses to segment all eight chairs (c). Mask3D does not rely
on hand-crafted grouping mechanisms and can successfully segment most of the chairs.
(a) RGB Point Cloud (b) Mask3D (ours) w/o DBSCAN

(c) Instance Heatmap (Windows) (d) Mask3D (ours) with DBSCAN

(e) Center Votes (SoftGroup [56]) (f) Prediction (SoftGroup [56])

Fig. 7: Qualitative Analysis of DBSCAN Postprocessing. Mask3D occassionally predicts masks containing two instances of the same
class. In (b), two windows are merged into a single instance since their underlying point cloud features result in a high response when
convolved with the instance query (c.f. heatmap in (c)). In (d), we apply DBSCAN as a postprocessing routine to split erroneously merged
instances based on spatial contiguity. We do not see this effect for voting-based methods as they explicitly encode geometric priors (e)-(f).