0% found this document useful (0 votes)
25 views10 pages

Cheng Masked-Attention Mask Transformer For Universal Image Segmentation CVPR 2022 Paper

Research paper
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views10 pages

Cheng Masked-Attention Mask Transformer For Universal Image Segmentation CVPR 2022 Paper

Research paper
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Masked-attention Mask Transformer for Universal Image Segmentation

Bowen Cheng1,2 * Ishan Misra1 Alexander G. Schwing2 Alexander Kirillov1 Rohit Girdhar1
1
Facebook AI Research (FAIR) 2 University of Illinois at Urbana-Champaign (UIUC)
https://bowenc0221.github.io/mask2former

Abstract panoptic instance semantic

Image segmentation groups pixels with different seman-


tics, e.g., category or instance membership. Each choice
of semantics defines a task. While only the semantics of 57.8 49.5 50.1 57.0 57.7
55.6
each task differ, current research focuses on designing spe- 52.7
51.1
cialized architectures for each task. We present Masked- 40.1
attention Mask Transformer (Mask2Former), a new archi-
tecture capable of addressing any image segmentation task Universal architectures:
(panoptic, instance or semantic). Its key components in- Mask2Former (ours) MaskFormer
clude masked attention, which extracts localized features by SOTA specialized architectures:
constraining cross-attention within predicted mask regions. Max-DeepLab Swin-HTC++ BEiT
In addition to reducing the research effort by at least three
times, it outperforms the best specialized architectures by Figure 1. State-of-the-art segmentation architectures are typically
a significant margin on four popular datasets. Most no- specialized for each image segmentation task. Although recent
tably, Mask2Former sets a new state-of-the-art for panoptic work has proposed universal architectures that attempt all tasks
segmentation (57.8 PQ on COCO), instance segmentation and are competitive on semantic and panoptic segmentation, they
(50.1 AP on COCO) and semantic segmentation (57.7 mIoU struggle with segmenting instances. We propose Mask2Former,
on ADE20K). which, for the first time, outperforms the best specialized architec-
tures on three studied segmentation tasks on multiple datasets.

1. Introduction
specialized architecture for every task.
Image segmentation studies the problem of grouping
To address this fragmentation, recent work [14, 62] has
pixels. Different semantics for grouping pixels, e.g., cat-
attempted to design universal architectures, that are capable
egory or instance membership, have led to different types
of addressing all segmentation tasks with the same archi-
of segmentation tasks, such as panoptic, instance or seman-
tecture (i.e., universal image segmentation). These archi-
tic segmentation. While these tasks differ only in semantics,
tectures are typically based on an end-to-end set prediction
current methods develop specialized architectures for each
objective (e.g., DETR [5]), and successfully tackle multiple
task. Per-pixel classification architectures based on Fully
tasks without modifying the architecture, loss, or the train-
Convolutional Networks (FCNs) [37] are used for semantic
ing procedure. Note, universal architectures are still trained
segmentation, while mask classification architectures [5,24]
separately for different tasks and datasets, albeit having the
that predict a set of binary masks each associated with a
same architecture. In addition to being flexible, universal
single category, dominate instance-level segmentation. Al-
architectures have recently shown state-of-the-art results on
though such specialized architectures [6, 10, 24, 37] have
semantic and panoptic segmentation [14]. However, re-
advanced each individual task, they lack the flexibility to
cent work still focuses on advancing specialized architec-
generalize to the other tasks. For example, FCN-based ar-
tures [20, 39, 45], which raises the question: why haven’t
chitectures struggle at instance segmentation, leading to the
universal architectures replaced specialized ones?
evolution of different architectures for instance segmenta-
Although existing universal architectures are flexible
tion compared to semantic segmentation. Thus, duplicate
enough to tackle any segmentation task, as shown in Fig-
research and (hardware) optimization effort is spent on each
ure 1, in practice their performance lags behind the best
* Work done during an internship at Facebook AI Research. specialized architectures. For instance, the best reported

1290
performance of universal architectures [14, 62], is currently FCN-based architectures [37] independently predict a cat-
lower (> 9 AP) than the SOTA specialized architecture egory label for every pixel. Follow-up methods find con-
for instance segmentation [6]. Beyond the inferior per- text to play an important role for precise per-pixel classi-
formance, universal architectures are also harder to train. fication and focus on designing customized context mod-
They typically require more advanced hardware and a much ules [7,8,63] or self-attention variants [21,26,45,55,61,64].
longer training schedule. For example, training Mask- Specialized instance segmentation architectures are typ-
Former [14] takes 300 epochs to reach 40.1 AP and it can ically based upon “mask classification.” They predict a set
only fit a single image in a GPU with 32G memory. In con- of binary masks each associated with a single class label.
trast, the specialized Swin-HTC++ [6] obtains better perfor- The pioneering work, Mask R-CNN [24], generates masks
mance in only 72 epochs. Both the performance and train- from detected bounding boxes. Follow-up methods either
ing efficiency issues hamper the deployment of universal focus on detecting more precise bounding boxes [4, 6], or
architectures. finding new ways to generate a dynamic number of masks,
In this work, we propose a universal image segmen- e.g., using dynamic kernels [3, 49, 56] or clustering algo-
tation architecture named Masked-attention Mask Trans- rithms [11, 29]. Although the performance has been ad-
former (Mask2Former) that outperforms specialized ar- vanced in each task, these specialized innovations lack the
chitectures across different segmentation tasks, while still flexibility to generalize from one to the other, leading to
being easy to train on every task. We build upon a sim- duplicated research effort. For instance, although multiple
ple meta architecture [14] consisting of a backbone fea- approaches have been proposed for building feature pyra-
ture extractor [25, 36], a pixel decoder [33] and a Trans- mid representations [33], as we show in our experiments,
former decoder [51]. We propose key improvements that BiFPN [47] performs better for instance segmentation while
enable better results and efficient training. First, we use FaPN [39] performs better for semantic segmentation.
masked attention in the Transformer decoder which restricts Panoptic segmentation has been proposed to unify both se-
the attention to localized features centered around predicted mantic and instance segmentation tasks [28]. Architectures
segments, which can be either objects or regions depend- for panoptic segmentation either combine the best of spe-
ing on the specific semantic for grouping. Compared to cialized semantic and instance segmentation architectures
the cross-attention used in a standard Transformer decoder into a single framework [11, 27, 31, 60] or design novel ob-
which attends to all locations in an image, our masked atten- jectives that equally treat semantic regions and instance ob-
tion leads to faster convergence and improved performance. jects [5, 52]. Despite those new architectures, researchers
Second, we use multi-scale high-resolution features which continue to develop specialized architectures for different
help the model to segment small objects/regions. Third, image segmentation tasks [20, 45]. We find panoptic archi-
we propose optimization improvements such as switching tectures usually only report performance on a single panop-
the order of self and cross-attention, making query features tic segmentation task [52], which does not guarantee good
learnable, and removing dropout; all of which improve per- performance on other tasks (Figure 1). For example, panop-
formance without additional compute. Finally, we save 3⇥ tic segmentation does not measure architectures’ abilities to
training memory without affecting the performance by cal- rank predictions as instance segmentations. Thus, we re-
culating mask loss on few randomly sampled points. These frain from referring to architectures that are only evaluated
improvements not only boost the model performance, but for panoptic segmentation as universal architectures. In-
also make training significantly easier, making universal ar- stead, here, we evaluate our Mask2Former on all studied
chitectures more accessible to users with limited compute. tasks to guarantee generalizability.
We evaluate Mask2Former on three image segmenta-
tion tasks (panoptic, instance and semantic segmentation) Universal architectures have emerged with DETR [5] and
using four popular datasets (COCO [35], Cityscapes [16], show that mask classification architectures with an end-to-
ADE20K [65] and Mapillary Vistas [42]). For the first end set prediction objective are general enough for any im-
time, on all these benchmarks, our single architecture age segmentation task. MaskFormer [14] shows that mask
performs on par or better than specialized architectures. classification based on DETR not only performs well on
Mask2Former sets the new state-of-the-art of 57.8 PQ on panoptic segmentation but also achieves state-of-the-art on
COCO panoptic segmentation [28], 50.1 AP on COCO in- semantic segmentation. K-Net [62] further extends set pre-
stance segmentation [35] and 57.7 mIoU on ADE20K se- diction to instance segmentation. Unfortunately, these ar-
mantic segmentation [65] using the exact same architecture. chitectures fail to replace specialized models as their perfor-
mance on particular tasks or datasets is still worse than the
2. Related Work best specialized architecture (e.g., MaskFormer [14] cannot
segment instances well). To our knowledge, Mask2Former
Specialized semantic segmentation architectures typi- is the first architecture that outperforms state-of-the-art spe-
cally treat the task as a per-pixel classification problem. cialized architectures on all considered tasks and datasets.

1291
3. Masked-attention Mask Transformer
Transformer add & norm
We now present Mask2Former. We first review a meta Decoder
FFN
architecture for mask classification that Mask2Former is class
built upon. Then, we introduce our new Transformer de- mask add & norm
coder with masked attention which is the key to better con- !× self-attention
vergence and results. Lastly, we propose training improve- V K Q

ments that make Mask2Former efficient and accessible. add & norm

masked attention
V K Q
Backbone
3.1. Mask classification preliminaries
Pixe mask
l De
Mask classification architectures group pixels into N code
r
image query
features features
segments by predicting N binary masks, along with N cor-
responding category labels. Mask classification is suffi- Figure 2. Mask2Former overview. Mask2Former adopts the
ciently general to address any segmentation task by assign- same meta architecture as MaskFormer [14] with a backbone, a
ing different semantics, e.g., categories or instances, to dif- pixel decoder and a Transformer decoder. We propose a new
ferent segments. However, the challenge is to find good Transformer decoder with masked attention instead of the standard
representations for each segment. For example, Mask R- cross-attention (Section 3.2.1). To deal with small objects, we pro-
CNN [24] uses bounding boxes as the representation which pose an efficient way of utilizing high-resolution features from a
limits its application to semantic segmentation. Inspired by pixel decoder by feeding one scale of the multi-scale feature to one
DETR [5], each segment in an image can be represented as Transformer decoder layer at a time (Section 3.2.2). In addition,
a C-dimensional feature vector (“object query”) and can be we switch the order of self and cross-attention (i.e., our masked
attention), make query features learnable, and remove dropout to
processed by a Transformer decoder, trained with a set pre-
make computation more effective (Section 3.2.3). Note that posi-
diction objective. A simple meta architecture would con- tional embeddings and predictions from intermediate Transformer
sist of three components. A backbone that extracts low- decoder layers are omitted in this figure for readability.
resolution features from an image. A pixel decoder that
gradually upsamples low-resolution features from the out-
put of the backbone to generate high-resolution per-pixel 3.2.1 Masked attention
embeddings. And finally a Transformer decoder that oper-
ates on image features to process object queries. The final Context features have been shown to be important for im-
binary mask predictions are decoded from per-pixel embed- age segmentation [7,8,63]. However, recent studies [22,46]
dings with object queries. One successful instantiation of suggest that the slow convergence of Transformer-based
such a meta architecture is MaskFormer [14], and we refer models is due to global context in the cross-attention layer,
readers to [14] for more details. as it takes many training epochs for cross-attention to learn
to attend to localized object regions [46]. We hypothesize
that local features are enough to update query features and
3.2. Transformer decoder with masked attention context information can be gathered through self-attention.
For this we propose masked attention, a variant of cross-
Mask2Former adopts the aforementioned meta archi- attention that only attends within the foreground region of
tecture, with our proposed Transformer decoder (Figure 2 the predicted mask for each query.
right) replacing the standard one. The key components of Standard cross-attention (with residual path) computes
our Transformer decoder include a masked attention opera-
tor, which extracts localized features by constraining cross-
attention to within the foreground region of the predicted Xl = softmax(Ql KTl )Vl + Xl 1. (1)
mask for each query, instead of attending to the full fea-
ture map. To handle small objects, we propose an efficient Here, l is the layer index, Xl 2 RN ⇥C refers to N
multi-scale strategy to utilize high-resolution features. It C-dimensional query features at the lth layer and Ql =
feeds successive feature maps from the pixel decoder’s fea- fQ (Xl 1 ) 2 RN ⇥C . X0 denotes input query features to
ture pyramid into successive Transformer decoder layers in the Transformer decoder. Kl , Vl 2 RHl Wl ⇥C are the im-
a round robin fashion. Finally, we incorporate optimiza- age features under transformation fK (·) and fV (·) respec-
tion improvements that boost model performance without tively, and Hl and Wl are the spatial resolution of image
introducing additional computation. We now discuss these features that we will introduce next in Section 3.2.2. fQ ,
improvements in detail. fK and fV are linear transformations.

1292
Our masked attention modulates the attention matrix via To optimize the Transformer decoder design, we make
the following three improvements. First, we switch the
order of self- and cross-attention (our new “masked atten-
Xl = softmax(Ml 1 + Ql KTl )Vl + Xl 1. (2) tion”) to make computation more effective: query features
to the first self-attention layer are image-independent and
Moreover, the attention mask Ml 1 at feature location do not have signals from the image, thus applying self-
(x, y) is attention is unlikely to enrich information. Second, we
⇢ make query features (X0 ) learnable as well (we still keep
0 if Ml 1 (x, y) = 1
Ml 1 (x, y) = . (3) the learnable query positional embeddings), and learnable
1 otherwise
query features are directly supervised before being used in
the Transformer decoder to predict masks (M0 ). We find
Here, Ml 1 2 {0, 1}N ⇥Hl Wl is the binarized output
these learnable query features function like a region pro-
(thresholded at 0.5) of the resized mask prediction of the
posal network [43] and have the ability to generate mask
previous (l 1)-th Transformer decoder layer. It is resized
proposals. Finally, we find dropout is not necessary and
to the same resolution of Kl . M0 is the binary mask predic-
usually decreases performance. We thus completely remove
tion obtained from X0 , i.e., before feeding query features
dropout in our decoder.
into the Transformer decoder.
3.3. Improving training efficiency
3.2.2 High-resolution features
One limitation of training universal architectures is the
High-resolution features improve model performance, espe- large memory consumption due to high-resolution mask
cially for small objects [5]. However, this is computation- prediction, making them less accessible than the more
ally demanding. Thus, we propose an efficient multi-scale memory-friendly specialized architectures [6, 24]. For ex-
strategy to introduce high-resolution features while control- ample, MaskFormer [14] can only fit a single image in a
ling the increase in computation. Instead of always using GPU with 32G memory. Motivated by PointRend [30] and
the high-resolution feature map, we utilize a feature pyra- Implicit PointRend [13], which show a segmentation model
mid which consists of both low- and high-resolution fea- can be trained with its mask loss calculated on K randomly
tures and feed one resolution of the multi-scale feature to sampled points instead of the whole mask, we calculate the
one Transformer decoder layer at a time. mask loss with sampled points in both the matching and
Specifically, we use the feature pyramid produced by the final loss calculation. More specifically, in the match-
the pixel decoder with resolution 1/32, 1/16 and 1/8 of ing loss that constructs the cost matrix for bipartite match-
the original image. For each resolution, we add both a si- ing, we uniformly sample the same set of K points for all
nusoidal positional embedding epos 2 RHl Wl ⇥C , follow- prediction and ground truth masks. In the final loss be-
ing [5], and a learnable scale-level embedding elvl 2 R1⇥C , tween predictions and their matched ground truths, we sam-
following [66]. We use those, from lowest-resolution to ple different sets of K points for different pairs of predic-
highest-resolution for the corresponding Transformer de- tion and ground truth using importance sampling [30]. We
coder layer as shown in Figure 2 left. We repeat this 3-layer set K = 12544, i.e., 112 ⇥ 112 points. This new training
Transformer decoder L times. Our final Transformer de- strategy effectively reduces training memory by 3⇥, from
coder hence has 3L layers. More specifically, the first three 18GB to 6GB per image, making Mask2Former more ac-
layers receive a feature map of resolution H1 = H/32, cessible to users with limited computational resources.
H2 = H/16, H3 = H/8 and W1 = W/32, W2 = W/16,
W3 = W/8, where H and W are the original image reso- 4. Experiments
lution. This pattern is repeated in a round robin fashion for
all following layers. We demonstrate Mask2Former is an effective architec-
ture for universal image segmentation through compar-
isons with specialized state-of-the-art architectures on stan-
3.2.3 Optimization improvements
dard benchmarks. We evaluate our proposed design de-
A standard Transformer decoder layer [51] consists of three cisions through ablations on all three tasks. Finally we
modules to process query features in the following order: a show Mask2Former generalizes beyond the standard bench-
self-attention module, a cross-attention and a feed-forward marks, obtaining state-of-the-art results on four datasets.
network (FFN). Moreover, query features (X0 ) are zero ini- Datasets. We study Mask2Former using four widely used
tialized before being fed into the Transformer decoder and image segmentation datasets that support semantic, instance
are associated with learnable positional embeddings. Fur- and panoptic segmentation: COCO [35] (80 “things” and
thermore, dropout is applied to both residual connections 53 “stuff” categories), ADE20K [65] (100 “things” and
and attention maps. 50 “stuff” categories), Cityscapes [16] (8 “things” and 11

1293
method backbone query type epochs PQ PQTh PQSt APTh
pan mIoUpan #params. FLOPs fps
DETR [5] R50 100 queries 500+25 43.4 48.2 36.3 31.1 - - - -
MaskFormer [14] R50 100 queries 300 46.5 51.0 39.8 33.0 57.8 45M 181G 17.6
Mask2Former (ours) R50 100 queries 50 51.9 57.7 43.0 41.7 61.7 44M 226G 8.6
DETR [5] R101 100 queries 500+25 45.1 50.5 37.0 33.0 - - - -
MaskFormer [14] R101 100 queries 300 47.6 52.5 40.3 34.1 59.3 64M 248G 14.0
Mask2Former (ours) R101 100 queries 50 52.6 58.5 43.7 42.6 62.4 63M 293G 7.2
Max-DeepLab [52] Max-L 128 queries 216 51.1 57.0 42.2 - - 451M 3692G -
MaskFormer [14] Swin-L† 100 queries 300 52.7 58.5 44.0 40.1 64.8 212M 792G 5.2
K-Net [62] Swin-L† 100 queries 36 54.6 60.2 46.0 - - - - -
Mask2Former (ours) Swin-L† 200 queries 100 57.8 64.2 48.1 48.6 67.4 216M 868G 4.0

Table 1. Panoptic segmentation on COCO panoptic val2017 with 133 categories. Mask2Former consistently outperforms Mask-
Former [14] by a large margin with different backbones on all metrics. Our best model outperforms prior state-of-the-art MaskFormer by
5.1 PQ and K-Net [62] by 3.2 PQ. Backbones pre-trained on ImageNet-22K are marked with † .

“stuff” categories) and Mapillary Vistas [42] (37 “things” classification loss: Lmask + cls Lcls and we set cls = 2.0 for
and 28 “stuff” categories). Panoptic and semantic seg- predictions matched with a ground truth and 0.1 for the “no
mentation tasks are evaluated on the union of “things” and object,” i.e., predictions that have not been matched with
“stuff” categories while instance segmentation is only eval- any ground truth.
uated on the “things” categories. Post-processing. We use the exact same post-processing
Evaluation metrics. For panoptic segmentation, we use as [14] to acquire the expected output format for panoptic
the standard PQ (panoptic quality) metric [28]. We fur- and semantic segmentation from pairs of binary masks and
ther report APTh
pan , which is the AP evaluated on the “thing” class predictions. Instance segmentation requires additional
categories using instance segmentation annotations, and confidence scores for each prediction. We multiply class
mIoUpan , which is the mIoU for semantic segmentation confidence and mask confidence (i.e., averaged foreground
by merging instance masks from the same category, of the per-pixel binary mask probability) for a final confidence.
same model trained only with panoptic segmentation anno-
tations. For instance segmentation, we use the standard AP 4.2. Training settings
(average precision) metric [35]. For semantic segmentation, Panoptic and instance segmentation. We use Detec-
we use mIoU (mean Intersection-over-Union) [19]. tron2 [57] and follow the updated Mask R-CNN [24] base-
line settings1 for the COCO dataset. More specifically, we
4.1. Implementation details
use AdamW [38] optimizer and the step learning rate sched-
We adopt settings from [14] with the following differences: ule. We use an initial learning rate of 0.0001 and a weight
Pixel decoder. Mask2Former is compatible with any exist- decay of 0.05 for all backbones. A learning rate multiplier
ing pixel decoder module. In MaskFormer [14], FPN [33] of 0.1 is applied to the backbone and we decay the learning
is chosen as the default for its simplicity. Since our goal rate at 0.9 and 0.95 fractions of the total number of training
is to demonstrate strong performance across different seg- steps by a factor of 10. If not stated otherwise, we train our
mentation tasks, we use the more advanced multi-scale de- models for 50 epochs with a batch size of 16. For data aug-
formable attention Transformer (MSDeformAttn) [66] as mentation, we use the large-scale jittering (LSJ) augmenta-
our default pixel decoder. Specifically, we use 6 MSDefor- tion [18, 23] with a random scale sampled from range 0.1 to
mAttn layers applied to feature maps with resolution 1/8, 2.0 followed by a fixed size crop to 1024⇥1024. We use the
1/16 and 1/32, and use a simple upsampling layer with lat- standard Mask R-CNN inference setting where we resize an
eral connection on the final 1/8 feature map to generate the image with shorter side to 800 and longer side up-to 1333.
feature map of resolution 1/4 as the per-pixel embedding. We also report FLOPs and fps. FLOPs are averaged over
In our ablation study, we show that this pixel decoder pro- 100 validation images (COCO images have varying sizes).
vides best results across different segmentation tasks. Frames-per-second (fps) is measured on a V100 GPU with
Transformer decoder. We use our Transformer decoder a batch size of 1 by taking the average runtime on the entire
proposed in Section 3.2 with L = 3 (i.e., 9 layers total) and validation set including post-processing time.
100 queries by default. An auxiliary loss is added to every Semantic segmentation. We follow the same settings
intermediate Transformer decoder layer and to the learnable as [14] to train our models, except: 1) a learning rate multi-
query features before the Transformer decoder. plier of 0.1 is applied to both CNN and Transformer back-
Loss weights. We use the binary cross-entropy loss (instead bones instead of only applying it to CNN backbones in [14],
of focal loss [34] in [14]) and the dice loss [41] for our mask
1 https://github.com/facebookresearch/detectron2/blob/
loss: Lmask = ce Lce + dice Ldice . We set ce = 5.0 and main / MODEL _ ZOO . md # new - baselines - using - large - scale -
dice = 5.0. The final loss is a combination of mask loss and jitter-and-longer-training-schedule

1294
method backbone query type epochs AP APS APM APL APboundary #params. FLOPs fps
MaskFormer [14] R50 100 queries 300 34.0 16.4 37.8 54.2 23.0 45M 181G 19.2
Mask R-CNN [24] R50 dense anchors 36 37.2 18.6 39.5 53.3 23.1 44M 201G 15.2
Mask R-CNN [18, 23, 24] R50 dense anchors 400 42.5 23.8 45.0 60.0 28.0 46M 358G 10.3
Mask2Former (ours) R50 100 queries 50 43.7 23.4 47.2 64.8 30.6 44M 226G 9.7
Mask R-CNN [24] R101 dense anchors 36 38.6 19.5 41.3 55.3 24.5 63M 266G 10.8
Mask R-CNN [18, 23, 24] R101 dense anchors 400 43.7 24.6 46.4 61.8 29.1 65M 423G 8.6
Mask2Former (ours) R101 100 queries 50 44.2 23.8 47.7 66.7 31.1 63M 293G 7.8
QueryInst [20] Swin-L† 300 queries 50 48.9 30.8 52.6 68.3 33.5 - - 3.3
Swin-HTC++ [6, 36] Swin-L† dense anchors 72 49.5 31.0 52.4 67.2 34.1 284M 1470G -
Mask2Former (ours) Swin-L† 200 queries 100 50.1 29.9 53.9 72.1 36.2 216M 868G 4.0

Table 2. Instance segmentation on COCO val2017 with 80 categories. Mask2Former outperforms strong Mask R-CNN [24] baselines
for both AP and APboundary [12] metrics when training with 8⇥ fewer epochs. Our best model is also competitive to the state-of-the-art
specialized instance segmentation model on COCO and has higher boundary quality. For a fair comparison, we only consider single-scale
inference and models trained using only COCO train2017 set data. Backbones pre-trained on ImageNet-22K are marked with † .

2) both ResNet and Swin backbones use an initial learning method backbone crop size mIoU (s.s.) mIoU (m.s.)
MaskFormer [14] R50 512 44.5 46.7
rate of 0.0001 and a weight decay of 0.05, instead of using Mask2Former (ours) R50 512 47.2 49.2
different learning rates in [14].
Swin-UperNet [36, 58] Swin-T 512 - 46.1
MaskFormer [14] Swin-T 512 46.7 48.8
4.3. Main results Mask2Former (ours) Swin-T 512 47.7 49.6

Panoptic segmentation. We compare Mask2Former with MaskFormer [14] Swin-L† 640 54.1 55.6
FaPN-MaskFormer [14, 39] Swin-L-FaPN† 640 55.2 56.7
state-of-the-art models for panoptic segmentation on the BEiT-UperNet [2, 58] BEiT-L† 640 - 57.0
COCO panoptic [28] dataset in Table 1. Mask2Former Swin-L† 640 56.1 57.3
Mask2Former (ours)
consistently outperforms MaskFormer by more than 5 PQ Swin-L-FaPN† 640 56.4 57.7
across different backbones while converging 6⇥ faster.
With Swin-L backbone, our Mask2Former sets a new state- Table 3. Semantic segmentation on ADE20K val with
150 categories. Mask2Former consistently outperforms Mask-
of-the-art of 57.8 PQ, outperforming existing state-of-the-
Former [14] by a large margin with different backbones (all
art [14] by 5.1 PQ and concurrent work, K-Net [62], by
Mask2Former models use MSDeformAttn [66] as pixel decoder,
3.2 PQ. Mask2Former even outperforms the best ensemble except Swin-L-FaPN uses FaPN [39]). Our best model outper-
models with extra training data in the COCO challenge (see forms the best specialized model, BEiT [2]. We report both single-
Appendix A.1 for test set results). scale (s.s.) and multi-scale (m.s.) inference results. Backbones
Beyond the PQ metric, our Mask2Former also achieves pre-trained on ImageNet-22K are marked with † .
higher performance on two other metrics compared to
DETR [5] and MaskFormer: APTh pan , which is the AP eval- With a ResNet-50 backbone Mask2Former improves
uated on the 80 “thing” categories using instance segmen- over MaskFormer on small objects by 7.0 APS , while over-
tation annotation, and mIoUpan , which is the mIoU evalu- all the highest gains come from large objects (+10.6 APL ).
ated on the 133 categories for semantic segmentation con- The performance on APS still lags behind other state-of-the-
verted from panoptic segmentation annotation. This shows art models. Hence there still remains room for improvement
Mask2Former’s universality: trained only with panoptic on small objects, e.g., by using dilated backbones like in
segmentation annotations, it can be used for instance and DETR [5], which we leave for future work.
semantic segmentation.
Semantic segmentation. We compare Mask2Former with
Instance segmentation. We compare Mask2Former with state-of-the-art models for semantic segmentation on the
state-of-the-art models on the COCO [35] dataset in Ta- ADE20K [65] dataset in Table 3. Mask2Former outper-
ble 2. With ResNet [25] backbone, Mask2Former outper- forms MaskFormer [14] across different backbones, sug-
forms a strong Mask R-CNN [24] baseline using large- gesting that the proposed improvements even boost seman-
scale jittering (LSJ) augmentation [18, 23] while requir- tic segmentation results where [14] was already state-of-
ing 8⇥ fewer training iterations. With Swin-L backbone, the-art. With Swin-L as backbone and FaPN [39] as pixel
Mask2Former outperforms the state-of-the-art HTC++ [6]. decoder, Mask2Former sets a new state-of-the-art of 57.7
Although we only observe +0.6 AP improvement over mIoU. We also report the test set results in Appendix A.3.
HTC++, the Boundary AP [12] improves by 2.1, suggesting
that our predictions have a better boundary quality thanks to 4.4. Ablation studies
the high-resolution mask predictions. Note that for a fair
comparison, we only consider single-scale inference and We now analyze Mask2Former through a series of abla-
models trained with only COCO train2017 set data. tion studies using a ResNet-50 backbone [25]. To test the

1295
AP PQ mIoU FLOPs AP PQ mIoU FLOPs
Mask2Former (ours) 43.7 51.9 47.2 226G Mask2Former (ours) 43.7 51.9 47.2 226G
masked attention 37.8 (-5.9) 47.1 (-4.8) 45.5 (-1.7) 213G learnable query features 42.9 (-0.8) 51.2 (-0.7) 45.4 (-1.8) 226G
high-resolution features 41.5 (-2.2) 50.2 (-1.7) 46.1 (-1.1) 218G cross-attention first 43.2 (-0.5) 51.6 (-0.3) 46.3 (-0.9) 226G
remove dropout 43.0 (-0.7) 51.3 (-0.6) 47.2 (-0.0) 226G
all 3 components above 42.3 (-1.4) 50.8 (-1.1) 46.3 (-0.9) 226G
(a) Masked attention and high-resolution features (from efficient multi-scale strategy) (b) Optimization improvements increase the performance without introducing extra
lead to the most gains. More detailed ablations are in Table 4c and Table 4d. We remove compute. Following DETR [5], query features are zero-initialized when not learnable.
one component at a time. We remove one component at a time.

AP PQ mIoU FLOPs AP PQ mIoU FLOPs AP PQ mIoU FLOPs


cross-attention 37.8 47.1 45.5 213G single scale (1/32) 41.5 50.2 46.1 218G FPN [33] 41.5 50.7 45.6 195G
SMCA [22] 37.9 47.2 46.6 213G single scale (1/16) 43.0 51.5 46.5 222G Semantic FPN [27] 42.1 51.2 46.2 258G
mask pooling [62] 43.1 51.5 46.0 217G single scale (1/8) 44.0 51.8 47.4 239G FaPN [39] 42.4 51.8 46.8 -
masked attention 43.7 51.9 47.2 226G naı̈ve m.s. (3 scales) 44.0 51.9 46.3 247G BiFPN [47] 43.5 51.8 45.6 204G
efficient m.s. (3 scales) 43.7 51.9 47.2 226G MSDeformAttn [66] 43.7 51.9 47.2 226G
(c) Masked attention. Our masked attention per- (d) Feature resolution. High-resolution features (single scale (e) Pixel decoder. MSDeformAttn [66] consistently per-
forms better than other variants of cross-attention 1/8) are important. Our efficient multi-scale (efficient m.s.) forms the best across all tasks.
across all tasks. strategy effectively reduces the FLOPs.

Table 4. Mask2Former ablations. We perform ablations on three tasks: instance (AP on COCO val2017), panoptic (PQ on COCO
panoptic val2017) and semantic (mIoU on ADE20K val) segmentation. FLOPs are measured on COCO instance segmentation.

generality of the proposed components for universal image AP PQ mIoU memory


matching loss training loss (COCO) (COCO) (ADE20K) (COCO)
segmentation, all ablations are performed on three tasks. mask 41.0 50.3 45.9 18G
mask
Transformer decoder. We validate the importance of each point 41.0 50.8 45.9 6G
component by removing them one at a time. As shown in point (ours)
mask 43.1 51.4 47.3 18G
point (ours) 43.7 51.9 47.2 6G
Table 4a, masked attention leads to the biggest improve-
ment across all tasks. The improvement is larger for in- Table 5. Calculating loss with points vs. masks. Training with
stance and panoptic segmentation than for semantic seg- point loss reduces training memory without influencing the perfor-
mentation. Moreover, using high-resolution features from mance. Matching with point loss further improves performance.
the efficient multi-scale strategy is also important. Table 4b
shows additional optimization improvements further im-
prove the performance without extra computation. and thus is selected as our default. This set of ablations also
Masked attention. Concurrent work has proposed other suggests that designing a module like a pixel decoder for a
variants of cross-attention [22, 40] that aim to improve the specific task does not guarantee generalization across seg-
convergence and performance of DETR [5] for object de- mentation tasks. Mask2Former, as a universal model, could
tection. Most recently, K-Net [62] replaced cross-attention serve as a testbed for a generalizable module design.
with a mask pooling operation that averages features within Calculating loss with points vs. masks. In Table 5 we
mask regions. We validate the importance of our masked study the performance and memory implications when cal-
attention in Table 4c. While existing cross-attention vari- culating the loss based on either mask or sampled points.
ants may improve on a specific task, our masked attention Calculating the final training loss with sampled points re-
performs the best on all three tasks. duces training memory by 3⇥ without affecting the per-
Feature resolution. Table 4d shows that Mask2Former formance. Additionally, calculating the matching loss with
benefits from using high-resolution features (e.g., a single sampled points improves performance across all three tasks.
scale of 1/8) in the Transformer decoder. However, this in- Learnable queries as region proposals. Region propos-
troduces additional computation. Our efficient multi-scale als [1, 50], either in the form of boxes or masks, are re-
(efficient m.s.) strategy effectively reduces the FLOPs with- gions that are likely to be “objects.” With learnable queries
out affecting the performance. Note that, naively concate- being supervised by the mask loss, predictions from learn-
nating multi-scale features as input to every Transformer able queries can serve as mask proposals. In Figure 3 top,
decoder layer (naı̈ve m.s.) does not yield additional gains. we visualize mask predictions of selected learnable queries
Pixel decoder. As shown in Table 4e, Mask2Former is com- before feeding them into the Transformer decoder (the pro-
patible with any existing pixel decoder. However, we ob- posal generation process is shown in Figure 3 bottom right).
serve different pixel decoders specialize in different tasks: In Figure 3 bottom left, we further perform a quantita-
while BiFPN [47] performs better on instance-level seg- tive analysis on the quality of these proposals by calculat-
mentation, FaPN [39] works better for semantic segmen- ing the class-agnostic average recall with 100 predictions
tation. Among all studied pixel decoders, the MSDefor- (AR@100) on COCO val2017. We find these learnable
maAttn [66] consistently performs the best across all tasks queries already achieve good AR@100 compared to the fi-

1296
PQ AP mIoU PQ AP mIoU PQ AP mIoU
panoptic 51.9 41.7 61.7 39.7 26.5 46.1 62.1 37.3 77.5
instance - 43.7 - - 26.4 - - 37.4 -
semantic - - 61.5 - - 47.2 - - 79.4
(a) COCO (b) ADE20K (c) Cityscapes

Table 7. Limitations of Mask2Former. Although a single


Mask2Former can address any segmentation task, we still need
to train it on different tasks. Across three datasets we find
Mask2Former trained with panoptic annotations performs slightly
worse than the exact same model trained specifically for instance
and semantic segmentation tasks with the corresponding data.

layer 3 layer 6 layer 9 Learnable queries


mask
We observe that our Mask2Former is competitive to
57.4 57.7
56.8 state-of-the-art methods on these datasets as well. It sug-
gests Mask2Former can serve as a universal image segmen-
learnable
queries Backbone
tation model and results generalize across datasets.
50.3
Pixe
l De
4.6. Limitations
AR@100 on COCO val2017 code
r
Our ultimate goal is to train a single model for all im-
Figure 3. Learnable queries as “region proposals”. Top: We age segmentation tasks. In Table 7, we find Mask2Former
visualize mask predictions of four selected learnable queries be- trained on panoptic segmentation only performs slightly
fore feeding them into the Transformer decoder (using R50 back- worse than the exact same model trained with the corre-
bone). Bottom left: We calculate the class-agnostic average recall sponding annotations for instance and semantic segmenta-
with 100 proposals (AR@100) and observe that these learnable tion tasks across three datasets. This suggests that even
queries provide good proposals compared to the final predictions though Mask2Former can generalize to different tasks, it
of Mask2Former after the Transformer decoder layers (layer 9).
still needs to be trained for those specific tasks. In the fu-
Bottom right: Illustration of proposal generation process.
ture, we hope to develop a model that can be trained only
once for multiple tasks and even for multiple datasets.
panoptic model semantic model
method backbone PQ APTh pan mIoUpan mIoU (s.s.) (m.s.)
Furthermore, as seen in Tables 2 and 4d, even though it
Panoptic FCN [31] Swin-L† 65.9 - - - - improves over baselines, Mask2Former struggles with seg-
Panoptic-DeepLab [11] SWideRNet [9] 66.4 40.1 82.2 - - menting small objects and is unable to fully leverage multi-
Panoptic-DeepLab [11] SWideRNet [9] 67.5⇤ 43.9⇤ 82.9⇤ - - scale features. We believe better utilization of the feature
SETR [64] ViT-L† [17] - - - - 82.2
SegFormer [59] MiT-B5 [59] - - - - 84.0
pyramid and designing losses for small objects are critical.
R50 62.1 37.3 77.5 79.4 82.2
Mask2Former (ours) Swin-B† 66.1 42.8 82.7 83.3 84.5 5. Conclusion
Swin-L† 66.6 43.6 82.9 83.3 84.3
We present Mask2Former for universal image segmen-
Table 6. Cityscapes val. Mask2Former is competitive to spe- tation. Built upon a simple meta framework [14] with a
cialized models on Cityscapes. Panoptic segmentation models use new Transformer decoder using the proposed masked atten-
single-scale inference by default, multi-scale numbers are marked
tion, Mask2Former obtains top results in all three major im-
with ⇤ . For semantic segmentation, we report both single-scale
age segmentation tasks (panoptic, instance and semantic) on
(s.s.) and multi-scale (m.s.) inference results. Backbones pre-
trained on ImageNet-22K are marked with † . four popular datasets, outperforming even the best special-
ized models designed for each benchmark while remaining
easy to train. Mask2Former saves 3⇥ research effort com-
nal predictions of Mask2Former after the Transformer de-
pared to designing specialized models for each task, and it
coder layers, i.e., layer 9, and AR@100 consistently im-
is accessible to users with limited computational resources.
proves with more decoder layers.
We hope to attract interest in universal model design.
Ethical considerations: While our technical innovations do not appear to
4.5. Generalization to other datasets have any inherent biases, the models trained with our approach on real-
world datasets should undergo ethical review to ensure the predictions do
To show our Mask2Former can generalize beyond the not propagate problematic stereotypes, and the approach is not used for
COCO dataset, we further perform experiments on other applications including but not limited to illegal surveillance.
popular image segmentation datasets. In Table 6, we show Acknowledgments: Thanks to Nicolas Carion and Xingyi Zhou for help-
results on Cityscapes [16]. Please see Appendix B for de- ful feedback. BC and AS are supported in part by NSF #1718221,
tailed training settings on each dataset as well as more re- 2008387, 2045586, 2106825, MRI #1725729, NIFA 2020-67021-32799
sults on ADE20K [65] and Mapillary Vistas [42]. and Cisco Systems Inc. (CG 1377144 - thanks for access to Arcetri).

1297
References [18] Xianzhi Du, Barret Zoph, Wei-Chih Hung, and Tsung-Yi
Lin. Simple training strategies and model scaling for object
[1] Pablo Arbeláez, Jordi Pont-Tuset, Jonathan T Barron, Fer- detection. arXiv preprint arXiv:2107.00057, 2021.
ran Marques, and Jitendra Malik. Multiscale combinatorial
[19] Mark Everingham, SM Ali Eslami, Luc Van Gool, Christo-
grouping. In CVPR, 2014.
pher KI Williams, John Winn, and Andrew Zisserman. The
[2] Hangbo Bao, Li Dong, and Furu Wei. BEiT: BERT pre- PASCAL visual object classes challenge: A retrospective.
training of image transformers. arXiv, 2021. IJCV, 2015.
[3] Daniel Bolya, Chong Zhou, Fanyi Xiao, and Yong Jae Lee. [20] Yuxin Fang, Shusheng Yang, Xinggang Wang, Yu Li, Chen
YOLACT++: Better real-time instance segmentation, 2019. Fang, Ying Shan, Bin Feng, and Wenyu Liu. Instances as
[4] Zhaowei Cai and Nuno Vasconcelos. Cascade R-CNN: Delv- queries. In ICCV, 2021.
ing into high quality object detection. In CVPR, 2018. [21] Jun Fu, Jing Liu, Haijie Tian, Yong Li, Yongjun Bao, Zhiwei
[5] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Fang, and Hanqing Lu. Dual attention network for scene
Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to- segmentation. In CVPR, 2019.
end object detection with transformers. In ECCV, 2020. [22] Peng Gao, Minghang Zheng, Xiaogang Wang, Jifeng Dai,
[6] Kai Chen, Jiangmiao Pang, Jiaqi Wang, Yu Xiong, Xiaox- and Hongsheng Li. Fast convergence of detr with spatially
iao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jianping Shi, modulated co-attention. In ICCV, 2021.
Wanli Ouyang, et al. Hybrid task cascade for instance seg- [23] Golnaz Ghiasi, Yin Cui, Aravind Srinivas, Rui Qian, Tsung-
mentation. In CVPR, 2019. Yi Lin, Ekin D Cubuk, Quoc V Le, and Barret Zoph. Simple
[7] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, copy-paste is a strong data augmentation method for instance
Kevin Murphy, and Alan L Yuille. DeepLab: Semantic im- segmentation. In CVPR, 2021.
age segmentation with deep convolutional nets, atrous con- [24] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Gir-
volution, and fully connected CRFs. PAMI, 2018. shick. Mask R-CNN. In ICCV, 2017.
[8] Liang-Chieh Chen, George Papandreou, Florian Schroff, and [25] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Hartwig Adam. Rethinking atrous convolution for semantic Deep residual learning for image recognition. In CVPR,
image segmentation. arXiv:1706.05587, 2017. 2016.
[9] Liang-Chieh Chen, Huiyu Wang, and Siyuan Qiao. Scal- [26] Zilong Huang, Xinggang Wang, Lichao Huang, Chang
ing wide residual networks for panoptic segmentation. Huang, Yunchao Wei, and Wenyu Liu. CCNet: Criss-cross
arXiv:2011.11675, 2020. attention for semantic segmentation. In ICCV, 2019.
[10] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian [27] Alexander Kirillov, Ross Girshick, Kaiming He, and Piotr
Schroff, and Hartwig Adam. Encoder-decoder with atrous Dollár. Panoptic feature pyramid networks. In CVPR, 2019.
separable convolution for semantic image segmentation. In [28] Alexander Kirillov, Kaiming He, Ross Girshick, Carsten
ECCV, 2018. Rother, and Piotr Dollár. Panoptic segmentation. In CVPR,
[11] Bowen Cheng, Maxwell D Collins, Yukun Zhu, Ting Liu, 2019.
Thomas S Huang, Hartwig Adam, and Liang-Chieh Chen. [29] Alexander Kirillov, Evgeny Levinkov, Bjoern Andres, Bog-
Panoptic-DeepLab: A simple, strong, and fast baseline for dan Savchynskyy, and Carsten Rother. InstanceCut: from
bottom-up panoptic segmentation. In CVPR, 2020. edges to instances with multicut. In CVPR, 2017.
[12] Bowen Cheng, Ross Girshick, Piotr Dollár, Alexander C [30] Alexander Kirillov, Yuxin Wu, Kaiming He, and Ross Gir-
Berg, and Alexander Kirillov. Boundary iou: Improving shick. PointRend: Image segmentation as rendering. In
object-centric image segmentation evaluation. In CVPR, CVPR, 2020.
2021. [31] Yanwei Li, Hengshuang Zhao, Xiaojuan Qi, Yukang Chen,
[13] Bowen Cheng, Omkar Parkhi, and Alexander Kirillov. Lu Qi, Liwei Wang, Zeming Li, Jian Sun, and Jiaya Jia.
Pointly-supervised instance segmentation. arXiv, 2021. Fully convolutional networks for panoptic segmentation with
[14] Bowen Cheng, Alexander G. Schwing, and Alexander Kir- point-based supervision. arXiv preprint arXiv:2108.07682,
illov. Per-pixel classification is not all you need for semantic 2021.
segmentation. In NeurIPS, 2021. [32] Zhiqi Li, Wenhai Wang, Enze Xie, Zhiding Yu, Anima
[15] François Chollet. Xception: Deep learning with depthwise Anandkumar, Jose M Alvarez, Tong Lu, and Ping Luo.
separable convolutions. In CVPR, 2017. Panoptic segformer. arXiv preprint arXiv:2109.03814, 2021.
[16] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo [33] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He,
Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Bharath Hariharan, and Serge Belongie. Feature pyramid
Franke, Stefan Roth, and Bernt Schiele. The Cityscapes networks for object detection. In CVPR, 2017.
dataset for semantic urban scene understanding. In CVPR, [34] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and
2016. Piotr Dollár. Focal loss for dense object detection. In ICCV,
[17] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, 2017.
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, [35] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence
vain Gelly, et al. An image is worth 16x16 words: Trans- Zitnick. Microsoft COCO: Common objects in context. In
formers for image recognition at scale. In ICLR, 2021. ECCV, 2014.

1298
[36] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, [54] Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao
Zheng Zhang, Stephen Lin, and Baining Guo. Swin trans- Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao.
former: Hierarchical vision transformer using shifted win- Pvtv2: Improved baselines with pyramid vision transformer.
dows. arXiv:2103.14030, 2021. arXiv preprint arXiv:2106.13797, 2021.
[37] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully [55] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaim-
convolutional networks for semantic segmentation. In ing He. Non-local neural networks. In CVPR, 2018.
CVPR, 2015. [56] Xinlong Wang, Rufeng Zhang, Tao Kong, Lei Li, and Chun-
[38] Ilya Loshchilov and Frank Hutter. Decoupled weight decay hua Shen. SOLOv2: Dynamic and fast instance segmenta-
regularization. In ICLR, 2019. tion. NeurIPS, 2020.
[39] Shihua Huang Zhichao Lu, Ran Cheng, and Cheng He. Fapn: [57] Yuxin Wu, Alexander Kirillov, Francisco Massa, Wan-Yen
Feature-aligned pyramid network for dense image predic- Lo, and Ross Girshick. Detectron2. https://github.
tion. arXiv, 2021. com/facebookresearch/detectron2, 2019.
[40] Depu Meng, Xiaokang Chen, Zejia Fan, Gang Zeng, [58] Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, and
Houqiang Li, Yuhui Yuan, Lei Sun, and Jingdong Wang. Jian Sun. Unified perceptual parsing for scene understand-
Conditional detr for fast training convergence. In ICCV, ing. In ECCV, 2018.
2021. [59] Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar,
[41] Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi. Jose M Alvarez, and Ping Luo. Segformer: Simple and ef-
V-Net: Fully convolutional neural networks for volumetric ficient design for semantic segmentation with transformers.
medical image segmentation. In 3DV, 2016. In NeurIPS, 2021.
[42] Gerhard Neuhold, Tobias Ollmann, Samuel Rota Bulò, and [60] Yuwen Xiong, Renjie Liao, Hengshuang Zhao, Rui Hu, Min
Peter Kontschieder. The mapillary vistas dataset for semantic Bai, Ersin Yumer, and Raquel Urtasun. Upsnet: A unified
understanding of street scenes. In CVPR, 2017. panoptic segmentation network. In CVPR, 2019.
[43] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. [61] Yuhui Yuan, Lang Huang, Jianyuan Guo, Chao Zhang, Xilin
Faster R-CNN: Towards real-time object detection with re- Chen, and Jingdong Wang. OCNet: Object context for se-
gion proposal networks. In NeurIPS, 2015. mantic segmentation. IJCV, 2021.
[44] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San- [62] Wenwei Zhang, Jiangmiao Pang, Kai Chen, and
jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Chen Change Loy. K-net: Towards unified image seg-
Aditya Khosla, Michael Bernstein, Alexander C. Berg, and mentation. In NeurIPS, 2021.
Li Fei-Fei. ImageNet Large Scale Visual Recognition Chal- [63] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang
lenge. IJCV, 2015. Wang, and Jiaya Jia. Pyramid scene parsing network. In
[45] Robin Strudel, Ricardo Garcia, Ivan Laptev, and Cordelia CVPR, 2017.
Schmid. Segmenter: Transformer for semantic segmenta- [64] Sixiao Zheng, Jiachen Lu, Hengshuang Zhao, Xiatian Zhu,
tion. In ICCV, 2021. Zekun Luo, Yabiao Wang, Yanwei Fu, Jianfeng Feng, Tao
[46] Zhiqing Sun, Shengcao Cao, Yiming Yang, and Kris M Ki- Xiang, Philip HS Torr, et al. Rethinking semantic segmen-
tani. Rethinking transformer-based set prediction for object tation from a sequence-to-sequence perspective with trans-
detection. In ICCV, 2021. formers. In CVPR, 2021.
[47] Mingxing Tan, Ruoming Pang, and Quoc V Le. Efficientdet: [65] Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela
Scalable and efficient object detection. In CVPR, 2020. Barriuso, and Antonio Torralba. Scene parsing through
[48] Andrew Tao, Karan Sapra, and Bryan Catanzaro. Hi- ADE20K dataset. In CVPR, 2017.
erarchical multi-scale attention for semantic segmentation. [66] Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang,
arXiv:2005.10821, 2020. and Jifeng Dai. Deformable detr: Deformable transformers
[49] Zhi Tian, Chunhua Shen, and Hao Chen. Conditional convo- for end-to-end object detection. In ICLR, 2021.
lutions for instance segmentation. In ECCV, 2020.
[50] Jasper RR Uijlings, Koen EA Van De Sande, Theo Gev-
ers, and Arnold WM Smeulders. Selective search for object
recognition. IJCV, 2013.
[51] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
reit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia
Polosukhin. Attention is all you need. In NeurIPS, 2017.
[52] Huiyu Wang, Yukun Zhu, Hartwig Adam, Alan Yuille, and
Liang-Chieh Chen. MaX-DeepLab: End-to-end panoptic
segmentation with mask transformers. In CVPR, 2021.
[53] Jingdong Wang, Ke Sun, Tianheng Cheng, Borui Jiang,
Chaorui Deng, Yang Zhao, Dong Liu, Yadong Mu, Mingkui
Tan, Xinggang Wang, Wenyu Liu, and Bin Xiao. Deep
high-resolution representation learning for visual recogni-
tion. PAMI, 2019.

1299

You might also like