A Simple Single-Scale Vision Transformer For Object Detection and Instance Segmentation
A Simple Single-Scale Vision Transformer For Object Detection and Instance Segmentation
1⋆ 2 2
Wuyang Chen , Xianzhi Du , Fan Yang
2 2 2 2 2
Lucas Beyer , Xiaohua Zhai , Tsung-Yi Lin , Huizhong Chen , Jing Li ,
2 1 2
Xiaodan Song , Zhangyang Wang , and Denny Zhou
arXiv:2112.09747v3 [cs.CV] 1 Oct 2022
1
University of Texas at Austin, Austin TX 78712, USA
{wuyang.chen,atlaswang}@utexas.edu
2
Google
{xianzhi,fyangf,lbeyer,xzhai,tsungyi,
huizhongc,jingli,xiaodansong,dennyzhou}@google.com
1 Introduction
Transformer [49], the de-facto standard architecture for natural language process-
ing (NLP), recently has shown promising results on computer vision tasks. Vision
Transformer (ViT) [20], an architecture consisting of a sequence of transformer
⋆
Work done during the first author’s research internship with Google.
2 W. Chen et al.
52
48 SD MF 2x
UViT+ (ours)
UViT (ours)
46 ViT
Swin
ResNet
44
500 1000 1500 2000 50 100 150
GFLOPs Params [M]
Fig. 1: Trade-off between mAP (COCO) and FLOPs (left) / number of parameters
(right). We compare our UViT / UViT+ with Swin Transformer [36], ViT [55], and
ResNet (18/50/101/152) [27], all adopting the same standard Cascade Mask RCNN
framework [6]. Our UViT is compact, strong, and simple, avoid using any hierarchical
design (“SD”: spatial downsampling, “MF”: multi-scale features, “2×”: double channels).
2 Related Works
CNNs are now mainstream and standard deep network models for dense predic-
tion tasks in computer vision, such as object detection and semantic segmentation.
During decades of development, people summarized several high-level and fun-
damental design conventions: 1) deeper networks for more accurate function
approximation [16,21,22,32]: ResNet [27], DenseNet [30]; 2) shallow widths in
early layers for high feature resolutions, and wider widths in deeper layers for com-
pressed features, which can deliver good performance-efficiency trade-off: Vgg[43],
ResNet[27]; 3) enlarged receptive fields for learning long-term correlations: dilated
convolution (Deeplab series [11]), deformable convolutions [18]; 4) hierarchical
feature pyramids for learning across a wide range of object scales: FPN [34],
ASPP [11], HRNet [50]. In short, the motivations behind these successful design
solutions fall in two folds: 1) to support the semantic understanding of objects
with diverse sizes and scales; 2) to maintain a balanced computation cost under
large input sizes. These two motivations, or challenges, also exist in designing
our UViT architectures when we are facing dense prediction tasks, for which we
provide a comprehensive study in our work (Section 3.1).
The first ViT work [20] adopted a transformer encoder on coarse non-overlapping
image patches for image classification and requires large-scale training datasets
(JFT [44], ImageNet-21K [19]) for pretraining. DeiT further introduce strong
augmentations on both data-level and architecture-level to efficiently train ViT
on ImageNet-1k [19]. Beyond image classification, more and more works try to
design ViT backbones for dense prediction tasks. Initially people try to directly
learn high-resolution features extracted by ViT backbone via extra interpolation
or convolution layers [4,58]. Some works also leverage self-attention operations
to replace partial or all convolution layers in CNNs [29,41,56]. More recent
trends [15,24,36,52,53,54] start following design conventions in CNNs discussed
above (Section 2.1) and customize ViT architectures to be CNN-like: tokens
are progressively merged to downsample the feature resolutions with reduced
computation cost, along with increased embedding sizes. Multi-scale feature maps
are also collected from the ViT backbone. These ViT works can successfully
achieve strong or state-of-the-art performance on object detection or semantic
segmentation, but the architecture is again highly customized for vision problems
and lose the potential for multi-modal learning in the future. More importantly,
those CNN-like design conventions are directly inherited into ViTs without a
clear understanding of each individual benefit, leading to empirical black-box
designs. In contrast, the simple and neat solution we will provide is motivated
by a complete study on ViT’s architecture preference on dense prediction tasks
(Section 3.1 and 3.2).
Abbreviated paper title 5
Since the architecture of vision transformers is still in its infant stage, there
are few works that systematically study principles in ViT’s model design and
scaling rule. Initially, people leverage coarse tokenizations, constant feature
resolution, and constant hidden size [20,47], while recently fine-grained tokens,
spatial downsampling, and doubled channels are also becoming popular in ViT
design [36,59]. They all achieve good performance, calling for an organized study
on the benefits of different fundamental designs. In addition, different learning
behaviors of self-attentions (compared with CNNs) make the scaling law of ViTs
highly unclear. Recent works [40] revealed ViT generates more uniform receptive
fields across layers, enabling the aggregation of global information in early layers.
This is contradictory to CNNs which require deeper layers to help the learning of
visual global information [12]. Attention scores of ViTs are also found to gradually
become indistinguishable as the encoder goes deeper, leading to identical and
redundant feature maps, and plateaued performance [59]. These observations
all indicate that previously discovered design conventions and scaling laws for
CNNs [27,46] may not be suitable for ViTs, thus calling for comprehensive studies
on the new inductive bias of ViT’s architecture on dense prediction tasks.
3 Methods
Our work targets designing a simple ViT model for dense prediction tasks, and
trying to avoid hand-crafted customization on architectures. We will first explain
our motivations with comprehensive ablation studies on individual design benefits
in Section 3.1, and then elaborate the discovered principles of our UViT designs
in Section 3.2.
Observations
– Spatial Downsampling (“SD”) does not seem to be beneficial. Our hypothesis
is that, under the same FLOPs constraint, the self-attention layers already
provide global features, and do not need to downsample the features to enlarge
the receptive field.
– Multi-scale Features (“MF”) can mitigate the poor performance from down-
sampling by leveraging early high-resolution features (“SD+MF”). However,
the vanilla setting still outperforms this combination. We hypothesize that
high-resolution features are extracted too early in the encoder; in contrast,
tokens in vanilla ViTs are able to learn fine-grained details throughout the
encoder blocks.
Abbreviated paper title 7
50
48
– Doubled channels (“2×”) plus multi-scale features (“MF”) may potentially seem
competitive. However, ViT does not show strong inductive bias on “deeper
compressed features with more embedding dimension”. This observation is
also aligned with findings in [40] that ViTs have highly similar representations
throughout the model, indicating that we should not sacrifice embedding
dimensions of early layers to compensate for deeper layers.
In summary, we did not find strong benefits by adopting CNN-like design con-
ventions. Instead, A simple architecture solution of a constant feature resolution
and hidden size could be a strong ViT baseline.
# % # %
3×#×% C× × C× ×
8 8 8 8 Detection
✓ Single-scale
UViT Encoder: Basic Attention Blocks Feature Maps
✓ Constant Token Resolution & Hidden Size
Spatial Downsampling, Double Channels Multi-scale
Instance
(UViT) Segmentation
8×8 Patches +
Position Embeddings (UViT+)
Attention Windows
Fig. 3: We keep the architecture of our UViT neat: image patches (plus position
embeddings) are processed by a stack of vanilla attention blocks with a constant
resolution and hidden size. Single-scale feature maps as outputs are fed into head modules
for detection or segmentation tasks. Constant (UViT, Section A.1) or progressive
(UViT+, Section 3.2.2) attention windows are introduced to reduce the computation
cost. We demonstrate that this simple architecture is strong, without introducing design
overhead from hierarchical spatial downsampling, doubled channels, and multi-scale
feature pyramids.
Though being simple, still we have two core questions to be determined in our
design: (1) How to balance the UViT’s depth, width, and input size to achieve
the best performance-efficiency trade-off? (Section A.1) (2) Which attention
window strategy can effectively save the computation cost without sacrificing the
performance? (Section 3.2.2)
B B
52 52
S S
MSCOCO mAP [%]
50 50
640 × 640 640 × 640
49 768 × 768 49 768 × 768
896 × 896 896 × 896
1024 × 1024 1024 × 1024
48 L = 18 48 L = 18
700 800 900 1000 1100 1200 1300 40 50 60 70 80 90 100 110
FLOPs (G) Params. (M))
Fig. 4: Input scaling rule for UViT on COCO object detection. Given a fixed depth,
an input size of 896 × 896 (thin solid line) leaves more room for model scaling (by
increasing the width) and is slightly better than 1024 × 1024 (thick solid line); and
640 × 640 (dashed line) or 768 × 768 (dotted line) are of worse performance-efficiency
trade-off. Black capital letters “T ”, “S ”, and “B ” annotate three final depth/width
configurations of UViT variants we will propose (Table 2). Different sizes of markers
represent the hidden sizes (widths).
Observations
In summary, based on our final compound scaling rule, we propose our basic
version of UViT as 18 attention blocks under 896 × 896 input size. See our
supplement for more architecture details.
B B
52 52
S S
MSCOCO mAP [%]
0.5
Relative Receptive Field
0.4
0.3
0.2
0.1
1 2 3 4 5 6 7 8 9 10 11 12
Depth
Fig. 6: Relative attention’s receptive field of a ImageNet pretrained ViT-B16 [20] along
depth (indices of attention blocks), on the COCO dataset. Error bars are standard
deviations across different attention heads.
We collect the averages and standard deviations across different attention
heads. As shown in Figure 6, we can see that tokens in early attention layers,
Abbreviated paper title 11
2) Do deeper layers require global attention, or some local attentions are also
sufficient? To compare with global attentions (window size as 1), we will also
try small attention windows (window size of 12 scale) in deeper layers.
To represent an attention window strategy that “progressively increases win-
dow scales from 14 to 12 to 1”, we use a simple annotation “[4 ]×14 → [2 ]×2 →
−1 −1
[1] × 2”, indicating that there are 14 attention blocks assigned with 4 -scale win-
1
dows, then two attention blocks assigned with 12 -scale windows, and finally two
attention blocks assigned with 1-scale windows. When comparing different win-
dow strategies, we make sure all strategies have the same number of parameters
and share the similar computation cost for fair comparisons. We also include four
more baselines of constant attention window scale across all attention blocks:
global attention, and also windows of 14 ∼ 21 scale. We show our results in Table 1,
and summarize observations below:
In conclusion, we set the window scale of our basic version (UViT, Section A.1)
−1
as constant 2 , and proposed an improved version of our model, dubbed “UViT+”
with the attention window strategy adopted as “[4 ] × 14 → [2 ] × 2 → [1] × 2”.
−1 −1
For example, if the input sequence has (896/8) × (896/8) = 112 × 112 tokens, a
4
1
window of scale 16 will contain 7 × 7 = 49 elements. Similar ideas for 18 and 41 .
12 W. Chen et al.
Table 1: Over-shrank window sizes in early layers are harmful, and global attention
windows in deep layers are vital to the final performance. Fractions in brackets
indicate attention window scales (relative to sequence feature sizes), and the
multiplier indicates the number of attention blocks allocated to an attention
window scale (18 blocks in total). Standard Deviations of three random runs are
shown in parentheses.
[window_scale] × #layers GFLOPs APval Img/s
[1] × 18 2961.9 52.4 (0.09) 3.5
[2 ] × 18
−1
1298.7 52.3 (0.17) 10.5
[16 ] × 4 [8 ] × 4 [4 ] × 4 [2 ] × 4 [1] × 2
−1 −1 −1 −1
1154.3 52.0 (0.15) 11.5
[8 ] × 9 → [4 ] × 4 → [2 ] × 3 → [1] × 2
−1 −1 −1
1131.2 52.2 (0.21) 12.7
[4 ] × 14 → [2 ] × 2 → [1] × 2
−1 −1
1160.1 52.5 (0.11) 12.3
[4 ] × 6 → [2 ] × 12
−1 −1
1160.1 52.2 (0.12) 12.5
4 Final Results
We conduct our experiments on COCO [35] object detection and instance seg-
mentation to show our final performance.
4.1 Implementations
Table 2: Architecture variants of our UViT with ImageNet [19] Pretraining Performance.
Table 3: Two-stage object detection and instance segmentation results on COCO 2017.
We compare employing different backbones with Cascade Mask R-CNN on single model
−1
without test-time augmentation. UViT sets a constant window scale as 2 , and UViT+
adopts the attention window strategy as “[4 ] × 14 → [2 ] × 2 → [1] × 2”. We also
−1 −1
Settings Object detection experiments are conducted on COCO 2017 [35], which
contains 118K training and 5K validation images. We consider the popular
Cascade Mask-RCNN detection framework [6,26], and leverage multi-scale train-
ing [45,8] (resizing the input to 896 × 896), AdamW optimizer [37] (with an initial
−3 −4
learning rate as 3 × 10 ), weight decay as 1 × 10 , and a batch size of 256.
Similar above, the thoughput (“Img/s”) measures the latency of UViTs with one
COCO image per TPU core.
14 W. Chen et al.
From Table 3 we can see that on different levels of model variants, our UViTs
are highly compact. Compared with both CNNs and other ViT works, our UViT
achieves strong results with much better efficiency: with similar GFLOPs, UViT
uses a much fewer number of parameters (at least 44.9% parameter reduction
compared with Swin [36]). To make this comparison clean, we did not adopt any
5
system-level techniques [36] to boost the performance . As we did not leverage
any CNN-like hierarchical pyramid structures, the results of our simple and
neat solution suggest that, the original design philosophy of ViT [20] is a strong
baseline without any hand-crafted architecture customization. We also show the
mAP-efficiency trade-off curve in Figure 1. Besides, we also adopt our UViT-B
backbone with the Mask-RCNN [26] framework, and achieve 50.5 APval with
1026.1 GFLOPs.
Additionally, we adopt self-training on top of our largest model (UViT-B) to
evaluate the performance gain by leveraging unlabeled data, similar as [60]. We
use ImageNet-1K without labels as the unlabeled set, and a pretrained UViT-B
model as the teacher model to generate pseudo-labels. All predicted boxes with
confidence scores larger than 0.5 are kept, together with their corresponding
masks. For UViT-B with self-training, the student model is initialized from the
same weights of the teacher model. The ratio of labeled data to pseudo-labeled
data is 1:1 in each batch. Apart from increasing training steps by 2× for each
epoch, all other hyperparameters remain unchanged. We can see from the last
row in Table 3 that self-training significantly improves box AP and mask AP by
1.4% and 1.3%, respectively.
5 Conclusion
We present a simple, single-scale vision transformer backbone that can serve as
a strong baseline for object detection and semantic segmentation. Our novelty
is not “to add” any special layers to ViT, but instead to choose “not to add”
complex designs, with strong motivations and clear experimental supports. ViT is
proposed for image classification. To adapt ViT to dense vision tasks, recent works
choose “to add” more CNN-like designs (multi-scale, double channels, spatial
reduction). But these add-ons mainly follow the success of CNNs, and their
compatibility with attention layers is not verified. However, our detailed study
shows that CNN-like designs are not prerequisites for ViT, and a vanilla ViT
architecture plus a better scaling rule (depth, width, input size) and a progressive
attention widow strategy can indeed achieve a high detection performance. Our
proposed UViT architectures achieve strong performance on both COCO object
detection and instance segmentation. Our uniform design has the potential of
supporting multi-modal/multi-task learning and vision-language problems. Most
importantly, we hope our work could bring the attention to the community that
ViTs may require careful and special architecture design on dense prediction
tasks, instead of directly adopting CNN design conventions in black-box.
5
As we adopt the popular Cascade Mask-RCNN detection framework [6,26], some
previous detection works [14,45] may not be directly compared.
Abbreviated paper title 15
References
1. Amirul Islam, M., Rochan, M., Bruce, N.D., Wang, Y.: Gated feedback refinement
network for dense image labeling. In: Proceedings of the IEEE conference on
computer vision and pattern recognition. pp. 3751–3759 (2017)
2. Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lucic, M., Schmid, C.: Vivit: A
video vision transformer. ArXiv abs/2103.15691 (2021)
3. Artacho, B., Savakis, A.: Waterfall atrous spatial pooling architecture for efficient
semantic segmentation. Sensors 19(24), 5361 (2019)
4. Beal, J., Kim, E., Tzeng, E., Park, D.H., Zhai, A., Kislyuk, D.: Toward transformer-
based object detection. arXiv preprint arXiv:2012.09958 (2020)
5. Beltagy, I., Peters, M.E., Cohan, A.: Longformer: The long-document transformer.
arXiv preprint arXiv:2004.05150 (2020)
6. Cai, Z., Vasconcelos, N.: Cascade r-cnn: Delving into high quality object detection.
In: Proceedings of the IEEE conference on computer vision and pattern recognition.
pp. 6154–6162 (2018)
7. Cao, Y., Xu, J., Lin, S., Wei, F., Hu, H.: Global context networks. IEEE Transactions
on Pattern Analysis and Machine Intelligence (2020)
8. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-
to-end object detection with transformers. In: European Conference on Computer
Vision. pp. 213–229. Springer (2020)
9. Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Semantic image
segmentation with deep convolutional nets and fully connected crfs. arXiv preprint
arXiv:1412.7062 (2014)
10. Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab:
Semantic image segmentation with deep convolutional nets, atrous convolution, and
fully connected crfs. IEEE transactions on pattern analysis and machine intelligence
40(4), 834–848 (2017)
11. Chen, L.C., Papandreou, G., Schroff, F., Adam, H.: Rethinking atrous convolution
for semantic image segmentation. arXiv preprint arXiv:1706.05587 (2017)
12. Chen, L.C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-decoder with
atrous separable convolution for semantic image segmentation. In: Proceedings of
the European conference on computer vision (ECCV). pp. 801–818 (2018)
13. Chen, X., Hsieh, C.J., Gong, B.: When vision transformers outperform resnets
without pretraining or strong data augmentations. arXiv preprint arXiv:2106.01548
(2021)
14. Chen, Y., Zhang, Z., Cao, Y., Wang, L., Lin, S., Hu, H.: Reppoints v2: Verification
meets regression for object detection. Advances in Neural Information Processing
Systems 33, 5621–5631 (2020)
15. Chu, X., Zhang, B., Tian, Z., Wei, X., Xia, H.: Do we really need explicit position
encodings for vision transformers? arXiv e-prints pp. arXiv–2102 (2021)
16. Cohen, N., Sharir, O., Shashua, A.: On the expressive power of deep learning: A
tensor analysis. In: Conference on learning theory. pp. 698–728. PMLR (2016)
17. Crotts, A.P.S.: Vatt/columbia microlensing survey of m31 and the galaxy. arXiv:
Astrophysics (1996)
18. Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., Wei, Y.: Deformable convolu-
tional networks. In: Proceedings of the IEEE international conference on computer
vision. pp. 764–773 (2017)
19. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale
hierarchical image database. In: 2009 IEEE conference on computer vision and
pattern recognition. pp. 248–255. IEEE (2009)
16 W. Chen et al.
20. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T.,
Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16
words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929
(2020)
21. Elbrächter, D., Perekrestenko, D., Grohs, P., Bölcskei, H.: Deep neural network
approximation theory. arXiv preprint arXiv:1901.02220 (2019)
22. Eldan, R., Shamir, O.: The power of depth for feedforward neural networks. In:
Conference on learning theory. pp. 907–940. PMLR (2016)
23. Ghiasi, G., Fowlkes, C.C.: Laplacian pyramid reconstruction and refinement for
semantic segmentation. In: European conference on computer vision. pp. 519–534.
Springer (2016)
24. Han, K., Xiao, A., Wu, E., Guo, J., Xu, C., Wang, Y.: Transformer in transformer.
arXiv preprint arXiv:2103.00112 (2021)
25. Hariharan, B., Arbeláez, P., Bourdev, L., Maji, S., Malik, J.: Semantic contours
from inverse detectors. In: 2011 International Conference on Computer Vision. pp.
991–998. IEEE (2011)
26. He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask r-cnn. In: Proceedings of the
IEEE international conference on computer vision. pp. 2961–2969 (2017)
27. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition.
In: Proceedings of the IEEE conference on computer vision and pattern recognition.
pp. 770–778 (2016)
28. Heo, B., Yun, S., Han, D., Chun, S., Choe, J., Oh, S.J.: Rethinking spatial dimensions
of vision transformers. arXiv preprint arXiv:2103.16302 (2021)
29. Hu, H., Zhang, Z., Xie, Z., Lin, S.: Local relation networks for image recognition.
In: Proceedings of the IEEE International Conference on Computer Vision. pp.
3464–3473 (2019)
30. Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected
convolutional networks. In: Proceedings of the IEEE conference on computer vision
and pattern recognition. pp. 4700–4708 (2017)
31. Huang, Z., Ben, Y., Luo, G., Cheng, P., Yu, G., Fu, B.: Shuffle transformer:
Rethinking spatial shuffle for vision transformer. arXiv preprint arXiv:2106.03650
(2021)
32. Liang, S., Srikant, R.: Why deep neural networks for function approximation? arXiv
preprint arXiv:1610.04161 (2016)
33. Lin, G., Milan, A., Shen, C., Reid, I.: Refinenet: Multi-path refinement networks
for high-resolution semantic segmentation. In: Proceedings of the IEEE conference
on computer vision and pattern recognition. pp. 1925–1934 (2017)
34. Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature
pyramid networks for object detection. In: Proceedings of the IEEE conference on
computer vision and pattern recognition. pp. 2117–2125 (2017)
35. Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P.,
Zitnick, C.L.: Microsoft coco: Common objects in context. In: European conference
on computer vision. pp. 740–755. Springer (2014)
36. Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin
transformer: Hierarchical vision transformer using shifted windows. arXiv preprint
arXiv:2103.14030 (2021)
37. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint
arXiv:1711.05101 (2017)
38. Naseer, M., Ranasinghe, K., Khan, S., Hayat, M., Khan, F.S., Yang, M.H.: Intriguing
properties of vision transformers. arXiv preprint arXiv:2105.10497 (2021)
Abbreviated paper title 17
39. Peng, C., Zhang, X., Yu, G., Luo, G., Sun, J.: Large kernel matters–improve
semantic segmentation by global convolutional network. In: Proceedings of the
IEEE conference on computer vision and pattern recognition. pp. 4353–4361 (2017)
40. Raghu, M., Unterthiner, T., Kornblith, S., Zhang, C., Dosovitskiy, A.: Do vision
transformers see like convolutional neural networks? arXiv preprint arXiv:2108.08810
(2021)
41. Ramachandran, P., Parmar, N., Vaswani, A., Bello, I., Levskaya, A., Shlens, J.:
Stand-alone self-attention in vision models. arXiv preprint arXiv:1906.05909 (2019)
42. Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical
image segmentation. In: International Conference on Medical image computing and
computer-assisted intervention. pp. 234–241. Springer (2015)
43. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale
image recognition. arXiv preprint arXiv:1409.1556 (2014)
44. Sun, C., Shrivastava, A., Singh, S., Gupta, A.: Revisiting unreasonable effectiveness
of data in deep learning era. In: Proceedings of the IEEE international conference
on computer vision. pp. 843–852 (2017)
45. Sun, P., Zhang, R., Jiang, Y., Kong, T., Xu, C., Zhan, W., Tomizuka, M., Li, L.,
Yuan, Z., Wang, C., et al.: Sparse r-cnn: End-to-end object detection with learnable
proposals. In: Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition. pp. 14454–14463 (2021)
46. Tan, M., Le, Q.V.: Efficientnet: Rethinking model scaling for convolutional neural
networks. arXiv preprint arXiv:1905.11946 (2019)
47. Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training
data-efficient image transformers & distillation through attention. arXiv preprint
arXiv:2012.12877 (2020)
48. Touvron, H., Cord, M., Sablayrolles, A., Synnaeve, G., Jégou, H.: Going deeper
with image transformers. arXiv preprint arXiv:2103.17239 (2021)
49. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N.,
Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information
processing systems 30, 5998–6008 (2017)
50. Wang, J., Sun, K., Cheng, T., Jiang, B., Deng, C., Zhao, Y., Liu, D., Mu, Y.,
Tan, M., Wang, X., et al.: Deep high-resolution representation learning for visual
recognition. IEEE transactions on pattern analysis and machine intelligence (2020)
51. Wang, P., Chen, P., Yuan, Y., Liu, D., Huang, Z., Hou, X., Cottrell, G.: Under-
standing convolution for semantic segmentation. In: 2018 IEEE winter conference
on applications of computer vision (WACV). pp. 1451–1460. IEEE (2018)
52. Wang, W., Xie, E., Li, X., Fan, D.P., Song, K., Liang, D., Lu, T., Luo, P., Shao,
L.: Pyramid vision transformer: A versatile backbone for dense prediction without
convolutions. arXiv preprint arXiv:2102.12122 (2021)
53. Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer:
Simple and efficient design for semantic segmentation with transformers. arXiv
preprint arXiv:2105.15203 (2021)
54. Yuan, L., Chen, Y., Wang, T., Yu, W., Shi, Y., Tay, F.E., Feng, J., Yan, S.: Tokens-
to-token vit: Training vision transformers from scratch on imagenet. arXiv preprint
arXiv:2101.11986 (2021)
55. Zhai, X., Kolesnikov, A., Houlsby, N., Beyer, L.: Scaling vision transformers. In:
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog-
nition. pp. 12104–12113 (2022)
56. Zhao, H., Jia, J., Koltun, V.: Exploring self-attention for image recognition. In:
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog-
nition. pp. 10076–10085 (2020)
18 W. Chen et al.
57. Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In:
Proceedings of the IEEE conference on computer vision and pattern recognition.
pp. 2881–2890 (2017)
58. Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T.,
Torr, P.H., et al.: Rethinking semantic segmentation from a sequence-to-sequence
perspective with transformers. In: Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition. pp. 6881–6890 (2021)
59. Zhou, D., Kang, B., Jin, X., Yang, L., Lian, X., Jiang, Z., Hou, Q., Feng, J.: Deepvit:
Towards deeper vision transformer. arXiv preprint arXiv:2103.11886 (2021)
60. Zoph, B., Ghiasi, G., Lin, T.Y., Cui, Y., Liu, H., Cubuk, E.D., Le, Q.V.: Rethinking
pre-training and self-training. arXiv preprint arXiv:2006.06882 (2020)
A.2 Architectures
We propose three variants of our UViT variants. The architecture configurations
of our model variants are listed in Table 4, and are also annotated in Figure 7
(“T”, “S”, “B” in white). The number of heads is fixed as six, and the expansion
ratio of each FFN (feed-forward network) layer is fixed as four in all experiments.
We also scale up our UViT into a huge version following the design in [20,55],
and denote it as “UViT-H”.
6
This scaling rule is studied before we study the attention window strategy in Sec-
tion A.3. Thus for all models in Figure 7 we adopt the window scale as 12 , for fair
comparisons.
Abbreviated paper title 19
0.65 B
0.64 T L = 12
0.64 T L = 12
Pascal VOC mIoU [%]
S L = 18
L = 24
0.63 L = 32
0.62
0.61
0.60
0.59
20 30 40 50 60
Params. (M)
Fig. 7: Model scaling rule for UViT on Pascal VOC semantic segmentation (ImageNet
pretrained, before COCO pretraining). 32 attention blocks (blue) perform better
than shallower UViTs. Different sizes of markers represent the hidden sizes (widths).
Table 4: Architecture variants of our UViT for Pascal VOC semantic segmentation.
Hidden
Name Depth Params. (M)
Size
UViT-T 32 192 18.1
UViT-S 32 240 26.7
UViT-B 32 342 50.7
UViT-H 32 1280 529.9
windows again achieve the best performance. Global attentions in deep layers
are vital, and a smaller window in early attentions can improve efficiency. In
conclusion, we set the window scale of our UViT as “[2 ] × 28 → [1 ] × 4” for
−1 −1
Table 5: Local attention windows in early layers can improve model efficiency,
and global attention windows in deep layers are vital to the final performance on
Pascal VOC. Model: UViT-B.
[window_scale] × #layers GFLOPs APval
[1] × 32 596.7 81.1
[2 ] × 32
−1
315.2 80.6
[2 ] × 28 → [1 ] × 4
−1 −1
350.4 81.2
Table 6: Segmentation results on Pascal VOC 2012. Our UViT leverages a plain
convolutional segmentation head, without any test-time augmentation.
Backbone Resolution GFLOPs Params. (M) mIoU
WASPnet-CRF [3] - - 47.5 80.4
DeepLabv3+ (ResNet-101) [12] 512×512 298 58.6 79.4
UViT-T (ours) 512×512 163 18.1 79.0
UViT-S (ours) 512×512 215 26.7 79.9
UViT-B (ours) 512×512 350 50.7 81.2
UViT-H (ours) 640×640 3846 529.9 88.1
Abbreviated paper title 21
We show all architectures studied in our compound scaling rule in Figure 4 and
−1
Figure 5 (main body). All models are of 2 -scale attention windows for fair
comparisons.
22 W. Chen et al.
Table 7: Model architectures in Figure 2 (main body), all studied under a 640×640
input size on MS-COCO. “SD”: spatial downsampling. “MF”: multi-scale features.
“2×”: doubled channels. Without any of these three techniques (first section in
this table), the whole network has a constant feature resolution and hidden size;
all other seven settings below will split the network into three stages, since they
require either a progressive feature downsampling or multi-scale features from
each stage. Input scale is relative to the 2D shape of the input image H × W
−1
(e.g. 8 indicates the 2D shape of the UViT’s sequence feature is 18 H × 81 W ).
The window scale is relative to the 2D shape of sequence feature’s h × w (e.g.
−1
8 indicates the 2D shape of the attention window is 18 h × 18 w). Numbers with
underscores in the column “Output Scale” indicate feature maps that will be
fed into the FPN detection head (i.e., the last output of backbone if no “MF” is
applied, or features from all three stages if “MF” is applied).
SD MF 2× Input Scale #Layers Window Scale Hidden Size Output Scale Params. (M) FLOPs (G) mAP
−1
16 534.1 44.5
−1
8 540.9 48.2
−1 −1 −1
8 18 4 384 8 72.1 567.9 50.1
−1
2 676.2 50.7
1 1109.1 50.8
Stage 1 Stage 2 Stage 3
SD MF 2× Params. (M) FLOPs (G) mAP
Input Window Hidden Output Input Window Hidden Output Input Window Hidden Output
#Layers #Layers #Layers
Scale Scale Size Scale Scale Scale Size Scale Scale Scale Size Scale
✓ 6 6 6 607.1 41.0
✓ 8 5 5 688.28 42.0
−1 −1 −1 −1 −1 −1
✓ 8 10 1 384 8 16 4 1 384 16 32 4 1 384 32 72.1 769.47 42.6
✓ 12 3 3 850.68 43.0
✓ 14 2 2 931.88 43.4
−1 −1 −1
✓ 16 16 16 534.3 44.3
−1 −1 −1
✓ 8 8 8 541.03 47.6
−1 −1 −1 −1 −1 −1 −1 −1 −1
✓ 8 6 4 384 8 8 6 4 384 16 8 6 4 384 32 72.1 568.09 49.4
−1 −1 −1
✓ 2 2 2 676.33 50.3
✓ 1 1 1 1109.3 50.2
−1 −1 −1
✓ 16 16 16 558.4 43.4
−1 −1 −1
✓ 8 8 8 561.5 44.4
−1 −1 −1 −1 −1 −1 −1 −1 −1
✓ 8 6 4 152 8 8 6 4 304 8 8 6 4 608 8 73.8 587.7 46.3
−1 −1 −1
✓ 2 2 2 692.2 46.6
✓ 1 1 1 1110.2 48.3
✓ ✓ 2 8 8 459.7 45.8
✓ ✓ 4 7 7 540.9 47.5
✓ ✓ 6 6 6 622.1 48.5
−1 −1 −1 −1 −1 −1
✓ ✓ 8 8 1 384 8 16 5 1 384 16 32 5 1 384 32 72.1 703.3 48.0
✓ ✓ 10 4 4 784.5 48.6
✓ ✓ 12 3 3 865.7 50.2
✓ ✓ 15 2 1 989.5 50.4
✓ ✓ 128 256 9 512 70.2 529.1 37.6
✓ ✓ 160 320 5 640 69.3 581.7 38.9
−1 −1 −1 −1 −1 −1
✓ ✓ 8 16 1 192 8 16 1 1 384 16 32 3 1 768 32 69.3 637.4 40.2
✓ ✓ 224 448 2 896 71.4 696.6 41.7
✓ ✓ 256 512 1 1024 69.2 756.5 42.5
−1 −1 −1
✓ ✓ 16 16 16 566.3 45.7
−1 −1 −1
✓ ✓ −1 8 −1 −1 8 −1 −1 8 −1 569.5 46.4
8 6 −1 152 8 8 6 −1 304 16 8 6 −1 608 32 73.8
✓ ✓ 4 4 4 595.6 48.1
−1 −1 −1
✓ ✓ 2 2 2 700.1 49.0
✓ ✓ ✓ 16 128 256 9 512 73.3 552.1 44.3
✓ ✓ ✓ 16 160 320 5 640 72.4 604.9 45.5
✓ ✓ ✓ −1 16 192 −1 −1 384 −1 −1 3 768 −1 72.4 660.7 47.6
8 1 8 16 1 1 16 32 1 32
✓ ✓ ✓ 16 224 448 2 896 74.5 719.9 48.8
✓ ✓ ✓ 16 256 512 1 1024 72.4 779.9 49.4
✓ ✓ ✓ 28 224 448 1 896 72.1 992.3 49.5
Abbreviated paper title 23
Table 8: Model architectures in Figure 4 and Figure 5 (MS-COCO) (in main body).
Configurations (depth, width) of UViT-T/S/B are annotated.
Input Size Depth Width Params. (M) FLOPs (G) mAP
384 72.1 676.2 50.4
432 80.9 748.3 50.5
640 × 640 18 462 86.9 796.6 50.7
492 93.3 847.4 50.4
564 110.2 979.4 50.1
288 58.2 725.9 51.1
306 60.7 761.1 51.5
330 64.3 810.0 51.3
768 × 768 18
384 73.1 928.5 51.5
432 82.1 1043.5 51.6
462 88.2 1120.1 51.3
186 47.4 710.2 51
222 (UViT-T) 51.0 801.4 51.3
246 53.8 866.1 51.7
896 × 896 18
288 (UViT-S) 59.2 986.8 51.7
330 65.4 1117.1 52.1
384 (UViT-B) 74.4 1298.7 52.3
120 42.6 710.3 47.9
132 43.5 750.1 48.9
144 44.4 791.0 49.3
1024 × 1024 18 162 45.8 854.3 50.4
198 49.3 987.6 51.4
246 54.7 1179.7 51.7
288 60.3 1361.2 52.0
276 52.1 748.4 50.8
300 54.4 796.2 50.8
896 × 896 12 324 56.9 846.2 51.0
360 60.9 925.0 51.5
390 64.5 994.2 51.5
156 46.5 739.0 50.6
180 49.2 813.8 50.8
896 × 896 24 192 50.6 852.7 51.3
258 60.1 1085.4 51.8
294 66.3 1225.7 51.6
120 44.6 732.5 50
132 45.9 777.4 50.4
896 × 896 32 144 47.3 823.8 51.2
180 52.3 971.1 51.5
240 62.8 1244.4 52.0
96 43.2 723.2 48.5
102 43.8 749.3 49.1
114 45.2 802.9 50.1
896 × 896 40
126 46.8 858.2 50.7
150 50.3 974.0 51.2
156 51.2 1004.0 51.2