0% found this document useful (0 votes)
277 views23 pages

A Survey On Vision Transformer

This document is a survey of vision transformer models that have been developed to apply the transformer architecture to computer vision tasks. It categorizes vision transformer models based on the type of computer vision task they address, such as backbone networks, high/mid-level vision, low-level vision, and video processing. The survey provides an overview of the recent advances in vision transformers and discusses potential directions for further improving these models.

Uploaded by

Lưu Hải
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
277 views23 pages

A Survey On Vision Transformer

This document is a survey of vision transformer models that have been developed to apply the transformer architecture to computer vision tasks. It categorizes vision transformer models based on the type of computer vision task they address, such as backbone networks, high/mid-level vision, low-level vision, and video processing. The survey provides an overview of the recent advances in vision transformers and discusses potential directions for further improving these models.

Uploaded by

Lưu Hải
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

A SUBMISSION TO IEEE TRANSACTION ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 1

A Survey on Vision Transformer


Kai Han, Yunhe Wang, Hanting Chen, Xinghao Chen, Jianyuan Guo, Zhenhua Liu, Yehui Tang, An Xiao,
Chunjing Xu, Yixing Xu, Zhaohui Yang, Yiman Zhang, and Dacheng Tao Fellow, IEEE

Abstract—Transformer, first applied to the field of natural language processing, is a type of deep neural network mainly based on the
self-attention mechanism. Thanks to its strong representation capabilities, researchers are looking at ways to apply transformer to
computer vision tasks. In a variety of visual benchmarks, transformer-based models perform similar to or better than other types of
networks such as convolutional and recurrent neural networks. Given its high performance and less need for vision-specific inductive
bias, transformer is receiving more and more attention from the computer vision community. In this paper, we review these vision
transformer models by categorizing them in different tasks and analyzing their advantages and disadvantages. The main categories we
explore include the backbone network, high/mid-level vision, low-level vision, and video processing. We also include efficient
arXiv:2012.12556v5 [cs.CV] 23 Feb 2022

transformer methods for pushing transformer into real device-based applications. Furthermore, we also take a brief look at the
self-attention mechanism in computer vision, as it is the base component in transformer. Toward the end of this paper, we discuss the
challenges and provide several further research directions for vision transformers.

Index Terms—Transformer, Self-attention, Computer Vision, High-level vision, Low-level vision, Video.

1 I NTRODUCTION

D EEP neural networks (DNNs) have become the fundamental


infrastructure in today’s artificial intelligence (AI) systems.
Different types of tasks have typically involved different types
TB of compressed plaintext data using 175 billion parameters.
It achieved strong performance on different types of downstream
natural language tasks without requiring any fine-tuning. These
of networks. For example, multi-layer perceptron (MLP) or the transformer-based models, with their strong representation capac-
fully connected (FC) network is the classical type of neural ity, have achieved significant breakthroughs in NLP.
network, which is composed of multiple linear layers and non- Inspired by the major success of transformer architectures in
linear activations stacked together [1], [2]. Convolutional neural the field of NLP, researchers have recently applied transformer
networks (CNNs) introduce convolutional layers and pooling to computer vision (CV) tasks. In vision applications, CNNs are
layers for processing shift-invariant data such as images [3], [4]. considered the fundamental component [12], [13], but nowadays
And recurrent neural networks (RNNs) utilize recurrent cells to transformer is showing it is a potential alternative to CNN. Chen et
process sequential data or time series data [5], [6]. Transformer is al. [14] trained a sequence transformer to auto-regressively predict
a new type of neural network. It mainly utilizes the self-attention pixels, achieving results comparable to CNNs on image classi-
mechanism [7], [8] to extract intrinsic features [9] and shows great fication tasks. Another vision transformer model is ViT, which
potential for extensive use in AI applications. applies a pure transformer directly to sequences of image patches
Transformer was first applied to natural language processing to classify the full image. Recently proposed by Dosovitskiy et
(NLP) tasks where it achieved significant improvements [9], [10], al. [15], it has achieved state-of-the-art performance on multiple
[11]. For example, Vaswani et al. [9] first proposed transformer image recognition benchmarks. In addition to image classification,
based on attention mechanism for machine translation and English transformer has been utilized to address a variety of other vision
constituency parsing tasks. Devlin et al. [10] introduced a new lan- problems, including object detection [16], [17], semantic segmen-
guage representation model called BERT (short for Bidirectional tation [18], image processing [19], and video understanding [20].
Encoder Representations from Transformers), which pre-trains a Thanks to its exceptional performance, more and more researchers
transformer on unlabeled text taking into account the context of are proposing transformer-based models for improving a wide
each word as it is bidirectional. When BERT was published, it range of visual tasks.
obtained state-of-the-art performance on 11 NLP tasks. Brown et Due to the rapid increase in the number of transformer-based
al. [11] pre-trained a massive transformer-based model called vision models, keeping pace with the rate of new progress is
GPT-3 (short for Generative Pre-trained Transformer 3) on 45 becoming increasingly difficult. As such, a survey of the existing
works is urgent and would be beneficial for the community. In
• Kai Han, Hanting Chen, Xinghao Chen, Jianyuan Guo, Zhenhua Liu, Yehui this paper, we focus on providing a comprehensive overview
Tang, An Xiao, Chunjing Xu, Yixing Xu, Zhaohui Yang, Yiman Zhang, of the recent advances in vision transformers and discuss the
and Yunhe Wang are with Huawei Noah’s Ark Lab. E-mail: {kai.han,
yunhe.wang}@huawei.com.
potential directions for further improvement. To facilitate future
• Hanting Chen, Zhenhua Liu, Yehui Tang, and Zhaohui Yang are also with research on different topics, we categorize the transformer models
School of EECS, Peking University. by their application scenarios, as listed in Table 1. The main
• Dacheng Tao is with the School of Computer Science, in the Faculty of categories include backbone network, high/mid-level vision, low-
Engineering, at The University of Sydney, 6 Cleveland St, Darlington,
NSW 2008, Australia. E-mail: [email protected]. level vision, and video processing. High-level vision deals with the
• Corresponding to Yunhe Wang and Dacheng Tao. interpretation and use of what is seen in the image [21], whereas
• All authors are listed in alphabetical order of last name (except the mid-level vision deals with how this information is organized into
primary and corresponding authors).
what we experience as objects and surfaces [22]. Given the gap
A SUBMISSION TO IEEE TRANSACTION ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2

2017.6 | Transformer 2020.5 | GPT-3 2020.7 | iGPT End of 2020 | IPT/SETR/CLIP


Solely based on attention A huge transformer with The transformer model for NLP Applications of transformer model
mechanism, the Transformer is 170B parameters, takes a can also be used for image pre- on low-level vision, segmentation
proposed and shows great big step towards general training. and multimodality tasks,
performance on NLP tasks. NLP model. respectively.

2018.10 | BERT 2020.5 | DETR 2020.10 | ViT 2021 | ViT Variants


Pre-training transformer models A simple yet effective Pure transformer Variants of ViT models,
begin to be dominated in the framework for high-level vision architectures work well for e.g., DeiT, PVT, TNT, and
field of NLP. by viewing object detection as visual recognition. Swin.
a direct set prediction problem.

Fig. 1: Key milestones in the development of transformer. The vision transformer models are marked in red.
between high- and mid-level vision is becoming more obscure
in DNN-based vision systems [23], [24], we treat them as a
single category here. A few examples of transformer models that
address these high/mid-level vision tasks include DETR [16], de-
formable DETR [17] for object detection, and Max-DeepLab [25]
for segmentation. Low-level image processing mainly deals with
extracting descriptions from images (such descriptions are usually
represented as images themselves) [26]. Typical applications of
low-level image processing include super-resolution, image de-
noising, and style transfer. At present, only a few works [19], [27]
in low-level vision use transformers, creating the need for further
investigation. Another category is video processing, which is an
important part in both computer vision and image-based tasks. Due
to the sequential property of video, transformer is inherently well
suited for use on video tasks [20], [28], in which it is beginning
to perform on par with conventional CNNs and RNNs. Here, we
survey the works associated with transformer-based visual models
in order to track the progress in this field. Figure 1 shows the
development timeline of vision transformer — undoubtedly, there Fig. 2: Structure of the original transformer (image from [9]).
will be many more milestones in the future.
The rest of the paper is organized as follows. Section 2 2.1 Self-Attention
discusses the formulation of the standard transformer and the self-
In the self-attention layer, the input vector is first transformed into
attention mechanism. Section 4 is the main part of the paper, in
three different vectors: the query vector q, the key vector k and the
which we summarize the vision transformer models on backbone,
value vector v with dimension dq = dk = dv = dmodel = 512.
high/mid-level vision, low-level vision, and video tasks. We also
Vectors derived from different inputs are then packed together into
briefly describe efficient transformer methods, as they are closely
three different matrices, namely, Q, K and V. Subsequently, the
related to our main topic. In the final section, we give our
attention function between different input vectors is calculated as
conclusion and discuss several research directions and challenges.
follows (and shown in Figure 3 left):
Due to the page limit, we describe the methods of transformer in
NLP in the supplemental material, as the research experience may • Step 1: Compute scores between different input vectors
be beneficial for vision tasks. In the supplemental material, we also with S = Q · K> ;
review the self-attention mechanism for CV as the supplementary Step 2: Normalize

√ the scores for the stability of gradient
of vision transformer models. In this survey, we mainly include the with Sn = S/ dk ;
representative works (early, pioneering, novel, or inspiring works) • Step 3: Translate the scores into probabilities with softmax
since there are many preprinted works on arXiv and we cannot function P = softmax(Sn );
include them all in limited pages. • Step 4: Obtain the weighted value matrix with Z = V ·P.

The process can be unified into a single function:


2 F ORMULATION OF T RANSFORMER
Q · K>
Transformer [9] was first used in the field of natural language Attention(Q, K, V) = softmax( √ ) · V. (1)
dk
processing (NLP) on machine translation tasks. As shown in
Figure 2, it consists of an encoder and a decoder with several The logic behind Eq. 1 is simple. Step 1 computes scores between
transformer blocks of the same architecture. The encoder gener- each pair of different vectors, and these scores determine the
ates encodings of inputs, while the decoder takes all the encodings degree of attention that we give other words when encoding
and using their incorporated contextual information to generate the word at the current position. Step 2 normalizes the scores
the output sequence. Each transformer block is composed of a to enhance gradient stability for improved training, and step 3
multi-head attention layer, a feed-forward neural network, shortcut translates the scores into probabilities. Finally, each value vector
connection and layer normalization. In the following, we describe is multiplied by the sum of the probabilities. Vectors with larger
each component of the transformer in detail. probabilities receive additional focus from the following layers.
A SUBMISSION TO IEEE TRANSACTION ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 3

TABLE 1: Representative works of vision transformers.


Category Sub-category Method Highlights Publication
ViT [15] Image patches, standard transformer ICLR 2021
Supervised pretraining TNT [29] Transformer in transformer, local attention NeurIPS 2021
Backbone Swin [30] Shifted window, window-based self-attention ICCV 2021
iGPT [14] Pixel prediction self-supervised learning, GPT model ICML 2020
Self-supervised pretraining
MoCo v3 [31] Contrastive self-supervised learning, ViT ICCV 2021
DETR [16] Set-based prediction, bipartite matching, transformer ECCV 2020
Object detection Deformable DETR [17] DETR, deformable attention module ICLR 2021
UP-DETR [32] Unsupervised pre-training, random query patch detection CVPR 2021
High/Mid-level Max-DeepLab [25] PQ-style bipartite matching, dual-path transformer CVPR 2021
Segmentation VisTR [33] Instance sequence matching and segmentation CVPR 2021
vision
SETR [18] Sequence-to-sequence prediction, standard transformer CVPR 2021
Hand-Transformer [34] Non-autoregressive transformer, 3D point set ECCV 2020
Pose Estimation HOT-Net [35] Structured-reference extractor MM 2020
METRO [36] Progressive dimensionality reduction CVPR 2021
Image Transformer [27] Pixel generation using transformer ICML 2018
Image generation Taming transformer [37] VQ-GAN, auto-regressive transformer CVPR 2021
Low-level
vision TransGAN [38] GAN using pure transformer architecture NeurIPS 2021
IPT [19] Multi-task, ImageNet pre-training, transformer model CVPR 2021
Image enhancement
TTSR [39] Texture transformer, RefSR CVPR 2020
Video Video inpainting STTN [28] Spatial-temporal adversarial loss ECCV 2020
processing Video captioning Masked Transformer [20] Masking network, event proposal CVPR 2018
Classification CLIP [40] NLP supervision for images, zero-shot transfer arXiv 2021
DALL-E [41] Zero-shot text-to image generation ICML 2021
Multimodality Image generation
Cogview [42] VQ-VAE, Chinese input NeurIPS 2021
Multi-task UniT [43] Different NLP & CV tasks, shared model parameters ICCV 2021
Decomposition ASH [44] Number of heads, importance estimation NeurIPS 2019
Efficient Distillation TinyBert [45] Various losses for different modules EMNLP Findings 2020
transformer Quantization FullyQT [46] Fully quantized transformer EMNLP Findings 2020
Architecture design ConvBert [47] Local dependence, dynamic convolution NeurIPS 2020

The encoder-decoder attention layer in the decoder module


is similar to the self-attention layer in the encoder module with
the following exceptions: The key matrix K and value matrix V
are derived from the encoder module, and the query matrix Q is
derived from the previous layer.
Note that the preceding process is invariant to the position of
each word, meaning that the self-attention layer lacks the ability
to capture the positional information of words in a sentence.
However, the sequential nature of sentences in a language requires
us to incorporate the positional information within our encoding.
Fig. 3: (Left) Self-attention process. (Right) Multi-head attention.
To address this issue and allow the final input vector of the word
The image is from [9].
to be obtained, a positional encoding with dimension dmodel is
added to the original input embedding. Specifically, the position is This is achieved by giving attention layers different representation
encoded with the following equations: subspace. Specifically, different query, key and value matrices are
pos used for different heads, and these matrices can project the input
P E(pos, 2i) = sin( 2i ); (2) vectors into different representation subspace after training due to
10000 dmodel
pos random initialization.
P E(pos, 2i + 1) = cos( 2i ), (3) To elaborate on this in greater detail, given an input vector
10000 dmodel and the number of heads h, the input vector is first transformed
in which pos denotes the position of the word in a sentence, and into three different groups of vectors: the query group, the key
i represents the current dimension of the positional encoding. In group and the value group. In each group, there are h vectors with
this way, each element of the positional encoding corresponds to dimension dq0 = dk0 = dv0 = dmodel /h = 64. The vectors
a sinusoid, and it allows the transformer model to learn to attend derived from different inputs are then packed together into three
by relative positions and extrapolate to longer sequence lengths different groups of matrices: {Qi }hi=1 , {Ki }hi=1 and {Vi }hi=1 .
during inference. In apart from the fixed positional encoding in the The multi-head attention process is shown as follows:
vanilla transformer, learned positional encoding [48] and relative
positional encoding [49] are also utilized in various models [10], MultiHead(Q0 , K0 , V0 ) = Concat(head1 , · · · , headh )Wo ,
[15]. where headi = Attention(Qi , Ki , Vi ). (4)
Multi-Head Attention. Multi-head attention is a mechanism
Here, Q0 (and similarly K0 and V0 ) is the concatenation of
that can be used to boost the performance of the vanilla self-
{Qi }hi=1 , and Wo ∈ Rdmodel ×dmodel is the projection weight.
attention layer. Note that for a given reference word, we often
want to focus on several other words when going through the
sentence. A single-head self-attention layer limits our ability to 2.2 Other Key Concepts in Transformer
focus on one or more specific positions without influencing the Feed-Forward Network. A feed-forward network (FFN) is ap-
attention on other equally important positions at the same time. plied after the self-attention layers in each encoder and decoder. It
A SUBMISSION TO IEEE TRANSACTION ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 4

consists of two linear transformation layers and a nonlinear acti- 3.1 Backbone for Representation Learning
vation function within them, and can be denoted as the following Inspired by the success that transformer has achieved in the field of
function: NLP, some researchers have explored whether similar models can
FFN(X) = W2 σ(W1 X), (5) learn useful representations for images. Given that images involve
where W1 and W2 are the two parameter matrices of the more dimensions, noise and redundant modality compared to text,
two linear transformation layers, and σ represents the nonlinear they are believed to be more difficult for generative modeling.
activation function, such as GELU [50]. The dimensionality of the Other than CNNs, the transformer can be used as backbone
hidden layer is dh = 2048. networks for image classification. Wu et al. [58] adopted ResNet
Residual Connection in the Encoder and Decoder. As shown in as a convenient baseline and used vision transformers to replace
Figure 2, a residual connection is added to each sub-layer in the the last stage of convolutions. Specifically, they apply convolu-
encoder and decoder. This strengthens the flow of information in tional layers to extract low-level features that are then fed into
order to achieve higher performance. A layer-normalization [51] the vision transformer. For the vision transformer, they use a
is followed after the residual connection. The output of these tokenizer to group pixels into a small number of visual tokens,
operations can be described as: each representing a semantic concept in the image. These visual
tokens are used directly for image classification, with the trans-
LayerNorm(X + Attention(X)). (6)
formers being used to model the relationships between tokens. As
Here, X is used as the input of self-attention layer, and the query, shown in Figure 4, the works can be divided into purely using
key and value matrices Q, K and V are all derived from the same transformer for vision and combining CNN and transformer. We
input matrix X. A variant pre-layer normalization (Pre-LN) is also summarize the results of these models in Table 2 and Figure 6
widely-used [52], [53], [15]. Pre-LN inserts the layer normaliza- to demonstrate the development of the backbones. In addition to
tion inside the residual connection and before multi-head attention supervised learning, self-supervised learning is also explored in
or FFN. For the normalization layer, there are several alternatives vision transformer.
such as batch normalization [54]. Batch normalization usually
perform worse when applied on transformer as the feature values 3.1.1 Pure Transformer
change acutely [55]. Some other normalization algorithms [56], ViT. Vision Transformer (ViT) [15] is a pure transformer directly
[55], [57] have been proposed to improve training of transformer. applies to the sequences of image patches for image classification
Final Layer in the Decoder. The final layer in the decoder is used task. It follows transformer’s original design as much as possible.
to turn the stack of vectors back into a word. This is achieved by a Figure 5 shows the framework of ViT.
linear layer followed by a softmax layer. The linear layer projects
To handle 2D images, the image X ∈ Rh×w×c is reshaped
the vector into a logits vector with dword dimensions, in which 2
into a sequence of flattened 2D patches Xp ∈ Rn×(p ·c) such
dword is the number of words in the vocabulary. The softmax
that c is the number of channels. (h, w) is the resolution of the
layer is then used to transform the logits vector into probabilities.
original image, while (p, p) is the resolution of each image patch.
When used for CV tasks, most transformers adopt the original
The effective sequence length for the transformer is therefore n =
transformer’s encoder module. Such transformers can be treated
hw/p2 . Because the transformer uses constant widths in all of its
as a new type of feature extractor. Compared with CNNs which
layers, a trainable linear projection maps each vectorized path to
focus only on local characteristics, transformer can capture long-
the model dimension d, the output of which is referred to as patch
distance characteristics, meaning that it can easily derive global
embeddings.
information. And in contrast to RNNs, whose hidden state must
Similar to BERT’s [class] token, a learnable embedding is
be computed sequentially, transformer is more efficient because
applied to the sequence of embedding patches. The state of this
the output of the self-attention layer and the fully connected layers
embedding serves as the image representation. During both pre-
can be computed in parallel and easily accelerated. From this, we
training and fine-tuning stage, the classification heads are attached
can conclude that further study into using transformer in computer
to the same size. In addition, 1D position embeddings are added
vision as well as NLP would yield beneficial results.
to the patch embeddings in order to retain positional information.
It is worth noting that ViT utilizes only the standard transformer’s
3 V ISION T RANSFORMER encoder (except for the place for the layer normalization), whose
In this section, we review the applications of transformer- output precedes an MLP head. In most cases, ViT is pre-trained
based models in computer vision, including image classification, on large datasets, and then fine-tuned for downstream tasks with
high/mid-level vision, low-level vision and video processing. We smaller data.
also briefly summarize the applications of the self-attention mech- ViT yields modest results when trained on mid-sized datasets
anism and model compression methods for efficient transformer. such as ImageNet, achieving accuracies of a few percentage
points below ResNets of comparable size. Because transformers
Backbone for Representation Learning lack some inductive biases inherent to CNNs–such as translation
equivariance and locality–they do not generalize well when trained
Convolution Attention on insufficient amounts of data. However, the authors found
that trainingNon-Local
the models on large datasets (14 million to 300
million images) surpassed inductive bias. When pre-trained at
CNN CNN + Transformer Transformer SENet
NLNet
sufficient scale, transformers achieve excellent results on tasks
AlexNet/ResNet/DenseNet BoTNet/CeiT ViT/PVT/TNT/Swin GCNet with fewer datapoints. For example, when pre-trained on the
JFT-300M dataset, ViT approached or even exceeded state of
Fig. 4: A taxonomy of backbone using convolution and attention. the art performance on multiple image recognition benchmarks.
A SUBMISSION TO IEEE TRANSACTION ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 5

Specifically, it reached an accuracy of 88.36% on ImageNet, and tokens by only computing attentions with top-k similar tokens. Re-
77.16% on the VTAB suite of 19 tasks. finer [70] explores attention expansion in higher-dimension space
Touvron et al. [59] proposed a competitive convolution-free and applied convolution to augment local patterns of the attention
transformer, called Data-efficient image transformer (DeiT), by maps. XCiT [71] performs self-attention calculation across feature
training on only the ImageNet database. DeiT-B, the reference vi- channels rather than tokens, which allows efficient processing of
sion transformer, has the same architecture as ViT-B and employs high-resolution images. The computation complexity and attention
86 million parameters. With a strong data augmentation, DeiT- precision of the self-attention mechanism are two key-points for
B achieves top-1 accuracy of 83.1% (single-crop evaluation) on future optimization.
ImageNet with no external data. In addition, the authors observe The network architecture is an important factor as demon-
that using a CNN teacher gives better performance than using strated in the field of CNNs. The original architecture of ViT
a transformer. Specifically, DeiT-B can achieve top-1 accuracy is a simple stack of the same-shape transformer block. New
84.40% with the help of a token-based distillation. architecture design for vision transformer has been an interesting
topic. The pyramid-like architecture is utilized by many vision
transformer models [72], [60], [73], [74], [75], [76] including
PVT [72], HVT [77], Swin Transformer [60] and PiT [78]. There
are also other types of architectures, such as two-stream architec-
ture [79] and U-net architecture [80], [30]. Neural architecture
search (NAS) has also been investigated to search for better
transformer architectures, e.g., Scaling-ViT [81], ViTAS [82],
AutoFormer [83] and GLiT [84]. Currently, both network design
and NAS for vision transformer mainly draw on the experience of
CNN. In the future, we expect the specific and novel architectures
appear in the filed of vision transformer.
In addition to the aforementioned approaches, there are some
other directions to further improve vision transformer, e.g., posi-
Fig. 5: The framework of ViT (image from [15]). tional encoding [85], [86], normalization strategy [87], shortcut
connection [88] and removing attention [89], [90], [91], [92].
Variants of ViT. Following the paradigm of ViT, a series of
variants of ViT have been proposed to improve the performance 3.1.2 Transformer with Convolution
on vision tasks. The main approaches include enhancing locality, Although vision transformers have been successfully applied to
self-attention improvement and architecture design. various visual tasks due to their ability to capture long-range
The original vision transformer is good at capturing long-range dependencies within the input, there are still gaps in performance
dependencies between patches, but disregard the local feature between transformers and existing CNNs. One main reason can be
extraction as the 2D patch is projected to a vector with simple the lack of ability to extract local information. Except the above
linear layer. Recently, the researchers begin to pay attention to mentioned variants of ViT that enhance the locality, combining the
improve the modeling capacity for local information [29], [60], transformer with convolution can be a more straightforward way
[61]. TNT [29] further divides the patch into a number of sub- to introduce the locality into the conventional transformer.
patches and introduces a novel transformer-in-transformer archi- There are plenty of works trying to augment a conventional
tecture which utilizes an inner transformer block to model the transformer block or self-attention layer with convolution. For
relationship between sub-patches and an outer transformer block example, CPVT [85] proposed a conditional positional encoding
for patch-level information exchange. Twins [62] and CAT [63] (CPE) scheme, which is conditioned on the local neighborhood
alternately perform local and global attention layer-by-layer. Swin of input tokens and adaptable to arbitrary input sizes, to leverage
Transformers [60], [64] performs local attention within a win- convolutions for fine-level feature encoding. CvT [96], CeiT [97],
dow and introduces a shifted window partitioning approach for LocalViT [98] and CMT [94] analyzed the potential drawbacks
cross-window connections. Shuffle Transformer [65], [66] further when directly borrowing Transformer architectures from NLP and
utilizes the spatial shuffle operation instead of shifted window combined the convolutions with transformers together. Specifi-
partitioning to allow cross-window connections. RegionViT [61] cally, the feed-forward network (FFN) in each transformer block is
generates regional tokens and local tokens from an image, and combined with a convolutional layer that promotes the correlation
local tokens receive global information via attention with regional among neighboring tokens. LeViT [99] revisited principles from
tokens. In addition to the local attention, some other works propose extensive literature on CNNs and applied them to transformers,
to boost local information through local feature aggregation, e.g., proposing a hybrid neural network for fast inference image clas-
T2T [67]. These works demonstrate the benefit of the local sification. BoTNet [100] replaced the spatial convolutions with
information exchange and global information exchange in vision global self-attention in the final three bottleneck blocks of a
transformer. ResNet, and improved upon the baselines significantly on both
As a key component of transformer, self-attention layer pro- instance segmentation and object detection tasks with minimal
vides the ability for global interaction between image patches. overhead in latency.
Improving the calculation of self-attention layer has attracted Besides, some researchers have demonstrated that transformer
many researchers. DeepViT [68] proposes to establish cross- based models can be more difficult to enjoy a favorable ability of
head communication to re-generate the attention maps to increase fitting data [15], [101], [102], in other words, they are sensitive
the diversity at different layers. KVT [69] introduces the k -NN
attention to utilize locality of images patches and ignore noisy 1.
A SUBMISSION TO IEEE TRANSACTION ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 6

TABLE 2: ImageNet result comparison of representative CNN and 86 86

vision transformer models. Pure transformer means only using a few 85 85 ResNet
EfficientNet
T2T
Swin
convolutions in the stem stage. CNN + Transformer means using 84 84 DeiT CMT
PVT VOLO
convolutions in the intermediate layers. Following [59], [60], the 83 83

Accuracy (%)

Accuracy (%)
throughput is measured on NVIDIA V100 GPU and Pytorch, with 82 82

224×224 input size. 81 81

80 ResNet T2T 80
Params FLOPs Throughput Top-1 EfficientNet Swin
Model 79 DeiT CMT 79
(M) (B) (image/s) (%) PVT VOLO
CNN 78 78
2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 200 400 600 800 1000 1200
ResNet-50 [12], [67] 25.6 4.1 1226 79.1 FLOPs (B) Throughput (image/s)

ResNet-101 [12], [67] 44.7 7.9 753 79.9


ResNet-152 [12], [67] 60.2 11.5 526 80.8 (a) Acc v.s. FLOPs. (b) Acc v.s. throughput.
EfficientNet-B0 [93] 5.3 0.39 2694 77.1
EfficientNet-B1 [93] 7.8 0.70 1662 79.1 Fig. 6: FLOPs and throughput comparison of representative CNN
EfficientNet-B2 [93] 9.2 1.0 1255 80.1 and vision transformer models.
EfficientNet-B3 [93] 12 1.8 732 81.6
EfficientNet-B4 [93] 19 4.2 349 82.9
Pure Transformer sequence transformer architecture is adopted instead of language
DeiT-Ti [15], [59] 5 1.3 2536 72.2 tokens (as used in NLP). Pre-training can be thought of as a
DeiT-S [15], [59] 22 4.6 940 79.8
DeiT-B [15], [59] 86 17.6 292 81.8 favorable initialization or regularizer when used in combination
T2T-ViT-14 [67] 21.5 5.2 764 81.5 with early stopping. During the fine-tuning stage, they add a
T2T-ViT-19 [67] 39.2 8.9 464 81.9
T2T-ViT-24 [67] 64.1 14.1 312 82.3 small classification head to the model. This helps optimize a
PVT-Small [72] 24.5 3.8 820 79.8 classification objective and adapts all weights.
PVT-Medium [72] 44.2 6.7 526 81.2
PVT-Large [72] 61.4 9.8 367 81.7
The image pixels are transformed into a sequential data by
TNT-S [29] 23.8 5.2 428 81.5 k -means clustering. Given an unlabeled dataset X consisting of
TNT-B [29] 65.6 14.1 246 82.9 high dimensional data x = (x1 , · · · , xn ), they train the model by
CPVT-S [85] 23 4.6 930 80.5
CPVT-B [85] 88 17.6 285 82.3 minimizing the negative log-likelihood of the data:
Swin-T [60] 29 4.5 755 81.3
Swin-S [60] 50 8.7 437 83.0 LAR = E [− log p(x)], (7)
Swin-B [60] 88 15.4 278 83.3 x∼X
CNN + Transformer where p(x) is the probability density of the data of images, which
Twins-SVT-S [62] 24 2.9 1059 81.7
Twins-SVT-B [62] 56 8.6 469 83.2 can be modeled as:
Twins-SVT-L [62] 99.2 15.1 288 83.7 Yn
Shuffle-T [65] 29 4.6 791 82.5
Shuffle-S [65] 50 8.9 450 83.5
p(x) = p(xπi |xπ1 , · · · , xπi−1 , θ). (8)
Shuffle-B [65] 88 15.6 279 84.0 i=1
CMT-S [94] 25.1 4.0 563 83.5
CMT-B [94] 45.7 9.3 285 84.5
Here, the identity permutation πi = i is adopted for 1 6 i 6 n,
VOLO-D1 [95] 27 6.8 481 84.2 which is also known as raster order. Chen et al. also considered the
VOLO-D2 [95] 59 14.1 244 85.2 BERT objective, which samples a sub-sequence M ⊂ [1, n] such
VOLO-D3 [95] 86 20.6 168 85.4
VOLO-D4 [95] 193 43.8 100 85.7 that each index i independently has probability 0.15 of appearing
VOLO-D5 [95] 296 69.0 64 86.1 in M . M is called the BERT mask, and the model is trained by
minimizing the negative log-likelihood of the “masked” elements
to the choice of optimizer, hyper-parameter, and the schedule of xM conditioned on the “unmasked” ones x[1,n]\M :
training. Visformer [101] revealed the gap between transformers X
and CNNs with two different training settings. The first one is the LBERT = E E [− log p(xi |x[1,n]\M )]. (9)
x∼X M
standard setting for CNNs, i.e., the training schedule is shorter i∈M

and the data augmentation only contains random cropping and During the pre-training stage, they pick either LAR or LBERT
horizental flipping. The other one is the training setting used and minimize the loss over the pre-training dataset.
in [59], i.e., the training schedule is longer and the data augmenta- GPT-2 [109] formulation of the transformer decoder block
tion is stronger. [102] changed the early visual processing of ViT is used. To ensure proper conditioning when training the AR
by replacing its embedding stem with a standard convolutional objective, Chen et al. apply the standard upper triangular mask
stem, and found that this change allows ViT to converge faster to the n × n matrix of attention logits. No attention logit masking
and enables the use of either AdamW or SGD without a significant is required when the BERT objective is used: Chen et al. zero
drop in accuracy. In addition to these two works, [99], [94] also out the positions after the content embeddings are applied to the
choose to add convolutional stem on the top of the transformer. input sequence. Following the final transformer layer, they apply a
layer norm and learn a projection from the output to logits param-
3.1.3 Self-supervised Representation Learning eterizing the conditional distributions at each sequence element.
Generative Based Approach. Generative pre-training methods When training BERT, they simply ignore the logits at unmasked
for images have existed for a long time [103], [104], [105], [106]. positions.
Chen et al. [14] re-examined this class of methods and combined During the fine-tuning stage, they average pool the output of
it with self-supervised methods. After that, several works [107], the final layer normalization layer across the sequence dimension
[108] were proposed to extend generative based self-supervised to extract a d-dimensional vector of features per example. They
learning for vision transformer. learn a projection from the pooled feature to class logits and
We briefly introduce iGPT [14] to demonstrate its mechanism. use this projection to minimize a cross entropy loss. Practical
This approach consists of a pre-training stage followed by a fine- applications offer empirical evidence that the joint objective of
tuning stage. During the pre-training stage, auto-regressive and cross entropy loss and pretraining loss (LAR or LBERT ) works
BERT objectives are explored. To implement pixel prediction, a even better.
A SUBMISSION TO IEEE TRANSACTION ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 7

iGPT and ViT are two pioneering works to apply transformer through local connection and global connection. Further investi-
for visual tasks. The difference of iGPT and ViT-like models gation on backbone networks can lead to the improvement for the
mainly lies on 3 aspects: 1) The input of iGPT is a sequence of whole vision community. As for the self-supervised representation
color palettes by clustering pixels, while ViT uniformly divided learning for vision transformer, we still need to make effort to
the image into a number of local patches; 2) The architec- pursue the success of large-scale pretraining in the filed of NLP.
ture of iGPT is an encoder-decoder framework, while ViT only
has transformer encoder; 3) iGPT utilizes auto-regressive self- 3.2 High/Mid-level Vision
supervised loss for training, while ViT is trained by supervised
Recently there has been growing interest in using transformer
image classification task.
for high/mid-level computer vision tasks, such as object detec-
Contrastive Learning Based Approach. Currently, contrastive
tion [16], [17], [113], [114], [115], lane detection [116], segmen-
learning is the most popular manner of self-supervised learning for
tation [33], [25], [18] and pose estimation [34], [35], [36], [117].
computer vision. Contrastive learning has been applied on vision
We review these methods in this section.
transformer for unsupervised pretraining [31], [110], [111].
Chen et al. [31] investigate the effects of several fundamental 3.2.1 Generic Object Detection
components for training self-supervised ViT. The authors observe
Traditional object detectors are mainly built upon CNNs, but
that instability is a major issue that degrades accuracy, and these
transformer-based object detection has gained significant interest
results are indeed partial failure and they can be improved when
recently due to its advantageous capability.
training is made more stable.
Some object detection methods have attempted to use trans-
They introduce a “MoCo v3” framework, which is an incre-
former’s self-attention mechanism and then enhance the spe-
mental improvement of MoCo [112]. Specifically, the authors take
cific modules for modern detectors, such as feature fusion
two crops for each image under random data augmentation. They
module [118] and prediction head [119]. We discuss this in
are encodes by two encoders, fq and fk , with output vectors
the supplemental material. Transformer-based object detection
q and k. Intuitively, q behaves like a “query” and the goal of methods are broadly categorized into two groups: transformer-
learning is to retrieve the corresponding “key”. This is formulated
based set prediction methods [16], [17], [120], [121], [122] and
as minimizing a contrastive loss function, which can be written as:
transformer-based backbone methods [113], [115], as shown in
exp(q · k+ /τ ) Fig. 7. Transformer-based methods have shown strong perfor-
Lq = −log . (10) mance compared with CNN-based detectors, in terms of both
+ k− exp(q · k− /τ )
k+ /τ )
P
exp(q ·
accuracy and running speed. Table 3 shows the detection results
Here k+ is fk ’s output on the same image as q, known as q’s for different transformer-based object detectors mentioned earlier
positive sample. The set k− consists of fk ’s outputs from other on the COCO 2012 val set.
images, known as q’s negative samples. τ is a temperature hyper-
class box
parameter for l2 -normalized q, k. MoCo v3 uses the keys that nat-
Transformer Transformer
urally co-exist in the same batch and abandon the memory queue, Image CNN
Encoders Decoders
FFN
Set prediction
which they find has diminishing gain if the batch is sufficiently
Positional Object
large (e.g., 4096). With this simplification, the contrastive loss can Encoding Queries
be implemented in a simple way. The encoder fq consists of a (a) Transformer-based set prediction for detection
backbone (e.g., ViT), a projection head and an extra prediction
head; while the encoder fk has the backbone and projection head, RPN
class
Transformer Predict
but not the prediction head. fk is updated by the moving-average Image
patches Encoders Head box
of fq , excluding the prediction head. Positional
MoCo v3 shows that the instability is a major issue of training Encoding

the self-supervised ViT, thus they describe a simple trick that can (b) Transformer-based backbone for detection
improve the stability in various cases of the experiments. They Fig. 7: General framework of transformer-based object detection.
observe that it is not necessary to train the patch projection layer.
Transformer-based Set Prediction for Detection. As a pioneer
For the standard ViT patch size, the patch projection matrix is
for transformer-based detection method, the detection transformer
complete or over-complete. And in this case, random projection
(DETR) proposed by Carion et al. [16] redesigns the framework
should be sufficient to preserve the information of the original
of object detection. DETR, a simple and fully end-to-end object
patches. However, the trick alleviates the issue, but does not solve
detector, treats the object detection task as an intuitive set pre-
it. The model can still be unstable if the learning rate is too big and
diction problem, eliminating traditional hand-crafted components
the first layer is unlikely the essential reason for the instability.
such as anchor generation and non-maximum suppression (NMS)
post-processing. As shown in Fig. 8, DETR starts with a CNN
3.1.4 Discussions backbone to extract features from the input image. To supplement
All of the components of vision transformer including multi- the image features with position information, fixed positional en-
head self-attention, multi-layer perceptron, shortcut connection, codings are added to the flattened features before the features are
layer normalization, positional encoding and network topology, fed into the encoder-decoder transformer. The decoder consumes
play key roles in visual recognition. As stated above, a number the embeddings from the encoder along with N learned positional
of works have been proposed to improve the effectiveness and encodings (object queries), and produces N output embeddings.
efficiency of vision transformer. From the results in Figure 6, Here N is a predefined parameter and typically larger than the
we can see that combining CNN and transformer achieve the number of objects in an image. Simple feed-forward networks
better performance, indicating their complementation to each other (FFNs) are used to compute the final predictions, which include
A SUBMISSION TO IEEE TRANSACTION ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 8

the bounding box coordinates and class labels to indicate the the proposed SMCA module into DETR, similar mAP could be
specific class of object (or to indicate that no object exists). Unlike obtained with about 10× less training epochs under comparable
the original transformer, which computes predictions sequentially, inference cost.
DETR decodes N objects in parallel. DETR employs a bipartite
Given the high computation complexity associated with
matching algorithm to assign the predicted and ground-truth
DETR, Zheng et al. [121] proposed an Adaptive Clustering
objects. As shown in Eq. 11, the Hungarian loss is exploited to
Transformer (ACT) to reduce the computation cost of pre-trained
compute the loss function for all matched pairs of objects.
DETR. ACT adaptively clusters the query features using a locality
N h
X i sensitivity hashing (LSH) method and broadcasts the attention
LHungarian (y, ŷ) = − log p̂σ̂(i) (ci ) + 1{ci 6=∅} Lbox (bi , b̂σ̂ (i)) , (11)
i=1
output to the queries represented by the selected prototypes. ACT
is used to replace the self-attention module of the pre-trained
where σ̂ is the optimal assignment, ci and p̂σ̂(i) (ci ) are the target DETR model without requiring any re-training. This approach
class label and predicted label, respectively, and bi and b̂σ̂ (i) are significantly reduces the computational cost while the accuracy
the ground truth and predicted bounding box, y = {(ci , bi )} slides slightly. The performance drop can be further reduced by
and ŷ are the ground truth and prediction of objects, respectively. utilizing a multi-task knowledge distillation (MTKD) method,
DETR shows impressive performance on object detection, deliv- which exploits the original transformer to distill the ACT module
ering comparable accuracy and speed with the popular and well- with a few epochs of fine-tuning. Yao et al. [124] pointed out
established Faster R-CNN [13] baseline on COCO benchmark. that the random initialization in DETR is the main reason for
the requirement of multiple decoder layers and slow convergence.
backbone encoder decoder prediction heads To this end, they proposed the Efficient DETR to incorporate
set of image features
… class,
FFN
box the dense prior into the detection pipeline via an additional
CNN
no

+
transformer transformer
FFN
object region proposal network. The better initialization enables them
encoder decoder class,
FFN
box to use only one decoder layers instead of six layers to achieve
no
… FFN
positional encoding object queries
object competitive performance with a more compact network.
Transformer-based Backbone for Detection. Unlike DETR
Fig. 8: The overall architecture of DETR (image from [16]). which redesigns object detection as a set prediction tasks via
DETR is a new design for the object detection framework transformer, Beal et al. [113] proposed to utilize transformer as
based on transformer and empowers the community to develop a backbone for common detection frameworks such as Faster R-
fully end-to-end detectors. However, the vanilla DETR poses CNN [13]. The input image is divided into several patches and
several challenges, specifically, longer training schedule and poor fed into a vision transformer, whose output embedding features
performance for small objects. To address these challenges, Zhu et are reorganized according to spatial information before passing
al. [17] proposed Deformable DETR, which has become a popular through a detection head for the final results. A massive pre-
method that significantly improves the detection performance. The training transformer backbone could bring benefits to the proposed
deformable attention module attends to a small set of key positions ViT-FRCNN. There are also quite a few methods to explore versa-
around a reference point rather than looking at all spatial locations tile vision transformer backbone design [29], [72], [60], [62] and
on image feature maps as performed by the original multi-head transfer these backbones to traditional detection frameworks like
attention mechanism in transformer. This approach significantly RetinaNet [127] and Cascade R-CNN [128]. For example, Swin
reduces the computational complexity and brings benefits in terms Transformer [60] obtains about 4 box AP gains over ResNet-50
of fast convergence. More importantly, the deformable attention backbone with similar FLOPs for various detection frameworks.
module can be easily applied for fusing multi-scale features.
Pre-training for Transformer-based Object Detection. Inspired
Deformable DETR achieves better performance than DETR with
by the pre-training transformer scheme in NLP, several methods
10× less training cost and 1.6× faster inference speed. And by
have been proposed to explore different pre-training scheme
using an iterative bounding box refinement method and two-stage
for transformer-based object detection [32], [126], [129]. Dai et
scheme, Deformable DETR can further improve the detection
al. [32] proposed unsupervised pre-training for object detection
performance.
(UP-DETR). Specifically, a novel unsupervised pretext task named
There are also several methods to deal with the slow conver-
random query patch detection is proposed to pre-train the DETR
gence problem of the original DETR. For example, Sun et al. [120]
model. With this unsupervised pre-training scheme, UP-DETR
investigated why the DETR model has slow convergence and
significantly improves the detection accuracy on a relatively small
discovered that this is mainly due to the cross-attention module
dataset (PASCAL VOC). On the COCO benchmark with sufficient
in the transformer decoder. To address this issue, an encoder-only
training data, UP-DETR still outperforms DETR, demonstrating
version of DETR is proposed, achieving considerable improve-
the effectiveness of the unsupervised pre-training scheme.
ment in terms of detection accuracy and training convergence. In
addition, a new bipartite matching scheme is designed for greater Fang et al. [126] explored how to transfer the pure ViT
training stability and faster convergence and two transformer- structure that is pre-trained on ImageNet to the more challenging
based set prediction models, i.e. TSP-FCOS and TSP-RCNN, are object detection task and proposed the YOLOS detector. To cope
proposed to improve encoder-only DETR with feature pyramids. with the object detection task, the proposed YOLOS first drops
These new models achieve better performance compared with the the classification tokens in ViT and appends learnable detection
original DETR model. Gao et al. [123] proposed the Spatially tokens. Besides, the bipartite matching loss is utilized to perform
Modulated Co-Attention (SMCA) mechanism to accelerate the set prediction for objects. With this simple pre-training scheme
convergence by constraining co-attention responses to be high on ImageNet dataset, the proposed YOLOS shows competitive
near initially estimated bounding box locations. By integrating performance for object detection on COCO benchmark.
A SUBMISSION TO IEEE TRANSACTION ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 9

TABLE 3: Comparison of different transformer-based object detectors on COCO 2017 val set. Running speed (FPS) is evaluated on an
NVIDIA Tesla V100 GPU as reported in [17]. † Estimated speed according to the reported number in the paper. ‡ ViT backbone is pre-trained
on ImageNet-21k. ∗ ViT backbone is pre-trained on an private dataset with 1.3 billion images.
Method Epochs AP AP50 AP75 APS APM APL #Params (M) GFLOPs FPS
CNN based
FCOS [125] 36 41.0 59.8 44.1 26.2 44.6 52.2 - 177 23†
Faster R-CNN + FPN [13] 109 42.0 62.1 45.5 26.6 45.4 53.4 42 180 26
CNN Backbone + Transformer Head
DETR [16] 500 42.0 62.4 44.2 20.5 45.8 61.1 41 86 28
DETR-DC5 [16] 500 43.3 63.1 45.9 22.5 47.3 61.1 41 187 12
Deformable DETR [17] 50 46.2 65.2 50.0 28.8 49.2 61.7 40 173 19
TSP-FCOS [120] 36 43.1 62.3 47.0 26.6 46.8 55.9 - 189 20†
TSP-RCNN [120] 96 45.0 64.5 49.6 29.7 47.7 58.0 - 188 15†
ACT+MKKD (L=32) [121] - 43.1 - - 61.4 47.1 22.2 - 169 14†
SMCA [123] 108 45.6 65.5 49.1 25.9 49.3 62.6 - - -
Efficient DETR [124] 36 45.1 63.1 49.1 28.3 48.4 59.0 35 210 -
UP-DETR [32] 150 40.5 60.8 42.6 19.0 44.4 60.0 41 - -
UP-DETR [32] 300 42.8 63.0 45.3 20.8 47.1 61.7 41 - -
Transformer Backbone + CNN Head
ViT-B/16-FRCNN‡ [113] 21 36.6 56.3 39.3 17.4 40.0 55.5 - - -
ViT-B/16-FRCNN∗ [113] 21 37.8 57.4 40.1 17.8 41.4 57.3 - - -
PVT-Small+RetinaNet [72] 12 40.4 61.3 43.0 25.0 42.9 55.7 34.2 118 -
Twins-SVT-S+RetinaNet [62] 12 43.0 64.2 46.3 28.0 46.4 57.5 34.3 104 -
Swin-T+RetinaNet [60] 12 41.5 62.1 44.2 25.1 44.9 55.5 38.5 118 -
Swin-T+ATSS [60] 36 47.2 66.5 51.3 - - - 36 215 -
Pure Transformer based
PVT-Small+DETR [72] 50 34.7 55.7 35.4 12.0 36.4 56.7 40 - -
TNT-S+DETR [29] 50 38.2 58.9 39.4 15.5 41.1 58.8 39 - -
YOLOS-Ti [126] 300 30.0 - - - - - 6.5 21 -
YOLOS-S [126] 150 37.6 57.6 39.2 15.9 40.2 57.3 28 179 -
YOLOS-B [126] 150 42.0 62.2 44.5 19.5 45.3 62.1 127 537 -

3.2.2 Segmentation discussed using Transformer to deal with segmentation task.


Transformer for Semantic Segmentation. Zheng et al. [18]
Segmentation is an important topic in computer vision community,
proposed a transformer-based semantic segmentation network
which broadly includes panoptic segmentation, instance segmen-
(SETR). SETR utilizes an encoder similar to ViT [15] as the
tation and semantic segmentation etc. Vision transformer has also
encoder to extract features from an input image. A multi-level
shown impressive potential on the field of segmentation.
feature aggregation module is adopted for performing pixel-wise
Transformer for Panoptic Segmentation. DETR [16] can be segmentation. Strudel et al. [134] introduced Segmenter which
naturally extended for panoptic segmentation tasks and achieve relies on the output embedding corresponding to image patches
competitive results by appending a mask head on the decoder. and obtains class labels with a point-wise linear decoder or a mask
Wang et al. [25] proposed Max-DeepLab to directly predict transformer decoder. Xie et al. [135] proposed a simple, efficient
panoptic segmentation results with a mask transformer, without yet powerful semantic segmentation framework which unifies
involving surrogate sub-tasks such as box detection. Similar to Transformers with lightweight multilayer perception (MLP) de-
DETR, Max-DeepLab streamlines the panoptic segmentation tasks coders, which outputs multiscale features and avoids complex
in an end-to-end fashion and directly predicts a set of non- decoders.
overlapping masks and corresponding labels. Model training is Transformer for Medical Image Segmentation. Cao et al. [30]
performed using a panoptic quality (PQ) style loss, but unlike prior proposed an Unet-like pure Transformer for medical image seg-
methods that stack a transformer on top of a CNN backbone, Max- mentation, by feeding the tokenized image patches into the
DeepLab adopts a dual-path framework that facilitates combining Transformer-based U-shaped Encoder-Decoder architecture with
the CNN and transformer. skip-connections for local-global semantic feature learning. Vala-
Transformer for Instance Segmentation. VisTR, a transformer- narasu et al. [136] explored transformer-based solutions and study
based video instance segmentation model, was proposed by the feasibility of using transformer-based network architectures
Wang et al. [33] to produce instance prediction results from for medical image segmentation tasks and proposed a Gated
a sequence of input images. A strategy for matching instance Axial-Attention model which extends the existing architectures by
sequence is proposed to assign the predictions with ground truths. introducing an additional control mechanism in the self-attention
In order to obtain the mask sequence for each instance, VisTR module. Cell-DETR [137], based on the DETR panoptic segmen-
utilizes the instance sequence segmentation module to accumulate tation model, is an attempt to use transformer for cell instance seg-
the mask features from multiple frames and segment the mask mentation. It adds skip connections that bridge features between
sequence with a 3D CNN. Hu et al. [130] proposed an instance the backbone CNN and the CNN decoder in the segmentation
segmentation Transformer (ISTR) to predict low-dimensional head in order to enhance feature fusion. Cell-DETR achieves
mask embeddings, and match them with ground truth for the set state-of-the-art performance for cell instance segmentation from
loss. ISTR conducted detection and segmentation with a recurrent microscopy imagery.
refinement strategy which is different from the existing top-down
and bottom-up frameworks. Yang et al. [131] investigated how 3.2.3 Pose Estimation
to realize better and more efficient embedding learning to tackle Human pose and hand pose estimation are foundational topics that
the semi-supervised video object segmentation under challenging have attracted significant interest from the research community.
multi-object scenarios. Some papers such as [132], [133] also Articulated pose estimation is akin to a structured prediction task,
A SUBMISSION TO IEEE TRANSACTION ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 10

aiming to predict the joint coordinates or mesh vertices from input 3.2.4 Other Tasks
RGB/D images. Here we discuss some methods [34], [35], [36], There are also quite a lot different high/mid-level vision tasks
[117] that explore how to utilize transformer for modeling the that have explored the usage of vision transformer for better
global structure information of human poses and hand poses. performance. We briefly review several tasks below.
Transformer for Hand Pose Estimation. Huang et al. [34] pro- Pedestrian Detection. Because the distribution of objects is very
posed a transformer based network for 3D hand pose estimation dense in occlusion and crowd scenes, additional analysis and
from point sets. The encoder first utilizes a PointNet [138] to adaptation are often required when common detection networks
extract point-wise features from input point clouds and then adopts are applied to pedestrian detection tasks. Lin et al. [145] revealed
standard multi-head self-attention module to produce embeddings. that sparse uniform queries and a weak attention field in the
In order to expose more global pose-related information to the decoder result in performance degradation when directly applying
decoder, a feature extractor such as PointNet++ [139] is used DETR or Deformable DETR to pedestrian detection tasks. To
to extract hand joint-wise features, which are then fed into the alleviate these drawbacks, the authors proposes Pedestrian End-
decoder as positional encodings. Similarly, Huang et al. [35] to-end Detector (PED), which employs a new decoder called
proposed HOT-Net (short for hand-object transformer network) Dense Queries and Rectified Attention field (DQRF) to support
for 3D hand-object pose estimation. Unlike the preceding method dense queries and alleviate the noisy or narrow attention field
which employs transformer to directly predict 3D hand pose from of the queries. They also proposed V-Match, which achieves
input point clouds, HOT-Net uses a ResNet to generate initial 2D additional performance improvements by fully leveraging visible
hand-object pose and then feeds it into a transformer to predict annotations.
Lane Detection. Based on PolyLaneNet [146], Liu et al. [116]
the 3D hand-object pose. A spectral graph convolution network
proposed a method called LSTR, which improves performance
is therefore used to extract input embeddings for the encoder.
of curve lane detection by learning the global context with
Hampali et al. [140] proposed to estimate the 3D poses of two
a transformer network. Similar to PolyLaneNet, LSTR regards
hands given a single color image. Specifically, appearance and
lane detection as a task of fitting lanes with polynomials and
spatial encodings of a set of potential 2D locations for the joints
uses neural networks to predict the parameters of polynomials.
of both hands were inputted to a transformer, and the attention
To capture slender structures for lanes and the global context,
mechanisms were used to sort out the correct configuration of the
LSTR introduces a transformer network into the architecture. This
joints and outputted the 3D poses of both hands.
enables processing of low-level features extracted by CNNs. In ad-
Transformer for Human Pose Estimation. Lin et al. [36] dition, LSTR uses Hungarian loss to optimize network parameters.
proposed a mesh transformer (METRO) for predicting 3D human As demonstrated in [116], LSTR outperforms PolyLaneNet, with
pose and mesh from a single RGB image. METRO extracts 2.82% higher accuracy and 3.65× higher FPS using 5-times fewer
image features via a CNN and then perform position encoding parameters. The combination of a transformer network, CNN and
by concatenating a template human mesh to the image feature. A Hungarian Loss culminates in a lane detection framework that
multi-layer transformer encoder with progressive dimensionality is precise, fast, and tiny. Considering that the entire lane line
reduction is proposed to gradually reduce the embedding dimen- generally has an elongated shape and long-range, Liu et al. [147]
sions and finally produce 3D coordinates of human joint and mesh utilized a transformer encoder structure for more efficient context
vertices. To encourage the learning of non-local relationships be- feature extraction. This transformer encoder structure improves
tween human joints, METRO randomly mask some input queries the detection of the proposal points a lot, which rely on contextual
during training. Yang et al. [117] constructed an explainable model features and global information, especially in the case where the
named TransPose based on Transformer architecture and low-level backbone network is a small model.
convolutional blocks. The attention layers built in Transformer can Scene Graph. Scene graph is a structured representation of a
capture long-range spatial relationships between keypoints and ex- scene that can clearly express the objects, attributes, and rela-
plain what dependencies the predicted keypoints locations highly tionships between objects in the scene [148]. To generate scene
rely on. Li et al. [141] proposed a novel approach based on Token graph, most of existing methods first extract image-based object
representation for human Pose estimation (TokenPose). Each key- representations and then do message propagation between them.
point was explicitly embedded as a token to simultaneously learn Graph R-CNN [149] utilizes self-attention to integrate contextual
constraint relationships and appearance cues from images. Mao et information from neighboring nodes in the graph. Recently, Shar-
al. [142] proposed a human pose estimation framework that solved ifzadeh et al. [150] employed transformers over the extracted
the task in the regression-based fashion. They formulated the pose object embedding. Sharifzadeh et al. [151] proposed a new
estimation task into a sequence prediction problem and solve it by pipeline called Texema and employed a pre-trained Text-to-Text
transformers, which bypass the drawbacks of the heatmap-based Transfer Transformer (T5) [152] to create structured graphs from
pose estimator. Jiang et al. [143] proposed a novel transformer textual input and utilized them to improve the relational reasoning
based network that can learn a distribution over both pose and module. The T5 model enables Texema to utilize the knowledge
motion in an unsupervised fashion rather than tracking body parts in texts.
and trying to temporally smooth them. The method overcame Tracking. Some researchers also explored to use transformer
inaccuracies in detection and corrected partial or entire skeleton encoder-decoder architecture in template-based discriminative
corruption. Hao et al. [144] proposed to personalize a human pose trackers, such as TMT [153], TrTr [154] and TransT [155]. All
estimator given a set of test images of a person without using these work use a Siamese-like tracking pipeline to do video
any manual annotations. The method adapted the pose estimator object tracking and utilize the encoder-decoder network to re-
during test time to exploit person-specific information, and used place explicit cross-correlation operation for global and rich
a Transformer model to build a transformation between the self- contextual inter-dependencies. Specifically, the transformer en-
supervised keypoints and the supervised keypoints. coder and decoder are assigned to the template branch and the
A SUBMISSION TO IEEE TRANSACTION ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 11

searching branch, respectively. In addition, Sun et al. proposed further research effort is conducted into exploring more powerful
TransTrack [156], which is an online joint-detection-and-tracking transformers for high-level vision.
pipeline. It utilizes the query-key mechanism to track pre-existing
objects and introduces a set of learned object queries into the 3.3 Low-level Vision
pipeline to detect new-coming objects. The proposed TransTrack
Few works apply transformers on low-level vision fields, such
achieves 74.5% and 64.5% MOTA on MOT17 and MOT20 bench-
as image super-resolution and generation. These tasks often take
mark.
images as outputs (e.g., high-resolution or denoised images),
Re-Identification. He et al. [157] proposed TransReID to inves-
which is more challenging than high-level vision tasks such as
tigate the application of pure transformers in the field of object
classification, segmentation, and detection, whose outputs are
re-identification (ReID). While introducing transformer network
labels or boxes.
into object ReID, TransReID slices with overlap to reserve local
neighboring structures around the patches and introduces 2D bilin- Training Only
ear interpolation to help handle any given input resolution. With CNN
Images
the transformer module and the loss function, a strong baseline Images Encoder
was proposed to achieve comparable performance with CNN-
based frameworks. Moreover, The jigsaw patch module (JPM)
patches
was designed to facilitate perturbation-invariant and robust feature Transformer Tokens
representation of objects and the side information embeddings Decoders
Transformer
(SIE) was introduced to encode side information. The final frame- Decoders
work TransReID achieves state-of-the-art performance on both
person and vehicle ReID benchmarks. Both Liu et al. [158] and Images CNN
Decoder
Zhang et al. [159] provided solutions for introducing transformer noise
network into video-based person Re-ID. And similarly, both of the (a) Image Generation (b) Image Generation
(GAN-based) (Transformer-based)
them utilized separated transformer networks to refine spatial and
temporal features, and then utilized a cross view transformer to Fig. 9: A generic framework for transformer in image generation.
aggregate multi-view features.
Point Cloud Learning. A number of other works exploring 3.3.1 Image Generation
transformer architecture for point cloud learning [160], [161], An simple yet effective to apply transformer model to the image
[162] have also emerged recently. For example, Guo et al. [161] generation task is to directly change the architectures from CNNs
proposed a novel framework that replaces the original self- to transformers, as shown in Figure 9 (a). Jiang et al. [38] proposed
attention module with a more suitable offset-attention module, TransGAN, which build GAN using the transformer architec-
which includes implicit Laplace operator and normalization refine- ture. Since the it is difficult to generate high-resolution images
ment. In addition, Zhao et al. [162] designed a novel transformer pixel-wise, a memory-friendly generator is utilized by gradually
architecture called Point Transformer. The proposed self-attention increasing the feature map resolution at different stages. Corre-
layer is invariant to the permutation of the point set, making it spondingly, a multi-scale discriminator is designed to handle the
suitable for point set processing tasks. Point Transformer shows varying size of inputs in different stages. Various training recipes
strong performance for semantic segmentation task from 3D point are introduced including grid self-attention, data augmentation,
clouds. relative position encoding and modified normalization to stabilize
the training and improve its performance. Experiments on various
benchmark datasets demonstrate the effectiveness and potential
3.2.5 Discussions
of the transformer-based GAN model in image generation tasks.
As discussed in the preceding sections, transformers have shown Kwonjoon Lee et al. [163] proposed ViTGAN, which introduce
strong performance on several high-level tasks, including detec- several technique to both generator and discriminator to stabilize
tion, segmentation and pose estimation. The key issues that need to the training procedure and convergence. Euclidean distance is
be resolved before transformer can be adopted for high-level tasks introduced for the self-attention module to enforce the Lips-
relate to input embedding, position encoding, and prediction loss. chitzness of transformer discriminator. Self-modulated layernorm
Some methods propose improving the self-attention module from and implicit neural representation are proposed to enhance the
different perspectives, for example, deformable attention [17], training for the generator. As a result, ViTGAN is the first work
adaptive clustering [121] and point transformer [162]. Neverthe- to demonstrate transformer-based GANs can achieve comparable
less, exploration into the use of transformers for high-level vision performance to state-of-the-art CNN-based GANs.
tasks is still in the preliminary stages and so further research Parmar et al. [27] proposed Image Transformer, taking the first
may prove beneficial. For example, is it necessary to use feature step toward generalizing the transformer model to formulate image
extraction modules such as CNN and PointNet before transformer translation and generation tasks in an auto-regressive manner.
for potential better performance? How can vision transformer be Image Transformer consists of two parts: an encoder for extracting
fully leveraged using large scale pre-training datasets as BERT image representation and a decoder to generate pixels. For each
and GPT-3 do in the NLP field? And is it possible to pre-train a pixel with value 0 − 255, a 256 × d dimensional embedding
single transformer model and fine-tune it for different downstream is learned for encoding each value into a d dimensional vector,
tasks with only a few epochs of fine-tuning? How to design more which is fed into the encoder as input. The encoder and decoder
powerful architecture by incorporating prior knowledge of the adopt the same architecture as that in [9]. Each output pixel
specific tasks? Several prior works have performed preliminary q 0 is generated by calculating self-attention between the input
discussions for the aforementioned topics and We hope more pixel q and previously generated pixels m1 , m2 , ... with position
A SUBMISSION TO IEEE TRANSACTION ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 12

embedding p1 , p2 , .... For image-conditioned generation, such as The most relevant reference patch is ti = vhi , where ti in
super-resolution and inpainting, an encoder-decoder architecture is T is the transferred features. A soft-attention module is then
used, where the encoder’s input is the low-resolution or corrupted used to transfer V to the low-resolution feature. The transferred
images. For unconditional and class-conditional generation (i.e., features from the high-resolution texture image and the low-
noise to image), only the decoder is used for inputting noise vec- resolution feature are used to generate the output features of
tors. Because the decoder’s input is the previously generated pixels the low-resolution image. By leveraging the transformer-based
(involving high computation cost when producing high-resolution architecture, TTSR can successfully transfer texture information
images), a local self-attention scheme is proposed. This scheme from high-resolution reference images to low-resolution images in
uses only the closest generated pixels as input for the decoder, super-resolution tasks.
enabling Image Transformer to achieve performance on par with
CNN-based models for image generation and translation tasks, Multi-head Flatten features Multi-tail

demonstrating the effectiveness of transformer-based models on Denoising Denoising


Tail
Head
low-level vision tasks. Transformer Encoder
Since it is difficult to directly generate high-resolution images Deraining Deraining
Head Tail
by transformer models, Esser et al. [37] proposed Taming Trans- Features

former. Taming Transformer consists of two parts: a VQGAN x2 Up


Task embedding
Features x2 Up
and a transformer. VQGAN is a variant of VQVAE [164], which Head Tail

uses a discriminator and perceptual loss to improve the visual



Transformer Decoder
quality. Through VQGAN, the image can be represented by a x4 Up x4 Up
Head Reshape Tail
series of context-rich discrete vectors and therefore these vectors
can be easily predicted by a transformer model through an auto-
regression way. The transformer model can learn the long-range Fig. 10: Diagram of IPT architecture (image from [19]).
interactions for generating high-resolution images. As a result, the Different from the preceding methods that use transformer
proposed Taming Transformer achieves state-of-the-art results on models on single tasks, Chen et al. [19] proposed Image Pro-
a wide variety of image synthesis tasks. cessing Transformer (IPT), which fully utilizes the advantages
Besides image generation, DALL·E [41] proposed the trans- of transformers by using large pre-training datasets. It achieves
former model for text-to-image generation, which synthesizes state-of-the-art performance in several image processing tasks,
images according to the given captions. The whole framework including super-resolution, denoising, and deraining. As shown
consists of two stages. In the first stage, a discrete VAE is in Figure 10, IPT consists of multiple heads, an encoder, a
utilized to learn the visual codebook. In the second stage, the decoder, and multiple tails. The multi-head, multi-tail structure
text is decoded by BPE-encode and the corresponding image and task embeddings are introduced for different image processing
is decoded by dVAE learned in the first stage. Then an auto- tasks. The features are divided into patches, which are fed into
regression transformer is used to learn the prior between the the encoder-decoder architecture. Following this, the outputs are
encoded text and image. During the inference procedure, tokens reshaped to features with the same size. Given the advantages
of images are predicted by the transformer and decoded by the of pre-training transformer models on large datasets, IPT uses
learned decoder. The CLIP model [40] is introduced to rank the ImageNet dataset for pre-training. Specifically, images from
generated samples. Experiments on text-to-image generation task this dataset are degraded by manually adding noise, rain streaks,
demonstrate the powerful ability of the proposed model. Note that or downsampling in order to generate corrupted images. The
our survey mainly focus on pure vision tasks, we do not include degraded images are used as inputs for IPT, while the original
the framework of DALL·E in Figure 9. images are used as the optimization goal of the outputs. A self-
supervised method is also introduced to enhance the generalization
3.3.2 Image Processing ability of the IPT model. Once the model is trained, it is fine-
tuned on each task by using the corresponding head, tail, and
A number of recent works eschew using each pixel as the input
task embedding. IPT largely achieves performance improvements
for transformer models and instead use patches (set of pixels) as
on image processing tasks (e.g., 2 dB in image denoising tasks),
input. For example, Yang et al. [39] proposed Texture Transformer
demonstrating the huge potential of applying transformer-based
Network for Image Super-Resolution (TTSR), using the trans-
models to the field of low-level vision.
former architecture in the reference-based image super-resolution
Besides single image generation, Wang et al. [165] proposed
problem. It aims to transfer relevant textures from reference
SceneFormer to utilize transformer in 3D indoor scene generation.
images to low-resolution images. Taking a low-resolution image
By treating a scene as a sequence of objects, the transformer
and reference image as the query Q and key K, respectively, the
decoder can be used to predict series of objects and their location,
relevance ri,j is calculated between each patch qi in Q and ki in
category, and size. This has enabled SceneFormer to outperform
K as: 
qi ki
 conventional CNN-based methods in user studies.
ri,j = , . (12) It should be noted that iGPT [14] is pre-trained on an
kqi k kki k
inpainting-like task. Since iGPT mainly focus on the fine-tuning
A hard-attention module is proposed to select high-resolution performance on image classification tasks, we treat this work more
features V according to the reference image, so that the low- like an attempt on image classification task using transformer than
resolution image can be matched by using the relevance. The low-level vision tasks.
hard-attention map is calculated as: In conclusion, different to classification and detection tasks,
the outputs of image generation and processing are images. Fig-
hi = arg max ri,j (13) ure 11 illustrates using transformers in low-level vision. In image
j
A SUBMISSION TO IEEE TRANSACTION ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 13

mining. The results of using this approach on benchmark datasets


Images
patches demonstrate its performance and speed advantages. In addition,
Gabeur et al. [175] presented a multi-modal transformer to learn
different cross-modal cues in order to represent videos.
Transformer Transformer Video Object Detection. To detect objects in a video, both
Encoders Decoders
global and local information is required. Chen et al. introduced
the memory enhanced global-local aggregation (MEGA) [176]
to capture more content. The representative features enhance the
Images
patches overall performance and address the ineffective and insufficient
problems. Furthermore, Yin et al. [177] proposed a spatiotem-
Fig. 11: A generic framework for transformer in image processing. poral transformer to aggregate spatial and temporal information.
processing tasks, the images are first encoded into a sequence of Together with another spatial feature encoding component, these
tokens or patches and the transformer encoder uses the sequence two components perform well on 3D video object detection tasks.
as input, allowing the transformer decoder to successfully produce Multi-task Learning. Untrimmed video usually contains many
desired images. In image generation tasks, the GAN-based models frames that are irrelevant to the target tasks. It is therefore
directly learn a decoder to generated patches to outputting images crucial to mine the relevant information and discard the redundant
through linear projection, while the transformer-based models information. To extract such information, Seong et al. proposed
train a auto-encoder to learn a codebook for images and use an the video multi-task transformer network [178], which handles
auto-regression transformer model to predict the encoded tokens. multi-task learning on untrimmed videos. For the CoVieW dataset,
A meaningful direction for future research would be designing a the tasks are scene recognition, action recognition and importance
suitable architecture for different image processing tasks. score prediction. Two pre-trained networks on ImageNet and
Places365 extract the scene features and object features. The
multi-task transformers are stacked to implement feature fusion,
3.4 Video Processing
leveraging the class conversion matrix (CCM).
Transformer performs surprisingly well on sequence-based tasks
and especially on NLP tasks. In computer vision (specifically, 3.4.2 Low-level Video Processing
video tasks), spatial and temporal dimension information is fa- Frame/Video Synthesis. Frame synthesis tasks involve synthe-
vored, giving rise to the application of transformer in a number sizing the frames between two consecutive frames or after a
of video tasks, such as frame synthesis [166], action recogni- frame sequence while video synthesis tasks involve synthesizing
tion [167], and video retrieval [168]. a video. Liu et al. proposed the ConvTransformer [166], which is
comprised of five components: feature embedding, position encod-
3.4.1 High-level Video Processing ing, encoder, query decoder, and the synthesis feed-forward net-
Video Action Recognition. Video human action tasks, as the work. Compared with LSTM based works, the ConvTransformer
name suggests, involves identifying and localizing human actions achieves superior results with a more parallelizable architecture.
in videos. Context (such as other people and objects) plays a Another transformer-based approach was proposed by Schatz et
critical role in recognizing human actions. Rohit et al. proposed al. [179], which uses a recurrent transformer network to synthetize
the action transformer [167] to model the underlying relationship human actions from novel views.
between the human of interest and the surrounding context. Video Inpainting. Video inpainting tasks involve completing
Specifically, the I3D [169] is used as the backbone to extract high- any missing regions within a frame. This is challenging, as it
level feature maps. The features extracted (using RoI pooling) requires information along the spatial and temporal dimensions
from intermediate feature maps are viewed as the query (Q), while to be merged. Zeng et al. proposed a spatial-temporal transformer
the key (K) and values (V) are calculated from the intermediate network [28], which uses all the input frames as input and fills
features. A self-attention mechanism is applied to the three compo- them in parallel. The spatial-temporal adversarial loss is used to
nents, and it outputs the classification and regressions predictions. optimize the transformer network.
Lohit et al. [170] proposed an interpretable differentiable module,
named temporal transformer network, to reduce the intra-class 3.4.3 Discussions
variance and increase the inter-class variance. In addition, Fayyaz Compared to image, video has an extra dimension to encode
and Gall proposed a temporal transformer [171] to perform action the temporal information. Exploiting both spatial and temporal
recognition tasks under weakly supervised settings. In addition information helps to have a better understanding of a video.
to human action recognition, transformer has been utilized for Thanks to the relationship modeling capability of transformer,
group activity recognition [172]. Gavrilyuk et al. proposed an video processing tasks have been improved by mining spatial
actor-transformer [173] architecture to learn the representation, and temporal information simultaneously. Nevertheless, due to
using the static and dynamic representations generated by the 2D the high complexity and much redundancy of video data, how
and 3D networks as input. The output of the transformer is the to efficiently and accurately modeling both spatial and temporal
predicted activity. relationships is still an open problem.
Video Retrieval. The key to content-based video retrieval is to
find the similarity between videos. Leveraging only the image-
level of video-level features to overcome the associated chal- 3.5 Multi-Modal Tasks
lenges, Shao et al. [174] suggested using the transformer to model Owing to the success of transformer across text-based NLP tasks,
the long-range semantic dependency. They also introduced the many researches are keen to exploit its potential for processing
supervised contrastive learning strategy to perform hard negative multi-modal tasks (e.g., video-text, image-text and audio-text).
A SUBMISSION TO IEEE TRANSACTION ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 14

One example of this is VideoBERT [180], which uses a CNN- or extend an existing image while still matching the description in
based module to pre-process videos in order to obtain representa- the text. Subsequently, Ding et al. proposes CogView [42], which
tion tokens. A transformer encoder is then trained on these tokens is a transformer with VQ-VAE tokenizer similar to DALL-E, but
to learn the video-text representations for downstream tasks, such supports Chinese text input. They claim CogView outperforms
as video caption. Some other examples include VisualBERT [181] DALL-E and previous GAN-bsed methods and also unlike DALL-
and VL-BERT [182], which adopt a single-stream unified trans- E, CogView does not need an additional CLIP model to rerank the
former to capture visual elements and image-text relationship for samples drawn from transformer, i.e. DALL-E.
downstream tasks such as visual question answering (VQA) and Recently, a Unified Transformer (UniT) [43] model is pro-
visual commonsense reasoning (VCR). In addition, several studies posed to cope with multi-modal multi-task learning, which can
such as SpeechBERT [183] explore the possibility of encoding simultaneously handle multiple tasks across different domains,
audio and text pairs with a transformer encoder to process auto- including object detection, natural language understanding and
text tasks such as speech question answering (SQA). vision-language reasoning. Specifically, UniT has two transformer
encoders to handle image and text inputs, respectively, and then
the transformer decoder takes the single or concatenated encoder
outputs according to the task modality. Finally, a task-specific
prediction head is applied to the decoder outputs for different
tasks. In the training stage, all tasks are jointly trained by randomly
selecting a specific task within an iteration. The experiments show
UniT achieves satisfactory performance on every task with a
compact set of model parameters.
In conclusion, current transformer-based mutil-modal models
demonstrates its architectural superiority for unifying data and
Fig. 12: The framework of the CLIP (image from [40]). tasks of various modalities, which demonstrates the potential of
transformer to build a general-purpose intelligence agents able to
Apart from the aforementioned pioneering multi-modal trans- cope with vast amount of applications. Future researches can be
formers, Contrastive Language-Image Pre-training (CLIP) [40] conducted in exploring the effective training or the extendability
takes natural language as supervision to learn more efficient image of multi-modal transformers.
representation. CLIP jointly trains a text encoder and an image
encoder to predict the corresponding training text-image pairs.
3.6 Efficient Transformer
The text encoder of CLIP is a standard transformer with masked
self-attention used to preserve the initialization ability of the pre- Although transformer models have achieved success in various
trained language models. For the image encoder, CLIP considers tasks, their high requirements for memory and computing re-
two types of architecture, ResNet and Vision Transformer. CLIP is sources block their implementation on resource-limited devices
trained on a new dataset containing 400 million (image, text) pairs such as mobile phones. In this section, we review the researches
collected from the Internet. More specifically, given a batch of N carried out into compressing and accelerating transformer models
(image, text) pairs, CLIP learns both text and image embeddings for efficient implementation. This includes including network
jointly to maximize the cosine similarity of those N matched pruning, low-rank decomposition, knowledge distillation, network
embeddings while minimize N 2 − N incorrectly matched embed- quantization, and compact architecture design. Table 4 lists some
dings. On Zero-Shot transfer, CLIP demonstrates astonishing zero- representative works for compressing transformer-based models.
shot classification performances, achieving 76.2% top-1 accuracy
on ImageNet-1K dataset without using any ImageNet training 3.6.1 Pruning and Decomposition
labels. Concretely, at inference, the text encoder of CLIP first In transformer based pre-trained models (e.g., BERT), multiple
computes the feature embeddings of all ImageNet Labels and the attention operations are performed in parallel to independently
image encoder then computes the embeddings of all images. By model the relationship between different tokens [9], [10]. How-
calculating the cosine similarity of text and image embeddings, ever, specific tasks do not require all heads to be used. For
the text-image pair with the highest score should be the image and example, Michel et al. [44] presented empirical evidence that a
its corresponding label. Further experiments on 30 various CV large percentage of attention heads can be removed at test time
benchmarks show the zero-shot transfer ability of CLIP and the without impacting performance significantly. The number of heads
feature diversity learned by CLIP. required varies across different layers — some layers may even
While CLIP maps images according to the description in text, require only one head. Considering the redundancy on attention
another work DALL-E [41] synthesizes new images of categories heads, importance scores are defined to estimate the influence of
described in an input text. Similar to GPT-3, DALL-E is a multi- each head on the final output in [44], and unimportant heads can be
modal transformer with 12 billion model parameters autoregres- removed for efficient deployment. Dalvi et al. [184] analyzed the
sively trained on a dataset of 3.3 million text-image pairs. More redundancy in pre-trained transformer models from two perspec-
specifically, to train DALL-E, a two-stage training procedure is tives: general redundancy and task-specific redundancy. Following
used, where in stage 1, a discrete variational autoencoder is used the lottery ticket hypothesis [185], Prasanna et al. [184] analyzed
to compress 256× 256 RGB images into 32×32 image tokens and the lotteries in BERT and showed that good sub-networks also
then in stage 2, an autoregressive transformer is trained to model exist in transformer-based models, reducing both the FFN layers
the joint distribution over the image and text tokens. Experimental and attention heads in order to achieve high compression rates.
results show that DALL-E can generate images of various styles For the vision transformer [15] which splits an image to multiple
from scratch, including photorealistic imagery, cartoons and emoji patches, Tang et al. [186] proposed to reduce patch calculation
A SUBMISSION TO IEEE TRANSACTION ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 15

to accelerate the inference, and the redundant patches can be


Pruning & Decomposition
automatically discovered by considering their contributions to the Redundant Compact Quantization Low-Bit
effective output features. Zhu et al. [187] extended the network Model Knowledge Distillation Model Model

slimming approach [188] to vision transformers for reducing


the dimensions of linear projections in both FFN and attention
modules. Expertise Architecture Design
or NAS
In addition to the width of transformer models, the depth
(i.e., the number of layers) can also be reduced to accelerate the
inference process [198], [199]. Differing from the concept that Fig. 13: Different methods for compressing transformers.
different attention heads in transformer models can be computed in 3.6.3 Quantization
parallel, different layers have to be calculated sequentially because
the input of the next layer depends on the output of previous Quantization aims to reduce the number of bits needed to represent
layers. Fan et al. [198] proposed a layer-wisely dropping strategy network weight or intermediate features [208], [209]. Quantization
to regularize the training of models, and then the whole layers are methods for general neural networks have been discussed at length
removed together at the test phase. and achieve performance on par with the original networks [210],
Beyond the pruning methods that directly discard modules in [211], [212]. Recently, there has been growing interest in how
transformer models, matrix decomposition aims to approximate to specially quantize transformer models [213], [214]. For ex-
the large matrices with multiple small matrices based on the low- ample, Shridhar et al. [215] suggested embedding the input into
rank assumption. For example, Wang et al. [200] decomposed the binary high-dimensional vectors, and then using the binary input
standard matrix multiplication in transformer models, improving representation to train the binary neural networks. Cheong et
the inference efficiency. al. [216] represented the weights in the transformer models by
low-bit (e.g., 4-bit) representation. Zhao et al. [217] empirically
3.6.2 Knowledge Distillation investigated various quantization methods and showed that k-
Knowledge distillation aims to train student networks by trans- means quantization has a huge development potential. Aimed
ferring knowledge from large teacher networks [201], [202], at machine translation tasks, Prato et al. [46] proposed a fully
[203]. Compared with teacher networks, student networks usually quantized transformer, which, as the paper claims, is the first 8-
have thinner and shallower architectures, which are easier to bit model not to suffer any loss in translation quality. Beside,
be deployed on resource-limited resources. Both the output and Liu et al. [218] explored a post-training quantization scheme to
intermediate features of neural networks can also be used to reduce the memory storage and computational costs of vision
transfer effective information from teachers to students. Focused transformers.
on transformer models, Mukherjee et al. [204] used the pre-trained
BERT [10] as a teacher to guide the training of small models, 3.6.4 Compact Architecture Design
leveraging large amounts of unlabeled data. Wang et al. [205] train Beyond compressing pre-defined transformer models into smaller
the student networks to mimic the output of self-attention layers ones, some works attempt to design compact models di-
in the pre-trained teacher models. The dot-product between values rectly [219], [47]. Jiang et al. [47] simplified the calculation
is introduced as a new form of knowledge for guiding students. A of self-attention by proposing a new module — called span-
teacher’s assistant [206] is also introduced in [205], reducing the based dynamic convolution — that combine the fully-connected
gap between large pre-trained transformer models and compact layers and the convolutional layers. Interesting “hamburger” layers
student networks, thereby facilitating the mimicking process. Due are proposed in [220], using matrix decomposition to substitute
to the various types of layers in the transformer model (i.e., self- the original self-attention layers.Compared with standard self-
attention layer, embedding layer, and prediction layers), Jiao et attention operations, matrix decomposition can be calculated more
al. [45] design different objective functions to transfer knowledge efficiently while clearly reflecting the dependence between differ-
from teachers to students. For example, the outputs of student ent tokens. The design of efficient transformer architectures can
models’ embedding layers imitate those of teachers via MSE also be automated with neural architecture search (NAS) [221],
losses. For the vision transformer, Jia et al. [207] proposed a fine- [222], which automatically searches how to combine different
grained manifold distillation method, which excavates effective components. For example, Su et al. [82] searched patch size and
knowledge through the relationship between images and the di- dimensions of linear projections and head number of attention
vided patches. modules to get an efficient vision transformer. Li et al. [223]
TABLE 4: List of representative compressed transformer- explored a self-supervised search strategy to get a hybrid architec-
based models. The data of the Table is from [197]. ture composing of both convolutional modules and self-attention
modules.
Models Compress Type #Layer Params Speed Up The self-attention operation in transformer models calculates
BERTBASE [10] Baseline 12 110M ×1 the dot product between representations from different input to-
ALBERT [189] Decomposition 12 12M ×5.6
kens in a given sequence (patches in image recognition tasks [15]),
BERT- Architecture
6 66M ×1.94 whose complexity is O(N ), where N is the length of the se-
of-Theseus [190] design
Q-BERT [191]
Quantization
12
- -
quence. Recently, there has been a targeted focus to reduce the
Q8BERT [192] 12 complexity to O(N ) in large methods so that transformer models
TinyBERT [45] 4 14.5M ×9.4
DistilBERT [193] 6 6.6m ×1.63
can scale to long sequences [224], [225], [226]. For example,
BERT-PKD [194] Distillation 3∼6 45.7∼67M ×3.73∼1.64 Katharopoulos et al. [224] approximated self-attention as a linear
MobileBERT [195] 24 25.3M ×4.0 dot-product of kernel feature maps and revealed the relationship
PD [196] 6 67.5M ×2.0 between tokens via RNNs. Zaheer et al. [226] considered each to-
A SUBMISSION TO IEEE TRANSACTION ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 16

ken as a vertex in a graph and defined the inner product calculation al. [15] claim that large-scale training can surpass inductive
between two tokens as an edge. Inspired by graph theories [227], bias. Position embeddings are added into image patches to retain
[228], various sparse graph are combined to approximate the dense positional information, which is important in computer vision
graph in transformer models, and can achieve O(N ) complexity. tasks. Inspired by the heavy parameter usage in transformers,
Discussion. The preceding methods take different approaches over-parameterization [238], [239] may be a potential point to the
in how they attempt to identify redundancy in transformer mod- interpretability of vision transformers.
els (see Figure 13). Pruning and decomposition methods usually Last but not least, developing efficient transformer models for
require pre-defined models with redundancy. Specifically, pruning CV remains an open problem. Transformer models are usually
focuses on reducing the number of components (e.g., layers, huge and computationally expensive. For example, the base ViT
heads) in transformer models while decomposition represents an model [15] requires 18 billion FLOPs to process an image. In
original matrix with multiple small matrices. Compact models contrast, the lightweight CNN model GhostNet [240], [241] can
also can be directly designed either manually (requiring sufficient achieve similar performance with only about 600 million FLOPs.
expertise) or automatically (e.g., via NAS). The obtained compact Although several methods have been proposed to compress trans-
models can be further represented with low-bits via quantization former, they remain highly complex. And these methods, which
methods for efficient deployment on resource-limited devices. were originally designed for NLP, may not be suitable for CV.
Consequently, efficient transformer models are urgently needed
4 C ONCLUSIONS AND D ISCUSSIONS so that vision transformer can be deployed on resource-limited
devices.
Transformer is becoming a hot topic in the field of computer
vision due to its competitive performance and tremendous poten-
tial compared with CNNs. To discover and utilize the power of
transformer, as summarized in this survey, a number of methods 4.2 Future Prospects
have been proposed in recent years. These methods show excellent
performance on a wide range of visual tasks, including backbone, In order to drive the development of vision transformers, we
high/mid-level vision, low-level vision, and video processing. provide several potential directions for future study.
Nevertheless, the potential of transformer for computer vision has One direction is the effectiveness and the efficiency of trans-
not yet been fully explored, meaning that several challenges still formers in computer vision. The goal is to develop highly ef-
need to be resolved. In this section, we discuss these challenges fective and efficient vision transformers; specifically, transformers
and provide insights on the future prospects. with high performance and low resource cost. The performance
determines whether the model can be applied on real-world
4.1 Challenges applications, while the resource cost influences the deployment
on devices [242], [243]. The effectiveness is usually correlated
Although researchers have proposed many transformer-based
with the efficiency, so determining how to achieve a better balance
models to tackle computer vision tasks, these works are only the
between them is a meaningful topic for future study.
first steps in this field and still have much room for improvement.
For example, the transformer architecture in ViT [15] follows Most of the existing vision transformer models are designed to
the standard transformer for NLP [9], but an improved version handle only a single task. Many NLP models such as GPT-3 [11]
specifically designed for CV remains to be explored. Moreover, it have demonstrated how transformer can deal with multiple tasks in
is necessary to apply transformer to more tasks other than those one model. IPT [19] in the CV field is also able to process multiple
mentioned earlier. low-level vision tasks, such as super-resolution, image denoising,
The generalization and robustness of transformers for com- and deraining. Perceiver [244] and Perceiver IO [245] are the
puter vision are also challenging. Compared with CNNs, pure pioneering models that can work on several domains including
transformers lack some inductive biases and rely heavily on images, audio, multimodal, point clouds. We believe that more
massive datasets for large-scale training [15]. Consequently, the tasks can be involved in only one model. Unifying all visual
quality of data has a significant influence on the generalization tasks and even other tasks in one transformer (i.e., a grand unified
and robustness of transformers. Although ViT shows exceptional model) is an exciting topic.
performance on downstream image classification tasks such as There have been various types of neural networks, such as
CIFAR [229] and VTAB [230], directly applying the ViT back- CNN, RNN, and transformer. In the CV field, CNNs used to
bone on object detection has failed to achieve better results than be the mainstream choice [12], [93], but now transformer is
CNNs [113]. There is still a long way to go in order to better becoming popular. CNNs can capture inductive biases such as
generalize pre-trained transformers on more generalized visual translation equivariance and locality, whereas ViT uses large-scale
tasks. Practitioners concern the robustness of transformer (e.g. training to surpass inductive bias [15]. From the evidence currently
the vulnerability issue [231]). Although the robustness has been available [15], CNNs perform well on small datasets, whereas
investigated in [232], [233], [234], it is still an open problem transformers perform better on large datasets. The question for the
waiting to be solved. future is whether to use CNN or transformer.
Although numerous works have explained the use of trans- By training with large datasets, transformers can achieve state-
formers in NLP [235], [236], it remains a challenging subject of-the-art performance on both NLP [11], [10] and CV bench-
to clearly explain why transformer works well on visual tasks. marks [15]. It is possible that neural networks need big data rather
The inductive biases, including translation equivariance and lo- than inductive bias. In closing, we leave you with a question:
cality, are attributed to CNN’s success, but transformer lacks any Can transformer obtains satisfactory results with a very simple
inductive bias. The current literature usually analyzes the effect computational paradigm (e.g., with only fully connected layers)
in an intuitive way [15], [237]. For example, Dosovitskiy et and massive data training?
A SUBMISSION TO IEEE TRANSACTION ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 17

ACKNOWLEDGEMENT Generally, the final output signal of the self-attention module


for computer vision will be wrapped as:
This research is partially supported by MindSpore (https://
mindspore.cn/) and CANN (Compute Architecture for Neural Z = YWo + X (19)
Networks).
where Y is generated through Eq. 17. If Wo is initialized as zero,
this self-attention module can be inserted into any existing model
A PPENDIX without breaking its initial behavior.
A1. General Formulation of Self-attention
The self-attention module [9] for machine translation computes the A2. Revisiting Transformers for NLP
responses at each position in a sequence by estimating attention Before transformer was developed, RNNs ( e.g., GRU [248]
scores to all positions and gathering the corresponding embed- and LSTM [6]) with added attention [7] empowered most of
dings based on the scores accordingly. This can be viewed as a the state-of-the-art language models. However, RNNs require the
form of non-local filtering operations [246], [247]. We follow the information flow to be processed sequentially from the previous
convention [246] to formulate the self-attention module. Given an hidden states to the next one. This rules out the possibility of using
input signal (e.g., image, sequence, video and feature) X ∈ Rn×d , acceleration and parallelization during training, and consequently
where n = h × w (indicating the number of pixels in feature) and hinders the potential of RNNs to process longer sequences or build
d is the number of channels, the output signal is generated as: larger models. In 2017, Vaswani et al. [9] proposed transformer,
a novel encoder-decoder architecture built solely on multi-head
1 X
yi = f (xi , xj )g(xj ), (14) self-attention mechanisms and feed-forward neural networks. Its
C(xi ) ∀j purpose was to solve seq-to-seq natural language tasks (e.g.,
machine translation) easily by acquiring global dependencies. The
where xi ∈ R1×d and yi ∈ R1×d indicate the ith position subsequent success of transformer demonstrates that leveraging
(e.g., space, time and spacetime) of the input signal X and output attention mechanisms alone can achieve performance comparable
signal Y , respectively. Subscript j is the index that enumerates all with attentive RNNs. Furthermore, the architecture of transformer
positions, and a pairwise function f (·) computes a representing lends itself to massively parallel computing, which enables train-
relationship (such as affinity) between i and all j . The function ing on larger datasets. This has given rise to the surge of large
g(·) computes a representation of the input signal at position j , pre-trained models (PTMs) for natural language processing.
and the response is normalized by a factor C(xi ). BERT [10] and its variants (e.g., SpanBERT [249],
Note that there are many choices for the pairwise function RoBERTa [250]) are a series of PTMs built on the multi-layer
f (·). For example, a simple extension of the Gaussian function transformer encoder architecture. Two tasks are conducted on
could be used to compute the similarity in an embedding space. BookCorpus [251] and English Wikipedia datasets at the pre-
As such, the function f (·) can be formulated as: training stage of BERT: 1) Masked language modeling (MLM),
T which involves first randomly masking out some tokens in the
f (xi , xj ) = eθ(xi )φ(xj ) (15)
input and then training the model to predict; 2) Next sentence pre-
where θ(·) and φ(·) can be any embedding layers. If we diction, which uses paired sentences as input and predicts whether
consider the θ(·), φ(·), g(·) in the form of linear embedding: the second sentence is the original one in the document. After
θ(X) = XWθ , φ(X) = XWφ , g(X) = XWg where pre-training, BERT can be fine-tuned by adding an output layer
Wθ ∈ Rd×dk , Wφ ∈ Rd×dk ,P Wg ∈ Rd×dv , and set the on a wide range of downstream tasks. More specifically, when
normalization factor as C(xi ) = ∀j f (xi , xj ), the Eq. 14 can performing sequence-level tasks (e.g., sentiment analysis), BERT
be rewritten as: uses the representation of the first token for classification; for
T T
token-level tasks (e.g., name entity recognition), all tokens are fed
exi wθ,i wφ,j xj into the softmax layer for classification. At the time of its release,
yi = P x w wT xT xj wg,j , (16)
BERT achieved the state-of-the-art performance on 11 NLP tasks,
je
i θ,i φ,j j

setting a milestone in pre-trained language models. Generative


where wθ,i ∈ Rd×1 is the ith row of the weight matrix Wθ . For a Pre-trained Transformer models (e.g., GPT [252], GPT-2 [109])
1 are another type of PTMs based on the transformer decoder
given index i, C(x i)
f (xi , xj ) becomes the softmax output along
the dimension j . The formulation can be further rewritten as: architecture, which uses masked self-attention mechanisms. The
main difference between the GPT series and BERT is the way
Y = softmax(XWθ WφT X)g(X), (17) in which pre-training is performed. Unlike BERT, GPT models
are unidirectional language models pre-trained using Left-to-Right
where Y ∈ Rn×c is the output signal of the same size as X.
(LTR) language modeling. Furthermore, BERT learns the sentence
Compared with the query, key and value representations Q =
separator ([SEP]) and classifier token ([CLS]) embeddings during
XWq , K = XWk , V = XWv from the translation module, pre-training, whereas these embeddings are involved in only the
once Wq = Wθ , Wk = Wφ , Wv = Wg , Eq. 17 can be
fine-tuning stage of GPT. Due to its unidirectional pre-training
formulated as:
strategy, GPT achieves superior performance in many natural
Y = softmax(QKT )V = Attention(Q, K, V), (18) language generation tasks. More recently, a massive transformer-
based model called GPT-3, which has an astonishing 175 billion
The self-attention module [9] proposed for machine translation parameters, was developed [11]. By pre-training on 45 TB of
is, to some extent, the same as the preceding non-local filtering compressed plaintext data, GPT-3 can directly process different
operations proposed for computer vision. types of downstream natural language tasks without fine-tuning.
A SUBMISSION TO IEEE TRANSACTION ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 18

As a result, it achieves strong performance on many NLP datasets, in computer vision. Such tasks include semantic segmentation,
including both natural language understanding and generation. instance segmentation, object detection, keypoint detection, and
Since the introduction of transformer, many other models have depth estimation. Here we briefly summarize the existing applica-
been proposed in addition to the transformer-based PTMs men- tions using self-attention for computer vision.
tioned earlier. We list a few representative models in Table 5 for Image Classification. Trainable attention for classification
interested readers, but this is not the focus of our study. consists of two main streams: hard attention [262], [263], [264] re-
garding the use of an image region, and soft attention [265], [266],
TABLE 5: List of representative language models built on [267], [268] generating non-rigid feature maps. Ba et al. [262]
transformer. Transformer is the standard encoder-decoder ar- first proposed the term “visual attention” for image classification
chitecture. Transformer Enc. and Dec. represent the encoder tasks, and used attention to select relevant regions and locations
and decoder, respectively. Decoder uses mask self-attention to within the input image. This can also reduce the computational
prevent attending to the future tokens. The data of the Table complexity of the proposed model regarding the size of the input
is from [197]. image. For medical image classification, AG-CNN [269] was
proposed to crop a sub-region from a global image by the attention
Models Architecture # of Params Fine-tuning
GPT [252] Transformer Dec. 117M Yes heat map. And instead of using hard attention and recalibrating the
GPT-2 [109] Transformer Dec. 117M-1542M No crop of feature maps, SENet [270] was proposed to reweight the
GPT-3 [11] Transformer Dec. 125M-175B No channel-wise responses of the convolutional features using soft
BERT [10] Transformer Enc. 110M-340M Yes self-attention. Jetley et al. [266] used attention maps generated
RoBERTa [250] Transformer Enc. 355M Yes by corresponding estimators to reweight intermediate features in
Two-Stream
XLNet [253] ≈ BERT Yes DNNs. In addition, Han et al. [267] utilized the attribute-aware
Transformer Enc.
ELECTRA [254] Transformer Enc. 335M Yes attention to enhance the representation of CNNs.
UniLM [255] Transformer Enc. 340M Yes Semantic Segmentation. PSANet [271], OCNet [272],
BART [256] Transformer 110% of BERT Yes DANet [273] and CFNet [274] are the pioneering works to propose
T5 [152] Transfomer 220M-11B Yes using the self-attention module in semantic segmentation tasks.
ERNIE (THU) [257] Transformer Enc. 114M Yes These works consider and augment the relationship and similar-
KnowBERT [258] Transformer Enc. 253M-523M Yes
ity [275], [276], [277], [278], [279], [280] between the contextual
pixels. DANet [273] simultaneously leverages the self-attention
Apart from the PTMs trained on large corpora for general NLP
module on spatial and channel dimensions, whereas A2 Net [281]
tasks, transformer-based models have also been applied in many
groups the pixels into a set of regions, and then augments the
other NLP-related domains and to multi-modal tasks.
pixel representations by aggregating the region representations
BioNLP Domain. Transformer-based models have outperformed
with the generated attention weights. DGCNet [282] employs a
many traditional biomedical methods. Some examples of such
dual graph CNN to model coordinate space similarity and feature
models include BioBERT [259], which uses a transformer ar-
space similarity in a single framework. To improve the efficiency
chitecture for biomedical text mining tasks, and SciBERT [260],
of the self-attention module for semantic segmentation, several
which is developed by training transformer on 114M scientific
works [283], [284], [285], [286], [287] have been proposed,
articles (covering biomedical and computer science fields) with
aiming to alleviate the huge amount of parameters brought by
the aim of executing NLP tasks in the scientific domain more
calculating pixel similarities. For example, CGNL [283] applies
precisely. Another example is ClinicalBERT, proposed by Huang
the Taylor series of the RBF kernel function to approximate the
et al. [261]. It utilizes transformer to develop and evaluate con-
pixel similarities. CCNet [284] approximates the original self-
tinuous representations of clinical notes. One of the side effects
attention scheme via two consecutive criss-cross attention mod-
of this is that the attention map of ClinicalBERT can be used
ules. In addition, ISSA [285] factorizes the dense affinity matrix as
to explain predictions, thereby allowing high-quality connections
the product of two sparse affinity matrices. There are other related
between different medical contents to be discovered.
works using attention based graph reasoning modules [288], [289],
The rapid development of transformer-based models on a
[286] to enhance both the local and global representations.
variety of NLP-related tasks demonstrates its structural superiority
Object Detection. Ramachandran et al. [268] proposes an
and versatility, opening up the possibility that it will become a
attention-based layer and swapped the conventional convolution
universal module applied in many AI fields other than just NLP.
layers to build a fully attentional detector that outperforms the
The following part of this survey focuses on the applications of
typical RetinaNet [127] on COCO benchmark [290]. GCNet [291]
transformer in a wide range of computer vision tasks that have
assumes that the global contexts modeled by non-local operations
emerged over the past two years.
are almost the same for different query positions within an image,
and unifies the simplified formulation and SENet [270] into a
A3. Self-attention for Computer Vision general framework for global context modeling [292], [293],
The preceding sections reviewed methods that use a transformer [294], [295]. Vo et al. [296] designs a bidirectional operation
architecture for vision tasks. We can conclude that self-attention to gather and distribute information from a query position to
plays a pivotal role in transformer. The self-attention module can all possible positions. Zhang et al. [118] suggests that previous
also be considered a building block of CNN architectures, which methods fail to interact with cross-scale features, and proposes
have low scaling properties concerning the large receptive fields. Feature Pyramid Transformer, based on the self-attention module,
This building block is widely used on top of the networks to to fully exploit interactions across both space and scales.
capture long-range interactions and enhance high-level semantic Conventional detection methods usually exploit a single visual
features for vision tasks. In this section, we delve deeply into representation (e.g., bounding box and corner point) for predicting
the models based on self-attention designed for challenging tasks the final results. Hu et al. [297] proposes a relation module based
A SUBMISSION TO IEEE TRANSACTION ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 19

on self-attention to process a set of objects simultaneously through [24] J. Long et al. Fully convolutional networks for semantic segmentation.
interaction between their appearance features. Cheng et al. [119] In CVPR, 2015.
[25] H. Wang et al. Max-deeplab: End-to-end panoptic segmentation with
proposes RelationNet++ with the bridging visual representations mask transformers. In CVPR, pp. 5463–5474, 2021.
(BVR) module to combine different heterogeneous representations [26] R. B. Fisher. Cvonline: The evolving, distributed, non-proprietary, on-
into a single one similar to that in the self-attention module. line compendium of computer vision. Retrieved January 28, 2006 from
Specifically, the master representation is treated as the query input http://homepages. inf. ed. ac. uk/rbf/CVonline, 2008.
[27] N. Parmar et al. Image transformer. In ICML, 2018.
and the auxiliary representations are regarded as the key input. [28] Y. Zeng et al. Learning joint spatial-temporal transformations for video
The enhanced feature can therefore bridge the information from inpainting. In ECCV, pp. 528–543. Springer, 2020.
auxiliary representations and benefit final detection results. [29] K. Han et al. Transformer in transformer. In NeurIPS, 2021.
Other Vision Tasks. Zhang et al. [298] proposes a resolution- [30] H. Cao et al. Swin-unet: Unet-like pure transformer for medical image
segmentation. arXiv:2105.05537, 2021.
wise attention module to learn enhanced feature maps when [31] X. Chen et al. An empirical study of training self-supervised vision
training multi-resolution networks to obtain accurate human key- transformers. In ICCV, 2021.
point locations for pose estimation task. Furthermore, Chang et [32] Z. Dai et al. UP-DETR: unsupervised pre-training for object detection
with transformers. In CVPR, 2021.
al. [299] uses an attention-mechanism based feature fusion block [33] Y. Wang et al. End-to-end video instance segmentation with transform-
to improve the accuracy of the human keypoint detection model. ers. In CVPR, 2021.
To explore more generalized contextual information for im- [34] L. Huang et al. Hand-transformer: Non-autoregressive structured mod-
proving the self-supervised monocular trained depth estimation, eling for 3d hand pose estimation. In ECCV, pp. 17–33, 2020.
[35] L. Huang et al. Hot-net: Non-autoregressive transformer for 3d hand-
Johnston et al. [300] directly leverages self-attention module. object pose estimation. In ACM MM, pp. 3136–3145, 2020.
Chen et al. [301] also proposes an attention-based aggregation net- [36] K. Lin et al. End-to-end human pose and mesh reconstruction with
work to capture context information that differs in diverse scenes transformers. In CVPR, 2021.
for depth estimation. And Aich et al. [302] proposes bidirectional [37] P. Esser et al. Taming transformers for high-resolution image synthesis.
In CVPR, 2021.
attention modules that utilize the forward and backward attention [38] Y. Jiang et al. Transgan: Two transformers can make one strong gan. In
operations for better results of monocular depth estimation. NeurIPS, 2021.
[39] F. Yang et al. Learning texture transformer network for image super-
resolution. In CVPR, pp. 5791–5800, 2020.
R EFERENCES [40] A. Radford et al. Learning transferable visual models from natural
language supervision. arXiv:2103.00020, 2021.
[1] F. Rosenblatt. The perceptron, a perceiving and recognizing automaton [41] A. Ramesh et al. Zero-shot text-to-image generation. In ICML, 2021.
Project Para. Cornell Aeronautical Laboratory, 1957. [42] M. Ding et al. Cogview: Mastering text-to-image generation via
[2] F. ROSENBLATT. Principles of neurodynamics. perceptrons and the transformers. In NeurIPS, 2021.
theory of brain mechanisms. Technical report, 1961. [43] R. Hu and A. Singh. Unit: Multimodal multitask learning with a unified
[3] Y. LeCun et al. Gradient-based learning applied to document recogni- transformer. In ICCV, 2021.
tion. Proceedings of the IEEE, 86(11):2278–2324, 1998. [44] P. Michel et al. Are sixteen heads really better than one? In NeurIPS,
[4] A. Krizhevsky et al. Imagenet classification with deep convolutional pp. 14014–14024, 2019.
neural networks. In NeurIPS, pp. 1097–1105, 2012. [45] X. Jiao et al. TinyBERT: Distilling BERT for natural language under-
[5] D. E. Rumelhart et al. Learning internal representations by error standing. In Findings of EMNLP, pp. 4163–4174, 2020.
propagation. Technical report, 1985. [46] G. Prato et al. Fully quantized transformer for machine translation. In
[6] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Findings of EMNLP, 2020.
computation, 9(8):1735–1780, 1997. [47] Z.-H. Jiang et al. Convbert: Improving bert with span-based dynamic
[7] D. Bahdanau et al. Neural machine translation by jointly learning to convolution. NeurIPS, 33, 2020.
align and translate. In ICLR, 2015. [48] J. Gehring et al. Convolutional sequence to sequence learning. In ICML,
[8] A. Parikh et al. A decomposable attention model for natural language pp. 1243–1252. PMLR, 2017.
inference. In EMNLP, 2016. [49] P. Shaw et al. Self-attention with relative position representations. In
[9] A. Vaswani et al. Attention is all you need. In NeurIPS, 2017. NAACL, pp. 464–468, 2018.
[10] J. Devlin et al. Bert: Pre-training of deep bidirectional transformers for [50] D. Hendrycks and K. Gimpel. Gaussian error linear units (gelus).
language understanding. In NAACL-HLT, 2019. arXiv:1606.08415, 2016.
[11] T. B. Brown et al. Language models are few-shot learners. In NeurIPS, [51] J. L. Ba et al. Layer normalization. arXiv:1607.06450, 2016.
2020. [52] A. Baevski and M. Auli. Adaptive input representations for neural
[12] K. He et al. Deep residual learning for image recognition. In CVPR, language modeling. In ICLR, 2019.
pp. 770–778, 2016. [53] Q. Wang et al. Learning deep transformer models for machine transla-
[13] S. Ren et al. Faster R-CNN: Towards real-time object detection with tion. In ACL, pp. 1810–1822, 2019.
region proposal networks. In NeurIPS, 2015. [54] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep
[14] M. Chen et al. Generative pretraining from pixels. In ICML, 2020. network training by reducing internal covariate shift. In ICML, 2015.
[15] A. Dosovitskiy et al. An image is worth 16x16 words: Transformers for [55] S. Shen et al. Powernorm: Rethinking batch normalization in transform-
image recognition at scale. In ICLR, 2021. ers. In ICML, 2020.
[16] N. Carion et al. End-to-end object detection with transformers. In [56] J. Xu et al. Understanding and improving layer normalization. In
ECCV, 2020. NeurIPS, 2019.
[17] X. Zhu et al. Deformable detr: Deformable transformers for end-to-end [57] T. Bachlechner et al. Rezero is all you need: Fast convergence at large
object detection. In ICLR, 2021. depth. In Uncertainty in Artificial Intelligence, pp. 1352–1361. PMLR,
[18] S. Zheng et al. Rethinking semantic segmentation from a sequence-to- 2021.
sequence perspective with transformers. In CVPR, 2021. [58] B. Wu et al. Visual transformers: Token-based image representation and
[19] H. Chen et al. Pre-trained image processing transformer. In CVPR, processing for computer vision. arXiv:2006.03677, 2020.
2021. [59] H. Touvron et al. Training data-efficient image transformers & distilla-
[20] L. Zhou et al. End-to-end dense video captioning with masked tion through attention. In ICML, 2020.
transformer. In CVPR, pp. 8739–8748, 2018. [60] Z. Liu et al. Swin transformer: Hierarchical vision transformer using
[21] S. Ullman et al. High-level vision: Object recognition and visual shifted windows. In ICCV, 2021.
cognition, volume 2. MIT press Cambridge, MA, 1996. [61] C.-F. Chen et al. Regionvit: Regional-to-local attention for vision
[22] R. Kimchi et al. Perceptual organization in vision: Behavioral and transformers. arXiv:2106.02689, 2021.
neural perspectives. Psychology Press, 2003. [62] X. Chu et al. Twins: Revisiting the design of spatial attention in vision
[23] J. Zhu et al. Top-down saliency detection via contextual pooling. transformers. arXiv:2104.13840, 2021.
Journal of Signal Processing Systems, 74(1):33–46, 2014. [63] H. Lin et al. Cat: Cross attention in vision transformer. arXiv, 2021.
A SUBMISSION TO IEEE TRANSACTION ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 20

[64] X. Dong et al. Cswin transformer: A general vision transformer [105] A. v. d. Oord et al. Conditional image generation with pixelcnn
backbone with cross-shaped windows. arXiv:2107.00652, 2021. decoders. arXiv preprint arXiv:1606.05328, 2016.
[65] Z. Huang et al. Shuffle transformer: Rethinking spatial shuffle for vision [106] D. Pathak et al. Context encoders: Feature learning by inpainting. In
transformer. arXiv:2106.03650, 2021. CVPR, pp. 2536–2544, 2016.
[66] J. Fang et al. Msg-transformer: Exchanging local spatial information by [107] Z. Li et al. Mst: Masked self-supervised transformer for visual
manipulating messenger tokens. arXiv:2105.15168, 2021. representation. In NeurIPS, 2021.
[67] L. Yuan et al. Tokens-to-token vit: Training vision transformers from [108] H. Bao et al. Beit: Bert pre-training of image transformers.
scratch on imagenet. In ICCV, 2021. arXiv:2106.08254, 2021.
[68] D. Zhou et al. Deepvit: Towards deeper vision transformer. arXiv, 2021. [109] A. Radford et al. Language models are unsupervised multitask learners.
[69] P. Wang et al. Kvt: k-nn attention for boosting vision transformers. OpenAI blog, 1(8):9, 2019.
arXiv:2106.00515, 2021. [110] Z. Xie et al. Self-supervised learning with swin transformers.
[70] D. Zhou et al. Refiner: Refining self-attention for vision transformers. arXiv:2105.04553, 2021.
arXiv:2106.03714, 2021. [111] C. Li et al. Efficient self-supervised vision transformers for representa-
[71] A. El-Nouby et al. Xcit: Cross-covariance image transformers. tion learning. arXiv:2106.09785, 2021.
arXiv:2106.09681, 2021. [112] K. He et al. Momentum contrast for unsupervised visual representation
[72] W. Wang et al. Pyramid vision transformer: A versatile backbone for learning. In CVPR, 2020.
dense prediction without convolutions. In ICCV, 2021. [113] J. Beal et al. Toward transformer-based object detection.
[73] S. Sun* et al. Visual parser: Representing part-whole hierarchies with arXiv:2012.09958, 2020.
transformers. arXiv:2107.05790, 2021. [114] Z. Yuan et al. Temporal-channel transformer for 3d lidar-based video
[74] H. Fan et al. Multiscale vision transformers. arXiv:2104.11227, 2021. object detection for autonomous driving. IEEE TCSVT, 2021.
[75] Z. Zhang et al. Nested hierarchical transformer: Towards accurate, data- [115] X. Pan et al. 3d object detection with pointformer. In CVPR, 2021.
efficient and interpretable visual understanding. In AAAI, 2022. [116] R. Liu et al. End-to-end lane shape prediction with transformers. In
[76] Z. Pan et al. Less is more: Pay less attention in vision transformers. In WACV, 2021.
AAAI, 2022. [117] S. Yang et al. Transpose: Keypoint localization via transformer. In
[77] Z. Pan et al. Scalable visual transformers with hierarchical pooling. In ICCV, 2021.
ICCV, 2021. [118] D. Zhang et al. Feature pyramid transformer. In ECCV, 2020.
[78] B. Heo et al. Rethinking spatial dimensions of vision transformers. In [119] C. Chi et al. Relationnet++: Bridging visual representations for object
ICCV, 2021. detection via transformer decoder. NeurIPS, 2020.
[79] C.-F. Chen et al. Crossvit: Cross-attention multi-scale vision trans- [120] Z. Sun et al. Rethinking transformer-based set prediction for object
former for image classification. In ICCV, 2021. detection. In ICCV, pp. 3611–3620, 2021.
[80] Z. Wang et al. Uformer: A general u-shaped transformer for image [121] M. Zheng et al. End-to-end object detection with adaptive clustering
restoration. arXiv:2106.03106, 2021. transformer. In BMVC, 2021.
[81] X. Zhai et al. Scaling vision transformers. arXiv:2106.04560, 2021.
[122] T. Ma et al. Oriented object detection with transformer.
[82] X. Su et al. Vision transformer architecture search. arXiv, 2021.
arXiv:2106.03146, 2021.
[83] M. Chen et al. Autoformer: Searching transformers for visual recogni-
[123] P. Gao et al. Fast convergence of detr with spatially modulated co-
tion. In ICCV, pp. 12270–12280, 2021.
attention. In ICCV, 2021.
[84] B. Chen et al. Glit: Neural architecture search for global and local
[124] Z. Yao et al. Efficient detr: Improving end-to-end object detector with
image transformer. In ICCV, pp. 12–21, 2021.
dense prior. arXiv:2104.01318, 2021.
[85] X. Chu et al. Conditional positional encodings for vision transformers.
[125] Z. Tian et al. Fcos: Fully convolutional one-stage object detection. In
arXiv:2102.10882, 2021.
ICCV, pp. 9627–9636, 2019.
[86] K. Wu et al. Rethinking and improving relative position encoding for
[126] Y. Fang et al. You only look at one sequence: Rethinking transformer
vision transformer. In ICCV, 2021.
in vision through object detection. In NeurIPS, 2021.
[87] H. Touvron et al. Going deeper with image transformers.
arXiv:2103.17239, 2021. [127] T.-Y. Lin et al. Focal loss for dense object detection. In ICCV, 2017.
[88] Y. Tang et al. Augmented shortcuts for vision transformers. In NeurIPS, [128] Z. Cai and N. Vasconcelos. Cascade r-cnn: Delving into high quality
2021. object detection. In CVPR, 2018.
[89] I. Tolstikhin et al. Mlp-mixer: An all-mlp architecture for vision. [129] A. Bar et al. Detreg: Unsupervised pretraining with region priors for
arXiv:2105.01601, 2021. object detection. arXiv:2106.04550, 2021.
[90] L. Melas-Kyriazi. Do you even need attention? a stack of feed-forward [130] J. Hu et al. Istr: End-to-end instance segmentation with transformers.
layers does surprisingly well on imagenet. arXiv:2105.02723, 2021. arXiv:2105.00637, 2021.
[91] M.-H. Guo et al. Beyond self-attention: External attention using two [131] Z. Yang et al. Associating objects with transformers for video object
linear layers for visual tasks. arXiv:2105.02358, 2021. segmentation. In NeurIPS, 2021.
[92] H. Touvron et al. Resmlp: Feedforward networks for image classifica- [132] S. Wu et al. Fully transformer networks for semantic image segmenta-
tion with data-efficient training. arXiv:2105.03404, 2021. tion. arXiv:2106.04108, 2021.
[93] M. Tan and Q. Le. Efficientnet: Rethinking model scaling for convolu- [133] B. Dong et al. Solq: Segmenting objects by learning queries. In
tional neural networks. In ICML, 2019. NeurIPS, 2021.
[94] J. Guo et al. Cmt: Convolutional neural networks meet vision trans- [134] R. Strudel et al. Segmenter: Transformer for semantic segmentation. In
formers. arXiv:2107.06263, 2021. ICCV, 2021.
[95] L. Yuan et al. Volo: Vision outlooker for visual recognition. [135] E. Xie et al. Segformer: Simple and efficient design for semantic
arXiv:2106.13112, 2021. segmentation with transformers. In NeurIPS, 2021.
[96] H. Wu et al. Cvt: Introducing convolutions to vision transformers. [136] J. M. J. Valanarasu et al. Medical transformer: Gated axial-attention for
arXiv:2103.15808, 2021. medical image segmentation. In MICCAI, 2021.
[97] K. Yuan et al. Incorporating convolution designs into visual transform- [137] T. Prangemeier et al. Attention-based transformers for instance seg-
ers. arXiv:2103.11816, 2021. mentation of cells in microstructures. In International Conference on
[98] Y. Li et al. Localvit: Bringing locality to vision transformers. Bioinformatics and Biomedicine, pp. 700–707. IEEE, 2020.
arXiv:2104.05707, 2021. [138] C. R. Qi et al. Pointnet: Deep learning on point sets for 3d classification
[99] B. Graham et al. Levit: a vision transformer in convnet’s clothing for and segmentation. In CVPR, pp. 652–660, 2017.
faster inference. In ICCV, 2021. [139] C. R. Qi et al. Pointnet++: Deep hierarchical feature learning on point
[100] A. Srinivas et al. Bottleneck transformers for visual recognition. In sets in a metric space. NeurIPS, 30:5099–5108, 2017.
CVPR, 2021. [140] S. Hampali et al. Handsformer: Keypoint transformer for monocular 3d
[101] Z. Chen et al. Visformer: The vision-friendly transformer. arXiv, 2021. pose estimation ofhands and object in interaction. arXiv, 2021.
[102] T. Xiao et al. Early convolutions help transformers see better. In [141] Y. Li et al. Tokenpose: Learning keypoint tokens for human pose
NeurIPS, volume 34, 2021. estimation. In ICCV, 2021.
[103] G. E. Hinton and R. S. Zemel. Autoencoders, minimum description [142] W. Mao et al. Tfpose: Direct human pose estimation with transformers.
length, and helmholtz free energy. NIPS, 6:3–10, 1994. arXiv:2103.15320, 2021.
[104] P. Vincent et al. Extracting and composing robust features with [143] T. Jiang et al. Skeletor: Skeletal transformers for robust body-pose
denoising autoencoders. In ICML, pp. 1096–1103, 2008. estimation. In CVPR, 2021.
A SUBMISSION TO IEEE TRANSACTION ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 21

[144] Y. Li et al. Test-time personalization with a transformer for human pose [184] S. Prasanna et al. When bert plays the lottery, all tickets are winning.
estimation. Advances in Neural Information Processing Systems, 34, In EMNLP, 2020.
2021. [185] J. Frankle and M. Carbin. The lottery ticket hypothesis: Finding sparse,
[145] M. Lin et al. Detr for pedestrian detection. arXiv:2012.06785, 2020. trainable neural networks. In ICLR, 2018.
[146] L. Tabelini et al. Polylanenet: Lane estimation via deep polynomial re- [186] Y. Tang et al. Patch slimming for efficient vision transformers.
gression. In 2020 25th International Conference on Pattern Recognition arXiv:2106.02852, 2021.
(ICPR), pp. 6150–6156. IEEE, 2021. [187] M. Zhu et al. Vision transformer pruning. arXiv:2104.08500, 2021.
[147] L. Liu et al. Condlanenet: a top-to-down lane detection framework [188] Z. Liu et al. Learning efficient convolutional networks through network
based on conditional convolution. arXiv:2105.05003, 2021. slimming. In ICCV, 2017.
[148] P. Xu et al. A survey of scene graph: Generation and application. IEEE [189] Z. Lan et al. Albert: A lite bert for self-supervised learning of language
Trans. Neural Netw. Learn. Syst, 2020. representations. In ICLR, 2020.
[149] J. Yang et al. Graph r-cnn for scene graph generation. In ECCV, 2018. [190] C. Xu et al. Bert-of-theseus: Compressing bert by progressive module
[150] S. Sharifzadeh et al. Classification by attention: Scene graph classifica- replacing. In EMNLP, pp. 7859–7869, 2020.
tion with prior knowledge. In AAAI, 2021. [191] S. Shen et al. Q-bert: Hessian based ultra low precision quantization of
[151] S. Sharifzadeh et al. Improving Visual Reasoning by Exploiting The bert. In AAAI, pp. 8815–8821, 2020.
Knowledge in Texts. arXiv:2102.04760, 2021. [192] O. Zafrir et al. Q8bert: Quantized 8bit bert. arXiv:1910.06188, 2019.
[152] C. Raffel et al. Exploring the limits of transfer learning with a unified [193] V. Sanh et al. Distilbert, a distilled version of bert: smaller, faster,
text-to-text transformer. JMLR, 21(140):1–67, 2020. cheaper and lighter. arXiv:1910.01108, 2019.
[153] N. Wang et al. Transformer meets tracker: Exploiting temporal context [194] S. Sun et al. Patient knowledge distillation for bert model compression.
for robust visual tracking. In CVPR, 2021. In EMNLP-IJCNLP, pp. 4323–4332, 2019.
[154] M. Zhao et al. TrTr: Visual Tracking with Transformer. [195] Z. Sun et al. Mobilebert: a compact task-agnostic bert for resource-
arXiv:2105.03817 [cs], May 2021. arXiv: 2105.03817. limited devices. In ACL, pp. 2158–2170, 2020.
[155] X. Chen et al. Transformer tracking. In CVPR, 2021. [196] I. Turc et al. Well-read students learn better: The impact of student
[156] P. Sun et al. TransTrack: Multiple Object Tracking with Transformer. initialization on knowledge distillation. arXiv:1908.08962, 2019.
arXiv:2012.15460 [cs], May 2021. arXiv: 2012.15460. [197] X. Qiu et al. Pre-trained models for natural language processing: A
[157] S. He et al. TransReID: Transformer-based object re-identification. In survey. Science China Technological Sciences, pp. 1–26, 2020.
ICCV, 2021. [198] A. Fan et al. Reducing transformer depth on demand with structured
[158] X. Liu et al. A video is worth three views: Trigeminal transformers for dropout. In ICLR, 2020.
video-based person re-identification. arXiv:2104.01745, 2021. [199] L. Hou et al. Dynabert: Dynamic bert with adaptive width and depth.
[159] T. Zhang et al. Spatiotemporal transformer for video-based person re- NeurIPS, 33, 2020.
identification. arXiv:2103.16469, 2021. [200] Z. Wang et al. Structured pruning of large language models. In EMNLP,
[160] N. Engel et al. Point transformer. IEEE Access, 9:134826–134840, pp. 6151–6162, 2020.
2021. [201] G. Hinton et al. Distilling the knowledge in a neural network.
[161] M.-H. Guo et al. Pct: Point cloud transformer. Computational Visual arXiv:1503.02531, 2015.
Media, 7(2):187–199, 2021. [202] C. Buciluǎ et al. Model compression. In SIGKDD, pp. 535–541, 2006.
[162] H. Zhao et al. Point transformer. In ICCV, pp. 16259–16268, 2021. [203] J. Ba and R. Caruana. Do deep nets really need to be deep? NIPS, 2014.
[163] K. Lee et al. Vitgan: Training gans with vision transformers. arXiv [204] S. Mukherjee and A. H. Awadallah. Xtremedistil: Multi-stage distilla-
preprint arXiv:2107.04589, 2021. tion for massive multilingual models. In ACL, pp. 2221–2234, 2020.
[164] A. v. d. Oord et al. Neural discrete representation learning. arXiv, 2017. [205] W. Wang et al. Minilm: Deep self-attention distillation for task-agnostic
[165] X. Wang et al. Sceneformer: Indoor scene generation with transformers. compression of pre-trained transformers. arXiv:2002.10957, 2020.
In 3DV, pp. 106–115. IEEE, 2021. [206] S. I. Mirzadeh et al. Improved knowledge distillation via teacher
[166] Z. Liu et al. Convtransformer: A convolutional transformer network for assistant. In AAAI, 2020.
video frame synthesis. arXiv:2011.10185, 2020. [207] D. Jia et al. Efficient vision transformers via fine-grained manifold
[167] R. Girdhar et al. Video action transformer network. In CVPR, 2019. distillation. arXiv:2107.01378, 2021.
[168] H. Liu et al. Two-stream transformer networks for video-based face [208] V. Vanhoucke et al. Improving the speed of neural networks on cpus.
alignment. T-PAMI, 40(11):2546–2554, 2017. In NIPS Workshop, 2011.
[169] J. Carreira and A. Zisserman. Quo vadis, action recognition? a new [209] Z. Yang et al. Searching for low-bit weights in quantized neural
model and the kinetics dataset. In CVPR, 2017. networks. In NeurIPS, 2020.
[170] S. Lohit et al. Temporal transformer networks: Joint learning of [210] E. Park and S. Yoo. Profit: A novel training method for sub-4-bit
invariant and discriminative time warping. In CVPR, 2019. mobilenet models. In ECCV, pp. 430–446. Springer, 2020.
[171] M. Fayyaz and J. Gall. Sct: Set constrained temporal transformer for [211] J. Fromm et al. Riptide: Fast end-to-end binarized neural networks.
set supervised action segmentation. In 2020 CVPR, pp. 501–510, 2020. Proceedings of Machine Learning and Systems, 2:379–389, 2020.
[172] W. Choi et al. What are they doing?: Collective activity classification [212] Y. Bai et al. Proxquant: Quantized neural networks via proximal
using spatio-temporal relationship among people. In ICCVW, 2009. operators. In ICLR, 2019.
[173] K. Gavrilyuk et al. Actor-transformers for group activity recognition. [213] A. Bhandare et al. Efficient 8-bit quantization of transformer neural
In CVPR, pp. 839–848, 2020. machine language translation model. arXiv:1906.00532, 2019.
[174] J. Shao et al. Temporal context aggregation for video retrieval with [214] C. Fan. Quantized transformer. Technical report, Stanford Univ., 2019.
contrastive learning. In WACV, 2021. [215] K. Shridhar et al. End to end binarized neural networks for text
[175] V. Gabeur et al. Multi-modal transformer for video retrieval. In ECCV, classification. In SustaiNLP, 2020.
pp. 214–229, 2020. [216] R. Cheong and R. Daniel. transformers. zip: Compressing transformers
[176] Y. Chen et al. Memory enhanced global-local aggregation for video with pruning and quantization. Technical report, 2019.
object detection. In CVPR, pp. 10337–10346, 2020. [217] Z. Zhao et al. An investigation on different underlying quantization
[177] J. Yin et al. Lidar-based online 3d video object detection with graph- schemes for pre-trained language models. In NLPCC, 2020.
based message passing and spatiotemporal transformer attention. In [218] Z. Liu et al. Post-training quantization for vision transformer. In
2020 CVPR, pp. 11495–11504, 2020. NeurIPS, 2021.
[178] H. Seong et al. Video multitask transformer network. In ICCVW, 2019. [219] Z. Wu et al. Lite transformer with long-short range attention. In ICLR,
[179] K. M. Schatz et al. A recurrent transformer network for novel view 2020.
action synthesis. In ECCV (27), pp. 410–426, 2020. [220] Z. Geng et al. Is attention better than matrix decomposition? In ICLR,
[180] C. Sun et al. Videobert: A joint model for video and language 2020.
representation learning. In ICCV, pp. 7464–7473, 2019. [221] Y. Guo et al. Nat: Neural architecture transformer for accurate and
[181] L. H. Li et al. Visualbert: A simple and performant baseline for vision compact architectures. In NeurIPS, pp. 737–748, 2019.
and language. arXiv:1908.03557, 2019. [222] D. So et al. The evolved transformer. In ICML, pp. 5877–5886, 2019.
[182] W. Su et al. Vl-bert: Pre-training of generic visual-linguistic represen- [223] C. Li et al. Bossnas: Exploring hybrid cnn-transformers with block-
tations. In ICLR, 2020. wisely self-supervised neural architecture search. In ICCV, 2021.
[183] Y.-S. Chuang et al. Speechbert: Cross-modal pre-trained language [224] A. Katharopoulos et al. Transformers are rnns: Fast autoregressive
model for end-to-end spoken question answering. In Interspeech, 2020. transformers with linear attention. In ICML, 2020.
A SUBMISSION TO IEEE TRANSACTION ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 22

[225] C. Yun et al. o(n) connections are expressive enough: Universal [262] J. Ba et al. Multiple object recognition with visual attention. In ICLR,
approximability of sparse transformers. In NeurIPS, 2020. 2014.
[226] M. Zaheer et al. Big bird: Transformers for longer sequences. In [263] V. Mnih et al. Recurrent models of visual attention. NeurIPS, pp.
NeurIPS, 2020. 2204–2212, 2014.
[227] D. A. Spielman and S.-H. Teng. Spectral sparsification of graphs. SIAM [264] K. Xu et al. Show, attend and tell: Neural image caption generation
Journal on Computing, 40(4), 2011. with visual attention. In International conference on machine learning,
[228] F. Chung and L. Lu. The average distances in random graphs with given pp. 2048–2057, 2015.
expected degrees. PNAS, 99(25):15879–15882, 2002. [265] F. Wang et al. Residual attention network for image classification. In
[229] A. Krizhevsky and G. Hinton. Learning multiple layers of features from CVPR, pp. 3156–3164, 2017.
tiny images. Technical report, Citeseer, 2009. [266] S. Jetley et al. Learn to pay attention. In ICLR, 2018.
[230] X. Zhai et al. A large-scale study of representation learning with the [267] K. Han et al. Attribute-aware attention model for fine-grained represen-
visual task adaptation benchmark. arXiv:1910.04867, 2019. tation learning. In ACM MM, pp. 2040–2048, 2018.
[231] Y. Cheng et al. Robust neural machine translation with doubly adver- [268] P. Ramachandran et al. Stand-alone self-attention in vision models. In
sarial inputs. In ACL, 2019. NeurIPS, 2019.
[232] W. E. Zhang et al. Adversarial attacks on deep-learning models in [269] Q. Guan et al. Diagnose like a radiologist: Attention guided
natural language processing: A survey. ACM TIST, 11(3):1–41, 2020. convolutional neural network for thorax disease classification. In
[233] K. Mahmood et al. On the robustness of vision transformers to arXiv:1801.09927, 2018.
adversarial examples. arXiv:2104.02610, 2021. [270] J. Hu et al. Squeeze-and-excitation networks. In CVPR, pp. 7132–7141,
[234] X. Mao et al. Towards robust vision transformer. arXiv, 2021. 2018.
[235] S. Serrano and N. A. Smith. Is attention interpretable? In ACL, 2019. [271] H. Zhao et al. Psanet: Point-wise spatial attention network for scene
[236] S. Wiegreffe and Y. Pinter. Attention is not not explanation. In EMNLP- parsing. In ECCV, pp. 267–283, 2018.
IJCNLP, 2019. [272] Y. Yuan et al. Ocnet: Object context for semantic segmentation.
[237] H. Chefer et al. Transformer interpretability beyond attention visualiza- International Journal of Computer Vision, pp. 1–24, 2021.
tion. In CVPR, pp. 782–791, 2021. [273] J. Fu et al. Dual attention network for scene segmentation. In CVPR,
[238] R. Livni et al. On the computational efficiency of training neural pp. 3146–3154, 2019.
networks. In NeurIPS, 2014. [274] H. Zhang et al. Co-occurrent features in semantic segmentation. In
[239] B. Neyshabur et al. Towards understanding the role of over- CVPR, pp. 548–557, 2019.
parametrization in generalization of neural networks. In ICLR, 2019. [275] F. Zhang et al. Acfnet: Attentional class feature network for semantic
[240] K. Han et al. Ghostnet: More features from cheap operations. In CVPR, segmentation. In ICCV, pp. 6798–6807, 2019.
pp. 1580–1589, 2020. [276] X. Li et al. Expectation-maximization attention networks for semantic
[241] K. Han et al. Model rubik’s cube: Twisting resolution, depth and width segmentation. In ICCV, pp. 9167–9176, 2019.
for tinynets. NeurIPS, 33, 2020. [277] J. He et al. Adaptive pyramid context network for semantic segmenta-
[242] T. Chen et al. Diannao: a small-footprint high-throughput accelerator tion. In CVPR, pp. 7519–7528, 2019.
for ubiquitous machine-learning. In ASPLOS, pp. 269–284, 2014. [278] O. Oktay et al. Attention u-net: Learning where to look for the pancreas.
[243] H. Liao et al. Davinci: A scalable architecture for neural network 2018.
computing. In 2019 IEEE Hot Chips 31 Symposium (HCS), 2019. [279] Y. Wang et al. Self-supervised equivariant attention mechanism for
[244] A. Jaegle et al. Perceiver: General perception with iterative attention. weakly supervised semantic segmentation. In CVPR, pp. 12275–12284,
In ICML, volume 139, pp. 4651–4664. PMLR, 18–24 Jul 2021. 2020.
[245] A. Jaegle et al. Perceiver io: A general architecture for structured inputs [280] X. Li et al. Global aggregation then local distribution in fully convolu-
& outputs. arXiv preprint arXiv:2107.14795, 2021. tional networks. In BMVC, 2019.
[246] X. Wang et al. Non-local neural networks. In CVPR, pp. 7794–7803, [281] Y. Chen et al. Aˆ 2-nets: Double attention networks. NeurIPS, pp.
2018. 352–361, 2018.
[247] A. Buades et al. A non-local algorithm for image denoising. In CVPR, [282] L. Zhang et al. Dual graph convolutional network for semantic
pp. 60–65, 2005. segmentation. In BMVC, 2019.
[248] J. Chung et al. Empirical evaluation of gated recurrent neural networks [283] K. Yue et al. Compact generalized non-local network. In NeurIPS, pp.
on sequence modeling. arXiv:1412.3555, 2014. 6510–6519, 2018.
[249] M. Joshi et al. Spanbert: Improving pre-training by representing and [284] Z. Huang et al. Ccnet: Criss-cross attention for semantic segmentation.
predicting spans. Transactions of the Association for Computational In ICCV, pp. 603–612, 2019.
Linguistics, 8:64–77, 2020. [285] L. Huang et al. Interlaced sparse self-attention for semantic segmenta-
[250] Y. Liu et al. Roberta: A robustly optimized bert pretraining approach. tion. arXiv:1907.12273, 2019.
arXiv:1907.11692, 2019. [286] Y. Li and A. Gupta. Beyond grids: Learning graph representations for
[251] Y. Zhu et al. Aligning books and movies: Towards story-like visual visual recognition. NeurIPS, pp. 9225–9235, 2018.
explanations by watching movies and reading books. In ICCV, pp. [287] S. Kumaar et al. Cabinet: Efficient context aggregation network for
19–27, 2015. low-latency semantic segmentation. arXiv:2011.00993, 2020.
[252] A. Radford et al. Improving language understanding by generative pre- [288] X. Liang et al. Symbolic graph reasoning meets convolutions. NeurIPS,
training, 2018. pp. 1853–1863, 2018.
[253] Z. Yang et al. Xlnet: Generalized autoregressive pretraining for lan- [289] Y. Chen et al. Graph-based global reasoning networks. In CVPR, pp.
guage understanding. In NeurIPS, pp. 5753–5763, 2019. 433–442, 2019.
[254] K. Clark et al. Electra: Pre-training text encoders as discriminators [290] T.-Y. Lin et al. Microsoft coco: Common objects in context. In ECCV,
rather than generators. arXiv:2003.10555, 2020. pp. 740–755, 2014.
[255] L. Dong et al. Unified language model pre-training for natural language [291] Y. Cao et al. Gcnet: Non-local networks meet squeeze-excitation
understanding and generation. In NeurIPS, pp. 13063–13075, 2019. networks and beyond. In ICCV Workshops, 2019.
[256] M. Lewis et al. Bart: Denoising sequence-to-sequence pre-training [292] W. Li et al. Object detection based on an adaptive attention mechanism.
for natural language generation, translation, and comprehension. Scientific Reports, pp. 1–13, 2020.
arXiv:1910.13461, 2019. [293] T.-I. Hsieh et al. One-shot object detection with co-attention and co-
[257] Z. Zhang et al. Ernie: Enhanced language representation with informa- excitation. In NeurIPS, pp. 2725–2734, 2019.
tive entities. arXiv:1905.07129, 2019. [294] Q. Fan et al. Few-shot object detection with attention-rpn and multi-
[258] M. E. Peters et al. Knowledge enhanced contextual word representa- relation detector. In CVPR, pp. 4013–4022, 2020.
tions. arXiv:1909.04164, 2019. [295] H. Perreault et al. Spotnet: Self-attention multi-task network for object
[259] J. Lee et al. Biobert: a pre-trained biomedical language representation detection. In 2020 17th Conference on Computer and Robot Vision
model for biomedical text mining. Bioinformatics, 36(4):1234–1240, (CRV), pp. 230–237, 2020.
2020. [296] X.-T. Vo et al. Bidirectional non-local networks for object detection. In
[260] I. Beltagy et al. Scibert: A pretrained language model for scientific text. International Conference on Computational Collective Intelligence, pp.
arXiv:1903.10676, 2019. 491–501, 2020.
[261] K. Huang et al. Clinicalbert: Modeling clinical notes and predicting [297] H. Hu et al. Relation networks for object detection. In CVPR, pp.
hospital readmission. arXiv:1904.05342, 2019. 3588–3597, 2018.
A SUBMISSION TO IEEE TRANSACTION ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 23

[298] K. Zhang et al. Learning enhanced resolution-wise features for human


pose estimation. In 2020 IEEE International Conference on Image
Processing (ICIP), pp. 2256–2260, 2020.
[299] Y. Chang et al. The same size dilated attention network for keypoint
detection. In International Conference on Artificial Neural Networks,
pp. 471–483, 2019.
[300] A. Johnston and G. Carneiro. Self-supervised monocular trained depth
estimation using self-attention and discrete disparity volume. In CVPR,
pp. 4756–4765, 2020.
[301] Y. Chen et al. Attention-based context aggregation network for monoc-
ular depth estimation. International Journal of Machine Learning and
Cybernetics, pp. 1583–1596, 2021.
[302] S. Aich et al. Bidirectional attention network for monocular depth
estimation. In ICRA, 2021.

You might also like