0% found this document useful (0 votes)
18 views11 pages

Vision Transformer For Small-Size Datasets

Uploaded by

czhorange
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views11 pages

Vision Transformer For Small-Size Datasets

Uploaded by

czhorange
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Vision Transformer for Small-Size Datasets

Seung Hoon Lee Seunghyun Lee Byung Cheol Song


Inha University Inha University Inha University
Incheon, South Korea Incheon, South Korea Incheon, South Korea
[email protected] [email protected] [email protected]
arXiv:2112.13492v1 [cs.CV] 27 Dec 2021

Abstract

Recently, the Vision Transformer (ViT), which applied


the transformer structure to the image classification task,
has outperformed convolutional neural networks. However,
the high performance of the ViT results from pre-training
using a large-size dataset such as JFT-300M, and its depen-
dence on a large dataset is interpreted as due to low locality
inductive bias. This paper proposes Shifted Patch Tokeniza-
tion (SPT) and Locality Self-Attention (LSA), which effec-
tively solve the lack of locality inductive bias and enable it
to learn from scratch even on small-size datasets. Moreover,
SPT and LSA are generic and effective add-on modules that
are easily applicable to various ViTs. Experimental results
show that when both SPT and LSA were applied to the ViTs,
the performance improved by an average of 2.96% in Tiny-
ImageNet, which is a representative small-size dataset. Es-
pecially, Swin Transformer achieved an overwhelming per- Figure 1. Effect of the proposed method on the overall perfor-
formance improvement of 4.08% thanks to the proposed mance when learning Tiny-ImageNet from scratch. Throughput
SPT and LSA. refers to how many images can be processed per unit of time. The
stars and dots indicate after and before the proposed method are
applied, respectively.
1. INTRODUCTION
Convolutional neural networks (CNNs), which are ef-
fective in learning visual representations of image data, CNNs. Convolutional filters were usually used only for their
have been the main-stream in the field of computer vision tokenization. Thus, ViT structurally lacks locality inductive
(CV) [10, 14, 18, 30, 32, 37]. Meanwhile, in the field of bias than CNNs, and they require a too large amount of
Natural Language Processing (NLP), the so-called Trans- training data to obtain acceptable visual representation [26].
former [35] based on self-attention mechanism has achieved For example, just to learn a small-size dataset, ViT had to
tremendous success [5,8,20]. So, in the CV field, there have precede pre-training on a large-size dataset such as JFT-
been attempts to combine the self-attention mechanism with 300M [29]. In order to alleviate the burden of pre-training,
CNNs [4,13,28,36,38,44]. These studies have succeeded in several ViTs which can learn a mid-size dataset such as Ima-
proving that the self-attention mechanism also works for the geNet from scratch have been proposed. Such data-efficient
image domain. Recently, it was reported that Vision Trans- ViTs tried to increase the locality inductive bias in terms
former (ViT) [9], which applied a standard Transformer of network architecture. For example, some adopted a hi-
composed entirely of self-attention to image data, showed erarchical structure like CNNs to leverage various recep-
better performance than ResNet [10] and EfficientNet [32] tive fields [12, 24, 39], and the others tried to modify the
in the image classification task. This made Transformer re- self-attention mechanism itself [22,24,34,39,40]. However,
ceive a lot of attention in the CV field. learning from scratch on mid-size datasets still requires sig-
ViT rarely uses convolutional filters, i.e., the core of nificant costs. Moreover, learning small-size datasets from
scratch is very challenging considering the trade-off be- Fig. 1 and Table 5).
tween dataset capacity and performance. Therefore, we Our experiments show that the proposed method im-
need to study ViT that can learn small-size datasets by suf- proves the performance of various ViTs both qualitatively
ficiently increasing the locality inductive bias. and quantitatively. First, Fig. 5 illustrates that when SPT
Through observations, we found two problems that de- and LSA are applied to the ViTs, object shapes are bet-
crease locality inductive bias and limit the performance of ter captured. From a quantitative aspect, SPT and LSA im-
the ViT. The first problem is poor tokenization. ViT divides prove image classification performance. For example, in the
a given image into non-overlapping patches of equal size, experiment on Tiny-ImageNet, the classification accuracy
and linearly projects each patch to a visual token. Here, is improved by an average of 2.96%, and a maximum of
the same linear projection is applied to each patch. So, to- 4.08% (see Table 2). Also, SPT and LSA improve the per-
kenization of the ViT has the permutation invariant prop- formance of ViTs up to 1.06% in the mid-size dataset such
erty, which enables a good embedding of relations between as ImageNet (see Table 3). The main contribution points of
patches [3]. On the other hand, non-overlapping patches al- this paper are as follows:
low visual tokens to have a relatively small receptive field. • To sufficiently embed spatial information between
Usually, tokenization based on non-overlapping patches has neighboring pixels, we propose new tokenization
a smaller receptive field than tokenization based on over- based on spatial feature shifting. The proposed tok-
lapping patches with the same down-sampling ratio. Small enization can give a wider receptive field to visual to-
receptive fields cause ViT to tokenize with too few pixels. kens. This dramatically improves the performance of
As a result, the spatial relationship with adjacent pixels is the ViTs.
not sufficiently embedded in each visual token. The sec- • We propose a locality attention mechanism to solve
ond problem is the poor attention mechanism. The feature or attenuate the smoothing problem of the attention
dimension of image data is far greater than that of natural score distribution. This mechanism significantly im-
language and audio signal, so the number of embedded to- proves the performance of ViTs with only a small pa-
kens is inevitably large. Thus, the distribution of attention rameter increase and the addition of simple operations.
scores of tokens becomes smooth. In other words, we face
the problem that ViTs cannot attend locally to important vi- 2. RELATED WORK
sual tokens. The above two main problems cause highly re- Recently, several data-efficient ViTs have been proposed
dundant attentions that cannot focus on a target class. This to alleviate the dependence of ViT on large-size datasets.
redundant attention makes it easy for ViT to normally con- These ViTs can learn mid-size datasets from scratch. For
centrate on the background and not capture the shape of the example, DeiT [33] improved the efficiency of ViTs by em-
target class well (see Fig. 5). ploying data augmentations and regularizations and realized
This paper presents two solutions to effectively improve knowledge distillation by introducing the distillation token
the locality inductive bias of ViT for learning small-size concept. T2T [41] used a tokenization method that flattened
datasets from scratch. First, we propose Shifted Patch To- overlapping patches and applied a transformer. This makes
kenization (SPT) to further utilize spatial relations between it possible to learn local structure information around a to-
neighboring pixels in the tokenization process. The idea of ken. PiT [12] produced various receptive fields through spa-
SPT was derived from Temporal Shift Module (TSM) [23]. tial dimension reduction based on the pooling structure of
TSM is effective temporal modeling which shifts some tem- a convolutional layer. CvT [39] replaced both linear pro-
poral channels of features. Inspired by this, we propose ef- jection and multi-layer perceptron with convolutional lay-
fective spatial modeling that tokenizes spatially shifted im- ers. Also, like PiT, CvT generated various receptive fields
ages together with the input image. SPT can give a wider re- only with a convolutional layer. Swin Transformer [24] pre-
ceptive field to ViT than standard tokenization. This has the sented an efficient hierarchical transformer that gradually
effect of increasing the locality inductive bias by embedding reduces the number of tokens through patch merging while
more spatial information in each visual token. Second, we using attention calculated in non-overlapping local win-
propose Locality Self-Attention (LSA), which allows ViT dows. CaiT [34] employed LayerScale, which converges
to attend locally. LSA mitigates the smoothing phenomenon well even in training ViTs with a large depth. In addition,
of attention score distribution by excluding self-tokens and the transformer layer of the CaiT is divided into a patch-
by applying learnable temperature to the softmax function. attention layer and a class-attention layer, which is effective
LSA induces attention to work locally by forcing each token for class embedding.
to focus more on tokens with large relation to itself. Note However, the ViT for small-size datasets has not been re-
that the proposed SPT and LSA can be easily applied to ported yet. Therefore, this paper proposes tokenization us-
various ViTs in the form of add-on modules without struc- ing more spatial information and also a high-performance
tural changes and can effectively improve performance (see attention mechanism, which allows ViTs to effectively learn
(a) Shifted Patch Tokenization (b) Locality Self-Attention

Figure 2. Architectures of the proposed SPT and LSA.

small-size datasets from scratch. non-overlapping patches and flatten the patches to obtain
a sequence of vectors. This process is formulated as Eq. 1:
3. PROPOSED METHOD
P(x) = [x1p ; x2p ; . . . ; xN
p ] (1)
This section specifically describes two key ideas for in-
2
creasing the locality inductive bias of ViTs: SPT and LSA. where xip ∈ RP ·C represents the i-th flattened vector. P
First, Fig. 2(a) depicts the concept of SPT. SPT spatially and N = HW/P 2 stand for the patch size and the number
shifts an input image in several directions and concatenates of patches, respectively.
them with the input image. Fig. 2(a) is an example of shift- Next, we obtain patch embeddings by linearly projecting
ing in four diagonal directions. Next, patch partitioning is each vector into the space of the hidden dimension of the
applied like standard ViTs. Then, for embedding into visual transformer encoder. Each patch embedding corresponds to
tokens, three processes are sequentially performed: patch a visual token input to the transformer encoder, so this series
flattening, layer normalization [2], and linear projection. As of processes is called tokenization, i.e., T . This is defined
a result, SPT can embed more spatial information into vi- by:
sual tokens and increase the locality inductive bias of ViTs. T (x) = P(x)Et (2)
Fig. 2(b) explains the second idea, LSA. In general, a 2
softmax function can control the smoothness of the output where Et ∈ R(P ·C)×d is the learnable linear projection
distribution through temperature scaling [11]. LSA primar- for tokens, and d is the hidden dimension of the transformer
ily sharpens the distribution of attention scores by learning encoder.
the temperature parameters of the softmax function. Addi- Note that the receptive fields of visual tokens in ViT are
tionally, the self-token relation is removed by applying the determined by tokenization. In the transformer encoder run-
so-called diagonal masking, which forcibly suppresses the ning after the tokenization step, the number of visual tokens
diagonal components of the similarity matrix computed by does not change, so the receptive field cannot be adjusted
Query and Key. This masking relatively increases the atten- there. and the tokenization (Eq. 2) of standard ViT is the
tion scores between different tokens, making the distribu- same as the operation of the non-overlapping convolutional
tion of attention scores sharper. As a result, LSA increases layer with the same size of kernel and stride. So, the re-
the locality inductive bias by making ViT’s attention locally ceptive field size of visual tokens can be calculated by the
focused. following equation given in [1]:

3.1. Preliminary rtoken = rtrans · j + (k − j) (3)

Before a detailed description of the proposed SPT and where rtoken and rtrans stand for the receptive field sizes
LSA, this section briefly reviews the tokenization and of tokenization and transformer encoder, respectively. j and
formulation of the self-attention mechanism of standard k are the stride and kernel size of the convolutional layer,
ViT [9]. respectively. As mentioned earlier, the receptive field is not
Let x ∈ RH×W ×C be an input image. Here, H, W , adjusted in the transformer encoder, so rtrans = 1. Thus,
and C indicate the height, width, and channel of the im- rtoken is the same as the kernel size. Here, the kernel size is
age, respectively. First, ViT divides the input image into the patch size of ViT.
At this time, let’s investigate whether rtoken is of suffi-
cient size. For instance, we compare rtoken with the recep-
tive field size of the last feature of ResNet50 when training
on the ImageNet dataset consisting of images of 224 × 224.
The patch size of standard ViT is 16, so rtoken of visual to-
kens is also 16. On the other hand, the receptive field size
of the ResNet50 feature amounts to 483 [1]. As a result, the
visual tokens of ViTs have a receptive field size that is about
30 times smaller than that of the ResNet50 feature. We in-
terpret this small receptive field of tokenization as a major
factor in the lack of local inductive bias. Therefore, Sec. 3.2
proposes the SPT to leverage rich spatial information by in-
creasing the receptive field of tokenization. Figure 3. The learned temperature according to depth. Here, the
Meanwhile, the self-attention mechanism of general red dashed line indicates the temperature of standard ViT.
ViTs operates as follows. First, a learnable linear projec-
tion is applied to each token to obtain Query, Key, and
Value. Next, calculate the similarity matrix, that is, R ∈
R(N +1)×(N +1) , indicating the semantic relation between
tokens through the dot product operation of Query and
Key. The diagonal components of R represent self-token
relations, and the off-diagonal components represent inter-
token relations:
|
R(x) = xEq (xEk ) (4)

Here, Eq ∈ Rd×dq , Ek ∈ Rd×dk indicate learnable linear


projections for Query and Key, respectively. And, dq and dk Figure 4. Kullback–Leibler Divergence (KLD) of attention
are the dimensions of Query and Key, respectively. Next, R score distributions. The average KLDs were measured on Tiny-
is divided by the square root of the Key dimension, and then ImageNet.
the softmax function is applied to obtain the attention score
matrix. Finally, calculate the self-attention, defined by the
dot product of the attention score matrix and Value, as in that the attention scores smoothed due to high temperature
Eq. 5: degrade the performance of ViT. For example, take a look
p at Table 1 that shows the top-1 accuracy of standard ViT on
SA(x) = softmax(R/ dk )xEv (5)
the small-size datasets, i.e., CIFAR100 and Tiny-ImageNet.
where Ev ∈ Rd×dv is a learnable linear projection of Value, Here, we can observe the best performance
√ when the tem-
and dv is the Value dimension. perature of softmax is less than dk . Sec. 3.3 proposes the
Eq. 5 was designed so that the attentions of tokens with LSA for improving the performance of ViTs by solving the
large relations get large. However, due to the following smoothing problem of the attention score distribution.
two causes, attentions of standard ViT tend to be similar
to each other regardless of relations. The first cause is as 3.2. Shifted Patch Tokenization
follows: Since Query (xEq ) and Key (xEk ) is linearly pro- This section first describes the overall formulation of
jected from the same input tokens, token vectors belonging SPT (Sec. 3.2.1) and applies the proposed SPT to the patch
to Query and Key tend to have similar sizes. Eq. 4 shows embedding layer and the pooling layer, i.e., two main tok-
that R is the dot product of Query and Key. So, self-token enizations for ViTs (Sec. 3.2.2 and Sec. 3.2.3).
relations which are dot products of similar vectors are usu-
ally larger than inter-token relations. Therefore, the softmax 3.2.1 Formulation
function of Eq. 5 gives relatively high scores to self-token
relations and small scores to inter-token relations. The sec- First, each input image is spatially shifted by half the patch
√ cause is as follows: The reason why R is divided by
ond size in four diagonal directions, that is, left-up, right-up,
dk in Eq. 5 is to prevent the √ softmax function from hav- left-down, and right-down. In this paper, this shifting strat-
ing a small gradient. However, dk can rather act as a high egy is named S for convenience, and the SPT of all ex-
temperature of the softmax function and cause smoothing of periments follows S. Of course, various shifting strategies
the attention score distribution [11]. Our experiment proves other than S are available, and they are dealt with in the
supplementary. Next, the shifted features are cropped to the Table 1. Top-1 accuracy (%) according to temperatures.
same size as the input image and then concatenated with
the input. Then, the concatenated features are divided into
non-overlapping patches and the patches are flattened as in TOP-1 ACCURACY (%)
TEMPERATURE
Eq. 1. Next, visual tokens are obtained through layer nor- CIFAR100 T-ImageNet
malization (LN) and linear projection. The whole process is 1

formulated as Eq. 6: 4 √dk 73.70 57.62
1
2√ dk 74.54 57.65
S(x) = LN(P([x s1 s2 . . . sNS ]))ES (6)
√dk 73.81 57.07
Here, si ∈ RH×W ×C represents the i-th shifted image 2√dk 72.77 56.98
2
according to S and ES ∈ R(P ·C·(Ns +1)×dS ) indicates a 4 dk 71.55 56.43
learnable linear projection. Also, dS represents the hidden
dimension of the transformer encoder, and NS represents
the number of images shifted by S. 0
where Ecls ∈ Rd×dS is a learnable linear projection. In
0
addition, dS is the hidden dimension of the next stage. As
3.2.2 Patch Embedding Layer
a result, SPT embeds rich spatial information into visual
This section describes how to use SPT as a patch embed- tokens by increasing the receptive field of tokenization as
ding layer. We concatenate a class token to visual tokens much as spatially shifted.
and then add positional embedding. Here the class token is
the token with representation information of the entire im- 3.3. Locality Self-Attention Mechanism
age, and the positional embedding gives positional informa- This section describes the LSA. The core of LSA is the
tion to the visual tokens. If a class token is not used, only diagonal masking (Sec. 3.3.1) and the learnable temperature
positional embedding is added to the output of SPT. How to scaling (Sec. 3.3.2).
apply the SPT to the patch embedding layer is formulated
as follows:
( 3.3.1 Diagonal Masking
[xcls ; S(x)] + Epos if xcls exist
Spe (x) = (7) Diagonal masking plays a role in giving larger scores to
S(x) + Epos otherwise
inter-token relations by fundamentally excluding self-token
where xcls ∈ RdS is a class token and Epos ∈ R(N +1)×dS relations from the softmax operation. Specifically, diagonal
is the learnable positional embedding. Also, N is the num- masking forces −∞ on diagonal components of R of Eq. 4.
ber of embedded tokens in Eq. 6. This makes ViT’s attention more focused on other tokens
rather than attending to its own tokens. The proposed diag-
3.2.3 Pooling Layer onal masking is defined by:

Tokenization is the process of embedding 3D-tensor fea-


(
M Ri,j (x) (i 6= j)
tures into 2D-matrix features. For example, it embeds x ∈ Ri,j (x) = (9)
RH×W ×C into y = T (x) ∈ RN ×d . Since N = HW/P 2 , −∞ (i = j)
the spatial size of the 3D feature is reduced by P 2 through
the tokenization process. So, if tokenization is used as a where RM i,j indicates each component of the masked simi-
pooling layer, the number of visual tokens can be reduced. larity matrix.
Therefore, we propose to use SPT as a pooling layer as fol-
lows: First, class tokens and visual tokens are separated, 3.3.2 Learnable Temperature Scaling
and visual tokens in the form of 2D-matrix are reshaped
into 3D-tensor with spatial structure, i.e., R : RN ×d → The second technique for LSA is the learnable temperature
R(H/P )×(W/P )×d . Then, if the SPT of Eq. 6 is applied, new scaling, which allows ViT to determine the softmax temper-
visual tokens with a reduced number of tokens are embed- ature by itself during the learning process. Fig. 3 shows the
ded. Finally, the linearly projected class token is connected average learned temperature according to depth when the
with the embedded visual tokens. If there is no class token, softmax temperature is used as the learnable parameter in
only R is applied before the output of SPT. The whole pro- Eq. 5. Note that the average learned temperature is lower
cess is formulated as Eq. 8: than the constant temperature of standard ViT. In general,
( the low temperature of softmax sharpens the score distribu-
[xcls Ecls ; S(R(y))] if xcls exist tion. Therefore, the learnable temperature scaling sharpens
Spool (y) = (8)
S(R(y)) otherwise the distribution of attention scores. Based on Eq. 5, the LSA
Table 2. Top-1 accuracy comparison of different models on small-size datasets.

THROUGHPUT FLOPs PARAMS


MODEL CIFAR10 CIFAR100 SVHN T-ImageNet
(images/sec) (M) (M)
ResNet 56 4295 506.2 0.9 95.70 76.36 97.73 58.77
ResNet 110 2143 1020.0 1.7 96.37 79.86 97.85 62.96
EfficientNet B0 4078 123.9 3.7 94.66 76.04 97.22 66.79
ViT 8593 189.8 2.8 93.58 73.81 97.82 57.07
SL-ViT 7697 199.2 2.9 94.53 76.92 97.79 61.07
T2T 3388 643.0 6.7 95.30 77.00 97.90 60.57
SL-T2T 2943 671.4 7.1 95.57 77.36 97.91 61.83
CaiT 3138 613.8 9.1 94.91 76.89 98.13 64.37
SL-CaiT 2967 623.3 9.2 95.81 80.32 98.28 67.18
PiT 7583 279.2 7.1 94.24 74.99 97.83 60.25
SL-PiT w/o Spool 6632 280.4 7.1 94.96 77.08 97.94 60.31
SL-PiT w/ Spool 5981 322.9 8.7 95.88 79.00 97.93 62.91
Swin 6804 242.3 7.1 94.46 76.87 97.72 60.87
SL-Swin w/o Spool 6384 247.0 7.1 95.30 78.13 97.88 62.70
SL-Swin w/ Spool 5711 284.9 10.2 95.93 79.99 97.92 64.95

with both diagonal masking and learnable temperature scal- 4.1. SETTING
ing applied is defined by:
4.1.1 Environment and Dataset
M
L(x) = softmax(R (x)/τ )xEv (10) The proposed method was implemented in Pytorch [27]. In
the small-size dataset experiment (Table 2), The details of
where τ is the learnable temperature. throughput measurement are as follows: The inputs were
In other words, LSA solves the smoothing problem of Tiny-ImageNet, and the batch size was 128, and the GPU
the attention score distribution. Fig. 4 shows the depth-wise was RTX 2080 Ti.
total
averages of total Kullback-Leibler divergence (DKL ) for For small-size dataset experiments, CIFAR-10, CIFAR-
all heads. Here, T and M mean that only learnable tem- 100 [17], Tiny-ImageNet [21], and SVHN [25] were em-
perature scaling and diagonal masking is applied to ViTs, ployed and ImageNet [19] was employed for the mid-size
respectively, and L indicates that the entire LSA is applied dataset experiment.
total
to ViTs. The lower the average of DKL , the flatter the at-
tention score distribution. We can find that when LSA is 4.1.2 Model Configurations
total
fully applied, the average of DKL is larger by about 0.5
than standard ViT, so LSA attenuates the smoothing phe- In the small dataset experiment, in the case of ViT, the
nomenon of the attention score distribution. depth was set to 9, the hidden dimension was set to 192,
and the number of heads was set to 12. This configuration
4. EXPERIMENT was determined experimentally. And in the ImageNet ex-
periment, we used the ViT-Tiny suggested by DeiT [33].
This section verifies that the proposed method improves In the case of PiT, T2T, Swin and CaiT, the configurations
the performance of various ViTs through several experi- of PiT-XS, T2T-14, Swin-T and CaiT-XXS24 presented in
ments. Sec. 4.1 describes the settings of the following ex- the corresponding papers were adopted as they were, re-
periments. Sec. 4.2 quantitatively shows that the proposed spectively. The performance of ViT improves as the num-
method effectively improves various ViTs and reduces the ber of tokens increases, but the computational cost increases
gap with CNNs. Finally, Sec. 4.3 demonstrates that the quadratically. We were able to experimentally observe that
ViTs are qualitatively enhanced by visualizing the attention it was effective when both the number of visual tokens in
scores of the final class token. ViT without pooling and the number of tokens in the inter-
Table 3. Top-1 accuracy (%) of the proposed method on ImageNet Table 4. Effect of each component of LSA on performance.
dataset.

TOP-1 ACCURACY (%)


MODEL
MODEL TOP-1 ACCURACY (%)
CIFAR100 T-ImageNet
ViT 69.95
ViT 73.81 57.07
SL-ViT 71.55 (+1.60)
T-ViT 74.35 57.95
PiT 75.58 M-ViT 74.34 58.29
SL-PiT 77.02 (+1.44) L-ViT 74.87 58.50
Swin 79.95
SL-Swin 81.01 (+1.06)
Table 5. Effect of the proposed SPT (S) and LSA (L) on perfor-
mance.

mediate stage of ViT with pooling are 64, considering this


trade-off. Accordingly, we modified the baseline models. In TOP-1 ACCURACY (%)
MODEL
small-size dataset experiments, the patch size of the patch CIFAR100 T-ImageNet
embedding layer was set to 8 and the patch size of ViTs us-
ing pooling layers such as Swin and PiT was set to 16. In ViT 73.81 57.07
the ImageNet dataset experiment, the patch size was set to L-ViT 74.87 58.50
be the same as that used in each paper. Also, the hidden di- S-ViT 76.29 60.67
mension of MLP was set to twice that of the transformer in SL-ViT 76.92 61.07
the small dataset experiment, and the configuration used in
each paper was applied in the ImageNet experiment.

the SPT applied to the pooling layer. In most cases, the


4.1.3 Training Regime proposed method effectively improved the performance of
According to DeiT, various techniques are required to ef- ViTs, especially in CIFAR100 and Tiny-ImageNet. For ex-
fectively train ViTs. Thus, we applied data augmentations ample, in CIFAR100, the performance of CaiT and PiT im-
such as CutMix [42], Mixup [43], Auto Augment [6], Re- proved by +3.43% and +4.01% respectively and in Tiny-
peated Augment [7] to all models. In addition, regulariza- ImageNet, the performance of ViT and Swin improved up
tion techniques such as label smoothing [31], stochastic to +4.00% and +4.08% respectively. Also note that the
depth [15], and random erasing [45] were employed. Mean- performance was greatly improved only with the accept-
while, AdamW [16] was used as the optimizer. Weight de- able overhead of inference latency. In other words, the cost-
cays were set to 0.05, batch size to 128 (however, 256 for effectiveness of the proposed method is remarkable. For ex-
ImageNet), and warm-up to 10 (however, 5 for ImageNet). ample, for ViT, T2T, and CaiT, the proposed method causes
All models were trained for 100 epochs, and cosine learning only latency overhead of 1.12%, 1.15%, and 1.06% respec-
rate decay was used. In the small-size dataset experiments, tively. And it can be seen in the case of PiT and Swin that
the initial learning rate of ViT and CNNs was set to 0.003, additional performance improvement can be obtained by re-
and that of the remaining models was set to 0.001. On the placing the pooling layer with Spool . Therefore, we can find
other hand, in the ImageNet experiment, the initial learning that spatial modeling provided by SPT is effective not only
rate was set to 0.00025 for all models. for patch embedding but also for the pooling layer. Also,
This table shows that the proposed method effectively re-
4.2. QUANTITATIVE RESULT duces the gap between ViT and CNN on small-size datasets.
For example, SL-CaiT achieves the best performance over
4.2.1 Image Classification
ResNet and EfficientNet on all datasets except CIFAR10.
This section presents the experimental results for small-size SL-Swin also offers better throughput while providing per-
datasets and the ImageNet dataset. In the small-size dataset formance comparable to CNNs.
experiment, Throughput, FLOPs, and the number of param- Table 3 shows performance when training a mid-size
eters were measured in Tiny-ImageNet. dataset ImageNet from scratch. In ViT, SPT was applied
First, Table 2 shows the performance improvement when only to patch embeddings, and in PiT and Swin, SPT was
the proposed method was applied to ViTs. Here, SL indi- applied to both patch embedding and pooling layers. We
cates that both SPT and LSA were applied, and Spool means could observe that the proposed method is sufficiently ef-
Figure 5. Visualization of attention scores of final class tokens.

fective for ImageNet. For example, the performance was This proves the competitiveness and synergy of the two key
improved by the proposed method as much as +1.60% for element technologies.
ViT, +1.44% for PiT, and +1.06% for Swin. As a result,
we find that the proposed method noticeably improves the
ViTs even on mid-size datasets. 4.3. QUALITATIVE RESULT

Fig. 5 visualizes the attention scores of the final class


4.2.2 Ablation Study token when SPT and LSA were applied to various ViTs.
This section describes the ablation study on the proposed When the proposed method was applied, we can observe
method. ViT was used for this experiment. that the object shape is better captured as the attention,
which was dispersed in the background, is concentrated on
the target class. In particular, this phenomenon is evident in
Elements of LSA Let’s look at the effect of learnable the CaiT of the first row, the T2T of the second row, the ViT
temperature scaling and diagonal masking, two key ele- of the third row, and the PiT of the last row. Therefore, we
ments of LSA, on overall performance. Table 4 shows that can find that the proposed method effectively increases the
learnable temperature scaling and diagonal masking ef- locality inductive bias and induces the attention of the ViTs
fectively resolves the smoothing phenomenon of attention to improve.
score distribution (see Fig. 4). For example, learnable tem-
perature scaling and diagonal masking in Tiny-ImageNet
improved performance by +0.88% and +1.22%, respec- 5. CONCLUSION
tively. Considering that the LSA applied with both tech-
niques shows a performance improvement of +1.43%, we To train ViT on small-size datasets, this paper presents
can claim that the contribution of each is sufficiently large two novel techniques to increase the locality inductive bias
and the two techniques produce a synergy. of ViT. First, SPT embeds rich spatial information into vi-
sual tokens through specific transformation. Second, LSA
SPT and LSA Table 5 shows that SPT and LSA can induces ViT to attend locally through softmax with learn-
dramatically improve performance by increasing the local- able parameters. The SPT and LSA can achieve significant
ity inductive bias of ViT independently. In particular, in performance improvement independently, and they are ap-
Tiny-ImageNet, SPT and LSA improved performance by plicable to any ViTs. Therefore, this study proves that ViT
+1.43% and +3.60%, respectively. When both techniques learns small-size datasets from scratch and provides an op-
were applied, the performance improvement was +4.00%. portunity for ViT to develop further.
References tial dimensions of vision transformers. pages 11936–11945,
October 2021. 1, 2
[1] A. Araújo, W. Norris, and Jack Sim. Computing receptive [13] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation net-
fields of convolutional neural networks. 2019. 3, 4 works. In Proceedings of the IEEE conference on computer
[2] Jimmy Ba, J. Kiros, and Geoffrey E. Hinton. Layer normal- vision and pattern recognition, pages 7132–7141, 2018. 1
ization. ArXiv, abs/1607.06450, 2016. 3 [14] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kil-
[3] Peter W Battaglia, Jessica B Hamrick, Victor Bapst, Al- ian Q Weinberger. Densely connected convolutional net-
varo Sanchez-Gonzalez, Vinicius Zambaldi, Mateusz Ma- works. In Proceedings of the IEEE conference on computer
linowski, Andrea Tacchetti, David Raposo, Adam Santoro, vision and pattern recognition, pages 4700–4708, 2017. 1
Ryan Faulkner, et al. Relational inductive biases, deep learn- [15] Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kil-
ing, and graph networks. arXiv preprint arXiv:1806.01261, ian Q. Weinberger. Deep networks with stochastic depth.
2018. 2 In ECCV, 2016. 7
[4] Irwan Bello, Barret Zoph, Ashish Vaswani, Jonathon Shlens, [16] Diederik P. Kingma and Jimmy Ba. Adam: A method for
and Quoc V. Le. Attention augmented convolutional net- stochastic optimization. CoRR, abs/1412.6980, 2015. 7
works. 2019 IEEE/CVF International Conference on Com- [17] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple
puter Vision (ICCV), pages 3285–3294, 2019. 1 layers of features from tiny images. 2009. 6
[5] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie [18] A. Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Ima-
Subbiah, J. Kaplan, Prafulla Dhariwal, Arvind Neelakan- genet classification with deep convolutional neural networks.
tan, Pranav Shyam, Girish Sastry, Amanda Askell, Sand- Communications of the ACM, 60:84 – 90, 2012. 1
hini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, T. [19] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton.
Henighan, R. Child, A. Ramesh, Daniel M. Ziegler, Jeff Wu, Imagenet classification with deep convolutional neural net-
Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, works. Advances in neural information processing systems,
Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, 25:1097–1105, 2012. 6
Christopher Berner, Sam McCandlish, Alec Radford, Ilya [20] Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin
Sutskever, and Dario Amodei. Language models are few- Gimpel, Piyush Sharma, and Radu Soricut. Albert: A lite
shot learners. ArXiv, abs/2005.14165, 2020. 1 bert for self-supervised learning of language representations.
[6] Ekin D. Cubuk, Barret Zoph, Dandelion Mané, Vijay Va- ArXiv, abs/1909.11942, 2020. 1
sudevan, and Quoc V. Le. Autoaugment: Learning augmen- [21] Ya Le and Xuan Yang. Tiny imagenet visual recognition
tation strategies from data. In 2019 IEEE/CVF Conference challenge. CS 231N, 7(7):3, 2015. 6
on Computer Vision and Pattern Recognition (CVPR), pages [22] Yawei Li, K. Zhang, Jie Cao, R. Timofte, and L. Gool.
113–123, 2019. 7 Localvit: Bringing locality to vision transformers. ArXiv,
[7] E. D. Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V. Le. abs/2104.05707, 2021. 1
Randaugment: Practical automated data augmentation with a [23] Ji Lin, Chuang Gan, and Song Han. Tsm: Temporal shift
reduced search space. 2020 IEEE/CVF Conference on Com- module for efficient video understanding. 2019 IEEE/CVF
puter Vision and Pattern Recognition Workshops (CVPRW), International Conference on Computer Vision (ICCV), pages
pages 3008–3017, 2020. 7 7082–7092, 2019. 2
[8] J. Devlin, Ming-Wei Chang, Kenton Lee, and Kristina [24] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei,
Toutanova. Bert: Pre-training of deep bidirectional trans- Zheng Zhang, Stephen Lin, and Baining Guo. Swin trans-
formers for language understanding. In NAACL-HLT, 2019. former: Hierarchical vision transformer using shifted win-
1 dows. Proceedings of the IEEE/CVF International Confer-
[9] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, ence on Computer Vision (ICCV), pages 10012–10022, Oc-
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, tober 2021. 1, 2
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- [25] Yuval Netzer, T. Wang, A. Coates, A. Bissacco, Bo Wu, and
vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is A. Ng. Reading digits in natural images with unsupervised
worth 16x16 words: Transformers for image recognition at feature learning. 2011. 6
scale. In International Conference on Learning Representa- [26] Behnam Neyshabur. Towards learning convolutions from
tions, 2021. 1, 3 scratch. arXiv preprint arXiv:2007.13657, 2020. 1
[10] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. [27] Adam Paszke, S. Gross, Francisco Massa, Adam Lerer,
Deep residual learning for image recognition. In Proceed- James Bradbury, Gregory Chanan, Trevor Killeen, Zeming
ings of the IEEE conference on computer vision and pattern Lin, N. Gimelshein, L. Antiga, Alban Desmaison, Andreas
recognition, pages 770–778, 2016. 1 Köpf, E. Yang, Zach DeVito, Martin Raison, Alykhan Te-
[11] Yu-Lin He, Xiaoliang Zhang, W. Ao, and Joshua Zhexue jani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie
Huang. Determining the optimal temperature parameter for Bai, and Soumith Chintala. Pytorch: An imperative style,
softmax function in reinforcement learning. Appl. Soft Com- high-performance deep learning library. In NeurIPS, 2019.
put., 70:80–85, 2018. 3, 4 6
[12] Byeongho Heo, Sangdoo Yun, Dongyoon Han, Sanghyuk [28] Aravind Srinivas, Tsung-Yi Lin, Niki Parmar, Jonathon
Chun, Junsuk Choe, and Seong Joon Oh. Rethinking spa- Shlens, Pieter Abbeel, and Ashish Vaswani. Bottleneck
transformers for visual recognition. In Proceedings of the Yan. Tokens-to-token vit: Training vision transformers from
IEEE/CVF Conference on Computer Vision and Pattern scratch on imagenet. Proceedings of the IEEE/CVF Interna-
Recognition, pages 16519–16529, 2021. 1 tional Conference on Computer Vision (ICCV), pages 558–
[29] Chen Sun, Abhinav Shrivastava, Saurabh Singh, and A. 567, October 2021. 2
Gupta. Revisiting unreasonable effectiveness of data in deep [42] Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk
learning era. 2017 IEEE International Conference on Com- Chun, Junsuk Choe, and Youngjoon Yoo. Cutmix: Regu-
puter Vision (ICCV), pages 843–852, 2017. 1 larization strategy to train strong classifiers with localizable
[30] Christian Szegedy, W. Liu, Y. Jia, Pierre Sermanet, Scott E. features. In International Conference on Computer Vision
Reed, Dragomir Anguelov, D. Erhan, V. Vanhoucke, and An- (ICCV), 2019. 7
drew Rabinovich. Going deeper with convolutions. 2015 [43] Hongyi Zhang, Moustapha Cissé, Yann N. Dauphin, and
IEEE Conference on Computer Vision and Pattern Recogni- David Lopez-Paz. mixup: Beyond empirical risk minimiza-
tion (CVPR), pages 1–9, 2015. 1 tion. In 6th International Conference on Learning Represen-
[31] Christian Szegedy, V. Vanhoucke, S. Ioffe, Jonathon Shlens, tations, ICLR 2018, Vancouver, BC, Canada, April 30 - May
and Z. Wojna. Rethinking the inception architecture for com- 3, 2018, Conference Track Proceedings, 2018. 7
puter vision. 2016 IEEE Conference on Computer Vision and [44] Han Zhang, I. Goodfellow, Dimitris N. Metaxas, and Augus-
Pattern Recognition (CVPR), pages 2818–2826, 2016. 7 tus Odena. Self-attention generative adversarial networks. In
[32] Mingxing Tan and Quoc V. Le. Efficientnet: Rethinking ICML, 2019. 1
model scaling for convolutional neural networks. ArXiv, [45] Zhun Zhong, L. Zheng, Guoliang Kang, Shaozi Li, and Y.
abs/1905.11946, 2019. 1 Yang. Random erasing data augmentation. In AAAI, 2020. 7
[33] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco
Massa, Alexandre Sablayrolles, and Hervé Jégou. Training
data-efficient image transformers & distillation through at-
tention. pages 10347–10357. PMLR, 2021. 2, 6
[34] Hugo Touvron, Matthieu Cord, Alexandre Sablayrolles,
Gabriel Synnaeve, and Herve Jegou. Going deeper with im-
age transformers. Proceedings of the IEEE/CVF Interna-
tional Conference on Computer Vision (ICCV), pages 32–42,
October 2021. 1, 2
[35] Ashish Vaswani, Noam M. Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser,
and Illia Polosukhin. Attention is all you need. ArXiv,
abs/1706.03762, 2017. 1
[36] Fei Wang, Mengqing Jiang, Chen Qian, S. Yang, Cheng Li,
Honggang Zhang, Xiaogang Wang, and Xiaoou Tang. Resid-
ual attention network for image classification. 2017 IEEE
Conference on Computer Vision and Pattern Recognition
(CVPR), pages 6450–6458, 2017. 1
[37] Jingdong Wang, Ke Sun, Tianheng Cheng, Borui Jiang,
Chaorui Deng, Yang Zhao, D. Liu, Yadong Mu, Mingkui
Tan, Xinggang Wang, Wenyu Liu, and Bin Xiao. Deep
high-resolution representation learning for visual recogni-
tion. IEEE transactions on pattern analysis and machine
intelligence, 2020. 1
[38] X. Wang, Ross B. Girshick, A. Gupta, and Kaiming He.
Non-local neural networks. 2018 IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pages 7794–
7803, 2018. 1
[39] Haiping Wu, Bin Xiao, Noel Codella, Mengchen Liu,
Xiyang Dai, Lu Yuan, and Lei Zhang. Cvt: Introducing
convolutions to vision transformers. Proceedings of the
IEEE/CVF International Conference on Computer Vision
(ICCV), pages 22–31, October 2021. 1, 2
[40] Weijian Xu, Yifan Xu, Tyler Chang, and Zhuowen Tu. Co-
scale conv-attentional image transformers. Proceedings of
the IEEE/CVF International Conference on Computer Vision
(ICCV), pages 9981–9990, October 2021. 1
[41] Li Yuan, Yunpeng Chen, Tao Wang, Weihao Yu, Yujun Shi,
Zi-Hang Jiang, Francis E.H. Tay, Jiashi Feng, and Shuicheng
Supplementary where P represents the patch size, and rshif t represents the
shift ratio. Table 2 shows the performance according to shift
This section investigates the various shifting strategies ratio for CIFAR-100, Tiny-ImageNet, and ImageNet. In this
that SPT can employ. Specifically, we explored the shift di- experiment, a model with SPT applied to standard ViT was
rection and shift intensity (shift ratio), which have the most used, and 4 diagonal directions were adopted. In CIFAR-
impact on performance. 100 and ImageNet, a ratio of 0.5 was the best, and in Tiny-
We examined the following three shift directions. The ImageNet, a ratio of 0.25 was the best. This experimental
first is the 4 cardinal directions consisting of up, down, left result shows that the optimal shift ratio also depends on
and right directions (Fig. 1(a)). The second is 4 diagonal the datasets. Since the relatively most reasonable shift ra-
directions including up-left, up-right, down-left and down- tio is 0.5 according to our experiment, all the experiments
right (Fig. 1(b)). The last is the 8 cardinal directions includ- in this paper fixed the shift ratio to 0.5. Note that more var-
ing all the preceding directions (Fig. 1(c)). Table 1 shows ious shifting strategies will be available in addition to the
top-1 accuracy in small-size datasets such as CIFAR-10, methods considered here. The exploration of optimal shift-
CIFAR-100, SVHN, and Tiny-ImageNet for each shift di- ing strategy according to datasets remains as a further work.
rection. This experiment adopted a model applying SPT to
standard ViT. 4 cardinal directions showed the best per- Table 2. Top-1 Accuracy (%) of Various Raitos.
formance in CIFAR-10 and SVHN. On the other hand, 4
diagonal directions and 8 cardinal directions provided the
best performance in CIFAR-100 and Tiny-ImageNet, re- SHIFT TOP-1 ACCURACY (%)
spectively. This shows that the shift direction is somewhat RATIO CIFAR100 T-ImageNet ImageNet
dependent on the characteristics of datasets. For example, in
CIFAR-10 or CIFAR-100, the target class tends to be in the 0.125 - 60.63 -
0.25 76.24 61.01 70.65
center of the image, whereas other datasets do not. The loca-
0.5 76.29 60.78 70.83
tion of the target class has some degree of correlation with 0.75 75.73 60.18 70.57
the shift direction, and the correlation can affect the per- 1.00 74.63 59.35 -
formance. However, since the performance difference was
experimentally marginal, in this paper, the shift direction in
the experiment was fixed to 4 diagonal directions.

Table 1. Top-1 Accuracy (%) of Various Shift Directions.

DIREC- TOP-1 ACCURACY (%)


TIONS CIFAR10 CIFAR100 SVHN T-ImageNet
4 Cardinal 94.44 76.29 97.87 60.35
4 Diagonal 94.33 76.29 97.86 60.67
8 Cardinal 94.41 76.40 97.81 60.57

Figure 1. Various Shift Directions.

Next, we look at various shift ratios. The degree of image


shifting in SPT is defined as follows: SHIFT = P × rshif t ,

You might also like