Vision Transformer For Small-Size Datasets
Vision Transformer For Small-Size Datasets
Abstract
small-size datasets from scratch. non-overlapping patches and flatten the patches to obtain
a sequence of vectors. This process is formulated as Eq. 1:
3. PROPOSED METHOD
P(x) = [x1p ; x2p ; . . . ; xN
p ] (1)
This section specifically describes two key ideas for in-
2
creasing the locality inductive bias of ViTs: SPT and LSA. where xip ∈ RP ·C represents the i-th flattened vector. P
First, Fig. 2(a) depicts the concept of SPT. SPT spatially and N = HW/P 2 stand for the patch size and the number
shifts an input image in several directions and concatenates of patches, respectively.
them with the input image. Fig. 2(a) is an example of shift- Next, we obtain patch embeddings by linearly projecting
ing in four diagonal directions. Next, patch partitioning is each vector into the space of the hidden dimension of the
applied like standard ViTs. Then, for embedding into visual transformer encoder. Each patch embedding corresponds to
tokens, three processes are sequentially performed: patch a visual token input to the transformer encoder, so this series
flattening, layer normalization [2], and linear projection. As of processes is called tokenization, i.e., T . This is defined
a result, SPT can embed more spatial information into vi- by:
sual tokens and increase the locality inductive bias of ViTs. T (x) = P(x)Et (2)
Fig. 2(b) explains the second idea, LSA. In general, a 2
softmax function can control the smoothness of the output where Et ∈ R(P ·C)×d is the learnable linear projection
distribution through temperature scaling [11]. LSA primar- for tokens, and d is the hidden dimension of the transformer
ily sharpens the distribution of attention scores by learning encoder.
the temperature parameters of the softmax function. Addi- Note that the receptive fields of visual tokens in ViT are
tionally, the self-token relation is removed by applying the determined by tokenization. In the transformer encoder run-
so-called diagonal masking, which forcibly suppresses the ning after the tokenization step, the number of visual tokens
diagonal components of the similarity matrix computed by does not change, so the receptive field cannot be adjusted
Query and Key. This masking relatively increases the atten- there. and the tokenization (Eq. 2) of standard ViT is the
tion scores between different tokens, making the distribu- same as the operation of the non-overlapping convolutional
tion of attention scores sharper. As a result, LSA increases layer with the same size of kernel and stride. So, the re-
the locality inductive bias by making ViT’s attention locally ceptive field size of visual tokens can be calculated by the
focused. following equation given in [1]:
Before a detailed description of the proposed SPT and where rtoken and rtrans stand for the receptive field sizes
LSA, this section briefly reviews the tokenization and of tokenization and transformer encoder, respectively. j and
formulation of the self-attention mechanism of standard k are the stride and kernel size of the convolutional layer,
ViT [9]. respectively. As mentioned earlier, the receptive field is not
Let x ∈ RH×W ×C be an input image. Here, H, W , adjusted in the transformer encoder, so rtrans = 1. Thus,
and C indicate the height, width, and channel of the im- rtoken is the same as the kernel size. Here, the kernel size is
age, respectively. First, ViT divides the input image into the patch size of ViT.
At this time, let’s investigate whether rtoken is of suffi-
cient size. For instance, we compare rtoken with the recep-
tive field size of the last feature of ResNet50 when training
on the ImageNet dataset consisting of images of 224 × 224.
The patch size of standard ViT is 16, so rtoken of visual to-
kens is also 16. On the other hand, the receptive field size
of the ResNet50 feature amounts to 483 [1]. As a result, the
visual tokens of ViTs have a receptive field size that is about
30 times smaller than that of the ResNet50 feature. We in-
terpret this small receptive field of tokenization as a major
factor in the lack of local inductive bias. Therefore, Sec. 3.2
proposes the SPT to leverage rich spatial information by in-
creasing the receptive field of tokenization. Figure 3. The learned temperature according to depth. Here, the
Meanwhile, the self-attention mechanism of general red dashed line indicates the temperature of standard ViT.
ViTs operates as follows. First, a learnable linear projec-
tion is applied to each token to obtain Query, Key, and
Value. Next, calculate the similarity matrix, that is, R ∈
R(N +1)×(N +1) , indicating the semantic relation between
tokens through the dot product operation of Query and
Key. The diagonal components of R represent self-token
relations, and the off-diagonal components represent inter-
token relations:
|
R(x) = xEq (xEk ) (4)
with both diagonal masking and learnable temperature scal- 4.1. SETTING
ing applied is defined by:
4.1.1 Environment and Dataset
M
L(x) = softmax(R (x)/τ )xEv (10) The proposed method was implemented in Pytorch [27]. In
the small-size dataset experiment (Table 2), The details of
where τ is the learnable temperature. throughput measurement are as follows: The inputs were
In other words, LSA solves the smoothing problem of Tiny-ImageNet, and the batch size was 128, and the GPU
the attention score distribution. Fig. 4 shows the depth-wise was RTX 2080 Ti.
total
averages of total Kullback-Leibler divergence (DKL ) for For small-size dataset experiments, CIFAR-10, CIFAR-
all heads. Here, T and M mean that only learnable tem- 100 [17], Tiny-ImageNet [21], and SVHN [25] were em-
perature scaling and diagonal masking is applied to ViTs, ployed and ImageNet [19] was employed for the mid-size
respectively, and L indicates that the entire LSA is applied dataset experiment.
total
to ViTs. The lower the average of DKL , the flatter the at-
tention score distribution. We can find that when LSA is 4.1.2 Model Configurations
total
fully applied, the average of DKL is larger by about 0.5
than standard ViT, so LSA attenuates the smoothing phe- In the small dataset experiment, in the case of ViT, the
nomenon of the attention score distribution. depth was set to 9, the hidden dimension was set to 192,
and the number of heads was set to 12. This configuration
4. EXPERIMENT was determined experimentally. And in the ImageNet ex-
periment, we used the ViT-Tiny suggested by DeiT [33].
This section verifies that the proposed method improves In the case of PiT, T2T, Swin and CaiT, the configurations
the performance of various ViTs through several experi- of PiT-XS, T2T-14, Swin-T and CaiT-XXS24 presented in
ments. Sec. 4.1 describes the settings of the following ex- the corresponding papers were adopted as they were, re-
periments. Sec. 4.2 quantitatively shows that the proposed spectively. The performance of ViT improves as the num-
method effectively improves various ViTs and reduces the ber of tokens increases, but the computational cost increases
gap with CNNs. Finally, Sec. 4.3 demonstrates that the quadratically. We were able to experimentally observe that
ViTs are qualitatively enhanced by visualizing the attention it was effective when both the number of visual tokens in
scores of the final class token. ViT without pooling and the number of tokens in the inter-
Table 3. Top-1 accuracy (%) of the proposed method on ImageNet Table 4. Effect of each component of LSA on performance.
dataset.
fective for ImageNet. For example, the performance was This proves the competitiveness and synergy of the two key
improved by the proposed method as much as +1.60% for element technologies.
ViT, +1.44% for PiT, and +1.06% for Swin. As a result,
we find that the proposed method noticeably improves the
ViTs even on mid-size datasets. 4.3. QUALITATIVE RESULT