SAPENet: Self-Attention Based Prototype Enhancement Network For Few-Shot Learning
SAPENet: Self-Attention Based Prototype Enhancement Network For Few-Shot Learning
Pattern Recognition
journal homepage: www.elsevier.com/locate/patcog
a r t i c l e i n f o a b s t r a c t
Article history: Few-shot learning considers the problem of learning unseen categories given only a few labeled samples.
Received 23 January 2022 As one of the most popular few-shot learning approaches, Prototypical Networks have received consider-
Revised 2 September 2022
able attention owing to their simplicity and efficiency. However, a class prototype is typically obtained by
Accepted 9 November 2022
averaging a few labeled samples belonging to the same class, which treats the samples as equally impor-
Available online 13 November 2022
tant and is thus prone to learning redundant features. Herein, we propose a self-attention based proto-
Keywords: type enhancement network (SAPENet) to obtain a more representative prototype for each class. SAPENet
Few-shot learning utilizes multi-head self-attention mechanisms to selectively augment discriminative features in each sam-
Multi-head self-attention mechanism ple feature map, and generates channel attention maps between intra-class sample features to attentively
Image classification retain informative channel features for that class. The augmented feature maps and attention maps are fi-
k-Nearest neighbor nally fused to obtain representative class prototypes. Thereafter, a local descriptor-based metric module is
employed to fully exploit the channel information of the prototypes by searching k similar local descrip-
tors of the prototype for each local descriptor in the unlabeled samples for classification. We performed
experiments on multiple benchmark datasets: miniImageNet, tieredImageNet, and CUB-200-2011. The ex-
perimental results on these datasets show that SAPENet achieves a considerable improvement compared
to Prototypical Networks and also outperforms related state-of-the-art methods.
© 2022 Elsevier Ltd. All rights reserved.
https://doi.org/10.1016/j.patcog.2022.109170
0031-3203/© 2022 Elsevier Ltd. All rights reserved.
X. Huang and S.H. Choi Pattern Recognition 135 (2023) 109170
2
X. Huang and S.H. Choi Pattern Recognition 135 (2023) 109170
our SAPENet falls under metric-based method but is significantly shot settings. However, it is evident that treating each feature vec-
different from [21] even though both works use self-attention. In- tor in the same class as equally important in such low-data scenar-
tuitively, [21] directly applied the self-attention to the mean proto- ios may result in the degradation of informative features that are
types, whereas SAPENet augments the local features of each sam- supposed to be emphasized. To deal with this issue, our SAPENet
ple feature using the self-attention block and develops an intra- deploys two self-attention based blocks to selectively enhance the
class self-attention block to merge the augmented intra-class sam- informative local features of each sample and adaptively preserves
ple features for the final prototypes. Thus, SAPENet determines not the important channel features in the class, thereby obtaining a
only the discriminative features in each sample features, but also more representative prototype for that class.
the priority of important features in the class.
4. Methodology
3. Background
This section first describes the overall structure of the proposed
3.1. Problem definition SAPENet and then introduces the details regarding the two atten-
tion blocks used to obtain representative prototypes for the classes.
In few-shot learning, a dataset is typically split into a meta- A classification strategy is presented at the end of this section.
training set Dtrain , meta-testing set Dtest , and meta-validation set
Dval . These sets possess disjoint label spaces, and each set is fur- 4.1. Overview
ther divided into multiple small learning tasks consisting of a sup-
port set S and query set Q. The support set S is used as the The objective of SAPENet is to address the issue of learning re-
prior knowledge for the network’s learning whereas the query set dundant features when using a few training samples. To this end,
Q is the classification target. To imitate low-data scenarios, few- SAPENet maximizes the use of intra-class information by empha-
shot learning adopts the episodic learning paradigm to repeatedly sizing both the informative features of each sample feature map
use the learning tasks for training. Specifically, a learning task is and the important features in the class. The overall structure of
formed by randomly sampling N classes with M labeled samples SAPENet is shown in Fig. 1, which consists of four parts: a back-
and some unlabelled samples per class fromDtrain to construct bone network, a self-attention block, an intra-class attention block,
a support set S = (x1 ,1 , y1 ,1 ), . . . , xN,M , yN,M and a query set and a metric module.
Q = x˜1 ,1 , y˜1 ,1 , . . . , x˜N ,M , y˜N ,M , in which xi, j denotes the jth la- A support set with two classes and three labeled samples per
beled sample from class i, yi, j ∈ {1, . . . , N } is the corresponding la- class are given based on the few-shot setting. Both the support
bel, and x˜i, j denotes the jth unlabeled sample from class i. The set- and query samples are first fed into the backbone network fθ . The
ting of the support set is usually abbreviated as the N-way M-shot backbone network learns to extract significant features of the in-
task. put samples by mapping semantically similar inputs to positions
that are close to each other in the embedding space. Then, a self-
3.2. Prototypical network attention block is applied to each support feature map to obtain
the augmented feature map with the important local features high-
ProtoNet [14] is built based on the assumption that feature vec- lighted. Simultaneously, an intra-class attention block takes the
tors cluster around a single prototype representation in the em- intra-class support feature maps as input to compute the channel
bedding space. To make use of this assumption, ProtoNet utilizes attention map. These channel attention maps indicate the impor-
a neural network to learn a non-linear mapping of the inputs into tance scores of intra-class feature maps at the same spatial loca-
the embedding space and takes the mean of intra-class sample fea- tion, which are used to re-weight the channel features and empha-
tures as the prototype representation of that class. The prediction size the informative features in this class. To obtain the final proto-
of each query sample is determined by finding the nearest proto- type feature map, SAPENet first performs channel-wise multiplica-
type to its feature vector from all the class prototypes. tion between the augmented feature maps and the corresponding
Given M labeled samples from class n ∈ {1, . . . , N }, and a back- channel attention maps. This step preserves the important features
bone network fθ with learnable parameters θ , a prototype cn for of each feature map by using the extracted intra-class feature in-
class n can be computed as: formation. Subsequently, SAPENet performs element-wise addition
on the support feature maps from the same class to obtain the fi-
1 M
cn = f (xn,i ). (1) nal prototype for each class. In the classification phase, the metric
M i=1 θ module performs the classification by finding the prototype near-
After obtaining the class prototypes, ProtoNet uses the Euclidean est to the query feature map.
distance function d (·, · ) on the query feature vector x˜ and class
prototypes in the embedding space to calculate their distances. 4.2. Self-attention block
Subsequently, the softmax function is applied to the distances to
compute the probability distribution of the query samples belong- The self-attention mechanism has been shown to benefit from
ing to each class: capturing long-term dependencies and is widely applied in many
exp(−d ( fθ (x˜), cn )) computer vision tasks [24–26]. In this study, motivated by [19] and
pθ (y = n|x˜) = N . (2) [18], we utilize multi-head self-attention as our self-attention
n =1 exp(−d ( f θ (x
˜), cn ))
block. The key idea of multi-head self-attention is to use multiple
Using the probability distribution over classes for a query self-attention operators to jointly capture features from different
sample, ProtoNet minimizes the negative log-probability J (θ ) = representation subspaces and merge them to enhance the contex-
−logpθ (y = n|x˜) by adjusting the network parameters via an op- tual information of local features.
timizer (e.g., Adam and SGD), such that the predicted probability For the sake of simplicity, we illustrate our self-attention block
of the ground truth label can be maximized. By iterating through with the number of attention heads being 1, as depicted in
the learning tasks of the base classes, ProtoNet learns to gener- Fig. 2(A). Suppose that a feature map obtained from the back-
ate class representations from a few training samples and is thus bone network is denoted as sn,m ∈ RC×H×W , where m ∈ [1, . . . , M],
able to generalize to the new classes using only a few labeled sam- and C, H, and W represent the number of channels, height, and
ples. ProtoNet has achieved competitive performance under few- width of the feature maps, respectively. The self-attention block
3
X. Huang and S.H. Choi Pattern Recognition 135 (2023) 109170
Fig. 1. The overall structure of the proposed SAPENet for a 2-way 3-shot task. Given a support set and a query sample, SAPENet encoders the images into feature maps
through a backbone network f θ . Afterward, a self-attention block and an intra-class attention block are applied to the support feature maps to obtain the self-attention
feature maps and channel attention maps, respectively. The prototype of each class is the element-wise addition of the product between the feature maps and the cor-
responding channel attention maps along the channel dimension. Classification is performed by a k-nearest neighbor-based metric module on the query feature map and
prototypes. (Best viewed in color)
Fig. 2. The details of the self-attention block and the intra-class attention block. (A) Self-attention block augments the important local feature for each support feature map.
(B) Intra-class attention block generates channel attention maps by extracting the relative channel information between intra-class support feature maps.(Best viewed in
color)
uses three 1 × 1 convolution kernels with learnable parameters sulting matrix along the row dimension to obtain the attention
φ , δ , and to linearly transform the input feature map. The trans- map A ∈ RU×U :
φ
formation results in three new feature maps: Query sn,m , Key sδn,m ,
and Value sn,m . The block then reshapes them to RC×U , where φ ,T √
exp(sn,m,i · sδn,m, j / γ )
U = H × W is the number of pixels, and performs a matrix multi- ai, j = U φ ,T √ , (3)
δ
φ
plication of the transposed sn,m and sδn,m . The matrix multiplication j=1 exp(sn,m,i · sn,m, j / γ )
φ
computes the similarity between each channel feature in sn,m and
φ ,T
all channel features in sδn,m to build a global relationship among where sn,m,i and sδn,m, j denote the ith and jth positions of the trans-
φ ,T
features. Subsequently, the softmax function is applied to the re- posed sn,m and sδn,m , respectively. ai, j is the measurement of the
4
X. Huang and S.H. Choi Pattern Recognition 135 (2023) 109170
φ ,T
correlation between sn,m,i and sδn,m, j in the attention map A, and Each feature map in P is the weighted V considering the corre-
φ ,T
√1 lations between the channel features of sn,m and all channel fea-
γ is the scaling factor.
tures of each sδn,m in K. Therefore, P contains not only the attentive
After obtaining the attention map, the next step entails per-
values of its own but also the attentive values between features of
forming matrix multiplication between the value sn,m and the
intra-class samples. To obtain the channel attention map for each
transpose of attention map A, and reshaping the result to the same
input feature map, element-wise addition is performed on P along
size of the input, which is RC×H×W . Finally, the self-attention block
the first dimension to aggregate the global information of each po-
outputs the refined feature map by performing an element-wise
sition. The results are then averaged along the channel dimension
addition between the reshaped result and the raw input:
and reshaped into feature maps with the channel dimension of 1.
s˜n,m = sn,m · AT + sn,m , (4) This procedure aims to compress the channel information to ob-
tain a two-dimensional channel score map for each support fea-
where s˜n,m denotes the refined feature map. Intuitively, each posi-
ture map. Finally, all channel score maps are concatenated and the
tion of the resulting sn,m · AT is a weighted sum of all local features,
softmax layer is applied along the channel dimension to obtain the
which provides a global view of the relationship between each lo-
channel attention map sˆn,m for each support feature. These chan-
cal feature and all the local features on the feature map. By learn-
nel attention maps are used to assign channel weights to the cor-
ing the discriminative features of each sample via convolutional
responding refined feature maps from the self-attention block. In
kernels, the self-attention block can selectively aggregate similar
this manner, the informative channel features in the class can be
semantic features that are useful for classification in the feature
emphasized by giving larger weights based on the intra-class chan-
map regardless of the distances between positions. Consequently,
nel information between samples.
the block can output augmented feature maps with prominently
important features.
4.4. Metric model
5
X. Huang and S.H. Choi Pattern Recognition 135 (2023) 109170
All experiments were conducted using Pytorch-1.6 on an eight- Based on the above settings, we first conducted experiments on
core AMD Ryzen 7 2700 CPU and an NVIDIA GeForce RTX 2080 the miniImageNet dataset, and the results are listed in Table 1. All
Ti GPU. Following the previous studies [9,22,33], we considered results are obtained using the same settings as those of SAPENet
two standard backbone networks: a 4-layer convolutional network for a fair comparison, and the best results are shown in bold. As
(Conv4-64) and a 12-layer ResNet (ResNet-12). Conv4-64 consists indicated in Table 1, SAPENet achieves the best results in both
of four consecutive 64-channel convolution blocks, each of which 5-way 1-shot and 5-shot tasks using the Conv4-64 and ResNet-
includes a convolutional layer with 64 filters of size 3 × 3, Batch- 12 backbones. Specifically, under the 1-shot and 5-shot settings,
Norm layer, LeakyReLU (0.2) layer, and 2 × 2 max-pooling layer. To the accuracies of SAPENet are respectively 3.83% and 1.41% higher
6
X. Huang and S.H. Choi Pattern Recognition 135 (2023) 109170
Table 1 Table 2
The 5-way 1-shot and 5-way 5-shot classification accuracies (%) on the The 5-way 1-shot and 5-way 5-shot classification accuracies (%) on the
miniImageNet dataset. The best results are shown in bold. tieredImageNet dataset. The best results are shown in bold.
miniImageNet tieredImageNet
Method Backbone Method Backbone
1-shot 5-shot 1-shot 5-shot
MatchingNet [12] Conv4-64 44.81 ± 0.19 65.02 ± 0.17 MatchingNet [12] Conv4-64 49.48 ± 0.22 70.00 ± 0.18
MAML [7] Conv4-64 44.26 ± 0.19 61.81 ± 0.18 MAML [7] Conv4-64 49.97 ± 0.22 64.46 ± 0.20
RelationNet [13] Conv4-64 49.06 ± 0.19 65.18 ± 0.16 RelationNet [13] Conv4-64 50.75 ± 0.22 67.29 ± 0.18
Baseline+ [31] Conv4-64 47.94 ± 0.18 67.37 ± 0.15 Baseline+ [31] Conv4-64 50.03 ± 0.19 70.17 ± 0.18
SAML [32] Conv4-64 52.88 ± 0.20 68.17 ± 0.16 SAML [32] Conv4-64 51.94 ± 0.22 70.12 ± 0.19
TADAM [15] Conv4-64 50.50 ± 0.20 70.09 ± 0.16 TADAM [15] Conv4-64 51.05 ± 0.22 70.92 ± 0.19
MetaOptNet [9] Conv4-64 51.28 ± 0.20 69.71 ± 0.16 MetaOptNet [9] Conv4-64 49.97 ± 0.21 69.76 ± 0.18
DSN [33] Conv4-64 51.69 ± 0.20 69.95 ± 0.16 DSN [33] Conv4-64 51.52 ± 0.21 71.31 ± 0.18
ProtoNet[14] Conv4-64 51.88 ± 0.20 70.40 ± 0.16 ProtoNet [14] Conv4-64 51.44 ± 0.22 72.50 ± 0.18
IPN [20] Conv4-64 51.98 ± 0.20 70.18 ± 0.16 IPN [20] Conv4-64 52.22 ± 0.22 72.07 ± 0.18
CAN [34] Conv4-64 48.61 ± 0.20 69.96 ± 0.16 CAN [34] Conv4-64 49.33 ± 0.22 66.09 ± 0.19
DN4 [27] Conv4-64 51.65 ± 0.20 71.46 ± 0.16 DN4 [27] Conv4-64 49.35 ± 0.21 72.24 ± 0.18
FEAT [21] Conv4-64 51.05 ± 0.19 67.53 ± 0.17 FEAT [21] Conv4-64 51.93 ± 0.21 69.58 ± 0.19
DeepEMD [16] Conv4-64 53.81 ± 0.20 70.56 ± 0.16 DeepEMD [16] Conv4-64 55.96 ± 0.22 73.43 ± 0.18
LMPNet [22] Conv4-64 54.01 ± 0.20 71.30 ± 0.16 LMPNet [22] Conv4-64 51.80 ± 0.22 72.32 ± 0.18
DCAP [35] Conv4-64 54.62 ± 0.20 69.02 ± 0.16 DCAP [35] Conv4-64 54.12 ± 0.22 69.29 ± 0.19
DMF [36] Conv4-64 55.69 ± 0.20 71.52 ± 0.16 DMF [36] Conv4-64 55.65 ± 0.21 69.41 ± 0.19
SetFeat [37] Conv4-64 54.72 ± 0.20 70.32 ± 0.16 SetFeat [37] Conv4-64 57.26 ± 0.22 74.74 ± 0.18
SAPENet (ours) Conv4-64 55.71 ± 0.20 71.81 ± 0.16 SAPENet (ours) Conv4-64 57.61 ± 0.22 75.42 ± 0.18
MatchingNet [12] ResNet-12 62.90 ± 0.20 78.93 ± 0.15 MatchingNet [12] ResNet-12 64.18 ± 0.22 80.09 ± 0.17
MAML [7] ResNet-12 55.24 ± 0.21 68.75 ± 0.17 MAML [7] ResNet-12 52.01 ± 0.23 70.31 ± 0.23
RelationNet [13] ResNet-12 61.44 ± 0.21 75.27 ± 0.16 RelationNet [13] ResNet-12 62.15 ± 0.23 75.47 ± 0.19
Baseline+ [31] ResNet-12 54.46 ± 0.21 77.02 ± 0.15 Baseline+ [31] ResNet-12 57.71 ± 0.24 76.93 ± 0.18
SAML [32] ResNet-12 62.69 ± 0.19 78.96 ± 0.14 SAML [32] ResNet-12 65.43 ± 0.21 79.21 ± 0.18
TADAM [15] ResNet-12 62.06 ± 0.20 79.24 ± 0.14 TADAM [15] ResNet-12 64.34 ± 0.24 82.34 ± 0.16
MetaOptNet [9] ResNet-12 62.67 ± 0.20 80.52 ± 0.14 MetaOptNet [9] ResNet-12 65.99 ± 0.72 81.56 ± 0.16
DSN [33] ResNet-12 60.55 ± 0.20 77.63 ± 0.15 DSN [33] ResNet-12 65.61 ± 0.20 79.22 ± 0.17
ProtoNet [14] ResNet-12 61.71 ± 0.21 79.08 ± 0.16 ProtoNet [14] ResNet-12 64.63 ± 0.23 81.17 ± 0.17
IPN [20] ResNet-12 61.51 ± 0.21 79.47 ± 0.15 IPN [20] ResNet-12 63.23 ± 0.23 80.08 ± 0.16
CAN [34] ResNet-12 61.60 ± 0.20 80.00 ± 0.15 CAN [34] ResNet-12 62.52 ± 0.23 81.05 ± 0.16
DN4 [34] ResNet-12 64.93 ± 0.20 80.87 ± 0.14 DN4 [27] ResNet-12 66.28 ± 0.22 82.24 ± 0.16
FEAT [21] ResNet-12 66.33 ± 0.20 81.78 ± 0.16 FEAT [21] ResNet-12 67.23 ± 0.22 82.83 ± 0.18
DeepEMD [16] ResNet-12 65.69 ± 0.19 81.96 ± 0.16 DeepEMD [16] ResNet-12 68.12 ± 0.22 84.69 ± 0.16
LMPNet [22] ResNet-12 62.52 ± 0.19 81.05 ± 0.14 LMPNet [22] ResNet-12 66.62 ± 0.23 80.12 ± 0.16
DCAP [35] ResNet-12 63.19 ± 0.20 80.64 ± 0.14 DCAP [35] ResNet-12 64.31 ± 0.22 82.17 ± 0.16
DMF [36] ResNet-12 65.00 ± 0.21 81.43 ± 0.15 DMF [36] ResNet-12 66.80 ± 0.23 82.68 ± 0.16
SetFeat [37] ResNet-12 65.95 ± 0.20 81.18 ± 0.14 SetFeat [37] ResNet-12 67.48 ± 0.23 83.25 ± 0.16
SAPENet (ours) ResNet-12 66.41 ± 0.20 82.76 ± 0.14 SAPENet (ours) ResNet-12 68.63 ± 0.23 84.30 ± 0.16
7
X. Huang and S.H. Choi Pattern Recognition 135 (2023) 109170
Table 3 Table 5
The 5-way 1-shot and 5-way 5-shot classification accuracies (%) on the Classification accuracy (%) of different attention head number
CUB-200-2011 dataset. The best results are shown in bold. and scaling factor γ on the miniImageNet dataset. Results were
obtained by averaging over 10,0 0 0 episodes.
CUB
Method Backbone Number of Scaling ResNet-12
1-shot 5-shot √
heads factor γ
1-shot 5-shot
MatchingNet [12] Conv4-64 51.45 ± 0.22 75.46 ± 0.18 √
MAML [7] Conv4-64 47.85 ± 0.22 64.77 ± 0.20 1 √640 65.84 82.12
RelationNet [13] Conv4-64 58.81 ± 0.24 75.23 ± 0.18 4 √160 66.17 82.83
Baseline+ [31] Conv4-64 57.79 ± 0.22 74.03 ± 0.18 8 √80 66.41 82.76
SAML [32] Conv4-64 62.75 ± 0.23 78.24 ± 0.16 16 √40 66.36 82.55
TADAM [15] Conv4-64 56.64 ± 0.23 73.66 ± 0.17 32 20 66.10 82.53
MetaOptNet [9] Conv4-64 49.52 ± 0.22 71.68 ± 0.18
DSN [33] Conv4-64 54.49 ± 0.23 74.10 ± 0.17
ProtoNet [14] Conv4-64 54.52 ± 0.23 73.30 ± 0.17
IPN [20] Conv4-64 58.45 ± 0.24 76.61 ± 0.17 one sample per class; thus, we filled those grids with the same
CAN [34] Conv4-64 59.31 ± 0.24 72.72 ± 0.19
number to indicate that they are equivalent.
DN4 [27] Conv4-64 63.15 ± 0.22 82.54 ± 0.14
FEAT [21] Conv4-64 62.91 ± 0.24 79.82 ± 0.16 As indicated in Table 4, the self-attention and intra-class at-
DeepEMD [16] Conv4-64 62.09 ± 0.23 83.58 ± 0.17 tention blocks can achieve considerable improvement on both 1-
LMPNet [22] Conv4-64 61.66 ± 0.23 82.20 ± 0.14 shot and 5-shot tasks compared to the case without any atten-
DCAP [35] Conv4-64 58.69 ± 0.25 69.40 ± 0.19 tion blocks. In addition, the performance of using two attention
DMF [36] Conv4-64 66.79 ± 0.24 81.40 ± 0.17
blocks is similar to that of using only the self-attention block under
SetFeat [37] Conv4-64 67.78 ± 0.23 82.87 ± 0.15
SAPENet (ours) Conv4-64 70.38 ± 0.23 84.47 ± 0.14 the Euclidean distance. It can be inferred that flattening the fea-
ture maps into vectors for classification cannot maximize the use
of the channel features refined by the intra-class attention block.
Table 4
Classification accuracy (%) for key component analysis on the miniImageNet In contrast, SAPENet keeps the augmented feature maps as proto-
dataset with the Conv4-64 backbone under 5-way 1-shot and 5-shot settings. types and uses the descriptor-based metric module to fully exploit
Results are averaged over 10,0 0 0 episodes and 95% confidence intervals are be- the refined channel features. These experimental results show that
low 2e-3. the proposed attention blocks can effectively help SAPENet obtain
Metric module Self-attention Intra-class attention 1-shot 5-shot informative prototypes with a few training samples, and they are
Euclidean 50.41 69.06
fully compatible with the metric module that matches descriptors
√ between the query and prototypes.
Euclidean 52.40 70.46
√
Euclidean 50.41 69.54
√ √
Euclidean 52.40 70.53
KNN 53.34 70.44 5.6.2. Effect of attention head and scaling factor
√
KNN
√
55.71 70.83 In [18], the number of attention heads and scaling factor are
KNN 53.34 71.52 two decision variables. The former determines the number of sub-
√ √
KNN 55.71 71.81
spaces to extract information, while the latter controls the distri-
bution of attention values for the feature maps. To evaluate their
influence on SAPENet, we varied the number of attention heads
used the same settings as those of SAPENet to ensure a fair com-
and scaling factor to perform 5-way 1-shot and 5-shot tasks on
parison. As displayed in Table 3, SAPENet achieves the best results
the miniImageNet dataset with the ResNet-12 backbone. As pre-
and attains a significant improvement compared to ProtoNet and
sented in Table 5, increasing the number of attention heads to an
FEAT. In addition, SAPENet leads to 8.29, 0.89, 3.60, and 1.60% ac-
appropriate range can lead to a higher performance, whereas an
curacy improvements over DeepEMD and SetFeat in 1-shot and
extremely small (or excessive) number of attention heads can re-
5-shot settings, respectively. These results validate the effective-
sult in performance loss. We infer that this phenomenon is due
ness of SAPENet on the fine-grained dataset, and further show that
to the small (or large) number of attention heads yielding insuffi-
intra-class attention blocks can effectively emphasize the discrimi-
cient (or excessive) subspaces to extract useful information, result-
native features of each class in the presence of similar classes, thus
ing in under-representation (or over-representation) of informative
enabling more accurate matching between the query and its re-
features. Thus, we chose the number of attention heads for Conv4-
lated prototype during classification.
64 and ResNet-12 to be 4 and 8, respectively, to maintain a suitable
number of attention heads to learn feature extraction.
5.6. Ablation study
8
X. Huang and S.H. Choi Pattern Recognition 135 (2023) 109170
Fig. 3. Class activation mapping (CAM) visualization on a 5-shot task for SAPENet and ProtoNet. (Best viewed in color)
Table 6 Table 7
Classification accuracy (%) of different k values on the miniIm- Computational cost analysis on miniImageNet with the ResNet-12 backbone
ageNet dataset under 5-way 1-shot and 5-shot settings. Results during meta-training phase. Time is measured over 10,0 0 0 episodes.
were obtained by averaging over 10,0 0 0 episodes.
Method Episode Parameters Time (ms)
Number Conv4-64 ResNet-12 size (MB)
of k
1-shot 5-shot 1-shot 5-shot ProtoNet [14] 5-way 1-shot 12.42 218
LMPNet [22] 5-way 1-shot 12.46 228
1 55.71 71.81 66.05 81.34
SetFeat [37] 5-way 1-shot 20.72 654
3 54.79 71.04 66.84 82.18
DeepEMD [16] 5-way 1-shot 12.42 473
5 54.61 70.27 66.41 82.76
DMF [36] 5-way 1-shot 13.41 519
7 54.24 69.31 65.85 81.88
FEAT [21] 5-way 1-shot 14.06 228
SAPENet (ours) 5-way 1-shot 13.66 230
ProtoNet [14] 5-way 5-shot 12.42 266
LMPNet [22] 5-way 5-shot 12.46 280
thus, a relatively larger k is beneficial in this case to aggregate fea- SetFeat [37] 5-way 5-shot 20.72 811
tures for classification. DeepEMD [16] 5-way 5-shot 12.50 11,300
DMF [36] 5-way 5-shot 14.41 573
FEAT [21] 5-way 5-shot 14.06 279
5.6.4. Visualization SAPENet (ours) 5-way 5-shot 13.66 283
To visually confirm that SAPENet pays more attention to infor-
mative features, we generated and compared the class activation
mapping [40] of SAPENet and ProtoNet using the ResNet-12 back-
bone on Dtest of the miniImageNet dataset. As depicted in Fig. 3, SAPENet uses 1 × 1 convolutional layers in the attention block, its
ProtoNet tends to contain the features of non-target objects ow- learnable parameters only increase when the output channel di-
ing to the use of mean prototypes as the learning criteria for the mension increases. Considering that ResNet-12 is almost the deep-
network. In contrast, SAPENet exploits the intra-class information est backbone in few-shot learning (the output channel dimension
to attentively select the informative features for that class, which is 640), the size of the additional learnable parameters of SAPENet
allows it to focus on target features and ignore redundant ones. will basically not exceed 1.24MB. Note that the computation speed
of FEAT is slightly faster than SAPENet, although FEAT has more pa-
5.6.5. Computational cost analysis rameters to learn. This is because SAPENet needs to conduct KNN
SAPENet uses two attention blocks to produce a more repre- (k=5) to each descriptor of the query to find its k closest neigh-
sentative prototype for the class, which inevitably increases the bors in the prototypes. Compared with the current state-of-the-art
learnable parameters in addition to the backbone. To analyze the methods, such as DeepEMD, DMF, and SetFeat, SAPENet is much
computational speed of SAPENet, we compared it with other meth- faster in both 1-shot and 5-shot tasks. Specifically, although Deep-
ods. As displayed in Table 7, ProtoNet has the advantage of high EMD possesses the same size of learnable parameters as ProtoNet,
computational efficiency among all methods owing to its simple it requires more training time to solve the optimal matching prob-
prototype computation and parameter-free classifier. SAPENet has lem between the support and query features. Furthermore, when
additional 1.24MB learnable parameters owing to the three 1 × 1 the shot number is greater than 1, it learns an SFC layer by fine-
convolutional layers in the attention blocks. However, SAPENet is tuning on the novel classes, which requires multiple forward and
only 12ms and 17ms slower than ProtoNet in the 1-shot and 5- backward passes, and thus slowing down the computational speed.
shot tasks per episode, respectively, but achieves a considerably DMF learns a specific region for each local feature via complex
higher classification accuracy. Compared with FEAT, SAPENet has deformable convolution, which results in twice the computational
fewer learnable parameters because FEAT has an additional fully cost of SAPENet. SetFeat attaches the self-attention block to each
connected layer attached after the attention block. Moreover, as residual block, which greatly increases the number of learnable pa-
9
X. Huang and S.H. Choi Pattern Recognition 135 (2023) 109170
rameters thereby incurring a much higher computational cost in [4] Q. Sun, Y. Liu, T.-S. Chua, B. Schiele, Meta-transfer learning for few-shot learn-
the training phase. Through the above comparisons, we can find ing, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pat-
tern Recognition (CVPR), 2019, pp. 403–412.
that SAPENet achieves a higher classification accuracy while the [5] Y. Wang, Q. Yao, J.T. Kwok, L.M. Ni, Generalizing from a few examples: a sur-
computational cost is not significantly different from those of Pro- vey on few-shot learning, ACM Comput. Surv. 53 (3) (2020) 1–34, doi:10.1145/
toNet and FEAT. These results validate the efficiency of SAPENet in 3386252.
[6] Y. Wang, R. Girshick, M. Hebert, B. Hariharan, Low-shot learning from imag-
few-shot settings. inary data, in: Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), 2018, pp. 7278–7286.
[7] C. Finn, P. Abbeel, S. Levine, Model-agnostic meta-learning for fast adaptation
5.7. Conclusion
of deep networks, in: Proceedings of the International Conference on Machine
Learning (ICML), 2017, pp. 1126–1135.
In this paper, we proposed a self-attention based prototype en- [8] S. Ravi, H. Larochelle, Optimization as a model for few-shot learning, in: Pro-
hancement network (SAPENet) to address the issue that mean pro- ceedings of the International Conference on Learning Representations (ICLR),
2017.
totypes usually contain redundant information. To obtain a more [9] K. Lee, S. Maji, A. Ravichandran, S. Soatto, Meta-learning with differentiable
representative prototype for the class, SAPENet first utilizes a self- convex optimization, in: Proceedings of the IEEE/CVF Conference on Computer
attention block to selectively emphasize the important local fea- Vision and Pattern Recognition (CVPR), 2019, pp. 10657–10665.
[10] A. Santoro, S. Bartunov, M. Botvinick, D. Wierstra, T. Lillicrap, Meta-learning
tures of each support feature map. Then, unlike ProtoNet, which with memory-augmented neural networks, in: Proceedings of the International
uses the mean operation to aggregate the intra-class features, Conference on International Conference on Machine Learning (ICML), 2016,
SAPENet develops an intra-class attention block to attentively ex- pp. 1842–1850.
[11] T. Munkhdalai, H. Yu, Meta networks, in: Proceedings of the International
ploit the intra-class information, which aims to retain the infor- Conference on International Conference on Machine Learning (ICML), 2017,
mative channel features for that class and avoid learning the re- pp. 2554–2563.
dundant ones. To maximize the use of the augmented prototypes [12] O. Vinyals, C. Blundell, T. Lillicrap, K. Kavukcuoglu, D. Wierstra, Matching
networks for one shot learning, in: Proceedings of the International Con-
obtained by SAPENet, a descriptor-based classifier was deployed as
ference on Neural Information Processing Systems (NIPS), 2016, pp. 3637–
our metric module to compute the local descriptor similarities be- 3645.
tween the query feature map and prototypes. We compared our [13] F. Sung, Y. Yang, L. Zhang, T. Xiang, P.H. Torr, T.M. Hospedales, Learning
to compare: Relation network for few-shot learning, in: Proceedings of the
SAPENet with state-of-the-art methods through numerous experi-
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018,
ments using three widely-used few-shot learning datasets. The re- pp. 1199–1208.
sults validate the effectiveness of SAPENet by largely outperform- [14] J. Snell, K. Swersky, R. Zemel, Prototypical networks for few-shot learning, in:
ing ProtoNet and achieving a competitive performance compared Proceedings of the International Conference on Neural Information Processing
Systems (NIPS), 2017, pp. 4080–4090.
with the state-of-the-art methods. Furthermore, ablation studies [15] B.N. Oreshkin, P. Rodriguez, A. Lacoste, Tadam: task dependent adaptive met-
and visualization results demonstrated the effectiveness of atten- ric for improved few-shot learning, in: Proceedings of the International Con-
tion blocks in producing representative prototypes given only a few ference on Neural Information Processing Systems (NIPS), 2018, pp. 719–
729.
training samples. In terms of limitations, because the intra-class at- [16] C. Zhang, Y. Cai, G. Lin, C. Shen, Deepemd: Few-shot image classification with
tention block is not available when the shot number is 1, the per- differentiable earth mover’s distance and structured classifiers, in: Proceed-
formance of SAPENet primarily relied on the self-attention block. ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR), 2020, pp. 12203–12213.
To deal with this limitation, in future work, we plan to extend the [17] J. Liu, L. Song, Y. Qin, Prototype rectification for few-shot learning, in: Pro-
self-attention block to explore the difference between inter-class ceedings of the European Conference on Computer Vision (ECCV), 2020,
features, which can be beneficial for generating discriminative fea- pp. 741–756.
[18] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, Ł. Kaiser,
tures for classification, even when the shot number is 1. In addi-
I. Polosukhin, Attention is all you need, in: Proceedings of the Interna-
tion, it is also a promising way to learn word embeddings or at- tional Conference on Neural Information Processing Systems (NIPS), 2017,
tributes of images, which can be served as auxiliary training infor- pp. 60 0 0–6010.
[19] X. Wang, R. Girshick, A. Gupta, K. He, Non-local neural networks, in: Pro-
mation to help address the issue of limited training samples.
ceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), 2018, pp. 7794–7803.
Declaration of Competing Interest [20] Z. Ji, X. Chai, Y. Yu, Y. Pang, Z. Zhang, Improved prototypical networks for few-
shot learning, Pattern Recognit. Lett. 140 (2020) 81–87, doi:10.1016/j.patrec.
2020.07.015.
The authors declare that they have no known competing finan- [21] H. Ye, H. Hu, D. Zhan, F. Sha, Few-shot learning via embedding adaptation with
cial interests or personal relationships that could have appeared to set-to-set functions, in: Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR), 2020, pp. 8808–8817.
influence the work reported in this paper. [22] H. Huang, Z. Wu, W. Li, J. Huo, Y. Gao, Local descriptor-based multi-prototype
network for few-shot learning, Pattern Recognit. 116 (2021) 107935, doi:10.
Data availability 1016/j.patcog.2021.107935.
[23] J. Hu, L. Shen, G. Sun, Squeeze-and-excitation networks, in: Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018,
Data will be made available on request.
pp. 7132–7141.
[24] Y. Zhang, Y. Gong, H. Zhu, X. Bai, W. Tang, Multi-head enhanced self-attention
Acknowledgment network for novelty detection, Pattern Recognit. 107 (2020) 107486, doi:10.
1016/j.patcog.2020.107486.
[25] J. Fu, J. Liu, H. Tian, Y. Li, Y. Bao, Z. Fang, H. Lu, Dual attention network for
This work was supported by the National Research Foundation scene segmentation, in: Proceedings of the IEEE/CVF Conference on Computer
of Korea (NRF) funded by the Korea Government (ministry of Sci- Vision and Pattern Recognition (CVPR), 2019, pp. 3146–3154.
[26] H. Zhao, J. Jia, V. Koltun, Exploring self-attention for image recognition, in: Pro-
ence and ICT) under Grand 2022R1F1A1066267. ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni-
tion (CVPR), 2020, pp. 10076–10085.
References [27] W. Li, L. Wang, J. Xu, J. Huo, Y. Gao, J. Luo, Revisiting local descriptor
based image-to-class measure for few-shot learning, in: Proceedings of the
[1] A. Sellami, S. Tabbone, Deep neural networks-based relevant latent represen- IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),
tation learning for hyperspectral image classification, Pattern Recognit. 121 2019, pp. 7260–7268.
(2022) 108224, doi:10.1016/j.patcog.2021.108224. [28] J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, L. FeiFei, Imagenet: A large-scale hier-
[2] J. Long, E. Shelhamer, T. Darrell, Fully convolutional networks for semantic seg- archical image database, in: Proceedings of the IEEE Conference on Computer
mentation, in: Proceedings of the IEEE Conference on Computer Vision and Vision and Pattern Recognition (CVPR), 2009, pp. 248–255.
Pattern Recognition (CVPR), 2015, pp. 3431–3440. [29] M. Ren, E. Triantafillou, S. Ravi, J. Snell, K. Swersky, J.B. Tenenbaum,
[3] X. Qin, Z. Zhang, C. Huang, M. Dehghan, O.R. Zaiane, M. Jagersand, U2-Net: go- H. Larochelle, R.S. Zemel, Meta-learning for semi-supervised few-shot classi-
ing deeper with nested u-structure for salient object detection, Pattern Recog- fication, in: Proceedings of the International Conference on International Con-
nit. 106 (2020) 107404, doi:10.1016/j.patcog.2020.107404. ference on Machine Learning (ICML), 2018.
10
X. Huang and S.H. Choi Pattern Recognition 135 (2023) 109170
[30] C. Wah, S. Branson, P. Welinder, P. Perona, S. Belongie, The caltech-ucsd [40] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, A. Torralba, Learning deep features
birds-200-2011 dataset, Technical Report CNS-TR-2011-001, California Institute for discriminative localization, in: Proceedings of the IEEE Conference on Com-
of Technology, 2011. puter Vision and Pattern Recognition (CVPR), 2016, pp. 2921–2929.
[31] W. Chen, Y. Liu, Z. Kira, Y. Wang, J. Huang, A closer look at few-shot classifica-
tion, in: Proceedings of the International Conference on Learning Representa- Xi-Lang Huang received the M.S. degree in electri-
tions (ICLR), 2019. cal engineering from the Pusan National University, Bu-
[32] F. Hao, F. He, J. Cheng, L. Wang, J. Cao, D. Tao, Collect and select: Se- san, South Korea, in 2018. He is currently pursuing the
mantic alignment metric learning for few-shot learning, in: Proceedings of Ph.D. degree in electrical engineering with the Puky-
the IEEE/CVF International Conference on Computer Vision (ICCV), 2019, ong National University, Busan, South Korea. His current
pp. 8460–8469. research interests include modeling and simulation of
[33] C. Simon, P. Koniusz, R. Nock, M. Harandi, Adaptive subspaces for few-shot discrete-event systems, efficient simulation optimization,
learning, in: Proceedings of the IEEE/CVF Conference on Computer Vision and and computer vision.
Pattern Recognition (CVPR), 2020, pp. 4136–4145.
[34] R. Hou, H. Chang, B. Ma, S. Shan, X. Chen, Cross attention network for few-shot
classification, in: Proceedings of the International Conference on Neural Infor-
mation Processing Systems (NIPS), 2019, pp. 4003–4014.
[35] J. He, R. Hong, X. Liu, M. Xu, Q. Sun, Revisiting local descriptor for im-
proved few-shot classification, ACM Trans. Multimedia Comput. Commun.
Seon Han Choi He received the B.S., M.S., and Ph.D.
Appl. (TOMM) (2022).
degrees in Electrical Engineering from the Korea Ad-
[36] C. Xu, Y. Fu, C. Liu, C. Wang, J. Li, F. Huang, L. Zhang, X. Xue, Learning dynamic
vanced Institute of Science and Technology (KAIST), Dae-
alignment via meta-filter for few-shot learning, in: Proceedings of the IEEE/CVF
jeon, South Korea, in 2012, 2014, and 2018, respectively.
Conference on Computer Vision and Pattern ecognition, 2021, pp. 5182–
In 2018, he was a Post-Doctoral Researcher with the In-
5191.
formation and Electronics Research Institute, KAIST. From
[37] A. Afrasiyabi, H. Larochelle, J.-F. Lalonde, C. Gagné, Matching feature sets for
2018 to 2019, he was a Senior Researcher with the Ko-
few-shot image classification, in: Proceedings of the IEEE/CVF Conference on
rea Institute of Industrial Technology, Ansan, South Korea.
Computer Vision and Pattern Recognition, 2022, pp. 9014–9024.
From 2019 to 2022, he was an Assistant Professor in the
[38] Y. Tian, Y. Wang, D. Krishnan, J.B. Tenenbaum, P. Isola, Rethinking few-shot
Department of IT Convergence and Application Engineer-
image classification: a good embedding is all you need? in: Proceedings
ing, Pukyong National University, Busan, South Korea. In
of the European Conference on Computer Vision (ECCV), 2020, pp. 266–
2022, he joined the Department of Electronic and Elec-
282.
trical Engineering, Ewha Womans University, Seoul, South
[39] D. Wertheimer, L. Tang, B. Hariharan, Few-shot classification with feature map
Korea, as an Assistant Professor. His current research interests include the modeling
reconstruction networks, in: Proceedings of the IEEE/CVF Conference on Com-
and simulation of discrete-event systems, efficient simulation optimization under
puter Vision and Pattern Recognition (CVPR), 2021, pp. 8012–8021.
stochastic noise, evolutionary computing, and machine learning.
11