0% found this document useful (0 votes)
208 views11 pages

SAPENet: Self-Attention Based Prototype Enhancement Network For Few-Shot Learning

1) The document presents SAPENet, a self-attention based prototype enhancement network for few-shot learning. 2) SAPENet uses multi-head self-attention to selectively augment discriminative features in sample feature maps and generates channel attention maps between intra-class samples to retain informative channel features for each class. 3) The augmented feature maps and attention maps are fused to obtain more representative class prototypes compared to simply averaging features, as is done in Prototypical Networks.

Uploaded by

natra2k2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
208 views11 pages

SAPENet: Self-Attention Based Prototype Enhancement Network For Few-Shot Learning

1) The document presents SAPENet, a self-attention based prototype enhancement network for few-shot learning. 2) SAPENet uses multi-head self-attention to selectively augment discriminative features in sample feature maps and generates channel attention maps between intra-class samples to retain informative channel features for each class. 3) The augmented feature maps and attention maps are fused to obtain more representative class prototypes compared to simply averaging features, as is done in Prototypical Networks.

Uploaded by

natra2k2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Pattern Recognition 135 (2023) 109170

Contents lists available at ScienceDirect

Pattern Recognition
journal homepage: www.elsevier.com/locate/patcog

SAPENet: Self-Attention based Prototype Enhancement Network for


Few-shot Learning
Xilang Huang a, Seon Han Choi b,c,∗
a
Department of Artificial Intelligent Convergence, Pukyong National University, Busan, 48513, South Korea
b
Department of Electronic and Electrical Engineering, Ewha Womans University, Seoul, 03760, South Korea
c
Graduate Program in Smart Factory, Ewha Womans University, Seoul, 03760, South Korea

a r t i c l e i n f o a b s t r a c t

Article history: Few-shot learning considers the problem of learning unseen categories given only a few labeled samples.
Received 23 January 2022 As one of the most popular few-shot learning approaches, Prototypical Networks have received consider-
Revised 2 September 2022
able attention owing to their simplicity and efficiency. However, a class prototype is typically obtained by
Accepted 9 November 2022
averaging a few labeled samples belonging to the same class, which treats the samples as equally impor-
Available online 13 November 2022
tant and is thus prone to learning redundant features. Herein, we propose a self-attention based proto-
Keywords: type enhancement network (SAPENet) to obtain a more representative prototype for each class. SAPENet
Few-shot learning utilizes multi-head self-attention mechanisms to selectively augment discriminative features in each sam-
Multi-head self-attention mechanism ple feature map, and generates channel attention maps between intra-class sample features to attentively
Image classification retain informative channel features for that class. The augmented feature maps and attention maps are fi-
k-Nearest neighbor nally fused to obtain representative class prototypes. Thereafter, a local descriptor-based metric module is
employed to fully exploit the channel information of the prototypes by searching k similar local descrip-
tors of the prototype for each local descriptor in the unlabeled samples for classification. We performed
experiments on multiple benchmark datasets: miniImageNet, tieredImageNet, and CUB-200-2011. The ex-
perimental results on these datasets show that SAPENet achieves a considerable improvement compared
to Prototypical Networks and also outperforms related state-of-the-art methods.
© 2022 Elsevier Ltd. All rights reserved.

1. Introduction the generalization ability of humans to learn useful information


using extremely limited training samples. In few-shot learning, a
Over recent years, remarkable progress has been achieved in learning algorithm is generally composed of a neural network and
deep learning methods for various computer vision tasks such as a classifier. The network extracts discriminative features, whereas
image classification [1], semantic segmentation [2], and object de- the classifier learns to make correct decisions based on the fea-
tection [3]. The success in these tasks relies mainly on the avail- tures. The learning algorithm is expected to learn prior knowledge
ability of a large number of training samples to progressively learn from a number of base training tasks and to adopt the learned
task-specific weight distributions for neural networks. However, it knowledge into novel classes consisting of a few labeled sam-
is typically difficult or even impossible to collect such a large num- ples. In the training phase, to maintain consistency with the set-
ber of labeled samples in practice, which, in turn, hampers the net- ting in the test phase, the base classes are distributed into mul-
works from sufficiently learning useful information regarding the tiple N-way M-shot tasks. Each task includes a support (training)
images; further, this usually leads to overfitting issues. set consisting of N classes with M labeled samples per class and
Unlike deep learning methods, which are highly dependent on a query (testing) set consisting of N classes with some unlabeled
data capacity, humans possess the ability to quickly learn new con- samples.
cepts using only a few samples. Motivated by this phenomenon, Learning to generalize prior knowledge to novel concepts is a
certain studies have proposed few-shot learning [4–6] to imitate non-trivial task for neural networks, especially when the training
samples are limited. To this end, various excellent few-shot learn-
ing methods have been proposed to tackle this learning issue from

different perspectives. These methods can be roughly divided into
Corresponding author.
E-mail addresses: [email protected] (X. Huang),
optimization-based [7–9], model-based [10,11] and metric-based
[email protected] (S.H. Choi). methods [12–14]. Among them, metric-based methods have re-

https://doi.org/10.1016/j.patcog.2022.109170
0031-3203/© 2022 Elsevier Ltd. All rights reserved.
X. Huang and S.H. Choi Pattern Recognition 135 (2023) 109170

ceived considerable attention because of their strong generalization 2. Related work


ability. Metric-based methods learn a metric function that can cor-
rectly match the features between the support and query samples This section presents a review of previous few-shot learning
and complete the classification using the nearest neighbor method. methods, namely optimization-based, model-based, and metric-
It has been shown that distance metrics such as Euclidean distance based methods. Notably,many studies have also been focused on
[14], cosine distance [12,15], and Earth Movers distance [16], can obtaining more representative prototypes and improving the per-
efficiently construct task-specific distance spaces from a few su- formance of ProtoNet, we briefly introduce them in the metric-
pervised samples and help train networks through the similarity based method.
information between sample features. As one of the representative
metric-based methods, ProtoNet [14] is known for its conceptual 2.1. Optimization-based methods
simplicity and effectiveness under the few-shot settings. ProtoNet
assumes that sample features from the same class cluster around Optimization-based methods [7–9] strive to learn a good ini-
a single prototype representation in an embedding space gener- tialized model from base classes and adapt the learned model to
ated by the network. Therefore, it adopts the mean of intra-class novel classes using a few optimization steps. Model-Agnostic Meta-
samples as the prototype for that class and classifies the query ac- learning (MAML) [7] attempts to learn a good parameter initial-
cording to the prototypes. Despite its promising performance, Pro- ization so that it can generalize well to the new task with only
toNet treats the intra-class sample features as equally important; a few gradient steps. Ravi et al. [8] proposed a long short-term
however, certain sample features may not be representative as oth- memory (LSTM)-based meta-learner to learn an exact optimization
ers, thereby easily resulting in information loss and learning biased algorithm, aiming to train a classifier and achieve a fast gradient-
prototypes with limited training samples [17]. based adaptation. Lee et al. [9] advocated the use of the support
In this paper, we propose a self-attention-based prototype en- vector machine to learn feature representations that can general-
hancement network (SAPENet) to address the above mentioned ize well to novel classes.
issue. Unlike ProtoNet, which adopts the mean vectors as proto-
types without considering the important features, SAPENet obtains 2.2. Model-based methods
a more representative prototype for each class by considering the
important local features of the feature map and the relative re- Model-based methods [10,11] aim to quickly update parame-
lationships among the intra-class features. To search for impor- ters with a few training samples through the design of the model
tant local features, SAPENet uses a self-attention block inspired by structure. Santoro et al. [10] proposed memory-augmented neural
[18] and [19], which includes three convolution kernels that learn networks that use an explicit storage buffer to rapidly learn novel
to generate feature maps with emphasized distinguishable local classes by incorporating an LSTM-based meta-learner. Munkhdalai
features. Furthermore, SAPENet develops an intra-class attention and Yu [11] explored a more sophisticated weight update scheme
block, which is an extension of the self-attention block designed to generate fast and slow weights and integrated them into a
to explore the inner relationships between intra-class features. The network for rapid learning and generalization. Different from the
intra-class attention block learns to accurately measure the rela- optimization-based methods, model-based methods have simpler
tionships between intra-class features based on linear transforma- computations to generate fast weights without second-order gra-
tions and highlights the important channel features in the class by dients.
generating a channel feature map. By merging the augmented fea-
ture maps and channel attention maps, SAPENet tends to preserve 2.3. Metric-based methods
informative features, while reducing redundant ones in the feature
maps, such that the prototype feature maps contain more task- Metric-based methods encourage networks to learn the simi-
specific information regarding the class. For classification, SAPENet larity relationships between deep representations in the embed-
applies the k-nearest neighbor (KNN) method to determine the k ding space using a suitable metric function. MatchingNet [12] de-
nearest local descriptor in the prototype feature map for each local ploys a bidirectional LSTM on the entire support set and learns to
descriptor in the target feature map. The similarity score between extract important features based on their correlations with query
the target and the prototype feature map is then obtained by sum- features. RelationNet [13] parameterizes the distance metric as a
ming the similarity between all local descriptors of the target fea- learnable relation module to learn a non-linear metric function be-
ture map and their nearest descriptors. In summary, the contribu- tween deep representations. ProtoNet [14] uses the mean vector of
tions of this paper are as follows: intra-class features as a prototype representation for the class and
applies the Euclidean distance to measure the similarity between
• In contrast to ProtoNet, which uses mean vectors as proto- prototypes and the target feature for classification.
types, SAPENet develops an intra-class self-attention block to Despite the promising performance of ProtoNet, several studies
efficiently capture the informative channel features between [20–22] have argued that using mean vectors as prototypes may
sample features in the same class and preserve important fea- not be efficient under few-shot settings. This is because each sup-
tures, while reducing redundant ones. port feature may not be equally important to the final prototype.
• We retain the fused feature maps as prototypes instead of flat- To address this problem, IPN [20] uses a weight-distribution mod-
tening them into vectors because SAPENet aggregates channel- ule with a softmax function to generate weights for intra-class fea-
level information (i.e., local descriptors) of feature maps from ture vectors and adopts a metric scaling method [15] into the ob-
the same class rather than pixel-level information. The resulting jective function to learn a task-dependent metric space. Ye et al.
prototype maps can contain more compact intra-class features [21] believed that backbone networks may accidentally select un-
than sparse features. characteristic features because the classification tasks are agnostic
• The experimental results on three popular benchmark datasets to the current model. Thus, they applied a self-attention mecha-
show that our method achieves superior performance over re- nism to the mean prototypes to yield task-specific prototypes for
lated state-of-the-art methods and outperforms ProtoNet by a solving the issue. Huang et al. [22] argued that a single prototype
large margin. This outcome demonstrates the effectiveness of is not sufficient to represent the characteristics of the class and
SAPENet in capturing informative features given only a few la- proposed the learning of multiple prototypes of each class using a
beled samples. learnable squeeze and spatial excitation module (sSE) [23]. Notably,

2
X. Huang and S.H. Choi Pattern Recognition 135 (2023) 109170

our SAPENet falls under metric-based method but is significantly shot settings. However, it is evident that treating each feature vec-
different from [21] even though both works use self-attention. In- tor in the same class as equally important in such low-data scenar-
tuitively, [21] directly applied the self-attention to the mean proto- ios may result in the degradation of informative features that are
types, whereas SAPENet augments the local features of each sam- supposed to be emphasized. To deal with this issue, our SAPENet
ple feature using the self-attention block and develops an intra- deploys two self-attention based blocks to selectively enhance the
class self-attention block to merge the augmented intra-class sam- informative local features of each sample and adaptively preserves
ple features for the final prototypes. Thus, SAPENet determines not the important channel features in the class, thereby obtaining a
only the discriminative features in each sample features, but also more representative prototype for that class.
the priority of important features in the class.
4. Methodology
3. Background
This section first describes the overall structure of the proposed
3.1. Problem definition SAPENet and then introduces the details regarding the two atten-
tion blocks used to obtain representative prototypes for the classes.
In few-shot learning, a dataset is typically split into a meta- A classification strategy is presented at the end of this section.
training set Dtrain , meta-testing set Dtest , and meta-validation set
Dval . These sets possess disjoint label spaces, and each set is fur- 4.1. Overview
ther divided into multiple small learning tasks consisting of a sup-
port set S and query set Q. The support set S is used as the The objective of SAPENet is to address the issue of learning re-
prior knowledge for the network’s learning whereas the query set dundant features when using a few training samples. To this end,
Q is the classification target. To imitate low-data scenarios, few- SAPENet maximizes the use of intra-class information by empha-
shot learning adopts the episodic learning paradigm to repeatedly sizing both the informative features of each sample feature map
use the learning tasks for training. Specifically, a learning task is and the important features in the class. The overall structure of
formed by randomly sampling N classes with M labeled samples SAPENet is shown in Fig. 1, which consists of four parts: a back-
and some unlabelled  samples per class fromDtrain to construct bone network, a self-attention block, an intra-class attention block,
a support set S = (x1 ,1 , y1 ,1 ), . . . , xN,M , yN,M and a query set and a metric module.
   
Q = x˜1 ,1 , y˜1 ,1 , . . . , x˜N ,M , y˜N ,M , in which xi, j denotes the jth la- A support set with two classes and three labeled samples per
beled sample from class i, yi, j ∈ {1, . . . , N } is the corresponding la- class are given based on the few-shot setting. Both the support
bel, and x˜i, j denotes the jth unlabeled sample from class i. The set- and query samples are first fed into the backbone network fθ . The
ting of the support set is usually abbreviated as the N-way M-shot backbone network learns to extract significant features of the in-
task. put samples by mapping semantically similar inputs to positions
that are close to each other in the embedding space. Then, a self-
3.2. Prototypical network attention block is applied to each support feature map to obtain
the augmented feature map with the important local features high-
ProtoNet [14] is built based on the assumption that feature vec- lighted. Simultaneously, an intra-class attention block takes the
tors cluster around a single prototype representation in the em- intra-class support feature maps as input to compute the channel
bedding space. To make use of this assumption, ProtoNet utilizes attention map. These channel attention maps indicate the impor-
a neural network to learn a non-linear mapping of the inputs into tance scores of intra-class feature maps at the same spatial loca-
the embedding space and takes the mean of intra-class sample fea- tion, which are used to re-weight the channel features and empha-
tures as the prototype representation of that class. The prediction size the informative features in this class. To obtain the final proto-
of each query sample is determined by finding the nearest proto- type feature map, SAPENet first performs channel-wise multiplica-
type to its feature vector from all the class prototypes. tion between the augmented feature maps and the corresponding
Given M labeled samples from class n ∈ {1, . . . , N }, and a back- channel attention maps. This step preserves the important features
bone network fθ with learnable parameters θ , a prototype cn for of each feature map by using the extracted intra-class feature in-
class n can be computed as: formation. Subsequently, SAPENet performs element-wise addition
on the support feature maps from the same class to obtain the fi-
1 M
cn = f (xn,i ). (1) nal prototype for each class. In the classification phase, the metric
M i=1 θ module performs the classification by finding the prototype near-
After obtaining the class prototypes, ProtoNet uses the Euclidean est to the query feature map.
distance function d (·, · ) on the query feature vector x˜ and class
prototypes in the embedding space to calculate their distances. 4.2. Self-attention block
Subsequently, the softmax function is applied to the distances to
compute the probability distribution of the query samples belong- The self-attention mechanism has been shown to benefit from
ing to each class: capturing long-term dependencies and is widely applied in many
exp(−d ( fθ (x˜), cn )) computer vision tasks [24–26]. In this study, motivated by [19] and
pθ (y = n|x˜) = N . (2) [18], we utilize multi-head self-attention as our self-attention
n =1 exp(−d ( f θ (x
˜), cn ))
block. The key idea of multi-head self-attention is to use multiple
Using the probability distribution over classes for a query self-attention operators to jointly capture features from different
sample, ProtoNet minimizes the negative log-probability J (θ ) = representation subspaces and merge them to enhance the contex-
−logpθ (y = n|x˜) by adjusting the network parameters via an op- tual information of local features.
timizer (e.g., Adam and SGD), such that the predicted probability For the sake of simplicity, we illustrate our self-attention block
of the ground truth label can be maximized. By iterating through with the number of attention heads being 1, as depicted in
the learning tasks of the base classes, ProtoNet learns to gener- Fig. 2(A). Suppose that a feature map obtained from the back-
ate class representations from a few training samples and is thus bone network is denoted as sn,m ∈ RC×H×W , where m ∈ [1, . . . , M],
able to generalize to the new classes using only a few labeled sam- and C, H, and W represent the number of channels, height, and
ples. ProtoNet has achieved competitive performance under few- width of the feature maps, respectively. The self-attention block

3
X. Huang and S.H. Choi Pattern Recognition 135 (2023) 109170

Fig. 1. The overall structure of the proposed SAPENet for a 2-way 3-shot task. Given a support set and a query sample, SAPENet encoders the images into feature maps
through a backbone network f θ . Afterward, a self-attention block and an intra-class attention block are applied to the support feature maps to obtain the self-attention
feature maps and channel attention maps, respectively. The prototype of each class is the element-wise addition of the product between the feature maps and the cor-
responding channel attention maps along the channel dimension. Classification is performed by a k-nearest neighbor-based metric module on the query feature map and
prototypes. (Best viewed in color)

Fig. 2. The details of the self-attention block and the intra-class attention block. (A) Self-attention block augments the important local feature for each support feature map.
(B) Intra-class attention block generates channel attention maps by extracting the relative channel information between intra-class support feature maps.(Best viewed in
color)

uses three 1 × 1 convolution kernels with learnable parameters sulting matrix along the row dimension to obtain the attention
φ , δ , and  to linearly transform the input feature map. The trans- map A ∈ RU×U :
φ
formation results in three new feature maps: Query sn,m , Key sδn,m ,
and Value sn,m . The block then reshapes them to RC×U , where φ ,T √
exp(sn,m,i · sδn,m, j / γ )
U = H × W is the number of pixels, and performs a matrix multi- ai, j = U φ ,T √ , (3)
δ
φ
plication of the transposed sn,m and sδn,m . The matrix multiplication j=1 exp(sn,m,i · sn,m, j / γ )
φ
computes the similarity between each channel feature in sn,m and
φ ,T
all channel features in sδn,m to build a global relationship among where sn,m,i and sδn,m, j denote the ith and jth positions of the trans-
φ ,T
features. Subsequently, the softmax function is applied to the re- posed sn,m and sδn,m , respectively. ai, j is the measurement of the

4
X. Huang and S.H. Choi Pattern Recognition 135 (2023) 109170

φ ,T
correlation between sn,m,i and sδn,m, j in the attention map A, and Each feature map in P is the weighted V considering the corre-
φ ,T
√1 lations between the channel features of sn,m and all channel fea-
γ is the scaling factor.
tures of each sδn,m in K. Therefore, P contains not only the attentive
After obtaining the attention map, the next step entails per-
values of its own but also the attentive values between features of
forming matrix multiplication between the value sn,m and the
intra-class samples. To obtain the channel attention map for each
transpose of attention map A, and reshaping the result to the same
input feature map, element-wise addition is performed on P along
size of the input, which is RC×H×W . Finally, the self-attention block
the first dimension to aggregate the global information of each po-
outputs the refined feature map by performing an element-wise
sition. The results are then averaged along the channel dimension
addition between the reshaped result and the raw input:
and reshaped into feature maps with the channel dimension of 1.
s˜n,m = sn,m · AT + sn,m , (4) This procedure aims to compress the channel information to ob-
tain a two-dimensional channel score map for each support fea-
where s˜n,m denotes the refined feature map. Intuitively, each posi-
ture map. Finally, all channel score maps are concatenated and the
tion of the resulting sn,m · AT is a weighted sum of all local features,
softmax layer is applied along the channel dimension to obtain the
which provides a global view of the relationship between each lo-
channel attention map sˆn,m for each support feature. These chan-
cal feature and all the local features on the feature map. By learn-
nel attention maps are used to assign channel weights to the cor-
ing the discriminative features of each sample via convolutional
responding refined feature maps from the self-attention block. In
kernels, the self-attention block can selectively aggregate similar
this manner, the informative channel features in the class can be
semantic features that are useful for classification in the feature
emphasized by giving larger weights based on the intra-class chan-
map regardless of the distances between positions. Consequently,
nel information between samples.
the block can output augmented feature maps with prominently
important features.
4.4. Metric model

4.3. Intra-class attention block


To obtain the final prototypes, channel-wise multiplication is
performed between the refined feature maps and the correspond-
To obtain the final prototypes, ProtoNet uses the mean vector
ing channel attention maps. The re-weighted feature maps are
of the intra-class vectors without considering the discriminative
then aggregated to obtain the prototype feature map for the class
features in the class. Thus, it is prone to learning redundant fea-
through element-wise addition. Intuitively, the channel-wise multi-
tures and results in learning under-represented prototypes in the
plication uses the channel attention maps to selectively determine
few-shot settings. To overcome this problem, SAPENet develops an
important channel features for the class, while the element-wise
intra-class attention block to fully exploit the channel information
addition aggregates the spatial information of the enhanced intra-
of intra-class feature maps, as illustrated in Fig. 2(B). The intra-
class features; thus, SAPENet can obtain the prototype that accu-
class attention block shares the parameters of 1 × 1 convolutional
rately represents the characteristic of the class. To fully exploit the
operators with the self-attention block introduced above. There-
enhanced channel features of prototypes, we adopt the idea from
fore, the intra-class attention block does not bring additional learn-
[27] and use the KNN method to compute the similarity between
able parameters but provides training information for the convo-
the query feature map and prototypes. Concretely, the query fea-
lutional kernels together with the self-attention block. The intra-
ture map can be regarded as a set of U (U = H × W ) C-dimensional
class attention block takes as input the intra-class feature maps
local descriptors:
and performs linear transformations on the features via the convo-
lutional kernels. Afterward, we concatenate the new feature maps fθ (q ) = [l1 , . . . , lU ] ∈ RC×U , (8)
obtained from the Key and Value convolution kernels, which can where li denotes the ith
local descriptor of the feature map. For
be represented as: example, given an output feature map with a size of 64 × 5 × 5,
  we can obtain 25 (5 × 5) 64-dimensional descriptors in total.
K = sδn,1 , . . . , sδn,m ∈ RM×C×HW
  For each local descriptor li in the query feature map, the met-
V = sn,1 , . . . , sn,m ∈ RM×C×HW . (5) ric module D computes the similarity between each li and all de-
φ ,T scriptors l˜j |Uj=1 in each class prototype cn through cosine similarity
Then, we broadcast each sn,m to have the same size as K and per-
φ ,T to find its k nearest descriptors l˜j |k . Thereafter, the final similar-
j=1
form matrix multiplication between sn,m and K, aiming to capture
ity score between the query feature map and prototype cn is com-
the semantic similarity between every channel feature of one sam-
puted by summing the total Uk similarities, which can be formu-
ple feature map and all channel features of other sample feature
lated as
maps (including itself). Subsequently, the softmax function is ap- U k
plied to the resulting matrix along the row dimension to normalize D ( f θ ( q ), cn ) = i=1 j=1 cos (li , l˜j ). (9)
the similarity scores to the attention values. Thus, the attention at The metric module D is non-parametric, similar to most metric-
the ith and jth positions of the mth sub-attention map of B can be based methods [14,21,22]. It has the advantage of learning the
formulated as: mapping function without making any assumptions, which to a
φ ,T √
exp(sn,m,i · kmj / γ ) certain extent is suitable for few-shot settings where the distri-
bm = U , (6)
m √
i, j φ ,T bution of training samples may be very complicated. In addition,
j=1 exp(sn,m,i · k j / γ ) each query feature map consists of multiple local descriptors that
where km represents the jth column of the mth feature map of contain the characteristic information of the image. This enables
j
the KNN method to search in a much larger feature space, thus
K. To obtain the attention score between one sample feature map
providing more accurate feature matching between the prototype
to others, matrix multiplication is performed to the Value feature
and the query. Furthermore, because each descriptor in the proto-
maps V and the transpose of attention map B. This step helps to
type aggregates the important information within the class via the
establish the channel relationship P between the mth sample fea-
attention blocks, the descriptors in one prototype are distinguish-
ture and all sample features in the same class. The procedure can
able from those of another prototype. This further enables efficient
be formulated as follows:
matching between the prototype and the query, thereby improving
P = V · BT . (7) the classification accuracy of the network.

5
X. Huang and S.H. Choi Pattern Recognition 135 (2023) 109170

5. Experiments retain a sufficient number of local descriptors to represent the im-


ages, we remove the max-pooling layer from the last two convolu-
In this section, we conduct few-shot classification tasks on tion blocks. ResNet-12 is a much deeper network composed of four
three few-shot learning benchmark datasets to validate the effec- residual blocks, each of which includes three convolutional layers,
tiveness of the proposed SAPENet. BatchNorm layers, LeakyReLU (0.1) layers, and a 2 × 2 max-pooling
layer. We used the same setting as in previous studies [9,21,38], in
5.1. Datasets which each residual block outputs feature maps with channel sizes
of 64, 160, 320, and 640. Given an input image of size 84 × 84,
Three few-shot learning benchmark datasets were used to eval- Conv4-64 outputs a feature map with a size of 64 × 21 × 21 and
uate the performance of the proposed SAPENet. The details of the the output size of ResNet-12 is 640 × 5 × 5. For the self-attention
datasets are as follows: block in SAPENet, we set the number of attention heads to 4 and
• miniImageNet [12]: This dataset is a widely used bench- 8 for Conv4-64 and ResNet-12 backbones, respectively.The scal-
mark dataset for few-shot image classification and contains 100 ing factor √1γ was set as the dimension of Key (i.e., γ = 64/4 for
classes from ImageNet [28]. We followed the commonly used Conv4-64, and γ = 640/8 for ResNet-12) following the relation-
split method [8] to split 100 classes into 64, 16, and 20 classes ship given in [18]. For the metric module, we set k to 1 and 5
as Dtrain , Dval , and Dtest , respectively. Each class contains 600 for Conv4-64 and ResNet-12 backbones, respectively. The reasons
images with a size of 84 × 84. for choosing these hyper-parameters are discussed in the ablation
• tieredImageNet [29]: tieredImageNet is a larger subset con- study section.
structed from ImageNet, which consists of 608 classes. These During training, we applied standard data augmentation (i.e.,
classes are further divided into 351, 97, and 160 classes as Dtrain , random crop, horizontal flip, and color jotter) following previous
Dval , and Dtest , respectively. Each class contains at least 1200 studies [21,34,36]. For Conv4-64, we trained the backbone network
images with an image size of 84 × 84. from scratch with 200 epochs, with each epoch consisting of 10 0 0
• CUB-20 0-2011 [30]: CUB-20 0-2011 contains 11,788 images of episodes. We set the batch size to 2 and used Adam as the opti-
200 bird species. We followed the previous studies [31] to split mizer with the default learning rate (i.e., 0.001). The query set con-
the classes into 100, 50, and 50 classes as Dtrain , Dval , and Dtest , tains 15 samples per class and is used to generate the classification
respectively. Each image is resized to 84 × 84 to fit the input losses. Thus, for a 5-way 5-shot task, there are 50 (2 × 5 × 5 ) im-
size of the backbone network. ages in the support set and 150 (2 × 5 × 15 ) images in the query
set in each training episode. For ResNet-12, we adopted the pre-
5.2. Comparison methods training scheme from [39] to relieve the burden of episodic train-
ing. Specifically, we used Dtrain to first pre-train the ResNet-12
We compared our SAPENet to few-shot learning methods that backbone. We ran 350 epochs and 90 epochs with a batch size
focus on metric-based learning or attention mechanisms to val- of 128 on the miniImageNet and tieredImageNet datasets, respec-
idate its efficiency. Among the comparison methods, Baseline++ tively. SGD was used as the optimizer with an initial learning
[31] uses a pre-trained backbone and a cosine distance-based clas- rate of 0.1 and a Nesterov momentum of 0.9. For the miniIma-
sifier to perform fine-tuning on novel classes. SAML [32] employs geNet dataset, the learning rate decays to half at epochs 200 and
an attention mechanism to select semantically relevant local de- 300, whereas for the tieredImageNet dataset, the learning rate de-
scriptors between the support and query features. DSN [33] learns cays to half at epochs 40 and 60. We then discarded the classifier
to generate high-order prototype-based subspaces as the classifier. and used the pre-trained backbone to train the attention blocks in
DN4 [27] uses a descriptor-based classifier to build similarities be- SAPENet on the same meta-training set Dtrain . We trained the at-
tween the concatenated intra-class support features and the query tention blocks with 60 epochs under the few-shot settings, with
features for classification. TADAM [15] inserts learnable parameters each epoch consisting of 1200 episodes for both the miniImageNet
into the distance metric to adaptively learn task-specific metrics. and tieredImageNet dataset. The initial learning rate of SGD was
CAN [34] draws attention to important regions of query feature set to 0.001, which decayed by half at epochs 30, and 50. The query
maps by capturing the semantic relevance between the class and set contained 15 samples per class and the batch size was set to
query features. FEAT [21] applies self-attention to mean prototypes 1. To ensure a fair comparison, we re-implemented the methods
to make them task-specific and discriminative. DCAP [35] uses at- mentioned above using the settings of SAPENet (e.g., episode size,
tentive pooling to re-weight the local features for efficient classifi- data augmentation, and data format). When using ResNet-12 as the
cation. LMPNet [22] learns multiple prototypes to adequately rep- backbone, we used the same pre-trained backbone for the methods
resent the features of a class. DMF [36] learns a dynamic meta- without a pre-training process. On the other hand, for the methods
filter that highlights both query regions and channels according that include a pre-training process, we used their own pre-trained
to different local support information. SetFeat [37] attaches self- backbone for training.
attention to each convolutional block to adaptively align the fea- In the testing phase, we conducted 5-way 5-shot and 5-way
tures between the query and support. In addition to the meth- 1-shot classification tasks for each dataset. The final classification
ods above, we also compared SAPENet to representative few-shot accuracies were obtained by averaging over 10,0 0 0 episodes and
learning methods from other categories. were reported with 95% confidence intervals.

5.3. Implementation details 5.4. General few-shot classification

All experiments were conducted using Pytorch-1.6 on an eight- Based on the above settings, we first conducted experiments on
core AMD Ryzen 7 2700 CPU and an NVIDIA GeForce RTX 2080 the miniImageNet dataset, and the results are listed in Table 1. All
Ti GPU. Following the previous studies [9,22,33], we considered results are obtained using the same settings as those of SAPENet
two standard backbone networks: a 4-layer convolutional network for a fair comparison, and the best results are shown in bold. As
(Conv4-64) and a 12-layer ResNet (ResNet-12). Conv4-64 consists indicated in Table 1, SAPENet achieves the best results in both
of four consecutive 64-channel convolution blocks, each of which 5-way 1-shot and 5-shot tasks using the Conv4-64 and ResNet-
includes a convolutional layer with 64 filters of size 3 × 3, Batch- 12 backbones. Specifically, under the 1-shot and 5-shot settings,
Norm layer, LeakyReLU (0.2) layer, and 2 × 2 max-pooling layer. To the accuracies of SAPENet are respectively 3.83% and 1.41% higher

6
X. Huang and S.H. Choi Pattern Recognition 135 (2023) 109170

Table 1 Table 2
The 5-way 1-shot and 5-way 5-shot classification accuracies (%) on the The 5-way 1-shot and 5-way 5-shot classification accuracies (%) on the
miniImageNet dataset. The best results are shown in bold. tieredImageNet dataset. The best results are shown in bold.

miniImageNet tieredImageNet
Method Backbone Method Backbone
1-shot 5-shot 1-shot 5-shot

MatchingNet [12] Conv4-64 44.81 ± 0.19 65.02 ± 0.17 MatchingNet [12] Conv4-64 49.48 ± 0.22 70.00 ± 0.18
MAML [7] Conv4-64 44.26 ± 0.19 61.81 ± 0.18 MAML [7] Conv4-64 49.97 ± 0.22 64.46 ± 0.20
RelationNet [13] Conv4-64 49.06 ± 0.19 65.18 ± 0.16 RelationNet [13] Conv4-64 50.75 ± 0.22 67.29 ± 0.18
Baseline+ [31] Conv4-64 47.94 ± 0.18 67.37 ± 0.15 Baseline+ [31] Conv4-64 50.03 ± 0.19 70.17 ± 0.18
SAML [32] Conv4-64 52.88 ± 0.20 68.17 ± 0.16 SAML [32] Conv4-64 51.94 ± 0.22 70.12 ± 0.19
TADAM [15] Conv4-64 50.50 ± 0.20 70.09 ± 0.16 TADAM [15] Conv4-64 51.05 ± 0.22 70.92 ± 0.19
MetaOptNet [9] Conv4-64 51.28 ± 0.20 69.71 ± 0.16 MetaOptNet [9] Conv4-64 49.97 ± 0.21 69.76 ± 0.18
DSN [33] Conv4-64 51.69 ± 0.20 69.95 ± 0.16 DSN [33] Conv4-64 51.52 ± 0.21 71.31 ± 0.18
ProtoNet[14] Conv4-64 51.88 ± 0.20 70.40 ± 0.16 ProtoNet [14] Conv4-64 51.44 ± 0.22 72.50 ± 0.18
IPN [20] Conv4-64 51.98 ± 0.20 70.18 ± 0.16 IPN [20] Conv4-64 52.22 ± 0.22 72.07 ± 0.18
CAN [34] Conv4-64 48.61 ± 0.20 69.96 ± 0.16 CAN [34] Conv4-64 49.33 ± 0.22 66.09 ± 0.19
DN4 [27] Conv4-64 51.65 ± 0.20 71.46 ± 0.16 DN4 [27] Conv4-64 49.35 ± 0.21 72.24 ± 0.18
FEAT [21] Conv4-64 51.05 ± 0.19 67.53 ± 0.17 FEAT [21] Conv4-64 51.93 ± 0.21 69.58 ± 0.19
DeepEMD [16] Conv4-64 53.81 ± 0.20 70.56 ± 0.16 DeepEMD [16] Conv4-64 55.96 ± 0.22 73.43 ± 0.18
LMPNet [22] Conv4-64 54.01 ± 0.20 71.30 ± 0.16 LMPNet [22] Conv4-64 51.80 ± 0.22 72.32 ± 0.18
DCAP [35] Conv4-64 54.62 ± 0.20 69.02 ± 0.16 DCAP [35] Conv4-64 54.12 ± 0.22 69.29 ± 0.19
DMF [36] Conv4-64 55.69 ± 0.20 71.52 ± 0.16 DMF [36] Conv4-64 55.65 ± 0.21 69.41 ± 0.19
SetFeat [37] Conv4-64 54.72 ± 0.20 70.32 ± 0.16 SetFeat [37] Conv4-64 57.26 ± 0.22 74.74 ± 0.18
SAPENet (ours) Conv4-64 55.71 ± 0.20 71.81 ± 0.16 SAPENet (ours) Conv4-64 57.61 ± 0.22 75.42 ± 0.18
MatchingNet [12] ResNet-12 62.90 ± 0.20 78.93 ± 0.15 MatchingNet [12] ResNet-12 64.18 ± 0.22 80.09 ± 0.17
MAML [7] ResNet-12 55.24 ± 0.21 68.75 ± 0.17 MAML [7] ResNet-12 52.01 ± 0.23 70.31 ± 0.23
RelationNet [13] ResNet-12 61.44 ± 0.21 75.27 ± 0.16 RelationNet [13] ResNet-12 62.15 ± 0.23 75.47 ± 0.19
Baseline+ [31] ResNet-12 54.46 ± 0.21 77.02 ± 0.15 Baseline+ [31] ResNet-12 57.71 ± 0.24 76.93 ± 0.18
SAML [32] ResNet-12 62.69 ± 0.19 78.96 ± 0.14 SAML [32] ResNet-12 65.43 ± 0.21 79.21 ± 0.18
TADAM [15] ResNet-12 62.06 ± 0.20 79.24 ± 0.14 TADAM [15] ResNet-12 64.34 ± 0.24 82.34 ± 0.16
MetaOptNet [9] ResNet-12 62.67 ± 0.20 80.52 ± 0.14 MetaOptNet [9] ResNet-12 65.99 ± 0.72 81.56 ± 0.16
DSN [33] ResNet-12 60.55 ± 0.20 77.63 ± 0.15 DSN [33] ResNet-12 65.61 ± 0.20 79.22 ± 0.17
ProtoNet [14] ResNet-12 61.71 ± 0.21 79.08 ± 0.16 ProtoNet [14] ResNet-12 64.63 ± 0.23 81.17 ± 0.17
IPN [20] ResNet-12 61.51 ± 0.21 79.47 ± 0.15 IPN [20] ResNet-12 63.23 ± 0.23 80.08 ± 0.16
CAN [34] ResNet-12 61.60 ± 0.20 80.00 ± 0.15 CAN [34] ResNet-12 62.52 ± 0.23 81.05 ± 0.16
DN4 [34] ResNet-12 64.93 ± 0.20 80.87 ± 0.14 DN4 [27] ResNet-12 66.28 ± 0.22 82.24 ± 0.16
FEAT [21] ResNet-12 66.33 ± 0.20 81.78 ± 0.16 FEAT [21] ResNet-12 67.23 ± 0.22 82.83 ± 0.18
DeepEMD [16] ResNet-12 65.69 ± 0.19 81.96 ± 0.16 DeepEMD [16] ResNet-12 68.12 ± 0.22 84.69 ± 0.16
LMPNet [22] ResNet-12 62.52 ± 0.19 81.05 ± 0.14 LMPNet [22] ResNet-12 66.62 ± 0.23 80.12 ± 0.16
DCAP [35] ResNet-12 63.19 ± 0.20 80.64 ± 0.14 DCAP [35] ResNet-12 64.31 ± 0.22 82.17 ± 0.16
DMF [36] ResNet-12 65.00 ± 0.21 81.43 ± 0.15 DMF [36] ResNet-12 66.80 ± 0.23 82.68 ± 0.16
SetFeat [37] ResNet-12 65.95 ± 0.20 81.18 ± 0.14 SetFeat [37] ResNet-12 67.48 ± 0.23 83.25 ± 0.16
SAPENet (ours) ResNet-12 66.41 ± 0.20 82.76 ± 0.14 SAPENet (ours) ResNet-12 68.63 ± 0.23 84.30 ± 0.16

The effectiveness of SAPENet can be summarized in three as-


than those of ProtoNet and 4.66% and 4.28% higher than those pects: First, under the 1-shot setting, SAPENet enhances the se-
of FEAT when using Conv4-64. Surprisingly, ProtoNet achieves a mantic compactness of each support feature map by aggregating
higher accuracy than FEAT. This is because the original FEAT uses semantically similar local features on the feature map; thus, impor-
a pre-trained backbone to boost the performance, and we re- tant and similar features to the class can be highlighted through
implemented it without using the pre-trained backbone for a fair training. This leads to a large improvement in the 1-shot task com-
comparison. In the case of using ResNet-12, the accuracies of pared to IPN and LMPNet. Second, the intra-class self-attention
SAPENet are 4.70% and 3.68% higher than those of ProtoNet and block considers the correlation of local features between intra-class
1.41% and 1.33% higher than those of DMF. In addition, SAPENet samples, which makes it possible to redetermine the informative
also outperforms the methods (i.e., IPN, LMPNet, FEAT, and DCAP) features in feature maps, and thus retaining more valuable infor-
that focus on improving ProtoNet and performs better than CAN mation for the prototypes. In contrast, FEAT only considers putting
and SetFeat, which use attention to highlight the target object re- attention to the mean prototypes, which may be less efficient due
gions. to the redundant features generated by the mean operation. Third,
Table 2 presents the classification results for the tieredIma- since each descriptor in the prototype obtained by SAPENet inte-
geNet dataset. We can observe that SAPENet consistently achieves grates features that are important to the class through importance
the best results when using the Conv4-64 backbone. Specifically, scores, the descriptors between classes are distinguishable from
the accuracies of SAPENet are 4.30% and 2.73% higher than those each other. The descriptor-based metric module maximizes the use
of ProtoNet and 5.68% and 5.84% higher than those of FEAT in of informative descriptors and globally finds similar descriptors in
the 1-shot and 5-shot tasks, respectively. In the case of ResNet- the prototype for each descriptor in the query between the feature
12, SAPENet still obtains a competitive performance in the 5-shot maps.
task and achieves accuracy improvement of 4.00% and 3.13% com-
pared with ProtoNet. Furthermore, SAPENet outperforms the cur- 5.5. Fine-grained few-shot classification
rent methods, such as SetFeat, DCAP, and DMF, in both settings.
DeepEMD obtains the best result in the 5-shot task, which can be To evaluate whether SAPENet is also effective on the fine-
attributed to its fine-tuning on the novel classes and learning a grained dataset, we conducted experiments on the CUB dataset
group of prototypes of the class via the structured fully connected with the Conv4-64 backbone using the same settings as in the
(SFC) layer. miniImageNet dataset. Similarly, all the methods listed in Table 3

7
X. Huang and S.H. Choi Pattern Recognition 135 (2023) 109170

Table 3 Table 5
The 5-way 1-shot and 5-way 5-shot classification accuracies (%) on the Classification accuracy (%) of different attention head number
CUB-200-2011 dataset. The best results are shown in bold. and scaling factor γ on the miniImageNet dataset. Results were
obtained by averaging over 10,0 0 0 episodes.
CUB
Method Backbone Number of Scaling ResNet-12
1-shot 5-shot √
heads factor γ
1-shot 5-shot
MatchingNet [12] Conv4-64 51.45 ± 0.22 75.46 ± 0.18 √
MAML [7] Conv4-64 47.85 ± 0.22 64.77 ± 0.20 1 √640 65.84 82.12
RelationNet [13] Conv4-64 58.81 ± 0.24 75.23 ± 0.18 4 √160 66.17 82.83
Baseline+ [31] Conv4-64 57.79 ± 0.22 74.03 ± 0.18 8 √80 66.41 82.76
SAML [32] Conv4-64 62.75 ± 0.23 78.24 ± 0.16 16 √40 66.36 82.55
TADAM [15] Conv4-64 56.64 ± 0.23 73.66 ± 0.17 32 20 66.10 82.53
MetaOptNet [9] Conv4-64 49.52 ± 0.22 71.68 ± 0.18
DSN [33] Conv4-64 54.49 ± 0.23 74.10 ± 0.17
ProtoNet [14] Conv4-64 54.52 ± 0.23 73.30 ± 0.17
IPN [20] Conv4-64 58.45 ± 0.24 76.61 ± 0.17 one sample per class; thus, we filled those grids with the same
CAN [34] Conv4-64 59.31 ± 0.24 72.72 ± 0.19
number to indicate that they are equivalent.
DN4 [27] Conv4-64 63.15 ± 0.22 82.54 ± 0.14
FEAT [21] Conv4-64 62.91 ± 0.24 79.82 ± 0.16 As indicated in Table 4, the self-attention and intra-class at-
DeepEMD [16] Conv4-64 62.09 ± 0.23 83.58 ± 0.17 tention blocks can achieve considerable improvement on both 1-
LMPNet [22] Conv4-64 61.66 ± 0.23 82.20 ± 0.14 shot and 5-shot tasks compared to the case without any atten-
DCAP [35] Conv4-64 58.69 ± 0.25 69.40 ± 0.19 tion blocks. In addition, the performance of using two attention
DMF [36] Conv4-64 66.79 ± 0.24 81.40 ± 0.17
blocks is similar to that of using only the self-attention block under
SetFeat [37] Conv4-64 67.78 ± 0.23 82.87 ± 0.15
SAPENet (ours) Conv4-64 70.38 ± 0.23 84.47 ± 0.14 the Euclidean distance. It can be inferred that flattening the fea-
ture maps into vectors for classification cannot maximize the use
of the channel features refined by the intra-class attention block.
Table 4
Classification accuracy (%) for key component analysis on the miniImageNet In contrast, SAPENet keeps the augmented feature maps as proto-
dataset with the Conv4-64 backbone under 5-way 1-shot and 5-shot settings. types and uses the descriptor-based metric module to fully exploit
Results are averaged over 10,0 0 0 episodes and 95% confidence intervals are be- the refined channel features. These experimental results show that
low 2e-3. the proposed attention blocks can effectively help SAPENet obtain
Metric module Self-attention Intra-class attention 1-shot 5-shot informative prototypes with a few training samples, and they are
Euclidean 50.41 69.06
fully compatible with the metric module that matches descriptors
√ between the query and prototypes.
Euclidean 52.40 70.46

Euclidean 50.41 69.54
√ √
Euclidean 52.40 70.53
KNN 53.34 70.44 5.6.2. Effect of attention head and scaling factor

KNN

55.71 70.83 In [18], the number of attention heads and scaling factor are
KNN 53.34 71.52 two decision variables. The former determines the number of sub-
√ √
KNN 55.71 71.81
spaces to extract information, while the latter controls the distri-
bution of attention values for the feature maps. To evaluate their
influence on SAPENet, we varied the number of attention heads
used the same settings as those of SAPENet to ensure a fair com-
and scaling factor to perform 5-way 1-shot and 5-shot tasks on
parison. As displayed in Table 3, SAPENet achieves the best results
the miniImageNet dataset with the ResNet-12 backbone. As pre-
and attains a significant improvement compared to ProtoNet and
sented in Table 5, increasing the number of attention heads to an
FEAT. In addition, SAPENet leads to 8.29, 0.89, 3.60, and 1.60% ac-
appropriate range can lead to a higher performance, whereas an
curacy improvements over DeepEMD and SetFeat in 1-shot and
extremely small (or excessive) number of attention heads can re-
5-shot settings, respectively. These results validate the effective-
sult in performance loss. We infer that this phenomenon is due
ness of SAPENet on the fine-grained dataset, and further show that
to the small (or large) number of attention heads yielding insuffi-
intra-class attention blocks can effectively emphasize the discrimi-
cient (or excessive) subspaces to extract useful information, result-
native features of each class in the presence of similar classes, thus
ing in under-representation (or over-representation) of informative
enabling more accurate matching between the query and its re-
features. Thus, we chose the number of attention heads for Conv4-
lated prototype during classification.
64 and ResNet-12 to be 4 and 8, respectively, to maintain a suitable
number of attention heads to learn feature extraction.
5.6. Ablation study

5.6.1. Key component analysis 5.6.3. Neighbor k selection analysis


To verify how the self-attention block, intra-class attention In the metric module, the hyper-parameter k determines the
block, and metric module affect the performance of SAPENet, we number of semantically related descriptors in the prototype that
conducted 5-way 1-shot and 5-shot tasks on the miniImageNet should be selected for each descriptor in the query feature map.
dataset using the Conv4-64 backbone. Specifically, when using the Because different k values can have an impact on the final pre-
Euclidean distance as the metric module, we followed ProtoNet to dictive accuracy of SAPENet, we chose k ∈ {1, 3, 5, 7} to perform 1-
flatten the feature maps into vectors for classification, while keep- shot and 5-shot tasks on the miniImageNet dataset with Conv4-
ing the feature maps as prototypes when using KNN. Moreover, we 64 and ResNet-12 backbones. As indicated in Table 6, Conv4-64
employed the mean operation to merge the intra-class vectors as achieves the best results when k is 1, whereas a larger k (i.e., 5)
the prototype when the intra-class attention block was not used. is more favorable for ResNet-12. We suspect that this phenomenon
Note that the classification results are the same with or without is because Conv4-64 is a shallow backbone, whose output feature
intra-class attention in the 1-shot setting (e.g., the 1st row is equiv- maps may contain coarser and noisier features and increasing k
alent to the 3rd row under the 1-shot setting in Table 4). This is may result in selecting more unrelated descriptors, thereby degrad-
because the intra-class attention block outputs a matrix of 1, and ing the performance. In contrast, ResNet-12 outputs feature maps
the mean operation outputs the original input when there is only with finer and more informative features to represent the class;

8
X. Huang and S.H. Choi Pattern Recognition 135 (2023) 109170

Fig. 3. Class activation mapping (CAM) visualization on a 5-shot task for SAPENet and ProtoNet. (Best viewed in color)

Table 6 Table 7
Classification accuracy (%) of different k values on the miniIm- Computational cost analysis on miniImageNet with the ResNet-12 backbone
ageNet dataset under 5-way 1-shot and 5-shot settings. Results during meta-training phase. Time is measured over 10,0 0 0 episodes.
were obtained by averaging over 10,0 0 0 episodes.
Method Episode Parameters Time (ms)
Number Conv4-64 ResNet-12 size (MB)
of k
1-shot 5-shot 1-shot 5-shot ProtoNet [14] 5-way 1-shot 12.42 218
LMPNet [22] 5-way 1-shot 12.46 228
1 55.71 71.81 66.05 81.34
SetFeat [37] 5-way 1-shot 20.72 654
3 54.79 71.04 66.84 82.18
DeepEMD [16] 5-way 1-shot 12.42 473
5 54.61 70.27 66.41 82.76
DMF [36] 5-way 1-shot 13.41 519
7 54.24 69.31 65.85 81.88
FEAT [21] 5-way 1-shot 14.06 228
SAPENet (ours) 5-way 1-shot 13.66 230
ProtoNet [14] 5-way 5-shot 12.42 266
LMPNet [22] 5-way 5-shot 12.46 280
thus, a relatively larger k is beneficial in this case to aggregate fea- SetFeat [37] 5-way 5-shot 20.72 811
tures for classification. DeepEMD [16] 5-way 5-shot 12.50 11,300
DMF [36] 5-way 5-shot 14.41 573
FEAT [21] 5-way 5-shot 14.06 279
5.6.4. Visualization SAPENet (ours) 5-way 5-shot 13.66 283
To visually confirm that SAPENet pays more attention to infor-
mative features, we generated and compared the class activation
mapping [40] of SAPENet and ProtoNet using the ResNet-12 back-
bone on Dtest of the miniImageNet dataset. As depicted in Fig. 3, SAPENet uses 1 × 1 convolutional layers in the attention block, its
ProtoNet tends to contain the features of non-target objects ow- learnable parameters only increase when the output channel di-
ing to the use of mean prototypes as the learning criteria for the mension increases. Considering that ResNet-12 is almost the deep-
network. In contrast, SAPENet exploits the intra-class information est backbone in few-shot learning (the output channel dimension
to attentively select the informative features for that class, which is 640), the size of the additional learnable parameters of SAPENet
allows it to focus on target features and ignore redundant ones. will basically not exceed 1.24MB. Note that the computation speed
of FEAT is slightly faster than SAPENet, although FEAT has more pa-
5.6.5. Computational cost analysis rameters to learn. This is because SAPENet needs to conduct KNN
SAPENet uses two attention blocks to produce a more repre- (k=5) to each descriptor of the query to find its k closest neigh-
sentative prototype for the class, which inevitably increases the bors in the prototypes. Compared with the current state-of-the-art
learnable parameters in addition to the backbone. To analyze the methods, such as DeepEMD, DMF, and SetFeat, SAPENet is much
computational speed of SAPENet, we compared it with other meth- faster in both 1-shot and 5-shot tasks. Specifically, although Deep-
ods. As displayed in Table 7, ProtoNet has the advantage of high EMD possesses the same size of learnable parameters as ProtoNet,
computational efficiency among all methods owing to its simple it requires more training time to solve the optimal matching prob-
prototype computation and parameter-free classifier. SAPENet has lem between the support and query features. Furthermore, when
additional 1.24MB learnable parameters owing to the three 1 × 1 the shot number is greater than 1, it learns an SFC layer by fine-
convolutional layers in the attention blocks. However, SAPENet is tuning on the novel classes, which requires multiple forward and
only 12ms and 17ms slower than ProtoNet in the 1-shot and 5- backward passes, and thus slowing down the computational speed.
shot tasks per episode, respectively, but achieves a considerably DMF learns a specific region for each local feature via complex
higher classification accuracy. Compared with FEAT, SAPENet has deformable convolution, which results in twice the computational
fewer learnable parameters because FEAT has an additional fully cost of SAPENet. SetFeat attaches the self-attention block to each
connected layer attached after the attention block. Moreover, as residual block, which greatly increases the number of learnable pa-

9
X. Huang and S.H. Choi Pattern Recognition 135 (2023) 109170

rameters thereby incurring a much higher computational cost in [4] Q. Sun, Y. Liu, T.-S. Chua, B. Schiele, Meta-transfer learning for few-shot learn-
the training phase. Through the above comparisons, we can find ing, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pat-
tern Recognition (CVPR), 2019, pp. 403–412.
that SAPENet achieves a higher classification accuracy while the [5] Y. Wang, Q. Yao, J.T. Kwok, L.M. Ni, Generalizing from a few examples: a sur-
computational cost is not significantly different from those of Pro- vey on few-shot learning, ACM Comput. Surv. 53 (3) (2020) 1–34, doi:10.1145/
toNet and FEAT. These results validate the efficiency of SAPENet in 3386252.
[6] Y. Wang, R. Girshick, M. Hebert, B. Hariharan, Low-shot learning from imag-
few-shot settings. inary data, in: Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), 2018, pp. 7278–7286.
[7] C. Finn, P. Abbeel, S. Levine, Model-agnostic meta-learning for fast adaptation
5.7. Conclusion
of deep networks, in: Proceedings of the International Conference on Machine
Learning (ICML), 2017, pp. 1126–1135.
In this paper, we proposed a self-attention based prototype en- [8] S. Ravi, H. Larochelle, Optimization as a model for few-shot learning, in: Pro-
hancement network (SAPENet) to address the issue that mean pro- ceedings of the International Conference on Learning Representations (ICLR),
2017.
totypes usually contain redundant information. To obtain a more [9] K. Lee, S. Maji, A. Ravichandran, S. Soatto, Meta-learning with differentiable
representative prototype for the class, SAPENet first utilizes a self- convex optimization, in: Proceedings of the IEEE/CVF Conference on Computer
attention block to selectively emphasize the important local fea- Vision and Pattern Recognition (CVPR), 2019, pp. 10657–10665.
[10] A. Santoro, S. Bartunov, M. Botvinick, D. Wierstra, T. Lillicrap, Meta-learning
tures of each support feature map. Then, unlike ProtoNet, which with memory-augmented neural networks, in: Proceedings of the International
uses the mean operation to aggregate the intra-class features, Conference on International Conference on Machine Learning (ICML), 2016,
SAPENet develops an intra-class attention block to attentively ex- pp. 1842–1850.
[11] T. Munkhdalai, H. Yu, Meta networks, in: Proceedings of the International
ploit the intra-class information, which aims to retain the infor- Conference on International Conference on Machine Learning (ICML), 2017,
mative channel features for that class and avoid learning the re- pp. 2554–2563.
dundant ones. To maximize the use of the augmented prototypes [12] O. Vinyals, C. Blundell, T. Lillicrap, K. Kavukcuoglu, D. Wierstra, Matching
networks for one shot learning, in: Proceedings of the International Con-
obtained by SAPENet, a descriptor-based classifier was deployed as
ference on Neural Information Processing Systems (NIPS), 2016, pp. 3637–
our metric module to compute the local descriptor similarities be- 3645.
tween the query feature map and prototypes. We compared our [13] F. Sung, Y. Yang, L. Zhang, T. Xiang, P.H. Torr, T.M. Hospedales, Learning
to compare: Relation network for few-shot learning, in: Proceedings of the
SAPENet with state-of-the-art methods through numerous experi-
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018,
ments using three widely-used few-shot learning datasets. The re- pp. 1199–1208.
sults validate the effectiveness of SAPENet by largely outperform- [14] J. Snell, K. Swersky, R. Zemel, Prototypical networks for few-shot learning, in:
ing ProtoNet and achieving a competitive performance compared Proceedings of the International Conference on Neural Information Processing
Systems (NIPS), 2017, pp. 4080–4090.
with the state-of-the-art methods. Furthermore, ablation studies [15] B.N. Oreshkin, P. Rodriguez, A. Lacoste, Tadam: task dependent adaptive met-
and visualization results demonstrated the effectiveness of atten- ric for improved few-shot learning, in: Proceedings of the International Con-
tion blocks in producing representative prototypes given only a few ference on Neural Information Processing Systems (NIPS), 2018, pp. 719–
729.
training samples. In terms of limitations, because the intra-class at- [16] C. Zhang, Y. Cai, G. Lin, C. Shen, Deepemd: Few-shot image classification with
tention block is not available when the shot number is 1, the per- differentiable earth mover’s distance and structured classifiers, in: Proceed-
formance of SAPENet primarily relied on the self-attention block. ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR), 2020, pp. 12203–12213.
To deal with this limitation, in future work, we plan to extend the [17] J. Liu, L. Song, Y. Qin, Prototype rectification for few-shot learning, in: Pro-
self-attention block to explore the difference between inter-class ceedings of the European Conference on Computer Vision (ECCV), 2020,
features, which can be beneficial for generating discriminative fea- pp. 741–756.
[18] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, Ł. Kaiser,
tures for classification, even when the shot number is 1. In addi-
I. Polosukhin, Attention is all you need, in: Proceedings of the Interna-
tion, it is also a promising way to learn word embeddings or at- tional Conference on Neural Information Processing Systems (NIPS), 2017,
tributes of images, which can be served as auxiliary training infor- pp. 60 0 0–6010.
[19] X. Wang, R. Girshick, A. Gupta, K. He, Non-local neural networks, in: Pro-
mation to help address the issue of limited training samples.
ceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), 2018, pp. 7794–7803.
Declaration of Competing Interest [20] Z. Ji, X. Chai, Y. Yu, Y. Pang, Z. Zhang, Improved prototypical networks for few-
shot learning, Pattern Recognit. Lett. 140 (2020) 81–87, doi:10.1016/j.patrec.
2020.07.015.
The authors declare that they have no known competing finan- [21] H. Ye, H. Hu, D. Zhan, F. Sha, Few-shot learning via embedding adaptation with
cial interests or personal relationships that could have appeared to set-to-set functions, in: Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR), 2020, pp. 8808–8817.
influence the work reported in this paper. [22] H. Huang, Z. Wu, W. Li, J. Huo, Y. Gao, Local descriptor-based multi-prototype
network for few-shot learning, Pattern Recognit. 116 (2021) 107935, doi:10.
Data availability 1016/j.patcog.2021.107935.
[23] J. Hu, L. Shen, G. Sun, Squeeze-and-excitation networks, in: Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018,
Data will be made available on request.
pp. 7132–7141.
[24] Y. Zhang, Y. Gong, H. Zhu, X. Bai, W. Tang, Multi-head enhanced self-attention
Acknowledgment network for novelty detection, Pattern Recognit. 107 (2020) 107486, doi:10.
1016/j.patcog.2020.107486.
[25] J. Fu, J. Liu, H. Tian, Y. Li, Y. Bao, Z. Fang, H. Lu, Dual attention network for
This work was supported by the National Research Foundation scene segmentation, in: Proceedings of the IEEE/CVF Conference on Computer
of Korea (NRF) funded by the Korea Government (ministry of Sci- Vision and Pattern Recognition (CVPR), 2019, pp. 3146–3154.
[26] H. Zhao, J. Jia, V. Koltun, Exploring self-attention for image recognition, in: Pro-
ence and ICT) under Grand 2022R1F1A1066267. ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni-
tion (CVPR), 2020, pp. 10076–10085.
References [27] W. Li, L. Wang, J. Xu, J. Huo, Y. Gao, J. Luo, Revisiting local descriptor
based image-to-class measure for few-shot learning, in: Proceedings of the
[1] A. Sellami, S. Tabbone, Deep neural networks-based relevant latent represen- IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),
tation learning for hyperspectral image classification, Pattern Recognit. 121 2019, pp. 7260–7268.
(2022) 108224, doi:10.1016/j.patcog.2021.108224. [28] J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, L. FeiFei, Imagenet: A large-scale hier-
[2] J. Long, E. Shelhamer, T. Darrell, Fully convolutional networks for semantic seg- archical image database, in: Proceedings of the IEEE Conference on Computer
mentation, in: Proceedings of the IEEE Conference on Computer Vision and Vision and Pattern Recognition (CVPR), 2009, pp. 248–255.
Pattern Recognition (CVPR), 2015, pp. 3431–3440. [29] M. Ren, E. Triantafillou, S. Ravi, J. Snell, K. Swersky, J.B. Tenenbaum,
[3] X. Qin, Z. Zhang, C. Huang, M. Dehghan, O.R. Zaiane, M. Jagersand, U2-Net: go- H. Larochelle, R.S. Zemel, Meta-learning for semi-supervised few-shot classi-
ing deeper with nested u-structure for salient object detection, Pattern Recog- fication, in: Proceedings of the International Conference on International Con-
nit. 106 (2020) 107404, doi:10.1016/j.patcog.2020.107404. ference on Machine Learning (ICML), 2018.

10
X. Huang and S.H. Choi Pattern Recognition 135 (2023) 109170

[30] C. Wah, S. Branson, P. Welinder, P. Perona, S. Belongie, The caltech-ucsd [40] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, A. Torralba, Learning deep features
birds-200-2011 dataset, Technical Report CNS-TR-2011-001, California Institute for discriminative localization, in: Proceedings of the IEEE Conference on Com-
of Technology, 2011. puter Vision and Pattern Recognition (CVPR), 2016, pp. 2921–2929.
[31] W. Chen, Y. Liu, Z. Kira, Y. Wang, J. Huang, A closer look at few-shot classifica-
tion, in: Proceedings of the International Conference on Learning Representa- Xi-Lang Huang received the M.S. degree in electri-
tions (ICLR), 2019. cal engineering from the Pusan National University, Bu-
[32] F. Hao, F. He, J. Cheng, L. Wang, J. Cao, D. Tao, Collect and select: Se- san, South Korea, in 2018. He is currently pursuing the
mantic alignment metric learning for few-shot learning, in: Proceedings of Ph.D. degree in electrical engineering with the Puky-
the IEEE/CVF International Conference on Computer Vision (ICCV), 2019, ong National University, Busan, South Korea. His current
pp. 8460–8469. research interests include modeling and simulation of
[33] C. Simon, P. Koniusz, R. Nock, M. Harandi, Adaptive subspaces for few-shot discrete-event systems, efficient simulation optimization,
learning, in: Proceedings of the IEEE/CVF Conference on Computer Vision and and computer vision.
Pattern Recognition (CVPR), 2020, pp. 4136–4145.
[34] R. Hou, H. Chang, B. Ma, S. Shan, X. Chen, Cross attention network for few-shot
classification, in: Proceedings of the International Conference on Neural Infor-
mation Processing Systems (NIPS), 2019, pp. 4003–4014.
[35] J. He, R. Hong, X. Liu, M. Xu, Q. Sun, Revisiting local descriptor for im-
proved few-shot classification, ACM Trans. Multimedia Comput. Commun.
Seon Han Choi He received the B.S., M.S., and Ph.D.
Appl. (TOMM) (2022).
degrees in Electrical Engineering from the Korea Ad-
[36] C. Xu, Y. Fu, C. Liu, C. Wang, J. Li, F. Huang, L. Zhang, X. Xue, Learning dynamic
vanced Institute of Science and Technology (KAIST), Dae-
alignment via meta-filter for few-shot learning, in: Proceedings of the IEEE/CVF
jeon, South Korea, in 2012, 2014, and 2018, respectively.
Conference on Computer Vision and Pattern ecognition, 2021, pp. 5182–
In 2018, he was a Post-Doctoral Researcher with the In-
5191.
formation and Electronics Research Institute, KAIST. From
[37] A. Afrasiyabi, H. Larochelle, J.-F. Lalonde, C. Gagné, Matching feature sets for
2018 to 2019, he was a Senior Researcher with the Ko-
few-shot image classification, in: Proceedings of the IEEE/CVF Conference on
rea Institute of Industrial Technology, Ansan, South Korea.
Computer Vision and Pattern Recognition, 2022, pp. 9014–9024.
From 2019 to 2022, he was an Assistant Professor in the
[38] Y. Tian, Y. Wang, D. Krishnan, J.B. Tenenbaum, P. Isola, Rethinking few-shot
Department of IT Convergence and Application Engineer-
image classification: a good embedding is all you need? in: Proceedings
ing, Pukyong National University, Busan, South Korea. In
of the European Conference on Computer Vision (ECCV), 2020, pp. 266–
2022, he joined the Department of Electronic and Elec-
282.
trical Engineering, Ewha Womans University, Seoul, South
[39] D. Wertheimer, L. Tang, B. Hariharan, Few-shot classification with feature map
Korea, as an Assistant Professor. His current research interests include the modeling
reconstruction networks, in: Proceedings of the IEEE/CVF Conference on Com-
and simulation of discrete-event systems, efficient simulation optimization under
puter Vision and Pattern Recognition (CVPR), 2021, pp. 8012–8021.
stochastic noise, evolutionary computing, and machine learning.

11

You might also like