0% found this document useful (0 votes)
20 views10 pages

Han Few-Shot Object Detection With Fully Cross-Transformer CVPR 2022 Paper

The document presents a novel Fully Cross-Transformer (FCT) model for few-shot object detection (FSOD), which enhances the interaction between query and support images by incorporating cross-transformer mechanisms in both the feature backbone and detection head. This approach aims to improve similarity learning and generalization capabilities in FSOD tasks, addressing limitations of previous two-branch methods that only interact at the detection head. Comprehensive experiments demonstrate that the proposed model achieves state-of-the-art performance on FSOD benchmarks like PASCAL VOC and MSCOCO.

Uploaded by

brooks3216
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views10 pages

Han Few-Shot Object Detection With Fully Cross-Transformer CVPR 2022 Paper

The document presents a novel Fully Cross-Transformer (FCT) model for few-shot object detection (FSOD), which enhances the interaction between query and support images by incorporating cross-transformer mechanisms in both the feature backbone and detection head. This approach aims to improve similarity learning and generalization capabilities in FSOD tasks, addressing limitations of previous two-branch methods that only interact at the detection head. Comprehensive experiments demonstrate that the proposed model achieves state-of-the-art performance on FSOD benchmarks like PASCAL VOC and MSCOCO.

Uploaded by

brooks3216
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Few-Shot Object Detection with Fully Cross-Transformer

Guangxing Han, Jiawei Ma, Shiyuan Huang, Long Chen, Shih-Fu Chang
Columbia University
{gh2561,jiawei.m,sh3813,cl3695,sc250}@columbia.edu

Abstract

Few-shot object detection (FSOD), with the aim to detect


novel objects using very few training examples, has recently
attracted great research interest in the community. Metric-
learning based methods have been demonstrated to be effec-
tive for this task using a two-branch based siamese network,
and calculate the similarity between image regions and few-
shot examples for detection. However, in previous works, the
interaction between the two branches is only restricted in
the detection head, while leaving the remaining hundreds of
layers for separate feature extraction. Inspired by the recent
work on vision transformers and vision-language transform-
ers, we propose a novel Fully Cross-Transformer based
model (FCT) for FSOD by incorporating cross-transformer
into both the feature backbone and detection head. The
asymmetric-batched cross-attention is proposed to aggre-
gate the key information from the two branches with different Figure 1. Comparison of the single-branch, two-branch based
batch sizes. Our model can improve the few-shot similarity FSOD models and our proposed model.
learning between the two branches by introducing the multi-
level interactions. Comprehensive experiments on both PAS-
model architectures vary in different works, which can be
CAL VOC and MSCOCO FSOD benchmarks demonstrate
roughly divided into two categories, single-branch based
the effectiveness of our model.
methods [36, 45, 47, 51, 52] and two-branch based meth-
ods [8, 12, 13, 20, 23, 49]. (1) Single-branch based meth-
ods employ a typical object detection model, e.g., Faster
1. Introduction R-CNN [33], and build a multi-class classifier for detection.
Few-shot object detection (FSOD) aims to detect objects It is prone to overfitting to the small training data, espe-
from the query image using a few training examples. This is cially when we only have 1-shot training data per novel class.
motivated by human visual system which can quickly learn (2) Two-branch based methods apply the metric-learning
novel concepts from very few instructions. The key point idea [34, 37, 41] to FSOD and build a siamese network to
is how to quickly learn object detection models with strong process the query image and the few-shot support image in
generalization ability using a small number of training data, parallel. After extracting deep visual features from the two
such that the learned model can detect objects in unseen branches, previous works propose various methods (e.g., fea-
images. This is very challenging, especially for the current ture fusion [8,48,49], feature alignment [13], GCN [12], and
state-of-the-art deep-learning based methods [1, 28, 32, 33], non-local attention/transformer [2, 3, 6, 20, 44]) to calculate
which usually need thousands of training examples and are the similarity of the two branches. The two-branch based
prone to overfitting under this data-scarce scenario. methods do not learn the multi-class classifier over novel
Current methods for this task mainly follow a two- classes, and usually have stronger generalization ability by
stage learning paradigm [45] to transfer the knowledge learning to compare the query regions with few-shot classes.
learned from the data-abundant base classes to assist in Previous two-branch based methods have explored var-
object detection for few-shot novel classes. The detailed ious interactions (e.g., alignment) between the query and

5321
support branch to improve the similarity learning. But the ture [43] to extract multi-scale visual tokens, and propose the
interactions are restricted in the detection head with high- asymmetric-batched cross-attention across the branches with
level features, and leave the remaining hundreds of layers different batch sizes to reduce computational complexity.
for separate feature extraction. In fact, the query and sup- Our contributions can be summarized as: (1) To the
port images may have large visual differences and domain best of our knowledge, we are the first to explore and pro-
gap in terms of object pose, scale, illumination, occlusion, pose the vision transformer based few-shot object detection
background and etc. Simply aligning the two branches at the model. (2) A novel fully cross-transformer is proposed for
high-level feature space might not be optimal. If we could both the feature backbone and detection head, to encourage
align the extracted features in all network layers, the network multi-level interactions between the query and support. We
could have more capacity focusing on the common features also propose the asymmetric-batched cross-attention across
in each layer, and improve the final similarity learning. the branches. (3) We comprehensively evaluate the pro-
In this work, we propose a novel Fully Cross-Transformer posed model on the two widely used FSOD benchmarks and
based model (FCT) for FSOD, which is a pure cross- achieve state-of-the-art performance.
transformer based detection model without deep convolu-
tional networks. The ability to model long-range dependen- 2. Related Works
cies in transformer [40] can not only capture the abundant Object Detection. Object detection is one of the most
context in one branch, and also related context in the other fundamental tasks in computer vision. Recently, deep con-
branch, thus encouraging mutual alignment between the two volutional neural networks (DCNNs [19, 25]) have demon-
branches. As shown in Figure 1, Our model is based on strated their power to automatically learn feature from a
the two-stage detection model Faster R-CNN. Instead of large scale of training data, and are the dominant approach
extracting deep visual features separately for the query and for object detection. Current methods using DCNNs can
support inputs, we use the multi-layer deep cross-transformer mainly be grouped into two categories: proposal-based
to jointly extract features for the two branches. Inside the methods and proposal-free methods. Proposal-based meth-
cross-transformer layer, we propose the asymmetric-batched ods [11, 15, 17, 18, 33] divide object detection into two se-
cross-attention to aggregate the key information from the two quential stages by firstly generating a set of region proposals
branches with different batch sizes, and update the features and then performing classification and bounding box regres-
of either branch using self-attention with the aggregated key sion for each proposal. Proposal-free methods [16,28,32,39]
information. Thus, we can align the features from the two directly predict the bounding boxes and the corresponding
branches in each of the cross-transformer layer. Then af- class labels on top of CNN features. Recently, the trans-
ter the joint feature extraction and proposal generation for former based object detection models [1, 53] show promis-
the query image, we propose a cross-transformer based RoI ing results, but still suffer from slow convergence problem.
feature extractor in the detection head to jointly extract RoI Therefore, we choose to use one of the most representa-
features for the query proposals and support images. In- tive proposal-based methods, Faster R-CNN [33], for FSOD
corporating our cross-transformer in both feature backbone considering both detection accuracy and training efficiency.
and ROI feature extractor could largely promote the multi- Few-Shot Learning. Few-shot learning (FSL) aims to
level interactions (alignment) between the query and support recognize novel classes using only few examples. The key
inputs, thus further improving the final FSOD performance. idea of FSL is to transfer knowledge from many-shot base
We’d like to emphasize the difference between a closely classes to few-shot novel classes. Existing few-shot learning
related work ViLT [24] and ours, both using transformers methods can be roughly divided into the following three
for joint feature extraction of two branches. First, ViLT has categories: (1) Optimization based methods. For example,
language and the original image as input, and the highly model-Agnostic Meta-Learning (MAML [9]) learns a good
abstracted language tokens are interacting with the visual initialization so that the learner could rapidly adapt to novel
tokens at each layer. However, visual tokens represent low- tasks within a few optimization steps. (2) Parameter genera-
level concepts at the beginning, and evolve into high-level tion based methods [10, 22]. For example, Gidaris et al. [10]
concepts in deep layers. Different from ViLT, we take input proposes an attention-based weight generator to generate
of two visual images, and explore multi-level interactions the classifier weights for novel classes. (3) Metric-learning
between the two visual branches, gradually from low-level based methods [30, 34, 37, 41, 50]. These methods learn a
to high-level features. Second, we focus on FSOD, a dense generalizable similarity metric-space from base classes. For
prediction task, instead of the classification and retrieval example, Prototypical Networks [34] calculate prototype of
task in ViLT, and incorporate cross-transformer into both the novel classes by averaging the features of the few samples,
feature backbone and detection head. Third, ViLT extracts and then perform classification by a nearest neighbor search.
visual tokens following ViT [7], and uses the same number of Few-Shot Object Detection. Few-shot object detection
tokens throughout the model. We employ the pyramid struc- needs to not only recognize novel objects using a few train-

5322
ing examples, but also localize objects in the image. Existing propose a unified vision-language transformer model with-
works can be mainly grouped into the following two cate- out convolution (ViLT [24]), to focus more on the modality
gories according to the model architecture: (1) Single-branch interactions instead of using deep modal-specific embed-
based methods [36, 45, 47, 51, 52]. These methods attempt dings. Our work is inspired by these previous works, and
to learn object detection using the long-tailed training data propose a novel fully cross-transformer based FSOD model.
from both data-abundant base classes and data-scarce novel
classes. The final classification layer in the detection head 3. Our Approach
is determined by the number of classes to detect. To deal
with the unbalanced training set, re-sampling [45] and re- 3.1. Task Definition
weighting [27] are the two main strategies. Wang et al. [45] In few-shot object detection (FSOD), we have two sets of
shows that a simple two-stage fine-tuning approach outper- classes C = Cbase ∪ Cnovel and Cbase ∩ Cnovel = ∅, where
forms other complex meta-learning methods. Following base classes Cbase have plenty of training data per class, and
works introduce multi-scale positive sample refinement [47], novel classes Cnovel (a.k.a. support classes) only have very
image hallucination [51], contrastive learning [36] and lin- few training examples for each class (a.k.a. support images).
guistic semantic knowledge [52] to assist in FSOD. (2) Two- For K-shot (e.g., K = 1, 5, 10) object detection, we have
branch based methods [8, 12–14, 20, 23, 49]. These methods exactly K bounding box annotations for each novel class
are based on a siamese network to process the query and c ∈ Cnovel as the training data. The goal of FSOD is to
support in parallel, and calculate the similarity between im- leverage the data-abundant base classes to assist in detection
age regions (usually proposals) and few-shot examples for for few-shot novel classes.
detection. Kang et al. [23] first propose a feature reweighting
module to aggregate the query and support features. Multi- 3.2. Overview of Our Proposed Model (FCT)
ple feature fusion networks [8, 13, 48, 49] are then proposed We propose a novel Fully Cross-Transformer (FCT) based
for stronger feature aggregation. Han et al. [13] propose few-shot object detection model in this work. Our work be-
to perform feature alignment between the two inputs and longs to the two-branch based few-shot object detection
focus on foreground regions using attention. GCNs are em- method. The motivation is that although the traditional two-
ployed in [12] to facilitate mutual adaptation between the two branch based methods [8, 12, 13, 20, 23, 49] show promising
branches. Other works [2, 3, 6, 20] use more advanced non- results, the interaction of the query and support branch is
local attention/transformer [40, 44] to improve the similarity only restricted in the detection head, while leaving hundreds
learning of the two inputs. All these previous works show of layers for separate feature extraction in each branch before
that the two-branch paradigm is a promising solution for the cross-branch interaction. Our idea is to remove the sepa-
FSOD. Our work also belongs to this category, and proposes rate deep feature encoders and fully exploit the cross-branch
a pure cross-transformer model to exploit the interaction interaction to the largest extend.
between the two branches to the largest extent. An overview of our model is illustrated in Figure 2. Our
Transformer and Its Application in Computer Vision. model is based on the Faster R-CNN object detection frame-
Transformer was first introduced by Vaswani et al. [40] as a work. In Faster R-CNN, we have a feature backbone to
new attention-based building block for machine translation extract deep visual features of the input. Then proposals are
and has become a prevalent architecture in NLP [5]. The suc- generated using the extracted features and a detection head is
cess of transformer can be attributed to its strong ability to followed to extract the RoI features for each proposal and per-
model long-range dependencies using self-attention. Since form classification and bounding box (bbox) refinement. In-
then, transformer has been extended to various vision-related spired by the recent vision transformers and vision-language
tasks, e.g., vision-and-language pre-training [24, 35, 38], im- transformers, we propose a pure cross-transformer based
age classification [7, 29, 43], object detection [1, 53], and few-shot object detection model without deep convolutional
etc. The pioneering work of Vision Transformer (ViT [7]) networks. Specifically, the cross-transformer is incorporated
splits an image into non-overlapping patches (similar to into both the feature backbone and detection head. We show
tokens in NLP) and provides the sequence of linear em- in Section 3.3 how we jointly extract features for both the
beddings of these patches as an input to a transformer, query and support images using our cross-transformer fea-
and show promising results for image classification com- ture backbone, and similarly in Section 3.4 we show the
pared with CNNs [19]. Following works e.g., PVT [42, 43], details of our cross-transformer detection head. The model
Swin [29], and Twins [4], introduce pyramid structure to training framework is introduced in Section 3.5.
generate multi-scale feature maps for dense prediction tasks.
3.3. The Cross-Transformer Feature Backbone
Spatial-reduction attention [42, 43] and Shifted Window
based Self-Attention [29] are proposed to reduce the com- We have three stages of cross-transformer modules in the
putational complexity in the transformer. Kim et al. [24] feature backbone for joint feature extraction of the query

5323
Figure 2. The overall architecture of our proposed Fully Cross-Transformer based few-shot object detection model (FCT).

and support inputs. In the first stage, we have a single query information using the sub-sampled K and V,
image Iq ∈ R1∗HIq ∗WIq ∗3 and a batch of support images 0 0
Is ∈ RBs ∗HIs ∗WIs ∗3 of the same class as inputs, where Qiq = Xq WQi , Qis = Xs WQi (2)
0 0
Bs ≥ 1. We first split the original RGB images into non- Kqi = i
SR(Xq )WK , Ksi = i
SR(Xs )WK (3)
overlapping 4 × 4 × 3 patches. Then the flattened patches go 0 0

through a linear patching embedding layer and are projected Vqi = SR(Xq )WVi , Vsi = SR(Xs )WVi (4)
to C1 dimensions. The embedded patch sequences Xq ∈
q HI WI s where WQi ∈ RC1 ∗dh , WK i
∈ RC1 ∗dh , WVi ∈ RC1 ∗dh are
RN1 ∗C1 (N1q = 4 q ∗ 4 q ) and Xs ∈ RN1 ∗C1 (N1s = the learnable weights of the linear projection, which are
H Is W Is
4 ∗ 4 ) of the two branches are fed into several cross- shared between the two branches. The dimension of the
transformer layers. The second and third stage share a similar projected features is dh = C1 /h, same in each head. SR(·)
architecture as the first stage, and generate feature maps is the spatial-reduction operation, and can be implemented
with gradually decreasing sequence lengths and increasing by a strided convolution layer or a spatial pooling layer.
channel dimensions.
Following the vallina transformer [40], our cross- The Asymmetric-Batched Cross-Attention. The batch
transformer layer consists of the proposed multi-head size of the query branch and support branch is different. We
asymmetric-batched cross-attention and two feed-forward perform detection for each query image separately, because
layers, with LayerNorm (LN), GELU non-linearity and resid- different query images are irrelevant and the detection is
ual connections in between. independent from each other. For the support branch, the
q
Specifically, the position embedding Eposq ∈ RN1 ∗C1 , novel classes are also processed one-by-one, but the number
s
Epos
s ∈ RN1 ∗C1 and branch embedding Ebra ∈ R2∗C1 are of support images for one class could be arbitrary. The
first added to the input patch sequences Xq and Xs to retain naive implementation of only forwarding a query image and
the position and branch information, a support image each time and repeating the process for
0 0
each support image can be extremely slow. Therefore, we
Xq = Xq +Epos
q +E
bra
[0], Xs = Xs +Epos
s +E
bra
[1] (1) propose the asymmetric-batched cross-attention to calculate
the attention between the query image and all support images
In multi-head cross-attention, we map the input patch of the same class at one time.
0 0
sequence Xq to Qiq , Kqi , Vqi and Xs to Qis , Ksi , Vsi in the As shown in Figure 3, the cross-attention layer aggregates
head i (i = 1...h, and h is the the number of head), following the key information (K-V pairs) from the two branches for
the Q-K-V attention in transformer [40]. In order to reduce attention. To aggregate the K-V pairs from the support
the computational complexity of the attention, especially in branch to the query branch, we first conduct average pooling
the early layers, inspired by PVT [43], we use the spatial- over the multiple support images to match the batch size of
reduction operation to sub-sample the feature maps for K the query branch, and then concatenate the K-V pairs of the
and V. Another benefit is that we can summarize the key two branches. Similarly, to aggregate the K-V pairs from

5324
Remarks. We thoroughly study the multi-level interac-
tions between the two visual branches in our proposed model.
The three stages in our cross-transformer feature backbone
enable efficient interactions of the two branches with low-
level, mid-level and high-level visual features gradually.

3.4. The Cross-Transformer Detection Head


In the detection head, we first follow the previous work [8]
to generate class-specific proposals in the query image, and
use RoIAlign [18] to extract the initial RoI features for each
0 0
proposal fp ∈ RBp ∗H ∗W ∗C3 , and similarly for the support
0 0
branch fs ∈ RBs ∗H ∗W ∗C3 . (Bp = 100 by default, and
0 0
H = W = 14, the default spatial size after RoIAlign.)
Then the RoI feature extractor, also Stage 4 of our cross-
transformer, jointly extracts the RoI features for the propos-
als and support images before the final detection. In order
to reduce the computational complexity, P we take the av-
0
erage of all support images fs = B1s Bs fs , such that
0 0 0
fs ∈ R1∗H ∗W ∗C3 . We use the proposed asymmetric-
Figure 3. The proposed Asymmetric-Batched Cross-Attention in batched cross-attention to calculate the attention of the two
our cross-transformer feature backbone. 0
branches fp and fs , similarly in the feature backbone. The
difference is that the batch size of the query proposals is
0
the query branch to the support branch, we first repeat the Bp ≥ 1 and Bs = 1 for the support branch, which is the
query image Bs times along the batch dimension, and then reverse in the backbone.
concatenate the K-V pairs of the two branches, After the joint RoI feature extraction, we use the pair-
1 P wise matching network in [8] for the final detection. Binary
Kqi cat = [Kqi , K i ], (5) cross-entropy loss and bbox regression loss are employed
Bs Bs s
1 P for training, following [8].
Vqi cat = [Vqi , V i] (6)
B s Bs s
Ksi cat = [REP(Kqi , Bs ), Ksi ], (7) Remarks. We follow the vallina Faster R-CNN object
detection framework, and do not use FPN [26] in our
Vsi cat = [REP(Vqi , Bs ), Vsi ], (8) model. We find that using FPN does not improve the per-
where [·, ·] denotes the concatenation along the token dimen- formance, especially for the two-branch based FSOD meth-
sion by default, and REP(A, b) is to repeat the tensor A by ods [3, 8, 12, 20, 48, 49]. The cross-transformer based RoI
b times along the batch dimension by default. feature extractor in the detection head can encourage mutual
Thus, the multi-head asymmetric-batched cross-attention alignment between the query proposals and support images,
can be summarized as, which is crucial for the final pairwise matching.
00
Xq = Concat(head1q , ..., headhq )WO (9) 3.5. The Model Training Framework
where headis = Attention(Qiq , Kqi cat , Vqi cat ) (10) We have three steps for model training.
00 Pretraining the single-branch based model over base
Xs = Concat(head1s , ..., headhs )WO (11) classes. In the first step, we pretrain our model without using
where headis = Attention(Qis , Ksi cat , Vsi cat ) (12) the cross-transformer. Specifically, we use the vallina Faster
R-CNN model with the vision transformer backbone [42,43],
where WO ∈ Rhdh ∗C1 is the weight of the projection back and only train the model using the base-class dataset.
to the original feature space, shared with the two branches. Training the two-branch based model over base
classes. Then we train the proposed two-branch based model
Then the feed-forward network is applied to each patch with fully cross-transformer using the base-class dataset, ini-
with stronger feature representations, following [40], tialized by the pretrained model in the first step. Our pro-
000 00 00 posed FCT model can reuse most of the parameters of the
Xq = MLP(LN(Xq ) + Xq (13)
000 00 00
model learned in the first step. The good initialization point
Xs = MLP(LN(Xs ) + Xs (14) in the first step can ease the training of our FCT model.

5325
Fine-tuning the two-branch based model over novel 4.3. Ablation Study
classes. Finally, we fine-tune our FCT model on a sub-
We perform ablation study on the model architecture and
sampled dataset of base and novel classes with K-shot sam-
training strategy in Table 1, 2, and 3.
ples per class, following the previous works [8, 45]. Fine-
Single-branch baseline model versus Two-branch
tuning could largely improve the adaptation of our model for
baseline model. First, we compare the single-branch base-
novel classes by seeing a few examples during training.
line model [45] and the two-branch baseline model [8] in Ta-
ble 1 (a-d). We compare the performance of the two models
4. Experimental Results using two feature backbones, ResNet-101 and PVTv2-B2-Li.
4.1. Datasets Using the stronger transformer backbone, we achieve much
higher FSOD accuracy. The two-branch based model outper-
We evaluate our model on two widely-used few-shot ob- forms the single-branch one using any of the two backbones,
ject detection benchmarks as follows. especially for extremely few-shot settings, e.g., 2/10-shot.
PASCAL VOC. Following previous works in [23,45], we The reason is that the single-branch based model is prone
have three random partitions of base and novel categories. In to overfitting to the few-shot training data, while the two-
each partition, the 20 PASCAL VOC categories are split into branch based model has stronger generalization ability by
15 base classes and 5 novel classes similarly. We sample the learning to compare the query regions with few-shot classes.
few-shot images following [36, 45], and report AP50 results How do each of the cross-transformer blocks help for
under shots 1, 2, 3, 5, and 10. We report both single run FSOD? We study the functions of the four cross-transformer
results using the exact same few-shot images as [23, 45] and stages in Table 1 (e-j). (1) We conduct the experiments of
the average results of multiple runs. using only one cross-transformer stage and leave the other
MSCOCO. We use the 20 PASCAL VOC categories three stages for separate processing in Table 1 (e-h). The re-
as novel classes and the remaining 60 categories are base sults show the effectiveness of all the four cross-transformer
classes. We sample the few-shot images following [36, 45], stages due to the mutual alignment of the two branches and
and report the detection accuracy AP under shots 1, 2, 3, 5, feature fusion. In all four stages, Stage 4 in the detection
10 and 30 following [12, 31, 45]. We report both single run head improves the most. This is because the objective of
results using the exact same few-shot images as [23, 45] and FSOD is to compare the proposal features with the support
the average results of multiple runs. We use the MSCOCO features, and Stage 4 unifies the RoI feature extraction of the
dataset under 2/10/30-shot for ablation study in Section 4.3. two branches before the final comparison. (2) Using the first
three stages results in our cross-transformer feature back-
4.2. Implementation Details
bone (Table 1 (i)), which further improves the performance,
We implement our model based on the improved Pyramid compared with using any of these stages alone. Finally, our
Vision Transformer PVTv2 [42]. We follow most of the fully cross-transformer (FCT) (Table 1 (j)) achieves the best
model designs and hyperparameters in PVTv2. results with the cross-transformer feature backbone and de-
The reason is that, first, PVTv2 is a pure transformer tection head. (3) The visualization of the cross-attention
backbone, and has been shown strong performance on image masks in the four stages is shown in Figure 4. From Fig-
classification, object detection, and etc. Second, the spatial- ure 4, we have the following observations: i) In the early
reduction attention (SRA) is initially proposed to reduce the stages (e.g., stage 1), the attention masks spread out over
computation overhead in PVT [43] and PVTv2 [42]. We the regions with similar color and texture, which align the
find that it is also an effective way to summarize the key low-level feature spaces of the two branches. ii) In the later
information in the high-resolution features. Inspired by this, stages, the attention masks focus more on semantic related
we propose the asymmetric-batched cross-attention which local regions, which align the two high-level feature spaces.
aggregates the key information from the two branches for The comparison of model performance using differ-
attention, using the sub-sampled features. ent backbones. We conduct the experiments using differ-
For experiments, we use the PVTv2 model variants ent PVTv2 variants as the backbone in Table 1 (j-m). The
PVTv2-B0, PVTv2-B1, PVTv2-B2 and PVTv2-B2-Li for PVTv2-B2 based model outperforms the models based on
implementation. We do not use PVTv2-B3 or larger models PVTv2-B0 and PVTv2-B1 due to larger model capacity. The
due to the GPU memory limit. Our model is initialized from PVTv2-B2-Li based model has very similar performance
the ImageNet pretrained model provided by [42]. We use compared with PVTv2-B2, and is faster for training/testing
PVTv2-B2-Li as the default model because it can largely re- speed. Therefore, we use PVTv2-B2-Li by default.
duce the training/testing time using the pooling based spatial The ablation study on the information aggregation
reduction attention, and maintains high detection accuracy. across branches. To perform cross-attention with the two
The detailed training hyperparameters (e.g., epochs, learn- branches, we need to aggregate the key information from
ing rate) are included in the supplementary file. both of them. Specifically, we use the concatenation oper-

5326
Table 1. Ablation study on each component in our model using various backbones, tested on the MSCOCO dataset. † We replace the original
block in the backbone with our cross-transformer block if marked. ‡ The baseline model has no cross-branch interaction in the feature
backbone and RoI feature extractor.

Our Cross-Transformer† 2-shot 10-shot 30-shot


Backbone
Stage 1 Stage 2 Stage 3 Stage 4 AP AP50 AP75 AP AP50 AP75 AP AP50 AP75
(a) ResNet101 Single branch baseline model [45]‡ 4.6 8.3 4.8 10.0 19.1 9.3 13.7 24.9 13.4
(b) ResNet101 Two branch baseline model [8]‡ 5.6 14.0 3.9 9.6 20.7 7.7 13.5 28.5 11.7
(c) PVTv2-B2-Li Single branch baseline model [45]‡ 5.3 9.5 5.2 14.5 26.5 13.9 19.7 33.6 19.9
(d) PVTv2-B2-Li Two branch baseline model [8]‡ 7.0 12.8 6.7 15.3 27.3 15.3 19.5 32.7 19.8
(e) PVTv2-B2-Li X 7.1 13.0 6.8 15.7 28.3 15.4 20.2 33.6 20.5
(f) PVTv2-B2-Li X 7.3 13.1 7.0 16.2 28.5 16.0 20.4 33.9 20.8
(g) PVTv2-B2-Li X 7.4 13.3 7.3 16.1 28.5 15.8 20.5 33.8 20.9
(h) PVTv2-B2-Li X 7.7 13.5 7.7 16.4 28.9 16.3 20.7 34.1 21.5
(i) PVTv2-B2-Li X X X 7.6 13.7 7.6 16.5 29.6 16.2 20.8 34.9 21.2
(j) PVTv2-B2-Li X X X X 7.9 14.2 7.9 17.1 30.2 17.0 21.4 35.5 22.1
(k) PVTv2-B0 X X X X 4.6 8.1 4.2 10.2 20.1 8.7 13.7 27.5 11.8
(l) PVTv2-B1 X X X X 5.3 9.5 5.0 12.1 23.9 10.2 17.3 33.4 15.6
(m) PVTv2-B2 X X X X 7.3 13.7 7.2 16.3 29.6 16.4 20.6 37.2 20.8

Table 2. Ablation study on the aggregation of the key-value pairs


from the two branches.
2-shot 10-shot
Method
AP AP50 AP75 AP AP50 AP75
Addition 6.5 11.9 6.2 15.0 26.2 14.8
Multiplication 6.7 12.0 6.7 15.1 26.9 15.0
W/o branch embed 7.7 14.0 7.8 17.0 29.8 17.0
W/ branch embed 7.9 14.2 7.9 17.1 30.2 17.0

Table 3. Ablation study on model training framework.

Single-branch 2-shot 10-shot Figure 4. Visualization of the multi-level cross-attention in our


pretraining AP AP50 AP75 AP AP50 AP75 model (RED means larger value). Using the white-box area (near
5.3 10.3 5.0 14.1 25.5 13.3 the eye of the horse in the query) as Q, we show the corresponding
X 7.9 14.2 7.9 17.1 30.2 17.0 cross-attention masks in both the query image and 1-shot support
image. We visualize the last cross-transformer layer in all the
four stages. The white boxes with different sizes in each stage are
ation with branch embedding to aggregate the K-V pairs determined by the actual patch sizes in the input.
from the two branches in our work, without losing the origi-
nal information. (1) We conduct the experiments using the better for the few-shot scenario by learning how to compare.
element-wise addition and multiplication for aggregating the Therefore, we combine the strengths of the two methods in
K-V pairs of the two branches. The results are much worse the first two steps of the training. The pre-trained model in
compared with using the concatenation, as shown in Table 2, the first step can provide a good initialization, which can
due to the potential information loss. (2) The branch embed- help ease the training in the second step.
ding can identify which branch the feature comes from, and
slightly improve the performance in Table 2. 4.4. Comparison with the State-of-the-arts (SOTAs)
The importance of the three-step training framework. We compare our proposed FCT with the recent state-
We have three steps for model training. The first and second of-the-arts on the PASCAL VOC and MSCOCO FSOD
step are both pre-training, performed over the data-abundant benchmarks in Table 4 and 5. We report both the single
base classes. We conduct the experiments of using the first- run and multiple runs results following [36, 45] on the two
step pre-training or not in Table 3. Using the single-branch benchmarks. Compared with the existing two-branch based
pre-training leads to a large improvement. This is because methods, we achieve the SOTAs across most of the shots
the single-branch method with a multi-class classifier is good under the two evaluation settings in the two benchmarks.
at learning a stronger feature backbone over large-scale base- Compared with the single-branch based methods, we
class training data, while our two-branch based method is achieve the second best results under the multiple runs set-

5327
Table 4. Few-shot object detection results (AP50) on the PASCAL VOC dataset. We report both single run results and the average results of
multiple runs. S: Single-branch based methods. T: Two-branch based methods.
Novel Set 1 Novel Set 2 Novel Set 3
Type Method Venue Backbone
1 2 3 5 10 1 2 3 5 10 1 2 3 5 10
Single run results, using the exact same few-shot samples as [45]
MetaDet [46] ICCV 2019 VGG16 18.9 20.6 30.2 36.8 49.6 21.8 23.1 27.8 31.7 43.0 20.6 23.9 29.4 43.9 44.1
TFA w/ cos [45] ICML 2020 ResNet-101 39.8 36.1 44.7 55.7 56.0 23.5 26.9 34.1 35.1 39.1 30.8 34.8 42.8 49.5 49.8
MPSR [47] ECCV 2020 ResNet-101 41.7 42.5 51.4 55.2 61.8 24.4 29.3 39.2 39.9 47.8 35.6 41.8 42.3 48.0 49.7
S
SRR-FSD [52] CVPR 2021 ResNet-101 47.8 50.5 51.3 55.2 56.8 32.5 35.3 39.1 40.8 43.8 40.1 41.5 44.3 46.9 46.4
CoRPNs + Halluc [51] CVPR 2021 ResNet-101 47.0 44.9 46.5 54.7 54.7 26.3 31.8 37.4 37.4 41.2 40.4 42.1 43.3 51.4 49.6
FSCE [36] CVPR 2021 ResNet-101 44.2 43.8 51.4 61.9 63.4 27.3 29.5 43.5 44.2 50.2 37.2 41.9 47.5 54.6 58.5
FSRW [23] ICCV 2019 YOLOv2 14.8 15.5 26.7 33.9 47.2 15.7 15.3 22.7 30.1 40.5 21.3 25.6 28.4 42.8 45.9
Meta R-CNN [49] ICCV 2019 ResNet-101 19.9 25.5 35.0 45.7 51.5 10.4 19.4 29.6 34.8 45.4 14.3 18.2 27.5 41.2 48.1
T Fan et al. [8] CVPR 2020 ResNet-101 37.8 43.6 51.6 56.5 58.6 22.5 30.6 40.7 43.1 47.6 31.0 37.9 43.7 51.3 49.8
QA-FewDet [12] ICCV 2021 ResNet-101 42.4 51.9 55.7 62.6 63.4 25.9 37.8 46.6 48.9 51.1 35.2 42.9 47.8 54.8 53.5
Meta Faster R-CNN [13] AAAI 2022 ResNet-101 43.0 54.5 60.6 66.1 65.4 27.7 35.5 46.1 47.8 51.4 40.6 46.4 53.4 59.9 58.6
FCT (Ours) This work PVTv2-B2-Li 49.9 57.1 57.9 63.2 67.1 27.6 34.5 43.7 49.2 51.2 39.5 54.7 52.3 57.0 58.7
Average results of multiple runs, following [45]
TFA w/ cos [45] ICML 2020 ResNet-101 25.3 36.4 42.1 47.9 52.8 18.3 27.5 30.9 34.1 39.5 17.9 27.2 34.3 40.8 45.6
S FSCE [36] CVPR 2021 ResNet-101 32.9 44.0 46.8 52.9 59.7 23.7 30.6 38.4 43.0 48.5 22.6 33.4 39.5 47.3 54.0
DeFRCN [31] ICCV 2021 ResNet-101 40.2 53.6 58.2 63.6 66.5 29.5 39.7 43.4 48.1 52.8 35.0 38.3 52.9 57.7 60.8
Xiao et al. [48] ECCV 2020 ResNet-101 24.2 35.3 42.2 49.1 57.4 21.6 24.6 31.9 37.0 45.7 21.2 30.0 37.2 43.8 49.6
T DCNet [21] CVPR 2021 ResNet-101 33.9 37.4 43.7 51.1 59.6 23.2 24.8 30.6 36.7 46.6 32.3 34.9 39.7 42.6 50.7
FCT (Ours) This work PVTv2-B2-Li 38.5 49.6 53.5 59.8 64.3 25.9 34.2 40.1 44.9 47.4 34.7 43.9 49.3 53.1 56.3

Table 5. Few-shot object detection results (AP) on the MSCOCO multi-class classifier over novel classes, and instead learn
dataset. S: Single-branch based methods. T: Two-branch based the class-agnostic comparison network between the query
methods. and support, which is shared among all classes. Thus, our
Type Method
Shot method can mitigate the data scarcity problem under 1-shot
1 2 3 5 10 30 setting and improve the model generalization ability.
Single run results, using the exact same few-shot samples as [45]
MetaDet [46] – – – – 7.1 11.3 5. Conclusion
TFA w/ cos [45] 3.4 4.6 6.6 8.3 10.0 13.7
MPSR [47] 2.3 3.5 5.2 6.7 9.8 14.1 We propose a novel fully cross-transformer based few-
S
SRR-FSD [52] - - - - 11.3 14.7
TFA + Halluc [51] 4.4 5.6 7.2 - - - shot object detection model (FCT) in this work, by incorpo-
FSCE [36] - - - - 11.9 16.4 rating cross-transformer into both the feature backbone and
FSRW [23] – – – – 5.6 9.1
Meta R-CNN [49] – – – – 8.7 12.4 detection head. The asymmetric-batched cross-attention is
T Fan et al. [8] 4.2 5.6 6.6 8.0 9.6 13.5 proposed to aggregate the K-V pairs from the query and sup-
QA-FewDet [12] 4.9 7.6 8.4 9.7 11.6 16.5
Meta Faster R-CNN [13] 5.1 7.6 9.8 10.8 12.7 16.6
port branch with different batch sizes. We show both quanti-
FCT (Ours) 5.6 7.9 11.1 14.0 17.1 21.4 tative results on the two widely used FSOD benchmarks and
Average results of multiple runs, following [45] qualitative visualization of the multi-level cross-attention
TFA w/ cos [45] 1.9 3.9 5.1 7.0 9.1 12.1 learned in our model. All these evidence demonstrates the
S FSCE [36] - - - - 11.1 15.3 effectiveness of the proposed multi-level interactions be-
DeFRCN [31] 4.8 8.5 10.7 13.6 16.8 21.2
Xiao et al. [48] 4.5 6.6 7.2 10.7 12.5 14.7
tween the query and support branch. We hope our work can
T DCNet [21] - - - - 12.8 18.6 inspire future work on the two-branch based FSOD methods.
FCT (Ours) 5.1 7.2 9.8 12.0 15.3 20.2

Acknowledgements
ting. DeFRCN [31], reports the best results with multi- This material is based on research sponsored by Air Force
ple runs, which is a highly-optimized single-branch based Research Laboratory (AFRL) under agreement number
method. It proposes a Gradient Decoupled Layer to ad- FA8750-19-1-1000. The U.S. Government is authorized to
just the degree of decoupling of the backbone, RPN, and reproduce and distribute reprints for Government purposes
R-CNN through gradient, and also a post-processing Proto- notwithstanding any copyright notation therein. The views
typical Calibration Block. Different from that, we propose and conclusions contained herein are those of the authors
a novel two-branch based FSOD model, and achieves the and should not be interpreted as necessarily representing the
best results on the most challenging MSCOCO 1-shot set- official policies or endorsements, either expressed or implied,
ting with multiple runs. This is because we do not learn the of Air Force Laboratory, DARPA or the U.S. Government.

5328
References [13] Guangxing Han, Shiyuan Huang, Jiawei Ma, Yicheng He,
and Shih-Fu Chang. Meta faster r-cnn: Towards accurate
[1] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas few-shot object detection with attentive feature alignment.
Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to- In Thirty-Sixth AAAI Conference on Artificial Intelligence
end object detection with transformers. In European Confer- (AAAI), 2022. 1, 3, 8
ence on Computer Vision, pages 213–229. Springer, 2020. 1, [14] Guangxing Han, Jiawei Ma, Shiyuan Huang, Long Chen,
2, 3 Rama Chellappa, and Shih-Fu Chang. Multimodal few-
[2] Ding-Jie Chen, He-Yen Hsieh, and Tyng-Luh Liu. Adaptive shot object detection with meta-learning based cross-modal
image transformer for one-shot object detection. In Proceed- prompting. 2022. 3
ings of the IEEE/CVF Conference on Computer Vision and [15] Guangxing Han, Xuan Zhang, and Chongrong Li. Revisiting
Pattern Recognition, pages 12247–12256, 2021. 1, 3 faster r-cnn: a deeper look at region proposal network. In
[3] Tung-I Chen, Yueh-Cheng Liu, Hung-Ting Su, Yu-Cheng International Conference on Neural Information Processing,
Chang, Yu-Hsiang Lin, Jia-Fong Yeh, Wen-Chin Chen, and pages 14–24. Springer, 2017. 2
Winston Hsu. Dual-awareness attention for few-shot object [16] Guangxing Han, Xuan Zhang, and Chongrong Li. Single shot
detection. IEEE Transactions on Multimedia, pages 1–1, object detection with top-down refinement. In 2017 IEEE
2021. 1, 3, 5 International Conference on Image Processing (ICIP), pages
[4] Xiangxiang Chu, Zhi Tian, Yuqing Wang, Bo Zhang, Haibing 3360–3364. IEEE, 2017. 2
Ren, Xiaolin Wei, Huaxia Xia, and Chunhua Shen. Twins: [17] Guangxing Han, Xuan Zhang, and Chongrong Li. Semi-
Revisiting the design of spatial attention in vision transform- supervised dff: Decoupling detection and feature flow for
ers. In Advances in Neural Information Processing Systems, video object detectors. In 26th ACM international conference
2021. 3 on Multimedia, pages 1811–1819, 2018. 2
[5] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina [18] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Gir-
Toutanova. Bert: Pre-training of deep bidirectional shick. Mask r-cnn. In Proceedings of the IEEE international
transformers for language understanding. arXiv preprint conference on computer vision, pages 2961–2969, 2017. 2, 5
arXiv:1810.04805, 2018. 3 [19] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
[6] Carl Doersch, Ankush Gupta, and Andrew Zisserman. Deep residual learning for image recognition. In Proceed-
Crosstransformers: spatially-aware few-shot transfer. In H. ings of the IEEE conference on computer vision and pattern
Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, recognition, pages 770–778, 2016. 2, 3
editors, Advances in Neural Information Processing Systems, [20] Ting-I Hsieh, Yi-Chen Lo, Hwann-Tzong Chen, and Tyng-
volume 33, pages 21981–21993, 2020. 1, 3 Luh Liu. One-shot object detection with co-attention and
co-excitation. In Advances in Neural Information Processing
[7] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov,
Systems, pages 2725–2734, 2019. 1, 3, 5
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner,
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- [21] Hanzhe Hu, Shuai Bai, Aoxue Li, Jinshi Cui, and Liwei
vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is Wang. Dense relation distillation with context-aware ag-
worth 16x16 words: Transformers for image recognition at gregation for few-shot object detection. In Proceedings of
scale. In International Conference on Learning Representa- the IEEE/CVF Conference on Computer Vision and Pattern
tions, 2021. 2, 3 Recognition (CVPR), pages 10185–10194, June 2021. 8
[22] Shiyuan Huang, Jiawei Ma, Guangxing Han, and Shih-Fu
[8] Qi Fan, Wei Zhuo, Chi-Keung Tang, and Yu-Wing Tai. Few-
Chang. Task-adaptive negative class envision for few-shot
shot object detection with attention-rpn and multi-relation
open-set recognition. In IEEE/CVF Conference on Computer
detector. In Proceedings of the IEEE/CVF Conference on
Vision and Pattern Recognition (CVPR), 2022. 2
Computer Vision and Pattern Recognition, pages 4013–4022,
[23] Bingyi Kang, Zhuang Liu, Xin Wang, Fisher Yu, Jiashi Feng,
2020. 1, 3, 5, 6, 7, 8
and Trevor Darrell. Few-shot object detection via feature
[9] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model- reweighting. In IEEE International Conference on Computer
agnostic meta-learning for fast adaptation of deep networks. Vision, pages 8420–8429, 2019. 1, 3, 6, 8
In International Conference on Machine Learning, pages [24] Wonjae Kim, Bokyung Son, and Ildoo Kim. Vilt: Vision-and-
1126–1135, 2017. 2 language transformer without convolution or region supervi-
[10] Spyros Gidaris and Nikos Komodakis. Dynamic few-shot sion. In 38th International Conference on Machine Learning,
visual learning without forgetting. In Proceedings of the IEEE volume 139, pages 5583–5594. PMLR, 18–24 Jul 2021. 2, 3
Conference on Computer Vision and Pattern Recognition, [25] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Im-
pages 4367–4375, 2018. 2 agenet classification with deep convolutional neural networks.
[11] Ross Girshick. Fast r-cnn. In IEEE international conference In Advances in neural information processing systems, pages
on computer vision, pages 1440–1448, 2015. 2 1097–1105, 2012. 2
[12] Guangxing Han, Yicheng He, Shiyuan Huang, Jiawei Ma, [26] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He,
and Shih-Fu Chang. Query adaptive few-shot object de- Bharath Hariharan, and Serge Belongie. Feature pyramid
tection with heterogeneous graph convolutional networks. networks for object detection. In Proceedings of the IEEE
In IEEE/CVF International Conference on Computer Vision conference on computer vision and pattern recognition, pages
(ICCV), pages 3263–3272, October 2021. 1, 3, 5, 6, 8 2117–2125, 2017. 5

5329
[27] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Polosukhin. Attention is all you need. In Advances in neural
Piotr Dollár. Focal loss for dense object detection. In Pro- information processing systems, pages 5998–6008, 2017. 2,
ceedings of the IEEE international conference on computer 3, 4, 5
vision, pages 2980–2988, 2017. 3 [41] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan
[28] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Wierstra, et al. Matching networks for one shot learning. In
Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. Advances in neural information processing systems, pages
Ssd: Single shot multibox detector. In European conference 3630–3638, 2016. 1, 2
on computer vision, pages 21–37. Springer, 2016. 1, 2 [42] Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao
[29] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. Pvtv2:
Zhang, Stephen Lin, and Baining Guo. Swin transformer: Improved baselines with pyramid vision transformer. arXiv
Hierarchical vision transformer using shifted windows. Inter- preprint arXiv:2106.13797, 2021. 3, 5, 6
national Conference on Computer Vision, 2021. 3 [43] Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao
[30] Jiawei Ma, Hanchen Xie, Guangxing Han, Shih-Fu Chang, Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. Pyra-
Aram Galstyan, and Wael Abd-Almageed. Partner-assisted mid vision transformer: A versatile backbone for dense predic-
learning for few-shot image classification. In Proceedings of tion without convolutions. In Proceedings of the IEEE/CVF
the IEEE/CVF International Conference on Computer Vision International Conference on Computer Vision (ICCV), pages
(ICCV), pages 10573–10582, October 2021. 2 568–578, October 2021. 2, 3, 4, 5, 6
[31] Limeng Qiao, Yuxuan Zhao, Zhiyuan Li, Xi Qiu, Jianan [44] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming
Wu, and Chi Zhang. Defrcn: Decoupled faster r-cnn for He. Non-local neural networks. In Proceedings of the IEEE
few-shot object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages
International Conference on Computer Vision (ICCV), pages 7794–7803, 2018. 1, 3
8681–8690, October 2021. 6, 8 [45] Xin Wang, Thomas E. Huang, Trevor Darrell, Joseph E Gon-
zalez, and Fisher Yu. Frustratingly simple few-shot object
[32] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali
detection. In International Conference on Machine Learning
Farhadi. You only look once: Unified, real-time object de-
(ICML), July 2020. 1, 3, 6, 7, 8
tection. In Proceedings of the IEEE conference on computer
[46] Yu-Xiong Wang, Deva Ramanan, and Martial Hebert. Meta-
vision and pattern recognition, pages 779–788, 2016. 1, 2
learning to detect rare objects. In IEEE International Confer-
[33] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. ence on Computer Vision, pages 9925–9934, 2019. 8
Faster r-cnn: Towards real-time object detection with region
[47] Jiaxi Wu, Songtao Liu, Di Huang, and Yunhong Wang. Multi-
proposal networks. In Advances in neural information pro-
scale positive sample refinement for few-shot object detection.
cessing systems, pages 91–99, 2015. 1, 2
In European Conference on Computer Vision, pages 456–472.
[34] Jake Snell, Kevin Swersky, and Richard Zemel. Prototypi- Springer, 2020. 1, 3, 8
cal networks for few-shot learning. In Advances in neural [48] Yang Xiao and Renaud Marlet. Few-shot object detection
information processing systems, pages 4077–4087, 2017. 1, 2 and viewpoint estimation for objects in the wild. In European
[35] Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Conference on Computer Vision, 2020. 1, 3, 5, 8
Wei, and Jifeng Dai. Vl-bert: Pre-training of generic visual- [49] Xiaopeng Yan, Ziliang Chen, Anni Xu, Xiaoxi Wang, Xi-
linguistic representations. In International Conference on aodan Liang, and Liang Lin. Meta r-cnn: Towards general
Learning Representations, 2020. 3 solver for instance-level low-shot learning. In Proceedings
[36] Bo Sun, Banghuai Li, Shengcai Cai, Ye Yuan, and Chi Zhang. of the IEEE International Conference on Computer Vision,
Fsce: Few-shot object detection via contrastive proposal en- pages 9577–9586, 2019. 1, 3, 5, 8
coding. In Proceedings of the IEEE/CVF Conference on [50] Nikolaos-Antonios Ypsilantis, Noa Garcia, Guangxing Han,
Computer Vision and Pattern Recognition (CVPR), pages Sarah Ibrahimi, Nanne Van Noord, and Giorgos Tolias. The
7352–7362, June 2021. 1, 3, 6, 7, 8 met dataset: Instance-level recognition for artworks. In Thirty-
[37] Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip HS fifth Conference on Neural Information Processing Systems
Torr, and Timothy M Hospedales. Learning to compare: Re- Datasets and Benchmarks Track (Round 2), 2021. 2
lation network for few-shot learning. In Proceedings of the [51] Weilin Zhang and Yu-Xiong Wang. Hallucination improves
IEEE Conference on Computer Vision and Pattern Recogni- few-shot object detection. In Proceedings of the IEEE/CVF
tion, pages 1199–1208, 2018. 1, 2 Conference on Computer Vision and Pattern Recognition
[38] Hao Tan and Mohit Bansal. Lxmert: Learning cross-modality (CVPR), pages 13008–13017, June 2021. 1, 3, 8
encoder representations from transformers. In Proceedings [52] Chenchen Zhu, Fangyi Chen, Uzair Ahmed, Zhiqiang Shen,
of the 2019 Conference on Empirical Methods in Natural and Marios Savvides. Semantic relation reasoning for shot-
Language Processing, 2019. 3 stable few-shot object detection. In IEEE/CVF Conference
[39] Zhi Tian, Chunhua Shen, Hao Chen, and Tong He. Fcos: Fully on Computer Vision and Pattern Recognition (CVPR), pages
convolutional one-stage object detection. In Proceedings of 8782–8791, June 2021. 1, 3, 8
the IEEE international conference on computer vision, pages [53] Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang
9627–9636, 2019. 2 Wang, and Jifeng Dai. Deformable detr: Deformable trans-
[40] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- formers for end-to-end object detection. arXiv preprint
reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia arXiv:2010.04159, 2020. 2, 3

5330

You might also like