0% found this document useful (0 votes)
27 views7 pages

Object Counting Yolo

Uploaded by

Hi Manshu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views7 pages

Object Counting Yolo

Uploaded by

Hi Manshu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

OBJECT COUNTING: YOU ONLY NEED TO LOOK AT ONE

Hui LIN, Xiaopeng HONG, Yabin WANG

School of Cyber Science and Engineering, Xi’an Jiaotong University, China


Emails: [email protected]; [email protected]; [email protected]

ABSTRACT several categories exist within a same image. Moreover in


arXiv:2112.05993v1 [cs.CV] 11 Dec 2021

few-shot setting, these categories will not overlap between


This paper aims to tackle the challenging task of one-
training and inference. This means that the model needs to
shot object counting. Given an image containing novel, pre-
have a strong distinguishing ability between features of differ-
viously unseen category objects, the goal of the task is to
ent categories, and meanwhile, an effective associating abil-
count all instances in the desired category with only one sup-
ity among instances sharing the same category. Second, in
porting bounding box example. To this end, we propose a
one-shot counting, the model learns from only one support-
counting model by which you only need to Look At One in-
ing instance. Much of the difficulty results from the fact that
stance (LaoNet). First, a feature correlation module combines
the supporting sample may differ from other instances in, for
the Self-Attention and Correlative-Attention modules to learn
example, sizes and poses. Hence, the model is required to
both inner-relations and inter-relations. It enables the network
be invariant towards these variations without seeing the com-
to be robust to the inconsistency of rotations and sizes among
monalities across different instances.
different instances. Second, a Scale Aggregation mechanism
Therefore, in this paper, we propose an effective network
is designed to help extract features with different scale infor-
named LaoNet for one-shot object counting. It consists of
mation. Compared with existing few-shot counting methods,
three main parts: feature extraction, feature correlation and
LaoNet achieves state-of-the-art results while learning with a
the density regressor, as shown in Figure 1. The feature corre-
high convergence speed. The code will be available soon.
lation model and the feature extraction model are elaborately
Index Terms— Object Counting, One-Shot Learning, At- designed to address the above two challenges.
tention Mechanism We propose the feature correlation based on Self-
Attention and Correlative-Attention modules to learn inner-
1. INTRODUCTION relations and inter-relations respectively. The Self-Attention
encourages the model to focus more on important features and
Object counting has become increasingly important due to its their correlations, improving the efficiency of information re-
wide range of applications such as crowd surveillance, traffic finement. Previous few-shot counting methods [4, 5] usually
monitoring, wildlife conservation and inventory management. leverage on a convolution operation to match the similarities
Most of the existing counting methods [1, 2, 3] focus on a par- between image features and supporting features. However,
ticular, single category. However, when applying them into as the kernel is derived from supporting features with the de-
new categories, their performances will drop catastrophically. fault size and rotation angle, the convolution operation will
Meanwhile, it is extremely difficult and costly to collect all greatly depend on the quality of supporting features and the
categories and label them for training. consistency of physical properties among different instances.
For humans, the generalization ability allows them to Instead, our designed feature correlation model benefits from
learn and deal with various vision tasks without much prior two kinds of attention modules and addresses the above prob-
knowledge and experience. We are amazed by this remark- lem by considering all correlations.
able ability and in this work, we focus on this learning We further propose a Scale Aggregation mechanism in
paradigm and design a network to efficiently recognize and scale extraction to deal with scale variations among different
count new categories given only one example. We follow the categories and different instances. By learning features from
few-shot setting in [4] and modify it to one-shot object count- multi-subspace, the model aggregates various scale informa-
ing. That is, the model takes an image with unseen novel cate- tion while maintaining a spatial consistency.
gories and a supporting bounding box containing an example To summarize, our contribution is threefold.
instance of desired category as input, and then predicts the
object count in the image. • We design a novel network named LaoNet (A network
However, there are two main challenges. First, the object by which you only need to Look At One instance) for
counting task includes many different categories, and even one-shot object counting. By combining Self-Attention
Fig. 1. The overall architecture of the proposed LaoNet for one-shot object counting. Both the query image and the supporting
box are fed into CNN to extract features. Supporting features are aggregated among scales. Then the flatten features with unique
position embedding are transmitted into feature correlation model with Self-Attentions and Correlative Attentions. Finally, a
density regressor is adopted to predict the final density map.

and Correlative-Attention modules, LaoNet exploits Generic Matching Network (GMN) for class-agnostic count-
the correlation among novel category objects with high ing. However it still needs several dozens to hundreds exam-
accuracy and efficiency. ples of a novel category for adaptation and good performance.
CFOCNet is introduced to match and utilize the similarity be-
• We propose a Scale Aggregation mechanism to extract tween objects within the same category [5]. The work [4]
more comprehensive features and fuse multi-scale in- presents a Few Shot Adaptation and Matching Network (Fam-
formation from the supporting box. Net) to learn feature correlations and few-shot adaptation and
• The experimental results show that our model achieves also introduces a few-shot counting dataset named FSC-147.
state-of-the-art results with significant improvements When the number of labeled example decreases to one,
on FSC-147 [4] and COCO [6] datasets under the one- the task evolves into one-shot counting. In other visual tasks,
shot setting without fine-tuning. researchers develop methods for one-shot segmentation [17]
and one-shot object detection [18, 19]. Compared to the few-
shot setting which usually uses at least three instances for
2. RELATED WORKS each object [4], the one-shot setting, where only one instance
is available, is clearly more challenging.
Object counting methods can be briefly divided into two
It is worth mentioning that detection based approaches
types. Detection based methods [7] count the number of ob-
[20, 21, 22] are inferior for the tasks of few-shot and one-shot
jects by exhaustively detecting every target in images. But
counting. One main reason is that it requires extra and costly
they rely on the complex labels such as bounding boxes. Re-
bounding-box annotations of all instances in the training stage
gression based methods [1, 2] learn to count by predicting a
while one-shot counting approach which we focus on depends
density map, in which each value represents the density of
on dot annotations and only one supporting box. To illus-
target objects at the corresponding location. The count pre-
trate this point further, we perform experiments in Section 4.3
diction equals to the total sum of density map.
to compare with detection based approaches and validate the
Nevertheless, most of the counting methods are category
proposed network for one-shot counting.
specifically, e.g. for human crowd [1, 2, 8, 9, 10, 11], for
cars [3, 12], for plants [13] or for cells [14, 15]. They focus
on only one category and will loss the original satisfied per- 3. APPROACH
formance when transferring to other categories. Moreover,
most traditional approaches usually rely on tens of thousands 3.1. Problem Definition
of instances to train a counting model [2, 8, 9, 11, 3, 12].
To reduce considerably the number of samples needed to One-shot object counting consists of a training set
train a counting model for a particular category, recently, few- (It , st , yt ) ∈ T and a query set (Iq , sq ) ∈ Q, in which cate-
shot counting task has been developed. The key lies in the gories are mutually exclusive. Each input for the model con-
generalization ability of the model to deal with novel cate- tains an image I and a supporting bounding box s annotating
gories from few labeled examples. The study [16] proposes a one object of the desired category. In training set, abundant
point annotations yt are available to supervise the model. In Previous few-shot counting methods [4, 5] usually adopt
inference stage, we aim the model to learn to count the novel a convolution operation where the supporting features act as
objects in Iq with a supporting category instance sampled by kernels to match the similarities for target category. However,
sq . the results will greatly depend on the quality of supporting
features and the consistency of objects’ properties, including
3.2. Feature Correlation rotations and scales.
To this end, we propose a Correlative-Attention module
As the model is required to learn to count from only one sup- to learn inter-relations between query and supporting features
porting object, seizing the correlation between features with and alleviate the constraints of irrelevant properties.
high efficiency is quite important. Therefore, we build the Specifically, we extend the MA by learning correlations
feature correlation model in our one-shot network based on between different feature sequences and add a feed-forward
Self-Attention and Correlative-Attention modules, for learn- network (FFN) to fuse the features, i.e.,
ing the inner-relations and inter-relations respectively.
As illustrated in Figure 1 (violet block), our Self- X ∗ = Corr(X̃, S̃) = G(M A(X̃Q , S̃K , S̃V ) + X̃). (4)
Attention module consists of a Multi-head Attention (MA) G includes two LNs and a FFN in the form of residual (light
and a layer normalization (LN). We first introduce the defi- blue block in Figure 1). Finally, X ∗ and S̃ will be fed into the
nition of attention [23], given the query Q, key K and value cycle as new feature sequences where each cycle consists of
vector V : two Self-Attention modules and a Correlative-Attention mod-
(QW Q )(KW K )T ule.
A(Q, K, V | W ) = S( √ + P E)(V W V ),
d
(1) 3.3. Feature Extraction and Scale Aggregation
where S is the softmax function and √1d is a scaling factor
To extract feature sequences from images, we use VGG-19 as
based on the vector dimension d. W : W Q , W K , W V ∈ our backbone. For query image, the output of the final level
Rd×d are weight matrices for projections and P E is the posi- is directly flattened and transmitted into Self-Attention mod-
tion embedding. ule. For the supporting box, as there are uncontrollable scale
To leverage on more representation subspaces, we adopt variations among instances due to the perspective, we pro-
the extending form with multi attention heads: pose a Scale Aggregation mechanism to fuse different scale
information.
M A(Q, K, V ) = Concat(head1 , .., headh )W O Given l as the number of layers in CNN, we aggregate the
(2)
where headi = A(Q, K, V | Wi ). feature maps among different scales:

The representation dimensions are divided by parallel atten- S = Concat(F l (s), F l−1 (s), ..., F l+1−δ (s)), (5)
tion heads, where parameter matrices Wi : WiQ , WiK , WiV ∈
Rd×d/h and WO ∈ Rd×d . where F i represents a feature map at ith level and δ ∈ [1, l]
One challenging problem in counting task is the existence decides the number of layers taken for aggregation.
of many complex interfering things. To efficiently weaken the Meanwhile, we leverage on identifying position embed-
negative influence by those irrelevant background, we apply ding to help the model distinguish the integrated scale in-
Multi-head Self-Attention in image features to learn inner- formation in attention model. By adopting the fixed sinu-
relations and encourage the model to focus more on repetitive soidal absolute position embedding [23], feature sequences
objects that can be counted. from different scales can still maintain the consistency be-
We denote the feature sequences of the query image and tween positions, i.e.,
the supporting box region as X and S, with sizes X ∈ P E(posj ,2i) = sin(posj /100002i/d ),
RHW ×C and S ∈ Rhw×C . And the refined query feature (6)
is calculated by: P E(posj ,2i+1) = cos(posj /100002i/d ).
i is the dimension and posj is the position for jth feature map.
X̃ = LN (M A(XQ , XK , XV ) + X). (3)

A layer normalization (LN) is adopted to balance the value 3.4. Training Loss
scales. We use Euclidean distance to measure the difference between
Meanwhile, as there is only one supporting object in one- estimated density map and ground truth density map, which is
shot counting problem, refining the salient features within the generated based on annotated points following [1]. The loss
object is necessary and helpful for counting efficiency and is defined as follows:
accuracy. Therefore we apply another Self-Attention module
to supporting feature and get refined S̃. LE = ||Dgt − D||22 , (7)
where D is the estimated density map and Dgt is the ground Val Test
Methods
truth density map. To improve the local pattern consistency, MAE RMSE MAE RMSE
we also adopt a SSIM loss followed the calculation in [8]. By 3-shot
integrating the above two loss functions, we have Mean 53.38 124.53 47.55 147.67
Median 48.68 129.70 47.73 152.46
L = LE + λLSSIM , (8) FR detector [25] 45.45 112.53 41.64 141.04
FSOD detector [26] 36.36 115.00 32.53 140.65
where λ is the balanced weight. GMN [16] 29.66 89.81 26.52 124.57
MAML [27] 25.54 79.44 24.90 112.68
FamNet [4] 23.75 69.07 22.08 99.54
4. EXPERIMENTS 1-shot
CFOCNet [5] 27.82 71.99 28.60 123.96
4.1. Implement Details and Evaluation Metrics FamNet [4] 26.55 77.01 26.76 110.95
LaoNet (Ours) 17.11 56.81 15.78 97.15
We design the density regressor by an upsampling layer and
three convolution layers with ReLU activation. The kernel
Table 1. Comparisons with previous state-of-the-art few-shot
sizes of first two layers are 3 × 3 and that of last is 1 × 1. Ran-
methods on FSC-147. The upper part of the table presents the
dom scaling and flipping are adopted for each training image.
results in 3-shot setting while the lower part presents 1-shot
Adam [24] with a learning rate 0.5 × 10−5 is used to optimize
results. FamNet [4] uses the adaptation strategy during test-
the parameters. We set the number of attention heads h as 4,
ing. It is worth noticing that our one-shot LaoNet outperforms
the correlation cycle T as 2, the number of aggregated layers
all of previous methods, even those in 3-shot setting, without
δ as 2, and the loss balanced parameter λ as 10−4 .
any fine-tuning strategy.
Mean Absolute Error (MAE) and Root Mean Squared Er-
ror (RMSE) are used to measure the performance of our meth-
ods. They are defined by:
M
1 X gt
M AE = N − Ni ,
M i=1 i
v (9)
u
u 1 XM
GT: 33 GT: 14 GT: 35
RM SE = t (N gt − Ni )2 ),
M i=1 i

where M and N gt are the number of images and the ground-


truth count, respectively. The predicted count N is calculated
by integrating the estimated density map D. Pre: 35 Pre: 14 Pre: 37

4.2. Datesets Fig. 2. Visualizations of one-shot counting inputs and cor-


responding predicted density maps. The model can perform
FSC-147 [4] contains a total of 6135 images collected for great counting accuracy even it has never seen strawberry, hot
few-shot counting problem. In each image, three randomly air balloon or cashew before.
selected object instances are annotated by bounding boxes
while other instances are annotated by points. 89 object cat-
egories with 3,659 images are divided for training set. Each
29 categories with 1,286 and 1,190 images respectively are methods specifically designed for one-shot counting, for com-
divided for validation and testing sets. prehensive evaluation, we modify FamNet [4] and CFOC-
MS-COCO [6] is a large dataset widely used in object detec- Net [5] for this setting and also compare with other few-shot
tion and instance segmentation. In val2017 set, there are 80 counting approaches [25, 26, 16, 27, 17].
common object categories with 5,000 images in complex ev- First, quantitative results on FSC-147 are shown in Ta-
eryday scenes. We follow [17] to generate four train/test splits ble 1. We list seven results of previous few-shot detection and
which each contains 60 training and 20 testing categories. counting methods in 3-shot setting and two results of state-
of-the-art counting methods in 1-shot setting for comparison.
The result of FamNet [4] uses the adaptation strategy during
4.3. Comparison with Few-Shot Approaches
testing.
We hold experiments on above two few-shot counting datasets It is worth noticing that our one-shot LaoNet outperforms
to evaluate the proposed network. As there are few existing all of previous few-shot methods, even those in 3 shot set-
Fold 0 Fold 1 Fold 2 Fold 3 Average
Methods
MAE RMSE MAE RMSE MAE RMSE MAE RMSE MAE RMSE
Segment [17]† 2.91 4.20 2.47 3.67 2.64 3.79 2.82 4.09 2.71 3.94
GMN [16]† 2.97 4.02 3.39 4.56 3.00 3.94 3.30 4.40 3.17 4.23
CFOCNet [5]† 2.24 3.50 1.78 2.90 2.66 3.82 2.16 3.27 2.21 3.37
FamNet [4] 2.34 3.78 1.41 2.85 2.40 2.75 2.27 3.66 2.11 3.26
CFOCNet [5] 2.23 4.04 1.62 2.72 1.83 3.02 2.13 3.03 1.95 3.20
LaoNet (Ours) 2.20 3.78 1.32 2.66 1.58 2.19 1.84 2.90 1.73 2.93

Table 2. Results on each of four folds of COCO val2017. Methods with † follow the experiment setting in [5]. Our method
achieves great accuracy without any fine-tuning on testing categories.

Val Test FSC147-COCO Val FSC147-COCO Test


Methods Methods
MAE RMSE MAE RMSE MAE RMSE MAE RMSE
LaoNet 17.11 56.81 15.78 97.15 RetinaNet [20] 63.57 174.36 52.67 85.86
− Self-Attention (X) 19.83 64.84 19.71 107.32 Faster R-CNN [21] 52.79 172.46 36.20 79.59
− Self-Attention (S) 19.67 63.79 18.71 111.83 Mask R-CNN [22] 52.51 172.21 35.56 80.00
− Scale Aggregation 18.82 63.74 17.16 106.40 FamNet [4] 39.82 108.13 22.76 45.92
− SSIM 17.82 57.66 16.11 100.59 LaoNet (Ours) 31.12 97.15 12.89 26.64

Table 3. Ablation study for different terms. X stands for Table 4. Comparisons with pre-trained object detectors on
feature sequences of query image and S stands for that of FSC147-COCO splits of FSC147 which contain images with
supporting box region. Experiments are performed in FSC- COCO categories. Even pre-trained with thousands of anno-
147 val and test. tated examples on MS-COCO dataset, these object detectors
still perform unsatisfied accuracy on counting task.

ting, without any fine-tuning strategy. We have generated new


records by reducing the error of FamNet from 26.55 to 17.11
The result demonstrates a robustness contribution under the
for MAE and from 77.01 to 56.81 for RMSE in validation set,
multi-scale aggregation. Finally, the SSIM loss further im-
from 26.76 to 15.78 for MAE and from 110.95 to 97.15 for
proves the counting accuracy by both lower MAE and RMSE.
RMSE in testing set.
Second, Table 2 shows the results on each of four folds of Convergence Speed. We hold experiments to measure the
COCO val2017. Methods with † in the upper part of the table convergence speed and the performance stability. We pick
follow the experiment setting in [5]. That is, the supporting FamNet [4] as the baseline for LaoNet with a pre-trained
examples are chosen from all instances in the dataset during CNN backbone and an Adam optimizer. We train both two
training and testing, which is laborious and costly under the models on FSC-147 and report the validation MAE for 100
need of all instances annotated by bounding boxes. While our epochs.
setting allows only one fixed instance for each image, we re- As shown in Figure 3, our model has faster convergence
conduct the experiment of CFOCNet [5]. As the result shows, speed and better stability than FamNet. With just 2 epoches,
our method maintains a great performance on COCO dataset. our method achieves a low counting error which FamNet has
to reach after 40 epochs. Meanwhile, the convergence of our
method is smooth and stable, while that of Famet is jagged,
4.4. Discussions with multiple sharp peaks and the highest error of 70.
Contribution of Different Terms. We study the accuracy Comparison with Object Detectors. Object detectors can
contributions of different terms in FSC-147. The result is be used for counting task with the number of predicted de-
shown in Table 3, each row whereof reports the results af- tections. However, even these detectors work with categories
ter removing one component or one term from LaoNet. The which they are trained on instead of one-shot setting, their
Self-Attention modules for the two feature sequences to learn counting performances are still limited. We select images of
inner-relations increase the accuracy in testing set by 19.9% FSC-147-COCO subset from FSC147 Val and Test sets which
and 15.7% for MAE, 9.5% and 13.1% for RMSE, respec- share categories with MS-COCO dataset and conduct quanti-
tively. Compared to other two terms, the Self-Attention mod- tative experiments.
ules contribute most to the performance of our model. As the results shown in Table 4, we compare LaoNet with
The Scale Aggregation mechanism helps more on RMSE. several object detectors which are well pre-trained with thou-
[7] Prithvijit Chattopadhyay, Ramakrishna Vedantam,
Ramprasaath R Selvaraju, Dhruv Batra, and Devi
Parikh, “Counting everyday objects in everyday
scenes,” in CVPR, 2017, pp. 1135–1144.
[8] Xinkun Cao, Zhipeng Wang, Yanyun Zhao, and Fei Su,
“Scale aggregation network for accurate and efficient
Fig. 3. Comparisons of validation MAE during training. The crowd counting,” in ECCV, 2018.
blue line represents our proposed LaoNet. With just one [9] Weizhe Liu, Mathieu Salzmann, and Pascal Fua,
epoch, it can perform a great accuracy which FamNet needs “Context-aware crowd counting,” in CVPR, 2019.
to train for about 20 epochs. [10] Boyu Wang, Huidong Liu, Dimitris Samaras, and
Minh Hoai Nguyen, “Distribution matching for crowd
counting,” Advances in Neural Information Processing
sands of annotated examples on MS-COCO. Nevertheless,
Systems, vol. 33, 2020.
our method, which counts unseen categories, still outperforms
[11] Hui Lin, Xiaopeng Hong, Zhiheng Ma, Xing Wei, Yun-
the detection based methods which have met those categories
feng Qiu, Yaowei Wang, and Yihong Gong, “Direct
in training, by a large margin.
measure matching for crowd counting,” in IJCAI, 2021.
[12] Thomas Moranduzzo and Farid Melgani, “Automatic
5. CONCLUSION car counting method for unmanned aerial vehicle im-
ages,” TGRS, 2013.
This paper targets one-shot object counting, which requires [13] Mélissande Machefer, François Lemarchand, Vir-
the counting model to count objects of new categories by ginie Bonnefond, Alasdair Hitchins, and Panagiotis
looking at only one instance. We propose an efficient network Sidiropoulos, “Mask r-cnn refitting strategy for plant
named LaoNet to address this challenge. LaoNet includes counting and sizing in uav imagery,” Remote Sensing,
a feature correlation module to learn both inner-relations 2020.
and inter-relations and a scale aggregation module to extract [14] Thorsten Falk, Dominic Mai, Robert Bensch, Özgün
multi-scale information for improving robustness. Without Çiçek, Ahmed Abdulkadir, Yassine Marrakchi, Anton
any fine-tuning in inference, our LaoNet outperforms previ- Böhm, Jan Deubner, Zoe Jäckel, Katharina Seiwald,
ous state-of-the-art few-shot counting methods with a high et al., “U-net: deep learning for cell counting, detec-
convergence speed. In the future, we consider applying our tion, and morphometry,” Nature methods, 2019.
model to a wider range of one-shot vision tasks. [15] Weidi Xie, J Alison Noble, and Andrew Zisserman,
“Microscopy cell counting and detection with fully con-
6. REFERENCES volutional regression networks,” Computer methods in
biomechanics and biomedical engineering: Imaging &
[1] Yingying Zhang, Desen Zhou, Siqin Chen, Shenghua Visualization, 2018.
Gao, and Yi Ma, “Single-image crowd counting via [16] Erika Lu, Weidi Xie, and Andrew Zisserman, “Class-
multi-column convolutional neural network,” in CVPR, agnostic counting,” in ACCV, 2018.
2016. [17] Claudio Michaelis, Ivan Ustyuzhaninov, Matthias
[2] Zhiheng Ma, Xing Wei, Xiaopeng Hong, and Yihong Bethge, and Alexander S Ecker, “One-shot instance seg-
Gong, “Bayesian loss for crowd count estimation with mentation,” arXiv preprint, 2018.
point supervision,” in ICCV, 2019. [18] Ting-I Hsieh, Yi-Chen Lo, Hwann-Tzong Chen, and
[3] Debojit Biswas, Hongbo Su, Chengyi Wang, Jason Tyng-Luh Liu, “One-shot object detection with co-
Blankenship, and Aleksandar Stevanovic, “An auto- attention and co-excitation,” in NIPS, 2019.
matic car counting system using overfeat framework,” [19] Xiang Li, Lin Zhang, Yau Pun Chen, Yu-Wing Tai, and
Sensors (Basel), 2017. Chi-Keung Tang, “One-shot object detection without
[4] Viresh Ranjan, Udbhav Sharma, Thu Nguyen, and Minh fine-tuning,” arXiv preprint, 2020.
Hoai, “Learning to count everything,” in CVPR, 2021. [20] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He,
[5] Shuo-Diao Yang, Hung-Ting Su, Winston H Hsu, and and Piotr Dollár, “Focal loss for dense object detection,”
Wen-Chin Chen, “Class-agnostic few-shot object count- in ICCV, 2017.
ing,” in WACV, 2021. [21] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian
[6] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Sun, “Faster r-cnn: Towards real-time object detection
Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and with region proposal networks,” NIPS, 2015.
C Lawrence Zitnick, “Microsoft coco: Common objects [22] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross
in context,” in ECCV. Springer, 2014. Girshick, “Mask r-cnn,” in ICCV, 2017.
[23] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser,
and Illia Polosukhin, “Attention is all you need,” in
NIPS, 2017.
[24] Diederik P Kingma and Jimmy Lei Ba, “Adam:
Amethod for stochastic optimization,” .
[25] Bingyi Kang, Zhuang Liu, Xin Wang, Fisher Yu, Jiashi
Feng, and Trevor Darrell, “Few-shot object detection
via feature reweighting,” in ICCV, 2019.
[26] Qi Fan, Wei Zhuo, Chi-Keung Tang, and Yu-Wing Tai,
“Few-shot object detection with attention-rpn and multi-
relation detector,” in CVPR, 2020.
[27] Chelsea Finn, Pieter Abbeel, and Sergey Levine,
“Model-agnostic meta-learning for fast adaptation of
deep networks,” in ICML, 2017.

You might also like