0% found this document useful (0 votes)
35 views14 pages

Few-Shot Object Detection On Remote Sensing Images

The document discusses a few-shot learning method for object detection on remote sensing images. The method contains three main components: a metafeature extractor, a feature reweighting module, and a bounding box prediction module. Experiments on two datasets demonstrate the method can achieve satisfying detection performance on remote sensing images with only a few annotated samples for unseen object categories.

Uploaded by

Neha Neha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views14 pages

Few-Shot Object Detection On Remote Sensing Images

The document discusses a few-shot learning method for object detection on remote sensing images. The method contains three main components: a metafeature extractor, a feature reweighting module, and a bounding box prediction module. Experiments on two datasets demonstrate the method can achieve satisfying detection performance on remote sensing images with only a few annotated samples for unseen object categories.

Uploaded by

Neha Neha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL.

60, 2022 5601614

Few-Shot Object Detection on


Remote Sensing Images
Xiang Li , Jingyu Deng, and Yi Fang , Member, IEEE

Abstract— In this article, we deal with the problem of object In the past decades, object detection has been extensively
detection on remote sensing images. Previous researchers have studied, and a large number of methods have been devel-
developed numerous deep convolutional neural network (CNN)- oped for the detection of both artificial objects (e.g., vehi-
based methods for object detection on remote sensing images,
and they have reported remarkable achievements in detection cles, buildings, roads, and bridges) and natural objects (e.g.,
performance and efficiency. However, current CNN-based meth- lakes, coasts, and forests) in remote sensing images. Existing
ods often require a large number of annotated samples to train object detection methods in remote sensing images (RSIs)
deep neural networks and tend to have limited generalization can be roughly divided into four categories: 1) tem-
abilities for unseen object categories. In this article, we introduce plate matching-based methods; 2) knowledge-based methods;
a metalearning-based method for few-shot object detection on
remote sensing images where only a few annotated samples 3) object-based image analysis (OBIA)-based methods; and
are needed for the unseen object categories. More specifically, 4) machine learning-based methods [1]. Among these meth-
our model contains three main components: a metafeature ods, the machine learning-based methods have powerful abil-
extractor that learns to extract metafeature maps from input ities for robust feature extraction and object classification and
images, a feature reweighting module that learns class-specific have been extensively studied by many recent approaches and
reweighting vectors from the support images and use them to
recalibrate the metafeature maps, and a bounding box prediction achieved significant progress in solving this problem [3]–[6].
module that carries out object detection on the reweighted In recent years, among all machine learning-based meth-
feature maps. We build our few-shot object detection model ods for object detection, deep learning methods, especially
upon the YOLOv3 architecture and develop a multiscale object convolutional neural networks (CNNs), have drawn immense
detection framework. Experiments on two benchmark data sets research attention. Due to the powerful feature extraction
demonstrate that with only a few annotated samples, our model
can still achieve a satisfying detection performance on remote abilities of CNN models, a huge number of CNN-based
sensing images, and the performance of our model is significantly methods have been developed for object detection on both
better than the well-established baseline models. optical and remote sensing images. Notable methods include
Index Terms— Few-shot learning, metalearning, object detec- Faster R-CNN [7], You-Only-Look-Once (YOLO) [8], and
tion, remote sensing images, You-Only-Look-Once (YOLO). single-shot-detector (SSD) [9]. In the remote sensing field,
recent studies mostly build their methods using these prevalent
I. I NTRODUCTION deep learning-based architectures.
Despite the breakthroughs achieved by deep learning-based
O BJECT detection has been a long-standing problem in
both remote sensing and computer vision fields. It is
generally defined as identifying the location of target objects
methods for object detection, these methods suffer from a com-
mon issue: a large-scale, diverse data set is required to train a
in the input image and recognizing the object categories. deep neural network model. Any adjustments of the candidate
Automatic object detection has been widely used in many identifiable classes are expensive for existing methods because
real-world applications, such as hazard detection, environmen- collecting a new RSI data set with a large number of manual
tal monitoring, change detection, and urban planning [1], [2]. annotations is costly, and these methods need a lot of time to
re-train their models on the newly collected data set. On the
Manuscript received November 6, 2020; revised December 14, 2020; other hand, training a model with only a few samples from
accepted January 10, 2021. Date of publication February 24, 2021; date of the new classes tends to cause the model to suffer from the
current version December 6, 2021. This work was supported in part by the
ADEK under Grant AARE-18150, and in part by the NYU Abu Dhabi Institute overfitting problem, and its generalization abilities are severely
under Grant AD131. (Xiang Li and Jingyu Deng contributed equally to this reduced. Therefore, a special mechanism of learning robust
work.) (Corresponding author: Yi Fang.) detection from a few samples of the new classes is desired for
Xiang Li and Yi Fang are with the NYU Multimedia and Visual Computing
Laboratory, New York University Abu Dhabi, Abu Dhabi 129188, UAE, also object detection on RSIs.
with the Department of Electrical and Computer Engineering, New York Uni- In the past few years, few-shot learning has been extensively
versity Abu Dhabi, Abu Dhabi 129188, UAE, and also with the Department studied in the computer vision field to undertake the tasks of
of Electrical and Computer Engineering, New York University Tandon School
of Engineering, Brooklyn, NY 11201 USA (e-mail: [email protected]). scene classification [10]–[12], image segmentation [13]–[15],
Jingyu Deng is with the NYU Multimedia and Visual Computing Labora- object detection [16]–[19], and shape analysis [20], [21].
tory, New York University Abu Dhabi, Abu Dhabi 129188, UAE, and also with Few-shot learning aims at learning to learn transferable
the Department of Electrical and Computer Engineering, New York University
Abu Dhabi, Abu Dhabi 129188, UAE. knowledge that can be well generalized to new classes and,
Digital Object Identifier 10.1109/TGRS.2021.3051383 therefore, performs image recognition (e.g., classification and
1558-0644 © 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: National Institute of Technology Patna. Downloaded on August 05,2022 at 10:12:00 UTC from IEEE Xplore. Restrictions apply.
5601614 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 60, 2022

images. Our method is trained with large-scale data from


some seen classes and can learn metaknowledge about
object detection and, thus, generalize well to unseen
classes with only a few labeled samples.
2) Our method contains three main components: a metafea-
ture extraction network, a feature reweighting module,
and a bounding box prediction module. All three mod-
ules are designed with multiscale architecture to enable
multiscale object detection.
3) Experiments on two public benchmark data sets demon-
strate the effectiveness of the proposed method for
few-shot object detection on remote sensing images.

Fig. 1. Illustration of few-shot detection on remote sensing images. Our


model is trained with large number of annotated samples from the seen classes II. R ELATED W ORK
and performs detection on unseen classes with only a few annotated samples.
A. Object Detection in Computer Vision
segmentation) on new classes with only a few annotated exam- Object detection is a hot topic in the computer vision
ples. Existing few-shot object detection methods are designed field with extensive studies, especially since the prosperity
for common object detection (e.g., bicycles, cars, and chairs) of deep learning methods. CNNs, as one of the commonly
in optical images using a single-scale detection network. These used deep learning models, have achieved great success for
common objects usually have small size variations, while, various vision tasks, include image classification [22], [23],
in remote sensing images, objects can have very different sizes, semantic segmentation [24], [25], object detection [7], [26],
and the spatial resolution of RSIs can be quite different, which image registration [27], [28], and point cloud and shape
makes the problem even more challenging when only a few analysis [29], [30]. R-CNN [31] is one of the earliest and
annotated samples are provided. successful methods that adopt CNNs for object detection. In R-
In this article, we introduce a metalearning-based method CNN, the authors replace the traditional handcrafted feature
for few-shot object detection on remote sensing images. Under engineering with a CNN-based feature learning process, and
the few-shot scenario, our model aims to learn a detection the model experiences a significant performance boost. Fol-
model from the data set of seen classes that can conduct lowing R-CNN, Fast R-CNN [32] performs feature extraction
accurate object detection for unseen (novel) classes with only on the original input images and maps all region proposals
a few annotated samples. Fig. 1 illustrates the basic idea of onto the extracted feature map. A region-of-interest (RoI)
few-shot object detection on remote sensing images. We build pooling layer is proposed to transform feature representations
our method based upon a recently published article [19], of each ROI into a fixed-length vector. To facilitate the neural
which is designed for common object detection in optical network design, the SVM classifier is replaced with a softmax
images. To address the scale variations inherently present classifier, and the bounding box regression process is included
in remote sensing images, we extend [19] to a multiscale within the model instead of doing it afterward. Fast R-CNN
feature extraction and object detection framework. Concretely, improves detection efficiency by a large margin. Another
a metafeature extractor is designed to learn to extract metafea- important variant comes from Faster R-CNN [7]. To further
ture maps from input images. A feature reweighting module overcome the computation burden from the region proposal
is designed to learn class-specific reweighting vectors that generation process, Faster R-CNN introduces a region pro-
carry the necessary information for detecting objects of the posal network (RPN) to generate region proposals from the
same classes and use them to recalibrate the metafeature CNN network and enables weight sharing between the RPN
maps. The feature reweighting module is implemented by network and the detection network. The following works, such
a deep neural network. A bounding-box prediction module as [33], [34], mostly base their method on the Faster R-CNN
carries out object detection on the reweighted feature maps. architecture. For example, Mask R-CNN [34] adopts feature
Our few-shot detection method includes two training stages: pyramid network (FPN) [33] as the backbone network to
the metatraining stage and the meta-fine-tuning stage. In the produce multiscale feature maps and adds a mask prediction
metatraining stage, our model is trained on a large amount branch to detect the precise boundary of each instance.
of data from the seen classes and learns metaknowledge for The abovementioned approaches generally divide the detec-
object detection. In the meta-fine-tuning stage, a few samples tion process into two stages: region proposal generation and
from the unseen classes (no overlapping with the seen classes) object detection from the region proposals. These methods
are used to fine-tune the model to make it adapted to the are, therefore, often called two-stage object detectors. Another
unseen classes while still maintaining the metaknowledge family of methods removes the region proposal generation
during the metatraining stage. process and directly conducts object detection on the input
The main contributions of this article are summarized as images. These methods are, therefore, often called one-stage
follows. object detectors. One of the most successfully one-stage object
1) In this article, we introduce the first metalearning-based detectors is YOLO [8]. In the YOLO model, the input image
method for few-shot object detection on remote sensing is divided into grid cells, and each cell is responsible for

Authorized licensed use limited to: National Institute of Technology Patna. Downloaded on August 05,2022 at 10:12:00 UTC from IEEE Xplore. Restrictions apply.
LI et al.: FEW-SHOT OBJECT DETECTION ON REMOTE SENSING IMAGES 5601614

detecting a fixed number of objects. A deep CNN architecture tried to extend the Faster R-CNN framework to the remote
is designed to learn high-level feature representations for sensing community [58]–[61]. For example, Li et al. [53]
each cell, and a successive of fully connected layers are developed a rotation-insensitive RPN by using multiangle
used to predict the object categories and locations. YOLO anchors instead of the horizontal anchors used in conventional
is generally a lot faster than two-stage object detectors but RPN networks. The proposed method can effectively detect
with inferior detection performance. Subsequent variants, such geospatial objects of arbitrary orientations.
as YOLOv2 [35] and YOLOv3 [36], improve the perfor- Following the great success of one-stage based methods
mance by using a more powerful backbone network and for object detection on natural images, researchers also devel-
conduct object detection on multiple scales. More specifically, oped various regression-based methods for object detection on
the YOLOv3 model adopts FPN [33] as the backbone network remote sensing images [26], [51], [61], [62]. For example, [61]
and, thus, enables more powerfully feature extraction and extends the SSD model to conduct real-time vehicle detection
detection at different scales. Follow-on research efforts mostly on remote sensing images. Reference [62] replaces the hori-
improve the performance by using deconvolutional layers [37], zontal anchors with oriented anchors in the SSD [9] framework
multiscale detection pipeline [9], or focal loss [38]. and, thus, enables the model to detect objects with orientation
angles. Subsequent methods further enhance the performance
of geospatial object detection on remote sensing images by
B. Object Detection in RSIs using hard example mining [51], multifeature fusion [63],
Existing methods for object detection on remote sensing transfer learning [64], nonmaximum suppression [65],
images fall into four categories, template matching-based and so on.
methods, knowledge-based methods, object-based image
analysis (OBIA)-based methods, and machine learning-based C. Few-Shot Detection
methods [1]. The template matching-based methods use the
Few-shot learning aims at learning to learn transferable
stored templates, which are generated through handcrafting or
knowledge that can be generalized to new classes and, there-
training, to find the best matches at each possible location in
fore, performs image recognition (e.g., classification, detec-
the source image. Typical template matching-based methods
tion, and segmentation) on new classes where only a few
include rigid template matching [39]–[41] and deformable
annotated samples are given. In recent years, few-shot detec-
template matching [42]. Knowledge-based methods treat the
tion has received increased attention recently in the computer
object detection problem as a hypothesis testing process
vision field. Chen et al. [66] proposed to fine-tune a pretrained
by using preestablished knowledge and rules. Two kinds of
model, such as Faster R-CNN [7] and SSD [9], on a few
well-known knowledge are geometric knowledge [43]–[46]
given examples and transfers it into a few-shot object detector.
and context knowledge [43], [47], [48]. OBIA-based meth-
In [67], the authors enrich the training examples with addi-
ods start with segmenting images into homogeneous regions
tional unannotated data in a semisupervised setting and obtain
that represent a relatively homogeneous group of pixels and
performances comparable to weakly supervised methods with
then perform region classification using region-level features
a large amount of training data. Karlinsky et al. [17] intro-
from handcrafted feature engineering. The last family of
duced a metric learning subnet to replace the classification
methods, machine learning-based object detectors, contains
head of a standard detection architecture [33] and achieve
two fundamental processes: handcrafted feature extraction
satisfying detection performance with a few training samples.
and classification using machine learning-based algorithms.
Kang et al. [19] introduced a reweighting module to produce a
Machine learning-based methods have shown more powerful
group of reweighting vectors from a few supporting samples,
generalization abilities compared to the other three families of
one for each class, to reweight the metafeature extracted from
methods [1].
the DarkNet-19 network. With the reweighted metafeatures,
Among all machine learning-based methods, deep
a bounding box prediction module is adapted to produce the
learning-based methods have drawn enormous research
detection results. However, the DarkNet-19 only produces a
attention and are widely used in recent RSI object detection
single metafeature map for each input image, leading to poor
works. Unlike traditional machine learning-based methods
performance when detecting objects with large size variations.
that use handcrafted features, deep learning-based methods
In contrast to [19] that only conducts object detection on
use deep neural networks to automatically learn robust
a single-scale feature map, our proposed method extracts
features from input images. Following this research trajectory,
hierarchy feature maps with different scales from an FPN-like
early efforts adopt R-CNN architecture to detect geospatial
structure and improves the performance by performing multi-
objects on remote sensing images [49]–[56]. For example,
scale object detection in the few-shot scenario.
Chen et al. [49] introduced a new rotation-invariant layer
to the R-CNN architecture to enhance the performance
for detection of objects with different orientations. D. Few-Shot Learning and Transfer Learning
Zhang et al. [57] introduced a hierarchical feature encoding Few-shot learning is one of the applications of metalearning
network for robust object representation learning and in the supervised domain. It shares a lot of similarities in
demonstrate the effectiveness of their method for object terms of reusability with another commonly used technique
detection on high-resolution remote sensing images. Following called transfer learning. Metalearning, also called learn to
the great success of Faster R-CNN, numerous works have learn, is generally defined as the machine learning theory in

Authorized licensed use limited to: National Institute of Technology Patna. Downloaded on August 05,2022 at 10:12:00 UTC from IEEE Xplore. Restrictions apply.
5601614 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 60, 2022

Fig. 2. Illustration of a three-way-two-shot setting. For each task, the support set consists of two annotated images for each of the three object categories.

which some algorithms are designed to learn new concepts and few-shot detection scenario, a metalearning algorithm learns to
skills fast with a few training examples. In contrast, transfer learn metaknowledge from a large number of detection tasks
learning works by transferring the knowledge learned from one sampled from the seen classes and, thus, can generalize well to
problem to a different but related problem. By comparison, unseen classes. Each of the sampled tasks is called an episode.
metalearning is more about learning prior information from Each episode E is constructed from a set of support images
a branch of tasks that can be used to quickly learn new S (with annotations) and a set of query images Q. For each
models for some new tasks, whereas transfer learning uses episode, the support images can be regarded as the training
the model trained for a source task as the initialization for samples and are used for learning how to solve this task, while
some new target tasks that are relatively similar to the source the query images can be regarded as the test samples and are
task. Thus, although they can both be used for the task to task used for evaluating the performance on this task. Fig. 2 gives
transferring, their focuses are quite different: one tries to learn an illustration of the few-shot detection setting.
prior knowledge and fast adaption for new tasks, and the other We follow [19] to construct episodes from the data set of
simply reuses an already optimized model, or part of it. both seen and unseen categories. Given an N-way-K -shot
detection task, each support set S consists of K annotated
III. M ETHOD images for each of the N object categories. We denote the
A. Method Overview support set as S = {(Ick , Mck )}, where Ick denotes the input
We first clarify the settings for the few-shot object detection image, Ick ∈ Rh×w×3 , c = 1, 2, . . . , N, k = 1, 2, . . . K , and
problem. The problem of few-shot object detection aims at Mck denotes the corresponding bounding box annotations. The
learning a detection model from the data set of seen classes query set Q contains Nq images from the same set of class C
that can conduct object detection on images from unseen as the support set. During metatraining, we randomly sample
classes with only a few annotated samples. There are adequate as many episodes from the data set of seen classes to train
samples for model training for each seen class, while each our model to learn metaknowledge of how to detect objects
unseen class has only a few annotated samples. A few-shot in the query images given the information lied in support
object detection model should be able to learn metaknowledge images. Each of the sampled episodes/tasks can be completely
from the data set of seen classes and well transfer it to the nonoverlapping. After metatraining, we fine-tune our model on
unseen classes. the unseen classes with a few samples to make our model well
This few-shot object detection setting is very common in adapted to the unseen classes.
real-world scenarios. One may need to develop a new object Fig. 3 illustrates the pipeline of the proposed method.
detection model, while collecting a large scale data set for Our few-shot object detection model (FSODM) is designed
the target classes is time-consuming. A good starting point to leverage the metaknowledge from the data set of seen
would be deploying a detection model pretrained on some classes. Specifically, a metafeature extractor module is first
existing large-scale object detection data sets (e.g., DIOR [2]). developed to learn metafeatures at three different scales from
However, these data sets only cover a limited number of object input query images. Then, a feature reweighting module takes
categories, while one may only focus on several specific object as an input N support images with labels, one for each class,
categories that may not happen to be included in these data and outputs three groups of N reweighting vectors, one for
sets. This calls for the need for few-shot-based object detection each scale. These reweighting vectors are used to recalibrate
models in the remote sensing field. the metafeatures of the same scale through a channelwise mul-
To achieve few-shot learning, a common solution is to tiplication. With the reweighting module, the metainformation
build the detection model using metalearning techniques. In a from the support samples is extracted and used to amplify

Authorized licensed use limited to: National Institute of Technology Patna. Downloaded on August 05,2022 at 10:12:00 UTC from IEEE Xplore. Restrictions apply.
LI et al.: FEW-SHOT OBJECT DETECTION ON REMOTE SENSING IMAGES 5601614

Fig. 3. Pipeline of the proposed method for few-shot object detection on remote sensing images. Our method consists of three main components, a metafeature
extractor, a reweighting module, and a bounding box prediction module. The feature extractor network takes a query image as input and produces metafeature
maps at three different scales. The reweighting module takes as an input K support images with labels for each of the N classes and outputs three groups of
N reweighting vectors. These reweighting vectors are used to recalibrate the metafeature maps of the same scale through a channelwise multiplication. The
reweighted feature maps are then fed into three independent bounding box detection modules to predict the bounding box locations and sizes (x p , y p , w p ,
and h p ), the objectness scores (o p ), and the classification scores (r p ) at three different scales.

those metafeatures that are informative for detecting novel In this article, we choose feature maps at the scales
objects in the query images. The reweighted metafeatures of 1/32×, 1/16×, and 1/8×, i.e., the output feature maps will
are then fed into three independent bounding box detection have sizes of (h/32 × w/32 × 1024), (h/16 × w/16 × 512),
modules to predict the objectness scores (o), the bounding and (h/8 × w/8 × 256).
box locations, and sizes (x, y, w, h) and class scores (c) at
three different scales.
C. Feature Reweighting Module
B. Metafeature Extractor Our feature reweighting module is designed to extract
Our metafeature extractor network is designed to extract metaknowledge from the support images and to guide object
robust feature representations from input query images. detection in query images. To achieve this goal, a lightweight
Unlike [19] that only extracts single-scale metafeatures, CNN is formulated to map each support image to a set of
objects in remote sensing images can have quite different sizes. reweighting vectors, one for each scale. These reweighting
Therefore, a multiscale feature extraction network is desired. vectors will be used to adjust the contribution of metafea-
In this article, our feature extractor network is designed based tures and highlight metafeatures significant for novel object
on DarkNet-53 [36] and FPN [33]. The detailed network detection.
architecture can be found in [36]. For each input query image, Assuming the support samples are from N object categories,
our metafeature extractor network produces metafeatures at our feature reweighting module receives inputs of N × K
three different scales. Let I q ∈ Q (q ∈ {1, 2, . . . , Nq }) be one support images and their annotations. For each of the sup-
the input query image, and the generated metafeatures after port classes c, where c ∈ {1, 2, . . . , N}, K support image
the feature extractor network can be formulated as {Ick }k=1
K
, along with its corresponding bounding box annota-
tions {Mck }k=1
K
, will be randomly chosen from the support set.
Fi = Fθ (I ) ∈ Rh i ×wi ×m i
q
(1)
Our feature reweight module first extracts per-object features
where Fθ denotes the feature encoding network with parame- using a feature encoding network Gφ (with parameters φ) and
ters θ , i denotes the scale level, i ∈ {1, 2, 3}, and h i , wi , and then averages the feature vectors by object classes, resulting in
m i denote the sizes of feature maps at scale i . class-specific representations Vic = average{Gφ (Ick , Mck )}k=1
K
,

Authorized licensed use limited to: National Institute of Technology Patna. Downloaded on August 05,2022 at 10:12:00 UTC from IEEE Xplore. Restrictions apply.
5601614 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 60, 2022

Fig. 4. Anchor boxes on three feature maps at different scales. Green boxes represent ground-truth bounding boxes, and yellow boxes represent anchor boxes
at different scales. The size of the input image is 800 × 800. (a)–(c) Anchor settings of its small (25 × 25), middle (50 × 50), and large (100 × 100) feature
maps.

TABLE I maps Fic by


N ETWORK A RCHITECTURE OF THE R EWEIGHTING M ODULE
Fic = Fi ⊗ Vic , i = 1, 2, 3 and c = 1, 2, . . . , N
q q
(2)
where ⊗ is channelwise multiplication that is realized through
1 × 1 convolution with the reweighting vectors Vic as the
convolution kernels.
As one can see, after channelwise multiplication, there will
be three groups of reweighted feature maps, one for each
scale. In each group, our feature reweighting module produces
N reweighted feature maps. Each reweighted feature map is
responsible for detecting objects for one of the N classes.

D. Bounding Box Prediction


Our bounding prediction module (Mψ ) takes as input
the reweighted feature maps and produces the object cate-
gories and bounding box locations. Following the setting of
YOLOv3 [36], at each cell of the reweighted feature maps,
we predict three bounding boxes for each of the reweighted
features maps. To achieve this goal, we generate a set of anchor
boxes at each pixel location on the reweighted feature maps.
where Vic ∈ Rm i , i ∈ {1, 2, 3}. The reweighting vector Vic Fig. 4 illustrates the anchor box settings at three different
will be used to reweight the metafeatures and highlight the scales. For the first set of feature maps with scale level i
informative one at scale i and class c. equals 1, the sizes of the anchor boxes are set to (116 × 90),
Table I shows the network architecture of our feature (156 × 198), and (373 × 326). For the second set of feature
reweighting module G. “Convolutional” denotes 2-D convo- maps with scale level i equals 2, the sizes of the anchor boxes
lutional layer. “Filters” is the number of convolutional filters. are set to (30 × 61), (62 × 45), and (59 × 119) for the middle
“Size” indicates the spatial size and stride of a convolutional feature map. For the third set of feature maps with scale level
kernel in the form of “kernel height × kernel width/stride”; i equals 3, the sizes of the anchor boxes are set to (10 × 13),
“Max-pooling” denotes max-pooling Layer; “GlobalMax” (16 × 30), and (33 × 23). In our experiments, we set the aspect
denotes global max-pooling layer where the kernel size equals ratio of different anchor boxes to around 1:2, 1:1, and 2:1 and
the input size; and “Route” is a layer that used to control search the optimal base size in a predefined range.
forward route. “Route” layer will take the output of layer For each anchor box in the reweighted feature maps, our
in the “Filter” column as the input of the next layer. For bounding box prediction module produces a 6-D output, as dis-
example, “Route 8” in Layer 11 means taking the output played in Fig. 3. Among the output, the first four elements are
of Layer 8 as the input of the next layer, i.e., Layer 12. used for object location prediction, and the left two elements
The output reweighting vectors are taken from every global are the objectness score o p and the classification score R p .
max-pooling layer (marked with an underline in Table I), Fig. 5 shows the output representation of each bounding box.
and each reweighting vector has the same dimension as the Assuming the coordinates of a predicted bounding box are
corresponding metafeature. bx , b y , bw , and bh , where b x and b y are the coordinates of its
q
After obtaining the metafeatures Fi and the reweighting center and bw and bh are the width and height of the bounding
vectors Vic , we compute the class-specific reweighted feature box. Instead of directly regressing the bounding box locations,

Authorized licensed use limited to: National Institute of Technology Patna. Downloaded on August 05,2022 at 10:12:00 UTC from IEEE Xplore. Restrictions apply.
LI et al.: FEW-SHOT OBJECT DETECTION ON REMOTE SENSING IMAGES 5601614

E. Loss Function
The loss function of our object detection model contains
two parts: object localization loss and object classification
loss. For object localization, we use the mean-square-error loss
to penalize the misalignment between the predicted bounding
boxes and the ground-truth ones. Given the predicted bounding
boxes coordinates coord p and ground-truth bounding boxes
coordinates coordt , the object localization loss is calculated
as
1  2
Lloc = coordlt − coordlp (5)
Npos pos l

where l denotes a coordinate enumerator, it can be chosen


from {w, y, w, h}, i.e., the four coordinate representations of
Fig. 5. Illustration of anchor box and predicted bounding box representations.
The solid line grid represents the cells of the feature map, the dashed line
a specific bounding box. pos indicates all positive anchors.
rectangle represents an anchor box, and the blue line rectangle represents the Only losses of positive anchors are used in localization loss
predicted bounding box. calculation, and localization losses of those negative anchor
boxes are ignored. We identify an anchor box as positive if
our bounding box prediction module predicts four offset values the IoU between this anchor box and a certain ground-truth
x p , y p , w p , and h p , and the coordinates of the predicted box bounding box is larger than a given threshold (e.g., 0.7). Also,
can be computed through we identify an anchor box as negative if the IoU between
this anchor box and all ground-truth bounding boxes are
bx = α(σ (x p ) + cx ) less than a given threshold (e.g., 0.3). We also identify an
anchor box as positive if the IoU between this anchor box
b y = α(σ (y p ) + cy )
and a certain ground-truth bounding box is greater than the
bw = aw e w p IoUs between all other anchor boxes and this ground-truth
bh = ah e h p (3) bounding box.
The loss function for the objectness score Lobj is a binary
where cx and  c y are cell offsets from the top left corner cross-entropy loss, calculated as
to the cell that makes the prediction; aw and ah are the
1 
width and height of the corresponding anchor box; σ (·) is the Lo = −[Pt · log Po + (1 − Pt ) · log (1 − Po )]
sigmoid function; and α is a scale transformation coefficient Npos pos
equivalents to the ratio between the input image side length 1 
and the feature map side length. = − log Po
Npos pos
The objectness score (o p ) implies the possibility of
the existences of an object, which can be computed as 1 
Lnoobj = −[Pt · log Po + (1 − Pt ) · log (1 − Po )]
Po = σ (o p ), where Po is the possibility and σ (·) is the Nneg neg
sigmoid function. Considering we have one set of reweighted 1 
feature maps for each class, each predicted bounding box only = − log (1 − Po )
Nneg neg
needs one score for a class prediction instead of the total
number of categories (N). The classification score r p indicates Lobj = wobj · Lo + wnoobj · Lnoobj (6)
the possibility that the detected object belongs to each one where Po denotes the predicted objectness possibility men-
of the N classes. Taking the classification scores generated tioned above; Pt denotes the true possibility that is one when
from the same anchor boxes locations with the same anchor it is a positive box and is zero when negative; and wobj
sizes as a group, there will be N classification scores belonging and wnoobj are weights of objectness loss and none-objectness
to the same anchor boxes of the input image. Naming these loss. Considering there are usually more negative boxes than
N predicted boxes as {r cp } (c = 1, 2, . . . , N), a softmax positive boxes, wobj and wnoobj are used to balance these two
function is applied on the probability vector to normalize these loss terms.
probability values. The final classification score for each class For object classification, we use cross-entropy loss to
c can be formulated as enforce the predicted classes to be aligned with the
er p
c
ground-truth ones, calculated as
Rc =  N . (4)
c
er p 1 
c=1 Lcls = −yc log Rc (7)
Npos pos c
Rc is the final classification possibility of class c, and
 N
c=1 Rc = 1. Objectness possibility and classification possi- where yc is the ground-truth classification label. Because we
bility together can help to judge whether an object is detected already use an objectness score to decide whether the predicted
and which class the object belongs to. box contains an object or not, so the background class is

Authorized licensed use limited to: National Institute of Technology Patna. Downloaded on August 05,2022 at 10:12:00 UTC from IEEE Xplore. Restrictions apply.
5601614 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 60, 2022

ignored during the classification loss calculation. The overall Algorithm 1 Training and Testing Process
objective loss function is formulated as 1: Construct training set Dtrain from seen classes and testing
set Dtest from unseen classes.
L = Lloc + Lobj + Lcls . (8) 2: Initialize the network parameters θ, φ, ψ in the feature
extractor network, feature reweighting module, and bound-
ing box prediction module.
F. Training and Inference 3: for each training episode (S, Q) ∈ Dtrain do
4: Model meta-training.
In our few-shot detection model, the training and inference
5: end for
processes are conducted with episode data. To facilitate model
6: for each training episode (S, Q) ∈ Dtest do
training in the few-shot detection scenario, we reorganize
7: Model meta-finetuning.
the training data set into two sets: the query set (Q) and
8: end for
the support set (S). The support set is a training data set
9: for each testing episode (S, Q) ∈ Dtest do
regrouped by object classes. As explained in III-C, each query
10: Extract feature maps for query images using Meta Fea-
image is associated with a group of support images from all
classes. Therefore, we separate training images into N groups ture Extractor.
11: Generate class-specific reweighting vectors and compute
according to object categories. After regrouping, a bounding
box mask is generated for each support image. The mask is the reweighted feature maps.
12: Generate predicted bounding boxes using the bounding
generated by setting the pixel value to 1 when the pixel is
located within the ground-truth bounding box and 0 otherwise. box prediction modules.
13: end for
Therefore, a support set can be formulated as

S = {(Ick , Mck )}, c = 1, 2, . . . N, k = 1, 2, . . . , K . (9)


IV. E XPERIMENTS AND R ESULTS
Similarly, the query set contains a set of query images and In this section, we evaluate the performance of our model
their annotations for the few-shot object detection on two public benchmark
RSI data sets and compare our method with both conventional
Q = {(I q , Aq )}q=1
q N
. (10) object detectors and few-shot detectors to show the superiority
of our model.
We, therefore, formulate an episode T by
A. Data Set
T = Q∪S NWPU VHR-10 is a very high resolution (VHR) remote
= {(I q , Aq )} ∪ {(Ick , Mck )}. (11) sensing image data set released by [68]. This data set con-
tains 800 RSIs collected from Google Earth and the ISPRS
I q and {(Ick , Mck )} are fed into the metafeature extractor Vaihingen data set [69]. One hundred and fifty “negative
network and feature reweighting module, respectively, while samples” without target objects and 650 “positive samples”
Aq is used as the ground-truth supervision during the training with at least one object are annotated manually. There are in
process. total ten object categories in this data set: airplane, baseball
Under the few-shot detection scenario, we need to leave diamond, basketball, bridge, court, ground track field, harbor,
some object classes in the data set for model evaluation. All ship, storage tank, tennis court, and vehicle.
classes in the data set are divided into seen classes and unseen DIOR is a large-scale benchmark data set for object detec-
classes. Seen classes require as many samples as possible to tion on RSIs, released by [2]. Images in the DIOR data set
train a robust model, while unseen classes are viewed as a new are collected from Google Earth with 23 463 images and
detection task with only a few annotated samples. 192 472 instances of 20 classes. The object classes include
The training process is divided into two steps. The first step airplane, airport, baseball field, basketball court, bridge, chim-
is training on the seen classes to learn metaknowledge from ney, dam, expressway service area, expressway toll station,
the data set of seen classes. This process is also called meta- harbor, golf course, ground track field, overpass, ship, stadium,
training. This step generally requires a large amount of training storage tank, tennis court, train station, vehicle, and windmill.
data, takes a relatively long time, and usually is not necessary All images are of the size 800 × 800 pixels, and the spatial
to repeat in the following utilization. The second step is meta- resolutions range from 0.5 to 30 m. In the DIOR data set,
fine-tuning on the unseen classes with a few samples to make the object sizes vary widely.
our model well adapted to the unseen classes. This meta-
fine-tuning process is fast and is going to be done whenever B. Experimental Settings
a new class is added. Both the metatraining and fine-tuning To evaluate the detection performance of our FSODM
processes are conducted with episode data. The overall train- model under the few-shot scenario, we divide each data set into
ing and testing processes are illustrated in Algorithm 1. two parts: one is constructed from the seen classes, the other
Our code will be released at https://github.com/lixiang- is constructed from the unseen classes. For the NWPU VHR-
ucas/FSODM. 10 data set, three classes (airplane, baseball diamond, and

Authorized licensed use limited to: National Institute of Technology Patna. Downloaded on August 05,2022 at 10:12:00 UTC from IEEE Xplore. Restrictions apply.
LI et al.: FEW-SHOT OBJECT DETECTION ON REMOTE SENSING IMAGES 5601614

TABLE II
F EW-S HOT D ETECTION P ERFORMANCE ( M AP) ON THE U NSEEN C LASSES OF THE NWPU VHR-10 D ATA SET. T HE T OP PART S HOWS
C ONVENTIONAL O BJECT D ETECTORS , AND THE B OTTOM PART S HOWS F EW-S HOT L EARNING -BASED M ETHODS

tennis court) are used as unseen classes and the others as enhance their performance; in our experiments, we do not
seen classes. For the DIOR data set, five classes (airplane, implement these strategies in order to have a fair comparison
baseball field, tennis court, train station, and windmill) are with our method. There are only a few methods focused on the
chosen as unseen classes and the others as seen classes. problem of few-shot detection, so we only include two current
In our experiments, we randomly choose the seen/unseen state-of-the-art few-shot detectors of YOLO-Low-Shot [19]
classes. It should be noted that different seen/unseen splits and RepMet [17]. For [19], the experimental settings are
can have different final detection performance, but, in this the same as our method: training on the same set from the
article, we aim to introduce the first few-shot learning-based seen classes and fine-tuning on the same set from the unseen
method for object detection on remote sensing images but not classes.
to achieve state-of-the-art performance. Note that one image The mean average precision (mAP) is used to evalu-
can contain several object instances; the number of shots in ate object detection performance. We follow the PASCAL
our experiments means the number of object instances, not the VOC2007 benchmark [70] to calculate the mAP, which takes
number of images. the average of 11 precision values when recall increases from
Moreover, we apply a multiscale training technique process 0 to 1 with a step of 0.1.
to enhance detection performance. The scale range of input
images varies in (384, 416, 448, 480, 512, 544, 576, 608, and
640), and all input images are square. We note that, in the D. Results on NWPU VHR-10
DIOR data set, the original images are much larger than the Table II lists the few-shot object detection performance
desired input scales. Therefore, those large images are cropped of our FSODM method and the comparison methods on
into a series of patches with 1024 × 1024 pixels and a stride the unseen classes of the NWPU VHR-10 data set. In this
of 512 pixels (for the DIOR data set, this step is ignored). table, we show the performance under different numbers of
For the objects that get truncated in this process, we ignore shots (i.e., annotated samples in unseen classes). As shown
these truncated object instances that have an overlapping of in Table II, our proposed FSODM model achieves signifi-
less than 70% with the original object instances. cantly better performance than all comparing methods. More
specifically, compared to the current state-of-the-art few-shot
object detector [19], our method obtains an mAP 166.6%
C. Comparing Methods higher in the three-shot setting, 120.8% higher in the five-shot
We compare our FSODM model with the prevalent setting, and 62.5% higher in the ten-shot setting. The con-
one-stage object detector YOLOv3 [36] and two-stage ventional none few-shot-based methods (Faster RCNN and
object detector Faster R-CNN [7] (Faster R-CNN includes YOLOv3) obtain a lot worse performance than the two few-
ResNet101 and VGG16 two type of feature extraction net- shot-based methods. Even in the ten-shot setting, Faster RCNN
works). We train these conventional object detectors using w ResNet101 only gets an mAP of 0.24, which is worse
transfer learning techniques, where the training process of than our FSODM model in the three-shot setting with an
these conventional object detectors consists of two steps: mAP of 0.32. Moreover, as shown in Table II, with increases
pretraining and model fine-tuning. In the pretraining stage, in the number of annotated samples in the unseen classes,
we remove all objects belonging to the unseen classes from the the detection performance of our FSODM model increases
training data and train a conventional none few-shot model; in quickly.
the fine-tuning stage, we train the model with a few annotated From Table II, one can also see that both our FSODM
samples from the unseen classes. Note that these conventional model and the comparison methods obtain better performance
detectors use complicated data augmentation strategies to on the ‘baseball diamond’ category. This is because baseball

Authorized licensed use limited to: National Institute of Technology Patna. Downloaded on August 05,2022 at 10:12:00 UTC from IEEE Xplore. Restrictions apply.
5601614 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 60, 2022

TABLE III
F EW-S HOT D ETECTION P ERFORMANCE ( M AP) ON THE U NSEEN C LASSES OF THE DIOR D ATA S ET. T HE T OP PART S HOWS
C ONVENTIONAL O BJECT D ETECTORS , AND THE B OTTOM PART S HOWS F EW-S HOT L EARNING -BASED M ETHODS

Fig. 6. Selected examples of our few-shot detection results. (Left) Detection results on the unseen classes of the NWPU VHR data set in a ten-shot setting.
(Right) Detection results on the unseen classes of the DIOR data set in a 20-shot setting. Red, yellow, and blue boxes indicate true positive, false positive,
and false negative detection, respectively.

diamonds have smaller size variations, which reduces the methods (Faster R-CNN and YOLOv3), even with fewer sam-
recognition challenges for a detection model. ples. Moreover, with the increase in the number of annotated
samples in unseen classes, the detection performance improves
E. Results on DIOR consistently for all methods. Table III also shows that the
Considering the DIOR data set is a large scale data set “baseball field” and “tennis court” categories reach better
with large variations in object structures and sizes, a larger detection performance. This is probably because these two
number of annotated samples are used for the unseen classes. object categories have smaller in-category variations.
Specifically, for conventional object detector (Faster RCNN Fig. 6 shows some examples of the few-shot detection
and YOLOv3), we conduct experiments with 10, 20, and results of our FSODM model on the NWPU VHR-10 data
30 annotated samples for each of the unseen classes. Table III set and the DIOR data set. As shown in Fig. 6, our model can
shows the quantitative results of our method and the comparing successfully detect most of the objects in all the unseen classes
methods on the unseen classes of the DIOR data set. of the NWPU VHR-10 and the DIOR data sets. Most of the
As shown in Table III, our FSODM model achieves better failure cases come from the missing or false detection of small
performance than other two few-shot-based methods presented objects. Moreover, with only a few annotated samples of the
in [17] and [19]. All these three few-shot-based methods unseen classes, our model fails to accurately localize “train
achieve a lot better performance than the none few-shot-based stations” with large sizes or appearance variations.

Authorized licensed use limited to: National Institute of Technology Patna. Downloaded on August 05,2022 at 10:12:00 UTC from IEEE Xplore. Restrictions apply.
LI et al.: FEW-SHOT OBJECT DETECTION ON REMOTE SENSING IMAGES 5601614

TABLE IV
D ETECTION P ERFORMANCE ( M AP) ON THE S EEN C LASSES OF THE NWPU VHR-10 DATA S ET

TABLE V
D ETECTION P ERFORMANCE ( M AP) ON THE S EEN C LASSES OF THE DIOR D ATA S ET

V. D ISCUSSION
A. Detection Performance on Seen Classes
A good few-shot object detection model should not only
perform well on the unseen classes with few annotated samples
but also should not sacrifice the performance on seen classes,
which means that it should also perform the conventional none
few-shot-based models when data are abundant.
Tables IV and V show the performances of our FSODM
model and the comparison methods on the seen classes of the
NWPU VHR-10 and DIOR data sets. From Table IV, one can
see that all three methods achieve similar performances on
the seen classes, with slight differences in the mAP values.
On the DIOR data set, our method performs better than
another few-shot-based method [19]. This demonstrates that
our proposed method can better maintain the performance on Fig. 7. Detection performance with different shots on the NWPU VHR-
10 data set. Horizontal dash lines indicate the baseline performances generated
the seen classes under the few-shot detection scenario. The using all the training samples in the data set (1025 samples for the airplane
performance of our few-shot-based method achieves the same category, 519 samples for the baseball diamond category, and 643 samples
mAP value as the conventional YOLOv3 detection when a for the tennis court category).
large amount of data is provided.

model, we conduct experiments with a larger few-shot range


B. Number of Shots (from five shots to 60 shots). As shown in Fig. 7, our model
We investigate the performance of our few-shot object with only 60 (about 8%) training samples from the novel
detection model under different numbers of shots on the categories can achieve almost the same detection performance
unseen classes. To show the advantage of our few-shot based as the baseline model that uses all the training samples.
model, we conduct comparing experiments with all training We attribute this to the fact that our FSODM model can learn
samples from the novel categories of the NWPU VHR-10 data metaknowledge from the seen classes and effectively apply it
set and use YOLOv3 as the baseline model. For our FSODM for detection on the unseen classes. Moreover, in the baseball

Authorized licensed use limited to: National Institute of Technology Patna. Downloaded on August 05,2022 at 10:12:00 UTC from IEEE Xplore. Restrictions apply.
5601614 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 60, 2022

Fig. 8. t-SNE [71] visualization of reweighting vectors. The reweighting vectors are generated from 400 support images randomly picked from the DIOR
data set (20 images of each category). (a)–(c) Visualizations of reweighting vectors with dimensions of 256, 512, and 1024, respectively.

TABLE VI elements a reweighting vector has, the more information it


D ETECTION S PEED OF THE C OMPARING M ETHODS carries. Therefore, reweighting vectors with higher dimensions
tends to be more capable of representing the object information
from support samples.

D. Detection Speed
In this section, we explore the detection speed of our
FSODM model. We compare the detection speed of our model
with Faster RCNN, YOLOv3, RepMet, and YOLO-Low-shot.
diamond class, our model with only 20 annotated samples
Experiments are carried out on the NWPU data set in a
achieves almost the same performance as the baseline model
three-way-ten-shot setting. We report the inference speed on
that uses all the training samples. This is probably because
a Tesla P100 GPU with the batch size set to 1. Results are
baseball diamonds have smaller in-category variations and
listed in Table VI. From Table VI, one can see that our
are easily identified by their structures from a few annotated
FSODM model obtains a detection speed comparable with
samples. In contrast, although the airplane class has almost
YOLOv3 and is a lot faster than the two Faster R-CNN
the same detection performance as the baseball diamond class
methods and another few-shot-based method RepMet. YOLO-
using the baseline model, the few-shot detection performance
Low-Shot is about 2× faster than our method because it
is significantly worse. This is because objects in the airplane
uses a single-scale detection framework, while our FSODM
category have larger structural and size variations, as shown
model adopts a multiscale detection pipeline and can get better
in Fig. 6. This is a challenge that impedes our model from
performance.
achieving a satisfying performance with only a few samples
(less than 60) even though our few-shot-based model can VI. C ONCLUSION
successfully obtain a comparable performance as the baseline This article introduces a new metalearning-based method
model when enough annotated samples (60 shots) are given. for few-shot object detection on remote sensing images
that are among the first method to challenge this area of
C. Reweighting Vectors research. We first formulate the few-shot object detection
In our approach, the reweighting vectors are extracted by the problem on remote sensing images. Then, we introduce our
reweighting module and significantly support the final detec- proposed method that includes three main components: a
tion performance. To explore the relationship between these metafeature extractor, a feature reweighting module, and a
reweighting vectors, we use t-Distributed Stochastic Neighbor bounding box prediction module. Each module is designed in
Embedding (t-SNE) [71] to reduce their dimensions and a multiscale architecture to enable multiscale object detection.
visualize them on the coordinate axis. T-SNE is a dimension- Our method is trained with large-scale data from some seen
ality reduction technique that can pass the inner relationship classes and can learn metaknowledge from seen classes and
between high dimension vectors to low-dimensional vectors. generalizes well to unseen classes with only a few samples.
It keeps close vectors in high-dimensional space close in Experiments on two public benchmark data sets demonstrate
low-dimensional space and remote vectors in high-dimensional the powerful ability of our method for detecting objects
space remote in low-dimensional space. from unseen classes through a few annotated samples. This
Fig. 8 shows some examples of visualized reweighting work is the very first step in the few-shot detection in the
vectors. In the figure, reweighting vectors from the same cat- remote sensing field, and we will further improve it and keep
egories tend to aggregate together, which suggests the learned exploring in this field.
reweighting vectors successfully characterize the object class R EFERENCES
information from the original support masks. In addition,
[1] G. Cheng and J. Han, “A survey on object detection in optical
the clustering results in Fig. 8(c) are obviously better than remote sensing images,” ISPRS J. Photogramm. Remote Sens., vol. 117,
the results in Fig. 8(a) and (b). The reason is that the more pp. 11–28, Jul. 2016.

Authorized licensed use limited to: National Institute of Technology Patna. Downloaded on August 05,2022 at 10:12:00 UTC from IEEE Xplore. Restrictions apply.
LI et al.: FEW-SHOT OBJECT DETECTION ON REMOTE SENSING IMAGES 5601614

[2] K. Li, G. Wan, G. Cheng, L. Meng, and J. Han, “Object detection in [26] Y. Hu, X. Li, N. Zhou, L. Yang, L. Peng, and S. Xiao, “A sample update-
optical remote sensing images: A survey and a new benchmark,” ISPRS based convolutional neural network framework for object detection in
J. Photogramm. Remote Sens., vol. 159, pp. 296–307, Jan. 2020. large-area remote sensing images,” IEEE Geosci. Remote Sens. Lett.,
[3] X. Bai, H. Zhang, and J. Zhou, “VHR object detection based on vol. 16, no. 6, pp. 947–951, Jun. 2019.
structural feature extraction and query expansion,” IEEE Trans. Geosci. [27] I. Rocco, R. Arandjelovic, and J. Sivic, “Convolutional neural network
Remote Sens., vol. 52, no. 10, pp. 6508–6520, Oct. 2014. architecture for geometric matching,” in Proc. IEEE Conf. Comput. Vis.
[4] F. Bi, B. Zhu, L. Gao, and M. Bian, “A visual search inspired com- Pattern Recognit. (CVPR), Jul. 2017, pp. 6148–6157.
putational model for ship detection in optical satellite images,” IEEE [28] J. Chen, L. Wang, X. Li, and Y. Fang, “Arbicon-Net: Arbitrary continu-
Geosci. Remote Sens. Lett., vol. 9, no. 4, pp. 749–753, Jul. 2012. ous geometric transformation networks for image registration,” in Proc.
[5] X. Huang and L. Zhang, “Road centreline extraction from high- Adv. Neural Inf. Process. Syst., 2019, pp. 3410–3420.
resolution imagery based on multiscale structural features and support [29] X. Li, C. Wen, L. Wang, and Y. Fang, “Topology-constrained shape
vector machines,” Int. J. Remote Sens., vol. 30, no. 8, pp. 1977–1987, correspondence,” IEEE Trans. Vis. Comput. Graphics, early access,
Apr. 2009. May 11, 2020, doi: 10.1109/TVCG.2020.2994013.
[30] X. Li, L. Wang, M. Wang, C. Wen, and Y. Fang, “DANCE-NET:
[6] M. Volpi, F. D. Morsier, G. Camps-Valls, M. Kanevski, and D. Tuia,
Density-aware convolution networks with context encoding for airborne
“Multi-sensor change detection based on nonlinear canonical corre-
LiDAR point cloud classification,” ISPRS J. Photogramm. Remote Sens.,
lations,” in Proc. IEEE Int. Geosci. Remote Sens. Symp. (IGARSS),
vol. 166, pp. 128–139, Aug. 2020.
Jul. 2013, pp. 1944–1947.
[31] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature
[7] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real- hierarchies for accurate object detection and semantic segmentation,”
time object detection with region proposal networks,” in Proc. Adv. in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2014,
Neural Inf. Process. Syst., 2015, pp. 91–99. pp. 580–587.
[8] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look [32] R. Girshick, “Fast R-CNN,” in Proc. IEEE Int. Conf. Comput. Vis.
once: Unified, real-time object detection,” in Proc. IEEE Conf. Comput. (ICCV), Dec. 2015, pp. 1440–1448.
Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 779–788. [33] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie,
[9] W. Liu et al., “SSD: Single shot multibox detector,” in Proc. Eur. Conf. “Feature pyramid networks for object detection,” in Proc. IEEE Conf.
Comput. Vis. Cham, Switzerland: Springer, 2016, pp. 21–37. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 2117–2125.
[10] O. Vinyals, C. Blundell, T. Lillicrap, K. Kavukcuoglu, and D. Wierstra, [34] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask R-CNN,” in Proc.
“Matching networks for one shot learning,” in Proc. Adv. Neural Inf. IEEE Int. Conf. Comput. Vis., Dec. 2017, pp. 2961–2969.
Process. Syst., 2016, pp. 3630–3638. [35] J. Redmon and A. Farhadi, “YOLO9000: Better, faster, stronger,” in
[11] J. Snell, K. Swersky, and R. Zemel, “Prototypical networks for few-shot Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017,
learning,” in Proc. Adv. Neural Inf. Process. Syst., 2017, pp. 4077–4087. pp. 7263–7271.
[12] S. Gidaris and N. Komodakis, “Dynamic few-shot visual learning [36] J. Redmon and A. Farhadi, “YOLOv3: An incremental improvement,”
without forgetting,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern 2018, arXiv:1804.02767. [Online]. Available: http://arxiv.org/abs/1804.
Recognit., Jun. 2018, pp. 4367–4375. 02767
[13] C. Zhang, G. Lin, F. Liu, R. Yao, and C. Shen, “CANet: Class-agnostic [37] C.-Y. Fu, W. Liu, A. Ranga, A. Tyagi, and A. C. Berg, “DSSD : Decon-
segmentation networks with iterative refinement and attentive few-shot volutional single shot detector,” 2017, arXiv:1701.06659. [Online].
learning,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Available: http://arxiv.org/abs/1701.06659
(CVPR), Jun. 2019, pp. 5217–5226. [38] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for
dense object detection,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV),
[14] K. Wang, J. H. Liew, Y. Zou, D. Zhou, and J. Feng, “PANet: Few- Oct. 2017, pp. 2980–2988.
shot image semantic segmentation with prototype alignment,” in Proc.
[39] D. Chaudhuri, N. K. Kushwaha, and A. Samal, “Semi-automated road
IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2019, pp. 9197–9206.
detection from high resolution satellite images by directional mor-
[15] T. Hu, P. Yang, Z. Chiliang, G. Yu, Y. Mu, and C. Snoek, “Attention- phological enhancement and segmentation techniques,” IEEE J. Sel.
based multi-context guiding for few-shot semantic segmentation,” in Topics Appl. Earth Observ. Remote Sens., vol. 5, no. 5, pp. 1538–1544,
Proc. AAAI Conf. Artif. Intell., vol. 33, Jul. 2019, pp. 8441–8448. Oct. 2012.
[16] M. Dixit, R. Kwitt, M. Niethammer, and N. Vasconcelos, “AGA: [40] D. M. McKeown and J. L. Denlinger, “Cooperative methods for road
Attribute-guided augmentation,” in Proc. IEEE Conf. Comput. Vis. tracking in aerial imagery,” in Proc. CVPR, Comput. Soc. Conf. Comput.
Pattern Recognit. (CVPR), Jul. 2017, pp. 7455–7463. Vis. Pattern Recognit., Jun. 1988, pp. 662–672.
[17] L. Karlinsky et al., “RepMet: Representative-based metric learning for [41] J. Zhou, W. F. Bischof, and T. Caelli, “Road tracking in aerial images
classification and few-shot object detection,” in Proc. IEEE/CVF Conf. based on human–computer interaction and Bayesian filtering,” ISPRS
Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 5197–5206. J. Photogramm. Remote Sens., vol. 61, no. 2, pp. 108–124, Nov. 2006.
[18] T. Wang, X. Zhang, L. Yuan, and J. Feng, “Few-shot adaptive faster [42] M. A. Fischler and R. A. Elschlager, “The representation and match-
R-CNN,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. ing of pictorial structures,” IEEE Trans. Comput., vol. C-22, no. 1,
(CVPR), Jun. 2019, pp. 7173–7182. pp. 67–92, Jan. 1973.
[19] B. Kang, Z. Liu, X. Wang, F. Yu, J. Feng, and T. Darrell, “Few-shot [43] A. Huertas and R. Nevatia, “Detecting buildings in aerial images,”
object detection via feature reweighting,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., Graph., Image Process., vol. 41, no. 2, pp. 131–152,
Comput. Vis. (ICCV), Oct. 2019, pp. 8420–8429. Feb. 1988.
[20] Z. Chen, K. Yin, M. Fisher, S. Chaudhuri, and H. Zhang, “BAE-NET: [44] J. C. McGlone and J. A. Shufelt, “Projective and object space geometry
Branched autoencoder for shape co-segmentation,” in Proc. IEEE/CVF 982 for monocular building extraction,” in Proc. IEEE Conf. Comput.
Int. Conf. Comput. Vis. (ICCV), Oct. 2019, pp. 8490–8499. Vis. Pattern Recognit. (CVPR), 2016, pp. 54–61.
[45] J. C. Trinder and Y. Wang, “Automatic road extraction from aerial
[21] L. Wang, X. Li, and Y. Fang, “Few-shot learning of part-specific
images,” Digit. Signal Process., vol. 8, no. 4, pp. 215–224, Oct. 1998.
probability space for 3D shape segmentation,” in Proc. IEEE/CVF Conf.
Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020, pp. 4504–4513. [46] U. Weidner and W. Förstner, “Towards automatic building extraction
from high-resolution digital elevation models,” ISPRS J. Photogramm.
[22] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for Remote Sens., vol. 50, no. 4, pp. 38–49, Aug. 1995.
image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. [47] H. G. Akcay and S. Aksoy, “Building detection using directional spatial
(CVPR), Jun. 2016, pp. 770–778. constraints,” in Proc. IEEE Int. Geosci. Remote Sens. Symp., Jul. 2010,
[23] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely pp. 1932–1935.
connected convolutional networks,” in Proc. IEEE Conf. Comput. Vis. [48] R. B. Irvin and D. M. McKeown, “Methods for exploiting the rela-
Pattern Recognit. (CVPR), Jul. 2017, pp. 4700–4708. tionship between buildings and their shadows in aerial imagery,”
[24] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks IEEE Trans. Syst., Man, Cybern., vol. 19, no. 6, pp. 1564–1575,
for semantic segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Nov./Dec. 1989.
Recognit. (CVPR), Jun. 2015, pp. 3431–3440. [49] G. Cheng, P. Zhou, and J. Han, “Learning rotation-invariant convo-
[25] Q. Wang, J. Gao, and X. Li, “Weakly supervised adversarial domain lutional neural networks for object detection in VHR optical remote
adaptation for semantic segmentation in urban scenes,” IEEE Trans. sensing images,” IEEE Trans. Geosci. Remote Sens., vol. 54, no. 12,
Image Process., vol. 28, no. 9, pp. 4376–4386, Sep. 2019. pp. 7405–7415, Dec. 2016.

Authorized licensed use limited to: National Institute of Technology Patna. Downloaded on August 05,2022 at 10:12:00 UTC from IEEE Xplore. Restrictions apply.
5601614 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 60, 2022

[50] Z. Deng, H. Sun, S. Zhou, J. Zhao, and H. Zou, “Toward fast and [67] X. Dong, L. Zheng, F. Ma, Y. Yang, and D. Meng, “Few-example object
accurate vehicle detection in aerial images using coupled region-based detection with model communication,” IEEE Trans. Pattern Anal. Mach.
convolutional neural networks,” IEEE J. Sel. Topics Appl. Earth Observ. Intell., vol. 41, no. 7, pp. 1641–1654, Jul. 2019.
Remote Sens., vol. 10, no. 8, pp. 3652–3664, Aug. 2017. [68] G. Cheng, J. Han, P. Zhou, and L. Guo, “Multi-class geospatial object
[51] T. Tang, S. Zhou, Z. Deng, H. Zou, and L. Lei, “Vehicle detection in detection and geographic image classification based on collection of part
aerial images based on region convolutional neural networks and hard detectors,” ISPRS J. Photogramm. Remote Sens., vol. 98, pp. 119–132,
negative example mining,” Sensors, vol. 17, no. 2, p. 336, Feb. 2017. Dec. 2014.
[52] Y. Yang, Y. Zhuang, F. Bi, H. Shi, and Y. Xie, “M-FCN: Effective [69] J. Niemeyer, F. Rottensteiner, and U. Soergel, “Contextual classification
fully convolutional network-based airplane detection framework,” IEEE of lidar data and building object detection in urban areas,” ISPRS
Geosci. Remote Sens. Lett., vol. 14, no. 8, pp. 1293–1297, Aug. 2017. J. Photogramm. Remote Sens., vol. 87, pp. 152–165, Jan. 2014.
[53] K. Li, G. Cheng, S. Bu, and X. You, “Rotation-insensitive and context- [70] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and
augmented object detection in remote sensing images,” IEEE Trans. A. Zisser-Man, “The PASCAL visual object classes (VOC) challenge,”
Geosci. Remote Sens., vol. 56, no. 4, pp. 2337–2348, Apr. 2018. Int. J. Comput. Vis., vol. 88, no. 2, pp. 303–338, 2010.
[54] Y. Zhong, X. Han, and L. Zhang, “Multi-class geospatial object detection [71] L. van der Maaten and G. Hinton, “Visualizing data using t-SNE,”
based on a position-sensitive balancing framework for high spatial J. Mach. Learn. Res., vol. 9, pp. 2579–2605, Nov. 2008.
resolution remote sensing imagery,” ISPRS J. Photogramm. Remote
Sens., vol. 138, pp. 281–294, Apr. 2018.
[55] W. Guo, W. Yang, H. Zhang, and G. Hua, “Geospatial object detection Xiang Li received the B.S. degree in remote sens-
in high resolution satellite images based on multi-scale convolutional ing science and technology from Wuhan Univer-
neural network,” Remote Sens., vol. 10, no. 1, p. 131, Jan. 2018. sity, Wuhan, China, in 2014, and the Ph.D. degree
[56] J. Yang, Y. Zhu, B. Jiang, L. Gao, L. Xiao, and Z. Zheng, “Aircraft from the Institute of Remote Sensing and Digi-
detection in remote sensing images based on a deep residual network tal Earth, Chinese Academy of Sciences, Beijing,
and super-vector coding,” Remote Sens. Lett., vol. 9, no. 3, pp. 228–236, China, in 2019.
Mar. 2018. He is a Post-Doctoral Associate with the Depart-
[57] Y. Zhang, Y. Yuan, Y. Feng, and X. Lu, “Hierarchical and robust ment of Electrical and Computer Engineering,
convolutional neural network for very high-resolution remote sensing New York University Abu Dhabi, Abu Dhabi, UAE.
object detection,” IEEE Trans. Geosci. Remote Sens., vol. 57, no. 8, His research interests include deep learning, com-
pp. 5535–5548, Aug. 2019. puter vision, and remote sensing image recognition.
[58] Z. Zou and Z. Shi, “Ship detection in spaceborne optical image with
SVD networks,” IEEE Trans. Geosci. Remote Sens., vol. 54, no. 10,
pp. 5832–5845, Oct. 2016. Jingyu Deng received the B.S. degree in IoT engi-
[59] H. Lin, Z. Shi, and Z. Zou, “Fully convolutional network with task neering from the Huazhong University of Science
partitioning for inshore ship detection in optical remote sensing images,” and Technology, Wuhan, China, in 2018, and the
IEEE Geosci. Remote Sens. Lett., vol. 14, no. 10, pp. 1665–1669, M.S. degree in computer engineering from the NYU
Oct. 2017. Tandon School of Engineering, New York, NY,
[60] W. Liu, L. Ma, and H. Chen, “Arbitrary-oriented ship detection frame- USA, in 2020.
work in optical remote-sensing images,” IEEE Geosci. Remote Sens. His research interests include deep learning and
Lett., vol. 15, no. 6, pp. 937–941, Jun. 2018. computer vision.
[61] T. Tang, S. Zhou, Z. Deng, L. Lei, and H. Zou, “Arbitrary-oriented
vehicle detection in aerial imagery with single convolutional neural
networks,” Remote Sens., vol. 9, no. 11, p. 1170, Nov. 2017.
[62] L. Liu, Z. Pan, and B. Lei, “Learning a rotation invariant detector with
rotatable bounding box,” 2017, arXiv:1711.09405. [Online]. Available:
http://arxiv.org/abs/1711.09405 Yi Fang (Member, IEEE) received the B.S. and
[63] J. Zhong, T. Lei, and G. Yao, “Robust vehicle detection in aerial images M.S. degrees in biomedical engineering from Xi’an
based on cascaded convolutional neural networks,” Sensors, vol. 17, Jiaotong University, Xi’an, China, in 2003 and 2006,
no. 12, p. 2720, Nov. 2017. respectively, and the Ph.D. degree in mechanical
[64] X. Han, Y. Zhong, and L. Zhang, “An efficient and robust integrated engineering from Purdue University, West Lafayette,
geospatial object detection framework for high spatial resolution remote IN, USA, in 2011.
sensing imagery,” Remote Sens., vol. 9, no. 7, p. 666, Jun. 2017. He is an Assistant Professor with the Department
[65] Z. Xu, X. Xu, L. Wang, R. Yang, and F. Pu, “Deformable ConvNet with of Electrical and Computer Engineering, New York
aspect ratio constrained NMS for object detection in remote sensing University Abu Dhabi, Abu Dhabi, UAE. His
imagery,” Remote Sens., vol. 9, no. 12, p. 1312, Dec. 2017. research interests include 3-D computer vision and
[66] H. Chen, Y. Wang, G. Wang, and Y. Qiao, “LSTD: A low-shot transfer pattern recognition, large-scale visual computing,
detector for object detection,” in Proc. 32nd AAAI Conf. Artif. Intell., deep visual computing, deep cross-domain, and cross-modality multimedia
2018, pp. 1–8. analysis, and computational structural biology.

Authorized licensed use limited to: National Institute of Technology Patna. Downloaded on August 05,2022 at 10:12:00 UTC from IEEE Xplore. Restrictions apply.

You might also like