Few-Shot Object Detection On Remote Sensing Images
Few-Shot Object Detection On Remote Sensing Images
Abstract— In this article, we deal with the problem of object In the past decades, object detection has been extensively
detection on remote sensing images. Previous researchers have studied, and a large number of methods have been devel-
developed numerous deep convolutional neural network (CNN)- oped for the detection of both artificial objects (e.g., vehi-
based methods for object detection on remote sensing images,
and they have reported remarkable achievements in detection cles, buildings, roads, and bridges) and natural objects (e.g.,
performance and efficiency. However, current CNN-based meth- lakes, coasts, and forests) in remote sensing images. Existing
ods often require a large number of annotated samples to train object detection methods in remote sensing images (RSIs)
deep neural networks and tend to have limited generalization can be roughly divided into four categories: 1) tem-
abilities for unseen object categories. In this article, we introduce plate matching-based methods; 2) knowledge-based methods;
a metalearning-based method for few-shot object detection on
remote sensing images where only a few annotated samples 3) object-based image analysis (OBIA)-based methods; and
are needed for the unseen object categories. More specifically, 4) machine learning-based methods [1]. Among these meth-
our model contains three main components: a metafeature ods, the machine learning-based methods have powerful abil-
extractor that learns to extract metafeature maps from input ities for robust feature extraction and object classification and
images, a feature reweighting module that learns class-specific have been extensively studied by many recent approaches and
reweighting vectors from the support images and use them to
recalibrate the metafeature maps, and a bounding box prediction achieved significant progress in solving this problem [3]–[6].
module that carries out object detection on the reweighted In recent years, among all machine learning-based meth-
feature maps. We build our few-shot object detection model ods for object detection, deep learning methods, especially
upon the YOLOv3 architecture and develop a multiscale object convolutional neural networks (CNNs), have drawn immense
detection framework. Experiments on two benchmark data sets research attention. Due to the powerful feature extraction
demonstrate that with only a few annotated samples, our model
can still achieve a satisfying detection performance on remote abilities of CNN models, a huge number of CNN-based
sensing images, and the performance of our model is significantly methods have been developed for object detection on both
better than the well-established baseline models. optical and remote sensing images. Notable methods include
Index Terms— Few-shot learning, metalearning, object detec- Faster R-CNN [7], You-Only-Look-Once (YOLO) [8], and
tion, remote sensing images, You-Only-Look-Once (YOLO). single-shot-detector (SSD) [9]. In the remote sensing field,
recent studies mostly build their methods using these prevalent
I. I NTRODUCTION deep learning-based architectures.
Despite the breakthroughs achieved by deep learning-based
O BJECT detection has been a long-standing problem in
both remote sensing and computer vision fields. It is
generally defined as identifying the location of target objects
methods for object detection, these methods suffer from a com-
mon issue: a large-scale, diverse data set is required to train a
in the input image and recognizing the object categories. deep neural network model. Any adjustments of the candidate
Automatic object detection has been widely used in many identifiable classes are expensive for existing methods because
real-world applications, such as hazard detection, environmen- collecting a new RSI data set with a large number of manual
tal monitoring, change detection, and urban planning [1], [2]. annotations is costly, and these methods need a lot of time to
re-train their models on the newly collected data set. On the
Manuscript received November 6, 2020; revised December 14, 2020; other hand, training a model with only a few samples from
accepted January 10, 2021. Date of publication February 24, 2021; date of the new classes tends to cause the model to suffer from the
current version December 6, 2021. This work was supported in part by the
ADEK under Grant AARE-18150, and in part by the NYU Abu Dhabi Institute overfitting problem, and its generalization abilities are severely
under Grant AD131. (Xiang Li and Jingyu Deng contributed equally to this reduced. Therefore, a special mechanism of learning robust
work.) (Corresponding author: Yi Fang.) detection from a few samples of the new classes is desired for
Xiang Li and Yi Fang are with the NYU Multimedia and Visual Computing
Laboratory, New York University Abu Dhabi, Abu Dhabi 129188, UAE, also object detection on RSIs.
with the Department of Electrical and Computer Engineering, New York Uni- In the past few years, few-shot learning has been extensively
versity Abu Dhabi, Abu Dhabi 129188, UAE, and also with the Department studied in the computer vision field to undertake the tasks of
of Electrical and Computer Engineering, New York University Tandon School
of Engineering, Brooklyn, NY 11201 USA (e-mail: [email protected]). scene classification [10]–[12], image segmentation [13]–[15],
Jingyu Deng is with the NYU Multimedia and Visual Computing Labora- object detection [16]–[19], and shape analysis [20], [21].
tory, New York University Abu Dhabi, Abu Dhabi 129188, UAE, and also with Few-shot learning aims at learning to learn transferable
the Department of Electrical and Computer Engineering, New York University
Abu Dhabi, Abu Dhabi 129188, UAE. knowledge that can be well generalized to new classes and,
Digital Object Identifier 10.1109/TGRS.2021.3051383 therefore, performs image recognition (e.g., classification and
1558-0644 © 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: National Institute of Technology Patna. Downloaded on August 05,2022 at 10:12:00 UTC from IEEE Xplore. Restrictions apply.
5601614 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 60, 2022
Authorized licensed use limited to: National Institute of Technology Patna. Downloaded on August 05,2022 at 10:12:00 UTC from IEEE Xplore. Restrictions apply.
LI et al.: FEW-SHOT OBJECT DETECTION ON REMOTE SENSING IMAGES 5601614
detecting a fixed number of objects. A deep CNN architecture tried to extend the Faster R-CNN framework to the remote
is designed to learn high-level feature representations for sensing community [58]–[61]. For example, Li et al. [53]
each cell, and a successive of fully connected layers are developed a rotation-insensitive RPN by using multiangle
used to predict the object categories and locations. YOLO anchors instead of the horizontal anchors used in conventional
is generally a lot faster than two-stage object detectors but RPN networks. The proposed method can effectively detect
with inferior detection performance. Subsequent variants, such geospatial objects of arbitrary orientations.
as YOLOv2 [35] and YOLOv3 [36], improve the perfor- Following the great success of one-stage based methods
mance by using a more powerful backbone network and for object detection on natural images, researchers also devel-
conduct object detection on multiple scales. More specifically, oped various regression-based methods for object detection on
the YOLOv3 model adopts FPN [33] as the backbone network remote sensing images [26], [51], [61], [62]. For example, [61]
and, thus, enables more powerfully feature extraction and extends the SSD model to conduct real-time vehicle detection
detection at different scales. Follow-on research efforts mostly on remote sensing images. Reference [62] replaces the hori-
improve the performance by using deconvolutional layers [37], zontal anchors with oriented anchors in the SSD [9] framework
multiscale detection pipeline [9], or focal loss [38]. and, thus, enables the model to detect objects with orientation
angles. Subsequent methods further enhance the performance
of geospatial object detection on remote sensing images by
B. Object Detection in RSIs using hard example mining [51], multifeature fusion [63],
Existing methods for object detection on remote sensing transfer learning [64], nonmaximum suppression [65],
images fall into four categories, template matching-based and so on.
methods, knowledge-based methods, object-based image
analysis (OBIA)-based methods, and machine learning-based C. Few-Shot Detection
methods [1]. The template matching-based methods use the
Few-shot learning aims at learning to learn transferable
stored templates, which are generated through handcrafting or
knowledge that can be generalized to new classes and, there-
training, to find the best matches at each possible location in
fore, performs image recognition (e.g., classification, detec-
the source image. Typical template matching-based methods
tion, and segmentation) on new classes where only a few
include rigid template matching [39]–[41] and deformable
annotated samples are given. In recent years, few-shot detec-
template matching [42]. Knowledge-based methods treat the
tion has received increased attention recently in the computer
object detection problem as a hypothesis testing process
vision field. Chen et al. [66] proposed to fine-tune a pretrained
by using preestablished knowledge and rules. Two kinds of
model, such as Faster R-CNN [7] and SSD [9], on a few
well-known knowledge are geometric knowledge [43]–[46]
given examples and transfers it into a few-shot object detector.
and context knowledge [43], [47], [48]. OBIA-based meth-
In [67], the authors enrich the training examples with addi-
ods start with segmenting images into homogeneous regions
tional unannotated data in a semisupervised setting and obtain
that represent a relatively homogeneous group of pixels and
performances comparable to weakly supervised methods with
then perform region classification using region-level features
a large amount of training data. Karlinsky et al. [17] intro-
from handcrafted feature engineering. The last family of
duced a metric learning subnet to replace the classification
methods, machine learning-based object detectors, contains
head of a standard detection architecture [33] and achieve
two fundamental processes: handcrafted feature extraction
satisfying detection performance with a few training samples.
and classification using machine learning-based algorithms.
Kang et al. [19] introduced a reweighting module to produce a
Machine learning-based methods have shown more powerful
group of reweighting vectors from a few supporting samples,
generalization abilities compared to the other three families of
one for each class, to reweight the metafeature extracted from
methods [1].
the DarkNet-19 network. With the reweighted metafeatures,
Among all machine learning-based methods, deep
a bounding box prediction module is adapted to produce the
learning-based methods have drawn enormous research
detection results. However, the DarkNet-19 only produces a
attention and are widely used in recent RSI object detection
single metafeature map for each input image, leading to poor
works. Unlike traditional machine learning-based methods
performance when detecting objects with large size variations.
that use handcrafted features, deep learning-based methods
In contrast to [19] that only conducts object detection on
use deep neural networks to automatically learn robust
a single-scale feature map, our proposed method extracts
features from input images. Following this research trajectory,
hierarchy feature maps with different scales from an FPN-like
early efforts adopt R-CNN architecture to detect geospatial
structure and improves the performance by performing multi-
objects on remote sensing images [49]–[56]. For example,
scale object detection in the few-shot scenario.
Chen et al. [49] introduced a new rotation-invariant layer
to the R-CNN architecture to enhance the performance
for detection of objects with different orientations. D. Few-Shot Learning and Transfer Learning
Zhang et al. [57] introduced a hierarchical feature encoding Few-shot learning is one of the applications of metalearning
network for robust object representation learning and in the supervised domain. It shares a lot of similarities in
demonstrate the effectiveness of their method for object terms of reusability with another commonly used technique
detection on high-resolution remote sensing images. Following called transfer learning. Metalearning, also called learn to
the great success of Faster R-CNN, numerous works have learn, is generally defined as the machine learning theory in
Authorized licensed use limited to: National Institute of Technology Patna. Downloaded on August 05,2022 at 10:12:00 UTC from IEEE Xplore. Restrictions apply.
5601614 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 60, 2022
Fig. 2. Illustration of a three-way-two-shot setting. For each task, the support set consists of two annotated images for each of the three object categories.
which some algorithms are designed to learn new concepts and few-shot detection scenario, a metalearning algorithm learns to
skills fast with a few training examples. In contrast, transfer learn metaknowledge from a large number of detection tasks
learning works by transferring the knowledge learned from one sampled from the seen classes and, thus, can generalize well to
problem to a different but related problem. By comparison, unseen classes. Each of the sampled tasks is called an episode.
metalearning is more about learning prior information from Each episode E is constructed from a set of support images
a branch of tasks that can be used to quickly learn new S (with annotations) and a set of query images Q. For each
models for some new tasks, whereas transfer learning uses episode, the support images can be regarded as the training
the model trained for a source task as the initialization for samples and are used for learning how to solve this task, while
some new target tasks that are relatively similar to the source the query images can be regarded as the test samples and are
task. Thus, although they can both be used for the task to task used for evaluating the performance on this task. Fig. 2 gives
transferring, their focuses are quite different: one tries to learn an illustration of the few-shot detection setting.
prior knowledge and fast adaption for new tasks, and the other We follow [19] to construct episodes from the data set of
simply reuses an already optimized model, or part of it. both seen and unseen categories. Given an N-way-K -shot
detection task, each support set S consists of K annotated
III. M ETHOD images for each of the N object categories. We denote the
A. Method Overview support set as S = {(Ick , Mck )}, where Ick denotes the input
We first clarify the settings for the few-shot object detection image, Ick ∈ Rh×w×3 , c = 1, 2, . . . , N, k = 1, 2, . . . K , and
problem. The problem of few-shot object detection aims at Mck denotes the corresponding bounding box annotations. The
learning a detection model from the data set of seen classes query set Q contains Nq images from the same set of class C
that can conduct object detection on images from unseen as the support set. During metatraining, we randomly sample
classes with only a few annotated samples. There are adequate as many episodes from the data set of seen classes to train
samples for model training for each seen class, while each our model to learn metaknowledge of how to detect objects
unseen class has only a few annotated samples. A few-shot in the query images given the information lied in support
object detection model should be able to learn metaknowledge images. Each of the sampled episodes/tasks can be completely
from the data set of seen classes and well transfer it to the nonoverlapping. After metatraining, we fine-tune our model on
unseen classes. the unseen classes with a few samples to make our model well
This few-shot object detection setting is very common in adapted to the unseen classes.
real-world scenarios. One may need to develop a new object Fig. 3 illustrates the pipeline of the proposed method.
detection model, while collecting a large scale data set for Our few-shot object detection model (FSODM) is designed
the target classes is time-consuming. A good starting point to leverage the metaknowledge from the data set of seen
would be deploying a detection model pretrained on some classes. Specifically, a metafeature extractor module is first
existing large-scale object detection data sets (e.g., DIOR [2]). developed to learn metafeatures at three different scales from
However, these data sets only cover a limited number of object input query images. Then, a feature reweighting module takes
categories, while one may only focus on several specific object as an input N support images with labels, one for each class,
categories that may not happen to be included in these data and outputs three groups of N reweighting vectors, one for
sets. This calls for the need for few-shot-based object detection each scale. These reweighting vectors are used to recalibrate
models in the remote sensing field. the metafeatures of the same scale through a channelwise mul-
To achieve few-shot learning, a common solution is to tiplication. With the reweighting module, the metainformation
build the detection model using metalearning techniques. In a from the support samples is extracted and used to amplify
Authorized licensed use limited to: National Institute of Technology Patna. Downloaded on August 05,2022 at 10:12:00 UTC from IEEE Xplore. Restrictions apply.
LI et al.: FEW-SHOT OBJECT DETECTION ON REMOTE SENSING IMAGES 5601614
Fig. 3. Pipeline of the proposed method for few-shot object detection on remote sensing images. Our method consists of three main components, a metafeature
extractor, a reweighting module, and a bounding box prediction module. The feature extractor network takes a query image as input and produces metafeature
maps at three different scales. The reweighting module takes as an input K support images with labels for each of the N classes and outputs three groups of
N reweighting vectors. These reweighting vectors are used to recalibrate the metafeature maps of the same scale through a channelwise multiplication. The
reweighted feature maps are then fed into three independent bounding box detection modules to predict the bounding box locations and sizes (x p , y p , w p ,
and h p ), the objectness scores (o p ), and the classification scores (r p ) at three different scales.
those metafeatures that are informative for detecting novel In this article, we choose feature maps at the scales
objects in the query images. The reweighted metafeatures of 1/32×, 1/16×, and 1/8×, i.e., the output feature maps will
are then fed into three independent bounding box detection have sizes of (h/32 × w/32 × 1024), (h/16 × w/16 × 512),
modules to predict the objectness scores (o), the bounding and (h/8 × w/8 × 256).
box locations, and sizes (x, y, w, h) and class scores (c) at
three different scales.
C. Feature Reweighting Module
B. Metafeature Extractor Our feature reweighting module is designed to extract
Our metafeature extractor network is designed to extract metaknowledge from the support images and to guide object
robust feature representations from input query images. detection in query images. To achieve this goal, a lightweight
Unlike [19] that only extracts single-scale metafeatures, CNN is formulated to map each support image to a set of
objects in remote sensing images can have quite different sizes. reweighting vectors, one for each scale. These reweighting
Therefore, a multiscale feature extraction network is desired. vectors will be used to adjust the contribution of metafea-
In this article, our feature extractor network is designed based tures and highlight metafeatures significant for novel object
on DarkNet-53 [36] and FPN [33]. The detailed network detection.
architecture can be found in [36]. For each input query image, Assuming the support samples are from N object categories,
our metafeature extractor network produces metafeatures at our feature reweighting module receives inputs of N × K
three different scales. Let I q ∈ Q (q ∈ {1, 2, . . . , Nq }) be one support images and their annotations. For each of the sup-
the input query image, and the generated metafeatures after port classes c, where c ∈ {1, 2, . . . , N}, K support image
the feature extractor network can be formulated as {Ick }k=1
K
, along with its corresponding bounding box annota-
tions {Mck }k=1
K
, will be randomly chosen from the support set.
Fi = Fθ (I ) ∈ Rh i ×wi ×m i
q
(1)
Our feature reweight module first extracts per-object features
where Fθ denotes the feature encoding network with parame- using a feature encoding network Gφ (with parameters φ) and
ters θ , i denotes the scale level, i ∈ {1, 2, 3}, and h i , wi , and then averages the feature vectors by object classes, resulting in
m i denote the sizes of feature maps at scale i . class-specific representations Vic = average{Gφ (Ick , Mck )}k=1
K
,
Authorized licensed use limited to: National Institute of Technology Patna. Downloaded on August 05,2022 at 10:12:00 UTC from IEEE Xplore. Restrictions apply.
5601614 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 60, 2022
Fig. 4. Anchor boxes on three feature maps at different scales. Green boxes represent ground-truth bounding boxes, and yellow boxes represent anchor boxes
at different scales. The size of the input image is 800 × 800. (a)–(c) Anchor settings of its small (25 × 25), middle (50 × 50), and large (100 × 100) feature
maps.
Authorized licensed use limited to: National Institute of Technology Patna. Downloaded on August 05,2022 at 10:12:00 UTC from IEEE Xplore. Restrictions apply.
LI et al.: FEW-SHOT OBJECT DETECTION ON REMOTE SENSING IMAGES 5601614
E. Loss Function
The loss function of our object detection model contains
two parts: object localization loss and object classification
loss. For object localization, we use the mean-square-error loss
to penalize the misalignment between the predicted bounding
boxes and the ground-truth ones. Given the predicted bounding
boxes coordinates coord p and ground-truth bounding boxes
coordinates coordt , the object localization loss is calculated
as
1 2
Lloc = coordlt − coordlp (5)
Npos pos l
Authorized licensed use limited to: National Institute of Technology Patna. Downloaded on August 05,2022 at 10:12:00 UTC from IEEE Xplore. Restrictions apply.
5601614 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 60, 2022
ignored during the classification loss calculation. The overall Algorithm 1 Training and Testing Process
objective loss function is formulated as 1: Construct training set Dtrain from seen classes and testing
set Dtest from unseen classes.
L = Lloc + Lobj + Lcls . (8) 2: Initialize the network parameters θ, φ, ψ in the feature
extractor network, feature reweighting module, and bound-
ing box prediction module.
F. Training and Inference 3: for each training episode (S, Q) ∈ Dtrain do
4: Model meta-training.
In our few-shot detection model, the training and inference
5: end for
processes are conducted with episode data. To facilitate model
6: for each training episode (S, Q) ∈ Dtest do
training in the few-shot detection scenario, we reorganize
7: Model meta-finetuning.
the training data set into two sets: the query set (Q) and
8: end for
the support set (S). The support set is a training data set
9: for each testing episode (S, Q) ∈ Dtest do
regrouped by object classes. As explained in III-C, each query
10: Extract feature maps for query images using Meta Fea-
image is associated with a group of support images from all
classes. Therefore, we separate training images into N groups ture Extractor.
11: Generate class-specific reweighting vectors and compute
according to object categories. After regrouping, a bounding
box mask is generated for each support image. The mask is the reweighted feature maps.
12: Generate predicted bounding boxes using the bounding
generated by setting the pixel value to 1 when the pixel is
located within the ground-truth bounding box and 0 otherwise. box prediction modules.
13: end for
Therefore, a support set can be formulated as
Authorized licensed use limited to: National Institute of Technology Patna. Downloaded on August 05,2022 at 10:12:00 UTC from IEEE Xplore. Restrictions apply.
LI et al.: FEW-SHOT OBJECT DETECTION ON REMOTE SENSING IMAGES 5601614
TABLE II
F EW-S HOT D ETECTION P ERFORMANCE ( M AP) ON THE U NSEEN C LASSES OF THE NWPU VHR-10 D ATA SET. T HE T OP PART S HOWS
C ONVENTIONAL O BJECT D ETECTORS , AND THE B OTTOM PART S HOWS F EW-S HOT L EARNING -BASED M ETHODS
tennis court) are used as unseen classes and the others as enhance their performance; in our experiments, we do not
seen classes. For the DIOR data set, five classes (airplane, implement these strategies in order to have a fair comparison
baseball field, tennis court, train station, and windmill) are with our method. There are only a few methods focused on the
chosen as unseen classes and the others as seen classes. problem of few-shot detection, so we only include two current
In our experiments, we randomly choose the seen/unseen state-of-the-art few-shot detectors of YOLO-Low-Shot [19]
classes. It should be noted that different seen/unseen splits and RepMet [17]. For [19], the experimental settings are
can have different final detection performance, but, in this the same as our method: training on the same set from the
article, we aim to introduce the first few-shot learning-based seen classes and fine-tuning on the same set from the unseen
method for object detection on remote sensing images but not classes.
to achieve state-of-the-art performance. Note that one image The mean average precision (mAP) is used to evalu-
can contain several object instances; the number of shots in ate object detection performance. We follow the PASCAL
our experiments means the number of object instances, not the VOC2007 benchmark [70] to calculate the mAP, which takes
number of images. the average of 11 precision values when recall increases from
Moreover, we apply a multiscale training technique process 0 to 1 with a step of 0.1.
to enhance detection performance. The scale range of input
images varies in (384, 416, 448, 480, 512, 544, 576, 608, and
640), and all input images are square. We note that, in the D. Results on NWPU VHR-10
DIOR data set, the original images are much larger than the Table II lists the few-shot object detection performance
desired input scales. Therefore, those large images are cropped of our FSODM method and the comparison methods on
into a series of patches with 1024 × 1024 pixels and a stride the unseen classes of the NWPU VHR-10 data set. In this
of 512 pixels (for the DIOR data set, this step is ignored). table, we show the performance under different numbers of
For the objects that get truncated in this process, we ignore shots (i.e., annotated samples in unseen classes). As shown
these truncated object instances that have an overlapping of in Table II, our proposed FSODM model achieves signifi-
less than 70% with the original object instances. cantly better performance than all comparing methods. More
specifically, compared to the current state-of-the-art few-shot
object detector [19], our method obtains an mAP 166.6%
C. Comparing Methods higher in the three-shot setting, 120.8% higher in the five-shot
We compare our FSODM model with the prevalent setting, and 62.5% higher in the ten-shot setting. The con-
one-stage object detector YOLOv3 [36] and two-stage ventional none few-shot-based methods (Faster RCNN and
object detector Faster R-CNN [7] (Faster R-CNN includes YOLOv3) obtain a lot worse performance than the two few-
ResNet101 and VGG16 two type of feature extraction net- shot-based methods. Even in the ten-shot setting, Faster RCNN
works). We train these conventional object detectors using w ResNet101 only gets an mAP of 0.24, which is worse
transfer learning techniques, where the training process of than our FSODM model in the three-shot setting with an
these conventional object detectors consists of two steps: mAP of 0.32. Moreover, as shown in Table II, with increases
pretraining and model fine-tuning. In the pretraining stage, in the number of annotated samples in the unseen classes,
we remove all objects belonging to the unseen classes from the the detection performance of our FSODM model increases
training data and train a conventional none few-shot model; in quickly.
the fine-tuning stage, we train the model with a few annotated From Table II, one can also see that both our FSODM
samples from the unseen classes. Note that these conventional model and the comparison methods obtain better performance
detectors use complicated data augmentation strategies to on the ‘baseball diamond’ category. This is because baseball
Authorized licensed use limited to: National Institute of Technology Patna. Downloaded on August 05,2022 at 10:12:00 UTC from IEEE Xplore. Restrictions apply.
5601614 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 60, 2022
TABLE III
F EW-S HOT D ETECTION P ERFORMANCE ( M AP) ON THE U NSEEN C LASSES OF THE DIOR D ATA S ET. T HE T OP PART S HOWS
C ONVENTIONAL O BJECT D ETECTORS , AND THE B OTTOM PART S HOWS F EW-S HOT L EARNING -BASED M ETHODS
Fig. 6. Selected examples of our few-shot detection results. (Left) Detection results on the unseen classes of the NWPU VHR data set in a ten-shot setting.
(Right) Detection results on the unseen classes of the DIOR data set in a 20-shot setting. Red, yellow, and blue boxes indicate true positive, false positive,
and false negative detection, respectively.
diamonds have smaller size variations, which reduces the methods (Faster R-CNN and YOLOv3), even with fewer sam-
recognition challenges for a detection model. ples. Moreover, with the increase in the number of annotated
samples in unseen classes, the detection performance improves
E. Results on DIOR consistently for all methods. Table III also shows that the
Considering the DIOR data set is a large scale data set “baseball field” and “tennis court” categories reach better
with large variations in object structures and sizes, a larger detection performance. This is probably because these two
number of annotated samples are used for the unseen classes. object categories have smaller in-category variations.
Specifically, for conventional object detector (Faster RCNN Fig. 6 shows some examples of the few-shot detection
and YOLOv3), we conduct experiments with 10, 20, and results of our FSODM model on the NWPU VHR-10 data
30 annotated samples for each of the unseen classes. Table III set and the DIOR data set. As shown in Fig. 6, our model can
shows the quantitative results of our method and the comparing successfully detect most of the objects in all the unseen classes
methods on the unseen classes of the DIOR data set. of the NWPU VHR-10 and the DIOR data sets. Most of the
As shown in Table III, our FSODM model achieves better failure cases come from the missing or false detection of small
performance than other two few-shot-based methods presented objects. Moreover, with only a few annotated samples of the
in [17] and [19]. All these three few-shot-based methods unseen classes, our model fails to accurately localize “train
achieve a lot better performance than the none few-shot-based stations” with large sizes or appearance variations.
Authorized licensed use limited to: National Institute of Technology Patna. Downloaded on August 05,2022 at 10:12:00 UTC from IEEE Xplore. Restrictions apply.
LI et al.: FEW-SHOT OBJECT DETECTION ON REMOTE SENSING IMAGES 5601614
TABLE IV
D ETECTION P ERFORMANCE ( M AP) ON THE S EEN C LASSES OF THE NWPU VHR-10 DATA S ET
TABLE V
D ETECTION P ERFORMANCE ( M AP) ON THE S EEN C LASSES OF THE DIOR D ATA S ET
V. D ISCUSSION
A. Detection Performance on Seen Classes
A good few-shot object detection model should not only
perform well on the unseen classes with few annotated samples
but also should not sacrifice the performance on seen classes,
which means that it should also perform the conventional none
few-shot-based models when data are abundant.
Tables IV and V show the performances of our FSODM
model and the comparison methods on the seen classes of the
NWPU VHR-10 and DIOR data sets. From Table IV, one can
see that all three methods achieve similar performances on
the seen classes, with slight differences in the mAP values.
On the DIOR data set, our method performs better than
another few-shot-based method [19]. This demonstrates that
our proposed method can better maintain the performance on Fig. 7. Detection performance with different shots on the NWPU VHR-
10 data set. Horizontal dash lines indicate the baseline performances generated
the seen classes under the few-shot detection scenario. The using all the training samples in the data set (1025 samples for the airplane
performance of our few-shot-based method achieves the same category, 519 samples for the baseball diamond category, and 643 samples
mAP value as the conventional YOLOv3 detection when a for the tennis court category).
large amount of data is provided.
Authorized licensed use limited to: National Institute of Technology Patna. Downloaded on August 05,2022 at 10:12:00 UTC from IEEE Xplore. Restrictions apply.
5601614 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 60, 2022
Fig. 8. t-SNE [71] visualization of reweighting vectors. The reweighting vectors are generated from 400 support images randomly picked from the DIOR
data set (20 images of each category). (a)–(c) Visualizations of reweighting vectors with dimensions of 256, 512, and 1024, respectively.
D. Detection Speed
In this section, we explore the detection speed of our
FSODM model. We compare the detection speed of our model
with Faster RCNN, YOLOv3, RepMet, and YOLO-Low-shot.
diamond class, our model with only 20 annotated samples
Experiments are carried out on the NWPU data set in a
achieves almost the same performance as the baseline model
three-way-ten-shot setting. We report the inference speed on
that uses all the training samples. This is probably because
a Tesla P100 GPU with the batch size set to 1. Results are
baseball diamonds have smaller in-category variations and
listed in Table VI. From Table VI, one can see that our
are easily identified by their structures from a few annotated
FSODM model obtains a detection speed comparable with
samples. In contrast, although the airplane class has almost
YOLOv3 and is a lot faster than the two Faster R-CNN
the same detection performance as the baseball diamond class
methods and another few-shot-based method RepMet. YOLO-
using the baseline model, the few-shot detection performance
Low-Shot is about 2× faster than our method because it
is significantly worse. This is because objects in the airplane
uses a single-scale detection framework, while our FSODM
category have larger structural and size variations, as shown
model adopts a multiscale detection pipeline and can get better
in Fig. 6. This is a challenge that impedes our model from
performance.
achieving a satisfying performance with only a few samples
(less than 60) even though our few-shot-based model can VI. C ONCLUSION
successfully obtain a comparable performance as the baseline This article introduces a new metalearning-based method
model when enough annotated samples (60 shots) are given. for few-shot object detection on remote sensing images
that are among the first method to challenge this area of
C. Reweighting Vectors research. We first formulate the few-shot object detection
In our approach, the reweighting vectors are extracted by the problem on remote sensing images. Then, we introduce our
reweighting module and significantly support the final detec- proposed method that includes three main components: a
tion performance. To explore the relationship between these metafeature extractor, a feature reweighting module, and a
reweighting vectors, we use t-Distributed Stochastic Neighbor bounding box prediction module. Each module is designed in
Embedding (t-SNE) [71] to reduce their dimensions and a multiscale architecture to enable multiscale object detection.
visualize them on the coordinate axis. T-SNE is a dimension- Our method is trained with large-scale data from some seen
ality reduction technique that can pass the inner relationship classes and can learn metaknowledge from seen classes and
between high dimension vectors to low-dimensional vectors. generalizes well to unseen classes with only a few samples.
It keeps close vectors in high-dimensional space close in Experiments on two public benchmark data sets demonstrate
low-dimensional space and remote vectors in high-dimensional the powerful ability of our method for detecting objects
space remote in low-dimensional space. from unseen classes through a few annotated samples. This
Fig. 8 shows some examples of visualized reweighting work is the very first step in the few-shot detection in the
vectors. In the figure, reweighting vectors from the same cat- remote sensing field, and we will further improve it and keep
egories tend to aggregate together, which suggests the learned exploring in this field.
reweighting vectors successfully characterize the object class R EFERENCES
information from the original support masks. In addition,
[1] G. Cheng and J. Han, “A survey on object detection in optical
the clustering results in Fig. 8(c) are obviously better than remote sensing images,” ISPRS J. Photogramm. Remote Sens., vol. 117,
the results in Fig. 8(a) and (b). The reason is that the more pp. 11–28, Jul. 2016.
Authorized licensed use limited to: National Institute of Technology Patna. Downloaded on August 05,2022 at 10:12:00 UTC from IEEE Xplore. Restrictions apply.
LI et al.: FEW-SHOT OBJECT DETECTION ON REMOTE SENSING IMAGES 5601614
[2] K. Li, G. Wan, G. Cheng, L. Meng, and J. Han, “Object detection in [26] Y. Hu, X. Li, N. Zhou, L. Yang, L. Peng, and S. Xiao, “A sample update-
optical remote sensing images: A survey and a new benchmark,” ISPRS based convolutional neural network framework for object detection in
J. Photogramm. Remote Sens., vol. 159, pp. 296–307, Jan. 2020. large-area remote sensing images,” IEEE Geosci. Remote Sens. Lett.,
[3] X. Bai, H. Zhang, and J. Zhou, “VHR object detection based on vol. 16, no. 6, pp. 947–951, Jun. 2019.
structural feature extraction and query expansion,” IEEE Trans. Geosci. [27] I. Rocco, R. Arandjelovic, and J. Sivic, “Convolutional neural network
Remote Sens., vol. 52, no. 10, pp. 6508–6520, Oct. 2014. architecture for geometric matching,” in Proc. IEEE Conf. Comput. Vis.
[4] F. Bi, B. Zhu, L. Gao, and M. Bian, “A visual search inspired com- Pattern Recognit. (CVPR), Jul. 2017, pp. 6148–6157.
putational model for ship detection in optical satellite images,” IEEE [28] J. Chen, L. Wang, X. Li, and Y. Fang, “Arbicon-Net: Arbitrary continu-
Geosci. Remote Sens. Lett., vol. 9, no. 4, pp. 749–753, Jul. 2012. ous geometric transformation networks for image registration,” in Proc.
[5] X. Huang and L. Zhang, “Road centreline extraction from high- Adv. Neural Inf. Process. Syst., 2019, pp. 3410–3420.
resolution imagery based on multiscale structural features and support [29] X. Li, C. Wen, L. Wang, and Y. Fang, “Topology-constrained shape
vector machines,” Int. J. Remote Sens., vol. 30, no. 8, pp. 1977–1987, correspondence,” IEEE Trans. Vis. Comput. Graphics, early access,
Apr. 2009. May 11, 2020, doi: 10.1109/TVCG.2020.2994013.
[30] X. Li, L. Wang, M. Wang, C. Wen, and Y. Fang, “DANCE-NET:
[6] M. Volpi, F. D. Morsier, G. Camps-Valls, M. Kanevski, and D. Tuia,
Density-aware convolution networks with context encoding for airborne
“Multi-sensor change detection based on nonlinear canonical corre-
LiDAR point cloud classification,” ISPRS J. Photogramm. Remote Sens.,
lations,” in Proc. IEEE Int. Geosci. Remote Sens. Symp. (IGARSS),
vol. 166, pp. 128–139, Aug. 2020.
Jul. 2013, pp. 1944–1947.
[31] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature
[7] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real- hierarchies for accurate object detection and semantic segmentation,”
time object detection with region proposal networks,” in Proc. Adv. in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2014,
Neural Inf. Process. Syst., 2015, pp. 91–99. pp. 580–587.
[8] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look [32] R. Girshick, “Fast R-CNN,” in Proc. IEEE Int. Conf. Comput. Vis.
once: Unified, real-time object detection,” in Proc. IEEE Conf. Comput. (ICCV), Dec. 2015, pp. 1440–1448.
Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 779–788. [33] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie,
[9] W. Liu et al., “SSD: Single shot multibox detector,” in Proc. Eur. Conf. “Feature pyramid networks for object detection,” in Proc. IEEE Conf.
Comput. Vis. Cham, Switzerland: Springer, 2016, pp. 21–37. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 2117–2125.
[10] O. Vinyals, C. Blundell, T. Lillicrap, K. Kavukcuoglu, and D. Wierstra, [34] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask R-CNN,” in Proc.
“Matching networks for one shot learning,” in Proc. Adv. Neural Inf. IEEE Int. Conf. Comput. Vis., Dec. 2017, pp. 2961–2969.
Process. Syst., 2016, pp. 3630–3638. [35] J. Redmon and A. Farhadi, “YOLO9000: Better, faster, stronger,” in
[11] J. Snell, K. Swersky, and R. Zemel, “Prototypical networks for few-shot Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017,
learning,” in Proc. Adv. Neural Inf. Process. Syst., 2017, pp. 4077–4087. pp. 7263–7271.
[12] S. Gidaris and N. Komodakis, “Dynamic few-shot visual learning [36] J. Redmon and A. Farhadi, “YOLOv3: An incremental improvement,”
without forgetting,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern 2018, arXiv:1804.02767. [Online]. Available: http://arxiv.org/abs/1804.
Recognit., Jun. 2018, pp. 4367–4375. 02767
[13] C. Zhang, G. Lin, F. Liu, R. Yao, and C. Shen, “CANet: Class-agnostic [37] C.-Y. Fu, W. Liu, A. Ranga, A. Tyagi, and A. C. Berg, “DSSD : Decon-
segmentation networks with iterative refinement and attentive few-shot volutional single shot detector,” 2017, arXiv:1701.06659. [Online].
learning,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Available: http://arxiv.org/abs/1701.06659
(CVPR), Jun. 2019, pp. 5217–5226. [38] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for
dense object detection,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV),
[14] K. Wang, J. H. Liew, Y. Zou, D. Zhou, and J. Feng, “PANet: Few- Oct. 2017, pp. 2980–2988.
shot image semantic segmentation with prototype alignment,” in Proc.
[39] D. Chaudhuri, N. K. Kushwaha, and A. Samal, “Semi-automated road
IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2019, pp. 9197–9206.
detection from high resolution satellite images by directional mor-
[15] T. Hu, P. Yang, Z. Chiliang, G. Yu, Y. Mu, and C. Snoek, “Attention- phological enhancement and segmentation techniques,” IEEE J. Sel.
based multi-context guiding for few-shot semantic segmentation,” in Topics Appl. Earth Observ. Remote Sens., vol. 5, no. 5, pp. 1538–1544,
Proc. AAAI Conf. Artif. Intell., vol. 33, Jul. 2019, pp. 8441–8448. Oct. 2012.
[16] M. Dixit, R. Kwitt, M. Niethammer, and N. Vasconcelos, “AGA: [40] D. M. McKeown and J. L. Denlinger, “Cooperative methods for road
Attribute-guided augmentation,” in Proc. IEEE Conf. Comput. Vis. tracking in aerial imagery,” in Proc. CVPR, Comput. Soc. Conf. Comput.
Pattern Recognit. (CVPR), Jul. 2017, pp. 7455–7463. Vis. Pattern Recognit., Jun. 1988, pp. 662–672.
[17] L. Karlinsky et al., “RepMet: Representative-based metric learning for [41] J. Zhou, W. F. Bischof, and T. Caelli, “Road tracking in aerial images
classification and few-shot object detection,” in Proc. IEEE/CVF Conf. based on human–computer interaction and Bayesian filtering,” ISPRS
Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 5197–5206. J. Photogramm. Remote Sens., vol. 61, no. 2, pp. 108–124, Nov. 2006.
[18] T. Wang, X. Zhang, L. Yuan, and J. Feng, “Few-shot adaptive faster [42] M. A. Fischler and R. A. Elschlager, “The representation and match-
R-CNN,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. ing of pictorial structures,” IEEE Trans. Comput., vol. C-22, no. 1,
(CVPR), Jun. 2019, pp. 7173–7182. pp. 67–92, Jan. 1973.
[19] B. Kang, Z. Liu, X. Wang, F. Yu, J. Feng, and T. Darrell, “Few-shot [43] A. Huertas and R. Nevatia, “Detecting buildings in aerial images,”
object detection via feature reweighting,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., Graph., Image Process., vol. 41, no. 2, pp. 131–152,
Comput. Vis. (ICCV), Oct. 2019, pp. 8420–8429. Feb. 1988.
[20] Z. Chen, K. Yin, M. Fisher, S. Chaudhuri, and H. Zhang, “BAE-NET: [44] J. C. McGlone and J. A. Shufelt, “Projective and object space geometry
Branched autoencoder for shape co-segmentation,” in Proc. IEEE/CVF 982 for monocular building extraction,” in Proc. IEEE Conf. Comput.
Int. Conf. Comput. Vis. (ICCV), Oct. 2019, pp. 8490–8499. Vis. Pattern Recognit. (CVPR), 2016, pp. 54–61.
[45] J. C. Trinder and Y. Wang, “Automatic road extraction from aerial
[21] L. Wang, X. Li, and Y. Fang, “Few-shot learning of part-specific
images,” Digit. Signal Process., vol. 8, no. 4, pp. 215–224, Oct. 1998.
probability space for 3D shape segmentation,” in Proc. IEEE/CVF Conf.
Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020, pp. 4504–4513. [46] U. Weidner and W. Förstner, “Towards automatic building extraction
from high-resolution digital elevation models,” ISPRS J. Photogramm.
[22] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for Remote Sens., vol. 50, no. 4, pp. 38–49, Aug. 1995.
image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. [47] H. G. Akcay and S. Aksoy, “Building detection using directional spatial
(CVPR), Jun. 2016, pp. 770–778. constraints,” in Proc. IEEE Int. Geosci. Remote Sens. Symp., Jul. 2010,
[23] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely pp. 1932–1935.
connected convolutional networks,” in Proc. IEEE Conf. Comput. Vis. [48] R. B. Irvin and D. M. McKeown, “Methods for exploiting the rela-
Pattern Recognit. (CVPR), Jul. 2017, pp. 4700–4708. tionship between buildings and their shadows in aerial imagery,”
[24] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks IEEE Trans. Syst., Man, Cybern., vol. 19, no. 6, pp. 1564–1575,
for semantic segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Nov./Dec. 1989.
Recognit. (CVPR), Jun. 2015, pp. 3431–3440. [49] G. Cheng, P. Zhou, and J. Han, “Learning rotation-invariant convo-
[25] Q. Wang, J. Gao, and X. Li, “Weakly supervised adversarial domain lutional neural networks for object detection in VHR optical remote
adaptation for semantic segmentation in urban scenes,” IEEE Trans. sensing images,” IEEE Trans. Geosci. Remote Sens., vol. 54, no. 12,
Image Process., vol. 28, no. 9, pp. 4376–4386, Sep. 2019. pp. 7405–7415, Dec. 2016.
Authorized licensed use limited to: National Institute of Technology Patna. Downloaded on August 05,2022 at 10:12:00 UTC from IEEE Xplore. Restrictions apply.
5601614 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 60, 2022
[50] Z. Deng, H. Sun, S. Zhou, J. Zhao, and H. Zou, “Toward fast and [67] X. Dong, L. Zheng, F. Ma, Y. Yang, and D. Meng, “Few-example object
accurate vehicle detection in aerial images using coupled region-based detection with model communication,” IEEE Trans. Pattern Anal. Mach.
convolutional neural networks,” IEEE J. Sel. Topics Appl. Earth Observ. Intell., vol. 41, no. 7, pp. 1641–1654, Jul. 2019.
Remote Sens., vol. 10, no. 8, pp. 3652–3664, Aug. 2017. [68] G. Cheng, J. Han, P. Zhou, and L. Guo, “Multi-class geospatial object
[51] T. Tang, S. Zhou, Z. Deng, H. Zou, and L. Lei, “Vehicle detection in detection and geographic image classification based on collection of part
aerial images based on region convolutional neural networks and hard detectors,” ISPRS J. Photogramm. Remote Sens., vol. 98, pp. 119–132,
negative example mining,” Sensors, vol. 17, no. 2, p. 336, Feb. 2017. Dec. 2014.
[52] Y. Yang, Y. Zhuang, F. Bi, H. Shi, and Y. Xie, “M-FCN: Effective [69] J. Niemeyer, F. Rottensteiner, and U. Soergel, “Contextual classification
fully convolutional network-based airplane detection framework,” IEEE of lidar data and building object detection in urban areas,” ISPRS
Geosci. Remote Sens. Lett., vol. 14, no. 8, pp. 1293–1297, Aug. 2017. J. Photogramm. Remote Sens., vol. 87, pp. 152–165, Jan. 2014.
[53] K. Li, G. Cheng, S. Bu, and X. You, “Rotation-insensitive and context- [70] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and
augmented object detection in remote sensing images,” IEEE Trans. A. Zisser-Man, “The PASCAL visual object classes (VOC) challenge,”
Geosci. Remote Sens., vol. 56, no. 4, pp. 2337–2348, Apr. 2018. Int. J. Comput. Vis., vol. 88, no. 2, pp. 303–338, 2010.
[54] Y. Zhong, X. Han, and L. Zhang, “Multi-class geospatial object detection [71] L. van der Maaten and G. Hinton, “Visualizing data using t-SNE,”
based on a position-sensitive balancing framework for high spatial J. Mach. Learn. Res., vol. 9, pp. 2579–2605, Nov. 2008.
resolution remote sensing imagery,” ISPRS J. Photogramm. Remote
Sens., vol. 138, pp. 281–294, Apr. 2018.
[55] W. Guo, W. Yang, H. Zhang, and G. Hua, “Geospatial object detection Xiang Li received the B.S. degree in remote sens-
in high resolution satellite images based on multi-scale convolutional ing science and technology from Wuhan Univer-
neural network,” Remote Sens., vol. 10, no. 1, p. 131, Jan. 2018. sity, Wuhan, China, in 2014, and the Ph.D. degree
[56] J. Yang, Y. Zhu, B. Jiang, L. Gao, L. Xiao, and Z. Zheng, “Aircraft from the Institute of Remote Sensing and Digi-
detection in remote sensing images based on a deep residual network tal Earth, Chinese Academy of Sciences, Beijing,
and super-vector coding,” Remote Sens. Lett., vol. 9, no. 3, pp. 228–236, China, in 2019.
Mar. 2018. He is a Post-Doctoral Associate with the Depart-
[57] Y. Zhang, Y. Yuan, Y. Feng, and X. Lu, “Hierarchical and robust ment of Electrical and Computer Engineering,
convolutional neural network for very high-resolution remote sensing New York University Abu Dhabi, Abu Dhabi, UAE.
object detection,” IEEE Trans. Geosci. Remote Sens., vol. 57, no. 8, His research interests include deep learning, com-
pp. 5535–5548, Aug. 2019. puter vision, and remote sensing image recognition.
[58] Z. Zou and Z. Shi, “Ship detection in spaceborne optical image with
SVD networks,” IEEE Trans. Geosci. Remote Sens., vol. 54, no. 10,
pp. 5832–5845, Oct. 2016. Jingyu Deng received the B.S. degree in IoT engi-
[59] H. Lin, Z. Shi, and Z. Zou, “Fully convolutional network with task neering from the Huazhong University of Science
partitioning for inshore ship detection in optical remote sensing images,” and Technology, Wuhan, China, in 2018, and the
IEEE Geosci. Remote Sens. Lett., vol. 14, no. 10, pp. 1665–1669, M.S. degree in computer engineering from the NYU
Oct. 2017. Tandon School of Engineering, New York, NY,
[60] W. Liu, L. Ma, and H. Chen, “Arbitrary-oriented ship detection frame- USA, in 2020.
work in optical remote-sensing images,” IEEE Geosci. Remote Sens. His research interests include deep learning and
Lett., vol. 15, no. 6, pp. 937–941, Jun. 2018. computer vision.
[61] T. Tang, S. Zhou, Z. Deng, L. Lei, and H. Zou, “Arbitrary-oriented
vehicle detection in aerial imagery with single convolutional neural
networks,” Remote Sens., vol. 9, no. 11, p. 1170, Nov. 2017.
[62] L. Liu, Z. Pan, and B. Lei, “Learning a rotation invariant detector with
rotatable bounding box,” 2017, arXiv:1711.09405. [Online]. Available:
http://arxiv.org/abs/1711.09405 Yi Fang (Member, IEEE) received the B.S. and
[63] J. Zhong, T. Lei, and G. Yao, “Robust vehicle detection in aerial images M.S. degrees in biomedical engineering from Xi’an
based on cascaded convolutional neural networks,” Sensors, vol. 17, Jiaotong University, Xi’an, China, in 2003 and 2006,
no. 12, p. 2720, Nov. 2017. respectively, and the Ph.D. degree in mechanical
[64] X. Han, Y. Zhong, and L. Zhang, “An efficient and robust integrated engineering from Purdue University, West Lafayette,
geospatial object detection framework for high spatial resolution remote IN, USA, in 2011.
sensing imagery,” Remote Sens., vol. 9, no. 7, p. 666, Jun. 2017. He is an Assistant Professor with the Department
[65] Z. Xu, X. Xu, L. Wang, R. Yang, and F. Pu, “Deformable ConvNet with of Electrical and Computer Engineering, New York
aspect ratio constrained NMS for object detection in remote sensing University Abu Dhabi, Abu Dhabi, UAE. His
imagery,” Remote Sens., vol. 9, no. 12, p. 1312, Dec. 2017. research interests include 3-D computer vision and
[66] H. Chen, Y. Wang, G. Wang, and Y. Qiao, “LSTD: A low-shot transfer pattern recognition, large-scale visual computing,
detector for object detection,” in Proc. 32nd AAAI Conf. Artif. Intell., deep visual computing, deep cross-domain, and cross-modality multimedia
2018, pp. 1–8. analysis, and computational structural biology.
Authorized licensed use limited to: National Institute of Technology Patna. Downloaded on August 05,2022 at 10:12:00 UTC from IEEE Xplore. Restrictions apply.