0% found this document useful (0 votes)
13 views15 pages

J12 TGRS2023 SuperYOLO

The document presents SuperYOLO, a novel object detection method for multimodal remote sensing imagery that enhances detection accuracy while minimizing computational costs. It utilizes a symmetric compact multimodal fusion approach and an assisted super resolution branch to improve the identification of small objects in low-resolution inputs. Experimental results demonstrate that SuperYOLO outperforms existing state-of-the-art models in terms of accuracy and efficiency on the VEDAI RS dataset.

Uploaded by

rouanito
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views15 pages

J12 TGRS2023 SuperYOLO

The document presents SuperYOLO, a novel object detection method for multimodal remote sensing imagery that enhances detection accuracy while minimizing computational costs. It utilizes a symmetric compact multimodal fusion approach and an assisted super resolution branch to improve the identification of small objects in low-resolution inputs. Experimental results demonstrate that SuperYOLO outperforms existing state-of-the-art models in terms of accuracy and efficiency on the VEDAI RS dataset.

Uploaded by

rouanito
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL.

61, 2023 5605415

SuperYOLO: Super Resolution Assisted Object


Detection in Multimodal Remote
Sensing Imagery
Jiaqing Zhang, Jie Lei , Member, IEEE, Weiying Xie , Member, IEEE, Zhenman Fang , Member, IEEE,
Yunsong Li , Member, IEEE, and Qian Du , Fellow, IEEE

Abstract— Accurately and timely detecting multiscale small neural network (DNN)-based object detection frameworks [1],
objects that contain tens of pixels from remote sensing images [2], [3], [4], [5] have been proposed, updated, and optimized
(RSI) remains challenging. Most of the existing solutions pri- in computer vision. The remarkable accuracy enhancement of
marily design complex deep neural networks to learn strong
feature representations for objects separated from the back- DNN-based object detection frameworks owes to the appli-
ground, which often results in a heavy computation burden. cation of large-scale natural datasets with accurate annota-
In this article, we propose an accurate yet fast object detection tions [6], [7], [8].
method for RSI, named SuperYOLO, which fuses multimodal Compared with natural scenarios, there are several vital
data and performs high-resolution (HR) object detection on challenges for accurate object detection in remote sensing
multiscale objects by utilizing the assisted super resolution
(SR) learning and considering both the detection accuracy images (RSIs). First, the number of labeled samples is rela-
and computation cost. First, we utilize a symmetric compact tively small, which limits the training of DNNs to achieve high
multimodal fusion (MF) to extract supplementary information detection accuracy. Second, the size of objects in RSI is much
from various data for improving small object detection in RSI. smaller, accounting for merely tens of pixels in relation to the
Furthermore, we design a simple and flexible SR branch to complicated and broad backgrounds [9], [10]. Moreover, the
learn HR feature representations that can discriminate small
objects from vast backgrounds with low-resolution (LR) input, scale of those objects is diverse with multiple categories [11].
thus further improving the detection accuracy. Moreover, to avoid As shown in Fig. 1(a), the object car is considerably small
introducing additional computation, the SR branch is discarded within a vast area. As shown in Fig. 1(b), the objects have
in the inference stage, and the computation of the network large-scale variations, to which the scale of a car is smaller
model is reduced due to the LR input. Experimental results
than that of a camping vehicle.
show that, on the widely used VEDAI RS dataset, SuperYOLO
achieves an accuracy of 75.09% (in terms of mAP50 ), which is Currently, most object detection techniques are solely
more than 10% higher than the SOTA large models, such as designed and applied for a single modality, such as red-green-
YOLOv5l, YOLOv5x, and RS designed YOLOrs. Meanwhile, blue (RGB) and infrared (IR) [12], [13]. Consequently, with
the parameter size and GFLOPs of SuperYOLO are about 18× respect to object detection, its capability to recognize objects
and 3.8× less than YOLOv5x. Our proposed model shows a
on the Earth’s surface remains insufficient due to the deficiency
favorable accuracy–speed tradeoff compared to the state-of-the-
art models. The code will be open-sourced at https://github.com/ of complementary information between different modali-
icey-zhang/SuperYOLO. ties [14]. As imaging technology flourishes, RSIs collected
Index Terms— Feature fusion, multimodal remote sensing
from multimodality become available and provide an oppor-
image, object detection, super resolution (SR). tunity to improve detection accuracy. For example, as shown
in Fig. 1, the fusion of two different multimodalities (RGB
I. I NTRODUCTION and IR) can effectively enhance the detection accuracy in

O BJECT detection plays an important role in various fields


involving computer-aided diagnosis or autonomous
piloting. Over the past decades, numerous excellent deep
RSI. Sometimes, the resolution of one modality is low, which
requires technique to improve the resolution to enhance infor-
mation. Recently, super resolution (SR) technology has shown
great potential in remote sensing fields [15], [16], [17], [18].
Manuscript received 16 November 2022; revised 2 February 2023 and
26 February 2023; accepted 7 March 2023. Date of publication 17 March Benefiting from the vigorous development of the convolutional
2023; date of current version 31 March 2023. This work was supported in part neuron network (CNN), the resolution of the remote sensing
by the National Natural Science Foundation of China under Grant 62071360. image has achieved high texture information to be interpreted.
(Corresponding author: Jie Lei.)
Jiaqing Zhang, Jie Lei, Weiying Xie, and Yunsong Li are with the However, due to the high computation cost of the CNN net-
State Key Laboratory of Integrated Services Networks, Xidian University, work, the application of the SR network in real-time practical
Xi’an 710071, China (e-mail: [email protected]; jielei@mail. tasks has become a hot topic in current research.
xidian.edu.cn; [email protected]; [email protected]).
Zhenman Fang is with the School of Engineering Science, Simon Fraser In this study, our motivation is to propose an on-board
University, Burnaby, BC V5A 1S6, Canada (e-mail: [email protected]). real-time object detection framework for multimodal RSIs to
Qian Du is with the Department of Electronic and Computer Engi- achieve high detection accuracy and high inference speed
neering, Mississippi State University, Starkville, MS 39759 USA (e-mail:
[email protected]). without introducing additional computation overhead. Inspired
Digital Object Identifier 10.1109/TGRS.2023.3258666 by recent advances in real-time compact neural network
1558-0644 © 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
5605415 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 61, 2023

Fig. 1. Visual comparison of RGB image, IR image, and ground truth (GT). The IR image provides vital complementary information for resolving the
challenges in RGB detection. The object car in (a) is considerably small within a vast area. In (b), the objects have large-scale variation, to which the scale
of a car is smaller than that of a camping vehicle. The fusion of RGB and IR modalities effectively enhances detection performance.

models, we choose small-size YOLOv5s [19] structure as of discriminating small objects from vast backgrounds
our detection baseline. It can reduce deployment costs and with low-resolution (LR) input.
facilitate rapid deployment of the model. Considering the 3) Considering the demand for high-quality results and
high-resolution (HR) retention requirements for small objects, low-computation cost, the SR module functioning as
we remove the Focus module in the baseline YOLOv5s model, an auxiliary task is removed during the inference stage
which not only benefits defining the location of small dense without introducing additional computation. The SR
objects but also enhances the detection performance. Consider- branch is general and extensible and can be inserted
ing the complementary characteristics in different modalities, in the existing fully convolutional network (FCN)
we propose a multimodal fusion (MF) scheme to improve the framework.
detection performance for RSI. We evaluate different fusion 4) The proposed SuperYOLO markedly improves the per-
alternatives (pixel-level or feature-level) and choose pixel-level formance of object detection, outperforming SOTA
fusion for low computation cost. detectors in real-time multimodal object detection. Our
Lastly and most importantly, we develop an SR assur- proposed model shows a favorable accuracy–speed
ance module to guide the network to generate HR fea- tradeoff compared to the state-of-the-art models.
tures that are capable of identifying small objects in vast
backgrounds, thereby reducing false alarms induced by II. R ELATED W ORK
background-contaminated objects in RSI. Nevertheless, a naive
SR solution can significantly increase the computation cost. A. Object Detection With Multimodal Data
Therefore, we set the auxiliary SR branch engaged in the Recently, multimodal data have been widely leveraged in
training process and remove it in the inference stage, facili- numerous practical application scenarios, including visual
tating spatial information extraction in HR without increasing question answering [20], auto-pilot vehicles [21], saliency
computation cost. detection [22], and remote sensing classification [23]. It is
In summary, this article makes the following contributions. found that combining the internal information of multimodal
1) We propose a computation-friendly pixel-level fusion data can efficiently transfer complementary features to avoid
method to combine inner information bidirectionally in a certain information of a single modality from being omitted.
symmetric and compact manner. It efficiently decreases In the field of RSI processing, there exist various modalities
the computation cost without sacrificing accuracy com- (e.g., RGB, synthetic aperture radar (SAR), Light Detec-
pared with feature-level fusion. tion and Ranging (LiDAR), IR, panchromatic (PAN), and
2) We introduce an assisted SR branch into multimodal multispectral (MS) images) from diverse sensors, which can
object detection for the first time. Our approach not be fused with complementary characteristics to enhance the
only makes a breakthrough in limited detection per- performance of various tasks [24], [25], [26]. For exam-
formance but also paves a more flexible way to study ple, the additional IR modality [27] captures longer thermal
outstanding HR feature representations that are capable wavelengths to improve the detection under difficult weather
ZHANG et al.: SuperYOLO: SR ASSISTED OBJECT DETECTION IN MULTIMODAL RSI 5605415

Fig. 2. Overview of the proposed SuperYOLO framework. Our new contributions include: 1) removal of the Focus module to reserve HR; 2) MF; and
3) assisted SR branch. The architecture is optimized in terms of mean square error (mse) loss for the SR branch and task-specific loss for object detection.
During the training stage, the SR branch guides the related learning of the spatial dimension to enhance the HR information preservation for the backbone.
During the test stage, the SR branch is removed to accelerate the inference speed equal to the baseline.

conditions. Manish et al. [27] proposed a real-time framework B. Super Resolution in Object Detection
for object detection in multimodal remote sensing imaging, In the recent literature, the performance of small object
in which the extended version conducted mid-level fusion and detection can be improved by multiscale feature learning
merged data from multiple modalities. Despite that multisensor [29], [30] and context-based detection [31]. These methods
fusion can enhance the detection performance, as shown in always enhance the information representation ability of the
Fig. 1, hardly can its low-accuracy detection performance and network in different scales but ignore the HR contextual infor-
to-be-improved computing speed meet the requirements of mation reservation. Conducted in a preprocessing step, SR has
real-time detection tasks. proven to be effective and efficient in various object detection
The fusion methods are primarily grouped into three strate- tasks [32], [33]. Shermeyer and Van Etten [34] quantified its
gies, i.e., pixel-level fusion, feature-level fusion, and decision- effect on the detection performance of satellite imaging by
level fusion methods [28]. The decision-level fusion methods multiple resolutions of RSI. Based on generative adversarial
fuse the detection results during the last stage, which may networks (GANs), Courtrai et al. [35] utilized SR to generate
consume enormous computation resources due to repeated HR images that were fed into the detector to improve its
calculations for different multimodal branches. In the field detection performance. Rabbi et al. [36] leveraged a Laplacian
of remote sensing, feature-level fusion methods are mainly operator to extract edges from the input image to enhance
adopted with multi branches. The multimodal images will be the capability of reconstructing HR images, thus improv-
input into the parallel branches to extract respective indepen- ing its performance in object localization and classification.
dent features of different modalities, and then, these features Ji et al. [37] introduced a cycle-consistent GAN structure as an
will be combined by some operations, such as attention SR network and modified faster R-CNN architecture to detect
module or simple concatenation. The parallel branches bring vehicles from enhanced images that are produced by the SR
repeated computation as the modalities increase, which is not network. In these works, the adoption of the SR structure has
friendly in the real-time tasks in remote sensing. effectively addressed the challenges regarding small objects.
In contrast, the adoption of pixel-level fusion methods can However, compared with single detection models, additional
reduce unnecessary computation. In this article, our proposed computation is introduced, which attributes to the enlarged
SuperYOLO fuses the modalities at the pixel level to signif- scale of the input image by HR design.
icantly reduce the computation cost and design operations in Recently, Wang et al. [38] proposed an SR module that can
spatial and channel domains to extract inner information in maintain HR representations with LR input while reducing
the different modalities, which can help enhance detection the model computation in segmentation tasks. Inspired by
accuracy. Wang et al. [38], we design an SR assisted branch. In contrast
5605415 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 61, 2023

Fig. 3. Backbone structure of YOLOv5s. The low-level texture and high-level semantic features are extracted by stacked CSP, CBS, and SPP structures.

to the aforementioned work in which the SR is realized in Limitation 1: It is worth mentioning that the Focus mod-
the start stage, the assisted SR module guides the learning of ule is introduced to decrease the number of computations.
high-quality HR representations for the detector, which not As shown in Fig. 2 (bottom left), inputs are partitioned
only strengthens the response of small dense objects but also into individual pixels, reconstructed at intervals, and, finally,
improves the performance of the object detection in spatial concatenated in the channel dimension. The inputs are resized
space. Moreover, the SR module is removed in the inference to a smaller scale to reduce the computation cost and accel-
stage to avoid extra computation. erate the network training and inference speed. However, this
may sacrifice object detection accuracy to a certain extent,
especially for small objects vulnerable to resolution.
III. BASELINE A RCHITECTURE Limitation 2: It is known that the backbone of YOLO
As shown in Fig. 2, the baseline YOLOv5 network consists employs deep convolutional neural networks to extract hier-
of two main components: the backbone and head (includ- archical features with a stride step of 2, through which the
ing the neck). The backbone is designed to extract low- size of the extracted features is halved. Hence, the feature size
level texture and high-level semantic features. Next, these retained for multiscale detection is far smaller than that of the
hint features are fed to the head to construct the enhanced original input image. For example, when the input image size
feature pyramid network from top to bottom to transfer robust is 608, the sizes of output features for the last detection layer
semantic features and from bottom to top to propagate a strong are 76, 38, and 19, respectively. LR features may result in the
response of local texture and pattern features. This resolves the missing of some small objects.
various scale issue of the objects by yielding an enhancement
of detection with diverse scales. IV. S UPERYOLO A RCHITECTURE
In Fig. 3, CSPNet [39] is utilized as the backbone to As summarized in Fig. 2, we introduce three new con-
extract the feature information, consisting of numerous sam- tributions to our SuperYOLO network architecture. First,
ple Convolution-Batch-normalization-SiLu (CBS) components we remove the Focus module in the backbone and replace
and cross stage partial (CSP) modules. The CBS is composed it with an MF module to avoid resolution degradation and,
of operations of convolution, batch normalization, and activa- thus, accuracy degradation. Second, we explore different
tion function SiLu [40]. The CSP duplicates the feature map fusion methods and choose the computation-efficient pixel-
of the previous layer into two branches and then halves the level fusion to fuse RGB and IR modalities to refine dissimilar
channel numbers through 1 × 1 convolution, by which the and complementary information. Finally, we add an assisted
computation is, therefore, reduced. With respect to the two SR module in the training stage, which reconstructs the HR
copies of the feature map, one is connected to the end of the images to guide the related backbone learning in spatial
stage, and the other is sent into ResNet blocks or CBS blocks dimension and, thus, maintain HR information. In the infer-
as the input. Finally, the two copies of the feature map are ence stage, the SR branch is discarded to avoid introducing
concatenated to combine the features, which is followed by a additional computation overhead.
CBS block. The spatial pyramid pooling (SPP) module [41]
is composed of parallel maxpool layers with different kernel
sizes and is utilized to extract multiscale deep features. The A. Focus Removal
low-level texture and high-level semantic features are extracted As presented in Section III and Fig. 2 (bottom left), the
by stacked CSP, CBS, and SPP structures. Focus module in the YOLOv5 backbone partitions images at
ZHANG et al.: SuperYOLO: SR ASSISTED OBJECT DETECTION IN MULTIMODAL RSI 5605415

Fig. 4. Architecture of the MF module at the pixel level.

intervals on the spatial domain and then reorganize the new an input IR image into two intervals of [0, 1]. The input modal-
image to resize the input images. Specifically, this operation ities X RGB , X IR ∈ RC×H ×W are subsampled to IRGB , IIR ∈
is to collect a value for every group of pixels in an image and RC×(H/n)×(W/n) , which are fed to SE blocks extracting inner
then reconstruct it to obtain smaller complementary images. information in channel domain [42] to generate FRGB , FIR
The size of the rebuilt images decreases with the increase
in the number of channels. As a result, it causes resolution FRGB = S E(IRGB ), FIR = S E(IIR ). (1)
degradation and spatial information loss for small targets. Then, the attention map that reveals the inner relationship
Considering that the detection of small targets depends more of the different modalities in the spatial domain is defined as
heavily on higher resolution, the Focus module is abandoned
and replaced by an MF module (as shown in Fig. 4) to prevent m IR = f 1 (FIR ), m RGB = f 2 (FRGB ) (2)
the resolution from being degraded.
where f 1 and f 2 represent 1 × 1 convolutions for the RGB
and IR modalities, respectively. Here, ⊗ denotes the element-
B. Multimodal Fusion wise matrix multiplication. Inner spatial information between
The more the information is utilized to distinguish objects, the different modalities is produced by
the better the performance can be achieved in object detection.
Fin1 = m RGB ⊗ FRGB , Fin2 = m IR ⊗ FIR . (3)
MF is an effective path for merging different information
from various sensors. The decision-, feature-, and pixel-level To incorporate internal inner view information and spatial
fusions are the three mainstream fusion methods that can be texture information, the features are added by the original
deployed at different depths of the network. Since decision- input modalities and then fed into 1 × 1 convolutions. The
level fusion requires enormous computation, it is not consid- full features are
ered in SuperYOLO.
We propose a pixel-level MF to extract the shared and Fful1 = f 3 (Fin1 + IRGB ), Fful2 = f 4 (Fin2 + IIR ) (4)
special information from the different modalities. The MF where f 3 and f 4 represent 1 × 1 convolutions. Finally, the
can combine multimodal inner information bidirectionally in features are fused by
a symmetric and compact manner. As shown in Fig. 4, for the
pixel-level fusion, we first normalize an input RGB image and Fo = SE(Concat(Fful1 , Fful2 )) (5)
5605415 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 61, 2023

Fig. 5. SR structure of SuperYOLO. The SR structure can be regarded as a simple encode–decoder model. The low- and high-level features of the backbone
are selected to fuse local textures patterns and semantic information, respectively.

where Concat(·) denotes the concatenation operation along the high-level feature, we use an upsampling operation to match
channel axis. The result is then fed to the backbone to produce the spatial size of the low-level feature, and then, we use
multilevel features. Note that X is subsampled to 1/n size of a concatenation operation and two CR modules to merge
the original image to accomplish the SR module discussed in the low- and high-level features. The CR module includes
Section IV-C and to accelerate the training process. The X a convolution and ReLU. For the decoder, the LR feature is
represents the RGB or IR modality, and the sampled image is upscaled to the HR space in which the SR module’s output size
denoted as I ∈ RC×(H/n)×(W/n) and generated by is twice larger than that of the input image. As illustrated in
Fig. 5, the decoder is implemented using three deconvolutional
I = D(X ) (6)
layers. The SR guides the related learning of spatial dimension
where D(·) represents the n times downsampling operation and transfers it to the main branch, thereby improving the
using bilinear interpolation. performance of object detection. In addition, we introduce
EDSR [43] as our encoder structure to explore the SR per-
formance and its influence on detection performance.
C. Super Resolution
To present a more visually interpretable description,
As mentioned in Section III, the feature size retained for we visualize the features of backbones for YOLOv5s,
multiscale detection in the backbone is far smaller than that YOLOv5x, and SuperYOLO in Fig. 6. The features are upsam-
of the original input image. Most of the existing methods pled to the same scale as the input image for comparison.
conduct upsampling operations to recover the feature size. By comparing the pairwise images of (c), (f), and (i); (d),
Unfortunately, this approach has produced limited success due (g), and (j); and (e), (h), and (k) in Fig. 6, it can be observed
to the information loss in texture and pattern, which explains that SuperYOLO contains clearer object structures with higher
that it is inappropriate to employ this operation to detect small resolution with the assistance of the SR. Eventually, we obtain
targets that require HR preservation in RSI. a bumper harvest in high-quality HR representation with the
To address this issue, as shown in Fig. 2, we introduce SR branch and utilize the head of YOLOv5 to detect small
an auxiliary SR branch. First, the introduced branch shall objects.
facilitate the extraction of HR information in the backbone and
achieve satisfactory performance. Second, the branch should D. Loss Function
not add more computation to reduce the inference speed. The overall loss of our network consists of two components:
It shall realize a tradeoff between accuracy and computation detection loss L o and SR construction loss L s , which can be
time during the inference stage. Inspired by the study of expressed as
Wang et al. [38] where the proposed SR succeeded in facil-
itating segmentation tasks without additional requirements, L total = c1 L o + c2 L s (7)
we introduce a simple and effective branch named SR into where c1 and c2 are the coefficients for a balance of the two
the framework. Our proposal can improve detection accuracy training tasks. The L1 loss (rather than L2 loss) [44] is used to
without computation and memory overload, especially under calculate the SR construction loss L s between the input image
circumstances of LR input. X and SR result S, to which the expression is written as
Specifically, the SR structure can be regarded as a simple
L s = S − X 1 . (8)
encode–decoder model. We select the backbone’s low- and
high-level features to fuse local textures and patterns, and The detection loss involves three components [19]: loss of
semantic information, respectively. As depicted in Fig. 4, judging whether there is an object L obj , loss of object location
we select the result of the fourth and ninth modules as the low- L loc , and loss of object classification L cls , which are used to
and high-level features, respectively. The encoder integrates evaluate the loss of the prediction as
the low-level feature and high-level feature generated in the 
2 
2 
2
backbone. As illustrated in Fig. 5, in the encoder, the first L o = λloc al L loc + λobj bl L obj + λcls cl L cls (9)
CR module is conducted on the low-level feature. For the l=0 l=0 l=0
ZHANG et al.: SuperYOLO: SR ASSISTED OBJECT DETECTION IN MULTIMODAL RSI 5605415

Fig. 6. Feature-level visualization of backbone for YOLOv5s, YOLOv5x, and SuperYOLO with the same input: (a) RGB input, (b) IR input,
(c)–(e) features of YOLOv5s, (f)–(h) features of YOLOv5x, and (i)–(k) features of SuperYOLO. The features are upsampled to the same scale as the
input image for comparison. (c), (f), and (i) Features in the first layer. (d), (g), and (j) Low-level features. (e), (h), and (k) High-level features in layers at the
same depth.

where l represents the layer of the output in the head; al , bl , another binary flag identify whether an object is cropped.
and cl are the weights of different layers for the three loss We do not consider classes with fewer than 50 instances in
functions; and the weights λloc , λobj , and λcls regulate error the dataset, such as plane, motorcycle, and bus. Thus, the
emphasis among box coordinates, box dimensions, objectness, annotations of the VEDAI dataset are converted to YOLOv5
no-objectness, and classification. format, and we transfer the ID of the interested class to
0, 1, . . . , 7, i.e., N = 8. Then, the center coordinates of the
V. E XPERIMENTAL R ESULTS bounding box are normalized, and the absolute coordinate is
transformed into a relative coordinate. Similarly, the length and
A. Dataset
width of the bounding box are normalized to [0, 1]. To realize
The popular Vehicle Detection in Aerial Imagery (VEDAI) the SR assisted branch, the input images of the network are
dataset [45] is used in the experiments, which contains downsampled from 1024 × 1024 size to 512 × 512 during
cropped images obtained from the much larger Utah Auto- the training process. In the test process, the image size
mated Geographic Reference Center (AGRC) dataset. Each is 512 × 512, which is consistent with the input of other
image collected from the same altitude in AGRC has approx- algorithms compared. In addition, data are augmented with hue
imately 16 000 × 16 000 pixels, with a resolution of about saturation value (HSV), multiscale, translation, left-right flip,
12.5 cm × 12.5 cm per pixel. RGB and IR are the two modal- and mosaic. The augmentation strategy is canceled in the test
ities for each image in the same scenes. The VEDAI dataset stage. The standard stochastic gradient descent (SGD) [46] is
consists of 1246 smaller images that focus on diverse back- used to train the network with a momentum of 0.937, a weight
grounds involving grass, highway, mountains, and urban areas. decay of 0.0005 for the Nesterov accelerated gradients utilized,
All images are in the size of 1024 × 1024 or 512 × 512 . The and a batch size of 2. The learning rate is set to 0.01 initially.
task is to detect 11 classes of different vehicles, such as car, The entire training process involves 300 epochs.
pickup, camping, and truck.
C. Accuracy Metrics
B. Implementation Details
The accuracy assessment measures the agreements and dif-
Our proposed framework is implemented in PyTorch and ferences between the detection result and the reference mask.
runs on a workstation with an NVIDIA 3090 GPU. The The recall, precision, and mean Average Precision (mAP)
VEDAI dataset is used to train our SuperYOLO. Follow- are used as accuracy metrics to evaluate the performance of
ing [27], the VEDAI dataset is devised for tenfold cross- the methods to be compared with. The calculations of the
validation. In each split, 1089 images are used for training, precision and recall metrics are defined as
and another 121 images are used for testing. The ablation
TP
experiments are conducted on the first fold of data, while Precision = (10)
the comparisons with previous methods are performed on the TP + FP
TP
ten folds by averaging their results. The annotations for each Recall = (11)
object in the image contain the coordinates of the bounding TP + FN
box center, the orientation of the object concerning the positive where the true positive (TP) and true negative (TN) denote
x-axis, the four corners of the bounding box, the class ID, correct prediction, and the false positive (FP) and false
a binary flag identifying whether an object is occluded, and negative (FN) denote incorrect outcome. The precision and
5605415 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 61, 2023

TABLE I TABLE II
C OMPARISON R ESULTS OF M ODEL S IZE AND I NFERENCE A BILITY IN I NFLUENCE OF R EMOVING THE F OCUS M ODULE IN THE N ETWORK ON
D IFFERENT BASELINE YOLO F RAMEWORKS ON THE F IRST THE F IRST F OLD OF THE VEDAI VALIDATION S ET
F OLD OF THE VEDAI VALIDATION S ET

recall are correlated with the commission and omission errors, TABLE III
respectively. The mAP is a comprehensive indicator obtained C OMPARISON R ESULT OF P IXEL - AND F EATURE -L EVEL F USIONS IN
by averaging AP values, which uses an integral method to YOLOV 5 S ( NO F OCUS ) FOR M ULTIMODAL DATASET ON THE
calculate the area enclosed by the precision–recall curve and F IRST F OLD OF THE VEDAI VALIDATION S ET
coordinate axis of all categories. Hence, the mAP can be
calculated by
1
AP p(r )dr
mAP = = 0 (12)
N N
where p denotes the precision, r denotes the recall, and N is
the number of categories.
The giga floating-point operations per second (GFOLPs)
and parameter size are used to measure the model complexity
and computation cost. In addition, PSNR and SSIM are used
for image quality evaluation of the SR branch. Generally,
higher PSNR values and SSIM values represent the better
four YOLOv5 network frameworks: YOLOv5s, YOLOV5m,
quality of the generated image.
YOLOv5l, and YOLOv5x. Note that the results here are
collected after the concatenation pixel-level fusion of RGB and
D. Ablation Study IR modalities. As listed in Table II, after removing the Focus
First, we verify the effectiveness of our proposed method by module, we observe a noticeable improvement in the detec-
designing a series of ablation experiments that are conducted tion performance of YOLOv5s (62.2%→69.5% in mAP50 ),
on the first fold of the validation set. YOLOv5m (64.5%→72.2%), YOLOV5l (63.7%→72.5%),
1) Validation of the Baseline Framework: In Table I, the and YOLOv5x (64.0%→69.2%). This is because by removing
model size and inference ability of different base frameworks the Focus module, not only can the resolution degradation be
are evaluated in terms of the number of layers, parameter size, avoided, but also the spatial interval information be retained
and GFLOPs. The detection performances of those models for small objects in RSI, thereby reducing the missing errors of
are measured by mAP50 , i.e., the detection metric of mAP at object detection. Generally, removing the Focus module brings
the intersection over union (IOU) = 0.5. Although YOLOv4 more than 5% improvement in the detection performance
achieves the best detection performance, it has 169 more layers (mAP50 ) of the whole frameworks.
than YOLOv5s (393 versus 224), its parameter size (params) is Meanwhile, we notice that the above removal
7.4 times larger than that of YOLOv5s (52.5M versus 7.1M), increases the inference computation cost (GFLOPs) in
and its GFLOPs is 7.2 times higher than that of YOLOv5s YOLOv5s (5.3→20.4), YOLOv5m (16.1→63.6), YOLOV5l
(38.2 versus 5.3). With respect to YOLOv5s, although its (36.7→145), and YOLOv5x (69.7→276.6). However, the
mAP is slightly lower than those of YOLOv4 and YOLOv5m, GFLOPs of YOLOv5s-noFocus (20.4) are smaller than
its number of layers, parameter size, and GFLOPs are much those of YOLOv3 (52.8), YOLOv4 (38.2), and YOLOrs
smaller than those of other models. Therefore, it is easier to (46.4), as shown in Table I. The parameters of these models
deploy YOLOv5s on board to achieve real-time performance are slightly reduced after removing the Focus module.
in practical applications. The above fact verifies the rationality In summary, in order to retain the resolution to better detect
of YOLOv5s as the baseline detection framework. smaller objects, priority shall be given to the detection
2) Impact of Removing Focus Module: As presented in accuracy, for which the convolution operation is adopted to
Section IV-A, the Focus module reduces the resolution of replace the Focus module.
input images, which imposes encumbrance on the detection 3) Comparison of Different Fusion Methods: To evaluate
performance of small objects in RSI. To investigate the influ- the influence of the devised fusion methods, we compare
ence of the Focus module, we conduct experiments on the five fusion results on YOLOv5-noFocus, as presented in
ZHANG et al.: SuperYOLO: SR ASSISTED OBJECT DETECTION IN MULTIMODAL RSI 5605415

TABLE IV
I NFLUENCE OF D IFFERENT R ESOLUTIONS FOR I NPUT I MAGE ON
N ETWORK P ERFORMANCE ON THE F IRST F OLD OF THE
VEDAI VALIDATION S ET

4) Impact of High Resolution: We compare different train-


ing and test modes to explore more possibilities in terms of the
Fig. 7. Feature-level fusion of different blocks in the latent layers. Fusion-n input image resolution in Table IV. First, we compare cases
represents the concatenation fusion operation performed in the nth blocks.
(a) and (b) Feature-level fusion. (c) Multistage feature-level fusion. where the image resolutions of the training set and the test
set are the same. By comparing the result of YOLOv5s, the
Section IV-B. As shown in Fig. 7, fusion1, fusion2, fusion3, detection metric mAP50 is improved from 62.2% to 77.7%,
and fusion4 represent the concatenation fusion operation causing a 15.5% increase when the image size is doubled
performed in the first, second, third, and fourth blocks, respec- from 512 to 1024. Similarly, YOLOv5s-noFocus (1024) out-
tively. The IR image is expanded to three bands in feature- performs YOLOv5s-noFocus (512) by 9.8% mAP50 score
level fusion to obtain the features that have equal channels (79.3% versus 69.5%). The mean recall and mean precision
for the two modes. The final result is listed in Table III. increase simultaneously, suggesting that ensuring resolution
The parameter size, GFLOPs, and mAP50 of pixel-level fusion reduces the commission and omission errors in object detec-
with concatenation operation are 7.0705M, 20.37, and 69.5%, tion. Based on the above analysis, we argue that the charac-
and those of the pixel-level fusion with MF module are teristics of HR significantly influence the final performance of
7.0897M, 21.67, and 70.3%, which are the best among all the object detection. However, it is noteworthy that maintaining an
compared methods. There are some reasons why the model HR input image of the network introduces a certain amount
parameters of the feature-level fusions are close to the pixel- of calculation. The GFLOPs with a size of 1024 (HR) is
level fusion. First, the feature-level fusion is completed in higher than that with 512 (low resolution) in both YOLOv5s
the latent layers rather than the whole two separate models. (21.3 versus 5.3) and YOLOv5s-noFocus (81.5 versus 20.4).
Second, the modules before the concatenation fusion are As shown in Table IV, the use of different sizes of the
different, making the different fusion channels cause different image during the training process (train size) and the test
parameters. However, it can be proved that the calculation process (test size) results in the score reduction of mAP50 ,
cost is increased with the layer of fusion becoming deeper. i.e., (10.6% versus 62.2%), (48.2% versus 77.7%), (13.4%
In addition, we compare the multistage feature-level fusion versus 69.5%), and (62.9% versus 79.3%). This may attribute
[as shown in Fig. 7(c)] with the proposed pixel-level fusion. to the inconsistent scale of objects in the test process and in
As shown in Table III, the accuracy of multistage feature- the training process, where the size of the predicted bounding
level fusion is only 59.3% mAP50 lower than that of pixel- box is not suitable for the objects of test images anymore.
level fusion, while its computation cost is 34.56 GFLOPs Finally, mAP50 of YOLOv5s-noFocus + SR is close to
with 7.7545M parameters, which is higher than that of pixel- the YOLOv5-noFocus HR (1024) one (78.0% versus 79.3%),
level fusion. These findings suggest that innovative pixel-level and the GFOLPs is equal to that of YOLOv5-noFocus LR
fusion methods are more effective than multistage shallow (512) one (20.4 versus 20.4). Our proposed network decreased
feature-level fusion because the multiple stages of fusion can the resolution of input images in the test process to reduce
lead to the accumulation of redundant information. The above computation and maintain accuracy by remaining the identical
results suggest that pixel-level fusion can accurately detect resolution of the training and testing data, thereby highlighting
objects while reducing the computation. Our proposed MF the advantage of the proposed SR branch.
fusion can improve detection accuracy with some computation 5) Impact of Super Resolution Branch: Some ablation
costs. Overall, the proposed method only uses pixel-level experiences about the SR branch are completed in Table V.
fusion to contain the lower computation cost. Compared with the upsampling operation, the YOLOv5s
5605415 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 61, 2023

TABLE V
A BLATION E XPERIMENT R ESULTS A BOUT THE I NFLUENCE OF SR B RANCH ON D ETECTION
P ERFORMANCE ON THE F IRST F OLD OF THE VEDAI VALIDATION S ET

TABLE VI does not require a lot of manpower to refine the design of the
E FFECTIVE VALIDATION OF THE SR B RANCH FOR THE D IFFERENT detection network. The SR branch is general and extensible,
BASELINE ON THE F IRST F OLD OF THE VEDAI VALIDATION S ET
and can be utilized in the existing FCN framework.

E. Comparisons With Previous Methods


The visual detection results of the compared YOLO methods
and SuperYOLO are shown in Fig. 8, for a diverse set of
scenes. It can be observed that SuperYOLO can accurately
detect those objects that are not detected, or predicted into a
wrong category or with uncertainty, in YOLOv4, YOLOv5s,
and YOLOv5m. The objects in RSIs are challenging to detect
on small scales. In particular, Pickup and Car or Van and
Boat are easily confused in the detection process due to their
(noFocus) added SR network shows favorable performance similarities. Hence, improving the detection classification is of
and gets mAP50 1.8% better than upsampling operation. The necessity in object detection tasks except for location detec-
SR network is a learnable upsampling method with a more tion, which can be accomplished by the proposed SuperYOLO
vital reconstruction ability that can help the feature extraction with better performance.
in the backbone for detection. We deleted the PANet struc- Table VII summarizes the performance of the
ture and two detectors, which are responsible for enhancing YOLOv3 [47], YOLOv4 [48], YOLOv5s-x [19], YOLOrs
middle- and large-scale target detections because the objects in [27], YOLO-Fine [49], YOLOFusion [50], and our proposed
RSI datasets, such as VEDAI are on the small scale and can be SuperYOLO. Note that the AP scores of multimodal modes
detected with the small-scale detector. When we only use one are significantly higher than those of unimodal (RGB or IR)
detector, the number of parameters (7.0705M versus 4.8259M) modes for most classes. The overall mAP50 of multimodal
and GFLOPs (20.37 versus 16.68) can be decreased, and the (multi) modes outperforms those of RGB or IR modes. These
detection accuracy can be increased (78.0% versus 79.0% ). results confirm that MF is an effective and efficient strategy
When we utilize the EDSR network (rather than three ordinal for object detection based on information complementation
deconvolutions) as a decoder and L1 loss (rather than L2 loss) between multimodal inputs. However, it should be noted
as an SR loss function in the SR branch, which is powerful that the slight increase in parameters and GFLOPs with MF
in the SR task, not only the performance of SR is improved reflects the necessity of choosing pixel-level fusion rather
but also the performance of the detection network enhanced than feature-level fusion.
meantime because the SR branch helps the detection network It is obvious that the SuperYOLO achieves higher mAP50
to extract more effective and superior features in the backbone, than the other frameworks except for YOLOFusion. The
accelerating the convergence of the detection network and, results of YOLOFusion are slightly better than SuperYOLO,
thus, improving the performance of the detection network. The as YOLOFusion uses pretrained weight which is trained on
performance of SR and object detection is complementary and MS COCO [7]. However, its parameter count is approximately
cooperative. three times that of SuperYOLO. The performance of YOLO-
Table VI shows the favorable accuracy–complexity tradeoff Fine is good on a single modality, but it lacks the develop-
of the SR branch. At the different baselines, the influence ment of multimodality fusion techniques. In particular, the
of the SR branch on object detection is positive. Com- SuperYOLO outperforms the YOLOv5x by a 12.44% mAP50
pared with bare baseline, baseline added SR shows favorable score in multimodal mode. Meanwhile, parameter size and
performance: YOLOv3 + SR performs mAP50 9.2% better GFLOPs of SuperYOLO are about 18× and 3.8× less than
than YOLOv3, YOLOv4 + SR is mAP50 3.3% better than YOLOv5x.
YOLOv4, and YOLOv5s + SR performs mAP50 2.2% better In addition, it can be noticed that superior performance
than YOLOv5s. Notably, SR can be removed in the inference is achieved for the classes of Car, Pickup, Tractor, and
stage. Hence, no extra parameters and computation costs are Camping, which have the most training instances. YOLOv5s
introduced, which is impressive considering that the SR branch performs superior on GFLOPs, which depends on the Focus
ZHANG et al.: SuperYOLO: SR ASSISTED OBJECT DETECTION IN MULTIMODAL RSI 5605415

Fig. 8. Visual results of object detection using different methods involving YOLOv4, YOLOv5s, YOLOv5m, and the proposed SuperYOLO. The red cycles
represent the false alarms, the yellow ones denote the FP detection results, and the blue ones are FN detection results. (a)–(e) Different images in the VEDAI
dataset.

module to slim the input image, but results in lousy detection including a large-scale Dataset for Object Detection in Aerial
performance, especially for small objects. The SuperYOLO images (DOTA), object DetectIon in Optical Remote sensing
performs 18.30% mAP50 better than YOLOv5s. Our proposed images (DIOR), and Northwestern Polytechnical University
SuperYOLO shows a favorable speed–accuracy tradeoff com- Very-High-Resolution 10-class (NWPU VHR-10) datasets.
pared to the state-of-the-art models. 1) DOTA: The DOTA dataset was proposed in 2018 for
object detection of remote sensing. It contains 2806 large
F. Generalization to Single Modal Remote Sensing Images images and 188 282 instances, which are divided into 15 cat-
At present, although there are massive multimodal images egories. The size of each original image is 4000 × 4000,
in remote sensing, the labeled dataset in object detection and the images are cropped into 1024 × 1024 pixels with an
tasks is lacking due to the expensive cost of manually overlap of 200 pixels in the experiment. We select half of the
annotating. To validate the generalization of our proposed original images as the training set, 1/6 as the validation set,
network, we compare the SuperYOLO with different one- or and 1/3 as the testing set. The size of the image is fixed to
two-stage methods using data from the single modality, 512 × 512.
5605415 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 61, 2023

TABLE VII
C LASSWISE AVERAGE P RECISION AP, M EAN AVERAGE P RECISION mAP 50 , PARAMETERS , AND GFLP S FOR THE P ROPOSED S UPERYOLO, YOLOV 3,
YOLOV 4, YOLOV 5 S - X , YOLO RS , YOLO-F INE , AND YOLOF USION , I NCLUDING U NIMODAL AND M ULTIMODAL C ONFIGURATIONS
ON VEDAI DATASET. * R EPRESENTS U SING P RETRAINED W EIGHT

2) NWPU VHR-10: The dataset of NWPU VHR-10 was RetainNet [51], and GFL [52]); two-stage method (Faster
proposed in 2016. It contains 800 images, of which 650 pic- R-CNN [5]); lightweight models (MobileNetV2 [55] and
tures contain objects, so we use 520 images as the training ShuffleNet [56]); distillation-based methods (ARSD [59]);
set and 130 images as the testing set. The dataset contains ten and remote sensing designed approaches (FMSSD [58] and
categories, and the size of the image is fixed to 512 × 512. O2DNet [57]).
3) DIOR: The DIOR dataset was proposed in 2020 for As presented in Table VIII, our SuperYOLO achieves
the task of object detection, which involves 23 463 images the optimal detection result (69.99%, 93.30%, and 71.82%
and 192 472 instances. The size of each image is 800 × 800. mAP50 ), and the model parameters (7.70M, 7.68M, and
We choose 11 725 images as the training set and 11 738 images 7.70M) and GFLOPs (20.89, 20.86, and 20.93) are much
as the testing set. The size of the image is fixed to 512 × 512. smaller than other SOTA detectors regardless of the two-
The training strategy is modified to accommodate the new stage, one-stage, lightweight, or distillation-based method.
dataset. The entire training process involves 150 epochs for The PANet structure and three detectors are responsible for
NWPU and DIOR datasets, and 100 epochs for DOTA. The enhancing small-, middle-, and large-scale target detections in
batch size of the DOTA and DIOR is 16 and of NWPU is 8. consideration of the big objects, such as playgrounds in these
To verify the superiority of the SuperYOLO proposed in this three datasets. Hence, the model parameters of SuperYOLO
article, we selected 11 generic methods for comparison: one- are more than those in Table VII. We also compare two detec-
stage algorithms (YOLOv3 [47], FCOS [53], ATSS [54], tors designed for RSI, such as FMSSD [58] and O2DNet [57].
ZHANG et al.: SuperYOLO: SR ASSISTED OBJECT DETECTION IN MULTIMODAL RSI 5605415

TABLE VIII
P ERFORMANCE OF D IFFERENT A LGORITHMS ON DOTA, NWPU, AND DOTA T ESTING S ETS

Although these models have a close performance with our [2] R. Girshick, “Fast R-CNN,” in Proc. IEEE Int. Conf. Comput. Vis.
lightweight model, the huger parameters and GFLOPs seem to (ICCV), Dec. 2015, pp. 1440–1448.
[3] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look
be a massive cost in computation resources. Hence, our model once: Unified, real-time object detection,” in Proc. IEEE Conf. Comput.
has a better balance in consideration of detection efficiency Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 779–788.
and efficacy. [4] P. Tang, X. Wang, X. Bai, and W. Liu, “Multiple instance detection
network with online instance classifier refinement,” in Proc. IEEE Conf.
VI. C ONCLUSION AND F UTURE W ORK Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 3059–3067.
[5] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards
In this article, we have presented SuperYOLO, a real- real-time object detection with region proposal networks,” IEEE Trans.
time lightweight network that is built on top of the widely Pattern Anal. Mach. Intell., vol. 39, no. 6, pp. 1137–1149, Jun. 2017.
used YOLOv5s to improve the detection performance of [6] D. Jia, D. Wei, R. Socher, J. Lili, K. Li, and F. Li, “ImageNet: A large-
scale hierarchical image database,” in Proc. IEEE Conf. Comput. Vis.
small objects in RSI. First, we have modified the baseline Pattern Recognit. (CVPR), Jun. 2009, pp. 248–255.
network by removing the Focus module to avoid resolu- [7] T.-Y. Lin et al., “Microsoft COCO: Common objects in context,” in
tion degradation, through which the baseline is significantly Proc. Eur. Conf. Comput. Vis., Sep. 2014, pp. 740–755.
[8] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and
improved and overcomes the missing error of small objects. A. Zisserman, “The PASCAL visual object classes (VOC) challenge,”
Second, we have conducted research fusion of multimodal- Int. J. Comput. Vis., vol. 88, no. 2, pp. 303–338, Jun. 2010.
ity to improve the detection performance based on mutual [9] Z. Zheng, Y. Zhong, J. Wang, and A. Ma, “Foreground-aware relation
information. Lastly and most importantly, we have introduced network for geospatial object segmentation in high spatial resolution
remote sensing imagery,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
a simple and flexible SR branch facilitating the backbone Recognit. (CVPR), Jun. 2020, pp. 4096–4105.
to construct an HR representation feature, by which small [10] J. Pang, C. Li, J. Shi, Z. Xu, and H. Feng, “R 2 -CNN: Fast tiny object
objects can be easily recognized from vast backgrounds with detection in large-scale remote sensing images,” IEEE Trans. Geosci.
Remote Sens., vol. 57, no. 8, pp. 5512–5524, Aug. 2019.
merely LR input required. We remove the SR branch in the
[11] Z. Deng, H. Sun, S. Zhou, J. Zhao, L. Lei, and H. Zou, “Multi-scale
inference stage, accomplishing the detection without changing object detection in remote sensing imagery with convolutional neural
the original structure of the network to achieve the same networks,” ISPRS J. Photogramm. Remote Sens., vol. 145, pp. 3–22,
GFOLPs. With joint contributions of these ideas, the proposed Apr. 2018.
[12] J. Ding, N. Xue, Y. Long, G.-S. Xia, and Q. Lu, “Learning RoI
SuperYOLO achieves 75.09% mAP50 with lower computation transformer for oriented object detection in aerial images,” in Proc.
cost on the VEDAI dataset, which is 18.30% higher than IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019,
that of YOLOv5s, and more than 12.44% higher than that pp. 2844–2853.
[13] Z. Liu, H. Wang, H. Weng, and L. Yang, “Ship rotated bounding box
of YOLOv5x. space for ship extraction from high-resolution optical satellite images
The performance and inference ability of our proposal with complex backgrounds,” IEEE Geosci. Remote Sens. Lett., vol. 13,
highlight the value of SR in remote sensing tasks, paving way no. 8, pp. 1074–1078, Aug. 2016.
for the future study of multimodal object detection. Our future [14] D. Hong et al., “More diverse means better: Multimodal deep learn-
ing meets remote-sensing imagery classification,” IEEE Trans. Geosci.
interests will be focusing on the design of a low-parameter Remote Sens., vol. 59, no. 5, pp. 4340–4354, May 2021.
mode to extract HR features, thereby further satisfying real- [15] Z. Wang, K. Jiang, P. Yi, Z. Han, and Z. He, “Ultra-dense GAN for satel-
time and high-accuracy motivations. lite imagery super-resolution,” Neurocomputing, vol. 398, pp. 328–337,
Jul. 2020.
R EFERENCES [16] M. T. Razzak, G. Mateo-García, G. Lecuyer, L. Gómez-Chova,
Y. Gal, and F. Kalaitzis, “Multi-spectral multi-image super-resolution of
[1] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierar- Sentinel-2 with radiometric consistency losses and its effect on building
chies for accurate object detection and semantic segmentation,” in Proc. delineation,” ISPRS J. Photogramm. Remote Sens., vol. 195, pp. 1–13,
IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2014, pp. 580–587. Jan. 2023.
5605415 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 61, 2023

[17] K. Jiang, Z. Wang, P. Yi, G. Wang, T. Lu, and J. Jiang, “Edge-enhanced [40] S. Elfwing, E. Uchibe, and K. Doya, “Sigmoid-weighted linear units
GAN for remote sensing image superresolution,” IEEE Trans. Geosci. for neural network function approximation in reinforcement learning,”
Remote Sens., vol. 57, no. 8, pp. 5799–5812, Jun. 2019. Neural Netw., vol. 107, pp. 3–11, Nov. 2018.
[18] Y. Xiao, X. Su, Q. Yuan, D. Liu, H. Shen, and L. Zhang, “Satellite [41] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in
video super-resolution via multiscale deformable convolution alignment deep convolutional networks for visual recognition,” IEEE Trans. Pattern
and temporal grouping projection,” IEEE Trans. Geosci. Remote Sens., Anal. Mach. Intell., vol. 37, no. 9, pp. 1904–1916, Sep. 2014.
vol. 60, 2022, Art. no. 5610819. [42] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in
[19] (2021). Ultralytics/Yolov5:v5.0. [Online]. Available: https://github. Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018,
com/ultralytics/yolov5 pp. 7132–7141.
[20] S. Zhang, M. Chen, J. Chen, F. Zou, Y.-F. Li, and P. Lu, “Multimodal [43] B. Lim, S. Son, H. Kim, S. Nah, and K. M. Lee, “Enhanced deep
feature-wise co-attention method for visual question answering,” Inf. residual networks for single image super-resolution,” in Proc. IEEE
Fusion, vol. 73, pp. 1–10, Sep. 2021. Conf. Comput. Vis. Pattern Recognit. Workshops (CVPRW), Jul. 2017,
[21] Y.-T. Chen, J. Shi, Z. Ye, C. Mertz, D. Ramanan, and S. Kong, pp. 136–144.
“Multimodal object detection via probabilistic ensembling,” 2021, [44] H. Zhao, O. Gallo, I. Frosio, and J. Kautz, “Loss functions for image
arXiv:2104.02904. restoration with neural networks,” IEEE Trans. Comput. Imag., vol. 3,
[22] Q. Chen et al., “EF-Net: A novel enhancement and fusion network no. 1, pp. 47–57, Mar. 2017.
for RGB-D saliency detection,” Pattern Recognit., vol. 112, Apr. 2021, [45] S. Razakarivony and F. Jurie, “Vehicle detection in aerial imagery:
Art. no. 107740. A small target detection benchmark,” J. Vis. Commun. Image Represent.,
[23] H. Zhu, M. Ma, W. Ma, L. Jiao, and B. Hou, “A spatial-channel vol. 34, pp. 187–203, Jan. 2016.
progressive fusion ResNet for remote sensing classification,” Inf. Fusion, [46] L. Bottou, “Large-scale machine learning with stochastic gradi-
vol. 70, no. 1, pp. 72–87, 2020. ent descent,” in Proc. 19th Int. Conf. Comput. Statist., vol. 2010,
[24] Y. Sun, Z. Fu, C. Sun, Y. Hu, and S. Zhang, “Deep multimodal pp. 177–186.
fusion network for semantic segmentation using remote sensing image [47] J. Redmon and A. Farhadi, “YOLOv3: An incremental improvement,”
and LiDAR data,” IEEE Trans. Geosci. Remote Sens., vol. 60, 2022, 2018, arXiv:1804.02767.
Art. no. 5404418. [48] A. Bochkovskiy, C.-Y. Wang, and H.-Y. M. Liao, “YOLOv4: Optimal
[25] W. Li, Y. Gao, M. Zhang, R. Tao, and Q. Du, “Asymmetric feature speed and accuracy of object detection,” 2020, arXiv:2004.10934.
fusion network for hyperspectral and SAR image classification,” IEEE [49] M.-T. Pham, L. Courtrai, C. Friguet, S. Lefèvre, and A. Baussard,
Trans. Neural Netw. Learn. Syst., early access, Feb. 18, 2022, doi: “YOLO-fine: One-stage detector of small objects under various back-
10.1109/TNNLS.2022.3149394. grounds in remote sensing images,” Remote Sens., vol. 12, no. 15,
[26] Y. Gao et al., “Hyperspectral and multispectral classification for coastal p. 2501, Aug. 2020.
wetland using depthwise feature interaction network,” IEEE Trans. [50] F. Qingyun and W. Zhaokui, “Cross-modality attentive feature fusion
Geosci. Remote Sens., vol. 60, 2022, Art. no. 5512615. for object detection in multispectral remote sensing imagery,” Pattern
[27] M. Sharma et al., “YOLOrs: Object detection in multimodal remote Recognit., vol. 130, Oct. 2022, Art. no. 108786.
sensing imagery,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., [51] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollar, “Focal loss for
vol. 14, pp. 1497–1508, 2021. dense object detection,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV),
[28] L. Gómez-Chova, D. Tuia, G. Moser, and G. Camps-Valls, “Multimodal Oct. 2017, pp. 2980–2988.
classification of remote sensing images: A review and future directions,” [52] X. Li et al., “Generalized focal loss: Learning qualified and distributed
Proc. IEEE, vol. 103, no. 9, pp. 1560–1584, Sep. 2015. bounding boxes for dense object detection,” in Proc. Adv. Neural Inf.
[29] T.-Y. Lin, P. Dollar, R. Girshick, K. He, B. Hariharan, and S. Belongie, Process. Syst., vol. 33, 2020, pp. 21002–21012.
“Feature pyramid networks for object detection,” in Proc. IEEE Conf. [53] Z. Tian, C. Shen, H. Chen, and T. He, “FCOS: Fully convolutional
Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 2117–2125. one-stage object detection,” in Proc. IEEE/CVF Int. Conf. Comput. Vis.
[30] C. Li, T. Yang, S. Zhu, C. Chen, and S. Guan, “Density map (ICCV), Oct. 2019, pp. 9627–9636.
guided object detection in aerial images,” in Proc. IEEE/CVF Conf. [54] S. Zhang, C. Chi, Y. Yao, Z. Lei, and S. Z. Li, “Bridging the gap between
Comput. Vis. Pattern Recognit. Workshops (CVPRW), Jun. 2020, anchor-based and anchor-free detection via adaptive training sample
pp. 190–191. selection,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.
[31] C. Chen, M.-Y. Liu, O. Tuzel, and J. Xiao, “R-CNN for small object (CVPR), Jun. 2020, pp. 9759–9768.
detection,” in Proc. Asian Conf. Comput. Vis. Cham, Switzerland: [55] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen,
Springer, 2017, pp. 214–230. “MobileNetV2: Inverted residuals and linear bottlenecks,” in Proc.
[32] J. Noh, W. Bae, W. Lee, J. Seo, and G. Kim, “Better to follow, follow IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018,
to be better: Towards precise supervision of feature super-resolution pp. 4510–4520.
for small object detection,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. [56] X. Zhang, X. Zhou, M. Lin, and J. Sun, “ShuffleNet: An
(ICCV), Oct. 2019, pp. 9725–9734. extremely efficient convolutional neural network for mobile devices,”
[33] M. Haris, G. Shakhnarovich, and N. Ukita, “Task-driven super in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018,
resolution: Object detection in low-resolution images,” 2018, pp. 6848–6856.
arXiv:1803.11316. [57] H. Wei, Y. Zhang, Z. Chang, H. Li, H. Wang, and X. Sun, “Oriented
[34] J. Shermeyer and A. Van Etten, “The effects of super-resolution on
objects as pairs of middle lines,” ISPRS J. Photogramm. Remote Sens.,
object detection performance in satellite imagery,” in Proc. IEEE/CVF
vol. 169, pp. 268–279, Nov. 2020.
Conf. Comput. Vis. Pattern Recognit. Workshops (CVPRW), Jun. 2019, [58] P. Wang, X. Sun, W. Diao, and K. Fu, “FMSSD: Feature-merged
pp. 1432–1441. single-shot detection for multiscale objects in large-scale remote sens-
[35] L. Courtrai, M.-T. Pham, and S. Lefèvre, “Small object detection
ing imagery,” IEEE Trans. Geosci. Remote Sens., vol. 58, no. 5,
in remote sensing images based on super-resolution with auxiliary
pp. 3377–3390, Dec. 2020.
generative adversarial networks,” Remote Sens., vol. 12, no. 19, p. 3152, [59] Y. Yang et al., “Adaptive knowledge distillation for lightweight remote
Sep. 2020. sensing object detectors optimizing,” IEEE Trans. Geosci. Remote Sens.,
[36] J. Rabbi, N. Ray, M. Schubert, S. Chowdhury, and D. Chao, “Small-
vol. 60, 2022, Art. no. 5623715.
object detection in remote sensing images with end-to-end edge-
enhanced GAN and object detector network,” Remote Sens., vol. 12,
no. 9, p. 1432, May 2020.
[37] H. Ji, Z. Gao, T. Mei, and B. Ramesh, “Vehicle detection in
remote sensing images leveraging on simultaneous super-resolution,” Jiaqing Zhang received the B.E. degree in telecom-
IEEE Geosci. Remote Sens. Lett., vol. 17, no. 4, pp. 676–680, munications engineering from Ningbo University,
Apr. 2020. Ningbo, Zhejiang, China, in 2019. She is currently
[38] L. Wang, D. Li, Y. Zhu, L. Tian, and Y. Shan, “Dual super-resolution pursuing the Ph.D. degree with the Image Coding
learning for semantic segmentation,” in Proc. IEEE/CVF Conf. Comput. and Processing Center, State Key Laboratory of Inte-
Vis. Pattern Recognit. (CVPR), Jun. 2020, pp. 3773–3782. grated Services Networks, Xidian University, Xi’an,
[39] C.-Y. Wang, H.-Y. Mark Liao, Y.-H. Wu, P.-Y. Chen, J.-W. Hsieh, China.
and I.-H. Yeh, “CSPNet: A new backbone that can enhance learning Her research interests include multimodal image
capability of CNN,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern processing, remote sensing object detection, and
Recognit. Workshops (CVPRW), Jun. 2020, pp. 1571–1580. network compression.
ZHANG et al.: SuperYOLO: SR ASSISTED OBJECT DETECTION IN MULTIMODAL RSI 5605415

Jie Lei (Member, IEEE) received the M.S. degree in Yunsong Li (Member, IEEE) received the M.S.
telecommunication and information systems and the degree in telecommunication and information sys-
Ph.D. degree in signal and information processing tems and the Ph.D. degree in signal and information
from Xidian University, Xi’an, China, in 2006 and processing from Xidian University, Xi’an, China, in
2010, respectively. 1999 and 2002, respectively.
He was a Visiting Scholar with the Depart- He joined the School of Telecommunications Engi-
ment of Computer Science, University of Cali- neering, Xidian University, in 1999, where he is
fornia at Los Angeles, Los Angeles, CA, USA, currently a Professor. He is currently the Director of
from 2014 to 2015. He is currently a Professor the Image Coding and Processing Center, State Key
with the School of Telecommunications Engineer- Laboratory of Integrated Services Networks, Xidian
ing, Xidian University, where he is also a member of University. His research interests focus on image and
the Image Coding and Processing Center, State Key Laboratory of Integrated video processing and high-performance computing.
Services Networks. He is also with the Science and Technology on Electro-
Optic Control Laboratory, Luoyang, China. His research interests focus on
image and video processing, computer vision, and customized computing for
big-data applications.

Weiying Xie (Member, IEEE) received the B.S.


degree in electronic information science and tech-
nology from the University of Jinan, Jinan, China,
in 2011, the M.S. degree in communication and
information systems from Lanzhou University, Qian Du (Fellow, IEEE) received the Ph.D. degree
Lanzhou, China, in 2014, and the Ph.D. degree in electrical engineering from the University of
in communication and information systems from Maryland, Baltimore, MD, USA, in 2000.
Xidian University, Xi’an, China, in 2017. She is currently the Bobby Shackouls Professor
She is currently an Associate Professor with the with the Department of Electrical and Computer
State Key Laboratory of Integrated Services Net- Engineering, Mississippi State University, Starkville,
works, Xidian University. Her research interests MS, USA. She is also an Adjunct Professor with the
include neural networks, machine learning, hyperspectral image processing, College of Surveying and Geo-Informatics, Tongji
and high-performance computing. University, Shanghai, China. Her research interests
include hyperspectral remote sensing image anal-
Zhenman Fang (Member, IEEE) received the Ph.D. ysis and applications, pattern classification, data
degree in computer science from Fudan University, compression, and neural networks.
Shanghai, China, in 2014. Dr. Du is a fellow of the SPIE-International Society for Optics and
He did his post-doctoral research at the University Photonics. She received the 2010 Best Reviewer Award from the IEEE
of California at Los Angeles (UCLA), Los Angeles, Geoscience and Remote Sensing Society. She was the Co-Chair of the Data
CA, USA, from 2014 to 2017, and worked as a Staff Fusion Technical Committee of the IEEE Geoscience and Remote Sensing
Software Engineer at Xilinx, San Jose, CA, USA, Society from 2009 to 2013 and the Chair of the Remote Sensing and
from 2017 to 2019. He is currently an Assistant Mapping Technical Committee of the International Association for Pattern
Professor with the School of Engineering Science, Recognition from 2010 to 2014. She was the General Chair of the fourth IEEE
Simon Fraser University, Burnaby, BC, Canada. His GRSS Workshop on Hyperspectral Image and Signal Processing: Evolution
recent research focuses on customizable computing in Remote Sensing held in Shanghai, China, in 2012. She has served as an
with specialized hardware acceleration, including emerging application char- Associate Editor for the IEEE J OURNAL OF S ELECTED T OPICS IN A PPLIED
acterization and acceleration, novel accelerator-rich and near-data computing E ARTH O BSERVATIONS AND R EMOTE S ENSING, the Journal of Applied
architecture designs, and corresponding programming, runtime, and tool Remote Sensing, and the IEEE S IGNAL P ROCESSING L ETTERS. Since 2016,
support. she has been the Editor-in-Chief of the IEEE J OURNAL OF S ELECTED T OPICS
Dr. Fang is a member of the Association for Computing Machinery (ACM). IN A PPLIED E ARTH O BSERVATIONS AND R EMOTE S ENSING .

You might also like