0% found this document useful (0 votes)

42 views15 pages

FFCA-YOLO For Small Object Detection in Remote Sensing Images

The document presents FFCA-YOLO, an efficient detector designed for small object detection in remote sensing images, addressing challenges like insufficient feature representation and background confusion. It introduces three innovative modules—feature enhancement, fusion, and context aware—to improve detection capabilities while maintaining low computational complexity. The performance of FFCA-YOLO and its lite version, L-FFCA-YOLO, is validated on multiple datasets, achieving superior accuracy compared to existing models and demonstrating potential for real-time applications.

Uploaded by

darkknight.hxk

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

42 views15 pages

FFCA-YOLO For Small Object Detection in Remote Sensing Images

Uploaded by

darkknight.hxk

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL.

62, 2024 5611215

FFCA-YOLO for Small Object Detection

in Remote Sensing Images
Yin Zhang , Mu Ye , Guiyi Zhu , Yong Liu , Pengyu Guo , and Junhua Yan

Abstract— Issues, such as insufficient feature representation rescue, security, military, and so on. Remote sensing images
and background confusion, make detection tasks for small object generally have large fields of view, which is quite suitable
in remote sensing arduous. Particularly, when the algorithm will for wide area monitoring. However, because of their relatively
be deployed on board for real-time processing, which requires
extensive optimization of accuracy and speed under limited com- low resolution and poor quality, interested objects are usu-
puting resources. To tackle these problems, an efficient detector ally characterized by small sizes (less than 32 × 32 pixels
called feature enhancement, fusion and context aware YOLO [7], [57]), dim features, low contrast, and insufficient informa-
(FFCA-YOLO) is proposed in this article. FFCA-YOLO includes tion, causing extra difficulties in detection [8], [9]. At the same
three innovative lightweight and plug-and-play modules: feature time, remote sensing systems face less controllable observing
enhancement module (FEM), feature fusion module (FFM), and
spatial context aware module (SCAM). These three modules conditions and numerous interferences in imaging chain, such
improve the network capabilities of local area awareness, mul- as platform motion, atmosphere, and various complex imaging
tiscale feature fusion, and global association cross channels and scenes. All these factors lead to the aliasing of objects and
space, respectively, while trying to avoid increasing complexity as backgrounds, which makes small objects indistinguishable.
possible. Thus, the weak feature representations of small objects On the other hand, with the continuous increase of camera
are enhanced and the confusable backgrounds are suppressed.
Two public remote sensing datasets (VEDAI and AI-TOD) for bands and resolution, massive data are generated during on-
small object detection and one self-built dataset (USOD) are board imaging [10]. For example, WorldView-4 collect data
used to validate the effectiveness of FFCA-YOLO. The accuracy covering 680 000 km2 per day [11], which brings a huge
of FFCA-YOLO reaches 0.748, 0.617, and 0.909 (in terms of amount of downstream data. Traditional ground processing
mAP50) that exceeds several benchmark models and the state- mode after data downlink is facing severe challenges, which is
of-the-art methods. Meanwhile, the robustness of FFCA-YOLO is
also validated under different simulated degradation conditions. hard to meet the requirements of high timeliness applications,
Moreover, to further reduce computational resource consump- such as military reconnaissance and emergency rescue. Real-
tion while ensuring efficiency, a lite version of FFCA-YOLO time processing on board can significantly relieve transmission
(L-FFCA-YOLO) is optimized by reconstructing the backbone pressure of imaging data and shorten the delay from infor-
and neck of FFCA-YOLO based on partial convolution (PConv). mation acquisition to strategic decision, which becomes one
L-FFCA-YOLO has faster speed, smaller parameter scale, and
lower computing power requirement but little accuracy loss of the potential ways to solve this problem. Authoritative
compared with FFCA-YOLO. The source code will be available institutions, such as European Space Agency (ESA), have
at https://github.com/yemu1138178251/FFCA-YOLO. already treated on-board processing technology as one of
Index Terms— Context information, feature fusion, lightweight the key research directions prospectively [12]. Unfortunately,
network, remote sensing image, small object detection. the strict constraints on on-board resources, such as power,
weight, and volume, put forward higher requirements for the
I. I NTRODUCTION performance of processing algorithms in terms of reliability,
speed, and scale.
I N RECENT years, the research on small object detection
has achieved significant growth due to the rapid develop-
ment of optical remote sensing technology [1], [2], [3], [4],
In general, the main challenges of small object detection
in remote sensing applications can be summarized into three
[5], [6] for applications, such as traffic supervision, search and points: insufficient feature representation, background confu-
sion, and the optimization of speed and accuracy under limited
Manuscript received 23 September 2023; revised 16 December 2023; hardware conditions.
accepted 31 January 2024. Date of publication 6 February 2024; date of
current version 28 February 2024. This work was supported in part by the In this study, our motivation is to design a small object
Strengthening Project of National Defense Science and Technology under detector with high accuracy that has the potential to be applied
Grant 2021-JCJQ-JJ-0834, in part by the Nation Natural Science Foundation to real-time processing on board in the future. The key to
of China under Grant 61705104, in part by the Fundamental Research Funds
for the Central Universities of China under Grant NJ2022025. (Corresponding alleviate the problems of insufficient feature representation
authors: Yin Zhang; Pengyu Guo.) and background confusion lies in feature enhancement and
Yin Zhang, Mu Ye, Guiyi Zhu, and Junhua Yan are with the Col- fusion. In terms of feature enhancement, fully utilizing local
lege of Astronautics, Nanjing University of Aeronautics and Astronautics
(NUAA), Nanjing 211106, China (e-mail: [email protected]; and global contextual information [13], [14], [15] can effec-
[email protected]; [email protected]; [email protected]). tively enhance the perception of network for small objects.
Yong Liu and Pengyu Guo are with the National Innovation Institute of Feature enhancement module (FEM) and spatial context aware
Defense Technology, Academy of Military Sciences, Beijing 100071, China
(e-mail: [email protected]; [email protected]). module (SCAM) are proposed to enrich the local and global
Digital Object Identifier 10.1109/TGRS.2024.3363057 contextual feature, respectively. FEM expands the receptive
1558-0644 © 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: Imperial College London. Downloaded on February 18,2025 at 08:47:36 UTC from IEEE Xplore. Restrictions apply.
5611215 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 62, 2024

field of the backbone by multibranch atrous convolution. A. Applications of YOLO in Remote Sensing
SCAM considers the association between small objects and The development of deep learning enables object detectors
global regions by constructing global context relationships. to adaptively extract image features and locate objects through
In terms of feature fusion, feature fusion module (FFM) end-to-end learning framework. At present, the detection meth-
is proposed to improve feature fusion strategy, which can ods can be classified into two categories: two-stage [16], [17]
reweight different feature maps by channel information with- and one-stage detectors [18], [19], [20], [21]. Compared with
out increasing computational complexity. These three modules two-stage detectors, one-stage detectors have faster computa-
are added to YOLO to obtain a new model: feature enhance- tion speed and low accuracy loss, which makes them have
ment, fusion, and context aware YOLO (FFCA-YOLO). better potential for on-board applications. YOLO series of
Finally, in order to further reduce computational resource algorithm [18], [19], [20], as typical one-stage object detection
consumption while ensuring efficiency, a lite version of algorithms, has advantages to achieve desired performance for
FFCA-YOLO (L-FFCA-YOLO) is optimized by reconstruct- small objects. At present, some improved YOLO algorithms
ing the backbone and neck of FFCA-YOLO based on partial for object detection in remote sensing have emerged, such as
convolution (PConv). TPH-YOLO [22], FE-YOLO [23], and CA-YOLO [24].
The main contributions of this article are listed as follows. TPH-YOLO [22] integrates transformer encoder blocks
1) An efficient detector (FFCA-YOLO) of small objects into backbone to obtain rich global context information
and its lite version L-FFCA-YOLO are designed for and improves the quality of object feature representation.
remote sensing applications. FFCA-YOLO has advanced FE-YOLO [23] uses deformable convolution for feature fusion
performance in small object detection tasks compared of high and low feature maps in the neck of YOLO, which
with several benchmark models and the state-of-the- aims to eliminate the impact of semantic gaps caused by
art (SOTA) methods, and has the potential for future top–down connections on objects. These two methods have
real-time application on board. good results but with a sharp increase in parameter count.
2) Three innovative and lightweight plug-and-play modules CA-YOLO [24] embeds coordinate attention module into
are proposed: FEM, FFM, and SCAM. These three shallow feature network extraction, which suppresses redun-
modules improve the network capabilities of local area dant backgrounds and enhances the feature representation
awareness, multiscale feature fusion, and global associa- of objects by establishing long-range dependencies between
tion cross channels and space, respectively. They can be pixels. In summary, YOLO has the superiority of scalability
used as common modules inserting into any detection and efficiency, which is suitable for applying in remote sensing
networks to enhance the weak feature representations of tasks.
small objects and suppress the confusable backgrounds. Therefore, we choose YOLO as the basic framework and
3) A new small object dataset USOD is constructed based add specifically designed modules for small object feature
on aerial remote sensing images, which has the propor- representation and background suppression.
tion of small objects (less than 32 × 32 pixels) more
than 99.9% with many instances under low illumination
and shadow occlusion conditions. In addition, USOD has B. Feature Enhancement and Fusion Methods of Small
multiple test sets under different simulated degradation Object Detection
conditions, such as image blurring, Gaussian noise, Object detection methods based on deep learning rely on
stripe noise, and fog, which can serve as a benchmark the backbone to obtain high-dimensional features. However,
dataset for small object detection in remote sensing. in remote sensing images, the extracted features of small
The remainder of this article is organized as follows: after objects may only occupy one pixel on output feature maps.
introducing the related works of small object detection in Multiscale features need to be used to represent the features
Section II, the proposed FFCA-YOLO and L-FFCA-YOLO more effectively. Inspired by the pyramid structure derived
architecture are elaborated in Section III. In Section IV, the from hand-engineered features, Lin et al. [25] propose the
experimental details are briefly introduced. The performance feature pyramid network (FPN), which yields the capacity
of the proposed method and several benchmark models as to aggregate low-level features that have high resolution
well as SOTA methods are particularly compared. The robust- with high-level features that have low resolution. Since then,
ness and lightweight performance of FFCA-YOLO are also PANet [26], NAS-FPN [27], ASFF [28], and BiFPN [29] are
validated in this section. In Section V, the entire article is proposed and achieve good results in object detection tasks.
summarized and the future directions of small object detection Guo et al. [30] introduce AugFPN to address the inconsis-
in remote sensing are pointed out. tency between detailed and semantic information in feature
maps. The information gap is narrowed by using a one-time
supervision method in feature fusion stage. Liu et al. [31]
II. R ELATED W ORKS present a high-resolution object detection network (HRDNet)
This section briefly reviews the literatures relevant to our to detect small vehicle objects, which uses a multidepth image
work, including the applications of YOLO in remote sensing pyramid combined with a multiscale FPN to deepen features.
detection, feature extraction methods of small object, global These methods demonstrate that strengthening the quality of
context feature representation, and lightweight frameworks of multiscale feature fusion can effectively improve the detection
network. performance of small objects to a certain extent. In addition,

Authorized licensed use limited to: Imperial College London. Downloaded on February 18,2025 at 08:47:36 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: FFCA-YOLO FOR SMALL OBJECT DETECTION 5611215

feature enhancement before fusion can further improve features. DWConv can effectively reduce parameter count and
the semantic representation of network. Cheng et al. [32] FLOPs. Several network structures [48], [49], [50] for object
use dual attention mechanism to enhance features before detection in remote sensing implement lightweight design
fusion, which makes the network focus on the distinct fea- based on the above methods. Chen et al. [51] prove that the low
tures of objects. The feature enhance module proposed by FLOPs of DWConv are mainly due to frequent memory access
Zhang and Shen [33] is similar to Cheng’s, which also uses by operators. Therefore, the PConv is proposed to extract the
the attention mechanism of spatial and channel dimensions to spatial features more effectively by reducing redundant calcu-
enhance features. Besides attention mechanism, expanding the lations and memory access. Based on the idea of PConv, a lite
receptive field by multibranch convolution [8] and transformer version of FFCA-YOLO named L-FFCA-YOLO is presented
encoder [34], [35] are also two commonly used ways for by reconstructing the network in Section IV-E, which is faster
feature enhancement. and slightly lower in accuracy.
In order to obtain a larger receptive field, a new lightweight
FEM is designed for obtaining richer local contextual III. P ROPOSED M ETHOD
information in this article, which includes a multibranch struc-
ture containing standard convolution and atrous convolution. A. Overview
In addition, a new FFM is proposed by improving the multi- YOLOv5 is selected as our benchmark framework since it
scale fusion strategy with almost no additional parameters. has fewer parameters compared with the latest YOLOv8 and
can maintain a certain degree of accuracy in the tasks of small
C. Global Context Feature Representation object detection. The overall architecture of FFCA-YOLO is
After FEM and FFM, the feature representation of small shown in Fig. 1. First, FFCA-YOLO only uses four con-
objects has been enhanced to some extent. Modeling the global volution subsampling operations as the backbone of feature
relationship between small objects and backgrounds at this extraction, which is different from the original YOLOv5.
stage is more effective than in backbone. Second, three specially designed modules are added into the
According to the research results of [36], [37], and [38], neck of YOLOv5: a lightweight FEM is proposed to improve
obtaining the global receptive field and context information is the local area awareness of the network; FFM is proposed to
very important for small object localization. Nonlocal neural improve the capability of multiscale feature fusion; SCAM is
network (NLNet) [13] aggregates the global context by calcu- designed to improve the capability of global association cross
lating the pairwise correlations between spatial pixels. After channels and space. Finally, a lite version named L-FFCA-
that, GCNet [14] and SCP [38] simplify the multiplication of YOLO is obtained by reconstructing FFCA-YOLO based on
query and key to solve the problem of excessive calculation PConv with little accuracy loss. Their detailed description can
of NLNet. SCP adds additional paths to GCNet to learn be found in Sections III-B–III-E.
the information of each pixel. This additional path uses one
1 × 1 convolution to aggregate spatial information between B. Feature Enhancement Module (FEM)
different channels, which may still bring some useless back-
Due to the complexity of remote sensing images, false
ground features.
alarms with similar features are prone to occur in tasks of small
Based on these methods, a new SCAM is proposed con-
object detection. However, the extraction ability of backbone
sidering the ideas of [39] and [40]. SCAM uses global
is limited. The features extracted at this stage contain less
average-pooling (GAP) and global max-pooling (GMP) to
semantic information and narrow receptive fields, which makes
guide pixels learning the relationship between space and chan-
it difficult to distinguish small objects from backgrounds.
nels. Therefore, the proposed SCAM can achieve contextual
Accordingly, the proposed FEM considers to enhance the
feature interaction cross channels and space.
features of small objects from two perspectives. From the
view of increasing feature richness, multibranch convolutional
D. Lightweight Model Frameworks structure is adopted to extract multiple discriminative semantic
Lightweight is an important indicator for measuring detector information. From the view of enlarging receptive fields, atrous
performance, especially aiming at on-board deployment in the convolution is applied to obtain richer local contextual infor-
future, which requires to optimize accuracy and speed with mation. The whole structure of FEM is shown in Fig. 2, which
limited computing resources. There are two commonly used is inspired by RFB-s [52]. The difference is that FEM only has
ways to make network lightweight. The first one is model two branches with atrous convolution. Each branch performs
compression represented by pruning [41], [42], [43], [44]. The a 1 × 1 convolution operation on the input feature map to
essence of pruning is to delete the redundant parameters lower preliminarily adjust the number of channels for subsequent
than the threshold set by designing filtering algorithm. Any processing. The first branch is a residual structure, which
model can be pruned to reduce the amount of parameters. forms an equivalent map to retain critical feature information
Another way is to use lightweight convolutional networks to of small objects. The other three branches perform cascade
optimize the model structure. Its idea lies in designing more standard convolution operations, whose kernel sizes are 1 × 3,
efficient computing methods for networks. MobileNet [45], 3 × 1, and 3 × 3, respectively. Additional atrous convolution
ShuffleNet [46], and GhostNet [47] use the depthwise convo- layers are added to the middle two branches, so that the
lution (DWConv) and/or group convolution to extract spatial extracted feature maps could retain more context information.

Fig. 1. Overall framework of FFCA-YOLO.

Fig. 2. Structure of FEM.

The mathematical expressions of FEM can be written as

follows:
3×3
f conv (F)
1×1
W1 = f conv (1)
W2 = f diconv f conv f conv f conv (F)
3×3
3×1 1×3 1×1
(2)
W3 = f diconv f conv f conv f conv (F)
3×3
1×3 3×1 1×1
(3)
Y = Cat(W1 , W2 , W3 ) ⊕ f conv
1×1
(F) (4)
1×1 1×3 3×1 3×3
where f conv , f conv , f conv , and f conv represent the standard
Fig. 3. Structure of FFM.
convolution operations with kernel sizes of 1 × 1, 1 × 3,
3×3
3 × 1, and 3 × 3, respectively. f diconv means atrous convolution
operation with a dilation rate of 5. Cat(·) is the feature
C. Feature Fusion Module (FFM)
map concatenation operation. ⊕ represents the elementwise
addition operation of the feature map. F is the input feature High-level and low-level feature maps contain different
map. W1 , W2 , and W3 represent the output feature map of the semantic information. Aggregating features from multiscale
first three branches after standard and atrous convolution. Y is feature maps could enhance the semantic representation of
the output feature map of FEM. small object. The proposed FFM adopts a neck structure based
FEM has a much lighter structure compared with RFB-s on BiFPN. Unlike BiFPN, FFM improves the reweighting
and enables the model to learn richer local contextual features strategy named CRC and adjusts the original BiFPN to accom-
through multibranch atrous convolution, which improves the modate three detection heads. The structure of FFM is shown
feature representation ability for small objects. in Fig. 3. The input of FFM consists of the low-level feature

maps X 2 (160 × 160) and X 3 (80 × 80) processed by FEM and Section IV-D, all the three strategies improve the performance,
the high-level feature map X 4 (40 × 40) processed by SPPF. but the difference between the second and third strategies is not
The top–down strategy of FFM is as follows. First, using significant. As a result, we select the second strategy in FFM
CSPBlock for X 4 to get X 4′ , then upsampling X 4′ to obtain for feature reweighting. The structure of FFM and its channel
the feature map with the same scale as X 3 , and using CRC reweighting strategy optimize the fusion process of multiscale
to fuse them together. The fused feature map is processed by semantic information for small objects, which provides more
CSPBlock to get X 3′ . The above operations are repeated on effective feature maps for subsequent global context modeling.
X 3′ to create a new feature map X 2′ . X 2′ , X 3′ , and X 4′ realize
the flowing of semantic information from deep to shallow. D. Spatial Context Aware Module (SCAM)
The process from bottom to top is similar to that from top to
bottom, with the main difference being that the feature map After FEM and FFM, the feature maps have already
is downsampled using a convolution with a stride of 2. X 3 is
′′ taken into account local contextual information and have
obtained through the CRC of X 3 , X 3′ , and X 2′ . This operation well representation of small object features. Modeling the
could fuse more features without increasing much costs. X 2′ , global relationship between small objects and backgrounds
′′ ′′
X 3 , and X 4 as the output results of FFM are sent to SCAM at this stage is more effective than in backbone. Global
for context information extraction. The calculation process of context information could be used to represent the relationship
FFM can be expressed as follows: between pixels cross space, which suppresses useless back-
n h io ground and enhances the discrimination between objects and
X 2′ = CSP CRC f up 2↑
CBS(X 3′ ) , X 2

(5) backgrounds. Inspired by GCNet [14] and SCP [38], SCAM
′′ consists of three branches. The first branch uses GAP and
X 3 = CSP CRC CBS(X 3 ), X 3 , CBS(X 2 , stride = 2)
′ ′

(6)
n h io GMP to integrate global information. The second branch uses
′′ ′′
X 4 = CSP CRC X 4′ , CBS(X 3 , stride = 2) (7) a 1 × 1 convolution to generate linear transform results
of the feature map which is named value [54] in Fig. 4.
2↑
where f up represents the upsampling operation. CBS means The third branch uses a 1 × 1 convolution to simplify the
3 × 3 convolution including batch normalization and SiLU. multiple of query and key. This convolution is named QK
Compared with BiFPN, FFM improves the fusion strategy in Fig. 4. Subsequently, the first and third branches are
of multiscale feature maps involving reweighting channels. matrix multiplied with the second branch, separately. The
The fusion strategy of BiFPN [29] is between feature maps, obtained two branches represent contextual information cross
which causes different channels have the same weight. In order channels and space, respectively. Finally, the output of SCAM
to strengthen the representation of small object from multiscale is obtained by using broadcast Hadamard product on these
features and fully utilize the features of different channels, two branches. The structure of SCAM is shown in Fig. 4.
the proposed CRC reweights the channels of feature map, In each layer, the pixelwise spatial context can be expressed as
as shown in the lower half of Fig. 3. follows:
We design three strategies for reweighting channels. The Ni
" j
#
first strategy uses channel attention mechanism similar to j j j
X exp(ωqk Pi ) j
Q i = Pi + ai · ωv Pi (11)
n=1 exp(ωqk Pi )
P Ni n
SENet [39] or ECANet [53] to reweight channels as j=1
formula (8). This strategy is feasible but increases the com- j
putational cost and parameter count significantly. The second exp avg(Pi ); max(Pi ) Pi
j
strategy first concatenates the feature maps and then multiplies ai = P Ni n · ωv (12)
n=1 exp avg(Pi ); max(Pi ) Pi

the normalized trainable weights with the same number of
j j
parameters as the total number of channels, as shown in where Pi and Q i represent the input and output of the
formula (9). The third strategy further considers the semantic jth pixel in the i-level feature map, respectively. Ni denotes
gap between different feature maps, which first reweights the the total number of pixels. ωqk and ωv are the linear transform
channels within each feature map and then reweights different matrices for projecting the feature maps, which simplify by
feature maps, as shown in formula (10) 1 × 1 convolution. avg(·) and max(·) perform GAP and
GMP, respectively. GAP and GMP can guide feature map to
Output = Attention(X ) · X (8)
select channels with significant information, which enables
X ωj
Output = · xj (9) SCAM to learn the context information about channel
ε + m ωm
P
j dimensions.
XX ωi ωj
Output = · · xj (10)
ε + k ωk ε + m i ωm i
P P
i j E. Lite-FFCA-YOLO (L-FFCA-YOLO)
where Attention(·) represents the channel attention mecha- A qualified lightweight model needs to strike a balance
nism, such as SENet or ECANet. ωi represents the trainable among parameter count, speed, and accuracy. FasterNet has
weight in the ith feature map. ω j represents the trainable found that the main reason for low FLOPs of DWConv
weight in the jth channel. mi is the number of channels in the is its frequent memory redundancy access, which actually
ith feature map. m represents the total number of channels leads to the decrease in speed. To alleviate this phenomenon,
after concatenation. ε is set to 0.0001 to avoid numerical FasterNet uses PConv, which considers the redundancy
instability. According to the results of ablation experiments in in feature maps [51], and applies standard convolution

Fig. 4. Structures of GCBlock, SCP, and SCAM.

TABLE I
PARAMETER C OUNTS OF FFCA-YOLO AND L-FFCA-YOLO IN BACKBONE

on only a portion of input channels. The CSPBlock in

FFCA-YOLO is reconstructed by combining the FasterBlock
in FasterNet, which is named CSPFasterBlock, as shown
in Fig. 5.
According to the research results of [51], the number of
channels M that using 1 × 1 convolution is set to 3/4 of the
total channels in CSPFasterBlock. Two standard convolutions
with channel scaling ratio are set after the PConv. Section IV-E
displays the experimental results with different scaling ratios.
FasterNet concludes that directly replacing standard convolu-
tion with PConv will lead to a serious decline in accuracy.
Therefore, we only replace the bottleneck in CSPBlock with
FasterBlock, which ensures that the feature information of
different layers flows through all channels with little accuracy Fig. 5. Backbone structure of L-FFCA-YOLO.
loss. The parameter counts in the backbone of FFCA-YOLO
and L-FFCA-YOLO are presented in Table I, which shows that IV. E XPERIMENTAL R ESULTS
the backbone of L-FFCA-YOLO has parameters 30% fewer In this article, small object is defined as an object with
than FFCA-YOLO. size less than 32 × 32 pixels. The benchmark tests are

Fig. 6. Ground truth annotation in UNICORN2008 and USOD. The red bounding boxes are the original annotated instances in UNICORN2008, while
bounding boxes with green corner points are the manual annotation supplementation for USOD. (a) Original annotation, (b) manual annotation, (c) original
annotation, (d) manual annotation.

conducted on two public datasets of small object VEDAI [54]

and AI-TOD [55], [56] as well as a self-built dataset (USOD)
dedicated to small object detection. YOLOv5 is selected as
the benchmark framework, which can be divided into five
models with increasing network width and depth: YOLOv5n,
YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x. YOLOv5m
gets the excellent balance between speed and accuracy in the
YOLOv5 series of algorithm. Therefore, we use YOLOv5m as
the base model and perform improvement and optimization.

A. Experimental Dataset Description

1) VEDAI: Vehicle detection in aerial imagery (VEDAI)
dataset [55] consists of cropped images obtained from a
larger Utah Automated Geographic Reference Center (AGRC)
dataset. In AGRC, each image has about 16 000 × 16 000
pixels, collected from the same altitude, with a resolution
of about 12.5 × 12.5 cm per pixel. RGB and IR are two
modes of each image in the same scene. We only execute
experiments on the RGB version and divide the training and
testing sets according to the official given method. We do Fig. 7. Distribution of object sizes in USOD.
not consider classes with instances fewer than 50, such as
plane, motorcycle, and bus. Our task is to detect eight dif-
ferent classes of objects the same as YOLO-fine [62] and and manually adding annotations for small vehicle objects,
SuperYOLO [63]. as shown in Fig. 6. In addition, UNICORN2008 has SAR
2) AI-TOD: AI-TOD [56], [57] is a dataset for tiny object images that some of those can be registered with visible light
detection in aerial images. Compared with the existing object images. In the future, we will add SAR images to USOD for
detection datasets in remote sensing, the average size of constructing a multimodal version dataset.
the objects in AI-TOD is about 12.8 pixels, which is much USOD includes a total of 3000 images containing 43 378
smaller than other public datasets. AITOD contains 28 036 vehicle instances. The ratio of training set to test set is 7:3.
aerial images with totaling 700 621 object instances, which are As shown in Fig. 7, the proportion of objects with size less
divided into eight classes, including airplane, bridge, storage than 16 × 16 accounts for 96.3%, and the proportion of objects
tank, and so on. We use the training set of 11 214 images and with size less than 32 × 32 accounts for 99.9%. Fig. 8 shows
the validation set of 2804 with a total of 14 018 images for the distribution of the number of small objects in the training
training and evaluate the model performance in the test set of set, which can be seen that small objects are relatively evenly
14 018 images according to the official offer. distributed. In summary, USOD can serve as a benchmark
3) USOD: The existing public datasets [58], [59] contain dataset for small object detection in remote sensing with the
many medium and large objects, so it is difficult to verify the following characteristics.
feature extraction performance of detectors for small objects. 1) The proportion of small objects in USOD (99.9%) is
Therefore, in order to further verify the detection ability higher compared with other small object datasets, such
of FFCA-YOLO, unicorn small object dataset (USOD) is as AI-TOD (97.9%).
built based on UNICORN2008 [60]. UNICORN2008 provides 2) There are many vehicle instances in USOD that are
imaging data from photoelectric sensors, whose spatial res- under low illumination and shadow occlusion conditions,
olution is about 0.4 m. We used the visible light data of which can more effectively validate the performance of
UNICORN 2008 to form USOD by filtering, segmenting, models to detect small objects.

TABLE II
C OMPARISON E XPERIMENTS FOR FFCA-YOLO IN VEDAI

in order to reflect the detection performance for small objects,

mAPs is used as the evaluation metric.

C. Comparisons With Previous Methods

The experimental results of FFCA-YOLO and L-FFCA-
YOLO are provided on three datasets: VEDAI, AI-TOD, and
USOD. In VEDAI and AI-TOD, we compare our model with
current advanced methods and SOTA methods. Fig. 9 shows
the detection results of FFCA-YOLO in typical scenarios
across various datasets. In USOD dataset, we compare our
model with other YOLO models and some classic object
detection algorithms.
1) VEDAI: We used 512 × 512 data in VEDAI dataset for
training and validating. The results of lightweight CNN [61],
YOLO-fine [62], SuperYOLO [63], and CMAFF [64] are
compared. Both the original CMAFF and SuperYOLO used
multimodal data for training, and we only use their results in
Fig. 8. Distribution of the number of small objects in the training set training RGB data, which is consistent with our training set.
of USOD. Table II shows that compared with CMAFF, FFCA-YOLO
improves by 0.005 in mAP50. Compared with YOLOv5m,
3) USOD includes a series test sets for validating the FFCA-YOLO improves by 0.025, 0.038, and 0.047 in mAP50,
robustness of models, considering image degradation mAP50:95, and mAPs, respectively. Compared to mAP50 and
factors, including blurring, Gaussian noise, stripe noise, mAP50:95, FFCA-YOLO has a significant improvement in
and fog. mAPs, which indicates that FFCA-YOLO has a significant
4) USOD has the potential to become a multimodal advantage over benchmark networks for small object detection
dataset in the future. The data source of USOD is in remote sensing.
UNICORN2008, which has registered images between 2) AI-TOD: AI-TOD has a higher proportion of small
visible light data and SAR data. objects, which better reflects the network’s ability in small
object detection. The evaluation metrics for AI-TOD dataset
B. Model Training and Evaluation Metrics are different from other datasets that mAPvt, mAPt, and
The proposed model was implemented in PyTorch and mAPs are adopted. mAPvt, mAPt, and mAPs represent the
deployed on a workstation with an NVIDIA 4090 GPU. mAP for objects with sizes below 8 × 8, 8 × 8 to 16 ×
Stochastic gradient descent (SGD) optimizer was used with 16, and 16 × 16 to 32 × 32, respectively. Table III shows
initial learning rate 0.01, momentum 0.937, and weight decay that compared with the SOTA methods, FFCA-YOLO and L-
0.0005 to learn the parameters. The batch size during training FFCA-YOLO achieve the best performance. In the test set, the
was set to 32. Normalized Wasserstein distance (NWD) [57] mAP50 of FFCA-YOLO reaches 0.617, which is 0.08 higher
loss is added to the loss function of YOLOv5 as a supplement than the current best model HANet [68]. The mAP50:95,
to the box loss. NWD models the distance between bounding mAPvt, mAPt, and mAPs are increased by 0.056 0.015, 0.027,
boxes as a Wasserstein distance, which reduced the sensitivity and 0.045, respectively. The results demonstrate the excellent
of IOU to small objects. An adjustment weight is introduced performance of FFCA-YOLO for small object detection in
for CIOU loss and NWD loss, which is set to 0.5. Mean remote sensing.
average precision (mAP) is used as the standard evaluation 3) USOD: Table IV shows the performance of DSSD [69],
metric, which can be divided into mAP50, mAP75, mAP50:95, RefineDet [70], YOLOv3 [19], YOLOv4 [71], YOLOv5m,
and so on, according to the different IOUs. Here, mAP50 and YOLOv8m, TPH-YOLO [22], and the proposed method in
mAP50:95 are used as the main evaluation metrics. In addition, USOD dataset. It can be seen that under the same training

Fig. 9. The detection results of FFCA-YOLO in USOD, VEDAI, and AI-TOD for typical scenarios, such as ports, highways, and buildings. (a) Results in
USOD dataset. (b) Results in VEDAI dataset. (c) Results in AI-TOD dataset.

TABLE III
C OMPARISON E XPERIMENTS FOR FFCA-YOLO IN AI-TOD

TABLE IV
C OMPARISON E XPERIMENTS FOR FFCA-YOLO IN USOD

hyperparameters, FFCA-YOLO has smaller parameter count 30% compared with FFCA-YOLO (from 7.12 to 5.04M), but
and higher performance compared with the benchmark meth- showing no significant decline in accuracy metrics. Fig. 10
ods. L-FFCA-YOLO reduces the parameter count by about shows the detection results of YOLOv5m, TPH-YOLO, and

Fig. 10. Detection results of YOLOv5m, TPH-YOLO, and FFCA-YOLO for low illumination and shadow occlusion scenes. The red bounding boxes represent
the detection box output by the model, while the yellow circles represent the missed detections.

TABLE V
A BLATION E XPERIMENTS FOR FEM, FFM, AND SCAM IN USOD

FFCA-YOLO in low illumination and shadow occlusion contextual features, the network has shown good suppression
scenes. In low illumination scene, the grayscale values of effects on complex backgrounds.
objects and the background are close to each other caus- 2) FFM: Table V shows that adding FFM can improve
ing YOLOv5m and TPH-YOLO to have missed detections. all evaluation metrics, especially in terms of recall (from
In occlusion scene, one small object is located in the shade of 0.826 to 0.837). In addition, we research on the effects of
a tree causing YOLOv5m to have missed detection. different neck structures and different fusion strategies of
multiscale feature map mentioned in Section III-C, as shown
D. Ablation Experimental Result in Table VI. CRC_1, CRC_2, and CRC_3 represent dif-
To analyze the importance of each component in FFCA- ferent channel reweighting strategies in formulas (8)–(10),
YOLO, we progressively applied the FEM, FFM, and SCAM respectively. It can be seen that the performance of CRC_2
in the baseline to verify their effectiveness. The ablation and CRC_3 is significantly better in all aspects compared
experiment was conducted in USOD dataset. Table V shows with BiFPN, and the performance difference between CRC_2
the impact of adding or reducing each module on evaluation and CRC_3 is relatively small (mAP50:95 of CRC_2 is
√
metrics, where represents using the module and × repre- 0.003 higher than that of CRC_3). As a result, CRC_2 is
sents not using the module. selected as the channel reweighting strategy in FFM.
1) FEM: As shown in Table V, adding FEM can obviously 3) SCAM: Table V shows the performance improvement
improve all evaluation metrics, especially in terms of precision by adding SCAM. SCAM can improve all evaluation metrics.
(from 0.9 to 0.926) and mAPs (from 0.303 to 0.335). This Table VII shows the comparison among SCAM and some
confirms that FEM makes it easier for the model to distinguish typical baseline methods. SCAM achieves better performance
small objects from backgrounds. To further validate this con- in all evaluation metrics. Fig. 11 shows the impact of SCAM
clusion, we visualize the feature maps before and after FEM on feature maps. Compared with the feature maps outputted by
in Fig. 11. The brighter color represents that the model pays FEM, the same level feature maps of SCAM further enhance
more attention to that area. Due to FEM enriching the local the feature representation of small objects and suppress the

Fig. 11. Influence of FEM and SCAM on feature extraction. The brighter color represents that the model pays more attention to that area.

TABLE VI
C OMPARISON E XPERIMENTS FOR FFM IN USOD

backgrounds. Through the above analysis of ablation experi- FFM, and SCAM all steadily improve the performance of
ments, it can be concluded that the proposed modules FEM, FFCA-YOLO without any conflicts.

Fig. 12. Simulated degradation images in USOD.

TABLE VII
C OMPARISON E XPERIMENTS FOR SCAM IN USOD

TABLE VIII
ROBUSTNESS E XPERIMENTS FOR FFCA-YOLO AND YOLOV 5 M IN USOD

E. Robustness Experiment we consider include image blurring, Gaussian noise, stripe

Remote sensing data tend to suffer from various degra- noise, and fog. The blurring factor w, the variance of
dation, noise effects, or variabilities in the process of gaussian noise σ 2 , and the amplitude factor of the stripe
imaging that may cause the aliasing of interested objects r refer to the article [73]. To generate images with fog,
and backgrounds, especially when the objects are small. we refer to the model used in [74] and set different atmo-
To verify the robustness of FFCA-YOLO under image degra- spheric light parameters A. Fig. 12 shows the degradation
dation, we generated a series of test sets that simulating results and indicates that image degradation significantly
the image degradation in remote sensing based on the damages the features of small objects. We use peak signal-
research [73]. Each test set has the same original images to-noise ratio (PSNR) to evaluate the quality of degraded
but different degradation conditions. The degradation types images.

TABLE IX requirement but little accuracy loss compared with FFCA-

L IGHTWEIGHT E XPERIMENTS FOR L-FFCA-YOLO IN USOD YOLO. The experimental results show that in the two common
small object detection datasets VEDAI (RGB) and AITOD,
FFCA-YOLO demonstrates the superiority in tasks of small
object detection, whose accuracy reaches 0.748 and 0.617
(in terms of mAP50) and exceeds the given SOTA models.
In addition, a new small object dataset USOD is constructed,
which has a larger proportion of small objects, more scenes
with low illumination, and object occlusion, a series of test
set under various degradation conditions. The accuracy of
FFCA-YOLO on USOD reaches 0.909 (in terms of mAP50),
We select FFCA-YOLO and YOLOv5m for robustness
which significantly surpasses other benchmark models, such
testing. The experimental results show that both FFCA-YOLO
as YOLOv5m (0.873). Although FFCA-YOLO can achieve
and YOLOv5m have a certain degree of robustness to the
good results in small object detection tasks and may have the
image blurring and fog. FFCA-YOLO has a slightly better
potential to be applied to real-time processing on board in the
effect than YOLOv5m, as shown in Table VIII. Unfortunately,
future, it still has some limitations.
both FFCA-YOLO and YOLOv5m have poor resistance to
1) The speed and memory utilization are required to be
the impact of gaussian noise and stripe noise, which seriously
further optimized before hardware deployment.
damage the features of small objects. To alleviate these prob-
2) Currently, the proposed method is only validated on air-
lems, we add the noise simulation into the data augmentation
based datasets. For space-based remote sensing, the images
process and then retrain the models. After retraining, FFCA-
often have lower resolution, poorer quality, and more complex
YOLO has much better resistance but still unable to deal with
degradation appearance. Therefore, the effectiveness of our
images with strong noise. Therefore, we suggest that using
method remains to be further studied and validated.
image denoising, nonuniformity correction, or other methods
In the process of research, we find that the ability of the
to suppress noise before detecting small objects.
existed deep learning network encounters a bottleneck in
small object detection by using only one single-modal data
F. Lightweight Comparison Experiment source. In our opinion, multisource combination could enable
To verify the lightweight effect of L-FFCA-YOLO, CSP- the detector obtaining more effective feature representations
FasterBlock is compared with GhostBlock and ShuffleBlock, of small objects. As a result, cooperative detection by
as shown in Table IX. It can be seen that CSPFasterBlock multiplatform or multiband detection by single platform may
has a significant performance in mAP50 but with a rela- be the future development directions in applications for small
tively large number of GFLOPs. That is because GhostNet object detection.
and ShuffleBlock have more computational redundancy and ACKNOWLEDGMENT
memory access. Under similar GFLOPs, CSPFasterBlock has The authors would like to express their appreciations to the
faster speed that can optimize speed, accuracy, and memory developers of YOLO and UNICORN2008.
requirements more effectively. Furthermore, in order to obtain
an optimized structure of CSPFasterBlock, different channel R EFERENCES
scaling ratios are also analyzed in Table IX. It can be found [1] K. Tong, Y. Wu, and F. Zhou, “Recent advances in small object detection
that when the ratio decreases, the mAP50 and parameter count based on deep learning: A review,” Image Vis. Comput., vol. 97,
May 2020, Art. no. 103910.
will simultaneously decrease. When the ratio is equal to 2, [2] M. Shimoni, R. Haelterman, and C. Perneel, “Hypersectral imaging for
it has a relatively close performance to FFCA-YOLO. military and security applications: Combining myriad processing and
sensing techniques,” IEEE Geosci. Remote Sens. Mag., vol. 7, no. 2,
pp. 101–117, Jun. 2019.
V. C ONCLUSION [3] V. Gagliardi et al., “Satellite remote sensing and non-destructive testing
methods for transport infrastructure monitoring: Advances, challenges
In this article, an efficient detector called FFCA-YOLO is and perspectives,” Remote Sens., vol. 15, no. 2, p. 418, Jan. 2023.
designed to detect small objects in remote sensing. Specifi- [4] X. Sun et al., “RingMo: A remote sensing foundation model with
cally, three lightweight plug-and-play modules (FEM, FFM, masked image modeling,” IEEE Trans. Geosci. Remote Sens., vol. 61,
pp. 1–22, 2023, Art. no. 5612822, doi: 10.1109/TGRS.2022.3194732.
and SCAM) are proposed. FEM has multibranch structure [5] Q. He, X. Sun, Z. Yan, B. Li, and K. Fu, “Multi-object tracking in
to obtain different receptive fields, which fuses local context satellite videos with graph-based multitask modeling,” IEEE Trans.
information of small objects. FFM designs a new feature Geosci. Remote Sens., vol. 60, pp. 1–13, 2022, Art. no. 5619513, doi:
10.1109/TGRS.2022.3152250.
fusion strategy to reduce the interference of background. [6] F. Zhang, X. Wang, S. Zhou, Y. Wang, and Y. Hou, “Arbitrary-
SCAM utilizes global pooling to guide global context learning oriented ship detection through center-head point extraction,” IEEE
to learn the correlation between channels and reconstructs the Trans. Geosci. Remote Sens., vol. 60, pp. 1–14, 2022, Art. no. 5612414,
doi: 10.1109/TGRS.2021.3120411.
correlation between pixels to obtain global context informa- [7] T. Shi et al., “Feature-enhanced CenterNet for small object detection
tion cross channels and space. In addition, a lite version of in remote sensing images,” Remote Sens., vol. 14, no. 21, p. 5488,
FFCA-YOLO named L-FFCA-YOLO uses PConv to recon- Oct. 2022.
[8] H. Ruan, W. Qian, Z. Zheng, and Y. Peng, “A decoupled semantic-
struct the backbone and neck. L-FFCA-YOLO has faster detail learning network for remote sensing object detection in complex
speed, smaller parameter scale, and lower computing power backgrounds,” Electronics, vol. 12, no. 14, p. 3201, Jul. 2023.

[9] Q. Ran, Q. Wang, B. Zhao, Y. Wu, S. Pu, and Z. Li, “Lightweight [34] R. Liu et al., “RAANet: A residual ASPP with attention framework
oriented object detection using multiscale context and enhanced channel for semantic segmentation of high-resolution remote sensing images,”
attention in remote sensing images,” IEEE J. Sel. Topics Appl. Earth Remote Sens., vol. 14, no. 13, p. 3109, Jun. 2022.
Observ. Remote Sens., vol. 14, pp. 5786–5795, 2021. [35] Y. Li, Z. Cheng, C. Wang, J. Zhao, and L. Huang, “RCCT-ASPPNet:
[10] B. Zhang et al., “Progress and challenges in intelligent remote sensing Dual-encoder remote image segmentation based on transformer and
satellite systems,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., ASPP,” Remote Sens., vol. 15, no. 2, p. 379, Jan. 2023.
vol. 15, pp. 1814–1822, 2022. [36] W. Chen, S. Ouyang, W. Tong, X. Li, X. Zheng, and L. Wang,
[11] B. Vajsova, A. Walczynska, S. Bärisch, P. J. Åstrand, and S. Hain, “New “GCSANet: A global context spatial attention deep learning network
sensors benchmark report on WorldView-4,” Publications Office Eur. for remote sensing scene classification,” IEEE J. Sel. Topics Appl. Earth
Union, Luxembourg, U.K., Tech. Rep. EUR 28761 EN, 2017. Observ. Remote Sens., vol. 15, pp. 1150–1162, 2022.
[12] R. Trautner and R. Vitulli, “Ongoing developments of future payload [37] Y. Zhou et al., “BOMSC-Net: Boundary optimization and multi-scale
data processing platforms at ESA,” in Proc. On-Board Payload Data context awareness based building extraction from high-resolution remote
Compress. Workshop (OBPDC), 2010. sensing imagery,” IEEE Trans. Geosci. Remote Sens., vol. 60, pp. 1–17,
[13] X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural 2022, Art. no. 5618617, doi: 10.1109/TGRS.2022.3152575.
networks,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., [38] Y. Liu, H. Li, C. Hu, S. Luo, Y. Luo, and C. Wen Chen, “Learning
Jun. 2018, pp. 7794–7803. to aggregate multi-scale context for instance segmentation in remote
[14] Y. Cao, J. Xu, S. Lin, F. Wei, and H. Hu, “GCNet: Non-local networks sensing images,” 2021, arXiv:2111.11057.
meet squeeze-excitation networks and beyond,” in Proc. IEEE/CVF Int. [39] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in
Conf. Comput. Vis. Workshop (ICCVW), Oct. 2019, pp. 1971–1980. Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018,
[15] Y. Cao, J. Xu, S. Lin, F. Wei, and H. Hu, “Global context networks,” pp. 7132–7141.
IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 6, pp. 6881–6895, [40] S. Woo, J. Park, J. Y. Lee, and I. S. Kweon, “CBAM: Convo-
Jun. 2023. lutional block attention module,” in Proc. Eur. Conf. Comput. Vis.
[16] S. Q. Ren et al., “Faster R-CNN: Towards real-time object detection (ECCV)., 2018, pp. 3–19.
with region proposal networks,” IEEE Trans. Pattern Anal. Mach. Intell., [41] Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan, and C. Zhang, “Learning
vol. 39, no. 6, pp. 1137–1149, 2017, doi: 10.1109/TPAMI.2016.2577031. efficient convolutional networks through network slimming,” in Proc.
[17] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask R-CNN,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Oct. 2017, pp. 2736–2744.
IEEE Int. Conf. Comput. Vis. (ICCV), Oct. 2017, pp. 2961–2969. [42] Y. He, X. Zhang, and J. Sun, “Channel pruning for accelerating very
[18] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look deep neural networks,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV),
once: Unified, real-time object detection,” in Proc. IEEE Conf. Comput. Oct. 2017, pp. 1389–1397.
Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 779–788.
[43] S. Guo, Y. Wang, Q. Li, and J. Yan, “DMCP: Differentiable Markov
[19] J. Redmon and A. Farhadi, “YOLOv3: An incremental improvement,” channel pruning for neural networks,” in Proc. IEEE/CVF Conf. Comput.
2018, arXiv:1804.02767. Vis. Pattern Recognit. (CVPR), Jun. 2020, pp. 1536–1544.
[20] C.-Y. Wang, A. Bochkovskiy, and H.-Y.-M. Liao, “YOLOv7: Trainable
[44] J. Chang, Y. Lu, P. Xue, Y. Xu, and Z. Wei, “Automatic channel pruning
bag-of-freebies sets new state-of-the-art for real-time object detectors,”
via clustering and swarm intelligence optimization for CNN,” Appl.
in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR),
Intell., vol. 52, pp. 17751–17771, Apr. 2022.
Jun. 2023, pp. 7464–7475.
[45] A. G. Howard et al., “MobileNets: Efficient convolutional neural net-
[21] W. Liu et al., “SSD: Single shot MultiBox detector,” in Proc. Eur. Conf.
works for mobile vision applications,” 2017, arXiv:1704.04861.
Comput. Vis. Cham, Switzerland: Springer, 2016, pp. 21–37.
[46] X. Zhang, X. Zhou, M. Lin, and J. Sun, “ShuffleNet: An extremely
[22] X. Zhu, S. Lyu, X. Wang, and Q. Zhao, “TPH-YOLOv5: Improved
efficient convolutional neural network for mobile devices,” in
YOLOv5 based on transformer prediction head for object detection on
Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018,
drone-captured scenarios,” in Proc. IEEE/CVF Int. Conf. Comput. Vis.
pp. 6848–6856.
Workshops (ICCVW), Oct. 2021, pp. 2778–2788.
[23] M. Wang et al., “FE-YOLOv5: Feature enhancement network based on [47] K. Han, Y. Wang, Q. Tian, J. Guo, C. Xu, and C. Xu, “GhostNet: More
YOLOv5 for small object detection,” J. Vis. Commun. Image Represent., features from cheap operations,” in Proc. IEEE/CVF Conf. Comput. Vis.
vol. 90, Feb. 2023, Art. no. 103752. Pattern Recognit. (CVPR), Jun. 2020, pp. 1580–1589.
[24] L. Shen, B. Lang, and Z. Song, “CA-YOLO: Model optimization [48] L. Huyan et al., “A lightweight object detection framework for remote
for remote sensing image object detection,” IEEE Access, vol. 11, sensing images,” Remote Sens., vol. 13, no. 4, p. 683, Feb. 2021.
pp. 64769–64781, 2023. [49] J. Yi, Z. Shen, F. Chen, Y. Zhao, S. Xiao, and W. Zhou, “A lightweight
[25] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, multiscale feature fusion network for remote sensing object count-
“Feature pyramid networks for object detection,” in Proc. IEEE Conf. ing,” IEEE Trans. Geosci. Remote Sens., vol. 61, pp. 1–13, 2023,
Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 2117–2125. Art. no. 5902113, doi: 10.1109/TGRS.2023.3238185.
[26] S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia, “Path aggregation network for [50] J. Liu, R. Liu, K. Ren, X. Li, J. Xiang, and S. Qiu, “High-performance
instance segmentation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern object detection for optical remote sensing images with lightweight
Recognit., Jun. 2018, pp. 8759–8768. convolutional neural networks,” in Proc. IEEE 22nd Int. Conf. High
[27] G. Ghiasi, T.-Y. Lin, and Q. V. Le, “NAS-FPN: Learning scalable feature Perform. Comput. Commun., IEEE 18th Int. Conf. Smart City, IEEE
pyramid architecture for object detection,” in Proc. IEEE/CVF Conf. 6th Int. Conf. Data Sci. Syst. (HPCC/SmartCity/DSS), Dec. 2020,
Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 7036–7045. pp. 585–592.
[28] S. Liu, D. Huang, and Y. Wang, “Learning spatial fusion for single-shot [51] J. Chen et al., “Run, don’t walk: Chasing higher FLOPS for faster neural
object detection,” 2019, arXiv:1911.09516. networks,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.
[29] M. Tan, R. Pang, and Q. V. Le, “EfficientDet: Scalable and efficient (CVPR), Jun. 2023, pp. 12021–12031.
object detection,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recog- [52] S. Liu and D. Huang, “Receptive field block net for accurate and
nit. (CVPR), Jun. 2020, pp. 10781–10790. fast object detection,” in Proc. Eur. Conf. Comput. Vis. (ECCV), 2018,
[30] C. Guo, B. Fan, Q. Zhang, S. Xiang, and C. Pan, “AugFPN: pp. 385–400.
Improving multi-scale feature learning for object detection,” in Proc. [53] Q. Wang, B. Wu, P. Zhu, P. Li, W. Zuo, and Q. Hu, “ECA-Net: Efficient
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020, channel attention for deep convolutional neural networks,” in Proc.
pp. 12595–12604. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020,
[31] Z. Liu, G. Gao, L. Sun, and Z. Fang, “HRDNet: High-resolution detec- pp. 11534–11542.
tion network for small objects,” in Proc. IEEE Int. Conf. Multimedia [54] A. Vaswani et al., “Attention is all you need,” in Proc. Adv. Neural Inf.
Expo (ICME), Jul. 2021, pp. 1–6. Process. Syst., vol. 30, 2017, pp. 5998–6008.
[32] G. Cheng et al., “Feature enhancement network for object detection in [55] S. Razakarivony and F. Jurie, “Vehicle detection in aerial imagery:
optical remote sensing images,” J. Remote Sens., vol. 1, p. 14, 2021, A small target detection benchmark,” J. Vis. Commun. Image Represent.,
doi: 10.34133/2021/9805389. vol. 34, pp. 187–203, Jan. 2016.
[33] K. Zhang and H. Shen, “Multi-stage feature enhancement pyramid [56] J. Wang, W. Yang, H. Guo, R. Zhang, and G.-S. Xia, “Tiny object
network for detecting objects in optical remote sensing images,” Remote detection in aerial images,” in Proc. 25th Int. Conf. Pattern Recognit.
Sens., vol. 14, no. 3, p. 579, Jan. 2022. (ICPR), Jan. 2021, pp. 3791–3798.

[57] J. Wang, C. Xu, W. Yang, and L. Yu, “A normalized Gaussian Wasser- Mu Ye received the B.Sc. and M.Sc. degrees from
stein distance for tiny object detection,” 2021, arXiv:2110.13389. the Shanghai University of Engineering Science, in
[58] T.-Y. Lin et al., “Microsoft COCO: Common objects in context,” in 2019 and 2022, respectively. He is currently pursu-
Proc. Eur. Conf. Comput. Vis. Cham, Switzerland: Springer, 2014, ing the D.Eng. degree from the Nanjing University
pp. 740–755. of Aeronautics and Astronautics, Nanjing, China.
[59] G.-S. Xia et al., “DOTA: A large-scale dataset for object detection in His main research interests include space-based
aerial images,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., object detection and signal processing.
Jun. 2018, pp. 3974–3983.
[60] L. Colin et al. (2019). Unified Coincident Optical and Radar
for Recognition (UNICORN) 2008 Dataset. [Online]. Available:
https://github.com/AFRL-RY/data-unicorn-2008
[61] M. A. Momin, M. H. Junos, A. S. M. Khairuddin, and M. S. A. Talip,
“Lightweight CNN model: Automated vehicle detection in aerial
images,” Signal, Image Video Process., vol. 17, no. 4, pp. 1209–1217,
Jun. 2023. Guiyi Zhu received the B.Sc. degree from Xidian
[62] M.-T. Pham, L. Courtrai, C. Friguet, S. Lefèvre, and A. Baussard, University, Xi’an, China, in 2015. She is currently
“YOLO-Fine: One-stage detector of small objects under various back- pursuing the M.S. degree with the Nanjing Uni-
grounds in remote sensing images,” Remote Sens., vol. 12, no. 15, versity of Aeronautics and Astronautics, Nanjing,
p. 2501, Aug. 2020. China.
[63] J. Zhang, J. Lei, W. Xie, Z. Fang, Y. Li, and Q. Du, “SuperYOLO: Her main research interests include object detec-
Super resolution assisted object detection in multimodal remote sensing tion and classification.
imagery,” IEEE Trans. Geosci. Remote Sens., vol. 61, pp. 1–15, 2023,
Art. no. 5605415, doi: 10.1109/TGRS.2023.3258666.
[64] F. Qingyun and W. Zhaokui, “Cross-modality attentive feature fusion
for object detection in multispectral remote sensing imagery,” Pattern
Recognit., vol. 130, Oct. 2022, Art. no. 108786.
[65] Z. Cai and N. Vasconcelos, “Cascade R-CNN: Delving into high
quality object detection,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
Recognit., Jun. 2018, pp. 6154–6162. Yong Liu received the B.Sc. and M.Sc. degrees from
[66] S. Qiao, L.-C. Chen, and A. Yuille, “DetectoRS: Detecting objects with Air Force Aviation University, Changchun, China, in
recursive feature pyramid and switchable atrous convolution,” in Proc. 2012 and 2014, respectively, and the Ph.D. degree
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2021, from the National University of Defense Technology,
pp. 10213–10224. Changsha, China, in 2018.
[67] M. Ma and H. Pang, “SP-YOLOv8s: An improved YOLOv8s model for He is currently a Research Assistant with the
remote sensing image tiny object detection,” Appl. Sci., vol. 13, no. 14, National Innovation Institute of Defense Technology,
p. 8161, Jul. 2023. Academy of Military Sciences, Beijing, China. His
[68] G. Guo, P. Chen, X. Yu, Z. Han, Q. Ye, and S. Gao, “Save the tiny, main research interests include remote sensing data
save the all: Hierarchical activation network for tiny object detection,” processing and information fusion.
IEEE Trans. Circuits Syst. Video Technol., vol. 34, no. 1, pp. 221–234,
Jan. 2024, doi: 10.1109/TCSVT.2023.3284161.
[69] C.-Y. Fu, W. Liu, A. Ranga, A. Tyagi, and A. C. Berg, “DSSD :
Deconvolutional single shot detector,” 2017, arXiv:1701.06659.
[70] S. Zhang, L. Wen, X. Bian, Z. Lei, and S. Z. Li, “Single-shot refinement
neural network for object detection,” in Proc. IEEE/CVF Conf. Comput. Pengyu Guo received the master’s degree in com-
Vis. Pattern Recognit., Jun. 2018, pp. 4203–4212. puter science and technology from the National
[71] A. Bochkovskiy, C.-Y. Wang, and H.-Y. Mark Liao, “YOLOv4: Optimal University of Defense Technology, Changsha, China,
speed and accuracy of object detection,” 2020, arXiv:2004.10934. in 2008, and the Ph.D. degree in aerospace science
[72] G. Yang, J. Lei, Z. Zhu, S. Cheng, Z. Feng, and R. Liang, “AFPN: and technology from the National University of
Asymptotic feature pyramid network for object detection,” 2023, Defense Technology, in 2015.
arXiv:2306.15988. He has experience in working for the China
[73] C. Li, Z. Li, X. Liu, and S. Li, “The influence of image degradation Xi’an Satellite Control Center, Xi’an, China. He is
on hyperspectral image classification,” Remote Sens., vol. 14, no. 20, currently an Associate Research Fellow with the
p. 5199, Oct. 2022. National Innovation Institute of Defense Tech-
[74] K. He, J. Sun, and X. Tang, “Single image haze removal using dark nology, Academy of Military Sciences, Beijing,
channel prior,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 12, China. His current research focus is to devise algorithms based on
pp. 2341–2353, Dec. 2010. computer vision and machine learning to enable unmanned platform’s
imaging systems for detection, tracking, recognition, and relative pose
estimation.

Yin Zhang received the B.Sc. degree from Jilin Junhua Yan received the B.Sc., M.Sc., and Ph.D.
University, Changchun, China, in 2009, and the degrees from the Nanjing University of Aeronautics
M.Sc. and Ph.D. degrees from the Harbin Institute and Astronautics, Nanjing, China, in 1993, 2001, and
of Technology, Harbin, China, in 2011 and 2016, 2004, respectively.
respectively. She is currently a Professor with the Nanjing Uni-
He is currently an Associate Professor with the versity of Aeronautics and Astronautics. Her main
Nanjing University of Aeronautics and Astronautics, research interests include image quality assessment,
Nanjing, China. His main research interests include multisource information fusion, object detection,
simulating and processing photoelectric detection tracking, and recognition.
information.

Authorized licensed use limited to: Imperial College London. Downloaded on February 18,2025 at 08:47:36 UTC from IEEE Xplore. Restrictions apply.

Transfer Learning in Conjunction With Multi-Object Detection Using Yolo RCNN
No ratings yet
Transfer Learning in Conjunction With Multi-Object Detection Using Yolo RCNN
10 pages
Effective Feature Fusion Network in BIFPN For Small Object Detection
100% (1)
Effective Feature Fusion Network in BIFPN For Small Object Detection
5 pages
Batch 4
No ratings yet
Batch 4
28 pages
Object Detection in Aerial Images: A Large-Scale Benchmark and Challenges
No ratings yet
Object Detection in Aerial Images: A Large-Scale Benchmark and Challenges
18 pages
Large Selective Kernel Network For Remote Sensing Object Detection
No ratings yet
Large Selective Kernel Network For Remote Sensing Object Detection
16 pages
Electronics 12 02323 v2
No ratings yet
Electronics 12 02323 v2
14 pages
FBRT-YOLO: Faster and Better For Real-Time Aerial Image Detection
No ratings yet
FBRT-YOLO: Faster and Better For Real-Time Aerial Image Detection
9 pages
Remotesensing 16 03753 With Cover
No ratings yet
Remotesensing 16 03753 With Cover
19 pages
Remotesensing 12 02501 v2
No ratings yet
Remotesensing 12 02501 v2
26 pages
Remotesensing 17 01027
No ratings yet
Remotesensing 17 01027
17 pages
RFAG-YOLO: A Receptive Field Attention-Guided YOLO Network For Small-Object Detection in UAV Images
No ratings yet
RFAG-YOLO: A Receptive Field Attention-Guided YOLO Network For Small-Object Detection in UAV Images
25 pages
Object Detection in Aerial Images A Large Scale Benchmark and Challenges
No ratings yet
Object Detection in Aerial Images A Large Scale Benchmark and Challenges
18 pages
J12 TGRS2023 SuperYOLO
No ratings yet
J12 TGRS2023 SuperYOLO
15 pages
GMS-YOLO An Algorithm For Multi-Scale Object Detec
No ratings yet
GMS-YOLO An Algorithm For Multi-Scale Object Detec
17 pages
Multiscale Object Detection in Remote Sensing Images Using 1qh06jan
No ratings yet
Multiscale Object Detection in Remote Sensing Images Using 1qh06jan
10 pages
Enhanced Target Detection Fusion of SPD and CoTC3 Within YOLOv5 Framework
No ratings yet
Enhanced Target Detection Fusion of SPD and CoTC3 Within YOLOv5 Framework
14 pages
Remotesensing 14 05063 v2
No ratings yet
Remotesensing 14 05063 v2
25 pages
CSPPartial-YOLO A Lightweight YOLO-Based Method For Typical Objects Detection in Remote Sensing Images
No ratings yet
CSPPartial-YOLO A Lightweight YOLO-Based Method For Typical Objects Detection in Remote Sensing Images
12 pages
Boundary Viên Thuốc
No ratings yet
Boundary Viên Thuốc
12 pages
2209 13351v2-SuperYlo
No ratings yet
2209 13351v2-SuperYlo
14 pages
Sensors 23 08118
No ratings yet
Sensors 23 08118
21 pages
I-YOLO: A Novel Single-Stage Framework For Small Object Detection
No ratings yet
I-YOLO: A Novel Single-Stage Framework For Small Object Detection
18 pages
Remotesensing 16 00327
No ratings yet
Remotesensing 16 00327
28 pages
Attention and Feature Fusion SSD For Remote Sensing Object Detection
No ratings yet
Attention and Feature Fusion SSD For Remote Sensing Object Detection
9 pages
D5802 PPT
No ratings yet
D5802 PPT
26 pages
Remotesensing 13 00965 v2
No ratings yet
Remotesensing 13 00965 v2
18 pages
44 Review Remote Sensing For Object Detection
No ratings yet
44 Review Remote Sensing For Object Detection
29 pages
FE-YOLOv5 Feature Enhancement Network Based On YOLOv5 For Small Object Detection1
No ratings yet
FE-YOLOv5 Feature Enhancement Network Based On YOLOv5 For Small Object Detection1
8 pages
Conferene Paper Major Project Batch-12
No ratings yet
Conferene Paper Major Project Batch-12
8 pages
2023 - FPGA - Accelerator - For - Meta-Recognition - Anomaly - Detection - Case - of - Burned - Area - Detection
No ratings yet
2023 - FPGA - Accelerator - For - Meta-Recognition - Anomaly - Detection - Case - of - Burned - Area - Detection
13 pages
Drones 07 00188
No ratings yet
Drones 07 00188
18 pages
YOLOv5 CSL F YOLOv5s Loss Improvement and Attention Mechanism Application For Remote Sensing Image Object Detection
No ratings yet
YOLOv5 CSL F YOLOv5s Loss Improvement and Attention Mechanism Application For Remote Sensing Image Object Detection
7 pages
SSRN 4542996
No ratings yet
SSRN 4542996
17 pages
TSP CMC 49710
No ratings yet
TSP CMC 49710
19 pages
Nhóm 1 - Task 4
No ratings yet
Nhóm 1 - Task 4
8 pages
Improved YOLOv7-Tiny For Object Detection Based On
No ratings yet
Improved YOLOv7-Tiny For Object Detection Based On
23 pages
Journal Pone 0259283
No ratings yet
Journal Pone 0259283
15 pages
Sensors 23 07190
No ratings yet
Sensors 23 07190
27 pages
Job Application Letter Volunteer
100% (1)
Job Application Letter Volunteer
6 pages
Sustainable Housing Case Study
No ratings yet
Sustainable Housing Case Study
9 pages
New Approach Based On Pix2Pix-YOLOv7 Mmwave Radar
No ratings yet
New Approach Based On Pix2Pix-YOLOv7 Mmwave Radar
19 pages
Remotesensing 15 03970
No ratings yet
Remotesensing 15 03970
18 pages
Remote Sensing Image Detection Based On YOLOv4 Improvements
No ratings yet
Remote Sensing Image Detection Based On YOLOv4 Improvements
12 pages
Canadian Pharmacist Evaluating Examination (PEBC) Study Guide
70% (10)
Canadian Pharmacist Evaluating Examination (PEBC) Study Guide
20 pages
Remotesensing 15 03265
No ratings yet
Remotesensing 15 03265
29 pages
附件5
No ratings yet
附件5
29 pages
Man Diesel Engine
No ratings yet
Man Diesel Engine
325 pages
Sensors 23 06423
No ratings yet
Sensors 23 06423
23 pages
DT0400002 en - Fe FRENIC Lift Asíncrono - Síncrono r0b
No ratings yet
DT0400002 en - Fe FRENIC Lift Asíncrono - Síncrono r0b
24 pages
Remotesensing 16 03278
No ratings yet
Remotesensing 16 03278
18 pages
Detection of Multiclass Objects in Optical Remote Sensing Images
No ratings yet
Detection of Multiclass Objects in Optical Remote Sensing Images
5 pages
Bo de Thi Tieng Anh Lop 4 Hoc Ki 1 Co Dap An
No ratings yet
Bo de Thi Tieng Anh Lop 4 Hoc Ki 1 Co Dap An
60 pages
1 s2.0 S2667393224000139 Main
No ratings yet
1 s2.0 S2667393224000139 Main
8 pages
Applsci 13 08161
No ratings yet
Applsci 13 08161
17 pages
Object Detection Using Adaptive Mask RCNN
No ratings yet
Object Detection Using Adaptive Mask RCNN
12 pages
Object Detection Coarse-To-Fine Manner
No ratings yet
Object Detection Coarse-To-Fine Manner
5 pages
Kumari 2024 Eng. Res. Express 6 045210
No ratings yet
Kumari 2024 Eng. Res. Express 6 045210
14 pages
Applsci 13 12977
No ratings yet
Applsci 13 12977
21 pages
OPODet Toward Open World Potential Oriented Object Detection in Remote Sensing Images
No ratings yet
OPODet Toward Open World Potential Oriented Object Detection in Remote Sensing Images
13 pages
Improving Detection Capabilities of YOLOv8-n For S
No ratings yet
Improving Detection Capabilities of YOLOv8-n For S
10 pages
2003 07442v1 PDF
No ratings yet
2003 07442v1 PDF
7 pages
Applsci 13 04144 v2
No ratings yet
Applsci 13 04144 v2
26 pages
Lightweight Aerial Image
No ratings yet
Lightweight Aerial Image
10 pages
Remote Sensing: Improved YOLO Network For Free-Angle Remote Sensing Target Detection
No ratings yet
Remote Sensing: Improved YOLO Network For Free-Angle Remote Sensing Target Detection
20 pages
Liao 2020
No ratings yet
Liao 2020
35 pages
Du 2019 J. Phys. Conf. Ser. 1314 012202
No ratings yet
Du 2019 J. Phys. Conf. Ser. 1314 012202
7 pages
Insit of Medicine Members 2008
No ratings yet
Insit of Medicine Members 2008
33 pages
EPB-6. Cs-Ti
No ratings yet
EPB-6. Cs-Ti
29 pages
Strategy Formulation
No ratings yet
Strategy Formulation
17 pages
Chapter 6 Curriculum Evaluation Hazel 014525
No ratings yet
Chapter 6 Curriculum Evaluation Hazel 014525
10 pages
Unsupervised Machine Learning
No ratings yet
Unsupervised Machine Learning
10 pages
User'S Guide: 2. External Dimensions and Parts 5. Specifications
No ratings yet
User'S Guide: 2. External Dimensions and Parts 5. Specifications
8 pages
Both Statements Are False
No ratings yet
Both Statements Are False
26 pages
Prenatal Genetic Testing For Monogenic Diabetes Due To Glucokinase Deficiency (December 2023) What's New
No ratings yet
Prenatal Genetic Testing For Monogenic Diabetes Due To Glucokinase Deficiency (December 2023) What's New
33 pages
Fluid Statics Examples
No ratings yet
Fluid Statics Examples
14 pages
Advent Review, and Sabbath Herald - August 7, 1883
No ratings yet
Advent Review, and Sabbath Herald - August 7, 1883
16 pages
Toolbox Talks - Overhead Power Lines
No ratings yet
Toolbox Talks - Overhead Power Lines
2 pages
Weather Vocab
No ratings yet
Weather Vocab
2 pages
French Sociologist Pierre Bourdieu
No ratings yet
French Sociologist Pierre Bourdieu
3 pages
3D Assignment
No ratings yet
3D Assignment
7 pages
World Religion Week 2 PDF
No ratings yet
World Religion Week 2 PDF
9 pages
Module 1-Ders Notları
No ratings yet
Module 1-Ders Notları
2 pages
30 bt Mức độ thông hiểu - phần 2
No ratings yet
30 bt Mức độ thông hiểu - phần 2
2 pages
CID 20210320173003021556 989295 uniROC Ipayob
No ratings yet
CID 20210320173003021556 989295 uniROC Ipayob
6 pages
NOTES LIFE PROCESSES (Respiration, Excretion
No ratings yet
NOTES LIFE PROCESSES (Respiration, Excretion
3 pages
Unit 1 Introduction To HRM
100% (1)
Unit 1 Introduction To HRM
6 pages
Chicago Boogie - Alto Sax
No ratings yet
Chicago Boogie - Alto Sax
2 pages
Written Performance Task in English 9
No ratings yet
Written Performance Task in English 9
4 pages
Organization and Management Module 1: Quarter 1 - Week 1
100% (1)
Organization and Management Module 1: Quarter 1 - Week 1
16 pages
YOLO Object Detection Explained: Definitive Reference for Developers and Engineers
From Everand
YOLO Object Detection Explained: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet

FFCA-YOLO For Small Object Detection in Remote Sensing Images

Uploaded by

FFCA-YOLO For Small Object Detection in Remote Sensing Images

Uploaded by

IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL.

62, 2024 5611215

FFCA-YOLO for Small Object Detection

Fig. 1. Overall framework of FFCA-YOLO.

Fig. 2. Structure of FEM.

The mathematical expressions of FEM can be written as

Fig. 4. Structures of GCBlock, SCP, and SCAM.

on only a portion of input channels. The CSPBlock in

conducted on two public datasets of small object VEDAI [54]

A. Experimental Dataset Description

in order to reflect the detection performance for small objects,

C. Comparisons With Previous Methods

Fig. 12. Simulated degradation images in USOD.

E. Robustness Experiment we consider include image blurring, Gaussian noise, stripe

TABLE IX requirement but little accuracy loss compared with FFCA-

You might also like