FFCA-YOLO For Small Object Detection in Remote Sensing Images
FFCA-YOLO For Small Object Detection in Remote Sensing Images
Abstract— Issues, such as insufficient feature representation rescue, security, military, and so on. Remote sensing images
and background confusion, make detection tasks for small object generally have large fields of view, which is quite suitable
in remote sensing arduous. Particularly, when the algorithm will for wide area monitoring. However, because of their relatively
be deployed on board for real-time processing, which requires
extensive optimization of accuracy and speed under limited com- low resolution and poor quality, interested objects are usu-
puting resources. To tackle these problems, an efficient detector ally characterized by small sizes (less than 32 × 32 pixels
called feature enhancement, fusion and context aware YOLO [7], [57]), dim features, low contrast, and insufficient informa-
(FFCA-YOLO) is proposed in this article. FFCA-YOLO includes tion, causing extra difficulties in detection [8], [9]. At the same
three innovative lightweight and plug-and-play modules: feature time, remote sensing systems face less controllable observing
enhancement module (FEM), feature fusion module (FFM), and
spatial context aware module (SCAM). These three modules conditions and numerous interferences in imaging chain, such
improve the network capabilities of local area awareness, mul- as platform motion, atmosphere, and various complex imaging
tiscale feature fusion, and global association cross channels and scenes. All these factors lead to the aliasing of objects and
space, respectively, while trying to avoid increasing complexity as backgrounds, which makes small objects indistinguishable.
possible. Thus, the weak feature representations of small objects On the other hand, with the continuous increase of camera
are enhanced and the confusable backgrounds are suppressed.
Two public remote sensing datasets (VEDAI and AI-TOD) for bands and resolution, massive data are generated during on-
small object detection and one self-built dataset (USOD) are board imaging [10]. For example, WorldView-4 collect data
used to validate the effectiveness of FFCA-YOLO. The accuracy covering 680 000 km2 per day [11], which brings a huge
of FFCA-YOLO reaches 0.748, 0.617, and 0.909 (in terms of amount of downstream data. Traditional ground processing
mAP50) that exceeds several benchmark models and the state- mode after data downlink is facing severe challenges, which is
of-the-art methods. Meanwhile, the robustness of FFCA-YOLO is
also validated under different simulated degradation conditions. hard to meet the requirements of high timeliness applications,
Moreover, to further reduce computational resource consump- such as military reconnaissance and emergency rescue. Real-
tion while ensuring efficiency, a lite version of FFCA-YOLO time processing on board can significantly relieve transmission
(L-FFCA-YOLO) is optimized by reconstructing the backbone pressure of imaging data and shorten the delay from infor-
and neck of FFCA-YOLO based on partial convolution (PConv). mation acquisition to strategic decision, which becomes one
L-FFCA-YOLO has faster speed, smaller parameter scale, and
lower computing power requirement but little accuracy loss of the potential ways to solve this problem. Authoritative
compared with FFCA-YOLO. The source code will be available institutions, such as European Space Agency (ESA), have
at https://github.com/yemu1138178251/FFCA-YOLO. already treated on-board processing technology as one of
Index Terms— Context information, feature fusion, lightweight the key research directions prospectively [12]. Unfortunately,
network, remote sensing image, small object detection. the strict constraints on on-board resources, such as power,
weight, and volume, put forward higher requirements for the
I. I NTRODUCTION performance of processing algorithms in terms of reliability,
speed, and scale.
I N RECENT years, the research on small object detection
has achieved significant growth due to the rapid develop-
ment of optical remote sensing technology [1], [2], [3], [4],
In general, the main challenges of small object detection
in remote sensing applications can be summarized into three
[5], [6] for applications, such as traffic supervision, search and points: insufficient feature representation, background confu-
sion, and the optimization of speed and accuracy under limited
Manuscript received 23 September 2023; revised 16 December 2023; hardware conditions.
accepted 31 January 2024. Date of publication 6 February 2024; date of
current version 28 February 2024. This work was supported in part by the In this study, our motivation is to design a small object
Strengthening Project of National Defense Science and Technology under detector with high accuracy that has the potential to be applied
Grant 2021-JCJQ-JJ-0834, in part by the Nation Natural Science Foundation to real-time processing on board in the future. The key to
of China under Grant 61705104, in part by the Fundamental Research Funds
for the Central Universities of China under Grant NJ2022025. (Corresponding alleviate the problems of insufficient feature representation
authors: Yin Zhang; Pengyu Guo.) and background confusion lies in feature enhancement and
Yin Zhang, Mu Ye, Guiyi Zhu, and Junhua Yan are with the Col- fusion. In terms of feature enhancement, fully utilizing local
lege of Astronautics, Nanjing University of Aeronautics and Astronautics
(NUAA), Nanjing 211106, China (e-mail: [email protected]; and global contextual information [13], [14], [15] can effec-
[email protected]; [email protected]; [email protected]). tively enhance the perception of network for small objects.
Yong Liu and Pengyu Guo are with the National Innovation Institute of Feature enhancement module (FEM) and spatial context aware
Defense Technology, Academy of Military Sciences, Beijing 100071, China
(e-mail: [email protected]; [email protected]). module (SCAM) are proposed to enrich the local and global
Digital Object Identifier 10.1109/TGRS.2024.3363057 contextual feature, respectively. FEM expands the receptive
1558-0644 © 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Imperial College London. Downloaded on February 18,2025 at 08:47:36 UTC from IEEE Xplore. Restrictions apply.
5611215 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 62, 2024
field of the backbone by multibranch atrous convolution. A. Applications of YOLO in Remote Sensing
SCAM considers the association between small objects and The development of deep learning enables object detectors
global regions by constructing global context relationships. to adaptively extract image features and locate objects through
In terms of feature fusion, feature fusion module (FFM) end-to-end learning framework. At present, the detection meth-
is proposed to improve feature fusion strategy, which can ods can be classified into two categories: two-stage [16], [17]
reweight different feature maps by channel information with- and one-stage detectors [18], [19], [20], [21]. Compared with
out increasing computational complexity. These three modules two-stage detectors, one-stage detectors have faster computa-
are added to YOLO to obtain a new model: feature enhance- tion speed and low accuracy loss, which makes them have
ment, fusion, and context aware YOLO (FFCA-YOLO). better potential for on-board applications. YOLO series of
Finally, in order to further reduce computational resource algorithm [18], [19], [20], as typical one-stage object detection
consumption while ensuring efficiency, a lite version of algorithms, has advantages to achieve desired performance for
FFCA-YOLO (L-FFCA-YOLO) is optimized by reconstruct- small objects. At present, some improved YOLO algorithms
ing the backbone and neck of FFCA-YOLO based on partial for object detection in remote sensing have emerged, such as
convolution (PConv). TPH-YOLO [22], FE-YOLO [23], and CA-YOLO [24].
The main contributions of this article are listed as follows. TPH-YOLO [22] integrates transformer encoder blocks
1) An efficient detector (FFCA-YOLO) of small objects into backbone to obtain rich global context information
and its lite version L-FFCA-YOLO are designed for and improves the quality of object feature representation.
remote sensing applications. FFCA-YOLO has advanced FE-YOLO [23] uses deformable convolution for feature fusion
performance in small object detection tasks compared of high and low feature maps in the neck of YOLO, which
with several benchmark models and the state-of-the- aims to eliminate the impact of semantic gaps caused by
art (SOTA) methods, and has the potential for future top–down connections on objects. These two methods have
real-time application on board. good results but with a sharp increase in parameter count.
2) Three innovative and lightweight plug-and-play modules CA-YOLO [24] embeds coordinate attention module into
are proposed: FEM, FFM, and SCAM. These three shallow feature network extraction, which suppresses redun-
modules improve the network capabilities of local area dant backgrounds and enhances the feature representation
awareness, multiscale feature fusion, and global associa- of objects by establishing long-range dependencies between
tion cross channels and space, respectively. They can be pixels. In summary, YOLO has the superiority of scalability
used as common modules inserting into any detection and efficiency, which is suitable for applying in remote sensing
networks to enhance the weak feature representations of tasks.
small objects and suppress the confusable backgrounds. Therefore, we choose YOLO as the basic framework and
3) A new small object dataset USOD is constructed based add specifically designed modules for small object feature
on aerial remote sensing images, which has the propor- representation and background suppression.
tion of small objects (less than 32 × 32 pixels) more
than 99.9% with many instances under low illumination
and shadow occlusion conditions. In addition, USOD has B. Feature Enhancement and Fusion Methods of Small
multiple test sets under different simulated degradation Object Detection
conditions, such as image blurring, Gaussian noise, Object detection methods based on deep learning rely on
stripe noise, and fog, which can serve as a benchmark the backbone to obtain high-dimensional features. However,
dataset for small object detection in remote sensing. in remote sensing images, the extracted features of small
The remainder of this article is organized as follows: after objects may only occupy one pixel on output feature maps.
introducing the related works of small object detection in Multiscale features need to be used to represent the features
Section II, the proposed FFCA-YOLO and L-FFCA-YOLO more effectively. Inspired by the pyramid structure derived
architecture are elaborated in Section III. In Section IV, the from hand-engineered features, Lin et al. [25] propose the
experimental details are briefly introduced. The performance feature pyramid network (FPN), which yields the capacity
of the proposed method and several benchmark models as to aggregate low-level features that have high resolution
well as SOTA methods are particularly compared. The robust- with high-level features that have low resolution. Since then,
ness and lightweight performance of FFCA-YOLO are also PANet [26], NAS-FPN [27], ASFF [28], and BiFPN [29] are
validated in this section. In Section V, the entire article is proposed and achieve good results in object detection tasks.
summarized and the future directions of small object detection Guo et al. [30] introduce AugFPN to address the inconsis-
in remote sensing are pointed out. tency between detailed and semantic information in feature
maps. The information gap is narrowed by using a one-time
supervision method in feature fusion stage. Liu et al. [31]
II. R ELATED W ORKS present a high-resolution object detection network (HRDNet)
This section briefly reviews the literatures relevant to our to detect small vehicle objects, which uses a multidepth image
work, including the applications of YOLO in remote sensing pyramid combined with a multiscale FPN to deepen features.
detection, feature extraction methods of small object, global These methods demonstrate that strengthening the quality of
context feature representation, and lightweight frameworks of multiscale feature fusion can effectively improve the detection
network. performance of small objects to a certain extent. In addition,
Authorized licensed use limited to: Imperial College London. Downloaded on February 18,2025 at 08:47:36 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: FFCA-YOLO FOR SMALL OBJECT DETECTION 5611215
feature enhancement before fusion can further improve features. DWConv can effectively reduce parameter count and
the semantic representation of network. Cheng et al. [32] FLOPs. Several network structures [48], [49], [50] for object
use dual attention mechanism to enhance features before detection in remote sensing implement lightweight design
fusion, which makes the network focus on the distinct fea- based on the above methods. Chen et al. [51] prove that the low
tures of objects. The feature enhance module proposed by FLOPs of DWConv are mainly due to frequent memory access
Zhang and Shen [33] is similar to Cheng’s, which also uses by operators. Therefore, the PConv is proposed to extract the
the attention mechanism of spatial and channel dimensions to spatial features more effectively by reducing redundant calcu-
enhance features. Besides attention mechanism, expanding the lations and memory access. Based on the idea of PConv, a lite
receptive field by multibranch convolution [8] and transformer version of FFCA-YOLO named L-FFCA-YOLO is presented
encoder [34], [35] are also two commonly used ways for by reconstructing the network in Section IV-E, which is faster
feature enhancement. and slightly lower in accuracy.
In order to obtain a larger receptive field, a new lightweight
FEM is designed for obtaining richer local contextual III. P ROPOSED M ETHOD
information in this article, which includes a multibranch struc-
ture containing standard convolution and atrous convolution. A. Overview
In addition, a new FFM is proposed by improving the multi- YOLOv5 is selected as our benchmark framework since it
scale fusion strategy with almost no additional parameters. has fewer parameters compared with the latest YOLOv8 and
can maintain a certain degree of accuracy in the tasks of small
C. Global Context Feature Representation object detection. The overall architecture of FFCA-YOLO is
After FEM and FFM, the feature representation of small shown in Fig. 1. First, FFCA-YOLO only uses four con-
objects has been enhanced to some extent. Modeling the global volution subsampling operations as the backbone of feature
relationship between small objects and backgrounds at this extraction, which is different from the original YOLOv5.
stage is more effective than in backbone. Second, three specially designed modules are added into the
According to the research results of [36], [37], and [38], neck of YOLOv5: a lightweight FEM is proposed to improve
obtaining the global receptive field and context information is the local area awareness of the network; FFM is proposed to
very important for small object localization. Nonlocal neural improve the capability of multiscale feature fusion; SCAM is
network (NLNet) [13] aggregates the global context by calcu- designed to improve the capability of global association cross
lating the pairwise correlations between spatial pixels. After channels and space. Finally, a lite version named L-FFCA-
that, GCNet [14] and SCP [38] simplify the multiplication of YOLO is obtained by reconstructing FFCA-YOLO based on
query and key to solve the problem of excessive calculation PConv with little accuracy loss. Their detailed description can
of NLNet. SCP adds additional paths to GCNet to learn be found in Sections III-B–III-E.
the information of each pixel. This additional path uses one
1 × 1 convolution to aggregate spatial information between B. Feature Enhancement Module (FEM)
different channels, which may still bring some useless back-
Due to the complexity of remote sensing images, false
ground features.
alarms with similar features are prone to occur in tasks of small
Based on these methods, a new SCAM is proposed con-
object detection. However, the extraction ability of backbone
sidering the ideas of [39] and [40]. SCAM uses global
is limited. The features extracted at this stage contain less
average-pooling (GAP) and global max-pooling (GMP) to
semantic information and narrow receptive fields, which makes
guide pixels learning the relationship between space and chan-
it difficult to distinguish small objects from backgrounds.
nels. Therefore, the proposed SCAM can achieve contextual
Accordingly, the proposed FEM considers to enhance the
feature interaction cross channels and space.
features of small objects from two perspectives. From the
view of increasing feature richness, multibranch convolutional
D. Lightweight Model Frameworks structure is adopted to extract multiple discriminative semantic
Lightweight is an important indicator for measuring detector information. From the view of enlarging receptive fields, atrous
performance, especially aiming at on-board deployment in the convolution is applied to obtain richer local contextual infor-
future, which requires to optimize accuracy and speed with mation. The whole structure of FEM is shown in Fig. 2, which
limited computing resources. There are two commonly used is inspired by RFB-s [52]. The difference is that FEM only has
ways to make network lightweight. The first one is model two branches with atrous convolution. Each branch performs
compression represented by pruning [41], [42], [43], [44]. The a 1 × 1 convolution operation on the input feature map to
essence of pruning is to delete the redundant parameters lower preliminarily adjust the number of channels for subsequent
than the threshold set by designing filtering algorithm. Any processing. The first branch is a residual structure, which
model can be pruned to reduce the amount of parameters. forms an equivalent map to retain critical feature information
Another way is to use lightweight convolutional networks to of small objects. The other three branches perform cascade
optimize the model structure. Its idea lies in designing more standard convolution operations, whose kernel sizes are 1 × 3,
efficient computing methods for networks. MobileNet [45], 3 × 1, and 3 × 3, respectively. Additional atrous convolution
ShuffleNet [46], and GhostNet [47] use the depthwise convo- layers are added to the middle two branches, so that the
lution (DWConv) and/or group convolution to extract spatial extracted feature maps could retain more context information.
Authorized licensed use limited to: Imperial College London. Downloaded on February 18,2025 at 08:47:36 UTC from IEEE Xplore. Restrictions apply.
5611215 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 62, 2024
Authorized licensed use limited to: Imperial College London. Downloaded on February 18,2025 at 08:47:36 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: FFCA-YOLO FOR SMALL OBJECT DETECTION 5611215
maps X 2 (160 × 160) and X 3 (80 × 80) processed by FEM and Section IV-D, all the three strategies improve the performance,
the high-level feature map X 4 (40 × 40) processed by SPPF. but the difference between the second and third strategies is not
The top–down strategy of FFM is as follows. First, using significant. As a result, we select the second strategy in FFM
CSPBlock for X 4 to get X 4′ , then upsampling X 4′ to obtain for feature reweighting. The structure of FFM and its channel
the feature map with the same scale as X 3 , and using CRC reweighting strategy optimize the fusion process of multiscale
to fuse them together. The fused feature map is processed by semantic information for small objects, which provides more
CSPBlock to get X 3′ . The above operations are repeated on effective feature maps for subsequent global context modeling.
X 3′ to create a new feature map X 2′ . X 2′ , X 3′ , and X 4′ realize
the flowing of semantic information from deep to shallow. D. Spatial Context Aware Module (SCAM)
The process from bottom to top is similar to that from top to
bottom, with the main difference being that the feature map After FEM and FFM, the feature maps have already
is downsampled using a convolution with a stride of 2. X 3 is
′′ taken into account local contextual information and have
obtained through the CRC of X 3 , X 3′ , and X 2′ . This operation well representation of small object features. Modeling the
could fuse more features without increasing much costs. X 2′ , global relationship between small objects and backgrounds
′′ ′′
X 3 , and X 4 as the output results of FFM are sent to SCAM at this stage is more effective than in backbone. Global
for context information extraction. The calculation process of context information could be used to represent the relationship
FFM can be expressed as follows: between pixels cross space, which suppresses useless back-
n h io ground and enhances the discrimination between objects and
X 2′ = CSP CRC f up 2↑
CBS(X 3′ ) , X 2
(5) backgrounds. Inspired by GCNet [14] and SCP [38], SCAM
′′ consists of three branches. The first branch uses GAP and
X 3 = CSP CRC CBS(X 3 ), X 3 , CBS(X 2 , stride = 2)
′ ′
(6)
n h io GMP to integrate global information. The second branch uses
′′ ′′
X 4 = CSP CRC X 4′ , CBS(X 3 , stride = 2) (7) a 1 × 1 convolution to generate linear transform results
of the feature map which is named value [54] in Fig. 4.
2↑
where f up represents the upsampling operation. CBS means The third branch uses a 1 × 1 convolution to simplify the
3 × 3 convolution including batch normalization and SiLU. multiple of query and key. This convolution is named QK
Compared with BiFPN, FFM improves the fusion strategy in Fig. 4. Subsequently, the first and third branches are
of multiscale feature maps involving reweighting channels. matrix multiplied with the second branch, separately. The
The fusion strategy of BiFPN [29] is between feature maps, obtained two branches represent contextual information cross
which causes different channels have the same weight. In order channels and space, respectively. Finally, the output of SCAM
to strengthen the representation of small object from multiscale is obtained by using broadcast Hadamard product on these
features and fully utilize the features of different channels, two branches. The structure of SCAM is shown in Fig. 4.
the proposed CRC reweights the channels of feature map, In each layer, the pixelwise spatial context can be expressed as
as shown in the lower half of Fig. 3. follows:
We design three strategies for reweighting channels. The Ni
" j
#
first strategy uses channel attention mechanism similar to j j j
X exp(ωqk Pi ) j
Q i = Pi + ai · ωv Pi (11)
n=1 exp(ωqk Pi )
P Ni n
SENet [39] or ECANet [53] to reweight channels as j=1
formula (8). This strategy is feasible but increases the com- j
putational cost and parameter count significantly. The second exp avg(Pi ); max(Pi ) Pi
j
strategy first concatenates the feature maps and then multiplies ai = P Ni n · ωv (12)
n=1 exp avg(Pi ); max(Pi ) Pi
the normalized trainable weights with the same number of
j j
parameters as the total number of channels, as shown in where Pi and Q i represent the input and output of the
formula (9). The third strategy further considers the semantic jth pixel in the i-level feature map, respectively. Ni denotes
gap between different feature maps, which first reweights the the total number of pixels. ωqk and ωv are the linear transform
channels within each feature map and then reweights different matrices for projecting the feature maps, which simplify by
feature maps, as shown in formula (10) 1 × 1 convolution. avg(·) and max(·) perform GAP and
GMP, respectively. GAP and GMP can guide feature map to
Output = Attention(X ) · X (8)
select channels with significant information, which enables
X ωj
Output = · xj (9) SCAM to learn the context information about channel
ε + m ωm
P
j dimensions.
XX ωi ωj
Output = · · xj (10)
ε + k ωk ε + m i ωm i
P P
i j E. Lite-FFCA-YOLO (L-FFCA-YOLO)
where Attention(·) represents the channel attention mecha- A qualified lightweight model needs to strike a balance
nism, such as SENet or ECANet. ωi represents the trainable among parameter count, speed, and accuracy. FasterNet has
weight in the ith feature map. ω j represents the trainable found that the main reason for low FLOPs of DWConv
weight in the jth channel. mi is the number of channels in the is its frequent memory redundancy access, which actually
ith feature map. m represents the total number of channels leads to the decrease in speed. To alleviate this phenomenon,
after concatenation. ε is set to 0.0001 to avoid numerical FasterNet uses PConv, which considers the redundancy
instability. According to the results of ablation experiments in in feature maps [51], and applies standard convolution
Authorized licensed use limited to: Imperial College London. Downloaded on February 18,2025 at 08:47:36 UTC from IEEE Xplore. Restrictions apply.
5611215 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 62, 2024
TABLE I
PARAMETER C OUNTS OF FFCA-YOLO AND L-FFCA-YOLO IN BACKBONE
Authorized licensed use limited to: Imperial College London. Downloaded on February 18,2025 at 08:47:36 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: FFCA-YOLO FOR SMALL OBJECT DETECTION 5611215
Fig. 6. Ground truth annotation in UNICORN2008 and USOD. The red bounding boxes are the original annotated instances in UNICORN2008, while
bounding boxes with green corner points are the manual annotation supplementation for USOD. (a) Original annotation, (b) manual annotation, (c) original
annotation, (d) manual annotation.
Authorized licensed use limited to: Imperial College London. Downloaded on February 18,2025 at 08:47:36 UTC from IEEE Xplore. Restrictions apply.
5611215 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 62, 2024
TABLE II
C OMPARISON E XPERIMENTS FOR FFCA-YOLO IN VEDAI
Authorized licensed use limited to: Imperial College London. Downloaded on February 18,2025 at 08:47:36 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: FFCA-YOLO FOR SMALL OBJECT DETECTION 5611215
Fig. 9. The detection results of FFCA-YOLO in USOD, VEDAI, and AI-TOD for typical scenarios, such as ports, highways, and buildings. (a) Results in
USOD dataset. (b) Results in VEDAI dataset. (c) Results in AI-TOD dataset.
TABLE III
C OMPARISON E XPERIMENTS FOR FFCA-YOLO IN AI-TOD
TABLE IV
C OMPARISON E XPERIMENTS FOR FFCA-YOLO IN USOD
hyperparameters, FFCA-YOLO has smaller parameter count 30% compared with FFCA-YOLO (from 7.12 to 5.04M), but
and higher performance compared with the benchmark meth- showing no significant decline in accuracy metrics. Fig. 10
ods. L-FFCA-YOLO reduces the parameter count by about shows the detection results of YOLOv5m, TPH-YOLO, and
Authorized licensed use limited to: Imperial College London. Downloaded on February 18,2025 at 08:47:36 UTC from IEEE Xplore. Restrictions apply.
5611215 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 62, 2024
Fig. 10. Detection results of YOLOv5m, TPH-YOLO, and FFCA-YOLO for low illumination and shadow occlusion scenes. The red bounding boxes represent
the detection box output by the model, while the yellow circles represent the missed detections.
TABLE V
A BLATION E XPERIMENTS FOR FEM, FFM, AND SCAM IN USOD
FFCA-YOLO in low illumination and shadow occlusion contextual features, the network has shown good suppression
scenes. In low illumination scene, the grayscale values of effects on complex backgrounds.
objects and the background are close to each other caus- 2) FFM: Table V shows that adding FFM can improve
ing YOLOv5m and TPH-YOLO to have missed detections. all evaluation metrics, especially in terms of recall (from
In occlusion scene, one small object is located in the shade of 0.826 to 0.837). In addition, we research on the effects of
a tree causing YOLOv5m to have missed detection. different neck structures and different fusion strategies of
multiscale feature map mentioned in Section III-C, as shown
D. Ablation Experimental Result in Table VI. CRC_1, CRC_2, and CRC_3 represent dif-
To analyze the importance of each component in FFCA- ferent channel reweighting strategies in formulas (8)–(10),
YOLO, we progressively applied the FEM, FFM, and SCAM respectively. It can be seen that the performance of CRC_2
in the baseline to verify their effectiveness. The ablation and CRC_3 is significantly better in all aspects compared
experiment was conducted in USOD dataset. Table V shows with BiFPN, and the performance difference between CRC_2
the impact of adding or reducing each module on evaluation and CRC_3 is relatively small (mAP50:95 of CRC_2 is
√
metrics, where represents using the module and × repre- 0.003 higher than that of CRC_3). As a result, CRC_2 is
sents not using the module. selected as the channel reweighting strategy in FFM.
1) FEM: As shown in Table V, adding FEM can obviously 3) SCAM: Table V shows the performance improvement
improve all evaluation metrics, especially in terms of precision by adding SCAM. SCAM can improve all evaluation metrics.
(from 0.9 to 0.926) and mAPs (from 0.303 to 0.335). This Table VII shows the comparison among SCAM and some
confirms that FEM makes it easier for the model to distinguish typical baseline methods. SCAM achieves better performance
small objects from backgrounds. To further validate this con- in all evaluation metrics. Fig. 11 shows the impact of SCAM
clusion, we visualize the feature maps before and after FEM on feature maps. Compared with the feature maps outputted by
in Fig. 11. The brighter color represents that the model pays FEM, the same level feature maps of SCAM further enhance
more attention to that area. Due to FEM enriching the local the feature representation of small objects and suppress the
Authorized licensed use limited to: Imperial College London. Downloaded on February 18,2025 at 08:47:36 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: FFCA-YOLO FOR SMALL OBJECT DETECTION 5611215
Fig. 11. Influence of FEM and SCAM on feature extraction. The brighter color represents that the model pays more attention to that area.
TABLE VI
C OMPARISON E XPERIMENTS FOR FFM IN USOD
backgrounds. Through the above analysis of ablation experi- FFM, and SCAM all steadily improve the performance of
ments, it can be concluded that the proposed modules FEM, FFCA-YOLO without any conflicts.
Authorized licensed use limited to: Imperial College London. Downloaded on February 18,2025 at 08:47:36 UTC from IEEE Xplore. Restrictions apply.
5611215 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 62, 2024
TABLE VII
C OMPARISON E XPERIMENTS FOR SCAM IN USOD
TABLE VIII
ROBUSTNESS E XPERIMENTS FOR FFCA-YOLO AND YOLOV 5 M IN USOD
Authorized licensed use limited to: Imperial College London. Downloaded on February 18,2025 at 08:47:36 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: FFCA-YOLO FOR SMALL OBJECT DETECTION 5611215
Authorized licensed use limited to: Imperial College London. Downloaded on February 18,2025 at 08:47:36 UTC from IEEE Xplore. Restrictions apply.
5611215 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 62, 2024
[9] Q. Ran, Q. Wang, B. Zhao, Y. Wu, S. Pu, and Z. Li, “Lightweight [34] R. Liu et al., “RAANet: A residual ASPP with attention framework
oriented object detection using multiscale context and enhanced channel for semantic segmentation of high-resolution remote sensing images,”
attention in remote sensing images,” IEEE J. Sel. Topics Appl. Earth Remote Sens., vol. 14, no. 13, p. 3109, Jun. 2022.
Observ. Remote Sens., vol. 14, pp. 5786–5795, 2021. [35] Y. Li, Z. Cheng, C. Wang, J. Zhao, and L. Huang, “RCCT-ASPPNet:
[10] B. Zhang et al., “Progress and challenges in intelligent remote sensing Dual-encoder remote image segmentation based on transformer and
satellite systems,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., ASPP,” Remote Sens., vol. 15, no. 2, p. 379, Jan. 2023.
vol. 15, pp. 1814–1822, 2022. [36] W. Chen, S. Ouyang, W. Tong, X. Li, X. Zheng, and L. Wang,
[11] B. Vajsova, A. Walczynska, S. Bärisch, P. J. Åstrand, and S. Hain, “New “GCSANet: A global context spatial attention deep learning network
sensors benchmark report on WorldView-4,” Publications Office Eur. for remote sensing scene classification,” IEEE J. Sel. Topics Appl. Earth
Union, Luxembourg, U.K., Tech. Rep. EUR 28761 EN, 2017. Observ. Remote Sens., vol. 15, pp. 1150–1162, 2022.
[12] R. Trautner and R. Vitulli, “Ongoing developments of future payload [37] Y. Zhou et al., “BOMSC-Net: Boundary optimization and multi-scale
data processing platforms at ESA,” in Proc. On-Board Payload Data context awareness based building extraction from high-resolution remote
Compress. Workshop (OBPDC), 2010. sensing imagery,” IEEE Trans. Geosci. Remote Sens., vol. 60, pp. 1–17,
[13] X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural 2022, Art. no. 5618617, doi: 10.1109/TGRS.2022.3152575.
networks,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., [38] Y. Liu, H. Li, C. Hu, S. Luo, Y. Luo, and C. Wen Chen, “Learning
Jun. 2018, pp. 7794–7803. to aggregate multi-scale context for instance segmentation in remote
[14] Y. Cao, J. Xu, S. Lin, F. Wei, and H. Hu, “GCNet: Non-local networks sensing images,” 2021, arXiv:2111.11057.
meet squeeze-excitation networks and beyond,” in Proc. IEEE/CVF Int. [39] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in
Conf. Comput. Vis. Workshop (ICCVW), Oct. 2019, pp. 1971–1980. Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018,
[15] Y. Cao, J. Xu, S. Lin, F. Wei, and H. Hu, “Global context networks,” pp. 7132–7141.
IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 6, pp. 6881–6895, [40] S. Woo, J. Park, J. Y. Lee, and I. S. Kweon, “CBAM: Convo-
Jun. 2023. lutional block attention module,” in Proc. Eur. Conf. Comput. Vis.
[16] S. Q. Ren et al., “Faster R-CNN: Towards real-time object detection (ECCV)., 2018, pp. 3–19.
with region proposal networks,” IEEE Trans. Pattern Anal. Mach. Intell., [41] Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan, and C. Zhang, “Learning
vol. 39, no. 6, pp. 1137–1149, 2017, doi: 10.1109/TPAMI.2016.2577031. efficient convolutional networks through network slimming,” in Proc.
[17] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask R-CNN,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Oct. 2017, pp. 2736–2744.
IEEE Int. Conf. Comput. Vis. (ICCV), Oct. 2017, pp. 2961–2969. [42] Y. He, X. Zhang, and J. Sun, “Channel pruning for accelerating very
[18] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look deep neural networks,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV),
once: Unified, real-time object detection,” in Proc. IEEE Conf. Comput. Oct. 2017, pp. 1389–1397.
Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 779–788.
[43] S. Guo, Y. Wang, Q. Li, and J. Yan, “DMCP: Differentiable Markov
[19] J. Redmon and A. Farhadi, “YOLOv3: An incremental improvement,” channel pruning for neural networks,” in Proc. IEEE/CVF Conf. Comput.
2018, arXiv:1804.02767. Vis. Pattern Recognit. (CVPR), Jun. 2020, pp. 1536–1544.
[20] C.-Y. Wang, A. Bochkovskiy, and H.-Y.-M. Liao, “YOLOv7: Trainable
[44] J. Chang, Y. Lu, P. Xue, Y. Xu, and Z. Wei, “Automatic channel pruning
bag-of-freebies sets new state-of-the-art for real-time object detectors,”
via clustering and swarm intelligence optimization for CNN,” Appl.
in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR),
Intell., vol. 52, pp. 17751–17771, Apr. 2022.
Jun. 2023, pp. 7464–7475.
[45] A. G. Howard et al., “MobileNets: Efficient convolutional neural net-
[21] W. Liu et al., “SSD: Single shot MultiBox detector,” in Proc. Eur. Conf.
works for mobile vision applications,” 2017, arXiv:1704.04861.
Comput. Vis. Cham, Switzerland: Springer, 2016, pp. 21–37.
[46] X. Zhang, X. Zhou, M. Lin, and J. Sun, “ShuffleNet: An extremely
[22] X. Zhu, S. Lyu, X. Wang, and Q. Zhao, “TPH-YOLOv5: Improved
efficient convolutional neural network for mobile devices,” in
YOLOv5 based on transformer prediction head for object detection on
Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018,
drone-captured scenarios,” in Proc. IEEE/CVF Int. Conf. Comput. Vis.
pp. 6848–6856.
Workshops (ICCVW), Oct. 2021, pp. 2778–2788.
[23] M. Wang et al., “FE-YOLOv5: Feature enhancement network based on [47] K. Han, Y. Wang, Q. Tian, J. Guo, C. Xu, and C. Xu, “GhostNet: More
YOLOv5 for small object detection,” J. Vis. Commun. Image Represent., features from cheap operations,” in Proc. IEEE/CVF Conf. Comput. Vis.
vol. 90, Feb. 2023, Art. no. 103752. Pattern Recognit. (CVPR), Jun. 2020, pp. 1580–1589.
[24] L. Shen, B. Lang, and Z. Song, “CA-YOLO: Model optimization [48] L. Huyan et al., “A lightweight object detection framework for remote
for remote sensing image object detection,” IEEE Access, vol. 11, sensing images,” Remote Sens., vol. 13, no. 4, p. 683, Feb. 2021.
pp. 64769–64781, 2023. [49] J. Yi, Z. Shen, F. Chen, Y. Zhao, S. Xiao, and W. Zhou, “A lightweight
[25] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, multiscale feature fusion network for remote sensing object count-
“Feature pyramid networks for object detection,” in Proc. IEEE Conf. ing,” IEEE Trans. Geosci. Remote Sens., vol. 61, pp. 1–13, 2023,
Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 2117–2125. Art. no. 5902113, doi: 10.1109/TGRS.2023.3238185.
[26] S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia, “Path aggregation network for [50] J. Liu, R. Liu, K. Ren, X. Li, J. Xiang, and S. Qiu, “High-performance
instance segmentation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern object detection for optical remote sensing images with lightweight
Recognit., Jun. 2018, pp. 8759–8768. convolutional neural networks,” in Proc. IEEE 22nd Int. Conf. High
[27] G. Ghiasi, T.-Y. Lin, and Q. V. Le, “NAS-FPN: Learning scalable feature Perform. Comput. Commun., IEEE 18th Int. Conf. Smart City, IEEE
pyramid architecture for object detection,” in Proc. IEEE/CVF Conf. 6th Int. Conf. Data Sci. Syst. (HPCC/SmartCity/DSS), Dec. 2020,
Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 7036–7045. pp. 585–592.
[28] S. Liu, D. Huang, and Y. Wang, “Learning spatial fusion for single-shot [51] J. Chen et al., “Run, don’t walk: Chasing higher FLOPS for faster neural
object detection,” 2019, arXiv:1911.09516. networks,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.
[29] M. Tan, R. Pang, and Q. V. Le, “EfficientDet: Scalable and efficient (CVPR), Jun. 2023, pp. 12021–12031.
object detection,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recog- [52] S. Liu and D. Huang, “Receptive field block net for accurate and
nit. (CVPR), Jun. 2020, pp. 10781–10790. fast object detection,” in Proc. Eur. Conf. Comput. Vis. (ECCV), 2018,
[30] C. Guo, B. Fan, Q. Zhang, S. Xiang, and C. Pan, “AugFPN: pp. 385–400.
Improving multi-scale feature learning for object detection,” in Proc. [53] Q. Wang, B. Wu, P. Zhu, P. Li, W. Zuo, and Q. Hu, “ECA-Net: Efficient
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020, channel attention for deep convolutional neural networks,” in Proc.
pp. 12595–12604. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020,
[31] Z. Liu, G. Gao, L. Sun, and Z. Fang, “HRDNet: High-resolution detec- pp. 11534–11542.
tion network for small objects,” in Proc. IEEE Int. Conf. Multimedia [54] A. Vaswani et al., “Attention is all you need,” in Proc. Adv. Neural Inf.
Expo (ICME), Jul. 2021, pp. 1–6. Process. Syst., vol. 30, 2017, pp. 5998–6008.
[32] G. Cheng et al., “Feature enhancement network for object detection in [55] S. Razakarivony and F. Jurie, “Vehicle detection in aerial imagery:
optical remote sensing images,” J. Remote Sens., vol. 1, p. 14, 2021, A small target detection benchmark,” J. Vis. Commun. Image Represent.,
doi: 10.34133/2021/9805389. vol. 34, pp. 187–203, Jan. 2016.
[33] K. Zhang and H. Shen, “Multi-stage feature enhancement pyramid [56] J. Wang, W. Yang, H. Guo, R. Zhang, and G.-S. Xia, “Tiny object
network for detecting objects in optical remote sensing images,” Remote detection in aerial images,” in Proc. 25th Int. Conf. Pattern Recognit.
Sens., vol. 14, no. 3, p. 579, Jan. 2022. (ICPR), Jan. 2021, pp. 3791–3798.
Authorized licensed use limited to: Imperial College London. Downloaded on February 18,2025 at 08:47:36 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: FFCA-YOLO FOR SMALL OBJECT DETECTION 5611215
[57] J. Wang, C. Xu, W. Yang, and L. Yu, “A normalized Gaussian Wasser- Mu Ye received the B.Sc. and M.Sc. degrees from
stein distance for tiny object detection,” 2021, arXiv:2110.13389. the Shanghai University of Engineering Science, in
[58] T.-Y. Lin et al., “Microsoft COCO: Common objects in context,” in 2019 and 2022, respectively. He is currently pursu-
Proc. Eur. Conf. Comput. Vis. Cham, Switzerland: Springer, 2014, ing the D.Eng. degree from the Nanjing University
pp. 740–755. of Aeronautics and Astronautics, Nanjing, China.
[59] G.-S. Xia et al., “DOTA: A large-scale dataset for object detection in His main research interests include space-based
aerial images,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., object detection and signal processing.
Jun. 2018, pp. 3974–3983.
[60] L. Colin et al. (2019). Unified Coincident Optical and Radar
for Recognition (UNICORN) 2008 Dataset. [Online]. Available:
https://github.com/AFRL-RY/data-unicorn-2008
[61] M. A. Momin, M. H. Junos, A. S. M. Khairuddin, and M. S. A. Talip,
“Lightweight CNN model: Automated vehicle detection in aerial
images,” Signal, Image Video Process., vol. 17, no. 4, pp. 1209–1217,
Jun. 2023. Guiyi Zhu received the B.Sc. degree from Xidian
[62] M.-T. Pham, L. Courtrai, C. Friguet, S. Lefèvre, and A. Baussard, University, Xi’an, China, in 2015. She is currently
“YOLO-Fine: One-stage detector of small objects under various back- pursuing the M.S. degree with the Nanjing Uni-
grounds in remote sensing images,” Remote Sens., vol. 12, no. 15, versity of Aeronautics and Astronautics, Nanjing,
p. 2501, Aug. 2020. China.
[63] J. Zhang, J. Lei, W. Xie, Z. Fang, Y. Li, and Q. Du, “SuperYOLO: Her main research interests include object detec-
Super resolution assisted object detection in multimodal remote sensing tion and classification.
imagery,” IEEE Trans. Geosci. Remote Sens., vol. 61, pp. 1–15, 2023,
Art. no. 5605415, doi: 10.1109/TGRS.2023.3258666.
[64] F. Qingyun and W. Zhaokui, “Cross-modality attentive feature fusion
for object detection in multispectral remote sensing imagery,” Pattern
Recognit., vol. 130, Oct. 2022, Art. no. 108786.
[65] Z. Cai and N. Vasconcelos, “Cascade R-CNN: Delving into high
quality object detection,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
Recognit., Jun. 2018, pp. 6154–6162. Yong Liu received the B.Sc. and M.Sc. degrees from
[66] S. Qiao, L.-C. Chen, and A. Yuille, “DetectoRS: Detecting objects with Air Force Aviation University, Changchun, China, in
recursive feature pyramid and switchable atrous convolution,” in Proc. 2012 and 2014, respectively, and the Ph.D. degree
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2021, from the National University of Defense Technology,
pp. 10213–10224. Changsha, China, in 2018.
[67] M. Ma and H. Pang, “SP-YOLOv8s: An improved YOLOv8s model for He is currently a Research Assistant with the
remote sensing image tiny object detection,” Appl. Sci., vol. 13, no. 14, National Innovation Institute of Defense Technology,
p. 8161, Jul. 2023. Academy of Military Sciences, Beijing, China. His
[68] G. Guo, P. Chen, X. Yu, Z. Han, Q. Ye, and S. Gao, “Save the tiny, main research interests include remote sensing data
save the all: Hierarchical activation network for tiny object detection,” processing and information fusion.
IEEE Trans. Circuits Syst. Video Technol., vol. 34, no. 1, pp. 221–234,
Jan. 2024, doi: 10.1109/TCSVT.2023.3284161.
[69] C.-Y. Fu, W. Liu, A. Ranga, A. Tyagi, and A. C. Berg, “DSSD :
Deconvolutional single shot detector,” 2017, arXiv:1701.06659.
[70] S. Zhang, L. Wen, X. Bian, Z. Lei, and S. Z. Li, “Single-shot refinement
neural network for object detection,” in Proc. IEEE/CVF Conf. Comput. Pengyu Guo received the master’s degree in com-
Vis. Pattern Recognit., Jun. 2018, pp. 4203–4212. puter science and technology from the National
[71] A. Bochkovskiy, C.-Y. Wang, and H.-Y. Mark Liao, “YOLOv4: Optimal University of Defense Technology, Changsha, China,
speed and accuracy of object detection,” 2020, arXiv:2004.10934. in 2008, and the Ph.D. degree in aerospace science
[72] G. Yang, J. Lei, Z. Zhu, S. Cheng, Z. Feng, and R. Liang, “AFPN: and technology from the National University of
Asymptotic feature pyramid network for object detection,” 2023, Defense Technology, in 2015.
arXiv:2306.15988. He has experience in working for the China
[73] C. Li, Z. Li, X. Liu, and S. Li, “The influence of image degradation Xi’an Satellite Control Center, Xi’an, China. He is
on hyperspectral image classification,” Remote Sens., vol. 14, no. 20, currently an Associate Research Fellow with the
p. 5199, Oct. 2022. National Innovation Institute of Defense Tech-
[74] K. He, J. Sun, and X. Tang, “Single image haze removal using dark nology, Academy of Military Sciences, Beijing,
channel prior,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 12, China. His current research focus is to devise algorithms based on
pp. 2341–2353, Dec. 2010. computer vision and machine learning to enable unmanned platform’s
imaging systems for detection, tracking, recognition, and relative pose
estimation.
Yin Zhang received the B.Sc. degree from Jilin Junhua Yan received the B.Sc., M.Sc., and Ph.D.
University, Changchun, China, in 2009, and the degrees from the Nanjing University of Aeronautics
M.Sc. and Ph.D. degrees from the Harbin Institute and Astronautics, Nanjing, China, in 1993, 2001, and
of Technology, Harbin, China, in 2011 and 2016, 2004, respectively.
respectively. She is currently a Professor with the Nanjing Uni-
He is currently an Associate Professor with the versity of Aeronautics and Astronautics. Her main
Nanjing University of Aeronautics and Astronautics, research interests include image quality assessment,
Nanjing, China. His main research interests include multisource information fusion, object detection,
simulating and processing photoelectric detection tracking, and recognition.
information.
Authorized licensed use limited to: Imperial College London. Downloaded on February 18,2025 at 08:47:36 UTC from IEEE Xplore. Restrictions apply.