Remotesensing 15 03265
Remotesensing 15 03265
Article
Small Object Detection Based on Deep Learning for Remote
Sensing: A Comprehensive Review
Xuan Wang 1 , Aoran Wang 1 , Jinglei Yi 1 , Yongchao Song 1 and Abdellah Chehri 2, *
1 School of Computer and Control Engineering, Yantai University, Yantai 264005, China;
[email protected] (X.W.); [email protected] (A.W.); [email protected] (J.Y.);
[email protected] (Y.S.)
2 Department of Mathematics and Computer Science, Royal Military College of Canada,
Kingston, ON K7K 7B4, Canada
* Correspondence: [email protected]
Abstract: With the accelerated development of artificial intelligence, remote-sensing image technolo-
gies have gained widespread attention in smart cities. In recent years, remote sensing object detection
research has focused on detecting and counting small dense objects in large remote sensing scenes.
Small object detection, as a branch of object detection, remains a significant challenge in research due
to the image resolution, size, number, and orientation of objects, among other factors. This paper
examines object detection based on deep learning and its applications for small object detection in
remote sensing. This paper aims to provide readers with a thorough comprehension of the research
objectives. Specifically, we aggregate the principal datasets and evaluation methods extensively em-
ployed in recent remote sensing object detection techniques. We also discuss the irregularity problem
of remote sensing image object detection and overview the small object detection methods in remote
sensing images. In addition, we select small target detection methods with excellent performance in
recent years for experiments and analysis. Finally, the challenges and future work related to small
object detection in remote sensing are highlighted.
The DPM algorithm is an upgrade and extension of the HOG algorithm. It has a more
effective technique for finding a solution to the problem of the object’s multiple perspectives.
On the other hand, because these algorithms are focused primarily on detecting pedestrians,
the detection effect on remote sensing images is not very good.
In recent years, Convolutional Neural Networks (CNNs), feed-forward neural net-
works with a convolutional structure, have been widely used. The aforementioned ar-
chitecture is proficient in diminishing the amount of memory of deep neural networks.
The reduction of parameters in a network and the alleviation of model overfitting can be
effectively achieved through the implementation of three fundamental operations: local
perceptual fields, weight sharing, and pooling layers.
In general, CNNs have several convolutional and pooling layers. They use alternating
convolutional and pooling layers, i.e., one convolutional layer is connected to one pooling
layer, and so on. Each neuron of the output feature map in the convolutional layer is locally
connected to its input local. The corresponding connection weights are weighted and
summed with the local inputs, and bias values are added to obtain the neuron input values.
The CNN is named because the process is equivalent to the convolution process,
and the schematic diagram of CNN target detection is shown in Figure. With the develop-
ment of deep learning, a large number of deep learning-based target detection algorithms
have been proposed and have achieved remarkable results on remote sensing image
datasets. The principle diagram of CNN object detection is shown in Figure 1. The algo-
rithm for detecting objects in remote sensing images has emerged as a significant area of
research, with a multitude of experiments and studies conducted on the subject.
FC
CONV RELU CONV RELU POOL CONV RELU CONV RELU POOL
Input Output
As the pioneering work of object detection algorithms based on deep learning, Region-
based Convolutional Neural Networks (RCNNs) [14] successfully link convolutional neural
networks with object detection. However, because RCNNs consist of four parts—generating
candidate windows, feature extraction, SVM classification, and window regression—the de-
tection efficiency of the algorithm is relatively low. Based on this problem, subsequent
SPPNet [15], Fast RCNN [16], Faster RCNN [17], FPN [18], Mask RCNN [19], etc., im-
proved the shortcomings of the previous algorithm to enhance the detector performance.
With the introduction of detectors such as the YOLO series [20–23] and SSD [24], the per-
formance of object detection algorithms has been improved, and the technology has been
continuously developed.
Several researchers have achieved some results in summarizing the overview of object
detection algorithms [25–29]. They mainly review the problems faced by high-resolution
object detection and the proposed methodological approaches, remote sensing image
datasets, and the performance of the leading detection methods at this time.
This paper provides an in-depth analysis of the remote sensing images and evaluation
metrics that are commonly used for object detection, which differs from the existing
literature. The article focuses on various categories of object detection techniques, the
constraints associated with remote sensing images, and the challenges caused by object
Remote Sens. 2023, 15, 3265 3 of 29
irregularities, along with the strategies for addressing them. Additionally, it explores
methods for detecting small objects in remote sensing imagery.
The applications of small object detection on remote sensing images, especially rotating
small objects, are summarized. We classify the existing processes into six categories based
on different technical bases, including more recent techniques within the last two years.
In addition, we re-measure the mean Average Precision (mAP), Floating Point Operations
(FLOPs), number of parameters (Params), and Frames Per Second (FPS) transmitted for six
of the best methods. These algorithms are evaluated on this basis.
In this article, we present a comprehensive review of object identification methods
and how those methods have been applied to remote sensing in recent years. In addition,
we give a great deal of focus to developing algorithms and applications for detecting small
objects. The overview of this paper is shown as Figure 2.
Overview of the Paper Remote Sensing Images Irregular Object Detection in Remote Sensing Images Non-axial Feature Learning DRN, SCRDet++...
Multi-scale Prediction
Contextual Information
Small Object Detection in Remote Sensing Images
Data Enhancement
Image Processing
Network Design
(8) UCAS-AOD dataset [41]: This dataset contains 2819 car images and 3210 aircraft images.
(9) RSC11 dataset [42]: This dataset contains 11 similar scene classes, so the classification
of scenes becomes difficult.
The Horizontal Bounding Box (HBB) is commonly used to represent objects oriented
horizontally in the labeling of datasets. Objects that do not rotate are typically depicted
using the Oriented Bounding Box (OBB) method.
The HBB (Hierarchical Bounding Box) requires the box to be oriented perpendic-
ular to the coordinate axis. This orientation restricts the box from fully encompassing
partially distorted large objects. The orientation and scale of the box are determined by
OBB, which considers the object’s shape. The box is not necessarily perpendicular to the
coordinate axis. The generated inclusive box is comparatively more compact than the
oriented bounding box.
Regarding the creation methods of OBB, the Principal Component Analysis (PCA)
method [43] is the dominant method. OBB first uses PCA to obtain the three principal
directions of the point cloud to obtain the center of mass and calculate the covariance. The
covariance matrix is then obtained, and the eigenvalues and eigenvectors of the covariance
matrix are found. Among them, the eigenvectors are the principal directions. In the second
step, OBB converts the input point cloud to the origin using the principal directions and
the center of mass.
The principal direction coincides with the coordinate system direction to build the
enclosing box of the point cloud transformed into the origin. Finally, OBB sets the principal
direction and enclosing box to the input point cloud and achieves the final effect by the
inverse transformation of the input point cloud to the origin point cloud transformation.
As shown in Table 1, the composition of a benchmark dataset, including the number
of objects, classes, instances, and annotation style, significantly affects the training and
testing of a model. The effective training of the model can be facilitated by using rich
instances, diverse classes, and a suitable annotation style. In Table 1, the classes of DIOR
and DOTA are 20 and 15, respectively, and their number of instances is much higher than
other datasets. In addition, the annotation style of OBB helps to improve the detection of
rotating objects. DIOR uses a combination of both HBB and OBB annotation styles, whereas
RSOD and NWPU VHR-10 solely employ the HBB annotation style. Furthermore, DOTA
incorporates all OBB. The distinctive characteristics of DIOR and DOTA differentiate them
from other remote sensing datasets.
DIOR [26] 23,463 20 192,472 HBB + OBB Aircraft, stadiums, bridges, dams, ports, etc.
RSOD [30] 976 4 6950 HBB Aircraft, oil drums, overpasses, sports fields
NWPU VHR-10 [32] 800 10 3775 HBB Aircraft, ships, stadiums, ports, bridges, etc.
DOTA [37] 2806 15 188,282 OBB Aircraft, vehicles, stadiums, etc.
VEDA [38] 1210 9 3640 OBB Vehicles
ITCVD [39] 173 1 29,088 OBB Vehicles
UCAS-AOD [44] 910 2 6029 HBB + OBB Airplane, car
RSC11 [42] 1213 11 - Scene Class Dense forests, grasslands, buildings, ports, etc.
significant than 0.5, the detected object is considered to be detected. The crossover
ratio is defined as follows:
A∩B
IoU = . (1)
A∪B
(2) Precision: Precision represents the ratio of the model finding the correct sample to
the total sample in the prediction result. When the intersection-union ratio is greater
than the threshold, the result is classified as True Positive (TP), and vice versa as False
Positive (FP). If the detector does not detect an object in the detection frame labeled
with the sample, the object is classified as False Negative (FN). Accuracy is defined
as follows:
True Positive True Positive
Precision = = . (2)
True Positive + False Positive All Observations
(3) Recall: Recall rate indicates the number of positive samples recovered by the model in
the total positive samples, which is an important indicator to measure whether the
model is “found all”. Recall is defined as:
True Positive True Positive
Recall = = . (3)
True Positive + False Negative All Ground Truth
(4) AP [45]: Average Precision is the precision averaging on a [0, 1] recall. The higher the
AP value, the better the detector’s detection performance for a certain type of object
in the dataset. Average Precision is defined as follows:
where Ωu denotes the Ground Truth result, puj denotes the location of object j,
and puj < pui denotes that object j is ranked before item i in the recommendation list.
(5) mAP [45]: mAP averages the average accuracy of each class of objects detected by
the detector. Higher mAP values indicate better detector performance for the entire
dataset. The mean average accuracy is defined as:
∑u∈U APu
mAP = . (5)
|U |
(6) FPS: FPS is used to evaluate the target detection speed, i.e., the number of images that
can be processed in each second. The higher the FPS, the faster the detection speed of
the model.
(7) FLOPs: FLOPs refers to the number of floating point operations, which can also
be interpreted as computations coming. The smaller the FLOPs, the smaller the
complexity of the model.
(8) Params: Params represents the number of parameters required by the model. The smaller
the Params, the less parameters the model needs and the lighter it is.
3. Object Detection
Object detection has been researched and refined for over two decades. Object identifi-
cation, considered one of the key directions and fundamental challenges of computer vision,
is currently developing along with the requirements of many applications. Throughout its
development history, object detection can be divided into two main periods, i.e., traditional
object detection algorithms and deep learning-based object detection algorithms.
Deep learning-based object detection algorithms are further divided into several
technical branches. A diagram of the method development history is shown in Figure 3.
This section mainly discusses the different branches of traditional and deep learning-based
object detection algorithms.
Remote Sens. 2023, 15, 3265 7 of 29
Anchor Free
VJ Det HOG Det DPM 2018 2019 2020 2021 2022 YOLOv5 YOLOv7
+AlexNet YOLOv1 YOLOv2 YOLOv3 YOLOv4
YOLOv6
... SSD RetinaNet
2001 2004 2006 2008 2012 One-stage
2016 2017 2018 2019 2020 2021 2022
Libra RCNN CenterMap DODet
Anchor Based SSPNet Faster RCNN FPN
RCNN Fast RCNN Mask RCNN Grid RCNN ReDet
Traditional Detection Cascade RCNN
Methods Two-stage
2014 2015 2016 2017 2018 2019 2020 2021 2022
Deep Learning based
Detection Methods
Net has the same limitations as RCNN. Its network cannot achieve end-to-end detection.
The schematic diagram of SPPNet is shown in Figure 5.
Proposals
16*256-d
4*256-d
The Fast RCNN approach employs a softmax classifier as a means to address the issue
of classification synchronization. Additionally, RoI layers are used to facilitate the mapping
of multi-scale features, thereby addressing the challenge of scale variation. The multitask loss
function of Fast RCNN enables end-to-end training for multitask purposes. Detection is slow
due to the intricate algorithm employed by Fast RCNN for selecting candidate regions [16].
Faster RCNN inherits the advantages of Fast RCNN. It innovatively proposes using a
region selection network to extract candidate frames, which improves the computational
speed. However, it has inaccurate localization frames and cannot effectively identify
small objects [17].
FPN extracts multi-scale features of images by constructing feature pyramids at differ-
ent scales, which significantly improves the network accuracy. Because the network can
only be trained for a specific single resolution, it can be contradictory to the multi-scale in-
ference [18]. The Cascade RCNN approach employs a cascade detector to select thresholds
merit-based. The proposed solution effectively addresses the issue of overfitting that may
arise from implementing high thresholds. However, it should be noted that this approach
does not facilitate real-time detection [47].
R-FCN adds a position-sensitive score map to improve the sensitivity of the convo-
lutional network to object position. It solves the problem of object location insensitivity,
but there is no improvement in computational speed [48]. Mask RCNN solves the prob-
lem of simultaneously localizing, classifying, and segmenting objects. It introduces an
instance segmentation branch in order to achieve pixel-level object detection. However,
Remote Sens. 2023, 15, 3265 9 of 29
the performance is also lower than the real-time performance due to the high cost of
instance segmentation [19].
In extracting multi-scale objects, TridentNet differs from the multi-scale feature pyra-
mid of FPN. It uses a multi-branch structure with different perceptual fields and shares
multi-branch structure weights, improving detection accuracy. However, it cannot be
monitored in real-time due to its slow detection speed [49].
The one-stage detector can obtain the final detection result directly after only one stage,
which is faster than the two-stage detector. Its flow chart is shown in Figure 6. The YOLO
series, which is a one-stage detector, has been evolving.
YOLOv1 is the first to turn the object detection problem into a regression problem.
It has a more straightforward network structure and fast detection speed. However, its
accuracy of object localization could be significantly higher. When the object is small,
or there are multiple objects, the detection effect of YOLOv1 is not good [20].
YOLOv2 further improves detection accuracy and detection speed. However, it does
not improve the limitations of YOLOv1 [21]. To address the issue of insufficient detection
of small objects, YOLOv3 employs a multi-scale feature map extraction method and an
improved classification network. It is ineffective, however, at detecting medium and
large objects. [22].
YOLOv4 uses Mosaic and self-adversarial training strategies for data enhancement. It
integrates FPN, PAN, and so on, to improve the model performance further [23]. Yolov5 has
a slightly worse performance compared to Yolov4. However, it is flexible, fast, and better
at rapid model deployment. Yolov6 further improves accuracy and speed. It achieves the
highest accuracy so far in real-time detection [50].
SSD [24] uses multi-scale feature map extraction and convolutional feature detection.
It is faster and has higher accuracy. However, SSD relies more on manual experience and
requires the manual setting of parameters for pre-selected boxes. Therefore, SSD has poor
detection accuracy for small objects and multiple objects.
RetinaNet [51] uses the Focal loss function, which solves the problem of category
imbalance. However, it cannot perform real-time detection and has poor detection results
for small and multiple objects.
EfficientDet [52] proposes a weighted bidirectional feature pyramid network. It is
simpler and faster in multi-scale feature fusion. EfficientDet proposes a composite scaling
method that simultaneously scales the backbone network’s resolution, depth, and width.
However, it has a slower detection speed.
detects the central area and boundary of the object. Representative algorithms include
Centernet [53], FCOS [54], and TTFNet [55], to name a few.
For example, CenterNet [53] uses a keypoint detection algorithm. It treats detection
objects as points and uses center pooling and cascaded corner point pooling. CenterNet is
unsuitable for small object and multi-object detection due to the computationally intensive
nature of the model.
FCOS uses a fully convolutional network to perform regression operations on the
distance from each location of the feature map to the border. Similar to the principle of FCN,
it treats each position of each point as a training sample. Compared with the Anchor-based
algorithm, FCOS [54] saves a significant amount of memory space during training, which
is suitable for instance segmentation.
TTFNetk can be seen as an improved version of Centernet. It uses an elliptical Gaussian
kernel to generate negative sample supervised signals and sampling regions around the
centroid. While maintaining the performance, TTFNetk [55] reduces the preprocessing
operations on the data, thus improving the learning efficiency and the quality of the
supervised signal.
Key point-based algorithms are also called corner point-based algorithms. At the
object’s top left and bottom right two-point positions, the detection frame is formed.
Representative algorithms are Cornernet and Extremenet algorithms, among others. They
are prone to FP due to the lack of information within the object [56].
Compared with Cornernet, Extremenet [57] uses the top, bottom, left, right, and center
five points of the object as key points. Extremenet extracts local information with less noise
and more robust features, enabling better detection performance.
Guo et al. [87] introduced a convex hull representation to train the perception of the
shape and distribution of irregular objects. Learnable feature adaptation is also used to
avoid feature confounding.
DRN [73] uses a feature selection module to aggregate non-axisymmetric object infor-
mation of different shapes, directions, and core sizes. It also uses a dynamic filter generator
to regress this information. The above methods are aimed at improving the detection
performance of non-axisymmetric features.
method of constructing multi-scale features. The technique aims to scale the images to
different resolutions and then extract the features separately. It uses a sliding window-based
approach to detect objects, detecting small objects at the bottom of the pyramid.
For example, MTCNN [93] uses this idea and better recognizes small objects. However,
its detection time is relatively long due to the need to extract features of multiple resolutions.
With the development of deep learning techniques, CNN multi-scale feature extraction
replaced image pyramids, and SSD [24] was proposed. However, it was found that SSD is
not effective for small objects during detection.
Aiming at the problem of a single small object feature layer in SSD, DSSD [94] uses
Resnet-101 as the backbone network for extracting features, which combines the semantic
information of the higher-level features with the bottom-level information. This results in
richer semantic features and better detection in the small object layer.
FPN is similar to the idea of DSSD. Its bottom-up and top-down branching fully
integrates the high-level and bottom features, making each layer feature rich in semantic
information, which is beneficial for small object detection. PANet [95] improves on the
FPN by using fewer convolutional layers to build the path enhancement module, which
can retain more information on the underlying layers. It adds an adaptive feature pooling
module to make the region of interest contain multiple layers of features, further improving
the performance of small object detection.
FPN introduces information from other layers, causing conflicts when detecting in a
single layer. To address this problem, ASFF proposes an adaptive spatial feature fusion
approach. It uses a learning weight approach to fuse the features of each layer for the final
detection, which further improves the small object detection performance.
Therefore, the authors of AugFPN [96] argue that FPN does not take into account
the semantic differences between features at different levels. This makes the top-down
feature fusion process lose features at higher levels, resulting in regions of interest in each
layer without feature information from other layers. To this end, the AugFPN proposer
reduces semantic differences by adding the same supervision information to each layer
before feature fusion.
A residual structure combines other layer features with the top-level features, which
enhances contextual information. In addition, by fusing the elements of the candidate
boxes pooled in different layers, it is ensured that the area of interest of each layer has the
feature information of the other layers. Its small object detection performance is further
improved. The current backbone networks used for feature extraction are trained on the
ImageNet dataset, while the COCO dataset is used for testing.
The authors of SNIP [97] concluded that the difference between the two datasets affects
the small object detection performance. During training, SNIP only calculates the gradients
of regions of interest close to the object scale in the ImageNet dataset. In this way, the scale
differences between different datasets are reduced.
For the problem of scale variation in object detection, the authors of TridentNet [49]
found that the perceptual field is positively correlated with the object scale. The larger
the perceptual field, the better the detection of large objects; the smaller the perceptual
field, the better the detection of small objects. The algorithm controls the perceptual
field by controlling the parameters of the null convolution. It generates three parallel
convolutional layers to detect objects at different scales and improves the small object
detection performance.
RTMDet [98] comprehensively improves the current single-stage object detector. It
uses CSPDarkNet as a baseline and performs multi-scale feature fusion using CSPPAFPN.
In terms of training strategy optimization, it uses a dynamic soft label assignment strategy
to make the matching results of classification cost more stable and accurate. In the data
enhancement stage, RTMDet introduces a caching mechanism, significantly improving
operation efficiency.
Hang et al. [99] improved small object detection performance by modifying the first-
level detector YOLOv5. They added new feature fusion layers and detector heads from
Remote Sens. 2023, 15, 3265 14 of 29
shallow layers to maximize the retention of feature information. In addition, they replaced
the original convolutional prediction heads with Swin transformer prediction heads SPHs
to reduce the computational complexity. Finally, the normalization-based attention module
HAM was integrated into YOLOv5 to improve attention performance in a normalized manner.
Guan et al. [100] proposed a deep neural network DNN based on high-quality object
locations. Small object detection performance is improved by computing multiple layered
segments with superpixels to derive gap-quality object locations and perform classification.
Fang et al. [101] proposed an improved method S2ANet-SR based on S2ANet. The
model sends both the original image and the restored image to the detection network and
then designs a super-resolution enhancement module for the restored image to enhance
the feature extraction of small objects and proposes perceptual loss function and matching
texture loss as supervision. The feature network design of part of the method is shown
in Figure 7.
Repeated Blocks Repeated Blocks
P7 P7 P7
P7
P7
P6 P6 P6
P6
P5 P5
P5 P5
P4
P4
P4 P4
P3
P3
P3 P3 P3
P7 Feature network design. (P3–P7) Multi-scale features from level 3 to level 7. The different
Figure 7.
P7
color dot represents different feature layers. (a) FPN [18] uses top-down paths to fuse multi-scale
P6
P6 features. (b) PANet [95] adds additional bottom-up paths to the FPN. (c) NAS-FPN [102] uses neural
P5
architecture
P5 search to obtain irregular feature network topologies, then applies the same blocks
repeatedly. (d) BiFPN [52] introduces a feature fusion mechanism with weights to extract features,
P4
P4
then uses the same blocks repeatedly.
PANet The method BiFPNcan promote detection accuracy by increasing the accuracy of high-level
feature maps or transforming the feature representation of the small object into a middle or
big object representation approximately. STDN [103] applies this idea using a scale transfer
module to increase resolution.
GAN-based PGAN [104] and SOD-MTGAN [105] inherit the generator and discrimina-
tor. Firstly, features containing enough small object information after the first convolution
layer are fed to the generator and are then enhanced by adding residual representation.
Secondly, the discriminative network has an adversarial branch and a perceptual branch.
The network is trained with instances of large objects first. The generator and discriminator
are trained in an iterative manner using a set of instances of both large and small objects,
to enhance the detection accuracy of small objects.
The GAN adversarial network framework diagram is shown in Figure 8. ViTAE-
B+RVSA_ORCN [106] uses the MAE [107] generative self-supervised pre-training method.
It extracts the image features of non-masked regions and predicts the image contents of
masked areas by an asymmetric network structure. The algorithm uses ViTAE as the
backbone network and replaces the MHSA module in Plain ViT with RVSA to adapt MAE
pre-training to remote sensing downstream tasks. Images generated by the Enhanced
Super Resolution GAN (ESRGAN) model, which is based on the Generative Adversarial
Network GAN, usually miss high-frequency edge information. This can seriously affect
the detection of small objects in remote-sensing images. Inspired by this, the new edge-
enhanced super-resolution adversarial network (EESRGAN) [108] uses different detector
networks in an end-to-end manner to propagate detector loss directions into EESRGAN as
a way to improve detection performance.
Remote Sens. 2023, 15, 3265 15 of 29
True
Samples
Discriminator True/Fake
Random Fake
Generator
Variable Samples
Compared with ResNet-50 [119], its small object detection performance is better.
Qiao et al. [120] proposed the DetectoRS algorithm, which feeds information from the FPN
layer to the backbone network. The recursive structural feature reuses the information
twice, greatly improving the small object detection performance.
DEA-net [121] proposed a dynamically improved anchor network to solve the issue of
small object labels being easily lost or mislabeled. In order to provide qualifying samples,
the network employs sample discriminators to carry out interactive sample screening
between anchored and unanchored units.
GGHL [122] is suitable for object detection in arbitrary directions. It uses an adaptive
label assignment strategy (OLA) for unanchored objects based on a two-dimensional
oriented Gaussian heat map to define positive candidate objects. This enables the adaptive
fitting of features of unused objects after feeding to the neural network CNN learning.
APE adaptive period embedding is a method for representing oriented objects in
remote sensing images. The process is based on the angular periodicity of the oriented
object. The angle is represented by two two-dimensional feature vectors with different
periods. The vectors are continuous during the change of shape.
The CFC-Net [123] key feature capture network focuses on feature representation, pre-
defined anchor points, and label assignment. The network constructs robust key features
suitable for the respective tasks by polarizing the attention module. It also extracts dis-
criminative regression features to refine the pre-defined anchor points and uses a dynamic
anchor learning strategy to select high-quality anchor points adaptively.
Li et al. proposed a novel backbone network Large Selective Nuclear Network
(LSKNet) [124]. It can dynamically adjust the spatial receptive field to better simulate
the distance environment of various objects in the remote sensing scene.
Pang et al. [125] proposed a unified self-reinforcement network R2CNN. The network
consists of a backbone Tiny-Net, an intermediate global attention module, and classifiers
and detectors. As a lightweight residual structure, the Tiny-Net allows fast extraction of
rich features from the input. The global attention module is used to suppress false positives.
The classifier predicts the targets in each PATCH. If the object is available, the classifier
tracks the detector to locate the object. The classifier and detector are trained end-to-end to
speed up the detection process further and avoid false positives.
The TRD proposed by Li et al. [126] is a combination of CNN and a multilayer
transformer with an encoder and decoder. To detect objects in remote sensing images,
they designed an improved converter to aggregate multi-scale features and model the
interaction between instances. Considering the difference between the remote sensing
image dataset and the source dataset (ImageNet), they proposed the TRD with transmitted
CNN (T-TRD) based on the attention mechanism due to the limited samples in the remotely
sensed images and the large number of training samples required by the transformer.
To avoid overfitting, data enhancement in the model is combined with the transformer to
improve the detection performance.
0.5 0.495
0.455 0.444
0.4 0.405 0.394
0.37 0.368 0.372
0.355
0.332 0.336
0.315
mAP
together, LSKNet-S* has higher mAP, APL , APM , and APS than AO2-DETR and Oriented
RepPoints using R-101-FPN. Therefore, LSKNet-S* has the best detection performance.
Following are some of the conclusions that can be reached through comparison:
(1) At the moment, Resnet-FPN serves as the backbone network for most of the object
detection methods used on remote sensing images. By analyzing their mAP, it can be
found that the performance of these methods is more stable, i.e., medium level. (2) The
overall performance level of the first method is better than that of the second method
when comparing the Anchor-based method with the Anchor-free method. (3) The newly
proposed LSKNet backbone network shows significant advantages on the DOTA dataset,
including various categories of accuracy (AP) and mAP.
By analyzing the above-related information, we can observe that, on the one hand,
object detection methods on remotely sensing images are constantly evolving and improv-
ing in performance. On the other hand, the proposal of new backbone networks helps
to improve object detection performance significantly. Therefore, the design of backbone
networks can be a major focus of future research.
Remote Sens. 2023, 15, 3265 19 of 29
of image segmentation cause missing detection due to incomplete objects. Thus, it can be
seen that the object incompleteness problem caused by image segmentation of images can
significantly affect object detection performance.
Figure 10. Visualization results of different object detection methods on the DOTA dataset:
(a) LSKNet-S [124], (b) Oriented-RepPoints [136], (c) R3Det [74], (d) S2A-Net [84], (e) CSL [79],
(f) CFA [87].
In order to compare the above six methods more comprehensively, we measured the
FLOP, Params, FPS, and mAP evaluation metrics of the six methods on our own partial
DOTA dataset. The specific data are shown in Table 4. In terms of the number of floating-
point operations, LSKNet-S [124] is the smallest, followed by Oriented-RepPoints [136],
CFA [87], S2A-Net [84], R3Det [74], and CSL [79], and in terms of the number of parameters,
LSKNet-S [124] is the smallest, followed by Oriented-RepPoints [136], CFA [87], R3Det [74],
CSL [79], and S2A-Net [84].
It can be seen from these two metrics that LSKNet-S [124] is the lightest method among
the six methods with its smaller number of parameters and computations. In terms of
the number of frames per second transmitted, Oriented-RepPoints [136] is the largest,
followed by LSKNet-S [124], R3Det [74], CFA [87], CSL [79], and S2A-Net [84]. This
metric shows that Oriented-RepPoints [136] has the fastest computation speed, followed
by LSKNet-S [124]. In terms of average precision mean value, LSKNet-S [124] is the largest,
followed by CFA [87], S2A-Net [84], Oriented-RepPoints [136], CSL [79], and R3Det [74],
in that order. LSKNet-S [124] tops the list with a very high average precision value.
Remote Sens. 2023, 15, 3265 21 of 29
By comparison, we conclude that LSKNet-S [124] has the best performance, followed by
Oriented-RepPoints [136], CFA [87], S2A-Net [84], R3Det [74], and CSL [79], in that order.
Figure 11. Visualization results of different object detection methods on the DOTA dataset:
(a) LSKNet-S [124], (b) Oriented-RepPoints [136], (c) R3Det [74], (d) S2A-Net [84], (e) CSL [79],
(f) CFA [87].
Figure 12. Visualization results of different object detection methods on the DOTA dataset:
(a) LSKNet-S [124], (b) Oriented-RepPoints [136], (c) R3Det [74], (d) S2A-Net [84], (e) CSL [79],
(f) CFA [87].
7. Conclusions
This study presents a comprehensive review of object detection methods, particularly
methods for detecting small objects. This article discusses the use of common datasets,
evaluation methodologies, various classification criteria, the limitations of remote sensing
images, and challenges related to detecting irregular objects. Furthermore, we discussed
the diverse applications of object detection techniques in remote sensing imagery.
Finally, although the research on object detection methods in remote sensing im-
ages has made significant progress in recent years, there are still many problems, such
as low model inference efficiency and unsatisfactory object detection results. Therefore,
we propose promising research directions, such as better applications of image process-
ing techniques, more efficient and lightweight backbone networks, and more reasonable
learning strategies.
We hope the review in this paper can help researchers gain a deeper understanding
of object detection methods, especially the application of small object detection methods
in remote sensing images. It is expected to promote the development and progress of
remote-sensing image technology.
Author Contributions: Conceptualization, X.W. and A.C.; software, A.W.; investigation, A.W. and
J.Y.; formal analysis, X.W.; writing—original draft preparation, X.W., A.W. and J.Y.; writing—review
and editing, A.C. and Y.S.; supervision, Y.S.; funding acquisition, X.W. and Y.S. All authors have read
and agreed to the published version of the manuscript.
Funding: This research was funded by the Natural Science Foundation of Shandong Province
(ZR2020QF108, ZR2022QF037).
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: The datasets are available on Github at https://captain-whu.github.
io/DOTA/dataset.html, accessed on 10 April 2023.
Acknowledgments: We would like to thank the anonymous reviewers for their supportive comments
to improve our manuscript.
Conflicts of Interest: The authors declare no conflict of interest.
References
1. Bai, L.; Li, Y.; Cen, M.; Hu, F. 3D Instance Segmentation and Object Detection Framework Based on the Fusion of Lidar Remote
Sensing and Optical Image Sensing. Remote Sens. 2021, 13, 3288. [CrossRef]
2. Wei, Z.; Liu, Y. Deep Intelligent Neural Network for Medical Geographic Small-target Intelligent Satellite Image Super-resolution.
J. Imaging Sci. Technol. 2021, 65, 030406-1–030406-10. [CrossRef]
3. Kussul, N.; Lavreniuk, M.; Skakun, S.; Shelestov, A. Deep learning classification of land cover and crop types using remote
sensing data. IEEE Geosci. Remote Sens. Lett. 2017, 14, 778–782. [CrossRef]
4. Pi, Y.; Nath, N.D.; Behzadan, A.H. Convolutional neural networks for object detection in aerial imagery for disaster response and
recovery. Adv. Eng. Inform. 2020, 43, 101009. [CrossRef]
5. Bashir, S.M.A.; Wang, Y. Deep learning for the assisted diagnosis of movement disorders, including isolated dystonia. Front.
Neurol. 2021, 12, 638266. [CrossRef]
6. Bashir, S.M.A.; Wang, Y. Small object detection in remote sensing images with residual feature aggregation-based super-resolution
and object detector network. Remote Sens. 2021, 13, 1854. [CrossRef]
7. DARAL, N. Histograms of oriented gradients for human detection. In Proceedings of the 2005 IEEE Computer Society Conference
on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–25 June 2005; pp. 886–893.
8. Wang, X.; Han, T.X.; Yan, S. An HOG-LBP human detector with partial occlusion handling. In Proceedings of the 2009 IEEE 12th
International Conference on Computer Vision, Kyoto, Japan, 29 September–2 October 2009; pp. 32–39.
9. Lin, C. Fast Human Detection Using a Cascade of histograms of Oriented Gradients. In Proceedings of the 2006 IEEE Computer
Society Conference on Computer Vision and Pattern Recognition (CVPR’06), New York, NY, USA, 17–22 June 2005; pp. 886–893.
10. Divvala, S.K.; Efros, A.A.; Hebert, M. How important are “deformable parts” in the deformable parts model? In Proceedings of
the Computer Vision–ECCV 2012—Workshops and Demonstrations: Florence, Italy, 7–13 October 2012; Proceedings, Part III 12;
Springer: Berlin/Heidelberg, Germany, 2012; pp. 31–40.
Remote Sens. 2023, 15, 3265 25 of 29
11. Felzenszwalb, P.F.; Girshick, R.B.; McAllester, D.; Ramanan, D. Object detection with discriminatively trained part-based models.
IEEE Trans. Pattern Anal. Mach. Intell. 2009, 32, 1627–1645. [CrossRef] [PubMed]
12. Girshick, R.; Iandola, F.; Darrell, T.; Malik, J. Deformable part models are convolutional neural networks. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 437–446.
13. Ouyang, W.; Wang, X. Joint deep learning for pedestrian detection. In Proceedings of the IEEE International Conference on
Computer Vision, Sydney, NSW, Australia, 1–8 December 2013; pp. 2056–2063.
14. Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014;
pp. 580–587.
15. He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans.
Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [CrossRef]
16. Girshick, R. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13
December 2015; pp. 1440–1448.
17. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. Adv. Neural
Inf. Process. Syst. 2015, 28. [CrossRef]
18. Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125.
19. He, K.; Gkioxari, G.; Dollár, P.; Girshick, R.B. Mask R-CNN. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 42, 386–397. [CrossRef]
20. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 779–788.
21. Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271.
22. Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767.
23. Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934.
24. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of
the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016, Proceedings,
Part I 14; Springer: Berlin/Heidelberg, Germany, 2016; pp. 21–37.
25. Li, Z.; Wang, Y.; Zhang, N.; Zhang, Y.; Zhao, Z.; Xu, D.; Ben, G.; Gao, Y. Deep Learning-Based Object Detection Techniques for
Remote Sensing Images: A Survey. Remote Sens. 2022, 14, 2385. [CrossRef]
26. Li, K.; Wan, G.; Cheng, G.; Meng, L.; Han, J. Object detection in optical remote sensing images: A survey and a new benchmark.
ISPRS J. Photogramm. Remote Sens. 2020, 159, 296–307. [CrossRef]
27. Liu, Y.; Sun, P.; Wergeles, N.; Shang, Y. A survey and performance evaluation of deep learning methods for small object detection.
Expert Syst. Appl. 2021, 172, 114602. [CrossRef]
28. Sun, X.; Wang, P.; Yan, Z.; Xu, F.; Wang, R.; Diao, W.; Chen, J.; Li, J.; Feng, Y.; Xu, T.; et al. FAIR1M: A benchmark dataset for
fine-grained object recognition in high-resolution remote sensing imagery. ISPRS J. Photogramm. Remote Sens. 2022, 184, 116–130.
[CrossRef]
29. Wang, Y.; Bashir, S.M.A.; Khan, M.; Ullah, Q.; Wang, R.; Song, Y.; Guo, Z.; Niu, Y. Remote sensing image super-resolution and
object detection: Benchmark and state of the art. Expert Syst. Appl. 2022, 197, 116793. [CrossRef]
30. Long, Y.; Gong, Y.; Xiao, Z.; Liu, Q. Accurate object localization in remote sensing images based on convolutional neural networks.
IEEE Trans. Geosci. Remote Sens. 2017, 55, 2486–2498. [CrossRef]
31. Xiao, Z.; Liu, Q.; Tang, G.; Zhai, X. Elliptic Fourier transformation-based histograms of oriented gradients for rotationally
invariant object detection in remote-sensing images. Int. J. Remote Sens. 2015, 36, 618–644. [CrossRef]
32. Cheng, G.; Zhou, P.; Han, J. Learning rotation-invariant convolutional neural networks for object detection in VHR optical remote
sensing images. IEEE Trans. Geosci. Remote Sens. 2016, 54, 7405–7415. [CrossRef]
33. Dong, R.; Xu, D.; Zhao, J.; Jiao, L.; An, J. Sig-NMS-based faster R-CNN combining transfer learning for small target detection in
VHR optical remote sensing imagery. IEEE Trans. Geosci. Remote Sens. 2019, 57, 8534–8545. [CrossRef]
34. Rasche, C. Land use classification with engineered features. IEEE Geosci. Remote Sens. Lett. 2021, 19, 2500805. [CrossRef]
35. Xu, K.; Huang, H.; Li, Y.; Shi, G. Multilayer feature fusion network for scene classification in remote sensing. IEEE Geosci. Remote
Sens. Lett. 2020, 17, 1894–1898. [CrossRef]
36. Xue, W.; Dai, X.; Liu, L. Remote sensing scene classification based on multi-structure deep features fusion. IEEE Access 2020,
8, 28746–28755. [CrossRef]
37. Xia, G.S.; Bai, X.; Ding, J.; Zhu, Z.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; Zhang, L. DOTA: A large-scale dataset for object
detection in aerial images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City,
UT, USA, 18–23 June 2018; pp. 3974–3983.
38. Razakarivony, S.; Jurie, F. Vehicle detection in aerial imagery: A small target detection benchmark. J. Vis. Commun. Image
Represent. 2016, 34, 187–203. [CrossRef]
39. Yang, M.Y.; Liao, W.; Li, X.; Rosenhahn, B. Deep learning for vehicle detection in aerial images. In Proceedings of the 2018 25th
IEEE International Conference on Image Processing (ICIP), Athens, Greece, 7–10 October 2018; pp. 3079–3083.
Remote Sens. 2023, 15, 3265 26 of 29
40. Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in
context. In Proceedings of the Computer Vision—ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September
2014, Proceedings, Part V 13; Springer: Berlin/Heidelberg, Germany, 2014; pp. 740–755.
41. Zhu, H.; Chen, X.; Dai, W.; Fu, K.; Ye, Q.; Jiao, J. Orientation robust object detection in aerial images using deep convolutional
neural network. In Proceedings of the 2015 IEEE International Conference on Image Processing (ICIP), Quebec City, QC, Canada,
27–30 September 2015; pp. 3735–3739.
42. Zhao, L.; Tang, P.; Huo, L. Feature significance-based multibag-of-visual-words model for remote sensing image scene classifica-
tion. J. Appl. Remote Sens. 2016, 10, 035004. [CrossRef]
43. Dimitrov, D.; Knauer, C.; Kriegel, K.; Rote, G. Bounds on the quality of the PCA bounding boxes. Comput. Geom. 2009, 42, 772–789.
[CrossRef]
44. Ming, Q.; Miao, L.; Zhou, Z.; Song, J.; Yang, X. Sparse Label Assignment for Oriented Object Detection in Aerial Images. Remote
Sens. 2021, 13, 2664. [CrossRef]
45. Cheng, G.; Han, J. A survey on object detection in optical remote sensing images. ISPRS J. Photogramm. Remote Sens. 2016,
117, 11–28. [CrossRef]
46. Zhang, S.; Chi, C.; Yao, Y.; Lei, Z.; Li, S.Z. Bridging the gap between anchor-based and anchor-free detection via adaptive training
sample selection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA,
13–19 June 2020; pp. 9759–9768.
47. Cai, Z.; Vasconcelos, N. Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6154–6162.
48. Dai, J.; Li, Y.; He, K.; Sun, J. R-FCN: Object detection via region-based fully convolutional networks. Adv. Neural Inf. Process. Syst.
2016, 29.
49. Li, Y.; Chen, Y.; Wang, N.; Zhang, Z. Scale-aware trident networks for object detection. In Proceedings of the IEEE/CVF
International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6054–6063.
50. Li, C.; Li, L.; Geng, Y.; Jiang, H.; Cheng, M.; Zhang, B.; Ke, Z.; Xu, X.; Chu, X. YOLOv6 v3. 0: A Full-Scale Reloading. arXiv 2023,
arXiv:2301.05586.
51. Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International
Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988.
52. Tan, M.; Pang, R.; Le, Q.V. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10781–10790.
53. Zhou, X.; Wang, D.; Krähenbühl, P. Objects as points. arXiv 2019, arXiv:1904.07850.
54. Tian, Z.; Shen, C.; Chen, H.; He, T. Fcos: Fully convolutional one-stage object detection. In Proceedings of the IEEE/CVF
International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9627–9636.
55. Liu, Z.; Zheng, T.; Xu, G.; Yang, Z.; Liu, H.; Cai, D. Training-Time-Friendly Network for Real-Time Object Detection. In
Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019.
56. Law, H.; Deng, J. Cornernet: Detecting objects as paired keypoints. In Proceedings of the European Conference on Computer
Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 734–750.
57. Zhou, X.; Zhuo, J.; Krahenbuhl, P. Bottom-up object detection by grouping extreme and center points. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 850–859.
58. Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In
Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020, Proceedings, Part
I 16; Springer: Berlin/Heidelberg, Germany, 2020; pp. 213–229.
59. Fang, Y.; Liao, B.; Wang, X.; Fang, J.; Qi, J.; Wu, R.; Niu, J.; Liu, W. You only look at one sequence: Rethinking transformer in
vision through object detection. Adv. Neural Inf. Process. Syst. 2021, 34, 26183–26197.
60. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted
windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Nashville, TN, USA, 11–27 October
2021; pp. 10012–10022.
61. Kuang, X.; Sui, X.; Liu, Y.; Chen, Q.; Gu, G. Single infrared image enhancement using a deep convolutional neural network.
Neurocomputing 2019, 332, 119–128. [CrossRef]
62. Suzuki, K.; Horiba, I.; Sugie, N. Neural edge enhancer for supervised edge enhancement from noisy images. IEEE Trans. Pattern
Anal. Mach. Intell. 2003, 25, 1582–1596. [CrossRef]
63. Sreedhar, K.; Panlal, B. Enhancement of images using morphological transformation. arXiv 2012, arXiv:1203.2514.
64. Piao, Y.; Shin, I.; Park, H. Image resolution enhancement using inter-subband correlation in wavelet domain. In Proceedings of
the 2007 IEEE International Conference on Image Processing, San Antonio, TX, USA, 16 September–19 October 2007; Volume 1,
pp. 1–445.
65. Wu, X.; Liu, M.; Cao, Y.; Ren, D.; Zuo, W. Unpaired learning of deep image denoising. In Proceedings of the Computer Vision–
ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part IV; Springer: Berlin/Heidelberg,
Germany, 2020; pp. 352–368.
66. He, K.; Sun, J.; Tang, X. Guided image filtering. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 35, 1397–1409. [CrossRef]
67. Lev, B. Sharpening the intangibles edge. Harv. Bus. Rev. 2004, 6, 109–116.
Remote Sens. 2023, 15, 3265 27 of 29
68. Lin, C.Y.; Wu, M.; Bloom, J.A.; Cox, I.J.; Miller, M.L.; Lui, Y.M. Rotation, scale, and translation resilient watermarking for images.
IEEE Trans. Image Process. 2001, 10, 767–782. [CrossRef]
69. Lin, X.; Ma, Y.l.; Ma, L.z.; Zhang, R.l. A survey for image resizing. J. Zhejiang Univ. Sci. C 2014, 15, 697–716. [CrossRef]
70. Dhawan, S. A review of image compression and comparison of its algorithms. Int. J. Electron. Commun. Technol. 2011, 2, 22–26.
71. Yang, X.; Yang, J.; Yan, J.; Zhang, Y.; Zhang, T.; Guo, Z.; Sun, X.; Fu, K. Scrdet: Towards more robust detection for small, cluttered
and rotated objects. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27
October–2 November 2019; pp. 8232–8241.
72. Zhang, G.; Lu, S.; Zhang, W. CAD-Net: A Context-Aware Detection Network for Objects in Remote Sensing Imagery. IEEE Trans.
Geosci. Remote Sens. 2019, 57, 10015–10024. [CrossRef]
73. Pan, X.; Ren, Y.; Sheng, K.; Dong, W.; Yuan, H.; Guo, X.W.; Ma, C.; Xu, C. Dynamic Refinement Network for Oriented and Densely
Packed Object Detection. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),
Seattle, WA, USA, 13–19 June 2020; pp. 11204–11213.
74. Yang, X.; Liu, Q.; Yan, J.; Li, A. R3Det: Refined Single-Stage Detector with Feature Refinement for Rotating Object. In Proceedings
of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019.
75. Han, J.; Ding, J.; Xue, N.; Xia, G. ReDet: A Rotation-equivariant Detector for Aerial Object Detection. In Proceedings of
the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021;
pp. 2785–2794.
76. Xie, X.; Cheng, G.; Wang, J.; Yao, X.; Han, J. Oriented R-CNN for Object Detection. In Proceedings of the 2021 IEEE/CVF
International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 11–17 October 2021; pp. 3500–3509.
77. Xu, Y.; Fu, M.; Wang, Q.; Wang, Y.; Chen, K.; Xia, G.; Bai, X. Gliding Vertex on the Horizontal Bounding Box for Multi-Oriented
Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 1452–1459. [CrossRef]
78. Qian, W.; Yang, X.; Peng, S.; Yan, J.; Guo, Y. Learning modulated loss for rotated object detection. In Proceedings of the AAAI
Conference on Artificial Intelligence, Virtually, 2–9 February 2021; Volume 35, pp. 2458–2466.
79. Yang, X.; Yan, J. Arbitrary-Oriented Object Detection with Circular Smooth Label. In Proceedings of the European Conference on
Computer Vision, Glasgow, UK, 23–28 August 2020.
80. Yang, X.; Hou, L.; Zhou, Y.; Wang, W.; Yan, J. Dense Label Encoding for Boundary Discontinuity Free Rotation Detection. In
Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19
June 2020; pp. 15814–15824.
81. Yang, Z.; Liu, S.; Hu, H.; Wang, L.; Lin, S. RepPoints: Point Set Representation for Object Detection. In Proceedings of the
2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019;
pp. 9656–9665.
82. Zhang, J.; Lin, L.; Li, Y.; chen Chen, Y.; Zhu, J.; Hu, Y.; Hoi, S.C.H. Attribute-Aware Pedestrian Detection in a Crowd. IEEE Trans.
Multimed. 2019, 23, 3085–3097. [CrossRef]
83. Zhang, J.; Wu, X.; Zhu, J.; Hoi, S.C.H. Feature Agglomeration Networks for Single Stage Face Detection. arXiv 2017,
arXiv:1712.00721.
84. Han, J.; Ding, J.; Li, J.; Xia, G. Align Deep Features for Oriented Object Detection. IEEE Trans. Geosci. Remote Sens. 2020,
60, 5602511. [CrossRef]
85. Ding, J.; Xue, N.; Long, Y.; Xia, G.; Lu, Q. Learning RoI Transformer for Oriented Object Detection in Aerial Images. In
Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA,
16–17 June 2019; pp. 2844–2853.
86. Yang, X.; Yan, J.; Yang, X.; Tang, J.; Liao, W.; He, T. SCRDet++: Detecting Small, Cluttered and Rotated Objects via Instance-Level
Feature Denoising and Rotation Loss Smoothing. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 45, 2384–2399. [CrossRef]
87. Guo, Z.; Liu, C.; Zhang, X.; Jiao, J.; Ji, X.; Ye, Q. Beyond Bounding-Box: Convex-hull Feature Adaptation for Oriented and Densely
Packed Object Detection. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),
Nashville, TN, USA, 20–25 June 2021; pp. 8788–8797.
88. Zhang, X.; Wan, F.; Liu, C.; Ji, X.; Ye, Q. Learning to Match Anchors for Visual Object Detection. IEEE Trans. Pattern Anal. Mach.
Intell. 2019, 44, 3096–3109. [CrossRef]
89. Kim, K.; Lee, H.S. Probabilistic Anchor Assignment with IoU Prediction for Object Detection. In Proceedings of the European
Conference on Computer Vision, Glasgow, UK, 23–28 August 2020.
90. Ge, Z.; Liu, S.; Li, Z.; Yoshie, O.; Sun, J. OTA: Optimal Transport Assignment for Object Detection. In Proceedings of the 2021
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 303–312.
91. Ming, Q.; Zhou, Z.; Miao, L.; Zhang, H.; Li, L. Dynamic Anchor Learning for Arbitrary-Oriented Object Detection. arXiv 2020,
arXiv:2012.04150.
92. Chen, C.; Liu, M.Y.; Tuzel, O.; Xiao, J. R-CNN for small object detection. In Proceedings of the Computer Vision–ACCV 2016:
13th Asian Conference on Computer Vision, Taipei, Taiwan, 20–24 November 2016; Revised Selected Papers, Part V 13; Springer:
Berlin/Heidelberg, Germany, 2017; pp. 214–230.
93. Zhang, K.; Zhang, Z.; Li, Z.; Qiao, Y. Joint Face Detection and Alignment Using Multitask Cascaded Convolutional Networks.
IEEE Signal Process. Lett. 2016, 23, 1499–1503. [CrossRef]
94. Fu, C.Y.; Liu, W.; Ranga, A.; Tyagi, A.; Berg, A.C. DSSD : Deconvolutional Single Shot Detector. arXiv 2017, arXiv:1701.06659.
Remote Sens. 2023, 15, 3265 28 of 29
95. Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path Aggregation Network for Instance Segmentation. In Proceedings of the 2018 IEEE/CVF
Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 8759–8768.
96. Guo, C.; Fan, B.; Zhang, Q.; Xiang, S.; Pan, C. AugFPN: Improving Multi-Scale Feature Learning for Object Detection. In
Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19
June 2020; pp. 12592–12601.
97. Singh, B.; Davis, L.S. An Analysis of Scale Invariance in Object Detection—SNIP. In Proceedings of the 2018 IEEE/CVF Conference
on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3578–3587.
98. Lyu, C.; Zhang, W.; Huang, H.; Zhou, Y.; Wang, Y.; Liu, Y.; Zhang, S.; Chen, K. RTMDet: An Empirical Study of Designing
Real-Time Object Detectors. arXiv 2022, arXiv:2212.07784.
99. Gong, H.; Mu, T.; Li, Q.; Dai, H.; Li, C.; He, Z.; Wang, W.; Han, F.; Tuniyazi, A.; Li, H.; et al. Swin-Transformer-Enabled YOLOv5
with Attention Mechanism for Small Object Detection on Satellite Images. Remote Sens. 2022, 14, 2861. [CrossRef]
100. Guan, Y.; Aamir, M.; Hu, Z.; Dayo, Z.A.; Rahman, Z.; Abro, W.A.; Soothar, P. An Object Detection Framework Based on Deep
Features and High-Quality Object Locations. Trait. Signal 2021, 38, 719–730. [CrossRef]
101. Xiaolin, F.; Fan, H.; Ming, Y.; Tongxin, Z.; Ran, B.; Zenghui, Z.; Zhiyuan, G. Small object detection in remote sensing images based
on super-resolution. Pattern Recognit. Lett. 2022, 153, 107–112. [CrossRef]
102. Ghiasi, G.; Lin, T.Y.; Le, Q.V. Nas-fpn: Learning scalable feature pyramid architecture for object detection. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 7036–7045.
103. Zhou, P.; Ni, B.; Geng, C.; Hu, J.; Xu, Y. Scale-Transferrable Object Detection. In Proceedings of the 2018 IEEE/CVF Conference
on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 528–537.
104. Li, J.; Liang, X.; Wei, Y.; Xu, T.; Feng, J.; Yan, S. Perceptual Generative Adversarial Networks for Small Object Detection. In
Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July
2017; pp. 1951–1959.
105. Bai, Y.; Zhang, Y.; Ding, M.; Ghanem, B. SOD-MTGAN: Small Object Detection via Multi-Task Generative Adversarial Network.
In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018.
106. Wang, D.; Zhang, Q.; Xu, Y.; Zhang, J.; Du, B.; Tao, D.; Zhang, L. Advancing Plain Vision Transformer Towards Remote Sensing
Foundation Model. arXiv 2022, arXiv:2208.03987.
107. He, K.; Chen, X.; Xie, S.; Li, Y.; Doll’ar, P.; Girshick, R.B. Masked Autoencoders Are Scalable Vision Learners. In Proceedings of
the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022;
pp. 15979–15988.
108. Rabbi, J.; Ray, N.; Schubert, M.; Chowdhury, S.; Chao, D. Small-object detection in remote sensing images with end-to-end
edge-enhanced GAN and object detector network. Remote Sens. 2020, 12, 1432. [CrossRef]
109. Tang, X.; Du, D.K.; He, Z.; Liu, J. PyramidBox: A Context-assisted Single Shot Face Detector. arXiv 2018, arXiv:1803.07737.
110. Hu, H.; Gu, J.; Zhang, Z.; Dai, J.; Wei, Y. Relation Networks for Object Detection. In Proceedings of the 2018 IEEE/CVF Conference
on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3588–3597.
111. Chen, X.; Gupta, A.K. Spatial Memory for Context Reasoning in Object Detection. In Proceedings of the 2017 IEEE International
Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 4106–4116.
112. Zhu, Y.; Zhao, C.; Wang, J.; Zhao, X.; Wu, Y.; Lu, H. CoupleNet: Coupling Global Structure with Local Parts for Object
Detection. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October
2017; pp. 4146–4154.
113. Kisantal, M.; Wojna, Z.; Murawski, J.; Naruniec, J.; Cho, K. Augmentation for small object detection. arXiv 2019, arXiv:1902.07296.
114. Zoph, B.; Cubuk, E.D.; Ghiasi, G.; Lin, T.Y.; Shlens, J.; Le, Q.V. Learning Data Augmentation Strategies for Object Detection. In
Proceedings of the European Conference on Computer Vision, Thessaloniki, Greece, 23-25 September 2019.
115. Wang, N.; Gao, Y.; Chen, H.; Wang, P.; Tian, Z.; Shen, C. NAS-FCOS: Fast Neural Architecture Search for Object Detection. In Pro-
ceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June
2020; pp. 11940–11948.
116. Guan, Y.; Aamir, M.; Hu, Z.; Abro, W.A.; Rahman, Z.; Dayo, Z.A.; Akram, S. A Region-Based Efficient Network for Accurate
Object Detection. Trait. Signal 2021, 38, 481–494. [CrossRef]
117. Wang, T.; Anwer, R.M.; Cholakkal, H.; Khan, F.S.; Pang, Y.; Shao, L. Learning Rich Features at High-Speed for Single-Shot Object
Detection. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea,
27 October–2 November 2019; pp. 1971–1980.
118. Li, Z.; Peng, C.; Yu, G.; Zhang, X.; Deng, Y.; Sun, J. DetNet: A Backbone network for Object Detection. arXiv 2018, arXiv:1804.06215.
119. Li, H.; Wu, X. Infrared and Visible Image Fusion with ResNet and zero-phase component analysis. arXiv 2018, arXiv:1806.07119.
120. Qiao, S.; Chen, L.C.; Yuille, A.L. DetectoRS: Detecting Objects with Recursive Feature Pyramid and Switchable Atrous Convolution.
In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA,
20–25 June 2021; pp. 10208–10219.
121. Liang, D.; Geng, Q.; Wei, Z.; Vorontsov, D.A.; Kim, E.L.; Wei, M.; Zhou, H. Anchor Retouching via Model Interaction for Robust
Object Detection in Aerial Images. IEEE Trans. Geosci. Remote Sens. 2021, PP, 5619213. [CrossRef]
122. Huang, Z.; Li, W.; Xia, X.G.; Tao, R. A General Gaussian Heatmap Label Assignment for Arbitrary-Oriented Object Detection.
IEEE Trans. Image Process. 2021, 31, 1895–1910. [CrossRef]
Remote Sens. 2023, 15, 3265 29 of 29
123. Ming, Q.; Miao, L.; Zhou, Z.; Dong, Y. CFC-Net: A Critical Feature Capturing Network for Arbitrary-Oriented Object Detection
in Remote-Sensing Images. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5605814. [CrossRef]
124. Li, Y.; Hou, Q.; Zheng, Z.; Cheng, M.M.; Yang, J.; Li, X. Large Selective Kernel Network for Remote Sensing Object Detection.
arXiv 2023, arXiv:2303.09030.
125. Pang, J.; Li, C.; Shi, J.; Xu, Z.; Feng, H. R2 -CNN: fast Tiny object detection in large-scale remote sensing images. IEEE Trans.
Geosci. Remote Sens. 2019, 57, 5512–5524. [CrossRef]
126. Li, Q.; Chen, Y.; Zeng, Y. Transformer with transfer CNN for remote-sensing-image object detection. Remote Sens. 2022, 14, 984.
[CrossRef]
127. Wang, X.; Wang, G.; Dang, Q.; Liu, Y.; Hu, X.; Yu, D. PP-YOLOE-R: An Efficient Anchor-Free Rotated Object Detector. arXiv 2022,
arXiv:2211.02386.
128. Lang, S.; Ventola, F.; Kersting, K. Dafne: A one-stage anchor-free deep model for oriented object detection. arXiv 2021,
arXiv:2109.06148.
129. Hou, L.; Lu, K.; Xue, J.; Li, Y. Shape-adaptive selection and measurement for oriented object detection. In Proceedings of the
AAAI Conference on Artificial Intelligence, Virtually, 22 February–1 March 2022; Volume 36, pp. 923–932.
130. Dai, L.; Liu, H.; Tang, H.; Wu, Z.; Song, P. Ao2-detr: Arbitrary-oriented object detection transformer. IEEE Trans. Circuits Syst.
Video Technol. 2022, 33, 2342–2356. [CrossRef]
131. Wang, J.; Yang, W.; Li, H.C.; Zhang, H.; Xia, G.S. Learning center probability map for detecting objects in aerial images. IEEE
Trans. Geosci. Remote Sens. 2020, 59, 4307–4323. [CrossRef]
132. Wang, J.; Ding, J.; Guo, H.; Cheng, W.; Pan, T.; Yang, W. Mask OBB: A semantic attention-based mask oriented bounding box
representation for multi-category object detection in aerial images. Remote Sens. 2019, 11, 2930. [CrossRef]
133. Li, C.; Xu, C.; Cui, Z.; Wang, D.; Zhang, T.; Yang, J. Feature-attentioned object detection in remote sensing imagery. In Proceedings
of the 2019 IEEE International Conference on Image Processing (ICIP), Taipei, Taiwan, 22–25 September 2019; pp. 3886–3890.
134. Cheng, G.; Yao, Y.; Li, S.; Li, K.; Xie, X.; Wang, J.; Yao, X.; Han, J. Dual-aligned oriented detector. IEEE Trans. Geosci. Remote Sens.
2022, 60, 1–11. [CrossRef]
135. Cheng, G.; Wang, J.; Li, K.; Xie, X.; Lang, C.; Yao, Y.; Han, J. Anchor-free oriented proposal generator for object detection. IEEE
Trans. Geosci. Remote Sens. 2022, 60, 5618111. [CrossRef]
136. Li, W.; Chen, Y.; Hu, K.; Zhu, J. Oriented reppoints for aerial object detection. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1829–1838.
137. Chen, Z.; Chen, K.; Lin, W.; See, J.; Yu, H.; Ke, Y.; Yang, C. Piou loss: Towards accurate oriented object detection in complex
environments. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020;
Proceedings, Part V 16; Springer: Berlin/Heidelberg, Germany, 2020; pp. 195–211.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.