A Survey and Performance Evaluation of Deep Learning Methods For Small 2021
A Survey and Performance Evaluation of Deep Learning Methods For Small 2021
A R T I C L E I N F O A B S T R A C T
Keywords: In computer vision, significant advances have been made on object detection with the rapid development of deep
Small object detection convolutional neural networks (CNN). This paper provides a comprehensive review of recently developed deep
Computer vision learning methods for small object detection. We summarize challenges and solutions of small object detection,
Convolutional neural networks
and present major deep learning techniques, including fusing feature maps, adding context information,
Deep learning
balancing foreground-background examples, and creating sufficient positive examples. We discuss related
techniques developed in four research areas, including generic object detection, face detection, object detection
in aerial imagery, and segmentation. In addition, this paper compares the performances of several leading deep
learning methods for small object detection, including YOLOv3, Faster R-CNN, and SSD, based on three large
benchmark datasets of small objects. Our experimental results show that while the detection accuracy on small
objects by these deep learning methods was low, less than 0.4, Faster R-CNN performed the best, while YOLOv3
was a close second.
* Corresponding author.
E-mail addresses: [email protected] (Y. Liu), [email protected] (P. Sun), [email protected] (N. Wergeles), [email protected] (Y. Shang).
https://doi.org/10.1016/j.eswa.2021.114602
Received 9 April 2020; Received in revised form 13 October 2020; Accepted 10 January 2021
Available online 19 January 2021
0957-4174/© 2021 Elsevier Ltd. All rights reserved.
Y. Liu et al. Expert Systems With Applications 172 (2021) 114602
report our experimental results of comparing the performances of and frameworks for small object detection. Section 4 identifies the
several state-of-the-art deep learning methods on benchmark datasets challenges and solutions for small object detection and major techniques
focusing on small objects. developed in four related research areas. Section 5 presents experi
The main contributions of this paper are as follows: mental results of several leading deep learning methods for small object
detection on three benchmark datasets of small objects. Finally, Section
• Provide a comprehensive review of the state-of-the-art deep learning VI discusses some future research directions.
techniques on small object detection.
• Identify challenges for small object detection in four specific aspects, 2. Overview of deep learning methods for image-based object
summarize major components of deep learning methods, and cate detection
gorize existing methods in four aspects.
• Analyze and connect related techniques from four research areas, 2.1. Problem definition
including generic object detection, face detection, object detection in
aerial imagery, and segmentation. The goal of image-based object detection is to detect instances of
• Empirical performance evaluation of some state-of-the-art deep objects of predefined classes in images and draw a tight bounding box
learning methods on three benchmark datasets of small objects. around each object. More specifically, the object detection consists of
two tasks: object localization and classification, i.e., finding where ob
1.2. Comparison with previous survey papers jects are located in an image and determining which predefined class
each object belongs to.
The survey in (Zou, Shi, Guo, & Ye, 2019) covered object detection
methods in the past 20 years, including both traditional detection 2.2. Major components of deep learning methods
methods and deep learning methods. This paper focuses on deep
learning methods for small object detection developed in the last 5 In this section, we summarize the major components of deep learning
years. (Zou et al., 2019) surveyed deep learning methods for generic methods for image-based object detection, which include backbone
object detection, whereas this paper includes methods developed in four networks, region proposals, anchors, object classification, bounding box
research areas, including generic object detection, face detection, object regression, loss functions, and non-maximum suppression.
detection in aerial imagery, and segmentation. (Leevy, Khoshgoftaar,
Bauder, & Seliya, 2018; Oksuz, Cam, Kalkan, & Akbas, 2019) focused on 2.2.1. Backbone networks
methods to overcome the class imbalance problem. (Zhao, Zheng, Xu, & Backbone networks in deep neural network-based object detectors
Wu , 2019) reviewed several state-of-the-art deep learning frameworks are used to extract high-level features from input images. Most
in several object detection tasks and analyzed different methods with commonly used backbone networks are derived from deep neural
experimental results on general object detection. (Liu et al., 2020; Jiao network image classifiers that performed well on large-scale image
et al., 2019) reviewed the deep learning methods and techniques for classification datasets, such as the ImageNet classification dataset
object detection. However, they did not provide the experimental (Huang, Liu, Van Der Maaten, & Weinberger, 2017; Szegedy et al., 2015;
analysis for these deep learning methods. (Wu, Sahoo, & Hoi, 2020) Szegedy, Vanhoucke, Ioffe, Shlens, & Wojna, 2016; Newell, Yang, &
reviewed object detection components, models and learning strategies. Deng, 2016; He, Zhang, Ren, & Sun, 2016; Howard et al., 2017;
Even though these works provide a comprehensive review, their focus is Simonyan & Zisserman, 2014). Typically, the last classification layers
on general size objects, not small objects. are removed from these image classifiers and the rest of the layers are
Recently, there are also some reviews on the small object detection. used as the backbone networks. Based on the backbone networks,
Nguyen, Do, Ngo, and Le (2020)) provided a review of the existing ob detection layers are appended to form complete object detectors.
ject detection methods for small objects and focused on performance The main design objectives of backbone networks are high detection
evaluation on four models. In comparison, this paper presents a more accuracy and computational efficiency. Some popular backbone net
depth and comprehensive review and different perspective of chal works are as follows.
lenges. Moreover, we summarize the major components of existing deep
learning methods, categorize existing detection approaches in four as • VGGNets (Simonyan & Zisserman, 2014) that use small filters of size
pects, connect and analyze current deep learning methods from four 3 by 3 pixels in their convolutional layers, followed by 2 by 2 max
separate application areas of object detection, and evaluate three pooling. VGG16 has 13 convolutional layers, whereas VGG19 has 16
models’ performances on three different datasets. Tong, Wu, and Zhou convolutional layers. VGG won the ImageNet Challenge in 2014 and
(2020) mainly reviewed existing methods from five aspects to improve is still one of the most widely used networks.
small object detection and analyzed experimental results on two data • Residual networks, or ResNets (He et al., 2016), in which residual
sets. In comparison, this paper not only summarizes existing methods in blocks were proposed to make training very deep networks possible
different aspects, but also identifies and analyzes key challenges in four by overcoming the gradient vanish problem in back propagation by
specific aspects, connects and analyzes solutions from several related adding a skip connection directly from input of each module. There
research areas, and presents empirical results on different datasets. are several variations of residual networks. The most used versions
In summary, this paper differs from the previous review papers in are ResNet50 and ResNet101. ResNet is much deeper than VGGNet.
several aspects. First, our review is focus on small objects. Secondly, our ResNet won the ImageNet 2015 classification task.
review includes summaries of major detection components and state-of- • Inception networks (Szegedy et al., 2015, 2016) that increased the
the-art object detection frameworks. Thirdly, we identify the challenges depth and width of networks without increasing computational
for small object detection, and summarize major techniques to improve complexity. The Inception module consists of 1x1, 3x3, and 5x5 filter
small object detection accuracy. In addition, we analyze and connect size convolution layers and max pooling layers stacked parallel with
techniques from four small object detection application areas, which each other. Multiple scales of feature can be extracted simulta
cover a wide range of small object detection tasks. Finally, we provide neously in one layer. Inception networks are much faster than
empirical comparison of three representative deep learning frameworks VGGNet.
on small object benchmark datasets. • DenseNet (Huang et al., 2017), in which each layer is densely con
The rest of the paper is organized as follows. Section 2 presents an nected to all other layers in a forward manor, so that lower level
overview of deep learning methods and major components for object features are used by all latter layers. DenseNet can alleviate the
detection in images. Section 3 presents major deep learning approaches vanishing-gradient problem.
2
Y. Liu et al. Expert Systems With Applications 172 (2021) 114602
is an image, and its output contains regions of interests (RoIs) with bounding box. Various bounding box regression loss functions have been
object scores. Specifically, it uses small networks as sliding windows proposed. In (He, Zhu, Wang, Savvides & Zhang, 2019), KL loss (Kull
over the convolutional layers. Each of the sliding windows corresponds back-Leibler) was proposed based on the Kullback-Leibler Divergence of
to one of the regions in the input image and can be viewed as region the predicted bounding box distribution and ground truth distribution.
proposals with various scales. The features are fed into two prediction In (Lee, Kwak, & Cho, 2018), a bounding box regression neural network
layers: classification layer and box regression layer. The classification was proposed to be trained separately with convolution layers, fully
layer performs binary classification to predict if the region contains any connected layers and ROI-Align layer to minimize IoU loss. (Yu, Jiang,
objects or not. Wang, Cao, & Huang, 2016) proposed the IoU loss for bounding box
regression and regress all the bounding box variables together. The IoU
2.2.3. Anchors loss is not sensitive to scale invariant (Rezatofighi et al., 2019) found
Anchors, also called anchor boxes, were first proposed in (Ren et al., that IoU does not provide a strong relationship with minimizing the
2015). Anchors are a set of pre-defined bounding boxes with various ln -norms loss function. Therefore, GIoU is proposed to solve this problem
scales and ratios placed regularly on the feature maps. Anchors at by not only considering the overlap area, but also focus on
different locations on feature maps are projected back to the input im no-overlapping area. (Zheng et al., 2020) proposed DIoU loss to solve
ages, to be matched with the ground-truth bounding boxes. The stride the slow converge issues in earlier work. Moreover, CIoU is further
for each feature map is calculated by H/h and W/w, where H and W are proposed by considering overlap area, central point distance asd well as
the height and width of an input image, respectively, and h and w are the aspect ratio.
height and width of a certain feature map, respectively. The scales and
ratios of anchors are usually pre-defined to maximize match ground- 2.2.6. Loss functions
truth bounding boxes. Some researchers used unsupervised clustering In object detection, multi-task loss functions have been widely used
methods to calculate the scales and ratios directly from the training
3
Y. Liu et al. Expert Systems With Applications 172 (2021) 114602
to simultaneously minimize errors of object classification and bounding divide the detection problem into two stages: region proposal stage and
boxes regression. For classification, softmax loss has been applied for detection stage. The goal of the region proposal stage is to generate the
calculating foreground background classes. For regression loss, regions where objects may exist. It outputs the regions locations as well
SmoothL1, as defined below, has been widely used. This loss is only as the object score, 0 or 1, to indicate if there exist objects. In the
computed for the bounding boxes predicted as foreground class (i.e. detection stage, the candidate region proposals are classified into
objects). different classes. This stage outputs class probabilities and, optionally,
refined region locations. Two-stage detectors achieved state-of-the-art
L(p, u, tu , v) = Lcls (p, u) + λ[u ≥ 1]Lloc (tu , v) (5)
performance; Yet their running speeds were typically slow.
∑ ( ) One-stage object detectors, such as YOLO (Redmon, Divvala, Gir
Lloc (tu , v) = smoothL1 tui − vi (6)
i∈{x,y,w,h}
shick, & Farhadi, 2016) and SSD (Liu et al., 2016), perform region
proposal and detection in one deep neural network. Initial regions are
where pre-defined anchors with various scales and ratios, tiled densely on the
{ image. From the initial anchors, the detectors find those that likely
0.5x2 if |x| < 1
smoothL1 (x) = (7) contain objects. Compared with two-stage detectors, one-stage detectors
|x| − 0.5otherwise
are usually much faster, but achieve less accurate solutions.
in which p is the probability distribution (per ROI) over all cate Some previous works empirically evaluated the performances of
gories, p = (p0 , ⋯, pk )p is computed by a softmax layer. Lcls (p, u) = some one-stage detectors and two-stage detectors (Soviany & Ionescu,
− logpu is the log loss for the true class u. The second loss is defined over a 2018a, 2018b). It compared the detection accuracy and time between
( )
tuple of true bounding-box regression for class u, v = ux , uy , uw , uh , and some two-stage detectors and one-stage detectors and concluded that,
( )
on average, one-stage detectors asssre faster than two-stage detsssectors,
a predicted tuple tu = txu , tyu , twu , thu The L1 loss is used here since it is less
while two-stage detectors tend to be more accurate than one-stage de
sensitive to outliers compared to the L2 loss. tectors. (Soviany & Ionescu, 2018b) proposed separating images into
The sentence has been re-written as follows: However, because of the easy and hard category and training different models on different cat
lack of foreground objects in most input images, it is difficult to capture egories to achieve faster speed and higher accuracy.
effective patterns and information of foreground objects in the training
phase. To overcome this challenge, another type of multi-task loss 3.2. Representative two-stage detectors
function has been proposed for object detection to improve the perfor
mance. (Lin, Goyal, Girshick, He, & Dollár, 2017). In (Lin, Goyal, et al., R-CNN (Girshick et al., 2014) was the first work transferring deep
2017), Focal Loss was proposed to address the imbalance problem be CNN classification results from ImageNet to object detection. R-CNN
tween foreground and background examples, by down-weighting the adopted the two-stage approach of separate region proposal and object
easy examples. classification. Each region proposal was warped and fed into a deep
FL(pt ) = − αt (1 − pt )r log(pt ) (8) CNN, and a 4096-dimension feature vector was extracted. During
training, the network was first pre-trained on the ImageNet dataset using
Where y is the label, and p if the model’s estimated probability. γ is image level label only. R-CNN made a breakthrough in object detection
the focusing prarmeter, which reduce the weight on easy examples and α and improved detection accuracy on the VOC 2012 dataset by more than
is to control the foreground and background examples balance. 30%. This work successfully demonstrated the superior trained CNN
features compared with human designed features. However, the CNN
2.2.7. Non-maximum suppression (NMS) feature extraction was applied to each region proposal independently,
NMS serves as the post-processing step in the inference phase to which made the network very slow.
remove redundant overlapping detection results for the same object. A Fast R-CNN (Girshick, 2015) improved R-CNN by addressing two
greedy method is to first sort all detection boxes according to their major issues. First, the training of R-CNN for classification and bounding
scores, and then greedily select the detection boxes with the highest box regression was done separately in two different stages. Fast R-CNN
scores and suppress other regions if they overlap significantly with the combined the training for classification and regression. It had two
selected boxes, i.e., their intersection-over-union (IoU) score is higher output layers: one for the classification scores and the other for the
than a designed threshold. NMS is usually applied on each object class bounding boxes location offset represented as four values, i.e. the ×, y
independently. In (Rothe, Guillaumin, & Van Gool, 2014), improved coordinates of bounding box’s center point, and the width and height of
solutions were found by treating the problem as a message-passing the bounding box. A new multi-task loss function was proposed for
clustering problem and learning the threshold parameters from simultaneously training. Secondly, the feature extraction in R-CNN was
training data, instead of pre-defined thresholds. Recently, (Bodla, Singh, applied to each region proposal, which was time and space consuming.
Chellappa, & Davis, 2017) and (Hosang, Benenson, & Schiele, 2017) Fast R-CNN improved the efficiency by only computing the convolu
proposed methods with soft thresholds to improve performance. In tional feature map once for an entire image. The Region of Interest (RoI)
standard NMS, lower-score regions overlapping significantly with pooling layer was designed to extract fixed size features for different size
higher-score regions are discarded. However, in soft-NMS (Bodla et al., region proposals. It divided the feature maps into fixed size sub-
2017), lower-score regions are kept in the results. (Hosang et al., 2017) windows, and max pooled each sub-window to form the fixed size fea
proposed a CNN network to perform NMS by using neighboring regions’ tures. Although Fast R-CNN was much faster than R-CNN, it was still
detection results to update one region’s detection. based on the traditional region proposal method, which was time
consuming.
3. Major deep learning approaches for image-based object Faster R-CNN (Ren et al., 2015) further improved the detection
detection speed of Fast R-CNN by replacing the traditional region proposals stage
with a convolutional neural network, called Region Proposal Network
3.1. Anchor-based deep learning approaches (RPN). RPN predicted the object region and object confidence scores
simultaneously. The main benefit of this design is that the RPN shares
Anchor-based deep learning approaches for object detection can be the same convolutional layers with the object detection network, which
grouped into two main categories: two-stage object detectors and one- reduces the detection time.
stage object detectors. Various anchor sizes captured multi-scale representation and made
Two-stage object detectors, such as Faster R-CNN (Ren et al., 2015), the computation light weight, compared to image pyramid methods. In
4
Y. Liu et al. Expert Systems With Applications 172 (2021) 114602
the second stage, the output of RPN were further classified and localized instead of softmax loss, to handle the case of multiple classes in one
to generate final detection results. Faster R-CNN can be trained end-to- bounding box. It adopted the multi-scale framework and feature pyra
end and the whole network is efficient. mid to predict objects in 3 different scales. A new backbone network
Mask R-CNN (He, Gkioxari, Dollár, & Girshick, 2017) was proposed with ResNet module was proposed for improved speed and accuracy,
as an extension of Faster R-CNN with extra predictions on pixel wise especially on small object detection.
instance segmentation. Mask R-CNN used the Faster R-CNN two-stage SSD (Liu et al., 2016) was a single shot detector without a region
pipelines with the same first stage network. In the second stage, it proposal stage, as shown in Fig. 1. Different from Faster R-CNN that only
added one extra output, a binary mask for each region proposal, and used the last layer for detection, SSD performed detection using multiple
kept the original classification and bounding box regression. In the loss layers to better capture multi-scale objects. Since anchors were applied
function used in training, it added one extra term for mask prediction in to multiple feature maps, SSD designed various anchor scale ranges
the form of binary cross-entropy loss. One of the key contributions of between layers. The lower layers had smaller scales and higher layers
Mask R-CNN was the RoI Align layer, which was introduced to fix the larger scales. This design could handle a wide range of objects, resulting
misalignment issuses in RoI pooling layer. The idea was to remove the in higher recall rate. Different from YOLO that used the fully connected
quantization of the RoI boundary, and instead calculating real values. layers for object detection, SSD used fully convolutional layers to predict
Bilinear interpolation was used to calculate the real feature value on confidence score and localization offset. With some additional data
each sampling point. The results were averaged or max pooled from the augmentation and hard negative mining techniques, SSD achieved the
sampling points. Mask R-CNN achieved the state-of-the-art on instance- state-of-the-art performance on several benchmark datasets. However,
level segmentation. SSD performed poorly on small objects, due to shallow layers without
Feature pyramid network (FPN; Lin, Dollár, et al., 2017) focused deep semantic information.
on solving the problem that lower level feature maps contain more DSSD (Fu, Liu, Ranga, Tyagi, & Berg , 2017) improved SSD by using a
spatial information but less semantic information, whereas the latter larger network. Their experimental results showed the deep and
layers of a deep neural network contain more high-level semantic in powerful backbone network ResNet-101 outperformed the VGG
formation but less spatial information. FPN utilized the hierarchy of a network. A deconvolutional module was introduced to add more context
CNN network and implemented a bottom-up and top-down path with information. More importantly, the deconvolutional layers were train
lateral connections. In the bottom-up part, an input image was passed able during training, making DSSD more flexible and achieving better
through a CNN and pooling layer was used to shrink feature maps size. performance. In order to improve the accuracy of anchor scales and
In the top-down part, the feature maps were up-sampled back into the ratios, K-means clustering are applied to group training boxes with
same size as in the bottom-up part. Moreover, the lateral connection squared root boxes areas as the distance measurement. DSSD improved
fused the feature maps in bottom-up path and top-down path of same the SSD accuracy, especially for small objects.
sizes with element-wise addition. FPN generated integrated feature RetinaNet (Lin, Goyal, et al., 2017) was proposed as a one-stage
maps that dramatically improve detection accuracy, especially for small object detector to reduce the detection accuracy gap with existing
objects. two-stage detectors while maintaining fast detection time. This work
found that the accuracy gap between one-stage detectors and two-stage
3.3. Representative one-stage detectors detectors was mainly due to the numbers of the positive examples and
negative examples as well as easy examples and hard examples used in
YOLO (Redmon et al., 2016) focused on improving the speed of training were highly unbalanced. The large number of easy examples
object detector. It treated the object detection problem as regression dominated the loss function, which resulted in a degenerated model.
problem and removed the region proposal stage in two-stage detectors. This problem was solved by introducing a new loss function, called focal
Instead of using pre-defined anchors for object region, it divided input loss function to reduce the weights of the easy examples adaptively.
images into 7*7 cells and each cell was used to predict the object’s
center falling into the cell. Each cell predicted bounding box locations, a 4. Challenges and solutions for small object detection
score for each bounding box, and class probabilities. The network was
implemented as convolutional layers followed by fully connected layers. In this section, we identify four major challenges of applying deep
The sum of squared error loss was used to minimize localization and neural networks to small object detection and discuss existing solutions.
classification error. YOLO was a real-time object detector with 45 frames
per second detection speed, which was extremely fast compared to other 4.1. Challenges for small object detection
detectors. However, class probabilities were only predicted within each
cell. It does not work well on objects partially located in one cell, could In this section, we summarize the four major challenges for small
not handle a wide distribution of ground truth objects, and has difficulty object detection.
to predict bounding box scales and ratios precisely, which results in a
low localization accuracy. 4.1.1. Challenge 1: Individual feature layers do not contain sufficient
YOLOv2 (Redmon & Farhadi, 2017) proposed several improvements information for small object detection.
on YOLO. In order to increase recall, it removed the fully connected Deep CNN architectures provide hierarchy feature maps due to
layers and adopted the anchor boxes concept to predict bounding boxes. pooling and subsampling operations, resulting in different layers of
Unsupervised learning methods were applied to generate bounding box feature maps containing different spatial resolutions. It is well known
scales and ratios directly from training data. Instead of only predicting that in the early-layer feature maps, the feature maps are of higher
one class probability per cell, it predicts both objectness and class for resolution and represent smaller reception fields. At the same time, they
each bounding box, which improved the performance on detecting do not contain high-level semantic information that is important for
partially covered objects. For the bounding box regression, it predicted object detection. On the other hand, the latter-layer feature maps
the location relative to the cell left top location, which made the pre contain stronger semantics information, which is essential for identi
diction bounds to be between 0 and 1. Other proposed techniques fying and classifying objects, including different object poses or illumi
included batch normalization, high-resolution classification, and multi- nations. Even though higher-level feature maps are useful for identifying
scale training. All the techniques dramatically improved detection ac large objects, they may not be sufficient for small object detection. After
curacy while keeping the fast speed. down sampling several times in the deep CNN architectures, the latter
YOLOv3 (Redmon & Farhadi, 2018) proposed more improvement feature maps lose spatial information. A small object of size 32x32 pixels
over YOLOv2. For class prediction, it used binary cross-entropy loss is clearly visible in earlier (or shallower) feature maps, but not in the
5
Y. Liu et al. Expert Systems With Applications 172 (2021) 114602
latter (or deeper) feature maps. Therefore, low-level features alone or other objects in the image. For example, in face detection systems,
high-level features alone are not sufficient for small object detection. the subject’s shoulder and neck are always close to their face.
Solution: Combining features from shallow layers and deep layers.
To better detect small objects, several deep CNN based methods combine Solution: Incorporating contextual information in the detection
lower-level feature maps and higher-level feature maps together to network. The local pixel context is usually added by enlarging filter sizes
obtain necessary spatial and semantic information. There are two main to capture extra information around the objects. The semantic context is
approaches to feature map fusion: usually added by extracting deeper features from images, such as in the
1) Bottom-Up Scheme deconvolution layers or recurrent neural networks (RNNs).
This scheme is incorporated into the standard feedforward CNN ar
chitecture. From early to latter layers, feature maps shrink after pooling 4.1.3. Challenge 3: Class imbalance for small objects.
operations. The final detection layers directly combine several bottom- Class imbalance refers to the uneven data distribution between
up feature maps. classes. There are two types of class imbalances. One is imbalance of
2) Top-Down Scheme foreground and background examples. In object detection, region pro
This scheme can be viewed as an attention mechanism that propa posal networks are used to generate the candidate regions containing
gates higher-level semantic information back to lower-level feature objects, by densely scanning the entire image. The anchors are pre-
maps. It usually uses a convolution-deconvolution or encoder-decoder defined rectangular boxes densely tiled on the entire input image. The
network with an upsampling operation in the decoder to enlarge the scales and ratios of anchors are pre-defined based on the target objects’
feature maps’ spatial resolution. Moreover, skip paradigm or lateral sizes in the training dataset. To detect small objects, there is an increase
connection are often used to connect lower-layer with higher-layer of anchors generated per image compared to detecting large objects.
feature maps while bypassing intermediate layers. The fused feature Only the anchors with high Intersection over Union (IoU) with the
maps are used by detection layers. Typical operations to combine feature ground truth bounding boxes are labeled as positive examples. Since
maps include summation, production, concatenation, and global most anchors have low or no overlap with the ground truth bounding
pooling. boxes, they are considered as negative examples. When densely gener
ated anchors are matched with sparsely located real objects in the im
4.1.2. Challenge 2: Limited context information of small objects. ages, positive examples are a tiny fraction, resulting in a high-class
Usually small objects are in low resolutions and it is difficult to imbalance, e.g. class ratio from 100:1 to 1000:1.
recognize low-resolution objects. Since small objects themselves contain The anchor-based object detection approach has several drawbacks.
limited information, contextual information plays a critical role in small First, due to the sparseness of ground-truth bounding boxes and the IoU
object detection (Torralba, Murphy, Freeman, & Rubin, 2003; Oliva & matching strategies between ground-truth and anchors, negative ex
Torralba, 2007; Divvala, Hoiem, Hays, Efros, & Hebert, 2009; Palmer, amples highly dominate positive examples, which leads to models fa
1975). Contextual information has been used in object recognition from voring the negative class. Second, the dense sliding window strategy has
a “global” image level to a “local” image level. A global image level high time complexity, (O(h2 w2 )), where h is the height and w is the
considers image statistics from the entire image, whereas a local image width of the anchors, which makes training slow.
level considers contextual information from neighbor areas of the ob Solution: Balance positive and negative examples in training. There
jects. Context features could be categorized into three types (Divvala are two main strategies: 1) data-based and 2) loss function-based. The
et al., 2009): data-based strategy is to change the foreground and background
example numbers to make the examples of the positive and negative
1) Local pixel context: The patches or pixels around an object, such as class roughly carry the same weights. Hard sampling and soft sampling
edges, colors, textures, etc. Local pixel context could be captured by are two popular methods. Hard sampling selects a subset of samples,
increasing the size of the detection window in object detection whereas soft sampling assigns different weights to examples. For
networks. example, random sampling is commonly used to randomly select ex
2) Semantic context: The probability of an object to be identify in some amples to meet a certain ratio. Another sampling strategy is to sample
surrounding scenes, such as events, activities, or scene categories. more of the hard examples with large losses. For example, a machine
3) Spatial context: The spatial location of other objects in the image, e. learning model could be trained first and then the false positives are
g. the likelihood of finding an object in some positions in respect to considered as hard examples, which are weighted heavily in the second
round of training. The recently proposed Online Hard Example Mining
6
Y. Liu et al. Expert Systems With Applications 172 (2021) 114602
(OHEM) method (Shrivastava, Gupta, & Girshick, 2016) performs one considering both global and local information. The parameters for the
forward pass on the calculated region of interest (RoI) and computes non-linear transformation were learnable and shareable with different
losses for all RoIs. Then, examples are ranked based on their loss func layers. The transformations were applied to feature maps in different
tion values, and the examples with the largest loss are selected to be used layers and each transformed layer generated detection results. (Yang,
in the next round of training since the current trained network model Liu, Yan, & Li, 2019) used deconvolutional layers in an “encoder-
performs the worst on them. (Pang, Chen, et al., 2019) also proposed an decoder” architecture in addition to the feature maps from convolu
IoU-balanced sampling technique in order to sample more training ex tional layers and deconvolutional layers were combined. (Bell, Law
amples from difficult cases. In terms of soft sampling, (Cao, Chen, Loy & rence, Zitnick, Bala & Girshick, 2016) used skip connections to directly
Lin, 2020) proposed a technique that selects samples based on their add lower-level feature maps to higher-level feature maps. Features
importance where the importance of positive examples is measured by were pooled from convolutional layers with different receptive fields
their IoU scores with the ground truth bounding boxes, and the impor (conv3, conv4, and conv5). These features were normalized and
tance of negative examples is calculated by considering both local region concatenated to be fed into detection modules. (Zagoruyko et al., 2016)
and global region properties. also concatenated features from different layers (conv3, conv4, conv5)
Loss function-based strategies will re-weight examples of imbalanced after normalization. In (Cao et al., 2018), extensive experimental results
classes in the loss function in order to balance the foreground and showed that combining low-level and high-level features improved
background examples. For example, AP loss (Chen et al., 2019) uses an detection accuracy of small objects. (Jeong, Park, & Kwak, 2017)
average-precision loss to re-weight examples. DR Loss (Qian, Chen, Li, & concatenated features by performing pooling and deconvolution
Jin , 2019) re-weights examples based on the distribution of foreground simultaneously, where pooling decreased the size of the low-level
examples over the distribution of background examples. feature maps to combine them with high-level feature maps and
deconvolution increased the high-level feature maps to combine them
4.1.4. Challenge 4: Insufficient positive examples for small objects with low-level feature maps. (Yu et al., 2018) fused both semantic and
Most deep neural network models for object detection were trained spatial information by using Iterative Deep Aggregation (IDA) and Hi
using objects of various scales. They typically perform well on large erarchical Deep Aggregation (HDA). IDA non-linearly two types of fea
objects, but poorly on small objects. Reasons may include an insufficient tures, whereas HDA merged several CNN features in a tree structure.
amount of small-scale anchor boxes generated to match the small objects (Zhang, Qiao, et al., 2018) used an extra segmentation module to add
and an insufficient number of examples to be successfully matched to the more semantic information. (Li & Zhou, 2017) fused features from
ground truth. The anchors are regions in the feature maps of some in multiple lower layers, where low-level feature maps were down sampled
termediate layers in a deep neural network, which would be projected with max pooling and high-level feature maps were resized with bilinear
back to the original image. It’s hard to generate anchors for small ob interpolation. Similarly, (Kong, Yao, Chen, & Sun, 2016) applied max
jects. Moreover, the anchors need to be matched to the ground truth pooling on low-level features and deconvolutional operation on high-
bounding boxes. A widely used matching method is as follows. If an level features. Then these feature maps were concatenated together to
anchor has a high IoU score with respect to a ground truth bounding box, form the so-called Hyper Feature Map.
such as larger than 0.9, it is labeled as a positive example. In addition, Some network architectures used both top-down and bottom-up
the anchor with the highest IoU score with respect to each ground truth connections for combing features. ( (Shrivastava, Sukthankar, Malik,
box is also labeled as a positive example. Therefore, small objects usu & Gupta, 2016) added a top-down module as well as lateral connections.
ally have very few anchors match with the ground truth bonding boxes, The top-down module can generate features with more semantic and
i.e. very few positive examples. contextual information. The lateral connection can help to enrich the
Solution: Use methods that can generate more anchors for small top-down features by transmitting lower-level features. (Kong et al.,
objects and match more anchors with small objects. Existing techniques 2017) used reverse connections to add high-level semantic information
include: from latter network layers back to earlier layers. (Ghiasi, Lin, & Le,
1) Multi-scale mechanism. Multi-scale architectures consisting of 2019) proposed a search method to automatic search for good feature
separate branches for small, medium, and large-scale objects can pyramid architectures which may make it possible to replace manual
generate anchors of different scales. design. A recurrent neural network (RNN) served as the controller to
2) Matching strategy. Adaptively setting anchor scales and ratios to merge any two input features with sum or pooling operation. (Pang,
help more anchors match to ground truths of small objects. Wang, Anwer, Khan, & Shao, 2019) applied the feature pyramid fusion
3) Increasing positive examples of small objects. In the region proposal mechanism at two levels. For global information, it constructed an
stage, generate more anchors by allowing them to overlap with each image pyramid and combined the features from four levels of the image
other. pyramid with the original features from the standard SSD framework.
For local spatial information, features from both the previous and the
4.2. Deep learning techniques for small object detection developed in current layers were fused together.
generic object detection research
4.2.2. Technique 2: Incorporate context information of small objects
In this section, we summarize the deep learning techniques devel Context information of small objects can be divided into local contact
oped in generic object detection research that are effective for small and sematic context information. More local context information could
object detection. be included in deep neural networks through larger bounding boxes and
proposal boxes. (Cai, Fan, Feris & Vasconcelos, 2016) added extra local
4.2.1. Technique 1: Improve feature maps for small objects context with bounding boxes 1.5 times of object regions, which was
Low-level features are important for localization, whereas high-level useful to include more of the surroundings of small objects. The extra
features are important for classification. Using a combination of low- context information was combined with object features for detection
level and high-level feature maps at the detection layers in several layers. (Zagoruyko et al., 2016) incorporated local context information
deep neural networks has led to an improvement of results for small by increasing the sizes of region proposal boxes with four scales, 1x,
object detection. 1.5x, 2x, and 4x. The outputs from four regions were pooled with ROI-
Using a bottom-up fashion, one group of network architectures pooling and concatenated before being fed into the detection and clas
merged feature maps from different layers (Fu et al., 2017; Lin, Milan, sification layers.
Shen, & Reid, 2017; Kong et al., 2017). Kong, Sun, Tan, Liu, and Huang For including more semantic context information, (Fu et al., 2017)
(2018) proposed a non-linear feature map transformation by used deconvolutional layers with “skip connections”, resulting in better
7
Y. Liu et al. Expert Systems With Applications 172 (2021) 114602
detection results on small objects. Instead of simply stacking deconvo applied to find perfect scales and ratios for anchor boxes and to perform
lutional layers on top of convolutional layersthe deconvolution layers adaptive matching of anchor boxes (Redmon & Farhadi, 2017). For
were designed to be much shallower than convolutional layers, and an example, ground truth bounding boxes can be grouped into clusters
element-wise product was used. Similarly, (Cai et al., 2016) also used based on their scales, and then anchor boxes are matched to ground
deconvolutional layers to increase the resolutions of feature maps. (Bell truth bounding boxes of similar scales.
et al., 2016) used four Recurrent Neural Network (RNNs) to capture
global information of the input image. The RNNs added semantic 4.3. Techniques for small object detection developed in face detection
context information around the objects and 1x1 convolutions combined research
all information together. In addition, (Zhang, Wen, Bian, Lei, & Li, 2018)
passed higher-level feature maps to lower level features. Face detection has been extensively studied and has achieved great
success. Faces have distinct facial features, i.e. the nose, eyes, mouth,
4.2.3. Technique 3: Correct foreground and background class imbalance and their relative positions with respect to each other, which make face
for small objects detection different than generic object detection. However, it remains
Techniques for improving the foreground and background class challenging when face sizes are very small, e.g. smaller than 16x16
imbalance can be divided into two part, which are a data-based pixels, and features are not distinguishable. Many techniques have been
approach and a loss function-based approach. developed for small face detection. The state-of-art algorithms shown in
In the data-based approach, (Cai et al., 2016) addressed class Table 1.
imbalance using bootstrap sampling that sampled negative examples
based on their loss values. (Zhang, Wen, et al., 2018) used a two-step 4.3.1. Technique 1: Improve feature maps for small faces
regression method to balance the foreground and background exam One of the techniques for combining multiple feature maps is the skip
ples. Some easy negatives were dropped to make the ratio between connections used to integrate lower, middle, and higher layer features.
positives and negatives about even. (Kong et al., 2017) implemented an (Tian et al., 2018) proposed an iterative feature map generation scheme,
objectness prior to filtering out the bounding boxes without objects. In which generated features in six different scales, and all feature maps
the loss function-based approach, (Galleguillos & Belongie, 2010) used a from the backbone network were fed back to the beginning of the
loss function that gave hard-negative examples greater weights. network to extract more semantic information for small objects.
(Samangouei, Chellappa, Najibi, & Davis, 2018) fed the combined lower
4.2.4. Technique 4: Increase training examples for small objects and higher layer features to a ROI-based block normalization layers.
Many techniques have been developed to increase the amount of (Tian et al., 2018) merged four feature maps using skip connections to
training examples for small objects. They include neural network ar generate four new feature maps for detection. Specifically, the input
chitectures for multi-scale learning, scale transformation, and adaptive features were fused with the next level features by element multiplica
matching of anchor boxes. tion. Furthermore, the fused features were added to the original input
Several neural network architectures have been designed for multi- features to form the final features, which proved to be effective for
scale learning, i.e. training detector networks for objects of different detecting difficult tiny faces. (Zhu, Zheng, Luu, & Savvides, 2017)
sizes, to addressing the problem of insufficient examples of small objects combined the lower level features with higher level features by first
in training classifiers. (Singh, Najibi, & Davis, 2018) showed the effec downsampling the lower level features to the size of higher-level fea
tiveness of a training scheme that used objects of various scales and tures and then concatenating them with L2 normalization. (Luo, Li, Zhu,
poses to increase the detection performance on small objects due to & Zhang, 2019) combined the lower level features with neighboring
increased quantity and variety of training examples. Small objects were features by using a bilinear upsampling strategy. (Yoo, Dan, & Yun,
up sampled and fed into convolutional networks for detection. Only the 2019) designed a feature map generation scheme by recurrently passing
layers whose feature maps contained target objects within a certain the network.
range were activated during training, so that small objects could be
trained equally well as medium and large objects. (Najibi, Singh, & 4.3.2. Technique 2: Incorporate context information of small faces
Davis, 2019b) proposed a multi-scale network to predict the regions For face detection, (Bai, Zhang, Ding, & Ghanem., 2018) showed that
most likely containing small objects and discarded the regions that did adding context information can improve the performance dramatically
not likely contain small objects. The network first predicted a binary for small faces. However, too much context information can also hurt the
segmentation map for small objects, aiming to achieve a large recall rate performance on small faces due to over-fitting. Some methods added
for small objects. Only the regions that likely contained small objects context information by enlarging the receptive fields around faces.
were used in training. (Yang, Choi, & Lin, 2016) proposed a technique (Najibi, Samangouei, Chellappa, & Davis, 2017) adopted larger filter
called scale-dependent pooling to assign the appropriate feature maps to sizes, e.g. 5x5 and 7x7 filters in the convolutional network, instead of
objects based on object scales. This approach was based on the idea that 3x3 filters. (Wang, Yuan, & Network, 2017) adopted the feature fusion
small objects mainly have strong activations in lower-level feature maps. method to combine lower level features with higher level features with
This method pools the earlier feature maps for small objects and latter an agglomeration connection module. In this module, lower feature
feature maps for large objects. Several other works added prediction and maps first pass through an inception like network to increase the se
detection layers after different convolutional layers, i.e. using different mantic information, then it concatenates lower feature maps with higher
feature maps for prediction, and enlarged the region sizes to include feature maps. (Samangouei et al., 2018) added context information
more local context information (Cai et al., 2016; Li & Zhou, 2017; Chen, around each bounding box. In (Tang et al., 2018), extra context infor
Liu, Tuzel, & Xiao, 2016). Scale transformation is another technique mation from bodies and shoulders were integrated and a semi-
used to increase the amount of training examples for small objects (Kim, supervised method was used to generate labels for other body parts.
Kang, & Kim, 2018). Objects of different scales are mapped onto a single (Tian et al., 2018) used a segmentation branch to add extra context and
scale-invariant space in order to increase the number of examples. semantic information, without extra annotations, and the segmentation
Learning the mapping from different scales in the original input images branch shared the same receptive field of detection, which made the
are used to normalized patches. Therefore, images containing objects of segmentation branch an extra source for more discriminative features.
all scales can be used in training. (Singh & Davis, 2018a) propsed a novel (Li, Tang, Han, Liu, & He, 2019) used the structure of the dense block
training scheme called scale normalization for image pyramids (SNIP) from (Huang et al., 2017) to integrate extracted context features. (Zhu
that can minimize the scale changes during training. et al., 2017) integrated body information to reduce false positives. The
Instead of pre-defined anchor boxes, machine learning has been body features were acquired by additional RoI-pooling operations to
8
Y. Liu et al. Expert Systems With Applications 172 (2021) 114602
Table 1
Face Detection State-of-the-Art Algorithms.
Face Detection
enlarge the receptive fields. The combined face and body features were stride sizes on the lower anchor-associated layer to increase the number
used in both classification and bounding box regression. of anchors that can potentially match with more small scale objects.
Moreover, it proposed a two-stage anchor matching strategy to make
4.3.3. Technique 3: Correct foreground and background class imbalance sure that each small object has enough anchors to match with. (Zhu
for small faces et al., 2018) increased the number of anchors by reducing stride and the
For small face detection, detector networks usually place a lot of distance between the face and anchor center, as well as adding extra
small anchors on the images which will usually generate a lot of negative shifted anchors. Moreover, for the hard faces for which the highest IoU
anchors and very few positive anchors, resulting in a high false positive scores were still lower than the matching threshold, the top few anchors
rate. There are two main approaches to handle class imbalance. with highest scores were selected as positive examples. (Luo et al., 2019)
a) Filtering anchors. (Zhang et al., 2017) used a max-out background increased the ranges of anchors so that more anchors with small sizes
label, which predicted several scores for background labels and selected can be matched with small faces.
the largest as the final score. (Chi et al., 2019) used two-step classifi c) Multi-scale training. (Wang, Chen, Huang, Yao, & Liu, 2017) resized
cation on the lower layers to filter out false positives for small faces, input images to different sizes to generate objects of various sizes and
which helped to balance the positive and negative examples to improve small objects can be resized to larger objects to match more anchor
the classification results. boxes. (Najibi et al., 2017) designed a multi-scale network with three
b) Sampling. (Zhang et al., 2017) applied hard-negative mining to different convolutional branches intended to detect different scales of
make the ratio between negatives and positives at most 3:1. (Najibi, faces: small, medium and large faces, respectively. Small faces were
Singh, & Davis, 2019a) also used hard-negatives mining. Anchors were detected by an element sum feature from conv4 and conv5. Medium
labeled positive if the overlap with the ground-truth bounding box was faces were directly detected from conv5. Large faces were detected from
larger than 0.5. (Li et al., 2019) proposed a balanced-data-anchor- max pooling after conv5. (Hu & Ramanan, 2017) used an image pyra
sampling strategy to select large size and small size anchors with mid, where the input images were scaled to ratio 0.5, 1, and 2 of the
equal probability. (Tang et al., 2018) proposed a Pyramid box that original resolution. Then two types of feature maps were applied to
adopted the max-in-out technique on both positive and negative samples capture different scales of faces.
to reduce the false positive rate of small objects. (Wang, Li, Ji, & Wang,
2017) applied the online hard exampling mining (OHEM) by sorting the
examples based on loss and selecting the top examples with the highest 4.4. Techniques for object detection in aerial images
loss as the hard examples. Also, they used a 1:1 ratio for positive hard
examples and negative hard examples in each mini batch during For detecting objects in aerial images, there are mainly four kinds of
training. methods: (i) template matching-based, (ii) knowledge-based, (iii) OBIA-
based, and (iv) machine learning-based (Cheng & Han, 2016). In recent
4.3.4. Technique 4: Increase training examples for small faces years, deep learning based methods achieved the best performance.
It is important to make sure that small objects have sufficient anchors Commonly, CNNs pretrained on large image datasets, such as the
to match with, otherwise small objects cannot be trained well, and the ImageNet and COCO dataset, were fine-tuned on aerial images. In
trained model’s recall will be low. There are three main techniques for addition, new deep neural networks were proposed for the unique at
increasing training examples for small faces. tributes of objects in aerial images, like multi-scale and multi-angle, to
a) Matching strategy. (Zhu, Tao, Luu, & Savvides 2018) found the achieve better performance. For example, (Dong, Liu, & Xu, 2018)
average IoU for small faces and small anchors, which are much lower proposed rotation-invariant models to achieve good performance on
than large faces. In order to design anchors to match with more small remote sensing images. Moreover, weakly supervised learning methods
scale objects, a new matching score was proposed to consider face scales (Peng et al., 2018) has been proposed to learn high-level features in an
and anchor strides so that small face scales can also achieve high IoU unsupervised manner to capture the structural information of objects in
scores. Moreover, during training, faces were randomly shifted to match remote sensor images.
with more anchors. (Chi et al., 2019) identified the mismatch between
anchors ratios and receptive fields and proposed inception-styled feature 4.4.1. Technique 1: Deal with orientation of aerial image objects
maps to increase the diversity of feature map scales and reduce the Objects in aerial image can have arbitrary orientations or rotations.
mismatches between faces and anchors. Deep neural network-based detectors have been designed to address this
b) Increasing anchors. (Zhang et al., 2017) added extra convolutional issue. Rotation-Invariant CNN (RICNN) in (Cheng, Zhou, & Han, 2016) a
layers to generate more anchors for small objects. It also reduced the new rotation-invariant layer in the basic CNN architecture. introduced
Rotation-Invariant and Fisher Discriminative CNN (RIFD) in (Cheng,
9
Y. Liu et al. Expert Systems With Applications 172 (2021) 114602
Han, Zhou, & Xu, 2018) was proposed to contain a rotation-invariant the trade-off between reception field and localization accuracy, large
regularizer and fisher discrimination regularizer on multi-scale fea reception fields lead to lower localization accuracy. However, when the
tures from CNN. The rotation-invariant regularizer mapped the CNN reception field is too small, localization accuracy may also decrease due
feature representations of training samples before and after rotations to to the lack of context information.
be similar, while the fisher discrimination regularizer constrained the Feature Pyramid Networks (FPNs) combine FCN and Faster R-CNN.
CNN features to be similar for within-class examples, but dissimilar for On top of the two predictions that Faster R-CNN generates: (i) bounding
examples in different classes. Recently, anchor rotation methods have box localization and (ii) bounding box recognition, FPNs added the third
been proposed in one-stage object detectors to achieve rotation invari output, (iii) instance mask prediction for segmentation. FPNs also used
ance (Yang et al., 2018). A feature refinement technique was proposed some new techniques, such as new ROI align layers, multitask training
to improve detection performance on aerial images, in which the posi and better backbone networks for further improvement.
tion information of bounding boxes was encoded to the corresponding For small object detection, some other techniques have also been
feature points through feature interpolation to improve feature recon proposed based on segmentation methods. To include more context in
struction and alignment. R-Net in (Yang et al., 2018) proposed a formation, capsule networks with deconvolutional capsules were pro
network to generate rotatable region proposals. posed to expand the original layers in the network architectures
(LaLonde & Bagci, 2018). Segmentations could be refined using bottom-
4.4.2. Technique 2: Incorporate context information of aerial image objects up and top-down network architectures to combine features from
More context information for small objects have been included in the different layers (Ronneberger et al., 2015), or using pyramid pooling
detection networks through combined feature maps and dilated con layers to segment objects in multiple scales as in DeepLab (Chen et al.,
volutions. Feature maps from multiple convolutional layers can be 2018). To increase training examples for small objects, a more robust
concatenated to form a new feature map. Dilated convolutions can be embedding could be learned by jointly using unsupervised and super
added to CNN models to improve performance on small-scale object vised learning and combing features from different models to form a
detection. multi-scale representa tion (Lin, Milan, et al., 2017). Concept Mask in
(Wang Lin, Shen, Zhang, & Cohen, 2018) used a semi-supervised
4.4.3. Technique 3: Correcte foreground and background class imbalance learning method to train a deep neural network with image-level la
for aerial image objects bels. Then, the results were refined and extended to predict attention
Based on Faster RCNN, IoU-Adaptive Deformable R-CNN in (Yan maps. Finally, an attention-driven class segmentation network was
et al., 2019) was proposed to address the class imbalance issue in trained.
training classifiers in object detectors. By analyzing the different roles
that IoU can play in different parts of the network models, an IoU-guided 5. Performance evaluation of deep learning methods for small
detection framework was proposed to reduce the loss of small object object detection
information during training. Besides, an IoU-based weighted loss was
designed to learn the IoU information of positive ROIs to improve the In this section, performances of representative state-of-the-art object-
detection accuracy. Finally, the class aspect ratio constrained non- detection methods on widely used public benchmarkdatasets are pre
maximum suppression (CARC-NMS) was proposed to improve detec sented. The emphasis is on small object detection.
tion precision.
5.1. Dataset
4.4.4. Technique 4: Increase training examples for aerial image objects
Multi-scale network models have been proposed to detect objects of We used several datasets from three different areas: generic object
various sizes in aerial images, such as Multi-Scale and Rotation- detection, face detection, and object detection in aerial imagery. Images
Insensitive Convolutional Channel Features (MsRi-CCF) in (Wu, Hong, in the generic object detection dataset were mostly collected under
Ghamisi, Li, & Tao, 2018). MsRi-CCF was proposed for geo-spatial object everyday living and indoor settings. The objects are usually of rigid
detection by integrating robust low-level feature generation, classifier shapes. The difficulty of detection usually comes from illumination and
generation with outlier removal, and detection with a power law. background clutter. In comparison, faces have a common structure
containing several regions of fixed parts, such as eyes, noses, mouths,
4.5. Instance segmentation methods for small object detection etc., and the relationships between parts are known a prior. In terms of
aerial images, they were collected under much diverse conditions, such
Different from the popular bounding-box-based object detectors as camera mounted underneath airplanes, helicopters or UAS (drones),
presented in the previous sections, deep CNNs for instance segmentation which resulted in straight-down views of the objects, very different
have also been applied to object detection. The main drawbacks of viewpoints from images in generic object detection or face detection
segmentation methods are pixel-wise labeling, which is time consuming, datasets. We showed the examples of three datasets in Fig. 2.
and compute and memory intensive. For small object detection, each 1) Generic object detection dataset. A combination of images from
pixel of an object is important and using pixel information could the Microsoft Common Object in Context (COCO) dataset (Lin et al.,
generate good results. 2014) and the SUN dataset (Xiao, Hays, Ehinger, Oliva, & Torralba,
FCN (Long, Shelhamer, & Darrell, 2015) is one of the first methods 2010) were used in the experiments. COCO consists of 82 K training and
that use CNNs for semantic segmentation. FCN employs CNNs without 40 K validation images belonging to 80 classes. COCO is a widely used
fully connected layers, which allows the input image to have an arbi dataset in object detection and a relatively difficult dataset, since the
trary size. It uses pooling layers to reduce computation time and increase objects sizes are relatively small compared with other datasets. For our
the reception field size. Based on FCN, U-Net (Ronneberger, Fischer, & experiments, we selected ten small object categories from COCO, where
Brox, 2015) was proposed with an encoder-decoder architecture to the largest physical dimension is smaller than 30 cm. The selected object
address the issue of determining appropriate numbers of pooling layers. categories are mouse, telephone, switch, outlet, clock, toilet paper, tis
It has a U-shape architecture to balance the trade-off between good sue box, faucet, plate and jar. Then, we used the ground truth bounding
localization accuracy and efficient context information. In the encoder, boxes in the COCO and SUN datasets to filter out large objects to create a
it uses pooling layers to gradually reduce the layer size, whereas, in the dataset containing small objects with small bounding boxes. Table 2
decoder stage, it uses up-convolution to gradually increase the layer shows the statistics of this small object dataset used in our experiments.
size. Moreover, U-Net uses short-cut connections from encoder to It contains about 8393 object instances in 4952 images. The mouse
decoder to help the decoder recover fine-grain information. Regarding category has the largest number of object instances: 2,173 instances in
10
Y. Liu et al. Expert Systems With Applications 172 (2021) 114602
Fig. 2. Examples of small object subsets of three benchmark datasets used in our experiments for object detection: (a) DOTA, (b) WIDER FACE, and (c) COCO and
SUN. All the objects used in our experiments are smaller than 50 × 50 pixels.
11
Y. Liu et al. Expert Systems With Applications 172 (2021) 114602
Table 3
Results of three representative detectors on DOTA dataset.
Method mAP Plane BD Bridge GTF SV LV Ship TC BC ST SBF RA Harbor SP HC
FasterR-CNN 0.35 0.574 0.172 0.234 0.175 0.63 0.51 0.721 0.27 0.03 0.434 0.232 0.23 0.506 0.38 0.008
SSD 0.24 0.66 0.0 0.0 0.0 0.60 0.49 0.74 0.09 0.0 0.37 0.0 0.0 0.45 0.22 0.0
YOLOv3 0.32 0.58 0.15 0.20 0.132 0.58 0.41 0.68 0.25 0.01 0.35 0.20 0.21 0.45 0.35 0.0
Table 4
Results of three representative detectors on the small object subsets of COCO and SUN datasets.
Method mAP Mouse Telephone Outlet Clock TP TB Faucet Plate Jar Switch
Faster R-CNN 0.241 0.517 0.106 0.368 0.627 0.05 0.0 0.251 0.161 0.06 0.27
SSD 0.17 0.481 0.05 0.155 0.509 0.05 0.0 0.15 0.13 0.02 0.22
YOLOv3 0.23 0.54 0.105 0.397 0.631 0.07 0.0 0.26 0.11 0.06 0.20
References
Table 5
Results of four representative detectors on a Bai, Y., Zhang, Y., Ding, M., & Ghanem, B. (2018). In Finding tiny faces in the wild with
small face subset of Wider Face dataset. generative adversarial network (pp. 21–30). Salt Lake City: IEEE Xplore.
Bell, S., Lawrence Zitnick, C., Bala, K., & Girshick, R. (2016). Inside-Outside Net:
Methods mAP
Detecting objects in context with skip pooling and recurrent neural networks.
Faster R-CNN 0.336 Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
SSD 0.246 (CVPR; pp. 2874-2883). Las Vegas: IEEE Xplore. Retrieved from https://openaccess.
YOLO v3 0.315 thecvf.com/content_cvpr_2016/html/Bell_Inside-Outside_Net_Detecting_CVPR_
2016_paper.html.
SSH 0.308
Bodla, N., Singh, B., Chellappa, R., & Davis, L. S. (2017). In Soft-NMS – improving object
detection with one line of code (pp. 5561–5569). Venice, Italy: IEEE Xplore.
Cai, Z., Fan, Q., Feris, R. S., & Vasconcelos, N. (2016). A unified multi-scale deep
used as the backbone network and heat maps were predicted for both convolutional neural network for fast object detection. European Conference on
top-left corners and bottom-right corners. The corners belonging to the Computer Vision. 9908, pp. 354–370. Cham: Springer. tps://doi.org/10.1007/9
78-3-319-46493-0_22.
same object were further grouped with embedding vectors, which were
Cao, G., Xie, X., Yang, W., Liao, Q., Shi, G., & Wu, J. (2018). Feature-fused SSD: fast
predicted as the similarities between corners. Corner pooling layers detection for small objects. Ninth International Conference on Graphic and Image
were also proposed for combining the prior location knowledge of cor Processing (ICGIP 2017). 10615, p. 106151E. Qingdao, China: Proc. SPIE. https
://doi.org/10.1117/12.2304811.
ners into the feature extraction process. (Wang, Chen, et al., 2017)
Cao, Y., Chen, K., Loy, C. C., & Lin, D. (2020). Prime Sample Attention in Object
represented bounding boxes using center points and four corners (top- Detection. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
left, top-right, bottom-left and bottom-right). It divided the feature maps Recognition (CVPR; pp. 11583–11591). IEEE Xplore. Retrieved from https://ope
into grid cells and for each cell predicted the probability of center points naccess.thecvf.com/content_CVPR_2020/html/Cao_Prime_Sample_Attention_i
n_Object_Detection_CVPR_2020_paper.html.
and corner points in the cell, x-offset and y-offset, as well as the prob Chen, C., Liu, M. Y., Tuzel, O., & Xiao, J. (2016). R-CNN for Small Object Detection.
ability of point link. The point link contained two parts: special index Asian Conference on Computer Vision (pp. 214–230). Cham: Springer. https://doi.
(the probability of point linking to cells) and point index (the linking org/10.1007/978-3-319-54193-8_14.
Chen, K., Li, J., Lin, W., See, J., Wang, J., Duan, L., . . . Zou, J. (2019). Towards accurate
probability between corner points and center point). Different from one-stage object detection with AP-loss. Proceedings of the IEEE/CVF Conference on
other methods that estimated the points of bounding boxes, (Zhou, Computer Vision and Pattern Recognition (CVPR; pp. 5119-5127). Long Beach,
Zhuo, & Krahenbuhl, P, 2019)estimated the four extreme points on an California: IEEE Xplore. Retrieved from https://openaccess.thecvf.com/content_C
VPR_2019/html/Chen_Towards_Accurate_One-Stage_Object_Detection_With_
object with fully appearance-based algorithms. It predicted five feature AP-Loss_CVPR_2019_paper.html.
maps: one for the center point and four for corner points. The extreme Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2018). DeepLab:
points were generated with maxpooling. Based on the prediction of any Semantic image segmentation with deep convolutional nets, atrous convolution, and
fully connected CRFs. IEEE Transactions on Pattern Analysis and Machine Intelligence,
combination of four corners, the center was calculated and verified on
40(4), 834–848. https://doi.org/10.1109/TPAMI.2017.2699184
the heat map for center. The corner centers with verified center points Cheng, G., & Han, J. (2016). A survey on object detection in optical remote sensing
represented the detected objects. (Duan et al., 2019) improved the images. ISPRS Journal of Photogrammetry and Remote Sensing, 117, 11–28. https://
doi.org/10.1016/j.isprsjprs.2016.03.014
network of (Law & Deng, 2018). The work represented bounding boxes
Cheng, G., Han, J., Zhou, P., & Xu, D. (2018). Learning rotation-invariant and Fisher
using three points (two corners and one center point) and center pooling discriminative convolutional neural networks for object detection. IEEE Trans. on
and cascade corner pooling were proposed to extract strong features. Image Process., 28(1), 265–278. https://doi.org/10.1109/TIP.2018.2867198
Cheng, G., Zhou, P., & Han, J. (2016). Learning rotation-invariant convolutional neural
networks for object detection in VHR optical remote sensing images. IEEE Trans.
CRediT authorship contribution statement Geosci. Remote Sensing, 54(12), 7405–7415. https://doi.org/10.1109/
TGRS.2016.2601622
Yang Liu: Conceptualization, Writing - original draft, Investigation, Divvala, S. K., Hoiem, D., Hays, J. H., Efros, A. A., & Hebert, M. (2009). An empirical
study of context in object detection. 2009 IEEE Conference on Computer Vision and
Resources, Formal analysis, Software. Peng Sun: Writing - original Pattern Recognition (pp. 1271–1278). Miami, FL, USA: IEEE. https://dx.doi.org/10.
draft. Nickolas Wergeles: Resources, Supervision. Yi Shang: Concep 1109/CVPR.2009.5206532.
tualization, Supervision, Project administration. Dong, C., Liu, J., & Xu, F. (2018). Ship detection in optical remote sensing images based
on saliency and a rotation-invariant descriptor. Remote Sensing, 10(3), 400. https://
doi.org/10.3390/rs10030400
Declaration of Competing Interest Duan, K., Bai, S., Xie, L., Qi, H., Huang, Q., & Tian, Q. (2019). CenterNet: Object
detection with keypoint triplets. arXiv. Cornell University. Retrieved from https
://arxiv.org/abs/1904.08189v1.
The authors declare that they have no known competing financial
Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. (2010). The
interests or personal relationships that could have appeared to influence Pascal visual object classes (VOC) challenge. International Journal of Computer Vision,
the work reported in this paper. 88(2), 303–338. https://doi.org/10.1007/s11263-009-0275-4
Fu, C., Liu, W., Ranga, A., Tyagi, A., & Berg, A. (2017). DSSD: Deconvolutional Single
Shot Detector. arXiv. Cornell University. Retrieved from https://arxiv.org/abs/1701
.06659.
12
Y. Liu et al. Expert Systems With Applications 172 (2021) 114602
Galleguillos, C., & Belongie, S. (2010). Context based object categorization: A critical Li, Z., Tang, X., Han, J., Liu, J., & He, R. (2019). PyramidBox++: High Performance
survey. Computer Vision and Image Understanding, 114(6), 712–722. https://doi.org/ Detector for Finding Tiny Face. arXiv. Retrieved from https://arxiv.org/abs/1
10.1016/j.cviu.2010.02.004 904.00386.
Ghiasi, G., Lin, T. Y., & Le, Q. V. (2019). NAS-FPN: Learning scalable feature pyramid Lin, G., Milan, A., Shen, C., & Reid, I. (2017). RefineNet: Multi-path refinement networks
architecture for object detection. Proceedings of the IEEE/CVF Conference on for high-resolution semantic segmentation. Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition (CVPR; pp. 7036-7045). California: Long Computer Vision and Pattern Recognition (CVPR; pp. 1925–1934). Honolulu,
Beach. IEEE Xplore. Retrieved from https://openaccess.thecvf.com/content_CVPR_ Hawaii: IEEE Xplore. Retrieved from https://openaccess.thecvf.com/content_cvp
2019/html/Ghiasi_NAS-FPN_Learning_Scalable_Feature_Pyramid_Architecture_for_ r_2017/html/Lin_RefineNet_Multi-Path_Refinement_CVPR_2017_paper.html.
Object_Detection_CVPR_2019_paper.html. Lin, T. Y., Dollár, P., Girshick, R., He, K., Hariharan, B., & Belongie, S. (2017). Feature
Girshick, R. (2015). Fast R-CNN. Proceedings of the IEEE International Conference on pyramid networks for object detection. In Proceedings of the IEEE Conference on
Computer Vision (ICCV; pp. 1440–1448). Santiago, Chile: IEEE Xplore. Retrieved Computer Vision and Pattern Recognition (CVPR) (pp. 2117–2125).
from https://openaccess.thecvf.com/content_iccv_2015/html/Girshick_Fast Lin, T. Y., Goyal, P., Girshick, R., He, K., & Dollár, P. (2017). Focal loss for dense object
_R-CNN_ICCV_2015_paper.html. detection. In Proceedings of the IEEE International Conference on Computer Vision
Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for (ICCV) (pp. 2980–2988).
accurate object detection and semantic segmentation. Proceedings of the IEEE Lin, T., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., & Zitnick, C. (2014).
Conference on Computer Vision and Pattern Recognition (CVPR; pp. 580–587). Microsoft COCO: Common Objects in Context. European Conference on Computer
Columbus, Ohio: IEEE Xplore. Retrieved from https://openaccess.thecvf.com/conte Vision (pp. 740–755). Cham: Springer. https://doi.org/10.1007/978-3-319-10602-1
nt_cvpr_2014/html/Girshick_Rich_Feature_Hierarchies_2014_CVPR_paper.html. _48.
He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask R-CNN. Proceedings of the Liu, L.i., Ouyang, W., Wang, X., Fieguth, P., Chen, J., Liu, X., & Pietikäinen, M. (2020).
IEEE International Conference on Computer Vision (ICCV; pp. 2961-2969). Venice, Deep learning for generic object detection: A survey. International Journal of
Italy: IEEE Xplore. Retrieved from https://openaccess.thecvf.com/content_iccv_ Computer Vision, 128(2), 261–318. https://doi.org/10.1007/s11263-019-01247-4
2017/html/He_Mask_R-CNN_ICCV_2017_paper.html. Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C., & Berg, A. (2016). Ssd:
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. Single shot multibox detector. European conference on computer vision, pp. 21-37,
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 21–37). Springer, Cham.
(CVPR; pp. 770-778). Las Vegas: IEEE Xplore. Retrieved from https://openaccess.th Liu, Y. (2020). GitHub. Retrieved from https://github.com/ylt5b/A-Survey-and-Perfor
ecvf.com/content_cvpr_2016/html/He_Deep_Residual_Learning_CVPR_2016_paper. mance-Evaluation-of-Deep-Learning-Methods-for-Small-Object-Detection.
html. Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic
He, Y., Zhu, C., Wang, J., Savvides, M., & Zhang, X. (2019). Bounding Box Regression segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern
With Uncertainty for Accurate Object Detection. Proceedings of the IEEE/CVF Recognition (CVPR; pp. 3431–3440). Boston: IEEE Xplore. Retrieved from https
Conference on Computer Vision and Pattern Recognition (CVPR; pp. 2888–2897). ://www.cv-foundation.org/openaccess/content_cvpr_2015/html/Long_Fully_Conv
Long Beach, California: IEEE Xplore. Retrieved from https://openaccess.thecvf.co olutional_Networks_2015_CVPR_paper.html.
m/content_CVPR_2019/html/He_Bounding_Box_Regression_With_Uncertainty_for_ Luo, S., Li, X., Zhu, R., & Zhang, X. (2019). SFA: Small faces attention face detector. IEEE
Accurate_Object_Detection_CVPR_2019_paper.html. Access, 7, 171609–171620.
Hosang, J., Benenson, R., & Schiele, B. (2017). Learning Non-Maximum Suppression. Najibi, M., Samangouei, P., Chellappa, R., & Davis, L. S. (2017). SSH: Single Stage
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Headless Face Detector. In Proceedings of the IEEE International Conference on
(CVPR; pp. 4507-4515). Honolulu, Hawaii: IEEE Xplore. Retrieved from https://op Computer Vision (ICCV) (pp. 4875–4884).
enaccess.thecvf.com/content_cvpr_2017/html/Hosang_Learning_Non-M Najibi, M., Singh, B., & Davis, L. S. (2019a). FA-RPN: Floating region proposals for face
aximum_Suppression_CVPR_2017_paper.html. detection. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., . . . Adam, H. Recognition (CVPR; pp. 7723-7732). Long Beach, California: IEEE Xplore. Retrieved
(2017). MobileNets: Efficient convolutional neural networks for mobile vision from https://openaccess.thecvf.com/content_CVPR_2019/html/Najibi_FA-RPN_
applications. arXiv. Cornell University. Retrieved from https://arxiv.org/abs/1 Floating_Region_Proposals_for_Face_Detection_CVPR_2019_paper.html.
704.04861. Najibi, M., Singh, B., & Davis, L. S. (2019b). AutoFocus: Efficient multi-scale inference. In
Hu, P., & Ramanan, P. (2017). Finding Tiny Faces. Proceedings of the IEEE Conference Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (pp.
on Computer Vision and Pattern Recognition (CVPR). Honolulu, Hawaii: IEEE 9745–9755).
Xplore. Retrieved from https://openaccess.thecvf.com/content_cvpr_2017/html/Hu_ Newell, A., Yang, K., & Deng, J. (2016). Stacked hourglass networks for human pose
Finding_Tiny_Faces_CVPR_2017_paper.html. estimation. European conference on computer vision, pp. 483-499. Cham: Springer.
Huang, G., Liu, Z., Van Der Maaten, L., & Weinberger, K. Q. (2017). Densely connected Nguyen, N., Do, T., Ngo, T., & Le, D. (2020). An Evaluation of Deep Learning Methods for
convolutional networks. Proceedings of the IEEE Conference on Computer Vision Small Object Detection. 2020, 18. https://doi.org/10.1155/2020/3189691.
and Pattern Recognition (CVPR), (pp. 4700–4708). Honolulu, Hawaii. Retrieved Oksuz, K., Cam, B., Kalkan, S., & Akbas, E. (2019). Imbalance problems in object
from https://openaccess.thecvf.com/content_cvpr_2017/html/Huang_Densel detection: A review. IEEE Transactions on Pattern Analysis and Machine Intelligence.,
y_Connected_Convolutional_CVPR_2017_paper.html. 1–1. https://doi.org/10.1109/TPAMI.2020.2981890
Jeong, J., Park, H., & Kwak, N. (2017). Enhancement of SSD by concatenating feature Oliva, A., & Torralba, A. (2007). The role of context in object recognition. Trends in
maps for object detection. arXiv. Cornell University. Retrieved from https://arxiv. Cognitive Sciences, 11(12), 520–527. https://doi.org/10.1016/j.tics.2007.09.009
org/abs/1705.09587. Palmer, T. (1975). The effects of contextual scenes on the identification of objects.
Jiao, L., Zhang, F., Liu, F., Yang, S., Li, L., Feng, Z., et al. (2019). A Survey of Deep Memory & Cognition, 3, 519–526. https://doi.org/10.3758/BF03197524
learning-based object detection. IEEE Access, 7, 128837–128868. https://doi.org/ Pang, Jiangmiao, Chen, Kai, Shi, Jianping, Feng, Huajun, Ouyang, Wanli, & Lin, Dahua
10.1109/ACCESS.2019.2939201 (2019). Libra r-cnn: Towards balanced learning for object detection. In Proceedings of
Kim, K., Hong, S., Roh, B., Kim, K. H., Cheon, Y., & Park, M. (2016). PVANet: Lightweight the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 821–830).
Deep Neural Networks for Real-time Object Detection. arXiv. Cornell University. Pang, Y., Wang, T., Anwer, R., Khan, F., & Shao, L. (2019). Efficient featurized image
Retrieved from https://arxiv.org/abs/1611.08588. pyramid network for single shot detector. In Proceedings of the IEEE/CVF Conference
Kim, Y., Kang, B. N., & Kim, D. (2018). SAN: Learning relationship between on Computer Vision and Pattern Recognition (CVPR) (pp. 7336–7344).
convolutional features for multi-scale object detection. Proceedings of the European Peng, Tang, Wang, Xinggang, Wang, Angtian, Yan, Yongluan, Liu, Wenyu,
Conference on Computer Vision (ECCV), (pp. 316-331). Munich, Germany. Retrieved Huang, Junzhou, & Yuille, Alan (2018). Weakly supervised region proposal network
from https://openaccess.thecvf.com/content_ECCV_2018/html/Kim_SAN_Lea and object detection. In Proceedings of the European conference on computer vision
rning_Relationship_ECCV_2018_paper.html. (ECCV) (pp. 352–368).
Kong, T., Sun, F., Tan, C., Liu, H., & Huang, W. (2018). Deep feature pyramid Pont-Tuset, J., Arbeláez, P., Barron, J. T., Marques, F., & Malik, J. (2016). Multiscale
reconfiguration for object detection. Proceedings of the European Conference on combinatorial grouping for image segmentation and object proposal generation.
Computer Vision (ECCV), (pp. 169–185). Munich, Germany. Retrieved from http IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(1), 128–140.
s://openaccess.thecvf.com/content_ECCV_2018/html/Tao_Kong_Deep_Feature https://doi.org/10.1109/TPAMI.2016.2537320
_Pyramid_ECCV_2018_paper.html. Qian, Q., Chen, L., Li, H., & Jin, R. (2019). DR loss: Improving object detection by
Kong, T., Yao, A., Chen, Y., & Sun, F. (2016). Hypernet: Towards accurate region distributional ranking. In Proceedings of the IEEE/CVF Conference on Computer Vision
proposal generation and joint object detection. In Proceedings of the IEEE conference and Pattern Recognition (CVPR) (pp. 12164–12172).
on computer vision and pattern recognition (pp. 845–853). Redmon, J., & Farhadi, A. (2017). YOLO9000: Better, faster, stronger. Proceedings of the
LaLonde, R., & Bagci, U. (2018). Capsules for Object Segmentation. arXiv. Cornell IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (pp.
University. Retrieved from https://arxiv.org/abs/1804.04241. 7263–7271). Retrieved from https://openaccess.thecvf.com/content_cvpr_2017/ht
Law, H., & Deng, J. (2018). CornerNet: Detecting objects as paired keypoints. In ml/Redmon_YOLO9000_Better_Faster_CVPR_2017_paper.html.
Proceedings of the European Conference on Computer Vision (ECCV) (pp. 734–750). Redmon, J., & Farhadi, A. (2018). Yolov3: An incremental improvement. arXiv.
Lee, S., Kwak, S., & Cho, M. (2018). Universal bounding box regression and its Retrieved from https://arxiv.org/abs/1804.02767.
applications. Asian Conference on Computer Vision (pp. 373–387). Cham: Springer. Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look once: unified,
Leevy, J. L., Khoshgoftaar, T. M., Bauder, R. A., & Seliya, N. (2018). A survey on real-time object detection. In Proceedings of the IEEE conference on computer vision and
addressing high-class imbalance in big data. Journal of Big Data, 5(1), 42. https:// pattern recognition (pp. 779–788).
doi.org/10.1186/s40537-018-0151-6 Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster r-cnn: Towards real-time object
Li, Z., & Zhou, F. (2017). FSSD: Feature fusion single shot multibox detector. arXiv. detection with region proposal networks. Advances in neural information processing
Retrieved from https://arxiv.org/abs/1712.00960. systems, (pp. 91–99).
Rezatofighi, Hamid, Tsoi, Nathan, Gwak, JunYoung, Sadeghian, Amir, Reid, Ian, &
Savarese, Silvio (2019). Generalized intersection over union: A metric and a loss for
13
Y. Liu et al. Expert Systems With Applications 172 (2021) 114602
bounding box regression.. In Proceedings of the IEEE/CVF Conference on Computer Xia, G., Bai, X., Ding, J., Zhu, Z., Belongie, S., Luo, J., . . . Pelillo, M. (2018). DOTA: A
Vision and Pattern Recognition (pp. 658–666). large-scale dataset for object detection in aerial images. Proceedings of the IEEE
Ronneberger, O., Fischer, P., & Brox, T. (2015). U-Net: Convolutional networks for Conference on Computer Vision and Pattern Recognition (CVPR; pp. 3974–3983).
biomedical image segmentation. International Conference on Medical Image Salt Lake City: IEEE Xplore. Retrieved from https://openaccess.thecvf.com/content_c
Computing and Computer-Assisted Intervention. 9351, pp. 234–241. Springer, vpr_2018/html/Xia_DOTA_A_Large-Scale_CVPR_2018_paper.html.
Cham. https://doi.org/10.1007/978-3-319-24574-4_28. Xiao, J., Hays, J., Ehinger, K. A., Oliva, A., & Torralba, A. (2010). SUN database: Large-
Rothe, R., Guillaumin, M., & Van Gool, L. (2014). Non-maximum suppression for object scale scene recognition from abbey to zoo. IEEE Computer Society Conference on
detection by passing messages between windows. Asian conference on computer Computer Vision and Pattern Recognition (pp. 3485–3492). San Francisco, CA, USA:
vision (pp. 290–306). Springer, Cham. IEEE. https://dx.doi.org/10.1109/CVPR.2010.5539970.
Samangouei, P., Chellappa, R., Najibi, M., & Davis, L. S. (2018). Face-MagNet: Yan, J., Wang, H., Yan, M., Diao, W., Sun, X., & Li, H. (2019). IoU-adaptive deformable R-
Magnifying feature maps to detect small faces. IEEE Winter Conference on CNN: make full use of IoU for multi-class object detection in remote sensing imagery.
Applications of Computer Vision (WACV). Lake Tahoe, NV, USA: IEEE. https://dx.do Remote Sensing, 11(3), 286. https://doi.org/10.3390/rs11030286
i.org/10.1109/WACV.2018.00020. Yang, F., Choi, W., & Lin, Y. (2016). Exploit all the layers: Fast and accurate CNN object
Shrivastava, A., Gupta, A., & Girshick, R. (2016). Training region-based object detectors detector with scale dependent pooling and cascaded rejection classifiers.
with online hard example mining. In Proceedings of the IEEE Conference on Computer Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Vision and Pattern Recognition (CVPR) (pp. 761–769). (CVPR; pp. 2129–2137). Las Vegas: IEEE Xplore. Retrieved from https://openaccess.
Shrivastava, A., Sukthankar, R., Malik, J., & Gupta, A. (2016). Beyond skip connections: thecvf.com/content_cvpr_2016/html/Yang_Exploit_All_the_CVPR_2016_paper.html.
Top-down modulation for object detection. arXiv. Retrieved from https://arxiv. Yang, S., Luo, P., Loy, C., & Tang, X. (2016). WIDER FACE: A face detection benchmark.
org/abs/1612.06851. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale (CVPR; pp. 5525-5533). Las Vegas: IEEE Xplore. Retrieved from https://openaccess.
image recognition. arXiv. Retrieved from preprint arXiv:1409.1556. thecvf.com/content_cvpr_2016/html/Yang_WIDER_FACE_A_CVPR_2016_paper.html.
Singh, B., Najibi, M., & Davis, L. S. (2018). SNIPER: Efficient multi-scale training. Yang, X., Liu, Q., Yan, J., & Li, A. (2019). R3Det: Refined single-stage detector with
Advances in Neural Information Processing Systems, 9310–9320. feature refinement for rotating object. arXiv. Cornell University. Retrieved from htt
Soviany, P., & Ionescu, R. T. (2018a). In Optimizing the Trade-Off between Single-Stage and ps://arxiv.org/abs/1908.05612.
Two-Stage Deep Object Detectors using Image Difficulty Prediction. Timisoara, Romania, Yang, X., Sun, H., Fu, K., Yang, J., Sun, X., Yan, M., et al. (2018). Automatic ship
Romania: IEEE Xplore. https://doi.org/10.1109/SYNASC.2018.00041. detection in remote sensing images from google earth of complex scenes based on
Soviany, P., & Ionescu, R. T. (2018b). Frustratingly easy trade-off optimization between multiscale rotation dense feature pyramid networks. Remote Sensing, 10(1), 132.
single-stage and two-stage deep object detectors. Proceedings of the European https://doi.org/10.3390/rs10010132
Conference on Computer Vision (ECCV). Yoo, Y., Dan, D., & Yun, S. (2019). EXTD: Extremely tiny face detector via iterative filter
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., et al. (2015). Going reuse. arXiv. Cornell University. Retrieved from https://arxiv.org/abs/1906.06579.
deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision Yu, F., Wang, D., Shelhamer, E., & Darrell, T. (2018). Deep layer aggregation.
and Pattern Recognition (CVPR) (pp. 1–9). Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016). Rethinking the (CVPR; pp. 2403–2412). Salt Lake City: IEEE Xplore. Retrieved from https://op
inception architecture for computer vision. In Proceedings of the IEEE Conference on enaccess.thecvf.com/content_cvpr_2018/html/Yu_Deep_Layer_Aggregation_C
Computer Vision and Pattern Recognition (CVPR) (pp. 2818–2826). VPR_2018_paper.html.
Tang, X., Du, D., He, Z., & Liu , J. (2018). PyramidBox: A context-assisted single shot face Yu, J., Jiang, Y., Wang, Z., Cao, Z., & Huang, T. (2016). UnitBox: An advanced object
detector. Proceedings of the European Conference on Computer Vision (ECCV), (pp. detection network. Proceedings of the 24th ACM international conference on
797–813). Retrieved from https://openaccess.thecvf.com/content_ECCV_2018/html Multimedia., (pp. 516–520). Retrieved from https://dl.acm.org/doi/abs/10.1145/
/Xu_Tang_PyramidBox_A_Context-assisted_ECCV_2018_paper.html. 2964284.2967274?casa_token=qzC_K8AjZLAAAAAA:8HLv_FdNVhdvhFq5UEX3n_
Tian, W., Wang, Z., Shen, H., Deng, W., Meng, Y., Chen, B., . . . Huang, X. (2018). 5LZXGAVr4gL3Uw7JnnBTngyaYEnJkhWZxrrGfBlpfFWCSJ4Dw2EVTw.
Learning better features for face detection with feature fusion and segmentation Zagoruyko, S., Lerer, A., Lin, T.-Y., Pinheiro, P., Gross, S., Chintala, S., & Dollár, P.
supervision. arXiv. Retrieved from https://arxiv.org/abs/1811.08557. (2016). A multipath network for object detection. arXiv. Cornell University.
Tong, K., Wu, Y., & Zhou, F. (2020). Recent advances in small object detection based on Retrieved from https://arxiv.org/abs/1604.02135.
deep learning: A review. Image and Vision Computing, 97. https://doi.org/10.1016/j. Zhang, S., Wen, L., Bian, X., Lei, Z., & Li, S. Z. (2018). Single-Shot Refinement Neural
imavis.2020.103910 Network for Object Detection. Proceedings of the IEEE Conference on Computer
Torralba, A., Murphy, K. P., Freeman, W. T., & Rubin, M. A. (2003). Context-based vision Vision and Pattern Recognition (CVPR; pp. 4203-4212). Salt Lake City: IEEE Xplore.
system for place and object recognition. Proceedings Ninth IEEE International Retrieved from https://openaccess.thecvf.com/content_cvpr_2018/html/Zhang_
Conference on Computer Vision (pp. 273–280). Nice, France, France: IEEE Xplore. Single-Shot_Refinement_Neural_CVPR_2018_paper.html.
https://dx.doi.org/10.1109/ICCV.2003.1238354. Zhang, S., Zhu, X., Lei, Z., Shi, H., Wang, X., & Li, S. Z. (2017). S3FD: Single shot scale-
Tychsen-Smith, L., & Petersson, L. (2017). DeNet: Scalable real-time object detection invariant face detector. Proceedings of the IEEE International Conference on
with directed sparse sampling. In Proceedings of the IEEE international conference Computer Vision (ICCV; pp. 192–201). Venice, Italy: IEEE Xplore. Retrieved from
on computer vision (pp. 428-436). Venice, Italy: IEEE Xplore. Retrieved from http https://openaccess.thecvf.com/content_iccv_2017/html/Zhang_S3FD_Single_Shot
s://openaccess.thecvf.com/content_iccv_2017/html/Tychsen-Smith_DeNet_Scala _ICCV_2017_paper.html.
ble_Real-Time_ICCV_2017_paper.html. Zhang, Z., Qiao, S., Xie, C., Shen, W., Wang, B., & Yuille, A. (2018). Single-shot object
Uijlings, J. R., Van De Sande, K. E., Gevers, T., & Smeulders, A. W. (2013). Selective detection with enriched semantics. In Proceedings of the IEEE Conference on Computer
search for object recognition. International Journal of Computer Vision, 104(2), Vision and Pattern Recognition (CVPR) (pp. 5813–5821).
154–171. https://doi.org/10.1007/s11263-013-0620-5 Zhao, Z.-Q., Zheng, P., Xu, S.-T., & Wu, X. (2019). Object detection with deep learning: A
Wang, H., Li, Z., Ji, X., & Wang, Y. (2017). Face R-CNN. arXiv. Cornell University. review. IEEE Transactions on Neural Networks and Learning Systems, 30(11),
Retrieved from https://arxiv.org/abs/1706.01061. 3212–3232. https://doi.org/10.1109/TNNLS.2018.2876865
Wang, J., Chen, K., Yang, S., Loy, C. C., & Lin, D. (2019). Region proposal by guided Zheng, Z., Wang, P., Liu, W., Li, J., Ye, R., & Ren, D. (2020). Distance-IoU loss: Faster and
anchoring. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern better learning for bounding box regression. In AAAI, (pp. 12993–13000).
Recognition (CVPR), (pp. 2965–2974). Retrieved from https://openaccess.thecvf. Zhou, X., Zhuo, J., & Krahenbuhl, P. (2019). Bottom-up object detection by grouping
com/content_CVPR_2019/html/Wang_Region_Proposal_by_Guided_Anchoring_C extreme and center points. Proceedings of the IEEE/CVF Conference on Computer
VPR_2019_paper.html. Vision and Pattern Recognition (CVPR), (pp. 850-859). Long Beach, California.
Wang, J., Yuan, Y., & Network, G. Y. (2017). An Effective Face Detector for the Occluded Retrieved from https://openaccess.thecvf.com/content_CVPR_2019/html/Zhou_Bott
Faces. arXiv. Cornell University. Retrieved from https://arxiv.org/abs/1711.07246. om-Up_Object_Detection_by_Grouping_Extreme_and_Center_Points_CVPR_2019_pape
Wang, X., Chen, K., Huang, Z., Yao, C., & Liu, W. (2017). Point linking network for object r.html.
detection. arXiv. Retrieved from https://arxiv.org/abs/1706.03646. Zhu, C., Tao, R., Luu, K., & Savvides, M. (2018). Seeing small faces from robust anchor’s
Wang, Y., Lin, Z., Shen, X., Zhang, J., & Cohen, S. (2018). Concept Mask: Large-scale perspective. Proceedings of the IEEE Conference on Computer Vision and Pattern
segmentation from semantic concepts. Proceedings of the European Conference on Recognition (CVPR; pp. 5127–5136). Salt Lake City: IEEE Xplore. Retrieved from
Computer Vision (ECCV), (pp. 530-546). Munich, Germany. Retrieved from http https://openaccess.thecvf.com/content_cvpr_2018/html/Zhu_Seeing_Small_Faces_C
s://openaccess.thecvf.com/content_ECCV_2018/html/Yufei_Wang_ConceptM VPR_2018_paper.html.
ask_Large-Scale_Segmentation_ECCV_2018_paper.html. Zhu, C., Zheng, Y., Luu, Y., & Savvides, M. (2017). CMS-RCNN: Contextual multi-scale
Wu, X., Hong, D., Ghamisi, P., Li, W., & Tao, R. (2018). MsRi-CCF: Multi-Scale and region-based CNN for unconstrained face detection. Deep learning for biometrics.
Rotation-Insensitive Convolutional Channel Features for Geospatial Object Advances in computer vision and pattern recognition (pp. 57–79). https://doi.org/
Detection. Remote Sensing. https://doi.org/10.3390/rs10121990 10.1007/978-3-319-61657-5_3.
Wu, X., Sahoo, D., & Hoi, C. S. (2020). Recent advances in deep learning for object Zou, Z., Shi, Z., Guo, Y., & Ye, J. (2019). Object Detection in 20 Years: A Survey. arXiv.
detection. Neurocomputing, 396(5), 39–64. https://doi.org/10.1016/j. Retrieved from https://arxiv.org/abs/1905.05055.
neucom.2020.01.085
14