0% found this document useful (0 votes)

94 views14 pages

A Survey and Performance Evaluation of Deep Learning Methods For Small 2021

This document provides a survey and performance evaluation of deep learning methods for small object detection. It first reviews challenges in detecting small objects and techniques developed to address these challenges, including fusing feature maps, adding context information, and handling class imbalance. It then summarizes methods from four related areas: generic object detection, face detection, aerial imagery detection, and segmentation. Experimental results show that while accuracy on small objects is low, Faster R-CNN performs the best of the evaluated methods, with YOLOv3 close behind.

Uploaded by

sadeq

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

94 views14 pages

A Survey and Performance Evaluation of Deep Learning Methods For Small 2021

Uploaded by

sadeq

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

Expert Systems With Applications 172 (2021) 114602

Contents lists available at ScienceDirect

Expert Systems With Applications

journal homepage: www.elsevier.com/locate/eswa

A survey and performance evaluation of deep learning methods for small

object detection
Yang Liu *, Peng Sun , Nickolas Wergeles , Yi Shang
Department of Electrical Engineering and Computer Science (EECS), University of Missouri, 201 Naka Hall, Columbia, MO, 65201, USA

A R T I C L E I N F O A B S T R A C T

Keywords: In computer vision, significant advances have been made on object detection with the rapid development of deep
Small object detection convolutional neural networks (CNN). This paper provides a comprehensive review of recently developed deep
Computer vision learning methods for small object detection. We summarize challenges and solutions of small object detection,
Convolutional neural networks
and present major deep learning techniques, including fusing feature maps, adding context information,
Deep learning
balancing foreground-background examples, and creating sufficient positive examples. We discuss related
techniques developed in four research areas, including generic object detection, face detection, object detection
in aerial imagery, and segmentation. In addition, this paper compares the performances of several leading deep
learning methods for small object detection, including YOLOv3, Faster R-CNN, and SSD, based on three large
benchmark datasets of small objects. Our experimental results show that while the detection accuracy on small
objects by these deep learning methods was low, less than 0.4, Faster R-CNN performed the best, while YOLOv3
was a close second.

1. Introduction Moreover, techniques to improve classification accuracy, such as those

addressing imbalanced class examples and insufficient training, have
Object detection is one of the fundamental tasks in computer vision. achieved good results.
Typically, object detection and recognition involve two steps: first, the
potential location of each target object is localized; then, the objects are
classified into different categories. Before the bloom of deep learning 1.1. Scope of this paper
methods, object detection methods relied on manually designed features
and designed classifiers based on how humans understand objects. In This paper focuses on deep learning techniques for detecting small
recent years, the field of object detection has dramatically advanced due objects in images. We provide a comprehensive review of related object
to the success of deep learning, especially deep convolutional neural detection and instance segmentation methods. We identify and analyze
networks (CNN). Object detection has been widely used in many ap major challenges and summarize strategies for improving detection
plications, such as autonomous driving, visual search, virtual reality performance on small objects. First, we discuss the challenges in four
(VR), and augmented reality (AR), etc. aspects: 1) features generated by individual layers in basic CNNs do not
Even though accurate detection of medium and large-size objects in contain sufficient information for small object detection; 2) context in
images has been achieved in many applications, accurate detection of formation is lacking for small object detection; 3) imbalance of fore
small objects, such as a 20x20 pixel duck in an aerial image, remains ground and background training examples make classification difficult;
challenging. Small objects are difficult to detect due to indistinguishable and 4) insufficient positive training examples for small objects. Then, we
features, low-resolution, complicated backgrounds, limited context in summarize existing techniques for small objects from the perspective of
formation, etc. This is an active research area and many deep learning 1) combining multiple feature maps, 2) adding context information, 3)
techniques have been developed in recent years with promising results. balancing class examples, and 4) creating sufficient number of positive
Some work showed the importance of combining different feature examples. We present related techniques developed in four different
layers, while others showed contextual infomation is very useful. research areas, including generic object detection, face detection, object
detection in aerial images, and instance segmentation. In the end, we

* Corresponding author.
E-mail addresses: [email protected] (Y. Liu), [email protected] (P. Sun), [email protected] (N. Wergeles), [email protected] (Y. Shang).

https://doi.org/10.1016/j.eswa.2021.114602
Received 9 April 2020; Received in revised form 13 October 2020; Accepted 10 January 2021
Available online 19 January 2021
0957-4174/© 2021 Elsevier Ltd. All rights reserved.
Y. Liu et al. Expert Systems With Applications 172 (2021) 114602

report our experimental results of comparing the performances of and frameworks for small object detection. Section 4 identifies the
several state-of-the-art deep learning methods on benchmark datasets challenges and solutions for small object detection and major techniques
focusing on small objects. developed in four related research areas. Section 5 presents experi
The main contributions of this paper are as follows: mental results of several leading deep learning methods for small object
detection on three benchmark datasets of small objects. Finally, Section
• Provide a comprehensive review of the state-of-the-art deep learning VI discusses some future research directions.
techniques on small object detection.
• Identify challenges for small object detection in four specific aspects, 2. Overview of deep learning methods for image-based object
summarize major components of deep learning methods, and cate detection
gorize existing methods in four aspects.
• Analyze and connect related techniques from four research areas, 2.1. Problem definition
including generic object detection, face detection, object detection in
aerial imagery, and segmentation. The goal of image-based object detection is to detect instances of
• Empirical performance evaluation of some state-of-the-art deep objects of predefined classes in images and draw a tight bounding box
learning methods on three benchmark datasets of small objects. around each object. More specifically, the object detection consists of
two tasks: object localization and classification, i.e., finding where ob
1.2. Comparison with previous survey papers jects are located in an image and determining which predefined class
each object belongs to.
The survey in (Zou, Shi, Guo, & Ye, 2019) covered object detection
methods in the past 20 years, including both traditional detection 2.2. Major components of deep learning methods
methods and deep learning methods. This paper focuses on deep
learning methods for small object detection developed in the last 5 In this section, we summarize the major components of deep learning
years. (Zou et al., 2019) surveyed deep learning methods for generic methods for image-based object detection, which include backbone
object detection, whereas this paper includes methods developed in four networks, region proposals, anchors, object classification, bounding box
research areas, including generic object detection, face detection, object regression, loss functions, and non-maximum suppression.
detection in aerial imagery, and segmentation. (Leevy, Khoshgoftaar,
Bauder, & Seliya, 2018; Oksuz, Cam, Kalkan, & Akbas, 2019) focused on 2.2.1. Backbone networks
methods to overcome the class imbalance problem. (Zhao, Zheng, Xu, & Backbone networks in deep neural network-based object detectors
Wu , 2019) reviewed several state-of-the-art deep learning frameworks are used to extract high-level features from input images. Most
in several object detection tasks and analyzed different methods with commonly used backbone networks are derived from deep neural
experimental results on general object detection. (Liu et al., 2020; Jiao network image classifiers that performed well on large-scale image
et al., 2019) reviewed the deep learning methods and techniques for classification datasets, such as the ImageNet classification dataset
object detection. However, they did not provide the experimental (Huang, Liu, Van Der Maaten, & Weinberger, 2017; Szegedy et al., 2015;
analysis for these deep learning methods. (Wu, Sahoo, & Hoi, 2020) Szegedy, Vanhoucke, Ioffe, Shlens, & Wojna, 2016; Newell, Yang, &
reviewed object detection components, models and learning strategies. Deng, 2016; He, Zhang, Ren, & Sun, 2016; Howard et al., 2017;
Even though these works provide a comprehensive review, their focus is Simonyan & Zisserman, 2014). Typically, the last classification layers
on general size objects, not small objects. are removed from these image classifiers and the rest of the layers are
Recently, there are also some reviews on the small object detection. used as the backbone networks. Based on the backbone networks,
Nguyen, Do, Ngo, and Le (2020)) provided a review of the existing ob detection layers are appended to form complete object detectors.
ject detection methods for small objects and focused on performance The main design objectives of backbone networks are high detection
evaluation on four models. In comparison, this paper presents a more accuracy and computational efficiency. Some popular backbone net
depth and comprehensive review and different perspective of chal works are as follows.
lenges. Moreover, we summarize the major components of existing deep
learning methods, categorize existing detection approaches in four as • VGGNets (Simonyan & Zisserman, 2014) that use small filters of size
pects, connect and analyze current deep learning methods from four 3 by 3 pixels in their convolutional layers, followed by 2 by 2 max
separate application areas of object detection, and evaluate three pooling. VGG16 has 13 convolutional layers, whereas VGG19 has 16
models’ performances on three different datasets. Tong, Wu, and Zhou convolutional layers. VGG won the ImageNet Challenge in 2014 and
(2020) mainly reviewed existing methods from five aspects to improve is still one of the most widely used networks.
small object detection and analyzed experimental results on two data • Residual networks, or ResNets (He et al., 2016), in which residual
sets. In comparison, this paper not only summarizes existing methods in blocks were proposed to make training very deep networks possible
different aspects, but also identifies and analyzes key challenges in four by overcoming the gradient vanish problem in back propagation by
specific aspects, connects and analyzes solutions from several related adding a skip connection directly from input of each module. There
research areas, and presents empirical results on different datasets. are several variations of residual networks. The most used versions
In summary, this paper differs from the previous review papers in are ResNet50 and ResNet101. ResNet is much deeper than VGGNet.
several aspects. First, our review is focus on small objects. Secondly, our ResNet won the ImageNet 2015 classification task.
review includes summaries of major detection components and state-of- • Inception networks (Szegedy et al., 2015, 2016) that increased the
the-art object detection frameworks. Thirdly, we identify the challenges depth and width of networks without increasing computational
for small object detection, and summarize major techniques to improve complexity. The Inception module consists of 1x1, 3x3, and 5x5 filter
small object detection accuracy. In addition, we analyze and connect size convolution layers and max pooling layers stacked parallel with
techniques from four small object detection application areas, which each other. Multiple scales of feature can be extracted simulta
cover a wide range of small object detection tasks. Finally, we provide neously in one layer. Inception networks are much faster than
empirical comparison of three representative deep learning frameworks VGGNet.
on small object benchmark datasets. • DenseNet (Huang et al., 2017), in which each layer is densely con
The rest of the paper is organized as follows. Section 2 presents an nected to all other layers in a forward manor, so that lower level
overview of deep learning methods and major components for object features are used by all latter layers. DenseNet can alleviate the
detection in images. Section 3 presents major deep learning approaches vanishing-gradient problem.

2
Y. Liu et al. Expert Systems With Applications 172 (2021) 114602

Many techniques have been developed to improve the backbone dataset.

networks. For example, Hourglass networks have been designed for During training, anchors are matched with the ground-truth
capturing multi-scale feature and has been widely used in pose estima bounding boxes by the IoU (intersection over union) score. Usually,
tion and object detection (Newell et al., 2016). A simple hourglass for each bounding box, the anchors with the highest scores or their IoU
module consists of convolutional layers and max pooling layers to with ground-truth bounding boxes are higher than a threshold are
extract features to a low resolution. After several pooling layers, the labeled as positive examples. Those anchors, whose IoU scores with
network uses upsampling layers and combined features from all ground-truth boxes are lower than a threshold, are labeled as negative
different sizes of scaled images. At the end, two 1x1 convolutional layers examples. The positive and negative examples are used for training the
are applied to generate the final prediction. Stacked multiple hourglass classifier for object classification. Only the positive examples are further
modules followings bottom-up and top-down approaches across regressed to locate the position of objects. (Wang, Chen, Yang, Loy, &
different sizes of scaled images. For deep learning models on embedded Lin, 2019) proprosed a new anchor sheme called guided anchoring, by
devices, (Howard et al., 2017) proposed a lightweight network that can using semantic features to dynamicly guid anchor generation. More
run on a mobile device. It replaced a standard convolution layer by a specifically, the anchors are generated on the feature maps where the
depth-wise convolution and a 1x1 pointwise convolution, and dramat prediciotn probability is higher than a certain threshold.
ically reduced the amount of computation compared to other deep
neural networks. PvaNet (Kim et al., 2016) designed a light and thin 2.2.4. Object classification
feature extraction using CReLU, Inception module and multi-scale The goal of object classification is to predict the class of an object in
outputs. an image, i.e. predict the probabilities of different class labels for a given
region of interest (RoI). Since the localization problem is handled by
2.2.2. Region proposals bounding box regression, the object classification component is similar
Major methods for generating region proposals include constrained to a standard classification problem applied to each RoI, instead of the
parametric min-cuts (CPMC), multi-scale combinatorial grouping, se whole images. The object classification and bounding box regression
lective search and region proposal network (RPN). CPMC (Carreira and tasks share the same backbone network for generating good results.
Sminchisescu, 2011) is to learn a model to rank the segments. Specif
ically, a large set of features are extracted from the segments, such as 2.2.5. Bounding box regression
graph features, region features and Gestalt features. The segments are Bounding box regression is to learn a transformation to map pre
ranked based on their similarity to ground truth. Therefore, the ranking dicted bounding boxes to the corresponding ground-truth bounding
problem is modeled as a regression problem to predict similarity to boxes. It is either applied to proposals, which is used in the first stage of
ground truth. The model is trained on images with various degree of two-stage object detectors or is used to refine estimated bounding boxes
background bias. Then the overlap scores between segments are maxi to make them more accurate. In (Girshick, Donahue, Darrell, & Malik,
mized. Multiscale combination grouping (Pont-Tuset, Arbeláez, Barron, 2014), a linear regression model with the CNN features is trained to
Marques, & Malik, 2016) proposed a hierarchical segmentation method better localize bounding boxes. It has four transformation parameters for
that considers the multiscale information. The image pyramid is con the center (x and y coordinates) and the width and height of a region
structed based on subsampling and supersampling. Single-scale seg proposal. For a region proposal (P) and ground truth (G), the four pa
mentation is applied on each image resolution. After rescaling and rameters are computed as follows:
alignment, the different segmentation maps are combined. The combi
tx = (Gx − Px )/Pw (1)
nation of multiscale regions is further processed together. Selective
Search (Uijlings, Van De Sande, Gevers, & Smeulders, 2013) is a ( )/
ty = Gy − Py Ph (2)
segmentation-based method widely used in object detection. It groups
candidate regions hierarchically and generates informative locations
tw = log(Gw /Pw ) (3)
containing category-independent objects. However, it is not neural
network based, cannot be trained using a dataset, and is slow. RPN (Ren, th = log(Gh /Ph ) (4)
He, Girshick, & Sun, 2015) is the first CNN based network and can be
trained with other detection network. It predicts object regions and Where P =
i
(Pix , Piy , Piw , Pih )
specifies the pixel coordinates of the cen
( )
object classification confidence scores simultaneously. The input of RPN ter of proposal P . Similar G = Gx , Gy , Gw , Gh specifies the ground-truth
i

is an image, and its output contains regions of interests (RoIs) with bounding box. Various bounding box regression loss functions have been
object scores. Specifically, it uses small networks as sliding windows proposed. In (He, Zhu, Wang, Savvides & Zhang, 2019), KL loss (Kull
over the convolutional layers. Each of the sliding windows corresponds back-Leibler) was proposed based on the Kullback-Leibler Divergence of
to one of the regions in the input image and can be viewed as region the predicted bounding box distribution and ground truth distribution.
proposals with various scales. The features are fed into two prediction In (Lee, Kwak, & Cho, 2018), a bounding box regression neural network
layers: classification layer and box regression layer. The classification was proposed to be trained separately with convolution layers, fully
layer performs binary classification to predict if the region contains any connected layers and ROI-Align layer to minimize IoU loss. (Yu, Jiang,
objects or not. Wang, Cao, & Huang, 2016) proposed the IoU loss for bounding box
regression and regress all the bounding box variables together. The IoU
2.2.3. Anchors loss is not sensitive to scale invariant (Rezatofighi et al., 2019) found
Anchors, also called anchor boxes, were first proposed in (Ren et al., that IoU does not provide a strong relationship with minimizing the
2015). Anchors are a set of pre-defined bounding boxes with various ln -norms loss function. Therefore, GIoU is proposed to solve this problem
scales and ratios placed regularly on the feature maps. Anchors at by not only considering the overlap area, but also focus on
different locations on feature maps are projected back to the input im no-overlapping area. (Zheng et al., 2020) proposed DIoU loss to solve
ages, to be matched with the ground-truth bounding boxes. The stride the slow converge issues in earlier work. Moreover, CIoU is further
for each feature map is calculated by H/h and W/w, where H and W are proposed by considering overlap area, central point distance asd well as
the height and width of an input image, respectively, and h and w are the aspect ratio.
height and width of a certain feature map, respectively. The scales and
ratios of anchors are usually pre-defined to maximize match ground- 2.2.6. Loss functions
truth bounding boxes. Some researchers used unsupervised clustering In object detection, multi-task loss functions have been widely used
methods to calculate the scales and ratios directly from the training

3
Y. Liu et al. Expert Systems With Applications 172 (2021) 114602

to simultaneously minimize errors of object classification and bounding divide the detection problem into two stages: region proposal stage and
boxes regression. For classification, softmax loss has been applied for detection stage. The goal of the region proposal stage is to generate the
calculating foreground background classes. For regression loss, regions where objects may exist. It outputs the regions locations as well
SmoothL1, as defined below, has been widely used. This loss is only as the object score, 0 or 1, to indicate if there exist objects. In the
computed for the bounding boxes predicted as foreground class (i.e. detection stage, the candidate region proposals are classified into
objects). different classes. This stage outputs class probabilities and, optionally,
refined region locations. Two-stage detectors achieved state-of-the-art
L(p, u, tu , v) = Lcls (p, u) + λ[u ≥ 1]Lloc (tu , v) (5)
performance; Yet their running speeds were typically slow.
∑ ( ) One-stage object detectors, such as YOLO (Redmon, Divvala, Gir
Lloc (tu , v) = smoothL1 tui − vi (6)
i∈{x,y,w,h}
shick, & Farhadi, 2016) and SSD (Liu et al., 2016), perform region
proposal and detection in one deep neural network. Initial regions are
where pre-defined anchors with various scales and ratios, tiled densely on the
{ image. From the initial anchors, the detectors find those that likely
0.5x2 if |x| < 1
smoothL1 (x) = (7) contain objects. Compared with two-stage detectors, one-stage detectors
|x| − 0.5otherwise
are usually much faster, but achieve less accurate solutions.
in which p is the probability distribution (per ROI) over all cate Some previous works empirically evaluated the performances of
gories, p = (p0 , ⋯, pk )p is computed by a softmax layer. Lcls (p, u) = some one-stage detectors and two-stage detectors (Soviany & Ionescu,
− logpu is the log loss for the true class u. The second loss is defined over a 2018a, 2018b). It compared the detection accuracy and time between
( )
tuple of true bounding-box regression for class u, v = ux , uy , uw , uh , and some two-stage detectors and one-stage detectors and concluded that,
( )
on average, one-stage detectors asssre faster than two-stage detsssectors,
a predicted tuple tu = txu , tyu , twu , thu The L1 loss is used here since it is less
while two-stage detectors tend to be more accurate than one-stage de
sensitive to outliers compared to the L2 loss. tectors. (Soviany & Ionescu, 2018b) proposed separating images into
The sentence has been re-written as follows: However, because of the easy and hard category and training different models on different cat
lack of foreground objects in most input images, it is difficult to capture egories to achieve faster speed and higher accuracy.
effective patterns and information of foreground objects in the training
phase. To overcome this challenge, another type of multi-task loss 3.2. Representative two-stage detectors
function has been proposed for object detection to improve the perfor
mance. (Lin, Goyal, Girshick, He, & Dollár, 2017). In (Lin, Goyal, et al., R-CNN (Girshick et al., 2014) was the first work transferring deep
2017), Focal Loss was proposed to address the imbalance problem be CNN classification results from ImageNet to object detection. R-CNN
tween foreground and background examples, by down-weighting the adopted the two-stage approach of separate region proposal and object
easy examples. classification. Each region proposal was warped and fed into a deep
FL(pt ) = − αt (1 − pt )r log(pt ) (8) CNN, and a 4096-dimension feature vector was extracted. During
training, the network was first pre-trained on the ImageNet dataset using
Where y is the label, and p if the model’s estimated probability. γ is image level label only. R-CNN made a breakthrough in object detection
the focusing prarmeter, which reduce the weight on easy examples and α and improved detection accuracy on the VOC 2012 dataset by more than
is to control the foreground and background examples balance. 30%. This work successfully demonstrated the superior trained CNN
features compared with human designed features. However, the CNN
2.2.7. Non-maximum suppression (NMS) feature extraction was applied to each region proposal independently,
NMS serves as the post-processing step in the inference phase to which made the network very slow.
remove redundant overlapping detection results for the same object. A Fast R-CNN (Girshick, 2015) improved R-CNN by addressing two
greedy method is to first sort all detection boxes according to their major issues. First, the training of R-CNN for classification and bounding
scores, and then greedily select the detection boxes with the highest box regression was done separately in two different stages. Fast R-CNN
scores and suppress other regions if they overlap significantly with the combined the training for classification and regression. It had two
selected boxes, i.e., their intersection-over-union (IoU) score is higher output layers: one for the classification scores and the other for the
than a designed threshold. NMS is usually applied on each object class bounding boxes location offset represented as four values, i.e. the ×, y
independently. In (Rothe, Guillaumin, & Van Gool, 2014), improved coordinates of bounding box’s center point, and the width and height of
solutions were found by treating the problem as a message-passing the bounding box. A new multi-task loss function was proposed for
clustering problem and learning the threshold parameters from simultaneously training. Secondly, the feature extraction in R-CNN was
training data, instead of pre-defined thresholds. Recently, (Bodla, Singh, applied to each region proposal, which was time and space consuming.
Chellappa, & Davis, 2017) and (Hosang, Benenson, & Schiele, 2017) Fast R-CNN improved the efficiency by only computing the convolu
proposed methods with soft thresholds to improve performance. In tional feature map once for an entire image. The Region of Interest (RoI)
standard NMS, lower-score regions overlapping significantly with pooling layer was designed to extract fixed size features for different size
higher-score regions are discarded. However, in soft-NMS (Bodla et al., region proposals. It divided the feature maps into fixed size sub-
2017), lower-score regions are kept in the results. (Hosang et al., 2017) windows, and max pooled each sub-window to form the fixed size fea
proposed a CNN network to perform NMS by using neighboring regions’ tures. Although Fast R-CNN was much faster than R-CNN, it was still
detection results to update one region’s detection. based on the traditional region proposal method, which was time
consuming.
3. Major deep learning approaches for image-based object Faster R-CNN (Ren et al., 2015) further improved the detection
detection speed of Fast R-CNN by replacing the traditional region proposals stage
with a convolutional neural network, called Region Proposal Network
3.1. Anchor-based deep learning approaches (RPN). RPN predicted the object region and object confidence scores
simultaneously. The main benefit of this design is that the RPN shares
Anchor-based deep learning approaches for object detection can be the same convolutional layers with the object detection network, which
grouped into two main categories: two-stage object detectors and one- reduces the detection time.
stage object detectors. Various anchor sizes captured multi-scale representation and made
Two-stage object detectors, such as Faster R-CNN (Ren et al., 2015), the computation light weight, compared to image pyramid methods. In

4
Y. Liu et al. Expert Systems With Applications 172 (2021) 114602

the second stage, the output of RPN were further classified and localized instead of softmax loss, to handle the case of multiple classes in one
to generate final detection results. Faster R-CNN can be trained end-to- bounding box. It adopted the multi-scale framework and feature pyra
end and the whole network is efficient. mid to predict objects in 3 different scales. A new backbone network
Mask R-CNN (He, Gkioxari, Dollár, & Girshick, 2017) was proposed with ResNet module was proposed for improved speed and accuracy,
as an extension of Faster R-CNN with extra predictions on pixel wise especially on small object detection.
instance segmentation. Mask R-CNN used the Faster R-CNN two-stage SSD (Liu et al., 2016) was a single shot detector without a region
pipelines with the same first stage network. In the second stage, it proposal stage, as shown in Fig. 1. Different from Faster R-CNN that only
added one extra output, a binary mask for each region proposal, and used the last layer for detection, SSD performed detection using multiple
kept the original classification and bounding box regression. In the loss layers to better capture multi-scale objects. Since anchors were applied
function used in training, it added one extra term for mask prediction in to multiple feature maps, SSD designed various anchor scale ranges
the form of binary cross-entropy loss. One of the key contributions of between layers. The lower layers had smaller scales and higher layers
Mask R-CNN was the RoI Align layer, which was introduced to fix the larger scales. This design could handle a wide range of objects, resulting
misalignment issuses in RoI pooling layer. The idea was to remove the in higher recall rate. Different from YOLO that used the fully connected
quantization of the RoI boundary, and instead calculating real values. layers for object detection, SSD used fully convolutional layers to predict
Bilinear interpolation was used to calculate the real feature value on confidence score and localization offset. With some additional data
each sampling point. The results were averaged or max pooled from the augmentation and hard negative mining techniques, SSD achieved the
sampling points. Mask R-CNN achieved the state-of-the-art on instance- state-of-the-art performance on several benchmark datasets. However,
level segmentation. SSD performed poorly on small objects, due to shallow layers without
Feature pyramid network (FPN; Lin, Dollár, et al., 2017) focused deep semantic information.
on solving the problem that lower level feature maps contain more DSSD (Fu, Liu, Ranga, Tyagi, & Berg , 2017) improved SSD by using a
spatial information but less semantic information, whereas the latter larger network. Their experimental results showed the deep and
layers of a deep neural network contain more high-level semantic in powerful backbone network ResNet-101 outperformed the VGG
formation but less spatial information. FPN utilized the hierarchy of a network. A deconvolutional module was introduced to add more context
CNN network and implemented a bottom-up and top-down path with information. More importantly, the deconvolutional layers were train
lateral connections. In the bottom-up part, an input image was passed able during training, making DSSD more flexible and achieving better
through a CNN and pooling layer was used to shrink feature maps size. performance. In order to improve the accuracy of anchor scales and
In the top-down part, the feature maps were up-sampled back into the ratios, K-means clustering are applied to group training boxes with
same size as in the bottom-up part. Moreover, the lateral connection squared root boxes areas as the distance measurement. DSSD improved
fused the feature maps in bottom-up path and top-down path of same the SSD accuracy, especially for small objects.
sizes with element-wise addition. FPN generated integrated feature RetinaNet (Lin, Goyal, et al., 2017) was proposed as a one-stage
maps that dramatically improve detection accuracy, especially for small object detector to reduce the detection accuracy gap with existing
objects. two-stage detectors while maintaining fast detection time. This work
found that the accuracy gap between one-stage detectors and two-stage
3.3. Representative one-stage detectors detectors was mainly due to the numbers of the positive examples and
negative examples as well as easy examples and hard examples used in
YOLO (Redmon et al., 2016) focused on improving the speed of training were highly unbalanced. The large number of easy examples
object detector. It treated the object detection problem as regression dominated the loss function, which resulted in a degenerated model.
problem and removed the region proposal stage in two-stage detectors. This problem was solved by introducing a new loss function, called focal
Instead of using pre-defined anchors for object region, it divided input loss function to reduce the weights of the easy examples adaptively.
images into 7*7 cells and each cell was used to predict the object’s
center falling into the cell. Each cell predicted bounding box locations, a 4. Challenges and solutions for small object detection
score for each bounding box, and class probabilities. The network was
implemented as convolutional layers followed by fully connected layers. In this section, we identify four major challenges of applying deep
The sum of squared error loss was used to minimize localization and neural networks to small object detection and discuss existing solutions.
classification error. YOLO was a real-time object detector with 45 frames
per second detection speed, which was extremely fast compared to other 4.1. Challenges for small object detection
detectors. However, class probabilities were only predicted within each
cell. It does not work well on objects partially located in one cell, could In this section, we summarize the four major challenges for small
not handle a wide distribution of ground truth objects, and has difficulty object detection.
to predict bounding box scales and ratios precisely, which results in a
low localization accuracy. 4.1.1. Challenge 1: Individual feature layers do not contain sufficient
YOLOv2 (Redmon & Farhadi, 2017) proposed several improvements information for small object detection.
on YOLO. In order to increase recall, it removed the fully connected Deep CNN architectures provide hierarchy feature maps due to
layers and adopted the anchor boxes concept to predict bounding boxes. pooling and subsampling operations, resulting in different layers of
Unsupervised learning methods were applied to generate bounding box feature maps containing different spatial resolutions. It is well known
scales and ratios directly from training data. Instead of only predicting that in the early-layer feature maps, the feature maps are of higher
one class probability per cell, it predicts both objectness and class for resolution and represent smaller reception fields. At the same time, they
each bounding box, which improved the performance on detecting do not contain high-level semantic information that is important for
partially covered objects. For the bounding box regression, it predicted object detection. On the other hand, the latter-layer feature maps
the location relative to the cell left top location, which made the pre contain stronger semantics information, which is essential for identi
diction bounds to be between 0 and 1. Other proposed techniques fying and classifying objects, including different object poses or illumi
included batch normalization, high-resolution classification, and multi- nations. Even though higher-level feature maps are useful for identifying
scale training. All the techniques dramatically improved detection ac large objects, they may not be sufficient for small object detection. After
curacy while keeping the fast speed. down sampling several times in the deep CNN architectures, the latter
YOLOv3 (Redmon & Farhadi, 2018) proposed more improvement feature maps lose spatial information. A small object of size 32x32 pixels
over YOLOv2. For class prediction, it used binary cross-entropy loss is clearly visible in earlier (or shallower) feature maps, but not in the

5
Y. Liu et al. Expert Systems With Applications 172 (2021) 114602

Fig. 1. Architecture of SSD framework.

latter (or deeper) feature maps. Therefore, low-level features alone or other objects in the image. For example, in face detection systems,
high-level features alone are not sufficient for small object detection. the subject’s shoulder and neck are always close to their face.
Solution: Combining features from shallow layers and deep layers.
To better detect small objects, several deep CNN based methods combine Solution: Incorporating contextual information in the detection
lower-level feature maps and higher-level feature maps together to network. The local pixel context is usually added by enlarging filter sizes
obtain necessary spatial and semantic information. There are two main to capture extra information around the objects. The semantic context is
approaches to feature map fusion: usually added by extracting deeper features from images, such as in the
1) Bottom-Up Scheme deconvolution layers or recurrent neural networks (RNNs).
This scheme is incorporated into the standard feedforward CNN ar
chitecture. From early to latter layers, feature maps shrink after pooling 4.1.3. Challenge 3: Class imbalance for small objects.
operations. The final detection layers directly combine several bottom- Class imbalance refers to the uneven data distribution between
up feature maps. classes. There are two types of class imbalances. One is imbalance of
2) Top-Down Scheme foreground and background examples. In object detection, region pro
This scheme can be viewed as an attention mechanism that propa posal networks are used to generate the candidate regions containing
gates higher-level semantic information back to lower-level feature objects, by densely scanning the entire image. The anchors are pre-
maps. It usually uses a convolution-deconvolution or encoder-decoder defined rectangular boxes densely tiled on the entire input image. The
network with an upsampling operation in the decoder to enlarge the scales and ratios of anchors are pre-defined based on the target objects’
feature maps’ spatial resolution. Moreover, skip paradigm or lateral sizes in the training dataset. To detect small objects, there is an increase
connection are often used to connect lower-layer with higher-layer of anchors generated per image compared to detecting large objects.
feature maps while bypassing intermediate layers. The fused feature Only the anchors with high Intersection over Union (IoU) with the
maps are used by detection layers. Typical operations to combine feature ground truth bounding boxes are labeled as positive examples. Since
maps include summation, production, concatenation, and global most anchors have low or no overlap with the ground truth bounding
pooling. boxes, they are considered as negative examples. When densely gener
ated anchors are matched with sparsely located real objects in the im
4.1.2. Challenge 2: Limited context information of small objects. ages, positive examples are a tiny fraction, resulting in a high-class
Usually small objects are in low resolutions and it is difficult to imbalance, e.g. class ratio from 100:1 to 1000:1.
recognize low-resolution objects. Since small objects themselves contain The anchor-based object detection approach has several drawbacks.
limited information, contextual information plays a critical role in small First, due to the sparseness of ground-truth bounding boxes and the IoU
object detection (Torralba, Murphy, Freeman, & Rubin, 2003; Oliva & matching strategies between ground-truth and anchors, negative ex
Torralba, 2007; Divvala, Hoiem, Hays, Efros, & Hebert, 2009; Palmer, amples highly dominate positive examples, which leads to models fa
1975). Contextual information has been used in object recognition from voring the negative class. Second, the dense sliding window strategy has
a “global” image level to a “local” image level. A global image level high time complexity, (O(h2 w2 )), where h is the height and w is the
considers image statistics from the entire image, whereas a local image width of the anchors, which makes training slow.
level considers contextual information from neighbor areas of the ob Solution: Balance positive and negative examples in training. There
jects. Context features could be categorized into three types (Divvala are two main strategies: 1) data-based and 2) loss function-based. The
et al., 2009): data-based strategy is to change the foreground and background
example numbers to make the examples of the positive and negative
1) Local pixel context: The patches or pixels around an object, such as class roughly carry the same weights. Hard sampling and soft sampling
edges, colors, textures, etc. Local pixel context could be captured by are two popular methods. Hard sampling selects a subset of samples,
increasing the size of the detection window in object detection whereas soft sampling assigns different weights to examples. For
networks. example, random sampling is commonly used to randomly select ex
2) Semantic context: The probability of an object to be identify in some amples to meet a certain ratio. Another sampling strategy is to sample
surrounding scenes, such as events, activities, or scene categories. more of the hard examples with large losses. For example, a machine
3) Spatial context: The spatial location of other objects in the image, e. learning model could be trained first and then the false positives are
g. the likelihood of finding an object in some positions in respect to considered as hard examples, which are weighted heavily in the second
round of training. The recently proposed Online Hard Example Mining

6
Y. Liu et al. Expert Systems With Applications 172 (2021) 114602

(OHEM) method (Shrivastava, Gupta, & Girshick, 2016) performs one considering both global and local information. The parameters for the
forward pass on the calculated region of interest (RoI) and computes non-linear transformation were learnable and shareable with different
losses for all RoIs. Then, examples are ranked based on their loss func layers. The transformations were applied to feature maps in different
tion values, and the examples with the largest loss are selected to be used layers and each transformed layer generated detection results. (Yang,
in the next round of training since the current trained network model Liu, Yan, & Li, 2019) used deconvolutional layers in an “encoder-
performs the worst on them. (Pang, Chen, et al., 2019) also proposed an decoder” architecture in addition to the feature maps from convolu
IoU-balanced sampling technique in order to sample more training ex tional layers and deconvolutional layers were combined. (Bell, Law
amples from difficult cases. In terms of soft sampling, (Cao, Chen, Loy & rence, Zitnick, Bala & Girshick, 2016) used skip connections to directly
Lin, 2020) proposed a technique that selects samples based on their add lower-level feature maps to higher-level feature maps. Features
importance where the importance of positive examples is measured by were pooled from convolutional layers with different receptive fields
their IoU scores with the ground truth bounding boxes, and the impor (conv3, conv4, and conv5). These features were normalized and
tance of negative examples is calculated by considering both local region concatenated to be fed into detection modules. (Zagoruyko et al., 2016)
and global region properties. also concatenated features from different layers (conv3, conv4, conv5)
Loss function-based strategies will re-weight examples of imbalanced after normalization. In (Cao et al., 2018), extensive experimental results
classes in the loss function in order to balance the foreground and showed that combining low-level and high-level features improved
background examples. For example, AP loss (Chen et al., 2019) uses an detection accuracy of small objects. (Jeong, Park, & Kwak, 2017)
average-precision loss to re-weight examples. DR Loss (Qian, Chen, Li, & concatenated features by performing pooling and deconvolution
Jin , 2019) re-weights examples based on the distribution of foreground simultaneously, where pooling decreased the size of the low-level
examples over the distribution of background examples. feature maps to combine them with high-level feature maps and
deconvolution increased the high-level feature maps to combine them
4.1.4. Challenge 4: Insufficient positive examples for small objects with low-level feature maps. (Yu et al., 2018) fused both semantic and
Most deep neural network models for object detection were trained spatial information by using Iterative Deep Aggregation (IDA) and Hi
using objects of various scales. They typically perform well on large erarchical Deep Aggregation (HDA). IDA non-linearly two types of fea
objects, but poorly on small objects. Reasons may include an insufficient tures, whereas HDA merged several CNN features in a tree structure.
amount of small-scale anchor boxes generated to match the small objects (Zhang, Qiao, et al., 2018) used an extra segmentation module to add
and an insufficient number of examples to be successfully matched to the more semantic information. (Li & Zhou, 2017) fused features from
ground truth. The anchors are regions in the feature maps of some in multiple lower layers, where low-level feature maps were down sampled
termediate layers in a deep neural network, which would be projected with max pooling and high-level feature maps were resized with bilinear
back to the original image. It’s hard to generate anchors for small ob interpolation. Similarly, (Kong, Yao, Chen, & Sun, 2016) applied max
jects. Moreover, the anchors need to be matched to the ground truth pooling on low-level features and deconvolutional operation on high-
bounding boxes. A widely used matching method is as follows. If an level features. Then these feature maps were concatenated together to
anchor has a high IoU score with respect to a ground truth bounding box, form the so-called Hyper Feature Map.
such as larger than 0.9, it is labeled as a positive example. In addition, Some network architectures used both top-down and bottom-up
the anchor with the highest IoU score with respect to each ground truth connections for combing features. ( (Shrivastava, Sukthankar, Malik,
box is also labeled as a positive example. Therefore, small objects usu & Gupta, 2016) added a top-down module as well as lateral connections.
ally have very few anchors match with the ground truth bonding boxes, The top-down module can generate features with more semantic and
i.e. very few positive examples. contextual information. The lateral connection can help to enrich the
Solution: Use methods that can generate more anchors for small top-down features by transmitting lower-level features. (Kong et al.,
objects and match more anchors with small objects. Existing techniques 2017) used reverse connections to add high-level semantic information
include: from latter network layers back to earlier layers. (Ghiasi, Lin, & Le,
1) Multi-scale mechanism. Multi-scale architectures consisting of 2019) proposed a search method to automatic search for good feature
separate branches for small, medium, and large-scale objects can pyramid architectures which may make it possible to replace manual
generate anchors of different scales. design. A recurrent neural network (RNN) served as the controller to
2) Matching strategy. Adaptively setting anchor scales and ratios to merge any two input features with sum or pooling operation. (Pang,
help more anchors match to ground truths of small objects. Wang, Anwer, Khan, & Shao, 2019) applied the feature pyramid fusion
3) Increasing positive examples of small objects. In the region proposal mechanism at two levels. For global information, it constructed an
stage, generate more anchors by allowing them to overlap with each image pyramid and combined the features from four levels of the image
other. pyramid with the original features from the standard SSD framework.
For local spatial information, features from both the previous and the
4.2. Deep learning techniques for small object detection developed in current layers were fused together.
generic object detection research
4.2.2. Technique 2: Incorporate context information of small objects
In this section, we summarize the deep learning techniques devel Context information of small objects can be divided into local contact
oped in generic object detection research that are effective for small and sematic context information. More local context information could
object detection. be included in deep neural networks through larger bounding boxes and
proposal boxes. (Cai, Fan, Feris & Vasconcelos, 2016) added extra local
4.2.1. Technique 1: Improve feature maps for small objects context with bounding boxes 1.5 times of object regions, which was
Low-level features are important for localization, whereas high-level useful to include more of the surroundings of small objects. The extra
features are important for classification. Using a combination of low- context information was combined with object features for detection
level and high-level feature maps at the detection layers in several layers. (Zagoruyko et al., 2016) incorporated local context information
deep neural networks has led to an improvement of results for small by increasing the sizes of region proposal boxes with four scales, 1x,
object detection. 1.5x, 2x, and 4x. The outputs from four regions were pooled with ROI-
Using a bottom-up fashion, one group of network architectures pooling and concatenated before being fed into the detection and clas
merged feature maps from different layers (Fu et al., 2017; Lin, Milan, sification layers.
Shen, & Reid, 2017; Kong et al., 2017). Kong, Sun, Tan, Liu, and Huang For including more semantic context information, (Fu et al., 2017)
(2018) proposed a non-linear feature map transformation by used deconvolutional layers with “skip connections”, resulting in better

7
Y. Liu et al. Expert Systems With Applications 172 (2021) 114602

detection results on small objects. Instead of simply stacking deconvo applied to find perfect scales and ratios for anchor boxes and to perform
lutional layers on top of convolutional layersthe deconvolution layers adaptive matching of anchor boxes (Redmon & Farhadi, 2017). For
were designed to be much shallower than convolutional layers, and an example, ground truth bounding boxes can be grouped into clusters
element-wise product was used. Similarly, (Cai et al., 2016) also used based on their scales, and then anchor boxes are matched to ground
deconvolutional layers to increase the resolutions of feature maps. (Bell truth bounding boxes of similar scales.
et al., 2016) used four Recurrent Neural Network (RNNs) to capture
global information of the input image. The RNNs added semantic 4.3. Techniques for small object detection developed in face detection
context information around the objects and 1x1 convolutions combined research
all information together. In addition, (Zhang, Wen, Bian, Lei, & Li, 2018)
passed higher-level feature maps to lower level features. Face detection has been extensively studied and has achieved great
success. Faces have distinct facial features, i.e. the nose, eyes, mouth,
4.2.3. Technique 3: Correct foreground and background class imbalance and their relative positions with respect to each other, which make face
for small objects detection different than generic object detection. However, it remains
Techniques for improving the foreground and background class challenging when face sizes are very small, e.g. smaller than 16x16
imbalance can be divided into two part, which are a data-based pixels, and features are not distinguishable. Many techniques have been
approach and a loss function-based approach. developed for small face detection. The state-of-art algorithms shown in
In the data-based approach, (Cai et al., 2016) addressed class Table 1.
imbalance using bootstrap sampling that sampled negative examples
based on their loss values. (Zhang, Wen, et al., 2018) used a two-step 4.3.1. Technique 1: Improve feature maps for small faces
regression method to balance the foreground and background exam One of the techniques for combining multiple feature maps is the skip
ples. Some easy negatives were dropped to make the ratio between connections used to integrate lower, middle, and higher layer features.
positives and negatives about even. (Kong et al., 2017) implemented an (Tian et al., 2018) proposed an iterative feature map generation scheme,
objectness prior to filtering out the bounding boxes without objects. In which generated features in six different scales, and all feature maps
the loss function-based approach, (Galleguillos & Belongie, 2010) used a from the backbone network were fed back to the beginning of the
loss function that gave hard-negative examples greater weights. network to extract more semantic information for small objects.
(Samangouei, Chellappa, Najibi, & Davis, 2018) fed the combined lower
4.2.4. Technique 4: Increase training examples for small objects and higher layer features to a ROI-based block normalization layers.
Many techniques have been developed to increase the amount of (Tian et al., 2018) merged four feature maps using skip connections to
training examples for small objects. They include neural network ar generate four new feature maps for detection. Specifically, the input
chitectures for multi-scale learning, scale transformation, and adaptive features were fused with the next level features by element multiplica
matching of anchor boxes. tion. Furthermore, the fused features were added to the original input
Several neural network architectures have been designed for multi- features to form the final features, which proved to be effective for
scale learning, i.e. training detector networks for objects of different detecting difficult tiny faces. (Zhu, Zheng, Luu, & Savvides, 2017)
sizes, to addressing the problem of insufficient examples of small objects combined the lower level features with higher level features by first
in training classifiers. (Singh, Najibi, & Davis, 2018) showed the effec downsampling the lower level features to the size of higher-level fea
tiveness of a training scheme that used objects of various scales and tures and then concatenating them with L2 normalization. (Luo, Li, Zhu,
poses to increase the detection performance on small objects due to & Zhang, 2019) combined the lower level features with neighboring
increased quantity and variety of training examples. Small objects were features by using a bilinear upsampling strategy. (Yoo, Dan, & Yun,
up sampled and fed into convolutional networks for detection. Only the 2019) designed a feature map generation scheme by recurrently passing
layers whose feature maps contained target objects within a certain the network.
range were activated during training, so that small objects could be
trained equally well as medium and large objects. (Najibi, Singh, & 4.3.2. Technique 2: Incorporate context information of small faces
Davis, 2019b) proposed a multi-scale network to predict the regions For face detection, (Bai, Zhang, Ding, & Ghanem., 2018) showed that
most likely containing small objects and discarded the regions that did adding context information can improve the performance dramatically
not likely contain small objects. The network first predicted a binary for small faces. However, too much context information can also hurt the
segmentation map for small objects, aiming to achieve a large recall rate performance on small faces due to over-fitting. Some methods added
for small objects. Only the regions that likely contained small objects context information by enlarging the receptive fields around faces.
were used in training. (Yang, Choi, & Lin, 2016) proposed a technique (Najibi, Samangouei, Chellappa, & Davis, 2017) adopted larger filter
called scale-dependent pooling to assign the appropriate feature maps to sizes, e.g. 5x5 and 7x7 filters in the convolutional network, instead of
objects based on object scales. This approach was based on the idea that 3x3 filters. (Wang, Yuan, & Network, 2017) adopted the feature fusion
small objects mainly have strong activations in lower-level feature maps. method to combine lower level features with higher level features with
This method pools the earlier feature maps for small objects and latter an agglomeration connection module. In this module, lower feature
feature maps for large objects. Several other works added prediction and maps first pass through an inception like network to increase the se
detection layers after different convolutional layers, i.e. using different mantic information, then it concatenates lower feature maps with higher
feature maps for prediction, and enlarged the region sizes to include feature maps. (Samangouei et al., 2018) added context information
more local context information (Cai et al., 2016; Li & Zhou, 2017; Chen, around each bounding box. In (Tang et al., 2018), extra context infor
Liu, Tuzel, & Xiao, 2016). Scale transformation is another technique mation from bodies and shoulders were integrated and a semi-
used to increase the amount of training examples for small objects (Kim, supervised method was used to generate labels for other body parts.
Kang, & Kim, 2018). Objects of different scales are mapped onto a single (Tian et al., 2018) used a segmentation branch to add extra context and
scale-invariant space in order to increase the number of examples. semantic information, without extra annotations, and the segmentation
Learning the mapping from different scales in the original input images branch shared the same receptive field of detection, which made the
are used to normalized patches. Therefore, images containing objects of segmentation branch an extra source for more discriminative features.
all scales can be used in training. (Singh & Davis, 2018a) propsed a novel (Li, Tang, Han, Liu, & He, 2019) used the structure of the dense block
training scheme called scale normalization for image pyramids (SNIP) from (Huang et al., 2017) to integrate extracted context features. (Zhu
that can minimize the scale changes during training. et al., 2017) integrated body information to reduce false positives. The
Instead of pre-defined anchor boxes, machine learning has been body features were acquired by additional RoI-pooling operations to

8
Y. Liu et al. Expert Systems With Applications 172 (2021) 114602

Table 1
Face Detection State-of-the-Art Algorithms.
Face Detection

Network One-Two stage Backbone Strategy Wider Face validation Set

Easy (mAP) Medium (mAP) Hard (mAP)
Tiny Face One-stage ResNet101 Context reasoning 0.919 0.908 0.823
SSH One-stage VGG16 Feature fusion/context reasoning 0.931 0.921 0.845
SRN Two-stage ResNet50 Anchor matching /balance classes 0.957 0.946 0.884
S3FD One-stage VGG16 Anchor matching /balance classes 0.937 0.924 0.852
EXTD One-stage Inverted Residual Feature fusion 0.912 0.903 0.85
DF2S2 One-stage ResNet50 Feature fusion/context reasoning 0.969 0.959 0.912
PyramidBox One-stage FPN Feature fusion/context reasoning 0.961 0.95 0.889
PyramidBox++ One-stage FPN Context reasoning/balance classes 0.965 0.959 0.912
FA-RPN One-stage RPN Anchor matching /balance classes 0.95 0.942 0.894
Face-MegNet Two-stage VGG16 Context reasoning/feature fusion 0.92 0.913 0.85
SFA One-stage VGG16 Anchor matching/feature fusion 0.949 0.936 0.866
Face-RCNN Two-stage VGG19 Anchor matching /balance classes 0.938 0.922 0.829
RetinaFace One-stage – Context reasoning 0.969 0.961 0.92
RAP One-stage Dilated convolutional Anchor matching 0.949 0.935 0.865

enlarge the receptive fields. The combined face and body features were stride sizes on the lower anchor-associated layer to increase the number
used in both classification and bounding box regression. of anchors that can potentially match with more small scale objects.
Moreover, it proposed a two-stage anchor matching strategy to make
4.3.3. Technique 3: Correct foreground and background class imbalance sure that each small object has enough anchors to match with. (Zhu
for small faces et al., 2018) increased the number of anchors by reducing stride and the
For small face detection, detector networks usually place a lot of distance between the face and anchor center, as well as adding extra
small anchors on the images which will usually generate a lot of negative shifted anchors. Moreover, for the hard faces for which the highest IoU
anchors and very few positive anchors, resulting in a high false positive scores were still lower than the matching threshold, the top few anchors
rate. There are two main approaches to handle class imbalance. with highest scores were selected as positive examples. (Luo et al., 2019)
a) Filtering anchors. (Zhang et al., 2017) used a max-out background increased the ranges of anchors so that more anchors with small sizes
label, which predicted several scores for background labels and selected can be matched with small faces.
the largest as the final score. (Chi et al., 2019) used two-step classifi c) Multi-scale training. (Wang, Chen, Huang, Yao, & Liu, 2017) resized
cation on the lower layers to filter out false positives for small faces, input images to different sizes to generate objects of various sizes and
which helped to balance the positive and negative examples to improve small objects can be resized to larger objects to match more anchor
the classification results. boxes. (Najibi et al., 2017) designed a multi-scale network with three
b) Sampling. (Zhang et al., 2017) applied hard-negative mining to different convolutional branches intended to detect different scales of
make the ratio between negatives and positives at most 3:1. (Najibi, faces: small, medium and large faces, respectively. Small faces were
Singh, & Davis, 2019a) also used hard-negatives mining. Anchors were detected by an element sum feature from conv4 and conv5. Medium
labeled positive if the overlap with the ground-truth bounding box was faces were directly detected from conv5. Large faces were detected from
larger than 0.5. (Li et al., 2019) proposed a balanced-data-anchor- max pooling after conv5. (Hu & Ramanan, 2017) used an image pyra
sampling strategy to select large size and small size anchors with mid, where the input images were scaled to ratio 0.5, 1, and 2 of the
equal probability. (Tang et al., 2018) proposed a Pyramid box that original resolution. Then two types of feature maps were applied to
adopted the max-in-out technique on both positive and negative samples capture different scales of faces.
to reduce the false positive rate of small objects. (Wang, Li, Ji, & Wang,
2017) applied the online hard exampling mining (OHEM) by sorting the
examples based on loss and selecting the top examples with the highest 4.4. Techniques for object detection in aerial images
loss as the hard examples. Also, they used a 1:1 ratio for positive hard
examples and negative hard examples in each mini batch during For detecting objects in aerial images, there are mainly four kinds of
training. methods: (i) template matching-based, (ii) knowledge-based, (iii) OBIA-
based, and (iv) machine learning-based (Cheng & Han, 2016). In recent
4.3.4. Technique 4: Increase training examples for small faces years, deep learning based methods achieved the best performance.
It is important to make sure that small objects have sufficient anchors Commonly, CNNs pretrained on large image datasets, such as the
to match with, otherwise small objects cannot be trained well, and the ImageNet and COCO dataset, were fine-tuned on aerial images. In
trained model’s recall will be low. There are three main techniques for addition, new deep neural networks were proposed for the unique at
increasing training examples for small faces. tributes of objects in aerial images, like multi-scale and multi-angle, to
a) Matching strategy. (Zhu, Tao, Luu, & Savvides 2018) found the achieve better performance. For example, (Dong, Liu, & Xu, 2018)
average IoU for small faces and small anchors, which are much lower proposed rotation-invariant models to achieve good performance on
than large faces. In order to design anchors to match with more small remote sensing images. Moreover, weakly supervised learning methods
scale objects, a new matching score was proposed to consider face scales (Peng et al., 2018) has been proposed to learn high-level features in an
and anchor strides so that small face scales can also achieve high IoU unsupervised manner to capture the structural information of objects in
scores. Moreover, during training, faces were randomly shifted to match remote sensor images.
with more anchors. (Chi et al., 2019) identified the mismatch between
anchors ratios and receptive fields and proposed inception-styled feature 4.4.1. Technique 1: Deal with orientation of aerial image objects
maps to increase the diversity of feature map scales and reduce the Objects in aerial image can have arbitrary orientations or rotations.
mismatches between faces and anchors. Deep neural network-based detectors have been designed to address this
b) Increasing anchors. (Zhang et al., 2017) added extra convolutional issue. Rotation-Invariant CNN (RICNN) in (Cheng, Zhou, & Han, 2016) a
layers to generate more anchors for small objects. It also reduced the new rotation-invariant layer in the basic CNN architecture. introduced
Rotation-Invariant and Fisher Discriminative CNN (RIFD) in (Cheng,

9
Y. Liu et al. Expert Systems With Applications 172 (2021) 114602

Han, Zhou, & Xu, 2018) was proposed to contain a rotation-invariant the trade-off between reception field and localization accuracy, large
regularizer and fisher discrimination regularizer on multi-scale fea reception fields lead to lower localization accuracy. However, when the
tures from CNN. The rotation-invariant regularizer mapped the CNN reception field is too small, localization accuracy may also decrease due
feature representations of training samples before and after rotations to to the lack of context information.
be similar, while the fisher discrimination regularizer constrained the Feature Pyramid Networks (FPNs) combine FCN and Faster R-CNN.
CNN features to be similar for within-class examples, but dissimilar for On top of the two predictions that Faster R-CNN generates: (i) bounding
examples in different classes. Recently, anchor rotation methods have box localization and (ii) bounding box recognition, FPNs added the third
been proposed in one-stage object detectors to achieve rotation invari output, (iii) instance mask prediction for segmentation. FPNs also used
ance (Yang et al., 2018). A feature refinement technique was proposed some new techniques, such as new ROI align layers, multitask training
to improve detection performance on aerial images, in which the posi and better backbone networks for further improvement.
tion information of bounding boxes was encoded to the corresponding For small object detection, some other techniques have also been
feature points through feature interpolation to improve feature recon proposed based on segmentation methods. To include more context in
struction and alignment. R-Net in (Yang et al., 2018) proposed a formation, capsule networks with deconvolutional capsules were pro
network to generate rotatable region proposals. posed to expand the original layers in the network architectures
(LaLonde & Bagci, 2018). Segmentations could be refined using bottom-
4.4.2. Technique 2: Incorporate context information of aerial image objects up and top-down network architectures to combine features from
More context information for small objects have been included in the different layers (Ronneberger et al., 2015), or using pyramid pooling
detection networks through combined feature maps and dilated con layers to segment objects in multiple scales as in DeepLab (Chen et al.,
volutions. Feature maps from multiple convolutional layers can be 2018). To increase training examples for small objects, a more robust
concatenated to form a new feature map. Dilated convolutions can be embedding could be learned by jointly using unsupervised and super
added to CNN models to improve performance on small-scale object vised learning and combing features from different models to form a
detection. multi-scale representa tion (Lin, Milan, et al., 2017). Concept Mask in
(Wang Lin, Shen, Zhang, & Cohen, 2018) used a semi-supervised
4.4.3. Technique 3: Correcte foreground and background class imbalance learning method to train a deep neural network with image-level la
for aerial image objects bels. Then, the results were refined and extended to predict attention
Based on Faster RCNN, IoU-Adaptive Deformable R-CNN in (Yan maps. Finally, an attention-driven class segmentation network was
et al., 2019) was proposed to address the class imbalance issue in trained.
training classifiers in object detectors. By analyzing the different roles
that IoU can play in different parts of the network models, an IoU-guided 5. Performance evaluation of deep learning methods for small
detection framework was proposed to reduce the loss of small object object detection
information during training. Besides, an IoU-based weighted loss was
designed to learn the IoU information of positive ROIs to improve the In this section, performances of representative state-of-the-art object-
detection accuracy. Finally, the class aspect ratio constrained non- detection methods on widely used public benchmarkdatasets are pre
maximum suppression (CARC-NMS) was proposed to improve detec sented. The emphasis is on small object detection.
tion precision.
5.1. Dataset
4.4.4. Technique 4: Increase training examples for aerial image objects
Multi-scale network models have been proposed to detect objects of We used several datasets from three different areas: generic object
various sizes in aerial images, such as Multi-Scale and Rotation- detection, face detection, and object detection in aerial imagery. Images
Insensitive Convolutional Channel Features (MsRi-CCF) in (Wu, Hong, in the generic object detection dataset were mostly collected under
Ghamisi, Li, & Tao, 2018). MsRi-CCF was proposed for geo-spatial object everyday living and indoor settings. The objects are usually of rigid
detection by integrating robust low-level feature generation, classifier shapes. The difficulty of detection usually comes from illumination and
generation with outlier removal, and detection with a power law. background clutter. In comparison, faces have a common structure
containing several regions of fixed parts, such as eyes, noses, mouths,
4.5. Instance segmentation methods for small object detection etc., and the relationships between parts are known a prior. In terms of
aerial images, they were collected under much diverse conditions, such
Different from the popular bounding-box-based object detectors as camera mounted underneath airplanes, helicopters or UAS (drones),
presented in the previous sections, deep CNNs for instance segmentation which resulted in straight-down views of the objects, very different
have also been applied to object detection. The main drawbacks of viewpoints from images in generic object detection or face detection
segmentation methods are pixel-wise labeling, which is time consuming, datasets. We showed the examples of three datasets in Fig. 2.
and compute and memory intensive. For small object detection, each 1) Generic object detection dataset. A combination of images from
pixel of an object is important and using pixel information could the Microsoft Common Object in Context (COCO) dataset (Lin et al.,
generate good results. 2014) and the SUN dataset (Xiao, Hays, Ehinger, Oliva, & Torralba,
FCN (Long, Shelhamer, & Darrell, 2015) is one of the first methods 2010) were used in the experiments. COCO consists of 82 K training and
that use CNNs for semantic segmentation. FCN employs CNNs without 40 K validation images belonging to 80 classes. COCO is a widely used
fully connected layers, which allows the input image to have an arbi dataset in object detection and a relatively difficult dataset, since the
trary size. It uses pooling layers to reduce computation time and increase objects sizes are relatively small compared with other datasets. For our
the reception field size. Based on FCN, U-Net (Ronneberger, Fischer, & experiments, we selected ten small object categories from COCO, where
Brox, 2015) was proposed with an encoder-decoder architecture to the largest physical dimension is smaller than 30 cm. The selected object
address the issue of determining appropriate numbers of pooling layers. categories are mouse, telephone, switch, outlet, clock, toilet paper, tis
It has a U-shape architecture to balance the trade-off between good sue box, faucet, plate and jar. Then, we used the ground truth bounding
localization accuracy and efficient context information. In the encoder, boxes in the COCO and SUN datasets to filter out large objects to create a
it uses pooling layers to gradually reduce the layer size, whereas, in the dataset containing small objects with small bounding boxes. Table 2
decoder stage, it uses up-convolution to gradually increase the layer shows the statistics of this small object dataset used in our experiments.
size. Moreover, U-Net uses short-cut connections from encoder to It contains about 8393 object instances in 4952 images. The mouse
decoder to help the decoder recover fine-grain information. Regarding category has the largest number of object instances: 2,173 instances in

10
Y. Liu et al. Expert Systems With Applications 172 (2021) 114602

Fig. 2. Examples of small object subsets of three benchmark datasets used in our experiments for object detection: (a) DOTA, (b) WIDER FACE, and (c) COCO and
SUN. All the objects used in our experiments are smaller than 50 × 50 pixels.

wherep(r) is the preicion at recall value r.

Table 2
Three benchmark datasets used to evaluate the performances of representative
small object detectors in our experiments. 5.2. Experimental results
Dataset # # test # Examples
train category The software we used to generate the results in this paper can be find
DOTA 4156 1186 15 Plane, ship, vehicle, harbor, …
on GitHub at (Liu, 2020).
WIDER 2730 708 1 Face Table 3 shows the results of 3 state-of-the-art object detectors, Faster
FACE R-CNN, SSD and YOLOv3, on the DOTA dataset. Faster R-CNN, which is
COCO + 3655 1560 10 Mouse, telephone, outlet, faucet, a two-stage object detector, performed better than the other two one-
SUN …
stage object detectors. Faster R-CNN achieved 35% on mAP, which is
3% and 11% higher than YOLOv3 and SSD respectively. Among the one
1,739 images. The tissue box category has the fewest instances: 103 stage object detectors, YOLOv3 performed much better than SSD. One
instances in 100 images. The object instances in the dataset are small. reason is that the feature fusion of lower level and higher level and
The median of relative areas of all the object instances is from 0.08% to multi-scale training techniques in YOLOv3 are effective for small object
0.58%. As a comparison, the median of relative areas of objects in the detection.
PASCAL VOC dataset is from 1.38% to 46.40%. This dataset is chal Table 4 shows the results of Faster R-CNN, SSD and YOLOv3 on the
lenging in two ways. First, the appearance cue for distinguishing a small small object subset of COCO and SUN datasets.
object from background clutters is much less due to the small size. Their overall mAP is quite low. Again, Faster R-CNN performed the
Second, the number of bounding box hypotheses for a small object in an best, with mAP 24.1%. YOLOv3 was slightly worse than Faster R-CNN,
image is much larger than that for a big object in VOC. while SSD was much woe.
2) Face detection dataset. We used Wider Face dataset (Yang, Luo, Table 5 shows the results of Faster R-CNN, SSD and YOLOv3, and
Loy, & Tang, 2016) in our experiments. This dataset contains 32,203 SSH on the Wider Face dataset. SSH was specifically designed for face
images containing 393,703 annotated faces, 158,989 of which are in the detection. Overall, Faster R-CNN performed the best with mAP 33.6%
train set, 39, 496 in the validation set and the rest are in the test set. The and YOLOv3 the second with 31.5% mAP. Although SSH was designed
validation and test set are divided into “easy”, “medium”, and “hard” for face detection, it did not perform well on small faces, worse than
subsets. This is one of the most challenging public face datasets mainly Faster R-CNN and YOLOv3. SSH achieved 30.8% mAP. SSD performed
due to the wide variety of face scales and occlusion. We trained all poorly with 24.6% mAP.
models on the train set of the WIDER dataset and evaluated their per The execution times of the methods were measured as frames per
formances on the validation and test sets. Moreover, a subset of small second (FPS) when predicting the results. All experiments were run on a
objects was created containing objects smaller than 50x50 pixels. Dell Alienware computer with GTX 980 M GPU with 8 GB memory
3) Aerial imagery dataset. The Large-scale dataset for Object running Ubuntu. For images from Wider Face dataset with size 512 ×
Detection in Aerial images (DOTA; Xia et al., 2018) was used in our 512 pixels, the running time of Faster R-CNN was 4 FPS, the slowest, due
experiments. It is one of the largest annotated object datasets of aerial to its two-stage object detection structure. SSH and SSD had similar
images, containing 2,806 high-resolution aerial images with 188,282 running times, 14 and 18 FPS, respectively. YOLOv3 ran the fastest, 40
object instances of various scales and shapes. It has 15 object categories, FPS.
including ships, planes, storage tanks, baseball diamonds, tennis courts,
basketball courts, ground track fields, harbors, bridges, large vehicles, 6. Future research directions
small vehicles, helicopters, roundabouts, and soccer ball field. DOTA
dataset contains challenging and complicated scenes. Different from anchors-based methods, recently several anchors-free
In our experiment, we use mean average precision (mAP) as our algorithms have been proposed and achieved the state-of-the-art results.
measurement metric. The definition of mAP was first proposed in the These methods get rid of the manual design of anchors and can represent
PASCAL VOC challenge (Everingham, Van Gool, Williams, Winn, & bounding boxes in any scales and ratios. (Duan et al., 2019) represented
Zisserman , 2010). Precision is the percentage of correct predictions and the bounding boxesas pairs of center points and corner points. (Tychsen-
recall is the percentage of true positives over all possible positives. Given Smith & Petersson, 2017) formulated the object detection task as a
a precision-recall curse, AP is defined as: sparse bounding boxes probability distribution. It first estimated the
distribution of four corners’ locations. The features for each corner were
1 ∑
AP = Pinterp (r) (9) constructed with nearest neighbor sampling and bounding box width
11 r∈{0,0,1,..1} and height. In its deep neural networks, deconvolutional layers were
applied to recover the information lost in pooling layers. (Law & Deng,
Pinterp (r) = maxp(r) (10) 2018) defined the object detection problem as detecting and grouping
r:r≥r
pairs of corners with extra embedding information. The network in was

11
Y. Liu et al. Expert Systems With Applications 172 (2021) 114602

Table 3
Results of three representative detectors on DOTA dataset.
Method mAP Plane BD Bridge GTF SV LV Ship TC BC ST SBF RA Harbor SP HC

FasterR-CNN 0.35 0.574 0.172 0.234 0.175 0.63 0.51 0.721 0.27 0.03 0.434 0.232 0.23 0.506 0.38 0.008
SSD 0.24 0.66 0.0 0.0 0.0 0.60 0.49 0.74 0.09 0.0 0.37 0.0 0.0 0.45 0.22 0.0
YOLOv3 0.32 0.58 0.15 0.20 0.132 0.58 0.41 0.68 0.25 0.01 0.35 0.20 0.21 0.45 0.35 0.0

Table 4
Results of three representative detectors on the small object subsets of COCO and SUN datasets.
Method mAP Mouse Telephone Outlet Clock TP TB Faucet Plate Jar Switch

Faster R-CNN 0.241 0.517 0.106 0.368 0.627 0.05 0.0 0.251 0.161 0.06 0.27
SSD 0.17 0.481 0.05 0.155 0.509 0.05 0.0 0.15 0.13 0.02 0.22
YOLOv3 0.23 0.54 0.105 0.397 0.631 0.07 0.0 0.26 0.11 0.06 0.20

References
Table 5
Results of four representative detectors on a Bai, Y., Zhang, Y., Ding, M., & Ghanem, B. (2018). In Finding tiny faces in the wild with
small face subset of Wider Face dataset. generative adversarial network (pp. 21–30). Salt Lake City: IEEE Xplore.
Bell, S., Lawrence Zitnick, C., Bala, K., & Girshick, R. (2016). Inside-Outside Net:
Methods mAP
Detecting objects in context with skip pooling and recurrent neural networks.
Faster R-CNN 0.336 Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
SSD 0.246 (CVPR; pp. 2874-2883). Las Vegas: IEEE Xplore. Retrieved from https://openaccess.
YOLO v3 0.315 thecvf.com/content_cvpr_2016/html/Bell_Inside-Outside_Net_Detecting_CVPR_
2016_paper.html.
SSH 0.308
Bodla, N., Singh, B., Chellappa, R., & Davis, L. S. (2017). In Soft-NMS – improving object
detection with one line of code (pp. 5561–5569). Venice, Italy: IEEE Xplore.
Cai, Z., Fan, Q., Feris, R. S., & Vasconcelos, N. (2016). A unified multi-scale deep
used as the backbone network and heat maps were predicted for both convolutional neural network for fast object detection. European Conference on
top-left corners and bottom-right corners. The corners belonging to the Computer Vision. 9908, pp. 354–370. Cham: Springer. tps://doi.org/10.1007/9
78-3-319-46493-0_22.
same object were further grouped with embedding vectors, which were
Cao, G., Xie, X., Yang, W., Liao, Q., Shi, G., & Wu, J. (2018). Feature-fused SSD: fast
predicted as the similarities between corners. Corner pooling layers detection for small objects. Ninth International Conference on Graphic and Image
were also proposed for combining the prior location knowledge of cor Processing (ICGIP 2017). 10615, p. 106151E. Qingdao, China: Proc. SPIE. https
://doi.org/10.1117/12.2304811.
ners into the feature extraction process. (Wang, Chen, et al., 2017)
Cao, Y., Chen, K., Loy, C. C., & Lin, D. (2020). Prime Sample Attention in Object
represented bounding boxes using center points and four corners (top- Detection. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
left, top-right, bottom-left and bottom-right). It divided the feature maps Recognition (CVPR; pp. 11583–11591). IEEE Xplore. Retrieved from https://ope
into grid cells and for each cell predicted the probability of center points naccess.thecvf.com/content_CVPR_2020/html/Cao_Prime_Sample_Attention_i
n_Object_Detection_CVPR_2020_paper.html.
and corner points in the cell, x-offset and y-offset, as well as the prob Chen, C., Liu, M. Y., Tuzel, O., & Xiao, J. (2016). R-CNN for Small Object Detection.
ability of point link. The point link contained two parts: special index Asian Conference on Computer Vision (pp. 214–230). Cham: Springer. https://doi.
(the probability of point linking to cells) and point index (the linking org/10.1007/978-3-319-54193-8_14.
Chen, K., Li, J., Lin, W., See, J., Wang, J., Duan, L., . . . Zou, J. (2019). Towards accurate
probability between corner points and center point). Different from one-stage object detection with AP-loss. Proceedings of the IEEE/CVF Conference on
other methods that estimated the points of bounding boxes, (Zhou, Computer Vision and Pattern Recognition (CVPR; pp. 5119-5127). Long Beach,
Zhuo, & Krahenbuhl, P, 2019)estimated the four extreme points on an California: IEEE Xplore. Retrieved from https://openaccess.thecvf.com/content_C
VPR_2019/html/Chen_Towards_Accurate_One-Stage_Object_Detection_With_
object with fully appearance-based algorithms. It predicted five feature AP-Loss_CVPR_2019_paper.html.
maps: one for the center point and four for corner points. The extreme Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2018). DeepLab:
points were generated with maxpooling. Based on the prediction of any Semantic image segmentation with deep convolutional nets, atrous convolution, and
fully connected CRFs. IEEE Transactions on Pattern Analysis and Machine Intelligence,
combination of four corners, the center was calculated and verified on
40(4), 834–848. https://doi.org/10.1109/TPAMI.2017.2699184
the heat map for center. The corner centers with verified center points Cheng, G., & Han, J. (2016). A survey on object detection in optical remote sensing
represented the detected objects. (Duan et al., 2019) improved the images. ISPRS Journal of Photogrammetry and Remote Sensing, 117, 11–28. https://
doi.org/10.1016/j.isprsjprs.2016.03.014
network of (Law & Deng, 2018). The work represented bounding boxes
Cheng, G., Han, J., Zhou, P., & Xu, D. (2018). Learning rotation-invariant and Fisher
using three points (two corners and one center point) and center pooling discriminative convolutional neural networks for object detection. IEEE Trans. on
and cascade corner pooling were proposed to extract strong features. Image Process., 28(1), 265–278. https://doi.org/10.1109/TIP.2018.2867198
Cheng, G., Zhou, P., & Han, J. (2016). Learning rotation-invariant convolutional neural
networks for object detection in VHR optical remote sensing images. IEEE Trans.
CRediT authorship contribution statement Geosci. Remote Sensing, 54(12), 7405–7415. https://doi.org/10.1109/
TGRS.2016.2601622
Yang Liu: Conceptualization, Writing - original draft, Investigation, Divvala, S. K., Hoiem, D., Hays, J. H., Efros, A. A., & Hebert, M. (2009). An empirical
study of context in object detection. 2009 IEEE Conference on Computer Vision and
Resources, Formal analysis, Software. Peng Sun: Writing - original Pattern Recognition (pp. 1271–1278). Miami, FL, USA: IEEE. https://dx.doi.org/10.
draft. Nickolas Wergeles: Resources, Supervision. Yi Shang: Concep 1109/CVPR.2009.5206532.
tualization, Supervision, Project administration. Dong, C., Liu, J., & Xu, F. (2018). Ship detection in optical remote sensing images based
on saliency and a rotation-invariant descriptor. Remote Sensing, 10(3), 400. https://
doi.org/10.3390/rs10030400
Declaration of Competing Interest Duan, K., Bai, S., Xie, L., Qi, H., Huang, Q., & Tian, Q. (2019). CenterNet: Object
detection with keypoint triplets. arXiv. Cornell University. Retrieved from https
://arxiv.org/abs/1904.08189v1.
The authors declare that they have no known competing financial
Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. (2010). The
interests or personal relationships that could have appeared to influence Pascal visual object classes (VOC) challenge. International Journal of Computer Vision,
the work reported in this paper. 88(2), 303–338. https://doi.org/10.1007/s11263-009-0275-4
Fu, C., Liu, W., Ranga, A., Tyagi, A., & Berg, A. (2017). DSSD: Deconvolutional Single
Shot Detector. arXiv. Cornell University. Retrieved from https://arxiv.org/abs/1701
.06659.

12
Y. Liu et al. Expert Systems With Applications 172 (2021) 114602

Galleguillos, C., & Belongie, S. (2010). Context based object categorization: A critical Li, Z., Tang, X., Han, J., Liu, J., & He, R. (2019). PyramidBox++: High Performance
survey. Computer Vision and Image Understanding, 114(6), 712–722. https://doi.org/ Detector for Finding Tiny Face. arXiv. Retrieved from https://arxiv.org/abs/1
10.1016/j.cviu.2010.02.004 904.00386.
Ghiasi, G., Lin, T. Y., & Le, Q. V. (2019). NAS-FPN: Learning scalable feature pyramid Lin, G., Milan, A., Shen, C., & Reid, I. (2017). RefineNet: Multi-path refinement networks
architecture for object detection. Proceedings of the IEEE/CVF Conference on for high-resolution semantic segmentation. Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition (CVPR; pp. 7036-7045). California: Long Computer Vision and Pattern Recognition (CVPR; pp. 1925–1934). Honolulu,
Beach. IEEE Xplore. Retrieved from https://openaccess.thecvf.com/content_CVPR_ Hawaii: IEEE Xplore. Retrieved from https://openaccess.thecvf.com/content_cvp
2019/html/Ghiasi_NAS-FPN_Learning_Scalable_Feature_Pyramid_Architecture_for_ r_2017/html/Lin_RefineNet_Multi-Path_Refinement_CVPR_2017_paper.html.
Object_Detection_CVPR_2019_paper.html. Lin, T. Y., Dollár, P., Girshick, R., He, K., Hariharan, B., & Belongie, S. (2017). Feature
Girshick, R. (2015). Fast R-CNN. Proceedings of the IEEE International Conference on pyramid networks for object detection. In Proceedings of the IEEE Conference on
Computer Vision (ICCV; pp. 1440–1448). Santiago, Chile: IEEE Xplore. Retrieved Computer Vision and Pattern Recognition (CVPR) (pp. 2117–2125).
from https://openaccess.thecvf.com/content_iccv_2015/html/Girshick_Fast Lin, T. Y., Goyal, P., Girshick, R., He, K., & Dollár, P. (2017). Focal loss for dense object
_R-CNN_ICCV_2015_paper.html. detection. In Proceedings of the IEEE International Conference on Computer Vision
Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for (ICCV) (pp. 2980–2988).
accurate object detection and semantic segmentation. Proceedings of the IEEE Lin, T., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., & Zitnick, C. (2014).
Conference on Computer Vision and Pattern Recognition (CVPR; pp. 580–587). Microsoft COCO: Common Objects in Context. European Conference on Computer
Columbus, Ohio: IEEE Xplore. Retrieved from https://openaccess.thecvf.com/conte Vision (pp. 740–755). Cham: Springer. https://doi.org/10.1007/978-3-319-10602-1
nt_cvpr_2014/html/Girshick_Rich_Feature_Hierarchies_2014_CVPR_paper.html. _48.
He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask R-CNN. Proceedings of the Liu, L.i., Ouyang, W., Wang, X., Fieguth, P., Chen, J., Liu, X., & Pietikäinen, M. (2020).
IEEE International Conference on Computer Vision (ICCV; pp. 2961-2969). Venice, Deep learning for generic object detection: A survey. International Journal of
Italy: IEEE Xplore. Retrieved from https://openaccess.thecvf.com/content_iccv_ Computer Vision, 128(2), 261–318. https://doi.org/10.1007/s11263-019-01247-4
2017/html/He_Mask_R-CNN_ICCV_2017_paper.html. Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C., & Berg, A. (2016). Ssd:
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. Single shot multibox detector. European conference on computer vision, pp. 21-37,
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 21–37). Springer, Cham.
(CVPR; pp. 770-778). Las Vegas: IEEE Xplore. Retrieved from https://openaccess.th Liu, Y. (2020). GitHub. Retrieved from https://github.com/ylt5b/A-Survey-and-Perfor
ecvf.com/content_cvpr_2016/html/He_Deep_Residual_Learning_CVPR_2016_paper. mance-Evaluation-of-Deep-Learning-Methods-for-Small-Object-Detection.
html. Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic
He, Y., Zhu, C., Wang, J., Savvides, M., & Zhang, X. (2019). Bounding Box Regression segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern
With Uncertainty for Accurate Object Detection. Proceedings of the IEEE/CVF Recognition (CVPR; pp. 3431–3440). Boston: IEEE Xplore. Retrieved from https
Conference on Computer Vision and Pattern Recognition (CVPR; pp. 2888–2897). ://www.cv-foundation.org/openaccess/content_cvpr_2015/html/Long_Fully_Conv
Long Beach, California: IEEE Xplore. Retrieved from https://openaccess.thecvf.co olutional_Networks_2015_CVPR_paper.html.
m/content_CVPR_2019/html/He_Bounding_Box_Regression_With_Uncertainty_for_ Luo, S., Li, X., Zhu, R., & Zhang, X. (2019). SFA: Small faces attention face detector. IEEE
Accurate_Object_Detection_CVPR_2019_paper.html. Access, 7, 171609–171620.
Hosang, J., Benenson, R., & Schiele, B. (2017). Learning Non-Maximum Suppression. Najibi, M., Samangouei, P., Chellappa, R., & Davis, L. S. (2017). SSH: Single Stage
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Headless Face Detector. In Proceedings of the IEEE International Conference on
(CVPR; pp. 4507-4515). Honolulu, Hawaii: IEEE Xplore. Retrieved from https://op Computer Vision (ICCV) (pp. 4875–4884).
enaccess.thecvf.com/content_cvpr_2017/html/Hosang_Learning_Non-M Najibi, M., Singh, B., & Davis, L. S. (2019a). FA-RPN: Floating region proposals for face
aximum_Suppression_CVPR_2017_paper.html. detection. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., . . . Adam, H. Recognition (CVPR; pp. 7723-7732). Long Beach, California: IEEE Xplore. Retrieved
(2017). MobileNets: Efficient convolutional neural networks for mobile vision from https://openaccess.thecvf.com/content_CVPR_2019/html/Najibi_FA-RPN_
applications. arXiv. Cornell University. Retrieved from https://arxiv.org/abs/1 Floating_Region_Proposals_for_Face_Detection_CVPR_2019_paper.html.
704.04861. Najibi, M., Singh, B., & Davis, L. S. (2019b). AutoFocus: Efficient multi-scale inference. In
Hu, P., & Ramanan, P. (2017). Finding Tiny Faces. Proceedings of the IEEE Conference Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (pp.
on Computer Vision and Pattern Recognition (CVPR). Honolulu, Hawaii: IEEE 9745–9755).
Xplore. Retrieved from https://openaccess.thecvf.com/content_cvpr_2017/html/Hu_ Newell, A., Yang, K., & Deng, J. (2016). Stacked hourglass networks for human pose
Finding_Tiny_Faces_CVPR_2017_paper.html. estimation. European conference on computer vision, pp. 483-499. Cham: Springer.
Huang, G., Liu, Z., Van Der Maaten, L., & Weinberger, K. Q. (2017). Densely connected Nguyen, N., Do, T., Ngo, T., & Le, D. (2020). An Evaluation of Deep Learning Methods for
convolutional networks. Proceedings of the IEEE Conference on Computer Vision Small Object Detection. 2020, 18. https://doi.org/10.1155/2020/3189691.
and Pattern Recognition (CVPR), (pp. 4700–4708). Honolulu, Hawaii. Retrieved Oksuz, K., Cam, B., Kalkan, S., & Akbas, E. (2019). Imbalance problems in object
from https://openaccess.thecvf.com/content_cvpr_2017/html/Huang_Densel detection: A review. IEEE Transactions on Pattern Analysis and Machine Intelligence.,
y_Connected_Convolutional_CVPR_2017_paper.html. 1–1. https://doi.org/10.1109/TPAMI.2020.2981890
Jeong, J., Park, H., & Kwak, N. (2017). Enhancement of SSD by concatenating feature Oliva, A., & Torralba, A. (2007). The role of context in object recognition. Trends in
maps for object detection. arXiv. Cornell University. Retrieved from https://arxiv. Cognitive Sciences, 11(12), 520–527. https://doi.org/10.1016/j.tics.2007.09.009
org/abs/1705.09587. Palmer, T. (1975). The effects of contextual scenes on the identification of objects.
Jiao, L., Zhang, F., Liu, F., Yang, S., Li, L., Feng, Z., et al. (2019). A Survey of Deep Memory & Cognition, 3, 519–526. https://doi.org/10.3758/BF03197524
learning-based object detection. IEEE Access, 7, 128837–128868. https://doi.org/ Pang, Jiangmiao, Chen, Kai, Shi, Jianping, Feng, Huajun, Ouyang, Wanli, & Lin, Dahua
10.1109/ACCESS.2019.2939201 (2019). Libra r-cnn: Towards balanced learning for object detection. In Proceedings of
Kim, K., Hong, S., Roh, B., Kim, K. H., Cheon, Y., & Park, M. (2016). PVANet: Lightweight the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 821–830).
Deep Neural Networks for Real-time Object Detection. arXiv. Cornell University. Pang, Y., Wang, T., Anwer, R., Khan, F., & Shao, L. (2019). Efficient featurized image
Retrieved from https://arxiv.org/abs/1611.08588. pyramid network for single shot detector. In Proceedings of the IEEE/CVF Conference
Kim, Y., Kang, B. N., & Kim, D. (2018). SAN: Learning relationship between on Computer Vision and Pattern Recognition (CVPR) (pp. 7336–7344).
convolutional features for multi-scale object detection. Proceedings of the European Peng, Tang, Wang, Xinggang, Wang, Angtian, Yan, Yongluan, Liu, Wenyu,
Conference on Computer Vision (ECCV), (pp. 316-331). Munich, Germany. Retrieved Huang, Junzhou, & Yuille, Alan (2018). Weakly supervised region proposal network
from https://openaccess.thecvf.com/content_ECCV_2018/html/Kim_SAN_Lea and object detection. In Proceedings of the European conference on computer vision
rning_Relationship_ECCV_2018_paper.html. (ECCV) (pp. 352–368).
Kong, T., Sun, F., Tan, C., Liu, H., & Huang, W. (2018). Deep feature pyramid Pont-Tuset, J., Arbeláez, P., Barron, J. T., Marques, F., & Malik, J. (2016). Multiscale
reconfiguration for object detection. Proceedings of the European Conference on combinatorial grouping for image segmentation and object proposal generation.
Computer Vision (ECCV), (pp. 169–185). Munich, Germany. Retrieved from http IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(1), 128–140.
s://openaccess.thecvf.com/content_ECCV_2018/html/Tao_Kong_Deep_Feature https://doi.org/10.1109/TPAMI.2016.2537320
_Pyramid_ECCV_2018_paper.html. Qian, Q., Chen, L., Li, H., & Jin, R. (2019). DR loss: Improving object detection by
Kong, T., Yao, A., Chen, Y., & Sun, F. (2016). Hypernet: Towards accurate region distributional ranking. In Proceedings of the IEEE/CVF Conference on Computer Vision
proposal generation and joint object detection. In Proceedings of the IEEE conference and Pattern Recognition (CVPR) (pp. 12164–12172).
on computer vision and pattern recognition (pp. 845–853). Redmon, J., & Farhadi, A. (2017). YOLO9000: Better, faster, stronger. Proceedings of the
LaLonde, R., & Bagci, U. (2018). Capsules for Object Segmentation. arXiv. Cornell IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (pp.
University. Retrieved from https://arxiv.org/abs/1804.04241. 7263–7271). Retrieved from https://openaccess.thecvf.com/content_cvpr_2017/ht
Law, H., & Deng, J. (2018). CornerNet: Detecting objects as paired keypoints. In ml/Redmon_YOLO9000_Better_Faster_CVPR_2017_paper.html.
Proceedings of the European Conference on Computer Vision (ECCV) (pp. 734–750). Redmon, J., & Farhadi, A. (2018). Yolov3: An incremental improvement. arXiv.
Lee, S., Kwak, S., & Cho, M. (2018). Universal bounding box regression and its Retrieved from https://arxiv.org/abs/1804.02767.
applications. Asian Conference on Computer Vision (pp. 373–387). Cham: Springer. Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look once: unified,
Leevy, J. L., Khoshgoftaar, T. M., Bauder, R. A., & Seliya, N. (2018). A survey on real-time object detection. In Proceedings of the IEEE conference on computer vision and
addressing high-class imbalance in big data. Journal of Big Data, 5(1), 42. https:// pattern recognition (pp. 779–788).
doi.org/10.1186/s40537-018-0151-6 Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster r-cnn: Towards real-time object
Li, Z., & Zhou, F. (2017). FSSD: Feature fusion single shot multibox detector. arXiv. detection with region proposal networks. Advances in neural information processing
Retrieved from https://arxiv.org/abs/1712.00960. systems, (pp. 91–99).
Rezatofighi, Hamid, Tsoi, Nathan, Gwak, JunYoung, Sadeghian, Amir, Reid, Ian, &
Savarese, Silvio (2019). Generalized intersection over union: A metric and a loss for

13
Y. Liu et al. Expert Systems With Applications 172 (2021) 114602

bounding box regression.. In Proceedings of the IEEE/CVF Conference on Computer Xia, G., Bai, X., Ding, J., Zhu, Z., Belongie, S., Luo, J., . . . Pelillo, M. (2018). DOTA: A
Vision and Pattern Recognition (pp. 658–666). large-scale dataset for object detection in aerial images. Proceedings of the IEEE
Ronneberger, O., Fischer, P., & Brox, T. (2015). U-Net: Convolutional networks for Conference on Computer Vision and Pattern Recognition (CVPR; pp. 3974–3983).
biomedical image segmentation. International Conference on Medical Image Salt Lake City: IEEE Xplore. Retrieved from https://openaccess.thecvf.com/content_c
Computing and Computer-Assisted Intervention. 9351, pp. 234–241. Springer, vpr_2018/html/Xia_DOTA_A_Large-Scale_CVPR_2018_paper.html.
Cham. https://doi.org/10.1007/978-3-319-24574-4_28. Xiao, J., Hays, J., Ehinger, K. A., Oliva, A., & Torralba, A. (2010). SUN database: Large-
Rothe, R., Guillaumin, M., & Van Gool, L. (2014). Non-maximum suppression for object scale scene recognition from abbey to zoo. IEEE Computer Society Conference on
detection by passing messages between windows. Asian conference on computer Computer Vision and Pattern Recognition (pp. 3485–3492). San Francisco, CA, USA:
vision (pp. 290–306). Springer, Cham. IEEE. https://dx.doi.org/10.1109/CVPR.2010.5539970.
Samangouei, P., Chellappa, R., Najibi, M., & Davis, L. S. (2018). Face-MagNet: Yan, J., Wang, H., Yan, M., Diao, W., Sun, X., & Li, H. (2019). IoU-adaptive deformable R-
Magnifying feature maps to detect small faces. IEEE Winter Conference on CNN: make full use of IoU for multi-class object detection in remote sensing imagery.
Applications of Computer Vision (WACV). Lake Tahoe, NV, USA: IEEE. https://dx.do Remote Sensing, 11(3), 286. https://doi.org/10.3390/rs11030286
i.org/10.1109/WACV.2018.00020. Yang, F., Choi, W., & Lin, Y. (2016). Exploit all the layers: Fast and accurate CNN object
Shrivastava, A., Gupta, A., & Girshick, R. (2016). Training region-based object detectors detector with scale dependent pooling and cascaded rejection classifiers.
with online hard example mining. In Proceedings of the IEEE Conference on Computer Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Vision and Pattern Recognition (CVPR) (pp. 761–769). (CVPR; pp. 2129–2137). Las Vegas: IEEE Xplore. Retrieved from https://openaccess.
Shrivastava, A., Sukthankar, R., Malik, J., & Gupta, A. (2016). Beyond skip connections: thecvf.com/content_cvpr_2016/html/Yang_Exploit_All_the_CVPR_2016_paper.html.
Top-down modulation for object detection. arXiv. Retrieved from https://arxiv. Yang, S., Luo, P., Loy, C., & Tang, X. (2016). WIDER FACE: A face detection benchmark.
org/abs/1612.06851. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale (CVPR; pp. 5525-5533). Las Vegas: IEEE Xplore. Retrieved from https://openaccess.
image recognition. arXiv. Retrieved from preprint arXiv:1409.1556. thecvf.com/content_cvpr_2016/html/Yang_WIDER_FACE_A_CVPR_2016_paper.html.
Singh, B., Najibi, M., & Davis, L. S. (2018). SNIPER: Efficient multi-scale training. Yang, X., Liu, Q., Yan, J., & Li, A. (2019). R3Det: Refined single-stage detector with
Advances in Neural Information Processing Systems, 9310–9320. feature refinement for rotating object. arXiv. Cornell University. Retrieved from htt
Soviany, P., & Ionescu, R. T. (2018a). In Optimizing the Trade-Off between Single-Stage and ps://arxiv.org/abs/1908.05612.
Two-Stage Deep Object Detectors using Image Difficulty Prediction. Timisoara, Romania, Yang, X., Sun, H., Fu, K., Yang, J., Sun, X., Yan, M., et al. (2018). Automatic ship
Romania: IEEE Xplore. https://doi.org/10.1109/SYNASC.2018.00041. detection in remote sensing images from google earth of complex scenes based on
Soviany, P., & Ionescu, R. T. (2018b). Frustratingly easy trade-off optimization between multiscale rotation dense feature pyramid networks. Remote Sensing, 10(1), 132.
single-stage and two-stage deep object detectors. Proceedings of the European https://doi.org/10.3390/rs10010132
Conference on Computer Vision (ECCV). Yoo, Y., Dan, D., & Yun, S. (2019). EXTD: Extremely tiny face detector via iterative filter
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., et al. (2015). Going reuse. arXiv. Cornell University. Retrieved from https://arxiv.org/abs/1906.06579.
deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision Yu, F., Wang, D., Shelhamer, E., & Darrell, T. (2018). Deep layer aggregation.
and Pattern Recognition (CVPR) (pp. 1–9). Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016). Rethinking the (CVPR; pp. 2403–2412). Salt Lake City: IEEE Xplore. Retrieved from https://op
inception architecture for computer vision. In Proceedings of the IEEE Conference on enaccess.thecvf.com/content_cvpr_2018/html/Yu_Deep_Layer_Aggregation_C
Computer Vision and Pattern Recognition (CVPR) (pp. 2818–2826). VPR_2018_paper.html.
Tang, X., Du, D., He, Z., & Liu , J. (2018). PyramidBox: A context-assisted single shot face Yu, J., Jiang, Y., Wang, Z., Cao, Z., & Huang, T. (2016). UnitBox: An advanced object
detector. Proceedings of the European Conference on Computer Vision (ECCV), (pp. detection network. Proceedings of the 24th ACM international conference on
797–813). Retrieved from https://openaccess.thecvf.com/content_ECCV_2018/html Multimedia., (pp. 516–520). Retrieved from https://dl.acm.org/doi/abs/10.1145/
/Xu_Tang_PyramidBox_A_Context-assisted_ECCV_2018_paper.html. 2964284.2967274?casa_token=qzC_K8AjZLAAAAAA:8HLv_FdNVhdvhFq5UEX3n_
Tian, W., Wang, Z., Shen, H., Deng, W., Meng, Y., Chen, B., . . . Huang, X. (2018). 5LZXGAVr4gL3Uw7JnnBTngyaYEnJkhWZxrrGfBlpfFWCSJ4Dw2EVTw.
Learning better features for face detection with feature fusion and segmentation Zagoruyko, S., Lerer, A., Lin, T.-Y., Pinheiro, P., Gross, S., Chintala, S., & Dollár, P.
supervision. arXiv. Retrieved from https://arxiv.org/abs/1811.08557. (2016). A multipath network for object detection. arXiv. Cornell University.
Tong, K., Wu, Y., & Zhou, F. (2020). Recent advances in small object detection based on Retrieved from https://arxiv.org/abs/1604.02135.
deep learning: A review. Image and Vision Computing, 97. https://doi.org/10.1016/j. Zhang, S., Wen, L., Bian, X., Lei, Z., & Li, S. Z. (2018). Single-Shot Refinement Neural
imavis.2020.103910 Network for Object Detection. Proceedings of the IEEE Conference on Computer
Torralba, A., Murphy, K. P., Freeman, W. T., & Rubin, M. A. (2003). Context-based vision Vision and Pattern Recognition (CVPR; pp. 4203-4212). Salt Lake City: IEEE Xplore.
system for place and object recognition. Proceedings Ninth IEEE International Retrieved from https://openaccess.thecvf.com/content_cvpr_2018/html/Zhang_
Conference on Computer Vision (pp. 273–280). Nice, France, France: IEEE Xplore. Single-Shot_Refinement_Neural_CVPR_2018_paper.html.
https://dx.doi.org/10.1109/ICCV.2003.1238354. Zhang, S., Zhu, X., Lei, Z., Shi, H., Wang, X., & Li, S. Z. (2017). S3FD: Single shot scale-
Tychsen-Smith, L., & Petersson, L. (2017). DeNet: Scalable real-time object detection invariant face detector. Proceedings of the IEEE International Conference on
with directed sparse sampling. In Proceedings of the IEEE international conference Computer Vision (ICCV; pp. 192–201). Venice, Italy: IEEE Xplore. Retrieved from
on computer vision (pp. 428-436). Venice, Italy: IEEE Xplore. Retrieved from http https://openaccess.thecvf.com/content_iccv_2017/html/Zhang_S3FD_Single_Shot
s://openaccess.thecvf.com/content_iccv_2017/html/Tychsen-Smith_DeNet_Scala _ICCV_2017_paper.html.
ble_Real-Time_ICCV_2017_paper.html. Zhang, Z., Qiao, S., Xie, C., Shen, W., Wang, B., & Yuille, A. (2018). Single-shot object
Uijlings, J. R., Van De Sande, K. E., Gevers, T., & Smeulders, A. W. (2013). Selective detection with enriched semantics. In Proceedings of the IEEE Conference on Computer
search for object recognition. International Journal of Computer Vision, 104(2), Vision and Pattern Recognition (CVPR) (pp. 5813–5821).
154–171. https://doi.org/10.1007/s11263-013-0620-5 Zhao, Z.-Q., Zheng, P., Xu, S.-T., & Wu, X. (2019). Object detection with deep learning: A
Wang, H., Li, Z., Ji, X., & Wang, Y. (2017). Face R-CNN. arXiv. Cornell University. review. IEEE Transactions on Neural Networks and Learning Systems, 30(11),
Retrieved from https://arxiv.org/abs/1706.01061. 3212–3232. https://doi.org/10.1109/TNNLS.2018.2876865
Wang, J., Chen, K., Yang, S., Loy, C. C., & Lin, D. (2019). Region proposal by guided Zheng, Z., Wang, P., Liu, W., Li, J., Ye, R., & Ren, D. (2020). Distance-IoU loss: Faster and
anchoring. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern better learning for bounding box regression. In AAAI, (pp. 12993–13000).
Recognition (CVPR), (pp. 2965–2974). Retrieved from https://openaccess.thecvf. Zhou, X., Zhuo, J., & Krahenbuhl, P. (2019). Bottom-up object detection by grouping
com/content_CVPR_2019/html/Wang_Region_Proposal_by_Guided_Anchoring_C extreme and center points. Proceedings of the IEEE/CVF Conference on Computer
VPR_2019_paper.html. Vision and Pattern Recognition (CVPR), (pp. 850-859). Long Beach, California.
Wang, J., Yuan, Y., & Network, G. Y. (2017). An Effective Face Detector for the Occluded Retrieved from https://openaccess.thecvf.com/content_CVPR_2019/html/Zhou_Bott
Faces. arXiv. Cornell University. Retrieved from https://arxiv.org/abs/1711.07246. om-Up_Object_Detection_by_Grouping_Extreme_and_Center_Points_CVPR_2019_pape
Wang, X., Chen, K., Huang, Z., Yao, C., & Liu, W. (2017). Point linking network for object r.html.
detection. arXiv. Retrieved from https://arxiv.org/abs/1706.03646. Zhu, C., Tao, R., Luu, K., & Savvides, M. (2018). Seeing small faces from robust anchor’s
Wang, Y., Lin, Z., Shen, X., Zhang, J., & Cohen, S. (2018). Concept Mask: Large-scale perspective. Proceedings of the IEEE Conference on Computer Vision and Pattern
segmentation from semantic concepts. Proceedings of the European Conference on Recognition (CVPR; pp. 5127–5136). Salt Lake City: IEEE Xplore. Retrieved from
Computer Vision (ECCV), (pp. 530-546). Munich, Germany. Retrieved from http https://openaccess.thecvf.com/content_cvpr_2018/html/Zhu_Seeing_Small_Faces_C
s://openaccess.thecvf.com/content_ECCV_2018/html/Yufei_Wang_ConceptM VPR_2018_paper.html.
ask_Large-Scale_Segmentation_ECCV_2018_paper.html. Zhu, C., Zheng, Y., Luu, Y., & Savvides, M. (2017). CMS-RCNN: Contextual multi-scale
Wu, X., Hong, D., Ghamisi, P., Li, W., & Tao, R. (2018). MsRi-CCF: Multi-Scale and region-based CNN for unconstrained face detection. Deep learning for biometrics.
Rotation-Insensitive Convolutional Channel Features for Geospatial Object Advances in computer vision and pattern recognition (pp. 57–79). https://doi.org/
Detection. Remote Sensing. https://doi.org/10.3390/rs10121990 10.1007/978-3-319-61657-5_3.
Wu, X., Sahoo, D., & Hoi, C. S. (2020). Recent advances in deep learning for object Zou, Z., Shi, Z., Guo, Y., & Ye, J. (2019). Object Detection in 20 Years: A Survey. arXiv.
detection. Neurocomputing, 396(5), 39–64. https://doi.org/10.1016/j. Retrieved from https://arxiv.org/abs/1905.05055.
neucom.2020.01.085

From Classical Techniques To Convolution-Based Models: A Review of Object Detection Algorithms
No ratings yet
From Classical Techniques To Convolution-Based Models: A Review of Object Detection Algorithms
6 pages
Caterpillar Model
100% (1)
Caterpillar Model
109 pages
Hot Fress
100% (2)
Hot Fress
37 pages
Real Time Object Detection Using Deep Learning
No ratings yet
Real Time Object Detection Using Deep Learning
6 pages
19 - Heating and Ventilating Systems - HVAC
No ratings yet
19 - Heating and Ventilating Systems - HVAC
6 pages
Havi Batch 10
No ratings yet
Havi Batch 10
15 pages
Endorsement of Higher Qualification-New
0% (1)
Endorsement of Higher Qualification-New
2 pages
Seminar Paper by Roquia Salam
No ratings yet
Seminar Paper by Roquia Salam
29 pages
2003 07442v1 PDF
No ratings yet
2003 07442v1 PDF
7 pages
Knowledge-Based Systems
No ratings yet
Knowledge-Based Systems
10 pages
10.3934 Mbe.2023282
No ratings yet
10.3934 Mbe.2023282
40 pages
Focus-And-Detect A Small Object Detection Framework For Aerial Images
No ratings yet
Focus-And-Detect A Small Object Detection Framework For Aerial Images
9 pages
A Comprehensive Survey of LIDAR-based 3D Object Detection Methods
No ratings yet
A Comprehensive Survey of LIDAR-based 3D Object Detection Methods
29 pages
1 s2.0 S0262885622001007 Main
No ratings yet
1 s2.0 S0262885622001007 Main
26 pages
Transformers in Small Object Detection - SOTA
No ratings yet
Transformers in Small Object Detection - SOTA
20 pages
A Survey of Modern Object Detection Literature Using Deep Learning
No ratings yet
A Survey of Modern Object Detection Literature Using Deep Learning
15 pages
Object Detection With Deep Learning: A Review
No ratings yet
Object Detection With Deep Learning: A Review
21 pages
Research Article: An Evaluation of Deep Learning Methods For Small Object Detection
No ratings yet
Research Article: An Evaluation of Deep Learning Methods For Small Object Detection
18 pages
A Survey of Deep Learning-Based Object Detection
No ratings yet
A Survey of Deep Learning-Based Object Detection
30 pages
Remotesensing 15 03265
No ratings yet
Remotesensing 15 03265
29 pages
Final Report - Removed
No ratings yet
Final Report - Removed
43 pages
Object Detection With Deep Learning: A Review
No ratings yet
Object Detection With Deep Learning: A Review
21 pages
Computer Vision 3
No ratings yet
Computer Vision 3
38 pages
A Novel Model To Detect and Categorize Objects From Images by Using A Hybrid Machine Learning Model
No ratings yet
A Novel Model To Detect and Categorize Objects From Images by Using A Hybrid Machine Learning Model
13 pages
Towards Large-Scale Small Object Detection: Survey and Benchmarks
No ratings yet
Towards Large-Scale Small Object Detection: Survey and Benchmarks
24 pages
Object Detection Using ELAN
No ratings yet
Object Detection Using ELAN
6 pages
Computer Vision Application
No ratings yet
Computer Vision Application
2 pages
Tata 1412g LPT BS6 Trucks Overview - Specs, Features & Images
100% (1)
Tata 1412g LPT BS6 Trucks Overview - Specs, Features & Images
2 pages
Electronics 12 01515
No ratings yet
Electronics 12 01515
21 pages
FE-YOLOv5 Feature Enhancement Network Based On YOLOv5 For Small Object Detection1
No ratings yet
FE-YOLOv5 Feature Enhancement Network Based On YOLOv5 For Small Object Detection1
8 pages
Ijlbps 6620dd20c5747
No ratings yet
Ijlbps 6620dd20c5747
8 pages
John 2020 Comparative
No ratings yet
John 2020 Comparative
7 pages
Object Detectionusing Machine Learningand Deep Learning
No ratings yet
Object Detectionusing Machine Learningand Deep Learning
9 pages
A Comprehensive Review For Aircraft Detection Tech
No ratings yet
A Comprehensive Review For Aircraft Detection Tech
21 pages
Object Detection With DL
No ratings yet
Object Detection With DL
17 pages
Attention and Feature Fusion SSD For Remote Sensing Object Detection
No ratings yet
Attention and Feature Fusion SSD For Remote Sensing Object Detection
9 pages
A Survey of 3D Object Detection: Wei Liang Pengfei Xu Ling Guo Heng Bai Yang Zhou Feng Chen
No ratings yet
A Survey of 3D Object Detection: Wei Liang Pengfei Xu Ling Guo Heng Bai Yang Zhou Feng Chen
25 pages
An Improved YOLOv5 Method For Small Object
No ratings yet
An Improved YOLOv5 Method For Small Object
10 pages
Improved YOLOv7-Tiny For Object Detection Based On
No ratings yet
Improved YOLOv7-Tiny For Object Detection Based On
23 pages
Slicing Aidedhyperinferenceandfine-Tuning Forsmallobjectdetection
No ratings yet
Slicing Aidedhyperinferenceandfine-Tuning Forsmallobjectdetection
5 pages
Fin Irjmets1684232858
No ratings yet
Fin Irjmets1684232858
9 pages
hw4 Sol PDF
100% (2)
hw4 Sol PDF
23 pages
Vijay Report
No ratings yet
Vijay Report
14 pages
Woodmizer LT15 Parts
No ratings yet
Woodmizer LT15 Parts
39 pages
1 s2.0 S0167865523000727 Main
No ratings yet
1 s2.0 S0167865523000727 Main
8 pages
Overview of Object Detection Based On Deep Learnin
No ratings yet
Overview of Object Detection Based On Deep Learnin
7 pages
Ref 19
No ratings yet
Ref 19
6 pages
Literature Survey For Robotics
No ratings yet
Literature Survey For Robotics
6 pages
E3sconf Iconnect2023 04032
No ratings yet
E3sconf Iconnect2023 04032
11 pages
Investigations of Object Detection in Im
No ratings yet
Investigations of Object Detection in Im
46 pages
5-Jul-11093 Paper
No ratings yet
5-Jul-11093 Paper
5 pages
Object Detection Presentation
No ratings yet
Object Detection Presentation
12 pages
基于相似距离的微小物体检测标签分配
No ratings yet
基于相似距离的微小物体检测标签分配
8 pages
A Survey of Modern Deep Learning Based Object Detection Models
No ratings yet
A Survey of Modern Deep Learning Based Object Detection Models
19 pages
Object Detection Using Machine Learningand Neural Networks
No ratings yet
Object Detection Using Machine Learningand Neural Networks
10 pages
Q6 6r80manual PDF
No ratings yet
Q6 6r80manual PDF
49 pages
Object Detectionwith Convolutional Neural Networks
No ratings yet
Object Detectionwith Convolutional Neural Networks
12 pages
E3sconf Icmed-Icmpc2023 01016
No ratings yet
E3sconf Icmed-Icmpc2023 01016
6 pages
An Evaluation of Deep Learning Methods For Small Object
No ratings yet
An Evaluation of Deep Learning Methods For Small Object
18 pages
Optimizing YOLOv5s
No ratings yet
Optimizing YOLOv5s
5 pages
An Investigation of Deep Neural Network Based Techniques For Object Detection An
No ratings yet
An Investigation of Deep Neural Network Based Techniques For Object Detection An
6 pages
Ding 2018 IOP Conf. Ser. Mater. Sci. Eng. 322 062024
No ratings yet
Ding 2018 IOP Conf. Ser. Mater. Sci. Eng. 322 062024
6 pages
5 Ijlemr 77839
No ratings yet
5 Ijlemr 77839
5 pages
Lab 6 Introduction To Basic Interface
No ratings yet
Lab 6 Introduction To Basic Interface
7 pages
Powertronic Installation Manual-Kawasaki Er-6N (2012-2018)
No ratings yet
Powertronic Installation Manual-Kawasaki Er-6N (2012-2018)
33 pages
Object Detection Using Deep Learning
No ratings yet
Object Detection Using Deep Learning
5 pages
Electronics-Object Detection YOLO
No ratings yet
Electronics-Object Detection YOLO
12 pages
IT 2023 - Digital - (SEGi Susan 012-2820 251)
No ratings yet
IT 2023 - Digital - (SEGi Susan 012-2820 251)
24 pages
Bricks
No ratings yet
Bricks
34 pages
LOCATE SWIFT MT103 Auto Cash Transfer in EURO FOLDER
No ratings yet
LOCATE SWIFT MT103 Auto Cash Transfer in EURO FOLDER
5 pages
Standard Truss Garage Plan
No ratings yet
Standard Truss Garage Plan
12 pages
Flow Over Weirs Apparatus: Model FM 02
No ratings yet
Flow Over Weirs Apparatus: Model FM 02
22 pages
Surveillance Systems
No ratings yet
Surveillance Systems
17 pages
Module 8 Artificial Intelligence in Monitoring and Evaluation
No ratings yet
Module 8 Artificial Intelligence in Monitoring and Evaluation
23 pages
3 Categories of Entrants
No ratings yet
3 Categories of Entrants
5 pages
Permission Forms
No ratings yet
Permission Forms
4 pages
WIREs Data Min Knowl - 2023 - Shaik - Remote Patient Monitoring Using Artificial Intelligence Current State
No ratings yet
WIREs Data Min Knowl - 2023 - Shaik - Remote Patient Monitoring Using Artificial Intelligence Current State
31 pages
Full Paper Title in Title Case: Name Surname, Name Surname
No ratings yet
Full Paper Title in Title Case: Name Surname, Name Surname
4 pages
Toyota Sienna 6
No ratings yet
Toyota Sienna 6
2 pages
Itri 613 Database Systems Assignment 1 29435927
No ratings yet
Itri 613 Database Systems Assignment 1 29435927
9 pages
2018 M.SC 2nd Sem
No ratings yet
2018 M.SC 2nd Sem
12 pages
?simplify Allocations With SAP Analytics Cloud?
No ratings yet
?simplify Allocations With SAP Analytics Cloud?
15 pages
Espan140 Solution 54860159 8697
No ratings yet
Espan140 Solution 54860159 8697
39 pages
Safetica Datasheet EN 2024-04-11
No ratings yet
Safetica Datasheet EN 2024-04-11
8 pages
Problem Set 1 Answers
No ratings yet
Problem Set 1 Answers
4 pages
JS7 ClassNotes
No ratings yet
JS7 ClassNotes
5 pages
BCN Campus Recruitment Process - FAQ
No ratings yet
BCN Campus Recruitment Process - FAQ
1 page
Object Detection: Advances, Applications, and Algorithms
From Everand
Object Detection: Advances, Applications, and Algorithms
Fouad Sabry
No ratings yet
Geometric Feature Learning: Unlocking Visual Insights through Geometric Feature Learning
From Everand
Geometric Feature Learning: Unlocking Visual Insights through Geometric Feature Learning
Fouad Sabry
No ratings yet
Activity Recognition: Fundamentals and Applications
From Everand
Activity Recognition: Fundamentals and Applications
Fouad Sabry
No ratings yet