A Comprehensive and Systematic Look Up Into Deep Learning Based Object Detection Techniques - A Review

Download as pdf or txt
Download as pdf or txt
You are on page 1of 29

Computer Science Review 38 (2020) 100301

Contents lists available at ScienceDirect

Computer Science Review


journal homepage: www.elsevier.com/locate/cosrev

Review article

A comprehensive and systematic look up into deep learning based


object detection techniques: A review

Vipul Sharma , Roohie Naaz Mir
Department of Computer Science & Engineering, National Institute of Technology, Srinagar, India

article info a b s t r a c t

Article history: Object detection can be regarded as one of the most fundamental and challenging visual recognition
Received 12 May 2020 task in computer vision and it has received great attention over the past few decades. Object
Received in revised form 20 July 2020 detection techniques find their application in almost all the spheres of life, most prominent ones being
Accepted 26 August 2020
surveillance, autonomous driving, pedestrian detection and so on. The primary focus of visual object
Available online 11 September 2020
detection is to detect objects belonging to certain class targets with absolute localization in a realistic
Keywords: scene or an input image and also to assign each detected instance of an object a predefined class label.
Convolutional neural networks Owing to rapid development of deep neural networks, the performance of object detectors has rapidly
Object detection improved and as a result of this deep learning based detection techniques have been actively studied
Localization over the past several years. In this paper we provide a comprehensive survey of latest advances in
Segmentation deep learning based visual object detection. Firstly we have reviewed a large body of recent works in
Classification literature and using that we have analyzed traditional and current object detectors. Afterwards and
Deep learning
primarily we provide a rigorous overview of backbone architectures for object detection followed by a
systematic cover up of current learning strategies. Some popular datasets and metrics used for object
detection are analyzed as well. Finally we discuss applications of object detection and provide several
future directions to facilitate future research for visual object detection with deep learning.
© 2020 Elsevier Inc. All rights reserved.

Contents

1. Introduction......................................................................................................................................................................................................................... 2
1.1. Summary of previous survey papers on object detection ................................................................................................................................ 4
1.2. Contributions of this survey ................................................................................................................................................................................. 6
1.3. Article organization ............................................................................................................................................................................................... 6
2. Traditional detectors vs. state-of-the-art detectors........................................................................................................................................................ 6
2.1. Traditional object detectors.................................................................................................................................................................................. 6
2.1.1. Viola jones detectors: ............................................................................................................................................................................ 6
2.1.2. Histogram of oriented gradients (HOG) detector:.............................................................................................................................. 6
2.1.3. Deformable part based model (DPM):................................................................................................................................................. 6
2.2. Deep learning based object detectors ................................................................................................................................................................. 7
2.2.1. One stage object detectors: .................................................................................................................................................................. 7
2.2.2. Two stage detectors:.............................................................................................................................................................................. 8
2.2.3. Latest object detectors: ......................................................................................................................................................................... 9
3. Proposal generation and feature representation learning for object detection: ........................................................................................................ 10
3.1. Proposal generation:.............................................................................................................................................................................................. 10
3.1.1. Traditional proposal generation techniques: ...................................................................................................................................... 10
3.1.2. Anchor based proposal generation methods: ..................................................................................................................................... 10
3.1.3. Keypoint based proposal generation methods: .................................................................................................................................. 11
3.2. Feature representation learning for object detection: ...................................................................................................................................... 11
3.2.1. Multi-scale feature representation learning: ...................................................................................................................................... 11
3.2.2. Contextual reasoning: ............................................................................................................................................................................ 12
3.2.3. Deformable feature learning:................................................................................................................................................................ 12

∗ Corresponding author.
E-mail addresses: [email protected] (V. Sharma), [email protected] (R.N. Mir).

https://doi.org/10.1016/j.cosrev.2020.100301
1574-0137/© 2020 Elsevier Inc. All rights reserved.
2 V. Sharma and R.N. Mir / Computer Science Review 38 (2020) 100301

4. Backbone architecture for object detection: ................................................................................................................................................................... 12


4.1. Basic architecture of a CNN: ................................................................................................................................................................................ 12
4.2. CNN backbone for object detection: ................................................................................................................................................................... 13
5. Learning strategies for object detection .......................................................................................................................................................................... 14
5.1. Learning strategies for training: ......................................................................................................................................................................... 14
5.2. Testing stage: ......................................................................................................................................................................................................... 16
5.3. Other latest learning strategies: .......................................................................................................................................................................... 16
6. Popular datasets and metrics used for object detection ............................................................................................................................................... 16
6.0.1. PASCAL VOC dataset: ............................................................................................................................................................................. 17
6.0.2. MS-COCO dataset: .................................................................................................................................................................................. 17
6.0.3. ImageNet Dataset: .................................................................................................................................................................................. 17
6.0.4. OpenImages dataset:.............................................................................................................................................................................. 18
6.0.5. VIS Drone 2018 dataset:........................................................................................................................................................................ 19
6.0.6. LVIS [1] : ................................................................................................................................................................................................. 19
6.1. Pedestrian detection dataset: ............................................................................................................................................................................... 19
6.2. Face detection datasets:........................................................................................................................................................................................ 19
7. Applications of object detection ....................................................................................................................................................................................... 19
7.1. Face detection: ....................................................................................................................................................................................................... 19
7.2. Pedestrian detection:............................................................................................................................................................................................. 20
7.3. Anomaly detection: ............................................................................................................................................................................................... 20
7.4. License plate recognition: ..................................................................................................................................................................................... 20
7.5. Autonomous driving:............................................................................................................................................................................................. 20
7.6. Traffic sign recognition: ........................................................................................................................................................................................ 20
7.7. Computer aided diagnosis (CAD) system: .......................................................................................................................................................... 20
7.8. Event detection: ..................................................................................................................................................................................................... 21
7.9. Pattern detection: .................................................................................................................................................................................................. 21
7.10. Image caption generation: .................................................................................................................................................................................... 21
7.11. Salient object detection: ....................................................................................................................................................................................... 21
7.12. Text detection: ....................................................................................................................................................................................................... 21
7.13. Point clouds 3D object detection:........................................................................................................................................................................ 21
7.14. 2D 3D pose detection: .......................................................................................................................................................................................... 21
7.15. Fine grained visual recognition:........................................................................................................................................................................... 21
7.16. Edge detection: ...................................................................................................................................................................................................... 22
8. Concluding remarks and future research areas .............................................................................................................................................................. 22
8.1. Future research trends: ......................................................................................................................................................................................... 22
Declaration of competing interest.................................................................................................................................................................................... 22
References ........................................................................................................................................................................................................................... 22
Glossary ............................................................................................................................................................................................................................... 28

Image classification particularly aims to identify semantic cat-


1. Introduction egories/classes of objects in a given image, on the other hand
the task of object detection is to not only identify the specific
Object detection is one of the most important computer vision object categories but also to find out the precise locations of
tasks that deal with the detection of visual instances of objects category specific objects with the help of bounding boxes. Talking
of a certain class in a cluttered realistic scene or an input image. of image segmentation, the aim is to assign a specific class label
The primary objective of object detection techniques is to develop to each pixel thereby providing an even richer understanding
computational models which provide much needed information of the image. However, if we are given an image comprising
by applications based on computer vision. Due to its wide range of multiple objects belonging to the same class category, then
of applications and latest advancements, object detection has at- in that case semantic segmentation fails to draw a distinction
tracted colossal amounts of attention in recent years. The task of between different object instances. So, in order to overcome
object detection in both real world applications and academia is this drawback a relatively new technique named instance seg-
currently under rigorous investigation. Some prominent/ popular mentation is proposed which not only identify different object
applications of object detection are autonomous driving, remote instances belonging to the same category but is also able to assign
sensing, video surveillance and robotic vision. There are many each of them a separate class pixel level mask. Therefore in in-
factors that can be attributed for the speedy evolution of detec- stance segmentation pixel-level localization is required instead of
tion models. Among them deep convolutional neural networks bounding-box based object localization. A good object detection
and high computing powers of GPU can be considered as most algorithm should have a robust understanding of semantic cues
important contributors for these advancements [2–4]. Presently as well as spatial information about the image and in fact, as
deep learning based computational models are largely been em- an important part of image understanding, object detection has
ployed for both generic and domain-specific object detection. been broadly used in many computer vision applications like
These computational models act as backbone architecture in most face recognition [9–11], pedestrian detection [12–14], logo detec-
of the object detectors and they are being used to perform a tion [15,16] and video analysis [17,18]. However in the meantime
variety of tasks like extracting features from an input image, due to immense variations in lighting conditions and occlusions,
segmentation, classification and object localization etc. The field different viewpoints and poses, the task of object detection along
of computer vision encompasses various recognition problems, with localization has become a tedious one and it is very hard to
like image classification [5], semantic segmentation [6], object accomplish with absolute perfection. So, due to these challenges
detection and instance segmentation [7,8]. An example covering much more attention has been given to this field in the recent
these recognition problems is given in Fig. 1. years [7,19–21].
V. Sharma and R.N. Mir / Computer Science Review 38 (2020) 100301 3

Table 1
List of Acronyms used in the paper.
Acronym Full form Acronym Full form
mAP Mean Average Precision NMS Non Maximum Suppression
FPS Frames Per Second DCN Deformable Convolution Network
SIFT Scale-Invariant Feature Transform MSCOCO Microsoft Common Objects in Context
SURF Speeded Up Robust Features NAS-FPN Neural Architecture Search Feature Pyramid Network
HOG Histogram of Oriented Gradients ResNet Residual Network
SVM Support Vector Machine ReLU Rectified Linear Unit
DPM Deformable Parts Model VGG Visual Geometry Group
ROI Region of interest SGD Stochastic Gradient Descent
CNN Convolution Neural Network DPN Dual Path Network
R-CNN Region Proposal Convolution Neural Network RPN Region Proposal Network
R-FCN Region-based Fully Convolution Network CRAFT Cascade Region Proposal Network and Fast-RCNN
FPN Feature Pyramid Networks GAN Generative Adversarial Networks
SPP-net Spatial Pyramid Pooling Network DSOD Deeply Supervised Object Detectors
YOLO You Only Look Once ILSVRC ImageNet Large Scale Visual Recognition Challenge
SSD Single Shot Multibox Detector LVIS Large Vocabulary Instance Segmentation
VOC Visual Object Class M-DOD Multi-Domain Object Detection
PAMI Pattern Analysis and Machine Intelligence CVIU Computational Vision and Image Understanding
ACM CS Association of Computing Machinery CS PR Pattern Recognition
LIDAR Light Detection and Ranging

Here second term is a regularizer and λ is the trade-off parameter.


In order to evaluate the quality of localization, we use a metric
known as ‘‘intersection over union’’ between objects and the
predictions as shown below:
( )
Area ppred ∩ pgroundtruth
IoU ppred , pgroundtruth =
( )
( ⋃ ) (2)
Area ppred pgroundtruth
Here pgroundtruth refers to the ground truth bounding box. In
order to find out whether the prediction accurately covers an
object or not an IoU threshold value denoted by 4 Ω ′′ is set.
Most commonly used value for this threshold is 0.5. for an object
detection system, a prediction with correct class-label as well
as successful localization prediction is referred to as positive
prediction and negative-prediction otherwise.
For generic object detection mean average precision (mAP)
over K categorical classes is used as an evaluation metric. Apart
from detection accuracy, inference speed can also be regarded
as an important metric to evaluate the performance of object
Fig. 1. Progression of object recognition techniques: (a) image classification, (b)
detection system. In real world scenarios the efficiency of object
object detection, (c) semantic segmentation and (d) instance segmentation.
detectors can also be evaluated using a metric called FPS (frames
per second) i.e. how many number of images can be processed
per second. A real time object detector can achieve an inference
Table 1 include acronyms used in this paper. speed of 20 frames per second. Definition of ‘‘generic-object-
The problem definition for generic object detection is to find detection’’ is to find out whether and where objects are located
out the precise positions, where the objects are located in a given in a given image, this is known as object localization and which
image and what is the category to which each of these object category/class each of the objects belong to. So, traditional ob-
instances belong to. Mathematical formulation of this problem of ject detection pipeline was divided into three steps (informative
object detection is as given below:
region selection, feature vector extraction and classification).
Let us assume that, we are given a collection of (‘‘M’’) images
In informative region selection stage, as we know that dif-
with annotations x1, x2, x3,. . . .. xM, then for the ith image xi, let
ferent objects can be present at any position in an image and
us assume that there are Ni objects present which belong to K
as the objective of this step is to search the specific locations
categories/classes.
in the image which may contain the objects. It is therefore very
ki1 , pi1 , ki2 , pi2 , ki3 , pi3 . . . . . . . . . . . . . . . . . . kiNi , piNi
{( ) ( ) ( ) ( )}
yi = natural to scan the entire input image with the help of a mutiscale
( ) sliding window. The advantage of using a multiscale sliding win-
Here kij kij ∈ C denotes category or class and pij denotes the label dow is that it can capture information about objects of different
mask of jth object in xi image respectively. The ‘‘d’’ detector is scales and aspect ratios and the obvious disadvantage is that it is
parameterized by ‘‘Q’’. For image ‘‘xi’’, the prediction yipred shares computationally very expensive and produces many redundant
the same format as yi and it is as shown below:- windows. In order to identify different objects in an image, we
yipred = kipred1 , pipred1 , kipred2 , pipred2 ,
{( ) ( )
need to extract features that can be used to represent meaningful
information of the underlying detected objects. Therefore in the
. . . . . . . . . . . . . . . kipredNi , pipredNi
( )}
second stage of this detection pipeline, a fixed-size vector is
In order to optimize the detector ‘‘d’’ a loss function loss is defined extracted from the regions which are covered by the multiscale-
as : - sliding-window. The features extracted from these regions can be
M coded using descriptors like SIFT [22], SURF [23], Haar [24,25],
1 ∑ ( i ) λ
ℓ(x, θ ) = ℓ ypred , xi , yi ; θ + ∥θ∥22 (1) HOG [26,27]. The primary reason behind using these descriptors
M 2 to encode the extracted features is that, they are robust to scale,
i=1
4 V. Sharma and R.N. Mir / Computer Science Review 38 (2020) 100301

Fig. 2. Important milestones in object detection research since 1999.

illumination and rotation level variance. However, due to multi • In CNN based object detectors, hierarchical feature vec-
appearances, different backgrounds and poor illumination condi- tors/representations can be extracted automatically from
tions, it is very difficult to manually design a descriptor for all the underlying data and can be disentangled through multi-
kind of objects. Now having feature vectors in hand, a classifier level non linear mappings.
is needed to separate a target object category from all the other
All these advantages made it obvious to design and develop
object classes so as to make the feature representation more
deep learning based object detection techniques with expressive
informative for visual recognition. For a classifier, support vector
feature representation capability that can be optimized in an
machines can be considered as a better option, this is due to
end-to-end manner. These properties further made it possible
their better performance on small scale training data. Apart from
for CNN’s to find out their applicability into many research do-
SVM, Adaboost [28,29], bagging [30] and DPM [31] can even be
mains like image classification [39], face recognition [40], video
considered as viable options leading to further improvement in
analysis [41,42] and pedestrian detection [27,43,44].
the detection accuracy.
Recent deep learning based object detection architectures can
Traditional methods developed for object detection aims at
be broadly divided into two categories.
properly designing feature descriptors to obtain embedding for
roi (‘‘region of interest’’). Owing to these advancements in fea- • Single stage detectors like YOLO [20,45] and its different
ture vector representations and classification models, highly im- variants [46,47].
pressive results were obtained on PASCAL VOC datasets [32,33]. • Two stage detectors like R-CNN [7] and its variants
However during 2007 to 2012, minimum gains are obtained by [19,21,48].
building only ensemble models and using light variants of these
traditional successful models. This is due to the following factors. Single stage detectors directly maps class prediction of dif-
ferent objects present at each location of the extracted feature
• During region proposal stage, a large number of redundant maps without utilizing the region classification step, whereas
bounding boxes are generated using multi-window sliding in case of two stage detectors, proposal generator is used to
strategy which leads to a large number of false positives for generate a coarse set of region proposals from which feature
classification purpose. vectors are extracted which are then used to predict the cate-
• All the three stages of object detection pipeline are de- gory of objects using region classifiers. Compared to single stage
signed and optimized in a separate way and hence a global detectors, two stage detectors generally achieve good detection
optimum solution for the entire setup is hard to obtain. performance and generate state-of-the-art results on publically
• It is very hard to bridge the semantic gap by using manually available datasets, However, single stage detectors are compar-
crafted low level features descriptors [34–36]. atively more time efficient and hence are greatly applicable for
real time object detection [2,4]. In Fig. 2 major developments and
With the emergence of deep convolutional neural networks
milestones of deep learning based object detection techniques are
[37] and their widespread applications in image classification
illustrated.
[5,38], object detection techniques based on deep learning [6,21]
has achieved a significant amount of progress in recent years
1.1. Summary of previous survey papers on object detection
and they have outperformed the traditional detectors. Compared
to manually crafted feature descriptors employed by traditional
Many prominent survey papers on object detection have been
object detectors, deep CNN based object detectors generate hi-
published over the years as summarized in Table 2. These cover
erarchical feature representations which is learned in a self au-
many admirable reviews on some specific object detection prob-
tomated way and try to show more discriminative expression
lems like face detection [49,50], text detection [51], pedestrian
power.
detection [52–54], vehicle detection [55] etc. There are relatively
The advantages of CNN based methods against traditional
very few survey papers which directly focuses on the problem of
object detectors is as summarized below:
deep learning based generic object detection techniques except
• In contrast to shallow traditional models, a deep CNN based for Zhang et al. [56] who conducted object class detection survey
model provides an increased expressive power exponen- in the year 2013, Jiao Licheng et al. [57] whose survey focuses
tially. on describing and analyzing deep learning based object detection
• CNN based object detectors provide an opportunity to opti- task in the year 2019, followed by Zhao et al. [58] who provided
mize several detection related tasks together. a systematic review to summarize representative models and
V. Sharma and R.N. Mir / Computer Science Review 38 (2020) 100301
Table 2
Summary of previous survey papers on object detection.
S.No Title of Survey References Venue Year Content

1 Detecting faces in images: a survey Yang et al. [49] PAMI 2002 First survey of face detection from a single image
2 On road vehicle detection: a review Sun et al. [59] PAMI 2006 A review of computer vision based on-road vehicle detection systems
3 Toward category level object recognition Ponce et al. [60] Book 2007 Survey papers on object classification, detection, and segmentation
4 The evolution of object categorization and the challenge of image abstraction Dickinson et al. [61] Book 2009 An outline of the evolution of object categorization over 4 decades
5 Monocular pedestrian detection: survey and experiments Enzweiler and Gavrila [53] PAMI 2010 An assessment of three pedestrian detection techniques
6 Survey of pedestrian detection for advanced driver assistance systems Geronimo et al. [54] PAMI 2009 A thorough review of pedestrian detection for advanced driver assistance systems
7 Context based object categorization: a critical survey Galleguillos and Belongie [62] CVIU 2010 A review of perspective information for object categorization
8 Visual object recognition Grauman and Leibe [63] Tutorial 2011 Object recognition techniques based on instance and category
9 Pedestrian detection: an evaluation of the state of the art Dollar et al. [52] PAMI 2012 A detailed assessment of detectors in monocular images
10 50 years of object recognition: directions forward Andreopoulos and Tsotsos [64] CVIU 2013 A review of the evolution of object recognition systems over 5 decades
11 Object class detection: a survey Zhang et al. [56] ACMCS 2013 Survey of generic object detection methods before 2011
12 Representation learning: a review and new perspectives Bengio et al. [65] PAMI 2013 Unsupervised feature learning and deep learning, probabilistic models, autoencoders, manifold learning, and deep networks
13 Salient object detection: a survey Borji et al. [66] arXiv 2014 A survey of salient object detection techniques
14 A survey on face detection in the wild: past, present and future Zafeiriou et al. [50] CVIU 2015 A survey of face detection techniques in the wild since 2000
15 Text detection and recognition in imagery: a survey Ye and Doermann [51] PAMI 2015 A thorough review of text detection and recognition in colored images
16 Feature representation for statistical learning based object detection: a review Li et al. [67] PR 2015 Feature representation methods in statistical learning based object detection, including handcrafted and deep learning based features
17 Deep learning LeCun et al. [68] Nature 2015 An introduction to deep learning and applications
18 A survey on deep learning in medical image analysis Litjens et al. [69] MIA 2017 A survey on deep learning based image classification, object detection, segmentation and registration in medical image analysis
19 Recent advances in convolutional neural networks Gu et al. [70] Nature 2015 An introduction to deep learning and applications
20 Deep learning LeCun et al. [68] PR 2017 A broad survey of the recent advances in CNN and its applications in computer vision, speech and natural language processing
21 Object detection in 20 years: A survey Zou et al. [71] IEEE 2019 A comprehensive review in the light of technical evolution of different object detection techniques in last two decades
22 A survey of deep learning based object detection Jiao et al. [57] IEEE 2019 A broad survey focuses on describing and analyzing deep learning based object detection task
23 Deep learning LeCun et al. [58] arXiv 2019 A systematic review to summarize representative models and their different characteristics in several generic object detection application domains
24 Recent advances in deep learning for object detection Wu et al. [4] Neurocomputing 2020 Presents a comprehensive understanding of deep learning based object detection algorithms

5
6 V. Sharma and R.N. Mir / Computer Science Review 38 (2020) 100301

their different characteristics in several generic object detection and metrics used for generic object detection are provided in
application domains in the year 2019. Apart from these two more Section 6. Detection algorithms for real-world-applications are
surveys came in the same year (2019) conducted by Zou et al. [71] covered in Section 7 and finally we conclude and discuss future
where they provided a comprehensive review in the light of tech- research directions in Section 8.
nical evolution of different object detection techniques followed
by Wu et al. [4] presenting a comprehensive understanding of 2. Traditional detectors vs. state-of-the-art detectors.
deep learning based object detection algorithms. However the
research reviewed in these surveys lack to provide a systematic In the last 20 years, it has been unanimously accepted that
view of deep learning based object detection techniques, the the advancements in the field of object detection has been made
learning strategies that need to be employed, up to date state in two historical periods. One is traditional object detection era
of art detection solutions and are therefore prior to the recent before the year 2014 and other is deep learning based object
striking success and dominance of deep learning and related detection after 2014 [71].
methods.
The advantage of deep leaning is that, it permits compu- 2.1. Traditional object detectors
tational models to learn implausible complex, restrained and
intangible depictions, thereby making significant inroads into a Most of the traditional object detection techniques were de-
large set of problems like medical image analysis, Natural Lan- veloped using manually crafted features. Due to the absence
guage Processing, speech recognition, generic object detection of advanced image representation techniques before 2012, the
etc. Among various types of deep neural networks, deep convo- only choice researchers have is to design highly complex feature
lutional neural networks have revolutionized the fields of image vector representations along with a colossal variety of speed up
processing, video, speech and audio detection. There has been a techniques to compensate the mismatch which arises due to the
wide range of many published surveys on deep learning including availability of limited computational resources.
that of Bengio et al. [65], Lecun et al. [68], Gu et al. [70], Litjen
et al. [69] and more recently in tutorials at ICCV and CVPR. 2.1.1. Viola jones detectors:
In disparity even though many deep learning based meth- P. viola and M. jones, 19 years ago, developed a human face
ods been set forth for object detection. we are not aware of detector and have achieved a real time detection accuracy [72,73].
any systematic and recent comprehensive survey covering latest This detector has outperformed any other algorithm in its time
strategies and techniques related to generic object detection. A under comparable detection accuracy metric. It follows a sliding
detailed review and cover up of existing techniques is very much window approach and covers all the possible locations in an input
necessary for additional progress in object detection, notably for image to figure out, if any of the windows capture a human face.
researchers who are willing to enter this field. Although sliding window approach is a time consuming process,
but still VJ detector is able to improve its detection speed by
1.2. Contributions of this survey incorporating three techniques: integral image, which is basically
a way to speed up the convolution process. It helps in making
The primary focus of this survey is to describe and analyze the computational complexity of each window independent of its
deep learning based object detection problem. Unlike existing size. Second is feature selection, where adaboost algorithm [74]
surveys, this paper covers state of the art techniques for object is incorporated to select a small set of feature vectors about
detection and also provides various learning strategies for this 180k dimensional which are helpful for face detection. More-
task along with future research directions due to rapid develop- over, in order to reduce the computational complexity, detection
ment in computer vision research. cascades are introduced which is a basically a multistage detec-
tion pipeline, wherein face targets are given more computational
• This research survey is featured by an in-depth analysis and
importance.
thorough discussion in various aspects, some of which are
new to the best of our knowledge.
2.1.2. Histogram of oriented gradients (HOG) detector:
• In this paper, we have listed down some object detection
HOG feature descriptor was first of all proposed by N. dalal
learning strategies and have neglected to provide the basic
and B. Triggs in the year 2005. [26] HOG feature can be regarded
information, apart from that latest object detection trends
are also discussed so that readers can themselves witness as a break through advancement of SIFT features and contextual
the advancements in this field easily. shapes. It was basically designed to perform computations on a
dense grid of cells which are uniformly spaced. HOG can be used
• Unlike previous surveys in this field, this paper provides a
for detecting a wide variety of object categories, but the primary
comprehensive and systematic look up into deep learning
based object detection techniques, a write up related to motivation behind developing HOG was pedestrian detection. The
most significant research trends and an up to date object best advantage of HOG detector is that it can be used to detect
detection algorithms. objects of different sizes and it can be achieved by rescaling
the input image a number of times, while keeping the size of
So, the goal of this survey paper is to provide a thorough detection window unaltered [75–77].
analysis of object detection techniques based on deep learning.
2.1.3. Deformable part based model (DPM):
1.3. Article organization In the year 2008, deformable part based model was proposed
by P. Felzenszwalb [75]. It was originally developed as an exten-
Rest of the manuscript is organized in the following sections: sion to HOG detector and later a lot of improvements were made
In Section 2 we have thoroughly compared traditional detection by R. Girshick [31,76,78,79]. In deformable part based model
techniques with current object detectors. Proposal generation and divide and conquer strategy is followed, where in a given input
feature representation learning techniques for object detection image can be considered as an amalgamation of detection oper-
are covered in Section 3. Backbone architectures for object detec- ations carried out on different object within that image. For e.g.
tion are presented in Section 4 followed by learning strategies for The problem of detecting a bird can be considered as a problem
object detection which are covered in Section 5. Popular datasets of detecting its beak, legs, tail and wings. In a basic DPM detector,
V. Sharma and R.N. Mir / Computer Science Review 38 (2020) 100301 7

we have a root filter along with a number of part filters. The


configurations related to these part filters can be learned using
a weakly supervised learning strategy instead of specifying them
manually. For improving the detection results R. Girshick [80]
has formulated this whole process as a unique case of multi-
instance learning by incorporating some other special techniques
like bounding box regression and context priming. Moreover
R. grishick has further developed a strategy for compiling detec-
tion models to be build as a cascade architecture which is 10
times faster than the traditional models without compromising
Fig. 3. YOLO framework for generic object detection.
its accuracy [76,79]. It should be noted that although majority
of object detectors have outperformed DPM based models in
terms of detection accuracy, but still most of them are profoundly
influenced by its precious insights.

2.2. Deep learning based object detectors

Deep learning based object detectors can broadly be divided


into two major categories: two stage detectors and one stage
detectors. Speaking of two stage detectors, in the first stage a Fig. 4. SSD framework for generic object detection.
meager set of underlying proposals are generated and then in
the next stage feature vectors are extracted from these proposals
which are then encoded using deep convolutional neural net- the underlying cell is then considered as a proposal to
works followed by making predefined class predictions. A one detect the occurrence of objects. It is exceptionally fast, an
stage detector on the other hand typically mulls over all the extremely quick version of YOLO runs at 155 fps (frames
positions on the input image as prospective target objects and per second) and achieves an mAP of 52.7% on VOC2007
try to categorize each region of interest as either a target object dataset whereas its enhanced versions work at 45 fps and
or simply background. One stage detectors are much quicker achieves an mAP of 63.4% and 57.9% on VOC2007 and
and highly viable for real time object detection applications but VOC2012 datasets respectively. YOLO basically follows a
exhibit comparatively poor performance when compared with state-of-the-art strategy to use a single neural network for
two stage detectors which often report remarkable results on an entire image. This setup as we have discussed earlier
many publically available datasets. partitions the input image into a number of regions and
forecasts probabilities along with bounding boxes for each
2.2.1. One stage object detectors: of these regions simultaneously. later on R. Joseph et al.
As mentioned earlier, one stage detectors do not possess a came up with a chain of improvements on YOLO and re-
separate branch for generating proposals. These detectors basi- leased YOLOv2 and YOLO v3 versions [46,82] which not
cally mulls over all the locations of an input image as prospective only improves the detection accuracy but also improves the
objects and further try to categorize each ROI (region of inter- detection speed. No matter YOLO has made a remarkable
est) into either background or a predefined object class. Let us improvement in detection speed, but it still suffers from
now quickly review some prominent state-of the-art one stage inaccurate localizations when compared with state-of-the-
detectors. art two stage detectors. Moreover YOLO could recognize
(i) Overfeat: It is one of the earliest single stage deep learning upto only two objects at a given location and hence it is
based object detectors proposed by sermanet et al. [81]. We very difficult for it to detect small and crowded objects.
know that the task of object detection can be perceived as (iii) SSD: For overcoming the limitations of YOLO, Lin et al.
multi-section categorization problem and therefore over- in the year 2016 proposed single shot multibox detector
feat comprehended the earlier classifiers into a detector [47,48] as shown in Fig. 4. The most important contribution
by considering the deeper fully connected layers as 1 x of SSD is the use of multi-reference and multi-resolution
1 convolutions so as to permit arbitrary sized input im- techniques which exceptionally improves the detection ac-
ages. The categorization arrangement of overfeat detector curacy of one stage detector for detecting small objects.
produces a grid of forecast on each input region to point Like YOLO, SSD also partitions the input image into matrix
out the presence of an object. Once objects are discovered, cells, but unlike recognizing objects from predefined grid
bounding box regressors are used to refine the predicted cells as in YOLO, a set of anchors with varied scales, sizes
regions. For recognizing multi-scale objects the input im- and aspect ratios are produced to refine the bounding
age is reorganized and rescaled into multiple images which boxes. SSD achieves a remarkable mAP of 76.8% and 74.9%
are then fed into the overfeat setup. At last the predic- on VOC2007 and VOC 2012 datasets respectively. The pri-
tions across all these scales are grouped together. Overfeat mary difference between SSD and other former detectors
has achieved a remarkable speed strength as compared is that SSD runs detection tests only on the deeper layers
to RCNN by distributing the overlapped computational re- while former single stage detectors runs detection test on
gions using convolutional layers. Despite speed strength it different layers of the network using different scales.
is still difficult to optimize the training of classifier and (iv) RetinaNet: In spite of its high detection speed and ef-
regressor. fortlessness, single stage detectors are outperformed by
(ii) YOLO: In the year 2015, R. Joseph et al. [20] came up with a two stage detectors in terms of detection accuracy. The
real time one stage detector based on deep learning called reason behind this is foreground-background category im-
YOLO as shown in Fig. 3. YOLO basically perceives object balance, so to overcome this problem Lin et al. [83] in
detection as a regression problem and divides the entire the year 2017 came up with Retinanet [83] as shown in
image into a predefined number of matrix cells. Each of Fig. 5. In Retinanet an all new loss function called ‘‘focal
8 V. Sharma and R.N. Mir / Computer Science Review 38 (2020) 100301

Fig. 7. R-CNN framework for generic object detection.

the first to showcase that a CNN based model can be used


Fig. 5. RetinaNet framework for generic object detection.
to achieve a remarkable object detection performance on
publically available PASCAL visual object class datasets [33]
than the traditional models which are based on simple HOG
features. R-CNN drastically improved the detection perfor-
mance and achieved 53.7% mAP (mean average precision)
on PASCAL VOC dataset. A simple R-CNN detector shown in
Fig. 7 consists of three components: (i) generation of pro-
posals; (ii) extracting features from underlying proposals;
(iii) categorization of regions into predefined object classes.
Using selective search technique [87] for each of the input
image, R-CNN produces a meager set of proposals about
Fig. 6. Cornernet framework for generic object detection.
2000 in number and discards the ones which can simply be
discovered as a background region. Each of these proposals
are cropped and adjusted into a fixed sized region which is
then used to generate a feature vector (particularly 4096
loss’’ has been introduced by suppressing the gradients of
dimensional) with the help of a deep convolutional neural
negative samples instead of straightaway discarding them.
network trained on ImageNet, this transfer of knowledge
Focal loss function thus enables a single stage detector to
from ImageNet dataset offers considerable performance
achieve somewhat equivalent detection accuracy of two
achievement. Finally linear SVM classifiers are bring into
stage detectors without compromising the detection speed.
play to predict the presence of an object and to identify
In addition Retinanet achieves an mAP of 59.1% on MSCOCO
predefined object categories [71]. In addition to this the last
dataset.
fully connected layer of the underlying CNN is re-initialized
(v) Cornernet: Subsequent single stage object detectors made
for the detection work. Moreover R-CNN discards a large
use of anchor boxes which are manually designed for train-
number of simple negatives before training begins and this
ing purpose. In the later years a chain of anchor free object
helps in improving the pace of learning and mitigating
detectors were introduced, where the idea was to find out
the number of false positives. No matter R-CNN has made
the important locations in the bounding box instead of
fitting an object to an anchor. Considering this idea, Law a remarkable progress, its shortcomings are also appar-
and Deng came up with an original anchor free setup called ent: It is extremely time consuming for both training and
cornernet [84] as shown in Fig. 6, wherein objects are testing because of superfluous feature computations from
recognized as a pair of corners. On each possible locations highly overlapped regions, which leads to an extremely
of the generated feature map, pair embedding, corner offset slow detection speed which is 14 s per image with GPU.
and category specific thermal maps are predicted. Thermal (ii) SPP-net: In the year 2014, He et al. stirred by the idea
maps computed the probabilities of being corners, corner of spatial pyramid matching [88] proposed SPP-net [89]
offset are used for refining the corner locations and pair as shown in Fig. 8 to overcome the shortcomings asso-
embedding is used for grouping together corner pairs be- ciated with R-CNN. The significant role of SPP-net is the
longing to the same object. Due to this cornernet achieved introduction of spatial pyramid pooling layer which assists
a significant performance on MSCOCO dataset [85,86]. a CNN to produce a predetermined length feature vector
representation despite of the size of image’s area of sig-
2.2.2. Two stage detectors: nificance devoid of rescaling. SPP divides the feature map
Two stage detectors divide the detection task into two phases into an M x M matrix for varying values of M and performs
(i) generation of proposals; and (ii) making class predictions from pooling on each and every cell of the created matrix to
these underlying generated proposals; The idea basically is to find produce a feature vector. All the feature vectors thus ob-
out regions with larger recall values in such a way that all the tained from each cell of the matrix are then joined together
objects within the image categorize themselves in any case to and provided as an input to region SVM categorizers and
one of the proposed regions, this is done during the first stage. bounding box regressors. SPP-net is 20 times more efficient
Whereas in the second stage convolutional neural network based and faster than R-CNN without compromising its detection
models are utilized to categorize these generated proposals into accuracy. On VOC 2007 SPP-net achieved a mean average
either an object from predefined class categories or simply into precision (mAP) of 59.2%. No matter it has efficiently im-
background. Now, let us review some prominent state-of-the-art proved the detection speed and accuracy but still there
two stage object detectors. are some shortcomings associated with it. Firstly training
scenario is multiphase in nature and secondly it majorly
(i) R-CNN:It is basically a region based object detector pro- depends on its fully connected layers for the detection task
posed by Girshick et al. [7] in the year 2014. R. Girshick was and simply ignores all the previous layers.
V. Sharma and R.N. Mir / Computer Science Review 38 (2020) 100301 9

(iii) Fast R-CNN: To further improve the detection accuracy


of R-CNN and SPP-net [7,79,89]. R. Girshick et al. in the
year 2015 came up with fast R-CNN [19] as shown in
Fig. 9 which is a multitask learning detector and can extract
predetermined length feature vectors from the feature map
like SPP-net did. But the difference between fast R-CNN
and SPP-net is that in Fast R-CNN ROI pooling layer is
used to extract the features associated with the regions.
ROI pooling is a unique case of SPP which only considers
a single scale to divide the underlying proposal in prede- Fig. 8. SPP-net framework for generic object detection.

termined number of partitions and also back propagate the


error signals to the convolution kernels. Extracted features
are then fed as an input to a collection of fully connected
layers followed by a classification and a regression layer
which generated softmax probabilities for C + 1 classes
and top left and bottom right coordinates for redefining
the bounding boxes respectively. However Fast R-CNN is
successful in combining the benefits of both R-CNN and
SPP-net, but still its detection speed is limited by a number
of factors. Fig. 9. Fast R-CNN framework for generic object detection.
(iv) Faster R-CNN: Although a lot of progress has been made
designing learning detectors, but still generation of propos-
als heavily relied on conventional techniques like selective
search [87] and edge boxes [90] which utilizes trivial visual
signals and are hard to learn using a data driven approach.
To address these issues S. Ren et al. came up with Faster
R-CNN [21] as shown in Fig. 10 in the year 2015 shortly
after the release of Fast R-CNN. Faster R-CNN makes use
of a fully convolutional layer called RPN (region proposal
network) which works on arbitrary images of varied sizes
and creates a set of proposals probably on each feature
map location. The feature vectors thus obtained from the
derived feature map are then fed into a classification layer
followed by a bounding box regression layer for object
localization which finally leads to object detection. Faster
R-CNN can be considered as a first end-to-end and near real Fig. 10. Faster R-CNN framework for generic object detection.
time deep learning detection model that achieved an mAP
of 73.2% on VOC2007 and an mAP of 70.4% on VOC2012
datasets respectively. No matter Faster R-CNN outperforms different scales. The underlying idea is to reinforce spatially
Fast R-CNN in terms of detection speed, still there is some sound features with well to do semantic features from the
computational idleness at successive detection stages. deeper layers. Using this FPN in a traditional Faster R-CNN
(v) R-FCN: The problem with Faster R-CNN is that it computes model achieves better detection results with mAP of 59.1%
feature map of an arbitrary input image and derives re- on COCO dataset. Further it should be noted that FPN has
gion based feature vectors from it which shares feature now become an essential building block of latest detectors.
extraction computation through different regions of the (vii) Mask R-CNN: Mask R-CNN [8,92] was proposed by He
image. Such extra computations could be exceptionally et al. as an extension to Faster R-CNN purposely for in-
hefty because each of the input image may have hundreds stance segmentation which predicted bounding boxes and
of proposals. To overcome this additional computation cost masks used for segmentation parallel from the underlying
Dai et al. [91] proposed region based fully convolutional proposals and achieved state-of-the-art results.
networks (R-FCN) as shown in Fig. 11 which shares total
computation cost in the region classification step. R-FCN
achieved better performance as compared to Faster R-CNN 2.2.3. Latest object detectors:
without operating on region based fully connected layers. (i) Relation Networks for Detecting Objects: Different object
This is because R-FCN creates a location sensitive score targets present in an image having distinct façade features
map which encodes comparative location information re- and geometrical information needs to be interacted to-
lated to different categories and also makes use of location gether. Considering this requirement, He et al. [93] came
sensitive ROI pooling layer to derive spatially aware re- up with an improved attention network known as object
gion features by encoding each comparative location of the relation module which is inserted in the network before
target region. two fully connected layers to extract superior features for
(vi) Feature Pyramid Networks (FPN): In deep convolutional proper categorization and localization of target objects.
neural networks features belonging to deep layers are se- This added relation unit supplies superior features to re-
mantically strong but spatially weak whereas features be- gressor and classifier, but substitute NMS post processing
longing to shallow layers are sematically weak but spatially step thereby increasing the overall detection accuracy.
strong. Lin et al. [48] considered this property and pro- (ii) Deformable Convolutional Networks: Regular convolutional
posed FPN [48] wherein deep layer features are integrated neural networks consider features of predefined square
with the shallow ones to enable detection of objects at size kernels and due to this the amenable region field
10 V. Sharma and R.N. Mir / Computer Science Review 38 (2020) 100301

box coordinates, whereas in double-stage detector, a meager


set of proposals containing only foreground and background in-
formation are generated. Let us now briefly different proposal
generation techniques.

3.1.1. Traditional proposal generation techniques:


These traditional methods make use of edge, color and corner
information for generating proposals. These traditional methods
can be divided into three principle categories. (i) calculating the
objectness score of the candidate rectangular boxes. (ii) assimilat-
ing superpixels from the original image. (iii) engendering many
foreground and background segments.

(i) Objectness Score based Methods: These methods are used


Fig. 11. R-FCN framework for generic object detection. to determine the score associated with each of the can-
didate proposals which is a parameter used to determine
the presence of an object inside the bounding box. Ar-
belarez et al. [96] used visual measures like saliency and
does not cover up the entire pixel area of the target ob-
density of edges to assign the objectness scores. later Rahtu
jects properly. So, to overcome this problem and to adapt
et al. [97] introduced a highly effective cascaded learning
the geometric variations replicated in the useful spatial
based approach to assign objectness scores to candidate
support regions of the target objects, DCN (Deformable
proposals.
Convolutional Networks) is proposed by Dai et al. [94]
(ii) Assimilation of Superpixels: This method of proposal gen-
in which deformable kernels along with box offsets from
eration merges superpixels engendered from the image
the initial convolutional anchors can be learned and are
segmentation results. one such technique based on this
used to adapt to the part locations of the target objects
concept is selective search [87] where multiple hierarchical
with different shapes and sizes. In addition to all this, DCN
segments were combined together using their visual fac-
achieves a significant detection accuracy which is almost
tors followed by placing bounding boxes on the merged
4% higher than traditional convents and also achieves an
segments. Another similar idea was proposed by Manen
mAP of 37.5% on MSCOCO.
et al. [98] where merging process was highly randomized.
(iii) NAS-FPN: In recent years Google Brain authors take up
(iii) Seed Segmentation based Proposal Generation: Here mul-
neural architectural search to discover some latest feature
tiple foreground and background segments are generated
pyramid architecture called NAS-FPN [95] which consists
and to avoid build up of hierarchical segments, CPMC [99]
of both bottom up and top down relations to fuse together
produced a number of overlapped regions initiated with
features with a number of different scales. During the
different seed values. The ideas of selective search [87]
search, the underlying FPN architecture is repeated N no.
and CPMC [99] were combined together by Enreds and
of times and are concatenated together thereby helping
Hoein [100] to merge generated superpixels with newly
deeper feature layers to randomly pick arbitrary features
designed features which are then used as seeds for gen-
to emulate. Hoarding more pyramid networks, adding fea-
erating larger segments.
ture dimensions and adopting large capacity architectures
helps in increasing the detection accuracy. Experiments The most important advantage of traditional proposal genera-
have shown that the mAP of NAS-FPN superseded the tion methods is that they are straight forward and can produce
original FPN by 2.9% when ResNet-50 with 256 dimensional proposals with high recall value, still they cannot be combinely
features is used as a backbone network on COCO test- optimized with the entire object detection pipeline, because they
dev dataset. In conclusion we can say that the traditional are based on low level visual features like color and edges.
baselines improve the detection accuracy by extracting rich
features of target objects and adopting multi level and 3.1.2. Anchor based proposal generation methods:
multi scale features for different sized object detection. It is defined as a large unit of supervised proposal generators
which generate proposals based on already defined anchor points.
3. Proposal generation and feature representation learning for For generating proposals in a supervised manner based on deep
object detection: convolutional feature maps, Ren et al. [21] came up with region
proposal network, where the network slid over the entire feature
3.1. Proposal generation: map using 3X3 convolution filter. For each location ‘‘k’’ anchors
of different sizes and aspect ratios were taken into consideration.
Proposal generation, which is used for creating a set of rectan- These different sizes and aspect ratios thereby helped in matching
gular bounding boxes surrounding potential target objects, plays objects at varying scales in the entire image. Depending upon
a very significant role in object detection and recognition frame- the ground truth available the object locations are matched with
work. The rectangular boxes thus generated are then used for the most suitable anchors and a 256 dimensional feature vector
the purpose of classification and localization refinement. Proposal is generated for each anchor which is fed into classification and
generation methods can be categorized into three different cate- regression layers Based on the ground truth, each of the anchor is
gories, these are traditionally used computer vision techniques, predicted to be either an object or simply a background as shown
key-point based methods and anchor based techniques. it should in Fig. 12.
be noted that both single-stage and double-stage object detec- Despite showing remarkable performance, the anchors are
tors make use of proposal generation techniques, the important designed manually using multiple scales and aspect ratios in a
difference is, the single stage detection consider each region in heuristic manner. These designs may not be optimal and hence
the input image as a potential target proposal and thereby it it require different design strategies to build them. In the recent
makes an estimation regarding the object class and bounding years many things have been proposed to improve the overall
V. Sharma and R.N. Mir / Computer Science Review 38 (2020) 100301 11

environments and having different aspect ratios and scale in-


variance, there is an immediate need to train a vigorous and
discriminative feature embedding of objects, so as to obtain sig-
nificant detection results. So, here we propose some feature rep-
resentation learning strategies for detection and recognition of
objects. Exclusively we have identified three feature representa-
tion learning strategies: multi-scale feature representation learn-
ing, deformable feature representation learning and contextual
reasoning.

3.2.1. Multi-scale feature representation learning:


We know that many object detection techniques based on
deep convolutional neural networks like Fast R-CNN [106] and
Faster R-CNN [21] make use of only a single layered feature map
for detection. But to detect the objects across multiple scales and
aspect ratios is somewhat challenging using a single feature map.
Hierarchical features in different layers capturing varying scale
information is learned using deep convolutional neural networks.
Fig. 12. Diagram of RPN, Here each feature of the feature map connects with a
As we have already seen that for detecting tiny objects, features
sliding window followed by two branches. with rich spatial information having narrow receptive fields and
high resolutions can be extracted from shallow layers whereas
features present in deep layers having large receptive fields and
design choice of the anchors. Zhu et al. [101] proposed an an- semantic rich information are more suitable for detecting large
chor designing technique for matching small objects by enlarging objects.
the input image while reducing the stride rate. Xie et al. [102] Some of the techniques like dilated/atrous convolutions [91,
came up with dimension decomposition region proposal network 94] were put forward to avoid downsampling which is one of
which is used for decomposing the dimensions of anchor boxes the issues faced by researchers while detecting small objects
using deep layer features. At the very same time, detection of
based on RPN, thereby helping to match objects with large scale
larger objects from shallow layer features is also not optimal
variance and reduced searching space.
without having large scale receptive fields . Due to all these
reasons, feature scaling issue has become a fundamental object
3.1.3. Keypoint based proposal generation methods:
detection problem. Four major paradigms dealing with multi-
These techniques are broadly divided into corner-based tech-
scale feature learning problems are image pyramid, prediction
niques and center-based methods. In corner-based techniques,
pyramid, integrated features and feature pyramid. These are as
bounding boxes are generated by merging together a pair of
briefly discussed below:
corners learned from the feature map. Denet [103] sculpted the
distribution of corner points on the feature map an applied a (i) Image Pyramid: One of the innate ideas to deal with scale
bayesian classifier on each of the corners of the object to generate feature learning problem is image pyramid wherein input
the confidence score of the bounding box. With these corner- images are resized into multiple scales which are then used
based methods design of anchor points has been eliminated and to train multiple object detectors responsible for detect-
thus it becomes one of the most effective method to generate ing particular range of scales. Similarly during the testing
high quality proposals. Using Denet [103] Law and Deng pro- phase images are again resized into different scales fol-
posed a framework called CornerNet [84] which directly modeled lowed by many detectors and the detection results thus
categorical information on the corner points. obtained are merged together. This whole process can be-
Coming to center-based proposal generation methods, the come computationally very expensive. Singh et al. [107]
probability of a point being the center of an object is predicted carried out on all inclusive experiments on small scale
on each possible location of the feature map and its weight and object detection and have argued that learning a single
height are directly regressed without any prior knowledge of the scale detector to detect all scale objects is really a difficult
anchors. Zhu et al. [104] proposed a feature selection anchor free task then learning scale dependent detectors with image
model which can be inserted into single-stage object detector pyramids. In their research they proposed a model called
with the FPN structure. In this model an online feature selection scale normalization for image pyramids (SNIP) [107] where
module is used to train multi-level center-based branches at each many scale dependent object detector are used which are
level of the feature pyramid. Zhou et al. [85] developed a new responsible for detecting certain scale objects.
center-based model based on single hourglass network [84] with- (ii) Integrated Features: It is another multi-feature learning
out an FPN module and they used it for 3 dimensional detection technique wherein a single feature map is constructed
and human pose recognition. Duan et al. [105] proposed a model using features from multiple layers which is then used
called centerNet which is based on both center-based and corner- to make predictions. By combining semantic rich deep
based methods. CentreNet achieved a remarkable improvement layer and spatially rich shallow layer features, the newly
when compared with other baseline models. All these anchor-free obtained feature map consists of highly rich information
techniques for proposal generation presents a prominent future which can thus be used to detect objects at varying scales.
research direction. (iii) Prediction Pyramid: Coarse and fine-grained features from
multiple layers are combined together in the framework
3.2. Feature representation learning for object detection: called SSD proposed by Liu et al. [47] In SSD object predic-
tions are made using multiple layers where each such layer
It is the most significant component in the entire object detec- is dedicated to detect objects of a certain scale. Later many
tion framework. Due to the presence of target objects in difficult models were proposed based on this principle [108–110].
12 V. Sharma and R.N. Mir / Computer Science Review 38 (2020) 100301

Multi scale deep convolutional neural networks [108] used 123]. For meeting different requirements emphasizing on de-
deconvolutional layers on different feature maps for im- tection accuracy vs. localization efficiency, researchers are free
proving their resolutions and later these improved feature to make use of much deeper and densely connected backbone
maps are used for making final predictions. Receptive field architectures like ResNext [124], AmoebaNet [95], ResNet [8].
blocknet was proposed by Liu et al. [110] for enriching Apart from all these there are various lightweight backbone net-
the robustness and receptive fields using a receptive field works which when applied to mobile devices can meet out the
block, which is used for capturing multi scale features specific requirements. Some prominent light weight networks
using multiple branches with different convolution kernels, like MobileNet [125], ShuffleNet [126], Xception [127] are highly
which are then fused together. popular. Wang et al. [128] combined PeeleNet with SSD [129]
(iv) Feature Pyramid: Feature pyramid network was proposed and came out with a state-of-the-art real time object detection
by Lin et al. [48] which combines the advantages of both system having fast processing speed. For meeting out the require-
integrated features as well as the prediction pyramid. Here ment of high precision value and more accuracy applications,
different scale features with lateral associations are com- highly complex backbone networks are required. Whereas well
bined in a top-down fashion for building a group of scale designed architecture which can be adapted to meet the trade-off
invariant feature maps which are then employed to train between speed and accuracy are required to meet the real time
scale-dependent classifiers. particularly deep semantic rich requirement for applications involving videos and webcameras.
features are used to reinforce the shallow spatially rich For exploring highly competitive detection accuracy, shallower
features. and sparsely connected backbone networks are replaced by much
These top-down features are then combined together us- deeper and densely connected counterparts. As an example He
ing element wise summation or concatenation operation et al. [8] preferred ResNet [5] over VGG [130] to extract rich
along with small convolutions for reducing the overall di- features for gaining high detection accuracy. As an efficient way
mensionality. Significant amount of improvement has been to further enhance the network performance, high performance
witnessed by FPN for object detection along with other classification setup which can improve detection precision and
applications and has achieved state-of-the-art results in reduce the complexity of object detection task are under con-
learning multi-scale features. Many different variants of sideration. With the immense advancements in deep learning
FPN were later introduced with modified feature pyramid techniques and unremitting improvement of computation power
block [111–117] enormous progress has been made in the field of object detection.
R-CNN [6] has successfully demonstrated that adopting convolu-
3.2.2. Contextual reasoning: tional weights from already trained models could come up with
In the field of object detection contextual information plays a affluent semantic information for not only to train the detectors
significant role. Very often objects are likely to appear in specific but also to enhance the detection performance. In the last few
environments and sometimes they do coexist with other ob- years, this strategy has been widely accepted and now it has
jects. For detecting objects with insufficient features, contextual become a norm for most of the object detectors. In this section
information can effectively help in improving the detection per- we will review some basic architectures being extensively used
formance. For understanding the association between objects and for detection purpose, after providing a basic introduction of deep
their surroundings, context can improve the ability of the detec- convolutional neural networks.
tor to understand the scene. As far as traditional object detection
techniques are concerned, several efforts exploring context have 4.1. Basic architecture of a CNN:
been made [118].
For visual understanding and machine vision, deep CNN’s has
3.2.3. Deformable feature learning: proven exceptionally useful [36,131]. A typical deep convolu-
For an object detector to be good. It should be absolutely ro- tional neural network as shown in Fig. 13, generally comprises
bust to non-rigid twisting of objects. Just before deep learning era, of a series of convolutional layers, pooling layers, non linear
DPM [119] had been successfully employed for object detection activation layers and finally a bunch of fully connected layers.
and recognition. For enabling object detectors on deep learning to Sample image is fed as an input to the convolutional layer which
model deformation of objects, many detection frameworks have generates a feature map from it by convolving operation per-
been proposed which can directly model object parts [94,120– formed by n X n kernels or anchors. The obtained feature map can
122]. Dai et al. [94] and Zhu et al. [121] have proposed deformable be considered as a multi-channel sequence, where each channel
convolutional layers that can automatically learn the supporting holds different information related to the input image. Each pixel
location offsets to embed information sampled into standard present in the feature map is represented with the help of a
sampling positions of the feature map. neuron which in turn interacts with a group of adjacent neurons
Next we will discuss about the backbone architecture of object from the previous feature map to produce something known as
detectors. a receptive field. Just after generating the feature maps a non-
linear activation function is used followed by a pooling layer (max
4. Backbone architecture for object detection: or min pooling layer) to widen the receptive field as well as
to mitigate the cost involved in computation. Using a sequence
Generally to extract the features for performing object detec- of convolutional layers, pooling layers and non linear activation
tion task where an image is provided as an input and feature map layers, a deep convolutional neural network is formed. Predefined
is produced as an output, backbone networks are employed. Ma- loss functions like stochastic gradient descent [132], Adam [133]
jority of backbone architectures used for detection tasks are basi- etc are employed to optimize the entire network in an end-to-end
cally the networks which are used for carrying out the classifica- manner. AlexNet [36] is one such example of deep CNN which
tion task by considering the last fully connected layers. Highly ef- comprises of five convolutional layers, 3 max pooling layers and
ficient version of this basic classification networks are also avail- 3 fully connected layers. Further it should be noted that each
able. Lin et al. [48] replaced some of the intermediate layers with of the convolution layer is followed by a ReLU [134] which is a
some specially designed to better meet specific requirements [20, non-linear activation layer.
V. Sharma and R.N. Mir / Computer Science Review 38 (2020) 100301 13

Fig. 13. Basic CNN Architecture.

Table 3
Details of prominent generic object detection architectures.
Architecture Proposal generation technique Learning strategy Loss function Softmax layer End-to-end training Multi-scale input Platform
RCNN [7] Selective search SGD, BP Hinge Loss Yes No No Caffe
Fast-RCNN [19] Selective search SGD Hinge Loss Yes No No Caffe
Faster-RCNN [19] Region Proposal Network SGD Class Log Loss Yes Yes Yes Caffe
SPP-Net [89] Edge Boxes SGD Hinge Loss Yes No Yes Caffe
R-FCN [91] Region Proposal Network SGD Class Log Loss No Yes Yes Caffe
Mask R-CNN [8] Region Proposal Network SGD Class Log Loss and Semantic Sigmoid Loss Yes Yes Yes TensorFlow/Keras
FPN [48] Region Proposal Network Synchronized SGD Class Log Loss Yes Yes Yes TensorFlow
SSD [47] Region Proposal Network SGD Class Softmax Loss No Yes No Caffe
YOLO [20] Region Proposal Network SGD Class sum squared error loss Yes Yes No DarkNet

4.2. CNN backbone for object detection: of the model could be increased from 16 upto 152 thus allow-
ing us to train high capacity network models. In the upcoming
Here we will quickly review some common CNN backbone year, He et al. [135] came up with a pre-activation alternate of
architectures such as ResNet [5,91], VGG16 [19,21], ResNeXt [83] ResNet known as ResNetv2. The experiments carried out clearly
and Hourglass [85] which are broadly being used for object de- showed that ResNetv2 could achieve better performance if batch
tection purpose. AN overview of prominent generic object detec- normalization is done properly [136]. Due to this it becomes
tion architectures as mentioned in [58] is provided in Table 3. possible to successfully train models with more than 1000 layers.
VGG16 [130] which was build up based on AlexNet comprises of Huang et al. concluded that even though ResNet has overcome the
5 clusters of convolutional layers and 3 fully connected layers. shortcomings associated with training using shortcut connection.
First two sets consist of two convolutional layers followed by next It did not make use of all the features produced by the preceding
3 sets comprising of three convolutional layers. In between each layers. Since the features from the shallow layers are missing
hence they cannot be directly used for element wise operations.
set a Max pooling layer is introduced to diminish the spatial di-
Hence they proposed DenseNet [137] which not only preserved
mensionality. VGG16 architecture has clearly highlighted that, if
the features from the shallow layers but also improved the flow
we increase the depth of the network by hoarding together many
of information by combining the inputs with the residual product
convolutional layers, we can increase the expressive potential of
instead of element wise addition.
the model which further leads to better performance. However
this strategy of increasing the number of convolutional layers yk+1 = yk ◦ tk+1 (yk , θ) (4)
suffers from a serious drawback of optimizing the entire network
Here ◦ denotes the concatenation operation. DenseNet [137] too
in an end-to-end manner. Now keeping this observation in mind,
suffered from various shortcomings as Chen et al. [138] con-
He et al. [5] came up with ResNet [91] which mitigated the
cluded that there is a lot of duplicity involved with a number
drawbacks associated with optimization by establishing shortcut
of extracted features and computation cost involved is also very
connections where a layer could omit the non-linear makeover
high. In the upcoming years, the advantages of both ResNet and
and straightaway pass on the values to the next layer. It is as DenseNet were combined together and a new model called Dual
represented below in the form of an equation. Path Network (DPN) was proposed wherein each individual chan-
yk+1 = yk + tk+1 (yk , θ) (3) nel yk is divided into two segments [ydk and yrk ] here ydk was used
to perform intense connection computation and element wise
Here yk is the input feature present in the kth layer and tk+1 summation was carried out using yrk , with non-sharable residual
represents the set of operations (normalization, non-linear trans- learning stems tkd+1 and tkr+1 . The final result thus produced is
formation, convolutions, activations etc.) carried out on the input basically the concatenation of these two branches.
feature yk . tk+1 (yk , θ) is a residual operation function for yk input
yk+1 = yrk + tkr+1 yrk , θ r ◦ yk ◦ tk+1 yk , θ
( ( )) ( d d ( d d ))
(5)
feature. Using this equation any feature map can be regarded as a
sum of activations of previous layer and the residual function for Xie et al. [124] came up with ResNeXt based on ResNet archi-
considerably reducing the difficulties involved during the training tecture which not only mitigated the computation and memory
process. Gradients from the deeper layers back to the shallow cost but also maintained the classification accuracy. ResNeXt
layers can easily be propagated through the shortcut connection espoused group convolution layers [36] which meagerly connects
using training setup involving ‘‘residual function blocks’’, depth the feature map channels to incarcerate rich semantic features
14 V. Sharma and R.N. Mir / Computer Science Review 38 (2020) 100301

thereby improving the backbone accuracy while maintaining the


computation cost consistent with traditional ResNet. After some
time Howard et al. [125] came out with strategy of setting the
coordinates equal to the number of feature channels for each
feature map and hence proposed a new architecture called Mo-
bileNet which not only lessen the cost of computation but also
maintained the categorization accuracy without any considerable
loss. Also it should be noted that this model was designed for
being used on a mobile platform. We have already witnessed the
benefits of increasing the model depth. In addition to that, some
works discovered the benefits related to learning capabilities by
increasing the width of the model. One such work was proposed
by Sezegedy et al. called GoogleNet which consists of an inception
unit [139] where many convolutional kernels of varying shapes
and sizes were applied on a common feature map in a particular Fig. 14. Failure case of detection in high IoU threshold. Red box is groundtruth
layer extracting multi-scale features which are then summed and yellow box is prediction.

up as an output feature map. Later advanced version of this


model was developed by introducing residual units [140]. The
popular backbone architectures that we have just reviewed are 5.1. Learning strategies for training:
all designed for the purpose of image classification and have been
trained on Image Net which can be adapted to object detec- Here we review some learning strategies extensively used for
the training purpose.
tion applications. But it should be noted that these pre-trained
models cannot be directly used for object detection because of (i) Data Intensification: It is one of the most important learn-
highly probable divergence between classification and detection ing strategies for almost all deep learning systems, since
tasks. This divergence occurs because classification needs huge they are considered as data-famished and therefore large
receptive fields for maintaining lower spatial invariance, while amounts of training data will lead to better outputs. In
detection requires small receptive fields with high resolution object detection for generating training scraps with large
spatial invariance. Moreover classification makes prediction from number of visual features, increasing the training data,
a single feature map whereas detection needs feature maps with horizontal flips are used during training of Faster R-CNN
multiple representations for detecting multi-scale objects. Keep- object detector [19]. In one-stage detectors a highly in-
ing the above points in mind, Li et al. proposed DetNet [123] tense data intensification strategy (including expansion,
specifically developed for detection purposes. Using expanded color jittering, rotation, cropping etc.) [47,108,142] is used
convolutions to increase the receptive fields, DetNet [123] pre- for improving the detection accuracy.
(ii) Refining Localization: An expected outcome of an object
serves high resolution feature maps for prediction. Moreover
detector is to provide a rigid bounding box or a stiff mask
extracting multi-scale feature maps assists DetNet to detect vary-
for each object in the image. For achieving this many mea-
ing shape objects easily. Another architecture which was not
sures have been proposed which refines the bounding box
typically designed for classification purposes is Hourglass Net-
coordinates and helps in improving the localization accu-
work [141]. Hourglass which is a fully convolutional setup with a racy. Exact localization is troublesome because proposal
series of hourglass modules originally emerged in human pose predictions are normally based on most discriminative ob-
recognition problem [141]. The beauty of hourglass module is, ject parts and not on the region encompassing the object.
it first downsamples the given input image using a series of In some scenarios, the detection algorithms are required
convolutional layers and pooling operations and then upsampling to make high quality predictions (with high IoU threshold
the generated feature maps using deconvolutional operations. value) see Fig. 14 for an explanation as how a detector
Here also skip connections are used to avoid loss of information may fail in a high IoU threshold setup. A normal approach
during downsampling. Since it can capture both local and global for refining localization is to provide high quality region
information, hence it is widely being used in state-of-the-art proposals. Here we review other methods for refining the
object detection systems [84–86]. In the next section we will localization. In fast R-CNN smooth L1 regressors are learned
review some learning strategies being used for object detection. from end-to-end as shown below:

Lreg pc , g = L1Smooth pci − gi
( ) ( )
(6)
5. Learning strategies for object detection i∈{x,y,w,h}
0.5x2 if |x| < 1
{
L1Smooth(x) = (7)
Localization and classification jobs related to image classifi-
|x| − 0.5 otherwise
cation and object detection entails optimization which makes it Here the prediction offset is denoted by pc = pcx , pcy , pcw ,
(
even harder to train vigorous detectors. Apart from that there are
)
pch for each of the predefined target class(and ‘‘g’’
( here de-
a number of concerns such as localization, imbalance sampling notes the ground truth of bounding boxes v = gx , gy , gw ,
etc. which needs to be addressed. So there arises a strong need gh )), x,y,w, h represents center coordinates, width and
to design and develop ground breaking learning strategies for height of bounding box respectively. In R-CNN architec-
training capable and competent object detectors. In this section, ture L2 auxiliary box regressors are used for localization
a comprehensive review of state-of-the-art learning strategies for refinement. Apart from these, some techniques even used
object detection is provided. Learning strategies for training the supplementary setups to further process the localization.
object detectors are different from their counterparts used during A looping bounding box regression technique was pro-
the testing phase. So we will review them separately. posed by Gidaris et al. [143], where R-CNN backbone was
V. Sharma and R.N. Mir / Computer Science Review 38 (2020) 100301 15

used for further processing the defined prediction offset (v) Learning from scratch: Majority of object detectors pro-
multiple times. Some years later, another approach called foundly relying on pretrained imageNet models suffer from
LocNet was proposed by Gidaris et al. [144] which focused poor performance, because of the prejudice of loss func-
on bounding box coordinate distribution for refining the tion and data distribution existing between classification
predictions. Zagoruyko at al. [145] introduced a multipath and detection tasks. This problem can be mitigated to an
network where in a cascade of categorizers were developed extent by finetuning on the detection task but cannot be
and later optimized using predefined integral loss based completely avoided. Apart from that using a pretrained
classification model for detection purposes in a new do-
on various quality parameters. Each ensemble classifier
main can lead to further complications. Because of these
was individually optimized with a predefined IoU thresh-
reasons there is a strong need to train the detection models
old value. The final results of predictions thus made is from scratch instead on only using pretrained ImageNet
a combination of outputs from each of these classifiers. models. The primary difficulty one may face in training
Tychen et al. derived a bounding box regression loss on a a model from scratch is the non availability of enough
group of IoU values so as to maximize the IoU prediction data for object detection which can further cause overfit-
offsets related to the objects. They also proposed Fitness- ting to occur. Moreover object detection models also need
NMS [146] that learned fresh fitness score function of bounding box level annotations and therefore annotating
IoU between objects and their proposals. Taking inspira- a huge dataset for object detection is a time consum-
tion from cornerNet [84] and DeNet [103] Lu et al. [147] ing and costly affair. There are various state-of-the-art
came up with grid R-CNN which substitutes bounding box works where object detectors trained from scratch was
regressor with keypoint corner based system. proposed Shen at al. [109] called DSOD (Deeply super-
(iii) Learning by Cascades: It is defined as a coarse-to-fine vised object detectors) where the authors have argued that
learning strategy which gathers data from the result of optimization difficulties can be mitigated significantly if
given classifiers and develop a stronger one in a gushed densely connected network models with intense supervi-
sion are used. Shen et al. [156] came out with a gated
manner. Coming to deep learning based detection mecha-
recurrent feature pyramid which vigorously attuned su-
nisms, CRAFT (Cascade Region Proposal Network and Fast-
pervision intensities of transitional layers for objects with
RCNN) was proposed by Yang et al. [148] which employs various scales. This method proved to be more powerful
cascade learning techniques to learn RPN and region pro- as compared to traditional DSOD. Later He et al. [157]
posals. CRAFTS learn a regular RPN followed by a dual class authenticated the difficulty associated with training detec-
Fast R-CNN which discards greater part of easy negatives. tors from scratch on MSCOCO dataset and concluded that
The remaining samples were then used to build the cas- the vanilla detectors can obtain better performance with
cade region classifier. For classifying multiscale objects into atleast 10 K annotated images and proved that no specific
one of the predefined target classes, a layer wise cascade model is required for training from scratch.
classifier was introduced by Yang et al. Chang et al. [149] (vi) Imbalanced Sampling: In the field of object detection un-
observed Faster-RCNN and find out that although localizing evenness of positive and negative samples is a serious
objects has improved a lot but still there are many clas- issue. This is because majority of proposals which are con-
sification errors that can be attributed to joint multitask sidered as region of interests are basically the background
optimization for regression and classification, they also images and few among them are actually the objects we are
noted that because of larger receptive fields of Faster R- looking for. Due to this imbalancing issue, two problems
may arise, class imbalance and difficulty imbalance. Class
CNN, noise is introduced in the detection process. So, to
imbalance means that majority of the generated proposals
overcome these shortcomings a cascade model was build
are just background images and very few among them
which was based on Faster R-CNN and R-CNN both. An
actually contains the objects whereas difficulty imbalance
initial set of prediction offsets are produced from a highly means that it is much easier to classify the proposals as
trained Faster R-CNN which are then used to refine the background while objects get hard to classify. Various tech-
offsets using an R-CNN. niques have been developed to get rid of this problem.
(iv) Adversarial Learning: This learning strategy has shown a Two stage detectors like R-CNN and Fast-RCNN discards a
remarkable advancement in generative models. The most number of negative samples and upkeep 2000 proposals
famous and recent examples of adversarial learning is gen- for further classification. In Fast-R-CNN [19] the ratio of
erative adversarial networks (GAN) [150] where in a gen- positive to negative samples is fixed as 1:3 in each of the
erator and a discriminator are working together in a com- minibatch to overcome the difficulty of class imbalancing.
petitive manner. Generator here tries to model the data by For addressing class imbalance, random sampling can be
producing faux images and then using them to puzzle the used. To address the problem of difficulty imbalance, care-
discriminator, whereas discriminator battles with the gen- fully fabricated loss functions can be used. For the purpose
erator to identify the real ones from the faux images. GAN of object detection, a multi class classifier is trained over
C+1 classes. If p = (p0 , p1 , p2 , p3 . . . . . . ..pC ) is the discrete
and its variants [151–153] have become largely popular
probability distribution of output over C+1 classes and ‘‘v’’
and therefore can also be used for object detection. A new
is the ground truth class then Loss function is defined as:
setup for detecting the smaller objects was proposed by
Li et al. [154] called perceptual GAN where the generator Losscls (p, v ) = − log pv (8)
module learned high-decree features from small objects
Instead of rejecting all the easy samples, a novel focal
using ‘‘adversarial learning technique’’ and discriminator loss [83] was proposed by Liu et al. where an importance
tried to identify the real resolution features for small ob- factor was assigned to each sample with respect to its loss
jects. A Faster R-CNN model which was trained using GAN value as shown.
examples was proposed by Wang et al. [155] where a
learned mask was produced on region features followed by LossFL = −α (1 − pv )γ log (pv ) (9)
region classifiers due to which the detector becomes more Here α and γ are the parameters to control the weight
robust after receiving more adversarial examples. importance.
16 V. Sharma and R.N. Mir / Computer Science Review 38 (2020) 100301

new technique festered the confidence score of B as a linear


function Func of its overlap with M as given below:
IoU (B, M) < Ωtest
{
Score B
Score = (11)
B
F (IoU(B, M)) IoU(B, M) ≥ Ωtest
Sift-NMS avoided discarding prediction of grouped objects
and achieved a remarkable improvement in many common
benchmarks.
(ii) Model Acceleration: For applying object detection in real
Fig. 15. Example to show how duplicate predictions are eliminated by NMS world the algorithms and techniques are required to work
operation. efficiently. No matter current object detectors [5,161] can
attain remarkable results on publically available datasets,
but due to their inference alacrities, it is very difficult
(vii) Distillation of Knowledge: Distillation of knowledge is de- to apply them into real world applications. So here we
fined as a training technique which extracts the knowledge will review some techniques on accelerating the object
detectors. As we have already seen that two-stage detec-
from a cascade of models and imparts it into a single
tors are usually slower compared to single-stage detectors,
model using teacher–student training scheme. This learn-
because of two stages involved, one which is used for
ing technique was first of all used for classification of
generating the proposals and other for region classification.
images [158]. Distillation of knowledge technique was also
For sharing the cost of computations, R-FCN [91] extracts
used in some object detection works [149,159] to improve position sensitive features from spatially-sensitive feature
the performance. A light weighted optimized object de- maps. However, the number of channels involving spatially
tector was proposed by Li et al. [159] and achieved com- sensitive feature maps increased extensively with a large
parable detection accuracy by using knowledge distilla- number of object categories. To overcome this problem
tion strategy while maintaining higher inference speeds. of increased number of channels. Li et al. [162] devel-
Another optimized framework based on teacher–student oped Light-Headed R-CNN, which remarkably mitigated
training scheme was proposed by Cheng et al. [149] which the number of channels in the final feature map thereby re-
showed an improvement in accuracy compared to single ducing the computational cost and increasing the inference
model optimization strategy. speeds. A significant idea to accelerate the speed of the
detection models is to use efficient backbone networks in
5.2. Testing stage: place of detection architectures, e.g.: MobileNet [125,163]
can be considered as an efficient model which was also
The algorithms and techniques used for object detection make adopted in many works [164,165]. Another approach to
accelerate the model is to optimize them offline using
intense predictions which cannot be used directly for evaluation
model compression and quantization [166–173] on the
due to large number of duplicates and moreover other learning
learned setups.
strategies are needed to achieve better detection results. In this
section some of the popular strategies being used during the 5.3. Other latest learning strategies:
testing phase are reviewed.
Image pyramids [5,174] which is a vividly used approach for
(i) Removal of Duplicates: Object detection techniques are
improving the detection results can be used, this is because of its
known to make an intense set of predictions with many hierarchical nature which helps to make predictions at different
duplicates and therefore NMS appears to be a fundamen- object scales and finally merging the results to make a final call on
tal part of object detection to get rid of duplicate false prediction. Horizontal flipping [8,174] was also employed exten-
positives. For single stage object detectors the proposals sively during the testing phase and has achieved improved perfor-
enclosing the same object may have similar confidence mance. All these learning strategies largely increased the detector
scores leading to false positives. For double-stage detectors capacities to handle multi-scale objects and hence are widely
the bounding box regressor can be regarded as a cause used in many object detection techniques despite of increasing
of generating the false positives. Therefore in both these the computation cost.
cases NMS is needed to remove the duplicate predictions.
Typically for each of the target class the generated predic- 6. Popular datasets and metrics used for object detection
tion boxes are arranged in order of their confidence scores
and the box with the highest score is henceforth selected. By object detection, we mean that whether an object belongs
Let us denote this box with M. now IoU of all other boxes to a predefined class or not, if it belongs to a class, we need
with M is compared and if the IoU value thus obtained is to identify and locate it in the given image. Localization of an
larger than some already defined threshold Ω , the boxes object in an image is generally represented using a bounding
box. For carrying out a standard comparison between different
will be discarded. This whole process is repeated for all
algorithmic approaches and defined goals for solutions, exigent
the predictions made. A typical example is given in Fig. 15.
and challenging datasets which are considered significant are
In addition to that the confidence of box B which overlaps
being used in many different research areas. Earlier face detection
with M larger than Ω is set to zero as shown below:
techniques rely on various adhoc datasets, but in the later years,
IoU(B, M) < Ωtest
{
Score B
many state-of-the-art face detection datasets were created. Apart
Score = (10)
B
0 IoU(B, M) ≥ Ωtest from that several datasets have been created for the problem of
pedestrian detection, face detection etc. generic object detection
However if an object simply lies within the threshold Ω datasets like PASCAL VOC [33], ImageNet [175], MSCOCO [92] are
of M, it will result in a missing prediction. To address this widely used object detection benchmarks. In this section we will
issue Navanneth et al. [160] proposed Soft-NMS technique review some most popular datasets along with their performance
where instead of directly discarding the prediction B, this evaluation.
V. Sharma and R.N. Mir / Computer Science Review 38 (2020) 100301 17

Fig. 16. Sample images from PASCAL VOC 2007 dataset.

6.0.1. PASCAL VOC dataset:


For detecting normal object classes, a multi-year attempt from
the year 2005 to 2012 was dedicated for the creation of a se-
quence of benchmark datasets which were vividly used. The
PASCAL VOC datasets [33] consists of 20 visual object classes
extended over 11000 images. These 20 classes can be grouped Fig. 17. Sample images from PASCAL VOC 2012 dataset.

into four main branches — animals, vehicles, people and domestic


items. In addition to that the classes of objects which are seman-
tically similar like trucks vs. buses add up to increased difficulty
levels for detection purposes. Some examples from VOC 2008
and VOC 2012 datasets are given in Fig. 16 & 17 respectively.
For the evaluation of classification and detection performance
interpolated average precision (Salton and McGill 1986) was used
as a criterion for PASCAL datasets. It was used to reprimand the
detection techniques for absent object instances, for false positive
detections and also for duplicate detections of a single object
instance. Fig. 18. Sample annotated images from MS-COCO dataset.
∑ [ ]
ij 1 sij ≥ t zij
Recall(t) = (12)
N requires more of contextual reasoning for their recognition and
∑ [ ]
ij 1 sij ≥ t zij images from MSCOCO dataset are super rich in this contextual
Precision(t) = (13)
information. The largest class in MSCOCO dataset is persons hav-
∑ [ ]
ij 1 sij ≥ t
ing almost 800,000 instances where as the smallest one is the
Here ‘‘t’’ is the threshold value to evaluate the IoU between hair dryer with roughly 600 instances from the entire dataset.
ground truth and prediction made. In VOC datasets ‘‘t’’ is set to Some typical annotated images from MSCOCO are given in Fig. 18.
0.5. ‘‘N’’ is the total number of predicted bounding boxes, ‘‘i’’ is The metric used to measure the detection performance is similar
the index of the ith image and ‘‘j’’ is the index of the jth object.
to what we have seen for PASCAL VOC. The threshold value
Precision/recall curve is computed for a given detection task along
here can be chosen from a range of values [0.5,0.95] with an
with different object categories where recall is the fraction of all
interval of 0.05 to measure the mean Average Precision (mAP).
the positive samples above a given threshold value and preci-
Furthermore to measure the performance of the object detector to
sion defined the fraction of samples above the defined threshold
detect different sized objects, special average precision for small,
which are from the positive class. To obtain the ultimate results,
mean Average Precision (mAP) is computed across all the object medium and large objects are computed separately. Comparison
categories. Comparative results of some state-of-the-art detection of detection performance of some state-of-the-art techniques on
models on VOC 2007 and VOC 2012 Test Sets are summarized in MSCOCO test-dev dataset are summarized in Table 6.
Tables 4 and 5 respectively.
6.0.3. ImageNet Dataset:
6.0.2. MS-COCO dataset: It is one of the most important dataset with near about 200 ob-
MSCOCO is expanded as Microsoft Common Objects in con- ject classes [175,196]. The ISLVRC task of object detection assesses
text dataset [92]. It consists of 91 object categories found in the capability of an algorithm to localize and recognize all the
everyday life in their natural environment and 82 of these object
possible instances of target object categories present in an image.
categories have more than 5000 instances with labels. As a total
The metric used to measure the performance of an algorithm
this dataset has 2,500,000 labeled instances spread over 328,000
using this dataset is called loosen threshold and mathematically
images. With respect to another popular ImageNet dataset [186].
it is as given below:
MS-COCO has less number of class objects but has more number
of instances per object category and compared to PASCAL-VOC wh
( )
dataset, MS-COCO has a significantly much larger number of t = min 0.5, (14)
(w + 10)(h + 10)
instances per object category typically exceeding 27 K on an av-
erage. Furthermore MSCOCO dataset consists of 3.5 object classes Here ‘‘w’’ and ‘‘h’’ are the width and height of the ground truth
per image as compared to ImageNet which has 1.7 classes per bounding box. The beauty of this threshold is that it allows the
image on an average and for PASCAL VOC this figure stands at predicted bounding box to extend atmost 5 pixels on an average
1.4. As we already know that objects which are tiny in size often in each possible direction around the object.
18 V. Sharma and R.N. Mir / Computer Science Review 38 (2020) 100301

Table 4
Performance analysis of some state-of-the-art in terms of mean Average Precision (mAP) using PASCAL VOC 2007 test dataset.
Models Aeroplane Bicycle Bird Boat Bottle Bus Car Cat Chair Cow Table Dog Horse Mbike Person Plant Sheep Sofa Train TV mAP
R-CNN (Alex) [7] 68.1 72.8 56.8 43.0 36.8 66.3 74.2 67.6 34.4 63.5 54.5 61.2 69.1 68.6 58.7 33.4 62.9 51.1 62.5 68.6 58.5
R-CNN (VGG) [19] 73.4 77.0 63.4 45.4 44.6 75.1 78.1 79.8 40.5 73.7 62.2 79.4 78.1 73.1 64.2 35.6 66.8 67.2 70.4 71.1 66.0
SPP-Net [89] 68.5 71.7 58.7 41.9 42.5 67.7 72.1 73.8 34.7 67.0 63.4 66.0 72.5 71.3 58.9 32.8 60.9 56.1 67.9 68.8 60.9
OHEM +Fast-RCNN [176] 80.6 85.7 79.8 69.9 60.8 88.3 87.9 89.6 59.7 85.1 76.5 87.1 87.3 82.4 78.8 53.7 80.5 78.7 84.5 80.7 78.9
HyperNet VGG [32] 84.2 78.5 73.6 55.6 53.7 78.7 79.8 87.7 49.6 74.9 52.1 86 81.7 83.3 81.8 48.6 73.5 59.4 79.9 65.7 71.415
Faster R-CNN [21] 70.0 80.6 70.1 57.3 49.9 78.2 80.4 82.0 52.2 75.3 67.2 80.3 79.8 75.0 76.3 39.1 68.3 67.3 81.1 67.6 69.9
GCNN [177] 68.3 77.3 68.5 52.4 38.6 78.5 79.5 81.0 47.1 73.6 64.5 77.2 80.5 75.8 66.6 34.3 65.2 64.4 75.6 66.4 66.8
Bayes [178] 74.1 83.2 67.0 50.8 51.6 76.2 81.4 77.2 48.1 78.9 65.6 77.3 78.4 75.1 70.1 41.4 69.6 60.8 70.2 73.7 68.5
SDP +CRC [179] 76.1 79.4 68.2 52.6 46.0 78.4 78.4 81.0 46.7 73.5 65.3 78.6 81.0 76.7 77.3 39.0 65.1 67.2 77.5 70.3 68.9
SubCNN [27] 70.2 80.5 69.5 60.3 47.9 79.0 78.7 84.2 48.5 73.9 63.0 82.7 80.6 76.0 70.2 38.2 62.4 67.7 77.7 60.5 68.5
StuffNet30 [180] 72.6 81.7 70.6 60.5 53.0 81.5 83.7 83.9 52.2 78.9 70.7 85.0 85.7 77.0 78.7 42.2 73.6 69.2 79.2 73.8 72.7
NOC [181] 76.3 81.4 74.4 61.7 60.8 84.7 78.2 82.9 53.0 79.2 69.2 83.2 83.2 78.5 68.0 45.0 71.6 76.7 82.2 75.7 73.3
MR-CNN + S-CNN [182] 80.3 84.1 78.5 70.8 68.5 88.0 85.9 87.8 60.3 85.2 73.7 87.2 86.5 85.0 76.4 48.5 76.3 75.5 85.0 81.0 78.2
HyperNet [183] 77.4 83.3 75.0 69.1 62.4 83.1 87.4 87.4 57.1 79.8 71.4 85.1 85.1 80.0 79.1 51.2 79.1 75.7 80.9 76.5 76.3
SSD300 [47] 80.9 86.3 79.0 76.2 57.6 87.3 88.2 88.6 60.5 85.4 76.7 87.5 89.2 84.5 81.4 55.0 81.9 81.5 85.9 78.9 79.6
SSD512 [47] 86.6 88.3 82.4 76.0 66.3 88.6 88.9 89.1 65.1 88.4 73.6 86.5 88.9 85.3 84.6 59.1 85.0 80.4 87.4 81.2 81.6

Table 5
Performance analysis of some state-of-the-art in terms of mean Average Precision (mAP) using PASCAL VOC 2012 test dataset.
Models Aeroplane Bicycle Bird Boat Bottle Bus Car Cat Chair Cow Table Dog Horse Mbike Person Plant Sheep Sofa Train TV mAP
R-CNN (Alex) [7] 71.8 65.8 52.0 34.1 32.6 59.6 60.0 69.8 27.6 52.0 41.7 69.6 61.3 68.3 57.8 29.6 57.8 40.9 59.3 54.1 53.3
R-CNN (VGG) [7] 79.6 72.7 61.9 41.2 41.9 65.9 66.4 84.6 38.5 67.2 46.7 82.0 74.8 76.0 65.2 35.6 65.4 54.2 67.4 60.3 62.4
Fast-RCNN [19] 82.3 78.4 70.8 52.3 38.7 77.8 71.6 89.3 44.2 73.0 55.0 87.5 80.5 80.8 72.0 35.1 68.3 65.7 80.4 64.2 68.4
OHEM +Fast-RCNN [184] 90.1 87.4 79.9 65.8 66.3 86.1 85.0 92.9 62.4 83.4 69.5 90.6 88.9 88.9 83.6 59.0 82.0 74.7 88.2 77.3 80.1
Faster R-CNN [21] 84.9 79.8 74.3 53.9 49.8 77.5 75.9 88.5 45.6 77.1 55.3 86.9 81.7 80.9 79.6 40.1 72.6 60.9 81.2 61.5 70.4
StuffNet30 [180] 83.0 76.9 71.2 51.6 50.1 76.4 75.7 87.8 48.3 74.8 55.7 85.7 81.2 80.3 79.5 44.2 71.8 61.0 78.5 65.4 70.0
NOC [181] 82.8 79.0 71.6 52.3 53.7 74.1 69.0 84.9 46.9 74.3 53.1 85.0 81.3 79.5 72.2 38.9 72.4 59.5 76.7 68.1 68.8
MR-CNN + S-CNN [182] 85.5 82.9 76.6 57.8 62.7 79.4 77.2 86.6 55.0 79.1 62.2 87.0 83.4 84.7 78.9 45.3 73.4 65.8 80.3 74.0 73.9
HyperNet [183] 84.2 78.5 73.6 55.6 53.7 78.7 79.8 87.7 49.6 74.9 52.1 86.0 81.7 83.3 81.8 48.6 73.5 59.4 79.9 65.7 71.4
ION [185] 87.5 84.7 76.8 63.8 58.3 82.6 79.0 90.9 57.8 82.0 64.7 88.9 86.5 84.7 82.3 51.4 78.2 69.2 85.2 73.5 76.4
SSD512 [47] 91.4 88.6 82.6 71.4 63.1 87.4 88.1 93.9 66.9 86.6 66.3 92.0 91.7 90.8 88.5 60.9 87.0 75.4 90.2 80.4 82.2
SSD300 [47] 91.0 86.0 78.1 65.0 55.4 84.9 84.0 93.4 62.1 83.6 67.3 91.3 88.9 88.6 85.6 54.7 83.8 77.3 88.3 76.5 79.3
YOLO [20] 77.0 67.2 57.7 38.3 22.7 68.3 55.9 81.4 36.2 60.8 48.5 77.2 72.3 71.3 63.5 28.9 52.2 54.8 73.9 50.8 57.9
YOLO + Fast R-CNN [20] 83.4 78.5 73.5 55.8 43.4 79.1 73.1 89.4 49.4 75.5 57.0 87.5 80.9 81.0 74.7 41.8 71.5 68.5 82.1 67.2 70.7
YOLOv2 [20] 88.8 87.0 77.8 64.9 51.8 85.2 79.3 93.1 64.4 81.4 70.2 91.3 88.1 87.2 81.0 57.7 78.1 71.0 88.5 76.8 78.2
R-FCN [19] 92.3 89.9 86.7 74.7 75.2 86.7 89.0 95.8 70.2 90.4 66.5 95.0 93.2 92.1 91.1 71.0 89.7 76.0 92.0 83.4 85.0

Table 6
Comparison of detection performance of some state-of-the-art techniques on MSCOCO test-dev dataset.
Model Backbone architecture Year of introduction AP AP5 0 AP7 5 APS APM APL
Single-Stage-Object-Detectors
SSD512 [16] VGG-16 2016 28.8 48.5 30.3 10.9 31.8 43.5
SSD513 [113] ResNet-101 2017 31.2 50.4 33.3 10.2 34.5 49.8
DSSD513 [113] ResNet-101 2017 33.2 53.3 35.2 13.0 35.4 51.1
STDN513 [187] DenseNet-169 2018 31.8 51.0 33.6 14.4 36.1 43.4
CornerNet511 [84] Hourglass-169 2018 40.5 56.5 43.1 19.4 42.7 53.9
CornerNet511 [86] Hourglass-169 2019 44.9 62.4 48.1 25.6 47.4 57.4
GHM SSD [188] ResNeXt-101 2018 41.6 62.8 44.2 22.3 45.1 55.3
FPN-Reconfig [116] ResNeXt-101 2018 34.6 54.3 37.3 NA NA NA
FCOS [189] ResNeXt-101 2019 42.1 62.1 45.2 25.6 44.9 52.0
FSAF [104] ResNeXt-101 2019 42.9 63.8 46.3 26.6 46.2 52.7
ExtremeNet [190] Hourglass-104 2019 40.2 55.5 43.2 20.4 43.2 53.1
M2Det800 [117] VGG-16 2019 41.0 59.7 45.0 22.1 46.5 53.8
RefineDet512 [174] ResNet-101 2018 36.4 57.5 39.5 16.6 39.9 51.4
YOLOv2 [46] DarkNet-19 2017 21.6 44.0 19.2 5.0 22.4 35.5
Two-Stage-Object-Detectors
Fast R-CNN [19] VGG-16 2015 19.7 35.9 NA NA NA NA
Faster R-CNN [21] VGG-16 2015 21.9 42.7 NA NA NA NA
Faster R-CNN w FPN [48] ResNet-101 2016 36.2 59.1 39.0 18.2 39.0 48.2
Faster R-CNN by G-RMI [161] Inception-ResNet-v2 2017 34.7 55.5 36.7 13.5 38.1 52.0
OHEM [184] VGG-16 2016 22.6 42.5 22.2 5.0 23.7 37.9
ION [185] VGG-16 2016 23.6 43.2 23.6 6.4 24.1 38.3
R-FCN [91] ResNet-101 2016 29.9 51.9 NA 10.8 32.8 45.0
CoupleNet [191] ResNet-101 2017 34.4 54.8 37.2 13.4 38.1 50.8
Deformable R-FCN [91] Aligned-Inception-ResNet 2017 37.5 58.0 40.8 19.4 40.1 52.5
DeNet-101 [103] ResNet-101 2017 33.8 53.4 36.1 12.3 36.1 50.8
Mask-RCNN [8] ResNeXt-101 2017 39.8 62.3 43.4 22.1 43.2 51.2
Fitness-NMS [146] ResNet-101 2017 41.8 60.9 44.9 21.5 45.0 57.5
Relation Net [93] ResNet-101 2018 39.0 58.6 42.9 NA NA NA
DeepRegionlets [192] ResNet-101 2018 39.3 59.8 NA 21.7 43.7 50.9
C-Mask RCNN [193] ResNet-101 2018 42.0 62.9 46.4 23.4 44.7 53.8
DCN + R-CNN [149] ResNet-101 + ResNet-152 2018 42.6 65.3 46.5 26.4 46.1 56.4
Cascade R-CNN [194] ResNet-101 2018 42.8 62.1 46.3 23.7 45.5 55.2
Grid R-CNN [147] ResNeXt-101 2019 43.2 63.0 46.6 25.1 46.5 55.2
DCN-v2 [121] ResNet-101 2019 44.8 66.3 48.8 24.4 48.1 59.6
TridentNet [195] ResNet-101 2019 42.7 63.6 46.5 23.9 46.6 56.6

6.0.4. OpenImages dataset: labels, object segmentation masks, visual relationships and object
This dataset consists of 1.9 million images containing ap- bounding boxes [197]. Some important aspects of this dataset are,
proximately 15 million objects spread over 600 categories. The firstly the bounding boxes are largely drawn manually. Secondly
images in this dataset are already annotated with image level the images are really assorted and consists of complex scenes
V. Sharma and R.N. Mir / Computer Science Review 38 (2020) 100301 19

Table 7
Statistics of some popular generic object detection datasets.
Dataset Training subset Validation subset Testing subset Trainval subset
Images Objects Images Objects Images Objects Images Objects
Pascal VOC-2007 2501 6301 2510 6307 4952 14976 5011 12608
Pascal VOC-2007 5717 13609 5823 13841 10991 NA 11540 27450
ILSVRC-2014 456567 478807 20121 55502 40152 NA 476688 534309
ILSVRC-2014 456567 478807 20121 55502 65500 NA 476688 534309
MS-COCO-2015 82783 604907 40504 291875 81434 NA 123287 896782
MS-COCO-2018 118287 860001 5000 36781 40670 NA 123287 896782

with many objects and thirdly, the dataset often offers visual (iv) KITTI [200]: It consists of 7481 labeled HD images for
relationship annotations. Since it is a very diverse dataset, hence training and 7518 images for testing purpose. The person
different metrics are been used to measure the performance of category in this dataset is subdivided into two sub cate-
an algorithm using it. Firstly, the unannotated class categories gories of pedestrian and cyclist. The models trained on it
are discarded to avoid the wrong number of false negatives. are evaluated using three metrics with difference in the
Secondly, if an object belongs to a predefined target class, then min. bounding box height and max. occlusion level.
the detection model should come up with detection results for
For INRIA [26] and ETH [199] datasets, log average miss rate
each of the relevant class. Thirdly, if a detection occurs inside a metric is used to evaluate the performance of detectors and
group of boxes and if the intersection of the detection and the for KITTI, mean Average Precision with IoU threshold of 0.5 is
box partitioned by the area of detected region exceeds threshold used for evaluation purposes. Comparative analysis of pedestrian
of 0.5 then it is counted as a true positive. detection datasets is given in Table 8 followed by comparison
of person detection benchmarks which are covered in Table 9.
6.0.5. VIS Drone 2018 dataset: Table Information from PIOTR ET AL, IEEE TPAMI2012 [52]. Table
In the year 2018, a new dataset which comprises of videos and Information from PIOTR ET AL, IEEE TPAMI2012 [52].
images captured using a drone was developed and is called VIS
Drone 2018 [198]. It is a large scale visual object detection and 6.2. Face detection datasets:
tracking dataset aims at proceeding with visual understanding on
a drone based platform. The images and videos in this dataset are Here many popular face detection datasets are reviewed along
captured over several urban areas of 14 different Chinese cities. with their evaluation metrics.
Typically speaking it comprises of 263 video sequences and 10209
(i) PASCAL FACE [33]: This dataset comprises of 1335 labeled
images with affluent annotations like ground truth bounding faces from 851 different images. It is collected from PASCAL
boxes, object category, truncation ratios etc. it comprises of more person layout set and is commonly used as a test set.
than 2.5 million annotated instances. This dataset adopts MS- (ii) FDDB [205]: It stands for Face detection datasets and bench-
COCO metric for evaluating the performance of different object mark. It is a widely used dataset comprising of 5171 faces
detection algorithms. spread over 2845 images. It is also used as a test set for
evaluating the performance of face detection models.
6.0.6. LVIS [1] : (iii) Wider-Face [206]: It is a very large dataset comprising of
It is a newly collected benchmark comprising of 164000 im- 3203 images with almost 400 K faces of varied range and
ages spread over 1000 plus object categories. It is a pretty new scales. It is divided into three subsets, 40% for training, 10%
dataset without any pre-existing results as of now. for validating and remaining 50% is kept for testing pur-
The statistics of some of these well known generic object poses. The annotated training and validation datasets are
detection datasets are provided in Table 7. available online at http://shuoyang1213.me/WIDERFACE/.
For Wider-Face and PASCAL-Face mAP with IoU threshold of 0.5
6.1. Pedestrian detection dataset: is used as a metric to evaluate the performance of face detectors
and for evaluating FDDB dataset two annotation types (bound box
In this subsection, we will review most famous and commonly level and eclipse level) are mostly used. The details of evaluation
used datasets for pedestrian object detection along with the metrics (discussed above) are summarized in Table 10.
metrics beings used for evaluation purposes.
7. Applications of object detection
(i) Caltech [52]: It is one of the most popular and tricky
datasets used for pedestrian detection task. It comprises Object detection finds its applications in many fields like med-
of approximately 10 h of VGA video sequence recorded ical, military, security etc to assist people. Here in this section,
by a vehicle bound camera driving through the streets of some prominent applications of object detectors are reviewed.
Los Angeles metropolitan city. The training and testing set
comprises of 42782 and 4024 video frames respectively. 7.1. Face detection:
(ii) ETH [199]: This dataset is basically used as a testing set
to assess the performance of detection models trained on Aim of face detection is to detect all the faces present in an
Citypersons dataset. It comprises of 1804 frames spread image. Face detection is still considered as a difficult task because
over three video sequences. of different occlusion and illumination variations. Many state-of-
(iii) INRIA [26]: It comprises of HD images of pedestrians mostly the-art face detectors have been build which precisely detect the
collected from vacation resorts comprising of 2120 images faces from a given image. A novel Wasserstein CNN approach was
which are further divided into two sets consisting of 1832 proposed by He et al. [207] to learn invariant features for face
images for training purpose and remaining 288 images for detection. To enhance and speed up the discriminative abilities
carrying out the testing task. of a DCNN based face recognizer, appropriate loss functions need
20 V. Sharma and R.N. Mir / Computer Science Review 38 (2020) 100301

Table 8
Comparative analysis of pedestrian detection datasets.
Dataset Setup Training set Testing set
Pedestrians Positive images Negative images Pedestrians Positive images Negative images
Caltech [52] Mobile 192K 67K 61K 155K 65K 56K
INRIA [26] Photo 1208 614 1218 566 288 453
ETH [199] Mobile 2388 499 NA 12K 1804 NA
TUD-Brussels [201] Mobile 1776 1092 218 1498 508 NA
Daimler-DB [53] Mobile 192K 67K 61K 155K 65K 56K

Table 9
Comparison of person detection benchmarks.
Dataset Seasons Countries Cities Images Pedestrians Image resolution Weather Train-Val-Test-Split (%)
Caltech [52] 1 1 1 249884 289395 640 X 480 Dry 50-0-50
KITTI [202] 1 1 1 14999 9400 1240 X 376 Dry 50-0-50
City Persons [203] 3 3 27 5000 31514 2048 X 1024 Dry 60-10-30
TDC [204] 1 1 1 14674 8919 2048 X 1024 Dry 71-8-21

Table 10
Details of evaluation metrics for generic object detection.
Abbreviation Meaning Description
Ω IoU Threshold Used for evaluating localization accuracy.
D All Predictions Peak predictions with highest confidence score generated by the object detector.
TP True Positive Exact predictions made by object detector.
FP False Positive Sham predictions made by object detector.
P Precision Fraction of true positives out of all predictions.
AP Average Precision Computed over different values of recall.
TPR True Positive Rate Portion of positive rate over false positives.
FPPI False Positive Per Image Portion of false positive for each image.
FPS Frames Per Second Number of images processes per second.
MR Missing Rate Average miss rate over diverse FPPI rates equally placed in log-space.

to be designed. One such loss function is cosine based softmax 7.5. Autonomous driving:
loss [208–211] which is widely been used. Guo et al. [212] pro-
posed a fuzzy logic based sparse auto encoder framework for face For automatically driving a car a perfect perception is needed
recognition. to operate in a reliable manner. Deep learning based perception
systems are normally employed for this purpose which trans-
7.2. Pedestrian detection: fers multi sensory data and converts it into semantic knowledge
thereby enabling automatic driving. Object detection is a funda-
This application aims at detecting pedestrians from a natural mental aspect of this driving system. In recent years Lu et al. [228]
environment. Braun et al. [213] developed a EuroCity person make use of fresh setups containing 3D convolutions and RNN’s
dataset which typically comprises of pedestrians and cyclists. to achieve localization accuracy in several real world driving
For real time pedestrian detection, some cascaded pedestrian scenarios. Apart from that song et al. [228] developed a 3D car
detection models were developed [214–216]. instance understanding benchmark for autonomous driving and
Banerjee et al. [229] utilized sensor fusion to extract efficient
features.
7.3. Anomaly detection:

A noteworthy part in fraud detection, healthcare monitor- 7.6. Traffic sign recognition:
ing and weather scrutiny is played by anomaly detection tech-
nique. Current anomaly detection techniques study the data us- For the sake of security and rule following, real time accurate
ing a point wise criteria [217–220]. For analyzing contiguous traffic sign recognition is required which assists in driving by ob-
time and space intervals Barz et al. [221] proposed an unsuper- taining temporal and spatial information of various traffic signs.
vised technique called Maximally Divergent Intervals for anomaly Deep learning based object detection methods [230–243] tries to
detection. solve this problem with high accuracy.

7.4. License plate recognition: 7.7. Computer aided diagnosis (CAD) system:

With the growing craze for cars and vehicles, license plate These systems can assist doctors and physicians to categorize
recognition is considered as a hot topic today as it is required in different kinds of cancer. The fundamental tasks carried out by
vehicular and traffic violation tracking. For making license plate a CAD setup can be recognized as image acquisition followed by
recognition more robust and trustworthy, several techniques like segmentation, feature extraction, classification and finally object
edge detection, morphology, sliding concentric windows etc are detection. Because of significant entity differences, lack of data
all clubbed together to achieve that. In recent years deep learn- and privacy concerns, there generally exists a difference of data
ing based methods [222–227] also provide solutions for this distribution between source and target domains. Hence domain
application. adaptation setups [244] are required for medical image detection.
V. Sharma and R.N. Mir / Computer Science Review 38 (2020) 100301 21

7.8. Event detection: text within that scene, whereas in the second category, the tex-
tual image is directly segmented and requires complicated post
This application focuses on recognizing real world events from processing step. further, in order to get the required orientation
the web, like festivals, talk shows, disasters, election campaigns, of text boxes, some of the text detection techniques require mul-
protests etc. with the advancement of social media, multi-domain tifaceted post processing, hence they are not as efficient as those
event detection systems can provide a detailed description about methods which are directly based on detection networks. Lyu
them. For disposing off multi-domain data an event detection sys- et al. [273] combined together the ideas proposed in above two
tem was developed by Yang et al. [245] Wang et al. [246] added text detection categories along with dividing the text regions into
some online social interaction features by developing affine based relative positions to recognize the underlying text. Ma et al. [274]
graphs for event detection purposes. Apart from these advance- developed a novice rotation based technique and an end to end
ments, a multi dimensional graph based model for detecting text detection system to generate inclined region proposals with
events from millions of video sequences and images was devel- text orientation information.
oped by Schinas et al. [247].
7.13. Point clouds 3D object detection:
7.9. Pattern detection:
A compared to image based object detection. LIDAR based
Pattern detection always suffer from the difficulties of scene point cloud tries to provide depth information which can further
occlusion, varying illuminations, pose variations and sensor based be used for accurately locating the objects and characterizing
noises. For addressing duplicate patterns and periodic structure their shapes. LIDAR point cloud based 3D object detection plays
detections, researchers have designed and developed some state- an important role in autonomous driving, robotics and virtual
of-the-art benchmarks for both 2D [248,249] and 3D images reality applications. Apart from numerous applications of point
[129,250–260]. clouds in object detection, this technique also meet some chal-
lenges like sparsity of LIDAR point clouds, inconsistent sam-
7.10. Image caption generation: pling of the 3D-space, occlusions and relative pose variation.
Qi et al. [275] developed an end to end deep neural network
This application aims at automatically generating captions called PointNet which can learn features directly from LIDAR
for a given image by capturing underlying semantic information clouds. For efficient mapping and processing of enormous 3D data
Engelcke et al. [276] proposed sparse convolution layers and L1
from the image and expressing it using natural languages. Image
regularization. Zhou et al. [277] developed a general end to end
captioning requires both computer vision and NLP technologies
3D object detection framework called VoxelNet that can predict
which itself are very challenging. So for addressing these issues,
correct 3D bounding boxes by learning a discriminative feature
reinforcement learning [261,262], attention networks [263,264],
representation from point clouds.
encoder–decoder networks are widely been used for this purpose.
Apart from all these deep CNN’s based techniques are proven to
7.14. 2D 3D pose detection:
be highly effective and efficient [2,265].
The aim of human pose detection is to locate the 2D or
7.11. Salient object detection:
3D pose of the body joints along with the base classes and
eventually returning the average pose of the maximum scoring
For detecting salient objects, deep neural networks are utilized
class. Emblematic 2D human pose detection technique [237,238,
to foresee saliency scores of image regions and to produce cor- 278–280] make use of deep CNN architecture. Rogez et al. [239]
rect saliency maps. Deep neural networks for detecting salient came up with an end to end architecture for joint 2D and 3D pose
objects normally need to put together multi-level features of estimation in natural scenes thereby predicting 2D and 3D poses
the backbone architecture. To obtain fast speed without com- of many people together. The human pose estimation techniques
promising accuracy. Wu et al. [266] suggested that shallower can broadly be divided into two categories, single stage and
layer features are sufficient to precisely obtain the saliency map. multi stage methods. The best performing methods are typically
for detecting salient objects, Wang et al. [267] have used fixa- based on single stage backbone architectures [240–242]. However
tion prediction techniques and to accurately detect salient ob- the most representative multi stage techniques are convolutional
jects, Wang et al. [268] incorporated prior knowledge of saliency pose machines [243], Hourglass network [280] and MSPN [281].
into recurrent fully convolutional networks. To properly explore
the structure of objects, an attentive feedback module has been 7.15. Fine grained visual recognition:
designed by Feng et al. [269].
The aim is to recognize an exact class of objects in each of
7.12. Text detection: the basic level category like identifying the model of a car or
recognizing the species of a mammal. This task is a bit challenging
The aim of text detection is to discover the text region of due to visual differences which are minimal between the class
a given image or a video and it is one of the most important categories and which can easily be overwhelmed by factors like
precondition for many of the computer vision tasks like cate- pose, location of an object and its viewpoint in the given image.
gorization and video analysis. Although there has already been Krause et al. [282] make use of 3D object representation to gener-
many thriving optical character recognition systems, but still the alize across multiple viewpoints at the level of both locations and
detection of text in cluttered natural scenes is a challenging task local features. Bilinear model consisting of two CNN streams was
because of blurring, different orientations, lighting conditions and introduced by Lin et al. [283] wherein the outputs from these two
various other distortions. Very recently researchers have found CNN streams are multiplied using outer product at each of the
out that randomly-oriented text directions [270–272] is a re- image location which are then pooled together to get an image
search direction that requires attention. Generally deep learning descriptor. A finely grained discriminative localization method
based scene text detection can be broadly categorized into two based on saliency guided faster R-CNN was introduced by He at
classes. The first category considers the scene text as a particular al. [284] and later they introduced a weakly supervised version
object and using text box regression techniques try to locate the for fast fine-grained image categorization [285].
22 V. Sharma and R.N. Mir / Computer Science Review 38 (2020) 100301

7.16. Edge detection: (v) Multi-task Learning: For improving the detection perfor-
mance, amassing multi level features of backbone archi-
It aims at extracting the boundary of an object and salient tectures can be considered as a significant step. Moreover
edges from an image which is considered very important for performing many computer vision tasks simultaneously
many higher level computer vision tasks like object recognition, like, object detection, semantic and instance segmentation
segmentation etc. Edge detection also suffers from major chal- etc can improve the performance by a large extent using
lenges like the edges of different scales which are present in an richer information. Adopting this technique is an efficient
image need both object level boundaries and useful local region way to combine multiple tasks in a model presents a se-
details. furthermore to predict different parts in the final detec- ries of challenges for researchers to improve the detection
tion each of the CNN layers should be trained by proper layer accuracy without compromising the processing speed.
specific supervision. For addressing these issues, He et al. [286] (vi) Unsupervised Object Detection: Developing automatic an-
came up with a bi-directional cascade network which allows notation techniques to get rid of manual annotating is
one layer supervised by labeled edges while adopting dilated an exciting and promising trend for unsupervised object
convolutions for generating multi-scale features. detection. For sharp detection tasks, unsupervised object
detection is a future research direction.
8. Concluding remarks and future research areas
(vii) Remote Sensing Real Time Detection: Remote sensing im-
Due to its potent learning capabilities and usefulness in deal- ages find their applications in both military and agricultural
ings with occlusions, scale alterations and background exchanges, domains. Automatic detection models and incorporated
deep learning based object detection techniques has become a hardware units will enhance rapid development of these
highly researched area. In this paper we have provided a compre- fields.
hensive survey of latest advances in deep learning based visual (viii) GAN based Object Detectors: As we know that deep learn-
object detection. The review starts with surveying a large body ing based object detector often requires huge amounts
of recent works in literature followed by analyzing traditional of data for training purposes, whereas GAN based object
and current detectors. Then a rigorous overview of backbone detectors is an influential structure which generates fake
architectures along with systematic coverup of prominent learn- images. Combining real world scenarios and simulated data
ing strategies is performed. Finally some popular datasets and produced by GAN helps in making the detectors to grow
benchmarks for visual object detection are discussed along with more robust and to obtain greater generalization capabili-
some application areas to gain a thorough understanding of the ties.
object detection landscape. With the increasingly powerful visual
object detectors in the fields of security, transportation, military The research in object detection with deep learning needs further
etc. the applications of object detection is therefore witnessing study. We hope that deep learning based object detectors will
a sharp increase. Despite all these advancements, there is still make life changing contributions in years to come.
much room for further development. Below we have provided
some latest trends in this domain for facilitating future research Declaration of competing interest
in visual object detection with deep learning.
The authors declare that they have no known competing finan-
8.1. Future research trends: cial interests or personal relationships that could have appeared
to influence the work reported in this paper.
(i) Video Object Detection: As we know that video object
detection suffers from motion target ambiguities, really References
tiny target objects, truncations and occlusions etc, which
make it extremely difficult to achieve high accuracy and [1] Agrim Gupta, Piotr Dollar, Ross Girshick, LVIS: A dataset for large
efficiency. Henceforth, investigating motion based goals vocabulary instance segmentation, in: Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, 2019, pp. 5356–5364.
and multifaceted data sources like video sequences are one
[2] Licheng Jiao, Fan Zhang, Fang Liu, Shuyuan Yang, Lingling Li, Zhixi Feng,
of the most promising future research areas. Rong Qu, A survey of deep learning-based object detection, IEEE Access
(ii) Weakly Supervised Object Detection: Weakly supervised 7 (2019) 128837–128868.
object detection models focus on utilizing a small set of [3] Jensen Huang Nvidia, Accelerating AI with GPUs: A new computing
completely annotated images for detecting a much larger model, 2020, Retrieved on June 20, 2020 at 11:45 am, from URL
https://blogs.nvidia.com/blog/2016/01/12/accelerating-ai-artificial-
number of non-annotated counterparts. Hence it is a sig-
intelligence-gpus/.
nificant problem for future studies where large proportion [4] Xiongwei Wu, Doyen Sahoo, Steven C.H. Hoi, Recent advances in deep
of annotated and labeled images with target objects and learning for object detection, Neurocomputing (2020).
bounding boxes are used to efficiently train a network for [5] Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun, Deep residual
achieving high effectiveness. learning for image recognition, in: Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
(iii) Multi-Domain Object Detection: As we know that area
[6] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy,
specific detectors always tend to perform better, achieving Alan L. Yuille, Semantic image segmentation with deep convolutional nets
high detection accuracies on a predefined dataset. So the and fully connected crfs, 2014, arXiv preprint arXiv:1412.7062.
future basically lies in developing a universal object de- [7] Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik, Rich feature
tector which is capable of detecting multi domain objects hierarchies for accurate object detection and semantic segmentation, in:
Proceedings of the IEEE Conference on Computer Vision and Pattern
without having any prior knowledge.
Recognition, 2014, pp. 580–587.
(iv) Salient Object Detection: This area aims at stressing on [8] Kaiming He, Georgia Gkioxari, Piotr Dollár, Ross Girshick, Mask r-cnn,
salient object regions in an image. Salient object detection in: Proceedings of the IEEE International Conference on Computer Vision,
is applied to a broad spectrum of object detection applica- 2017, pp. 2961–2969.
tions in different areas. Salient object regions of importance [9] Yi Sun, Ding Liang, Xiaogang Wang, Xiaoou Tang, Deepid3: Face recog-
nition with very deep neural networks, 2015, arXiv preprint arXiv:1502.
in each frame can help to accurately detect the objects in
00873.
a continuous scene or video sequence. Hence for impor- [10] Yi Sun, Yuheng Chen, Xiaogang Wang, Xiaoou Tang, Deep learning face
tant recognition and detection tasks, saliency guided object representation by joint identification-verification, in: Advances in Neural
detection can be considered as a preliminary process. Information Processing Systems, 2014, pp. 1988–1996.
V. Sharma and R.N. Mir / Computer Science Review 38 (2020) 100301 23

[11] Weiyang Liu, Yandong Wen, Zhiding Yu, Ming Li, Bhiksha Raj, Le [37] Guimei Cao, Xuemei Xie, Wenzhe Yang, Quan Liao, Guangming Shi,
Song, Sphereface: Deep hypersphere embedding for face recognition, in: Jinjian Wu, Feature-fused SSD: Fast detection for small objects, in: Ninth
Proceedings of the IEEE Conference on Computer Vision and Pattern International Conference on Graphic and Image Processing (ICGIP 2017),
Recognition, 2017, pp. 212–220. vol. 10615, International Society for Optics and Photonics, 2018, p.
[12] Jianan Li, Xiaodan Liang, ShengMei Shen, Tingfa Xu, Jiashi Feng, Shuicheng 106151E.
Yan, Scale-aware fast R-CNN for pedestrian detection, IEEE Trans. [38] Subarna Tripathi, Gokce Dane, Byeongkeun Kang, Vasudev Bhaskaran,
Multimedia 20 (4) (2017) 985–996. Truong Nguyen, LCDet: Low-complexity fully-convolutional neural net-
[13] Jan Hosang, Mohamed Omran, Rodrigo Benenson, Bernt Schiele, Taking works for object detection in embedded systems, in: Proceedings of the
a deeper look at pedestrians, in: Proceedings of the IEEE Conference on IEEE Conference on Computer Vision and Pattern Recognition Workshops,
Computer Vision and Pattern Recognition, 2015, pp. 4073–4082. 2017, pp. 94–103.
[14] Anelia Angelova, Alex Krizhevsky, Vincent Vanhoucke, Abhijit Ogale, Dave [39] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan
Ferguson, Real-time pedestrian detection with deep network cascades, Long, Ross Girshick, Sergio Guadarrama, Trevor Darrell, Caffe: Convolu-
2015. tional architecture for fast feature embedding, in: Proceedings of the 22nd
[15] Steven CH Hoi, Xiongwei Wu, Hantang Liu, Yue Wu, Huiqiong Wang, ACM International Conference on Multimedia, 2014, pp. 675–678.
Hui Xue, Qiang Wu, Logo-net: Large-scale deep logo detection and brand [40] Zhenheng Yang, Ramakant Nevatia, A multi-scale cascade fully convolu-
recognition with deep region-based convolutional networks, 2015, arXiv tional network face detector, in: 2016 23rd International Conference on
preprint arXiv:1511.02462. Pattern Recognition (ICPR), IEEE, 2016, pp. 633–638.
[41] Ngiam Jiquan, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, AY
[16] Hang Su, Xiatian Zhu, Shaogang Gong, Deep learning logo detection with
Ng, Multimodal deep learning, in: Proceedings of the 28th International
data expansion by synthesising context, in: 2017 IEEE Winter Conference
Conference on Machine Learning (ICML-11), vol. 689696, 2011.
on Applications of Computer Vision (WACV), IEEE, 2017, pp. 530–539.
[42] Yu-Gang Jiang, Zuxuan Wu, Jinhui Tang, Zechao Li, Xiangyang Xue,
[17] Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul
Shih-Fu Chang, Modeling multimodal clues in a hybrid deep learning
Sukthankar, Li Fei-Fei, Large-scale video classification with convolutional
framework for video classification, IEEE Trans. Multimed. 20 (11) (2018)
neural networks, in: Proceedings of the IEEE Conference on Computer
3137–3147.
Vision and Pattern Recognition, 2014, pp. 1725–1732.
[43] Denis Tomè, Federico Monti, Luca Baroffio, Luca Bondi, Marco Tagliasac-
[18] Hossein Mobahi, Ronan Collobert, Jason Weston, Deep learning from tem-
chi, Stefano Tubaro, Deep convolutional neural networks for pedestrian
poral coherence in video, in: Proceedings of the 26th Annual International
detection, Signal Process., Image Commun. 47 (2016) 482–489.
Conference on Machine Learning, 2009, pp. 737–744.
[44] Zhong-Qiu Zhao, Haiman Bian, Donghui Hu, Wenjuan Cheng, Hervé
[19] Ross B. Girshick, Fast R-CNN, 2015, CoRR arXiv:1504.08083. Glotin, Pedestrian detection based on fast R-CNN and batch normalization,
[20] Joseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi, You only in: ICIC 2017: Intelligent Computing Theories and Application, Springer,
look once: Unified, real-time object detection, in: Proceedings of the IEEE 2017, pp. 735–746.
Conference on Computer Vision and Pattern Recognition, pp. 779–788. [45] Chen Zhang, Joohee Kim, Object detection with location-aware de-
[21] Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun, Faster r-cnn: Towards formable convolution and backward attention filtering, in: Proceedings of
real-time object detection with region proposal networks, in: Advances the IEEE Conference on Computer Vision and Pattern Recognition, 2019,
in Neural Information Processing Systems, 2015, pp. 91–99. pp. 9452–9461.
[22] David G. Lowe, Object recognition from local scale-invariant features, in: [46] Joseph Redmon, Ali Farhadi, YOLO9000: better, faster, stronger, in:
Proceedings of the Seventh IEEE International Conference on Computer Proceedings of the IEEE Conference on Computer Vision and Pattern
Vision, vol. 2, Ieee, 1999, pp. 1150–1157. Recognition, 2017, pp. 7263–7271.
[23] Herbert Bay, Tinne Tuytelaars, Luc Van Gool, Surf: Speeded up robust [47] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott
features, in: European Conference on Computer Vision, Springer, 2006, Reed, Cheng-Yang Fu, Alexander C Berg, Ssd: Single shot multibox
pp. 404–417. detector, in: European Conference on Computer Vision, Springer, 2016,
[24] Rainer Lienhart, Jochen Maydt, An extended set of haar-like features for pp. 21–37.
rapid object detection, in: Proceedings. International Conference on Image [48] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariha-
Processing, vol. 1, IEEE, 2002, pp. I–I. ran, Serge Belongie, Feature pyramid networks for object detection, in:
[25] Eleonora Vig, Michael Dorr, David Cox, Large-scale optimization of hierar- Proceedings of the IEEE Conference on Computer Vision and Pattern
chical features for saliency prediction in natural images, in: Proceedings Recognition, 2017, pp. 2117–2125.
of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, [49] Ming-Hsuan Yang, David J. Kriegman, Narendra Ahuja, Detecting faces in
pp. 2798–2805. images: A survey, IEEE Trans. Pattern Anal. Mach. Intell. 24 (1) (2002)
[26] Navneet Dalal, Bill Triggs, Histograms of oriented gradients for human 34–58.
detection, in: 2005 IEEE Computer Society Conference on Computer [50] Stefanos Zafeiriou, Cha Zhang, Zhengyou Zhang, A survey on face detec-
Vision and Pattern Recognition (CVPR’05), vol. 1, IEEE, 2005, pp. 886–893. tion in the wild: past, present and future, Comput. Vis. Image Underst.
[27] Yu Xiang, Wongun Choi, Yuanqing Lin, Silvio Savarese, Subcategory- 138 (2015) 1–24.
aware convolutional neural networks for object proposals and detection, [51] Qixiang Ye, David Doermann, Text detection and recognition in imagery:
in: 2017 IEEE Winter Conference on Applications of Computer Vision A survey, IEEE Trans. Pattern Anal. Mach. Intell. 37 (7) (2014) 1480–1500.
(WACV), IEEE, 2017, pp. 924–933. [52] Piotr Dollar, Christian Wojek, Bernt Schiele, Pietro Perona, Pedestrian
[28] Yoav Freund, Robert E. Schapire, A desicion-theoretic generalization of detection: An evaluation of the state of the art, IEEE Trans. Pattern Anal.
on-line learning and an application to boosting, in: European Conference Mach. Intell. 34 (4) (2011) 743–761.
on Computational Learning Theory, Springer, 1995, pp. 23–37. [53] Markus Enzweiler, Dariu M. Gavrila, Monocular pedestrian detection:
Survey and experiments, IEEE Trans. Pattern Anal. Mach. Intell. 31 (12)
[29] Yoav Freund, Robert E. Schapire, et al., Experiments with a new boosting
(2008) 2179–2195.
algorithm, in: Icml, vol. 96, Citeseer, 1996, pp. 148–156.
[54] David Geronimo, Antonio M. Lopez, Angel D. Sappa, Thorsten Graf, Survey
[30] David Opitz, Richard Maclin, Popular ensemble methods: An empirical
of pedestrian detection for advanced driver assistance systems, IEEE
study, J. Artif. Intell. Res. 11 (1999) 169–198.
Trans. Pattern Anal. Mach. Intell. 32 (7) (2009) 1239–1258.
[31] P.F. Felzenszwalb, R.B. Girshick, D. McAllester, D. Ramanan, Object detec- [55] Zehang Sun, George Bebis, Ronald Miller, On-road vehicle detection: A
tion with discriminatively trained part-based models, IEEE Trans. Pattern review, IEEE Trans. Pattern Anal. Mach. Intell. 28 (5) (2006) 694–711.
Anal. Mach. Intell. 32 (2010) 1627–1645.
[56] Xin Zhang, Yee-Hong Yang, Zhiguang Han, Hui Wang, Chao Gao, Object
[32] Mark Everingham, Luc Van Gool, Christopher K.I. Williams, John Winn, class detection: A survey, ACM Comput. Surv. 46 (1) (2013) 1–53.
Andrew Zisserman, The PASCAL visual object classes challenge 2007 [57] Licheng Jiao, Fan Zhang, Fang Liu, Shuyuan Yang, Lingling Li, Zhixi Feng,
(VOC2007) results, 2007. Rong Qu, A survey of deep learning-based object detection, IEEE Access
[33] Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, 7 (2019) 128837–128868.
Andrew Zisserman, The pascal visual object classes (voc) challenge, Int. [58] Zhong-Qiu Zhao, Peng Zheng, Shou-tao Xu, Xindong Wu, Object detection
J. Comput. Vis. 88 (2) (2010) 303–338. with deep learning: A review, IEEE Trans. Neural Netw. Learn. Syst. 30
[34] David G. Lowe, Distinctive image features from scale-invariant keypoints, (11) (2019) 3212–3232.
Int. J. Comput. Vis. 60 (2) (2004) 91–110. [59] Zehang Sun, George Bebis, Ronald Miller, On-road vehicle detection: A
[35] Timo Ojala, Matti Pietikainen, Topi Maenpaa, Multiresolution gray-scale review, IEEE Trans. Pattern Anal. Mach. Intell. 28 (5) (2006) 694–711.
and rotation invariant texture classification with local binary patterns, [60] Jean Ponce, Martial Hebert, Cordelia Schmid, Andrew Zisserman, Toward
IEEE Trans. Pattern Anal. Mach. Intell. 24 (7) (2002) 971–987. Category-Level Object Recognition, vol. 4170, Springer, 2007.
[36] Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton, Imagenet classifica- [61] Sven J. Dickinson, Aleš Leonardis, Bernt Schiele, Michael J. Tarr, Object
tion with deep convolutional neural networks, in: Advances in Neural Categorization: Computer and Human Vision Perspectives, Cambridge
Information Processing Systems, 2012, pp. 1097–1105. University Press, 2009.
24 V. Sharma and R.N. Mir / Computer Science Review 38 (2020) 100301

[62] Carolina Galleguillos, Serge Belongie, Context based object categorization: [90] C.L.awrence Zitnick, Piotr Dollar, Edge boxes: Locating object proposals
A critical survey, Comput. Vis. Image Underst. 114 (6) (2010) 712–722. from edges, in: European Conference on Computer Vision, Springer, 2014,
[63] Kristen Grauman, Bastian Leibe, Visual object recognition, Synth. Lect. pp. 391–405.
Artif. Intell. Mach. Learn. 5 (2) (2011) 1–181. [91] Jifeng Dai, Yi Li, Kaiming He, Jian Sun, R-fcn: Object detection via region-
[64] Alexander Andreopoulos, John K. Tsotsos, 50 years of object recognition: based fully convolutional networks, in: Advances in Neural Information
Directions forward, Comput. Vis. Image Underst. 117 (8) (2013) 827–891. Processing Systems, 2016, pp. 379–387.
[65] Yoshua Bengio, Aaron Courville, Pascal Vincent, Representation learning: [92] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona,
A review and new perspectives, IEEE Trans. Pattern Anal. Mach. Intell. 35 Deva Ramanan, Piotr Dollár, C Lawrence Zitnick, Microsoft coco: Common
(8) (2013) 1798–1828. objects in context, in: European Conference on Computer Vision, Springer,
[66] Ali Borji, Ming-Ming Cheng, Qibin Hou, Huaizu Jiang, Jia Li, Salient object 2014, pp. 740–755.
detection: A survey, Comput. Vis. Media (2019) 1–34. [93] Han Hu, Jiayuan Gu, Zheng Zhang, Jifeng Dai, Yichen Wei, Relation
[67] Yali Li, Shengjin Wang, Qi Tian, Xiaoqing Ding, Feature representation for networks for object detection, in: Proceedings of the IEEE Conference
statistical-learning-based object detection: A review, Pattern Recognit. 48 on Computer Vision and Pattern Recognition, 2018, pp. 3588–3597.
(11) (2015) 3542–3559. [94] Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, Yichen
[68] Yann LeCun, Yoshua Bengio, Geoffrey Hinton, Deep learning, Nature 521 Wei, Deformable convolutional networks, in: Proceedings of the IEEE
(7553) (2015) 436–444. International Conference on Computer Vision, 2017, pp. 764–773.
[69] Geert Litjens, Thijs Kooi, Babak Ehteshami Bejnordi, Arnaud [95] Golnaz Ghiasi, Tsung-Yi Lin, Quoc V. Le, Nas-fpn: Learning scalable feature
Arindra Adiyoso Setio, Francesco Ciompi, Mohsen Ghafoorian, pyramid architecture for object detection, in: Proceedings of the IEEE
Jeroen Awm Van Der Laak, Bram Van Ginneken, Clara I Sánchez, A Conference on Computer Vision and Pattern Recognition, pp. 7036–7045.
survey on deep learning in medical image analysis, Med. Image Anal. 42 [96] Bogdan Alexe, Thomas Deselaers, Vittorio Ferrari, Measuring the object-
(2017) 60–88. ness of image windows, IEEE Trans. Pattern Anal. Mach. Intell. 34 (11)
[70] Jiuxiang Gu, Zhenhua Wang, Jason Kuen, Lianyang Ma, Amir Shahroudy, (2012) 2189–2202.
Bing Shuai, Ting Liu, Xingxing Wang, Gang Wang, Jianfei Cai, et al., Recent [97] Esa Rahtu, Juho Kannala, Matthew Blaschko, Learning a category inde-
advances in convolutional neural networks, Pattern Recognit. 77 (2018) pendent object detection cascade, in: 2011 International Conference on
354–377. Computer Vision, IEEE, 2011, pp. 1052–1059.
[71] Zhengxia Zou, Zhenwei Shi, Yuhong Guo, Jieping Ye, Object detection in [98] Santiago Manen, Matthieu Guillaumin, Luc Van Gool, Prime object pro-
20 years: A survey, 2019, arXiv preprint arXiv:1905.05055. posals with randomized prim’s algorithm, in: Proceedings of the IEEE
[72] Paul Viola, Michael Jones, Rapid object detection using a boosted cascade International Conference on Computer Vision, 2013, pp. 2536–2543.
of simple features, in: Proceedings of the 2001 IEEE Computer Society [99] Joao Carreira, Cristian Sminchisescu, CPMC: Automatic object segmen-
Conference on Computer Vision and Pattern Recognition. CVPR 2001, vol. tation using constrained parametric min-cuts, IEEE Trans. Pattern Anal.
1, IEEE, 2001, pp. I–I. Mach. Intell. 34 (7) (2011) 1312–1328.
[100] Ian Endres, Derek Hoiem, Category-independent object proposals with
[73] Paul Viola, Michael J. Jones, Robust real-time face detection, Int. J.
diverse ranking, IEEE Trans. Pattern Anal. Mach. Intell. 36 (2) (2013)
Comput. Vis. 57 (2) (2004) 137–154.
222–234.
[74] Yoav Freund, Robert Schapire, Naoki Abe, A short introduction to
[101] Chenchen Zhu, Ran Tao, Khoa Luu, Marios Savvides, Seeing small faces
boosting, J. Japanese Soc. Artif. Intell. 14 (771–780) (1999) 1612.
from robust anchor’s perspective, in: Proceedings of the IEEE Conference
[75] Pedro Felzenszwalb, David McAllester, Deva Ramanan, A discriminatively
on Computer Vision and Pattern Recognition, 2018, pp. 5127–5136.
trained, multiscale, deformable part model, in: 2008 IEEE Conference on
[102] Lele Xie, Yuliang Liu, Lianwen Jin, Zecheng Xie, DeRPN: Taking a further
Computer Vision and Pattern Recognition, IEEE, 2008, pp. 1–8.
step toward more general object detection, in: Proceedings of the AAAI
[76] Pedro F. Felzenszwalb, Ross B. Girshick, David McAllester, Cascade object
Conference on Artificial Intelligence, vol. 33, 2019, pp. 9046–9053.
detection with deformable part models, in: 2010 IEEE Computer Society
[103] Lachlan Tychsen-Smith, Lars Petersson, Denet: Scalable real-time object
Conference on Computer Vision and Pattern Recognition, IEEE, 2010, pp.
detection with directed sparse sampling, in: Proceedings of the IEEE
2241–2248.
International Conference on Computer Vision, 2017, pp. 428–436.
[77] Tomasz Malisiewicz, Abhinav Gupta, Alexei A. Efros, Ensemble of
[104] Chenchen Zhu, Yihui He, Marios Savvides, Feature selective anchor-
exemplar-svms for object detection and beyond, in: 2011 International
free module for single-shot object detection, in: Proceedings of the
Conference on Computer Vision, IEEE, 2011, pp. 89–96.
IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp.
[78] Ross B. Girshick, Pedro F. Felzenszwalb, David A. Mcallester, Object
840–849.
detection with grammar models, in: Advances in Neural Information
[105] Kaiwen Duan, Song Bai, Lingxi Xie, Honggang Qi, Qingming Huang, Qi
Processing Systems, 2011, pp. 442–450.
Tian, Centernet: Keypoint triplets for object detection, in: Proceedings
[79] Ross Brook Girshick, From Rigid Templates to Grammars: Object of the IEEE International Conference on Computer Vision, 2019, pp.
Detection with Structured Models, Citeseer, 2012. 6569–6578.
[80] Stuart Andrews, Ioannis Tsochantaridis, Thomas Hofmann, Support vector [106] Ross Girshick, Fast r-cnn, in: Proceedings of the IEEE International
machines for multiple-instance learning, in: S. Becker, S. Thrun, K. Conference on Computer Vision, 2015, pp. 1440–1448.
Obermayer (Eds.), Advances in Neural Information Processing Systems 15, [107] Bharat Singh, Larry S. Davis, An analysis of scale invariance in object
MIT Press, 2003, pp. 577–584, http://papers.nips.cc/paper/2232-support- detection snip, in: Proceedings of the IEEE Conference on Computer Vision
vector-machines-for-multiple-instance-learning.pdf. and Pattern Recognition, 2018, pp. 3578–3587.
[81] Pierre Sermanet, David Eigen, Xiang Zhang, Michaël Mathieu, Rob Fergus, [108] Zhaowei Cai, Quanfu Fan, Rogerio S Feris, Nuno Vasconcelos, A unified
Yann LeCun, Overfeat: Integrated recognition, localization and detection multi-scale deep convolutional neural network for fast object detec-
using convolutional networks, 2013, arXiv preprint arXiv:1312.6229. tion, in: European Conference on Computer Vision, Springer, 2016, pp.
[82] Joseph Redmon, Ali Farhadi, Yolov3: An incremental improvement, 2018, 354–370.
arXiv preprint arXiv:1804.02767. [109] Zhiqiang Shen, Zhuang Liu, Jianguo Li, Yu-Gang Jiang, Yurong Chen,
[83] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, Piotr Dollár, Focal Xiangyang Xue, Dsod: Learning deeply supervised object detectors from
loss for dense object detection, in: Proceedings of the IEEE International scratch, in: Proceedings of the IEEE International Conference on Computer
Conference on Computer Vision, 2017, pp. 2980–2988. Vision, 2017, pp. 1919–1927.
[84] Hei Law, Jia Deng, Cornernet: Detecting objects as paired keypoints, in: [110] Songtao Liu, Di Huang, et al., Receptive field block net for accurate and
Proceedings of the European Conference on Computer Vision (ECCV), fast object detection, in: Proceedings of the European Conference on
2018, pp. 734–750. Computer Vision (ECCV), 2018, pp. 385–400.
[85] Xingyi Zhou, Dequan Wang, Philipp Krähenbühl, Objects as points, 2019, [111] Jimmy Ren, Xiaohao Chen, Jianbo Liu, Wenxiu Sun, Jiahao Pang, Qiong
arXiv preprint arXiv:1904.07850. Yan, Yu-Wing Tai, Li Xu, Accurate single stage detector using recurrent
[86] Kaiwen Duan, Song Bai, Lingxi Xie, Honggang Qi, Qingming Huang, Qi rolling convolution, in: Proceedings of the IEEE Conference on Computer
Tian, Centernet: Object detection with keypoint triplets, 2019, arXiv Vision and Pattern Recognition, 2017, pp. 5420–5428.
preprint arXiv:1904.08189 1(2), 4. [112] Jisoo Jeong, Hyojin Park, Nojun Kwak, Enhancement of SSD by concate-
[87] Jasper R.R. Uijlings, Koen E.A. Van De Sande, Theo Gevers, Arnold W.M. nating feature maps for object detection, 2017, arXiv preprint arXiv:
Smeulders, Selective search for object recognition, Int. J. Comput. Vis. 104 1705.09587.
(2) (2013) 154–171. [113] Cheng-Yang Fu, Wei Liu, Ananth Ranga, Ambrish Tyagi, Alexander C
[88] Jim Kleban, Xing Xie, Wei-Ying Ma, Spatial pyramid mining for logo Berg, Dssd: Deconvolutional single shot detector, 2017, arXiv preprint
detection in natural scenes, in: 2008 IEEE International Conference on arXiv:1701.06659.
Multimedia and Expo, IEEE, 2008, pp. 1077–1080. [114] Sanghyun Woo, Soonmin Hwang, In So Kweon, Stairnet: Top-down
[89] Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun, Spatial pyramid semantic aggregation for accurate one shot detection, in: 2018 IEEE
pooling in deep convolutional networks for visual recognition, IEEE Trans. Winter Conference on Applications of Computer Vision (WACV), IEEE,
Pattern Anal. Mach. Intell. 37 (9) (2015) 1904–1916. 2018, pp. 1093–1102.
V. Sharma and R.N. Mir / Computer Science Review 38 (2020) 100301 25

[115] Hongyang Li, Yu Liu, Wanli Ouyang, Xiaogang Wang, Zoom out-and-in [141] Alejandro Newell, Kaiyu Yang, Jia Deng, Stacked hourglass networks for
network with recursive training for object proposal, 2017, arXiv preprint human pose estimation, in: European Conference on Computer Vision,
arXiv:1702.05711. Springer, 2016, pp. 483–499.
[116] Tao Kong, Fuchun Sun, Chuanqi Tan, Huaping Liu, Wenbing Huang, Deep [142] Bharat Singh, Mahyar Najibi, Larry S. Davis, SNIPER: Efficient multi-scale
feature pyramid reconfiguration for object detection, in: Proceedings of training, in: Advances in Neural Information Processing Systems, 2018,
the European Conference on Computer Vision (ECCV), 2018, pp. 169–185. pp. 9310–9320.
[117] Qijie Zhao, Tao Sheng, Yongtao Wang, Zhi Tang, Ying Chen, Ling Cai, [143] Jiayuan Gu, Han Hu, Liwei Wang, Yichen Wei, Jifeng Dai, Learning region
Haibin Ling, M2det: A single-shot object detector based on multi-level features for object detection, in: Proceedings of the European Conference
feature pyramid network, in: Proceedings of the AAAI Conference on on Computer Vision (ECCV), 2018, pp. 381–395.
Artificial Intelligence, vol. 33, 2019, pp. 9259–9266. [144] Spyros Gidaris, Nikos Komodakis, Locnet: Improving localization accuracy
[118] Carolina Galleguillos, Serge Belongie, Context based object categorization: for object detection, in: Proceedings of the IEEE Conference on Computer
A critical survey, Comput. Vis. Image Underst. 114 (6) (2010) 712–722. Vision and Pattern Recognition, 2016, pp. 789–798.
[119] Pedro Felzenszwalb, Ross Girshick, David McAllester, Deva Ramanan, [145] Sergey Zagoruyko, Adam Lerer, Tsung-Yi Lin, Pedro O. Pinheiro, Sam
Discriminatively trained mixtures of deformable part models, PASCAL Gross, Soumith Chintala, Piotr Dollár, A multipath network for object
VOC Challenge (2008). detection, 2016, arXiv preprint arXiv:1604.02135.
[120] Wanli Ouyang, Xiaogang Wang, Xingyu Zeng, Shi Qiu, Ping Luo, Yonglong [146] Lachlan Tychsen-Smith, Lars Petersson, Improving object localization with
Tian, Hongsheng Li, Shuo Yang, Zhe Wang, Chen-Change Loy, et al., fitness nms and bounded iou loss, in: Proceedings of the IEEE Conference
Deepid-net: Deformable deep convolutional neural networks for object on Computer Vision and Pattern Recognition, 2018, pp. 6877–6885.
detection, in: Proceedings of the IEEE Conference on Computer Vision [147] Xin Lu, Buyu Li, Yuxin Yue, Quanquan Li, Junjie Yan, Grid r-cnn, in:
and Pattern Recognition, 2015, pp. 2403–2412. Proceedings of the IEEE Conference on Computer Vision and Pattern
[121] Xizhou Zhu, Han Hu, Stephen Lin, Jifeng Dai, Deformable convnets v2: Recognition, 2019, pp. 7363–7372.
More deformable, better results, in: Proceedings of the IEEE Conference [148] Bin Yang, Junjie Yan, Zhen Lei, Stan Z. Li, Craft objects from images,
on Computer Vision and Pattern Recognition, 2019, pp. 9308–9316. in: Proceedings of the IEEE Conference on Computer Vision and Pattern
[122] Ross Girshick, Forrest Iandola, Trevor Darrell, Jitendra Malik, Deformable Recognition, 2016, pp. 6043–6051.
part models are convolutional neural networks, in: Proceedings of the [149] Bowen Cheng, Yunchao Wei, Honghui Shi, Rogerio Feris, Jinjun Xiong,
IEEE Conference on Computer Vision and Pattern Recognition, 2015, Thomas Huang, Revisiting rcnn: On awakening the classification power
pp. 437–446. of faster rcnn, in: Proceedings of the European Conference on Computer
[123] Zeming Li, Chao Peng, Gang Yu, Xiangyu Zhang, Yangdong Deng, Jian Sun, Vision (ECCV), 2018, pp. 453–468.
Detnet: A backbone network for object detection, 2018, arXiv preprint [150] Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David
arXiv:1804.06215. Warde-Farley, Sherjil Ozair, Aaron Courville, Yoshua Bengio, Genera-
[124] Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, Kaiming He, Aggre- tive adversarial networks, in: Annual Conference on Neural Information
gated residual transformations for deep neural networks, in: Proceedings Processing Systems (NeurIPS), 2014, pp. 2672–2680.
of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, [151] Jun-Yan Zhu, Taesung Park, Phillip Isola, Alexei A Efros, Unpaired image-
pp. 1492–1500. to-image translation using cycle-consistent adversarial networks, in:
[125] Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Wei- Proceedings of the IEEE International Conference on Computer Vision,
jun Wang, Tobias Weyand, Marco Andreetto, Hartwig Adam, Mobilenets: 2017, pp. 2223–2232.
Efficient convolutional neural networks for mobile vision applications, [152] Alec Radford, Luke Metz, Soumith Chintala, Unsupervised representation
2017, arXiv preprint arXiv:1704.04861. learning with deep convolutional generative adversarial networks, 2015,
[126] Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, Jian Sun, Shufflenet: An arXiv preprint arXiv:1511.06434.
extremely efficient convolutional neural network for mobile devices, in: [153] Andrew Brock, Jeff Donahue, Karen Simonyan, Large scale gan training
Proceedings of the IEEE Conference on Computer Vision and Pattern for high fidelity natural image synthesis, 2018, arXiv preprint arXiv:
Recognition, pp. 6848–6856. 1809.11096.
[127] François Chollet, Xception: Deep learning with depthwise separable [154] Jianan Li, Xiaodan Liang, Yunchao Wei, Tingfa Xu, Jiashi Feng, Shuicheng
convolutions, in: Proceedings of the IEEE Conference on Computer Vision Yan, Perceptual generative adversarial networks for small object detec-
and Pattern Recognition, pp. 1251–1258. tion, in: Proceedings of the IEEE Conference on Computer Vision and
[128] C.X. Ling R.J. Wang, Pelee: A real-time object detection system on mobile Pattern Recognition, 2017, pp. 1222–1230.
devices, Adv. Neural Inf. Process. Syst. (2018) 1963–1972. [155] Xiaolong Wang, Abhinav Shrivastava, Abhinav Gupta, A-fast-rcnn: Hard
[129] Emmanuel J. Candès, Xiaodong Li, Yi Ma, John Wright, Robust principal positive generation via adversary for object detection, in: Proceedings of
component analysis?, J. ACM 58 (3) (2011) 1–37. the IEEE Conference on Computer Vision and Pattern Recognition, 2017,
[130] Karen Simonyan, Andrew Zisserman, Very deep convolutional networks pp. 2606–2615.
for large-scale image recognition, 2014, arXiv preprint arXiv:1409.1556. [156] Zhiqiang Shen, Honghui Shi, Jiahui Yu, Hai Phan, Rogerio Feris, Liangliang
[131] Yann LeCun, Léon Bottou, Yoshua Bengio, Patrick Haffner, Gradient-based Cao, Ding Liu, Xinchao Wang, Thomas Huang, Marios Savvides, Improving
learning applied to document recognition, Proc. IEEE 86 (11) (1998) object detection from scratch via gated feature reuse, 2017, arXiv preprint
2278–2324. arXiv:1712.00886.
[132] Herbert Robbins, Sutton Monro, A stochastic approximation method, Ann. [157] Kaiming He, Ross Girshick, Piotr Dollár, Rethinking imagenet pre-training,
Math. Stat. (1951) 400–407. in: Proceedings of the IEEE International Conference on Computer Vision,
[133] Diederik P. Kingma, Jimmy Ba, Adam: A method for stochastic 2019, pp. 4918–4927.
optimization, 2014, arXiv preprint arXiv:1412.6980. [158] Geoffrey Hinton, Oriol Vinyals, Jeff Dean, Distilling the knowledge in a
[134] Vinod Nair, Geoffrey E. Hinton, Rectified linear units improve restricted neural network, 2015, arXiv preprint arXiv:1503.02531.
boltzmann machines, in: Proceedings of the 27th International Conference [159] Quanquan Li, Shengying Jin, Junjie Yan, Mimicking very efficient network
on Machine Learning (ICML-10), 2010, pp. 807–814. for object detection, in: Proceedings of the IEEE Conference on Computer
[135] Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun, Identity mappings Vision and Pattern Recognition, 2017, pp. 6356–6364.
in deep residual networks, in: European Conference on Computer Vision, [160] Navaneeth Bodla, Bharat Singh, Rama Chellappa, Larry S. Davis, Soft-NMS–
Springer, 2016, pp. 630–645. improving object detection with one line of code, in: Proceedings of the
[136] Sergey Ioffe, Christian Szegedy, Batch normalization: Accelerating deep IEEE International Conference on Computer Vision, 2017, pp. 5561–5569.
network training by reducing internal covariate shift, 2015, arXiv preprint [161] Jonathan Huang, Vivek Rathod, Chen Sun, Menglong Zhu, Anoop Ko-
arXiv:1502.03167. rattikara, Alireza Fathi, Ian Fischer, Zbigniew Wojna, Yang Song, Sergio
[137] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, Kilian Q Wein- Guadarrama, et al., Speed/accuracy trade-offs for modern convolutional
berger, Densely connected convolutional networks, in: Proceedings of object detectors, in: Proceedings of the IEEE Conference on Computer
the IEEE Conference on Computer Vision and Pattern Recognition, 2017, Vision and Pattern Recognition, 2017, pp. 7310–7311.
pp. 4700–4708. [162] Zeming Li, Chao Peng, Gang Yu, Xiangyu Zhang, Yangdong Deng, Jian Sun,
[138] Hanliang Jiang, Fei Gao, Xingxin Xu, Fei Huang, Suguo Zhu, Attentive and Light-head r-cnn: In defense of two-stage object detector, 2017, arXiv
ensemble 3D dual path networks for pulmonary nodules classification, preprint arXiv:1711.07264.
Neurocomputing (2019). [163] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, Liang-
[139] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Chieh Chen, Mobilenetv2: Inverted residuals and linear bottlenecks, in:
Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, Andrew Rabi- Proceedings of the IEEE Conference on Computer Vision and Pattern
novich, Going deeper with convolutions, in: Proceedings of the IEEE Recognition, 2018, pp. 4510–4520.
Conference on Computer Vision and Pattern Recognition, 2015, pp. 1–9. [164] Alexander Womg, Mohammad Javad Shafiee, Francis Li, Brendan Chwyl,
[140] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, Alexander A Alemi, Tiny SSD: A tiny single-shot detection deep convolutional neural network
Inception-v4, inception-resnet and the impact of residual connections on for real-time embedded object detection, in: 2018 15th Conference on
learning, in: Thirty-First AAAI Conference on Artificial Intelligence, 2017. Computer and Robot Vision, CRV, IEEE, 2018, pp. 95–101.
26 V. Sharma and R.N. Mir / Computer Science Review 38 (2020) 100301

[165] Yuxi Li, Jiuwei Li, Weiyao Lin, Jianguo Li, Tiny-dsod: Lightweight object [188] Buyu Li, Yu Liu, Xiaogang Wang, Gradient harmonized single-stage de-
detection for resource-restricted usages, 2018, arXiv preprint arXiv:1807. tector, in: Proceedings of the AAAI Conference on Artificial Intelligence,
11013. vol. 33, 2019, pp. 8577–8584.
[166] Wenling Shang, Kihyuk Sohn, Diogo Almeida, Honglak Lee, Understanding [189] Zhi Tian, Chunhua Shen, Hao Chen, Tong He, Fcos: Fully convolutional
and improving convolutional neural networks via concatenated rectified one-stage object detection, in: Proceedings of the IEEE International
linear units, in: International Conference on Machine Learning, 2016, Conference on Computer Vision, 2019, pp. 9627–9636.
pp. 2217–2225. [190] Xingyi Zhou, Jiacheng Zhuo, Philipp Krahenbuhl, Bottom-up object de-
[167] Yong-Deok Kim, Eunhyeok Park, Sungjoo Yoo, Taelim Choi, Lu Yang, tection by grouping extreme and center points, in: Proceedings of the
Dongjun Shin, Compression of deep convolutional neural networks for IEEE Conference on Computer Vision and Pattern Recognition, 2019,
fast and low power mobile applications, 2015, arXiv preprint arXiv: pp. 850–859.
1511.06530. [191] Yousong Zhu, Chaoyang Zhao, Jinqiao Wang, Xu Zhao, Yi Wu, Hanqing Lu,
[168] Yihui He, Xiangyu Zhang, Jian Sun, Channel pruning for accelerating Couplenet: Coupling global structure with local parts for object detection,
very deep neural networks, in: Proceedings of the IEEE International in: Proceedings of the IEEE International Conference on Computer Vision,
Conference on Computer Vision, 2017, pp. 1389–1397. 2017, pp. 4126–4134.
[169] Yunchao Gong, Liu Liu, Ming Yang, Lubomir Bourdev, Compressing deep [192] Hongyu Xu, Xutao Lv, Xiaoyu Wang, Zhou Ren, Navaneeth Bodla, Rama
convolutional networks using vector quantization, 2014, arXiv preprint Chellappa, Deep regionlets for object detection, in: Proceedings of the
arXiv:1412.6115. European Conference on Computer Vision, ECCV, 2018, pp. 798–814.
[193] Zhe Chen, Shaoli Huang, Dacheng Tao, Context refinement for object
[170] Yujun Lin, Song Han, Huizi Mao, Yu Wang, William J Dally, Deep gradient
detection, in: Proceedings of the European Conference on Computer
compression: Reducing the communication bandwidth for distributed
Vision, ECCV, 2018, pp. 71–86.
training, 2017, arXiv preprint arXiv:1712.01887.
[194] Zhaowei Cai, Nuno Vasconcelos, Cascade r-cnn: Delving into high quality
[171] Jiaxiang Wu, Cong Leng, Yuhang Wang, Qinghao Hu, Jian Cheng, Quan-
object detection, in: Proceedings of the IEEE Conference on Computer
tized convolutional neural networks for mobile devices, in: Proceedings of
Vision and Pattern Recognition, 2018, pp. 6154–6162.
the IEEE Conference on Computer Vision and Pattern Recognition, 2016,
[195] Yanghao Li, Yuntao Chen, Naiyan Wang, Zhaoxiang Zhang, Scale-aware
pp. 4820–4828.
trident networks for object detection, in: Proceedings of the IEEE
[172] Song Han, Huizi Mao, William J Dally, Deep compression: Compressing International Conference on Computer Vision, 2019, pp. 6054–6063.
deep neural networks with pruning, trained quantization and huffman [196] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, Li Fei-Fei, Imagenet:
coding, 2015, arXiv preprint arXiv:1510.00149. A large-scale hierarchical image database, in: 2009 IEEE Conference on
[173] Song Han, Jeff Pool, John Tran, William Dally, Learning both weights Computer Vision and Pattern Recognition, IEEE, 2009, pp. 248–255.
and connections for efficient neural network, in: Advances in Neural [197] Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin,
Information Processing Systems, 2015, pp. 1135–1143. Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Tom
[174] Shifeng Zhang, Longyin Wen, Xiao Bian, Zhen Lei, Stan Z. Li, Single- Duerig, et al., The open images dataset v4: Unified image classification,
shot refinement neural network for object detection, in: Proceedings of object detection, and visual relationship detection at scale, 2018, arXiv
the IEEE Conference on Computer Vision and Pattern Recognition, 2018, preprint arXiv:1811.00982.
pp. 4203–4212. [198] Pengfei Zhu, Longyin Wen, Xiao Bian, Haibin Ling, Qinghua Hu, Vision
[175] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, meets drones: A challenge, 2018, arXiv preprint arXiv:1804.07437.
Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bern- [199] Andreas Ess, Bastian Leibe, Luc Van Gool, Depth and appearance for
stein, et al., Imagenet large scale visual recognition challenge, Int. J. mobile scene analysis, in: 2007 IEEE 11th International Conference on
Comput. Vis. 115 (3) (2015) 211–252. Computer Vision, IEEE, 2007, pp. 1–8.
[176] Saining Xie, Zhuowen Tu, Holistically-nested edge detection, in: Proceed- [200] Andreas Geiger, Philip Lenz, Christoph Stiller, Raquel Urtasun, Vision
ings of the IEEE International Conference on Computer Vision, 2015, pp. meets robotics: The kitti dataset, Int. J. Robot. Res. 32 (11) (2013)
1395–1403. 1231–1237.
[177] Mahyar Najibi, Mohammad Rastegari, Larry S. Davis, G-cnn: an iterative [201] Christian Wojek, Stefan Walk, Bernt Schiele, Multi-cue onboard pedes-
grid based object detector, in: Proceedings of the IEEE Conference on trian detection, in: 2009 IEEE Conference on Computer Vision and Pattern
Computer Vision and Pattern Recognition, 2016, pp. 2369–2377. Recognition, IEEE, 2009, pp. 794–801.
[178] Yuting Zhang, Kihyuk Sohn, Ruben Villegas, Gang Pan, Honglak Lee, [202] Andreas Geiger, Philip Lenz, Raquel Urtasun, Are we ready for au-
Improving object detection with deep convolutional networks via tonomous driving? The kitti vision benchmark suite, in: 2012 IEEE
bayesian optimization and structured prediction, in: Proceedings of the Conference on Computer Vision and Pattern Recognition, IEEE, 2012,
IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3354–3361.
pp. 249–258. [203] Shanshan Zhang, Rodrigo Benenson, Bernt Schiele, Citypersons: A diverse
[179] Fan Yang, Wongun Choi, Yuanqing Lin, Exploit all the layers: Fast and dataset for pedestrian detection, in: Proceedings of the IEEE Conference
accurate cnn object detector with scale dependent pooling and cascaded on Computer Vision and Pattern Recognition, 2017, pp. 3213–3221.
rejection classifiers, in: Proceedings of the IEEE Conference on Computer [204] Xiaofei Li, Fabian Flohr, Yue Yang, Hui Xiong, Markus Braun, Shuyue Pan,
Vision and Pattern Recognition, 2016, pp. 2129–2137. Keqiang Li, Dariu M Gavrila, A new benchmark for vision-based cyclist
[180] Samarth Brahmbhatt, Henrik I. Christensen, James Hays, Stuffnet: Using detection, in: 2016 IEEE Intelligent Vehicles Symposium, IV, IEEE, 2016,
‘stuff’to improve object detection, in: 2017 IEEE Winter Conference on pp. 1028–1033.
Applications of Computer Vision, WACV, IEEE, 2017, pp. 934–943. [205] Vidit Jain, Erik Learned-Miller, Fddb: A Benchmark for Face Detection
in Unconstrained Settings, Technical Report, UMass Amherst Technical
[181] Shaoqing Ren, Kaiming He, Ross Girshick, Xiangyu Zhang, Jian Sun, Object
Report, 2010.
detection networks on convolutional feature maps, IEEE Trans. Pattern
[206] Shuo Yang, Ping Luo, Chen-Change Loy, Xiaoou Tang, Wider face: A
Anal. Mach. Intell. 39 (7) (2016) 1476–1481.
face detection benchmark, in: Proceedings of the IEEE Conference on
[182] Spyros Gidaris, Nikos Komodakis, Object detection via a multi-region and
Computer Vision and Pattern Recognition, 2016, pp. 5525–5533.
semantic segmentation-aware cnn model, in: Proceedings of the IEEE
[207] Ran He, Xiang Wu, Zhenan Sun, Tieniu Tan, Wasserstein cnn: Learning
International Conference on Computer Vision, 2015, pp. 1134–1142.
invariant features for nir-vis face recognition, IEEE Trans. Pattern Anal.
[183] Tao Kong, Anbang Yao, Yurong Chen, Fuchun Sun, Hypernet: Towards Mach. Intell. 41 (7) (2018) 1761–1773.
accurate region proposal generation and joint object detection, in: [208] Xiao Zhang, Rui Zhao, Yu Qiao, Xiaogang Wang, Hongsheng Li, Adacos:
Proceedings of the IEEE Conference on Computer Vision and Pattern Adaptively scaling cosine logits for effectively learning deep face repre-
Recognition, 2016, pp. 845–853. sentations, in: Proceedings of the IEEE Conference on Computer Vision
[184] Abhinav Shrivastava, Abhinav Gupta, Ross Girshick, Training region-based and Pattern Recognition, 2019, pp. 10823–10832.
object detectors with online hard example mining, in: Proceedings of [209] Yu Liu, Hongyang Li, Xiaogang Wang, Rethinking feature discrimination
the IEEE Conference on Computer Vision and Pattern Recognition, 2016, and polymerization for large-scale recognition, 2017, arXiv preprint arXiv:
pp. 761–769. 1710.00870.
[185] Sean Bell, C. Lawrence Zitnick, Kavita Bala, Ross Girshick, Inside-outside [210] Rajeev Ranjan, Carlos D Castillo, Rama Chellappa, L2-constrained softmax
net: Detecting objects in context with skip pooling and recurrent neural loss for discriminative face verification, 2017, arXiv preprint arXiv:1703.
networks, in: Proceedings of the IEEE Conference on Computer Vision and 09507.
Pattern Recognition, 2016, pp. 2874–2883. [211] Feng Wang, Xiang Xiang, Jian Cheng, Alan Loddon Yuille, Normface: L2
[186] Zhengxia Zou, Zhenwei Shi, Yuhong Guo, Jieping Ye, Object detection in hypersphere embedding for face verification, in: Proceedings of the 25th
20 years: A survey, 2019, arXiv preprint arXiv:1905.05055. ACM International Conference on Multimedia, 2017, pp. 1041–1049.
[187] Peng Zhou, Bingbing Ni, Cong Geng, Jianguo Hu, Yi Xu, Scale-transferrable [212] Yuwei Guo, Licheng Jiao, Shuang Wang, Shuo Wang, Fang Liu, Fuzzy
object detection, in: Proceedings of the IEEE Conference on Computer sparse autoencoder framework for single image per person face
Vision and Pattern Recognition, 2018, pp. 528–537. recognition, IEEE Trans. Cybern. 48 (8) (2017) 2402–2415.
V. Sharma and R.N. Mir / Computer Science Review 38 (2020) 100301 27

[213] Markus Braun, Sebastian Krebs, Fabian Flohr, Dariu M. Gavrila, Eurocity [237] Xianjie Chen, Alan L Yuille, Articulated pose estimation by a graphical
persons: A novel benchmark for person detection in traffic scenes, IEEE model with image dependent pairwise relations, in: Advances in Neural
Trans. Pattern Anal. Mach. Intell. 41 (8) (2019) 1844–1861. Information Processing Systems, 2014, pp. 1736–1744.
[214] Zhaowei Cai, Mohammad Javad Saberian, Nuno Vasconcelos, Learning [238] Xiaochuan Fan, Kang Zheng, Yuewei Lin, Song Wang, Combining local
complexity-aware cascades for pedestrian detection, IEEE Trans. Pattern appearance and holistic view: Dual-source deep neural networks for
Anal. Mach. Intell. (2019). human pose estimation, in: Proceedings of the IEEE Conference on
[215] Mohammad Javad Saberian, Nuno Vasconcelos, Learning optimal em- Computer Vision and Pattern Recognition, 2015, pp. 1347–1355.
bedded cascades, IEEE Trans. Pattern Anal. Mach. Intell. 34 (10) (2012) [239] Gregory Rogez, Philippe Weinzaepfel, Cordelia Schmid, Lcr-net++: Multi-
2005–2018. person 2d and 3d pose detection in natural images, IEEE Trans. Pattern
[216] Piotr Dollár, Ron Appel, Serge Belongie, Pietro Perona, Fast feature Anal. Mach. Intell. 42 (5) (2019) 1146–1161.
pyramids for object detection, IEEE Trans. Pattern Anal. Mach. Intell. 36 [240] Yilun Chen, Zhicheng Wang, Yuxiang Peng, Zhiqiang Zhang, Gang Yu,
(8) (2014) 1532–1545. Jian Sun, Cascaded pyramid network for multi-person pose estimation,
[217] Song Liu, Makoto Yamada, Nigel Collier, Masashi Sugiyama, Change-point in: Proceedings of the IEEE Conference on Computer Vision and Pattern
detection in time-series data by relative density-ratio estimation, Neural Recognition, 2018, pp. 7103–7112.
Netw. 43 (2013) 72–83. [241] George Papandreou, Tyler Zhu, Nori Kanazawa, Alexander Toshev,
[218] Pavel Senin, Jessica Lin, Xing Wang, Tim Oates, Sunil Gandhi, Arnold P. Jonathan Tompson, Chris Bregler, Kevin Murphy, Towards accurate
Boedihardjo, Crystal Chen, Susan Frankenstein, Grammarviz 3.0: Interac- multi-person pose estimation in the wild, in: Proceedings of the
tive discovery of variable-length time series patterns, ACM Trans. Knowl. IEEE Conference on Computer Vision and Pattern Recognition, 2017,
Dis. Data (TKDD) 12 (1) (2018) 1–28. pp. 4903–4911.
[219] Meng Jiang, Alex Beutel, Peng Cui, Bryan Hooi, Shiqiang Yang, Christos [242] Bin Xiao, Haiping Wu, Yichen Wei, Simple baselines for human pose
Faloutsos, A general suspiciousness metric for dense blocks in multimodal estimation and tracking, in: Proceedings of the European Conference on
data, in: 2015 IEEE International Conference on Data Mining, IEEE, 2015, Computer Vision, ECCV, 2018, pp. 466–481.
pp. 781–786. [243] Shih-En Wei, Varun Ramakrishna, Takeo Kanade, Yaser Sheikh, Con-
[220] Elizabeth Wu, Wei Liu, Sanjay Chawla, Spatio-temporal outlier detection volutional pose machines, in: Proceedings of the IEEE Conference on
in precipitation data, in: International Workshop on Knowledge Discovery Computer Vision and Pattern Recognition, 2016, pp. 4724–4732.
from Sensor Data, Springer, 2008, pp. 115–133. [244] Zhuoling Li, Minghui Dong, Shiping Wen, Xiang Hu, Pan Zhou, Zhigang
[221] Björn Barz, Erik Rodner, Yanira Guanche Garcia, Joachim Denzler, Detect- Zeng, CLU-CNNs: Object detection for medical images, Neurocomputing
ing regions of maximal divergence for spatio-temporal anomaly detection, 350 (2019) 53–59.
IEEE Trans. Pattern Anal. Mach. Intell. 41 (5) (2018) 1088–1101. [245] Zhenguo Yang, Qing Li, Liu Wenyin, Jianming Lv, Shared multi-view data
[222] Gong Cheng, Junwei Han, A survey on object detection in optical remote representation for multi-domain event detection, IEEE Trans. Pattern Anal.
sensing images, ISPRS J. Photogramm. Remote Sens. 117 (2016) 11–28. Mach. Intell. (2019).
[223] Palaiahnakote Shivakumara, Dongqi Tang, Maryam Asadzadehkaljahi, [246] Yanxiang Wang, Hari Sundaram, Lexing Xie, Social event detection with
Tong Lu, Umapada Pal, Mohammad Hossein Anisi, CNN-RNN based interaction graph modeling, in: Proceedings of the 20th ACM International
method for license plate recognition, CAAI Trans. Intell. Technol. 3 (3) Conference on Multimedia, 2012, pp. 865–868.
(2018) 169–175.
[247] Manos Schinas, Symeon Papadopoulos, Georgios Petkos, Yiannis Kompat-
[224] Muhammad Sarfraz, Mohammed Jameel Ahmed, An approach to license
siaris, Pericles A. Mitkas, Multimodal graph-based event detection and
plate recognition system using neural network, in: Exploring Critical
summarization in social media streams, in: Proceedings of the 23rd ACM
Approaches of Evolutionary Computation, IGI Global, 2019, pp. 20–36.
International Conference on Multimedia, 2015, pp. 189–192.
[225] Hui Li, Peng Wang, Chunhua Shen, Toward end-to-end car license plate
[248] Olivier Teboul, Iasonas Kokkinos, Loic Simon, Panagiotis Koutsourakis,
detection and recognition with deep neural networks, IEEE Trans. Intell.
Nikos Paragios, Shape grammar parsing via reinforcement learning, in:
Transp. Syst. 20 (3) (2018) 1126–1136.
CVPR 2011, IEEE, 2011, pp. 2273–2280.
[226] Jinxing Qian, Bo Qu, Fast license plate recognition method based on
[249] Peng Zhao, Tian Fang, Jianxiong Xiao, Honghui Zhang, Qinping Zhao,
competitive neural network, in: 2018 3rd International Conference on
Long Quan, Rectilinear parsing of architecture in urban environment, in:
Communications, Information Management and Network Security, CIMNS
2010 IEEE Computer Society Conference on Computer Vision and Pattern
2018, Atlantis Press, 2018.
Recognition, IEEE, 2010, pp. 342–349.
[227] Rayson Laroca, Evair Severo, Luiz A Zanlorensi, Luiz S Oliveira, Gabriel Re-
[250] Sam Friedman, Ioannis Stamos, Online detection of repeated structures
sende Gonçalves, William Robson Schwartz, David Menotti, A robust
in point clouds of urban scenes for compression and registration, Int. J.
real-time automatic license plate recognition based on the YOLO detector,
Comput. Vis. 102 (1–3) (2013) 112–128.
in: 2018 International Joint Conference on Neural Networks, IJCNN, IEEE,
[251] Chao-Hui Shen, Shi-Sheng Huang, Hongbo Fu, Shi-Min Hu, Adaptive
2018, pp. 1–10.
partitioning of urban facades, ACM Trans. Graph. 30 (6) (2011) 1–10.
[228] Xibin Song, Peng Wang, Dingfu Zhou, Rui Zhu, Chenye Guan, Yuchao Dai,
Hao Su, Hongdong Li, Ruigang Yang, Apollocar3d: A large 3d car instance [252] Grant Schindler, Panchapagesan Krishnamurthy, Roberto Lublinerman,
understanding benchmark for autonomous driving, in: Proceedings of Yanxi Liu, Frank Dellaert, Detecting and matching repeated patterns for
the IEEE Conference on Computer Vision and Pattern Recognition, 2019, automatic geo-tagging in urban environments, in: 2008 IEEE Conference
pp. 5452–5462. on Computer Vision and Pattern Recognition, IEEE, 2008, pp. 1–7.
[229] Koyel Banerjee, Dominik Notz, Johannes Windelen, Sumanth Gavarraju, [253] Changchang Wu, Jan-Michael Frahm, Marc Pollefeys, Detecting large
Mingkang He, Online camera lidar fusion and object detection on hy- repetitive structures with salient boundaries, in: European Conference on
brid data for autonomous driving, in: 2018 IEEE Intelligent Vehicles Computer Vision, Springer, 2010, pp. 142–155.
Symposium, IV, IEEE, 2018, pp. 1632–1638. [254] Chao-Hui Shen, Shi-Sheng Huang, Hongbo Fu, Shi-Min Hu, Image-based
[230] Jia Li, Zengfu Wang, Real-time traffic sign recognition based on efficient procedural modeling of facades, ACM Trans. Graph. 26 (2007) 85–95.
CNNs in the wild, IEEE Trans. Intell. Transp. Syst. 20 (3) (2018) 975–984. [255] Olga Barinova, Victor Lempitsky, Elena Tretiak, Pushmeet Kohli, Geomet-
[231] T. Arinaga T. Moritani, Traffic sign recognition system, US Patent 9, 865, ric image parsing in man-made environments, in: European Conference
165, 2018. on Computer Vision, Springer, 2010, pp. 57–70.
[232] Sara Khalid, Nazeer Muhammad, Muhammad Sharif, Automatic measure- [256] Mateusz Kozinski, Raghudeep Gadde, Sergey Zagoruyko, Guillaume
ment of the traffic sign with digital segmentation and recognition, IET Obozinski, Renaud Marlet, A MRF shape prior for facade parsing with
Intell. Transp. Syst. 13 (2) (2018) 269–279. occlusions, in: Proceedings of the IEEE Conference on Computer Vision
[233] A. lvarez Garcıa, J.A.A. Arcos-Garcıa, L.M. Soria-Morillo, Deep neural and Pattern Recognition, 2015, pp. 2820–2828.
network for traffic sign recognition systems: An analysis of spatial trans- [257] Andrea Cohen, Alexander G. Schwing, Marc Pollefeys, Efficient structured
formers and stochastic optimisation methods, Neural Netw. 99 (2018) parsing of facades using dynamic programming, in: Proceedings of the
158–165. IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp.
[234] Dong Li, Dongbin Zhao, Yaran Chen, Qichao Zhang, Deepsign: Deep learn- 3206–3213.
ing based traffic sign recognition, in: 2018 International Joint Conference [258] Silvia Gandy, Benjamin Recht, Isao Yamada, Tensor completion and low-
on Neural Networks, IJCNN, IEEE, 2018, pp. 1–6. n-rank tensor recovery via convex optimization, Inverse Problems 27 (2)
[235] Bo-Xun Wu, Pin-Yu Wang, Yi-Ta Yang, Jiun-In Guo, Traffic sign recog- (2011) 025010.
nition with light convolutional networks, in: 2018 IEEE International [259] Ji Liu, Przemyslaw Musialski, Peter Wonka, Jieping Ye, Tensor completion
Conference on Consumer Electronics-Taiwan, ICCE-TW, IEEE, 2018, for estimating missing values in visual data, IEEE Trans. Pattern Anal.
pp. 1–2. Mach. Intell. 35 (1) (2012) 208–220.
[236] Shuren Zhou, Wenlong Liang, Junguo Li, Jeong-Uk Kim, Improved VGG [260] Juan Liu, Emmanouil Z. Psarakis, Yang Feng, Ioannis Stamos, A kronecker
model for road traffic sign recognition, Comput. Mater. Continua 57 (1) product model for repeated pattern detection on 2d urban images, IEEE
(2018) 11–24. Trans. Pattern Anal. Mach. Intell. 41 (9) (2018) 2266–2272.
28 V. Sharma and R.N. Mir / Computer Science Review 38 (2020) 100301

[261] Oriol Vinyals, Alexander Toshev, Samy Bengio, Dumitru Erhan, Show [284] Xiangteng He, Yuxin Peng, Junjie Zhao, Fine-grained discriminative local-
and tell: A neural image caption generator, in: Proceedings of the ization via saliency-guided faster R-CNN, in: Proceedings of the 25th ACM
IEEE Conference on Computer Vision and Pattern Recognition, 2015, International Conference on Multimedia, 2017, pp. 627–635.
pp. 3156–3164. [285] Xiangteng He, Yuxin Peng, Junjie Zhao, Fast fine-grained image classifica-
[262] Jiuxiang Gu, Jianfei Cai, Gang Wang, Tsuhan Chen, Stack-captioning: tion via weakly supervised discriminative localization, IEEE Trans. Circuits
Coarse-to-fine learning for image captioning, in: Thirty-Second AAAI Syst. Video Technol. 29 (5) (2018) 1394–1407.
Conference on Artificial Intelligence, 2018. [286] Jianzhong He, Shiliang Zhang, Ming Yang, Yanhu Shan, Tiejun Huang,
[263] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Bi-directional cascade network for perceptual edge detection, in: Proceed-
Ruslan Salakhudinov, Rich Zemel, Yoshua Bengio, Show, attend and tell: ings of the IEEE Conference on Computer Vision and Pattern Recognition,
Neural image caption generation with visual attention, in: International 2019, pp. 3828–3837.
Conference on Machine Learning, 2015, pp. 2048–2057.
[264] Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark John-
son, Stephen Gould, Lei Zhang, Bottom-up and top-down attention for Glossary
image captioning and visual question answering, in: Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, 2018,
Average Precision: It is a popular metric for measuring the accuracy of object
pp. 6077–6086.
detectors like faster R-CNN etc. It calculates he average precision value for
[265] A. Deshpande, J. Aneja, A.G. Schwing, Convolutional image captioning,
recall value over 0 to 1.
in: Proceedings of the IEEE Conference on Computer Vision and Pattern
Precision: It is used for measuring as how accurate is our prediction.
Recognition, 2018, pp. 5561—5570.
Recall: It measures as how good we find all the positives. For example we can
[266] Zhe Wu, Li Su, Qingming Huang, Cascaded partial decoder for fast and
find 60% of all the positive cases in our total M predictions.
accurate salient object detection, in: Proceedings of the IEEE Conference
IoU: It stands for intersection over union. It is used for measuring the overlap
on Computer Vision and Pattern Recognition, 2019, pp. 3907–3916.
between two boundaries. It is particularly used to measure as how much
[267] Wenguan Wang, Jianbing Shen, Xingping Dong, Ali Borji, Ruigang Yang,
our predicted boundary overlaps with the ground truth.
Inferring salient objects from human fixations, IEEE Trans. Pattern Anal.
NMS: It stands for non maximum suppression. It is defined as a technique which
Mach. Intell. (2019).
[268] Linzhao Wang, Lijun Wang, Huchuan Lu, Pingping Zhang, Xiang Ruan, is used for filtering out region proposals based on some criteria.
Salient object detection with recurrent fully convolutional networks, IEEE FPS: It stands for frames per second. It is defined as a frequency rate at which
Trans. Pattern Anal. Mach. Intell. 41 (7) (2018) 1734–1746. consecutive images appears on the display.
[269] Mengyang Feng, Huchuan Lu, Errui Ding, Attentive feedback network DCN: It stands for deformable convolution network. It consists of two parts, a
for boundary-aware salient object detection, in: Proceedings of the regular convolution layer and an another layer to learn 2d offset for each
IEEE Conference on Computer Vision and Pattern Recognition, 2019, input.
pp. 1623–1632. SIFT: It stands for scale invariant feature transform. It is defined as a feature
[270] Zheng Zhang, Chengquan Zhang, Wei Shen, Cong Yao, Wenyu Liu, Xiang detection technique in computer vision to detect and explain local image
Bai, Multi-oriented text detection with fully convolutional networks, in: features.
Proceedings of the IEEE Conference on Computer Vision and Pattern MSCOCO: It stands for microsoft common objects in context. It is a dataset
Recognition, 2016, pp. 4159–4167. containing images of 91 object types with a total of 2.5 million labeled
[271] Cong Yao, Xiang Bai, Nong Sang, Xinyu Zhou, Shuchang Zhou, Zhimin Cao, instances in 328K images.
Scene text detection via holistic, multi-channel prediction, 2016, arXiv SURF: It stands for speeded up robust features. It is defined as a local feature
preprint arXiv:1606.09002. detector and descriptor which can be used for tasks like object recognition,
[272] Tong He, Weilin Huang, Yu Qiao, Jian Yao, Accurate text localization 3d reconstruction, classification etc. Its feature descriptor is based on the
in natural image with cascaded convolutional text network, 2016, arXiv summation of Haar wavelet response around points of interest.
preprint arXiv:1603.09423. NAS-FPN: It stands for neural architecture search feature pyramid network. It
[273] Pengyuan Lyu, Cong Yao, Wenhao Wu, Shuicheng Yan, Xiang Bai, consists of a combination of top-down and bottom-up connections to fuse
Multi-oriented scene text detection via corner localization and region features across different scales.
segmentation, in: Proceedings of the IEEE Conference on Computer Vision HOG: It stands for histogram of oriented gradients. It is defined as a feature
and Pattern Recognition, 2018, pp. 7553–7563. descriptor used in image processing and computer vision for the task of
[274] Jianqi Ma, Weiyuan Shao, Hao Ye, Li Wang, Hong Wang, Yingbin Zheng, object detection. It count occurrences of gradient orientations in localized
Xiangyang Xue, Arbitrary-oriented scene text detection via rotation portions of an image.
proposals, IEEE Trans. Multimed. 20 (11) (2018) 3111–3122. ResNet: It stands for residual network. It makes it possible to train upto hundreds
[275] Martin Engelcke, Dushyant Rao, Dominic Zeng Wang, Chi Hay Tong, or even thousands of layers and still achieve remarkable performance.
Ingmar Posner, Vote3deep: Fast object detection in 3d point clouds ResNet: It stands for support vector machine. It is defined as a supervised
using efficient convolutional neural networks, in: 2017 IEEE International machine learning model which makes use of classification algorithms for
Conference on Robotics and Automation, ICRA, IEEE, 2017, pp. 1355–1361. two group classification problems.
[276] Charles R. Qi, Hao Su, Kaichun Mo, Leonidas J. Guibas, Pointnet: Deep ReLU: It stands for rectified linear unit. It is one of the most commonly used
learning on point sets for 3d classification and segmentation, in: Proceed- activation function deep learning models. A value of 0 is returned, if the
ings of the IEEE Conference on Computer Vision and Pattern Recognition, function receives a negative input, but for a positive value "x" the same
2017, pp. 652–660. value is returned back.
[277] Yin Zhou, Oncel Tuzel, Voxelnet: End-to-end learning for point cloud DPM: It stands for deformable part model detector. It is a learning based
based 3d object detection, in: Proceedings of the IEEE Conference on object detection FPGA IP core developed for embedded computer vision
Computer Vision and Pattern Recognition, 2018, pp. 4490–4499. applications.
[278] Zhe Cao, Tomas Simon, Shih-En Wei, Yaser Sheikh, Realtime multi- VGG: It stands for visual geometry group. It is an innovative object recognition
person 2d pose estimation using part affinity fields, in: Proceedings of model which supports upto 19 layers. Instead of using a large receptive field
the IEEE Conference on Computer Vision and Pattern Recognition, 2017, like AlexNet, VGG uses very small receptive field (3 x 3 with a stride of 1).
pp. 7291–7299. ROI Pooling: It stands for region of interest pooling. It is used for utilizing single
[279] Adrian Bulat, Georgios Tzimiropoulos, Human pose estimation via convo- feature map for all the proposals generated by RPN in a single pass. It solves
lutional part heatmap regression, in: European Conference on Computer the problem of fixed image size requirement for object detection.
Vision, Springer, 2016, pp. 717–732. SGD: It stands for stochastic gradient descent. It is defined as an iterative
[280] Alejandro Newell, Kaiyu Yang, Jia Deng, Stacked hourglass networks for technique for optimizing an objective function with suitable smoothness
human pose estimation, in: European Conference on Computer Vision, properties.
Springer, 2016, pp. 483–499. CNN: It stands for convolutional neural network. It is defined as a deep learning
[281] Wenbo Li, Zhicheng Wang, Binyi Yin, Qixiang Peng, Yuming Du, Tianzi algorithm which can take an input image, assign importance to various
Xiao, Gang Yu, Hongtao Lu, Yichen Wei, Jian Sun, Rethinking on multi- aspects in the given image and be able to differentiate one from another.
stage networks for human pose estimation, 2019, arXiv preprint arXiv: DPN: It stands for dual path network. It is an image classification model which
1901.00148. picks the advantages of both ResNet and DenseNet and also outperforms
[282] Jonathan Krause, Michael Stark, Jia Deng, Li Fei-Fei, 3d object rep- them in the image classification task.
resentations for fine-grained categorization, in: Proceedings of the R-CNN: It stands for region proposal convolution neural network, which allows
IEEE International Conference on Computer Vision Workshops, 2013, us to work with 2000 image regions instead of trying to classify a huge
pp. 554–561. number of regions, thereby minimizing the total number of computations
[283] Tsung-Yu Lin, Aruni RoyChowdhury, Subhransu Maji, Bilinear cnn mod- required.
els for fine-grained visual recognition, in: Proceedings of the IEEE RPN: It stands for region proposal network. It is a neural network used for
International Conference on Computer Vision, 2015, pp. 1449–1457. generating proposals for object detection in faster R-CNN.
V. Sharma and R.N. Mir / Computer Science Review 38 (2020) 100301 29

R-FCN: It stands for region based fully convolutional network. It is used for ILSVRC: It stands for imagenet large scale visual recognition challenge. It is an
accurate and efficient object detection. It is a region based detector which annual computer vision competition developed on a subset of a publicly
is fully convolutional with almost all the computations shared on the entire available dataset called imagenet. It evaluates algorithms for object detection
image. and image classification at large scales.
FPN: It stands for feature pyramid network. It is defined as a top-down SSD: It stands for single shot multibox detector. It is a method used for detecting
architecture with lateral connections which is developed for building high- objects in images using a single deep neural network. It discretizes the
level semantic feature maps at all scales. It has achieved a remarkable output space of bounding boxes into a set of default boxes over different
improvement as a generic feature extractor in many applications. aspect ratios and scales per feature map location.
GAN: It stands for generative adversarial network. It is defined as a network LVIS: It stands for large vocabulary instance segmentation. It is a large scale
wherein given a training dataset, it learns to generate new data with same dataset comprises of 2 million high quality instance segmentation masks for
statistics as the training set. over 1000 entry level object categories in 164K images.
SPP-Net: It stands for spatial pyramid pooling network. It is defined as a VOC: It stands for visual object class. It is a project which provides standardized
network which adds a new layer between the fully connected layer and the image datasets for object class recognition and provides a common set of
convolutional layer to map any sized input down to a fixed sized output. tools for accessing the datasets and annotations. It ran challenges evaluating
DSOD: It stands for deeply supervised object detector. It is defined as a performance on object class recognition from 2005 to 2012.
framework which can learn object detector from scratch. M-DOD: It stands for multi domain object detection. It is defined as a technique
YOLO: It stands for you only look once. It is defined as a state-of-the-art real to universally detect multi domain objects without its prior knowledge.
time single stage object detector which processes images at 30 fps and has
an mAP of 57.9% on COCO test-dev dataset.

You might also like