A Comprehensive and Systematic Look Up Into Deep Learning Based Object Detection Techniques - A Review
A Comprehensive and Systematic Look Up Into Deep Learning Based Object Detection Techniques - A Review
A Comprehensive and Systematic Look Up Into Deep Learning Based Object Detection Techniques - A Review
Review article
article info a b s t r a c t
Article history: Object detection can be regarded as one of the most fundamental and challenging visual recognition
Received 12 May 2020 task in computer vision and it has received great attention over the past few decades. Object
Received in revised form 20 July 2020 detection techniques find their application in almost all the spheres of life, most prominent ones being
Accepted 26 August 2020
surveillance, autonomous driving, pedestrian detection and so on. The primary focus of visual object
Available online 11 September 2020
detection is to detect objects belonging to certain class targets with absolute localization in a realistic
Keywords: scene or an input image and also to assign each detected instance of an object a predefined class label.
Convolutional neural networks Owing to rapid development of deep neural networks, the performance of object detectors has rapidly
Object detection improved and as a result of this deep learning based detection techniques have been actively studied
Localization over the past several years. In this paper we provide a comprehensive survey of latest advances in
Segmentation deep learning based visual object detection. Firstly we have reviewed a large body of recent works in
Classification literature and using that we have analyzed traditional and current object detectors. Afterwards and
Deep learning
primarily we provide a rigorous overview of backbone architectures for object detection followed by a
systematic cover up of current learning strategies. Some popular datasets and metrics used for object
detection are analyzed as well. Finally we discuss applications of object detection and provide several
future directions to facilitate future research for visual object detection with deep learning.
© 2020 Elsevier Inc. All rights reserved.
Contents
1. Introduction......................................................................................................................................................................................................................... 2
1.1. Summary of previous survey papers on object detection ................................................................................................................................ 4
1.2. Contributions of this survey ................................................................................................................................................................................. 6
1.3. Article organization ............................................................................................................................................................................................... 6
2. Traditional detectors vs. state-of-the-art detectors........................................................................................................................................................ 6
2.1. Traditional object detectors.................................................................................................................................................................................. 6
2.1.1. Viola jones detectors: ............................................................................................................................................................................ 6
2.1.2. Histogram of oriented gradients (HOG) detector:.............................................................................................................................. 6
2.1.3. Deformable part based model (DPM):................................................................................................................................................. 6
2.2. Deep learning based object detectors ................................................................................................................................................................. 7
2.2.1. One stage object detectors: .................................................................................................................................................................. 7
2.2.2. Two stage detectors:.............................................................................................................................................................................. 8
2.2.3. Latest object detectors: ......................................................................................................................................................................... 9
3. Proposal generation and feature representation learning for object detection: ........................................................................................................ 10
3.1. Proposal generation:.............................................................................................................................................................................................. 10
3.1.1. Traditional proposal generation techniques: ...................................................................................................................................... 10
3.1.2. Anchor based proposal generation methods: ..................................................................................................................................... 10
3.1.3. Keypoint based proposal generation methods: .................................................................................................................................. 11
3.2. Feature representation learning for object detection: ...................................................................................................................................... 11
3.2.1. Multi-scale feature representation learning: ...................................................................................................................................... 11
3.2.2. Contextual reasoning: ............................................................................................................................................................................ 12
3.2.3. Deformable feature learning:................................................................................................................................................................ 12
∗ Corresponding author.
E-mail addresses: [email protected] (V. Sharma), [email protected] (R.N. Mir).
https://doi.org/10.1016/j.cosrev.2020.100301
1574-0137/© 2020 Elsevier Inc. All rights reserved.
2 V. Sharma and R.N. Mir / Computer Science Review 38 (2020) 100301
Table 1
List of Acronyms used in the paper.
Acronym Full form Acronym Full form
mAP Mean Average Precision NMS Non Maximum Suppression
FPS Frames Per Second DCN Deformable Convolution Network
SIFT Scale-Invariant Feature Transform MSCOCO Microsoft Common Objects in Context
SURF Speeded Up Robust Features NAS-FPN Neural Architecture Search Feature Pyramid Network
HOG Histogram of Oriented Gradients ResNet Residual Network
SVM Support Vector Machine ReLU Rectified Linear Unit
DPM Deformable Parts Model VGG Visual Geometry Group
ROI Region of interest SGD Stochastic Gradient Descent
CNN Convolution Neural Network DPN Dual Path Network
R-CNN Region Proposal Convolution Neural Network RPN Region Proposal Network
R-FCN Region-based Fully Convolution Network CRAFT Cascade Region Proposal Network and Fast-RCNN
FPN Feature Pyramid Networks GAN Generative Adversarial Networks
SPP-net Spatial Pyramid Pooling Network DSOD Deeply Supervised Object Detectors
YOLO You Only Look Once ILSVRC ImageNet Large Scale Visual Recognition Challenge
SSD Single Shot Multibox Detector LVIS Large Vocabulary Instance Segmentation
VOC Visual Object Class M-DOD Multi-Domain Object Detection
PAMI Pattern Analysis and Machine Intelligence CVIU Computational Vision and Image Understanding
ACM CS Association of Computing Machinery CS PR Pattern Recognition
LIDAR Light Detection and Ranging
illumination and rotation level variance. However, due to multi • In CNN based object detectors, hierarchical feature vec-
appearances, different backgrounds and poor illumination condi- tors/representations can be extracted automatically from
tions, it is very difficult to manually design a descriptor for all the underlying data and can be disentangled through multi-
kind of objects. Now having feature vectors in hand, a classifier level non linear mappings.
is needed to separate a target object category from all the other
All these advantages made it obvious to design and develop
object classes so as to make the feature representation more
deep learning based object detection techniques with expressive
informative for visual recognition. For a classifier, support vector
feature representation capability that can be optimized in an
machines can be considered as a better option, this is due to
end-to-end manner. These properties further made it possible
their better performance on small scale training data. Apart from
for CNN’s to find out their applicability into many research do-
SVM, Adaboost [28,29], bagging [30] and DPM [31] can even be
mains like image classification [39], face recognition [40], video
considered as viable options leading to further improvement in
analysis [41,42] and pedestrian detection [27,43,44].
the detection accuracy.
Recent deep learning based object detection architectures can
Traditional methods developed for object detection aims at
be broadly divided into two categories.
properly designing feature descriptors to obtain embedding for
roi (‘‘region of interest’’). Owing to these advancements in fea- • Single stage detectors like YOLO [20,45] and its different
ture vector representations and classification models, highly im- variants [46,47].
pressive results were obtained on PASCAL VOC datasets [32,33]. • Two stage detectors like R-CNN [7] and its variants
However during 2007 to 2012, minimum gains are obtained by [19,21,48].
building only ensemble models and using light variants of these
traditional successful models. This is due to the following factors. Single stage detectors directly maps class prediction of dif-
ferent objects present at each location of the extracted feature
• During region proposal stage, a large number of redundant maps without utilizing the region classification step, whereas
bounding boxes are generated using multi-window sliding in case of two stage detectors, proposal generator is used to
strategy which leads to a large number of false positives for generate a coarse set of region proposals from which feature
classification purpose. vectors are extracted which are then used to predict the cate-
• All the three stages of object detection pipeline are de- gory of objects using region classifiers. Compared to single stage
signed and optimized in a separate way and hence a global detectors, two stage detectors generally achieve good detection
optimum solution for the entire setup is hard to obtain. performance and generate state-of-the-art results on publically
• It is very hard to bridge the semantic gap by using manually available datasets, However, single stage detectors are compar-
crafted low level features descriptors [34–36]. atively more time efficient and hence are greatly applicable for
real time object detection [2,4]. In Fig. 2 major developments and
With the emergence of deep convolutional neural networks
milestones of deep learning based object detection techniques are
[37] and their widespread applications in image classification
illustrated.
[5,38], object detection techniques based on deep learning [6,21]
has achieved a significant amount of progress in recent years
1.1. Summary of previous survey papers on object detection
and they have outperformed the traditional detectors. Compared
to manually crafted feature descriptors employed by traditional
Many prominent survey papers on object detection have been
object detectors, deep CNN based object detectors generate hi-
published over the years as summarized in Table 2. These cover
erarchical feature representations which is learned in a self au-
many admirable reviews on some specific object detection prob-
tomated way and try to show more discriminative expression
lems like face detection [49,50], text detection [51], pedestrian
power.
detection [52–54], vehicle detection [55] etc. There are relatively
The advantages of CNN based methods against traditional
very few survey papers which directly focuses on the problem of
object detectors is as summarized below:
deep learning based generic object detection techniques except
• In contrast to shallow traditional models, a deep CNN based for Zhang et al. [56] who conducted object class detection survey
model provides an increased expressive power exponen- in the year 2013, Jiao Licheng et al. [57] whose survey focuses
tially. on describing and analyzing deep learning based object detection
• CNN based object detectors provide an opportunity to opti- task in the year 2019, followed by Zhao et al. [58] who provided
mize several detection related tasks together. a systematic review to summarize representative models and
V. Sharma and R.N. Mir / Computer Science Review 38 (2020) 100301
Table 2
Summary of previous survey papers on object detection.
S.No Title of Survey References Venue Year Content
1 Detecting faces in images: a survey Yang et al. [49] PAMI 2002 First survey of face detection from a single image
2 On road vehicle detection: a review Sun et al. [59] PAMI 2006 A review of computer vision based on-road vehicle detection systems
3 Toward category level object recognition Ponce et al. [60] Book 2007 Survey papers on object classification, detection, and segmentation
4 The evolution of object categorization and the challenge of image abstraction Dickinson et al. [61] Book 2009 An outline of the evolution of object categorization over 4 decades
5 Monocular pedestrian detection: survey and experiments Enzweiler and Gavrila [53] PAMI 2010 An assessment of three pedestrian detection techniques
6 Survey of pedestrian detection for advanced driver assistance systems Geronimo et al. [54] PAMI 2009 A thorough review of pedestrian detection for advanced driver assistance systems
7 Context based object categorization: a critical survey Galleguillos and Belongie [62] CVIU 2010 A review of perspective information for object categorization
8 Visual object recognition Grauman and Leibe [63] Tutorial 2011 Object recognition techniques based on instance and category
9 Pedestrian detection: an evaluation of the state of the art Dollar et al. [52] PAMI 2012 A detailed assessment of detectors in monocular images
10 50 years of object recognition: directions forward Andreopoulos and Tsotsos [64] CVIU 2013 A review of the evolution of object recognition systems over 5 decades
11 Object class detection: a survey Zhang et al. [56] ACMCS 2013 Survey of generic object detection methods before 2011
12 Representation learning: a review and new perspectives Bengio et al. [65] PAMI 2013 Unsupervised feature learning and deep learning, probabilistic models, autoencoders, manifold learning, and deep networks
13 Salient object detection: a survey Borji et al. [66] arXiv 2014 A survey of salient object detection techniques
14 A survey on face detection in the wild: past, present and future Zafeiriou et al. [50] CVIU 2015 A survey of face detection techniques in the wild since 2000
15 Text detection and recognition in imagery: a survey Ye and Doermann [51] PAMI 2015 A thorough review of text detection and recognition in colored images
16 Feature representation for statistical learning based object detection: a review Li et al. [67] PR 2015 Feature representation methods in statistical learning based object detection, including handcrafted and deep learning based features
17 Deep learning LeCun et al. [68] Nature 2015 An introduction to deep learning and applications
18 A survey on deep learning in medical image analysis Litjens et al. [69] MIA 2017 A survey on deep learning based image classification, object detection, segmentation and registration in medical image analysis
19 Recent advances in convolutional neural networks Gu et al. [70] Nature 2015 An introduction to deep learning and applications
20 Deep learning LeCun et al. [68] PR 2017 A broad survey of the recent advances in CNN and its applications in computer vision, speech and natural language processing
21 Object detection in 20 years: A survey Zou et al. [71] IEEE 2019 A comprehensive review in the light of technical evolution of different object detection techniques in last two decades
22 A survey of deep learning based object detection Jiao et al. [57] IEEE 2019 A broad survey focuses on describing and analyzing deep learning based object detection task
23 Deep learning LeCun et al. [58] arXiv 2019 A systematic review to summarize representative models and their different characteristics in several generic object detection application domains
24 Recent advances in deep learning for object detection Wu et al. [4] Neurocomputing 2020 Presents a comprehensive understanding of deep learning based object detection algorithms
5
6 V. Sharma and R.N. Mir / Computer Science Review 38 (2020) 100301
their different characteristics in several generic object detection and metrics used for generic object detection are provided in
application domains in the year 2019. Apart from these two more Section 6. Detection algorithms for real-world-applications are
surveys came in the same year (2019) conducted by Zou et al. [71] covered in Section 7 and finally we conclude and discuss future
where they provided a comprehensive review in the light of tech- research directions in Section 8.
nical evolution of different object detection techniques followed
by Wu et al. [4] presenting a comprehensive understanding of 2. Traditional detectors vs. state-of-the-art detectors.
deep learning based object detection algorithms. However the
research reviewed in these surveys lack to provide a systematic In the last 20 years, it has been unanimously accepted that
view of deep learning based object detection techniques, the the advancements in the field of object detection has been made
learning strategies that need to be employed, up to date state in two historical periods. One is traditional object detection era
of art detection solutions and are therefore prior to the recent before the year 2014 and other is deep learning based object
striking success and dominance of deep learning and related detection after 2014 [71].
methods.
The advantage of deep leaning is that, it permits compu- 2.1. Traditional object detectors
tational models to learn implausible complex, restrained and
intangible depictions, thereby making significant inroads into a Most of the traditional object detection techniques were de-
large set of problems like medical image analysis, Natural Lan- veloped using manually crafted features. Due to the absence
guage Processing, speech recognition, generic object detection of advanced image representation techniques before 2012, the
etc. Among various types of deep neural networks, deep convo- only choice researchers have is to design highly complex feature
lutional neural networks have revolutionized the fields of image vector representations along with a colossal variety of speed up
processing, video, speech and audio detection. There has been a techniques to compensate the mismatch which arises due to the
wide range of many published surveys on deep learning including availability of limited computational resources.
that of Bengio et al. [65], Lecun et al. [68], Gu et al. [70], Litjen
et al. [69] and more recently in tutorials at ICCV and CVPR. 2.1.1. Viola jones detectors:
In disparity even though many deep learning based meth- P. viola and M. jones, 19 years ago, developed a human face
ods been set forth for object detection. we are not aware of detector and have achieved a real time detection accuracy [72,73].
any systematic and recent comprehensive survey covering latest This detector has outperformed any other algorithm in its time
strategies and techniques related to generic object detection. A under comparable detection accuracy metric. It follows a sliding
detailed review and cover up of existing techniques is very much window approach and covers all the possible locations in an input
necessary for additional progress in object detection, notably for image to figure out, if any of the windows capture a human face.
researchers who are willing to enter this field. Although sliding window approach is a time consuming process,
but still VJ detector is able to improve its detection speed by
1.2. Contributions of this survey incorporating three techniques: integral image, which is basically
a way to speed up the convolution process. It helps in making
The primary focus of this survey is to describe and analyze the computational complexity of each window independent of its
deep learning based object detection problem. Unlike existing size. Second is feature selection, where adaboost algorithm [74]
surveys, this paper covers state of the art techniques for object is incorporated to select a small set of feature vectors about
detection and also provides various learning strategies for this 180k dimensional which are helpful for face detection. More-
task along with future research directions due to rapid develop- over, in order to reduce the computational complexity, detection
ment in computer vision research. cascades are introduced which is a basically a multistage detec-
tion pipeline, wherein face targets are given more computational
• This research survey is featured by an in-depth analysis and
importance.
thorough discussion in various aspects, some of which are
new to the best of our knowledge.
2.1.2. Histogram of oriented gradients (HOG) detector:
• In this paper, we have listed down some object detection
HOG feature descriptor was first of all proposed by N. dalal
learning strategies and have neglected to provide the basic
and B. Triggs in the year 2005. [26] HOG feature can be regarded
information, apart from that latest object detection trends
are also discussed so that readers can themselves witness as a break through advancement of SIFT features and contextual
the advancements in this field easily. shapes. It was basically designed to perform computations on a
dense grid of cells which are uniformly spaced. HOG can be used
• Unlike previous surveys in this field, this paper provides a
for detecting a wide variety of object categories, but the primary
comprehensive and systematic look up into deep learning
based object detection techniques, a write up related to motivation behind developing HOG was pedestrian detection. The
most significant research trends and an up to date object best advantage of HOG detector is that it can be used to detect
detection algorithms. objects of different sizes and it can be achieved by rescaling
the input image a number of times, while keeping the size of
So, the goal of this survey paper is to provide a thorough detection window unaltered [75–77].
analysis of object detection techniques based on deep learning.
2.1.3. Deformable part based model (DPM):
1.3. Article organization In the year 2008, deformable part based model was proposed
by P. Felzenszwalb [75]. It was originally developed as an exten-
Rest of the manuscript is organized in the following sections: sion to HOG detector and later a lot of improvements were made
In Section 2 we have thoroughly compared traditional detection by R. Girshick [31,76,78,79]. In deformable part based model
techniques with current object detectors. Proposal generation and divide and conquer strategy is followed, where in a given input
feature representation learning techniques for object detection image can be considered as an amalgamation of detection oper-
are covered in Section 3. Backbone architectures for object detec- ations carried out on different object within that image. For e.g.
tion are presented in Section 4 followed by learning strategies for The problem of detecting a bird can be considered as a problem
object detection which are covered in Section 5. Popular datasets of detecting its beak, legs, tail and wings. In a basic DPM detector,
V. Sharma and R.N. Mir / Computer Science Review 38 (2020) 100301 7
Multi scale deep convolutional neural networks [108] used 123]. For meeting different requirements emphasizing on de-
deconvolutional layers on different feature maps for im- tection accuracy vs. localization efficiency, researchers are free
proving their resolutions and later these improved feature to make use of much deeper and densely connected backbone
maps are used for making final predictions. Receptive field architectures like ResNext [124], AmoebaNet [95], ResNet [8].
blocknet was proposed by Liu et al. [110] for enriching Apart from all these there are various lightweight backbone net-
the robustness and receptive fields using a receptive field works which when applied to mobile devices can meet out the
block, which is used for capturing multi scale features specific requirements. Some prominent light weight networks
using multiple branches with different convolution kernels, like MobileNet [125], ShuffleNet [126], Xception [127] are highly
which are then fused together. popular. Wang et al. [128] combined PeeleNet with SSD [129]
(iv) Feature Pyramid: Feature pyramid network was proposed and came out with a state-of-the-art real time object detection
by Lin et al. [48] which combines the advantages of both system having fast processing speed. For meeting out the require-
integrated features as well as the prediction pyramid. Here ment of high precision value and more accuracy applications,
different scale features with lateral associations are com- highly complex backbone networks are required. Whereas well
bined in a top-down fashion for building a group of scale designed architecture which can be adapted to meet the trade-off
invariant feature maps which are then employed to train between speed and accuracy are required to meet the real time
scale-dependent classifiers. particularly deep semantic rich requirement for applications involving videos and webcameras.
features are used to reinforce the shallow spatially rich For exploring highly competitive detection accuracy, shallower
features. and sparsely connected backbone networks are replaced by much
These top-down features are then combined together us- deeper and densely connected counterparts. As an example He
ing element wise summation or concatenation operation et al. [8] preferred ResNet [5] over VGG [130] to extract rich
along with small convolutions for reducing the overall di- features for gaining high detection accuracy. As an efficient way
mensionality. Significant amount of improvement has been to further enhance the network performance, high performance
witnessed by FPN for object detection along with other classification setup which can improve detection precision and
applications and has achieved state-of-the-art results in reduce the complexity of object detection task are under con-
learning multi-scale features. Many different variants of sideration. With the immense advancements in deep learning
FPN were later introduced with modified feature pyramid techniques and unremitting improvement of computation power
block [111–117] enormous progress has been made in the field of object detection.
R-CNN [6] has successfully demonstrated that adopting convolu-
3.2.2. Contextual reasoning: tional weights from already trained models could come up with
In the field of object detection contextual information plays a affluent semantic information for not only to train the detectors
significant role. Very often objects are likely to appear in specific but also to enhance the detection performance. In the last few
environments and sometimes they do coexist with other ob- years, this strategy has been widely accepted and now it has
jects. For detecting objects with insufficient features, contextual become a norm for most of the object detectors. In this section
information can effectively help in improving the detection per- we will review some basic architectures being extensively used
formance. For understanding the association between objects and for detection purpose, after providing a basic introduction of deep
their surroundings, context can improve the ability of the detec- convolutional neural networks.
tor to understand the scene. As far as traditional object detection
techniques are concerned, several efforts exploring context have 4.1. Basic architecture of a CNN:
been made [118].
For visual understanding and machine vision, deep CNN’s has
3.2.3. Deformable feature learning: proven exceptionally useful [36,131]. A typical deep convolu-
For an object detector to be good. It should be absolutely ro- tional neural network as shown in Fig. 13, generally comprises
bust to non-rigid twisting of objects. Just before deep learning era, of a series of convolutional layers, pooling layers, non linear
DPM [119] had been successfully employed for object detection activation layers and finally a bunch of fully connected layers.
and recognition. For enabling object detectors on deep learning to Sample image is fed as an input to the convolutional layer which
model deformation of objects, many detection frameworks have generates a feature map from it by convolving operation per-
been proposed which can directly model object parts [94,120– formed by n X n kernels or anchors. The obtained feature map can
122]. Dai et al. [94] and Zhu et al. [121] have proposed deformable be considered as a multi-channel sequence, where each channel
convolutional layers that can automatically learn the supporting holds different information related to the input image. Each pixel
location offsets to embed information sampled into standard present in the feature map is represented with the help of a
sampling positions of the feature map. neuron which in turn interacts with a group of adjacent neurons
Next we will discuss about the backbone architecture of object from the previous feature map to produce something known as
detectors. a receptive field. Just after generating the feature maps a non-
linear activation function is used followed by a pooling layer (max
4. Backbone architecture for object detection: or min pooling layer) to widen the receptive field as well as
to mitigate the cost involved in computation. Using a sequence
Generally to extract the features for performing object detec- of convolutional layers, pooling layers and non linear activation
tion task where an image is provided as an input and feature map layers, a deep convolutional neural network is formed. Predefined
is produced as an output, backbone networks are employed. Ma- loss functions like stochastic gradient descent [132], Adam [133]
jority of backbone architectures used for detection tasks are basi- etc are employed to optimize the entire network in an end-to-end
cally the networks which are used for carrying out the classifica- manner. AlexNet [36] is one such example of deep CNN which
tion task by considering the last fully connected layers. Highly ef- comprises of five convolutional layers, 3 max pooling layers and
ficient version of this basic classification networks are also avail- 3 fully connected layers. Further it should be noted that each
able. Lin et al. [48] replaced some of the intermediate layers with of the convolution layer is followed by a ReLU [134] which is a
some specially designed to better meet specific requirements [20, non-linear activation layer.
V. Sharma and R.N. Mir / Computer Science Review 38 (2020) 100301 13
Table 3
Details of prominent generic object detection architectures.
Architecture Proposal generation technique Learning strategy Loss function Softmax layer End-to-end training Multi-scale input Platform
RCNN [7] Selective search SGD, BP Hinge Loss Yes No No Caffe
Fast-RCNN [19] Selective search SGD Hinge Loss Yes No No Caffe
Faster-RCNN [19] Region Proposal Network SGD Class Log Loss Yes Yes Yes Caffe
SPP-Net [89] Edge Boxes SGD Hinge Loss Yes No Yes Caffe
R-FCN [91] Region Proposal Network SGD Class Log Loss No Yes Yes Caffe
Mask R-CNN [8] Region Proposal Network SGD Class Log Loss and Semantic Sigmoid Loss Yes Yes Yes TensorFlow/Keras
FPN [48] Region Proposal Network Synchronized SGD Class Log Loss Yes Yes Yes TensorFlow
SSD [47] Region Proposal Network SGD Class Softmax Loss No Yes No Caffe
YOLO [20] Region Proposal Network SGD Class sum squared error loss Yes Yes No DarkNet
4.2. CNN backbone for object detection: of the model could be increased from 16 upto 152 thus allow-
ing us to train high capacity network models. In the upcoming
Here we will quickly review some common CNN backbone year, He et al. [135] came up with a pre-activation alternate of
architectures such as ResNet [5,91], VGG16 [19,21], ResNeXt [83] ResNet known as ResNetv2. The experiments carried out clearly
and Hourglass [85] which are broadly being used for object de- showed that ResNetv2 could achieve better performance if batch
tection purpose. AN overview of prominent generic object detec- normalization is done properly [136]. Due to this it becomes
tion architectures as mentioned in [58] is provided in Table 3. possible to successfully train models with more than 1000 layers.
VGG16 [130] which was build up based on AlexNet comprises of Huang et al. concluded that even though ResNet has overcome the
5 clusters of convolutional layers and 3 fully connected layers. shortcomings associated with training using shortcut connection.
First two sets consist of two convolutional layers followed by next It did not make use of all the features produced by the preceding
3 sets comprising of three convolutional layers. In between each layers. Since the features from the shallow layers are missing
hence they cannot be directly used for element wise operations.
set a Max pooling layer is introduced to diminish the spatial di-
Hence they proposed DenseNet [137] which not only preserved
mensionality. VGG16 architecture has clearly highlighted that, if
the features from the shallow layers but also improved the flow
we increase the depth of the network by hoarding together many
of information by combining the inputs with the residual product
convolutional layers, we can increase the expressive potential of
instead of element wise addition.
the model which further leads to better performance. However
this strategy of increasing the number of convolutional layers yk+1 = yk ◦ tk+1 (yk , θ) (4)
suffers from a serious drawback of optimizing the entire network
Here ◦ denotes the concatenation operation. DenseNet [137] too
in an end-to-end manner. Now keeping this observation in mind,
suffered from various shortcomings as Chen et al. [138] con-
He et al. [5] came up with ResNet [91] which mitigated the
cluded that there is a lot of duplicity involved with a number
drawbacks associated with optimization by establishing shortcut
of extracted features and computation cost involved is also very
connections where a layer could omit the non-linear makeover
high. In the upcoming years, the advantages of both ResNet and
and straightaway pass on the values to the next layer. It is as DenseNet were combined together and a new model called Dual
represented below in the form of an equation. Path Network (DPN) was proposed wherein each individual chan-
yk+1 = yk + tk+1 (yk , θ) (3) nel yk is divided into two segments [ydk and yrk ] here ydk was used
to perform intense connection computation and element wise
Here yk is the input feature present in the kth layer and tk+1 summation was carried out using yrk , with non-sharable residual
represents the set of operations (normalization, non-linear trans- learning stems tkd+1 and tkr+1 . The final result thus produced is
formation, convolutions, activations etc.) carried out on the input basically the concatenation of these two branches.
feature yk . tk+1 (yk , θ) is a residual operation function for yk input
yk+1 = yrk + tkr+1 yrk , θ r ◦ yk ◦ tk+1 yk , θ
( ( )) ( d d ( d d ))
(5)
feature. Using this equation any feature map can be regarded as a
sum of activations of previous layer and the residual function for Xie et al. [124] came up with ResNeXt based on ResNet archi-
considerably reducing the difficulties involved during the training tecture which not only mitigated the computation and memory
process. Gradients from the deeper layers back to the shallow cost but also maintained the classification accuracy. ResNeXt
layers can easily be propagated through the shortcut connection espoused group convolution layers [36] which meagerly connects
using training setup involving ‘‘residual function blocks’’, depth the feature map channels to incarcerate rich semantic features
14 V. Sharma and R.N. Mir / Computer Science Review 38 (2020) 100301
used for further processing the defined prediction offset (v) Learning from scratch: Majority of object detectors pro-
multiple times. Some years later, another approach called foundly relying on pretrained imageNet models suffer from
LocNet was proposed by Gidaris et al. [144] which focused poor performance, because of the prejudice of loss func-
on bounding box coordinate distribution for refining the tion and data distribution existing between classification
predictions. Zagoruyko at al. [145] introduced a multipath and detection tasks. This problem can be mitigated to an
network where in a cascade of categorizers were developed extent by finetuning on the detection task but cannot be
and later optimized using predefined integral loss based completely avoided. Apart from that using a pretrained
classification model for detection purposes in a new do-
on various quality parameters. Each ensemble classifier
main can lead to further complications. Because of these
was individually optimized with a predefined IoU thresh-
reasons there is a strong need to train the detection models
old value. The final results of predictions thus made is from scratch instead on only using pretrained ImageNet
a combination of outputs from each of these classifiers. models. The primary difficulty one may face in training
Tychen et al. derived a bounding box regression loss on a a model from scratch is the non availability of enough
group of IoU values so as to maximize the IoU prediction data for object detection which can further cause overfit-
offsets related to the objects. They also proposed Fitness- ting to occur. Moreover object detection models also need
NMS [146] that learned fresh fitness score function of bounding box level annotations and therefore annotating
IoU between objects and their proposals. Taking inspira- a huge dataset for object detection is a time consum-
tion from cornerNet [84] and DeNet [103] Lu et al. [147] ing and costly affair. There are various state-of-the-art
came up with grid R-CNN which substitutes bounding box works where object detectors trained from scratch was
regressor with keypoint corner based system. proposed Shen at al. [109] called DSOD (Deeply super-
(iii) Learning by Cascades: It is defined as a coarse-to-fine vised object detectors) where the authors have argued that
learning strategy which gathers data from the result of optimization difficulties can be mitigated significantly if
given classifiers and develop a stronger one in a gushed densely connected network models with intense supervi-
sion are used. Shen et al. [156] came out with a gated
manner. Coming to deep learning based detection mecha-
recurrent feature pyramid which vigorously attuned su-
nisms, CRAFT (Cascade Region Proposal Network and Fast-
pervision intensities of transitional layers for objects with
RCNN) was proposed by Yang et al. [148] which employs various scales. This method proved to be more powerful
cascade learning techniques to learn RPN and region pro- as compared to traditional DSOD. Later He et al. [157]
posals. CRAFTS learn a regular RPN followed by a dual class authenticated the difficulty associated with training detec-
Fast R-CNN which discards greater part of easy negatives. tors from scratch on MSCOCO dataset and concluded that
The remaining samples were then used to build the cas- the vanilla detectors can obtain better performance with
cade region classifier. For classifying multiscale objects into atleast 10 K annotated images and proved that no specific
one of the predefined target classes, a layer wise cascade model is required for training from scratch.
classifier was introduced by Yang et al. Chang et al. [149] (vi) Imbalanced Sampling: In the field of object detection un-
observed Faster-RCNN and find out that although localizing evenness of positive and negative samples is a serious
objects has improved a lot but still there are many clas- issue. This is because majority of proposals which are con-
sification errors that can be attributed to joint multitask sidered as region of interests are basically the background
optimization for regression and classification, they also images and few among them are actually the objects we are
noted that because of larger receptive fields of Faster R- looking for. Due to this imbalancing issue, two problems
may arise, class imbalance and difficulty imbalance. Class
CNN, noise is introduced in the detection process. So, to
imbalance means that majority of the generated proposals
overcome these shortcomings a cascade model was build
are just background images and very few among them
which was based on Faster R-CNN and R-CNN both. An
actually contains the objects whereas difficulty imbalance
initial set of prediction offsets are produced from a highly means that it is much easier to classify the proposals as
trained Faster R-CNN which are then used to refine the background while objects get hard to classify. Various tech-
offsets using an R-CNN. niques have been developed to get rid of this problem.
(iv) Adversarial Learning: This learning strategy has shown a Two stage detectors like R-CNN and Fast-RCNN discards a
remarkable advancement in generative models. The most number of negative samples and upkeep 2000 proposals
famous and recent examples of adversarial learning is gen- for further classification. In Fast-R-CNN [19] the ratio of
erative adversarial networks (GAN) [150] where in a gen- positive to negative samples is fixed as 1:3 in each of the
erator and a discriminator are working together in a com- minibatch to overcome the difficulty of class imbalancing.
petitive manner. Generator here tries to model the data by For addressing class imbalance, random sampling can be
producing faux images and then using them to puzzle the used. To address the problem of difficulty imbalance, care-
discriminator, whereas discriminator battles with the gen- fully fabricated loss functions can be used. For the purpose
erator to identify the real ones from the faux images. GAN of object detection, a multi class classifier is trained over
C+1 classes. If p = (p0 , p1 , p2 , p3 . . . . . . ..pC ) is the discrete
and its variants [151–153] have become largely popular
probability distribution of output over C+1 classes and ‘‘v’’
and therefore can also be used for object detection. A new
is the ground truth class then Loss function is defined as:
setup for detecting the smaller objects was proposed by
Li et al. [154] called perceptual GAN where the generator Losscls (p, v ) = − log pv (8)
module learned high-decree features from small objects
Instead of rejecting all the easy samples, a novel focal
using ‘‘adversarial learning technique’’ and discriminator loss [83] was proposed by Liu et al. where an importance
tried to identify the real resolution features for small ob- factor was assigned to each sample with respect to its loss
jects. A Faster R-CNN model which was trained using GAN value as shown.
examples was proposed by Wang et al. [155] where a
learned mask was produced on region features followed by LossFL = −α (1 − pv )γ log (pv ) (9)
region classifiers due to which the detector becomes more Here α and γ are the parameters to control the weight
robust after receiving more adversarial examples. importance.
16 V. Sharma and R.N. Mir / Computer Science Review 38 (2020) 100301
Table 4
Performance analysis of some state-of-the-art in terms of mean Average Precision (mAP) using PASCAL VOC 2007 test dataset.
Models Aeroplane Bicycle Bird Boat Bottle Bus Car Cat Chair Cow Table Dog Horse Mbike Person Plant Sheep Sofa Train TV mAP
R-CNN (Alex) [7] 68.1 72.8 56.8 43.0 36.8 66.3 74.2 67.6 34.4 63.5 54.5 61.2 69.1 68.6 58.7 33.4 62.9 51.1 62.5 68.6 58.5
R-CNN (VGG) [19] 73.4 77.0 63.4 45.4 44.6 75.1 78.1 79.8 40.5 73.7 62.2 79.4 78.1 73.1 64.2 35.6 66.8 67.2 70.4 71.1 66.0
SPP-Net [89] 68.5 71.7 58.7 41.9 42.5 67.7 72.1 73.8 34.7 67.0 63.4 66.0 72.5 71.3 58.9 32.8 60.9 56.1 67.9 68.8 60.9
OHEM +Fast-RCNN [176] 80.6 85.7 79.8 69.9 60.8 88.3 87.9 89.6 59.7 85.1 76.5 87.1 87.3 82.4 78.8 53.7 80.5 78.7 84.5 80.7 78.9
HyperNet VGG [32] 84.2 78.5 73.6 55.6 53.7 78.7 79.8 87.7 49.6 74.9 52.1 86 81.7 83.3 81.8 48.6 73.5 59.4 79.9 65.7 71.415
Faster R-CNN [21] 70.0 80.6 70.1 57.3 49.9 78.2 80.4 82.0 52.2 75.3 67.2 80.3 79.8 75.0 76.3 39.1 68.3 67.3 81.1 67.6 69.9
GCNN [177] 68.3 77.3 68.5 52.4 38.6 78.5 79.5 81.0 47.1 73.6 64.5 77.2 80.5 75.8 66.6 34.3 65.2 64.4 75.6 66.4 66.8
Bayes [178] 74.1 83.2 67.0 50.8 51.6 76.2 81.4 77.2 48.1 78.9 65.6 77.3 78.4 75.1 70.1 41.4 69.6 60.8 70.2 73.7 68.5
SDP +CRC [179] 76.1 79.4 68.2 52.6 46.0 78.4 78.4 81.0 46.7 73.5 65.3 78.6 81.0 76.7 77.3 39.0 65.1 67.2 77.5 70.3 68.9
SubCNN [27] 70.2 80.5 69.5 60.3 47.9 79.0 78.7 84.2 48.5 73.9 63.0 82.7 80.6 76.0 70.2 38.2 62.4 67.7 77.7 60.5 68.5
StuffNet30 [180] 72.6 81.7 70.6 60.5 53.0 81.5 83.7 83.9 52.2 78.9 70.7 85.0 85.7 77.0 78.7 42.2 73.6 69.2 79.2 73.8 72.7
NOC [181] 76.3 81.4 74.4 61.7 60.8 84.7 78.2 82.9 53.0 79.2 69.2 83.2 83.2 78.5 68.0 45.0 71.6 76.7 82.2 75.7 73.3
MR-CNN + S-CNN [182] 80.3 84.1 78.5 70.8 68.5 88.0 85.9 87.8 60.3 85.2 73.7 87.2 86.5 85.0 76.4 48.5 76.3 75.5 85.0 81.0 78.2
HyperNet [183] 77.4 83.3 75.0 69.1 62.4 83.1 87.4 87.4 57.1 79.8 71.4 85.1 85.1 80.0 79.1 51.2 79.1 75.7 80.9 76.5 76.3
SSD300 [47] 80.9 86.3 79.0 76.2 57.6 87.3 88.2 88.6 60.5 85.4 76.7 87.5 89.2 84.5 81.4 55.0 81.9 81.5 85.9 78.9 79.6
SSD512 [47] 86.6 88.3 82.4 76.0 66.3 88.6 88.9 89.1 65.1 88.4 73.6 86.5 88.9 85.3 84.6 59.1 85.0 80.4 87.4 81.2 81.6
Table 5
Performance analysis of some state-of-the-art in terms of mean Average Precision (mAP) using PASCAL VOC 2012 test dataset.
Models Aeroplane Bicycle Bird Boat Bottle Bus Car Cat Chair Cow Table Dog Horse Mbike Person Plant Sheep Sofa Train TV mAP
R-CNN (Alex) [7] 71.8 65.8 52.0 34.1 32.6 59.6 60.0 69.8 27.6 52.0 41.7 69.6 61.3 68.3 57.8 29.6 57.8 40.9 59.3 54.1 53.3
R-CNN (VGG) [7] 79.6 72.7 61.9 41.2 41.9 65.9 66.4 84.6 38.5 67.2 46.7 82.0 74.8 76.0 65.2 35.6 65.4 54.2 67.4 60.3 62.4
Fast-RCNN [19] 82.3 78.4 70.8 52.3 38.7 77.8 71.6 89.3 44.2 73.0 55.0 87.5 80.5 80.8 72.0 35.1 68.3 65.7 80.4 64.2 68.4
OHEM +Fast-RCNN [184] 90.1 87.4 79.9 65.8 66.3 86.1 85.0 92.9 62.4 83.4 69.5 90.6 88.9 88.9 83.6 59.0 82.0 74.7 88.2 77.3 80.1
Faster R-CNN [21] 84.9 79.8 74.3 53.9 49.8 77.5 75.9 88.5 45.6 77.1 55.3 86.9 81.7 80.9 79.6 40.1 72.6 60.9 81.2 61.5 70.4
StuffNet30 [180] 83.0 76.9 71.2 51.6 50.1 76.4 75.7 87.8 48.3 74.8 55.7 85.7 81.2 80.3 79.5 44.2 71.8 61.0 78.5 65.4 70.0
NOC [181] 82.8 79.0 71.6 52.3 53.7 74.1 69.0 84.9 46.9 74.3 53.1 85.0 81.3 79.5 72.2 38.9 72.4 59.5 76.7 68.1 68.8
MR-CNN + S-CNN [182] 85.5 82.9 76.6 57.8 62.7 79.4 77.2 86.6 55.0 79.1 62.2 87.0 83.4 84.7 78.9 45.3 73.4 65.8 80.3 74.0 73.9
HyperNet [183] 84.2 78.5 73.6 55.6 53.7 78.7 79.8 87.7 49.6 74.9 52.1 86.0 81.7 83.3 81.8 48.6 73.5 59.4 79.9 65.7 71.4
ION [185] 87.5 84.7 76.8 63.8 58.3 82.6 79.0 90.9 57.8 82.0 64.7 88.9 86.5 84.7 82.3 51.4 78.2 69.2 85.2 73.5 76.4
SSD512 [47] 91.4 88.6 82.6 71.4 63.1 87.4 88.1 93.9 66.9 86.6 66.3 92.0 91.7 90.8 88.5 60.9 87.0 75.4 90.2 80.4 82.2
SSD300 [47] 91.0 86.0 78.1 65.0 55.4 84.9 84.0 93.4 62.1 83.6 67.3 91.3 88.9 88.6 85.6 54.7 83.8 77.3 88.3 76.5 79.3
YOLO [20] 77.0 67.2 57.7 38.3 22.7 68.3 55.9 81.4 36.2 60.8 48.5 77.2 72.3 71.3 63.5 28.9 52.2 54.8 73.9 50.8 57.9
YOLO + Fast R-CNN [20] 83.4 78.5 73.5 55.8 43.4 79.1 73.1 89.4 49.4 75.5 57.0 87.5 80.9 81.0 74.7 41.8 71.5 68.5 82.1 67.2 70.7
YOLOv2 [20] 88.8 87.0 77.8 64.9 51.8 85.2 79.3 93.1 64.4 81.4 70.2 91.3 88.1 87.2 81.0 57.7 78.1 71.0 88.5 76.8 78.2
R-FCN [19] 92.3 89.9 86.7 74.7 75.2 86.7 89.0 95.8 70.2 90.4 66.5 95.0 93.2 92.1 91.1 71.0 89.7 76.0 92.0 83.4 85.0
Table 6
Comparison of detection performance of some state-of-the-art techniques on MSCOCO test-dev dataset.
Model Backbone architecture Year of introduction AP AP5 0 AP7 5 APS APM APL
Single-Stage-Object-Detectors
SSD512 [16] VGG-16 2016 28.8 48.5 30.3 10.9 31.8 43.5
SSD513 [113] ResNet-101 2017 31.2 50.4 33.3 10.2 34.5 49.8
DSSD513 [113] ResNet-101 2017 33.2 53.3 35.2 13.0 35.4 51.1
STDN513 [187] DenseNet-169 2018 31.8 51.0 33.6 14.4 36.1 43.4
CornerNet511 [84] Hourglass-169 2018 40.5 56.5 43.1 19.4 42.7 53.9
CornerNet511 [86] Hourglass-169 2019 44.9 62.4 48.1 25.6 47.4 57.4
GHM SSD [188] ResNeXt-101 2018 41.6 62.8 44.2 22.3 45.1 55.3
FPN-Reconfig [116] ResNeXt-101 2018 34.6 54.3 37.3 NA NA NA
FCOS [189] ResNeXt-101 2019 42.1 62.1 45.2 25.6 44.9 52.0
FSAF [104] ResNeXt-101 2019 42.9 63.8 46.3 26.6 46.2 52.7
ExtremeNet [190] Hourglass-104 2019 40.2 55.5 43.2 20.4 43.2 53.1
M2Det800 [117] VGG-16 2019 41.0 59.7 45.0 22.1 46.5 53.8
RefineDet512 [174] ResNet-101 2018 36.4 57.5 39.5 16.6 39.9 51.4
YOLOv2 [46] DarkNet-19 2017 21.6 44.0 19.2 5.0 22.4 35.5
Two-Stage-Object-Detectors
Fast R-CNN [19] VGG-16 2015 19.7 35.9 NA NA NA NA
Faster R-CNN [21] VGG-16 2015 21.9 42.7 NA NA NA NA
Faster R-CNN w FPN [48] ResNet-101 2016 36.2 59.1 39.0 18.2 39.0 48.2
Faster R-CNN by G-RMI [161] Inception-ResNet-v2 2017 34.7 55.5 36.7 13.5 38.1 52.0
OHEM [184] VGG-16 2016 22.6 42.5 22.2 5.0 23.7 37.9
ION [185] VGG-16 2016 23.6 43.2 23.6 6.4 24.1 38.3
R-FCN [91] ResNet-101 2016 29.9 51.9 NA 10.8 32.8 45.0
CoupleNet [191] ResNet-101 2017 34.4 54.8 37.2 13.4 38.1 50.8
Deformable R-FCN [91] Aligned-Inception-ResNet 2017 37.5 58.0 40.8 19.4 40.1 52.5
DeNet-101 [103] ResNet-101 2017 33.8 53.4 36.1 12.3 36.1 50.8
Mask-RCNN [8] ResNeXt-101 2017 39.8 62.3 43.4 22.1 43.2 51.2
Fitness-NMS [146] ResNet-101 2017 41.8 60.9 44.9 21.5 45.0 57.5
Relation Net [93] ResNet-101 2018 39.0 58.6 42.9 NA NA NA
DeepRegionlets [192] ResNet-101 2018 39.3 59.8 NA 21.7 43.7 50.9
C-Mask RCNN [193] ResNet-101 2018 42.0 62.9 46.4 23.4 44.7 53.8
DCN + R-CNN [149] ResNet-101 + ResNet-152 2018 42.6 65.3 46.5 26.4 46.1 56.4
Cascade R-CNN [194] ResNet-101 2018 42.8 62.1 46.3 23.7 45.5 55.2
Grid R-CNN [147] ResNeXt-101 2019 43.2 63.0 46.6 25.1 46.5 55.2
DCN-v2 [121] ResNet-101 2019 44.8 66.3 48.8 24.4 48.1 59.6
TridentNet [195] ResNet-101 2019 42.7 63.6 46.5 23.9 46.6 56.6
6.0.4. OpenImages dataset: labels, object segmentation masks, visual relationships and object
This dataset consists of 1.9 million images containing ap- bounding boxes [197]. Some important aspects of this dataset are,
proximately 15 million objects spread over 600 categories. The firstly the bounding boxes are largely drawn manually. Secondly
images in this dataset are already annotated with image level the images are really assorted and consists of complex scenes
V. Sharma and R.N. Mir / Computer Science Review 38 (2020) 100301 19
Table 7
Statistics of some popular generic object detection datasets.
Dataset Training subset Validation subset Testing subset Trainval subset
Images Objects Images Objects Images Objects Images Objects
Pascal VOC-2007 2501 6301 2510 6307 4952 14976 5011 12608
Pascal VOC-2007 5717 13609 5823 13841 10991 NA 11540 27450
ILSVRC-2014 456567 478807 20121 55502 40152 NA 476688 534309
ILSVRC-2014 456567 478807 20121 55502 65500 NA 476688 534309
MS-COCO-2015 82783 604907 40504 291875 81434 NA 123287 896782
MS-COCO-2018 118287 860001 5000 36781 40670 NA 123287 896782
with many objects and thirdly, the dataset often offers visual (iv) KITTI [200]: It consists of 7481 labeled HD images for
relationship annotations. Since it is a very diverse dataset, hence training and 7518 images for testing purpose. The person
different metrics are been used to measure the performance of category in this dataset is subdivided into two sub cate-
an algorithm using it. Firstly, the unannotated class categories gories of pedestrian and cyclist. The models trained on it
are discarded to avoid the wrong number of false negatives. are evaluated using three metrics with difference in the
Secondly, if an object belongs to a predefined target class, then min. bounding box height and max. occlusion level.
the detection model should come up with detection results for
For INRIA [26] and ETH [199] datasets, log average miss rate
each of the relevant class. Thirdly, if a detection occurs inside a metric is used to evaluate the performance of detectors and
group of boxes and if the intersection of the detection and the for KITTI, mean Average Precision with IoU threshold of 0.5 is
box partitioned by the area of detected region exceeds threshold used for evaluation purposes. Comparative analysis of pedestrian
of 0.5 then it is counted as a true positive. detection datasets is given in Table 8 followed by comparison
of person detection benchmarks which are covered in Table 9.
6.0.5. VIS Drone 2018 dataset: Table Information from PIOTR ET AL, IEEE TPAMI2012 [52]. Table
In the year 2018, a new dataset which comprises of videos and Information from PIOTR ET AL, IEEE TPAMI2012 [52].
images captured using a drone was developed and is called VIS
Drone 2018 [198]. It is a large scale visual object detection and 6.2. Face detection datasets:
tracking dataset aims at proceeding with visual understanding on
a drone based platform. The images and videos in this dataset are Here many popular face detection datasets are reviewed along
captured over several urban areas of 14 different Chinese cities. with their evaluation metrics.
Typically speaking it comprises of 263 video sequences and 10209
(i) PASCAL FACE [33]: This dataset comprises of 1335 labeled
images with affluent annotations like ground truth bounding faces from 851 different images. It is collected from PASCAL
boxes, object category, truncation ratios etc. it comprises of more person layout set and is commonly used as a test set.
than 2.5 million annotated instances. This dataset adopts MS- (ii) FDDB [205]: It stands for Face detection datasets and bench-
COCO metric for evaluating the performance of different object mark. It is a widely used dataset comprising of 5171 faces
detection algorithms. spread over 2845 images. It is also used as a test set for
evaluating the performance of face detection models.
6.0.6. LVIS [1] : (iii) Wider-Face [206]: It is a very large dataset comprising of
It is a newly collected benchmark comprising of 164000 im- 3203 images with almost 400 K faces of varied range and
ages spread over 1000 plus object categories. It is a pretty new scales. It is divided into three subsets, 40% for training, 10%
dataset without any pre-existing results as of now. for validating and remaining 50% is kept for testing pur-
The statistics of some of these well known generic object poses. The annotated training and validation datasets are
detection datasets are provided in Table 7. available online at http://shuoyang1213.me/WIDERFACE/.
For Wider-Face and PASCAL-Face mAP with IoU threshold of 0.5
6.1. Pedestrian detection dataset: is used as a metric to evaluate the performance of face detectors
and for evaluating FDDB dataset two annotation types (bound box
In this subsection, we will review most famous and commonly level and eclipse level) are mostly used. The details of evaluation
used datasets for pedestrian object detection along with the metrics (discussed above) are summarized in Table 10.
metrics beings used for evaluation purposes.
7. Applications of object detection
(i) Caltech [52]: It is one of the most popular and tricky
datasets used for pedestrian detection task. It comprises Object detection finds its applications in many fields like med-
of approximately 10 h of VGA video sequence recorded ical, military, security etc to assist people. Here in this section,
by a vehicle bound camera driving through the streets of some prominent applications of object detectors are reviewed.
Los Angeles metropolitan city. The training and testing set
comprises of 42782 and 4024 video frames respectively. 7.1. Face detection:
(ii) ETH [199]: This dataset is basically used as a testing set
to assess the performance of detection models trained on Aim of face detection is to detect all the faces present in an
Citypersons dataset. It comprises of 1804 frames spread image. Face detection is still considered as a difficult task because
over three video sequences. of different occlusion and illumination variations. Many state-of-
(iii) INRIA [26]: It comprises of HD images of pedestrians mostly the-art face detectors have been build which precisely detect the
collected from vacation resorts comprising of 2120 images faces from a given image. A novel Wasserstein CNN approach was
which are further divided into two sets consisting of 1832 proposed by He et al. [207] to learn invariant features for face
images for training purpose and remaining 288 images for detection. To enhance and speed up the discriminative abilities
carrying out the testing task. of a DCNN based face recognizer, appropriate loss functions need
20 V. Sharma and R.N. Mir / Computer Science Review 38 (2020) 100301
Table 8
Comparative analysis of pedestrian detection datasets.
Dataset Setup Training set Testing set
Pedestrians Positive images Negative images Pedestrians Positive images Negative images
Caltech [52] Mobile 192K 67K 61K 155K 65K 56K
INRIA [26] Photo 1208 614 1218 566 288 453
ETH [199] Mobile 2388 499 NA 12K 1804 NA
TUD-Brussels [201] Mobile 1776 1092 218 1498 508 NA
Daimler-DB [53] Mobile 192K 67K 61K 155K 65K 56K
Table 9
Comparison of person detection benchmarks.
Dataset Seasons Countries Cities Images Pedestrians Image resolution Weather Train-Val-Test-Split (%)
Caltech [52] 1 1 1 249884 289395 640 X 480 Dry 50-0-50
KITTI [202] 1 1 1 14999 9400 1240 X 376 Dry 50-0-50
City Persons [203] 3 3 27 5000 31514 2048 X 1024 Dry 60-10-30
TDC [204] 1 1 1 14674 8919 2048 X 1024 Dry 71-8-21
Table 10
Details of evaluation metrics for generic object detection.
Abbreviation Meaning Description
Ω IoU Threshold Used for evaluating localization accuracy.
D All Predictions Peak predictions with highest confidence score generated by the object detector.
TP True Positive Exact predictions made by object detector.
FP False Positive Sham predictions made by object detector.
P Precision Fraction of true positives out of all predictions.
AP Average Precision Computed over different values of recall.
TPR True Positive Rate Portion of positive rate over false positives.
FPPI False Positive Per Image Portion of false positive for each image.
FPS Frames Per Second Number of images processes per second.
MR Missing Rate Average miss rate over diverse FPPI rates equally placed in log-space.
to be designed. One such loss function is cosine based softmax 7.5. Autonomous driving:
loss [208–211] which is widely been used. Guo et al. [212] pro-
posed a fuzzy logic based sparse auto encoder framework for face For automatically driving a car a perfect perception is needed
recognition. to operate in a reliable manner. Deep learning based perception
systems are normally employed for this purpose which trans-
7.2. Pedestrian detection: fers multi sensory data and converts it into semantic knowledge
thereby enabling automatic driving. Object detection is a funda-
This application aims at detecting pedestrians from a natural mental aspect of this driving system. In recent years Lu et al. [228]
environment. Braun et al. [213] developed a EuroCity person make use of fresh setups containing 3D convolutions and RNN’s
dataset which typically comprises of pedestrians and cyclists. to achieve localization accuracy in several real world driving
For real time pedestrian detection, some cascaded pedestrian scenarios. Apart from that song et al. [228] developed a 3D car
detection models were developed [214–216]. instance understanding benchmark for autonomous driving and
Banerjee et al. [229] utilized sensor fusion to extract efficient
features.
7.3. Anomaly detection:
A noteworthy part in fraud detection, healthcare monitor- 7.6. Traffic sign recognition:
ing and weather scrutiny is played by anomaly detection tech-
nique. Current anomaly detection techniques study the data us- For the sake of security and rule following, real time accurate
ing a point wise criteria [217–220]. For analyzing contiguous traffic sign recognition is required which assists in driving by ob-
time and space intervals Barz et al. [221] proposed an unsuper- taining temporal and spatial information of various traffic signs.
vised technique called Maximally Divergent Intervals for anomaly Deep learning based object detection methods [230–243] tries to
detection. solve this problem with high accuracy.
7.4. License plate recognition: 7.7. Computer aided diagnosis (CAD) system:
With the growing craze for cars and vehicles, license plate These systems can assist doctors and physicians to categorize
recognition is considered as a hot topic today as it is required in different kinds of cancer. The fundamental tasks carried out by
vehicular and traffic violation tracking. For making license plate a CAD setup can be recognized as image acquisition followed by
recognition more robust and trustworthy, several techniques like segmentation, feature extraction, classification and finally object
edge detection, morphology, sliding concentric windows etc are detection. Because of significant entity differences, lack of data
all clubbed together to achieve that. In recent years deep learn- and privacy concerns, there generally exists a difference of data
ing based methods [222–227] also provide solutions for this distribution between source and target domains. Hence domain
application. adaptation setups [244] are required for medical image detection.
V. Sharma and R.N. Mir / Computer Science Review 38 (2020) 100301 21
7.8. Event detection: text within that scene, whereas in the second category, the tex-
tual image is directly segmented and requires complicated post
This application focuses on recognizing real world events from processing step. further, in order to get the required orientation
the web, like festivals, talk shows, disasters, election campaigns, of text boxes, some of the text detection techniques require mul-
protests etc. with the advancement of social media, multi-domain tifaceted post processing, hence they are not as efficient as those
event detection systems can provide a detailed description about methods which are directly based on detection networks. Lyu
them. For disposing off multi-domain data an event detection sys- et al. [273] combined together the ideas proposed in above two
tem was developed by Yang et al. [245] Wang et al. [246] added text detection categories along with dividing the text regions into
some online social interaction features by developing affine based relative positions to recognize the underlying text. Ma et al. [274]
graphs for event detection purposes. Apart from these advance- developed a novice rotation based technique and an end to end
ments, a multi dimensional graph based model for detecting text detection system to generate inclined region proposals with
events from millions of video sequences and images was devel- text orientation information.
oped by Schinas et al. [247].
7.13. Point clouds 3D object detection:
7.9. Pattern detection:
A compared to image based object detection. LIDAR based
Pattern detection always suffer from the difficulties of scene point cloud tries to provide depth information which can further
occlusion, varying illuminations, pose variations and sensor based be used for accurately locating the objects and characterizing
noises. For addressing duplicate patterns and periodic structure their shapes. LIDAR point cloud based 3D object detection plays
detections, researchers have designed and developed some state- an important role in autonomous driving, robotics and virtual
of-the-art benchmarks for both 2D [248,249] and 3D images reality applications. Apart from numerous applications of point
[129,250–260]. clouds in object detection, this technique also meet some chal-
lenges like sparsity of LIDAR point clouds, inconsistent sam-
7.10. Image caption generation: pling of the 3D-space, occlusions and relative pose variation.
Qi et al. [275] developed an end to end deep neural network
This application aims at automatically generating captions called PointNet which can learn features directly from LIDAR
for a given image by capturing underlying semantic information clouds. For efficient mapping and processing of enormous 3D data
Engelcke et al. [276] proposed sparse convolution layers and L1
from the image and expressing it using natural languages. Image
regularization. Zhou et al. [277] developed a general end to end
captioning requires both computer vision and NLP technologies
3D object detection framework called VoxelNet that can predict
which itself are very challenging. So for addressing these issues,
correct 3D bounding boxes by learning a discriminative feature
reinforcement learning [261,262], attention networks [263,264],
representation from point clouds.
encoder–decoder networks are widely been used for this purpose.
Apart from all these deep CNN’s based techniques are proven to
7.14. 2D 3D pose detection:
be highly effective and efficient [2,265].
The aim of human pose detection is to locate the 2D or
7.11. Salient object detection:
3D pose of the body joints along with the base classes and
eventually returning the average pose of the maximum scoring
For detecting salient objects, deep neural networks are utilized
class. Emblematic 2D human pose detection technique [237,238,
to foresee saliency scores of image regions and to produce cor- 278–280] make use of deep CNN architecture. Rogez et al. [239]
rect saliency maps. Deep neural networks for detecting salient came up with an end to end architecture for joint 2D and 3D pose
objects normally need to put together multi-level features of estimation in natural scenes thereby predicting 2D and 3D poses
the backbone architecture. To obtain fast speed without com- of many people together. The human pose estimation techniques
promising accuracy. Wu et al. [266] suggested that shallower can broadly be divided into two categories, single stage and
layer features are sufficient to precisely obtain the saliency map. multi stage methods. The best performing methods are typically
for detecting salient objects, Wang et al. [267] have used fixa- based on single stage backbone architectures [240–242]. However
tion prediction techniques and to accurately detect salient ob- the most representative multi stage techniques are convolutional
jects, Wang et al. [268] incorporated prior knowledge of saliency pose machines [243], Hourglass network [280] and MSPN [281].
into recurrent fully convolutional networks. To properly explore
the structure of objects, an attentive feedback module has been 7.15. Fine grained visual recognition:
designed by Feng et al. [269].
The aim is to recognize an exact class of objects in each of
7.12. Text detection: the basic level category like identifying the model of a car or
recognizing the species of a mammal. This task is a bit challenging
The aim of text detection is to discover the text region of due to visual differences which are minimal between the class
a given image or a video and it is one of the most important categories and which can easily be overwhelmed by factors like
precondition for many of the computer vision tasks like cate- pose, location of an object and its viewpoint in the given image.
gorization and video analysis. Although there has already been Krause et al. [282] make use of 3D object representation to gener-
many thriving optical character recognition systems, but still the alize across multiple viewpoints at the level of both locations and
detection of text in cluttered natural scenes is a challenging task local features. Bilinear model consisting of two CNN streams was
because of blurring, different orientations, lighting conditions and introduced by Lin et al. [283] wherein the outputs from these two
various other distortions. Very recently researchers have found CNN streams are multiplied using outer product at each of the
out that randomly-oriented text directions [270–272] is a re- image location which are then pooled together to get an image
search direction that requires attention. Generally deep learning descriptor. A finely grained discriminative localization method
based scene text detection can be broadly categorized into two based on saliency guided faster R-CNN was introduced by He at
classes. The first category considers the scene text as a particular al. [284] and later they introduced a weakly supervised version
object and using text box regression techniques try to locate the for fast fine-grained image categorization [285].
22 V. Sharma and R.N. Mir / Computer Science Review 38 (2020) 100301
7.16. Edge detection: (v) Multi-task Learning: For improving the detection perfor-
mance, amassing multi level features of backbone archi-
It aims at extracting the boundary of an object and salient tectures can be considered as a significant step. Moreover
edges from an image which is considered very important for performing many computer vision tasks simultaneously
many higher level computer vision tasks like object recognition, like, object detection, semantic and instance segmentation
segmentation etc. Edge detection also suffers from major chal- etc can improve the performance by a large extent using
lenges like the edges of different scales which are present in an richer information. Adopting this technique is an efficient
image need both object level boundaries and useful local region way to combine multiple tasks in a model presents a se-
details. furthermore to predict different parts in the final detec- ries of challenges for researchers to improve the detection
tion each of the CNN layers should be trained by proper layer accuracy without compromising the processing speed.
specific supervision. For addressing these issues, He et al. [286] (vi) Unsupervised Object Detection: Developing automatic an-
came up with a bi-directional cascade network which allows notation techniques to get rid of manual annotating is
one layer supervised by labeled edges while adopting dilated an exciting and promising trend for unsupervised object
convolutions for generating multi-scale features. detection. For sharp detection tasks, unsupervised object
detection is a future research direction.
8. Concluding remarks and future research areas
(vii) Remote Sensing Real Time Detection: Remote sensing im-
Due to its potent learning capabilities and usefulness in deal- ages find their applications in both military and agricultural
ings with occlusions, scale alterations and background exchanges, domains. Automatic detection models and incorporated
deep learning based object detection techniques has become a hardware units will enhance rapid development of these
highly researched area. In this paper we have provided a compre- fields.
hensive survey of latest advances in deep learning based visual (viii) GAN based Object Detectors: As we know that deep learn-
object detection. The review starts with surveying a large body ing based object detector often requires huge amounts
of recent works in literature followed by analyzing traditional of data for training purposes, whereas GAN based object
and current detectors. Then a rigorous overview of backbone detectors is an influential structure which generates fake
architectures along with systematic coverup of prominent learn- images. Combining real world scenarios and simulated data
ing strategies is performed. Finally some popular datasets and produced by GAN helps in making the detectors to grow
benchmarks for visual object detection are discussed along with more robust and to obtain greater generalization capabili-
some application areas to gain a thorough understanding of the ties.
object detection landscape. With the increasingly powerful visual
object detectors in the fields of security, transportation, military The research in object detection with deep learning needs further
etc. the applications of object detection is therefore witnessing study. We hope that deep learning based object detectors will
a sharp increase. Despite all these advancements, there is still make life changing contributions in years to come.
much room for further development. Below we have provided
some latest trends in this domain for facilitating future research Declaration of competing interest
in visual object detection with deep learning.
The authors declare that they have no known competing finan-
8.1. Future research trends: cial interests or personal relationships that could have appeared
to influence the work reported in this paper.
(i) Video Object Detection: As we know that video object
detection suffers from motion target ambiguities, really References
tiny target objects, truncations and occlusions etc, which
make it extremely difficult to achieve high accuracy and [1] Agrim Gupta, Piotr Dollar, Ross Girshick, LVIS: A dataset for large
efficiency. Henceforth, investigating motion based goals vocabulary instance segmentation, in: Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, 2019, pp. 5356–5364.
and multifaceted data sources like video sequences are one
[2] Licheng Jiao, Fan Zhang, Fang Liu, Shuyuan Yang, Lingling Li, Zhixi Feng,
of the most promising future research areas. Rong Qu, A survey of deep learning-based object detection, IEEE Access
(ii) Weakly Supervised Object Detection: Weakly supervised 7 (2019) 128837–128868.
object detection models focus on utilizing a small set of [3] Jensen Huang Nvidia, Accelerating AI with GPUs: A new computing
completely annotated images for detecting a much larger model, 2020, Retrieved on June 20, 2020 at 11:45 am, from URL
https://blogs.nvidia.com/blog/2016/01/12/accelerating-ai-artificial-
number of non-annotated counterparts. Hence it is a sig-
intelligence-gpus/.
nificant problem for future studies where large proportion [4] Xiongwei Wu, Doyen Sahoo, Steven C.H. Hoi, Recent advances in deep
of annotated and labeled images with target objects and learning for object detection, Neurocomputing (2020).
bounding boxes are used to efficiently train a network for [5] Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun, Deep residual
achieving high effectiveness. learning for image recognition, in: Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
(iii) Multi-Domain Object Detection: As we know that area
[6] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy,
specific detectors always tend to perform better, achieving Alan L. Yuille, Semantic image segmentation with deep convolutional nets
high detection accuracies on a predefined dataset. So the and fully connected crfs, 2014, arXiv preprint arXiv:1412.7062.
future basically lies in developing a universal object de- [7] Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik, Rich feature
tector which is capable of detecting multi domain objects hierarchies for accurate object detection and semantic segmentation, in:
Proceedings of the IEEE Conference on Computer Vision and Pattern
without having any prior knowledge.
Recognition, 2014, pp. 580–587.
(iv) Salient Object Detection: This area aims at stressing on [8] Kaiming He, Georgia Gkioxari, Piotr Dollár, Ross Girshick, Mask r-cnn,
salient object regions in an image. Salient object detection in: Proceedings of the IEEE International Conference on Computer Vision,
is applied to a broad spectrum of object detection applica- 2017, pp. 2961–2969.
tions in different areas. Salient object regions of importance [9] Yi Sun, Ding Liang, Xiaogang Wang, Xiaoou Tang, Deepid3: Face recog-
nition with very deep neural networks, 2015, arXiv preprint arXiv:1502.
in each frame can help to accurately detect the objects in
00873.
a continuous scene or video sequence. Hence for impor- [10] Yi Sun, Yuheng Chen, Xiaogang Wang, Xiaoou Tang, Deep learning face
tant recognition and detection tasks, saliency guided object representation by joint identification-verification, in: Advances in Neural
detection can be considered as a preliminary process. Information Processing Systems, 2014, pp. 1988–1996.
V. Sharma and R.N. Mir / Computer Science Review 38 (2020) 100301 23
[11] Weiyang Liu, Yandong Wen, Zhiding Yu, Ming Li, Bhiksha Raj, Le [37] Guimei Cao, Xuemei Xie, Wenzhe Yang, Quan Liao, Guangming Shi,
Song, Sphereface: Deep hypersphere embedding for face recognition, in: Jinjian Wu, Feature-fused SSD: Fast detection for small objects, in: Ninth
Proceedings of the IEEE Conference on Computer Vision and Pattern International Conference on Graphic and Image Processing (ICGIP 2017),
Recognition, 2017, pp. 212–220. vol. 10615, International Society for Optics and Photonics, 2018, p.
[12] Jianan Li, Xiaodan Liang, ShengMei Shen, Tingfa Xu, Jiashi Feng, Shuicheng 106151E.
Yan, Scale-aware fast R-CNN for pedestrian detection, IEEE Trans. [38] Subarna Tripathi, Gokce Dane, Byeongkeun Kang, Vasudev Bhaskaran,
Multimedia 20 (4) (2017) 985–996. Truong Nguyen, LCDet: Low-complexity fully-convolutional neural net-
[13] Jan Hosang, Mohamed Omran, Rodrigo Benenson, Bernt Schiele, Taking works for object detection in embedded systems, in: Proceedings of the
a deeper look at pedestrians, in: Proceedings of the IEEE Conference on IEEE Conference on Computer Vision and Pattern Recognition Workshops,
Computer Vision and Pattern Recognition, 2015, pp. 4073–4082. 2017, pp. 94–103.
[14] Anelia Angelova, Alex Krizhevsky, Vincent Vanhoucke, Abhijit Ogale, Dave [39] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan
Ferguson, Real-time pedestrian detection with deep network cascades, Long, Ross Girshick, Sergio Guadarrama, Trevor Darrell, Caffe: Convolu-
2015. tional architecture for fast feature embedding, in: Proceedings of the 22nd
[15] Steven CH Hoi, Xiongwei Wu, Hantang Liu, Yue Wu, Huiqiong Wang, ACM International Conference on Multimedia, 2014, pp. 675–678.
Hui Xue, Qiang Wu, Logo-net: Large-scale deep logo detection and brand [40] Zhenheng Yang, Ramakant Nevatia, A multi-scale cascade fully convolu-
recognition with deep region-based convolutional networks, 2015, arXiv tional network face detector, in: 2016 23rd International Conference on
preprint arXiv:1511.02462. Pattern Recognition (ICPR), IEEE, 2016, pp. 633–638.
[41] Ngiam Jiquan, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, AY
[16] Hang Su, Xiatian Zhu, Shaogang Gong, Deep learning logo detection with
Ng, Multimodal deep learning, in: Proceedings of the 28th International
data expansion by synthesising context, in: 2017 IEEE Winter Conference
Conference on Machine Learning (ICML-11), vol. 689696, 2011.
on Applications of Computer Vision (WACV), IEEE, 2017, pp. 530–539.
[42] Yu-Gang Jiang, Zuxuan Wu, Jinhui Tang, Zechao Li, Xiangyang Xue,
[17] Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul
Shih-Fu Chang, Modeling multimodal clues in a hybrid deep learning
Sukthankar, Li Fei-Fei, Large-scale video classification with convolutional
framework for video classification, IEEE Trans. Multimed. 20 (11) (2018)
neural networks, in: Proceedings of the IEEE Conference on Computer
3137–3147.
Vision and Pattern Recognition, 2014, pp. 1725–1732.
[43] Denis Tomè, Federico Monti, Luca Baroffio, Luca Bondi, Marco Tagliasac-
[18] Hossein Mobahi, Ronan Collobert, Jason Weston, Deep learning from tem-
chi, Stefano Tubaro, Deep convolutional neural networks for pedestrian
poral coherence in video, in: Proceedings of the 26th Annual International
detection, Signal Process., Image Commun. 47 (2016) 482–489.
Conference on Machine Learning, 2009, pp. 737–744.
[44] Zhong-Qiu Zhao, Haiman Bian, Donghui Hu, Wenjuan Cheng, Hervé
[19] Ross B. Girshick, Fast R-CNN, 2015, CoRR arXiv:1504.08083. Glotin, Pedestrian detection based on fast R-CNN and batch normalization,
[20] Joseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi, You only in: ICIC 2017: Intelligent Computing Theories and Application, Springer,
look once: Unified, real-time object detection, in: Proceedings of the IEEE 2017, pp. 735–746.
Conference on Computer Vision and Pattern Recognition, pp. 779–788. [45] Chen Zhang, Joohee Kim, Object detection with location-aware de-
[21] Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun, Faster r-cnn: Towards formable convolution and backward attention filtering, in: Proceedings of
real-time object detection with region proposal networks, in: Advances the IEEE Conference on Computer Vision and Pattern Recognition, 2019,
in Neural Information Processing Systems, 2015, pp. 91–99. pp. 9452–9461.
[22] David G. Lowe, Object recognition from local scale-invariant features, in: [46] Joseph Redmon, Ali Farhadi, YOLO9000: better, faster, stronger, in:
Proceedings of the Seventh IEEE International Conference on Computer Proceedings of the IEEE Conference on Computer Vision and Pattern
Vision, vol. 2, Ieee, 1999, pp. 1150–1157. Recognition, 2017, pp. 7263–7271.
[23] Herbert Bay, Tinne Tuytelaars, Luc Van Gool, Surf: Speeded up robust [47] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott
features, in: European Conference on Computer Vision, Springer, 2006, Reed, Cheng-Yang Fu, Alexander C Berg, Ssd: Single shot multibox
pp. 404–417. detector, in: European Conference on Computer Vision, Springer, 2016,
[24] Rainer Lienhart, Jochen Maydt, An extended set of haar-like features for pp. 21–37.
rapid object detection, in: Proceedings. International Conference on Image [48] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariha-
Processing, vol. 1, IEEE, 2002, pp. I–I. ran, Serge Belongie, Feature pyramid networks for object detection, in:
[25] Eleonora Vig, Michael Dorr, David Cox, Large-scale optimization of hierar- Proceedings of the IEEE Conference on Computer Vision and Pattern
chical features for saliency prediction in natural images, in: Proceedings Recognition, 2017, pp. 2117–2125.
of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, [49] Ming-Hsuan Yang, David J. Kriegman, Narendra Ahuja, Detecting faces in
pp. 2798–2805. images: A survey, IEEE Trans. Pattern Anal. Mach. Intell. 24 (1) (2002)
[26] Navneet Dalal, Bill Triggs, Histograms of oriented gradients for human 34–58.
detection, in: 2005 IEEE Computer Society Conference on Computer [50] Stefanos Zafeiriou, Cha Zhang, Zhengyou Zhang, A survey on face detec-
Vision and Pattern Recognition (CVPR’05), vol. 1, IEEE, 2005, pp. 886–893. tion in the wild: past, present and future, Comput. Vis. Image Underst.
[27] Yu Xiang, Wongun Choi, Yuanqing Lin, Silvio Savarese, Subcategory- 138 (2015) 1–24.
aware convolutional neural networks for object proposals and detection, [51] Qixiang Ye, David Doermann, Text detection and recognition in imagery:
in: 2017 IEEE Winter Conference on Applications of Computer Vision A survey, IEEE Trans. Pattern Anal. Mach. Intell. 37 (7) (2014) 1480–1500.
(WACV), IEEE, 2017, pp. 924–933. [52] Piotr Dollar, Christian Wojek, Bernt Schiele, Pietro Perona, Pedestrian
[28] Yoav Freund, Robert E. Schapire, A desicion-theoretic generalization of detection: An evaluation of the state of the art, IEEE Trans. Pattern Anal.
on-line learning and an application to boosting, in: European Conference Mach. Intell. 34 (4) (2011) 743–761.
on Computational Learning Theory, Springer, 1995, pp. 23–37. [53] Markus Enzweiler, Dariu M. Gavrila, Monocular pedestrian detection:
Survey and experiments, IEEE Trans. Pattern Anal. Mach. Intell. 31 (12)
[29] Yoav Freund, Robert E. Schapire, et al., Experiments with a new boosting
(2008) 2179–2195.
algorithm, in: Icml, vol. 96, Citeseer, 1996, pp. 148–156.
[54] David Geronimo, Antonio M. Lopez, Angel D. Sappa, Thorsten Graf, Survey
[30] David Opitz, Richard Maclin, Popular ensemble methods: An empirical
of pedestrian detection for advanced driver assistance systems, IEEE
study, J. Artif. Intell. Res. 11 (1999) 169–198.
Trans. Pattern Anal. Mach. Intell. 32 (7) (2009) 1239–1258.
[31] P.F. Felzenszwalb, R.B. Girshick, D. McAllester, D. Ramanan, Object detec- [55] Zehang Sun, George Bebis, Ronald Miller, On-road vehicle detection: A
tion with discriminatively trained part-based models, IEEE Trans. Pattern review, IEEE Trans. Pattern Anal. Mach. Intell. 28 (5) (2006) 694–711.
Anal. Mach. Intell. 32 (2010) 1627–1645.
[56] Xin Zhang, Yee-Hong Yang, Zhiguang Han, Hui Wang, Chao Gao, Object
[32] Mark Everingham, Luc Van Gool, Christopher K.I. Williams, John Winn, class detection: A survey, ACM Comput. Surv. 46 (1) (2013) 1–53.
Andrew Zisserman, The PASCAL visual object classes challenge 2007 [57] Licheng Jiao, Fan Zhang, Fang Liu, Shuyuan Yang, Lingling Li, Zhixi Feng,
(VOC2007) results, 2007. Rong Qu, A survey of deep learning-based object detection, IEEE Access
[33] Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, 7 (2019) 128837–128868.
Andrew Zisserman, The pascal visual object classes (voc) challenge, Int. [58] Zhong-Qiu Zhao, Peng Zheng, Shou-tao Xu, Xindong Wu, Object detection
J. Comput. Vis. 88 (2) (2010) 303–338. with deep learning: A review, IEEE Trans. Neural Netw. Learn. Syst. 30
[34] David G. Lowe, Distinctive image features from scale-invariant keypoints, (11) (2019) 3212–3232.
Int. J. Comput. Vis. 60 (2) (2004) 91–110. [59] Zehang Sun, George Bebis, Ronald Miller, On-road vehicle detection: A
[35] Timo Ojala, Matti Pietikainen, Topi Maenpaa, Multiresolution gray-scale review, IEEE Trans. Pattern Anal. Mach. Intell. 28 (5) (2006) 694–711.
and rotation invariant texture classification with local binary patterns, [60] Jean Ponce, Martial Hebert, Cordelia Schmid, Andrew Zisserman, Toward
IEEE Trans. Pattern Anal. Mach. Intell. 24 (7) (2002) 971–987. Category-Level Object Recognition, vol. 4170, Springer, 2007.
[36] Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton, Imagenet classifica- [61] Sven J. Dickinson, Aleš Leonardis, Bernt Schiele, Michael J. Tarr, Object
tion with deep convolutional neural networks, in: Advances in Neural Categorization: Computer and Human Vision Perspectives, Cambridge
Information Processing Systems, 2012, pp. 1097–1105. University Press, 2009.
24 V. Sharma and R.N. Mir / Computer Science Review 38 (2020) 100301
[62] Carolina Galleguillos, Serge Belongie, Context based object categorization: [90] C.L.awrence Zitnick, Piotr Dollar, Edge boxes: Locating object proposals
A critical survey, Comput. Vis. Image Underst. 114 (6) (2010) 712–722. from edges, in: European Conference on Computer Vision, Springer, 2014,
[63] Kristen Grauman, Bastian Leibe, Visual object recognition, Synth. Lect. pp. 391–405.
Artif. Intell. Mach. Learn. 5 (2) (2011) 1–181. [91] Jifeng Dai, Yi Li, Kaiming He, Jian Sun, R-fcn: Object detection via region-
[64] Alexander Andreopoulos, John K. Tsotsos, 50 years of object recognition: based fully convolutional networks, in: Advances in Neural Information
Directions forward, Comput. Vis. Image Underst. 117 (8) (2013) 827–891. Processing Systems, 2016, pp. 379–387.
[65] Yoshua Bengio, Aaron Courville, Pascal Vincent, Representation learning: [92] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona,
A review and new perspectives, IEEE Trans. Pattern Anal. Mach. Intell. 35 Deva Ramanan, Piotr Dollár, C Lawrence Zitnick, Microsoft coco: Common
(8) (2013) 1798–1828. objects in context, in: European Conference on Computer Vision, Springer,
[66] Ali Borji, Ming-Ming Cheng, Qibin Hou, Huaizu Jiang, Jia Li, Salient object 2014, pp. 740–755.
detection: A survey, Comput. Vis. Media (2019) 1–34. [93] Han Hu, Jiayuan Gu, Zheng Zhang, Jifeng Dai, Yichen Wei, Relation
[67] Yali Li, Shengjin Wang, Qi Tian, Xiaoqing Ding, Feature representation for networks for object detection, in: Proceedings of the IEEE Conference
statistical-learning-based object detection: A review, Pattern Recognit. 48 on Computer Vision and Pattern Recognition, 2018, pp. 3588–3597.
(11) (2015) 3542–3559. [94] Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, Yichen
[68] Yann LeCun, Yoshua Bengio, Geoffrey Hinton, Deep learning, Nature 521 Wei, Deformable convolutional networks, in: Proceedings of the IEEE
(7553) (2015) 436–444. International Conference on Computer Vision, 2017, pp. 764–773.
[69] Geert Litjens, Thijs Kooi, Babak Ehteshami Bejnordi, Arnaud [95] Golnaz Ghiasi, Tsung-Yi Lin, Quoc V. Le, Nas-fpn: Learning scalable feature
Arindra Adiyoso Setio, Francesco Ciompi, Mohsen Ghafoorian, pyramid architecture for object detection, in: Proceedings of the IEEE
Jeroen Awm Van Der Laak, Bram Van Ginneken, Clara I Sánchez, A Conference on Computer Vision and Pattern Recognition, pp. 7036–7045.
survey on deep learning in medical image analysis, Med. Image Anal. 42 [96] Bogdan Alexe, Thomas Deselaers, Vittorio Ferrari, Measuring the object-
(2017) 60–88. ness of image windows, IEEE Trans. Pattern Anal. Mach. Intell. 34 (11)
[70] Jiuxiang Gu, Zhenhua Wang, Jason Kuen, Lianyang Ma, Amir Shahroudy, (2012) 2189–2202.
Bing Shuai, Ting Liu, Xingxing Wang, Gang Wang, Jianfei Cai, et al., Recent [97] Esa Rahtu, Juho Kannala, Matthew Blaschko, Learning a category inde-
advances in convolutional neural networks, Pattern Recognit. 77 (2018) pendent object detection cascade, in: 2011 International Conference on
354–377. Computer Vision, IEEE, 2011, pp. 1052–1059.
[71] Zhengxia Zou, Zhenwei Shi, Yuhong Guo, Jieping Ye, Object detection in [98] Santiago Manen, Matthieu Guillaumin, Luc Van Gool, Prime object pro-
20 years: A survey, 2019, arXiv preprint arXiv:1905.05055. posals with randomized prim’s algorithm, in: Proceedings of the IEEE
[72] Paul Viola, Michael Jones, Rapid object detection using a boosted cascade International Conference on Computer Vision, 2013, pp. 2536–2543.
of simple features, in: Proceedings of the 2001 IEEE Computer Society [99] Joao Carreira, Cristian Sminchisescu, CPMC: Automatic object segmen-
Conference on Computer Vision and Pattern Recognition. CVPR 2001, vol. tation using constrained parametric min-cuts, IEEE Trans. Pattern Anal.
1, IEEE, 2001, pp. I–I. Mach. Intell. 34 (7) (2011) 1312–1328.
[100] Ian Endres, Derek Hoiem, Category-independent object proposals with
[73] Paul Viola, Michael J. Jones, Robust real-time face detection, Int. J.
diverse ranking, IEEE Trans. Pattern Anal. Mach. Intell. 36 (2) (2013)
Comput. Vis. 57 (2) (2004) 137–154.
222–234.
[74] Yoav Freund, Robert Schapire, Naoki Abe, A short introduction to
[101] Chenchen Zhu, Ran Tao, Khoa Luu, Marios Savvides, Seeing small faces
boosting, J. Japanese Soc. Artif. Intell. 14 (771–780) (1999) 1612.
from robust anchor’s perspective, in: Proceedings of the IEEE Conference
[75] Pedro Felzenszwalb, David McAllester, Deva Ramanan, A discriminatively
on Computer Vision and Pattern Recognition, 2018, pp. 5127–5136.
trained, multiscale, deformable part model, in: 2008 IEEE Conference on
[102] Lele Xie, Yuliang Liu, Lianwen Jin, Zecheng Xie, DeRPN: Taking a further
Computer Vision and Pattern Recognition, IEEE, 2008, pp. 1–8.
step toward more general object detection, in: Proceedings of the AAAI
[76] Pedro F. Felzenszwalb, Ross B. Girshick, David McAllester, Cascade object
Conference on Artificial Intelligence, vol. 33, 2019, pp. 9046–9053.
detection with deformable part models, in: 2010 IEEE Computer Society
[103] Lachlan Tychsen-Smith, Lars Petersson, Denet: Scalable real-time object
Conference on Computer Vision and Pattern Recognition, IEEE, 2010, pp.
detection with directed sparse sampling, in: Proceedings of the IEEE
2241–2248.
International Conference on Computer Vision, 2017, pp. 428–436.
[77] Tomasz Malisiewicz, Abhinav Gupta, Alexei A. Efros, Ensemble of
[104] Chenchen Zhu, Yihui He, Marios Savvides, Feature selective anchor-
exemplar-svms for object detection and beyond, in: 2011 International
free module for single-shot object detection, in: Proceedings of the
Conference on Computer Vision, IEEE, 2011, pp. 89–96.
IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp.
[78] Ross B. Girshick, Pedro F. Felzenszwalb, David A. Mcallester, Object
840–849.
detection with grammar models, in: Advances in Neural Information
[105] Kaiwen Duan, Song Bai, Lingxi Xie, Honggang Qi, Qingming Huang, Qi
Processing Systems, 2011, pp. 442–450.
Tian, Centernet: Keypoint triplets for object detection, in: Proceedings
[79] Ross Brook Girshick, From Rigid Templates to Grammars: Object of the IEEE International Conference on Computer Vision, 2019, pp.
Detection with Structured Models, Citeseer, 2012. 6569–6578.
[80] Stuart Andrews, Ioannis Tsochantaridis, Thomas Hofmann, Support vector [106] Ross Girshick, Fast r-cnn, in: Proceedings of the IEEE International
machines for multiple-instance learning, in: S. Becker, S. Thrun, K. Conference on Computer Vision, 2015, pp. 1440–1448.
Obermayer (Eds.), Advances in Neural Information Processing Systems 15, [107] Bharat Singh, Larry S. Davis, An analysis of scale invariance in object
MIT Press, 2003, pp. 577–584, http://papers.nips.cc/paper/2232-support- detection snip, in: Proceedings of the IEEE Conference on Computer Vision
vector-machines-for-multiple-instance-learning.pdf. and Pattern Recognition, 2018, pp. 3578–3587.
[81] Pierre Sermanet, David Eigen, Xiang Zhang, Michaël Mathieu, Rob Fergus, [108] Zhaowei Cai, Quanfu Fan, Rogerio S Feris, Nuno Vasconcelos, A unified
Yann LeCun, Overfeat: Integrated recognition, localization and detection multi-scale deep convolutional neural network for fast object detec-
using convolutional networks, 2013, arXiv preprint arXiv:1312.6229. tion, in: European Conference on Computer Vision, Springer, 2016, pp.
[82] Joseph Redmon, Ali Farhadi, Yolov3: An incremental improvement, 2018, 354–370.
arXiv preprint arXiv:1804.02767. [109] Zhiqiang Shen, Zhuang Liu, Jianguo Li, Yu-Gang Jiang, Yurong Chen,
[83] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, Piotr Dollár, Focal Xiangyang Xue, Dsod: Learning deeply supervised object detectors from
loss for dense object detection, in: Proceedings of the IEEE International scratch, in: Proceedings of the IEEE International Conference on Computer
Conference on Computer Vision, 2017, pp. 2980–2988. Vision, 2017, pp. 1919–1927.
[84] Hei Law, Jia Deng, Cornernet: Detecting objects as paired keypoints, in: [110] Songtao Liu, Di Huang, et al., Receptive field block net for accurate and
Proceedings of the European Conference on Computer Vision (ECCV), fast object detection, in: Proceedings of the European Conference on
2018, pp. 734–750. Computer Vision (ECCV), 2018, pp. 385–400.
[85] Xingyi Zhou, Dequan Wang, Philipp Krähenbühl, Objects as points, 2019, [111] Jimmy Ren, Xiaohao Chen, Jianbo Liu, Wenxiu Sun, Jiahao Pang, Qiong
arXiv preprint arXiv:1904.07850. Yan, Yu-Wing Tai, Li Xu, Accurate single stage detector using recurrent
[86] Kaiwen Duan, Song Bai, Lingxi Xie, Honggang Qi, Qingming Huang, Qi rolling convolution, in: Proceedings of the IEEE Conference on Computer
Tian, Centernet: Object detection with keypoint triplets, 2019, arXiv Vision and Pattern Recognition, 2017, pp. 5420–5428.
preprint arXiv:1904.08189 1(2), 4. [112] Jisoo Jeong, Hyojin Park, Nojun Kwak, Enhancement of SSD by concate-
[87] Jasper R.R. Uijlings, Koen E.A. Van De Sande, Theo Gevers, Arnold W.M. nating feature maps for object detection, 2017, arXiv preprint arXiv:
Smeulders, Selective search for object recognition, Int. J. Comput. Vis. 104 1705.09587.
(2) (2013) 154–171. [113] Cheng-Yang Fu, Wei Liu, Ananth Ranga, Ambrish Tyagi, Alexander C
[88] Jim Kleban, Xing Xie, Wei-Ying Ma, Spatial pyramid mining for logo Berg, Dssd: Deconvolutional single shot detector, 2017, arXiv preprint
detection in natural scenes, in: 2008 IEEE International Conference on arXiv:1701.06659.
Multimedia and Expo, IEEE, 2008, pp. 1077–1080. [114] Sanghyun Woo, Soonmin Hwang, In So Kweon, Stairnet: Top-down
[89] Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun, Spatial pyramid semantic aggregation for accurate one shot detection, in: 2018 IEEE
pooling in deep convolutional networks for visual recognition, IEEE Trans. Winter Conference on Applications of Computer Vision (WACV), IEEE,
Pattern Anal. Mach. Intell. 37 (9) (2015) 1904–1916. 2018, pp. 1093–1102.
V. Sharma and R.N. Mir / Computer Science Review 38 (2020) 100301 25
[115] Hongyang Li, Yu Liu, Wanli Ouyang, Xiaogang Wang, Zoom out-and-in [141] Alejandro Newell, Kaiyu Yang, Jia Deng, Stacked hourglass networks for
network with recursive training for object proposal, 2017, arXiv preprint human pose estimation, in: European Conference on Computer Vision,
arXiv:1702.05711. Springer, 2016, pp. 483–499.
[116] Tao Kong, Fuchun Sun, Chuanqi Tan, Huaping Liu, Wenbing Huang, Deep [142] Bharat Singh, Mahyar Najibi, Larry S. Davis, SNIPER: Efficient multi-scale
feature pyramid reconfiguration for object detection, in: Proceedings of training, in: Advances in Neural Information Processing Systems, 2018,
the European Conference on Computer Vision (ECCV), 2018, pp. 169–185. pp. 9310–9320.
[117] Qijie Zhao, Tao Sheng, Yongtao Wang, Zhi Tang, Ying Chen, Ling Cai, [143] Jiayuan Gu, Han Hu, Liwei Wang, Yichen Wei, Jifeng Dai, Learning region
Haibin Ling, M2det: A single-shot object detector based on multi-level features for object detection, in: Proceedings of the European Conference
feature pyramid network, in: Proceedings of the AAAI Conference on on Computer Vision (ECCV), 2018, pp. 381–395.
Artificial Intelligence, vol. 33, 2019, pp. 9259–9266. [144] Spyros Gidaris, Nikos Komodakis, Locnet: Improving localization accuracy
[118] Carolina Galleguillos, Serge Belongie, Context based object categorization: for object detection, in: Proceedings of the IEEE Conference on Computer
A critical survey, Comput. Vis. Image Underst. 114 (6) (2010) 712–722. Vision and Pattern Recognition, 2016, pp. 789–798.
[119] Pedro Felzenszwalb, Ross Girshick, David McAllester, Deva Ramanan, [145] Sergey Zagoruyko, Adam Lerer, Tsung-Yi Lin, Pedro O. Pinheiro, Sam
Discriminatively trained mixtures of deformable part models, PASCAL Gross, Soumith Chintala, Piotr Dollár, A multipath network for object
VOC Challenge (2008). detection, 2016, arXiv preprint arXiv:1604.02135.
[120] Wanli Ouyang, Xiaogang Wang, Xingyu Zeng, Shi Qiu, Ping Luo, Yonglong [146] Lachlan Tychsen-Smith, Lars Petersson, Improving object localization with
Tian, Hongsheng Li, Shuo Yang, Zhe Wang, Chen-Change Loy, et al., fitness nms and bounded iou loss, in: Proceedings of the IEEE Conference
Deepid-net: Deformable deep convolutional neural networks for object on Computer Vision and Pattern Recognition, 2018, pp. 6877–6885.
detection, in: Proceedings of the IEEE Conference on Computer Vision [147] Xin Lu, Buyu Li, Yuxin Yue, Quanquan Li, Junjie Yan, Grid r-cnn, in:
and Pattern Recognition, 2015, pp. 2403–2412. Proceedings of the IEEE Conference on Computer Vision and Pattern
[121] Xizhou Zhu, Han Hu, Stephen Lin, Jifeng Dai, Deformable convnets v2: Recognition, 2019, pp. 7363–7372.
More deformable, better results, in: Proceedings of the IEEE Conference [148] Bin Yang, Junjie Yan, Zhen Lei, Stan Z. Li, Craft objects from images,
on Computer Vision and Pattern Recognition, 2019, pp. 9308–9316. in: Proceedings of the IEEE Conference on Computer Vision and Pattern
[122] Ross Girshick, Forrest Iandola, Trevor Darrell, Jitendra Malik, Deformable Recognition, 2016, pp. 6043–6051.
part models are convolutional neural networks, in: Proceedings of the [149] Bowen Cheng, Yunchao Wei, Honghui Shi, Rogerio Feris, Jinjun Xiong,
IEEE Conference on Computer Vision and Pattern Recognition, 2015, Thomas Huang, Revisiting rcnn: On awakening the classification power
pp. 437–446. of faster rcnn, in: Proceedings of the European Conference on Computer
[123] Zeming Li, Chao Peng, Gang Yu, Xiangyu Zhang, Yangdong Deng, Jian Sun, Vision (ECCV), 2018, pp. 453–468.
Detnet: A backbone network for object detection, 2018, arXiv preprint [150] Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David
arXiv:1804.06215. Warde-Farley, Sherjil Ozair, Aaron Courville, Yoshua Bengio, Genera-
[124] Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, Kaiming He, Aggre- tive adversarial networks, in: Annual Conference on Neural Information
gated residual transformations for deep neural networks, in: Proceedings Processing Systems (NeurIPS), 2014, pp. 2672–2680.
of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, [151] Jun-Yan Zhu, Taesung Park, Phillip Isola, Alexei A Efros, Unpaired image-
pp. 1492–1500. to-image translation using cycle-consistent adversarial networks, in:
[125] Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Wei- Proceedings of the IEEE International Conference on Computer Vision,
jun Wang, Tobias Weyand, Marco Andreetto, Hartwig Adam, Mobilenets: 2017, pp. 2223–2232.
Efficient convolutional neural networks for mobile vision applications, [152] Alec Radford, Luke Metz, Soumith Chintala, Unsupervised representation
2017, arXiv preprint arXiv:1704.04861. learning with deep convolutional generative adversarial networks, 2015,
[126] Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, Jian Sun, Shufflenet: An arXiv preprint arXiv:1511.06434.
extremely efficient convolutional neural network for mobile devices, in: [153] Andrew Brock, Jeff Donahue, Karen Simonyan, Large scale gan training
Proceedings of the IEEE Conference on Computer Vision and Pattern for high fidelity natural image synthesis, 2018, arXiv preprint arXiv:
Recognition, pp. 6848–6856. 1809.11096.
[127] François Chollet, Xception: Deep learning with depthwise separable [154] Jianan Li, Xiaodan Liang, Yunchao Wei, Tingfa Xu, Jiashi Feng, Shuicheng
convolutions, in: Proceedings of the IEEE Conference on Computer Vision Yan, Perceptual generative adversarial networks for small object detec-
and Pattern Recognition, pp. 1251–1258. tion, in: Proceedings of the IEEE Conference on Computer Vision and
[128] C.X. Ling R.J. Wang, Pelee: A real-time object detection system on mobile Pattern Recognition, 2017, pp. 1222–1230.
devices, Adv. Neural Inf. Process. Syst. (2018) 1963–1972. [155] Xiaolong Wang, Abhinav Shrivastava, Abhinav Gupta, A-fast-rcnn: Hard
[129] Emmanuel J. Candès, Xiaodong Li, Yi Ma, John Wright, Robust principal positive generation via adversary for object detection, in: Proceedings of
component analysis?, J. ACM 58 (3) (2011) 1–37. the IEEE Conference on Computer Vision and Pattern Recognition, 2017,
[130] Karen Simonyan, Andrew Zisserman, Very deep convolutional networks pp. 2606–2615.
for large-scale image recognition, 2014, arXiv preprint arXiv:1409.1556. [156] Zhiqiang Shen, Honghui Shi, Jiahui Yu, Hai Phan, Rogerio Feris, Liangliang
[131] Yann LeCun, Léon Bottou, Yoshua Bengio, Patrick Haffner, Gradient-based Cao, Ding Liu, Xinchao Wang, Thomas Huang, Marios Savvides, Improving
learning applied to document recognition, Proc. IEEE 86 (11) (1998) object detection from scratch via gated feature reuse, 2017, arXiv preprint
2278–2324. arXiv:1712.00886.
[132] Herbert Robbins, Sutton Monro, A stochastic approximation method, Ann. [157] Kaiming He, Ross Girshick, Piotr Dollár, Rethinking imagenet pre-training,
Math. Stat. (1951) 400–407. in: Proceedings of the IEEE International Conference on Computer Vision,
[133] Diederik P. Kingma, Jimmy Ba, Adam: A method for stochastic 2019, pp. 4918–4927.
optimization, 2014, arXiv preprint arXiv:1412.6980. [158] Geoffrey Hinton, Oriol Vinyals, Jeff Dean, Distilling the knowledge in a
[134] Vinod Nair, Geoffrey E. Hinton, Rectified linear units improve restricted neural network, 2015, arXiv preprint arXiv:1503.02531.
boltzmann machines, in: Proceedings of the 27th International Conference [159] Quanquan Li, Shengying Jin, Junjie Yan, Mimicking very efficient network
on Machine Learning (ICML-10), 2010, pp. 807–814. for object detection, in: Proceedings of the IEEE Conference on Computer
[135] Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun, Identity mappings Vision and Pattern Recognition, 2017, pp. 6356–6364.
in deep residual networks, in: European Conference on Computer Vision, [160] Navaneeth Bodla, Bharat Singh, Rama Chellappa, Larry S. Davis, Soft-NMS–
Springer, 2016, pp. 630–645. improving object detection with one line of code, in: Proceedings of the
[136] Sergey Ioffe, Christian Szegedy, Batch normalization: Accelerating deep IEEE International Conference on Computer Vision, 2017, pp. 5561–5569.
network training by reducing internal covariate shift, 2015, arXiv preprint [161] Jonathan Huang, Vivek Rathod, Chen Sun, Menglong Zhu, Anoop Ko-
arXiv:1502.03167. rattikara, Alireza Fathi, Ian Fischer, Zbigniew Wojna, Yang Song, Sergio
[137] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, Kilian Q Wein- Guadarrama, et al., Speed/accuracy trade-offs for modern convolutional
berger, Densely connected convolutional networks, in: Proceedings of object detectors, in: Proceedings of the IEEE Conference on Computer
the IEEE Conference on Computer Vision and Pattern Recognition, 2017, Vision and Pattern Recognition, 2017, pp. 7310–7311.
pp. 4700–4708. [162] Zeming Li, Chao Peng, Gang Yu, Xiangyu Zhang, Yangdong Deng, Jian Sun,
[138] Hanliang Jiang, Fei Gao, Xingxin Xu, Fei Huang, Suguo Zhu, Attentive and Light-head r-cnn: In defense of two-stage object detector, 2017, arXiv
ensemble 3D dual path networks for pulmonary nodules classification, preprint arXiv:1711.07264.
Neurocomputing (2019). [163] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, Liang-
[139] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Chieh Chen, Mobilenetv2: Inverted residuals and linear bottlenecks, in:
Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, Andrew Rabi- Proceedings of the IEEE Conference on Computer Vision and Pattern
novich, Going deeper with convolutions, in: Proceedings of the IEEE Recognition, 2018, pp. 4510–4520.
Conference on Computer Vision and Pattern Recognition, 2015, pp. 1–9. [164] Alexander Womg, Mohammad Javad Shafiee, Francis Li, Brendan Chwyl,
[140] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, Alexander A Alemi, Tiny SSD: A tiny single-shot detection deep convolutional neural network
Inception-v4, inception-resnet and the impact of residual connections on for real-time embedded object detection, in: 2018 15th Conference on
learning, in: Thirty-First AAAI Conference on Artificial Intelligence, 2017. Computer and Robot Vision, CRV, IEEE, 2018, pp. 95–101.
26 V. Sharma and R.N. Mir / Computer Science Review 38 (2020) 100301
[165] Yuxi Li, Jiuwei Li, Weiyao Lin, Jianguo Li, Tiny-dsod: Lightweight object [188] Buyu Li, Yu Liu, Xiaogang Wang, Gradient harmonized single-stage de-
detection for resource-restricted usages, 2018, arXiv preprint arXiv:1807. tector, in: Proceedings of the AAAI Conference on Artificial Intelligence,
11013. vol. 33, 2019, pp. 8577–8584.
[166] Wenling Shang, Kihyuk Sohn, Diogo Almeida, Honglak Lee, Understanding [189] Zhi Tian, Chunhua Shen, Hao Chen, Tong He, Fcos: Fully convolutional
and improving convolutional neural networks via concatenated rectified one-stage object detection, in: Proceedings of the IEEE International
linear units, in: International Conference on Machine Learning, 2016, Conference on Computer Vision, 2019, pp. 9627–9636.
pp. 2217–2225. [190] Xingyi Zhou, Jiacheng Zhuo, Philipp Krahenbuhl, Bottom-up object de-
[167] Yong-Deok Kim, Eunhyeok Park, Sungjoo Yoo, Taelim Choi, Lu Yang, tection by grouping extreme and center points, in: Proceedings of the
Dongjun Shin, Compression of deep convolutional neural networks for IEEE Conference on Computer Vision and Pattern Recognition, 2019,
fast and low power mobile applications, 2015, arXiv preprint arXiv: pp. 850–859.
1511.06530. [191] Yousong Zhu, Chaoyang Zhao, Jinqiao Wang, Xu Zhao, Yi Wu, Hanqing Lu,
[168] Yihui He, Xiangyu Zhang, Jian Sun, Channel pruning for accelerating Couplenet: Coupling global structure with local parts for object detection,
very deep neural networks, in: Proceedings of the IEEE International in: Proceedings of the IEEE International Conference on Computer Vision,
Conference on Computer Vision, 2017, pp. 1389–1397. 2017, pp. 4126–4134.
[169] Yunchao Gong, Liu Liu, Ming Yang, Lubomir Bourdev, Compressing deep [192] Hongyu Xu, Xutao Lv, Xiaoyu Wang, Zhou Ren, Navaneeth Bodla, Rama
convolutional networks using vector quantization, 2014, arXiv preprint Chellappa, Deep regionlets for object detection, in: Proceedings of the
arXiv:1412.6115. European Conference on Computer Vision, ECCV, 2018, pp. 798–814.
[193] Zhe Chen, Shaoli Huang, Dacheng Tao, Context refinement for object
[170] Yujun Lin, Song Han, Huizi Mao, Yu Wang, William J Dally, Deep gradient
detection, in: Proceedings of the European Conference on Computer
compression: Reducing the communication bandwidth for distributed
Vision, ECCV, 2018, pp. 71–86.
training, 2017, arXiv preprint arXiv:1712.01887.
[194] Zhaowei Cai, Nuno Vasconcelos, Cascade r-cnn: Delving into high quality
[171] Jiaxiang Wu, Cong Leng, Yuhang Wang, Qinghao Hu, Jian Cheng, Quan-
object detection, in: Proceedings of the IEEE Conference on Computer
tized convolutional neural networks for mobile devices, in: Proceedings of
Vision and Pattern Recognition, 2018, pp. 6154–6162.
the IEEE Conference on Computer Vision and Pattern Recognition, 2016,
[195] Yanghao Li, Yuntao Chen, Naiyan Wang, Zhaoxiang Zhang, Scale-aware
pp. 4820–4828.
trident networks for object detection, in: Proceedings of the IEEE
[172] Song Han, Huizi Mao, William J Dally, Deep compression: Compressing International Conference on Computer Vision, 2019, pp. 6054–6063.
deep neural networks with pruning, trained quantization and huffman [196] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, Li Fei-Fei, Imagenet:
coding, 2015, arXiv preprint arXiv:1510.00149. A large-scale hierarchical image database, in: 2009 IEEE Conference on
[173] Song Han, Jeff Pool, John Tran, William Dally, Learning both weights Computer Vision and Pattern Recognition, IEEE, 2009, pp. 248–255.
and connections for efficient neural network, in: Advances in Neural [197] Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin,
Information Processing Systems, 2015, pp. 1135–1143. Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Tom
[174] Shifeng Zhang, Longyin Wen, Xiao Bian, Zhen Lei, Stan Z. Li, Single- Duerig, et al., The open images dataset v4: Unified image classification,
shot refinement neural network for object detection, in: Proceedings of object detection, and visual relationship detection at scale, 2018, arXiv
the IEEE Conference on Computer Vision and Pattern Recognition, 2018, preprint arXiv:1811.00982.
pp. 4203–4212. [198] Pengfei Zhu, Longyin Wen, Xiao Bian, Haibin Ling, Qinghua Hu, Vision
[175] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, meets drones: A challenge, 2018, arXiv preprint arXiv:1804.07437.
Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bern- [199] Andreas Ess, Bastian Leibe, Luc Van Gool, Depth and appearance for
stein, et al., Imagenet large scale visual recognition challenge, Int. J. mobile scene analysis, in: 2007 IEEE 11th International Conference on
Comput. Vis. 115 (3) (2015) 211–252. Computer Vision, IEEE, 2007, pp. 1–8.
[176] Saining Xie, Zhuowen Tu, Holistically-nested edge detection, in: Proceed- [200] Andreas Geiger, Philip Lenz, Christoph Stiller, Raquel Urtasun, Vision
ings of the IEEE International Conference on Computer Vision, 2015, pp. meets robotics: The kitti dataset, Int. J. Robot. Res. 32 (11) (2013)
1395–1403. 1231–1237.
[177] Mahyar Najibi, Mohammad Rastegari, Larry S. Davis, G-cnn: an iterative [201] Christian Wojek, Stefan Walk, Bernt Schiele, Multi-cue onboard pedes-
grid based object detector, in: Proceedings of the IEEE Conference on trian detection, in: 2009 IEEE Conference on Computer Vision and Pattern
Computer Vision and Pattern Recognition, 2016, pp. 2369–2377. Recognition, IEEE, 2009, pp. 794–801.
[178] Yuting Zhang, Kihyuk Sohn, Ruben Villegas, Gang Pan, Honglak Lee, [202] Andreas Geiger, Philip Lenz, Raquel Urtasun, Are we ready for au-
Improving object detection with deep convolutional networks via tonomous driving? The kitti vision benchmark suite, in: 2012 IEEE
bayesian optimization and structured prediction, in: Proceedings of the Conference on Computer Vision and Pattern Recognition, IEEE, 2012,
IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3354–3361.
pp. 249–258. [203] Shanshan Zhang, Rodrigo Benenson, Bernt Schiele, Citypersons: A diverse
[179] Fan Yang, Wongun Choi, Yuanqing Lin, Exploit all the layers: Fast and dataset for pedestrian detection, in: Proceedings of the IEEE Conference
accurate cnn object detector with scale dependent pooling and cascaded on Computer Vision and Pattern Recognition, 2017, pp. 3213–3221.
rejection classifiers, in: Proceedings of the IEEE Conference on Computer [204] Xiaofei Li, Fabian Flohr, Yue Yang, Hui Xiong, Markus Braun, Shuyue Pan,
Vision and Pattern Recognition, 2016, pp. 2129–2137. Keqiang Li, Dariu M Gavrila, A new benchmark for vision-based cyclist
[180] Samarth Brahmbhatt, Henrik I. Christensen, James Hays, Stuffnet: Using detection, in: 2016 IEEE Intelligent Vehicles Symposium, IV, IEEE, 2016,
‘stuff’to improve object detection, in: 2017 IEEE Winter Conference on pp. 1028–1033.
Applications of Computer Vision, WACV, IEEE, 2017, pp. 934–943. [205] Vidit Jain, Erik Learned-Miller, Fddb: A Benchmark for Face Detection
in Unconstrained Settings, Technical Report, UMass Amherst Technical
[181] Shaoqing Ren, Kaiming He, Ross Girshick, Xiangyu Zhang, Jian Sun, Object
Report, 2010.
detection networks on convolutional feature maps, IEEE Trans. Pattern
[206] Shuo Yang, Ping Luo, Chen-Change Loy, Xiaoou Tang, Wider face: A
Anal. Mach. Intell. 39 (7) (2016) 1476–1481.
face detection benchmark, in: Proceedings of the IEEE Conference on
[182] Spyros Gidaris, Nikos Komodakis, Object detection via a multi-region and
Computer Vision and Pattern Recognition, 2016, pp. 5525–5533.
semantic segmentation-aware cnn model, in: Proceedings of the IEEE
[207] Ran He, Xiang Wu, Zhenan Sun, Tieniu Tan, Wasserstein cnn: Learning
International Conference on Computer Vision, 2015, pp. 1134–1142.
invariant features for nir-vis face recognition, IEEE Trans. Pattern Anal.
[183] Tao Kong, Anbang Yao, Yurong Chen, Fuchun Sun, Hypernet: Towards Mach. Intell. 41 (7) (2018) 1761–1773.
accurate region proposal generation and joint object detection, in: [208] Xiao Zhang, Rui Zhao, Yu Qiao, Xiaogang Wang, Hongsheng Li, Adacos:
Proceedings of the IEEE Conference on Computer Vision and Pattern Adaptively scaling cosine logits for effectively learning deep face repre-
Recognition, 2016, pp. 845–853. sentations, in: Proceedings of the IEEE Conference on Computer Vision
[184] Abhinav Shrivastava, Abhinav Gupta, Ross Girshick, Training region-based and Pattern Recognition, 2019, pp. 10823–10832.
object detectors with online hard example mining, in: Proceedings of [209] Yu Liu, Hongyang Li, Xiaogang Wang, Rethinking feature discrimination
the IEEE Conference on Computer Vision and Pattern Recognition, 2016, and polymerization for large-scale recognition, 2017, arXiv preprint arXiv:
pp. 761–769. 1710.00870.
[185] Sean Bell, C. Lawrence Zitnick, Kavita Bala, Ross Girshick, Inside-outside [210] Rajeev Ranjan, Carlos D Castillo, Rama Chellappa, L2-constrained softmax
net: Detecting objects in context with skip pooling and recurrent neural loss for discriminative face verification, 2017, arXiv preprint arXiv:1703.
networks, in: Proceedings of the IEEE Conference on Computer Vision and 09507.
Pattern Recognition, 2016, pp. 2874–2883. [211] Feng Wang, Xiang Xiang, Jian Cheng, Alan Loddon Yuille, Normface: L2
[186] Zhengxia Zou, Zhenwei Shi, Yuhong Guo, Jieping Ye, Object detection in hypersphere embedding for face verification, in: Proceedings of the 25th
20 years: A survey, 2019, arXiv preprint arXiv:1905.05055. ACM International Conference on Multimedia, 2017, pp. 1041–1049.
[187] Peng Zhou, Bingbing Ni, Cong Geng, Jianguo Hu, Yi Xu, Scale-transferrable [212] Yuwei Guo, Licheng Jiao, Shuang Wang, Shuo Wang, Fang Liu, Fuzzy
object detection, in: Proceedings of the IEEE Conference on Computer sparse autoencoder framework for single image per person face
Vision and Pattern Recognition, 2018, pp. 528–537. recognition, IEEE Trans. Cybern. 48 (8) (2017) 2402–2415.
V. Sharma and R.N. Mir / Computer Science Review 38 (2020) 100301 27
[213] Markus Braun, Sebastian Krebs, Fabian Flohr, Dariu M. Gavrila, Eurocity [237] Xianjie Chen, Alan L Yuille, Articulated pose estimation by a graphical
persons: A novel benchmark for person detection in traffic scenes, IEEE model with image dependent pairwise relations, in: Advances in Neural
Trans. Pattern Anal. Mach. Intell. 41 (8) (2019) 1844–1861. Information Processing Systems, 2014, pp. 1736–1744.
[214] Zhaowei Cai, Mohammad Javad Saberian, Nuno Vasconcelos, Learning [238] Xiaochuan Fan, Kang Zheng, Yuewei Lin, Song Wang, Combining local
complexity-aware cascades for pedestrian detection, IEEE Trans. Pattern appearance and holistic view: Dual-source deep neural networks for
Anal. Mach. Intell. (2019). human pose estimation, in: Proceedings of the IEEE Conference on
[215] Mohammad Javad Saberian, Nuno Vasconcelos, Learning optimal em- Computer Vision and Pattern Recognition, 2015, pp. 1347–1355.
bedded cascades, IEEE Trans. Pattern Anal. Mach. Intell. 34 (10) (2012) [239] Gregory Rogez, Philippe Weinzaepfel, Cordelia Schmid, Lcr-net++: Multi-
2005–2018. person 2d and 3d pose detection in natural images, IEEE Trans. Pattern
[216] Piotr Dollár, Ron Appel, Serge Belongie, Pietro Perona, Fast feature Anal. Mach. Intell. 42 (5) (2019) 1146–1161.
pyramids for object detection, IEEE Trans. Pattern Anal. Mach. Intell. 36 [240] Yilun Chen, Zhicheng Wang, Yuxiang Peng, Zhiqiang Zhang, Gang Yu,
(8) (2014) 1532–1545. Jian Sun, Cascaded pyramid network for multi-person pose estimation,
[217] Song Liu, Makoto Yamada, Nigel Collier, Masashi Sugiyama, Change-point in: Proceedings of the IEEE Conference on Computer Vision and Pattern
detection in time-series data by relative density-ratio estimation, Neural Recognition, 2018, pp. 7103–7112.
Netw. 43 (2013) 72–83. [241] George Papandreou, Tyler Zhu, Nori Kanazawa, Alexander Toshev,
[218] Pavel Senin, Jessica Lin, Xing Wang, Tim Oates, Sunil Gandhi, Arnold P. Jonathan Tompson, Chris Bregler, Kevin Murphy, Towards accurate
Boedihardjo, Crystal Chen, Susan Frankenstein, Grammarviz 3.0: Interac- multi-person pose estimation in the wild, in: Proceedings of the
tive discovery of variable-length time series patterns, ACM Trans. Knowl. IEEE Conference on Computer Vision and Pattern Recognition, 2017,
Dis. Data (TKDD) 12 (1) (2018) 1–28. pp. 4903–4911.
[219] Meng Jiang, Alex Beutel, Peng Cui, Bryan Hooi, Shiqiang Yang, Christos [242] Bin Xiao, Haiping Wu, Yichen Wei, Simple baselines for human pose
Faloutsos, A general suspiciousness metric for dense blocks in multimodal estimation and tracking, in: Proceedings of the European Conference on
data, in: 2015 IEEE International Conference on Data Mining, IEEE, 2015, Computer Vision, ECCV, 2018, pp. 466–481.
pp. 781–786. [243] Shih-En Wei, Varun Ramakrishna, Takeo Kanade, Yaser Sheikh, Con-
[220] Elizabeth Wu, Wei Liu, Sanjay Chawla, Spatio-temporal outlier detection volutional pose machines, in: Proceedings of the IEEE Conference on
in precipitation data, in: International Workshop on Knowledge Discovery Computer Vision and Pattern Recognition, 2016, pp. 4724–4732.
from Sensor Data, Springer, 2008, pp. 115–133. [244] Zhuoling Li, Minghui Dong, Shiping Wen, Xiang Hu, Pan Zhou, Zhigang
[221] Björn Barz, Erik Rodner, Yanira Guanche Garcia, Joachim Denzler, Detect- Zeng, CLU-CNNs: Object detection for medical images, Neurocomputing
ing regions of maximal divergence for spatio-temporal anomaly detection, 350 (2019) 53–59.
IEEE Trans. Pattern Anal. Mach. Intell. 41 (5) (2018) 1088–1101. [245] Zhenguo Yang, Qing Li, Liu Wenyin, Jianming Lv, Shared multi-view data
[222] Gong Cheng, Junwei Han, A survey on object detection in optical remote representation for multi-domain event detection, IEEE Trans. Pattern Anal.
sensing images, ISPRS J. Photogramm. Remote Sens. 117 (2016) 11–28. Mach. Intell. (2019).
[223] Palaiahnakote Shivakumara, Dongqi Tang, Maryam Asadzadehkaljahi, [246] Yanxiang Wang, Hari Sundaram, Lexing Xie, Social event detection with
Tong Lu, Umapada Pal, Mohammad Hossein Anisi, CNN-RNN based interaction graph modeling, in: Proceedings of the 20th ACM International
method for license plate recognition, CAAI Trans. Intell. Technol. 3 (3) Conference on Multimedia, 2012, pp. 865–868.
(2018) 169–175.
[247] Manos Schinas, Symeon Papadopoulos, Georgios Petkos, Yiannis Kompat-
[224] Muhammad Sarfraz, Mohammed Jameel Ahmed, An approach to license
siaris, Pericles A. Mitkas, Multimodal graph-based event detection and
plate recognition system using neural network, in: Exploring Critical
summarization in social media streams, in: Proceedings of the 23rd ACM
Approaches of Evolutionary Computation, IGI Global, 2019, pp. 20–36.
International Conference on Multimedia, 2015, pp. 189–192.
[225] Hui Li, Peng Wang, Chunhua Shen, Toward end-to-end car license plate
[248] Olivier Teboul, Iasonas Kokkinos, Loic Simon, Panagiotis Koutsourakis,
detection and recognition with deep neural networks, IEEE Trans. Intell.
Nikos Paragios, Shape grammar parsing via reinforcement learning, in:
Transp. Syst. 20 (3) (2018) 1126–1136.
CVPR 2011, IEEE, 2011, pp. 2273–2280.
[226] Jinxing Qian, Bo Qu, Fast license plate recognition method based on
[249] Peng Zhao, Tian Fang, Jianxiong Xiao, Honghui Zhang, Qinping Zhao,
competitive neural network, in: 2018 3rd International Conference on
Long Quan, Rectilinear parsing of architecture in urban environment, in:
Communications, Information Management and Network Security, CIMNS
2010 IEEE Computer Society Conference on Computer Vision and Pattern
2018, Atlantis Press, 2018.
Recognition, IEEE, 2010, pp. 342–349.
[227] Rayson Laroca, Evair Severo, Luiz A Zanlorensi, Luiz S Oliveira, Gabriel Re-
[250] Sam Friedman, Ioannis Stamos, Online detection of repeated structures
sende Gonçalves, William Robson Schwartz, David Menotti, A robust
in point clouds of urban scenes for compression and registration, Int. J.
real-time automatic license plate recognition based on the YOLO detector,
Comput. Vis. 102 (1–3) (2013) 112–128.
in: 2018 International Joint Conference on Neural Networks, IJCNN, IEEE,
[251] Chao-Hui Shen, Shi-Sheng Huang, Hongbo Fu, Shi-Min Hu, Adaptive
2018, pp. 1–10.
partitioning of urban facades, ACM Trans. Graph. 30 (6) (2011) 1–10.
[228] Xibin Song, Peng Wang, Dingfu Zhou, Rui Zhu, Chenye Guan, Yuchao Dai,
Hao Su, Hongdong Li, Ruigang Yang, Apollocar3d: A large 3d car instance [252] Grant Schindler, Panchapagesan Krishnamurthy, Roberto Lublinerman,
understanding benchmark for autonomous driving, in: Proceedings of Yanxi Liu, Frank Dellaert, Detecting and matching repeated patterns for
the IEEE Conference on Computer Vision and Pattern Recognition, 2019, automatic geo-tagging in urban environments, in: 2008 IEEE Conference
pp. 5452–5462. on Computer Vision and Pattern Recognition, IEEE, 2008, pp. 1–7.
[229] Koyel Banerjee, Dominik Notz, Johannes Windelen, Sumanth Gavarraju, [253] Changchang Wu, Jan-Michael Frahm, Marc Pollefeys, Detecting large
Mingkang He, Online camera lidar fusion and object detection on hy- repetitive structures with salient boundaries, in: European Conference on
brid data for autonomous driving, in: 2018 IEEE Intelligent Vehicles Computer Vision, Springer, 2010, pp. 142–155.
Symposium, IV, IEEE, 2018, pp. 1632–1638. [254] Chao-Hui Shen, Shi-Sheng Huang, Hongbo Fu, Shi-Min Hu, Image-based
[230] Jia Li, Zengfu Wang, Real-time traffic sign recognition based on efficient procedural modeling of facades, ACM Trans. Graph. 26 (2007) 85–95.
CNNs in the wild, IEEE Trans. Intell. Transp. Syst. 20 (3) (2018) 975–984. [255] Olga Barinova, Victor Lempitsky, Elena Tretiak, Pushmeet Kohli, Geomet-
[231] T. Arinaga T. Moritani, Traffic sign recognition system, US Patent 9, 865, ric image parsing in man-made environments, in: European Conference
165, 2018. on Computer Vision, Springer, 2010, pp. 57–70.
[232] Sara Khalid, Nazeer Muhammad, Muhammad Sharif, Automatic measure- [256] Mateusz Kozinski, Raghudeep Gadde, Sergey Zagoruyko, Guillaume
ment of the traffic sign with digital segmentation and recognition, IET Obozinski, Renaud Marlet, A MRF shape prior for facade parsing with
Intell. Transp. Syst. 13 (2) (2018) 269–279. occlusions, in: Proceedings of the IEEE Conference on Computer Vision
[233] A. lvarez Garcıa, J.A.A. Arcos-Garcıa, L.M. Soria-Morillo, Deep neural and Pattern Recognition, 2015, pp. 2820–2828.
network for traffic sign recognition systems: An analysis of spatial trans- [257] Andrea Cohen, Alexander G. Schwing, Marc Pollefeys, Efficient structured
formers and stochastic optimisation methods, Neural Netw. 99 (2018) parsing of facades using dynamic programming, in: Proceedings of the
158–165. IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp.
[234] Dong Li, Dongbin Zhao, Yaran Chen, Qichao Zhang, Deepsign: Deep learn- 3206–3213.
ing based traffic sign recognition, in: 2018 International Joint Conference [258] Silvia Gandy, Benjamin Recht, Isao Yamada, Tensor completion and low-
on Neural Networks, IJCNN, IEEE, 2018, pp. 1–6. n-rank tensor recovery via convex optimization, Inverse Problems 27 (2)
[235] Bo-Xun Wu, Pin-Yu Wang, Yi-Ta Yang, Jiun-In Guo, Traffic sign recog- (2011) 025010.
nition with light convolutional networks, in: 2018 IEEE International [259] Ji Liu, Przemyslaw Musialski, Peter Wonka, Jieping Ye, Tensor completion
Conference on Consumer Electronics-Taiwan, ICCE-TW, IEEE, 2018, for estimating missing values in visual data, IEEE Trans. Pattern Anal.
pp. 1–2. Mach. Intell. 35 (1) (2012) 208–220.
[236] Shuren Zhou, Wenlong Liang, Junguo Li, Jeong-Uk Kim, Improved VGG [260] Juan Liu, Emmanouil Z. Psarakis, Yang Feng, Ioannis Stamos, A kronecker
model for road traffic sign recognition, Comput. Mater. Continua 57 (1) product model for repeated pattern detection on 2d urban images, IEEE
(2018) 11–24. Trans. Pattern Anal. Mach. Intell. 41 (9) (2018) 2266–2272.
28 V. Sharma and R.N. Mir / Computer Science Review 38 (2020) 100301
[261] Oriol Vinyals, Alexander Toshev, Samy Bengio, Dumitru Erhan, Show [284] Xiangteng He, Yuxin Peng, Junjie Zhao, Fine-grained discriminative local-
and tell: A neural image caption generator, in: Proceedings of the ization via saliency-guided faster R-CNN, in: Proceedings of the 25th ACM
IEEE Conference on Computer Vision and Pattern Recognition, 2015, International Conference on Multimedia, 2017, pp. 627–635.
pp. 3156–3164. [285] Xiangteng He, Yuxin Peng, Junjie Zhao, Fast fine-grained image classifica-
[262] Jiuxiang Gu, Jianfei Cai, Gang Wang, Tsuhan Chen, Stack-captioning: tion via weakly supervised discriminative localization, IEEE Trans. Circuits
Coarse-to-fine learning for image captioning, in: Thirty-Second AAAI Syst. Video Technol. 29 (5) (2018) 1394–1407.
Conference on Artificial Intelligence, 2018. [286] Jianzhong He, Shiliang Zhang, Ming Yang, Yanhu Shan, Tiejun Huang,
[263] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Bi-directional cascade network for perceptual edge detection, in: Proceed-
Ruslan Salakhudinov, Rich Zemel, Yoshua Bengio, Show, attend and tell: ings of the IEEE Conference on Computer Vision and Pattern Recognition,
Neural image caption generation with visual attention, in: International 2019, pp. 3828–3837.
Conference on Machine Learning, 2015, pp. 2048–2057.
[264] Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark John-
son, Stephen Gould, Lei Zhang, Bottom-up and top-down attention for Glossary
image captioning and visual question answering, in: Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, 2018,
Average Precision: It is a popular metric for measuring the accuracy of object
pp. 6077–6086.
detectors like faster R-CNN etc. It calculates he average precision value for
[265] A. Deshpande, J. Aneja, A.G. Schwing, Convolutional image captioning,
recall value over 0 to 1.
in: Proceedings of the IEEE Conference on Computer Vision and Pattern
Precision: It is used for measuring as how accurate is our prediction.
Recognition, 2018, pp. 5561—5570.
Recall: It measures as how good we find all the positives. For example we can
[266] Zhe Wu, Li Su, Qingming Huang, Cascaded partial decoder for fast and
find 60% of all the positive cases in our total M predictions.
accurate salient object detection, in: Proceedings of the IEEE Conference
IoU: It stands for intersection over union. It is used for measuring the overlap
on Computer Vision and Pattern Recognition, 2019, pp. 3907–3916.
between two boundaries. It is particularly used to measure as how much
[267] Wenguan Wang, Jianbing Shen, Xingping Dong, Ali Borji, Ruigang Yang,
our predicted boundary overlaps with the ground truth.
Inferring salient objects from human fixations, IEEE Trans. Pattern Anal.
NMS: It stands for non maximum suppression. It is defined as a technique which
Mach. Intell. (2019).
[268] Linzhao Wang, Lijun Wang, Huchuan Lu, Pingping Zhang, Xiang Ruan, is used for filtering out region proposals based on some criteria.
Salient object detection with recurrent fully convolutional networks, IEEE FPS: It stands for frames per second. It is defined as a frequency rate at which
Trans. Pattern Anal. Mach. Intell. 41 (7) (2018) 1734–1746. consecutive images appears on the display.
[269] Mengyang Feng, Huchuan Lu, Errui Ding, Attentive feedback network DCN: It stands for deformable convolution network. It consists of two parts, a
for boundary-aware salient object detection, in: Proceedings of the regular convolution layer and an another layer to learn 2d offset for each
IEEE Conference on Computer Vision and Pattern Recognition, 2019, input.
pp. 1623–1632. SIFT: It stands for scale invariant feature transform. It is defined as a feature
[270] Zheng Zhang, Chengquan Zhang, Wei Shen, Cong Yao, Wenyu Liu, Xiang detection technique in computer vision to detect and explain local image
Bai, Multi-oriented text detection with fully convolutional networks, in: features.
Proceedings of the IEEE Conference on Computer Vision and Pattern MSCOCO: It stands for microsoft common objects in context. It is a dataset
Recognition, 2016, pp. 4159–4167. containing images of 91 object types with a total of 2.5 million labeled
[271] Cong Yao, Xiang Bai, Nong Sang, Xinyu Zhou, Shuchang Zhou, Zhimin Cao, instances in 328K images.
Scene text detection via holistic, multi-channel prediction, 2016, arXiv SURF: It stands for speeded up robust features. It is defined as a local feature
preprint arXiv:1606.09002. detector and descriptor which can be used for tasks like object recognition,
[272] Tong He, Weilin Huang, Yu Qiao, Jian Yao, Accurate text localization 3d reconstruction, classification etc. Its feature descriptor is based on the
in natural image with cascaded convolutional text network, 2016, arXiv summation of Haar wavelet response around points of interest.
preprint arXiv:1603.09423. NAS-FPN: It stands for neural architecture search feature pyramid network. It
[273] Pengyuan Lyu, Cong Yao, Wenhao Wu, Shuicheng Yan, Xiang Bai, consists of a combination of top-down and bottom-up connections to fuse
Multi-oriented scene text detection via corner localization and region features across different scales.
segmentation, in: Proceedings of the IEEE Conference on Computer Vision HOG: It stands for histogram of oriented gradients. It is defined as a feature
and Pattern Recognition, 2018, pp. 7553–7563. descriptor used in image processing and computer vision for the task of
[274] Jianqi Ma, Weiyuan Shao, Hao Ye, Li Wang, Hong Wang, Yingbin Zheng, object detection. It count occurrences of gradient orientations in localized
Xiangyang Xue, Arbitrary-oriented scene text detection via rotation portions of an image.
proposals, IEEE Trans. Multimed. 20 (11) (2018) 3111–3122. ResNet: It stands for residual network. It makes it possible to train upto hundreds
[275] Martin Engelcke, Dushyant Rao, Dominic Zeng Wang, Chi Hay Tong, or even thousands of layers and still achieve remarkable performance.
Ingmar Posner, Vote3deep: Fast object detection in 3d point clouds ResNet: It stands for support vector machine. It is defined as a supervised
using efficient convolutional neural networks, in: 2017 IEEE International machine learning model which makes use of classification algorithms for
Conference on Robotics and Automation, ICRA, IEEE, 2017, pp. 1355–1361. two group classification problems.
[276] Charles R. Qi, Hao Su, Kaichun Mo, Leonidas J. Guibas, Pointnet: Deep ReLU: It stands for rectified linear unit. It is one of the most commonly used
learning on point sets for 3d classification and segmentation, in: Proceed- activation function deep learning models. A value of 0 is returned, if the
ings of the IEEE Conference on Computer Vision and Pattern Recognition, function receives a negative input, but for a positive value "x" the same
2017, pp. 652–660. value is returned back.
[277] Yin Zhou, Oncel Tuzel, Voxelnet: End-to-end learning for point cloud DPM: It stands for deformable part model detector. It is a learning based
based 3d object detection, in: Proceedings of the IEEE Conference on object detection FPGA IP core developed for embedded computer vision
Computer Vision and Pattern Recognition, 2018, pp. 4490–4499. applications.
[278] Zhe Cao, Tomas Simon, Shih-En Wei, Yaser Sheikh, Realtime multi- VGG: It stands for visual geometry group. It is an innovative object recognition
person 2d pose estimation using part affinity fields, in: Proceedings of model which supports upto 19 layers. Instead of using a large receptive field
the IEEE Conference on Computer Vision and Pattern Recognition, 2017, like AlexNet, VGG uses very small receptive field (3 x 3 with a stride of 1).
pp. 7291–7299. ROI Pooling: It stands for region of interest pooling. It is used for utilizing single
[279] Adrian Bulat, Georgios Tzimiropoulos, Human pose estimation via convo- feature map for all the proposals generated by RPN in a single pass. It solves
lutional part heatmap regression, in: European Conference on Computer the problem of fixed image size requirement for object detection.
Vision, Springer, 2016, pp. 717–732. SGD: It stands for stochastic gradient descent. It is defined as an iterative
[280] Alejandro Newell, Kaiyu Yang, Jia Deng, Stacked hourglass networks for technique for optimizing an objective function with suitable smoothness
human pose estimation, in: European Conference on Computer Vision, properties.
Springer, 2016, pp. 483–499. CNN: It stands for convolutional neural network. It is defined as a deep learning
[281] Wenbo Li, Zhicheng Wang, Binyi Yin, Qixiang Peng, Yuming Du, Tianzi algorithm which can take an input image, assign importance to various
Xiao, Gang Yu, Hongtao Lu, Yichen Wei, Jian Sun, Rethinking on multi- aspects in the given image and be able to differentiate one from another.
stage networks for human pose estimation, 2019, arXiv preprint arXiv: DPN: It stands for dual path network. It is an image classification model which
1901.00148. picks the advantages of both ResNet and DenseNet and also outperforms
[282] Jonathan Krause, Michael Stark, Jia Deng, Li Fei-Fei, 3d object rep- them in the image classification task.
resentations for fine-grained categorization, in: Proceedings of the R-CNN: It stands for region proposal convolution neural network, which allows
IEEE International Conference on Computer Vision Workshops, 2013, us to work with 2000 image regions instead of trying to classify a huge
pp. 554–561. number of regions, thereby minimizing the total number of computations
[283] Tsung-Yu Lin, Aruni RoyChowdhury, Subhransu Maji, Bilinear cnn mod- required.
els for fine-grained visual recognition, in: Proceedings of the IEEE RPN: It stands for region proposal network. It is a neural network used for
International Conference on Computer Vision, 2015, pp. 1449–1457. generating proposals for object detection in faster R-CNN.
V. Sharma and R.N. Mir / Computer Science Review 38 (2020) 100301 29
R-FCN: It stands for region based fully convolutional network. It is used for ILSVRC: It stands for imagenet large scale visual recognition challenge. It is an
accurate and efficient object detection. It is a region based detector which annual computer vision competition developed on a subset of a publicly
is fully convolutional with almost all the computations shared on the entire available dataset called imagenet. It evaluates algorithms for object detection
image. and image classification at large scales.
FPN: It stands for feature pyramid network. It is defined as a top-down SSD: It stands for single shot multibox detector. It is a method used for detecting
architecture with lateral connections which is developed for building high- objects in images using a single deep neural network. It discretizes the
level semantic feature maps at all scales. It has achieved a remarkable output space of bounding boxes into a set of default boxes over different
improvement as a generic feature extractor in many applications. aspect ratios and scales per feature map location.
GAN: It stands for generative adversarial network. It is defined as a network LVIS: It stands for large vocabulary instance segmentation. It is a large scale
wherein given a training dataset, it learns to generate new data with same dataset comprises of 2 million high quality instance segmentation masks for
statistics as the training set. over 1000 entry level object categories in 164K images.
SPP-Net: It stands for spatial pyramid pooling network. It is defined as a VOC: It stands for visual object class. It is a project which provides standardized
network which adds a new layer between the fully connected layer and the image datasets for object class recognition and provides a common set of
convolutional layer to map any sized input down to a fixed sized output. tools for accessing the datasets and annotations. It ran challenges evaluating
DSOD: It stands for deeply supervised object detector. It is defined as a performance on object class recognition from 2005 to 2012.
framework which can learn object detector from scratch. M-DOD: It stands for multi domain object detection. It is defined as a technique
YOLO: It stands for you only look once. It is defined as a state-of-the-art real to universally detect multi domain objects without its prior knowledge.
time single stage object detector which processes images at 30 fps and has
an mAP of 57.9% on COCO test-dev dataset.