Deep Learning in Object Detection: A Review: August 2020
Deep Learning in Object Detection: A Review: August 2020
Deep Learning in Object Detection: A Review: August 2020
net/publication/344052340
CITATIONS READS
3 667
3 authors:
Thokozani Shongwe
University of Johannesburg
64 PUBLICATIONS 363 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
“PSK to CSK Mapping for Hybrid Systems Involving the Radio Frequency and the Visible Spectrum View project
All content following this page was uploaded by Ali N Hasan on 23 October 2020.
Abstract— Object detection continues to play an significant detected in an image, for instance, detection of vehicles,
part in computer vision theory, study and practical application. pedestrians, buildings, road signs, human faces, etc.
Conventional object detection algorithms were primarily
derived from machine learning. This involved the design of By improving object detection precision, robustness and
features for describing the object’s characteristics followed by effectiveness through deep learning methodologies, namely
an integration with classifiers. In recent years, the application of deep neural networks, region-based convolutional neural
deep learning (DL), and more specifically Convolutional Neural networks, and deeply convolutional neural networks, this may
Networks (CNN) have elicited a great advancement and lead to more robust surveillance and protection systems
promising progress, and has therefore, received much attention designed to detect moving objects from video[5]. This is
on the global stage of research about computer vision. This particularly important for tracing security threats such as
paper conducts a survey about some of the most important and intruders in a vulnerable area, locating abandoned objects
recent developments and contributions that have been made which could be anomalies in a scene such as bombs or
towards research in the use of deep learning in object detection. explosives, tracing robbery vehicles, studying and monitoring
Moreover, as evidently demonstrated, the findings of numerous suspicious behaviours which usually lead to criminal
studies suggest that the application of deep learning in object situations in our society. In addition, intelligent visual cameras
detection much surpasses conventional approaches focused on by inspiration from deep learning in object detection can be
handcrafted and learned features. used for monitoring the activities and behaviour of animals in
protected areas, either for ethology or the preservation of our
Keywords— Deep learning; Object detection; Convolutional
neural networks; Machine learning
natural environment. The use of deep learning algorithms for
object detection has also become an important application for
image processing in the medical field, and detection of
I. INTRODUCTION cancerous cells in the human body [6].
When carefully observing the brain, multiple processing Object Detection is one among the computer vision tasks
levels can be identified. It is understood that every level can that has benefited from Deep Learning techniques in several
learn features or representations at escalating heights of papers in literature. This paper reviews the Deep Learning
abstraction. For instance, the typical design of the visual algorithms and techniques that are utilized in Object detection
cortex [1] suggests that (roughly speaking) the brain initially each in fixed-images and within the video domain. It covers
extracts edges followed by patches, then surfaces, then an extensive review for deep learning techniques and its
objects, and so on. This is one of the fundamental ways in applications in the field of image detection [7]. Additionally,
which the brain performs vision. This observation is what has it demonstrates clearly the precise role of deep neural
inspired the field of Machine Learning called deep learning, networks in object detection and their performance over
which attempts to reproduce and duplicate a similar traditional machine learning techniques. Furthermore, it
architecture in a computer [2]. introduces conventional deep learning techniques for image
Machine learning and deep learning have demonstrated a and object detection research, and presents the most
significant application in Computer vision research. Deep remarkable findings in the recent years [8].
convolutional neural systems (types of feed-forward artificial In recent years, there has been a number of notable and
neural systems) have out-contended different models of deep innovative techniques employed towards improving detection
learning on computer vision subjects basically image accuracy of deep learning models and solving complex
classification, object detection, scene recreation, object pose problems experienced during the training and testing process
estimation, learning, event tracking, and so on. Convolutional of deep learning object detection models. Among these
Neural Networks (CNNs)'s effective performance in object innovative techniques is the modification of the activation
recognition is focused primarily on the fact that they can learn function of deep CNNs [9], Transfer learning [10-12] and
significant mid-level image features, rather than hand- ingenious approaches in the combined selection of the
designed low-level representations that are usually used in activation function and the optimization system for the
specific approaches to image classification[3]. But the proposed deep learning model [13].
question arises: What precisely is object detection?
This paper is organized in the following manner: Section
An object is defined in this topic by main features, which 2 introduces the background in object detection and the
include form, size, colour, texture, and other attributes. To conventional machine learning techniques that have been
detect such an object, it would mean that an image clearly applied to it. Section 3 defines the common DL methods and
indicates the object's presence and, moreover, its location is their computational techniques that are applied on Object
illustrated in the image[4]. Thus, object detection can be detection, furthermore, it provides a brief overview of relevant
defined as a way to locate instances of real-world objects in DL models for solving advanced object detection problems in
images. Detection is closely related to classification since it computer vision. Section 4 briefly discusses some of the
involves telling the presence and positioning of a particular innovative techniques that have adopted to improve and
object in an image. There are various objects that can be
optimize deep learning models and solve some challenges that due to the demands on training speed, memory constraints and
occur during training and testing. Section 5 is the conclusion the accuracy of optimization variables. However, the
of the survey. difficulty is that the tasks of optimization are costly in size,
and the typical single-threaded implementation battles with
II. BACKGROUND REVIEW the sophisticated learning process. As a consequence, SVMs
can lead to an excessive computational cost, which may occur
throughout testing as well [21].
A. Object Detection
Colour often conveys an essential information about the Detection of standard figures, namely, lines, circles,
setting. Objects in images have distinct colours and based on polygons, etc. comes from one of the elementary low-level
their colour can be distinguished from the background. As a computer vision tasks. These figures can be expressed
result, objects are often cut out from a scene by their distinct parametrically through a mathematical scheme. The highly
colours. Direct classification of the pixels into 'objects and regarded technique that has been used in the detection of
background' is the strategy frequently used to accomplish this shapes, was the one devised by Hough [22], as a voting
task. An object is identified by or outlined by a set of colours method for recognition of lines. In time, the method became
may almost certainly belong to it, or coordinate it. However, extended to arbitrary figure detection by Ballard [23]. When
the background can be described as the rest of the values from detecting common shapes, the technique is computationally
the image, or in like manner, by means of its characteristic expensive. However, the application of the tensor greatly
colours [14]. This method is applied in a process that filters enhanced the detection of standard shapes. Rapid and precise
out the colours of a particular object from all the other objects information could be gathered by the analysis of the local
and the background. Cyganek [15] introduced a technique of phase ϕ of the tensor and its coherence. This introduced a
‘Road Signs Detection’ by employing the approach of Direct technique called orientation-based Hough transform, which
Pixel Classification. Pixel classification methods are excellent was proposed in [24]. This approach does not involve any
for dimensionality reduction, and can be adapted to quick pre- initial image segmentation process. The structural tensor gets
processing of images. In addition, it is also possible to use computed on each point, to provide the following information:
features other than colour [16]. whether a point belongs to an edge, what is its local phase if it
does, and what is the type of local structure [25].
In the detection of Road Signs, there are two principle
pixel-based methods that have frequently been used. These The final parameter to be determined is the length p0 of a
methods involve manually collecting samples from some line segment to the origin of the coordinate system. The
images depicted in real traffic scenes [17]. A considerable formulas are presented below (see Fig. 1.).
number of techniques were previously designed for refining
category-specific object detection accuracy. Histogram of
Oriented Gradients [18] based detectors by employing a multi- x2 − x02
= ctg φ, x20 = p0 sin φ,
scale sliding window system has traditionally been the most x01 − x1
widely-utilised and widely-accepted methods for pedestrian
detection.
The use of filtered channel features has evidently been x10 = p0 cos φ (1)
reported to produce a high quality of detection for pedestrian
detection [16]. An in-depth analysis was conducted into the
performance of cutting-edge pedestrian detectors for
providing insights into the typical failures of some methods
and an understanding of the effects of the quality of training
data. These insights were utilised to investigate variants of
advanced techniques such as the filtered channel features and
the R-CNN detectors and present improvements over the
baseline [12].
One method is that of building a fuzzy classifier from the
colour histograms. For each of the distinctive colours
belonging to a group of traffic signs, a few thousand samples
were collected to create their colour histograms [18]. The
utility of this method is that it is simple to implement, and it
takes less processing time. However, the fuzzy approach often
shows a considerable percentage of false positives, and
therefore, resulting in poor accuracy [19].
Another pixel-based method is that of the Support Vector
Machines (SVM). The presentation of Support Vector
Machines came along by Vapnik [20], with a premise on the
Structural Risk Minimization (SRM) method. SVMs have Fig. 1. Orientation-based Hough transform and the UpWrite method
great generalization properties and regularization potential. (Jähne 2005) Methodology.
They are characterized as binary large classifiers. The goal of
SVMs is to actuate a classifier which possesses exceptional After rearranging, the lower and upper indices yield the
classification capability on unobserved data points. SVM's following transform:
approach involves certain optimization strategies, important
𝑥𝑥 permits a fairly natural mechanism to survive insufficient
[cos 𝜑𝜑 sin 𝜑𝜑] �𝑥𝑥1 � = 𝑝𝑝0
��������� (2) statistics or poorly allotted records. It allows you to put a
2
𝑤𝑤 preliminary on the coefficients and on the noise in order that
where 𝑤𝑤 is a standard vector to the desired line and 𝑝𝑝0 is a in the absence of data, the preliminaries can take over [31].
length of the line segment to the centre of the image coordinate 2) Sparse linear regression
system [23].
In a linear model, scaled sparse linear regression jointly
This orientation based method is associated with the concept approximates the coefficients of regression and noise level. It
called the UpWrite method, which was at first proposed by selects an equilibrium with a sparse regression technique by
McLaughlin and Alder [26] to detect circles, ellipses and lines. repeatedly approximating the level of noise through the use of
the mean residual square and increasing the penalty in
This technique presumes computation of the local orientations proportion to the noise level expected. The iterative algorithm
to be the phase of the main eigenvector about the covariance costs more than the computation of a sparse regression
matrix of the image data. An order of points that pass through estimator's path or grid for penalty rates above an acceptable
successive mean points of local pixel blobs with local threshold [32].
orientations following a presumed curvature (or its variations) 3) Bayesian linear regression
forms a curve. Interpreted another way, the inertia tensor of
pixel intensities are utilised to determine a curve, and This is a technique of linear regression whereby the
ultimately, the points found can be shaped up to the figure statistical analysis is implemented within the context of
Bayesian inference. When the regression model contains
through the least-squares method [27]. errors that possess a regular distribution, and whenever a
Figure detection presents other strategies of recognizing particular design of prior distribution is presumed, explicit
results are obtainable for the posterior probability distributions
objects from pictures. Objects may be determined based on of the model's parameters [33]. When linear regression is
observation of their distinguishing points. This approach is understood as a Bayesian model, the prior variance and the
rooted in the sparse image coding area which is dynamically noise variance can automatically be inferred, and calibrated
advancing [28]. The definition is about identifying distinctive predictions can be made. Bayesian linear regression is a useful
points belonging to an object and most invariant to the component in intricate probabilistic models [34].
potential geometric transformation of the view of that object, 4) Bayesian logistic regression
as well as noise and other distortions. HOG[18], SIFT[29] and
their many variations, such as the OpponentSIFT and the Bayesian logistic regression is a machine learning model
for binary classification, i.e. mastering to classify statistics
PCA-SIFT, are among the widely known point descriptors.
factors into one of two categories. This is a linear model, since
Various researchers have compared the performance of sparse only the dot product of a weight vector with a feature vector
descriptors, for example, Mikolajczyk and Schmid made such governs the decision. The classification boundary may thus be
a comparison[30] and presented a paper on the Gradient expressed as a hyperplane. It's a widely used model, and a
Location and Orientation Histogram (GLOH), which in quite common motif in neural networks is the normal linear-
a few instances outperformed the SIFT process. However, the followed-by-sigmoid configuration. Bayesian logistic
drawback with these strategies is that they only seize low-level regression is a variant of logistic regression in Bayesian
form[35]. The name logistic regression comes from the fact
edge detail regarding any figure. It is more difficult to build
that the regression dependent variable is a logistic function. It
features that can capture mid-level information like edge is one of the models commonly used in problems where a
intersections or high-level representations like object pieces. binary variable is the solution. Y is viewed as a linear function
This is clearly observed even when extracting distinctive of the explanatory variables X in logistic regression.
image features from scale-invariant key-points [71]. Consequently, the model of logistic regression can be
described as:
B. Machine Learning methods
Machine Learning models or algorithms in object 𝜎𝜎(𝑌𝑌)
detection are based on a set of complex statistical and 𝑙𝑙𝑙𝑙 �1−𝜎𝜎(𝑌𝑌)� = 𝜃𝜃1 𝑏𝑏1 (𝑋𝑋) + 𝜃𝜃2 𝑏𝑏2 (𝑋𝑋) + ⋯ +
mathematical equations, which are very tightly interlinked by 𝜃𝜃𝑀𝑀 𝑏𝑏𝑀𝑀 (𝑋𝑋) (3)
nature. Most of these methods are primarily feature-based,
thus, they usually require a vast amount of samples to learn
and to train. Below is a list of some of the typical machine Where {𝑏𝑏1 (𝑋𝑋)} is the set of basis functions and {𝜃𝜃1 } are the
learning techniques in object detection. model parameters.
1) Deformable Models
C. Support Vector Machines
This is a statistical model used to map the instability in The Support Vector Machines (SVMs) are algorithms for
constructing an object's actual instance based on a prior pattern recognition, which was developed by Vapnik [20].
distribution based on template deformation. The model is This is a statistical model primarily based on a principle
described by reference to generators and inter-generator referred to as Structural Risk Minimization, which functions
subsets bonds. Variables that characterize the template to limit an upper bound on the generalization errors. The
deformation are used to denominate the generators and bonds. principal objective of SVMs is to trigger a classifier which
Additionally, given a particular deformation of a template, a possesses exceptional classification functionality on
statistical model of the image data is produced. This model unobserved data points [36]. Due to the high demand on
training speed and memory constraints, SVMs need certain contact between variables belonging to a particular layer. The
optimisation process. Therefore, SVMs have the tendency of normal Boltzmann Machine (BM), is a network that contains
being computationally expensive [21]. units, which are connected in symmetry and are binary
founded on a stochastic instrument. The basic learning
Summary algorithmic rule of BMs results in their being very difficult
Most of the conventional machine learning techniques in when learning, and slow to train [2]. In a Deep Boltzmann
object detection have often demonstrated difficulty when they machine, however, every layer extracts high-order parallels
have to be extended to more complex objects such as people, that appear amongst exercises of shrouded characteristics in
vehicles, and many other complex classes of objects because the layer beneath [37]. These DBMs can learn internal features
they involved a great amount of prior information and domain that gradually turn to be complex, which is a good
knowledge. Another challenge with these models is the fact characteristic for solving object detection challenges as well
that most of them would require an appropriate image as other computer vision problems. High-level features can be
representation (visual coding) in order to capture the structural made up of a wide collection of unlabelled sensory inputs.
similarities between instances of an object class. Other Furthermore, a highly constrained supply of labelled data can
learning-based techniques, which are data-driven may seem to subsequently be employed to only calibrate and improve the
show more advantages, however, most of them require some network for a specific task being implemented. A DBM can
optimization technique, and tend to be computationally be configured as depicted in figure 2 below.
intensive.
TABLE I. DEEP LEARNING IN OBJECT DETECTION (METHODS, BENCHMARK DL TECHNIQUE, SUCCESSES AND FAILURES)
Method Benchmark Deep Success Failure
Learning Technique
DeepID-Net: Deformable Deep Convolutional The model introduces a new efficient pre- This model was capable of improving the advanced
Deep Convolutional Neural Neural Networks training strategy for Deep CNN models by performance obtained by RCNN from mAP 31.0% to
Networks for Object incorporating object-level annotation. 50.3% on the ImageNet dataset for object detection.
Detection. However, the results were only valid for the
ILSVC2014.
Boosted Convolutional Convolutional Neural The experiment results indicated that BCNN- The model may still result in minor performance
Neural Network(BCNN) Networks LS, BCNN-TI and BCNN-BF can achieve degradations in a few high score samples.
significant performance gain over the Fast-
RCNN, which serves as the baseline for
performance boosting. The BCNN-BF ranks
the third in the Caltech pedestrian dataset.
Learning Framework for Convolutional Neural The model introduces a robust adaptive U-V The model requires a robust and powerful GPU
Robust Obstacle Detection, Networks and Auto- disparity for detecting practical objects within implementation to speed up running and process,
Recognition, and obstacle encoder different driving conditions. which can be very expensive. The system also requires
tracking. a heavy integration of multiple techniques to maximise
detection accuracy.
Pedestrian detection model Region-based The outcome of the experiment reveal that the The system is not completely implemented on a GPU,
based on RCNN. Convolutional Neural proposed pedestrian detection system which consumes a great duration of time during
Networks outperforms the traditional region proposal training and testing.
algorithms by yielding a miss rate of 23%.
Subcategory-Aware Convolutional Neural The model can handle scale variations of When there are many different views of objects, the
Convolutional Neural Networks objects by efficiently utilising the image model does not provide a significant detection
Networks for Object pyramids by means of a feature extrapolating accuracy. The sub-category-aware CNNs do not
Proposals and Detection layer. necessarily make an improvement in performance on
the dataset of PASCAL VOC 2007.
You Only Look Once: Convolutional Neural YOLO can learn generalizable The model learns to predict bounding boxes from data,
Unified Real-time Object Network representations of objects; thus, it cannot therefore, battles to generalize to objects in unique
Detection easily collapse or disintegrate when used on aspect ratios or configurations.
new domains or unanticipated input.
Faster R-CNN: Towards Region-based The model uses anchors of many different The presence of low-resolution features and as well no
Real-Time Object Convolutional Neural sizes as the regression references, this is bootstrapping strategy in the model results in a
Detection with Region Network demonstrated to be an efficient explanation degraded detection accuracy.
Proposal Networks for advancing detection accuracy.
Deep Residual Learning for Deep Convolutional The highly deep residual networks are simple The deep plain network has extremely low
Image Recognition Neural Networks to optimize with various datasets. On account convergence rates, and therefore, does not reduce the
of the highly increased depth of the network training error significantly.
compared to the VGG-16, the classification
and detection accuracy of the model is
significantly improved.
SSD: Single Shot MultiBox Convolutional Neural Data augmentation is implemented to cause SSD is susceptible to confusion with comparable
Detector Network the model to be powerful to different input object categories (particularly for animals), this to a
object shapes and sizes. This is achieved by certain extent because it shares locations for many
randomly sampling each training image by categories.
means of the complete initial image.
Object Detection in Videos Convolutional Neural The proposed model is capable of employing When the temporal window continues to increase,
with Tubelet Proposal Networks useful temporal information from tubelet even with the proposed initialization techniques in
Networks [61] proposals for increasing detection accuracy. place, the performance decreases.