Deep Learning in Object Detection: A Review: August 2020

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/344052340

Deep Learning in Object Detection: a Review

Conference Paper · August 2020


DOI: 10.1109/icABCD49160.2020.9183866

CITATIONS READS

3 667

3 authors:

Katleho Masita Ali N Hasan


University of Johannesburg University of Johannesburg
2 PUBLICATIONS   15 CITATIONS    56 PUBLICATIONS   336 CITATIONS   

SEE PROFILE SEE PROFILE

Thokozani Shongwe
University of Johannesburg
64 PUBLICATIONS   363 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

“PSK to CSK Mapping for Hybrid Systems Involving the Radio Frequency and the Visible Spectrum View project

5G COGNITIVE RADIO WIRELESS NETWORKS View project

All content following this page was uploaded by Ali N Hasan on 23 October 2020.

The user has requested enhancement of the downloaded file.


Deep Learning in Object Detection: a Review
Katleho1 Masita1 Ali Hasan Thokozani Shongwe
Electrical and Electronic Engineering Electrical Engineering Electrical and Electronic Engineering
University of Johannesburg Higher Colleges of Technology University of Johannesburg
Johannesburg, South Africa Abu Dhabi, United Arab Emirates Johannesburg, South Africa
[email protected] [email protected] [email protected]

Abstract— Object detection continues to play an significant detected in an image, for instance, detection of vehicles,
part in computer vision theory, study and practical application. pedestrians, buildings, road signs, human faces, etc.
Conventional object detection algorithms were primarily
derived from machine learning. This involved the design of By improving object detection precision, robustness and
features for describing the object’s characteristics followed by effectiveness through deep learning methodologies, namely
an integration with classifiers. In recent years, the application of deep neural networks, region-based convolutional neural
deep learning (DL), and more specifically Convolutional Neural networks, and deeply convolutional neural networks, this may
Networks (CNN) have elicited a great advancement and lead to more robust surveillance and protection systems
promising progress, and has therefore, received much attention designed to detect moving objects from video[5]. This is
on the global stage of research about computer vision. This particularly important for tracing security threats such as
paper conducts a survey about some of the most important and intruders in a vulnerable area, locating abandoned objects
recent developments and contributions that have been made which could be anomalies in a scene such as bombs or
towards research in the use of deep learning in object detection. explosives, tracing robbery vehicles, studying and monitoring
Moreover, as evidently demonstrated, the findings of numerous suspicious behaviours which usually lead to criminal
studies suggest that the application of deep learning in object situations in our society. In addition, intelligent visual cameras
detection much surpasses conventional approaches focused on by inspiration from deep learning in object detection can be
handcrafted and learned features. used for monitoring the activities and behaviour of animals in
protected areas, either for ethology or the preservation of our
Keywords— Deep learning; Object detection; Convolutional
neural networks; Machine learning
natural environment. The use of deep learning algorithms for
object detection has also become an important application for
image processing in the medical field, and detection of
I. INTRODUCTION cancerous cells in the human body [6].
When carefully observing the brain, multiple processing Object Detection is one among the computer vision tasks
levels can be identified. It is understood that every level can that has benefited from Deep Learning techniques in several
learn features or representations at escalating heights of papers in literature. This paper reviews the Deep Learning
abstraction. For instance, the typical design of the visual algorithms and techniques that are utilized in Object detection
cortex [1] suggests that (roughly speaking) the brain initially each in fixed-images and within the video domain. It covers
extracts edges followed by patches, then surfaces, then an extensive review for deep learning techniques and its
objects, and so on. This is one of the fundamental ways in applications in the field of image detection [7]. Additionally,
which the brain performs vision. This observation is what has it demonstrates clearly the precise role of deep neural
inspired the field of Machine Learning called deep learning, networks in object detection and their performance over
which attempts to reproduce and duplicate a similar traditional machine learning techniques. Furthermore, it
architecture in a computer [2]. introduces conventional deep learning techniques for image
Machine learning and deep learning have demonstrated a and object detection research, and presents the most
significant application in Computer vision research. Deep remarkable findings in the recent years [8].
convolutional neural systems (types of feed-forward artificial In recent years, there has been a number of notable and
neural systems) have out-contended different models of deep innovative techniques employed towards improving detection
learning on computer vision subjects basically image accuracy of deep learning models and solving complex
classification, object detection, scene recreation, object pose problems experienced during the training and testing process
estimation, learning, event tracking, and so on. Convolutional of deep learning object detection models. Among these
Neural Networks (CNNs)'s effective performance in object innovative techniques is the modification of the activation
recognition is focused primarily on the fact that they can learn function of deep CNNs [9], Transfer learning [10-12] and
significant mid-level image features, rather than hand- ingenious approaches in the combined selection of the
designed low-level representations that are usually used in activation function and the optimization system for the
specific approaches to image classification[3]. But the proposed deep learning model [13].
question arises: What precisely is object detection?
This paper is organized in the following manner: Section
An object is defined in this topic by main features, which 2 introduces the background in object detection and the
include form, size, colour, texture, and other attributes. To conventional machine learning techniques that have been
detect such an object, it would mean that an image clearly applied to it. Section 3 defines the common DL methods and
indicates the object's presence and, moreover, its location is their computational techniques that are applied on Object
illustrated in the image[4]. Thus, object detection can be detection, furthermore, it provides a brief overview of relevant
defined as a way to locate instances of real-world objects in DL models for solving advanced object detection problems in
images. Detection is closely related to classification since it computer vision. Section 4 briefly discusses some of the
involves telling the presence and positioning of a particular innovative techniques that have adopted to improve and
object in an image. There are various objects that can be
optimize deep learning models and solve some challenges that due to the demands on training speed, memory constraints and
occur during training and testing. Section 5 is the conclusion the accuracy of optimization variables. However, the
of the survey. difficulty is that the tasks of optimization are costly in size,
and the typical single-threaded implementation battles with
II. BACKGROUND REVIEW the sophisticated learning process. As a consequence, SVMs
can lead to an excessive computational cost, which may occur
throughout testing as well [21].
A. Object Detection
Colour often conveys an essential information about the Detection of standard figures, namely, lines, circles,
setting. Objects in images have distinct colours and based on polygons, etc. comes from one of the elementary low-level
their colour can be distinguished from the background. As a computer vision tasks. These figures can be expressed
result, objects are often cut out from a scene by their distinct parametrically through a mathematical scheme. The highly
colours. Direct classification of the pixels into 'objects and regarded technique that has been used in the detection of
background' is the strategy frequently used to accomplish this shapes, was the one devised by Hough [22], as a voting
task. An object is identified by or outlined by a set of colours method for recognition of lines. In time, the method became
may almost certainly belong to it, or coordinate it. However, extended to arbitrary figure detection by Ballard [23]. When
the background can be described as the rest of the values from detecting common shapes, the technique is computationally
the image, or in like manner, by means of its characteristic expensive. However, the application of the tensor greatly
colours [14]. This method is applied in a process that filters enhanced the detection of standard shapes. Rapid and precise
out the colours of a particular object from all the other objects information could be gathered by the analysis of the local
and the background. Cyganek [15] introduced a technique of phase ϕ of the tensor and its coherence. This introduced a
‘Road Signs Detection’ by employing the approach of Direct technique called orientation-based Hough transform, which
Pixel Classification. Pixel classification methods are excellent was proposed in [24]. This approach does not involve any
for dimensionality reduction, and can be adapted to quick pre- initial image segmentation process. The structural tensor gets
processing of images. In addition, it is also possible to use computed on each point, to provide the following information:
features other than colour [16]. whether a point belongs to an edge, what is its local phase if it
does, and what is the type of local structure [25].
In the detection of Road Signs, there are two principle
pixel-based methods that have frequently been used. These The final parameter to be determined is the length p0 of a
methods involve manually collecting samples from some line segment to the origin of the coordinate system. The
images depicted in real traffic scenes [17]. A considerable formulas are presented below (see Fig. 1.).
number of techniques were previously designed for refining
category-specific object detection accuracy. Histogram of
Oriented Gradients [18] based detectors by employing a multi- x2 − x02
= ctg φ, x20 = p0 sin φ,
scale sliding window system has traditionally been the most x01 − x1
widely-utilised and widely-accepted methods for pedestrian
detection.
The use of filtered channel features has evidently been x10 = p0 cos φ (1)
reported to produce a high quality of detection for pedestrian
detection [16]. An in-depth analysis was conducted into the
performance of cutting-edge pedestrian detectors for
providing insights into the typical failures of some methods
and an understanding of the effects of the quality of training
data. These insights were utilised to investigate variants of
advanced techniques such as the filtered channel features and
the R-CNN detectors and present improvements over the
baseline [12].
One method is that of building a fuzzy classifier from the
colour histograms. For each of the distinctive colours
belonging to a group of traffic signs, a few thousand samples
were collected to create their colour histograms [18]. The
utility of this method is that it is simple to implement, and it
takes less processing time. However, the fuzzy approach often
shows a considerable percentage of false positives, and
therefore, resulting in poor accuracy [19].
Another pixel-based method is that of the Support Vector
Machines (SVM). The presentation of Support Vector
Machines came along by Vapnik [20], with a premise on the
Structural Risk Minimization (SRM) method. SVMs have Fig. 1. Orientation-based Hough transform and the UpWrite method
great generalization properties and regularization potential. (Jähne 2005) Methodology.
They are characterized as binary large classifiers. The goal of
SVMs is to actuate a classifier which possesses exceptional After rearranging, the lower and upper indices yield the
classification capability on unobserved data points. SVM's following transform:
approach involves certain optimization strategies, important
𝑥𝑥 permits a fairly natural mechanism to survive insufficient
[cos 𝜑𝜑 sin 𝜑𝜑] �𝑥𝑥1 � = 𝑝𝑝0
��������� (2) statistics or poorly allotted records. It allows you to put a
2
𝑤𝑤 preliminary on the coefficients and on the noise in order that
where 𝑤𝑤 is a standard vector to the desired line and 𝑝𝑝0 is a in the absence of data, the preliminaries can take over [31].
length of the line segment to the centre of the image coordinate 2) Sparse linear regression
system [23].
In a linear model, scaled sparse linear regression jointly
This orientation based method is associated with the concept approximates the coefficients of regression and noise level. It
called the UpWrite method, which was at first proposed by selects an equilibrium with a sparse regression technique by
McLaughlin and Alder [26] to detect circles, ellipses and lines. repeatedly approximating the level of noise through the use of
the mean residual square and increasing the penalty in
This technique presumes computation of the local orientations proportion to the noise level expected. The iterative algorithm
to be the phase of the main eigenvector about the covariance costs more than the computation of a sparse regression
matrix of the image data. An order of points that pass through estimator's path or grid for penalty rates above an acceptable
successive mean points of local pixel blobs with local threshold [32].
orientations following a presumed curvature (or its variations) 3) Bayesian linear regression
forms a curve. Interpreted another way, the inertia tensor of
pixel intensities are utilised to determine a curve, and This is a technique of linear regression whereby the
ultimately, the points found can be shaped up to the figure statistical analysis is implemented within the context of
Bayesian inference. When the regression model contains
through the least-squares method [27]. errors that possess a regular distribution, and whenever a
Figure detection presents other strategies of recognizing particular design of prior distribution is presumed, explicit
results are obtainable for the posterior probability distributions
objects from pictures. Objects may be determined based on of the model's parameters [33]. When linear regression is
observation of their distinguishing points. This approach is understood as a Bayesian model, the prior variance and the
rooted in the sparse image coding area which is dynamically noise variance can automatically be inferred, and calibrated
advancing [28]. The definition is about identifying distinctive predictions can be made. Bayesian linear regression is a useful
points belonging to an object and most invariant to the component in intricate probabilistic models [34].
potential geometric transformation of the view of that object, 4) Bayesian logistic regression
as well as noise and other distortions. HOG[18], SIFT[29] and
their many variations, such as the OpponentSIFT and the Bayesian logistic regression is a machine learning model
for binary classification, i.e. mastering to classify statistics
PCA-SIFT, are among the widely known point descriptors.
factors into one of two categories. This is a linear model, since
Various researchers have compared the performance of sparse only the dot product of a weight vector with a feature vector
descriptors, for example, Mikolajczyk and Schmid made such governs the decision. The classification boundary may thus be
a comparison[30] and presented a paper on the Gradient expressed as a hyperplane. It's a widely used model, and a
Location and Orientation Histogram (GLOH), which in quite common motif in neural networks is the normal linear-
a few instances outperformed the SIFT process. However, the followed-by-sigmoid configuration. Bayesian logistic
drawback with these strategies is that they only seize low-level regression is a variant of logistic regression in Bayesian
form[35]. The name logistic regression comes from the fact
edge detail regarding any figure. It is more difficult to build
that the regression dependent variable is a logistic function. It
features that can capture mid-level information like edge is one of the models commonly used in problems where a
intersections or high-level representations like object pieces. binary variable is the solution. Y is viewed as a linear function
This is clearly observed even when extracting distinctive of the explanatory variables X in logistic regression.
image features from scale-invariant key-points [71]. Consequently, the model of logistic regression can be
described as:
B. Machine Learning methods
Machine Learning models or algorithms in object 𝜎𝜎(𝑌𝑌)
detection are based on a set of complex statistical and 𝑙𝑙𝑙𝑙 �1−𝜎𝜎(𝑌𝑌)� = 𝜃𝜃1 𝑏𝑏1 (𝑋𝑋) + 𝜃𝜃2 𝑏𝑏2 (𝑋𝑋) + ⋯ +
mathematical equations, which are very tightly interlinked by 𝜃𝜃𝑀𝑀 𝑏𝑏𝑀𝑀 (𝑋𝑋) (3)
nature. Most of these methods are primarily feature-based,
thus, they usually require a vast amount of samples to learn
and to train. Below is a list of some of the typical machine Where {𝑏𝑏1 (𝑋𝑋)} is the set of basis functions and {𝜃𝜃1 } are the
learning techniques in object detection. model parameters.
1) Deformable Models
C. Support Vector Machines
This is a statistical model used to map the instability in The Support Vector Machines (SVMs) are algorithms for
constructing an object's actual instance based on a prior pattern recognition, which was developed by Vapnik [20].
distribution based on template deformation. The model is This is a statistical model primarily based on a principle
described by reference to generators and inter-generator referred to as Structural Risk Minimization, which functions
subsets bonds. Variables that characterize the template to limit an upper bound on the generalization errors. The
deformation are used to denominate the generators and bonds. principal objective of SVMs is to trigger a classifier which
Additionally, given a particular deformation of a template, a possesses exceptional classification functionality on
statistical model of the image data is produced. This model unobserved data points [36]. Due to the high demand on
training speed and memory constraints, SVMs need certain contact between variables belonging to a particular layer. The
optimisation process. Therefore, SVMs have the tendency of normal Boltzmann Machine (BM), is a network that contains
being computationally expensive [21]. units, which are connected in symmetry and are binary
founded on a stochastic instrument. The basic learning
Summary algorithmic rule of BMs results in their being very difficult
Most of the conventional machine learning techniques in when learning, and slow to train [2]. In a Deep Boltzmann
object detection have often demonstrated difficulty when they machine, however, every layer extracts high-order parallels
have to be extended to more complex objects such as people, that appear amongst exercises of shrouded characteristics in
vehicles, and many other complex classes of objects because the layer beneath [37]. These DBMs can learn internal features
they involved a great amount of prior information and domain that gradually turn to be complex, which is a good
knowledge. Another challenge with these models is the fact characteristic for solving object detection challenges as well
that most of them would require an appropriate image as other computer vision problems. High-level features can be
representation (visual coding) in order to capture the structural made up of a wide collection of unlabelled sensory inputs.
similarities between instances of an object class. Other Furthermore, a highly constrained supply of labelled data can
learning-based techniques, which are data-driven may seem to subsequently be employed to only calibrate and improve the
show more advantages, however, most of them require some network for a specific task being implemented. A DBM can
optimization technique, and tend to be computationally be configured as depicted in figure 2 below.
intensive.

III. DEEP LEARNING TECHNIQUES

A. Deep learning methods


The field of Deep learning consists of different
methodologies that were developed to improve object
detection. Deep learning Belief Networks consist of bands of
restricted Boltzmann Machines (RBM) – a two-layer,
bipartite, undirected graphical model that has complete
contact amongst noticeable units and individual units within
the specific hidden layer. A Stack Auto-Encoder (SAE) is a
two-layer neural network stack designed to remodel its own
inputs, SAE trains by minimizing the reconstruction error. Fig. 2. Deep Boltzmann Machine
Convolutional neural networks differ in that they contain
numerous connections between the neurons, this is a model 2) Restricted Boltzmann Machine
that registers a form of regularization by itself, and without the
A Restricted Boltzmann Machine is a special case where
need of another algorithm to assist it. The objective of CNNs
the quantity of hidden layers of a DBM has been limited to
is to determine filters through a data-driven approach so that
one. The similarity between RBMs and DBMs is that they
they can be able to extract features to describe inputs. Deep
both do not have hidden-to-hidden and visible-to-visible
learning aims to overcome challenges faced by feature-based
connections in their models [38]. When constructing RBMs,
methods such as learning mid-level and high-level
the feature activation of one RBM is employed as the training
information. This is done by naturally learning rankings of
data of the next RBM, which ensures that the shrouded layers
visual characteristics in both the supervised and unsupervised
are learned efficiently. This is the most important
techniques directly from data.
characteristic of the RBM [39]. The architecture of a typical
The many deep learning techniques applied to object RBM is illustrated in figure 3 below.
detection are generally classified into three groups,
respectively. The first category, involve unsupervised feature
learning, here the theory and principle of deep learning are
employed for the extraction of features only. These features
will then be supplied to exceptionally easy machine learning
methods for performing tasks such as classification, detection
or tracking, depending on the operation. The second group is
the supervised learning techniques, here end-to-end learning
is used to simultaneously optimize the feature extractor and
the classifier units of the complete model when substantial Fig. 3. Restricted Boltzmann Machine
measures of labelled data are provided. Finally, the third group
is the Hybrid deep networks, which involves using generative 3) Convolutional Neural Networks
feature learning models to enhance the training of Deep neural Convolutional neural networks are specially-designed
networks and the training of other deep supervised feature- types of neural networks for handling data that possesses a
learning methods through more effective regularization of known, grid-like topology. One example is that of time-series
optimization techniques. data, which can be regarded as a 1-Dimensional (1D) matrix
1) Deep Boltzmann Machine that takes tests at constant timeframes. Another example is the
image data, this is usually presented in the form of a 2D grid
Deep Boltzmann machine or DBM is a type of generative of pixels [40]. Convolutional neural networks have
feature learning model, or network. A DBM is made up of accomplished enormously outstanding results in pragmatic
multiple layers that contain hidden variables, and there is no
utilisation. The phrase “convolutional neural network” shows Convolutional layers - these can be numerous based on the
that the system executes a numerical task termed convolution, dimensions and complexity of the problem.
a specially-designed type of linear operation [3]. CNNs ReLU (Rectified linear Unit) Layers - these appear after the
introduce convolution in one of their layers, instead of convolutional layer.
traditional matrix multiplication. The following formula will Pooling Layers - these are multiple-repeated layers, and they
give expression to the convolutional process: follow after the convolutional and ReLU layer pair.
Fully Connected (FC) Layer - a single, fully-connected
𝑠𝑠(𝑡𝑡) = (𝑥𝑥 × 𝑤𝑤)(𝑡𝑡) = ∫ 𝑥𝑥(𝑎𝑎)𝑤𝑤(𝑡𝑡 − 𝑎𝑎)𝑑𝑑𝑑𝑑 (4) output layer that follows at the end, and is used for
classification and decision purposes.
The general design of a DNN is shown in figure 4 below.
Where 𝑥𝑥 is the input, 𝑤𝑤 means the kernel filter and 𝑠𝑠
signifies the output or the feature map, which is the function
of 𝑡𝑡 , the continuous time. The operation of the discrete time
convolution is defined in the following way:

𝑠𝑠(𝑡𝑡) = (𝑥𝑥 × 𝑤𝑤)(𝑡𝑡) = ∑∞


𝛼𝛼=−∞ 𝑥𝑥(𝑎𝑎)𝑤𝑤(𝑡𝑡 − 𝑎𝑎)𝑑𝑑𝑑𝑑 (5)

Convolution is referred to as cross correlation when dealing


with a 2D image I and a 2D kernel filter K, and is expressed
in the following manner:

Fig. 4. A deep neural network architecture (Michael A. Nielsen, Neural


𝑠𝑠(𝑖𝑖, 𝑗𝑗) = (𝐼𝐼 × 𝐾𝐾)(𝑖𝑖, 𝑗𝑗) = ∑𝑚𝑚 ∑𝑛𝑛 𝐼𝐼(𝑖𝑖 + 𝑚𝑚, 𝑗𝑗 + Networks and Deep Learning, Determination Press, 2015)
𝑛𝑛)𝐾𝐾(𝑚𝑚, 𝑛𝑛) (6)
5) Stack Auto-encoder
The CNN layer has three main stages in it. The layer in the A Stack Auto-encoder (SAE) is a neural network stack
first stage executes numerous convolutions in parallel to containing two layers and designed to remodel its own inputs,
provide a fixed number of linear activations. In the second SAE trains by minimizing the reconstruction error.
stage, each linear activation function is administered through Convolutional neural networks differ in that they contain
a non-linear activation operation such as the rectified linear numerous connections between the neurons, this is a model
activation function, which makes this stage to occasionally be that registers a form of regularization by itself, and without the
referred to as the detector stage [41]. A pooling function to need of another algorithm to assist it [44]. The primary
modify the output of the layer similarly is used within the third components of the Autoencoder network model are
stage. A pooling layer supplants the system yield at a specific established on encoder function ℎ = 𝑓𝑓(𝑥𝑥) and a decoder for
area with an outline measurement of the contiguous yields. re-modeling 𝑥𝑥� = 𝑔𝑔(ℎ). Thus, the re-modeled output is 𝑥𝑥� =
4) Deep Neural Networks 𝑔𝑔�𝑓𝑓(ℎ)�, this will copy the data input. Autoencoder is
mathematically represented as:
A Deep Neural Network is a type of discriminative feature
learning technique, a neural network that contains multiple
hidden layers. This is a simple conceptual extension of neural
𝑥𝑥� = 𝑔𝑔(𝑾𝑾𝑾𝑾 + 𝒃𝒃) (7)
networks; however, it provides valuable advances with regard
to the capability of these models and new challenges as to 𝑥𝑥� represents the input, 𝑾𝑾 stands for the weights, 𝒃𝒃 denotes
training them [42]. The structure of deep neural networks bias, while 𝑔𝑔 refers to the activation function. This activation
causes them to be more sophisticated in design, and yet more function is either a sigmoid or a rectified linear function. This
complex in elements. There are two complexity aspects of a learning algorithm is commonly employed for dimensionality
DNN model’s architecture [37]. Firstly, how wide, or narrow reduction, feature learning or corrupted data re-modeling [45].
it is, in other words, how many neurons there are in each layer. figure 5 below illustrates the typical model of the auto-
Secondly, how deep it is, that is, how many layers of neurons encoder.
there are. When dealing with the kind of data that has such
deep architecture, Deep neural networks can be very
beneficial, a deep neural network can fit the data more
accurately with fewer parameters than a normal neural
network, this is because more layers can be used for a more
efficient and accurate representation. It is clear, that shallow
neural network models require far more parameters in
achieving their tasks [43].
Fig. 5. The typical design of the Auto-encoder (Almalaq A. and Edwards
G., A Review of Deep Learning Methods Applied on Load
The design of deep CNNs often involve the following Forecasting, 2017)
fundamental elements:
6) Deep Belief Networks
A Deep Belief Network (DBN) is a generative deep network, approach was used for training the score class, and the smooth
which can be employed as the first network of a DNN for loss is used for predicting the bounding boxes[49]. To
supervised learning while retaining the integrity of its network configure the base detector, the VGG-16 network is used and
architecture, and further discriminately trained or fine-tuned the ImageNet dataset is used to pre-train the base detector. The
using the target labels furnished [38]. The structure of a Deep networks are trained using a single NVIDIA K40 GPU with
Belief Network contains a number of layers of hidden units the Caffe program.
referred to as stacked RBM. The stacked RBM contains In June 2017, a number of researchers developed a deep
multiple hidden layers, which are trained by means of the learning framework established upon depth information
back-propagation algorithm. Therefore, the connection multiple local patterns using a stereo vision system to achieve
devices inside the DBN structure are established between a high accuracy detection of pedestrians and vehicles through
every unit within a layer in connection with every unit within varying driving circumstances. There are three primary phases
the layer, although, no intra connection of layer units exist. A in the system: pre-processing using multiple local patterns,
DBN can be perceived as an RBM with multiple hidden layers unsupervised and supervised training. The framework
[38]. The architecture of the Deep Belief Network can be comprises deep learning techniques such as the auto-encoder
observed in figure 6 below. and convolutional neural networks [40], [44], [50]. The
supervised training stage employs soft-max regression. In this
study, there is a robust object detection algorithm that can
produce detailed information about the objects’ locations,
widths, and heights from reliable disparity information, which
was acquired by the use of an approach proposed by Nguyen
[51]. The proposed model is executed upon a Tesla K40 GPU
in order to minimize its processing time. On this platform, the
processing time for the input image of 136 × 136 is greatly
reduced from 5.5s on a PC (Core i7 4.0 GHz, 8.0 GHz RAM)
to 70ms on the GPU.
In 2016, a number of Hangzhou Dianzi University
students created a pedestrian detection model based on
RCNN. The model implements the Edge Boxes algorithm as
Fig. 6. Deep Belief Networks
follows: First, the Edge Boxes were used to extract regions of
interest (ROI) from the input image [52]. Second, the region
B. Deep Learning Techniques in Object detection proposals are scaled to 227×227 sizes and delivered through
The following paragraphs provide an overview of how two fully connected layers to the deep convolutional neural
Deep Learning methods that are explained earlier are used to network. Finally, linear SVMs are used for the classification
unravel the object detection problem. of CNN features and the identification of regions[53]. The
outcome of the experiment shows that the proposed pedestrian
A deformable deep Convolutional Neural Network Model, detection method outperforms conventional algorithms by
named DeepID-Net, for object detection was proposed by producing a 23% miss rate . This result is obtained with the
several researchers from a University in Hong Kong in 2016. same 10% false detection rate. There are other region proposal
The DeepID-Net jointly learns feature representation and part algorithms that are evaluated in the experiment, such as the
deformation for an immense amount of object categories. The Viola-Jones with a miss rate of 72%, the HOG algorithm,
research work introduces a new methodology for pre-training which generates a miss rate of 46% and finally the Selective
the deep CNN model. This pre-training method was Search, which achieves a competitive miss rate of 24% [54].
implemented on the dataset of ImageNet image classification
and localization. Fine-tuning was done on the dataset of object A group of authors proposed sub-category-aware
detection in ImageNet/PASCAL-VOC. This model was convolutional neural networks for object proposals and
capable of improving the advanced performance obtained by detection in 2017. This research work explores sub-category
RCNN from mAP 31.0% to 50.3% on the ImageNet dataset information, which is broadly employed in conventional
for object detection [41]. The individual model performance object detection methods for solving two fundamental
and the model averaging performance were found to be the problems in CNN-based object detection [55]. Firstly, the
best on the ILSVRC2014 [46], [70]. inability for the system to efficiently handle object scale
changes, occlusion and truncation. Secondly, inability to
A group of authors and researchers suggested a Boosted approximate detailed information concerning objects, for
Convolutional Neural Network for the identification of example, 2D segmentation boundary, and 3-dimensional pose
pedestrians in 2017. A technique is introduced in this study to or occlusion connection between objects [5]. The design of the
identify complex samples for CNN, and to use them to boost proposed detection model is built on the Fast R-CNN
the performance of the model[47]. This approach can be detection model with several improvements [8]. Image
extended to any CNN architecture, and the Fast-RCNN pyramids are employed to manage the scale variation of
adopted from[48] is the model used as the basis for objects. The feature extrapolating layer is added immediately
demonstrating the boosting effect for this analysis. The input after the last conv layer for extracting features in order to
to this model is an entire image, instead of one region heighten the amount of scales in the conv feature pyramid [9].
proposal. Within the ROI-pooling layer, regional proposals On the KITTI val set the developed RPN obtains the best AP
are provided for achieving the pooling effect over the conv results in comparison to the initial RPN detection network in
feature in each region. The model is trained using the all the evaluation metrics for object detection, pedestrian
Stochastic Gradient Descent (SGD) technique. The softmax detection and cyclist detection [56].
In 2016, a group of writers suggested, "You Only Search classification and detection accuracy of the model is
Once: Single Real-time Object Detection". The distinct object significantly improved [60].
detection elements in this model are integrated into a single,
Several authors proposed SSD: Single Shot MultiBox
convolutional neural network[57]. Every bounding box is
Detector for object detection during 2017. The SSD technique
projected as a function of the overall image features. At the
is based on a convolutionary feed-forward network, which
same time, these bounding boxes are projected for every
performs a collection of bounding boxes and scores of a fixed
image[58]. Together with real-time detection the model
size indicating where object class instances are present in
achieves end-to-end training at very good average precision.
those boxes [61]. The next stage is the non-maximum
The input image is divided into an S×S grid, and the grid cell
suppression, which is used for the final detection cycle. The
can detect the object when the middle of an object occurs in a
initial network layers are based on a basic architecture used
specific grid cell. Every grid cell has the ability to project B
for the classification of high-quality images (curtailed before
bounding boxes as well as confidence scores for the respective
any layers of image classification), which is referred to as the
bounding boxes. Confidence scores indicate the model’s
base network[5]. The network is then supplemented with an
certainty that the box holds an object and how precisely it
auxiliary structure to deliver detection with a defined set of
estimates the object to be what it predicts [59]. Detection is
principal features. Hard negative mining is employed for rapid
framed as a question of regression and therefore no complex
optimization and more steady training. This involves sorting
pipeline is necessary. During testing time the neural network
the negative training examples that result from the negative
is working on an unfamiliar image to speculate detection. The
default boxes after the matching step [62]. SSD is extremely
primary network runs on a TitanX GPU at 45 frames per
responsive to the bounding box size. Specifically, its
second without batch processing. YOLO exacts solid and
performance degrades on smaller objects than bigger ones.
powerful spatial constraints on bounding box predictions as
The reason is that small objects very often do not have
every grid cell speculates two boxes and can only possess a
adequate information at the very top layers.
single class. The spatial constraint affects the amount of
adjacent objects that the neural network model is capable of In 2017, K. Kang, H. Li and fellow researchers proposed
predicting. The model battles with small objects that occur in a Tubelet Proposal Network that can generate tubelet for
clusters, for example, flocks of birds [58]. videos efficiently. The network comprises two primary
components, the initial sub-network extracts visual features
In June 2017 several researchers proposed Faster R-CNNs
over time based on static region proposals at a single frame
with region proposal networks for real-time object detection.
[63]. What is noted is that as the receptive (RF) fields of CNNs
Region Proposal Networks (RPN's) in this work are
are usually large enough, feature map pooling can be executed
constructed by means of a combination of convolutional
easily at the same bounding box locations across time to
feature maps utilised with the help of the region-based feature
extract the visual features of moving objects [41]. The second
detectors such as the vast R-CNN and a few extra
part, based on the pooled visual features, is a regression layer
convolutional neural networks that instantaneously regress
that approximates the temporal displacements of bounding
region bounds and objectness scores at every position on a
boxes to generate tubelet proposals. The process of object
standard grid [58]. The complete Fast-RCNN is comprised of
detection consists of 2 networks, TPN for the production of
two components, namely, a deep convolutional neural
candidate object tubelets, and the second is a CNN-LSTM
network which introduces the regions and Fast R-CNN that
classification network, which classifies each bounding box on
exploits the regions proposed. This makes up the consolidated
object tubes into different object classes[46]. The proposed
unit for object detection. The experimental results on
model is capable of employing useful temporal information
PASCAL VOC dataset indicate that RPN incorporating Fast-
from tubelet proposals for increasing detection accuracy.
CNN achieves outstanding results having a mAP 59.9% with
the use of approximately 300 proposals. The RPN produces a
faster detection speed compared to Selective Search (SS) [59] IV. INNOVATION OF DEEP LEARNING MODELS
and the Edge Boxes (EB) [52] because it involves shared For some time, advanced deep learning models have made
convolutional computations; the smaller number of proposals headway and been readily accepted within the community of
would also reduce the cost of the region-wise fully-connected Computer vision scientists and engineers. A number of
layers. notable innovations have been made to improve the existing
K. He, X. Zhang S. Ren and J. Sun proposed a Deep benchmarks and simplify the training and testing process.
Residual Learning for Image Recognition in 2016. This study, Training a DNN is a difficult task computationally. This is
by having the stacked layers of the introduced model match a because, at the training stage, the conventional training
residual mapping [47], proposes a deep residual learning techniques used is the stochastic gradient descent learning
framework. The formulation F(x)+x is realized by means of algorithmic rule, which is incredibly challenging to parallelize
shortcut connections (connections that skip one layer or more crosswise over machines. This means that learning at a
layers) in the residual network model. The plain network substantial scale is non-trivial. For instance, while it is
architecture is primarily inspired by the neural network model conceivable to apply a solitary yet incredible GPU machine
of VGG nets [48]. Most of the convolutional layers have a 3 for preparing DNN-based object detectors, with a huge
× 3 filters and adhere to these 2 design guidelines: (i) the layers number of long stretches of image training databases
have the same quantity of filters with the same output feature delivering uncommon outcomes. It turns out to be
map size; and (ii) whenever the feature map size is halved, the increasingly mind-boggling to scale up this execution with a
number of filters is doubled in order to sustain the complexity huge number or more long stretches of training data.
of time per layer. Faster-RCNN are adopted as the detection
method [5]. The highly deep residual networks are simple to Deep Stacking Neural Networks (DSNs), which were
optimize with various datasets. On account of the highly designed with the need to solve the learning scalability
increased depth of the network compared to the VGG-16, the problem are among the strategic and innovative ways of
solving the problem of computational errors during the of the sigmoid function, that is, tanh(𝑥𝑥) = 2𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠(2𝑥𝑥) −
training of DNNs [68 - 69]. The main idea of the DSN is about 1. Both the sigmoid function and the tanh function exhibit two
stacking, and it involves simple modules of capacities or problems, the gradient saturation and the constantly positive
classifiers being made first after which they are stacked over weight problem.
one another for the purpose of learning complex functions or
classifiers. There are diverse ways in which stacking
operations have previously been implemented, these have
often involved the application of supervised information in
simple modules. The stacked classifier at a better stage of
stacking engineering has new features that are derived from
concatenating the classifier output of a lower module and raw
input features. The easy module usually used for stacking is
the conditional random field. The DSN utilises supervision
information to stack each of the basic connections, which
appears as a simplified multilayer perceptron. The architecture
of Deep Stacking Networks is illustrated in figure 7 below.
Fig. 8. Sigmoid function

Fig. 9. Tanh function

Fig. 7. Deep Stacking Networks


The ReLU function 𝑓𝑓(𝑥𝑥) = max(0, 𝑥𝑥) (𝑥𝑥 ∈ (0, +∞))
Transfer learning (TL) is one innovation, which has illustrated in Fig. 9 contains the following characteristics: (1)
opened a platform for many engineers and research scientists Unsaturated gradient with the formula: I{x > 0}. In this way,
to innovate deep learning models, which possess more robust the gradient dispersion problem in the reverse propagation
capability to detect objects under complex scenarios of process is relieved, which helps to ensure that the first layer of
different objects. Transfer learning involves the storage of the neural network can be rapidly modified. (2) Low
knowledge acquired whilst solving one problem and applying computational complexity: the thresholds set by the ReLU
it to a different but similar problem [64]. In principle, this layer are the following: if, x < 0 then f (x) = 0; if x > 0, then f
process is achieved by using a pre-trained deep CNN (x) = x. The unfortunate problem about the ReLU units is that
classification model such as Alexnet, VGG-16 or VGG-19 as they can expire or ‘die’ [76]. A ‘dead’ ReLU always outputs
the backbone for the final deep learning model, which would the same value. By strategically modifying the activation
then be retrained for a different object detection scenario. In function of a deep learning model, the problems of gradient
this way, the knowledge acquired from the pretrained CNN saturation and constantly positive weight can be reduced.
model is leveraged for better performance. Moreover, this could improve convergence speeds.
Modifications of the important performance elements of
the CNN architecture is another way in which innovation is
achieved in the development of deep learning models for
object detection. One of these performance elements is the
basis for the activation function. Conventional CNN models
usually employ nonlinear functions such as Sigmoid [65],
Tanh [66] and ReLU [67]. The sigmoid function maps an
input of real numbers to the range [0, 1]. As the activation
value of the function reaches the extremum 0 or 1 the function
gradient tends to become 0. The average function output value
is not 0, which causes the neuron layer to render a signal input
of the nonzero mean. This activity makes the data of the input
neuron positive. Consequently, the weight becomes positive.
These problems lead to a slow convergence of parameters and
affect the efficiency of the training and the effect of model Fig. 10. ReLU function
recognition. The tanh function is capable of mapping a
specific input into the [-1, 1] range, but it is simply a variant
Another way of innovating deep learning models is by deep learning CNN models for object detection and
adopting an ingenious approach in the combined selection of optimizing the training and testing phases. However, some of
of the activation function, regularization technique, weight these deep learning models produce effective performance
update method and a variable step size. For example, by using under specific contexts of datasets such as small object
the tanh activation function, dropout method, gradient descent detection and category-specific object detection. This makes
with momentum and variable step size, a deep CNN model it essential to have the performance of the newly innovated
can be significantly optimized to produce better performance deep learning models tested on additional datasets.
on various fronts [13].
There continues to be other innovations that have been
adopted with the objective of improving the performance of

TABLE I. DEEP LEARNING IN OBJECT DETECTION (METHODS, BENCHMARK DL TECHNIQUE, SUCCESSES AND FAILURES)
Method Benchmark Deep Success Failure
Learning Technique
DeepID-Net: Deformable Deep Convolutional The model introduces a new efficient pre- This model was capable of improving the advanced
Deep Convolutional Neural Neural Networks training strategy for Deep CNN models by performance obtained by RCNN from mAP 31.0% to
Networks for Object incorporating object-level annotation. 50.3% on the ImageNet dataset for object detection.
Detection. However, the results were only valid for the
ILSVC2014.
Boosted Convolutional Convolutional Neural The experiment results indicated that BCNN- The model may still result in minor performance
Neural Network(BCNN) Networks LS, BCNN-TI and BCNN-BF can achieve degradations in a few high score samples.
significant performance gain over the Fast-
RCNN, which serves as the baseline for
performance boosting. The BCNN-BF ranks
the third in the Caltech pedestrian dataset.
Learning Framework for Convolutional Neural The model introduces a robust adaptive U-V The model requires a robust and powerful GPU
Robust Obstacle Detection, Networks and Auto- disparity for detecting practical objects within implementation to speed up running and process,
Recognition, and obstacle encoder different driving conditions. which can be very expensive. The system also requires
tracking. a heavy integration of multiple techniques to maximise
detection accuracy.
Pedestrian detection model Region-based The outcome of the experiment reveal that the The system is not completely implemented on a GPU,
based on RCNN. Convolutional Neural proposed pedestrian detection system which consumes a great duration of time during
Networks outperforms the traditional region proposal training and testing.
algorithms by yielding a miss rate of 23%.
Subcategory-Aware Convolutional Neural The model can handle scale variations of When there are many different views of objects, the
Convolutional Neural Networks objects by efficiently utilising the image model does not provide a significant detection
Networks for Object pyramids by means of a feature extrapolating accuracy. The sub-category-aware CNNs do not
Proposals and Detection layer. necessarily make an improvement in performance on
the dataset of PASCAL VOC 2007.
You Only Look Once: Convolutional Neural YOLO can learn generalizable The model learns to predict bounding boxes from data,
Unified Real-time Object Network representations of objects; thus, it cannot therefore, battles to generalize to objects in unique
Detection easily collapse or disintegrate when used on aspect ratios or configurations.
new domains or unanticipated input.
Faster R-CNN: Towards Region-based The model uses anchors of many different The presence of low-resolution features and as well no
Real-Time Object Convolutional Neural sizes as the regression references, this is bootstrapping strategy in the model results in a
Detection with Region Network demonstrated to be an efficient explanation degraded detection accuracy.
Proposal Networks for advancing detection accuracy.
Deep Residual Learning for Deep Convolutional The highly deep residual networks are simple The deep plain network has extremely low
Image Recognition Neural Networks to optimize with various datasets. On account convergence rates, and therefore, does not reduce the
of the highly increased depth of the network training error significantly.
compared to the VGG-16, the classification
and detection accuracy of the model is
significantly improved.
SSD: Single Shot MultiBox Convolutional Neural Data augmentation is implemented to cause SSD is susceptible to confusion with comparable
Detector Network the model to be powerful to different input object categories (particularly for animals), this to a
object shapes and sizes. This is achieved by certain extent because it shares locations for many
randomly sampling each training image by categories.
means of the complete initial image.
Object Detection in Videos Convolutional Neural The proposed model is capable of employing When the temporal window continues to increase,
with Tubelet Proposal Networks useful temporal information from tubelet even with the proposed initialization techniques in
Networks [61] proposals for increasing detection accuracy. place, the performance decreases.

studies that have recently been implemented and completed in


V. CONCLUSION the domain are carefully reviewed and analysed. To this end,
In this paper, a comprehensive survey on some of the it has been well-demonstrated that Convolutional Neural
important developments and successes shown by the Networks, Deep Neural Networks and as well as Region-
application of deep learning techniques in object detection is based convolutional neural networks have repetitively been
put forward. To prove the efficiency of applying deep learning used as the baseline for many robust detection systems, and
techniques in object detection, various experiments and have obtained - in many experiments, contemporary
performance on various datasets. It can also be inferred that
deep learning has proved to be effective in object detection, [17] C.P. Papageorgiou, M. Oren and T. Poggio, A general framework for
but in order to check and confirm the feasibility of using deep object detection. In Computer vision, 1998. Sixth international
conference (1998), IEEE pages 555–562.
learning techniques for object detection further studies have to
[18] N. Dalal, and B. Triggs, Histograms of Oriented Gradients for Human
be performed on larger datasets containing different Detection, IEEE Computer Society Conference on Computer Vision
categories. In addition, due to the intensive computation and Pattern Recognition, Vol. 1 (2005), pp. 886–893.
involved in training and testing of deep learning models, [19] H. Wu, Q. Chen and M. Yachida, Face Detection From Color Images
additional experiments have to be carried out on various Using a Fuzzy Pattern Matching Method, IEEE Transactions on Pattern
platforms to create the most convenient computing platform Analysis and Machine Intelligence, Vol. 21, No. 6 (1999), pp. 557–
with deep learning techniques. 563.
[20] V. Vapnik, Statistical Learning Theory, Wiley New York, Inc. (1998).
[21] N. Lopes and B. Ribeiro, Support Vector Machines (SVMs), In
Machine Learning for Adaptive Many-Core Machines - A Practical
Approach. Studies in Big Data, Vol. 7, Springer, Cham (2015)
REFERENCES
[22] P.V.C. Hough, Machine Analysis of Bubble Chamber Pictures,
Proceedings of International Conference on High Energy Accelerators
[1] M. Ranzato, F.J. Huang, Y. Boureau and Y. LeCun, “Unsupervised and Instrumentation, CERN, (1959)
Learning of Invariant Feature Hierarchies with Applications to Object [23] D.H. Ballard, Generalizing the Hough transform to detect arbitrary
Recognition, Proc. of Computer Vision and Pattern Recognition shapes, Pattern Recognition, Elsevier, Vol. 13, No. 2 (1981), pp. 111–
Conference” (CVPR 2007), (Minneapolis 2007) 122.
[2] M. Zeiler., Hierarchical Convolutional Deep Learning in Computer [24] B. Jähne, Digital Image Processing. 6th edition, Springer-Verlag
Vision (New York University, Jan. 2014) (2005)
[3] I. Goodfellow, Y. Bengio and A. Courville, Deep Learning (2013), [25] B. Cyganek, Orientation-based Hough transform and the UpWrite
pages 654-720. method, figure, Object detection and Recognition in Digital Images:
[4] P. Dollár, R. Appel, S. Belongie and P. Perona, Fast feature pyramids Theory and Practice, May (2016), pp. 365.
for object detection, Transactions on Pattern Analysis and Machine [26] R.A McLaughlin, and M.D. Alder, The Hough Transform Versus the
Intelligence, Vol. 36, No. 8 (2014), pp. 1532–1545. UpWrite, IEEE Transactions On Pattern Analysis And Machine
[5] S. Ren, K. He, R. Girshick and J. Sun: Faster R-CNN: Towards Real- Intelligence, Vol. 20, No. 4 (1998), pp. 396–400.
Time Object Detection with Region Proposal Networks, in IEEE [27] J. Bigun, G.H. Granlund and J. Wiklund, Multidimensional Orientation
Transactions on Pattern Analysis and Machine Intelligence, Vol. 39, Estimation with Applications to Texture Analysis and Optical Flow,
No. 6 (Jun. 2017), pp. 1137-1149. IEEE PAMI, Vol.13, No.8 (1991), pp. 775–790.
[6] H.C. Shin, Deep Convolutional Neural Networks for Computer-Aided [28] T. Hazan, S. Polak and A. Shashua, Sparse Image Coding using a 3D
Detection: CNN Architectures, Dataset Characteristics and Transfer Non-negative Tensor Factorization, ICCV 2005 10th IEEE
Learning, in IEEE Transactions on Medical Imaging, vol. 35, no. 5 International Conference on Computer Vision, Vol. 1 (2005), pp. 50–
(May, 2016), pp. 1285-1298. 57.
[7] D. Hu, X. Zhou and X. Yu, Deep learning and its applications in Object [29] Y. Ke, and R. Sukthankar, PCA-SIFT: A More Distinctive
Recognition and Tracking, Computer Science and Engineering Representation for Local Image Descriptors, Computer Vision and
Technology (CSET2015), Medical Science and Biological Engineering Pattern Recognition, Vol. 2 (2004), pp. 506–513
(MSBE 2015), pp. 87-92. [30] K. Mikolajczyk and C. Schmid, A Performance Evaluation Of Local
[8] W. Nam, P. Dollár and J. Han, Local decorrelation for improved Descriptors, Vol. 27, No. 10 (2005), pp. 1615-1630.
pedestrian detection, In Advances in Neural Information Processing [31] T. Albrecht, M. Luthi and T. Vetter, Deformable Models, University of
Systems (2014), pages 424–432. Basel, (Switzerland 2015)
[9] J. Yang and G. Yang, Modified Convolutional Neural Network Based [32] L. Perrinet, Sparse Models for Computer Vision, Biologically inspired
on Dropout and the Stochastic Gradient Descent Optimizer, Key computer vision, (2015)
Laboratory of Advanced Manufacturing Technology of Ministry of
Education, Guizhou University,2018 (Guiyang, China, 2018) [33] T.J. Mitchell, and J.J. Beauchamp, Bayesian Variable Selection in
Linear Regression, Journal of the American Statistical Association,
[10] S. Li, W. Liu and G. Xiao, "Detection of Srew Nut Images Based on (2012)
Deep Transfer Learning Network," 2019 Chinese Automation
Congress (CAC), Hangzhou, China, 2019, pp. 951-955. [34] B.A Hoadley, Bayesian Look at Inverse Linear Regression, Journal of
the American Statistical Association, Vol. 65, issue 329 (2012), pp.
[11] H. Yang, S. Jiao and P. Sun, "Bayesian-Convolutional Neural Network 356-369.
Model Transfer Learning for Image Detection of Concrete Water-
Binder Ratio," in IEEE Access, vol. 8, pp. 35350-35367, 2020. [35] Z. Xu and R. Akella, A bayesian logistic regression model for active
relevance feedback, Proceedings of the 31st annual international ACM
[12] K. L. Masita, A. N. Hasan and S. Paul, "Pedestrian Detection Using R- SIGIR conference on Research and development in information
CNN Object Detector," 2018 IEEE Latin American Conference on retrieval (Singapore, 2008), pp. 227-234.
Computational Intelligence (LA-CCI), Gudalajara, Mexico, 2018, pp.
1-6. [36] A.M. Farayola, A.N. Hasan and Ali. Ahmed, Efficient Photovoltaic
MPPT System Using Coarse Gaussian Support Vector Machine and
[13] Y. Ouyang, K. Wang and S. Wu, "SAR Image Ground Object Artificial Neural Network Techniques, in International Journal of
Recognition Detection Method based on Optimized and Improved Innovative Computing Information and Control (IJICIC),Vol. 14, No.
CNN," 2019 IEEE 4th Advanced Information Technology, Electronic 1 (Feb. 2018).
and Automation Control Conference (IAEAC), Chengdu, China, 2019,
pp. 1727-1731. [37] L. Deng and D. Yu, Deep Learning: Methods and Applications,
Foundations and Trends in Signal Processing, Vol. 7: No. 3–4 (2014),
[14] K.E.A. van de Sande, T. Gevers, C.G.M. Snoek, Evaluating Color pp 197-387.
Descriptors for Object and Scene Recognition, IEEE Transactions on
Pattern Analysis and Machine Intelligence, Vol. 32 (2010), No. 9, pp. [38] N. Le Roux, and Y. Bengio, Representational power of restricted
1582–1596 Boltzmann machines and deep belief networks, Neural Computation,
Vol. 20, No. 6 (2008), pp. 1631–1649.
[15] B. Cyganek, Color Image Segmentation With Support Vector
Machines: Applications To Road Signs Detection, International [39] B. Cyganek, Object detection and Recognition in Digital Images,
Journal of Neural Systems, Vol. 18, No. 4, World Scientific Publishing Theory and Practice, (May, 2016)
Company (2008), pp. 339–345. [40] A. Krizhevsky, I Sutskever. and G Hinton, ImageNet classification
[16] S. Zhang, R. Benenson and B. Schiele, Filtered channel features for with deep convolutional neural Networks, Advances in Neural
pedestrian detection, In CVPR (2015), pages 1751–1760. Information Processing Systems, (2012), pp. 1106–1114.
[41] W. Ouyang, X. Zeng, X. Wang, S. Qiu, P. Luo, Y. Tian, H. Li, S. Yang,
Z. Wang, H. Li, C. Loy, K. Wang, J. Yan and X. Tang, DeepID-Net:
Deformable Deep Convolutional Neural Networks for Object Key Lab of Compter Vision & Pattern Recognition, Shenzhen
Detection, IEEE Transactions on Pattern Analysis and Machine Institutes of Advanced Technology, CAS (China, 2017)
Intelligence (Volume: PP, Issue: 99), (Jul. 2016) [64] P. Perera and V. M. Patel, "Deep Transfer Learning for Multiple Class
[42] A. Michael Nielsen, Neural Networks and Deep Learning, Novelty Detection," 2019 IEEE/CVF Conference on Computer Vision
Determination Press, 2015 and Pattern Recognition (CVPR), Long Beach, CA, USA, 2019, pp.
[43] D. Erhan, C. Szegedy, A. Toshev and D. Anguelov, Scalable object 11536-11544.
detection using deep neural networks, in Proc. IEEE Conf. Comput. [65] Y.N. Zhang; L. Qu; J.W. Chen; J.R. Liu; D.S. Guo, Weights and
Vis. Pattern Recognit. (2014), pp. 2155–2162. structure determination method of multiple-input Sigmoid activation
[44] P. Vincent, H. Larochelle, Y. Bengio and P.A. Manzagol, Extracting function neural network. Appl. Res. Comput. 2012, 29, 4113–4116.
and composing robust features with denoising autoencoders, In ICML [66] P. Luo; H.F. Li, Research on Quantum Neural Network and its
2008, pg. 241. Applications Based on Tanh Activation Function. Comput. Digit. Eng.
[45] A. Almalaq and G. Edwards, A Review of Deep Learning Methods 2012, 16, 33–39.
Applied on Load Forecasting, IEEE International Conference on [67] Z. Tang; L. Luo; H. Peng; S. Li, A joint residual network with paired
Machine Learning and Applications, (2017) ReLUs activation for image super-resolution. Neurocomputing 2018,
[46] B. Pepik, M. Stark, P. Gehler and P. Schiele, Multi-view and 3rd 273, 37–46.
deformable part models, Transactions on Pattern Analysis and Machine [68] Y. Li, G. Cao and W. Cao, "Stacking-based deep neural network for
Intelligence, Vol. 37, No. 11 (2015), pp. 2232–2245. Facial Expression Recognition," 2019 IEEE International Conference
[47] R. Girshick: Fast r-cnn, In Proceedings of the IEEE International on Bioinformatics and Biomedicine (BIBM), San Diego, CA, USA,
Conference on Computer Vision (2015), pages 1440–1448. 2019, pp. 1338-1342.
[48] K. Simonyan, and A. Zisserman, Very deep convolutional networks for [69] M. T. Vo, T. Nguyen and T. Le, "Robust Head Pose Estimation Using
large-scale image recognition, Visual Geometry Group, Department of Extreme Gradient Boosting Machine on Stacked Autoencoders Neural
Engineering Science (University of Oxford, Apr. 2014) Network," in IEEE Access, vol. 8, pp. 3687-3694, 2020.
[49] P. Sermanet, K. Kavukcuoglu, S. Chintala and Y. LeCun, Pedestrian [70] J. Yan, Z. Lei, L. Wen and S. Z. Li, The fastest deformable part model
detection with unsupervised multi-stage feature learning, In: Computer for object detection, In Computer Vision and Pattern Recognition
Vision and Pattern Recognition (CVPR), IEEE Conference (2013), (CVPR), 2014 IEEE Conference (2014), pages 2497–2504.
pp.3626–3633. [71] Lowe, D., Distinctive Image Features from Scale-Invariant Keypoints,
[50] E. Ohn-Bar, and M. Trivedi, Learning to detect vehicles by clustering International Journal of Computer Vision (2004), Vol. 60, No. 2, pp.
appearance pattern, IEEE Transactions on Intelligent Transportation 91–110.
Systems, Vol. 16, No.5 (2015), pp. 2511–2521.
[51] V. Nguyen, H. Van Nguyen, D. Tran, S. Lee and J. W. Jeon, Learning
Framework for Robust Obstacle Detection, Recognition, and Tracking,
in IEEE Transactions on Intelligent Transportation Systems, Vol. 18,
No. 6 (Jun. 2017), pp. 1633-1646.
[52] C. Zitnick and P. Dollár: Edge boxes, Locating object proposals from
edges, in Proceedings of the 13th European Conference on Computer
Vision, (2014) pp. 391–405.
[53] H. Li, Z. Wu and J. Zhang, Pedestrian detection based on deep learning
model, 9th International Congress on Image and Signal Processing,
BioMedical Engineering and Informatics (CISP-BMEI), Datong
(2016), pp. 796-800.
[54] J. Uijlings, K. van de Sande, T. Gevers and A. Smeulders, Selective
search for object recognition, International Journal on Computer
Vision, Vol. 104, No. 2, (Sep. 2013), pp. 154–171.
[55] Y. Xiang, W. Choi, Y. Lin and S. Savarese, Subcategory-Aware
Convolutional Neural Networks for Object Proposals and Detection,
IEEE Winter Conference on Applications of Computer Vision
(WACV), (Santa Rosa, CA, USA, 2017), pp. 924-933
[56] A. Geiger, P. Lenz and R. Urtasun, Are we ready for autonomous
driving? The KITTI vision benchmark suite, in Proceedings of IEEE
International Conference on Computer Vision and Pattern Recognition
(Jun. 2012), pp. 3354–3361.
[57] K. He, X. Zhang, S. Ren and J. Sun, Deep residual learning for image
recognition, IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), (Las Vegas, NV, 2016), pp. 770-778.
[58] J. Redmon, S. Divvala, R. Girshick and A. Farhadi: You Only Look
Once: Unified, Real-Time Object Detection, 2016 IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), (Las Vegas, NV,
2016), pp. 779-788.
[59] J. Uijlings, K. van de Sande, T. Gevers and A. Smeulders, Selective
search for object recognition, International Journal on Computer
Vision, Vol. 104, No. 2, (Sep. 2013), pp. 154–171.
[60] M. Everingham, L. Van Gool, C. Williams, J. Winn and A. Zisserman,
The pascal visual object classes (voc) challenge, International Journal
of Computer Vision, Vol. 88, No. 2 (Jun. 2010), pp. 303–338.
[61] C. Ning, H. Zhou, Y. Song and J. Tang, Inception Single Shot
MultiBox Detector for object detection, IEEE International Conference
on Multimedia & Expo Workshops (ICMEW), (Hong Kong, 2017), pp.
549-554.
[62] K. Simonyan, A Vedaldi. and A. Zisserman, Deep Fisher networks for
large-scale image classification, Proc. NIPS, (2013)
[63] K. Kang, H. Li, T. Xiao, W. Ouyang, J. Yan, X. Liu and X. Wang,
Object Detection in Videos with Tubelet Proposal Networks, Shenzhen

View publication stats

You might also like