Computation 11 00052
Computation 11 00052
net/publication/369059212
CITATIONS READS
70 1,553
1 author:
Mohammad Taye
Philadelphia University Jordan
13 PUBLICATIONS 291 CITATIONS
SEE PROFILE
All content following this page was uploaded by Mohammad Taye on 08 March 2023.
Data Science and Artificial Intelligence, Philadelphia University, Amman 19392, Jordan;
[email protected]
Abstract: Convolutional neural networks (CNNs) are one of the main types of neural networks
used for image recognition and classification. CNNs have several uses, some of which are object
recognition, image processing, computer vision, and face recognition. Input for convolutional neural
networks is provided through images. Convolutional neural networks are used to automatically learn
a hierarchy of features that can then be utilized for classification, as opposed to manually creating
features. In achieving this, a hierarchy of feature maps is constructed by iteratively convolving the
input image with learned filters. Because of the hierarchical method, higher layers can learn more
intricate features that are also distortion and translation invariant. The main goals of this study are to
help academics understand where there are research gaps and to talk in-depth about CNN’s building
blocks, their roles, and other vital issues.
Keywords: artificial intelligence (AI); deep learning (DL); machine learning (ML); convolution neural
network (CNN); deep learning applications; image classification; supervised learning
1. Introduction
There has been a dramatic surge in the usage of machine learning (ML) in recent
years [1–3] for a wide range of purposes, from research to practical applications, including
text mining, spam detection, video recommendation, picture categorization, and multime-
Citation: Taye, M.M. Theoretical
dia idea retrieval [4,5].
Understanding of Convolutional
Deep learning (DL) is one machine learning (ML) approach that is commonly used
Neural Network: Concepts,
in these contexts [6,7]. The working domain of DL is a subset of that of ML and artificial
Architectures, Applications, Future
intelligence (AI); therefore, it may be seen as a function of AI that mimics the way the
Directions. Computation 2023, 11, 52.
https://doi.org/10.3390/
human brain processes information [1]. The traditional neural network from which DL
computation11030052
originated has significantly been surpassed by its superior performance. In addition, DL
uses transformations and graph technologies in tandem to construct multi-layer learning
Academic Editor: Demos T. Tsahalis models [8,9].
Received: 3 February 2023 A closer examination of the “Learning sub-fields” reveals that deep learning (DL), a
Revised: 2 March 2023 subfield of machine learning (ML), focuses on creating algorithms that simulate how the
Accepted: 3 March 2023 human brain thinks and solves problems [5,8,10].
Published: 6 March 2023 Recent years have seen a surge in interest in machine learning algorithms, which are
now being used in various fields, including image recognition, optical character recognition,
pricing prediction, spam filtering, fraud detection, healthcare, transportation, and many
others [11]. The various machine-learning types and algorithms are depicted in Figure 1.
Copyright: © 2023 by the author.
Over the past few years, deep learning has recently received a lot of attention and has
Licensee MDPI, Basel, Switzerland.
been applied successfully in addressing a wide range of problems in various application
This article is an open access article
fields. Diverse deep-learning techniques are applied in several application areas [4].
distributed under the terms and
These application areas include robots, enterprises, cybersecurity, virtual assistants,
conditions of the Creative Commons
image recognition, and healthcare. They also involve sentiment analysis and natural
Attribution (CC BY) license (https://
creativecommons.org/licenses/by/
language processing [8].
4.0/).
Figure1.1.Machine
Figure MachineLearning
LearningParts.
Parts.
Themost
The mostestablished
establisheddeep deeplearning
learningtechnique
techniqueisisthe theconvolutional
convolutionalneural neuralnetwork
network
(CNN),a asubtype
(CNN), subtypeofof anan artificial
artificial neural
neural network
network [12].[12].
SinceSince
the the astounding
astounding outcomes
outcomes of theof
the ImageNet
ImageNet LargeLarge
ScaleScale
Visual Visual Recognition
Recognition Competition,
Competition, an object
an object recognition
recognition compe-
competition,
intition,
2012 in 2012 [13,14],
[13,14], CNN has CNN has dominated
dominated computer computer vision tasks.
vision tasks.
CNNisisuseful
CNN usefulininmedical
medical imaging
imaging because
because it can
it can detect
detect tumors
tumors andand other
other irregulari-
irregularities
ties more
more accurately
accurately in X-rayin X-ray
and MRI andimages.
MRI images. CNN models
CNN models can analyze
can analyze a picturea of picture
a humanof a
body
human component, such assuch
body component, the lungs, and identify
as the lungs, potential
and identify tumor
potential locations
tumor as well
locations as
as well
other abnormalities
as other abnormalities likelike
fractured
fractured bones in in
bones X-ray
X-rayimages
images based
basedon onpreviously
previouslyprocessed
processed
comparable
comparableimages
imagesby byCNN
CNNnetworks
networks[15–17].
[15–17].
Convolutional
Convolutional neural networks (CNN),which
neural networks (CNN), which areare
usedusedto represent
to represent spatial informa-
spatial infor-
tion, may be used to model images.
mation, may be used to model images.
Because
Becauseofoftheir
theirgreater
greater capacity to extract
capacity features
to extract featuresfromfrom
pictures, suchsuch
pictures, as barriers and
as barriers
road
and signs, CNNsCNNs
road signs, are characterized as universal
are characterized non-linear
as universal function
non-linear approximators.
function approximators.
CNN
CNNhas hasbeen
beenusedusedforforbiometric
biometricuser
useridentity
identityauthentication
authenticationbybyrecognizing
recognizing particu-
partic-
lar physical characteristics associated with a person’s face. CNN models
ular physical characteristics associated with a person’s face. CNN models may be trained may be trained
ononphotos
photosor orvideos
videosofofindividuals
individualstotorecognize
recognizecertain
certainfeatures
featuresofoftheir
theirfaces,
faces,such
suchas asthe
the
distance between their eyes, the shape of their noses, and the curve of their
distance between their eyes, the shape of their noses, and the curve of their lips [15–17]. lips [15–17].
Most of the time, convolutional neural networks have led to ground-breaking discov-
eries in many fields related to pattern recognition, such as voice and image processing.
The number of ANN parameters going down is the most important thing about
CNNs [18,19].
Computation 2023, 11, 52 3 of 23
This success has encouraged researchers and developers to utilize more complex
models to tackle difficult issues that conventional ANNs were unable to address. The most
important presupposition regarding the issues that CNN resolves is that there should be
no spatially dependent characteristics [19].
CNN is to blame for the current popularity of DL. The primary benefit of CNN over
its forerunners is that it does everything automatically and without human supervision,
making it the most popular. Therefore, we have covered a lot of ground with CNN
by outlining its essential parts. The most prevalent CNN architectures, starting with
the AlexNet network and concluding with the high-resolution network, have also been
explored in length (HR.Net) [19,20].
This review’s main objective is to draw attention to the elements of CNN that are most
important, making it simple for researchers and students to comprehend CNN completely
after reading just one review paper. Additionally, in order to encourage CNN research, we
want to let people understand more about current developments in the industry. In view to
provide more precise options to the field, researchers would be allowed to select the best
route of study to follow.
CNNs learn as they are trained on images; therefore, the features they extract from
images are not pre-learned.
Automatic feature extraction is largely responsible for the remarkable success of
deep learning models in computer vision. Complex models are required for deep CNN
architecture. More precision from them calls for larger image databases. For computer
vision tasks like object categorization, detection, tracking, and recognition, CNNs need
access to huge labeled datasets.
Object identification, which has captivated researchers for much of this decade, has
significantly benefited from the application of deep learning techniques. Object identifi-
cation and tracking play a crucial role in video surveillance, making it one of the most
difficult but vital aspects of security systems. It keeps an eye on people in public places to
spot any signs of unusual or suspicious conduct.
The general contribution of this study is summarized as follows:
This review almost provides an in-depth analysis of CNN’s most important features.
This review almost provides an in-depth analysis of CNN’s networks and algorithms.
In this paper, I have compiled all of the current deep learning-based object detection
methods that can be found in the most recent academic literature.
To help pick the best object detection method for a given application or dataset, I
reviewed and compared several popular options.
This article focuses on CNN’s models and processes.
We put together a list of CNN to help developers and academics learn more about
how to use CNN.
We explain CNN in-depth, the most well-known deep learning algorithm, by outlining
the ideas, theories, and cutting-edge architectures.
Survey Methodology
I have analyzed the key research articles published in the field between 2017 and 2022,
with a focus on those from 2017, 2018, and 2019, along with a few from 2021 and 2022. The
primary emphasis was on publications from the most prestigious publishers, including
IEEE, Elsevier, MDPI, ACM, and Springer. Several papers from ArXiv have been chosen.
I have examined over 60 papers on a variety of DL-related topics. There are 14 papers
from 2017, 12 papers from 2018, 19 papers from 2019, and 15 papers from all other years
(2020–2022). This shows that the focus of this review was on the most recent publications in
DL and CNN. The selected publications were analyzed and evaluated in order to perform
the following:
(1) List and describe the DL and CNN techniques and network types;
(2) Present the problems of CNN and provide alternative solutions;
(3) List and explain CNN architectures;
Computation 2023, 11, 52 4 of 23
CNN Layers
2.1.2. Convolutional Layer
A CNN is typically composed of four types of layers
The convolutional layer is a crucial part of CNN’s overall structure. It is a set of fil-
• ters—or
Convolutional;
kernels—applied to the data before it is used. Each kernel’s width, height, and
• weight
Pooling;
are used to extract characteristics from the input data. Weights in the kernel are
• first
Function
assignedofatActivation;
random but gradually become more informed by the training data.
• In other
Fully words, the feature map is made by combining the input image (which is
Connected.
shown by N-dimensional metrics) with these filters [15–17].
A kernel is a set of discrete values or integers.
For each number, the kernel weight is given as a reference. The initial kernel weights
for a CNN are a set of integers picked at random. Additionally, the weights are initialized
in various ways. In turn, the kernel learns to extract meaningful features because these
Computation 2023, 11, 52 5 of 23
(K − L + 1), (1)
Figure 3.
Figure A visual
3. A visual representation
representation of
of the
the primary
primary calculations.
calculations.
Figure 5. Stride 1, the filter windows move only one time for each connection.
Figure 5. Stride 1, the filter windows move only one time for each connection.
Equation (2) formalizes this, resulting in the output size O as shown in Figure 6, given
the image’s NXN dimension and FXF filter size.
𝑂 = 1 + (𝑁 − 𝐹)/𝑆 (2)
where N is the input size, F is the filter size, and S is the stride size.
Computation 2023, 11, 52 8 of 23
Figure 5. Stride 1, the filter windows move only one time for each connection.
Equation
Equation (2)(2) formalizes
formalizes this,
this, resulting
resulting inin the
the output
output size OO
size asas shown
shown in in Figure
Figure 6, 6, given
given
the image’s NXN dimension and FXF filter
the image’s NXN dimension and FXF filter size. size.
𝑂 = 1O+= ( N − F )/S
+𝐹)/𝑆
(𝑁1− (2)(2)
where
whereNN
is is
the input
the size,
input F is
size, the
F is filter
the size,
filter and
size, SS
and is is
the stride
the stridesize.
size.
Figure 7. Zero-padding.
Figure 7. Zero-padding.
Feature of CNNs: Due to the aforementioned weight distribution, the model is also
invariant under translational changes. The learn function may be filtered to help in any
Feature of CNNs: Due to the aforementioned weight distribution, the model is also
environment. If beginning with random values for the filters improves performance, then
invariant under translational changes. The learn function may be filtered to help in any
the filters will learn to detect the edge (as in Figure 3). It is crucial to note that a shared
environment. If beginning with random values for the filters improves performance, then
weight is a bad idea when evaluating an input’s spatial significance.
the filters will learn to detect the edge (as in Figure 3). It is crucial to note that a shared
weight is a bad idea when evaluating an input’s spatial significance.
2.1.3. Pooling
The pooling layer, also known as the down-sampling layer, is used to decrease the
2.1.3. Pooling
feature maps’ dimensionality while retaining the most important data. A filter applies the
The pooling
pooling layer,
operation also
to the known
input dataasbythe down-sampling
sliding over it in thelayer, is used
pooling layerto(max,
decrease
min, the
avg).
feature maps’ dimensionality while retaining the most important
In the literature, maximum pooling is most frequently utilized. data. A filter applies the
pooling operation
The topart
essential the input data bywhich
of pooling, slidingisover it in to
utilized thereduce
poolingthe
layer (max, min,
complexity of avg).
upper
Inlayers,
the literature, maximum pooling
is down-sampling. In termsisofmost
imagefrequently
processing,utilized.
it may be comparable to reducing
the The essential
resolution. partcount
Filter of pooling, whichby
is unaffected is pooling.
utilized to reduce the iscomplexity
Max-pooling of upper
one of the most often
layers, is down-sampling. In terms of image processing, it may be comparable
used pooling methods. The picture is divided into rectangular subregions, and only the to reducing
the resolution.
greatest valueFilter count isinside
discovered unaffected by pooling.isMax-pooling
each subregion returned. One is one of the
of the most
most often
prevalent
max-pooling sizes is 2 × 2.
As shown in Figure 8, when pooling is used on the 2-by-2 blocks in the top-left corner,
attention is diverted to the top-right corner, and two steps are moved. As a result, stride 2 is
used for pooling. It is possible to use stride 1, which is unusual, to prevent downsampling.
Keep in mind that downsampling does not preserve the position of the data.
used pooling methods. The picture is divided into rectangular subregions, and only the
greatest value discovered inside each subregion is returned. One of the most prevalent
max-pooling sizes is 2 × 2.
As shown in Figure 8, when pooling is used on the 2-by-2 blocks in the top-left corner,
attention is diverted to the top-right corner, and two steps are moved. As a result, stride
Computation 2023, 11, 52 10 of 23
2 is used for pooling. It is possible to use stride 1, which is unusual, to prevent downsam-
pling. Keep in mind that downsampling does not preserve the position of the data.
FunctionFigure
and 9. Functiondefinitions
gradient of Activation.
using ReLU are simpler.
Saturated functions, such as the sigmoid and the tanh, have issues with backpropa-
Function and
gation. This phenomenon, knowngradient
as thedefinitions
“vanishingusing ReLU occurs
gradient,” are simpler.
when the gradient
signal gradually Saturated
decreases functions,
as the depth such ofasthethe sigmoid
neural and the
network tanh, havegrows.
architecture issues with
This backpro
gation.
happens because theThis phenomenon,
gradient known as
of these functions is the “vanishing
essentially zerogradient,”
on all sidesoccurs
of the when
center.the grad
Nonetheless,signal gradually
the ReLU has a decreases as the depth
constant gradient for theofpositive
the neural network
input. While architecture
the function grows. T
happens because the gradient of these
cannot be distinguished, it can be ignored during implementation. functions is essentially zero on all sides of the c
ter. ReLU
Third, the Nonetheless,
generates theaReLU
sparserhasrepresentation
a constant gradient
becauseforathe positivezero
complete input. While the fu
is pro-
tion cannot
duced by a gradient be distinguished,
zero. For sigmoid and it can be ignored
tanh, during
the gradient implementation.
outcomes are never zero,
Third, the ReLU during
which may be counterproductive generates a sparser
training representation because a complete zero is p
[22,23,26–29].
When using
ducedReLU, a few significant
by a gradient zero. For problems
sigmoidmay and occasionally arise. outcomes are never z
tanh, the gradient
Consider a method
which may be forcounterproductive
error backpropagation duringwith a greater
training gradient flowing through
[22,23,26–29].
it, for instance. When using ReLU, a few significant problems may occasionally arise.
The weightsConsider
will be updated
a method by passing
for error this backpropagation
gradient through the ReLU
with function
a greater in a
gradient flow
way that ensures thatit,the
through forneuron
instance.will not be stimulated again.
This problem is known
The weightsaswill“Dying ReLU”.by passing this gradient through the ReLU function
be updated
In ordera to
wayaddress these problems,
that ensures there are
that the neuron willsome
not ReLU substitutes.
be stimulated again.
These are some of them, as discussed below.
This problem is known as “Dying ReLU”.
Leaky ReLU:InThisorder activation
to address function makes sure
these problems, thatare
there thesome
negative
ReLU inputs are never
substitutes.
disregarded, as opposed to the negative inputs
These are some of them, as discussed below.being downscaled by ReLU. It is used to
address the Dying ReLU issue.
Leaky ReLU: This activation function makes sure that the negative inputs are ne
disregarded, as opposed to the negative inputs being downscaled by ReLU. It is used
2.1.5. Fully Connected Layer
address the Dying ReLU issue.
Neurons are organized into groups in the fully-connected layer that are reminiscent of
those seen in2.1.5.
traditional neural networks.
Fully Connected Layer As shown in Figure 10, any node in a layer that is
entirely linked is, therefore, directly connected to every node in the layer above and below
Neurons are organized into groups in the fully-connected layer that are reminisc
it. Figure 10 shows that every node in the pooling layer’s most recent frames is connected
of those seen in traditional neural networks. As shown in Figure 10, any node in a la
as a vector from the fully-connected layer to the top layer. These are the most often utilized
that is entirely linked is, therefore, directly connected to every node in the layer above
CNN parameters within these layers; however, they need a lot of training time [22–24].
below it. Figure 10 shows that every node in the pooling layer’s most recent frame
The biggest drawback of a fully connected layer is the large number of parameters that
connected as a vector from the fully-connected layer to the top layer. These are the m
necessitate laborious calculation in training samples. Consequently, we try to minimize the
number of connections and nodes. The eliminated nodes and connections can be satisfied
using the dropout approach. LeNet and AlexNet, for example, developed a vast and deep
network while preserving a constant computational complexity [22,23].
Computation 2023, 11, x FOR PEER REVIEW 12 of 25
Computation 2023, 11, 52 often utilized CNN parameters within these layers; however, they need a lot of training12 of 23
time [22–24].
Until it reaches the first layer, the algorithm repeats the procedure.
Figure 11. Forward and11.
Figure
All backpropagation
Forward
inputs, includinginthe
hidden
bias CNN
and backpropagation layers.
unit,in hidden
are CNN layers.
summarized by the activation unit ; then, use
the activation function to compute the result. The network will then calculate the cost
Until it reaches Untilfirst
it reaches the first layer, the algorithm repeats the procedure.
functionthe and send layer,
the the algorithm
error back to updaterepeats theweights
the procedure.
until the cost is minimized.
All inputs, including the bias unit, are
All inputs, including the bias unit, are summarized by the activation summarized by the activation
unit unit; then, use the
; then, use
the activation3.activation
function tofunction
Regularization compute to compute
of CNN the result. theTheresult. The network
network willcalculate
will then then calculate the cost function
the cost
and send the error back to update the weights
function and send the error back to update the weights until the cost is minimized.until the cost is minimized.
When trying to create well-behaved generalizations for CNN models, over-fitting is
the
3. key obstacle. Over-fitting
Regularization of CNN describes a situation in which a model does well on training
3. Regularization of CNN
data but poorly on test
When trying to create data well-behaved
(data it has never seen before),for
generalizations asCNN
will be shownover-fitting
models, in the nextis
When trying to
section. create well-behaved generalizations for CNN models, over-fitting is
the keyWhen the model
obstacle. does not
Over-fitting pick up
describes enough information
a situation from the
in which a model doestraining
well ondata, it
training
the key obstacle. Over-fitting
isdata
said to be describes[26–28].
under-fitted a situation in which a model does well on training
but poorly on test data (data it has never seen before), as will be shown in the next
data but poorly on A test datais (data
model it has never
considered to be seen before), it as will be satisfactory
shown in the next on both the
section. When the model does not“just
pickfit”
up if
enoughproduces
information from the results
training data, it is
section. When the
trainingmodel does not pick up enough information from the training data, it
said to and testing data.[26–28].
be under-fitted These three types are shown in Figure 12.
is said to be under-fitted
Multiple [26–28].
intuitive conceptions
A model is considered to be are used
“just fit” to facilitate
if it producesregularization
satisfactory and prevent
results over-
on both the
A modelfitting;
is considered
more to be “just fit” if it produces satisfactory results on both the
training anddetails
testingondata.
over-fitting
These three and types
under-fitting
are shown arein
provided below.
Figure 12.
training and testing data. These three types are shown in Figure 12.
Multiple intuitive conceptions are used to facilitate regularization and prevent over-
fitting; more details on over-fitting and under-fitting are provided below.
Multiple intuitive conceptions are used to facilitate regularization and prevent over-
fitting; more details on over-fitting and under-fitting are provided below.
Using a dropout as a generalization strategy is popular. Neurons are removed at
random throughout each training session. As an added bonus to coercively training
the model to acquire several features, this method also makes feature selection equally
weighted across the whole neural network.
Computation 2023, 11, 52 14 of 23
A dropped neuron will not take part in either backward or forward propagation
during training. Testing makes use of the full-scale network to make predictions [29,30].
The drop-weight method is quite similar to the dropout strategy. Drop-weight training
differs from dropout in that only the weights (connections) between neurons are eliminated
after each training iteration.
Data augmentation may easily prevent over-fitting when the model is trained using
an enormous amount of data. In order to do this, data augmentation is needed. There are
a few methods that may be utilized to increase the size of the training dataset artificially.
Finally, data augmentation methods are discussed in further depth.
Through batch normalization, the efficacy of the final activations may be ensured [31,32].
The one-unit Gaussian distribution describes this execution well. When normaliz-
ing the output at each layer, we will first remove the mean and then divide it by the
standard deviation.
Despite being conceptualized as a pre-processing activity at each tier of the network,
this may be differentiated and integrated with other networks.
It is also employed to reduce “internal covariance shift” in the activation layers. The
activation distribution’s variability defines the change in internal covariance at each layer.
The continual update of weights during training, which might occur if training data
samples come from a wide range of sources, amplifies this change (for example, day and
night images). However, the training period will lengthen since the model needs more
time for convergence. For this reason, we add a layer to the CNN design that mimics batch
normalization to help us deal with this issue [33–35].
The following are some benefits of using batch normalization:
• It stops the disappearing gradient issue before it starts.
• It has the ability to effectively manage bad weight initialization.
• It significantly reduces the amount of time needed for network convergence (which
will be very helpful for large datasets).
• It has difficulty reducing training dependence on various hyperparameters.
• Over-fitting is less likely because it only slightly affects regularization [35].
swaps out the fully connected layers with position-aware score maps, which results in
improved object detection.
Architecture
Layers Main Contribution Highlights Strength Gaps
Name
• Illustrated parameter
tweaking by displaying the
output of intermediary Further processing of
Conceptualization of layers.
ZfNet [2014] 8 information is necessary
middle levels • Diminished the filter size and for visualization.
stride in the initial two layers
of AlexNe
• A slightly intricate
structure
-Identity mapping • Degrades
based on information of
A unique design that feature-map in feed
50 in ResNet-50, links—Long-term Reduces the error rate of deeper
features “skip forwarding
101 in ResNet-101, retention of knowledge. networks; introduces the concept of
ResNet [2016] connections” and • Excessive
152 in Overfitting-resistant residual learning; mitigates the
extensive batch adaptation of
ResNet-152 due to symmetry vanishing gradient problem
normalization. hyperparameters
mapping-based skip
connections for a specific task as
a result of stacking
identical modules
Computation 2023, 11, 52 17 of 23
Table 1. Cont.
Architecture
Layers Main Contribution Highlights Strength Gaps
Name
The compound coefficient method was used by EfficientNet in 2019 to efficiently and
effectively scale up models. Compound scaling uses a uniform set of scaling coefficients to
increase an image’s width, depth, or resolution rather than randomly selecting one of these
values. The authors of the paper efficiently used the scaling method and AutoML to create
seven models of varying dimensions that both outperformed and were more efficient than
the state-of-the-art convolutional neural networks.
CNNs have an advantage over other classification algorithms like SVM, K-NN, Ran-
dom Forest, and others because they learn the most important features to represent the
objects in an image.
5.2. Detection
The difficult computer vision task of object detection involves anticipating both the
location of the objects in the image and the kind of objects that were found. Beginners may
find it difficult to differentiate between various related computer vision tasks.
For instance, the distinctions between object localization and object detection might be
difficult to understand, even though all three tasks may be collectively referred to as object
recognition. Image categorization, by contrast, is straightforward.
Image classification involves assigning a category label to an image, whereas ob-
ject localization involves drawing a bounding box around one or more objects in an
image [52–54].
The more challenging object detection challenge involves doing both of these things
at once, drawing bounding boxes around each object of interest in the image and then
labeling each object with its class.
Together, we call these problems in the real world “object recognition”.
The process of object detection can be broken down into two distinct phases, as shown
in Figure 13:
Computation
Computation 2023,
2023, 11,
11, x52FOR PEER REVIEW 1918ofof 25
23
jects by learning generalizable representations of them [16]. While YOLOv1 used a fully
connected layer to generate bounding boxes, YOLOv2 introduced batch normalization
and a high-resolution classifier in 2016 [43,45]. YOLOv3 was proposed in 2018 [43,45]
using a 53-layer backbone-based network that predicted overlapping bounding boxes and
smaller objects using an independent logistic classifier and binary cross-entropy loss. In
contrast to YOLO models, which produce feature maps by constructing grids within an
image, SSD models were offered as a superior choice to execute inference on videos and
real-time applications since they share features between the classification and localization
task on the complete picture. Although YOLO models are quicker to run, they are not as
accurate as SSD models [45]. While YOLO and SSD models offer fast inference speeds, they
struggle with class imbalance when identifying tiny objects. The RetinaNet detector [20]
overcame this problem by using a dedicated network for classification and bounding box
regression during training and a focal loss function. Better methods for data augmentation
and regularization during training (‘bag of freebies’) and a post-processing module that
enables better mAP and faster inference (‘bag of specials’) were introduced in YOLOv4 [45].
YOLOv5 was proposed, which would further improve data augmentation and loss calcula-
tion. It also used self-learning bounding box anchors to tailor itself to a specific dataset. A
second form, termed YOLOR (You only learn one representation), was presented to forecast
the output in 2021. It employed a single network that encoded both implicit and explicit
knowledge. With just a single model, YOLOR is capable of multitasking learning in areas
including object identification, multilabel picture classification, and feature embedding.
Similarly, the YOLOX model was introduced in 2021; it employs a decoupled head method
that eliminates the need for anchors and permits the network to process classification and
regression independently. When compared to YOLOv4 and YOLOv5 models, YOLOX has
fewer parameters and faster inference [45].
5.3. Segmentation
As the name implies, this is the process of segmenting an image into various parts.
Each pixel in the image is given an object type throughout this process. Semantic segmenta-
tion and instance segmentation are the two main categories of image segmentation [55,56].
In instance segmentation, related items receive their own unique labels, whereas, in
semantic segmentation, all objects of the same type are tagged using a single class name.
Semantic segmentation has come a long way in the last decade thanks to the develop-
ment of deep learning-based models, particularly fully convolutional networks (FCNs) [37]
and variants [38]. To learn stable and secure features, FCNs leverage pre-existing deep
neural networks. By replacing the fully connected layers with convolutional ones, an FCN
converts popular classification models like VGG (16-layer net) [31] and ResNet [44] into
fully convolutional ones that produce spatial maps rather than classification scores. To gen-
erate dense per-pixel labeled outputs, I upsampled those maps using fractionally-strided
convolutions. U-Net is another model used for rapid and accurate image segmentation
based on convolutional network architecture. The University of Freiburg’s Computer Sci-
ence Department developed it [44]. The ISBI challenge for segmenting neuronal structures
in electron microscopic stacks has outperformed the previous best method (a sliding-
window convolutional network). The main drawback of the U-Net architecture is that it
can slow down the training speed in the middle layers of deeper neural networks, increas-
ing the risk of skipping over them. The main cause of this phenomenon is the fact that
gradients weaken as the network moves away from the output layer, where the training
loss is calculated. Table 2. is shown a comparison between different Types of CNN.
Computation 2023, 11, 52 20 of 23
6. Future Directions
Indeed, the performance of classification in terms of accuracy, misclassification rate,
precision, and recall is heavily influenced by the combination of convolutional layers,
the number of pooling layers, the number of filters, the filter size, the stride rate, and
the location of the pooling layer when designing a convolutional neural network. CNN
training necessitates the use of powerful and impressive hardware resources, such as GPUs.
Training and testing various combinations of parameters repeatedly requires a great deal of
time and high computing resources like GPUs in order to obtain a satisfactory result [59,60].
Computation 2023, 11, 52 21 of 23
7. Conclusions
I have provided an organized and thorough overview of deep learning technology
in this paper, which is regarded as a fundamental component of both data science and
artificial intelligence.
It begins with a history of artificial neural networks before moving on to more modern
deep learning methods and innovations in several fields.
The main techniques in this field are then examined, along with deep neural network
modeling in multiple dimensions.
For this, I have also provided a taxonomy that takes into account the various deep-
learning tasks and their many applications.
In this comprehensive research, I took into account both supervised learning using
deep networks and unsupervised learning using generative learning using deep networks.
I have also thought of hybrid learning, which may be used in a variety of real-world
contexts depending on the specifics of the problem at hand.
Finally, I summarize several important problems with convolutional neural networks
(CNNs) and describe how each parameter affects the network’s performance. The convolu-
tion layer is the heart of a CNN and is responsible for the vast majority of processing time.
A network’s performance can be affected by the number of layers it contains. In contrast,
training and testing the network takes more time as the number of layers grows.
References
1. Sarker, I.H. Machine Learning: Algorithms, Real-World Applications, and Research Directions. SN Comput. Sci. 2021, 2, 1–21.
[CrossRef]
2. Du, K.-L.; Swamy, M.N.S. Fundamentals of Machine Learning. Neural Netw. Stat. Learn. 2019, 21–63. [CrossRef]
3. ZZhao, Q.; Zheng, P.; Xu, S.; Wu, X. Object detection with deep learning: A review. IEEE Trans. Neural Networks Learn. Syst. 2019,
30, 3212–3232. [CrossRef]
4. Indrakumari, R.; Poongodi, T.; Singh, K. Introduction to Deep Learning. EAI/Springer Innov. Commun. Comput. 2021, 1–22.
[CrossRef]
5. AI vs Machine Learning vs Deep Learning|Edureka. Available online: https://www.edureka.co/blog/ai-vs-machine-learning-
vs-deep-learning/ (accessed on 11 August 2022).
6. Cintra, R.J.; Duffner, S.; Garcia, C.; Leite, A. Low-complexity approximate convolutional neural networks. IEEE Trans. Neural
Netw. Learn. Syst. 2018, 29, 5981–5992. [CrossRef]
7. Rusk, N. Deep learning. Nat. Methods 2017, 13, 35. [CrossRef]
Computation 2023, 11, 52 22 of 23
8. Shrestha, A.; Mahmood, A. Review of deep learning algorithms and architectures. IEEE Access 2019, 7, 53040–53065. [CrossRef]
9. Zhang, Z.; Cui, P.; Zhu, W. Deep Learning on Graphs: A Survey. IEEE Trans. Knowl. Data Eng. 2022, 34, 249–270. [CrossRef]
10. Mishra, R.K.; Reddy, G.Y.S.; Pathak, H. The Understanding of Deep Learning: A Comprehensive Review. Math. Probl. Eng. 2021,
2021, 1–5. [CrossRef]
11. Ker, J.; Wang, L.; Rao, J.; Lim, T. Deep Learning Applications in Medical Image Analysis. IEEE Access 2017, 6, 9375–9379.
[CrossRef]
12. Dhillon, A.; Verma, G.K. Convolutional neural network: A review of models, methodologies, and applications to object detection.
Prog. Artif. Intell. 2019, 9, 85–112. [CrossRef]
13. Li, Z.; Liu, F.; Yang, W.; Peng, S.; Zhou, J. A Survey of Convolutional Neural Networks: Analysis, Applications, and Prospects.
IEEE Trans. Neural Networks Learn. Syst. 2021, 1–21. [CrossRef] [PubMed]
14. Khan, A.; Sohail, A.; Zahoora, U.; Qureshi, A.S. A survey of the recent architectures of deep convolutional neural networks. Artif.
Intell. Rev. 2020, 53, 5455–5516. [CrossRef]
15. Introduction to Convolutional Neural Networks (CNNs)|The Most Popular Deep Learning architecture|by Louis
Bouchard|What is Artificial Intelligence|Medium. Available online: https://medium.com/what-is-artificial-intelligence/
introduction-to-convolutional-neural-networks-cnns-the-most-popular-deep-learning-architecture-b938f62f133f (accessed on 8
August 2022).
16. Koushik, J. Understanding Convolutional Neural Networks. May 2016. Available online: http://arxiv.org/abs/1605.09081
(accessed on 13 August 2022).
17. Bezdan, T.; Džakula, N.B. Convolutional Neural Network Layers and Architectures. In International Scientific Conference on
Information Technology and Data Related Research; Singidunum University: Belgrade, Serbia, 2019; pp. 445–451. [CrossRef]
18. Zhang, J.; Huang, J.; Chen, X.; Zhang, D. How to fully exploit the abilities of aerial image detectors. In Proceedings of the IEEE
International Conference on Computer Vision Workshops 2019, Seoul, Republic of Korea, 27–28 October 2019.
19. Rodriguez, R.; Gonzalez, C.I.; Martinez, G.E.; Melin, P. An Improved Convolutional Neural Network Based on a Parameter
Modification of the Convolution Layer. In Fuzzy Logic Hybrid Extensions of Neural and Optimization Algorithms: Theory and
Applications; Springer: Cham, Switzerland, 2021; pp. 125–147. [CrossRef]
20. Batmaz, Z.; Yurekli, A.; Bilge, A.; Kaleli, C. A review on deep learning for recommender systems: Challenges and remedies. Artif.
Intell. Rev. 2019, 52, 137. [CrossRef]
21. Fang, X. Understanding deep learning via back-tracking and deconvolution. J. Big Data 2017, 4, 40. [CrossRef]
22. Alzubaidi, L.; Zhang, J.; Humaidi, A.J.; Al-Dujaili, A.; Duan, Y.; Al-Shamma, O.; Santamaría, J.; Fadhel, M.A.; Al-Amidie, M.;
Farhan, L. Review of deep learning: Concepts, CNN architectures, challenges, applications, future directions. J. Big Data 2021, 8,
83. [CrossRef] [PubMed]
23. Du, K.L.; Swamy, M.N.S. Neural networks and statistical learning, second edition. In Neural Networks and Statistical Learning, 2nd
ed.; Springer: Berlin/Heidelberg, Germany, 2019; pp. 1–988. [CrossRef]
24. Zhang, Q.; Zhang, M.; Chen, T.; Sun, Z.; Ma, Y.; Yu, B. Recent advances in convolutional neural network acceleration. Neurocom-
puting 2019, 323, 37–51. [CrossRef]
25. Bhatt, D.; Patel, C.; Talsania, H.; Patel, J.; Vaghela, R.; Pandya, S.; Modi, K.; Ghayvat, H. CNN variants for computer vision:
History, architecture, application, challenges and future scope. Electronics 2021, 10, 2470. [CrossRef]
26. Prakash, K.B.; Kannan, R.; Alexander, S.A.; Kanagachidambaresan, G.R. Advanced Deep Learning for Engineers and Scientists: A
Practical Approach; Springer: Berlin/Heidelberg, Germany, 2021.
27. Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. arXiv 2017, arXiv:1708.02002.
28. Zhang, S.; Wen, L.; Bian, X.; Lei, Z.; Li, S. Single-shot refinement neural network for object detection. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition 2018, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4203–4212.
29. Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional network. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708.
30. Hsieh, M.-R.; Lin, Y.-L.; Hsu, W. Drone-based object counting by spatially regularized regional proposal network. In Proceedings
of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017.
31. Yang, F.; Fan, H.; Chu, P.; Blasch, E.; Ling, H. Clustered object detection in aerial images. In Proceedings of the IEEE International
Conference on Computer Vision 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 8311–8320.
32. Zhou, X.; Wang, D.; Krähenbühl, P. Objects as points. arXiv 2019, arXiv:1904.07850.
33. Kong, T.; Sun, F.; Liu, H.; Jiang, Y.; Shi, J. Foveabox: Beyond anchor-based object detector. arXiv 2019, arXiv:1904.0379729.
34. Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid networks for object detection. In Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition 2017, Honolulu, HI, USA, 21–26 July 2017.
35. Ghiasi, G.; Lin, T.-Y.; Le, Q. Dropblock: A regularization method for convolutional networks. In Proceedings of the 32nd International
Conference on Neural Information Processing Systems; Curran Associates Inc.: Dutchess County, NY, USA; pp. 10727–10737.
36. Müller, R.; Kornblith, S.; Hinton, G. When does label smoothing help? In Advances in Neural Information Processing Systems 2019;
Neural Information Processing Systems Foundation, Inc. (NeurIPS): La Jolla, CA, USA, 2019; pp. 4696–4705.
37. Dollár, K.; Girshick, R. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision 2017, Venice, Italy,
22–29 October 2017.
Computation 2023, 11, 52 23 of 23
38. Cai, Z.; Vasconcelos, N. Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition 2018, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6154–6162.
39. Duan, K.; Bai, S.; Xie, L.; Qi, H.; Huang, Q.; Tian, Q. Centernet: Keypoint triplets for object detection. In Proceedings of the IEEE
International Conference on Computer Vision 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 6569–6578.
40. Li, Z.; Peng, C.; Yu, G.; Zhang, X.; Deng, Y.; Sun, J. Light-head r-cnn: In defense of twostage object detector. arXiv 2017,
arXiv:1711.07264.
41. Chollet, F. Xception: Deep Learning with Depthwise Separable Convolutions. In Proceedings of the 2017 IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1800–1807. [CrossRef]
42. Law, H.; Deng, J. Cornernet: Detecting objects as paired keypoints. In Proceedings of the European Conference on Computer
Vision (ECCV) 2018, Munich, Germany, 8–14 September 2018; pp. 734–750.
43. Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767.
44. Swapna, M.; Sharma, D.Y.K.; Prasad, D.B. CNN Architectures: Alex Net, Le Net, VGG, Google Net, Res Net. Int. J. Recent Technol.
Eng. 2020, 8, 953–959. [CrossRef]
45. Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y. YOLOv4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934.
46. Wang, C.-Y.; Liao, H.-Y.; Yeh, I.-H.; Wu, Y.-H.; Chen, P.-Y.; Hsieh, J.-W. CSPNet: A new backbone that can enhance learning
capability of CNN. arXiv 2019, arXiv:1911.11929.
47. Yun, S.; Han, D.; Oh, S.; Chun, S.; Choe, J.; Yoo, Y. Cutmix: Regularization strategy to train strong classifiers with localizable
features. In Proceedings of the IEEE International Conference on Computer Vision 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 6023–6032.
48. Fu, C.-Y.; Liu, W.; Ranga, A.; Tyagi, A.; Berg, A. DSSD: Deconvolutional single shot detector. arXiv 2017, arXiv:1701.06659.
49. Law, H.; Teng, Y.; Russakovsky, O.; Deng, J. Cornernet-lite: Efficient keypoint based object detection. arXiv 2019, arXiv:1904.08900.
50. Chen, K.; Fu, K.; Yan, M.; Gao, X.; Sun, X.; Wei, X. Semantic segmentation of aerial images with shuffling convolutional neural
networks. IEEE Geosci. Remote Sens. Lett. 2018, 15, 173–177. [CrossRef]
51. Pailla, D. VisDrone-DET2019: The Vision Meets Drone Object Detection in Image Challenge Results. In Proceedings of the IEEE/CVF
International Conference on Computer Vision Workshops 2019; IEEE: Piscataway, NJ, USA, 2019.
52. Terrail, J.D.; Jurie, F. On the use of deep neural networks for the detection of small vehicles in ortho-images. In Proceedings of the
2017 IEEE International Conference on Image Processing (ICIP 2017), Beijing, China, 17–20 September 2017; pp. 4212–4216.
53. Shen, J.; Shafiq, M.O. Deep Learning Convolutional Neural Networks with Dropout—A Parallel Approach. In Proceedings of the
17th IEEE International Conference on Machine Learning and Applications, ICMLA 2018, Orlando, FL, USA, 17–20 December
2018; pp. 572–577. [CrossRef]
54. Xiao, Y.; Tian, Z.; Yu, J.; Zhang, Y.; Liu, S.; Du, S.; Lan, X. A review of object detection based on deep learning. Multimed. Tools
Appl. 2020, 79, 23729–23791. [CrossRef]
55. Adem, K. Impact of activation functions and number of layers on detection of exudates using circular Hough transform and
convolutional neural networks. Expert. Syst. Appl. 2022, 203, 117583. [CrossRef]
56. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Commun. ACM 2017,
60, 84–90. [CrossRef]
57. Nwankpa, C.; Ijomah, W.; Gachagan, A.; Marshall, S. Activation Functions: Comparison of trends in Practice and Research for
Deep Learning. arXiv 2018, arXiv:1811.03378.
58. Zhang, Z. Improved Adam Optimizer for Deep Neural Networks. In Proceedings of the 2018 IEEE/ACM 26th International
Symposium on Quality of Service (IWQoS), Banff, AB, Canada, 4–6 June 2018. [CrossRef]
59. Coelho, I.M.; Coelho, V.N.; Luz, E.J.D.S.; Ochi, L.S.; Guimarães, F.G.; Rios, E. A GPU deep learning metaheuristic based model for
time series forecasting. Appl. Energy 2017, 201, 412–418. [CrossRef]
60. Huh, J.-H.; Seo, Y.-S. Understanding edge computing: Engineering evolution with artificial intelligence. IEEE Access 2019, 7,
164229–164245. [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.