0% found this document useful (0 votes)
49 views24 pages

Computation 11 00052

artículo de vision de entorno

Uploaded by

elizabarzi73
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views24 pages

Computation 11 00052

artículo de vision de entorno

Uploaded by

elizabarzi73
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/369059212

Theoretical Understanding of Convolutional Neural Network: Concepts,


Architectures, Applications, Future Directions

Article in Computation · March 2023


DOI: 10.3390/computation11030052

CITATIONS READS

70 1,553

1 author:

Mohammad Taye
Philadelphia University Jordan
13 PUBLICATIONS 291 CITATIONS

SEE PROFILE

All content following this page was uploaded by Mohammad Taye on 08 March 2023.

The user has requested enhancement of the downloaded file.


computation
Review
Theoretical Understanding of Convolutional Neural Network:
Concepts, Architectures, Applications, Future Directions
Mohammad Mustafa Taye

Data Science and Artificial Intelligence, Philadelphia University, Amman 19392, Jordan;
[email protected]

Abstract: Convolutional neural networks (CNNs) are one of the main types of neural networks
used for image recognition and classification. CNNs have several uses, some of which are object
recognition, image processing, computer vision, and face recognition. Input for convolutional neural
networks is provided through images. Convolutional neural networks are used to automatically learn
a hierarchy of features that can then be utilized for classification, as opposed to manually creating
features. In achieving this, a hierarchy of feature maps is constructed by iteratively convolving the
input image with learned filters. Because of the hierarchical method, higher layers can learn more
intricate features that are also distortion and translation invariant. The main goals of this study are to
help academics understand where there are research gaps and to talk in-depth about CNN’s building
blocks, their roles, and other vital issues.

Keywords: artificial intelligence (AI); deep learning (DL); machine learning (ML); convolution neural
network (CNN); deep learning applications; image classification; supervised learning

1. Introduction
There has been a dramatic surge in the usage of machine learning (ML) in recent
years [1–3] for a wide range of purposes, from research to practical applications, including
text mining, spam detection, video recommendation, picture categorization, and multime-
Citation: Taye, M.M. Theoretical
dia idea retrieval [4,5].
Understanding of Convolutional
Deep learning (DL) is one machine learning (ML) approach that is commonly used
Neural Network: Concepts,
in these contexts [6,7]. The working domain of DL is a subset of that of ML and artificial
Architectures, Applications, Future
intelligence (AI); therefore, it may be seen as a function of AI that mimics the way the
Directions. Computation 2023, 11, 52.
https://doi.org/10.3390/
human brain processes information [1]. The traditional neural network from which DL
computation11030052
originated has significantly been surpassed by its superior performance. In addition, DL
uses transformations and graph technologies in tandem to construct multi-layer learning
Academic Editor: Demos T. Tsahalis models [8,9].
Received: 3 February 2023 A closer examination of the “Learning sub-fields” reveals that deep learning (DL), a
Revised: 2 March 2023 subfield of machine learning (ML), focuses on creating algorithms that simulate how the
Accepted: 3 March 2023 human brain thinks and solves problems [5,8,10].
Published: 6 March 2023 Recent years have seen a surge in interest in machine learning algorithms, which are
now being used in various fields, including image recognition, optical character recognition,
pricing prediction, spam filtering, fraud detection, healthcare, transportation, and many
others [11]. The various machine-learning types and algorithms are depicted in Figure 1.
Copyright: © 2023 by the author.
Over the past few years, deep learning has recently received a lot of attention and has
Licensee MDPI, Basel, Switzerland.
been applied successfully in addressing a wide range of problems in various application
This article is an open access article
fields. Diverse deep-learning techniques are applied in several application areas [4].
distributed under the terms and
These application areas include robots, enterprises, cybersecurity, virtual assistants,
conditions of the Creative Commons
image recognition, and healthcare. They also involve sentiment analysis and natural
Attribution (CC BY) license (https://
creativecommons.org/licenses/by/
language processing [8].
4.0/).

Computation 2023, 11, 52. https://doi.org/10.3390/computation11030052 https://www.mdpi.com/journal/computation


Computation 2023, 11, x FOR PEER REVIEW 2 of 25

Computation 2023, 11, 52


These application areas include robots, enterprises, cybersecurity, virtual assistants,
2 of 23
image recognition, and healthcare. They also involve sentiment analysis and natural lan-
guage processing [8].

Figure1.1.Machine
Figure MachineLearning
LearningParts.
Parts.

Themost
The mostestablished
establisheddeep deeplearning
learningtechnique
techniqueisisthe theconvolutional
convolutionalneural neuralnetwork
network
(CNN),a asubtype
(CNN), subtypeofof anan artificial
artificial neural
neural network
network [12].[12].
SinceSince
the the astounding
astounding outcomes
outcomes of theof
the ImageNet
ImageNet LargeLarge
ScaleScale
Visual Visual Recognition
Recognition Competition,
Competition, an object
an object recognition
recognition compe-
competition,
intition,
2012 in 2012 [13,14],
[13,14], CNN has CNN has dominated
dominated computer computer vision tasks.
vision tasks.
CNNisisuseful
CNN usefulininmedical
medical imaging
imaging because
because it can
it can detect
detect tumors
tumors andand other
other irregulari-
irregularities
ties more
more accurately
accurately in X-rayin X-ray
and MRI andimages.
MRI images. CNN models
CNN models can analyze
can analyze a picturea of picture
a humanof a
body
human component, such assuch
body component, the lungs, and identify
as the lungs, potential
and identify tumor
potential locations
tumor as well
locations as
as well
other abnormalities
as other abnormalities likelike
fractured
fractured bones in in
bones X-ray
X-rayimages
images based
basedon onpreviously
previouslyprocessed
processed
comparable
comparableimages
imagesby byCNN
CNNnetworks
networks[15–17].
[15–17].
Convolutional
Convolutional neural networks (CNN),which
neural networks (CNN), which areare
usedusedto represent
to represent spatial informa-
spatial infor-
tion, may be used to model images.
mation, may be used to model images.
Because
Becauseofoftheir
theirgreater
greater capacity to extract
capacity features
to extract featuresfromfrom
pictures, suchsuch
pictures, as barriers and
as barriers
road
and signs, CNNsCNNs
road signs, are characterized as universal
are characterized non-linear
as universal function
non-linear approximators.
function approximators.
CNN
CNNhas hasbeen
beenusedusedforforbiometric
biometricuser
useridentity
identityauthentication
authenticationbybyrecognizing
recognizing particu-
partic-
lar physical characteristics associated with a person’s face. CNN models
ular physical characteristics associated with a person’s face. CNN models may be trained may be trained
ononphotos
photosor orvideos
videosofofindividuals
individualstotorecognize
recognizecertain
certainfeatures
featuresofoftheir
theirfaces,
faces,such
suchas asthe
the
distance between their eyes, the shape of their noses, and the curve of their
distance between their eyes, the shape of their noses, and the curve of their lips [15–17]. lips [15–17].
Most of the time, convolutional neural networks have led to ground-breaking discov-
eries in many fields related to pattern recognition, such as voice and image processing.
The number of ANN parameters going down is the most important thing about
CNNs [18,19].
Computation 2023, 11, 52 3 of 23

This success has encouraged researchers and developers to utilize more complex
models to tackle difficult issues that conventional ANNs were unable to address. The most
important presupposition regarding the issues that CNN resolves is that there should be
no spatially dependent characteristics [19].
CNN is to blame for the current popularity of DL. The primary benefit of CNN over
its forerunners is that it does everything automatically and without human supervision,
making it the most popular. Therefore, we have covered a lot of ground with CNN
by outlining its essential parts. The most prevalent CNN architectures, starting with
the AlexNet network and concluding with the high-resolution network, have also been
explored in length (HR.Net) [19,20].
This review’s main objective is to draw attention to the elements of CNN that are most
important, making it simple for researchers and students to comprehend CNN completely
after reading just one review paper. Additionally, in order to encourage CNN research, we
want to let people understand more about current developments in the industry. In view to
provide more precise options to the field, researchers would be allowed to select the best
route of study to follow.
CNNs learn as they are trained on images; therefore, the features they extract from
images are not pre-learned.
Automatic feature extraction is largely responsible for the remarkable success of
deep learning models in computer vision. Complex models are required for deep CNN
architecture. More precision from them calls for larger image databases. For computer
vision tasks like object categorization, detection, tracking, and recognition, CNNs need
access to huge labeled datasets.
Object identification, which has captivated researchers for much of this decade, has
significantly benefited from the application of deep learning techniques. Object identifi-
cation and tracking play a crucial role in video surveillance, making it one of the most
difficult but vital aspects of security systems. It keeps an eye on people in public places to
spot any signs of unusual or suspicious conduct.
The general contribution of this study is summarized as follows:
This review almost provides an in-depth analysis of CNN’s most important features.
This review almost provides an in-depth analysis of CNN’s networks and algorithms.
In this paper, I have compiled all of the current deep learning-based object detection
methods that can be found in the most recent academic literature.
To help pick the best object detection method for a given application or dataset, I
reviewed and compared several popular options.
This article focuses on CNN’s models and processes.
We put together a list of CNN to help developers and academics learn more about
how to use CNN.
We explain CNN in-depth, the most well-known deep learning algorithm, by outlining
the ideas, theories, and cutting-edge architectures.

Survey Methodology
I have analyzed the key research articles published in the field between 2017 and 2022,
with a focus on those from 2017, 2018, and 2019, along with a few from 2021 and 2022. The
primary emphasis was on publications from the most prestigious publishers, including
IEEE, Elsevier, MDPI, ACM, and Springer. Several papers from ArXiv have been chosen.
I have examined over 60 papers on a variety of DL-related topics. There are 14 papers
from 2017, 12 papers from 2018, 19 papers from 2019, and 15 papers from all other years
(2020–2022). This shows that the focus of this review was on the most recent publications in
DL and CNN. The selected publications were analyzed and evaluated in order to perform
the following:
(1) List and describe the DL and CNN techniques and network types;
(2) Present the problems of CNN and provide alternative solutions;
(3) List and explain CNN architectures;
Computation 2023, 11, 52 4 of 23

(4) Evaluate the applications of CNN.


“Deep Learning”, “Machine Learning”, “Convolution Neural Network”, “Convolution
Neural Network” and “Architectures”, “Convolution Neural Network” and “detection” or
“classification” or “segmentation” and “Convolution Neural Network” and “Overfitting”
are the most common search terms for this review.

2. Convolutional Neural Network (CNN or ConvNet)


Convolutional neural networks (CNNs) are artificial intelligence systems based on
multi-layer neural networks that can identify, recognize, and classify objects as well as
detect and segment objects in images. In fact, CNN or ConvNet is a popular discriminative
deep learning architecture that could be learned directly from the input object without the
obligation for human feature extraction [15–17].
This network is frequently used in visual identification, medical image analysis,
image segmentation, NLP, and many other applications since it is specifically designed
to deal with a range of 2D shapes [15–17]. It is more effective than a regular network
since it can automatically identify key elements from the input without the need for
human participation.

2.1. CNN Fundamentals


Computation 2023, 11, x FOR PEER REVIEW 5 of 25
Understanding the various CNN components and their applications is critical to com-
prehending the advancements in CNN architecture. Figure 2 displays several CNN parts.

Figure 2. The CNN Components.


Figure 2. The CNN Components.

CNN Layers
2.1.2. Convolutional Layer
A CNN is typically composed of four types of layers
The convolutional layer is a crucial part of CNN’s overall structure. It is a set of fil-
• ters—or
Convolutional;
kernels—applied to the data before it is used. Each kernel’s width, height, and
• weight
Pooling;
are used to extract characteristics from the input data. Weights in the kernel are
• first
Function
assignedofatActivation;
random but gradually become more informed by the training data.
• In other
Fully words, the feature map is made by combining the input image (which is
Connected.
shown by N-dimensional metrics) with these filters [15–17].
A kernel is a set of discrete values or integers.
For each number, the kernel weight is given as a reference. The initial kernel weights
for a CNN are a set of integers picked at random. Additionally, the weights are initialized
in various ways. In turn, the kernel learns to extract meaningful features because these
Computation 2023, 11, 52 5 of 23

2.1.1. Input Image


The building blocks of a computer image are called pixels. They are the binary of
the visual data representation. From 0–255 pixels are sequentially organized in a matrix-
like arrangement in the digital image’s layout. The brightness and hue of each pixel are
specified by its pixel value [15–17].
When viewing an image first, humans’ brains assimilate a tremendous amount of
information.
The CNN layers are trained to first recognize more basic patterns, such as lines and
curves, before moving on to more intricate patterns, such as faces and objects. As a result,
it is possible to assert that using a CNN could provide computers with vision [15–17].

2.1.2. Convolutional Layer


The convolutional layer is a crucial part of CNN’s overall structure. It is a set of
filters—or kernels—applied to the data before it is used. Each kernel’s width, height, and
weight are used to extract characteristics from the input data. Weights in the kernel are first
assigned at random but gradually become more informed by the training data.
In other words, the feature map is made by combining the input image (which is
shown by N-dimensional metrics) with these filters [15–17].
A kernel is a set of discrete values or integers.
For each number, the kernel weight is given as a reference. The initial kernel weights
for a CNN are a set of integers picked at random. Additionally, the weights are initialized
in various ways. In turn, the kernel learns to extract meaningful features because these
weights are tweaked during the training process.
The kernel enables them to perform the operation in a high-dimensional, implicit
feature space without calculating the coordinates of the data in that space. Instead, they
compute the inner product of the pictures of all data pairings in feature space. By applying
the kernel trick to a linear model, it can be transformed into a non-linear model.
The CNN input format is first provided before the convolutional process begins. The
classic neural network takes in data in a vector format, whereas the CNN takes in a multi-
channeled image. While an RGB image contains three color “channels,” a grayscale image
has just one.
Examine this 4 × 4 grayscale image with a 2 × 2 random weight-initialized kernel
to learn about convolutions in action. The kernel will first pan horizontally and vertically
over the full picture. The dot product between the input picture and the kernel is also
computed in parallel; this is accomplished by multiplying the corresponding values and
adding the results to get a single scalar value. Thereafter, the procedure is repeated until no
more sliding is feasible [15–17].
The main image (K), the filter (L).
The output matrix is based on the equation

(K − L + 1), (1)

4 − 2 + 1 = 3, so the output then 3 × 3


In fact, the values of the dot product indicate the feature map of the output. Figure 3
visually represents the primary calculations performed at each stage. In this diagram, the
smaller square (2 × 2) represents the kernel, while the larger square (4 × 4) represents the
input picture. A product is then presented as a number after multiplying by both, and this
sum provides an input value for the output feature map [15–17].
4 − 2 + 1 = 3, so the output then 3 × 3
In fact, the values of the dot product indicate the feature map of the output. Figure 3
visually represents the primary calculations performed at each stage. In this diagram, the
smaller square (2 × 2) represents the kernel, while the larger square (4 × 4) represents the
Computation 2023, 11, 52 input picture. A product is then presented as a number after multiplying by both, and6 this
of 23
sum provides an input value for the output feature map [15–17].

Figure 3.
Figure A visual
3. A visual representation
representation of
of the
the primary
primary calculations.
calculations.

However, in the preceding example, the kernel is given a stride of 1 (designating


However, in the preceding example, the kernel is given a stride of 1 (designating the
the desired step size total for vertical or horizontal locations), but the input image is not
desired step size total for vertical or horizontal locations), but the input image is not pad-
padded. Indeed, you are free to substitute a different stride value if you so choose. An
ded. Indeed, you are free to substitute a different stride value if you so choose. An
additional benefit of increasing the stride value is a decrease in the dimensionality of the
resulting feature map [15–17].
The border size of the supplied picture, however, is greatly influenced by padding.
In contrast, the characteristics of the border side change dramatically over time.
Padding makes the input picture bigger, which also makes the size of the feature
map bigger.
Each filter could represent a feature. The filter does not activate when it moves over
an image and does not discover a match. CNN employs this method to discover the most
effective object–description filters.
Figure 4 demonstrates how the matrix may be configured to find picture edges. Due
to the fact that they behave like the conventional filters used in image processing methods,
these matrices are also known as filters.
Each filter could represent a feature. The filter does not activate when it moves over
an image and does not discover a match. CNN employs this method to discover the most
effective object–description filters.
Figure 4 demonstrates how the matrix may be configured to find picture edges. Due
Computation 2023, 11, 52
to the fact that they behave like the conventional filters used in image processing methods,
7 of 23
these matrices are also known as filters.

Figure 4. Effects of different convolution matrices [21].


Figure 4. Effects of different convolution matrices [21].
In CNN, however, these filters are started before the form filters used in the training
In CNN, however, these filters are started before the form filters used in the training
process, which are better suited to the job at hand.
process, which are better suited to the job at hand.
Weight Sharing: Since the entire set of weights in a CNN act on each and every pixel
Weight Sharing: Since the entire set of weights in a CNN act on each and every pixel
of the input matrix, there are no assigned weights between any two neurons in nearby
of the input matrix, there are no assigned weights between any two neurons in nearby
layers. Learning a single group of weights for the entire input will significantly reduce the
layers. Learning
necessary traininga time
singleandgroup of weights
various for the entire
costs because inputweights
additional will significantly reduce do
for each neuron the
necessary
not need totraining
be learned.time and various costs because additional weights for each neuron do
not need to be learned.
Stride: In fact, CNN offers additional options that provide several opportunities to
furtherStride:
narrow In fact, CNN offers
the settings whileadditional options
also reducing somethatofprovide several opportunities
the undesirable impacts. One to
further narrow the settings while also reducing some of the undesirable
of these options is called stride. In the aforementioned scenario, the next-layer node is impacts. One of
these options is called stride. In the aforementioned scenario, the
assumed to have numerous overlaps with its neighbors based only on an examination of the next-layer node is as-
sumed
areas. Wetomay
havemodify
numerous overlapsby
the overlap with its neighbors
changing based
the stride. only on6an
A unique ×examination
6 image is shown of the
Computation 2023, 11, x FOR PEER REVIEW
areas. We may modify the overlap by changing the stride. A unique 6 × 6 image is 8shown
of 25
in Figure 5. Since the filter can only be moved in one-node increments, the maximum
in Figure
output size5.we
Since
canthe filter is
achieve can
4× only be can
4. As moved in one-node
be seen in Figureincrements,
5, there is anthe maximum
overlap out-
between
put size we can achieve is 4 × 4. As can be seen in Figure 5, there is an
the output of the three left matrices (and the three middle ones together and the three right overlap between
the output
right
ones also). of the
ones also).
However, three
if weleft
However, if matrices
walk, walk, (and
we counting thestep
counting
each three
each middle
as step 2,ones
2, theastotalthe together
total
will be 3will and3 the
be 3.
times timesthree
In other3.
In otherthe
words, words,
total the totalsize
output output
and size
totaland total will
overlap overlap will be [5,12,16].
be reduced reduced [5,12,16].

Figure 5. Stride 1, the filter windows move only one time for each connection.
Figure 5. Stride 1, the filter windows move only one time for each connection.

Equation (2) formalizes this, resulting in the output size O as shown in Figure 6, given
the image’s NXN dimension and FXF filter size.
𝑂 = 1 + (𝑁 − 𝐹)/𝑆 (2)
where N is the input size, F is the filter size, and S is the stride size.
Computation 2023, 11, 52 8 of 23

Figure 5. Stride 1, the filter windows move only one time for each connection.

Equation
Equation (2)(2) formalizes
formalizes this,
this, resulting
resulting inin the
the output
output size OO
size asas shown
shown in in Figure
Figure 6, 6, given
given
the image’s NXN dimension and FXF filter
the image’s NXN dimension and FXF filter size. size.

𝑂 = 1O+= ( N − F )/S
+𝐹)/𝑆
(𝑁1− (2)(2)
where
whereNN
is is
the input
the size,
input F is
size, the
F is filter
the size,
filter and
size, SS
and is is
the stride
the stridesize.
size.

Figure 6. The effect of stride in the output.


Figure 6. The effect of stride in the output.
Padding: One disadvantage of the convolution step is the possible loss of detail at the
image’s edges.
Padding: OneThey are only captured
disadvantage when the filter
of the convolution is moved,
step so they loss
is the possible are never actually
of detail at
theseen. One edges.
image’s easy and Theypractical solution
are only is towhen
captured use zero
the padding. You may
filter is moved, also manage
so they are neverthe
outputseen.
actually size with the aid
One easy and ofpractical
zero padding.
solution is to use zero padding. You may also man-
age theIn Figuresize
output 6, for instance,
with the aidthe output
of zero will be 4 × 4 (which decreases from a 6 × 6 input)
padding.
with N = 6, F = 3, and stride 1.
In Figure 6, for instance, the output will be 4 × 4 (which decreases from a 6 × 6 input)
with NHowever,
= 6, F = 3, by stride 1. one zero-padding, the result will be 6 × 6, which is identical
andincluding
to the initial input
However, by including(the calculation for actual N
one zero-padding, thenow being
result will9).
beThe formula
6 × 6, which is
is aidentical
modified
toformula with
the initial zero(the
input padding (3). for actual N now being 9). The formula is a modified
calculation
formula with zero padding (3).
N + 2P − F
O = 1+ (3)
S
where P is the number of the layers of the zero-padding (e.g., P = 1 in Figure 7), We can
avoid network output size from decreasing with depth by using this padding concept.
Consequently, any number of deep convolutional networks is feasible [21].
𝑁 + 2𝑃 − 𝐹
𝑂 =1+ (3)
𝑆
where P is the number of the layers of the zero-padding (e.g., P = 1 in Figure 7), We can
Computation 2023, 11, 52
avoid network output size from decreasing with depth by using this padding concept. 9 of 23
Consequently, any number of deep convolutional networks is feasible [21].

Figure 7. Zero-padding.
Figure 7. Zero-padding.
Feature of CNNs: Due to the aforementioned weight distribution, the model is also
invariant under translational changes. The learn function may be filtered to help in any
Feature of CNNs: Due to the aforementioned weight distribution, the model is also
environment. If beginning with random values for the filters improves performance, then
invariant under translational changes. The learn function may be filtered to help in any
the filters will learn to detect the edge (as in Figure 3). It is crucial to note that a shared
environment. If beginning with random values for the filters improves performance, then
weight is a bad idea when evaluating an input’s spatial significance.
the filters will learn to detect the edge (as in Figure 3). It is crucial to note that a shared
weight is a bad idea when evaluating an input’s spatial significance.
2.1.3. Pooling
The pooling layer, also known as the down-sampling layer, is used to decrease the
2.1.3. Pooling
feature maps’ dimensionality while retaining the most important data. A filter applies the
The pooling
pooling layer,
operation also
to the known
input dataasbythe down-sampling
sliding over it in thelayer, is used
pooling layerto(max,
decrease
min, the
avg).
feature maps’ dimensionality while retaining the most important
In the literature, maximum pooling is most frequently utilized. data. A filter applies the
pooling operation
The topart
essential the input data bywhich
of pooling, slidingisover it in to
utilized thereduce
poolingthe
layer (max, min,
complexity of avg).
upper
Inlayers,
the literature, maximum pooling
is down-sampling. In termsisofmost
imagefrequently
processing,utilized.
it may be comparable to reducing
the The essential
resolution. partcount
Filter of pooling, whichby
is unaffected is pooling.
utilized to reduce the iscomplexity
Max-pooling of upper
one of the most often
layers, is down-sampling. In terms of image processing, it may be comparable
used pooling methods. The picture is divided into rectangular subregions, and only the to reducing
the resolution.
greatest valueFilter count isinside
discovered unaffected by pooling.isMax-pooling
each subregion returned. One is one of the
of the most
most often
prevalent
max-pooling sizes is 2 × 2.
As shown in Figure 8, when pooling is used on the 2-by-2 blocks in the top-left corner,
attention is diverted to the top-right corner, and two steps are moved. As a result, stride 2 is
used for pooling. It is possible to use stride 1, which is unusual, to prevent downsampling.
Keep in mind that downsampling does not preserve the position of the data.
used pooling methods. The picture is divided into rectangular subregions, and only the
greatest value discovered inside each subregion is returned. One of the most prevalent
max-pooling sizes is 2 × 2.
As shown in Figure 8, when pooling is used on the 2-by-2 blocks in the top-left corner,
attention is diverted to the top-right corner, and two steps are moved. As a result, stride
Computation 2023, 11, 52 10 of 23
2 is used for pooling. It is possible to use stride 1, which is unusual, to prevent downsam-
pling. Keep in mind that downsampling does not preserve the position of the data.

Figure 8. Pooling layer.


Figure 8. Pooling layer.
At various pooling levels, various pooling techniques may be applied. Global average
pooling (GAP), global
At various poolingmax pooling,
levels, average
various pooling,
pooling min pooling,
techniques may be and gatedGlobal
applied. pooling are
aver-
some of these methods. Figure 8 depicts each of these three pooling techniques
age pooling (GAP), global max pooling, average pooling, min pooling, and gated pooling [22,23].
The primary
are some problem with
of these methods. the8pooling
Figure depictslayer
each isofthat
theseit does
threenot aid CNN
pooling in determining
techniques [22,23].
whether or not a feature
The primary problem is with
present
the in an input
pooling image
layer butit rather
is that justaid
does not where
CNNthat feature is
in determin-
located.
ing Therefore,
whether or not athere are times
feature when
is present inCNN’s total
an input ratings
image buttake a dip.
rather just However,
where thatthe CNN
feature
model leaves out the necessary information.
is located. Therefore, there are times when CNN’s total ratings take a dip. However, the
CNN model leaves out the necessary information.
2.1.4. Non-Linearity (Function of Activation)
2.1.4.The layer of non-linearity
Non-Linearity (Function offollows convolution. Non-linearity allows the generated
Activation)
output to be changed or terminated. This layer is used to restrict or oversaturate the output.
The layer of non-linearity follows convolution. Non-linearity allows the generated
Every type of activation function in every type of neural network serves the essential
output to be changed or terminated. This layer is used to restrict or oversaturate the out-
function of mapping input to output. The input value is calculated by calculating the
put.
weighted sum of the neuron input and its bias (if present). This indicates that the activation
Every type of activation function in every type of neural network serves the essential
function determines whether or not to fire a neuron in response to a certain input by
function of mapping input to output. The input value is calculated by calculating the
generating the matching output.
weighted sum of the neuron input and its bias (if present). This indicates that the activa-
In the CNN architecture, non-linear activation layers are used after all layers with
tion
weights (also determines
function whether layers,
known as learnable or not to fireasa FC
such neuron inand
layers response to a certain
convolutional input by
layers).
generating the matching
The mapping output.
of input to output will be non-linear because of the activation layers’
In the CNN architecture,
non-linear performance, and these non-linear activation
layers also enable layers
the CNNaretoused
learnafter all layers
extremely with
complex
weights (also
things [22–24]. known as learnable layers, such as FC layers and convolutional layers).
The mapping the
Additionally, of input to output
capacity will be non-linear
to differentiate because
is a crucial of the activation
requirement layers’
for the activation
non-linear performance, and these layers also enable the CNN to
function since it enables the use of error backpropagation to train the network.learn extremely complex
things [22–24].
The most popular activation functions in CNNs and other deep neural networks are
Additionally,
the ones listed below: the capacity to differentiate is a crucial requirement for the activation
function since it enables
Sigmoid: This activation the usefunction
of error only
backpropagation
allows output to values
train the network.
between 0 and 1 and
The most popular activation
accepts real numbers as input [25–28]. functions in CNNs and other deep neural networks are
the ones
Tanh:listed
It is below:
comparable to the sigmoid function in that it accepts real numbers as input,
but its output range
Sigmoid: is only between
This activation function oneonly
and allows
one. output values between 0 and 1 and
ReLU isreal
accepts thenumbers
most popular function
as input in the CNN context. All of the input values are
[25–28].
converted to the positive range. ReLU’s primary benefit over other algorithms is the time
and resources it saves when used.
For a long time, the Tanh and sigmoid non-linearities were the most prevalent. Non-
linearities come in many forms, and they are shown in Figure 9. For these reasons, however,
the rectified linear unit (ReLU) has seen a surge in popularity in recent years.
ReLU is the most popular function in the CNN context. All of the input values
converted to the positive range. ReLU’s primary benefit over other algorithms is the t
and resources it saves when used.
For a long time, the Tanh and sigmoid non-linearities were the most prevalent. N
linearities come in many forms, and they are shown in Figure 9. For these reasons, ho
Computation 2023, 11, 52 11 of 23
ever, the rectified linear unit (ReLU) has seen a surge in popularity in recent years.

Figure 9. Function of Activation.

FunctionFigure
and 9. Functiondefinitions
gradient of Activation.
using ReLU are simpler.
Saturated functions, such as the sigmoid and the tanh, have issues with backpropa-
Function and
gation. This phenomenon, knowngradient
as thedefinitions
“vanishingusing ReLU occurs
gradient,” are simpler.
when the gradient
signal gradually Saturated
decreases functions,
as the depth such ofasthethe sigmoid
neural and the
network tanh, havegrows.
architecture issues with
This backpro
gation.
happens because theThis phenomenon,
gradient known as
of these functions is the “vanishing
essentially zerogradient,”
on all sidesoccurs
of the when
center.the grad
Nonetheless,signal gradually
the ReLU has a decreases as the depth
constant gradient for theofpositive
the neural network
input. While architecture
the function grows. T
happens because the gradient of these
cannot be distinguished, it can be ignored during implementation. functions is essentially zero on all sides of the c
ter. ReLU
Third, the Nonetheless,
generates theaReLU
sparserhasrepresentation
a constant gradient
becauseforathe positivezero
complete input. While the fu
is pro-
tion cannot
duced by a gradient be distinguished,
zero. For sigmoid and it can be ignored
tanh, during
the gradient implementation.
outcomes are never zero,
Third, the ReLU during
which may be counterproductive generates a sparser
training representation because a complete zero is p
[22,23,26–29].
When using
ducedReLU, a few significant
by a gradient zero. For problems
sigmoidmay and occasionally arise. outcomes are never z
tanh, the gradient
Consider a method
which may be forcounterproductive
error backpropagation duringwith a greater
training gradient flowing through
[22,23,26–29].
it, for instance. When using ReLU, a few significant problems may occasionally arise.
The weightsConsider
will be updated
a method by passing
for error this backpropagation
gradient through the ReLU
with function
a greater in a
gradient flow
way that ensures thatit,the
through forneuron
instance.will not be stimulated again.
This problem is known
The weightsaswill“Dying ReLU”.by passing this gradient through the ReLU function
be updated
In ordera to
wayaddress these problems,
that ensures there are
that the neuron willsome
not ReLU substitutes.
be stimulated again.
These are some of them, as discussed below.
This problem is known as “Dying ReLU”.
Leaky ReLU:InThisorder activation
to address function makes sure
these problems, thatare
there thesome
negative
ReLU inputs are never
substitutes.
disregarded, as opposed to the negative inputs
These are some of them, as discussed below.being downscaled by ReLU. It is used to
address the Dying ReLU issue.
Leaky ReLU: This activation function makes sure that the negative inputs are ne
disregarded, as opposed to the negative inputs being downscaled by ReLU. It is used
2.1.5. Fully Connected Layer
address the Dying ReLU issue.
Neurons are organized into groups in the fully-connected layer that are reminiscent of
those seen in2.1.5.
traditional neural networks.
Fully Connected Layer As shown in Figure 10, any node in a layer that is
entirely linked is, therefore, directly connected to every node in the layer above and below
Neurons are organized into groups in the fully-connected layer that are reminisc
it. Figure 10 shows that every node in the pooling layer’s most recent frames is connected
of those seen in traditional neural networks. As shown in Figure 10, any node in a la
as a vector from the fully-connected layer to the top layer. These are the most often utilized
that is entirely linked is, therefore, directly connected to every node in the layer above
CNN parameters within these layers; however, they need a lot of training time [22–24].
below it. Figure 10 shows that every node in the pooling layer’s most recent frame
The biggest drawback of a fully connected layer is the large number of parameters that
connected as a vector from the fully-connected layer to the top layer. These are the m
necessitate laborious calculation in training samples. Consequently, we try to minimize the
number of connections and nodes. The eliminated nodes and connections can be satisfied
using the dropout approach. LeNet and AlexNet, for example, developed a vast and deep
network while preserving a constant computational complexity [22,23].
Computation 2023, 11, x FOR PEER REVIEW 12 of 25

Computation 2023, 11, 52 often utilized CNN parameters within these layers; however, they need a lot of training12 of 23
time [22–24].

Figure 10. Fully-connected layer.


Figure 10. Fully-connected layer.
The convolution, which is the core element of the CNN network, is exposed when the
non-linearity and pooling
The biggest drawback layerconnected
of a fully is added. layerThe three
is thethat arenumber
large most commonly utilized in
of parameters
architecture
that necessitate are as follows:
laborious calculation in training samples. Consequently, we try to mini-
mize the
– number of connections
To rephrase, and nodes.
in a completely The eliminated
connected layer, allnodes
of theand connections
neurons can be with
communicate
theirthe
satisfied using counterparts in the layer
dropout approach. LeNetbelow.
andItAlexNet,
is a classifier used by developed
for example, CNN. a vast
– network
and deep Being a while
feed-forward
preserving ANN, it performs
a constant similarly tocomplexity
computational a regular multi-layer
[22,23]. perceptron
network. Input
The convolution, whichto the FCcore
is the layerelement
comes from of thethe
CNNlast pooling
network,orisconvolutional
exposed when layer.
– This isand
the non-linearity a vector input
pooling created
layer by increasing
is added. The three the thickness
that are mostof commonly
the feature utilized
maps [24].
in architecture
Figureare10asdisplays
follows: that the FC layer’s output is consistent with the final CNN output.
− The preceding
To rephrase, part discussed
in a completely the various
connected layer, alltypes of layers
of the used
neurons in the CNN design;
communicate with this
section will focus on loss functions.
their counterparts in the layer below. It is a classifier used by CNN.
− Being Furthermore,
a feed-forwardthe ANN,finalitclassification is achieved
performs similarly by employing
to a regular the perceptron
multi-layer output layer, the
very last layer of the CNN architecture. A few loss functions
network. Input to the FC layer comes from the last pooling or convolutional layer.are used in the CNN model’s
− output
This layer toinput
is a vector compute
created thebypredicted
increasing error
the across
thicknessthe of
training data.maps
the feature As a [24].
result of this
mistake, the disparity between actual and predicted output is highlighted. Then, it will be
Figure 10 displays that the FC layer’s output is consistent with the final CNN output.
improved using the CNN learning approach.
The preceding part discussed the various types of layers used in the CNN design;
The loss function, however, takes advantage of two inputs to pinpoint the source of
this section will focus on loss functions.
the mistake. For CNN, the first parameter is the forecast or estimated output. The second
Furthermore, the final classification is achieved by employing the output layer, the
input is the desired output or label. There are many different kinds of loss functions used
very last layer of the CNN architecture. A few loss functions are used in the CNN model’s
for different sorts of problems [25].
output layer to compute the predicted error across the training data. As a result of this
Below is a basic explanation of the many kinds of loss functions:
mistake, the disparity between actual and predicted output is highlighted. Then, it will be
Training: A training dataset made up of a collection of images and labels (classes,
improved using boxes,
bounding the CNN andlearning
masks) approach.
is used to train a CNN model.
The loss function, however,
Backpropagation is a CNN takestraining
advantage of two that
procedure inputs to pinpoint
measures the source
an error of the
value using
the mistake. For CNN, the first parameter is the forecast or estimated output.
output value of the previous layer. Each neuron’s weight in that layer is updated using the The second
input is the value
error desired output or label. There are many different kinds of loss functions used
[26].
for differentIn order toproblems
sorts of measure[25].an incorrect value and revise the old weights, fresh weights are
Below is a basic explanation
employed, as shown in Figure of the
11. many kinds of loss functions:
Training: A training dataset made up of a collection of images and labels (classes,
bounding boxes, and masks) is used to train a CNN model.
output value of the previous layer. Each neuron’s weight in that layer is updated using
the error value [26].
Backpropagation is a CNN
In order training
to measure anprocedure that measures
incorrect value anthe
and revise error
oldvalue usingfresh
weights, the weights are
output valueemployed,
of the previous layer. Each neuron’s
as shown in Figure 11. weight in that layer is updated using
the52error value [26].
Computation 2023, 11, 13 of 23
In order to measure an incorrect value and revise the old weights, fresh weights are
employed, as shown in Figure 11.

Figure 11. Forward and backpropagation in hidden CNN layers.

Until it reaches the first layer, the algorithm repeats the procedure.
Figure 11. Forward and11.
Figure
All backpropagation
Forward
inputs, includinginthe
hidden
bias CNN
and backpropagation layers.
unit,in hidden
are CNN layers.
summarized by the activation unit ; then, use
the activation function to compute the result. The network will then calculate the cost
Until it reaches Untilfirst
it reaches the first layer, the algorithm repeats the procedure.
functionthe and send layer,
the the algorithm
error back to updaterepeats theweights
the procedure.
until the cost is minimized.
All inputs, including the bias unit, are
All inputs, including the bias unit, are summarized by the activation summarized by the activation
unit unit; then, use the
; then, use
the activation3.activation
function tofunction
Regularization compute to compute
of CNN the result. theTheresult. The network
network willcalculate
will then then calculate the cost function
the cost
and send the error back to update the weights
function and send the error back to update the weights until the cost is minimized.until the cost is minimized.
When trying to create well-behaved generalizations for CNN models, over-fitting is
the
3. key obstacle. Over-fitting
Regularization of CNN describes a situation in which a model does well on training
3. Regularization of CNN
data but poorly on test
When trying to create data well-behaved
(data it has never seen before),for
generalizations asCNN
will be shownover-fitting
models, in the nextis
When trying to
section. create well-behaved generalizations for CNN models, over-fitting is
the keyWhen the model
obstacle. does not
Over-fitting pick up
describes enough information
a situation from the
in which a model doestraining
well ondata, it
training
the key obstacle. Over-fitting
isdata
said to be describes[26–28].
under-fitted a situation in which a model does well on training
but poorly on test data (data it has never seen before), as will be shown in the next
data but poorly on A test datais (data
model it has never
considered to be seen before), it as will be satisfactory
shown in the next on both the
section. When the model does not“just
pickfit”
up if
enoughproduces
information from the results
training data, it is
section. When the
trainingmodel does not pick up enough information from the training data, it
said to and testing data.[26–28].
be under-fitted These three types are shown in Figure 12.
is said to be under-fitted
Multiple [26–28].
intuitive conceptions
A model is considered to be are used
“just fit” to facilitate
if it producesregularization
satisfactory and prevent
results over-
on both the
A modelfitting;
is considered
more to be “just fit” if it produces satisfactory results on both the
training anddetails
testingondata.
over-fitting
These three and types
under-fitting
are shown arein
provided below.
Figure 12.
training and testing data. These three types are shown in Figure 12.
Multiple intuitive conceptions are used to facilitate regularization and prevent over-
fitting; more details on over-fitting and under-fitting are provided below.

Figure 12. Regularization to CNN.

Multiple intuitive conceptions are used to facilitate regularization and prevent over-
fitting; more details on over-fitting and under-fitting are provided below.
Using a dropout as a generalization strategy is popular. Neurons are removed at
random throughout each training session. As an added bonus to coercively training
the model to acquire several features, this method also makes feature selection equally
weighted across the whole neural network.
Computation 2023, 11, 52 14 of 23

A dropped neuron will not take part in either backward or forward propagation
during training. Testing makes use of the full-scale network to make predictions [29,30].
The drop-weight method is quite similar to the dropout strategy. Drop-weight training
differs from dropout in that only the weights (connections) between neurons are eliminated
after each training iteration.
Data augmentation may easily prevent over-fitting when the model is trained using
an enormous amount of data. In order to do this, data augmentation is needed. There are
a few methods that may be utilized to increase the size of the training dataset artificially.
Finally, data augmentation methods are discussed in further depth.
Through batch normalization, the efficacy of the final activations may be ensured [31,32].
The one-unit Gaussian distribution describes this execution well. When normaliz-
ing the output at each layer, we will first remove the mean and then divide it by the
standard deviation.
Despite being conceptualized as a pre-processing activity at each tier of the network,
this may be differentiated and integrated with other networks.
It is also employed to reduce “internal covariance shift” in the activation layers. The
activation distribution’s variability defines the change in internal covariance at each layer.
The continual update of weights during training, which might occur if training data
samples come from a wide range of sources, amplifies this change (for example, day and
night images). However, the training period will lengthen since the model needs more
time for convergence. For this reason, we add a layer to the CNN design that mimics batch
normalization to help us deal with this issue [33–35].
The following are some benefits of using batch normalization:
• It stops the disappearing gradient issue before it starts.
• It has the ability to effectively manage bad weight initialization.
• It significantly reduces the amount of time needed for network convergence (which
will be very helpful for large datasets).
• It has difficulty reducing training dependence on various hyperparameters.
• Over-fitting is less likely because it only slightly affects regularization [35].

4. Popular CNN Architecture


R-CNN [36–38] was the first ground-breaking model to use convolution neural net-
works (CNN). For each image that needs to be classified, the model creates 2000 region
proposals and resizes them to 227 × 227. R-CNN employs a region-of-interest (RoI) clas-
sifier based on a deep convolutional neural network (DCN) to perform region-specific
classification of input pictures. In addition, a convolutional neural network (CNN) is em-
ployed for feature extraction and model training, and then a support vector machine (SVM)
classifier is used for object categorization. This model moves at a snail’s pace. Eventually,
in 2015, Fast R-CNN was offered as a solution to the accuracy and speed issues [37,38]. RoI
extraction from feature maps is the focus of SPPNet and Fast R-CNN, an enhanced form of
R-CNN. It was discovered that this method outpaced the standard R-CNN framework by a
significant margin.
Faster R-CNN continues the trend by proposing region proposal networks for feature
extraction and to eliminate storage costs [37,38]. Faster R-CNN, an enhanced variant of Fast
R-CNN achieved using RPN-based fully-contained end-to-end training (region proposal
network). Regression-based region-of-interest (RoI) networks (RPNs) are a type of network
used in the process of producing RoIs.
Compared to earlier models, this one performs well in terms of accuracy and speed;
however, the ground truth and predicted bounding boxes are not aligned. To deal with
the issue of inaccuracy generated by the quantization process in the region of interest (RoI)
pooling layer, the authors introduce Mask R-CNN.
Mask R-CNN builds on top of the Faster R-CNN by including a mask prediction
branch; this allows it to detect objects and predict their masks simultaneously. R-FCN
Computation 2023, 11, 52 15 of 23

swaps out the fully connected layers with position-aware score maps, which results in
improved object detection.

Several CNN Architectures


CNN has many architectures, such as VGG, AlexNet, Xception, Inception, and
ResNet [37–49], which can be used in different application domains depending on how
well they can learn.
The convolutional neural network (CNN) is the most emblematic deep learning model.
It comprises the input, convolution, pooling, and full connection layers [37–49]. The
majority of existing networks are based on a series of CNN enhancements. LeCun first
presented the LeNet network for handwritten digit identification in 1998 and extended
CNN to the field of picture recognition. LeNet is an early neural network with only three
complete connection layers, two convolution layers, and two pooling layers. Due to the tiny
size of the model, it is unable to adequately fit other data, which hinders the advancement
of computer vision areas.
In 2012, Krizhevsky proposed the AlexNet, resulting in a significant learning spike
in computer vision. AlexNet has five convolutional layers and two fully linked layers for
learning features are present in AlexNet. After the first, second, and fifth convolutional
layers, it has max-pooling. It has 630M connections, 60M parameters, and 650K neurons
altogether. The AlexNet was the first to demonstrate the use of deep learning for computer
vision applications [44].
LeNet and AlexNet used a single convolutional layer with big kernels of size 7 × 7
and 11 × 11, while the VGG-16 was built with stacks of convolutional layers with smaller
kernels of size 3 × 3.
By adding more non-linear rectification layers, a stack of convolutional layers with
tiny filter sizes creates a more discriminatory decision function [39,44].
ZFNet [44] made modest modifications to the AlexNet network in 2013, mainly offer-
ing a new visualization technique. In the past, CNN was a black box; no theory or method-
ology was used to explain the network’s optimization and improvement process. Using
deconvolution, ZFNet visualizes the intermediate layer of features [44]. Simonyan [40]
introduced the VGG model in 2014, which investigates the effect of network depth on
accuracy. Unlike AlexNet, VGG uses many convolution layers of size 3 × 3 to replace
large-scale filters. The framework of the model is simple and effective, and it can be easily
ported to other networks; nevertheless, the parameters are too large and simple to adjust.
Researchers have effectively utilized VGG in numerous domains [31–33]. GoogLeNet [44]
is a network that not only investigates the impact of depth but also considers the breadth
of the network. The network eliminates the final full connection layer and intelligently
implements the 1 × 1 convolution operation in order to lower the dimension and prevent
the over-fitting issue caused by excessively large network parameters. The year 2015 saw
the proposal of the ResNet residual network and residual connection by He et al. [44]. As
a result, the depth of the network can reach 152 layers. The network employs a small
number of pooling layers and a high number of downsampling, which increases the for-
ward propagation efficiency of the network and obtains the greatest picture recognition
effect at that time, demonstrating the viability of residual connection [46,47]. Liu et al.
proposed DenseNet [39,44,45] in 2017. Using the ResNet network’s strategy for increasing
the depth and width of the network can also guarantee the model’s precision. DenseNet
created a network for typing. All information layers are concatenated (dimensionally
coupled) with one another. DenseNet may efficiently minimize the number of parameters
and increase the reusability of features across many convolutional layers [40–44]. Table 1.
is shown a comparison between different popular CNNs Architecture. In 2018, boosting
deep convolutional neural networks (BoostCNN). Use a deep learning architecture with a
least-squares-based objective function and add boosting weights to learn this model. This
model uses various network architectures within the proposed boosting framework, with
BoostCNN choosing the most effective network architecture after each iteration.
Computation 2023, 11, 52 16 of 23

Table 1. A comparison between different popular CNNs Architecture.

Architecture
Layers Main Contribution Highlights Strength Gaps
Name

• Utilized spatial correlation to


Rapidly deployable and decrease computation and Inadequate scaling to
LeNet
7 (5 Convolution + First popular CNN effective at resolving parameter count varied image classes; Filter
LeNet-5
2 FC) architecture small-scale image • Automated discovery of sizes that are too large;
[1998]
recognition issues feature hierarchy structures Weak feature extraction

• Low, middle, and high-level


feature extraction utilizing
More depth and breadth Neurons in the first and
AlexNet is large and tiny size filters on
than the second layers that are
comparable to the early (5 × 5 and 11 × 11)
LeNet-Employs Relu, dormant
8 LeNet-5, except it is and final (5 × 5 and 11 × 11)
dropout, and overlap layers (3 × 3) • Aliasing artifacts in
AlexNet [2012] (5 Convolution + 3 more complex, has
Pooling-NVIDIA GTX • Implemented regularization learned feature
Fully Connected) more filters per layer
580 GPUs in CNN maps as a result of a
and employs stacked
Makes use of Dropout • Commenced parallel usage of large filter size
convolutional layers.
and ReLU GPUs as an accelerator to
address difficult architectures

• Illustrated parameter
tweaking by displaying the
output of intermediary Further processing of
Conceptualization of layers.
ZfNet [2014] 8 information is necessary
middle levels • Diminished the filter size and for visualization.
stride in the initial two layers
of AlexNe

The accuracy of a model • Introduced the concept of an


-Homogeneous
is improved by effective receptive field
structure—Small kernel Implementation of
16–19 (13–16 employing small • Presented the concept of a
VGG [2014] size. computationally costly
convolution + 3 FC convolutional filters simple and homogeneous
-Enhanced depth, fully linked layers
with dimensions of 3 3 topology
reduced filter size
in each layer.

• Introduced the concept of


applying mutiscale filters to Due to diverse topologies,
-Presented the block layers parameter modification is
concept-Separated the • Introduced the concept of arduous and
A deeper and broader
transform and merge divide, transform, and merge time-consuming.
architecture with
22 Convolution notions • Reduced the number of
GoogLeNet various receptive field • The useful
layers, 9 Inception Increased depth, the parameters by the use of
[2015] sizes and a number of information may be
modules block concept, a bottleneck layer, global
extremely small lost due to a
different filter size, and average-pooling at the final
convolutions. representational
the concatenation layer, and sparse connections
concept bottleneck
• Use of auxiliary classifiers to
enhance convergence rate

-Resolves the Enhances the efficiency


representational of a network.
bottleneck The application of Batch • Complexity of the
42 Convolution Utilized asymmetric filters and
issue—Change Normalization architectural design
Inception-V3 layers, 10 Inception bottleneck layer to decrease the
large-size filters to expedites the training • Absence of
[2015] modules computational expense of
tiny-size filters process. uniformity
48 deep designs
-Employs a tiny filter Inception-building
size and improved elements are employed
feature representation effectively to go deeper.

• A slightly intricate
structure
-Identity mapping • Degrades
based on information of
A unique design that feature-map in feed
50 in ResNet-50, links—Long-term Reduces the error rate of deeper
features “skip forwarding
101 in ResNet-101, retention of knowledge. networks; introduces the concept of
ResNet [2016] connections” and • Excessive
152 in Overfitting-resistant residual learning; mitigates the
extensive batch adaptation of
ResNet-152 due to symmetry vanishing gradient problem
normalization. hyperparameters
mapping-based skip
connections for a specific task as
a result of stacking
identical modules
Computation 2023, 11, 52 17 of 23

Table 1. Cont.

Architecture
Layers Main Contribution Highlights Strength Gaps
Name

• Added depth or cross-layer


All layers are intimately dimension
connected to one • Ensures maximum data flow
-Information
117 Convolution another in a across network layers Significant rise in
DenseNet transmission between
layers, 3 Transition feed-forward fashion. It • Prevents relearning parameters as a result of
DenseNet-121 layers
layers, and 1 mitigates the problem of redundant feature-maps an increase in the number
[2017] Blocks of layers; layers
Classification layer vanishing gradients and • Both low-level and high-level of feature-maps per layer
that are interconnected.
requires few features are available to
parameters. decision layers

The compound coefficient method was used by EfficientNet in 2019 to efficiently and
effectively scale up models. Compound scaling uses a uniform set of scaling coefficients to
increase an image’s width, depth, or resolution rather than randomly selecting one of these
values. The authors of the paper efficiently used the scaling method and AutoML to create
seven models of varying dimensions that both outperformed and were more efficient than
the state-of-the-art convolutional neural networks.
CNNs have an advantage over other classification algorithms like SVM, K-NN, Ran-
dom Forest, and others because they learn the most important features to represent the
objects in an image.

5. Different Types of CNN Architectures


5.1. Classification
Image classification is crucial to the processing of multimedia information in the
Internet of Things (IoT). In order to identify whether or not the illness is present, the
image classification procedure uses the input images to provide an output classification.
Image classification and recognition technology have found widespread usage in artificial
intelligence applications, particularly in the areas of picture information retrieval, real-time
target tracking, and medical image analysis. Recent years have seen a rise in interest in
deep learning [50,51].
Popular CNN for classification tasks such as VGG-16, ResNets, and Inception [44].

5.2. Detection
The difficult computer vision task of object detection involves anticipating both the
location of the objects in the image and the kind of objects that were found. Beginners may
find it difficult to differentiate between various related computer vision tasks.
For instance, the distinctions between object localization and object detection might be
difficult to understand, even though all three tasks may be collectively referred to as object
recognition. Image categorization, by contrast, is straightforward.
Image classification involves assigning a category label to an image, whereas ob-
ject localization involves drawing a bounding box around one or more objects in an
image [52–54].
The more challenging object detection challenge involves doing both of these things
at once, drawing bounding boxes around each object of interest in the image and then
labeling each object with its class.
Together, we call these problems in the real world “object recognition”.
The process of object detection can be broken down into two distinct phases, as shown
in Figure 13:
Computation
Computation 2023,
2023, 11,
11, x52FOR PEER REVIEW 1918ofof 25
23

Figure 13. The process of object detection.


Figure 13. The process of object detection.
One-step object detectors.
The use ofobject
One-step two-stage object detectors.
detectors.
Object detectors that
The use of two-stage object are built on two-stage deep learning pipelines have two dis-
detectors.
tinctObject
phases: (1) proposing regions and then deep
detectors that are built on two-stage (2) classifying the objects
learning pipelines havewithin those
two distinct
regions [37,38,43,44,55].
phases: (1) proposing regions The object detector’s
and then region proposal
(2) classifying stagewithin
the objects entailsthose
proposing
regionsa
number of Regions
[37,38,43,44,55]. Theofobject
Interest (ROIs) in
detector’s an input
region imagestage
proposal that has a high
entails possibility
proposing of having
a number of
items of of
Regions interest.
InterestThe second
(ROIs) in anphase
inputinvolves selecting
image that promising
has a high ROIsof
possibility (while
having discarding
items of
less promising
interest. ones)phase
The second and classifying items contained
involves selecting promisingwithin
ROIsthem [53].discarding
(while RCNN, Fast lessR-CNN,
prom-
ising ones) and classifying items contained within them [53]. RCNN, Fast R-CNN,hand,
and Faster R-CNN are all well-liked examples of two-stage detectors. On the other and
single-stage
Faster R-CNN object detectors
are all well-likedbuildexamples
boundingofboxes and classify
two-stage objects
detectors. all in
On the the same
other hand,stage
sin-
using a single
gle-stage objectfeed-forward
detectors build neural network.
bounding boxesAlthough theseobjects
and classify detectorsall inare
thequicker than
same stage
their two-stage
using counterparts,neural
a single feed-forward they are often less
network. precise.
Although YOLO,
these SSD, EfficientNet,
detectors are quicker than and
RetinaNet are just a few of the most well-known examples of single-stage
their two-stage counterparts, they are often less precise. YOLO, SSD, EfficientNet, and detectors. The
distinction between these two object detectors is seen in Figure 13.
RetinaNet are just a few of the most well-known examples of single-stage detectors. The
As onebetween
distinction of the earliest
these two deep learning-based
object detectors is object
seen indetectors,
Figure 13.R-CNN implemented a
two-stage detection process that included a highly
As one of the earliest deep learning-based object detectors, effective selective
R-CNN search method for
implemented a
ROI proposals. A few issues with the R-CNN model were addressed with the introduction
two-stage detection process that included a highly effective selective search method for
of fast RCNN, including slow inference speed and inaccurate predictions. The Fast R-CNN
ROI proposals. A few issues with the R-CNN model were addressed with the introduction
model uses a convolutional neural network (CNN) to process an image’s data and produce
of fast RCNN, including slow inference speed and inaccurate predictions. The Fast R-
a feature map and ROI projection. Using ROI pooling, these regions of interest are mapped
CNN model uses a convolutional neural network (CNN) to process an image’s data and
to the feature map for prediction. Rapid R-CNN is an alternative to R-CNN that bypasses
produce a feature map and ROI projection. Using ROI pooling, these regions of interest
the region of interest (ROI) as input to the CNN layers and instead processes the entire
are mapped to the feature map for prediction. Rapid R-CNN is an alternative to R-CNN
image to create feature maps for object detection [37,38]. While both Fast and Faster R-
that bypasses the region of interest (ROI) as input to the CNN layers and instead processes
CNN use a similar strategy, Faster R-CNN uses a separate network to feed the ROI to the
the entire image to create feature maps for object detection [37,38]. While both Fast and
ROI pooling layer and the feature map, which are subsequently reshaped and used for
Faster R-CNN use a similar strategy, Faster R-CNN uses a separate network to feed the
prediction [37].
ROI to the ROI pooling layer and the feature map, which are subsequently reshaped and
Due to their ability to make predictions about an input with just one pass, single-
used
stagefor prediction
object detectors[37].like YOLO (You only look once) are quicker than their two-stage
Due to their ability
counterparts. The first YOLO to make predictions
variation, aboutdiscovered
YOLOv1, an input withhowjust one pass,
to quickly single-
detect ob-
stage object detectors like YOLO (You only look once) are quicker than their two-stage
Computation 2023, 11, 52 19 of 23

jects by learning generalizable representations of them [16]. While YOLOv1 used a fully
connected layer to generate bounding boxes, YOLOv2 introduced batch normalization
and a high-resolution classifier in 2016 [43,45]. YOLOv3 was proposed in 2018 [43,45]
using a 53-layer backbone-based network that predicted overlapping bounding boxes and
smaller objects using an independent logistic classifier and binary cross-entropy loss. In
contrast to YOLO models, which produce feature maps by constructing grids within an
image, SSD models were offered as a superior choice to execute inference on videos and
real-time applications since they share features between the classification and localization
task on the complete picture. Although YOLO models are quicker to run, they are not as
accurate as SSD models [45]. While YOLO and SSD models offer fast inference speeds, they
struggle with class imbalance when identifying tiny objects. The RetinaNet detector [20]
overcame this problem by using a dedicated network for classification and bounding box
regression during training and a focal loss function. Better methods for data augmentation
and regularization during training (‘bag of freebies’) and a post-processing module that
enables better mAP and faster inference (‘bag of specials’) were introduced in YOLOv4 [45].
YOLOv5 was proposed, which would further improve data augmentation and loss calcula-
tion. It also used self-learning bounding box anchors to tailor itself to a specific dataset. A
second form, termed YOLOR (You only learn one representation), was presented to forecast
the output in 2021. It employed a single network that encoded both implicit and explicit
knowledge. With just a single model, YOLOR is capable of multitasking learning in areas
including object identification, multilabel picture classification, and feature embedding.
Similarly, the YOLOX model was introduced in 2021; it employs a decoupled head method
that eliminates the need for anchors and permits the network to process classification and
regression independently. When compared to YOLOv4 and YOLOv5 models, YOLOX has
fewer parameters and faster inference [45].

5.3. Segmentation
As the name implies, this is the process of segmenting an image into various parts.
Each pixel in the image is given an object type throughout this process. Semantic segmenta-
tion and instance segmentation are the two main categories of image segmentation [55,56].
In instance segmentation, related items receive their own unique labels, whereas, in
semantic segmentation, all objects of the same type are tagged using a single class name.
Semantic segmentation has come a long way in the last decade thanks to the develop-
ment of deep learning-based models, particularly fully convolutional networks (FCNs) [37]
and variants [38]. To learn stable and secure features, FCNs leverage pre-existing deep
neural networks. By replacing the fully connected layers with convolutional ones, an FCN
converts popular classification models like VGG (16-layer net) [31] and ResNet [44] into
fully convolutional ones that produce spatial maps rather than classification scores. To gen-
erate dense per-pixel labeled outputs, I upsampled those maps using fractionally-strided
convolutions. U-Net is another model used for rapid and accurate image segmentation
based on convolutional network architecture. The University of Freiburg’s Computer Sci-
ence Department developed it [44]. The ISBI challenge for segmenting neuronal structures
in electron microscopic stacks has outperformed the previous best method (a sliding-
window convolutional network). The main drawback of the U-Net architecture is that it
can slow down the training speed in the middle layers of deeper neural networks, increas-
ing the risk of skipping over them. The main cause of this phenomenon is the fact that
gradients weaken as the network moves away from the output layer, where the training
loss is calculated. Table 2. is shown a comparison between different Types of CNN.
Computation 2023, 11, 52 20 of 23

Table 2. A comparison between different Types of CNN.

Models Types of CNN Advantages Limitations


The properties of CNN are
calculated in a single loop,
Using an external candidate region
Fast R-CNN Object detection making the detection of objects
generator slows down the
[2015] Two-stage framework 25 times faster than the RCNN
detection procedure.
approach (an average of 20 s is
required to study a picture).
Despite the algorithm’s effectiveness,
The RPN approach enables
Faster R-CNN Object detection it is too slow to be used in
near-real-time object detection,
[2015] Two-stage framework applications requiring real-time, such
around 0.12 s per image.
as driverless vehicles.
Its execution time is longer than that
When segmenting the objects in
Mask R-CNN Object detection of the Faster-RCNN approach; hence,
an image, the location of the
[2017] Two-stage framework it cannot be implemented in
objects becomes more exact.
real-time applications.
The efficiency of object
YOLO Object detection The technique has trouble accurately
localization enables its usage in
[2015] One-stage framework detecting little items.
real-time applications.
YOLO and Faster R-CNN
Compared to the Fast-RCNN and
SSD Object detection advantages are balanced with
Faster-RCNN algorithms, the object
[2016] One-stage framework high detection speed and high
detection accuracy is less precise.
object detection rate.
Obtaining a complete
FCN Poor precision of feature maps and
Semantic segmentation convolutional layer (without
[2014] significant GPU utilization.
connected layer).
The structure has fewer
parameters and is basically like It is difficult to acquire uniform
UNet
Semantic segmentation the letter U. Appropriate for sub-sampling and
[2015]
object detection in limited up-sampling standards.
medical image samples.

5.4. Popular Applications


Biometric Detection: Verifying identity via unique biological characteristics. Individu-
als can be uniquely identified through their biometric features, which include things like
hand geometry, retina, iris patterns, and even DNA. The object detection method uses a
matching template to make its determinations [57,58].
CCTV cameras and other surveillance equipment are used to monitor the area and
record any suspicious activity. Keeping tabs on potential criminals is the job of object detection.
Research with autonomous robots is the central problem in the field at the moment.
The most widely used system approach currently is the human–robotic system. Computa-
tional behavior forms the basis of the trusted system’s vision.
Medical imaging for object recognition includes such applications as tumor detection
in MRI scans and skin cancer screening.

6. Future Directions
Indeed, the performance of classification in terms of accuracy, misclassification rate,
precision, and recall is heavily influenced by the combination of convolutional layers,
the number of pooling layers, the number of filters, the filter size, the stride rate, and
the location of the pooling layer when designing a convolutional neural network. CNN
training necessitates the use of powerful and impressive hardware resources, such as GPUs.
Training and testing various combinations of parameters repeatedly requires a great deal of
time and high computing resources like GPUs in order to obtain a satisfactory result [59,60].
Computation 2023, 11, 52 21 of 23

The choice of hyper-parameters has a substantial impact on CNN’s performance.


The overall CNN performance is sensitive to even a modest shift in the hyper-parameter
settings. As a result, it is crucial to take into account the importance of appropriate
parameter selection while designing optimization schemes.
The number of layers in a CNN has been increased from a few (AlexNet) to hundreds,
making it smaller and more effective (ResNet, ResNext, DenseNet). These networks have
billions of parameters, so training them takes a lot of data and powerful GPUs. Therefore,
scientists should take an interest in developing lightweight and compact networks in order
to reduce network redundancy further.
Selecting the best detection network for a given application and embedded hardware
strikes a balance between speed, memory usage, and accuracy. It is preferable to teach
compact models with few parameters, even if this results in a decrease in detection accuracy;
this could be remedied through the use of hint learning, knowledge distillation, and
improved pre-training schemes.
As a result of these enhancements, CNNs are better able to learn from data at varying
depths and with varying structural modifications. Modern research has shown that the
performance of CNN could be greatly improved if blocks were used instead of layers.

7. Conclusions
I have provided an organized and thorough overview of deep learning technology
in this paper, which is regarded as a fundamental component of both data science and
artificial intelligence.
It begins with a history of artificial neural networks before moving on to more modern
deep learning methods and innovations in several fields.
The main techniques in this field are then examined, along with deep neural network
modeling in multiple dimensions.
For this, I have also provided a taxonomy that takes into account the various deep-
learning tasks and their many applications.
In this comprehensive research, I took into account both supervised learning using
deep networks and unsupervised learning using generative learning using deep networks.
I have also thought of hybrid learning, which may be used in a variety of real-world
contexts depending on the specifics of the problem at hand.
Finally, I summarize several important problems with convolutional neural networks
(CNNs) and describe how each parameter affects the network’s performance. The convolu-
tion layer is the heart of a CNN and is responsible for the vast majority of processing time.
A network’s performance can be affected by the number of layers it contains. In contrast,
training and testing the network takes more time as the number of layers grows.

Funding: This research received no external funding.


Data Availability Statement: Not applicable.
Conflicts of Interest: The author declares no conflict of interest.

References
1. Sarker, I.H. Machine Learning: Algorithms, Real-World Applications, and Research Directions. SN Comput. Sci. 2021, 2, 1–21.
[CrossRef]
2. Du, K.-L.; Swamy, M.N.S. Fundamentals of Machine Learning. Neural Netw. Stat. Learn. 2019, 21–63. [CrossRef]
3. ZZhao, Q.; Zheng, P.; Xu, S.; Wu, X. Object detection with deep learning: A review. IEEE Trans. Neural Networks Learn. Syst. 2019,
30, 3212–3232. [CrossRef]
4. Indrakumari, R.; Poongodi, T.; Singh, K. Introduction to Deep Learning. EAI/Springer Innov. Commun. Comput. 2021, 1–22.
[CrossRef]
5. AI vs Machine Learning vs Deep Learning|Edureka. Available online: https://www.edureka.co/blog/ai-vs-machine-learning-
vs-deep-learning/ (accessed on 11 August 2022).
6. Cintra, R.J.; Duffner, S.; Garcia, C.; Leite, A. Low-complexity approximate convolutional neural networks. IEEE Trans. Neural
Netw. Learn. Syst. 2018, 29, 5981–5992. [CrossRef]
7. Rusk, N. Deep learning. Nat. Methods 2017, 13, 35. [CrossRef]
Computation 2023, 11, 52 22 of 23

8. Shrestha, A.; Mahmood, A. Review of deep learning algorithms and architectures. IEEE Access 2019, 7, 53040–53065. [CrossRef]
9. Zhang, Z.; Cui, P.; Zhu, W. Deep Learning on Graphs: A Survey. IEEE Trans. Knowl. Data Eng. 2022, 34, 249–270. [CrossRef]
10. Mishra, R.K.; Reddy, G.Y.S.; Pathak, H. The Understanding of Deep Learning: A Comprehensive Review. Math. Probl. Eng. 2021,
2021, 1–5. [CrossRef]
11. Ker, J.; Wang, L.; Rao, J.; Lim, T. Deep Learning Applications in Medical Image Analysis. IEEE Access 2017, 6, 9375–9379.
[CrossRef]
12. Dhillon, A.; Verma, G.K. Convolutional neural network: A review of models, methodologies, and applications to object detection.
Prog. Artif. Intell. 2019, 9, 85–112. [CrossRef]
13. Li, Z.; Liu, F.; Yang, W.; Peng, S.; Zhou, J. A Survey of Convolutional Neural Networks: Analysis, Applications, and Prospects.
IEEE Trans. Neural Networks Learn. Syst. 2021, 1–21. [CrossRef] [PubMed]
14. Khan, A.; Sohail, A.; Zahoora, U.; Qureshi, A.S. A survey of the recent architectures of deep convolutional neural networks. Artif.
Intell. Rev. 2020, 53, 5455–5516. [CrossRef]
15. Introduction to Convolutional Neural Networks (CNNs)|The Most Popular Deep Learning architecture|by Louis
Bouchard|What is Artificial Intelligence|Medium. Available online: https://medium.com/what-is-artificial-intelligence/
introduction-to-convolutional-neural-networks-cnns-the-most-popular-deep-learning-architecture-b938f62f133f (accessed on 8
August 2022).
16. Koushik, J. Understanding Convolutional Neural Networks. May 2016. Available online: http://arxiv.org/abs/1605.09081
(accessed on 13 August 2022).
17. Bezdan, T.; Džakula, N.B. Convolutional Neural Network Layers and Architectures. In International Scientific Conference on
Information Technology and Data Related Research; Singidunum University: Belgrade, Serbia, 2019; pp. 445–451. [CrossRef]
18. Zhang, J.; Huang, J.; Chen, X.; Zhang, D. How to fully exploit the abilities of aerial image detectors. In Proceedings of the IEEE
International Conference on Computer Vision Workshops 2019, Seoul, Republic of Korea, 27–28 October 2019.
19. Rodriguez, R.; Gonzalez, C.I.; Martinez, G.E.; Melin, P. An Improved Convolutional Neural Network Based on a Parameter
Modification of the Convolution Layer. In Fuzzy Logic Hybrid Extensions of Neural and Optimization Algorithms: Theory and
Applications; Springer: Cham, Switzerland, 2021; pp. 125–147. [CrossRef]
20. Batmaz, Z.; Yurekli, A.; Bilge, A.; Kaleli, C. A review on deep learning for recommender systems: Challenges and remedies. Artif.
Intell. Rev. 2019, 52, 137. [CrossRef]
21. Fang, X. Understanding deep learning via back-tracking and deconvolution. J. Big Data 2017, 4, 40. [CrossRef]
22. Alzubaidi, L.; Zhang, J.; Humaidi, A.J.; Al-Dujaili, A.; Duan, Y.; Al-Shamma, O.; Santamaría, J.; Fadhel, M.A.; Al-Amidie, M.;
Farhan, L. Review of deep learning: Concepts, CNN architectures, challenges, applications, future directions. J. Big Data 2021, 8,
83. [CrossRef] [PubMed]
23. Du, K.L.; Swamy, M.N.S. Neural networks and statistical learning, second edition. In Neural Networks and Statistical Learning, 2nd
ed.; Springer: Berlin/Heidelberg, Germany, 2019; pp. 1–988. [CrossRef]
24. Zhang, Q.; Zhang, M.; Chen, T.; Sun, Z.; Ma, Y.; Yu, B. Recent advances in convolutional neural network acceleration. Neurocom-
puting 2019, 323, 37–51. [CrossRef]
25. Bhatt, D.; Patel, C.; Talsania, H.; Patel, J.; Vaghela, R.; Pandya, S.; Modi, K.; Ghayvat, H. CNN variants for computer vision:
History, architecture, application, challenges and future scope. Electronics 2021, 10, 2470. [CrossRef]
26. Prakash, K.B.; Kannan, R.; Alexander, S.A.; Kanagachidambaresan, G.R. Advanced Deep Learning for Engineers and Scientists: A
Practical Approach; Springer: Berlin/Heidelberg, Germany, 2021.
27. Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. arXiv 2017, arXiv:1708.02002.
28. Zhang, S.; Wen, L.; Bian, X.; Lei, Z.; Li, S. Single-shot refinement neural network for object detection. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition 2018, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4203–4212.
29. Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional network. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708.
30. Hsieh, M.-R.; Lin, Y.-L.; Hsu, W. Drone-based object counting by spatially regularized regional proposal network. In Proceedings
of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017.
31. Yang, F.; Fan, H.; Chu, P.; Blasch, E.; Ling, H. Clustered object detection in aerial images. In Proceedings of the IEEE International
Conference on Computer Vision 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 8311–8320.
32. Zhou, X.; Wang, D.; Krähenbühl, P. Objects as points. arXiv 2019, arXiv:1904.07850.
33. Kong, T.; Sun, F.; Liu, H.; Jiang, Y.; Shi, J. Foveabox: Beyond anchor-based object detector. arXiv 2019, arXiv:1904.0379729.
34. Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid networks for object detection. In Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition 2017, Honolulu, HI, USA, 21–26 July 2017.
35. Ghiasi, G.; Lin, T.-Y.; Le, Q. Dropblock: A regularization method for convolutional networks. In Proceedings of the 32nd International
Conference on Neural Information Processing Systems; Curran Associates Inc.: Dutchess County, NY, USA; pp. 10727–10737.
36. Müller, R.; Kornblith, S.; Hinton, G. When does label smoothing help? In Advances in Neural Information Processing Systems 2019;
Neural Information Processing Systems Foundation, Inc. (NeurIPS): La Jolla, CA, USA, 2019; pp. 4696–4705.
37. Dollár, K.; Girshick, R. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision 2017, Venice, Italy,
22–29 October 2017.
Computation 2023, 11, 52 23 of 23

38. Cai, Z.; Vasconcelos, N. Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition 2018, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6154–6162.
39. Duan, K.; Bai, S.; Xie, L.; Qi, H.; Huang, Q.; Tian, Q. Centernet: Keypoint triplets for object detection. In Proceedings of the IEEE
International Conference on Computer Vision 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 6569–6578.
40. Li, Z.; Peng, C.; Yu, G.; Zhang, X.; Deng, Y.; Sun, J. Light-head r-cnn: In defense of twostage object detector. arXiv 2017,
arXiv:1711.07264.
41. Chollet, F. Xception: Deep Learning with Depthwise Separable Convolutions. In Proceedings of the 2017 IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1800–1807. [CrossRef]
42. Law, H.; Deng, J. Cornernet: Detecting objects as paired keypoints. In Proceedings of the European Conference on Computer
Vision (ECCV) 2018, Munich, Germany, 8–14 September 2018; pp. 734–750.
43. Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767.
44. Swapna, M.; Sharma, D.Y.K.; Prasad, D.B. CNN Architectures: Alex Net, Le Net, VGG, Google Net, Res Net. Int. J. Recent Technol.
Eng. 2020, 8, 953–959. [CrossRef]
45. Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y. YOLOv4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934.
46. Wang, C.-Y.; Liao, H.-Y.; Yeh, I.-H.; Wu, Y.-H.; Chen, P.-Y.; Hsieh, J.-W. CSPNet: A new backbone that can enhance learning
capability of CNN. arXiv 2019, arXiv:1911.11929.
47. Yun, S.; Han, D.; Oh, S.; Chun, S.; Choe, J.; Yoo, Y. Cutmix: Regularization strategy to train strong classifiers with localizable
features. In Proceedings of the IEEE International Conference on Computer Vision 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 6023–6032.
48. Fu, C.-Y.; Liu, W.; Ranga, A.; Tyagi, A.; Berg, A. DSSD: Deconvolutional single shot detector. arXiv 2017, arXiv:1701.06659.
49. Law, H.; Teng, Y.; Russakovsky, O.; Deng, J. Cornernet-lite: Efficient keypoint based object detection. arXiv 2019, arXiv:1904.08900.
50. Chen, K.; Fu, K.; Yan, M.; Gao, X.; Sun, X.; Wei, X. Semantic segmentation of aerial images with shuffling convolutional neural
networks. IEEE Geosci. Remote Sens. Lett. 2018, 15, 173–177. [CrossRef]
51. Pailla, D. VisDrone-DET2019: The Vision Meets Drone Object Detection in Image Challenge Results. In Proceedings of the IEEE/CVF
International Conference on Computer Vision Workshops 2019; IEEE: Piscataway, NJ, USA, 2019.
52. Terrail, J.D.; Jurie, F. On the use of deep neural networks for the detection of small vehicles in ortho-images. In Proceedings of the
2017 IEEE International Conference on Image Processing (ICIP 2017), Beijing, China, 17–20 September 2017; pp. 4212–4216.
53. Shen, J.; Shafiq, M.O. Deep Learning Convolutional Neural Networks with Dropout—A Parallel Approach. In Proceedings of the
17th IEEE International Conference on Machine Learning and Applications, ICMLA 2018, Orlando, FL, USA, 17–20 December
2018; pp. 572–577. [CrossRef]
54. Xiao, Y.; Tian, Z.; Yu, J.; Zhang, Y.; Liu, S.; Du, S.; Lan, X. A review of object detection based on deep learning. Multimed. Tools
Appl. 2020, 79, 23729–23791. [CrossRef]
55. Adem, K. Impact of activation functions and number of layers on detection of exudates using circular Hough transform and
convolutional neural networks. Expert. Syst. Appl. 2022, 203, 117583. [CrossRef]
56. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Commun. ACM 2017,
60, 84–90. [CrossRef]
57. Nwankpa, C.; Ijomah, W.; Gachagan, A.; Marshall, S. Activation Functions: Comparison of trends in Practice and Research for
Deep Learning. arXiv 2018, arXiv:1811.03378.
58. Zhang, Z. Improved Adam Optimizer for Deep Neural Networks. In Proceedings of the 2018 IEEE/ACM 26th International
Symposium on Quality of Service (IWQoS), Banff, AB, Canada, 4–6 June 2018. [CrossRef]
59. Coelho, I.M.; Coelho, V.N.; Luz, E.J.D.S.; Ochi, L.S.; Guimarães, F.G.; Rios, E. A GPU deep learning metaheuristic based model for
time series forecasting. Appl. Energy 2017, 201, 412–418. [CrossRef]
60. Huh, J.-H.; Seo, Y.-S. Understanding edge computing: Engineering evolution with artificial intelligence. IEEE Access 2019, 7,
164229–164245. [CrossRef]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.

View publication stats

You might also like