0% found this document useful (0 votes)

79 views22 pages

A Survey of Convolutional Neural Networks Analysis

This document provides a survey of convolutional neural networks (CNNs). It discusses the history and development of CNNs, from early neural networks to modern CNN architectures. The survey analyzes key components of CNNs like convolution, reviews applications of CNNs in computer vision and natural language processing, and proposes future directions for CNN research.

Uploaded by

imrizwanali

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

79 views22 pages

A Survey of Convolutional Neural Networks Analysis

Uploaded by

imrizwanali

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/340475800

A Survey of Convolutional Neural Networks: Analysis, Applications, and

Prospects

Preprint · April 2020

CITATIONS READS

0 2,214

4 authors, including:

Zewen Li Shouheng Peng

Hohai University The Hong Kong University of Science and Technology
14 PUBLICATIONS 136 CITATIONS 2 PUBLICATIONS 107 CITATIONS

SEE PROFILE SEE PROFILE

Fan Liu
Hohai University
52 PUBLICATIONS 533 CITATIONS

SEE PROFILE

All content following this page was uploaded by Zewen Li on 14 December 2020.

The user has requested enhancement of the downloaded file.

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 1

A Survey of Convolutional Neural Networks:

Analysis, Applications, and Prospects
Zewen Li, Wenjie Yang, Shouheng Peng, Fan Liu, Member, IEEE

 perceptron network cannot handle linear inseparable problems

Abstract—Convolutional Neural Network (CNN) is one of the (such as XOR problems). In 1986, Hinton et al. [4] proposed a
most significant networks in the deep learning field. Since CNN multi-layer feedforward network trained by the error back-
made impressive achievements in many areas, including but not propagation algorithm—Back Propagation Network (BP
limited to computer vision and natural language processing, it
attracted much attention both of industry and academia in the past
Network), which addressed some problems that single-layer
few years. The existing reviews mainly focus on the applications of perceptron could not solve. In 1987, Waibel et al. [5] proposed
CNN in different scenarios without considering CNN from a Time Delay Neural Network (TDNN) for speech recognition,
general perspective, and some novel ideas proposed recently are which can be viewed as a one-dimensional convolutional neural
not covered. In this review, we aim to provide novel ideas and network. Then, Zhang [6] proposed the first two-dimensional
prospects in this fast-growing field as much as possible. Besides, convolutional neural network—Shift-invariant Artificial
not only two-dimensional convolution but also one-dimensional
and multi-dimensional ones are involved. First, this review starts
Neural Network (SIANN). LeCun et al. [7] also constructed a
with a brief introduction to the history of CNN. Second, we convolutional neural network for handwritten zip code
provide an overview of CNN. Third, classic and advanced CNN recognition in 1989 and used the term "convolution" firstly,
models are introduced, especially those key points making them which is the original version of LeNet. In the 1990s, various
reach state-of-the-art results. Fourth, through experimental shallow neural networks were successively proposed, such as
analysis, we draw some conclusions and provide several rules of Chaotic neural networks [8] and A general regression neural
thumb for function selection. Fifth, the applications of one-
dimensional, two-dimensional, and multi-dimensional convolution
network [9]. The most famous one is LeNet-5 [10].
are covered. Finally, some open issues and promising directions Nevertheless, when the number of layers of neural networks is
for CNN are discussed to serve as guidelines for future work. increased, traditional BP networks would encounter local
optimum, overfitting, gradient vanishing, and gradient
Index Terms—Deep learning, convolutional neural networks, exploding problems. In 2006, Hinton et al. [11] proposed the
deep neural networks, computer vision. following points: 1) Multi-hidden layers artificial neural
networks have excellent feature learning ability; 2) The "layer-
wise pre-training" can effectively overcome the difficulties of
I. INTRODUCTION training deep neural networks, which brought about the study

C ONVOLUTIONAL Neural Network (CNN) has been making

brilliant achievements. It has become one of the most
representative neural networks in the field of deep learning.
of deep learning. In 2012, Alex et al. [11] achieved the best
classification result at that time using deep CNN in the
ImageNet Large Scale Visual Recognition Challenge (LSVRC),
Computer vision based on convolutional neural networks has which attracted researchers much of attention and greatly
enabled people to accomplish what had been considered promoted the development of modern CNN.
impossible in the past few centuries, such as face recognition, Before our work, there exist several researchers reviewed
autonomous vehicles, self-service supermarket, and intelligent CNN. Aloysius et al. [12] paid attention to frameworks of deep
medical treatment. To better understand modern convolutional learning chronologically. Nevertheless, they did not fully
neural network and make it better serve human beings, in this explain why these architectures are better than their
paper, we present an overview of CNN, introduce classic predecessors and how these architectures achieved their goals.
models and applications, and propose some prospects for CNN. Dhillon et al. [13] discussed the architectures of some classic
The emergence of convolutional neural networks cannot be networks, but there are many new-generation networks, after
separated from Artificial Neural Networks (ANN). In 1943, their work, have been proposed, such as MobileNet v3,
McCulloch and Pitts [1] proposed the first mathematical model Inception v4, and ShuffleNet series, which deserve researchers’
of neurons—the MP model. In the late 1950s and early 1960s, attention. Besides, the work reviewed applications of CNN for
Rosenblatt [2], [3] proposed a single-layer perceptron model by object detection. Rawat et al. [14] reviewed CNN for image
adding learning ability to the MP model. However, single-layer recognition. Liu et al. [15] discussed CNN for image

This work was supported in part by National Natural Science Foundation of Zewen Li, Wenjie Yang, Shouheng Peng, and Fan Liu are with College of
China under grant No. 61602150, Natural Science Foundation of Jiangsu Computer and Information, Hohai University, Nanjing, 210098, China
Province under grant No. BK20191298. (Corresponding author: Fan Liu) ([email protected], [email protected], [email protected],
[email protected])
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 2

recognition. Ajmal et al. [16] discussed CNN for image which can reduce the amount of data while retaining useful
segmentation. These reviews mentioned above mainly information. It can also reduce the number of parameters by
reviewed the applications of CNN in different scenarios without removing trivial features. The three appealing characteristics
considering CNN from a general perspective. Also, due to the make CNN become one of the most representative algorithms
rapid development of CNN, lots of inspiring ideas in this field in the deep learning field.
have been proposed, but these reviews did not fully cover them. To be specific, in order to build a CNN model, four
In this paper, we focus on analyzing and discussing CNN. In components are typically needed. Convolution is a pivotal step
detail, the key contributions of this review are as follows: 1) We for feature extraction. The outputs of convolution can be called
provide a brief overview of CNN, including some basic feature maps. When setting a convolution kernel with a certain
building blocks of modern CNN, in which some fascinating size, we will lose information in the border. Hence, padding is
convolution structures and innovations are involved. 2) Some introduced to enlarge the input with zero value, which can
classic CNN-based models are covered, from LeNet-5, AlexNet adjust the size indirectly. Besides, for the sake of controlling the
to MobileNet v3 and GhostNet. Innovations of these models are density of convolving, stride is employed. The larger the stride,
emphasized to help readers draw some useful experience from the lower the density. After convolution, feature maps consist
masterpieces. 3) Several representative activation functions, of a large number of features that is prone to causing overfitting
loss functions, and optimizers are discussed. We reach some problem [21]. As a result, pooling [22] (a.k.a. down-sampling)
conclusions about them through experiments. 4) Although is proposed to obviate redundancy, including max pooling and
applications of two-dimensional convolution are widely used, average pooling. The procedure of a CNN is shown in Fig. 1.
one-dimensional and multi-dimensional ones should not be Stride = 2

0×1 0×0 0×1 0 0 0 0 0 0

ignored. Some of typical applications are presented. 5) We raise 0 0 0 0 1 0 0 0×0 0×1 0×0 0 0 1 0 0 0 1 0 1

several points of view on prospects for CNN. Part of them are 1

0
0
0
0
1
1
0
0
1
0
0
1
0
0×1 1×0 0×1 0
0 0 0 1
1
0
0
1
0
0
1
0
0
0
* 011 0
0 1 0 1 1 0 Max
Padding 1 3 3 1 Pooling 3 3
intended to refine existing CNNs, and the others create new 0 1 1 0 0 1 0 0 0 1 1 0 0 1 0 0
Conv kernel

2 2 2 2 2 3
1 0 1 0 0 1 0 0 1 0 1 0 0 1 0 0
networks from scratch. 0 0 1 0 0 1 0 0 0 0 1 1 0 1 0 0
0 1 3 2

We organize the rest of this paper as follows: Section 2 takes 0 0 0 1 1 0 1 0 0 0 0 1 1 0 1 0

Input 0 0 0 0 0 0 0 0 0
an overview of modern CNN. Section 3 introduces many Fig. 1. Procedure of a two-dimensional CNN
representative and classic CNN-based models. We mainly
Furthermore, in order for convolution kernels to perceive
focus on the innovations of these models, but not all details.
larger area, dilated convolution [23] was proposed. A general 3
Section 4 discusses some representative activation functions,
× 3 convolution kernel is shown in Fig. 2 (a), and a 2-dilated 3
loss functions, and optimizers, which can help readers select
× 3 convolution kernel and a 4-dilated 3 × 3 convolution kernel
them appropriately. Section 5 covers some applications of CNN
are shown in Fig. 2 (b) and (c). Note that there is an empty value
from the perspective of different dimensional convolutions.
(filling with zero) between each convolution kernel point. Even
Section 6 discusses current challenges and several promising
though the valid kernel points are still 3 × 3, a 2-dilated
directions or trends of CNN for future work. Section 7
convolution has a 7 × 7 receptive field, and a 4-dilated
concludes the survey by giving a bird view of our contributions.
convolution has a 15 × 15 receptive field.
II. BRIEF OVERVIEW OF CNN . . .
. . . . . .
. . .
Convolutional neural network is a kind of feedforward neural (a)
network that is able to extract features from data with . . .
convolution structures. Different from the traditional feature . . .
extraction methods [17], [18], [19] , CNN does not need to . . . . . .
extract features manually. The architecture of CNN is inspired . . .
by visual perception [20]. A biological neuron corresponds to (b) (c)
an artificial neuron; CNN kernels represent different receptors Fig. 2. Comparison between general convolution kernel and dilated convolution
kernel. (a) A general 3 × 3 convolution kernel (b) A 2-dilated 3 × 3 convolution
that can respond to various features; activation functions kernel (c) A 4-dilated 3 × 3 convolution kernel
simulate the function that only neural electric signals exceeding
As shown in Fig. 3, deformable convolution [23] was
a certain threshold can be transmitted to the next neuron. Loss
proposed to handle the problem that the shape of objects in the
functions and optimizers are something people invented to
real world are usually irregular. Deformable convolution is able
teach the whole CNN system to learn what we expected.
to only focus on what they are interested in, making the feature
Compared with general artificial neural networks, CNN
maps are more representative.
possesses many advantages: 1) Local connections. Each neuron
. .
is no longer connected to all neurons of the previous layer, but
. . . . ... .
only to a small number of neurons, which is effective in . . . . . . ..
reducing parameters and speed up convergence; 2) Weight . . . . . .
. .
sharing. A group of connections can share the same weights, .
which reduces parameters further. 3) Down-sampling (a) (b)
Fig. 3. Comparison between general convolution kernel and deformable
dimensionality reduction. A pooling layer harnesses the convolution kernel. (a) A general 3 × 3 convolution kernel (b) A deformable 3
principle of image local correlation to down-sample an image, × 3 convolution kernel
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 3

Fig. 4. Part of classic CNN models. NiN: Network in Network; ResNet: Residual Netwrok; DCGAN: Deep Convolutional Generative Adversarial Network; SENet:
Squeeze-and-Excitation Network

Moreover, there exist a variety of awesome convolutions, AlexNet carries forward LeNet's ideas and applies the basic
such as Separable convolutions [24], [25], [26], [27], [28], principles of CNN to a deep and wide network. It successfully
group convolutions [11], [29], [30], [31] and multi-dimensional leverages ReLU activation function, dropout, and local
convolutions, which are discussed in Section 3 and Section 5. response normalization (LRN) for the first time on CNN. At the
same time, AlexNet also makes use of GPUs for computing
III. CLASSIC CNN MODELS acceleration. The main innovations of AlexNet lie in the
Since AlexNet was proposed in 2012, researchers have following:
invented a variety of CNN models—deeper, wider, and lighter. 1) AlexNet uses ReLU as the activation function of CNN,
Part of well-known models can be seen in Fig. 4. Due to the which mitigates the problem of gradient vanishing when the
limitation of paper length, this section aims to take an overview network is deep. Although the ReLU activation function was
of several representative models, and we will emphatically proposed long before AlexNet, it was not carried forward until
discuss the innovations of them to help readers understand the the appearance of AlexNet.
main points and propose their own promising ideas. 2) Dropout is used by AlexNet to randomly ignore some
neurons during training to avoid overfitting. This technique is
A. LeNet-5 mainly used in the last few fully-connected layers.
LeCun et al. [10] proposed LeNet-5 in 1998, which is an 3) In convolutional layers of AlexNet, overlapping max
efficient convolutional neural network trained with the pooling is used to replace average pooling that was commonly-
backpropagation algorithm for handwritten character used in the previous convolutional neural networks. Max
recognition. As shown in Fig. 5, LeNet-5 is composed of seven pooling can avoid the blurred result of average pooling, and
trainable layers containing two convolutional layers, two overlapping pooling can improve the richness of features.
pooling layers, and three fully-connected layers. LeNet-5 is the 4) LRN is proposed to simulate the lateral inhibition
pioneering convolutional neural network combining local mechanism of the biological nervous system, which means the
receptive fields, shared weights, and spatial or temporal sub- neuron receiving stimulation can inhibit the activity of
sampling, which can ensure shift, scale, and distortion peripheral neurons. Similarly, LRN can make neurons with
invariance to some extent. It is the foundation of modern CNN. small values are suppressed, and those with large values are
Although LeNet-5 is useful for recognizing handwriting relatively active, the function of which is very similar to
characters and reading bank checks, it still does not exceed the normalization. Hence, LRN is a way to enhance the
traditional support vector machine (SVM) and boosting generalization ability of the model.
algorithms. As a result, LeNet-5 did not obtain enough attention 5) AlexNet also employs two powerful GPUs to train group
at that time. convolutions. Since the computing resource limit of one GPU,
16@10x10
6@28x28
16@5x5
1x120
1x84
AlexNet designs a group convolution structure, which can be
6@14x14
1@32x32
1x10 trained on two distinct GPUs. And then, two feature maps
generated by two GPUs can be combined as the final output.
6) AlexNet adopts two data augmentation methods in
Convolution Subsampling Convolution Subsampling Full connection training. The first is extracting random 224 × 224 patches from
Fig. 5. Architecture of LeNet-5 the original 256 × 256 images and their horizontal reflections to
obtain more training data. Besides, the Principal Component
B. AlexNet
Analysis (PCA) is utilized to change the RGB values of the
Alex et al. [11] proposed the AlexNet in 2012, which won training set. When making predictions, AlexNet also enlarges
the championship in the ImageNet 2012 competition. As shown the dataset and then calculate the average of their predictions as
in Fig. 6, AlexNet has eight layers, containing five the final result. AlexNet shows that the use of data
convolutional layers and three fully-connected layers. augmentation can substantially mitigate overfitting problem
2048 2048
224
55

55
27

27
13 13

13
13

13
and improve generalization ability.
13

C. VGGNets
192 192 128
48 128

224

VGGNets [32] are a series of convolutional neural network

27
13
55 13 13
27 13 13
13 1000

48 128
192 192 128

algorithms proposed by the Visual Geometry Group (VGG) of

Oxford University, including VGG-11, VGG-11-LRN, VGG-
3 2048 2048

Fig. 6. Architecture of AlexNet

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 4

13, VGG-16, and VGG-19. VGGNets secured the first place in branch is factorized, shown in Fig. 8 (c).
the localization track of ImageNet Challenge 2014. The authors Concat
Concat

Concat
of VGGNets prove that increasing the depth of neural networks n×1

3×3
can improve the final performance of the network to some 1×n
1×3 3×1

extent. Compared with AlexNet, VGGNets have the following 1×1 3×3 3×3 n×1 n×1
1×1 1×3 3×1 3×3
improvements: Pool 1×1 1×1 1×1
1×1 1×n 1×n
Pool 1×1 1×1 1×1
1) The LRN layer was removed since the author of VGGNets Pool 1×1 1×1 1×1

found the effect of LRN in deep CNNs was not obvious. Input Input Input

2) VGGNets use 3 × 3 convolution kernels rather than 5 × 5 Fig. 8. Inception v2 module. (a) Each 5 x 5 convolution is replaced by two 3
x 3 convolutions. (b) n x n convolution is replaced by a 1 x n convolution and
or 5 × 5 ones, since several small kernels have the same a n x 1 convolution. (c) Inception modules with the last convolutional layer is
receptive field and more nonlinear variations compared with factorized.
larger ones. For instance, the receptive field of 3 × 3 kernels is 3) Inception v3
the same as one 5 × 5 kernel. Nevertheless, the number of Inception v3 [35] integrates major innovations mentioned in
parameters reduces by about 45%, and three kernels have three Inception v2. And factorizing 5 × 5 and 3 × 3 convolution
nonlinear variations. kernels into two one-dimensional ones (one by seven and seven
by one, one by three and three by one, respectively). This
D. GoogLeNet
operation accelerates the training and further increases the
GoogLeNet [33] is the winner of the ILSVRC 2014 image depth of networks and the non-linearity of networks. Besides,
classification algorithms. It is the first large-scale CNN formed the input size of the network changed from 224 by 224 to 299
by stacking with Inception modules. Inception networks have by 299. And Inception v3 utilizes RMSProp as the optimizer.
four versions, namely Inception v1 [33], Inception v2 [34], [35], 4) Inception v4 and Inception-ResNet
Inception v3 [35], and Inception v4 [36]. Inception v4 modules [36] are based upon that of Inception
1) Inception v1 v3. The architecture of Inception v4 is more concise and utilizes
Due to objects in images have different distances to cameras, more Inception modules. Experimental evaluation proved that
an object with a large proportion of an image usually prefers a Inception v4 is better than its predecessors.
large convolution kernel or a few small ones. However, a small In addition, ResNet structure [37] is harnessed to extend the
object in an image is the opposite. Based on the past experience, depth of Inception networks, namely Inception-ResNet-v1 and
large kernels have many parameters to train, and deep networks Inception-ResNet-v2. Experiments proved that they could
are hard to train. As a result, Inception v1 [33] deploys 1 × 1, 3 improve the training speed and performance.
× 3, 5 × 5 convolution kernels to construct a “wide” network,
which can be seen in Fig. 7, Convolution kernels with different E. ResNet
sizes can extract the feature maps of different scales of the Theoretically, Deep Neural Networks (DNN) outperform
image. Then, those feature maps are stacked to obtain a more shallow ones as the former can extract more complicated and
representative one. Besides, 1 × 1 convolution kernel is used to sufficient features of images. However, with the increase of
reduce the number of channels, i.e., reduce computational cost. layers, DNNs are prone to cause gradient vanishing, gradient
Concat exploding problems, etc. He et al. [37] proposed a 34-layer
Residual Network in 2016, which is the winner of the ILSVRC
1×1 3×3 5×5 2015 image classification and object detection algorithm. The
Pool 1×1 1×1 1×1
performance of ResNet exceeds the GoogLeNet Inception v3.
One of the significant contributions of ResNet is the two-
Input
layer residual block constructed by the shortcut connection, as
Fig. 7. Inception v1 module with dimension reductions
shown in Fig. 9 (a) below.
ReLU
2) Inception v2 ReLU +
Inception v2 [35] utilizes batch normalization to handle +
1×1, 256
internal covariate shift problem [34]. The output of every layer ReLU
is normalized to normal distribution, which can increase the 3×3, 64
ReLU 3×3, 64
robustness of the model and train the model with a relatively ReLU
large learning rate. 3×3, 64
1×1, 64
Furthermore, Inception v2 shows that a single 5 × 5
64-d 256-d
convolutional layers can be replaced by two 3 × 3 ones, shown
(a) (b)
in Fig. 8 (a). One n x n convolutional layer can be replaced by Fig. 9. Structure of ResNet blocks. (a) The structure of two-layer residual
one 1 x n and one n x 1 convolutional layer shown in Fig. 8 (b). block. (b) The structure of three-layer residual block
However, the original paper points out the use of factorization 50-layer ResNet, 101-layer ResNet, and 152-layer ResNet
is not effective in the early layers. It is better to use it on utilize three-layer residual blocks, as shown in the Fig. 9 (b)
medium-sized feature maps. And filter banks should be above, instead of two-layer one. Three-layer residual block is
expanded (wider but not deeper) to improve high dimensional also called the bottleneck module because the two ends of the
representations. Hence, only the last 3 × 3 convolution of each
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 5

block are narrower than the middle. Using 1 × 1 convolution G. MobileNets

kernel can not only reduce the number of parameters in the MobileNets are a series of lightweight models proposed by
network but also greatly improve the network's nonlinearity. Google for embedded devices such as mobile phones. They use
A lot of experiments in [37] have proved that ResNet can depth-wise separable convolutions and several advanced
mitigate the gradient vanishing problem without degeneration techniques to build thin deep neural networks. There are three
in deep neural networks since the gradient can directly flow versions of MobileNets to date, namely MobileNet v1 [44],
through shortcut connections. MobileNet v2 [45], and MobileNet v3 [46].
Based upon ResNet, many studies have managed to improve 1) MobileNet v1
the performance of the original ResNet, such as pre-activation MobileNet v1 [44] utilizes depth-wise separable convolutions
ResNet [38], wide ResNet [39], stochastic depth ResNets (SDR) proposed in Xception [26], which decomposes the standard
[40], and ResNet in ResNet (RiR) [41]. convolution into depth-wise convolution and pointwise
F. DCGAN convolution (1 × 1 convolution), as shown in Fig. 13.
Specifically, standard convolution applies each convolution
Generative Adversarial Network (GAN) [42] is an
kernel to all the channels of input. In contrast, depth-wise
unsupervised model proposed by Goodfellow et al. in 2014.
convolution applies each convolution kernel to only one
GAN contains a generative model G and a discriminative model
channel of input, and then 1 × 1 convolution is used to combine
D. The model G with random noise z generates a sample G(z)
the output of depth-wise convolution. This decomposition can
that subjects to the data distribution Pdata learned by G. The
substantially reduce the number of parameters.
model D can determine whether the input sample is real data x
or generated data G(z). Both G and D can be nonlinear functions, d
d

#M, d×d×1 #N, 1×1×M

such as deep neural networks. The aim of G is to generate data 1
M 1
as real as possible; nevertheless, the aim of D is to distinguish
M M N
the fake data generated by G from the real data. There exists an
interestingly adversarial relationship between the generative M
...
network and the discriminative network. This idea originates d
1
...
1
d 1
from game theory, in which the two sides use their strategies to M N
achieve the goal of winning. The procedure is shown in Fig. 10. Fig. 13. Depth-wise separable convolutions in MobileNet v1. #M and #N
Real data x
represent the number of kernels of depth-wise convolution and pointwise
Discriminator Result [0, 1]
convolution, respectively

Random noise z Generator Generated data G(z)

MobileNet v1 also introduces the width multiplier to reduce
the number of channels of each layer and the resolution
Update
multiplier to lower the resolution of the input image (feature
Fig. 10. The flowchart of GAN
map).
Radford et al. [43] proposed Deep Convolutional Generative 2) MobileNet v2
Adversarial Network (DCGAN) in 2015. The generator of Based upon MobileNet v1, MobileNet v2 [45] mainly
DCGAN on Large-scale Scene Understanding (LSUN) dataset introduces two improvements: inverted residual blocks and
is implemented by using deep convolutional neural networks, linear bottleneck modules.
the structure of which is shown in the figure below. In Section 3.5, we have explained three-layer residual blocks,
the purpose of which is to make use of 1 × 1 convolution to
reduce the number of parameters involved in 3 × 3 convolution.
In a word, the whole process of a residual block is channel
Fig. 11. DCGAN generator used for LSUN scene modeling
compression—standard convolution—channel expansion. In
MobileNet v2, an inverted residual block (seen in Fig. 14 (b))
In Fig. 11, the generative model of DCGAN performs up-
is opposite to a residual block (seen in Fig. 14 (a)). The input of
sampling by "fractionally-strided convolution". As shown in
an inverted residual block is firstly convoluted by 1 × 1
Fig. 12 (a), supposing that there is a 3 × 3 input, and the size of
convolution kernels for channel expansion, then convoluted by
the output is expected to be larger than 3 × 3, then the 3 × 3
3 × 3 depth-wise separable convolution, and finally convoluted
input can be expanded by inserting zero between pixels. After
by 1 × 1 convolution kernels to compress the number of
expanding to a 5 × 5 size, performing convolution, shown in
channels back. Briefly speaking, the whole process of an
Fig. 12 (b), can obtain an output larger than 3 × 3.
inverted residual block is channel expansion—depth-wise
separable convolution—channel compression. Also, due to the
fact that depth-wise separable convolution cannot change the
number of channels, which causes the number of input channels
limits the feature extraction, inverted residual blocks are
harnessed to handle the problem.
(a) (b)
Fig. 12. An example of fractionally-strided convolution. (a) Inserting zero
between 3 × 3 kernel points. (b) Convolving the 7 × 7 graph
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 6

computation, and hence, they put forward a hard version of

swish (h-swish) to reduce computation with little loss of
accuracy. However, they found that the benefits gained by h-
swish only in the deep layer, and therefore, h-swish is only
utilized in the second half of the model. Besides, they found that
(a)
sigmoid can also be replaced by hard version of sigmoid (h-
sigmoid).

(b)
Fig. 14. (a) Residual block (b) MobileNet v2 block: inverted residual block
When performing the steps: channel expansion—depth-wise
separable convolution—channel compression, one problem
will be encountered after "channel compression". That is, it is Fig. 16. The diagrams of sigmoid, h-sigmoid, swish and h-swish
easy to destroy the information when ReLU activation function
is utilized in low-dimensional space, whereas it will not happen H. ShuffleNets
in high-dimensional space. Therefore, ReLU activation ShuffleNets are a series of CNN-based models proposed by
function following the second 1 × 1 convolution of inverted MEGVII to solve the problem of insufficient computing power
residual blocks is removed, and a linear transformation is of mobile devices. These models combine pointwise group
adopted. Hence, this architecture is called the linear bottleneck convolution, channel shuffle, and some other techniques, which
module. significantly reduce the computational cost with little loss of
3) MobileNet v3 accuracy. So far, there are two versions of ShuffleNets, namely
MobileNet v3 [46] achieves three improvements: network ShuffleNet v1 [49] and ShuffleNet v2 [50].
search combining platform-aware neural architecture search 1) ShuffleNet v1
(platform-aware NAS) and NetAdapt algorithm [47], ShuffleNet v1 [49] was proposed to construct a high-efficient
lightweight attention model based upon squeeze and excitation, CNN structure for resource-limited devices. There are two
and h-swish activation function. innovations: pointwise group convolution and channel shuffle.
For MobileNet v3, researchers use platform-aware NAS for The authors of ShuffleNet v1 reckon that Xception [26] and
block-wise search. Platform-aware NAS utilizes an RNN-based ResNeXt [29] are less efficient in extremely small networks
controller and hierarchical search space to find the structure of since 1 × 1 convolution requires a lot of computing resources.
the global network. And then, the NetAdapt algorithm, Therefore, pointwise group convolution is proposed to reduce
complementary to platform-aware NAS, is used for layer-wise the computation complexity of 1 × 1 convolutions. Pointwise
search. It can fine-tune to find the optimal number of filters in group convolution, shown in Fig. 17 (a), requires each
each layer. convolution operation is only on the corresponding input
MobileNet v3 makes use of the squeeze and excitation (SE) channel group, which can reduce the computational complexity.
[48] to reweight the channels of each layer to achieve a However, one problem is that pointwise group convolutions
lightweight attention model. As shown in Fig. 15, after the prevent feature maps between different groups from
depth-wise convolution of an inverted residual block, the SE communicating with each other, which is harmful to extract
module is added. Global-pool operation is firstly performed, representative feature maps. Therefore, channel shuffle
then following a fully-connected layer, the number of channels operation, shown in Fig. 17 (b), is proposed to help the
is reduced to 1/4. The second fully-connected layer is utilized information in different groups flow to other groups randomly.
to recover the number of channels and get the weight of each
layer. Finally, multiply the weight and the depth-wise
convolution to get a reweighted feature map. Howard et al. [46]
proved that this operation could improve the accuracy without
extra time cost.

Fig. 17. Pointwise group convolution and channel shuffle in ShuffleNet v1 (a)
Pointwise group convolution where different colors represent different groups.
(b) After acquiring feature 1, channel shuffle operation is inserted to promote
information exchanges between groups
Fig. 15. MobileNet v3 block: MobileNet v2 block + Squeeze-and-Excite in
the residual layer Furthermore, ShuffleNet unit is proposed on the basis of
The authors of MobileNet v3 figure out that swish activation channel shuffle operation. As shown in figure below, Fig. 18 (a)
function can improve the accuracy of the network compared is a naïve residual block with depth-wise convolution
with ReLU. Nevertheless, swish function costs too much (DWConv); Fig. 18 (b) replaces standard convolution with
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 7

computational economical pointwise group convolution

(GConv) and channel shuffle. Besides, the second ReLU Channel Shuffle Channel Shuffle

activation function is removed; In Fig. 18 (c), a 3 × 3 average

pooling with two strides is utilized in shortcut paths, and Concat Concat

ReLU BN
element-wise addition is replaced by concatenation. These
BN ReLU BN ReLU

1×1, Conv 1×1, Conv 1×1, Conv

tricks further reduce the number of parameters. Fig. 18 (c) is the BN
BN BN

architecture of the final ShuffleNet unit. 3×3, DWConv 3×3, DWConv

(stride = 2)
3×3, DWConv
(stride = 2)
BN ReLU
BN ReLU
ReLU 1×1, Conv
ReLU ReLU
Con- 1×1, Conv
+ + cat
BN BN Channel Split
BN
1×1, GConv 1×1, GConv
1×1, Convolution
BN BN
BN ReLU 3×3, DWConv
3×3, DWConv
(stride = 2)
3×3, AVG Pool
(a) (b)
3×3, DWConv
Channel Shuffle
(stride = 2) Fig. 19. ShuffleNet v2 units (a) ShuffleNet v2 basic unit (b) ShuffleNet v2
Channel Shuffle
BN ReLU
BN ReLU BN ReLU
unit for spatial down sampling
1×1, Convolution 1×1, GConv 1×1, GConv
In summary, LeNet-5 is the first modern CNN utilized in
handwritten digits recognition successfully. Its convolution,
(a) (b) (c) activation, pooling, and full connection have been widely used.
Fig. 18. ShuffleNet v1 units. (a) Naïve residual block with DWConv (b) AlexNet brought up deeper structures than LeNet-5, proposing
ShuffleNet v1 unit with GConv and channel shuffle (c) ShuffleNet v1 unit some tricks like dropout and data augmentation, and harnessing
with 3 x 3 average pooling and stride = 2.
ReLU activation function. VGGNet further proved that deeper
2) ShuffleNet v2 networks usually work better, and provided a guideline for
The authors of ShuffleNet v2 [50] suggest that a lot of
designing networks. GoogLeNet series proposed wider
networks are dominated by the metric of computation networks can also work. And large size convolution kernels can
complexity, i.e., FLOPs. Nevertheless, FLOPs should not be
be replaced with small ones. Besides, factorizing convolution
regarded as the only norm of evaluating the speed of networks
and ResNet are employed as well. ResNet makes extreme deep
because Memory Access Cost (MAC) is another crucial factor. networks possible, which is able to mitigate gradient vanishing.
They provide some guidelines for designing networks through
DCGAN combines CNN with GAN, expanding the practical
experiments and uses these guidelines to build ShuffleNet v2.
scenarios of both. Depth-wise separable convolution, inverted
Four points are proposed to guide the design of networks:
residual blocks, SENet-based lightweight attention model,
a) Through experimental, compared with different input and
platform-aware NAS, and NetAdapt algorithm are harnessed by
output channel ratios of ShuffleNet v1 [49] and MobileNet v2
MobileNets designing for mobile devices with limited
[45] on GPU and ARM platforms with 1 × 1 convolutional
computing power. ShuffleNet series is also invented for mobile
layers, it is found that the MAC is minimal when the number of
devices, combining pointwise group convolution and channel
input channels is equal to the number of output channels.
shuffle. Moreover, ShuffleNet proved that except for FLOPs,
b) Changing the number of groups of convolution has an
MAC is another factor that can affect the speed of networks.
impact on network training speed. As the number of groups
increases, the MAC increases, and the training speed decreases. I. GhostNet
c) By adjusting fragmented structures of the networks and the As large amounts of redundant features are extracted by
number of convolutional layers in each basic structure, it is existing CNNs for image cognition, Han et al. [51] proposed
found that network fragmentation reduces the degree of GhostNet to reduce computational cost effectively. They found
parallelism, such as GoogLeNet series. Although multiple that there are many similar feature maps in traditional
fragmented structures are able to improve accuracy, they can convolution layers. These feature maps are called ghost.
reduce efficiency on parallel computing powers like GPUs. Therefore, they leverage the cost-efficient GhostNet to reach
d) Elementwise operations (such as ReLU, tensor addition, state-of-the-art results. Two major contributions are as follows:
offset addition, separation convolution, etc.) are non-negligible,
i.e., they will consume a lot of time. Consequently, when
designing networks, researchers should reduce the use of
elementwise operations as much as possible.
ShuffleNet v2 unit is designed based upon the above four
guidelines. Additionally, channel split is proposed in
ShuffleNet v2. For each ShuffleNet v2 unit, the channels are Fig. 20. The Ghost module. Where Ф represents linear transformation
firstly split into two branches, namely A and B. The details of function [51]
the following process can be seen in Fig. 19 (a). For spatial They divide the traditional convolution layers into two parts.
downsampling, the unit is slightly modified, and the details can In the first part, less convolution kernels are directly used in
be seen in Fig. 19 (b). feature extraction, which is the same as the original convolution.
Then, these features are processed in linear transformation to
acquire multiple feature maps. They proved that Ghost module
applies to other CNN models.
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 8

IV. DISCUSSION AND EXPERIMENTAL ANALYSIS exponential operation that require division while computing
derivatives, whereas the derivative of ReLU is a constant.
A. Activation function
Moreover, in the sigmoid and tanh function, if the value of x is
1) Discussion of Activation Function too large or too small, the gradient of the function is pretty small,
Convolutional neural networks can harness different which can cause the function to converge slowly. However,
activation functions to express complex features. Similar to the when x is less than 0, the derivative of ReLU is 0, and when x
function of the neuron model of the human brain, the activation is greater than 0, the derivative is 1, so it can obtain an ideal
function here is a unit that determines which information should convergence effect. AlexNet [11], the best model in ILSVRC-
be transmitted to the next neuron. Each neuron in the neural 2012, uses ReLU as the activation function of CNN-based
network accepts the output value of the neurons from the model, which mitigates gradient vanishing problem when the
previous layer as input, and passes the processed value to the network is deep, and verifies that the use of ReLU surpasses
next layer. In a multilayer neural network, there is a function sigmoid in deep networks.
between two layers. This function is called activation function, From what discussed above, we can find that ReLU does not
whose structure is shown in the following Fig. 21. consider the upper limit. In practice, we can set an upper limit,
x1
w1j such as ReLU6 [54].
n  However, when x is less than 0, the gradient of ReLU is 0,
xi wij
∑ f yj = f   xi  wij   bj 
 i 1  which means the back-propagated error will be multiplied by 0,
wnj
resulting in no error being passed to the previous layer. In this
xn bj scenario, the neurons will be regarded as inactivated or dead.
Fig. 21. General activation function structure
Therefore, some improved versions are proposed. Leaky ReLU
In this figure, xi represents the input feature; n features are (see Fig. 22 (d)) can reduce neuron inactivation. When x is less
input to the neuron j at the same time; wij represents the weight than 0, the output of Leaky ReLU is 𝑥/𝑎, instead of zero, where
value of the connection between the input feature xi and the ‘a’ is a fixed parameter in range (1, +∞).
neuron j; bj represents the internal state of the neuron j, which Another variant of ReLU is PReLU [38] (see Fig. 22 (e)).
is the bias value; and yj is the output of the neuron j. 𝑓(∙) is the Unlike Leaky ReLU, the slope of the negative part of PReLU is
activation function, which can be sigmoid function, tanh (x) based upon the data, not a predefined one. He et al. [38] reckon
function [10], Rectified Linear Unit [52], etc. [53] that PReLU is the key to surpassing the level of human
If an activation function is not used or a linear function is classification on the ImageNet 2012 classification dataset.
used, the input of each layer will be a linear function of the Exponential Linear Units (ELU) function [55] (see Fig. 22
output of the previous layer. In this case, He et al. [38] verify (f)) is another improved version of ReLU. Since ReLU is non-
no matter how many layers the neural network has, the output negatively activated, the average value of its output is greater
is always a linear combination of the input, which means hidden than 0. This problem will cause the offset of the next layer unit.
layers have no effect. This situation is the primitive perceptron ELU function has a negative value, so the average value of its
[2], [3], which has the limited learning ability. For this reason, output is close to 0, making the rate of convergence faster than
the nonlinear functions are introduced as activation functions. ReLU. However, the negative part is a curve, which demands
Theoretically, the deep neural networks with nonlinear lots of complicated derivatives.
activation function can approximate any function, which
greatly enhances the ability of neural networks to fit data.
In this section, we mainly focus on several frequently-used
activation functions. To begin with, sigmoid function is one of
the most typical non-linear activation functions with an overall
S-shape (see Fig. 22 (a)). With x value approaching 0, the (a) Sigmoid function (b) Tanh function (c) ReLU function

gradient becomes steeper. Sigmoid function can map a real

number to (0, 1), so it can be used for binary classification
problems. In addition, SENet [48] and MobileNet v3 [46] need
to transform the output value to (0, 1) for attention mechanism,
in which sigmoid is a good way to implement.
Different from sigmoid, tanh function [10] (see Fig. 22 (b)) (d) Leaky ReLU function (e) PReLU function (f) ELU function

can map a real number to (-1, 1). Since the mean value of the Fig. 22. Diagrams of Sigmoid, Tanh, ReLU, Leaky ReLU, PReLU, and ELU
output of tanh is 0, it can achieve a kind of normalization. This 2) Experimental Evaluation
makes the next layer easier to learn. To compare aforementioned activation functions, two classic
In addition, Rectified Linear Unit (ReLU) [52] (see Fig. 22 CNN models, LeNet-5 [10] and VGG-16 [32], are tested on
(c)) is another effective activation function. When x is less than four benchmark datasets, including MNIST [10], Fashion-
0, its function value is 0; when x is greater than or equal to 0, MNIST [56], CIFAR-10 [57] and CIFAR-100 [57]. LeNet-5 is
its function value is x itself. Compared to sigmoid function and the first modern but relatively shadow CNN model. In the
tanh function, a significant advantage of using ReLU function following experiments, we train LeNet-5 from scratch. VGG-
is that it can speed up learning. Sigmoid and tanh involve in 16 is a deeper, larger, and frequently-used model. We conduct
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 9

(a) LeNet5 on MNIST (b) LeNet5 on Fashion-MNIST

(c) VGG-16 on MNIST (d) VGG-16 on Fashion-MNIST

Fig. 23. The experimental results on seven activation functions, respectively. (a) The accuracy of validation set and training loss on MNIST trained by LeNet-5.
(b) The accuracy of validation set and training loss on Fashion-MNIST trained by LeNet-5. (c) The accuracy of validation set and training loss on MNIST trained
by VGG-16. (d) The accuracy of validation set and training loss on Fashion-MNIST trained by VGG-16.

(a) LeNet5 on CIFAR-10 (b) LeNet5 on CIFAR-100

(c) VGG-16 on CIFAR-10 (d) VGG-16 on CIFAR-100

Fig. 24. The experimental results on seven activation functions, respectively. (a) The accuracy of validation set and training loss on CIFAR-10 trained by LeNet-
5. (b) The accuracy of validation set and training loss CIFAR-100 trained by LeNet-5. (c) The accuracy of validation set and training loss on CIFAR-10 trained by
VGG-16. (d) The accuracy of validation set and training loss on CIFAR-100 trained by VGG-16.
our experiments on the basis of a pre-trained VGG-16 model of sigmoid is slowest. Usually, the final performance of sigmoid
without the last three layers on ImageNet [58]. is not all that excellent. As a result, if we expect a fast
Both LeNet-5 and VGG-16 deploy softmax at the last layer convergence, sigmoid is not the best solution.
for multi-classification. All experiments are tested on Intel • From the perspective of accuracy, ELU possesses the best
Xeon E5-2640 v4 (X2), NVIDIA TITAN RTX (24GB), Python accuracy, but only a little better than ReLU, Leaky ReLU, and
3.5.6, and Keras 2.2.4. PReLU. In terms of training time, from Table I, ELU is prone
a) MNIST & Fashion-MNIST: MNIST is a dataset of to consume more time than ReLU and Leaky ReLU.
handwritten digits consisting of 10 categories, which has a • ReLU and Leaky ReLU have better stability during
training set of 60,000 examples and a test set of 10,000 training than PReLU and ELU.
examples. Each example is a 28 × 28 grayscale image, b) CIFAR-10 & CIFAR-100: CIFAR-10 and CIFAR-100 are
associated with a label from 10 classes, from 0 to 9. Fashion- labeled subsets of the 80 million tiny images dataset [57], which
MNIST dataset is a more complicated version of original are more complex than MNIST as well as Fashion-MNIST.
MNIST, sharing the same image size, structure, and split. These CIFAR-10 dataset consists of 60000 32 × 32 RGB images in 10
two datasets are trained on LeNet-5 and VGG-16, and results classes, with 6000 images per class. The whole dataset is
are exhibited in Table I and Fig. 23. From the results, we can divided into 50000 training images and 10000 test images, i.e.,
draw some meaningful conclusions. each class has 5000 training images and 1000 test images.
• Linear activation function indeed lead to the worst CIFAR-100 is like the CIFAR-10, except it has 100 classes
performance. Therefore, when building a deep neural network containing 600 images per class. And each class has 500
(more than one layer), we need to add a non-linear function. If training images and 100 test images. Similarly, we evaluate
not, multiple layers, theoretically, are equal to one layer. LeNet-5 and VGG-16 with different activation functions on
• Among these activation functions, the convergence speed these two datasets. The results can be seen in Table I and Fig.
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 10

TABLE I
COMPARATIVE RESULTS OF DIFFERENT ACTIVATION FUNCTIONS
Data Batch Activation Validation set Training
Model Dataset Loss Optimizer epochs
Augmentation size function accuracy time (s)
Linear 98.56% 92.89
Sigmoid 98.94% 95.51
Tanh 99.03% 92.92
Cross
MNIST － Adam 256 50 ReLU 99.18% 95.04
entropy
Leaky ReLU 99.10% 99.82
PReLU 99.20% 113.42
ELU 99.20% 103.84
Linear 88.10% 169.95
Sigmoid 89.84% 174.26
Tanh 89.83% 181.99
Fashion- Cross
－ Adam 256 100 ReLU 90.17% 191.77
MNIST entropy
Leaky ReLU 90.36% 190.02
PReLU 90.36% 217.20
ELU 90.37% 204.64
LeNet-5
Linear 62.56% 614.89
Sigmoid 62.65% 569.25
Tanh 62.94% 575.07
Cross
CIFAR-10 － Adam 256 200 ReLU 64.40% 550.35
entropy
Leaky ReLU 64.27% 582.08
PReLU 63.61% 650.75
ELU 65.70% 626.51
Linear 31.24% 2381.76
Sigmoid 32.32% 2355.64
Tanh 32.69% 2376.35
Cross
CIFAR-100 － Adam 512 1000 ReLU 32.69% 2418.18
entropy
Leaky ReLU 33.81% 2443.72
PReLU 31.84% 2615.02
ELU 35.10% 2475.30
Linear 9.82% 598.14
Sigmoid 11.35% 600.27
Tanh 11.35% 596.32
Cross
MNIST － Adam 512 30 ReLU 99.55% 606.83
entropy
Leaky ReLU 99.48% 608.91
PReLU 99.45% 607.27
ELU 11.35% 614.81
Linear 10.00% 599.18
Sigmoid 15.66% 595.11
Tanh 11.01% 596.48
Fashion- Cross
－ Adam 512 30 ReLU 93.16% 608.19
MNIST entropy
Leaky ReLU 92.81% 610.82
PReLU 10.00% 612.75
VGG-16
ELU 93.87% 613.32
(pre-
Linear 83.25% 958.74
trained)
Sigmoid 10.00% 957.62
Tanh 10.00% 957.97
Cross
CIFAR-10 － Adam 512 100 ReLU 83.22% 957.74
entropy
Leaky ReLU 83.37% 958.39
PReLU 82.17% 963.67
ELU 83.14% 968.60
Linear 1.00% 1897.14
Sigmoid 1.00% 1868.02
Tanh 1.00% 1897.76
Cross
CIFAR-100 － Adam 512 200 ReLU 44.77% 1901.56
entropy
Leaky ReLU 48.22% 1916.38
PReLU 47.46% 1922.29
ELU 1.00% 1917.75

24, from which we can get some conclusions. ELU may make networks learn nothing. More often than not,
• Tanh, PReLU, and ELU activation functions are more Leaky ReLU has better performance in terms of accuracy and
likely to bring about oscillation at the end of the training. training speed.
• When training a deep CNN model with pre-trained 3) Rules of Thumb for Selection
weights, it is hard to converge by the use of sigmoid and tanh • For binary classification problems, the last layer can
activation functions. harness sigmoid; for multi-classification problems, the last
• The models trained by Leaky ReLU and ELU have better layer can harness softmax.
accuracy than the others in the experiments. But sometimes, • Sigmoid and tanh functions sometimes should be avoided
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 11

because of the gradient vanishment. Usually, in hidden layers, cross-entropy loss as the loss function in their original paper,
ReLU or Leaky ReLU is a good choice. which helped them reach state-of-the-art results.
• If you have no idea about choosing activation functions, However, cross entropy loss has some flaws. Cross entropy
feel free to try ReLU or Leaky ReLU. loss only cares about the correctness of the classification, not
• If a lot of neurons are inactivated in the training process, the degree of compactness within the same class or the margin
please try to utilize Leaky ReLU, PReLU, etc. between different classes. Hence, many loss functions are
• The negative slope in Leaky ReLU can be set to 0.02 to proposed to solve this problem.
speed up training. Contrastive loss [59] enlarges the distance between different
categories and minimizes the distance within the same
B. Loss function categories. It can be used in dimensionality reduction in
Loss function or cost function is harnessed to calculate the convolutional neural networks. After dimensionality reduction,
distance between the predicted value and the actual value. Loss the two samples that are originally similar are still similar in the
function is usually used as a learning criterion of the feature space, whereas the two samples that are originally
optimization problem. Loss function can be used with different are still different. Additionally, contrastive loss is
convolutional neural networks to deal with regression problems widely used with convolutional neural networks in face
and classification problems, the goal of which is to minimize recognition. It was firstly used in SiameseNet [60], and later
loss function. Common loss functions include Mean Absolute was deployed in DeepID2 [61], DeepID2+ [62] and DeepID3
Error (MAE), Mean Square Error (MSE), Cross Entropy, etc. [63].
After contrastive loss, triplet loss was proposed by Schroff et
1) Loss Function for Regression al. in FaceNet [64], with which the CNN model can learn better
In convolutional neural networks, when dealing with face embeddings. The definition of the triplet loss function is
regression problems, we are likely to use MAE or MSE. based upon three images. These three images are anchor image,
MAE calculates the mean of the absolute error between the positive image, and negative image. The positive image and the
predicted value and the actual value; MSE calculates the mean anchor image are from the same person, whereas the negative
of square error between them. image and the anchor image are from different people.
MAE is more robust to outliers than MSE, because MSE Minimizing triplet loss is to make the distance between the
would calculate the square error of outliers. However, the result anchor and the positive one closer, and make the distance
of MSE is derivable so that it can control the rate of update. The between the anchor and the negative one further. Triplet loss is
result of MAE is non-derivable, the update speed of which usually used with convolutional neural networks for fine-
cannot be determined during optimization. grained classification at the individual level, which requires
Therefore, if there are lots of outliers in the training set and model have ability to distinguish different individuals from the
they may have a negative impact on models, MAE is better than same category. Convolutional neural networks with triplet loss
MSE. Otherwise, MSE should be considered. or its variants can be used in identification problems, such as
2) Loss Function for Classification face identification [65], [66], [64], person re-identification [67],
In convolutional neural networks, when it comes to [68], and vehicle re-identification [69].
classification tasks, there are many loss functions to handle. Another one is center loss [70], which is an improvement
The most typical one, called cross entropy loss, is used to based upon cross entropy. The purpose of center loss is to focus
evaluate the difference between the probability distribution on the uniformity of the distribution within the same class. In
obtained from the current training and the actual distribution. order to make it evenly distributed around the center of the class,
This function compares the predicted probability with the actual center loss adds an additional constraint to minimize the intra-
output value (0 or 1) in each class and calculate the penalty class difference. Center loss was used with CNN in face
value based upon the distance from them. The penalty is recognition [70], image retrieval [71], person re-identification
logarithmic, so the function provides a smaller score (0.1 or 0.2) [72], speaker recognition [73], etc.
for smaller differences and a bigger score (0.9 or 1.0) for larger Another variant of cross entropy is large-margin softmax loss
differences. [74]. The purpose of it is also intra-class compression and inter-
Cross entropy loss is also called softmax loss, which class separation. Large-margin softmax loss adds a margin
indicates it is always used in CNNs with a softmax layer. For between different classes, and introduces the margin regularity
example, AlexNet [11], Inception v1 [33], and ResNet [37] uses through the angle of the constraint weight matrix. Large-margin
TABLE II
DIFFERENT LOSS FUNCTIONS FOR CONVOLUTIONAL NEURAL NETWORKS
Loss Brief Description
Mean absolute error Calculate the mean of absolute error of samples.
Mean square error Calculate the mean of square error of samples.
Cross entropy loss Calculate the difference between the probability distribution and the actual distribution. [11] [33] [37]
Contrastive loss Enlarge the distance between different categories and minimize the distance within the same categories. [60] [61] [62] [63]
Triplet loss Minimize the distance between anchor samples and positive samples, and enlarge the distance between anchor samples and
negative samples. [64] [65] [66] [67] [68]
Center loss Minimize intra-class distance. [70] [71] [72] [73]
Large-margin softmax loss Focus on intra-class compression and inter-class separation. [74] [75] [76]
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 12

softmax loss was used in face recognition [74], emotion FaceNet [64], DeepID [79], and DeepID2 [61].
recognition [75], speaker verification [76], etc. 2) Gradient Descent Optimization Algorithms
3) Rules of Thumb for Selection On the basis of MBGD, a series of effective algorithms for
• When using CNN models to deal with regression problems, optimization are proposed to accelerate model training process.
we can choose L1 loss or L2 loss as the loss function. A proportion of them are presented as follows.
• When dealing with classification problems, we can select Qian et al. proposed the Momentum algorithm [80]. It
the rest of the loss functions. simulates physical momentum, using the exponentially
• Cross entropy loss is the most popular choice, usually weighted average of the gradient to update weights. If the
appearing in CNN models with a softmax layer in the end. gradient in one dimension is much larger than the gradient in
• If the compactness within the class or the margin between another dimension, the learning process will become
different classes is concerned, the improvements based upon unbalanced. The Momentum algorithm can prevent oscillations
cross entropy loss can be considered, like center loss and large- in one dimension, thereby achieving faster convergence. Some
margin softmax loss. classic CNN models like VGG [32], Inception v1 [33], and
• The selection of loss function in CNNs also depends on Residual networks [37] use momentum in their original paper.
the application scenario. For example, when it comes to face However, for the Momentum algorithm, blindly following
recognition, contrastive loss and triplet loss are turned out to be gradient descent is a problem. Nesterov Accelerated Gradient
the commonly-used ones nowadays. (NAG) algorithm [81] gives the Momentum algorithm a
predictability that makes it slow down before the slope becomes
C. Optimizer positive. By getting the approximate gradient of the next
In convolutional neural networks, we often need to optimize position, it can adjust the step size in advance. Nesterov
non-convex functions. Mathematical methods require huge Accelerated Gradient has been used to train CNN-based models
computing power, so optimizers are used in the training process in many tasks [82], [83], [84].
to minimize the loss function for getting optimal network Adagrad algorithm [85] is another optimization algorithm
parameters within acceptable time. Common optimization based upon gradients. It can adapt the learning rate to
algorithms are Momentum, RMSprop, Adam, etc. parameters, performing smaller updates (i.e., a low learning rate)
1) Gradient Descent for frequent feature-related parameters, and performing larger-
There are three kinds of gradient descent methods that we can step updates (i.e., a high learning rate) for infrequent ones.
use to train our CNN models: Batch Gradient Descent (BGD), Therefore, it is very suitable for processing sparse data. One of
Stochastic Gradient Descent (SGD), and Mini-Batch Gradient the main advantages of Adagrad is that there is no need to adjust
Descent (MBGD). the learning rate manually. In most cases, we just use 0.01 as
The BGD indicates a whole batch of data need to be the default learning rate [50]. FaceNet [64] uses Adagrad as the
calculated to get a gradient for each update, which can ensure optimizer in training.
convergence to the global optimum of the convex plane and the Adadelta algorithm [86] is an extension of the Adagrad,
local optimum of the non-convex plane. However, it's pretty designed to reduce its monotonically decreasing learning rate.
slow to use BGD because the average gradient of whole batch It does not merely accumulate all squared gradients but sets a
samples should be calculated. Also, it can be tricky for data that fixed size window to limit the number of accumulated squared
is not suitable for in-memory calculation. Hence, BGD is hardly gradients. At the same time, the sum of gradients is recursively
utilized in training CNN-based models in practice. defined as the decaying average of all previous squared
On the contrary, SGD only use one sample for each update. gradients, rather than directly storing the previous squared
It is apparent that the time of SGD for each update greatly less gradients. Adadelta are leveraged in many tasks [87], [88], [89].
than BGD because only one sample’s gradient is needed to Root Mean Square prop (RMSprop) algorithm [90] is also
calculate. In this case, SGD is suitable for online learning [77]. designed to solve the problem of the radically diminishing
However, SGD is quickly updated with high variance, which learning rate in the Adagrad algorithm. MobileNet [44],
will cause the objective function to oscillate severely. On the Inception v3 [35], and Inception v4 [36] achieved their best
one hand, the oscillation of the calculation can make the models using RMSprop.
gradient calculation jump out of the local optimum, and finally Another frequently-used optimizer is Adaptive Moment
reach a better point; on the other hand, SGD may never Estimation (Adam) [91] It is essentially an algorithm formed by
converge because of endless oscillation. combining the Momentum and the RMSprop. Adam stores both
Based on BGD and SGD, MBGD was proposed, which the exponential decay average of the past square gradients like
combines the advantages of BGD and SGD. MBGD uses a the Adadelta algorithm and the average exponential decay
small batch of samples for each update, so that it can not only average of the past gradients like the Momentum algorithm.
perform more efficient gradient decent than BGD, but also Practice has proved that Adam algorithm works well on many
reduce the variance, making the convergence more stable. problems and is applicable to many different convolutional
Among these three methods, MBGD is the most popular one. neural network structures [92], [93], [88].
Lots of classic CNN models use it to train their networks in AdaMax algorithm [91] is a variant of Adam that makes the
original papers, like AlexNet [11], VGG [32], Inception v2 [34], boundary range of the learning rate simpler, and it has been used
ResNet [37] and DenseNet [78]. It has also been leveraged in to train CNN models [94], [95].
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 13

TABLE III
COMPARATIVE RESULTS OF DIFFERENT OPTIMIZERS
Last three Data Activation Validation set Training
Model Dataset Loss Batch size epochs Optimizer
layers Augmentation function accuracy time (s)
MBGD 85.70% 926.24
Momentum 86.37% 947.18
Nesterov 84.32% 945.92
Adagrad 84.68% 950.72
VGG-16 512, 256, Cross Adadelta 86.06% 965.90
CIFAR-10 － ReLU 512 100
(pre-trained) 10 entropy RMSprop 87.32% 959.33
Adam 83.09% 953.46
Adamax 86.18% 960.83
Nadam 86.26% 968.72
AMSgrad 83.25% 951.28

Nesterov-accelerated Adaptive Moment Estimation (Nadam) • In the experiment, we find that Nesterov, Adagrad,
[96] is a combination of Adam and NAG. Nadam has a stronger RMSprop, Adamax, and Nadam oscillate and even cannot
constraint on the learning rate and a direct impact on the update converge during training. In the further experiments (see Fig.
of the gradient. Nadam is used in many tasks [97], [98], [99]. 26.), we find that learning rate has huge impact on convergence.
AMSGrad [100] is an improvement on Adagrad. The author • Nesterov, RMSprop, and Nadam are likely to create
of AMSGrad algorithm found that there were errors in the oscillation, but it is this characteristic that may help models
update rules of the Adam algorithm, which caused it to fail to jump out of local optima.
converge to the optimal in some cases. Therefore, AMSGrad
algorithm uses the maximum value of the past squared gradient
instead of the original exponential average to update the
parameters. AMSGrad has been used to train CNN in many
tasks [101], [102], [103].
3) Experimental Evaluation
In the experiment, we tested ten kinds of optimizers—mini-
batch gradient decent, Momentum, Nesterov, Adagrad,
Adadelta, RMSprop, Adam, Adamax, Nadam, and AMSgrad
on CIFAR-10 data set. The last nine optimizers are based upon
mini batch. The format of the CIFAR -10 data set is the same
as the experiment in the section 2.B. We also do our
experiments on the basis of a pre-trained VGG-16 model
without the last three layers on ImageNet [58]. The results can
be seen in Table III and Fig. 25, from which we can get some
conclusions.
• Almost all optimizers that we tested can make the CNN-
based model converge at the end of the training.
Fig. 25. The accuracy of validation set and training loss on CIFAR-10 trained
• The convergence rate of mini-batch gradient decent is by VGG-16 with ten different optimizers, respectively.
slowest, even if it can converge at the end.

Fig. 26. The accuracy of validation set and training loss on CIFAR-10 trained by VGG-16 with Nesterov, Adagrad, RMSprop, Adamax, or Nadam optimizer with
different learning rates, respectively.
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 14

4) Rules of Thumb for Selection B. Applications of two-dimensional CNN

• Mini batch should be used in order to make a trade-off 1) Image Classification
between the computing cost and the accuracy of each update. Image classification is the task of classifying an image into a
• The performance of optimizers is closely related to data class category. CNN represents a breakthrough in this field.
distribution, so please feel free to try different optimizers LeNet-5 [10] is regarded as the first application used in hand-
mentioned above. written digits classification. AlexNet [11] made CNN-based
• If excessive oscillation or divergence occurs, lowering the classification approaches get off the ground. Then, Simonyan
learning rate may be a good choice. [32] et al. emphasize the importance of depth, but these
primitive CNNs are not more than ten layers. Afterward, deeper
V. APPLICATIONS OF CNN network structures emerged, such as GoogLeNet [33] and
VGGNets [32], which significantly improve the accuracy in
Convolutional neural network is one of the crucial concepts classification tasks.
in the field of deep learning. In the era of big data, different In 2014, He et al. [110] proposed the SPP-Net that inserts a
from traditional approaches, CNN is able to harness a massive pyramid pooling layer between the last convolution layer and
amount of data to achieve a promising result. Hence, there are the fully-connected layer, making the size of different input
lots of applications that come up. It can be used not only in the images get the same size outputs. In 2015, He et al. [37]
processing of two-dimensional images but also in one- proposed ResNet to solve the degradation problems and made
dimensional and multi-dimensional scenarios. it possible to train deeper neural networks. In 2017, Chen et al.
A. Applications of one-dimensional CNN [111] proposed a Double Path Network (DPN) for image
One-dimensional CNN (1D CNN) typically leverages one- classification by analyzing the similarities and differences
dimensional convolutional kernels to process one-dimensional between ResNet [37] and DenseNet [112]. DPN not only shares
data. It is very effective when extracting the feature from a the same image features but also ensures the flexibility of
fixed-length segment in the whole dataset where the position of structure feature extraction by double path. In 2018, Facebook
the feature does not matter. Therefore, 1D CNN can be applied opened the source code of ResNeXt-101 [113] and extended the
to time series prediction and signal identification, for example. number of layers of ResNeXt to 101, which achieved state-of-
1) Time Series Prediction the-art results on ImageNet.
1D CNN can be applied to time series prediction of data, such Also, CNN can be deployed in medical image classification
as electrocardiogram (ECG) time series, weather forecast, and [114], [115], traffic scenes related classification[116], [117], etc.
traffic flow prediction. Erdenebayar et al. [104] proposed a [118], [119] Li et al. [114] designed a custom CNN with
method based on 1D-CNN to predict atrial fibrillation using shallow convolution layers to the classification of Interstitial
short-term ECG data automatically. Harbola et al. [105] Lung Disease (ILD). Jiang et al. [115] proposed a method based
proposed 1D Single (1DS) CNN for predicting dominant wind on SE-ResNet modules to classify breast cancer tissues. Bruno
speed and direction. Han et al. [106] applied 1D-CNN on short- et al. [116] applied Inception networks to the classification of
term highway traffic flow prediction. 1D-CNN is used to traffic signal signs. Madan et al. [117] proposed a different
extract spatial features of traffic flow, which is combined with preprocessing method to classify traffic signals.
temporal features to predict the traffic flow. 2) Object Detection
2) Signal Identification Object detection is the task based on image classification.
Signal identification is to discriminate the input signal Systems not only need to identify which category the input
according to the feature that CNN learned from training data. It image belongs to, but also need to mark it with a bounding box.
can be applied to ECG signal identification, structural damage The development process of object detection based on deep
identification, and system fault identification. Zhang et al. [107] learning is shown in Fig. 27. The approaches of object detection
proposed a multi-resolution one-dimensional convolutional can be divided into one-stage approaches, like YOLO
neural network structure to identify arrhythmia and other [120],[121],[122], SSD [123], CornerNet [124],[125], and two-
diseases based on ECG data. Abdeljaber et al. [108] proposed a stage approaches, like R-CNN [126], Fast R-CNN [127], Faster
direct damage identification method based on 1D CNN that can R-CNN [128].
apply to the original environmental vibration signals. In the two-stage object detection, the region proposals are
Abdeljaber et al. [109] designed a compact 1D CNN used in selected in advance, and then the objects are classified by CNN.
fault detection and severity identification of ball bearings. In 2014, Girshick et al. [126] used region proposal and CNN to

Deep Learning based Detection Approaches

Traditional Detection One-stage

Approaches Approaches

2014 2015 2016 2017 2018 2019 2020

2012
Two-stage
Approaches

Fig. 27. Object detection milestones based on deep learning

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 15

replace the sliding window and manual feature extraction used In 2014, Long et al. [130] proposed the concept of Fully
in traditional object detection and designed the R-CNN Convolutional Networks (FCN) and applied CNN structures to
framework, which made a breakthrough in object detection. image semantic segmentation for the first time. In 2015,
Then, Girshick et al. [127] summarizing the shortcomings of R- Ronneberger et al. [131] proposed U-Net, which has more
CNN [126] and drawing lessons from the SPP-Net [110], multi-scale features and has been applied to medical image
proposed Fast R-CNN, which introduced the Regions of segmentation. Besides, ENet [132], PSPNet [133], etc. [134],
Interest (ROI) Pooling layer, making the network faster. [135] were proposed to handle specific problems.
Besides, Fast R-CNN shares convolution features between In terms of instance segmentation tasks, He et al. [136]
object classification and bounding box regression. However, proposed Mask-RCNN that shares convolution features
Fast R-CNN still retains the selective search algorithm of R- between two tasks through the cascade structure. In
CNN’s region proposals. In 2015, Ren et al. [128] proposed consideration of real time, Bolya et al. [137] based on
Faster R-CNN, which adds the selection of region proposals to RetinaNet [138] harnessed ResNet-101 and FPN to fuse multi-
make it faster. An essential contribution of Faster R-CNN is to scale features.
introduce an RPN network at the end of the convolutional layer. For panoptic segmentation tasks, it was first proposed by
In 2016, Lin et al. [129] added Feature Pyramid Network (FPN) Kirillov et al. [139] in 2018. They proposed panoramic FPN
to Faster R-CNN, where multi-scale features can be fused [140] in 2019, which combines FPN network with Mask-RCNN
through the feature pyramid in the forward process. to generate a branch of semantic segmentation. In the same year,
In one stage, the model directly returns the category Liu et al. [141] proposed OANet that also introduced the FPN
probability and position coordinates of the objects. Redmon et based on Mask-RCNN, but the difference is that they designed
al. regarded object detection as a regression problem and an end-to-end network.
proposed YOLO v1 [120], which directly utilizes a single 4) Face Recognition
neural network to predict bounding boxes and the category of Face recognition is a biometric identification technique based
objects. Afterward, YOLO v2 [121] proposed a new on the features of the human face. The development history of
classification model darknet-19, which includes 19 deep face recognition is shown in Fig. 29. DeepFace [142] and
convolutional layers and five max-pooling layers. Batch DeepID [79] achieved excellent results on the LFW [74] data
normalization layers are added after each convolution layer, set, surpassing humans for the first time in the unconstrained
which is beneficial to stable training and fast convergence. scenarios. Henceforth, deep learning-based approaches
YOLO v3 [122] was proposed to remove the max-pooling received much more attention. The process of DeepFace
layers and the fully-connected layers, using 1 × 1 and 3 × 3 proposed by Taigman et al. [142] is detection, alignment,
convolution and shortcut connections. Besides, YOLO v3 extraction, and classification. After detecting the face, using
borrows the idea from FPN to achieve multi-scale feature fusion. three-dimensional alignment generate a 152 × 152 image as the
For the benefits of the structure of YOLO v3, many classic input of CNN. Taigman et al. [142] leveraged Siamese network
networks replace the backend of it to achieve better results. All to train the model, which obtained state-of-the-art results.
of the aforementioned approaches leverage anchor boxes to Unlike DeepFace, DeepID directly inputs two face images into
determine where objects are. Their performance hinges on the CNN to extract feature vectors for classification. DeepID2 [61]
choice of anchor boxes, and a large number of hyperparameters introduces classification loss and verification loss. Based upon
are introduced. Therefore, Law et al. [124] proposed CornerNet, DeepID2 [61], DeepID2+ [62] adds the auxiliary loss between
which abandons anchor boxes and directly predicts the top-left convolutional layers. DeepID3 [63] proposed two kinds of
corner and bottom-right corner of bounding boxes of objects. In structures, which can be constructed by stacked convolutions of
order to decide which two corners in different categories are VGGNet or Inception modules.
paired with each other, and an embedding vector is introduced. The aforementioned approaches harness the standard
Then, CornerNet-Lite [125] optimized CornerNet in terms of softmax loss function. More recently, improvements in face
detection speed and accuracy. recognition are basically focused on the loss function. FaceNet
3) Image Segmentation [85] proposed by Google in 2015 utilizes 22-layer CNN and
Image segmentation is the task that divides an image into 200 million pictures, including eight million people, to train a
different areas. It has to mark the boundaries of different model. In order to learn more efficient embeddings, FaceNet
semantic entities in an image. The image segmentation task replaces softmax with triplet loss. Besides, VGGFace [65] also
completed by CNN is shown in Fig 28. deploys triplet loss to train the model. Besides, there are various
loss functions harnessed to reach better results, like L-softmax
loss, SphereFace, ArcFace, and large margin cosine loss, which
can be seen in Fig. 29. [74], [143], [144], [145]
DeepID2
(contrastive
loss)
DeepID FaceNet ArcFace
FaceNet
(softmax) (triplet loss) (triplet loss)

DeepFace L2-softmax
DeepID2+ DeepID3 VGGNet L-softmax A-softmax CosFace
(softmax) (feature
(contrastive loss) (contrastive loss) (triplet+softmax) (large margin) (large margin) (large margin)
normalization)

2014 2015 2016 2017 2018 2019

Fig. 28. Applications of CNN in image segmentation Fig. 29. The development history of deep face recognition
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 16

C. Applications of multi-dimensional CNN A. Model Compression

In theory, CNN can be leveraged in data with any dimension. Over the past decade, convolutional neural networks have
However, since high dimensional data is hard to understand for been designed with various modules, which helped CNN reach
human, multi-dimensional CNN is not too common in over good accuracy. However, high accuracy typically relies on
three dimensions. Therefore, to better explain the key points, extreme deep and wide architectures, which makes it
we take applications of three-dimensional CNN (3D CNN), for challenging to deploy the models to embedded devices.
instance. It does not mean that higher dimensions are infeasible. Therefore, model compression is one of the possible ways to
1) Human Action Recognition handle this problem, including low-rank approximation,
Human Action recognition refers to the automatic network pruning, and network quantization.
recognition of human actions in videos by machines. As a kind
of rich visual information, action recognition method based on
human body posture is used in many jobs. Cao et al. [146]
utilized 3D CNN to extract features of joints instead of the
whole body, and then these features are fed into a linear SVM
for classification. Another approach [147] is to extend 2D
image convolutions to 3D video convolutions to extract spatio-
Fig. 30. Three directions of model compression
temporal patterns. Huang et al. [148] designed a 3D-CNN
structure to carry out sign language recognition that the 3D The methods of low-rank approximation consider the weight
convolutional layers automatically extract distinct spatial and matrix of the original network as the full-rank matrix and then
temporal features from the video stream, which are input to the decomposes it into a low-rank matrix to approximate the effect
fully connected layer for classification. In addition, it is also of the full-rank matrix. Jaderberg et al. [157] proposed a
possible to integrate the features extracted from 3D CNN and structure-independent method using cross-channel redundancy
2D CNN. Huang et al. [149] fused the 3D convolutional pose or ﬁlter redundancy to reconstruct a low-rank filter. Sindhwani
stream with the 2D convolutional appearance stream that et al. [158] proposed a structural transformation network to
provides more discriminative human action information. learn a large class of structural parameter matrices
2) Object Recognition/Detection characterized by low displacement rank through a unified
In 2015, Wu et al. [150] proposed a generative 3D CNN of framework.
shape named 3D ShapeNet, which can be applied to object Besides, according to the granularity of networks, pruning
detection of RGBD images. In the same year, Maturana et al. can be divided into structured pruning and unstructured pruning.
[151]proposed VoxNet, which is a 3D CNN architecture that Han et al. [159] proposed deep pruning for CNN to prune
contains two 3D convolutional layers, pooling layer, and two networks by learning essential connections. However, the fine-
fully-connected layers. In 2016, Song et al. [152] proposed a grain pruning method increases the irregular sparsity of
3D region proposal network to learn the geometric feature of convolution kernels and computational cost. Therefore, many
objects and designed a Joint Object Recognition Network that pruning methods focus on coarse granularity. Mao et al. [160]
fuses the output of VGGNet and the 3D CNN to learn 3D deemed that coarse pruning reaches a better compression ratio
bounding box and object classification jointly. In 2017, Zhou et than fine-grained pruning. Through weighing the relationship
al. [153] proposed a single end-to-end VoxelNet for point- between sparsity and precision, they provide some advice on
cloud-based 3D detection. VoxelNet contains feature learning how to ensure accuracy when structural pruning.
layers, convolution middle layers, and RPN. Each convolution Network quantization is another way to reduce
middle layer uses 3D convolutions, batch normalization layers, computational cost, including binary quantization, ternary
and ReLU to aggregate voxel-wise features. Pastor et al. [154] quantization, and multivariate quantization. In 2016, Rastegari
designed TactNet3D, harnessing tactile features to recognize et al. [161] proposed Binary-Weight Networks and XNOR-
objects. Networks. The former makes the values of networks
In addition, high dimensional images, like X-rays and CT approximate to binary values, and the latter approximate
images, can be detected by 3D CNN. A lot of practitioners [155], convolutions by binary operations. Lin et al. [162] use a linear
[156] are dedicated to these jobs. combination of multiple binary weight bases to approximate
full-precision weights and multiple binary activation to reduce
VI. PROSPECTS FOR CNN information loss, which suppresses the prediction accuracy
degradation caused by previous binary CNN. Zhu et al. [163]
With the rapid development of CNN, some methods are
proposed that ternary quantization could alleviate the accuracy
proposed to refine CNN, including model compression, security,
degradation. Besides, multivariate quantization is leveraged to
network architecture search, etc. In addition, convolutional
represent weights by several values [164], [165].
neural networks have many obvious disadvantages, like losing
As increasing networks use 1 × 1 convolution, low-rank
spatial information and other limitations. New structures need
approximation is difficult to achieve model compression.
to be introduced to handle these problems. Based on the points
Network pruning is a major practical way in model compression
mentioned above, this section briefly introduces several
tasks. Binary quantization tremendously reduces the model size
promising trends of CNN.
with the cost of losing accuracy. Hence, ternary quantization
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 17

and multivariate quantization are harnessed to strike a proper C. Network Architecture Search
balance between model size and accuracy. Network Architecture Search (NAS) is another method to
B. Security of CNN realize automatic machine learning of CNN. NAS constructs a
search space through design choices, such as the number of
There are many applications of CNN in daily life, including
kernels, skip connections, etc. Besides, NAS finds a suitable
security identification system [166],[167], medical image
optimizer to control the search process in the search space. As
identification [168],[156],[155], traffic sign recognition [169],
shown in Fig. 32, NAS could be divided into NAS with agents
and license plate recognition [170]. These applications are
and without agents. Due to the high demand for NAS on
highly related to life and property security. Once models are
computing resources, the integrated models consist of learned
disturbed or destroyed, the consequences will be severe.
optimal convolutional layers in the small-scale data sets. Small-
Therefore, the security of CNN is expected to be attached great
scale data sets are the agents that generate the overall model, so
importance. More precisely, researchers [171],[172],[173],[174]
this approach is the NAS with agents. The agentless NAS refers
have proposed some methods to deceive CNN, resulting in a
to learning the whole model directly on large-scale data sets.
sharp drop in the accuracy. These methods can be classified into
two categories: data poisoning and adversarial attacks.
Target Target
Data poisoning indicates that poisoning the training data Learner Proxy
Task& Learner Task&
Task
during the training phase. Poison refers to the insertion of noise Hardware Hardware

data into the training data. It is not easily distinguished at the

image level and has no abnormalities found during the training Fig. 32. The procedure of NAS. [179] (a) NAS with agents (b) NAS without
process. Whereas in test stages, the trained model would reveal agents.
the problem of low accuracy. Furthermore, the noise data can In 2017, Google Inc. [177] proposed a machine learning
even be fine-tuned so that the model can identify certain targets search algorithm that uses reinforcement learning to maximize
incorrectly. Liao et al. [172] introduced that generated the target network and implements an auto-built network on
perturbation masks are injected into the training samples as a CIFAR-10 data set, achieving similar precision and speed to
backdoor to deceive the model. Backdoor injection does not networks with similar structure. Nevertheless, this approach is
interfere with normal behaviors but stimulates the backdoor computationally expensive. Pham et al. [178] proposed
instance to misclassify specific targets. Liu et al. [175] Efficient Neural Architecture Search (ENAS), which shares
proposed a method of fine-tuning and pruning what can parameters among sub-models and reduces resource
effectively defend against the threat of backdoor injection. The requirements. Cai et al. [179] proposed ProxylessNAS, which
combination of pruning and fine-tuning succeeds in suppressing, is a path-level NAS method that has a model structure
even eliminating the effects of the backdoors. parameter layer at the end of the path and adds a binary gate
before the output to reduce GPU utilization. It can directly learn
architectures on the large-scale data set. Additionally, there are
many ways to reduce the search space of reinforcement learning.
Tan et al. [180] designed Mobile Neural Architecture Search
(MNAS) to solve the CNN inferring delay problem. They
introduced a decomposed hierarchical search space and
performed the reinforcement learning structural search
Fig. 31. A demonstration of fast adversarial example generation [174] algorithm on this space. Ghiasi et al. [181] proposed NAS-FPN
Adversarial attack is also one of the threats faced by deep by applying NAS to feature pyramid structure search of object
neural networks. In Fig. 31, some noise is added to a normal detection. They combined scalable search space with NAS to
image. Although naked eyes cannot distinguish the difference reduce search space. The scalable search space can cover all
between two images, the CNN based model cannot recognize possible cross-scale connections and generate multi-scale
them as the same. Goodfellow et al. [174] reckoned that the feature representations.
main factor for the vulnerability of neural networks is linear
characteristics, like ReLU, Maxout, etc. The cheap, analytical D. Capsule Neural Network
disturbance of these linear models would damage the neural A lot of impressive applications of CNN are emerging in
network. Besides, they proposed a fast gradient notation to modern society, from simple cat-dog classifiers [182] to
generate adversarial examples and found that many models sophisticated autonomous vehicles [181], [183], [184].
misclassified these examples. Akhtar et al. [176] listed three However, is CNN a panacea? When does it not work?
directions of defense against adversarial attacks, which are
respectively improved on training examples, modified trained
networks, and additional networks. First, for the training
examples, adversarial examples can be utilized to enhance the Fig. 33. The same cat in different ways
robustness of models. Second, network architectures can be
CNN is not sensitive to slight changes of images, such as
adjusted to ignore noise. Last, additional networks can be used
rotation, scaling, and symmetry, which has been demonstrated
to help the backbone network against adversarial attacks.
by Azulay et al. [185]. However, when trained with Fig. 33 (a),
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 18

CNN cannot correctly recognize Fig. 33 (b), (c), or (d) is the conclusions are reached. Also, we offer some rules of thumb for
same cat as the former, which is obvious to humans. This the selection of these functions.
problem is caused by the architecture of CNN. Therefore, in Fourth, we discuss some typical applications of CNN.
order to teach a CNN system to recognize different patterns of Different dimensional convolutions should be designed for
one object, a massive amount of data should be fed, making up various problems. Other than the most frequently-used two-
for the flaw of CNN architectures with diverse data. However, dimensional CNN used for image-related tasks, one-
labeled data is typically hard to obtain. Although some tricks dimensional and multi-dimensional CNN are harnessed in lots
like data augmentation can bring about some effects, the of scenarios as well.
improvement is relative limited. Finally, even though convolutions possess many benefits and
Pooling layer is widely used in CNN for many advantages, have been widely used, we reckon that it can be refined further
but it ignores the relationship between the whole and the part. in terms of model size, security, and easy hyperparameters
For effectively organizing network structures and solving the selection. Moreover, there are lots of problems that convolution
problem of spatial information loss of traditional CNN, Hinton is hard to handle, such as low generalization ability, lack of
et al. [186] proposed Capsule Neural Networks (CapsNet) equivariance, and poor crowded-scene results, so that several
where neurons on different layers focus on different entities or promising directions are pointed.
attributes, so that they add neurons to focus on the same
category or attribute, similar to the structure of a capsule. When REFERENCES
CapsNet is activated, the pathway between capsules forms a
tree structure composed of sparsely activated neurons. Each [1] W. S. Mcculloch, and W. H. Pitts, “A logical Calculus of Ideas
output of a capsule is a vector, the length of which represents Immanent in Nervous Activity,” The Bulletin of Mathematical
Biophysics, vol. 5, pp. 115-133, 1942.
the probability of the existence of an object. Therefore, the [2] F. Rosenblatt, “The Perceptron: A Probabilistic Model for Information
output features include the specific pose information of objects, Storage and Organization in the Brain,” Psychological Review, pp.
which means that CapsNet has the ability to recognize the 368-408, 1958.
[3] C. V. D. Malsburg, “Frank Rosenblatt: Principles of Neurodynamics:
orientation. In addition, unsupervised CapsNet was created by Perceptrons and the Theory of Brain Mechanisms.”
Hinton et al. [187], called Stacked Capsule Autoencoder [4] Davd. Rumhar, Geoffrey. Hinton, and RonadJ. Wams, “Learning
(SCAE). SCAE consists of four parts: Part Capsule representations by back-propagating errors.”
[5] A. Waibel, T. Hanazawa, G. E. Hinton, K. Shikano, and K. J. Lang,
Autoencoder (PCAE), Object Capsule Autoencoder (OCAE), “Phoneme recognition using time-delay neural networks,” IEEE
and the decoders of PCAE and OCAE. PCAE is a CNN with a Transactions on Acoustics Speech & Signal Processing, vol. 37, no. 3,
top-down attention mechanism. It can identify the pose and pp. 328-339, 1989.
[6] W. Zhang, “Shift-invariant pattern recognition neural network and its
existence of capsules of different parts. OCAE is used to
optical architecture,” in Proceedings of annual conference of the Japan
implement inference. SCAE can predict the activations of Society of Applied Physics, 1988.
CapsNet directly based on the pose and the existence. Some [7] Y. Lecun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W.
experiments have proved that CapsNet is able to reach state-of- Hubbard, and L. D. Jackel, “Backpropagation Applied to Handwritten
Zip Code Recognition,” Neural Computation, vol. 1, no. 4, pp. 541-
the-art results. Although it did not achieve satisfactory results 551.
on complicated large-scale data sets, like CIFAR-100 or [8] K. Aihara, T. Takabe, and M. Toyoda, “Chaotic neural networks,”
ImageNet, we can see that it is potential. Physics Letters A, vol. 144, no. 6-7, pp. 333-340.
[9] Specht, and D.F., “A general regression neural network,” IEEE
Transactions on Neural Networks, vol. 2, no. 6, pp. 568-576.
VII. CONCLUSION [10] B. L. Lecun Y , Bengio Y , et al., “Gradient-based learning applied to
document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp.
Due to the advantages of convolutional neural networks, such 2278-2324, 1998.
as local connection, weight sharing, and down-sampling [11] A. Krizhevsky, I. Sutskever, and G. Hinton, “ImageNet Classification
dimensionality reduction, they have been widely deployed in with Deep Convolutional Neural Networks,” Advances in neural
information processing systems, vol. 25, no. 2, 2012.
both research and industry projects. This paper provides a [12] N. Aloysius, and M. Geetha, "A review on deep convolutional neural
detailed survey on CNN, including common building blocks, networks," Proceedings of the 2017 IEEE International Conference on
classic networks, related functions, applications, and prospects. Communication and Signal Processing, ICCSP 2017. pp. 588-592.
[13] A. Dhillon, and G. K. Verma, “Convolutional neural network: a review
First, we discuss basic building blocks of CNN and present of models, methodologies and applications to object detection,”
how to construct a CNN-based model from scratch. Progress in Artificial Intelligence, 2019/12/20, 2019.
Second, some excellent networks are expounded. From them, [14] W. Rawat, and Z. Wang, “Deep Convolutional Neural Networks for
we obtain some guidelines for devising novel networks from Image Classification: A Comprehensive Review,” Neural
Computation, pp. 1-98.
the perspective of accuracy and speed. More specifically, in [15] Q. Liu, N. Zhang, W. Yang, S. Wang, Z. Cui, X. Chen, and L. Chen,
terms of accuracy, deeper and wider neural structures are able “A Review of Image Recognition with Deep Convolutional Neural
to learn better representation than shallow ones. Besides, Network.”
[16] S. Rehman, H. Ajmal, U. Farooq, Q. U. Ain, and A. Hassan,
residual connections can be leveraged to build extremely deep "Convolutional neural network based image segmentation: a review."
neural networks, which can increase the ability to handle [17] T. Lindeberg, “Scale invariant feature transform,” 2012.
complex tasks. In terms of speed, dimension reduction and low- [18] N. Dalal, and B. Triggs, "Histograms of oriented gradients for human
detection." pp. 886-893.
rank approximation are very handy tools. [19] T. Ahonen, A. Hadid, and M. Pietikainen, “Face description with local
Third, we introduce activation functions, loss functions, and binary patterns: Application to face recognition,” IEEE transactions on
optimizers for CNN. Through experimental analysis, several
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 19

pattern analysis and machine intelligence, vol. 28, no. 12, pp. 2037- [50] N. Ma, X. Zhang, H. T. Zheng, and J. Sun, “ShuffleNet V2: Practical
2041, 2006. Guidelines for Efficient CNN Architecture Design.”
[20] W. T. N. Hubel D H “Receptive fields, binocular interaction and [51] K. Han, Y. Wang, Q. Tian, J. Guo, C. Xu, and C. Xu, “GhostNet: More
functional architecture in the cat\"s visual cortex,” The Journal of Features from Cheap Operations,” arXiv preprint arXiv:1911.11907,
Physiology, vol. 160, no. 1, pp. 106-154, 1962. 2019.
[21] D. M. Hawkins, “The problem of overfitting,” Journal of chemical [52] V. Nair, and G. E. Hinton, "Rectified Linear Units Improve Restricted
information and computer sciences, vol. 44, no. 1, pp. 1-12, 2004. Boltzmann Machines Vinod Nair."
[22] K. Fukushima, “Neocognitron: A self-organizing neural network [53] M. T. Hagan, H. B. Demuth, and M. H. Beale, Neural network design,
model for a mechanism of pattern recognition unaffected by shift in 2002.
position,” Biological Cybernetics, vol. 36, no. 4, pp. 193-202. [54] A. Krizhevsky, “Convolutional Deep Belief Networks on CIFAR-10,”
[23] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei, 2010.
“Deformable Convolutional Networks.” [55] D.-A. Clevert, T. Unterthiner, and S. Hochreiter, Fast and Accurate
[24] L. Sifre, and S. Mallat, “Rigid-Motion Scattering for Texture Deep Network Learning by Exponential Linear Units (ELUs), 2016.
Classification,” 03/07, 2014. [56] H. Xiao, K. Rasul, and R. Vollgraf, “Fashion-MNIST: a Novel Image
[25] F. Mamalet, and C. Garcia, Simplifying ConvNets for Fast Learning, Dataset for Benchmarking Machine Learning Algorithms.”
2012. [57] A. Krizhevsky, “Learning Multiple Layers of Features from Tiny
[26] F. Chollet, “Xception: Deep Learning with Depthwise Separable Images,” University of Toronto, 05/08, 2012.
Convolutions.” [58] J. Deng, W. Dong, R. Socher, L. J. Li, K. Li, and F. F. Li, “ImageNet:
[27] W. Min, B. Liu, and H. Foroosh, “Factorized Convolutional Neural A large-scale hierarchical image database,” Proc of IEEE Computer
Networks.” Vision & Pattern Recognition, pp. 248-255, 2009.
[28] D. Li, A. Zhou, and A. Yao, “HBONet: Harmonious Bottleneck on [59] R. Hadsell, S. Chopra, and Y. LeCun, "Dimensionality reduction by
Two Orthogonal Dimensions.” learning an invariant mapping." pp. 1735-1742.
[29] S. Xie, R. Girshick, P. Dollar, Z. Tu, and K. He, "Aggregated Residual [60] S. Chopra, R. Hadsell, and Y. LeCun, "Learning a similarity metric
Transformations for Deep Neural Networks." discriminatively, with application to face verification." pp. 539-546.
[30] T. K. Lee, W. J. Baddar, S. T. Kim, and Y. M. Ro, “Convolution with [61] Y. Sun, Y. Chen, X. Wang, and X. Tang, "Deep learning face
Logarithmic Filter Groups for Efficient Shallow CNN.” representation by joint identification-verification." pp. 1988-1996.
[31] Y. Ioannou, D. Robertson, R. Cipolla, and A. Criminisi, “Deep Roots: [62] Y. Sun, X. Wang, and X. Tang, "Deeply learned face representations
Improving CNN Efficiency with Hierarchical Filter Groups.” are sparse, selective, and robust." pp. 2892-2900.
[32] K. Simonyan, and A. Zisserman, “Very Deep Convolutional Networks [63] Y. Sun, D. Liang, X. Wang, and X. Tang, “Deepid3: Face recognition
for Large-Scale Image Recognition,” Computer Science, 2014. with very deep neural networks,” arXiv preprint arXiv:1502.00873,
[33] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. 2015.
Erhan, V. Vanhoucke, and A. Rabinovich, "Going deeper with [64] F. Schroff, D. Kalenichenko, and J. Philbin, "Facenet: A unified
convolutions." pp. 1-9. embedding for face recognition and clustering." pp. 815-823.
[34] S. Ioffe, and C. Szegedy, “Batch normalization: Accelerating deep [65] O. M. Parkhi, A. Vedaldi, and A. Zisserman, “Deep face recognition,”
network training by reducing internal covariate shift,” arXiv preprint 2015.
arXiv:1502.03167, 2015. [66] B. Amos, B. Ludwiczuk, and M. Satyanarayanan, “Openface: A
[35] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, general-purpose face recognition library with mobile applications,”
"Rethinking the inception architecture for computer vision." pp. 2818- CMU School of Computer Science, vol. 6, pp. 2, 2016.
2826. [67] D. Cheng, Y. Gong, S. Zhou, J. Wang, and N. Zheng, "Person re-
[36] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi, "Inception-v4, identification by multi-channel parts-based cnn with improved triplet
inception-resnet and the impact of residual connections on learning." loss function." pp. 1335-1344.
[37] K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image [68] A. Hermans, L. Beyer, and B. Leibe, “In defense of the triplet loss for
recognition." pp. 770-778. person re-identification,” arXiv preprint arXiv:1703.07737, 2017.
[38] K. He, X. Zhang, S. Ren, and S. Jian, "Delving Deep into Rectifiers: [69] R. Kuma, E. Weill, F. Aghdasi, and P. Sriram, "Vehicle re-
Surpassing Human-Level Performance on ImageNet Classification." identification: an efficient baseline using triplet embedding." pp. 1-9.
[39] S. Zagoruyko, and N. Komodakis, "Wide Residual Networks." [70] Y. Wen, K. Zhang, Z. Li, and Y. Qiao, "A discriminative feature
[40] G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Weinberger, “Deep learning approach for deep face recognition." pp. 499-515.
Networks with Stochastic Depth.” [71] J. Yao, Y. Yu, Y. Deng, and C. Sun, "A feature learning approach for
[41] S. Targ, D. Almeida, and K. Lyman, “Resnet in Resnet: Generalizing image retrieval." pp. 405-412.
Residual Architectures.” [72] H. Jin, X. Wang, S. Liao, and S. Z. Li, "Deep person re-identification
[42] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, with improved embedding and efficient training." pp. 261-267.
S. Ozair, A. Courville, and Y. Bengio, "Generative adversarial nets." [73] G. Wisniewksi, H. Bredin, G. Gelly, and C. Barras, "Combining
pp. 2672-2680. speaker turn embedding and incremental structure prediction for low-
[43] A. Radford, L. Metz, and S. Chintala, “Unsupervised Representation latency speaker diarization."
Learning with Deep Convolutional Generative Adversarial Networks,” [74] W. Liu, Y. Wen, Z. Yu, and M. Yang, "Large-margin softmax loss for
Computer Science, 2015. convolutional neural networks." p. 7.
[44] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. [75] L. Tan, K. Zhang, K. Wang, X. Zeng, X. Peng, and Y. Qiao, "Group
Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient emotion recognition with individual facial emotion CNNs and global
convolutional neural networks for mobile vision applications,” arXiv image based CNNs." pp. 549-552.
preprint arXiv:1704.04861, 2017. [76] Y. Liu, L. He, and J. Liu, “Large margin softmax loss for speaker
[45] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, verification,” arXiv preprint arXiv:1904.03479, 2019.
"Mobilenetv2: Inverted residuals and linear bottlenecks." pp. 4510- [77] D. Saad, On-line learning in neural networks: Cambridge University
4520. Press, 2009.
[46] M. S. Andrew Howard, Grace Chu, Liang-Chieh Chen, Bo Chen, [78] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, "Densely
Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay connected convolutional networks." pp. 4700-4708.
Vasudevan, Quoc V. Le, Hartwig Adam, “Searching for [79] Y. Sun, X. Wang, and X. Tang, "Deep learning face representation
MobileNetV3,” arXiv:1905.02244 [cs.CV], 2019. from predicting 10,000 classes." pp. 1891-1898.
[47] T. J. Yang, A. Howard, B. Chen, X. Zhang, A. Go, M. Sandler, V. Sze, [80] N. Qian, “On the momentum term in gradient descent learning
and H. Adam, “NetAdapt: Platform-Aware Neural Network algorithms,” Neural networks, vol. 12, no. 1, pp. 145-151, 1999.
Adaptation for Mobile Applications.” [81] Y. Nesterov, "A method for unconstrained convex minimization
[48] J. Hu, L. Shen, S. Albanie, G. Sun, and E. Wu, “Squeeze-and- problem with the rate of convergence O (1/k^ 2)." pp. 543-547.
Excitation Networks.” [82] W. Su, L. Chen, M. Wu, M. Zhou, Z. Liu, and W. Cao, "Nesterov
[49] X. Zhang, X. Zhou, M. Lin, and J. Sun, “ShuffleNet: An Extremely accelerated gradient descent-based convolution neural network with
Efficient Convolutional Neural Network for Mobile Devices.” dropout for facial expression recognition." pp. 1063-1068.
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 20

[83] A. L. Maas, P. Qi, Z. Xie, A. Y. Hannun, C. T. Lengerich, D. Jurafsky, by Online Condition Monitoring,” IEEE Transactions on Industrial
and A. Y. Ng, “Building DNN acoustic models for large vocabulary Electronics, pp. 1-1.
speech recognition,” Computer Speech & Language, vol. 41, pp. 195- [110] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep
213, 2017. convolutional networks for visual recognition,” IEEE transactions on
[84] P. Molchanov, S. Gupta, K. Kim, and J. Kautz, "Hand gesture pattern analysis and machine intelligence, vol. 37, no. 9, pp. 1904-1916,
recognition with 3D convolutional neural networks." pp. 1-7. 2015.
[85] J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods for [111] Y. Chen, J. Li, H. Xiao, X. Jin, S. Yan, and J. Feng, "Dual path
online learning and stochastic optimization,” Journal of machine networks." pp. 4467-4475.
learning research, vol. 12, no. Jul, pp. 2121-2159, 2011. [112] F. Iandola, M. Moskewicz, S. Karayev, R. Girshick, T. Darrell, and K.
[86] M. D. Zeiler, “Adadelta: an adaptive learning rate method,” arXiv Keutzer, “Densenet: Implementing efficient convnet descriptor
preprint arXiv:1212.5701, 2012. pyramids,” arXiv preprint arXiv:1404.1869, 2014.
[87] J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio, [113] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, "Aggregated residual
"Attention-based models for speech recognition." pp. 577-585. transformations for deep neural networks." pp. 1492-1500.
[88] T. Sercu, C. Puhrsch, B. Kingsbury, and Y. LeCun, "Very deep [114] Q. Li, W. Cai, X. Wang, Y. Zhou, D. D. Feng, and M. Chen, "Medical
multilingual convolutional neural networks for LVCSR." pp. 4955- image classification with convolutional neural network." pp. 844-848.
4959. [115] Y. Jiang, L. Chen, H. Zhang, and X. Xiao, “Breast cancer
[89] Y. Kim, “Convolutional neural networks for sentence classification,” histopathological image classification using convolutional neural
arXiv preprint arXiv:1408.5882, 2014. networks with small SE-ResNet module,” PloS one, vol. 14, no. 3,
[90] G. Hinton, N. Srivastava, and K. Swersky, “Neural networks for 2019.
machine learning lecture 6a overview of mini-batch gradient descent,” [116] D. R. Bruno, and F. S. Osório, "Image classification system based on
Cited on, vol. 14, no. 8, 2012. deep learning applied to the recognition of traffic signs for intelligent
[91] D. P. Kingma, and J. Ba, “Adam: A method for stochastic robotic vehicle navigation purposes." pp. 1-6.
optimization,” arXiv preprint arXiv:1412.6980, 2014. [117] R. Madan, D. Agrawal, S. Kowshik, H. Maheshwari, S. Agarwal, and
[92] S. Sharma, R. Kiros, and R. Salakhutdinov, “Action recognition using D. Chakravarty, “Traffic Sign Classification using Hybrid HOG-
visual attention,” arXiv preprint arXiv:1511.04119, 2015. SURF Features and Convolutional Neural Networks,” 2019.
[93] F. Korzeniowski, and G. Widmer, "A fully convolutional deep [118] M. Zhang, W. Li, and Q. Du, “Diverse region-based CNN for
auditory model for musical chord recognition." pp. 1-6. hyperspectral image classification,” IEEE Transactions on Image
[94] M. J. Van Putten, S. Olbrich, and M. Arns, “Predicting sex from brain Processing, vol. 27, no. 6, pp. 2623-2634, 2018.
rhythms with deep learning,” Scientific reports, vol. 8, no. 1, pp. 1-7, [119] A. Sharma, X. Liu, X. Yang, and D. Shi, “A patch-based convolutional
2018. neural network for remote sensing image classification,” Neural
[95] S. Niklaus, L. Mai, and F. Liu, "Video frame interpolation via adaptive Networks, vol. 95, pp. 19-28, 2017.
separable convolution." pp. 261-270. [120] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, "You only look
[96] T. Dozat, “Incorporating nesterov momentum into adam,” 2016. once: Unified, real-time object detection." pp. 779-788.
[97] D. Q. Nguyen, and K. Verspoor, “Convolutional neural networks for [121] J. Redmon, and A. Farhadi, "YOLO9000: better, faster, stronger." pp.
chemical-disease relation extraction are improved with character- 7263-7271.
based word embeddings,” arXiv preprint arXiv:1805.10586, 2018. [122] J. Redmon, and A. Farhadi, “Yolov3: An incremental improvement,”
[98] S. Maetschke, B. Antony, H. Ishikawa, G. Wollstein, J. Schuman, and arXiv preprint arXiv:1804.02767, 2018.
R. Garnavi, “A feature agnostic approach for glaucoma detection in [123] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A.
OCT volumes,” PloS one, vol. 14, no. 7, 2019. C. Berg, "Ssd: Single shot multibox detector." pp. 21-37.
[99] A. Schindler, T. Lidy, and A. Rauber, “Multi-temporal resolution [124] H. Law, and J. Deng, "Cornernet: Detecting objects as paired
convolutional neural networks for acoustic scene classification,” arXiv keypoints." pp. 734-750.
preprint arXiv:1811.04419, 2018. [125] H. Law, Y. Teng, O. Russakovsky, and J. Deng, “Cornernet-lite:
[100] S. J. Reddi, S. Kale, and S. Kumar, “On the convergence of adam and Efficient keypoint based object detection,” arXiv preprint
beyond,” arXiv preprint arXiv:1904.09237, 2019. arXiv:1904.08900, 2019.
[101] M. Jahanifar, N. Z. Tajeddin, N. A. Koohbanani, A. Gooya, and N. [126] R. Girshick, J. Donahue, T. Darrell, and J. Malik, "Rich feature
Rajpoot, “Segmentation of skin lesions and their attributes using multi- hierarchies for accurate object detection and semantic segmentation."
scale convolutional neural networks and domain specific pp. 580-587.
augmentations,” arXiv preprint arXiv:1809.10243, 2018. [127] R. Girshick, "Fast r-cnn." pp. 1440-1448.
[102] F. Monti, F. Frasca, D. Eynard, D. Mannion, and M. M. Bronstein, [128] S. Ren, K. He, R. Girshick, and J. Sun, "Faster r-cnn: Towards real-
“Fake news detection on social media using geometric deep learning,” time object detection with region proposal networks." pp. 91-99.
arXiv preprint arXiv:1902.06673, 2019. [129] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie,
[103] S. Liu, E. Gibson, S. Grbic, Z. Xu, A. A. A. Setio, J. Yang, B. "Feature pyramid networks for object detection." pp. 2117-2125.
Georgescu, and D. Comaniciu, “Decompose to manipulate: [130] E. Shelhamer, J. Long, and T. Darrell, “Fully Convolutional Networks
manipulable object synthesis in 3D medical images with structured for Semantic Segmentation.”
image decomposition,” arXiv preprint arXiv:1812.01737, 2018. [131] O. Ronneberger, P. Fischer, and T. Brox, "U-Net: Convolutional
[104] Urtnasan, Erdenebayar, Hyeonggon, Kim, Jong-Uk, Park, Dongwon, Networks for Biomedical Image Segmentation."
Kang, Kyoung-Joung, and Lee, “Automatic Prediction of Atrial [132] A. Paszke, A. Chaurasia, S. Kim, and E. Culurciello, “ENet: A Deep
Fibrillation Based on Convolutional Neural Network Using a Short- Neural Network Architecture for Real-Time Semantic Segmentation.”
term Normal Electrocardiogram Signal.” [133] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid Scene Parsing
[105] S. Harbola, and V. Coors, “One dimensional convolutional neural Network.”
network architectures for wind prediction,” Energy Conversion and [134] L. C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille,
Management, vol. 195, pp. 70-75, 2019. “DeepLab: Semantic Image Segmentation with Deep Convolutional
[106] D. Han, J. Chen, and J. Sun, “A parallel spatiotemporal deep learning Nets, Atrous Convolution, and Fully Connected CRFs,” IEEE
network for highway traffic flow forecasting,” International Journal of Transactions on Pattern Analysis & Machine Intelligence, vol. 40, no.
Distributed Sensor Networks, vol. 15, no. 2. 4, pp. 834, 2018.
[107] Q. Zhang, D. Zhou, and X. Zeng, “HeartID: A Multiresolution [135] A. Pal, S. Jaiswal, S. Ghosh, N. Das, and M. Nasipuri, "Segfast: A
Convolutional Neural Network for ECG-based Biometric Human faster squeezenet based semantic image segmentation technique using
Identification in Smart Health Applications,” IEEE Access, pp. 1-1. depth-wise separable convolutions."
[108] O. Abdeljaber, O. Avci, S. Kiranyaz, M. Gabbouj, and D. J. Inman, [136] K. He, G. Georgia, D. Piotr, and G. Ross, “Mask R-CNN,” IEEE
“Real-Time Vibration-Based Structural Damage Detection Using One- Transactions on Pattern Analysis & Machine Intelligence, pp. 1-1.
Dimensional Convolutional Neural Networks,” Journal of Sound & [137] D. Bolya, C. Zhou, F. Xiao, and Y. J. Lee, “YOLACT: Real-time
Vibration, vol. 388, pp. 154-170, 2017. Instance Segmentation.”
[109] O. Abdeljaber, S. Sassi, O. Avci, S. Kiranyaz, A. A. Ibrahim, and M. [138] T. Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal Loss for
Gabbouj, “Fault Detection and Severity Identification of Ball Bearings Dense Object Detection,” IEEE Transactions on Pattern Analysis &
Machine Intelligence, vol. PP, no. 99, pp. 2999-3007, 2017.
View publication stats

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 21

[139] A. Kirillov, K. He, R. Girshick, C. Rother, and P. Dollár, “Panoptic [165] P. Micikevicius, S. Narang, J. Alben, G. Diamos, E. Elsen, D. Garcia,
Segmentation.” B. Ginsburg, M. Houston, O. Kuchaiev, and G. Venkatesh, “Mixed
[140] A. Kirillov, R. Girshick, K. He, and P. Dollár, “Panoptic Feature Precision Training.”
Pyramid Networks.” [166] M. Sajjad, S. Khan, T. Hussain, K. Muhammad, A. K. Sangaiah, A.
[141] H. Liu, C. Peng, C. Yu, J. Wang, X. Liu, G. Yu, and W. Jiang, “An Castiglione, C. Esposito, and S. W. Baik, “CNN-based anti-spoofing
End-to-End Network for Panoptic Segmentation.” two-tier multi-factor authentication system,” Pattern Recognition
[142] Y. Taigman, M. Yang, M. A. Ranzato, and L. Wolf, "Deepface: Letters, vol. 126, pp. 123-131, 2019.
Closing the gap to human-level performance in face verification." pp. [167] K. Itqan, A. Syafeeza, F. Gong, N. Mustafa, Y. Wong, and M. Ibrahim,
1701-1708. “User identification system based on finger-vein patterns using
[143] W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song, “SphereFace: Deep convolutional neural network,” ARPN Journal of Engineering and
Hypersphere Embedding for Face Recognition.” Applied Sciences, vol. 11, no. 5, pp. 3316-3319, 2016.
[144] J. Deng, J. Guo, N. Xue, and S. Zafeiriou, “ArcFace: Additive Angular [168] H. Ke, D. Chen, X. Li, Y. Tang, T. Shah, and R. Ranjan, “Towards
Margin Loss for Deep Face Recognition.” brain big data classification: Epileptic EEG identification with a
[145] H. Wang, Y. Wang, Z. Zhou, X. Ji, D. Gong, J. Zhou, Z. Li, and W. lightweight VGGNet on global MIC,” IEEE Access, vol. 6, pp. 14722-
Liu, “CosFace: Large Margin Cosine Loss for Deep Face Recognition.” 14733, 2018.
[146] C. Cao, Y. Zhang, C. Zhang, and H. Lu, "Action Recognition with [169] A. Shustanov, and P. Yakimov, “CNN design for real-time traffic sign
Joints-Pooled 3D Deep Convolutional Descriptors." recognition,” Procedia engineering, vol. 201, pp. 718-725, 2017.
[147] A. Stergiou, and R. Poppe, “Spatio-Temporal FAST 3D Convolutions [170] J. Špaňhel, J. Sochor, R. Juránek, A. Herout, L. Maršík, and P. Zemčík,
for Human Action Recognition.” "Holistic recognition of low quality license plates by CNN using track
[148] J. Huang, W. Zhou, H. Li, and W. Li, “Attention-based 3D-CNNs for annotated data." pp. 1-6.
large-vocabulary sign language recognition,” IEEE Transactions on [171] T. Xie, and Y. Li, "A Gradient-Based Algorithm to Deceive Deep
Circuits and Systems for Video Technology, vol. 29, no. 9, pp. 2822- Neural Networks." pp. 57-65.
2832, 2018. [172] C. Liao, H. Zhong, A. Squicciarini, S. Zhu, and D. Miller, “Backdoor
[149] Y. Huang, S.-H. Lai, and S.-H. Tai, “Human Action Recognition embedding in convolutional neural network models via invisible
Based on Temporal Pose CNN and Multi-dimensional Fusion.” perturbation,” arXiv preprint arXiv:1808.10307, 2018.
[150] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao, “3D [173] A. N. Bhagoji, S. Chakraborty, P. Mittal, and S. Calo, “Analyzing
ShapeNets: A Deep Representation for Volumetric Shapes.” federated learning through an adversarial lens,” arXiv preprint
[151] D. Maturana, and S. Scherer, "Voxnet: A 3d convolutional neural arXiv:1811.12470, 2018.
network for real-time object recognition." pp. 922-928. [174] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and
[152] S. Song, and J. Xiao, "Deep Sliding Shapes for Amodal 3D Object harnessing adversarial examples. arXiv,” preprint, 2018.
Detection in RGB-D Images." [175] K. Liu, B. Dolan-Gavitt, and S. Garg, "Fine-pruning: Defending
[153] Y. Zhou, and O. Tuzel, “VoxelNet: End-to-End Learning for Point against backdooring attacks on deep neural networks." pp. 273-294.
Cloud Based 3D Object Detection.” [176] N. Akhtar, and A. Mian, “Threat of adversarial attacks on deep
[154] F. Pastor, J. M. Gandarias, A. J. García-Cerezo, and J. M. Gómez-de- learning in computer vision: A survey,” IEEE Access, vol. 6, pp.
Gabriel, “Using 3D Convolutional Neural Networks for Tactile Object 14410-14430, 2018.
Recognition with Robotic Palpation,” Sensors, vol. 19, no. 24, pp. [177] B. Zoph, and Q. V. Le, “Neural architecture search with reinforcement
5356, 2019. learning,” arXiv preprint arXiv:1611.01578, 2016.
[155] K. Jnawali, M. R. Arbabshirani, N. Rao, and A. A. Patel, "Deep 3D [178] H. Pham, M. Y. Guan, B. Zoph, Q. V. Le, and J. Dean, “Efficient
convolution neural network for CT brain hemorrhage classification." neural architecture search via parameter sharing,” arXiv preprint
p. 105751C. arXiv:1802.03268, 2018.
[156] S. Hamidian, B. Sahiner, N. Petrick, and A. Pezeshk, "3D [179] H. Cai, L. Zhu, and S. Han, “Proxylessnas: Direct neural architecture
convolutional neural network for automatic detection of lung nodules search on target task and hardware,” arXiv preprint arXiv:1812.00332,
in chest CT." p. 1013409. 2018.
[157] M. Jaderberg, A. Vedaldi, and A. Zisserman, “Speeding up [180] M. Tan, B. Chen, R. Pang, V. Vasudevan, M. Sandler, A. Howard, and
convolutional neural networks with low rank expansions,” arXiv Q. V. Le, "Mnasnet: Platform-aware neural architecture search for
preprint arXiv:1405.3866, 2014. mobile." pp. 2820-2828.
[158] V. Sindhwani, T. Sainath, and S. Kumar, "Structured transforms for [181] G. Ghiasi, T.-Y. Lin, and Q. V. Le, "Nas-fpn: Learning scalable feature
small-footprint deep learning." pp. 3088-3096. pyramid architecture for object detection." pp. 7036-7045.
[159] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing [182] T. Jajodia, and P. Garg, “Image Classification–Cat and Dog Images,”
deep neural networks with pruning, trained quantization and huffman Image, vol. 6, no. 12, 2019.
coding,” arXiv preprint arXiv:1510.00149, 2015. [183] P. Drews, G. Williams, B. Goldfain, E. A. Theodorou, and J. M. Rehg,
[160] H. Mao, S. Han, J. Pool, W. Li, X. Liu, Y. Wang, and W. J. Dally, “Aggressive deep driving: Model predictive control with a cnn cost
“Exploring the regularity of sparse structure in convolutional neural model,” arXiv preprint arXiv:1707.05303, 2017.
networks,” arXiv preprint arXiv:1705.08922, 2017. [184] H. Gao, B. Cheng, J. Wang, K. Li, J. Zhao, and D. Li, “Object
[161] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “XNOR-Net: classification using CNN-based fusion of vision and LIDAR in
ImageNet Classification Using Binary Convolutional Neural autonomous vehicle environment,” IEEE Transactions on Industrial
Networks.” Informatics, vol. 14, no. 9, pp. 4224-4231, 2018.
[162] X. Lin, C. Zhao, and W. Pan, “Towards Accurate Binary [185] A. Azulay, and Y. Weiss, “Why do deep convolutional networks
Convolutional Neural Network.” generalize so poorly to small image transformations?,” 2018.
[163] C. Zhu, S. Han, H. Mao, and W. J. Dally, “Trained Ternary [186] S. Sabour, N. Frosst, and G. E. Hinton, "Dynamic routing between
Quantization.” capsules." pp. 3856-3866.
[164] Y. Choi, M. El-Khamy, and J. Lee, “Towards the limit of network [187] A. Kosiorek, S. Sabour, Y. W. Teh, and G. E. Hinton, "Stacked capsule
quantization,” arXiv preprint arXiv:1612.01543, 2016. autoencoders." pp. 15486-15496.

Unveiling Motivation For Luxury Fashion Purchase Among Gen Z Consumers Need For Uniqueness Versus Bandwagon Effect
No ratings yet
Unveiling Motivation For Luxury Fashion Purchase Among Gen Z Consumers Need For Uniqueness Versus Bandwagon Effect
13 pages
Chapter 13 Experimental Design and Analysis of Variance PDF
100% (1)
Chapter 13 Experimental Design and Analysis of Variance PDF
43 pages
Engineering Applications of Artificial Intelligence: Hamid Taheri, Zhao Chun Xia
100% (1)
Engineering Applications of Artificial Intelligence: Hamid Taheri, Zhao Chun Xia
25 pages
The Bivariate Poisson Distribution
No ratings yet
The Bivariate Poisson Distribution
45 pages
Technology Management Tools: S-Curve
No ratings yet
Technology Management Tools: S-Curve
18 pages
Fracture Analysis of Pressure Vessel Under Dynamic Loading and Thermal Effect PDF
No ratings yet
Fracture Analysis of Pressure Vessel Under Dynamic Loading and Thermal Effect PDF
108 pages
Standard Form of Quadratic Equation
No ratings yet
Standard Form of Quadratic Equation
2 pages
Quantization and Compression PDF
No ratings yet
Quantization and Compression PDF
220 pages
Promaths Final Push Paper 2 Paper 2 (October 2023)
No ratings yet
Promaths Final Push Paper 2 Paper 2 (October 2023)
144 pages
A Three-Dimensional Finite-Strain Rod Model. Part Ii: Computational Aspects
No ratings yet
A Three-Dimensional Finite-Strain Rod Model. Part Ii: Computational Aspects
37 pages
V53PR0906133 Exp
No ratings yet
V53PR0906133 Exp
155 pages
Experiment 1 Computation of Parameters and Modelling of Transmission Lines
No ratings yet
Experiment 1 Computation of Parameters and Modelling of Transmission Lines
13 pages
Chaos Theory
No ratings yet
Chaos Theory
14 pages
Artigo SEPOPE - Redes Neurais - Ingles
No ratings yet
Artigo SEPOPE - Redes Neurais - Ingles
12 pages
Test of Arithmetic Progression
No ratings yet
Test of Arithmetic Progression
2 pages
2022 - ARTIGO - Relação Água-Cimento Ideal de Solo Estabilizado Com Cimento
No ratings yet
2022 - ARTIGO - Relação Água-Cimento Ideal de Solo Estabilizado Com Cimento
15 pages
Lesson Plan-Angle of Elevation
100% (2)
Lesson Plan-Angle of Elevation
11 pages
A Review of Condition Monitoring and Fault Diagnosis For Diesel Engines
No ratings yet
A Review of Condition Monitoring and Fault Diagnosis For Diesel Engines
25 pages
4D Printing of Magnetoactive Soft Materials For On-Demand
No ratings yet
4D Printing of Magnetoactive Soft Materials For On-Demand
12 pages
Advance in The Toxic Effects of Petroleum Water Accommodated Fraction On Marine Plankton
No ratings yet
Advance in The Toxic Effects of Petroleum Water Accommodated Fraction On Marine Plankton
9 pages
Evolutionary Many-Objective Optimization Based On Dynamical Decomposition
No ratings yet
Evolutionary Many-Objective Optimization Based On Dynamical Decomposition
16 pages
IEEE Template Research-Track
No ratings yet
IEEE Template Research-Track
3 pages
Inside Reverse Fold - in Origami
No ratings yet
Inside Reverse Fold - in Origami
2 pages
A Novel Solvent Injection Technique For Enhanced Heavy Oil Recovery: Cyclic Production With Continuous Solvent Injection
No ratings yet
A Novel Solvent Injection Technique For Enhanced Heavy Oil Recovery: Cyclic Production With Continuous Solvent Injection
14 pages
Axial-Flux - Permanent-Magnet - Final
No ratings yet
Axial-Flux - Permanent-Magnet - Final
10 pages
Controlled Crystallization of Hierarchical and Porous Calcium Carbonate Crystals Using Polypeptide Type Block Copolymer As Crystal Growth Modifier in A Mixed Solution
No ratings yet
Controlled Crystallization of Hierarchical and Porous Calcium Carbonate Crystals Using Polypeptide Type Block Copolymer As Crystal Growth Modifier in A Mixed Solution
9 pages
De Novo Synthesize of Bile Acids in Pulmonary Arterial Hypertension Lung
No ratings yet
De Novo Synthesize of Bile Acids in Pulmonary Arterial Hypertension Lung
8 pages
1 Introduction To Rings
No ratings yet
1 Introduction To Rings
23 pages
I Ching Philosophy Inspired Optimization: July 2017
No ratings yet
I Ching Philosophy Inspired Optimization: July 2017
9 pages
Intelligent Menu Planning: Recommending Set of Recipes by Ingredients
No ratings yet
Intelligent Menu Planning: Recommending Set of Recipes by Ingredients
7 pages
Maths Assignment
No ratings yet
Maths Assignment
7 pages
Joint Sparse Channel Estimation and Data Detection For Underwater Acoustic Channels Using Partial Interval Demodulation
No ratings yet
Joint Sparse Channel Estimation and Data Detection For Underwater Acoustic Channels Using Partial Interval Demodulation
6 pages
In-Vivo Biotransformation of Citrus Functional Components and Their Effects On Health
No ratings yet
In-Vivo Biotransformation of Citrus Functional Components and Their Effects On Health
23 pages
1976 - Barlow Optimal Stress Locations in Finite Element Models
No ratings yet
1976 - Barlow Optimal Stress Locations in Finite Element Models
9 pages
A Comprehensive Review On The Progress of Lead Zirconate-Based Antiferroelectric Materials
No ratings yet
A Comprehensive Review On The Progress of Lead Zirconate-Based Antiferroelectric Materials
8 pages
Catalytic Dehydrogenation of Propane To Propene: Catalyst Development, Mechanistic Aspects and Reactor Design
No ratings yet
Catalytic Dehydrogenation of Propane To Propene: Catalyst Development, Mechanistic Aspects and Reactor Design
17 pages
Joint Precoding and Combining Design For Hybrid Beamforming Systems With Subconnected Structure
No ratings yet
Joint Precoding and Combining Design For Hybrid Beamforming Systems With Subconnected Structure
12 pages
Composite Action of Octagonal Concrete-Filled Steel Tubular Stub Columns Under Axial Loading
No ratings yet
Composite Action of Octagonal Concrete-Filled Steel Tubular Stub Columns Under Axial Loading
10 pages
Narrow Band Internet of Things Implementations and Applications
No ratings yet
Narrow Band Internet of Things Implementations and Applications
6 pages
Study On Mechanical Properties of Unidirectional Continuous Carbon Fiber Reinforced PEEK Composites Fabricated by The Wrapped Yarn Method
No ratings yet
Study On Mechanical Properties of Unidirectional Continuous Carbon Fiber Reinforced PEEK Composites Fabricated by The Wrapped Yarn Method
15 pages
In Uence of An Orifice Plate On Gas Pulsation in A Reciprocating Compressor Piping System
No ratings yet
In Uence of An Orifice Plate On Gas Pulsation in A Reciprocating Compressor Piping System
16 pages
Teacher-Student Relationship and Mathematical Problem-Solving Ability: Mediating Roles of Self-Efficacy and Mathematical Anxiety
No ratings yet
Teacher-Student Relationship and Mathematical Problem-Solving Ability: Mediating Roles of Self-Efficacy and Mathematical Anxiety
19 pages
The Effect of Dielectric Thickness On Diffuse Nanosecond Dielectric Barrier
No ratings yet
The Effect of Dielectric Thickness On Diffuse Nanosecond Dielectric Barrier
8 pages
Electric Power Generation by Paper Materials: Journal of Materials Chemistry A August 2019
No ratings yet
Electric Power Generation by Paper Materials: Journal of Materials Chemistry A August 2019
6 pages
Facile Preparation of Graphene Oxide Membranes For Gas Separation
No ratings yet
Facile Preparation of Graphene Oxide Membranes For Gas Separation
8 pages
Lecture 3 - Introduction To Computer Data Processing Using Python
No ratings yet
Lecture 3 - Introduction To Computer Data Processing Using Python
22 pages
Teleomere Lengthening
No ratings yet
Teleomere Lengthening
5 pages
Fundamental Mechanisms of Laser Shock Processing of Metals and Ceramics
No ratings yet
Fundamental Mechanisms of Laser Shock Processing of Metals and Ceramics
11 pages
Cheng 等。 - 2022 - Steadily Decreasing Power Loss of a Double Schottk
No ratings yet
Cheng 等。 - 2022 - Steadily Decreasing Power Loss of a Double Schottk
12 pages
003-The Potential of Urban Family Vertical Farming-A Pilot Study of Shanghai
No ratings yet
003-The Potential of Urban Family Vertical Farming-A Pilot Study of Shanghai
15 pages
Oceanic Eddy Characteristics and Generation Mechanisms in The Kuroshio Extension Region
No ratings yet
Oceanic Eddy Characteristics and Generation Mechanisms in The Kuroshio Extension Region
21 pages
Fake News Research: Theories, Detection Strategies, and Open Problems
No ratings yet
Fake News Research: Theories, Detection Strategies, and Open Problems
3 pages
Development of A Quenchbody For The Detection and Imaging
No ratings yet
Development of A Quenchbody For The Detection and Imaging
13 pages
Theinfluenceofteachersintrinsicmotivationonstudentsintrinsicmotivation
No ratings yet
Theinfluenceofteachersintrinsicmotivationonstudentsintrinsicmotivation
16 pages
Kamva - Numerical Methods 262.bw
No ratings yet
Kamva - Numerical Methods 262.bw
3 pages
AHybrid Regression Techniquefor House Prices Predictions
No ratings yet
AHybrid Regression Techniquefor House Prices Predictions
6 pages
Control Flow A Byte of Python
No ratings yet
Control Flow A Byte of Python
11 pages
Week 03 02 - Walpole 23032021 054301pm
No ratings yet
Week 03 02 - Walpole 23032021 054301pm
27 pages
Fast Gradient Attack On Network Embedding
No ratings yet
Fast Gradient Attack On Network Embedding
13 pages
Monitoring The Ultrafast Buildup of Rabi Oscillations in The Time Domain - PhysRevA.109.063102
No ratings yet
Monitoring The Ultrafast Buildup of Rabi Oscillations in The Time Domain - PhysRevA.109.063102
14 pages
WCNC
No ratings yet
WCNC
6 pages
Res TFG03
No ratings yet
Res TFG03
7 pages
Problem Solving Style in Mathematics.
No ratings yet
Problem Solving Style in Mathematics.
16 pages
Either Bandwagon Effect or Need For Uniqueness Motivational Factors Driving Young Adult Consumers Luxury Brand Purchases An Abstract
No ratings yet
Either Bandwagon Effect or Need For Uniqueness Motivational Factors Driving Young Adult Consumers Luxury Brand Purchases An Abstract
13 pages
Pagination EFM 109630
No ratings yet
Pagination EFM 109630
18 pages
Snow 2015
No ratings yet
Snow 2015
11 pages
A Low Temperature Solid State Reaction To Produce Hollow MnxFe3-xO4 Nanoparticles For Li Ion Battery
No ratings yet
A Low Temperature Solid State Reaction To Produce Hollow MnxFe3-xO4 Nanoparticles For Li Ion Battery
11 pages
Pagination JIJF 107917
No ratings yet
Pagination JIJF 107917
14 pages
TIV Autonomous
No ratings yet
TIV Autonomous
12 pages
Sec. 3
No ratings yet
Sec. 3
8 pages
210 Determining End Ring Resistance and Inductance of Squirrel Cage For Induction Motor With 2D and 3D Computations
No ratings yet
210 Determining End Ring Resistance and Inductance of Squirrel Cage For Induction Motor With 2D and 3D Computations
6 pages
An Alternative Tool To Controlled-Source Audio-Frequency Electromagnetic Method For Prospecting Deeply Buried Ore Deposits
No ratings yet
An Alternative Tool To Controlled-Source Audio-Frequency Electromagnetic Method For Prospecting Deeply Buried Ore Deposits
6 pages
Açık Kanal Akışlarında Yatak Malzemesi Tortu Deşarjını Tahmin Etmek İçin Karma Hız Ölçeği #Velocity Scale
No ratings yet
Açık Kanal Akışlarında Yatak Malzemesi Tortu Deşarjını Tahmin Etmek İçin Karma Hız Ölçeği #Velocity Scale
9 pages
Novelwavelengthdivisionmultiplex Radiooverfiber Passiveopticalnetworkarchitectureformultipleaccesspointsbasedonmultitonegenerationandtripl
No ratings yet
Novelwavelengthdivisionmultiplex Radiooverfiber Passiveopticalnetworkarchitectureformultipleaccesspointsbasedonmultitonegenerationandtripl
8 pages
A Locauruuwja
No ratings yet
A Locauruuwja
9 pages
Zircon U-Pb Chronology
No ratings yet
Zircon U-Pb Chronology
15 pages
2025 An Improved Topology Identification Methodof Complex Dynamical Networksonline
No ratings yet
2025 An Improved Topology Identification Methodof Complex Dynamical Networksonline
10 pages
Heat Transfer Behaviour of Funnel Mould Copper Plates During Thin Slab Continuous Casting and Channel Structure Optimization
No ratings yet
Heat Transfer Behaviour of Funnel Mould Copper Plates During Thin Slab Continuous Casting and Channel Structure Optimization
14 pages
Vulnerability of Traffic Control System Under Cybe
No ratings yet
Vulnerability of Traffic Control System Under Cybe
12 pages
2020 HuangEtAl Book FlexibleLearning VersionISBN
No ratings yet
2020 HuangEtAl Book FlexibleLearning VersionISBN
55 pages
Er3+ Sensitized 1530 NM To 1180 NM Second Near-Infrared Window Upconversion Nanocrystals For in Vivo Biosensing
No ratings yet
Er3+ Sensitized 1530 NM To 1180 NM Second Near-Infrared Window Upconversion Nanocrystals For in Vivo Biosensing
7 pages
Quantitative Analysis of Insulation Condition of Oil-Paper Insulation Based On
No ratings yet
Quantitative Analysis of Insulation Condition of Oil-Paper Insulation Based On
17 pages
Emulating The Paired-Pulse Facilitation of A Biological Synapse With A Niox-Based Memristor
No ratings yet
Emulating The Paired-Pulse Facilitation of A Biological Synapse With A Niox-Based Memristor
6 pages
Fantasy Sports Prediction Clustering Analysis
No ratings yet
Fantasy Sports Prediction Clustering Analysis
21 pages
Tai-LingLiuetalShyi CompPsychiatry 2019
No ratings yet
Tai-LingLiuetalShyi CompPsychiatry 2019
8 pages
Application of Machine Learning in The Preoperative Radiomic Diagnosis of Ameloblastoma and Odontogenic Keratocyst Based On Cone-beamCT
No ratings yet
Application of Machine Learning in The Preoperative Radiomic Diagnosis of Ameloblastoma and Odontogenic Keratocyst Based On Cone-beamCT
10 pages
Ieeetap 06979215
No ratings yet
Ieeetap 06979215
10 pages
East Himalaya Syntaxis (EHS) Toroidal Mantle Flow Induced by Slab Subduction and Rollback 2019
No ratings yet
East Himalaya Syntaxis (EHS) Toroidal Mantle Flow Induced by Slab Subduction and Rollback 2019
12 pages
Arc Interruption Performance of C4F7N-CO2 Mixture in A 126 KV Disconnector
No ratings yet
Arc Interruption Performance of C4F7N-CO2 Mixture in A 126 KV Disconnector
12 pages
Communication Nets: Stochastic Message Flow and Delay
From Everand
Communication Nets: Stochastic Message Flow and Delay
Leonard Kleinrock
3/5 (1)

A Survey of Convolutional Neural Networks Analysis

Uploaded by

A Survey of Convolutional Neural Networks Analysis

Uploaded by

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

A Survey of Convolutional Neural Networks: Analysis, Applications, and

Preprint · April 2020

Zewen Li Shouheng Peng

SEE PROFILE SEE PROFILE

The user has requested enhancement of the downloaded file.

A Survey of Convolutional Neural Networks:

 perceptron network cannot handle linear inseparable problems

C ONVOLUTIONAL Neural Network (CNN) has been making

0×1 0×0 0×1 0 0 0 0 0 0

several points of view on prospects for CNN. Part of them are 1

We organize the rest of this paper as follows: Section 2 takes 0 0 0 1 1 0 1 0 0 0 0 1 1 0 1 0

VGGNets [32] are a series of convolutional neural network

algorithms proposed by the Visual Geometry Group (VGG) of

Fig. 6. Architecture of AlexNet

block are narrower than the middle. Using 1 × 1 convolution G. MobileNets

#M, d×d×1 #N, 1×1×M

Random noise z Generator Generated data G(z)

computation, and hence, they put forward a hard version of

computational economical pointwise group convolution

activation function is removed; In Fig. 18 (c), a 3 × 3 average

1×1, Conv 1×1, Conv 1×1, Conv

architecture of the final ShuffleNet unit. 3×3, DWConv 3×3, DWConv

gradient becomes steeper. Sigmoid function can map a real

(a) LeNet5 on MNIST (b) LeNet5 on Fashion-MNIST

(c) VGG-16 on MNIST (d) VGG-16 on Fashion-MNIST

(a) LeNet5 on CIFAR-10 (b) LeNet5 on CIFAR-100

(c) VGG-16 on CIFAR-10 (d) VGG-16 on CIFAR-100

4) Rules of Thumb for Selection B. Applications of two-dimensional CNN

Deep Learning based Detection Approaches

Traditional Detection One-stage

2014 2015 2016 2017 2018 2019 2020

Fig. 27. Object detection milestones based on deep learning

2014 2015 2016 2017 2018 2019

C. Applications of multi-dimensional CNN A. Model Compression

data into the training data. It is not easily distinguished at the

You might also like