A Survey of Convolutional Neural Networks Analysis
A Survey of Convolutional Neural Networks Analysis
net/publication/340475800
CITATIONS READS
0 2,214
4 authors, including:
Fan Liu
Hohai University
52 PUBLICATIONS 533 CITATIONS
SEE PROFILE
All content following this page was uploaded by Zewen Li on 14 December 2020.
This work was supported in part by National Natural Science Foundation of Zewen Li, Wenjie Yang, Shouheng Peng, and Fan Liu are with College of
China under grant No. 61602150, Natural Science Foundation of Jiangsu Computer and Information, Hohai University, Nanjing, 210098, China
Province under grant No. BK20191298. (Corresponding author: Fan Liu) ([email protected], [email protected], [email protected],
[email protected])
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 2
recognition. Ajmal et al. [16] discussed CNN for image which can reduce the amount of data while retaining useful
segmentation. These reviews mentioned above mainly information. It can also reduce the number of parameters by
reviewed the applications of CNN in different scenarios without removing trivial features. The three appealing characteristics
considering CNN from a general perspective. Also, due to the make CNN become one of the most representative algorithms
rapid development of CNN, lots of inspiring ideas in this field in the deep learning field.
have been proposed, but these reviews did not fully cover them. To be specific, in order to build a CNN model, four
In this paper, we focus on analyzing and discussing CNN. In components are typically needed. Convolution is a pivotal step
detail, the key contributions of this review are as follows: 1) We for feature extraction. The outputs of convolution can be called
provide a brief overview of CNN, including some basic feature maps. When setting a convolution kernel with a certain
building blocks of modern CNN, in which some fascinating size, we will lose information in the border. Hence, padding is
convolution structures and innovations are involved. 2) Some introduced to enlarge the input with zero value, which can
classic CNN-based models are covered, from LeNet-5, AlexNet adjust the size indirectly. Besides, for the sake of controlling the
to MobileNet v3 and GhostNet. Innovations of these models are density of convolving, stride is employed. The larger the stride,
emphasized to help readers draw some useful experience from the lower the density. After convolution, feature maps consist
masterpieces. 3) Several representative activation functions, of a large number of features that is prone to causing overfitting
loss functions, and optimizers are discussed. We reach some problem [21]. As a result, pooling [22] (a.k.a. down-sampling)
conclusions about them through experiments. 4) Although is proposed to obviate redundancy, including max pooling and
applications of two-dimensional convolution are widely used, average pooling. The procedure of a CNN is shown in Fig. 1.
one-dimensional and multi-dimensional ones should not be Stride = 2
2 2 2 2 2 3
1 0 1 0 0 1 0 0 1 0 1 0 0 1 0 0
networks from scratch. 0 0 1 0 0 1 0 0 0 0 1 1 0 1 0 0
0 1 3 2
Fig. 4. Part of classic CNN models. NiN: Network in Network; ResNet: Residual Netwrok; DCGAN: Deep Convolutional Generative Adversarial Network; SENet:
Squeeze-and-Excitation Network
Moreover, there exist a variety of awesome convolutions, AlexNet carries forward LeNet's ideas and applies the basic
such as Separable convolutions [24], [25], [26], [27], [28], principles of CNN to a deep and wide network. It successfully
group convolutions [11], [29], [30], [31] and multi-dimensional leverages ReLU activation function, dropout, and local
convolutions, which are discussed in Section 3 and Section 5. response normalization (LRN) for the first time on CNN. At the
same time, AlexNet also makes use of GPUs for computing
III. CLASSIC CNN MODELS acceleration. The main innovations of AlexNet lie in the
Since AlexNet was proposed in 2012, researchers have following:
invented a variety of CNN models—deeper, wider, and lighter. 1) AlexNet uses ReLU as the activation function of CNN,
Part of well-known models can be seen in Fig. 4. Due to the which mitigates the problem of gradient vanishing when the
limitation of paper length, this section aims to take an overview network is deep. Although the ReLU activation function was
of several representative models, and we will emphatically proposed long before AlexNet, it was not carried forward until
discuss the innovations of them to help readers understand the the appearance of AlexNet.
main points and propose their own promising ideas. 2) Dropout is used by AlexNet to randomly ignore some
neurons during training to avoid overfitting. This technique is
A. LeNet-5 mainly used in the last few fully-connected layers.
LeCun et al. [10] proposed LeNet-5 in 1998, which is an 3) In convolutional layers of AlexNet, overlapping max
efficient convolutional neural network trained with the pooling is used to replace average pooling that was commonly-
backpropagation algorithm for handwritten character used in the previous convolutional neural networks. Max
recognition. As shown in Fig. 5, LeNet-5 is composed of seven pooling can avoid the blurred result of average pooling, and
trainable layers containing two convolutional layers, two overlapping pooling can improve the richness of features.
pooling layers, and three fully-connected layers. LeNet-5 is the 4) LRN is proposed to simulate the lateral inhibition
pioneering convolutional neural network combining local mechanism of the biological nervous system, which means the
receptive fields, shared weights, and spatial or temporal sub- neuron receiving stimulation can inhibit the activity of
sampling, which can ensure shift, scale, and distortion peripheral neurons. Similarly, LRN can make neurons with
invariance to some extent. It is the foundation of modern CNN. small values are suppressed, and those with large values are
Although LeNet-5 is useful for recognizing handwriting relatively active, the function of which is very similar to
characters and reading bank checks, it still does not exceed the normalization. Hence, LRN is a way to enhance the
traditional support vector machine (SVM) and boosting generalization ability of the model.
algorithms. As a result, LeNet-5 did not obtain enough attention 5) AlexNet also employs two powerful GPUs to train group
at that time. convolutions. Since the computing resource limit of one GPU,
16@10x10
6@28x28
16@5x5
1x120
1x84
AlexNet designs a group convolution structure, which can be
6@14x14
1@32x32
1x10 trained on two distinct GPUs. And then, two feature maps
generated by two GPUs can be combined as the final output.
6) AlexNet adopts two data augmentation methods in
Convolution Subsampling Convolution Subsampling Full connection training. The first is extracting random 224 × 224 patches from
Fig. 5. Architecture of LeNet-5 the original 256 × 256 images and their horizontal reflections to
obtain more training data. Besides, the Principal Component
B. AlexNet
Analysis (PCA) is utilized to change the RGB values of the
Alex et al. [11] proposed the AlexNet in 2012, which won training set. When making predictions, AlexNet also enlarges
the championship in the ImageNet 2012 competition. As shown the dataset and then calculate the average of their predictions as
in Fig. 6, AlexNet has eight layers, containing five the final result. AlexNet shows that the use of data
convolutional layers and three fully-connected layers. augmentation can substantially mitigate overfitting problem
2048 2048
224
55
55
27
27
13 13
13
13
13
and improve generalization ability.
13
C. VGGNets
192 192 128
48 128
224
55
48 128
192 192 128
13, VGG-16, and VGG-19. VGGNets secured the first place in branch is factorized, shown in Fig. 8 (c).
the localization track of ImageNet Challenge 2014. The authors Concat
Concat
Concat
of VGGNets prove that increasing the depth of neural networks n×1
3×3
can improve the final performance of the network to some 1×n
1×3 3×1
extent. Compared with AlexNet, VGGNets have the following 1×1 3×3 3×3 n×1 n×1
1×1 1×3 3×1 3×3
improvements: Pool 1×1 1×1 1×1
1×1 1×n 1×n
Pool 1×1 1×1 1×1
1) The LRN layer was removed since the author of VGGNets Pool 1×1 1×1 1×1
found the effect of LRN in deep CNNs was not obvious. Input Input Input
2) VGGNets use 3 × 3 convolution kernels rather than 5 × 5 Fig. 8. Inception v2 module. (a) Each 5 x 5 convolution is replaced by two 3
x 3 convolutions. (b) n x n convolution is replaced by a 1 x n convolution and
or 5 × 5 ones, since several small kernels have the same a n x 1 convolution. (c) Inception modules with the last convolutional layer is
receptive field and more nonlinear variations compared with factorized.
larger ones. For instance, the receptive field of 3 × 3 kernels is 3) Inception v3
the same as one 5 × 5 kernel. Nevertheless, the number of Inception v3 [35] integrates major innovations mentioned in
parameters reduces by about 45%, and three kernels have three Inception v2. And factorizing 5 × 5 and 3 × 3 convolution
nonlinear variations. kernels into two one-dimensional ones (one by seven and seven
by one, one by three and three by one, respectively). This
D. GoogLeNet
operation accelerates the training and further increases the
GoogLeNet [33] is the winner of the ILSVRC 2014 image depth of networks and the non-linearity of networks. Besides,
classification algorithms. It is the first large-scale CNN formed the input size of the network changed from 224 by 224 to 299
by stacking with Inception modules. Inception networks have by 299. And Inception v3 utilizes RMSProp as the optimizer.
four versions, namely Inception v1 [33], Inception v2 [34], [35], 4) Inception v4 and Inception-ResNet
Inception v3 [35], and Inception v4 [36]. Inception v4 modules [36] are based upon that of Inception
1) Inception v1 v3. The architecture of Inception v4 is more concise and utilizes
Due to objects in images have different distances to cameras, more Inception modules. Experimental evaluation proved that
an object with a large proportion of an image usually prefers a Inception v4 is better than its predecessors.
large convolution kernel or a few small ones. However, a small In addition, ResNet structure [37] is harnessed to extend the
object in an image is the opposite. Based on the past experience, depth of Inception networks, namely Inception-ResNet-v1 and
large kernels have many parameters to train, and deep networks Inception-ResNet-v2. Experiments proved that they could
are hard to train. As a result, Inception v1 [33] deploys 1 × 1, 3 improve the training speed and performance.
× 3, 5 × 5 convolution kernels to construct a “wide” network,
which can be seen in Fig. 7, Convolution kernels with different E. ResNet
sizes can extract the feature maps of different scales of the Theoretically, Deep Neural Networks (DNN) outperform
image. Then, those feature maps are stacked to obtain a more shallow ones as the former can extract more complicated and
representative one. Besides, 1 × 1 convolution kernel is used to sufficient features of images. However, with the increase of
reduce the number of channels, i.e., reduce computational cost. layers, DNNs are prone to cause gradient vanishing, gradient
Concat exploding problems, etc. He et al. [37] proposed a 34-layer
Residual Network in 2016, which is the winner of the ILSVRC
1×1 3×3 5×5 2015 image classification and object detection algorithm. The
Pool 1×1 1×1 1×1
performance of ResNet exceeds the GoogLeNet Inception v3.
One of the significant contributions of ResNet is the two-
Input
layer residual block constructed by the shortcut connection, as
Fig. 7. Inception v1 module with dimension reductions
shown in Fig. 9 (a) below.
ReLU
2) Inception v2 ReLU +
Inception v2 [35] utilizes batch normalization to handle +
1×1, 256
internal covariate shift problem [34]. The output of every layer ReLU
is normalized to normal distribution, which can increase the 3×3, 64
ReLU 3×3, 64
robustness of the model and train the model with a relatively ReLU
large learning rate. 3×3, 64
1×1, 64
Furthermore, Inception v2 shows that a single 5 × 5
64-d 256-d
convolutional layers can be replaced by two 3 × 3 ones, shown
(a) (b)
in Fig. 8 (a). One n x n convolutional layer can be replaced by Fig. 9. Structure of ResNet blocks. (a) The structure of two-layer residual
one 1 x n and one n x 1 convolutional layer shown in Fig. 8 (b). block. (b) The structure of three-layer residual block
However, the original paper points out the use of factorization 50-layer ResNet, 101-layer ResNet, and 152-layer ResNet
is not effective in the early layers. It is better to use it on utilize three-layer residual blocks, as shown in the Fig. 9 (b)
medium-sized feature maps. And filter banks should be above, instead of two-layer one. Three-layer residual block is
expanded (wider but not deeper) to improve high dimensional also called the bottleneck module because the two ends of the
representations. Hence, only the last 3 × 3 convolution of each
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 5
(b)
Fig. 14. (a) Residual block (b) MobileNet v2 block: inverted residual block
When performing the steps: channel expansion—depth-wise
separable convolution—channel compression, one problem
will be encountered after "channel compression". That is, it is Fig. 16. The diagrams of sigmoid, h-sigmoid, swish and h-swish
easy to destroy the information when ReLU activation function
is utilized in low-dimensional space, whereas it will not happen H. ShuffleNets
in high-dimensional space. Therefore, ReLU activation ShuffleNets are a series of CNN-based models proposed by
function following the second 1 × 1 convolution of inverted MEGVII to solve the problem of insufficient computing power
residual blocks is removed, and a linear transformation is of mobile devices. These models combine pointwise group
adopted. Hence, this architecture is called the linear bottleneck convolution, channel shuffle, and some other techniques, which
module. significantly reduce the computational cost with little loss of
3) MobileNet v3 accuracy. So far, there are two versions of ShuffleNets, namely
MobileNet v3 [46] achieves three improvements: network ShuffleNet v1 [49] and ShuffleNet v2 [50].
search combining platform-aware neural architecture search 1) ShuffleNet v1
(platform-aware NAS) and NetAdapt algorithm [47], ShuffleNet v1 [49] was proposed to construct a high-efficient
lightweight attention model based upon squeeze and excitation, CNN structure for resource-limited devices. There are two
and h-swish activation function. innovations: pointwise group convolution and channel shuffle.
For MobileNet v3, researchers use platform-aware NAS for The authors of ShuffleNet v1 reckon that Xception [26] and
block-wise search. Platform-aware NAS utilizes an RNN-based ResNeXt [29] are less efficient in extremely small networks
controller and hierarchical search space to find the structure of since 1 × 1 convolution requires a lot of computing resources.
the global network. And then, the NetAdapt algorithm, Therefore, pointwise group convolution is proposed to reduce
complementary to platform-aware NAS, is used for layer-wise the computation complexity of 1 × 1 convolutions. Pointwise
search. It can fine-tune to find the optimal number of filters in group convolution, shown in Fig. 17 (a), requires each
each layer. convolution operation is only on the corresponding input
MobileNet v3 makes use of the squeeze and excitation (SE) channel group, which can reduce the computational complexity.
[48] to reweight the channels of each layer to achieve a However, one problem is that pointwise group convolutions
lightweight attention model. As shown in Fig. 15, after the prevent feature maps between different groups from
depth-wise convolution of an inverted residual block, the SE communicating with each other, which is harmful to extract
module is added. Global-pool operation is firstly performed, representative feature maps. Therefore, channel shuffle
then following a fully-connected layer, the number of channels operation, shown in Fig. 17 (b), is proposed to help the
is reduced to 1/4. The second fully-connected layer is utilized information in different groups flow to other groups randomly.
to recover the number of channels and get the weight of each
layer. Finally, multiply the weight and the depth-wise
convolution to get a reweighted feature map. Howard et al. [46]
proved that this operation could improve the accuracy without
extra time cost.
Fig. 17. Pointwise group convolution and channel shuffle in ShuffleNet v1 (a)
Pointwise group convolution where different colors represent different groups.
(b) After acquiring feature 1, channel shuffle operation is inserted to promote
information exchanges between groups
Fig. 15. MobileNet v3 block: MobileNet v2 block + Squeeze-and-Excite in
the residual layer Furthermore, ShuffleNet unit is proposed on the basis of
The authors of MobileNet v3 figure out that swish activation channel shuffle operation. As shown in figure below, Fig. 18 (a)
function can improve the accuracy of the network compared is a naïve residual block with depth-wise convolution
with ReLU. Nevertheless, swish function costs too much (DWConv); Fig. 18 (b) replaces standard convolution with
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 7
ReLU BN
element-wise addition is replaced by concatenation. These
BN ReLU BN ReLU
IV. DISCUSSION AND EXPERIMENTAL ANALYSIS exponential operation that require division while computing
derivatives, whereas the derivative of ReLU is a constant.
A. Activation function
Moreover, in the sigmoid and tanh function, if the value of x is
1) Discussion of Activation Function too large or too small, the gradient of the function is pretty small,
Convolutional neural networks can harness different which can cause the function to converge slowly. However,
activation functions to express complex features. Similar to the when x is less than 0, the derivative of ReLU is 0, and when x
function of the neuron model of the human brain, the activation is greater than 0, the derivative is 1, so it can obtain an ideal
function here is a unit that determines which information should convergence effect. AlexNet [11], the best model in ILSVRC-
be transmitted to the next neuron. Each neuron in the neural 2012, uses ReLU as the activation function of CNN-based
network accepts the output value of the neurons from the model, which mitigates gradient vanishing problem when the
previous layer as input, and passes the processed value to the network is deep, and verifies that the use of ReLU surpasses
next layer. In a multilayer neural network, there is a function sigmoid in deep networks.
between two layers. This function is called activation function, From what discussed above, we can find that ReLU does not
whose structure is shown in the following Fig. 21. consider the upper limit. In practice, we can set an upper limit,
x1
w1j such as ReLU6 [54].
n However, when x is less than 0, the gradient of ReLU is 0,
xi wij
∑ f yj = f xi wij bj
i 1 which means the back-propagated error will be multiplied by 0,
wnj
resulting in no error being passed to the previous layer. In this
xn bj scenario, the neurons will be regarded as inactivated or dead.
Fig. 21. General activation function structure
Therefore, some improved versions are proposed. Leaky ReLU
In this figure, xi represents the input feature; n features are (see Fig. 22 (d)) can reduce neuron inactivation. When x is less
input to the neuron j at the same time; wij represents the weight than 0, the output of Leaky ReLU is 𝑥/𝑎, instead of zero, where
value of the connection between the input feature xi and the ‘a’ is a fixed parameter in range (1, +∞).
neuron j; bj represents the internal state of the neuron j, which Another variant of ReLU is PReLU [38] (see Fig. 22 (e)).
is the bias value; and yj is the output of the neuron j. 𝑓(∙) is the Unlike Leaky ReLU, the slope of the negative part of PReLU is
activation function, which can be sigmoid function, tanh (x) based upon the data, not a predefined one. He et al. [38] reckon
function [10], Rectified Linear Unit [52], etc. [53] that PReLU is the key to surpassing the level of human
If an activation function is not used or a linear function is classification on the ImageNet 2012 classification dataset.
used, the input of each layer will be a linear function of the Exponential Linear Units (ELU) function [55] (see Fig. 22
output of the previous layer. In this case, He et al. [38] verify (f)) is another improved version of ReLU. Since ReLU is non-
no matter how many layers the neural network has, the output negatively activated, the average value of its output is greater
is always a linear combination of the input, which means hidden than 0. This problem will cause the offset of the next layer unit.
layers have no effect. This situation is the primitive perceptron ELU function has a negative value, so the average value of its
[2], [3], which has the limited learning ability. For this reason, output is close to 0, making the rate of convergence faster than
the nonlinear functions are introduced as activation functions. ReLU. However, the negative part is a curve, which demands
Theoretically, the deep neural networks with nonlinear lots of complicated derivatives.
activation function can approximate any function, which
greatly enhances the ability of neural networks to fit data.
In this section, we mainly focus on several frequently-used
activation functions. To begin with, sigmoid function is one of
the most typical non-linear activation functions with an overall
S-shape (see Fig. 22 (a)). With x value approaching 0, the (a) Sigmoid function (b) Tanh function (c) ReLU function
can map a real number to (-1, 1). Since the mean value of the Fig. 22. Diagrams of Sigmoid, Tanh, ReLU, Leaky ReLU, PReLU, and ELU
output of tanh is 0, it can achieve a kind of normalization. This 2) Experimental Evaluation
makes the next layer easier to learn. To compare aforementioned activation functions, two classic
In addition, Rectified Linear Unit (ReLU) [52] (see Fig. 22 CNN models, LeNet-5 [10] and VGG-16 [32], are tested on
(c)) is another effective activation function. When x is less than four benchmark datasets, including MNIST [10], Fashion-
0, its function value is 0; when x is greater than or equal to 0, MNIST [56], CIFAR-10 [57] and CIFAR-100 [57]. LeNet-5 is
its function value is x itself. Compared to sigmoid function and the first modern but relatively shadow CNN model. In the
tanh function, a significant advantage of using ReLU function following experiments, we train LeNet-5 from scratch. VGG-
is that it can speed up learning. Sigmoid and tanh involve in 16 is a deeper, larger, and frequently-used model. We conduct
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 9
Fig. 23. The experimental results on seven activation functions, respectively. (a) The accuracy of validation set and training loss on MNIST trained by LeNet-5.
(b) The accuracy of validation set and training loss on Fashion-MNIST trained by LeNet-5. (c) The accuracy of validation set and training loss on MNIST trained
by VGG-16. (d) The accuracy of validation set and training loss on Fashion-MNIST trained by VGG-16.
Fig. 24. The experimental results on seven activation functions, respectively. (a) The accuracy of validation set and training loss on CIFAR-10 trained by LeNet-
5. (b) The accuracy of validation set and training loss CIFAR-100 trained by LeNet-5. (c) The accuracy of validation set and training loss on CIFAR-10 trained by
VGG-16. (d) The accuracy of validation set and training loss on CIFAR-100 trained by VGG-16.
our experiments on the basis of a pre-trained VGG-16 model of sigmoid is slowest. Usually, the final performance of sigmoid
without the last three layers on ImageNet [58]. is not all that excellent. As a result, if we expect a fast
Both LeNet-5 and VGG-16 deploy softmax at the last layer convergence, sigmoid is not the best solution.
for multi-classification. All experiments are tested on Intel • From the perspective of accuracy, ELU possesses the best
Xeon E5-2640 v4 (X2), NVIDIA TITAN RTX (24GB), Python accuracy, but only a little better than ReLU, Leaky ReLU, and
3.5.6, and Keras 2.2.4. PReLU. In terms of training time, from Table I, ELU is prone
a) MNIST & Fashion-MNIST: MNIST is a dataset of to consume more time than ReLU and Leaky ReLU.
handwritten digits consisting of 10 categories, which has a • ReLU and Leaky ReLU have better stability during
training set of 60,000 examples and a test set of 10,000 training than PReLU and ELU.
examples. Each example is a 28 × 28 grayscale image, b) CIFAR-10 & CIFAR-100: CIFAR-10 and CIFAR-100 are
associated with a label from 10 classes, from 0 to 9. Fashion- labeled subsets of the 80 million tiny images dataset [57], which
MNIST dataset is a more complicated version of original are more complex than MNIST as well as Fashion-MNIST.
MNIST, sharing the same image size, structure, and split. These CIFAR-10 dataset consists of 60000 32 × 32 RGB images in 10
two datasets are trained on LeNet-5 and VGG-16, and results classes, with 6000 images per class. The whole dataset is
are exhibited in Table I and Fig. 23. From the results, we can divided into 50000 training images and 10000 test images, i.e.,
draw some meaningful conclusions. each class has 5000 training images and 1000 test images.
• Linear activation function indeed lead to the worst CIFAR-100 is like the CIFAR-10, except it has 100 classes
performance. Therefore, when building a deep neural network containing 600 images per class. And each class has 500
(more than one layer), we need to add a non-linear function. If training images and 100 test images. Similarly, we evaluate
not, multiple layers, theoretically, are equal to one layer. LeNet-5 and VGG-16 with different activation functions on
• Among these activation functions, the convergence speed these two datasets. The results can be seen in Table I and Fig.
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 10
TABLE I
COMPARATIVE RESULTS OF DIFFERENT ACTIVATION FUNCTIONS
Data Batch Activation Validation set Training
Model Dataset Loss Optimizer epochs
Augmentation size function accuracy time (s)
Linear 98.56% 92.89
Sigmoid 98.94% 95.51
Tanh 99.03% 92.92
Cross
MNIST - Adam 256 50 ReLU 99.18% 95.04
entropy
Leaky ReLU 99.10% 99.82
PReLU 99.20% 113.42
ELU 99.20% 103.84
Linear 88.10% 169.95
Sigmoid 89.84% 174.26
Tanh 89.83% 181.99
Fashion- Cross
- Adam 256 100 ReLU 90.17% 191.77
MNIST entropy
Leaky ReLU 90.36% 190.02
PReLU 90.36% 217.20
ELU 90.37% 204.64
LeNet-5
Linear 62.56% 614.89
Sigmoid 62.65% 569.25
Tanh 62.94% 575.07
Cross
CIFAR-10 - Adam 256 200 ReLU 64.40% 550.35
entropy
Leaky ReLU 64.27% 582.08
PReLU 63.61% 650.75
ELU 65.70% 626.51
Linear 31.24% 2381.76
Sigmoid 32.32% 2355.64
Tanh 32.69% 2376.35
Cross
CIFAR-100 - Adam 512 1000 ReLU 32.69% 2418.18
entropy
Leaky ReLU 33.81% 2443.72
PReLU 31.84% 2615.02
ELU 35.10% 2475.30
Linear 9.82% 598.14
Sigmoid 11.35% 600.27
Tanh 11.35% 596.32
Cross
MNIST - Adam 512 30 ReLU 99.55% 606.83
entropy
Leaky ReLU 99.48% 608.91
PReLU 99.45% 607.27
ELU 11.35% 614.81
Linear 10.00% 599.18
Sigmoid 15.66% 595.11
Tanh 11.01% 596.48
Fashion- Cross
- Adam 512 30 ReLU 93.16% 608.19
MNIST entropy
Leaky ReLU 92.81% 610.82
PReLU 10.00% 612.75
VGG-16
ELU 93.87% 613.32
(pre-
Linear 83.25% 958.74
trained)
Sigmoid 10.00% 957.62
Tanh 10.00% 957.97
Cross
CIFAR-10 - Adam 512 100 ReLU 83.22% 957.74
entropy
Leaky ReLU 83.37% 958.39
PReLU 82.17% 963.67
ELU 83.14% 968.60
Linear 1.00% 1897.14
Sigmoid 1.00% 1868.02
Tanh 1.00% 1897.76
Cross
CIFAR-100 - Adam 512 200 ReLU 44.77% 1901.56
entropy
Leaky ReLU 48.22% 1916.38
PReLU 47.46% 1922.29
ELU 1.00% 1917.75
24, from which we can get some conclusions. ELU may make networks learn nothing. More often than not,
• Tanh, PReLU, and ELU activation functions are more Leaky ReLU has better performance in terms of accuracy and
likely to bring about oscillation at the end of the training. training speed.
• When training a deep CNN model with pre-trained 3) Rules of Thumb for Selection
weights, it is hard to converge by the use of sigmoid and tanh • For binary classification problems, the last layer can
activation functions. harness sigmoid; for multi-classification problems, the last
• The models trained by Leaky ReLU and ELU have better layer can harness softmax.
accuracy than the others in the experiments. But sometimes, • Sigmoid and tanh functions sometimes should be avoided
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 11
because of the gradient vanishment. Usually, in hidden layers, cross-entropy loss as the loss function in their original paper,
ReLU or Leaky ReLU is a good choice. which helped them reach state-of-the-art results.
• If you have no idea about choosing activation functions, However, cross entropy loss has some flaws. Cross entropy
feel free to try ReLU or Leaky ReLU. loss only cares about the correctness of the classification, not
• If a lot of neurons are inactivated in the training process, the degree of compactness within the same class or the margin
please try to utilize Leaky ReLU, PReLU, etc. between different classes. Hence, many loss functions are
• The negative slope in Leaky ReLU can be set to 0.02 to proposed to solve this problem.
speed up training. Contrastive loss [59] enlarges the distance between different
categories and minimizes the distance within the same
B. Loss function categories. It can be used in dimensionality reduction in
Loss function or cost function is harnessed to calculate the convolutional neural networks. After dimensionality reduction,
distance between the predicted value and the actual value. Loss the two samples that are originally similar are still similar in the
function is usually used as a learning criterion of the feature space, whereas the two samples that are originally
optimization problem. Loss function can be used with different are still different. Additionally, contrastive loss is
convolutional neural networks to deal with regression problems widely used with convolutional neural networks in face
and classification problems, the goal of which is to minimize recognition. It was firstly used in SiameseNet [60], and later
loss function. Common loss functions include Mean Absolute was deployed in DeepID2 [61], DeepID2+ [62] and DeepID3
Error (MAE), Mean Square Error (MSE), Cross Entropy, etc. [63].
After contrastive loss, triplet loss was proposed by Schroff et
1) Loss Function for Regression al. in FaceNet [64], with which the CNN model can learn better
In convolutional neural networks, when dealing with face embeddings. The definition of the triplet loss function is
regression problems, we are likely to use MAE or MSE. based upon three images. These three images are anchor image,
MAE calculates the mean of the absolute error between the positive image, and negative image. The positive image and the
predicted value and the actual value; MSE calculates the mean anchor image are from the same person, whereas the negative
of square error between them. image and the anchor image are from different people.
MAE is more robust to outliers than MSE, because MSE Minimizing triplet loss is to make the distance between the
would calculate the square error of outliers. However, the result anchor and the positive one closer, and make the distance
of MSE is derivable so that it can control the rate of update. The between the anchor and the negative one further. Triplet loss is
result of MAE is non-derivable, the update speed of which usually used with convolutional neural networks for fine-
cannot be determined during optimization. grained classification at the individual level, which requires
Therefore, if there are lots of outliers in the training set and model have ability to distinguish different individuals from the
they may have a negative impact on models, MAE is better than same category. Convolutional neural networks with triplet loss
MSE. Otherwise, MSE should be considered. or its variants can be used in identification problems, such as
2) Loss Function for Classification face identification [65], [66], [64], person re-identification [67],
In convolutional neural networks, when it comes to [68], and vehicle re-identification [69].
classification tasks, there are many loss functions to handle. Another one is center loss [70], which is an improvement
The most typical one, called cross entropy loss, is used to based upon cross entropy. The purpose of center loss is to focus
evaluate the difference between the probability distribution on the uniformity of the distribution within the same class. In
obtained from the current training and the actual distribution. order to make it evenly distributed around the center of the class,
This function compares the predicted probability with the actual center loss adds an additional constraint to minimize the intra-
output value (0 or 1) in each class and calculate the penalty class difference. Center loss was used with CNN in face
value based upon the distance from them. The penalty is recognition [70], image retrieval [71], person re-identification
logarithmic, so the function provides a smaller score (0.1 or 0.2) [72], speaker recognition [73], etc.
for smaller differences and a bigger score (0.9 or 1.0) for larger Another variant of cross entropy is large-margin softmax loss
differences. [74]. The purpose of it is also intra-class compression and inter-
Cross entropy loss is also called softmax loss, which class separation. Large-margin softmax loss adds a margin
indicates it is always used in CNNs with a softmax layer. For between different classes, and introduces the margin regularity
example, AlexNet [11], Inception v1 [33], and ResNet [37] uses through the angle of the constraint weight matrix. Large-margin
TABLE II
DIFFERENT LOSS FUNCTIONS FOR CONVOLUTIONAL NEURAL NETWORKS
Loss Brief Description
Mean absolute error Calculate the mean of absolute error of samples.
Mean square error Calculate the mean of square error of samples.
Cross entropy loss Calculate the difference between the probability distribution and the actual distribution. [11] [33] [37]
Contrastive loss Enlarge the distance between different categories and minimize the distance within the same categories. [60] [61] [62] [63]
Triplet loss Minimize the distance between anchor samples and positive samples, and enlarge the distance between anchor samples and
negative samples. [64] [65] [66] [67] [68]
Center loss Minimize intra-class distance. [70] [71] [72] [73]
Large-margin softmax loss Focus on intra-class compression and inter-class separation. [74] [75] [76]
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 12
softmax loss was used in face recognition [74], emotion FaceNet [64], DeepID [79], and DeepID2 [61].
recognition [75], speaker verification [76], etc. 2) Gradient Descent Optimization Algorithms
3) Rules of Thumb for Selection On the basis of MBGD, a series of effective algorithms for
• When using CNN models to deal with regression problems, optimization are proposed to accelerate model training process.
we can choose L1 loss or L2 loss as the loss function. A proportion of them are presented as follows.
• When dealing with classification problems, we can select Qian et al. proposed the Momentum algorithm [80]. It
the rest of the loss functions. simulates physical momentum, using the exponentially
• Cross entropy loss is the most popular choice, usually weighted average of the gradient to update weights. If the
appearing in CNN models with a softmax layer in the end. gradient in one dimension is much larger than the gradient in
• If the compactness within the class or the margin between another dimension, the learning process will become
different classes is concerned, the improvements based upon unbalanced. The Momentum algorithm can prevent oscillations
cross entropy loss can be considered, like center loss and large- in one dimension, thereby achieving faster convergence. Some
margin softmax loss. classic CNN models like VGG [32], Inception v1 [33], and
• The selection of loss function in CNNs also depends on Residual networks [37] use momentum in their original paper.
the application scenario. For example, when it comes to face However, for the Momentum algorithm, blindly following
recognition, contrastive loss and triplet loss are turned out to be gradient descent is a problem. Nesterov Accelerated Gradient
the commonly-used ones nowadays. (NAG) algorithm [81] gives the Momentum algorithm a
predictability that makes it slow down before the slope becomes
C. Optimizer positive. By getting the approximate gradient of the next
In convolutional neural networks, we often need to optimize position, it can adjust the step size in advance. Nesterov
non-convex functions. Mathematical methods require huge Accelerated Gradient has been used to train CNN-based models
computing power, so optimizers are used in the training process in many tasks [82], [83], [84].
to minimize the loss function for getting optimal network Adagrad algorithm [85] is another optimization algorithm
parameters within acceptable time. Common optimization based upon gradients. It can adapt the learning rate to
algorithms are Momentum, RMSprop, Adam, etc. parameters, performing smaller updates (i.e., a low learning rate)
1) Gradient Descent for frequent feature-related parameters, and performing larger-
There are three kinds of gradient descent methods that we can step updates (i.e., a high learning rate) for infrequent ones.
use to train our CNN models: Batch Gradient Descent (BGD), Therefore, it is very suitable for processing sparse data. One of
Stochastic Gradient Descent (SGD), and Mini-Batch Gradient the main advantages of Adagrad is that there is no need to adjust
Descent (MBGD). the learning rate manually. In most cases, we just use 0.01 as
The BGD indicates a whole batch of data need to be the default learning rate [50]. FaceNet [64] uses Adagrad as the
calculated to get a gradient for each update, which can ensure optimizer in training.
convergence to the global optimum of the convex plane and the Adadelta algorithm [86] is an extension of the Adagrad,
local optimum of the non-convex plane. However, it's pretty designed to reduce its monotonically decreasing learning rate.
slow to use BGD because the average gradient of whole batch It does not merely accumulate all squared gradients but sets a
samples should be calculated. Also, it can be tricky for data that fixed size window to limit the number of accumulated squared
is not suitable for in-memory calculation. Hence, BGD is hardly gradients. At the same time, the sum of gradients is recursively
utilized in training CNN-based models in practice. defined as the decaying average of all previous squared
On the contrary, SGD only use one sample for each update. gradients, rather than directly storing the previous squared
It is apparent that the time of SGD for each update greatly less gradients. Adadelta are leveraged in many tasks [87], [88], [89].
than BGD because only one sample’s gradient is needed to Root Mean Square prop (RMSprop) algorithm [90] is also
calculate. In this case, SGD is suitable for online learning [77]. designed to solve the problem of the radically diminishing
However, SGD is quickly updated with high variance, which learning rate in the Adagrad algorithm. MobileNet [44],
will cause the objective function to oscillate severely. On the Inception v3 [35], and Inception v4 [36] achieved their best
one hand, the oscillation of the calculation can make the models using RMSprop.
gradient calculation jump out of the local optimum, and finally Another frequently-used optimizer is Adaptive Moment
reach a better point; on the other hand, SGD may never Estimation (Adam) [91] It is essentially an algorithm formed by
converge because of endless oscillation. combining the Momentum and the RMSprop. Adam stores both
Based on BGD and SGD, MBGD was proposed, which the exponential decay average of the past square gradients like
combines the advantages of BGD and SGD. MBGD uses a the Adadelta algorithm and the average exponential decay
small batch of samples for each update, so that it can not only average of the past gradients like the Momentum algorithm.
perform more efficient gradient decent than BGD, but also Practice has proved that Adam algorithm works well on many
reduce the variance, making the convergence more stable. problems and is applicable to many different convolutional
Among these three methods, MBGD is the most popular one. neural network structures [92], [93], [88].
Lots of classic CNN models use it to train their networks in AdaMax algorithm [91] is a variant of Adam that makes the
original papers, like AlexNet [11], VGG [32], Inception v2 [34], boundary range of the learning rate simpler, and it has been used
ResNet [37] and DenseNet [78]. It has also been leveraged in to train CNN models [94], [95].
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 13
TABLE III
COMPARATIVE RESULTS OF DIFFERENT OPTIMIZERS
Last three Data Activation Validation set Training
Model Dataset Loss Batch size epochs Optimizer
layers Augmentation function accuracy time (s)
MBGD 85.70% 926.24
Momentum 86.37% 947.18
Nesterov 84.32% 945.92
Adagrad 84.68% 950.72
VGG-16 512, 256, Cross Adadelta 86.06% 965.90
CIFAR-10 - ReLU 512 100
(pre-trained) 10 entropy RMSprop 87.32% 959.33
Adam 83.09% 953.46
Adamax 86.18% 960.83
Nadam 86.26% 968.72
AMSgrad 83.25% 951.28
Nesterov-accelerated Adaptive Moment Estimation (Nadam) • In the experiment, we find that Nesterov, Adagrad,
[96] is a combination of Adam and NAG. Nadam has a stronger RMSprop, Adamax, and Nadam oscillate and even cannot
constraint on the learning rate and a direct impact on the update converge during training. In the further experiments (see Fig.
of the gradient. Nadam is used in many tasks [97], [98], [99]. 26.), we find that learning rate has huge impact on convergence.
AMSGrad [100] is an improvement on Adagrad. The author • Nesterov, RMSprop, and Nadam are likely to create
of AMSGrad algorithm found that there were errors in the oscillation, but it is this characteristic that may help models
update rules of the Adam algorithm, which caused it to fail to jump out of local optima.
converge to the optimal in some cases. Therefore, AMSGrad
algorithm uses the maximum value of the past squared gradient
instead of the original exponential average to update the
parameters. AMSGrad has been used to train CNN in many
tasks [101], [102], [103].
3) Experimental Evaluation
In the experiment, we tested ten kinds of optimizers—mini-
batch gradient decent, Momentum, Nesterov, Adagrad,
Adadelta, RMSprop, Adam, Adamax, Nadam, and AMSgrad
on CIFAR-10 data set. The last nine optimizers are based upon
mini batch. The format of the CIFAR -10 data set is the same
as the experiment in the section 2.B. We also do our
experiments on the basis of a pre-trained VGG-16 model
without the last three layers on ImageNet [58]. The results can
be seen in Table III and Fig. 25, from which we can get some
conclusions.
• Almost all optimizers that we tested can make the CNN-
based model converge at the end of the training.
Fig. 25. The accuracy of validation set and training loss on CIFAR-10 trained
• The convergence rate of mini-batch gradient decent is by VGG-16 with ten different optimizers, respectively.
slowest, even if it can converge at the end.
Fig. 26. The accuracy of validation set and training loss on CIFAR-10 trained by VGG-16 with Nesterov, Adagrad, RMSprop, Adamax, or Nadam optimizer with
different learning rates, respectively.
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 14
replace the sliding window and manual feature extraction used In 2014, Long et al. [130] proposed the concept of Fully
in traditional object detection and designed the R-CNN Convolutional Networks (FCN) and applied CNN structures to
framework, which made a breakthrough in object detection. image semantic segmentation for the first time. In 2015,
Then, Girshick et al. [127] summarizing the shortcomings of R- Ronneberger et al. [131] proposed U-Net, which has more
CNN [126] and drawing lessons from the SPP-Net [110], multi-scale features and has been applied to medical image
proposed Fast R-CNN, which introduced the Regions of segmentation. Besides, ENet [132], PSPNet [133], etc. [134],
Interest (ROI) Pooling layer, making the network faster. [135] were proposed to handle specific problems.
Besides, Fast R-CNN shares convolution features between In terms of instance segmentation tasks, He et al. [136]
object classification and bounding box regression. However, proposed Mask-RCNN that shares convolution features
Fast R-CNN still retains the selective search algorithm of R- between two tasks through the cascade structure. In
CNN’s region proposals. In 2015, Ren et al. [128] proposed consideration of real time, Bolya et al. [137] based on
Faster R-CNN, which adds the selection of region proposals to RetinaNet [138] harnessed ResNet-101 and FPN to fuse multi-
make it faster. An essential contribution of Faster R-CNN is to scale features.
introduce an RPN network at the end of the convolutional layer. For panoptic segmentation tasks, it was first proposed by
In 2016, Lin et al. [129] added Feature Pyramid Network (FPN) Kirillov et al. [139] in 2018. They proposed panoramic FPN
to Faster R-CNN, where multi-scale features can be fused [140] in 2019, which combines FPN network with Mask-RCNN
through the feature pyramid in the forward process. to generate a branch of semantic segmentation. In the same year,
In one stage, the model directly returns the category Liu et al. [141] proposed OANet that also introduced the FPN
probability and position coordinates of the objects. Redmon et based on Mask-RCNN, but the difference is that they designed
al. regarded object detection as a regression problem and an end-to-end network.
proposed YOLO v1 [120], which directly utilizes a single 4) Face Recognition
neural network to predict bounding boxes and the category of Face recognition is a biometric identification technique based
objects. Afterward, YOLO v2 [121] proposed a new on the features of the human face. The development history of
classification model darknet-19, which includes 19 deep face recognition is shown in Fig. 29. DeepFace [142] and
convolutional layers and five max-pooling layers. Batch DeepID [79] achieved excellent results on the LFW [74] data
normalization layers are added after each convolution layer, set, surpassing humans for the first time in the unconstrained
which is beneficial to stable training and fast convergence. scenarios. Henceforth, deep learning-based approaches
YOLO v3 [122] was proposed to remove the max-pooling received much more attention. The process of DeepFace
layers and the fully-connected layers, using 1 × 1 and 3 × 3 proposed by Taigman et al. [142] is detection, alignment,
convolution and shortcut connections. Besides, YOLO v3 extraction, and classification. After detecting the face, using
borrows the idea from FPN to achieve multi-scale feature fusion. three-dimensional alignment generate a 152 × 152 image as the
For the benefits of the structure of YOLO v3, many classic input of CNN. Taigman et al. [142] leveraged Siamese network
networks replace the backend of it to achieve better results. All to train the model, which obtained state-of-the-art results.
of the aforementioned approaches leverage anchor boxes to Unlike DeepFace, DeepID directly inputs two face images into
determine where objects are. Their performance hinges on the CNN to extract feature vectors for classification. DeepID2 [61]
choice of anchor boxes, and a large number of hyperparameters introduces classification loss and verification loss. Based upon
are introduced. Therefore, Law et al. [124] proposed CornerNet, DeepID2 [61], DeepID2+ [62] adds the auxiliary loss between
which abandons anchor boxes and directly predicts the top-left convolutional layers. DeepID3 [63] proposed two kinds of
corner and bottom-right corner of bounding boxes of objects. In structures, which can be constructed by stacked convolutions of
order to decide which two corners in different categories are VGGNet or Inception modules.
paired with each other, and an embedding vector is introduced. The aforementioned approaches harness the standard
Then, CornerNet-Lite [125] optimized CornerNet in terms of softmax loss function. More recently, improvements in face
detection speed and accuracy. recognition are basically focused on the loss function. FaceNet
3) Image Segmentation [85] proposed by Google in 2015 utilizes 22-layer CNN and
Image segmentation is the task that divides an image into 200 million pictures, including eight million people, to train a
different areas. It has to mark the boundaries of different model. In order to learn more efficient embeddings, FaceNet
semantic entities in an image. The image segmentation task replaces softmax with triplet loss. Besides, VGGFace [65] also
completed by CNN is shown in Fig 28. deploys triplet loss to train the model. Besides, there are various
loss functions harnessed to reach better results, like L-softmax
loss, SphereFace, ArcFace, and large margin cosine loss, which
can be seen in Fig. 29. [74], [143], [144], [145]
DeepID2
(contrastive
loss)
DeepID FaceNet ArcFace
FaceNet
(softmax) (triplet loss) (triplet loss)
DeepFace L2-softmax
DeepID2+ DeepID3 VGGNet L-softmax A-softmax CosFace
(softmax) (feature
(contrastive loss) (contrastive loss) (triplet+softmax) (large margin) (large margin) (large margin)
normalization)
Fig. 28. Applications of CNN in image segmentation Fig. 29. The development history of deep face recognition
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 16
and multivariate quantization are harnessed to strike a proper C. Network Architecture Search
balance between model size and accuracy. Network Architecture Search (NAS) is another method to
B. Security of CNN realize automatic machine learning of CNN. NAS constructs a
search space through design choices, such as the number of
There are many applications of CNN in daily life, including
kernels, skip connections, etc. Besides, NAS finds a suitable
security identification system [166],[167], medical image
optimizer to control the search process in the search space. As
identification [168],[156],[155], traffic sign recognition [169],
shown in Fig. 32, NAS could be divided into NAS with agents
and license plate recognition [170]. These applications are
and without agents. Due to the high demand for NAS on
highly related to life and property security. Once models are
computing resources, the integrated models consist of learned
disturbed or destroyed, the consequences will be severe.
optimal convolutional layers in the small-scale data sets. Small-
Therefore, the security of CNN is expected to be attached great
scale data sets are the agents that generate the overall model, so
importance. More precisely, researchers [171],[172],[173],[174]
this approach is the NAS with agents. The agentless NAS refers
have proposed some methods to deceive CNN, resulting in a
to learning the whole model directly on large-scale data sets.
sharp drop in the accuracy. These methods can be classified into
two categories: data poisoning and adversarial attacks.
Target Target
Data poisoning indicates that poisoning the training data Learner Proxy
Task& Learner Task&
Task
during the training phase. Poison refers to the insertion of noise Hardware Hardware
CNN cannot correctly recognize Fig. 33 (b), (c), or (d) is the conclusions are reached. Also, we offer some rules of thumb for
same cat as the former, which is obvious to humans. This the selection of these functions.
problem is caused by the architecture of CNN. Therefore, in Fourth, we discuss some typical applications of CNN.
order to teach a CNN system to recognize different patterns of Different dimensional convolutions should be designed for
one object, a massive amount of data should be fed, making up various problems. Other than the most frequently-used two-
for the flaw of CNN architectures with diverse data. However, dimensional CNN used for image-related tasks, one-
labeled data is typically hard to obtain. Although some tricks dimensional and multi-dimensional CNN are harnessed in lots
like data augmentation can bring about some effects, the of scenarios as well.
improvement is relative limited. Finally, even though convolutions possess many benefits and
Pooling layer is widely used in CNN for many advantages, have been widely used, we reckon that it can be refined further
but it ignores the relationship between the whole and the part. in terms of model size, security, and easy hyperparameters
For effectively organizing network structures and solving the selection. Moreover, there are lots of problems that convolution
problem of spatial information loss of traditional CNN, Hinton is hard to handle, such as low generalization ability, lack of
et al. [186] proposed Capsule Neural Networks (CapsNet) equivariance, and poor crowded-scene results, so that several
where neurons on different layers focus on different entities or promising directions are pointed.
attributes, so that they add neurons to focus on the same
category or attribute, similar to the structure of a capsule. When REFERENCES
CapsNet is activated, the pathway between capsules forms a
tree structure composed of sparsely activated neurons. Each [1] W. S. Mcculloch, and W. H. Pitts, “A logical Calculus of Ideas
output of a capsule is a vector, the length of which represents Immanent in Nervous Activity,” The Bulletin of Mathematical
Biophysics, vol. 5, pp. 115-133, 1942.
the probability of the existence of an object. Therefore, the [2] F. Rosenblatt, “The Perceptron: A Probabilistic Model for Information
output features include the specific pose information of objects, Storage and Organization in the Brain,” Psychological Review, pp.
which means that CapsNet has the ability to recognize the 368-408, 1958.
[3] C. V. D. Malsburg, “Frank Rosenblatt: Principles of Neurodynamics:
orientation. In addition, unsupervised CapsNet was created by Perceptrons and the Theory of Brain Mechanisms.”
Hinton et al. [187], called Stacked Capsule Autoencoder [4] Davd. Rumhar, Geoffrey. Hinton, and RonadJ. Wams, “Learning
(SCAE). SCAE consists of four parts: Part Capsule representations by back-propagating errors.”
[5] A. Waibel, T. Hanazawa, G. E. Hinton, K. Shikano, and K. J. Lang,
Autoencoder (PCAE), Object Capsule Autoencoder (OCAE), “Phoneme recognition using time-delay neural networks,” IEEE
and the decoders of PCAE and OCAE. PCAE is a CNN with a Transactions on Acoustics Speech & Signal Processing, vol. 37, no. 3,
top-down attention mechanism. It can identify the pose and pp. 328-339, 1989.
[6] W. Zhang, “Shift-invariant pattern recognition neural network and its
existence of capsules of different parts. OCAE is used to
optical architecture,” in Proceedings of annual conference of the Japan
implement inference. SCAE can predict the activations of Society of Applied Physics, 1988.
CapsNet directly based on the pose and the existence. Some [7] Y. Lecun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W.
experiments have proved that CapsNet is able to reach state-of- Hubbard, and L. D. Jackel, “Backpropagation Applied to Handwritten
Zip Code Recognition,” Neural Computation, vol. 1, no. 4, pp. 541-
the-art results. Although it did not achieve satisfactory results 551.
on complicated large-scale data sets, like CIFAR-100 or [8] K. Aihara, T. Takabe, and M. Toyoda, “Chaotic neural networks,”
ImageNet, we can see that it is potential. Physics Letters A, vol. 144, no. 6-7, pp. 333-340.
[9] Specht, and D.F., “A general regression neural network,” IEEE
Transactions on Neural Networks, vol. 2, no. 6, pp. 568-576.
VII. CONCLUSION [10] B. L. Lecun Y , Bengio Y , et al., “Gradient-based learning applied to
document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp.
Due to the advantages of convolutional neural networks, such 2278-2324, 1998.
as local connection, weight sharing, and down-sampling [11] A. Krizhevsky, I. Sutskever, and G. Hinton, “ImageNet Classification
dimensionality reduction, they have been widely deployed in with Deep Convolutional Neural Networks,” Advances in neural
information processing systems, vol. 25, no. 2, 2012.
both research and industry projects. This paper provides a [12] N. Aloysius, and M. Geetha, "A review on deep convolutional neural
detailed survey on CNN, including common building blocks, networks," Proceedings of the 2017 IEEE International Conference on
classic networks, related functions, applications, and prospects. Communication and Signal Processing, ICCSP 2017. pp. 588-592.
[13] A. Dhillon, and G. K. Verma, “Convolutional neural network: a review
First, we discuss basic building blocks of CNN and present of models, methodologies and applications to object detection,”
how to construct a CNN-based model from scratch. Progress in Artificial Intelligence, 2019/12/20, 2019.
Second, some excellent networks are expounded. From them, [14] W. Rawat, and Z. Wang, “Deep Convolutional Neural Networks for
we obtain some guidelines for devising novel networks from Image Classification: A Comprehensive Review,” Neural
Computation, pp. 1-98.
the perspective of accuracy and speed. More specifically, in [15] Q. Liu, N. Zhang, W. Yang, S. Wang, Z. Cui, X. Chen, and L. Chen,
terms of accuracy, deeper and wider neural structures are able “A Review of Image Recognition with Deep Convolutional Neural
to learn better representation than shallow ones. Besides, Network.”
[16] S. Rehman, H. Ajmal, U. Farooq, Q. U. Ain, and A. Hassan,
residual connections can be leveraged to build extremely deep "Convolutional neural network based image segmentation: a review."
neural networks, which can increase the ability to handle [17] T. Lindeberg, “Scale invariant feature transform,” 2012.
complex tasks. In terms of speed, dimension reduction and low- [18] N. Dalal, and B. Triggs, "Histograms of oriented gradients for human
detection." pp. 886-893.
rank approximation are very handy tools. [19] T. Ahonen, A. Hadid, and M. Pietikainen, “Face description with local
Third, we introduce activation functions, loss functions, and binary patterns: Application to face recognition,” IEEE transactions on
optimizers for CNN. Through experimental analysis, several
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 19
pattern analysis and machine intelligence, vol. 28, no. 12, pp. 2037- [50] N. Ma, X. Zhang, H. T. Zheng, and J. Sun, “ShuffleNet V2: Practical
2041, 2006. Guidelines for Efficient CNN Architecture Design.”
[20] W. T. N. Hubel D H “Receptive fields, binocular interaction and [51] K. Han, Y. Wang, Q. Tian, J. Guo, C. Xu, and C. Xu, “GhostNet: More
functional architecture in the cat\"s visual cortex,” The Journal of Features from Cheap Operations,” arXiv preprint arXiv:1911.11907,
Physiology, vol. 160, no. 1, pp. 106-154, 1962. 2019.
[21] D. M. Hawkins, “The problem of overfitting,” Journal of chemical [52] V. Nair, and G. E. Hinton, "Rectified Linear Units Improve Restricted
information and computer sciences, vol. 44, no. 1, pp. 1-12, 2004. Boltzmann Machines Vinod Nair."
[22] K. Fukushima, “Neocognitron: A self-organizing neural network [53] M. T. Hagan, H. B. Demuth, and M. H. Beale, Neural network design,
model for a mechanism of pattern recognition unaffected by shift in 2002.
position,” Biological Cybernetics, vol. 36, no. 4, pp. 193-202. [54] A. Krizhevsky, “Convolutional Deep Belief Networks on CIFAR-10,”
[23] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei, 2010.
“Deformable Convolutional Networks.” [55] D.-A. Clevert, T. Unterthiner, and S. Hochreiter, Fast and Accurate
[24] L. Sifre, and S. Mallat, “Rigid-Motion Scattering for Texture Deep Network Learning by Exponential Linear Units (ELUs), 2016.
Classification,” 03/07, 2014. [56] H. Xiao, K. Rasul, and R. Vollgraf, “Fashion-MNIST: a Novel Image
[25] F. Mamalet, and C. Garcia, Simplifying ConvNets for Fast Learning, Dataset for Benchmarking Machine Learning Algorithms.”
2012. [57] A. Krizhevsky, “Learning Multiple Layers of Features from Tiny
[26] F. Chollet, “Xception: Deep Learning with Depthwise Separable Images,” University of Toronto, 05/08, 2012.
Convolutions.” [58] J. Deng, W. Dong, R. Socher, L. J. Li, K. Li, and F. F. Li, “ImageNet:
[27] W. Min, B. Liu, and H. Foroosh, “Factorized Convolutional Neural A large-scale hierarchical image database,” Proc of IEEE Computer
Networks.” Vision & Pattern Recognition, pp. 248-255, 2009.
[28] D. Li, A. Zhou, and A. Yao, “HBONet: Harmonious Bottleneck on [59] R. Hadsell, S. Chopra, and Y. LeCun, "Dimensionality reduction by
Two Orthogonal Dimensions.” learning an invariant mapping." pp. 1735-1742.
[29] S. Xie, R. Girshick, P. Dollar, Z. Tu, and K. He, "Aggregated Residual [60] S. Chopra, R. Hadsell, and Y. LeCun, "Learning a similarity metric
Transformations for Deep Neural Networks." discriminatively, with application to face verification." pp. 539-546.
[30] T. K. Lee, W. J. Baddar, S. T. Kim, and Y. M. Ro, “Convolution with [61] Y. Sun, Y. Chen, X. Wang, and X. Tang, "Deep learning face
Logarithmic Filter Groups for Efficient Shallow CNN.” representation by joint identification-verification." pp. 1988-1996.
[31] Y. Ioannou, D. Robertson, R. Cipolla, and A. Criminisi, “Deep Roots: [62] Y. Sun, X. Wang, and X. Tang, "Deeply learned face representations
Improving CNN Efficiency with Hierarchical Filter Groups.” are sparse, selective, and robust." pp. 2892-2900.
[32] K. Simonyan, and A. Zisserman, “Very Deep Convolutional Networks [63] Y. Sun, D. Liang, X. Wang, and X. Tang, “Deepid3: Face recognition
for Large-Scale Image Recognition,” Computer Science, 2014. with very deep neural networks,” arXiv preprint arXiv:1502.00873,
[33] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. 2015.
Erhan, V. Vanhoucke, and A. Rabinovich, "Going deeper with [64] F. Schroff, D. Kalenichenko, and J. Philbin, "Facenet: A unified
convolutions." pp. 1-9. embedding for face recognition and clustering." pp. 815-823.
[34] S. Ioffe, and C. Szegedy, “Batch normalization: Accelerating deep [65] O. M. Parkhi, A. Vedaldi, and A. Zisserman, “Deep face recognition,”
network training by reducing internal covariate shift,” arXiv preprint 2015.
arXiv:1502.03167, 2015. [66] B. Amos, B. Ludwiczuk, and M. Satyanarayanan, “Openface: A
[35] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, general-purpose face recognition library with mobile applications,”
"Rethinking the inception architecture for computer vision." pp. 2818- CMU School of Computer Science, vol. 6, pp. 2, 2016.
2826. [67] D. Cheng, Y. Gong, S. Zhou, J. Wang, and N. Zheng, "Person re-
[36] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi, "Inception-v4, identification by multi-channel parts-based cnn with improved triplet
inception-resnet and the impact of residual connections on learning." loss function." pp. 1335-1344.
[37] K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image [68] A. Hermans, L. Beyer, and B. Leibe, “In defense of the triplet loss for
recognition." pp. 770-778. person re-identification,” arXiv preprint arXiv:1703.07737, 2017.
[38] K. He, X. Zhang, S. Ren, and S. Jian, "Delving Deep into Rectifiers: [69] R. Kuma, E. Weill, F. Aghdasi, and P. Sriram, "Vehicle re-
Surpassing Human-Level Performance on ImageNet Classification." identification: an efficient baseline using triplet embedding." pp. 1-9.
[39] S. Zagoruyko, and N. Komodakis, "Wide Residual Networks." [70] Y. Wen, K. Zhang, Z. Li, and Y. Qiao, "A discriminative feature
[40] G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Weinberger, “Deep learning approach for deep face recognition." pp. 499-515.
Networks with Stochastic Depth.” [71] J. Yao, Y. Yu, Y. Deng, and C. Sun, "A feature learning approach for
[41] S. Targ, D. Almeida, and K. Lyman, “Resnet in Resnet: Generalizing image retrieval." pp. 405-412.
Residual Architectures.” [72] H. Jin, X. Wang, S. Liao, and S. Z. Li, "Deep person re-identification
[42] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, with improved embedding and efficient training." pp. 261-267.
S. Ozair, A. Courville, and Y. Bengio, "Generative adversarial nets." [73] G. Wisniewksi, H. Bredin, G. Gelly, and C. Barras, "Combining
pp. 2672-2680. speaker turn embedding and incremental structure prediction for low-
[43] A. Radford, L. Metz, and S. Chintala, “Unsupervised Representation latency speaker diarization."
Learning with Deep Convolutional Generative Adversarial Networks,” [74] W. Liu, Y. Wen, Z. Yu, and M. Yang, "Large-margin softmax loss for
Computer Science, 2015. convolutional neural networks." p. 7.
[44] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. [75] L. Tan, K. Zhang, K. Wang, X. Zeng, X. Peng, and Y. Qiao, "Group
Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient emotion recognition with individual facial emotion CNNs and global
convolutional neural networks for mobile vision applications,” arXiv image based CNNs." pp. 549-552.
preprint arXiv:1704.04861, 2017. [76] Y. Liu, L. He, and J. Liu, “Large margin softmax loss for speaker
[45] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, verification,” arXiv preprint arXiv:1904.03479, 2019.
"Mobilenetv2: Inverted residuals and linear bottlenecks." pp. 4510- [77] D. Saad, On-line learning in neural networks: Cambridge University
4520. Press, 2009.
[46] M. S. Andrew Howard, Grace Chu, Liang-Chieh Chen, Bo Chen, [78] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, "Densely
Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay connected convolutional networks." pp. 4700-4708.
Vasudevan, Quoc V. Le, Hartwig Adam, “Searching for [79] Y. Sun, X. Wang, and X. Tang, "Deep learning face representation
MobileNetV3,” arXiv:1905.02244 [cs.CV], 2019. from predicting 10,000 classes." pp. 1891-1898.
[47] T. J. Yang, A. Howard, B. Chen, X. Zhang, A. Go, M. Sandler, V. Sze, [80] N. Qian, “On the momentum term in gradient descent learning
and H. Adam, “NetAdapt: Platform-Aware Neural Network algorithms,” Neural networks, vol. 12, no. 1, pp. 145-151, 1999.
Adaptation for Mobile Applications.” [81] Y. Nesterov, "A method for unconstrained convex minimization
[48] J. Hu, L. Shen, S. Albanie, G. Sun, and E. Wu, “Squeeze-and- problem with the rate of convergence O (1/k^ 2)." pp. 543-547.
Excitation Networks.” [82] W. Su, L. Chen, M. Wu, M. Zhou, Z. Liu, and W. Cao, "Nesterov
[49] X. Zhang, X. Zhou, M. Lin, and J. Sun, “ShuffleNet: An Extremely accelerated gradient descent-based convolution neural network with
Efficient Convolutional Neural Network for Mobile Devices.” dropout for facial expression recognition." pp. 1063-1068.
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 20
[83] A. L. Maas, P. Qi, Z. Xie, A. Y. Hannun, C. T. Lengerich, D. Jurafsky, by Online Condition Monitoring,” IEEE Transactions on Industrial
and A. Y. Ng, “Building DNN acoustic models for large vocabulary Electronics, pp. 1-1.
speech recognition,” Computer Speech & Language, vol. 41, pp. 195- [110] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep
213, 2017. convolutional networks for visual recognition,” IEEE transactions on
[84] P. Molchanov, S. Gupta, K. Kim, and J. Kautz, "Hand gesture pattern analysis and machine intelligence, vol. 37, no. 9, pp. 1904-1916,
recognition with 3D convolutional neural networks." pp. 1-7. 2015.
[85] J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods for [111] Y. Chen, J. Li, H. Xiao, X. Jin, S. Yan, and J. Feng, "Dual path
online learning and stochastic optimization,” Journal of machine networks." pp. 4467-4475.
learning research, vol. 12, no. Jul, pp. 2121-2159, 2011. [112] F. Iandola, M. Moskewicz, S. Karayev, R. Girshick, T. Darrell, and K.
[86] M. D. Zeiler, “Adadelta: an adaptive learning rate method,” arXiv Keutzer, “Densenet: Implementing efficient convnet descriptor
preprint arXiv:1212.5701, 2012. pyramids,” arXiv preprint arXiv:1404.1869, 2014.
[87] J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio, [113] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, "Aggregated residual
"Attention-based models for speech recognition." pp. 577-585. transformations for deep neural networks." pp. 1492-1500.
[88] T. Sercu, C. Puhrsch, B. Kingsbury, and Y. LeCun, "Very deep [114] Q. Li, W. Cai, X. Wang, Y. Zhou, D. D. Feng, and M. Chen, "Medical
multilingual convolutional neural networks for LVCSR." pp. 4955- image classification with convolutional neural network." pp. 844-848.
4959. [115] Y. Jiang, L. Chen, H. Zhang, and X. Xiao, “Breast cancer
[89] Y. Kim, “Convolutional neural networks for sentence classification,” histopathological image classification using convolutional neural
arXiv preprint arXiv:1408.5882, 2014. networks with small SE-ResNet module,” PloS one, vol. 14, no. 3,
[90] G. Hinton, N. Srivastava, and K. Swersky, “Neural networks for 2019.
machine learning lecture 6a overview of mini-batch gradient descent,” [116] D. R. Bruno, and F. S. Osório, "Image classification system based on
Cited on, vol. 14, no. 8, 2012. deep learning applied to the recognition of traffic signs for intelligent
[91] D. P. Kingma, and J. Ba, “Adam: A method for stochastic robotic vehicle navigation purposes." pp. 1-6.
optimization,” arXiv preprint arXiv:1412.6980, 2014. [117] R. Madan, D. Agrawal, S. Kowshik, H. Maheshwari, S. Agarwal, and
[92] S. Sharma, R. Kiros, and R. Salakhutdinov, “Action recognition using D. Chakravarty, “Traffic Sign Classification using Hybrid HOG-
visual attention,” arXiv preprint arXiv:1511.04119, 2015. SURF Features and Convolutional Neural Networks,” 2019.
[93] F. Korzeniowski, and G. Widmer, "A fully convolutional deep [118] M. Zhang, W. Li, and Q. Du, “Diverse region-based CNN for
auditory model for musical chord recognition." pp. 1-6. hyperspectral image classification,” IEEE Transactions on Image
[94] M. J. Van Putten, S. Olbrich, and M. Arns, “Predicting sex from brain Processing, vol. 27, no. 6, pp. 2623-2634, 2018.
rhythms with deep learning,” Scientific reports, vol. 8, no. 1, pp. 1-7, [119] A. Sharma, X. Liu, X. Yang, and D. Shi, “A patch-based convolutional
2018. neural network for remote sensing image classification,” Neural
[95] S. Niklaus, L. Mai, and F. Liu, "Video frame interpolation via adaptive Networks, vol. 95, pp. 19-28, 2017.
separable convolution." pp. 261-270. [120] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, "You only look
[96] T. Dozat, “Incorporating nesterov momentum into adam,” 2016. once: Unified, real-time object detection." pp. 779-788.
[97] D. Q. Nguyen, and K. Verspoor, “Convolutional neural networks for [121] J. Redmon, and A. Farhadi, "YOLO9000: better, faster, stronger." pp.
chemical-disease relation extraction are improved with character- 7263-7271.
based word embeddings,” arXiv preprint arXiv:1805.10586, 2018. [122] J. Redmon, and A. Farhadi, “Yolov3: An incremental improvement,”
[98] S. Maetschke, B. Antony, H. Ishikawa, G. Wollstein, J. Schuman, and arXiv preprint arXiv:1804.02767, 2018.
R. Garnavi, “A feature agnostic approach for glaucoma detection in [123] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A.
OCT volumes,” PloS one, vol. 14, no. 7, 2019. C. Berg, "Ssd: Single shot multibox detector." pp. 21-37.
[99] A. Schindler, T. Lidy, and A. Rauber, “Multi-temporal resolution [124] H. Law, and J. Deng, "Cornernet: Detecting objects as paired
convolutional neural networks for acoustic scene classification,” arXiv keypoints." pp. 734-750.
preprint arXiv:1811.04419, 2018. [125] H. Law, Y. Teng, O. Russakovsky, and J. Deng, “Cornernet-lite:
[100] S. J. Reddi, S. Kale, and S. Kumar, “On the convergence of adam and Efficient keypoint based object detection,” arXiv preprint
beyond,” arXiv preprint arXiv:1904.09237, 2019. arXiv:1904.08900, 2019.
[101] M. Jahanifar, N. Z. Tajeddin, N. A. Koohbanani, A. Gooya, and N. [126] R. Girshick, J. Donahue, T. Darrell, and J. Malik, "Rich feature
Rajpoot, “Segmentation of skin lesions and their attributes using multi- hierarchies for accurate object detection and semantic segmentation."
scale convolutional neural networks and domain specific pp. 580-587.
augmentations,” arXiv preprint arXiv:1809.10243, 2018. [127] R. Girshick, "Fast r-cnn." pp. 1440-1448.
[102] F. Monti, F. Frasca, D. Eynard, D. Mannion, and M. M. Bronstein, [128] S. Ren, K. He, R. Girshick, and J. Sun, "Faster r-cnn: Towards real-
“Fake news detection on social media using geometric deep learning,” time object detection with region proposal networks." pp. 91-99.
arXiv preprint arXiv:1902.06673, 2019. [129] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie,
[103] S. Liu, E. Gibson, S. Grbic, Z. Xu, A. A. A. Setio, J. Yang, B. "Feature pyramid networks for object detection." pp. 2117-2125.
Georgescu, and D. Comaniciu, “Decompose to manipulate: [130] E. Shelhamer, J. Long, and T. Darrell, “Fully Convolutional Networks
manipulable object synthesis in 3D medical images with structured for Semantic Segmentation.”
image decomposition,” arXiv preprint arXiv:1812.01737, 2018. [131] O. Ronneberger, P. Fischer, and T. Brox, "U-Net: Convolutional
[104] Urtnasan, Erdenebayar, Hyeonggon, Kim, Jong-Uk, Park, Dongwon, Networks for Biomedical Image Segmentation."
Kang, Kyoung-Joung, and Lee, “Automatic Prediction of Atrial [132] A. Paszke, A. Chaurasia, S. Kim, and E. Culurciello, “ENet: A Deep
Fibrillation Based on Convolutional Neural Network Using a Short- Neural Network Architecture for Real-Time Semantic Segmentation.”
term Normal Electrocardiogram Signal.” [133] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid Scene Parsing
[105] S. Harbola, and V. Coors, “One dimensional convolutional neural Network.”
network architectures for wind prediction,” Energy Conversion and [134] L. C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille,
Management, vol. 195, pp. 70-75, 2019. “DeepLab: Semantic Image Segmentation with Deep Convolutional
[106] D. Han, J. Chen, and J. Sun, “A parallel spatiotemporal deep learning Nets, Atrous Convolution, and Fully Connected CRFs,” IEEE
network for highway traffic flow forecasting,” International Journal of Transactions on Pattern Analysis & Machine Intelligence, vol. 40, no.
Distributed Sensor Networks, vol. 15, no. 2. 4, pp. 834, 2018.
[107] Q. Zhang, D. Zhou, and X. Zeng, “HeartID: A Multiresolution [135] A. Pal, S. Jaiswal, S. Ghosh, N. Das, and M. Nasipuri, "Segfast: A
Convolutional Neural Network for ECG-based Biometric Human faster squeezenet based semantic image segmentation technique using
Identification in Smart Health Applications,” IEEE Access, pp. 1-1. depth-wise separable convolutions."
[108] O. Abdeljaber, O. Avci, S. Kiranyaz, M. Gabbouj, and D. J. Inman, [136] K. He, G. Georgia, D. Piotr, and G. Ross, “Mask R-CNN,” IEEE
“Real-Time Vibration-Based Structural Damage Detection Using One- Transactions on Pattern Analysis & Machine Intelligence, pp. 1-1.
Dimensional Convolutional Neural Networks,” Journal of Sound & [137] D. Bolya, C. Zhou, F. Xiao, and Y. J. Lee, “YOLACT: Real-time
Vibration, vol. 388, pp. 154-170, 2017. Instance Segmentation.”
[109] O. Abdeljaber, S. Sassi, O. Avci, S. Kiranyaz, A. A. Ibrahim, and M. [138] T. Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal Loss for
Gabbouj, “Fault Detection and Severity Identification of Ball Bearings Dense Object Detection,” IEEE Transactions on Pattern Analysis &
Machine Intelligence, vol. PP, no. 99, pp. 2999-3007, 2017.
View publication stats
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 21
[139] A. Kirillov, K. He, R. Girshick, C. Rother, and P. Dollár, “Panoptic [165] P. Micikevicius, S. Narang, J. Alben, G. Diamos, E. Elsen, D. Garcia,
Segmentation.” B. Ginsburg, M. Houston, O. Kuchaiev, and G. Venkatesh, “Mixed
[140] A. Kirillov, R. Girshick, K. He, and P. Dollár, “Panoptic Feature Precision Training.”
Pyramid Networks.” [166] M. Sajjad, S. Khan, T. Hussain, K. Muhammad, A. K. Sangaiah, A.
[141] H. Liu, C. Peng, C. Yu, J. Wang, X. Liu, G. Yu, and W. Jiang, “An Castiglione, C. Esposito, and S. W. Baik, “CNN-based anti-spoofing
End-to-End Network for Panoptic Segmentation.” two-tier multi-factor authentication system,” Pattern Recognition
[142] Y. Taigman, M. Yang, M. A. Ranzato, and L. Wolf, "Deepface: Letters, vol. 126, pp. 123-131, 2019.
Closing the gap to human-level performance in face verification." pp. [167] K. Itqan, A. Syafeeza, F. Gong, N. Mustafa, Y. Wong, and M. Ibrahim,
1701-1708. “User identification system based on finger-vein patterns using
[143] W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song, “SphereFace: Deep convolutional neural network,” ARPN Journal of Engineering and
Hypersphere Embedding for Face Recognition.” Applied Sciences, vol. 11, no. 5, pp. 3316-3319, 2016.
[144] J. Deng, J. Guo, N. Xue, and S. Zafeiriou, “ArcFace: Additive Angular [168] H. Ke, D. Chen, X. Li, Y. Tang, T. Shah, and R. Ranjan, “Towards
Margin Loss for Deep Face Recognition.” brain big data classification: Epileptic EEG identification with a
[145] H. Wang, Y. Wang, Z. Zhou, X. Ji, D. Gong, J. Zhou, Z. Li, and W. lightweight VGGNet on global MIC,” IEEE Access, vol. 6, pp. 14722-
Liu, “CosFace: Large Margin Cosine Loss for Deep Face Recognition.” 14733, 2018.
[146] C. Cao, Y. Zhang, C. Zhang, and H. Lu, "Action Recognition with [169] A. Shustanov, and P. Yakimov, “CNN design for real-time traffic sign
Joints-Pooled 3D Deep Convolutional Descriptors." recognition,” Procedia engineering, vol. 201, pp. 718-725, 2017.
[147] A. Stergiou, and R. Poppe, “Spatio-Temporal FAST 3D Convolutions [170] J. Špaňhel, J. Sochor, R. Juránek, A. Herout, L. Maršík, and P. Zemčík,
for Human Action Recognition.” "Holistic recognition of low quality license plates by CNN using track
[148] J. Huang, W. Zhou, H. Li, and W. Li, “Attention-based 3D-CNNs for annotated data." pp. 1-6.
large-vocabulary sign language recognition,” IEEE Transactions on [171] T. Xie, and Y. Li, "A Gradient-Based Algorithm to Deceive Deep
Circuits and Systems for Video Technology, vol. 29, no. 9, pp. 2822- Neural Networks." pp. 57-65.
2832, 2018. [172] C. Liao, H. Zhong, A. Squicciarini, S. Zhu, and D. Miller, “Backdoor
[149] Y. Huang, S.-H. Lai, and S.-H. Tai, “Human Action Recognition embedding in convolutional neural network models via invisible
Based on Temporal Pose CNN and Multi-dimensional Fusion.” perturbation,” arXiv preprint arXiv:1808.10307, 2018.
[150] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao, “3D [173] A. N. Bhagoji, S. Chakraborty, P. Mittal, and S. Calo, “Analyzing
ShapeNets: A Deep Representation for Volumetric Shapes.” federated learning through an adversarial lens,” arXiv preprint
[151] D. Maturana, and S. Scherer, "Voxnet: A 3d convolutional neural arXiv:1811.12470, 2018.
network for real-time object recognition." pp. 922-928. [174] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and
[152] S. Song, and J. Xiao, "Deep Sliding Shapes for Amodal 3D Object harnessing adversarial examples. arXiv,” preprint, 2018.
Detection in RGB-D Images." [175] K. Liu, B. Dolan-Gavitt, and S. Garg, "Fine-pruning: Defending
[153] Y. Zhou, and O. Tuzel, “VoxelNet: End-to-End Learning for Point against backdooring attacks on deep neural networks." pp. 273-294.
Cloud Based 3D Object Detection.” [176] N. Akhtar, and A. Mian, “Threat of adversarial attacks on deep
[154] F. Pastor, J. M. Gandarias, A. J. García-Cerezo, and J. M. Gómez-de- learning in computer vision: A survey,” IEEE Access, vol. 6, pp.
Gabriel, “Using 3D Convolutional Neural Networks for Tactile Object 14410-14430, 2018.
Recognition with Robotic Palpation,” Sensors, vol. 19, no. 24, pp. [177] B. Zoph, and Q. V. Le, “Neural architecture search with reinforcement
5356, 2019. learning,” arXiv preprint arXiv:1611.01578, 2016.
[155] K. Jnawali, M. R. Arbabshirani, N. Rao, and A. A. Patel, "Deep 3D [178] H. Pham, M. Y. Guan, B. Zoph, Q. V. Le, and J. Dean, “Efficient
convolution neural network for CT brain hemorrhage classification." neural architecture search via parameter sharing,” arXiv preprint
p. 105751C. arXiv:1802.03268, 2018.
[156] S. Hamidian, B. Sahiner, N. Petrick, and A. Pezeshk, "3D [179] H. Cai, L. Zhu, and S. Han, “Proxylessnas: Direct neural architecture
convolutional neural network for automatic detection of lung nodules search on target task and hardware,” arXiv preprint arXiv:1812.00332,
in chest CT." p. 1013409. 2018.
[157] M. Jaderberg, A. Vedaldi, and A. Zisserman, “Speeding up [180] M. Tan, B. Chen, R. Pang, V. Vasudevan, M. Sandler, A. Howard, and
convolutional neural networks with low rank expansions,” arXiv Q. V. Le, "Mnasnet: Platform-aware neural architecture search for
preprint arXiv:1405.3866, 2014. mobile." pp. 2820-2828.
[158] V. Sindhwani, T. Sainath, and S. Kumar, "Structured transforms for [181] G. Ghiasi, T.-Y. Lin, and Q. V. Le, "Nas-fpn: Learning scalable feature
small-footprint deep learning." pp. 3088-3096. pyramid architecture for object detection." pp. 7036-7045.
[159] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing [182] T. Jajodia, and P. Garg, “Image Classification–Cat and Dog Images,”
deep neural networks with pruning, trained quantization and huffman Image, vol. 6, no. 12, 2019.
coding,” arXiv preprint arXiv:1510.00149, 2015. [183] P. Drews, G. Williams, B. Goldfain, E. A. Theodorou, and J. M. Rehg,
[160] H. Mao, S. Han, J. Pool, W. Li, X. Liu, Y. Wang, and W. J. Dally, “Aggressive deep driving: Model predictive control with a cnn cost
“Exploring the regularity of sparse structure in convolutional neural model,” arXiv preprint arXiv:1707.05303, 2017.
networks,” arXiv preprint arXiv:1705.08922, 2017. [184] H. Gao, B. Cheng, J. Wang, K. Li, J. Zhao, and D. Li, “Object
[161] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “XNOR-Net: classification using CNN-based fusion of vision and LIDAR in
ImageNet Classification Using Binary Convolutional Neural autonomous vehicle environment,” IEEE Transactions on Industrial
Networks.” Informatics, vol. 14, no. 9, pp. 4224-4231, 2018.
[162] X. Lin, C. Zhao, and W. Pan, “Towards Accurate Binary [185] A. Azulay, and Y. Weiss, “Why do deep convolutional networks
Convolutional Neural Network.” generalize so poorly to small image transformations?,” 2018.
[163] C. Zhu, S. Han, H. Mao, and W. J. Dally, “Trained Ternary [186] S. Sabour, N. Frosst, and G. E. Hinton, "Dynamic routing between
Quantization.” capsules." pp. 3856-3866.
[164] Y. Choi, M. El-Khamy, and J. Lee, “Towards the limit of network [187] A. Kosiorek, S. Sabour, Y. W. Teh, and G. E. Hinton, "Stacked capsule
quantization,” arXiv preprint arXiv:1612.01543, 2016. autoencoders." pp. 15486-15496.