Activation Functions and Their Characteristics in Deep Neural Networks
Activation Functions and Their Characteristics in Deep Neural Networks
Networks
Bin Ding, Huimin Qian, Jun Zhou
College of Energy and Electrical Engincering, Hohai University, Nanjing, 211100
E-mail: [email protected], [email protected], [email protected]
Abstract: Deep neural networks have gained remarkable achievements in many research arcas, especially in computer
vision, and natural language processing. The great successes of deep neural networks depend on several aspects in which
the development of activation function is one of the most important elements. Being aware of this, a number of researches
have concentrated on the performance improvements after the revision of a certain activation function in some specified
neural networks. We have noticed that there are few papers to review thoroughly the activation functions employed by
the neural networks. Therefore, considering the impact of improving the performance of neural networks with deep
architectures, the status and the developments of commonly used activation functions will be investigated in this paper.
More specifically, the definitions, the impacts on the neural networks, and the advantages and disadvantages of quite a
few activation functions will be discussed in this paper. Furthermore, experimental results on the dataset MNIST are
employed to compare the performance of different activation functions.
Key Words: neural network, deep architecture, activation function
978-1-5386-1243-9/18/$31.00
(©2018 IEEE 1836
=aw’ +b output level of a neural network owing to its value distribu-
tion. While sigmoid function is rarely adopted in the deep
The function g(-), which is named as activation function, neural networks except the output level since its saturation
is used to simulate the response state of biological neuron Exactly, it is soft saturate since it only achieves zero gradi-
and obtain the output y, that is ent in the limit [13]
y=9(2) A) g =0
Caglar et. al. gave the definition of an activation function and
in [13], which is “ An activation function is a function g :
R — R that is differentiable almost everywhere.” mo)
g =0
Nonlinear activation functions have brought neural net- as seen in Fig. 2. The soft saturation results in the difficul-
works their nonlinear capabilities, which means neural net- ties of training a deep neural network. More specific, in the
works can differentiate those data that can not be classified process of optimizing loss function, the derivatives of sig-
linearly in the current data space. moid function, which is contributed to update the weights
The definition given by Caglar [13] shows that the contin- and the bias, will reduce to zero when it comes to the satu-
uous differentiable function can be used as an activation ration area, which brings about the less contributions of the
function. Whereas, it is not always the truth to the deep first several layers in the knowledge learning from training
neural networks due to the difficulties in training. Next, samples. In fact, the situation is called vanishing gradient.
the commonly used activation functions will be analyzed Generally, the vanishing gradient will emerge in a neural
thoroughly. network with more than five layers [6].
2.1 Sigmoid and Its Improvements According to the limitation of Sigmoid function, several
211 Sigmoid improvements have been proposed. Huang et al. [14] intro-
duced double-parameters to sigmoid function. One of the
Sigmoid function is one of the most common forms of ac- parameter is used to generate an appropriate input = which
tivation functions. The definition of sigmoid function is as can not lead the input to the saturation area; the other is to
follows, control the decay of residual error. Caglar et al. [13] pro-
1 posed to inject the noise to the activation functions which
9(x)
“Tter (1)
makes the loss function is easier to optimize. Another de-
in which @ € (—00, +00), g € (0, 1), as seen in Fig. 2. It velopment is hyperbolic tangent function, which will be il-
lustrated next.
The 30th Chinese Control and Decision Conference (2018 CCDC) 1837
Figure 3: The graphic depiction of hyperbolic tangent func- Figure 4: The graphic depiction of ReLU function and its
tion and its derivatives. derivatives.
is more complicated than sigmoid function. And it has the function has the following advantages [6, 16]:
same soft saturation as sigmoid function, which also has
the vanishing gradient problem. Computations of neural networks with ReLU func-
tions are cheaper than sigmoid and hyperbolic tangen-
2.2 ReLU and Its Improvements t activation functions because there are no need for
The neuroscience research found that cortical neurons are computing the exponential functions in activations.
rarely in their maximum saturation regime, and suggest-
s that their activation function can be approximated by a
The neural networks with ReLU activation functions
rectifier [29]. That is to say, the operating mode of the neu- converge much faster than those with saturating acti-
rons has the characteristic of sparsity. More specific, only vation functions in terms of training time with gradi-
one percent to four percent of neurons in the brain can be ent descent.
activated simultaneously. However, in the neural network- The ReLU function allows a network to easily obtain
s with sigmoid or hyperbolic tangent activation function- sparse representation, More specific, the output is 0
s, almost one half of the neuron units are activated at the when the input 2z < 0, which provides the sparsity in
same time, which is inconsistent with the research in neu- the activation of neuron units and improves the effi-
roscience. Furthermore, activating more neuron units will ciency of data learning. When the input = > 0, the
bring about more difficulties in the training of a deep neural features of the data can be retained largely.
network. Rectified linear units (ReLU), firstly introduced
by Nair and Hinton for Restricted Boltzmann machines [7], The derivatives of ReLU function keep as the constant
will help hidden layer of the neural network to obtain the 1, which can avoid trapping into the local optimization
sparse output matrix, which can improve the efficiency. and resolve the vanishing gradient effect occurred in
ReLU function and its improvements are almost the most sigmoid and hyperbolic tangent activation functions.
popular activation functions used in deep neural networks
Deep neural networks with ReLU activation function-
currently [6, 16, 24, 26]. It is said that although the main
difficulty of training deep networks are resolved by the idea s can reach their best performance without requiring
of initializing each layer by unsupervised learning, while any unsupervised pre-training on purely supervised
tasks with large labeled datasets.
the employment of ReLU activation functions can also be
seen as a breakthrough in directly supervised training of However, the derivatives ¢'(x) = 0 when < 0 so the
deep networks. ReLU function is left-hard-saturating. And the relative
weights might not be updated any more and that leads to
2.2.1 ReLU the death of some neuron units, which means that these
neuron units will never be activated. Another disadvantage
The definition of ReLU is as follows of ReLU is that the average of the units” outputs is identi-
cally positive, which will cause a bias shift for units in the
9(z) == maz(0,)
0,2) = {0 Gaze=
z if >0
O3 next layer. The above two attributes both have a negative
impact on convergence of the corresponding deep neural
The graph is depicted in Fig. 4. ReLU activation function networks [17].
is non-saturated, and its derivative function is
2.2.2 LReLU, PReLU, and RReL.U
1 if £>0
g'(x) = {0 if x<0 The possible death of some neuron units of neural networks
with ReLU functions are resulted from the compulsive op-
which is a constant when the inputz > 0. Thus, the vanish- eration of letting g(x) = 0 when 2 < 0. In order to alle-
ing gradient problem can be released. Specifically, ReLU viate this potential problem, Maas et al. [17] proposed the
1838 The 30th Chinese Control and Decision Conference (2018 CCDC)
leaky rectified linear units (LReLU), see in equation(4). Another improvement of ReLU, namely randomized rec-
tified linear unit (RReLU), should be discussed. As we
) = maz(0,z) = v oif 220
= 4 know, the slopes of the negative parts are set as a constant
9(z) 0.2) {0.01,« ifa<o @ in LReLU activation functions, and as a learnable param-
eter in PReLU activation functions, respectively. While in
Its derivative function is RReLU [10] , the slopes are randomized in a given range
in the training, and then fixed in the testing. The definition
B 1 if >0 of RReLU is that
9= = {0.01 otherwise
z if x>0
that the LReLU allows fora small, non-zero gradient when ) = 7
the unit is saturated and not active. Therefore, there are no 9(@) {a(ef 1) if 2<0 ™
zero gradients and no neuron units can be “off” always.
It should be acknowledged that the sparsity has been lost
And its derivative function is
when replacing ReLU with LReLU. Fortunately, the ex-
perimental results in [17] illustrated that the impact on the o) — 1 ifae>0
classification accuracy under the modification can not be
q(m)’{(\e‘ if 2<0
recognized while the learning capabilities of the neural net- The graphic depiction is shown in Fig. 6. The parameter o
works become more robust.
Furthermore, He et al. [12] presented the parametric recti-
fied linear unit (PReLU) by replacing the constant 0.01 of
equation (4) with a learnable parameter. The definition of
PReLU is
x if ©>0
o) = {/u‘ if £<0 2
where a is a learnable parameter. The experiments in [12]
show that LReLU can lead to better results than ReLU and
PReLU by choosing the value a very carefully, but it need-
s tedious, repeated training. On the contrary, PReLU can
learn the parameter from the data. It is verified that the e N
PReLU converges faster and has lower train error than Re-
LU. In addition, it is said that the introduce of parameter Figure 6: The graphic depiction of ELU and MPELU func-
tion
a for the activation functions will not bring about over-
fitting [12]. Wei et al. [I8] has used deep convolution-
al neural network combining L1 regularization and PRe- manages the value to which an ELU saturates for negative
LU activation function to the research on image retrieval, network inputs. The vanishing gradient problem is allevi-
in which the experiments demonstrate that the over-fitting ated because the positive part of ELU is identity. More spe-
problem has indeed resolved and the efficiency of image cific, the derivative is one when = > 0. The left-saturation
retrieval has been improved by adopting PReLU activation lets deep neural networks with ELU activation function-
functions. s become more robust to the input perturbation or noise.
The 30th Chinese Control and Decision Conference (2018 CCDC) 1839
The output average of ELU is approach to zero which con- getting deeper, the training efficiency and accuracy have
tribute to faster convergence. Experimental results in [11] received many concentrations which stimulates the devel-
have shown that ELU can enable faster learning and the opments of activation functions. The saturated activation
generalization performance of ELU are better than ReLU functions, like Sigmoid, hyperbolic tangent, are replaced
and LReLU activation functions when the layers of deep by non-saturated counterpart, such as ReLU, ELU. In this
neural networks are more than five. paper, the definition, pros and cons of several popular acti-
However, the same drawback of ELU as LReLU is that vation functions are reviewed. It should be acknowledged
searching a reasonable is important but time-consuming. that some effective activation functions have not investi-
According to this, Li et al. [8] proposed the multiple para- gated in this paper, such as maxout [27], softplus [28]. The
metric exponential linear unit (MPELU). aim of this paper is to make some contributions on the un-
derstanding of the development progress, attributions, and
Nk if x>0 appropriate choice of activation functions. And further in-
)
I = e —1) if w<o vestigation and analysis are required to improve the views
in this paper.
Among which, o, 3 are learnable parameters which control
the value to and at which MPELU saturates respectively. REFERENCES
And it has been reported that MPELU can become ELU, [1] Y. Lecun, Y. Bengio, and G. Hinton, Deep learning,
ReLU, LReLU, or PReLU by adjusting the two parameter- Nature, vol.521, No. 7553, 436-444, 2015.
s a, 8. Therefore, the advantages of MPELU include: 1)
The convergence property of ELU is also possessed by M- [2] J. Schmidhuber, Deep learning in neural networks: an
PELU, 2) The generalization capability of MPELU is better overview, Neural networks, vol. 61, 85-117, 2015.
than ReLU and PReLU based on the experiments on Ima-
geNet database. [3] Y. Guo, Y. Liu, A. Ocrlemans, S. Lao, and ct al, Deep
3 EXPERIMENTS learning for visual understanding: a review, Neuro-
computing, vol. 187, 27-48,2016.
In this paper, we have conducted experiments by deep con-
volutional neural network (DCNN), whose structure and [4 S. Christian, W. Liu, Y. Jia, and et al, Going deeper
parameters can be seen in Fig. 7. From the figure we can with convolutions, In Proceedings of the IEEE confer-
see that, the DCNN contains two 5*5 convolutional layers ence on computer vision and pattern recognition, 1-9,
and two 2*2 max-pooling layers, both the stride is fixed to 2015.
1 pixel. Then followed a Fully-Connected(FC) layer. Dur-
ing training, the input to our DCNN is a fixed-size 28*28 [5] K. He, X. Zhang, S. Ren, and J. Sun, Deep residu-
gray image. The dataset is segmented as training dataset al learning for image recognition, In Proceedings of
with 60, 000 samples and test dataset with 10, 000 samples. the IEEE conference on computer vision and pattern
The cross-entropy loss function is used, the learning rate is recognition, 770-778, 2016.
chosen as e~*, and the number of iterations is 20, 000.
The activation functions including sigmoid, hyperbolic tan- [6: X. Glorot, and Y. Bengio, Understanding the difficulty
gent, ReLU, RReLU, and ELU are considered for the DC- of training deep feedforward neural networks, Journal
NN in Fig. 7 respectively. For RReLU and ELU, we con- of Machine Learning Research, vol. 9, 249-256, 2010.
ducted different experiments by adjusting the values of pa- [7 V. Nair, and G. E. Hinton, Rectified linear units im-
rameter a for RReLU and a for ELU to choose their best
prove restricted boltzmann machines, In Proceedings
performance. The classification errors with different acti-
of International Conference on International Confer-
vation functions are listed in Tab. 1. It can be seen that
ence on Machine Learning, 807-814, 2010.
Table 1: Classification error comparisons of DCNNs with [8] Y. Li, C. Fan, Y. Li, and Q. Wu, Improving deep neural
different activation functions on MNIST network with multiple parametric exponential linear u-
‘Activation function _Parameter__Error (%) nits, arXiv: 1606.00305,2016.
Sigmoid T15
Tanh - 112 [9] A.L.Maas, A. Y. Hannun, and A. Y. Ng, Rectifier non-
ReLU - 08 linearities improve neural network acoustic models, In
RReLU a 0.99 Proceedings of International Conference on Machine
ELU o L1 Learning, Vol.30, No.1, 1-6, 2013,
the proposed network in this paper with ReLU function [10] B. Xu, N. Wang, T. Chen, and M. Li, Empirical evalu-
achieves the best classification performance than others. ation of rectified activations in convolutional network,
4 CONCLUSION arXiv:1505.00853,2015.
In the last decades, deep neural networks have acquired [11] D. Clevert, T. Unterthiner, and S. Hochreiter. Fast and
rapid improvements, especially in computer vision or natu- accurate deep network learning by exponential linear
ral language processing. Along with the layers of network units (ELUs), arXiv:1511.07289,2015.
1840 The 30th Chinese Control and Decision Conference (2018 CCDC)
p—
Kemet o3 A ePotngt [[ Comolon2
LTS
e R N epts2 2 R g
TRREL/ (ReLU)
e
—_—
— InnerProduct
MAX Pooling2
Kemel_size:2
BEeppKol stride:2.
loss functionn
aceuracy 8 ReLU
softmaxwithLoss
[12] K. He, X. Zhang, S. Ren, and J. Sun, Delving deep European Conference on Computer Vision, 818-833,
into rectifiers: surpassing human-level performance on 2014,
ImageNet classification, In Proceedings of the IEEE
international conference on computer vision, 1026- [21] S. Pierre, D. Eigen, X. Zhang, and et al., Overfeat:
1034,2015. integrated recognition, localization and detection using
convolutional networks, arXiv:1312.6229,2013.
[13] C. Guleehre, M. Moczulski, M. Denil, and Y. Bengio,
Noisy activation functions, In Proceedings of Interna- [22] K. He, X. Zhang, S. Ren, and J. Sun, Spatial pyra-
tional Conference on Machine Learning, 3059-3068, mid pooling in deep convolutional networks for visual
2016.
recognition, IEEE transactions on pattern analysis and
machine intelligence, vol. 37, no. 9, 1904-1916, 2015.
[14] Y. Huang, X. Duan, S. Sun, and et al., A study of
[23] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisser-
training algorithm in deep neural networks based on
man, Return of the devil in the details: Delving deep
Sigmoid activation function, Computer measurement
into convolutional nets, arXiv:1405.3531,2014.
and control, vol. 25, no. 2, 126-129,2017.
[24] M. D. Zeiler, M. Ranzato, R. Monga, and et al., On
[15] Y. Lecun, L. Bottou, G. B. Orr, and K. Miiller, Effi- rectified linear units for speech processing, In Proceed-
cient Backprop, Neural networks: Tricks of the trade, ings of IEEE International Conference on Acoustics,
Springer Berlin Heidelberg, 9-50, 1998. Speech and Signal Processing (ICASSP), 3517-3521,
[16] A. Krizhevsky, I. Sutskever, and G. E. Hinton, Ima- 2013.
genct classification with deep convolutional neural net- [25] M. Lin, Q. Chen, and S. Yan, Network in network,
works, In Proceedings of Advances in neural informa- arXiv:1312.4400,2013.
tion processing systems, 1097-1105,2012.
[26] R. K. Srivastava, J. Masci, S. Kazerounian, F. Gomez,
[17] A. L. Maas, A. Y. Hannun, and A. Y. Ng, Rectifier and J. Schmidhuber, Compete to compute, In Proceed-
nonlinearities improve neural network acoustic mod- ings of Advances in neural information processing sys-
els, In Proceedings of International conference on ma- tems, 2310-2318,2013.
chine learning, Computer Science Department, vol. 30,
no. 1,2013. [27] L. J. Goodfellow, D. Warde-Farley, M. Mirza, A.
Courville, and Y. Bengio, Maxout networks, arX-
[18] Q. Wei, and W. Wang, Research on image retrieval us- iv:1302.4389,2013.
ing deep convolutional neural network combining L1
regularization and PRelu activation function, In IOP [28] D. Charles, Y. Bengio, F. Blisle, C. Nadeau, and R.
Conference Series: Earth and Environmental Science Garcia, Incorporating second-order functional knowl-
, Vol. 69, No. 1, 012156, 2017. edge for better option pricing, In Proceedings of Ad-
vances in neural information processing systems, 472-
[19] K. Simonyan, and A. Zisserman, Very deep convolu- 478,2001.
tional networks for large-scale image recognition, arX-
iV:1409.1556,2014. [29] P. Lennie, The cost of cortical computation, Current
biology, vol. 13, no. 6, 493-497, 2003.
[20] M. D. Zeiler, and R. Fergus, Visualizing and un-
derstanding convolutional networks, In Proceedings of
The 30th Chinese Control and Decision Conference (2018 CCDC) 1841