0% found this document useful (0 votes)
99 views

Activation Functions and Their Characteristics in Deep Neural Networks

The document discusses activation functions and their characteristics in deep neural networks. It begins by introducing activation functions and their role in artificial neural networks as analogs to biological neurons. It then discusses several commonly used activation functions - sigmoid, hyperbolic tangent, ReLU, and their variants. It analyzes the characteristics of these functions, including their impacts on training deep neural networks and addressing issues like vanishing gradients. Experimental results on MNIST are also discussed to compare performance of different activation functions.

Uploaded by

Satyam
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
99 views

Activation Functions and Their Characteristics in Deep Neural Networks

The document discusses activation functions and their characteristics in deep neural networks. It begins by introducing activation functions and their role in artificial neural networks as analogs to biological neurons. It then discusses several commonly used activation functions - sigmoid, hyperbolic tangent, ReLU, and their variants. It analyzes the characteristics of these functions, including their impacts on training deep neural networks and addressing issues like vanishing gradients. Experimental results on MNIST are also discussed to compare performance of different activation functions.

Uploaded by

Satyam
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Activation Functions and Their Characteristics in Deep Neural

Networks
Bin Ding, Huimin Qian, Jun Zhou
College of Energy and Electrical Engincering, Hohai University, Nanjing, 211100
E-mail: [email protected], [email protected], [email protected]

Abstract: Deep neural networks have gained remarkable achievements in many research arcas, especially in computer
vision, and natural language processing. The great successes of deep neural networks depend on several aspects in which
the development of activation function is one of the most important elements. Being aware of this, a number of researches
have concentrated on the performance improvements after the revision of a certain activation function in some specified
neural networks. We have noticed that there are few papers to review thoroughly the activation functions employed by
the neural networks. Therefore, considering the impact of improving the performance of neural networks with deep
architectures, the status and the developments of commonly used activation functions will be investigated in this paper.
More specifically, the definitions, the impacts on the neural networks, and the advantages and disadvantages of quite a
few activation functions will be discussed in this paper. Furthermore, experimental results on the dataset MNIST are
employed to compare the performance of different activation functions.
Key Words: neural network, deep architecture, activation function

1 INTRODUCTION 2 ACTIVATION FUNCTION


In the past few years, tremendous improvements have been Artificial neural network is the analog of biological neu-
witnessed in the representation and recognition perfor- ral network. Therefore, the primary task of constructing
mance of neural networks with deep architectures [1, 2], a0 artificial neural network is to design the artificial neuron
due to which, the landscapes of computer vision and natu- model since neurons are the basic units of biological neural
ral language processing have been noticeably changed [3- network. Next, briefly explain the operating principle of a
5]. The revolutionary changes are resulted from the next ~ biological neuron. A biological neuron receives the elec-
crucial elements including building more powerful model- trical signals sentby other neurons with different weights.
s[4, 19-22, 25], accumulating larger-scale dataset [19,21], When the synthetic value of all the electrical signals is big
developing higher-performance hardware, designing more enough to stimulate the neuron, it tus into the excited s-
effective regularization techniques [21, 21, 23], and so on, 1ate to output a response. Otherwise, it keeps the inactive
Among which, the developments of activation functionsal- State.
so have played a vital role of improving the performance ~ According to the principle, the artificial neuron model is
of deep neural networks, thus, more and more efforts have designed. Fig. 1 is the schematic diagram of an artifi-
concentrated on the study of activation functions [6-12]. cial neuron (AU), where {1, zy, -+, } are the inputs of
Since Nair and Hinton [7] first proposed the rectified lin- a0 AU, {wi,wz, -+ ,w, } are the weights corresponding
car units to improve the performance of Restricted Boltz- 10 the inputs, b is the bias, and the addition unit 37 get-
mann Machines, saturated activation function, such as sig- S the linear weight sum = of the inputs {z;, 23, 2}
moid and tanh, are replaced by non-saturated counterpart, ~ and the bias. Denote z = [z1, 2, ,z.] € R" and
such as ReLU, ELU, to solve the so-called vanishing gra-
dient and to accelerate the convergence speed in the neural b
networks with deep architectures, like convolutional neural o /l\
network, recurrent neural network. ‘\
In order to understand the status and performance improve- x5 W ¥ s \ = ) %
ment of activation function in deep neural networks thor- . / N
oughly, the definition, pros and cons of commonly used ac- 4 /’K
tivation functions will be discussed in this paper. And the 5
comparison of experimental results on MNIST dataset will
be illustrated as well. Figure 1: The schematic diagram of an artificial neuron

“This work is supported by the Natural Science Foundation of Jiangsu .


Province under Grant BK20140860 and the Natural Science Foundation % = [w1, +wn] € R", then the output = of the ad-
of China under Grant 61573001 dition unit can be represented as

978-1-5386-1243-9/18/$31.00
(©2018 IEEE 1836
=aw’ +b output level of a neural network owing to its value distribu-
tion. While sigmoid function is rarely adopted in the deep
The function g(-), which is named as activation function, neural networks except the output level since its saturation
is used to simulate the response state of biological neuron Exactly, it is soft saturate since it only achieves zero gradi-
and obtain the output y, that is ent in the limit [13]
y=9(2) A) g =0
Caglar et. al. gave the definition of an activation function and
in [13], which is “ An activation function is a function g :
R — R that is differentiable almost everywhere.” mo)
g =0
Nonlinear activation functions have brought neural net- as seen in Fig. 2. The soft saturation results in the difficul-
works their nonlinear capabilities, which means neural net- ties of training a deep neural network. More specific, in the
works can differentiate those data that can not be classified process of optimizing loss function, the derivatives of sig-
linearly in the current data space. moid function, which is contributed to update the weights
The definition given by Caglar [13] shows that the contin- and the bias, will reduce to zero when it comes to the satu-
uous differentiable function can be used as an activation ration area, which brings about the less contributions of the
function. Whereas, it is not always the truth to the deep first several layers in the knowledge learning from training
neural networks due to the difficulties in training. Next, samples. In fact, the situation is called vanishing gradient.
the commonly used activation functions will be analyzed Generally, the vanishing gradient will emerge in a neural
thoroughly. network with more than five layers [6].
2.1 Sigmoid and Its Improvements According to the limitation of Sigmoid function, several
211 Sigmoid improvements have been proposed. Huang et al. [14] intro-
duced double-parameters to sigmoid function. One of the
Sigmoid function is one of the most common forms of ac- parameter is used to generate an appropriate input = which
tivation functions. The definition of sigmoid function is as can not lead the input to the saturation area; the other is to
follows, control the decay of residual error. Caglar et al. [13] pro-
1 posed to inject the noise to the activation functions which
9(x)
“Tter (1)
makes the loss function is easier to optimize. Another de-
in which @ € (—00, +00), g € (0, 1), as seen in Fig. 2. It velopment is hyperbolic tangent function, which will be il-
lustrated next.

2.1.2 Hyperbolic Tangent


Hyperbolic tangent function can be casily defined as the
ratio between the sine and the cosine functions.
tanh(z) = sinh(z 2
osh(a)
It is similar to sigmoid function and can be deducted from
the sigmoid function in (1) as follows
tanh(x) = 2sigmoid(2x) — 1
Figure 2: The graphic depiction of Sigmoid function and where sigmoid(-) is g(x) in equation (1). The hyperbolic
tangent function ranges outputs between -1 and 1 as seen
its derivative in Fig. 3. It is also a continuous and monotonic func-
tion, and it is differentiable everywhere. It is symmetric
is known that the modeling and training process of a multi- about the origin (see Fig. 3), of which the outputs, namely
layer neural network can be divided into two parts: forward the inputs of next layer, are more likely on average close
propagation and back propagation. And in the back propa- to zero. Therefore, the hyperbolic tangent functions are
gation, the derivatives of activation functions in each layer more preferred than sigmoid functions. In addition, neural
should be calculated. The sigmoid function is a continuous networks with hyperbolic tangent activation functions con-
function, which means that it is differentiable everywhere. verge faster than those with sigmoid activation function-
The derivatives of sigmoid function s [15]. And the neural networks with hyperbolic tangen-
t activation functions have lower classification error than
those with sigmoid activation functions [6].
However, the calculation of the derivatives of hyperbolic
is easy to be calculated. Therefore, the sigmoid function tangent functions, listed as follows,
was commonly used in shallow neural networks. In ad-
dition, the sigmoid function is frequently employed in the tanh' () = 2sigmoid’(2x) —

The 30th Chinese Control and Decision Conference (2018 CCDC) 1837
Figure 3: The graphic depiction of hyperbolic tangent func- Figure 4: The graphic depiction of ReLU function and its
tion and its derivatives. derivatives.

is more complicated than sigmoid function. And it has the function has the following advantages [6, 16]:
same soft saturation as sigmoid function, which also has
the vanishing gradient problem. Computations of neural networks with ReLU func-
tions are cheaper than sigmoid and hyperbolic tangen-
2.2 ReLU and Its Improvements t activation functions because there are no need for
The neuroscience research found that cortical neurons are computing the exponential functions in activations.
rarely in their maximum saturation regime, and suggest-
s that their activation function can be approximated by a
The neural networks with ReLU activation functions
rectifier [29]. That is to say, the operating mode of the neu- converge much faster than those with saturating acti-
rons has the characteristic of sparsity. More specific, only vation functions in terms of training time with gradi-
one percent to four percent of neurons in the brain can be ent descent.
activated simultaneously. However, in the neural network- The ReLU function allows a network to easily obtain
s with sigmoid or hyperbolic tangent activation function- sparse representation, More specific, the output is 0
s, almost one half of the neuron units are activated at the when the input 2z < 0, which provides the sparsity in
same time, which is inconsistent with the research in neu- the activation of neuron units and improves the effi-
roscience. Furthermore, activating more neuron units will ciency of data learning. When the input = > 0, the
bring about more difficulties in the training of a deep neural features of the data can be retained largely.
network. Rectified linear units (ReLU), firstly introduced
by Nair and Hinton for Restricted Boltzmann machines [7], The derivatives of ReLU function keep as the constant
will help hidden layer of the neural network to obtain the 1, which can avoid trapping into the local optimization
sparse output matrix, which can improve the efficiency. and resolve the vanishing gradient effect occurred in
ReLU function and its improvements are almost the most sigmoid and hyperbolic tangent activation functions.
popular activation functions used in deep neural networks
Deep neural networks with ReLU activation function-
currently [6, 16, 24, 26]. It is said that although the main
difficulty of training deep networks are resolved by the idea s can reach their best performance without requiring
of initializing each layer by unsupervised learning, while any unsupervised pre-training on purely supervised
tasks with large labeled datasets.
the employment of ReLU activation functions can also be
seen as a breakthrough in directly supervised training of However, the derivatives ¢'(x) = 0 when < 0 so the
deep networks. ReLU function is left-hard-saturating. And the relative
weights might not be updated any more and that leads to
2.2.1 ReLU the death of some neuron units, which means that these
neuron units will never be activated. Another disadvantage
The definition of ReLU is as follows of ReLU is that the average of the units” outputs is identi-
cally positive, which will cause a bias shift for units in the
9(z) == maz(0,)
0,2) = {0 Gaze=
z if >0
O3 next layer. The above two attributes both have a negative
impact on convergence of the corresponding deep neural
The graph is depicted in Fig. 4. ReLU activation function networks [17].
is non-saturated, and its derivative function is
2.2.2 LReLU, PReLU, and RReL.U
1 if £>0
g'(x) = {0 if x<0 The possible death of some neuron units of neural networks
with ReLU functions are resulted from the compulsive op-
which is a constant when the inputz > 0. Thus, the vanish- eration of letting g(x) = 0 when 2 < 0. In order to alle-
ing gradient problem can be released. Specifically, ReLU viate this potential problem, Maas et al. [17] proposed the

1838 The 30th Chinese Control and Decision Conference (2018 CCDC)
leaky rectified linear units (LReLU), see in equation(4). Another improvement of ReLU, namely randomized rec-
tified linear unit (RReLU), should be discussed. As we
) = maz(0,z) = v oif 220
= 4 know, the slopes of the negative parts are set as a constant
9(z) 0.2) {0.01,« ifa<o @ in LReLU activation functions, and as a learnable param-
eter in PReLU activation functions, respectively. While in
Its derivative function is RReLU [10] , the slopes are randomized in a given range
in the training, and then fixed in the testing. The definition
B 1 if >0 of RReLU is that
9= = {0.01 otherwise

Fig. 5 gives the graph of LReLU function, which is nearly


.q(:c)z{ arvore=0
if v <0
©
identical to the ReLU function in Fig. 4. As we can see
where
a~U(A, B), A< BandA,B € [0,1)

In the training process, a is a random number sampled from


a uniform distribution U(A, B). In the test process, the
average of all the parameters a in training are taken, and
the parameter is set as %2 in [10]. The performance of
RReLU is investigated to be better than ReLU, LReLU and
PReLU in specific experiments [10].

2.3 ELU and Its Improvements


In order to push the activation means of activation functions
closer to zero to decrease the bias shift effect of ReLU, the
Figure 5: The graphic depiction of LReLU, PReLU, and exponential linear unit (ELU) with a > 0 was proposed as
RReLU function follows [11].

z if x>0
that the LReLU allows fora small, non-zero gradient when ) = 7
the unit is saturated and not active. Therefore, there are no 9(@) {a(ef 1) if 2<0 ™
zero gradients and no neuron units can be “off” always.
It should be acknowledged that the sparsity has been lost
And its derivative function is
when replacing ReLU with LReLU. Fortunately, the ex-
perimental results in [17] illustrated that the impact on the o) — 1 ifae>0
classification accuracy under the modification can not be
q(m)’{(\e‘ if 2<0
recognized while the learning capabilities of the neural net- The graphic depiction is shown in Fig. 6. The parameter o
works become more robust.
Furthermore, He et al. [12] presented the parametric recti-
fied linear unit (PReLU) by replacing the constant 0.01 of
equation (4) with a learnable parameter. The definition of
PReLU is
x if ©>0
o) = {/u‘ if £<0 2
where a is a learnable parameter. The experiments in [12]
show that LReLU can lead to better results than ReLU and
PReLU by choosing the value a very carefully, but it need-
s tedious, repeated training. On the contrary, PReLU can
learn the parameter from the data. It is verified that the e N
PReLU converges faster and has lower train error than Re-
LU. In addition, it is said that the introduce of parameter Figure 6: The graphic depiction of ELU and MPELU func-
tion
a for the activation functions will not bring about over-
fitting [12]. Wei et al. [I8] has used deep convolution-
al neural network combining L1 regularization and PRe- manages the value to which an ELU saturates for negative
LU activation function to the research on image retrieval, network inputs. The vanishing gradient problem is allevi-
in which the experiments demonstrate that the over-fitting ated because the positive part of ELU is identity. More spe-
problem has indeed resolved and the efficiency of image cific, the derivative is one when = > 0. The left-saturation
retrieval has been improved by adopting PReLU activation lets deep neural networks with ELU activation function-
functions. s become more robust to the input perturbation or noise.

The 30th Chinese Control and Decision Conference (2018 CCDC) 1839
The output average of ELU is approach to zero which con- getting deeper, the training efficiency and accuracy have
tribute to faster convergence. Experimental results in [11] received many concentrations which stimulates the devel-
have shown that ELU can enable faster learning and the opments of activation functions. The saturated activation
generalization performance of ELU are better than ReLU functions, like Sigmoid, hyperbolic tangent, are replaced
and LReLU activation functions when the layers of deep by non-saturated counterpart, such as ReLU, ELU. In this
neural networks are more than five. paper, the definition, pros and cons of several popular acti-
However, the same drawback of ELU as LReLU is that vation functions are reviewed. It should be acknowledged
searching a reasonable is important but time-consuming. that some effective activation functions have not investi-
According to this, Li et al. [8] proposed the multiple para- gated in this paper, such as maxout [27], softplus [28]. The
metric exponential linear unit (MPELU). aim of this paper is to make some contributions on the un-
derstanding of the development progress, attributions, and
Nk if x>0 appropriate choice of activation functions. And further in-
)
I = e —1) if w<o vestigation and analysis are required to improve the views
in this paper.
Among which, o, 3 are learnable parameters which control
the value to and at which MPELU saturates respectively. REFERENCES
And it has been reported that MPELU can become ELU, [1] Y. Lecun, Y. Bengio, and G. Hinton, Deep learning,
ReLU, LReLU, or PReLU by adjusting the two parameter- Nature, vol.521, No. 7553, 436-444, 2015.
s a, 8. Therefore, the advantages of MPELU include: 1)
The convergence property of ELU is also possessed by M- [2] J. Schmidhuber, Deep learning in neural networks: an
PELU, 2) The generalization capability of MPELU is better overview, Neural networks, vol. 61, 85-117, 2015.
than ReLU and PReLU based on the experiments on Ima-
geNet database. [3] Y. Guo, Y. Liu, A. Ocrlemans, S. Lao, and ct al, Deep
3 EXPERIMENTS learning for visual understanding: a review, Neuro-
computing, vol. 187, 27-48,2016.
In this paper, we have conducted experiments by deep con-
volutional neural network (DCNN), whose structure and [4 S. Christian, W. Liu, Y. Jia, and et al, Going deeper
parameters can be seen in Fig. 7. From the figure we can with convolutions, In Proceedings of the IEEE confer-
see that, the DCNN contains two 5*5 convolutional layers ence on computer vision and pattern recognition, 1-9,
and two 2*2 max-pooling layers, both the stride is fixed to 2015.
1 pixel. Then followed a Fully-Connected(FC) layer. Dur-
ing training, the input to our DCNN is a fixed-size 28*28 [5] K. He, X. Zhang, S. Ren, and J. Sun, Deep residu-
gray image. The dataset is segmented as training dataset al learning for image recognition, In Proceedings of
with 60, 000 samples and test dataset with 10, 000 samples. the IEEE conference on computer vision and pattern
The cross-entropy loss function is used, the learning rate is recognition, 770-778, 2016.
chosen as e~*, and the number of iterations is 20, 000.
The activation functions including sigmoid, hyperbolic tan- [6: X. Glorot, and Y. Bengio, Understanding the difficulty
gent, ReLU, RReLU, and ELU are considered for the DC- of training deep feedforward neural networks, Journal
NN in Fig. 7 respectively. For RReLU and ELU, we con- of Machine Learning Research, vol. 9, 249-256, 2010.
ducted different experiments by adjusting the values of pa- [7 V. Nair, and G. E. Hinton, Rectified linear units im-
rameter a for RReLU and a for ELU to choose their best
prove restricted boltzmann machines, In Proceedings
performance. The classification errors with different acti-
of International Conference on International Confer-
vation functions are listed in Tab. 1. It can be seen that
ence on Machine Learning, 807-814, 2010.
Table 1: Classification error comparisons of DCNNs with [8] Y. Li, C. Fan, Y. Li, and Q. Wu, Improving deep neural
different activation functions on MNIST network with multiple parametric exponential linear u-
‘Activation function _Parameter__Error (%) nits, arXiv: 1606.00305,2016.
Sigmoid T15
Tanh - 112 [9] A.L.Maas, A. Y. Hannun, and A. Y. Ng, Rectifier non-
ReLU - 08 linearities improve neural network acoustic models, In
RReLU a 0.99 Proceedings of International Conference on Machine
ELU o L1 Learning, Vol.30, No.1, 1-6, 2013,
the proposed network in this paper with ReLU function [10] B. Xu, N. Wang, T. Chen, and M. Li, Empirical evalu-
achieves the best classification performance than others. ation of rectified activations in convolutional network,
4 CONCLUSION arXiv:1505.00853,2015.

In the last decades, deep neural networks have acquired [11] D. Clevert, T. Unterthiner, and S. Hochreiter. Fast and
rapid improvements, especially in computer vision or natu- accurate deep network learning by exponential linear
ral language processing. Along with the layers of network units (ELUs), arXiv:1511.07289,2015.

1840 The 30th Chinese Control and Decision Conference (2018 CCDC)
p—
Kemet o3 A ePotngt [[ Comolon2
LTS
e R N epts2 2 R g
TRREL/ (ReLU)
e

—_—
— InnerProduct
MAX Pooling2
Kemel_size:2
BEeppKol stride:2.

loss functionn
aceuracy 8 ReLU
softmaxwithLoss

Figure 7: The structure of DCNN employed in this paper.

[12] K. He, X. Zhang, S. Ren, and J. Sun, Delving deep European Conference on Computer Vision, 818-833,
into rectifiers: surpassing human-level performance on 2014,
ImageNet classification, In Proceedings of the IEEE
international conference on computer vision, 1026- [21] S. Pierre, D. Eigen, X. Zhang, and et al., Overfeat:
1034,2015. integrated recognition, localization and detection using
convolutional networks, arXiv:1312.6229,2013.
[13] C. Guleehre, M. Moczulski, M. Denil, and Y. Bengio,
Noisy activation functions, In Proceedings of Interna- [22] K. He, X. Zhang, S. Ren, and J. Sun, Spatial pyra-
tional Conference on Machine Learning, 3059-3068, mid pooling in deep convolutional networks for visual
2016.
recognition, IEEE transactions on pattern analysis and
machine intelligence, vol. 37, no. 9, 1904-1916, 2015.
[14] Y. Huang, X. Duan, S. Sun, and et al., A study of
[23] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisser-
training algorithm in deep neural networks based on
man, Return of the devil in the details: Delving deep
Sigmoid activation function, Computer measurement
into convolutional nets, arXiv:1405.3531,2014.
and control, vol. 25, no. 2, 126-129,2017.
[24] M. D. Zeiler, M. Ranzato, R. Monga, and et al., On
[15] Y. Lecun, L. Bottou, G. B. Orr, and K. Miiller, Effi- rectified linear units for speech processing, In Proceed-
cient Backprop, Neural networks: Tricks of the trade, ings of IEEE International Conference on Acoustics,
Springer Berlin Heidelberg, 9-50, 1998. Speech and Signal Processing (ICASSP), 3517-3521,
[16] A. Krizhevsky, I. Sutskever, and G. E. Hinton, Ima- 2013.
genct classification with deep convolutional neural net- [25] M. Lin, Q. Chen, and S. Yan, Network in network,
works, In Proceedings of Advances in neural informa- arXiv:1312.4400,2013.
tion processing systems, 1097-1105,2012.
[26] R. K. Srivastava, J. Masci, S. Kazerounian, F. Gomez,
[17] A. L. Maas, A. Y. Hannun, and A. Y. Ng, Rectifier and J. Schmidhuber, Compete to compute, In Proceed-
nonlinearities improve neural network acoustic mod- ings of Advances in neural information processing sys-
els, In Proceedings of International conference on ma- tems, 2310-2318,2013.
chine learning, Computer Science Department, vol. 30,
no. 1,2013. [27] L. J. Goodfellow, D. Warde-Farley, M. Mirza, A.
Courville, and Y. Bengio, Maxout networks, arX-
[18] Q. Wei, and W. Wang, Research on image retrieval us- iv:1302.4389,2013.
ing deep convolutional neural network combining L1
regularization and PRelu activation function, In IOP [28] D. Charles, Y. Bengio, F. Blisle, C. Nadeau, and R.
Conference Series: Earth and Environmental Science Garcia, Incorporating second-order functional knowl-
, Vol. 69, No. 1, 012156, 2017. edge for better option pricing, In Proceedings of Ad-
vances in neural information processing systems, 472-
[19] K. Simonyan, and A. Zisserman, Very deep convolu- 478,2001.
tional networks for large-scale image recognition, arX-
iV:1409.1556,2014. [29] P. Lennie, The cost of cortical computation, Current
biology, vol. 13, no. 6, 493-497, 2003.
[20] M. D. Zeiler, and R. Fergus, Visualizing and un-
derstanding convolutional networks, In Proceedings of

The 30th Chinese Control and Decision Conference (2018 CCDC) 1841

You might also like