Activation Functions and Initialization Methods
Activation Functions and Initialization Methods
Leonid Datta
Delft University of Technology
Abstract
In artificial neural network, the activation function and the weight ini-
tialization method play important roles in training and performance of
a neural network. The question arises is what properties of a function
are important/necessary for being a well-performing activation function.
Also, the most widely used weight initialization methods - Xavier and He
normal initialization have fundamental connection with activation func-
tion. This survey discusses the important/necessary properties of acti-
vation function and the most widely used activation functions (sigmoid,
tanh, ReLU, LReLU and PReLU). This survey also explores the relation-
ship between these activation functions and the two weight initialization
methods - Xavier and He normal initialization.
1 Introduction
Artificial intelligence has been trying to make intelligent machines for long [1].
Artificial neural networks have played important roles in artificial intelligence
to achieve its goal [2] [3]. When an artificial neural network is built to execute
a task, it is programmed to perceive a pattern. The main task of an artificial
neural network is to learn this pattern from data [1].
An artificial neural network is composed of large number of interconnected
working units known as perceptrons or neurons. A perceptron is composed of
four components: input node, weight vector, activation function and output
node. The first component of a perceptron is the input node - it receives the in-
put vector. I assume to have an m dimensional input vector x = x1 x2 . . . xm .
The second component is the weight vector which has the same dimension
as that of the input vector. Here, the weight vector is w = w1 w2 . . . wm .
From the input vector and the weight vector, an inner product is calculated
as x> w. While calculating the inner product, an extra term 1 is added to the
input vector with an initial weight value (here w0 ). This is known as bias. This
bias term is added as an intercept so that the inner product can be easily shifted
using the weight w0 . After the bias has been added, the input vector and the
1
weight vector look like x = 1 x1 x2 . . . xm and w = w0 w1 w2 . . . wm .
Thus, x> w = w0 + w1 x1 + w2 x2 + . . . + wm xm .
The inner product x> w is used as input to a function known as the activation
function (here symbolized as f ). This activation function is the third component
of the perceptron. The output from the activation function is sent to the output
node which is the fourth component. This is how an input vector flows from
the input node to the output node. It is known as forward propagation.
In mathematical terms, the output y(x1 , ..., xm ) can be expressed as
y(x1 , . . . , xm ) = f (w0 + w1 x1 + w2 x2 + . . . + wm xm ) (1)
.
2
where z is the inner product z = x> w and y = f (z). This is known as the chain
rule. This update is done in the opposite direction of the forward propagation-
starting from the last layer and then gradually moving to the first layer. This
process is known as backward propagation or backpropagation. One forward
propagation and one backward propagation of all the training examples or train-
ing data is termed as an epoch.
Since the weight vector gets updated during training, it needs to be assigned
initial values before the training starts. Assigning the initial values to the weight
vector is known as weight initialization. Once the weight vector is updated, the
input is again passed through the network in forward direction to generate
the output and calculate the loss. This process continues till the loss reaches
a satisfactory minimum value. A network is said to converge when the loss
achieves the satisfactory minimum value and this process is called the training
of a neural network. When an artificial neural network aims to perform a task,
the neural network is first trained.
One of the most interesting characteristics of artificial neural network is the
possibility to adapt its behavior with the changing characteristics of the type of
task. The activation function has an important role in this behaviour adaptation
and learning [4]. The choice of the activation function at different layers in a
neural network is important as it decides how the data will be presented to the
next layer. Also, it controls how much bounded or not bounded (that is, the
range of the data) the data will be to the next layer or the output layer of the
network. When the network is sufficiently deep, learning the pattern becomes
less difficult for common nonlinear functions [5] [6]. But when the network is
not complex and deep, the choice of activation function has more significant
effect on the learning pattern and the performance of the neural network than
that of the complex and deep networks [4].
Weight initialization takes an important role in the speed of training a neural
network [7]. More precisely it controls the speed of the backpropagation because
the weights are updated during backpropagation [8]. If the weight initialization
is not proper, it can lead to poor update of weight so that the weights will get
saturated at early stage of training[9]. This can cause the training process to fail.
In artificial neural network,Xavier and He normal weight initialization method
have gained popularity among the different weight initialization methods since
they have been proposed [10] [11][12] [13] [14] [15].
1.1 Contribution
The main contributions of this survey are:
1. This survey discusses the necessary/ important properties an activation
function is expected to have and explore why they are necessary/important.
2. This survey discusses the four widely used functions - sigmoid, tanh,
ReLU, LReLU and PReLU and why they are so widely used. It also
discusses the problems faced by these activation functions and why they
face the problems.
3
3. This survey discusses Xavier and He normal weight initialization method
and their relation with the mentioned activation functions .
2 Activation function
The activation function, also known as the transfer function, is the nonlinear
function applied on the inner product x> w in an artificial neural network.
Before discussing the properties of activation function, it is important to
know what the sigmoidal activation function is. A function is called sigmoidal
in nature if it has all the following properties:
1. it is continuous
2. it is bounded between a finite lower and upper bound
3. it is nonlinear
4
1. Nonlinear: There are two reasons why an activation function should be
nonlinear. They are:
(a) The boundaries or patterns in real-world problems cannot always be
expected to be linear. A non-linear function can easily approximate
a linear boundary whereas a linear function cannot approximate a
non-linear boundary. Since an artificial neural network learns the
pattern or boundary from data, nonlinearity in activation function
is necessary so that the artificial neural network can easily learn any
linear or non-linear boundary.
(b) If the activation function is linear, then a perceptron with multiple
hidden layers can be easily compressed to a single layer perceptron
because a linear combination of another linear combination of input
vector can be simply expressed as a single linear combination of the
input vector. In that case, the depth of the network will have no
effect. This is another reason why activation functions are non-linear.
5
6. Computational cost: The computational cost of an activation function is
defined as the time required to generate the output of the activation func-
tion when input is fed to it. Along with the computational cost of the ac-
tivation function, the computational cost of the gradient is also important
as the gradient is calculated during weight update in backpropagation.
Gradient descent optimization itself is a very time consuming process and
many iterations are needed to perform this. Therefore, the computational
cost is an issue. In an artificial neural network, if the activation function or
gradient of the activation function has low computational cost, it requires
less time to get trained. When the activation function and the gradient
of the activation function have high computational cost, it requires more
time to get trained. An activation function with less computational cost
of it and its’ gradient is preferable as it saves time and energy.
6
during the network performs. These forcefully deactivated neurons are called
’dead neurons’ and this problem is termed as the dead neuron problem.
2(a) 2(b)
Figure 2: 2(a): Graph of the sigmoid function 2(b): Graph of gradient of sigmoid
function
The sigmoid function contains an exponential term as it can be seen from the
function definition. Exponential functions have high computation cost and as
a result of this, the sigmoid function has a high computational cost. Although,
the function is computationally expensive, its gradient is not. Its gradient can
be calculated using the formula f 0 (x) = f (x)(1 − f (x)).
The sigmoid function suffers some major drawbacks as well. The sigmoid
function is bound in the range of (0,1). Hence it always produces a non-negative
value as output. Thus it is not a zero-centered activation function. The sigmoid
function binds a large range of input to a small range of (0,1). So, a large change
to the input value leads to a small change to the output value. This results to
7
small gradient values as well. Because of the small gradient values, it suffers the
vanishing gradient problem.
To have the benefits of the sigmoid function along with zero-centered nature,
the hyperbolic tangent or tanh function was introduced.
1 − e−x
f (x) =
1 + e−x
where x is the input to the activation function.
Tanh function can be seen as a modified version of the sigmoid function
because tanh can be expressed as tanh(x) = 2sigmoid(2x) − 1. That is the rea-
son why it has got all the properties of the sigmoid function. It is continuous,
differentiable and bounded. It has got a range of (-1,1). So, it produces nega-
tive, positive and zero as outputs. Thus it is zero centered activation function
and solves the ‘not a zero-centered activation function’ problem of the sigmoid
function.
3(a) 3(b)
Figure 3: 3(a): Graph of the tanh function 3(b): Graph of gradient of tanh
function
The tanh function also belongs to the sigmoidal group of function and thus
the Cybenko theorem holds true for tanh as well. The main advantage provided
by the tanh function is that it produces zero centered output and thereby it
aids the back-propagation process[21].
Tanh is computationally expensive for the same reason as that of sigmoid -
it is exponential in nature. Though, the gradient calculation is not expensive.
The gradient of tanh can be calculated using f 0 (x) = (1 − f (x)2 ). The graph of
the tanh and the corresponding gradient is shown in figure 3.
The tanh function, in a similar way to sigmoid, binds a large range of in-
put to a small range of (-1,1). So, a large change to the input value leads to
a small change to the output value. This results into close to zero gradient
8
values. Because of the close to zero gradient values, tanh suffers vanishing gra-
dient problem. The problem of vanishing gradient spurred further research in
activation functions and ReLU was introduced.
4(a) 4(b)
Figure 4: 4(a): Graph of the ReLU function 4(b): Graph of gradient of ReLU
function
9
by deactivating a large part of it. Hence, it from suffers from the dead neu-
ron which is already explained in section 2.2.2. This is also termed as dying
ReLU problem and the neurons which are deactivated by ReLU are called dead
neurons.
The dying ReLU problem lead forwards to a new variant of the ReLU called
the Leaky ReLU or LReLU.
5(a) 5(b)
Figure 5: 5(a): Graph of the LReLU function 5(b): Graph of gradient of LReLU
function
10
2.3.5 Parametric ReLU function
The parametric ReLU or PReLU function was introduced by He et al. [11]. It
is defined as (
ax for x ≤ 0
f (x) =
x otherwise
where a is a learnable parameter and x is the input to the activation function.
When this a is 0.01, PReLU function becomes leaky ReLU and for a = 0
PReLU function becomes ReLU. That is why PReLU can be seen as a general
representation of rectifier nonlinearities.
This PReLU function is continuous, not bounded and zero centered. When
x < 0, then the gradient of the function is a, and when x > 0 the gradient of
the function is 1. This function is not differentiable at x = 0 because the right
hand derivative and left hand derivative are not equal at x = 0. In the positive
part of PReLU function where gradient is always 1, there no vanishing gradient
problem. But on the negative part, the gradient is always a which is typically
close to zero. It leads to a risk of vanishing gradient problem. The graph of the
PRelu function and its gradient is shown in figure 6.
6(a) 6(b)
Figure 6: 6(a): Graph of the PReLU function 6(b): Graph of gradient of PReLU
function
11
Table 1: Calculation cost of activation functions and their gradients
3 Weight initialization
At the beginning of the training process, assigning the initial values to the
weight vector is knows as weight initialization. Rumelhart et al. first tried to
assign equal initial values to weight vector but it was observed that the weights
move in groups during the weight update and maintain a symmetry [29]. This
causes the network to fail to be trained properly. As a solution, Rumelhart et al.
proposed the random weight initialization method [30]. The initial weight values
are chosen uniformly from a range of [−δ, δ] [29]. Xavier and He normal weight
initialization method have been used widely since they have been proposed.
12
incoming network connections and nj is the number of outgoing network con-
nections from that layer.
The initial objective of the Xavier et al. paper was to explore why standard
gradient descent from random initialization performs poorly on deep neural
networks. They investigate that the logistic sigmoid activation is not suited
deep neural networks with random initialization. It leads to saturation of initial
layers at very early stage of training. They also find that the saturated units
can move out of saturation by themselves with proper initialization [31].
To find a better range for random initial weight values, they equated the
variance of every layer. The variance of the input and output of a layer was
equated so that much variance is not lost between the input and output of a
single layer. With this idea, the solution found is a range for initial weight
values. The range is √ √
6 6
U [(− √ ,√ )]
ni + nj ni + nj
where ni is the number of incoming network connections, nj is the number
of outgoing network connections from that layer and U is a random uniform
distribution.
4 Discussion
Xavier et al. strongly suggests that the sigmoid activation function easily sat-
urates at very early stage of training. This causes the training to fail. It also
suggests tanh activation function as a good alternative for the sigmoid because
13
it does not get saturated easily. The main advantage of tanh function over sig-
moid, as they show in the paper, is that its mean value is 0 (more precisely, it
is a zero-centered activation function) [31]. Xavier et al. paper, as it states in
the section ‘Theoretical Considerations and a New Normalized Initialization’,
assumes a linear regime of the network. Sun et al. also claims the same that
Xavier initialization typically considers a linear activation function at the layer
to bind the gradient values in a range [32].
14
5 Conclusion
The usage of sigmoidal activation functions is decreasing as the rectifier nonlin-
earities are being more popular. On one hand, rectifier nonlinearities, especially
ReLU, have good performance with He normal initialization in several kinds of
networks. On the other hand, the performance of He normal initialization beats
the performance of Xavier initialization. Though tanh activation function with
Xavier initialization is used but only in cases where the network is not deep.
He normal initialization along with rectifier nonlinearities, especially ReLU, get
more preference when the network is deep.
Acknowledgement
I would like to thank Dr. David Tax, Assistant Professor, Faculty of Electrical
Engineering, Mathematics and Computer Science, TU Delft, Netherlands for
his supervision and guidance for this survey.
References
[1] John McCarthy. What is artificial intelligence? 2004.
[2] Simon Haykin. Neural Networks: A Comprehensive Foundation. Prentice
Hall, 1999.
[5] Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedfor-
ward networks are universal approximators. Neural networks, 2(5):359–366,
1989.
[6] Youngmin Cho and Lawrence K Saul. Kernel methods for deep learning. In
Advances in neural information processing systems, pages 342–350, 2009.
15
[9] Jim YF Yam and Tommy WS Chow. A weight initialization method for
improving training speed in feedforward neural network. Neurocomputing,
30(1-4):219–232, 2000.
[10] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse recti-
fier neural networks. In Geoffrey Gordon, David Dunson, and Miroslav
Dudı́k, editors, Proceedings of the Fourteenth International Conference on
Artificial Intelligence and Statistics, volume 15 of Proceedings of Machine
Learning Research, pages 315–323, Fort Lauderdale, FL, USA, 11–13 Apr
2011. PMLR.
[11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep
into rectifiers: Surpassing human-level performance on imagenet classifica-
tion. IEEE International Conference on Computer Vision (ICCV 2015),
1502, 02 2015.
[12] Dmytro Mishkin and Jiri Matas. All you need is a good init. 05 2016.
[13] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolu-
tional networks for biomedical image segmentation. In International Con-
ference on Medical image computing and computer-assisted intervention,
pages 234–241. Springer, 2015.
[14] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Wein-
berger. Densely connected convolutional networks. In Proceedings of the
IEEE conference on computer vision and pattern recognition, pages 4700–
4708, 2017.
[15] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature,
521(7553):436–444, 2015.
[16] G. Cybenko. Approximation by superpositions of a sigmoidal function.
Mathematics of Control, Signals and Systems, 2(4):303–314, Dec 1989.
[17] Chigozie Nwankpa, Winifred Ijomah, Anthony Gachagan, and Stephen
Marshall. Activation functions: Comparison of trends in practice and re-
search for deep learning. CoRR, abs/1811.03378, 2018.
[18] K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward net-
works are universal approximators. Neural Netw., 2(5):359–366, July 1989.
[19] A Vehbi Olgac and Bekir Karlik. Performance analysis of various activation
functions in generalized mlp architectures of neural networks. International
Journal of Artificial Intelligence And Expert Systems, 1:111–122, 02 2011.
[20] Radford M. Neal. Connectionist learning of belief networks. Artif. Intell.,
56(1):71–113, July 1992.
[21] Chigozie Nwankpa, Winifred Ijomah, Anthony Gachagan, and Stephen
Marshall. Activation functions: Comparison of trends in practice and re-
search for deep learning. 11 2018.
16
[22] Vinod Nair and Geoffrey E. Hinton. Rectified linear units improve restricted
boltzmann machines vinod nair. Proceedings of ICML, 27:807–814, 06 2010.
[23] Prajit Ramachandran, Barret Zoph, and Quoc V. Le. Searching for acti-
vation functions, 2018.
[24] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet clas-
sification with deep convolutional neural networks. In Advances in neural
information processing systems, pages 1097–1105, 2012.
[25] M.D. Zeiler, M. Ranzato, R. Monga, M. Mao, K. Yang, Q.V. Le, P. Nguyen,
A. Senior, V. Vanhoucke, J. Dean, and G.E. Hinton. On rectified linear
units for speech processing. In 38th International Conference on Acoustics,
Speech and Signal Processing (ICASSP), Vancouver, 2013.
[26] George E. Dahl, Tara N. Sainath, and Geoffrey E. Hinton. Improving
deep neural networks for lvcsr using rectified linear units and dropout. In
ICASSP, pages 8609–8613. IEEE, 2013.
[27] Mian Mian Lau and King Hann Lim. Investigation of activation functions
in deep belief network. pages 201–206, 04 2017.
[28] Andrew L. Maas, Awni Y. Hannun, and Andrew Y. Ng. Rectifier nonlin-
earities improve neural network acoustic models. In in ICML Workshop on
Deep Learning for Audio, Speech and Language Processing, 2013.
[29] Sartaj Singh Sodhi and Pravin Chandra. Interval based weight initializa-
tion method for sigmoidal feedforward artificial neural networks. AASRI
Procedia, 6:19–25, 2014.
[30] James L McClelland, David E Rumelhart, PDP Research Group, et al.
Parallel distributed processing, volume 2. MIT press Cambridge, MA:, 1987.
[31] Y Bengio and X Glorot. Understanding the difficulty of training deep feed
forward neural networks. International Conference on Artificial Intelligence
and Statistics, pages 249–256, 01 2010.
[32] Weichen Sun, Fei Su, and Leiquan Wang. Improving deep neural networks
with multi-layer maxout networks and a novel initialization method. Neu-
rocomputing, 278:34–40, 2018.
[33] Siddharth Krishna Kumar. On weight initialization in deep neural net-
works. arXiv preprint arXiv:1704.08863, 2017.
[34] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse rectifier
neural networks. In Proceedings of the fourteenth international conference
on artificial intelligence and statistics, pages 315–323, 2011.
17