0% found this document useful (0 votes)

21 views

Activation Functions and Initialization Methods

This document surveys activation functions used in artificial neural networks and their relationship to Xavier and He normal initialization methods. It discusses important properties of activation functions, commonly used functions like sigmoid and ReLU, and problems they can face. The survey also explores the connection between activation functions and the two initialization methods.

Uploaded by

emailpython6

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views

Activation Functions and Initialization Methods

Uploaded by

emailpython6

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

A Survey on Activation Functions and their

relation with Xavier and He Normal Initialization

arXiv:2004.06632v1 [cs.NE] 18 Mar 2020

Leonid Datta
Delft University of Technology

Abstract
In artificial neural network, the activation function and the weight ini-
tialization method play important roles in training and performance of
a neural network. The question arises is what properties of a function
are important/necessary for being a well-performing activation function.
Also, the most widely used weight initialization methods - Xavier and He
normal initialization have fundamental connection with activation func-
tion. This survey discusses the important/necessary properties of acti-
vation function and the most widely used activation functions (sigmoid,
tanh, ReLU, LReLU and PReLU). This survey also explores the relation-
ship between these activation functions and the two weight initialization
methods - Xavier and He normal initialization.

1 Introduction
Artificial intelligence has been trying to make intelligent machines for long [1].
Artificial neural networks have played important roles in artificial intelligence
to achieve its goal [2] [3]. When an artificial neural network is built to execute
a task, it is programmed to perceive a pattern. The main task of an artificial
neural network is to learn this pattern from data [1].
An artificial neural network is composed of large number of interconnected
working units known as perceptrons or neurons. A perceptron is composed of
four components: input node, weight vector, activation function and output
node. The first component of a perceptron is the input node - it receives the in-

put vector. I assume to have an m dimensional input vector x = x1 x2 . . . xm .
The second component is the weight vector which has the same dimension
as that of the input vector. Here, the weight vector is w = w1 w2 . . . wm .
From the input vector and the weight vector, an inner product is calculated
as x> w. While calculating the inner product, an extra term 1 is added to the
input vector with an initial weight value (here w0 ). This is known as bias. This
bias term is added as an intercept so that the inner product can be easily shifted
using the weight w0 . After the bias has been added, the input vector and the

1

weight vector look like x = 1 x1 x2 . . . xm and w = w0 w1 w2 . . . wm .
Thus, x> w = w0 + w1 x1 + w2 x2 + . . . + wm xm .
The inner product x> w is used as input to a function known as the activation
function (here symbolized as f ). This activation function is the third component
of the perceptron. The output from the activation function is sent to the output
node which is the fourth component. This is how an input vector flows from
the input node to the output node. It is known as forward propagation.
In mathematical terms, the output y(x1 , ..., xm ) can be expressed as
y(x1 , . . . , xm ) = f (w0 + w1 x1 + w2 x2 + . . . + wm xm ) (1)
.

Figure 1: Structure of a single layer perceptron

The perceptron explained above is a single layer perceptron as it has only

one layer of weight vector and activation function. This structure is shown in
fig 1. Neural networks typically have more than one layer. Neural networks
receive the input and pass them through a series of weight vector and activation
function. At the end, it is passed at the output layer. Since, the layers between
the input layer and the output layer are not directly visible from outside, they
are called the hidden layers. A neural network with one hidden layer is known
as shallow network.
When the output is produced at the output layer, the neural network calcu-
lates the loss. Loss quantifies the difference between the obtained output and
the desired output. Let L be the loss. The goal of the neural network is to
update the weight vectors of the perceptron to reduce the loss L. To perform
this task, gradient descent is used. In gradient descent, the weights are updated
by moving them to the opposite direction of the gradient of the loss with respect
∂L
to weight. The gradient of the loss with respect to the weights ∂w is expressed
as
∂L ∂L ∂y ∂L ∂y ∂z
= · = · · (2)
∂w ∂y ∂w ∂y ∂z ∂w

2
where z is the inner product z = x> w and y = f (z). This is known as the chain
rule. This update is done in the opposite direction of the forward propagation-
starting from the last layer and then gradually moving to the first layer. This
process is known as backward propagation or backpropagation. One forward
propagation and one backward propagation of all the training examples or train-
ing data is termed as an epoch.
Since the weight vector gets updated during training, it needs to be assigned
initial values before the training starts. Assigning the initial values to the weight
vector is known as weight initialization. Once the weight vector is updated, the
input is again passed through the network in forward direction to generate
the output and calculate the loss. This process continues till the loss reaches
a satisfactory minimum value. A network is said to converge when the loss
achieves the satisfactory minimum value and this process is called the training
of a neural network. When an artificial neural network aims to perform a task,
the neural network is first trained.
One of the most interesting characteristics of artificial neural network is the
possibility to adapt its behavior with the changing characteristics of the type of
task. The activation function has an important role in this behaviour adaptation
and learning [4]. The choice of the activation function at different layers in a
neural network is important as it decides how the data will be presented to the
next layer. Also, it controls how much bounded or not bounded (that is, the
range of the data) the data will be to the next layer or the output layer of the
network. When the network is sufficiently deep, learning the pattern becomes
less difficult for common nonlinear functions [5] [6]. But when the network is
not complex and deep, the choice of activation function has more significant
effect on the learning pattern and the performance of the neural network than
that of the complex and deep networks [4].
Weight initialization takes an important role in the speed of training a neural
network [7]. More precisely it controls the speed of the backpropagation because
the weights are updated during backpropagation [8]. If the weight initialization
is not proper, it can lead to poor update of weight so that the weights will get
saturated at early stage of training[9]. This can cause the training process to fail.
In artificial neural network,Xavier and He normal weight initialization method
have gained popularity among the different weight initialization methods since
they have been proposed [10] [11][12] [13] [14] [15].

1.1 Contribution
The main contributions of this survey are:
1. This survey discusses the necessary/ important properties an activation
function is expected to have and explore why they are necessary/important.
2. This survey discusses the four widely used functions - sigmoid, tanh,
ReLU, LReLU and PReLU and why they are so widely used. It also
discusses the problems faced by these activation functions and why they
face the problems.

3
3. This survey discusses Xavier and He normal weight initialization method
and their relation with the mentioned activation functions .

1.2 Organisation of the report

In this report, section 2 puts light on the activation function in detail. Sec-
tion 2.1 talks about the necessary/important properties an activation function
is expected to have. Section 2.2 talks about the main two problems (vanishing
gradient problem and dead neuron problem) faced by activation functions. Sec-
tion 2.3 discusses the five different activation functions (sigmoid, tanh, ReLU,
LReLU and PReLU), their properties and problems faced by them. Section 2.4
shows which properties which properties hold or do not hold by the mentioned
activation functions and also the problems faced or not faced by the mentioned
activation functions in a tabular format.
Section 3 focuses on the weight initialization methods. Section 3.1 discusses
the Xavier initialization method and section 3.2 discusses the He normal weight
initialization method [10] [11].
Section 4 gives an overview of the insights from the survey and finally the
conclusions of the survey are noted in section 5.

2 Activation function
The activation function, also known as the transfer function, is the nonlinear
function applied on the inner product x> w in an artificial neural network.
Before discussing the properties of activation function, it is important to
know what the sigmoidal activation function is. A function is called sigmoidal
in nature if it has all the following properties:

1. it is continuous
2. it is bounded between a finite lower and upper bound
3. it is nonlinear

4. its domain contains all real numbers

5. its output value monotonically increases when the input value increases
and as a result of this, it has got an ‘S’ shaped curve.
A feed-forward network having a single hidden layer with finite number of neu-
rons and sigmoidal activation function can approximate any continuous bound-
ary or functions. This is known as the Cybenko theorem [16].

2.1 Properties of activation function

The properties activation functions are expected to have are-

4
1. Nonlinear: There are two reasons why an activation function should be
nonlinear. They are:
(a) The boundaries or patterns in real-world problems cannot always be
expected to be linear. A non-linear function can easily approximate
a linear boundary whereas a linear function cannot approximate a
non-linear boundary. Since an artificial neural network learns the
pattern or boundary from data, nonlinearity in activation function
is necessary so that the artificial neural network can easily learn any
linear or non-linear boundary.
(b) If the activation function is linear, then a perceptron with multiple
hidden layers can be easily compressed to a single layer perceptron
because a linear combination of another linear combination of input
vector can be simply expressed as a single linear combination of the
input vector. In that case, the depth of the network will have no
effect. This is another reason why activation functions are non-linear.

So, nonlinearity in activation function is necessary when the decision

boundary is nonlinear in nature.
2. Differentiable: During backpropagation, the gradient of the loss function is
calculated in gradient descent method. The gradient of the loss function
∂L ∂y
with respect to weight is calculated as ∂w = ∂L ∂z
∂y · ∂z · ∂w as explained
∂f (z)
in equation number (2). The term ∂y ∂z = ∂z appears in the gradient
expression. So, it is necessary that the activation function is differentiable
with respect to its input.
3. Continuous: A function cannot be differentiable unless it is continuous.
Differentiability is a necessary property of activation function. This makes
continuity a necessary property for an activation function.

4. Bounded: The input data is passed through a series of perceptrons each of

which contains an activation function. As a result of this, if the function
is not bounded in a range, the output value may explode. To control this
explosion of values, a bounded nature of activation function is important
but not necessary.

5. Zero-centered: A function is said to be zero centered when its range con-

tains both positive and negative values. If the activation function of the
network is not zero centered, y = f (x> w) is always positive or always
negative. Thus, the output of a layer is always being moved to either the
positive values or the negative values. As a result, the weight vector needs
more update to be trained properly. So, the number of epochs needed for
the network to get trained increases if the activation function is not zero
centered. This is why the zero centered property is important, though it
is not necessary.

5
6. Computational cost: The computational cost of an activation function is
defined as the time required to generate the output of the activation func-
tion when input is fed to it. Along with the computational cost of the ac-
tivation function, the computational cost of the gradient is also important
as the gradient is calculated during weight update in backpropagation.
Gradient descent optimization itself is a very time consuming process and
many iterations are needed to perform this. Therefore, the computational
cost is an issue. In an artificial neural network, if the activation function or
gradient of the activation function has low computational cost, it requires
less time to get trained. When the activation function and the gradient
of the activation function have high computational cost, it requires more
time to get trained. An activation function with less computational cost
of it and its’ gradient is preferable as it saves time and energy.

2.2 Problems faced by activation functions

The vanishing gradient problem and the dead neuron problem are the two major
problems faced by the widely used activation functions. These two problems are
discussed in this section.

2.2.1 Vanishing gradient problem

When an activation function compresses a large range of input into a small
output range, a large change to the input of the activation function results into
a very small change to the output. This leads to a small (typically close to zero)
gradient value.
During weight update, when backpropagation calculates the gradients of the
network, the gradients of each layer are multiplied down from the final layer to
that layer because of the chain rule. When a value close to zero is multiplied to
other values close to zero several times, the value becomes closer and closer to
zero. In this same way, the values of the gradients get closer and closer to 0 as
the backpropagation goes deeper into the network. So, the weights get saturated
and they are not updated properly. As a result, the loss stops decreasing and the
network does not get trained properly. This problem is termed as the vanishing
gradient problem. The neurons, of which the weights are not updated properly
are called the saturated neurons.

2.2.2 Dead neuron problem

When an activation function forces a large part of the input to zero or almost
zero, those corresponding neurons are inactive/dead in contributing to the final
output. During weight update, there is a possibility that the weights will be
updated in such a way that the weighted sum of a large part of the network
will be forced to zero. A network will hardly recover from such a situation and
a large portion of the input fails to contribute to the network. This leads to a
problem because a large part of the input may remain completely deactivated

6
during the network performs. These forcefully deactivated neurons are called
’dead neurons’ and this problem is termed as the dead neuron problem.

2.3 Widely used activation functions

In this section, we will look into the most widely used activation functions and
their properties.

2.3.1 Sigmoid/ Logistic sigmoid function

The logistic sigmoid function, sometimes referred as only ‘sigmoid function’, is
one of the most used activation functions in artificial neural networks [17] . The
function is defined as
1
f (x) =
1 + e−x
where x is the input to the activation function.
The sigmoid function is continuous and bounded in the range of (0,1) and
differentiable. More importantly, the sigmoid function belongs to the sigmoidal
group of activation functions. So, the Cybenko theorem holds true for the
sigmoid function. These are why it is widely used [17] [18]. The graph of the
sigmoid function and its gradient is shown in figure 2.

2(a) 2(b)

Figure 2: 2(a): Graph of the sigmoid function 2(b): Graph of gradient of sigmoid
function

The sigmoid function contains an exponential term as it can be seen from the
function definition. Exponential functions have high computation cost and as
a result of this, the sigmoid function has a high computational cost. Although,
the function is computationally expensive, its gradient is not. Its gradient can
be calculated using the formula f 0 (x) = f (x)(1 − f (x)).
The sigmoid function suffers some major drawbacks as well. The sigmoid
function is bound in the range of (0,1). Hence it always produces a non-negative
value as output. Thus it is not a zero-centered activation function. The sigmoid
function binds a large range of input to a small range of (0,1). So, a large change
to the input value leads to a small change to the output value. This results to

7
small gradient values as well. Because of the small gradient values, it suffers the
vanishing gradient problem.
To have the benefits of the sigmoid function along with zero-centered nature,
the hyperbolic tangent or tanh function was introduced.

2.3.2 Tanh function

The hyperbolic tangent or tanh function is slightly more popular than the sig-
moid function because it gives better training performance for multi-layer neural
networks [19] [20]. The tanh function is defined as

1 − e−x
f (x) =
1 + e−x
where x is the input to the activation function.
Tanh function can be seen as a modified version of the sigmoid function
because tanh can be expressed as tanh(x) = 2sigmoid(2x) − 1. That is the rea-
son why it has got all the properties of the sigmoid function. It is continuous,
differentiable and bounded. It has got a range of (-1,1). So, it produces nega-
tive, positive and zero as outputs. Thus it is zero centered activation function
and solves the ‘not a zero-centered activation function’ problem of the sigmoid
function.

3(a) 3(b)

Figure 3: 3(a): Graph of the tanh function 3(b): Graph of gradient of tanh
function

The tanh function also belongs to the sigmoidal group of function and thus
the Cybenko theorem holds true for tanh as well. The main advantage provided
by the tanh function is that it produces zero centered output and thereby it
aids the back-propagation process[21].
Tanh is computationally expensive for the same reason as that of sigmoid -
it is exponential in nature. Though, the gradient calculation is not expensive.
The gradient of tanh can be calculated using f 0 (x) = (1 − f (x)2 ). The graph of
the tanh and the corresponding gradient is shown in figure 3.
The tanh function, in a similar way to sigmoid, binds a large range of in-
put to a small range of (-1,1). So, a large change to the input value leads to
a small change to the output value. This results into close to zero gradient

8
values. Because of the close to zero gradient values, tanh suffers vanishing gra-
dient problem. The problem of vanishing gradient spurred further research in
activation functions and ReLU was introduced.

2.3.3 ReLU function

The ReLU or rectified linear unit function was proposed in the paper Nair et
al. [22]. This function is defined as f (x) = max(0, x) where x is the input to
the activation function. ReLU has been one of the most widely used activation
functions since it has been proposed [23].

4(a) 4(b)

Figure 4: 4(a): Graph of the ReLU function 4(b): Graph of gradient of ReLU
function

The ReLU function is continuous, not-bounded and not zero-centered. At

x = 0, the left hand derivative of ReLU is zero while right hand derivative is 1.
Since the left hand derivative and the right hand derivative are not equal at x =
0, ReLU function is not differentiable at x = 0. ReLU has got extremely cheap
computational cost as it basically pushes the negative values to zero. It also
introduces sparsity to the problem in the same way [23]. The paper Krizhevsky
et al. demonstrated that ReLU is six times faster than sigmoid/tanh in terms
of number of epochs required to train a network [24]. The ReLU function offers
better performance and generalization in neural networks than the sigmoid and
tanh activation functions and is generally used within the hidden units in the
deep neural networks with another activation functions in the output layers of
the network [25] [26].
Glorot et al. 2011 states that there is no vanishing gradient problem in
the ReLU and Lau et al. 2017 quotes the same [10] [27]. Maas et al. 2013
describes that the ReLU activates values above zero and its partial derivative
is 1. Since, the derivative is exactly 0 or 1 all the time, the vanishing gradient
problem do not exist in network [28]. The graph of the ReLU function and the
corresponding gradient is shown in figure 4.
Any input to the ReLU function which is less than zero generates zero as
output. As a result, the parts of the input which have negative weighted sums
fail to contribute to the whole process. Thus, ReLU can make a network fragile

9
by deactivating a large part of it. Hence, it from suffers from the dead neu-
ron which is already explained in section 2.2.2. This is also termed as dying
ReLU problem and the neurons which are deactivated by ReLU are called dead
neurons.
The dying ReLU problem lead forwards to a new variant of the ReLU called
the Leaky ReLU or LReLU.

2.3.4 Leaky ReLU function

The lealy ReLU or LReLU function was introduced by Maas et al. 2013 [28].
The LReLU function is defined as
(
0.01x for x ≤ 0
f (x) =
x otherwise

where x is the input to the activation function.

5(a) 5(b)

Figure 5: 5(a): Graph of the LReLU function 5(b): Graph of gradient of LReLU
function

The LReLU function is continuous and not-bounded. It is computationally

very cheap and a zero-centered activation function. The LReLU function allows
a small part of negative units instead of pushing them to zero like the ReLU
function does. Because of this, it loses the sparse nature of the ReLU function.
At x = 0, the left hand derivative of LReLU is 0.01 while right hand deriva-
tive is 1. Since the left hand derivative and the right hand derivative are not
equal at x = 0, LReLU function is not differentiable at x = 0. In the positive
part of LReLU function where gradient is always 1, there no vanishing gradient
problem. But on the negative part, the gradient is always 0.01 which is close to
zero. It leads to a risk of vanishing gradient problem.
The graph of LReLU and its gradient is shown in figure 5.

10
2.3.5 Parametric ReLU function
The parametric ReLU or PReLU function was introduced by He et al. [11]. It
is defined as (
ax for x ≤ 0
f (x) =
x otherwise
where a is a learnable parameter and x is the input to the activation function.
When this a is 0.01, PReLU function becomes leaky ReLU and for a = 0
PReLU function becomes ReLU. That is why PReLU can be seen as a general
representation of rectifier nonlinearities.
This PReLU function is continuous, not bounded and zero centered. When
x < 0, then the gradient of the function is a, and when x > 0 the gradient of
the function is 1. This function is not differentiable at x = 0 because the right
hand derivative and left hand derivative are not equal at x = 0. In the positive
part of PReLU function where gradient is always 1, there no vanishing gradient
problem. But on the negative part, the gradient is always a which is typically
close to zero. It leads to a risk of vanishing gradient problem. The graph of the
PRelu function and its gradient is shown in figure 6.

6(a) 6(b)

Figure 6: 6(a): Graph of the PReLU function 6(b): Graph of gradient of PReLU
function

2.4 Shared properties among different activation functions

In this section, the shared properties among different activation functions are
shown in tabular form in table 1, 2 and 3. Table 1 shows the computational cost
of the functions and the same of the gradients. Table 2 shows the shared prop-
erties among the different activation functions. Table 3 shows which activation
function faces dead neuron problem and solves the vanishing gradient problem.

11
Table 1: Calculation cost of activation functions and their gradients

Function Computational cost Gradient computational cost

Logistic sigmoid High Low
Tanh High Low
ReLU Low Low
LReLU Low Low
PReLU Low Low

Table 2: Shared properties among different activation functions

Function Bounded? Sigmoidal? Continuous? Zero-centered? Differentiable?

Sigmoid Yes Yes Yes No Yes
Tanh Yes Yes Yes Yes Yes
ReLU No No Yes No Yes (except at x=0)
LReLU No No Yes Yes Yes(except at x=0)
PReLU No No Yes Yes Yes(except at x=0)

Table 3: Problems faced by different activation functions

Function Faces dead neuron problem? Faces vanishing gradient problem?

Logistic sigmoid No Yes
Tanh No Yes
ReLU Yes No
LReLU No Partially
PReLU No Partially

3 Weight initialization
At the beginning of the training process, assigning the initial values to the
weight vector is knows as weight initialization. Rumelhart et al. first tried to
assign equal initial values to weight vector but it was observed that the weights
move in groups during the weight update and maintain a symmetry [29]. This
causes the network to fail to be trained properly. As a solution, Rumelhart et al.
proposed the random weight initialization method [30]. The initial weight values
are chosen uniformly from a range of [−δ, δ] [29]. Xavier and He normal weight
initialization method have been used widely since they have been proposed.

3.1 Xavier initialization

Xavier initialization method was introduced in the paper Xavier et al. [31].
This method proposes that the weights
√
are to √be chosen from a random uniform
distribution bounded between (− √ 6 , √ 6 ) where ni is the number of
ni +nj ni +nj

12
incoming network connections and nj is the number of outgoing network con-
nections from that layer.
The initial objective of the Xavier et al. paper was to explore why standard
gradient descent from random initialization performs poorly on deep neural
networks. They investigate that the logistic sigmoid activation is not suited
deep neural networks with random initialization. It leads to saturation of initial
layers at very early stage of training. They also find that the saturated units
can move out of saturation by themselves with proper initialization [31].
To find a better range for random initial weight values, they equated the
variance of every layer. The variance of the input and output of a layer was
equated so that much variance is not lost between the input and output of a
single layer. With this idea, the solution found is a range for initial weight
values. The range is √ √
6 6
U [(− √ ,√ )]
ni + nj ni + nj
where ni is the number of incoming network connections, nj is the number
of outgoing network connections from that layer and U is a random uniform
distribution.

3.2 He normal initialization

He normal weight initialization was introduced in the paper He et al. paper and
it has been widely used since it has been proposed [11]. This paper addresses
two different things - one is a new activation function PReLU which is already
discussed in section 2.3.5 and the other one is the new initialization method.
While studying extremely deep networks with rectifier nonlinearities as ac-
tivation functions, they found if the weights are randomly selected from a range
√ √
6 6
N [(− p ,p )]
ni ∗ (1 + a2 ) ni ∗ (1 + a2 )

where ni is the number of incoming network connections in the layer and a is

the parameter of PReLU function and N is normal distribution, the network
gets trained fast. This initialization helps to attain the global minima of the
objective function more efficiently. The approach of this method is quite similar
to the Xavier initialization but it was done in both ways - forward and back-
propagation. That is why their method reduces the error faster than Xavier
initialization as they state in their paper [11].

4 Discussion
Xavier et al. strongly suggests that the sigmoid activation function easily sat-
urates at very early stage of training. This causes the training to fail. It also
suggests tanh activation function as a good alternative for the sigmoid because

13
it does not get saturated easily. The main advantage of tanh function over sig-
moid, as they show in the paper, is that its mean value is 0 (more precisely, it
is a zero-centered activation function) [31]. Xavier et al. paper, as it states in
the section ‘Theoretical Considerations and a New Normalized Initialization’,
assumes a linear regime of the network. Sun et al. also claims the same that
Xavier initialization typically considers a linear activation function at the layer
to bind the gradient values in a range [32].

Table 4: Error rates of EM segmentation challenge [13]

Group name Error

Human value 0.000005

U-net 0.000353
DIVE-SCI 0.000355
IDSIA 0.000420
DIVE 0.000430

To explore the relation between Xavier and rectifier nonlinearities, He et

al. paper discusses an experiment where they experiment a 22 layered neural
network and a 30 layered neural network. ReLU has been used as activation
functions and Xavier initialization has been used as weight initializer in the ex-
periment. The experiment shows that the 22 layered network converges while
the 30 layered network fails to converge [11]. Siddharth et al. shows the rea-
son behind this failure as - ‘the variance of the inputs to the deeper layers is
exponentially smaller than the variance of the inputs to the shallower layer’
[33]. From this, it can be concluded that though Xavier initialization wanted
to preserve the variance, but as the network goes deeper, it fails. This is why
Xavier initialization works well in the cases of shallow networks.
He et al. paper states it very clearly that networks with rectifier nonlinear-
ities are easier to train than networks with sigmoidal activation functions [11].
This statement is also supported by papers like Krizhevsky et al. , Glorot et al.
[24] [34]. This is why He et al. completely ignores the sigmoidal activation func-
tions and focuses on rectifier nonlinearities only and also produces a generalized
form of rectifier nonlinearities PReLU.
The combination of ReLU and He normal initialization has performed well
in some of the state of the art papers like Ronneberger et al., Huang et al [13]
[14]. In table 4 the result of the Ronneberger et al. paper has been shown.
It shows that the U-net architecture which uses the combination of ReLU and
He normal initialization outperforms the other methods and has the least error
rate in a biomedical image segmentation challenge named ‘EM segmentation
challenge’.

14
5 Conclusion
The usage of sigmoidal activation functions is decreasing as the rectifier nonlin-
earities are being more popular. On one hand, rectifier nonlinearities, especially
ReLU, have good performance with He normal initialization in several kinds of
networks. On the other hand, the performance of He normal initialization beats
the performance of Xavier initialization. Though tanh activation function with
Xavier initialization is used but only in cases where the network is not deep.
He normal initialization along with rectifier nonlinearities, especially ReLU, get
more preference when the network is deep.

Acknowledgement
I would like to thank Dr. David Tax, Assistant Professor, Faculty of Electrical
Engineering, Mathematics and Computer Science, TU Delft, Netherlands for
his supervision and guidance for this survey.

References
[1] John McCarthy. What is artificial intelligence? 2004.
[2] Simon Haykin. Neural Networks: A Comprehensive Foundation. Prentice
Hall, 1999.

[3] Christopher M. Bishop. Neural Networks for Pattern Recognition. Oxford

University Press, Inc., New York, NY, USA, 1995.
[4] Forest Agostinelli, Matthew Hoffman, Peter Sadowski, and Pierre Baldi.
Learning activation functions to improve deep neural networks. arXiv
preprint arXiv:1412.6830, 2014.

[5] Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedfor-
ward networks are universal approximators. Neural networks, 2(5):359–366,
1989.
[6] Youngmin Cho and Lawrence K Saul. Kernel methods for deep learning. In
Advances in neural information processing systems, pages 342–350, 2009.

[7] LY Bottou. Reconnaissance de la parole par reseaux multi-couches. In

Proceedings of the International Workshop Neural Networks Application,
Neuro-Nimes, volume 88, pages 197–217, 1988.
[8] Gian Paolo Drago and Sandro Ridella. Statistically controlled activa-
tion weight initialization (scawi). IEEE Transactions on Neural Networks,
3(4):627–631, 1992.

15
[9] Jim YF Yam and Tommy WS Chow. A weight initialization method for
improving training speed in feedforward neural network. Neurocomputing,
30(1-4):219–232, 2000.
[10] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse recti-
fier neural networks. In Geoffrey Gordon, David Dunson, and Miroslav
Dudı́k, editors, Proceedings of the Fourteenth International Conference on
Artificial Intelligence and Statistics, volume 15 of Proceedings of Machine
Learning Research, pages 315–323, Fort Lauderdale, FL, USA, 11–13 Apr
2011. PMLR.
[11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep
into rectifiers: Surpassing human-level performance on imagenet classifica-
tion. IEEE International Conference on Computer Vision (ICCV 2015),
1502, 02 2015.
[12] Dmytro Mishkin and Jiri Matas. All you need is a good init. 05 2016.
[13] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolu-
tional networks for biomedical image segmentation. In International Con-
ference on Medical image computing and computer-assisted intervention,
pages 234–241. Springer, 2015.
[14] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Wein-
berger. Densely connected convolutional networks. In Proceedings of the
IEEE conference on computer vision and pattern recognition, pages 4700–
4708, 2017.
[15] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature,
521(7553):436–444, 2015.
[16] G. Cybenko. Approximation by superpositions of a sigmoidal function.
Mathematics of Control, Signals and Systems, 2(4):303–314, Dec 1989.
[17] Chigozie Nwankpa, Winifred Ijomah, Anthony Gachagan, and Stephen
Marshall. Activation functions: Comparison of trends in practice and re-
search for deep learning. CoRR, abs/1811.03378, 2018.
[18] K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward net-
works are universal approximators. Neural Netw., 2(5):359–366, July 1989.
[19] A Vehbi Olgac and Bekir Karlik. Performance analysis of various activation
functions in generalized mlp architectures of neural networks. International
Journal of Artificial Intelligence And Expert Systems, 1:111–122, 02 2011.
[20] Radford M. Neal. Connectionist learning of belief networks. Artif. Intell.,
56(1):71–113, July 1992.
[21] Chigozie Nwankpa, Winifred Ijomah, Anthony Gachagan, and Stephen
Marshall. Activation functions: Comparison of trends in practice and re-
search for deep learning. 11 2018.

16
[22] Vinod Nair and Geoffrey E. Hinton. Rectified linear units improve restricted
boltzmann machines vinod nair. Proceedings of ICML, 27:807–814, 06 2010.
[23] Prajit Ramachandran, Barret Zoph, and Quoc V. Le. Searching for acti-
vation functions, 2018.

[24] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet clas-
sification with deep convolutional neural networks. In Advances in neural
information processing systems, pages 1097–1105, 2012.
[25] M.D. Zeiler, M. Ranzato, R. Monga, M. Mao, K. Yang, Q.V. Le, P. Nguyen,
A. Senior, V. Vanhoucke, J. Dean, and G.E. Hinton. On rectified linear
units for speech processing. In 38th International Conference on Acoustics,
Speech and Signal Processing (ICASSP), Vancouver, 2013.
[26] George E. Dahl, Tara N. Sainath, and Geoffrey E. Hinton. Improving
deep neural networks for lvcsr using rectified linear units and dropout. In
ICASSP, pages 8609–8613. IEEE, 2013.

[27] Mian Mian Lau and King Hann Lim. Investigation of activation functions
in deep belief network. pages 201–206, 04 2017.
[28] Andrew L. Maas, Awni Y. Hannun, and Andrew Y. Ng. Rectifier nonlin-
earities improve neural network acoustic models. In in ICML Workshop on
Deep Learning for Audio, Speech and Language Processing, 2013.

[29] Sartaj Singh Sodhi and Pravin Chandra. Interval based weight initializa-
tion method for sigmoidal feedforward artificial neural networks. AASRI
Procedia, 6:19–25, 2014.
[30] James L McClelland, David E Rumelhart, PDP Research Group, et al.
Parallel distributed processing, volume 2. MIT press Cambridge, MA:, 1987.

[31] Y Bengio and X Glorot. Understanding the difficulty of training deep feed
forward neural networks. International Conference on Artificial Intelligence
and Statistics, pages 249–256, 01 2010.
[32] Weichen Sun, Fei Su, and Leiquan Wang. Improving deep neural networks
with multi-layer maxout networks and a novel initialization method. Neu-
rocomputing, 278:34–40, 2018.
[33] Siddharth Krishna Kumar. On weight initialization in deep neural net-
works. arXiv preprint arXiv:1704.08863, 2017.
[34] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse rectifier
neural networks. In Proceedings of the fourteenth international conference
on artificial intelligence and statistics, pages 315–323, 2011.

Progress in Geography KS3 Sample Material - 1 PDF
70% (10)
Progress in Geography KS3 Sample Material - 1 PDF
17 pages
465-Lecture 2-4
No ratings yet
465-Lecture 2-4
43 pages
Perceptron: Single Layer Neural Network
No ratings yet
Perceptron: Single Layer Neural Network
14 pages
ML Mentorship Prahitha Movva V1
No ratings yet
ML Mentorship Prahitha Movva V1
5 pages
NN unit_1
No ratings yet
NN unit_1
27 pages
Artificial Neural Network Notes
No ratings yet
Artificial Neural Network Notes
9 pages
Ann MJJ-1
No ratings yet
Ann MJJ-1
64 pages
Unit 2
No ratings yet
Unit 2
18 pages
NeuralNetworks
No ratings yet
NeuralNetworks
29 pages
Artifical Neural Networks - Lect - 2
No ratings yet
Artifical Neural Networks - Lect - 2
16 pages
Unit 2 - Machine Learning
No ratings yet
Unit 2 - Machine Learning
19 pages
Ad3451 Ml Unit 4 Notes
No ratings yet
Ad3451 Ml Unit 4 Notes
34 pages
Neural Net 2002
No ratings yet
Neural Net 2002
12 pages
Unit 2_Activation Function_PR
No ratings yet
Unit 2_Activation Function_PR
22 pages
DEEP LEARNING Paper
No ratings yet
DEEP LEARNING Paper
12 pages
Neural Networks
No ratings yet
Neural Networks
61 pages
Model of Neuron in An ANN
No ratings yet
Model of Neuron in An ANN
12 pages
DL Mod2
No ratings yet
DL Mod2
45 pages
UNIT V NEURAL NETWORKS
No ratings yet
UNIT V NEURAL NETWORKS
35 pages
Fundamentals Deep Learning Activation Functions When To Use Them
No ratings yet
Fundamentals Deep Learning Activation Functions When To Use Them
15 pages
Neural Networks
No ratings yet
Neural Networks
37 pages
ML_Lec-22
No ratings yet
ML_Lec-22
25 pages
Unit Iv
No ratings yet
Unit Iv
34 pages
Artifical Neural Network
No ratings yet
Artifical Neural Network
7 pages
Activation Functions in Neural Networks - 241102 - 224129
No ratings yet
Activation Functions in Neural Networks - 241102 - 224129
7 pages
Annette Paper
No ratings yet
Annette Paper
7 pages
UNIT V (1)
No ratings yet
UNIT V (1)
25 pages
4-Neural Networks and Activation Function
No ratings yet
4-Neural Networks and Activation Function
28 pages
EE769 7 Introduction To Neural Networks
No ratings yet
EE769 7 Introduction To Neural Networks
52 pages
AI & ML Unit 5 Notes
No ratings yet
AI & ML Unit 5 Notes
23 pages
DeepLearing Theory
No ratings yet
DeepLearing Theory
51 pages
Neural Networks From Scratch: 3.1 Formal Neuron
No ratings yet
Neural Networks From Scratch: 3.1 Formal Neuron
8 pages
Artificial Neural Network: Lecture Module 22
No ratings yet
Artificial Neural Network: Lecture Module 22
54 pages
DL M2 Tech
No ratings yet
DL M2 Tech
32 pages
unit v
No ratings yet
unit v
9 pages
L4 Training Neural Networks en
No ratings yet
L4 Training Neural Networks en
48 pages
Module1 - Upto Loss Function
No ratings yet
Module1 - Upto Loss Function
137 pages
Lecture8,9-Neural Networks
No ratings yet
Lecture8,9-Neural Networks
65 pages
06 AIS302 ANN backpropagation
No ratings yet
06 AIS302 ANN backpropagation
83 pages
Notes On Introduction To Deep Learning
No ratings yet
Notes On Introduction To Deep Learning
19 pages
Artificial Neural Network
No ratings yet
Artificial Neural Network
29 pages
Module1
No ratings yet
Module1
124 pages
Week 2 Artificial Neural Networks
No ratings yet
Week 2 Artificial Neural Networks
62 pages
Functii de Activare1
No ratings yet
Functii de Activare1
89 pages
7 Types of Neural Network Activation Functions
No ratings yet
7 Types of Neural Network Activation Functions
16 pages
Unit III
No ratings yet
Unit III
37 pages
Ad3451 ML Unit 4 Notes Eduengg
No ratings yet
Ad3451 ML Unit 4 Notes Eduengg
36 pages
DL Answers
No ratings yet
DL Answers
24 pages
AI17-Neural Networks
No ratings yet
AI17-Neural Networks
34 pages
Understanding Activation Functions in Neural Networks
No ratings yet
Understanding Activation Functions in Neural Networks
15 pages
Activation Funtions
No ratings yet
Activation Funtions
26 pages
Deep Learning Lab Manual
No ratings yet
Deep Learning Lab Manual
73 pages
Activation Functions
No ratings yet
Activation Functions
11 pages
Activation Function in NN
No ratings yet
Activation Function in NN
29 pages
Unit 5 Neural Networks and Types of Learning.pptx
No ratings yet
Unit 5 Neural Networks and Types of Learning.pptx
38 pages
Activation Function
No ratings yet
Activation Function
7 pages
Study of Ensemble of Activation Functions in Deep Learning
No ratings yet
Study of Ensemble of Activation Functions in Deep Learning
10 pages
activation fn
No ratings yet
activation fn
15 pages
Bio Inspired Computing: Fundamentals and Applications for Biological Inspiration in the Digital World
From Everand
Bio Inspired Computing: Fundamentals and Applications for Biological Inspiration in the Digital World
Fouad Sabry
No ratings yet
Neural Networks
From Everand
Neural Networks
Sasha Kurzweil
No ratings yet
Perceptrons: Fundamentals and Applications for The Neural Building Block
From Everand
Perceptrons: Fundamentals and Applications for The Neural Building Block
Fouad Sabry
No ratings yet
Anshul SOP
No ratings yet
Anshul SOP
5 pages
Seminar Agile Science
No ratings yet
Seminar Agile Science
42 pages
Comparing Online and Non-Online Shoppers
No ratings yet
Comparing Online and Non-Online Shoppers
8 pages
Midtern PR 1
No ratings yet
Midtern PR 1
20 pages
Assignment: Name: Nishant Saraswat ADMISSION NO: 19032010158 Batch: Mba (Dual)
No ratings yet
Assignment: Name: Nishant Saraswat ADMISSION NO: 19032010158 Batch: Mba (Dual)
5 pages
2 - Probability Distributions
No ratings yet
2 - Probability Distributions
63 pages
Position and Competency Profile: Qualification Standards
No ratings yet
Position and Competency Profile: Qualification Standards
6 pages
ECQ Manual PDF
No ratings yet
ECQ Manual PDF
29 pages
Samples of Literature Review in Linguistics
100% (2)
Samples of Literature Review in Linguistics
6 pages
The Counseling, Services, Process and Methods
No ratings yet
The Counseling, Services, Process and Methods
18 pages
Pushpanjali
No ratings yet
Pushpanjali
42 pages
ECOM90001 Outline2024
No ratings yet
ECOM90001 Outline2024
24 pages
Mathematical Practitioners and the Transformation of Natural Knowledge in Early Modern Europe 1st Edition Lesley B. Cormack All Chapters Instant Download
100% (1)
Mathematical Practitioners and the Transformation of Natural Knowledge in Early Modern Europe 1st Edition Lesley B. Cormack All Chapters Instant Download
55 pages
Anna University Thesis Format For PHD 2016
100% (3)
Anna University Thesis Format For PHD 2016
5 pages
Core Clinical Pharmacology Training
No ratings yet
Core Clinical Pharmacology Training
3 pages
Lecture Notes 4
No ratings yet
Lecture Notes 4
6 pages
Jurnal Bahasa Inggris Pendidikan
No ratings yet
Jurnal Bahasa Inggris Pendidikan
6 pages
Modeling Forced Outage in Hydropower Generating Units For Operations Planning Model
No ratings yet
Modeling Forced Outage in Hydropower Generating Units For Operations Planning Model
150 pages
Nagenra PP
No ratings yet
Nagenra PP
91 pages
Course Work
No ratings yet
Course Work
3 pages
AT.2813 - Determining The Extent of Testing PDF
No ratings yet
AT.2813 - Determining The Extent of Testing PDF
7 pages
Csip Programme 1 Vce
No ratings yet
Csip Programme 1 Vce
5 pages
Obe-Scl - Module 1 (Ppoint)
100% (1)
Obe-Scl - Module 1 (Ppoint)
77 pages
Normal Distribution Slides Questions Solutions: Example 1
100% (1)
Normal Distribution Slides Questions Solutions: Example 1
3 pages
Sport Science Literature Review Example
100% (2)
Sport Science Literature Review Example
4 pages
Pooling of Unshared Information in Group Decision Making: Biased Information Sampling During Discussion
No ratings yet
Pooling of Unshared Information in Group Decision Making: Biased Information Sampling During Discussion
12 pages
2012 - Lord Et Al - The Effect of Different Active Learning Environments On Student Outcomes Related To Lifelong Learning
No ratings yet
2012 - Lord Et Al - The Effect of Different Active Learning Environments On Student Outcomes Related To Lifelong Learning
16 pages
Erged 1
No ratings yet
Erged 1
168 pages
Richter - Montesquieu Comparative Method
No ratings yet
Richter - Montesquieu Comparative Method
33 pages

Activation Functions and Initialization Methods

Uploaded by

Activation Functions and Initialization Methods

Uploaded by

A Survey on Activation Functions and their

relation with Xavier and He Normal Initialization

Figure 1: Structure of a single layer perceptron

The perceptron explained above is a single layer perceptron as it has only

1.2 Organisation of the report

4. its domain contains all real numbers

2.1 Properties of activation function

So, nonlinearity in activation function is necessary when the decision

4. Bounded: The input data is passed through a series of perceptrons each of

5. Zero-centered: A function is said to be zero centered when its range con-

2.2 Problems faced by activation functions

2.2.1 Vanishing gradient problem

2.2.2 Dead neuron problem

2.3 Widely used activation functions

2.3.1 Sigmoid/ Logistic sigmoid function

2.3.2 Tanh function

2.3.3 ReLU function

The ReLU function is continuous, not-bounded and not zero-centered. At

2.3.4 Leaky ReLU function

where x is the input to the activation function.

The LReLU function is continuous and not-bounded. It is computationally

2.4 Shared properties among different activation functions

Function Computational cost Gradient computational cost

Table 2: Shared properties among different activation functions

Function Bounded? Sigmoidal? Continuous? Zero-centered? Differentiable?

Table 3: Problems faced by different activation functions

Function Faces dead neuron problem? Faces vanishing gradient problem?

3.1 Xavier initialization

3.2 He normal initialization

where ni is the number of incoming network connections in the layer and a is

Table 4: Error rates of EM segmentation challenge [13]

Group name Error

*Human value* 0.000005

To explore the relation between Xavier and rectifier nonlinearities, He et

[3] Christopher M. Bishop. Neural Networks for Pattern Recognition. Oxford

[7] LY Bottou. Reconnaissance de la parole par reseaux multi-couches. In

You might also like

Human value 0.000005