0% found this document useful (0 votes)
2 views

Activation Function

The document discusses the importance of activation functions in artificial neural networks, emphasizing that non-linear activation functions are preferred for learning complex mappings between inputs and outputs. It outlines various types of activation functions, including binary step, linear, sigmoid, tanh, ReLU, leaky ReLU, and softmax, highlighting their characteristics and applications. The choice of activation function significantly affects the performance and accuracy of neural networks, making it a critical aspect of network design.

Uploaded by

Kevin Christian
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Activation Function

The document discusses the importance of activation functions in artificial neural networks, emphasizing that non-linear activation functions are preferred for learning complex mappings between inputs and outputs. It outlines various types of activation functions, including binary step, linear, sigmoid, tanh, ReLU, leaky ReLU, and softmax, highlighting their characteristics and applications. The choice of activation function significantly affects the performance and accuracy of neural networks, making it a critical aspect of network design.

Uploaded by

Kevin Christian
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

International Journal of Engineering Applied Sciences and Technology, 2020

Vol. 4, Issue 12, ISSN No. 2455-2143, Pages 310-316


Published Online April 2020 in IJEAST (http://www.ijeast.com)

ACTIVATION FUNCTIONS IN NEURAL


NETWORKS
Siddharth Sharma, Simone Sharma Anidhya Athaiya
UG Scholar, Dept. of Computer Science and Assistant Professor, Dept. of Computer Science and
Engineering, Global Institute of Technology, Jaipur Engineering, Global Institute of Technology, Jaipur

Abstract—Artificial Neural Networks are inspired from linear activation function is used where the output is similar
the human brain and the network of neurons present in as the input fed along with some error. A linear activation
the brain. The information is processed and passed on function’s boundary is linear and the if they are used, then
from one neuron to another through neuro synaptic the network can adapt to only the linear changes of the
junctions. Similarly, in artificial neural networks there input
are different layers of cells arranged and connected to but, in real world the errors possess non-linear
each other. The output/information from the inner characteristics which in turn with the neural networks
layers of the neural network are passed on to the next ability to learn about erroneous data. Hence non-linear
layers and finally to the outermost layer which gives the activation functions are preferred over linear activation
output. The input to the outer layer is provided non- functions in a Neural Network.
linearity to inner layers’ output so that it can be further The most appealing property of Artificial Neural
processed. In an Artificial Neural Network, activation Networks is the ability to adapt their behavior according to
functions are very important as they help in learning the changing characteristics of the system. In the last few
and making sense of non-linear and complicated decades many researchers and scientists have performed
mappings between the inputs and corresponding outputs. studies and investigated a number of methods to improve
the performance of Artificial Neural Networks by
I. INTRODUCTION optimizing the training methods, hyperparameter tuning,
Activation Functions are specially used in artificial neural learn parameters or network structures but not much
networks to transform an input signal into an output signal attention has been paid towards activation functions.
which in turn is fed as input to the next layer in the stack.
II. NEURAL NETWORKS
In an artificial neural network, we calculate the sum of
products of inputs and their corresponding weights and According to inventor of one of the first neurocomputer,
finally apply an activation function to it to get the output of neural network can be defined as:
that particular layer and supply it as the input to the next "...a computing system made up of a number of simple,
layer. highly interconnected processing elements, which process
A Neural Network’s prediction accuracy is dependent on information by their dynamic state response to external
the number of layers used and more importantly on the type inputs.
of activation function used. There is no manual that specify Artificial Neural Networks are based on the network of
the minimum or maximum number of layers to be used for neurons in the mammalian cortex and are modelled loosely
better results and accuracy of the neural networks but a but on a much smaller scale. Artificial Neural Networks can
thumb rule shows that a minimum 2 layers to be used. be algorithms or an actual piece of hardware. There are
Neither is there any mention in literature of the type of billions of neurons present in the mammalian brain which
activation function to be used. It is evident from studies and gives enormous magnitude of the interaction and emergent
research that using a single/multiple hidden layer in a behavior, but in an Artificial Neural Network there may
neural network reduces the error in predictions. have hundreds or thousands of processor units which is very
A neural network’s prediction accuracy is defined by the small as compared to mammalian brain structure.
type of activation function used. The most commonly used Neural Networks are organized in multiple layers and
activation functions are non-linear activation functions. A each layer is made up of a number of interconnected nodes
neural network works just like a linear regression model which have activation functions associated with them. Data
where the predicted output is same as the provided input if is fed to the network via the input layer which then
an activation function is not defined. Same is the case if a communicates with other layers and process the input data

310
International Journal of Engineering Applied Sciences and Technology, 2020
Vol. 4, Issue 12, ISSN No. 2455-2143, Pages 310-316
Published Online April 2020 in IJEAST (http://www.ijeast.com)

with the help of a system of weighted connections. This and also a complex architecture for extracting knowledge,
processed data is then obtained through the output layer. which again is our ultimate goal.

III. WHY NEURAL NETWORKS NEED ACTIVATION IV. THE NEED FOR NON-LINEARITY IN NEURAL
FUNCTIONS? NETWORKS
Neural Networks are a network of multiple layers of Those functions which have degree more than one and have
neurons consisting of nodes which are used for a curvature when plotted are known as Non-linear
classification and prediction of data provided some data as functions. It is required of a neural network to learn,
input to the network. There is an input layer, one or many represent and process any data and any arbitrary complex
hidden layers and an output layer. All the layers have nodes function which maps the inputs to the outputs. Neural
and each node has a weight which is considered while Networks are also known as Universal Function
processing information from one layer to the next layer. Approximators which, means that they can compute and
learn any function provided to them. Any imaginable
process can be represented as a functional computation in
Neural Networks.
Thus, we need to apply an activation function to make the
network dynamic and add the ability to it to extract complex
and complicated information from data and represent non-
linear convoluted random functional mappings between
input and output. Hence, by adding non linearity with the
help of non-linear activation functions to the network, we
are able to achieve non-linear mappings from inputs to
outputs. An important feature of an activation function is
that it must be differentiable so that we can implement back
propagation optimization strategy in order to compute the
errors or losses with respect to weights and eventually
optimize weights using Gradient Descend or any other
optimization technique to reduce errors.

V. TYPES OF ACTIVATION FUNCTIONS


Net inputs are the most important units in the structure of a
neural network which are processed and changed into an
output result known as unit’s activation by applying
function called the activation function or threshold function
or transfer function which is a a scalar to scalar
Figure 1. Neural Network transformation.
If an activation function is not used in a neural network To enable a limited amplitude of the output of a neuron
then the output signal would simply be a simple linear and enabling it in a limited range is known as squashing
function which is just a polynomial of degree one. Although functions. A squashing function squashes the amplitude of
a linear equation is simple and easy to solve but their output signal into a finite value.
complexity is limited and they do not have the ability to
learn and recognize complex mappings from data. Neural 1. Binary Step Function
Network without an activation functions acts as a Linear 2. Linear
Regression Model with limited performance and power 3. Sigmoid
most of the times. It is desirable that a neural network not 4. Tanh
just only learn and compute a linear function but perform 5. ReLU
tasks more complicated than that like modelling 6. Leaky ReLU
complicated types od data such as images, videos, audio, 7. Parametrized ReLU
speech, text, etc. 8. Exponential Linear Unit
This is the reason that we use activation functions and 9. Swish
artificial neural network techniques like Deep Learning that 10. SoftMax
makes sense of complicated, high dimensional and non-
linear datasets where the model has multiple hidden layers a. BINARY STEP FUNCTION
When we have an activation function, the most

311
International Journal of Engineering Applied Sciences and Technology, 2020
Vol. 4, Issue 12, ISSN No. 2455-2143, Pages 310-316
Published Online April 2020 in IJEAST (http://www.ijeast.com)

important thing to consideration is threshold based


classifier which means that whether the linear
transformation’s value must activate the neuron or not
or we can say that a neuron gets activated if the input
to the activation function is greater than a threshold
value, or else it gets deactivated. In that case the
output is not fed as input to the next layer.
Binary Step Function is the simplest activation
function that exists and it can be implemented with
simple if-else statements in Python. While creating a
binary classifier binary activation function are
generally used. But, binary step function cannot be
used in case of multiclass classification in target
carriable. Also, the gradient of the binary step function
is zero which may cause a hinderance in back
propagation step i.e if we calculate the derivative of
f(x) with respect to x, it is equal to zero.
Mathematically binary step function can be defined as:
Figure 3. Linear activation function
f(x) = 1, x>=0
f(x)= 0, x<0 Here the derivative of the function f(x) is not zero but
is equal to the value of constant used. The gradient is
not zero, but a constant value which is independent of
the input value x which implies that the weights and
biases will be updated during backpropagation step
although the updating factor will be the same.
There isn’t much benefit of using linear function
because the neural network would not improve the
error due to the same value of gradient for every
iteration. Also, the network will not be able to identify
complex patterns from the data. Therefore, linear
functions are ideal where interpretability is required
and for simple tasks.
c. SIGMOID ACTIVATION FUNCTION
It is the most widely used activation function as it is a
non-linear function. Sigmoid function transforms the
values in the range 0 to 1. It can be defined as:
f(x) = 1/e-x
Sigmoid function is continuously differentiable and
Figure 2. Binary Step Function a smooth S-shaped function. The derivative of the
function is:
b. LINEAR ACTIVATION FUNCTION
The linear activation function is directly proportional f’(x) = 1-sigmoid(x)
to the input. The main drawback of the binary step Also, sigmoid function is not symmetric about zero
function was that it had zero gradient because there is which means that the signs of all output values of
no component of x in binary step function. In order to neurons will be same. This issue can be improved by
remove that, linear function can be used. It ca be scaling the sigmoid function.
defined as:
F(x) = ax
The value of variable a can be any constant value
chosen by the user.

312
International Journal of Engineering Applied Sciences and Technology, 2020
Vol. 4, Issue 12, ISSN No. 2455-2143, Pages 310-316
Published Online April 2020 in IJEAST (http://www.ijeast.com)

The upper hand of using ReLU function is that all the


neurons are not activated at the same time. This implies
that a neuron will be deactivated only when the output of
linear transformation is zero. It can be defuned
mathematically as:
f(x) = max(0,x)

Figure 4. Sigmoid Function


d. TANH FUNCTION
It is Hyperbolic Tangent function. Tanh function is
similar to the sigmoid function but it is symmetric to around
the origin. This results in different signs of outputs from
previous layers which will be fed as input to the next layer. Figure 6. ReLU Activation Function plot
It can be defined as: ReLU is more efficient than other functions because as all
f(x) = 2sigmoid(2x)-1 the neurons are not activated at the same time, rather a
Tanh function is continuous and differentiable, the values certain number of neurons are activated at a time.
lies in the range -1 to 1. As compared to the sigmoid In some cases, the value of gradient is zero, due to which
function the gradient of tanh function is more steep. Tanh is the weights and biases are not updated during back-
preferred over sigmoid function as it has gradients which propagation step in neural network training.
are not restricted to vary in a certain direction and also, it is
f. LEAKY RELU FUNCTION
zero centered.
Leaky ReLU is an improvised version of ReLU function
where for negative values of x, instead of defining the ReLU
functions’ value as zero, it is defined as extremely small
linear component of x. It can be expressed mathematically
as:
f(x) = 0.01x, x < 0
f(x) = x, x >= 0

Figure 5. Tanh Function


e. RELU FUNCTION
ReLU stands for rectified liner unit and is a non-linear Figure 7. Plot of Leaky ReLU functiom
activation function which is widely used in neural network.
g. PARAMETRIZED RELU FUNCTION

313
International Journal of Engineering Applied Sciences and Technology, 2020
Vol. 4, Issue 12, ISSN No. 2455-2143, Pages 310-316
Published Online April 2020 in IJEAST (http://www.ijeast.com)

It is also a variant of Rectified Linear Unit with better which was discovered by researchers at GOOGLE. The
performance and a slight variation. It resolves the distinguishing feature of Swish function is that it is nit
problem of gradient of ReLU becoming zero for Monotonic, which means that the value of function
negative values of x by introducing a new parameter of may decrease even though the values of inputs are
the negative part of the function i.e Slope. increasing. In some cases, Swish outperforms even the
It is expressed as: ReLU function.
f(x) = x, x >= 0 It is expressed mathematically as:
f(x) = ax, x < 0 f(x) = x * sigmoid(x)
f(x) = x/(1 – e-x)

Figure 10. Swish Function


j. SOFTMAX ACTIVATION FUNCTION
Softmax function is a combination of multiple sigmoid
functions. As we know that a sigmoid function returns
Figure 8. Plot of Parametrized ReLU function
values in the range 0 to 1, these can be treated as
The value of a, when set to 0.01, it behaves as leaky probabilities of a particular class’ data points.
ReLU function but here a is also a trainable parameter.
Softmax function unlike sigmoid functions which
For faster and optimum convergence, the network are used for binary classification, can be used for
learns the value of a. multiclass classification problems. The function, for
h. EXPONENTIAL LINEAR UNIT every data point of all the individual classes, returns
Exponential Linear Unit or ELU is also a variant of the probability. It can be expressed as:
Rectified Linear Unit. ELU introduces a parameter
slope for the negative values of x. It uses a log curve
for defining the negative values.
f(x) = x, x >= 0
f(x) = a(ex – 1), x < 0 When we build a network or model for multiple
class classification, then the output layer of the
network will have the same number of neurons as the
number of classes in the target.

VI. CHOOSING THE RIGHT ACTIVATION FUNCTION


For better performance and less erroneous results, a lot of
things have to be considered like the number of hidden
layers in a network, training methods, hyperparameter
tuning etc and activation function is one of the most
important parameters to consider. Choosing the right
activation function for any particular task may be a tedious
process and may require a lot of research and studies.
Figure 9. Plot of ELU function
There is no thumb rule for selecting any activation
i. SWISH FUNCTION function but the choice of activation function is context
Swish function is a relatively new activation function dependent, i.e it depends on the task that is to be

314
International Journal of Engineering Applied Sciences and Technology, 2020
Vol. 4, Issue 12, ISSN No. 2455-2143, Pages 310-316
Published Online April 2020 in IJEAST (http://www.ijeast.com)

accomplished. Different Activation Functions have both paying the much-needed attention and research in the study
advantages and dis-advantages of their own and it depends of activation functions as they affect neural network’s
on the type of system that we are designing. For example.: performance. Also, there are various activation functions
 For classification problems, a combination of that are not discussed in this literature as they aren’t widely
sigmoid functions gives better results. used in deep learning but rather we have emphasized on the
 Due to vanishing gradient problem i.e. gradient most commonly used activation functions In future we could
also work to compare all the activation functions and
reaching the value zero, sigmoid and tanh
analyze their performance using standard datasets and
functions are avoided.
architectures to see if their performance can be further
 ReLU function is the most widely used function and improved.
performs better than other activation functions in
most of the cases. VIII. REFERENCES
 If there are dead neurons in our network, then we
can use the leaky ReLU function. [1.] KARLIK, B., & OLGAC, A. V. (2011). PERFORMANCE
 ReLU function has to be used only in the hidden ANALYSIS OF VARIOUS ACTIVATION FUNCTIONS IN

layers and not in the outer layer. GENERALIZED MLP ARCHITECTURES OF NEURAL
NETWORKS. INTERNATIONAL JOURNAL OF ARTIFICIAL
We can experiment with different activation functions
INTELLIGENCE AND EXPERT SYSTEMS, 1(4), 111-122.
while developing a model if time constraints are not there.
We start with ReLU function and then move on to other
functions if it does not give satisfactory results. [2.] AGOSTINELLI, F., HOFFMAN, M., SADOWSKI, P., &
As studies have shown that both sigmoid and tanh BALDI, P. (2014). LEARNING ACTIVATION
FUNCTIONS TO IMPROVE DEEP NEURAL
functions are not suitable for hidden layers because the
NETWORKS. ARXIV PREPRINT ARXIV:1412.6830.
slope of function becomes very small as the input becomes
very large or very small which in turn slows down gradient
descent. ReLu is the most preferred choice for apllying with [3.] CHEN, T., & CHEN, H. (1995). UNIVERSAL
hidden layers as the derivative of ReLU is 1. Also, leaky APPROXIMATION TO NONLINEAR OPERATORS BY
NEURAL NETWORKS WITH ARBITRARY ACTIVATION
ReLU can be used in case of zero derivatives. An activation
FUNCTIONS AND ITS APPLICATION TO DYNAMICAL
function which is going to approximate the function faster
SYSTEMS.IEEE TRANSACTIONS ON NEURAL
and can be trained at a faster rate can also be chosen.
NETWORKS, 6(4), 911-917.
VII. CONCLUSION
[4.] STINCHCOMBE, M., & WHITE, H. (1989,
This paper provides a brief description of various activation
DECEMBER). UNIVERSAL APPROXIMATION USING
functions that are used in the field of deep learning and also
FEEDFORWARD NETWORKS WITH NON-SIGMOID
about the importance of activation functions in developing
HIDDEN LAYER ACTIVATION FUNCTIONS. IN IJCNN
an effective and efficient deep learning model and
INTERNATIONAL JOINT CONFERENCE ON NEURAL
improving the performance of artificial neural networks.
NETWORKS.
This paper highlights the need of activation function and
the need for non-linearity in neural networks. Firstly, we
have given a description of activation functions, then we [5.] CAO, J., & WANG, J. (2004). ABSOLUTE
have given a brief description of the need of activation EXPONENTIAL STABILITY OF RECURRENT NEURAL
functions and the need of non-linearity in neural networks. NETWORKS WITH LIPSCHITZ-CONTINUOUS
We then describe various types of activation functions that ACTIVATION FUNCTIONS AND TIME DELAYS. NEURAL
are commonly used in neural networks. Activation NETWORKS, 17(3), 379-390.
Functions have the ability to improve the learning rate and
learning of patterns present in the dataset which, in turn, [6.] HUANG, G. B., & BABRI, H. A. (1998). UPPER
helps in automation of process of feature detection, BOUNDS ON THE NUMBER OF HIDDEN NEURONS IN
extraction and predictions. This paper justifies the use of FEEDFORWARD NETWORKS WITH ARBITRARY
activation functions in the hidden layers of neural networks BOUNDED NONLINEAR ACTIVATION
and their usefulness in classification across various FUNCTIONS. IEEE TRANSACTIONS ON NEURAL
domains. These activation functions were developed after NETWORKS, 9(1), 224-229.
extensive research and experiments over the years.
There has been little emphasis on activation functions in
the past but now, at present scientists and developers are

315
International Journal of Engineering Applied Sciences and Technology, 2020
Vol. 4, Issue 12, ISSN No. 2455-2143, Pages 310-316
Published Online April 2020 in IJEAST (http://www.ijeast.com)

[7.] SIBI, P., JONES, S. A., & SIDDARTH, P. (2013).


ANALYSIS OF DIFFERENT ACTIVATION FUNCTIONS
USING BACK PROPAGATION NEURAL
NETWORKS. JOURNAL OF THEORETICAL AND APPLIED
INFORMATION TECHNOLOGY, 47(3), 1264-1268.

[8.] MA, L., & KHORASANI, K. (2005). CONSTRUCTIVE


FEEDFORWARD NEURAL NETWORKS USING HERMITE
POLYNOMIAL ACTIVATION FUNCTIONS. IEEE
TRANSACTIONS ON NEURAL NETWORKS, 16(4), 821-
833.

[9.] LU, W., & CHEN, T. (2005). DYNAMICAL BEHAVIORS


OF COHEN–GROSSBERG NEURAL NETWORKS WITH
DISCONTINUOUS ACTIVATION FUNCTIONS. NEURAL
NETWORKS, 18(3), 231-242.

[10.] GOMES, G. S. D. S., LUDERMIR, T. B., & LIMA, L.


M. (2011). COMPARISON OF NEW ACTIVATION
FUNCTIONS IN NEURAL NETWORK FOR FORECASTING
FINANCIAL TIME SERIES. NEURAL COMPUTING AND
APPLICATIONS, 20(3), 417-439.

[11.] HASHEM, S. (1992, JUNE). SENSITIVITY ANALYSIS


FOR FEEDFORWARD ARTIFICIAL NEURAL NETWORKS
WITH DIFFERENTIABLE ACTIVATION FUNCTIONS.
IN [PROCEEDINGS 1992] IJCNN INTERNATIONAL
JOINT CONFERENCE ON NEURAL NETWORKS (VOL. 1,
PP. 419-424). IEEE.

[12.] PIAZZA, F., UNCINI, A., & ZENOBI, M. (1993,


OCTOBER). NEURAL NETWORKS WITH DIGITAL LUT
ACTIVATION FUNCTIONS. IN PROCEEDINGS OF 1993
INTERNATIONAL CONFERENCE ON NEURAL NETWORKS
(IJCNN-93-NAGOYA, JAPAN) (VOL. 2, PP. 1401-
1404). IEEE.

[13.] YI, Z., & TAN, K. K. (2004). MULTISTABILITY OF


DISCRETE-TIME RECURRENT NEURAL NETWORKS
WITH UNSATURATING PIECEWISE LINEAR ACTIVATION
FUNCTIONS. IEEE TRANSACTIONS ON NEURAL
NETWORKS, 15(2), 329-336.

316

You might also like