Optimizers and Activation functions in Deep Learning
Optimizers and Activation functions in Deep Learning
What is an optimizer?
Optimizers are algorithms or methods used to minimize an error function(loss function)or to maximize the
efficiency of production. Optimizers are mathematical functions which are dependent on model’s learnable
parameters i.e Weights & Biases. Optimizers help to know how to change weights and learning rate of neural
network to reduce the losses.
This post will walk you through the optimizers and some popular approaches.
Types of optimizers
Let’s learn about different types of optimizers and how they exactly work to minimize the loss function.
1. Gradient Descent
2. Stochastic Gradient Descent (SGD)
3. Mini Batch Stochastic Gradient Descent (MB-SGD)
4. SGD with momentum
5. Nesterov Accelerated Gradient (NAG)
6. Adaptive Gradient (AdaGrad)
7. AdaDelta
8. RMSprop
9. Adam
1. Gradient Descent
Gradient descent is an optimization algorithm based on a convex function and tweaks its parameters
iteratively to minimize a given function to its local minimum. Gradient Descent iteratively reduces a loss
function by moving in the direction opposite to that of steepest ascent. It is dependent on the derivatives of
the loss function for finding minima. uses the data of the entire training set to calculate the gradient of the
cost function to the parameters which requires large amount of memory and slows down the process.
Gradient Descent
Momentum Formula
Advantages of SGD with momentum
1. Momentum helps to reduce the noise.
2. Exponential Weighted Average is used to smoothen the curve.
Disadvantage of SGD with momentum
1. Extra hyperparameter is added.
Both NAG and SGD with momentum algorithms work equally well and share the same advantages and
disadvantages.
Advantages of RMS-Prop
1. In RMS-Prop learning rate gets adjusted automatically and it chooses a different learning rate for each
parameter.
Disadvantages of RMS-Prop
1. Slow Learning
AdaDelta
Adadelta is an extension of Adagrad and it also tries to reduce Adagrad’s aggressive, monotonically reducing
the learning rate and remove decaying learning rate problem. In Adadelta we do not need to set the default
learning rate as we take the ratio of the running average of the previous time steps to the current gradient.
Advantages of Adadelta
1. The main advantage of AdaDelta is that we do not need to set a default learning rate.
Disadvantages of Adadelta
1. Computationally expensive
Advantages of Adam
1. Easy to implement
2. Computationally efficient.
3. Little memory requirements.
As a starting point and for better comparability with later functions, we start with the simplest possible
activation function. The linear activation function returns the input value unchanged and is described by the
following formula:
Although it appears that this function does not make any changes to the data, it does have an important
influence on how the network functions. It ensures that the neural network can only recognize linear
relationships in the data. This limits its performance immensely, as no more complex structures can be
learned from the data. For this reason, this simple activation function is rarely used in deep neural networks,
but only in simpler, linear models or in the output layer for regressions.
Sigmoid function
The sigmoid function is one of the oldest non-linear activation functions that has been used in the field of
machine learning for many years. It is described by the following mathematical formula:
This function ensures that the input value is mapped to a range between 0 and 1. The graph follows the
characteristic S-curve, which ensures that small values lie in a range close to 0 and high values are
transformed into a range close to 1.
This range of values makes the sigmoid function particularly suitable for applications in which binary values
are to be predicted so that the output can then be interpreted as the probability of membership. For this
reason, the sigmoid function is primarily used in the last layer of a network if a binary prediction is to be
made. This can be useful, for example, in the area of object recognition in images or in medical diagnoses
where a patient is to be classified as healthy or ill.
It can also lead to problems if the output values of the sigmoid function are not centered around zero, but lie
between 0 and 1. This means that both the positive and negative gradients are always in the same direction,
which further slows down the convergence of the model.
Due to these disadvantages, the sigmoid function is increasingly being replaced in modern network
architectures by other activation functions that enable more efficient training, which is particularly
important in deep architectures.
The hyperbolic tangent function, or tanh, is another non-linear activation function used in neural networks
to learn more complex relationships in the data. It is based on the following mathematical formula:
The tanh function transforms the input value into the range between -1 and 1. In contrast to Sigmoid, the
values are therefore distributed around zero. This results in some advantages compared to the previously
presented activation functions, as the centering around zero helps to improve the training effect and the
weight adjustments move faster in the right direction.
It is also advantageous that the tanh function also scales smaller input values more strongly in the output
range so that the values can be better separated from each other, especially when the input range is close
together.
Due to these properties, the hyperbolic tangent is often used in recurrent neural networks where temporal
sequences and dependencies play an important role. By using positive and negative values, the state changes
in an RNN can be represented much more precisely.
Compared to the sigmoid function, the hyperbolic tangent also struggles with the same problems. The
vanishing gradient problem can also occur with this activation function, especially with extremely large or
small values. With very deep neural networks, it then becomes difficult to keep the gradients in the front
part of the network strong enough to make sufficient weight adjustments. In addition, saturation effects can
also occur in these value ranges, so that the gradient decreases sharply near 1 or -1.
The Rectified Linear Unit (ReLU for short) is a linear activation function that was introduced to solve the
vanishing gradient problem and has become increasingly popular in recent years. In short, it keeps positive
values and sets negative input values equal to zero. Mathematically, this is expressed by the following term:
In simpler terms, it can be represented using the max function:
The Relu Activation Function has established itself primarily due to the following advantages:
Simple calculation: Compared to the other options, the ReLU function is very easy to calculate and
therefore saves a lot of computing power, especially for large networks. This is reflected either in
lower costs or a shorter training time.
No vanishing gradient problem: Due to the linear structure, there are no asymptotic points that are
parallel to the x-axis. This means that the gradient is not vanishingly small and the error runs through
all layers, even with large networks. Finally, it is ensured that the network learns structures and the
learning process is significantly accelerated.
Better results for new model architectures: Compared to the other activation functions, ReLU can
set values to zero as soon as they are negative. With the sigmoid, softmax, and tanh functions, on the
other hand, the values only approach zero asymptotically, but never become zero. However, this
leads to problems in newer models, such as autoencoders, as real zeros are required in the so-called
code layer in order to achieve good results.
Economy: The ability of the activation function to set certain input values to zero makes the model
much more economical with computing power. If neurons are permanently given zero values, they
"die" and become inactive. This reduces the complexity of the model and can lead to better
generalization.
However, there are also problems with this simple activation function. Because negative values are
consistently set to zero, it can happen that individual neurons also have a weighting of zero, as they make no
contribution to the learning process and therefore "die off". This may not initially be a problem for
individual neurons, but it has been shown that in some cases as many as 20–50 % of neurons can "die off" as
a result of ReLU.
This problem occurs more frequently if too high a learning rate has been defined so that the weights of the
neuron can change in such a way that the neuron only receives negative values. In the long term, these
neurons remain dead because they no longer generate a gradient and are no longer capable of learning. This
means that models with ReLU as an activation function are also highly dependent on a well-chosen learning
rate, which should be carefully considered in advance.
Furthermore, it is problematic that the ReLU function is not limited and can theoretically assume infinitely
large, positive values. Particularly in applications where the output range is limited, such as the prediction of
probabilities, the ReLU function must then be supplemented with another activation function such as the
softmax so that interpretable results are output.
The ReLU function is primarily used in deep neural networks, as convergence can be significantly
accelerated due to efficient gradient processing. In addition, computational effort can be saved, increasing
the efficiency of the entire model. A central application here is the training of autoencoders, which is used to
learn compressed representations of the data. An efficient and compressed representation can be found
through the sparse activations.
Leaky ReLU
To eliminate this disadvantage and make the ReLU function more robust, an optimization of the function has
been developed, which is known as Leaky ReLU. Compared to the conventional version of the function,
negative values are not set to zero but are given an (albeit small) positive slope. Mathematically, this looks
like this:
The parameter α is a positive constant that must be determined before training and can be 0.01, for
example. This ensures that even if the neuron receives negative values, it still does not become zero and can
therefore still generate a small gradient. This prevents the neurons from dying, as they still make a small
contribution to learning.
In addition to this advantage of the Leaky ReLU function, this activation function is also characterized by the
fact that the learning ability of the model is increased, as it is possible to learn even with negative values,
and their information is not lost. This property can lead to faster convergence, as more neurons remain
active and participate in the learning process. In addition, this activation function has the advantage that it
can be calculated with similar efficiency despite the small changes to the ReLU.
A possible disadvantage is that α introduces another hyperparameter, which must be determined before
training and has a major influence on the quality of the training. A value that is too small can slow down
learning, as some neurons do not die, but come close to zero and therefore contribute little to training.
Softmax
The softmax is a mathematical function that takes a vector as input and converts its values into probabilities,
depending on their size. A high numerical value leads to a high probability in the resulting vector.
In other words, each value of the vector is divided by the sum of all values of the output vector and stored in
the new vector. In purely mathematical terms, this formula looks like this:
At first glance, the sigmoid and softmax functions appear relatively similar, as both functions map the input
value to the numerical range between 0 and 1. Their progression is also almost identical with the difference
that the sigmoid function passes through the value 0.5 at x = 0 and the softmax function is still below 0.5 at
this point.
The difference between the functions lies in the application. The sigmoid function can be used for binary
classifications, i.e. for models in which a decision is to be made between two different classes. Softmax, on
the other hand, can also be used for classifications that are to predict more than two classes. The function
ensures that the probability of all classes is 1.
The advantages of Softmax are that the outputs are interpretable and represent probabilities, which are
particularly helpful in classification problems. In addition, exponential values are used so that the function is
numerically stable and can also handle large differences in the input data.
Disadvantages include overconfidence, which describes the property that the predictions are overconfident
even though the model is quite uncertain. Therefore, measures for uncertainty assessment should be
included to avoid this problem. Furthermore, although Softmax is suitable for multi-class classifications, the
number of classes should not increase too much, as the exponential calculation for each class would then be
too time-consuming and computationally intensive. In addition, the model may then become unstable as the
probabilities for individual classes become too low.