0% found this document useful (0 votes)
25 views15 pages

Optimizers and Activation functions in Deep Learning

The document provides an overview of optimizers in deep learning, explaining their role in minimizing loss functions and detailing various types such as Gradient Descent, Stochastic Gradient Descent, and Adam. It also discusses activation functions, their importance in neural networks, and compares different types like Sigmoid, Tanh, and ReLU. The document emphasizes the significance of choosing the right optimizer and activation function based on the data and model architecture.

Uploaded by

royalranger5500
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views15 pages

Optimizers and Activation functions in Deep Learning

The document provides an overview of optimizers in deep learning, explaining their role in minimizing loss functions and detailing various types such as Gradient Descent, Stochastic Gradient Descent, and Adam. It also discusses activation functions, their importance in neural networks, and compares different types like Sigmoid, Tanh, and ReLU. The document emphasizes the significance of choosing the right optimizer and activation function based on the data and model architecture.

Uploaded by

royalranger5500
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Optimizers in Deep Learning

What is an optimizer?
Optimizers are algorithms or methods used to minimize an error function(loss function)or to maximize the
efficiency of production. Optimizers are mathematical functions which are dependent on model’s learnable
parameters i.e Weights & Biases. Optimizers help to know how to change weights and learning rate of neural
network to reduce the losses.
This post will walk you through the optimizers and some popular approaches.
Types of optimizers
Let’s learn about different types of optimizers and how they exactly work to minimize the loss function.
1. Gradient Descent
2. Stochastic Gradient Descent (SGD)
3. Mini Batch Stochastic Gradient Descent (MB-SGD)
4. SGD with momentum
5. Nesterov Accelerated Gradient (NAG)
6. Adaptive Gradient (AdaGrad)
7. AdaDelta
8. RMSprop
9. Adam

1. Gradient Descent
Gradient descent is an optimization algorithm based on a convex function and tweaks its parameters
iteratively to minimize a given function to its local minimum. Gradient Descent iteratively reduces a loss
function by moving in the direction opposite to that of steepest ascent. It is dependent on the derivatives of
the loss function for finding minima. uses the data of the entire training set to calculate the gradient of the
cost function to the parameters which requires large amount of memory and slows down the process.
Gradient Descent

Advantages of Gradient Descent


1. Easy to understand
2. Easy to implement
Disadvantages of Gradient Descent
1. Because this method calculates the gradient for the entire data set in one update, the calculation is very
slow.
2. It requires large memory and it is computationally expensive.
Learning Rate
How big/small the steps are gradient descent takes into the direction of the local minimum are determined
by the learning rate, which figures out how fast or slow we will move towards the optimal weights.
Learning Rate

2. Stochastic Gradient Descent


It is a variant of Gradient Descent. It update the model parameters one by one. If the model has 10K dataset
SGD will update the model parameters 10k times.

Stochastic Gradient Descent


Advantages of Stochastic Gradient Descent
1. Frequent updates of model parameter
2. Requires less Memory.
3. Allows the use of large data sets as it has to update only one example at a time.
Disadvantages of Stochastic Gradient Descent
1. The frequent can also result in noisy gradients which may cause the error to increase instead of
decreasing it.
2. High Variance.
3. Frequent updates are computationally expensive.

3. Mini-Batch Gradient Descent


It is a combination of the concepts of SGD and batch gradient descent. It simply splits the training dataset into
small batches and performs an update for each of those batches. This creates a balance between the
robustness of stochastic gradient descent and the efficiency of batch gradient descent. it can reduce the
variance when the parameters are updated, and the convergence is more stable. It splits the data set in
batches in between 50 to 256 examples, chosen at random.
Mini Batch Gradient Descent
Advantages of Mini Batch Gradient Descent:
1. It leads to more stable convergence.
2. more efficient gradient calculations.
3. Requires less amount of memory.
Disadvantages of Mini Batch Gradient Descent
1. Mini-batch gradient descent does not guarantee good convergence,
2. If the learning rate is too small, the convergence rate will be slow. If it is too large, the loss function will
oscillate or even deviate at the minimum value.

4. SGD with Momentum


SGD with Momentum is a stochastic optimization method that adds a momentum term to regular stochastic
gradient descent. Momentum simulates the inertia of an object when it is moving, that is, the direction of the
previous update is retained to a certain extent during the update, while the current update gradient is used to
fine-tune the final update direction. In this way, you can increase the stability to a certain extent, so that you
can learn faster, and also have the ability to get rid of local optimization.

SGD with Momentum

Momentum Formula
Advantages of SGD with momentum
1. Momentum helps to reduce the noise.
2. Exponential Weighted Average is used to smoothen the curve.
Disadvantage of SGD with momentum
1. Extra hyperparameter is added.

5. Nesterov Accelerated Gradient (NAG)


The idea of the NAG algorithm is very similar to SGD with momentum with a slight variant. In the case of
SGD with a momentum algorithm, the momentum and gradient are computed on the previous updated
weight.
Momentum may be a good method but if the momentum is too high the algorithm may miss the local
minima and may continue to rise up. So, to resolve this issue the NAG algorithm was developed. It is a
look ahead method. We know we’ll be using γ.V(t−1) for modifying the weights
so, θ−γV(t−1) approximately tells us the future location. Now, we’ll calculate the cost based on this
future parameter rather than the current one.
V(t) = γ.V(t−1) + α. ∂(J(θ − γV(t−1)))/∂θ
and then update the parameters using θ = θ − V(t)
Again, we set the momentum term γγ to a value of around 0.9. While Momentum first computes the
current gradient (small brown vector in Image 4) and then takes a big jump in the direction of the
updated accumulated gradient (big brown vector), NAG first makes a big jump in the direction of the
previously accumulated gradient (green vector), measures the gradient and then makes a correction
(red vector), which results in the complete NAG update (red vector). This anticipatory update prevents
us from going too fast and results in increased responsiveness, which has significantly increased the
performance of RNNs on a number of tasks.

Both NAG and SGD with momentum algorithms work equally well and share the same advantages and
disadvantages.

6. Adaptive Gradient Descent(AdaGrad)


In all the algorithms that we discussed previously the learning rate remains constant. The intuition behind
AdaGrad is can we use different Learning Rates for each and every neuron for each and every hidden layer
based on different iterations.
Advantages of AdaGrad
1. Learning Rate changes adaptively with iterations.
2. It is able to train sparse data as well.
Disadvantage of AdaGrad
1. If the neural network is deep the learning rate becomes very small number which will cause dead neuron
problem.

7. Root Mean Square Propagation (RMS-Prop )


RMS-Prop is a special version of Adagrad in which the learning rate is an exponential average of the gradients
instead of the cumulative sum of squared gradients. RMS-Prop basically combines momentum with AdaGrad.

Advantages of RMS-Prop
1. In RMS-Prop learning rate gets adjusted automatically and it chooses a different learning rate for each
parameter.
Disadvantages of RMS-Prop
1. Slow Learning
AdaDelta
Adadelta is an extension of Adagrad and it also tries to reduce Adagrad’s aggressive, monotonically reducing
the learning rate and remove decaying learning rate problem. In Adadelta we do not need to set the default
learning rate as we take the ratio of the running average of the previous time steps to the current gradient.
Advantages of Adadelta
1. The main advantage of AdaDelta is that we do not need to set a default learning rate.
Disadvantages of Adadelta
1. Computationally expensive

8. Adaptive Moment Estimation (AdaM)


Adam optimizer is one of the most popular and famous gradient descent optimization algorithms. It is a
method that computes adaptive learning rates for each parameter. It stores both the decaying average of the
past gradients , similar to momentum and also the decaying average of the past squared gradients , similar to
RMS-Prop and Adadelta. Thus, it combines the advantages of both the methods.

Advantages of Adam
1. Easy to implement
2. Computationally efficient.
3. Little memory requirements.

How to choose optimizers?


 If the data is sparse, use the self-applicable methods, namely Adagrad, Adadelta, RMSprop, Adam.
 RMSprop, Adadelta, Adam have similar effects in many cases.
 Adam just added bias-correction and momentum on the basis of RMSprop,
 As the gradient becomes sparse, Adam will perform better than RMSprop.
ACTIVATION FUNCTION
What is an Activation Function?
The activation function is a mathematical function that is used within neural networks and decides whether
a neuron is activated or not. It processes the weighted sum of the neuron’s inputs and calculates a new value
to determine how strongly the signal is passed on to the next layer in the network. In simple terms, the
activation function determines the strength of the neuron’s response to the weighted input values.
The activation function plays a crucial role in the training of neural networks, as it enables the modeling of
non-linear relationships. The choice of the appropriate function for the model architecture and the
underlying data has a decisive influence on the final results and is therefore an important component in the
creation of a neural network.
What Properties do Activation Functions have?
The activation function has an important influence on the performance of neural networks and should be
chosen depending on the complexity of the data and the prediction type. Although there are a variety of
functions to choose from, they all share the following properties, which we explain in more detail in this
section.
One of the most important properties of activation functions is their non-linearity, which enables the
models to learn complex relationships from the data that go beyond simple, linear relationships. Only then
can the challenging applications in image or speech processing be mastered. Although non-linear activation
functions can also be used, these have some disadvantages, as we will see in the following section.
In addition, all activation functions must be differentiable. In other words, it must be mathematically
possible to form the derivative of the function so that the learning process of the neural networks can take
place. This process is based on the backpropagation algorithm, in which the gradient, i.e. the derivative in
several dimensions, is calculated in each iteration and the weights of the individual neurons are changed
based on the results so that the prediction quality increases. Only through this process and the
differentiability of the activation function is the model able to learn and continuously improve.
In addition to these positive or at least neutral properties, activation functions also have problematic
properties that can lead to challenges during training. Some activation functions, such as Sigmoid or Tanh,
have saturation regions in which the gradients become very small and come close to zero. Within these
ranges, changes in the input values cause only very small changes in the activation function, so that the
training of the network slows down considerably. This so-called vanishing gradient effect occurs above all in
the value ranges in which the activation function reaches the minimum or maximum values.
Linear activation function

As a starting point and for better comparability with later functions, we start with the simplest possible
activation function. The linear activation function returns the input value unchanged and is described by the
following formula:

Although it appears that this function does not make any changes to the data, it does have an important
influence on how the network functions. It ensures that the neural network can only recognize linear
relationships in the data. This limits its performance immensely, as no more complex structures can be
learned from the data. For this reason, this simple activation function is rarely used in deep neural networks,
but only in simpler, linear models or in the output layer for regressions.

Sigmoid function

The sigmoid function is one of the oldest non-linear activation functions that has been used in the field of
machine learning for many years. It is described by the following mathematical formula:

This function ensures that the input value is mapped to a range between 0 and 1. The graph follows the
characteristic S-curve, which ensures that small values lie in a range close to 0 and high values are
transformed into a range close to 1.

This range of values makes the sigmoid function particularly suitable for applications in which binary values
are to be predicted so that the output can then be interpreted as the probability of membership. For this
reason, the sigmoid function is primarily used in the last layer of a network if a binary prediction is to be
made. This can be useful, for example, in the area of object recognition in images or in medical diagnoses
where a patient is to be classified as healthy or ill.

Graph of the Sigmoid Function | Source: Author


The main disadvantage of the sigmoid function is that the so-called vanishing gradient problem can occur.
With very large or minimal input values, the gradient value may approach zero during derivation. As a
result, the neurons’ weights are not adjusted at all or only very slightly during backpropagation, making
training slow and inefficient.

It can also lead to problems if the output values of the sigmoid function are not centered around zero, but lie
between 0 and 1. This means that both the positive and negative gradients are always in the same direction,
which further slows down the convergence of the model.

Due to these disadvantages, the sigmoid function is increasingly being replaced in modern network
architectures by other activation functions that enable more efficient training, which is particularly
important in deep architectures.

Hyperbolic tangent (Tanh)

The hyperbolic tangent function, or tanh, is another non-linear activation function used in neural networks
to learn more complex relationships in the data. It is based on the following mathematical formula:

The tanh function transforms the input value into the range between -1 and 1. In contrast to Sigmoid, the
values are therefore distributed around zero. This results in some advantages compared to the previously
presented activation functions, as the centering around zero helps to improve the training effect and the
weight adjustments move faster in the right direction.

It is also advantageous that the tanh function also scales smaller input values more strongly in the output
range so that the values can be better separated from each other, especially when the input range is close
together.

Due to these properties, the hyperbolic tangent is often used in recurrent neural networks where temporal
sequences and dependencies play an important role. By using positive and negative values, the state changes
in an RNN can be represented much more precisely.

Compared to the sigmoid function, the hyperbolic tangent also struggles with the same problems. The
vanishing gradient problem can also occur with this activation function, especially with extremely large or
small values. With very deep neural networks, it then becomes difficult to keep the gradients in the front
part of the network strong enough to make sufficient weight adjustments. In addition, saturation effects can
also occur in these value ranges, so that the gradient decreases sharply near 1 or -1.

Rectified Linear Unit (ReLU)

The Rectified Linear Unit (ReLU for short) is a linear activation function that was introduced to solve the
vanishing gradient problem and has become increasingly popular in recent years. In short, it keeps positive
values and sets negative input values equal to zero. Mathematically, this is expressed by the following term:
In simpler terms, it can be represented using the max function:

The Relu Activation Function has established itself primarily due to the following advantages:

 Simple calculation: Compared to the other options, the ReLU function is very easy to calculate and
therefore saves a lot of computing power, especially for large networks. This is reflected either in
lower costs or a shorter training time.
 No vanishing gradient problem: Due to the linear structure, there are no asymptotic points that are
parallel to the x-axis. This means that the gradient is not vanishingly small and the error runs through
all layers, even with large networks. Finally, it is ensured that the network learns structures and the
learning process is significantly accelerated.
 Better results for new model architectures: Compared to the other activation functions, ReLU can
set values to zero as soon as they are negative. With the sigmoid, softmax, and tanh functions, on the
other hand, the values only approach zero asymptotically, but never become zero. However, this
leads to problems in newer models, such as autoencoders, as real zeros are required in the so-called
code layer in order to achieve good results.
 Economy: The ability of the activation function to set certain input values to zero makes the model
much more economical with computing power. If neurons are permanently given zero values, they
"die" and become inactive. This reduces the complexity of the model and can lead to better
generalization.

However, there are also problems with this simple activation function. Because negative values are
consistently set to zero, it can happen that individual neurons also have a weighting of zero, as they make no
contribution to the learning process and therefore "die off". This may not initially be a problem for
individual neurons, but it has been shown that in some cases as many as 20–50 % of neurons can "die off" as
a result of ReLU.

This problem occurs more frequently if too high a learning rate has been defined so that the weights of the
neuron can change in such a way that the neuron only receives negative values. In the long term, these
neurons remain dead because they no longer generate a gradient and are no longer capable of learning. This
means that models with ReLU as an activation function are also highly dependent on a well-chosen learning
rate, which should be carefully considered in advance.

Furthermore, it is problematic that the ReLU function is not limited and can theoretically assume infinitely
large, positive values. Particularly in applications where the output range is limited, such as the prediction of
probabilities, the ReLU function must then be supplemented with another activation function such as the
softmax so that interpretable results are output.

The ReLU function is primarily used in deep neural networks, as convergence can be significantly
accelerated due to efficient gradient processing. In addition, computational effort can be saved, increasing
the efficiency of the entire model. A central application here is the training of autoencoders, which is used to
learn compressed representations of the data. An efficient and compressed representation can be found
through the sparse activations.
Leaky ReLU

To eliminate this disadvantage and make the ReLU function more robust, an optimization of the function has
been developed, which is known as Leaky ReLU. Compared to the conventional version of the function,
negative values are not set to zero but are given an (albeit small) positive slope. Mathematically, this looks
like this:

In a more compact form, the function looks like this:

The parameter α is a positive constant that must be determined before training and can be 0.01, for
example. This ensures that even if the neuron receives negative values, it still does not become zero and can
therefore still generate a small gradient. This prevents the neurons from dying, as they still make a small
contribution to learning.

In addition to this advantage of the Leaky ReLU function, this activation function is also characterized by the
fact that the learning ability of the model is increased, as it is possible to learn even with negative values,
and their information is not lost. This property can lead to faster convergence, as more neurons remain
active and participate in the learning process. In addition, this activation function has the advantage that it
can be calculated with similar efficiency despite the small changes to the ReLU.

A possible disadvantage is that α introduces another hyperparameter, which must be determined before
training and has a major influence on the quality of the training. A value that is too small can slow down
learning, as some neurons do not die, but come close to zero and therefore contribute little to training.

Softmax

The softmax is a mathematical function that takes a vector as input and converts its values into probabilities,
depending on their size. A high numerical value leads to a high probability in the resulting vector.

In other words, each value of the vector is divided by the sum of all values of the output vector and stored in
the new vector. In purely mathematical terms, this formula looks like this:

A specific example illustrates how the Softmax function works:


The positive feature of this function is that it ensures that the sum of the output values is always less than or
equal to 1. This is particularly advantageous in probability calculations, as it ensures that no summed
probability can be greater than 1.

At first glance, the sigmoid and softmax functions appear relatively similar, as both functions map the input
value to the numerical range between 0 and 1. Their progression is also almost identical with the difference
that the sigmoid function passes through the value 0.5 at x = 0 and the softmax function is still below 0.5 at
this point.

Sigmoid and Softmax function in the range [-4, 4] | Source: Nomidl

The difference between the functions lies in the application. The sigmoid function can be used for binary
classifications, i.e. for models in which a decision is to be made between two different classes. Softmax, on
the other hand, can also be used for classifications that are to predict more than two classes. The function
ensures that the probability of all classes is 1.

The advantages of Softmax are that the outputs are interpretable and represent probabilities, which are
particularly helpful in classification problems. In addition, exponential values are used so that the function is
numerically stable and can also handle large differences in the input data.

Disadvantages include overconfidence, which describes the property that the predictions are overconfident
even though the model is quite uncertain. Therefore, measures for uncertainty assessment should be
included to avoid this problem. Furthermore, although Softmax is suitable for multi-class classifications, the
number of classes should not increase too much, as the exponential calculation for each class would then be
too time-consuming and computationally intensive. In addition, the model may then become unstable as the
probabilities for individual classes become too low.

You might also like