UNIT-1 Foundations of Deep Learning
UNIT-1 Foundations of Deep Learning
UNIT-1 Foundations of Deep Learning
Now the value of net input can be any anything from -inf to
+inf.
The neuron doesn’t really know how to bound to value and
thus is not able to decide the firing pattern.
The activation function is an important part of an artificial
neural network. They basically decide whether a neuron
should be activated or not.Thus it bounds the value of the net
input.
The activation function is a non-linear transformation that we
do over the input before sending it to the next layer of
neurons or finalizing it as output.
Activation function is also called “Transfer Function”.
Different types of activation functions are used in Deep Learning.
1. Step Function:
Step Function is one of the simplest kind of activation
functions. In this, we consider a threshold value and if the
value of net input say y is greater than the threshold then the
neuron is activated.
Mathematically,
(Or)
tanh(x) = 2 * sigmoid(2x) – 1
The range of values is from -1 to +1.
non-linear activation function
7. Softmax Function:
The softmax function is also a type of sigmoid function
and used to handle classification problems.
It is used if output has more than two.
Usually used when trying to handle multiple classes. The
softmax function would squeeze the outputs for each
class between 0 and 1 and would also divide by the sum
of the outputs.
non-linear activation function
The softmax function is ideally used in the output layer
of the classifier where we are actually trying to attain
the probabilities to define the class of each input.
output for the Softmax function is the ratio of the
exponential of the parameter and the sum of
exponential parameter.
CHOOSING THE RIGHT ACTIVATION FUNCTION
If we don’t know what activation function to use, then simply
use RELU as it is a general activation function and is used in
most cases these days.
If output is for binary classification then, sigmoid function is
very choice for output layer.
If output is other than binary classification then, softmax
function is choice for output layer.
For all hidden layers RELU is choice of activation function.
Advantages:
Easy to understand
Easy to implement
Easy for Computation
Disadvantages:
Because this method calculates the gradient for the entire
data set in one update, the calculation is very slow. So, if the
dataset is too large then this may take years to converge to
the minima.
It requires large memory and it is computationally expensive.
Advantages:
1. Frequent updates of model parameter
2. Requires less memory as no need to store values of loss
functions.
3. Allows the use of large data sets as it has to update only one
record at a time.
Disadvantages:
1. The frequent can also result in noisy gradients which may
cause the error to increase instead of decreasing it.
2. High Variance.
3. Frequent updates are computationally expensive.
Disadvantages:
Mini-batch gradient descent does not guarantee good
convergence,
If the learning rate is too small, the convergence rate will be
slow. If it is too large, the loss function will oscillate or even
deviate at the minimum value.
Advantages:
Momentum helps to reduce the noise. Reduces the oscillations
and high variance of the parameters.
Converges faster than gradient descent.
Exponential Weighted Average is used to smoothen the curve.
Disadvantages:
Extra hyper parameter is added, which needs to be selected
manually and accurately.
Disadvantages:
1. Computationally expensive as a need to calculate the second
order derivative.
2. The learning rate is always decreasing, results in slow training.
3. If the neural network is deep the learning rate becomes very
small number, which will cause dead neuron problem.
4. Need to set the default learning rate first.
6. AdaDelta:
Adadelta is an extension of Adagrad and it also tries to reduce
Adagrad’s aggressive, monotonically reducing the learning
rate and remove decaying learning rate problem.
In Adadelta we do not need to set the default learning rate as
we take the ratio of the running average of the previous time
steps to the current gradient.
Advantages:
1. Now the learning rate does not decay and the training does not
stop.
Disadvantages:
1. Computationally expensive.
7. Adam (Adaptive Moment Estimation)
Adam optimizer is one of the most popular and famous
gradient descent optimization algorithms.
It is a method that computes adaptive learning rates for each
parameter.
Works with momentums of first and second order. The
intuition behind the Adam is that we don’t want to roll so fast
just because we can jump over the minimum, we want to
decrease the velocity a little bit for a careful search.
Updates both bias and weights.
8. RMSprop Optimizer
The RMSprop optimizer is similar to the gradient descent algorithm
with momentum.
The RMSprop optimizer restricts the oscillations in the vertical
direction. Therefore, we can increase our learning rate and our
algorithm could take larger steps in the horizontal direction
converging faster.
The following equations show how the gradients are calculated for
the RMSprop and gradient descent with momentum. The value of
momentum is denoted by beta and is usually set to 0.9.