Optimization
Optimization
➢ Optimization algorithms are responsible for reducing losses and provide most accurate
results possible.
➢ The weight is initialized using some initialization strategies and is updated with each epoch
according to the equation.
➢ The best results are achieved using some optimization strategies or algorithms called
Optimizer.
➢ Optimizers are algorithms or methods used to change the attributes of your neural network
such as weights and learning rate in order to reduce the losses.
➢ Gradient Descent
➢ Batch Gradient Descent
➢ Stochastic Gradient Descent (SGD)
➢ Mini-Batch Stochastic Gradient Descent (MB — SGD)
➢ RMSProp
➢ Adam
➢ The Gradient Descent is an optimization algorithm that is used for minimizing the cost
function(Errors).
➢ Its updates the various parameters of a machine learning model to minimize the cost
function.
➢ Gradient descent is an optimization algorithm which is commonly-used to train machine
learning models and neural networks.
What is Gradient?
➢ A gradient measures how much the output of a function changes if you change the inputs a
little bit.
➢ Batch gradient descent (BGD) is used to find the error for each point in the training set and
update the model after evaluating all training examples.
➢ This procedure is known as the training epoch.
➢ Batch gradient descent, also called vanilla gradient descent.
➢ Mini Batch gradient descent is the combination of both batch gradient descent and
stochastic gradient descent.
➢ It splits the training dataset into small batch sizes and performs updates on each of those
batches.
➢ This approach strikes a balance between the computational efficiency of batch gradient
descent and the speed of stochastic gradient descent.
➢ The RMSprop optimizer is similar to the gradient descent algorithm with momentum.
➢ The RMSprop optimizer restricts the oscillations in the vertical direction.
➢ Therefore, we can increase our learning rate and our algorithm could take larger steps in the
horizontal direction converging faster.
➢ Adam is a replacement optimization algorithm for stochastic gradient descent for training
deep learning models.
➢ Adam combines the best properties of the AdaGrad and RMSProp algorithms to provide an
optimization algorithm that can handle sparse gradients on noisy problems.
➢ Learning rate is the size of the steps that are taken to reach the minimum.
➢ It is defined as the step size taken to reach the minimum or lowest point.
➢ This is typically a small value that is evaluated and updated based on the behavior of the
cost function.
➢ If the learning rate is high, it results in larger steps but also leads to risks of overshooting the
minimum.
What is Cost-function?
➢ The cost function is defined as the measurement of difference or error between actual
values and expected values at the current position and present in the form of a single real
number.