0% found this document useful (0 votes)
84 views2 pages

Role of Optimizer in Neural Network

The document discusses different optimizers used in neural networks: 1. RMSProp adjusts the learning rate for each parameter automatically by dividing the learning rate by an exponentially decaying average of squared gradients. 2. Adam keeps exponentially decaying averages of both the gradients and the second moments of the gradients to optimize the neural network training. It was found to outperform other optimizers like RMSProp and SGD. 3. Stochastic gradient descent (SGD) updates the model parameters more frequently after each training example compared to regular gradient descent, allowing it to converge faster but with higher variance in the parameters.

Uploaded by

Muhammad Alian
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
84 views2 pages

Role of Optimizer in Neural Network

The document discusses different optimizers used in neural networks: 1. RMSProp adjusts the learning rate for each parameter automatically by dividing the learning rate by an exponentially decaying average of squared gradients. 2. Adam keeps exponentially decaying averages of both the gradients and the second moments of the gradients to optimize the neural network training. It was found to outperform other optimizers like RMSProp and SGD. 3. Stochastic gradient descent (SGD) updates the model parameters more frequently after each training example compared to regular gradient descent, allowing it to converge faster but with higher variance in the parameters.

Uploaded by

Muhammad Alian
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 2

Role of optimizer in neural network

1.RMS propagation
RMS Prop is Root Mean Square Propagation. In RMS Prop learning rate gets adjusted automatically and
it chooses a different learning rate for each parameter.

RMS Prop divides the learning rate by the average of the exponential decay of squared gradients

V t =ρ v t −1+(1−ρ)∗g 2 t
−η
Δ ωt = ∗gt
√V t + ε
η : Initial Learning rate
Vt: Exponential average of square of gradient

gt : Gradient at time t along ω j


2.ADAM
In this optimization algorithm, running averages of both the gradients and the second moments of
the gradients are used.If one wants to train the neural network in less time and more efficiently than
Adam is the optimizer. Adam  also keeps an exponentially decaying average of past gradients M(t).

Advantages:

• Now the learning rate does not decay and the training does not stop.

Disadvantages:

• Computationally expensive.

V t =β t∗V t−1−(1−β 1)∗gt


2
st =β 2∗St −1−(1−β 2)∗g t
Vt
Δ ωt=−η ∗g t
√ st +ε
η : Initial Learning rate
Vt: Exponential average of gradient along ω j

gt : Gradient at time t along ω j


β1,β2: Hyperparameters

st: Exponential average of square of gradient along ω j


We used different optimizer like Adam, RMSprop, SGD for performing the training of our best CNN
models such as Resnet-50 and VGG16 and by comparing the result of these optimizer. We choose Adam
optimizer to train our network. It outperforms other competitive optimization methods. In this
optimizer learning rate does not decay and the training does not stop.

3.SGD (Stochastic gradient descent)


It’s a variant of Gradient Descent. It tries to update the model’s parameters more frequently. In this, the
model parameters are altered after computation of loss on each training example. So, if the dataset
contains 1000 rows SGD will update the model parameters 1000 times in one cycle of dataset instead of
one time as in Gradient Descent.

Advantages:

• Frequent updates of model parameters hence, converges in less time.

• Requires less memory as no need to store values of loss functions.

• May get new minima’s.

Disadvantages:

• High variance in model parameters.

• May shoot even after achieving global minima.

• To get the same convergence as gradient descent needs to slowly reduce the value of learning
rate.

You might also like