0% found this document useful (0 votes)

11 views58 pages

Ann 3

Optimizers are algorithms that minimize loss functions by adjusting model parameters such as weights and biases. Gradient descent is a commonly used optimizer that takes iterative steps down the gradient of the loss function to find parameter values that produce low loss. It works by calculating the gradient of the loss with respect to each parameter, then taking a step in the opposite direction. The step size is determined by the learning rate parameter.

Uploaded by

jamiesonlara

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views58 pages

Ann 3

Uploaded by

jamiesonlara

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 58

Optimizers

What are optimizers ?

Optimizers (optimization algorithms)

minimize the loss function by finding the
most accurate model parameters possible

Model parameters correspond to weights and biases

Optimizers

What are optimizers ?

Training a model is an iterative process

It is essential for the optimizers to be :

Fast

Accurate
Optimizers

We will cover the following optimizers :

Ø Gradient Descent
Ø Adagrad
Ø Stochastic Gradient Descent
Ø RMS Prop
Ø Mini-Batch Gradient Descent
Ø Adam
Ø Gradient Descent with Momentum
Optimizers

1
Loss function
Gradient Descent :

Calculating the loss for every

possible value of w is an
exhaustive and an inefficient
way to find the minimum loss !

weight w

A more efficient method is the Gradient Descent

Optimizers

1
Loss function
Gradient Descent :

o An optimization algorithm that finds the minimum value

of a function by taking steps starting from an initial value
until it finds the best value

o It takes big steps towards the minimum if far from the

optimum value and smaller steps as it gets closer weight
Optimizers

1
Loss function
Gradient Descent :
Starting point
Step 1 : We set a starting value for weight w
(this value can be randomly chosen)

weight w
Optimizers

1
Loss function
Gradient Descent :
Starting point
Step 1 : We set a starting value for weight w
(this value can be randomly chosen)

Step 2 : The Gradient Descent algorithm calculates the gradient

of the loss function at the starting point.
The gradient tells which direction allows the cost function to weight w

decrease
Optimizers

1
Loss function
Gradient Descent :

Step 3 : The algorithm takes a step in the negative gradient direction

(note : the gradient always points towards the direction of steepest
increase è we use the negative gradient)

Step size = gradient magnitude * learning rate

New weight = old weight – step size weight w

Optimizers

1
Loss function
Gradient Descent :

The Gradient Descent algorithm repeats these steps until :

• Step size gets very close to 0 (ex. min step size = 0.001 or
smaller)

• Maximum number of steps is reached (ex. 1000 or more)

weight w
Optimizers

1
Gradient Descent :

Learning rate : A scalar that determines the step size

If learning rate is very small è learning can take too long

If learning is very large è the point will perpetually

bounce across the curve minimum
Optimizers
Loss function
1
Gradient Descent :

Learning rate : A scalar that determines the step size

Ideal learning rate : larger when the point is far from the
minimum & smaller as it gets closer

Gradient Descent is very sensitive to learning rate. weight w

In practice, the learning rate can be determined automatically during training (starts
large and diminishes gradually).
This strategy is called ‘schedule’.
Optimizers

1
Gradient Descent : Example : How does Gradient Descent fit a line to data ?
Height H
??

Use Gradient Descent to estimate the optimal

values for the slope and the intercept :

𝐻!"#$ = 𝑠𝑙𝑜𝑝𝑒 𝑆𝑆 + 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡

Shoe Size SS
Optimizers

1
Gradient Descent : Example : How does Gradient Descent fit a line to data ?

Use GD to estimate the optimal values

for the slope and the intercept :

𝐻!"#$ = 𝑠𝑙𝑜𝑝𝑒 𝑆𝑆 + 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡

Optimizers

1
Gradient Descent : 𝐻!"#$ = 𝑠𝑙𝑜𝑝𝑒 𝑆𝑆 + 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡

Height
??
(1.1, 2.1)
Loss function

= Sum of Squared residuals (SSR)

(0.6, 1.1) = (𝐻%"&#! − 𝐻!"#$! )' + (𝐻%"&#" − 𝐻!"#$" )' + (𝐻%"&## − 𝐻!"#$# )'
(0.8, 1.2)

Shoe Size SS
Optimizers

1
Gradient Descent : 𝐻!"#$ = 𝑠𝑙𝑜𝑝𝑒 𝑆𝑆 + 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡

Height Loss function

??
(1.1, 2.1)
= Sum of Squared residuals (SSR)

= (𝐻%"&#! − 𝐻!"#$! )' + (𝐻%"&#" − 𝐻!"#$" )' + (𝐻%"&## − 𝐻!"#$# )'

(0.6, 1.1) = (1.1 − (𝑠𝑙𝑜𝑝𝑒 ∗ 0.6 + 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡))'

(0.8, 1.2) + (1.2 − (𝑠𝑙𝑜𝑝𝑒 ∗ 0.8 + 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡))'
+ (2.1 − (𝑠𝑙𝑜𝑝𝑒 ∗ 1.1 + 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡))'

Shoe Size SS
Optimizers

1
Gradient Descent : 𝐻!"#$ = 𝑠𝑙𝑜𝑝𝑒 𝑆𝑆 + 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡

Loss function = Sum of Squared residuals (SSR) d(Loss funcVon)

𝑑(𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡)
= (1.1 − (𝑠𝑙𝑜𝑝𝑒 ∗ 0.6 + 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡))'
+ (1.2 − (𝑠𝑙𝑜𝑝𝑒 ∗ 0.8 + 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡))'
+ (2.1 − (𝑠𝑙𝑜𝑝𝑒 ∗ 1.1 + 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡))' d(Loss funcVon)
𝑑(𝑠𝑙𝑜𝑝𝑒)
Optimizers

1
Gradient Descent : 𝐻!"#$ = 𝑠𝑙𝑜𝑝𝑒 𝑆𝑆 + 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡

To calculate the gradient with respect to each of the parameters

d(Loss funcVon)
𝑑(𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡)

d(Loss funcVon)
𝑑(𝑠𝑙𝑜𝑝𝑒)

we use the Chain Rule

Optimizers

1
Gradient Descent : 𝐻!"#$ = 𝑠𝑙𝑜𝑝𝑒 𝑆𝑆 + 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡

d(Loss funcVon)
𝑑(𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡)
Loss function = Sum of Squared residuals (SSR)
d(Loss funcVon) d(U)
U = ×
𝑑(𝑈) 𝑑(𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡)

= (1.1 − (𝑠𝑙𝑜𝑝𝑒 ∗ 0.6 + 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡))%

+ (1.2 − (𝑠𝑙𝑜𝑝𝑒 ∗ 0.8 + 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡))% d(Loss funcVon)
𝑑(𝑠𝑙𝑜𝑝𝑒)
+ (2.1 − (𝑠𝑙𝑜𝑝𝑒 ∗ 1.1 + 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡))%
d(Loss funcVon) d(U)
= ×
𝑑(𝑈) 𝑑(𝑠𝑙𝑜𝑝𝑒)
Optimizers

1
Gradient Descent : 𝐻!"#$ = 𝑠𝑙𝑜𝑝𝑒 𝑆𝑆 + 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡

d(Loss funcVon)
𝑑(𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡)
Loss function = Sum of Squared residuals (SSR)
d(Loss funcVon) d(U)
U =
𝑑(𝑈)
×
𝑑(𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡)
= (1.1 − (𝑠𝑙𝑜𝑝𝑒 ∗ 0.6 + 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡))% = 2𝑈( × −1 + 2𝑈' × −1 + 2𝑈) × −1
+ (1.2 − (𝑠𝑙𝑜𝑝𝑒 ∗ 0.8 + 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡))%
+ (2.1 − (𝑠𝑙𝑜𝑝𝑒 ∗ 1.1 + 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡))% = −2(1.1 − 0.6 ∗ 𝑠𝑙𝑜𝑝𝑒 + 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 )
−2(1.2 − (0.8 ∗ 𝑠𝑙𝑜𝑝𝑒 + 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡))
−2(2.1 − (1.1 ∗ 𝑠𝑙𝑜𝑝𝑒 + 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡))
Optimizers

1 𝐻!"#$ = 𝑠𝑙𝑜𝑝𝑒 𝑆𝑆 + 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡

Gradient Descent :
d(Loss funcVon)
𝑑(𝑠𝑙𝑜𝑝𝑒)
Loss function = Sum of Squared residuals (SSR)
d(Loss funcVon) d(U)
= ×
U 𝑑(𝑈) 𝑑(𝑠𝑙𝑜𝑝𝑒)

= (1.1 − (𝑠𝑙𝑜𝑝𝑒 ∗ 0.6 + 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡))% = 2𝑈( × −0.6 + 2𝑈' × −0.8 + 2𝑈) × −1.1

+ (1.2 − (𝑠𝑙𝑜𝑝𝑒 ∗ 0.8 + 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡))%

= −1.2(1.1 − 0.6 ∗ 𝑠𝑙𝑜𝑝𝑒 + 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 )
+ (2.1 − (𝑠𝑙𝑜𝑝𝑒 ∗ 1.1 + 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡))%
−1.6 1.2 − 0.8 ∗ 𝑠𝑙𝑜𝑝𝑒 + 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡
−2.2(2.1 − (1.1 ∗ 𝑠𝑙𝑜𝑝𝑒 + 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡))
Optimizers

1
Gradient Descent :

d(Loss funcVon) d(Loss funcVon)

𝑑(𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡) 𝑑(𝑠𝑙𝑜𝑝𝑒)

= −2(1.1 − 0.6 ∗ 𝑠𝑙𝑜𝑝𝑒 + 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 ) = −1.2(1.1 − 0.6 ∗ 𝑠𝑙𝑜𝑝𝑒 + 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 )

−2(1.2 − (0.8 ∗ 𝑠𝑙𝑜𝑝𝑒 + 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡)) −1.6 1.2 − 0.8 ∗ 𝑠𝑙𝑜𝑝𝑒 + 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡
−2(2.1 − (1.1 ∗ 𝑠𝑙𝑜𝑝𝑒 + 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡)) −2.2(2.1 − (1.1 ∗ 𝑠𝑙𝑜𝑝𝑒 + 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡))
Optimizers

1
Gradient Descent :

Note : The partial derivatives of a multivariate function

are stored in a vector called the Gradient (∇)
d(Loss funcQon)
𝑑(𝑠𝑙𝑜𝑝𝑒)
∇ Loss FuncQon =
d(Loss funcQon)
𝑑(𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡)
Optimizers

1 d(Loss funcVon)
Gradient Descent : 𝑑(𝑠𝑙𝑜𝑝𝑒)
∇ Loss FuncVon =
d(Loss funcVon)
𝑑(𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡)

We can now use this Gradient to descend to the minimal point in the cost function :

Step 1 : We choose a random intercept ( = 0) and a random slope ( = 1)

Step 2 : We plug the values in the partial derivative formulas and get 2 values
Step 3 : 𝑠𝑡𝑒𝑝𝑠𝑖𝑧𝑒*+,!# = 𝑣𝑎𝑙𝑢𝑒( ∗ 𝑙𝑒𝑎𝑟𝑛𝑖𝑛𝑔 𝑟𝑎𝑡𝑒
𝑠𝑡𝑒𝑝𝑠𝑖𝑧𝑒-.%#"/#!% = 𝑣𝑎𝑙𝑢𝑒' ∗ 𝑙𝑒𝑎𝑟𝑛𝑖𝑛𝑔 𝑟𝑎𝑡𝑒
Step 4 : 𝑠𝑙𝑜𝑝𝑒.#0 = 𝑠𝑙𝑜𝑝𝑒,+$ − 𝑠𝑡𝑒𝑝𝑠𝑖𝑧𝑒*+,!#
𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡.#0 = 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡,+$ − 𝑠𝑡𝑒𝑝𝑠𝑖𝑧𝑒-.%#"/#!%
Optimizers

1 d(Loss funcVon)
Gradient Descent : 𝑑(𝑠𝑙𝑜𝑝𝑒)
∇ Loss FuncVon =
d(Loss funcVon)
𝑑(𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡)

We can now use this Gradient to descend to the minimal point in the
cost function (hence the name Gradient Descent) :

Repeat steps 1 to 4 until :

ü the step sizes are very small

ü the maximum number of steps is reached

Optimizers

1
Gradient Descent :

Note : In the case where we have more parameters to estimate

(weights and biases of our ANN), we only use more derivatives, while
the whole procedure remains exactly the same

Note : Gradient Descent can be used with any loss function

Optimizers

1
Gradient Descent :

Conclusion :

Number Distance
Price
Area of from train city
(label)
rooms station
32 3 5 Amiens 60 000
25 1 8 Lille 70 000
The data is entirely plugged into the Neural
80 4 10 Versailles 600 000 Network and the parameters are adjusted
55 3 10 Nice 450 000
Optimizers

2
Stochastic Gradient Descent :

Gradient Descent considers the whole dataset for each parameter

estimation step.

If we have thousands of parameters and millions of data points,

repeating all the necessary steps to find the optimal parameters
takes a huge amount of time
Optimizers

2
Stochastic Gradient Descent :

Stochastic Gradient Descent :

- uses 1 sample per step (calculation of new adjusted parameters)
- reduces the time taken for derivative calculation
Optimizers

2
Stochastic Gradient Descent :

Number Distance
Price
Area of from train city
(label)
rooms station
32 3 5 Amiens 60 000
25 1 8 Lille 70 000 The data is plugged in the Neural Network
one sample at a time and the parameters
80 4 10 Versailles 600 000
are adjusted after every sample
55 3 10 Nice 450 000
Optimizers

2
Stochastic Gradient Descent :
Number Distance
Price
Area of from train city
(label)
rooms station

Notice the low memory requirement as 32 3 5 Amiens 60 000

25 1 8 Lille 70 000
compared to Gradient Descent :
80 4 10 Versailles 600 000
55 3 10 Nice 450 000
Estimates concerning previous data do not
have to be conserved in memory.

We can only store the very last estimates,

and use them when new data is added.
Optimizers

2
Stochastic Gradient Descent :

Note :
• Sensitivity to learning rate also applies to SGD
• The same ‘schedule’ strategy is used : Starting with large numbers and
reducing them gradually
Optimizers

3
Mini-Batch Gradient Descent :

Mini-Batch Gradient Descent :

- uses a small subset of data (mini-batch) for each step

- gives more stable results than using one sample per step (Stochastic Gradient Descent)

- faster than using the whole dataset (Gradient Descent)

Optimizers

3
Mini-Batch Gradient Descent :

Number Distance
Price
Area of from train city
(label)
rooms station
32 3 5 Amiens 60 000

mini-batch 25 1 8 Lille 70 000

80 4 10 Versailles 600 000
55 3 10 Nice 450 000
Etc. Etc. Etc. Etc. Etc.
Optimizers

4
Gradient Descent with Momentum :

Gradient Descent might make a lot of steps and keep on

oscillating very slowly towards the minimum of the lost function.

Gradient Descent with Momentum :

- uses the exponentially weighted average of the gradients to update the weights
- smoothes out the steps of the Gradient Descent (because it takes into consideration
the average of the past parameters)
- is faster than the regular Gradient Descent
Optimizers

4 1. On each iteration, we compute the partial derivatives

Gradient Descent with Momentum : while plugging in the initial slope and intercept values
d(Loss func@on) d(Loss func@on)
&
𝑑(𝑠𝑙𝑜𝑝𝑒) 𝑑(𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡)
𝑣𝑎𝑙𝑢𝑒!"#$% 𝑣𝑎𝑙𝑢𝑒&'(%)*%$(
Optimizers

4 1. On each iteration, we compute the partial derivatives

Gradient Descent with Momentum : while plugging in the initial slope and intercept values
d(Loss func@on) d(Loss func@on)
&
𝑑(𝑠𝑙𝑜𝑝𝑒) 𝑑(𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡)
𝑣𝑎𝑙𝑢𝑒!"#$% 𝑣𝑎𝑙𝑢𝑒&'(%)*%$(

2. Compute :
𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑_𝑎𝑣𝑒𝑟𝑎𝑔𝑒!"#$% = 𝛃 𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑_𝑎𝑣𝑒𝑟𝑎𝑔𝑒!"#$% + (1 − 𝛃)𝑣𝑎𝑙𝑢𝑒!"#$%

𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑_𝑎𝑣𝑒𝑟𝑎𝑔𝑒&'(%)%$( = 𝛃 𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑_𝑎𝑣𝑒𝑟𝑎𝑔𝑒&'(%)%$( + (1 − 𝛃)𝑣𝑎𝑙𝑢𝑒&'(%)*%$(

Optimizers

4 1. On each iteration, we compute the partial derivatives

2. Compute :
𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑_𝑎𝑣𝑒𝑟𝑎𝑔𝑒!"#$% = 𝛃 𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑_𝑎𝑣𝑒𝑟𝑎𝑔𝑒!"#$% + (1 − 𝛃)𝑣𝑎𝑙𝑢𝑒!"#$%

𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑_𝑎𝑣𝑒𝑟𝑎𝑔𝑒&'(%)%$( = 𝛃 𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑_𝑎𝑣𝑒𝑟𝑎𝑔𝑒&'(%)%$( + (1 − 𝛃)𝑣𝑎𝑙𝑢𝑒&'(%)*%$(

3. Update : 𝑠𝑙𝑜𝑝𝑒'%+ = 𝑠𝑙𝑜𝑝𝑒#", − (𝑙𝑒𝑎𝑟𝑛𝑖𝑛𝑔_𝑟𝑎𝑡𝑒 ∗ 𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑_𝑎𝑣𝑒𝑟𝑎𝑔𝑒!"#$% )

𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡'%+ = 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡#", − (𝑙𝑒𝑎𝑟𝑛𝑖𝑛𝑔_𝑟𝑎𝑡𝑒 ∗ 𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑_𝑎𝑣𝑒𝑟𝑎𝑔𝑒&'(%)*%$( )

Optimizers

4 1. On each iteration, we compute the partial derivatives

2. Compute : 𝛃 is a parameter that controls

the weighted average
𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑_𝑎𝑣𝑒𝑟𝑎𝑔𝑒!"#$% = 𝛃 𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑_𝑎𝑣𝑒𝑟𝑎𝑔𝑒!"#$% + (1 − 𝛃)𝑣𝑎𝑙𝑢𝑒!"#$%
(common value = 0.9)
𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑_𝑎𝑣𝑒𝑟𝑎𝑔𝑒&'(%)*%$( = 𝛃 𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑_𝑎𝑣𝑒𝑟𝑎𝑔𝑒&'(%)*%$( + (1 − 𝛃)𝑣𝑎𝑙𝑢𝑒&'(%)*%$(
Initial weighted_averages = 0

3. Update : 𝑠𝑙𝑜𝑝𝑒'%+ = 𝑠𝑙𝑜𝑝𝑒#", − (𝑙𝑒𝑎𝑟𝑛𝑖𝑛𝑔_𝑟𝑎𝑡𝑒 ∗ 𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑_𝑎𝑣𝑒𝑟𝑎𝑔𝑒!"#$% )

𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡'%+ = 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡#", − (𝑙𝑒𝑎𝑟𝑛𝑖𝑛𝑔_𝑟𝑎𝑡𝑒 ∗ 𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑_𝑎𝑣𝑒𝑟𝑎𝑔𝑒&'(%)*%$( )

Optimizers

5
AdaGrad (Adaptive Gradients): AdaGrad is a technique to change the learning rate over time

𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟'%+ = 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟#", − 𝑠𝑡𝑒𝑝𝑠𝑖𝑧𝑒

where :

+#1".-.3 "1%#
𝑠𝑡𝑒𝑝𝑠𝑖𝑧𝑒!1"12#%#"$ = 𝛁𝐿(𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟- ) ∗
"
45 ∑&$%! 𝛁8 !1"12#%#"$

and

$(8,** :&./%-,.)
𝛁𝐿(𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟- ) =
$(!1"12#%#"$ )
Optimizers

5
AdaGrad (Adaptive Gradients): AdaGrad is a technique to change the learning rate over time

𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟'%+ = 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟#", − 𝑠𝑡𝑒𝑝𝑠𝑖𝑧𝑒

Sum over all the gradients from the

where : first time step until the current one

+#1".-.3 "1%#
𝑠𝑡𝑒𝑝𝑠𝑖𝑧𝑒!1"12#%#"$ = 𝛁𝐿(𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟- ) ∗ With every new time step, a new
"
45 ∑&$%! 𝛁8 !1"12#%#"$ gradient is added which causes
the denominator to increase and
and the step size to decrease
$(8,** :&./%-,.)
𝛁𝐿(𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟- ) =
$(!1"12#%#"$ )
Optimizers

5
AdaGrad (Adaptive Gradients): AdaGrad is a technique to change the learning rate over time

𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟'%+ = 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟#", − 𝑠𝑡𝑒𝑝𝑠𝑖𝑧𝑒

where :

+#1".-.3 "1%#
𝑠𝑡𝑒𝑝𝑠𝑖𝑧𝑒!1"12#%#"$ = 𝛁𝐿(𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟- ) ∗
"
45 ∑&$%! 𝛁8 !1"12#%#"$

and

$(8,** :&./%-,.) A small value to avoid division by 0

𝛁𝐿(𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟- ) =
$(!1"12#%#"$ )
Optimizers

6
RMS Prop (Root Mean Squared Propagation) :

RMS Prop is very similar to AdaGrad

However, instead of using the sum of gradients, it uses the

exponentially weighted average of the squared gradients.

Instead of being concerned about all of the gradients, we are

more concerned about the most recent gradients.
Optimizers

6
RMS Prop (Root Mean Squared Propagation) :

RMS Prop is very similar to AdaGrad

AdaGrad : Learning rate decreases monotonously

RMS Prop : Learning rate can adapt to increase or decrease with every step
Optimizers

6
RMS Prop (Root Mean Squared Propagation) :

𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟'%+ = 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟#", − 𝑠𝑡𝑒𝑝𝑠𝑖𝑧𝑒 Default values :

Learning rate = 0.001
𝛃 = 0.9
where :

+#1".-.3 "1%#
𝑠𝑡𝑒𝑝𝑠𝑖𝑧𝑒!1"12#%#"$ = 𝛁𝐿(𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟- ) ∗
450#-3<%#$_1>#"13#(𝛁8" (!1"12#%#"$ ))

and

$(8,** :&./%-,.)
𝛁𝐿(𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟- ) =
$(!1"12#%#"$ )
Optimizers

7
Adam Optimizer (Adaptive Moment Estimation ) :

Adam is another optimizer with adaptive learning rates for each parameter

Adam :
- uses the exponentially weighted average of the past squared gradients (like RMS Prop)
- uses an exponentially weighted average of past gradients (like GD with momentum)
Optimizers

7
Adam Optimizer (Adaptive Moment Estimation ) :

𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟'%+ = 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟#", − 𝑠𝑡𝑒𝑝𝑠𝑖𝑧𝑒

where :

+#1".-.3 "1%#
𝑠𝑡𝑒𝑝𝑠𝑖𝑧𝑒!1"12#%#"$ = 𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑_𝑎𝑣𝑒𝑟𝑎𝑔𝑒(𝛁𝐿(𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟- )) ∗
450#-3<%#$_1>#"13#(𝛁8" (!1"12#%#"$ ))

and

$(8,** :&./%-,.)
𝛁𝐿(𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟- ) =
$(!1"12#%#"$ )
Optimizers

7
Default values :
Adam Optimizer (Adaptive Moment Estimation ) : 𝛽( = 0.9
𝛽' = 0.999
𝜖 = 10@B

However, as the weighted averages are initialized to 0, they are biased towards 0 during the first iterations
è bias-corrected moments are calculated and used :

-!
𝑚
P& =
. /0"

1! 5!
4-
𝑣Q& = 𝑚& = 𝛽. 𝑚&/. + 1 − 𝛽. g 2 𝑤&3. = 𝑤& −
. /0# 63 17!

+#1".-.3 "1%#
𝑠𝑡𝑒𝑝𝑠𝑖𝑧𝑒0$ = 𝑚
I- ∗ 𝑣- = 𝛽' 𝑣-@( + 1 − 𝛽' g 'A
45 >?$
Optimizers

Bias Correction in Exponentially Weighted Average

C&
𝑉N% = where 𝑡 = current iteration As 𝑡 increases, 𝛽( approaches 0
( @D&

E.( G! è The correction has less influence

è 𝑉N( =
( @D
Optimizers

How does Exponentially Weighted Average give

more weight to the most recent observations ?

𝑉( = 𝛽𝑉(/. + 1 − 𝛽 𝛳8

𝑉.99 = 0.9𝑉:: + 0.1𝛳.99

𝑉:: = 0.9𝑉:; + 0.1𝛳::
𝑉:; = 0.9𝑉:< + 0.1𝛳:;

è 𝑉.99 = 0.1𝛳.99 + 0.9(0.1𝛳:: + 0.9(0.1𝛳:; + 0.9𝑉:< ))

è 𝑉.99 = 0.1𝛳.99 + 0.09𝛳:: + 0.081𝛳:; + 0.729𝑉:<
Optimizers

How does Exponentially Weighted Average give

more weight to the most recent observations ?

𝑉( = 𝛽𝑉(/. + 1 − 𝛽 𝛳8

Ø The current observation will always have

𝑉.99 = 0.9𝑉:: + 0.1𝛳.99 the highest coefficient
𝑉:: = 0.9𝑉:; + 0.1𝛳::
Ø The coefficient dimishes gradually further
𝑉:; = 0.9𝑉:< + 0.1𝛳:; backwards

è 𝑉.99 = 0.1𝛳.99 + 0.9(0.1𝛳:: + 0.9(0.1𝛳:; + 0.9𝑉:< ))

è 𝑉.99 = 0.1𝛳.99 + 0.09𝛳:: + 0.081𝛳:; + 0.729𝑉:<
Backpropagation
Backpropagation

Backpropagation of error :

o An algorithm to calculate the gradient of a loss function relative to the model parameters

o Those gradients are then used by the optimizer to update the model weights

o Gradients are calculated backwards through the network starting at the output layer, one layer at a
time

Together, backpropagation and Stochastic Gradient Descent (or

variants) can be used to train a neural network.
The Learning Mechanism
The Learning Mechanism

So, how does the network learn ?

Step 1 : Forward propagation:

At each neuron :

✦ Preactivation

✦ Activation
The Learning
Backproagation
Mechanism

So, how does the network learn ?

Step 2 : Error calculation by the loss function : Ŷ

The cost function measures the error between the

trues values and the predicted values.
The Learning Mechanism

So, how does the network learn ?

Step 3 : Backpropagation and Optimisation : Ŷ

✦ Backpropagation calculates the gradients of the

loss function with respect to the model weights

✦ The optimizer uses the gradients to find new

values for the weights that can reduce the loss
The Learning Mechanism

So, how does the network learn ?

Step 4 : Steps 1 to 3 are repeated until : Ŷ

✦ The assigned number of iterations is reached

✦ The defined loss value is reached

WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
No ratings yet
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
40 pages
Yash 21bsds12
No ratings yet
Yash 21bsds12
3 pages
Gradient-Based Optimizers
No ratings yet
Gradient-Based Optimizers
54 pages
Module 4 Lab 3
No ratings yet
Module 4 Lab 3
6 pages
Gradient Descent Algorithm.Y...
No ratings yet
Gradient Descent Algorithm.Y...
10 pages
14-RMSProp and Adam Optimization-12!08!2024
No ratings yet
14-RMSProp and Adam Optimization-12!08!2024
2 pages
5.1loss Function, Optimization, GD
No ratings yet
5.1loss Function, Optimization, GD
39 pages
DL Unit - 2
No ratings yet
DL Unit - 2
20 pages
Gradient Descent Algorithm Is A First
No ratings yet
Gradient Descent Algorithm Is A First
5 pages
Gradient Descent
No ratings yet
Gradient Descent
15 pages
Mlfa Autumn 23 Optimization
No ratings yet
Mlfa Autumn 23 Optimization
37 pages
Gradient Descent
No ratings yet
Gradient Descent
52 pages
Gradient Descent and Cost Function
No ratings yet
Gradient Descent and Cost Function
14 pages
Sheet 3 Sol 3
No ratings yet
Sheet 3 Sol 3
3 pages
Gradient Descent
No ratings yet
Gradient Descent
13 pages
Chapter4 PDF
No ratings yet
Chapter4 PDF
9 pages
GD Types
No ratings yet
GD Types
98 pages
4 - Gradient Descent and Stochastic GD
No ratings yet
4 - Gradient Descent and Stochastic GD
37 pages
Gradient Descent Algorithm in Machine Learning
No ratings yet
Gradient Descent Algorithm in Machine Learning
21 pages
ML Lec 08 Gradient Descent
No ratings yet
ML Lec 08 Gradient Descent
37 pages
5 Optimizers
No ratings yet
5 Optimizers
10 pages
Gradient Descent Final
No ratings yet
Gradient Descent Final
27 pages
Gradient Descent DS Rohit Sharma Fench Knjs
No ratings yet
Gradient Descent DS Rohit Sharma Fench Knjs
15 pages
Gradient Descent A Fundamental Optimization Algorithm
No ratings yet
Gradient Descent A Fundamental Optimization Algorithm
30 pages
Gradient Descent
No ratings yet
Gradient Descent
4 pages
Gradient Descent - PR
No ratings yet
Gradient Descent - PR
31 pages
Gradient Descent
No ratings yet
Gradient Descent
9 pages
(PR 2024) Lec2 Regression II
No ratings yet
(PR 2024) Lec2 Regression II
41 pages
AI33
No ratings yet
AI33
6 pages
DL Regularization
No ratings yet
DL Regularization
51 pages
Adam Optimizer
No ratings yet
Adam Optimizer
22 pages
Gradient Descent Algorithm in Machine Learning: Dr. P. K. Chaurasia
No ratings yet
Gradient Descent Algorithm in Machine Learning: Dr. P. K. Chaurasia
24 pages
Gradient Decent
No ratings yet
Gradient Decent
40 pages
Gradient Descent
No ratings yet
Gradient Descent
8 pages
Mlfa Autumn 22 Lec 04
No ratings yet
Mlfa Autumn 22 Lec 04
24 pages
Tut04 - One Algorithm To Optimize Them All
No ratings yet
Tut04 - One Algorithm To Optimize Them All
19 pages
Gradient Descent in Linear Regression
No ratings yet
Gradient Descent in Linear Regression
30 pages
An Overview of Gradient Descent Optimization Algorithms PDF
No ratings yet
An Overview of Gradient Descent Optimization Algorithms PDF
12 pages
Unit VI Optimization Techniques Question Bank Solved Answer
No ratings yet
Unit VI Optimization Techniques Question Bank Solved Answer
20 pages
Optimization
No ratings yet
Optimization
3 pages
3 Gradient Descent
No ratings yet
3 Gradient Descent
8 pages
Gradient Descent Regression
No ratings yet
Gradient Descent Regression
14 pages
Notes Unit 1-3 Part-III
No ratings yet
Notes Unit 1-3 Part-III
25 pages
ML - Week 06
No ratings yet
ML - Week 06
31 pages
Gradient Descent and SGD
No ratings yet
Gradient Descent and SGD
8 pages
Gradient Descent
No ratings yet
Gradient Descent
27 pages
3 Types of Gradient Descent Algorithms For Small & Large Datasets
No ratings yet
3 Types of Gradient Descent Algorithms For Small & Large Datasets
9 pages
ML Lecture2
No ratings yet
ML Lecture2
36 pages
Lect 6
No ratings yet
Lect 6
60 pages
CCS355 Neural Networks and Deep Learning
No ratings yet
CCS355 Neural Networks and Deep Learning
142 pages
Gradient Descent Optimization
No ratings yet
Gradient Descent Optimization
4 pages
11 Gradient Descent
No ratings yet
11 Gradient Descent
58 pages
chp2 Gradient Descent Algorithm
No ratings yet
chp2 Gradient Descent Algorithm
5 pages
Gradient Decent
No ratings yet
Gradient Decent
15 pages
Gradient Descent Method
No ratings yet
Gradient Descent Method
12 pages
LInear
No ratings yet
LInear
14 pages
Introduction To Gradient Descent
No ratings yet
Introduction To Gradient Descent
8 pages
Paper 2
No ratings yet
Paper 2
27 pages
Screenshot 2024-10-19 at 10.37.25 AM
No ratings yet
Screenshot 2024-10-19 at 10.37.25 AM
25 pages
Canny Edge Detector: Unveiling the Art of Visual Perception
From Everand
Canny Edge Detector: Unveiling the Art of Visual Perception
Fouad Sabry
No ratings yet
Multi-Objective Economic Load Dispatch Using Hybrid NSGA-II and PVDE Techniques
No ratings yet
Multi-Objective Economic Load Dispatch Using Hybrid NSGA-II and PVDE Techniques
10 pages
Chapter 7 - Unconstrained Minimization Methods
No ratings yet
Chapter 7 - Unconstrained Minimization Methods
66 pages
Fakhrullah Chapter1-3
No ratings yet
Fakhrullah Chapter1-3
85 pages
Iterative Methods in Combinatorial Optimization Lau LC Ravi R Download
No ratings yet
Iterative Methods in Combinatorial Optimization Lau LC Ravi R Download
84 pages
The Mathematics of Diffusion (Wei-Ming Ni)
No ratings yet
The Mathematics of Diffusion (Wei-Ming Ni)
119 pages
1.cover Page
No ratings yet
1.cover Page
7 pages
4 Psoc As Aurora College
No ratings yet
4 Psoc As Aurora College
56 pages
Slide 3 - Linear Regression One Variable
No ratings yet
Slide 3 - Linear Regression One Variable
60 pages
Market-Driven Energy Storage Planning For Microgrids With Renewable Energy Systems Using Stochastic Programming
No ratings yet
Market-Driven Energy Storage Planning For Microgrids With Renewable Energy Systems Using Stochastic Programming
7 pages
Fault Indicator Allocation in Power Distribution Network For Improving Reliability and Fault Section Estimation
No ratings yet
Fault Indicator Allocation in Power Distribution Network For Improving Reliability and Fault Section Estimation
7 pages
CS502 Fundamentals of Algorithms 2013 Final Term Mcqs Solved With References by Moaaz
0% (1)
CS502 Fundamentals of Algorithms 2013 Final Term Mcqs Solved With References by Moaaz
46 pages
Dipesh NITW EMOO
No ratings yet
Dipesh NITW EMOO
68 pages
Lieven LP Problems
No ratings yet
Lieven LP Problems
68 pages
Evaluation of The Pseudo-Dynamic Bearing Capacity of Surface Footings On Cohesionless Soils Using Finite Element Lower Bound Limit Analysis
No ratings yet
Evaluation of The Pseudo-Dynamic Bearing Capacity of Surface Footings On Cohesionless Soils Using Finite Element Lower Bound Limit Analysis
14 pages
Tesla Microvalve Optimization Model
No ratings yet
Tesla Microvalve Optimization Model
18 pages
BBA FIB Syllabus 2023
No ratings yet
BBA FIB Syllabus 2023
131 pages
WSC2017 LegatoMazzapaperid261
No ratings yet
WSC2017 LegatoMazzapaperid261
12 pages
SPE 150760 Efficient Methodology For Stimulation Candidate Selection and Well Workover Optimization
No ratings yet
SPE 150760 Efficient Methodology For Stimulation Candidate Selection and Well Workover Optimization
14 pages
Linear Programming: Simplex Method
No ratings yet
Linear Programming: Simplex Method
39 pages
Multi-Parameters Pump Impeller Optimization.
No ratings yet
Multi-Parameters Pump Impeller Optimization.
8 pages
E BrIM Oct23
No ratings yet
E BrIM Oct23
50 pages
Mangalore India
No ratings yet
Mangalore India
13 pages
Professor List For International Students (2021)
No ratings yet
Professor List For International Students (2021)
4 pages
Nonlinear Programming Solution Techniques
No ratings yet
Nonlinear Programming Solution Techniques
9 pages
OBJECTIVE FUNCTION: To Minimize The Total Distance Traveled by Officials Represented in
No ratings yet
OBJECTIVE FUNCTION: To Minimize The Total Distance Traveled by Officials Represented in
5 pages
Optimal Asset Replacement
No ratings yet
Optimal Asset Replacement
123 pages
Automatic Cell Planning
100% (2)
Automatic Cell Planning
12 pages
Linear Programming Formulate
No ratings yet
Linear Programming Formulate
25 pages
Metrics For Nonlinear Model Updating in Structural Dynamics: Samuel Da Silva
No ratings yet
Metrics For Nonlinear Model Updating in Structural Dynamics: Samuel Da Silva
8 pages
Cs502 Solved Mcqs Final Term by Junaid
No ratings yet
Cs502 Solved Mcqs Final Term by Junaid
104 pages