0% found this document useful (0 votes)
11 views58 pages

Ann 3

Optimizers are algorithms that minimize loss functions by adjusting model parameters such as weights and biases. Gradient descent is a commonly used optimizer that takes iterative steps down the gradient of the loss function to find parameter values that produce low loss. It works by calculating the gradient of the loss with respect to each parameter, then taking a step in the opposite direction. The step size is determined by the learning rate parameter.

Uploaded by

jamiesonlara
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views58 pages

Ann 3

Optimizers are algorithms that minimize loss functions by adjusting model parameters such as weights and biases. Gradient descent is a commonly used optimizer that takes iterative steps down the gradient of the loss function to find parameter values that produce low loss. It works by calculating the gradient of the loss with respect to each parameter, then taking a step in the opposite direction. The step size is determined by the learning rate parameter.

Uploaded by

jamiesonlara
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 58

Optimizers

Optimizers

What are optimizers ?

Optimizers (optimization algorithms)


minimize the loss function by finding the
most accurate model parameters possible

Model parameters correspond to weights and biases


Optimizers

What are optimizers ?

Training a model is an iterative process

It is essential for the optimizers to be :


Fast

Accurate
Optimizers

We will cover the following optimizers :

Ø Gradient Descent
Ø Adagrad
Ø Stochastic Gradient Descent
Ø RMS Prop
Ø Mini-Batch Gradient Descent
Ø Adam
Ø Gradient Descent with Momentum
Optimizers

1
Loss function
Gradient Descent :

Calculating the loss for every


possible value of w is an
exhaustive and an inefficient
way to find the minimum loss !

weight w

A more efficient method is the Gradient Descent


Optimizers

1
Loss function
Gradient Descent :

o An optimization algorithm that finds the minimum value


of a function by taking steps starting from an initial value
until it finds the best value

o It takes big steps towards the minimum if far from the


optimum value and smaller steps as it gets closer weight
Optimizers

1
Loss function
Gradient Descent :
Starting point
Step 1 : We set a starting value for weight w
(this value can be randomly chosen)

weight w
Optimizers

1
Loss function
Gradient Descent :
Starting point
Step 1 : We set a starting value for weight w
(this value can be randomly chosen)

Step 2 : The Gradient Descent algorithm calculates the gradient


of the loss function at the starting point.
The gradient tells which direction allows the cost function to weight w

decrease
Optimizers

1
Loss function
Gradient Descent :

Step 3 : The algorithm takes a step in the negative gradient direction


(note : the gradient always points towards the direction of steepest
increase è we use the negative gradient)

Step size = gradient magnitude * learning rate

New weight = old weight – step size weight w


Optimizers

1
Loss function
Gradient Descent :

The Gradient Descent algorithm repeats these steps until :

• Step size gets very close to 0 (ex. min step size = 0.001 or
smaller)

• Maximum number of steps is reached (ex. 1000 or more)


weight w
Optimizers

1
Gradient Descent :

Learning rate : A scalar that determines the step size

If learning rate is very small è learning can take too long

If learning is very large è the point will perpetually


bounce across the curve minimum
Optimizers
Loss function
1
Gradient Descent :

Learning rate : A scalar that determines the step size

Ideal learning rate : larger when the point is far from the
minimum & smaller as it gets closer

Gradient Descent is very sensitive to learning rate. weight w


In practice, the learning rate can be determined automatically during training (starts
large and diminishes gradually).
This strategy is called ‘schedule’.
Optimizers

1
Gradient Descent : Example : How does Gradient Descent fit a line to data ?
Height H
??

Use Gradient Descent to estimate the optimal


values for the slope and the intercept :

𝐻!"#$ = 𝑠𝑙𝑜𝑝𝑒 𝑆𝑆 + 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡

Shoe Size SS
Optimizers

1
Gradient Descent : Example : How does Gradient Descent fit a line to data ?

Use GD to estimate the optimal values


for the slope and the intercept :

𝐻!"#$ = 𝑠𝑙𝑜𝑝𝑒 𝑆𝑆 + 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡


Optimizers

1
Gradient Descent : 𝐻!"#$ = 𝑠𝑙𝑜𝑝𝑒 𝑆𝑆 + 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡

Height
??
(1.1, 2.1)
Loss function

= Sum of Squared residuals (SSR)

(0.6, 1.1) = (𝐻%"&#! − 𝐻!"#$! )' + (𝐻%"&#" − 𝐻!"#$" )' + (𝐻%"&## − 𝐻!"#$# )'
(0.8, 1.2)

Shoe Size SS
Optimizers

1
Gradient Descent : 𝐻!"#$ = 𝑠𝑙𝑜𝑝𝑒 𝑆𝑆 + 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡

Height Loss function


??
(1.1, 2.1)
= Sum of Squared residuals (SSR)

= (𝐻%"&#! − 𝐻!"#$! )' + (𝐻%"&#" − 𝐻!"#$" )' + (𝐻%"&## − 𝐻!"#$# )'

(0.6, 1.1) = (1.1 − (𝑠𝑙𝑜𝑝𝑒 ∗ 0.6 + 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡))'


(0.8, 1.2) + (1.2 − (𝑠𝑙𝑜𝑝𝑒 ∗ 0.8 + 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡))'
+ (2.1 − (𝑠𝑙𝑜𝑝𝑒 ∗ 1.1 + 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡))'

Shoe Size SS
Optimizers

1
Gradient Descent : 𝐻!"#$ = 𝑠𝑙𝑜𝑝𝑒 𝑆𝑆 + 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡

Loss function = Sum of Squared residuals (SSR) d(Loss funcVon)


𝑑(𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡)
= (1.1 − (𝑠𝑙𝑜𝑝𝑒 ∗ 0.6 + 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡))'
+ (1.2 − (𝑠𝑙𝑜𝑝𝑒 ∗ 0.8 + 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡))'
+ (2.1 − (𝑠𝑙𝑜𝑝𝑒 ∗ 1.1 + 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡))' d(Loss funcVon)
𝑑(𝑠𝑙𝑜𝑝𝑒)
Optimizers

1
Gradient Descent : 𝐻!"#$ = 𝑠𝑙𝑜𝑝𝑒 𝑆𝑆 + 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡

To calculate the gradient with respect to each of the parameters

d(Loss funcVon)
𝑑(𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡)

d(Loss funcVon)
𝑑(𝑠𝑙𝑜𝑝𝑒)

we use the Chain Rule


Optimizers

1
Gradient Descent : 𝐻!"#$ = 𝑠𝑙𝑜𝑝𝑒 𝑆𝑆 + 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡

d(Loss funcVon)
𝑑(𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡)
Loss function = Sum of Squared residuals (SSR)
d(Loss funcVon) d(U)
U = ×
𝑑(𝑈) 𝑑(𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡)

= (1.1 − (𝑠𝑙𝑜𝑝𝑒 ∗ 0.6 + 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡))%


+ (1.2 − (𝑠𝑙𝑜𝑝𝑒 ∗ 0.8 + 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡))% d(Loss funcVon)
𝑑(𝑠𝑙𝑜𝑝𝑒)
+ (2.1 − (𝑠𝑙𝑜𝑝𝑒 ∗ 1.1 + 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡))%
d(Loss funcVon) d(U)
= ×
𝑑(𝑈) 𝑑(𝑠𝑙𝑜𝑝𝑒)
Optimizers

1
Gradient Descent : 𝐻!"#$ = 𝑠𝑙𝑜𝑝𝑒 𝑆𝑆 + 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡

d(Loss funcVon)
𝑑(𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡)
Loss function = Sum of Squared residuals (SSR)
d(Loss funcVon) d(U)
U =
𝑑(𝑈)
×
𝑑(𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡)
= (1.1 − (𝑠𝑙𝑜𝑝𝑒 ∗ 0.6 + 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡))% = 2𝑈( × −1 + 2𝑈' × −1 + 2𝑈) × −1
+ (1.2 − (𝑠𝑙𝑜𝑝𝑒 ∗ 0.8 + 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡))%
+ (2.1 − (𝑠𝑙𝑜𝑝𝑒 ∗ 1.1 + 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡))% = −2(1.1 − 0.6 ∗ 𝑠𝑙𝑜𝑝𝑒 + 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 )
−2(1.2 − (0.8 ∗ 𝑠𝑙𝑜𝑝𝑒 + 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡))
−2(2.1 − (1.1 ∗ 𝑠𝑙𝑜𝑝𝑒 + 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡))
Optimizers

1 𝐻!"#$ = 𝑠𝑙𝑜𝑝𝑒 𝑆𝑆 + 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡


Gradient Descent :
d(Loss funcVon)
𝑑(𝑠𝑙𝑜𝑝𝑒)
Loss function = Sum of Squared residuals (SSR)
d(Loss funcVon) d(U)
= ×
U 𝑑(𝑈) 𝑑(𝑠𝑙𝑜𝑝𝑒)

= (1.1 − (𝑠𝑙𝑜𝑝𝑒 ∗ 0.6 + 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡))% = 2𝑈( × −0.6 + 2𝑈' × −0.8 + 2𝑈) × −1.1

+ (1.2 − (𝑠𝑙𝑜𝑝𝑒 ∗ 0.8 + 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡))%


= −1.2(1.1 − 0.6 ∗ 𝑠𝑙𝑜𝑝𝑒 + 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 )
+ (2.1 − (𝑠𝑙𝑜𝑝𝑒 ∗ 1.1 + 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡))%
−1.6 1.2 − 0.8 ∗ 𝑠𝑙𝑜𝑝𝑒 + 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡
−2.2(2.1 − (1.1 ∗ 𝑠𝑙𝑜𝑝𝑒 + 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡))
Optimizers

1
Gradient Descent :

d(Loss funcVon) d(Loss funcVon)


𝑑(𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡) 𝑑(𝑠𝑙𝑜𝑝𝑒)

= −2(1.1 − 0.6 ∗ 𝑠𝑙𝑜𝑝𝑒 + 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 ) = −1.2(1.1 − 0.6 ∗ 𝑠𝑙𝑜𝑝𝑒 + 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 )


−2(1.2 − (0.8 ∗ 𝑠𝑙𝑜𝑝𝑒 + 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡)) −1.6 1.2 − 0.8 ∗ 𝑠𝑙𝑜𝑝𝑒 + 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡
−2(2.1 − (1.1 ∗ 𝑠𝑙𝑜𝑝𝑒 + 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡)) −2.2(2.1 − (1.1 ∗ 𝑠𝑙𝑜𝑝𝑒 + 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡))
Optimizers

1
Gradient Descent :

Note : The partial derivatives of a multivariate function


are stored in a vector called the Gradient (∇)
d(Loss funcQon)
𝑑(𝑠𝑙𝑜𝑝𝑒)
∇ Loss FuncQon =
d(Loss funcQon)
𝑑(𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡)
Optimizers

1 d(Loss funcVon)
Gradient Descent : 𝑑(𝑠𝑙𝑜𝑝𝑒)
∇ Loss FuncVon =
d(Loss funcVon)
𝑑(𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡)

We can now use this Gradient to descend to the minimal point in the cost function :

Step 1 : We choose a random intercept ( = 0) and a random slope ( = 1)


Step 2 : We plug the values in the partial derivative formulas and get 2 values
Step 3 : 𝑠𝑡𝑒𝑝𝑠𝑖𝑧𝑒*+,!# = 𝑣𝑎𝑙𝑢𝑒( ∗ 𝑙𝑒𝑎𝑟𝑛𝑖𝑛𝑔 𝑟𝑎𝑡𝑒
𝑠𝑡𝑒𝑝𝑠𝑖𝑧𝑒-.%#"/#!% = 𝑣𝑎𝑙𝑢𝑒' ∗ 𝑙𝑒𝑎𝑟𝑛𝑖𝑛𝑔 𝑟𝑎𝑡𝑒
Step 4 : 𝑠𝑙𝑜𝑝𝑒.#0 = 𝑠𝑙𝑜𝑝𝑒,+$ − 𝑠𝑡𝑒𝑝𝑠𝑖𝑧𝑒*+,!#
𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡.#0 = 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡,+$ − 𝑠𝑡𝑒𝑝𝑠𝑖𝑧𝑒-.%#"/#!%
Optimizers

1 d(Loss funcVon)
Gradient Descent : 𝑑(𝑠𝑙𝑜𝑝𝑒)
∇ Loss FuncVon =
d(Loss funcVon)
𝑑(𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡)

We can now use this Gradient to descend to the minimal point in the
cost function (hence the name Gradient Descent) :

Repeat steps 1 to 4 until :


ü the step sizes are very small

ü the maximum number of steps is reached


Optimizers

1
Gradient Descent :

Note : In the case where we have more parameters to estimate


(weights and biases of our ANN), we only use more derivatives, while
the whole procedure remains exactly the same

Note : Gradient Descent can be used with any loss function


Optimizers

1
Gradient Descent :

Conclusion :

Number Distance
Price
Area of from train city
(label)
rooms station
32 3 5 Amiens 60 000
25 1 8 Lille 70 000
The data is entirely plugged into the Neural
80 4 10 Versailles 600 000 Network and the parameters are adjusted
55 3 10 Nice 450 000
Optimizers

2
Stochastic Gradient Descent :

Gradient Descent considers the whole dataset for each parameter


estimation step.

If we have thousands of parameters and millions of data points,


repeating all the necessary steps to find the optimal parameters
takes a huge amount of time
Optimizers

2
Stochastic Gradient Descent :

Stochastic Gradient Descent :


- uses 1 sample per step (calculation of new adjusted parameters)
- reduces the time taken for derivative calculation
Optimizers

2
Stochastic Gradient Descent :

Number Distance
Price
Area of from train city
(label)
rooms station
32 3 5 Amiens 60 000
25 1 8 Lille 70 000 The data is plugged in the Neural Network
one sample at a time and the parameters
80 4 10 Versailles 600 000
are adjusted after every sample
55 3 10 Nice 450 000
Optimizers

2
Stochastic Gradient Descent :
Number Distance
Price
Area of from train city
(label)
rooms station

Notice the low memory requirement as 32 3 5 Amiens 60 000


25 1 8 Lille 70 000
compared to Gradient Descent :
80 4 10 Versailles 600 000
55 3 10 Nice 450 000
Estimates concerning previous data do not
have to be conserved in memory.

We can only store the very last estimates,


and use them when new data is added.
Optimizers

2
Stochastic Gradient Descent :

Note :
• Sensitivity to learning rate also applies to SGD
• The same ‘schedule’ strategy is used : Starting with large numbers and
reducing them gradually
Optimizers

3
Mini-Batch Gradient Descent :

Mini-Batch Gradient Descent :

- uses a small subset of data (mini-batch) for each step

- gives more stable results than using one sample per step (Stochastic Gradient Descent)

- faster than using the whole dataset (Gradient Descent)


Optimizers

3
Mini-Batch Gradient Descent :

Number Distance
Price
Area of from train city
(label)
rooms station
32 3 5 Amiens 60 000

mini-batch 25 1 8 Lille 70 000


80 4 10 Versailles 600 000
55 3 10 Nice 450 000
Etc. Etc. Etc. Etc. Etc.
Optimizers

4
Gradient Descent with Momentum :

Gradient Descent might make a lot of steps and keep on


oscillating very slowly towards the minimum of the lost function.

Gradient Descent with Momentum :


- uses the exponentially weighted average of the gradients to update the weights
- smoothes out the steps of the Gradient Descent (because it takes into consideration
the average of the past parameters)
- is faster than the regular Gradient Descent
Optimizers

4 1. On each iteration, we compute the partial derivatives


Gradient Descent with Momentum : while plugging in the initial slope and intercept values
d(Loss func@on) d(Loss func@on)
&
𝑑(𝑠𝑙𝑜𝑝𝑒) 𝑑(𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡)
𝑣𝑎𝑙𝑢𝑒!"#$% 𝑣𝑎𝑙𝑢𝑒&'(%)*%$(
Optimizers

4 1. On each iteration, we compute the partial derivatives


Gradient Descent with Momentum : while plugging in the initial slope and intercept values
d(Loss func@on) d(Loss func@on)
&
𝑑(𝑠𝑙𝑜𝑝𝑒) 𝑑(𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡)
𝑣𝑎𝑙𝑢𝑒!"#$% 𝑣𝑎𝑙𝑢𝑒&'(%)*%$(

2. Compute :
𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑_𝑎𝑣𝑒𝑟𝑎𝑔𝑒!"#$% = 𝛃 𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑_𝑎𝑣𝑒𝑟𝑎𝑔𝑒!"#$% + (1 − 𝛃)𝑣𝑎𝑙𝑢𝑒!"#$%

𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑_𝑎𝑣𝑒𝑟𝑎𝑔𝑒&'(%)*%$( = 𝛃 𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑_𝑎𝑣𝑒𝑟𝑎𝑔𝑒&'(%)*%$( + (1 − 𝛃)𝑣𝑎𝑙𝑢𝑒&'(%)*%$(


Optimizers

4 1. On each iteration, we compute the partial derivatives


Gradient Descent with Momentum : while plugging in the initial slope and intercept values
d(Loss func@on) d(Loss func@on)
&
𝑑(𝑠𝑙𝑜𝑝𝑒) 𝑑(𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡)
𝑣𝑎𝑙𝑢𝑒!"#$% 𝑣𝑎𝑙𝑢𝑒&'(%)*%$(

2. Compute :
𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑_𝑎𝑣𝑒𝑟𝑎𝑔𝑒!"#$% = 𝛃 𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑_𝑎𝑣𝑒𝑟𝑎𝑔𝑒!"#$% + (1 − 𝛃)𝑣𝑎𝑙𝑢𝑒!"#$%

𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑_𝑎𝑣𝑒𝑟𝑎𝑔𝑒&'(%)*%$( = 𝛃 𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑_𝑎𝑣𝑒𝑟𝑎𝑔𝑒&'(%)*%$( + (1 − 𝛃)𝑣𝑎𝑙𝑢𝑒&'(%)*%$(

3. Update : 𝑠𝑙𝑜𝑝𝑒'%+ = 𝑠𝑙𝑜𝑝𝑒#", − (𝑙𝑒𝑎𝑟𝑛𝑖𝑛𝑔_𝑟𝑎𝑡𝑒 ∗ 𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑_𝑎𝑣𝑒𝑟𝑎𝑔𝑒!"#$% )

𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡'%+ = 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡#", − (𝑙𝑒𝑎𝑟𝑛𝑖𝑛𝑔_𝑟𝑎𝑡𝑒 ∗ 𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑_𝑎𝑣𝑒𝑟𝑎𝑔𝑒&'(%)*%$( )


Optimizers

4 1. On each iteration, we compute the partial derivatives


Gradient Descent with Momentum : while plugging in the initial slope and intercept values
d(Loss func@on) d(Loss func@on)
&
𝑑(𝑠𝑙𝑜𝑝𝑒) 𝑑(𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡)
𝑣𝑎𝑙𝑢𝑒!"#$% 𝑣𝑎𝑙𝑢𝑒&'(%)*%$(

2. Compute : 𝛃 is a parameter that controls


the weighted average
𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑_𝑎𝑣𝑒𝑟𝑎𝑔𝑒!"#$% = 𝛃 𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑_𝑎𝑣𝑒𝑟𝑎𝑔𝑒!"#$% + (1 − 𝛃)𝑣𝑎𝑙𝑢𝑒!"#$%
(common value = 0.9)
𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑_𝑎𝑣𝑒𝑟𝑎𝑔𝑒&'(%)*%$( = 𝛃 𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑_𝑎𝑣𝑒𝑟𝑎𝑔𝑒&'(%)*%$( + (1 − 𝛃)𝑣𝑎𝑙𝑢𝑒&'(%)*%$(
Initial weighted_averages = 0

3. Update : 𝑠𝑙𝑜𝑝𝑒'%+ = 𝑠𝑙𝑜𝑝𝑒#", − (𝑙𝑒𝑎𝑟𝑛𝑖𝑛𝑔_𝑟𝑎𝑡𝑒 ∗ 𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑_𝑎𝑣𝑒𝑟𝑎𝑔𝑒!"#$% )

𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡'%+ = 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡#", − (𝑙𝑒𝑎𝑟𝑛𝑖𝑛𝑔_𝑟𝑎𝑡𝑒 ∗ 𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑_𝑎𝑣𝑒𝑟𝑎𝑔𝑒&'(%)*%$( )


Optimizers

5
AdaGrad (Adaptive Gradients): AdaGrad is a technique to change the learning rate over time

𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟'%+ = 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟#", − 𝑠𝑡𝑒𝑝𝑠𝑖𝑧𝑒

where :

+#1".-.3 "1%#
𝑠𝑡𝑒𝑝𝑠𝑖𝑧𝑒!1"12#%#"$ = 𝛁𝐿(𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟- ) ∗
"
45 ∑&$%! 𝛁8 !1"12#%#"$

and

$(8,** :&./%-,.)
𝛁𝐿(𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟- ) =
$(!1"12#%#"$ )
Optimizers

5
AdaGrad (Adaptive Gradients): AdaGrad is a technique to change the learning rate over time

𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟'%+ = 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟#", − 𝑠𝑡𝑒𝑝𝑠𝑖𝑧𝑒

Sum over all the gradients from the


where : first time step until the current one

+#1".-.3 "1%#
𝑠𝑡𝑒𝑝𝑠𝑖𝑧𝑒!1"12#%#"$ = 𝛁𝐿(𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟- ) ∗ With every new time step, a new
"
45 ∑&$%! 𝛁8 !1"12#%#"$ gradient is added which causes
the denominator to increase and
and the step size to decrease
$(8,** :&./%-,.)
𝛁𝐿(𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟- ) =
$(!1"12#%#"$ )
Optimizers

5
AdaGrad (Adaptive Gradients): AdaGrad is a technique to change the learning rate over time

𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟'%+ = 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟#", − 𝑠𝑡𝑒𝑝𝑠𝑖𝑧𝑒

where :

+#1".-.3 "1%#
𝑠𝑡𝑒𝑝𝑠𝑖𝑧𝑒!1"12#%#"$ = 𝛁𝐿(𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟- ) ∗
"
45 ∑&$%! 𝛁8 !1"12#%#"$

and

$(8,** :&./%-,.) A small value to avoid division by 0


𝛁𝐿(𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟- ) =
$(!1"12#%#"$ )
Optimizers

6
RMS Prop (Root Mean Squared Propagation) :

RMS Prop is very similar to AdaGrad

However, instead of using the sum of gradients, it uses the


exponentially weighted average of the squared gradients.

Instead of being concerned about all of the gradients, we are


more concerned about the most recent gradients.
Optimizers

6
RMS Prop (Root Mean Squared Propagation) :

RMS Prop is very similar to AdaGrad

AdaGrad : Learning rate decreases monotonously


RMS Prop : Learning rate can adapt to increase or decrease with every step
Optimizers

6
RMS Prop (Root Mean Squared Propagation) :

𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟'%+ = 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟#", − 𝑠𝑡𝑒𝑝𝑠𝑖𝑧𝑒 Default values :


Learning rate = 0.001
𝛃 = 0.9
where :

+#1".-.3 "1%#
𝑠𝑡𝑒𝑝𝑠𝑖𝑧𝑒!1"12#%#"$ = 𝛁𝐿(𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟- ) ∗
450#-3<%#$_1>#"13#(𝛁8" (!1"12#%#"$ ))

and

$(8,** :&./%-,.)
𝛁𝐿(𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟- ) =
$(!1"12#%#"$ )
Optimizers

7
Adam Optimizer (Adaptive Moment Estimation ) :

Adam is another optimizer with adaptive learning rates for each parameter

Adam :
- uses the exponentially weighted average of the past squared gradients (like RMS Prop)
- uses an exponentially weighted average of past gradients (like GD with momentum)
Optimizers

7
Adam Optimizer (Adaptive Moment Estimation ) :

𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟'%+ = 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟#", − 𝑠𝑡𝑒𝑝𝑠𝑖𝑧𝑒

where :

+#1".-.3 "1%#
𝑠𝑡𝑒𝑝𝑠𝑖𝑧𝑒!1"12#%#"$ = 𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑_𝑎𝑣𝑒𝑟𝑎𝑔𝑒(𝛁𝐿(𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟- )) ∗
450#-3<%#$_1>#"13#(𝛁8" (!1"12#%#"$ ))

and

$(8,** :&./%-,.)
𝛁𝐿(𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟- ) =
$(!1"12#%#"$ )
Optimizers

7
Default values :
Adam Optimizer (Adaptive Moment Estimation ) : 𝛽( = 0.9
𝛽' = 0.999
𝜖 = 10@B

However, as the weighted averages are initialized to 0, they are biased towards 0 during the first iterations
è bias-corrected moments are calculated and used :

-!
𝑚
P& =
. /0"

1! 5!
4-
𝑣Q& = 𝑚& = 𝛽. 𝑚&/. + 1 − 𝛽. g 2 𝑤&3. = 𝑤& −
. /0# 63 17!

+#1".-.3 "1%#
𝑠𝑡𝑒𝑝𝑠𝑖𝑧𝑒0$ = 𝑚
I- ∗ 𝑣- = 𝛽' 𝑣-@( + 1 − 𝛽' g 'A
45 >?$
Optimizers

Bias Correction in Exponentially Weighted Average

C&
𝑉N% = where 𝑡 = current iteration As 𝑡 increases, 𝛽( approaches 0
( @D&

E.( G! è The correction has less influence


è 𝑉N( =
( @D
Optimizers

How does Exponentially Weighted Average give


more weight to the most recent observations ?

𝑉( = 𝛽𝑉(/. + 1 − 𝛽 𝛳8

𝑉.99 = 0.9𝑉:: + 0.1𝛳.99


𝑉:: = 0.9𝑉:; + 0.1𝛳::
𝑉:; = 0.9𝑉:< + 0.1𝛳:;

è 𝑉.99 = 0.1𝛳.99 + 0.9(0.1𝛳:: + 0.9(0.1𝛳:; + 0.9𝑉:< ))


è 𝑉.99 = 0.1𝛳.99 + 0.09𝛳:: + 0.081𝛳:; + 0.729𝑉:<
Optimizers

How does Exponentially Weighted Average give


more weight to the most recent observations ?

𝑉( = 𝛽𝑉(/. + 1 − 𝛽 𝛳8

Ø The current observation will always have


𝑉.99 = 0.9𝑉:: + 0.1𝛳.99 the highest coefficient
𝑉:: = 0.9𝑉:; + 0.1𝛳::
Ø The coefficient dimishes gradually further
𝑉:; = 0.9𝑉:< + 0.1𝛳:; backwards

è 𝑉.99 = 0.1𝛳.99 + 0.9(0.1𝛳:: + 0.9(0.1𝛳:; + 0.9𝑉:< ))


è 𝑉.99 = 0.1𝛳.99 + 0.09𝛳:: + 0.081𝛳:; + 0.729𝑉:<
Backpropagation
Backpropagation

Backpropagation of error :

o An algorithm to calculate the gradient of a loss function relative to the model parameters

o Those gradients are then used by the optimizer to update the model weights

o Gradients are calculated backwards through the network starting at the output layer, one layer at a
time

Together, backpropagation and Stochastic Gradient Descent (or


variants) can be used to train a neural network.
The Learning Mechanism
The Learning Mechanism

So, how does the network learn ?

Step 1 : Forward propagation:

At each neuron :

✦ Preactivation

✦ Activation
The Learning
Backproagation
Mechanism

So, how does the network learn ?

Step 2 : Error calculation by the loss function : Ŷ

The cost function measures the error between the


trues values and the predicted values.
The Learning Mechanism

So, how does the network learn ?

Step 3 : Backpropagation and Optimisation : Ŷ

✦ Backpropagation calculates the gradients of the


loss function with respect to the model weights

✦ The optimizer uses the gradients to find new


values for the weights that can reduce the loss
The Learning Mechanism

So, how does the network learn ?

Step 4 : Steps 1 to 3 are repeated until : Ŷ

✦ The assigned number of iterations is reached

✦ The defined loss value is reached

You might also like