Ann 3
Ann 3
Optimizers
Accurate
Optimizers
Ø Gradient Descent
Ø Adagrad
Ø Stochastic Gradient Descent
Ø RMS Prop
Ø Mini-Batch Gradient Descent
Ø Adam
Ø Gradient Descent with Momentum
Optimizers
1
Loss function
Gradient Descent :
weight w
1
Loss function
Gradient Descent :
1
Loss function
Gradient Descent :
Starting point
Step 1 : We set a starting value for weight w
(this value can be randomly chosen)
weight w
Optimizers
1
Loss function
Gradient Descent :
Starting point
Step 1 : We set a starting value for weight w
(this value can be randomly chosen)
decrease
Optimizers
1
Loss function
Gradient Descent :
1
Loss function
Gradient Descent :
• Step size gets very close to 0 (ex. min step size = 0.001 or
smaller)
1
Gradient Descent :
Ideal learning rate : larger when the point is far from the
minimum & smaller as it gets closer
1
Gradient Descent : Example : How does Gradient Descent fit a line to data ?
Height H
??
Shoe Size SS
Optimizers
1
Gradient Descent : Example : How does Gradient Descent fit a line to data ?
1
Gradient Descent : 𝐻!"#$ = 𝑠𝑙𝑜𝑝𝑒 𝑆𝑆 + 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡
Height
??
(1.1, 2.1)
Loss function
(0.6, 1.1) = (𝐻%"&#! − 𝐻!"#$! )' + (𝐻%"&#" − 𝐻!"#$" )' + (𝐻%"&## − 𝐻!"#$# )'
(0.8, 1.2)
Shoe Size SS
Optimizers
1
Gradient Descent : 𝐻!"#$ = 𝑠𝑙𝑜𝑝𝑒 𝑆𝑆 + 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡
Shoe Size SS
Optimizers
1
Gradient Descent : 𝐻!"#$ = 𝑠𝑙𝑜𝑝𝑒 𝑆𝑆 + 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡
1
Gradient Descent : 𝐻!"#$ = 𝑠𝑙𝑜𝑝𝑒 𝑆𝑆 + 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡
d(Loss funcVon)
𝑑(𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡)
d(Loss funcVon)
𝑑(𝑠𝑙𝑜𝑝𝑒)
1
Gradient Descent : 𝐻!"#$ = 𝑠𝑙𝑜𝑝𝑒 𝑆𝑆 + 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡
d(Loss funcVon)
𝑑(𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡)
Loss function = Sum of Squared residuals (SSR)
d(Loss funcVon) d(U)
U = ×
𝑑(𝑈) 𝑑(𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡)
1
Gradient Descent : 𝐻!"#$ = 𝑠𝑙𝑜𝑝𝑒 𝑆𝑆 + 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡
d(Loss funcVon)
𝑑(𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡)
Loss function = Sum of Squared residuals (SSR)
d(Loss funcVon) d(U)
U =
𝑑(𝑈)
×
𝑑(𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡)
= (1.1 − (𝑠𝑙𝑜𝑝𝑒 ∗ 0.6 + 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡))% = 2𝑈( × −1 + 2𝑈' × −1 + 2𝑈) × −1
+ (1.2 − (𝑠𝑙𝑜𝑝𝑒 ∗ 0.8 + 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡))%
+ (2.1 − (𝑠𝑙𝑜𝑝𝑒 ∗ 1.1 + 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡))% = −2(1.1 − 0.6 ∗ 𝑠𝑙𝑜𝑝𝑒 + 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 )
−2(1.2 − (0.8 ∗ 𝑠𝑙𝑜𝑝𝑒 + 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡))
−2(2.1 − (1.1 ∗ 𝑠𝑙𝑜𝑝𝑒 + 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡))
Optimizers
= (1.1 − (𝑠𝑙𝑜𝑝𝑒 ∗ 0.6 + 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡))% = 2𝑈( × −0.6 + 2𝑈' × −0.8 + 2𝑈) × −1.1
1
Gradient Descent :
1
Gradient Descent :
1 d(Loss funcVon)
Gradient Descent : 𝑑(𝑠𝑙𝑜𝑝𝑒)
∇ Loss FuncVon =
d(Loss funcVon)
𝑑(𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡)
We can now use this Gradient to descend to the minimal point in the cost function :
1 d(Loss funcVon)
Gradient Descent : 𝑑(𝑠𝑙𝑜𝑝𝑒)
∇ Loss FuncVon =
d(Loss funcVon)
𝑑(𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡)
We can now use this Gradient to descend to the minimal point in the
cost function (hence the name Gradient Descent) :
1
Gradient Descent :
1
Gradient Descent :
Conclusion :
Number Distance
Price
Area of from train city
(label)
rooms station
32 3 5 Amiens 60 000
25 1 8 Lille 70 000
The data is entirely plugged into the Neural
80 4 10 Versailles 600 000 Network and the parameters are adjusted
55 3 10 Nice 450 000
Optimizers
2
Stochastic Gradient Descent :
2
Stochastic Gradient Descent :
2
Stochastic Gradient Descent :
Number Distance
Price
Area of from train city
(label)
rooms station
32 3 5 Amiens 60 000
25 1 8 Lille 70 000 The data is plugged in the Neural Network
one sample at a time and the parameters
80 4 10 Versailles 600 000
are adjusted after every sample
55 3 10 Nice 450 000
Optimizers
2
Stochastic Gradient Descent :
Number Distance
Price
Area of from train city
(label)
rooms station
2
Stochastic Gradient Descent :
Note :
• Sensitivity to learning rate also applies to SGD
• The same ‘schedule’ strategy is used : Starting with large numbers and
reducing them gradually
Optimizers
3
Mini-Batch Gradient Descent :
- gives more stable results than using one sample per step (Stochastic Gradient Descent)
3
Mini-Batch Gradient Descent :
Number Distance
Price
Area of from train city
(label)
rooms station
32 3 5 Amiens 60 000
4
Gradient Descent with Momentum :
2. Compute :
𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑_𝑎𝑣𝑒𝑟𝑎𝑔𝑒!"#$% = 𝛃 𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑_𝑎𝑣𝑒𝑟𝑎𝑔𝑒!"#$% + (1 − 𝛃)𝑣𝑎𝑙𝑢𝑒!"#$%
2. Compute :
𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑_𝑎𝑣𝑒𝑟𝑎𝑔𝑒!"#$% = 𝛃 𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑_𝑎𝑣𝑒𝑟𝑎𝑔𝑒!"#$% + (1 − 𝛃)𝑣𝑎𝑙𝑢𝑒!"#$%
5
AdaGrad (Adaptive Gradients): AdaGrad is a technique to change the learning rate over time
where :
+#1".-.3 "1%#
𝑠𝑡𝑒𝑝𝑠𝑖𝑧𝑒!1"12#%#"$ = 𝛁𝐿(𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟- ) ∗
"
45 ∑&$%! 𝛁8 !1"12#%#"$
and
$(8,** :&./%-,.)
𝛁𝐿(𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟- ) =
$(!1"12#%#"$ )
Optimizers
5
AdaGrad (Adaptive Gradients): AdaGrad is a technique to change the learning rate over time
+#1".-.3 "1%#
𝑠𝑡𝑒𝑝𝑠𝑖𝑧𝑒!1"12#%#"$ = 𝛁𝐿(𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟- ) ∗ With every new time step, a new
"
45 ∑&$%! 𝛁8 !1"12#%#"$ gradient is added which causes
the denominator to increase and
and the step size to decrease
$(8,** :&./%-,.)
𝛁𝐿(𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟- ) =
$(!1"12#%#"$ )
Optimizers
5
AdaGrad (Adaptive Gradients): AdaGrad is a technique to change the learning rate over time
where :
+#1".-.3 "1%#
𝑠𝑡𝑒𝑝𝑠𝑖𝑧𝑒!1"12#%#"$ = 𝛁𝐿(𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟- ) ∗
"
45 ∑&$%! 𝛁8 !1"12#%#"$
and
6
RMS Prop (Root Mean Squared Propagation) :
6
RMS Prop (Root Mean Squared Propagation) :
6
RMS Prop (Root Mean Squared Propagation) :
+#1".-.3 "1%#
𝑠𝑡𝑒𝑝𝑠𝑖𝑧𝑒!1"12#%#"$ = 𝛁𝐿(𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟- ) ∗
450#-3<%#$_1>#"13#(𝛁8" (!1"12#%#"$ ))
and
$(8,** :&./%-,.)
𝛁𝐿(𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟- ) =
$(!1"12#%#"$ )
Optimizers
7
Adam Optimizer (Adaptive Moment Estimation ) :
Adam is another optimizer with adaptive learning rates for each parameter
Adam :
- uses the exponentially weighted average of the past squared gradients (like RMS Prop)
- uses an exponentially weighted average of past gradients (like GD with momentum)
Optimizers
7
Adam Optimizer (Adaptive Moment Estimation ) :
where :
+#1".-.3 "1%#
𝑠𝑡𝑒𝑝𝑠𝑖𝑧𝑒!1"12#%#"$ = 𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑_𝑎𝑣𝑒𝑟𝑎𝑔𝑒(𝛁𝐿(𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟- )) ∗
450#-3<%#$_1>#"13#(𝛁8" (!1"12#%#"$ ))
and
$(8,** :&./%-,.)
𝛁𝐿(𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟- ) =
$(!1"12#%#"$ )
Optimizers
7
Default values :
Adam Optimizer (Adaptive Moment Estimation ) : 𝛽( = 0.9
𝛽' = 0.999
𝜖 = 10@B
However, as the weighted averages are initialized to 0, they are biased towards 0 during the first iterations
è bias-corrected moments are calculated and used :
-!
𝑚
P& =
. /0"
1! 5!
4-
𝑣Q& = 𝑚& = 𝛽. 𝑚&/. + 1 − 𝛽. g 2 𝑤&3. = 𝑤& −
. /0# 63 17!
+#1".-.3 "1%#
𝑠𝑡𝑒𝑝𝑠𝑖𝑧𝑒0$ = 𝑚
I- ∗ 𝑣- = 𝛽' 𝑣-@( + 1 − 𝛽' g 'A
45 >?$
Optimizers
C&
𝑉N% = where 𝑡 = current iteration As 𝑡 increases, 𝛽( approaches 0
( @D&
𝑉( = 𝛽𝑉(/. + 1 − 𝛽 𝛳8
𝑉( = 𝛽𝑉(/. + 1 − 𝛽 𝛳8
Backpropagation of error :
o An algorithm to calculate the gradient of a loss function relative to the model parameters
o Those gradients are then used by the optimizer to update the model weights
o Gradients are calculated backwards through the network starting at the output layer, one layer at a
time
At each neuron :
✦ Preactivation
✦ Activation
The Learning
Backproagation
Mechanism