Lec 7 Optimization Part 2
Lec 7 Optimization Part 2
Optimization
Intro to Deep Learning, Spring 2024
1
Recap
• Neural networks are universal approximators
• We must train them to approximate any
function
• Networks are trained to minimize total “error”
on a training set
– We do so through empirical risk minimization
• We use variants of gradient descent to do so
– Gradients are computed through backpropagation
2
Recap
• Vanilla gradient descent may be too slow or unstable
4
Moving on: Topics for the day
• Incremental updates
• Revisiting “trend” algorithms
• Generalization
• Tricks of the trade
– Divergences..
– Activations
– Normalizations
5
The training formulation
output (y)
Input (X)
7
Gradient descent
8
Gradient descent
9
Gradient descent
10
Gradient descent
11
Gradient descent
12
Effect of number of samples
14
Poll 1
15
Alternative: Incremental update
16
Alternative: Incremental update
17
Alternative: Incremental update
18
Alternative: Incremental update
19
Alternative: Incremental update
20
Incremental Update
• Given , ,…,
• Initialize all weights
• Do:
– For all
• For every layer :
– Compute 𝒕 𝒕
– Update
22
Incremental Update
• Given , ,…,
• Initialize all weights
• Do: Over multiple epochs One epoch
– For all
• For every layer :
– Compute 𝒕 𝒕
– Update
One update
24
Caveats: order of presentation
25
Caveats: order of presentation
26
Caveats: order of presentation
27
Caveats: order of presentation
35
The expected behavior of the gradient
( ) ( ) ( ) ( ) ( )
𝒊 𝒊
( ) ( )
, 𝒊 ,
Batch SGD
42
Incremental learning runs the risk of
always chasing the latest input
output (y)
Input (X)
output (y)
Input (X)
output (y)
Input (X)
output (y)
Input (X)
output (y)
Input (X)
output (y)
Input (X)
output (y)
Input (X)
49
Incremental learning runs the risk of
always chasing the latest input
output (y)
Input (X)
50
Incremental learning runs the risk of
always chasing the latest input
output (y)
Input (X)
51
Incremental learning runs the risk of
always chasing the latest input
output (y)
Input (X)
52
Incremental learning runs the risk of
always chasing the latest input
output (y)
Input (X)
53
Incremental learning runs the risk of
always chasing the latest input
output (y)
Input (X)
• Incremental learning: Update the model to always minimize the error on the latest
instance
– Shrink the learning rate with iterations
– With increasing iterations, it will swing less and less towards the new point
– Eventually arriving at the correct solution and not moving much from it further because the
step sizes are now too small…
54
Incremental learning caveat: learning
rate
output (y)
Input (X)
𝜂 =∞
𝜂 <∞
– Strongly convex Can be placed inside a quadratic bowl, touching at any point
– Giving us the iterations to convergence as
• For generically convex (but not strongly convex) function, various proofs
report an convergence of using a learning rate of .
59
Batch gradient convergence
• In contrast, using the batch update method, for strongly
convex functions,
61
SGD Convergence
• We can bound the expected difference between the
loss over our data using the optimal weights and
the weights at any single iteration to for
strongly convex loss or for convex loss
62
SGD Convergence and weight
averaging
Polynomial Decay Averaging:
63
SGD example
65
Poll 2
66
Recall: Modelling a function
67
Recall: The Empirical risk
di
Xi
1
𝐿𝑜𝑠𝑠 𝑊 = 𝑑𝑖𝑣 𝑓 𝑋 ; 𝑊 , 𝑑
𝑁
𝑾 = argmin 𝐿𝑜𝑠𝑠 𝑊
• The expected value of the empirical risk is actually the expected divergence
𝐸 𝐿𝑜𝑠𝑠 𝑊 = 𝐸 𝑑𝑖𝑣 𝑓 𝑋; 𝑊 , 𝑔 𝑋
68
Recall: The Empirical risk
di
Xi
1
𝐿𝑜𝑠𝑠 𝑊 = 𝑑𝑖𝑣 𝑓 𝑋 ; 𝑊 , 𝑑
𝑁
The empirical risk is an unbiased
𝑾 = estimate
argmin 𝐿𝑜𝑠𝑠of
𝑊 the expected divergence
Though there is no guarantee that minimizing it will minimize the
expected divergence
• The expected value of the empirical risk is actually the expected divergence
𝐸 𝐿𝑜𝑠𝑠 𝑊 = 𝐸 𝑑𝑖𝑣 𝑓 𝑋; 𝑊 , 𝑔 𝑋
69
Recall: The Empirical risk
di
1
𝐿𝑜𝑠𝑠 𝑊 = 𝑑𝑖𝑣 𝑓 𝑋 ; 𝑊 , 𝑑
𝑁
The empirical risk is an unbiased estimate
𝑾 = argmin 𝐿𝑜𝑠𝑠 𝑓 𝑋; of
𝑊 ,the
𝑔 𝑋 expected divergence
Though there is no guarantee that minimizing it will minimize the
expected divergence
• The expected value of the empirical risk is actually the expected divergence
𝐸 𝐿𝑜𝑠𝑠 𝑊 = 𝐸 𝑑𝑖𝑣 𝑓 𝑋; 𝑊 , 𝑔 𝑋
70
SGD
di
Xi
di
Xi
di
Xi
The variance of the sample divergence is the variance of the divergence itself:
var(div). This is N times the variance of the empirical average minimized by
batch update
The sample divergence is also an unbiased estimate of the expected error
74
Explaining the variance
83
Alternative: Mini-batch update
84
Alternative: Mini-batch update
85
Alternative: Mini-batch update
86
Alternative: Mini-batch update
87
Incremental Update: Mini-batch
update
• Given , ,…,
• Initialize all weights ;
• Do:
– Randomly permute , ,…,
– For
•
• For every layer k:
– ∆𝑊 = 0
• For t’ = t : t+b-1
– For every layer 𝑘:
» Compute 𝛻 𝐷𝑖𝑣(𝑌 , 𝑑 )
» ∆𝑊 = ∆𝑊 + 𝛻 𝐷𝑖𝑣(𝑌 , 𝑑 )
• Update
– For every layer k:
» ∆𝑊 = ∆𝑊 + 𝛻 𝐷𝑖𝑣(𝑌 , 𝑑 )
• Update
– For every layer k:
di
Xi
• The expected value of the batch loss is also the expected divergence
90
Mini Batches
di
Xi
• The expected value of the batch loss is also the expected divergence
91
Mini Batches
di
Xi
The variance of the minibatch loss: var(BatchLoss) = 1/b var(div)
This will be much smaller than the variance of the sample error in SGD
The minibatch loss is also an unbiased estimate of the expected error
• Mini-batch updates compute and minimize a batch loss
• The expected value of the batch loss is also the expected divergence
92
Minibatch convergence
• For convex functions, convergence rate for SGD is .
• For mini-batch updates with batches of size , the
convergence rate is
– Apparently an improvement of over SGD
– But since the batch size is , we perform times as many
computations per iteration as SGD
– We actually get a degradation of
• However, in practice
– The objectives are generally not convex; mini-batches are more
effective with the right learning rates
– We also get additional benefits of vector processing
93
SGD example
97
Poll 3
98
Story so far
• SGD: Presenting training instances one-at-a-time can be more effective
than full-batch training
– Provided they are provided in random order
• For SGD to converge, the learning rate must shrink sufficiently rapidly with
iterations
– Otherwise the learning will continuously “chase” the latest sample
99
Training and minibatches
• Convergence depends on learning rate
– Simple technique: fix learning rate until the error
plateaus, then reduce learning rate by a fixed
factor (e.g. 10)
– Advanced methods: Adaptive updates, where the
learning rate is itself determined as part of the
estimation
100
Moving on: Topics for the day
• Incremental updates
• Revisiting “trend” algorithms
• Generalization
• Tricks of the trade
– Divergences..
– Activations
– Normalizations
101
Recall: Momentum Update
Plain gradient update With momentum
( ) ( ) ( )
103
Recall: Momentum Update
107
Nestorov’s Accelerated Gradient
• Nestorov’s method
112
Nestorov’s Accelerated Gradient
SGD instance
or minibatch
loss
• The momentum method
» 𝛻 𝐿𝑜𝑠𝑠 += 𝛻 𝑫𝒊𝒗(𝑌 , 𝑑 )
• Update
– For every layer k:
Δ𝑊 = 𝛽Δ𝑊 − 𝜂 (𝛻 𝐿𝑜𝑠𝑠)
𝑊 = 𝑊 + ∆𝑊
• Until has converged
115
Nestorov’s Accelerated Gradient
SGD instance
or minibatch
loss
• Nestorov’s method
( )
117
Nestorov: Mini-batch update
• Given , ,…,
• Initialize all weights ; 𝑗 = 0, ∆𝑊 = 0
• Do:
– Randomly permute 𝑋 , 𝑑 , 𝑋 , 𝑑 ,…, 𝑋 , 𝑑
– For 𝑡 = 1: 𝑏: 𝑇
• 𝑗=𝑗+1
• For every layer k:
– 𝑊 = 𝑊 + 𝛽Δ𝑊
– 𝛻 𝐿𝑜𝑠𝑠 = 0
• For t’ = t : t+b-1
– For every layer 𝑘:
» Compute 𝛻 𝐷𝑖𝑣(𝑌 , 𝑑 )
» 𝛻 𝐿𝑜𝑠𝑠 += 𝛻 𝑫𝒊𝒗(𝑌 , 𝑑 )
• Update
– For every layer k:
𝑊 =𝑊 −𝜂 𝛻 𝐿𝑜𝑠𝑠𝑇
Δ𝑊 = 𝛽Δ𝑊 − 𝜂 𝛻 𝐿𝑜𝑠𝑠𝑇
1 1 +2.5
2 1 -3
1 2 4 3 2 +2.5
3 5
4 1 -2
5 1.5 1.5
124
RMS Prop
• This is a variant on the basic mini-batch SGD algorithm
• Procedure:
– Maintain a running estimate of the mean squared value of
derivatives for each parameter
– Scale learning rate of the parameter by the inverse of the root
mean squared derivative
125
RMS Prop
• This is a variant on the basic mini-batch SGD algorithm
• Procedure:
– Maintain a running estimate of the mean squared value of
derivatives for each parameter
– Scale learning rate of the parameter by the inverse of the root
mean squared derivative
•
• Until loss has converged
127
All the terms in gradient descent
• Standard gradient descent rule
• Procedure:
– Maintain a running estimate of the mean derivative for each parameter
– Maintain a running estimate of the mean squared value of derivatives for each
parameter
– Learning rate is proportional to the inverse of the root mean squared
derivative
130
ADAM: RMSprop with momentum
• RMS prop only adapts the learning rate
• Momentum only smooths the gradient
• ADAM combines the two
• Procedure:
– Maintain a running estimate of the mean derivative for each parameter
– Maintain a running estimate of the mean squared value of Ensures
derivatives
thatfor
theeach
parameter and terms do
not
– Learning rate is proportional to the inverse of the root mean dominate in
squared
derivative early
iterations
131
ADAM: RMSprop with momentum
Typically is 0 and is close to 1. So .
Without
• RMS the propdenominator term a second-moment
only considers will stay close to 0normalized
for for aoflong
version thetime,
resulting in minimal parameter updates
current gradient
• ADAM
The utilizesterm
denominator a smoothed version of the momentum-augmented
ensures that and updates actually gradient
happen
• Procedure:
For large , the denominator
– Maintain just becomes
a running estimate of the 1mean derivative for each parameter
– Maintain a running estimate of the mean squared value of Ensures
derivatives
thatfor
theeach
parameter and terms do
– Learning rate is proportional to the inverse of the root mean squared
not dominate in
derivative early
iterations
132
Other variants of the same theme
• Many:
– Adagrad
– AdaDelta
– AdaMax
– …
• Generally no explicit learning rate to optimize
– But come with other hyper parameters to be optimized
– Typical params:
• RMSProp: ,
• ADAM: , ,
133
Poll 4 (@418)
134
Poll 4
135
Visualizing the optimizers: Beale’s Function
• http://www.denizyuret.com/2015/03/alec-radfords-animations-for.html
136
Visualizing the optimizers: Long Valley
• http://www.denizyuret.com/2015/03/alec-radfords-animations-for.html
137
Visualizing the optimizers: Saddle Point
• http://www.denizyuret.com/2015/03/alec-radfords-animations-for.html
138
Story so far
• Gradient descent can be sped up by incremental
updates
– Convergence is guaranteed under most conditions
• Learning rate must shrink with time for convergence
– Stochastic gradient descent: update after each
observation. Can be much faster than batch learning
– Mini-batch updates: update after batches. Can be more
efficient than SGD
139