04 Optimization
04 Optimization
Optimization
Intro to Deep Learning, Fall 2020
1
Recap
• Neural networks are universal approximators
• We must train them to approximate any
function
• Networks are trained to minimize total “error”
on a training set
– We do so through empirical risk minimization
• We use variants of gradient descent to do so
– Gradients are computed through backpropagation
17
The training formulation
output (y)
Input (X)
22
Gradient descent
23
Gradient descent
24
Gradient descent
27
Effect of number of samples
29
Alternative: Incremental update
30
Alternative: Incremental update
31
Alternative: Incremental update
33
Incremental Update
• Given , ,…,
• Initialize all weights
• Do:
– For all
• For every layer :
– Compute 𝒕 𝒕
– Update
35
Incremental Update
• Given , ,…,
• Initialize all weights
• Do: Over multiple epochs One epoch
– For all
• For every layer :
– Compute 𝒕 𝒕
– Update
One update
37
Caveats: order of presentation
38
Caveats: order of presentation
39
Caveats: order of presentation
40
Caveats: order of presentation
Batch SGD
output (y)
Input (X)
• Except in the case of a perfect fit, even an optimal overall
fit will look incorrect to individual instances
– Correcting the function for individual instances will lead to
never-ending, non-convergent updates
– We must shrink the learning rate with iterations to prevent this
• Correction for individual instances with the eventual miniscule
learning rates will not modify the function 56
Incremental Update: Stochastic
Gradient Descent
• Given , ,…,
• Initialize all weights ;
• Do:
– Randomly permute , ,…,
– For all
•
• For every layer :
– Compute 𝒕 𝒕
– Update
𝒕 𝒕
73
Explaining the variance
82
Alternative: Mini-batch update
83
Alternative: Mini-batch update
84
Alternative: Mini-batch update
85
Alternative: Mini-batch update
86
Incremental Update: Mini-batch
update
• Given , ,…,
• Initialize all weights ;
• Do:
– Randomly permute , ,…,
– For
• 𝑗 =𝑗+1
• For every layer k:
– ∆𝑊 = 0
• For t’ = t : t+b-1
– For every layer 𝑘:
» Compute 𝛻 𝐷𝑖𝑣(𝑌 , 𝑑 )
» ∆𝑊 = ∆𝑊 + 𝛻 𝐷𝑖𝑣(𝑌 , 𝑑 )
• Update
– For every layer k:
𝑊 = 𝑊 − 𝜂 ∆𝑊
• Until has converged 87
Incremental Update: Mini-batch
update
• Given , ,…,
• Initialize all weights ;
• Do:
– Randomly permute , ,…,
– For
• 𝑗 =𝑗+1 Mini-batch size
• For every layer k:
– ∆𝑊 = 0 Shrinking step size
• For t’ = t : t+b-1
– For every layer 𝑘:
» Compute 𝛻 𝐷𝑖𝑣(𝑌 , 𝑑 )
» ∆𝑊 = ∆𝑊 + 𝛻 𝐷𝑖𝑣(𝑌 , 𝑑 )
• Update
– For every layer k:
𝑊 = 𝑊 − 𝜂 ∆𝑊
• Until has converged 88
Mini Batches
di
Xi
• The expected value of the batch loss is also the expected divergence
89
SGD example
95
Momentum and incremental updates
SGD instance
or minibatch
loss
• The momentum method
» 𝛻 𝐿𝑜𝑠𝑠 += 𝛻 𝑫𝒊𝒗(𝑌 , 𝑑 )
• Update
– For every layer k:
Δ𝑊 = 𝛽Δ𝑊 − 𝜂 (𝛻 𝐿𝑜𝑠𝑠)
𝑊 = 𝑊 + ∆𝑊
• Until has converged
101
Nestorov’s Accelerated Gradient
SGD instance
or minibatch
loss
• Nestorov’s method
( )
103
Nestorov: Mini-batch update
• Given , ,…,
• Initialize all weights ; 𝑗 = 0, ∆𝑊 = 0
• Do:
– Randomly permute 𝑋 , 𝑑 , 𝑋 , 𝑑 ,…, 𝑋 , 𝑑
– For 𝑡 = 1: 𝑏: 𝑇
• 𝑗=𝑗+1
• For every layer k:
– 𝑊 = 𝑊 + 𝛽Δ𝑊
– 𝛻 𝐿𝑜𝑠𝑠 = 0
• For t’ = t : t+b-1
– For every layer 𝑘:
» Compute 𝛻 𝐷𝑖𝑣(𝑌 , 𝑑 )
» 𝛻 𝐿𝑜𝑠𝑠 += 𝛻 𝑫𝒊𝒗(𝑌 , 𝑑 )
• Update
– For every layer k:
𝑊 =𝑊 −𝜂 𝛻 𝐿𝑜𝑠𝑠𝑇
Δ𝑊 = 𝛽Δ𝑊 − 𝜂 𝛻 𝐿𝑜𝑠𝑠𝑇
1 1 +2.5
2 1 -3
1 2 4 3 2 +2.5
3 5
4 1 -2
5 1.5 1.5
• Procedure:
– Maintain a running estimate of the mean squared value of
derivatives for each parameter
– Scale update of the parameter by the inverse of the root mean
squared derivative
109
RMS Prop (updates are for each
• Do:
weight of each layer)
– Randomly shuffle inputs to change their order
– Initialize: ; for all weights in all layers,
– For all (incrementing in blocks of inputs)
• For all weights in all layers initialize 𝜕 𝐷 =0
• For 𝑏 = 0: 𝐵 − 1
– Compute
» Output 𝒀(𝑿𝒕 𝒃)
𝒅𝑫𝒊𝒗(𝒀(𝑿𝒕 𝒃 ),𝒅𝒕 𝒃 )
» Compute gradient
𝒅𝒘
𝒅𝑫𝒊𝒗(𝒀(𝑿𝒕 𝒃 ),𝒅𝒕 𝒃 )
» Compute 𝜕 𝐷 +=
𝒅𝒘
• 𝑘 =𝑘+1
• Until loss has converged
111
ADAM: RMSprop with momentum
• RMS prop only considers a second-moment normalized version of the current
gradient
• ADAM utilizes a smoothed version of the momentum-augmented gradient
– Considers both first and second moments
• Procedure:
– Maintain a running estimate of the mean derivative for each parameter
– Maintain a running estimate of the mean squared value of derivatives for each parameter
– Scale update of the parameter by the inverse of the root mean squared derivative
𝑚 𝑣
𝑚 = , 𝑣 =
1−𝛿 1−𝛾
𝜂
𝑤 =𝑤 − 𝑚
𝑣 +𝜖
112
ADAM: RMSprop with momentum
• RMS prop only considers a second-moment normalized version of the
current gradient
• ADAM utilizes a smoothed version of the momentum-augmented gradient
• Procedure:
– Maintain a running estimate of the mean derivative for each parameter
– Maintain a running estimate of the mean squared value of Ensures
derivatives
thatfor
theeach
parameter and terms do
– Scale update of the parameter by the inverse of the root mean squared in
not dominate
derivative early
iterations
113
Other variants of the same theme
• Many:
– Adagrad
– AdaDelta
– AdaMax
– …
• Generally no explicit learning rate to optimize
– But come with other hyper parameters to be optimized
– Typical params:
• RMSProp: ,
• ADAM: , ,
114
Visualizing the optimizers: Beale’s Function
• http://www.denizyuret.com/2015/03/alec-radfords-animations-for.html
115
Visualizing the optimizers: Long Valley
• http://www.denizyuret.com/2015/03/alec-radfords-animations-for.html
116
Visualizing the optimizers: Saddle Point
• http://www.denizyuret.com/2015/03/alec-radfords-animations-for.html
117
Story so far
• Gradient descent can be sped up by incremental
updates
– Convergence is guaranteed under most conditions
• Learning rate must shrink with time for convergence
– Stochastic gradient descent: update after each
observation. Can be much faster than batch learning
– Mini-batch updates: update after batches. Can be more
efficient than SGD
118