0% found this document useful (0 votes)
7 views

04 Optimization

Uploaded by

rishabh johri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

04 Optimization

Uploaded by

rishabh johri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 62

Training Neural Networks:

Optimization
Intro to Deep Learning, Fall 2020

1
Recap
• Neural networks are universal approximators
• We must train them to approximate any
function
• Networks are trained to minimize total “error”
on a training set
– We do so through empirical risk minimization
• We use variants of gradient descent to do so
– Gradients are computed through backpropagation
17
The training formulation

output (y)

Input (X)

• Given input output pairs at a number of


locations, estimate the entire function
21
Gradient descent

• Start with an initial function


• Adjust its value at all points to make the outputs closer to the required
value
– Gradient descent adjusts parameters to adjust the function value at all points
– Repeat this iteratively until we get arbitrarily close to the target function at the
training points

22
Gradient descent

• Start with an initial function


• Adjust its value at all points to make the outputs closer to the required
value
– Gradient descent adjusts parameters to adjust the function value at all points
– Repeat this iteratively until we get arbitrarily close to the target function at the
training points

23
Gradient descent

• Start with an initial function


• Adjust its value at all points to make the outputs closer to the required
value
– Gradient descent adjusts parameters to adjust the function value at all points
– Repeat this iteratively until we get arbitrarily close to the target function at the
training points

24
Gradient descent

• Start with an initial function


• Adjust its value at all points to make the outputs closer to the required
value
– Gradient descent adjusts parameters to adjust the function value at all points
– Repeat this iteratively until we get arbitrarily close to the target function at the
training points

27
Effect of number of samples

• Problem with conventional gradient descent: we try to


simultaneously adjust the function at all training points
– We must process all training points before making a single
adjustment
– “Batch” update
28
Alternative: Incremental update

• Alternative: adjust the function at one training point at a time


– Keep adjustments small
– Eventually, when we have processed all the training points, we will
have adjusted the entire function
• With greater overall adjustment than we would if we made a single “Batch”
update

29
Alternative: Incremental update

• Alternative: adjust the function at one training point at a time


– Keep adjustments small
– Eventually, when we have processed all the training points, we will
have adjusted the entire function
• With greater overall adjustment than we would if we made a single “Batch”
update

30
Alternative: Incremental update

• Alternative: adjust the function at one training point at a time


– Keep adjustments small
– Eventually, when we have processed all the training points, we will
have adjusted the entire function
• With greater overall adjustment than we would if we made a single “Batch”
update

31
Alternative: Incremental update

• Alternative: adjust the function at one training point at a time


– Keep adjustments small
– Eventually, when we have processed all the training points, we will
have adjusted the entire function
• With greater overall adjustment than we would if we made a single “Batch”
update

33
Incremental Update
• Given , ,…,
• Initialize all weights
• Do:
– For all
• For every layer :
– Compute 𝒕 𝒕

– Update

• Until has converged


34
Incremental Updates
• The iterations can make multiple passes over
the data
• A single pass through the entire training data
is called an “epoch”
– An epoch over a training set with samples
results in updates of parameters

35
Incremental Update
• Given , ,…,
• Initialize all weights
• Do: Over multiple epochs One epoch

– For all
• For every layer :
– Compute 𝒕 𝒕

– Update
One update

• Until has converged


36
Caveats: order of presentation

• If we loop through the samples in the same


order, we may get cyclic behavior

37
Caveats: order of presentation

• If we loop through the samples in the same


order, we may get cyclic behavior

38
Caveats: order of presentation

• If we loop through the samples in the same


order, we may get cyclic behavior

39
Caveats: order of presentation

• If we loop through the samples in the same


order, we may get cyclic behavior

40
Caveats: order of presentation

• If we loop through the samples in the same order,


we may get cyclic behavior
• We must go through them randomly to get more
convergent behavior
41
Incremental Update: Stochastic
Gradient Descent
• Given , ,…,
• Initialize all weights
• Do:
– Randomly permute , ,…,
– For all
• For every layer :
– Compute 𝒕 𝒕
– Update
𝒕 𝒕

• Until has converged


46
Story so far
• In any gradient descent optimization problem,
presenting training instances incrementally
can be more effective than presenting them
all at once
– Provided training instances are provided in
random order
– “Stochastic Gradient Descent”

• This also holds for training neural networks


47
Batch vs SGD

Batch SGD

• Batch gradient descent operates over T training instances


to get a single update
• SGD gets T updates for the same computation
52
Caveats: learning rate

output (y)

Input (X)
• Except in the case of a perfect fit, even an optimal overall
fit will look incorrect to individual instances
– Correcting the function for individual instances will lead to
never-ending, non-convergent updates
– We must shrink the learning rate with iterations to prevent this
• Correction for individual instances with the eventual miniscule
learning rates will not modify the function 56
Incremental Update: Stochastic
Gradient Descent
• Given , ,…,
• Initialize all weights ;
• Do:
– Randomly permute , ,…,
– For all

• For every layer :
– Compute 𝒕 𝒕
– Update
𝒕 𝒕

• Until has converged


57
Incremental Update: Stochastic
Gradient Descent
• Given , ,…,
• Initialize all weights ;
• Do:
– Randomly permute , ,…,
– For all
Randomize input order

• For every layer :
Learning rate reduces with j
– Compute 𝒕 𝒕
– Update
𝒕 𝒕

• Until has converged


58
SGD example

• A simpler problem: K-means


• Note: SGD converges slower
• Also note the rather large variation between runs
– Lets try to understand these results.. 65
Explaining the variance

• The blue curve is the function being approximated


• The red curve is the approximation by the model at a given
• The heights of the shaded regions represent the point-by-point error
– The divergence is a function of the error
– We want to find the that minimizes the average divergence

73
Explaining the variance

• Sample estimate approximates the shaded area with the


average length of the lines of these curves is the red curve
itself
• Variance: The spread between the different curves is the
variance
74
Explaining the variance

• Sample estimate approximates the shaded area


with the average length of the lines
• This average length will change with position of
the samples 75
Explaining the variance

• Having more samples makes the estimate more


robust to changes in the position of samples
– The variance of the estimate is smaller
77
Explaining the variance
With only one sample

• Having very few samples makes the estimate


swing wildly with the sample position
– Since our estimator learns the to minimize this
estimate, the learned too can swing wildly 78
Explaining the variance
With only one sample

• Having very few samples makes the estimate


swing wildly with the sample position
– Since our estimator learns the to minimize this
estimate, the learned too can swing wildly 79
Explaining the variance
With only one sample

• Having very few samples makes the estimate


swing wildly with the sample position
– Since our estimator learns the to minimize this
estimate, the learned too can swing wildly 80
SGD vs batch
• SGD uses the gradient from only one sample
at a time, and is consequently high variance

• But also provides significantly quicker updates


than batch
• Is there a good medium?

82
Alternative: Mini-batch update

• Alternative: adjust the function at a small, randomly chosen subset of


points
– Keep adjustments small
– If the subsets cover the training set, we will have adjusted the entire function
• As before, vary the subsets randomly in different passes through the
training data

83
Alternative: Mini-batch update

• Alternative: adjust the function at a small, randomly chosen subset of


points
– Keep adjustments small
– If the subsets cover the training set, we will have adjusted the entire function
• As before, vary the subsets randomly in different passes through the
training data

84
Alternative: Mini-batch update

• Alternative: adjust the function at a small, randomly chosen subset of


points
– Keep adjustments small
– If the subsets cover the training set, we will have adjusted the entire function
• As before, vary the subsets randomly in different passes through the
training data

85
Alternative: Mini-batch update

• Alternative: adjust the function at a small, randomly chosen subset of


points
– Keep adjustments small
– If the subsets cover the training set, we will have adjusted the entire function
• As before, vary the subsets randomly in different passes through the
training data

86
Incremental Update: Mini-batch
update
• Given , ,…,
• Initialize all weights ;
• Do:
– Randomly permute , ,…,
– For
• 𝑗 =𝑗+1
• For every layer k:
– ∆𝑊 = 0
• For t’ = t : t+b-1
– For every layer 𝑘:
» Compute 𝛻 𝐷𝑖𝑣(𝑌 , 𝑑 )

» ∆𝑊 = ∆𝑊 + 𝛻 𝐷𝑖𝑣(𝑌 , 𝑑 )

• Update
– For every layer k:
𝑊 = 𝑊 − 𝜂 ∆𝑊
• Until has converged 87
Incremental Update: Mini-batch
update
• Given , ,…,
• Initialize all weights ;
• Do:
– Randomly permute , ,…,
– For
• 𝑗 =𝑗+1 Mini-batch size
• For every layer k:
– ∆𝑊 = 0 Shrinking step size
• For t’ = t : t+b-1
– For every layer 𝑘:
» Compute 𝛻 𝐷𝑖𝑣(𝑌 , 𝑑 )

» ∆𝑊 = ∆𝑊 + 𝛻 𝐷𝑖𝑣(𝑌 , 𝑑 )

• Update
– For every layer k:
𝑊 = 𝑊 − 𝜂 ∆𝑊
• Until has converged 88
Mini Batches

di

Xi

• Mini-batch updates compute and minimize a batch loss

• The expected value of the batch loss is also the expected divergence

89
SGD example

• Mini-batch performs comparably to batch


training on this simple problem
– But converges orders of magnitude faster
93
Training and minibatches
• In practice, training is usually performed using mini-
batches
– The mini-batch size is a hyper parameter to be optimized

• Convergence depends on learning rate


– Simple technique: fix learning rate until the error plateaus,
then reduce learning rate by a fixed factor (e.g. 10)
– Advanced methods: Adaptive updates, where the learning
rate is itself determined as part of the estimation

95
Momentum and incremental updates

SGD instance
or minibatch
loss
• The momentum method

• Incremental SGD and mini-batch gradients tend to have


high variance
• Momentum smooths out the variations
– Smoother and faster convergence
100
Momentum: Mini-batch update
• Given , ,…,
• Initialize all weights ; ,
• Do:
– Randomly permute , ,…,
– For
• 𝑗 =𝑗+1
• For every layer k:
– 𝛻 𝐿𝑜𝑠𝑠 = 0
• For t’ = t : t+b-1
– For every layer 𝑘:
» Compute 𝛻 𝐷𝑖𝑣(𝑌 , 𝑑 )

» 𝛻 𝐿𝑜𝑠𝑠 += 𝛻 𝑫𝒊𝒗(𝑌 , 𝑑 )
• Update
– For every layer k:
Δ𝑊 = 𝛽Δ𝑊 − 𝜂 (𝛻 𝐿𝑜𝑠𝑠)
𝑊 = 𝑊 + ∆𝑊
• Until has converged
101
Nestorov’s Accelerated Gradient

• At any iteration, to compute the current step:


– First extend the previous step
– Then compute the gradient at the resultant position
– Add the two to obtain the final step
• This also applies directly to incremental update methods
– The accelerated gradient smooths out the variance in the
gradients
102
Nestorov’s Accelerated Gradient

SGD instance
or minibatch
loss
• Nestorov’s method
( )

103
Nestorov: Mini-batch update
• Given , ,…,
• Initialize all weights ; 𝑗 = 0, ∆𝑊 = 0
• Do:
– Randomly permute 𝑋 , 𝑑 , 𝑋 , 𝑑 ,…, 𝑋 , 𝑑
– For 𝑡 = 1: 𝑏: 𝑇
• 𝑗=𝑗+1
• For every layer k:
– 𝑊 = 𝑊 + 𝛽Δ𝑊
– 𝛻 𝐿𝑜𝑠𝑠 = 0
• For t’ = t : t+b-1
– For every layer 𝑘:
» Compute 𝛻 𝐷𝑖𝑣(𝑌 , 𝑑 )

» 𝛻 𝐿𝑜𝑠𝑠 += 𝛻 𝑫𝒊𝒗(𝑌 , 𝑑 )
• Update
– For every layer k:
𝑊 =𝑊 −𝜂 𝛻 𝐿𝑜𝑠𝑠𝑇
Δ𝑊 = 𝛽Δ𝑊 − 𝜂 𝛻 𝐿𝑜𝑠𝑠𝑇

• Until has converged


104
Still higher-order methods
• Momentum and Nestorov’s method improve
convergence by normalizing the mean of the
derivatives
• More recent methods take this one step further by also
considering their variance
– RMS Prop
– Adagrad
– AdaDelta
– ADAM: very popular in practice
– …
• All roughly equivalent in performance
105
Smoothing the trajectory
Step X component Y component

1 1 +2.5
2 1 -3
1 2 4 3 2 +2.5
3 5
4 1 -2
5 1.5 1.5

• Observation: Steps in “oscillatory” directions show large total


movement
– In the example, total motion in the vertical direction is much greater
than in the horizontal direction
– Can happen even when momentum or Nestorov are used
• Improvement: Dampen step size in directions with high motion
– Second order term
106
Normalizing steps by second moment

• Modify usual gradient-based update:


– Scale updates in every component in inverse proportion to the total
movement of that component in recent past
• According to their variation (not just their average)
• This will change the relative update sizes for the individual
components
– In the above example it would scale down Y component
– And scale up X component (in comparison)
• We will see two popular methods that embody this principle…
107
RMS Prop
• Notation:
– Updates are by parameter

– Derivative of loss w.r.t any individual parameter is shown as


• Batch or minibatch loss, or individual divergence for batch/minibatch/SGD

– The squared derivative is


• Short-hand notation represents the squared derivative, not the second
derivative

– The mean squared derivative is a running estimate of the average


squared derivative. We will show this as

• Modified update rule: We want to


– scale down updates with large mean squared derivatives
– scale up updates with small mean squared derivatives
108
RMS Prop
• This is a variant on the basic mini-batch SGD algorithm

• Procedure:
– Maintain a running estimate of the mean squared value of
derivatives for each parameter
– Scale update of the parameter by the inverse of the root mean
squared derivative

109
RMS Prop (updates are for each
• Do:
weight of each layer)
– Randomly shuffle inputs to change their order
– Initialize: ; for all weights in all layers,
– For all (incrementing in blocks of inputs)
• For all weights in all layers initialize 𝜕 𝐷 =0
• For 𝑏 = 0: 𝐵 − 1
– Compute
» Output 𝒀(𝑿𝒕 𝒃)
𝒅𝑫𝒊𝒗(𝒀(𝑿𝒕 𝒃 ),𝒅𝒕 𝒃 )
» Compute gradient
𝒅𝒘
𝒅𝑫𝒊𝒗(𝒀(𝑿𝒕 𝒃 ),𝒅𝒕 𝒃 )
» Compute 𝜕 𝐷 +=
𝒅𝒘

• update: for all 𝑤 ∈ 𝑤 ∀𝑖, 𝑗, 𝑘


𝑬 𝝏𝟐𝒘 𝑫 𝒌
= 𝜸𝑬 𝝏𝟐𝒘 𝑫 𝒌 𝟏
+ 𝟏 − 𝜸 𝝏𝟐𝒘 𝑫 𝒌
𝜼 Typical values:
𝒘𝒌 𝟏 = 𝒘𝒌 − 𝝏𝒘 𝑫
𝑬 𝝏𝟐𝒘 𝑫 𝒌+𝝐

• 𝑘 =𝑘+1
• Until loss has converged
111
ADAM: RMSprop with momentum
• RMS prop only considers a second-moment normalized version of the current
gradient
• ADAM utilizes a smoothed version of the momentum-augmented gradient
– Considers both first and second moments

• Procedure:
– Maintain a running estimate of the mean derivative for each parameter
– Maintain a running estimate of the mean squared value of derivatives for each parameter
– Scale update of the parameter by the inverse of the root mean squared derivative

𝑚 𝑣
𝑚 = , 𝑣 =
1−𝛿 1−𝛾
𝜂
𝑤 =𝑤 − 𝑚
𝑣 +𝜖

112
ADAM: RMSprop with momentum
• RMS prop only considers a second-moment normalized version of the
current gradient
• ADAM utilizes a smoothed version of the momentum-augmented gradient

• Procedure:
– Maintain a running estimate of the mean derivative for each parameter
– Maintain a running estimate of the mean squared value of Ensures
derivatives
thatfor
theeach
parameter and terms do
– Scale update of the parameter by the inverse of the root mean squared in
not dominate
derivative early
iterations

113
Other variants of the same theme
• Many:
– Adagrad
– AdaDelta
– AdaMax
– …
• Generally no explicit learning rate to optimize
– But come with other hyper parameters to be optimized
– Typical params:
• RMSProp: ,
• ADAM: , ,

114
Visualizing the optimizers: Beale’s Function

• http://www.denizyuret.com/2015/03/alec-radfords-animations-for.html

115
Visualizing the optimizers: Long Valley

• http://www.denizyuret.com/2015/03/alec-radfords-animations-for.html

116
Visualizing the optimizers: Saddle Point

• http://www.denizyuret.com/2015/03/alec-radfords-animations-for.html

117
Story so far
• Gradient descent can be sped up by incremental
updates
– Convergence is guaranteed under most conditions
• Learning rate must shrink with time for convergence
– Stochastic gradient descent: update after each
observation. Can be much faster than batch learning
– Mini-batch updates: update after batches. Can be more
efficient than SGD

• Convergence can be improved using smoothed updates


– RMSprop and more advanced techniques

118

You might also like