0% found this document useful (0 votes)
2 views139 pages

Lec 7 Optimization Part 2

The document discusses the training of neural networks, focusing on optimization techniques such as gradient descent and its variants, including incremental updates and stochastic gradient descent (SGD). It highlights the importance of minimizing error through empirical risk minimization and the challenges of conventional batch updates. Additionally, it addresses the risks of incremental learning and the necessity of adjusting learning rates for effective convergence.

Uploaded by

mohamed541416
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views139 pages

Lec 7 Optimization Part 2

The document discusses the training of neural networks, focusing on optimization techniques such as gradient descent and its variants, including incremental updates and stochastic gradient descent (SGD). It highlights the importance of minimizing error through empirical risk minimization and the challenges of conventional batch updates. Additionally, it addresses the risks of incremental learning and the necessity of adjusting learning rates for effective convergence.

Uploaded by

mohamed541416
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 139

Training Neural Networks:

Optimization
Intro to Deep Learning, Spring 2024

1
Recap
• Neural networks are universal approximators
• We must train them to approximate any
function
• Networks are trained to minimize total “error”
on a training set
– We do so through empirical risk minimization
• We use variants of gradient descent to do so
– Gradients are computed through backpropagation
2
Recap
• Vanilla gradient descent may be too slow or unstable

• Better convergence can be obtained through


– Second order methods that normalize the variation across
dimensions
– Adaptive or decaying learning rates that can improve
convergence
– Methods like Rprop that decouple the dimensions can
improve convergence
– Momentum methods which emphasize directions of
steady improvement and deemphasize unstable directions
3
Moving on…
• Incremental updates
• Revisiting “trend” algorithms
• Generalization
• Tricks of the trade
– Divergences..
– Activations
– Normalizations

4
Moving on: Topics for the day
• Incremental updates
• Revisiting “trend” algorithms
• Generalization
• Tricks of the trade
– Divergences..
– Activations
– Normalizations

5
The training formulation

output (y)

Input (X)

• Given input output pairs at a number of


locations, estimate the entire function
6
Gradient descent

• Start with an initial function


• Adjust its value at all points to make the outputs closer to the required
value
– Gradient descent adjusts parameters to adjust the function value at all points
– Repeat this iteratively until we get arbitrarily close to the target function at the
training points

7
Gradient descent

• Start with an initial function


• Adjust its value at all points to make the outputs closer to the required
value
– Gradient descent adjusts parameters to adjust the function value at all points
– Repeat this iteratively until we get arbitrarily close to the target function at the
training points

8
Gradient descent

• Start with an initial function


• Adjust its value at all points to make the outputs closer to the required
value
– Gradient descent adjusts parameters to adjust the function value at all points
– Repeat this iteratively until we get arbitrarily close to the target function at the
training points

9
Gradient descent

• Start with an initial function


• Adjust its value at all points to make the outputs closer to the required
value
– Gradient descent adjusts parameters to adjust the function value at all points
– Repeat this iteratively until we get arbitrarily close to the target function at the
training points

10
Gradient descent

• Start with an initial function


• Adjust its value at all points to make the outputs closer to the required
value
– Gradient descent adjusts parameters to adjust the function value at all points
– Repeat this iteratively until we get arbitrarily close to the target function at the
training points

11
Gradient descent

• Start with an initial function


• Adjust its value at all points to make the outputs closer to the required
value
– Gradient descent adjusts parameters to adjust the function value at all points
– Repeat this iteratively until we get arbitrarily close to the target function at the
training points

12
Effect of number of samples

• Problem with conventional gradient descent: we try to


simultaneously adjust the function at all training points
– We must process all training points before making a single
adjustment
– “Batch” update
13
Poll 1 (@415)

Select all that are true [all correct]


 The actual loss function we try to minimize requires batch updates
 Batch updates minimize the total loss over the entire training data
 Batch updates optimize the actual loss function
 Batch updates require processing the entire training data before we perform a single
update

14
Poll 1

Select all that are true [all correct]


 The actual loss function we try to minimize requires batch updates
 Batch updates minimize the total loss over the entire training data
 Batch updates optimize the actual loss function
 Batch updates require processing the entire training data before we perform a single
update

15
Alternative: Incremental update

• Alternative: adjust the function at one training point at a time


– Keep adjustments small
– Eventually, when we have processed all the training points, we will
have adjusted the entire function
• With greater overall adjustment than we would if we made a single “Batch”
update

16
Alternative: Incremental update

• Alternative: adjust the function at one training point at a time


– Keep adjustments small
– Eventually, when we have processed all the training points, we will
have adjusted the entire function
• With greater overall adjustment than we would if we made a single “Batch”
update

17
Alternative: Incremental update

• Alternative: adjust the function at one training point at a time


– Keep adjustments small
– Eventually, when we have processed all the training points, we will
have adjusted the entire function
• With greater overall adjustment than we would if we made a single “Batch”
update

18
Alternative: Incremental update

• Alternative: adjust the function at one training point at a time


– Keep adjustments small
– Eventually, when we have processed all the training points, we will
have adjusted the entire function
• With greater overall adjustment than we would if we made a single “Batch”
update

19
Alternative: Incremental update

• Alternative: adjust the function at one training point at a time


– Keep adjustments small
– Eventually, when we have processed all the training points, we will
have adjusted the entire function
• With greater overall adjustment than we would if we made a single “Batch”
update

20
Incremental Update
• Given , ,…,
• Initialize all weights
• Do:
– For all
• For every layer :
– Compute 𝒕 𝒕

– Update

• Until has converged


21
Incremental Updates
• The iterations can make multiple passes over
the data
• A single pass through the entire training data
is called an “epoch”
– An epoch over a training set with samples
results in updates of parameters

22
Incremental Update
• Given , ,…,
• Initialize all weights
• Do: Over multiple epochs One epoch

– For all
• For every layer :
– Compute 𝒕 𝒕

– Update
One update

• Until has converged


23
Caveats: order of presentation

• If we loop through the samples in the same


order, we may get cyclic behavior

24
Caveats: order of presentation

• If we loop through the samples in the same


order, we may get cyclic behavior

25
Caveats: order of presentation

• If we loop through the samples in the same


order, we may get cyclic behavior

26
Caveats: order of presentation

• If we loop through the samples in the same


order, we may get cyclic behavior

27
Caveats: order of presentation

• If we loop through the samples in the same order,


we may get cyclic behavior
• We must go through them randomly to get more
convergent behavior
28
Caveats: order of presentation

• If we loop through the samples in the same order,


we may get cyclic behavior
• We must go through them randomly to get more
convergent behavior
29
Caveats: order of presentation

• If we loop through the samples in the same order,


we may get cyclic behavior
• We must go through them randomly to get more
convergent behavior
30
Caveats: order of presentation

• If we loop through the samples in the same order,


we may get cyclic behavior
• We must go through them randomly to get more
convergent behavior
31
Caveats: order of presentation

• If we loop through the samples in the same order,


we may get cyclic behavior
• We must go through them randomly to get more
convergent behavior
32
Incremental Update: Stochastic
Gradient Descent
• Given , ,…,
• Initialize all weights
• Do:
– Randomly permute , ,…,
– For all
• For every layer :
– Compute 𝒕 𝒕
– Update

• Until has converged


33
Story so far
• In any gradient descent optimization problem,
presenting training instances incrementally
can be more effective than presenting them
all at once
– Provided training instances are provided in
random order
– “Stochastic Gradient Descent”

• This also holds for training neural networks


34
Explanations and restrictions
• So why does this process of incremental
updates work?
• Under what conditions?

• For “why”: first consider a simplistic


explanation that’s often given
– Look at an extreme example

35
The expected behavior of the gradient
( ) ( ) ( ) ( ) ( )
𝒊 𝒊
( ) ( )
, 𝒊 ,

• The individual training instances contribute different directions to the


overall gradient
– The final gradient points is the average of individual gradients
– It points towards the net direction
36
Extreme example

• Extreme instance of data clotting: all the


training instances are exactly the same
37
The expected behavior of the gradient
𝒊 𝒊 𝒊 𝒊
( ) ( ) ( )
, 𝒊 , ,

• The individual training instance contribute identical


directions to the overall gradient
– The final gradient points is simply the gradient for an individual
instance 38
Batch vs SGD

Batch SGD

• Batch gradient descent operates over T training instances


to get a single update
• SGD gets T updates for the same computation
39
Clumpy data..

• Also holds if all the data are not identical, but


are tightly clumped together
40
Clumpy data..

• As data get increasingly diverse, the benefits of incremental


updates decrease, but do not entirely vanish
41
When does it work
• What are the considerations?

• And how well does it work?

42
Incremental learning runs the risk of
always chasing the latest input

output (y)

Input (X)

• Modelling problem: Find a linear regression line


(through origin) to model the data
– Find the line through origin that has the lowest overall
squared projection error w.r.t. data
43
Incremental learning runs the risk of
always chasing the latest input

output (y)

Input (X)

• Modelling problem: Find a linear regression line


(through origin) to model the data
– Batch processing: Find the line through origin that has the
lowest overall squared projection error w.r.t. data
44
Incremental learning runs the risk of
always chasing the latest input

output (y)

Input (X)

• Incremental learning: Update the model to always


minimize the error on the latest instance
– It will never converge
– Solution?
45
Incremental learning runs the risk of
always chasing the latest input

output (y)

Input (X)

• Incremental learning: Update the model to always


minimize the error on the latest instance
– It will never converge
– Solution?
46
Incremental learning runs the risk of
always chasing the latest input

output (y)

Input (X)

• Incremental learning: Update the model to always


minimize the error on the latest instance
– It will never converge
– Solution?
47
Incremental learning runs the risk of
always chasing the latest input

output (y)

Input (X)

• Incremental learning: Update the model to always


minimize the error on the latest instance
– It will never converge
– Solution?
48
Incremental learning runs the risk of
always chasing the latest input

output (y)

Input (X)

• Incremental learning: Update the model to always minimize the error on


the latest instance
– Shrink the learning rate with iterations
– With increasing iterations, it will swing less and less towards the new point

49
Incremental learning runs the risk of
always chasing the latest input

output (y)

Input (X)

• Incremental learning: Update the model to always minimize the error on


the latest instance
– Shrink the learning rate with iterations
– With increasing iterations, it will swing less and less towards the new point

50
Incremental learning runs the risk of
always chasing the latest input

output (y)

Input (X)

• Incremental learning: Update the model to always minimize the error on


the latest instance
– Shrink the learning rate with iterations
– With increasing iterations, it will swing less and less towards the new point

51
Incremental learning runs the risk of
always chasing the latest input

output (y)

Input (X)

• Incremental learning: Update the model to always minimize the error on


the latest instance
– Shrink the learning rate with iterations
– With increasing iterations, it will swing less and less towards the new point

52
Incremental learning runs the risk of
always chasing the latest input

output (y)

Input (X)

• Incremental learning: Update the model to always minimize the error on


the latest instance
– Shrink the learning rate with iterations
– With increasing iterations, it will swing less and less towards the new point

53
Incremental learning runs the risk of
always chasing the latest input

output (y)

Input (X)

• Incremental learning: Update the model to always minimize the error on the latest
instance
– Shrink the learning rate with iterations
– With increasing iterations, it will swing less and less towards the new point
– Eventually arriving at the correct solution and not moving much from it further because the
step sizes are now too small…
54
Incremental learning caveat: learning
rate

output (y)

Input (X)

• Incremental learning: Update the model to


always minimize the error on the latest instance
– Caveat: We must shrink the learning rate with
iterations for convergence
• Correction for individual instances with the eventual
miniscule learning rates will not modify the function 55
Incremental Update: Stochastic
Gradient Descent
• Given , ,…,
• Initialize all weights ;
• Do:
– Randomly permute , ,…,
– For all

• For every layer :
– Compute 𝒕 𝒕
– Update

• Until has converged


56
Incremental Update: Stochastic
Gradient Descent
• Given , ,…,
• Initialize all weights ;
• Do:
– Randomly permute , ,…,
– For all
Randomize input order

• For every layer :
Learning rate reduces with j
– Compute 𝒕 𝒕
– Update

• Until has converged


57
SGD convergence
• SGD converges “almost surely” to a global or local minimum for most
functions
– Sufficient condition: step sizes follow the following conditions
(Robbins and Munro 1951)

𝜂 =∞

• Eventually the entire parameter space can be searched

𝜂 <∞

• The steps shrink


– The fastest converging series that satisfies both above requirements is
1
𝜂 ∝
𝑘
• This is the optimal rate of shrinking the step size for strongly convex functions
– More generally, the learning rates are heuristically determined
• If the loss is convex, SGD converges to the optimal solution
• For non-convex losses SGD converges to a local minimum 58
SGD convergence
• We will define convergence in terms of the number of iterations taken to
get within of the optimal solution
– ( ) ∗

– Note: here is the optimization objective on the entire training data,


although SGD itself updates after every training instance

• Using the optimal learning rate , for strongly convex functions,


( ) ∗ ( ) ∗

– Strongly convex  Can be placed inside a quadratic bowl, touching at any point
– Giving us the iterations to convergence as

• For generically convex (but not strongly convex) function, various proofs
report an convergence of using a learning rate of .

59
Batch gradient convergence
• In contrast, using the batch update method, for strongly
convex functions,

– Giving us the iterations to convergence as

• For generic convex functions, iterations to convergence


is

• Batch gradients converge “faster”


– But SGD performs updates for every batch update
60
SGD Convergence: Loss value
If:
• is -strongly convex, and
• at step we have a noisy estimate of the
subgradient with for all ,
• and we use step size
Then for any :

61
SGD Convergence
• We can bound the expected difference between the
loss over our data using the optimal weights and
the weights at any single iteration to for
strongly convex loss or for convex loss

• Averaging schemes can improve the bound to


and

• Smoothness of the loss is not required

62
SGD Convergence and weight
averaging
Polynomial Decay Averaging:

With some small positive constant, e.g.


Achieves (strongly convex) and
(convex) convergence

63
SGD example

• A simpler problem: K-means


• Note: SGD converges faster
– But to a poorer minimum
• Also note the rather large variation between runs
64
– Let’s try to understand these results..
Poll 2 (@416)

Select all that are true [all correct]


 SGD is an online version of batch updates
 SGD can have oscillatory behavior if we do not randomize the order of the inputs
 SGD can converge faster than batch updates, but arrive at poorer optima
 SGD convergence to the global optimum can only be guaranteed if step sizes shrink
across iterations, but sum to infinity in the limit

65
Poll 2

Select all that are true [all correct]


 SGD is an online version of batch updates
 SGD can have oscillatory behavior if we do not randomize the order of the inputs
 SGD can converge faster than batch updates, but arrive at poorer optima
 SGD convergence to the global optimum can only be guaranteed if step sizes shrink
across iterations, but sum to infinity in the limit

66
Recall: Modelling a function

• To learn a network to model a function we


minimize the expected divergence

67
Recall: The Empirical risk

di

Xi

• In practice, we minimize the empirical risk (or loss)

1
𝐿𝑜𝑠𝑠 𝑊 = 𝑑𝑖𝑣 𝑓 𝑋 ; 𝑊 , 𝑑
𝑁
𝑾 = argmin 𝐿𝑜𝑠𝑠 𝑊

• The expected value of the empirical risk is actually the expected divergence
𝐸 𝐿𝑜𝑠𝑠 𝑊 = 𝐸 𝑑𝑖𝑣 𝑓 𝑋; 𝑊 , 𝑔 𝑋
68
Recall: The Empirical risk

di

Xi

• In practice, we minimize the empirical risk (or loss)

1
𝐿𝑜𝑠𝑠 𝑊 = 𝑑𝑖𝑣 𝑓 𝑋 ; 𝑊 , 𝑑
𝑁
The empirical risk is an unbiased
𝑾 = estimate
argmin 𝐿𝑜𝑠𝑠of
𝑊 the expected divergence
Though there is no guarantee that minimizing it will minimize the
expected divergence
• The expected value of the empirical risk is actually the expected divergence
𝐸 𝐿𝑜𝑠𝑠 𝑊 = 𝐸 𝑑𝑖𝑣 𝑓 𝑋; 𝑊 , 𝑔 𝑋
69
Recall: The Empirical risk

di

The variance of the empirical risk: var(Loss) = 1/N var(div) Xi


The variance of the estimator is proportional to 1/N
The larger this variance, the greater the likelihood that the W that
minimizes the empirical risk will differ significantly from the W that
•minimizes
In practice,the expected
we minimize divergence
the empirical risk

1
𝐿𝑜𝑠𝑠 𝑊 = 𝑑𝑖𝑣 𝑓 𝑋 ; 𝑊 , 𝑑
𝑁
The empirical risk is an unbiased estimate
𝑾 = argmin 𝐿𝑜𝑠𝑠 𝑓 𝑋; of
𝑊 ,the
𝑔 𝑋 expected divergence
Though there is no guarantee that minimizing it will minimize the
expected divergence
• The expected value of the empirical risk is actually the expected divergence
𝐸 𝐿𝑜𝑠𝑠 𝑊 = 𝐸 𝑑𝑖𝑣 𝑓 𝑋; 𝑊 , 𝑔 𝑋
70
SGD

di

Xi

• At each iteration, SGD focuses on the divergence


of a single sample
• The expected value of the sample error is still the
expected divergence 71
SGD

di

Xi

The sample divergence is also an unbiased estimate of the expected error

• At each iteration, SGD focuses on the divergence


of a single sample
• The expected value of the sample error is still the
expected divergence 72
SGD

di

Xi
The variance of the sample divergence is the variance of the divergence itself:
var(div). This is N times the variance of the empirical average minimized by
batch update
The sample divergence is also an unbiased estimate of the expected error

• At each iteration, SGD focuses on the divergence


of a single sample
• The expected value of the sample error is still the
expected divergence 73
Explaining the variance

• The blue curve is the function being approximated


• The red curve is the approximation by the model at a given
• The heights of the shaded regions represent the point-by-point error
– The divergence is a function of the error
– We want to find the that minimizes the average divergence

74
Explaining the variance

• Sample estimate approximates the shaded area with the


average length of the error lines of these curves is the red
curve itself
• Variance: The spread between the different curves is the
variance
75
Explaining the variance

• Sample estimate approximates the shaded area


with the average length of the error lines
• This average length will change with position of
the samples 76
Explaining the variance

• Sample estimate approximates the shaded area


with the average length of the error lines
• This average length will change with position of
the samples 77
Explaining the variance

• Having more samples makes the estimate more


robust to changes in the position of samples
– The variance of the estimate is smaller
78
Explaining the variance
With only one sample

• Having very few samples makes the estimate


swing wildly with the sample position
– Since our estimator learns the to minimize this
estimate, the learned too can swing wildly 79
Explaining the variance
With only one sample

• Having very few samples makes the estimate


swing wildly with the sample position
– Since our estimator learns the to minimize this
estimate, the learned too can swing wildly 80
Explaining the variance
With only one sample

• Having very few samples makes the estimate


swing wildly with the sample position
– Since our estimator learns the to minimize this
estimate, the learned too can swing wildly 81
SGD example

• A simpler problem: K-means


• Note: SGD converges faster
• But also has large variation between runs 82
SGD vs batch
• SGD uses the gradient from only one sample
at a time, and is consequently high variance

• But also provides significantly quicker updates


than batch
• Is there a good medium?

83
Alternative: Mini-batch update

• Alternative: adjust the function at a small, randomly chosen subset of


points
– Keep adjustments small
– If the subsets cover the training set, we will have adjusted the entire function
• As before, vary the subsets randomly in different passes through the
training data

84
Alternative: Mini-batch update

• Alternative: adjust the function at a small, randomly chosen subset of


points
– Keep adjustments small
– If the subsets cover the training set, we will have adjusted the entire function
• As before, vary the subsets randomly in different passes through the
training data

85
Alternative: Mini-batch update

• Alternative: adjust the function at a small, randomly chosen subset of


points
– Keep adjustments small
– If the subsets cover the training set, we will have adjusted the entire function
• As before, vary the subsets randomly in different passes through the
training data

86
Alternative: Mini-batch update

• Alternative: adjust the function at a small, randomly chosen subset of


points
– Keep adjustments small
– If the subsets cover the training set, we will have adjusted the entire function
• As before, vary the subsets randomly in different passes through the
training data

87
Incremental Update: Mini-batch
update
• Given , ,…,
• Initialize all weights ;
• Do:
– Randomly permute , ,…,
– For

• For every layer k:
– ∆𝑊 = 0
• For t’ = t : t+b-1
– For every layer 𝑘:
» Compute 𝛻 𝐷𝑖𝑣(𝑌 , 𝑑 )

» ∆𝑊 = ∆𝑊 + 𝛻 𝐷𝑖𝑣(𝑌 , 𝑑 )

• Update
– For every layer k:

• Until has converged 88


Incremental Update: Mini-batch
update
• Given , ,…,
• Initialize all weights ;
• Do:
– Randomly permute , ,…,
– For
• Mini-batch size
• For every layer k:
– ∆𝑊 = 0 Shrinking step size
• For t’ = t : t+b-1
– For every layer 𝑘:
» Compute 𝛻 𝐷𝑖𝑣(𝑌 , 𝑑 )

» ∆𝑊 = ∆𝑊 + 𝛻 𝐷𝑖𝑣(𝑌 , 𝑑 )

• Update
– For every layer k:

• Until has converged 89


Mini Batches

di

Xi

• Mini-batch updates compute and minimize a batch loss

• The expected value of the batch loss is also the expected divergence

90
Mini Batches

di

Xi

The minibatch loss is also an unbiased estimate of the expected loss


• Mini-batch updates compute and minimize a batch loss

• The expected value of the batch loss is also the expected divergence

91
Mini Batches

di

Xi
The variance of the minibatch loss: var(BatchLoss) = 1/b var(div)
This will be much smaller than the variance of the sample error in SGD
The minibatch loss is also an unbiased estimate of the expected error
• Mini-batch updates compute and minimize a batch loss

• The expected value of the batch loss is also the expected divergence

92
Minibatch convergence
• For convex functions, convergence rate for SGD is .
• For mini-batch updates with batches of size , the
convergence rate is
– Apparently an improvement of over SGD
– But since the batch size is , we perform times as many
computations per iteration as SGD
– We actually get a degradation of

• However, in practice
– The objectives are generally not convex; mini-batches are more
effective with the right learning rates
– We also get additional benefits of vector processing
93
SGD example

• Mini-batch performs comparably to batch


training on this simple problem
– But converges orders of magnitude faster
94
Measuring Loss
• Convergence is generally
defined in terms of the
overall training loss
– Not sample or batch loss

• Infeasible to actually measure the overall training loss


after each iteration
• More typically, we estimate is as
– Divergence or classification error on a held-out set
– Average sample/batch loss over the past
samples/batches
95
Training and minibatches
• In practice, training is usually performed using mini-
batches
– The mini-batch size is generally set to the largest that your
hardware will support (in memory) without compromising
overall compute time
• Larger minibatches = less variance
• Larger minibatches = few updates per epoch

• Convergence depends on learning rate


– Simple technique: fix learning rate until the error plateaus,
then reduce learning rate by a fixed factor (e.g. 10)
– Advanced methods: Adaptive updates, where the learning
rate is itself determined as part of the estimation
96
Poll 3 (@417)

Select all that are true


 Minibatch descent is an online version of batch updates
 Minibatch descent is faster than SGD when the batch size is 1 [false]
 The variance of minibatch updates decreases with batch size
 Minibatch gradient approaches batch updates in variance, but SGD in efficiency
when we use vector processing and large batches

97
Poll 3

Select all that are true


 Minibatch descent is an online version of batch updates
 Minibatch descent is faster than SGD when the batch size is 1 [false]
 The variance of minibatch updates decreases with batch size
 Minibatch gradient approaches batch updates in variance, but SGD in efficiency
when we use vector processing and large batches

98
Story so far
• SGD: Presenting training instances one-at-a-time can be more effective
than full-batch training
– Provided they are provided in random order

• For SGD to converge, the learning rate must shrink sufficiently rapidly with
iterations
– Otherwise the learning will continuously “chase” the latest sample

• SGD estimates have higher variance than batch estimates

• Minibatch updates operate on batches of instances at a time


– Estimates have lower variance than SGD
– Convergence rate is theoretically worse than SGD
– But we compensate by being able to perform batch processing

99
Training and minibatches
• Convergence depends on learning rate
– Simple technique: fix learning rate until the error
plateaus, then reduce learning rate by a fixed
factor (e.g. 10)
– Advanced methods: Adaptive updates, where the
learning rate is itself determined as part of the
estimation

100
Moving on: Topics for the day
• Incremental updates
• Revisiting “trend” algorithms
• Generalization
• Tricks of the trade
– Divergences..
– Activations
– Normalizations

101
Recall: Momentum Update
Plain gradient update With momentum

• The momentum method maintains a running average of all gradients until


the current step
( ) ( )

( ) ( ) ( )

– Typical value is 0.9


• The running average steps
– Get longer in directions where gradient retains the same sign
– Become shorter in directions where the sign keeps flipping
102
Recall: Momentum Update

• The momentum method

• At any iteration, to compute the current step:


– First computes the gradient step at the current location
– Then adds in the historical average step

103
Recall: Momentum Update

• The momentum method

• At any iteration, to compute the current step:


– First compute the gradient step at the current location

– Then adds in the historical average step


104
Recall: Momentum Update

• The momentum method

• At any iteration, to compute the current step:


– First compute the gradient step at the current location
– Then add in the scaled previous step
• Which is actually a running average
105
Recall: Momentum Update

• The momentum method

• At any iteration, to compute the current step:


– First compute the gradient step at the current location
– Then add in the scaled previous step
• Which is actually a running average
– To get the final step
106
Momentum update
1

• Momentum update steps are actually computed in two stages


– First: We take a step against the gradient at the current location
– Second: Then we add a scaled version of the previous step

• The procedure can be made more optimal by reversing the order of


operations..

107
Nestorov’s Accelerated Gradient

• Change the order of operations


• At any iteration, to compute the current step:
– First extend by the (scaled) historical average
– Then compute the gradient at the resultant position
– Add the two to obtain the final step
108
Nestorov’s Accelerated Gradient

• Change the order of operations


• At any iteration, to compute the current step:
– First extend the previous step
– Then compute the gradient at the resultant position
– Add the two to obtain the final step
109
Nestorov’s Accelerated Gradient

• Change the order of operations


• At any iteration, to compute the current step:
– First extend the previous step
– Then compute the gradient step at the resultant
position
– Add the two to obtain the final step
110
Nestorov’s Accelerated Gradient

• Change the order of operations


• At any iteration, to compute the current step:
– First extend the previous step
– Then compute the gradient step at the resultant
position
– Add the two to obtain the final step
111
Nestorov’s Accelerated Gradient

• Nestorov’s method

112
Nestorov’s Accelerated Gradient

• Comparison with momentum (example from


Hinton)
• Converges much faster
113
Momentum and incremental updates

SGD instance
or minibatch
loss
• The momentum method

• Incremental SGD and mini-batch gradients tend to have


high variance
• Momentum smooths out the variations
– Smoother and faster convergence
114
Momentum: Mini-batch update
• Given , ,…,
• Initialize all weights ; ,
• Do:
– Randomly permute , ,…,
– For
• 𝑗 =𝑗+1
• For every layer k:
– 𝛻 𝐿𝑜𝑠𝑠 = 0
• For t’ = t : t+b-1
– For every layer 𝑘:
» Compute 𝛻 𝐷𝑖𝑣(𝑌 , 𝑑 )

» 𝛻 𝐿𝑜𝑠𝑠 += 𝛻 𝑫𝒊𝒗(𝑌 , 𝑑 )
• Update
– For every layer k:
Δ𝑊 = 𝛽Δ𝑊 − 𝜂 (𝛻 𝐿𝑜𝑠𝑠)
𝑊 = 𝑊 + ∆𝑊
• Until has converged
115
Nestorov’s Accelerated Gradient

• At any iteration, to compute the current step:


– First extend the previous step
– Then compute the gradient at the resultant position
– Add the two to obtain the final step
• This also applies directly to incremental update methods
– The accelerated gradient smooths out the variance in the
gradients
116
Nestorov’s Accelerated Gradient

SGD instance
or minibatch
loss
• Nestorov’s method
( )

117
Nestorov: Mini-batch update
• Given , ,…,
• Initialize all weights ; 𝑗 = 0, ∆𝑊 = 0
• Do:
– Randomly permute 𝑋 , 𝑑 , 𝑋 , 𝑑 ,…, 𝑋 , 𝑑
– For 𝑡 = 1: 𝑏: 𝑇
• 𝑗=𝑗+1
• For every layer k:
– 𝑊 = 𝑊 + 𝛽Δ𝑊
– 𝛻 𝐿𝑜𝑠𝑠 = 0
• For t’ = t : t+b-1
– For every layer 𝑘:
» Compute 𝛻 𝐷𝑖𝑣(𝑌 , 𝑑 )

» 𝛻 𝐿𝑜𝑠𝑠 += 𝛻 𝑫𝒊𝒗(𝑌 , 𝑑 )
• Update
– For every layer k:
𝑊 =𝑊 −𝜂 𝛻 𝐿𝑜𝑠𝑠𝑇
Δ𝑊 = 𝛽Δ𝑊 − 𝜂 𝛻 𝐿𝑜𝑠𝑠𝑇

• Until has converged


118
The other term in the update
• Standard gradient descent rule

• Gradient descent invokes two terms for updates


– The derivative
– and the learning rate
• Momentum methods fix this term to reduce
unstable oscillation
• What about this term?
119
The other term in the update
• Standard gradient descent rule

• Gradient descent invokes two terms for updates


– The derivative
– and the learning rate
• Momentum methods fix this term to reduce
unstable oscillation
• What about this term?
120
The other term in the update
• Standard gradient descent rule

• Gradient descent invokes two terms for updates


– The derivative
– and the learning rate
• Momentum methods fix this term to reduce
unstable oscillation
• What about this term?
121
Adjusting the learning rate
Sequence of
gradients
With separate learning rates
in each direction, which should
have the lowest learning rate
vs vs
in the vertical direction?

• Have separate learning rates for each component


• Directions in which the derivatives swing more should likely have lower
learning rates
– Is likely indicative of more wildly swinging behavior

• Directions of greater swing are indicated by total movement


– Direction of greater movement should have lower learning rate
122
Smoothing the trajectory
Step X component Y component

1 1 +2.5
2 1 -3
1 2 4 3 2 +2.5
3 5
4 1 -2
5 1.5 1.5

• Observation: Steps in “oscillatory” directions show large total movement


– In the example, total motion in the vertical direction is much greater than in
the horizontal direction

• Solution: Lower learning rate in the vertical direction than in the


horizontal direction
– Based on total motion
– As quantified by RMS value 123
RMS Prop
• Notation:
– Formulae are by parameter

– Derivative of loss w.r.t any individual parameter is shown as


• Batch or minibatch loss, or individual divergence for batch/minibatch/SGD

– The squared derivative is


• Short-hand notation represents the squared derivative, not the second derivative

– The mean squared derivative is a running estimate of the average squared


derivative. We will show this as

• Modified update rule: We want to


– scale down learning rates for terms with large mean squared derivatives
– scale up learning rates for terms with small mean squared derivatives

124
RMS Prop
• This is a variant on the basic mini-batch SGD algorithm

• Procedure:
– Maintain a running estimate of the mean squared value of
derivatives for each parameter
– Scale learning rate of the parameter by the inverse of the root
mean squared derivative

125
RMS Prop
• This is a variant on the basic mini-batch SGD algorithm

• Procedure:
– Maintain a running estimate of the mean squared value of
derivatives for each parameter
– Scale learning rate of the parameter by the inverse of the root
mean squared derivative

Note similarity to RPROP


The magnitude of the derivative is being normalized126out
RMS Prop (updates are for each
• Do:
weight of each layer)
– Randomly shuffle inputs to change their order
– Initialize: ; for all weights in all layers,
– For all (incrementing in blocks of inputs)
• For all weights in all layers initialize
• For
– Compute
» Output 𝒀(𝑿𝒕 𝒃)
𝒅𝑫𝒊𝒗(𝒀(𝑿𝒕 𝒃 ),𝒅𝒕 𝒃 )
» Compute gradient 𝒅𝒘
𝒅𝑫𝒊𝒗(𝒀(𝑿𝒕 𝒃 ),𝒅𝒕 𝒃 )
» Compute 𝜕 𝐷 += 𝒅𝒘

• update: for all


𝑬 𝝏𝟐𝒘 𝑫 𝒌
= 𝜸𝑬 𝝏𝟐𝒘 𝑫 𝒌 𝟏
+ 𝟏 − 𝜸 𝝏𝟐𝒘 𝑫 𝒌
𝜼 Typical values:
𝒘𝒌 𝟏 = 𝒘𝒌 − 𝝏𝒘 𝑫
𝑬 𝝏𝟐𝒘 𝑫 𝒌+𝝐


• Until loss has converged
127
All the terms in gradient descent
• Standard gradient descent rule

• RMSprop only adapts the learning rate


– by total movement
• Momentum only smooths the gradient

• How about combining both?


128
All the terms in gradient descent
• Standard gradient descent rule

• RMSprop only adapts the learning rate


– by total movement
• Momentum only smooths the gradient

• How about combining both?


129
ADAM: RMSprop with momentum
• RMS prop only adapts the learning rate
• Momentum only smooths the gradient
• ADAM combines the two

• Procedure:
– Maintain a running estimate of the mean derivative for each parameter
– Maintain a running estimate of the mean squared value of derivatives for each
parameter
– Learning rate is proportional to the inverse of the root mean squared
derivative

130
ADAM: RMSprop with momentum
• RMS prop only adapts the learning rate
• Momentum only smooths the gradient
• ADAM combines the two

• Procedure:
– Maintain a running estimate of the mean derivative for each parameter
– Maintain a running estimate of the mean squared value of Ensures
derivatives
thatfor
theeach
parameter and terms do
not
– Learning rate is proportional to the inverse of the root mean dominate in
squared
derivative early
iterations

131
ADAM: RMSprop with momentum
Typically is 0 and is close to 1. So .
Without
• RMS the propdenominator term a second-moment
only considers will stay close to 0normalized
for for aoflong
version thetime,
resulting in minimal parameter updates
current gradient
• ADAM
The utilizesterm
denominator a smoothed version of the momentum-augmented
ensures that and updates actually gradient
happen
• Procedure:
For large , the denominator
– Maintain just becomes
a running estimate of the 1mean derivative for each parameter
– Maintain a running estimate of the mean squared value of Ensures
derivatives
thatfor
theeach
parameter and terms do
– Learning rate is proportional to the inverse of the root mean squared
not dominate in
derivative early
iterations

132
Other variants of the same theme
• Many:
– Adagrad
– AdaDelta
– AdaMax
– …
• Generally no explicit learning rate to optimize
– But come with other hyper parameters to be optimized
– Typical params:
• RMSProp: ,
• ADAM: , ,

133
Poll 4 (@418)

Which of the following are true


 Vanilla SGD considers the long-term trends of gradients in update steps [false]
 Momentum methods consider the long-term average of derivatives to make updates
 RMSprop only considers the second order moment of derivatives, but not their
average trend, to make updates
 ADAM considers both the average trend and second moment of derivatives to make
updates
 Trend-based optimizers like momentum, RMSprop and ADAM are important to
smooth out the variance of SGD or minibatch updates

134
Poll 4

Which of the following are true


 Vanilla SGD considers the long-term trends of gradients in update steps [false]
 Momentum methods consider the long-term average of derivatives to make updates
 RMSprop only considers the second order moment of derivatives, but not their
average trend, to make updates
 ADAM considers both the average trend and second moment of derivatives to make
updates
 Trend-based optimizers like momentum, RMSprop and ADAM are important to
smooth out the variance of SGD or minibatch updates

135
Visualizing the optimizers: Beale’s Function

• http://www.denizyuret.com/2015/03/alec-radfords-animations-for.html

136
Visualizing the optimizers: Long Valley

• http://www.denizyuret.com/2015/03/alec-radfords-animations-for.html

137
Visualizing the optimizers: Saddle Point

• http://www.denizyuret.com/2015/03/alec-radfords-animations-for.html

138
Story so far
• Gradient descent can be sped up by incremental
updates
– Convergence is guaranteed under most conditions
• Learning rate must shrink with time for convergence
– Stochastic gradient descent: update after each
observation. Can be much faster than batch learning
– Mini-batch updates: update after batches. Can be more
efficient than SGD

• Convergence can be improved using smoothed updates


– RMSprop and more advanced techniques

139

You might also like