0% found this document useful (0 votes)

2 views139 pages

Lec 7 Optimization Part 2

The document discusses the training of neural networks, focusing on optimization techniques such as gradient descent and its variants, including incremental updates and stochastic gradient descent (SGD). It highlights the importance of minimizing error through empirical risk minimization and the challenges of conventional batch updates. Additionally, it addresses the risks of incremental learning and the necessity of adjusting learning rates for effective convergence.

Uploaded by

mohamed541416

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views139 pages

Lec 7 Optimization Part 2

Uploaded by

mohamed541416

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 139

Training Neural Networks:

Optimization
Intro to Deep Learning, Spring 2024

1
Recap
• Neural networks are universal approximators
• We must train them to approximate any
function
• Networks are trained to minimize total “error”
on a training set
– We do so through empirical risk minimization
• We use variants of gradient descent to do so
– Gradients are computed through backpropagation
2
Recap
• Vanilla gradient descent may be too slow or unstable

• Better convergence can be obtained through

– Second order methods that normalize the variation across
dimensions
– Adaptive or decaying learning rates that can improve
convergence
– Methods like Rprop that decouple the dimensions can
improve convergence
– Momentum methods which emphasize directions of
steady improvement and deemphasize unstable directions
3
Moving on…
• Incremental updates
• Revisiting “trend” algorithms
• Generalization
• Tricks of the trade
– Divergences..
– Activations
– Normalizations

4
Moving on: Topics for the day
• Incremental updates
• Revisiting “trend” algorithms
• Generalization
• Tricks of the trade
– Divergences..
– Activations
– Normalizations

5
The training formulation

output (y)

Input (X)

• Given input output pairs at a number of

locations, estimate the entire function
6
Gradient descent

• Start with an initial function

• Adjust its value at all points to make the outputs closer to the required
value
– Gradient descent adjusts parameters to adjust the function value at all points
– Repeat this iteratively until we get arbitrarily close to the target function at the
training points

7
Gradient descent

• Start with an initial function

8
Gradient descent

• Start with an initial function

9
Gradient descent

• Start with an initial function

10
Gradient descent

• Start with an initial function

11
Gradient descent

• Start with an initial function

12
Effect of number of samples

• Problem with conventional gradient descent: we try to

simultaneously adjust the function at all training points
– We must process all training points before making a single
adjustment
– “Batch” update
13
Poll 1 (@415)

Select all that are true [all correct]

 The actual loss function we try to minimize requires batch updates
 Batch updates minimize the total loss over the entire training data
 Batch updates optimize the actual loss function
 Batch updates require processing the entire training data before we perform a single
update

14
Poll 1

Select all that are true [all correct]

15
Alternative: Incremental update

• Alternative: adjust the function at one training point at a time

– Keep adjustments small
– Eventually, when we have processed all the training points, we will
have adjusted the entire function
• With greater overall adjustment than we would if we made a single “Batch”
update

16
Alternative: Incremental update

• Alternative: adjust the function at one training point at a time

17
Alternative: Incremental update

• Alternative: adjust the function at one training point at a time

18
Alternative: Incremental update

• Alternative: adjust the function at one training point at a time

19
Alternative: Incremental update

• Alternative: adjust the function at one training point at a time

20
Incremental Update
• Given , ,…,
• Initialize all weights
• Do:
– For all
• For every layer :
– Compute 𝒕 𝒕

– Update

• Until has converged

21
Incremental Updates
• The iterations can make multiple passes over
the data
• A single pass through the entire training data
is called an “epoch”
– An epoch over a training set with samples
results in updates of parameters

22
Incremental Update
• Given , ,…,
• Initialize all weights
• Do: Over multiple epochs One epoch

– For all
• For every layer :
– Compute 𝒕 𝒕

– Update
One update

• Until has converged

23
Caveats: order of presentation

• If we loop through the samples in the same

order, we may get cyclic behavior

24
Caveats: order of presentation

• If we loop through the samples in the same

order, we may get cyclic behavior

25
Caveats: order of presentation

• If we loop through the samples in the same

order, we may get cyclic behavior

26
Caveats: order of presentation

• If we loop through the samples in the same

order, we may get cyclic behavior

27
Caveats: order of presentation

• If we loop through the samples in the same order,

we may get cyclic behavior
• We must go through them randomly to get more
convergent behavior
28
Caveats: order of presentation

• If we loop through the samples in the same order,

we may get cyclic behavior
• We must go through them randomly to get more
convergent behavior
29
Caveats: order of presentation

• If we loop through the samples in the same order,

we may get cyclic behavior
• We must go through them randomly to get more
convergent behavior
30
Caveats: order of presentation

• If we loop through the samples in the same order,

we may get cyclic behavior
• We must go through them randomly to get more
convergent behavior
31
Caveats: order of presentation

• If we loop through the samples in the same order,

we may get cyclic behavior
• We must go through them randomly to get more
convergent behavior
32
Incremental Update: Stochastic
Gradient Descent
• Given , ,…,
• Initialize all weights
• Do:
– Randomly permute , ,…,
– For all
• For every layer :
– Compute 𝒕 𝒕
– Update

• Until has converged

33
Story so far
• In any gradient descent optimization problem,
presenting training instances incrementally
can be more effective than presenting them
all at once
– Provided training instances are provided in
random order
– “Stochastic Gradient Descent”

• This also holds for training neural networks

34
Explanations and restrictions
• So why does this process of incremental
updates work?
• Under what conditions?

• For “why”: first consider a simplistic

explanation that’s often given
– Look at an extreme example

35
The expected behavior of the gradient
( ) ( ) ( ) ( ) ( )
𝒊 𝒊
( ) ( )
, 𝒊 ,

• The individual training instances contribute different directions to the

overall gradient
– The final gradient points is the average of individual gradients
– It points towards the net direction
36
Extreme example

• Extreme instance of data clotting: all the

training instances are exactly the same
37
The expected behavior of the gradient
𝒊 𝒊 𝒊 𝒊
( ) ( ) ( )
, 𝒊 , ,

• The individual training instance contribute identical

directions to the overall gradient
– The final gradient points is simply the gradient for an individual
instance 38
Batch vs SGD

Batch SGD

• Batch gradient descent operates over T training instances

to get a single update
• SGD gets T updates for the same computation
39
Clumpy data..

• Also holds if all the data are not identical, but

are tightly clumped together
40
Clumpy data..

• As data get increasingly diverse, the benefits of incremental

updates decrease, but do not entirely vanish
41
When does it work
• What are the considerations?

• And how well does it work?

42
Incremental learning runs the risk of
always chasing the latest input

output (y)

Input (X)

• Modelling problem: Find a linear regression line

(through origin) to model the data
– Find the line through origin that has the lowest overall
squared projection error w.r.t. data
43
Incremental learning runs the risk of
always chasing the latest input

output (y)

Input (X)

• Modelling problem: Find a linear regression line

(through origin) to model the data
– Batch processing: Find the line through origin that has the
lowest overall squared projection error w.r.t. data
44
Incremental learning runs the risk of
always chasing the latest input

output (y)

Input (X)

• Incremental learning: Update the model to always

minimize the error on the latest instance
– It will never converge
– Solution?
45
Incremental learning runs the risk of
always chasing the latest input

output (y)

Input (X)

• Incremental learning: Update the model to always

minimize the error on the latest instance
– It will never converge
– Solution?
46
Incremental learning runs the risk of
always chasing the latest input

output (y)

Input (X)

• Incremental learning: Update the model to always

minimize the error on the latest instance
– It will never converge
– Solution?
47
Incremental learning runs the risk of
always chasing the latest input

output (y)

Input (X)

• Incremental learning: Update the model to always

minimize the error on the latest instance
– It will never converge
– Solution?
48
Incremental learning runs the risk of
always chasing the latest input

output (y)

Input (X)