Neural Networks For Machine Learning: Lecture 6a Overview of Mini - Batch Gradient Descent

Neural
Networks for Machine Learning

Lecture 6a
Overview of mini-batch gradient descent
Georey Hinton
with
Ni@sh Srivastava
Kevin Swersky
Reminder: The error surface for a linear neuron

The error surface lies in a space with a
horizontal axis for each weight and one ver@cal
axis for the error.
E
For a linear neuron with a squared error, it is
a quadra@c bowl.
Ver@cal cross-sec@ons are parabolas.
Horizontal cross-sec@ons are ellipses.
For mul@-layer, non-linear nets the error surface
w1
is much more complicated.
But locally, a piece of a quadra@c bowl is
usually a very good approxima@on.
w2
Convergence speed of full batch learning when the error

surface is a quadra@c bowl
Going downhill reduces the error, but the
direc@on of steepest descent does not point
at the minimum unless the ellipse is a circle.
i n

The gradient is big in the direc@on
which we only want to travel a small
distance.
The gradient is small in the direc@on in
which we want to travel a large distance.
Even for non-linear

mul@-layer nets, the
error surface is locally
quadra@c, so the same
speed issues apply.
How the learning goes wrong

If the learning rate is big, the weights slosh to
and fro across the ravine.
If the learning rate is too big, this
oscilla@on diverges.
What we would like to achieve:
Move quickly in direc@ons with small but
consistent gradients.
Move slowly in direc@ons with big but
inconsistent gradients.
E
w
Stochas@c gradient descent

If the dataset is highly redundant, the
gradient on the rst half is almost
iden@cal to the gradient on the
second half.
So instead of compu@ng the full
gradient, update the weights using
the gradient on the rst half and
then get a gradient for the new
weights on the second half.
The extreme version of this
approach updates weights aVer
each case. Its called online.
Mini-batches are usually beYer

than online.
Less computa@on is used
upda@ng the weights.
Compu@ng the gradient for
many cases simultaneously
uses matrix-matrix
mul@plies which are very
ecient, especially on GPUs
Mini-batches need to be
balanced for classes

Two types of learning algorithm

If we use the full gradient computed from all
the training cases, there are many clever ways
to speed up learning (e.g. non-linear conjugate
gradient).
The op@miza@on community has

studied the general problem of
op@mizing smooth non-linear
func@ons for many years.
Mul@layer neural nets are not typical
of the problems they study so their
methods may need a lot of adapta@on.
For large neural networks with

very large and highly redundant
training sets, it is nearly always
best to use mini-batch learning.
The mini-batches may
need to be quite big
when adap@ng fancy
methods.
Big mini-batches are
more computa@onally
ecient.
A basic mini-batch gradient descent algorithm

Guess an ini@al learning rate.
If the error keeps geang worse
or oscillates wildly, reduce the
learning rate.
If the error is falling fairly
consistently but slowly, increase
the learning rate.
Write a simple program to automate
this way of adjus@ng the learning
rate.
Towards the end of mini-batch

learning it nearly always helps to
turn down the learning rate.
This removes uctua@ons in the
nal weights caused by the
varia@ons between mini-
batches.
Turn down the learning rate when
the error stops decreasing.
Use the error on a separate
valida@on set
Neural Networks for Machine Learning

Lecture 6b
A bag of tricks for mini-batch gradient descent
Georey Hinton
with
Ni@sh Srivastava
Kevin Swersky
Turning down the learning

rate reduces the random
uctua@ons in the error due
to the dierent gradients on
dierent mini-batches.
So we get a quick win.
But then we get slower
learning.
Dont turn down the
learning rate too soon!
error
Be careful about turning down the learning rate

reduce
learning rate
epoch
Ini@alizing the weights

If two hidden units have exactly
the same bias and exactly the
same incoming and outgoing
weights, they will always get
exactly the same gradient.
So they can never learn to be
dierent features.
We break symmetry by
ini@alizing the weights to
have small random values.

If a hidden unit has a big fan-in,

small changes on many of its
incoming weights can cause the
learning to overshoot.
We generally want smaller
incoming weights when the
fan-in is big, so ini@alize the
weights to be propor@onal to
sqrt(fan-in).
We can also scale the learning
rate the same way.
ShiVing the inputs

When using steepest descent,
shiVing the input values makes a big
dierence.
It usually helps to transform
each component of the input
vector so that it has zero mean
over the whole training set.
The hypberbolic tangent (which is
2*logis@c -1) produces hidden
ac@va@ons that are roughly zero
mean.
In this respect its beYer than the
logis@c.
w1
101, 101 2
101, 99 0
1, 1 2
1, -1 0

w2
color indicates
training case
gives error
surface
gives error
surface
Scaling the inputs

When using steepest descent,
scaling the input values
makes a big dierence.
It usually helps to
transform each
component of the input
vector so that it has unit
variance over the whole
training set.
w1
0.1, 10 2
0.1, -10 0
1, 1 2
1, -1 0

w2
color indicates
weight axis
gives error
surface
gives error
surface
A more thorough method: Decorrelate the input components

For a linear neuron, we get a big win by decorrela@ng each component of the
input from the other input components.
There are several dierent ways to decorrelate inputs. A reasonable method is
to use Principal Components Analysis.
Drop the principal components with the smallest eigenvalues.
This achieves some dimensionality reduc@on.
Divide the remaining principal components by the square roots of their
eigenvalues. For a linear neuron, this converts an axis aligned ellip@cal
error surface into a circular one.
For a circular error surface, the gradient points straight towards the minimum.
Common problems that occur in mul@layer networks

If we start with a very big learning
rate, the weights of each hidden
unit will all become very big and
posi@ve or very big and nega@ve.
The error deriva@ves for the
hidden units will all become
@ny and the error will not
decrease.
This is usually a plateau, but
people oVen mistake it for a
local minimum.
In classica@on networks that use

a squared error or a cross-entropy
error, the best guessing strategy is
to make each output unit always
produce an output equal to the
propor@on of @me it should be a 1.
The network nds this strategy
quickly and may take a long
@me to improve on it by
making use of the input.
This is another plateau that
looks like a local minimum.
Four ways to speed up mini-batch learning

rmsprop: Divide the learning rate for a
Use momentum
weight by a running average of the
Instead of using the gradient
magnitudes of recent gradients for that
to change the posi@on of the
weight.
weight par@cle, use it to
This is the mini-batch version of just
change the velocity.
using the sign of the gradient.
Use separate adap@ve learning
Take a fancy method from the
rates for each parameter
op@miza@on literature that makes use of
Slowly adjust the rate using
curvature informa@on (not this lecture)
the consistency of the
Adapt it to work for neural nets
gradient for that parameter.
Adapt it to work for mini-batches.

Lecture 6c
The momentum method
Georey Hinton
with
Ni@sh Srivastava
Kevin Swersky
The intui@on behind the momentum method

Imagine a ball on the error surface. The It damps oscilla@ons in direc@ons of
high curvature by combining
loca@on of the ball in the horizontal
gradients with opposite signs.
plane represents the weight vector.
The ball starts o by following the It builds up speed in direc@ons with
a gentle but consistent gradient.
gradient, but once it has velocity,
it no longer does steepest descent.
Its momentum makes it keep
going in the previous direc@on.
The equa@ons of the momentum method

v(t) = v(t 1)
E
(t)
w
w(t) = v(t)
E
= v(t 1)
(t)
w
E
= w(t 1)
(t)
w
The eect of the gradient is to

increment the previous velocity. The
velocity also decays by which is
slightly less then 1.

The weight change is equal to the current
velocity.
The weight change can be expressed in
terms of the previous weight change and
the current gradient.
The behavior of the momentum method

If the error surface is a @lted plane,
the ball reaches a terminal velocity.
If the momentum is close to 1,
this is much faster than simple
gradient descent.
1 $ E '
v() =
&
)
1 % w (
At the beginning of learning there may

be very large gradients.
So it pays to use a small
momentum (e.g. 0.5).
Once the large gradients have
disappeared and the weights are
stuck in a ravine the momentum
can be smoothly raised to its nal
value (e.g. 0.9 or even 0.99)
This allows us to learn at a rate that
would cause divergent oscilla@ons
without the momentum.
A beYer type of momentum (Nesterov 1983)

The standard momentum method
rst computes the gradient at the
current loca@on and then takes a big
jump in the direc@on of the updated
accumulated gradient.
Ilya Sutskever (2012 unpublished)
suggested a new form of momentum
that oVen works beYer.
Inspired by the Nesterov method
for op@mizing convex func@ons.
First make a big jump in the

direc@on of the previous
accumulated gradient.
Then measure the gradient
where you end up and make a
correc@on.
Its beYer to correct a
mistake aVer you have
made it!
A picture of the Nesterov method

First make a big jump in the direc@on of the previous accumulated gradient.
Then measure the gradient where you end up and make a correc@on.

brown vector = jump, red vector = correc@on, green vector = accumulated gradient

blue vectors = standard momentum

Lecture 6d
A separate, adap@ve learning rate for each
connec@on
Georey Hinton
with
Ni@sh Srivastava
Kevin Swersky
The intui@on behind separate adap@ve learning rates

In a mul@layer net, the appropriate learning rates
can vary widely between weights:
The magnitudes of the gradients are oVen very
dierent for dierent layers, especially if the ini@al
weights are small.
The fan-in of a unit determines the size of the
overshoot eects caused by simultaneously
changing many of the incoming weights of a unit to
correct the same error.
So use a global learning rate (set by hand)

mul@plied by an appropriate local gain that is
determined empirically for each weight.
Gradients can get very

small in the early layers of
very deep nets.
The fan-in oVen varies
widely between layers.
One way to determine the individual learning rates

Start with a local gain of 1 for every weight.
Increase the local gain if the gradient for
that weight does not change sign.
Use small addi@ve increases and
mul@plica@ve decreases (for mini-batch)
This ensures that big gains decay rapidly
when oscilla@ons start.
If the gradient is totally random the gain
will hover around 1 when we increase
by plus
half the @me and decrease
by @mes 1
half the @me.
wij = gij
E
wij
# E
&
E
if %%
(t)
(t 1)(( > 0
$ wij wij
'
then gij (t) = gij (t 1) +.05
else gij (t) = gij (t 1)*.95
Tricks for making adap@ve learning rates work beYer

Adap@ve learning rates can be
Limit the gains to lie in some
combined with momentum.
reasonable range
e.g. [0.1, 10] or [.01, 100]
Use the agreement in sign
between the current gradient for a
Use full batch learning or big mini-
batches
weight and the velocity for that
weight (Jacobs, 1989).
This ensures that changes in
the sign of the gradient are
Adap@ve learning rates only deal with
not mainly due to the
axis-aligned eects.
sampling error of a mini-
Momentum does not care about
batch.
the alignment of the axes.

Lecture 6e
rmsprop: Divide the gradient by a running average
of its recent magnitude
Georey Hinton
with
Ni@sh Srivastava
Kevin Swersky
rprop: Using only the sign of the gradient

The magnitude of the gradient can be
very dierent for dierent weights
and can change during learning.
This makes it hard to choose a
single global learning rate.
For full batch learning, we can deal
with this varia@on by only using the
sign of the gradient.
The weight updates are all of the
same magnitude.
This escapes from plateaus with
@ny gradients quickly.
rprop: This combines the idea of only

using the sign of the gradient with the
idea of adap@ng the step size separately
for each weight.
Increase the step size for a weight
mul@plica@vely (e.g. @mes 1.2) if the
signs of its last two gradients agree.
Otherwise decrease the step size
mul@plica@vely (e.g. @mes 0.5).
Limit the step sizes to be less than
50 and more than a millionth (Mike
Shusters advice).
Why rprop does not work with mini-batches

The idea behind stochas@c gradient
descent is that when the learning
rate is small, it averages the
gradients over successive mini-
batches.
Consider a weight that gets a
gradient of +0.1 on nine mini-
batches and a gradient of -0.9
on the tenth mini-batch.
We want this weight to stay
roughly where it is.
rprop would increment the weight

nine @mes and decrement it once by
about the same amount (assuming
any adapta@on of the step sizes is
small on this @me-scale).
So the weight would grow a lot.
Is there a way to combine:
The robustness of rprop.
The eciency of mini-batches.
The eec@ve averaging of
gradients over mini-batches.
rmsprop: A mini-batch version of rprop

rprop is equivalent to using the gradient but also dividing by the size of the
gradient.
The problem with mini-batch rprop is that we divide by a dierent number
for each mini-batch. So why not force the number we divide by to be very
similar for adjacent mini-batches?
rmsprop: Keep a moving average of the squared gradient for each weight
MeanSquare(w, t) = 0.9 MeanSquare(w, t 1) + 0.1 E
(t)

Dividing the gradient by MeanSquare(w,
t)
makes the learning work much
beYer (Tijmen Tieleman, unpublished).
Further developments of rmsprop

Combining rmsprop with standard momentum
Momentum does not help as much as it normally does. Needs more
inves@ga@on.
Combining rmsprop with Nesterov momentum (Sutskever 2012)
It works best if the RMS of the recent gradients is used to divide the
correc@on rather than the jump in the direc@on of accumulated correc@ons.
Combining rmsprop with adap@ve learning rates for each connec@on
Needs more inves@ga@on.
Other methods related to rmsprop
Yann LeCuns group has a fancy version in No more pesky learning rates
Summary of learning methods for neural networks

For small datasets (e.g. 10,000 cases) Why there is no simple recipe:
or bigger datasets without much
Neural nets dier a lot:
redundancy, use a full-batch
Very deep nets (especially ones
method.
with narrow boYlenecks).
Conjugate gradient, LBFGS ...
Recurrent nets.
adap@ve learning rates, rprop ...
Wide shallow nets.
For big, redundant datasets use mini- Tasks dier a lot:
batches.
Some require very accurate
Try gradient descent with
weights, some dont.
momentum.
Some have many very rare
Try rmsprop (with momentum ?)
cases (e.g. words).
Try LeCuns latest recipe.

Neural Networks For Machine Learning: Lecture 6a Overview of Mini - Batch Gradient Descent

Uploaded by

Copyright:

Available Formats

Neural Networks For Machine Learning: Lecture 6a Overview of Mini - Batch Gradient Descent

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Neural Networks For Machine Learning: Lecture 6a Overview of Mini - Batch Gradient Descent

Uploaded by

Copyright:

Available Formats

Neural

Networks for Machine Learning

Reminder: The error surface for a linear neuron

Convergence speed of full batch learning when the error

Even for non-linear

How the learning goes wrong

Stochas@c gradient descent

Mini-batches are usually beYer

Two types of learning algorithm

The op@miza@on community has

For large neural networks with

A basic mini-batch gradient descent algorithm

Towards the end of mini-batch

Neural Networks for Machine Learning

Turning down the learning

Be careful about turning down the learning rate

Ini@alizing the weights

If a hidden unit has a big fan-in,

ShiVing the inputs

Scaling the inputs

A more thorough method: Decorrelate the input components

Common problems that occur in mul@layer networks

In classica@on networks that use

Four ways to speed up mini-batch learning

Neural Networks for Machine Learning

The intui@on behind the momentum method

The equa@ons of the momentum method

The eect of the gradient is to

The behavior of the momentum method

At the beginning of learning there may

A beYer type of momentum (Nesterov 1983)

First make a big jump in the

A picture of the Nesterov method

Neural Networks for Machine Learning

The intui@on behind separate adap@ve learning rates

So use a global learning rate (set by hand)

Gradients can get very

One way to determine the individual learning rates

Tricks for making adap@ve learning rates work beYer

Neural Networks for Machine Learning

rprop: Using only the sign of the gradient

rprop: This combines the idea of only

Why rprop does not work with mini-batches

rprop would increment the weight

rmsprop: A mini-batch version of rprop

MeanSquare(w, t) = 0.9 MeanSquare(w, t 1) + 0.1 E

Further developments of rmsprop

Summary of learning methods for neural networks

You might also like