Neural Networks For Machine Learning: Lecture 6a Overview of Mini - Batch Gradient Descent

Download as pdf or txt
Download as pdf or txt
You are on page 1of 31

Neural

Networks for Machine Learning



Lecture 6a
Overview of mini-batch gradient descent
Georey Hinton
with
Ni@sh Srivastava
Kevin Swersky

Reminder: The error surface for a linear neuron


The error surface lies in a space with a
horizontal axis for each weight and one ver@cal
axis for the error.
E
For a linear neuron with a squared error, it is
a quadra@c bowl.
Ver@cal cross-sec@ons are parabolas.
Horizontal cross-sec@ons are ellipses.
For mul@-layer, non-linear nets the error surface
w1
is much more complicated.
But locally, a piece of a quadra@c bowl is
usually a very good approxima@on.

w2

Convergence speed of full batch learning when the error


surface is a quadra@c bowl
Going downhill reduces the error, but the
direc@on of steepest descent does not point
at the minimum unless the ellipse is a circle.
i n

The gradient is big in the direc@on
which we only want to travel a small
distance.
The gradient is small in the direc@on in
which we want to travel a large distance.

Even for non-linear


mul@-layer nets, the
error surface is locally
quadra@c, so the same
speed issues apply.

How the learning goes wrong


If the learning rate is big, the weights slosh to
and fro across the ravine.
If the learning rate is too big, this
oscilla@on diverges.
What we would like to achieve:
Move quickly in direc@ons with small but
consistent gradients.
Move slowly in direc@ons with big but
inconsistent gradients.

E
w

Stochas@c gradient descent


If the dataset is highly redundant, the
gradient on the rst half is almost
iden@cal to the gradient on the
second half.
So instead of compu@ng the full
gradient, update the weights using
the gradient on the rst half and
then get a gradient for the new
weights on the second half.
The extreme version of this
approach updates weights aVer
each case. Its called online.

Mini-batches are usually beYer


than online.
Less computa@on is used
upda@ng the weights.
Compu@ng the gradient for
many cases simultaneously
uses matrix-matrix
mul@plies which are very
ecient, especially on GPUs
Mini-batches need to be
balanced for classes

Two types of learning algorithm


If we use the full gradient computed from all
the training cases, there are many clever ways
to speed up learning (e.g. non-linear conjugate
gradient).

The op@miza@on community has


studied the general problem of
op@mizing smooth non-linear
func@ons for many years.
Mul@layer neural nets are not typical
of the problems they study so their
methods may need a lot of adapta@on.

For large neural networks with


very large and highly redundant
training sets, it is nearly always
best to use mini-batch learning.
The mini-batches may
need to be quite big
when adap@ng fancy
methods.
Big mini-batches are
more computa@onally
ecient.

A basic mini-batch gradient descent algorithm


Guess an ini@al learning rate.
If the error keeps geang worse
or oscillates wildly, reduce the
learning rate.
If the error is falling fairly
consistently but slowly, increase
the learning rate.
Write a simple program to automate
this way of adjus@ng the learning
rate.

Towards the end of mini-batch


learning it nearly always helps to
turn down the learning rate.
This removes uctua@ons in the
nal weights caused by the
varia@ons between mini-
batches.
Turn down the learning rate when
the error stops decreasing.
Use the error on a separate
valida@on set

Neural Networks for Machine Learning



Lecture 6b
A bag of tricks for mini-batch gradient descent
Georey Hinton
with
Ni@sh Srivastava
Kevin Swersky

Turning down the learning


rate reduces the random
uctua@ons in the error due
to the dierent gradients on
dierent mini-batches.
So we get a quick win.
But then we get slower
learning.
Dont turn down the
learning rate too soon!

error

Be careful about turning down the learning rate


reduce
learning rate

epoch

Ini@alizing the weights


If two hidden units have exactly
the same bias and exactly the
same incoming and outgoing
weights, they will always get
exactly the same gradient.
So they can never learn to be
dierent features.
We break symmetry by
ini@alizing the weights to
have small random values.

If a hidden unit has a big fan-in,


small changes on many of its
incoming weights can cause the
learning to overshoot.
We generally want smaller
incoming weights when the
fan-in is big, so ini@alize the
weights to be propor@onal to
sqrt(fan-in).
We can also scale the learning
rate the same way.

ShiVing the inputs


When using steepest descent,
shiVing the input values makes a big
dierence.
It usually helps to transform
each component of the input
vector so that it has zero mean
over the whole training set.
The hypberbolic tangent (which is
2*logis@c -1) produces hidden
ac@va@ons that are roughly zero
mean.
In this respect its beYer than the
logis@c.

w1

101, 101 2
101, 99 0
1, 1 2
1, -1 0

w2

color indicates
training case

gives error
surface
gives error
surface

Scaling the inputs


When using steepest descent,
scaling the input values
makes a big dierence.
It usually helps to
transform each
component of the input
vector so that it has unit
variance over the whole
training set.

w1

0.1, 10 2
0.1, -10 0

1, 1 2
1, -1 0

w2

color indicates
weight axis

gives error
surface

gives error
surface

A more thorough method: Decorrelate the input components


For a linear neuron, we get a big win by decorrela@ng each component of the
input from the other input components.
There are several dierent ways to decorrelate inputs. A reasonable method is
to use Principal Components Analysis.
Drop the principal components with the smallest eigenvalues.
This achieves some dimensionality reduc@on.
Divide the remaining principal components by the square roots of their
eigenvalues. For a linear neuron, this converts an axis aligned ellip@cal
error surface into a circular one.
For a circular error surface, the gradient points straight towards the minimum.

Common problems that occur in mul@layer networks


If we start with a very big learning
rate, the weights of each hidden
unit will all become very big and
posi@ve or very big and nega@ve.
The error deriva@ves for the
hidden units will all become
@ny and the error will not
decrease.
This is usually a plateau, but
people oVen mistake it for a
local minimum.

In classica@on networks that use


a squared error or a cross-entropy
error, the best guessing strategy is
to make each output unit always
produce an output equal to the
propor@on of @me it should be a 1.
The network nds this strategy
quickly and may take a long
@me to improve on it by
making use of the input.
This is another plateau that
looks like a local minimum.

Four ways to speed up mini-batch learning



rmsprop: Divide the learning rate for a
Use momentum
weight by a running average of the
Instead of using the gradient
magnitudes of recent gradients for that
to change the posi@on of the
weight.
weight par@cle, use it to
This is the mini-batch version of just
change the velocity.
using the sign of the gradient.
Use separate adap@ve learning
Take a fancy method from the
rates for each parameter
op@miza@on literature that makes use of
Slowly adjust the rate using
curvature informa@on (not this lecture)
the consistency of the
Adapt it to work for neural nets
gradient for that parameter.
Adapt it to work for mini-batches.

Neural Networks for Machine Learning



Lecture 6c
The momentum method
Georey Hinton
with
Ni@sh Srivastava
Kevin Swersky

The intui@on behind the momentum method


Imagine a ball on the error surface. The It damps oscilla@ons in direc@ons of
high curvature by combining
loca@on of the ball in the horizontal
gradients with opposite signs.
plane represents the weight vector.
The ball starts o by following the It builds up speed in direc@ons with
a gentle but consistent gradient.
gradient, but once it has velocity,
it no longer does steepest descent.
Its momentum makes it keep
going in the previous direc@on.

The equa@ons of the momentum method


v(t) = v(t 1)

E
(t)
w

w(t) = v(t)
E
= v(t 1)
(t)
w
E
= w(t 1)
(t)
w

The eect of the gradient is to


increment the previous velocity. The
velocity also decays by which is
slightly less then 1.


The weight change is equal to the current
velocity.
The weight change can be expressed in
terms of the previous weight change and
the current gradient.

The behavior of the momentum method


If the error surface is a @lted plane,
the ball reaches a terminal velocity.
If the momentum is close to 1,
this is much faster than simple
gradient descent.

1 $ E '
v() =
&
)
1 % w (

At the beginning of learning there may


be very large gradients.
So it pays to use a small
momentum (e.g. 0.5).
Once the large gradients have
disappeared and the weights are
stuck in a ravine the momentum
can be smoothly raised to its nal
value (e.g. 0.9 or even 0.99)
This allows us to learn at a rate that
would cause divergent oscilla@ons
without the momentum.

A beYer type of momentum (Nesterov 1983)


The standard momentum method
rst computes the gradient at the
current loca@on and then takes a big
jump in the direc@on of the updated
accumulated gradient.
Ilya Sutskever (2012 unpublished)
suggested a new form of momentum
that oVen works beYer.
Inspired by the Nesterov method
for op@mizing convex func@ons.

First make a big jump in the


direc@on of the previous
accumulated gradient.
Then measure the gradient
where you end up and make a
correc@on.
Its beYer to correct a
mistake aVer you have
made it!

A picture of the Nesterov method


First make a big jump in the direc@on of the previous accumulated gradient.
Then measure the gradient where you end up and make a correc@on.


brown vector = jump, red vector = correc@on, green vector = accumulated gradient

blue vectors = standard momentum

Neural Networks for Machine Learning



Lecture 6d
A separate, adap@ve learning rate for each
connec@on
Georey Hinton
with
Ni@sh Srivastava
Kevin Swersky

The intui@on behind separate adap@ve learning rates


In a mul@layer net, the appropriate learning rates
can vary widely between weights:
The magnitudes of the gradients are oVen very
dierent for dierent layers, especially if the ini@al
weights are small.
The fan-in of a unit determines the size of the
overshoot eects caused by simultaneously
changing many of the incoming weights of a unit to
correct the same error.

So use a global learning rate (set by hand)


mul@plied by an appropriate local gain that is
determined empirically for each weight.

Gradients can get very


small in the early layers of
very deep nets.
The fan-in oVen varies
widely between layers.

One way to determine the individual learning rates


Start with a local gain of 1 for every weight.
Increase the local gain if the gradient for
that weight does not change sign.
Use small addi@ve increases and
mul@plica@ve decreases (for mini-batch)
This ensures that big gains decay rapidly
when oscilla@ons start.
If the gradient is totally random the gain
will hover around 1 when we increase
by plus
half the @me and decrease
by @mes 1
half the @me.

wij = gij

E
wij

# E
&
E
if %%
(t)
(t 1)(( > 0
$ wij wij
'
then gij (t) = gij (t 1) +.05
else gij (t) = gij (t 1)*.95

Tricks for making adap@ve learning rates work beYer


Adap@ve learning rates can be
Limit the gains to lie in some
combined with momentum.
reasonable range
e.g. [0.1, 10] or [.01, 100]
Use the agreement in sign
between the current gradient for a
Use full batch learning or big mini-
batches
weight and the velocity for that
weight (Jacobs, 1989).
This ensures that changes in
the sign of the gradient are
Adap@ve learning rates only deal with
not mainly due to the
axis-aligned eects.
sampling error of a mini-
Momentum does not care about
batch.
the alignment of the axes.

Neural Networks for Machine Learning



Lecture 6e
rmsprop: Divide the gradient by a running average
of its recent magnitude
Georey Hinton
with
Ni@sh Srivastava
Kevin Swersky

rprop: Using only the sign of the gradient


The magnitude of the gradient can be
very dierent for dierent weights
and can change during learning.
This makes it hard to choose a
single global learning rate.
For full batch learning, we can deal
with this varia@on by only using the
sign of the gradient.
The weight updates are all of the
same magnitude.
This escapes from plateaus with
@ny gradients quickly.

rprop: This combines the idea of only


using the sign of the gradient with the
idea of adap@ng the step size separately
for each weight.
Increase the step size for a weight
mul@plica@vely (e.g. @mes 1.2) if the
signs of its last two gradients agree.
Otherwise decrease the step size
mul@plica@vely (e.g. @mes 0.5).
Limit the step sizes to be less than
50 and more than a millionth (Mike
Shusters advice).

Why rprop does not work with mini-batches


The idea behind stochas@c gradient
descent is that when the learning
rate is small, it averages the
gradients over successive mini-
batches.
Consider a weight that gets a
gradient of +0.1 on nine mini-
batches and a gradient of -0.9
on the tenth mini-batch.
We want this weight to stay
roughly where it is.

rprop would increment the weight


nine @mes and decrement it once by
about the same amount (assuming
any adapta@on of the step sizes is
small on this @me-scale).
So the weight would grow a lot.
Is there a way to combine:
The robustness of rprop.
The eciency of mini-batches.
The eec@ve averaging of
gradients over mini-batches.

rmsprop: A mini-batch version of rprop


rprop is equivalent to using the gradient but also dividing by the size of the
gradient.
The problem with mini-batch rprop is that we divide by a dierent number
for each mini-batch. So why not force the number we divide by to be very
similar for adjacent mini-batches?
rmsprop: Keep a moving average of the squared gradient for each weight

MeanSquare(w, t) = 0.9 MeanSquare(w, t 1) + 0.1 E

(t)


Dividing the gradient by MeanSquare(w,
t)
makes the learning work much
beYer (Tijmen Tieleman, unpublished).

Further developments of rmsprop


Combining rmsprop with standard momentum
Momentum does not help as much as it normally does. Needs more
inves@ga@on.
Combining rmsprop with Nesterov momentum (Sutskever 2012)
It works best if the RMS of the recent gradients is used to divide the
correc@on rather than the jump in the direc@on of accumulated correc@ons.
Combining rmsprop with adap@ve learning rates for each connec@on
Needs more inves@ga@on.
Other methods related to rmsprop
Yann LeCuns group has a fancy version in No more pesky learning rates

Summary of learning methods for neural networks


For small datasets (e.g. 10,000 cases) Why there is no simple recipe:
or bigger datasets without much
Neural nets dier a lot:
redundancy, use a full-batch
Very deep nets (especially ones
method.
with narrow boYlenecks).
Conjugate gradient, LBFGS ...
Recurrent nets.
adap@ve learning rates, rprop ...
Wide shallow nets.
For big, redundant datasets use mini- Tasks dier a lot:
batches.
Some require very accurate
Try gradient descent with
weights, some dont.
momentum.
Some have many very rare
Try rmsprop (with momentum ?)
cases (e.g. words).
Try LeCuns latest recipe.

You might also like