Neural Networks Tricks: Patrick Van Der Smagt

Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

neural networks tricks

Patrick van der Smagt

most slides nicked from Geoffrey Hinton


still you need none of this for the exam
(but you’ll need it to train neural networks efficiently)

1 / 20
algorithm for backprop (“on-line” aka “stochastic” learning)
back-propagation algorithm:

i n i t i a l i s e the weights
repeat
f o r e a c h t r a i n i n g s a m p l e (x, z) do
begin
o = M(w, x) ; f o r w a r d p a s s
c a l c u l a t e e r r o r z − o at the output u n i t s
f o r a l l w(2) compute δw(2) ; backward p a s s
f o r a l l w(1) compute δw(1) ; backward p a s s c o n t i n u e d
u p d a t e t h e w e i g h t s u s i n g ∂E/∂wij = δj xi
end
( t h i s i s c a l l e d one epoch )
until stopping c r i t e r i o n s a t i s f i e d

What is wrong with this learning method?

2 / 20
better algorithm for backprop (“(mini) batch learning”)
back-propagation algorithm:

i n i t i a l i s e the weights
repeat
r a n d o m l y s e l e c t s a m p l e s (x, z) from a m i n i −b a t c h do
begin
o = M(w, x) ; f o r w a r d p a s s
c a l c u l a t e e r r o r z − o at the output u n i t s
f o r a l l w(2) compute δw(2) ; backward p a s s
f o r a l l w(1) compute δw(1) ; backward p a s s c o n t i n u e d
sum t h e d e l t a w e i g h t s u s i n g ∂E/∂wij = δj xi
end
u p d a t e t h e w e i g h t s u s i n g summed d e l t a w e i g h t s
until stopping c r i t e r i o n s a t i s f i e d

What is wrong with this learning method?

3 / 20
the error surface for a linear neuron

The error surface lies in a space with a


horizontal axis for each weight and one
vertical axis for the error.
I For a linear neuron with a squared
error, it is a quadratic bowl.
For multi-layer, non-linear nets the error
surface is much more complicated.
I but locally, a piece of a quadratic
bowl is usually a very good
approximation.

4 / 20
Convergence speed of full batch learning when the error
surface is a quadratic bowl

Going downhill reduces the error, but the direction of steepest descent
does not point at the minimum unless the ellipse is a circle.

I The gradient is big in the direction in which we only want to travel a


small distance.
I The gradient is small in the direction in which we want to travel a
large distance.

Even for non-linear multi-layer nets, the error surface is locally quadratic,
so the same speed issues apply.

5 / 20
how learning goes wrong

If the learning rate is big, the weights go


back and forth across the ravine.
I If the learning rate is too big, this
oscillation diverges.
What we would like to achieve:
I Move quickly in directions with
small but consistent gradients.
I Move slowly in directions with big
but inconsistent gradients.

6 / 20
stochastic gradient descent

If the dataset is highly redundant, the gradient on the first half is almost
identical to the gradient on the second half.

I So instead of computing the full gradient, update the weights using


the gradient on the first half and then get a gradient for the new
weights on the second half.
I The extreme version of this approach updates weights after each
case. Its called “online.”’

Mini-batches are usually better than online:

I Less computation is used updating the weights.


I Computing the gradient for many cases simultaneously uses
matrix-matrix multiplies which are very efficient, especially on GPUs

Mini-batches need to be balanced for classes!

7 / 20
stochastic gradient descent

If we use the full gradient computed from all the training cases, there are
many clever ways to speed up learning (e.g. non-linear conjugate
gradient).

I The optimisation community has studied the general problem of


optimising smooth non-linear functions for many years.
I Multilayer neural nets are not typical of the problems they study so
their methods may need a lot of adaptation.

For large neural networks with very large and highly redundant training
sets, it is nearly always best to use mini-batch learning.

I The mini-batches may need to be quite big when adapting fancy


methods.
I Big mini-batches are more computationally efficient.

8 / 20
reducing the learning rate

Turning down the learning rate reduces


the random fluctuations in the error due
to the different gradients on different
mini-batches.
I So we get a quick win.
I but then we get slower learning.

Don’t turn down the learning rate too soon!

9 / 20
initialising the weights

If two hidden units have exactly the same bias and exactly the same
incoming and outgoing weights, they will always get exactly the same
gradient.

I So they can never learn to be different features.


I We break symmetry by initialising the weights to have small random
values.

If a hidden unit has a big fan-in, small changes on many of its incoming
weights can cause the learning to overshoot.

I We generally want smaller incoming weights when the fan-in is big,


so initialise the weights to be proportional to sqrt(fan-in).

We can also scale the learning rate the same way.

10 / 20
shifting the inputs

When using steepest descent, shifting


the input values makes a big difference.
I It usually helps to transform each
component of the input vector so
that it has zero mean over the
whole training set.
The tanh() produces hidden activations
that are roughly zero mean.
I In this respect its better than the
logistic.

11 / 20
scaling the inputs

When using steepest descent, scaling the


input values makes a big difference.
I It usually helps to transform each
component of the input vector so
that it has unit variance over the
whole training set.

12 / 20
A more thorough method: Decorrelate the input
components

For a linear neuron, we get a big win by decorrelating each component of


the input from the other input components.

There are several different ways to decorrelate inputs. A reasonable


method is to use Principal Components Analysis (PCA).

I Drop the principal components with the smallest eigenvalues.


I Divide the remaining principal components by the square roots of
their eigenvalues. For a linear neuron, this converts an axis aligned
elliptical error surface into a circular one.

For a circular error surface, the gradient points straight towards the
minimum.

13 / 20
speed-up tricks

If we start with a very large learning rate, weights get very large and one
suffers from “saturation” in the neurons. This leads to a vanishing
gradient, and learning is stuck.

Furthermore,

1. use momentum
2. use separate learning rates per parameter
3. rmsprop
4. use a second-order method

14 / 20
the behaviour of the momentum method

∂E
u(t) = βu(t − 1) − α (t)
∂w
 
1 ∂E
u(∞) = −α
1−β ∂w

If the momentum β is close to 1, this is much faster than simple gradient


descent.

At the beginning of learning there may be very large gradients.

I So it pays to use a small momentum (e.g., 0.5).


I Once the large gradients have disappeared and the weights are stuck
in a ravine the momentum can be smoothly raised to its final value
(e.g. 0.9 or even 0.99)

This allows us to learn at a rate that would cause divergent oscillations


without the momentum.

15 / 20
separate adaptive learning rates

In a multilayer net, the appropriate learning rates can vary widely


between weights:

I The magnitudes of the gradients are often very different for different
layers, especially if the initial weights are small.
I The fan-in of a unit determines the size of the “overshoot” effects
caused by simultaneously changing many of the incoming weights of
a unit to correct the same error.

So use a global learning rate (set by hand) multiplied by an appropriate


local gain that is determined empirically for each weight.

16 / 20
how to determine the individual learning rates

start with a local gain of 1 for every


weight.
Increase the local gain if the gradient for ∂E
that weight does not change sign. ∆wij = −αgij
∂wij
Use small additive increases and  
multiplicative decreases (for mini-batch) ∂E
if ∂w (t) ∂E
(t − 1) >0
ij ∂wij
I this ensures that big gains decay
rapidly when oscillations start.
then gij (t) = gij (t − 1) + 0.05
I If the gradient is totally random the
gain will hover around 1 when we
else gij (t) = gij (t − 1) · 0.95
increase by plus δ half the time and
decrease by times 1 − δ half the
time

17 / 20
rprop: using only the sign of the gradient
The magnitude of the gradient can be very different for different weights
and can change during learning.

I This makes it hard to choose a single global learning rate.

For full batch learning, we can deal with this variation by only using the
sign of the gradient.

I The weight updates are all of the same magnitude.


I This escapes from plateaus with tiny gradients quickly.

rprop: This combines the idea of only using the sign of the gradient with
the idea of adapting the step size separately for each weight.

I Increase the step size for a weight multiplicatively (e.g. times 1.2) if
the signs of its last two gradients agree.
I Otherwise decrease the step size multiplicatively (e.g. times 0.5).
I (Rule of thumb: Limit the step sizes to be less than 50 and more
than a millionth (Mike Shuster’s advice)).
18 / 20
rmsprop: a mini-batch version of rprop

rprop is equivalent to using the gradient but also dividing by the size of
the gradient.

The problem with mini-batch rprop is that we divide by a different


number for each mini-batch. So why not force the number we divide by
to be very similar for adjacent mini-batches?

rmsprop: Keep a moving average of the squared gradient for each weight
 2
∂E
meanSquare(w, t) = 0.9 meanSquare(w, t − 1) + 0.1 (t)
∂w

p
Dividing the gradient by meanSquare(w, t) makes the learning work
much better.

19 / 20
summary of learning methods
For small datasets (e.g. 10,000 cases) or bigger datasets without much
redundancy, use a full-batch method.

I Conjugate gradient, LBFGS, . . .


I adaptive learning rates, rprop, . . .

For big, redundant datasets use mini-batches.

I Try gradient descent with momentum.


I Try rmsprop (with momentum?)

Why there is no simple recipe:

I Neural nets differ a lot: Very deep nets (especially ones with narrow
bottlenecks); Recurrent nets; Wide shallow nets.
I Tasks differ a lot: Some require very accurate weights, some don’t;
Some have many very rare cases (e.g., words).

20 / 20

You might also like