Neural Networks Tricks: Patrick Van Der Smagt
Neural Networks Tricks: Patrick Van Der Smagt
Neural Networks Tricks: Patrick Van Der Smagt
1 / 20
algorithm for backprop (“on-line” aka “stochastic” learning)
back-propagation algorithm:
i n i t i a l i s e the weights
repeat
f o r e a c h t r a i n i n g s a m p l e (x, z) do
begin
o = M(w, x) ; f o r w a r d p a s s
c a l c u l a t e e r r o r z − o at the output u n i t s
f o r a l l w(2) compute δw(2) ; backward p a s s
f o r a l l w(1) compute δw(1) ; backward p a s s c o n t i n u e d
u p d a t e t h e w e i g h t s u s i n g ∂E/∂wij = δj xi
end
( t h i s i s c a l l e d one epoch )
until stopping c r i t e r i o n s a t i s f i e d
2 / 20
better algorithm for backprop (“(mini) batch learning”)
back-propagation algorithm:
i n i t i a l i s e the weights
repeat
r a n d o m l y s e l e c t s a m p l e s (x, z) from a m i n i −b a t c h do
begin
o = M(w, x) ; f o r w a r d p a s s
c a l c u l a t e e r r o r z − o at the output u n i t s
f o r a l l w(2) compute δw(2) ; backward p a s s
f o r a l l w(1) compute δw(1) ; backward p a s s c o n t i n u e d
sum t h e d e l t a w e i g h t s u s i n g ∂E/∂wij = δj xi
end
u p d a t e t h e w e i g h t s u s i n g summed d e l t a w e i g h t s
until stopping c r i t e r i o n s a t i s f i e d
3 / 20
the error surface for a linear neuron
4 / 20
Convergence speed of full batch learning when the error
surface is a quadratic bowl
Going downhill reduces the error, but the direction of steepest descent
does not point at the minimum unless the ellipse is a circle.
Even for non-linear multi-layer nets, the error surface is locally quadratic,
so the same speed issues apply.
5 / 20
how learning goes wrong
6 / 20
stochastic gradient descent
If the dataset is highly redundant, the gradient on the first half is almost
identical to the gradient on the second half.
7 / 20
stochastic gradient descent
If we use the full gradient computed from all the training cases, there are
many clever ways to speed up learning (e.g. non-linear conjugate
gradient).
For large neural networks with very large and highly redundant training
sets, it is nearly always best to use mini-batch learning.
8 / 20
reducing the learning rate
9 / 20
initialising the weights
If two hidden units have exactly the same bias and exactly the same
incoming and outgoing weights, they will always get exactly the same
gradient.
If a hidden unit has a big fan-in, small changes on many of its incoming
weights can cause the learning to overshoot.
10 / 20
shifting the inputs
11 / 20
scaling the inputs
12 / 20
A more thorough method: Decorrelate the input
components
For a circular error surface, the gradient points straight towards the
minimum.
13 / 20
speed-up tricks
If we start with a very large learning rate, weights get very large and one
suffers from “saturation” in the neurons. This leads to a vanishing
gradient, and learning is stuck.
Furthermore,
1. use momentum
2. use separate learning rates per parameter
3. rmsprop
4. use a second-order method
14 / 20
the behaviour of the momentum method
∂E
u(t) = βu(t − 1) − α (t)
∂w
1 ∂E
u(∞) = −α
1−β ∂w
15 / 20
separate adaptive learning rates
I The magnitudes of the gradients are often very different for different
layers, especially if the initial weights are small.
I The fan-in of a unit determines the size of the “overshoot” effects
caused by simultaneously changing many of the incoming weights of
a unit to correct the same error.
16 / 20
how to determine the individual learning rates
17 / 20
rprop: using only the sign of the gradient
The magnitude of the gradient can be very different for different weights
and can change during learning.
For full batch learning, we can deal with this variation by only using the
sign of the gradient.
rprop: This combines the idea of only using the sign of the gradient with
the idea of adapting the step size separately for each weight.
I Increase the step size for a weight multiplicatively (e.g. times 1.2) if
the signs of its last two gradients agree.
I Otherwise decrease the step size multiplicatively (e.g. times 0.5).
I (Rule of thumb: Limit the step sizes to be less than 50 and more
than a millionth (Mike Shuster’s advice)).
18 / 20
rmsprop: a mini-batch version of rprop
rprop is equivalent to using the gradient but also dividing by the size of
the gradient.
rmsprop: Keep a moving average of the squared gradient for each weight
2
∂E
meanSquare(w, t) = 0.9 meanSquare(w, t − 1) + 0.1 (t)
∂w
p
Dividing the gradient by meanSquare(w, t) makes the learning work
much better.
19 / 20
summary of learning methods
For small datasets (e.g. 10,000 cases) or bigger datasets without much
redundancy, use a full-batch method.
I Neural nets differ a lot: Very deep nets (especially ones with narrow
bottlenecks); Recurrent nets; Wide shallow nets.
I Tasks differ a lot: Some require very accurate weights, some don’t;
Some have many very rare cases (e.g., words).
20 / 20