0% found this document useful (0 votes)

30 views20 pages

Neural Networks Tricks: Patrick Van Der Smagt

This document discusses techniques for training neural networks more efficiently, including: 1) Stochastic gradient descent updates weights after each training sample rather than the full dataset, reducing computation. Mini-batches balance efficiency and consistency better than online learning. 2) Initializing weights randomly breaks symmetry and scales weights by fan-in to control learning. 3) Transforming inputs to have zero mean and unit variance helps gradient descent converge. Decorrelating inputs with PCA further improves performance for linear neurons. 4) Momentum speeds up learning by incorporating previous weight updates while adaptive learning rates tune the rate for each weight.

Uploaded by

Abhishek Udayashankar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views20 pages

Neural Networks Tricks: Patrick Van Der Smagt

Uploaded by

Abhishek Udayashankar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

neural networks tricks

Patrick van der Smagt

most slides nicked from Geoffrey Hinton

still you need none of this for the exam
(but you’ll need it to train neural networks efficiently)

1 / 20
algorithm for backprop (“on-line” aka “stochastic” learning)
back-propagation algorithm:

i n i t i a l i s e the weights
repeat
f o r e a c h t r a i n i n g s a m p l e (x, z) do
begin
o = M(w, x) ; f o r w a r d p a s s
c a l c u l a t e e r r o r z − o at the output u n i t s
f o r a l l w(2) compute δw(2) ; backward p a s s
f o r a l l w(1) compute δw(1) ; backward p a s s c o n t i n u e d
u p d a t e t h e w e i g h t s u s i n g ∂E/∂wij = δj xi
end
( t h i s i s c a l l e d one epoch )
until stopping c r i t e r i o n s a t i s f i e d

What is wrong with this learning method?

2 / 20
better algorithm for backprop (“(mini) batch learning”)
back-propagation algorithm:

i n i t i a l i s e the weights
repeat
r a n d o m l y s e l e c t s a m p l e s (x, z) from a m i n i −b a t c h do
begin
o = M(w, x) ; f o r w a r d p a s s
c a l c u l a t e e r r o r z − o at the output u n i t s
f o r a l l w(2) compute δw(2) ; backward p a s s
f o r a l l w(1) compute δw(1) ; backward p a s s c o n t i n u e d
sum t h e d e l t a w e i g h t s u s i n g ∂E/∂wij = δj xi
end
u p d a t e t h e w e i g h t s u s i n g summed d e l t a w e i g h t s
until stopping c r i t e r i o n s a t i s f i e d

What is wrong with this learning method?

3 / 20
the error surface for a linear neuron

The error surface lies in a space with a

horizontal axis for each weight and one
vertical axis for the error.
I For a linear neuron with a squared
error, it is a quadratic bowl.
For multi-layer, non-linear nets the error
surface is much more complicated.
I but locally, a piece of a quadratic
bowl is usually a very good
approximation.

4 / 20
Convergence speed of full batch learning when the error
surface is a quadratic bowl

Going downhill reduces the error, but the direction of steepest descent
does not point at the minimum unless the ellipse is a circle.

I The gradient is big in the direction in which we only want to travel a

small distance.
I The gradient is small in the direction in which we want to travel a
large distance.

Even for non-linear multi-layer nets, the error surface is locally quadratic,
so the same speed issues apply.

5 / 20
how learning goes wrong

If the learning rate is big, the weights go

back and forth across the ravine.
I If the learning rate is too big, this
oscillation diverges.
What we would like to achieve:
I Move quickly in directions with
small but consistent gradients.
I Move slowly in directions with big
but inconsistent gradients.

6 / 20
stochastic gradient descent

If the dataset is highly redundant, the gradient on the first half is almost
identical to the gradient on the second half.

I So instead of computing the full gradient, update the weights using

the gradient on the first half and then get a gradient for the new
weights on the second half.
I The extreme version of this approach updates weights after each
case. Its called “online.”’

Mini-batches are usually better than online:

I Less computation is used updating the weights.

I Computing the gradient for many cases simultaneously uses
matrix-matrix multiplies which are very efficient, especially on GPUs

Mini-batches need to be balanced for classes!

7 / 20
stochastic gradient descent

If we use the full gradient computed from all the training cases, there are
many clever ways to speed up learning (e.g. non-linear conjugate
gradient).

I The optimisation community has studied the general problem of

optimising smooth non-linear functions for many years.
I Multilayer neural nets are not typical of the problems they study so
their methods may need a lot of adaptation.

For large neural networks with very large and highly redundant training
sets, it is nearly always best to use mini-batch learning.

I The mini-batches may need to be quite big when adapting fancy

methods.
I Big mini-batches are more computationally efficient.

8 / 20
reducing the learning rate

Turning down the learning rate reduces

the random fluctuations in the error due
to the different gradients on different
mini-batches.
I So we get a quick win.
I but then we get slower learning.

Don’t turn down the learning rate too soon!

9 / 20
initialising the weights

If two hidden units have exactly the same bias and exactly the same
incoming and outgoing weights, they will always get exactly the same
gradient.

I So they can never learn to be different features.

I We break symmetry by initialising the weights to have small random
values.

If a hidden unit has a big fan-in, small changes on many of its incoming
weights can cause the learning to overshoot.

I We generally want smaller incoming weights when the fan-in is big,

so initialise the weights to be proportional to sqrt(fan-in).

We can also scale the learning rate the same way.

10 / 20
shifting the inputs

When using steepest descent, shifting

the input values makes a big difference.
I It usually helps to transform each
component of the input vector so
that it has zero mean over the
whole training set.
The tanh() produces hidden activations
that are roughly zero mean.
I In this respect its better than the
logistic.

11 / 20
scaling the inputs

When using steepest descent, scaling the

input values makes a big difference.
I It usually helps to transform each
component of the input vector so
that it has unit variance over the
whole training set.

12 / 20
A more thorough method: Decorrelate the input
components

For a linear neuron, we get a big win by decorrelating each component of

the input from the other input components.

There are several different ways to decorrelate inputs. A reasonable

method is to use Principal Components Analysis (PCA).

I Drop the principal components with the smallest eigenvalues.

I Divide the remaining principal components by the square roots of
their eigenvalues. For a linear neuron, this converts an axis aligned
elliptical error surface into a circular one.

For a circular error surface, the gradient points straight towards the
minimum.

13 / 20
speed-up tricks

If we start with a very large learning rate, weights get very large and one
suffers from “saturation” in the neurons. This leads to a vanishing
gradient, and learning is stuck.

Furthermore,

1. use momentum
2. use separate learning rates per parameter
3. rmsprop
4. use a second-order method

14 / 20
the behaviour of the momentum method

∂E
u(t) = βu(t − 1) − α (t)
∂w

1 ∂E
u(∞) = −α
1−β ∂w

If the momentum β is close to 1, this is much faster than simple gradient

descent.

At the beginning of learning there may be very large gradients.

I So it pays to use a small momentum (e.g., 0.5).

I Once the large gradients have disappeared and the weights are stuck
in a ravine the momentum can be smoothly raised to its final value
(e.g. 0.9 or even 0.99)

This allows us to learn at a rate that would cause divergent oscillations

without the momentum.

15 / 20
separate adaptive learning rates

In a multilayer net, the appropriate learning rates can vary widely

between weights:

I The magnitudes of the gradients are often very different for different
layers, especially if the initial weights are small.
I The fan-in of a unit determines the size of the “overshoot” effects
caused by simultaneously changing many of the incoming weights of
a unit to correct the same error.

So use a global learning rate (set by hand) multiplied by an appropriate

local gain that is determined empirically for each weight.

16 / 20
how to determine the individual learning rates

start with a local gain of 1 for every

weight.
Increase the local gain if the gradient for ∂E
that weight does not change sign. ∆wij = −αgij
∂wij
Use small additive increases and
multiplicative decreases (for mini-batch) ∂E
if ∂w (t) ∂E
(t − 1) >0
ij ∂wij
I this ensures that big gains decay
rapidly when oscillations start.
then gij (t) = gij (t − 1) + 0.05
I If the gradient is totally random the
gain will hover around 1 when we
else gij (t) = gij (t − 1) · 0.95
increase by plus δ half the time and
decrease by times 1 − δ half the
time

17 / 20
rprop: using only the sign of the gradient
The magnitude of the gradient can be very different for different weights
and can change during learning.

I This makes it hard to choose a single global learning rate.

For full batch learning, we can deal with this variation by only using the
sign of the gradient.

I The weight updates are all of the same magnitude.

I This escapes from plateaus with tiny gradients quickly.

rprop: This combines the idea of only using the sign of the gradient with
the idea of adapting the step size separately for each weight.

I Increase the step size for a weight multiplicatively (e.g. times 1.2) if
the signs of its last two gradients agree.
I Otherwise decrease the step size multiplicatively (e.g. times 0.5).
I (Rule of thumb: Limit the step sizes to be less than 50 and more
than a millionth (Mike Shuster’s advice)).
18 / 20
rmsprop: a mini-batch version of rprop

rprop is equivalent to using the gradient but also dividing by the size of
the gradient.

The problem with mini-batch rprop is that we divide by a different

number for each mini-batch. So why not force the number we divide by
to be very similar for adjacent mini-batches?

rmsprop: Keep a moving average of the squared gradient for each weight
2
∂E
meanSquare(w, t) = 0.9 meanSquare(w, t − 1) + 0.1 (t)
∂w

p
Dividing the gradient by meanSquare(w, t) makes the learning work
much better.

19 / 20
summary of learning methods
For small datasets (e.g. 10,000 cases) or bigger datasets without much
redundancy, use a full-batch method.

I Conjugate gradient, LBFGS, . . .

I adaptive learning rates, rprop, . . .

For big, redundant datasets use mini-batches.

I Try gradient descent with momentum.

I Try rmsprop (with momentum?)

Why there is no simple recipe:

I Neural nets differ a lot: Very deep nets (especially ones with narrow
bottlenecks); Recurrent nets; Wide shallow nets.
I Tasks differ a lot: Some require very accurate weights, some don’t;
Some have many very rare cases (e.g., words).

20 / 20

On The Wine-Dark Sea
No ratings yet
On The Wine-Dark Sea
1 page
Pronunciation Rules Regular Past Verbs - US
No ratings yet
Pronunciation Rules Regular Past Verbs - US
1 page
Key Differences Between Industrial All Risk (IAR) and PAR
No ratings yet
Key Differences Between Industrial All Risk (IAR) and PAR
2 pages
Fresh Air Cooling Coil - 9800 CFM
No ratings yet
Fresh Air Cooling Coil - 9800 CFM
1 page
Peps Grade 3 Sample
No ratings yet
Peps Grade 3 Sample
3 pages
The Efficacy of Specialized Language Models in Advancing Educational Outcomes
No ratings yet
The Efficacy of Specialized Language Models in Advancing Educational Outcomes
8 pages
Week 7 Milestone Worksheet Completed
No ratings yet
Week 7 Milestone Worksheet Completed
16 pages
Final Marketing Plan Whole
No ratings yet
Final Marketing Plan Whole
19 pages
Magic and The Mind
No ratings yet
Magic and The Mind
379 pages
Assignment 2
No ratings yet
Assignment 2
11 pages
Optimization
No ratings yet
Optimization
44 pages
PJ Poliuretán Hosszbordás Szíj
No ratings yet
PJ Poliuretán Hosszbordás Szíj
1 page
CV Thabet English
No ratings yet
CV Thabet English
2 pages
Probability Althea
No ratings yet
Probability Althea
8 pages
DL UNIT II PART II (IMP) Optimization For Training Deep Model
No ratings yet
DL UNIT II PART II (IMP) Optimization For Training Deep Model
81 pages
AB-063 - 4 - EN Aluminium in Cement
No ratings yet
AB-063 - 4 - EN Aluminium in Cement
8 pages
3EBX0 Lecture Notes Addendum
No ratings yet
3EBX0 Lecture Notes Addendum
10 pages
Editpadrsep 1712951867
No ratings yet
Editpadrsep 1712951867
2 pages
Chapter 5 Summary
No ratings yet
Chapter 5 Summary
5 pages
CI-6-8 Backpropagation (COMPLETE) Updated
No ratings yet
CI-6-8 Backpropagation (COMPLETE) Updated
76 pages
Audit Objectives Procedures Evidences and Documentation
100% (4)
Audit Objectives Procedures Evidences and Documentation
35 pages
CMPC Pulp: Insulation Requirement: Heat Conservation (For Personnel Protection, See Notes 3 & 4) Service
No ratings yet
CMPC Pulp: Insulation Requirement: Heat Conservation (For Personnel Protection, See Notes 3 & 4) Service
3 pages
GDC BCP Template
No ratings yet
GDC BCP Template
53 pages
EBD Blades Sponsorhip Letter
No ratings yet
EBD Blades Sponsorhip Letter
2 pages
Unit 2 - ML
No ratings yet
Unit 2 - ML
18 pages
Ielts Reading Question Sheet
No ratings yet
Ielts Reading Question Sheet
2 pages
ML Lec 09 ANN Quadratic Training
No ratings yet
ML Lec 09 ANN Quadratic Training
44 pages
Tritaal/teentaal-Single Speed - (Thah)
No ratings yet
Tritaal/teentaal-Single Speed - (Thah)
7 pages
Neural Networks Four
No ratings yet
Neural Networks Four
70 pages
Lecture 5
No ratings yet
Lecture 5
63 pages
Unit 2.4
No ratings yet
Unit 2.4
31 pages
Training NNs
No ratings yet
Training NNs
34 pages
Artificial Neural NetworkIV
No ratings yet
Artificial Neural NetworkIV
6 pages
Classification BP Regression KNN Other Classifiers - Final
No ratings yet
Classification BP Regression KNN Other Classifiers - Final
116 pages
Back Propagation
No ratings yet
Back Propagation
29 pages
68 133 1 SM PDF
No ratings yet
68 133 1 SM PDF
9 pages
EPS-DL-Handout3-Build ANN From Scratch Basics
No ratings yet
EPS-DL-Handout3-Build ANN From Scratch Basics
25 pages
NN 2
No ratings yet
NN 2
12 pages
Domnic Object Detecion Basics
No ratings yet
Domnic Object Detecion Basics
62 pages
Lec 8
No ratings yet
Lec 8
43 pages
Cours 5
No ratings yet
Cours 5
23 pages
Foundations of Machine Learning: Module 6: Neural Network
No ratings yet
Foundations of Machine Learning: Module 6: Neural Network
68 pages
Philippine Primitive Art
100% (1)
Philippine Primitive Art
3 pages
Deep Learning UNIT-II Part1
No ratings yet
Deep Learning UNIT-II Part1
48 pages
M3 Transcript
No ratings yet
M3 Transcript
10 pages
Industrial Grinders N V
100% (3)
Industrial Grinders N V
9 pages
DL Unit-3
No ratings yet
DL Unit-3
10 pages
Pattern Classification 11. Backpropagation & Time-Series Forecasting
No ratings yet
Pattern Classification 11. Backpropagation & Time-Series Forecasting
78 pages
Ai - W7L13
No ratings yet
Ai - W7L13
46 pages
Lecture 2
No ratings yet
Lecture 2
31 pages
RIM S BlackBerry Fall Back Analysis and PDF
No ratings yet
RIM S BlackBerry Fall Back Analysis and PDF
9 pages
Lecture 5
No ratings yet
Lecture 5
34 pages
Supervised Deep Learning
No ratings yet
Supervised Deep Learning
28 pages
Lecture 10
No ratings yet
Lecture 10
155 pages
Pure Optimization
No ratings yet
Pure Optimization
23 pages
Part 13 MD
No ratings yet
Part 13 MD
41 pages
18 DL Regularization
No ratings yet
18 DL Regularization
41 pages
Chap 2 Training Feed Forward Neural Networks
No ratings yet
Chap 2 Training Feed Forward Neural Networks
22 pages
Multi Layer Perceptron Haykin
No ratings yet
Multi Layer Perceptron Haykin
50 pages
Colligative Properties of Non Electrolytes
50% (2)
Colligative Properties of Non Electrolytes
20 pages
Corolla Diesel PDF
No ratings yet
Corolla Diesel PDF
2 pages
CAWRT Drill Flyer
No ratings yet
CAWRT Drill Flyer
1 page
Lect 7
No ratings yet
Lect 7
43 pages
36-Multi-Layer Perceptron and Its Properties-30-10-2024
No ratings yet
36-Multi-Layer Perceptron and Its Properties-30-10-2024
39 pages
Mod 2.4,2.5,2.6 Architecture Design
No ratings yet
Mod 2.4,2.5,2.6 Architecture Design
20 pages
Geoffrey Hinton With Nitish Srivastava Kevin Swersky: Neural Networks For Machine Learning
No ratings yet
Geoffrey Hinton With Nitish Srivastava Kevin Swersky: Neural Networks For Machine Learning
31 pages
Curs3site PDF
No ratings yet
Curs3site PDF
38 pages
Math C4 Practice
No ratings yet
Math C4 Practice
53 pages
Backpropagation
No ratings yet
Backpropagation
12 pages
ANN-Implemetation of Back-Prop
No ratings yet
ANN-Implemetation of Back-Prop
89 pages
2020 CS182 Section 2 Notes
No ratings yet
2020 CS182 Section 2 Notes
6 pages
Thumb Rules - Xls For Chemical Engineer
No ratings yet
Thumb Rules - Xls For Chemical Engineer
46 pages
General Observation
No ratings yet
General Observation
93 pages
Foundations of Machine Learning: Module 6: Neural Network
No ratings yet
Foundations of Machine Learning: Module 6: Neural Network
19 pages
BMS Procedure
100% (3)
BMS Procedure
138 pages
Improving Deep Neural Networks: Hyperparameter Tuning, Regularization and Optimization
No ratings yet
Improving Deep Neural Networks: Hyperparameter Tuning, Regularization and Optimization
1 page
A Probabilistic Theory of Deep Learning: Unit 2
100% (1)
A Probabilistic Theory of Deep Learning: Unit 2
17 pages
Neural Networks For Machine Learning: Lecture 6a Overview of Mini - Batch Gradient Descent
No ratings yet
Neural Networks For Machine Learning: Lecture 6a Overview of Mini - Batch Gradient Descent
31 pages
Shortcomings in Single Layer Neural Networks: Most Real World Problems Are Not
No ratings yet
Shortcomings in Single Layer Neural Networks: Most Real World Problems Are Not
43 pages
Neural Network
100% (1)
Neural Network
54 pages
Multi Layer Perceptron
No ratings yet
Multi Layer Perceptron
62 pages
Advantages Bpa
No ratings yet
Advantages Bpa
38 pages
Artificial Neural Networks
No ratings yet
Artificial Neural Networks
26 pages
Kevin Swingler - Lecture 4: Multi-Layer Perceptrons
No ratings yet
Kevin Swingler - Lecture 4: Multi-Layer Perceptrons
20 pages
Neural Net 3rdclass
No ratings yet
Neural Net 3rdclass
35 pages

Neural Networks Tricks: Patrick Van Der Smagt

Uploaded by

Neural Networks Tricks: Patrick Van Der Smagt

Uploaded by

neural networks tricks

Patrick van der Smagt

most slides nicked from Geoffrey Hinton

What is wrong with this learning method?

What is wrong with this learning method?

The error surface lies in a space with a

I The gradient is big in the direction in which we only want to travel a

If the learning rate is big, the weights go

I So instead of computing the full gradient, update the weights using

Mini-batches are usually better than online:

I Less computation is used updating the weights.

Mini-batches need to be balanced for classes!

I The optimisation community has studied the general problem of

I The mini-batches may need to be quite big when adapting fancy

Turning down the learning rate reduces

Don’t turn down the learning rate too soon!

I So they can never learn to be different features.

I We generally want smaller incoming weights when the fan-in is big,

We can also scale the learning rate the same way.

When using steepest descent, shifting

When using steepest descent, scaling the

For a linear neuron, we get a big win by decorrelating each component of

There are several different ways to decorrelate inputs. A reasonable

I Drop the principal components with the smallest eigenvalues.

If the momentum β is close to 1, this is much faster than simple gradient

At the beginning of learning there may be very large gradients.

I So it pays to use a small momentum (e.g., 0.5).

This allows us to learn at a rate that would cause divergent oscillations

In a multilayer net, the appropriate learning rates can vary widely

So use a global learning rate (set by hand) multiplied by an appropriate

start with a local gain of 1 for every

I This makes it hard to choose a single global learning rate.

I The weight updates are all of the same magnitude.

The problem with mini-batch rprop is that we divide by a different

I Conjugate gradient, LBFGS, . . .

For big, redundant datasets use mini-batches.

I Try gradient descent with momentum.

Why there is no simple recipe:

You might also like