Curs5site PDF
Curs5site PDF
Neural Networks
Course 4: Making the neural network resistent to overfitting
Overview
Overfitting
Regularization
Dropout
𝑧𝑖𝐿 𝑧𝑖𝑙 = 𝑤𝑥 + 𝑏 = 0𝑥 + 0 = 0
1
𝜎 𝑧𝑖𝑙 = 𝜎 0 = = 0.5
1+𝑒 0
𝑧𝑖𝑙
(similarly), 𝜎(𝑧𝑖𝐿 ) = 0.5
What if we would initialize all of them with a constant (different than 0)?
𝑧𝑖𝐿
𝑐 𝑐 𝑐
𝑧𝑖𝑙
𝑐 𝑐
Weight Initialization
What if we would initialize all of them with a constant (different than 0)?
y=0.5
What if we would initialize all of them with a constant (different than 0)?
𝑦𝐿
Experiment:
We will train a neural network with the same architecture from the last
course (784, 36, 10), but this time we will use a training data of 1000 images
instead of 50000.
We will use the same learning rate (0.3), the same mini batch size (10) and
we will use cross-entropy as our cost function.
But since we use a smaller dataset, we will train it more iterations: 500
instead of 30.
(So nothing changes in the network but the training set size and the number of iterations)
Overfitting
This clearly shows that the ANN learns particularities of the data and does
not generalize.
The main reason for why this happens is that the training data is to small
compared to the size of the network.
Our network has: 784*36 + 36*10 weights and 36+10 biases. That means it
has 28584 parameters that it can use in order to model 1000 elements.
How does the same network behave on the larger dataset (50000)
Overfitting
A good way to detect that you are not overtraining is to compare the
accuracy on training data and accuracy on test data. These should be
close with the accuracy on training being slightly higher
So, increasing the training data does help to solve overfitting. With very
large datasets, it becomes very difficult for a neural network to overfit.
Bias:
The bias measures how far are the estimated elements (their average) from the
target
Variance:
The variance measures how spread are the estimated elements from their mean
Solutions for Overfitting: Regularization
It can be shown, that the Mean Squared Error is actually making a tradeoff
between variance and bias.
Let’s say we have some points (𝑥, 𝑦) and there is a relation between 𝑥 and
𝑦 such that 𝑦 = 𝑓(𝑥). We want to find the function 𝑓 so we will try to
approximate it, by g.
In this case:
𝑏𝑖𝑎𝑠 𝑔(𝑥) = 𝐸[𝑔(𝑥)] − 𝑓(𝑥)
𝑣𝑎𝑟 𝑔 = 𝐸 𝑔(𝑥) − 𝐸[𝑔(𝑥)] 2
2
𝑀𝑆𝐸𝐷 𝑓 = 𝐸[ 𝑔 𝑥 − 𝑓 𝑥 ]
Where 𝐸 = 𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑 𝑣𝑎𝑙𝑢𝑒 (𝑚𝑒𝑎𝑛)
Solutions for Overfitting: Regularization
Proof:
𝐸 𝑋 − 𝐸 𝑋 2 = 𝐸 𝑋2 + 𝐸 𝑋 2 − 2𝑋𝐸 𝑋 = 𝐸 𝑋2 + 𝐸 𝐸 𝑋 2 − 2𝐸[𝑋𝐸 𝑋 ] =
=𝐸 𝑋 2 + 𝐸 𝑋 2 − 2𝐸 𝑋 2
+𝐸 𝑋 2
---------------------------------------------
= 𝐸[𝑋 2 ]
Solutions for Overfitting: Regularization
2
𝐸 𝑔 𝑥 −𝑓 𝑥 =
=𝐸 𝑔 𝑥 2 − 2𝑔 𝑥 𝑓 𝑥 + 𝑓 𝑥 2
2 2
=𝐸 𝑔 𝑥 − 2𝐸 𝑔 𝑥 𝑓 𝑥 ] + 𝐸 𝑓 𝑥
= 𝐸 𝑔 𝑥 −𝐸 𝑔 𝑥 2 +𝐸 𝑔 𝑥 2 − 2𝐸 𝑔 𝑥 𝑓 𝑥 +𝐸 𝑓 𝑥 −𝐸 𝑓 𝑥 2 +𝐸 𝑓 𝑥 2 =
2 2 2
=𝐸 𝑔 𝑥 −𝐸 𝑔 𝑥 + 𝐸𝑔 𝑥 −𝑓 𝑥 +𝐸 𝑓 𝑥 −𝐸 𝑓 𝑥 ]=
What this tells us is that for a value of the error we have to tradeoff between
variance and bias (since the noise doesn’t depend on the estimator)
Both high bias and high variance are negative properties for the model. A model
with high bias, tends to underfit the data while a model with high variance tends
to overfit
The model on the left underfits the data. It has high bias (big distance from the
targets), but low variance. If we move one point, the line would probably not
change very much
The model in the right, overfits the data. It has low bias (there is almost an exact
match of the target points) but it has high variance. If we move one point there will
probably be a big change in the model
The model in the center is the right one since it has lower bias than the one on the
left and a lower variance from the one from the right
Solutions for Overfitting: Regularization
0.5𝑥 2 + 𝑥 + 1 5𝑥 2 + 𝑥 + 1
Solutions for Overfitting: Regularization
So, in order to make the function smooth (low variance) we need to have
low weights (even zero).
𝜆
The 𝜆 (regularization parameter) in 𝐶 = 𝐶0 + 𝑤𝑤
2 controls how
2𝑛
much bias vs variance we want.
If 𝜆 = 0, then we have the standard Cost Function
The bigger 𝜆 is, the more bias (and less variance) we want. If 𝜆 is too big,
we can go into undertraining
𝜕𝐶 𝜕𝐶0 𝜆
= + 𝑤
𝜕𝑤 𝜕𝑤 𝑛
𝜕𝐶0 𝜆 𝜆 𝜕𝐶0
𝑤 = 𝑤−𝜂 −𝜂 𝑤 = 1−𝜂 𝑤−𝜂
𝜕𝑤 𝑛 𝑛 𝜕𝑤
Since we’ll most likely use SGD, the equation will use is:
𝜆 𝜂 𝜕𝐶𝑥
𝑤 = 1−𝜂 𝑤−
𝑛 𝑚 𝜕𝑤
𝑥
Solutions for Overfitting: Regularization
𝜕𝐶 𝜕𝐶0 𝜆
= + 𝑠𝑔𝑛 𝑤 𝑤ℎ𝑒𝑟𝑒 𝑠𝑔𝑛 𝑤 𝑖𝑠 𝑡ℎ𝑒 𝑠𝑖𝑔𝑛 𝑜𝑓 𝑤
𝜕𝑤 𝜕𝑤 𝑛
𝜆 𝜕𝐶0
𝑤 = 𝑤 − 𝜂 𝑠𝑔𝑛 𝑤 − 𝜂
n 𝜕𝑤
Or, if we use SGD:
𝜆 𝜂 𝜕𝐶0
𝑤 = 𝑤 − 𝜂 𝑠𝑔𝑛 𝑤 −
n 𝑚 𝜕𝑤
𝑥
Solutions for Overfitting: Regularization
L1 tends to build the model on certain weights while L2 uses more weights but
with smaller values
Solutions for Overfitting: Dropout
The idea of this technique is to make the network use less weights
during training
After we have finished the training, we restore all the hidden neurons
and halve the hidden weights from the hidden neurons to the output
neurons
Solutions for Overfitting: Dropout
2
Solutions for Overfitting: Dropout
Improvement of the
accuracy by using 20%
dropout in the hidden layer
when used on the entire
dataset
Solutions for Overfitting: Dropout
Other Advantages:
makes the network robust to the loss of any individual neuron
Makes it more efficient since more neurons will participate in
learning
Improvement of the
accuracy by using
50% dropout in the hidden
layer,
On a dataset of just 1000
elements
Solutions for Overfitting: Maxnorm
Improvement of the
accuracy by using
50% dropout in the hidden
layer
Maxnorm of 5 layer
on a dataset of just 1000
elements
Solutions for Overfitting: Increase Dataset
http://neuralnetworksanddeeplearning.com/
http://www.kdnuggets.com/2015/04/preventing-overfitting-neural-
networks.html
Nitish Srivastava , Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever,
Ruslan Salakhutdinov, Dropout: “A Simple Way to Prevent Neural
Networks from Overfitting”, Journal of Machine Learning Research 15,
(https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf)
http://blog.fliptop.com/blog/2015/03/02/bias-variance-and-
overfitting-machine-learning-overview/