0% found this document useful (0 votes)
39 views47 pages

Curs5site PDF

Uploaded by

Gigi Florica
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views47 pages

Curs5site PDF

Uploaded by

Gigi Florica
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

1 November, 2016

Neural Networks
Course 4: Making the neural network resistent to overfitting
Overview

 Overfitting

 Regularization

 Dropout

 Max Norm Constraint


 Increase Dataset
 Conclusions
Weight Initialization

 What do we need to initialize weights with random values?

 What if we would initialize all of them with 0s?

𝑧𝑖𝐿 𝑧𝑖𝑙 = 𝑤𝑥 + 𝑏 = 0𝑥 + 0 = 0
1
𝜎 𝑧𝑖𝑙 = 𝜎 0 = = 0.5
1+𝑒 0

𝑧𝑖𝑙
(similarly), 𝜎(𝑧𝑖𝐿 ) = 0.5

Whatever the input, the output will


be the same
Weight Initialization

 What if we would initialize all of them with 0s?


y=0.5

𝑧𝑖𝐿 The error (if cross entropy is used


1
𝛿𝑖𝑙 =0 𝐶=− 𝑦𝑙𝑛𝑎 + 1 − 𝑦 ln 1 − 𝑎 ), will
𝑦𝑖𝑙 = 0.5 𝑛 𝑥
always be for the first iteration (ln(0.5))
𝑧𝑖𝑙

The error that will be backpropagated, will


be: 𝛿𝑖𝑙 = 𝑦𝑖𝑙 1 − 𝑦𝑖𝑙 𝑘 𝛿𝑖𝑙+1 ∙ 𝑤𝑖𝑘
𝑙+1
=
0.5 ∗ 1 − 0.5 ∗ 𝛿 𝐿 ∗ 0 = 0
Weight Initialization

 What if we would initialize all of them with 0s?


y=0.5

𝑧𝑖𝐿 The network will adjust the weights in the


𝛿𝑖𝑙 =0 final layer 𝑤 2 , with the same amount for
𝑦𝑖𝑙 = 0.5 each hidden unit
𝑧𝑖𝑙 𝑤 = 𝑤 + 𝜂 ∗ 𝛿𝑖𝑙 𝑦 𝑙−1 = 𝑤 + 𝜂 ∗ 0 ∗ 0 = 𝑤 = 0
𝑏 = 𝑏 + 𝜂 ∗ 𝛿𝑖𝑙 = 𝑏 + 𝜂 ∗ 0 = 𝑏 = 0

In the same way can be shown that 𝑤 1 = 0

We ended up in the same place as the first


iteration. So the network doesn’t learn
Weight Initialization

 What if we would initialize all of them with a constant (different than 0)?

𝑧𝑖𝐿
𝑐 𝑐 𝑐

𝑧𝑖𝑙

𝑐 𝑐
Weight Initialization

 What if we would initialize all of them with a constant (different than 0)?
y=0.5

𝑧𝑖𝐿 Each hidden unit will compute the same


activation (y 𝑙 = 𝜎(𝑐𝑥 + 𝑐)).
𝑙 𝑙 𝑙 𝑙+1 𝑙 𝑙 𝑙 𝑙+1
𝛿 =𝑦 1−𝑦 𝛿 𝑐 𝛿 =𝑦 1−𝑦 𝛿 𝑐
𝑐 𝑦 𝑙
𝑦𝑙 𝑦𝑙
𝑧𝑖𝑙 When the error is backpropagated, it will
be the same for each hidden unit:
𝑐
𝛿𝑖𝑙 = 𝑦 𝑙 1 − 𝑦 𝑙 𝑐𝛿 𝑙+1

Since all weigths are the same and all


weight updates are based on current
weight values, and on the error, all weight
adjustments will be the same
Weight Initialization

 What if we would initialize all of them with a constant (different than 0)?
𝑦𝐿

𝑧𝑖𝐿 So, even though we have multiple neurons


𝛿𝑙 = 𝑦𝑙 1 − 𝑦𝑙 𝑐 in the hidden layer, they will always have
𝑦𝑙 the same weights, thus they’ll be the same

Is almost as there is only one neuron in the


hidden layer

Making the weights random, achieves a


better exploration of the feature space
Overfitting
Overfitting

 Overfitting describes the process that happens when a model customizes


itself too much to the data and does not generalize it. It tries to memorize is
and not generalize on it

 Which of the following models is better?


Overfitting

 Experiment:
 We will train a neural network with the same architecture from the last
course (784, 36, 10), but this time we will use a training data of 1000 images
instead of 50000.

 We will use the same learning rate (0.3), the same mini batch size (10) and
we will use cross-entropy as our cost function.

 But since we use a smaller dataset, we will train it more iterations: 500
instead of 30.

 (So nothing changes in the network but the training set size and the number of iterations)
Overfitting

Cost of training data

The cost of the training data


decreases as we have
expected

At the end of the training cycle,


the error is very small (0.005)
Overfitting

Accuracy on testing data

As it can be seen in the graph,


the best accuracy is achieved
at about iteration 270. After
that it just fluctuates around the
same point

Beyond that point, the network


is overtraining or overfitting

We should have stopped at


iteration 280
Overfitting

Accuracy on training data

And if we check the accuracy


on the training data, from
about iteration 42 is 1000
(100%).

That means we should have


correctly identify all the
numbers
Overfitting

Cost of testing data

The cost of the testing data


decreases until around iteration
20, after which it starts
increasing

This shouldn’t have happened if


the model correctly
approximated the data.

There should have been a


continuous decrease in testing
cost
Overfitting

 This clearly shows that the ANN learns particularities of the data and does
not generalize.

 The main reason for why this happens is that the training data is to small
compared to the size of the network.

 Our network has: 784*36 + 36*10 weights and 36+10 biases. That means it
has 28584 parameters that it can use in order to model 1000 elements.

 How does the same network behave on the larger dataset (50000)
Overfitting

Accuracy on test data vs


training data

Overfitting still happens here,


but to lower degree.

Beyond iteration 7, the network


isn’t learning anymore.

However, the difference


between the two accuracies is
not that much. Only 2.42 %
Solutions for Overfitting
Solutions for Overfitting

 A good way to detect that you are not overtraining is to compare the
accuracy on training data and accuracy on test data. These should be
close with the accuracy on training being slightly higher

 So, increasing the training data does help to solve overfitting. With very
large datasets, it becomes very difficult for a neural network to overfit.

 Another obvious solution would be to reduce the network size

 The solutions above are sometimes impractical:


 Can not get more training data
 It may take much too long to train on a larger data set
 Large networks tend to be more powerful than smaller ones
Solutions for Overfitting: Regularization

 Understanding: bias and variance

 Bias:
The bias measures how far are the estimated elements (their average) from the
target
 Variance:
The variance measures how spread are the estimated elements from their mean
Solutions for Overfitting: Regularization

 Let’s suppose there are some


darts players who want to hit
the center of the target
(approximate a point) and
each players shoots many
darts.

 This is how the target would


look depending on how
biased are the estimated
shoots from the target or how
spread are they (variance).
Solutions for Overfitting: Regularization

 It can be shown, that the Mean Squared Error is actually making a tradeoff
between variance and bias.

 Let’s say we have some points (𝑥, 𝑦) and there is a relation between 𝑥 and
𝑦 such that 𝑦 = 𝑓(𝑥). We want to find the function 𝑓 so we will try to
approximate it, by g.

 In this case:
 𝑏𝑖𝑎𝑠 𝑔(𝑥) = 𝐸[𝑔(𝑥)] − 𝑓(𝑥)
 𝑣𝑎𝑟 𝑔 = 𝐸 𝑔(𝑥) − 𝐸[𝑔(𝑥)] 2
2
 𝑀𝑆𝐸𝐷 𝑓 = 𝐸[ 𝑔 𝑥 − 𝑓 𝑥 ]
Where 𝐸 = 𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑 𝑣𝑎𝑙𝑢𝑒 (𝑚𝑒𝑎𝑛)
Solutions for Overfitting: Regularization

 We will use the following lema:


E X2 = 𝐸 𝑋 − 𝐸 𝑋 2 + 𝐸 𝑋 2

 Proof:
𝐸 𝑋 − 𝐸 𝑋 2 = 𝐸 𝑋2 + 𝐸 𝑋 2 − 2𝑋𝐸 𝑋 = 𝐸 𝑋2 + 𝐸 𝐸 𝑋 2 − 2𝐸[𝑋𝐸 𝑋 ] =
=𝐸 𝑋 2 + 𝐸 𝑋 2 − 2𝐸 𝑋 2
+𝐸 𝑋 2
---------------------------------------------
= 𝐸[𝑋 2 ]
Solutions for Overfitting: Regularization

2
𝐸 𝑔 𝑥 −𝑓 𝑥 =

=𝐸 𝑔 𝑥 2 − 2𝑔 𝑥 𝑓 𝑥 + 𝑓 𝑥 2

2 2
=𝐸 𝑔 𝑥 − 2𝐸 𝑔 𝑥 𝑓 𝑥 ] + 𝐸 𝑓 𝑥
= 𝐸 𝑔 𝑥 −𝐸 𝑔 𝑥 2 +𝐸 𝑔 𝑥 2 − 2𝐸 𝑔 𝑥 𝑓 𝑥 +𝐸 𝑓 𝑥 −𝐸 𝑓 𝑥 2 +𝐸 𝑓 𝑥 2 =

2 2 2
=𝐸 𝑔 𝑥 −𝐸 𝑔 𝑥 + 𝐸𝑔 𝑥 −𝑓 𝑥 +𝐸 𝑓 𝑥 −𝐸 𝑓 𝑥 ]=

= 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 + 𝑏𝑖𝑎𝑠 2 + 𝜎 2 𝑛𝑜𝑖𝑠𝑒


Solutions for Overfitting: Regularization

 𝑀𝑆𝐸 𝑓 = 𝑏𝑖𝑎𝑠 2 + 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 + 𝜎 2 (𝑛𝑜𝑖𝑠𝑒)

 What this tells us is that for a value of the error we have to tradeoff between
variance and bias (since the noise doesn’t depend on the estimator)

 Both high bias and high variance are negative properties for the model. A model
with high bias, tends to underfit the data while a model with high variance tends
to overfit

 The purpose is to find an equilibrium of those two.


Solutions for Overfitting: Regularization

 The model on the left underfits the data. It has high bias (big distance from the
targets), but low variance. If we move one point, the line would probably not
change very much
 The model in the right, overfits the data. It has low bias (there is almost an exact
match of the target points) but it has high variance. If we move one point there will
probably be a big change in the model
 The model in the center is the right one since it has lower bias than the one on the
left and a lower variance from the one from the right
Solutions for Overfitting: Regularization

 In order for a model to have high variance, it needs two things:


 A large number of weights (for example, the line in the first image has less weights
then the polynomial in the right image)
 Large values for the weights. This is needed in order to be able to make rapid
shifts. Think of a parabola (𝑎𝑥 2 + 𝑏𝑥 + 𝑐) The greater a is, the more narrow the
parabola will be

0.5𝑥 2 + 𝑥 + 1 5𝑥 2 + 𝑥 + 1
Solutions for Overfitting: Regularization

 So, in order to make the function smooth (low variance) we need to have
low weights (even zero).

 We modify the cost function by introducing an element that penalizes large


weights.
𝜆
𝐶 = 𝐶0 + 𝑤2 𝑤ℎ𝑒𝑟𝑒 𝐶0 𝑖𝑠 𝑡ℎ𝑒 𝑜𝑟𝑖𝑔𝑖𝑛𝑎𝑙 𝑐𝑜𝑠𝑡𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛
2𝑛
𝑤
𝑎𝑛𝑑 𝜆 𝑖𝑠 𝑐𝑎𝑙𝑙𝑒𝑑 𝑡ℎ𝑒 𝑟𝑒𝑔𝑢𝑙𝑎𝑟𝑖𝑧𝑎𝑡𝑖𝑜𝑛 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟

So, for the cross entropy will look like this:


1 𝜆
𝐶=𝑛 𝑥𝑗 𝑦𝑗 𝑙𝑛𝑎𝑗𝐿 + 1 − 𝑦𝑗 ln 1 − 𝑎𝑗𝐿 + 2𝑛 𝑤𝑤
2

And the Mean Square Error (Quadratic cost):


1 2 𝜆
𝐶 = 2𝑛 𝑥𝑗 𝑦𝑗 − 𝑎𝑗𝐿 + 2𝑛 𝑤𝑤
2
Solutions for Overfitting: Regularization

𝜆
 The 𝜆 (regularization parameter) in 𝐶 = 𝐶0 + 𝑤𝑤
2 controls how
2𝑛
much bias vs variance we want.
 If 𝜆 = 0, then we have the standard Cost Function

 The bigger 𝜆 is, the more bias (and less variance) we want. If 𝜆 is too big,
we can go into undertraining

 This is called L2 regularization and it is also known as weight decay


Solutions for Overfitting: Regularization

 The easiest way to introduce regularization into backpropagation is


𝜕𝐶
to adjust how the cost varies in respect to the weights: (since the
𝜕𝑤
weights are the only ones affected)

𝜕𝐶 𝜕𝐶0 𝜆
 = + 𝑤
𝜕𝑤 𝜕𝑤 𝑛

𝜕𝐶0 𝜆 𝜆 𝜕𝐶0
𝑤 = 𝑤−𝜂 −𝜂 𝑤 = 1−𝜂 𝑤−𝜂
𝜕𝑤 𝑛 𝑛 𝜕𝑤

 Since we’ll most likely use SGD, the equation will use is:

𝜆 𝜂 𝜕𝐶𝑥
𝑤 = 1−𝜂 𝑤−
𝑛 𝑚 𝜕𝑤
𝑥
Solutions for Overfitting: Regularization

 Let’s train the ANN again on


the 1000 training set, but this
time with regularization. 𝜆 =
0.1

 As it can be seen, the test


accuracy continues to
increase (even after iteration
300).

 We have also achieved a


greater accuracy than
before 87.5 vs 82.6
Solutions for Overfitting: Regularization

 The same experiment, but this


time using the entire data set,
but using 𝜆 = 5

 The difference between the


test accuracy and training
accuracy is also smaller
(1.13%)

 The overall accuracy is


greater than before (96.34%
vs 94.86%
Solutions for Overfitting: Regularization

 Another variant of regularization is L1 where instead of using squares of


the weights, we’re using absolute values
𝜆
𝐶 = 𝐶0 + 𝑤
𝑛
𝑤
In order to use this in our backpropagation algorithm, we must first
discover how to adjust the cost

𝜕𝐶 𝜕𝐶0 𝜆
= + 𝑠𝑔𝑛 𝑤 𝑤ℎ𝑒𝑟𝑒 𝑠𝑔𝑛 𝑤 𝑖𝑠 𝑡ℎ𝑒 𝑠𝑖𝑔𝑛 𝑜𝑓 𝑤
𝜕𝑤 𝜕𝑤 𝑛
𝜆 𝜕𝐶0
𝑤 = 𝑤 − 𝜂 𝑠𝑔𝑛 𝑤 − 𝜂
n 𝜕𝑤
Or, if we use SGD:
𝜆 𝜂 𝜕𝐶0
𝑤 = 𝑤 − 𝜂 𝑠𝑔𝑛 𝑤 −
n 𝑚 𝜕𝑤
𝑥
Solutions for Overfitting: Regularization

 So, how is L1 different from L2?


𝜆 𝜕𝐶0
 L1 𝑤 = 𝑤 − 𝜂 𝑠𝑔𝑛 𝑤 − 𝜂
n 𝜕𝑤
𝜆 𝜕𝐶0
 L2 𝑤 = 1−𝜂 𝑤−𝜂
𝑛 𝜕𝑤

L1 drives the weights down by a constant amount:


- that means if some weights are very big, they won’t be reduced too much
- if some weights are small, they will probably be driven to zero

L2 drives the weights down by an amount that depends on the value of w:


- for large weights, this means large amount of decrease
- for small weights, this means a small mount of decrease

L1 tends to build the model on certain weights while L2 uses more weights but
with smaller values
Solutions for Overfitting: Dropout

 The idea of this technique is to make the network use less weights
during training

 On each minibatch that we train, we randomly select ½ from the


hidden neurons and make them invisible (of course, including the
weights that go into or out of them)
Layer 2
Layer 1
Layer 4
Solutions for Overfitting: Dropout

 We update the visible weights as before, using backpropagation


 We restore the invisible hidden neurons and their connection
 We then select another minibatch where we randomly select
another ½ from the hidden neurons that we will make them invisible

After we have finished the training, we restore all the hidden neurons
and halve the hidden weights from the hidden neurons to the output
neurons
Solutions for Overfitting: Dropout

 If neuron 1 outputs the right value 80% of time


 If neuron 2 outputs only random value
Then neuron 3 will most likely compute:
𝑎𝑐𝑡𝑖𝑣𝑎𝑡𝑖𝑜𝑛1 ∗ 1 + 𝑎𝑐𝑡𝑖𝑣𝑎𝑡𝑖𝑜𝑛2 ∗ 0

The gradients will get propagated only through neurons 3 and 1


This means the network uses less than its potential (since neuron 2
will never be used)

2
Solutions for Overfitting: Dropout

 Improvement of the
accuracy by using 20%
dropout in the hidden layer
when used on the entire
dataset
Solutions for Overfitting: Dropout

Other Advantages:
 makes the network robust to the loss of any individual neuron
 Makes it more efficient since more neurons will participate in
learning

How it is usually implemented:


During training:
1. Given the n activations from the previous layer, generate n
numbers of 1 (Bernoulli distribution) with a probability of (1-p)

2. Multiply the activations with the generated vector (element-


1
wise) and also multiply by
1−𝑝
Solutions for Overfitting: Dropout

 Improvement of the
accuracy by using
 50% dropout in the hidden
layer,
 On a dataset of just 1000
elements
Solutions for Overfitting: Maxnorm

Maxnorm tries to limit large weights, but on each individual neuron.

1. The weight updates are performed as usual


2. If 𝑤 2
>𝑐; 2
𝑖 𝑥𝑖 > c, then
3. adjust the weights:
𝑐
𝑤=𝑤∗
2
𝑖 𝑥𝑖
Solutions for Overfitting: Maxnorm

 Improvement of the
accuracy by using
 50% dropout in the hidden
layer
 Maxnorm of 5 layer
 on a dataset of just 1000
elements
Solutions for Overfitting: Increase Dataset

As we’ve seen earlier, when using a bigger dataset, our network


didn’t overfit that bad

So, another idea is to increase the data.


Since, sometimes it is difficult to find training data, a solution is to add
small noise to our existing data set using different functions, like slightly
rotating the image
Solutions for Overfitting: Increase Dataset

In “Best Practices for Convolutional Neural Networks Applied to Visual


Document Analysis, by Patrice Simard, Dave Steinkraus, and John Platt
(2003) the authors were using a neural network with 800 hidden
neurons to classify MNIST digits.

They achieved an increase of accuracy from 98.4% to 98.9% by:


 Rotations
 Translating
 Skewing the image
By using “elastic distortions” (that should emulate random oscillations
found in human muscles) they increased the accuracy to 99.3%
Questions & Discussion
Demo
Framework created by Andrej Karpathy
http://cs.stanford.edu/
Bibliography

http://neuralnetworksanddeeplearning.com/
http://www.kdnuggets.com/2015/04/preventing-overfitting-neural-
networks.html
Nitish Srivastava , Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever,
Ruslan Salakhutdinov, Dropout: “A Simple Way to Prevent Neural
Networks from Overfitting”, Journal of Machine Learning Research 15,
(https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf)
http://blog.fliptop.com/blog/2015/03/02/bias-variance-and-
overfitting-machine-learning-overview/

You might also like