0% found this document useful (0 votes)

39 views47 pages

Curs5site PDF

Uploaded by

Gigi Florica

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

39 views47 pages

Curs5site PDF

Uploaded by

Gigi Florica

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 47

1 November, 2016

Neural Networks
Course 4: Making the neural network resistent to overfitting
Overview

 Overfitting

 Regularization

 Dropout

 Max Norm Constraint

 Increase Dataset
 Conclusions
Weight Initialization

 What do we need to initialize weights with random values?

 What if we would initialize all of them with 0s?

𝑧𝑖𝐿 𝑧𝑖𝑙 = 𝑤𝑥 + 𝑏 = 0𝑥 + 0 = 0
1
𝜎 𝑧𝑖𝑙 = 𝜎 0 = = 0.5
1+𝑒 0

𝑧𝑖𝑙
(similarly), 𝜎(𝑧𝑖𝐿 ) = 0.5

Whatever the input, the output will

be the same
Weight Initialization

 What if we would initialize all of them with 0s?

y=0.5

𝑧𝑖𝐿 The error (if cross entropy is used

1
𝛿𝑖𝑙 =0 𝐶=− 𝑦𝑙𝑛𝑎 + 1 − 𝑦 ln 1 − 𝑎 ), will
𝑦𝑖𝑙 = 0.5 𝑛 𝑥
always be for the first iteration (ln(0.5))
𝑧𝑖𝑙

The error that will be backpropagated, will

be: 𝛿𝑖𝑙 = 𝑦𝑖𝑙 1 − 𝑦𝑖𝑙 𝑘 𝛿𝑖𝑙+1 ∙ 𝑤𝑖𝑘
𝑙+1
=
0.5 ∗ 1 − 0.5 ∗ 𝛿 𝐿 ∗ 0 = 0
Weight Initialization

 What if we would initialize all of them with 0s?

y=0.5

𝑧𝑖𝐿 The network will adjust the weights in the

𝛿𝑖𝑙 =0 final layer 𝑤 2 , with the same amount for
𝑦𝑖𝑙 = 0.5 each hidden unit
𝑧𝑖𝑙 𝑤 = 𝑤 + 𝜂 ∗ 𝛿𝑖𝑙 𝑦 𝑙−1 = 𝑤 + 𝜂 ∗ 0 ∗ 0 = 𝑤 = 0
𝑏 = 𝑏 + 𝜂 ∗ 𝛿𝑖𝑙 = 𝑏 + 𝜂 ∗ 0 = 𝑏 = 0

In the same way can be shown that 𝑤 1 = 0

We ended up in the same place as the first

iteration. So the network doesn’t learn
Weight Initialization

 What if we would initialize all of them with a constant (different than 0)?

𝑧𝑖𝐿
𝑐 𝑐 𝑐

𝑧𝑖𝑙

𝑐 𝑐
Weight Initialization

 What if we would initialize all of them with a constant (different than 0)?
y=0.5

𝑧𝑖𝐿 Each hidden unit will compute the same

activation (y 𝑙 = 𝜎(𝑐𝑥 + 𝑐)).
𝑙 𝑙 𝑙 𝑙+1 𝑙 𝑙 𝑙 𝑙+1
𝛿 =𝑦 1−𝑦 𝛿 𝑐 𝛿 =𝑦 1−𝑦 𝛿 𝑐
𝑐 𝑦 𝑙
𝑦𝑙 𝑦𝑙
𝑧𝑖𝑙 When the error is backpropagated, it will
be the same for each hidden unit:
𝑐
𝛿𝑖𝑙 = 𝑦 𝑙 1 − 𝑦 𝑙 𝑐𝛿 𝑙+1

Since all weigths are the same and all

weight updates are based on current
weight values, and on the error, all weight
adjustments will be the same
Weight Initialization

 What if we would initialize all of them with a constant (different than 0)?
𝑦𝐿

𝑧𝑖𝐿 So, even though we have multiple neurons

𝛿𝑙 = 𝑦𝑙 1 − 𝑦𝑙 𝑐 in the hidden layer, they will always have
𝑦𝑙 the same weights, thus they’ll be the same

Is almost as there is only one neuron in the

hidden layer

Making the weights random, achieves a

better exploration of the feature space
Overfitting
Overfitting

 Overfitting describes the process that happens when a model customizes

itself too much to the data and does not generalize it. It tries to memorize is
and not generalize on it

 Which of the following models is better?

Overfitting

 Experiment:
 We will train a neural network with the same architecture from the last
course (784, 36, 10), but this time we will use a training data of 1000 images
instead of 50000.

 We will use the same learning rate (0.3), the same mini batch size (10) and
we will use cross-entropy as our cost function.

 But since we use a smaller dataset, we will train it more iterations: 500
instead of 30.

 (So nothing changes in the network but the training set size and the number of iterations)
Overfitting

Cost of training data

The cost of the training data

decreases as we have
expected

At the end of the training cycle,

the error is very small (0.005)
Overfitting

Accuracy on testing data

As it can be seen in the graph,

the best accuracy is achieved
at about iteration 270. After
that it just fluctuates around the
same point

Beyond that point, the network

is overtraining or overfitting

We should have stopped at

iteration 280
Overfitting

Accuracy on training data

And if we check the accuracy

on the training data, from
about iteration 42 is 1000
(100%).

That means we should have

correctly identify all the
numbers
Overfitting

Cost of testing data

The cost of the testing data

decreases until around iteration
20, after which it starts
increasing

This shouldn’t have happened if

the model correctly
approximated the data.

There should have been a

continuous decrease in testing
cost
Overfitting

 This clearly shows that the ANN learns particularities of the data and does
not generalize.

 The main reason for why this happens is that the training data is to small
compared to the size of the network.

 Our network has: 784*36 + 36*10 weights and 36+10 biases. That means it
has 28584 parameters that it can use in order to model 1000 elements.

 How does the same network behave on the larger dataset (50000)
Overfitting

Accuracy on test data vs

training data

Overfitting still happens here,

but to lower degree.

Beyond iteration 7, the network

isn’t learning anymore.

However, the difference

between the two accuracies is
not that much. Only 2.42 %
Solutions for Overfitting
Solutions for Overfitting

 A good way to detect that you are not overtraining is to compare the
accuracy on training data and accuracy on test data. These should be
close with the accuracy on training being slightly higher

 So, increasing the training data does help to solve overfitting. With very
large datasets, it becomes very difficult for a neural network to overfit.

 Another obvious solution would be to reduce the network size

 The solutions above are sometimes impractical:

 Can not get more training data
 It may take much too long to train on a larger data set
 Large networks tend to be more powerful than smaller ones
Solutions for Overfitting: Regularization

 Understanding: bias and variance

 Bias:
The bias measures how far are the estimated elements (their average) from the
target
 Variance:
The variance measures how spread are the estimated elements from their mean
Solutions for Overfitting: Regularization

 Let’s suppose there are some

darts players who want to hit
the center of the target
(approximate a point) and
each players shoots many
darts.

 This is how the target would

look depending on how
biased are the estimated
shoots from the target or how
spread are they (variance).
Solutions for Overfitting: Regularization

 It can be shown, that the Mean Squared Error is actually making a tradeoff
between variance and bias.

 Let’s say we have some points (𝑥, 𝑦) and there is a relation between 𝑥 and
𝑦 such that 𝑦 = 𝑓(𝑥). We want to find the function 𝑓 so we will try to
approximate it, by g.

 In this case:
 𝑏𝑖𝑎𝑠 𝑔(𝑥) = 𝐸[𝑔(𝑥)] − 𝑓(𝑥)
 𝑣𝑎𝑟 𝑔 = 𝐸 𝑔(𝑥) − 𝐸[𝑔(𝑥)] 2
2
 𝑀𝑆𝐸𝐷 𝑓 = 𝐸[ 𝑔 𝑥 − 𝑓 𝑥 ]
Where 𝐸 = 𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑 𝑣𝑎𝑙𝑢𝑒 (𝑚𝑒𝑎𝑛)
Solutions for Overfitting: Regularization

 We will use the following lema:

E X2 = 𝐸 𝑋 − 𝐸 𝑋 2 + 𝐸 𝑋 2

 Proof:
𝐸 𝑋 − 𝐸 𝑋 2 = 𝐸 𝑋2 + 𝐸 𝑋 2 − 2𝑋𝐸 𝑋 = 𝐸 𝑋2 + 𝐸 𝐸 𝑋 2 − 2𝐸[𝑋𝐸 𝑋 ] =
=𝐸 𝑋 2 + 𝐸 𝑋 2 − 2𝐸 𝑋 2
+𝐸 𝑋 2
---------------------------------------------
= 𝐸[𝑋 2 ]
Solutions for Overfitting: Regularization

2
𝐸 𝑔 𝑥 −𝑓 𝑥 =

=𝐸 𝑔 𝑥 2 − 2𝑔 𝑥 𝑓 𝑥 + 𝑓 𝑥 2

2 2
=𝐸 𝑔 𝑥 − 2𝐸 𝑔 𝑥 𝑓 𝑥 ] + 𝐸 𝑓 𝑥
= 𝐸 𝑔 𝑥 −𝐸 𝑔 𝑥 2 +𝐸 𝑔 𝑥 2 − 2𝐸 𝑔 𝑥 𝑓 𝑥 +𝐸 𝑓 𝑥 −𝐸 𝑓 𝑥 2 +𝐸 𝑓 𝑥 2 =

2 2 2
=𝐸 𝑔 𝑥 −𝐸 𝑔 𝑥 + 𝐸𝑔 𝑥 −𝑓 𝑥 +𝐸 𝑓 𝑥 −𝐸 𝑓 𝑥 ]=

= 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 + 𝑏𝑖𝑎𝑠 2 + 𝜎 2 𝑛𝑜𝑖𝑠𝑒

Solutions for Overfitting: Regularization

 𝑀𝑆𝐸 𝑓 = 𝑏𝑖𝑎𝑠 2 + 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 + 𝜎 2 (𝑛𝑜𝑖𝑠𝑒)

 What this tells us is that for a value of the error we have to tradeoff between
variance and bias (since the noise doesn’t depend on the estimator)

 Both high bias and high variance are negative properties for the model. A model
with high bias, tends to underfit the data while a model with high variance tends
to overfit

 The purpose is to find an equilibrium of those two.

Solutions for Overfitting: Regularization

 The model on the left underfits the data. It has high bias (big distance from the
targets), but low variance. If we move one point, the line would probably not
change very much
 The model in the right, overfits the data. It has low bias (there is almost an exact
match of the target points) but it has high variance. If we move one point there will
probably be a big change in the model
 The model in the center is the right one since it has lower bias than the one on the
left and a lower variance from the one from the right
Solutions for Overfitting: Regularization

 In order for a model to have high variance, it needs two things:

 A large number of weights (for example, the line in the first image has less weights
then the polynomial in the right image)
 Large values for the weights. This is needed in order to be able to make rapid
shifts. Think of a parabola (𝑎𝑥 2 + 𝑏𝑥 + 𝑐) The greater a is, the more narrow the
parabola will be

0.5𝑥 2 + 𝑥 + 1 5𝑥 2 + 𝑥 + 1
Solutions for Overfitting: Regularization

 So, in order to make the function smooth (low variance) we need to have
low weights (even zero).

 We modify the cost function by introducing an element that penalizes large

weights.
𝜆
𝐶 = 𝐶0 + 𝑤2 𝑤ℎ𝑒𝑟𝑒 𝐶0 𝑖𝑠 𝑡ℎ𝑒 𝑜𝑟𝑖𝑔𝑖𝑛𝑎𝑙 𝑐𝑜𝑠𝑡𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛
2𝑛
𝑤
𝑎𝑛𝑑 𝜆 𝑖𝑠 𝑐𝑎𝑙𝑙𝑒𝑑 𝑡ℎ𝑒 𝑟𝑒𝑔𝑢𝑙𝑎𝑟𝑖𝑧𝑎𝑡𝑖𝑜𝑛 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟

So, for the cross entropy will look like this:

1 𝜆
𝐶=𝑛 𝑥𝑗 𝑦𝑗 𝑙𝑛𝑎𝑗𝐿 + 1 − 𝑦𝑗 ln 1 − 𝑎𝑗𝐿 + 2𝑛 𝑤𝑤
2

And the Mean Square Error (Quadratic cost):

1 2 𝜆
𝐶 = 2𝑛 𝑥𝑗 𝑦𝑗 − 𝑎𝑗𝐿 + 2𝑛 𝑤𝑤
2
Solutions for Overfitting: Regularization

𝜆
 The 𝜆 (regularization parameter) in 𝐶 = 𝐶0 + 𝑤𝑤
2 controls how
2𝑛
much bias vs variance we want.
 If 𝜆 = 0, then we have the standard Cost Function

 The bigger 𝜆 is, the more bias (and less variance) we want. If 𝜆 is too big,
we can go into undertraining

 This is called L2 regularization and it is also known as weight decay

Solutions for Overfitting: Regularization

 The easiest way to introduce regularization into backpropagation is

𝜕𝐶
to adjust how the cost varies in respect to the weights: (since the
𝜕𝑤
weights are the only ones affected)

𝜕𝐶 𝜕𝐶0 𝜆
 = + 𝑤
𝜕𝑤 𝜕𝑤 𝑛

𝜕𝐶0 𝜆 𝜆 𝜕𝐶0
𝑤 = 𝑤−𝜂 −𝜂 𝑤 = 1−𝜂 𝑤−𝜂
𝜕𝑤 𝑛 𝑛 𝜕𝑤

 Since we’ll most likely use SGD, the equation will use is:

𝜆 𝜂 𝜕𝐶𝑥
𝑤 = 1−𝜂 𝑤−
𝑛 𝑚 𝜕𝑤
𝑥
Solutions for Overfitting: Regularization

 Let’s train the ANN again on

the 1000 training set, but this
time with regularization. 𝜆 =
0.1

 As it can be seen, the test

accuracy continues to
increase (even after iteration
300).

 We have also achieved a

greater accuracy than
before 87.5 vs 82.6
Solutions for Overfitting: Regularization

 The same experiment, but this

time using the entire data set,
but using 𝜆 = 5

 The difference between the

test accuracy and training
accuracy is also smaller
(1.13%)

 The overall accuracy is

greater than before (96.34%
vs 94.86%
Solutions for Overfitting: Regularization

 Another variant of regularization is L1 where instead of using squares of

the weights, we’re using absolute values
𝜆
𝐶 = 𝐶0 + 𝑤
𝑛
𝑤
In order to use this in our backpropagation algorithm, we must first
discover how to adjust the cost

𝜕𝐶 𝜕𝐶0 𝜆
= + 𝑠𝑔𝑛 𝑤 𝑤ℎ𝑒𝑟𝑒 𝑠𝑔𝑛 𝑤 𝑖𝑠 𝑡ℎ𝑒 𝑠𝑖𝑔𝑛 𝑜𝑓 𝑤
𝜕𝑤 𝜕𝑤 𝑛
𝜆 𝜕𝐶0
𝑤 = 𝑤 − 𝜂 𝑠𝑔𝑛 𝑤 − 𝜂
n 𝜕𝑤
Or, if we use SGD:
𝜆 𝜂 𝜕𝐶0
𝑤 = 𝑤 − 𝜂 𝑠𝑔𝑛 𝑤 −
n 𝑚 𝜕𝑤
𝑥
Solutions for Overfitting: Regularization

 So, how is L1 different from L2?

𝜆 𝜕𝐶0
 L1 𝑤 = 𝑤 − 𝜂 𝑠𝑔𝑛 𝑤 − 𝜂
n 𝜕𝑤
𝜆 𝜕𝐶0
 L2 𝑤 = 1−𝜂 𝑤−𝜂
𝑛 𝜕𝑤

L1 drives the weights down by a constant amount:

- that means if some weights are very big, they won’t be reduced too much
- if some weights are small, they will probably be driven to zero

L2 drives the weights down by an amount that depends on the value of w:

- for large weights, this means large amount of decrease
- for small weights, this means a small mount of decrease

L1 tends to build the model on certain weights while L2 uses more weights but
with smaller values
Solutions for Overfitting: Dropout

 The idea of this technique is to make the network use less weights
during training

 On each minibatch that we train, we randomly select ½ from the

hidden neurons and make them invisible (of course, including the
weights that go into or out of them)
Layer 2
Layer 1
Layer 4
Solutions for Overfitting: Dropout

 We update the visible weights as before, using backpropagation

 We restore the invisible hidden neurons and their connection
 We then select another minibatch where we randomly select
another ½ from the hidden neurons that we will make them invisible

After we have finished the training, we restore all the hidden neurons
and halve the hidden weights from the hidden neurons to the output
neurons
Solutions for Overfitting: Dropout

 If neuron 1 outputs the right value 80% of time

 If neuron 2 outputs only random value
Then neuron 3 will most likely compute:
𝑎𝑐𝑡𝑖𝑣𝑎𝑡𝑖𝑜𝑛1 ∗ 1 + 𝑎𝑐𝑡𝑖𝑣𝑎𝑡𝑖𝑜𝑛2 ∗ 0

The gradients will get propagated only through neurons 3 and 1

This means the network uses less than its potential (since neuron 2
will never be used)

2
Solutions for Overfitting: Dropout

 Improvement of the
accuracy by using 20%
dropout in the hidden layer
when used on the entire
dataset
Solutions for Overfitting: Dropout

Other Advantages:
 makes the network robust to the loss of any individual neuron
 Makes it more efficient since more neurons will participate in
learning

How it is usually implemented:

During training:
1. Given the n activations from the previous layer, generate n
numbers of 1 (Bernoulli distribution) with a probability of (1-p)

2. Multiply the activations with the generated vector (element-

1
wise) and also multiply by
1−𝑝
Solutions for Overfitting: Dropout

 Improvement of the
accuracy by using
 50% dropout in the hidden
layer,
 On a dataset of just 1000
elements
Solutions for Overfitting: Maxnorm

Maxnorm tries to limit large weights, but on each individual neuron.

1. The weight updates are performed as usual

2. If 𝑤 2
>𝑐; 2
𝑖 𝑥𝑖 > c, then
3. adjust the weights:
𝑐
𝑤=𝑤∗
2
𝑖 𝑥𝑖
Solutions for Overfitting: Maxnorm

 Improvement of the
accuracy by using
 50% dropout in the hidden
layer
 Maxnorm of 5 layer
 on a dataset of just 1000
elements
Solutions for Overfitting: Increase Dataset

As we’ve seen earlier, when using a bigger dataset, our network

didn’t overfit that bad

So, another idea is to increase the data.

Since, sometimes it is difficult to find training data, a solution is to add
small noise to our existing data set using different functions, like slightly
rotating the image
Solutions for Overfitting: Increase Dataset

In “Best Practices for Convolutional Neural Networks Applied to Visual

Document Analysis, by Patrice Simard, Dave Steinkraus, and John Platt
(2003) the authors were using a neural network with 800 hidden
neurons to classify MNIST digits.

They achieved an increase of accuracy from 98.4% to 98.9% by:

 Rotations
 Translating
 Skewing the image
By using “elastic distortions” (that should emulate random oscillations
found in human muscles) they increased the accuracy to 99.3%
Questions & Discussion
Demo
Framework created by Andrej Karpathy
http://cs.stanford.edu/
Bibliography

http://neuralnetworksanddeeplearning.com/
http://www.kdnuggets.com/2015/04/preventing-overfitting-neural-
networks.html
Nitish Srivastava , Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever,
Ruslan Salakhutdinov, Dropout: “A Simple Way to Prevent Neural
Networks from Overfitting”, Journal of Machine Learning Research 15,
(https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf)
http://blog.fliptop.com/blog/2015/03/02/bias-variance-and-
overfitting-machine-learning-overview/

NNDL Notes
No ratings yet
NNDL Notes
73 pages
Cours 4
No ratings yet
Cours 4
30 pages
Deep Neural Network Module 4 Regularization
No ratings yet
Deep Neural Network Module 4 Regularization
53 pages
03 Reg Slides
No ratings yet
03 Reg Slides
64 pages
Hyperparameters
No ratings yet
Hyperparameters
15 pages
Lec 35
No ratings yet
Lec 35
12 pages
Unit 4
No ratings yet
Unit 4
35 pages
DL Class3
No ratings yet
DL Class3
28 pages
Curs4site PDF
No ratings yet
Curs4site PDF
44 pages
An Overview of Overfitting and Its Solutions
No ratings yet
An Overview of Overfitting and Its Solutions
7 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
116 pages
Regularization For Neural Networks 1718966083
No ratings yet
Regularization For Neural Networks 1718966083
9 pages
Practical Aspects of Deep Learning PI
No ratings yet
Practical Aspects of Deep Learning PI
46 pages
CMPE257 - W2C3 - ML Fundamentals - Part 2
No ratings yet
CMPE257 - W2C3 - ML Fundamentals - Part 2
34 pages
DL-Lec 2 - Bias-Variance-Tradeoff
No ratings yet
DL-Lec 2 - Bias-Variance-Tradeoff
33 pages
DL Lect 7
No ratings yet
DL Lect 7
15 pages
Regularization and Normalization
No ratings yet
Regularization and Normalization
29 pages
Regularization Slides
No ratings yet
Regularization Slides
50 pages
BACK PROPAGATION and REGULATION, BATCH NORMALIZATION
No ratings yet
BACK PROPAGATION and REGULATION, BATCH NORMALIZATION
20 pages
An Overview of Overfitting and Its Solutions
No ratings yet
An Overview of Overfitting and Its Solutions
7 pages
18-Deep Learning Frameworks - Data Augmentation - Under-Fitting Vs Over-Fitting-22!08!2024
No ratings yet
18-Deep Learning Frameworks - Data Augmentation - Under-Fitting Vs Over-Fitting-22!08!2024
5 pages
Pa 4 Unit
No ratings yet
Pa 4 Unit
33 pages
3 Non Linear Classifiers
No ratings yet
3 Non Linear Classifiers
74 pages
Session 3
No ratings yet
Session 3
26 pages
Overfitting Vs Underfitting
No ratings yet
Overfitting Vs Underfitting
8 pages
An Overview of Overfitting and Its Solutions
No ratings yet
An Overview of Overfitting and Its Solutions
7 pages
Lec 05 Regularization
No ratings yet
Lec 05 Regularization
77 pages
Training Neural Netwok: Data Set
No ratings yet
Training Neural Netwok: Data Set
35 pages
Chapter5 Regularization Summary Final
No ratings yet
Chapter5 Regularization Summary Final
10 pages
DL CS 7 M4 Live Class Flow
No ratings yet
DL CS 7 M4 Live Class Flow
37 pages
Solution PDF
No ratings yet
Solution PDF
20 pages
Unit-2 Improving-Deep-Neural-Networks
No ratings yet
Unit-2 Improving-Deep-Neural-Networks
18 pages
Ai - W7L14
No ratings yet
Ai - W7L14
22 pages
Advanced Machine Learning: Neural Networks Decision Trees Random Forest Xgboost
No ratings yet
Advanced Machine Learning: Neural Networks Decision Trees Random Forest Xgboost
61 pages
Mod 4
No ratings yet
Mod 4
65 pages
U&O Fitting
No ratings yet
U&O Fitting
6 pages
6 Working Example 01-08-2024
No ratings yet
6 Working Example 01-08-2024
21 pages
Bias and Variance in Machine Learning
No ratings yet
Bias and Variance in Machine Learning
3 pages
AN2DL 03 2324 NeuralNetwroksTraining
No ratings yet
AN2DL 03 2324 NeuralNetwroksTraining
40 pages
Overfitting
No ratings yet
Overfitting
7 pages
Homework 2
No ratings yet
Homework 2
3 pages
Data Science Concepts Overfitting Underfitting
No ratings yet
Data Science Concepts Overfitting Underfitting
8 pages
Chap 2 Training Feed Forward Neural Networks
No ratings yet
Chap 2 Training Feed Forward Neural Networks
22 pages
All DL
No ratings yet
All DL
72 pages
DL IT324a 3
No ratings yet
DL IT324a 3
13 pages
Shortcomings in Single Layer Neural Networks: Most Real World Problems Are Not
No ratings yet
Shortcomings in Single Layer Neural Networks: Most Real World Problems Are Not
43 pages
Deep Learning (All in One)
No ratings yet
Deep Learning (All in One)
23 pages
L06 Slides - mlp3
No ratings yet
L06 Slides - mlp3
26 pages
ANN Presentation Exam Hafsa
No ratings yet
ANN Presentation Exam Hafsa
29 pages
Choosing Model and Tuning
No ratings yet
Choosing Model and Tuning
20 pages
Deep Learning - Summary - Deep - Learning
No ratings yet
Deep Learning - Summary - Deep - Learning
17 pages
Kagan Lecture2
No ratings yet
Kagan Lecture2
118 pages
Unit V NNHDL
No ratings yet
Unit V NNHDL
33 pages
Bias and Variance
No ratings yet
Bias and Variance
4 pages
Deep Neural Networks
No ratings yet
Deep Neural Networks
26 pages
Artificial Neural Networks - DL
No ratings yet
Artificial Neural Networks - DL
55 pages
12-Regularization For Deep Learning-17!08!2024
No ratings yet
12-Regularization For Deep Learning-17!08!2024
51 pages
8.lecture7 28a 29 NN
No ratings yet
8.lecture7 28a 29 NN
60 pages
(Technical) Machine Learning U3-6 (2019 Pattern)
No ratings yet
(Technical) Machine Learning U3-6 (2019 Pattern)
101 pages
Florin Olariu: "Alexandru Ioan Cuza", University of Iași Department of Computer Science
No ratings yet
Florin Olariu: "Alexandru Ioan Cuza", University of Iași Department of Computer Science
45 pages
Florin Olariu: "Alexandru Ioan Cuza", University of Iași Department of Computer Science
No ratings yet
Florin Olariu: "Alexandru Ioan Cuza", University of Iași Department of Computer Science
25 pages
Florin Olariu: "Alexandru Ioan Cuza", University of Iași Department of Computer Science
No ratings yet
Florin Olariu: "Alexandru Ioan Cuza", University of Iași Department of Computer Science
50 pages
Florin Olariu: "Alexandru Ioan Cuza", University of Iași Department of Computer Science
No ratings yet
Florin Olariu: "Alexandru Ioan Cuza", University of Iași Department of Computer Science
117 pages
Neural Networks: 1 October, 2016
No ratings yet
Neural Networks: 1 October, 2016
51 pages
Florin Olariu: "Alexandru Ioan Cuza", University of Iași Department of Computer Science
No ratings yet
Florin Olariu: "Alexandru Ioan Cuza", University of Iași Department of Computer Science
78 pages
Florin Olariu: "Alexandru Ioan Cuza", University of Iași Department of Computer Science
No ratings yet
Florin Olariu: "Alexandru Ioan Cuza", University of Iași Department of Computer Science
51 pages
Neural Networks: 10 January, 2017
No ratings yet
Neural Networks: 10 January, 2017
74 pages
Florin Olariu: "Alexandru Ioan Cuza", University of Iași Department of Computer Science
No ratings yet
Florin Olariu: "Alexandru Ioan Cuza", University of Iași Department of Computer Science
48 pages
Florin Olariu: "Alexandru Ioan Cuza", University of Iași Department of Computer Science
No ratings yet
Florin Olariu: "Alexandru Ioan Cuza", University of Iași Department of Computer Science
45 pages
Florin Olariu: "Alexandru Ioan Cuza", University of Iași Department of Computer Science
No ratings yet
Florin Olariu: "Alexandru Ioan Cuza", University of Iași Department of Computer Science
44 pages
Curs6site PDF
No ratings yet
Curs6site PDF
40 pages
Curs11 PDF
No ratings yet
Curs11 PDF
41 pages
Florin Olariu & Andrei Arusoaie: "Alexandru Ioan Cuza", University of Iași Department of Computer Science
No ratings yet
Florin Olariu & Andrei Arusoaie: "Alexandru Ioan Cuza", University of Iași Department of Computer Science
20 pages
Crash Report 4 29 1008
No ratings yet
Crash Report 4 29 1008
32 pages
Workflow.: Modeling, Verification, Security
No ratings yet
Workflow.: Modeling, Verification, Security
5 pages
Graph Theory Chap
No ratings yet
Graph Theory Chap
12 pages
Curs7 PDF
No ratings yet
Curs7 PDF
46 pages
Curs3site PDF
No ratings yet
Curs3site PDF
38 pages
Real-Time GraphQL - Tech9
No ratings yet
Real-Time GraphQL - Tech9
57 pages
8-Star-Choosability of A Graph With Maximum Average Degree Less Than 3
No ratings yet
8-Star-Choosability of A Graph With Maximum Average Degree Less Than 3
14 pages
Vina Screen Local
No ratings yet
Vina Screen Local
1 page
Rsa - TCR PDF
No ratings yet
Rsa - TCR PDF
89 pages
B1809677-Ngô Hồng Quốc Bảo-Wrong Pose Dectection Based on Machine Learning
No ratings yet
B1809677-Ngô Hồng Quốc Bảo-Wrong Pose Dectection Based on Machine Learning
52 pages
Unit 4 Final
No ratings yet
Unit 4 Final
29 pages
CSC413 A2
No ratings yet
CSC413 A2
3 pages
Cours 2 - Training Deep Neural Networks
No ratings yet
Cours 2 - Training Deep Neural Networks
42 pages
Deep Unordered Composition Rivals Syntactic Methods For Text Classification
No ratings yet
Deep Unordered Composition Rivals Syntactic Methods For Text Classification
11 pages
DEEP LEARNING LAB Manuals
No ratings yet
DEEP LEARNING LAB Manuals
55 pages
Neural Graph Collaborative Filtering
No ratings yet
Neural Graph Collaborative Filtering
10 pages
(AFM) Attentional Factorization Machines - Learning The Weight of Feature Interactions Via Attention Networks (ZJU 2017)
No ratings yet
(AFM) Attentional Factorization Machines - Learning The Weight of Feature Interactions Via Attention Networks (ZJU 2017)
7 pages
Chapter 4
No ratings yet
Chapter 4
34 pages
DL Student Lab Manual
No ratings yet
DL Student Lab Manual
81 pages
Unit1 TDL Compressed
No ratings yet
Unit1 TDL Compressed
402 pages
SCSA3015 Deep Learning Unit 4 PDF
No ratings yet
SCSA3015 Deep Learning Unit 4 PDF
30 pages
Chap 7-2 Regularization For Deep Learning-Hyun-Lim Yang
No ratings yet
Chap 7-2 Regularization For Deep Learning-Hyun-Lim Yang
49 pages
CAIS Demo
No ratings yet
CAIS Demo
15 pages
Lecture 7 Loss Function and Regularization
No ratings yet
Lecture 7 Loss Function and Regularization
38 pages
Dropout Vs Pruning
No ratings yet
Dropout Vs Pruning
2 pages
An Empirical Investigation of Catastrophic Forgeti
No ratings yet
An Empirical Investigation of Catastrophic Forgeti
10 pages
Ccs355 Neural Networks and Deep Learning Unit1
No ratings yet
Ccs355 Neural Networks and Deep Learning Unit1
29 pages
Tu Delft Paper
No ratings yet
Tu Delft Paper
14 pages
Dropout As A Bayesian Approximation
No ratings yet
Dropout As A Bayesian Approximation
10 pages
Chapter 6 Deep Learning Knowledge
No ratings yet
Chapter 6 Deep Learning Knowledge
24 pages
Unit 2 - Neural Networks (DL Illustrated)
No ratings yet
Unit 2 - Neural Networks (DL Illustrated)
146 pages
Neural Networks & Deep Learning 2025
No ratings yet
Neural Networks & Deep Learning 2025
73 pages
Yang Et Al. 2023 Deep Learning and Reinforcement Learning
No ratings yet
Yang Et Al. 2023 Deep Learning and Reinforcement Learning
132 pages
Obsolescence in AI
No ratings yet
Obsolescence in AI
5 pages
Training Verifiers To Solve Math Word Problems
No ratings yet
Training Verifiers To Solve Math Word Problems
22 pages
Sentiment Analysis With NLP Deep Learning
No ratings yet
Sentiment Analysis With NLP Deep Learning
8 pages
Voice Based Gender Identification Using Deep Learning
No ratings yet
Voice Based Gender Identification Using Deep Learning
14 pages

Curs5site PDF

Uploaded by

Curs5site PDF

Uploaded by

1 November, 2016

 Max Norm Constraint

 What do we need to initialize weights with random values?

 What if we would initialize all of them with 0s?

Whatever the input, the output will

 What if we would initialize all of them with 0s?

𝑧𝑖𝐿 The error (if cross entropy is used

The error that will be backpropagated, will

 What if we would initialize all of them with 0s?

𝑧𝑖𝐿 The network will adjust the weights in the

In the same way can be shown that 𝑤 1 = 0

We ended up in the same place as the first

𝑧𝑖𝐿 Each hidden unit will compute the same

Since all weigths are the same and all

𝑧𝑖𝐿 So, even though we have multiple neurons

Is almost as there is only one neuron in the

Making the weights random, achieves a

 Overfitting describes the process that happens when a model customizes

 Which of the following models is better?

Cost of training data

The cost of the training data

At the end of the training cycle,

Accuracy on testing data

As it can be seen in the graph,

Beyond that point, the network

We should have stopped at

Accuracy on training data

And if we check the accuracy

That means we should have

Cost of testing data

The cost of the testing data

This shouldn’t have happened if

There should have been a

Accuracy on test data vs

Overfitting still happens here,

Beyond iteration 7, the network

However, the difference

 Another obvious solution would be to reduce the network size

 The solutions above are sometimes impractical:

 Understanding: bias and variance

 Let’s suppose there are some

 This is how the target would

 We will use the following lema:

= 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 + 𝑏𝑖𝑎𝑠 2 + 𝜎 2 𝑛𝑜𝑖𝑠𝑒

 𝑀𝑆𝐸 𝑓 = 𝑏𝑖𝑎𝑠 2 + 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 + 𝜎 2 (𝑛𝑜𝑖𝑠𝑒)

 The purpose is to find an equilibrium of those two.

 In order for a model to have high variance, it needs two things:

 We modify the cost function by introducing an element that penalizes large

So, for the cross entropy will look like this:

And the Mean Square Error (Quadratic cost):

 This is called L2 regularization and it is also known as weight decay

 The easiest way to introduce regularization into backpropagation is

 Let’s train the ANN again on

 As it can be seen, the test

 We have also achieved a

 The same experiment, but this

 The difference between the

 The overall accuracy is

 Another variant of regularization is L1 where instead of using squares of

 So, how is L1 different from L2?

L1 drives the weights down by a constant amount:

L2 drives the weights down by an amount that depends on the value of w:

 On each minibatch that we train, we randomly select ½ from the

 We update the visible weights as before, using backpropagation

 If neuron 1 outputs the right value 80% of time

The gradients will get propagated only through neurons 3 and 1

How it is usually implemented:

2. Multiply the activations with the generated vector (element-

Maxnorm tries to limit large weights, but on each individual neuron.

1. The weight updates are performed as usual

As we’ve seen earlier, when using a bigger dataset, our network

So, another idea is to increase the data.

In “Best Practices for Convolutional Neural Networks Applied to Visual

They achieved an increase of accuracy from 98.4% to 98.9% by:

You might also like