6 - Tips For Training Deep Neural Networks
6 - Tips For Training Deep Neural Networks
-1-
Outline
Deep Neural Network
Parameters vs Hyperparameters
How to set network parameters
Bias / Variance Trade-off
Regularization Strategies
Batch normalization
Vanishing / Exploding gradients
Gradient Descent
Mini-batch Gradient Descent
Adarad
-2-
Deep Neural Network
x1 … y1
…
x 2 W1 W2 …
WL y2
b1 b2 … b L
…
…
…
…
…
…
…
…
…
…
a2… yM
xN x a1 y
…
𝜎W1 x(+ b)
1
𝜎W2 a( )
1 + b 2
𝜎 (
WL aL-1 )
+b L
-3-
Deep Neural Network
𝜃= { 𝑊 , 𝑏 ,𝑊 ,𝑏 , ⋯ 𝑊 , 𝑏 }
1 1 2 2 𝐿 𝐿
x1 … y1
0.1 is 1
…
x2 … 0.7
y2 is 2
… …
…
…
…
…
…
…
…
x256 … y1
0.2 is 0
16 x 16 = 256 …
Ink → 1 Set the network parameters such
0 that ……
No ink → 0
Input: y1 the
has the maximum value
How to let neural
Input:network achieve this
y2 has the maximum value
network parameters 𝜃
By learning the -4-
Parameters vs Hyperparameters
A model parameter is a variable of the selected
model which can be estimated by fitting the given
data to the model.
Hyperparameter is a parameter from a prior
distribution; it captures the prior belief before data
is observed.
– These are the parameters that control the model
parameters
– In any machine learning algorithm, these parameters
need to be initialized before training a model.
-5-
-6-
Training
Set We train the model on the training data.
Data
Dev Set After training the model, we check how well it performs on the dev set.
Test Set When we have a final model, we evaluate it on the test set in order to get
an unbiased estimate of how well our algorithm is doing.
-7-
Train / Dev / Test sets
Training
Training Training
Set
Set Set
(60%)
(70%) (98%)
Data
Dev Set
(20%)
Test Set Test Set
(20%) (20%) Dev Set (1%)
-8-
Bias / Variance Trade-off
Make sure the distribution of dev/test set is
same as training set
– Divide the training, dev and test sets in such a
way that their distribution is similar
– Skip the test set and validate the model using
the dev set only
We want our model to be just right, which
means having low bias and low variance.
– Bias is the difference between the Predicted
Value and the Expected Value.
– Variance is the amount that the estimate of
the target function will change, given different
training data.
-9-
- 10 -
- 12 -
Source: https://cedar.buffalo.edu/~srihari/CSE574/Chap5/Chap5.5-Regularization.pdf
Regularization Strategies
- 13 -
Parameter Norm Penalties
The most traditional form of regularization applicable
to deep learning is the concept of parameter norm
penalties.
This approach limits the capacity of the model by
adding the penalty to the objective function resulting
in:
- 14 -
L2 Norm Parameter Regularization
Using L2 norm, we’re adding the constraints to the original
loss function, such that the weights of the network don’t
grow too large.
- 15 -
L1 Norm Parameter Regularization
L1 norm is another option that can be used to penalize the size
of model parameters.
L1 regularization on the model parameters w is:
- 16 -
Early Stopping
When training models with sufficient representational
capacity to overfit the task, we often observe that training
error decreases steadily over time, while the error on the
validation set begins to rise again or remaining the same for
certain iterations, then there is no point in training the
model further.
This means we can obtain a model with better validation set
error (and thus, hopefully better test set error) by returning
to the parameter setting at the point in time with the lowest
validation set error
- 17 -
Parameter Tying
Sometimes, we might not know which region the
parameters would lie in, but rather we known that there is
some dependencies between them.
Parameter Tying refers to explicitly forcing the parameters
of two models to be close to each other, through the norm
penalty.
- 18 -
Dropout
Dropout is a bagging method
– Bagging is a method of averaging over several
models to improve generalization
Impractical to train many neural networks since
it is expensive in time and memory
– It is a method of bagging applied to neural
networks
Dropout is an inexpensive but powerful method
of regularizing a broad family of models
Specifically, dropout trains the ensemble
consisting of sub-networks that can be formed
by removing non-output units from an
underlying base network.
- 19 -
Dropout - Intuitive Reason
Training:
- 21 -
Dropout
Training:
Thinner!
- 22 -
Dropout
Testing:
No dropout
If the dropout rate at training is p
%, all the weights times (1-p)%
Assume that the dropout rate is 50%.
If a weight by training, set for testing.
- 23 -
x1 x2
Why the weights should multiply (1-p)% (dropout
rate) when testing? w1 w2
x1 x2 x1 x2
w1 w2 w1 w2
z=w1x1+w2x2
z=w1x1+w2x2 z=w2x2 x1 x2
x1 x2 1 1
x1 x2
2 w1 2 w2
w1 w2 w1 w2
1 1
𝑧 = 𝑤1 𝑥 1 + 𝑤2 𝑥2
2 2
z=w1x1 z=0
- 24 -
Dropout is a kind of ensemble.
Training
Set
Ensemble
Set Set Set Set
1 2 3 4
- 25 -
Dropout is a kind of ensemble.
Ensemble
Testing data x
y1 y2 y3 y4
average
- 26 -
Setting up your Optimization Problem
- 27 -
Normalizing Inputs
The range of values of raw training data often varies widely
– Example: Has kids feature in {0,1}
– Value of car: $500-$100’sk
If one of the features has a broad range of values, the
distance will be governed by this particular feature.
– After, normalization, each feature contributes approximately
proportionately to the final distance.
In general, Gradient descent converges much faster with
feature scaling than without it.
- 28 -
Feature Scaling
1 2 3 𝑟 𝑚
𝑥 𝑥 𝑥 𝑥 𝑥
𝑥
1
1 𝑥1
2 For each
𝑥
1 2
𝑥2 dimension
2
i:
… … mean:
…
…
…
…
…
…
…
…
…
…
… … standard
deviation:
𝑟
𝑟 𝑥𝑖 − 𝑚𝑖 The means of all dimensions are
𝑥𝑖 ← 0, and the variances are all 1
𝜎𝑖
Sigmoid
𝑥 𝑊 𝑧 𝑎 𝑊 …
2 1 2 2 2…
𝑥 𝑊 𝑧 𝑎 𝑊
Sigmoid
…
3 1 3 3 2…
𝑥 𝑊 𝑧 Sigmoid
𝑎 𝑊 …
1 2 3 = 1 1 2 3
Batch
𝑧𝑧𝑧 𝑊 𝑥𝑥𝑥
- 31 -
Batch normalization
3
1 1 1 1
𝑥 𝑊 𝑧 𝜇= ∑ 𝑧
3 𝑖=1
𝑖
√
2 1 2
𝑥 𝑊 𝑧 1
3
𝜎= ∑ ( 𝑧 − 𝜇)
𝑖 2
3 𝑖=1
3 1 3
𝑥 𝑊 𝑧
and depends
on 𝜇 𝜎
- 32 -
Batch normalization
1 1 1 ~𝑧 1 1
Sigmoid
𝑥 𝑊 𝑧 𝑎
2
𝑥 𝑊 𝑧
1 2 ~𝑧 2 𝑎
2
Sigmoid
3 1 3 ~𝑧 3 3
Sigmoid
𝑥 𝑊 𝑧 𝑎
𝜇 𝜎
𝑖
and depends ~𝑖 𝑧 −𝜇
𝑧 =
on 𝜎+𝜀
Batch Norms happens between computing Z and computing A. And the intuition - 33 - is
that, instead of using the un-normalized value Z, you can use the normalized value Z
Batch normalization
Setting mean to and work for most of the
applications, but in actual implementation, we don't
want the hidden units to always have mean 0 and
variance 1
, we replace with the following
- 34 -
Acc
𝜇 300
Batch normalization at testing time 𝜇 100
𝜇1
Updates
𝑧 −𝜇
~𝑧
~
𝑧= 𝑧 =𝛾 ⨀ ~
𝑖 𝑖
^
^𝑧
𝑧 +𝛽
𝑥𝑊 𝑧 1 𝜎
, are from , are network
batch parameters
We do not have batch at testing stage.
Ideal solution:
Computing and using the whole training dataset.
Practical solution:
Computing the moving average of and of the
batches during training.
- 35 -
Why does normalizing the data make the algorithm faster?
In the case of unnormalized data, the scale of Unnormalized
𝐽
features will vary. This will make the cost :
function asymmetric.
It will take longer to converge to the
minimum because the parameters for larger- 𝑤
scale features will dominate the updates. 𝑏
Whereas, in the case of normalized data, the Normalized:𝐽
scale will be the same and the cost function
will also be symmetric.
This makes it is easier for the gradient
descent algorithm to find the global minima
more quickly. And this, in turn, makes the 𝑤
algorithm run much faster. 𝑏
- 37 -
Source: https://www.coursera.org/learn/deep-neural-network/lecture/C9iQO/vanishing-exploding-gradients
Vanishing / Exploding gradients
When you're training a very deep network,
sometimes the derivatives can get either
very small or very1
big,2 and this 𝐿−
makes
1 𝐿
𝑊
training difficult. 𝑊 𝑊 𝑊
x1 …
… 𝑦
x2 …
1 1 2… 2 1 𝐿− 1 𝐿 −1 𝐿 −2 𝐿 𝐿 −1
𝑍 =𝑊 𝑥 𝑍 =𝑊 𝑍 𝑍 =𝑊 𝑍 𝑦 =𝑊 𝑍
Source: https://www.coursera.org/learn/deep-neural-network/lecture/C9iQO/vanishing-exploding-gradients
Solutions: Vanishing / Exploding gradients
Use a good initialization
– Random Initialization
The primary reason behind initializing the weights
randomly is to break symmetry.
We want to make sure that different hidden units
learn different patterns.
Do not use sigmoid for deep networks
– Problem: saturation
Sigmoid tends to
saturate for very large
positive or negative
inputs, leading to the
vanishing gradient
problem, especially in
deep networks.
- 40 -
Reason:
𝑎
𝜎 (𝑧) 1. Fast to compute
𝑎=𝑧
No complex operations
like exponentials, like
sigmoid or tanh
𝑎= 0 functions.
𝑧
2. Vanishing gradient
problem
- 41 -
𝑎
𝑎=𝑧
ReLU
𝑎= 0
𝑧
0
x1 y1
0 y2
x2
0
0
- 42 -
𝑎
𝑎=𝑧
ReLU
A Thinner linear 𝑎= 0
𝑧
network
x1 y1
y2
x2
Do not have
smaller gradients
- 43 -
ReLU - variant
This helps to avoid the “dying ReLU” problem with the
standard ReLU function, where a neuron with a negative
bias may never activate and become “dead.”
𝑧 𝑧
𝑎=0.01 𝑧 𝑎=𝛼 𝑧
- 45 -
- 46 -
Gradient Descent
Assume there are only two
parameters w1 and w2 in a
Error Surface network. 𝜃= { 𝑤1 ,𝑤 2 }
𝜃
0 𝛻 𝐶 (𝜃 )=
0
[ 𝜕 𝐶 ( 𝜃0 ) / 𝜕 𝑤1
𝜕 𝐶 ( 𝜃 ) / 𝜕 𝑤2
0
] learning rate
−𝜂 𝛻 𝐶 ( 𝜃0 )
𝑤1
- 47 -
Gradient Descent
Eventually, we would
Randomly pick a
reach a minima …..
starting point
Compute the
2−𝜂 𝛻 𝐶 ( 𝜃 )
2
negative
−𝜂 𝛻 𝐶 ( 𝜃𝜃
1
)
𝑤2 gradient at
𝜃1()𝜃 )
2
− 𝛻−𝐶𝛻( 𝐶
𝜃
1
− 𝛻 𝐶 ( 𝜃0 )
Times the
0
learning rate
𝜃
−𝜂 𝛻 𝐶 ( 𝜃0 )
𝑤1
- 48 -
Gradient Descent
Gradient descent
– Pros
Guaranteed to converge to global minimum for convex error surface
Converge to local minimum for non-convex error surface
– Cons
Very slow
Intractable for dataset that do not fit in the memory
𝐶
Reach different minima, so
different results (non-convex)
𝑤1 𝑤2
- 49 -
Gradient Descent: Practical Issues
- 50 -
Mini-batch
Randomly initialize
x1 NN y1 ^
𝑦
1 Pick the 1st batch
batch
𝐿1 𝐶 = 𝐿1 + 𝐿31 +⋯
Mini-
x31 NN y31 ^
𝑦 31 𝜃 1 ← 𝜃0 −𝜂 𝛻 𝐶 ( 𝜃 0 )
𝐿 31 Pick the 2nd batch
…
…
𝐶 = 𝐿2+ 𝐿16 +⋯
𝜃2 ← 𝜃1 − 𝜂 𝛻 𝐶 ( 𝜃1 )
x2 NN y2 ^
𝑦2
…
2
batch
𝐿
Mini-
C is different each
x16 NN y16 ^
𝑦 16 time when we
𝐿 16 update
…
…
parameters!
- 51 -
Mini-batch
Better
Faster
!
Randomly initialize
x1 NN y1 ^
𝑦
1 Pick the 1st batch
batch
𝐶1 𝐶 =𝐶 1 +𝐶 31 +⋯
Mini-
x31 NN y31 ^
𝑦 31 𝜃 1 ← 𝜃0 −𝜂 𝛻 𝐶 ( 𝜃 0 )
𝐶 31 Pick the 2nd batch
…
…
𝐶 =𝐶 2 +𝐶 16 +⋯
𝜃2 ← 𝜃1 − 𝜂 𝛻 𝐶 ( 𝜃1 )
x2 NN y2 ^
𝑦2
…
2
batch
^
have been picked
x16 NN y16 𝑦 16
𝐶 16 one epoch
…
…
- 53 -
Source: https://www.coursera.org/learn/deep-neural-network/lecture/qcogH/mini-batch-gradient-descent
Learning Rate
Set the learning
rate η carefully
− 𝛻 𝐶 ( 𝜃0 )
0
𝜃
𝑤1
- 54 -
Learning Rate
Set the learning
Can we
rate give different
η carefully
parameters different
If learning
learning rate is too
rates?
large
- 55 -
Adagrad
Divide the learning rate by “average” gradient
• High-dimensional non-convex nature of neural
networks optimization could lead to different
sensitivity on each dimension.
• The learning rate could be too small in some
dimension 𝑡 +1
and could
𝑡 𝜂 be𝑡 too large in another
𝑤 ←𝑤 − 𝑔
dimension. 𝜎
1 0 𝜂 0
𝑤 ←𝑤 − 0
𝑔 0
𝜎 =𝑔
0
𝜎
2 𝜂
𝑤 ←𝑤 − 1 𝑔
𝜎
1 1 1 1 02
2 √
𝜎 = [ ( 𝑔 ) +( 𝑔 ) ]
1 2
3
𝑤 ←𝑤 −
2 𝜂
𝜎
2
𝑔
2 2 1 02
3 √
𝜎 = [ ( 𝑔 ) + ( 𝑔 ) +( 𝑔 ) ]
1 2 2 2
……
√
𝑡
1
𝑤
𝑡 +1 𝜂 𝑡
←𝑤 − 𝑡 𝑔
𝜎
𝑡
𝜎= 𝑡
∑
𝑡 +1 𝑖=0
𝑖 2
(𝑔 )
- 57 -
Adagrad
Divide the learning rate by “average” gradient
– The “average” gradient is obtained while updating the parameters
√
𝑡
𝜂𝜂𝑡 1
𝑤
𝑡 +1 𝑡
←𝑤 −
𝜎
𝑡
𝑔
𝑡
𝜎= 𝑡
∑
𝑡 +1 𝑖=0
𝑖 2
(𝑔 )
𝜂 𝑡 +1 𝑡 𝜂 𝑡
𝜂𝑡 = 𝑤 ←𝑤 − 𝑔
√ 𝑡 +1
√
𝑡
∑ ( 𝑔𝑖 )
2
1/t decay 𝑖 =0
- 58 -
Adagrad
Original Gradient Descent
𝜃 ← 𝜃 −𝜂 𝛻 𝐶 ( 𝜃 )
𝑡 𝑡 −1 𝑡 −1
Parameter dependent
learning rate
𝜂 constant
𝜂𝑤 =
√
𝑡
Summation of the square
∑
2
(𝑔𝑖)
𝑖= 0 of the previous derivatives
- 59 -
Acknowledgement
http://wavelab.uwaterloo.ca/wp-content/uploads/2017/04/Lecture_3.pdf
https://heartbeat.fritz.ai/deep-learning-best-practices-regularization-techniques-f
or-better-performance-of-neural-network-94f978a4e518
https://cedar.buffalo.edu/~srihari/CSE676/7.12%20Dropout.pdf
http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2017/Lecture/DNN%20tip.pptx
Accelerating Deep Network Training by Reducing Internal Covariate Shift, Jude W.
Shavlik
http://speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/ForDeep.pptx
Deep Learning Tutorial. Prof. Hung-yi Lee, NTU.
On Predictive and Generative Deep Neural Architectures, Prof. Swagatam Das,
ISICAL
- 60 -