0% found this document useful (0 votes)
15 views59 pages

6 - Tips For Training Deep Neural Networks

The document provides an overview of training deep neural networks, covering key concepts such as parameters vs hyperparameters, the bias/variance trade-off, regularization strategies, and gradient descent techniques. It emphasizes the importance of hyperparameter tuning, the effects of overfitting and underfitting, and various regularization methods like dropout and batch normalization. Additionally, it discusses the significance of feature scaling and normalization in improving model performance.

Uploaded by

rdxsingh01
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views59 pages

6 - Tips For Training Deep Neural Networks

The document provides an overview of training deep neural networks, covering key concepts such as parameters vs hyperparameters, the bias/variance trade-off, regularization strategies, and gradient descent techniques. It emphasizes the importance of hyperparameter tuning, the effects of overfitting and underfitting, and various regularization methods like dropout and batch normalization. Additionally, it discusses the significance of feature scaling and normalization in improving model performance.

Uploaded by

rdxsingh01
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 59

Tips for Training Deep Neural Networks

-1-
Outline
 Deep Neural Network
 Parameters vs Hyperparameters
 How to set network parameters
 Bias / Variance Trade-off
 Regularization Strategies
 Batch normalization
 Vanishing / Exploding gradients
 Gradient Descent
 Mini-batch Gradient Descent
 Adarad

-2-
Deep Neural Network

x1 … y1

x 2 W1 W2 …
WL y2
b1 b2 … b L








a2… yM
xN x a1 y

𝜎W1 x(+ b)
1

𝜎W2 a( )
1 + b 2

𝜎 (
WL aL-1 )
+b L

-3-
Deep Neural Network

𝜃= { 𝑊 , 𝑏 ,𝑊 ,𝑏 , ⋯ 𝑊 , 𝑏 }
1 1 2 2 𝐿 𝐿

x1 … y1
0.1 is 1

x2 … 0.7
y2 is 2
… …






x256 … y1
0.2 is 0
16 x 16 = 256 …
Ink → 1 Set the network parameters such
0 that ……
No ink → 0
Input: y1 the
has the maximum value
How to let neural
Input:network achieve this
y2 has the maximum value

network parameters 𝜃
By learning the -4-
Parameters vs Hyperparameters
 A model parameter is a variable of the selected
model which can be estimated by fitting the given
data to the model.
 Hyperparameter is a parameter from a prior
distribution; it captures the prior belief before data
is observed.
– These are the parameters that control the model
parameters
– In any machine learning algorithm, these parameters
need to be initialized before training a model.

-5-

Image Source: https://www.slideshare.net/AliceZheng3/evaluating-machine-learning-models-a-beginners-guide


Deep Neural Network: Parameters vs
Hyperparameters
 Parameters:

 Hyperparameters:
– Learning rate in gradient descent
– Number of iterations in gradient descent
– Number of layers in a Neural Network
– Number of neurons per layer in a Neural Network
– Activations Functions
– Mini-batch size
– Regularizations parameters

-6-

Image Source: https://www.slideshare.net/AliceZheng3/evaluating-machine-learning-models-a-beginners-guide


Train / Dev / Test sets
 Hyperparameters tuning is a highly iterative process, where you
– start with an idea, i.e. start with a certain number of hidden layers,
certain learning rate, etc.
– try the idea by implementing it
– experiment how well the idea has worked
– refine the idea and iterate this process
 Now how do we identify whether the idea is working? This is
where the train / dev / test sets come into play.

Training
Set We train the model on the training data.
Data

Dev Set After training the model, we check how well it performs on the dev set.

Test Set When we have a final model, we evaluate it on the test set in order to get
an unbiased estimate of how well our algorithm is doing.
-7-
Train / Dev / Test sets

Training
Training Training
Set
Set Set
(60%)
(70%) (98%)
Data

Dev Set
(20%)
Test Set Test Set
(20%) (20%) Dev Set (1%)

Test Set (1%)

Previously, when we had As the availability of data has


small datasets, most increased in recent years, we
often the distribution of can use a huge slice of it for
different sets was training the model

-8-
Bias / Variance Trade-off
 Make sure the distribution of dev/test set is
same as training set
– Divide the training, dev and test sets in such a
way that their distribution is similar
– Skip the test set and validate the model using
the dev set only
 We want our model to be just right, which
means having low bias and low variance.
– Bias is the difference between the Predicted
Value and the Expected Value.
– Variance is the amount that the estimate of
the target function will change, given different
training data.

-9-

Image Source: https://www.analyticsvidhya.com/blog/2018/11/neural-networks-hyperparameter-tuning-regularization-deeplearning/


Bias / Variance Trade-off

 Overfitting: If the dev set error is much


more than the train set error, the model is
overfitting and has a high variance
 Underfitting: When both train and dev set
errors are high, the model is underfitting
and has a high bias

- 10 -

Image Source: https://www.analyticsvidhya.com/blog/2018/11/neural-networks-hyperparameter-tuning-regularization-deeplearning/


Overfitting in Deep Neural Nets
 Deep neural networks contain multiple non-linear
hidden layers
– This makes them very expressive models that can learn
very complicated relationships between their inputs and
outputs.
– In other words, model learns even the tiniest details
present in the data.
 But with limited training data, many of these
complicated relationships will be the result of sampling
noise
– So they will exist in the training set but not in real test
data even if it is drawn from the same distribution.
– So after learning all the possible patterns it can find, the
model tends to perform extremely well on the training set
but fails to produce good results on the dev and test sets.
- 11 -
Regularization
 Regularization is:
– “any modification to a learning algorithm to
reduce its generalization error but not its training
error”
– Reduce generalization error even at the expense of
increasing training error
 E.g., Limiting model capacity is a regularization
method

- 12 -

Source: https://cedar.buffalo.edu/~srihari/CSE574/Chap5/Chap5.5-Regularization.pdf
Regularization Strategies

- 13 -
Parameter Norm Penalties
 The most traditional form of regularization applicable
to deep learning is the concept of parameter norm
penalties.
 This approach limits the capacity of the model by
adding the penalty to the objective function resulting
in:

 is a hyperparameter that weights the relative


contribution of the norm penalty to the value of the
objective function.

- 14 -
L2 Norm Parameter Regularization
 Using L2 norm, we’re adding the constraints to the original
loss function, such that the weights of the network don’t
grow too large.

 Assuming there is no bias parameters, only weights

 By adding the regularized term, we’re fooling the model such


that it won’t drive the training error to zero, which in turn
reduces the complexity of the model.

- 15 -
L1 Norm Parameter Regularization
 L1 norm is another option that can be used to penalize the size
of model parameters.
 L1 regularization on the model parameters w is:

 The L2 Norm penalty decays the components of the vector w


that do not contribute much to reducing the objective function.
 On the other hand, the L1 norm penalty provides solutions that
are sparse.
 This sparsity property can be thought of as a feature selection
mechanism.

- 16 -
Early Stopping
 When training models with sufficient representational
capacity to overfit the task, we often observe that training
error decreases steadily over time, while the error on the
validation set begins to rise again or remaining the same for
certain iterations, then there is no point in training the
model further.
 This means we can obtain a model with better validation set
error (and thus, hopefully better test set error) by returning
to the parameter setting at the point in time with the lowest
validation set error

- 17 -
Parameter Tying
 Sometimes, we might not know which region the
parameters would lie in, but rather we known that there is
some dependencies between them.
 Parameter Tying refers to explicitly forcing the parameters
of two models to be close to each other, through the norm
penalty.

 Here, refers to the weights of the first model while refers to


those of the second one.

- 18 -
Dropout
 Dropout is a bagging method
– Bagging is a method of averaging over several
models to improve generalization
 Impractical to train many neural networks since
it is expensive in time and memory
– It is a method of bagging applied to neural
networks
 Dropout is an inexpensive but powerful method
of regularizing a broad family of models
 Specifically, dropout trains the ensemble
consisting of sub-networks that can be formed
by removing non-output units from an
underlying base network.

- 19 -
Dropout - Intuitive Reason

 When teams up, if everyone expect the partner


will do the work, nothing will be done finally.
 However, if you know your partner will
dropout, you will do better.
 When testing, no one dropout actually, so
obtaining good results eventually.
- 20 -
Dropout

Training:

 Each neuron has p% to dropout

- 21 -
Dropout

Training:

Thinner!

 Each neuron has p% to dropout


The structure of the network is
changed.
 Using the new network for training

- 22 -
Dropout

Testing:

 No dropout
 If the dropout rate at training is p
%, all the weights times (1-p)%
 Assume that the dropout rate is 50%.
If a weight by training, set for testing.
- 23 -
x1 x2
Why the weights should multiply (1-p)% (dropout
rate) when testing? w1 w2
x1 x2 x1 x2

w1 w2 w1 w2

z=w1x1+w2x2

z=w1x1+w2x2 z=w2x2 x1 x2

x1 x2 1 1
x1 x2
2 w1 2 w2
w1 w2 w1 w2

1 1
𝑧 = 𝑤1 𝑥 1 + 𝑤2 𝑥2
2 2
z=w1x1 z=0

- 24 -
Dropout is a kind of ensemble.

Training
Set
Ensemble
Set Set Set Set
1 2 3 4

Network Network Network Network


1 2 3 4

Train a bunch of networks with different structures

- 25 -
Dropout is a kind of ensemble.

Ensemble
Testing data x

Network Network Network Network


1 2 3 4

y1 y2 y3 y4

average
- 26 -
Setting up your Optimization Problem

- 27 -
Normalizing Inputs
 The range of values of raw training data often varies widely
– Example: Has kids feature in {0,1}
– Value of car: $500-$100’sk
 If one of the features has a broad range of values, the
distance will be governed by this particular feature.
– After, normalization, each feature contributes approximately
proportionately to the final distance.
 In general, Gradient descent converges much faster with
feature scaling than without it.

- 28 -
Feature Scaling

1 2 3 𝑟 𝑚
𝑥 𝑥 𝑥 𝑥 𝑥
𝑥
1
1 𝑥1
2 For each
𝑥
1 2
𝑥2 dimension
2
i:
… … mean:








… … standard
deviation:

𝑟
𝑟 𝑥𝑖 − 𝑚𝑖 The means of all dimensions are
𝑥𝑖 ← 0, and the variances are all 1
𝜎𝑖

In general, gradient descent converges much - 29 -

faster with feature scaling than without it.


Internal Covariate Shift
• The first guy tells the second guy, “go water
the plants”, the second guy tells the third
guy, “got water in your pants”, and so on
until the last guy hears, “kite bang eat face
monkey” or something totally wrong.
• Let’s say that the problems are entirely
systemic and due entirely to faulty red cups.
Then, the situation is analogous to forward
propagation
• If can get new cups to fix the problem by
trial and error, it would help to have a
consistent way of passing messages in a
more controlled and standardized “First layer parameters change and
(“normalized”) way. e.g: Same volume, so the distribution of the input to
same language, etc your second layer changes”
- 30 -
Batch
1 1 1 1 2…

Sigmoid
𝑥 𝑊 𝑧 𝑎 𝑊 …

2 1 2 2 2…
𝑥 𝑊 𝑧 𝑎 𝑊
Sigmoid

3 1 3 3 2…
𝑥 𝑊 𝑧 Sigmoid
𝑎 𝑊 …

1 2 3 = 1 1 2 3
Batch
𝑧𝑧𝑧 𝑊 𝑥𝑥𝑥
- 31 -
Batch normalization
3
1 1 1 1
𝑥 𝑊 𝑧 𝜇= ∑ 𝑧
3 𝑖=1
𝑖


2 1 2
𝑥 𝑊 𝑧 1
3
𝜎= ∑ ( 𝑧 − 𝜇)
𝑖 2

3 𝑖=1
3 1 3
𝑥 𝑊 𝑧
and depends
on 𝜇 𝜎
- 32 -
Batch normalization

1 1 1 ~𝑧 1 1

Sigmoid
𝑥 𝑊 𝑧 𝑎
2
𝑥 𝑊 𝑧
1 2 ~𝑧 2 𝑎
2

Sigmoid
3 1 3 ~𝑧 3 3

Sigmoid
𝑥 𝑊 𝑧 𝑎
𝜇 𝜎
𝑖
and depends ~𝑖 𝑧 −𝜇
𝑧 =
on 𝜎+𝜀
Batch Norms happens between computing Z and computing A. And the intuition - 33 - is
that, instead of using the un-normalized value Z, you can use the normalized value Z
Batch normalization
 Setting mean to and work for most of the
applications, but in actual implementation, we don't
want the hidden units to always have mean 0 and
variance 1
 , we replace with the following

where and are learnable parameters.


 is the special case of at and

- 34 -
Acc

𝜇 300
Batch normalization at testing time 𝜇 100
𝜇1
Updates
𝑧 −𝜇
~𝑧
~
𝑧= 𝑧 =𝛾 ⨀ ~
𝑖 𝑖
^
^𝑧
𝑧 +𝛽
𝑥𝑊 𝑧 1 𝜎
, are from , are network
batch parameters
We do not have batch at testing stage.
Ideal solution:
Computing and using the whole training dataset.
Practical solution:
Computing the moving average of and of the
batches during training.
- 35 -
Why does normalizing the data make the algorithm faster?
 In the case of unnormalized data, the scale of Unnormalized
𝐽
features will vary. This will make the cost :
function asymmetric.
 It will take longer to converge to the
minimum because the parameters for larger- 𝑤
scale features will dominate the updates. 𝑏
 Whereas, in the case of normalized data, the Normalized:𝐽
scale will be the same and the cost function
will also be symmetric.
 This makes it is easier for the gradient
descent algorithm to find the global minima
more quickly. And this, in turn, makes the 𝑤
algorithm run much faster. 𝑏
- 37 -

Image Source: https://www.analyticsvidhya.com/blog/2018/11/neural-networks-hyperparameter-tuning-regularization-deeplearning/


Vanishing / Exploding gradients
 When you're training a very deep network,
sometimes the derivatives can get either
very small or very1
big,2 and this 𝐿−
makes
1 𝐿
𝑊
training difficult. 𝑊 𝑊 𝑊
x1 …
… 𝑦
x2 …
1 1 2… 2 1 𝐿− 1 𝐿 −1 𝐿 −2 𝐿 𝐿 −1
𝑍 =𝑊 𝑥 𝑍 =𝑊 𝑍 𝑍 =𝑊 𝑍 𝑦 =𝑊 𝑍
 For simplicity, we assume bias () at
every layer and the activation 𝑦 =𝑊 𝑊
𝐿 𝐿 −1
𝑊
𝐿 −2
…𝑊 𝑊 𝑥
2 1
 Assuming
function is the
linear
entries in the weight
matrix are in the form
then,
- 38 -

Source: https://www.coursera.org/learn/deep-neural-network/lecture/C9iQO/vanishing-exploding-gradients
Vanishing / Exploding gradients
 When you're training a very deep network,
sometimes the derivatives can get either
very small or very1
big,2 and this 𝐿−
makes
1 𝐿
𝑊
training difficult. 𝑊 𝑊 𝑊
x1 …
… 𝑦
x2 …
1 1 2… 2 1 𝐿− 1 𝐿 −1 𝐿 −2 𝐿 𝐿 −1
𝑍 =𝑊 𝑥 𝑍 =𝑊 𝑍 𝑍 =𝑊 𝑍 𝑦 =𝑊 𝑍

 if and the number of layers in the 𝐿 𝐿 −1 𝐿 −2 2 1


network is large, the value of will 𝑦 =𝑊 𝑊 𝑊 …𝑊 𝑊 𝑥
explode.
 Similarly, if , the value of will be very
small. Hence, the gradient descent will
take very tinny step. - 39 -

Source: https://www.coursera.org/learn/deep-neural-network/lecture/C9iQO/vanishing-exploding-gradients
Solutions: Vanishing / Exploding gradients
 Use a good initialization
– Random Initialization
 The primary reason behind initializing the weights
randomly is to break symmetry.
 We want to make sure that different hidden units
learn different patterns.
 Do not use sigmoid for deep networks
– Problem: saturation

Sigmoid tends to
saturate for very large
positive or negative
inputs, leading to the
vanishing gradient
problem, especially in
deep networks.
- 40 -

Image Source: Pattern Recognition and Machine Learning, Bishop


ReLU
 Rectified Linear Unit (ReLU)

Reason:
𝑎
𝜎 (𝑧) 1. Fast to compute
𝑎=𝑧
No complex operations
like exponentials, like
sigmoid or tanh
𝑎= 0 functions.
𝑧
2. Vanishing gradient

problem
- 41 -
𝑎
𝑎=𝑧
ReLU

𝑎= 0
𝑧
0

x1 y1

0 y2
x2
0

0
- 42 -
𝑎
𝑎=𝑧
ReLU

A Thinner linear 𝑎= 0
𝑧
network

x1 y1

y2
x2

Do not have
smaller gradients
- 43 -
ReLU - variant
This helps to avoid the “dying ReLU” problem with the
standard ReLU function, where a neuron with a negative
bias may never activate and become “dead.”

𝐿𝑒𝑎𝑘𝑦 𝑅𝑒𝐿𝑈 𝑃𝑎𝑟𝑎𝑚𝑒𝑡𝑟𝑖𝑐 𝑅𝑒𝐿𝑈


𝑎 𝑎
𝑎=𝑧 𝑎=𝑧

𝑧 𝑧
𝑎=0.01 𝑧 𝑎=𝛼 𝑧

also learned by gradient descent


- 44 -
Activation Functions

- 45 -

Image Source: https://sefiks.com/2020/02/02/dance-moves-of-deep-learning-activation-functions/


Optimization Algorithms

- 46 -
Gradient Descent
Assume there are only two
parameters w1 and w2 in a
Error Surface network. 𝜃= { 𝑤1 ,𝑤 2 }

The colors represent the Randomly pick a


value of C. starting point
Compute the
negative
∗ gradient at
𝑤2 𝜃
−𝜂 𝛻 𝐶 ( 𝜃0 ) − 𝛻 𝐶 ( 𝜃0 )
− 𝛻 𝐶 ( 𝜃0 )
Times the

𝜃
0 𝛻 𝐶 (𝜃 )=
0

[ 𝜕 𝐶 ( 𝜃0 ) / 𝜕 𝑤1
𝜕 𝐶 ( 𝜃 ) / 𝜕 𝑤2
0
] learning rate
−𝜂 𝛻 𝐶 ( 𝜃0 )
𝑤1
- 47 -
Gradient Descent

Eventually, we would
Randomly pick a
reach a minima …..
starting point
Compute the
2−𝜂 𝛻 𝐶 ( 𝜃 )
2
negative
−𝜂 𝛻 𝐶 ( 𝜃𝜃
1
)
𝑤2 gradient at
𝜃1()𝜃 )
2
− 𝛻−𝐶𝛻( 𝐶
𝜃
1
− 𝛻 𝐶 ( 𝜃0 )
Times the
0
learning rate
𝜃
−𝜂 𝛻 𝐶 ( 𝜃0 )
𝑤1
- 48 -
Gradient Descent
 Gradient descent
– Pros
 Guaranteed to converge to global minimum for convex error surface
 Converge to local minimum for non-convex error surface
– Cons
 Very slow
 Intractable for dataset that do not fit in the memory

Different initial point

𝐶
Reach different minima, so
different results (non-convex)
𝑤1 𝑤2

- 49 -
Gradient Descent: Practical Issues

- 50 -
Mini-batch

 Randomly initialize
x1 NN y1 ^
𝑦
1  Pick the 1st batch
batch

𝐿1 𝐶 = 𝐿1 + 𝐿31 +⋯
Mini-

x31 NN y31 ^
𝑦 31 𝜃 1 ← 𝜃0 −𝜂 𝛻 𝐶 ( 𝜃 0 )
𝐿 31  Pick the 2nd batch


𝐶 = 𝐿2+ 𝐿16 +⋯
𝜃2 ← 𝜃1 − 𝜂 𝛻 𝐶 ( 𝜃1 )
x2 NN y2 ^
𝑦2


2
batch

𝐿
Mini-

C is different each
x16 NN y16 ^
𝑦 16 time when we
𝐿 16 update

parameters!

- 51 -
Mini-batch
Better
Faster
!
 Randomly initialize
x1 NN y1 ^
𝑦
1  Pick the 1st batch
batch

𝐶1 𝐶 =𝐶 1 +𝐶 31 +⋯
Mini-

x31 NN y31 ^
𝑦 31 𝜃 1 ← 𝜃0 −𝜂 𝛻 𝐶 ( 𝜃 0 )
𝐶 31  Pick the 2nd batch


𝐶 =𝐶 2 +𝐶 16 +⋯
𝜃2 ← 𝜃1 − 𝜂 𝛻 𝐶 ( 𝜃1 )
x2 NN y2 ^
𝑦2


2
batch

𝐶  Until all mini-batches


Mini-

^
have been picked
x16 NN y16 𝑦 16
𝐶 16 one epoch

Repeat the above process


- 52 -
How can we choose a mini-batch size?
 If the mini-batch size = m
– It is a batch gradient descent where all the
training examples are used in each iteration. It
takes too much time per iteration.
 If the mini-batch size = 1
– It is called stochastic gradient descent, where
each training example is its own mini-batch.
– Since in every iteration we are taking just a
single example, it can become extremely noisy
and takes much more time to reach the global
minima.
 If the mini-batch size is between 1 to m
– It is mini-batch gradient descent. The size of the
mini-batch should not be too large or too small.

- 53 -

Source: https://www.coursera.org/learn/deep-neural-network/lecture/qcogH/mini-batch-gradient-descent
Learning Rate
Set the learning
rate η carefully

−𝜂 𝛻 𝐶 ( 𝜃0 ) If learning rate is too


large

Cost may not


decrease after each
𝑤2 update

− 𝛻 𝐶 ( 𝜃0 )

0
𝜃

𝑤1

- 54 -
Learning Rate
Set the learning
Can we
rate give different
η carefully
parameters different
If learning
learning rate is too
rates?
large

Cost may not


𝑤2 decrease after each
update
If learning rate is too
− 𝛻 𝐶 ( 𝜃0 )
−𝜂 𝛻 𝐶 ( 𝜃 )
0
small
0
𝜃 Training would be too
slow
𝑤1

- 55 -
Adagrad
 Divide the learning rate by “average” gradient
• High-dimensional non-convex nature of neural
networks optimization could lead to different
sensitivity on each dimension.
• The learning rate could be too small in some
dimension 𝑡 +1
and could
𝑡 𝜂 be𝑡 too large in another
𝑤 ←𝑤 − 𝑔
dimension. 𝜎

: Average Estimated while


gradient of updating the
parameter w parameters

If has small average Larger learning


gradient rate
If has large average Smaller learning
gradient rate
- 56 -
Adagrad

1 0 𝜂 0
𝑤 ←𝑤 − 0
𝑔 0
𝜎 =𝑔
0
𝜎

2 𝜂
𝑤 ←𝑤 − 1 𝑔
𝜎
1 1 1 1 02
2 √
𝜎 = [ ( 𝑔 ) +( 𝑔 ) ]
1 2

3
𝑤 ←𝑤 −
2 𝜂
𝜎
2
𝑔
2 2 1 02
3 √
𝜎 = [ ( 𝑔 ) + ( 𝑔 ) +( 𝑔 ) ]
1 2 2 2
……


𝑡
1
𝑤
𝑡 +1 𝜂 𝑡
←𝑤 − 𝑡 𝑔
𝜎
𝑡
𝜎= 𝑡

𝑡 +1 𝑖=0
𝑖 2
(𝑔 )
- 57 -
Adagrad
 Divide the learning rate by “average” gradient
– The “average” gradient is obtained while updating the parameters


𝑡
𝜂𝜂𝑡 1
𝑤
𝑡 +1 𝑡
←𝑤 −
𝜎
𝑡
𝑔
𝑡
𝜎= 𝑡

𝑡 +1 𝑖=0
𝑖 2
(𝑔 )
𝜂 𝑡 +1 𝑡 𝜂 𝑡
𝜂𝑡 = 𝑤 ←𝑤 − 𝑔
√ 𝑡 +1

𝑡

∑ ( 𝑔𝑖 )
2

1/t decay 𝑖 =0

- 58 -
Adagrad
Original Gradient Descent
𝜃 ← 𝜃 −𝜂 𝛻 𝐶 ( 𝜃 )
𝑡 𝑡 −1 𝑡 −1

Each parameter w are considered separately


𝜕 𝐶 ( 𝜃𝑡 )
𝑡
𝑤𝑡 +1 ←𝑤 𝑡 − 𝜂 𝑤 𝑔𝑡 𝑔 =
𝜕𝑤

Parameter dependent
learning rate

𝜂 constant
𝜂𝑤 =


𝑡
Summation of the square

2
(𝑔𝑖)
𝑖= 0 of the previous derivatives

- 59 -
Acknowledgement
 http://wavelab.uwaterloo.ca/wp-content/uploads/2017/04/Lecture_3.pdf
 https://heartbeat.fritz.ai/deep-learning-best-practices-regularization-techniques-f
or-better-performance-of-neural-network-94f978a4e518
 https://cedar.buffalo.edu/~srihari/CSE676/7.12%20Dropout.pdf
 http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2017/Lecture/DNN%20tip.pptx
 Accelerating Deep Network Training by Reducing Internal Covariate Shift, Jude W.
Shavlik
 http://speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/ForDeep.pptx
 Deep Learning Tutorial. Prof. Hung-yi Lee, NTU.
 On Predictive and Generative Deep Neural Architectures, Prof. Swagatam Das,
ISICAL

- 60 -

You might also like