0% found this document useful (0 votes)

15 views59 pages

6 - Tips For Training Deep Neural Networks

The document provides an overview of training deep neural networks, covering key concepts such as parameters vs hyperparameters, the bias/variance trade-off, regularization strategies, and gradient descent techniques. It emphasizes the importance of hyperparameter tuning, the effects of overfitting and underfitting, and various regularization methods like dropout and batch normalization. Additionally, it discusses the significance of feature scaling and normalization in improving model performance.

Uploaded by

rdxsingh01

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views59 pages

6 - Tips For Training Deep Neural Networks

Uploaded by

rdxsingh01

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 59

Tips for Training Deep Neural Networks

-1-
Outline
 Deep Neural Network
 Parameters vs Hyperparameters
 How to set network parameters
 Bias / Variance Trade-off
 Regularization Strategies
 Batch normalization
 Vanishing / Exploding gradients
 Gradient Descent
 Mini-batch Gradient Descent
 Adarad

-2-
Deep Neural Network

x1 … y1
…
x 2 W1 W2 …
WL y2
b1 b2 … b L

…
…
…
…

…
…

…
…
a2… yM
xN x a1 y
…

𝜎W1 x(+ b)
1

𝜎W2 a( )
1 + b 2

𝜎 (
WL aL-1 )
+b L

-3-
Deep Neural Network

𝜃= { 𝑊 , 𝑏 ,𝑊 ,𝑏 , ⋯ 𝑊 , 𝑏 }
1 1 2 2 𝐿 𝐿

x1 … y1
0.1 is 1
…
x2 … 0.7
y2 is 2
… …
…

…
…
…
…

…
…
x256 … y1
0.2 is 0
16 x 16 = 256 …
Ink → 1 Set the network parameters such
0 that ……
No ink → 0
Input: y1 the
has the maximum value
How to let neural
Input:network achieve this
y2 has the maximum value

network parameters 𝜃
By learning the -4-
Parameters vs Hyperparameters
 A model parameter is a variable of the selected
model which can be estimated by fitting the given
data to the model.
 Hyperparameter is a parameter from a prior
distribution; it captures the prior belief before data
is observed.
– These are the parameters that control the model
parameters
– In any machine learning algorithm, these parameters
need to be initialized before training a model.

-5-

Image Source: https://www.slideshare.net/AliceZheng3/evaluating-machine-learning-models-a-beginners-guide

Deep Neural Network: Parameters vs
Hyperparameters
 Parameters:
–
 Hyperparameters:
– Learning rate in gradient descent
– Number of iterations in gradient descent
– Number of layers in a Neural Network
– Number of neurons per layer in a Neural Network
– Activations Functions
– Mini-batch size
– Regularizations parameters

-6-

Image Source: https://www.slideshare.net/AliceZheng3/evaluating-machine-learning-models-a-beginners-guide

Train / Dev / Test sets
 Hyperparameters tuning is a highly iterative process, where you
– start with an idea, i.e. start with a certain number of hidden layers,
certain learning rate, etc.
– try the idea by implementing it
– experiment how well the idea has worked
– refine the idea and iterate this process
 Now how do we identify whether the idea is working? This is
where the train / dev / test sets come into play.

Training
Set We train the model on the training data.
Data

Dev Set After training the model, we check how well it performs on the dev set.

Test Set When we have a final model, we evaluate it on the test set in order to get
an unbiased estimate of how well our algorithm is doing.
-7-
Train / Dev / Test sets

Training
Training Training
Set
Set Set
(60%)
(70%) (98%)
Data

Dev Set
(20%)
Test Set Test Set
(20%) (20%) Dev Set (1%)

Test Set (1%)

Previously, when we had As the availability of data has

small datasets, most increased in recent years, we
often the distribution of can use a huge slice of it for
different sets was training the model

-8-
Bias / Variance Trade-off
 Make sure the distribution of dev/test set is
same as training set
– Divide the training, dev and test sets in such a
way that their distribution is similar
– Skip the test set and validate the model using
the dev set only
 We want our model to be just right, which
means having low bias and low variance.
– Bias is the difference between the Predicted
Value and the Expected Value.
– Variance is the amount that the estimate of
the target function will change, given different
training data.

-9-

Image Source: https://www.analyticsvidhya.com/blog/2018/11/neural-networks-hyperparameter-tuning-regularization-deeplearning/

Bias / Variance Trade-off

 Overfitting: If the dev set error is much

more than the train set error, the model is
overfitting and has a high variance
 Underfitting: When both train and dev set
errors are high, the model is underfitting
and has a high bias

- 10 -

Image Source: https://www.analyticsvidhya.com/blog/2018/11/neural-networks-hyperparameter-tuning-regularization-deeplearning/

Overfitting in Deep Neural Nets
 Deep neural networks contain multiple non-linear
hidden layers
– This makes them very expressive models that can learn
very complicated relationships between their inputs and
outputs.
– In other words, model learns even the tiniest details
present in the data.
 But with limited training data, many of these
complicated relationships will be the result of sampling
noise
– So they will exist in the training set but not in real test
data even if it is drawn from the same distribution.
– So after learning all the possible patterns it can find, the
model tends to perform extremely well on the training set
but fails to produce good results on the dev and test sets.
- 11 -
Regularization
 Regularization is:
– “any modification to a learning algorithm to
reduce its generalization error but not its training
error”
– Reduce generalization error even at the expense of
increasing training error
 E.g., Limiting model capacity is a regularization
method

- 12 -

Source: https://cedar.buffalo.edu/~srihari/CSE574/Chap5/Chap5.5-Regularization.pdf
Regularization Strategies

- 13 -
Parameter Norm Penalties
 The most traditional form of regularization applicable
to deep learning is the concept of parameter norm
penalties.
 This approach limits the capacity of the model by
adding the penalty to the objective function resulting
in:

 is a hyperparameter that weights the relative

contribution of the norm penalty to the value of the
objective function.

- 14 -
L2 Norm Parameter Regularization
 Using L2 norm, we’re adding the constraints to the original
loss function, such that the weights of the network don’t
grow too large.

 Assuming there is no bias parameters, only weights

 By adding the regularized term, we’re fooling the model such

that it won’t drive the training error to zero, which in turn
reduces the complexity of the model.

- 15 -
L1 Norm Parameter Regularization
 L1 norm is another option that can be used to penalize the size
of model parameters.
 L1 regularization on the model parameters w is:

 The L2 Norm penalty decays the components of the vector w

that do not contribute much to reducing the objective function.
 On the other hand, the L1 norm penalty provides solutions that
are sparse.
 This sparsity property can be thought of as a feature selection
mechanism.

- 16 -
Early Stopping
 When training models with sufficient representational
capacity to overfit the task, we often observe that training
error decreases steadily over time, while the error on the
validation set begins to rise again or remaining the same for
certain iterations, then there is no point in training the
model further.
 This means we can obtain a model with better validation set
error (and thus, hopefully better test set error) by returning
to the parameter setting at the point in time with the lowest
validation set error

- 17 -
Parameter Tying
 Sometimes, we might not know which region the
parameters would lie in, but rather we known that there is
some dependencies between them.
 Parameter Tying refers to explicitly forcing the parameters
of two models to be close to each other, through the norm
penalty.

 Here, refers to the weights of the first model while refers to

those of the second one.

- 18 -
Dropout
 Dropout is a bagging method
– Bagging is a method of averaging over several
models to improve generalization
 Impractical to train many neural networks since
it is expensive in time and memory
– It is a method of bagging applied to neural
networks
 Dropout is an inexpensive but powerful method
of regularizing a broad family of models
 Specifically, dropout trains the ensemble
consisting of sub-networks that can be formed
by removing non-output units from an
underlying base network.

- 19 -
Dropout - Intuitive Reason

 When teams up, if everyone expect the partner

will do the work, nothing will be done finally.
 However, if you know your partner will
dropout, you will do better.
 When testing, no one dropout actually, so
obtaining good results eventually.
- 20 -
Dropout

Training:

 Each neuron has p% to dropout

- 21 -
Dropout

Training:

Thinner!

 Each neuron has p% to dropout

The structure of the network is
changed.
 Using the new network for training

- 22 -
Dropout

Testing:

 No dropout
 If the dropout rate at training is p
%, all the weights times (1-p)%
 Assume that the dropout rate is 50%.
If a weight by training, set for testing.
- 23 -
x1 x2
Why the weights should multiply (1-p)% (dropout
rate) when testing? w1 w2
x1 x2 x1 x2

w1 w2 w1 w2

z=w1x1+w2x2

z=w1x1+w2x2 z=w2x2 x1 x2

x1 x2 1 1
x1 x2
2 w1 2 w2
w1 w2 w1 w2

1 1
𝑧 = 𝑤1 𝑥 1 + 𝑤2 𝑥2
2 2
z=w1x1 z=0

- 24 -
Dropout is a kind of ensemble.

Training
Set
Ensemble
Set Set Set Set
1 2 3 4

Network Network Network Network

1 2 3 4

Train a bunch of networks with different structures

- 25 -
Dropout is a kind of ensemble.

Ensemble
Testing data x

Network Network Network Network

1 2 3 4

y1 y2 y3 y4

average
- 26 -
Setting up your Optimization Problem

- 27 -
Normalizing Inputs
 The range of values of raw training data often varies widely
– Example: Has kids feature in {0,1}
– Value of car: $500-$100’sk
 If one of the features has a broad range of values, the
distance will be governed by this particular feature.
– After, normalization, each feature contributes approximately
proportionately to the final distance.
 In general, Gradient descent converges much faster with
feature scaling than without it.

- 28 -
Feature Scaling

1 2 3 𝑟 𝑚
𝑥 𝑥 𝑥 𝑥 𝑥
𝑥
1
1 𝑥1
2 For each
𝑥
1 2
𝑥2 dimension
2
i:
… … mean:
…
…
…
…
…
…

…
…

…
…
… … standard
deviation:

𝑟
𝑟 𝑥𝑖 − 𝑚𝑖 The means of all dimensions are
𝑥𝑖 ← 0, and the variances are all 1
𝜎𝑖

In general, gradient descent converges much - 29 -

faster with feature scaling than without it.

Internal Covariate Shift
• The first guy tells the second guy, “go water
the plants”, the second guy tells the third
guy, “got water in your pants”, and so on
until the last guy hears, “kite bang eat face
monkey” or something totally wrong.
• Let’s say that the problems are entirely
systemic and due entirely to faulty red cups.
Then, the situation is analogous to forward
propagation
• If can get new cups to fix the problem by
trial and error, it would help to have a
consistent way of passing messages in a
more controlled and standardized “First layer parameters change and
(“normalized”) way. e.g: Same volume, so the distribution of the input to
same language, etc your second layer changes”
- 30 -
Batch
1 1 1 1 2…

Sigmoid
𝑥 𝑊 𝑧 𝑎 𝑊 …

2 1 2 2 2…
𝑥 𝑊 𝑧 𝑎 𝑊
Sigmoid
…
3 1 3 3 2…
𝑥 𝑊 𝑧 Sigmoid
𝑎 𝑊 …

1 2 3 = 1 1 2 3
Batch
𝑧𝑧𝑧 𝑊 𝑥𝑥𝑥
- 31 -
Batch normalization
3
1 1 1 1
𝑥 𝑊 𝑧 𝜇= ∑ 𝑧
3 𝑖=1
𝑖

√
2 1 2
𝑥 𝑊 𝑧 1
3
𝜎= ∑ ( 𝑧 − 𝜇)
𝑖 2

3 𝑖=1
3 1 3
𝑥 𝑊 𝑧
and depends
on 𝜇 𝜎
- 32 -
Batch normalization

1 1 1 ~𝑧 1 1

Sigmoid
𝑥 𝑊 𝑧 𝑎
2
𝑥 𝑊 𝑧
1 2 ~𝑧 2 𝑎
2

Sigmoid
3 1 3 ~𝑧 3 3

Sigmoid
𝑥 𝑊 𝑧 𝑎
𝜇 𝜎
𝑖
and depends ~𝑖 𝑧 −𝜇
𝑧 =
on 𝜎+𝜀
Batch Norms happens between computing Z and computing A. And the intuition - 33 - is
that, instead of using the un-normalized value Z, you can use the normalized value Z
Batch normalization
 Setting mean to and work for most of the
applications, but in actual implementation, we don't
want the hidden units to always have mean 0 and
variance 1
 , we replace with the following

where and are learnable parameters.

 is the special case of at and

- 34 -
Acc

𝜇 300
Batch normalization at testing time 𝜇 100
𝜇1
Updates
𝑧 −𝜇
~𝑧
~
𝑧= 𝑧 =𝛾 ⨀ ~
𝑖 𝑖
^
^𝑧
𝑧 +𝛽
𝑥𝑊 𝑧 1 𝜎
, are from , are network
batch parameters
We do not have batch at testing stage.
Ideal solution:
Computing and using the whole training dataset.
Practical solution:
Computing the moving average of and of the
batches during training.
- 35 -
Why does normalizing the data make the algorithm faster?
 In the case of unnormalized data, the scale of Unnormalized
𝐽
features will vary. This will make the cost :
function asymmetric.
 It will take longer to converge to the
minimum because the parameters for larger- 𝑤
scale features will dominate the updates. 𝑏
 Whereas, in the case of normalized data, the Normalized:𝐽
scale will be the same and the cost function
will also be symmetric.
 This makes it is easier for the gradient
descent algorithm to find the global minima
more quickly. And this, in turn, makes the 𝑤
algorithm run much faster. 𝑏
- 37 -

Image Source: https://www.analyticsvidhya.com/blog/2018/11/neural-networks-hyperparameter-tuning-regularization-deeplearning/

Vanishing / Exploding gradients
 When you're training a very deep network,
sometimes the derivatives can get either
very small or very1
big,2 and this 𝐿−
makes
1 𝐿
𝑊
training difficult. 𝑊 𝑊 𝑊
x1 …
… 𝑦
x2 …
1 1 2… 2 1 𝐿− 1 𝐿 −1 𝐿 −2 𝐿 𝐿 −1
𝑍 =𝑊 𝑥 𝑍 =𝑊 𝑍 𝑍 =𝑊 𝑍 𝑦 =𝑊 𝑍
 For simplicity, we assume bias () at
every layer and the activation 𝑦 =𝑊 𝑊
𝐿 𝐿 −1
𝑊
𝐿 −2
…𝑊 𝑊 𝑥
2 1
 Assuming
function is the
linear
entries in the weight
matrix are in the form
then,
- 38 -

Source: https://www.coursera.org/learn/deep-neural-network/lecture/C9iQO/vanishing-exploding-gradients
Vanishing / Exploding gradients
 When you're training a very deep network,
sometimes the derivatives can get either
very small or very1
big,2 and this 𝐿−
makes
1 𝐿
𝑊
training difficult. 𝑊 𝑊 𝑊
x1 …
… 𝑦
x2 …
1 1 2… 2 1 𝐿− 1 𝐿 −1 𝐿 −2 𝐿 𝐿 −1
𝑍 =𝑊 𝑥 𝑍 =𝑊 𝑍 𝑍 =𝑊 𝑍 𝑦 =𝑊 𝑍

 if and the number of layers in the 𝐿 𝐿 −1 𝐿 −2 2 1

network is large, the value of will 𝑦 =𝑊 𝑊 𝑊 …𝑊 𝑊 𝑥
explode.
 Similarly, if , the value of will be very
small. Hence, the gradient descent will
take very tinny step. - 39 -

Source: https://www.coursera.org/learn/deep-neural-network/lecture/C9iQO/vanishing-exploding-gradients
Solutions: Vanishing / Exploding gradients
 Use a good initialization
– Random Initialization
 The primary reason behind initializing the weights
randomly is to break symmetry.
 We want to make sure that different hidden units
learn different patterns.
 Do not use sigmoid for deep networks
– Problem: saturation

Sigmoid tends to
saturate for very large
positive or negative
inputs, leading to the
vanishing gradient
problem, especially in
deep networks.
- 40 -

Image Source: Pattern Recognition and Machine Learning, Bishop

ReLU
 Rectified Linear Unit (ReLU)

Reason:
𝑎
𝜎 (𝑧) 1. Fast to compute
𝑎=𝑧
No complex operations
like exponentials, like
sigmoid or tanh
𝑎= 0 functions.
𝑧
2. Vanishing gradient

problem
- 41 -
𝑎
𝑎=𝑧
ReLU

𝑎= 0
𝑧
0

x1 y1

0 y2
x2
0

0
- 42 -
𝑎
𝑎=𝑧
ReLU

A Thinner linear 𝑎= 0
𝑧
network

x1 y1

y2
x2

Do not have
smaller gradients
- 43 -
ReLU - variant
This helps to avoid the “dying ReLU” problem with the
standard ReLU function, where a neuron with a negative
bias may never activate and become “dead.”

𝐿𝑒𝑎𝑘𝑦 𝑅𝑒𝐿𝑈 𝑃𝑎𝑟𝑎𝑚𝑒𝑡𝑟𝑖𝑐 𝑅𝑒𝐿𝑈

𝑎 𝑎
𝑎=𝑧 𝑎=𝑧

𝑧 𝑧
𝑎=0.01 𝑧 𝑎=𝛼 𝑧

also learned by gradient descent

- 44 -
Activation Functions

- 45 -

Image Source: https://sefiks.com/2020/02/02/dance-moves-of-deep-learning-activation-functions/

Optimization Algorithms

- 46 -
Gradient Descent
Assume there are only two
parameters w1 and w2 in a
Error Surface network. 𝜃= { 𝑤1 ,𝑤 2 }

The colors represent the Randomly pick a

value of C. starting point
Compute the
negative
∗ gradient at
𝑤2 𝜃
−𝜂 𝛻 𝐶 ( 𝜃0 ) − 𝛻 𝐶 ( 𝜃0 )
− 𝛻 𝐶 ( 𝜃0 )
Times the

𝜃
0 𝛻 𝐶 (𝜃 )=
0

[ 𝜕 𝐶 ( 𝜃0 ) / 𝜕 𝑤1
𝜕 𝐶 ( 𝜃 ) / 𝜕 𝑤2
0
] learning rate
−𝜂 𝛻 𝐶 ( 𝜃0 )
𝑤1
- 47 -
Gradient Descent

Eventually, we would
Randomly pick a
reach a minima …..
starting point
Compute the
2−𝜂 𝛻 𝐶 ( 𝜃 )
2
negative
−𝜂 𝛻 𝐶 ( 𝜃𝜃
1
)
𝑤2 gradient at
𝜃1()𝜃 )
2
− 𝛻−𝐶𝛻( 𝐶
𝜃
1
− 𝛻 𝐶 ( 𝜃0 )
Times the
0
learning rate
𝜃
−𝜂 𝛻 𝐶 ( 𝜃0 )
𝑤1
- 48 -
Gradient Descent
 Gradient descent
– Pros
 Guaranteed to converge to global minimum for convex error surface
 Converge to local minimum for non-convex error surface
– Cons
 Very slow
 Intractable for dataset that do not fit in the memory

Different initial point

𝐶
Reach different minima, so
different results (non-convex)
𝑤1 𝑤2

- 49 -
Gradient Descent: Practical Issues

- 50 -
Mini-batch

 Randomly initialize
x1 NN y1 ^
𝑦
1  Pick the 1st batch
batch

𝐿1 𝐶 = 𝐿1 + 𝐿31 +⋯
Mini-

x31 NN y31 ^
𝑦 31 𝜃 1 ← 𝜃0 −𝜂 𝛻 𝐶 ( 𝜃 0 )
𝐿 31  Pick the 2nd batch
…
…
𝐶 = 𝐿2+ 𝐿16 +⋯
𝜃2 ← 𝜃1 − 𝜂 𝛻 𝐶 ( 𝜃1 )
x2 NN y2 ^
𝑦2

…
2
batch

𝐿
Mini-

C is different each
x16 NN y16 ^
𝑦 16 time when we
𝐿 16 update
…
…

parameters!

- 51 -
Mini-batch
Better
Faster
!
 Randomly initialize
x1 NN y1 ^
𝑦
1  Pick the 1st batch
batch

𝐶1 𝐶 =𝐶 1 +𝐶 31 +⋯
Mini-

x31 NN y31 ^
𝑦 31 𝜃 1 ← 𝜃0 −𝜂 𝛻 𝐶 ( 𝜃 0 )
𝐶 31  Pick the 2nd batch
…
…
𝐶 =𝐶 2 +𝐶 16 +⋯
𝜃2 ← 𝜃1 − 𝜂 𝛻 𝐶 ( 𝜃1 )
x2 NN y2 ^
𝑦2

…
2
batch

𝐶  Until all mini-batches

Mini-

^
have been picked
x16 NN y16 𝑦 16
𝐶 16 one epoch
…
…

Repeat the above process

- 52 -
How can we choose a mini-batch size?
 If the mini-batch size = m
– It is a batch gradient descent where all the
training examples are used in each iteration. It
takes too much time per iteration.
 If the mini-batch size = 1
– It is called stochastic gradient descent, where
each training example is its own mini-batch.
– Since in every iteration we are taking just a
single example, it can become extremely noisy
and takes much more time to reach the global
minima.
 If the mini-batch size is between 1 to m
– It is mini-batch gradient descent. The size of the
mini-batch should not be too large or too small.

- 53 -

Source: https://www.coursera.org/learn/deep-neural-network/lecture/qcogH/mini-batch-gradient-descent
Learning Rate
Set the learning
rate η carefully

−𝜂 𝛻 𝐶 ( 𝜃0 ) If learning rate is too

large

Cost may not

decrease after each
𝑤2 update

− 𝛻 𝐶 ( 𝜃0 )

0
𝜃

𝑤1

- 54 -
Learning Rate
Set the learning
Can we
rate give different
η carefully
parameters different
If learning
learning rate is too
rates?
large

Cost may not

𝑤2 decrease after each
update
If learning rate is too
− 𝛻 𝐶 ( 𝜃0 )
−𝜂 𝛻 𝐶 ( 𝜃 )
0
small
0
𝜃 Training would be too
slow
𝑤1

- 55 -
Adagrad
 Divide the learning rate by “average” gradient
• High-dimensional non-convex nature of neural
networks optimization could lead to different
sensitivity on each dimension.
• The learning rate could be too small in some
dimension 𝑡 +1
and could
𝑡 𝜂 be𝑡 too large in another
𝑤 ←𝑤 − 𝑔
dimension. 𝜎

: Average Estimated while

gradient of updating the
parameter w parameters

If has small average Larger learning

gradient rate
If has large average Smaller learning
gradient rate
- 56 -
Adagrad

1 0 𝜂 0
𝑤 ←𝑤 − 0
𝑔 0
𝜎 =𝑔
0
𝜎

2 𝜂
𝑤 ←𝑤 − 1 𝑔
𝜎
1 1 1 1 02
2 √
𝜎 = [ ( 𝑔 ) +( 𝑔 ) ]
1 2

3
𝑤 ←𝑤 −
2 𝜂
𝜎
2
𝑔
2 2 1 02
3 √
𝜎 = [ ( 𝑔 ) + ( 𝑔 ) +( 𝑔 ) ]
1 2 2 2
……

√
𝑡
1
𝑤
𝑡 +1 𝜂 𝑡
←𝑤 − 𝑡 𝑔
𝜎
𝑡
𝜎= 𝑡
∑
𝑡 +1 𝑖=0
𝑖 2
(𝑔 )
- 57 -
Adagrad
 Divide the learning rate by “average” gradient
– The “average” gradient is obtained while updating the parameters

√
𝑡
𝜂𝜂𝑡 1
𝑤
𝑡 +1 𝑡
←𝑤 −
𝜎
𝑡
𝑔
𝑡
𝜎= 𝑡
∑
𝑡 +1 𝑖=0
𝑖 2
(𝑔 )
𝜂 𝑡 +1 𝑡 𝜂 𝑡
𝜂𝑡 = 𝑤 ←𝑤 − 𝑔
√ 𝑡 +1
√
𝑡

∑ ( 𝑔𝑖 )
2

1/t decay 𝑖 =0

- 58 -
Adagrad
Original Gradient Descent
𝜃 ← 𝜃 −𝜂 𝛻 𝐶 ( 𝜃 )
𝑡 𝑡 −1 𝑡 −1

Each parameter w are considered separately

𝜕 𝐶 ( 𝜃𝑡 )
𝑡
𝑤𝑡 +1 ←𝑤 𝑡 − 𝜂 𝑤 𝑔𝑡 𝑔 =
𝜕𝑤

Parameter dependent
learning rate

𝜂 constant
𝜂𝑤 =

√
𝑡
Summation of the square
∑
2
(𝑔𝑖)
𝑖= 0 of the previous derivatives

- 59 -
Acknowledgement
 http://wavelab.uwaterloo.ca/wp-content/uploads/2017/04/Lecture_3.pdf
 https://heartbeat.fritz.ai/deep-learning-best-practices-regularization-techniques-f
or-better-performance-of-neural-network-94f978a4e518
 https://cedar.buffalo.edu/~srihari/CSE676/7.12%20Dropout.pdf
 http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2017/Lecture/DNN%20tip.pptx
 Accelerating Deep Network Training by Reducing Internal Covariate Shift, Jude W.
Shavlik
 http://speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/ForDeep.pptx
 Deep Learning Tutorial. Prof. Hung-yi Lee, NTU.
 On Predictive and Generative Deep Neural Architectures, Prof. Swagatam Das,
ISICAL

- 60 -

Module 2 - S8 CSE NOTES - KTU DEEP LEARNING NOTES - CST414
No ratings yet
Module 2 - S8 CSE NOTES - KTU DEEP LEARNING NOTES - CST414
20 pages
Fundamentals of Deep Learning
No ratings yet
Fundamentals of Deep Learning
26 pages
AN2DL 03 2324 NeuralNetwroksTraining
No ratings yet
AN2DL 03 2324 NeuralNetwroksTraining
40 pages
DL CS 7 M4 Live Class Flow
No ratings yet
DL CS 7 M4 Live Class Flow
37 pages
Training Neural Netwok: Data Set
No ratings yet
Training Neural Netwok: Data Set
35 pages
Lecture 1 Part II
No ratings yet
Lecture 1 Part II
24 pages
Lecture 5-6
No ratings yet
Lecture 5-6
45 pages
Unit Online 1.3
No ratings yet
Unit Online 1.3
21 pages
DL Regularization
No ratings yet
DL Regularization
28 pages
Unit 4
No ratings yet
Unit 4
13 pages
UNIT-IV Improving Deep Neural Networks
No ratings yet
UNIT-IV Improving Deep Neural Networks
17 pages
DNN Tip
No ratings yet
DNN Tip
49 pages
Lecture 06
No ratings yet
Lecture 06
22 pages
Assignment Jaiprakash
No ratings yet
Assignment Jaiprakash
5 pages
Regularization Slides
No ratings yet
Regularization Slides
50 pages
Deep Learning Unit 2
No ratings yet
Deep Learning Unit 2
4 pages
15-Hyperparameter Tuning - Batch Normalization-14!08!2024
No ratings yet
15-Hyperparameter Tuning - Batch Normalization-14!08!2024
4 pages
A Imprimer 4
No ratings yet
A Imprimer 4
4 pages
UNIT 2 Notes
No ratings yet
UNIT 2 Notes
19 pages
Home Assignment Submission Solutions
No ratings yet
Home Assignment Submission Solutions
82 pages
Optimization of Deep Networks
No ratings yet
Optimization of Deep Networks
84 pages
Artificial Neural Networks - DL
No ratings yet
Artificial Neural Networks - DL
55 pages
Lecture W15ab
No ratings yet
Lecture W15ab
44 pages
Deep Learning Module 2 Important Topics PYQs
No ratings yet
Deep Learning Module 2 Important Topics PYQs
30 pages
Deep Learning Unit2
No ratings yet
Deep Learning Unit2
16 pages
Deep Learning Important Questions For Ia 1
No ratings yet
Deep Learning Important Questions For Ia 1
11 pages
Unit 3
No ratings yet
Unit 3
110 pages
Deep Neural Networks
No ratings yet
Deep Neural Networks
26 pages
Convolutional Neural Networks (Image Recognition) Part - II: Dr. Syed M. Usman
No ratings yet
Convolutional Neural Networks (Image Recognition) Part - II: Dr. Syed M. Usman
75 pages
Secrets of Deep Learning 1716536527
No ratings yet
Secrets of Deep Learning 1716536527
12 pages
Cours 4
No ratings yet
Cours 4
30 pages
Artificial Neural NetworkIV
No ratings yet
Artificial Neural NetworkIV
6 pages
L10 Learning II Gradient Based Learning
No ratings yet
L10 Learning II Gradient Based Learning
72 pages
465-Lecture 10-11
No ratings yet
465-Lecture 10-11
79 pages
Ai - W7L13
No ratings yet
Ai - W7L13
46 pages
Unit Online 1.4
No ratings yet
Unit Online 1.4
132 pages
IoT - Lecture 11
No ratings yet
IoT - Lecture 11
58 pages
4 - DNN Tip
No ratings yet
4 - DNN Tip
52 pages
Hyperparameter Tuning in DNNs
No ratings yet
Hyperparameter Tuning in DNNs
6 pages
Chap 2 Training Feed Forward Neural Networks
No ratings yet
Chap 2 Training Feed Forward Neural Networks
22 pages
DL Unit 3
No ratings yet
DL Unit 3
14 pages
Deep Learning Module-03 Search Creators
No ratings yet
Deep Learning Module-03 Search Creators
20 pages
Fixing Neural Network Course 2 1659759284
No ratings yet
Fixing Neural Network Course 2 1659759284
30 pages
Lecture 2
No ratings yet
Lecture 2
31 pages
Artificial Neural Networks - Lect - 4
No ratings yet
Artificial Neural Networks - Lect - 4
17 pages
Machine Learning and Pattern Recognition Week 8 - Neural - Net - Fitting
No ratings yet
Machine Learning and Pattern Recognition Week 8 - Neural - Net - Fitting
3 pages
Practical Aspects of Deep Learning PI
No ratings yet
Practical Aspects of Deep Learning PI
46 pages
Unit-2 Improving-Deep-Neural-Networks
No ratings yet
Unit-2 Improving-Deep-Neural-Networks
18 pages
Pure Optimization
No ratings yet
Pure Optimization
23 pages
Deep Learning
100% (2)
Deep Learning
49 pages
Deep Neural Network Module 4 Regularization
No ratings yet
Deep Neural Network Module 4 Regularization
53 pages
Improving Deep Neural Networks: Hyperparameter Tuning, Regularization and Optimization
No ratings yet
Improving Deep Neural Networks: Hyperparameter Tuning, Regularization and Optimization
1 page
A Probabilistic Theory of Deep Learning: Unit 2
100% (1)
A Probabilistic Theory of Deep Learning: Unit 2
17 pages
Tutorial 4
No ratings yet
Tutorial 4
6 pages
DL Class3
No ratings yet
DL Class3
28 pages
Cryptography
No ratings yet
Cryptography
12 pages
Neural Networks For Machine Learning: Lecture 9a Overview of Ways To Improve Generalization
No ratings yet
Neural Networks For Machine Learning: Lecture 9a Overview of Ways To Improve Generalization
39 pages
ANN Doc
No ratings yet
ANN Doc
2 pages
Introduction To Neural Network
No ratings yet
Introduction To Neural Network
20 pages
Bisection Method
100% (1)
Bisection Method
4 pages
미분적분학 솔루션 2판 제임스 스튜어트 1 200
No ratings yet
미분적분학 솔루션 2판 제임스 스튜어트 1 200
201 pages
Hyperparameters
No ratings yet
Hyperparameters
15 pages
Ch2 Wiener Filters
No ratings yet
Ch2 Wiener Filters
80 pages
DSP Lab Manual PDF
100% (1)
DSP Lab Manual PDF
51 pages
Modelagem Diagrama de Blocso
No ratings yet
Modelagem Diagrama de Blocso
12 pages
Maths Class 9 WS 5
No ratings yet
Maths Class 9 WS 5
8 pages
Semantic Segmentation Architecture: A Key Part of Scene Understanding Applications
No ratings yet
Semantic Segmentation Architecture: A Key Part of Scene Understanding Applications
9 pages
Sliding Mode Control PPT Final
No ratings yet
Sliding Mode Control PPT Final
28 pages
Aneeket Arya Tic-Tac-Toe AI First Draft
No ratings yet
Aneeket Arya Tic-Tac-Toe AI First Draft
12 pages
SC - Unit 1 Updated Notes
No ratings yet
SC - Unit 1 Updated Notes
82 pages
Lab 04 - Seismic Deconvolution
No ratings yet
Lab 04 - Seismic Deconvolution
10 pages
MCSC102 Artificial Intelligence OBE April2022 Final
No ratings yet
MCSC102 Artificial Intelligence OBE April2022 Final
3 pages
Stream Cipher
No ratings yet
Stream Cipher
21 pages
Algorithm-Lab Updated
No ratings yet
Algorithm-Lab Updated
125 pages
MCSC102 2020 OBE March2021
No ratings yet
MCSC102 2020 OBE March2021
3 pages
Anirban Roy 35500721006 Mathematics
No ratings yet
Anirban Roy 35500721006 Mathematics
24 pages
Lecture 1.9 Segmented Least Squares
No ratings yet
Lecture 1.9 Segmented Least Squares
24 pages
OPIM101 DA Homework Assignments
100% (1)
OPIM101 DA Homework Assignments
2 pages
StyleSwin Transformer-Based GAN For High-Resolution Image Generation
No ratings yet
StyleSwin Transformer-Based GAN For High-Resolution Image Generation
11 pages
Acín 2018 New J. Phys. 20 080201 PDF
No ratings yet
Acín 2018 New J. Phys. 20 080201 PDF
25 pages
Forpublicshare... Multivariate Time Series Clustering And...
No ratings yet
Forpublicshare... Multivariate Time Series Clustering And...
21 pages
7 1526465877 - 16-05-2018 PDF
No ratings yet
7 1526465877 - 16-05-2018 PDF
7 pages
TP1 Final Report IP
No ratings yet
TP1 Final Report IP
42 pages
Linear - Programming-Notes Unit 1
No ratings yet
Linear - Programming-Notes Unit 1
31 pages
PracticeQs DNC
No ratings yet
PracticeQs DNC
3 pages
PracticeQs Recursion
No ratings yet
PracticeQs Recursion
3 pages
IE 312-5.1-Location Problem Basic Models-Continuous II
No ratings yet
IE 312-5.1-Location Problem Basic Models-Continuous II
32 pages
Salem College of Engineering and Technology: Principles of Digital Signal Processing
No ratings yet
Salem College of Engineering and Technology: Principles of Digital Signal Processing
2 pages
Lower-Upper Symmetric-Gauss-Seidel Method For The Euler and Navier-Stokes Equations
No ratings yet
Lower-Upper Symmetric-Gauss-Seidel Method For The Euler and Navier-Stokes Equations
2 pages
Particle Dynamics in AdS2 Space
No ratings yet
Particle Dynamics in AdS2 Space
4 pages
Exercise 3 - Answer Key
No ratings yet
Exercise 3 - Answer Key
5 pages
Wa0001
No ratings yet
Wa0001
6 pages
PUMA: Planning Under Uncertainty With Macro-Actions: Ruijie He Emma Brunskill Nicholas Roy
No ratings yet
PUMA: Planning Under Uncertainty With Macro-Actions: Ruijie He Emma Brunskill Nicholas Roy
7 pages
Piyush Chaudhary
No ratings yet
Piyush Chaudhary
1 page
DEEP LEARNING TECHNIQUES: CLUSTER ANALYSIS and PATTERN RECOGNITION with NEURAL NETWORKS. Examples with MATLAB
From Everand
DEEP LEARNING TECHNIQUES: CLUSTER ANALYSIS and PATTERN RECOGNITION with NEURAL NETWORKS. Examples with MATLAB
César Pérez López
No ratings yet