0% found this document useful (0 votes)

104 views55 pages

Training Deep Neural Networks

Training Deep Neural Networks tutorial

Uploaded by

Sivaiah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

104 views55 pages

Training Deep Neural Networks

Training Deep Neural Networks tutorial

Uploaded by

Sivaiah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 55

Chapter 11: Training Deep Neural

Networks

Tsz-Chiu Au
[email protected]

Ulsan National Institute of Science and Technology (UNIST)

South Korea
Training a Deep Neural Networks
• In Chapter 10, all neural networks are shallow---they have just
a few hidden layers.
• To tackle a complex problem, we need to train a much deeper
DNN.
• Common problems when training deep DNNs.
» Vanishing gradients problem and exploding gradients problem
» Not enough training data
» Training may be extremely slow
» Overfitting the training set
• We will discuss how to address these issues.
The Vanishing/Exploding Gradients Problems

• The backpropagation algorithm works by propagating the error

gradient from the output layer to the input layer in the reverse
pass.
» Uses these gradients to update each parameter with a Gradient Descent
step.
• Vanishing Gradients Problem: gradients often get smaller and
smaller as the algorithm progresses down to the lower layers.
» Fail to update the lower layers’ connection weights and training never
converges to a good solution.
• Exploding gradients problem: In recurrent neural networks, the
gradients can grow bigger and bigger until layers get insanely large
weight updates and the algorithm diverges.
• For these reasons, DNNs were mostly abandoned in the early
2000s.
Logistic Activation Function Saturation
• Xavier Glorot and Yoshua Bengio found that these problems
are due to the use of the logistic sigmoid activation function
and the popular weight initialization technique at the time
(normal distribution with mean 0 and standard deviation 1)
» The variance of the
outputs of each layer is
much greater than the
variance of its inputs.
» The variance keeps
increasing after each
layer until the activation
function saturates at the
top layers.
» Virtually no gradient to
propagate back through
the network.
Xavier/Glorot Initialization
• Glorot and Bengio propose a way to significantly alleviate the unstable
gradients problem.
» The signal to flow properly in both directions: in the forward direction when
making predictions, and in the reverse direction when backpropagating
gradients.
§ We don’t want the signal to die out, nor do we want it to explode and saturate.
» The idea: the variance of the outputs of each layer to be equal to the variance
of its inputs; the gradients have equal variance before and after flowing
through a layer in the reverse direction.
• In Glorot Initialization, when fanavg = (fanin + fanout)/2, initialize the
weights by:

• If we replace fanavg with fanin in Eq. 11-1, we get LeCun initialization.

Xavier/Glorot Initialization (cont.)
• Using Glorot initialization can speed up training considerably, and it is one
of the tricks that led to the success of Deep Learning.
• For different activation functions:

• Keras uses Glorot initialization by default. To use He initialization:

• If you want He initialization with a uniform distribution but based on fanavg

rather than fanin:
Nonsaturating Activation Functions
• Glorot and Bengio also found that unstable gradients is partly
due to a poor choice of activation function.
» Although sigmoid activation functions look like the activation function
of biological neurons, there are better activation functions than
sigmoid activation functions.

• A popular alternative is the ReLU

activation function, which does
not saturate for positive values
and it is fast to compute.
» But it suffers from dying ReLUs---
they stop outputting anything other
than 0.
§ In some cases, half of network’s
neurons are dead

• Solution: The leaky ReLU.

Other Leaky ReLU Activation Functions
• The randomized leaky ReLU (RReLU)
» where α is picked randomly in a given range during
training and is fixed to an average value during
testing.
» can act as a regularizer.
• The parametric leaky ReLU (PReLU)
» where α is authorized to be learned during training
(instead of being a hyperparameter)
» Strongly outperform ReLU on large image datasets,
but on smaller datasets it runs the risk of overfitting
the training set.
Exponential Linear Unit (ELU)
• Exponential linear unit (ELU) outperformed all the ReLU variants in
some experiments.

• The unit has an average output closer to 0.

• A nonzero gradient for z < 0, which avoids the dead neurons
problem.
• α is usually set to 1, and then the function is smooth everywhere
• Drawback: slower to compute than the ReLU function and its
variants
» But Its faster convergence rate during training compensates for that slow
computation. However, still slower at test time.
Scaled ELU (SELU)
• Scaled ELU (SELU) is a scaled variant of the ELU activation function
• If you build a neural network composed exclusively of a stack of
dense layers, and if all hidden layers use the SELU activation function,
then the network will self-normalize.
» The output of each layer will tend to preserve a mean of 0 and standard
deviation of 1 during training, which solves the vanishing/exploding
gradients problem.
• Conditions for self-normalization:
» The input features must be standardized (mean 0 and standard deviation 1).
» Every hidden layer’s weights must be initialized with LeCun normal
initialization.
» The network’s architecture must be sequential.
§ For recurrent networks and skip connections, SELU will not necessarily outperform other
activation functions.
» All layers are dense.
Which activation functions should you
use for hidden layers?
• In general, SELU > ELU > leaky ReLU (and its variants) >
ReLU > tanh > logistic.
• If the network’s architecture prevents it from self-
normalizing, then ELU may perform better than SELU.
• If you care a lot about runtime latency, then you may
prefer leaky ReLU.
• ReLU is the most used activation function (by far),
many libraries and hardware accelerators provide
ReLU-specific optimizations
» if speed is your priority, ReLU might still be the best choice.
Using LeakyReLU and SELU in Keras
Batch Normalization
• Batch Normalization (BN): Adding BN layers in the model just
before or after the activation function of each hidden layer.
» The BN layers zero-center and normalize each input, then scales and
shifts the result using two new parameter vectors per layer: one for
scaling, the other for shifting.

• Evaluate the mean and

standard deviation of the
input over the current mini-
batch.
Batch Normalization (cont.)
Batch Normalization (cont.)
• For test time, we cannot estimate the inputs’ mean and standard
deviation since there is no batch.
• Most implementations of Batch Normalization estimate the
inputs’ mean and standard deviation during training by using a
moving average of the layer’s input means and standard
deviations.
• Learn four parameter vectors during training:
» γ (the output scale vector) and β (the output offset vector) are learned
through regular backpropagation
» μ (the final input mean vector) and σ (the final input standard deviation
vector) are estimated using an exponential moving average.
• Note that μ and σ are estimated during training, but they are
used only after training.
» Replace the batch input means and standard deviations in Equation 11-3.
Batch Normalization (cont.)
• Batch Normalization considerably improves all the deep neural
networks they experimented with, leading to a huge improvement
in the ImageNet classification task.
• The vanishing gradients problem was strongly reduced.
• The networks were also much less sensitive to the weight
initialization.
• Can use much larger learning rates, significantly speeding up the
learning process.
• Batch Normalization acts like a regularizer, reducing the need for
other regularization techniques.
• To avoid runtime penalty, fuse the BN layer with the previous layer
by updating the previous layer’s weights and biases so that it
directly produces outputs of the appropriate scale and offset.
Implementing BN with Keras
• In Keras, you can add a Batch Normalization layer before or after each
hidden layer’s activation function, and optionally add a BN layer as well as
the first layer in your model.
Implementing BN with Keras (cont.)
• Some argued in favor of adding the BN layers before the
activation functions, rather than after.
» But there is some debate about this.
Implementing BN with Keras (cont.)

• The BatchNormalization class has quite a few

hyperparameters you can tweak.
» momentum – the learning rate for the exponential moving
averages.
» axis – determines which axis should be normalized.
» The defaults will usually be fine.
• BatchNormalization has become one of the most-
used layers in deep neural networks
Gradient Clipping
• Gradient clipping: mitigate the exploding gradients problem by clipping the
gradients during backpropagation so that they never exceed some threshold.
» most often used in recurrent neural networks.
• In Keras, set the clipvalue or clipnorm argument when creating an optimizer:

• This optimizer will clip every component of the gradient vector to a value
between –1.0 and 1.0.
» However, it may change the orientation of the gradient vector.
• If you want to ensure that Gradient Clipping does not change the direction of
the gradient vector, you should clip by norm by setting clipnorm instead of
clipvalue.
» E.g., set clipnorm=1.0
• You may want to try both clipping by value and clipping by norm, with
different thresholds, and see which option performs best on the validation set.
Reusing Pretrained Layers
• It is generally not a good idea to train a very large
DNN from scratch.
» You should always try to find an existing neural network
that accomplishes a similar task, and then reuse the lower
layers of this network.
• Transfer Learning:
» Not only speed up training considerably, but also require
significantly less training data.
Reusing Pretrained Layers (cont.)
• Suppose you have access to a
DNN that was trained to
classify pictures into 100 dif-
ferent categories, including
animals, plants, vehicles, and
everyday objects.
• You now want to train a DNN to
classify specific types of
vehicles.
• Then you should try to reuse
parts of the first network.
• The output layer of the original
model should usually be
replaced because it is most
likely not useful at all for the
new task
Reusing Pretrained Layers (cont.)
• The upper hidden layers of the original model are less likely
to be as useful as the lower layers, since the high-level
features that are most useful for the new task may differ
significantly from the ones that were most useful for the
original task.
• Try freezing all the reused layers first, then train your model
and see how it performs.
• Then try unfreezing one or two of the top hidden layers to
let backpropagation tweak them and see if performance
improves.
• It is also useful to reduce the learning rate when you
unfreeze reused layers: this will avoid wrecking their fine-
tuned weights.
Transfer Learning with Keras
• The Fashion MNIST dataset only contained eight classes.
Someone built and trained a Keras model called model A.
• You now want to tackle a different task: train a binary
classifier (positive=shirt, negative=sandal) (model B) with a
small dataset.
• Since your task is quite similar to the first task, try transfer
learning:

• To avoid affect the weights in model_A, clone it.

Transfer Learning with Keras (cont.)
• Now you could train model_B_on_A for task B, but since the
new output layer was initialized randomly it will make large
errors, so there will be large error gradients that may wreck
the reused weights.
• Solution: freeze the reused layers during the first few epochs,
giving the new layer some time to learn reasonable weights.
Transfer Learning with Keras (cont.)
• You can train the model for a few epochs, then unfreeze the
reused layers (which requires compiling the model again) and
continue training to fine-tune the reused layers for task B.
• After unfreezing the reused layers, it is usually a good idea to
reduce the learning rate, once again to avoid damaging the
reused weights.
Transfer Learning with Keras (cont.)

• Transfer learning does not work very well with small dense
networks.
» Presumably because small networks learn few patterns, and dense
networks learn very specific patterns, which are unlikely to be useful
in other tasks.
• Transfer learning works best with deep convolutional neural
networks, which tend to learn feature detectors that are
much more general (especially in the lower layers).
Unsupervised Pretraining
• It is often cheap to gather unlabeled training examples, but
expensive to label them.
• Unsupervised pretraining: use the unlabeled data to train
an unsupervised model such as an autoencoder or a
generative adversarial network.
» Then you can reuse the lower layers of the autoencoder or the
lower layers of the GAN’s discriminator, add the output layer for
your task on top, and fine-tune the final network using
supervised learning
• A good option when you have a complex task to solve, no
similar model you can reuse, and little labeled training data
but plenty of unlabeled training data.
• Today typically using autoencoders or GANs rather than
restricted Boltzmann machines (RBMs).
Greedy Layer-wise Pretraining
• Greedy layer-wise pretraining is used in the early days of Deep Learning.
» First train an unsupervised model with a single layer, typically an RBM, then
they would freeze that layer and add another one on top of it, then train the
model again (effectively just training the new layer), then freeze the new layer
and add another layer on top of it, train the model again, and so on.
• But nowadays, people generally train the full unsupervised model in one
shot.
Pretraining on an Auxiliary Task
• If you do not have much labeled training data, one option is to train
a first neural network on an auxiliary task for which you can easily
obtain or generate labeled training data, then reuse the lower
layers of that network for your actual task.
• For example, you want to build a system to recognize faces, but you
only have a few pictures of each individual.
» Gather pictures of random people on the web and train a neural network
to detect whether or not two different pictures feature the same person.
» Reusing its lower layers would allow you to train a good face classifier that
uses little training data.
• Another option is self-supervised learning: automatically generate
the labels from the data itself, then you train a model on the
resulting “labeled” dataset using supervised learning techniques.
Faster Optimizers
• We’ve discussed four ways to speed up training of deep
neural networks:
» Apply a good initialization strategy for the connection weights
» Usea good activation function
» Use Batch Normalization
» Reuse parts of a pretrained network (possibly built on an
auxiliary task or using unsupervised learning).
• Another huge speed boost comes from using a faster
optimizer than the regular Gradient Descent optimizer.
» We will discuss momentum optimization, Nesterov Accelerated
Gradient, AdaGrad, RMSProp, and finally Adam and Nadam
optimization.
Momentum Optimization
• In Gradient Descent, if the local gradient is tiny, it goes very slowly.
• Idea: a bowling ball rolling down a gentle slope on a smooth surface, it will
quickly pick up momentum.
• At each iteration, it subtracts the local gradient from the momentum
vector m (multiplied by the learning rate η), and it updates the weights by
adding this momentum vector.

• A new hyperparameter β, called the momentum, which must be set

between 0 (high friction) and 1 (no friction).
» A typical momentum value is 0.9.
• The gradient is used for acceleration, not for speed.
Momentum Optimization (cont.)
• If β = 0.9 and the gradient remains constant, the terminal velocity is equal
to 10 times the gradient times the learning rate.
» can escape from plateaus much faster than Gradient Descent.

• In deep neural networks that don’t use Batch Normalization, the upper
layers will often end up having inputs with very different scales, so using
momentum optimization helps a lot.
» It can also help roll past local optima.
Nesterov Accelerated Gradient
• The Nesterov Accelerated Gradient (NAG) method, also known as Nesterov
momentum optimization, measures the gradient of the cost function not at the
local position θ but slightly ahead in the direction of the momentum, at θ + βm.

• NAG is generally faster than regular momentum optimization.

AdaGrad
• The AdaGrad algorithm uses adaptive learning rate, which
scaling down the gradient vector along the steepest
dimensions.
» Help moving straight toward the global optimum in the elongated bowl
problem
AdaGrad (cont.)

• Step 1: Accumulates the square of the gradients into the vector s

» If the cost function is steep along the ith dimension, then si will get larger
and larger at each iteration.
• Step 2: The gradient descent step in which the gradient vector is
scaled down by the square root of s + ε
» Decays the learning rate, but it does so faster for steep dimensions than
for dimensions with gentler slopes.
• AdaGrad frequently performs well for simple quadratic problems
» but it often stops too early when training neural networks.
» even though Keras has an Adagrad optimizer, you should not use it to train
deep neural networks
RMSProp
• AdaGrad runs the risk of slowing down a bit too fast and never con-
verging to the global optimum.
• The RMSProp algorithm accumulating only the gradients from the most
recent iterations
» exponential decay in the Step 1

• The decay rate β is typically set to 0.9.

• RMSProp almost always performs much better than AdaGrad

Adam and Nadam Optimization
• Adaptive moment estimation (Adam) combines the ideas of momentum
optimization and RMSProp

• t represents the iteration number.

• The momentum decay hyperparameter β1 is typically initialized to 0.9,
while the scaling decay hyperparameter β2 is often initialized to 0.999.

• Adam requires less tuning of the learning rate hyperparameter η.

» You can often use the default value η = 0.001
Adam and Nadam Optimization (cont.)
• AdaMax replaces the l2 norm with the l∞ norm.
» In practice, this can make AdaMax more stable than Adam, but it really depends on
the dataset
» In general, Adam performs better.
• Nadam optimization is Adam optimization plus the Nesterov trick
» Often converge slightly faster than Adam.
» Nadam generally outperforms Adam but is sometimes outperformed by RMSProp.
• Optimizer comparison (* is bad, ** is average, and *** is good):

• Try Nesterov Accelerated Gradient if RMSProp, Adam, and Nadam don’t work.
Learning Rate Scheduling
Learning Rate Scheduling (cont.)
• You can find a good learning rate by
» training the model for a few hundred iterations
» exponentially increasing the learning rate from a very small value to a
very large value
» looking at the learning curve and picking a learning rate slightly lower
than the one at which the learning curve starts shooting back up.
» Then reinitialize your model and train it with that learning rate.
• But you can do better than a constant learning rate:
» If you start with a large learning rate and then reduce it once training
stops making fast progress, you can reach a good solution faster than
with the optimal constant learning rate.
Learning Schedules
• Power scheduling
» The learning rate to a function of the iteration number t: η(t) = η0 / (1 + t/s)c.
§ where η0 is the initial learning rate, c is the power (typically set to 1), and s is the
step size.
» This schedule first drops quickly, then more and more slowly.

» decay is the inverse of s and Keras assumes that c is equal to 1.

• Exponential scheduling
» Set the learning rate to η(t) = η0 0.1t/s.
» While power scheduling reduces the learning rate more and more slowly,
exponential scheduling keeps slashing it by a factor of 10 every s steps.
Learning Schedules (cont.)
• Piecewise constant scheduling
» Use a constant learning rate for a number of epochs, then a smaller learning
rate for another number of epochs, and so on.

• Performance scheduling
» Measure the validation error every N steps (just like for early stopping), and reduce the
learning rate by a factor of λ when the error stops dropping.
Learning Schedules (cont.)
• 1cycle scheduling
» Starts by increasing the initial learning rate η0, growing linearly up to η1 halfway
through training.
» Then it decreases the learning rate linearly down to η0 again during the second half
of training.
» Finishing the last few epochs by dropping the rate down by several orders of
magnitude (still linearly).
• The maximum learning rate η1 is chosen using the same approach we used to
find the optimal learning rate. The initial learning rate η0 is chosen to be
roughly 10 times lower.
• When using a momentum,
» Start with a high momentum first (e.g., 0.95)
» Then drop it down to a lower momentum during the first half of training (e.g., down to 0.85,
linearly).
» Then bring it back up to the maximum value (e.g., 0.95) during the second half of training.
» Finishing the last few epochs with that maximum value.
• In summary, exponential decay, performance scheduling, and 1cycle can
considerably speed up convergence.
Avoiding Overfitting Through Regularization

• In Chapter 10, We have discussed one of the best

regularization techniques: early stopping.
• Batch Normalization acts likes a pretty good regularizer too.
• Other popular regularization techniques for neural networks
» l1 and l2 regularization
» Dropout
» max-norm regularization.
L1 and L2 Regularization
• Use L2 regularization to constrain a neural network’s
connection weights
• Use L1 regularization if you want a sparse model
(with many weights equal to 0).
Dropout
• Dropout is one of the most popular regularization techniques for
deep neural networks.
» Even the state-of-the-art neural networks get a 1–2% accuracy boost
simply by adding dropout.
• At every training step, every neuron (including the input neurons,
but always excluding the output neurons) has a probability p of
being temporarily “dropped out,”
» It will be entirely ignored during this training step, but it may be active
during the next step.
• The hyperparameter p is called the dropout rate
» typically set between 10% and 50%
» closer to 20– 30% in recurrent neural nets
» closer to 40–50% in convolutional neural networks
• After training, neurons don’t get dropped anymore.
Dropout (cont.)
Dropout (cont.)
• Neurons trained with dropout cannot co-adapt with their neighboring
neurons.
» Since each neuron can be either present or absent, there are a total of 2N
possible networks (where N is the total number of droppable neurons).
» The resulting neural network can be seen as an averaging ensemble of all
these smaller neural networks.
• Suppose p = 50%, in which case during testing a neuron would be
connected to twice as many input neurons as it would be (on average)
during training.
» To compensate for this fact, we need to multiply each neuron’s input
connection weights by 0.5 after training.
• More generally, we need to multiply each input connection weight by the
keep probability (1 – p) after training.
Dropout (cont.)
Dropout (cont.)
• If you observe that the model is overfitting, you can increase the
dropout rate.
• If you observe that the model is underfitting, you can decrease the
dropout rate.
• It can also help to increase the dropout rate for large layers, and
reduce it for small ones.
• Many state-of-the-art architectures only use dropout after the last
hidden layer, so you may want to try this if full dropout is too
strong.
• Dropout does tend to significantly slow down convergence.
• If you want to regularize a self-normalizing network based on the
SELU activation function (as discussed earlier), you should use alpha
dropout.
Monte Carlo (MC) Dropout
• MC Dropout can boost the performance of any trained dropout model without
having to retrain it or even modify it at all, provides a much better measure of the
model’s uncertainty, and is also amazingly simple to implement.
• Idea: Averaging over multiple predictions with dropout on gives us a Monte Carlo
estimate that is generally more reliable than the result of a single prediction with
dropout off.
• Given a trained dropout model, run the following code to make the prediction:

• If your model contains other layers that behave in a special way during training
(such as BatchNormalization layers), then you should not force training mode like
we just did. Instead, you should replace the Dropout layers with the following
MCDropout class:
Max-Norm Regularization
• Max-norm regularization: for each neuron, it constrains the
weights w of the incoming connections such that ∥ w ∥2 ≤ r,
where r is the max-norm hyperparameter and ∥ · ∥2 is the L2
norm.
» typically implemented by computing ∥w∥2 after each training step and
rescaling w if needed (w ← w r/‖ w ‖2).
§ Reducing r increases the amount of regularization and helps reduce overfitting.
» does not add a regularization loss term to the overall loss function.
» Max-norm regularization can also help alleviate the unstable gradients
problems (if you are not using Batch Normalization).
Practical Guidelines
• The following configuration work fine in most cases, without requiring
much hyperparameter tuning.

• If the network is a simple stack of dense layers, then it can self-normalize,

and you should use the following configuration.
Practical Guidelines (cont.)
• Don’t forget to normalize the input features.
• You should also try to reuse parts of a pretrained neural network if you can
find one that solves a similar problem.
• Use unsupervised pretraining if you have a lot of unlabeled data
• Use pretraining on an auxiliary task if you have a lot of labeled data for a
similar task.
• If you need a sparse model, you can use l1 regularization (and optionally zero
out the tiny weights after training).
• If you need an even sparser model, you can use the TensorFlow Model
Optimization Toolkit.
• If you need a low-latency model (one that performs lightning-fast predictions),
you may need to use fewer layers, fold the Batch Normalization layers into the
previous layers, and possibly use a faster activation function such as leaky
ReLU or just ReLU. Having a sparse model will also help.
• You may want to reduce the float precision from 32 bits to 16 or even 8 bits.
• If you are building a risk-sensitive application, or inference latency is not very
important in your application, you can use MC Dropout to boost performance
and get more reliable probability estimates, along with uncertainty estimates.

Deep Learning - Unit-III Two Marks
100% (1)
Deep Learning - Unit-III Two Marks
3 pages
(Ebook PDF) Physics For The Life Sciences 3rd Canadian Edition PDF Download
100% (2)
(Ebook PDF) Physics For The Life Sciences 3rd Canadian Edition PDF Download
50 pages
Physics-Informed Neural Networks For Encoding Dynamics in Real Physical Systems
No ratings yet
Physics-Informed Neural Networks For Encoding Dynamics in Real Physical Systems
110 pages
Study Materials - Restricted Boltzmann Machine
No ratings yet
Study Materials - Restricted Boltzmann Machine
6 pages
IQ Bot Developer Quiz Practice Q A
No ratings yet
IQ Bot Developer Quiz Practice Q A
12 pages
Csps 1
100% (1)
Csps 1
62 pages
R2021 Syllabus&Curriculum
No ratings yet
R2021 Syllabus&Curriculum
272 pages
ML Unit-Iv
No ratings yet
ML Unit-Iv
18 pages
Unit4 DL Final
No ratings yet
Unit4 DL Final
30 pages
FULL Version Testbank Machine Learning Algorithms For Signal and Image Processing Multiple Formats
No ratings yet
FULL Version Testbank Machine Learning Algorithms For Signal and Image Processing Multiple Formats
404 pages
Adaline/Madaline:Applications
100% (1)
Adaline/Madaline:Applications
25 pages
Rajesh (DL Unit1) 04dec2024
No ratings yet
Rajesh (DL Unit1) 04dec2024
125 pages
9.deep Feedforward Networks
100% (1)
9.deep Feedforward Networks
13 pages
Ch-4 Ethics in Data Science PPT Vasu Sharma 9-A
No ratings yet
Ch-4 Ethics in Data Science PPT Vasu Sharma 9-A
18 pages
Intro To QC Vol 1 Loceff PDF
No ratings yet
Intro To QC Vol 1 Loceff PDF
742 pages
Math4ml PDF
No ratings yet
Math4ml PDF
21 pages
Model Predictive Control Using YALMIP Getting Started
No ratings yet
Model Predictive Control Using YALMIP Getting Started
5 pages
Lecture Notes - Logistic Regression
100% (1)
Lecture Notes - Logistic Regression
11 pages
Neural Networks
No ratings yet
Neural Networks
13 pages
Training of Neural Networks: Q.J. Zhang, Carleton University
No ratings yet
Training of Neural Networks: Q.J. Zhang, Carleton University
44 pages
Tensorflow Deep Learning With Keras
No ratings yet
Tensorflow Deep Learning With Keras
90 pages
Example of 2D Convolution
No ratings yet
Example of 2D Convolution
5 pages
Confusion Matrix, Accuracy, Precision, Recall, F1 Score
No ratings yet
Confusion Matrix, Accuracy, Precision, Recall, F1 Score
1 page
CS 601 Machine Learning Unit 5
No ratings yet
CS 601 Machine Learning Unit 5
18 pages
Data Modelling and Visualization
No ratings yet
Data Modelling and Visualization
31 pages
Unit-4 Mwoc 5-12-22
No ratings yet
Unit-4 Mwoc 5-12-22
82 pages
Btech CSE
No ratings yet
Btech CSE
17 pages
Module2.3 Hyperparameter Optimization
No ratings yet
Module2.3 Hyperparameter Optimization
29 pages
Sat - 13.Pdf - Child Mortality Prediction Using Machine Learning
No ratings yet
Sat - 13.Pdf - Child Mortality Prediction Using Machine Learning
11 pages
UNIT-I - Introduction To Computer Vision
No ratings yet
UNIT-I - Introduction To Computer Vision
45 pages
Deep MLP's
No ratings yet
Deep MLP's
44 pages
Back Propagation Technique
No ratings yet
Back Propagation Technique
24 pages
Answers All 2007
0% (1)
Answers All 2007
64 pages
Aiml Manual 6th Sem
No ratings yet
Aiml Manual 6th Sem
15 pages
Duda Solutions PDF
No ratings yet
Duda Solutions PDF
77 pages
Practice Final sp22
No ratings yet
Practice Final sp22
10 pages
Q-Learning and Deep Q Networks (DQN)
No ratings yet
Q-Learning and Deep Q Networks (DQN)
52 pages
Deep Feedforward Networks and Regularization: Licheng Zhang
No ratings yet
Deep Feedforward Networks and Regularization: Licheng Zhang
56 pages
Text
No ratings yet
Text
131 pages
L4 Training Neural Networks en
No ratings yet
L4 Training Neural Networks en
48 pages
CNN Cheat Sheet
No ratings yet
CNN Cheat Sheet
5 pages
Machine Learning: Neural Networks
No ratings yet
Machine Learning: Neural Networks
22 pages
Module 2
No ratings yet
Module 2
13 pages
Lab Manual 15
No ratings yet
Lab Manual 15
9 pages
Seminar Report Machine Learning
No ratings yet
Seminar Report Machine Learning
20 pages
Mathematics For Machine Learning-I
No ratings yet
Mathematics For Machine Learning-I
10 pages
hw3 Solutions PDF
No ratings yet
hw3 Solutions PDF
11 pages
Minor Project Report
No ratings yet
Minor Project Report
46 pages
Activation Functions - Ipynb - Colaboratory
No ratings yet
Activation Functions - Ipynb - Colaboratory
10 pages
Intro4 ANN Deep CNN PDF
No ratings yet
Intro4 ANN Deep CNN PDF
20 pages
03 Diversity PDF
No ratings yet
03 Diversity PDF
30 pages
Independent Component Analysis: Bhagesh Bhutani (20) Chayan Sharma (21) Deepak
No ratings yet
Independent Component Analysis: Bhagesh Bhutani (20) Chayan Sharma (21) Deepak
15 pages
Introduction To Matlab Tutorial 11
No ratings yet
Introduction To Matlab Tutorial 11
37 pages
Connected Components
No ratings yet
Connected Components
42 pages
Neural
No ratings yet
Neural
35 pages
Cours 2 - Training Deep Neural Networks
No ratings yet
Cours 2 - Training Deep Neural Networks
42 pages
Machine Learning Guide Line
No ratings yet
Machine Learning Guide Line
10 pages
Lasso Regularization of Generalized Linear Models - MATLAB & Simulink
No ratings yet
Lasso Regularization of Generalized Linear Models - MATLAB & Simulink
14 pages
Radial Basis Function
No ratings yet
Radial Basis Function
35 pages
Deep Learning in Drug Discovery
No ratings yet
Deep Learning in Drug Discovery
12 pages
Deep Learning Module 1
No ratings yet
Deep Learning Module 1
46 pages
Module 5
No ratings yet
Module 5
72 pages
Impact of A I On Media Entertainment Industry
No ratings yet
Impact of A I On Media Entertainment Industry
32 pages
Batch Details 2019 2024 F
No ratings yet
Batch Details 2019 2024 F
70 pages
Introduction To Artificial Intelligence (AI) in Software Testing
No ratings yet
Introduction To Artificial Intelligence (AI) in Software Testing
24 pages
Generative AI Research Papers
No ratings yet
Generative AI Research Papers
3 pages
Artificial Neural Networks: Part 1/3
No ratings yet
Artificial Neural Networks: Part 1/3
25 pages
Internship Repoort 2
No ratings yet
Internship Repoort 2
35 pages
Artificial Intelligence in Healthcare Review Ethics Trust Challenges Future Research Directions
No ratings yet
Artificial Intelligence in Healthcare Review Ethics Trust Challenges Future Research Directions
20 pages
Bone Fracture Identification With Deep Learning Model Using Resnet50
No ratings yet
Bone Fracture Identification With Deep Learning Model Using Resnet50
14 pages
Neural Network and Their Applications
No ratings yet
Neural Network and Their Applications
2 pages
Stanford University CS224d - Deep Learning For Natural Language Processing - Syllabus
No ratings yet
Stanford University CS224d - Deep Learning For Natural Language Processing - Syllabus
3 pages
CS 3600 Project 4b Analysis
No ratings yet
CS 3600 Project 4b Analysis
3 pages
Alexandre Et Al 2023 Machine Learning Applied To Public Transportation by Bus A Systematic Literature Review
No ratings yet
Alexandre Et Al 2023 Machine Learning Applied To Public Transportation by Bus A Systematic Literature Review
22 pages
Salary Prediction Model Using Principal Component Analysis and Deep Neural Network Algorithm
No ratings yet
Salary Prediction Model Using Principal Component Analysis and Deep Neural Network Algorithm
11 pages
Benchmarking Probabilistic Deep Learning Methods For License Plate Recognition
No ratings yet
Benchmarking Probabilistic Deep Learning Methods For License Plate Recognition
14 pages
Plant Disease Detection Using Image Processing
No ratings yet
Plant Disease Detection Using Image Processing
44 pages
Automatic Logo Detection With Deep Region-Based Convolutional Networks
No ratings yet
Automatic Logo Detection With Deep Region-Based Convolutional Networks
10 pages
Open-AI Driven Open-Source Open-Access Sustainable ICs Design Flow
No ratings yet
Open-AI Driven Open-Source Open-Access Sustainable ICs Design Flow
5 pages
Vitamin Deficiency Detection (Base Paper)
No ratings yet
Vitamin Deficiency Detection (Base Paper)
3 pages
Chat TTT
No ratings yet
Chat TTT
14 pages
AI and IP
No ratings yet
AI and IP
60 pages
DeepEdge A New QoE-Based Resource Allocation Framework Using Deep Reinforcement Learning For Future Heterogeneous Edge-IoT Applications
No ratings yet
DeepEdge A New QoE-Based Resource Allocation Framework Using Deep Reinforcement Learning For Future Heterogeneous Edge-IoT Applications
13 pages
Int. To Data Analytics and Cyber Security Syllabus
No ratings yet
Int. To Data Analytics and Cyber Security Syllabus
2 pages
Why Deep Learning Is Changing The Way To Approach NGS Data Processing A Review
No ratings yet
Why Deep Learning Is Changing The Way To Approach NGS Data Processing A Review
9 pages
Artificial Intelligence Research Center
No ratings yet
Artificial Intelligence Research Center
20 pages
Book Report - 1177 - AI
No ratings yet
Book Report - 1177 - AI
9 pages
Unit 2 - Week 1: Assignment 1
No ratings yet
Unit 2 - Week 1: Assignment 1
3 pages
Rajkumar R: Objective
No ratings yet
Rajkumar R: Objective
2 pages
Textbook of Engineering Chemistry
From Everand
Textbook of Engineering Chemistry
C. Parameswara Murthy
No ratings yet

Training Deep Neural Networks

Uploaded by

Training Deep Neural Networks

Uploaded by

Chapter 11: Training Deep Neural

Ulsan National Institute of Science and Technology (UNIST)

• The backpropagation algorithm works by propagating the error

• If we replace fanavg with fanin in Eq. 11-1, we get LeCun initialization.

• Keras uses Glorot initialization by default. To use He initialization:

• If you want He initialization with a uniform distribution but based on fanavg

• A popular alternative is the ReLU

• Solution: The leaky ReLU.

• The unit has an average output closer to 0.

• Evaluate the mean and

• The BatchNormalization class has quite a few

• To avoid affect the weights in model_A, clone it.

• A new hyperparameter β, called the momentum, which must be set

• NAG is generally faster than regular momentum optimization.

• Step 1: Accumulates the square of the gradients into the vector s

• The decay rate β is typically set to 0.9.

• RMSProp almost always performs much better than AdaGrad

• t represents the iteration number.

• Adam requires less tuning of the learning rate hyperparameter η.

» decay is the inverse of s and Keras assumes that c is equal to 1.

• In Chapter 10, We have discussed one of the best

• If the network is a simple stack of dense layers, then it can self-normalize,

You might also like