0% found this document useful (0 votes)
104 views55 pages

Training Deep Neural Networks

Training Deep Neural Networks tutorial

Uploaded by

Sivaiah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
104 views55 pages

Training Deep Neural Networks

Training Deep Neural Networks tutorial

Uploaded by

Sivaiah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

Chapter 11: Training Deep Neural

Networks

Tsz-Chiu Au
[email protected]

Ulsan National Institute of Science and Technology (UNIST)


South Korea
Training a Deep Neural Networks
• In Chapter 10, all neural networks are shallow---they have just
a few hidden layers.
• To tackle a complex problem, we need to train a much deeper
DNN.
• Common problems when training deep DNNs.
» Vanishing gradients problem and exploding gradients problem
» Not enough training data
» Training may be extremely slow
» Overfitting the training set
• We will discuss how to address these issues.
The Vanishing/Exploding Gradients Problems

• The backpropagation algorithm works by propagating the error


gradient from the output layer to the input layer in the reverse
pass.
» Uses these gradients to update each parameter with a Gradient Descent
step.
• Vanishing Gradients Problem: gradients often get smaller and
smaller as the algorithm progresses down to the lower layers.
» Fail to update the lower layers’ connection weights and training never
converges to a good solution.
• Exploding gradients problem: In recurrent neural networks, the
gradients can grow bigger and bigger until layers get insanely large
weight updates and the algorithm diverges.
• For these reasons, DNNs were mostly abandoned in the early
2000s.
Logistic Activation Function Saturation
• Xavier Glorot and Yoshua Bengio found that these problems
are due to the use of the logistic sigmoid activation function
and the popular weight initialization technique at the time
(normal distribution with mean 0 and standard deviation 1)
» The variance of the
outputs of each layer is
much greater than the
variance of its inputs.
» The variance keeps
increasing after each
layer until the activation
function saturates at the
top layers.
» Virtually no gradient to
propagate back through
the network.
Xavier/Glorot Initialization
• Glorot and Bengio propose a way to significantly alleviate the unstable
gradients problem.
» The signal to flow properly in both directions: in the forward direction when
making predictions, and in the reverse direction when backpropagating
gradients.
§ We don’t want the signal to die out, nor do we want it to explode and saturate.
» The idea: the variance of the outputs of each layer to be equal to the variance
of its inputs; the gradients have equal variance before and after flowing
through a layer in the reverse direction.
• In Glorot Initialization, when fanavg = (fanin + fanout)/2, initialize the
weights by:

• If we replace fanavg with fanin in Eq. 11-1, we get LeCun initialization.


Xavier/Glorot Initialization (cont.)
• Using Glorot initialization can speed up training considerably, and it is one
of the tricks that led to the success of Deep Learning.
• For different activation functions:

• Keras uses Glorot initialization by default. To use He initialization:

• If you want He initialization with a uniform distribution but based on fanavg


rather than fanin:
Nonsaturating Activation Functions
• Glorot and Bengio also found that unstable gradients is partly
due to a poor choice of activation function.
» Although sigmoid activation functions look like the activation function
of biological neurons, there are better activation functions than
sigmoid activation functions.

• A popular alternative is the ReLU


activation function, which does
not saturate for positive values
and it is fast to compute.
» But it suffers from dying ReLUs---
they stop outputting anything other
than 0.
§ In some cases, half of network’s
neurons are dead

• Solution: The leaky ReLU.


Other Leaky ReLU Activation Functions
• The randomized leaky ReLU (RReLU)
» where α is picked randomly in a given range during
training and is fixed to an average value during
testing.
» can act as a regularizer.
• The parametric leaky ReLU (PReLU)
» where α is authorized to be learned during training
(instead of being a hyperparameter)
» Strongly outperform ReLU on large image datasets,
but on smaller datasets it runs the risk of overfitting
the training set.
Exponential Linear Unit (ELU)
• Exponential linear unit (ELU) outperformed all the ReLU variants in
some experiments.

• The unit has an average output closer to 0.


• A nonzero gradient for z < 0, which avoids the dead neurons
problem.
• α is usually set to 1, and then the function is smooth everywhere
• Drawback: slower to compute than the ReLU function and its
variants
» But Its faster convergence rate during training compensates for that slow
computation. However, still slower at test time.
Scaled ELU (SELU)
• Scaled ELU (SELU) is a scaled variant of the ELU activation function
• If you build a neural network composed exclusively of a stack of
dense layers, and if all hidden layers use the SELU activation function,
then the network will self-normalize.
» The output of each layer will tend to preserve a mean of 0 and standard
deviation of 1 during training, which solves the vanishing/exploding
gradients problem.
• Conditions for self-normalization:
» The input features must be standardized (mean 0 and standard deviation 1).
» Every hidden layer’s weights must be initialized with LeCun normal
initialization.
» The network’s architecture must be sequential.
§ For recurrent networks and skip connections, SELU will not necessarily outperform other
activation functions.
» All layers are dense.
Which activation functions should you
use for hidden layers?
• In general, SELU > ELU > leaky ReLU (and its variants) >
ReLU > tanh > logistic.
• If the network’s architecture prevents it from self-
normalizing, then ELU may perform better than SELU.
• If you care a lot about runtime latency, then you may
prefer leaky ReLU.
• ReLU is the most used activation function (by far),
many libraries and hardware accelerators provide
ReLU-specific optimizations
» if speed is your priority, ReLU might still be the best choice.
Using LeakyReLU and SELU in Keras
Batch Normalization
• Batch Normalization (BN): Adding BN layers in the model just
before or after the activation function of each hidden layer.
» The BN layers zero-center and normalize each input, then scales and
shifts the result using two new parameter vectors per layer: one for
scaling, the other for shifting.

• Evaluate the mean and


standard deviation of the
input over the current mini-
batch.
Batch Normalization (cont.)
Batch Normalization (cont.)
• For test time, we cannot estimate the inputs’ mean and standard
deviation since there is no batch.
• Most implementations of Batch Normalization estimate the
inputs’ mean and standard deviation during training by using a
moving average of the layer’s input means and standard
deviations.
• Learn four parameter vectors during training:
» γ (the output scale vector) and β (the output offset vector) are learned
through regular backpropagation
» μ (the final input mean vector) and σ (the final input standard deviation
vector) are estimated using an exponential moving average.
• Note that μ and σ are estimated during training, but they are
used only after training.
» Replace the batch input means and standard deviations in Equation 11-3.
Batch Normalization (cont.)
• Batch Normalization considerably improves all the deep neural
networks they experimented with, leading to a huge improvement
in the ImageNet classification task.
• The vanishing gradients problem was strongly reduced.
• The networks were also much less sensitive to the weight
initialization.
• Can use much larger learning rates, significantly speeding up the
learning process.
• Batch Normalization acts like a regularizer, reducing the need for
other regularization techniques.
• To avoid runtime penalty, fuse the BN layer with the previous layer
by updating the previous layer’s weights and biases so that it
directly produces outputs of the appropriate scale and offset.
Implementing BN with Keras
• In Keras, you can add a Batch Normalization layer before or after each
hidden layer’s activation function, and optionally add a BN layer as well as
the first layer in your model.
Implementing BN with Keras (cont.)
• Some argued in favor of adding the BN layers before the
activation functions, rather than after.
» But there is some debate about this.
Implementing BN with Keras (cont.)

• The BatchNormalization class has quite a few


hyperparameters you can tweak.
» momentum – the learning rate for the exponential moving
averages.
» axis – determines which axis should be normalized.
» The defaults will usually be fine.
• BatchNormalization has become one of the most-
used layers in deep neural networks
Gradient Clipping
• Gradient clipping: mitigate the exploding gradients problem by clipping the
gradients during backpropagation so that they never exceed some threshold.
» most often used in recurrent neural networks.
• In Keras, set the clipvalue or clipnorm argument when creating an optimizer:

• This optimizer will clip every component of the gradient vector to a value
between –1.0 and 1.0.
» However, it may change the orientation of the gradient vector.
• If you want to ensure that Gradient Clipping does not change the direction of
the gradient vector, you should clip by norm by setting clipnorm instead of
clipvalue.
» E.g., set clipnorm=1.0
• You may want to try both clipping by value and clipping by norm, with
different thresholds, and see which option performs best on the validation set.
Reusing Pretrained Layers
• It is generally not a good idea to train a very large
DNN from scratch.
» You should always try to find an existing neural network
that accomplishes a similar task, and then reuse the lower
layers of this network.
• Transfer Learning:
» Not only speed up training considerably, but also require
significantly less training data.
Reusing Pretrained Layers (cont.)
• Suppose you have access to a
DNN that was trained to
classify pictures into 100 dif-
ferent categories, including
animals, plants, vehicles, and
everyday objects.
• You now want to train a DNN to
classify specific types of
vehicles.
• Then you should try to reuse
parts of the first network.
• The output layer of the original
model should usually be
replaced because it is most
likely not useful at all for the
new task
Reusing Pretrained Layers (cont.)
• The upper hidden layers of the original model are less likely
to be as useful as the lower layers, since the high-level
features that are most useful for the new task may differ
significantly from the ones that were most useful for the
original task.
• Try freezing all the reused layers first, then train your model
and see how it performs.
• Then try unfreezing one or two of the top hidden layers to
let backpropagation tweak them and see if performance
improves.
• It is also useful to reduce the learning rate when you
unfreeze reused layers: this will avoid wrecking their fine-
tuned weights.
Transfer Learning with Keras
• The Fashion MNIST dataset only contained eight classes.
Someone built and trained a Keras model called model A.
• You now want to tackle a different task: train a binary
classifier (positive=shirt, negative=sandal) (model B) with a
small dataset.
• Since your task is quite similar to the first task, try transfer
learning:

• To avoid affect the weights in model_A, clone it.


Transfer Learning with Keras (cont.)
• Now you could train model_B_on_A for task B, but since the
new output layer was initialized randomly it will make large
errors, so there will be large error gradients that may wreck
the reused weights.
• Solution: freeze the reused layers during the first few epochs,
giving the new layer some time to learn reasonable weights.
Transfer Learning with Keras (cont.)
• You can train the model for a few epochs, then unfreeze the
reused layers (which requires compiling the model again) and
continue training to fine-tune the reused layers for task B.
• After unfreezing the reused layers, it is usually a good idea to
reduce the learning rate, once again to avoid damaging the
reused weights.
Transfer Learning with Keras (cont.)

• Transfer learning does not work very well with small dense
networks.
» Presumably because small networks learn few patterns, and dense
networks learn very specific patterns, which are unlikely to be useful
in other tasks.
• Transfer learning works best with deep convolutional neural
networks, which tend to learn feature detectors that are
much more general (especially in the lower layers).
Unsupervised Pretraining
• It is often cheap to gather unlabeled training examples, but
expensive to label them.
• Unsupervised pretraining: use the unlabeled data to train
an unsupervised model such as an autoencoder or a
generative adversarial network.
» Then you can reuse the lower layers of the autoencoder or the
lower layers of the GAN’s discriminator, add the output layer for
your task on top, and fine-tune the final network using
supervised learning
• A good option when you have a complex task to solve, no
similar model you can reuse, and little labeled training data
but plenty of unlabeled training data.
• Today typically using autoencoders or GANs rather than
restricted Boltzmann machines (RBMs).
Greedy Layer-wise Pretraining
• Greedy layer-wise pretraining is used in the early days of Deep Learning.
» First train an unsupervised model with a single layer, typically an RBM, then
they would freeze that layer and add another one on top of it, then train the
model again (effectively just training the new layer), then freeze the new layer
and add another layer on top of it, train the model again, and so on.
• But nowadays, people generally train the full unsupervised model in one
shot.
Pretraining on an Auxiliary Task
• If you do not have much labeled training data, one option is to train
a first neural network on an auxiliary task for which you can easily
obtain or generate labeled training data, then reuse the lower
layers of that network for your actual task.
• For example, you want to build a system to recognize faces, but you
only have a few pictures of each individual.
» Gather pictures of random people on the web and train a neural network
to detect whether or not two different pictures feature the same person.
» Reusing its lower layers would allow you to train a good face classifier that
uses little training data.
• Another option is self-supervised learning: automatically generate
the labels from the data itself, then you train a model on the
resulting “labeled” dataset using supervised learning techniques.
Faster Optimizers
• We’ve discussed four ways to speed up training of deep
neural networks:
» Apply a good initialization strategy for the connection weights
» Usea good activation function
» Use Batch Normalization
» Reuse parts of a pretrained network (possibly built on an
auxiliary task or using unsupervised learning).
• Another huge speed boost comes from using a faster
optimizer than the regular Gradient Descent optimizer.
» We will discuss momentum optimization, Nesterov Accelerated
Gradient, AdaGrad, RMSProp, and finally Adam and Nadam
optimization.
Momentum Optimization
• In Gradient Descent, if the local gradient is tiny, it goes very slowly.
• Idea: a bowling ball rolling down a gentle slope on a smooth surface, it will
quickly pick up momentum.
• At each iteration, it subtracts the local gradient from the momentum
vector m (multiplied by the learning rate η), and it updates the weights by
adding this momentum vector.

• A new hyperparameter β, called the momentum, which must be set


between 0 (high friction) and 1 (no friction).
» A typical momentum value is 0.9.
• The gradient is used for acceleration, not for speed.
Momentum Optimization (cont.)
• If β = 0.9 and the gradient remains constant, the terminal velocity is equal
to 10 times the gradient times the learning rate.
» can escape from plateaus much faster than Gradient Descent.

• In deep neural networks that don’t use Batch Normalization, the upper
layers will often end up having inputs with very different scales, so using
momentum optimization helps a lot.
» It can also help roll past local optima.
Nesterov Accelerated Gradient
• The Nesterov Accelerated Gradient (NAG) method, also known as Nesterov
momentum optimization, measures the gradient of the cost function not at the
local position θ but slightly ahead in the direction of the momentum, at θ + βm.

• NAG is generally faster than regular momentum optimization.


AdaGrad
• The AdaGrad algorithm uses adaptive learning rate, which
scaling down the gradient vector along the steepest
dimensions.
» Help moving straight toward the global optimum in the elongated bowl
problem
AdaGrad (cont.)

• Step 1: Accumulates the square of the gradients into the vector s


» If the cost function is steep along the ith dimension, then si will get larger
and larger at each iteration.
• Step 2: The gradient descent step in which the gradient vector is
scaled down by the square root of s + ε
» Decays the learning rate, but it does so faster for steep dimensions than
for dimensions with gentler slopes.
• AdaGrad frequently performs well for simple quadratic problems
» but it often stops too early when training neural networks.
» even though Keras has an Adagrad optimizer, you should not use it to train
deep neural networks
RMSProp
• AdaGrad runs the risk of slowing down a bit too fast and never con-
verging to the global optimum.
• The RMSProp algorithm accumulating only the gradients from the most
recent iterations
» exponential decay in the Step 1

• The decay rate β is typically set to 0.9.

• RMSProp almost always performs much better than AdaGrad


Adam and Nadam Optimization
• Adaptive moment estimation (Adam) combines the ideas of momentum
optimization and RMSProp

• t represents the iteration number.


• The momentum decay hyperparameter β1 is typically initialized to 0.9,
while the scaling decay hyperparameter β2 is often initialized to 0.999.

• Adam requires less tuning of the learning rate hyperparameter η.


» You can often use the default value η = 0.001
Adam and Nadam Optimization (cont.)
• AdaMax replaces the l2 norm with the l∞ norm.
» In practice, this can make AdaMax more stable than Adam, but it really depends on
the dataset
» In general, Adam performs better.
• Nadam optimization is Adam optimization plus the Nesterov trick
» Often converge slightly faster than Adam.
» Nadam generally outperforms Adam but is sometimes outperformed by RMSProp.
• Optimizer comparison (* is bad, ** is average, and *** is good):

• Try Nesterov Accelerated Gradient if RMSProp, Adam, and Nadam don’t work.
Learning Rate Scheduling
Learning Rate Scheduling (cont.)
• You can find a good learning rate by
» training the model for a few hundred iterations
» exponentially increasing the learning rate from a very small value to a
very large value
» looking at the learning curve and picking a learning rate slightly lower
than the one at which the learning curve starts shooting back up.
» Then reinitialize your model and train it with that learning rate.
• But you can do better than a constant learning rate:
» If you start with a large learning rate and then reduce it once training
stops making fast progress, you can reach a good solution faster than
with the optimal constant learning rate.
Learning Schedules
• Power scheduling
» The learning rate to a function of the iteration number t: η(t) = η0 / (1 + t/s)c.
§ where η0 is the initial learning rate, c is the power (typically set to 1), and s is the
step size.
» This schedule first drops quickly, then more and more slowly.

» decay is the inverse of s and Keras assumes that c is equal to 1.


• Exponential scheduling
» Set the learning rate to η(t) = η0 0.1t/s.
» While power scheduling reduces the learning rate more and more slowly,
exponential scheduling keeps slashing it by a factor of 10 every s steps.
Learning Schedules (cont.)
• Piecewise constant scheduling
» Use a constant learning rate for a number of epochs, then a smaller learning
rate for another number of epochs, and so on.

• Performance scheduling
» Measure the validation error every N steps (just like for early stopping), and reduce the
learning rate by a factor of λ when the error stops dropping.
Learning Schedules (cont.)
• 1cycle scheduling
» Starts by increasing the initial learning rate η0, growing linearly up to η1 halfway
through training.
» Then it decreases the learning rate linearly down to η0 again during the second half
of training.
» Finishing the last few epochs by dropping the rate down by several orders of
magnitude (still linearly).
• The maximum learning rate η1 is chosen using the same approach we used to
find the optimal learning rate. The initial learning rate η0 is chosen to be
roughly 10 times lower.
• When using a momentum,
» Start with a high momentum first (e.g., 0.95)
» Then drop it down to a lower momentum during the first half of training (e.g., down to 0.85,
linearly).
» Then bring it back up to the maximum value (e.g., 0.95) during the second half of training.
» Finishing the last few epochs with that maximum value.
• In summary, exponential decay, performance scheduling, and 1cycle can
considerably speed up convergence.
Avoiding Overfitting Through Regularization

• In Chapter 10, We have discussed one of the best


regularization techniques: early stopping.
• Batch Normalization acts likes a pretty good regularizer too.
• Other popular regularization techniques for neural networks
» l1 and l2 regularization
» Dropout
» max-norm regularization.
L1 and L2 Regularization
• Use L2 regularization to constrain a neural network’s
connection weights
• Use L1 regularization if you want a sparse model
(with many weights equal to 0).
Dropout
• Dropout is one of the most popular regularization techniques for
deep neural networks.
» Even the state-of-the-art neural networks get a 1–2% accuracy boost
simply by adding dropout.
• At every training step, every neuron (including the input neurons,
but always excluding the output neurons) has a probability p of
being temporarily “dropped out,”
» It will be entirely ignored during this training step, but it may be active
during the next step.
• The hyperparameter p is called the dropout rate
» typically set between 10% and 50%
» closer to 20– 30% in recurrent neural nets
» closer to 40–50% in convolutional neural networks
• After training, neurons don’t get dropped anymore.
Dropout (cont.)
Dropout (cont.)
• Neurons trained with dropout cannot co-adapt with their neighboring
neurons.
» Since each neuron can be either present or absent, there are a total of 2N
possible networks (where N is the total number of droppable neurons).
» The resulting neural network can be seen as an averaging ensemble of all
these smaller neural networks.
• Suppose p = 50%, in which case during testing a neuron would be
connected to twice as many input neurons as it would be (on average)
during training.
» To compensate for this fact, we need to multi- ply each neuron’s input
connection weights by 0.5 after training.
• More generally, we need to multiply each input connection weight by the
keep probability (1 – p) after training.
Dropout (cont.)
Dropout (cont.)
• If you observe that the model is overfitting, you can increase the
dropout rate.
• If you observe that the model is underfitting, you can decrease the
dropout rate.
• It can also help to increase the dropout rate for large layers, and
reduce it for small ones.
• Many state-of-the-art architectures only use dropout after the last
hidden layer, so you may want to try this if full dropout is too
strong.
• Dropout does tend to significantly slow down convergence.
• If you want to regularize a self-normalizing network based on the
SELU activation function (as discussed earlier), you should use alpha
dropout.
Monte Carlo (MC) Dropout
• MC Dropout can boost the performance of any trained dropout model without
having to retrain it or even modify it at all, provides a much better measure of the
model’s uncertainty, and is also amazingly simple to implement.
• Idea: Averaging over multiple predictions with dropout on gives us a Monte Carlo
estimate that is generally more reliable than the result of a single prediction with
dropout off.
• Given a trained dropout model, run the following code to make the prediction:

• If your model contains other layers that behave in a special way during training
(such as BatchNormalization layers), then you should not force training mode like
we just did. Instead, you should replace the Dropout layers with the following
MCDropout class:
Max-Norm Regularization
• Max-norm regularization: for each neuron, it constrains the
weights w of the incoming connections such that ∥ w ∥2 ≤ r,
where r is the max-norm hyperparameter and ∥ · ∥2 is the L2
norm.
» typically implemented by computing ∥w∥2 after each training step and
rescaling w if needed (w ← w r/‖ w ‖2).
§ Reducing r increases the amount of regularization and helps reduce overfitting.
» does not add a regularization loss term to the overall loss function.
» Max-norm regularization can also help alleviate the unstable gradients
problems (if you are not using Batch Normalization).
Practical Guidelines
• The following configuration work fine in most cases, without requiring
much hyperparameter tuning.

• If the network is a simple stack of dense layers, then it can self-normalize,


and you should use the following configuration.
Practical Guidelines (cont.)
• Don’t forget to normalize the input features.
• You should also try to reuse parts of a pretrained neural network if you can
find one that solves a similar problem.
• Use unsupervised pretraining if you have a lot of unlabeled data
• Use pretraining on an auxiliary task if you have a lot of labeled data for a
similar task.
• If you need a sparse model, you can use l1 regularization (and optionally zero
out the tiny weights after training).
• If you need an even sparser model, you can use the TensorFlow Model
Optimization Toolkit.
• If you need a low-latency model (one that performs lightning-fast predictions),
you may need to use fewer layers, fold the Batch Normalization layers into the
previous layers, and possibly use a faster activation function such as leaky
ReLU or just ReLU. Having a sparse model will also help.
• You may want to reduce the float precision from 32 bits to 16 or even 8 bits.
• If you are building a risk-sensitive application, or inference latency is not very
important in your application, you can use MC Dropout to boost performance
and get more reliable probability estimates, along with uncertainty estimates.

You might also like