Training Deep Neural Networks
Training Deep Neural Networks
Networks
Tsz-Chiu Au
[email protected]
• This optimizer will clip every component of the gradient vector to a value
between –1.0 and 1.0.
» However, it may change the orientation of the gradient vector.
• If you want to ensure that Gradient Clipping does not change the direction of
the gradient vector, you should clip by norm by setting clipnorm instead of
clipvalue.
» E.g., set clipnorm=1.0
• You may want to try both clipping by value and clipping by norm, with
different thresholds, and see which option performs best on the validation set.
Reusing Pretrained Layers
• It is generally not a good idea to train a very large
DNN from scratch.
» You should always try to find an existing neural network
that accomplishes a similar task, and then reuse the lower
layers of this network.
• Transfer Learning:
» Not only speed up training considerably, but also require
significantly less training data.
Reusing Pretrained Layers (cont.)
• Suppose you have access to a
DNN that was trained to
classify pictures into 100 dif-
ferent categories, including
animals, plants, vehicles, and
everyday objects.
• You now want to train a DNN to
classify specific types of
vehicles.
• Then you should try to reuse
parts of the first network.
• The output layer of the original
model should usually be
replaced because it is most
likely not useful at all for the
new task
Reusing Pretrained Layers (cont.)
• The upper hidden layers of the original model are less likely
to be as useful as the lower layers, since the high-level
features that are most useful for the new task may differ
significantly from the ones that were most useful for the
original task.
• Try freezing all the reused layers first, then train your model
and see how it performs.
• Then try unfreezing one or two of the top hidden layers to
let backpropagation tweak them and see if performance
improves.
• It is also useful to reduce the learning rate when you
unfreeze reused layers: this will avoid wrecking their fine-
tuned weights.
Transfer Learning with Keras
• The Fashion MNIST dataset only contained eight classes.
Someone built and trained a Keras model called model A.
• You now want to tackle a different task: train a binary
classifier (positive=shirt, negative=sandal) (model B) with a
small dataset.
• Since your task is quite similar to the first task, try transfer
learning:
• Transfer learning does not work very well with small dense
networks.
» Presumably because small networks learn few patterns, and dense
networks learn very specific patterns, which are unlikely to be useful
in other tasks.
• Transfer learning works best with deep convolutional neural
networks, which tend to learn feature detectors that are
much more general (especially in the lower layers).
Unsupervised Pretraining
• It is often cheap to gather unlabeled training examples, but
expensive to label them.
• Unsupervised pretraining: use the unlabeled data to train
an unsupervised model such as an autoencoder or a
generative adversarial network.
» Then you can reuse the lower layers of the autoencoder or the
lower layers of the GAN’s discriminator, add the output layer for
your task on top, and fine-tune the final network using
supervised learning
• A good option when you have a complex task to solve, no
similar model you can reuse, and little labeled training data
but plenty of unlabeled training data.
• Today typically using autoencoders or GANs rather than
restricted Boltzmann machines (RBMs).
Greedy Layer-wise Pretraining
• Greedy layer-wise pretraining is used in the early days of Deep Learning.
» First train an unsupervised model with a single layer, typically an RBM, then
they would freeze that layer and add another one on top of it, then train the
model again (effectively just training the new layer), then freeze the new layer
and add another layer on top of it, train the model again, and so on.
• But nowadays, people generally train the full unsupervised model in one
shot.
Pretraining on an Auxiliary Task
• If you do not have much labeled training data, one option is to train
a first neural network on an auxiliary task for which you can easily
obtain or generate labeled training data, then reuse the lower
layers of that network for your actual task.
• For example, you want to build a system to recognize faces, but you
only have a few pictures of each individual.
» Gather pictures of random people on the web and train a neural network
to detect whether or not two different pictures feature the same person.
» Reusing its lower layers would allow you to train a good face classifier that
uses little training data.
• Another option is self-supervised learning: automatically generate
the labels from the data itself, then you train a model on the
resulting “labeled” dataset using supervised learning techniques.
Faster Optimizers
• We’ve discussed four ways to speed up training of deep
neural networks:
» Apply a good initialization strategy for the connection weights
» Usea good activation function
» Use Batch Normalization
» Reuse parts of a pretrained network (possibly built on an
auxiliary task or using unsupervised learning).
• Another huge speed boost comes from using a faster
optimizer than the regular Gradient Descent optimizer.
» We will discuss momentum optimization, Nesterov Accelerated
Gradient, AdaGrad, RMSProp, and finally Adam and Nadam
optimization.
Momentum Optimization
• In Gradient Descent, if the local gradient is tiny, it goes very slowly.
• Idea: a bowling ball rolling down a gentle slope on a smooth surface, it will
quickly pick up momentum.
• At each iteration, it subtracts the local gradient from the momentum
vector m (multiplied by the learning rate η), and it updates the weights by
adding this momentum vector.
• In deep neural networks that don’t use Batch Normalization, the upper
layers will often end up having inputs with very different scales, so using
momentum optimization helps a lot.
» It can also help roll past local optima.
Nesterov Accelerated Gradient
• The Nesterov Accelerated Gradient (NAG) method, also known as Nesterov
momentum optimization, measures the gradient of the cost function not at the
local position θ but slightly ahead in the direction of the momentum, at θ + βm.
• Try Nesterov Accelerated Gradient if RMSProp, Adam, and Nadam don’t work.
Learning Rate Scheduling
Learning Rate Scheduling (cont.)
• You can find a good learning rate by
» training the model for a few hundred iterations
» exponentially increasing the learning rate from a very small value to a
very large value
» looking at the learning curve and picking a learning rate slightly lower
than the one at which the learning curve starts shooting back up.
» Then reinitialize your model and train it with that learning rate.
• But you can do better than a constant learning rate:
» If you start with a large learning rate and then reduce it once training
stops making fast progress, you can reach a good solution faster than
with the optimal constant learning rate.
Learning Schedules
• Power scheduling
» The learning rate to a function of the iteration number t: η(t) = η0 / (1 + t/s)c.
§ where η0 is the initial learning rate, c is the power (typically set to 1), and s is the
step size.
» This schedule first drops quickly, then more and more slowly.
• Performance scheduling
» Measure the validation error every N steps (just like for early stopping), and reduce the
learning rate by a factor of λ when the error stops dropping.
Learning Schedules (cont.)
• 1cycle scheduling
» Starts by increasing the initial learning rate η0, growing linearly up to η1 halfway
through training.
» Then it decreases the learning rate linearly down to η0 again during the second half
of training.
» Finishing the last few epochs by dropping the rate down by several orders of
magnitude (still linearly).
• The maximum learning rate η1 is chosen using the same approach we used to
find the optimal learning rate. The initial learning rate η0 is chosen to be
roughly 10 times lower.
• When using a momentum,
» Start with a high momentum first (e.g., 0.95)
» Then drop it down to a lower momentum during the first half of training (e.g., down to 0.85,
linearly).
» Then bring it back up to the maximum value (e.g., 0.95) during the second half of training.
» Finishing the last few epochs with that maximum value.
• In summary, exponential decay, performance scheduling, and 1cycle can
considerably speed up convergence.
Avoiding Overfitting Through Regularization
• If your model contains other layers that behave in a special way during training
(such as BatchNormalization layers), then you should not force training mode like
we just did. Instead, you should replace the Dropout layers with the following
MCDropout class:
Max-Norm Regularization
• Max-norm regularization: for each neuron, it constrains the
weights w of the incoming connections such that ∥ w ∥2 ≤ r,
where r is the max-norm hyperparameter and ∥ · ∥2 is the L2
norm.
» typically implemented by computing ∥w∥2 after each training step and
rescaling w if needed (w ← w r/‖ w ‖2).
§ Reducing r increases the amount of regularization and helps reduce overfitting.
» does not add a regularization loss term to the overall loss function.
» Max-norm regularization can also help alleviate the unstable gradients
problems (if you are not using Batch Normalization).
Practical Guidelines
• The following configuration work fine in most cases, without requiring
much hyperparameter tuning.