Ch2-Training, Optimization and Regularization of DNN-new
Ch2-Training, Optimization and Regularization of DNN-new
Regularization of DNN
By Dr. Shraddha Atul Mithbavkar
• Multilayer Feed-Forward Neural Network
• Multilayer Feed-Forward Neural Network(MFFNN) is an
interconnected Artificial Neural Network with multiple
layers that has neurons with weights associated with
them and they compute the result using activation
functions. It is one of the types of Neural Networks in
which the flow of the network is from input to output
units and it does have any loops, no feedback, and no
signal moves in backward directions that is from output
to hidden and input layer.
Multilayer Feed-Forward Neural Network
Learning factor:
• Initial weight
• Fixing up desired output
• Non separable patterns
• Learning constant
• Momentum
dW(t)=-ȠdE(t)+αdW(t-1), here α=0.1 to 0.8
• Steepness of activation function(λ)
Activation function:
• It introduce non linear properties in the neural
network.
• It convert the linear input signal of a node into non
linear output signal to facilitate the learning of high
order polynomial that go beyond one degree for
deep networks.
• It is differentiable.
• Need for non linearity:
• Due to non linear activation function, network
will be able to learn complex problem like
speech recognition, video, audio, and image
processing.
• Soft and Hard limit function type:
• Linear
Output= net
• Output=0 if net<=0
=1 if net>0 0
Symmetrical Hard limit/ Bi-polar binary
• Output=-1 if net<=0 +1
=1 if net>0
-1
• Saturating Linear +1
• Output=0 if net<0 0
=net if 0<=net<1
=1 if net>=1
• Symmetrical saturating linear
+1
• Output=-1 if net<-1
=net if -1<=net<1
=1 if net>=1
-1
• Logistic / Unipolar continous (Sigmoid function):
• P(cat/X)= P(tiger/X)=
• P(dog/X)= P(non/X)=
From above we can say that supplied image is of dog
• Leaky ReLU: in case of ReLU output is zero if
net<0 its called dying ReLU problem. It can be
avoided by adding slope in the negative range.
It is called Leaky ReLU.
• F(X)=max(anet,net) = net if net>anet, Here a is
less than 1 (i.e. a=0.1, 0.05,..)
Loss Function
Loss function helps you figure out the
performance of your model in prediction, how
good the model is able to generalize. It computes
the error for every training. It is distance between
current output and expected output.
• Squared Error loss
• Cross entropy
• Binary cross entropy
• Squared Error loss: Mean square error is calculated
by taking the average, specifically the mean, of
errors squared from data as it relates to a
function. A larger MSE indicates that the data
points are dispersed widely around its central
moment (mean), whereas a smaller MSE suggests
the opposite. A smaller MSE is preferred because it
indicates that your data points are dispersed closely
around its central moment (mean).
• MSE = (1/n) * Σ(actual – predicted)^2
• Cross entropy loss Function: Cross-Entropy
loss is a most important cost function. It is
used to optimize classification models. The
understanding of Cross-Entropy is pegged on
understanding of Softmax activation function.
• Consider a 4-class classification task where an
image is classified as either a dog, cat, horse
or cheetah.
In the above Figure, Softmax converts logits into probabilities. The purpose
of the Cross-Entropy is to take the output probabilities (P) and measure the
distance from the truth values (as shown in Figure below).
For the example above the desired output is [1,0,0,0] for the class dog but the
model outputs [0.775, 0.116, 0.039, 0.070] .
The objective is to make the model output be as close as possible to the desired
output (truth values). During model training, the model weights are iteratively
adjusted accordingly with the aim of minimizing the Cross-Entropy loss. The
process of adjusting the weights is what defines model training and as the model
keeps training and the loss is getting minimized, we say that the model
Cross Entropy=-1*log2(0.775)=0.3677
• Binary cross-entropy is another special case
of cross-entropy — used if our target is either
0 or 1. In a neural network, you typically
achieve this prediction by sigmoid activation.
• The target is not a probability vector. We can
still use cross-entropy with a little trick.
How to select Activation function and loss
function
Problem Output Type Activation Function Loss Function
Regression Numerical Linear MSE
Classification Binary Sigmoid Binary Cross
Entropy
Classification Single label, Softmax Cross entropy
multiple class
Classification Multiple label, Sigmoid Binary Cross
multiple class Entropy
Optimization
Multilayered Feed Forward Neural Network
Z Y O
V W
theta += delta
Batch gradient
• Batch gradient: All training data is taken into
consideration to take a single step. We take
average of gradients of all training examples
and then use that mean gradient to update
our parameters.
• It is great for convex or smooth error
manifold. In this case we reached to optimum
solution.
Stochastic gradient descent
• Stochastic gradient descent:
• If our dataset is huge, it is difficult and time consuming to
consider all training examples to update parameters.
Hence, in Stochastic gradient descent we consider one
example at a time to take a single step. We do the following
step in one epoch.
1. Take an example
2. Feed it to neural network
3. Calculate its gradient
4. Update weight
5. Repeat 1 to 4 step for all example
Stochastic gradient descent
• Disadvantage of SGD: We are considering one
example at a time so cost will fluctuate over
the training examples and it will not
decreases. In long run cost decreases with
fluctuating and never reach the minima.
Mini Batch Gradient Descent
• Mini Batch Gradient Descent:
• Gradient descent used for smooth curves and SGD
used for huge data. Batch GD converges directly
minima and SGD converges faster for large dataset.
But in SGD it takes one example at a time hence,
combination of both methods used which is called
Mini Batch Gradient Descent.
• Here we use a batch of fixed number of training
examples which is less than the actual dataset and
call it is mini batch.
Mini Batch Gradient Descent
• Steps of Mini Batch Gradient Descent
1. Pick a mini batch
2. Feed it to Neural network
3. Calculate the mean gradient of mini batch
4. Use mean gradient to update weight.
5. Repeat 1 to 4 for mini batches we created.
Momentum based Gradient Descent
(eq. 1)
(eq. 3)
gradient²
theta += delta
RMSprop optimizer
• RMSprop optimizer:
• Root Mean Squared Propagation, or RMSProp for short, is an extension to the
gradient descent optimization algorithm.
• RMSProp is designed to accelerate the optimization process, e.g. decrease the
number of function evaluations required to reach the optima, or to improve
the capability of the optimization algorithm, e.g. result in a better final result.
• A problem with AdaGrad is that it can slow the search down too much,
resulting in very small learning rates for each parameter or dimension of the
search by the end of the run. This has the effect of stopping the search too
soon, before the minimal can be located.
• This is achieved by adding a new hyperparameter we will call rho that acts like
momentum for the partial derivatives.
• Using a decaying moving average of the partial derivative allows the search to
forget early partial derivative values and focus on the most recently seen
shape of the search space.
RMSprop optimizer
• The calculation of the mean squared partial derivative
for one parameter is as follows:
• s(t+1) = (s(t) * rho) + (f'(x(t))^2 * (1.0-rho))
• Where s(t+1) is the decaying moving average of the
squared partial derivative for one parameter for the
current iteration of the algorithm, s(t) is the decaying
moving average squared partial derivative for the
previous iteration, f'(x(t))^2 is the squared partial
derivative for the current parameter, and rho is a
hyperparameter, typically with the value of 0.9 like
momentum.
RMSprop optimizer
• Given that we are using a decaying average of the partial
derivatives and calculating the square root of this average gives
the technique its name, e.g, square root of the mean squared
partial derivatives or root mean square (RMS). For example, the
custom step size for a parameter may be written as:
• cust_step_size(t+1) = step_size / (1e-8 + RMS(s(t+1)))
• Once we have the custom step size for the parameter, we can
update the parameter using the custom step size and the partial
derivative f'(x(t)).
• x(t+1) = x(t) – cust_step_size(t+1) * f'(x(t))
• This process is then repeated for each input variable until a new
point in the search space is created and can be evaluated.
sum_of_gradient_squared = previous_sum_of_gradient_squared *
theta += delta
Adam
• Adam: It is most effective optimization algorithm for
training neural network. It combines ideas from RMSProp
and Momentum.
• It calculate exponentially weighted average of past
gradient and stores it in variables v (before bias
correction )and V_corrected (with bias correction)
• It calculated an exponentially weighted average of square
of the past gradient and stores it in variable s (before bias
corrected) and s_corrected (with bias correction)
• It updates parameters in direction based on combining
information from 1 and 2
Adam
Where
t counts number of steps taken of
Adam
L is the number of layers
β1, β2 hyper parameters control
the two exponentially weighted
averages.
Adam
• AdaGrad uses the second moment with no decay to deal with
sparse features. RMSProp uses the second moment by with a
decay rate to speed up from AdaGrad. Adam uses both first
and second moments, and is generally the best choice.
• Adam is a replacement optimization algorithm for stochastic
gradient descent for training deep learning models.
• Adam combines the best properties of the AdaGrad and
RMSProp algorithms to provide an optimization algorithm
that can handle sparse gradients on noisy problems.
• Adam is relatively easy to configure where the default
configuration parameters do well on most problems.
sum_of_gradient = previous_sum_of_gradient * beta1 +
sum_of_gradient_squared = previous_sum_of_gradient_squared
sqrt(sum_of_gradient_squared)
theta += delta
Nesterov Accelerated Gradient (NAG)
Although, our input X was normalized with time the output will no longer be on
the same scale. As the data go through multiple layers of the neural network
and L activation functions are applied, it leads to an internal co-variate shift in
the data.
Data Augmentation
• Data augmentation is the process of
generating new training examples to our
dataset. More training data means lower
model’s variance, hence lower generalization
error. It is form of noise injection in the
training dataset.
Data augmentation
• Types of Data augmentation
• Basic Data Manipulations
• Feature space augmentation
• GAN based Augmentation
• Meta learning
Data augmentation
• Types of Data augmentation:
• Basic Data Manipulations: Geometric transformation
on the data. Example Image flipping, cropping,
rotation, translation, color modification, image mixing.
• Feature space augmentation: Instead of transforming
data in the input space as above, we can apply
transformation on the feature space. Example an auto
encoder might be used to extract the latent
representation which result In transformation of the
original data point.
• GAN based Augmentation: Generative adversarial
network have been proven to work extremely well on
data augmentation so they are natural choice for data
augmentation.
• Generative modeling is an unsupervised learning task
in machine learning that involves automatically
discovering and learning the regularities or patterns in
input data in such a way that the model can be used
to generate or output new examples that plausibly
could have been drawn from the original dataset.
Image transformation using GAN
The below picture represents how the place would have looked in winter season.
• Meta learning: we use neural network to optimize other
neural network by tuning their hyper parameter, improving
their layout, and more.
• In simple term, we use a classification network to tune an
augmentation network into generating better images.
• Example: we feed random images to GAN, which will
generate augmented images. Both augmented images and
originals are passed into a second network, which compares
them and tell us how good the augmented image is. After
repeating process the augmentation network becomes
better and better at producing new images.
• Meta-learning algorithms learn from the
output of other machine learning algorithms
that learn from data. This means that meta-
learning requires the presence of other
learning algorithms that have already been
trained on data.
Meta learning
Weight Decay
• Weight decay is a regularization technique in
deep learning
• Weight decay works by adding a penalty term
to the cost function of a neural network which
has the effect of shrinking the weight during
back propagation.
• This prevent the network from overfitting the
training data as well as the exploding gradient
problem.
Weight Decay
• In comparison with weight and bias, the
weights directly influence the relationship
between the input and the output learned by
the neural network because they are
multiplied by the inputs.
• The mathematically, the bias only offset the
relationship from the intercept. Therefore we
usually only regularize the weight.
Weight Decay with the L2 norm
• The L2 penalty is most commonly used
regularization term for neural networks. You
apply L2 regularization by adding the squared
sum of the weights to the error term E
multiplied by a hyper parameter lambda that
you pick manually.
• The full equation for a cost function would
look like this, where the function L represents
a loss function such as cross-entropy or mean
squared error.
Adding noise to the input and output
• Adding noise means that the network is less able to
memorize training samples because they are
changing all of the time, resulting in smaller
network weights and a more robust network that
has lower generalization error.
• If the random noise corresponds to coefficient
being 00, then it will pull our final estimates
towards being smaller; our actual data saying the
coefficient are larger will have to compete with the
random noise saying the coefficient are small.
Adding noise to the input and output
• Examples: in Speech recognition applications,
collected dataset of speech must be mixed
with noise and increase multiple data set
combination which will help to train neural
network with lots of data and avoid over
fitting.
Thank you