0% found this document useful (0 votes)

23 views114 pages

Ch2-Training, Optimization and Regularization of DNN-new

The document discusses the training, optimization, and regularization of Deep Neural Networks (DNN), focusing on Multilayer Feed-Forward Neural Networks (MFFNN) and various activation functions such as ReLU, Softmax, and Sigmoid. It also covers loss functions like Squared Error and Cross-Entropy, alongside optimization techniques including Gradient Descent, Stochastic Gradient Descent, and advanced methods like RMSprop and Adam. The content emphasizes the importance of selecting appropriate activation and loss functions based on the problem type for effective model training.

Uploaded by

CM-A-Jivhesh Choudhari

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views114 pages

Ch2-Training, Optimization and Regularization of DNN-new

Uploaded by

CM-A-Jivhesh Choudhari

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 114

Training, Optimization and

Regularization of DNN
By Dr. Shraddha Atul Mithbavkar
• Multilayer Feed-Forward Neural Network
• Multilayer Feed-Forward Neural Network(MFFNN) is an
interconnected Artificial Neural Network with multiple
layers that has neurons with weights associated with
them and they compute the result using activation
functions. It is one of the types of Neural Networks in
which the flow of the network is from input to output
units and it does have any loops, no feedback, and no
signal moves in backward directions that is from output
to hidden and input layer.
Multilayer Feed-Forward Neural Network
Learning factor:
• Initial weight
• Fixing up desired output
• Non separable patterns
• Learning constant
• Momentum
dW(t)=-ȠdE(t)+αdW(t-1), here α=0.1 to 0.8
• Steepness of activation function(λ)
Activation function:
• It introduce non linear properties in the neural
network.
• It convert the linear input signal of a node into non
linear output signal to facilitate the learning of high
order polynomial that go beyond one degree for
deep networks.
• It is differentiable.
• Need for non linearity:
• Due to non linear activation function, network
will be able to learn complex problem like
speech recognition, video, audio, and image
processing.
• Soft and Hard limit function type:
• Linear

Output= net

• Rectified Linear Unit(ReLU): it is fast learning activation function and

shows better generalization and performance as compared to sigmoid
and tanh function.
Output=max(0,net)=net if net>=0
Output=max(0,net)=0 if net<0
• Hardlimit/Uni-polar binary 1

• Output=0 if net<=0
=1 if net>0 0
Symmetrical Hard limit/ Bi-polar binary
• Output=-1 if net<=0 +1

=1 if net>0
-1
• Saturating Linear +1

• Output=0 if net<0 0

=net if 0<=net<1
=1 if net>=1
• Symmetrical saturating linear
+1
• Output=-1 if net<-1
=net if -1<=net<1
=1 if net>=1
-1
• Logistic / Unipolar continous (Sigmoid function):

It is used in Back propagation network and it is differentiable

• Tanh / Bipolar continuous
• Softmax function: it convert the vector of
numbers into the vector of the probabilities. It
is used for multiclass classification problem n
machine learning. Its output is probability of
getting each class. Softmax function is used in
the last layer of network
• Example we apply input of images cat, dog,
tiger and none and at output of network we
got Zi as [1.25, 2.44,0.78,0.12] then
probability distribution is calculated as

• P(cat/X)= P(tiger/X)=
• P(dog/X)= P(non/X)=
From above we can say that supplied image is of dog
• Leaky ReLU: in case of ReLU output is zero if
net<0 its called dying ReLU problem. It can be
avoided by adding slope in the negative range.
It is called Leaky ReLU.
• F(X)=max(anet,net) = net if net>anet, Here a is
less than 1 (i.e. a=0.1, 0.05,..)
Loss Function
Loss function helps you figure out the
performance of your model in prediction, how
good the model is able to generalize. It computes
the error for every training. It is distance between
current output and expected output.
• Squared Error loss
• Cross entropy
• Binary cross entropy
• Squared Error loss: Mean square error is calculated
by taking the average, specifically the mean, of
errors squared from data as it relates to a
function. A larger MSE indicates that the data
points are dispersed widely around its central
moment (mean), whereas a smaller MSE suggests
the opposite. A smaller MSE is preferred because it
indicates that your data points are dispersed closely
around its central moment (mean).
• MSE = (1/n) * Σ(actual – predicted)^2
• Cross entropy loss Function: Cross-Entropy
loss is a most important cost function. It is
used to optimize classification models. The
understanding of Cross-Entropy is pegged on
understanding of Softmax activation function.
• Consider a 4-class classification task where an
image is classified as either a dog, cat, horse
or cheetah.
In the above Figure, Softmax converts logits into probabilities. The purpose
of the Cross-Entropy is to take the output probabilities (P) and measure the
distance from the truth values (as shown in Figure below).
For the example above the desired output is [1,0,0,0] for the class dog but the
model outputs [0.775, 0.116, 0.039, 0.070] .
The objective is to make the model output be as close as possible to the desired
output (truth values). During model training, the model weights are iteratively
adjusted accordingly with the aim of minimizing the Cross-Entropy loss. The
process of adjusting the weights is what defines model training and as the model
keeps training and the loss is getting minimized, we say that the model
Cross Entropy=-1*log2(0.775)=0.3677
• Binary cross-entropy is another special case
of cross-entropy — used if our target is either
0 or 1. In a neural network, you typically
achieve this prediction by sigmoid activation.
• The target is not a probability vector. We can
still use cross-entropy with a little trick.
How to select Activation function and loss
function
Problem Output Type Activation Function Loss Function
Regression Numerical Linear MSE
Classification Binary Sigmoid Binary Cross
Entropy
Classification Single label, Softmax Cross entropy
multiple class
Classification Multiple label, Sigmoid Binary Cross
multiple class Entropy
Optimization
Multilayered Feed Forward Neural Network

Z Y O
V W

Layer i (input) Layer j (hidden) Layer k (output)

Back Propagation Training algorithm
Gradient descent
• Gradient descent :In mathematics, gradient descent (also
often called steepest descent) is a first-
order iterative optimization algorithm for finding a local
minimum of a differentiable function..The goal of the gradient
descent is to minimize a given function which is loss function
of neural network. To achieve this goal, it perform two step
iteratively
1. Compare slop (gradient) that is first order derivates at
current point
2. Move in opposite direction of the slop increases from the
current point by the computed amount.
Example: when man goes down from mountain based on current
position he takes decision to move downward direction.
delta = - learning_rate * gradient

theta += delta
Batch gradient
• Batch gradient: All training data is taken into
consideration to take a single step. We take
average of gradients of all training examples
and then use that mean gradient to update
our parameters.
• It is great for convex or smooth error
manifold. In this case we reached to optimum
solution.
Stochastic gradient descent
• Stochastic gradient descent:
• If our dataset is huge, it is difficult and time consuming to
consider all training examples to update parameters.
Hence, in Stochastic gradient descent we consider one
example at a time to take a single step. We do the following
step in one epoch.
1. Take an example
2. Feed it to neural network
3. Calculate its gradient
4. Update weight
5. Repeat 1 to 4 step for all example
Stochastic gradient descent
• Disadvantage of SGD: We are considering one
example at a time so cost will fluctuate over
the training examples and it will not
decreases. In long run cost decreases with
fluctuating and never reach the minima.
Mini Batch Gradient Descent
• Mini Batch Gradient Descent:
• Gradient descent used for smooth curves and SGD
used for huge data. Batch GD converges directly
minima and SGD converges faster for large dataset.
But in SGD it takes one example at a time hence,
combination of both methods used which is called
Mini Batch Gradient Descent.
• Here we use a batch of fixed number of training
examples which is less than the actual dataset and
call it is mini batch.
Mini Batch Gradient Descent
• Steps of Mini Batch Gradient Descent
1. Pick a mini batch
2. Feed it to Neural network
3. Calculate the mean gradient of mini batch
4. Use mean gradient to update weight.
5. Repeat 1 to 4 for mini batches we created.
Momentum based Gradient Descent

• Momentum based Gradient Descent

• Because mini batch gradient descent makes a parameters
update after seeing just a subset of examples, the direction
of the update has some variance, and so the path taken by
mini batch gradient descent will oscillate towards
convergence. Using momentum we can reduce oscillations.
• Momentum takes into account the past gradient to smooth
out update. We will store the direction of previous gradient
in the variable v. Formally, this will be the exponentially
weighted average of the gradient on previous steps.
Momentum based Gradient Descent

• Steps of Momentum based Gradient Descent

β= 0.8 to 0.999 = 0.9 standard range

Velocity initialize with zero, algorithm will take a few iteration to
build velocity and start to take bigger step.
delta = - learning_rate * gradient + previous_delta * decay_rate

(eq. 1)

theta += delta (eq. 2)

sum_of_gradient = gradient + previous_sum_of_gradient * decay_rate

(eq. 3)

delta = -learning_rate * sum_of_gradient (eq. 4)

theta += delta (eq. 5)

Adaptive Gradient algorithm (AdaGrad)
• The Adaptive Gradient algorithm, or AdaGrad for short, is
an extension to the gradient descent optimization
algorithm.
• It is designed to accelerate the optimization process,
• A problem with the gradient descent algorithm is that the
step size (learning rate) is the same for each variable or
dimension in the search space. It is possible that better
performance can be achieved using a step size that is
tailored to each variable, allowing larger movements in
dimensions with a consistently steep gradient and smaller
movements in dimensions with less steep gradients.
Adaptive Gradient algorithm
• AdaGrad is designed to specifically explore the idea of
automatically tailoring the step size for each dimension in the
search space.
• This is achieved by first calculating a step size for a given
dimension, then using the calculated step size to make a
movement in that dimension using the partial derivative. This
process is then repeated for each dimension in the search
space.
Adaptive Gradient algorithm
• An internal variable is then maintained for each input variable that is
the sum of the squared partial derivatives for the input variable
observed during the search.
• This sum of the squared partial derivatives is then used to calculate
the step size for the variable by dividing the initial step size value
(e.g. hyperparameter value specified at the start of the run) divided
by the square root of the sum of the squared partial derivatives.
• cust_step_size = step_size / sqrt(s)
•It is possible for the square root of the sum of squared partial
derivatives to result in a value of 0.0, resulting in a divide by zero error.
Therefore, a tiny value can be added to the denominator to avoid this
possibility, such as 1e-8.
• cust_step_size = step_size / (1e-8 + sqrt(s))
Adaptive Gradient algorithm
•cust_step_size is the calculated step size for an input variable
for a given point during the search,
• step_size is the initial step size, sqrt() is the square root
operation,
• s is the sum of the squared partial derivatives for the input
variable seen during the search so far.
The custom step size is then used to calculate the value for
the variable in the next point or solution in the search.
x(t+1) = x(t) – cust_step_size * f'(x(t))
Adaptive Gradient algorithm
• This process is then repeated for each input variable
until a new point in the search space is created and can
be evaluated.
• Importantly, the partial derivative for the current
solution (iteration of the search) is included in the sum
of the square root of partial derivatives.
• We could maintain an array of partial derivatives or
squared partial derivatives for each input variable, but
this is not necessary. Instead, we simply maintain the
sum of the squared partial derivatives and add new
values to this sum along the way
sum_of_gradient_squared = previous_sum_of_gradient_squared +

gradient²

delta = -learning_rate * gradient / sqrt(sum_of_gradient_squared)

theta += delta
RMSprop optimizer
• RMSprop optimizer:
• Root Mean Squared Propagation, or RMSProp for short, is an extension to the
gradient descent optimization algorithm.
• RMSProp is designed to accelerate the optimization process, e.g. decrease the
number of function evaluations required to reach the optima, or to improve
the capability of the optimization algorithm, e.g. result in a better final result.
• A problem with AdaGrad is that it can slow the search down too much,
resulting in very small learning rates for each parameter or dimension of the
search by the end of the run. This has the effect of stopping the search too
soon, before the minimal can be located.
• This is achieved by adding a new hyperparameter we will call rho that acts like
momentum for the partial derivatives.
• Using a decaying moving average of the partial derivative allows the search to
forget early partial derivative values and focus on the most recently seen
shape of the search space.
RMSprop optimizer
• The calculation of the mean squared partial derivative
for one parameter is as follows:
• s(t+1) = (s(t) * rho) + (f'(x(t))^2 * (1.0-rho))
• Where s(t+1) is the decaying moving average of the
squared partial derivative for one parameter for the
current iteration of the algorithm, s(t) is the decaying
moving average squared partial derivative for the
previous iteration, f'(x(t))^2 is the squared partial
derivative for the current parameter, and rho is a
hyperparameter, typically with the value of 0.9 like
momentum.
RMSprop optimizer
• Given that we are using a decaying average of the partial
derivatives and calculating the square root of this average gives
the technique its name, e.g, square root of the mean squared
partial derivatives or root mean square (RMS). For example, the
custom step size for a parameter may be written as:
• cust_step_size(t+1) = step_size / (1e-8 + RMS(s(t+1)))
• Once we have the custom step size for the parameter, we can
update the parameter using the custom step size and the partial
derivative f'(x(t)).
• x(t+1) = x(t) – cust_step_size(t+1) * f'(x(t))
• This process is then repeated for each input variable until a new
point in the search space is created and can be evaluated.
sum_of_gradient_squared = previous_sum_of_gradient_squared *

decay_rate+ gradient² * (1- decay_rate)

delta = -learning_rate * gradient / sqrt(sum_of_gradient_squared)

theta += delta
Adam
• Adam: It is most effective optimization algorithm for
training neural network. It combines ideas from RMSProp
and Momentum.
• It calculate exponentially weighted average of past
gradient and stores it in variables v (before bias
correction )and V_corrected (with bias correction)
• It calculated an exponentially weighted average of square
of the past gradient and stores it in variable s (before bias
corrected) and s_corrected (with bias correction)
• It updates parameters in direction based on combining
information from 1 and 2
Adam

Where
t counts number of steps taken of
Adam
L is the number of layers
β1, β2 hyper parameters control
the two exponentially weighted
averages.
Adam
• AdaGrad uses the second moment with no decay to deal with
sparse features. RMSProp uses the second moment by with a
decay rate to speed up from AdaGrad. Adam uses both first
and second moments, and is generally the best choice.
• Adam is a replacement optimization algorithm for stochastic
gradient descent for training deep learning models.
• Adam combines the best properties of the AdaGrad and
RMSProp algorithms to provide an optimization algorithm
that can handle sparse gradients on noisy problems.
• Adam is relatively easy to configure where the default
configuration parameters do well on most problems.
sum_of_gradient = previous_sum_of_gradient * beta1 +

gradient * (1 - beta1) [Momentum]

sum_of_gradient_squared = previous_sum_of_gradient_squared

* beta2 + gradient² * (1- beta2) [RMSProp]

delta = -learning_rate * sum_of_gradient /

sqrt(sum_of_gradient_squared)

theta += delta
Nesterov Accelerated Gradient (NAG)

• Nesterov Accelerated Gradient (NAG)

• Gradient descent is an optimization algorithm that follows the negative
gradient of an objective function in order to locate the minimum of the
function.
• A limitation of gradient descent is that it can get stuck in flat areas or
bounce around if the objective function returns noisy gradients.
Momentum is an approach that accelerates the progress of the search
to skim across flat areas and smooth out bouncy gradients.
• In some cases, the acceleration of momentum can cause the search to
miss or overshoot the minima at the bottom of basins or
valleys. Nesterov momentum is an extension of momentum that
involves calculating the decaying moving average of the gradients of
projected positions in the search space rather than the actual positions
themselves.
Nesterov Accelerated Gradient (NAG)
Nesterov Accelerated Gradient (NAG)

• In momentum based gradient, the step become larger and larger

due to the accumulated momentum, and then we overshoot at
the 4th step. We then have to take steps in the opposite direction
to reach the ,minimum point.
• However the update in NAG happens in two steps. First, a partial
step to reach the look ahead point, and then final update. If the
gradient at the look ahead point is negative, our final update will
be smaller than that of a regular momentum based gradient.
• As per diagram, in 4a step gradient is negative hence overall
update will be smaller than that of momentum based gradient
descent. Here momentum based gradient take six step to reach
minimum and NAG takes five step.
Regularization
• Training data error is always less than testing
dataset. How much worse the algorithm does
on the test set than training set is known as
the algorithm variance.
Generalization
• Generalization defines the ability of an ML
model to provide a suitable output by
adapting the given set of unknown input. It
means after providing training on the dataset,
it can produce reliable and accurate output.
Hence, the underfitting and overfitting are the
two terms that need to be checked for the
performance of the model and whether the
model is generalizing well or not.
Overfitting/ Underfitting
Overfitting/ Underfitting
•Underfitting occurs when our machine learning
model is not able to capture the underlying trend of
the data.
• It destroy accuracy of ML model.
• It uasually happen when we have less data to train
and non linear data for linear model.
•To avoid the Underfitting in the model, use more
data and also reducing the features by feature
selection.
Technique to reduce Under fitting
• Increase model complexity
• Performing feature engineering
• Remove noise from data
• Increase number of epoch and duration of
training.
• Overfitting occurs when our machine learning
model tries to cover all the data points or
more than the required data points present in
the given dataset. Because of this, the model
starts caching noise and inaccurate values
present in the dataset, and all these factors
reduce the efficiency and accuracy of the
model. The overfitted model has low
bias and high variance.
• It happens when we train model with lots of
data.
• The causes of overfitting are the non
parametric and non linear methods because
these type of machine learning algorithm have
more freedom in building the model based on
the dataset.
Technique to reduce overfitting
• Increase the training data
• Reduce model complexity
• Early stopping during the training phase.( as
soon as loss begin to increase during training,
stop training)
• Use dropout for neural network to tackle
overfitting
• What is bias?
• Bias is the difference between the average prediction of our model
and the correct value which we are trying to predict. Model with
high bias pays very little attention to the training data and
oversimplifies the model. It always leads to high error on training
and test data.
• What is variance?
• Variance is the variability of model prediction for a given data
point or a value which tells us spread of our data. Model with high
variance pays a lot of attention to training data and does not
generalize on the data which it hasn’t seen before. As a result,
such models perform very well on training data but has high error
rates on test data.
• Why is Bias Variance Tradeoff?
• If our model is too simple and has very few
parameters then it may have high bias and low
variance. On the other hand if our model has large
number of parameters then it’s going to have high
variance and low bias. So we need to find the
right/good balance without overfitting and
underfitting the data.
Regularization Methods
• Regularization aim to reduce over fitting and
keep the training error as low as possible.
• Regularization is the most used technique to
penalize complex models in machine learning,
it is deployed for reducing overfitting (or,
contracting generalization errors) by putting
network weights small. Also, it enhances the
performance of models for new inputs.
Regularization Methods
• Both L1 and L2 can add a penalty to the cost
depending upon the model complexity, so at
the place of computing the cost by using a loss
function, there will be an auxiliary component,
known as regularization terms, added in order
to panelizing complex models.
Regularization Methods
• Through biasing data points towards specific values such as very
small values to zero, Regularization achieves this biasing by adding
a tuning parameter to strengthen those data points. Such as;
• L1 regularization: It adds an L1 penalty that is equal to the
absolute value of the magnitude of coefficient, or simply
restricting the size of coefficients. For example, Lasso regression
implements this method.
• L2 Regularization: It adds an L2 penalty which is equal to the
square of the magnitude of coefficients. For example, Ridge
regression and SVM implement this method.
• Elastic Net: When L1 and L2 regularization combine together, it
becomes the elastic net method, it adds a hyperparameter.
Regularization Methods
• A basic regression model can be represented as
follows:

• where X is the feature matrix, and Wj are the

weight coefficients or regression coefficients. In basic
linear regression, the regression coefficients are
obtained by minimizing the loss function given
below:
L2 regularization
In L2 regularization, the regression coefficients
are obtained by minimizing the L2 loss function,
given as:
L1 regularization
• In L1 regularization, the regression coefficients
are obtained by minimizing the L1 loss
function, given as:
L1 and L2 regularization

•In both L1 and L2 regularization, when the regularization

parameter (α ∈[0, 1]) is increased, this would cause the L1
norm or L2 norm to decrease, forcing some of the
regression coefficients to zero.
• Hence, L1 and L2 regularization models are used for
feature selection and dimensionality reduction.
•One advantage of L2 regularization over L1 regularization is
that the L2 loss function is easily differentiable.
S.N
L1 Regularization L2 Regularization
o
Panelizes the sum of absolute penalizes the sum of square
1
value of weights. weights.
2 It has a sparse solution. It has a non-sparse solution.
3 It gives multiple solutions. It has only one solution.
Constructed in feature
4 No feature selection.
selection.
5 Robust to outliers. Not robust to outliers.
It gives more accurate
It generates simple and predictions when the output
6
interpretable models. variable is the function of
whole input variables.
Unable to learn complex data Able to learn complex data
7
patterns. patterns.
Computationally efficient
Computationally inefficient over
8 because of having analytical
non-sparse conditions.
solutions.
Dropout
• The term “dropout” refers to dropping out the
nodes (input and hidden layer) in a neural
network. All the forward and backwards
connections with a dropped node are
temporarily removed, thus creating a new
network architecture out of the parent
network. The nodes are dropped by a dropout
probability of p.
Dropout
• During test time, all units are present, but they have been
scaled down by Pp. This is happening because after dropout,
the next layers will receive lower values.
• By using dropout, the same layer will alter its connectivity and
will search for alternative paths to covey the information in
the next layer. As a result, each updates to layer during
training is performed with different view of the configured
layer.
• Dropout has the effect of making the training process noisy.
• Dropout breakup situation where networks layers co adapt to
correct mistake from prior layers making model more roubst.
• It increases the sparsity of the network.
Dropout
Early stopping
• The process of stopping the training when the training
error is no longer decreasing but the validation error is
starting to rise.
• This implies that we store the trainable parameters
periodically and track the validation error. After the
training stopped, we return the trainable parameters to
the exact point where the validation error started to
rise, instead the last ones.
• Early stopping is very efficient hyper parameter
selection algorithm which set number of epoch to the
absolute best.
Parameter sharing
• In this case, instead of penalizing model
parameters, it forces a group of parameters to be
equal. This can be seen as a way to apply our
previous domain knowledge to the training
process.
• CNN takes advantage of spatial structure of
images by sharing parameters across different
block of input image, the weight is shared among
the block instead of having separate ones.
Batch normalization
• BN can be used as a form of regularization, BN fixes the
mean and variance of the input by bringing the features in
the same range. We concentrate features in a compact
Gaussian like space.
• It is used for regularized model and it is preferred over drop
out.
• it is a process to make neural networks faster and more
stable through adding extra layers in a deep neural network.
• The new layer performs the standardizing and normalizing
operations on the input of a layer coming from a previous
layer.
Batch normalization
• A typical neural network is trained using a collected set of
input data called batch. Similarly, the normalizing process
in batch normalization takes place in batches, not as a
single input.
• Let’s understand this through an example, we have a
deep neural network as shown in the following image.
• Initially, our inputs X1, X2, X3, X4 are in normalized form
as they are coming from the pre-processing stage. When
the input passes through the first layer, it transforms, as a
sigmoid function applied over the dot product of input X
and the weight matrix W.
Batch normalization

Although, our input X was normalized with time the output will no longer be on
the same scale. As the data go through multiple layers of the neural network
and L activation functions are applied, it leads to an internal co-variate shift in
the data.
Data Augmentation
• Data augmentation is the process of
generating new training examples to our
dataset. More training data means lower
model’s variance, hence lower generalization
error. It is form of noise injection in the
training dataset.
Data augmentation
• Types of Data augmentation
• Basic Data Manipulations
• Feature space augmentation
• GAN based Augmentation
• Meta learning
Data augmentation
• Types of Data augmentation:
• Basic Data Manipulations: Geometric transformation
on the data. Example Image flipping, cropping,
rotation, translation, color modification, image mixing.
• Feature space augmentation: Instead of transforming
data in the input space as above, we can apply
transformation on the feature space. Example an auto
encoder might be used to extract the latent
representation which result In transformation of the
original data point.
• GAN based Augmentation: Generative adversarial
network have been proven to work extremely well on
data augmentation so they are natural choice for data
augmentation.
• Generative modeling is an unsupervised learning task
in machine learning that involves automatically
discovering and learning the regularities or patterns in
input data in such a way that the model can be used
to generate or output new examples that plausibly
could have been drawn from the original dataset.
Image transformation using GAN
The below picture represents how the place would have looked in winter season.
• Meta learning: we use neural network to optimize other
neural network by tuning their hyper parameter, improving
their layout, and more.
• In simple term, we use a classification network to tune an
augmentation network into generating better images.
• Example: we feed random images to GAN, which will
generate augmented images. Both augmented images and
originals are passed into a second network, which compares
them and tell us how good the augmented image is. After
repeating process the augmentation network becomes
better and better at producing new images.
• Meta-learning algorithms learn from the
output of other machine learning algorithms
that learn from data. This means that meta-
learning requires the presence of other
learning algorithms that have already been
trained on data.
Meta learning
Weight Decay
• Weight decay is a regularization technique in
deep learning
• Weight decay works by adding a penalty term
to the cost function of a neural network which
has the effect of shrinking the weight during
back propagation.
• This prevent the network from overfitting the
training data as well as the exploding gradient
problem.
Weight Decay
• In comparison with weight and bias, the
weights directly influence the relationship
between the input and the output learned by
the neural network because they are
multiplied by the inputs.
• The mathematically, the bias only offset the
relationship from the intercept. Therefore we
usually only regularize the weight.
Weight Decay with the L2 norm
• The L2 penalty is most commonly used
regularization term for neural networks. You
apply L2 regularization by adding the squared
sum of the weights to the error term E
multiplied by a hyper parameter lambda that
you pick manually.
• The full equation for a cost function would
look like this, where the function L represents
a loss function such as cross-entropy or mean
squared error.
Adding noise to the input and output
• Adding noise means that the network is less able to
memorize training samples because they are
changing all of the time, resulting in smaller
network weights and a more robust network that
has lower generalization error.
• If the random noise corresponds to coefficient
being 00, then it will pull our final estimates
towards being smaller; our actual data saying the
coefficient are larger will have to compete with the
random noise saying the coefficient are small.
Adding noise to the input and output
• Examples: in Speech recognition applications,
collected dataset of speech must be mixed
with noise and increase multiple data set
combination which will help to train neural
network with lots of data and avoid over
fitting.
Thank you

DL Unit-2
No ratings yet
DL Unit-2
24 pages
Deep Neural Networks
No ratings yet
Deep Neural Networks
48 pages
AI & ML Unit 5 Notes
No ratings yet
AI & ML Unit 5 Notes
23 pages
Deep Learning Tutorial 9
No ratings yet
Deep Learning Tutorial 9
70 pages
Unit-1 and 2 and 3
No ratings yet
Unit-1 and 2 and 3
212 pages
Domnic Object Detecion Basics
No ratings yet
Domnic Object Detecion Basics
62 pages
Lecture 5
No ratings yet
Lecture 5
34 pages
Day 2 - Loss & Activation Functions
No ratings yet
Day 2 - Loss & Activation Functions
8 pages
Convolutional Neural Network
100% (1)
Convolutional Neural Network
59 pages
Neural Networks
No ratings yet
Neural Networks
63 pages
Gradient-Based Optimizers
No ratings yet
Gradient-Based Optimizers
54 pages
Unit 2 DL
No ratings yet
Unit 2 DL
70 pages
Lec 8
No ratings yet
Lec 8
43 pages
DL Regularization
No ratings yet
DL Regularization
51 pages
AML 03 Dense Neural Networks
No ratings yet
AML 03 Dense Neural Networks
20 pages
DeepLearning Recap
No ratings yet
DeepLearning Recap
104 pages
CS460 - Deep Learning - W02 & W03
No ratings yet
CS460 - Deep Learning - W02 & W03
44 pages
Deep MLP's
No ratings yet
Deep MLP's
44 pages
HODL Lec 2 Training NNs Intro TF
No ratings yet
HODL Lec 2 Training NNs Intro TF
83 pages
Lecture 7 - Optimization Part I
No ratings yet
Lecture 7 - Optimization Part I
38 pages
Neural Networks - 2
No ratings yet
Neural Networks - 2
79 pages
Unit 2.1
No ratings yet
Unit 2.1
37 pages
Deep Learning
No ratings yet
Deep Learning
19 pages
UNIT2
No ratings yet
UNIT2
25 pages
CS601 - Machine Learning - Unit 2 - Notes - 1672759753
No ratings yet
CS601 - Machine Learning - Unit 2 - Notes - 1672759753
14 pages
CS601 Machine Learning Unit 2 Notes 1672759753
No ratings yet
CS601 Machine Learning Unit 2 Notes 1672759753
14 pages
DL U-I Introduction Part-2
No ratings yet
DL U-I Introduction Part-2
48 pages
DeepLearing Theory
No ratings yet
DeepLearing Theory
51 pages
Implement 03-1
No ratings yet
Implement 03-1
24 pages
Chapter 5 Final
No ratings yet
Chapter 5 Final
80 pages
Notes Chapter8
No ratings yet
Notes Chapter8
4 pages
APKA Report
No ratings yet
APKA Report
3 pages
Deep Learning Unit 1
No ratings yet
Deep Learning Unit 1
32 pages
Deep Learning (All in One)
No ratings yet
Deep Learning (All in One)
23 pages
Pure Optimization
No ratings yet
Pure Optimization
23 pages
EE769 7 Introduction To Neural Networks
No ratings yet
EE769 7 Introduction To Neural Networks
52 pages
Neural Network (Basics)
No ratings yet
Neural Network (Basics)
48 pages
Deep Learning Module-02 Search Creators
No ratings yet
Deep Learning Module-02 Search Creators
15 pages
Types of Neural Networks
No ratings yet
Types of Neural Networks
7 pages
Deep Learning
No ratings yet
Deep Learning
78 pages
Deep Learning
No ratings yet
Deep Learning
20 pages
Deep Learning
No ratings yet
Deep Learning
15 pages
1 Intro
No ratings yet
1 Intro
91 pages
1.1 Introduction
No ratings yet
1.1 Introduction
73 pages
26 Neural Nets
No ratings yet
26 Neural Nets
77 pages
Mlfa Autumn 23 Optimization
No ratings yet
Mlfa Autumn 23 Optimization
37 pages
Part 13 MD
No ratings yet
Part 13 MD
41 pages
UNIT 1 Introduction Part 1
No ratings yet
UNIT 1 Introduction Part 1
37 pages
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
No ratings yet
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
40 pages
Different Activation Functions With The Equations
No ratings yet
Different Activation Functions With The Equations
6 pages
Tutorial 1,2
No ratings yet
Tutorial 1,2
12 pages
Supervised Deep Learning
No ratings yet
Supervised Deep Learning
28 pages
CS601 - Machine Learning - Unit 2 New
No ratings yet
CS601 - Machine Learning - Unit 2 New
56 pages
Survey of FNN
No ratings yet
Survey of FNN
25 pages
Op Tim Ization
No ratings yet
Op Tim Ization
9 pages
Gradient Descent Algorithms and Variations - PyImageSearch
No ratings yet
Gradient Descent Algorithms and Variations - PyImageSearch
21 pages
Week 4
No ratings yet
Week 4
61 pages
Artificial Neural Networks
No ratings yet
Artificial Neural Networks
100 pages
Unit 4 - GRADIENT LEARNING
No ratings yet
Unit 4 - GRADIENT LEARNING
3 pages
Sma CH-5
No ratings yet
Sma CH-5
35 pages
Sma CH-4
No ratings yet
Sma CH-4
50 pages
Ch1-Fundamental of Neural Network
No ratings yet
Ch1-Fundamental of Neural Network
59 pages
Ch3 Auto Encoder
No ratings yet
Ch3 Auto Encoder
40 pages
Cdtrans Cross-Domain Transformer For Unsupervised Domain Adaptation
No ratings yet
Cdtrans Cross-Domain Transformer For Unsupervised Domain Adaptation
14 pages
It 8 Sem Machine Learning 3705 Summer 2019
No ratings yet
It 8 Sem Machine Learning 3705 Summer 2019
2 pages
AAM Sample Paper
100% (2)
AAM Sample Paper
4 pages
AIML Resume
No ratings yet
AIML Resume
2 pages
15-Hyperparameter Tuning - Batch Normalization-14!08!2024
No ratings yet
15-Hyperparameter Tuning - Batch Normalization-14!08!2024
4 pages
TE & BE - PR-Or - Seating Arrangement & Time - Table
No ratings yet
TE & BE - PR-Or - Seating Arrangement & Time - Table
2 pages
Lecture 9 PDF
100% (1)
Lecture 9 PDF
28 pages
5 Top Technologies Presented at CES 2021 Test
No ratings yet
5 Top Technologies Presented at CES 2021 Test
3 pages
Artificial Intelligence (AI)
No ratings yet
Artificial Intelligence (AI)
46 pages
Btech Cs 7 Sem Artificial Intelligence ncs702 2021
No ratings yet
Btech Cs 7 Sem Artificial Intelligence ncs702 2021
2 pages
Optimizing Brain Tumor Identification With Fine - Tuned Pre-Trained CNN Models A Comparative Study of VGG16 and EfficientNetB4
No ratings yet
Optimizing Brain Tumor Identification With Fine - Tuned Pre-Trained CNN Models A Comparative Study of VGG16 and EfficientNetB4
5 pages
HMM Solver!
No ratings yet
HMM Solver!
1 page
Botanical - Final Paper
No ratings yet
Botanical - Final Paper
5 pages
Generative Adversarial Learning Architectures and Applications Roozbeh Razavifar Instant Download
No ratings yet
Generative Adversarial Learning Architectures and Applications Roozbeh Razavifar Instant Download
86 pages
Seminar Text Summarization 1
No ratings yet
Seminar Text Summarization 1
21 pages
Artificial Intelligence: Presented By: Er. Shree Ram Khaitu
No ratings yet
Artificial Intelligence: Presented By: Er. Shree Ram Khaitu
19 pages
Machine Learning: 1.1 Types of Problems and Tasks
No ratings yet
Machine Learning: 1.1 Types of Problems and Tasks
9 pages
1725877145module 3 How AI Works
No ratings yet
1725877145module 3 How AI Works
18 pages
Generative Adversarial Network Architecture and Applications
No ratings yet
Generative Adversarial Network Architecture and Applications
41 pages
LLM-Select: Feature Selection With Large Language Models: Daniel P. Jeong
No ratings yet
LLM-Select: Feature Selection With Large Language Models: Daniel P. Jeong
74 pages
Scheme 3rd Year CSE (AI) V & VI Sem.
No ratings yet
Scheme 3rd Year CSE (AI) V & VI Sem.
3 pages
K Fold Cross Validation
No ratings yet
K Fold Cross Validation
17 pages
Unit 3 - Machine Learning
No ratings yet
Unit 3 - Machine Learning
29 pages
An AI Glossary - by Lenny Rachitsky - Lenny's Newsletter
No ratings yet
An AI Glossary - by Lenny Rachitsky - Lenny's Newsletter
16 pages
AI - For-Everyone
No ratings yet
AI - For-Everyone
19 pages
Project 2 Guide Consent Form
No ratings yet
Project 2 Guide Consent Form
2 pages
ML Theory Questions Final
No ratings yet
ML Theory Questions Final
3 pages
American Sign Language Recognition Using Machine Learning and Com
No ratings yet
American Sign Language Recognition Using Machine Learning and Com
57 pages
Unseen Class Discovery in Open-World Classification: A Project Report
No ratings yet
Unseen Class Discovery in Open-World Classification: A Project Report
48 pages
ML Engineer Roadmap
No ratings yet
ML Engineer Roadmap
2 pages

Ch2-Training, Optimization and Regularization of DNN-new

Uploaded by

Ch2-Training, Optimization and Regularization of DNN-new

Uploaded by

Training, Optimization and

• Rectified Linear Unit(ReLU): it is fast learning activation function and

It is used in Back propagation network and it is differentiable

Layer i (input) Layer j (hidden) Layer k (output)

• Momentum based Gradient Descent

• Steps of Momentum based Gradient Descent

β= 0.8 to 0.999 = 0.9 standard range

theta += delta (eq. 2)

sum_of_gradient = gradient + previous_sum_of_gradient * decay_rate

delta = -learning_rate * sum_of_gradient (eq. 4)

theta += delta (eq. 5)

delta = -learning_rate * gradient / sqrt(sum_of_gradient_squared)

decay_rate+ gradient² * (1- decay_rate)

delta = -learning_rate * gradient / sqrt(sum_of_gradient_squared)

gradient * (1 - beta1) [Momentum]

* beta2 + gradient² * (1- beta2) [RMSProp]

delta = -learning_rate * sum_of_gradient /

• Nesterov Accelerated Gradient (NAG)

• In momentum based gradient, the step become larger and larger

• where X is the feature matrix, and Wj are the

•In both L1 and L2 regularization, when the regularization

You might also like