Unit 2
Unit 2
UNIT – 2
Training Neural Network
Dr. D. SUDHEER
Assistant Professor
Computer Science and Engineering
VNRVJIET
© Dr. Devulapalli Sudheer 1
Risk Minimization
Loss Function
• The loss function can be defined in different forms for different
purposes:
• The empirical risk Remp(α) is defined to be the measured mean error on a given training
set
Vapnik-Chervonenkis dimension (VC- Dimension)
• To understand the random entropy.
•VC dimension is a combinatorial characterization of the diversity of functions that can
be computed by a given neural architecture.
• A subset S of the domain X is shattered by a class of functions or neural network N if
every function f : S → {0, 1} can be computed on N.
• If n points there will be 2n ways to label the set.
• The VC dimension is defined as Hypothesis space N defined over instance space X is
the size of largest finite subset of S shattered by N.
• The principle of structural risk minimization (SRM) minimizes the risk functional
with respect to both the empirical risk and the VC dimension of the set of functions.
• SRM is principle to reduce risk in model using regularization.
• The SRM principle is crucial to obtain good generalization performances for a variety
of learning machines, including SVMs.
• It finds the function that achieves the minimum of the guaranteed risk for the fixed
amount of data.
Loss Functions
• When the training data is corrupted by large noise, such as outliers, conventional
learning algorithms may not yield acceptable performance since a small number of
outliers have a large impact on the MSE.
• An outlier is an observation that deviates significantly from the other observations.
• It may be due to erroneous measurements or noisy data.
• When noise becomes large or outliers exist, the networks may try to fit those improper
data and the learned systems are corrupted.
• The loss function is used to degrade the effects of those outliers in learning.
Model Selection
Noise variance
Regularization
Regularization parameter
Where E is the error function, Ec is the penalty for the complexity
of the structure.
Where,
• The amplitudes of the weights decrease continuously towards zero,
unless they are reinforced by the BP rule.
• At the end of training, only the essential weights deviate
significantly from zero.
• This effectively increases generalization and reduces the danger of
overtraining as well.
Optimization
Optimization
Strategy #1: Random Search
Strategy #2: Random Local Search
Strategy #3: Following the gradient
Strategy 1: Since it is so simple to check how good a given set of parameters W is, the
first (very bad) idea that may come to mind is to simply try out many different random
weights and keep track of what works best.
Strategy 2: The first strategy you may think of is to try to extend one foot in a random
direction and then take a step only if it leads downhill. Concretely, we will start out with
a random W , generate random perturbations δW to it and if the loss at the perturbed
W+δW is lower, we will perform an update.
Strategy 3: Following gradient.
we can compute the best direction along which we should change our weight vector that
is mathematically guaranteed to be the direction of the steepest descent (at least in the
limit as the step size goes towards zero).
Gradient Descent
This direction will be related to the gradient of the loss function.
Source: Gradient descent algorithm explained with linear regression example | by Dhanoop Karunakaran | Intro to Artificial
Intelligence | Medium
SGD Momentum