0% found this document useful (0 votes)
32 views

Unit 2

Neural networks are trained using supervised learning to minimize risk. Models are selected that balance low empirical risk (error on training data) and model complexity to avoid overfitting. Backpropagation uses the chain rule to efficiently compute gradients for updating weights during training via algorithms like stochastic gradient descent. Regularization adds a penalty for complexity to prevent overfitting by pushing weights toward zero. Cross-validation and information criteria help select optimal models.

Uploaded by

Poorna
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views

Unit 2

Neural networks are trained using supervised learning to minimize risk. Models are selected that balance low empirical risk (error on training data) and model complexity to avoid overfitting. Backpropagation uses the chain rule to efficiently compute gradients for updating weights during training via algorithms like stochastic gradient descent. Regularization adds a penalty for complexity to prevent overfitting by pushing weights toward zero. Cross-validation and information criteria help select optimal models.

Uploaded by

Poorna
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Neural Networks and Deep Learning

UNIT – 2
Training Neural Network

Dr. D. SUDHEER
Assistant Professor
Computer Science and Engineering
VNRVJIET
© Dr. Devulapalli Sudheer 1
Risk Minimization

•It is an most used supervised learning framework.


•Due to the finite training set, learning theory cannot provide absolute guarantees of
performance of the algorithms.
• Assume that a set of N samples, {(xi, yi)}, are independently drawn and identically
distributed samples from some unknown probability distribution p(x, y).
•Assume a model defined by a set of possible mappings x → f(x, α).
•Where α is adjustable parameter, then the model is called trained model.
•The expected risk is the expectation of the generalization error for a trained machine is
given by

Loss Function
• The loss function can be defined in different forms for different
purposes:

• The empirical risk Remp(α) is defined to be the measured mean error on a given training
set
Vapnik-Chervonenkis dimension (VC- Dimension)
• To understand the random entropy.
•VC dimension is a combinatorial characterization of the diversity of functions that can
be computed by a given neural architecture.
• A subset S of the domain X is shattered by a class of functions or neural network N if
every function f : S → {0, 1} can be computed on N.
• If n points there will be 2n ways to label the set.
• The VC dimension is defined as Hypothesis space N defined over instance space X is
the size of largest finite subset of S shattered by N.

• The principle of structural risk minimization (SRM) minimizes the risk functional
with respect to both the empirical risk and the VC dimension of the set of functions.
• SRM is principle to reduce risk in model using regularization.
• The SRM principle is crucial to obtain good generalization performances for a variety
of learning machines, including SVMs.
• It finds the function that achieves the minimum of the guaranteed risk for the fixed
amount of data.
Loss Functions

• When the training data is corrupted by large noise, such as outliers, conventional
learning algorithms may not yield acceptable performance since a small number of
outliers have a large impact on the MSE.
• An outlier is an observation that deviates significantly from the other observations.
• It may be due to erroneous measurements or noisy data.
• When noise becomes large or outliers exist, the networks may try to fit those improper
data and the learned systems are corrupted.
• The loss function is used to degrade the effects of those outliers in learning.
Model Selection

• If two models of different complexity fit the data approximately


equally well, the simpler one usually is a better predictive model.
• From models approximating the noisy data, the ones that have
minimal complexity should be chosen.
• The objective of model selection is to find a model that is as
simple as possible that fits a given data set with sufficient
accuracy, and has a good generalization capability to unseen data.
• Model-selection approaches can be generally grouped into four
categories: cross validation, complexity criteria, regularization,
and network pruning/growing.
Cross validation:
• In crossvalidation methods, many networks of different complexity
are trained and then tested on an independent validation set.
• Crossvalidation is a standard model-selection method in statistics.
• The total pattern set is randomly partitioned into a training set and a
validation set.
• When only one sample is used for validation, the method is called
leave-one-out crossvalidation.
• Let Di and Di, i = 1, . . ., m, be the data subsets of the total pattern
set arising from the ith partitioning, which are, respectively, used for
training and testing.
• The crossvalidation process trains the algorithm m times, the log-
likelyhood function
• Validation uses data different from the training set, thus the
validation set is independent from the estimated model.
• The popular K-fold cross validation employs a non overlapping
test set selection scheme.
• The data universe D is divided into K non overlapping data
subsets of the same size.
• Each data subset is then used as a test set, with the remaining
K − 1 folds acting as a training set, and an error value is
calculated by testing the classifier in the remaining fold.
• Finally, the K-fold cross validation estimation of the error is the
average value of the errors committed in each fold.
K-fold cross validation where k=10.
Complexity criteria:

•These methods using information criteria for statistical method


selection:
Akaike information criterion (AIC)
Schwartz’s Bayesian information criterion (BIC)

•These methods are function with two paramètres for measuring


the error.
• A possible approach to model order selection consists of
minimizing the Kullback-Leibler discrepancy between the true pdf
of the data and the pdf (or likelihood) of the model
Likelihood estimated value

N is size of training set, NP is number of parameters in the model.


Maximum likelihood estimation
The two criteras can be expressed as below:

Noise variance
Regularization

• Regularization is one of the most important concepts of neural


networks. It is a technique to prevent the model from over fitting.
• Sometimes the machine learning model performs well with the
training data but does not perform well with the test data.
• It means the model is not able to predict the output when deals
with unseen data by introducing noise in the output, and hence the
model is called over fitted.
• It maintains accuracy as well as a generalization of the model.
• It mainly regularizes or reduces the coefficient of features
toward zero.
• Regularization works by adding a penalty or complexity term to the
complex model.

Regularization parameter
Where E is the error function, Ec is the penalty for the complexity
of the structure.

• Extra local minima are introduced to the optimization process by


the penalty term.

•In the weight-decay technique , Ec is defined as a function of the


weights.
• Ec is defined as the sum of the squares of all the weights:

The back propagation algorithm derived from ET using weight decay

Where,
• The amplitudes of the weights decrease continuously towards zero,
unless they are reinforced by the BP rule.
• At the end of training, only the essential weights deviate
significantly from zero.
• This effectively increases generalization and reduces the danger of
overtraining as well.
Optimization

Optimization
Strategy #1: Random Search
Strategy #2: Random Local Search
Strategy #3: Following the gradient

Optimization is the process of finding the set of parameters W that


minimize the loss function.

Strategy 1: Since it is so simple to check how good a given set of parameters W is, the
first (very bad) idea that may come to mind is to simply try out many different random
weights and keep track of what works best.
Strategy 2: The first strategy you may think of is to try to extend one foot in a random
direction and then take a step only if it leads downhill. Concretely, we will start out with
a random W , generate random perturbations δW to it and if the loss at the perturbed
W+δW is lower, we will perform an update.
Strategy 3: Following gradient.
we can compute the best direction along which we should change our weight vector that
is mathematically guaranteed to be the direction of the steepest descent (at least in the
limit as the step size goes towards zero).
Gradient Descent
This direction will be related to the gradient of the loss function.
Source: Gradient descent algorithm explained with linear regression example | by Dhanoop Karunakaran | Intro to Artificial
Intelligence | Medium
SGD Momentum

Row is momentum parameter (0,1)


Back-Propagation

• When we use a feedforward neural network to accept an input x


and produce an output yˆ, information flows forward through the
network.
• The inputs x provide the initial information that then propagates
up to the hidden units at each layerand finally produces yˆ. This is
called forward propagation.
• The back-propagation algorithm (Rumelhart et al., 1986a), often
simply called backprop, allows the information from the cost to
then flow backwards through the network, in order to compute the
gradient.
• Actually, back-propagation refers only to the method for
computing the gradient, while another algorithm, such as
stochastic gradient descent, is used to perform learning using this
gradient.
• back-propagation is often misunderstood as being specific to
multilayer neural networks, but in principle it can compute
derivatives of any function.
• In learning algorithms, the gradient we most often require is the
gradient of the cost function with respect to the parameters.
Computational graphs and chain rule

• To describe the back-propagation algorithm more precisely, it is


helpful to have a more precise computational graph language.
• Here, we use each node in the graph to indicate a variable. The
variable may be a scalar, vector, matrix, tensor, or even a variable
of another type.
• The chain rule of calculus (not to be confused with the chain
rule of probability) is used to compute the derivatives of
functions formed by composing other functions whose
derivatives are known.
• Back-propagation is an algorithm that computes the chain rule,
with a specific order of operations that is highly efficient.

You might also like