Unit 2

Neural networks are trained using supervised learning to minimize risk. Models are selected that balance low empirical risk (error on training data) and model complexity to avoid overfitting. Backpropagation uses the chain rule to efficiently compute gradients for updating weights during training via algorithms like stochastic gradient descent. Regularization adds a penalty for complexity to prevent overfitting by pushing weights toward zero. Cross-validation and information criteria help select optimal models.

Uploaded by

Poorna

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

45 views37 pages

Unit 2

Uploaded by

Poorna

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 37

Neural Networks and Deep Learning

UNIT – 2
Training Neural Network

Dr. D. SUDHEER
Assistant Professor
Computer Science and Engineering
VNRVJIET
© Dr. Devulapalli Sudheer 1
Risk Minimization

•It is an most used supervised learning framework.

•Due to the finite training set, learning theory cannot provide absolute guarantees of
performance of the algorithms.
• Assume that a set of N samples, {(xi, yi)}, are independently drawn and identically
distributed samples from some unknown probability distribution p(x, y).
•Assume a model defined by a set of possible mappings x → f(x, α).
•Where α is adjustable parameter, then the model is called trained model.
•The expected risk is the expectation of the generalization error for a trained machine is
given by

Loss Function
• The loss function can be defined in different forms for different
purposes:

• The empirical risk Remp(α) is defined to be the measured mean error on a given training
set
Vapnik-Chervonenkis dimension (VC- Dimension)
• To understand the random entropy.
•VC dimension is a combinatorial characterization of the diversity of functions that can
be computed by a given neural architecture.
• A subset S of the domain X is shattered by a class of functions or neural network N if
every function f : S → {0, 1} can be computed on N.
• If n points there will be 2n ways to label the set.
• The VC dimension is defined as Hypothesis space N defined over instance space X is
the size of largest finite subset of S shattered by N.

• The principle of structural risk minimization (SRM) minimizes the risk functional
with respect to both the empirical risk and the VC dimension of the set of functions.
• SRM is principle to reduce risk in model using regularization.
• The SRM principle is crucial to obtain good generalization performances for a variety
of learning machines, including SVMs.
• It finds the function that achieves the minimum of the guaranteed risk for the fixed
amount of data.
Loss Functions

• When the training data is corrupted by large noise, such as outliers, conventional
learning algorithms may not yield acceptable performance since a small number of
outliers have a large impact on the MSE.
• An outlier is an observation that deviates significantly from the other observations.
• It may be due to erroneous measurements or noisy data.
• When noise becomes large or outliers exist, the networks may try to fit those improper
data and the learned systems are corrupted.
• The loss function is used to degrade the effects of those outliers in learning.
Model Selection

• If two models of different complexity fit the data approximately

equally well, the simpler one usually is a better predictive model.
• From models approximating the noisy data, the ones that have
minimal complexity should be chosen.
• The objective of model selection is to find a model that is as
simple as possible that fits a given data set with sufficient
accuracy, and has a good generalization capability to unseen data.
• Model-selection approaches can be generally grouped into four
categories: cross validation, complexity criteria, regularization,
and network pruning/growing.
Cross validation:
• In crossvalidation methods, many networks of different complexity
are trained and then tested on an independent validation set.
• Crossvalidation is a standard model-selection method in statistics.
• The total pattern set is randomly partitioned into a training set and a
validation set.
• When only one sample is used for validation, the method is called
leave-one-out crossvalidation.
• Let Di and Di, i = 1, . . ., m, be the data subsets of the total pattern
set arising from the ith partitioning, which are, respectively, used for
training and testing.
• The crossvalidation process trains the algorithm m times, the log-
likelyhood function
• Validation uses data different from the training set, thus the
validation set is independent from the estimated model.
• The popular K-fold cross validation employs a non overlapping
test set selection scheme.
• The data universe D is divided into K non overlapping data
subsets of the same size.
• Each data subset is then used as a test set, with the remaining
K − 1 folds acting as a training set, and an error value is
calculated by testing the classifier in the remaining fold.
• Finally, the K-fold cross validation estimation of the error is the
average value of the errors committed in each fold.
K-fold cross validation where k=10.
Complexity criteria:

•These methods using information criteria for statistical method

selection:
Akaike information criterion (AIC)
Schwartz’s Bayesian information criterion (BIC)

•These methods are function with two paramètres for measuring

the error.
• A possible approach to model order selection consists of
minimizing the Kullback-Leibler discrepancy between the true pdf
of the data and the pdf (or likelihood) of the model
Likelihood estimated value

N is size of training set, NP is number of parameters in the model.

Maximum likelihood estimation
The two criteras can be expressed as below:

Noise variance
Regularization

• Regularization is one of the most important concepts of neural

networks. It is a technique to prevent the model from over fitting.
• Sometimes the machine learning model performs well with the
training data but does not perform well with the test data.
• It means the model is not able to predict the output when deals
with unseen data by introducing noise in the output, and hence the
model is called over fitted.
• It maintains accuracy as well as a generalization of the model.
• It mainly regularizes or reduces the coefficient of features
toward zero.
• Regularization works by adding a penalty or complexity term to the
complex model.

Regularization parameter
Where E is the error function, Ec is the penalty for the complexity
of the structure.

• Extra local minima are introduced to the optimization process by

the penalty term.

•In the weight-decay technique , Ec is defined as a function of the

weights.
• Ec is defined as the sum of the squares of all the weights:

The back propagation algorithm derived from ET using weight decay

Where,
• The amplitudes of the weights decrease continuously towards zero,
unless they are reinforced by the BP rule.
• At the end of training, only the essential weights deviate
significantly from zero.
• This effectively increases generalization and reduces the danger of
overtraining as well.
Optimization

Optimization
Strategy #1: Random Search
Strategy #2: Random Local Search
Strategy #3: Following the gradient

Optimization is the process of finding the set of parameters W that

minimize the loss function.

Strategy 1: Since it is so simple to check how good a given set of parameters W is, the
first (very bad) idea that may come to mind is to simply try out many different random
weights and keep track of what works best.
Strategy 2: The first strategy you may think of is to try to extend one foot in a random
direction and then take a step only if it leads downhill. Concretely, we will start out with
a random W , generate random perturbations δW to it and if the loss at the perturbed
W+δW is lower, we will perform an update.
Strategy 3: Following gradient.
we can compute the best direction along which we should change our weight vector that
is mathematically guaranteed to be the direction of the steepest descent (at least in the
limit as the step size goes towards zero).
Gradient Descent
This direction will be related to the gradient of the loss function.
Source: Gradient descent algorithm explained with linear regression example | by Dhanoop Karunakaran | Intro to Artificial
Intelligence | Medium
SGD Momentum

Row is momentum parameter (0,1)

Back-Propagation

• When we use a feedforward neural network to accept an input x

and produce an output yˆ, information flows forward through the
network.
• The inputs x provide the initial information that then propagates
up to the hidden units at each layerand finally produces yˆ. This is
called forward propagation.
• The back-propagation algorithm (Rumelhart et al., 1986a), often
simply called backprop, allows the information from the cost to
then flow backwards through the network, in order to compute the
gradient.
• Actually, back-propagation refers only to the method for
computing the gradient, while another algorithm, such as
stochastic gradient descent, is used to perform learning using this
gradient.
• back-propagation is often misunderstood as being specific to
multilayer neural networks, but in principle it can compute
derivatives of any function.
• In learning algorithms, the gradient we most often require is the
gradient of the cost function with respect to the parameters.
Computational graphs and chain rule

• To describe the back-propagation algorithm more precisely, it is

helpful to have a more precise computational graph language.
• Here, we use each node in the graph to indicate a variable. The
variable may be a scalar, vector, matrix, tensor, or even a variable
of another type.
• The chain rule of calculus (not to be confused with the chain
rule of probability) is used to compute the derivatives of
functions formed by composing other functions whose
derivatives are known.
• Back-propagation is an algorithm that computes the chain rule,
with a specific order of operations that is highly efficient.

Chapter 08
100% (2)
Chapter 08
202 pages
Deep Learning Module 3
No ratings yet
Deep Learning Module 3
15 pages
DL UNIT II PART II (IMP) Optimization For Training Deep Model
No ratings yet
DL UNIT II PART II (IMP) Optimization For Training Deep Model
81 pages
CS601 - Machine Learning - Unit 2 - Notes - 1672759753
No ratings yet
CS601 - Machine Learning - Unit 2 - Notes - 1672759753
14 pages
CS601 - Machine Learning - Unit 2 New
No ratings yet
CS601 - Machine Learning - Unit 2 New
56 pages
Module 3-DL
No ratings yet
Module 3-DL
12 pages
CS601 Machine Learning Unit 2 Notes 1672759753
No ratings yet
CS601 Machine Learning Unit 2 Notes 1672759753
14 pages
Neural Networks For Machine Learning: Lecture 9a Overview of Ways To Improve Generalization
No ratings yet
Neural Networks For Machine Learning: Lecture 9a Overview of Ways To Improve Generalization
39 pages
Unit Online 1.4
No ratings yet
Unit Online 1.4
132 pages
Unit 2
No ratings yet
Unit 2
76 pages
Unit 3
No ratings yet
Unit 3
110 pages
Richi's Neural Nets Summary
No ratings yet
Richi's Neural Nets Summary
114 pages
Deep Learning (All in One)
No ratings yet
Deep Learning (All in One)
23 pages
Deep Neural Network Module 4 Regularization
No ratings yet
Deep Neural Network Module 4 Regularization
53 pages
Multi Layer Feed-Forward Network Learning
No ratings yet
Multi Layer Feed-Forward Network Learning
5 pages
DL Unit-2
No ratings yet
DL Unit-2
24 pages
Machine Learning Shortnote
No ratings yet
Machine Learning Shortnote
14 pages
Advanced Machine Learning: Neural Networks Decision Trees Random Forest Xgboost
No ratings yet
Advanced Machine Learning: Neural Networks Decision Trees Random Forest Xgboost
61 pages
DL Unit2
No ratings yet
DL Unit2
113 pages
Chap 4 Slides
No ratings yet
Chap 4 Slides
61 pages
Week 06 - Deep Feedforward Networks - Optimization
No ratings yet
Week 06 - Deep Feedforward Networks - Optimization
83 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
116 pages
Towards A Mathematical Understanding of Neural Network-Based Machine Learning: What We Know and What We Don't
No ratings yet
Towards A Mathematical Understanding of Neural Network-Based Machine Learning: What We Know and What We Don't
56 pages
Week11 - Regularization and Optimization
No ratings yet
Week11 - Regularization and Optimization
75 pages
1 Intro
No ratings yet
1 Intro
91 pages
Lecture 2 Ai
No ratings yet
Lecture 2 Ai
24 pages
Mod 2.4,2.5,2.6 Architecture Design
No ratings yet
Mod 2.4,2.5,2.6 Architecture Design
20 pages
Deep Learning - Summary - Deep - Learning
No ratings yet
Deep Learning - Summary - Deep - Learning
17 pages
Unit 1.2 Perceptron 2024
No ratings yet
Unit 1.2 Perceptron 2024
107 pages
ML11 Generalization
No ratings yet
ML11 Generalization
40 pages
Deep Learning 1
No ratings yet
Deep Learning 1
48 pages
Unit V NNHDL
No ratings yet
Unit V NNHDL
33 pages
DL Notes
No ratings yet
DL Notes
16 pages
DL Unit2
No ratings yet
DL Unit2
22 pages
DL Mod2
No ratings yet
DL Mod2
45 pages
Feed Forward Neural Network Assignment PDF
No ratings yet
Feed Forward Neural Network Assignment PDF
11 pages
NN 2
No ratings yet
NN 2
12 pages
Gansp Awareness Quiz PDF
No ratings yet
Gansp Awareness Quiz PDF
13 pages
L8 Ann
No ratings yet
L8 Ann
20 pages
Data Science Interview Question
No ratings yet
Data Science Interview Question
23 pages
15-The Bias - Variance - Trade-Off-08-04-2024
No ratings yet
15-The Bias - Variance - Trade-Off-08-04-2024
23 pages
ML 01
No ratings yet
ML 01
24 pages
DL Unit 2
No ratings yet
DL Unit 2
46 pages
Deep Learning Unit 2
No ratings yet
Deep Learning Unit 2
25 pages
ML 5
No ratings yet
ML 5
26 pages
Artificial Intelligence
No ratings yet
Artificial Intelligence
31 pages
L06 Slides - mlp3
No ratings yet
L06 Slides - mlp3
26 pages
Linear Regression, Polynomical, Gradiant Descent
No ratings yet
Linear Regression, Polynomical, Gradiant Descent
42 pages
TFM Lichtner Bajjaoui Aisha
No ratings yet
TFM Lichtner Bajjaoui Aisha
18 pages
Unit 2
No ratings yet
Unit 2
18 pages
Mid 1 DL Notes
No ratings yet
Mid 1 DL Notes
15 pages
Training Method: Iterative Trial and Error Process That Machine Learning Algorithms May Use To Train A Model
No ratings yet
Training Method: Iterative Trial and Error Process That Machine Learning Algorithms May Use To Train A Model
8 pages
DL Unit-I
No ratings yet
DL Unit-I
30 pages
Machine Learning Interview Questions
No ratings yet
Machine Learning Interview Questions
8 pages
Unit IV BPA GD
No ratings yet
Unit IV BPA GD
12 pages
Upload Unit 2
No ratings yet
Upload Unit 2
19 pages
Machine Learning HC
No ratings yet
Machine Learning HC
4 pages
CSE 440 AI Volume1 (p1)
No ratings yet
CSE 440 AI Volume1 (p1)
4 pages
Random Sample Consensus: Robust Estimation in Computer Vision
From Everand
Random Sample Consensus: Robust Estimation in Computer Vision
Fouad Sabry
No ratings yet
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet