0% found this document useful (0 votes)

9 views12 pages

Annn

The document discusses the concepts of bias and variance in neural networks, emphasizing the trade-off between under-fitting and over-fitting. It explains how a neural network's ability to generalize is affected by its training approach, with a focus on minimizing both bias and variance for better performance. Additionally, it outlines strategies to prevent under-fitting and over-fitting, ensuring effective learning and generalization in neural network applications.

Uploaded by

viveksinghkece

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views12 pages

Annn

Uploaded by

viveksinghkece

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Bias and Variance, Under-Fitting and Over-Fitting

Introduction to Neural Networks : Lecture 9

© John A. Bullinaria, 2004

1. The Computational Power of MLPs

2. Learning and Generalization Revisited

3. A Statistical View of Network Training

4. Bias and Variance

5. Under-fitting, Over-fitting and the Bias/Variance Trade-off

6. Preventing Under-fitting and Over-fitting

Computational Power of MLPs
The universal approximation theorem can be stated as:

Let ϕ(⋅) be a non-constant, bounded, and monotone-increasing continuous function.

Then for any continuous function f(x) with x = {xi ∈ [0,1] : i = 1, …,m} and ε > 0, there
exists an integer M and real constants {αj, bj, wjk : j = 1, …,M , k = 1, …,m } such that
M
m 
F( x1,..., xm ) = ∑ α jϕ  ∑ w jk xk − b j 
j =1  k =1 
is an approximate realisation of f(⋅), that is
F( x1,..., xm ) − f ( x1,..., xm ) < ε
for all x that lie in the input space.

Clearly this applies to an MLP with M hidden units, since ϕ(⋅) can be a sigmoid, w jk, bj
can be hidden layer weights and biases, and αj can be output weights. It follows that,
given enough hidden units, a two layer MLP can approximate any continuous function.

L9-2
Learning and Generalization Revisited
Recall the idea of getting a neural network to learn a classification decision boundary:

in2 in2

in1 in1

Our aim is for the network to generalize to classify new inputs appropriately. If we know
that the training data contains noise, we don’t necessarily want the training data to be
classified totally accurately as that is likely to reduce the generalisation ability.

L9-3
Generalization in Function Approximation
Similarly if our network is required to recover an underlying function from noisy data:

out

We can expect the network to give a more accurate generalization to new inputs if its
output curve does not pass through all the data points. Again, allowing a larger error on
the training data is likely to lead to better generalization.

L9-4
A Statistical View of the Training Data
Suppose we have a training data set D for our neural network:

D = { xip, yp : i = 1 … ninputs, p = 1 … npatterns }

This consists of an output yp for each input pattern xip. To keep the notation simple we
shall assume we only have one output unit – the extension to many outputs is obvious.

Generally, the training data will be generated by some actual function g(xi) plus random
noise εp (which may, for example, be due to data gathering errors), so

y p = g( xip ) + ε p

We call this a regressive model of the data. We can define a statistical expectation
operator E that averages over all possible training patterns, so

g( xi ) = E [ y | xi ]

We say that the regression function g(xi) is the conditional mean of the model output y
given the inputs xi.

L9-5
A Statistical View of Network Training
The neural network training problem is to construct an output function net(xi, W, D) of
the network weights W = {wij( n )}, based on the data D, that best approximates the
regression model, i.e. the underlying function g(xi).

We have seen how to train a network by minimising the sum-squared error cost function:

∑ ( y p − net( xip , W , D))

2
E (W ) = 1
2
p∈D

with respect to the network weights W = {wij( n )}. However, we have also observed that, to
get good generalisation, we do not necessarily want to achieve that minimum. What we

and the underlying function g( xi ) = E [ y | xi ].

really want to do is minimise the difference between the network’s outputs net(xi, W, D)

The natural sum-squared error function, i.e. (E [ y | xi ] − net ( xi , W , D)) , depends on the
2

specific training set D, and we really want our network training regime to produce good
results averaged over all possible noisy training sets.

L9-6
Bias and Variance
If we define the expectation or average operator E D which takes the ensemble average
over all possible training sets D, then some rather messy algebra allows us to show that:

E D (E [ y | xi ] − net ( xi , W , D))
[ ]
2

(E D [net( xi , W , D)] − E [ y | xi ]) + E D (net( xi , W , D) − E D [net( xi , W , D)])

[ ]
2 2
=

= (bias) 2 + (variance)

This error function consists of two positive components:

(bias)2 the difference between the average network output E D [net ( xi , W , D)] and the
regression function g( xi ) = E [ y | xi ]. This can be viewed as the approximation error.
(variance) the variance of the approximating function net ( xi , W , D) over all the training
sets D. It represents the sensitivity of the results on the particular choice of data D.

In practice there will always be a trade-off between these two error components.

L9-7
The Extreme Cases of Bias and Variance
We can best understand the concepts of bias and variance by considering the two
extreme cases of what the network might learn.

Suppose our network is lazy and just generates the same constant output whatever
training data we give it, i.e. net ( xi , W , D) = c . In this case the variance term will be
zero, but the bias will be large, because the network has made no attempt to fit the data.

Suppose our network is very hard working and makes sure that it fits every data point:

E D [net ( xi , W , D)] = E D [ y( xi )] = E D [ g( xi ) + ε ] = E [ y | xi ]

so the bias is zero, but the variance is:

E D (net ( xi , W , D) − E D [net ( xi , W , D)]) = E D (g( xi ) + ε − E D [g( xi ) + ε ]) = E D [(ε )2 ]

[ ] [ ]
2 2

i.e. the variance of the noise on the data, which could be substantial.

L9-8
Examples of the Two Extreme Cases

The lazy and hard-working networks approach our function approximation as follows:

out out

in in

Ignore the data ⇒ Get every data point ⇒

Big approximation errors (high bias) No approximation errors (zero bias)
No variation between data sets (no variance) Variation between data sets (high variance)

L9-9
Under-fitting, Over-fitting and the Bias/Variance Trade-off
If our network is to generalize well to new data, we obviously need it to generate a good
approximation to the underlying function g( xi ) = E [ y | xi ], and we have seen that to do
this we must minimise the sum of the bias and variance terms. There will clearly have
to be a trade-off between minimising the bias and minimising the variance.

A network which is too closely fitted to the data will tend to have a large variance and
hence give a large expected generalization error. We then say that over-fitting of the
training data has occurred.

We can easily decrease the variance by smoothing the network outputs, but if this is
taken too far, then the bias becomes large, and the expected generalization error is large
again. We then say that under-fitting of the training data has occurred.

This trade-off between bias and variance plays a crucial role in the application of neural
network techniques to practical applications.

L9-10
Preventing Under-fitting and Over-fitting

To prevent under-fitting we need to make sure that:

1. The network has enough hidden units to represent to required mappings.
2. We train the network for long enough so that the sum squared error cost function is
sufficiently minimised.

To prevent over-fitting we can:

1. Stop the training early – before it has had time to learn the training data too well.
2. Restrict the number of adjustable parameters the network has – e.g. by reducing the
number of hidden units, or by forcing connections to share the same weight values.
3. Add some form of regularization term to the error function to encourage smoother
network mappings.
4. Add noise to the training patterns to smear out the data points.

Next lecture will be dedicated to looking at these approaches to improving generalization.

L9-11
Overview and Reading
1. We began by looking at the computational power of MLPs.
2. Then we saw why the generalization is often better if we don’t train the
network all the way to the minimum of its error function.
3. A statistical treatment of learning showed that there was a trade-off
between bias and variance.
4. Both under-fitting (giving high bias) and over-fitting (giving high
variance) will result in poor generalization.
5. The are many ways we can try to improve generalization.

Reading

1. Bishop: Sections 6.1, 9.1, 9.2

2. Gurney: Sections 6.8, 6.9
3. Haykin: Sections 2.13, 4.13

L9-12

DL Unit-2
No ratings yet
DL Unit-2
24 pages
2
100% (1)
2
35 pages
Inverse Galois Theory: Gunter Malle B. Heinrich Matzat
0% (1)
Inverse Galois Theory: Gunter Malle B. Heinrich Matzat
547 pages
Cse3521 Hw1 Solutions
No ratings yet
Cse3521 Hw1 Solutions
5 pages
EnVision CC To CCSSM Grade 1
100% (1)
EnVision CC To CCSSM Grade 1
4 pages
Canadian Mathematical Olympiad 2020: Official Solutions
No ratings yet
Canadian Mathematical Olympiad 2020: Official Solutions
6 pages
Famous Inequalities
100% (2)
Famous Inequalities
29 pages
IMO Maths Sample Paper 1 For Class 8
No ratings yet
IMO Maths Sample Paper 1 For Class 8
29 pages
Geometry Revision Properties For INMO - 2018
100% (1)
Geometry Revision Properties For INMO - 2018
15 pages
Rotational Motion Engineering Mechanics IIT Kanpur
No ratings yet
Rotational Motion Engineering Mechanics IIT Kanpur
67 pages
Mathematics Curriculum For School Education
No ratings yet
Mathematics Curriculum For School Education
12 pages
Chain Rule of Differentiation Assign-4 Ed
No ratings yet
Chain Rule of Differentiation Assign-4 Ed
4 pages
EE2211 Lecture 7
No ratings yet
EE2211 Lecture 7
43 pages
Bayesian Learning Rules
No ratings yet
Bayesian Learning Rules
37 pages
Chapter 08
100% (2)
Chapter 08
202 pages
2018 Bay Area Mathematical Olympiad Solution (Grade 8 and Grade 12) PDF
No ratings yet
2018 Bay Area Mathematical Olympiad Solution (Grade 8 and Grade 12) PDF
6 pages
11 Maths Notes 04 Principle of Mathematical Induction
No ratings yet
11 Maths Notes 04 Principle of Mathematical Induction
4 pages
Neural Networks Bias
No ratings yet
Neural Networks Bias
7 pages
CLASS 4 UIMO-2023-Paper UM9269
100% (2)
CLASS 4 UIMO-2023-Paper UM9269
16 pages
Richi's Neural Nets Summary
No ratings yet
Richi's Neural Nets Summary
114 pages
Kagan Lecture2
No ratings yet
Kagan Lecture2
118 pages
Real Life Applications of Direct Proportions
No ratings yet
Real Life Applications of Direct Proportions
8 pages
Unit Online 1.4
No ratings yet
Unit Online 1.4
132 pages
Solid Modeling PDF
No ratings yet
Solid Modeling PDF
7 pages
Bias and Variance
No ratings yet
Bias and Variance
21 pages
Theory in Machine Learning
No ratings yet
Theory in Machine Learning
60 pages
How To Write A Math Paper (Rough Draft)
No ratings yet
How To Write A Math Paper (Rough Draft)
5 pages
Note On 9 Theorem of Hilbert.: The Series
No ratings yet
Note On 9 Theorem of Hilbert.: The Series
4 pages
Chapter 5 Euclid S Geometry
No ratings yet
Chapter 5 Euclid S Geometry
9 pages
Knowledge of Preservice Elementary - Teachers On Fractions
No ratings yet
Knowledge of Preservice Elementary - Teachers On Fractions
17 pages
BSEd MATH 3A COLAO EVELYN DLP
No ratings yet
BSEd MATH 3A COLAO EVELYN DLP
7 pages
Neural Networks For Machine Learning: Lecture 9a Overview of Ways To Improve Generalization
No ratings yet
Neural Networks For Machine Learning: Lecture 9a Overview of Ways To Improve Generalization
39 pages
Recursive Functions
No ratings yet
Recursive Functions
12 pages
Data Science
No ratings yet
Data Science
62 pages
Scientific Notation: Addition/Subtraction
No ratings yet
Scientific Notation: Addition/Subtraction
2 pages
Module - 2 Ver 1.4
No ratings yet
Module - 2 Ver 1.4
35 pages
Machine Learning Models
No ratings yet
Machine Learning Models
54 pages
Variance and Bias
No ratings yet
Variance and Bias
14 pages
Session 3
No ratings yet
Session 3
26 pages
JD Data Scientist IIT
No ratings yet
JD Data Scientist IIT
3 pages
ML Decode
No ratings yet
ML Decode
130 pages
Machine Learning-2
No ratings yet
Machine Learning-2
87 pages
Data Science Interview Question
No ratings yet
Data Science Interview Question
23 pages
Feed Forward Neural Network Assignment PDF
No ratings yet
Feed Forward Neural Network Assignment PDF
11 pages
Deep Learning Unit 2
No ratings yet
Deep Learning Unit 2
30 pages
FALLSEM2024-25 BCSE401L TH VL2024250102084 2024-08-23 Reference-Material-I
No ratings yet
FALLSEM2024-25 BCSE401L TH VL2024250102084 2024-08-23 Reference-Material-I
29 pages
Deep Neural Network Module 4 Regularization
No ratings yet
Deep Neural Network Module 4 Regularization
53 pages
Mahesan-Soliton Interview Experience
No ratings yet
Mahesan-Soliton Interview Experience
4 pages
Deep Learning - Summary - Deep - Learning
No ratings yet
Deep Learning - Summary - Deep - Learning
17 pages
Unit 2
No ratings yet
Unit 2
37 pages
Overfitting vs. Underfitting, Bias vs. Variance
No ratings yet
Overfitting vs. Underfitting, Bias vs. Variance
7 pages
Lecture 2
No ratings yet
Lecture 2
67 pages
Chap 4 Slides
No ratings yet
Chap 4 Slides
61 pages
Activity: Line Symmetry
No ratings yet
Activity: Line Symmetry
6 pages
Machine Learning Cheatsheet Compiled and Curated by Robins Yadav
No ratings yet
Machine Learning Cheatsheet Compiled and Curated by Robins Yadav
14 pages
ML 01
No ratings yet
ML 01
24 pages
TFM Lichtner Bajjaoui Aisha
No ratings yet
TFM Lichtner Bajjaoui Aisha
18 pages
MLT Answer Key
No ratings yet
MLT Answer Key
10 pages
Week11 - Regularization and Optimization
No ratings yet
Week11 - Regularization and Optimization
75 pages
Prob and Stats in AI Unit-4
No ratings yet
Prob and Stats in AI Unit-4
24 pages
08 Eval-Intro Notes
No ratings yet
08 Eval-Intro Notes
10 pages
DL Unit1
100% (2)
DL Unit1
79 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
116 pages
4 - Bias-Variance Tradeoff
No ratings yet
4 - Bias-Variance Tradeoff
28 pages
Bias and Variance (v2)
No ratings yet
Bias and Variance (v2)
22 pages
CA01 AUT24 Investment Fundamentals London Business School: H0 Statistics Review
No ratings yet
CA01 AUT24 Investment Fundamentals London Business School: H0 Statistics Review
9 pages
Underfitting & Overfitting
No ratings yet
Underfitting & Overfitting
13 pages
Unit - 2 Deep Learning
No ratings yet
Unit - 2 Deep Learning
26 pages
(Technical) Machine Learning U3-6 (2019 Pattern)
No ratings yet
(Technical) Machine Learning U3-6 (2019 Pattern)
101 pages
AIML (4th Sem)
No ratings yet
AIML (4th Sem)
22 pages
Fit Without Fear - Remarkable Mathematical Phenomena of Deep Learning Through The Prism of Interpolation
No ratings yet
Fit Without Fear - Remarkable Mathematical Phenomena of Deep Learning Through The Prism of Interpolation
51 pages
Unit 1.2 Perceptron 2024
No ratings yet
Unit 1.2 Perceptron 2024
107 pages
Underfitting and Overfitting Slides and Transcript
No ratings yet
Underfitting and Overfitting Slides and Transcript
13 pages
Lecture To The Algerian OFM Team Functional Equations
No ratings yet
Lecture To The Algerian OFM Team Functional Equations
4 pages
Unit Online 1.3
No ratings yet
Unit Online 1.3
21 pages
Linear Regression, Polynomical, Gradiant Descent
No ratings yet
Linear Regression, Polynomical, Gradiant Descent
42 pages
L8 Ann
No ratings yet
L8 Ann
20 pages
Ad3451 ML Unit 4 Notes
No ratings yet
Ad3451 ML Unit 4 Notes
36 pages
CMPE257 - W2C3 - ML Fundamentals - Part 2
No ratings yet
CMPE257 - W2C3 - ML Fundamentals - Part 2
34 pages
Neural Networks Essay Feranmi Dere
No ratings yet
Neural Networks Essay Feranmi Dere
7 pages
Unit 4
No ratings yet
Unit 4
13 pages
Deep Learning Unit 2
No ratings yet
Deep Learning Unit 2
25 pages
DL Regularization
No ratings yet
DL Regularization
28 pages
6 Working Example 01-08-2024
No ratings yet
6 Working Example 01-08-2024
21 pages
Diagnosing Bias Vs Variance
No ratings yet
Diagnosing Bias Vs Variance
11 pages
Training Neural Netwok: Data Set
No ratings yet
Training Neural Netwok: Data Set
35 pages
UNIT 2 Data Science LM 2023
No ratings yet
UNIT 2 Data Science LM 2023
13 pages
465-Lecture 1 (Deep Learning)
No ratings yet
465-Lecture 1 (Deep Learning)
47 pages
Advanced C++ Interview Questions You'll Most Likely Be Asked
From Everand
Advanced C++ Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Geometric functions in computer aided geometric design
From Everand
Geometric functions in computer aided geometric design
Oscar Ruiz
No ratings yet

Annn

Uploaded by

Annn

Uploaded by

Bias and Variance, Under-Fitting and Over-Fitting

Introduction to Neural Networks : Lecture 9

© John A. Bullinaria, 2004

1. The Computational Power of MLPs

2. Learning and Generalization Revisited

3. A Statistical View of Network Training

4. Bias and Variance

5. Under-fitting, Over-fitting and the Bias/Variance Trade-off

6. Preventing Under-fitting and Over-fitting

Let ϕ(⋅) be a non-constant, bounded, and monotone-increasing continuous function.

D = { xip, yp : i = 1 … ninputs, p = 1 … npatterns }

∑ ( y p − net( xip , W , D))

and the underlying function g( xi ) = E [ y | xi ].

(E D [net( xi , W , D)] − E [ y | xi ]) + E D (net( xi , W , D) − E D [net( xi , W , D)])

This error function consists of two positive components:

so the bias is zero, but the variance is:

E D (net ( xi , W , D) − E D [net ( xi , W , D)]) = E D (g( xi ) + ε − E D [g( xi ) + ε ]) = E D [(ε )2 ]

Ignore the data ⇒ Get every data point ⇒

To prevent under-fitting we need to make sure that:

To prevent over-fitting we can:

Next lecture will be dedicated to looking at these approaches to improving generalization.

1. Bishop: Sections 6.1, 9.1, 9.2

You might also like