0% found this document useful (0 votes)
9 views12 pages

Annn

The document discusses the concepts of bias and variance in neural networks, emphasizing the trade-off between under-fitting and over-fitting. It explains how a neural network's ability to generalize is affected by its training approach, with a focus on minimizing both bias and variance for better performance. Additionally, it outlines strategies to prevent under-fitting and over-fitting, ensuring effective learning and generalization in neural network applications.

Uploaded by

viveksinghkece
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views12 pages

Annn

The document discusses the concepts of bias and variance in neural networks, emphasizing the trade-off between under-fitting and over-fitting. It explains how a neural network's ability to generalize is affected by its training approach, with a focus on minimizing both bias and variance for better performance. Additionally, it outlines strategies to prevent under-fitting and over-fitting, ensuring effective learning and generalization in neural network applications.

Uploaded by

viveksinghkece
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Bias and Variance, Under-Fitting and Over-Fitting

Introduction to Neural Networks : Lecture 9

© John A. Bullinaria, 2004

1. The Computational Power of MLPs

2. Learning and Generalization Revisited

3. A Statistical View of Network Training

4. Bias and Variance

5. Under-fitting, Over-fitting and the Bias/Variance Trade-off

6. Preventing Under-fitting and Over-fitting


Computational Power of MLPs
The universal approximation theorem can be stated as:

Let ϕ(⋅) be a non-constant, bounded, and monotone-increasing continuous function.


Then for any continuous function f(x) with x = {xi ∈ [0,1] : i = 1, …,m} and ε > 0, there
exists an integer M and real constants {αj, bj, wjk : j = 1, …,M , k = 1, …,m } such that
M
m 
F( x1,..., xm ) = ∑ α jϕ  ∑ w jk xk − b j 
j =1  k =1 
is an approximate realisation of f(⋅), that is
F( x1,..., xm ) − f ( x1,..., xm ) < ε
for all x that lie in the input space.

Clearly this applies to an MLP with M hidden units, since ϕ(⋅) can be a sigmoid, w jk, bj
can be hidden layer weights and biases, and αj can be output weights. It follows that,
given enough hidden units, a two layer MLP can approximate any continuous function.

L9-2
Learning and Generalization Revisited
Recall the idea of getting a neural network to learn a classification decision boundary:

in2 in2

in1 in1

Our aim is for the network to generalize to classify new inputs appropriately. If we know
that the training data contains noise, we don’t necessarily want the training data to be
classified totally accurately as that is likely to reduce the generalisation ability.

L9-3
Generalization in Function Approximation
Similarly if our network is required to recover an underlying function from noisy data:

out

in

We can expect the network to give a more accurate generalization to new inputs if its
output curve does not pass through all the data points. Again, allowing a larger error on
the training data is likely to lead to better generalization.

L9-4
A Statistical View of the Training Data
Suppose we have a training data set D for our neural network:

D = { xip, yp : i = 1 … ninputs, p = 1 … npatterns }

This consists of an output yp for each input pattern xip. To keep the notation simple we
shall assume we only have one output unit – the extension to many outputs is obvious.

Generally, the training data will be generated by some actual function g(xi) plus random
noise εp (which may, for example, be due to data gathering errors), so

y p = g( xip ) + ε p

We call this a regressive model of the data. We can define a statistical expectation
operator E that averages over all possible training patterns, so

g( xi ) = E [ y | xi ]

We say that the regression function g(xi) is the conditional mean of the model output y
given the inputs xi.

L9-5
A Statistical View of Network Training
The neural network training problem is to construct an output function net(xi, W, D) of
the network weights W = {wij( n )}, based on the data D, that best approximates the
regression model, i.e. the underlying function g(xi).

We have seen how to train a network by minimising the sum-squared error cost function:

∑ ( y p − net( xip , W , D))


2
E (W ) = 1
2
p∈D

with respect to the network weights W = {wij( n )}. However, we have also observed that, to
get good generalisation, we do not necessarily want to achieve that minimum. What we

and the underlying function g( xi ) = E [ y | xi ].


really want to do is minimise the difference between the network’s outputs net(xi, W, D)

The natural sum-squared error function, i.e. (E [ y | xi ] − net ( xi , W , D)) , depends on the
2

specific training set D, and we really want our network training regime to produce good
results averaged over all possible noisy training sets.

L9-6
Bias and Variance
If we define the expectation or average operator E D which takes the ensemble average
over all possible training sets D, then some rather messy algebra allows us to show that:

E D (E [ y | xi ] − net ( xi , W , D))
[ ]
2

(E D [net( xi , W , D)] − E [ y | xi ]) + E D (net( xi , W , D) − E D [net( xi , W , D)])


[ ]
2 2
=

= (bias) 2 + (variance)

This error function consists of two positive components:


(bias)2 the difference between the average network output E D [net ( xi , W , D)] and the
regression function g( xi ) = E [ y | xi ]. This can be viewed as the approximation error.
(variance) the variance of the approximating function net ( xi , W , D) over all the training
sets D. It represents the sensitivity of the results on the particular choice of data D.

In practice there will always be a trade-off between these two error components.

L9-7
The Extreme Cases of Bias and Variance
We can best understand the concepts of bias and variance by considering the two
extreme cases of what the network might learn.

Suppose our network is lazy and just generates the same constant output whatever
training data we give it, i.e. net ( xi , W , D) = c . In this case the variance term will be
zero, but the bias will be large, because the network has made no attempt to fit the data.

Suppose our network is very hard working and makes sure that it fits every data point:

E D [net ( xi , W , D)] = E D [ y( xi )] = E D [ g( xi ) + ε ] = E [ y | xi ]

so the bias is zero, but the variance is:

E D (net ( xi , W , D) − E D [net ( xi , W , D)]) = E D (g( xi ) + ε − E D [g( xi ) + ε ]) = E D [(ε )2 ]


[ ] [ ]
2 2

i.e. the variance of the noise on the data, which could be substantial.

L9-8
Examples of the Two Extreme Cases

The lazy and hard-working networks approach our function approximation as follows:

out out

in in

Ignore the data ⇒ Get every data point ⇒


Big approximation errors (high bias) No approximation errors (zero bias)
No variation between data sets (no variance) Variation between data sets (high variance)

L9-9
Under-fitting, Over-fitting and the Bias/Variance Trade-off
If our network is to generalize well to new data, we obviously need it to generate a good
approximation to the underlying function g( xi ) = E [ y | xi ], and we have seen that to do
this we must minimise the sum of the bias and variance terms. There will clearly have
to be a trade-off between minimising the bias and minimising the variance.

A network which is too closely fitted to the data will tend to have a large variance and
hence give a large expected generalization error. We then say that over-fitting of the
training data has occurred.

We can easily decrease the variance by smoothing the network outputs, but if this is
taken too far, then the bias becomes large, and the expected generalization error is large
again. We then say that under-fitting of the training data has occurred.

This trade-off between bias and variance plays a crucial role in the application of neural
network techniques to practical applications.

L9-10
Preventing Under-fitting and Over-fitting

To prevent under-fitting we need to make sure that:


1. The network has enough hidden units to represent to required mappings.
2. We train the network for long enough so that the sum squared error cost function is
sufficiently minimised.

To prevent over-fitting we can:


1. Stop the training early – before it has had time to learn the training data too well.
2. Restrict the number of adjustable parameters the network has – e.g. by reducing the
number of hidden units, or by forcing connections to share the same weight values.
3. Add some form of regularization term to the error function to encourage smoother
network mappings.
4. Add noise to the training patterns to smear out the data points.

Next lecture will be dedicated to looking at these approaches to improving generalization.

L9-11
Overview and Reading
1. We began by looking at the computational power of MLPs.
2. Then we saw why the generalization is often better if we don’t train the
network all the way to the minimum of its error function.
3. A statistical treatment of learning showed that there was a trade-off
between bias and variance.
4. Both under-fitting (giving high bias) and over-fitting (giving high
variance) will result in poor generalization.
5. The are many ways we can try to improve generalization.

Reading

1. Bishop: Sections 6.1, 9.1, 9.2


2. Gurney: Sections 6.8, 6.9
3. Haykin: Sections 2.13, 4.13

L9-12

You might also like