Annn
Annn
Clearly this applies to an MLP with M hidden units, since ϕ(⋅) can be a sigmoid, w jk, bj
can be hidden layer weights and biases, and αj can be output weights. It follows that,
given enough hidden units, a two layer MLP can approximate any continuous function.
L9-2
Learning and Generalization Revisited
Recall the idea of getting a neural network to learn a classification decision boundary:
in2 in2
in1 in1
Our aim is for the network to generalize to classify new inputs appropriately. If we know
that the training data contains noise, we don’t necessarily want the training data to be
classified totally accurately as that is likely to reduce the generalisation ability.
L9-3
Generalization in Function Approximation
Similarly if our network is required to recover an underlying function from noisy data:
out
in
We can expect the network to give a more accurate generalization to new inputs if its
output curve does not pass through all the data points. Again, allowing a larger error on
the training data is likely to lead to better generalization.
L9-4
A Statistical View of the Training Data
Suppose we have a training data set D for our neural network:
This consists of an output yp for each input pattern xip. To keep the notation simple we
shall assume we only have one output unit – the extension to many outputs is obvious.
Generally, the training data will be generated by some actual function g(xi) plus random
noise εp (which may, for example, be due to data gathering errors), so
y p = g( xip ) + ε p
We call this a regressive model of the data. We can define a statistical expectation
operator E that averages over all possible training patterns, so
g( xi ) = E [ y | xi ]
We say that the regression function g(xi) is the conditional mean of the model output y
given the inputs xi.
L9-5
A Statistical View of Network Training
The neural network training problem is to construct an output function net(xi, W, D) of
the network weights W = {wij( n )}, based on the data D, that best approximates the
regression model, i.e. the underlying function g(xi).
We have seen how to train a network by minimising the sum-squared error cost function:
with respect to the network weights W = {wij( n )}. However, we have also observed that, to
get good generalisation, we do not necessarily want to achieve that minimum. What we
The natural sum-squared error function, i.e. (E [ y | xi ] − net ( xi , W , D)) , depends on the
2
specific training set D, and we really want our network training regime to produce good
results averaged over all possible noisy training sets.
L9-6
Bias and Variance
If we define the expectation or average operator E D which takes the ensemble average
over all possible training sets D, then some rather messy algebra allows us to show that:
E D (E [ y | xi ] − net ( xi , W , D))
[ ]
2
= (bias) 2 + (variance)
In practice there will always be a trade-off between these two error components.
L9-7
The Extreme Cases of Bias and Variance
We can best understand the concepts of bias and variance by considering the two
extreme cases of what the network might learn.
Suppose our network is lazy and just generates the same constant output whatever
training data we give it, i.e. net ( xi , W , D) = c . In this case the variance term will be
zero, but the bias will be large, because the network has made no attempt to fit the data.
Suppose our network is very hard working and makes sure that it fits every data point:
E D [net ( xi , W , D)] = E D [ y( xi )] = E D [ g( xi ) + ε ] = E [ y | xi ]
i.e. the variance of the noise on the data, which could be substantial.
L9-8
Examples of the Two Extreme Cases
The lazy and hard-working networks approach our function approximation as follows:
out out
in in
L9-9
Under-fitting, Over-fitting and the Bias/Variance Trade-off
If our network is to generalize well to new data, we obviously need it to generate a good
approximation to the underlying function g( xi ) = E [ y | xi ], and we have seen that to do
this we must minimise the sum of the bias and variance terms. There will clearly have
to be a trade-off between minimising the bias and minimising the variance.
A network which is too closely fitted to the data will tend to have a large variance and
hence give a large expected generalization error. We then say that over-fitting of the
training data has occurred.
We can easily decrease the variance by smoothing the network outputs, but if this is
taken too far, then the bias becomes large, and the expected generalization error is large
again. We then say that under-fitting of the training data has occurred.
This trade-off between bias and variance plays a crucial role in the application of neural
network techniques to practical applications.
L9-10
Preventing Under-fitting and Over-fitting
L9-11
Overview and Reading
1. We began by looking at the computational power of MLPs.
2. Then we saw why the generalization is often better if we don’t train the
network all the way to the minimum of its error function.
3. A statistical treatment of learning showed that there was a trade-off
between bias and variance.
4. Both under-fitting (giving high bias) and over-fitting (giving high
variance) will result in poor generalization.
5. The are many ways we can try to improve generalization.
Reading
L9-12