Module 3 Final
Module 3 Final
DR. S. Chitrakala
Professor
DCSE/ CEG/ AU
Module 3
1. The Multi-Layer Perceptron
2. Back Propagation of Error
3. Multi-layer Perceptron in Practice
4. Deriving Back Propagation
5. Applications of MLP
The Multi-Layer Perceptron
Multi-layer Perceptron
Majority of interesting problems are not linearly separable.
That problems can be made linearly separable if we can work out how to transform
the features suitably.
Consider making more complicated networks.
Learning in the neural network happens in the weights. So, to perform more
computation it seems sensible to add more weights.
There are two things that we can do:
add some backwards connections, so that the output neurons connect to the inputs again, or
add more neurons.
18
The first approach leads into recurrent networks. but are not
that commonly used.
We will instead consider the second approach. We can add
neurons between the input nodes and the outputs, and this
will make more complex neural networks,
19
XOR: Input (1, 0) corresponds to node A being 1 and B being 0.
The input to neuron C is therefore −1 × 0.5 + 1 × 1 + 0 × 1 = −0.5 +
1 = 0.5. This is above the threshold of 0, and so neuron C fires,
giving output 1.
For neuron D the input is −1 × 1 + 1 × 1 + 0 × 1 = −1 + 1 = 0, and so
it does not fire, giving output 0.
Therefore the input to neuron E is −1×0.5+1×1+0×−1 = 0.5, so
neuron E fires.
20
MLP: GOING FORWARD:
we start at the left by filling in the values for the inputs.
We then use these inputs and the first level of weights to
calculate the activations of the hidden layer, and then we use
those activations and the next set of weights to calculate the
activations of the output layer.
Now that we’ve got the outputs of the network, we can compare
them to the targets and compute the error.
We need to include a bias input to each neuron.
By having an extra input that is permanently set to -1, and
adjusting the weights to each neuron as part of the training.
Thus, each neuron in the network (whether it is a hidden layer or
the output) has 1 extra input, with fixed value.
21
GOING BACKWARDS: BACK-PROPAGATION OF ERROR
The error function that we used for the Perceptron was
where N is the number of output nodes.
However, suppose that we make two errors.
In the first, the target is bigger than the output, while in the second the
output is bigger than the target.
If these two errors are the same size, then if we add them up we could
get 0, which means that the error value suggests that no error was made.
To get around this we need to make all errors have the same
sign.
sum-of-squares error function
22
If we differentiate a error function, then it tells us the gradient of
that function, which is the direction along which it increases and
decreases the most.
Since the purpose of learning is to minimize the error, following
the error function downhill (in other words, in the direction of the
negative gradient) will give us what we want.
23
THE MULTI-LAYER PERCEPTRON ALGORITHM
There are
L input nodes, plus the bias,
M hidden nodes, also plus a bias, and
N output nodes,
So that there are (L+1)×M weights between the input and the hidden layer
and (M+1)×N between the hidden layer and the output.
The sums that we write will start from 0 if they include the bias
nodes and 1 otherwise, and run up to L,M, or N, so that x0 = −1 is
the bias input, and a0 = −1 is the bias hidden node.
The algorithm that is described could have any number of hidden
layers, in which case there might be several values for M, and
extra sets of weights between the hidden layers.
We will also use i, j, k to index the nodes in each layer 24
1. An input vector is put into the input nodes
2. The inputs are fed forward through the network
The inputs and the first-layer weights (here labelled as v) are used to
decide whether the hidden nodes fire or not. The activation function
g(·) is the sigmoid function given in Equation (4.2) above
The outputs of these neurons and the second-layer weights (labelled
as w) are used to decide if the output neurons fire or not
25
26
27
28
Back Propagation Algorithm
INITIALIZING THE WEIGHTS
Each neuron is getting input from n different places (either input
nodes if the neuron is in the hidden layer, or hidden neurons if it
is in the output layer).
If we view the values of these inputs as having uniform variance,
then the typical input to the neuron will be
where w is the initialization value of the weights. So a common
trick is to set the weights in the range
where n is the number of nodes in the input layer to those
weights.
This makes the total input to a neuron have a maximum size of
about 1.
53
Further, if the weights are large, then the activation of a neuron is
likely to be at, or close to, 0 or 1 already, which means that the
gradients are small, and so the learning is very slow.
There is an interplay here with the value of in the logistic
function, which means that small values of β (say β= 3.0 or less)
are more effective.
We use random values for the initialization so that the learning
starts off from different places for each run, and we keep them all
about the same size because we want all of the weights to reach
their final values at about the same time. This is known as uniform
learning and it is important because otherwise the network will
do better on some inputs than others.
54
DIFFERENT OUTPUT ACTIVATION FUNCTIONS
In the algorithm described above, we used sigmoid neurons in the
hidden layer and the output layer. This is fine for classification
problems, since there we can make the classes be 0 and 1.
However, we might also want to perform regression problems,
where the output needs to be from a continuous range, not just 0
or 1.
The sigmoid neurons at the output are not very useful in that
case.
We can replace the output neurons with linear nodes that just
sum the inputs and give that as their activation (so g(h) = h ).
This does not mean that we change the hidden layer neurons;
they stay exactly the same, and we only modify the output nodes.
55
They are not models of neurons anymore, since they don’t have
the characteristic fire/don’t fire pattern.
They enable us to solve regression problems, where we want a
real number out, not just a 0/1 decision.
Soft-max activation function is most commonly used for
classification problems where the 1-of-N output encoding is used
The soft-max function rescales the outputs by calculating the
exponential of the inputs to that neuron, and dividing by the total
sumof the inputs to all of the neurons, so that the activations sum
to 1 and all lie between 0 and 1.
As an activation function it can be written as:
56
If we change the activation function, then the derivative of the
activation function will also change, and so the learning rule will be
different.
For the linear activation function
Derivative of this is
We can modify the error function as well, to have the cross-
entropy form (where ln is the natural logarithm):
57
SEQUENTIAL AND BATCH TRAINING
All of the training examples are presented to the neural network,
the average sum-of-squares error is then computed, and this is
used to update the weights.
Thus there is only one set of weight updates for each epoch (pass
through all the training examples).
This means that we only update the weights once for each
iteration of the algorithm, which means that the weights are
moved in the direction that most of the inputs want them to
move, rather than being pulled around by each input individually.
The batch method performs a more accurate estimate of the
error gradient, and will converge to the local minimum more
quickly.
58
The algorithm that was described earlier was the sequential version,
where the errors are computed and the weights updated after each
input.
This is not guaranteed to be as efficient in learning, but it is simpler
to program when using loops, and it is therefore much more
common.
Since it does not converge as well, it can also sometimes avoid local
minima, thus potentially reaching better solutions.
randomising the order of the input vectors at each iteration. This can
significantly improve the speed with which the algorithm learns.
NumPy has a useful function that assists with this,
np.random.shuffle(), which takes a list of numbers and reorders
them.
59
LOCAL MINIMA
There may be a much lower point over the next hill, but the ball can’t
see that, and it doesn’t have enough energy to climb over the hill and
find the global minimum
We don’t know where the global minimum is because we don’t know
what the error landscape looks like;
we can only compute local features of it for the place we are in at the
moment.
Which minimum we end up in depends on where we start.
If we begin near the global minimum, then we are very likely to end
up in it, but if we start near a local minimum we will probably end up
there.
60
In addition, how long it will take to get to the minimum that we do
find depends upon the exact appearance of the landscape at the
current point.
We can make it more likely that we find the global minimum by
trying out several different starting points by training several
different networks, and this is commonly done.
However, we can also try to make it less likely that the algorithm will
get stuck in local minima.
61
PICKING UP MOMENTUM
The reason that the ball stops rolling is because it runs out of energy
at the bottom of the dip.
If we give the ball some weight, then it will generate momentum as it
rolls, and so it is more likely to overcome a small hill on the other
side of the local minimum, and so more likely to find the global
minimum.
We can implement this idea in our neural network learning by adding
in some contribution from the previous weight change that we made
to the current one.
In two dimensions it will mean that the ball rolls more directly
towards the valley bottom, since on average that will be the correct
direction, rather than being controlled by the local changes
62
There is another benefit to momentum. It makes it possible to use a
smaller learning rate, which means that the learning is more stable.
There is another benefit to momentum. It makes it possible to use a
smaller learning rate, which means that the learning is more stable.
64
a single input vector is chosen from the training set, and the output
and hence the error for that one vector computed, and this is used
to estimate the gradient and so update the weights.
A new random input vector (which could be the same as the
previous one) is then chosen and the process repeated.
This is known as stochastic gradient descent, and can be used for
any gradient descent problem, not just the MLP.
It is often used if the training set is very large, since it would be
very expensive to use the whole dataset to estimate the gradient in
that case.
65
OTHER IMPROVEMENTS: One is to reduce the learning rate as the
algorithm progresses. The reasoning behind this is that the network
should only be making large-scale changes to the weights at the
beginning, when the weights are random; if it is still making large
weight changes later on, then something is wrong.
Second derivatives of the error with respect to the weights
66
Multi-layer Perceptron in
Practice
1. Amount of Training Data
2. Number of Hidden Layers
3. When to Stop Learning
THE MULTI-LAYER PERCEPTRON IN PRACTICE: AMOUNT OF TRAINING DATA :
For the MLP with one hidden layer there are (L + 1) ×M + (M + 1) × N weights,
where L,M,N are the number of nodes in the input, hidden, and output
layers, respectively.
The extra +1s come from the bias nodes, which also have adjustable weights.
This is a potentially huge number of adjustable parameters that we need to
set during the training phase.
Setting the values of these weights is the job of the back-propagation
algorithm, which is driven by the errors coming from the training data.
Clearly, the more training data there is, the better for learning, although the
time that the algorithm takes to learn increases.
68
Unfortunately, there is no way to compute what the minimum
amount of data required is, since it depends on the problem.
A rule of thumb that has been around for almost as long as the
MLP itself is that you should use a number of training examples
that is at least 10 times the number of weights.
This is probably going to be a very large number of examples, so
neural network training is a fairly computationally expensive
operation, because we need to show the network all of these
inputs lots of times.
NUMBER OF HIDDEN LAYERS:
it is possible to show mathematically that one hidden layer with
lots of hidden nodes is sufficient. This is known as the Universal
Approximation Theorem;
69
You just have to experiment by training networks with different
numbers of hidden nodes and then choosing the one that gives the
best results
we will never normally need more than two layers (that is, one
hidden layer and the output layer). This is because we can
approximate any smooth functional mapping using a linear
combination of localised sigmoidal functions.
The basic idea is that by combining sigmoid functions we can
generate ridge-like functions, and by combining ridge-like functions
we can generate functions with a unique maximum.
By combining these and transforming them using another layer of
neurons, we obtain a localised response (a ‘bump’ function), and any
functional mapping can be approximated to arbitrary accuracy using
70
71
WHEN TO STOP LEARNING
72
Applications of MLP
A Regression Problem
Classification with the MLP
A Classification Example: The Iris Dataset
Time-Series Prediction
Data Compression: The Auto-Associative Network
EXAMPLES OF USING THE MLP
A REGRESSION PROBLEM
We will take a set of samples generated by a simple
mathematical function, and try to learn the generating function
(that describes how the data was made) so that we can find the
values of any inputs, not just the ones we have training data for.
74
75
76
CLASSIFICATION WITH THE MLP
There are a couple of choices for the outputs. The first is to use a
single linear node for the output, y, and put some thresholds on
the activation value of that node.
78
A CLASSIFICATION EXAMPLE: THE IRIS DATASET
Classifying examples of three types of iris (flower) by the length and
width of the sepals and petals and is called iris.
79
80
81
TIME-SERIES PREDICTION
The target data for training the neural network is simple, because it
comes from further up the time-series, and so training is easy.
Suppose that = 2 and k = 3.
82
2855 data points. To minimise the sum-of-squares error. Since
there are no classes, the confusion matrix is not useful
83
84
DATA COMPRESSION: THE AUTO-ASSOCIATIVE NETWORK
Suppose that we train the network to reproduce the inputs at the
output layer (called auto-associative learning; sometimes the
network is known as an autoencoder).
The network is trained so that whatever you show it at the input is
reproduced at the output.
This bottleneck hidden layer has to represent all of the information in
the input, so that it can be reproduced at the output.
It therefore performs some compression of the data, representing it
using fewer dimensions than were used in the input.
They are finding a different (often lower dimensional) representation
of the input data that extracts important components of the data,
and ignores the noise.
85
86
87
A RECIPE FOR USING THE MLP
88