3 2MLP Extra Notes
3 2MLP Extra Notes
1
This seems fine on first examination, but a moment’s thought will
show that this arrangement of perceptrons in layers will be unable to learn.
Each neuron in the structure still takes the weighted sum of its inputs,
thresholds it, and outputs either a one or a zero. For the perceptrons in the
first layer, the inputs come from the actual inputs to the network, while the
perceptrons in the second layer take as their inputs the outputs from the first
layer. This means that the perceptrons in the second layer do not know
which of the real inputs were on or not; they are only aware of input from
the first layer. Since learning corresponds to strengthening the connections
between active inputs and active units (refer to section 3.3), it is impossible
to strengthen the correct parts of the network, since the actual inputs are
effectively masked off from the output units by the intermediate layer. The
two-state neuron, being “on” or “off”, gives us no indication of the scale by
which we need to adjust the weights, and so we cannot make a reasonable
adjustment.
The Solution
The way around the difficulty imposed by using the step function as
the thresholding is to adjust it slightly, and use a slightly different non-
linearity. If we smooth it out, so that it more or less turns on or off, as
before, but has a sloping region in the middle that will give us some
information on the inputs, we will be able to determine when we need to
strengthen or weaken the relevant weights. This means that the network
will be able to learn, as required. A couple of possibilities for the new
thresholding function are shown in Figure 3.2
2
Name Input/Output Icon MATLAB
Relation Function
a=0 n<0
Hard Limit a=1 n≥0 Hardlim
a=-1 n<0
Symmetrical a=+1 n≥0 Hardlims
Hard Limit
Log- 1 Logsig
a=
Sigmoid 1+ e−n
n −n
Hyperbolic
a = en − e−n
Tangent e +e Tansig
Sigmoid
Positive a=-1 n<0
Linear a=n n≥0 Poslin
3
Figure (3.2):The most used activation functions: (a) hard-limited threshold; (b) linear
threshold: if the input is above a certain threshold, the output becomes
saturated (to a value of 1); there are different variants of this function
depending on the range of the neuronal output values shown in (b-1) and
(b-2); (c) sigmoid function: logistic function (c-1); bipolar logistic function
(c-2); (c-3) gaussian (bell shape) function.
In both cases, the value of the output will be practically one if the
weighted sum exceeds the threshold by a lot, and conversely, it will be
practically zero if the weighted sum is much less than the threshold value.
However, in the case when the threshold and the weighted sun are almost
the same, the output from the neuron will have a value somewhere between
the two extremes. This means that the output from the neuron is able to be
related to its inputs in a more useful and informative way.
4
THE NEW MODEL
5
The NEW LEARNING RULE
The learning rule for multilayer perceptrons is called the
“generalized delta rule”, or the “backpropagation rule”.
The operation of the network is similar to that of the single-
layer perceptron, in that we show the net a pattern and calculate its
response. Comparison with the desired response enables the
weights to be altered so that the network can produce a more
accurate output next time. The learning rule provides the method
for adjusting the weights in the network When we show the
untrained network an input pattern, it will produce any random
output. We need to define an error function that represents the
difference between the network’s current output and the correct
output that we want it to produce. Because we need to know the
“correct” pattern, this type of learning is known as “supervised
learning” In order to learn successfully we want to make the output
of the net approach the desired output, that is, we want to
continually reduce the value of this error function. This is achieved
by adjusting the weights on the links between the units and the
generalized delta rule does this by calculating the value of the error
function for that particular input, and then back-propagating (hence
the name!) the error from one layer to the previous one Each unit in
the net has its weights adjusted so that it reduces of the value of the
error function; for units actually on the output, their output and the
desired output is known, so adjusting the weights is relatively
simple, but for units in the middle layer , the adjustment is not so
obvious.
6
Weights from input nodes to hidden nodes are called input-layer
weights.
2
E P = ∑ (T j − O j ) 2
1 ────►(1)
2 j =1
“Only for output nodes since they have targets.”
7
The total linear sum at each unit “j” is:
net = W .O ────►(2)
j ij i
For example, net3 = (W13.X1 + W23.X2)
net7 = (W37.O3 + W47.O4 +W57.O5+…..)
Oj = fj(netj) ────►(3)
i.e.
O4 = f4(net4)
∂E P
Wij ( new) = Wij ( old ) −η × ────►(4)
∂Wij
For any weight in the network.
⎛ ∂E P ⎞
Thus, the issue is to get ⎜ ⎟ for
⎜ ∂W ⎟ each weight in the
⎝ ij ⎠
network.
8
We can write that:-
⎛ ∂E P ∂E P ∂net j ⎞
⎜ = • ⎟ ────►(5)
⎜ ∂W ∂net ∂W ⎟
⎝ ij j ij ⎠
Chain Rule :-
∂E P ∂E P
Thus, = Oi • ────► (5′)
∂Wij ∂net j
i.e., the update error term for any weight depends on th input value going to
that weight (Oi).
9
Now we look at 2nd term of eq. (5′)
∂E P
Let’s call :- δj =− ────►(7)
∂net j
∆W46 = η × δ 6 × O4
{
O/P
∆W14 = η × δ 4 × O1
{
hidden
10
The δ j for O/P nodes (concerning O/P-layer weights) is
simpler than δ j for hidden nodes(concerning input-layer
weights);
∂E ∂E ∂O j
δj =− =− • “Chain Rule” →(9)
∂net j ∂O j ∂net j
Since, O j = f j (net j )
∂O j
∴ = f j′(net j ) ────►(10)
∂net j
Thus,
⎛ ∂E ⎞
δ j = − f j′(net j ) × ⎜⎜ ⎟
⎟
∂
⎝123
O j ⎠
→ (9′)
The problem is
to get this one
11
∂E
:-
∂O j
1) If Oj is an O/P unit:-
Thus,
E = 12 ∑ (T j − O j ) 2
j
∂E
= −(T j − O j ) ────► (11)
∂O j
Thus,
δ j = f j′(net j ) • (T j − O j ) ────►(12)
For O/P units, and hence, for O/P-layer weights,
∆Wij = η × f j′(net j ) × (T j − O j ) × Oi
12
2)For hidden-Layer weights (input-layer
weights):-
∂E
δ j = − f j′(net j ) ×
∂O j
Consider W14 ,Oj = O4
∂E ∂E ∂net 6 ∂E ∂net 7
∴ = × + ×
∂O4 ∂net 6 ∂O4 ∂net 7 ∂O4
13
Generally , If Oj is connected to (k) output nodes, then:
∂E ∂E ⎛ ∂net k ⎞
=∑ × ⎜⎜ ⎟
∂O j k ∂net k ⎝ ∂O j ⎟⎠
1424
3
simplify →W jk
Thus,
∂E ⎛ ∂E ⎞
= ∑ W jk × ⎜⎜ ⎟⎟ ────►(13)
Oj k
⎝1∂net k ⎠
424 3
O / P − Layer
But form (9),
∂E ∂E
= −δ j or = −δ k
∂net j ∂net k
∂E
∴ = −∑ W jk × δ k ────►(14)
∂O j k
From (12),
14
Back in (9′) ,
δ j = − f j′(net j ) × −∑ W jk f k′(net k ) × (Tk − Ok )
{
Ex:-
15
Multilayer Perceptron Learning Algorithm:-
1)Initialise weights and thresholds:
Set all weights and thresholds to small random values.
4) Adapt weights
Start from the output layer, and work backwards.
Wij (t + 1) = Wij (t ) + ηδ pj O pi
Wij(t) represents the weights from node I to node j at time t, η is a gain
term, and δ pj is an error term for pattern p on node j.
For output units:
δ pj = α ⋅ O pj (1 − O pj )(t pj − O pj )
δ pj = α ⋅ O pj (1 − O pj )∑ δ pkW jk
For hidden units:
k
Where the sum is over the k nodes in the layer above node j.
16
THE XOR PROBLEM REVISITED:
In the previous chapter, we saw how the single-layer perceptron was
unable to solve the exclusive-or problem.
17
Considering the hidden unit, we can see that it is detecting when both the
inputs are on, since, this is the only condition under which it turns on. Since
each of the input units detect when their inputs are on, the output unit is fed
with three items of information: if the left input is on, if the right input is
on, and if both the left and the right inputs are on. Since the output unit
treats the hidden unit as another input unit, the apparent input patterns it
receives are now dissimilar enough for the classification to be learnt.
Multilayer perceptrons can appear in all shapes and sizes, with the same
learning rule for them all. This means that it is possible to produce different
network topologies to solve the same problem. One of the more interesting
cases is when there is no direct connection from the input to the output.
This and the corresponding XOR solution are shown in figure 4.6.
18
The network shown in figure 4.7 will correctly respond to the input
patterns 00 and 10, but fails to produce the correct output for the patterns 01
or 11. The right hand input turns on both hidden units. These produce a net
input of 0.8 at the output unit, exactly the same as the threshold value.
Since the thresholding function is the sigmoid, this gives an output value of
0.5. This situation is stable and does not alter with further training. This
local minimum occurs infrequently about 1% of the time in the XOR
problem.
Another minor problem can occur in training networks with the
generalized. delta rule. Since the weight changes are proportional to the
weights themselves, if the system starts off with equal weights then non-
equal weights can never be developed, and. so the net cannot settle into the
non-symmetric solution that may be required.
Figure
This energy Surface is a rippling landscape of hills and valleys, wells and
mountains, with points of minimum energy corresponding to the. wells and
maximum energy found on the peaks. The generalized delta rule aims to
minimize the error function E by adjusting the weights in the network so
that they correspond to those at which the energy surface is lowest. It does
this by a method known as gradient descent, where the energy function is
calculated, and changes are made in the steepest downward direction. This
is guaranteed to find a solution in cases where the energy landscape is
19
simple. Each possible solution is represented as a hollow, or a basin, in the
landscape These basins of attraction, as they are known, represent the
solutions to the values of the weights that produce the correct output from s
a given input.
20
We can consider classifying patterns in another way. Any given input
pattern must belong to one of the classes that we are considering, and so
there is a mapping from the input to the required class. This mapping can be
viewed as a function that transforms the input pattern into the correct output
class, and we can consider that a network has learnt to-perform correctly, if
it can carry out this mapping. In fact, any function, no matter how complex,
can be represented by a multilayer perceptron of no more than three layers,
the inputs are fed through an input layer, a middle hidden layer, and an
output layer.
21
More than two units can be used in the first layer, which produces
pattern space partitioning that is a combination of more than 2 lines. All
regions produced in this way are known as convex regions or convex hulls.
A convex hull is a region in which any point can be connected to any other
by a straight line that does not cross the boundary of the region.
However, if we add another layer of perceptrons, the units in this layer
will receive as inputs, not lines, but convex hulls. The combinations of
these convex regions may intersect, overlap, or be separate from each other,
producing arbitrary shapes.
LEARNING DIFFICULTIES :
22
Momentum term
The weight changes can be given some “momentum” by introducing an
extra term into the weight adaptation equation that will produce a large
change in the weight if the changes are currently large, and will decrease as
the changes become less.
This momentum can be written as follows:
W ji (t + 1) = W ji (t ) + η .δ pj Oi + k (W ji (t ) − W ji (t − 1))
Where k is the momentum factor, 0<k<1
Addition of noise
If random noise is added, this perturbs the gradient. descent algorithm
from the line of steepest descent, and often this noise is enough to knock
the system out of a local minimum. This approach has the advantage that it
takes very little extra computation time, and, so is not noticeably slower
than the direct gradient descent algorithm.
23
2d
domain and there are 2 possible functions. The network for identifying
the true function should have “d” input units and one output unit. In
general, the larger the network, the larger the set of functions it can form
and the more likely the true function is in this set. We can think of the use
of training data as a means of rejecting incorrect functions. Hopefully, at
the end of training, there is only one function left which is the true function
sought. If this is not the case, we wish the function learned to be the
function that best approximates the true function. In this sense, the more
training instances, the more likely the network can identify the true
function. However, this is true only when the network is large enough (but
not too large) to implement the true function.
Hush and Horne(1993) describe the approaches to study generalization
and the methods to improve generalization. In one approach,
the average-generalization of the network is defined to be the average the
generalization values of all functions that are consistent with the set of
training instances. The generalization value of a function is the fraction of
the domain (of 2d instances).
24
activation will not be modified. This idea is implemented by changing the
input range to -1/2 to 1/2 and using the following activation function:
⎛ 1 ⎞
− 12 ⋅ ⎜ ⎟
⎝1 + e− a ⎠
Where a is the argument of the function. An alternative function is the
hyperbolic tangent function (tanh), which lies in the range from -1 to 1
“tansig”.
The first two criteria are sensitive to the choice of parameters and may
lead to poor results if the parameters are improperly chosen. The cross-
validation criterion docs not have this drawback. It can avoid over fitting
the data and can actually improve the generalization performance of the
network. However, cross-validation is much more computationally
intensive and often demands more data.
25
Network Size: The backpropagation network can approximate an
arbitrary mapping only when the network is sufficiently large. In general, it
is not known what size of the network will produce the best result for a
given problem. If the network is too small, it cannot learn to fit the training
data well On the other hand, if the network is too big the learning problem
becomes under constrained and the network is able to learn many solutions
that are consistent with the training data but most of them are likely to be
poor approximations of the actual model.
26
Complexity of Learning: It has been shown that the problem of
finding a set of weights for a fixed-size network which performs the desired
mapping exactly for given training data is NP-complete1. That is, we cannot
find optimal weights in polynomial time. So, for a very large problem, it is
unlikely that we can determine the optimal solution in a reasonable amount
of time.
Learning algorithms like backpropagation are gradient descent
techniques which seek-only a local minimum. These algorithms usually do
not take exponential time to run. Empirically, the learning time on a serial
machine for backpropagation is about O(N3) where N is the number of
weights in the network (Hinton 1989). The slowness of finding a local
minimum is perhaps due to the characteristics of the error surface being
searched. In the case of single-layer perceptron, the error surface is a
quadratic bowl with a single minimum.
1
NP-complete problems refer to class of problems which lacks a polynomial-time solution .the time taken by the best
method we know for solving such problems grows exponentially with problem size. It has been proven that if a
polynomial-time solution is found for one of the NP-complete problems, the solution can be generalized to the rest of the
problem in this category. The question is, however, whether such faster methods exist.
27