Multi-Layer Feed-Forward Networks
Multi-Layer Feed-Forward Networks
A single-layer network has severe restrictions: the class of tasks that can be accomplished is very
limited. In this chapter we will focus on feed-forward networks with layers of processing units.
Minsky and Papert (Minsky & Papert, 1969) showed in 1969 that a two layer feed-forward
network can overcome many restrictions, but did not present a solution to the problem of how to
adjust the weights from input to hidden units. An answer to this question was presented by
Rumelhart, Hinton and Williams in 1986 (Rumelhart, Hinton, & Williams, 1986), and similar
solutions appeared to have been published earlier (Werbos, 1974; Parker, 1985; Cun, 1985). The
central idea behind this solution is that the errors for the units of the hidden layer are determined
by back-propagating the errors of the units of the output layer. For this reason the method is
often called the back-propagation learning rule. Back-propagation can also be considered as a
generalisation of the delta rule for non-linear activation functions1 and multilayer networks.
The output of the hidden units is distributed over the next layer of Nh;2 hidden units, until the
last layer of hidden units, of which the outputs are fed into a layer of No output units .
Although backpropagation can be applied to networks with any number of layers, just as for
networks with binary units it has been shown (Hornik, Stinchcombe, & White, 1989; Funahashi,
1989; Cybenko, 1989; Hartman, Keeler, & Kowalski, 1990) that only one layer of hidden units
su
ces to approximate any function with finitely many discontinuities to arbitrary precision,
provided the activation functions of the hidden units are non-linear (the universal approximation
theorem). In most applications a feed-forward network with a single layer of hidden units is used
with a sigmoid activation function for the units.
Delta Rule
Since we are now using units with nonlinear activation functions, we have to generalise the delta
rule:
The error measure Ep is defined as the total quadratic error for pattern p at the output units:
where dpo is the desired output for unit o when pattern p is clamped.
define we will get an update rule which is equivalent to the delta rule as
described in the previous chapter, resulting in a gradient descent on the error surface if we make
the weight changes according to: The trick is to figure out what δp k should
be for each unit k in the network. The interesting result, which we now derive, is that there is a
simple recursive computation of these δ's which can be implemented by propagating error signals
backward through the network.
To compute δp k we apply the chain rule to write this partial derivative as the product of two
factors, one factor reflecting the change in error as a function of the output of the unit and one re
ecting the change in the output as a function of changes in the input. Thus, we have
Let us compute the second factor. By equation we see that
which is the same result as we obtained with the standard delta rule. Substituting this and
for any output unit o. Secondly, if k is not an output unit but a hidden unit k = h, we do not
readily know the contribution of the unit to the output error of the network. However, the error
measure can be written as a function of the net inputs from hidden to output layer; Ep = Ep(sp
1,sp 2,....... sp j.....) and we use the chain rule to write
Understanding backpropagation
The equations derived in the previous section may be mathematically correct, but what do they
actually mean? Is there a way of understanding back-propagation other than reciting the
necessary equations? The answer is, of course, yes. In fact, the whole back-propagation process
is intuitively very clear. What happens in the above equations is the following. When a learning
pattern is clamped, the activation values are propagated to the output units, and the actual
network output is compared with the desired output values, we usually end up with an error in
each of the output units. Let's call this error eo for a particular output unit o. We have to bring eo
to zero The simplest method to do this is the greedy method: we strive to change the connections
in the neural network in such a way that, next time around, the error eo will be zero for this
particular pattern. We know from the delta rule that, in order to reduce an error, we have to adapt
its incoming weights according to.
That's step one. But it alone is not enough: when we only apply this rule, the weights from input
to hidden units are never changed, and we do not have the full representational power of the
feed-forward network as promised by the universal approximation theorem. In order to adapt the
weights from input to hidden units, we again want to apply the delta rule. In this case, however,
we do not have a value for δ for the hidden units. This is solved by the chain rule which does the
following: distribute the error of an output unit o to all the hidden units that is it connected to,
weighted by this connection. Differently put, a hidden unit h receives a delta from each output
unit o equal to the delta of that output unit weighted with (= multiplied by) the weight of the
connection between those units.
written as:
The error signal for a hidden unit is determined recursively in terms of error signals of the
units to which it directly connects and the weights of those connections. For the sigmoid
activation function:
The learning procedure requires that the change in weight is proportional to True
gradient descent requires that in nitesimal steps are taken. The
constant of proportionality is the learning rate . For practical purposes we choose a learning rate
that is as large as possible without leading to oscillation. One way to avoid oscillation at large , is
to make the change in weight dependent of the past weight change by adding a momentum term:
Although, theoretically, the back-propagation algorithm performs gradient descent on the total
error only if the weights are adjusted after the full set of learning patterns has been presented,
more often than not the learning rule is applied to each pattern separately, i.e., a pattern p is
applied, Ep is calculated, and the weights are adapted (p = 1, 2,..... P). There exists empirical
indication that this results in faster convergence. Care has to be taken, however, with the order in
which the patterns are taught. For example, when using the same sequence over and over again
the network may become focused on the rst few patterns. This problem can be overcome by
using a permuted training method.