0% found this document useful (0 votes)
42 views6 pages

Multi-Layer Feed-Forward Networks

Backpropagation is a method for training multi-layer neural networks by calculating the gradient of the loss function with respect to the network parameters. It involves propagating errors from the output layer backwards through the network to calculate how each unit contributed to the overall error. This allows adjusting weights to minimize loss by gradient descent. Specifically, it uses the chain rule to distribute output errors to hidden units weighted by their connections to output units, then recursively calculates errors for hidden units based on errors of units they connect to.

Uploaded by

Vishnu Das
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views6 pages

Multi-Layer Feed-Forward Networks

Backpropagation is a method for training multi-layer neural networks by calculating the gradient of the loss function with respect to the network parameters. It involves propagating errors from the output layer backwards through the network to calculate how each unit contributed to the overall error. This allows adjusting weights to minimize loss by gradient descent. Specifically, it uses the chain rule to distribute output errors to hidden units weighted by their connections to output units, then recursively calculates errors for hidden units based on errors of units they connect to.

Uploaded by

Vishnu Das
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

BACK PROPOGATION

A single-layer network has severe restrictions: the class of tasks that can be accomplished is very
limited. In this chapter we will focus on feed-forward networks with layers of processing units.
Minsky and Papert (Minsky & Papert, 1969) showed in 1969 that a two layer feed-forward
network can overcome many restrictions, but did not present a solution to the problem of how to
adjust the weights from input to hidden units. An answer to this question was presented by
Rumelhart, Hinton and Williams in 1986 (Rumelhart, Hinton, & Williams, 1986), and similar
solutions appeared to have been published earlier (Werbos, 1974; Parker, 1985; Cun, 1985). The
central idea behind this solution is that the errors for the units of the hidden layer are determined
by back-propagating the errors of the units of the output layer. For this reason the method is
often called the back-propagation learning rule. Back-propagation can also be considered as a
generalisation of the delta rule for non-linear activation functions1 and multilayer networks.

Multi-layer feed-forward networks


A feed-forward network has a layered structure. Each layer consists of units which receive their
input from units from a layer directly below and send their output to units in a layer directly
above the unit. There are no connections within a layer. The Ni inputs are fed into the first layer
of Nh;1 hidden units. The input units are merely 'fan-out' units; no processing takes place in
these units. The activation of a hidden unit is a function Fi of the weighted inputs plus a bias, as
given in in eq

The output of the hidden units is distributed over the next layer of Nh;2 hidden units, until the
last layer of hidden units, of which the outputs are fed into a layer of No output units .
Although backpropagation can be applied to networks with any number of layers, just as for
networks with binary units it has been shown (Hornik, Stinchcombe, & White, 1989; Funahashi,
1989; Cybenko, 1989; Hartman, Keeler, & Kowalski, 1990) that only one layer of hidden units
su
ces to approximate any function with finitely many discontinuities to arbitrary precision,
provided the activation functions of the hidden units are non-linear (the universal approximation
theorem). In most applications a feed-forward network with a single layer of hidden units is used
with a sigmoid activation function for the units.

Delta Rule
Since we are now using units with nonlinear activation functions, we have to generalise the delta
rule:

The activation is a differentiable function of the total input, given by

in which To get the correct generalisation of the delta rule

as presented in the previous chapter, we must set

The error measure Ep is defined as the total quadratic error for pattern p at the output units:

where dpo is the desired output for unit o when pattern p is clamped.

We further set as the summed squared error. We can write

By equation we see that the second factor is When we

define we will get an update rule which is equivalent to the delta rule as
described in the previous chapter, resulting in a gradient descent on the error surface if we make

the weight changes according to: The trick is to figure out what δp k should
be for each unit k in the network. The interesting result, which we now derive, is that there is a
simple recursive computation of these δ's which can be implemented by propagating error signals
backward through the network.

To compute δp k we apply the chain rule to write this partial derivative as the product of two
factors, one factor reflecting the change in error as a function of the output of the unit and one re
ecting the change in the output as a function of changes in the input. Thus, we have
Let us compute the second factor. By equation we see that
which is the same result as we obtained with the standard delta rule. Substituting this and

equation in equation , we get

for any output unit o. Secondly, if k is not an output unit but a hidden unit k = h, we do not
readily know the contribution of the unit to the output error of the network. However, the error
measure can be written as a function of the net inputs from hidden to output layer; Ep = Ep(sp
1,sp 2,....... sp j.....) and we use the chain rule to write

Substituting this in equation yields

Equations and give a recursive procedure


for computing the δ's for all units in
the network, which are then used to compute the weight changes according to equation.
This procedure constitutes the generalised delta rule for a feed-forward network of non-linear
units.

Understanding backpropagation
The equations derived in the previous section may be mathematically correct, but what do they
actually mean? Is there a way of understanding back-propagation other than reciting the
necessary equations? The answer is, of course, yes. In fact, the whole back-propagation process
is intuitively very clear. What happens in the above equations is the following. When a learning
pattern is clamped, the activation values are propagated to the output units, and the actual
network output is compared with the desired output values, we usually end up with an error in
each of the output units. Let's call this error eo for a particular output unit o. We have to bring eo
to zero The simplest method to do this is the greedy method: we strive to change the connections
in the neural network in such a way that, next time around, the error eo will be zero for this
particular pattern. We know from the delta rule that, in order to reduce an error, we have to adapt
its incoming weights according to.

That's step one. But it alone is not enough: when we only apply this rule, the weights from input
to hidden units are never changed, and we do not have the full representational power of the
feed-forward network as promised by the universal approximation theorem. In order to adapt the
weights from input to hidden units, we again want to apply the delta rule. In this case, however,
we do not have a value for δ for the hidden units. This is solved by the chain rule which does the
following: distribute the error of an output unit o to all the hidden units that is it connected to,
weighted by this connection. Differently put, a hidden unit h receives a delta from each output
unit o equal to the delta of that output unit weighted with (= multiplied by) the weight of the
connection between those units.

Working with back-propagation


The application of the generalised delta rule thus involves two phases: During the first phase the
input x is presented and propagated forward through the network to compute the output values
yp o for each output unit. This output is compared with its desired value do, resulting in an error
signal δp o for each output unit. The second phase involves a backward pass through the network
during which the error signal is passed to each unit in the network and appropriate weight
changes are calculated.

Weight adjustments with sigmoid activation function.

 The weight of a connection is adjusted by an amount proportional to the product of an


error signal δ, on the unit k receiving the input and the output of the unit j sending this

signal along the connection:

 If the unit is an output unit, the error signal is given by Take as

the activation function F the 'sigmoid' function as defined In


this case the derivative is equal to
such that the error signal for an output unit can be

written as:

 The error signal for a hidden unit is determined recursively in terms of error signals of the
units to which it directly connects and the weights of those connections. For the sigmoid
activation function:

Learning rate and momentum

The learning procedure requires that the change in weight is proportional to True
gradient descent requires that in nitesimal steps are taken. The
constant of proportionality is the learning rate . For practical purposes we choose a learning rate
that is as large as possible without leading to oscillation. One way to avoid oscillation at large , is
to make the change in weight dependent of the past weight change by adding a momentum term:

where t indexes the presentation number and F is a


constant which determines the efect of the previous weight change.

Although, theoretically, the back-propagation algorithm performs gradient descent on the total
error only if the weights are adjusted after the full set of learning patterns has been presented,
more often than not the learning rule is applied to each pattern separately, i.e., a pattern p is
applied, Ep is calculated, and the weights are adapted (p = 1, 2,..... P). There exists empirical
indication that this results in faster convergence. Care has to be taken, however, with the order in
which the patterns are taught. For example, when using the same sequence over and over again
the network may become focused on the rst few patterns. This problem can be overcome by
using a permuted training method.

You might also like