0% found this document useful (0 votes)
76 views32 pages

Supervised Training Via Error Backpropagation: Derivations: 4.1 A Closer Look at The Supervised Training Problem

This document discusses the error backpropagation algorithm for training multilayer perceptron (MLP) neural networks. It begins by outlining the supervised training problem and some fundamental questions about designing and training an MLP network. It then provides brief answers to those questions, including that a single hidden layer is sufficient, the number of input and hidden nodes depends on the problem, the output layer size depends on the number of classes, and target vectors and training are discussed in later sections. It introduces the minimum sum-squared error methodology for training the network to map inputs to the correct target outputs.

Uploaded by

George Tsavd
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
76 views32 pages

Supervised Training Via Error Backpropagation: Derivations: 4.1 A Closer Look at The Supervised Training Problem

This document discusses the error backpropagation algorithm for training multilayer perceptron (MLP) neural networks. It begins by outlining the supervised training problem and some fundamental questions about designing and training an MLP network. It then provides brief answers to those questions, including that a single hidden layer is sufficient, the number of input and hidden nodes depends on the problem, the output layer size depends on the number of classes, and target vectors and training are discussed in later sections. It introduces the minimum sum-squared error methodology for training the network to map inputs to the correct target outputs.

Uploaded by

George Tsavd
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

136

4 Supervised Training via Error Backpropagation: Derivations


This chapter derives the error backpropagation algorithm in detail for various cases of MLPs and also for FLNs and RBFNs.

4.1 A Closer Look at the Supervised Training Problem


Some Fundamental Questions This chapter derives the gradient descent formulation for training an MLP (multiple layered perceptron) network in the epochal mode of BP (backpropagation). Chapters 5 and 6 cover fine points and other formulations for better training and performance. We continue to refer to the artificial neural nodes as neurodes to distinguish them from biological neurons, which are very different. The supervised training of an MLP requires: i) a sample of Q exemplar input feature vectors {x(q)}q=1,Q and an associated set of exemplar output target vectors {t(k): k = 1,...,K} to form the set {(x(q),t(k(q))): q = 1,...,Q} of Q exemplar pairs; ii) the selection of an initial synaptic weight set {wnm(0),umj(0): 1 n N, 1

m M, 1 j J}; and iii) the repetitive adjusting of the current weights by some method to force each
of the input exemplar feature vectors to be mapped closer to its correct output target vector that identifies a class in the input space. More than one exemplar feature vector of the sample {x(1),...,x(Q)} may map to a single identifier t(k) = t(k(q)) for each of the K classes (K Q). The weights {wnm} at the neurodes in the hidden layer and the weights {umj} at the neurodes in the output layer are "knobs" to be adjusted to define a correct mapping of input feature vectors to their output class identifiers. A correct mapping takes each sample input vector x(q) from the kth class into an output vector z(q) that is closer to the target t(k(q)) than to the identifier t(p) of any pth class, p k. This means that z(q) - t(k(q)) must be very small for each q. If this were all that there is to it, it would be a simple process, provided that we had a strategy that would adjust the weights properly. Unfortunately, the MLP architecture must be designed properly for the particular dataset to assure that the network will learn robustly and will be reasonably efficient. The main questions in laying out the architecture and then training the MLP are listed below. 1. How many layers of neurodes should we use? 2. How many input nodes should we use?

137 3. How many neurodes in the hidden layers should we use? 4. How many neurodes should we use in the output layer? 5. What should the target (identifier) vectors be? 6. How do we proceed to train the MLP? 7. How can we test to determine whether or not the MLP is properly trained? 8. How do we select parameters (such as ), speed up and improve the training? 9. What should be the range of the weights and the network inputs and outputs?

Some Answers Answer 1 is provided by the Hornik-Stinchcombe-White result (Hornik et al, 1989) given Section 3.9, which states that a hidden layer and an output of layer of neurodes are sufficient, provided there are enough neurodes in the hidden layer. To reduce a large number of neurodes in some situations, we may use two hidden layers (see Chapter 9 on neural engineering), but this is not necessary. There is also the question of what effects an extra layer of neurodes in the middle can have. Is the training faster? Is the learning better? How do the answers to these questions depend upon the linearity or nonlinearity of the data? We discuss these neural engineering issues in Chapter 9, and make actual tests in Chapters 11 and 12 to see what the results are. For now, we prefer to use a single hidden layer to avoid new difficulties, although we also derive the backpropagation iterative training equations for the case of two hidden layers. Answer 2 can be given tentatively. The number N of input nodes must be the number N of features in the feature vectors, so that once a set of features is chosen, their number N is fixed. Chapter 10 discusses feature and data engineering. The pattern attributes of a population may be mapped to one of many possible sets of features of various sizes, but we assume here that this has already been done, and that N is given and fixed. Answer 3 on the number of neurodes in the hidden layer(s) is difficult. A lot of research from different approaches has been done to find an answer. We provide answers in Chapter 9. For now, we use M = 2K for a small number K of classes (for, say, K = 2 to 8) up to M = K/2p for larger K (say, M = 128/23 = 16 or M = 512/24 = 32). This allows from M+1 to 2M (lower and upper bounds) convex regions in the feature space. Groups of these can be joined into nonconvex classes by the output neurodes (see Chapter 9). Answer 4 gives the number J of output neurodes that depends on the resolution required (the number K of classes) and the representation encoding scheme to be used. We may take J = log2K (from K = 2J), which permits 2J combinations of high and low (1 and 0) outputs of the J components. This is discussed in Section

138 4.7. Answer 5, on how to select the J target vectors is also given in Section 4.7, and may be chosen from the 2J combinations of high and low. It is usual practice to employ 0.9 for 1 and 0.1 for 0 (see Section 4.7) because standard MLPs can not put out 0s and 1s unless the weights become infinite. Answer 6 on the training of an MLP is given in Chapters 4, 5, and 6. The methods are steepest descent, accelerated gradient methods such as conjugate gradients and strategic search methodologies that include polynomial line search. There appears to be no single algorithm that is best overall for all datasets so we need multiple methods. Answer 7, which tells whether or not the training is satisfactory, is given in Chapter 9, which also outlines the way to perform validation and verification testing for acceptance of the MLP training and the model (network architecture). This involves using a training subset of the sample of exemplar pairs and two other disjoint test subsets that are to be used for validation and verification but not for training. When there are sufficiently many exemplars, we may select 25% of them at random to save for validation, choose another 15% at random to serve as the final verification, and use the remaining 60% for training. Under training, the training sum-squared error E60% decreases, as does the testing sum-squared error E25%, which is computed after each small segment of training. When E25% stops decreases and begins to increase, then we stop the training because it is specializing on the training data and is becoming less accurate on other data from the same population. Answer 8, on the question of how to select parameters and speed up the training, is dealt with in Chapters 5 and 6 (also, see (Looney, 1996)). These are mainly optimization techniques. For now we may put = 0.25. Answer 9, on the range of the weights, is perhaps the easiest to answer for practical purposes. While the initial weights may be drawn from the interval [-0.5,0.5], the training may require that some weights move out into the interval [-b,b], for some b 1. We want to avoid weight drift, however,

where certain weights becomes large in magnitude and other weights compensate with opposite sign to cancel out its affects. Ideally, the final weights should be in [-1,1] because the inputs and outputs do not exceed 1 in magnitude and the activation functions squash the summed values rm and sj to within unit magnitude. In computational practice, though, the weights may be allowed to wander slightly. The range of the inputs and outputs is typically taken to be [0,1] for unipolar activation functions and [-1,1] for bipolar ones (see Section 4.7). Backpropagation with a single hidden layer is derived for both unipolar and bipolar inputs and activation functions in Section 4.4. The algorithm is presented in Section 4.5. It is also derived for two hidden layers with both unipolar and bipolar sigmoid activation functions in Section 4.4.

139 Figure 4.1 - A Standard Feedforward Neural Network for Training

4.2 The Minimum Sum-Squared Error Methodology


The Network as an Input/Output Mapping An MLP that has a single hidden layer is shown in Figure 4.1. There are N input branching nodes, M middle neurodes, and J output neurodes. The weights on the input lines of the middle and output neurodes are designated by {wnm} and {umj}, respectively. The activation functions for the middle and output layers are, respectively, h(-) and g(-), as shown in Figure 4.2. Both of the sigmoids at the hidden and output layer should be of the same type, either unipolar or bipolar. Figure 4.1 uses bipolar sigmoids because it has no (N+1)st weight that is the bias input, shown in Figure 3.4. The sigmoids are defined in Equations (3-2) and (3-3).

140 Figure 4.2 - Standard and Extended Sigmoid Functions

Let there be a sample of Q exemplar vectors {x(1),...,x(Q)} from K classes (K Q). There may be multiple exemplars for certain classes. For each exemplar x(q) there is an associated target output vector t(k) = t(k(q)) that identifies its class number k = k(q). The problem is to train the MLP by adjusting the weights w = (w11,...,wNM) and u = (u11,...,uMJ) as shown in Figure 4.1 until each exemplar x(q) is mapped into an output z(q) that is very close to t(k) = t(k(q)). In cases of multiple input exemplars for the same class, the same target vector is associated with each such input exemplar feature vector. If x(q1) x(q2) are two different exemplars for Class k, then t(k(q1)) = t(k(q2)) because k(q1) = k(q2). A neural network is a black box, that is, an input/output system for which we need not know the inner workings. This contrasts with an expert rule-based system where every logical implication is known in a set of rules that provides explanation of the partial steps (Looney, 1993). After we train an MLP, it merely maps inputs to outputs with no explanations of its behavior. MLPs are easy to use, however, and can be quickly trained on datasets, whereas even a modest expert system requires several man-months or man-years of development.

141 The Total Sum-Squared Error Function E To force each actual output z(q) toward the correct output target t(k(q)), we adjust the weights so as to minimize the total sum-squared error (TSSE) E between the targets {t(k): k = 1,...,K} and the actual outputs {z(q): q = 1,...,Q}, over all Q exemplars. The TSSE (total sum-squared error) is defined via the Euclidean distance to be E=
(q=1,Q)

t(q) - z(q)

(4-1a)

The total mean-squared error (MSE) is TSSE = [1/(QJ)]E (4-1b)

The partial sum-squared errors (PSSEs) with respect to a single exemplar input/output pair (x(q),t(k(q))) is designated by E(q) and defined via E(q) =
(q) (j=1,J) j

(t

- zj(q))2

(4-1c)

Likewise, the partial mean-squared error (PMSE) is PMSE(q) = (1/J)E(q) (4-1d)

We consider E = E(w,u) to be a function of the weights w = (w11,...,wNM) and u = (u11,...,uMJ). The general minimum MSE methodology was invented and used independently by (Gauss, 1809) and (Legendre, 1810) in the late 1790s. E = E(w,u) is defined (refer to Figure 4.1) in detail by E(w,u) =
(q=1,Q) (q=1,Q) (q) (j=1,J) j

t(q) - z(q)

=
(q=1,Q)

( ( ( (

[t [t [t [t

- zj(q)]2) = - g( - g( - g(

(q) (j=1,J) j

[t

- g(sj(q))]2) =

(q=1,Q)

(q) (j=1,J) j (q) (j=1,J) j (q) (j=1,J) j

(q) (m=1,M) mj m

u y

))]2) =

(q=1,Q)

(m=1,M) mj

u h(rm(q))]2) = u h(
(n=1,N)

(q=1,Q)

(m=1,M) mj

wnmxn(q)))]2)

(4-2)

where h(-) and g(-) are sigmoidal activation functions respectively for the output and hidden layers. The function E(w,u) is a nonnegative continuously differentiable function on the weight space, which is [b,b]NM+MJ (b > 0), which is a finite dimensional closed bounded domain that is complete and thus compact. Therefore, E(w,u) assumes its minimizing point (w*,u*) in the weight domain. This doesn't mean that the sumsquared error E will be zero at the solution weight set (w*,u*), but only that E will assume its minimum value there on the given weight domain. If the target vectors {t(k)} are chosen judiciously to be far apart and if the exemplars for different classes are not too close, then the minimum mapping will successfully recognize the input feature vectors by mapping them to their class identifiers. Section 9.7 gives conditions under which there exists a unique exact solution.

142 To solve for the minimizing weight set (w*,u*), we use the necessary conditions E(w*,u*)/ wnm = 0, E(w*,u*)/ umj = 0 (4-3)

We can not solve these nonlinear equations in closed form, but can approximate the solution (w*,u*) iteratively with steepest descent. The next section discusses this method and derives an algorithm by means of chain rules (see Appendix 4 for an intuitive review of chain rules and vector calculus).

4.3 Linear Gradient Descent


To find a local minimum wlocmin for a nonlinear real valued function y = f(w), we put df(w)/dw = 0 (4-4)

and solve for w = wlocmin. However, in the general case of nonlinear f(-) we can only find an approximate solution wapprox to wlocmin by iterative methods. Starting from some initial point w(0), we move a step in the direction of steepest descent to w(1) = w(0) - df(w(0))/dw, which is opposite to the direction of steepest ascent. Note that the direction is either positive or negative along the w-axis. Figure 4.3 shows the first step. For the iterative (r+1)st step, we have w(r+1) = w(r) - (df(w(r))/dw) Figure 4.3 - Approximating a Local Minimum (4-5)

The step gain > 0 amplifies or attenuates the step size. If the step were too large, then it would move past the local minimum wlocmin, while if it were too small, a large number of steps might not yet reach the local minimum. The difficult problem of setting a proper step gain is addressed in Chapter 6. The step gain is called the learning rate in the literature on neural network training. In Figure 4.4, where y = f(w1,w2), the gradient is the vector of partial derivatives

143 f(w1,w2) = ( f(w1,w2)/ w1, f(w1,w2)/ w2) (4-6)

A function y = f(w1,...,wP) of several variables can be locally minimized in an analogous manner. The iterative updates to approximate a solution in the general case are f(w1,...,wP) ,..., w1 f(w1,...,wP)

(w1,...,wP)

(w1,...,wP) - (

)
wP

(4-7)

Figure 4.4 - Two-Dimensional Steepest Descent

In vector form, where w = (w1,...,wP), the gradient vector of partial derivatives is w(r+1) = w(r) - f(w(r))
(r)

(4-8)

The normalization of f(w ) to unit length would change . Usually, is permitted to adapt to absorb any normalizing factors (see Chapter 5). Equation (4-8) is linear in w and provides a piecewise linear approximation for an adjustment to move w(r) toward a local minimum. Appendix 4 derives Newton's second order method and the method of conjugate directions for finding local minima.

4.4 Derivation of Error Backpropagation


Partial Sum-Squared Errors The architecture of the standard feedforward MLP, shown in Figure 4.1, has N components in the input feature

144 vectors, M neurodes in the middle layer, and J neurodes in the output layer. We assume that there is a sample of Q exemplar input feature vectors paired with K output training vectors. Multiple input exemplar vectors for a class must each map to the same output target vector that identifies that class. For example, if x(q1) and x(q2) are from the same kth class, then both map to t(k(q1)) = t(k(q2)) = t(k). The total sum-squared error E is the sum of all of the squared errors over all J output components and over all Q exemplar pairs, where the individual squared errors are ej(k) = (tj(k) - zj(k(q))). The total SSE can be decomposed into a sum of partial sum-squared errors E = E(1) + ... + E(Q) of which each summand is the SSE over a single qth exemplar pair. E(q) =
(q) (j=1,J) j

(4-9)

(k(q)) (j=1,J) j

[t

- g(

(m=1,M) mj

u h(

(n=1,N)

wnmxn(q)))]2

(4-10a)

In the derivation for the steepest descent on E(q) that follows, we suppress the superscript "(q)" for convenience. We derive the error backpropagation algorithm for any single qth fixed exemplar pair, based on the PSSE (partial sum-squared error) function E(q) = E(q)(w,u) =
(j=1,J) j

(t - zj)2

(=

(q) (j=1,J) j

(t

- zj(q))2, fixed q )

(4-10b)

The sigmoids are unipolar, that is h(rm) = 1/[1 + exp(- 1rm + b1)], g(sj) = 1/[1 + exp(- 2sj + b2)] (4-10c)

are activation functions at the hidden and output layers, respectively. In the derivation that follows, where bi is a bias (i = 1, 2),
i

is the rate factor in the exponential, and r and s are the sums of the products of weights

times the incoming line values (see Figure 4.1). Iterative approximation may take a generalized Newtonian, or quasi-Newton, form (see Linz (1979, p. 146). The simplest of these is the steepest descent linearization w(r+1) = w(r) - [ E(w(r))] = w(r) + u(r+1) = u(r) - [ E(u(r))] = u(r) + u w (4-11a) (4-11b)

for updating the weights w at the hidden neurodes, and u at the output neurodes, respectively, on the (r+1)st iteration. Upon taking each weight individually, we obtain the formulas wnm(r+1) = wnm(r) - ( E(w(r),u(r))/ wnm) umj(r+1) = umj(r) - ( E(w(r),u(r))/ umj) (4-12a) (4-12b)

We now derive the backpropagation training equations to minimize the PSSE function E(q) for any fixed qth exemplar pair (x(q),t(q)). We suppress the q notation for convenience. Appendix 4 explains intuitively the chain rules that we use.

145

The Derivation of Backpropagation with Unipolar Sigmoids We first derive the computational formula for the weights {umj} at the output neurodes by: i) applying the chain rule repeatedly to the partial derivative in Equation (4-12b); ii) using sj =
(m=1,M) mj m

u y

(4-13)

as the sum at the jth output neurode; and iii) using g(sj) = [1 + exp(-sj+b)]-1 as the unipolar activation function. We need not include
i

in the derivation (see Equation (4-10c)) because any constant multiplier can be

absorbed into the step gain . We also use b for the bi for notational convenience. The derivation is done on the PSSE E(q), which is denoted by E (q is suppressed) for notational convenience. The derivation for the umj increment is E/ umj = ( E/ sj)( sj/ umj) = [( E/ zj)( zj/ sj)]( sj/ umj) = [( / zj)(
(p=1,J) p

{note: sj = sj(umj), functionally} {note: zj = g(sj), functionally} {note: p is a dummy variable for j} {note: ( / sj)g(sj) = g (sj), r is a dummy for m} {note: / umj[
(q=1,M) q qj

(t - zp)2)( / sj)g(sj)][ sj/ umj] =


(r=1,M) r rj

[2(-1)(tj - zj)(g (sj)][ / umj [(-2)(tj - zj)g (sj)][ym] = -2(tj - zj)g (sj)ym

yu ]=

y u ] = ym}

(4-14)

But the sigmoid activation function g(-) has derivative g (sj) = (d/dsj)g(sj) = (d/dsj)[1 + exp(-sj+b)]-1 = (-1)[1 + exp(-sj+b)]-2exp(-sj+b)(-1) = [1 + exp(-sj+b)]-2[exp(-sj+b)] = [zj]2[1 + exp(-sj+b) - 1] = [zj]2[1/zj - 1] = [zj]2[(1-zj)/zj] = zj(1-zj) so that g (sj) = zj(1-zj) Now we substitute Equation (4-15) back into Equation (4-14) to obtain E/ umj = -2(tj - zj)zj(1-zj)ym Substituting Equation (4-16) into Equation (4-12b) provides the update on the (r+1)st iteration as umj(r+1) = umj(r) + (tj - zj)zj(1-zj)ym (4-16) (4-15)
{note: zj = [1 + exp(-sj+b)]-1} {note: add (1 - 1) to rightmost factor}

(4-17)

146 where the 2 has been absorbed into the step gain . Second, we derive the weight increments {wnm} at the middle neurodes via: i) apply the chain rule repeatedly to the partial derivation in (4-12a); ii) use rm =
(n=1,N)

wnmxn

(4-18)

for the sums; and iii) using the hidden layer activation function h(rm) = 1/[1 + exp(-rm+b)] and derivative h (rm) = ym(1-ym). The previous remark about omitting E/ wnm = ( E/ rm)( rm/ wnm) = [( E/ ym)( ym/ rm)]( rm/ wnm) = [( / ym)( ( / ym)( ( / ym)(
(j=1,J) j

in the derivation holds here also. Then


{note: rm =
(n=1,N)

wnmxn}

{note: E = E(ym)} {note: ym = h(rm)}


(n=1,N) n

(t - zj)2)( / rm)ym][ rm/ wnm)] =

(j=1,J) j

(t - zj)2)[( / rm)h(rm)][( / wnm)( (t - zj)2)[h (rm)][xn] =

x wnm)] =
{note: ( / wnm)( xnwnm)=xn}

(j=1,J) j

( / ym)[E(s(ym))][h (rm)][xn] = { { { { { {
(j=1,J)

{note: E = E(s(ym)) = E(s1(ym),...,sJ(ym))} {note: pass / ym to inside of summation} {note: use chain rule, full sum above} {note: p is a dummy, zj/ sj = g (sj)} {note: sj =
(i=1,M) i ij

( E/ sj)( sj/ ym)}[h (rm)][xn] = ( / sj)


(p=1,J) p

(j=1,J)

(t - zp)2( sp/ ym)}[h (rm)][xn] =

(j=1,J)

(2)(tj - zj)(-1)g (sj)[ sj/ ym]}[h (rm)][xn] = (-2)(tj - zj)[g (sj)][( / ym)
(i=1,M) i ij

(j=1,J)

y u ]}[h (rm)][xn] =

yu }

(j=1,J)

(-2)(tj - zj)[zj(1-zj)][umj]}[h (rm)][xn] = (-2)(tj - zj)[zj(1-zj)]umj}[ym(1-ym)][xn]

{note: i is a dummy for m above} {note: g (sj) = zj(1-zj) from above} {note: h (rm) = ym(1-ym) analogously}

(j=1,J)

Therefore E/ wnm = {
(j=1,J)

(-2)(tj - zj)[zj(1-zj)]umj}[ym(1-ym)]xn

(4-19)

Upon substituting Equation (4-19) into (4-12a), and absorbing the 2 into , we obtain the computational formula on the (r+1)st, which is

wnm(r+1) = wnm(r) + {

(j=1,J) j

(t - zj)[zj(1-zj)]umj(r)[ym(1-ym)]xn}

(4-20)

Note that Equation (4-20) sums the differences (tj - zj) over all j = 1,...,J output neurodes. This is intuitive because every output difference affects each weight wnm at the hidden layer, whereas the only a single difference

147 affects each weight in the output layer. The original backpropagation (BP) algorithm uses Equations (4-17) and (4-20) to update each weight for a fixed qth exemplar input/output pair (Rumelhart et al, 1986). This constitutes one iteration (training all weights on a single exemplar pair (x(q),t(k(q))) to minimize E(q)). BP repeats this for each qth exemplar pair until all Q PSSEs E(q) have been used in training. This entire process over each PSSE one time constitutes an epoch. A single epoch takes a minimizing step over each of the PSSE functions E(1),...,E(Q) and on each such partial E(q), all of the weights are adjusted. Thus a single epoch adjusts each weight Q times. A large number I of epochs may be required for training (each weight is adjusted QI times). To recapitulate, the unipolar learning equations for each qth exemplar (q is not suppressed) are umj + 1(tj(q) - zj(q))zj(q)(1-zj(q))ym(q) wnm + 2{
(q) (j=1,J) j

umj wnm

(4-21a) (4-21b)

(t

- zj(q))[zj(q)(1-zj(q))]umj}[ym(q)(1-ym(q))]xn(q)

Backpropagation with Bipolar Sigmoids Many researchers now use the bipolar sigmoid in place of the unipolar one to eliminate the bias b as a source of error and computational need (see Chapter 6). The derivation is the same except for the derivatives of h(rm) and g(sj), which now become the derivatives of the bipolar sigmoids, denoted here by H(rm) and G(sj). Let z = G(s) = 2{1/[1 + exp(- s)]} - 1 = whose rational form is z = G(s) = [1 + exp(- s)]/[1 - exp(- s)] From Equation (4-22a), we obtain dz/ds = G (s) = 2 [1 + exp(- s)]-2(exp(- s)) Upon adding 1 to each side of Equation (4-22a), we obtain (1 + z) = 2/[1 + exp(- s)] We solve for exp(- s) from Equation (4-24) by exp(- s) = 2/(1 + z) - 1 = (1 - z)/(1 + z) Similarly, we use Equation (4-24) to solve for 1 + exp(- s) by 1 + exp(- s) = 2/(1 + z) (4-25b) (4-25a) (4-24) (4-23) (4-22b) (4-22a)

The substitution for exp(- s) and 1 + exp(- s) from Equations (4-25a,b) into Equation 4-23 yields dz/ds = G (s) = 2 [1 + exp(- s)]-2[exp(- s)] = 2 [(1 + z)2/22][(1 - z)/(1 + z)] = (1 + z)(1 - z)/2 (4-26)

148 The omission of


G

(it will be absorbed by the step gain ) yields the result (4-27a)

(s) = (1 + z)(1 - z)/2

Analogously, the bipolar sigmoid at the hidden neurodes has derivative


H

(r) = (1 + y)(1 - y)/2

(4-27b)

The bipolar update formulas come from substituting Equations (4-27a,b) into the computational formulas of Equations (4-21a,b), respectively, to obtain (using the PSSE E(q)) umj + 1(tj(q) - zj(q))[(1+zj(q))(1-zj(q))/2]ym(q) wnm + 2{
(q) (j=1,J) j

umj wnm

(4-28a) (4-28b)

(t

- zj(q))[(1+zj(q)(1-zj(q))/2]umj}[(1+ym(q))(1-ym(q))/2]xn(q)

Paul Werbos used backpropagation for regression analysis (Werbos, 1974). A similar stochastic approximation had previously been used by (Robbins and Monro, 1951). A recent book (Werbos, 1994) discusses the history and development of backpropagation. The books (Fu, 1994), (Kosko, 1992), (HechtNielsen, 1990), and (Wasserman, 1989) provide historical notes, while (Fausett, 1994) and (Zurada, 1992) are also good references. Chapter 6 discusses a more efficient method, called conjugate gradient directions, for accelerated gradient minimization. Quasi-Newton methods (Parker, 1982) are second order methods that converge more rapidly, but require more computation per iterative step.

4.5 The Basic Backpropagation Algorithm


A High Level Description of Backpropagation The flow chart of Figure 4.5 presents a higher level description of the most basic form of BP (backpropagation). In the first step, the MLP architecture is read from a file, the parameters are set ( 's, b's for unipolar sigmoids, and ), and the number I of epochs is accepted from the keyboard. The MLP file consists of the number N of inputs, the number M of hidden layer neurodes, the number J of output neurodes, the number Q of exemplars, and the Q exemplar pairs {(x(q),t(k(q)): q=1,...,Q} of exemplar input feature vectors and output target vectors. The second step randomly draws NM initial weights {wnm(0)) for the hidden neurodes and MJ initial weights {umj(0)} for the output neurodes (between -0.5 and 0.5 for unipolar or bipolar sigmoids). The third step sets the current epoch number to be r = 1, while the fourth step sets the current exemplar number to be q = 1. The fifth step performs all summing and activations at the hidden and output layers. These values are then used in the sixth step to compute the weight increments wnm for each current weight wnm(r) and umj for umj(r) and then add

149 them to the current weights to obtain the new weights wnm(r+1) and umj(r+1). Figure 4.5 - A Backpropagation Flow Chart

At this point, the test (q

Q?) is made. If it is false, then q is incremented by q

q+1 and the process

returns to the fifth step. If it is true, then an epoch has been completed, so the test (r I?) is made. If false, then r is incremented via r r+1 and the process returns to the fourth step for another epoch or else the process

terminates (I epochs have been completed). The updated weights are computed from the computational formulas of Equations (4-21a,b) or (4-28a,b). These are sometimes written in the form of incremental weights as umj(r+1) = umj(r) + umj, wnm(r+1) = wnm(r) + wnm (4-29)

A Backpropagation Algorithm Inputs: {number of input nodes N; number of middle neurodes M; number of output neurodes J; number of exemplar vectors Q; number of identifiers (classes) K; the exemplar vectors {x(q)} and paired identifier vectors {t(k(q))}; number of epochs I; and biases bi and decay rates
i

(i=1,2)}

Outputs: {the weights w = (w11, w21,...,wNM) and u = (u11, u21,...,uMJ) and the total SSE E}

150

Step 1: /Input N, M, J, Q, exemplar input vectors and corresponding identifiers {x(q),t(q)} and I/ read MLP file; input I; Step 2: /Set parameters b1
1 1

/Data is stored in file/ /Input no. epochs desired from keyboard/ , b1,
2

, b2, / /For bipolar sigmoids these are zero/ 0.4; 2 0.25; /Parameters may be different from these/

N/2.0; b2 2.4;
2

M/2.0; 2.4; 1

Step 3: /Generate initial weights randomly between -0.5 and 0.5/ for m = 1 to M do for n = 1 to N do wnm for j = 1 to J do umj Random() - 0.5; /Draw uniform(0,1), shift down -/

Random() - 0,5; /Draw uniform(0,1), shift down -/

Step 4: /Adjust all weights via steepest descent method/ for r = 1 to I do for q = 1 to Q do Update_NN(); for m = 1 to M do for j = 1 to J do umj(r+1) /Do I epochs/ /with each over all Q exemplar pairs/ /Call procedure to update MLP/ /Update MJ umj's and NM wnm's/ /For each m, sum over J outputs/

umj(r) + {(tj(q) - zj(q))[zj(q)(1-zj(q))]ym(q)}; /For each n, m sum over J outputs/


(q) (j=1,J) j

for n = 1 to N do wnm(r+1) wnm(r) + { (t

- zj(q))[zj(q)(1-zj(q))]umj(r)}[ym(q)(1-ym(q))][xn(q)]

The function Random() draws a uniform random weight value in the interval [0,1]. Early researchers restricted the initial weight set to magnitudes in [-0.5,0,5] as it was thought that the deeper local minima are close to the origin. Current research on this has mixed results. We note that during the training, some weights move away from the origin. Fahlman found that the initial intervals [-1,1] and [-2,2] yielded results that were just as good on his datasets as did the interval [-0.5,0.5] (Fahlman, 1988), while (McCormack and Doherty, 1993) used [-5,5] when training on certain data sets. The function Update_NN() puts the qth exemplar input vector through the network to update all of the ym(q) and zj(q) values put out by neurodes. The advantages of BP and the MLP type of FANNs are: i) the learning is somewhat independent of the order in which the exemplar feature vectors are presented; ii) the architecture can be manipulated for better

151 results (see Chapter 9 on neural engineering); and iii) the operational mode can be performed with parallel processors. The disadvantages are: i) the training may converge to a local minimum that is shallow so that the learning is not robust; ii) the step gain (learning rate) can not be predicted in advance and may be either too small, so that too many steps are required to converge, or too large, so that the process oscillates instead of converging; iii) the derivatives approach zero so the computed steps are essentially zero, in which case a large number of steps does not move the weight point much; iv) the gradient provides only a linear approximation to the actual local direction of steepest descent and the approximate directions may change drastically from to step to step; v) upon changing the value of one or more weights, the PSSEs (partial sum-squared error) E(q) are changed as a function of the other weights, which is the moving target effect (thrashing occurs during each epoch, where a step to minimize E(q) tends to increase some other E(p)); and vi) the network may overtrain on the presented feature vectors and become too specialized and unable to accurately recognize other similar vectors. The first problem of local minima was shown in a simple example by (Gori and Tesi, 1992) on nonlinearly separable patterns showed that BP becomes stuck in a local minimum on a FANN that contained a single hidden layer. We may train an MLP many times from different initial weight points and keep the weight set that yields the lowest total SSE (sum-squared error) E. This would appear to be a fruitful strategy for dealing with shallow undesirable local minima. However, the lowest SSE does not necessarily provide the best learning, as was observed by (Rumelhart, 1986) and sometimes indicates specialization (see Section 9.3). The second problem of no apriori information about the learning rate requires a strategy for adjusting it on the way to the minimum. It should be sufficiently large on the early iterations, but as the process descends into the well of a local minimum, must be decreased appropriately. One of the early methods was the deltabar-delta method (Jacobs, 1988), which we discuss in the next chapter. The third problem is the saturation, where a sigmoid derivative h (rm) or g (sj) approaches zero, which causes the weight increments - E(w,u) to become essentially zero unless the learning rate (step gain) = (r) increases enormously to compensate for it. The fourth problem of oscillating directions of steepest descent can be handled by smoothing. This was first done by (Rumelhart et al, 1986) with a "momentum" term (see Chapter 5) that performs the required smoothing. The fifth problem of a moving target is a fact of life with MLPs. When any weights are changed, then E(w,u) becomes a different function on the remaining weights. The next chapter discusses this further. The problem of overtraining can be handled by testing during training and afterwards, as discussed in Section 9.8.

152 In practice, however, convergence takes place often, especially with small to moderate sized MLPs. When it does not, it may converge on another training run from a different initial weight set, or it may require a change in the MLP architecture. An adaptive step gain can improve the rate of convergence, as we will see in Chapter 5. The more accurate Newtonian formulations use second order approximations of the SSE function E that may include the Hessian matrix of second order mixed partial derivatives (discussed in Appendix 4 and Chapter 5). Parker used second order quasi-Newtonian methods to train MLPs (Parker, 1982). Conjugate gradient methods offer a tradeoff between BP and second order Newton or quasi-Newton methods in that they converge more rapidly than BP but use significantly less computation than the quasi-Newton and Newton methods.

4.6 Extending Backpropagation to Two Hidden Layers


The Extended Architecture Figure 4.6 shows an MLP with two hidden layers. The input/output training pairs {(x(q),t(k(q))): q = 1,...,Q} are given and the initial weights {v(0), w(0), u(0)} are drawn randomly from [-0.5,0.5]LN+NM+MJ. The situation is the same as for a single hidden layer in that the exemplar feature vectors are paired with the desired identifier codewords that represent the classes. Chapter 9 provides more information on designing extended MLPs. Figure 4.6 - The Extended MLP Architecture

153 Backpropagation for Two Hidden Layers with Unipolar Sigmoids We first derive the computational formulas for updating the weights via steepest descent in an extended BP algorithm. For specificity, we use the unipolar sigmoids. We suppress the index "q" of the PSSE E(q), as we did above. The partial derivatives E/ vnl, E/ wlm, E/ umj are to be used in the weight updates vnl(r+1) = vnl(r) - E(vnl,wlm,umj)/ vnl wlm(r+1) = wlm(r) - E(vnl,wlm,umj)/ wlm umj(r+1) = umj(r) - E(vnl,wlm,umj)/ umj (4-30a) (4-30b) (4-30c)

From Figure 4.6, we see that the output layer has all of the same parameter names and indices, so we can use Equation (4-21a) for the increments on umj, which is umj(r+1) = umj(r) + (tj - zj)zj(1-zj)ym (4-31)

The hidden layer adjacent to the output layer is the second hidden layer. The difference in the variable designations here and those used to derive Equation (4-21b) is that the inputs to these hidden neurodes are a = (a1,...,aL) instead of x = (x1,...,xN). Upon making these changes in Equation (4-21b), we obtain wlm(r+1) = wlm(r) + {
(j=1,J) j

(t - zj)[zj(1-zj)]umj}[ym(1-ym)]al

(4-32)

The derivation of the weight updates at the first hidden layer is more tedious and uses different nomenclature of variables and indices. We omit the column of notes here, and suppress the "q" subscripts for the exemplar number and use the power of chain rules. But first, we describe E = E(vnl,wlm,umj) as a function of its weights and intermediate variables by E = E(vnl,wlm,umj) =
(j=1,J) j (j=1,J) j

(t - zj)2 =

(j=1,J) j

(t - g(sj))2 =
(m=1,M) mj

(t - g( (t - g( (t - g( (t - g(

(m=1,M) mj m

u y ))2 = u h( u h( u h(

(j=1,J) j

(t - g(

u h(rm)))2 =

(j=1,J) j

(m=1,M) mj

(m=1,M)

wlmal)))2 = wlmf(pl))))2 =
(n=1,N) nl n

(j=1,J) j

(m=1,M) mj

(m=1,M)

(j=1,J) j

(m=1,M) mj

(l=1,L)

wlmf(

v x ))))2

(4-33a)

It is obvious from Figure 4.6 and less obvious from Equation (4-33a) that each weight vnl at the first hidden layer is affected not only by every difference (tj - zj) at the output layer, but also by every neurode in the second hidden layer. Thus we need to sum the total error adjustments of all j = 1,...,J and over all m = 1,...,M. We first

154 derive the gradient of the SSE with respect to vnl over a single jth output and a single mth neurode in the first hidden layer. Thus we use Emj = ((tj - g(umjh(
(l=1,L)

wlmf(

(n=1,N) nl n

v x ))))2

(4-33b)

From Equation (4-33b) it is easy to write down the chain rule in terms of dependent variables that start at the output layer and work backward to the first hidden layer, which leads to the partial derivative E/ vnl = ( E/ zj)( zj/ vnl) = ( E/ zj)( zj/ sj)( sj/ vnl) = ( E/ zj)( zj/ sj)( sj/ ym)( ym/ vnl) = ( E/ zj)( zj/ sj)( sj/ ym)( ym/ rm)( rm/ vnl) = ( E/ zj)( zj/ sj)( sj/ ym)( ym/ rm)( rm/ al)( al/ vnl) = ( E/ zj)( zj/ sj)( sj/ ym)( ym/ rm)( rm/ al)( al/ pl)( pl/ vnl) = [(-2){(tj - zj)][g (sj)]umj}[h (rm)][wlmf (pl)][xn] into Equation (4-34a) and absorbing the 2 in , we obtain the (r+1)st iterate vnl(r+1) = vnl(r) + {(tj - zj)[zj(1-zj)]umj}[ym(1-ym)]wlmal(1-al)xn Now we sum over all such parts for j = 1,...,J and m = 1,...,M to obtain the final update vnl(r+1) = vnl(r) + {
(j=1,J) j

(4-34a)

Because we are using unipolar sigmoids, their derivatives all have the same form of (1- ). Substituting

(4-34b)

(t - zj)zj(1-zj)[

(m=1,M) mj m

u y (1-ym)wlm]al(1-al)xn}

(4-35)

Recapitulating for easy reference, the case of unipolar sigmoids for extended backpropagation yields umj + (tj - zj)zj(1-zj)ym wlm + {
(j=1,J) j

umj wlm

(4-36) (4-37) (4-38)

(t - zj)zj(1-zj)umj}[ym(1-ym)]al
(m=1,M) mj m

vnl = vln + {

(j=1,J) j

(t - zj)zj(1-zj)[

u y (1-ym)wlm]al(1-al)xn}

Backpropagation for Two Hidden Layers with Bipolar Sigmoids If, on the other hand, we use the bipolar sigmoids where the derivatives of the sigmoids have the form (1+ )(1)/2, then the weight updates are umj + (tj - zj)[(1+zj)(1-zj/2])ym wlm + { vnl + {
(j=1,J) j

umj wlm vnl

(4-39) (4-40)

(t - zj)[(1+zj)(1-zj)/2]umj}[(1+ym)(1-ym)/2]al

(j=1,J) j

(t - zj)[(1+zj)(1-zj)/2] [ (j=1,J)umj[(1+ym)(1-ym)/2]wlm][(1+al)(1-al)/2]xn]}

(4-41)

155 Equations (4-36), (4-37) and (4-38) coincide with those of (Rogers and Kabrisky, 1993).

4.7 Selecting the Output Target Vectors for Training


Identifiers as Codewords The requirement here is to design a set of identifiers {t(k(q)): q=1,...,Q} to be paired with the input exemplar feature vectors {x(q)}q=1,Q. Any output tj(k(q)), to be matched approximately by a computed zj(q) = g(sj(q)) at the jth output neurode, must be in the range of the activation function g(-) that squashes the sums s(q) into an interval such as [0,1] (unipolar) or [-1,1] (bipolar). The vectors t(k(q)) are usually binary encoded codewords with values from a binary alphabet such as {0,1} or {-1,1} to produce identifiers such as t(1) = (1,0,0,1,1) or its bipolar equivalent (1,-1,-1,1,1). The output neurodes can not actually attain values of 0 or 1 (unipolar case), or -1 or 1 (bipolar case), which are the limiting cases as the weighted sums rm or sj go to plus or minus infinity. Therefore we use values such as 0.1 and 0.9 or -0.9 and 0.9. The number J of components in the identifiers must be selected according to the resolution required, and the resolution is directly dependent upon the number K of classes needed. A single output can be given a number of discrete values such as 0.1, 0.3, 0.5, 0.7, 0.9. Five output target vectors could be specified for J = 2, for example, as (0.1,0.1), (0.1,0.9), (0.9,0.1), (0.9,0.9) and (0.5,0.5), while 25 are possible. The design goal is to separate the input feature vectors without error, so we should select the identifier codewords to be as different as possible. Rather than putting values of a single output variable close together, we should put them far apart. Thus we would not use 0.4 and 0.6 when we could use 0.1 and 0.9. The basic idea here is that appropriately trained weights will push each exemplar feature vector component up or down to fit a particular target identifier (combination of high and low values) to distinguish the different classes from each other. We have seen in Chapter 3 that the M hidden neurodes partition the feature space into a set of convex regions determined by a set of M hyperplanes. We will see in Chapter 9 that the output layer of neurodes can join combinations of these subclass regions together into classes that are not convex nor linearly separable. The particular regions depend upon the weights {wnm} and {umj} that respectively partition and join (two sets are joined by taking their union). Error Detecting and Error Correcting Identifiers Ideally, the identifiers of the classes should be sufficiently well separated for error correcting, or at least error detecting. We may choose each component of an identifier to be a value from a binary alphabet {0,1} or {1,1}. The binary alphabet used depends upon which sigmoid activation function is used: i) unipolar, in which

156 case we use {0,1}; or ii) bipolar, where we use {-1,1}. The Hamming distance between two J-bit identifier codewords is the number of positions (components) in which the codewords are different. For example, the Hamming distance between (1,1,0,0,1) and (1,0,1,0,1) is 2 (they differ in the second and third positions). If we choose K identifiers for K classes such that the Hamming distance between each pair of them is at least 3, then a single error in the mapping can be corrected. The output vector in error is closer to its correct identifier than to any other codeword and is thus recognized correctly, because it differs in only one position from the correct codeword but differs in at least two positions from any other codeword. But error correction comes with a cost. If the number of classes K is large, then J will need to be significantly larger to select K codewords that differ pairwise by a Hamming distance of 3 or more, so that there will be many output neurodes. This increases the number of weights to be trained and thus increases the complexity of computation both in training and in recognition. For single error detection, the pairs of identifier codewords must be at least a Hamming distance of 2 from each other. A single error will result in a codeword that is not an identifier and that exposes the error. Error detection and correction require that each output component be changed to either the high value or the low value, whichever it is closest to. A reasonable trade-off between the extremes of the error correction and single output is to use the number of outputs J, where 2J-1 < K 2J (for K classes needed). The use of J = 8 output neurodes and a binary alphabet, for example, allows K = 28 = 256 unique class identifiers. The input components are better left with continuous (analog) values that contain finer information than discretized values. In case the x(q) are discretized, the resolution should be high (a fairly large number of discrete values) so that less information will be lost.

4.8 Convergence, Parameters and Critical Values


Why Output Target Components Can Be Neither 0 nor 1 Suppose we use the unipolar sigmoid activation function at an output neurode where the desired (training) output is 1. Then zj = 1 / [1 + exp(- (sj -b))] = 1 This means that exp(- sj+b) = 0, so that sj = . If the desired output is zj = 1 / [1 + exp(- (sj -b))] = 0 then [1 + exp(- (sj - b))] = , so that sj = .

Similarly, if the bipolar equations are set to 1 via zj = [1 - exp(- (sj - b))] / [1 + exp(- (sj - b))] = 1

157 then -exp(- (sj - b)) = exp(- (sj - b)), which means that exp(- (sj - b)) = 0, so that sj = . If zj = [1 - exp(- (sj - b))] / [1 + exp(- (sj - b))] = -1 then [1 - exp(- (sj - b))] = -[1 + exp(- (sj - b))], so that 1 = -1, unless sj = - . The conclusion is that we never put the desired output vector components at 0 or 1 (unipolar) nor at 1 or -1 (bipolar). We often use 0.1 for 0 and 0.9 for 1 to make comparative runs, but in practice, we prefer to use 0.2 for 0 and 0.8 for 1 (unipolar), or -0.8 for -1 and 0.8 for 1 (bipolar). This causes less saturation and quicker convergence. It also helps prevent weight drift.

How Important Are Biases, Exponential Rates and Learning Rates? Chapter 5 discusses training on biases b1 and b2 and exponential rates bp bp - p( E/ bp),
r r 1

and

, which can be done via (4-42)

- r( E/

), p, r = 1,2

Such adjustments can lower the TSSE E. The derivatives given in Equations (4-42) are derived in Chapter 5, but it is often best to fix these to avoid the moving target effect. For large
i

, the sigmoid derivatives are near

zero for some neurodes and the convergence virtually halts. We may take: N/2 b1 1.2N/2, M/1 b2 1.2M/2. The step gains (learning rates) must be adapted as the training progresses for satisfactory convergence speed (see Section 5.3)

What Do the Error Functions Look Like? To plot slices of the graphs of the SSE E, we suboptimize E with respect to a set of exemplar training pairs {(x(q),t(q)): q = 1,...,Q} from a small dataset. Next, we hold all values of the suboptimal weight set fixed except for a single pair (wnm,umj) at a time, where the values n, m, and j are randomly selected from 1 to, respectively, N, M, and J. We increment both wnm and umj from -2.5 to 2.5 by = 0.007812 in Figures 4.7 through 4.11,

which display graphs of slices of the SSE function E as described above. Appendix 11 describes the datasets.

158 Figure 4.7 - A Slice of the SSE Function E for the Blood10 Data Set

The slice of Figure 4.7 (Blood10 dataset) has two local minima in the weight domain of the slice. Both are fairly deep, but one is definitely deeper than the other. The slice shown in Figure 4.8 (on the Digit12 dataset) has two minima with slightly greater differences in the deepness, while Figure 4.9 (on the Rotate9 dataset) has an even greater difference between the two minima. These slices with two minima are essentially quartic, that is, can be fit by a quartic polynomial. Figure 4.10 (on the Ten5ten dataset) has a single minimum, but it is not a true quadratic. It behaves more like a quartic and there may be another local minimum off the graph to the right. The slice in Figure 4.11 (on the Parity3 dataset) is different in that it has both a very shallow and a very deep local minimum. It is clear that a solution stuck in the shallow minimum would be unsatisfactory. We note that all of the curves approximate either quadratic or quartic polynomials (none have linear, cubic, nor quintic form).

159 Figure 4.8 - A Digit12 Slice Figure 4.9 - A Rotate9 Slice

Figure 4.10 - A Ten5ten Slice

Figure 4.11 - A Parity3 Slice

It can be observed from Figures 4.8 through 4.11 that on the one hand there are multiple global minima, or at least multiple deep minima that are approximately global, but that on the other hand, there are shallow local minima that would be bad solutions. When the training process falls into a shallow minimum during gradient descent, it becomes trapped there and the local minimum becomes a solution that may be very unsatisfactory.

160

4.9 The Slickpropagation Algorithms


The Hornik-Stinchcombe-White Theorem (Hornik et al, 1989) does not require sigmoid activation functions. Leaving the sigmoids in the hidden layer, we may take the averaging function g(sj) at each output neurode to simply the computation (see Figure 4.1 and Equations 4.10c). Thus we replace g(sj) = 1 / [1 + exp(- sj + b)] with the new activation function g(sj) = (1/J)sj (4-44) (4-43)

This not only reduces the volume of computation, but eliminates much of the convoluted ridges and valleys from the surface of the error function E. Perhaps the greatest benefit is that it prevents any saturation at the output layer because the derivative can never be close to zero, as we see from g (sj) = (1/J) It also permits the target output components to take the value 0 or 1 (or -1 or 1). Upon substitution of Equation (4-45) into Equations (4-21a,b) with 1/J into the learning rate, the formulas for slickpropagation (SP or slickprop) with unipolar sigmoids at the hidden layer become umj + (1/J)(tj(q) - zj(q))ym(q) wnm + (2/J){
(q) (j=1,J) j

(4-45)

umj wnm

(4-46a) (4-46b)

(t

- zj(q))umj}[ym(q)(1-ym(q))]xn(q)

Similarly, the SP algorithms that use bipolar sigmoids in the single hidden layer that are given in Equations (4-28a,b) now become

umj wnm

umj + (1/J)(tj(q) - zj(q))/2]ym(q) wnm + (2/J){


(q) (j=1,J) j

(4-47a) (4-47b)

(t

- zj(q))umj}[(1+ym(q))(1-ym(q))/2]xn(q)

For extended MLPs with two hidden layers with unipolar sigmoids, Equations (4-36, 4-37, 4-38) become umj + (1/J)(tj - zj)ym wlm + (2/J){ vln + (3/J){
(j=1,J) j

umj wlm vnl

(4-48a) (4-48b) (4-49b)

(t - zj)umj}[ym(1-ym)]al
(m=1,M) mj m

(j=1,J) j

(t - zj)[

u y (1-ym)wlm]al(1-al)xn}

161 Likewise, the extended MLPs with two hidden layers with bipolar sigmoids become umj + 1(tj - zj)ym wlm + 2{ vnl + 3{
(j=1,J) j

umj wlm vnl

(4-49a) (4-49b) (4-49c)

(t - zj)umj}[(1+ym)(1-ym)/2]al
(j=1,J) mj

(j=1,J) j

(t - zj)[

u [(1+ym)(1-ym)/2]wlm][(1+al)(1-al)/2]xn]}

4.10 Derivations for FLNs and RBFNs


Functional link networks use the N input features x1,...,xN and additionally create H new features from these via xN+1 = f1(x1,...,xN),...,xN+H = fH(x1,...,xN) (4-50)

The first N features are treated linearly in the single layer of neurodes (see Figure 3.14) whose activations are either unipolar or bipolar sigmoids. Once we compute the new feature inputs from Equations (4-50), we obtain E=
(q=1,Q) (j=1,J)

(tj(q) - yj(q))2 =
(n=1,N)

(q=1,Q)

(j=1,J)

(tj(q) - h(rj(q)))2 =
(4-51)

(q=1,Q)

(j=1,J)

(tj(q) - h(

wnjxn(q)))2

Thus the partial derivatives for steepest descent have the form E/ wnj = 2 -2
(q=1,Q) (q=1,Q)

(tj(q) - yj(q))(-1)[ yj(q)/ rj(q)][ rj(q)/ wnj] = (4-52)

(tj(q) - yj(q))h (rj(q))xn(q)

The FLN update algorithm for unipolar sigmoid h(-) is therefore wnj + 2

wnj

(q=1,Q)

(tj(q) - yj(q))[yj(q)(1-yj(q))]xn(q)

(4-53)

The FLN update algorithm for bipolar sigmoid h(-) is

wnj

wnj +

(q=1,Q)

(tj(q) - yj(q))[(1+yj(q))(1-yj(q))]xn(q)

(4-54)

Radial basis function networks adjust the weights (see Figure 3.17) umj at the output layer via steepest descent. There are no sigmoid activation functions at the output layer. From E=
(q=1,Q) (j=1,J)

(tj(q) - zj(q))2 =

(q=1,Q)

(j=1,J)

(tj(q) - (1/M)

2 (q) (m=1,M) m mj

u )

(4-55)

we determine the partial derivatives to have the form

162 E/ umj = ( E/ zj(q))( zj(q)/ umj) = (-2/M)


(q=1,Q)

(tj(q) - ym(q))ym(q)

(4-56)

The RBFN weight updates at the output layer neurodes have the form

umj

umj + (2/M)

(q=1,Q)

(tj(q) - ym(q))ym(q)

(4-57)

The neurodal centers {v(m): m = 1,...,M} at the hidden neurodes may be trained via steepest descent also, where E=
(q=1,Q) (j=1,J)

(tj(q) - (1/M)

2 (q) (m=1,M) m mj

u ) =

(q=1,Q)

(j=1,J)

(tj(q) - (1/M)

(m=1,M)

f (x(q))umj)2

Thus E/ vn(m) = ( E/ zj(q))( zj(q)/ ym(q))( ym(q)/ vn(m)) E/ vn(m) = (-2/M)


(q=1,Q)

(4-58)
(n=1,N)

(tj(q) - ym(q))(1/ 2)umjym(q)[

(xn(q) - vn(m))]

(4-59)

The RBFN updates for the hidden neurode centers are therefore

vn(m)

vn(m) + (2/M)

(q=1,Q)

(tj(q) - ym(q))(1/ 2)umjym(q)[

(n=1,N)

(xn(q) - vn(m))]

(4-60)

Exercises
4.1 Derive the specific iterative approximation formulas for steepest descent for an MLP with N = 2, M = 2, and J = 1. Use the unipolar sigmoid activation function. Write out all sums and differentiate term-by-term. 4.2 Equation 4-15 is the derivative of the unipolar sigmoid g(sj). Find the derivative of the bipolar sigmoid directly from G(s) = 2g(s+b) - 1. 4.3 Repeat the Exercise 4.1, but this time use the bipolar sigmoid for the activation functions. 4.4 Equation 4-19 contains the substitution h (rm) = ym(1 - ym) for the unipolar sigmoid function. Find the derivative directly for the bipolar sigmoid function H(rm) = tanh( r/2) from Equation 3.4. 4.5 Write an algorithm for training an MLP that uses steepest descent at the last two layers and random search on the weights at the first hidden layer. 4.6 Write a simple computer program that implements supervised training of MLPs using the method of steepest descent with unipolar sigmoids (use the backpropagation algorithm that trains on a single exemplar pair at a time).

163 4.7 Use the program written in Exercise 6 above to train a MLP to perform XOR logic (2-bit parity). Use four middle neurodes on one training run, but use only two on another run. Make an even/odd map of the results on [0,1][0,1] by testing all points (x1,x2) where the values of x1 and x2 assume multiples of 0.1 from 0 to 1. Compare the results for the two different runs. 4.8 Use the program written in the sixth exercise above and the XOR data to analyze via simulation the errors at the output components of z when the errors
1

on the single input component x1 have been chosen from a

uniform distribution on [0.0,0.25]. Compute sample means and variances of the outputs. 4.9 Suppose that a MLP has been trained to map an input vector x = (x1,...,xN) into an output vector z = (z1,...,zJ) that approximates some t = (t1,...,tJ) closely. Let an error
1

with uniform distribution on [0.0,0.25] be added to the

first component of the exemplar vector x. Analyze the error at the different components of the mapped output z due to
1

on x1.

4.10 Use the process of Exercise 4.9 to analyze what happens when Gaussian random error is added to one input feature vector component. For a large number of input components with uniformly distributed errors, what is the net effect at each output zj? Consider the Central Limit Theorem from statistics. 4.11 Let v(r+1) = v(r) - E(v(r)) be an iterative step to minimize the sum-squared error E(v) on the vector of all weights v = (w,u). Consider the modification: v(r+1) = v(r) + (1-)[v(r) - E(v(r))], where satisfies 0< <1, and is taken to be 1.0. Analyze the effect. 4.12 Analyze the effect (see Exercise 4.11) of using v(r+1) = v(r) + v(r-1) - (1-) E(v(r))], where satisfies 0 < < 1. How will this affect the convergence? 4.13 Consider an MLP with two hidden layers. The last two layers on the right have their weights updated as derived in the text. Derive in complete detail the formulation for updating the weights of the first hidden layer that is adjacent to the input branching layer. 4.14 Modify the linearization v(r+1) = v(r) - E(v(r)) by adding a momentum term of the form 0< < 1. Discuss the ramifications of this strategy.
(r) r-1

v , where

4.15 Analyze the effects on E(v(r)) of random errors on the exemplars {x(k)}? Discuss and justify your conclusions. 4.16 Modify the program developed in Exercise 4.6 above so that it uses bipolar sigmoids. Compare the convergence rates on XOR logic between unipolar and bipolar versions (be sure to use bipolar data). Make at least 12 runs and compare the average behavior (each run draws a different initial weight set at random). 4.17 Write a flow chart similar to the one given in Figure 4.5, but for training on all exemplars simultaneously in a full steepest descent on E(w,u). Now write out the complete algorithm.

164 4.18 Modify the program developed in Exercise 4.16 above so that it trains on all exemplars simultaneously. Run it on the XOR logic function and compare with BP (backpropagation). Try different step gains from 0.05 to 2.0 and note the convergence behavior. 4.19 Show that the bipolar sigmoid functions have maximal slope when the weights are all zero. Can the initial weights be selected to be all zeros to obtain a faster rate of convergence? Explain. State the analogous case for unipolar sigmoid activation functions. 4.20 What is the effect on the weight increments when the weights are in a region where the sigmoid of a neurode has essentially zero slope (consider the update formula for the weights at that neurode)? What about a region where the slope is large? 4.21 Write out a training algorithm where the second order Newton method is used, i.e., where the Hessian matrix of second order partial derivatives is used to obtain a more accurate descent step. 4.22 Take the partial derivative with respect to the exponential rate Work out an equation using E/ of the sum-squared error function E.

to be used in a modification of the BP algorithm (see Zurada, 1992).

4.23 Design the output identifiers for K = 4 classes so that a single error can be detected. 4.24 Design the output identifiers for K = 4 classes so that a single error can be corrected. 4.25 Take the partial derivative of E with respect to the bias b and include it in the BP algorithm for supervised training as an extra weight on a line where the constant 1 is always input. 4.26 Suppose that an MLP is trained to recognize K classes from a sample of Q exemplar pairs (K Q). Further, suppose that during operation of the MLP it were discovered that while it could recognize every exemplar input vector, it would sometimes incorrectly recognize novel input vectors that were very close to one of the known exemplars. Explain why this may occur. Give a strategy that could be taken to reduce this undesirable behavior. 4.27 Suppose we copy the exemplar set {x(1),...,x(Q)} to a set {x(Q+1),...,x(2Q)}. Now suppose that we add a moderate level of Gaussian noise onto the components of the second set and then use the combined set to train an MLP. What would be the effect on the generalization or specialization of the learning? Justify your arguments. 4.28 Under backpropagation training, the weight increments -( E(q)/ vps) become arbitrarily close to zero as the updated weights {vps} approach a local minimum because: E/ vps 0. This causes BP to converge at

an excruciatingly slow rate as it nears a local minimum because the weight increments also approach zero. Find a method to speed up convergence in the vicinity of a local minimum. 4.29 Suppose that we use 50% of the exemplars to train an MLP to reduce the SSE ET. After a few epochs

165 of training, let us put the other 50% of the exemplars (that were not used in the training) through the network and compute that SSE as EN. We repeat this process in a loop. Is it possible that at some point EN will stop decreasing and begin to increase, even though ET continues to decrease? What can we infer from this in terms of generalization and specialization? 4.30 Describe a method that does not use the gradients to arrive at weights that reduce the total SSE to a reasonably low value [consider drawing random numbers in some fashion]. 4.31 Would you use a MLP of size N-M-J = 2-200-1 to train on XOR logic? Why not? What can be inferred from this? 4.32 Suppose that we want to train an MLP on a set of Q exemplar pairs by using gradients. We want to use all of the Q exemplar pairs simultaneously (fullpropagation). However, we are using a computer that does not have sufficient memory for all Q exemplar pairs (Q is very large). Describe an algorithm where a portion of these, say Q/4 exemplar pairs are used on each weight updating, consecutively, until all of the Q exemplars have been used. Compare this with batching in backpropagation. Which is the most efficient? 4.33 Derive E/
1

and E/

for unipolar and bipolar sigmoid activations functions.

4.34 Derive E/ b1 and E/ b2 for unipolar sigmoid activation functions. 4.35 Find the equations for updating the decay rates b2 for unipolar sigmoid activation functions. 4.36 Show that for zj = g(sj) = [1 - exp(- 2sj + b2)] / [1 + exp(- 2sj + b2)], it either is or isn't possible that zj = -1.
1

and

and also for updating the biases b1 and

4.37 Make a copy of your BP program for Problem 4.6 and then modify it to implement SP (slickpropagation). Test it on 2-bit parity (XOR logic) and 3 bit parity. Is it quicker than BP? Why? 4.38 Consider an MLP network that has a single hidden layer. Modify the SP algorithm by substituting activations at the hidden neurodes so that their derivatives will not go to zero.

References
Fahlman, S. E. (1988), An Empirical Study of Learning Speed in Backpropagation, Tech. Report CMUCS-88-162, Carnegie Mellon University, Pittsburgh. Fausett, Laurene (1994), Fundamentals of Neural Networks, Prentice-Hall, Englewood Cliffs. Fu, LiMin (1994), Neural Networks in Computer Intelligence, McGraw-Hill, NY.

166 Gauss, Karl Frederick (1809), Theoria Motus Corporum Coelestium in Sectionibus Conicus Solem Ambientum, F. Perthes & I. H. Besser (Editors), Hamburg (translation: Dover, NY, 1963). Gori, M., and Tesi, A. (1992), On the problem of local minima in backpropagation, IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 14, 76-86. Hecht-Nielsen, R. (1990), Neurocomputing, Addison-Wesley, Reading, MA. Hornik, K., Stinchcombe, M., and White, H. (1989), Multilayer feedforward networks are universal approximators, Neural Networks, vol.2, no. 5, 359-366. Jacobs, R. A. (1988), Increased rates of convergence through learning rate adaptation, Neural Networks, vol. 1, 95-307. Kosko, B. (1992), Neural Networks and Fuzzy Systems, Prentice-Hall, Englewood Cliffs. Legendre, A. M. (1810), Mthodes des mondres quarrs, pour trouver le milieu le plus probable entre les rsultats de diffrentes observations, Mem. Inst. France, 149-154. Linz, Peter (1979), Theoretical Numerical Analysis, Wiley-Interscience, NY. Looney, C. (1996), Advances in feedforward neural networks: demystifying knowledge acquiring black boxes, IEEE Trans. Knowledge and Data Engineering, vol. 8, no. 2, 1-16. Looney, C. (1993), Neural networks as expert systems, J. Expert Systems with Applications, vol. 6, no. 2, 129-136. McCormack, C., and Doherty, J. (1993), Neural network super architectures, Proc. 1993 Int'l Conf. Neural Networks, Nagoya, 301-304. Parker, D. B. (1982), Learning Logic, Invention Report S81-64, File 1, Office of Technology Licensing, Stanford University. Robbins, H., and Monro, S. (1951), A stochastic approximation method, Annals of Math. Statistics 22, 400407. Rogers, S. K, and Kabrisky, M. (1993), An Introduction to Biological and Artificial Neural Networks for Pattern Recognition, SPIE Optical Engineering Press, Bellingham, WA, USA. Rumelhart, D., Hinton, G., and Williams, R. (1986), Learning internal representations by error propagation, appeared in Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Vol. 1, Edited by Rumelhart and McClelland, MIT Press, Cambridge, 318-362. Stornetta, W. S., and Huberman, B. A. (1987), An improved three-layer backpropagation algorithm, Proc. First IEEE Int'l Conf. Neural Networks, San Diego, vol. 2, 637-643. Wasserman, P. D. (1989), Neural Computing, Van Nostrand Rheinhold, NY.

172 Werbos, P. J. (1994), The Roots of Backpropagation, John Wiley, NY. Werbos, P. J. (1974), Beyond regression: New Tools for Prediction and Analysis in the Behavioral Sciences, Ph.D. thesis in Applied Math., Harvard University. Zurada, M. (1992), Artificial Neural Systems, West Publishing, St. Paul.

You might also like