Supervised Training Via Error Backpropagation: Derivations: 4.1 A Closer Look at The Supervised Training Problem
Supervised Training Via Error Backpropagation: Derivations: 4.1 A Closer Look at The Supervised Training Problem
m M, 1 j J}; and iii) the repetitive adjusting of the current weights by some method to force each
of the input exemplar feature vectors to be mapped closer to its correct output target vector that identifies a class in the input space. More than one exemplar feature vector of the sample {x(1),...,x(Q)} may map to a single identifier t(k) = t(k(q)) for each of the K classes (K Q). The weights {wnm} at the neurodes in the hidden layer and the weights {umj} at the neurodes in the output layer are "knobs" to be adjusted to define a correct mapping of input feature vectors to their output class identifiers. A correct mapping takes each sample input vector x(q) from the kth class into an output vector z(q) that is closer to the target t(k(q)) than to the identifier t(p) of any pth class, p k. This means that z(q) - t(k(q)) must be very small for each q. If this were all that there is to it, it would be a simple process, provided that we had a strategy that would adjust the weights properly. Unfortunately, the MLP architecture must be designed properly for the particular dataset to assure that the network will learn robustly and will be reasonably efficient. The main questions in laying out the architecture and then training the MLP are listed below. 1. How many layers of neurodes should we use? 2. How many input nodes should we use?
137 3. How many neurodes in the hidden layers should we use? 4. How many neurodes should we use in the output layer? 5. What should the target (identifier) vectors be? 6. How do we proceed to train the MLP? 7. How can we test to determine whether or not the MLP is properly trained? 8. How do we select parameters (such as ), speed up and improve the training? 9. What should be the range of the weights and the network inputs and outputs?
Some Answers Answer 1 is provided by the Hornik-Stinchcombe-White result (Hornik et al, 1989) given Section 3.9, which states that a hidden layer and an output of layer of neurodes are sufficient, provided there are enough neurodes in the hidden layer. To reduce a large number of neurodes in some situations, we may use two hidden layers (see Chapter 9 on neural engineering), but this is not necessary. There is also the question of what effects an extra layer of neurodes in the middle can have. Is the training faster? Is the learning better? How do the answers to these questions depend upon the linearity or nonlinearity of the data? We discuss these neural engineering issues in Chapter 9, and make actual tests in Chapters 11 and 12 to see what the results are. For now, we prefer to use a single hidden layer to avoid new difficulties, although we also derive the backpropagation iterative training equations for the case of two hidden layers. Answer 2 can be given tentatively. The number N of input nodes must be the number N of features in the feature vectors, so that once a set of features is chosen, their number N is fixed. Chapter 10 discusses feature and data engineering. The pattern attributes of a population may be mapped to one of many possible sets of features of various sizes, but we assume here that this has already been done, and that N is given and fixed. Answer 3 on the number of neurodes in the hidden layer(s) is difficult. A lot of research from different approaches has been done to find an answer. We provide answers in Chapter 9. For now, we use M = 2K for a small number K of classes (for, say, K = 2 to 8) up to M = K/2p for larger K (say, M = 128/23 = 16 or M = 512/24 = 32). This allows from M+1 to 2M (lower and upper bounds) convex regions in the feature space. Groups of these can be joined into nonconvex classes by the output neurodes (see Chapter 9). Answer 4 gives the number J of output neurodes that depends on the resolution required (the number K of classes) and the representation encoding scheme to be used. We may take J = log2K (from K = 2J), which permits 2J combinations of high and low (1 and 0) outputs of the J components. This is discussed in Section
138 4.7. Answer 5, on how to select the J target vectors is also given in Section 4.7, and may be chosen from the 2J combinations of high and low. It is usual practice to employ 0.9 for 1 and 0.1 for 0 (see Section 4.7) because standard MLPs can not put out 0s and 1s unless the weights become infinite. Answer 6 on the training of an MLP is given in Chapters 4, 5, and 6. The methods are steepest descent, accelerated gradient methods such as conjugate gradients and strategic search methodologies that include polynomial line search. There appears to be no single algorithm that is best overall for all datasets so we need multiple methods. Answer 7, which tells whether or not the training is satisfactory, is given in Chapter 9, which also outlines the way to perform validation and verification testing for acceptance of the MLP training and the model (network architecture). This involves using a training subset of the sample of exemplar pairs and two other disjoint test subsets that are to be used for validation and verification but not for training. When there are sufficiently many exemplars, we may select 25% of them at random to save for validation, choose another 15% at random to serve as the final verification, and use the remaining 60% for training. Under training, the training sum-squared error E60% decreases, as does the testing sum-squared error E25%, which is computed after each small segment of training. When E25% stops decreases and begins to increase, then we stop the training because it is specializing on the training data and is becoming less accurate on other data from the same population. Answer 8, on the question of how to select parameters and speed up the training, is dealt with in Chapters 5 and 6 (also, see (Looney, 1996)). These are mainly optimization techniques. For now we may put = 0.25. Answer 9, on the range of the weights, is perhaps the easiest to answer for practical purposes. While the initial weights may be drawn from the interval [-0.5,0.5], the training may require that some weights move out into the interval [-b,b], for some b 1. We want to avoid weight drift, however,
where certain weights becomes large in magnitude and other weights compensate with opposite sign to cancel out its affects. Ideally, the final weights should be in [-1,1] because the inputs and outputs do not exceed 1 in magnitude and the activation functions squash the summed values rm and sj to within unit magnitude. In computational practice, though, the weights may be allowed to wander slightly. The range of the inputs and outputs is typically taken to be [0,1] for unipolar activation functions and [-1,1] for bipolar ones (see Section 4.7). Backpropagation with a single hidden layer is derived for both unipolar and bipolar inputs and activation functions in Section 4.4. The algorithm is presented in Section 4.5. It is also derived for two hidden layers with both unipolar and bipolar sigmoid activation functions in Section 4.4.
Let there be a sample of Q exemplar vectors {x(1),...,x(Q)} from K classes (K Q). There may be multiple exemplars for certain classes. For each exemplar x(q) there is an associated target output vector t(k) = t(k(q)) that identifies its class number k = k(q). The problem is to train the MLP by adjusting the weights w = (w11,...,wNM) and u = (u11,...,uMJ) as shown in Figure 4.1 until each exemplar x(q) is mapped into an output z(q) that is very close to t(k) = t(k(q)). In cases of multiple input exemplars for the same class, the same target vector is associated with each such input exemplar feature vector. If x(q1) x(q2) are two different exemplars for Class k, then t(k(q1)) = t(k(q2)) because k(q1) = k(q2). A neural network is a black box, that is, an input/output system for which we need not know the inner workings. This contrasts with an expert rule-based system where every logical implication is known in a set of rules that provides explanation of the partial steps (Looney, 1993). After we train an MLP, it merely maps inputs to outputs with no explanations of its behavior. MLPs are easy to use, however, and can be quickly trained on datasets, whereas even a modest expert system requires several man-months or man-years of development.
141 The Total Sum-Squared Error Function E To force each actual output z(q) toward the correct output target t(k(q)), we adjust the weights so as to minimize the total sum-squared error (TSSE) E between the targets {t(k): k = 1,...,K} and the actual outputs {z(q): q = 1,...,Q}, over all Q exemplars. The TSSE (total sum-squared error) is defined via the Euclidean distance to be E=
(q=1,Q)
t(q) - z(q)
(4-1a)
The partial sum-squared errors (PSSEs) with respect to a single exemplar input/output pair (x(q),t(k(q))) is designated by E(q) and defined via E(q) =
(q) (j=1,J) j
(t
- zj(q))2
(4-1c)
We consider E = E(w,u) to be a function of the weights w = (w11,...,wNM) and u = (u11,...,uMJ). The general minimum MSE methodology was invented and used independently by (Gauss, 1809) and (Legendre, 1810) in the late 1790s. E = E(w,u) is defined (refer to Figure 4.1) in detail by E(w,u) =
(q=1,Q) (q=1,Q) (q) (j=1,J) j
t(q) - z(q)
=
(q=1,Q)
( ( ( (
[t [t [t [t
- zj(q)]2) = - g( - g( - g(
(q) (j=1,J) j
[t
- g(sj(q))]2) =
(q=1,Q)
(q) (m=1,M) mj m
u y
))]2) =
(q=1,Q)
(m=1,M) mj
u h(rm(q))]2) = u h(
(n=1,N)
(q=1,Q)
(m=1,M) mj
wnmxn(q)))]2)
(4-2)
where h(-) and g(-) are sigmoidal activation functions respectively for the output and hidden layers. The function E(w,u) is a nonnegative continuously differentiable function on the weight space, which is [b,b]NM+MJ (b > 0), which is a finite dimensional closed bounded domain that is complete and thus compact. Therefore, E(w,u) assumes its minimizing point (w*,u*) in the weight domain. This doesn't mean that the sumsquared error E will be zero at the solution weight set (w*,u*), but only that E will assume its minimum value there on the given weight domain. If the target vectors {t(k)} are chosen judiciously to be far apart and if the exemplars for different classes are not too close, then the minimum mapping will successfully recognize the input feature vectors by mapping them to their class identifiers. Section 9.7 gives conditions under which there exists a unique exact solution.
142 To solve for the minimizing weight set (w*,u*), we use the necessary conditions E(w*,u*)/ wnm = 0, E(w*,u*)/ umj = 0 (4-3)
We can not solve these nonlinear equations in closed form, but can approximate the solution (w*,u*) iteratively with steepest descent. The next section discusses this method and derives an algorithm by means of chain rules (see Appendix 4 for an intuitive review of chain rules and vector calculus).
and solve for w = wlocmin. However, in the general case of nonlinear f(-) we can only find an approximate solution wapprox to wlocmin by iterative methods. Starting from some initial point w(0), we move a step in the direction of steepest descent to w(1) = w(0) - df(w(0))/dw, which is opposite to the direction of steepest ascent. Note that the direction is either positive or negative along the w-axis. Figure 4.3 shows the first step. For the iterative (r+1)st step, we have w(r+1) = w(r) - (df(w(r))/dw) Figure 4.3 - Approximating a Local Minimum (4-5)
The step gain > 0 amplifies or attenuates the step size. If the step were too large, then it would move past the local minimum wlocmin, while if it were too small, a large number of steps might not yet reach the local minimum. The difficult problem of setting a proper step gain is addressed in Chapter 6. The step gain is called the learning rate in the literature on neural network training. In Figure 4.4, where y = f(w1,w2), the gradient is the vector of partial derivatives
A function y = f(w1,...,wP) of several variables can be locally minimized in an analogous manner. The iterative updates to approximate a solution in the general case are f(w1,...,wP) ,..., w1 f(w1,...,wP)
(w1,...,wP)
(w1,...,wP) - (
)
wP
(4-7)
In vector form, where w = (w1,...,wP), the gradient vector of partial derivatives is w(r+1) = w(r) - f(w(r))
(r)
(4-8)
The normalization of f(w ) to unit length would change . Usually, is permitted to adapt to absorb any normalizing factors (see Chapter 5). Equation (4-8) is linear in w and provides a piecewise linear approximation for an adjustment to move w(r) toward a local minimum. Appendix 4 derives Newton's second order method and the method of conjugate directions for finding local minima.
144 vectors, M neurodes in the middle layer, and J neurodes in the output layer. We assume that there is a sample of Q exemplar input feature vectors paired with K output training vectors. Multiple input exemplar vectors for a class must each map to the same output target vector that identifies that class. For example, if x(q1) and x(q2) are from the same kth class, then both map to t(k(q1)) = t(k(q2)) = t(k). The total sum-squared error E is the sum of all of the squared errors over all J output components and over all Q exemplar pairs, where the individual squared errors are ej(k) = (tj(k) - zj(k(q))). The total SSE can be decomposed into a sum of partial sum-squared errors E = E(1) + ... + E(Q) of which each summand is the SSE over a single qth exemplar pair. E(q) =
(q) (j=1,J) j
(4-9)
(k(q)) (j=1,J) j
[t
- g(
(m=1,M) mj
u h(
(n=1,N)
wnmxn(q)))]2
(4-10a)
In the derivation for the steepest descent on E(q) that follows, we suppress the superscript "(q)" for convenience. We derive the error backpropagation algorithm for any single qth fixed exemplar pair, based on the PSSE (partial sum-squared error) function E(q) = E(q)(w,u) =
(j=1,J) j
(t - zj)2
(=
(q) (j=1,J) j
(t
- zj(q))2, fixed q )
(4-10b)
The sigmoids are unipolar, that is h(rm) = 1/[1 + exp(- 1rm + b1)], g(sj) = 1/[1 + exp(- 2sj + b2)] (4-10c)
are activation functions at the hidden and output layers, respectively. In the derivation that follows, where bi is a bias (i = 1, 2),
i
is the rate factor in the exponential, and r and s are the sums of the products of weights
times the incoming line values (see Figure 4.1). Iterative approximation may take a generalized Newtonian, or quasi-Newton, form (see Linz (1979, p. 146). The simplest of these is the steepest descent linearization w(r+1) = w(r) - [ E(w(r))] = w(r) + u(r+1) = u(r) - [ E(u(r))] = u(r) + u w (4-11a) (4-11b)
for updating the weights w at the hidden neurodes, and u at the output neurodes, respectively, on the (r+1)st iteration. Upon taking each weight individually, we obtain the formulas wnm(r+1) = wnm(r) - ( E(w(r),u(r))/ wnm) umj(r+1) = umj(r) - ( E(w(r),u(r))/ umj) (4-12a) (4-12b)
We now derive the backpropagation training equations to minimize the PSSE function E(q) for any fixed qth exemplar pair (x(q),t(q)). We suppress the q notation for convenience. Appendix 4 explains intuitively the chain rules that we use.
145
The Derivation of Backpropagation with Unipolar Sigmoids We first derive the computational formula for the weights {umj} at the output neurodes by: i) applying the chain rule repeatedly to the partial derivative in Equation (4-12b); ii) using sj =
(m=1,M) mj m
u y
(4-13)
as the sum at the jth output neurode; and iii) using g(sj) = [1 + exp(-sj+b)]-1 as the unipolar activation function. We need not include
i
in the derivation (see Equation (4-10c)) because any constant multiplier can be
absorbed into the step gain . We also use b for the bi for notational convenience. The derivation is done on the PSSE E(q), which is denoted by E (q is suppressed) for notational convenience. The derivation for the umj increment is E/ umj = ( E/ sj)( sj/ umj) = [( E/ zj)( zj/ sj)]( sj/ umj) = [( / zj)(
(p=1,J) p
{note: sj = sj(umj), functionally} {note: zj = g(sj), functionally} {note: p is a dummy variable for j} {note: ( / sj)g(sj) = g (sj), r is a dummy for m} {note: / umj[
(q=1,M) q qj
[2(-1)(tj - zj)(g (sj)][ / umj [(-2)(tj - zj)g (sj)][ym] = -2(tj - zj)g (sj)ym
yu ]=
y u ] = ym}
(4-14)
But the sigmoid activation function g(-) has derivative g (sj) = (d/dsj)g(sj) = (d/dsj)[1 + exp(-sj+b)]-1 = (-1)[1 + exp(-sj+b)]-2exp(-sj+b)(-1) = [1 + exp(-sj+b)]-2[exp(-sj+b)] = [zj]2[1 + exp(-sj+b) - 1] = [zj]2[1/zj - 1] = [zj]2[(1-zj)/zj] = zj(1-zj) so that g (sj) = zj(1-zj) Now we substitute Equation (4-15) back into Equation (4-14) to obtain E/ umj = -2(tj - zj)zj(1-zj)ym Substituting Equation (4-16) into Equation (4-12b) provides the update on the (r+1)st iteration as umj(r+1) = umj(r) + (tj - zj)zj(1-zj)ym (4-16) (4-15)
{note: zj = [1 + exp(-sj+b)]-1} {note: add (1 - 1) to rightmost factor}
(4-17)
146 where the 2 has been absorbed into the step gain . Second, we derive the weight increments {wnm} at the middle neurodes via: i) apply the chain rule repeatedly to the partial derivation in (4-12a); ii) use rm =
(n=1,N)
wnmxn
(4-18)
for the sums; and iii) using the hidden layer activation function h(rm) = 1/[1 + exp(-rm+b)] and derivative h (rm) = ym(1-ym). The previous remark about omitting E/ wnm = ( E/ rm)( rm/ wnm) = [( E/ ym)( ym/ rm)]( rm/ wnm) = [( / ym)( ( / ym)( ( / ym)(
(j=1,J) j
wnmxn}
(j=1,J) j
x wnm)] =
{note: ( / wnm)( xnwnm)=xn}
(j=1,J) j
( / ym)[E(s(ym))][h (rm)][xn] = { { { { { {
(j=1,J)
{note: E = E(s(ym)) = E(s1(ym),...,sJ(ym))} {note: pass / ym to inside of summation} {note: use chain rule, full sum above} {note: p is a dummy, zj/ sj = g (sj)} {note: sj =
(i=1,M) i ij
(j=1,J)
(j=1,J)
(2)(tj - zj)(-1)g (sj)[ sj/ ym]}[h (rm)][xn] = (-2)(tj - zj)[g (sj)][( / ym)
(i=1,M) i ij
(j=1,J)
y u ]}[h (rm)][xn] =
yu }
(j=1,J)
{note: i is a dummy for m above} {note: g (sj) = zj(1-zj) from above} {note: h (rm) = ym(1-ym) analogously}
(j=1,J)
Therefore E/ wnm = {
(j=1,J)
(-2)(tj - zj)[zj(1-zj)]umj}[ym(1-ym)]xn
(4-19)
Upon substituting Equation (4-19) into (4-12a), and absorbing the 2 into , we obtain the computational formula on the (r+1)st, which is
wnm(r+1) = wnm(r) + {
(j=1,J) j
(t - zj)[zj(1-zj)]umj(r)[ym(1-ym)]xn}
(4-20)
Note that Equation (4-20) sums the differences (tj - zj) over all j = 1,...,J output neurodes. This is intuitive because every output difference affects each weight wnm at the hidden layer, whereas the only a single difference
147 affects each weight in the output layer. The original backpropagation (BP) algorithm uses Equations (4-17) and (4-20) to update each weight for a fixed qth exemplar input/output pair (Rumelhart et al, 1986). This constitutes one iteration (training all weights on a single exemplar pair (x(q),t(k(q))) to minimize E(q)). BP repeats this for each qth exemplar pair until all Q PSSEs E(q) have been used in training. This entire process over each PSSE one time constitutes an epoch. A single epoch takes a minimizing step over each of the PSSE functions E(1),...,E(Q) and on each such partial E(q), all of the weights are adjusted. Thus a single epoch adjusts each weight Q times. A large number I of epochs may be required for training (each weight is adjusted QI times). To recapitulate, the unipolar learning equations for each qth exemplar (q is not suppressed) are umj + 1(tj(q) - zj(q))zj(q)(1-zj(q))ym(q) wnm + 2{
(q) (j=1,J) j
umj wnm
(4-21a) (4-21b)
(t
- zj(q))[zj(q)(1-zj(q))]umj}[ym(q)(1-ym(q))]xn(q)
Backpropagation with Bipolar Sigmoids Many researchers now use the bipolar sigmoid in place of the unipolar one to eliminate the bias b as a source of error and computational need (see Chapter 6). The derivation is the same except for the derivatives of h(rm) and g(sj), which now become the derivatives of the bipolar sigmoids, denoted here by H(rm) and G(sj). Let z = G(s) = 2{1/[1 + exp(- s)]} - 1 = whose rational form is z = G(s) = [1 + exp(- s)]/[1 - exp(- s)] From Equation (4-22a), we obtain dz/ds = G (s) = 2 [1 + exp(- s)]-2(exp(- s)) Upon adding 1 to each side of Equation (4-22a), we obtain (1 + z) = 2/[1 + exp(- s)] We solve for exp(- s) from Equation (4-24) by exp(- s) = 2/(1 + z) - 1 = (1 - z)/(1 + z) Similarly, we use Equation (4-24) to solve for 1 + exp(- s) by 1 + exp(- s) = 2/(1 + z) (4-25b) (4-25a) (4-24) (4-23) (4-22b) (4-22a)
The substitution for exp(- s) and 1 + exp(- s) from Equations (4-25a,b) into Equation 4-23 yields dz/ds = G (s) = 2 [1 + exp(- s)]-2[exp(- s)] = 2 [(1 + z)2/22][(1 - z)/(1 + z)] = (1 + z)(1 - z)/2 (4-26)
(it will be absorbed by the step gain ) yields the result (4-27a)
(4-27b)
The bipolar update formulas come from substituting Equations (4-27a,b) into the computational formulas of Equations (4-21a,b), respectively, to obtain (using the PSSE E(q)) umj + 1(tj(q) - zj(q))[(1+zj(q))(1-zj(q))/2]ym(q) wnm + 2{
(q) (j=1,J) j
umj wnm
(4-28a) (4-28b)
(t
- zj(q))[(1+zj(q)(1-zj(q))/2]umj}[(1+ym(q))(1-ym(q))/2]xn(q)
Paul Werbos used backpropagation for regression analysis (Werbos, 1974). A similar stochastic approximation had previously been used by (Robbins and Monro, 1951). A recent book (Werbos, 1994) discusses the history and development of backpropagation. The books (Fu, 1994), (Kosko, 1992), (HechtNielsen, 1990), and (Wasserman, 1989) provide historical notes, while (Fausett, 1994) and (Zurada, 1992) are also good references. Chapter 6 discusses a more efficient method, called conjugate gradient directions, for accelerated gradient minimization. Quasi-Newton methods (Parker, 1982) are second order methods that converge more rapidly, but require more computation per iterative step.
149 them to the current weights to obtain the new weights wnm(r+1) and umj(r+1). Figure 4.5 - A Backpropagation Flow Chart
returns to the fifth step. If it is true, then an epoch has been completed, so the test (r I?) is made. If false, then r is incremented via r r+1 and the process returns to the fourth step for another epoch or else the process
terminates (I epochs have been completed). The updated weights are computed from the computational formulas of Equations (4-21a,b) or (4-28a,b). These are sometimes written in the form of incremental weights as umj(r+1) = umj(r) + umj, wnm(r+1) = wnm(r) + wnm (4-29)
A Backpropagation Algorithm Inputs: {number of input nodes N; number of middle neurodes M; number of output neurodes J; number of exemplar vectors Q; number of identifiers (classes) K; the exemplar vectors {x(q)} and paired identifier vectors {t(k(q))}; number of epochs I; and biases bi and decay rates
i
(i=1,2)}
Outputs: {the weights w = (w11, w21,...,wNM) and u = (u11, u21,...,uMJ) and the total SSE E}
150
Step 1: /Input N, M, J, Q, exemplar input vectors and corresponding identifiers {x(q),t(q)} and I/ read MLP file; input I; Step 2: /Set parameters b1
1 1
/Data is stored in file/ /Input no. epochs desired from keyboard/ , b1,
2
, b2, / /For bipolar sigmoids these are zero/ 0.4; 2 0.25; /Parameters may be different from these/
N/2.0; b2 2.4;
2
M/2.0; 2.4; 1
Step 3: /Generate initial weights randomly between -0.5 and 0.5/ for m = 1 to M do for n = 1 to N do wnm for j = 1 to J do umj Random() - 0.5; /Draw uniform(0,1), shift down -/
Step 4: /Adjust all weights via steepest descent method/ for r = 1 to I do for q = 1 to Q do Update_NN(); for m = 1 to M do for j = 1 to J do umj(r+1) /Do I epochs/ /with each over all Q exemplar pairs/ /Call procedure to update MLP/ /Update MJ umj's and NM wnm's/ /For each m, sum over J outputs/
- zj(q))[zj(q)(1-zj(q))]umj(r)}[ym(q)(1-ym(q))][xn(q)]
The function Random() draws a uniform random weight value in the interval [0,1]. Early researchers restricted the initial weight set to magnitudes in [-0.5,0,5] as it was thought that the deeper local minima are close to the origin. Current research on this has mixed results. We note that during the training, some weights move away from the origin. Fahlman found that the initial intervals [-1,1] and [-2,2] yielded results that were just as good on his datasets as did the interval [-0.5,0.5] (Fahlman, 1988), while (McCormack and Doherty, 1993) used [-5,5] when training on certain data sets. The function Update_NN() puts the qth exemplar input vector through the network to update all of the ym(q) and zj(q) values put out by neurodes. The advantages of BP and the MLP type of FANNs are: i) the learning is somewhat independent of the order in which the exemplar feature vectors are presented; ii) the architecture can be manipulated for better
151 results (see Chapter 9 on neural engineering); and iii) the operational mode can be performed with parallel processors. The disadvantages are: i) the training may converge to a local minimum that is shallow so that the learning is not robust; ii) the step gain (learning rate) can not be predicted in advance and may be either too small, so that too many steps are required to converge, or too large, so that the process oscillates instead of converging; iii) the derivatives approach zero so the computed steps are essentially zero, in which case a large number of steps does not move the weight point much; iv) the gradient provides only a linear approximation to the actual local direction of steepest descent and the approximate directions may change drastically from to step to step; v) upon changing the value of one or more weights, the PSSEs (partial sum-squared error) E(q) are changed as a function of the other weights, which is the moving target effect (thrashing occurs during each epoch, where a step to minimize E(q) tends to increase some other E(p)); and vi) the network may overtrain on the presented feature vectors and become too specialized and unable to accurately recognize other similar vectors. The first problem of local minima was shown in a simple example by (Gori and Tesi, 1992) on nonlinearly separable patterns showed that BP becomes stuck in a local minimum on a FANN that contained a single hidden layer. We may train an MLP many times from different initial weight points and keep the weight set that yields the lowest total SSE (sum-squared error) E. This would appear to be a fruitful strategy for dealing with shallow undesirable local minima. However, the lowest SSE does not necessarily provide the best learning, as was observed by (Rumelhart, 1986) and sometimes indicates specialization (see Section 9.3). The second problem of no apriori information about the learning rate requires a strategy for adjusting it on the way to the minimum. It should be sufficiently large on the early iterations, but as the process descends into the well of a local minimum, must be decreased appropriately. One of the early methods was the deltabar-delta method (Jacobs, 1988), which we discuss in the next chapter. The third problem is the saturation, where a sigmoid derivative h (rm) or g (sj) approaches zero, which causes the weight increments - E(w,u) to become essentially zero unless the learning rate (step gain) = (r) increases enormously to compensate for it. The fourth problem of oscillating directions of steepest descent can be handled by smoothing. This was first done by (Rumelhart et al, 1986) with a "momentum" term (see Chapter 5) that performs the required smoothing. The fifth problem of a moving target is a fact of life with MLPs. When any weights are changed, then E(w,u) becomes a different function on the remaining weights. The next chapter discusses this further. The problem of overtraining can be handled by testing during training and afterwards, as discussed in Section 9.8.
152 In practice, however, convergence takes place often, especially with small to moderate sized MLPs. When it does not, it may converge on another training run from a different initial weight set, or it may require a change in the MLP architecture. An adaptive step gain can improve the rate of convergence, as we will see in Chapter 5. The more accurate Newtonian formulations use second order approximations of the SSE function E that may include the Hessian matrix of second order mixed partial derivatives (discussed in Appendix 4 and Chapter 5). Parker used second order quasi-Newtonian methods to train MLPs (Parker, 1982). Conjugate gradient methods offer a tradeoff between BP and second order Newton or quasi-Newton methods in that they converge more rapidly than BP but use significantly less computation than the quasi-Newton and Newton methods.
153 Backpropagation for Two Hidden Layers with Unipolar Sigmoids We first derive the computational formulas for updating the weights via steepest descent in an extended BP algorithm. For specificity, we use the unipolar sigmoids. We suppress the index "q" of the PSSE E(q), as we did above. The partial derivatives E/ vnl, E/ wlm, E/ umj are to be used in the weight updates vnl(r+1) = vnl(r) - E(vnl,wlm,umj)/ vnl wlm(r+1) = wlm(r) - E(vnl,wlm,umj)/ wlm umj(r+1) = umj(r) - E(vnl,wlm,umj)/ umj (4-30a) (4-30b) (4-30c)
From Figure 4.6, we see that the output layer has all of the same parameter names and indices, so we can use Equation (4-21a) for the increments on umj, which is umj(r+1) = umj(r) + (tj - zj)zj(1-zj)ym (4-31)
The hidden layer adjacent to the output layer is the second hidden layer. The difference in the variable designations here and those used to derive Equation (4-21b) is that the inputs to these hidden neurodes are a = (a1,...,aL) instead of x = (x1,...,xN). Upon making these changes in Equation (4-21b), we obtain wlm(r+1) = wlm(r) + {
(j=1,J) j
(t - zj)[zj(1-zj)]umj}[ym(1-ym)]al
(4-32)
The derivation of the weight updates at the first hidden layer is more tedious and uses different nomenclature of variables and indices. We omit the column of notes here, and suppress the "q" subscripts for the exemplar number and use the power of chain rules. But first, we describe E = E(vnl,wlm,umj) as a function of its weights and intermediate variables by E = E(vnl,wlm,umj) =
(j=1,J) j (j=1,J) j
(t - zj)2 =
(j=1,J) j
(t - g(sj))2 =
(m=1,M) mj
(t - g( (t - g( (t - g( (t - g(
(m=1,M) mj m
u y ))2 = u h( u h( u h(
(j=1,J) j
(t - g(
u h(rm)))2 =
(j=1,J) j
(m=1,M) mj
(m=1,M)
wlmal)))2 = wlmf(pl))))2 =
(n=1,N) nl n
(j=1,J) j
(m=1,M) mj
(m=1,M)
(j=1,J) j
(m=1,M) mj
(l=1,L)
wlmf(
v x ))))2
(4-33a)
It is obvious from Figure 4.6 and less obvious from Equation (4-33a) that each weight vnl at the first hidden layer is affected not only by every difference (tj - zj) at the output layer, but also by every neurode in the second hidden layer. Thus we need to sum the total error adjustments of all j = 1,...,J and over all m = 1,...,M. We first
154 derive the gradient of the SSE with respect to vnl over a single jth output and a single mth neurode in the first hidden layer. Thus we use Emj = ((tj - g(umjh(
(l=1,L)
wlmf(
(n=1,N) nl n
v x ))))2
(4-33b)
From Equation (4-33b) it is easy to write down the chain rule in terms of dependent variables that start at the output layer and work backward to the first hidden layer, which leads to the partial derivative E/ vnl = ( E/ zj)( zj/ vnl) = ( E/ zj)( zj/ sj)( sj/ vnl) = ( E/ zj)( zj/ sj)( sj/ ym)( ym/ vnl) = ( E/ zj)( zj/ sj)( sj/ ym)( ym/ rm)( rm/ vnl) = ( E/ zj)( zj/ sj)( sj/ ym)( ym/ rm)( rm/ al)( al/ vnl) = ( E/ zj)( zj/ sj)( sj/ ym)( ym/ rm)( rm/ al)( al/ pl)( pl/ vnl) = [(-2){(tj - zj)][g (sj)]umj}[h (rm)][wlmf (pl)][xn] into Equation (4-34a) and absorbing the 2 in , we obtain the (r+1)st iterate vnl(r+1) = vnl(r) + {(tj - zj)[zj(1-zj)]umj}[ym(1-ym)]wlmal(1-al)xn Now we sum over all such parts for j = 1,...,J and m = 1,...,M to obtain the final update vnl(r+1) = vnl(r) + {
(j=1,J) j
(4-34a)
Because we are using unipolar sigmoids, their derivatives all have the same form of (1- ). Substituting
(4-34b)
(t - zj)zj(1-zj)[
(m=1,M) mj m
u y (1-ym)wlm]al(1-al)xn}
(4-35)
Recapitulating for easy reference, the case of unipolar sigmoids for extended backpropagation yields umj + (tj - zj)zj(1-zj)ym wlm + {
(j=1,J) j
umj wlm
(t - zj)zj(1-zj)umj}[ym(1-ym)]al
(m=1,M) mj m
vnl = vln + {
(j=1,J) j
(t - zj)zj(1-zj)[
u y (1-ym)wlm]al(1-al)xn}
Backpropagation for Two Hidden Layers with Bipolar Sigmoids If, on the other hand, we use the bipolar sigmoids where the derivatives of the sigmoids have the form (1+ )(1)/2, then the weight updates are umj + (tj - zj)[(1+zj)(1-zj/2])ym wlm + { vnl + {
(j=1,J) j
(4-39) (4-40)
(t - zj)[(1+zj)(1-zj)/2]umj}[(1+ym)(1-ym)/2]al
(j=1,J) j
(t - zj)[(1+zj)(1-zj)/2] [ (j=1,J)umj[(1+ym)(1-ym)/2]wlm][(1+al)(1-al)/2]xn]}
(4-41)
155 Equations (4-36), (4-37) and (4-38) coincide with those of (Rogers and Kabrisky, 1993).
156 case we use {0,1}; or ii) bipolar, where we use {-1,1}. The Hamming distance between two J-bit identifier codewords is the number of positions (components) in which the codewords are different. For example, the Hamming distance between (1,1,0,0,1) and (1,0,1,0,1) is 2 (they differ in the second and third positions). If we choose K identifiers for K classes such that the Hamming distance between each pair of them is at least 3, then a single error in the mapping can be corrected. The output vector in error is closer to its correct identifier than to any other codeword and is thus recognized correctly, because it differs in only one position from the correct codeword but differs in at least two positions from any other codeword. But error correction comes with a cost. If the number of classes K is large, then J will need to be significantly larger to select K codewords that differ pairwise by a Hamming distance of 3 or more, so that there will be many output neurodes. This increases the number of weights to be trained and thus increases the complexity of computation both in training and in recognition. For single error detection, the pairs of identifier codewords must be at least a Hamming distance of 2 from each other. A single error will result in a codeword that is not an identifier and that exposes the error. Error detection and correction require that each output component be changed to either the high value or the low value, whichever it is closest to. A reasonable trade-off between the extremes of the error correction and single output is to use the number of outputs J, where 2J-1 < K 2J (for K classes needed). The use of J = 8 output neurodes and a binary alphabet, for example, allows K = 28 = 256 unique class identifiers. The input components are better left with continuous (analog) values that contain finer information than discretized values. In case the x(q) are discretized, the resolution should be high (a fairly large number of discrete values) so that less information will be lost.
Similarly, if the bipolar equations are set to 1 via zj = [1 - exp(- (sj - b))] / [1 + exp(- (sj - b))] = 1
157 then -exp(- (sj - b)) = exp(- (sj - b)), which means that exp(- (sj - b)) = 0, so that sj = . If zj = [1 - exp(- (sj - b))] / [1 + exp(- (sj - b))] = -1 then [1 - exp(- (sj - b))] = -[1 + exp(- (sj - b))], so that 1 = -1, unless sj = - . The conclusion is that we never put the desired output vector components at 0 or 1 (unipolar) nor at 1 or -1 (bipolar). We often use 0.1 for 0 and 0.9 for 1 to make comparative runs, but in practice, we prefer to use 0.2 for 0 and 0.8 for 1 (unipolar), or -0.8 for -1 and 0.8 for 1 (bipolar). This causes less saturation and quicker convergence. It also helps prevent weight drift.
How Important Are Biases, Exponential Rates and Learning Rates? Chapter 5 discusses training on biases b1 and b2 and exponential rates bp bp - p( E/ bp),
r r 1
and
- r( E/
), p, r = 1,2
Such adjustments can lower the TSSE E. The derivatives given in Equations (4-42) are derived in Chapter 5, but it is often best to fix these to avoid the moving target effect. For large
i
zero for some neurodes and the convergence virtually halts. We may take: N/2 b1 1.2N/2, M/1 b2 1.2M/2. The step gains (learning rates) must be adapted as the training progresses for satisfactory convergence speed (see Section 5.3)
What Do the Error Functions Look Like? To plot slices of the graphs of the SSE E, we suboptimize E with respect to a set of exemplar training pairs {(x(q),t(q)): q = 1,...,Q} from a small dataset. Next, we hold all values of the suboptimal weight set fixed except for a single pair (wnm,umj) at a time, where the values n, m, and j are randomly selected from 1 to, respectively, N, M, and J. We increment both wnm and umj from -2.5 to 2.5 by = 0.007812 in Figures 4.7 through 4.11,
which display graphs of slices of the SSE function E as described above. Appendix 11 describes the datasets.
158 Figure 4.7 - A Slice of the SSE Function E for the Blood10 Data Set
The slice of Figure 4.7 (Blood10 dataset) has two local minima in the weight domain of the slice. Both are fairly deep, but one is definitely deeper than the other. The slice shown in Figure 4.8 (on the Digit12 dataset) has two minima with slightly greater differences in the deepness, while Figure 4.9 (on the Rotate9 dataset) has an even greater difference between the two minima. These slices with two minima are essentially quartic, that is, can be fit by a quartic polynomial. Figure 4.10 (on the Ten5ten dataset) has a single minimum, but it is not a true quadratic. It behaves more like a quartic and there may be another local minimum off the graph to the right. The slice in Figure 4.11 (on the Parity3 dataset) is different in that it has both a very shallow and a very deep local minimum. It is clear that a solution stuck in the shallow minimum would be unsatisfactory. We note that all of the curves approximate either quadratic or quartic polynomials (none have linear, cubic, nor quintic form).
It can be observed from Figures 4.8 through 4.11 that on the one hand there are multiple global minima, or at least multiple deep minima that are approximately global, but that on the other hand, there are shallow local minima that would be bad solutions. When the training process falls into a shallow minimum during gradient descent, it becomes trapped there and the local minimum becomes a solution that may be very unsatisfactory.
160
This not only reduces the volume of computation, but eliminates much of the convoluted ridges and valleys from the surface of the error function E. Perhaps the greatest benefit is that it prevents any saturation at the output layer because the derivative can never be close to zero, as we see from g (sj) = (1/J) It also permits the target output components to take the value 0 or 1 (or -1 or 1). Upon substitution of Equation (4-45) into Equations (4-21a,b) with 1/J into the learning rate, the formulas for slickpropagation (SP or slickprop) with unipolar sigmoids at the hidden layer become umj + (1/J)(tj(q) - zj(q))ym(q) wnm + (2/J){
(q) (j=1,J) j
(4-45)
umj wnm
(4-46a) (4-46b)
(t
- zj(q))umj}[ym(q)(1-ym(q))]xn(q)
Similarly, the SP algorithms that use bipolar sigmoids in the single hidden layer that are given in Equations (4-28a,b) now become
umj wnm
(4-47a) (4-47b)
(t
- zj(q))umj}[(1+ym(q))(1-ym(q))/2]xn(q)
For extended MLPs with two hidden layers with unipolar sigmoids, Equations (4-36, 4-37, 4-38) become umj + (1/J)(tj - zj)ym wlm + (2/J){ vln + (3/J){
(j=1,J) j
(t - zj)umj}[ym(1-ym)]al
(m=1,M) mj m
(j=1,J) j
(t - zj)[
u y (1-ym)wlm]al(1-al)xn}
161 Likewise, the extended MLPs with two hidden layers with bipolar sigmoids become umj + 1(tj - zj)ym wlm + 2{ vnl + 3{
(j=1,J) j
(t - zj)umj}[(1+ym)(1-ym)/2]al
(j=1,J) mj
(j=1,J) j
(t - zj)[
u [(1+ym)(1-ym)/2]wlm][(1+al)(1-al)/2]xn]}
The first N features are treated linearly in the single layer of neurodes (see Figure 3.14) whose activations are either unipolar or bipolar sigmoids. Once we compute the new feature inputs from Equations (4-50), we obtain E=
(q=1,Q) (j=1,J)
(tj(q) - yj(q))2 =
(n=1,N)
(q=1,Q)
(j=1,J)
(tj(q) - h(rj(q)))2 =
(4-51)
(q=1,Q)
(j=1,J)
(tj(q) - h(
wnjxn(q)))2
Thus the partial derivatives for steepest descent have the form E/ wnj = 2 -2
(q=1,Q) (q=1,Q)
The FLN update algorithm for unipolar sigmoid h(-) is therefore wnj + 2
wnj
(q=1,Q)
(tj(q) - yj(q))[yj(q)(1-yj(q))]xn(q)
(4-53)
wnj
wnj +
(q=1,Q)
(tj(q) - yj(q))[(1+yj(q))(1-yj(q))]xn(q)
(4-54)
Radial basis function networks adjust the weights (see Figure 3.17) umj at the output layer via steepest descent. There are no sigmoid activation functions at the output layer. From E=
(q=1,Q) (j=1,J)
(tj(q) - zj(q))2 =
(q=1,Q)
(j=1,J)
(tj(q) - (1/M)
2 (q) (m=1,M) m mj
u )
(4-55)
(tj(q) - ym(q))ym(q)
(4-56)
The RBFN weight updates at the output layer neurodes have the form
umj
umj + (2/M)
(q=1,Q)
(tj(q) - ym(q))ym(q)
(4-57)
The neurodal centers {v(m): m = 1,...,M} at the hidden neurodes may be trained via steepest descent also, where E=
(q=1,Q) (j=1,J)
(tj(q) - (1/M)
2 (q) (m=1,M) m mj
u ) =
(q=1,Q)
(j=1,J)
(tj(q) - (1/M)
(m=1,M)
f (x(q))umj)2
(4-58)
(n=1,N)
(xn(q) - vn(m))]
(4-59)
The RBFN updates for the hidden neurode centers are therefore
vn(m)
vn(m) + (2/M)
(q=1,Q)
(n=1,N)
(xn(q) - vn(m))]
(4-60)
Exercises
4.1 Derive the specific iterative approximation formulas for steepest descent for an MLP with N = 2, M = 2, and J = 1. Use the unipolar sigmoid activation function. Write out all sums and differentiate term-by-term. 4.2 Equation 4-15 is the derivative of the unipolar sigmoid g(sj). Find the derivative of the bipolar sigmoid directly from G(s) = 2g(s+b) - 1. 4.3 Repeat the Exercise 4.1, but this time use the bipolar sigmoid for the activation functions. 4.4 Equation 4-19 contains the substitution h (rm) = ym(1 - ym) for the unipolar sigmoid function. Find the derivative directly for the bipolar sigmoid function H(rm) = tanh( r/2) from Equation 3.4. 4.5 Write an algorithm for training an MLP that uses steepest descent at the last two layers and random search on the weights at the first hidden layer. 4.6 Write a simple computer program that implements supervised training of MLPs using the method of steepest descent with unipolar sigmoids (use the backpropagation algorithm that trains on a single exemplar pair at a time).
163 4.7 Use the program written in Exercise 6 above to train a MLP to perform XOR logic (2-bit parity). Use four middle neurodes on one training run, but use only two on another run. Make an even/odd map of the results on [0,1][0,1] by testing all points (x1,x2) where the values of x1 and x2 assume multiples of 0.1 from 0 to 1. Compare the results for the two different runs. 4.8 Use the program written in the sixth exercise above and the XOR data to analyze via simulation the errors at the output components of z when the errors
1
uniform distribution on [0.0,0.25]. Compute sample means and variances of the outputs. 4.9 Suppose that a MLP has been trained to map an input vector x = (x1,...,xN) into an output vector z = (z1,...,zJ) that approximates some t = (t1,...,tJ) closely. Let an error
1
first component of the exemplar vector x. Analyze the error at the different components of the mapped output z due to
1
on x1.
4.10 Use the process of Exercise 4.9 to analyze what happens when Gaussian random error is added to one input feature vector component. For a large number of input components with uniformly distributed errors, what is the net effect at each output zj? Consider the Central Limit Theorem from statistics. 4.11 Let v(r+1) = v(r) - E(v(r)) be an iterative step to minimize the sum-squared error E(v) on the vector of all weights v = (w,u). Consider the modification: v(r+1) = v(r) + (1-)[v(r) - E(v(r))], where satisfies 0< <1, and is taken to be 1.0. Analyze the effect. 4.12 Analyze the effect (see Exercise 4.11) of using v(r+1) = v(r) + v(r-1) - (1-) E(v(r))], where satisfies 0 < < 1. How will this affect the convergence? 4.13 Consider an MLP with two hidden layers. The last two layers on the right have their weights updated as derived in the text. Derive in complete detail the formulation for updating the weights of the first hidden layer that is adjacent to the input branching layer. 4.14 Modify the linearization v(r+1) = v(r) - E(v(r)) by adding a momentum term of the form 0< < 1. Discuss the ramifications of this strategy.
(r) r-1
v , where
4.15 Analyze the effects on E(v(r)) of random errors on the exemplars {x(k)}? Discuss and justify your conclusions. 4.16 Modify the program developed in Exercise 4.6 above so that it uses bipolar sigmoids. Compare the convergence rates on XOR logic between unipolar and bipolar versions (be sure to use bipolar data). Make at least 12 runs and compare the average behavior (each run draws a different initial weight set at random). 4.17 Write a flow chart similar to the one given in Figure 4.5, but for training on all exemplars simultaneously in a full steepest descent on E(w,u). Now write out the complete algorithm.
164 4.18 Modify the program developed in Exercise 4.16 above so that it trains on all exemplars simultaneously. Run it on the XOR logic function and compare with BP (backpropagation). Try different step gains from 0.05 to 2.0 and note the convergence behavior. 4.19 Show that the bipolar sigmoid functions have maximal slope when the weights are all zero. Can the initial weights be selected to be all zeros to obtain a faster rate of convergence? Explain. State the analogous case for unipolar sigmoid activation functions. 4.20 What is the effect on the weight increments when the weights are in a region where the sigmoid of a neurode has essentially zero slope (consider the update formula for the weights at that neurode)? What about a region where the slope is large? 4.21 Write out a training algorithm where the second order Newton method is used, i.e., where the Hessian matrix of second order partial derivatives is used to obtain a more accurate descent step. 4.22 Take the partial derivative with respect to the exponential rate Work out an equation using E/ of the sum-squared error function E.
4.23 Design the output identifiers for K = 4 classes so that a single error can be detected. 4.24 Design the output identifiers for K = 4 classes so that a single error can be corrected. 4.25 Take the partial derivative of E with respect to the bias b and include it in the BP algorithm for supervised training as an extra weight on a line where the constant 1 is always input. 4.26 Suppose that an MLP is trained to recognize K classes from a sample of Q exemplar pairs (K Q). Further, suppose that during operation of the MLP it were discovered that while it could recognize every exemplar input vector, it would sometimes incorrectly recognize novel input vectors that were very close to one of the known exemplars. Explain why this may occur. Give a strategy that could be taken to reduce this undesirable behavior. 4.27 Suppose we copy the exemplar set {x(1),...,x(Q)} to a set {x(Q+1),...,x(2Q)}. Now suppose that we add a moderate level of Gaussian noise onto the components of the second set and then use the combined set to train an MLP. What would be the effect on the generalization or specialization of the learning? Justify your arguments. 4.28 Under backpropagation training, the weight increments -( E(q)/ vps) become arbitrarily close to zero as the updated weights {vps} approach a local minimum because: E/ vps 0. This causes BP to converge at
an excruciatingly slow rate as it nears a local minimum because the weight increments also approach zero. Find a method to speed up convergence in the vicinity of a local minimum. 4.29 Suppose that we use 50% of the exemplars to train an MLP to reduce the SSE ET. After a few epochs
165 of training, let us put the other 50% of the exemplars (that were not used in the training) through the network and compute that SSE as EN. We repeat this process in a loop. Is it possible that at some point EN will stop decreasing and begin to increase, even though ET continues to decrease? What can we infer from this in terms of generalization and specialization? 4.30 Describe a method that does not use the gradients to arrive at weights that reduce the total SSE to a reasonably low value [consider drawing random numbers in some fashion]. 4.31 Would you use a MLP of size N-M-J = 2-200-1 to train on XOR logic? Why not? What can be inferred from this? 4.32 Suppose that we want to train an MLP on a set of Q exemplar pairs by using gradients. We want to use all of the Q exemplar pairs simultaneously (fullpropagation). However, we are using a computer that does not have sufficient memory for all Q exemplar pairs (Q is very large). Describe an algorithm where a portion of these, say Q/4 exemplar pairs are used on each weight updating, consecutively, until all of the Q exemplars have been used. Compare this with batching in backpropagation. Which is the most efficient? 4.33 Derive E/
1
and E/
4.34 Derive E/ b1 and E/ b2 for unipolar sigmoid activation functions. 4.35 Find the equations for updating the decay rates b2 for unipolar sigmoid activation functions. 4.36 Show that for zj = g(sj) = [1 - exp(- 2sj + b2)] / [1 + exp(- 2sj + b2)], it either is or isn't possible that zj = -1.
1
and
4.37 Make a copy of your BP program for Problem 4.6 and then modify it to implement SP (slickpropagation). Test it on 2-bit parity (XOR logic) and 3 bit parity. Is it quicker than BP? Why? 4.38 Consider an MLP network that has a single hidden layer. Modify the SP algorithm by substituting activations at the hidden neurodes so that their derivatives will not go to zero.
References
Fahlman, S. E. (1988), An Empirical Study of Learning Speed in Backpropagation, Tech. Report CMUCS-88-162, Carnegie Mellon University, Pittsburgh. Fausett, Laurene (1994), Fundamentals of Neural Networks, Prentice-Hall, Englewood Cliffs. Fu, LiMin (1994), Neural Networks in Computer Intelligence, McGraw-Hill, NY.
166 Gauss, Karl Frederick (1809), Theoria Motus Corporum Coelestium in Sectionibus Conicus Solem Ambientum, F. Perthes & I. H. Besser (Editors), Hamburg (translation: Dover, NY, 1963). Gori, M., and Tesi, A. (1992), On the problem of local minima in backpropagation, IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 14, 76-86. Hecht-Nielsen, R. (1990), Neurocomputing, Addison-Wesley, Reading, MA. Hornik, K., Stinchcombe, M., and White, H. (1989), Multilayer feedforward networks are universal approximators, Neural Networks, vol.2, no. 5, 359-366. Jacobs, R. A. (1988), Increased rates of convergence through learning rate adaptation, Neural Networks, vol. 1, 95-307. Kosko, B. (1992), Neural Networks and Fuzzy Systems, Prentice-Hall, Englewood Cliffs. Legendre, A. M. (1810), Mthodes des mondres quarrs, pour trouver le milieu le plus probable entre les rsultats de diffrentes observations, Mem. Inst. France, 149-154. Linz, Peter (1979), Theoretical Numerical Analysis, Wiley-Interscience, NY. Looney, C. (1996), Advances in feedforward neural networks: demystifying knowledge acquiring black boxes, IEEE Trans. Knowledge and Data Engineering, vol. 8, no. 2, 1-16. Looney, C. (1993), Neural networks as expert systems, J. Expert Systems with Applications, vol. 6, no. 2, 129-136. McCormack, C., and Doherty, J. (1993), Neural network super architectures, Proc. 1993 Int'l Conf. Neural Networks, Nagoya, 301-304. Parker, D. B. (1982), Learning Logic, Invention Report S81-64, File 1, Office of Technology Licensing, Stanford University. Robbins, H., and Monro, S. (1951), A stochastic approximation method, Annals of Math. Statistics 22, 400407. Rogers, S. K, and Kabrisky, M. (1993), An Introduction to Biological and Artificial Neural Networks for Pattern Recognition, SPIE Optical Engineering Press, Bellingham, WA, USA. Rumelhart, D., Hinton, G., and Williams, R. (1986), Learning internal representations by error propagation, appeared in Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Vol. 1, Edited by Rumelhart and McClelland, MIT Press, Cambridge, 318-362. Stornetta, W. S., and Huberman, B. A. (1987), An improved three-layer backpropagation algorithm, Proc. First IEEE Int'l Conf. Neural Networks, San Diego, vol. 2, 637-643. Wasserman, P. D. (1989), Neural Computing, Van Nostrand Rheinhold, NY.
172 Werbos, P. J. (1994), The Roots of Backpropagation, John Wiley, NY. Werbos, P. J. (1974), Beyond regression: New Tools for Prediction and Analysis in the Behavioral Sciences, Ph.D. thesis in Applied Math., Harvard University. Zurada, M. (1992), Artificial Neural Systems, West Publishing, St. Paul.