This document discusses the capacity of simple perceptrons. It begins by explaining that the capacity (Pmax) is the maximum number of random input-output pairs that can be reliably stored in a network. For linear units, Pmax is equal to the number of input units (N).
The document then focuses on deriving the capacity formula for threshold units receiving continuous-valued random inputs. Through an analysis of how many ways a set of points can be divided into two classes by a hyperplane, it is shown that Pmax is equal to 2N. Graphs demonstrate how the number of classifiable patterns transitions sharply from all to none around this value as N increases.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
418 views8 pages
Capacity of A Perceptron
This document discusses the capacity of simple perceptrons. It begins by explaining that the capacity (Pmax) is the maximum number of random input-output pairs that can be reliably stored in a network. For linear units, Pmax is equal to the number of input units (N).
The document then focuses on deriving the capacity formula for threshold units receiving continuous-valued random inputs. Through an analysis of how many ways a set of points can be divided into two classes by a hyperplane, it is shown that Pmax is equal to 2N. Graphs demonstrate how the number of classifiable patterns transitions sharply from all to none around this value as N increases.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8
110
FIVE Simple Perceptrons
5.6 Stochastic Units Another generalization is from our deterministic units to stochastic units Si gov- erned by (2.48): 1 (5.54) Prob(Sf = 1) = 1 + exp(=f2,8hf) with .hf = 'E Wiker (5.55) k as before. This leads to (Sf) = tanh (,8 'E Wiker) k (5.56) . t . (242) In the context of a simulation we can use (5.56) to calculate (Sf), JUS as m . . .' S h'l whereas in a real stochastic network we would find it averagmg i or a Wi:, updating randomly chosen units according to (5.54). Either way, we then use (Si ) as the basis of a weight change (5.57) where (5.58) This is just the average over outcomes of the changes we would have on basis of individual outcomes using the ordinary delta rule will find it articularly important when we discuss reinforcement learnmg m SectIOn 7.4 .. p It is interesting to prove that this rule always decreases the average error given by the usual quadratic measure - E = ! 'E(f - Sf)2. 2 ip (5.59) Since we are assuming output units and patterns are 1, this is just twice the total number of bits in error, and can also be wfltten E= 'E(I-(fSf). (5.60) ip Thus the average error in the stochastic netwo:r kis (E) = 'E(1 - (f (Sf)) ip = 'E [1- (f tanh(,8'E Wiker)] . ip k (5.61) 5.7 Capacity of the Simple Perceptron 111 The change in (E) in one cycle of weight updatings is thus 8 = - 'E -8- tanh(,8hn ipk Wik - 'E 7][1 - (r tanh(,8hf)],8sech2(,8hf) (5.62) ip" using 6 dtanh(x)/dx = sech 2 x. The result (5.62) is clearly always negative (recall tanh(x) < 1), so the procedure always improves the average 5.7 Capacity of the Simple Perceptron * In the case of the associative network in Chapter 2 we were able to find the capacity Pmax of a network of N units; for random patterns we found Pr'nax = 0.138N for large N if we used the standard Hebb rule. If we tried to store P patterns withp > Pmax the performance became terrible. Similar questions can be asked for simpleperceptrons: How many random input-output pairs can we expect to store reliably in a network of given size? How many of these can we expect to learn using a particular learning rule? The answer to the second question may well be smaller than the first (e.g., for nonlinear units), but is presently unknown in general. The first question, which this section deals with, gives the maximum capacity that any learning algorithm can hope to achieve. For continuous-valued units (linear or nonlinear) we already know the answer, because the condition is simply linear independence. If we choose P random pat- terns, then they will be linearly independent if P :5 N (except for cases with very small probability). So the capacity is Pmax = N. '/ The case of threshold units depends on linear separability, which is harder to deal with. The answer for random continuous-valued inputs was derived by Cover 11,965] (see also Mitchison and Durbin [1989]) and is remarkably simple: Pmax = 2N. (5.63) usual N is the number of input units, and is presumed large. The number of ' .. ' ut units must be small and fixed (independent of N). Equation (5.63) is strictly , in the N -+ 00 limit. ' function sech 2 x = 1 - tanh 2 x is a bell-shaped curve with peak at x = o. 112 FIVE Simple Perceptrons C(p,N)I2 P 0.5 . . pJN o o 2 3 4 FIGURE 5.11 The function C(p, N)/2 P given by (5.67) plotted versus p/Nfor N = 5, 20, and 100. The rest of this section is concerned with proving (5.63), and may be omitted on first reading. We follow the approach of Cover [1965]. A more general (but much more difficult) method for answering this sort of question was given by Gardner [1987] and is discussed in Chapter 10. We consider a perceptron with N continuous-valued inputs and one 1 output unit, using the deterministic threshold limit. The extension to several output units is trivial since output units and their connections are independent-the result (5.63) applies separately to each. For convenience we take the thresholds to be zero, but they could be reinserted at the expense of one extra input unit, as in (5.2). In (5.11) we showed that the perceptron divides the N-dimensional input space into two regions separated by an (N - I)-dimensional hyperplane. For the case of zero threshold this plane goes through the origin. All the points on one side give an output of + 1 and all those on the other side give -1. Let us think of these as red (+1) and black (-1) points respectively. Then the question we need to answer is: how many points can we expect to put randomly in an N-dimensional space, some red and some black, and then find a hyperplane through the origin that divides the red points from the blaCk points? Let us consider a slightly different question. For a given set of p randomly placed points in an N-dimensional space, for how many out of the 2 P possible red and black colorings of the points can we find a hyperplane dividing red from black? Call the answer C(p, N). For p small we expect C(p, N) = 2 P , because we should be able to find a suitable hyperplane for any possible coloring; consider N = p = 2 for example. For p large we expect C(p, N) to drop well below 2 P , so an arbitrarily chosen coloring will not possess a dividing hyperplane. The transition between these regimes turns out to be sharp for large N, and gives us Pmax. We will calculate C(p, N) shortly, but let us first examine the result. Figure 5.11 shows a graph of C(p, N)/2 P against p/ N for N = 5, 20, and 100. Our expectations for small and large p are fulfilled, and we see that the transition occurs quite rapidly in the neighborhood of p = 2N, in agreement with (5.63). As Nis made larger and 5.7 Capacity of the Simple Perceptron 113 FIGURE 5.12 Finding sep- arating hyperplanes con- to go through a pomt P as well as the origin a is equivalent to projecting onto one lower dimension. larger the transition becomes more and more shar Th ( ... . can demonstrate that FIg 5 11 . . t p. us 5.63) IS JustIfied If we . . IS correc . . of points is not actually necessary.7 All that we need IS a e pomts be In general position. As discussed on . (for the no threshold case) that all subsets of N (ti ) .page 97, thIS .means . ddt As or ewer pomts must be lmearly m epen en . an example consider N = 2 a set ofp . t t d. . plane is in g I t f . pom sma we- Imenslonal . enera POSI IOn 1 no two lie on the same line through the .. A of chosen from a continuous random distribution will obviousl orI?m. set pOSItIOn except for coincidences that have zero probability. y be m general We can now calculate C(p N) b . d t d d d b h ' Y m uc Ion. Let us call a coloring that can be ;oin[ ;. and add a b For those previous dichotomies where the dividing hyperplane could h een drawn through point P th '11 b. ave d .. ' ere e two new dIchotomies, one with P red an one WIth It black. This is because when the points in general osition ;ny t?rough 1>. can be shifted infinitesimally "to go either sfde of 1 , WI OU C angmg the SIde of any of the other p points. of the dichotomies only one color of point P will , ere e one new dIchotomy for each old one. Thus C(p + 1, N) = C(p, N) + D (5.64) where D is the number of the . C(p N) . the dividing hyperplane dr dIchotomies that could have had . . awn roug as well as the origin o. But this number SImply ?(p, N - 1), because constraining the hyperplanes to go throu h a tlcular pomt P makes the problem effectively (N _ 1) d . I .lgI par- . yo 5 12 - ImenslOna; as 1 ustrated mIg. . , we can proJect the whole problem onto an (N 1) d . I I - - ImenSlOna pane 7N . or IS It well defined unless a distribution function is specified. FIVE Simple Perceptrons perpendicular to OP, since any displacement of a point along the OP direction cannot affect which side it is of any hyperplane containing OP. We thereby obtain the recursion relation C(p + 1, N) = C(p, N) +C(p, N - 1). (5.65) Iterating this equation for p, p - 1, p - 2, ... , 1 yields C(p, N) = (P(jl)C(I, N) + N -1) + ... + N - p + 1). (5.66) For p < N this is easy to handle, because C(I, N) = 2 for all N; one point can be colored red or black. For p > N the second argument of C becomes 0 or negative in some terms, but these terms can be eliminated by taking C(p, N) = 0 for N o. It is easy to check that this choice is consistent with the recursion relation (5.65), and with C(p, 1) = 2 (in one dimension the only "hyperplane" is a point at the origin, allowing two dichotomies). Thus (5.66) makes sense for all values of p and Nand can be written as N-l ( 1) C(p, N) = 2 P (5.67) if we use the standard convention that (;:.) = 0 for m > n. Equation (5.67) was used to plot Fig. 5.11, thus completing the demonstration. It is actually easy to show from the symmetry = of binomial coefficients that (5.68) so the curve goes through 1/2 at p = 2N. To show analytically that the transition sharpens up for increasing N, one can appeal to the large N Gaussian limit of the binomial coefficients, which leads to (5.69) for large N. It is worth noting that C(p, N) = 2 P if p N (this is shown on page 155). So any coloring of up to N points is linearly separable, provided only that the points are in general position. For N or fewer points general position is equivalent to linear independence, so the sufficient conditions for asolution are exactly the same in the threshold arid continuous-valued networks. But this is not true, of course, for p>N. SIX Multi-Layer Networks The limitations of a simple perceptron do not apply to feed-forward networks with interr.nediate or "hidden" layers between the input and output layer. In fact, as we WIll see later, a network with just one hidden layer can represent any Boolean function (including for example XOR). Although the greater power of multi-layer networks was realized long, ago, it was only recently shown how to make them learn a particular function, using "back-propagation" or other methods. This absence of a rule-together the demonstration by Minsky and Papert [1969] that only lmearly separable functIOns could be represented by simple perceptrons-Ied to a waning of interest in layered networks until recently. Throughout this chapter, like the previous one, we consider only feed-forward networks. More general networks are discussed in the next chapter. 6.1 Back-Propagation The back-propagation algorithm is central to much current work on in neural networks. It was invented independently several times, by Bryson and Ho [1969], Werbos [1974], Parker [1985] and Rumelhart et al. [1986a, b]. A closely related by Le Cun [1985]. The algorithm gives a prescription for changmg the weIghts Wpq m any feed-forward network to learn a training set of input-output pairs The basis is simply gradient descent as described in Sections 5.4 (linear) and 5;5 (nonlinear) for a simple perceptron. ' consider a two-layer network such as that illustrated by Fig. 6.1. Our n?tatIOnal. conventIOns shown in the figure; output units are denoted by OJ, hIdden umtsby ltj, mput terminals by There are connections Wjk from the 115 264 TEN Formal Statistical Mechanics of Neural Networks Only the second of these, which comes from of/or = 0, is a little tricky, needing the identity (10.75) for any bounded function J(z). Equation (10.72) is just like (10.22) for the a = 0 case, except for the addition of the effective Gaussian random field term, which represents the crosstalk from the uncondensed patterns. For a = 0 it reduces directly to (10.22). Equation (10.73) is the obvious equation for the mean square magnetization. Equation (10.74) gives the (nontrivial) relation between q and the mean square value of the random field, and is identical to (2.67). For memory states, i.e., m-vectors of the form (m, 0, 0, ... ), the saddle-point equations (10.72) and (10.73) become simply m q tanh,B(forz + m)z tanh 2 ,B(forz + m)z (10.76) (10.77) where the averaging is solely over the Gaussian random field. These are are identical to (2.65) and (2.68) that we found in the heuristic theory of Section 2.5. Their solution, and the consequent phase diagram of the model in a - T space, can be studied as we sketched there. Spurious states, such as the symmetric combinations (10.26), can also be analyzed at finite a using the full equations (10.72)-(10.74). There are several subtle points in this replica method calculation: We started by calculating ((zn)) for integer n but eventually interpreted n as a real number and took the n ...... 0 limit. This is not the only possible continuation from the integers to the reals; we might for example have added a function like sin 7rn/n. We treated the order of limits and averages in a cavalier fashion, and in par- ticular reversed the order of n ...... 0 and N ...... 00. We made the replica symmetry approximation (10.60)-(10.62) which was re- ally only based on intuition. Experience has shown that the replica method usually does work, but there are few rigorous mathematical results. It can be shown for the Sherrington-Kirkpatrick spin glass model, and probably for this one too, that the reversal of limits is justified, and that the replica symmetry assumption is correct for integer n [van Hemmen and Palmer, 1979]. But for some problems, including the spin glass, the method sometimes gives the wrong answer. This can be blamed on the integer-to-real con- tinuation, and can be corrected by replica symmetry breaking, in which the replica symmetry assumption is replaced by a more complicated assumption. Then the natural continuation seems to give the right answer. For the present problem Amit et al. showed that the replica symmetric approx- imation is valid except at very low temperatures where there is replica symmetry breaking. This seems to lead only to very small corrections in the results. However, ----------------...... 10.2 Gardner Theory of the Connections 265 the predicted change in the capacity-a c becomes 0.144 instead of 0.138- b d t t d . . I . can e e ec e III numenca simulations [Crisanti et aI., 1986]. 10.2 Gardner Theory of the Connections The second classic statistical mechanical tour de force in neural networks is th computation by Gardner [1987, 1988] of the capacity of a simple perceptron. Th: calculation applies in the same form to a Hopfield-like recurrent network for auto- associative memory if the connections are allowed to be asymmetric. . theory is very it is not specific to any particular algorithm for the connectIOns. On the other hand, it does not provide us with a specific set of connections even when it has told us that such a set exists. As in Section 6.5, the basic idea is to consider the fraction of weight space that implements a particular input-output function; recall that weight space is the space of all possible connection weights w = {Wij}. In Section 6.5 we used relatively simple methods to calculate weight space volumes. The present approach is more complicated, though often more powerful. We many. of the techniques introduced in the previous section, including replicas, auxIliary varIables, and the saddle-point method. We consider a simple perceptron with N binary inputs ej = 1 and M binary threshold units that compute the outputs Oi = sgn (N- 1 / 2 L Wijej ) . j
(10.78) The N-l/2 factor will be discussed shortly. Given a desired set of associations ef ....... (f for Jl = 1, 2, ... , p, we want to know in what fraction of weight space the equatIOns (f = s g n(N- 1 / 2 L Wijey) (10.79) j satisfied (for all i and Jl). Or equivalently,JIl. what fraction of this space are the __ j (10.80) true?
. It is also interesting ask the, cor:esponding question if the condition (10.80) IS strengthened so there IS a margm sIze K; > 0 as in (5.20): (f N- 1 /2 L Wijey > K; j (10.81) A nonzero K; guarantees correction of small errors in the input pattern. 266 TEN Formal Statistical Mechanics of Neural Networks Until (iO.81) the factor N-l/2 was irrelevant. We include it because it is conve- nient to work with Wij'S of order unity, and a sum of N such terms of random sign gives a result of order N 1 / 2 . Thus the explicit factor N-l/2 makes the left-hand side of (10.81) of order unity, and it is appropriate to think about /\"s that are independent of N. Of course this is only appropriate if the terms in the sum over j are really of random sign, but that turns out to be the case of most interest here. On the other hand, in Chapter 5 we were mainly dealing with a correlated sum, and so used a factor N instead of Nl/2. For a recurrent autoassociative network, the same equations with (j = (j give the condition for the stability of the patterns, and a nonzero /\, ensures finite basins of attraction. The Capacity of a Simple Perceptron The fundamental quantity that we want to calculate is the volume fraction of weight space in which (10.81) is satisfied. Adding an additional constraint (10.82) for each unit i, so as to keep the weights within bounds, this fraction is J dw (Ill' 0((j N-l/2 E j Wije; - /\,)) ITi t5(E j W[j - N) V= ( 2) . J dw ITi t5 E j Wij - N (10.83) Here we enforce the constraint (10.82) with the delta functions, and restrict the numerator to regions satisfying (10.81) with the step functions 0(x). The expression (10.83) is rather like a statistical-mechanical partition func- tion (10.1), but the conventional exponential weight is replaced by an all-or-nothing one given by the step functions. It is also important to recognize that here it is the weights Wij that are the fundamental statistical-mechanical variables, not the acti- vations of the units. We observe immediately that (10.83) factors into a product of identical terms, one for each i. Therefore we can drop the index i altogether, reducing without loss of generality the calculation to the case of a single output unit. The corresponding step also works for the recurrent network if Wij and Wji are independent, but the calculation cannot be done this way if a symmetry constraint Wij = Wj; is imposed. In the same way as for Z in the previous section, the statistically relevant quantity is the average over the pattern distribution, not of V itself, but of its logarithm. Therefore, we introduce replicas again and compute the average 10.2 Gardner Theory of the Connections 267 where the integrals are over all the wi's and the average (( ... )) is over the (1's and the (I"s. To proceed we use the same kinds of tricks as in the previous section. First we work on the step functions, using the integral representation 0(z -/\,) = (Xld>. t5(>. _ z) = (OOd>'] dx e;x('x-z). j" j" 271" (10.85) We have step functions for each a and IL, so at this point we need auxiliary variables and Thus a particular step function becomes where 0((1' N- 1 / 2 w,!(j - /\,) = 1 00 ] J (10.86) = (I' N- 1 / 2 L: W(j(f . (10.87) j It is now easy to average over the patterns, which occur only in the last factor of (10.86). We consider the case of independent binary patterns, for which we have ((IT = I'a IT (( exp ( -i(l'(f jl' a eXP(Llogcos [N- 1 / 2 L jl' a = ( __ 1 I' I' a 13) exp 2N L.J xaxf3 L.J Wj Wj . l'af3 j (10.88) The resulting E j wj wj term is not easy to deal with directly, so we replace it by. a new variable q a f3 defined by (10.89) This gives qaa = 1 from (10.82), but we prefer to treat the a = /3 terms explicitly and use qaf3 only for a =/; /3. Thus we rewrite (10.88) as ((IT = IT exp ( - t - L I'a I' a a<f3 (10.90) - using qaf3 = qf3a The qaf3's play the same role in this problem that qaf3 and raf3 did in the previous section. When we insert (10.90) into (10.86) we see that we get an identical result for each IL, so we can drop all the IL's and write (10.91) 268 TEN Formal Statistical Mechanics of Neural Networks where Kp,x,q} = iLx"A" - - (10.92) a Now we turn to the delta functions. Using the basic integral representation ) J dr rz 6(z = -.e- 2n (10.93) we choose r = Ea/2 for each (l' to write the delta functions in (10.84) as 6 (2;:(wi? - N) = J (10.94) 3 In the same way we enforce the condition (10.89) for each pair (l'fJ (with (l' > fJ) using r = N F"p : ( N -I " P) - NJ dF"p 2:j wjw'J . (10.95) 6 q"p - L..J Wj Wj - 27ri e j We also have to add an integral over each of the q"p's, so that the delta function can pick out the desired value. ... A factorization of the integrals over the w's is now possIble. Takmg everythmg not involving wi outside, the numerator of (10.84) includes a factor J (II dwi) e - /2+ (10.96) " for each j. These factors are all identical-wi is a dummy variable, and j no longer appears elsewhere--so we can drop the j's and rewrite (10.96) as [J (II dw,,)e - N (10.97) " The same transformation applies to the denominator of (10.84), except that there are no F"p terms. It is now time to collect together our factors from (10.92), (10.94), (10.95), and (10.97). Writing Ak as exp(k log A), and omitting prefactors, (10.84) becomes J(I1" dE,,) (11,,<p dq"pdF"p) eNG{q,F,E} ((Vn)) = J(I1" dE,,)eNH{E} (10.98) where G{q, F, E} log [1 00 (IJ J (IJ dx" ) e K p,x,q} ] + log [J (II dw,,) e - ] a - L F"pq"p + LEa (10.99) a<p a 10.2 Gardner Theory of the Connections 269 and H{E} = 10g[J (II dw,,)e- + L E". " " (10.100) Since the exponents inside the integrals in (10.98) are proportional to N, we will be able to evaluate them exactly using the saddle-point method in the large-N limit. As before, we make a replica-symmetric ansatz: = F Ea=E (10.101) (where the first two apply for (l' f= fJ only). This allows us to evaluate each term of G. For the first term we can rewrite K from (10.92) as Kp,x,q} = iLx"A" - - i(LXaf " " " (10.102) and linearize the last term with the usual Gaussian integral trick (10.103) derived from (10.5). Then the x" integrals can be done, leaving a product of identical integrals over the A" 's. Upon replacing these by a single integral to the nth power we obtain for the whole first line of (10.99): (l'log{J_d_t e- t '/2 [1 00 dA ex p (- -,-::-(A-:-:-+t-,-"fo-;-)2)] n} ..ffi I< J27r(1- q) 2(1 - q) dA ex p (- (A+t-.fo)2)] (10.104) ..ffi I< J27r(1 - q) 2(1- q) where (l' == piN. The second term in G can be evaluated in the same way, linearizing the (2:" W,,)2 term with a Gaussian integral trick, then performing in turn the W" integrals and the Gaussian integral. The final result in the small n limit is -log(E + F) . E+F (10.105) Finally, the third term of G gives simply (again for small n) qF). (10.106) Now we are in a position to find the saddle point of G with respect to q, F, and E. The most important order parameter is q. Its value at the saddle point is the 270 TEN Formal Statistical Mechanics of Neural Networks most probable value of the overlap (10.89) between a pair of solutions. If, as at small a, there is a large region ofw-space that solves (10.80), then different solutions can be quite uncorrelated and q will be small. As we increase a, it becomes harder and harder to find solutions, and the typical overlap between a pair of them increases. Finally, when there is just a single solution, q becomes equal to 1. This point defines the optimal perceptron: the one with the largest capacity for a given stability parameter /\', or equivalently the one with highest stability for a given a. We focus on this case henceforth, taking q -+ 1 shortly. The saddle-point equations aGjaE = 0 and aGjaF = 0 can readily be solved to express E and F in terms of q: q F (1 _ q)2 E = 1- 2q (1- q)2 . (10.107) Substituting these into the expression for G (and making a change of variable in the d)" integral), we get 1 G( ) _ J dt -t 2 /21 [joo dz _Z2/2] - q - a --e og --e n V2i *V2i + + q) + 2(1 q) + (10.108) Setting aG j aq = 0 to find the saddle point gives J dt _t 2 /2 [l OOd _Z2/ 2] -1 _u 2 /2 t + /\,V'i a --e ze e V2i u 2V'i(1 - q)3/2 q (10.109) 2(1 - q)2 where u = (/\, + tV'i)j...jf=q. Taking the limit q -+ 1 is a little tricky, but can be done using L'Hospital's rule, yielding the final result (10.110) Equation (10.110) gives the capacity for fixed /\'. Alternatively we can use it to find the appropriate /\, for the optimal perceptron to store Na patterns. In the limit /\, = 0 it gives (10.111) in agreement with the result found geometrically by Cover that was outlined in Chapter 5. One can also perform the corresponding calculation for biased patterns with a distribution p(en = t(1 + m)6(er - 1) + t(1- m)6(er + 1) (10.112) 10.2 Gardner Theory of the Connections 4 r-------------------, 3 2
o 2 3 271 FIGURE 10.1 Capacity a c as a function of /\, for three values of m (from Gardner [1988]). so that ((en) = m. The calculation is just a little bit more complicated, with an extra set of variables M(X = N- 1 / 2 w'J with respect to which G has to be maximized. The results for the storage capacity as a function of m and /\, are shown in Fig. 10.1. An interesting limit is that of m -+ 1 (sparse patterns). Then the result for /\, = 0 is 1 (10.113) a c = (l-m)log(1!m) which shows that one can store a great many sparse patterns. But there is nothing very surprising about this, because very sparse patterns have very small informa- tion content. Indeed, if we work out the total information capacity-the maximum information we can store, in bits-given by N 2 a c [1 (l-m) 1 (l+m)] I=-log2 2(I-m)log -2- +2(1+m)log -2- , then we obtain N 2 1=-- 210g2 (10.114) (10.115) in the limit m -+ 1. This is less than the result for the unbiased case (m = 0, a c = 2), which is 1= 2N 2 In fact the total information capacity is always of the order N 2 , depending only slightly on m. It is interesting to note that a capacity of the order of the optimal one (10.113) is obtained for a Hopfield network from a simple Hebb-like rule [Wills haw et al., 1969; Tsodyks and Feigel'man, 1988], as we mentioned in Chapter 2. A number of extensions of this work have been made, notably to patterns with a finite fraction of errors, binary weights, diluted connections, and (in the recurrent network) connections with differing degrees of correlation between Wij and Wji [Gardner and Derrida, 1988; Gardner et al., 1989]. 272 TEN Formal Statistical Mechanics of Neural Networks Generalization Ability A particularly interesting application is to the calculation of the ability of a simple perceptron. Recall from Section 6.5 that the generalizatIOn abIlity of a network was defined as the probability of its giving the correct output for the mapping it is trained to implement when tested on a random of the mapping, not restricted to the training set. This calculated analytIcally by Gardner's methods [Gyorgyi and Tishby, 1990; GyorgyI, 1990; Opper et al., 1990]. The basic idea first used by Gardner and Derrida [1989], is to perform a cal- culation of the weight-space volume like the one just described, but, instead of considering random input-target pairs (ef, (1'), using pairs which are examples .of a particular function 1(1;.) = sgn(v .1;.) that the perceptron could learn. That IS, we think of our perceptron as learning to imitate a teacher perceptron whose weights are Vi. . . Under learning, the pupil perceptron's weight vector W wIll come to lme up with that of its teacher. Its generalization ability will depend on one parameter, the dot product of the two vectors: 1 R= -wv. N (10.116) Here both wand v are normalized as in (10.82). R is introduced into the calculation in the same way that q was earlier, by inserting a delta function and integrating over it. Ultimately one obtains saddle-point equations for both q R. To find the generalization ability from R, consider the two varIables x = N- I / 2 L Wjej j and y = N- 1 / 2 L Vjej j (10.117) which are the net inputs to the pupil and the teacher respectively. For large .N, x and yare Gaussian variables, each of zero mean and unit variance, with covarIance (xy) = R. Thus their joint distribution is (10.118) Having averaged over all inputs, the generalization ability gU) no longer depends on the specific mapping ofthe teacher (parametrized by v): but only on number of examples. We therefore write it as g(a). Clearly g(a) IS the probabIlIty that x and y have the same sign. Simple geometry then leads to 1 -1 R g(a) = 1- -cos 'II" where R is obtained from the saddle-point condition as described above. (10.119) I I I I f 10.2 Gardner Theory of the Connections 0.9 0.8 9 0.7 0.6 0.5 0 2 3 a 4 5 6 FIGURE 10.2 The gen- eralization ability, g( a), as a function of rela- 273 tive training-set size, a. Adapted from Opper et al. [1990]. Figure 10.2 shows the resulting g(a). The necessary number of examples for good generalization is clearly of order N, in agreement with the estimate (6.81). In the limit of many training examples perfect generalization is approached: 1 1- g(a) = -. a (10.120) This form of approach means that the a priori generalization ability distribution Po(g) discussed in Section 6.5 has no gap around g = l. This example shows how one can actually do an explicit calculation (for the simple percept ron) which fits into the theoretical framework for generalization in- troduced in Section 6.5. We hope it will guide us in future calculations for less trivial architectures. All the preceding has been algorithm-independent-it is about the existence of connection weights that implement the desired association, not about how they are found. It is also possible to apply statistical mechanics methods to particular algorithms [Kinzel and Opper, 1990; Hertz et al., 1989; Hertz, 1990], including their dynamics, but these calculations lie outside the scope of the framework we have presented here.