0% found this document useful (0 votes)

418 views8 pages

Capacity of A Perceptron

This document discusses the capacity of simple perceptrons. It begins by explaining that the capacity (Pmax) is the maximum number of random input-output pairs that can be reliably stored in a network. For linear units, Pmax is equal to the number of input units (N). The document then focuses on deriving the capacity formula for threshold units receiving continuous-valued random inputs. Through an analysis of how many ways a set of points can be divided into two classes by a hyperplane, it is shown that Pmax is equal to 2N. Graphs demonstrate how the number of classifiable patterns transitions sharply from all to none around this value as N increases.

Uploaded by

Sungho Hong

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

418 views8 pages

Capacity of A Perceptron

Uploaded by

Sungho Hong

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

110

FIVE Simple Perceptrons

5.6 Stochastic Units
Another generalization is from our deterministic units to stochastic units Si gov-
erned by (2.48):
1
(5.54)
Prob(Sf = 1) = 1 + exp(=f2,8hf)
with
.hf = 'E Wiker
(5.55)
k
as before. This leads to
(Sf) = tanh (,8 'E Wiker)
k
(5.56)
. t . (242) In the context of a simulation we can use (5.56) to calculate (Sf),
JUS as m . . .' S h'l
whereas in a real stochastic network we would find it averagmg i or a Wi:,
updating randomly chosen units according to (5.54). Either way, we then use (Si )
as the basis of a weight change
(5.57)
where
(5.58)
This is just the average over outcomes of the changes we would have on
basis of individual outcomes using the ordinary delta rule will find it
articularly important when we discuss reinforcement learnmg m SectIOn 7.4 ..
p It is interesting to prove that this rule always decreases the average error given
by the usual quadratic measure -
E = ! 'E(f - Sf)2.
2 ip
(5.59)
Since we are assuming output units and patterns are 1, this is just twice
the total number of bits in error, and can also be wfltten
E= 'E(I-(fSf).
(5.60)
ip
Thus the average error in the stochastic netwo:r
kis
(E) = 'E(1 - (f (Sf))
ip
= 'E [1- (f tanh(,8'E Wiker)] .
ip k
(5.61)
5.7 Capacity of the Simple Perceptron 111
The change in (E) in one cycle of weight updatings is thus
8
= - 'E -8- tanh(,8hn
ipk Wik
- 'E 7][1 - (r tanh(,8hf)],8sech2(,8hf) (5.62)
ip"
using
6
dtanh(x)/dx = sech
2
x. The result (5.62) is clearly always negative (recall
tanh(x) < 1), so the procedure always improves the average
5.7 Capacity of the Simple Perceptron *
In the case of the associative network in Chapter 2 we were able to find the capacity
Pmax of a network of N units; for random patterns we found Pr'nax = 0.138N for large
N if we used the standard Hebb rule. If we tried to store P patterns withp > Pmax
the performance became terrible.
Similar questions can be asked for simpleperceptrons:
How many random input-output pairs can we expect to store reliably in a
network of given size?
How many of these can we expect to learn using a particular learning rule?
The answer to the second question may well be smaller than the first (e.g., for
nonlinear units), but is presently unknown in general. The first question, which
this section deals with, gives the maximum capacity that any learning algorithm
can hope to achieve.
For continuous-valued units (linear or nonlinear) we already know the answer,
because the condition is simply linear independence. If we choose P random pat-
terns, then they will be linearly independent if P :5 N (except for cases with very
small probability). So the capacity is Pmax = N.
'/ The case of threshold units depends on linear separability, which is harder to
deal with. The answer for random continuous-valued inputs was derived by Cover
11,965] (see also Mitchison and Durbin [1989]) and is remarkably simple:
Pmax = 2N. (5.63)
usual N is the number of input units, and is presumed large. The number of
' .. ' ut units must be small and fixed (independent of N). Equation (5.63) is strictly
, in the N -+ 00 limit. '
function sech
2
x = 1 - tanh
2
x is a bell-shaped curve with peak at x = o.
112 FIVE Simple Perceptrons
C(p,N)I2
P
0.5
. .
pJN
o
o 2 3 4
FIGURE 5.11 The function C(p, N)/2
P
given by (5.67) plotted versus p/Nfor N =
5, 20, and 100.
The rest of this section is concerned with proving (5.63), and may be omitted
on first reading. We follow the approach of Cover [1965]. A more general (but much
more difficult) method for answering this sort of question was given by Gardner
[1987] and is discussed in Chapter 10.
We consider a perceptron with N continuous-valued inputs and one 1 output
unit, using the deterministic threshold limit. The extension to several output units
is trivial since output units and their connections are independent-the result (5.63)
applies separately to each. For convenience we take the thresholds to be zero, but
they could be reinserted at the expense of one extra input unit, as in (5.2).
In (5.11) we showed that the perceptron divides the N-dimensional input space
into two regions separated by an (N - I)-dimensional hyperplane. For the case of
zero threshold this plane goes through the origin. All the points on one side give an
output of + 1 and all those on the other side give -1. Let us think of these as red
(+1) and black (-1) points respectively. Then the question we need to answer is:
how many points can we expect to put randomly in an N-dimensional space, some
red and some black, and then find a hyperplane through the origin that divides the
red points from the blaCk points?
Let us consider a slightly different question. For a given set of p randomly
placed points in an N-dimensional space, for how many out of the 2
P
possible red
and black colorings of the points can we find a hyperplane dividing red from black?
Call the answer C(p, N). For p small we expect C(p, N) = 2
P
, because we should
be able to find a suitable hyperplane for any possible coloring; consider N = p = 2
for example. For p large we expect C(p, N) to drop well below 2
P
, so an arbitrarily
chosen coloring will not possess a dividing hyperplane. The transition between these
regimes turns out to be sharp for large N, and gives us Pmax.
We will calculate C(p, N) shortly, but let us first examine the result. Figure 5.11
shows a graph of C(p, N)/2
P
against p/ N for N = 5, 20, and 100. Our expectations
for small and large p are fulfilled, and we see that the transition occurs quite rapidly
in the neighborhood of p = 2N, in agreement with (5.63). As Nis made larger and
5.7 Capacity of the Simple Perceptron
113
FIGURE 5.12 Finding sep-
arating hyperplanes con-
to go through
a pomt P as well as the
origin a is equivalent to
projecting onto one lower
dimension.
larger the transition becomes more and more shar Th ( ... .
can demonstrate that FIg 5 11 . . t p. us 5.63) IS JustIfied If we
. . IS correc .
. of points is not actually necessary.7 All that we need
IS a e pomts be In general position. As discussed on .
(for the no threshold case) that all subsets of N (ti ) .page 97, thIS .means
. ddt As or ewer pomts must be lmearly
m epen en . an example consider N = 2 a set ofp . t t d. .
plane is in g I t f . pom sma we- Imenslonal
. enera POSI IOn 1 no two lie on the same line through the .. A
of chosen from a continuous random distribution will obviousl orI?m. set
pOSItIOn except for coincidences that have zero probability. y be m general
We can now calculate C(p N) b . d t
d d d b h ' Y m uc Ion. Let us call a coloring that can be
;oin[ ;. and add a
b
For
those previous dichotomies where the dividing hyperplane could h
een drawn through point P th '11 b. ave
d .. ' ere e two new dIchotomies, one with P red
an one WIth It black. This is because when the points in general osition
;ny t?rough 1>. can be shifted infinitesimally "to go either sfde of
1 , WI OU C angmg the SIde of any of the other p points.
of the dichotomies only one color of point P will
, ere e one new dIchotomy for each old one.
Thus
C(p + 1, N) = C(p, N) + D (5.64)
where D is the number of the . C(p N) .
the dividing hyperplane dr dIchotomies that could have had
. . awn roug as well as the origin o. But this number
SImply ?(p, N - 1), because constraining the hyperplanes to go throu h a
tlcular pomt P makes the problem effectively (N _ 1) d . I .lgI par-
. yo 5 12 - ImenslOna; as 1 ustrated
mIg. . , we can proJect the whole problem onto an (N 1) d . I I
- - ImenSlOna pane
7N .
or IS It well defined unless a distribution function is specified.
FIVE Simple Perceptrons
perpendicular to OP, since any displacement of a point along the OP direction
cannot affect which side it is of any hyperplane containing OP.
We thereby obtain the recursion relation
C(p + 1, N) = C(p, N) +C(p, N - 1).
(5.65)
Iterating this equation for p, p - 1, p - 2, ... , 1 yields
C(p, N) = (P(jl)C(I, N) + N -1) + ... + N - p + 1). (5.66)
For p < N this is easy to handle, because C(I, N) = 2 for all N; one point can be
colored red or black. For p > N the second argument of C becomes 0 or negative in
some terms, but these terms can be eliminated by taking C(p, N) = 0 for N o. It
is easy to check that this choice is consistent with the recursion relation (5.65), and
with C(p, 1) = 2 (in one dimension the only "hyperplane" is a point at the origin,
allowing two dichotomies). Thus (5.66) makes sense for all values of p and Nand
can be written as
N-l ( 1)
C(p, N) = 2 P (5.67)
if we use the standard convention that (;:.) = 0 for m > n. Equation (5.67) was
used to plot Fig. 5.11, thus completing the demonstration.
It is actually easy to show from the symmetry = of binomial
coefficients that
(5.68)
so the curve goes through 1/2 at p = 2N. To show analytically that the transition
sharpens up for increasing N, one can appeal to the large N Gaussian limit of the
binomial coefficients, which leads to
(5.69)
for large N.
It is worth noting that C(p, N) = 2
P
if p N (this is shown on page 155). So
any coloring of up to N points is linearly separable, provided only that the points
are in general position. For N or fewer points general position is equivalent to linear
independence, so the sufficient conditions for asolution are exactly the same in
the threshold arid continuous-valued networks. But this is not true, of course, for
p>N.
SIX
Multi-Layer Networks
The limitations of a simple perceptron do not apply to feed-forward networks with
interr.nediate or "hidden" layers between the input and output layer. In fact, as
we WIll see later, a network with just one hidden layer can represent any Boolean
function (including for example XOR). Although the greater power of multi-layer
networks was realized long, ago, it was only recently shown how to make them learn
a particular function, using "back-propagation" or other methods. This absence of
a rule-together the demonstration by Minsky and Papert [1969] that
only lmearly separable functIOns could be represented by simple perceptrons-Ied
to a waning of interest in layered networks until recently.
Throughout this chapter, like the previous one, we consider only feed-forward
networks. More general networks are discussed in the next chapter.
6.1 Back-Propagation
The back-propagation algorithm is central to much current work on in
neural networks. It was invented independently several times, by Bryson and Ho
[1969], Werbos [1974], Parker [1985] and Rumelhart et al. [1986a, b]. A closely
related by Le Cun [1985]. The algorithm gives a prescription
for changmg the weIghts Wpq m any feed-forward network to learn a training set of
input-output pairs The basis is simply gradient descent as described in
Sections 5.4 (linear) and 5;5 (nonlinear) for a simple perceptron. '
consider a two-layer network such as that illustrated by Fig. 6.1. Our
n?tatIOnal. conventIOns shown in the figure; output units are denoted by OJ,
hIdden umtsby ltj, mput terminals by There are connections Wjk from the
115
264 TEN Formal Statistical Mechanics of Neural Networks
Only the second of these, which comes from of/or = 0, is a little tricky, needing
the identity
(10.75)
for any bounded function J(z).
Equation (10.72) is just like (10.22) for the a = 0 case, except for the addition
of the effective Gaussian random field term, which represents the crosstalk from the
uncondensed patterns. For a = 0 it reduces directly to (10.22). Equation (10.73)
is the obvious equation for the mean square magnetization. Equation (10.74) gives
the (nontrivial) relation between q and the mean square value of the random field,
and is identical to (2.67).
For memory states, i.e., m-vectors of the form (m, 0, 0, ... ), the saddle-point
equations (10.72) and (10.73) become simply
m
q
tanh,B(forz + m)z
tanh
2
,B(forz + m)z
(10.76)
(10.77)
where the averaging is solely over the Gaussian random field. These are are identical
to (2.65) and (2.68) that we found in the heuristic theory of Section 2.5. Their
solution, and the consequent phase diagram of the model in a - T space, can be
studied as we sketched there. Spurious states, such as the symmetric combinations
(10.26), can also be analyzed at finite a using the full equations (10.72)-(10.74).
There are several subtle points in this replica method calculation:
We started by calculating ((zn)) for integer n but eventually interpreted n
as a real number and took the n ...... 0 limit. This is not the only possible
continuation from the integers to the reals; we might for example have added
a function like sin 7rn/n.
We treated the order of limits and averages in a cavalier fashion, and in par-
ticular reversed the order of n ...... 0 and N ...... 00.
We made the replica symmetry approximation (10.60)-(10.62) which was re-
ally only based on intuition.
Experience has shown that the replica method usually does work, but there are few
rigorous mathematical results. It can be shown for the Sherrington-Kirkpatrick spin
glass model, and probably for this one too, that the reversal of limits is justified,
and that the replica symmetry assumption is correct for integer n [van Hemmen
and Palmer, 1979]. But for some problems, including the spin glass, the method
sometimes gives the wrong answer. This can be blamed on the integer-to-real con-
tinuation, and can be corrected by replica symmetry breaking, in which the
replica symmetry assumption is replaced by a more complicated assumption. Then
the natural continuation seems to give the right answer.
For the present problem Amit et al. showed that the replica symmetric approx-
imation is valid except at very low temperatures where there is replica symmetry
breaking. This seems to lead only to very small corrections in the results. However,
----------------......
10.2 Gardner Theory of the Connections
265
the predicted change in the capacity-a
c
becomes 0.144 instead of 0.138- b
d t t d
. . I . can e
e ec e III numenca simulations [Crisanti et aI., 1986].
10.2 Gardner Theory of the Connections
The second classic statistical mechanical tour de force in neural networks is th
computation by Gardner [1987, 1988] of the capacity of a simple perceptron. Th:
calculation applies in the same form to a Hopfield-like recurrent network for auto-
associative memory if the connections are allowed to be asymmetric.
. theory is very it is not specific to any particular algorithm for
the connectIOns. On the other hand, it does not provide us with a
specific set of connections even when it has told us that such a set exists. As
in Section 6.5, the basic idea is to consider the fraction of weight space that
implements a particular input-output function; recall that weight space is the space
of all possible connection weights w = {Wij}.
In Section 6.5 we used relatively simple methods to calculate weight space
volumes. The present approach is more complicated, though often more powerful.
We many. of the techniques introduced in the previous section, including replicas,
auxIliary varIables, and the saddle-point method.
We consider a simple perceptron with N binary inputs ej = 1 and M binary
threshold units that compute the outputs
Oi = sgn (N-
1
/
2
L Wijej ) .
j

(10.78)
The N-l/2 factor will be discussed shortly. Given a desired set of associations
ef ....... (f for Jl = 1, 2, ... , p, we want to know in what fraction of weight space the
equatIOns
(f = s
g
n(N-
1
/
2
L Wijey) (10.79)
j
satisfied (for all i and Jl). Or equivalently,JIl. what fraction of this space are the
__
j
(10.80)
true?

. It is also interesting ask the, cor:esponding question if the condition (10.80)
IS strengthened so there IS a margm sIze K; > 0 as in (5.20):
(f N-
1
/2 L Wijey > K;
j
(10.81)
A nonzero K; guarantees correction of small errors in the input pattern.
266 TEN Formal Statistical Mechanics of Neural Networks
Until (iO.81) the factor N-l/2 was irrelevant. We include it because it is conve-
nient to work with Wij'S of order unity, and a sum of N such terms of random sign
gives a result of order N
1
/
2
. Thus the explicit factor N-l/2 makes the left-hand
side of (10.81) of order unity, and it is appropriate to think about /\"s that are
independent of N. Of course this is only appropriate if the terms in the sum over j
are really of random sign, but that turns out to be the case of most interest here.
On the other hand, in Chapter 5 we were mainly dealing with a correlated sum,
and so used a factor N instead of Nl/2.
For a recurrent autoassociative network, the same equations with (j = (j give
the condition for the stability of the patterns, and a nonzero /\, ensures finite basins
of attraction.
The Capacity of a Simple Perceptron
The fundamental quantity that we want to calculate is the volume fraction of weight
space in which (10.81) is satisfied. Adding an additional constraint
(10.82)
for each unit i, so as to keep the weights within bounds, this fraction is
J dw (Ill' 0((j N-l/2 E
j
Wije; - /\,)) ITi t5(E
j
W[j - N)
V= ( 2) .
J dw ITi t5 E j Wij - N
(10.83)
Here we enforce the constraint (10.82) with the delta functions, and restrict the
numerator to regions satisfying (10.81) with the step functions 0(x).
The expression (10.83) is rather like a statistical-mechanical partition func-
tion (10.1), but the conventional exponential weight is replaced by an all-or-nothing
one given by the step functions. It is also important to recognize that here it is the
weights Wij that are the fundamental statistical-mechanical variables, not the acti-
vations of the units.
We observe immediately that (10.83) factors into a product of identical terms,
one for each i. Therefore we can drop the index i altogether, reducing without loss
of generality the calculation to the case of a single output unit. The corresponding
step also works for the recurrent network if Wij and Wji are independent, but the
calculation cannot be done this way if a symmetry constraint Wij = Wj; is imposed.
In the same way as for Z in the previous section, the statistically relevant
quantity is the average over the pattern distribution, not of V itself, but of its
logarithm. Therefore, we introduce replicas again and compute the average
10.2 Gardner Theory of the Connections
267
where the integrals are over all the wi's and the average (( ... )) is over the (1's and
the (I"s.
To proceed we use the same kinds of tricks as in the previous section. First we
work on the step functions, using the integral representation
0(z -/\,) = (Xld>. t5(>. _ z) = (OOd>'] dx e;x('x-z).
j" j" 271"
(10.85)
We have step functions for each a and IL, so at this point we need auxiliary variables
and Thus a particular step function becomes
where
0((1' N-
1
/
2
w,!(j - /\,) = 1
00
]
J
(10.86)
= (I' N-
1
/
2
L: W(j(f . (10.87)
j
It is now easy to average over the patterns, which occur only in the last factor
of (10.86). We consider the case of independent binary patterns, for which we have
((IT =
I'a
IT (( exp ( -i(l'(f
jl' a
eXP(Llogcos [N-
1
/
2
L
jl' a
=
(
__ 1 I' I' a 13)
exp 2N L.J xaxf3 L.J Wj Wj .
l'af3 j
(10.88)
The resulting E
j
wj wj term is not easy to deal with directly, so we replace it
by. a new variable q
a
f3 defined by
(10.89)
This gives qaa = 1 from (10.82), but we prefer to treat the a = /3 terms explicitly
and use qaf3 only for a =/; /3. Thus we rewrite (10.88) as
((IT = IT exp ( - t - L
I'a I' a a<f3
(10.90)
- using qaf3 = qf3a The qaf3's play the same role in this problem that qaf3 and raf3
did in the previous section.
When we insert (10.90) into (10.86) we see that we get an identical result for
each IL, so we can drop all the IL's and write
(10.91)
268 TEN Formal Statistical Mechanics of Neural Networks
where
Kp,x,q} = iLx"A" - - (10.92)
a
Now we turn to the delta functions. Using the basic integral representation
)
J
dr rz
6(z = -.e-
2n
(10.93)
we choose r = Ea/2 for each (l' to write the delta functions in (10.84) as
6 (2;:(wi? - N) = J (10.94)
3
In the same way we enforce the condition (10.89) for each pair (l'fJ (with (l' > fJ)
using r = N F"p :
(
N
-I " P) - NJ dF"p 2:j wjw'J . (10.95)
6 q"p - L..J Wj Wj - 27ri e
j
We also have to add an integral over each of the q"p's, so that the delta function
can pick out the desired value. ...
A factorization of the integrals over the w's is now possIble. Takmg everythmg
not involving wi outside, the numerator of (10.84) includes a factor
J (II dwi) e - /2+ (10.96)
"
for each j. These factors are all identical-wi is a dummy variable, and j no longer
appears elsewhere--so we can drop the j's and rewrite (10.96) as
[J (II dw,,)e - N (10.97)
"
The same transformation applies to the denominator of (10.84), except that there
are no F"p terms.
It is now time to collect together our factors from (10.92), (10.94), (10.95), and
(10.97). Writing Ak as exp(k log A), and omitting prefactors, (10.84) becomes
J(I1" dE,,) (11,,<p dq"pdF"p) eNG{q,F,E}
((Vn)) = J(I1" dE,,)eNH{E}
(10.98)
where
G{q, F, E}
log [1
00
(IJ J (IJ dx" ) e
K
p,x,q} ]
+ log [J (II dw,,) e - ]
a
- L F"pq"p + LEa
(10.99)
a<p a
10.2 Gardner Theory of the Connections
269
and
H{E} = 10g[J (II dw,,)e- + L E".
" "
(10.100)
Since the exponents inside the integrals in (10.98) are proportional to N, we will be
able to evaluate them exactly using the saddle-point method in the large-N limit.
As before, we make a replica-symmetric ansatz:
= F Ea=E (10.101)
(where the first two apply for (l' f= fJ only). This allows us to evaluate each term of
G.
For the first term we can rewrite K from (10.92) as
Kp,x,q} = iLx"A" - - i(LXaf
" " "
(10.102)
and linearize the last term with the usual Gaussian integral trick
(10.103)
derived from (10.5). Then the x" integrals can be done, leaving a product of identical
integrals over the A" 's. Upon replacing these by a single integral to the nth power
we obtain for the whole first line of (10.99):
(l'log{J_d_t e-
t
'/2 [1
00
dA ex
p
(- -,-::-(A-:-:-+t-,-"fo-;-)2)] n}
..ffi I< J27r(1- q) 2(1 - q)
dA ex
p
(- (A+t-.fo)2)] (10.104)
..ffi I< J27r(1 - q) 2(1- q)
where (l' == piN.
The second term in G can be evaluated in the same way, linearizing the
(2:" W,,)2 term with a Gaussian integral trick, then performing in turn the W"
integrals and the Gaussian integral. The final result in the small n limit is
-log(E + F) .
E+F
(10.105)
Finally, the third term of G gives simply (again for small n)
qF).
(10.106)
Now we are in a position to find the saddle point of G with respect to q, F, and
E. The most important order parameter is q. Its value at the saddle point is the
270 TEN Formal Statistical Mechanics of Neural Networks
most probable value of the overlap (10.89) between a pair of solutions. If, as at small
a, there is a large region ofw-space that solves (10.80), then different solutions can
be quite uncorrelated and q will be small. As we increase a, it becomes harder and
harder to find solutions, and the typical overlap between a pair of them increases.
Finally, when there is just a single solution, q becomes equal to 1. This point defines
the optimal perceptron: the one with the largest capacity for a given stability
parameter /\', or equivalently the one with highest stability for a given a. We focus
on this case henceforth, taking q -+ 1 shortly.
The saddle-point equations aGjaE = 0 and aGjaF = 0 can readily be solved
to express E and F in terms of q:
q
F
(1 _ q)2
E = 1- 2q
(1- q)2 .
(10.107)
Substituting these into the expression for G (and making a change of variable in
the d)" integral), we get
1 G( ) _ J dt -t
2
/21 [joo dz _Z2/2]
- q - a --e og --e
n V2i *V2i
+ + q) + 2(1 q) +
(10.108)
Setting aG j aq = 0 to find the saddle point gives
J
dt _t
2
/2 [l OOd _Z2/ 2] -1 _u
2
/2 t + /\,V'i
a --e ze e
V2i u 2V'i(1 - q)3/2
q
(10.109)
2(1 - q)2
where u = (/\, + tV'i)j...jf=q. Taking the limit q -+ 1 is a little tricky, but can be
done using L'Hospital's rule, yielding the final result
(10.110)
Equation (10.110) gives the capacity for fixed /\'. Alternatively we can use it to
find the appropriate /\, for the optimal perceptron to store Na patterns. In the limit
/\, = 0 it gives
(10.111)
in agreement with the result found geometrically by Cover that was outlined in
Chapter 5.
One can also perform the corresponding calculation for biased patterns with a
distribution
p(en = t(1 + m)6(er - 1) + t(1- m)6(er + 1)
(10.112)
10.2 Gardner Theory of the Connections
4 r-------------------,
3
2

o 2 3
271
FIGURE 10.1 Capacity a
c
as a function of /\, for three
values of m (from Gardner
[1988]).
so that ((en) = m. The calculation is just a little bit more complicated, with
an extra set of variables M(X = N-
1
/
2
w'J with respect to which G has to be
maximized. The results for the storage capacity as a function of m and /\, are shown
in Fig. 10.1.
An interesting limit is that of m -+ 1 (sparse patterns). Then the result for
/\, = 0 is
1
(10.113)
a
c
= (l-m)log(1!m)
which shows that one can store a great many sparse patterns. But there is nothing
very surprising about this, because very sparse patterns have very small informa-
tion content. Indeed, if we work out the total information capacity-the maximum
information we can store, in bits-given by
N
2
a
c
[1 (l-m) 1 (l+m)]
I=-log2 2(I-m)log -2- +2(1+m)log -2- ,
then we obtain
N
2
1=--
210g2
(10.114)
(10.115)
in the limit m -+ 1. This is less than the result for the unbiased case (m = 0,
a c = 2), which is 1= 2N
2
In fact the total information capacity is always of the
order N
2
, depending only slightly on m.
It is interesting to note that a capacity of the order of the optimal one (10.113)
is obtained for a Hopfield network from a simple Hebb-like rule [Wills haw et al.,
1969; Tsodyks and Feigel'man, 1988], as we mentioned in Chapter 2.
A number of extensions of this work have been made, notably to patterns with
a finite fraction of errors, binary weights, diluted connections, and (in the recurrent
network) connections with differing degrees of correlation between Wij and Wji
[Gardner and Derrida, 1988; Gardner et al., 1989].
272 TEN Formal Statistical Mechanics of Neural Networks
Generalization Ability
A particularly interesting application is to the calculation of the
ability of a simple perceptron. Recall from Section 6.5 that the generalizatIOn abIlity
of a network was defined as the probability of its giving the correct output for
the mapping it is trained to implement when tested on a random of the
mapping, not restricted to the training set. This calculated analytIcally by
Gardner's methods [Gyorgyi and Tishby, 1990; GyorgyI, 1990; Opper et al., 1990].
The basic idea first used by Gardner and Derrida [1989], is to perform a cal-
culation of the weight-space volume like the one just described, but, instead of
considering random input-target pairs (ef, (1'), using pairs which are examples .of
a particular function 1(1;.) = sgn(v .1;.) that the perceptron could learn. That IS,
we think of our perceptron as learning to imitate a teacher perceptron whose
weights are Vi. . .
Under learning, the pupil perceptron's weight vector W wIll come to lme up
with that of its teacher. Its generalization ability will depend on one parameter,
the dot product of the two vectors:
1
R= -wv.
N
(10.116)
Here both wand v are normalized as in (10.82). R is introduced into the calculation
in the same way that q was earlier, by inserting a delta function and integrating
over it. Ultimately one obtains saddle-point equations for both q R.
To find the generalization ability from R, consider the two varIables
x = N-
I
/
2
L Wjej
j
and y = N-
1
/
2
L Vjej
j
(10.117)
which are the net inputs to the pupil and the teacher respectively. For large .N, x
and yare Gaussian variables, each of zero mean and unit variance, with covarIance
(xy) = R. Thus their joint distribution is
(10.118)
Having averaged over all inputs, the generalization ability gU) no longer depends
on the specific mapping ofthe teacher (parametrized by v): but only on number
of examples. We therefore write it as g(a). Clearly g(a) IS the probabIlIty that x
and y have the same sign. Simple geometry then leads to
1 -1 R
g(a) = 1- -cos
'II"
where R is obtained from the saddle-point condition as described above.
(10.119)
I
I
I
I
f
10.2 Gardner Theory of the Connections
0.9
0.8
9
0.7
0.6
0.5
0 2 3
a
4 5 6
FIGURE 10.2 The gen-
eralization ability, g( a),
as a function of rela-
273
tive training-set size, a.
Adapted from Opper et al.
[1990].
Figure 10.2 shows the resulting g(a). The necessary number of examples for
good generalization is clearly of order N, in agreement with the estimate (6.81). In
the limit of many training examples perfect generalization is approached:
1
1- g(a) = -.
a
(10.120)
This form of approach means that the a priori generalization ability distribution
Po(g) discussed in Section 6.5 has no gap around g = l.
This example shows how one can actually do an explicit calculation (for the
simple percept ron) which fits into the theoretical framework for generalization in-
troduced in Section 6.5. We hope it will guide us in future calculations for less
trivial architectures.
All the preceding has been algorithm-independent-it is about the existence
of connection weights that implement the desired association, not about how they
are found. It is also possible to apply statistical mechanics methods to particular
algorithms [Kinzel and Opper, 1990; Hertz et al., 1989; Hertz, 1990], including
their dynamics, but these calculations lie outside the scope of the framework we
have presented here.

Engine Control System (R9M) : Section
100% (4)
Engine Control System (R9M) : Section
405 pages
Limit (Regular)
No ratings yet
Limit (Regular)
5 pages
Assignment 4
No ratings yet
Assignment 4
2 pages
Exponential Operators and Parameter Differentiation in Quantum Physics PDF
100% (1)
Exponential Operators and Parameter Differentiation in Quantum Physics PDF
22 pages
Second Edition: Special Functions and Complex Variables
No ratings yet
Second Edition: Special Functions and Complex Variables
11 pages
1.5 - Chorin, Marsden - A Mathematical Introduction To Fluid Mechanics
No ratings yet
1.5 - Chorin, Marsden - A Mathematical Introduction To Fluid Mechanics
182 pages
Series of Real Numbers
No ratings yet
Series of Real Numbers
78 pages
Binmore PDF
No ratings yet
Binmore PDF
4 pages
Kleinbaum D.G., Klein M., Logistic Regression - A Self-Learning Text (3ed, Springer, 2010,709p,) PDF
No ratings yet
Kleinbaum D.G., Klein M., Logistic Regression - A Self-Learning Text (3ed, Springer, 2010,709p,) PDF
1 page
State Bank of India
No ratings yet
State Bank of India
1 page
Appearance Release: Complete Only For Hazardous Activity
No ratings yet
Appearance Release: Complete Only For Hazardous Activity
1 page
Lectures on the Coupling Method
From Everand
Lectures on the Coupling Method
Torgny Lindvall
No ratings yet
Lectures on Integral Equations
From Everand
Lectures on Integral Equations
Harold Widom
4.5/5 (2)
Nonlinear Transformations of Random Processes
From Everand
Nonlinear Transformations of Random Processes
Ralph Deutsch
No ratings yet
Differential Forms
From Everand
Differential Forms
Henri Cartan
5/5 (2)
Calculus of Variations Project
No ratings yet
Calculus of Variations Project
10 pages
Calculus of Variations Examples
No ratings yet
Calculus of Variations Examples
7 pages
Perceptron Linear Classifiers
No ratings yet
Perceptron Linear Classifiers
42 pages
Perceptron Learning Rules
50% (2)
Perceptron Learning Rules
38 pages
Algorithms For Minimization Without Derivatives
No ratings yet
Algorithms For Minimization Without Derivatives
104 pages
Euler Lagrange Equation
No ratings yet
Euler Lagrange Equation
14 pages
Shannon's Information Theory - Science4All
No ratings yet
Shannon's Information Theory - Science4All
13 pages
Analysis On Manifold Via Laplacian Canzani
No ratings yet
Analysis On Manifold Via Laplacian Canzani
114 pages
Complex analysis A Complete Guide
From Everand
Complex analysis A Complete Guide
Gerardus Blokdyk
No ratings yet
Bounds6
No ratings yet
Bounds6
12 pages
Minsky y Papert
No ratings yet
Minsky y Papert
77 pages
E The Master of All
No ratings yet
E The Master of All
12 pages
Coupled 1st Order ODEs
No ratings yet
Coupled 1st Order ODEs
28 pages
Before Calculus: September 30, 2011 18:19 Ftoc Sheet Number 1 Page Number Xiii
No ratings yet
Before Calculus: September 30, 2011 18:19 Ftoc Sheet Number 1 Page Number Xiii
5 pages
VOLTERRA INTEGRAL EQUATIONS .Ru
No ratings yet
VOLTERRA INTEGRAL EQUATIONS .Ru
15 pages
Introductory Mathematics-Part A
No ratings yet
Introductory Mathematics-Part A
77 pages
Physics Lab Manual PDF
No ratings yet
Physics Lab Manual PDF
57 pages
Center Manifold Reduction
100% (2)
Center Manifold Reduction
8 pages
Reflective Property of Parabolas
No ratings yet
Reflective Property of Parabolas
3 pages
2 - 1973-Outline of A New Approach To The Analysis of Complex Systems and Decision Processes
No ratings yet
2 - 1973-Outline of A New Approach To The Analysis of Complex Systems and Decision Processes
18 pages
Fuzzy in Differential
No ratings yet
Fuzzy in Differential
20 pages
Divergence and Curl
No ratings yet
Divergence and Curl
5 pages
Data Visualization With Ma Thematic A
No ratings yet
Data Visualization With Ma Thematic A
46 pages
Gilbert Strang-The Algebra of Elimination (Expository Notes) (2011) PDF
No ratings yet
Gilbert Strang-The Algebra of Elimination (Expository Notes) (2011) PDF
20 pages
Chapter 4 - Interpolation
No ratings yet
Chapter 4 - Interpolation
91 pages
Thermodynamics Enrico Fermi Dover Public
No ratings yet
Thermodynamics Enrico Fermi Dover Public
188 pages
U1 Finite Differences
No ratings yet
U1 Finite Differences
57 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
29 pages
Steve Smith Tuition: Maths Notes
No ratings yet
Steve Smith Tuition: Maths Notes
32 pages
The Discovery of The Vector Representation of Moments and Angular Velocity (Caparrini)
100% (1)
The Discovery of The Vector Representation of Moments and Angular Velocity (Caparrini)
31 pages
Fractional Derivatives PDF
No ratings yet
Fractional Derivatives PDF
8 pages
Factorial PDF
No ratings yet
Factorial PDF
2 pages
Asymptotics in Statistics Some Basic Concepts by Lucien Le Cam, Grace Lo Yang (Auth.)
No ratings yet
Asymptotics in Statistics Some Basic Concepts by Lucien Le Cam, Grace Lo Yang (Auth.)
298 pages
Solution Manual For Passage To Abstract Mathematics by Watkins PDF
100% (1)
Solution Manual For Passage To Abstract Mathematics by Watkins PDF
14 pages
Applications of Laplace Transform Unit Step Functions and Dirac Delta Functions
No ratings yet
Applications of Laplace Transform Unit Step Functions and Dirac Delta Functions
8 pages
Entropy and Perpetual Computers: Institute of Physics, Bhubaneswar 751 005, India Email: Somen@iopb - Res.in
No ratings yet
Entropy and Perpetual Computers: Institute of Physics, Bhubaneswar 751 005, India Email: Somen@iopb - Res.in
9 pages
3 7 6
No ratings yet
3 7 6
8 pages
Numerical Solutions of Second Order Boundary Value Problems by Galerkin Residual Method On Using Legendre Polynomials
No ratings yet
Numerical Solutions of Second Order Boundary Value Problems by Galerkin Residual Method On Using Legendre Polynomials
11 pages
Discrete-Time Evaluation of The Time Response: Appendix
No ratings yet
Discrete-Time Evaluation of The Time Response: Appendix
6 pages
The Hopf Bifurcation: Maple Solution
100% (2)
The Hopf Bifurcation: Maple Solution
5 pages
Billingsley P. Probability and Measure (Wiley, 1986) (ISBN 0471804789) (K) (600dpi) (T) (635s) - MV
No ratings yet
Billingsley P. Probability and Measure (Wiley, 1986) (ISBN 0471804789) (K) (600dpi) (T) (635s) - MV
635 pages
Math Cheat Sheet
No ratings yet
Math Cheat Sheet
5 pages
Introduction To Dynamical System by Luenberger PDF
100% (1)
Introduction To Dynamical System by Luenberger PDF
231 pages
Introduction To Nonlinear Dynamics
No ratings yet
Introduction To Nonlinear Dynamics
46 pages
Linear Algebra Stephen H Friedberg Arnold - 59d79fda1723dd21071cb226 PDF
No ratings yet
Linear Algebra Stephen H Friedberg Arnold - 59d79fda1723dd21071cb226 PDF
2 pages
Numerical Methods - B. Ram
No ratings yet
Numerical Methods - B. Ram
236 pages
Adjoint Tutorial PDF
No ratings yet
Adjoint Tutorial PDF
6 pages
Number Theory Synopsis Imotc 2013
100% (1)
Number Theory Synopsis Imotc 2013
6 pages
80407049830 (5)
No ratings yet
80407049830 (5)
2 pages
PR Electronics 5715v104 - Uk
No ratings yet
PR Electronics 5715v104 - Uk
25 pages
Drewy Demurrage & Detention Report
No ratings yet
Drewy Demurrage & Detention Report
34 pages
Best Practices For Effectively Implementing An ATP Sanitation Verification Program
100% (1)
Best Practices For Effectively Implementing An ATP Sanitation Verification Program
16 pages
Advert Receptionist Intern
No ratings yet
Advert Receptionist Intern
1 page
V003t07a004 88 GT 249
100% (1)
V003t07a004 88 GT 249
12 pages
Part List DCP t500w
No ratings yet
Part List DCP t500w
29 pages
Get Data Analytics For Accounting, 3rd Edition Vernon J. Richardson Free All Chapters
No ratings yet
Get Data Analytics For Accounting, 3rd Edition Vernon J. Richardson Free All Chapters
40 pages
Cage Trim Valves
100% (1)
Cage Trim Valves
57 pages
Business Model Canvas
No ratings yet
Business Model Canvas
3 pages
Psychopathology An Integrative Approach To Mental Disorders 9th Edition by David H Barlow V Mark Durand Stefan G Hofmann
No ratings yet
Psychopathology An Integrative Approach To Mental Disorders 9th Edition by David H Barlow V Mark Durand Stefan G Hofmann
351 pages
Plate # 2 Primary and Secondary Batteries
No ratings yet
Plate # 2 Primary and Secondary Batteries
14 pages
Ajsr 50 08
No ratings yet
Ajsr 50 08
14 pages
BS en Iso 14692-3-2017
No ratings yet
BS en Iso 14692-3-2017
46 pages
Week 4 Assignment - Jaime Eggspuehler: EDLD 5311 Fundamentals of Leadership
No ratings yet
Week 4 Assignment - Jaime Eggspuehler: EDLD 5311 Fundamentals of Leadership
9 pages
Optima Super Secure Brochure
No ratings yet
Optima Super Secure Brochure
20 pages
GS Yuasa Battery Europe Ltd. Safety Data Sheet
No ratings yet
GS Yuasa Battery Europe Ltd. Safety Data Sheet
11 pages
Basic Statistics in Business and Economics 10th Edition Lind Unlocked Test Bank
0% (1)
Basic Statistics in Business and Economics 10th Edition Lind Unlocked Test Bank
321 pages
Baroda Companies
100% (1)
Baroda Companies
25 pages
Diagnostic Test 15 Dependent Prepositions
No ratings yet
Diagnostic Test 15 Dependent Prepositions
1 page
SAP Threeway Match Functionality & Configuration - SAP Blogs
100% (1)
SAP Threeway Match Functionality & Configuration - SAP Blogs
6 pages
Mobile Data
No ratings yet
Mobile Data
5 pages
Game of Bitcoins Mega Airdrop Sheet
No ratings yet
Game of Bitcoins Mega Airdrop Sheet
9 pages
Wisconsin Indictment
No ratings yet
Wisconsin Indictment
47 pages
Coating MG For Use With NH4ClO4 - Shimizu's Improved and Long-Term Stable Dichromate Method
No ratings yet
Coating MG For Use With NH4ClO4 - Shimizu's Improved and Long-Term Stable Dichromate Method
13 pages
Special Power of Attorney
No ratings yet
Special Power of Attorney
2 pages
Current Affairs-Weekly Session-Ppt - June 2024 Part-I
No ratings yet
Current Affairs-Weekly Session-Ppt - June 2024 Part-I
99 pages

Capacity of A Perceptron

Uploaded by

Capacity of A Perceptron

Uploaded by

110

FIVE Simple Perceptrons

You might also like