NN Notes PDF
NN Notes PDF
Ivan F Wilde
Mathematics Department
[email protected]
Contents
1 Matrix Memory . . . . . . . . . . . . . . . . . . . . . . . . . . 1
4 The Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . 45
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
Chapter 1
Matrix Memory
The idea is that the association should not be defined so much between
the individual stimulus-response pairs, but rather embodied as a whole col-
lection of such input-output patterns—the system is a distributive associa-
tive memory (the input-output pairs are “distributed” throughout the sys-
tem memory rather than the particular input-output pairs being somehow
represented individually in various different parts of the system).
To attempt to realize such a system, we shall suppose that the input
key (or prototype) patterns are coded as vectors in Rn , say, and that the
responses are coded as vectors in Rm . For example, the input might be a
digitized photograph comprising a picture with 100 × 100 pixels, each of
which may assume one of eight levels of greyness (from white (= 0) to black
1
2 Chapter 1
(= 7)). In this case, by mapping the screen to a vector, via raster order, say,
the input is a vector in R10000 and whose components actually take values
in the set {0, . . . , 7}. The desired output might correspond to the name of
the person in the photograph. If we wish to recognize up to 50 people, say,
then we could give each a binary code name of 6 digits—which allows up to
26 = 64 different names. Then the output can be considered as an element
of R6 .
Now, for any pair of vectors x ∈ Rn , y ∈ Rm , we can effect the map
x 7→ y via the action of the m × n matrix
M (x,y) = y xT
where x is considered as an n × 1 (column) matrix and y as an m × 1 matrix.
Indeed,
M (x,y) x = y xT x
= α y,
where α = xT x = kxk2 , the squared Euclidean norm of x. The matrix
yxT is called the outer product of x and y. This suggests a model for our
“associative system”.
Suppose that we wish to consider p input-output pattern pairs, (x(1) , y (1) ),
(x(2) , y (2) ), . . ., (x(p) , y (p) ). Form the m × n matrix
p
X
M= y (i) x(i)T .
i=1
Department of Mathematics
Matrix Memory 3
M x(j) = y (j)
that is, perfect recall. Note that Rn contains at most n mutually orthogonal
vectors.
↓ ↓ ↓
input
output
signal .. ..
. .
∈ Rn ∈ Rm
• start with M = 0,
and we have seen that M has perfect recall, M x(j) = x(j) for all 1 ≤ j ≤ p.
We would like to know what happens if M is presented with x, a corrupted
version of one of the x(j) . In order to obtain a bipolar vector as output, we
process the output vector M x as follows:
M x à Φ(M x)
Department of Mathematics
Matrix Memory 5
where ρi (x) = ρ(x(i) , x), the Hamming distance between the input vector x
and the prototype pattern vector x(i) .
Given x, we wish to know when x à x(m) , that is, when x à M x Ã
Φ(M x) = x(m) . According to our bipolar quantization rule, it will certainly
be true that Φ(M x) = x(m) whenever the corresponding components of M x
(m)
and x(m) have the same sign. This will be the case when (M x)j xj > 0,
that is, whenever
p
1 (m) (m) 1X (i) (m)
(n − 2ρm (x)) xj xj + (n − 2ρi (x)) xj xj > 0
n | {z } n i=1
=1 i6=m
for all 1 ≤ j ≤ n (—we have used the fact that if s > |t| then certainly
s + t > 0).
We wish to find conditions which ensure that the inequality (∗) holds.
By the triangle inequality, we get
¯X ¯ X
¯ p (i) (m) ¯¯
p
¯ (n − 2ρi (x)) xj xj ¯ ≤ |n − 2ρi (x)| (∗∗)
¯
i=1 i=1
i6=m i6=m
(i) (m)
since |xj xj | = 1 for all 1 ≤ j ≤ n. Furthermore, using the orthogonality
of x(m) and x(i) , for i 6= m, we have
so that
Hence, we have
|n − 2ρi (x)| ≤ 2ρm (x) (∗∗∗)
for all i 6= m. This, together with (∗∗) gives
¯X ¯ X
¯ p (i) (m) ¯ p
¯ (n − 2ρ i (x)) x x ¯ ≤ |n − 2ρi (x)|
¯ j j ¯
i=1 i=1
i6=m i6=m
It follows that whenever 2(p − 1)ρm (x) < n − 2ρm (x) then (∗) holds which
means that M x = x(m) . The condition 2(p − 1)ρm (x) < n − 2ρm (x) is just
that 2pρm (x) < n, i.e., the condition that ρm (x) < n/2p.
Department of Mathematics
Matrix Memory 7
Theorem 1.3. Suppose that {x(1) , x(2) , . . . , x(p) } is a given set of mutually
orthogonal bipolar patterns in {−1, 1}n . If x ∈ {−1, 1}n lies within Ham-
ming distance (n/2p) of a particular prototype vector x(m) , say, then x(m)
is the nearest prototype vector to x.
Furthermore, if the autoassociative matrix memory based on the patterns
{x(1) , x(2) , . . . , x(p) } is augmented by subsequent bipolar quantization, then
the input vector x invokes x(m) as the corresponding output.
This means that the combined memory matrix and quantization sys-
tem can correctly recognize (slightly) corrupted input patterns. The non-
linearity (induced by the bipolar quantizer) has enhanced the system performance—
small background “noise” has been removed. Note that it could happen that
the output response to x is still x(m) even if x is further than (n/2p) from
x(m) . In other words, the theorem only gives sufficient conditions for x to
recall x(m) .
As an example, suppose that we store 4 patterns built from a grid of
8 × 8 pixels, so that p = 4, n = 82 = 64 and (n/2p) = 64/8 = 8. Each of the
4 patterns can then be correctly recalled even when presented with up to 7
incorrect pixels.
Remark 1.4. If x is close to −x(m) , then the output from the combined
autocorrelation matrix memory and bipolar quantizer is −x(m) .
x1 M11
x2 y1
M21 y2
x3
input vector Mm1 .. output vector
(n components) .. . (m components)
.
ym
xn Mmn
P
“Weights” are assigned to the connections. Since yi = j Mij xj , this
suggests that we assign the weight Mij to the connection joining input node j
to output node i; Mij = weight(j → i).
The correlation memory matrix
P trained on the pattern pairs (x(1) , y (1) ),. . . ,
p
(x(p) , y (p) ) is given by M = m=1 y (m) x(m)T , which has typical term
p
X
Mij = (y (m) x(m)T )ij
m=1
Xp
(m) (m)
= yi xj .
m=1
Now, Hebb’s law (1949) for “real” i.e., biological brains says that if the exci-
tation of cell j is involved in the excitation of cell i, then continued excitation
of cell j causes an increase in its efficiency to excite cell i. To encapsulate
a crude version of this idea mathematically, we might hypothesise that the
weight between the two nodes be proportional to the excitation values of
the nodes. Thus, for pattern label m, we would postulate that the weight,
(m) (m)
weight(input j → output i), be proportional to xj yi .
We see that Mij is a sum, over all patterns, of such terms. For this reason,
the assignment of the correlation memory matrix to a content addressable
memory system is sometimes referred to as generalized Hebbian learning, or
one says that the memory matrix is given by the generalized Hebbian rule.
We have seen that the correlation memory matrix has perfect recall provided
that the input patterns are pairwise orthogonal vectors. Clearly, there can
be at most n of these. In practice, this orthogonality requirement may not
be satisfied, so it is natural ask for some kind of guide as to the number
of patterns that can be stored and effectively recovered. In other words,
how many patterns can there be before the cross-talk term becomes so large
that it destroys the recovery of the key patterns? Experiment confirms that,
indeed, there is a problem here. To give some indication of what might be
reasonable, consider the autoassociative correlation memory matrix based
on p bipolar pattern vectors x(1) , . . . , x(p) ∈ {−1, 1}n , followed by bipolar
quantization, Φ. On presentation of pattern x(m) , the system output is
p
³1 X ´
(m)
Φ(M x )=Φ x(i) x(i)T x(m) .
n
i=1
Department of Mathematics
Matrix Memory 9
(m) (m)
Consider the k th bit. Then Φ(M x(m) )k = xk whenever xk (M x(m) )k > 0,
that is whenever
p
1 (m) (m) (m)T (m) 1 X (m) (i) (i)T (m)
x x x x + xk xk x x >0
n | k {z k } | {z } n
i=1
1 n i6=j
| {z }
Ck
or
1 + Ck > 0.
p n
1 XX
Ck = Xm,k,i,j
n
i=1 j=1
i6=m
Next, we see that, for j 6= k, each Xm,k,i,j takes the values ±1 with equal
probability, namely, 12 , and that these different Xs form an independent
family. Therefore, we may write Ck as
p−1 1
Ck = + S
n n
M a(i) = b(i)
Department of Mathematics
Matrix Memory 11
for all 1 ≤ i ≤ p. Let A ∈ Rn×p be the matrix whose columns are the vectors
a(1) , . . . , a(p) , i.e., A = (a(1) · · · a(p) ), and let B ∈ Rm×p be the matrix with
(j) (j)
columns given by the b(i) s, B = (b(1) · · · b(p) ), thus Aij = ai and Bij = bi .
Then it is easy to see that M a(i) = b(i) , for all i, is equivalent to M A = B.
The problem, then, is to solve the matrix equation
M A = B,
In other words,
Av = v1 a(1) + · · · + vp a(p) .
The vector Av is a linear combination of the columns of A, considered as
elements of Rn .
Now, the statement that Av = 0 if and only if v = 0 is equivalent to the
statement that v1 a(1) + · · · + vp a(p) = 0 if and only if v1 = v2 = · · · = vp = 0
which, in turn, is equivalent to the statement that a(1) ,. . . ,a(p) are linearly
independent vectors in Rn .
Thus, the statement, Av = 0 if and only if v = 0, is true if and only if
the columns of A are linearly independent vectors in Rn .
Proof. The square matrix ATA is invertible if and only if the equation
ATAv = 0 has the unique solution v = 0, v ∈ Rp . (Certainly the invertibility
of ATA implies the uniqueness of the zero solution to ATAv = 0. For the
converse, first note that the uniqueness of this zero solution implies that
So, with this choice of M we get perfect recall, provided that the input
pattern vectors are linearly independent.
Note that, in general, the solution above is
¡ not unique. Indeed,
¢ for any
matrix C ∈ R m×n ′ T −1 T
, the m × n matrix M = C 1ln − A(A A) A satisfies
¡ ¢
M ′ A = C A − A(ATA)−1 ATA = C(A − A) = 0 ∈ Rm×p .
Hence M + M ′ satisfies (M + M ′ )A = B.
Can we see what M looks like in terms of the patterns a(i) , b(i) ? The
answer is “yes and no”. We have A = (a(1) · · · a(p) ) and B = (b(1) · · · b(p) ).
Then
n
X n
X
(ATA)ij = ATik Akj = Aki Akj
k=1 k=1
Xn
(i) (j)
= ak ak
k=1
Department of Mathematics
Matrix Memory 13
which gives ATA directly in terms of the a(i) s. Let Q = ATA ∈ Rp×p . Then
M = BQ−1 AT , so that
p
X
Mij = Bik (Q−1 )kℓ Aℓj
k,ℓ=1
Xp
(k) (ℓ)
= bi (Q−1 )kℓ aj since ATℓj = Ajℓ .
k,ℓ=1
This formula for M , valid for linearly independent input patterns, expresses
M more or less in terms of the patterns. The appearance of the inverse,
Q−1 , somewhat lessens its appeal, however.
To discuss the case where the columns of A are not necessarily linearly
independent, we need to consider the notion of generalized inverse.
Definition 1.7. For any given matrix A ∈ Rm×n , the matrix X ∈ Rn×m is
said to be a generalized inverse of A if
(i) AXA = A,
(ii) XAX = X,
Examples 1.8.
X = XAX, by (i),
T
= X(AX) , by (iii),
T T
= XX A
= XX T AT Y T AT , by (i)T ,
= XX T AT AY, by (iii),
= XAXAY, by (iii),
= XAY, by (ii),
= XAY AY, by (i),
T T
= XAA Y Y, by (iv),
= AT X T AT Y T Y, by (iv),
T T T
= A Y Y, by (i) ,
= Y AY, by (iv),
= Y, by (ii),
as required.
Proposition 1.10. For any A ∈ Rm×n , AA# is the orthogonal projection onto
ran A, the linear span in Rm of the columns of A, i.e., if P = AA# ∈ Rm×m ,
then P = P T = P 2 and P maps Rm onto ran A.
Department of Mathematics
Matrix Memory 15
We see that
n
X
kAk2F T
= Tr(A A) = (ATA)ii
i=1
n X
X m
= ATij Aji
i=1 j=1
n X
X m
= A2ij
i=1 j=1
Proof. We have
as required.
Proof. We have
Hence
kXA − Bk2F = k(X − BA# )Ak2F + kB(A# A − 1lp )k2F
Department of Mathematics
Matrix Memory 17
Now let us return to our problem of finding a memory matrix which stores
the input-output pattern pairs (a(i) , b(i) ), 1 ≤ i ≤ p, with each a(i) ∈ Rn
and each b(i) ∈ Rm . In general, it may not be possible to find a matrix
M ∈ Rm×n such that M a(i) = b(i) , for each i. Whatever our choice of
M , the system output corresponding to the input a(i) is just M a(i) . So,
failing equality M a(i) = b(i) , we would at least like to minimize the error
b(i) − M a(i) . A measure of such an error is kb(i) − M a(i) k22 the squared
Euclidean norm of the difference. Taking all p patterns into account, the
total system recall error is taken to be
p
X
kb(i) − M a(i) k22 .
i=1
Let A = (a(1) · · · a(p) ) ∈ Rn×p and B = (b(1) · · · b(p) ) ∈ Rm×p be the matrices
whose columns are given by the pattern vectors a(i) and b(i) , respectively.
Then the total system recall error, above, is just
kB − M Ak2F .
We have seen that AA# is precisely the projection onto the range of A, i.e.,
onto the subspace of Rn spanned by the prototype patterns. In this case,
we say that M is given by the projection rule.
Kohonen calls 1l − AA# the novelty filter and has applied these ideas to
image-subtraction problems such as tumor detection in brain scans. Non-
null novelty vectors may indicate disorders or anomalies.
Pattern classification
We have discussed the distributed associative memory (DAM) matrix as
an autoassociative or as a heteroassociative memory model. The first is
mathematically just a special case of the second. Another special case is
that of so-called classification. The idea is that one simply wants an input
signal to elicit a response “tag”, typically coded as one of a collection of
orthogonal unit vectors, such as given by the standard basis vectors of Rm .
Department of Mathematics
Matrix Memory 19
used to construct the autoassociative memory via the projection rule. When
presented with incomplete or fuzzy versions of the original patterns, the
OLAM matrix satisfactorily reconstructed the correct image.
In another autoassociative recall experiment, twenty one different pro-
totype images were used to construct the OLAM matrix. These were each
composed of three similarly placed copies of a subimage. New pattern im-
ages, consisting of just one part of the usual triple features, were presented
to the OLAM matrix. The output images consisted of slightly fuzzy versions
of the single part but triplicated so as to mimic the subimage positioning
learned from the original twenty one prototypes.
An analysis comparing the performance of the correlation memory ma-
trix with that of the generalized inverse matrix memory has been offered by
Cherkassky, Fassett and Vassilas (IEEE Trans. on Computers, 40, 1429 (1991)).
Their conclusion is that the generalized inverse memory matrix performs
better than the correlation memory matrix for autoassociation, but that
the correlation memory matrix is better for classification. This is contrary
to the widespread belief that the generalized inverse memory matrix is the
superior model.
Department of Mathematics
Chapter 2
M x(i) = y (i) ,
weights on connections
x1
x2 m1
input signal m2
..
components . X
y
output
mn
xn
“adder”
P
forms ni=1 mi xi
We have seen that we may not be able to find M which satisfies the
exact input-output relationship M x(i) = y (i) , for each i. The idea is to look
for an M which is in a certain sense optimal. To do this, we seek m1 , . . . , mℓ
21
22 Chapter 2
X ℓ
∂E
0= = Aik mk − bi ,
∂mi
k=1
Department of Mathematics
Adaptive Linear Combiner 23
with m(1) arbitrary and where the parameter α is called the learning rate.
If we substitute for grad E, we find
¡ ¢
m(n + 1) = m(n) + α b − Am(n) .
U AU T = D = diag(λ1 , . . . , λℓ )
E = 21 mT Am − bT m + 21 c
= 12 mT U TU AU TU m − bT U TU m + 21 c
= 12 z T Dz − v T z + 12 c
where z = U m and v = U b
ℓ
X ℓ
X
= 1
2 λi zi2 − vi zi + 21 c.
i=1 i=1
Department of Mathematics
Adaptive Linear Combiner 25
p
1 X (i) (i)
Ajk = xj xk .
p
i=1
(i) (i)
This is an average of xj xk taken over the patterns. Given a particular
(i) (i)
pattern x(i) , we can think of xj xk as an estimate for the average Ajk .
P (i) (i)
Similarly, we can think of bj = p1 pi=1 y (i) xj as an average, and y (i) xj as
an estimate for bj . Accordingly, we change our algorithm for updating the
memory matrix to the following.
Select an input-output pattern pair, (x(i) , y (i) ), say, and use the previous
algorithm but with Ajk and bj “estimated” as above. Thus,
ℓ
X
¡ (i) (i) (i) ¢
mj (n + 1) = mj (n) + α y (i) xj − xj xk mk (n)
k=1
(i)
mj (n + 1) = mj (n) + αδ (i) xj
where
ℓ
(i)
¡ (i) X (i) ¢
δ = y − xk mk (n)
k=1
= (desired output − actual output)
is the output error for pattern pair i. This is known as the delta-rule, or the
Widrow-Hoff learning rule, or the least mean square (LMS) algorithm.
• First choose a value for α, the learning rate (in practice, this might be 0.1
or 0.05, say).
• Start with mj (1) = 0 for all j, or perhaps with small random values.
• Keep selecting input-output pattern pairs x(i) , y (i) and update m(n) by
the rule
(i)
mj (n + 1) = mj (n) + αδ (i) xj , 1 ≤ j ≤ ℓ,
P (i)
where δ (i) = y (i) − ℓk=1 mk (n)xk is the output error for the pattern
pair (i) as determined by the memory matrix in operation at iteration
step n. Ensure that every pattern pair is regularly presented and con-
tinue until the output error has reached and appears to remain at an
acceptably small value.
Department of Mathematics
Adaptive Linear Combiner 27
with m(1) arbitrary. This is exactly what we have already arrived at above.
It should be clear from this point of view that there is no reason a priori to
suppose that the algorithm converges. Indeed, one might be more inclined
to suspect that the m-values given by this rule simply “thrash about all over
the place” rather than settling down towards a limiting value.
ℓ
X
¡ (n) (n) (n) ¢
mj (n + 1) = mj (n) + α y (n) xj − xj xk mk (n)
k=1
where x(n) and y (n) are the input-output pattern pair presented at step
n. If we assume that these patterns presented at the various steps are
independent, then, from the algorithm, we see that mk (n) only depends on
the patterns presented before step n and so is independent of x(n) . Taking
expectations we obtain the vector equation
• The patterns (a(i) , b(i) ) are presented one after the other, (a(1) , b(1) ),
(a(2) , b(2) ),. . . , (a(p) , b(p) )—thus constituting a (pattern) cycle. This cycle
is to be repeated.
• Let M (i) (n) denote the memory matrix just before the pattern pair
(a(i) , b(i) ) is presented in the nth cycle. On presentation of (a(i) , b(i) )
to the network, the memory matrix is updated according to the rule
¡ ¢
M (i+1) (n) = M (i) (n) − αn M (i) (n)a(i) − b(i) a(i)T
Remark 2.4. The gradient of the total error function E is given by the terms
X ∂E (i) p
∂E
= .
∂Mjk ∂Mjk
i=1
When the ith example is being learned, only the terms ∂E (i) /∂Mjk , for
fixed i, are used to update the memory matrix. So at this step the value of
E (i) will decrease but it could happen that E actually increases. The point
is that the algorithm is not a standard gradient-descent algorithm and so
standard convergence arguments are not applicable. A separate proof of
convergence must be given.
Remark 2.5. When m = 1, the output vectors are just real numbers and we
recover the adaptive linear combiner and the Widrow-Hoff rule as a special
case.
Remark 2.6. The algorithm is “local” in the sense that it only involves in-
formation available at the time of each presentation, i.e., it does not need
to remember any of the previously seen examples.
Department of Mathematics
Adaptive Linear Combiner 29
for each i = 1, . . . , p.
(i)
Remark 2.8. In general, the limit matrices Mα are different for different i.
Example 2.9. Consider the case when there is a single input node, so that
the memory matrix M ∈ R1×1 is just a real number, m, say.
M ∈ R1×1
in out
We shall suppose that the system is to learn the two pattern pairs (1, c1 )
and (−1, c2 ). Then the total system error function is
giving
We see that limn→∞ m(1) (n + 1) = β/(1 − λ), provided that |λ| < 1. This
condition is equivalent to (1 − α)2 < 1, or |1 − α| < 1, which is the same as
0 < α < 2. The limit is
β (1 − α)αc1 − αc2
m(1)
α ≡ =
1−λ 1 − (1 − α)2
which simplifies to
(1 − α)c1 − c2
m(1)
α = .
2−α
Now, for m(2) (n), we have
Department of Mathematics
Adaptive Linear Combiner 31
Hence
£ ¤
m(1) (n + 1) = (1 − αn ) (1 − αn )m(1) (n) + αn c1 − αn c2 ,
giving
m(1) (n + 1) = (1 − αn )2 m(1) (n) + (1 − αn )αn c1 − αn c2 .
We
³c − wish´to examine the convergence of ³m(1) (n) ´(and m(2) (n)) to m∗ =
1 c2 c1 − c 2
. So if we set yn = m(1) (n) − , then we would like to
2 2
show that yn → 0, as n → ∞. The recursion formula for yn is
³c − c ´ ³ ³ c − c ´´
1 2 1 2
yn+1 + = (1 − αn )2 yn + + (1 − αn )αn c1 − αn c2 .
2 2
which simplifies to
³c + c ´
1 2
yn+1 = (1 − αn )2 yn − αn2 .
2
Next, we impose suitable conditions on the learning rates, αn , which will
ensure convergence.
which deals with the second bracketed term in the expression for yn+1 .
To estimate the first term, we rewrite it as
(r0′ + r1′ + · · · + rm
′
)β1 β2 · · · βn ,
rj
where we have set r0′ = r0 and rj′ = , for j > 0.
β1 · · · βj
We claim that β1 · · · βn → 0 as n → ∞. To see this, we use the inequality
log(1 − t) ≤ −t, for 0 ≤ t < 1,
which can be derived as follows. We have
Z 1
dx
− log(1 − t) =
1−t x
Z 1
1
≥ dx, since ≥ 1 in the range of integration,
1−t x
= t,
which gives log(1 − t) ≤ P −t, as required. Using
Pnthis, we may say that
n
log(1 − αj ) ≤ −αj , and so j=1 log(1 − αj ) ≤ − j=1 αj . Thus
n
Y n
X n
X
log(β1 · · · βn ) = log (1 − αj )2 = 2 log(1 − αj ) ≤ −2 αj .
j=1 j=1 j=1
Pn
But j=1 αj → ∞ as n → ∞, which means that log(β1 · · · βn ) → −∞ as
n → ∞, which, in turn, implies that β1 · · · βn → 0 as n → ∞, as claimed.
Department of Mathematics
Adaptive Linear Combiner 33
Thus, for this special simple example, we have demonstrated the con-
vergence of the LMS algorithm. The statement of the general case is as
follows.
Theorem 2.10 (LMS Convergence Theorem). Suppose that the learning rate αn
in the LMS algorithm satisfies the conditions
∞
X ∞
X
αn = ∞ and αn2 < ∞.
n=1 n=1
We will not present the proof here, which involves the explicit form of
the generalized inverse, as given via the Singular Value Decomposition. For
the details, we refer to the original paper of Luo.
Some of this signal then leaks across to the recipient’s mouthpiece and is
sent back to the caller. The time taken for this is about half a second, so
that the caller hears an echo of his own voice. By appropriate use of the
ALC in the circuit, this echo effect can be reduced.
Department of Mathematics
Chapter 3
Each neuron has only one axon, but it may branch out and so may be
able to reach perhaps thousands of other cells. There are many dendrites
(the word dendron is Greek for tree). The diameter of the soma is of the
order of 10 microns.
The outgoing signal is in the form of a pulse down the axon. On arrival
at a synapse (the junction where the axon meets a dendrite, or indeed, any
other part of another nerve cell) molecules, called neurotransmitters are re-
leased. These cross the synaptic gap (the axon and receiving neuron do not
quite touch) and attach themselves, very selectively, to receptor sites on the
receiving neuron. The membrane of the target neuron is chemically affected
and its own inclination to fire may be either enhanced or decreased. Thus,
the incoming signal can be correspondingly either excitatory or inhibitory.
Various drugs work by exploiting this behaviour. For example, curare de-
posits certain chemicals at particular receptor sites which artificially inhibit
motor (muscular) stimulation by the brain cells. This results in the inability
to move.
35
36 Chapter 3
Department of Mathematics
Artificial Neural Networks 37
Department of Mathematics
Artificial Neural Networks 39
x1
x2 w1
w2
X u
.. ϕ( · ) y
.
output
wn summation
xn θ
inputs −1 threshold
Typically, ϕ(·) is a non-linear function. Commonly used forms for ϕ(·) are
the binary and bipolar threshold functions, the piece-wise linear function
(“hard-limited” linear function), and the so-called sigmoid function. Exam-
ples of these are as follows.
Examples 3.1.
This is just like the binary version, but the “off” output is represented
as −1 rather than as 0. We might call this a bipolar McCulloch-Pitts
neuron, and denote the function ϕ(·) by sign(·).
ϕ
1
1,
v ≥ 12
ϕ(v) = v + 21 , − 12 < v < 1
2
0, v ≤ − 21 . v
−1/2 0 1/2
4. Sigmoid function
A sigmoid function is any differentiable function ϕ(·), say, such that
ϕ(v) → 0 as v → −∞, ϕ(v) → 1 as v → ∞ and ϕ′ (v) > 0.
ϕ(v)
1
0 v
1
ϕ : v 7→ ϕ(v) = .
1 + exp(−αv)
The larger the value of the constant parameter α, the greater is the slope.
(The slope is sometimes called the “gain”.) In the limit when α → ∞,
this sigmoid function becomes the binary threshold function (except for
the single value v = 0, for which ϕ(0) is equal to 1/2, for all α). One
could call this the threshold limit of ϕ.
If we want the output to vary between −1 and +1, rather than 0 and
1, we could simply change the definition to demand that ϕ(v) → −1,
as v → ∞, thus defining a bipolar sigmoid function. One can easily
transform between binary and bipolar sigmoid functions. For example,
if ϕ is a binary sigmoid function, then 2ϕ−1 is a bipolar sigmoid function.
Department of Mathematics
Artificial Neural Networks 41
1 − exp(−αv) ³ αv ´
2ϕ(v) − 1 = = tanh .
1 + exp(αv) 2
• Each connection can carry a signal, i.e., to each connection we may assign
a real number, called a signal. The signal is thought of as travelling in
the direction of the link.
• Each node has “local memory” in the form of a collection of real numbers,
called weights, each of which is assigned to a corresponding terminating
i.e., incoming connection and represents the synaptic efficacy.
• Some nodes may be specified as input nodes and others as output nodes.
In this way, the neural network can communicate with the external world.
One could consider a node with only outgoing links as an input node
(source) and one with only incoming links as an output node (sink).
input output
Often neural networks are arranged in layers such that the connections
are only between consecutive layers, all in the same direction, and there
being no connections within any given layer. Such neural networks are
called feedforward neural networks.
in .. .. .. out
. . .
Department of Mathematics
Artificial Neural Networks 43
Department of Mathematics
Chapter 4
The Perceptron
A1
A2 w1
w2
.. P
. output
θ
∈ {0, 1}
wn
| {z }
An
Threshold decision unit
The story goes that after much hype about the promise of neural net-
works in the fifties and early sixties (for example, that artificial brains would
soon be a reality), Minsky and Papert’s work, elucidating the theory and
vividly illustrating the limitations of the perceptron, was the catalyst for
the decline of the subject and certainly (which is almost the same thing) the
withdrawal of U. S. government funding. This led to a lull in research into
neural networks, as such, during the period from the end of the sixties to the
beginning of the eighties, but work did continue under the headings of adap-
45
46 Chapter 4
The associator units are thought of as “probing” the outside world, which
we shall call the “retina”, for pictorial simplicity, and then transmitting a
binary signal depending on the result. For example, a given associator unit
might probe a group of m × n pixels on the retina and output 1 if they are
all “on”, but otherwise output 0. Having set up a scheme of associators,
we might then ask whether the system can distinguish between vowels and
consonants drawn on the screen (retina).
Example 4.1. Consider a system comprising a retina formed by a 3×3 grid of
pixels and six associator units Ah1 , . . . , Av3 (see Aleksander, I. and H. Mor-
ton, 1991).
Ah1
Ah2 h1
h2
h3 P
Ah3
v3
Av3 v2
v1
Av2
Av1
Department of Mathematics
The Perceptron 47
3-point vertical receptive field (the three columns). Each associator fires if
and only if a majority of its probes are “on”, i.e., each produces an output of
1 if and only if at least 2 of its 3 pixels are black. We wish to assign weights
h1 , . . . , v3 , to the six connections to the binary threshold unit so that the
system can successfully distinguish between the letters T and H so that the
network has an output of 1, say, corresponding to T, and an output of 0
corresponding to H.
Associator
Ah1 Ah2 Ah3 Av1 Av2 Av3
! T 1 0 0 0 1 0
H 1 1 1 1 0 1 !
but
given by (Ah1 , . . . , Av3 ) = (1, 0, 0, 0, 1, 0), which we see induces the weighted
net input h1 + v2 = 1 + 1 = 2 to the binary decision unit. Thus the output
is 1, which is associated with the letter T.
On the other hand, the retina image leads to the associator output
vector (Ah1 , . . . , Av3 ) = (1, 1, 1, 1, 0, 1) which induces the weighted net input
h1 + h2 + h3 + v1 + v3 = 1 − 1 − 1 − 1 − 1 = −3 to the binary decision unit.
The output is 0 which is associated with the letter H.
width ≥ d + 1
1 2 3 4
The perceptron associator units can be split into three distinct groups:
Σ = ΣA + ΣB + ΣC .
Department of Mathematics
The Perceptron 49
P P
P C
B
A
Clearly, the first two inequalities imply that Y ′ > Y which together with the
third gives X ′ + Y ′ > X ′ + Y > α. This contradicts the fourth inequality.
We conclude that there can be no such perceptron.
Two-class classification
We wish to set up a network which will be able to differentiate inputs from
one of two categories. To this end, we shall consider a simple perceptron
comprising many input units and a single binary decision threshold unit.
We shall ignore the associator units as such and just imagine them as pro-
viding inputs to the binary decision unit. (One could imagine the system
comprising as many associator units as pixels and where each associator unit
probes just one pixel to see if it is “on” or not.) However, we shall allow the
input values to be any real numbers, rather than just binary digits.
Suppose then that we have a finite collection of vectors in Rn , each of
which is classified as belonging to one of two categories, S1 , S2 , say. If
the vector (x1 , . . . , xn ) is presented to a linear binary threshold unit with
weights w1 , . . . , wn and threshold θ, then the output, z, say, is
( P
³P ´ 1, if ni=1 wi xi ≥ θ
n
z = step i=1 wi xi − θ =
0, otherwise.
For given x ∈ Rn , let y = (x0 , x) = (−1, x) ∈ Rn+1 and let w ∈ Rn+1 be the
vector given by w = (w0 , w1 , . . . , wn ). The two categories of vectors S1 and
S2 in Rn determine two categories, C1 and C2 , say, in Rn+1 via the above
Department of Mathematics
The Perceptron 51
w · y = w T y = w 0 y0 + · · · + w n yn
w′ · y = (w + y) · y = w · y + y · y > w · y
w′ · y = (w − y) · y = w · y − y · y,
C = C1 ∪ (−C2 ) = {y : y ∈ C1 , or − y ∈ C2 }.
Then w · y > 0 for all y ∈ C would give correct classification. The error-
correction rule now becomes the following.
• otherwise change w to w′ = w + y.
One would like to present the input patterns one after the other, each
time following this error-correction rule. Unfortunately, whilst changes in
the vector w may enhance classification for one particular input pattern,
these changes may spoil the classification of other patterns and this correc-
tion procedure may simply go on forever. The content of the Perceptron
Convergence Theorem is that, in fact, under suitable circumstances, this
endless loop does not occur. After a finite number of steps, further correc-
tions become unnecessary.
Definition 4.3. We say that two subsets S1 and S2 in Rn are linearly sepa-
rable if and only if there is some (w n
P P1 , . . . , wn ) ∈ R and θ ∈ R such that
i=1 wi xi > θ for each x ∈ S1 and i=1 wi xi < θ for each x ∈ S2 .
x2
S1
+
+
+ +
+ +
+
c
b c
b
+
c
b bc
bc
cb
c
b S2 x1
w1 x1 + w2 x2 = θ
Department of Mathematics
The Perceptron 53
Then there is N such that w(N ) · y > 0 for all y ∈ C, that is, after at most
N update steps, the weight vector correctly classifies all the input patterns
and no further weight changes take place.
Proof. The perceptron error-correction rule can be written as
w(k+1) · w
b = w(1) · w
b + α(1) y (1) · w
b + α(2) y (2) · w
b + · · · + α(k) y (k) · w
b
≥ w(1) · w
b + (α(1) + α(2) + · · · + α(k) )δ.
and
Department of Mathematics
The Perceptron 55
Theorem 4.7. Let C be any bounded subset of Rℓ+1 such that there is some
b ∈ Rℓ+1 and δ > 0 such that w
w b · y > δ for all y ∈ C. For any given sequence
(y (k) ) in C, define the sequence (w(k) ) by
(
(k+1) (k) (k) (k) (k) 0, if w(k) · y (k) > 0
w =w +α y , where α =
1, otherwise,
and w(1) ∈ Rℓ+1 is arbitrary. Then there is M (not depending on the par-
ticular sequence (y (k) )) such that N (k) = α(1) + α(2) + · · · + α(k) ≤ M for
all k. In other words, α(k) = 0 for all sufficiently large k and so there can
be at most a finite number of non-zero weight changes.
p
N (k) ≤ b + c N (k)
Remark 4.8. This means that if S1 and S2 are strictly linearly separable,
bounded subsets of Rℓ , then there is an upper bound to the number of
corrections that the perceptron learning rule is able to make. This does not
mean that the perceptron will necessarily learn to separate S1 and S2 in a
finite number of steps—after all, the update sequence may only “sample”
some of the data a few times, or perhaps not even at all! Indeed, having
chosen w(1) ∈ Rℓ+1 and the sequence y (1) , y (2) , . . . of sample data, it may be
that w(1) · y (j) > 0 for all j, but nonetheless, if y (1) , y (2) , . . . does not exhaust
C, there could be (possibly infinitely-many) points y ∈ C with w(1) · y <
0. Thus, the algorithm, starting with this particular w(1) and based on
the particular sample patterns y (1) , y (2) , . . . will not lead to a successful
classification of S1 and S2 .
−1
θ1
θ2
x1
x2
θm
x3 output ∈ { 0, 1 }m
..
.. .
.
xn m units
• present the first (augmented) pattern vector and observe the output
values;
• now present the next input pattern and repeat, cycling through all
the patterns again and again, as necessary.
Department of Mathematics
The Perceptron 57
x = α1 e1 + · · · + αn en .
Then we have
ℓ(x) = ℓ(α1 e1 + · · · + αn en )
= ℓ(α1 e1 ) + · · · + ℓ(αn en )
= α1 ℓ(e1 ) + · · · + αn ℓ(en ).
Proposition 4.12. For any linearly independent set of vectors {x(1) , . . . , x(m) }
in Rn , the augmented vectors {x(1) , . . . , x(m) }, with x(i) = (−1, x(i) ) ∈ Rn+1 ,
form a linearly independent collection.
P
Proof. Suppose that m j=1 αj x
(j) = 0. This is a vector equation and so each
P (j)
of the n + 1 components of the left hand side must vanish, m j=1 αj xi = 0,
for 0 ≤ i ≤ n. Note that the first component of x(j) has been given the
index 0. P Pm
(j)
In particular, m j=1 αj xi = 0, for 1 ≤ i ≤ n, that is, j=1 αj x
(j) = 0.
and
Department of Mathematics
The Perceptron 59
P
x1
x2 P
j
..
. ..
.
xn P
computes
input m units the winning unit
There are n inputs which feed into m linear units, i.e., each unit simply
records its net weighted input. The network output is the index of the unit
which has maximum such activation. Thus, if w1 , . . . , wm are the weights
to the m linear units and the input pattern is x, then the system output is
the “winning unit”, i.e., that value of j with
wj · x > wi · x for all i 6= j.
In the case of a tie, one assigns a rule of convenience. For example, one
might choose one of the winners at random, or alternatively, always choose
the smallest index possible, say.
The question is whether or not the algorithm leads to a set of weights which
correctly classify the patterns. Under certain separability conditions, the
answer is yes.
Theorem 4.15 (Multi-class error-correction convergence theorem). Suppose
that the m classes Si are linearly separable in the sense that there exist weight
vectors w1∗ , . . . , wm
∗ ∈ Rn such that
Department of Mathematics
The Perceptron 61
(1) (1)
Suppose we start with wb = w1 ⊕ · · · ⊕ wm and apply the perceptron error
correction algorithm with patterns drawn from C:
b(k+1) = w
w b(k) + α(k) x
b(j)
where α(k) = 0 if w
b(k) · x
b(j) > 0, and otherwise α(k) = 1. The correction
α(k) x
b(j) gives
(k)
b(k+1) = (w1 + v1 ) ⊕ . . . (wm
w (k)
+ vm )
where
vi = α(k) x, assuming x ∈ Si
(k)
vj = −α x,
vℓ = 0, all ℓ 6= i, j.
(k+1) (k)
wi = wi + α(k) x
(k+1) (k)
wj = wj − α(k) x
(k+1) (k)
wℓ = wℓ , ℓ 6= i, ℓ 6= j
which is precisely the multi-class error correction rule when unit j is the
winning unit. In other words, the weight changes given by the two system
algorithms are the same. If one is non-zero, then neither is the other. It
follows that every misclassified pattern (from S) presented to the multi-
class system induces a misclassified “super-pattern” (in C ⊂ Rmn ) for the
perceptron system thus triggering a non-zero weight update.
However, the perceptron convergence theorem assures us that the “super-
pattern” system can accommodate at most a finite number of non-zero
weight changes. It follows that the multi-class algorithm can undergo at
most a finite number of weights changes. We deduce that after a finite num-
ber of steps, the winner-takes-all system must correctly classify all patterns
from S.
Mapping implementation
separating hyperplane
(0,1) (1,1)
w1 = 2
θ=3
w2 = 2
(0,0) (1,0)
Department of Mathematics
The Perceptron 63
Example 4.17. Next we consider the so-called “or” function, f on {(0, 1)}2 .
This is given by
(0, 0) 7→ 0
(0, 1) 7→ 1
f:
(1, 0) 7→ 1
(1, 1) 7→ 1.
Thus, f is “on” if and only at least one of its inputs is also “on”, i.e., one
or the other, or both. As above, we seek a line separating the point (0, 0)
from the rest. A solution is indicated in the figure.
separating hyperplane
(0,1) (1,1)
w1 = 1
1
θ= 2
w2 = 1
(0,0) (1,0)
Figure 4.9: A separating line for the “or” function and the cor-
responding simple perceptron.
Here, the function is “on” only if exactly one of the inputs is “on”. (We dis-
cuss the n-parity function later.) It is clear from a diagram that it is impossi-
ble to find a line separating the two classes {(0, 0), (1, 1)} and {(0, 1), (1, 0)}.
We can also see this algebraically as follows. If w1 , w2 and θ were weights
and threshold values implementing the “xor” function, then we would have
0 w1 + 0 w2 < θ
0 w1 + w 2 ≥ θ
w1 + 0 w 2 ≥ θ
w1 + w2 < θ.
Adding the second and third of these inequalities gives w1 + w2 ≥ 2θ, which
is incompatible with the fourth inequality. Thus we come to the conclusion
that the simple perceptron (two input nodes, one output node) is not able
to implement the “xor” function.
However, it is possible to implement the “xor” function using slightly
more complicated networks. Two such examples are shown in the figure.
1
θ1 = 2 2
1
1
1
1 1 3 −3 3
θ3 = 2 θ1 = 2 θ3 = 2
1 1
−1
1 3 2
θ2 = 2
Definition 4.19. The n-parity function is the binary map f on the hypercube
{0, 1}n , f : {0, 1}n → {0, 1}, given by f (x1 , . . . , xn ) = 1 if the number of
xk s which are equal to 1 is odd, and f (x1 , . . . , xn ) = 0 otherwise. Thus, f
can be written as
¡ ¢
f (x1 , . . . , xn ) = 1
2 1 − (−1)x1 +x2 +···+xn ,
Department of Mathematics
The Perceptron 65
−1
x1 θ
w1
x2 w2
Put x3 P
w3
x3 = 0
..
.
x4 = 0
..
.
wn
xn = 0 xn
k
(
X 1, if k is odd
(−1)j+1 = |1 − 1 +
{z1 − . .}. =
j=1
0, if k is even.
k terms
1
Evidently, because of the threshold of 2 at the output unit, this will fire if
and only if k is odd, as required.
1
1− 2
1
1
1 +1
1
2− 2
1 −1
1
1
.. 2
. ..
1 .
(−1)n+1
1 1
n− 2
Next we shall show that any binary mapping on the binary hypercube
{0, 1}N can be implemented by a 3-layer feedforward neural network of bi-
nary threshold units. The space {0, 1}N contains 2N points (each a string of
N binary digits). Each of these is mapped into either 0 or 1 under a binary
function, and each such assignment defines a binary function. It follows that
N
there are 22 binary-valued functions on the binary hypercube {0, 1}N . We
wish to show that any such function can be implemented by a suitable net-
work. First we shall show that it is possible to construct perceptrons which
are very selective.
Theorem 4.22 (Grandmother cell). Let z ∈ {0, 1}n be given. Then there is
an n-input perceptron which fires if and only if it is presented with z.
w1 b1 + · · · + wn bn − θ > 0
but
w1 x1 + · · · + wn xn − θ < 0
for every x 6= z. Define wi , i = 1, . . . , n by
(
1 bi = 1
wi =
−1 otherwise.
Department of Mathematics
The Perceptron 67
Now,
Pn both1 sides of the above inequality are integers, so if we set θ =
i=1 bi − 2 , it follows that
n
X n
X n
X
wi xi < θ < bi = wi bi
i=1 i=1 i=1
θ1
x1
wj1 ..
x2 .
.. θj f (b1 , . . . , bN )
. 1
wji 2 f
xi ..
wjN .
..
.
xn
θ2N
2N units
Remark 4.24. Notice that those hidden units labelled by those z with f (z) =
0 are effectively not connected to the output unit (their weights are zero).
We may just as well leave out these units altogether. This leads to the
observation that, in fact, at most 2N −1 units are required in the hidden
layer.
Indeed, suppose that {0, 1}N = A ∪ B where f (x) = 1 for x ∈ A and
f (x) = 0 for x ∈ B. If the number of elements in A is not greater than that
in B, we simply throw away all hidden units labelled by members of B, as
suggested above. In this case, we have no more than 2N −1 units left in the
hidden layer.
On the other hand, if B has fewer members than A, we wish to throw
away those units labelled by A. However, we first have to slightly modify
the network—otherwise, we will throw the grandmother out with the bath
water.
We make the following modifications. Change θ to −θ, and then change
all weights from the A-labelled hidden units to output to zero, and change all
weights from B-labelled hidden units to the output unit (from the previous
value of zero) to the value −1. Then we see that the system fires only for
inputs x ∈ A. We now throw away all the A-labelled hidden units (they
have zero weights out from them, anyway) to end up with fewer than 2N −1
hidden units, as required.
Department of Mathematics
The Perceptron 69
J = 12 kY w − bk2 = 12 kY w − bk2F .
J = 12 kY w − bk2 = 21 (Y w − b)T (Y w − b)
N
X
= 1
2 (y (j)T w − bj )2
j=1
Suppose that w(1) , w(2) , . . . , and b(1) , b(2) , . . . are iterations for w and b,
respectively. Then a gradient-descent technique for b might be to construct
the b(k) s by the rule
∂J
b(k+1) = b(k) − α
∂b
= b(k) + α(Y w(k) − b(k) ), k = 1, 2, . . . ,
where the constant α > 0 will be chosen to ensure convergence, and where
∆b(k) is given as above.
For notational convenience, let e(k) = Y w(k) − b(k) . Then ∆b(k) is the
vector with components given by ∆b(k) j = e(k) j if e(k) j > 0, but otherwise
∆b(k) j = 0. It follows that e(k)T ∆b(k) = ∆b(k)T ∆b(k) .
Theorem 4.25. Suppose that S1 and S2 are linearly separable classes. Then,
for any k, the inequalities e(k) j ≤ 0 for all j = 1, . . . , N imply that e(k) = 0.
Proof. By hypothesis, there is some wb ∈ Rℓ+1 such that wb · y > 0 for all
y ∈ C, i.e., w (j)
b · y > 0 for all 1 ≤ j ≤ N . In terms of the matrix Y , this
Department of Mathematics
The Perceptron 71
But Y Y # = (Y Y # )T and so
Y T Y Y # = Y T (Y Y # )T
¡ ¢
= (Y Y # Y T
= Y T.
Hence Y T e(k) = 0, and so e(k)T Y = 0. (We will also use this last equality
later.) It follows that e(k)T w
b = 0, that is,
N
X
e(k) j (Y w)
b j = 0.
j=1
Remark 4.26. This result means that if, during execution of the algorithm,
we find that e(k) has strictly negative components, then S1 and S2 are not
linearly separable.
Next we turn to a proof of convergence of the algorithm under the as-
sumption that S1 and S2 are linearly separable.
Theorem 4.27. Suppose that S1 and S2 are linearly separable, and let 0 <
α < 2. The algorithm converges to a solution after a finite number of steps.
Proof. By definition,
Hence
Furthermore,
Hence
Department of Mathematics
The Perceptron 73
fT Y w
b = lim e(kn )T Y w
b = 0,
n→∞
P
i.e., N j=1 fj (Y w) b j = 0, where (Y w) b j > 0 for all 1 ≤ j ≤ N . Since fj ≤ 0,
we must have fj = 0, j = 1, . . . , N , and we conclude that e(kn ) → 0 as
n → ∞. But then the inequality ke(k+1) k ≤ ke(k) k, for all k = 1, 2, . . . ,
implies that the whole sequence (e(k) )k∈N converges to 0. (For any given
ε > 0, there is N0 such that ke(kn ) k < ε whenever n > N0 . Put N1 = kN0 +1 .
Then for any k > N1 , we have ke(k) k ≤ ke(kN0 +1 ) k < ε.)
Let µ = max{b(1) j : 1 ≤ j ≤ N } be the maximum value of the com-
ponents of b(1) . Then µ > 0 by our choice of b(1) . Since e(k) → 0, as
k → ∞, there is k0 such that ke(k) k < 21 µ whenever k > k0 . In particular,
− 21 µ < e(k) j < 21 µ for each 1 ≤ j ≤ N whenever k > k0 . But for any
j = 1, . . . , N , b(k) j ≥ b(1) j , and so
whenever k > k0 . Therefore the vector w(k0 +1) satisfies y (j)T w(k0 +1) > 0
for all 1 ≤ j ≤ N and determines a separating vector for S1 and S2 which
completes the proof.
Department of Mathematics
Chapter 5
−1
x1 θ1
γ −1 y1
α
θ3
−1
.. θ2
. β
y2
The network has n input units, one unit in the middle layer and two
in the output layer. The various weights and thresholds are as indicated.
75
76 Chapter 5
We suppose that the activation functions of the neurons in the second and
third layers are given by the sigmoid function ϕ, say. The nodes in the
input layer serve, as usual, merely as placeholders for the distribution of the
input values to the units in the second (hidden) layer. Suppose that the
input pattern x = (x1 , . . . , xn ) is to induce the desired output (d1 , d2 ). Let
y = (y1 , y2 ) denote the actual output. Then the sum of squared differences
¡ ¢
E = 21 (d1 − y1 )2 + (d2 − y2 )2
is a measure of the error between the actual and desired outputs. We seek
to find weights and thresholds which minimize this. The approach is to use
a gradient-descent method, that is, we use the update algorithm
w 7→ w + ∆w,
where we have set ∆1 = (d1 − y1 )ϕ′ (v1 ). In an entirely similar way, we get
∂E
= −(d2 − y2 ) ϕ′ (v2 ) z
∂β
≡ −∆2 z
Department of Mathematics
Multilayer Feedforward Networks 77
Similarly, we find
∂y2
= ϕ′ (v2 )βϕ′ (v3 )x1 .
∂γ
Hence
∂E ¡ ¢
= − d1 − ϕ(v1 ) ϕ′ (v1 )αϕ′ (v3 )x1
∂γ
¡ ¢
− d2 − ϕ(v2 ) ϕ′ (v2 )αϕ′ (v3 )x1
= −(∆1 α + ∆2 β)ϕ′ (v3 )x1
= −∆1 x1
α −→ α + λ∆1 z,
β −→ β + λ∆2 z,
θi −→ θi − λ∆i , i = 1, 2,
γ −→ γ + λ∆1 x1 , and
θ3 −→ θ3 − λ∆1 .
We have only considered the weight γ associated with the input to second
layer, but the general case is clear. (We could denote these n weights by
γi , i = 1, . . . , n, in which case the adjustment to γi is +λ∆1 xi .) Notice
that the weight changes are proportional to the signal strength along the
Now we shall turn to the more general case of a three layer feedforward
neural network with n input nodes, M neurons in the middle (hidden) layer
and m in the third (output) layer. A similar analysis can be carried out on
a general multilayer feedforward neural network, but we will just consider
one hidden layer.
x1 ws ws
y1
x2
y2
.. ..
.. . .
.
ym
xn
hidden layer
The output from the neuron j in the hidden layer is ϕ(vjh ) ≡ zj , say.
PM
The activation potential of the output neuron ℓ is vℓout = k=0 w ℓk zk , where
z0 = −1 and wℓ0 = θℓ , the threshold for the unit.
Department of Mathematics
Multilayer Feedforward Networks 79
1 Pm
We consider the error function E = 2 r=1 (dr − yr )2 .
The strategy is to try to minimize E using a gradient-descent algorithm
based on the weight variables: w 7→ w − λ grad E, where the gradient is with
respect to all the weights (a total of nM + M + M m + m variables). We
wish to calculate the partial derivatives ∂E/∂wji and ∂E/∂wℓk . We find
X m
∂E ∂yr
= −(dr − yr )
∂wℓk ∂wℓk
r=1
∂ϕ(vℓout )
= −(dℓ − yℓ )
∂wℓk
∂v out
= −(dℓ − yℓ )ϕ′ (vℓout ) ℓ
∂wℓk
= −(dℓ − yℓ )ϕ′ (vℓout )zk
= −∆ℓ zk ,
where ∆ℓ = (dℓ − yℓ )ϕ′ (vℓout ). Next, we have
X m
∂E ∂yr
= −(dr − yr ) .
∂wji ∂wji
r=1
But
∂yr ∂ϕ(vrout ) ∂v out
= = ϕ′ (vrout ) r
∂wji ∂wji ∂wji
∂ ³X ´
M
= ϕ′ (vrout ) wrk zk
∂wji
k=0
∂zj
= ϕ′ (vrout )wrj , since only zj depends on wji ,
∂wji
∂vjh
= ϕ′ (vrout )wrj ϕ′ (vjh ) , since zj = ϕ(vjh ),
∂wji
= ϕ′ (vrout )wrj ϕ′ (vjh )xi .
It follows that
X m
∂E
=− −(dr − yr )ϕ′ (vrout )wrj ϕ′ (vjh )xi
∂wji
r=1
³X
m ´
=− ∆r wrj ϕ′ (vjh )xi
r=1
= −∆j xi ,
³P ´
m
where ∆j = r=1 ∆ r w rj ϕ′ (vjh ). As we have remarked earlier, the ∆ s are
calculated backwards through the network using the current weights. Thus
we may formulate the algorithm as follows.
It must be noted at the outset that there are no known general convergence
theorems for the back-propagation algorithm (unlike the LMS algorithm dis-
cussed earlier). This is a game of trial and error. Nonetheless, the method
has been widely applied and it could be argued that its study in the mid-
eighties led to the recent resurgence of interest (and funding) of artificial
neural networks. Some of the following remarks are various cook-book com-
ments based on experiment with the algorithm.
Remarks 5.1.
Department of Mathematics
Multilayer Feedforward Networks 81
4. There may well be problems with local minima. The sequence of chang-
ing weights could converge towards a local minimum rather than a global
one. However, one might hope that such a local minimum is acceptably
close to the global minimum so that it would not really matter, in prac-
tice, should the system get stuck in one.
5. If the weights are large in magnitude then the sigmoid ϕ(v) is near
saturation (close to its maximum or minimum) so that its derivative will
be small. Therefore changes to the weights according to the algorithm
will also be correspondingly small and so learning will be slow. It is
therefore a good idea to start with weights randomized in a small band
around the midpoint of the sigmoid (where its slope is near a maximum).
For the usual sigmoid ϕ(v) = 1/(1 + exp(−v)), this is at the value v = 0.
c
b c
b
c
b
+
c
b
+
+
+
c
b
+
+
+
desired curve
overtrained
e−v e−v ³ 1 ´
ϕ′ (v) = = ϕ(v) = ϕ(v) 1 −
(1 + e−v )2 (1 + e−v ) (1 + e−v )
¡ ¢
= ϕ(v) 1 − ϕ(v) .
10. We could use a different error function if we wish. All that is required
by the logic of the procedure is that it have a (global) minimum when
the actual output and desired output are equal. The only effect on the
calculations would be to replace the terms −2(dr − yr ) by the more
general expression ∂E/∂yr . This will change the formula for the “output
unit ∆ s”, ∆ℓ , but once this has been done the formulae for the remaining
weight updates are unchanged.
∂E(t)
∆wji (t + 1) = −λ + α∆wji (t),
∂wji
where ∆wji (t) is the previous weight increment and α is called the momen-
tum parameter (chosen so that 0 ≤ α ≤ 1, but often taken to be 0.9). To
see how such a term may prove to be beneficial, suppose that the weights
Department of Mathematics
Multilayer Feedforward Networks 83
∂E
∆wji (t) = −λ + α∆wji (t − 1)
∂wji
∂E ³ ∂E ´
= −λ + α −λ + α∆wji (t − 2)
∂wji ∂wji
∂E ¡ ¢
= −λ 1 + α + α2 + · · · + αt−2 + αt−1 ∆wji (1).
∂wji
We see that
λ ∂E
∆wji (t) → − ,
(1 − α) ∂wji
∂E(t + 1)
∆wji (t + 2) = −λ + α∆wji (t + 1)
∂wji
∂E(t + 1) ∂E(t)
= −λ − αλ + α2 ∆wji (t)
∂wji ∂wji
³ ∂E(t + 1) ∂E(t) ´
= −λ +α + α2 ∆wji (t).
∂wji ∂wji
We get
∂Et
∆wji (t + 1) = −λ
∂wji
∂E (t) ∂ ³ (1) ´
= −λ −λ E (w(t)) + · · · + E (t−1) (w(t)) .
∂wji ∂wji
The second term on the right hand side above is like ∂Et−1 /∂wji = ∆wji (t),
a momentum contribution, except that it is evaluated at step t not t − 1.
Function approximation
We have thought of a multilayer feedforward neural network as a network
which learns input-output pattern pairs. Suppose such a network has n units
in the input layer and m in the output layer. Then any given function f ,
say, of n variables and with values in Rm determines input-output pattern
pairs by the obvious pairing (x, f (x)). One can therefore consider trying to
train a network to learn a given function and so it is of interest to know if
and in what sense this can be achieved.
It turns out that there is a theorem of Kolmogorov, extended by Sprecher,
on the representation of continuous functions of many variables in terms of
linear combinations of a continuous function of linear combinations of func-
tions of one variable.
Theorem 5.2 (Kolmogorov). For any continuous function f : [0, 1]n → R (on
the n-dimensional unit cube), there are continuous functions h1 , . . . , h2n+1
on R and continuous monotonic increasing functions gij , for 1 ≤ i ≤ n, and
1 ≤ j ≤ 2n + 1, such that
2n+1
X ³X
n ´
f (x1 , . . . , xn ) = hj gij (xi ) .
j=1 i=1
Department of Mathematics
Multilayer Feedforward Networks 85
g1
g2 λ1
h
1 λ1
..
1 . λ2 h 1
x1 g5 1
1 λ2 h 1 P
f (x1 , x2 )
g1 1
λ1
1 g2
h 1
x2 1 ..
.
h
1 λ2
g5
However, the precise form of the various activation functions and the
values of the weights are unknown. That is to say, the theorem provides an
(important) existence statement, rather than an explicit construction.
If we are prepared to relax the requirement that the representation be
exact, then one can be more explicit as to the nature of the activation
functions. That is, there are results to the effect that any f as above can
be approximated, in various senses, to within any preassigned degree of
accuracy, by a suitable feedforward neural network with certain specified
activation functions. However, one usually does not know how many neurons
are required. Usually, the more accurate the approximation is required to
be, so will the number of units needed increase. We shall consider such
function approximation results (E. K. Blum and L. L. Li, Neural Networks,
4 (1991), 511–515, see also K. Hornik, M. Stinchcombe and H. White, Neural
Networks, 2 (1989), 359–366). The following example will illustrate the
ideas.
b0
f0
b1
f1 − f0 P
x .. f (x)
.
fn − fn−1
bn
Remark 5.6. To say that A is an algebra simply means that if f and g belong
to A, then so do αf + g and f g, for any α ∈ R. A separates points of K
means that given any points x, x′ ∈ K, with x 6= x′ , then there is some
f ∈ A such that f (x) 6= f (x′ ). In other words, A is sufficiently rich to be
able to distinguish different points in K.
Department of Mathematics
Multilayer Feedforward Networks 87
We emphasize that this is not Fourier analysis—the amn s are not any
kind of Fourier coefficients.
Theorem 5.8. Let f : [0, π]2 → R be continuous. For any given ε > 0, there
is a 3-layer feedforward neural network with McCulloch-Pitts neurons in the
hidden layer and a linear output unit which implements the function f on
the square [0, π]2 to within ε.
and also note that |mx ± ny| ≤ 2N π for any (x, y) ∈ [0, π]2 .
We can approximate 21 cos t on [−2N π, 2N π] by a simple function, γ(t),
say,
ε
| 12 cos t − γ(t)| <
4(N + 1)2 K
..
.
θj
m ..
. amn wj
x
n
P
m
y
.. amn wj
−n .
θj
..
.
Department of Mathematics
Multilayer Feedforward Networks 89
Remark 5.9. The particular square [0, π]2 is not crucial—one can obtain a
similar result, in fact, on any bounded region of R2 by rescaling the variables.
Also, there is an analogous result in any number of dimensions (products of
cosines in the several variables can always be rewritten as cosines of sums).
The theorem gives an approximation in the uniform sense. There are various
other results which, for example, approximate f in the mean square sense.
It is also worth noting that with a little more work, one can see that sigmoid
units could be used instead of the threshold units. One would take γ above
to be a sum of “smoothed” step functions, rather a sum of step functions.
A problem with the above approach is with the difficulty in actually
finding the values of N and the amn s in any particular concrete situation. By
admitting an extra layer of neurons, a somewhat more direct approximation
can be made. We shall illustrate the ideas in two dimensions.
Proposition 5.10. Let f : R2 → R be continuous. Then f is uniformly
continuous on any square S ⊂ R2 .
Proof. Suppose that S is the square S = [−R, R] × [−R, R] and let ε > 0
be given. We must show that there is some δ > 0 such that |x − x′ | < δ and
x, x′ ∈ S imply that
|f (x) − f (x′ )| < ε.
Suppose that this is not true. Then, no matter what δ > 0 is, there will be
points x, x′ ∈ S such that |x − x′ | < δ but |f (x) − f (x′ )| ≥ ε. In particular,
if, for any n ∈ N, we take δ = 1/n, then there will be points xn and x′n in S
with |xn − x′n | < 1/n but |f (xn ) − f (x′n )| ≥ ε.
Now, (xn ) is a sequence in the compact set S and so has a convergent
subsequence, xnk → x as k → ∞, with x ∈ S. But then
1
|x′nk − x| ≤ |x′nk − xnk | + |xnk − x| < + |xnk − x| → 0
nk
as k → ∞, i.e., x′nk → x, as k → ∞.
By the continuity of f , it follows that
|f (xnk ) − f (x′nk )| → |f (x) − f (x)| = 0
which contradicts the inequality |f (xnk ) − f (x′nk )| ≥ ε and the proof is
complete.
Theorem 5.11. Let f : R2 → R be continuous. Then for any R > 0 and any
ε > 0 there is g ∈ S such that
B1 B2 ...
...
..
.
Bn
Choose n so large that each Bm has diagonal smaller than δ and let xm
be the centre of Bm . Then |x − xm | < δ for any x ∈ Bm .
Department of Mathematics
Multilayer Feedforward Networks 91
n
X
g(x) = gm χBm (x).
m=1
( (
1, s≥0 1, s>0
H(s) = and H0 (s) =
0, s<0 0, s ≤ 0.
Thus, H(·) is just the function step(·) used already. H0 differs from H only
in its value at 0—it is 0 there. The purpose of introducing these different
versions of the step-function will soon become apparent. Indeed, H0 (s−a) =
χ(a,∞) (s) and H(−s + b) = χ(−∞,b] (s) so that
This means that we can implement the function χ(a,b] by the network
shown.
H0
θ=a
1 1 H
3
x θ= 2
−1 H
1
θ = −b
Thus
χBj (x1 , x2 ) = χ(aj ,bj ] (x1 ) χ(aj ,bj ] (x2 )
1 1 2 2
H0
θ = aj1
1 H
1
x1 θ= −bj1
−1 H
1
7 χBj (x1 , x2 )
θ= 2
H0 1
x2 1
θ = aj2
1
−1
H
θ = −bj2
Department of Mathematics
Multilayer Feedforward Networks 93
Proof. We have done all the preparation. Now we just piece together the
various parts. Let g, B1 , . . . , Bn etc. be an approximation to f (to within ε)
as given above. We shall implement g using the stated network. Each χBj
is implemented by a 3-layer network, as above. The outputs of these are
weighted by the corresponding gj and then fed into a linear (summation)
unit.
.. .. g1
. .
x1
H0 aj
1
1 P
gj f (x1 , x2 )
H −bj H
1 1
H0
x2 1 θ=7/2
aj2
1
.. gn
H −b2
j .
..
.
The first layer consists of place holders for the input, as usual. The
second layer consists of 4n threshold units (4 for each rectangle Bj , j =
1, . . . , n). The third layer consists of n threshold units, required to complete
the implementation of the n various χBj s. The final output layer consists of
a single linear unit.
For any given input (x1 , x2 ) from the rectangle (−R, R) × (−R, R) pre-
cisely one of the units in the third layer will fire—this will be the j th corre-
spondingP to the unique j with (x1 , x2 ) ∈ Bj . The system output is therefore
equal to ni=1 gi χBi (x1 , x2 ) = gj = g(x1 , x2 ), as required.
Department of Mathematics
Chapter 6
x(i) Ã y (i) , i = 1, . . . , p.
h(x(i) ) = y (i) , i = 1, . . . , p,
but which should also “predict” values when applied to new but “similar”
input data. This means that we are not interested in finding a mapping h
which works precisely for just these x(1) , . . . , x(p) . We would like a mapping
h which will “generalize”.
We will look at functions of the form ϕ(kx − x(i) k), i.e., functions of the
distance between x and the prototype input x(i) —so-called basis functions.
Thus, we try
Xp
h(x) = wi ϕ(kx − x(i) k)
i=1
Pp
and require y (j) = h(x(j) ) = i=1 wi ϕ(kx
(j) − x(i) k). If we set (A ) =
ji
(j) (i) p×p
(Aij ) = (ϕ(kx − x k)) ∈ R , then our requirement is that
p
X
y (j) = Aji wi = (Aw)j ,
i=1
95
96 Chapter 6
p
X
(j)
h(x )= wi ϕ(kx(j) − x(i) k)
i=1
p
X
= wj ϕ(kx(j) − x(j) k) + wi ϕ(kx(j) − x(i) k)
| {z }
=1 i=1
i6=j
= wj + ε
where ε is small provided that the prototypes x(j) are reasonably well-
separated. The various basis functions therefore “pick out” input patterns
in the vicinity of specified spatial locations.
To construct a mapping h from Rn to Rm , we simply construct suitable
combinations of basis functions for each of the m components of the vector-
valued map h = (h1 , . . . , hm ). Thus,
p
X
hk (x) = wki ϕ(kx − x(i) k).
i=1
This leads to
p
X
(j)
yk = hk (x(j) ) = wki ϕ(kx(j) − x(i) k)
| {z }
i=1
Aji
p
X
= wki Aij
i=1
= (W A)kj
Department of Mathematics
Radial Basis Functions 97
• The basis functions are allowed to have different widths (σ)—these may
also be determined by the training data.
M
X
x 7→ yk (x) = wkj ϕj (x) + wk0 k = 1, 2, . . . , m,
j=1
ϕ0 = 1
w10
P
ϕ1 w11 y1
x1 1
w21
1 P
ϕ2 w12 y2
x2
1 w1M
.. .. ..
. . .
xn P
ym
ϕM
Training
To train such a network, the layers are treated quite differently. The second
layer is subjected to unsupervised training—the input data being used to
assign values to µj and σj . These are then held fixed and the weights to the
output layer are found in the second phase of training.
Once a suitable set of values for the µj s and σj s has been found, we seek
suitable wkj s. These are given by
M
X
yk (x) = wkj ϕj (x) (with ϕ0 ≡ 1)
j=0
• Take the first k data points as k cluster centres (giving clusters each with
one member).
• Assign each of the remaining N − k data points one by one to the cluster
with the nearest centroid. After each assignment, recompute the centroid
of the gaining cluster.
• Select each data point in turn and compute the distances to all cluster
centroids. If the nearest centroid is not that particular data point’s
parent cluster, then reassign the data point (to the cluster with the
nearest centroid) and recompute the centroids of the losing and gaining
clusters.
• Repeat the above step until convergence—that is, until a full cycle through
all data points in the training set fails to trigger any further cluster
membership reallocations.
Department of Mathematics
Radial Basis Functions 99
for suitable c, λ and ν. It follows that the linear span S of 1 and Gaussian
functions of the form ϕ(x, µ, σ) is, in fact, an algebra. That is, if S denotes
the collection of functions of the form
w0 + w1 ϕ(x, µ1 , σ1 ) + · · · + wj ϕ(x, µj , σj )
ϕ0 = 1
ϕ1
x1 w0
x2 w1
.. P
.. .
.
wN
xn
ϕN
Department of Mathematics
Radial Basis Functions 101
Department of Mathematics
Chapter 7
103
104 Chapter 7
Department of Mathematics
Recurrent Neural Networks 105
E(x) = 12 xT W x + xT θ
n
X n
X
1
= −2 xi wij xj + xi θi
i,j=1 i=1
Theorem 7.1. Suppose that the synaptic weight matrix (wij ) is symmetric
with non-negative diagonal terms. Then the sequential mode of operation of
the recurrent neural network has no cycles.
Proof. Consider a single update xi 7→ x′i , and all other xj s remain un-
changed. Then we calculate the energy difference
n
X
E(x′ ) − E(x) = − 21 (x′i wij xj + xj wji x′i ) − 21 x′i wii x′i + x′i θi
j=1
j6=i
n
X
1
+ 2 (xi wij xj + xj wji xi ) + 21 xi wii xi − xi θi
j=1
j6=i
n
X
=− x′i wij xj − 12 x′i wii x′i + x′i θi
j=1
j6=i
n
X
+ xi wij xj + 12 xi wii xi − xi θi
j=1
j6=i
n
X
=− x′i wij xj + x′i wii xi − 12 x′i wii x′i + x′i θi
j=1
n
X
+ xi wij xj − xi wii xi + 21 xi wii xi − xi θi
j=1
Xn
= (xi − x′i ) wij xj + 21 wii (2x′i xi − x′i x′i − xi xi ) + (x′i − xi )θi
j=1
Xn
= −(x′i − xi ) wij xj − 12 wii (x′i − xi )2 + (x′i − xi )θi
j=1
³X
n ´
= − 21 wii (x′i − xi )2 − (x′i − xi ) wij xj − θi .
j=1
Proof. The state space is finite (and there are only a finite number of neurons
to be updated), so if there are no cycles the iteration process must simply
stop, i.e., the system reaches a fixed point.
Department of Mathematics
Recurrent Neural Networks 107
³X
n ´
xi (1) = sign wik xk (0)
k=1
³X
n ´
= sign wik zk .
k=1
³P ´
n
For z to be a fixed point, we must have zi = sign k=1 wik zk (or possibly
Pn
k=1 wik zk = 0). This requires that
n
X
wik zk = αzi
k=1
for some α ≥ 0. Guided by our experience with the adaptive linear combiner
(ALC) network, we try wik = αzi zk , that is, we choose the synaptic matrix
to be (wik ) = αzz T , the outer product (or Hebb rule). We calculate
n
X n
X
wik zk = αzi zk zk = αnzi
k=1 k=1
since zk2 = 1. It follows that the choice (wik ) = αzz T , α ≥ 0, does indeed
give z as a fixed point.
Next, we consider what happens if the system starts out not exactly in
the state z, but in some perturbation of z, say zb. We can think of zb as z
but with some of its bits flipped. Taking x(0) = zb, we get
³X
n ´
xi (1) = sign wik zbk .
k=1
The choice α = 0 would give wik = 0 and so xi (1) = zbi and zb (and indeed
any state) would be a fixed point. The dynamics would be trivial—nothing
moves. This is no good for us—we want states zb sufficiently close to z
to “evolve” into z. Incidentally, the same remark applies to the choice of
synaptic matrix (wik ) = 1ln , the unit n × n matrix. For such a choice, it
is clear that all vectors are fixed points. We want the stored patterns to
act as “attractors” each with non-trivial “basin of attraction”. Substituting
³X
n ´
xi (1) = sign αzi zk zbk
k=1
³ X n ´
= sign zi zk zbk , since α > 0,
k=1
¡ ¢
= sign zi (n − 2ρ(z, zb)) ,
where ρ(z, zb) is the Hamming distance between z and zb¡, i.e., the number¢ of
differing bits. It follows that if ρ(z, zb) < n/2, then sign zi (n − 2ρ(z, zb)) =
sign(zi ) = zi , so that x(1) = z. In other words, any vector zb with ρ(z, zb) <
n/2 is mapped onto the fixed point z directly, in one iteration cycle. (When-
ever a component of zb is updated, either it agrees with the corresponding
component of z and so remains unchanged, or it differs and is then mapped
immediately onto the corresponding component of z. Each flip moves zb to-
wards z.) We have therefore shown that the basin of direct attraction of z
contains the disc {b z : ρ(z, zb) < n/2}.
Can
¡ we say anything ¢ about the situation when ρ(z, zb) > n/2? Evidently,
sign zi (n − 2ρ(z, zb)) = sign(−zi ) = −zi , and so x(1) = −z. Is −z a fixed
point? According to our earlier discussion, the state −z is a fixed point if we
take the synaptic matrix to be given by (−z)(−z)T . But (−z)(−z)T = zz T
and we conclude that −z is a fixed point for the network above. (Alterna-
tively, we could simply check this using the definition of the dynamics.)
We can also use this to see that zb is directly attracted to −z whenever
ρ(z, zb) > n/2. Indeed, ρ(z, zb) > n/2 implies that ρ(−z, zb) < n/2 and so,
arguing as before, we deduce that zb is directly mapped onto the fixed point
−z. If ρ(z, zb) = n/2, then there is no change, by definition of the dynamics.
How can we store many patterns? We shall try the Hebbian rule,
p
X
W ≡ (wik ) = z (µ) z (µ)T
µ=1
³X
n ´
zi (1) = sign((W z)i ) = sign wik zk
k=1
³X p
n X ´
(µ) (µ)
= sign zi zk zk .
k=1 µ=1
Department of Mathematics
Recurrent Neural Networks 109
It follows that each exemplar z (1) , . . . , z (p) is a fixed point if they are pairwise
orthogonal. In general (not necessarily pairwise orthogonal), we have
n
X p X
X n
(j) (µ) (µ) (j)
wik zk = zi zk zk
k=1 µ=1 k=1
Xn X p Xn
(j) (j) (j) (µ) (µ) (j)
= zk zi zk + zi zk zk
k=1 µ6=j k=1
³ p
1 X (µ) X (µ) (j) ´
n
(j)
= n zi + zi zk zk (∗)
n
µ6=j k=1
(j)
The right hand side has the same sign as zi whenever the first term is
dominant. To get a rough estimate of the situation, let us suppose that the
protoype vectors are chosen at random—this gives an “average” indication
of what we might expect to happen (a “typical” situation). Then the second
term in the brackets on the right hand side of (∗) involves a sum of n(p − 1)
terms which can take on the values ±1 with equal probability. By the
Central Limit Theorem, the term
p n
1 X (µ) X (µ) (j)
zi zk zk
n
µ6=j k=1
³X
n ´
xi (t + 1) = sign wik xk (t) − θi .
k=1
Proof. The method uses an energy function, but now depending on the state
of the network at two consecutive time steps. We define
n
X n
X ¡ ¢
G(t) = − xi (t)wij xj (t − 1) + xi (t) + xi (t − 1) θi
i,j=1 i=1
¡ ¢
= −x (t)W x(t − 1) + xT (t) + xT (t − 1) θ.
T
Hence
¡ ¢
G(t + 1) − G(t) = −xT (t + 1)W x(t) + xT (t + 1) + xT (t) θ
¡ ¢
+ xT (t)W x(t − 1) − xT (t) + xT (t − 1) θ
¡ ¢
= xT (t − 1) − xT (t + 1) W x(t)
¡ ¢
+ xT (t + 1) − xT (t − 1) θ, using W = W T ,
¡ ¢
= − xT (t + 1) − xT (t − 1) (W x(t) − θ)
X n
¡ ¢ ³X
n ´
=− xi (t + 1) − xi (t − 1) wik xk (t) − θi .
i=1 k=1
and we conclude that G(t + 1) < G(t). (We assume here that the threshold
function is strict, P that is, the weights and threshold are such that x =
n
(x1 , . . . , xn ) 7→ k wik xk − θi never vanishes on {−1, 1} .) Since the state
Department of Mathematics
Recurrent Neural Networks 111
...
1 2 3 n
Let x(0) be the initial configuration of our original network (of n nodes).
Set z(0) = x(0) ⊕ x(0), that is, zi (0) = xi (0) = zn+i (0) for 1 ≤ i ≤ n.
We update the larger (doubled) network sequentially in the order node
n+1, . . . , 2n, 1, . . . , n. Since there are no connections within the set of nodes
1, . . . , n and within the set n + 1, . . . , 2n, we see that the outcome is
where x(s) is the state of our original system run in parallel mode. By the
theorem, the larger system reaches a fixed point so that z(t) = z(t + 1) for
all sufficiently large t. Hence
for all sufficiently large t—which means that the original system has a cycle
of length 2 (or a fixed point).
The energy function for the larger system is
b
E(z) c z + z T θb
= − 21 z T W
¡ ¢ ¡ ¢
c x(2t) ⊕ x(2t − 1)
= − 12 x(2t) ⊕ x(2t − 1) T W
¡ ¢
+ x(2t) ⊕ x(2t − 1) T θb
¡ ¢ ¡ ¢
= − 21 x(2t) ⊕ x(2t − 1) T W x(2t − 1) ⊕ W x(2t)
¡ ¢
+ x(2t) ⊕ x(2t − 1) T θb
¡ ¢
= − 21 x(2t)T W x(2t − 1)x(2t − 1)T W x(2t)
+ x(2t)T θ + x(2t − 1)T θ
¡
= −x(2t)T W x(2t − 1) + x(2t)
¢
= x(2t − 1) T θ
= G(2t)
Department of Mathematics
Recurrent Neural Networks 113
..
.. .
.
X Y
³X
n ´
XY
yj (t + 1) = sign wjk xk (t)
k=1
and then
³X
m ´
YX
xi (t + 1) = sign wiℓ yℓ (t + 1)
ℓ=1
with the usual convention that there is no change if the argument to the
function sign(·) is zero.
This system can be regarded as a special case of a recurrent neural
network operating under asynchronous dynamics, as we now show. First, let
z = (x1 , . . . , xn , y1 , . . . , ym ) ∈ {−1, 1}n+m . For i and j in {1, 2, . . . , n + m},
Department of Mathematics
Chapter 8
Theorem 8.1 (Singular Value Decomposition). For any given non-zero ma-
trix A ∈ Rm×n , there exist orthogonal matrices U ∈ Rm×m , V ∈ Rn×n and
positive real numbers λ1 ≥ λ2 ≥ · · · ≥ λr > 0, where r = rank A, such that
A = U DVT
where D ∈ Rm×n has entries Dii = λi , 1 ≤ i ≤ r and all other entries are
zero.
Proof. Suppose that m ≥ n. Then ATA ∈ Rn×n and ATA ≥ 0. Hence there
is an orthogonal n × n matrix V such that
ATA = V Σ V T
µ1
0
where Σ ∈ Rn×n is given by Σ = .. where µ1 ≥ µ2 ≥ · · · ≥ µn
.
0
µn
T
are the eigenvalues of A A, counted according to multiplicity. If A 6= 0,
then ATA 6= 0 and so has at least one non-zero eigenvalue. Thus, there
is 0 < r ≤ n such that µ1 ≥ µ2 ≥ · · · ≥ µ r > µr+1 = · · · = µn = 0.
µ ¶ λ1
Λ 0 0
Write Σ = , where Λ = .. , with λ21 = µ1 , . . . , λ2r = µr .
0 0 .
0
λn
¡ ¢ n×r
Partition V as V = V1 V2 where V1 ∈ R and V2 ∈ Rn×(n−r) . Since V is
115
116 Chapter 8
ATA = V Σ V T
µ ¶
¡ ¢ Λ2 0
= V1 V2 VT
0 0
µ ¶
¡ 2
¢ V1T
= V1 Λ 0
V2T
= V1 Λ2 V1T .
Hence V2T ATAV2 = V2T V1 Λ2 V1T V2 , so that V2T AT AV2 = 0 and so it follows
| {z } | {z }
(V1T V2 )T =0 =0
that AV2 = 0.
Now, the equality ATA = V1 Λ2 V1T suggests at first sight that we might
hope that A = ΛV1T . However, this cannot be correct, in general, since
A ∈ Rm×n , whereas ΛV1T ∈ Rr×n , and so the dimensions are incorrect.
However, if U ∈ Rk×r satisfies U T U = 1lr , then V1 Λ2 V1T = V1 ΛU T U ΛV1T
and we might hope that A = U ΛV1T . We use this idea to define a suitable
matrix U . Accordingly, we define
Department of Mathematics
Singular Value Decomposition 117
Hence,
µ ¶
Λ 0
A=U V T,
0 0
as claimed. Note that the condition m ≥ n means that m ≥ n ≥ r, and so
the dimensions of the various matrices are all valid.
If m < n, consider B = AT instead. Then, by the above argument, we
get that
µ ′ ¶
T ′ Λ 0
A =B=U V ′T ,
0 0
for orthogonal matrices U ′ ∈ Rn×n , V ′ ∈ Rm×m and where Λ′2 holds the
positive eigenvalues of AAT . Taking the transpose, we have
µ ¶
Λ′ 0
A=V′ U ′T .
0 0
Finally, we observe that from the given form of the matrix A, it is clear that
the dimension of ran A is exactly r, that is, rank A = r.
µ 2 ¶
Λ 0
Remark 8.2. From this result, we see that the matrices ATA =U VT
0 0
µ 2 ¶
T Λ 0
and AA = V U T have the same non-zero eigenvalues, counted
0 0
according to multiplicity. We can also see that
Proof. We just have to check that A# , as given above, really does satisfy
the defining conditions of the generalized inverse. We will verify two of the
four conditions by way of illustration.
µ −1 ¶
Λ 0
Put X = V HU T , where H = ∈ Rn×m . Then
0 0
AXA = U DV T V HU T U DV T
µ ¶µ ¶
1lr 0 Λ 0
=U V T = A.
0 0 0 0
| {z } | {z }
m×m m×n
XA = V HU T U DV T
µ −1 ¶ µ ¶
Λ 0 T Λ 0
=V U U VT
0 0 | {z } 0 0
| {z } 1lm | {z }
n×m m×n
µ ¶
1lr 0
=V VT
0 0
| {z }
n×n
Proof. We have already seen that the choice X = BA# does indeed minimize
ψ(X). Let A = U DV T be the singular value decomposition of A with the
notation as above—so that, for example, Λ denotes the top left r × r block
of D. We have
Department of Mathematics
Singular Value Decomposition 119
where we have partitioned the matrices Y and C into (Y1 Y2 ) and (C1 C2 )
with Y1 , C1 ∈ Rℓ×r ,
kYb k2F = kC1 Λ−1 k2F < kC1 Λ−1 k2F + kY2 k2F = kY ′ k2F .
It follows that among those Y matrices minimizing ψ(X), Yb is the one with
the least k · kF -norm. But if Y = XU , then kY kF = kXkF and so X b given
by Xb = Yb U is the X matrix which has the least k · kF -norm amongst those
T
= BA#
Department of Mathematics
Bibliography
Blum, A., Neural Networks in C++ —An Object Oriented Framework for
Building Connectionist Systems, J. Wiley 1992. Computing hints.
121
122 Bibliography
De Wilde, P., Neural Network Models, 2nd ed., Springer 1997. Gentle
introduction.
Department of Mathematics
Bibliography 123
Kosko, B., Neural Networks and Fuzzy Systems, Prentice-Hall 1992. Disc
included.
Department of Mathematics