0% found this document useful (0 votes)

44 views126 pages

NN Notes PDF

This document describes a matrix memory system that is able to store and recall associations between input and output patterns. The system uses a correlation memory matrix to store input-output pattern pairs and is able to recall the associated output when presented with a corrupted or incomplete version of the input pattern. The robustness of recall in the system is analyzed using the Hamming distance between bipolar vectors.

Uploaded by

rus

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

44 views126 pages

NN Notes PDF

Uploaded by

rus

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 126

NEURAL NETWORKS

Ivan F Wilde

Mathematics Department

King’s College London

London, WC2R 2LS, UK

[email protected]
Contents

1 Matrix Memory . . . . . . . . . . . . . . . . . . . . . . . . . . 1

2 Adaptive Linear Combiner . . . . . . . . . . . . . . . . . . . . 21

3 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . 35

4 The Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5 Multilayer Feedforward Networks . . . . . . . . . . . . . . . . . 75

6 Radial Basis Functions . . . . . . . . . . . . . . . . . . . . . . 95

7 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . 103

8 Singular Value Decomposition . . . . . . . . . . . . . . . . . . 115

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
Chapter 1

Matrix Memory

We wish to construct a system which possesses so-called associative memory.

This is definable generally as a process by which an input, considered as a
“key”, to a memory system is able to evoke, in a highly selective fashion, a
specific response associated with that key, at the system output. The signal-
response association should be “robust”, that is, a “noisy” or “incomplete”
input signal should none the less invoke the correct response—or at least
an acceptable response. Such a system is also called a content addressable
memory.

stimulus mapping response

Figure 1.1: A content addressable memory.

The idea is that the association should not be defined so much between
the individual stimulus-response pairs, but rather embodied as a whole col-
lection of such input-output patterns—the system is a distributive associa-
tive memory (the input-output pairs are “distributed” throughout the sys-
tem memory rather than the particular input-output pairs being somehow
represented individually in various different parts of the system).
To attempt to realize such a system, we shall suppose that the input
key (or prototype) patterns are coded as vectors in Rn , say, and that the
responses are coded as vectors in Rm . For example, the input might be a
digitized photograph comprising a picture with 100 × 100 pixels, each of
which may assume one of eight levels of greyness (from white (= 0) to black

1
2 Chapter 1

(= 7)). In this case, by mapping the screen to a vector, via raster order, say,
the input is a vector in R10000 and whose components actually take values
in the set {0, . . . , 7}. The desired output might correspond to the name of
the person in the photograph. If we wish to recognize up to 50 people, say,
then we could give each a binary code name of 6 digits—which allows up to
26 = 64 different names. Then the output can be considered as an element
of R6 .
Now, for any pair of vectors x ∈ Rn , y ∈ Rm , we can effect the map
x 7→ y via the action of the m × n matrix
M (x,y) = y xT
where x is considered as an n × 1 (column) matrix and y as an m × 1 matrix.
Indeed,
M (x,y) x = y xT x
= α y,
where α = xT x = kxk2 , the squared Euclidean norm of x. The matrix
yxT is called the outer product of x and y. This suggests a model for our
“associative system”.
Suppose that we wish to consider p input-output pattern pairs, (x(1) , y (1) ),
(x(2) , y (2) ), . . ., (x(p) , y (p) ). Form the m × n matrix
p
X
M= y (i) x(i)T .
i=1

M is called the correlation memory matrix (corresponding to the given pat-

tern pairs). Note that if we let X = (x(1) · · · x(p) ) and Y = (y (1) · · · y (p) ) be
the n × p and m × p matrices with columns givenP by the vectors x(1) , . . . , x(p)
and y , . . . , y , respectively, then the matrix pi=1 y (i) x(i)T is just Y X T .
(1) (p)

Indeed, the jk-element of Y X T is

p
X p
X
T T
(Y X )jk = Yji (X )ik = Yji Xki
i=1 i=1
Xp
(i) (i)
= yj xk
i=1

which is precisely the jk-element of M .

When presented with the input signal x(j) , the output is
p
X
(j)
Mx = y (i) x(i)T x(j)
i=1
p
X
= y (j) x(j)T x(j) + (x(i)T x(j) )y (i) .
i=1
i6=j

Department of Mathematics
Matrix Memory 3

In particular, if we agree to “normalize” the key input signals so that

x(i)T x(i) = kx(i) k2 = 1, for all 1 ≤ i ≤ p, then the first term on the right
hand side above is just y (j) , the desired response signal. The second term on
the right hand side is called the “cross-talk” since it involves overlaps (i.e.,
inner products) of the various input signals.
If the input signals are pairwise orthogonal vectors, as well as being
normalized, then x(i)T x(j) = 0 for all i 6= j. In this case, we get

M x(j) = y (j)

that is, perfect recall. Note that Rn contains at most n mutually orthogonal
vectors.

Operationally, one can imagine the system organized as indicated in the

figure.

key patterns are loaded at beginning

↓ ↓ ↓

input
output
signal .. ..
. .

∈ Rn ∈ Rm

Figure 1.2: An operational view of the correlation memory matrix.

• start with M = 0,

• load the key input patterns one by one

T
M ← M + y (i) x(i) , i = 1, . . . , p,

• finally, present any input signal and observe the response.

Note that additional signal-response patterns can simply be “added in”

at any time, or even removed—by adding in −y (j) x(j)T . After the second
stage above, the system has “learned” the signal-response pattern pairs. The
collection of pattern pairs (x(1) , y (1) ), . . . , (x(p) , y (p) ) is called the training
set.

King’s College London

4 Chapter 1

Remark 1.1. In general, the system is a heteroassociative memory x(i) Ã

y (i) , 1 ≤ i ≤ p. If the output is the prototype input itself, then the system
is said to be an autoassociative memory.

We wish, now, to consider a quantitative account of the robustness of

the autoassociative memory matrix. For this purpose, we shall suppose that
the prototype patterns are bipolar vectors in Rn , i.e., the components of
P (i)
the x(i) each belong to {−1, 1}. Then kx(i) k2 = nj=1 xj 2 = n, for each
√
1 ≤ i ≤ p, so that (1/ n)x(i) is normalized. Suppose, further, that the
prototype vectors are pairwise orthogonal (—this requires that n be even).
The correlation memory matrix is
p
1 X (i) (i)T
M= x x
n
i=1

and we have seen that M has perfect recall, M x(j) = x(j) for all 1 ≤ j ≤ p.
We would like to know what happens if M is presented with x, a corrupted
version of one of the x(j) . In order to obtain a bipolar vector as output, we
process the output vector M x as follows:

M x Ã Φ(M x)

where Φ : Rn → {−1, 1}n is defined by

(
1, if zk ≥ 0
Φ(z)k =
−1, if zk < 0

for 1 ≤ k ≤ n and z ∈ Rn . Thus, the matrix output is passed through

a (bipolar) signal quantizer, Φ. To proceed, we introduce the notion of
Hamming distance between pairs of bipolar vectors.
Let a = (a1 , . . . , an ) and b = (b1 , . . . , bn ) be elements of {−1, 1}n , i.e.,
bipolar vectors. The set {−1, 1}n consists of the 2n vertices of a hypercube
in Rn . Then
n
X
aT b = ai bi = α − β
i=1
where α is the number of components of a and b which are the same, and β
is the number of differing components (ai bi = 1 if and only if ai = bi , and
ai bi = −1 if and only if ai 6= bi ). Clearly, α + β = n and so aT b = n − 2β.

Definition 1.2. The Hamming distance between the bipolar vectors a, b,

denoted ρ(a, b), is defined to be
n
X
1
ρ(a, b) = 2 |ai − bi |.
i=1

Department of Mathematics
Matrix Memory 5

Evidently (thanks to the factor 21 ), ρ(a, b) is just the total number of

mismatches between the components of a and b, i.e., it is equal to β, above.
Hence
aT b = n − 2ρ(a, b).
Note that the Hamming distance defines a metric on the set of bipolar
vectors. Indeed, ρ(a,
P b) = 12 ka − bk1 , where k · k1 is the ℓ1 -norm defined
n
on Rn by kzk1 = i=1 |zi |, for z = (z1 , . . . , zn ) ∈ Rn . The ℓ1 -norm is also
known as the Manhatten norm—the distance between two locations is the
sum of lengths of the east-west and north-south contributions to the journey,
inasmuch as diagonal travel is not possible in Manhatten.
Hence, using x(i)T x = n − 2ρ(x(i) , x), we have
p
1 X (i) (i)T
Mx = x x x
n
i=1
Xp
1
= (n − 2ρi (x)) x(i) ,
n
i=1

where ρi (x) = ρ(x(i) , x), the Hamming distance between the input vector x
and the prototype pattern vector x(i) .
Given x, we wish to know when x Ã x(m) , that is, when x Ã M x Ã
Φ(M x) = x(m) . According to our bipolar quantization rule, it will certainly
be true that Φ(M x) = x(m) whenever the corresponding components of M x
(m)
and x(m) have the same sign. This will be the case when (M x)j xj > 0,
that is, whenever
p
1 (m) (m) 1X (i) (m)
(n − 2ρm (x)) xj xj + (n − 2ρi (x)) xj xj > 0
n | {z } n i=1
=1 i6=m

for all 1 ≤ j ≤ n. This holds if

¯X ¯
¯ p (i) (m) ¯¯
¯ (n − 2ρi (x)) xj xj ¯ < n − 2ρm (x) ((∗))
¯
i=1
i6=m

for all 1 ≤ j ≤ n (—we have used the fact that if s > |t| then certainly
s + t > 0).
We wish to find conditions which ensure that the inequality (∗) holds.
By the triangle inequality, we get
¯X ¯ X
¯ p (i) (m) ¯¯
p
¯ (n − 2ρi (x)) xj xj ¯ ≤ |n − 2ρi (x)| (∗∗)
¯
i=1 i=1
i6=m i6=m

King’s College London

6 Chapter 1

(i) (m)
since |xj xj | = 1 for all 1 ≤ j ≤ n. Furthermore, using the orthogonality
of x(m) and x(i) , for i 6= m, we have

0 = x(i)T x(m) = n − 2ρ(x(i) , x(m) )

so that

|n − 2ρi (x)| = |2ρ(x(i) , x(m) ) − 2ρi (x)|

= 2|ρ(x(i) , x(m) ) − ρ(x(i) , x)|
≤ 2ρ(x(m) , x),

where the inequality above follows from the pair of inequalities

ρ(x(i) , x(m) ) ≤ ρ(x(i) , x) + ρ(x, x(m) ) and

ρ(x(i) , x) ≤ ρ(x(i) , x(m) ) + ρ(x(m) , x).

Hence, we have
|n − 2ρi (x)| ≤ 2ρm (x) (∗∗∗)
for all i 6= m. This, together with (∗∗) gives
¯X ¯ X
¯ p (i) (m) ¯ p
¯ (n − 2ρ i (x)) x x ¯ ≤ |n − 2ρi (x)|
¯ j j ¯
i=1 i=1
i6=m i6=m

≤ 2(p − 1)ρm (x).

It follows that whenever 2(p − 1)ρm (x) < n − 2ρm (x) then (∗) holds which
means that M x = x(m) . The condition 2(p − 1)ρm (x) < n − 2ρm (x) is just
that 2pρm (x) < n, i.e., the condition that ρm (x) < n/2p.

Now, we observe that if ρm (x) < n/2p, then, for any i 6= m,

n − 2ρi (x) ≤ 2ρm (x) by (∗∗∗), above,

n
<
p
and so n − 2ρi (x) < n/p. Thus
µ ¶ µ ¶
1 n n p−1
ρi (x) > n− =
2 p 2 p
n
≥ ,
2p
assuming that p ≥ 2, so that p − 1 ≥ 1. In other words, if x is within
Hamming distance of (n/2p) from x(m) , then its Hamming distance to ev-
ery other prototype input vector is greater (or equal to) (n/2p). We have
thus proved the following theorem (L. Personnaz, I. Guyon and G. Dreyfus,
Phys. Rev. A 34, 4217–4228 (1986)).

Department of Mathematics
Matrix Memory 7

Theorem 1.3. Suppose that {x(1) , x(2) , . . . , x(p) } is a given set of mutually
orthogonal bipolar patterns in {−1, 1}n . If x ∈ {−1, 1}n lies within Ham-
ming distance (n/2p) of a particular prototype vector x(m) , say, then x(m)
is the nearest prototype vector to x.
Furthermore, if the autoassociative matrix memory based on the patterns
{x(1) , x(2) , . . . , x(p) } is augmented by subsequent bipolar quantization, then
the input vector x invokes x(m) as the corresponding output.

This means that the combined memory matrix and quantization sys-
tem can correctly recognize (slightly) corrupted input patterns. The non-
linearity (induced by the bipolar quantizer) has enhanced the system performance—
small background “noise” has been removed. Note that it could happen that
the output response to x is still x(m) even if x is further than (n/2p) from
x(m) . In other words, the theorem only gives sufficient conditions for x to
recall x(m) .
As an example, suppose that we store 4 patterns built from a grid of
8 × 8 pixels, so that p = 4, n = 82 = 64 and (n/2p) = 64/8 = 8. Each of the
4 patterns can then be correctly recalled even when presented with up to 7
incorrect pixels.

Remark 1.4. If x is close to −x(m) , then the output from the combined
autocorrelation matrix memory and bipolar quantizer is −x(m) .

Proof. Let M c = 1 Pp (−x(i) )(−x(i) )T . Then clearly M c = M , i.e., the

n i=1
autoassociative correlation memory matrix for the patterns −x(1) , . . . , −x(p)
is exactly the same as that for the patterns x(1) , . . . , x(p) . Applying the above
theorem to the system with the negative patterns, we get that Φ(M cx) =
(m)
−x , whenever x is within Hamming distance (n/2p) of −x . But then (m)

Φ(M x) = x(m) , as claimed.

A memory matrix, also known as a linear associator, can be pictured as

a network as in the figure.

x1 M11
x2 y1
M21 y2
x3
input vector Mm1 .. output vector
(n components) .. . (m components)
.
ym
xn Mmn

Figure 1.3: The memory matrix (linear associator) as a network.

King’s College London

8 Chapter 1

P
“Weights” are assigned to the connections. Since yi = j Mij xj , this
suggests that we assign the weight Mij to the connection joining input node j
to output node i; Mij = weight(j → i).
The correlation memory matrix
P trained on the pattern pairs (x(1) , y (1) ),. . . ,
p
(x(p) , y (p) ) is given by M = m=1 y (m) x(m)T , which has typical term

p
X
Mij = (y (m) x(m)T )ij
m=1
Xp
(m) (m)
= yi xj .
m=1

Now, Hebb’s law (1949) for “real” i.e., biological brains says that if the exci-
tation of cell j is involved in the excitation of cell i, then continued excitation
of cell j causes an increase in its efficiency to excite cell i. To encapsulate
a crude version of this idea mathematically, we might hypothesise that the
weight between the two nodes be proportional to the excitation values of
the nodes. Thus, for pattern label m, we would postulate that the weight,
(m) (m)
weight(input j → output i), be proportional to xj yi .
We see that Mij is a sum, over all patterns, of such terms. For this reason,
the assignment of the correlation memory matrix to a content addressable
memory system is sometimes referred to as generalized Hebbian learning, or
one says that the memory matrix is given by the generalized Hebbian rule.

Capacity of autoassociative Hebbian learning

We have seen that the correlation memory matrix has perfect recall provided
that the input patterns are pairwise orthogonal vectors. Clearly, there can
be at most n of these. In practice, this orthogonality requirement may not
be satisfied, so it is natural ask for some kind of guide as to the number
of patterns that can be stored and effectively recovered. In other words,
how many patterns can there be before the cross-talk term becomes so large
that it destroys the recovery of the key patterns? Experiment confirms that,
indeed, there is a problem here. To give some indication of what might be
reasonable, consider the autoassociative correlation memory matrix based
on p bipolar pattern vectors x(1) , . . . , x(p) ∈ {−1, 1}n , followed by bipolar
quantization, Φ. On presentation of pattern x(m) , the system output is

p
³1 X ´
(m)
Φ(M x )=Φ x(i) x(i)T x(m) .
n
i=1

Department of Mathematics
Matrix Memory 9

(m) (m)
Consider the k th bit. Then Φ(M x(m) )k = xk whenever xk (M x(m) )k > 0,
that is whenever

p
1 (m) (m) (m)T (m) 1 X (m) (i) (i)T (m)
x x x x + xk xk x x >0
n | k {z k } | {z } n
i=1
1 n i6=j
| {z }
Ck

1 + Ck > 0.

In order to consider a “typical” situation, we suppose that the patterns are

random. Thus, x(1) , . . . , x(p) are selected at random from {−1, 1}n , with
all pn bits being chosen independently and equally likely. The output bit
Φ(M x(m) )k is therefore incorrect if 1 + Ck < 0, i.e., if Ck < −1. We shall
estimate the probability of this happening.
We see that Ck is a sum of many terms

p n
1 XX
Ck = Xm,k,i,j
n
i=1 j=1
i6=m

(m) (i) (i) (m)

where Xm,k,i,j = xk xk xj xj . We note firstly that, with j = k,

(m) (i) (i) (m)

Xm,k,i,k = xk xk xk xk = 1.

Next, we see that, for j 6= k, each Xm,k,i,j takes the values ±1 with equal
probability, namely, 12 , and that these different Xs form an independent
family. Therefore, we may write Ck as

p−1 1
Ck = + S
n n

where S is a sum of (n − 1)(p − 1) independent random variables, each

taking the values ±1 with probability 21 . Each of the Xs has mean 0 and
variance σ 2 = 1. By the Centralp
Limit Theorem, it follows
p that the random
variable (S/(n − 1)(p − 1))/(σ/ (n − 1)(p − 1)) = S/ (n − 1)(p − 1) has
an approximate standard normal distribution (for large n).

King’s College London

10 Chapter 1

Hence, if we denote by Z a standard normal random variable,

³p − 1 S ´ ¡ ¢
Prob(Ck < −1) = Prob + < −1 = Prob S < −n − (p − 1)
n n
³ S −n − (p − 1) ´
= Prob p < p
(n − 1)(p − 1) (n − 1)(p − 1)
s r
³ S n2 p − 1´
= Prob p <− −
(n − 1)(p − 1) (n − 1)(p − 1) n−1
s r
³ n2 p − 1´
∼ Prob Z < − −
(n − 1)(p − 1) n−1
³ r ´
n
∼ Prob Z < − ,
p

where we have ignored terms in 1/n and replaced p − 1 by p. Using the

symmetry of the standard normal distribution, we can rewrite this as
³ r ´
n
Prob(Ck < −1) = Prob Z > .
p

Suppose that we require that the probability of an incorrect bit be no greater

than
p 0.01 (or 1%). Then, from p statistical tables, we find that Prob(Z >
n/p) ≤ 0.01 requires that n/p ≥ 2.326. That is, we require n/p ≥
(2.326)2 or p/n ≤ 0.185. Now, to say that any particular bit is incorrectly
recalled with probability 0.01 is to say that the average number of incorrect
bits (from a large sample) is 1% of the total. We have therefore shown that
if we are prepared to accept up to 1% bad bits in our recalled patterns (on
average) then we can expect to be able to store no more than p = 0.185n
patterns in our autoassociative system. That is, the storage capacity (with
a 1% error tolerance) is 0.185n.

Generalized inverse matrix memory

We have seen that the success of the correlation memory matrix, or Hebbian
learning, is limited by the appearance of the cross-talk term. We shall
derive an alternative system based on the idea of minimization of the output
distortion or error.
Let us start again (and with a change of notation). We wish to construct
an associative memory system which matches input patterns a(1) ,. . . , a(p)
(from Rn ) with output pattern vectors b(1) ,. . . , b(p) (in Rm ), respectively.
The question is whether or not we can find a matrix M ∈ Rm×n , the set of
m × n real matrices, such that

M a(i) = b(i)

Department of Mathematics
Matrix Memory 11

for all 1 ≤ i ≤ p. Let A ∈ Rn×p be the matrix whose columns are the vectors
a(1) , . . . , a(p) , i.e., A = (a(1) · · · a(p) ), and let B ∈ Rm×p be the matrix with
(j) (j)
columns given by the b(i) s, B = (b(1) · · · b(p) ), thus Aij = ai and Bij = bi .
Then it is easy to see that M a(i) = b(i) , for all i, is equivalent to M A = B.
The problem, then, is to solve the matrix equation

M A = B,

for M ∈ Rm×n , for given matrices A ∈ Rn×p and B ∈ Rm×p .

First, we observe that for a solution to exist, the matrices A and B
cannot be arbitrary. Indeed, if A = 0, then so is M A no matter what M
is—so the equation will not hold unless B also has all zero entries.
Suppose next, slightly more subtly, that there is some non-zero vector
v ∈ Rp such that Av = 0. Then, for any M , M Av = 0. In general, it need
not be true that Bv = 0.
Suppose then that there is no such non-zero v ∈ Rp such that Av = 0,
i.e., we are supposing that Av = 0 implies that v = 0. What does this
mean? We have
p
X
(Av)i = Aij vj
j=1

= v1 Ai1 + v2 Ai2 + · · · + vp Aip

(1) (2) (p)
= v1 ai + v2 ai + · · · + vp ai
= ith component of (v1 a(1) + · · · + vp a(p) ).

In other words,
Av = v1 a(1) + · · · + vp a(p) .
The vector Av is a linear combination of the columns of A, considered as
elements of Rn .
Now, the statement that Av = 0 if and only if v = 0 is equivalent to the
statement that v1 a(1) + · · · + vp a(p) = 0 if and only if v1 = v2 = · · · = vp = 0
which, in turn, is equivalent to the statement that a(1) ,. . . ,a(p) are linearly
independent vectors in Rn .
Thus, the statement, Av = 0 if and only if v = 0, is true if and only if
the columns of A are linearly independent vectors in Rn .

Proposition 1.5. For any A ∈ Rn×p , the p × p matrix ATA is invertible if

and only if the columns of A are linearly independent in Rn .

Proof. The square matrix ATA is invertible if and only if the equation
ATAv = 0 has the unique solution v = 0, v ∈ Rp . (Certainly the invertibility
of ATA implies the uniqueness of the zero solution to ATAv = 0. For the
converse, first note that the uniqueness of this zero solution implies that

King’s College London

12 Chapter 1

ATA is a one-one linear mapping from Rp to Rp . Moreover, using linearity,

one readily checks that the collection ATAu1 , . . . , ATAup is a linearly in-
dependent set for any basis u1 , . . . , up of Rp . This means that it is a basis
and so ATA maps Rp onto itself. Hence ATA has an inverse. Alternatively,
one can argue that since ATA is symmetric it can be diagonalized via some
orthogonal transformation. But a diagonal matrix is invertible if and only if
every diagonal entry is non-zero. In this case, these entries are precisely the
eigenvalues of ATA. So ATA is invertible if and only if none of its eigenvalues
are zero.)
Suppose that the columns of A are linearly independent and that ATAv =
0. Then it follows that v T ATAv = 0 and so Av = 0, since v T ATAv =
P n 2 2
i=1 (Av)i = kAvk2 , the square of the Euclidean length of the n-dimensional
vector Av. By the argument above, Av is a linear combination of the columns
of A, and we deduce that v = 0. Hence ATA is invertible.
On the other hand, if ATA is invertible, then Av = 0 implies that ATAv =
0 and so v = 0. Hence the columns of A are linearly independent.

We can now derive the result of interest here.

Theorem 1.6. Let A be any n × p matrix whose columns are linearly inde-
pendent. Then for any m × p matrix B, there is an m × n matrix M such
that M A = B.
Proof. Let
M = B (ATA)−1 AT .
|{z} | {z } |{z}
m×p p×p p×n

Then M is well-defined since ATA ∈ Rp×p is invertible, by the proposition.

Direct substitution shows that M A = B.

So, with this choice of M we get perfect recall, provided that the input
pattern vectors are linearly independent.
Note that, in general, the solution above is
¡ not unique. Indeed,
¢ for any
matrix C ∈ R m×n ′ T −1 T
, the m × n matrix M = C 1ln − A(A A) A satisfies
¡ ¢
M ′ A = C A − A(ATA)−1 ATA = C(A − A) = 0 ∈ Rm×p .

Hence M + M ′ satisfies (M + M ′ )A = B.
Can we see what M looks like in terms of the patterns a(i) , b(i) ? The
answer is “yes and no”. We have A = (a(1) · · · a(p) ) and B = (b(1) · · · b(p) ).
Then
n
X n
X
(ATA)ij = ATik Akj = Aki Akj
k=1 k=1
Xn
(i) (j)
= ak ak
k=1

Department of Mathematics
Matrix Memory 13

which gives ATA directly in terms of the a(i) s. Let Q = ATA ∈ Rp×p . Then
M = BQ−1 AT , so that
p
X
Mij = Bik (Q−1 )kℓ Aℓj
k,ℓ=1
Xp
(k) (ℓ)
= bi (Q−1 )kℓ aj since ATℓj = Ajℓ .
k,ℓ=1

This formula for M , valid for linearly independent input patterns, expresses
M more or less in terms of the patterns. The appearance of the inverse,
Q−1 , somewhat lessens its appeal, however.
To discuss the case where the columns of A are not necessarily linearly
independent, we need to consider the notion of generalized inverse.

Definition 1.7. For any given matrix A ∈ Rm×n , the matrix X ∈ Rn×m is
said to be a generalized inverse of A if

(i) AXA = A,

(ii) XAX = X,

(iii) (AX)T = AX,

(iv) (XA)T = XA.

The terms pseudoinverse or Moore-Penrose inverse are also commonly

used for such an X.

Examples 1.8.

1. If A ∈ Rn×n is invertible, then A−1 is the generalized inverse of A.

2. If A = α ∈ R1×1 , then X = 1/α is the generalized inverse provided

α 6= 0. If α = 0, then X = 0 is the generalized inverse.

3. The generalized inverse of A = 0 ∈ Rm×n is X = 0 ∈ Rn×m .

4. If A = u ∈ Rm×1 , u 6= 0, then one checks that X = uT /(uT u) is a

generalized inverse of u.

The following result is pertinent to the theory.

Theorem 1.9. Every matrix possesses a unique generalized inverse.

Proof. We postpone discussion of existence (which can be established via

the Singular Value Decomposition) and just show uniqueness. This follows

King’s College London

14 Chapter 1

by repeated use of the defining properties (i),. . . ,(iv). Let A ∈ Rm×n be

given and suppose that X, Y ∈ Rn×m are generalized inverses of A. Then

X = XAX, by (i),
T
= X(AX) , by (iii),
T T
= XX A
= XX T AT Y T AT , by (i)T ,
= XX T AT AY, by (iii),
= XAXAY, by (iii),
= XAY, by (ii),
= XAY AY, by (i),
T T
= XAA Y Y, by (iv),
= AT X T AT Y T Y, by (iv),
T T T
= A Y Y, by (i) ,
= Y AY, by (iv),
= Y, by (ii),

as required.

Notation For given A ∈ Rm×n , we denote its generalized inverse by A# .

It is also often written as Ag , A+ or A† .

Proposition 1.10. For any A ∈ Rm×n , AA# is the orthogonal projection onto
ran A, the linear span in Rm of the columns of A, i.e., if P = AA# ∈ Rm×m ,
then P = P T = P 2 and P maps Rm onto ran A.

Proof. The defining property (iii) of the generalized inverse A# is precisely

the statement that P = AA# is symmetric. Furthermore,

P 2 = AA# AA# = AA# , by condition (i),

so P is idempotent. Thus P is an orthogonal projection.

For any x ∈ Rm , we have that P x = AA# x ∈ ran A, so that P : Rm →
ran A. On the other hand, if x ∈ ran A, there is some z ∈ Rn such that
x = Az. Hence P x = P Az = AA# Az = Az = x, where we have used
condition (i) in the penultimate step. Hence P maps R onto ran A.

Proposition 1.11. Let A ∈ Rm×n .

(i) If rank A = n, then A# = (ATA)−1 AT .

(ii) If rank A = m, then A# = AT (AAT )−1 .

Department of Mathematics
Matrix Memory 15

Proof. If rank A = n, then A has linearly independent columns and we know

that this implies that ATA is invertible (in Rn×n ). It is now a straightforward
matter to verify that (ATA)−1 AT satisfies the four defining properties of the
generalized inverse, which completes the proof of (i).
If rank A = m, we simply consider the transpose instead. Let B = AT .
Then rank B = m, since A and AT have the same rank, and so, by the
argument above, B # = (B T B)−1 B T . However, AT # = A#T , as is easily
checked (again from the defining conditions). Hence

A# = A#T T = (AT )#T

= B #T = B(B T B)−1
= AT (AAT )−1

which establishes (ii).

Definition 1.12. The k · kF -norm on Rm×n is defined by

kAk2F = Tr(ATA) for A ∈ Rm×n ,

P
where Tr(B) is the trace of the square matrix B; Tr(B) = i Bii . This
norm is called the Frobenius norm and sometimes denoted k · k2 .

We see that
n
X
kAk2F T
= Tr(A A) = (ATA)ii
i=1
n X
X m
= ATij Aji
i=1 j=1
n X
X m
= A2ij
i=1 j=1

since ATij = Aji . Hence

p
kAkF = (sum of squares of all entries of A) .

We also note, here, that clearly kAkF = kAT kF .

PmSuppose that A = u ∈ Rm×1 , an m-component vector. Then kAk2F =
2
i=1 ui , that is, kAkF is the usual Euclidean norm in this case. Thus
the notation k · k2 for this norm is consistent. Note that, generally, kAkF
is just the Euclidean norm of A ∈ Rm×n when A is “taken apart” row
by row and considered as a vector in Rmn via the correspondence A ↔
(A11 , A12 , . . . , A1n , A21 , . . . , Amn ).
The notation kAk2 is sometimes used in numerical analysis texts (and
in the computer algebra software package Maple) to mean the norm of

King’s College London

16 Chapter 1

A as a linear map from Rn into Rm , that is, the value sup{kAxk : x ∈

Rn with kxk = 1}. One can show that this value is equal to the square root
of the largest eigenvalue of ATA whereas kAkF is equal to the square root
of the sum of the eigenvalues of ATA.

Remark 1.13. Let A ∈ Rm×n , B ∈ Rn×m , C ∈ Rm×n , and X ∈ Rp×p . Then

it is easy to see that

(i) Tr(AB) = Tr(BA),

(ii) Tr(X) = Tr(X T ),

(iii) Tr(AC T ) = Tr(C T A) = Tr(AT C) = Tr(CAT ).

The equalities in (iii) can each be verified directly, or alternatively, one

notices that (iii) is a consequence of (i) and (ii) (replacing B by C T ).

Lemma 1.14. For A ∈ Rm×n , A# AAT = AT .

Proof. We have

(A# A)AT = (A# A)T AT by condition (iv)

= (A(A# A))T
= AT by condition (i)

as required.

Theorem 1.15. Let A ∈ Rn×p and B ∈ Rm×p be given. Then X = BA# is

an element of Rm×n which minimizes the quantity kXA − BkF .

Proof. We have

kXA − Bk2F = k(X − BA# )A + B(A# A − 1lp )k2F

= k(X − BA# )Ak2F + kB(A# A − 1lp )k2F
¡ ¢
+ 2 Tr AT (X − BA# )T B(A# A − 1lp )
= k(X − BA# )Ak2F + kB(A# A − 1lp )k2F
¡ ¢
+ 2 Tr (X − BA# )T B (A# A − 1lp )AT .
| {z }
= 0 by the lemma

Hence
kXA − Bk2F = k(X − BA# )Ak2F + kB(A# A − 1lp )k2F

which achieves its minimum, kB(A# A − 1lp )k2F , when X = BA# .

Department of Mathematics
Matrix Memory 17

Note that any X satisfying XA = BA# A gives a minimum solution.

If AT has full column rank (or, equivalently, AT has no kernel) then AAT
is invertible. Multiplying on the right by AT (AAT )−1 gives X = BA# . So
under this condition on AT , we see that there is a unique solution X = BA#
minimizing kXA − BkF .
In general, one can show that BA# is that element with minimal k · kF -
norm which minimizes kXA − BkF , i.e., if Y 6= BA# and kY A − BkF =
kBA# A − BkF , then kBA# kF < kY kF .

Now let us return to our problem of finding a memory matrix which stores
the input-output pattern pairs (a(i) , b(i) ), 1 ≤ i ≤ p, with each a(i) ∈ Rn
and each b(i) ∈ Rm . In general, it may not be possible to find a matrix
M ∈ Rm×n such that M a(i) = b(i) , for each i. Whatever our choice of
M , the system output corresponding to the input a(i) is just M a(i) . So,
failing equality M a(i) = b(i) , we would at least like to minimize the error
b(i) − M a(i) . A measure of such an error is kb(i) − M a(i) k22 the squared
Euclidean norm of the difference. Taking all p patterns into account, the
total system recall error is taken to be
p
X
kb(i) − M a(i) k22 .
i=1

Let A = (a(1) · · · a(p) ) ∈ Rn×p and B = (b(1) · · · b(p) ) ∈ Rm×p be the matrices
whose columns are given by the pattern vectors a(i) and b(i) , respectively.
Then the total system recall error, above, is just

kB − M Ak2F .

We have seen that this is minimized by the choice M = BA# , where A# is

the generalized inverse of A. The memory matrix M = BA# is called the
optimal linear associative memory (OLAM) matrix.

Remark 1.16. If the patterns {a(1) , . . . , a(p) } constitute an orthonormal fam-

ily, then A has independent columns and so A# = (ATA)−1 AT = 1lp AT , so
that the OLAM matrix is BA# = BAT which is exactly the correlation
memory matrix.

In the autoassociative case, b(i) = a(i) , so that B = A and the OLAM

matrix is given as
M = AA# .

We have seen that AA# is precisely the projection onto the range of A, i.e.,
onto the subspace of Rn spanned by the prototype patterns. In this case,
we say that M is given by the projection rule.

King’s College London

18 Chapter 1

Any input x ∈ Rn can be written as

x = AA# x + (1l − AA# )x .

| {z } | {z }
OLAM system output “novelty”

Kohonen calls 1l − AA# the novelty filter and has applied these ideas to
image-subtraction problems such as tumor detection in brain scans. Non-
null novelty vectors may indicate disorders or anomalies.

Pattern classification
We have discussed the distributed associative memory (DAM) matrix as
an autoassociative or as a heteroassociative memory model. The first is
mathematically just a special case of the second. Another special case is
that of so-called classification. The idea is that one simply wants an input
signal to elicit a response “tag”, typically coded as one of a collection of
orthogonal unit vectors, such as given by the standard basis vectors of Rm .

• In operation, the input x induces output M x, which is then associated

with that tag vector corresponding to its maximum component. In other
words, if (M x)j is the maximum component of M x, then the output M x
is associated with the j th tag.

Examples of various pattern classification tasks have been given by T. Koho-

nen, P. Lehtiö, E. Oja, A. Kortekangas and K. Mäkisara, Demonstration of
pattern processing properties of the optimal associative mappings, Proceed-
ings of the International Conference on Cybernetics and Society, Washing-
ton, D. C., 581–585 (1977). (See also the article “Storage and Processing of
Information in Distributed Associative Memory Systems” by T. Kohonen,
P. Lehtiö and E. Oja in “Parallel Models of Associative Memory” edited
by G. Hinton and J. Anderson, published by Lawrence Erlbaum Associates,
(updated edition) 1989.)
In one such experiment, ten people were each photographed from five
different angles, ranging from 45◦ to −45◦ , with 0◦ corresponding to a fully
frontal face. These were then digitized to produce pattern vectors with
eight possible intensity levels for each pixel. A distinct unit vector, a tag,
was associated with each person, giving a total of ten tags, and fifty patterns.
The OLAM matrix was constructed from this data.
The memory matrix was then presented with a digitized photograph
of one of the ten people, but taken from a different angle to any of the
original five prototypes. The output was then classified according to the
tag associated with its largest component. This was found to give correct
identification.
The OLAM matrix was also found to perform well with autoassociation.
Pattern vectors corresponding to one hundred digitized photographs were

Department of Mathematics
Matrix Memory 19

used to construct the autoassociative memory via the projection rule. When
presented with incomplete or fuzzy versions of the original patterns, the
OLAM matrix satisfactorily reconstructed the correct image.
In another autoassociative recall experiment, twenty one different pro-
totype images were used to construct the OLAM matrix. These were each
composed of three similarly placed copies of a subimage. New pattern im-
ages, consisting of just one part of the usual triple features, were presented
to the OLAM matrix. The output images consisted of slightly fuzzy versions
of the single part but triplicated so as to mimic the subimage positioning
learned from the original twenty one prototypes.
An analysis comparing the performance of the correlation memory ma-
trix with that of the generalized inverse matrix memory has been offered by
Cherkassky, Fassett and Vassilas (IEEE Trans. on Computers, 40, 1429 (1991)).
Their conclusion is that the generalized inverse memory matrix performs
better than the correlation memory matrix for autoassociation, but that
the correlation memory matrix is better for classification. This is contrary
to the widespread belief that the generalized inverse memory matrix is the
superior model.

King’s College London

20 Chapter 1

Department of Mathematics
Chapter 2

Adaptive Linear Combiner

We wish to consider a memory matrix for the special case of one-dimensional

output vectors. Thus, we consider input pattern vectors x(1) , . . . , x(p) ∈ Rℓ ,
say, with corresponding desired outputs y (1) , . . . , y (p) ∈ R and we seek a
memory matrix M ∈ R1×ℓ such that

M x(i) = y (i) ,

for 1 ≤ i ≤ p. Since M ∈ R1×ℓ , we can think of M as a row vector M =

(m1 , . . . , mℓ ). The output corresponding to the input x = (x1 , . . . , xℓ ) ∈ Rℓ
is just
Xℓ
y = Mx = mi xi .
i=1

Such a system is known as the adaptive linear combiner (ALC).

weights on connections
x1
x2 m1

input signal m2
..
components . X
y
output
mn
xn
“adder”
P
forms ni=1 mi xi

Figure 2.1: The Adaptive Linear Combiner.

We have seen that we may not be able to find M which satisfies the
exact input-output relationship M x(i) = y (i) , for each i. The idea is to look
for an M which is in a certain sense optimal. To do this, we seek m1 , . . . , mℓ

21
22 Chapter 2

such that (one half) the average mean-squared error

p
X
E≡ 1 1
2 p |y (i) − z (i) |2
i=1

is minimized—where y (i) is the desired output corresponding to the input

vector x(i) and z (i) = M x(i) is the actual system output. We already know,
from the results in the last chapter, that this is achieved by the OLAM
matrix based on the input-output pattern pairs, but we wish here to develop
an algorithmic approach to the construction of the appropriate memory
matrix. We can write out E in terms of the mi as follows
p
X ℓ
X (i)
E= 1
2p |y (i) − mj xj |2
i=1 j=1
p µ
X ℓ
X ℓ
X ¶
(i) (i) (i)
= 1
2p y (i)2 + mj mk xj xk − 2 y (i) mj xj
i=1 j,k=1 j=1
ℓ
X ℓ
X
1
= 2 mj Ajk mk − bj mj + 21 c
j,k=1 j=1

P (i) (i) P (i) P

where Ajk = p1 pi=1 xj xk , bj = p1 pi=1 y (i) xj and c = p1 pi=1 y (i)2 .
Note that A = (Ajk ) ∈ Rℓ×ℓ is symmetric. The error E is a non-negative
quadratic function of the mi . For a minimum, we investigate the equalities
∂E/∂mi = 0, that is,

X ℓ
∂E
0= = Aik mk − bi ,
∂mi
k=1

for 1 ≤ i ≤ ℓ, where we have used the symmetry of (Aik ) here. We thus

obtain the so-called Gauss-normal or Wiener-Hopf equations
ℓ
X
Aik mk = bi , for 1 ≤ i ≤ ℓ,
k=1

or, in matrix form,

Am = b ,
with A = (Aik ) ∈ Rℓ×ℓ , m = (m1 , . . . , mℓ ) ∈ Rℓ and b = (b1 , . . . , bℓ ) ∈ Rℓ .
If A is invertible, then m = A−1 b is the unique solution. In general, there
may be many solutions. For example, if A is diagonal with A11 = 0, then
necessarily b1 = 0 (otherwise E could not be non-negative as a function of
the mi ) and so we see that m1 is arbitrary. To relate this to the OLAM
matrix, write E as
1
E = 2p kM X − Y k2F ,

Department of Mathematics
Adaptive Linear Combiner 23

where X = (x(1) · · · x(p) ) ∈ Rℓ×p and Y = (y (1) · · · y (p) ) ∈ R1×p . This, we

know, is minimized by M = Y X # ∈ R1×ℓ . Therefore m = M T must be a
solution to the Wiener-Hopf equations above. We can write A, b and c in
terms of the matrices X and Y . One finds that A = p1 XX T , bT = p1 Y X T
and c = p1 Y TY . The equation Am = b then becomes XX T m = XY T giving
m = (XX T )−1 XY T , provided that A is invertible. This gives M = mT =
Y X T (XX T )−1 = Y X # , as above.
One method of attack for finding a vector m∗ minimizing E is that of
gradient-descent. The idea is to think of E(m1 , . . . , mℓ ) as a bowl-shaped
surface above the ℓ-dimensional m1 , . . . , mℓ -space. Pick any value for m.
The vector grad E, when evaluated at m, points in the direction of maximum
increase of E in the neighbourhood of m. That is to say, for small α (and a
vector v of given length), E(m+αv)−E(m) is maximized when v points in the
same direction as grad E (as is seen by Taylor’s theorem). Now, rather than
increasing E, we wish to minimize it. So the idea is to move a small distance
from m to m − α grad E, thus inducing maximal “downhill” movement on
the error surface. By repeating this process, we hope to eventually reach a
value of m which minimizes E.
The strategy, then, is to consider a sequence of vectors m(n) given algo-
rithmically by

m(n + 1) = m(n) − α grad E , for n = 1, 2, . . . ,

with m(1) arbitrary and where the parameter α is called the learning rate.
If we substitute for grad E, we find
¡ ¢
m(n + 1) = m(n) + α b − Am(n) .

Now, A is symmetric and so can be diagonalized. There is an orthogonal

matrix U ∈ Rℓ×ℓ such that

U AU T = D = diag(λ1 , . . . , λℓ )

and we may assume that λ1 ≥ λ2 ≥ · · · ≥ λℓ . We have

E = 21 mT Am − bT m + 21 c
= 12 mT U TU AU TU m − bT U TU m + 21 c
= 12 z T Dz − v T z + 12 c

where z = U m and v = U b

ℓ
X ℓ
X
= 1
2 λi zi2 − vi zi + 21 c.
i=1 i=1

King’s College London

24 Chapter 2

Since E ≥ 0, it follows that all λi ≥ 0—otherwise E would have a negative

leading term. The recursion formula for m(n), namely,
¡ ¢
m(n + 1) = m(n) + α b − Am(n) ,
gives ¡ ¢
U m(n + 1) = U m(n) + α U b − U AU TU m(n) .
In terms of z, this becomes
¡ ¢
z(n + 1) = z(n) + α v − Dz(n) .
Hence, for any 1 ≤ j ≤ ℓ,
¡ ¢
zj (n + 1) = zj (n) + α vj − λj zj (n)
= (1 − αλj )zj (n) + αvj .
Setting µj = (1 − αλj ), we have
zj (n + 1) = µj zj (n) + αvj
= µj (µj zj (n − 1) + αvj ) + αvj
= µ2j zj (n − 1) + (µj + 1)αvj
= µ2j (µj zj (n − 2) + αvj ) + (µj + 1)αvj
= µ3j zj (n − 2) + (µ2j + µj + 1)αvj
= ···
= µnj zj (1) + (µn−1
j + µn−2
j + · · · + µj + 1)αvj .
This converges, as n → ∞, provided |µj | < 1, that is, provided −1 <
1 − αλj < 1. Thus, convergence demands the inequalities 0 < αλj < 2 for
all 1 ≤ j ≤ ℓ. We therefore have shown that the algorithm
¡ ¢
m(n + 1) = m(n) + α b − Am(n) , n = 1, 2, . . . ,
2
with m(1) arbitrary, converges provided 0 < α < , where λmax is the
λmax
maximum eigenvalue of A.
Suppose that m(1) is given and that α does indeed satisfy the inequalities
0 < α < 2/λmax . Let m∗ denote the limit limn→∞ m(n). Then, letting
n → ∞ in the recursion formula for m(n), we see that
¡ ¢
m∗ = m∗ + α b − Am∗ ,
that is, m∗ satisfies Am∗ = b and so m(n) does, indeed, converge to a value
minimizing E. Indeed, if m∗ satisfies Am∗ = b, then we can complete the
square and write 2E as
2E = mT Am − 2bT m + c
= mT Am − mT∗ Am − mT Am∗ + c, using bT m = mT b
= (m − m∗ )T A(m − m∗ ) − mT∗ m∗ + c,
which is certainly minimized when m = m∗ .

Department of Mathematics
Adaptive Linear Combiner 25

The above analysis requires a detailed knowledge of the matrix A. In

particular, its eigenvalues must be determined in order for us to be able to
choose a valid value for the learning rate α. We would like to avoid having
to worry too much about this detailed structure of A.
We recall that A = (Ajk ) is given by

p
1 X (i) (i)
Ajk = xj xk .
p
i=1

(i) (i)
This is an average of xj xk taken over the patterns. Given a particular
(i) (i)
pattern x(i) , we can think of xj xk as an estimate for the average Ajk .
P (i) (i)
Similarly, we can think of bj = p1 pi=1 y (i) xj as an average, and y (i) xj as
an estimate for bj . Accordingly, we change our algorithm for updating the
memory matrix to the following.

Select an input-output pattern pair, (x(i) , y (i) ), say, and use the previous
algorithm but with Ajk and bj “estimated” as above. Thus,

ℓ
X
¡ (i) (i) (i) ¢
mj (n + 1) = mj (n) + α y (i) xj − xj xk mk (n)
k=1

for 1 ≤ j ≤ n. That is,

(i)
mj (n + 1) = mj (n) + αδ (i) xj

where

ℓ
(i)
¡ (i) X (i) ¢
δ = y − xk mk (n)
k=1
= (desired output − actual output)

is the output error for pattern pair i. This is known as the delta-rule, or the
Widrow-Hoff learning rule, or the least mean square (LMS) algorithm.

The learning rule is then as follows.

King’s College London

26 Chapter 2

Widrow-Hoff (LMS) algorithm

• First choose a value for α, the learning rate (in practice, this might be 0.1
or 0.05, say).

• Start with mj (1) = 0 for all j, or perhaps with small random values.

• Keep selecting input-output pattern pairs x(i) , y (i) and update m(n) by
the rule
(i)
mj (n + 1) = mj (n) + αδ (i) xj , 1 ≤ j ≤ ℓ,
P (i)
where δ (i) = y (i) − ℓk=1 mk (n)xk is the output error for the pattern
pair (i) as determined by the memory matrix in operation at iteration
step n. Ensure that every pattern pair is regularly presented and con-
tinue until the output error has reached and appears to remain at an
acceptably small value.

• The actual question of convergence still remains to be discussed!

Remark 2.1. If α is too small, we might expect the convergence to be slow—

the adjustments to m are small if α is small. Of course, this is assuming that
there is convergence. Similarly, if the output error δ is small then changes
to m will also be small, thus slowing convergence. This could happen if m
enters an error surface “valley” with an almost flat bottom.
On the other hand, if α is too large, then the ms may overshoot and
oscillate about an optimal solution. In practice, one might start with a
largish value for α but then gradually decrease it as the learning progresses.
These comments apply to any kind of gradient-descent algorithm.
Remark 2.2. Suppose that, instead of basing our discussion on the error
function E, we present the ith input vector x(i) and look at the immediate
output “error”
Xℓ
¯
1 ¯ (i) (i) ¯2
(i)
E =2 y − mk xk ¯ .
k=1
Then we calculate
ℓ
∂E (i) ¡ (i) X (i) ¢ (i)
=− y − mk xk xj ,
∂mj
k=1
| {z }
δ (i)

for 1 ≤ j ≤ ℓ. So we might try the “one pattern at a time” gradient-descent

algorithm
(i)
mj (n + 1) = mj (n) + αδ (i) xj , 1 ≤ j ≤ ℓ,

Department of Mathematics
Adaptive Linear Combiner 27

with m(1) arbitrary. This is exactly what we have already arrived at above.
It should be clear from this point of view that there is no reason a priori to
suppose that the algorithm converges. Indeed, one might be more inclined
to suspect that the m-values given by this rule simply “thrash about all over
the place” rather than settling down towards a limiting value.

Remark 2.3. We might wish to consider the input-output patterns x and y

as random variables taking values in Rℓ and R, respectively. In this context,
it would be natural to consider the minimization of E((y − mT x)2 ). The
analysis proceeds exactly as above, but now with Ajk = E(xj xk ) and bj =
E(yxj ). The idea of using the current, i.e., the observed, values of x and y
to construct estimates for A and b is a common part of standard statistical
theory. The algorithm is then

ℓ
X
¡ (n) (n) (n) ¢
mj (n + 1) = mj (n) + α y (n) xj − xj xk mk (n)
k=1

where x(n) and y (n) are the input-output pattern pair presented at step
n. If we assume that these patterns presented at the various steps are
independent, then, from the algorithm, we see that mk (n) only depends on
the patterns presented before step n and so is independent of x(n) . Taking
expectations we obtain the vector equation

E m(n + 1) = E m(n) + α( b − A E m(n) ).

It follows, as above, that if 0 < α < 2/λmax , then E m(n) converges to m∗

which minimizes the mean square error E((y − mT x)2 ).

We now turn to a discussion of the convergence of the LMS algorithm

(see Z-Q. Luo, Neural Computation 3, 226–245 (1991)). Rather than just
looking at the ALC system, we shall consider the general heteroassociative
problem with p input-output pattern pairs (a(1) , b(1) ), . . . , (a(p) , b(p) ) with
a(i) ∈ Rℓ and b(i) ∈ Rm , 1 ≤ i ≤ p. Taking m = 1, we recover the ALC,
as above. We seek an algorithmic approach to minimizing the total system
“error”
Xp Xp
(i) 1 (i)
E(M ) = E (M ) = 2 kb − M a(i) k2
i=1 i=1

where E (i) (M ) = 21 kb(i) − M a(i) k2 is the error function corresponding to

pattern i.
We have seen that E(M ) = 12 kB − M Ak2F , where k · kF is the Frobenius
norm, A = (a(1) · · · a(p) ) ∈ Rℓ×p and B = (b(1) · · · b(p) ) ∈ Rm×p and that a
solution to the problem is given by M = BA# .

King’s College London

28 Chapter 2

Each E (i) is a function of the elements Mjk of the memory matrix M .

Calculating the partial derivatives gives
m
X ℓ
X
∂E (i) ∂ ¡ ¢2
= 1
2 b(i)
r − Mrs a(i)
s
∂Mjk ∂Mjk
r=1 s=1
Xℓ
¡ (i) ¢ (i)
= − bj − Mjs a(i)
s ak
s=1
¡¡ (i)
¢ ¢
= Ma − b(i) a(i)T jk .

This leads to the following LMS learning algorithm.

• The patterns (a(i) , b(i) ) are presented one after the other, (a(1) , b(1) ),
(a(2) , b(2) ),. . . , (a(p) , b(p) )—thus constituting a (pattern) cycle. This cycle
is to be repeated.

• Let M (i) (n) denote the memory matrix just before the pattern pair
(a(i) , b(i) ) is presented in the nth cycle. On presentation of (a(i) , b(i) )
to the network, the memory matrix is updated according to the rule
¡ ¢
M (i+1) (n) = M (i) (n) − αn M (i) (n)a(i) − b(i) a(i)T

where αn is the (n-dependent) learning rate, and we agree that the

matrix M (p+1) (n) = M (1) (n + 1), that is, presentation of pattern p + 1
in cycle n is actually the presentation of pattern 1 in cycle n + 1.

Remark 2.4. The gradient of the total error function E is given by the terms
X ∂E (i) p
∂E
= .
∂Mjk ∂Mjk
i=1

When the ith example is being learned, only the terms ∂E (i) /∂Mjk , for
fixed i, are used to update the memory matrix. So at this step the value of
E (i) will decrease but it could happen that E actually increases. The point
is that the algorithm is not a standard gradient-descent algorithm and so
standard convergence arguments are not applicable. A separate proof of
convergence must be given.
Remark 2.5. When m = 1, the output vectors are just real numbers and we
recover the adaptive linear combiner and the Widrow-Hoff rule as a special
case.
Remark 2.6. The algorithm is “local” in the sense that it only involves in-
formation available at the time of each presentation, i.e., it does not need
to remember any of the previously seen examples.

Department of Mathematics
Adaptive Linear Combiner 29

The following result is due to Kohonen.

Theorem 2.7. Suppose that αn = α > 0 is fixed. Then, for each i = 1, . . . , p,

(i)
the sequence M (i) (n) converges to some matrix Mα depending on α and i.
Moreover,
lim Mα(i) = BA# ,
α↓0

for each i = 1, . . . , p.
(i)
Remark 2.8. In general, the limit matrices Mα are different for different i.

We shall investigate a simple example to illustrate the theory (following

Luo).

Example 2.9. Consider the case when there is a single input node, so that
the memory matrix M ∈ R1×1 is just a real number, m, say.

M ∈ R1×1
in out

Figure 2.2: The ALC with one input node.

We shall suppose that the system is to learn the two pattern pairs (1, c1 )
and (−1, c2 ). Then the total system error function is

E = 21 (c1 − m)2 + 21 (c2 + m)2

where M11 = m, as above. The LMS algorithm, in this case, becomes

m(2) (n) = m(1) (n) − α(m(1) (n) − c1 )

m(3) (n) = m(1) (n + 1) = m(2) (n) − α(m(2) (n) + c2 )

with the initialization m(1) (1) = 0. Hence

m(1) (n + 1) = (1 − α)m(2) (n) − αc2

¡ ¢
= (1 − α) (1 − α)m(1) (n) + αc1 − αc2
= (1 − α)2 m(1) (n) + (1 − α)αc1 − αc2
| {z } | {z }
= λ, say = β, say

King’s College London

30 Chapter 2

giving

m(1) (n + 1) = λm(1) (n) + β

¡ ¢
= λ λm(1) (n − 1) + β + β
= λ2 m(1) (n − 1) + λβ + β
= ···
= λ m(1) (1) +(λn−1 + · · · + λ + 1)β
n
| {z }
=0
(1 − λn )
= β.
(1 − λ)

We see that limn→∞ m(1) (n + 1) = β/(1 − λ), provided that |λ| < 1. This
condition is equivalent to (1 − α)2 < 1, or |1 − α| < 1, which is the same as
0 < α < 2. The limit is
β (1 − α)αc1 − αc2
m(1)
α ≡ =
1−λ 1 − (1 − α)2

which simplifies to
(1 − α)c1 − c2
m(1)
α = .
2−α
Now, for m(2) (n), we have

m(2) (n) = m(1) (n)(1 − α) + αc1

→ m(2) (1)
α ≡ mα (1 − α) + αc1
c1 − (1 − α)c2
=
2−α
as n → ∞, provided 0 < α < 2. This shows that if we keep the learning
(1) (2)
rate fixed, then we do not get convergence, mα 6= mα , unless c1 = −c2 .
The actual optimal solution m∗ to minimizing the error E is got from solving
dE/dm = 0, i.e., −(c1 −m)+(c2 +m) = 0, which gives m∗ = 12 (c1 −c2 ). This
(1)
is the value for the OLAM “matrix”. If c1 6= c2 and α 6= 0, then mα 6= m∗
(2) (1) (2)
and mα 6= m∗ . Notice that both mα and mα converge to the OLAM
1
solution m∗ as α → 0, and also the average 2 (m(1) (n) + m(2) (n)) converges
to m∗ .

Next, we shall consider a dynamically variable learning rate αn . In this

case the algorithm becomes

m(2) (n) = (1 − αn )m(1) (n) + αn c1

m(1) (n + 1) = (1 − αn )m(2) (n) − αn c2 .

Department of Mathematics
Adaptive Linear Combiner 31

Hence
£ ¤
m(1) (n + 1) = (1 − αn ) (1 − αn )m(1) (n) + αn c1 − αn c2 ,
giving
m(1) (n + 1) = (1 − αn )2 m(1) (n) + (1 − αn )αn c1 − αn c2 .
We
³c − wish´to examine the convergence of ³m(1) (n) ´(and m(2) (n)) to m∗ =
1 c2 c1 − c 2
. So if we set yn = m(1) (n) − , then we would like to
2 2
show that yn → 0, as n → ∞. The recursion formula for yn is
³c − c ´ ³ ³ c − c ´´
1 2 1 2
yn+1 + = (1 − αn )2 yn + + (1 − αn )αn c1 − αn c2 .
2 2
which simplifies to
³c + c ´
1 2
yn+1 = (1 − αn )2 yn − αn2 .
2
Next, we impose suitable conditions on the learning rates, αn , which will
ensure convergence.

• Suppose that 0 ≤ αn < 1, for all n, and that

∞
X ∞
X
(i) αn = ∞ , (ii) αn2 < ∞ ,
n=1 n=1
P∞ P∞ 2
that is, the series n=1 αn is divergent, whilst the series n=1 αn is
convergent.

An example is provided by the assignment αn = 1/n. The intuition is that

condition (i) ensures that the learning rate is always sufficiently large to
push the iteration towards the desired limiting value, whereas condition (ii)
ensures that its influence is not too strong that it might force the scheme
into some kind of endless oscillatory behaviour.

Claim The sequence (yn ) converges to 0, as n → ∞.

³c + c ´ P
1 2
Proof. Let rn = − αn2 . Then, by condition (ii), ∞
n=1 |rn | < ∞.
2
The algorithm for yn is
yn+1 = (1 − αn )2 yn + rn
= (1 − αn )2 ((1 − αn−1 )2 yn−1 + rn−1 ) + rn
= (1 − αn )2 (1 − αn−1 )2 yn−1 + (1 − αn )2 rn−1 + rn
= ···
= (1 − αn )2 (1 − αn−1 )2 · · · (1 − α1 )2 y1 + (1 − αn )2 · · · (1 − α2 )2 r1
+ (1 − αn )2 · · · (1 − α3 )2 r2 + · · · + (1 − αn )2 rn−1 + rn .

King’s College London

32 Chapter 2

For convenience, set y1 = r0 and βj = (1 − αj )2 . Then we can write yn+1 as

yn+1 = r0 β1 β2 · · · βn + r1 β2 · · · βn + r2 β3 · · · βn + · · · + rn−1 βn + rn .
Let ε > 0 be given. We must show that there is some integer N such that
|yn+1 | < ε whenever n > N . The idea of the proof is to split the sum in the
expression for yn+1 into two parts, and show that each can be made small
for sufficiently large n. Thus, we write
yn+1 = (r0 β1 β2 · · · βn + · · · + rm βm+1 · · · βn )
+ (rm+1 βm+2 · · · βn + · · · + rn−1 βn + rn )
and seek m so that each of the two bracketed terms on the right hand side
is smaller than ε/2.P
∞
PnWe know that j=1 |rj | < ∞ and so there is some integer m such that
j=m+1 |rj | < ε/2, for n > m. Furthermore, the inequality |1 − αj | ≤ 1
implies that 0 < βj ≤ 1 and so, with m as above,
|(rm+1 βm+2 · · · βn + · · · + rn−1 βn + rn )| ≤ |rm+1 | + · · · + |rn |
Xn
ε
= |rj | <
2
j=m+1

which deals with the second bracketed term in the expression for yn+1 .
To estimate the first term, we rewrite it as
(r0′ + r1′ + · · · + rm
′
)β1 β2 · · · βn ,
rj
where we have set r0′ = r0 and rj′ = , for j > 0.
β1 · · · βj
We claim that β1 · · · βn → 0 as n → ∞. To see this, we use the inequality
log(1 − t) ≤ −t, for 0 ≤ t < 1,
which can be derived as follows. We have
Z 1
dx
− log(1 − t) =
1−t x
Z 1
1
≥ dx, since ≥ 1 in the range of integration,
1−t x
= t,
which gives log(1 − t) ≤ P −t, as required. Using
Pnthis, we may say that
n
log(1 − αj ) ≤ −αj , and so j=1 log(1 − αj ) ≤ − j=1 αj . Thus
n
Y n
X n
X
log(β1 · · · βn ) = log (1 − αj )2 = 2 log(1 − αj ) ≤ −2 αj .
j=1 j=1 j=1
Pn
But j=1 αj → ∞ as n → ∞, which means that log(β1 · · · βn ) → −∞ as
n → ∞, which, in turn, implies that β1 · · · βn → 0 as n → ∞, as claimed.

Department of Mathematics
Adaptive Linear Combiner 33

Finally, we observe that for m as above, the numbers r0′ , r1′ , . . . , rm

′ do

not depend on n. Hence there is N , with N > m, such that

ε
|(r0′ + r1′ + · · · + rm
′
)β1 β2 · · · βn | <
2
whenever n > N . This completes the proof that yn → 0 as n → ∞.
³c − c ´
1 2
It follows, therefore, that m(1) (n) → m∗ = , as n → ∞. To
2
(2)
investigate m (n), we use the relation

m(2) (n) = (1 − αn ) m(1) (n) + αn c1

| {z } | {z } |{z}
→1 → m∗ →0

to see that m(2) (n) → m∗ also.

Thus, for this special simple example, we have demonstrated the con-
vergence of the LMS algorithm. The statement of the general case is as
follows.

Theorem 2.10 (LMS Convergence Theorem). Suppose that the learning rate αn
in the LMS algorithm satisfies the conditions
∞
X ∞
X
αn = ∞ and αn2 < ∞.
n=1 n=1

Then the sequence of matrices generated by the algorithm, initialized by the

condition M (1) (1) = 0, converges to BA# , the OLAM matrix.

We will not present the proof here, which involves the explicit form of
the generalized inverse, as given via the Singular Value Decomposition. For
the details, we refer to the original paper of Luo.

The ALC can be used to “clean-up” a noisy signal by arranging it as

a transverse filter. Using delay mechanisms, the “noisy” input signal is
sampled n times, that is, its values at time steps τ, 2τ, . . . , nτ are collected.
These n values form the fan-in values for the ALC. The output error is
ε = |d − y|, where d is the desired output, i.e., the pure signal. The network
is trained to minimize ε, so that the system output y is as close to d as
possible (via the LMS error). Once trained, the network produces a “clean”
version of the signal.

The ALC has applications in echo cancellation in long distance telephone

calls. The ’phone handset contains a special circuit designed to distinguish
between incoming and outgoing signals. However, there is a certain amount
of signal leakage from the earpiece to the mouthpiece. When the caller
speaks, the message is transmitted to the recipient’s earpiece via satellite.

King’s College London

34 Chapter 2

Some of this signal then leaks across to the recipient’s mouthpiece and is
sent back to the caller. The time taken for this is about half a second, so
that the caller hears an echo of his own voice. By appropriate use of the
ALC in the circuit, this echo effect can be reduced.

Department of Mathematics
Chapter 3

Artificial Neural Networks

By way of background information, we consider some basic neurophysiology.

It has been estimated that the human brain contains some 1011 nerve cells, or
neurons, each having perhaps as many as 104 interconnections, thus forming
a densely packed web of fibres.

The neuron has three major components:

• the dendrites (constituting a vastly multibranching tree-like structure

which collects inputs from other cells),

• the cell body (the processing part, called the soma),

• the axon (which carries electrical pulses to other cells).

Each neuron has only one axon, but it may branch out and so may be
able to reach perhaps thousands of other cells. There are many dendrites
(the word dendron is Greek for tree). The diameter of the soma is of the
order of 10 microns.
The outgoing signal is in the form of a pulse down the axon. On arrival
at a synapse (the junction where the axon meets a dendrite, or indeed, any
other part of another nerve cell) molecules, called neurotransmitters are re-
leased. These cross the synaptic gap (the axon and receiving neuron do not
quite touch) and attach themselves, very selectively, to receptor sites on the
receiving neuron. The membrane of the target neuron is chemically affected
and its own inclination to fire may be either enhanced or decreased. Thus,
the incoming signal can be correspondingly either excitatory or inhibitory.
Various drugs work by exploiting this behaviour. For example, curare de-
posits certain chemicals at particular receptor sites which artificially inhibit
motor (muscular) stimulation by the brain cells. This results in the inability
to move.

35
36 Chapter 3

The containing wall of the cell is the cell membrane—a phospholipid

bilayer. (A lipid is, by definition, a compound insoluble in water, such as
oils and fatty acids.) These bilayers have a phosphoric acid head which is
attracted to water and a glyceride tail which is repelled by water. This
means that in a water solution they tend to line up in a double layer with
the heads pointing outwards.
The membrane keeps most molecules from passing either in or out of
the cell, but there are special channels allowing the passage of certain ions
such as Na+ , K+ , Cl− and Ca++ . By allowing such ions to pass in and
out, a potential difference between the inside and the outside of the cell is
maintained. The cell membrane is selectively more favourable to the passage
of potassium than to sodium so that the K+ ions could more easily diffuse
out, but negative organic ions inside tend to pull K+ ions into the cell. The
net result is that the K+ concentration is higher inside than outside whereas
the reverse is true of Na+ and Cl− ions. This results in a resting potential
inside relative to outside of about −70mV across the cell wall.
When an action potential reaches a synapse it causes a change in the
permeability of the membrane of the cell carrying the pulse (the presynaptic
membrane) which results in an influx of Ca++ ions. This leads to a release
of neurotransmitters into the synaptic cleft which diffuse across the gap and
attach themselves at receptor sites on the membrane of the receiving cell
(the postsynaptic membrane). As a consequence, the permeability of the
postsynaptic membrane is altered. An influx of positive ions will tend to
depolarize the receiving neuron (causing excitation) whereas an influx of
negative ions will increase polarization and so inhibit activation.
Each input pulse is of the order of 1 millivolt and these diffuse towards
the body of the cell where they are summed at the axon hillock. If there
is sufficient depolarization the membrane permeability changes and allows
a large influx of Na+ ions. An action potential is generated and travels
down the axon away from the main cell body and off to other neurons.
The amplitude of this signal is of the order of tens of millivolts and its
presence prevents the axon from transmitting further pulses. The shape
and amplitude of this travelling pulse is very stable and is replicated at the
branching points of the axon. This would indicate that the pulse seems not
to carry any information other than to indicate its presence, i.e., the axon
can be thought of as being in an all-or-none state.
Once triggered, the neuron is incapable of re-excitation for about one
millisecond, during which time it is restored to its resting potential. This
is called the refractory period. The existence of the refractory period limits
the frequency of nerve-pulse transmissions to no more than about 1000 per
second. In fact, this frequency can vary greatly, being mere tens of pulses
per second in some cases. The impulse trains can be in the form of regular
spikes, irregular spikes or in bursts.

Department of Mathematics
Artificial Neural Networks 37

The big question is how this massively interconnected network constitut-

ing the brain can not only control general functional behaviour but also give
rise to phenomena such as personality, sleep and consciousness. It is also
amazing how the brain can recognize something it has not “seen” before.
For example, a piece of badly played music, or writing roughly done, say, can
nevertheless be perfectly recognized in the sense that there is no doubt in
one’s mind (sic) what the tune or the letters actually “are”. (Indeed, surely
no real-life experience can ever be an exact replica of a previous experience.)
This type of ability seems to be very hard indeed to reproduce by computer.
One should take care in discussions of this kind, since we are apparently
talking about the functioning of the brain in self-referential terms. After all,
perhaps if we knew (whatever “know” means) what a tune “is”, i.e., how it
relates to the brain via our hearing it (or, indeed, seeing the musical score
written down) then we might be able to understand how we can recognize
it even in some new distorted form.
In this connection, certain cells do seem to perform as so-called “feature
detectors”. One example is provided by auditory cells located at either side
of the back of the brain near the base and serve to locate the direction of
sounds. These cells have two groups of dendrites receiving inputs originating,
respectively, from the left ear and the right ear. For those cells in the left side
of the brain, the inputs from the left ear inhibit activation, whereas those
from the right are excitatory. The arrangement is reversed for those in the
right side of the brain. This means that a sound coming from the right, say,
will reach the right ear first and hence initially excite those auditory cells
located in the left side of the brain but inhibit those in the right side. When
the sound reaches the left ear, the reverse happens, the cells on the left
become inhibited and those on the right side of the brain become excited.
The change from strong excitation to strong inhibition can take place within
a few hundred microseconds.
Another example of feature detection is provided by certain visual cells.
Imagine looking at a circular region which is divided into two by a smaller
concentric central disc and its surround. Then there are cells in the visual
system which become excited when a light appears in the centre but for
which activation is inhibited when a light appears in the surround. These
are called “on centre–off surround” cells. There are also corresponding “off
centre–on surround” cells.

We would like to devise mathematical models of networks inspired by

(our understanding of) the workings of the brain. The study of such artificial
neural networks may then help us to gain a greater understanding of the
workings of the brain. In this connection, one might then strive to make
the models more biologically realistic in a continuing endeavour to model
the brain. Presumably one might imagine that sooner or later the detailed
biochemistry of the neuron will have to be taken into account. Perhaps one

King’s College London

38 Chapter 3

might even have to go right down to a quantum mechanical description.

This seems to be a debatable issue in that there is a school of thought which
suggests that the details are not strictly relevant and that it is the overall
cooperative behaviour which is important for our understanding of the brain.
This situation is analogous to that of the study of thermodynamics and
statistical mechanics. The former deals essentially with gross behaviour of
physical systems whilst the latter is concerned with a detailed atomic or
molecular description in the hope of explaining the former. It turned out
that the detailed (and quantum) description was needed to explain certain
phenomena such as superconductivity. Perhaps this will turn out to be the
case in neuroscience too.
On the other hand, one could simply develop the networks in any di-
rection whatsoever and just consider them for their own sake (as part of
a mathematical structure), or as tools in artificial intelligence and expert
systems. Indeed, artificial neural networks have been applied in many areas
including medical diagnosis, credit validation, stock market prediction, wine
tasting and microwave cookers.
To develop the basic model, we shall think of the nervous system as
mediated by the passage of electrical impulses between a vast web of inter-
connected cells—neurons. This network receives input from receptors, such
as the rods and cones of the eye, or the hot and cold touch receptors of the
skin. These inputs are then processed in some way by the neural net within
the brain, and the result is the emission of impulses that control so-called
effectors, such as muscles, glands etc., which result in the response. Thus,
we have a three-stage system: input (via receptors), processing (via neural
net) and output (via effectors).
To model the excitatory/inhibitory behaviour of the synapse, we shall
assign suitable positive weights to excitatory synapses and negative weights
to the inhibitory ones. The neuron will then “fire” if its total weighted
input exceeds some threshold value. Having fired, there is a small delay,
the refractory period, before it is capable of firing again. To take this into
account, we consider a discrete time evolution by dividing the time scale
into units equal to the refractory period. Our concern is whether any given
neuron has “spiked” or not within one such period. This has the effect of
“clocking” the evolution. We are thus led to the caricature illustrated in the
figure.
The symbols x1 , . . . , xn denote the input values, w1 , . . . , wn denote the
weights
Pn associated with the connections (terminating at the synapses). u =
i=1 i xi is the net (weighted) input, θ is the threshold, v = u − θ is called
w
the activation potential, and ϕ(·) the activation function. The output y is
given by
y = ϕ(u − θ}).
| {z
v

Department of Mathematics
Artificial Neural Networks 39

x1
x2 w1
w2
X u
.. ϕ( · ) y
.
output
wn summation
xn θ
inputs −1 threshold

Figure 3.1: A model of a neuron as a processing device.

Typically, ϕ(·) is a non-linear function. Commonly used forms for ϕ(·) are
the binary and bipolar threshold functions, the piece-wise linear function
(“hard-limited” linear function), and the so-called sigmoid function. Exam-
ples of these are as follows.

Examples 3.1.

1. Binary threshold function:

ϕ
( 1
1, v≥0
ϕ(v) =
0, v < 0.
0
v

Thus, the output y is 1 if u − θ ≥ 0, i.e., if u ≥ θ, or, equivalently, if

P
wi xi ≥ 0. A neuron with such an activation function is known as
a McCulloch-Pitts neuron. It is an “all or nothing” neuron. We will
sometimes write step(v) for such a binary threshold function ϕ(v).
ϕ
2. Bipolar threshold function:
1
(
1, v ≥ 0 0 v
ϕ(v) =
−1, v < 0.
−1

This is just like the binary version, but the “off” output is represented
as −1 rather than as 0. We might call this a bipolar McCulloch-Pitts
neuron, and denote the function ϕ(·) by sign(·).

King’s College London

40 Chapter 3

3. Hard-limited linear function:

ϕ
 1
1,
 v ≥ 12
ϕ(v) = v + 21 , − 12 < v < 1
2


0, v ≤ − 21 . v
−1/2 0 1/2

4. Sigmoid function
A sigmoid function is any differentiable function ϕ(·), say, such that
ϕ(v) → 0 as v → −∞, ϕ(v) → 1 as v → ∞ and ϕ′ (v) > 0.

ϕ(v)
1

0 v

Figure 3.2: A sigmoid function.

A specific example of a sigmoid function is given by

1
ϕ : v 7→ ϕ(v) = .
1 + exp(−αv)

The larger the value of the constant parameter α, the greater is the slope.
(The slope is sometimes called the “gain”.) In the limit when α → ∞,
this sigmoid function becomes the binary threshold function (except for
the single value v = 0, for which ϕ(0) is equal to 1/2, for all α). One
could call this the threshold limit of ϕ.
If we want the output to vary between −1 and +1, rather than 0 and
1, we could simply change the definition to demand that ϕ(v) → −1,
as v → ∞, thus defining a bipolar sigmoid function. One can easily
transform between binary and bipolar sigmoid functions. For example,
if ϕ is a binary sigmoid function, then 2ϕ−1 is a bipolar sigmoid function.

Department of Mathematics
Artificial Neural Networks 41

For the explicit example above, we see that

1 − exp(−αv) ³ αv ´
2ϕ(v) − 1 = = tanh .
1 + exp(αv) 2

We turn now to a slightly more formal discussion of neural network. We

will not attempt a definition in the axiomatic sense (there seems little point
at this stage), but rather enumerate various characteristics that typify a
neural network.

• Essentially, a neural network is a decorated directed graph.

• A directed graph is a set of objects, called nodes, together with a collection

of directed links, that is, ordered pairs of nodes. Thus, the ordered pair
of nodes {u, v} is thought of as a line joining the node u to the node v
in the direction from u to v. A link is usually called a connection (or
wire) and the nodes are called processing units (or elements), neurons
or artificial neurons.

• Each connection can carry a signal, i.e., to each connection we may assign
a real number, called a signal. The signal is thought of as travelling in
the direction of the link.

• Each node has “local memory” in the form of a collection of real numbers,
called weights, each of which is assigned to a corresponding terminating
i.e., incoming connection and represents the synaptic efficacy.

• Moreover, to each node there is associated some activation function which

determines the signals along its outgoing connections based only on local
memory—such as the weights associated with the incoming connections
and the incoming signals. All outgoing signals from a particular node
are equal in value.

• Some nodes may be specified as input nodes and others as output nodes.
In this way, the neural network can communicate with the external world.
One could consider a node with only outgoing links as an input node
(source) and one with only incoming links as an output node (sink).

• In practical applications, a neural network is expected to be an intercon-

nected network of very many (possibly thousands) of relatively simple
processing units. That is to say, the effectiveness of the network is ex-
pected to come about because of the complexity of the interconnections
rather than through any particularly clever behaviour of the individual
neurons. The performance of a neural network is also expected to be
robust in the sense of relative insensitivity to the removal of a small
number of connections or variations in the values of a few of the weights.

King’s College London

42 Chapter 3

input output

Figure 3.3: An example neural network.

Often neural networks are arranged in layers such that the connections
are only between consecutive layers, all in the same direction, and there
being no connections within any given layer. Such neural networks are
called feedforward neural networks.

in .. .. .. out
. . .

Figure 3.4: A three layer feedforward neural network.

Note that sometimes the input layer of a multilayer feedforward neural

network is not counted as a layer—this is because it usually serves just
to provide “place-holders” for the inputs. Thus, the example in the figure
might sometimes be referred to as a two-layer feedforward neural network.
We will always count the first layer.

Department of Mathematics
Artificial Neural Networks 43

A neural network is said to be recurrent if it possesses at least one

feedback connection.

Figure 3.5: An example of a recurrent neural network.

King’s College London

44 Chapter 3

Department of Mathematics
Chapter 4

The Perceptron

The perceptron is a simple neural network invented by Frank Rosenblatt in

1957 and subsequently extensively studied by Marvin Minsky and Seymour
Papert (1969). It is driven by McCulloch-Pitts threshold processing units
equipped with a particular learning algorithm.

A2 w1

w2
.. P
. output
θ
∈ {0, 1}
wn
| {z }
An
Threshold decision unit

“Retina” Associator units

Figure 4.1: The perceptron architecture.

The story goes that after much hype about the promise of neural net-
works in the fifties and early sixties (for example, that artificial brains would
soon be a reality), Minsky and Papert’s work, elucidating the theory and
vividly illustrating the limitations of the perceptron, was the catalyst for
the decline of the subject and certainly (which is almost the same thing) the
withdrawal of U. S. government funding. This led to a lull in research into
neural networks, as such, during the period from the end of the sixties to the
beginning of the eighties, but work did continue under the headings of adap-

45
46 Chapter 4

tive signal processing, pattern recognition, and biological modelling. By the

early to mid-eighties the subject was revived mainly thanks to the discovery
of possible methods of surmounting the shortcomings of the perceptron and
the contributions from the physics community.
The figure illustrates the architecture of a so-called simple perceptron,
i.e., a perceptron with only a single output line.


 w1




 w2
Probes .. ∈ {0, 1}

 .

 w0

 wn

−1 clamped

Figure 4.2: The simple perceptron.

The associator units are thought of as “probing” the outside world, which
we shall call the “retina”, for pictorial simplicity, and then transmitting a
binary signal depending on the result. For example, a given associator unit
might probe a group of m × n pixels on the retina and output 1 if they are
all “on”, but otherwise output 0. Having set up a scheme of associators,
we might then ask whether the system can distinguish between vowels and
consonants drawn on the screen (retina).
Example 4.1. Consider a system comprising a retina formed by a 3×3 grid of
pixels and six associator units Ah1 , . . . , Av3 (see Aleksander, I. and H. Mor-
ton, 1991).

Ah1

Ah2 h1
h2
h3 P
Ah3
v3

Av3 v2

v1
Av2

Av1

The three associator units Ahi , i = 1, 2, 3, each have a 3-point horizontal

receptive field (the three rows, respectively) and each Avj , j = 1, 2, 3 has a

Department of Mathematics
The Perceptron 47

3-point vertical receptive field (the three columns). Each associator fires if
and only if a majority of its probes are “on”, i.e., each produces an output of
1 if and only if at least 2 of its 3 pixels are black. We wish to assign weights
h1 , . . . , v3 , to the six connections to the binary threshold unit so that the
system can successfully distinguish between the letters T and H so that the
network has an output of 1, say, corresponding to T, and an output of 0
corresponding to H.

Associator
Ah1 Ah2 Ah3 Av1 Av2 Av3
! T 1 0 0 0 1 0
H 1 1 1 1 0 1 !

To induce the required output, we seek values for h1 , . . . , v3 and θ such

that

1h1 + 0h2 + 0h3 + 0v1 + 1v2 + 0v3 > θ,

but

1h1 + 1h2 + 1h3 + 1v1 + 0v2 + 1v3 < θ.

That is, h1 + v2 > θ and h1 + h2 + h3 + v1 + v3 < θ.

This will ensure that the binary decision output unit responds correctly.
There are many possibilities here. One such is to try h1 = v2 = 1. Then the
first of these inequalities demands that 2 > θ, whereas the second inequality
requires that 1 + h2 + h3 + v1 + v3 < θ. Setting h2 = h3 = v1 = v3 = −1
and θ = 0 does the trick.

The retina image corresponds to the associator output vector

given by (Ah1 , . . . , Av3 ) = (1, 0, 0, 0, 1, 0), which we see induces the weighted
net input h1 + v2 = 1 + 1 = 2 to the binary decision unit. Thus the output
is 1, which is associated with the letter T.

On the other hand, the retina image leads to the associator output

vector (Ah1 , . . . , Av3 ) = (1, 1, 1, 1, 0, 1) which induces the weighted net input
h1 + h2 + h3 + v1 + v3 = 1 − 1 − 1 − 1 − 1 = −3 to the binary decision unit.
The output is 0 which is associated with the letter H.

King’s College London

48 Chapter 4

It is of interest to look at an example of the limitation of the perceptron

(after Minsky and Papert). We shall illustrate an example of a global prop-
erty (connectedness) that cannot be measured by local probes. Let us say
that an associator unit has diameter d if the smallest square containing its
support pixels (i.e., those pixels in the retina on whose values it depends)
has side of d pixels. We shall now see that if a perceptron has diameter d,
then, no matter how many associator units it has, there are tasks it simply
cannot perform, such as check for connectedness.

Theorem 4.2. It is impossible to implement “connectedness” on a retina of

width r > d by a perceptron whose associator units have diameter no greater
than d.

Proof. The proof is by contradiction. Suppose that we have a perceptron

whose probes have diameters not greater than d and that this perceptron
can recognise connectedness, that is, its output is 1 when it sees a connected
figure on the retina and 0 otherwise. Consider the four figures indicated on
a retina of width r > d.

width ≥ d + 1

1 2 3 4

Figure 4.3: Figures to fox the perceptron.

The perceptron associator units can be split into three distinct groups:

A — those which probe the left-hand end of the figure,

B — those which probe neither end,

C — those which probe the right-hand end of the figure.

The contribution to the output unit’s activation can therefore be represented

as a sum of the three corresponding contributions, symbolically,

Σ = ΣA + ΣB + ΣC .

Department of Mathematics
The Perceptron 49

P P
P C
B
A

Figure 4.4: Contributions to the output activation.

On presentation of pattern (1), the output should be 0, so that Σ(1) < θ,

that is,
(1) (1) (1)
ΣA + ΣB + ΣC < θ.

On presentation of patterns (2) or (3), the output should be 1, so

(2) (2) (2)

ΣA + ΣB + ΣC > θ
(3) (3) (3)
ΣA + ΣB + ΣC > θ.

Finally, presentation of pattern (4) should give output 0, so

(4) (4) (4)

ΣA + ΣB + ΣC < θ.

(1) (2) (3) (4)

However, the contributions ΣB , ΣB , ΣB and ΣB are all equal because
the class B probes cannot tell the difference between the four figures.
(1) (2) (1) (3) (2) (4)
Similarly, we see that ΣA = ΣA and ΣC = ΣC , and ΣC = ΣC and
(3) (4)
ΣA = ΣA . To simplify the notation, let X, X ′ and Y, Y ′ denote the two
(1)
different A and C contributions, respectively, and set α = θ − ΣB . Then

King’s College London

50 Chapter 4

we can write our four inequalities as

X +Y <α from figure (1)

′
X +Y >α from figure (2)
′
X +Y >α from figure (3)
′ ′
X +Y <α from figure (4).

Clearly, the first two inequalities imply that Y ′ > Y which together with the
third gives X ′ + Y ′ > X ′ + Y > α. This contradicts the fourth inequality.
We conclude that there can be no such perceptron.

Two-class classification
We wish to set up a network which will be able to differentiate inputs from
one of two categories. To this end, we shall consider a simple perceptron
comprising many input units and a single binary decision threshold unit.
We shall ignore the associator units as such and just imagine them as pro-
viding inputs to the binary decision unit. (One could imagine the system
comprising as many associator units as pixels and where each associator unit
probes just one pixel to see if it is “on” or not.) However, we shall allow the
input values to be any real numbers, rather than just binary digits.
Suppose then that we have a finite collection of vectors in Rn , each of
which is classified as belonging to one of two categories, S1 , S2 , say. If
the vector (x1 , . . . , xn ) is presented to a linear binary threshold unit with
weights w1 , . . . , wn and threshold θ, then the output, z, say, is
( P
³P ´ 1, if ni=1 wi xi ≥ θ
n
z = step i=1 wi xi − θ =
0, otherwise.

We wish to find values of w1 , . . . , wn and θ such that inputs in one of the

categories give output z = 1, whilst those in the other category yield output
z = 0. The means of attempting to achieve this will be to modify the weights
wi and the threshold θ via an error-correction learning procedure—the so-
called “perceptron learning rule”.
For notational convenience, we shall reformulate the problem. Let w0 =
θ and set x0 = −1. Then
n
X n
X
wi xi − θ = wi xi .
i=1 i=0

For given x ∈ Rn , let y = (x0 , x) = (−1, x) ∈ Rn+1 and let w ∈ Rn+1 be the
vector given by w = (w0 , w1 , . . . , wn ). The two categories of vectors S1 and
S2 in Rn determine two categories, C1 and C2 , say, in Rn+1 via the above

Department of Mathematics
The Perceptron 51

augmentation, x 7→ (−1, x). The problem is to find w ∈ Rn+1 such that

w · y > 0 for y ∈ C1 and w · y < 0 for y ∈ C2 , where

w · y = w T y = w 0 y0 + · · · + w n yn

is the inner product of the vectors w and y in Rn+1 .

To construct an error-correction scheme, suppose that y is one of the
input pattern vectors we wish to classify. If y ∈ C1 , then we want w · y > 0
for correct classification. If this is, indeed, the case, then we do not change
the value of w. However, if the classification is incorrect, then we must have
w · y < 0. The idea is to change w to w′ = w + y. This leads to

w′ · y = (w + y) · y = w · y + y · y > w · y

so that even if w′ · y still fails to be positive, it will be less negative than w · y

(note that y = (x0 , x) 6= 0, since x0 = −1). In other words, w′ is “nearer”
to giving the correct classification for the particular pattern y than w was.
Similarly, for y ∈ C2 , misclassification means that w · y ≥ 0 rather than
w · y < 0. Setting w′ = w − y, gives

w′ · y = (w − y) · y = w · y − y · y,

so that w′ is nearer to the correct classification than w.

The perceptron error-correction rule, then, is to present the pattern vec-
tor y and leave w unchanged if y is correctly classified, but otherwise change
w to w′ , where (
w + y, if y ∈ C1
w′ =
w − y, if y ∈ C2 .
The discussion can be simplified somewhat by introducing

C = C1 ∪ (−C2 ) = {y : y ∈ C1 , or − y ∈ C2 }.

Then w · y > 0 for all y ∈ C would give correct classification. The error-
correction rule now becomes the following.

The perceptron error-correction rule:

• present the pattern y ∈ C to the network;

• leave the weight vector w unchanged if w · y > 0,

• otherwise change w to w′ = w + y.

King’s College London

52 Chapter 4

One would like to present the input patterns one after the other, each
time following this error-correction rule. Unfortunately, whilst changes in
the vector w may enhance classification for one particular input pattern,
these changes may spoil the classification of other patterns and this correc-
tion procedure may simply go on forever. The content of the Perceptron
Convergence Theorem is that, in fact, under suitable circumstances, this
endless loop does not occur. After a finite number of steps, further correc-
tions become unnecessary.

Definition 4.3. We say that two subsets S1 and S2 in Rn are linearly sepa-
rable if and only if there is some (w n
P P1 , . . . , wn ) ∈ R and θ ∈ R such that
i=1 wi xi > θ for each x ∈ S1 and i=1 wi xi < θ for each x ∈ S2 .

Thus, S1 and S2 are linearly

P separable if and only if they lie on opposite
sides of some hyperplane i=1 wi xi = θ in Rn . For example, in two dimen-
sions, S1 and S2 are linearly separable if and only if they lie on opposite
sides of some line w1 x1 + w2 x2 = θ.

S1
+

+
+ +
+ +
+

c
b c
b
+

c
b bc
bc
cb
c
b S2 x1

w1 x1 + w2 x2 = θ

Figure 4.5: The line separates the two sets of points.

Notice that the introduction of the threshold as an extra parameter

allows the construction of such a separating line. Without the threshold
term, the line would have to pass through the origin, severely limiting its
separation ability.

Theorem 4.4 (Perceptron Convergence Theorem). Suppose that S1 and S2

are linearly separable finite subsets of pattern vectors in Rn . Let C = C1 ∪
(−C2 ), where C1 and C2 are the augmentations of S1 and S2 , respectively,
via the map x 7→ y = (−1, x) ∈ Rn+1 . Let (y (k) ) be any given sequence of
pattern vectors such that y (k) ∈ C for each k ∈ N, and such that each y ∈ C
occurs infinitely often in (y (k) ). Let w(1) ∈ Rn+1 be arbitrary, and define

Department of Mathematics
The Perceptron 53

w(k+1) by the perceptron error-correction rule

(
w(k) , if w(k) · y (k) > 0,
w(k+1) =
w(k) + y (k) , otherwise.

Then there is N such that w(N ) · y > 0 for all y ∈ C, that is, after at most
N update steps, the weight vector correctly classifies all the input patterns
and no further weight changes take place.
Proof. The perceptron error-correction rule can be written as

w(k+1) = w(k) + α(k) y (k) ,

(
0, if w(k) · y (k) > 0
where α(k) = for k = 1, 2, . . . .
1, otherwise,
Notice that α(k) y (k) · w(k) ≤ 0, for all k. We have

w(k+1) = w(k) + α(k) y (k)

= w(k−1) + α(k−1) y (k−1) + α(k) y (k)
= ···
(1)
=w + α(1) y (1) + α(2) y (2) + · · · + α(k) y (k) .

By hypothesis, S1 and S2 are linearly separable. This means that there is

b ∈ Rn+1 such that w·y
some w b > 0 for each y ∈ C. Let δ = min{w·y
b : y ∈ C}.
Then δ > 0 and wb · y (k) ≥ δ for all k ∈ N. Hence

w(k+1) · w
b = w(1) · w
b + α(1) y (1) · w
b + α(2) y (2) · w
b + · · · + α(k) y (k) · w
b
≥ w(1) · w
b + (α(1) + α(2) + · · · + α(k) )δ.

However, by Schwarz’ inequality, |w(k+1) · w|

b ≤ kw(k+1) kkwk
b and so
¡ (1) ¢
w ·w b + (α(1) + α(2) + · · · + α(k) )δ ≤ kw(k+1) k kwk.
b (∗)

On the other hand,

kw(k+1) k2 = w(k+1) · w(k+1)

= (w(k) + α(k) y (k) ) · (w(k) + α(k) y (k) )
= kw(k) k2 + 2α(k) y (k) · w(k) +α(k)2 ky (k) k2
| {z }
≤0
(k) 2 (k)
≤ kw k +α ky (k) k2 , since α(k)2 = α(k) ,
≤ ···
(1) 2
≤ kw k + α(1) ky (1) k2 + α(2) ky (2) k2 + · · · + α(k) ky (k) k2
¡ ¢
≤ kw(1) k2 + α(1) + α(2) + · · · + α(k) K (∗∗)

King’s College London

54 Chapter 4

where K = max{kyk2 : y ∈ C}. Let N (k) = α(1) + α(2) + · · · + α(k) .

Then N (k) is the total number of non-zero weight adjustments that have
been made during the first k applications of the perceptron correction rule.
Combining the inequalities (∗) and (∗∗), we obtain an inequality for N (k) of
the form p
N (k) ≤ b + c N (k) ,
where b and c are constants, i.e., independent of k. In fact, b and c depend
only on w(1) , w,
b δ and K. It follows from this inequality that there is an
upper bound to the possible values of N (k) , that is, there is γ, independent of
k, such that N (k) < γ for all k. Since N (k+1) ≥ N (k) , it follows that (N (k) ) is
eventually constant, which means that α(k) is eventually zero. Thus, there is
N ∈ N∪{0} such that α(N ) = α(N +1) = · · · = 0, i.e., w(N ) ·y > 0 for all y ∈ C
and no weight changes are needed after the N th step of the error-correction
algorithm.

Remark 4.5. The assumption of linear separability ensures the existence of

a suitable w.
b We do not need to know what w b actually is, its mere existence
suffices to guarantee convergence of the algorithm to a fixed point after a
finite number of iterations.
By keeping track of the constants in the inequalities, one can estimate
N , the upper bound for the number of required steps to ensure convergence,
in terms of w,
b but if we knew w b then there would be no need to perform
the algorithm in the first place. We could simply assign w = w b to obtain a
valid network.

It is possible to extend the theorem to allow for an infinite number

of patterns, but the hypotheses must be sharpened slightly. The relevant
concept needed is that of strict linear separability.

Definition 4.6. The non-empty subsets S1 and S2 of Rℓ are said to be strictly

linearly separable if and only if there is v ∈ Rℓ , θ ∈ R and some δ > 0 such
that

v · x > θ + δ for all x ∈ S1

and

v · x < θ − δ for all x ∈ S2 .

Let C1 and C2 be the subsets of Rℓ+1 obtained by augmenting S1 and

S2 , respectively, via the map x 7→ (−1, x), as usual. Then S1 and S2 are
b ∈ Rℓ+1 and δ > 0 such
strictly linearly separable if and only if there is w
that w b · y > δ for all y ∈ C1 ∪ (−C2 ). The perceptron convergence theorem
for infinitely-many input pattern vectors can be stated as follows.

Department of Mathematics
The Perceptron 55

Theorem 4.7. Let C be any bounded subset of Rℓ+1 such that there is some
b ∈ Rℓ+1 and δ > 0 such that w
w b · y > δ for all y ∈ C. For any given sequence
(y (k) ) in C, define the sequence (w(k) ) by

(
(k+1) (k) (k) (k) (k) 0, if w(k) · y (k) > 0
w =w +α y , where α =
1, otherwise,

and w(1) ∈ Rℓ+1 is arbitrary. Then there is M (not depending on the par-
ticular sequence (y (k) )) such that N (k) = α(1) + α(2) + · · · + α(k) ≤ M for
all k. In other words, α(k) = 0 for all sufficiently large k and so there can
be at most a finite number of non-zero weight changes.

Proof. By hypothesis, C is bounded and so there is some constant K such

that kyk2 < K for all y ∈ C. Exactly as before, we obtain an inequality

p
N (k) ≤ b + c N (k)

where b and c only depend on w(1) , w,

b K and δ.

Remark 4.8. This means that if S1 and S2 are strictly linearly separable,
bounded subsets of Rℓ , then there is an upper bound to the number of
corrections that the perceptron learning rule is able to make. This does not
mean that the perceptron will necessarily learn to separate S1 and S2 in a
finite number of steps—after all, the update sequence may only “sample”
some of the data a few times, or perhaps not even at all! Indeed, having
chosen w(1) ∈ Rℓ+1 and the sequence y (1) , y (2) , . . . of sample data, it may be
that w(1) · y (j) > 0 for all j, but nonetheless, if y (1) , y (2) , . . . does not exhaust
C, there could be (possibly infinitely-many) points y ∈ C with w(1) · y <
0. Thus, the algorithm, starting with this particular w(1) and based on
the particular sample patterns y (1) , y (2) , . . . will not lead to a successful
classification of S1 and S2 .

King’s College London

56 Chapter 4

Extended two-class classification

So far, we have considered classification into one of two possible classes. By
allowing many output units, we can use a perceptron to classify patterns as
belonging to one of a number of possible classes. Consider a fully connected
neural network with n input and m output threshold units.

−1
θ1
θ2
x1
x2
θm
x3 output ∈ { 0, 1 }m
..
.. .
.

xn m units

Figure 4.6: A perceptron with m output units.

The idea is to find weights and thresholds such that presentation of

pattern j induces output for which only the j th unit is “on”, i.e., all output
units are zero except for the j th which has value 1. Notice that the system
is equivalent to m completely independent simple perceptrons but with each
being presented with the same input pattern. We are led to the following
multiclass perceptron error-correction rule. As usual, we work with the
augmented pattern vectors.

The extended two-class perceptron error-correction algorithm

• start with arbitrary weights (incorporating the threshold as zeroth
component);

• present the first (augmented) pattern vector and observe the output
values;

• using the usual (simple) perceptron error-correction rule, update all

those weights which connect to an output with incorrect value;

• now present the next input pattern and repeat, cycling through all
the patterns again and again, as necessary.

Department of Mathematics
The Perceptron 57

Of course, the question is whether or not this procedure ever leads to

the correct classification of all m patterns. The relevant theorem is entirely
analogous to the binary decision case—and is based on pairwise mutual
linear separability. (We will consider a less restrictive formulation later.)

Theorem 4.9. Let S = S1 ∪ · · · ∪ Sm be a collection of m classes of pattern

vectors in Rn and suppose that the subsets Si and Si′ = S \ Si are linearly
separable, for each 1 ≤ i ≤ m. Then the extended two-class perceptron
error-correction algorithm applied to an n-input m-output perceptron reaches
a fixed point after a finite number of iterations and thus correctly classifies
each of the m patterns.

Proof. As we have already noted, the neural network is equivalent to m

independent simple perceptrons, each being presented with the same input
pattern. By the perceptron convergence theorem, we know that the j th of
these will correctly distinguish the j th pattern from the rest after a finite
number Nj , say, of iterations. Since there are only a finite number of pat-
terns to be classified, the whole set will be correctly classified after at most
max{N1 , . . . , Nm } updates.

It is therefore of interest to find conditions which guarantee the linear

separability of m patterns as required by the theorem. It turns out that
linear independence suffices for this, as we now show.

Definition 4.10. A linear map ℓ from Rn to R is called a linear functional.

Remark 4.11. Let ℓ be a linear functional on Rn , and let e1 , . . . , en be a basis

for Rn . Thus, for any x ∈ Rn , there is α1 , . . . , αn ∈ R such that

x = α1 e1 + · · · + αn en .

Then we have

ℓ(x) = ℓ(α1 e1 + · · · + αn en )
= ℓ(α1 e1 ) + · · · + ℓ(αn en )
= α1 ℓ(e1 ) + · · · + αn ℓ(en ).

It follows that ℓ is completely determined by its values on the elements of any

basis in Rn . In particular, if {e1 , . . . , en } is the standard orthonormal basis
of Rn , so that x = (x1 , . . . , xn ) = x1 e1 +· · ·+xn en , then T
¡ ℓ(x) = v ·x ( ¢= v x,
n
in matrix notation), where v ∈ R is the vector v = ℓ(e1 ), . . . , ℓ(en ) .
On the other hand, for any u ∈ Rn , it is clear that the map x 7→ u · x
( = uT x) defines a linear functional on Rn .

King’s College London

58 Chapter 4

Proposition 4.12. For any linearly independent set of vectors {x(1) , . . . , x(m) }
in Rn , the augmented vectors {x(1) , . . . , x(m) }, with x(i) = (−1, x(i) ) ∈ Rn+1 ,
form a linearly independent collection.
P
Proof. Suppose that m j=1 αj x
(j) = 0. This is a vector equation and so each
P (j)
of the n + 1 components of the left hand side must vanish, m j=1 αj xi = 0,
for 0 ≤ i ≤ n. Note that the first component of x(j) has been given the
index 0. P Pm
(j)
In particular, m j=1 αj xi = 0, for 1 ≤ i ≤ n, that is, j=1 αj x
(j) = 0.

By hypothesis, x(1) , . . . , x(m) are linearly independent and so α1 = · · · =

αm = 0 which means that x(1) , . . . , x(m) are linearly independent, as re-
quired.
Remark 4.13. The converse is false. For example, if x(1) = (1, 1, . . . , 1) ∈ Rn ,
and x(2) = (2, 2, . . . , 2) ∈ Rn , then clearly x(1) and x(2) are not linearly
independent. However, x(1) and x(2) are linearly independent.
Proposition 4.14. Suppose that x(1) , . . . , x(m) ∈ Rn are linearly independent
pattern vectors. Then, for each k = 1, 2, . . . , m, there is w b(k) ∈ Rn and
θk ∈ R such that
b(k) · x(k) > θk
w

and

b(k) · x(j) < θk

w
for j 6= k.
Proof. By the previous result, the augmented vectors x(1) , . . . , x(m) ∈ Rn+1
are linearly independent. Let y (m+1) , . . . , y (n+1) be n − m linearly inde-
pendent vectors such that the collection {x(1) , . . . , x(m) , y (m+1) , . . . , y (n+1) }
forms a basis for Rn+1 .
Define the linear functional ℓ1 : Rn+1 → R by assigning its values on the
basis given above according to ℓ1 (x(1) ) = 1 and ℓ1 (x(2) ) = · · · = ℓ(y n+1 ) =
−1. Now, we know that there is v = (v0 , v1 , . . . , vn ) ∈ Rn+1 such that ℓ1
can be written as
ℓ1 (x) = v · x
for all x ∈ Rn+1 . Set θ1 = v0 and w
b(1) = (v1 , . . . , vn ) ∈ Rn . Then
1 = ℓ1 (x(1) ) = v · x(1)
(1) (1)
= v0 x0 + v1 x1 + · · · + vn x(1)
n
(1)
b(1) · x(1) , since x0 = −1,
= −θ1 + w
b(1) · x(1) > θ1 .
so that w
Similarly, for any j 6= 1, −1 = ℓ1 (x(j) ) = −θ1 + w
b(1) · x(j) giving w
b(1) ·
x(j) < θ1 . An identical argument holds for any x(k) rather than x(1) .

Department of Mathematics
The Perceptron 59

This tells us that an m-output perceptron is capable of correctly classify-

ing m linearly independent pattern vectors. For example, we could correctly
classify the capital letters of the alphabet using a perceptron with 26 output
units by representing the letters as 26 linearly independent vectors. This
might be done by drawing each character on a retina (or grid) of sufficiently
high resolution (perhaps 6 × 6 pixels) and transforming the resulting image
into a binary vector whose components correspond to the on-off state of the
corresponding pixels. Of course, one should check that the 26 binary vectors
one gets are, indeed, linearly independent.

The multi-class error correction algorithm

Suppose that we have a collection of patterns S in Rn comprising m classes
S1 , . . . , Sm . The aim is to construct a network which can correctly classify
these patterns. To do this, we set up a “winner takes all” network built
from m linear units.

P
x1
x2 P

j
..
. ..
.

xn P
computes
input m units the winning unit

Figure 4.7: Winner takes all network.

There are n inputs which feed into m linear units, i.e., each unit simply
records its net weighted input. The network output is the index of the unit
which has maximum such activation. Thus, if w1 , . . . , wm are the weights
to the m linear units and the input pattern is x, then the system output is
the “winning unit”, i.e., that value of j with
wj · x > wi · x for all i 6= j.
In the case of a tie, one assigns a rule of convenience. For example, one
might choose one of the winners at random, or alternatively, always choose
the smallest index possible, say.

King’s College London

60 Chapter 4

Multi-class error correction algorithm

(1) (m)
• Choose the initial set of weight vectors w1 , . . . , wm , arbitrarily.
• Step k: Present a pattern x(k) from S and observe the winning unit.
Suppose that x(k) ∈ Si . If the winning unit is unit i, then do
nothing. If the winning unit is j 6= i, then update the weights
according to
(k+1) (k)
wi = wi + x(k) reinforce unit i
(k+1) (k)
wj = wj − x(k) anti-reinforce unit j
. . . and repeat.

The question is whether or not the algorithm leads to a set of weights which
correctly classify the patterns. Under certain separability conditions, the
answer is yes.
Theorem 4.15 (Multi-class error-correction convergence theorem). Suppose
that the m classes Si are linearly separable in the sense that there exist weight
vectors w1∗ , . . . , wm
∗ ∈ Rn such that

wi∗ · x > wj∗ · x for all j 6= i

whenever x ∈ Si . Suppose that the pattern presentation sequence (x(k) ),

from the finite set S = S1 ∪ · · · ∪ Sm , is such that every pattern in S appears
infinitely-often. Then, after a finite number of iterations, the multi-class
error-correction algorithm provides a set of weight vectors which correctly
classify all patterns in S.
Proof. The idea of the proof (Kesler) is to relate this system to a two-class
problem (in a high dimensional space) solved by the usual perceptron error-
correction algorithm.
For any given i ∈ {1, . . . , m} and x ∈ Si , we construct m − 1 vectors,
b(j) ∈ Rmn , 1 ≤ j ≤ m but j 6= i, as follows. Set
x
b(j) = u1 ⊕ u2 ⊕ · · · ⊕ um ∈ Rmn
x
where ui = x, uj = −x and all other uℓ s are zero in Rn . Let
b∗ = w1∗ ⊕ · · · ⊕ wm
w ∗
∈ Rmn .
Then
b∗ · x
w b(j) = wi∗ · x − wj∗ · x > 0
b(j) s as
by (the separation) hypothesis. Let C denote the set of all possible x
x runs over S. Then C is “linearly separable” in the sense of the perceptron
convergence theorem;
b∗ · y > 0 for all y ∈ C.
w

Department of Mathematics
The Perceptron 61

(1) (1)
Suppose we start with wb = w1 ⊕ · · · ⊕ wm and apply the perceptron error
correction algorithm with patterns drawn from C:

b(k+1) = w
w b(k) + α(k) x
b(j)

where α(k) = 0 if w
b(k) · x
b(j) > 0, and otherwise α(k) = 1. The correction
α(k) x
b(j) gives
(k)
b(k+1) = (w1 + v1 ) ⊕ . . . (wm
w (k)
+ vm )

where

vi = α(k) x, assuming x ∈ Si
(k)
vj = −α x,
vℓ = 0, all ℓ 6= i, j.

b(k+1) are given by

Thus, the components of w

(k+1) (k)
wi = wi + α(k) x
(k+1) (k)
wj = wj − α(k) x
(k+1) (k)
wℓ = wℓ , ℓ 6= i, ℓ 6= j

which is precisely the multi-class error correction rule when unit j is the
winning unit. In other words, the weight changes given by the two system
algorithms are the same. If one is non-zero, then neither is the other. It
follows that every misclassified pattern (from S) presented to the multi-
class system induces a misclassified “super-pattern” (in C ⊂ Rmn ) for the
perceptron system thus triggering a non-zero weight update.
However, the perceptron convergence theorem assures us that the “super-
pattern” system can accommodate at most a finite number of non-zero
weight changes. It follows that the multi-class algorithm can undergo at
most a finite number of weights changes. We deduce that after a finite num-
ber of steps, the winner-takes-all system must correctly classify all patterns
from S.

King’s College London

62 Chapter 4

Mapping implementation

Let us turn now to the perceptron viewed as a mechanism for implementing

binary-valued mappings, f : x 7→ f (x) ∈ {0, 1}.

Example 4.16. Consider the function f : {0, 1}2 → {0, 1} given by



 (0, 0) 7 0
→


(0, 1) 7→ 0
f:

 (1, 0) 7→ 0


(1, 1) 7→ 1.

If we think of 0 and 1 as “off” and “on”, respectively, then f is “on” if and

only if both its arguments are “on”. For this reason, f is called the “and”
function.
We wish to implement f using a simple perceptron with two input neu-
rons and a binary threshold output neuron. Thus, we seek weights w1 , w2
and a threshold θ such that step(w1 x + w2 y − θ) = 1 for (x, y) = (1, 1),
but otherwise is zero, that is, we require that w1 x + w2 y − θ is positive for
(x, y) = (1, 1), but otherwise is negative. Geometrically, this means that the
point (1, 1) lies above the line w1 x + w2 y − θ = 0 in the two-dimensional
(x, y)-plane, whereas the other three points lie below this line. The line
w1 x + w2 y − θ = 0 is a “hyperplane” linearly separating the two classes
{(1, 1)} and {(0, 0), (0, 1), (1, 0)}. There are many possibilities—one such is
shown in the figure.

separating hyperplane

(0,1) (1,1)

w1 = 2

θ=3
w2 = 2
(0,0) (1,0)

Figure 4.8: A line separating the classes, and the corresponding

simple perceptron which implements the “and” function.

Department of Mathematics
The Perceptron 63

Example 4.17. Next we consider the so-called “or” function, f on {(0, 1)}2 .
This is given by 

 (0, 0) 7→ 0


(0, 1) 7→ 1
f:

 (1, 0) 7→ 1


(1, 1) 7→ 1.

Thus, f is “on” if and only at least one of its inputs is also “on”, i.e., one
or the other, or both. As above, we seek a line separating the point (0, 0)
from the rest. A solution is indicated in the figure.

separating hyperplane

(0,1) (1,1)

w1 = 1
1
θ= 2

w2 = 1
(0,0) (1,0)

Figure 4.9: A separating line for the “or” function and the cor-
responding simple perceptron.

Example 4.18. Consider, now, the 2-parity or “xor” (exclusive-or) function.

This is defined by 

 (0, 0) 7→ 0


(0, 1) 7→ 1
f:

 (1, 0) 7→ 1


(1, 1) 7→ 0.

Here, the function is “on” only if exactly one of the inputs is “on”. (We dis-
cuss the n-parity function later.) It is clear from a diagram that it is impossi-
ble to find a line separating the two classes {(0, 0), (1, 1)} and {(0, 1), (1, 0)}.
We can also see this algebraically as follows. If w1 , w2 and θ were weights
and threshold values implementing the “xor” function, then we would have
0 w1 + 0 w2 < θ
0 w1 + w 2 ≥ θ
w1 + 0 w 2 ≥ θ
w1 + w2 < θ.

King’s College London

64 Chapter 4

Adding the second and third of these inequalities gives w1 + w2 ≥ 2θ, which
is incompatible with the fourth inequality. Thus we come to the conclusion
that the simple perceptron (two input nodes, one output node) is not able
to implement the “xor” function.
However, it is possible to implement the “xor” function using slightly
more complicated networks. Two such examples are shown in the figure.

1
θ1 = 2 2
1
1
1
1 1 3 −3 3
θ3 = 2 θ1 = 2 θ3 = 2
1 1
−1
1 3 2
θ2 = 2

Figure 4.10: The “xor” function can be implemented by a percep-

tron with a middle layer of two binary threshold units. Alterna-
tively, we can use just one middle unit but with a non-conventional
feed-forward structure.

Definition 4.19. The n-parity function is the binary map f on the hypercube
{0, 1}n , f : {0, 1}n → {0, 1}, given by f (x1 , . . . , xn ) = 1 if the number of
xk s which are equal to 1 is odd, and f (x1 , . . . , xn ) = 0 otherwise. Thus, f
can be written as
¡ ¢
f (x1 , . . . , xn ) = 1
2 1 − (−1)x1 +x2 +···+xn ,

bearing in mind that each xk is a binary digit, so that x1 + x2 + · · · + xn

is precisely the number of 1 s in (x1 , . . . , xn ). Note that 2-parity is just the
“xor” function on {0, 1}2 .

Theorem 4.20. The n-parity function cannot be implemented by an n-input

simple perceptron.

Proof. The statement of the theorem is equivalent to the statement that n-

parity is not linearly separable, that is, the two sets {x ∈ {0, 1}n : f (x) = 0}
and {x ∈ {0, 1}n : f (x) = 1} are not linearly separable in Rn . To prove the
theorem, suppose that f can be implemented by a simple perceptron (with
n-input units and a single binary threshold output unit). Then there are
weights w1 , . . . , wn and a threshold θ such that w1 x1 + · · · + wn xn > θ if the

Department of Mathematics
The Perceptron 65

number of xk s equal to 1 is odd, i.e., x1 + · · · + xn is odd, but otherwise

w1 x1 + · · · + wn xn < θ.
Setting x3 = · · · = xn = 0, we see that the network with the inputs x1
and x2 , weights w1 , w2 and threshold θ implements the 2-parity function,
i.e., the “xor” function. However, this we have seen is impossible. We
conclude that such a perceptron as hypothesized above cannot exist, and
the proof is complete.

−1
x1 θ
w1
x2 w2
Put  x3 P
 w3
x3 = 0 

 ..
 .
x4 = 0
.. 
. 
 wn


xn = 0 xn

Figure 4.11: If the n-parity function can be implemented by a

simple perceptron, then so can the “xor” function (2-parity).

Theorem 4.21. The n-parity function can be implemented by a three-layer

network of binary threshold units.

Proof. The network is a feed-forward network with n input units, n binary

threshold units in the middle layer, and a single binary threshold output
unit.
All weights from the inputs to the the middle layer are set equal to
1, whilst those from the middle layer to the output are given the values
+ 1, − 1, + 1, − 1, . . . , respectively, that is, the weight from middle unit j
to the output unit is set equal to (−1)j+1 .
The threshold values of the middle units are 1 − 12 , 2 − 12 , . . . , n − 12 and
that for the output unit is set to 12 . When k input units are “on”, i.e., have
value 1, then the total input (net activation) to the j th middle unit is equal to
k. Thus the units 1, 2, . . . , k in the middle layer all fire (since k − (j − 21 ) > 0
for all 1 ≤ j ≤ k). The middle units k + 1, . . . , n do not fire. The net
activation potential of the output unit is therefore equal to

k
(
X 1, if k is odd
(−1)j+1 = |1 − 1 +
{z1 − . .}. =
j=1
0, if k is even.
k terms

1
Evidently, because of the threshold of 2 at the output unit, this will fire if
and only if k is odd, as required.

King’s College London

66 Chapter 4

1
1− 2

1
1
1 +1
1
2− 2
1 −1
1
1
.. 2
. ..
1 .

(−1)n+1
1 1
n− 2

Figure 4.12: A 3-layer network implementing the n-parity func-

tion. All weights between the first and second layers are equal to
1, whereas those between the middle and output unit alternate
between the values ±1. The threshold values of the units are as
indicated.

Next we shall show that any binary mapping on the binary hypercube
{0, 1}N can be implemented by a 3-layer feedforward neural network of bi-
nary threshold units. The space {0, 1}N contains 2N points (each a string of
N binary digits). Each of these is mapped into either 0 or 1 under a binary
function, and each such assignment defines a binary function. It follows that
N
there are 22 binary-valued functions on the binary hypercube {0, 1}N . We
wish to show that any such function can be implemented by a suitable net-
work. First we shall show that it is possible to construct perceptrons which
are very selective.

Theorem 4.22 (Grandmother cell). Let z ∈ {0, 1}n be given. Then there is
an n-input perceptron which fires if and only if it is presented with z.

Proof. Suppose that z = (b1 , . . . , bn ) ∈ {0, 1}n is given. We must find

weights w1 , . . . , wn and a threshold θ such that

w1 b1 + · · · + wn bn − θ > 0

but
w1 x1 + · · · + wn xn − θ < 0
for every x 6= z. Define wi , i = 1, . . . , n by
(
1 bi = 1
wi =
−1 otherwise.

Department of Mathematics
The Perceptron 67

This assignment can be rewritten as wi = (2bi − 1). Then we see that

n
X n
X
wi xi = (2bi − 1)xi
i=1 i=1
Xn
= (2bi xi − xi ).
i=1

Now, if x = z, then xi = bi , 1 ≤ i ≤ n, so that bi xi = b2i = bi , since

bi ∈ {0, 1}. Hence, in this case,
n
X n
X n
X
wi xi = (2bi − bi ) = bi .
i=1 i=1 i=1

On the other hand, if x 6= z, then there is some j such that xj 6= bj so

that bj xj = 0 (since one of bj , xj must be zero). But, for such a term,
(2bj − 1)xj = −xj < bj (because bj + xj = 1). Hence, if x 6= z, we have
n
X n
X n
X
wi xi = (2bi − 1)xi < bi .
i=1 i=1 i=1

Now,
Pn both1 sides of the above inequality are integers, so if we set θ =
i=1 bi − 2 , it follows that
n
X n
X n
X
wi xi < θ < bi = wi bi
i=1 i=1 i=1

for any x 6= z and the construction is complete.

In the jargon, a neuron such as this is sometimes called a grandmother

cell (from the biological idea that some neurons might serve just a single
specialized purpose, such as responding to precisely one’s grandmother). It
is not thought that such neurons really exist in real brains.
Theorem 4.23. Any mapping f : {0, 1}N → {0, 1} can be implemented by a
3-layer feedforward neural network of binary threshold units.
Proof. Let f : {0, 1}N → {0, 1} be given. We construct the appropriate
neural network as follows.
The neural network has N inputs, a middle (hidden) layer of 2N binary
threshold units and a single binary threshold output unit. The output unit
is assigned threshold 12 . Label the hidden units by the 2N elements z in
{0, 1}N and construct each to be the corresponding grandmother unit for
the binary vector z. The weight from each such unit to the output unit is
set to the value f (z).
Then if the input is z, only the hidden unit labelled by z will fire. The
total input to the output unit will then be precisely f (z). The output unit
will fire if and only if the value of f (z) is 1 (and not 0).

King’s College London

68 Chapter 4

θ1
x1
wj1 ..
x2 .

.. θj f (b1 , . . . , bN )
. 1
wji 2 f
xi ..
wjN .
..
.
xn
θ2N

2N units

Figure 4.13: A 3-layer neural network implementing the function f .

Remark 4.24. Notice that those hidden units labelled by those z with f (z) =
0 are effectively not connected to the output unit (their weights are zero).
We may just as well leave out these units altogether. This leads to the
observation that, in fact, at most 2N −1 units are required in the hidden
layer.
Indeed, suppose that {0, 1}N = A ∪ B where f (x) = 1 for x ∈ A and
f (x) = 0 for x ∈ B. If the number of elements in A is not greater than that
in B, we simply throw away all hidden units labelled by members of B, as
suggested above. In this case, we have no more than 2N −1 units left in the
hidden layer.
On the other hand, if B has fewer members than A, we wish to throw
away those units labelled by A. However, we first have to slightly modify
the network—otherwise, we will throw the grandmother out with the bath
water.
We make the following modifications. Change θ to −θ, and then change
all weights from the A-labelled hidden units to output to zero, and change all
weights from B-labelled hidden units to the output unit (from the previous
value of zero) to the value −1. Then we see that the system fires only for
inputs x ∈ A. We now throw away all the A-labelled hidden units (they
have zero weights out from them, anyway) to end up with fewer than 2N −1
hidden units, as required.

Department of Mathematics
The Perceptron 69

Testing for linear separability

The perceptron convergence theorem tells us that if two classes of patterns
are linearly separable, then the perceptron error-correction rule will eventu-
ally lead to a set of weights giving correct classification. Unfortunately, it
sheds no light on whether or not the classes actually are linearly separable.
Of course, one could ignore this in practice, and continue with the hope that
they are. We have seen that linear independence suffices and so, provided
we embed our patterns into a space of sufficiently high dimension, there
would seem to be a good chance that the classes will be linearly separable
and so we can run the perceptron algorithm with abandon. It would be nice,
however, if it were possible to somehow discover, as we go along, whether
or not our efforts are doomed to failure. We will discuss now the so-called
Ho-Kashyap algorithm, which does precisely that.
The problem is to classify two sets of patterns S1 and S2 in Rℓ . As usual,
these are augmented to give C1 and C2 , and then we set C = C1 ∪ (−C2 ).
Thus, we seek w ∈ Rℓ+1 such that w · y > 0 for all y ∈ C. Suppose
C = {y (1) , . . . , y (N ) }. Then we want w ∈ Rℓ+1 such that w · y (j) > 0 for
each 1 ≤ j ≤ N . Evidently, this problem is equivalent to that of finding a
set of real numbers b1 > 0, . . . , bN > 0 such that w · y (j) = bj , 1 ≤ j ≤ N .
Expressing this in matrix form, we seek w ∈ Rℓ+1 and b ∈ RN such that b
has strictly positive entries, and
 (1)T 
y
 y (2)T 
 
Y w = b, where Y =  .  ∈ RN ×(ℓ+1) ,
 .. 
y (N )T

where we have used the equality w · y (j) = y (j)T w.

To develop an algorithm for finding such vectors w and b, we observe
that, trivially, any such pair minimizes the quantity

J = 12 kY w − bk2 = 12 kY w − bk2F .

So to attempt to find a suitable w and b, we employ a kind of gradient-

descent based technique. Writing

J = 12 kY w − bk2 = 21 (Y w − b)T (Y w − b)
N
X
= 1
2 (y (j)T w − bj )2
j=1

we see that ∂J/∂bj = −(y (j)T w − bj ), 1 ≤ j ≤ N , and so the gradient of J,

as a function of the vector b, is
∂J
= −(Y w − b) ∈ RN .
∂b

King’s College London

70 Chapter 4

Suppose that w(1) , w(2) , . . . , and b(1) , b(2) , . . . are iterations for w and b,
respectively. Then a gradient-descent technique for b might be to construct
the b(k) s by the rule

∂J
b(k+1) = b(k) − α
∂b
= b(k) + α(Y w(k) − b(k) ), k = 1, 2, . . . ,

where α is a positive constant. Given b(k) , we choose w(k) so that J is

minimized (for given Y and b(k) ). This we know, from general theory, to be
achieved by w(k) = Y # b(k) . Indeed, for given matrices A and B, kXA−BkF
is minimized by X = BA# and so, for given Y and b, kY w − bkF = k(Y w −
b)T kF = kwT Y T − bT kF is minimized by wT = bT Y T # = bT Y #T , that is,
for w = Y # b.
However, we want to ensure that b(k) has strictly positive components.
To this end, we modify the formula for b(k+1) above, to the following. We
set
b(k+1) = b(k) + α∆b(k)
(
(Y w(k) − b(k) )j , if (Y w(k) − b(k) )j > 0
where ∆b(k) j = and where b(1) is
0, otherwise
such that b(1) j > 0 for all 1 ≤ j ≤ N . Note that b(k+1) j ≥ b(k) j for each
1 ≤ j ≤ N , that is, the components of b(k) never decrease at any step of the
iteration. The algorithm then is as follows.

The Ho-Kashyap algorithm

Let b(1) be arbitrary subject to b(1) j > 0, 1 ≤ j ≤ N . For each k = 1, 2, . . .
set w(k) = Y # b(k) and
b(k+1) = b(k) + α∆b(k)

where the constant α > 0 will be chosen to ensure convergence, and where
∆b(k) is given as above.

For notational convenience, let e(k) = Y w(k) − b(k) . Then ∆b(k) is the
vector with components given by ∆b(k) j = e(k) j if e(k) j > 0, but otherwise
∆b(k) j = 0. It follows that e(k)T ∆b(k) = ∆b(k)T ∆b(k) .

Theorem 4.25. Suppose that S1 and S2 are linearly separable classes. Then,
for any k, the inequalities e(k) j ≤ 0 for all j = 1, . . . , N imply that e(k) = 0.

Proof. By hypothesis, there is some wb ∈ Rℓ+1 such that wb · y > 0 for all
y ∈ C, i.e., w (j)
b · y > 0 for all 1 ≤ j ≤ N . In terms of the matrix Y , this

Department of Mathematics
The Perceptron 71

implies that the vector Y w

b has strictly positive components, (Y w)
b j > 0,
(k)
1 ≤ j ≤ N . Now, from the definition of e , we have

Y T e(k) = Y T (Y w(k) − b(k) )

= Y T (Y Y # b(k) − b(k) )
= (Y T Y Y # − Y T )b(k) .

But Y Y # = (Y Y # )T and so

Y T Y Y # = Y T (Y Y # )T
¡ ¢
= (Y Y # Y T
= Y T.

Hence Y T e(k) = 0, and so e(k)T Y = 0. (We will also use this last equality
later.) It follows that e(k)T w
b = 0, that is,
N
X
e(k) j (Y w)
b j = 0.
j=1

b j > 0 and e(k) j ≤ 0, for all 1 ≤ j ≤ N , force e(k) j = 0

The inequalities (Y w)
for all 1 ≤ j ≤ N .

Remark 4.26. This result means that if, during execution of the algorithm,
we find that e(k) has strictly negative components, then S1 and S2 are not
linearly separable.
Next we turn to a proof of convergence of the algorithm under the as-
sumption that S1 and S2 are linearly separable.
Theorem 4.27. Suppose that S1 and S2 are linearly separable, and let 0 <
α < 2. The algorithm converges to a solution after a finite number of steps.
Proof. By definition,

e(k) = Y w(k) − b(k)

= (Y Y # − 1lN )b(k)
= (P − 1lN )b(k)

where P = Y Y # satisfies P = P T = P 2 , i.e., P is an orthogonal projection.

Therefore

e(k+1) = (P − 1lN )b(k+1)

= (P − 1lN )(b(k) + α∆b(k) )
= e(k) + α(P − 1lN )∆b(k) .

King’s College London

72 Chapter 4

Hence

ke(k+1) k2 = ke(k) k2 + α2 k(P − 1lN )∆b(k) k2

+ 2αe(k)T (P − 1lN )∆b(k) .

To evaluate the right hand side, we calculate

k(P − 1lN )∆b(k) k2 = ∆b(k)T P T P ∆b(k)

− 2∆b(k) P ∆b(k) + ∆b(k)T ∆b(k)
= −kP ∆b(k) k2 + k∆b(k) k2 , using P = P T P .

Furthermore,

e(k)T (P − 1lN )∆b(k) = e(k)T (Y Y # − 1lN )∆b(k)

= −e(k)T ∆b(k) , since e(k)T Y = 0,
= −∆b(k)T ∆b(k)
= −k∆b(k) k2 .

Therefore, we can write ke(k+1) k2 as

ke(k+1) k2 = ke(k) k2 − α2 kP ∆b(k) k2 − α(2 − α)k∆b(k) k2 (∗)

Choosing 0 < α < 2, we see that

ke(k+1) k ≤ ke(k) k, for k = 1, 2, . . . .

The sequence (ke(k) k)k∈N is a non-increasing sequence of non-negative terms,

and so converges as k → ∞. In particular, it follows from (∗) that k∆b(k) k →
0, i.e., ∆b(k) → 0 in RN as k → ∞. (The difference ke(k) k2 − ke(k+1) k2
converges to 0. By (∗), this is equal to a sum of two non-negative terms,
both of which must therefore also converge to 0.)
Moreover, the sequence (e(k) )k∈N is a bounded sequence in RN and so
has a convergent subsequence. That is, there is some f ∈ RN and a sub-
sequence (e(kn ) )n∈N such that e(kn ) → f in RN as n → ∞. In particular,
the components e(kn ) j converge to the corresponding component fj of f , as
n → ∞. From its definition, we can write down the components of ∆b(k) as

∆b(k) j = 21 (e(k) j + |e(k) j |), j = 1, 2, . . . , N.

Hence

∆b(kn ) j = 12 (e(kn ) j + |e(kn ) j |)

→ 21 (fj + |fj |)

Department of Mathematics
The Perceptron 73

as n → ∞, for each 1 ≤ j ≤ N . However, we know that the sequence

(∆b(k) )k∈N converges to 0 in RN , and so, therefore, does the subsequence
(∆b(kn ) )n∈N . It follows that fj + |fj | = 0 for each 1 ≤ j ≤ N . In other
words, all components of f are non-positive.
By hypothesis, S1 and S2 are linearly separable and so there is wb ∈ Rℓ+1
such that Y w b has strictly positive components. However, e(k)T Y = 0 for all
k, and so, in particular,

fT Y w
b = lim e(kn )T Y w
b = 0,
n→∞
P
i.e., N j=1 fj (Y w) b j = 0, where (Y w) b j > 0 for all 1 ≤ j ≤ N . Since fj ≤ 0,
we must have fj = 0, j = 1, . . . , N , and we conclude that e(kn ) → 0 as
n → ∞. But then the inequality ke(k+1) k ≤ ke(k) k, for all k = 1, 2, . . . ,
implies that the whole sequence (e(k) )k∈N converges to 0. (For any given
ε > 0, there is N0 such that ke(kn ) k < ε whenever n > N0 . Put N1 = kN0 +1 .
Then for any k > N1 , we have ke(k) k ≤ ke(kN0 +1 ) k < ε.)
Let µ = max{b(1) j : 1 ≤ j ≤ N } be the maximum value of the com-
ponents of b(1) . Then µ > 0 by our choice of b(1) . Since e(k) → 0, as
k → ∞, there is k0 such that ke(k) k < 21 µ whenever k > k0 . In particular,
− 21 µ < e(k) j < 21 µ for each 1 ≤ j ≤ N whenever k > k0 . But for any
j = 1, . . . , N , b(k) j ≥ b(1) j , and so

(Y w(k) )j = b(k) j + e(k) j

> µ − 12 µ = 21 µ
>0

whenever k > k0 . Therefore the vector w(k0 +1) satisfies y (j)T w(k0 +1) > 0
for all 1 ≤ j ≤ N and determines a separating vector for S1 and S2 which
completes the proof.

King’s College London

74 Chapter 4

Department of Mathematics
Chapter 5

Multilayer Feedforward Networks

We have seen that there are limitations on the computational abilities of

the perceptron which seem not to be shared by neural networks with hidden
layers. For example, we know that any boolean function can be implemented
by some multilayer neural network (and, in fact, three layers suffice). The
question one must ask for multilayer networks is how are the weights to
be assigned or learned. We shall discuss, in this section, a very popular
method of learning in multilayer feedforward neural networks, the so-called
error back-propagation algorithm. This is a supervised learning algorithm
based on a suitable error or cost function, with values determined by the
actual and desired outputs of the network, which is to be minimized via a
gradient-descent method. For this, we require that the activation functions
of the neurons in the network be differentiable and it is customary to use
some kind of sigmoid function. The idea is very simple, but the analysis is
complicated by the abundance of indices. To illustrate the method, we shall
consider the following three layer feedforward neural network.

−1
x1 θ1
γ −1 y1
α
θ3
−1
.. θ2
. β
y2

input layer hidden layer output layer

Figure 5.1: A small 3-layer feedforward neural network.

The network has n input units, one unit in the middle layer and two
in the output layer. The various weights and thresholds are as indicated.

75
76 Chapter 5

We suppose that the activation functions of the neurons in the second and
third layers are given by the sigmoid function ϕ, say. The nodes in the
input layer serve, as usual, merely as placeholders for the distribution of the
input values to the units in the second (hidden) layer. Suppose that the
input pattern x = (x1 , . . . , xn ) is to induce the desired output (d1 , d2 ). Let
y = (y1 , y2 ) denote the actual output. Then the sum of squared differences
¡ ¢
E = 21 (d1 − y1 )2 + (d2 − y2 )2

is a measure of the error between the actual and desired outputs. We seek
to find weights and thresholds which minimize this. The approach is to use
a gradient-descent method, that is, we use the update algorithm

w 7→ w + ∆w,

with ∆w = −λ grad E, where w denotes the “weight” vector formed by all

the individual weights and thresholds of the network. (So the dimension of
w is given by the total number of connections and thresholds.) As usual,
λ is called the learning parameter. To find grad E, we must calculate the
partial derivatives of E with respect to all the weights and thresholds.
Consider
P ∂E/∂α first. We have y1 = ϕ(v1 ), with v1 = u1 − θ1 and
u1 = ( w1i zi ) = αz where z is the output from the (single) middle unit
(so that there is, in fact, no summation but just the one term). Thus
y1 = ϕ(αz − θ1 ). Hence
∂E ∂y1
= −(d1 − y1 )
∂α ∂α
∂v1
= −(d1 − y1 ) ϕ′ (v1 )
∂α
= −(d1 − y1 ) ϕ′ (v1 ) z
= −∆1 z

where we have set ∆1 = (d1 − y1 )ϕ′ (v1 ). In an entirely similar way, we get
∂E
= −(d2 − y2 ) ϕ′ (v2 ) z
∂β
≡ −∆2 z

where we have used v2 = βz − θ2 . Further,

∂E
= −(d1 − y1 ) ϕ′ (v2 ) (−1)
∂θ1
and
∂E
= −(d2 − y2 ) ϕ′ (v2 ) (−1).
∂θ2

Department of Mathematics
Multilayer Feedforward Networks 77

Next, we must calculate

∂E ∂y1 ∂y2
= −(d1 − y1 ) − (d2 − y2 ) .
∂γ ∂γ ∂γ
Now,

∂y1 ∂ϕ(v1 ) ∂v1

= = ϕ′ (v1 )
∂γ ∂γ ∂γ
∂z ∂ϕ(v3 )
= ϕ′ (v1 )α = ϕ′ (v1 )α
∂γ ∂γ
∂v3
= ϕ′ (v1 )αϕ′ (v3 )
∂γ
′ ′
= ϕ (v1 )αϕ (v3 )x1 , using v3 = u3 − θ3 = (x1 γ + . . . ) − θ3 .

Similarly, we find
∂y2
= ϕ′ (v2 )βϕ′ (v3 )x1 .
∂γ
Hence
∂E ¡ ¢
= − d1 − ϕ(v1 ) ϕ′ (v1 )αϕ′ (v3 )x1
∂γ
¡ ¢
− d2 − ϕ(v2 ) ϕ′ (v2 )αϕ′ (v3 )x1
= −(∆1 α + ∆2 β)ϕ′ (v3 )x1
= −∆1 x1

where ∆1 ≡ (∆1 α + ∆2 β)ϕ′ (v3 ).

In the same way, we calculate
∂E
= −∆1 (−1).
∂θ3
Thus, the gradient-descent algorithm is to update the weights according to

α −→ α + λ∆1 z,
β −→ β + λ∆2 z,
θi −→ θi − λ∆i , i = 1, 2,
γ −→ γ + λ∆1 x1 , and
θ3 −→ θ3 − λ∆1 .

We have only considered the weight γ associated with the input to second
layer, but the general case is clear. (We could denote these n weights by
γi , i = 1, . . . , n, in which case the adjustment to γi is +λ∆1 xi .) Notice
that the weight changes are proportional to the signal strength along the

King’s College London

78 Chapter 5

corresponding connection (where the thresholds are considered as weights

on connections with inputs clamped to the value −1). The idea is to picture
the ∆s as errors associated with the respective network nodes. The values of
these are computed from the ouput layer “back” towards the input layer by
propagation, with the appropriate weights, along the “backwards” network
one layer at a time. Hence the name “error back-propagation algorithm”,
although it is simply an implementation of gradient-descent.

Now we shall turn to the more general case of a three layer feedforward
neural network with n input nodes, M neurons in the middle (hidden) layer
and m in the third (output) layer. A similar analysis can be carried out on
a general multilayer feedforward neural network, but we will just consider
one hidden layer.

x1 ws ws
y1
x2
y2

.. ..
.. . .
.
ym
xn
hidden layer

Figure 5.2: The general 3-layer feedforward neural network.

First, we need to set up the notation. Let x1 , . . . , xn denote the input

values and d1 , . . . , dm the desired outputs. Let the weight from input unit
i to hidden layer neuron j be denoted by wji and that from hidden layer
neuron k to output neuron ℓ by wℓk . All neurons in layers two and three are
assumed to have differentiable (usually sigmoid) activation function ϕ(·).

The activation potential (net

Pninput to the activation function) of neuron j
h
in the hidden layer is vj = i=0 wji xi , where x0 = −1 and wj0 = θj is the
threshold value.

The output from the neuron j in the hidden layer is ϕ(vjh ) ≡ zj , say.
PM
The activation potential of the output neuron ℓ is vℓout = k=0 w ℓk zk , where
z0 = −1 and wℓ0 = θℓ , the threshold for the unit.

The output from the neuron ℓ in the output layer is yℓ = ϕ(vℓout ).

Department of Mathematics
Multilayer Feedforward Networks 79

1 Pm
We consider the error function E = 2 r=1 (dr − yr )2 .
The strategy is to try to minimize E using a gradient-descent algorithm
based on the weight variables: w 7→ w − λ grad E, where the gradient is with
respect to all the weights (a total of nM + M + M m + m variables). We
wish to calculate the partial derivatives ∂E/∂wji and ∂E/∂wℓk . We find
X m
∂E ∂yr
= −(dr − yr )
∂wℓk ∂wℓk
r=1
∂ϕ(vℓout )
= −(dℓ − yℓ )
∂wℓk
∂v out
= −(dℓ − yℓ )ϕ′ (vℓout ) ℓ
∂wℓk
= −(dℓ − yℓ )ϕ′ (vℓout )zk
= −∆ℓ zk ,
where ∆ℓ = (dℓ − yℓ )ϕ′ (vℓout ). Next, we have
X m
∂E ∂yr
= −(dr − yr ) .
∂wji ∂wji
r=1
But
∂yr ∂ϕ(vrout ) ∂v out
= = ϕ′ (vrout ) r
∂wji ∂wji ∂wji
∂ ³X ´
M
= ϕ′ (vrout ) wrk zk
∂wji
k=0
∂zj
= ϕ′ (vrout )wrj , since only zj depends on wji ,
∂wji
∂vjh
= ϕ′ (vrout )wrj ϕ′ (vjh ) , since zj = ϕ(vjh ),
∂wji
= ϕ′ (vrout )wrj ϕ′ (vjh )xi .
It follows that
X m
∂E
=− −(dr − yr )ϕ′ (vrout )wrj ϕ′ (vjh )xi
∂wji
r=1
³X
m ´
=− ∆r wrj ϕ′ (vjh )xi
r=1
= −∆j xi ,
³P ´
m
where ∆j = r=1 ∆ r w rj ϕ′ (vjh ). As we have remarked earlier, the ∆ s are
calculated backwards through the network using the current weights. Thus
we may formulate the algorithm as follows.

King’s College London

80 Chapter 5

The Error Back-Propagation algorithm:

• set all weights to small random values;

• present an input-output pattern pair from the training set and

then compute all activation potentials and internal unit outputs
passing forward through the network, and the network outputs;

• compute all the ∆ s by propagation backwards through the network

and hence determine the partial derivatives of the error function
with respect to all the weights using the current values of the
weights;

• update all weights by the gradient-descent rule w 7→ w − λ grad E,

where λ is the learning rate;

• select another pattern repeating steps 2, 3 and 4 above until the

error function is acceptably small for all patterns in the given
training set.

It must be noted at the outset that there are no known general convergence
theorems for the back-propagation algorithm (unlike the LMS algorithm dis-
cussed earlier). This is a game of trial and error. Nonetheless, the method
has been widely applied and it could be argued that its study in the mid-
eighties led to the recent resurgence of interest (and funding) of artificial
neural networks. Some of the following remarks are various cook-book com-
ments based on experiment with the algorithm.

Remarks 5.1.

1. Instead of changing the weights after each presentation, we could take a

“batch” of pattern pairs, calculate the weight changes for each (but do
not actually make the change), sum them and then make an accumulated
weight change “in one lump”. This would be closer in spirit to a “pure”
gradient-descent method. However, such batch updating requires storing
the various changes in memory which is considered undesirable so usually
the algorithm is performed with an update after each step. This is called
pattern mode as distinct from batch mode.

2. The learning parameter λ is usually selected in the range 0.05—0.25.

One might wish to vary the value of λ as learning proceeds. A further
possibility would be to have different values of the learning parameter
for the different weights.

3. It is found in practice that the algorithm is slow to converge, i.e., it needs

many iterations.

Department of Mathematics
Multilayer Feedforward Networks 81

4. There may well be problems with local minima. The sequence of chang-
ing weights could converge towards a local minimum rather than a global
one. However, one might hope that such a local minimum is acceptably
close to the global minimum so that it would not really matter, in prac-
tice, should the system get stuck in one.

5. If the weights are large in magnitude then the sigmoid ϕ(v) is near
saturation (close to its maximum or minimum) so that its derivative will
be small. Therefore changes to the weights according to the algorithm
will also be correspondingly small and so learning will be slow. It is
therefore a good idea to start with weights randomized in a small band
around the midpoint of the sigmoid (where its slope is near a maximum).
For the usual sigmoid ϕ(v) = 1/(1 + exp(−v)), this is at the value v = 0.

6. Validation. It is perhaps wise to hold back a fraction of the training

set of input-output patterns (say, 10%) during training. When training
is completed, the weights are fixed and the network can be tested on
this remaining 10% of known data. This is the validation set. If the
performance is unsatisfactory, then action can be taken, such as further
training or retraining on a different 90% of the training set.

7. Generalization. It has been found that a neural network trained using

this algorithm generalizes well, that is, it appears to produce acceptable
results when presented with input patterns it has not previously seen.
Of course, this will only be the case if the new pattern is “like” the
training set. The network cannot be expected to give good results when
presented with exceptional patterns.
The network may suffer from overtraining. It seems that if the network is
trained for too long, then it learns the actual training set rather than the
underlying features of the patterns. This limits its ability to generalize.

c
b c
b
c
b
+

c
b
+

+
+

c
b
+

+
+

desired curve
overtrained

Figure 5.3: Overtraining is like overfitting a curve.

King’s College London

82 Chapter 5

8. If we take the activation function to be the standard sigmoid function

given as ϕ(v) = 1/(1 + exp(−v)), then

e−v e−v ³ 1 ´
ϕ′ (v) = = ϕ(v) = ϕ(v) 1 −
(1 + e−v )2 (1 + e−v ) (1 + e−v )
¡ ¢
= ϕ(v) 1 − ϕ(v) .

This means that it is not necessary to compute ϕ′ (v) separately once we

have found ϕ(v). That is to say, the update algorithm can be rewritten
so that the ∆s can be calculated from the knowledge of ϕ(v) directly.
This saves computation time.

9. To picture how the network operates, it is usual to imagine that somehow

certain characteristic features of the training set have been encapsulated
in the network weights. The network has extracted the essence of the
training set.
Data compression. Suppose the network has 100 input units, say, 10
hidden units and 100 output units, and is trained on a set of input-output
patterns where the output is the same as the input, i.e., the network is
trained to learn the identity function. Such a neural network will have
learned to perform 10:1 data compression. Instead of transmitting the
100 signal values, one would transmit the 10 values emitted by the hidden
units and only on arrival at the receiving end would these values be
passed into the final third layer (so there would be two neural networks,
one at each end of the transmission line).

10. We could use a different error function if we wish. All that is required
by the logic of the procedure is that it have a (global) minimum when
the actual output and desired output are equal. The only effect on the
calculations would be to replace the terms −2(dr − yr ) by the more
general expression ∂E/∂yr . This will change the formula for the “output
unit ∆ s”, ∆ℓ , but once this has been done the formulae for the remaining
weight updates are unchanged.

Instead of just the weight update ∆wji (t + 1) = −λ∂E(t)/∂wji at the iter-

ation step (t + 1), it has often been found better to incorporate a so-called
momentum term, α∆wji (t), giving

∂E(t)
∆wji (t + 1) = −λ + α∆wji (t),
∂wji

where ∆wji (t) is the previous weight increment and α is called the momen-
tum parameter (chosen so that 0 ≤ α ≤ 1, but often taken to be 0.9). To
see how such a term may prove to be beneficial, suppose that the weights

Department of Mathematics
Multilayer Feedforward Networks 83

correspond to a “flat” spot on the error surface, so that ∂E(t)/∂wji is nearly

constant (and small). In this situation, we have

∂E
∆wji (t) = −λ + α∆wji (t − 1)
∂wji
∂E ³ ∂E ´
= −λ + α −λ + α∆wji (t − 2)
∂wji ∂wji
∂E ¡ ¢
= −λ 1 + α + α2 + · · · + αt−2 + αt−1 ∆wji (1).
∂wji

We see that
λ ∂E
∆wji (t) → − ,
(1 − α) ∂wji

as t → ∞, giving an effective learning rate of λ/(1 − α) which is larger than

λ if α is close to 1. Indeed, if we take α = 0.9, as suggested above, then the
effective learning rate is 10λ. This helps to speed up “convergence”.
Now let us suppose that ∂E(t)/∂wji fluctuates between successive posi-
tive and negative values. Then

∂E(t + 1)
∆wji (t + 2) = −λ + α∆wji (t + 1)
∂wji
∂E(t + 1) ∂E(t)
= −λ − αλ + α2 ∆wji (t)
∂wji ∂wji
³ ∂E(t + 1) ∂E(t) ´
= −λ +α + α2 ∆wji (t).
∂wji ∂wji

If α is close to 1 and if ∂E(t + 1)/∂wji ∼ −∂E(t)/∂wji then the first (brack-

eted) term on the right hand side, above, is small and so the inclusion of the
momentum term helps to damp down oscillatory behaviour of the weights.
On the other hand, if ∂E/∂wji has the same sign for two consecutive
steps, then the above calculation shows that including the momentum term
“accelerates” the weight change.
A vestige of mathematical respectability can be conferred on such use
of a momentum term according to the following argument. The idea is
not to use just the error function for the current pattern, but rather to
incorporate all the error functions for all the patterns presented throughout
the history of the iteration process. Denote by E (1) , E (2) , . . . , E (t) the error
functions associated with those patterns presented at steps 1, 2, . . . , t. Let
Et = E (1) (w(t))+· · ·+E (t) (w(t)) be the sum of these error functions evaluated
at the current time step t. Now base the gradient-descent method on the
function Et .

King’s College London

84 Chapter 5

We get
∂Et
∆wji (t + 1) = −λ
∂wji
∂E (t) ∂ ³ (1) ´
= −λ −λ E (w(t)) + · · · + E (t−1) (w(t)) .
∂wji ∂wji
The second term on the right hand side above is like ∂Et−1 /∂wji = ∆wji (t),
a momentum contribution, except that it is evaluated at step t not t − 1.

Function approximation
We have thought of a multilayer feedforward neural network as a network
which learns input-output pattern pairs. Suppose such a network has n units
in the input layer and m in the output layer. Then any given function f ,
say, of n variables and with values in Rm determines input-output pattern
pairs by the obvious pairing (x, f (x)). One can therefore consider trying to
train a network to learn a given function and so it is of interest to know if
and in what sense this can be achieved.
It turns out that there is a theorem of Kolmogorov, extended by Sprecher,
on the representation of continuous functions of many variables in terms of
linear combinations of a continuous function of linear combinations of func-
tions of one variable.
Theorem 5.2 (Kolmogorov). For any continuous function f : [0, 1]n → R (on
the n-dimensional unit cube), there are continuous functions h1 , . . . , h2n+1
on R and continuous monotonic increasing functions gij , for 1 ≤ i ≤ n, and
1 ≤ j ≤ 2n + 1, such that
2n+1
X ³X
n ´
f (x1 , . . . , xn ) = hj gij (xi ) .
j=1 i=1

The functions gij do not depend on f .

Theorem 5.3 (Sprecher). For any continuous function on the n-dimensional

unit cube, f : [0, 1]n → R, there is a continuous function h and continuous
monotonic increasing functions g1 , . . . , g2n+1 and constants λ1 , . . . , λn such
that
X ³X
2n+1 n ´
f (x1 , . . . , xn ) = h λi gj (xi ) .
j=1 i=1

The functions gj and the constants λi do not depend on f .

This theorem implies that any continuous function f from a compact
subset of Rn into Rm can be realized by some four layer feedforward neural
network.

Department of Mathematics
Multilayer Feedforward Networks 85

Such a feedforward network realization with two hidden layers is illus-

trated in the diagram (for the case n = 2 and m = 1).

g1
g2 λ1
h
1 λ1
..
1 . λ2 h 1
x1 g5 1
1 λ2 h 1 P
f (x1 , x2 )
g1 1
λ1
1 g2
h 1
x2 1 ..
.
h
1 λ2
g5

Figure 5.4: An illustration of a Kolmogorov/Sprecher network,

with n = 2 and m = 1.

However, the precise form of the various activation functions and the
values of the weights are unknown. That is to say, the theorem provides an
(important) existence statement, rather than an explicit construction.
If we are prepared to relax the requirement that the representation be
exact, then one can be more explicit as to the nature of the activation
functions. That is, there are results to the effect that any f as above can
be approximated, in various senses, to within any preassigned degree of
accuracy, by a suitable feedforward neural network with certain specified
activation functions. However, one usually does not know how many neurons
are required. Usually, the more accurate the approximation is required to
be, so will the number of units needed increase. We shall consider such
function approximation results (E. K. Blum and L. L. Li, Neural Networks,
4 (1991), 511–515, see also K. Hornik, M. Stinchcombe and H. White, Neural
Networks, 2 (1989), 359–366). The following example will illustrate the
ideas.

Example 5.4. Any right-continuous simple function f : [a, b] → R can be

implemented by a 3-layer neural network comprising McCulloch-Pitts hidden
units and a linear output unit.

By definition, a right-continuous simple function on the interval [a, b] is

a function of the form

f = f0 χ[b0 ,b1 ) + f1 χ[b1 ,b2 ) + · · · + fn−1 χ[bn−1 ,bn ) + fn χ{bn }

King’s College London

86 Chapter 5

for some n and constants f0 , . . . , fn , and suitable a = b0 < b1 < · · · < bn = b,

where χI denotes the characteristic function of the set I (it is 1 on I and 0
otherwise). Now we observe that

χ[α,β) (x) = step(x − α) − step(x − β)

and so for any x ∈ [a, b] (and using step(0) = 1),

f (x) = f0 (step(x − b0 ) − step(x − b1 )) + f1 (step(x − b1 ) − step(x − b2 )) +

· · · + fn−1 (step(x − bn−1 ) − step(x − bn )) + fn step(x − bn )
= f0 step(x − b0 ) + (f1 − f0 ) step(x − b1 ) +
· · · + (fn − fn−1 ) step(x − bn ).

The neural network implementing this mapping is as shown in the figure.

b0
f0
b1
f1 − f0 P
x .. f (x)
.

fn − fn−1
bn

Figure 5.5: A neural network implementing a simple function.

We shall show that any continuous function on a square in R2 can be ap-

proximated by a 3-layer feedforward neural network. The result depends on
the Stone-Weierstrass theorem which implies that any continuous function
on such a square can be approximated by a sum of cosine functions.

Theorem 5.5 (Stone-Weierstrass). Suppose that A is an algebra of continu-

ous real-valued functions on the compact set K and that A separates points of
K and contains constants. Then A is uniformly dense in C(K), the normed
space of all real-valued continuous functions on K.

Remark 5.6. To say that A is an algebra simply means that if f and g belong
to A, then so do αf + g and f g, for any α ∈ R. A separates points of K
means that given any points x, x′ ∈ K, with x 6= x′ , then there is some
f ∈ A such that f (x) 6= f (x′ ). In other words, A is sufficiently rich to be
able to distinguish different points in K.

Department of Mathematics
Multilayer Feedforward Networks 87

The requirement that A contain constants is to rule out the possibility

that all functions may vanish at some common point. Uniform density is
the statement that for any given f ∈ C(K) and any given ε > 0, there is
some g ∈ A such that

|f (x) − g(x)| < ε for every x ∈ K.

We will be interested in the case where K is a square

¡ or rectangle in R¢2
n 1
(or R ). The trigonometric identity cos α cos β = 2 cos(α + β) + cos(α − β)
implies that the linear span of the family cos mx cos ny, for m, n ∈ N∪{0}, is,
in fact, an algebra. This family clearly contains constants (taking m = n =
0) and separates the points of [0, π]2 . Thus, we get the following application
of the Stone-Weierstrass Theorem.

Theorem 5.7 (Application of the Stone-Weierstrass theorem). Suppose that

f : [0, π]2 → R is continuous. For any given ε > 0, there is N ∈ N and
constants amn , with 0 ≤ m, n ≤ N , such that
N
X
¯ ¯
¯f (x, y) − amn cos mx cos ny ¯ < ε
m,n=0

for all (x, y) ∈ [0, π]2 .

We emphasize that this is not Fourier analysis—the amn s are not any
kind of Fourier coefficients.

Theorem 5.8. Let f : [0, π]2 → R be continuous. For any given ε > 0, there
is a 3-layer feedforward neural network with McCulloch-Pitts neurons in the
hidden layer and a linear output unit which implements the function f on
the square [0, π]2 to within ε.

Proof. According to the version of the Stone-Weierstrass theorem above,

there is N and constants amn , with 0 ≤, m, n ≤ N such that
N
X
¯ ¯
¯f (x, y) − amn cos mx cos ny ¯ < 21 ε
m,n=0

for all (x, y) ∈ [0, π]2 . Let K = max{|amn | : 0 ≤ m, n ≤ N }. We write

cos mx cos ny = 12 (cos(mx + ny) + cos(mx − ny))

and also note that |mx ± ny| ≤ 2N π for any (x, y) ∈ [0, π]2 .
We can approximate 21 cos t on [−2N π, 2N π] by a simple function, γ(t),
say,
ε
| 12 cos t − γ(t)| <
4(N + 1)2 K

King’s College London

88 Chapter 5

for all t ∈ [−2N π, 2N π]. Hence

¯X X ¡ ¢¯
¯ amn cos mx cos ny − amn γ(mx + ny) + γ(mx − ny) ¯
| {z }
m,n 1 m,n
2
(cos(mx+ny)+cos(mx−ny))
X ³ ε ε ´
≤ |amn | +
m,n
4(N + 1)2 K 4(N + 1)2 K
X ε ε
< 2
= .
m,n
2(N + 1) 2

But γ can be written as a sum as in the example above:

M
X
γ(t) = wj step(t − θj )
j=1

for suitable M , wj and θj . Hence

which provides the required network (see the figure).

..
.
θj
m ..
. amn wj
x
n
P
m
y
.. amn wj
−n .
θj
..
.

Figure 5.6: The 3-layer neural network implementing f to within

ε. The network has (N + 1)2 ×M + (N + 1)2 ×M hidden neurons
(doubly labelled by m, n, j). The two weights to the hidden neuron
labelled m, n, j in the top half are m and n, whereas those to the
hidden neuron m, n, j in the bottom half are m and −n.

Department of Mathematics
Multilayer Feedforward Networks 89

Remark 5.9. The particular square [0, π]2 is not crucial—one can obtain a
similar result, in fact, on any bounded region of R2 by rescaling the variables.
Also, there is an analogous result in any number of dimensions (products of
cosines in the several variables can always be rewritten as cosines of sums).
The theorem gives an approximation in the uniform sense. There are various
other results which, for example, approximate f in the mean square sense.
It is also worth noting that with a little more work, one can see that sigmoid
units could be used instead of the threshold units. One would take γ above
to be a sum of “smoothed” step functions, rather a sum of step functions.
A problem with the above approach is with the difficulty in actually
finding the values of N and the amn s in any particular concrete situation. By
admitting an extra layer of neurons, a somewhat more direct approximation
can be made. We shall illustrate the ideas in two dimensions.
Proposition 5.10. Let f : R2 → R be continuous. Then f is uniformly
continuous on any square S ⊂ R2 .
Proof. Suppose that S is the square S = [−R, R] × [−R, R] and let ε > 0
be given. We must show that there is some δ > 0 such that |x − x′ | < δ and
x, x′ ∈ S imply that
|f (x) − f (x′ )| < ε.
Suppose that this is not true. Then, no matter what δ > 0 is, there will be
points x, x′ ∈ S such that |x − x′ | < δ but |f (x) − f (x′ )| ≥ ε. In particular,
if, for any n ∈ N, we take δ = 1/n, then there will be points xn and x′n in S
with |xn − x′n | < 1/n but |f (xn ) − f (x′n )| ≥ ε.
Now, (xn ) is a sequence in the compact set S and so has a convergent
subsequence, xnk → x as k → ∞, with x ∈ S. But then
1
|x′nk − x| ≤ |x′nk − xnk | + |xnk − x| < + |xnk − x| → 0
nk
as k → ∞, i.e., x′nk → x, as k → ∞.
By the continuity of f , it follows that
|f (xnk ) − f (x′nk )| → |f (x) − f (x)| = 0
which contradicts the inequality |f (xnk ) − f (x′nk )| ≥ ε and the proof is
complete.

The idea of the construction is to approximate a given continuous func-

tion by step-functions built on small rectangles. Let B be a rectangle of the
form
B = (a, b] × (α, β].
Notice that B contains its upper and right-hand edges, but not the lower
or left-hand ones. It is a bit of a nuisance that we have to pay attention to
this minor detail.

King’s College London

90 Chapter 5

Let S denote the collection of finite linear combinations of functions of

the form χB , for various Bs, where
(
1, x ∈ B
χB (x) =
0, x ∈
/B

is the characteristic function of the rectangle B. Thus, S consists of functions

of the form a1 χB1 + · · · + am χBm , with a1 , . . . , am ∈ R.

Theorem 5.11. Let f : R2 → R be continuous. Then for any R > 0 and any
ε > 0 there is g ∈ S such that

|f (x) − g(x)| < ε, for all x ∈ (−R, R) × (−R, R),

i.e., f can be uniformly approximated on the square (−R, R) × (−R, R) by

elements of S.

Proof. By our previous discussion, we know that f is uniformly continuous

on the square A = [−R, R] × [−R, R]. Hence there is δ > 0 such that
|f (x) − f (x′ )| < ε whenever x, x′ ∈ A and |x − x′ | < δ. Subdivide A into n
smaller squares as shown.

B1 B2 ...

...
..
.

Figure 5.7: Divide A into many small rectangles, B1 , . . . , Bn .

Choose n so large that each Bm has diagonal smaller than δ and let xm
be the centre of Bm . Then |x − xm | < δ for any x ∈ Bm .

Department of Mathematics
Multilayer Feedforward Networks 91

Put gm = f (xm ), and set

n
X
g(x) = gm χBm (x).
m=1

Then, for any x ∈ (−R, R) × (−R, R),

|g(x) − f (x)| = |gj − f (x)|, where j is such that x ∈ Bj ,

= |f (xj ) − f (x)|
<ε

since |xj − x| < δ.

We see that f is approximated on each Bm by its value at the centre.

Notice that the approximation is rather more explicit than that of the pre-
vious scheme using the Stone-Weierstrass theorem. As a matter of fact, we
could use that theorem here also, but as we have just remarked, there would
then be no indication as to the value of n nor of the constants am .
To construct the appropriate network, we shall need two types of Mc
Culloch-Pitts threshold units—differing on the value they take at the jump
point. Let

( (
1, s≥0 1, s>0
H(s) = and H0 (s) =
0, s<0 0, s ≤ 0.

Thus, H(·) is just the function step(·) used already. H0 differs from H only
in its value at 0—it is 0 there. The purpose of introducing these different
versions of the step-function will soon become apparent. Indeed, H0 (s−a) =
χ(a,∞) (s) and H(−s + b) = χ(−∞,b] (s) so that

χ(a,b] (s) = H0 (s − a) H(−s + b) .

| {z } | {z }
χ(a,∞) (s) χ(−∞,b] (s)

This means that we can implement the function χ(a,b] by the network
shown.

King’s College London

92 Chapter 5

θ=a
1 1 H
3
x θ= 2

−1 H
1
θ = −b

Figure 5.8: Mc Culloch-Pitts units to implement χ(a, b](s).

In our two dimensional case, we can write each Bj as

Bj = { (x1 , x2 ) : aj1 < x1 ≤ bj1 , aj2 < x2 ≤ bj2 }.

Thus
χBj (x1 , x2 ) = χ(aj ,bj ] (x1 ) χ(aj ,bj ] (x2 )
1 1 2 2

and we can implement such a function by combining the networks above, as

shown.

θ = aj1

1 H
1
x1 θ= −bj1
−1 H
1
7 χBj (x1 , x2 )
θ= 2
H0 1
x2 1
θ = aj2
1
−1
H

θ = −bj2

Figure 5.9: Mc Culloch-Pitts units to implement χBj (x1 , x2 ).

Department of Mathematics
Multilayer Feedforward Networks 93

Theorem 5.12. Suppose that f : R2 → R is continuous. Then for any

given R > 0 and ε > 0 there is a 4-layer feedforward neural network with
Mc Culloch-Pitts units in layers 2 and 3 and with a linear output unit which
implements f to within error ε uniformly on the square (−R, R) × (−R, R).

Proof. We have done all the preparation. Now we just piece together the
various parts. Let g, B1 , . . . , Bn etc. be an approximation to f (to within ε)
as given above. We shall implement g using the stated network. Each χBj
is implemented by a 3-layer network, as above. The outputs of these are
weighted by the corresponding gj and then fed into a linear (summation)
unit.

.. .. g1
. .
x1
H0 aj
1
1 P
gj f (x1 , x2 )
H −bj H
1 1
H0
x2 1 θ=7/2
aj2
1
.. gn
H −b2
j .
..
.

Figure 5.10: The network to implement f to within ε uniformly

on the square (−R, R) × (−R, R).

The first layer consists of place holders for the input, as usual. The
second layer consists of 4n threshold units (4 for each rectangle Bj , j =
1, . . . , n). The third layer consists of n threshold units, required to complete
the implementation of the n various χBj s. The final output layer consists of
a single linear unit.
For any given input (x1 , x2 ) from the rectangle (−R, R) × (−R, R) pre-
cisely one of the units in the third layer will fire—this will be the j th corre-
spondingP to the unique j with (x1 , x2 ) ∈ Bj . The system output is therefore
equal to ni=1 gi χBi (x1 , x2 ) = gj = g(x1 , x2 ), as required.

See the work of K. Hornik, M. Stinchecombe and H. White, for example,

for further results on function approximation.

King’s College London

94 Chapter 5

Department of Mathematics
Chapter 6

Radial Basis Functions

We shall reconsider the problem of implementing the mapping of a given set

of input vectors x(1) , . . . , x(p) in Rn into target values y (1) , . . . , y (p) in R,

x(i) Ã y (i) , i = 1, . . . , p.

We seek a function h which not only realizes this association,

h(x(i) ) = y (i) , i = 1, . . . , p,

but which should also “predict” values when applied to new but “similar”
input data. This means that we are not interested in finding a mapping h
which works precisely for just these x(1) , . . . , x(p) . We would like a mapping
h which will “generalize”.
We will look at functions of the form ϕ(kx − x(i) k), i.e., functions of the
distance between x and the prototype input x(i) —so-called basis functions.
Thus, we try
Xp
h(x) = wi ϕ(kx − x(i) k)
i=1
Pp
and require y (j) = h(x(j) ) = i=1 wi ϕ(kx
(j) − x(i) k). If we set (A ) =
ji
(j) (i) p×p
(Aij ) = (ϕ(kx − x k)) ∈ R , then our requirement is that

p
X
y (j) = Aji wi = (Aw)j ,
i=1

that is, y = Aw. If A is invertible, we get w = A−1 y and we have found h.

A common choice for ϕ is a Gaussian function:
2 /2σ 2
ϕ(s) = e−s

—a so-called localized basis function.

95
96 Chapter 6

With this choice, we see that

p
X
(j)
h(x )= wi ϕ(kx(j) − x(i) k)
i=1
p
X
= wj ϕ(kx(j) − x(j) k) + wi ϕ(kx(j) − x(i) k)
| {z }
=1 i=1
i6=j

= wj + ε

where ε is small provided that the prototypes x(j) are reasonably well-
separated. The various basis functions therefore “pick out” input patterns
in the vicinity of specified spatial locations.
To construct a mapping h from Rn to Rm , we simply construct suitable
combinations of basis functions for each of the m components of the vector-
valued map h = (h1 , . . . , hm ). Thus,

p
X
hk (x) = wki ϕ(kx − x(i) k).
i=1

This leads to

p
X
(j)
yk = hk (x(j) ) = wki ϕ(kx(j) − x(i) k)
| {z }
i=1
Aji
p
X
= wki Aij
i=1
= (W A)kj

giving Y = W A, where Y = (y (1) , . . . , y (p) ) ∈ Rm×p .

The combination of basis functions has been constructed so that it passes
through all input-output data points. This means that h has “learned” these
particular pairs—but we really want a system to learn the “essence” of the
input-output association rather than the intricate details of a number of
specific examples. We want the system to somehow capture the “features”
of the data source, rather than any actual numbers—which may well be
“noisy” (or slightly inaccurate) anyway.

Department of Mathematics
Radial Basis Functions 97

Radial basis network

A so-called radial basis network is designed with the hope of incorporating
this idea of realistic generalization ability. It has the following features.

• The number M of basis functions is chosen to be typically much smaller

than p.

• The centres of the basis functions are not necessarily constrained to be

precisely at the input data vectors. They may be determined during the
training process itself.

• The basis functions are allowed to have different widths (σ)—these may
also be determined by the training data.

• Biases or thresholds are introduced into the linear sum.

Thus, we are led to consider mappings of the form

M
X
x 7→ yk (x) = wkj ϕj (x) + wk0 k = 1, 2, . . . , m,
j=1

where ϕj (x) = exp(−kx−µj k2 /2σj2 ) and µj ∈ Rn . The corresponding 3-layer

network is shown in the diagram. The first layer is the usual place-holder
layer. The second layer has units with activation functions given by the
basis functions ϕj , and the final layer consists of linear (summation) units.

ϕ0 = 1
w10
P
ϕ1 w11 y1
x1 1
w21
1 P
ϕ2 w12 y2
x2
1 w1M
.. .. ..
. . .

xn P
ym
ϕM

Figure 6.1: A radial basis network.

King’s College London

98 Chapter 6

Training
To train such a network, the layers are treated quite differently. The second
layer is subjected to unsupervised training—the input data being used to
assign values to µj and σj . These are then held fixed and the weights to the
output layer are found in the second phase of training.
Once a suitable set of values for the µj s and σj s has been found, we seek
suitable wkj s. These are given by

M
X
yk (x) = wkj ϕj (x) (with ϕ0 ≡ 1)
j=0

i.e., y = W ϕ. This is just a linear network problem—for given x, we calculate

ϕj (x), so we know the hidden layer to output pattern pairs. We can therefore
find W by minimizing the mean square error Ã generalized inverse memory
(OLAM) (or we can use the LMS algorithm to find this).
Returning to the problem of assigning values to the µj s and σj s—this can
be done using the k-means clustering algorithm (MacQueen), which assigns
vectors to one of a number of clusters.

k-means clustering algorithm

Suppose that the training set consists of N data vectors, each in Rn , which
are assumed to constitute k clusters. The centroid of a cluster is, by defini-
tion, the mean of its members.

• Take the first k data points as k cluster centres (giving clusters each with
one member).

• Assign each of the remaining N − k data points one by one to the cluster
with the nearest centroid. After each assignment, recompute the centroid
of the gaining cluster.

• Select each data point in turn and compute the distances to all cluster
centroids. If the nearest centroid is not that particular data point’s
parent cluster, then reassign the data point (to the cluster with the
nearest centroid) and recompute the centroids of the losing and gaining
clusters.

• Repeat the above step until convergence—that is, until a full cycle through
all data points in the training set fails to trigger any further cluster
membership reallocations.

• Note: in the case of ties, any choice can be made.

Department of Mathematics
Radial Basis Functions 99

This algorithm actually does converge, as we show next. The idea is to

introduce an “energy” or “cost” function which decreases with every cluster
reallocation.
Theorem 6.1 (k-means convergence theorem). The k-means algorithm is
convergent, i.e., the algorithm induces at most a finite number of cluster
membership reallocations and then “stops”.
Proof. Suppose that after a number of steps of the algorithm the data set
is partitioned into the k clusters A1 , . . . , Ak with corresponding centroids
a1 , . . . , ak , so that
1 X
aj = x,
nj
x∈Aj

where nj is the number of data points in Aj , j = 1, . . . , k. Let

E = E(A1 ) + · · · + E(Ak )
P
where E(Aj ) = x∈Aj kx − aj k2 , j = 1, . . . , k. Suppose that the data point
x′ is selected next and that x′ ∈ Ai . If kx′ − ai k ≤ kx′ − aj k, all j, then the
algorithm makes no change.
On the other hand, if there is ℓ 6= i such that kx′ − aℓ k < kx′ − ai k, then
the algorithm removes x′ from Ai and reallocates it to Am , say, where m is
such that
kx′ − am k = min{ kx′ − aj k : j = 1, 2, . . . , k }.
In this case, kx′ − am k < kx′ − ai k.
We estimate the change in E. To do this, let A′i = Ai \ {x′ } and let
Am = Am ∪ {x′ }, the reallocated clusters, and let a′i and a′m denote their
′

centroids, respectively. We find

X X
E(Ai ) + E(Am ) = kx − ai k2 + kx − am k2
x∈Ai x∈Am
X X
= kx − ai k + kx′ − ai k2 +
2
kx − am k2
x∈A′i x∈Am
X X
2 ′ 2
> kx − ai k + kx − am k + kx − am k2
x∈A′i x∈Am
X X
≥ kx − a′i k2 + kx − a′m k2 = E(A′i ) + E(A′m ),
x∈A′i x∈A′m
P
since kx − αk2 is minimized when α is the centroid of the xs. Thus, every
“active” step of the algorithm demands a strict decrease in the “energy” E
of the overall cluster arrangement.
Clearly, a given finite number of data points can be assigned to k-clusters
in a finite number of ways, and so E can assume only a finite number of
possible values. It follows that the algorithm can only trigger a finite number
of reallocations, and the result follows.

King’s College London

100 Chapter 6

Having found the centroids c1 , . . . , ck , via the k-means algorithm, we set

µ1 = c1 , . . . , µk = ck . The values of the various σj s can then be calculated
as the standard deviation of the data vectors in the appropriate cluster j,
i.e.,
nj
2 1 X (i)
σj = kx − cj k2 , j = 1, 2, . . . , k,
nj
i=1

where nj is the number of data vectors in cluster j.

We see that radial basis function networks are somewhat different from
the standard feedforward sigmoidal-type neural network and are simpler to
train. As some reassurance regarding their usefulness, we have the following
“universal approximation theorem”.

Theorem 6.2. Suppose that f : Rn → R is a continuous function. Then

for any R > 0 and ε > 0 there is a radial basis function network which
implements f to within ε everywhere in the ball { x : kxk ≤ R }.

Proof. Let ϕ(x, µ, σ) = exp(−(x − µ)2 /2σ 2 ). Then we see that

ϕ(x, µ, σ) ϕ(x, µ′ , σ ′ ) = c ϕ(x, λ, ν),

for suitable c, λ and ν. It follows that the linear span S of 1 and Gaussian
functions of the form ϕ(x, µ, σ) is, in fact, an algebra. That is, if S denotes
the collection of functions of the form

w0 + w1 ϕ(x, µ1 , σ1 ) + · · · + wj ϕ(x, µj , σj )

for j ∈ N and wi ∈ R, 0 ≤ i ≤ j, then S is an algebra.

ϕ0 = 1

ϕ1
x1 w0

x2 w1

.. P
.. .
.

wN
xn

ϕN

Figure 6.2: The radial basis network approximating the function

f uniformly on the ball { x : kxk ≤ R }.

Department of Mathematics
Radial Basis Functions 101

Clearly, S contains constants and also S separates points of Rn . Indeed,

if z 6= z ′ in Rn , then

ϕ(z, z, 1) = 1 6= ϕ(z ′ , z, 1).

We can now apply the Stone-Weierstrass theorem to conclude that f can be

uniformly approximated on any ball { x : kxk ≤ R } by an element from S.
Thus, for given ε > 0, there is N ∈ N and w0 , w1 , . . . , wN , µi and σi for
i = 1, . . . , N such that
¯ N
X ¯
¯ ¯
¯f (x) − wi ϕi (x)¯ < ε
i=0

for all x with kxk ≤ R, where ϕ0 = 1 and ϕi (x) = ϕ(x, µi , σi ) for i =

1, . . . , N . Such an approximating sum is implemented by a radial basis
network, as shown.

King’s College London

102 Chapter 6

Department of Mathematics
Chapter 7

Recurrent Neural Networks

In this chapter, we reconsider the problem of content addressable memory

(CAM). The idea is to construct a neural network which has stored (or
learned) a number of prototype patterns (also called exemplars) so that
when presented with a possible “noisy” or “corrupt” version of one of these,
it nonetheless correctly recalls the appropriate pattern—autoassociation.

Figure 7.1: A recurrent neural network with 5 neurons.

We consider a recurrent neural network of n bipolar threshold units.

Once the exemplars are loaded, the system is fixed, that is, the weights and
thresholds are kept fixed—there is no further learning. The purpose of the
network is to simply recall a suitable pattern appropriate to the input. The
state of the network is just the knowledge of which neurons are firing and
which are not. We shall model such states of the system by elements of the
bipolar hypercube {−1, 1}n , where the value 1 corresponds to “firing” and
−1 to “not firing”.

103
104 Chapter 7

If the state of the system is x = (x1 , .P

. . , xn ), then the net internal
activation potential of neuron i is equal to nk=1 wik xk − θi , where wik is
the connection weight from neuron k to neuron i, and θi is the threshold
of neuron i. Passing
Pn this through the bipolar threshold function gives the
signal value sign( k=1 wik xk − θi ), where
(
1, v≥0
sign(v) =
−1, v < 0.

The neural network can be realized as a dynamical system as follows. Con-

sider the state of the system at discrete time steps t = 0, 1, 2, . . . . If
x(t) = (x1 (t), . . . , xn (t)) in {−1, 1}n denotes the state of the system at
time t, we define the state at time t + 1 by the assignment
³X
n ´
xi (t + 1) = sign wik xk (t) − θi ,
k=1
P
1 ≤ i ≤ n, with the agreement that xi (t+1) = xi (t) if nk=1 wik xk (t)−θi = 0.
Thus, as time goes by, the state of the system hops about on the vertices
of the hypercube [−1, 1]n . The time evolution set out above is called the
synchronous (or parallel) dynamics, because all components xi are updated
simultaneously.
An alternative possibility is to update only one unit at a time, that is,
to consider the neurons individually rather than en masse. This is called
sequential (or asynchronous) dynamics and is defined by
³X
i−1 n
X ´
xi (t + 1) = sign wik xk (t + 1) + wik xk (t) − θi ,
k=1 k=i

for 1 ≤ i ≤ n. Thus, first x1 and only x1 , is updated, then just x2 is updated

using the latest value of x1 , then x3 , using the latest values of x1 and x2 ,
and so on. Then the transition (x1 (t), . . . , xn (t)) to (x1 (t + 1), . . . , xn (t + 1))
proceeds in n intermediate steps. At each such step, the state change of
only one neuron is considered. As a consequence, the state vector x hops
about on the vertices of the hypercube [−1, 1]n but can either remain where
it is or only move to a nearest-neighbour vertex at each such step. If we
think of n such intermediate steps as constituting a cycle, then we see that
each neuron is “activated” once and only once per cycle. (A variation on
this theme is to update the neurons in a random order so that each neuron
is only updated once per cycle on average.)
Such systems were considered by J. Hopfield in the early eighties and
this provided impetus to the study of neural networks. A recurrent neural
network as above and with the asynchronous (or random) mode of operation
is known as the Hopfield model.

Department of Mathematics
Recurrent Neural Networks 105

There is yet a further possibility—known as block-sequential dynamics.

Here the idea is to divide the neurons into a number of groups and then
update each group synchronously, but the groups sequentially, for example,
synchronously update group one, then likewise group two, but using the
latest values of the states of the neurons in group one, and so on.
So far we have not said anything about the values to be assigned to
the weights. Before doing this, we first consider the network abstractly
and make the following observation. The state vector moves around in the
space {−1, 1}n which contains only finitely-many points (2n , in fact) and
its position at t + 1 depends only on where it is at time t (and, in the
sequential case, which neuron is to be updated next). Thus, either it will
reach some point and stop, or it will reach a configuration for a second time
and then cycle repeatedly forever.
To facilitate further investigation, we define

E(x) = 12 xT W x + xT θ
n
X n
X
1
= −2 xi wij xj + xi θi
i,j=1 i=1

where x = (x1 , . . . , xn ) is the state vector of the network, W = (wij ) and

θ = (θi ).
The function E is called the energy associated with x (by analogy with
lattice magnetic (or spin) systems). We can now state the basic result for
sequential dynamics.

Theorem 7.1. Suppose that the synaptic weight matrix (wij ) is symmetric
with non-negative diagonal terms. Then the sequential mode of operation of
the recurrent neural network has no cycles.

Proof. Consider a single update xi 7→ x′i , and all other xj s remain un-
changed. Then we calculate the energy difference

n
X
E(x′ ) − E(x) = − 21 (x′i wij xj + xj wji x′i ) − 21 x′i wii x′i + x′i θi
j=1
j6=i
n
X
1
+ 2 (xi wij xj + xj wji xi ) + 21 xi wii xi − xi θi
j=1
j6=i

King’s College London

106 Chapter 7

(all terms not involving xi or x′i cancel out)

n
X
=− x′i wij xj − 12 x′i wii x′i + x′i θi
j=1
j6=i
n
X
+ xi wij xj + 12 xi wii xi − xi θi
j=1
j6=i

using the symmetry of (wij )

n
X
=− x′i wij xj + x′i wii xi − 12 x′i wii x′i + x′i θi
j=1
n
X
+ xi wij xj − xi wii xi + 21 xi wii xi − xi θi
j=1
Xn
= (xi − x′i ) wij xj + 21 wii (2x′i xi − x′i x′i − xi xi ) + (x′i − xi )θi
j=1
Xn
= −(x′i − xi ) wij xj − 12 wii (x′i − xi )2 + (x′i − xi )θi
j=1
³X
n ´
= − 21 wii (x′i − xi )2 − (x′i − xi ) wij xj − θi .
j=1

Suppose that x′i 6= xi . Then x′i − xi = 2 if x′i = 1 but x′i − xi = −2 if

x′i = −1. It follows that sign(x ′ ′
Pni ) = sign(xi − xi ). Furthermore, by definition
′
of thePevolution, xi = sign( j=1 wij xj − θi ), and so we see that (x′i − xi )
and ( nj=1 wij xj − θi ) have the same sign.
By hypothesis, wii ≥ 0, and so if x′i 6= xi then E(x′ ) < E(x). This means
that any asynchronous change in x leads to a strict decrease in the energy
function E. It follows that the system can never leave and then subsequently
return to any state configuration. In particular, there can be no cycles.

This leads immediately to the following corollary.

Corollary 7.2. A recurrent neural network with symmetric synaptic weight

matrix with non-negative diagonal elements reaches a fixed point after a
finite number of steps when operated in sequential mode.

Proof. The state space is finite (and there are only a finite number of neurons
to be updated), so if there are no cycles the iteration process must simply
stop, i.e., the system reaches a fixed point.

Department of Mathematics
Recurrent Neural Networks 107

We turn now to the question of choosing the synaptic weights. We set

all thresholds to zero, θi = 0, 1 ≤ i ≤ n. Consider the problem of storing
a single pattern z = (z1 , . . . , zn ). We would like input patterns (starting
configurations) close to z to converge to z via the asynchronous dynamics.
In particular, we would like z to be a fixed point. Now, according to the
evolution, if x(0) = z and the ith component of x(0) is updated first, then

³X
n ´
xi (1) = sign wik xk (0)
k=1
³X
n ´
= sign wik zk .
k=1

³P ´
n
For z to be a fixed point, we must have zi = sign k=1 wik zk (or possibly
Pn
k=1 wik zk = 0). This requires that

n
X
wik zk = αzi
k=1

for some α ≥ 0. Guided by our experience with the adaptive linear combiner
(ALC) network, we try wik = αzi zk , that is, we choose the synaptic matrix
to be (wik ) = αzz T , the outer product (or Hebb rule). We calculate

n
X n
X
wik zk = αzi zk zk = αnzi
k=1 k=1

since zk2 = 1. It follows that the choice (wik ) = αzz T , α ≥ 0, does indeed
give z as a fixed point.
Next, we consider what happens if the system starts out not exactly in
the state z, but in some perturbation of z, say zb. We can think of zb as z
but with some of its bits flipped. Taking x(0) = zb, we get

³X
n ´
xi (1) = sign wik zbk .
k=1

The choice α = 0 would give wik = 0 and so xi (1) = zbi and zb (and indeed
any state) would be a fixed point. The dynamics would be trivial—nothing
moves. This is no good for us—we want states zb sufficiently close to z
to “evolve” into z. Incidentally, the same remark applies to the choice of
synaptic matrix (wik ) = 1ln , the unit n × n matrix. For such a choice, it
is clear that all vectors are fixed points. We want the stored patterns to
act as “attractors” each with non-trivial “basin of attraction”. Substituting

King’s College London

108 Chapter 7

(wik ) = αzz T , with α > 0, we find

³X
n ´
xi (1) = sign αzi zk zbk
k=1
³ X n ´
= sign zi zk zbk , since α > 0,
k=1
¡ ¢
= sign zi (n − 2ρ(z, zb)) ,

where ρ(z, zb) is the Hamming distance between z and zb¡, i.e., the number¢ of
differing bits. It follows that if ρ(z, zb) < n/2, then sign zi (n − 2ρ(z, zb)) =
sign(zi ) = zi , so that x(1) = z. In other words, any vector zb with ρ(z, zb) <
n/2 is mapped onto the fixed point z directly, in one iteration cycle. (When-
ever a component of zb is updated, either it agrees with the corresponding
component of z and so remains unchanged, or it differs and is then mapped
immediately onto the corresponding component of z. Each flip moves zb to-
wards z.) We have therefore shown that the basin of direct attraction of z
contains the disc {b z : ρ(z, zb) < n/2}.
Can
¡ we say anything ¢ about the situation when ρ(z, zb) > n/2? Evidently,
sign zi (n − 2ρ(z, zb)) = sign(−zi ) = −zi , and so x(1) = −z. Is −z a fixed
point? According to our earlier discussion, the state −z is a fixed point if we
take the synaptic matrix to be given by (−z)(−z)T . But (−z)(−z)T = zz T
and we conclude that −z is a fixed point for the network above. (Alterna-
tively, we could simply check this using the definition of the dynamics.)
We can also use this to see that zb is directly attracted to −z whenever
ρ(z, zb) > n/2. Indeed, ρ(z, zb) > n/2 implies that ρ(−z, zb) < n/2 and so,
arguing as before, we deduce that zb is directly mapped onto the fixed point
−z. If ρ(z, zb) = n/2, then there is no change, by definition of the dynamics.

How can we store many patterns? We shall try the Hebbian rule,
p
X
W ≡ (wik ) = z (µ) z (µ)T
µ=1

Pp (µ) (µ) (1) (p) are the patterns to be

so that wik = µ=1 zi zk , where z , . . . , z
stored. Starting with x(0) = z, we find

³X
n ´
zi (1) = sign((W z)i ) = sign wik zk
k=1
³X p
n X ´
(µ) (µ)
= sign zi zk zk .
k=1 µ=1

Department of Mathematics
Recurrent Neural Networks 109

Notice that if z (1) , . . . , z (p) are pairwise orthogonal, then

p
X
z (µ) z (µ)T z (j) = z (j) kz (j) k2
µ=1

so that if z = z (j) , then

¡ (j) ¢
zi (1) = sign kz (j) k2 zi
(j)
= sign(zi )
(j)
= zi .

It follows that each exemplar z (1) , . . . , z (p) is a fixed point if they are pairwise
orthogonal. In general (not necessarily pairwise orthogonal), we have
n
X p X
X n
(j) (µ) (µ) (j)
wik zk = zi zk zk
k=1 µ=1 k=1
Xn X p Xn
(j) (j) (j) (µ) (µ) (j)
= zk zi zk + zi zk zk
k=1 µ6=j k=1
³ p
1 X (µ) X (µ) (j) ´
n
(j)
= n zi + zi zk zk (∗)
n
µ6=j k=1

(j)
The right hand side has the same sign as zi whenever the first term is
dominant. To get a rough estimate of the situation, let us suppose that the
protoype vectors are chosen at random—this gives an “average” indication
of what we might expect to happen (a “typical” situation). Then the second
term in the brackets on the right hand side of (∗) involves a sum of n(p − 1)
terms which can take on the values ±1 with equal probability. By the
Central Limit Theorem, the term
p n
1 X (µ) X (µ) (j)
zi zk zk
n
µ6=j k=1

has approximately a normal distribution with mean 0 and variance (p−1)/n.

This means that if p/n is small, then this term will be small with high
probability, so that the pattern z (j) will be stable.

King’s College London

110 Chapter 7

Consider now a recurrent neural network operating under the synchronous

(parallel) mode,

³X
n ´
xi (t + 1) = sign wik xk (t) − θi .
k=1

The dynamics is described by the following theorem.

Theorem 7.3. A recurrent neural network with symmetric synaptic matrix

operating in synchronous mode converges to a fixed point or to a cycle of
period two.

Proof. The method uses an energy function, but now depending on the state
of the network at two consecutive time steps. We define
n
X n
X ¡ ¢
G(t) = − xi (t)wij xj (t − 1) + xi (t) + xi (t − 1) θi
i,j=1 i=1
¡ ¢
= −x (t)W x(t − 1) + xT (t) + xT (t − 1) θ.
T

Hence
¡ ¢
G(t + 1) − G(t) = −xT (t + 1)W x(t) + xT (t + 1) + xT (t) θ
¡ ¢
+ xT (t)W x(t − 1) − xT (t) + xT (t − 1) θ
¡ ¢
= xT (t − 1) − xT (t + 1) W x(t)
¡ ¢
+ xT (t + 1) − xT (t − 1) θ, using W = W T ,
¡ ¢
= − xT (t + 1) − xT (t − 1) (W x(t) − θ)
X n
¡ ¢ ³X
n ´
=− xi (t + 1) − xi (t − 1) wik xk (t) − θi .
i=1 k=1

But, by definition of the dynamics,

³X
n ´
xi (t + 1) = sign wik xk (t) − θi ,
k=1

which means that if x(t + 1) 6= x(t − 1) then xi (t + 1) 6= xi (t − 1) for some i

and so
¡ ¢ ³X
n ´
xi (t + 1) − xi (t − 1) wik xk (t) − θi > 0
k=1

and we conclude that G(t + 1) < G(t). (We assume here that the threshold
function is strict, P that is, the weights and threshold are such that x =
n
(x1 , . . . , xn ) 7→ k wik xk − θi never vanishes on {−1, 1} .) Since the state

Department of Mathematics
Recurrent Neural Networks 111

space {−1, 1}n is finite, G cannot decrease indefinitely and so eventually

x(t + 1) = x(t − 1). It follows that either x(t + 1) is a fixed point or

x(t + 2) = the image of x(t + 1) under one iteration

= the image of x(t − 1), since x(t + 1) = x(t − 1)
= x(t)

and we have a cycle x(t) = x(t+2) = x(t+4) = . . . and x(t−1) = x(t+1) =

x(t + 3) = . . . , that is, a cycle of period 2.

We can understand G and this result (and rederive it) as follows. We

shall design a network which will function like two copies of our original. We
consider 2n nodes, with corresponding state vector (z1 , . . . , z2n ) ∈ {0, 1}2n .
The synaptic weight matrix W c ∈ R2n×2n is given by
µ ¶
c 0 W
W =
W 0

so that Wcij = 0 and W

c(n+i)(n+j) = 0 for all 1 ≤ i, j ≤ n and W ci(n+j) = wij
and Wc(n+i)j = wij . Notice that W c is symmetric and has zero diagonal
entries. The thresholds are set as θbi = θbn+i = θi , 1 ≤ i ≤ n.

n+1 n+2 n+3 2n

...

...
1 2 3 n

Figure 7.2: A “doubled” network. There are no connections be-

tween nodes 1, . . . , n nor between nodes n + 1, . . . , 2n. If we were
to “collapse” nodes n+i onto i, then we would recover the original
network.

King’s College London

112 Chapter 7

Let x(0) be the initial configuration of our original network (of n nodes).
Set z(0) = x(0) ⊕ x(0), that is, zi (0) = xi (0) = zn+i (0) for 1 ≤ i ≤ n.
We update the larger (doubled) network sequentially in the order node
n+1, . . . , 2n, 1, . . . , n. Since there are no connections within the set of nodes
1, . . . , n and within the set n + 1, . . . , 2n, we see that the outcome is

z(0) = x(0) ⊕ x(0) 7−→ x(2) ⊕ x(1) = z(1)

7−→ x(4) ⊕ x(3) = z(2)
7−→ x(6) ⊕ x(5) = z(3)
..
.
7−→ x(2t) ⊕ x(2t − 1) = z(t)

where x(s) is the state of our original system run in parallel mode. By the
theorem, the larger system reaches a fixed point so that z(t) = z(t + 1) for
all sufficiently large t. Hence

x(2t) = x(2t + 2) and x(2t − 1) = x(2t + 1)

for all sufficiently large t—which means that the original system has a cycle
of length 2 (or a fixed point).
The energy function for the larger system is
b
E(z) c z + z T θb
= − 21 z T W
¡ ¢ ¡ ¢
c x(2t) ⊕ x(2t − 1)
= − 12 x(2t) ⊕ x(2t − 1) T W
¡ ¢
+ x(2t) ⊕ x(2t − 1) T θb
¡ ¢ ¡ ¢
= − 21 x(2t) ⊕ x(2t − 1) T W x(2t − 1) ⊕ W x(2t)
¡ ¢
+ x(2t) ⊕ x(2t − 1) T θb
¡ ¢
= − 21 x(2t)T W x(2t − 1)x(2t − 1)T W x(2t)
+ x(2t)T θ + x(2t − 1)T θ
¡
= −x(2t)T W x(2t − 1) + x(2t)
¢
= x(2t − 1) T θ
= G(2t)

since W = W T . We know that E b is decreasing. Thus we have shown that

parallel dynamics leads to a fixed point or a cycle of length (at most) 2
without the extra condition that the threshold function be “strict”.

Department of Mathematics
Recurrent Neural Networks 113

The BAM network

A special case of a recurrent neural network is the bidirectional associative

memory (BAM). The BAM architecture is as shown.

..
.. .
.

X Y

Figure 7.3: The BAM network.

It consists of two subsets of bipolar threshold units, each unit of one

subset being fully connected to all units in the other subset, but with no
connections between the neurons within each of the two subsets. Let us
denote these subsets of neurons by X = {x1 , . . . , xn } and Y = {y1 , . . . , ym }
XY denote the weight corresponding to the connection from x to
and let wji i
Y X
yj and wij that corresponding to the connection from yj to xi . These two
values are taken to be equal.
The operation of the BAM is via block sequential dynamics—first the y s
are updated, and then the x s (based on the latest values of the y s). Thus,

³X
n ´
XY
yj (t + 1) = sign wjk xk (t)
k=1

and then
³X
m ´
YX
xi (t + 1) = sign wiℓ yℓ (t + 1)
ℓ=1

with the usual convention that there is no change if the argument to the
function sign(·) is zero.
This system can be regarded as a special case of a recurrent neural
network operating under asynchronous dynamics, as we now show. First, let
z = (x1 , . . . , xn , y1 , . . . , ym ) ∈ {−1, 1}n+m . For i and j in {1, 2, . . . , n + m},

King’s College London

114 Chapter 7

define wji as follows

XY
w(n+r)i = wri , for 1 ≤ i ≤ n and 1 ≤ r ≤ m,
YX
wi(n+r) = wir , for 1 ≤ i ≤ n and 1 ≤ r ≤ m

and all other wji = 0. Thus

µ ¶
0 WT
(wji ) =
W 0

where W is the m × n matrix Wri = wri XY . Next we consider the mode of

operation. Since there are no connections between the y s, the y updates

can be considered as the result of m sequential updates. Similarly, the x
updates can be thought of as the result of n sequential updates. So we can
consider the whole y and x update as m + n sequential updates taken in the
order of y s first and then the x s. It follows that the dynamics is governed
by the asynchronous dynamics of a recurrent neural network whose synaptic
matrix is symmetric and has nonnegative diagonal terms (zero in this case).
Hence, in particular, the system converges to a fixed point.
One usually considers the patterns stored by the BAM to consist of pairs
of vectors, corresponding to the x and y decomposition of the network,
rather than as vectors in {−1, 1}n+m , although this is a mathematically
trivial distinction.

Department of Mathematics
Chapter 8

Singular Value Decomposition

We have discussed the uniqueness of the generalized inverse of a matrix,

but have still to demonstrate its existence. We shall turn to a discussion
of this problem here. The solution rests on the existence of the so-called
singular value decomposition of any matrix, which is of interest in its own
right. It is a standard result of linear algebra that any symmetric matrix
can be diagonalized via an orthogonal transformation. The singular value
decomposition can be thought of as a generalization of this.

Theorem 8.1 (Singular Value Decomposition). For any given non-zero ma-
trix A ∈ Rm×n , there exist orthogonal matrices U ∈ Rm×m , V ∈ Rn×n and
positive real numbers λ1 ≥ λ2 ≥ · · · ≥ λr > 0, where r = rank A, such that

A = U DVT

where D ∈ Rm×n has entries Dii = λi , 1 ≤ i ≤ r and all other entries are
zero.

Proof. Suppose that m ≥ n. Then ATA ∈ Rn×n and ATA ≥ 0. Hence there
is an orthogonal n × n matrix V such that

ATA = V Σ V T
 
µ1
 0 
where Σ ∈ Rn×n is given by Σ =  ..  where µ1 ≥ µ2 ≥ · · · ≥ µn
.
0
µn
T
are the eigenvalues of A A, counted according to multiplicity. If A 6= 0,
then ATA 6= 0 and so has at least one non-zero eigenvalue. Thus, there
is 0 < r ≤ n such that µ1 ≥  µ2 ≥ · · · ≥ µ r > µr+1 = · · · = µn = 0.
µ ¶ λ1
Λ 0  0 
Write Σ = , where Λ =  .. , with λ21 = µ1 , . . . , λ2r = µr .
0 0 .
0
λn
¡ ¢ n×r
Partition V as V = V1 V2 where V1 ∈ R and V2 ∈ Rn×(n−r) . Since V is

115
116 Chapter 8

orthogonal, its columns form pairwise orthogonal vectors, and so V1T V2 = 0.

We have

ATA = V Σ V T
µ ¶
¡ ¢ Λ2 0
= V1 V2 VT
0 0
µ ¶
¡ 2
¢ V1T
= V1 Λ 0
V2T
= V1 Λ2 V1T .

Hence V2T ATAV2 = V2T V1 Λ2 V1T V2 , so that V2T AT AV2 = 0 and so it follows
| {z } | {z }
(V1T V2 )T =0 =0
that AV2 = 0.
Now, the equality ATA = V1 Λ2 V1T suggests at first sight that we might
hope that A = ΛV1T . However, this cannot be correct, in general, since
A ∈ Rm×n , whereas ΛV1T ∈ Rr×n , and so the dimensions are incorrect.
However, if U ∈ Rk×r satisfies U T U = 1lr , then V1 Λ2 V1T = V1 ΛU T U ΛV1T
and we might hope that A = U ΛV1T . We use this idea to define a suitable
matrix U . Accordingly, we define

U1 = AV1 Λ−1 ∈ Rm×r ,

so that A = U1 ΛV1T , as discussed above. We compute

U1T U1 = Λ−1 V1T ATAV1 Λ−1 = 1lr .

| {z }
Λ2

This means that the r columns of U1 form an orthonormal set of vectors

in Rm . Let U2 ∈ Rm×(m−r) be such that U = (U1 U2 ) is orthogonal (in
Rm×m )—thus the columns of U2 are made up of (m−r) orthonormal vectors
such that these, together with those of U1 , form an orthonormal set of m
vectors. Thus, U2T U1 = 0 ∈ R(m−r)×r and U1T U2 = 0 ∈ Rr×(m−r) . Hence we
have
µ T¶
T U1
U AV = A (V1 V2 )
U2T
µ T ¶
U1 A
= (V1 V2 )
U2T A
µ T ¶
U1 AV1 U1T AV2
=
U2T AV1 U2T AV2
µ T ¶
U1 AV1 0
= , since AV2 = 0,
U2T AV1 0
µ ¶
Λ 0
= ,
U2T U1 Λ 0

Department of Mathematics
Singular Value Decomposition 117

using U1 = AV1 Λ−1 and U1T U1 = 1lr so that Λ = U1T AV1 ,

µ ¶
Λ 0
= , using U2T U1 = 0.
0 0

Hence,
µ ¶
Λ 0
A=U V T,
0 0
as claimed. Note that the condition m ≥ n means that m ≥ n ≥ r, and so
the dimensions of the various matrices are all valid.
If m < n, consider B = AT instead. Then, by the above argument, we
get that
µ ′ ¶
T ′ Λ 0
A =B=U V ′T ,
0 0

for orthogonal matrices U ′ ∈ Rn×n , V ′ ∈ Rm×m and where Λ′2 holds the
positive eigenvalues of AAT . Taking the transpose, we have
µ ¶
Λ′ 0
A=V′ U ′T .
0 0

Finally, we observe that from the given form of the matrix A, it is clear that
the dimension of ran A is exactly r, that is, rank A = r.
µ 2 ¶
Λ 0
Remark 8.2. From this result, we see that the matrices ATA =U VT
0 0
µ 2 ¶
T Λ 0
and AA = V U T have the same non-zero eigenvalues, counted
0 0
according to multiplicity. We can also see that

rank A = rank AT = rank AAT = rank ATA .

Theorem 8.3. Let A ∈ Rm×n and let U ∈ Rm×m , V ∈ Rn×n , Λ ∈ Rr×r be as

given aboveµvia the¶ singular value decomposition of A, so that A = U DV T
Λ 0
where D = ∈ Rm×n . Then the generalized inverse of A is given by
0 0
µ −1 ¶
Λ 0
A# = V UT .
0 0
| {z }
n×m

Proof. We just have to check that A# , as given above, really does satisfy
the defining conditions of the generalized inverse. We will verify two of the
four conditions by way of illustration.

King’s College London

118 Chapter 8

µ −1 ¶
Λ 0
Put X = V HU T , where H = ∈ Rn×m . Then
0 0

AXA = U DV T V HU T U DV T
µ ¶µ ¶
1lr 0 Λ 0
=U V T = A.
0 0 0 0
| {z } | {z }
m×m m×n

Similarly, one finds that XAX = X. Next, we consider

XA = V HU T U DV T
µ −1 ¶ µ ¶
Λ 0 T Λ 0
=V U U VT
0 0 | {z } 0 0
| {z } 1lm | {z }
n×m m×n
µ ¶
1lr 0
=V VT
0 0
| {z }
n×n

which is clearly symmetric. Similarly, one verifies that AX = (AX)T , and

the proof is complete.

We have seen how the generalized inverse appears in the construction of

the OLAM matrix memory, via minimization of the output error. Now we
can show that this choice is privileged in a certain precise sense.

Theorem 8.4. For given A ∈ Rm×n and B ∈ Rℓ×n , let ψ : Rℓ×m → R be

the map ψ(X) = kXA − BkF . Among those matrices X minimizing ψ, the
matrix X = BA# has the least k · kF -norm.

Proof. We have already seen that the choice X = BA# does indeed minimize
ψ(X). Let A = U DV T be the singular value decomposition of A with the
notation as above—so that, for example, Λ denotes the top left r × r block
of D. We have

ψ(X)2 = kXA − Bk2F

= kXU DV T − Bk2F
= kXU D − BV k2F , since V is orthogonal,
= kY D − Ck2F , where Y = XU ∈ Rℓ×m , C = BV ∈ Rℓ×n ,
µ ¶
° Λ 0 °2
°
= (Y1 Y2 ) − (C1 C2 )°F ,
0 0

Department of Mathematics
Singular Value Decomposition 119

where we have partitioned the matrices Y and C into (Y1 Y2 ) and (C1 C2 )
with Y1 , C1 ∈ Rℓ×r ,

= k(Y1 Λ 0) − (C1 C2 )k2F

.
= k(Y1 Λ − C1 .. − C2 )k2F

= kY1 Λ − C1 k2F + kC2 k2F .

This is evidently minimized by any choice of Y (equivalently X) for which

. .
Y1 Λ = C1 . Let Yb = (C1 Λ−1 .. 0) and let Y = (C1 Λ−1 .. Y2 ), where Y2 ∈
Rℓ×m−r is subject to Y2 6= 0 but otherwise is arbitrary. Both Yb and Y
correspond to a minimum of ψ(X), where the X and Y matrices are related
by Y = XU , as above. Now,

kYb k2F = kC1 Λ−1 k2F < kC1 Λ−1 k2F + kY2 k2F = kY ′ k2F .

It follows that among those Y matrices minimizing ψ(X), Yb is the one with
the least k · kF -norm. But if Y = XU , then kY kF = kXkF and so X b given
by Xb = Yb U is the X matrix which has the least k · kF -norm amongst those
T

which minimize ψ(X). We find

b = Yb U T
X
.
= (C1 Λ−1 .. 0)U T
µ −1 ¶
Λ 0
= (C1 C2 ) UT
0 0
µ −1 ¶
Λ 0
= BV UT
0 0

= BA#

which completes the proof.

King’s College London

120 Chapter 8

Department of Mathematics
Bibliography

We list, here, a selection of books available. Many are journalistic-style

overviews of various aspects of the subject and convey little quantitative
feel for what is supposed to be going on. Many declare a deliberate attempt
to avoid anything mathematical.
For further references a good place to start might be the bibliography of
the book of R. Rojas (or that of S. Haykin).

Aarts, E. and J. Korst, Simulated Annealing and Boltzmann Machines,

J. Wiley 1989. Mathematics of Markov chains and Boltzmann machines.

Amit, D. J., Modeling Brain Function, The World of Attractor Neural

Networks, Cambridge University Press 1989. Statistical physics.

Aleksander, I. and H. Morton, An Introduction to Neural Computing, 2nd

ed., Thompson International, London 1995.

Anderson, J. A., An Introduction to Neural Networks, M.I.T. Press 1995.

Minimal mathematics with emphasis on the uses of neural network algorithms.
Indeed, the author questions the appropriateness of using mathematics in this
field.

Arbib, M. A., Brains, Machines and Mathematics, 2nd ed., Springer-Verlag

1987. Nice overview of the perceptron.

Aubin, J.-P., Neural Networks and Qualitative Physics, Cambridge Univer-

sity Press 1996. Quite abstract.

Bertsekas, D. P. and J. N. Tsitsiklis, Parallel and Distributed Computation–

Numerical Methods, Prentice-Hall International Inc. 1989.

Bishop, C. M., Neural networks for Pattern Recognition, Oxford University

Press 1995. Primarily concerned with Bayesian statistics.

Blum, A., Neural Networks in C++ —An Object Oriented Framework for
Building Connectionist Systems, J. Wiley 1992. Computing hints.

121
122 Bibliography

Bose, N. K. and P. Liang, Neural Network Fundamentals with Graphs,

Algorithms and Applications, McGraw-Hill 1996. Emphasises the use of
graph theory.

Dayhoff, J., Neural Computing Architectures, Van Nostrand Reinhold 1990.

A nice discussion of neurophysiology.

De Wilde, P., Neural Network Models, 2nd ed., Springer 1997. Gentle
introduction.

Devroye, L., Györfi, G. and G. Lugosi, A Probabilistic Theory of Pattern

Recognition, Springer 1996. A mathematics book, full of inequalities to do
with nonparametric estimation.

Dotsenko, V., An Introduction to the Theory of Spin Glasses and Neural

Networks, World Scientific 1994. Statistical physics.

Duda, R. O. and P. E. Hart, Pattern Classification and Scene Analysis,

J. Wiley 1973. Well worth looking at. It puts later books into perspective.

Fausett, L., Fundamentals of Neural Networks, Prentice Hall 1994. Nice

detailed descriptions of the standard algorithms with many examples.

Freeman, J. A., Simulating Neural Networks with Mathematica, Addison-

Wesley 1994. Set Mathematica to work on some algorithms.

Freeman, J. A. and D. M. Skapura, Neural Networks: Algorithms, Applica-

tions, and Programming Techniques, Addison-Wesley 1991. Well worth a
look.

Golub, G. H. and C. F. van Loan, Matrix Computations, The John Hopkins

University Press 1989. Superb book on matrices; you will discover entries in
your matrix you never knew you had.

Hassoun, M. H., Fundamentals of Artificial Neural Networks, The MIT

Press 1995. Quite useful to dip into now and again.

Haykin, S., Neural Networks, A Comprehensive Foundation, Macmillan

1994. Looks very promising, but you soon realise that you need to spend a lot
of time tracking down the original papers.

Hecht-Nielsen, R., Neurocomputing, Addison-Wesley 1990.

Department of Mathematics
Bibliography 123

Hertz, J., A. Krogh and R. G. Palmer, Introduction to the Theory of Neural

Computation, Addison-Wesley 1991. Quite good account but arguments are
sometimes replaced by references.

Kamp, Y. and M. Hasler, Recursive Neural Networks for Associative Mem-

ory, J. Wiley 1990. Well-presented readable account of recurrent networks.

Khanna, T., Foundations of Neural Networks, Addison-Wesley 1990.

Kohonen, T., Self-Organization and Associative Memory, Springer 1988.

Discusses the OLAM.

Kohonen, T., Self-Organizing Maps, Springer 1995. Describes a number of

algorithms and various applications. There are hundreds of references.

Kosko, B., Neural Networks and Fuzzy Systems, Prentice-Hall 1992. Disc
included.

Kung, S. Y., Digital Neural Networks, Prentice-Hall 1993. Very concise

account for electrical engineers.

Looney, C. G., Pattern Recognition Using Neural Networks, Oxford Uni-

versity Press 1997. Quite nice account—concerned with practical application
of algorithms.

Masters, T., Practical Neural Network Recipes in C++ , Academic Press

1993. Hints and ideas about implementing algorithms; with disc.

McClelland J. L., D. E. Rumelhart and the PDP Research Group, Parallel

Distributed Processing, Vols 1 and 2, MIT Press 1986. Has apparently sold
many copies.

Minsky, M. L. and S. A. Papert, Perceptrons, Expanded ed. MIT Press

1988. Everyone should at least browse through this book.

Müller, B. and J. Reinhardt, Neural Networks: An Introduction, Springer

1990. Quite a nice account; primarily statistical physics.

Pao, Y-H., Adaptive Pattern Recognition and Neural Networks, Addison-

Wesley 1989.

Peretto, P., An Introduction to the Modeling of Neural Networks, Cambridge

University Press 1992. Heavy-going statistical physics.

King’s College London

124 Bibliography

Ritter, H., T. Martinez and K. Schulten, Neural Computation and Self-

Organizing Maps, Addison-Wesley 1992.

Rojas, R., Neural Networks – A Systematic Introduction, Springer 1996. A

very nice introduction—but do not expect too many details.

Simpson, P. K., Artificial Neural Systems: Foundations, Paradigms, Appli-

cations, and Implementations, Pergamon Press 1990. A compilation of
algorithms.

Vapnik, V., The Nature of Statistical Learning Theory, Springer 1995. An

overview (no proofs) of the statistical theory of learning and generalization.

Wasserman, P. D., Neural Computing, Theory and Practice, Van Nostrand

Reinhold 1989.

Welstead, S., Neural Network and Fuzzy Logic Applications in C/C++ ,

J. Wiley 1994. Programmes built around Borland’s Turbo Vision; disc
included.

Department of Mathematics

ECE/CS 559 - Neural Networks Lecture Notes #8: Associative Memory and Hopfield Networks
No ratings yet
ECE/CS 559 - Neural Networks Lecture Notes #8: Associative Memory and Hopfield Networks
9 pages
NN Ch3
No ratings yet
NN Ch3
34 pages
CH 2
No ratings yet
CH 2
121 pages
TFM Lichtner Bajjaoui Aisha
No ratings yet
TFM Lichtner Bajjaoui Aisha
51 pages
Lec 12
No ratings yet
Lec 12
9 pages
NN Ch2
No ratings yet
NN Ch2
36 pages
Associative Memory Neural Networks
100% (1)
Associative Memory Neural Networks
26 pages
Prof. Richardson Neuralnetworks
No ratings yet
Prof. Richardson Neuralnetworks
61 pages
Notes Lect 17autoassociated - Hopfield
No ratings yet
Notes Lect 17autoassociated - Hopfield
25 pages
Very Sparse Random Projections: Ping Li Trevor J. Hastie Kenneth W. Church
No ratings yet
Very Sparse Random Projections: Ping Li Trevor J. Hastie Kenneth W. Church
10 pages
A Journey From Linear Algebra To Machine Learning
No ratings yet
A Journey From Linear Algebra To Machine Learning
50 pages
Linear Algebra Summary
No ratings yet
Linear Algebra Summary
34 pages
NN Ch3
No ratings yet
NN Ch3
36 pages
Associative Memory Network
No ratings yet
Associative Memory Network
63 pages
Additional Exercises For ECE133A PDF
No ratings yet
Additional Exercises For ECE133A PDF
103 pages
133A Exercises
100% (1)
133A Exercises
103 pages
Lecture 3 Introduction To Linear Algebra (Part 2)
No ratings yet
Lecture 3 Introduction To Linear Algebra (Part 2)
57 pages
Linear Algebra For Business Analytics
No ratings yet
Linear Algebra For Business Analytics
27 pages
What Is The Main Problem in Control?: Chapter 1 Overview of Control Engineering
100% (1)
What Is The Main Problem in Control?: Chapter 1 Overview of Control Engineering
82 pages
Cheat Sheet Final
No ratings yet
Cheat Sheet Final
3 pages
Neural Networks and Principal Component Analysis: Learning From Examples Without Local Minima
No ratings yet
Neural Networks and Principal Component Analysis: Learning From Examples Without Local Minima
6 pages
133A Exercises PDF
No ratings yet
133A Exercises PDF
99 pages
103 Exercises
No ratings yet
103 Exercises
70 pages
Linear Algebra Tutorial
No ratings yet
Linear Algebra Tutorial
9 pages
Wainwrightslides 1
No ratings yet
Wainwrightslides 1
67 pages
Selected Linear Algebra For Machine Learning
No ratings yet
Selected Linear Algebra For Machine Learning
30 pages
A Brief Introduction To Polar Codes: Supplemental Material For Advanced Channel Coding Henry D. Pfister April 21st, 2014
No ratings yet
A Brief Introduction To Polar Codes: Supplemental Material For Advanced Channel Coding Henry D. Pfister April 21st, 2014
12 pages
Tutorial On Principal Component Analysis: Javier R. Movellan
No ratings yet
Tutorial On Principal Component Analysis: Javier R. Movellan
9 pages
Associative Memory
No ratings yet
Associative Memory
10 pages
Math 4 AI
No ratings yet
Math 4 AI
25 pages
Lecture 2: Basics and Definitions: Networks As Data Models
No ratings yet
Lecture 2: Basics and Definitions: Networks As Data Models
28 pages
Discrete-Time Signals and Systems: Gao Xinbo School of E.E., Xidian Univ
No ratings yet
Discrete-Time Signals and Systems: Gao Xinbo School of E.E., Xidian Univ
40 pages
CH 2
No ratings yet
CH 2
49 pages
Operation On Discrete-Time Signals & Discrete-Time System: Lecture - 3
No ratings yet
Operation On Discrete-Time Signals & Discrete-Time System: Lecture - 3
36 pages
Least Mean Square (LMS) Algorithm: 3.1 Spatial Filtering
No ratings yet
Least Mean Square (LMS) Algorithm: 3.1 Spatial Filtering
16 pages
Chapter1 - Numerical Analysis II 2023-2024
No ratings yet
Chapter1 - Numerical Analysis II 2023-2024
30 pages
NN - 4TH
No ratings yet
NN - 4TH
26 pages
SMAI-M20-06: Data, Distances and Learning: C. V. Jawahar
No ratings yet
SMAI-M20-06: Data, Distances and Learning: C. V. Jawahar
24 pages
Visco Very
No ratings yet
Visco Very
37 pages
5.2 Regression
No ratings yet
5.2 Regression
19 pages
Filtering Convolution (Circular)
No ratings yet
Filtering Convolution (Circular)
14 pages
Filtering and Convolutions
No ratings yet
Filtering and Convolutions
15 pages
Soft Computing: Pattern Associators
No ratings yet
Soft Computing: Pattern Associators
36 pages
Derivative Networks Reducedversion - 2022
No ratings yet
Derivative Networks Reducedversion - 2022
14 pages
Learning Linear Representations
No ratings yet
Learning Linear Representations
21 pages
A Tutorial on ν-Support Vector Machines: 1 An Introductory Example
No ratings yet
A Tutorial on ν-Support Vector Machines: 1 An Introductory Example
29 pages
Use of Linear Algebr
No ratings yet
Use of Linear Algebr
4 pages
Chapter 6
No ratings yet
Chapter 6
55 pages
17 Aap1328
No ratings yet
17 Aap1328
59 pages
Linear Algebra and Applications: Benjamin Recht
No ratings yet
Linear Algebra and Applications: Benjamin Recht
42 pages
CS 532 Lecture Notes
No ratings yet
CS 532 Lecture Notes
25 pages
Linear Algebra Cheat Sheet
No ratings yet
Linear Algebra Cheat Sheet
5 pages
Vector Spaces-Imgprocessimg PDF
No ratings yet
Vector Spaces-Imgprocessimg PDF
25 pages
Pattern Classification
No ratings yet
Pattern Classification
41 pages
Linear Algebra: Submitted by Ahmad Saeed Submitted To Sir Muzzam Ali BITM-F18-022
No ratings yet
Linear Algebra: Submitted by Ahmad Saeed Submitted To Sir Muzzam Ali BITM-F18-022
5 pages
Math Review For ML
No ratings yet
Math Review For ML
41 pages
Linear Algebra 1730400240
No ratings yet
Linear Algebra 1730400240
26 pages
Deep-Learning
No ratings yet
Deep-Learning
28 pages
Differential Forms
From Everand
Differential Forms
Henri Cartan
5/5 (2)
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
Unique Practise Question - Mathematics
100% (1)
Unique Practise Question - Mathematics
109 pages
Week 4 Bartos - Wehr - Understanding Conflict
No ratings yet
Week 4 Bartos - Wehr - Understanding Conflict
19 pages
XII Mathematics 2024 Final Paper Solutions
100% (1)
XII Mathematics 2024 Final Paper Solutions
12 pages
Incremental Design
No ratings yet
Incremental Design
3 pages
(PDFS) CH 8 The Vectors of The Mind Thurstone
No ratings yet
(PDFS) CH 8 The Vectors of The Mind Thurstone
25 pages
Creating Sparse Finite-Element Matrices in MATLAB Loren On The Art of MATLAB
No ratings yet
Creating Sparse Finite-Element Matrices in MATLAB Loren On The Art of MATLAB
8 pages
Course Description For Business Mathematics, BBA Syllabus Sistech
100% (8)
Course Description For Business Mathematics, BBA Syllabus Sistech
4 pages
Learn Java - Two-Dimensional Arrays Cheatsheet - Codecademy
No ratings yet
Learn Java - Two-Dimensional Arrays Cheatsheet - Codecademy
3 pages
Linear Algebra Summary
No ratings yet
Linear Algebra Summary
80 pages
C++ Matrix Class
No ratings yet
C++ Matrix Class
11 pages
IT4L2 Java Lab Credits:2 Internal Assessment: 25 Marks Lab: 3 Periods/week Semester End Examination: 50 Marks Objectives
100% (1)
IT4L2 Java Lab Credits:2 Internal Assessment: 25 Marks Lab: 3 Periods/week Semester End Examination: 50 Marks Objectives
2 pages
TA93 Gabrielsson Sjögren JASA 1979 p1019
No ratings yet
TA93 Gabrielsson Sjögren JASA 1979 p1019
15 pages
Instant Download Feature Engineering For Machine Learning Principles and Techniques For Data Scientists First Edition Casari PDF All Chapters
No ratings yet
Instant Download Feature Engineering For Machine Learning Principles and Techniques For Data Scientists First Edition Casari PDF All Chapters
55 pages
XII Maths Book Back MCQ EM
No ratings yet
XII Maths Book Back MCQ EM
16 pages
Introductory Mathematical Analysis For Business Economics 13th Edition by Ernest F Haeussler
No ratings yet
Introductory Mathematical Analysis For Business Economics 13th Edition by Ernest F Haeussler
322 pages
Undergrad Thesis Lie Algebras
No ratings yet
Undergrad Thesis Lie Algebras
132 pages
Discret Math Unit 4-6
No ratings yet
Discret Math Unit 4-6
8 pages
Model Paper For Examinations 2024
No ratings yet
Model Paper For Examinations 2024
26 pages
Applications of MATLAB To Problems in Quantum Mechanics For Research and Education (1995) : Dirac Notation Interpreter
No ratings yet
Applications of MATLAB To Problems in Quantum Mechanics For Research and Education (1995) : Dirac Notation Interpreter
42 pages
Solved CBSE XII Maths (EF1GH-4)
No ratings yet
Solved CBSE XII Maths (EF1GH-4)
22 pages
D07SE3 COM IT Appmaths
No ratings yet
D07SE3 COM IT Appmaths
2 pages
A Note On The Existence of The Multivariate Gamma
No ratings yet
A Note On The Existence of The Multivariate Gamma
7 pages
Java Programming: From Problem Analysis To Program Design
No ratings yet
Java Programming: From Problem Analysis To Program Design
75 pages
I. For Each of The Following Question Choose The Best Answer From The Given Alternatives
No ratings yet
I. For Each of The Following Question Choose The Best Answer From The Given Alternatives
3 pages
Pol101 Course Outline
No ratings yet
Pol101 Course Outline
21 pages
CG Previous Year Question Paper Solution
No ratings yet
CG Previous Year Question Paper Solution
36 pages
RM - Multivariate Analysis
No ratings yet
RM - Multivariate Analysis
19 pages
MT105b Commentary 2021
No ratings yet
MT105b Commentary 2021
22 pages
Model QP - 2
No ratings yet
Model QP - 2
2 pages
Instant Download Matrix Algebra: Theory, Computations and Applications in Statistics 3rd Edition James E. Gentle PDF All Chapters
100% (1)
Instant Download Matrix Algebra: Theory, Computations and Applications in Statistics 3rd Edition James E. Gentle PDF All Chapters
65 pages