Simulating Neural Networks
Simulating Neural Networks
SIMULATING
flEURAL
NETWORKS
with
Mathematica
James A. Freeman
Loral Space Information Systems
and
University of Houston-Clear Lake
I
I
1
§
▲ ft
TT
ADDISON-WESLEY PUBLISHING COMPANY
Reading, Massachusetts • Menlo Park, California
New York • Don Mills, Ontario • Wokingham, England
Amsterdam • Bonn • Sydney • Singapore • Tokyo
Madrid • San Juan • Milan • Paris
Mathematica is not associated with Mathematica Inc., Mathematica Policy Research, Inc.,
or MathTech, Inc.
Many of the designations used by manufacturers and sellers to distinguish their products
are claimed as trademarks. Where those designations appear in this book, and Addison-
Wesley was aware of a trademark claim, the designations have been printed in caps or
initial caps.
The programs and applications presented in this book have been included for their instruc¬
tional value. They have been tested with care, but are not guaranteed for any particular
purpose. The publisher does not offer any warranties or representations, nor does it accept
any liabilities with respect to the programs or applications.
Freeman, James A.
Simulating neural networks with Mathematica / James A. Freeman,
p. cm.
Includes bibliographical references and index.
ISBN 0-201-56629-X
1. Neural networks (Computer science) 2. Mathematica (Computer
program) I. Title.
QA76.87.F72 1994
006.3-dc20
92-2345
CIP
All rights reserved. No part of this publication may be reproduced, stored in a retrieval
system, or transmitted, in any form or by any means, electronic, mechanical, photocopying,
recording, or otherwise, without the prior written permission of the publisher. Printed in
the United States of America.
123456789 10-DOC-97 96 95 94 93
Preface
I sat across the dinner table at a restaurant recently with a researcher from
Los Alamos National Laboratory. We were discussing a collaboration
on a neural-network project while eight of us had dinner at a quaint
restaurant in White Rock, just outside of Los Alamos. Fortified by a few
glasses of wine, I took the opportunity to mention that I had just reached
an agreement with my publisher to write a new book on neural networks
with Mathematica. My dinner companion looked over at me and asked a
question: "Why?"
I was somewhat surprised by this response, but reflecting for a mo¬
ment on the environment in which this person works, I realized that the
answer to his question was not necessarily so obvious. This person had
access to incredible computational resources: large Crays, and a 64,000-
node Connection Machine, for example. Considering the computational
demands of most neural networks, why would anyone working with
them want to incur the overhead of an interpreted language like Mathe¬
matica? To me the answer was obvious, but to him, and possibly to you,
the answer may require some explanation.
During the course of preparing the manuscript for an earlier text. Neu¬
ral Networks: Algorithms, Applications, and Programming Techniques (hence¬
forth "Neural Networks" ), I used Mathematica extensively for a variety
of purposes. I simulated the operation of most of the neural networks
described in that text using Mathematica. Mathematica gave me the ability
to confirm nuances of network behavior and performance, as well as to
develop examples and exercises pertaining to the networks. In addition
I used Mathematica's graphics capability to illustrate numerous points
throughout the text. The ease and speed with which I was able to im¬
plement a new network spoke highly of using Mathematica as a tool for
exploring neural-network technology.
iv Preface
The idea for Simulating Neural Networks with Mathematica grew along
with my conviction that Mathematica had saved me countless hours of
programming time. Even though Mathematica is an interpreted language
(much like the old BASIC) and neural networks are notorious CPU hogs,
there seemed to me much insight to be gained by using Mathematica in
the early stages of network development.
I explained all of these ideas (somewhat vigorously) to my dinner
companion. No doubt feeling a bit of remorse for asking the question in
the first place, he promised to purchase a copy of the book as soon as it
was available. I hope he finds it useful. I have the same wish for you.
This book will introduce you to the subject of neural networks within
the context of the interactive Mathematica environment. There are two
main thrusts of this text: teaching about neural networks and showing
how Mathematica can be used to implement and experiment with neural-
network architectures.
In Neural Networks my coauthor and I stated that you should do
some programming of your own, in a high-level language such as C,
FORTRAN, or Pascal, in order to gain a complete understanding of the
networks you study. I do not wish to retract that philosophy, and this
book is not an attempt to show you how you can avoid eventual trans¬
lation of your neural networks into executable software. As I stated in
the previous paragraph, most neural networks are computationally quite
intensive. A neural network of any realistic size for an actual application
would likely overwhelm Mathematica. Nevertheless, a researcher can use
Mathematica to experiment with variants of architectures, debug a new
training algorithm, design techniques for the analysis of network per¬
formance, and perform many other analyses that would prove far more
time-consuming if done in traditional software or by hand. This book
illustrates many of those techniques.
The book is suitable for a course in neural networks at the upper-level
undergraduate or beginning graduate level in computer science, electrical
engineering, applied mathematics, and related areas. The book is also
suitable for self-study. The best way to study the material presented
here is interactively, executing statements and trying new ideas as you
progress.
This book does not assume expertise with either neural networks or
Mathematica, although I expect that the readers will most likely know
something of Mathematica. If you have read through the first chapter
of Stephen Wolfram's Mathematica book and have spent an hour or two
Preface v
interacting with Mathematica, you will have more than sufficient back¬
ground for the Mathematica syntax in this book.
I have kept the Mathematica syntax as understandable as possible. It
is too easy to spend an inordinate amount of time trying to decipher
complex Mathematica expressions and programs, thereby missing the for¬
est for the trees. Moreover, many individual functions are quite long
with embedded print statements and code segments that I could have
written as separate functions. I chose not to, however, so the code stands
as it is. For these reasons, some Mathematica experts may find the coding
a bit nonelegant: To them I extend my apologies in advance. I have
also chosen to violate one of the "rules" of Mathematica programming:
using complete English words for function and variable names. As an
example, I use bpnMomentum instead of the more "correct" backprop-
agationNetworkWithMomentum. The former term is easily interpreted,
requires less space leading to more understandable expressions, and is
less prone to typographical errors.
Readers with no prior experience with neural networks should have
no trouble following the text, although the theoretical development is not
as complete as in Neural Networks. I do assume a familiarity with basic
linear algebra and calculus. Chapter 1 is a prerequisite to all other chap¬
ters in the book; the material in this chapter is fairly elementary, however,
both in terms of the neural-network theory and the use of Mathematica.
Readers who possess at least a working knowledge of each subject can
safely skip Chapter 1. If you have never studied the gradient-descent
method of learning, then you should also study Chapter 2 before pro¬
ceeding to later chapters. The discussion of the Elman and Jordan net¬
works in Chapter 6 requires an understanding of the backpropagation
algorithm given in Chapter 3.
The text comprises eight chapters; each one, with the exception of the
last, deals with a major topic related to neural networks or to a specific
type of network architecture. Each chapter also includes a simulation of
the networks using Mathematica and demonstrates the use of Mathemat¬
ica to explore the characteristics of the network and to experiment with
variations in many instances.
The last chapter introduces the subject of genetic algorithms (GAs).
We will study the application of GAs to a scaled-down version of the
traveling salesperson problem. To tie the subject back to neural networks,
we will look at one method of using GAs to find optimum weights for a
neural network. A brief description of the chapters follows.
vi Preface
The source code for all of the functions in this text is available, free of
charge, from MathSource. MathSource is a repository for Mathematica-
related material contributed by Mathematica users and by Wolfram Re-
viii Preface
search, Inc. To find out more about MathSource, send a single-line email
message: "help intro" to the MathSource server at [email protected].
For information on other ways to access MathSource, send the message
"help access" to [email protected].
If you do not have direct electronic access to MathSource, you can
still get some of the material in it from Wolfram Research. Contact the
MathSource Administrator at: Wolfram Research, Inc. 100 Trade Center
Drive, Champaign, Illinois, 61820-7237, USA, (217) 398-0700.
Acknowledgments
First, I would like to correct a grievous omission from my previous book.
Dr. Robert Hecht-Nielsen was my first and only formal instructor of neu¬
ral networks. I owe to him much of my appreciation and enthusiasm for
the subject. Thanks Robert! Of course, many others have contributed to
this project and I would like to mention several of those individuals.
In particular, I would like to thank Dr. John Engvall, of Loral Space
Information Systems, who has been a continuing source of support and
encouragement. Don Brady, M. Alan Meghrigian, and Jason Deines, who
I met through the CompuServe AI Expert and Wolfram Research forums,
reviewed portions of the manuscript and software in its early stages. I
thank them greatly for their efforts and comments.
I initially wrote the manuscript as a series of Mathematica notebooks,
which had to be translated to T^X. I want to thank Dr. Cameron Smith
for providing software and many hours of consultation to help with this
task.
I wish to express my appreciation to Alan Wylde, previously of
Addison-Wesley Publishing Company, for his support early on in the
project. I also want to thank Peter Gordon, Helen Goldstein, Mona Zef-
tel, Lauri Petrycki and Patsy DuMoulin, all of Addison-Wesley, for their
support, assistance, and exhortations throughout the preparation of this
manuscript.
Finally, I dedicate this book to my family: Peggy, Geoffrey, and Deb¬
orah, without whose support and patience this project would never have
been possible.
J.A.F.
Houston TX
Contents
Preface
ix
x Contents
Bibliography 335
Index 337
Chapter 1
Introduction to Neural
Networks and Mathematica
2 Chapter 1. Introduction to Neural Networks and Mathematica
More times than I care to remember, I have had to verify, using a calcu¬
lator, the computations performed by a neural network that I had pro¬
grammed in C. To say that the process was laborious will bring a chuckle
to anyone who has had the same experience. Then came Mathematica. For
a while, Mathematica became my super calculator, reducing the amount
of time spent doing such calculations by at least an order of magnitude.
Before long I was using Mathematica as a prototyping tool for experi¬
menting with neural-network architectures. The answer to the question,
"Why use Mathematica to build neural networks?" will, I trust, become
self-evident as you see the tool used throughout this book.
Those of you to whom neural networks is a new technology may
be asking a different question that requires answering before the issue
of using Mathematica ever arises: "Why neural networks?" I would like
to spend a little time discussing the answer to that question in the first
section of this chapter. In Section 1.2, we will begin to use Mathematica to
perform some basic neural-network calculations and to do some simple
analyses. The techniques and conventions that I introduce in that section
will form the basis of the work we do in subsequent chapters.
n
& A A
h 0. A
Figure 1.1 We can recognize all of these letters as variations of the letter "A." Writing a
computer program to recognize all of these, and all other possible variations, is a formidable
task.
Data
Figure 1.2 In a program designed to identify what appears in the picture, each picture
element, or pixel, becomes a single entity in data memory. The central processing unit
(CPU) reads instructions sequentially and operates on individual data elements, storing
the results, which can themselves be operated on by the CPU. In order for the CPU to
correctly classify the image, we must specify exactly the correct sequence of operations that
the CPU must perform on the data.
to search systematically for those features and match them against the
known list for the various letters. While this approach might work for
some letters, variations in writing style and typography would still ne¬
cessitate a huge information base and would likely not account for all
possibilities.
The problem with these approaches and with others that have been
tried in the past is that they depend on our ability to systematically pick
apart the picture of the letter on a pixel-by-pixel, or feature by feature,
basis. We must look for specific information in the picture and apply
rules or algorithms sequentially to the data, hoping that we are smart
enough to be able to write down in sufficient detail what it is exactly
that makes an "A" an "A" and not some other letter. The solution to
this problem has proved elusive. Figure 1.2 shows a simple schematic of
how data processing of this type takes place in a sequential computing
environment.
1.1. The Neural-Network Paradigm 5
Dog
Figure 1.3 This figure shows a simple example of a neural network that is used to identify
the image presented to it. Imagine that the input layer is the retina. The neurons on
this layer respond simultaneously to data from various parts of the image. In hidden-
layer 1, data from all parts of the retina are combined at individual neurons. The output
layer generates a hypothesis as to the identity of the input image, in this case a dog.
In this particular network, each neuron on the output layer corresponds to a particular
hypothesis concerning the identity of the pattern on the input layer. While there is some
serial processing, much of the work is done in parallel by neurons on each layer.
Figure 1.4 In this two-layer network, units are connected with bidirectional connections,
illustrated as lines without arrows. Also notice that each unit has a feedback connection to
itself. In this network the distinction between input and output layer is ambiguous, since
either layer can act in either capacity, depending on the particular application.
Figure 1.5 This network architecture has only a single layer of units. The output of each
Figure 1.6 This network architecture is an exampe of a complex, multilayered structure. For
the sake of clarity, all of the individual connections between units are not shown explicitly.
For example, in the layer labeled "F2 Layer," all of the units are connected to each other
by inhibitory connections. These connections are implied by the nature of the processing
that takes place on this layer, but the connections themselves are not shown.
ral networks, signals pass across synapses from one neuron to another.
During this passage, the efficiency with which the synapse transfers the
signal differs from synapse to synapse. In our neural-network models,
this difference manifests itself as a multiplicative factor that modifies the
incoming signal. This factor is called a weight or connection strength.
Each connection has its associated weight value. As you will see shortly,
these weight values contain the information that lets the network success¬
fully process its data. Building a neural network to perform a specific
task often depends on the development of a set of weights that encodes
the appropriate data-processing algorithm. Fortunately, many neural-
network models can learn the required weight values, so that we do not
have to specify them in advance.
Figure 1.7 This figure shows a representation of a general processing element of a neural
network; in this case the ith unit.
xj = Table[Random[Integer,{0,1}],{10}]
{1, 0, 0, 1, 1, 1, 1, 1, 1, 0}
For weights, we shall assume random values between -0.5 and 0.5.
wij = Table[Random[Real,{-0.5,0.5}],{10}]
neti = xj . uij
-1.19356
1.2. Neural-Network Fundamentals 11
±i « Axi/At
and finally
Xi(t + 1) = Xi(t) + At{-Xi(t) + neti) (1.4)
The loop in Listing 1.1 iterates Eq. (1.4) and appends each value to
a list, along with the timestep index, in such a way that we can easily
plot the results when the calculation is finished. In the If statement, we
anticipated the fact that the value of x would asymptotically approach
the value of neti. You can see this fact easily by setting the derivative in
Eq. (1.3) equal to zero to find the equilibrium value for x.
= neti
Listing 1.1
1.5
0.5
To see how close we came to the expected equilibrium value, look at the
last element in xlist.
Last[xlist]
{5.97, 1.99504}
Both the time step, deltaT, and the value of the stopping criterion in the
If statement affect the integration. Since Eq. (1.2) is easy to integrate
in closed form, we can compare the results of the numerical integration
with the actual solution.
Let's let Mathematica do the integration for us. We first clear the value
of neti so that Mathematica will include this parameter symbolically in
1.2. Neural-Network Fundamentals 13
Clear[neti]
DSolve[{xi'[t]==-xi[t]+neti,xi[0]==0>,xi[t],t]
t
-neti + E neti
{{xi[t] ->-»
t
E
xi = xi[t] /. First['/,]
t
-neti + E neti
t
E
xi = Simplify[xi]
neti
neti-
t
E
We would most likely choose a slightly different form for the solution,
such as Eq. (1.5):
neti=2;
xiPlot = Plot[xi,{t,0,10}];
14 Chapter 1. Introduction to Neural Networks and Mathematica
Shov[{xiList,xiPlot}];
As of yet, we have not considered any form of the output function other
than the identity function. Let's look at some other forms for the output
function.
For reference, let's plot the identity function:
1.2. Neural-Network Fundamentals 15
outl = Plot[neti,{neti,-5,5}];
We can change the slope and position of this graph by including some
constants in the output function. For example, consider the function
/(neti) = cnetj + d
c=2;
d=-0.5;
out2 = Plot[c neti + d, {neti,-5,5}];
Show[{outl,out2>];
16 Chapter 1. Introduction to Neural Networks and Mathematica
d'6)
To see the general shape, plot the function. First, however, let's define
a function for the sigmoid, since we will be using it often.
sigmoid[x_] := l/(l+E~(-x))
sigl ■ Plot[sigmoid[x],{x,-10,10}];
Notice that the sigmoid function is limited in range to the values between
zero and one. Also notice that if the net input to a unit having a sigmoid
output is less than about negative five or greater than about positive five,
the output of the unit will be approximately zero or one, respectively.
Thus, a unit with a sigmoid output saturates at these net-output values
and cannot distinguish between, for example, a net input of 8 or 10.
1.2. Neural-Network Fundamentals 17
We can change the shape and location of the sigmoid curve by in¬
cluding parameters in the defining equation. Consider this example:
r = 2;
s = 2.0;
sig2 = Plot[l/(l+E“(-r neti)) + s,{neti,-10,10}];
r = 20.0;
sig3 = Plot[l/(l+E~(-r neti)),{neti,-10,10}];
r = 0.5;
sig4 = Plot[l/(l+E“(-r neti)),{neti,-10,10}];
r = 0.1;
sig5 = Plot[l/(l+E~(-r neti)),{neti,-10,10}];
lr
0.8
0.6
0.4
0 .z
-10 -5 5 10
18 Chapter 1. Introduction to Neural Networks and Mathematica
1 x >9
f(x) = (1.7)
0 otherwise
1.2. Neural-Network Fundamentals 19
where 0 is the threshold value. The larger the value of r in the sigmoid
equation, the closer the function will approximate a threshold function.
Plot sig5 is essentially a linear function over the domain of interest.
The parameter s can shift the position of the graph along the ordinate.
For example,
sig6 = Plot[sigmoid[x]-0.5,{x,-10,10}]
The above plot shows the sigmoid shifted so that it passes through the
origin. The limiting values are ±0.5. One of the important features of
the sigmoid function is that it is differentiable everywhere. We will see
the importance of this characteristic in Chapter 3.
If the output function of a unit is a sigmoid function, then the fol¬
lowing relationship holds:
If you look back at Figures 1.3 through 1.6, you will notice the charac¬
teristic layered structure of the networks. This structure is general: You
can always decompose a neural network into layers. It may be true that
20 Chapter 1. Introduction to Neural Networks and Mathematica
Figure 1.8 This figure shows two layers of a neural network. The bottom layer sends inputs
to the units in the top layer. The layer is fully interconnected; that is, all units on the bottom
layer send connections to all units on the upper layer.
some layer has only a single unit, or that some units are members of
more than one layer, but, nevertheless, layers are the rule.
We shall impose a constraint on our layers that requires all units on a
layer to have the same number of incoming connections. This constraint
would seem to be in keeping with the networks in Figures 1.3 through
1.6. On the other hand, not all neural networks are structured so that all
units that form the logical grouping of a layer have the same number of
input connections. As an example, a network may have units that are
connected randomly to units in previous layers, making the number of
connections on each unit potentially different from that on other units.
Nevertheless, we can force any such network to appear as though all units
on a layer have the same number of connections by adding connections
with zero weights. Any data flowing across these connections would
have no effect on the net input to the unit, so it is as if the connection
did not exist. The reason that we go to this trouble is for convenience in
calculation.
Consider the small network shown in Figure 1.8. Each unit in the
top layer receives four input values from the units in the previous layer.
Each unit on the bottom layer sends the identical value — its output
value — to all units in the upper layer. Each unit in the upper layer has
its own weight vector with one weight for each input connection. The
easiest way to deal with these individual weight vectors is to combine
them into a weight matrix. Let's define such a weight matrix for the
network in Figure 1.8, using random weight values.
w = Table[Table[Random[Real,{-0.5,0.5>,3],{4}],{8>]
1.2. Neural-Network Fundamentals 21
We can access the weight matrix for each individual unit by indexing
into the matrix at the proper row. For example, the weight matrix on the
third unit in Figure 1.8 is given by
w[[3]]
Similarly, the weight from the second unit on the first layer to the third
unit on the upper layer is given by
w[[3,2]]
-0.297
or alternatively
w[[3]][[2]]
-0.297
We will sometimes find it advantageous to view the weight matrix in
a more familiar form. We can do the transformation with the HatrixForm
command:
MatrixForm[u]
To calculate the net inputs to all of the upper-layer units, all we need
do is to perform a vector-matrix multiplication using the output vector
of the bottom layer, and the weight matrix of the upper layer. First, we
must define the output vector of the bottom layer. Let's keep it simple:
outl = {1,2,3,4}
{1, 2, 3, 4}
netin2 = u . outl
Note that we could reverse the order of multiplication in the above ex¬
pression, provided we first transpose the weight matrix.
VectorQ[{l,2,3,4}]
True
HatrixForm[{l,2,3,4}]
1
2
3
4
1.2. Neural-Network Fundamentals 23
This convention is in keeping with other texts that assume vectors are
column vectors. For example, in Neural Networks, I would write the
vector outl as (1,2,3,4)*, where the t superscript represents the transpose
operation. In fact, if you attempt to transpose the outl vector, you will
get an error. Try it.
Let's return to the issue of randomly-connected networks. With the
above convention, connections that do not exist should have zeros at
those locations in the weight matrix. The following matrix is an example.
(0,0,0,0,1)
(0,0,0,0,1)
(0,0,0,0,1)
(1.1.1.1.1)
(1.0,0,0,1)
(1,0,0,0,1)
(1.1.1.1.1)
(b)
Figure 1.9 (a) This figure shows a 5 by 7 array of inputs to a neural network. Each array
location, or pixel, may be either on (black) or off (white), (b) This binary representation of
the image on the input array are the 35 numbers used as inputs to the neural network.
wl = {0.8,0.3,0.5,0.1,0.1,0.6,-0.5,-0.7,-0.5,0.6,
0.6,-0.5,-0.7,-0.3,0.5,0.4,0.3,0.6,0.5,0.4,
0.5,-0.5,-0.6,-0.1,0.3,0.4,-0.4,-0.5,-0.2,0.4,
0.4,0.5,0.3,0.7,0.2);
Notice that I have used a semicolon after the above expression in order
to suppress the output from Mathematica, which I shall do when I do not
feel that showing the output adds anything to the discussion.
Assume the second unit has the weight vector
w2 = {-0.4,-0.7,-0.5,-0.6,0.8,-0.5,-0.4,-0.7,-0.6,0.8,
-0.4,-0.3.-0.7,-0.4,0.6,0.4,0.5,0.4,0.7,0/4,
0.7,-0.5,-0.6,-0.1,0.4,0.5,-0.4,-0.5,-0.2,0.6,
0.4,0.7,0.7,0.6,0.4};
in = {0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,
1,1,1,1,1,1,0,0,0,1,1,0,0,0,1,1,1,1,1,1};
As a point of reference, let's find the net-input value for each of these
two units. For unit 1:
netl = in . wl
7.1
net2 = in . w2
9.6
The second unit has a somewhat larger net input; that is, unit number 2
is excited more strongly by the input pattern than unit 1. Just looking at
the weight vectors, however, would not easily lead you to that conclusion
until you actually calculated the results.
Let's rewrite the weight vectors as rectangular weight matrices having
a row-column structure the same as that of the input pattern. Weight 1
becomes
wl ■ Partition[wl,5]
{{0.8, 0.3, 0.5, 0.1, 0.1}, {0.6, -0.5, -0.7, -0.5, 0.6},
{0.6, -0.5, -0.7, -0.3, 0.5}, {0.4, 0.3, 0.6, 0.5, 0.4},
{0.5, -0.5, -0.6, -0.1, 0.3}, {0.4, -0.4, -0.5, -0.2, 0.4},
{0.4, 0.5, 0.3, 0.7, 0.2}}
In matrix form:
HatrixForm[wl]
w2 = Partition[w2,5]
{{-0.4, -0.7, -0.5, -0.6, 0.8}, {-0.5, -0.4, -0.7, -0.6, 0.8},
{-0.4, -0.3, -0.7, -0.4, 0.6}, {0.4, 0.5, 0.4, 0.7, 0.4},
{0.7, -0.5, -0.6, -0.1, 0.4}, {0.5, -0.4, -0.5, -0.2, 0.6},
{0.4, 0.7, 0.7, 0.6, 0.4}}
HatrixForm[u2]
The matrix forms of wl and w2 are still not so transparent. Let's apply the
command ListDensityPlot to both.
ListDensityPlot[Reverse[vl]];
0
1.2. Neural-Network Fundamentals 27
ListDensityPlot[Reverse[v2]];
Each square in either of these two plots corresponds to one weight value,
and each row corresponds to the weight vector on a single unit. If you
stand back from these graphics, you should notice that the first one con¬
tains an image of an upper-case "B," and the second contains an image
of a lower case "d." In these pictures, the larger the weight, the lighter
the shade of the square.
Notice that the second weight matrix has large values in the same
relative locations in which the input vector has a 1. This correspondence
leads to a larger dot product than in the case where the large weight
values and large input values do not match. This analysis assumes that
the weight vectors are normalized in some fashion, so that there is no
false match resulting from an unusually large weight-vector length.
Notice in the plots of the weight matrices that we had to Reverse the
matrices before plotting. If we had not done this reversal, the image of
the letters would have appeared upside down in the plot.
It is certainly not always the case that the weight matrix mimics one
of the possible input vectors. When this does happen, we say that the
corresponding unit has encoded the pattern, which itself is often called
an exemplar. Nevertheless, it is often true that we can gleen some insight
from looking at the weights in this manner, even though their meaning
28 Chapter 1. Introduction to Neural Networks and Mathematica
*1 X2 O
0 0 0
0 1 0
1 0 0
1 1 1
Figure 1.10 This figure illustrates a simple network with two input units and a single output
unit The table on the right shows the AND function. We wish to find weights such that,
given any two binary inputs, the network correctly computes the AND function of those
two inputs.
and otherwise, the output will be zero, n refers to the number of inputs.
If we replace the inequality in Eq. (1.9) with an equality, the equation
becomes the equation of a line in the xjx2 plane. If we position that line
properly, we can determine the weights that will allow the network to
solve the AND problem. Look at the following plot:
Show[{Graphics[{PointSize[0.03],{Point[{0,0}],Point[{0,l}],
Point[{l,0>],Point[{l,l}]»],Graphics[Line[{{0,1.2),
{2.4,0}}],»xes->Automatic]}];
The line is the plot of the equation: x2 = -0.5xi + 1.2. Let 0 — 1.2,
Wl = 0.5, and u>2 = 1.0. Rewrite the equation in the form of Eq. (1.9):
0.5a;i + 1.0x2 = 1.2. If xi = x2 = 1, the left side of the equation is equal
to 1.5, which is greater than 1.2; thus giving the correct output. For all
other cases, we get an answer less that 1.2, giving zero as the result. So,
by an astute placement of a line, we have determined a set of weights
that solves the AND problem. There are an infinite number of other lines
that also yield weights that solve this problem.
The line in the preceding problem is an example of a decision sur¬
face (even though it is a line, we refer to it as a surface for the sake of
generality). Notice how the line breaks up the space into two regions,
one where the points in the region, when used as inputs, would satisfy
the threshold condition, and one where the points in the region would
not satisfy the threshold condition.
30 Chapter 1. Introduction to Neural Networks and Mathematica
*1 X2 O
0 0 0
0 1 1
1 0 1
1 1 0
Figure 1.11 This figure shows the four input points for the XOR problem, a line representing
a decision surface, and the XOR truth table. Note that there is no way to position the line
so that it separates the points (0,0) and (1,1) from the points (0,1) and (1,0).
*2 *»*2 0
0 0 0 0
0 1 0 1
1 0 0 1
1 1 1 0
Figure 1.12 This figure shows a three-input unit where the third input is the product of the
first two. The truth table still computes the XOR of the first two inputs; the network uses
the third input to help distinguish between the two classes.
the network using the other two inputs. For example, we could multiply
the two inputs together and use the result as a third input. Such a system
appears in Figure 1.12.
Let's plot the set of points made up by the three-dimensional points
from the truth table in Figure 1.12.
xor3d=Show[Graphics3D[{ PointSize[0.02].{Point[{0,0,0}],
Point[{0,1,0}],Point[{1,0,0}],Point[{1,1,1}]}},
ViewPoint->{8.474, -2.607, 1.547}]];
32 Chapter 1. Introduction to Neural Networks and Mathematica
Notice how the point (1,1,1) is elevated above the xxx2 plane. We can
now separate the two classes of input points by constructing a plane to
divide the space into the two proper regions.
Show[{/„Graphics3D[{GrayLevel[0.3],Polygon[{{0.7,0,0},
{0,0.7,0},{1,1,0.8»]}]}];
As with the AND example, there are an infinite number of decision sur¬
faces that will solve correctly this modified XOR problem. Increasing
the dimension of the input space forms the basis for a particular type
of network architecture, called the functional-link network, that we will
study in Chapter 3.
We can construct a second solution to the XOR problem by using a
network of the type shown in Figure 1.13. Notice in this case that we
have constructed a hidden layer of units between the input and output
units. It is this hidden layer that facilitates a solution. Each hidden-
layer unit produces a decision surface, as shown in the figure. The first
hidden-layer unit (the one on the left) will produce an output of one if
either or both inputs are one. The hidden-layer unit on the right will
produce an output of one only if both inputs are one.
The output unit will produce a one only if the output of the first
hidden unit is one AND the output of the second hidden unit is zero; in
other words, if only one, but not both, of the inputs are one.
To verify the correct operation of this network, let's calculate the out¬
put value for each of the possible input vectors. Although the calculation
is simple enough to perform by hand (or in your head), we shall write
Mathematica code (Listing 1.2) for practice. To keep the routine simple,
we shall perform the calculation for one input vector at a time. Here is
the result for one input vector:
1.2. Neural-Network Fundamentals 33
Figure 1.13 The figure on the left shows a network made up of three layers: an input layer,
a hidden layer comprising two units, and a single-unit output layer. All units are threshold
units with the value of 9 as the threshold in each case. The two hidden-layer units construct
the two decision surfaces shown in the graph on the right. The output unit performs the
logical function: (hidden unit 1) AND (NOT (hidden unit 2)).
By changing the value of the input vector, you can verify that the network
computes the XOR function correctly.
This example also serves to establish some notational conventions
that we shall use throughout the text. The input vector will always be
inputs, and the output vector will always be outputs. Other quantities will
generally have compound names, such as hidWt, outNetin, etc. In each
case, the first part of the name will refer to the layer, as in hid for hidden
layer, and out for output layer. The second part of the name will refer
to the particular quantity, as in Netin for net input value, and Threshold
for threshold value. Moreover, the second part of any name will be
capitalized for readability. If a third part of any name is necessary, the
conventions will be the same as for the second part. I will abbreviate
some names when I feel there will be no uncertainty in the intended
34 Chapter 1. Introduction to Neural Networks and Mathematica
Listing 1.2
meaning.
Let's turn our attention now to another type of learning that has its
basis in an early theory of how brains actually learn. The theory was
first described by a psychologist, Donald Hebb, and the learning method
bears his name: Hebbian learning.
Figure 1.14 In this figure, which represents a schematic of the classical conditioning process,
each lettered circle corresponds to a neuron cell body. The long lines from the cell bodies
are the axons that terminate at the synaptic junctions. Sba and Sbc correspond to the two
synaptic junctions. Although we have represented this process in terms of a few simple
neurons, we intend this schematic to convey the concept of classical conditioning, not its
actual implementation in real nerve tissue.
This theory can be used directly to explain the behavior known as clas¬
sical conditioning, or Pavlovian conditioning. Refer to Figure 1.14.
Suppose that the sight of food is sufficient to excite cell C, which
in turn, excites cell B and causes salivation. Suppose also, that in the
absence of the sight of food, sound input from a ringing bell is insufficient
36 Chapter 1. Introduction to Neural Networks and Mathematica
a t
-1. eta + 1. E eta
{{wij [t] -> 0. +-»
a t
E a
1.2. Neural-Network Fundamentals 37
Plot the result assuming typical values for the various parameters. The
value asymptotically approaches the value rj/n; in this case, 1.6.
Many variations of the basic Hebbian learning law exist, and we shall
encounter a few of them in other places in this book. Moreover, we have
not yet addressed the issue of how a network, made up of multiple units,
learns to perform a particular function. This topic will consume most of
the remainder of this book.
Summary
In this chapter, we began our study of neural networks using Mathematica
by considering the rationale behind the neural-network approach to data
processing. We looked at the fundamentals of processing for individual
units including the net input and output calculations. We also introduced
the concept of learning and decision surfaces in neural networks. Along
the way, we introduced many of the Mathematica functions and methods
that we will use in the remaining chapters of this book. Both the neural
network and the Mathematica concepts covered in this chapter will serve
as the basis for the material in the chapters that follow.
'
Chapter 2
Training by Error
Minimization
40 Chapter 2. Training by Error Minimization
In the previous chapter, we discussed the fact that we train neural net¬
works to perform their task, rather than programming them. In this
chapter and the next, we shall explore a very powerful learning method
that has its roots in a search technique called hill climbing.
Suppose you are standing on a hillside on a day that is so foggy, you
can see only a few feet in front of your face. You want to get to the
highest peak as quickly as possible, but you have no reference points,
map, or compass to assist you on your journey. How do you proceed?
One logical way to proceed would be to look around and determine
the direction of travel that has the steepest, upward slope and to begin
walking in that direction. As you walk, you change your direction so
that, at any given time, you are walking in the direction with the steep¬
est upward slope. Eventually you arrive at a location from which all
directions of travel lead downward. You then conclude that you have
reached your goal of the top of the hill. Without instrumentation (and
assuming the fog does not lift) you cannot be absolutely sure that you
are at the highest peak, or instead, at some intermediate peak. You could
mark your passage at this location and begin an exhaustive search of the
surrounding landscape to determine if there are any other peaks that are
higher, or you could satisfy yourself that this peak is good enough.
The method of training that we shall examine in this chapter is based
on a technique similar to hill-climbing, but in the opposite sense; that
is, we will seek the lowest valley rather than the highest peak. In this
chapter we look at the training of a system comprising a single unit. In
Chapter 3 we extend this method to cover the case of multiple units, and
multiple layers of interconnected units.
1+1
Bipolar.
output
= sign(y)
Figure 2.1 The complete Adaline consists of the adaptive linear combiner, in the dashed
box, and a bipolar output function. The adaptive linear combiner resembles the general
processing element described in Chapter 1.
input value always equal to one. The inclusion of such a term is largely
a matter of experience.
The net input to the ALC is calculated as usual as the sum of the
products of the inputs and the weights. In the case of the ALC, the
output function is the identity function, so the output is the same as the
net input. If the output is y, then
n
i= 1
where the are the weights and the are the inputs. If we make the
identification x0 = 1, then we can write
n
i=0
y = w-x (2.1)
o = Sign(y)
42 Chapter 2. Training by Error Minimization
Sign [2]
1
Sign[-2]
-1
In the remainder of this chapter, we shall be concerned only with the
ALC portion of the Adaline. It is possible to connect many Adalines
together to form a layered neural network. We refer to such a structure
as a Madaline (Many ADALINEs), but we do not consider that network
in this book.
Suppose we have an ALC with four inputs and a bias term. Furthermore,
suppose that we desire that the output of the ALC be the value 2.0, when
the input vector is {1,0.4,1.2,0.5,1.1} where the first value is the input to
the bias term. We can represent the weight vector as {u/0, wu w2, tu4}.
There is an infinite number of weight vectors that will solve this particular
problem. To find one, simply select values for four of the weights at
random and compute the fifth weight. Let's do an example to illustrate
the use of the Solve function.
w = w /. I
2.2. The LMS Learning Rule 43
u.x
{2.}
Suppose we have a set of input vectors, {xl5 x2, ...,xL}, each having its
own, perhaps unique, correct or desired output value, dk,k = 1 ,L. The
problem of finding a single weight vector that can successfully associate
each input vector with its desired output value is no longer simple. In
this section we develop a method called the least-mean-square (LMS)
learning rule, or the delta rule, which is one method of finding the de¬
sired weight vector. We refer to this process of finding the weight vector
as training the ALC. Moreover, we call the process a supervised learning
technique, in the sense that there is some external teacher that knows
what the correct response should be for each given input vector. The
learning rule can be embedded in the device itself, which can then self-
adapt as inputs and desired outputs are presented to it. Small adjustments
are made to the weight values as each input-output combination is pro¬
cessed until the ALC gives correct outputs. In a sense, this procedure is
a true training procedure, because we do not calculate the value of the
weight vector explicitly.
= dk — yk (2.2)
and the mean-squared error is
^4
«-<■» = •££ (2.3)
k= 1
? = (4 - w • x*)2 (2.4)
and without specifying the actual input vectors, we can construct the
graph.
ClearAll[R,p,wl,w2,ut,d];
wt = {wl,w2}
R = {{3,1}.{1,4»;
P = {4,5};
d = 10;
wtPlot=Plot3D[d+wt.R.vt-2 p.wt,{wl,-50,50},
{w2,-50,50}];
2.2. The LMS Learning Rule 45
20000
15000
10000
5000
Although it may not be apparent to you from these graphs, the surface
is a paraboloid. The function has a single minimum point. The weights
corresponding to that minimum point are the best weights for this ex¬
ample. You may find it more instructive to look at a contour plot of the
function.
We can find the minimum point by taking the derivative of Eq. (2.5).
The result is the weight vector that gives the minimum error:
minVt = Inverse[R].p
{1, 1}
and the minimum error is
Given the knowledge of R, also called the input correlation matrix, and
p, we saw how it was possible to calculate the weight vector directly. In
many problems of practical interest, we do not know the values of R and
p. In these cases we must find an alternate method for discovering the
minimum point on the error surface.
Consider the situation shown in Figure 2.2. To initiate training, we
assign arbitrary values to the weights, which establishes the error, £,
2.2. The LMS Learning Rule 47
Figure 22 This figure illustrates gradient descent down an error surface toward the mini¬
mum weight value.
4
f= ( )» 4=& (2.7)
~ I dk -^rwi(xi)k
V *=i
dwi
-2 ( 4 - ^2wi(xi)k (®i)fc — <^kip'i)k
i= 1
We then adjust the weight value, in this case wit by a small amount
in the direction opposite to the gradient. In other words, we update the
weight value according to the following prescription:
or in vector form:
where 77 is called the learning rate parameter and usually has a value
much less than one.
Equations (2.8) and (2.9) are expressions of a learning law called the
LMS rule, or delta rule. By repeated application of this rule using all of
the input vectors, the point on the error surface moves down the slope
toward the minimum point, though it does not necessarily follow the
exact gradient of the surface. As the weight vector moves toward the
minimum point, the error values will decrease. You must keep iterating
until the errors have been reduced to an acceptable value, the definition
of acceptable being determined by the requirements of the application.
2.2. The LMS Learning Rule 49
Figure 2.3 This diagram shows the ALC in the transversal filter configuration. In this case,
there are two inputs. At each iteration the first input is shifted down to become the second
input. The fcth input is the sine function shown. The desired output value is twice the
cosine of the argument of the Jtth input. We have also added a random noise factor to
the inputs. We show the weights as variable resistors to indicate that they will change as
training proceeds.
Let's consider how to apply this learning rule by trying a specific exam¬
ple. We shall use an example from the text Adaptive Signal Processing, by
Bernard Widrow and Samuel D. Steams1. The ALC is a two-input unit
arranged in a configuration known as a transversal filter. In this con¬
figuration, one input is simply a time-delayed copy of the other input.
Figure 2.3 shows the ALC for this case.
At the Jtth timestep, the input value is given by
. (7T&
Xk = sin —
1 Widrow, Bernard and Samuel D. Stearns. Adaptive Signal Processing., Prentice Hall: Englewood Cliffs,
NJ, 1985.
50 Chapter 2. Training by Error Minimization
dk = 2 cos
To each input value we shall add a random noise factor with a ran¬
dom signal power, (f> — 0.01, where
4> = {rl)
We can look at the input function by creating a table of points and
then plotting those points.
{{0, 0.}, {1, 0.382683}, {2, 0.707107}, {3, 0.92388}, {4, 1.},
-19
{5, 0.92388}, {6, 0.707107}, {7, 0.382683}, {8, 3.79471 10 },
{9, -0.382683}, {10, -0.707107}, {11, -0.92388}, {12, -1.},
{13, -0.92388}, {14, -0.707107}, {15, -0.382683},
-19
{16, -7.58942 10 }, {17, 0.382683}, {18, 0.707107}, {19, 0.92388},
{20, 1.}, {21, 0.92388}, {22, 0.707107}, {23, 0.382683},
-18
{24, 1.35525 10 }}
ListPlot['/,];
Adding a random value to the function disturbs the plot somewhat. Here
is a rendering of the function with the random value, plotted with the
points joined with lines.
2.2. The LMS Learning Rule 51
Clear*ll[k]
outputs = Table[{k,2 Cos[Pi k/8]//N},{k,0,24}]
{{0, 2.}, {1, 1.84776}, {2, 1.41421}, {3, 0.765367}, {4, 0.},
{5, -0.765367}, {6, -1.41421}, {7, -1.84776}, {8, -2.},
{9, -1.84776}, {10, -1.41421}, {11, -0.765367}, {12, 0.},
{13, 0.765367}, {14, 1.41421}, {15, 1.84776}, {16, 2.},
{17, 1.84776}, {18, 1.41421}, {19, 0.765367}, {20, 0.},
{21, -0.765367}, {22, -1.41421}, {23, -1.84776}, {24, -2.}}
Show[{inPlot,outPlot}];
52 Chapter 2. Training by Error Minimization
SeedRandom[4729]
wts = Table [Random [], {2}] (* initialize weights *)
inputs = {0, Random [Real, {0, 0.175}]}(* initialize input vector *)
eta = 0.2 (* learning rate parameter *)
k=l
errorList=Table[
inputs[[l]] = N[Sin[Pi k/8]]+Random[Real,{0, 0.175}];
outOesired = N[2 Cos[Pi k/8]]; (* desired output *)
outputs = wts.inputs; (* actual output *)
outError = outDesired-outputs; (* error *)
wts += eta outError inputs; (* update weights *)
inputs[[2]]=inputs[[l]]; (* shift input values *)
k++; (♦ increment counter *)
outError,{250}
] (* end of Table *)
Print["Final weight vector = ",wts]
ListPlot[errorlist,PlotJoined->True] (* plot the errors *)
Listing 2.1
Listing 2.1 shows the Mathematica code for the ALC. The output giving
the final weight vector and plot of the errors is as follows:
Note that by setting the input vector equal to {0,Random} initially, all that
we need do to prepare the first valid input vector is replace inputs [[1]]
with Sin [Pi/8], which is accomplished the first time through the loop.
After the weight updates, the value in inputs[[l]] is shifted forward to
inputs [[2]], and inputs [[1]] is recalculated at the beginning of the loop.
The actual optimum weight vector for this problem is 3.784, -4.178.
You can see from the plot of the error values, that initially, the errors
appear quite sinusoidal in character. As the ALC learns, the error is
reduced to its random component.
We can make a slight modification to our code, as shown in Listing
2.2, and look at how the weight vector moves as a function of iteration
step. The output from the code in Listing 2.2 is as follows:
The following plot shows the weight vector as it converges on the known
optimum weight value, depicted as a large dot. If we continued iterating
the weight values, they would bounce around the optimum point. We
could get closer by decreasing the learning rate parameter, at the cost
54 Chapter 2. Training by Error Minimization
SeedRandom[4729]
wts = Table[Random[],{2>] (* initialize weights *)
inputs = {0, Random [Real, {0, 0.175}]}(* initialize input vector *)
eta = 0.2 (* learning rate parameter *)
k=l
wtList=Table[
inputs[[1]] = N[Sin[Pi k/8]]+Random[Real,{0, 0.175}];
outDesired = N[2 Cos[Pi k/8]]; (* desired output *)
outputs = wts.inputs; (* actual output *)
outError = outDesired-outputs; (* error *)
wts += eta outError inputs; (* update weights *)
inputs[[2]]=inputs[[l]]; (* shift input values *)
k++; (* increment counter *)
wts,{250} (* add wts value to table *)
] (* end of Table *)
Print["Final weight vector = ",wts]
ListPlot[wtList,PlotJoined->True]; (* plot the errors *)
Listing 2.2
of more iterations. For our purposes here, the current parameters are
sufficient to illustrate the concepts involved with this type of training.
Show[{wtPlot.Graphics[{PointSize[0.03],
Point[{3.784,-4.178}]}]}];
Widrow and Stearns have calculated the exact equation for the error
surface for this example:
2.2. The LMS Learning Rule 55
ClearAll[xi,vtl,wt2];
xi[wtl_,vt2j := 0.51 (wtr2+wt2~2) +
wtl ut2 Cos[N[Pi/8]] + 2 wt2 Sin[N[Pi/8]] + 2
ClearAll[wl,u2]
errorPlot = Plot3D[xi[wl,w2],{wl,-2,8},{v2,-10,0},
ViewPoint->{-1.048, -2.529, 1.989},
Shading->False];
We can superimpose the contour plot with the plot of the movement of
the weight vector as the ALC learns. The crosshair indicates the position
of the actual optimum weight value.
Contou rPlot[xi[ul,u2],{ul,-2,8},{w2,-10,0},
Contou rLe v els->20, Epilog->{Line[wtList],
Line[{{2,-4.178},{6,-4.178}}],
Line[{{3.784,-3},{3.784,-5}}]}];
56 Chapter 2. Training by Error Minimization
alcTest[0.1,10]
alcTest[learnRate_,numIters_:250] :=
Module[{eta=learnRate,wts,k,inputs,utList,outDesired,outputs,outError>,
wts = Table [Random [], {2}]; (* initialize weights *)
Print["Starting weights = ",wts];
Print["Learning rate = ",eta];
Print["Number of iterations = ".numlters];
inputs = {0,Random[Real,{0, 0.175}]};(* initialize input vector *)
k=l;
wtList=Table[
inputs[[1]] = N[Sin[Pi k/8]]+Random[Real,{0, 0.175}];
outDesired = N[2 Cos[Pi k/8]]; (* desired output *)
outputs = wts.inputs; (* actual output *)
outError = outDesired-outputs; (* error *)
wts += eta outError inputs; (* update weights *)
inputs[[2]]=inputs[[l]]; (* shift input values *)
k++; wts,{numlters}]; (* end of Table *)
Print[”Final weight vector = ",wts];
wtPlot=ListPlot[wtList,PloUoined->True] (* plot the weights *)
] (* end of Module *)
Listing 2.3
alcTest[0.1,200]
Figure 2.4 This diagram shows the ALC in its standard configuration. For the XOR example,
there are two inputs.
alcTest[0.3]
alcXor[learnRate,,numlnputs,,ioPairs_,numlters_:250] :=
Module [{vts, eta=lear nRate, er rorList, inputs .outDesired, ourEr ror, outputs},
SeedRandom[6460]; (* seed random number gen.*)
wts = Table[Random[],{numInputs>]; (* initialize ueights *)
errorList=Table[ (* select ioPair at random *)
{inputs, outDesired} = ioPairs[[Random[Integer,{l,4}]]];
outputs = wts.inputs; (* actual output *)
outError = First[outDesired-outputs]; (* error *)
wts += eta outError inputs;
outError,{numIters}]; (* end of Table *)
ListPlot[errorList,PlotJoined->True];
Return[wts];
]; (* end of Module *)
Listing 2.4
list, or ioPairs vector, outside of the function. This list should contain the
inputs and desired output values for the problem. The function, alcXor,
takes the learning rate and the ioPairs vector as an argument, as well as
the number of inputs and the number of iterations, which once again
defaults to 250. The function returns the weight matrix for use later. For
two inputs, the ioPairs vector is
ioPairsXor2 = {{{0,0},{0}},{{0,1},{1}},{{1,0},{1}},{{1,1},{0}»;
Let's execute the function with this ioPairs vector. The output will be a
plot of the error value as a function of iteration.
wtsXor = alcXor[0.2,2,ioPairsXor2];
60 Chapter 2. Training by Error Minimization
ioPairsXor3 =
{{{0,0,0>,{0»,{{0,1,0>,{1»,{{1,0,0>,{1»,
{{1,1,!},{()»};
Once again, let's execute the function; this time with the new ioPairs
vector.
utsXor = alcXor[0.2,3,ioPairsXor3];
Notice in this case that the error decreases as the iteration proceeds. To
see how close we are to an acceptable solution, we can write a simple
function — testXor in Listing 2.5 — to run through the ioPairs and deter¬
mine the error of the ALC for each input. Executing this function shows
2.2. The LMS Learning Rule 61
ClearAll[testXor]
testXor[ioPairs_,weights,] :=
Hodule[{errors,inputs,outDesired,outputs,wts,mse},
inputs = Hap[First,ioPairs]; (* extract inputs *)
outDesired = Hap [Last, ioPairs]; (* extract desired outputs *)
outputs s inputs . weights; (* calculate actual outputs *)
errors = outDesired-outputs;
mse=
Flatten[errors] . Flatten[errors]/Length[ioPairs];
Print["Inputs = ".inputs];
Print["Outputs = ".outputs];
Print["Errors = ".errors];
Print["Hean squared error = ",mse]
]
Listing 2.5
In the examples that we have done so far, we have performed weight up¬
dates for a certain number of iterations. At the beginning of the section,
we derived the delta rule based on a minimization of the mean-squared
error; or rather its approximation, e\. Let's modify the code from the
previous example to include a test of the mean-squared error, and a con¬
ditional termination based on its value.
We shall also write a function that calculates the mean squared error.
The ALC code will call this function every four iterations. Even though
we choose the input vector randomly, after a few iterations, the ALC
should be learning all four patterns about equally. The code for the
mean-squared error calculation appears in Listing 2.6. The code for the
complete simulation appears in Listing 2.7. Finally, we execute the code,
using 0.01 as the error tolerance value. The function returns the list of
final weight values, which appear following the error plot.
62 Chapter 2. Training by Error Minimization
calcHse[ioPairs.,vtVecJ :=
Module[{errors,inputs,outDesired,outputs),
inputs = Hap[First,ioPairs]; (* extract inputs *)
outDesired = Map[Last,ioPairs]; (* extract desired outputs *)
outputs = inputs . wtVec; (* calculate actual outputs *)
errors = Flatten [outDesired-outputs];
Return[errors.errors/Length[ioPairs]]
]
Listing 2.6
wtsXor = alcXorMin[0.2,3,ioPairsXor3,0.01]
The value of the coordinate on the abscissa of the resulting graph is the
number of cycles rather than the number of iterations. We define a cycle
to be equal to the number of exemplars; in this case four. Even though
we pick inputs at random rather than choosing the four exemplars in
sequence, we still consider a cycle to be four iterations. You could modify
the code to present the four exemplars in sequence: Typically, you would
use nested loops. As long as you have enough cycles so that all exemplars
are presented approximately the same number of times, random selection
is adequate.
Once again, we can test the network performance:
testXor[ioPairsXor3,wtsXor]
alcXorMin[learnRate_,numlnputs_,ioPairs_,maxError_] :=
Module[{wts,eta=lea rnRate,errorList,inputs,outDesired,
meanSqError,done,k,ourError,outputs,errorPlot},
wts = Table[Random[],{numInputs}]; (* initialize weights *)
meanSqError = 0.0;
errorList={};
For[k=l;done=False, !done,k++, (* until done *)
(* select ioPair at random *)
{inputs, outDesired} = ioPairs[[Random[Integer,{l,4}]]];
outputs = wts.inputs; (* actual output *)
outError = First[outDesired-outputs]; (* error *)
wts += eta outError inputs; (* update weights *)
If[Mod[k,4]—0,meanSqError=calcMse[ioPairs,wts];
AppendTo[errorList, meanSqError]; ];
If[k>4 Ml meanSqError<maxError,done=True,Continue]; (* test for done *)
]; (* end of For *)
errorPlot=ListPlot[errorList,PlotJoined->True];
Return[wts];
] (* end of Module *)
Listing 2.7
Before moving on, let's look at one item in the ALC simulation code.
Notice that we did not use the Table function this time; rather, we used
the For loop, as you might in a normal computer program. The reason for
this switch is that we no longer know in advance how big the final array
will be. When you know this value in advance, the Table construction is
a better choice.
Figure 2.5 This figure shows a single layer of ALCs. Each ALC receives the identical input
vector, but responds with a different output value.
Summary
In this section, we used the single-unit adaptive linear combiner to de¬
velop a powerful learning method called the least-mean-square method.
This method forms the basis of the multi-unit, multi-layer learning al¬
gorithm called backpropagation, which we shall introduce in the next
chapter. We also saw that we can view the learning process geometri¬
cally as finding the minimum point on a surface that represents the error
plotted as a function of the weight values. In the case of the ALC, the er¬
ror surface is always a concave-upward hyperparaboloid. In multi-unit,
multi-layer networks, we can still think of the learning process as finding
the minimum value on a surface, but the topology of that surface is not
usually as simple as the one for the ALC.
Chapter 3
Figure 3.1 This diagram shows a typical structure for a BPN. Although there is only one
hidden layer in this figure, you can have more than one. The superscripts on the various
quantities identify the layer. The p subscript refers to the p th input pattern.
netw' = Y1
i=
+ t3'1)
where iPj is the input from the /th hidden-layer unit to the output layer
units for the pth input pattern, and the 6s are the bias values. N and L
refer to the number of units on the input and hidden layers respectively.
Unlike the ALC discussed in Chapter 2, the output function of these
units is not necessarily the simple identity function, although it can be in
the case of the output units. Most often, the output function will be the
sigmoid function described in Chapter 1. Then the outputs of the units
are
(3.3)
(3.4)
°pk
inputs = Table[Random[],{5}]
Since there are five inputs, the weight matrix should have five columns
(assuming no bias term). If there are four units on the layer, one possible
(but highly unlikely) weight matrix may appear as follows:
Or in matrix form:
HatrixForm['/,]
1 0 0 0 0
0 10 0 0
0 0 10 0
0 0 0 1 0
sigmoid[x_] := l/(l+E“(-x))
Notice that we can supply the sigmoid function with a list of values as
an argument, rather than just a single value. Muthctnuticu automatically
applies the function to each element in the list. Now that we have de¬
scribed the basic feed-forward propagation in the BPN, let's move on to
a derivation of the GDR.
y = 3>(x), x e R^, y e RM
1 M
Ev = 5 £ SU (3.5)
^ fc=l
where
fipk = (Vpk ~ °pk) (3.6)
The subscript, p, refers to the pth exemplar, opk is the output of the Jtth
output-layer unit for the pth exemplar, and there are M output-layer
units.
Equation (3.5) represents a local approximation to the global error
surface
e = £ep
P=1
1 M
2 J2(ypk ~ /fc(netpfc))2
For now we shall write the partial derivative of the output function
as
d/fc(netpfc)
fk(^°pk)
5(netpfc)
3.1. The Generalized Delta Rule 73
d(net°pk) .
hi
on the output layer first, then bring those errors back to the hidden layer
to calculate the surface gradients there.
Once we have calculated the gradients, then we adjust each weight
value a small amount in the direction of the negative of the gradient. The
proportionality constant is called the learning-rate parameter, just as it
was for the ALC in Chapter 2. Next, we present the next input pattern
and repeat the weight-update process. The process continues until we are
satisfied that all output-layer errors have been reduced to an acceptable
value.
Before moving on to some examples, we can simplify the notation
somewhat through the use of some auxiliary variables. Define the output-
layer delta as
In this section we will write the Mathematica code for the standard BPN
and use it to look at two specific examples. Because the BPN is so in¬
tensive computationally, we shall be restricted to fairly small networks.
Nevertheless, we shall be able to experiment with several network pa¬
rameters to see their overall effect on the performance of the BPN.
3.2. BPN Examples 75
■ ■
1 ■fi ■■
— —
□ r ■ □
::
DUO L (a)
0.9,0.9,0.9
0.1,0.9,0.1
0.1,0.9, 0.1
(b)
Figure 3.2 This figure shows the data-representation scheme for the T-C problem, (a) Each
letter is superimposed on a 3 by 3 grid, (b) Filled grid-squares are represented by the real
number 0.9, and empty ones are represented by 0.1. Each input vector consists of nine real
numbers.
Figure 3.3 This figure shows a standard, three-layer BPN that we can use to solve the T-C
problem. We only show one hidden layer in this figure, although we could add more. The
number of units on the hidden layer can vary and has effects on the performance of the
network. For this example, we assume all units have sigmoid outputs.
ioPairsTC =
ioPairs = {{{0.9,0.9,0.9,
0.9,0.1,0.1,
0.9,0.9,0.9},{0.1}},
{{0.9,0.9,0.9,
0.1,0.9,0.1,
0.1,0.9,0.1},{0.9}},
{{0.9,0.9,0.9,
0.9,0.1,0.9,
0.9,0.1,0.9},{0.1}},
{{0.1,0.1,0.9,
0.9,0.9,0.9,
0.1,0.1,0.9},{0.9}},
{{0.9,0.9,0.9,
0.1,0.1,0.9,
0.9,0.9,0.9},{0.1}},
{{0.1,0.9,0.1,
0.1,0.9,0.1,
0.9,0.9,0.9},{0.9}},
{{0.9,0.1,0.9,
0.9,0.1,0.9,
0.9,0.9,0.9},{0.1}},
{{0.9,0.1,0.1,
0.9,0.9,0.9,
0.9,0.1,0.1},{0.9}} };
inNumber = 9
hidNumber = 3
outNumber = 1
78 Chapter 3. Backpropagation and Its Variants
We need to initialize the matrices that will hold the weight values for the
units on each layer. For the BPN, we use typically small, random real
numbers.
outWts = Table[Table[Random[Real,{-0.1,0.1}],
{hidNumber}].{outnumber}]
eta = 0.5
0.5
ioP=ioPairs[[Random[Intege r,{1,Length[ioPairs]}]]]
{{0.1, 0.9, 0.1, 0.1, 0.9, 0.1, 0.9, 0.9, 0.9}, {0.9}}
inputs=ioP[[l]]
outDe$ired=ioP[[2]]
{0.9}
To compute the output of the hidden-layer units, take the dot product of
the inputs and the hidden-layer weights and apply the sigmoid function
to each element of the resulting vector.
{0.501797}
outErrors = outDesired-outputs
{0.398203}
{0.0995494}
We are now finished with the first training vector. To continue, we would
select a new input vector and repeat the above steps. To monitor our
progress, we can watch the value of outErrors until it, or its square, reaches
some acceptable level; or we can specify a certain number of iterations,
which will be our approach here.
Notice that all of the processing for the BPN training algorithm com¬
prises only six lines of Mathematica code. Let's put those lines together
in a function, shown in Listing 3.1, that implements the simple BPN.
Notice that we are constructing a table (vector) of error values as a part
of the function. If you were programming this function in a high-level
computer language, such as C, you would likely use a loop construct,
such as a for or while loop in the main body of the code. Since we know
exactly how many elements there will be in the final table (numlters), the
Table function is more appropriate. If we were to iterate training until a
certain error value was reached, we would also use a For construct, and
Append the error to a preexisting array.
The function returns three important pieces of information: the new
values for the hidden-layer weights, the new values for the output-layer
weights, and the list of errors generated as training occurred. We shall
use these values to assess how well the training went after we call the
function. Let's run a short test of 10 iterations.
bpnStandard[inNumber_, hidNumber_, outNumber_,ioPai'’S_, eta_, numltersj :=
Module[{errors,hidWts,outWts,ioP,inputs,outDesired,hidOuts,
outputs, outErrors.outDelta,hidDelta},
hidWts = Table[Table[Random[Real,{-0.1,0.1>],{inNumber>].{hidNumber}];
outWts = Table[Table[Random[Real,{-0.1,0.1}].{hidNumber}].{outNumber}];
errors = Tablet
(* select ioPair *)
ioP=ioPairs[[Random[Integer,{1,Length[ioPairs]>]]];
inputs=ioP[[l]];
outDesired=ioP[[2]];
(* forward pass *)
hidOuts = sigmoid [hidWts. inputs];
outputs = sigmoid [outWts. hidOuts];
(* determine errors and deltas *)
outErrors * outDesired-outputs;
outDelta= outErrors (outputs (1-outputs));
hidDelta=(hidOuts (1-hidOuts)) Transpose[outWts].outDelta;
(* update weights *)
outWts += eta Outer[Times,outDelta,hidOuts];
hidWts += eta Outer[Times,hidDelta,inputs];
(* add squared error to Table *)
outErrors.outErrors,{numIters}]; (* end of Table *)
Return[{hidWts,outWts.errors}];
]; (* end of Module *)
Listing 3.1
82 Chapter 3. Backpropagation and Its Variants_
outs[[l]]
outs[[2]]
outs[[3]]
ListPlot[outs[[3]],PlotJoined->True];
3.2. BPN Examples 83
bpnTest[outs[[l]],outs[[2]],ioPairsTC];
Input 1 = {0.9, 0.9, 0.9, 0.9, 0.1, 0.1, 0.9, 0.9, 0.9}
Output 1 = {0.475994} desired = {0.1} Error = {-0.375994}
Input 2 = {0.9, 0.9, 0.9, 0.1, 0.9, 0.1, 0.1, 0.9, 0.1}
Output 2 = {0.477661} desired = {0.9} Error = {0.422339}
Input 3 = {0.9, 0.9, 0.9, 0.9, 0.1, 0.9, 0.9, 0.1, 0.9}
Output 3 = {0.476084} desired = {0.1} Error = {-0.376084}
Input 4 = {0.1, 0.1, 0.9, 0.9, 0.9, 0.9, 0.1, 0.1, 0.9}
Output 4 = {0.477411} desired = {0.9} Error = {0.422589}
Input 5 = {0.9, 0.9, 0.9, 0.1, 0.1, 0.9, 0.9, 0.9, 0.9}
Output 5 = {0.476176} desired = {0.1} Error = {-0.376176}
Input 6 = {0.1, 0.9, 0.1, 0.1, 0.9, 0.1, 0.9, 0.9, 0.9}
Output 6 = {0.476818} desired = {0.9} Error = {0.423182}
Input 7 = {0.9, 0.1, 0.9, 0.9, 0.1, 0.9, 0.9, 0.9, 0.9}
Output 7 = {0.475935} desired = {0.1} Error = {-0.375935}
Input 8 = {0.9, 0.1, 0.1, 0.9, 0.9, 0.9, 0.9, 0.1, 0.1}
Output 8 = {0.476766} desired = {0.9} Error = {0.423234}
All of the output values are clustered near the central region of the sig¬
moid function. The network has not been trained sufficiently to allow it
to distinguish between the two classes of input vectors. Let's try again,
this time increasing the number of iterations.
ListPlot[outs[[3]],PlotJoined->True];
84 Chapter 3. Backpropagation and Its Variants
bpnTest[outs[[l]],outs[[2]],ioPairsTC];
Input 1 = {0.9, 0.9, 0.9, 0.9, 0.1, 0.1, 0.9, 0.9, 0.9}
Output 1 = {0.551113} desired = {0.1} Error = {-0.451113}
Input 2 = {0.9, 0.9, 0.9, 0.1, 0.9, 0.1, 0.1, 0.9, 0.1}
Output 2 = {0.554287} desired = {0.9} Error = {0.345713}
Input 3 = {0.9, 0.9, 0.9, 0.9, 0.1, 0.9, 0.9, 0.1, 0.9}
Output 3 = {0.552132} desired = {0.1} Error = {-0.452132}
Input 4 = {0.1, 0.1, 0.9, 0.9, 0.9, 0.9, 0.1, 0.1, 0.9}
Output 4 = {0.55632} desired = {0.9} Error = {0.34368}
Input 5 = {0.9, 0.9, 0.9, 0.1, 0.1, 0.9, 0.9, 0.9, 0.9}
Output 5 = {0.552388} desired = {0.1} Error = {-0.452388}
Input 6 = {0.1, 0.9, 0.1, 0.1, 0.9, 0.1, 0.9, 0.9, 0.9}
Output 6 = {0.553846} desired = {0.9} Error = {0.346154}
Input 7 = {0.9, 0.1, 0.9, 0.9, 0.1, 0.9, 0.9, 0.9, 0.9}
Output 7 = {0.551687} desired = {0.1} Error = {-0.451687}
Input 8 = {0.9, 0.1, 0.1, 0.9, 0.9, 0.9, 0.9, 0.1, 0.1}
Output 8 = {0.552483} desired = {0.9} Error = {0.347517}
We are not getting very far very fast. Let's try again.
That calculation took quite a long time on my computer. Let's see where
we are.
bpnTest[outs[[l]],outs[[2]].ioPairsTC];
3.2. BPN Examples 85
Input 1 = {0.9, 0.9, 0.9, 0.9, 0.1, 0.1, 0.9, 0.9, 0.9}
Output 1 = {0.278072} desired = {0.1} Error = {-0.178072}
Input 2 = {0.9, 0.9, 0.9, 0.1, 0.9, 0.1, 0.1, 0.9, 0.1}
Output 2 = {0.684431} desired = {0.9} Error = {0.215569}
Input 3 = {0.9, 0.9, 0.9, 0.9, 0.1, 0.9, 0.9, 0.1, 0.9}
Output 3 = {0.286354} desired = {0.1} Error = {-0.186354}
Input 4 = {0.1, 0.1, 0.9, 0.9, 0.9, 0.9, 0.1, 0.1, 0.9}
Output 4 = {0.677043} desired = {0.9} Error = {0.222957}
Input 5 = {0.9, 0.9, 0.9, 0.1, 0.1, 0.9, 0.9, 0.9, 0.9}
Output 5 = {0.283591} desired = {0.1} Error = {-0.183591}
Input 6 = {0.1, 0.9, 0.1, 0.1, 0.9, 0.1, 0.9, 0.9, 0.9}
Output 6 = {0.684752} desired = {0.9} Error = {0.215248}
Input 7 = {0.9, 0.1, 0.9, 0.9, 0.1, 0.9, 0.9, 0.9, 0.9}
Output 7 = {0.276314} desired = {0.1} Error = {-0.176314}
Input 8 = {0.9, 0.1, 0.1, 0.9, 0.9, 0.9, 0.9, 0.1, 0.1}
Output 8 = {0.695672} desired = {0.9} Error = {0.204328}
ListPlot[outs[[3]],PlotJoined->True];
This plot is about what we might expect. If we went back and performed
more iterations, we should do even better. By now, however, you should
be asking if there are any ways in which we might speed up the con¬
vergence of the network. The answer, of course, is yes. The first thing
we can do is try a few variations of the learning rate parameter to see
86 Chapter 3. Backpropagation and Its Variants
outs=bpnStandard[9,3,1*ioPairsTC,0.9,500];
ListPlot[outs[[3]],PlotJoined->True];
outs=bpnStandard[9,3,1,ioPairsTC,1.3,500];
ListPlot[outs[[3]] ,PloUoined->True] ;
outs=bpnStandard[9,3,1,ioPairsTC,2.0,350];
ListPlot[outs[[3]],PlotJoined->True];
3.2. BPN Examples 87
outs=bpnStandard[9,3,1,ioPairsTC,3.0,250];
ListPlot[outs[[3]] ,PloUoined->True];
outs=bpnStandard[9,3,1,ioPairsTC,4.0,200];
ListPlot[outs[[3]],PlotJoined->True];
0.4
outs-bpnStandard[9,3,1,ioPairsTC,5.0,150];
ListPlot[outs[[3]],PlotJoined->True];
outs-bpnStandard[9,3,1,ioPairsTC,10,150];
ListPlot[outs[[3]],PlotJoined->True];
outs=bpnStandard[9,3,1,ioPairsTC,30,150];
ListPlot[outs[[3]],PlotJoined->True];
0'4
0.3f
I
Let's begin by using the Timing function to see how long iterations take
using the standard BPN model. My computer is a Macintosh Ilsi with a
90 Chapter 3. Backpropagation and Its Variants
Timing[outs=bpnStandard[2,3,1,ioPairsXOR,0.5,100];]
0.22
0.2-
0.18-
0.16-
0.14
0.12
20 40 60 80 100
Not much has. Let's increase the learning rate significantly to see if that
helps.
Timing[outs=bpnStandard[2,3,1,ioPairsXOR,5.0,100];]
Once again there has not been any learning. Let's try more passes
through the data.
3.2. BPN Examples 91
Timing[outs=bpnStandard[2,3.1.ioPairsXOR,5.0,1500];]
ListPlot[outs[[3]], PlotJoined->True];
o.
0 .
0.
bpnTest[outs[[l]],outs[[2]].ioPairsXOR];
The network seems to be learning three of the four points. Perhaps more
iterations will do the trick. Based on the last results, this next calculation
should take about 16 minutes on my computer.
outs=bpnStandard[2,3,1,ioPairsXOR,5.0,2000];
ListPlot[outs[[3]], PloUoined->True];
92 Chapter 3. Backpropagation and Its Variants
bpnTest[outs[[l]],outs[[2]].ioPairsXOR];
Well, we are not much better off than we were before. Perhaps we have
found another local minimum, where the network will never learn the
fourth training vector.
One variation that we have not yet tried is in the number of hidden-
layer units. In a sense, adding hidden-layer units adds "degrees of free¬
dom" that can help the network converge to a better solution, much like
adding higher orders to the polynomial fit to a curve. You must use some
restraint, however, since there is a trade-off between faster convergence
in terms of the number of iterations required, and the time per iteration.
Moreover, if you add too many hidden-layer units, you could end up
worse off than when you started. Let's evaluate the case of doubling the
number of hidden units to six.
outs={0,0,0};
Timing[outs=bpnStandard[2,6,1,ioPairsXOR,5.0,100];]
3.2. BPN Examples 93
We are at about the same place as we were at the end of 1500 iterations
using three hidden units. Moreover it has taken us just as much time to
get to this point as it did to run 1500 iterations with three hidden units.
While it may be true that learning requires fewer iterations with a larger
number of hidden units, the actual amount of CPU time may, in fact, be
as much, or even greater. Of course, if the network will not converge at
all, adding hidden units may be the way to get it to converge. The point
I am trying to make here is that the number of iterations required for
convergence is not necessarily the best measure of learning speed. We
shall need to keep this fact in mind when we examine other methods
of increasing learning speed in Section 3.3. I should also point out that
in most real-world problems where the number of inputs is large (say
10s or 100s) the number of hidden-layer units is typically less than the
number of inputs, unlike our simple example here.
fewer iterations. The trade-off, of course, is that there are more connec¬
tions to process. The code appears in Listing 3.2.
Adding space for the bias terms in the weight matrices is no big
problem; we just increase the column dimension by one. Two other
modifications involve adding an additional input value of 1.0 to each
input vector:
inputs=A ppend[ioP[[1]],1.0]
and forcing the last hidden output to be 1.0:
outlnputs = Append[hid0uts,1.0]
These two changes are indicated in the code by the comment
(* bias mod *)
A third modification appears in the statement that updates the hidden-
unit weights. Because of the way we calculate the weight deltas, the
equations for the weight updates would automatically try to calculate
new weights on connections from the input layer to the bias unit on the
hidden layer; however, there are no such connections. Therefore, we
must eliminate the last weight delta vector before updating the hidden
weights. The statement:
Drop[0uter[Times,hidDelta,inputs],-1]
performs this task.
I have also added some optional print statements, an automatic call
to the bpnTest routine, and an automatic plot of the errors. Notice that
there are now four returned items: the lists of the hidden weights, output
weights, output errors, and a graphics object representing the plot of the
errors. Let's try the T-C problem again, since it will require less time than
the XOR problem. The bpnTest function has an option that will allow it
to handle the bias terms correctly. We can set that option before running
the network.
outs={0,0,0,0};
SetOptions[bpnTest,bias->True];
Timing[outs=bpnBias[9,3,1,ioPairsTC,4.0,200];]
Listing 3.2
96 Chapter 3. Backpropagation and Its Variants
This result appears to be similar to that obtained without the bias term.
Nevertheless, the bias term is often incorporated as a part of a standard
BPN.
3.3. BPN Variations 97
{{0.9,0.9,0.9,
0.1,0.9,0.1,
0.1,0.9,0.1},{0.9»
If we had used zeros and ones in the input vector instead of 0.1 and 0.9
the above would appear as follows:
98 Chapter 3. Backpropagation and Its Variants
,,,
010
0,1,0},{0.9»
3.3.1 Momentum
bpnHomentum[inNumber,hidNumber,
outNumbe r ,ioPairs,eta.alpha,numlte rs]
hidLastDelta = Table[Table[0,{inNumber}],{hidNumber}];
outLastDelta = Table[Table[0,{hidNumber}],{outNumber}];
Let's try the T-C problem using this new network. We shall repeat an
example that we ran in the previous section, this time including a mo¬
mentum factor of 0.5.
outs={0,0,0,0};
Timing[outs=bpnHomentum[9,3,1,ioPairsTC,0.9,0.5,150];]
outs={0,0,0,0};
Timing[outs=bpnMomentum[2,3,1,ioPairsXOR,2.0,0.9,1500];]
0.4
0.3
0.2
0.1
outs={0,0,0,0};
Timing[outs=bpnMomentumSma rt[2,3,1,ioPairsXOR,2.0,0.9,1500];]
3.3. BPN Variations 101
Not only did the network converge, but the time required to run 1500
iterations was significantly less in this case than it was for the unmodified
program.
6ij=6°pkw°kj
k= 1
Let
maxm(<5pm) j= winning unit (3.16)
{ -1 rnaxm(<5pm) otherwise
102 Chapter 3. Backpropagation and Its Variants
wji(t + 1) = + V^jxpi
with a similar equation for the output layer. Let's build these changes
into a standard BPN program without momentum.
The algorithm proceeds as in the case of the standard algorithm until
we calculate the delta values for each layer. After that calculation, we
determine the epsilon values. We shall look at the code for the output
layer here; the code for the hidden layer is analogous. First we search
the delta values to find the one with the largest absolute value and save
its position:
outPos = First[Flatten[Position[Abs[outDelta],Max[Abs[outDelta]]]]];
Since the Position function returns a list of lists, we must extract the actual
number as shown, using First and Flatten. We need to remember the delta
value at this position:
outDelta=Table[-1/4 outEps,{Length[outDelta]>]
We can now perform the same calculation for the hidden layer and up¬
date the weights as usual. The new program is called bpnCompete, and a
complete listing appears in the appendix.
Let's try the new algorithm on the T-C problem with one output.
Recall from Section 3.2 that without momentum, the error was still fairly
high, though it was diminishing after about 400 iterations.
outs={0,0,0,0};
outs = bpnCompete[9,3,1,ioPairsTC,5.0,150];
This result looks somewhat better than with the standard algorithm. You
might want to experiment with this algorithm further to determine quan¬
titatively if it is better than standard backpropagation.
There are dozens — perhaps hundreds — of other modifications that
we could explore. Our intent, however, is not to examine all possibilities
in an attempt to find the best one, but rather to learn how to use Math-
ematica as a tool to facilitate that exploration. With that philosophy in
mind, let's move on to the discussion of a different network architecture
called the functional link network.
Tensor Model of the XOR Problem Let's recall the ioPairs vectors
for this problem:
Figure 3.4 This figure illustrates the functional-expansion model for a three-input FLN.
Each input passes through a functional link that generates n functions of the input value.
These 3n values are passed to the next layer of the network. Note that each unit on the
next layer would receive all 3ti data elements from the functional links.
Notice that 0.1 x 0.9 = 0.1, since this calculation represents Oxl.
Let's construct a program, called fin, to implement the network. We
can construct the program itself by modifying the bpnMomentum code. Since
there are no hidden units, we can eliminate all of the "hid" variables.
Inputs to the output units become inputs rather than hidOuts. We shall use
a linear unit as the output unit, so we do not need the sigmoid function,
and the equation for outDelta will change since the derivative of the linear
output function is unity. The argument list for fin is the same as that for
106 Chapter 3. Backpropagation and Its Variants
the bpnMomentum function, with the exception that there are no hidden units.
The function template is
fin[inNumber,outNumber,ioPairs,eta.alpha,numlters]
The function returns an array with four components: the new weight
matrix, the list of errors generated during the iterations, the graphic object
representing the plot of the errors, and the output vector generated by the
call to flnTest, which is a modified version of bpnTest. See the appendix
for the listings of these functions.
Let's try the fin function with only a few iterations.
outs={0,0,0.0>;
Timing[outs=fln[3,l,ioPairsX0RFLN,0.5,0.5,100];]
That result is not too bad, considering the small number of passes through
the data. Let's try again with a few more passes.
outs={0,0,0,0};
Timing[outs=fln[3,l,ioPairsX0RFLN,0.5,0.5,500];]
The mean squared error for this run is not bad, although there appears
to be one point that has a significantly larger error than the others. Since
the output function is linear for this network, let's try using zeros and
ones in the ioPairs vectors.
outs={0,0,0,0};
Timing[outs=fIn[3,1,ioPairsXOROlFLN,0.5,0.5,500];]
-22
Mean Squared Error = 5.18305 10
0.000015
0.0000125
0.00001
-6
7.5 10
-6
5. 10
-6
2.5 10
100 200 300 400 500
The result of the change in desired output values is dramatic. The FLN
performs quite well on the XOR problem as you can see, although it is
somewhat sensitive to the learning rate parameter. You can see this fact
for yourself if you experiment with larger values of eta.
We can plot these values if we flatten the array and then partition it into
pairs of coordinates.
functionListPlot = ListPlot[Partition[Flatten[ioPairsFunct],2]];
We must decide on the number and identity of the functions for the
functional link. We shall use the following six functions in this example:
x, sin (par), cos (jpx), sin(2px), cos(2px), and sin(4px). In order to simulate
the functional link, we can replace the first element of each ioPairs vector
with a list of these functions already evaluated. Because we will use it
more than once, let's define a function that will produce the appropriate
list of functions.
Each ioPair input vector should now have six components, as the follow¬
ing example shows.
ioPairsFunctFLN[[1]]
outs={0,0,0,0};
Timing[outs=fIn[6,1,ioPairsFunctFLN,0.005,0.50,1050];]
110 Chapter 3. Backpropagation and Its Variants
The agreement with the actual function appears to be fairly good. More¬
over, we have only executed 50 passes through the data set. These results
are sufficient to illustrate the concepts, so I shall not perform any more
executions of this network here.
By working on the data a bit, we can plot the output in order to
compare it to the correct answers. First, look at the list of output values:
outs[[4]]
ioPairsOut = HapThread[ReplacePart[#,#2,2]ft,{ioPairsFunct,outs[[4]]}]
Now we can flatten and partition this array and plot it.
outListPlot = ListPlot[Partition[Flatten[ioPairsOut],2]];
0.6
0.4
0.2
-0.2
-0.4 • . . *
To see where we are, we can plot the output along with the original
function.
Shou[{functionPlot,outListPlot}];
The results are not bad, and presumably could be improved with further
training of the network. Let's see how well this network interpolates.
We can construct a new ioPairs array using points in between those used
for training.
-20
{{2.}, {6.50521 10 », {{2.1}, {0.194681}}, {{2.2}, {0.387938}},
{{2.3}, {0.558222}}, {{2.4}, {0.684761}}, {{2.5}, {0.75}},
{{2.6}, {0.741824}}, {{2.7}, {0.655304}}, {{2.8}, {0.49374}},
-18
{{2.9}, {0.268845}}, {{3.}, {2.78098 10 }}}
Test these new vectors using the weights from the previous run. The
code for the function flnTest appears in the appendix.
outputValues=flnTest[outs[[l]].ioPairsTestFLN];
These results are as good as the training set, indicating that the network
can interpolate well. Can it extrapolate, however? Let's find out by
3.4. The Functional Link Network 113
constructing a new ioPairs array using the same function, but outside the
range of the original data.
ioPairsTest2FLN = Hap[ReplacePart[#,functionList[#[[l,l]]],l]4,ioPairsTest2];
outputValues2=fInTest[outs[[1]],ioPairsTest2FLN];
-17
Output 21 = {0.18331} desired = {1.16281 10 } Error = {-0.18331}
Mean Squared Error = 0.183144
As you might have guessed, these results are not so good. As a gen¬
eral rule, you should not expect a neural network to be able to respond
properly to data that is outside of the domain used during the training
process.
Summary
Figure 4.1 This figure shows the basic structure of the Hopfield network. Each of the n
units is connected by a weighted connection to all other units. There is no feedback from
a unit to itself. The I values are external inputs.
psi[inValue.,netln_] := lf[netln>0,l,
If[netln<0,-1,inValue]]
It will simplify things later if we also define a function that takes input
vectors and net-input vectors as arguments and maps them onto the psi
function. We shall call this new function phi.
phi[inVector_List,netInVector_List] :=
MapThread[psi[#,#2]ft,{inVector,netlnVector}]
The net-input value to each unit can be found in the usual way by find¬
ing the scalar product of the input vector and the weight vector. What
remains is to determine the appropriate weight vector for the given pairs
of vectors that we wish to store in the network.
To determine the weights we can use a training method known as
Hebb's rule (see Chapter 1). Hebb's rule was initially derived in an
attempt to explain how learning takes place in real, biological systems.
Simply put, it states that if two connected neurons are firing simulta¬
neously, the strength of the connection between them will increase. A
typical way to express this rule mathematically is
AWij oc XiXj
1 L
w=-2>xj (4.1)
ni= 1
where L is the number of vectors in the training set, and n is the number
of units in the network.
We can also associate with the Hopfield network a quantity called an
energy function. The energy function has the form
E = -ix‘wx (4.2)
where the x vector is the current output vector of the network. This
equation can also be written as
1 J ^
E 2 ^ ! XiWijXj (4.3)
*J=l
For future use, let's define the energy function using Mathematica.
energyHop[x_,v_] := -0.5 x . w . x;
Let's try a small Hopfield network, say one with 10 units, and three
random training patterns. First the training patterns. We can use the
Random function to generate binary vectors, then convert to bipolar vectors
by multiplying each component by 2 and subtracting 1.
trainPats = 2 Table[Table[Random[Integer,{0,i>],{10>],{3}]-l
To generate the weight matrix, we need to find the outer product of each
training vector with itself, and add the contributions from each.
4.1. The Discrete Hopfield Network 119
wts = Apply[Plus,Map[Outer[Times,#,#]ft,trainPats]];
HatrixForm[uts]
3 3 -1 1 -1 -3 -3 1 1 -3
CO
CO
3 -1 1 -1 -3 -3 1 1
1
CO
-1 -1 3 -1 1 1 1 1 1
1
1 1 -3 3 1 -1 -1 -1 -1 -1
-1 -1 -1 1 3 1 1 1 1 1
CO
CO
-1
CO
1 -1 1 3 3 -1
1
CO
-1 3
CO
CO
1 -1 1 3 -1
1
1
CO
1 1 1 -1 1 -1 -1 3 -1
CO
1 1 1 -1 1 -1 -1 3 -1
-1 -1 3
CO
CO
1 -1 1 3 3
1
1
As expected, the matrix is square and diagonal. Also notice that the
diagonal elements are all equal. To be consistent with the Hopfield ar¬
chitecture as we described it above, we should set all of the diagonal
elements to zero. It will not affect the results if we leave them as they
are, however, so let's not do any more manipulations with the matrix.
Calculating the energy of the network for each of the training patterns
gives the following:
eTrainPats = Hap[energyHop[#,wts]ft,trainPats]
energyHop[inputl,wts]
-20.
-60.
energyHop[outputl2,wts]
-60.
output2 = phi[input2,netlnput2]
energyHop[output2,wts]
-66.
output21 = phi[output2,netInput21]
energyHop[output21,wts]
-66.
122 Chapter 4. Probability and Neural Networks
makeHopfieldVts[trainingPats.,printWts.:True] :=
Module[{wtVector},
wtVector 3
Apply[Plus,Map[Outer[Times,i,#]&,trainingPats]];
If[printVts,
Print[];
Print[NatrixForm[vtVector]];
Print[];,Continue
]; (* end of If *)
Return[utVector];
] (* end of Module *)
Listing 4.1
discreteHopfield[wtVector.,inVector.,printAll.:True] :=
Module[{done, energy, nevEnergy, netlnput,
newlnput, output},
done = False;
newlnput = inVector;
energy = energy Hop [in Vector, wtVector];
If[printAll,
Print[ ];Print["Input vector = ".inVector];
Print[ ];
Print["Energy = ".energy];
Print[ ],Continue
]; (* end of If *)
While[!done,
netlnput = wtVector . newlnput;
output = phi [newlnput, netlnput];
newEnergy = energyHop[output,wtVector];
If[printAll,
Print[ ];Print["Output vector = ".output];
Print[ ];
Print["Energy = ".newEnergy];
Print[ ].Continue
]; (* end of If *)
If [energy—newEnergy,
done=True,
energy=newEnergy;newInput=output,
Continue
]; (* end of If *)
]; (* end of While *)
If['printA11,
Print[ ];Print["Output vector = ".output];
Print[ ];
Print["Energy = ".newEnergy];
Print[ ];
]; (* end of If *)
]; (* end of Module *)
Listing 4.2
124 Chapter 4. Probability and Neural Networks
!§ N
i| i
N :S: N N
m N N
N n H-r-'r
:fr N
Figure 4.2 You can think of a magnetic material as comprising individual atomic magnets
resulting from a quantum-mechanical property known as spin. In the presence of an ex¬
ternal magnetic field, the spin can be in one of two directions, each of which results in a
different orientation of the north and south poles of the individual atomic magnets.
The factor of one half in the first term on the right accounts for the
fact that each i and j index is counted twice in the summation.
At very low temperatures, individual magnets tend to line up in the
direction of the local field, hi. Thus, the spin becomes either positive or
negative one depending on the sign of the local field. At higher tempera¬
tures, thermal effects tend to disrupt the orderliness of this arrangement.
Consider the physical model that we just examined above in relation¬
ship to the Hopfield network. The spins, sif are analogous to the unit
outputs, the local magnetic fields, hif are analogous to the net inputs,
and the potential energy of the system is analogous to the energy of the
network (in the absense of any external field). The exchange interaction
strengths form a symmetric matrix analogous to the weight matrix of the
Hopfield network.
Incidentally, if all of the exchange interaction strengths are positive,
we refer to the material as a ferromagnet. Random strengths result in a
substance known as a spin glass. Since weights in a Hopfield network
are more likely to appear to be random than all positive, the analogy is
often made between the Hopfield network and spin glasses.
Thermal effects enter the picture because the random motions caused
by thermal energy can cause a magnet to flip to a different state. Since
there are a large number of atomic magnets in any system, you would not
likely notice that the magnetization state of the material was changing,
provided the temperature remained constant. However, if the tempera¬
ture were high enough, these thermal motions could completely random¬
ize the individual spin directions, resulting in a material that had no net
magnetization.
As individual spins flipped, the total energy of the system would
change slightly. On the average, however, the energy would remain at a
126 Chapter 4. Probability and Neural Networks
Pr = Ce~f3Er (4.6)
where the sum is taken over all possible energy states of the system.
We can now apply the results of the previous section to neural networks.
Each unit in the Hopfield network has an output of either positive one or
negative one. Thus, we can think of these units as the analogues of the
magnets in the physical system described in the previous section, where
the output of the units correspond to the values of the magnetic spin.
The weight values in the network correspond to the internal magnetic
fields at each of the individual magnets. We shall consider a physical
system with no external magnetic field. By doing so, Eq. (4.5) for the
energy of the system of magnets corresponds directly with Eq. (4.3) for
the energy of the Hopfield network.
In a neural network, such as the Hopfield network, we can impart
a stochastic nature to the system by means of a fictitious temperature,
whereby the unit outputs are made to fluctuate due to fictitious thermal
motions. Moreover, if we assume a condition of thermal equilibrium, we
4.2. Stochastic Methods for Neural Networks 127
can use the Boltzmann distribution to describe the effect of these fluctua¬
tions. The net effect is that the outputs of the network units are no longer
completely determined by the net-input values to those units. Instead,
there is the possibility that a unit will be bumped into a higher energy state
due to random thermal fluctuations within the system. Remember that
the ideas of temperature and thermal fluctuations in a neural network
are strictly mathematical constructs.
To build such a network we must be able to calculate the probabil¬
ity that any given unit will undergo a change of state due to a random
thermal fluctuation. To accomplish this task, we shall limit our con¬
sideration to a single unit and correspondingly change the unit-update
strategy from a synchronous update to a random, asynchronous update
procedure.
Let's focus our attention on the kth unit, whose output we denote
Xk• We shall call the energy of the system with Xk = +1, £(+i), arid the
energy with xk = -1, #(-i). According to the Boltzmann distribution,
the probability that the system will be in the state with xk — +1 is
n e~PE(+i)
P(+!) = e-/JB(+i) +
or
so that
P(+r) — l e-neth/T
where we have absorbed the Boltzmann constant into the fictitious tem¬
perature. This equation gives the probability that a unit has an output of
+1, regardless of the value of the net input. Notice that the probability
curve has a sigmoidal shape, identical to the output function that we
have used in other networks. Let's define the function for use later on.
prob[n_,T_] := l/(l+E~(-n/T));
128 Chapter 4. Probability and Neural Networks
We can plot the probability function for several values of the temperature
parameter. Notice that as the temperature decreases, the system behaves
more and more like a deterministic system, until, at T = 0, the system is
completely deterministic.
Plot[{prob[n,0.01],prob[n,.5],prob[n,5]>,{n,-5,5},
AxesLabel->{"net","P"}];
net
To determine the output of any unit, we must first calculate the net
input. Then, depending on the temperature, calculate the probability
that the unit will have a +1 output. We then compare this probability
to a randomly generated number between zero and one. If this random
number is less than or equal to the probability, then we set the unit
output equal to +1, regardless of the net input. If the random number
is greater than the probability, then the normal rules for the Hopfield
network apply in determining the output.
By now, you may reasonably question why we would go to all this
trouble when the deterministic Hopfield network seems to work just fine.
The answer lies in the fact that this network, like others — in particular
the BPN — potentially can settle into a local minimum energy state, or
a spurious energy minimum, rather than one that corresponds to one of
the items stored in the network's weight matrix. Adding a stochastic
element to the unit outputs helps the network avoid such errors. The
procedure we employ is similar to the annealing process used in materials
processing, and hence we call it simulated annealing.
4.2. Stochastic Methods for Neural Networks 129
1 Learning and relearning in Boltzmann machines. In David E. Rumelhart and James L. McClel¬
land. editors. Parallel Distributed Processing: Explorations in the Microstructure of Cognition. MIT Press:
Cambridge. MA. pages 282-317, 1986.
130 Chapter 4. Probability and Neural Networks
Figure 4.3 This figure shows a simple energy landscape with two minima, a local minimum,
Ea, and a global minimum, Eb. The system begins with some energy, EB. We can draw
an analogy to a ball bearing rolling down a hill. The bearing rolls down the hill toward
the local minimum, Ea, but has insufficient energy to roll up the other side and down into
the global minimum.
each pair comprises a temperature and the number of sweeps at that tem¬
perature. A sweep is the number of times each unit has an opportunity
to change while the network is at a particular temperature. An example
of an annealing schedule is: ((10,5), (8,10), (4,20), (1,20), (0.1,50)).
You can imagine that executing such an annealing schedule would
take a considerable number of computations for a network of any size,
and you would be correct. To ensure that the network reaches the global
minimum energy, the temperature needs to be reduced very slowly. Often
we can live with an imperfect annealing schedule that may get us close
to the global minimum, but will get us there in our lifetime.
probPsi[inValue_,netIn_,temp_] :=
If[Random[]<=prob[netIn,temp],l,psi[inValue,netIn]];
For[i=l;i<=numSweeps;i++;
For[j=l;j<=numUnits;j++;
(* select unit *)
indx * Random[Integer,{l.numUnits}];
(* net input to unit *)
net=inVector . weights[[indx]];
(* undate input vector *)
inVector[[indx]]=probPsi[inVector[[indx]],net,temp];
]; (* end For j *)
]; (* end For i *)
Listing 4.3
vts = Apply[Plus,Map[0uter[Times,#,t]ft,trainPats]];
HatrixForm[wts]
CO
CO
1 -3 3 -1 3 1 -1 3
1 3 -1 1 1 -3 1 -1 -3 1
-3 -1 3 -3 -3 1 -3 -1 1 -3
3 1 -3 3 3 -1 3 1 -1 3
CO
3 1 -3 3 3 -1 3 1 -1
-1 -3 1 -1 -1 3 -1 1 3 -1
3 1 -3 3 3 -1 3 1 -1 3
1 -1 -1 1 1 1 1 3 1 1
-1 -3 1 -1 -1 3 -1 1 3 -1
3 1 -3 3 3 -1 3 1 -1 3
stochasticHopfield[inputl,wts,10,8];
i= 1
New input vector =
H, 1, 1, -1, -1, 1, -1, 1, -1, -1}
4.2. Stochastic Methods for Neural Networks 133
stochasticHopfield[inVector.,weights.,numSweeps.,temp.]:=
Module[ {input, net, indx, numUnits, indxList, output),
numUnits=Length[inVector];
indxList=Table[0,{numUnits}];
input=inVector;
For[i=l,i<=numSweeps,i++,
Print["i= ",i];
For[j=l,j<=numUnits,j++,
(* select unit *)
indx = Random[Integer,{l,numUnits)];
(* net input to unit *)
net=input . weights[[indx]];
(* undate input vector *)
output=probPsi[input[[indx]],net,temp];
input[[indx]]=output;
indxList[[indx]]+=l;
]; (* end For numUnits *)
Print[ ];Print[”Neu input vector = ”];Print[input];
]; (* end For numSweeps *)
Print[ ];Print["Number of times each unit was updated:"];
Print[ ];Print[indxList];
]; (* end of Module *)
Listing 4.4
i= 2
New input vector =
{-1, -1, 1, -1, -1, 1, -1, -1, 1, -1>
i= 3
New input vector =
{-1, 1, 1, -1, -1, 1, -1, 1, 1, -1>
i= 4
New input vector =
{-1, 1, 1, -1, -1, 1, -1, "I, 1. -i>
i= 5
New input vector =
134 Chapter 4. Probability and Neural Networks
You should notice that the results tend to cluster around one of the train¬
ing vectors, but that there is some variation due to the finite temperature.
When you run this code on your computer, you will see different results
due to the random nature of the process.
Remember, this function executes the network only at a single tem¬
perature. To anneal this network properly, we should begin at a tem¬
perature considerably higher than 5, perform more sweeps, then reduce
the temperature according to some annealing schedule. You can use the
above code as a basis for a complete function that performs the entire
annealing process.
The Hopfield network is not the only one that can be annealed in the man¬
ner of the previous section. A network called the Boltzmann Machine
performs an annealing during the learning process as well as during
the postleaming production process. Figures 4.4 and 4.5 illustrate two
variations of the Boltzmann machine architecture. Because the learning
procedure for the Boltzmann machine is so time consuming, we shall not
attempt to simulate it here.
4.3. Bayesian Pattern Classification 135
hidden layer
visible layer
Figure 4.4 In the Boltzmann-completion architecture, there are two layers of units: visible
and hidden. The network is fully interconnected between layers and among units on each
layer. The connections are bidirectional and the weights are symmetric, Wij = Wji. All
of the units are updated according to the stochastic method that we described for the
Hopfield network. The function of the Boltzmann-completion network is to learn a set of
input patterns and then to be able to supply missing parts of the patterns when a partial,
or noisy, input pattern is processed.
Figure 4.5 For the Boltzmann-input/output network, the visible units are separated into
input and output units. There are no connections among input units, and the connections
from input units to other units in the network are unidirectional. All other connections
are bi-directional, as in the Boltzmann-completion network. This network functions as a
heteroassociative memory. During the recall process, the input-vector units are clamped
permanently and are not updated during the annealing process. All hidden units and
output units are updated according to the simulated-annealing procedure described above
for the Hopfield network.
what is the probability that our next choice will be the white marble.
The answer depends on what we do with the blue marble that we have
already chosen. If we replace the blue marble in the box before we make
a second choice, then the probability of choosing the white marble is
P(W) = 1/3, because the two events of choosing are totally independent
of one another. P{B) and P(W) are called a priori probabilities. They
are the probabilities we would assign to the selection of a blue or white
marble initially, without any other information.
Let's look at the different ways that we can choose two objects, one
after the other, without replacing the first object in the box after it has
been chosen.
outcomes=Map[Drop[#,-1]ft,Permutations[{R, W, B}]]
{{R, W>, {R, B}, {W, R>, {W, B}, {B, R}, {B, W}}
4.3. Bayesian Pattern Classification 137
There are only two possiblilities, having chosen the blue marble first: red
or white. Then the probability of choosing the white marble with our
next choice is 0.5. In other words, the probability changes based on the
results of the first choice. We shall call this new result the conditional
probability of choosing the white marble, given that we have already
chosen the blue marble and have not replaced it in the box. The symbol
that we give to this probability is P(W2\B). Similarly, the conditional
probability of choosing the blue marble given that we have already cho¬
sen the white one is denoted P(B2\W). Notice, however, that there are
two out of six ways in which the white marble may be chosen second.
Thus, P(W2) = 1/3.
We can also define the joint probability of choosing, for example, a
blue and a white marble as the result of two successive choices. We call
that value P(B n W2).
We already know that P(W) = 1/3, and P(B) = 1/3; also, P(R) =
1/3:
P[R]=l./3;
P[W]=l-/3;
P[B]=l./3;
P[W2|B]=P[W2|R]=P[R2|B]=P[R2|H]=P[B21M]=P[B2|R]=0.5;
P[BandW2]=l./6;
nBn\V2 _ n.BnW2/n
nB nB/n
Figure 4.6 This figure shows a region of two-dimensional space broken up into three cate¬
gories, or classes: B, C, and D. Each point in a given region belongs to the same class. The
circular region. A, is superimposed on the space.
selected first. The ratio, then, represents the ratio of the frequency of se¬
lecting a white and a blue marble given that a blue marble was selected
first. As the number of trials increases, the frequencies approach the true
probabilities of occurrence. Then the ratio becomes the conditional prob¬
ability of W given B. In other words, we define the conditional probability
as follows
P(W2\B) =
You can see that this relationship holds for the above example:
P[BandW2]/P[B]
0.5
Bayes' theorem is a generalization of this last result. Let B\, B2, ... Bn/
partition a region of space, S. Further, let A be an event associated with
one of the points of S. Then the probability that event A is in region Bi
is
(4.10)
p\Bi\A) Y:jP(A\Bj)p{Bj) p(A)
Bayes' theorem shows us how to convert the a priori probability,
P(Bi), into an a posteriori probability, P(Bi\A), based on obtaining
first the result A. Now that we have Bayes' theorem, let's move on to its
application to decision theory.
P(x\A) = KP(x\B)
where
3.0- -
i.o--
2.0 4.0
1 0--
- .
30
- . --
Figure 4.7 This figure shows a plane with two categories of points, A and B. The points
are uniformly distributed within each region. The examplar points are indicated by the
black dots.
gauss2[x_,y_] :=
(l/((2.0 Pi) sigma“2)) E~(-{(x-xa),(y-ya)}.
{(x-xa), (y-ya)}/ (2sigma“2))
Clear[xa,ya];
ruleSet* = Hap [<xa-># [ [1] ], ya-># [ [2] ] >ft .exemplars A]
{{xa -> 2.5, ya -> 1.5}, {xa -> 2.5, ya -> 2.5},
{xa -> 3.5, ya -> 1.5}, {xa -> 3.5, ya -> 2.5}}
gaussList=0;
For[i=l,i<=Length[exemplarsA],i++,
AppendTo[gaussList,gauss2[x,y]/.ruleSetA[[i]]];
];
gaussList
0.5
{-
2 2 2
((-2.5 + x) + (-1.5 + y) )/(2 sigma ) 2
E Pi sigma
0.5
2 2 2
((-2.5 + x) + (-2.5 + y) )/(2 sigma ) 2
E Pi sigma
0.5
2 2 2
((-3.5 + x) + (-1.5 + y) )/(2 sigma ) 2
E Pi sigma
0.5
-y
2 2 2
((-3.5 + x) + (-2.5 + y) )/(2 sigma ) 2
E Pi sigma
ClearAll[classA];
classA[x_,y_] :=
Apply[Plus,gaussList]/Length[exemplarsA]
Plot3D[classA[x,y]/.sigma->0.1,{x,0,5},{y,0,4},
PlotPoints->25,PlotRange->All];
4.3. Bayesian Pattern Classification 143
With a sigma of 0.1, the class* function represents the exemplars well,
but does not approximate the desired distribution over the entire class.
By adjusting the value of sigma, we can make the distribution more rea¬
sonable.
Plot30[class*[x,y]/.sigma->0.45,{x,0,5},{y,0,4},
PlotPoints->25,PlotRange-> *11];
To continue with this example, you should repeat the above development
to construct a function classB. These functions can then be used in the
Bayes' decision rule, Eq. (4.11), to classify any point in the plane. Notice
144 Chapter 4. Probability and Neural Networks
Figure 4.8 This figure shows the connectivity of a PNN designed to classify input vectors
into one of two classes. The input values are fully connected to the layer of pattern units
Pattern units that correspond to a single class are connected to one summation unit. Both
summation units are connected to the output unit. Details of the processing performed by
these units are in the text.
that points that are very far from either class will still be classified in one
of the two classes. You can fix this problem be requiring some threshold
value for either class function to validate the classification. Let's move
on now to the implementation of this methodology in a neural network.
pattern unit for each exemplar in the training set. The weight vector for
each pattern unit is a copy of the corresponding exemplar, and is also
normalized. Moreover, training is accomplished by adding pattern units
with the appropriate weight vectors in place.
In the pattern units, we take the dot product of the input vector and
the weight vector, as is typically done in feed-forward networks. We then
apply a nonlinear output function, although we depart from the sigmoid
function commonly used. Instead of the sigmoid, we use the gaussian
function, exp{(net — l)/cr2}, where net = x • w is the net input to the
unit, x is the input vector, w is the weight vector and o is a smoothing
parameter.
Since both x and w are normalized, the value of net is restricted to
the range -1 < net < -1. Let's define the output function and plot it
between these limits.
gaussOut[x_] := E~( (x-l)/sigma“2)
Plot[gaussOut[x]/.sigma->.7,{x,-1,1)»PlotRange->All];
Fa(x) = (2n)p/‘2apnAP(x\A)
146 Chapter 4. Probability and Neural Networks
/a(x) = vJb{x)
ua tib
or
fA(x)-CfB(x) = 0
where
_ P{B)nA
P(A)nB
This result suggests that we configure the output unit to have two
inputs: one from each summation unit. The connection from class A
will have a weight of one, and the connection from class B will have a
weight of -C. The output unit computes the net input as usual. The
output function can be the sign function: If the net input is positive, the
output is +1, corresponding to class A; if the net input is negative, the
output is -1, corresponding to class B.
You can make a final simplification if you can be sure that the number
of exemplars from each class is taken in proportion to the corresponding
a priori probability. In that case, C = 1 and the weight on the connection
from class B is just -1.
Let's step through an example using the classes described in the pre¬
vious section:
exemplarsA
exemplarsB
normalize[x.List] := x/(Sqrt[x.x]//N)
Since each exemplar array has more than one vector in it, use the Hap
function to perform the normalization.
exemplarsAnorm = Map[normalize,exemplarsA]
4.4. The Probabilistic Neural Network 147
exemplarsBnorm = Map[normalize,exemplars!}]
Before continuing on with the calculation, let's plot the original exemplars
from class A along with the normalized points.
pi = ListPlot[exemplarsA,PlotStyle->PointSize[0.05],
PlotRange->{ {0,5},{0,4}},
Prolog->{Line[{{2,1},{2,3}}],
Line[{{2,3},{4,3}}],
Line[{{4,3},{4,1}}],
Line[{{4,1},{2,1}}]}];
4
3.5
3
2.5
2
1.5
1
0.5
°~ 1 2 3 4 5
11 = ListPlot[exemplarsAnorm];
0.7
0.65
0.6
0.55
0.5
0.45
Show[{pi,11}];
3.5
3
2.5
2
1.5
1
0.5
1 2 3 4 5
When you normalize vectors in the plane, all of the points project down
to the unit circle. All points that lie along a line in a particular direction
from the origin, when normalized, will fall on the same point on the
unit circle. Thus, you would not be able to separate classes that are
in different regions of the plane, but that lie along the same direction.
When you want to use normalization of input vectors, make sure that the
direction of the vector is the only relevant attribute, and that the vector's
magnitude does not matter. With that point made, let's return to the
calculation.
The weights on the pattern units are equal to the normalized exem¬
plar vectors.
weights* = exemplarsAnorm;
weightsB = exemplarsBnorm;
inputs = Table[{Randoni[Real,{l,5}],
Random[Real,{-4,4}]},{10}]
Net inputs to the pattern units are the standard dot products. Since
there is more than one input vector, we cannot simply dot inputsNorm with
weights* and weightsB.
sigma = 0.45;
patternAout = gaussOut[patternAnet]
patternBout = gaussOut[patternBnet]
Now we must sum the outputs of the pattern units for each of the input
vectors.
sumAout = Hap[Apply[Plus,#]&,patternAout]
The sign of the difference of each of these output values is the network
output for the corresponding input vector.
4.4. The Probabilistic Neural Network 151
pnnTwoClass[dasslExemplars_,class2Exemplars_,
testlnputs_,sig_] :=
Module[{weights A,weightsB,inputsNorm,patternAout,
patternBout,sumAout,sumBout},
weightsA = Map [normalize, dasslExemplars];
weightsB = Map [normalize, class2Exemplars];
inputsNorm = Map[normalize,testInputs];
sigma = sig;
patternAout =
gaussOut[inputsNorm . Transpose [weights A]];
patternBout =
gaussOut[inputsNorm . Transpose [weightsB]];
sumAout = Map[A pply[Plus,#]ft,patte rn Aout];
sumBout = Map[Apply[Plus,#]ft,patternBout];
outputs = Sign[sumAout-sumBout];
sigma=.;
Return[outputs];
]
Listing 4.5
outputs = Sign[sumAout-sumBout]
Summary
In this chapter we have explored two different ways in which proba¬
bilistic concepts can be used to advantage in neural networks. The Ising
model from statistical mechanics has a direct analog with the Hopfield
neural network. Using concepts from statistical mechanics, in particu¬
lar the concepts of temperature and stochastic processes, we can change
the processing performed by neural-network units from deterministic to
stochastic. In doing so, we can endow the network with the ability to
152 Chapter 4. Probability and Neural Networks
The subject of this chapter is a class of problems for which there may
be many solutions, but for which one solution may be judged to be
better than another. The classic example of this type of problem is the
traveling salesperson problem: Given a list of cities and a known cost of
traveling from one city to the next, what is the most efficient route such
that all cities are visited, no cities are visited twice, and the total distance
traveled, and hence cost, is kept to a minimum?
Conditions that we impose on the problem, such as the restriction
that each city be visited only once, are called strong constraints. Any
solution must satisfy all strong constraints. On the other hand, the desire
to minimize the cost is a weak constraint, since not all possible solutions
will be minimum-cost solutions. Let's look at some of the details, and
then apply neural networks to the solution of this problem.
If you choose a path at random through a given list of cities, you are
likely to find that it is not the most efficient in terms of cost. One way to
ensure efficiency is to compute the cost for all possible paths, and then
to follow the one with the least cost. Unfortunately, such a computation
may take an extremely long time. Given any n cities, there are n\ possible
tours. If you consider the fact that, for a given tour, it does not matter
where you begin, or in which direction you travel, then the total number
of independent tours is n!/2n.
numTours[n_]:=n!/(2 n)
For a small number of cities, you would simply compute the cost of each
tour, and choose the one with the minimum. Unfortunately, for a tour
of more than a few cities, this exhaustive search can become quite time
consuming. For five cities, the number of possible tours is
numTours[5]
12
We could easily compute the cost of these twelve tours and select the most
efficient one. However, if there are ten cities on the tour, the number of
different possibilities is
numTours[10]
5.1. The Traveling Salesperson Problem (TSP) 155
181440
You can see that the computations involved will quickly overwhelm us
for a tour of any size greater than a few cities. Let's examine some of the
details of this problem by considering a specific case: a five-city tour.
We can express the cost of travel from one city to another using a
matrix of values where each row refers to a starting city, and each column
refers to a destination city. We assume that the cost of travel from one city
to another is independent of the direction of travel, making the matrix
symmetric. Moreover, all of the diagonal elements will be zero, since
there will be no travel from one city to itself. We construct such a matrix
as follows, choosing random values for the costs. First, we construct a
lower triangular matrix.
costs =
Table[If[i<= j,0,Random[Intege r,{1,10}]],{i,5}, {j,5>]
MatrixForm['/,]
0 0 0 0 0
4 0 0 0 0
3 9 0 0 0
8 4 6 0 0
3 4 8 5 0
HatrixFormf/J
0 4 3 8 3
4 0 9 4 4
3 9 0 6 8
8 4 6 0 5
3 4 8 5 0
156 Chapter 5. Optimization and Constraint Satisfaction
27
34
t3=costs[[1,2]]+costs[[2,4]]+costs[[4,5]]+costs[[5,3]]+costs[[3,1]]
24
and so on. The cheapest tour is tour nine, and the range goes from 20 to
34.
Remember that your results may be quite different. Nevertheless,
you can see that any random selection is likely to result in a tour that is
not optimum from a cost standpoint.
It is often the case with problems such as the TSP, that a good solution
obtained quickly is more desirable than the best solution obtained after
laborious calculation. In the example above, we might be quite satisfied
with any solution whose cost is less than 25. In the next section we shall
look at how to apply a neural network to this problem. We shall see that
the network can provide a solution quickly (relative to an exhaustive
search), but that the solution is not always the absolute best.
where 1 is a constant called the gain parameter. Let's plot this function
for several values of 1. We can use the Mathematica package, Graphics'Legend'
to help us distinguish the various graphs.
«Graphics'Legend'
Plot[{g[0.2,u],g[0.5,u],g[l,u],g[5,u]>,{u,-5,5>,
PlotStyle->{GrayLevel[0].Dashing[{0.01}],
Dashing[{0.03)],Dashing[{0.05}]},
PlotLegend->{"l=.2","1=.5","1-1",”1=5"}];
Plot[{g[0.5,u],sigmoid[u]},{u,-5,5>];
We shall denote Vi = #[A,«»] as the output of the ith unit. In real neurons,
there will be a time delay between the appearance of the outputs, Vj, of
other cells, and the resulting net input, uit to a cell. This delay is caused
by the resistance and capacitance of the cell membrane and the finite
conductance of the synapse between the ;th and ith cells. These ideas are
incorporated into the circuit shown in Figure 5.1. At each connection,
we place a resistor having a value Rij = 1/12^1, where Ty represents
the weight matrix. Inverting amplifiers simulate inhibitory signals. If
the output of a particular element excites some other element, then the
connection is made with the signal from the noninverting amplifier. If
the connection is inhibitory, it is made from the inverting amplifier.
Each amplifier has an input resistance, p, and an input capacitance,
C, as shown. Also shown are the external signals, U. In the case of an
actual circuit, the external signals would supply a constant current to
each amplifier.
The net-input current to each amplifier is the sum of the individual
current contributions from other units, plus the external-input current,
minus leakage across the input resistor, p. The contribution from each
connecting unit is the voltage value across the resistor at the connection,
divided by the connection resistance. For the connection from the ;'th
unit to the ith, this contribution would be (vj - u^/Rij = (Vj - u^T^.
The leakage current is Ui/p. If we make the definition
1 1 1
(5.1)
Ri p+ “ Rij
then we can write a differential equation describing the input voltage for
each amplifier by considering the charging of the capacitor as a result of
5.2. Neural Networks and the TSP 159
Figure 5.1 In this circuit diagram for the continuous Hopfield memory, amplifiers with a sig¬
moid output characteristic are the processing elements. The black circles at the intersection
points of the lines represent connections between processing elements.
duj
dt
y, TijVj if1 , r.
Ri 1
(5.2)
These equations, one for each unit in the memory circuit, completely
describe the time evolution of the system. Unfortunately, since these
equations are a set of coupled differential equations, they cannot be
solved in closed form. If each processing element is given an initial
value, Ui(0), these equations can be solved on a digital computer using
the numerical techniques for initial value problems. Before we can pro¬
ceed, however, we must determine the value of the weight matrix, Ty,
and the external inputs, /*.
160 Chapter 5. Optimization and Constraint Satisfaction
Provided that the gain parameter is sufficiently high, we can write the
energy function of the continuous Hopfield network as
<5-3>
t=l j—1 t=l
The n-out-of-N Problem Suppose for the moment that the only con¬
straint in the problem is that n cities are visited out of a total of N. We
can represent that constraint in the form of an equation as follows:
N
^2Vi = n (5.4)
i= 1
where n is the number of cities, and N = n2 is the number of units in
the network. The energy function
N N
E= [ n (5.7)
i=1 t=l
5.2. Neural Networks and the TSP 161
A B C D E
i
01000 10000 00010 00001 001 00
(a)
1 2345
01000 A
10000 B
00010 C
00001 D
00100 E
(b)
Figure 5.2 (a) In this representation scheme for the output vectors in a five-city TSP problem,
five units are associated with each of the five cities. The cities are labeled A through E.
The position of the 1 within any group of five represents the location of that particular
city in the sequence of the tour. For this example, the sequence is B—A—E—C—D with the
return to B assumed. Notice that N — n2 processing elements are required to represent
the information for an n-city tour, (b) This figure shows an alternative way of looking at
the units. The processing elements are arranged in a two-dimensional matrix configuration
with each row representing a city and each column representing a position on the sequence
of the tour.
If we expand Eq. (5.7) and ignore the n2 term, we can rewrite the
energy function as
E = -^^2^2(-2)vivj-Y^vi(2n-!) (5-8)
i= 1 *=1 i=1
jV*
T-. = | _2 (5.9)
\ 0 otherwise
Ii - 2n - 1
for all i. Notice that each unit in the network exerts an inhibitory strength
of -2 on all other units in the network. Moreover, the number of units
to be on is strictly a function of the external inputs, Ii.
162 Chapter 5. Optimization and Constraint Satisfaction
Before we continue on with the weight calculation for the TSP, let's
apply the results so far to a simple problem where we have four units,
and only two of them are to be on; it does not matter which two. We
can calculate the weight matrix from Eq. (5.9).
{{0, -2, -2, -2>, {-2, 0, -2, -2>, {-2, -2, 0, -2>,
{-2, -2, -2, 0}}
HatrixFormf’/,]
0 -2 -2 -2
-2 0 -2 -2
-2 -2 0 -2
-2 -2 -2 0
{3, 3, 3, 3}
and values for lambda and deltat; in addition we should initialize the vi
array that we shall need later.
lambda=2;
deltat = 0.01;
vi = g[lambda,ui];
5.2. Neural Networks and the TSP 163
vi
indx = Random[Integer,{l,4}]
ui[[indx]] = ui[[indx]] +
deltat (vi . testWtsl[[indx]] -
ui[[indx]] + testlnl[[indx]])
0.31662
vi[[indx]] = g[lambda,ui[[indx]]]
0.78014
vi
n0ut0fN[test«tsl,testInl,4,10,0.1,100,20,True];
iteration = 20
net inputs = {1.63788, 0.772332, -0.209322, -0.534506}
outputs =
{1., 1., 0.0149727, 0.0000227684}
164 Chapter 5. Optimization and Constraint Satisfaction
nOutOfN[weights.,externln_,numUnits.,lambda.,deltaT.,
numlters.,printFreq.,reset.:False]:=
Nodule[{iter,1,dt,indx,ins},
dt=deltaT;
l=lambda;
iter=numlters;
ins=externln;
(* only reset if starting over *)
If[reset,ui-Table[Random[].{numUnits}];
vi « g[l,ui],Continue]; (* end of If *)
Print ["initial ui = ",N[ui,2]];Print[];
Print["initial vi = ",N[vi,2]];
For[iter=l,iter<=numlters,iter++,
indx = Random[Integer,{l,numUnits}];
ui[[indx]] = ui[[indx]]+
dt (vi . weights[[indx]] - ui[[indx]] + ins[[indx]]);
vi[[indx]] = g[l,ui[[indx]]];
If[Hod[iter,printFreq]==0,
Print[];Print["iteration = ",iter];
Print["net inputs = "];
Print[N[ui,2]];
Print ["outputs = "];
Print[N[vi,2]];Print[];
]; (* end of If *)
]; (* end of For *)
Print[ ];Print["iteration = iter];
Print["final outputs = "];
Print[vi];
]; (* end of Nodule *)
Listing 5.1
5.2. Neural Networks and the TSP 165
iteration = 40
net inputs = {4.1303, 1.35895, -1.35663, -1.04605}
outputs =
-12 -10
{1., 1., 1.64622 10 , 8.20544 10 }
iteration = 60
net inputs = {4.64333, 5.11851, -2.13667, -2.62471}
outputs =
-19
{1., 1., 2.71051 10 , 0.}
iteration = 80
net inputs = {5.82843, 7.9581, -4.05164, -7.54687}
outputs =
{1., 1., 0., 0.}
iteration = 100
net inputs = {11.097, 16.4568, -5.11248, -12.7648}
outputs =
{1., 1., 0., 0.}
iteration = 100
final outputs =
{1., 1., 0., 0.}
You can see that the network quickly settles on an appropriate solution.
testWtsAdd = { {0,-2,0,0>,{-2,0,0,0>,{0,0,0,-2>,{0,0,-2,0> }
{{0, -2, 0, 0}, {-2, 0, 0, 0}, {0, 0, 0, -2}, {0, 0, -2, 0}}
HatrixForm[testVts Add]
0-200
-2 0 0 0
0 0 0 -2
0 0-20
testlnUdd = {1,1,1,1}
{1, 1, 1, 1>
MatrixForm[testWts2 = testWtsl+testUtsAdd]
CN
CN
-4
1
-4 0 -2 -2
CN
-2 0 -4
CN
-4
1
-2 0
testln2 = testlnl+testlnAdd
{4, 4, 4, 4}
TSP Solutions Let's rerun the network with the weights that we
calculated in the previous section. First we shall make a few changes in
the nOutOfN code to accommodate specifics of the TSP. The function tsp
appears in Listing 5.2.
One particular item to note about the code is the way we calculate
the initial u{ values. We know that the sum of the output values should
be equal to SqrtfnumUnits] when the network has settled on a solution.
We calculate an initial Ui so that the network starts out with the sum of
its outputs equal to the proper number, but in addition, we add a little
random noise to the values to give the network a start. Let's run the
code with the new weight and input values.
5.2. Neural Networks and the TSP 167
tsp[weights_,externln,,numUnits,,lambda,,deltaT,,
numlters,,printFreq,,reset,:False]:=
Hodule[{iter,l,dt,indx,ins,utemp},
dt=deltaT;
l=lambda;
iter=numlters;
ins=externln;
(* only reset if starting over *)
If[reset,
utemp = ArcTanh[(2.0/Sqrt[numUnits])-l]/l;
ui=Table[utemp+Random[Real,{-utemp/10,utemp/10}],
{numUnits}]; (* end of Table *)
vi = g[l,ui],Continue]; (* end of If *)
Print["initial ui = ",N[ui,2]];PrintQ;
Print["initial vi = ",N[vi,2]];
For [iter*l, iter<=numlters, iter++,
indx * Random[Integer,{1,numUnits}];
ui[[indx]] = ui[[indx]]+
dt (vi . Transpose[weights[[indx]]] -
ui[[indx]] + ins[[indx]]);
vi[[indx]] = g[l,ui[[indx]]];
If[Hod[iter,printFreq]=0,
Print[];Print["iteration - ".iter];
Print["net inputs = "];
Print[N[ui,2]];
Print["outputs ="];
Print[N[vi,2]];Print[];
]; (* end of If *)
]; (* end of For *)
Print[] ;Print["iteration = iter];
Print["final outputs = "];
Print[HatrixForm[Partition[N[vi,2],Sqrt[numUnits]]]];
]; (* end of Module *)
Listing 5.2
168 Chapter 5. Optimization and Constraint Satisfaction
tsp[testWts2,testln2,4,10,0.1,100,20,True];
iteration = 20
net inputs = {1.12986, -0.700786, 0.3861, 0.25233}
outputs =
-7
{1., 8.18559 10 , 0.999557, 0.99361}
iteration = 40
net inputs = {3.20793, -1.61986, -1.13265, 1.23898}
outputs =
-15 -10
{1., 8.5124 10 , 1.45196 10 , 1.}
iteration = 60
net inputs = {5.62494, -1.98185, -4.71511, 4.31185}
outputs =
-18
{1., 6.12574 10 ,0., 1.}
iteration = 80
net inputs = {6.38743, -4.41281, -13.8339, 8.1653}
outputs =
{1., 0., 0., 1.}
iteration = 100
net inputs = {8.14879, -9.36068, -23.5006, 17.8093}
outputs =
{1., 0., 0., 1.}
iteration = 100
final outputs =
{1., 0., 0., 1.}
Notice that the additional constraint has been satisfied by this solu¬
tion: One solution is from the first two units, and one is from the final
two units.
Think back to the data representation for the TSP given in Figure 5.2.
In order to account for the fact that each city is visited only once, units in
each row of Figure 5.2(b) would have to exert an inhibitory connection
on all other units in the same row. This situation amounts to a l-out-of-5
problem for each row. Similarly, since you can only visit one city at a
5.2. Neural Networks and the TSP 169
time, units in each column would have to inhibit all other units in the
column.
We can construct the weight matrix for the TSP starting with the
original n-out-of-N. Because a five-city problem results in a weight matrix
having 252 = 625 elements, we would be better off here to consider a
simple three-city problem. In that case, n=3 and N=9, and the weight
matrix has only 81 elements. First, we construct the part of the weight
matrix that accounts for the 3-out-of-9 constraint.
CN
CN
CN
1
1
-2
1
0 -2 -2
CN
CN
CN
CN
CN
CN
CN
1
1
1
1
1
1
1
0 -2
CN
CN
CN
CN
CN
-2
1
-2 0 -2
CN
CN
CN
CM
1
1
-2 -2 -2
1
-2 0
CN
CN
CN
CN
CM
1
1
1
-2 -2
1
1
-2 0
CN
CN
CN
CN
CN
CN
CN
1
1
1
-2
1
1
0
CN
CN
CN
NO
-2
1
-2 0
1
-2 -2
CN
CN
CN
CN
CN
CN
1
1
1
-2 0
1
1
-2
CN
CN
CN
CN
1
1
-2 0
1
-2
1
-2 -2
{5, 5, 5, 5, 5, 5, 5, 5, 5}
We can add the next constraint — that we visit each city only once —
with the following weights. Remember, each set of three units represents
all three cities at a particular position on the tour; thus, we must select
only one unit from units 1-3, one from 4-6, and one from 7-9. For ex¬
ample, among the first three units we would have inhibitory connections
between the unit pairs, 1-2, 1-3, 2-1, 3-1, 2-3, and 3-2. The appropriate
additions to the weight matrix and input values are as follows.
tspln2 = Table[2*l-1,{9}];
To account for the fact that we can only visit one city at a time,
we must inhibit the corresponding unit in each of the three groups; for
example, units 1, 4, and 7. The corresponding weights and inputs are
We must still account for the weak constraint: that constraint having
to do with the distances between the cities. Assume that the distance
between cities one and two is one unit, that between cities two and three
is two units, and that between one and three is three units. At a given step
on the tour, each unit corresponding to a certain city should inhibit the
units at either the next step or the previous step that correspond to other
cities, in proportion to the distance to that city. For example, unit four,
which corresponds to city one, tour position two, should inhibit units
eight (city two, position three) and nine (city three, position three), as
well as units two (city two, position one) and three (city three, position
one). Of course, since there is only one unique tour for the three-city
problem, the results that we get will be trivial. Nevertheless, the network
should settle on a solution that conforms to the strong constraints. The
corresponding weight matrix for the distances is
The final weight matrix is the sum of the four individual matrices, and
likewise for the external inputs. I have included the factor of 0.2 in the
above formula because the distance constraint is a weak constraint; thus
its effect on the network should not be such as to overpower the other
constraints. A factor of 0.2 may, in fact, be too small, but if you run
the network without any multiplicative factor, you will sometimes get a
solution with only two cities. This result is presumably because of the
stronger inhibitory connections due to the distances between cities. Here
is the weight matrix and the vector of external inputs:
HatrixForm[tspWts = tspWtsl+tspWts2+tspWts3+tspWts4]
-4 -4 -2.4 -4 -2.6
1
-4 -2.2 -2.6 0 -4 -4 -4 -2 -2
-2.2 -4 -2.4 -4 0 -4 -2 -4 -2
-2.6 -2.4 -4 -4 -4 0 -2 -2 -4
-4 -2.2 -2.6 -4 -2 -2 0 -4 -4
-2.2 -4 -2.4 -2 -4 -2 -4 0 -4
-2.6 -2.4 -4 -2 -2 -4 -4 -4 0
tspln = tsplnl+tspln2+tspln3
{7, 7, 7, 7, 7, 7, 7, 7, 7}
Now let's run the network. For space considerations, I have suppressed
printing of all but the initial values and final result.
tsp[tspWts,tspln,9,50,0.002,800,200,True];
0. 1. 0.
-19
1. 1.6 10 0.
0. 0. 1.
172 Chapter 5. Optimization and Constraint Satisfaction
The solution meets all of the strong constraints, as we might expect. Let's
try again.
tsp[tspWts,tspln,9,50,0.002,800,200,True];
iteration = 800
final outputs =
-14 -9 -14
3.8 10 5.1 10 1.7 10
0. 1. 0.
1. 0. 1.
Notice in this example that the network found a solution that violated
one of the strong constraints (although I had to run the program about
a dozen times before this solution appeared). We can attempt to fix this
problem by increasing the efficacy of the weights associated with those
constraints; thus, we can recompute tspWts as follows:
tsp[tspWts,tspln,9,50,0.002,800,200,True];
I ran this program numerous times and never saw a forbidden solution.
That is not to say that you may never see one; nevertheless, the modi¬
fication appears to have helped. I suggest that you construct a four-city
problem on your own. The solutions will not be trivial in that case.
group2Wts = { { 0, 0, 0, 0, 0, 0},
{ 0, 0, 0, 0, 0, 0},
{ 0, 0, 0,-2,-2,-2},
{ 0, 0,-2, 0,-2,-2>,
{ 0, 0,-2,-2, 0,-2},
{ 0, 0,-2,-2,-2, 0} };
The total weight matrix is the sum of the above two matrices,
HatrixForm[groupWts = grouplWts+group2Vts]
0 -2 0 0 0 0
-2 0 0 0 0 0
0 0 0 -2 -2 -2
0 0 -2 0 -2 -2
0 0 -2 -2 0 -2
0 0 -2 -2 -2 0
The first group of two units gets an external input of 2*1-1=1, while the
second group gets 2*2-l=3.
groupln = {1,1,3,3,3,3}
{1, 1, 3, 3, 3, 3}
Let's run the network several times to see how the results are distributed.
Once again I have suppressed all output but the initial values and final
results.
nOutOfN[groupWts,groupln,6,10,0.1,150,150,True]
Notice that the first group of two has only one unit on, while the second
group has two out of four on. Since the two that are on in the second
group are units three and four, the actual solution that we are interested
in is 0, 1,1, 1, assuming that units five and six are the slack units. Let's
try again.
5.2. Neural Networks and the TSP 175
nOutOfN[groupWts,groupln,6,10,0.1,100,100,True]
In this solution, neither units three or four are on, but that condition still
satisfies the constraint. The actual solution in this case is 0,1, 0, 0. Let's
try one more time.
nOutOfN [groupWts,grouping,10,0.1,100,100,True]
Summary
Constraint satisfaction and optimization problems form a large class for
which the traveling salesperson problem is the prototypical example. In
this chapter we modified the Hopfield network to allow the units to take
on continuous values. Then, using a procedure based on n-out-of-N units
allowed to be in the "on" state, we showed how to calculate the weights
for a Hopfield network that solves the TSP. This method is quite general
and can be applied to many similar constraint-satisfaction problems.
Chapter 6
The title of this chapter implies the existence of neural networks whose
outputs find their way back to become inputs, or in which data moves
both forward and backward in the network. There are many varieties of
these networks. One such network is the Hopfield network, which was
the subject of Chapters 4 and 5. The Hopfield network is a derivative of
a two-layer network called a bidirectional associative memory (BAM)
(You could, alternatively, think of the BAM as a generalization of the
Hopfield network.) The BAM is a recurrent network that implements an
associative memory. We shall study the BAM first in this chapter.
Following the BAM, we shall look at two multilayer network architec¬
tures that have feedback paths within their structures. These networks
are named after individuals: Elman and Jordan. With these architec¬
tures, we shall be able to develop neural networks that can learn a time
sequence of input vectors.
The BAM is similar to the Hopfield network in several ways. Like the
Hopfield network, we can compute a weight matrix in advance, provided
we know what we want to store. Moreover, you will notice a similarity
in the way the weights are determined and in the way processing is
done by the individual units in the BAM. This network implements a
heteroassociative memory rather than an autoassociative memory, as was
the case with the Hopfield network. For example, we might store pairs
of vectors representing the names and corresponding phone numbers of
our friends or customers. To see how we can accomplish this feat, let's
look at the BAM architecture.
Figure 6.1 illustrates the architecture of the BAM. The BAM comprises
two layers of units that are fully interconnected between the layers. The
units may, or may not, have feedback connections to themselves.
To apply the BAM to a particular problem, we assemble a set of pairs
of vectors where each pair comprises two pieces of information that we
would like to associate with each other; for example, a name and a phone
number. To store this information in the BAM, we must first represent
each datum as a vector having components in {-1, +1}. We refer to these
6.1. The BAM 179
x - layer
y - layer
Figure 6.1 The BAM shown here has n units on the x layer, and m units on the y layer. For
convenience, we shall call the x vector the input vector, and the y vector, the output vector.
In this network all of the elements in either the x or y vectors must be members of the
set {—1, +1}. All connections between units are bi-directional with weights at each end.
Information passes back and forth from one layer to the other, through these connections.
Feedback connections at each unit may not be present in all BAM architectures.
exemplars={ {{1,-1,-1,1,"1»M*"1»'-M}*{1»~1»"1»_1*'T*1}}»
{{1,1,1,-l.-i,-i,i,
There are two vector pairs in this example. The first vector of each pair
is the x-layer vector, and the second is the y-layer vector. There are 10
units on the x layer and six on the y layer, and hence, 10 connections to
■>'
w=£y<*J (6.1)
t=l
makeXtoYwts[exemplars.] :=
Module[{temp},
temp = Map[0uter[Times,#[[2]],f[[l]]]ft,exemplars];
Apply[Plus,temp]
]; (* end of Module *)
2 0 0 0 -2 0 2 0 -2 0
0 2 2 -2 0 -2 0 2 0 -2
0 2 2 -2 0 -2 0 2 0 -2
0 2 2 -2 0 -2 0 2 0 -2
-2 0 0 0 2 0 -2 0 2 0
0 -2 -2 2 0 2 0 -2 0 2
MatrixForm[y2xVts = Transpose[x2yWts]]
2 0 0 0 -2 0
0 2 2 2 0 -2
0 2 2 2 0 -2
0 -2 -2 -2 0 2
-2 0 0 0 2 0
0 -2 -2 -2 0 2
2 0 0 0 -2 0
0 2 2 2 0 -2
-2 0 0 0 2 0
0 -2 -2 -2 0 2
6.1. The BAM 181
Unit outputs will be -1, or +1, depending on the value of the net input to
the unit. We calculate net inputs in the usual manner of the dot product
between the input vector and the weight vector for each unit. Then the
output of the unit is given by
+1 net? > 0
Si(t) net? = 0
-1 net? < 0
where s refers to a unit on either layer, and we use the discrete variable
t to denote a particular timestep. Notice that if the net input is zero, the
output does not change from what it was in the previous timestep.
The two functions, psi and phi, that we defined for the Hopfield net¬
work (see Section 4.1), also apply to the BAM. These functions implement
the output function of the BAM units as specified in Eq. (6.2).
psi[inValue.,netlnj := lf[netln>0,l,
If[netln<0,-1,inValue]];
phi[inVector.List,netlnVector.List] :=
HapThread[psi[t,f2]&,{inVector,netlnVector}];
The energy function for the BAM is also similar to that of the Hopfield
network.
E — — y4wx (6.3)
We can use this function to calculate the energy of the network with the
exemplars.
energyBAM[exemplars[[l,2]],x2yWts,exemplars[[l,l]]]
-64
energyBAM[exemplars[[2,2]],x2yWts,exemplars[[2,l]]]
-64
2. Propagate the information from the x layer to the y layer and update
the values on the y-layer units. Although we shall consistently
begin with the x-to-y propagation, you could begin in the other
direction.
This algorithm is what gives the BAM its bi-directional nature. The
terms input and output refer to different quantities, depending on the
current direction of the propagation. For example, in going from y to x,
the y vector is considered as the input to the network, and the x vector
is the output. The opposite is true when propagating from x to y.
If all goes well, the final, stable state will recall one of the exemplars
used to construct the weight matrix. Since in this example, we assume we
know something about the desired x vector, but perhaps nothing about
the associated y vector, we hope that the final output is the exemplar
whose Xj vector is closest to the original input vector. There are many
definitions of the word close that are used when discussing neural net¬
works. In the case of the BAM, we use Hamming distance as the measure
of closeness between two vectors. The Hamming distance between two
vectors is the number of bits that differ between the two. The concept
applies equally to bipolar or binary vectors.
The above scenario works well provided we have not overloaded the
BAM with exemplars. If we try to put too much information in a given
BAM, a phenomenon known as crosstalk occurs between exemplar pat¬
terns. Crosstalk occurs when exemplar patterns are too close to each
other. The interaction between these patterns can result in the creation
of spurious stable states. In that case, the BAM could stabilize on mean¬
ingless vectors. If we think of the BAM in terms of an energy surface
in weight space, each exemplar pattern occupies a deep minimum well
in the space. Spurious stable states correspond to energy minima that
appear between the minima that correspond to the examplars.
6.1. The BAM 183
bam [ini.ti.alX_, initial Y_, x2y Heights., y2xWeights_, print A 11_: False] :=
Module[{done,newX,neuY,energyl,energy2>,
done s False;
neuX = initialX;
newY = initialY;
While [done = False,
newY = phi[newY,x2yWeights.neuX];
If[printAll,Print[] ;Print[] ;Print["y = ",neuY]];
energyl = energyBAH[newY,x2yWeights,newX];
If [printAll, Print ["energy = ",energyl]];
newX = phi[newX,y2xWeights . neuY];
If [printXll,Print[];Print["x = ",neuX]3;
energy2 = energyBAH[newY,x2yWeights,neuX];
If [printXll, Print["energy = ".energyl]];
If [energyl ~ energy2,done=True,Continue];
]; (* end of While *)
Print []; Print [];
Print["final y = ".newY," energy= ".energyl];
Print["final x = ",nevX," energy® ",energy2];
]; (* end of Hodule *)
Listing 6.2
initX = {-i,-l,-l,l,-l,l,l,-l,-U>;
initY = {1,1,1,1,-1,-1};
energyBAH[initY,x2yWts,initX]
40
184 Chapter 6. Feedback and Recurrent Networks
bam[initX,initY,x2y¥ts,y2xWts,True]
ClearAll[initX,initY];
initX = {-1,1,1,-1,1,1,1,-1,1,-0;
initY = {-1,1,-1,1,-1,-1};
-8
bam[initX,initY,x2y¥ts,y2xVts,True]
y = H, 1, 1. 1, 1, -1}
energy = -24
x = {-1, 1, 1, -1, 1, -1, -1, 1, 1, -1}
energy = -24
y = H, 1, 1, 1, 1, -1}
energy = -64
6.2. Recognition of Time Sequences 185
exemplars
{{{1, -1, -1, 1, -1, 1, 1, -1, -1, 1>, {1, -1, -1, -1, -1, 1»,
{{1, 1, 1, -1, -1, -1, 1, 1, -1, -1}, {1, 1, 1, 1, -1, -1}»
You will notice that the final output vectors do not match any of the
examplars. Furthermore, they are actually the complement of the first
training set, (xout,y0ut) = (xS.y?), where the c superscript refers to the
complement. This example illustrates a basic property of the BAM: If
you encode an exemplar, (x, y), you also encode its complement, (xc,yc).
Let's try a pair of random vectors to see if we always get an exemplar
or a complement of an exemplar.
initY = 2 Table[Random[Integer,{0,1}],{6}]-1
H, 1, l, -1, l, -0
bam[initX,initY,x2yWts,y2xWts,False]
not depend on any previous output. There are applications, however, for
which a network that can accommodate a time-ordered sequence of pat¬
terns would be necessary. An example, currently unattainable by any
neural network except a human one, would be learning the sequence of
finger and arm movements necessary to play a piece on the piano.
One way of encoding such a sequence is with a type of neural net¬
work called an avalanche, in which a series of units are triggered in a
time sequence. A second way of dealing with time in a neural network
is to take the results of processing at one particular time step and feed
that data back to the network inputs at the next time step. This latter
method is the one that we shall explore in this section.
Output vector
Figure 6.2 In this representation of the Elman network, outputs from each of the hidden-
layer units at timestep t, become additional inputs to the network at timestep t + 1. At
each timestep, information horn all previous timesteps influences the output of the hidden
layer and hence also influences the network output.
ClearAll[ioPairsEl]
ioPairsEl =
(* inputs outputs *)
{ (* 1 2 3 1 2 3 *)
{{ {0.9, 0.1, 0.1}, {0.1, 0.9, 0.1} },
{ {0.1, 0.9, 0.1}, {0.1, 0.1, 0.9} },
{ {0.1, 0.1, 0.9}, {0.9, 0.1, 0.1} } },
{{ {0.1, 0.9, 0.1}, {0.9, 0.1, 0.1} },
{ {0.9, 0.1, 0.1}, {0.1, 0.1, 0.9} },
{ {0.1, 0.1, 0.9}, {0.1, 0.9, 0.1} }
} };
The degree of nesting in the ioPairsEl list maintains separation between
188 Chapter 6. Feedback and Recurrent Networks
ioPairsEl[[1]]
ioPairsEl[Cl,1]3
hidlfts = Table[Table[Random[Real,{-0.5,0.5}],
{inNumber+hidNumber}].{hidNumber}];
hidLastDelta = Table [Table [0, {inNumber+hidNumber}],
{hidNumber}];
ioSequence=ioPairs[[Random[Intege r,{1,Length[ioPairs]}]]];
Next we begin a second For loop to cycle through the individual patterns
in the sequence. Using i as the index, we select the next pattern to be
processed:
ioP = ioSequence[[i]];
then identify the input vector and desired output vector. In this case,
however, we concatenate the context units to the sequence's input vector.
inputssJoin[conUnits,ioP[[l]] ];
outDesired=ioP[[2]];
conUnits = hidOuts;
Finally, we add the square of the current error value to the errorList.
A ppendTo[errorList,outErrors.outErrors];
We then repeat the inner loop for as many patterns as there are in the
sequence. When finished with a sequence, we select another, reset the
context units to 0.5, and begin another inner loop. The entire program
appears in Listing 6.3. We also must write a new test program to ac¬
commodate the sequences. Call the new test program, elmanTest. This
program, which appears in Listing 6.4, is similar in intent to bpnTest, but
has an additional parameter, conNumber, that explicitly defines the number
of context units. Let's use this code to attempt our example problem.
For space considerations I have suppressed some of the printout from
the function. After this first run, I will also suppress the printout of the
results of individual patterns, showing only the error plot.
ClearAll[elOut];
elOut = O;
Timing[elOut = elman[3,4,3,ioPairsEl,0.5,0.9,100];]
Sequence 1 input 1
inputs:
{0.5, 0.5, 0.5, 0.5, 0.9, 0.1, 0.1}
outputs:
190 Chapter 6. Feedback and Recurrent Networks
elman[inNumber_,hidNumbe r_,outNumber_,ioPairs.,eta.,alpha.,numlters_] :=
Module[{hidWts,outWts,ioP,inputs,hidOuts,outputs,outDesired,
i,indx,hidLastDelta,outLastDelta,outDelta,errorList=0,
ioSequence, conUnits,hidDelta,outErrors},
hidWts = Table[Table[Random[Real,{-0.5,0.5}].{inNumber+hidNumber}].{hidNumber}];
outWts = Table[Table[Random[Real,{-0.5,0.5}].{hidNumber}].{outNumber}];
hidLastDelta = Table[Table[0,{inNumber+hidNumber}],{hidNumber}];
outLastDelta = Table [Table [0, {hidNumber}], {outNumber}];
For[indx=l,indx<=numIters,indx++, (* begin forward pass; select a sequence *)
ioSequence=ioPairs[[Random[Integer,{1,Length[ioPairs]}]]];
conUnits = Table[0.5,{hidNumber}]; (* reset conUnits *)
For[i=l,i<=Length[ioSequence],i++, (* process the sequence in order *)
ioP = ioSequence[[i]]; (* pick out the next ioPair *)
inputs=Join[conUnits,ioP[[l]] ]; (* join context and input units *)
outDesired=ioP[[2]];
hidOuts = sigmoid [hidWts. inputs]; (* hidden-layer outputs *)
outputs = sigmoid [outWts. hidOuts]; (* output-layer outputs *)
outErrors = outDesired-outputs; (* calculate errors *)
outDelta= outErrors (outputs (1-outputs));
hidDelta=(hidOuts (1-hidOuts)) Transpose[outWts].outDelta;
outLastDelta= eta 0uter[Times,outDelta,hidOuts]+alpha outLastDelta;
outWts += outLastDelta; (* update weights *)
hidLastDelta = eta 0uter[Times,hidDelta,inputs]+alpha hidLastDelta;
hidWts += hidLastDelta;
conUnits = hidOuts; (* update context units *)
(* put the sum of the squared errors on the list *)
AppendTo[errorList,outErrors.outErrors];
]; (* end of For i *)
]; (* end of For indx *)
Print["New hidden-layer weight matrix: "];
Print[ ]; Print[hidWts];Print[ ];
Print["New output-layer weight matrix: "];
Print[ ]; Print[outWts];Print[ ];
elmanTest[hidWts,outWts,ioPairs,hidNumber]; (* check how close we are *)
errorPlot = ListPlotferrorList, PlotJoined->True];
Return[{hidWts,outWts,errorList,errorPlot}];
] (* end of Module *)
Listing 6.3
6.2. Recognition of Time Sequences 191
elmanTest[hiddenVts.,outputWts.,ioPairVectors.,conNumber.,printAll.:False] : =.
Module[{inputs,hidden,outputs.desired.errors,i, j,
prntA11,conUnits,ioSequen ce,ioP>,
If[printAll,Print[];Print["ioPairs:"];Print[];Print[ioPairVectors]];
(* loop through the sequences *)
For[i=l,i<=Length[ioPairVectors],i++,
(* select the next sequence *)
ioSequence = ioPairVectors[[i]];
(* reset the context units *)
conUnits = Table[0.5,{conNumber}];
(* loop through the chosen sequence *)
For[j=l,j<=Length[ioSequence], j++,
ioP = ioSequence [[j]];
(* join context and input units *)
inputs®Join[conUnits,ioP[[l]] ];
desired=ioP[[2]];
hidden®sigmoid[hiddenWts.inputs];
outputs=sigmoid[outputWts.hidden];
errors® desired-outputs;
(* update context units *)
conUnits = hidden;
Print[ ];
Print["Sequence ",i, " input ",j];
Print[ ];Print["inputs:"];Print[ ];
Print[inputs];
If[printAll,Print[ ];Print["hidden-layer outputs:"];
Print[hidden];Print[];];
Print["outputs:"];Print[ ];
Print[outputs];Print[];
Print["desired:"];Print[];Print[desired];Print[ ];
Print["Mean squared error:"];
Print[errors.errors/Length[errors]];
Print[ ];
]; (* end of For j *)
]; (* end of For i *)
] (* end of Module *)
Listing 6.4
192 Chapter 6. Feedback and Recurrent Networks
outputs = outWts.hidOuts;
(* modify by inhibitory connections *)
outputs = sigmoid [outputs -
0.3 Apply [Plus .outputs] + .5 outputs];
ClearAll[elOut];
elOut = O;
elOut = elmanComp[3,4,3,ioPairsEl,0.5,0.9,100];
Output units
Figure 6.3 This figure illustrates the architecture of the Jordan network. Notice that units
called state units receive their inputs from the output-layer units instead of the hidden-layer
units as in the case of the Elman network. Notice also that there are connections between
all of the state units as well as feedback from each state unit to itself. The function of the
plan units and state units is described in the text.
0 0 0 0 1
0 0 1 1 0
0 1 0 1 1
0 1 1 0 0
1 0 0 1 1
1 1 1 1 0
1 1 0 0 1
1 0 1 0 0
Table 6.1 This table shows the inputs and outputs for the counting example.
where s* is the output of the zth state unit, o* is the output of the zth
output unit, and the value of /z determines the amount of influence of
previous time steps. If /z is less than one, then the influence of previous
time steps decreases exponentially as we look farther back in time. In the
following examples, we shall not use the connections between the state
units.
Let's apply the Jordan network to a simple example and discuss the
processing within the context of that example. We shall call this example
the counting example. For this first example, we shall assume that /z = 0.
Table 6.1 shows the various vectors in their proper sequence.
We require one plan unit that can take on a value of 0 or 1. The
sequence corresponding to a plan unit of 0 counts upward from binary
1. The other sequence counts down from binary 11. Because /z = 0, the
state units take on values equal to the output units at the previous time
step. The network will have two state units, two output units, and two
hidden units, although the number of hidden units is not specified by
the example. Although we are using binary units here, there is nothing
that precludes the use of continuous-value units.
There are not many changes required to convert the elman function
into a jordan function. First we need to add an additional parameter, mu,
to the calling sequence.
jordan[inNumber.,hidNumber.,outNumber.,ioPairs_,
eta_,alpha.,mu.,numlters_]
Then we need to alter the size of the hidden-unit weight matrix and
6.2. Recognition of Time Sequences 197
hidWts = Table[Table[Random[Real,{-0.5,0.5}],
{inNumber+outNumber}].{hidNumber}];
hidLastDelta = Table[Table[0,{inNumber+outNumber}],
{hidNumber}];
stateUnits = Table[0.1,{outNumber}];
ioPairsJor = {
{{{0.1}, {0.1, 0.9}},
{{0.1}, {0.9, 0.1}},
{{0.1}, {0.9, 0.9}},
{{0.1}, {0.1, 0.1}}},
{{{0.9}, {0.9, 0.9}},
{{0.9}, {0.9, 0.1}},
{{0.9}, {0.1, 0.9}},
{{0.9}, {0.1, 0.1}}} };
Let's make a number of different runs using the Jordan network in var¬
ious configurations. There are several modifications that we can try in
order to assess the corresponding performance impact. You should be
aware that the runs that follow generally show only a relatively few iter¬
ations. If you actually want to reduce the error to a value that we would
consider appropriate for actual applications, you would likely have to
run the network for a significantly longer time. We restrict the number
of iterations here so that we can perform this experiment in a reasonable
time. Note, however, that many of the runs are quite time consuming. We
begin with the standard Jordan network as we have described it above.
198 Chapter 6. Feedback and Recurrent Networks
For the first run, we set n = 0. As with the Elman-network output, I have
edited out some information. For this first example I will leave intact the
results from individual inputs.
Timing[jordan[l,2,2,ioPairsJor,0.5,0.9,0,200];]
Sequence 1 input 1
inputs:
{0.1, 0.1, 0.1}
outputs:
{0.497832, 0.84752}
desired:
{0.1, 0.9}
Mean squared error:
0.0805123
Sequence 1 input 2
inputs:
{0.497832, 0.84752, 0.1}
outputs:
{0.48581, 0.112169}
desired:
{0.9, 0.1}
Mean squared error:
0.0858507
Sequence 1 input 3
inputs:
{0.48581, 0.112169, 0.1}
outputs:
{0.498776, 0.898134}
desired:
{0.9, 0.9}
Mean squared error:
0.080492
Sequence 1 input 4
inputs:
{0.498776, 0.898134, 0.1}
outputs:
{0.485458, 0.102813}
desired:
{0.1, 0.1}
6.2. Recognition of Time Sequences 199
If we consider all output values above &.5 to be "1" and all below
0.5 to be "0," then this network appears to be on the verge of performing
well, although it seems to be stalled. Adjustments in the parameters may
help, but we shall not undertake such a study here. Instead, let's redo
the example with a nonzero value of
Timing[jordan[l,2,2,ioPairsJor,0.5,0.9,0.1,200];]
The results here seem to be fairly close to those for n = 0. Let's see if
increasing the number of hidden units helps.
Timing[jordan[l,4,2,ioPairsJor,0.5,0.9,0.1,100];]
6.2. Recognition of Time Sequences 201
It does not look like we accomplished very much with that change,
though more passes through the data may help. Let's try something
different. Instead of setting the stateUnits equal to the actual output units
during training, we can set them equal to the desired outputs. The func¬
tion jordan2 implements this variation. For this test I have set the ^ factor
back to zero.
Timing[jordan2[1,4,2,ioPairsJor,0.5,0.9,0,100];]
The error dropped much faster for this run than it did for the previous
runs. Using the desired outputs as state vectors appears to have helped.
Let's make another change for purely aesthetic reasons. If you look at
the code, you will see that we are capturing the sum of the squares of the
errors as each pattern is presented to the network. When we print the
202 Chapter 6. Feedback and Recurrent Networks
results, we show the mean squared error, averaged over all of the output
units. The function jordan2a plots the average mean squared error; that
is, the squared error averaged over the output units, then averaged over
all of the patterns in a sequence.
Tijning[jordan2a[l,4,2,ioPairsJor,0.5,0.9,0,100];]
We can also try a different representation of the plan vector. The follow¬
ing uses a two-element plan vector for the counter problem.
ioPairsJor2 = {
{{{0.1, 0.9}, {0.1, 0.9}},
{{0.1, 0.9}, {0.9, 0.1}},
{{0.1, 0.9}, {0.9, 0.9}},
{{0.1, 0.9}, {0.1, 0.1}}},
{{{0.9, 0.1}, {0.9, 0.9}},
{{0.9, 0.1}, {0.9, 0.1}},
{{0.9, 0.1}, {0.1, 0.9}},
{{0.9, 0.1}, {0.1, 0.1}}}
};
Timing[jordan2a[2,4,2,ioPairsJor2,0.5,0.9,0,100];]
6.2. Recognition of Time Sequences 203
Although we are not seeing much difference here from run to run, such
changes often result in performance improvements. One final modifica¬
tion represents a somewhat radical change from the way we have been
doing the generalized delta rule algorithm.
The Tanh[u] function has the same general shape as the sigmoid func¬
tion, but the limits are +1 and -1, rather than 0 and 1. We can use
Tanh in place of the sigmoid function and at the same time change the
representation of the vectors.
Plot[Tanh[u],{u,-4,4>];
For the desired-output vectors, we use -0.9 to represent binary 0, and 0.9
to represent binary 1. The corresponding iopair vectors for the counter
problem appear as follows:
204 Chapter 6. Feedback and Recurrent Networks
ioPairsJor3 = {
«{-l>, {~0.9, 0.9»,
{ 0.9,-0.9»,
«-l>, { 0.9, 0.9»,
«-l>, {-0.9,-0.9}} },
{{{1}, { 0.9, 0.9}},
{{1}, { 0.9,-0.9}},
{{1}, {“0.9, 0.9}},
{{1}, {-0.9,-0.9}} } };
If you recall from Chapter 3, the calculation of the weight updates in¬
volved the derivative of the output function. For the sigmoid, the deriva¬
tive turned out to be outputs(l-outputs), for the output layer, with a similar
expression for the hidden layer. In the case of the Tanh[u] output function,
the derivative is Sech[u]“2 = 1 - Tahn[u]“2 which for the output layer would
be (l-output~2). We must modify two expressions in the function to reflect
these differences.
becomes
and
becomes
hidDelta=(l-hid0uts~2) Transpose[outVts].outOelta;
We call the new function jordan2aTanh. We must also modify the test
function. Both functions appear in the appendix. Let's try the new func¬
tion.
6.2. Recognition of Time Sequences 205
jordan2aTanh[l,4,2,ioPairsJor3,0.1,0.9,0,100];
Sequence 1 input 1
inputs:
{-0.9, -0.9, -1}
outputs:
{-0.905053, 0.900023)
desired:
{-0.9, 0.9)
Mean squared error:
0.0000127663
Sequence 1 input 2
inputs:
{-0.9, 0.9, -1)
outputs:
{0.903497, -0.897313)
desired:
{0.9, -0.9)
Mean squared error:
-6
9.72306 10
Sequence 1 input 3
inputs:
{0.9, -0.9, -1)
outputs:
{0.910307, 0.885798)
desired:
{0.9, 0.9)
Mean squared error:
0.000153967
Sequence 1 input 4
inputs:
{0.9, 0.9, -1)
outputs:
{-0.893995, -0.959445}
desired:
{-0.9, -0.9}
Mean squared error:
0.00178488
206 Chapter 6. Feedback and Recurrent Networks
Sequence 2 input 1
inputs:
{-0.9, -0.9, 1}
outputs:
{0.893995, 0.959445}
desired:
{0.9, 0.9}
Mean squared error:
0.00178488
Sequence 2 input 2
inputs:
{0.9, 0.9, 1}
outputs:
{0.905053, -0.900023}
desired:
{0.9, -0.9}
Mean squared error:
0.0000127663
Sequence 2 input 3
inputs:
{0.9, -0.9, 1}
outputs:
{-0.903497, 0.897313}
desired:
{-0.9, 0.9}
Mean squared error:
-6
9.72306 10
Sequence 2 input 4
inputs:
{-0.9, 0.9, 1}
outputs:
{-0.910307, -0.885798}
desired:
{-0.9, -0.9}
Mean squared error:
0.000153967
6.2. Recognition of Time Sequences 207
0.025
0.02
0.015
0.01
0.005
20 40 60 80 100
This version appears to work extremely well; notice that far fewer than
100 cycles would have been sufficient. You should be aware, however, of
two minor additional changes: First, note that the learning rate is only
0.1 instead of 0.5 for previous runs. Second, if you examine the code, you
will see that I initialized the state units to -0.9 instead of 0.1. I know that
it is not advisable to change more than one item at a time when running
these experiments, but I have done so here in the interest of space.
I have presented a large number of variations to further illustrate
the kind of experimentation that is often necessary to get a network to
perform adequately. Feel free to continue to experiment.
Summary
In this chapter we studied several networks that include feedback con¬
nections as a part of their architecture. The BAM is a general form of a
network that reduces to the Hopfield network if we think of both lay¬
ers of BAM units as the same layer. The Elman and Jordan networks
incorporate feedback from hidden and output units respectively back to
the input layer. These networks can learn sequences of input patterns
using a backpropagation algorithm. Both networks are fertile ground for
experimentation with feedback structures.
Chapter 7
One of the nice features of human memory is its ability to learn many
new things without necessarily forgetting things learned in the past. A
frequently cited example is the ability to recognize your parents even if
you have not seen them for some time and have learned many new faces
in the interim. Some popular neural networks, the backpropagation net¬
work in particular, cannot learn new information incrementally without
forgetting old information, unless it is retrained with the old information
along with the new.
Another characteristic of most neural networks is that if we present
to them a previously unseen input pattern, there is generally no built-in
mechanism for the network to be able to recognize the novelty of the
input. The neural network doesn't know that it doesn't know the input
pattern. On the other hand, suppose that an input pattern is simply a
distorted or noisy version of one already learned by the network. If a
network treats this pattern as a totally new pattern, then it may be over¬
working itself to learn what it has already learned in a slightly different
form.
We have been describing a situation called the stability-plasticity
dilemma. We can restate this dilemma as a series of questions: How
can a learning system remain adaptive (plastic) in response to significant
input, yet remain stable in response to irrelevant input? How does the
system know to switch between its plastic and its stable modes? How
can the system retain previously-learned information while continuing
to learn new information?
Adaptive resonance theory (ART) attempts to address the stability-
plasticity dilemma. A key to solving the stability-plasticity dilemma is
the addition of a feedback mechanism between the layers of the ART
network. This feedback mechanism facilitates the learning of new infor¬
mation without destroying old information, automatic switching between
stable and plastic modes, and stabilization of the encoding of the classes
done by the nodes. We shall discuss two classes of neural-network ar¬
chitectures that result from this approach. We refer to these network
architectures as ART1 and ART2. ART1 and ART2 differ in the nature
of their input patterns. ART1 networks require that the input vectors be
binary. ART2 networks are suitable for processing analog, or grey-scale,
patterns.
ART gets its name from the particular way in which learning and re¬
call interplay in the network. In physics, resonance occurs when a small
amplitude vibration of the proper frequency causes a large amplitude
7.1. ART1 211
7.1 ART1
As mentioned in the introduction to this chapter, ART1 networks require
binary input vectors. This limitation is not necessarily a severe handicap,
since many problems lend themselves reasonably well to a binary rep¬
resentation. In this section we shall examine some of the details of the
ART1 architecture and processing. The treatment of those topics here will
212 Chapter 7. Adaptive Resonance Theory
The basic features of the ART architecture appear in Figure 7.1. The
two major subsystems are the attentional subsystem and the orienting
subsystem. Patterns of activity that develop over the units in the two
layers of the attentional subsystem are called short term memory (STM)
traces because they exist only in association with a single application of an
input vector. The weights associated with the bottom-up and top-down
connections between JPi and F2 are called long term memory (LTM)
traces because they encode information that remains a part of the network
for an extended period.
We shall delay any consideration of the mathematics governing the
ART1 network and describe the processing in the form of an algorithm.
Furthermore, we shall ignore, for the moment, the gain control system;
we will revisit that topic later in this section.
The following algorithm is brief, and omits many details of ART1
processing, but it does illustrate conceptually how the network responds
to inputs.
2. Determine the output of the -layer units and propagate that out¬
put up to the F2 layer.
4. Send the output from the winning F2 unit back down to Fi where
it stimulates the appearance of a top-down template pattern.
Input vector
Figure 7.1 This figure illustrates the ART1 system diagram. The two major subsystems are
the attentional subsystem and the orienting subsystem. F\ and F2 represent two layers
of units in the attentional subsystem. Units on each layer are fully interconnected to the
units on the other layer. Not shown are interconnects among the units on each layer. Other
connections between components are indicated by the arrows. A plus sign indicates an
excitatory connection and a minus sign indicates an inhibitory connection. The function of
the various subsystems is discussed in the text.
8. Reset all units, clear the input vector, and begin at step 1 with a
new input vector.
resulting in a reset signal to the F2 layer. The effect of the reset signal
depends on the state of the individual F2 unit. If the unit currently has a
nonzero output, that unit is disabled for the duration of the current input
vector. If the unit does not have a nonzero output, the unit ignores the
reset signal.
Vi = Y^ujzij (7-3)
i
T°F2
Figure 7.2 This figure shows a processing element, Vi, on the Fx layer of an ART1 network.
The activity of the unit is Xu- It receives a binary input value, /< from below, and an
G, from the gain control. In addition, the top-down signals, Uj, from F2
excitatory signal,
are gated (multiplied by) weights, Zij. Outputs, Si, from the processing element go up to
F2 and across to the orienting subsystem, A.
F2. We shall set the inhibitory input to Fx units equal to unity. With the
above definitions, the equation for the activity on Fi units becomes
„ ~B 1
(7.5)
U (1 + C70
Notice that the equilibrium activities are negative, meaning that the units
are kept in a highly inhibited state.
During the initial stages of processing, that is, when an input is
present from below, but there has yet been no response from F2, W
remains at zero, but the gain control has become active: G = 1. The
7.1. ART1 217
To al F. units
Figure 7.3 This figure shows a processing element, Vj, on the F2 layer of an ART1 network.
The activity of the unit is x2j. The unit Vj receives inputs from the Fi layer, the gain control
system, G, and the orienting subsystem, A. Bottom-up signals. Si, from F\ are gated by
the weights, Zji. Outputs, Uj, are sent back down to Fi. In addition, each unit receives
a positive feedback term from itself, g(x2j), and sends an identical signal through an
inhibitory connection to all other units on the layer.
_U (7.6)
Xu 1 + Ai(Ii + Bi) + Ci
assume that we want a positive activity if the unit receives both a top-
down input and a nonzero bottom-up input {h = 1). This assumption
translates into a condition on the quantities in the numerator, namely
Vi > (7.8)
Bi- 1
Zij > (7.9)
Di
In the complete analysis of processing of F\, a condition relating the
values of B\ an D\ arises. We state that condition here, but you can find
the details in Chapter 8 of Neural Networks.
1 xu > 0
Si = (7.11)
0 xu < 0
winner. The winning unit will have an output signal of unit strength,
and all other units will remain inactive. We calculate the net input to the
F2 units in the usual manner
1 Tj = maxfc{Tfc}VA;
m = (7.13)
0 otherwise
L
0 < Zji(0) < (7.14)
L-l + M
where L > 1, and M is equal to the number of units on the Fi layer. Once
again, we shall not digress into the rather lengthy discussion of how we
derive this condition. Suffice it to say that this condition helps to ensure
that a unit, which has encoded a particular pattern, continues to win
over uncommitted units in the F2 layer when that particular pattern is
presented to the network.
We can also define the value of the output vector of the Fi units by
the following:
f I F2 is inactive
(7.16)
\ I n VJ F2 is active
Eq. (7.16) states that the output of Fi will be identical to the input
vector, I, if no top-down signal is present from F2, and will be equal to
the intersection of the input vector and the top-down template pattern,
VJ, received from the winning F2 unit, when F2 is active.
Provided we select a value for p which is less than or equal to one,
we can describe the matching condition as
JS|
> P (7.17)
III
If Eq. (7.17) holds, then the network will take the top-down template
as a match to the current input pattern.
Defining a vigilance criterion in this manner endows the network
with an important property called self-scaling. This property, illustrated
in Figure 7.4, enables the network to ignore minor differences in patterns,
such as might arise due to random noise, and yet remain sensitive to
major differences.
f 1 if ^ is active
(7.18)
\ 0 if Vj is inactive
/ 7^T+[s[ if Vi is active
zj, = (7.19)
1° if Vi is inactive
7.1. ART1 221
Figure 7.4 This figure illustrates the self-scaling property of ART1 networks, (a) For a value
of p — 0.8, the existence of the extra feature in the center of the top-down pattern on
the right is ignored by the orienting subsystem, which considers both patterns to be of the
same class, (b) For the same value of p, these bottom-up and top-down patterns will cause
the orienting subsystem to send a reset to Fi.
f2dim = 6;
We set the system vigilance parameter at a high value to ensure exact
matches.
222 Chapter 7. Adaptive Resonance Theory
rho = 0.9;
The vector magnitude for ART1 networks has a different definition than
the standard vector magnitude.
vmagl[v_] := Count[v,l]
resetflagl[outp_,inpJ : = If[vmagl[outp]/vmagl[inp]<rho,True,False]
where outp refers to the output of Fx and inp refers to the input vector.
We shall also use a function that returns the index of the winning unit
in a competitive layer whose outputs are assembled into a vector, p. This
function assumes that the winning unit is the one having an activation
of val.
winner[p_,val_] := First[First[Position[p,val]]]
droplistinit = Table[l,{f2dim>]
{1, 1, 1, 1, 1, 1>
droplist = droplistinit
{1, 1, 1, 1, 1, 1}
The output functions for the two layers have identical definitions. For
the Fx layer:
h[xj := lf[x>0,l,0]
7.1. ART1 223
s[x_] := Map[h,x]
u[xj := Hap[f,x]
MatrixForm[zl2 = Table [
Table[N[(bl-l)/dl + .2,3],{f2dim}],{fldim}]]
el = 3;
Weights on connections from Fi units to F2 units comprise the matrix,
z21, having one row for each F2 unit and one column for each Fi unit.
The subtraction of 0.1 is optional, since the weights may be initialized
identically to L/(L — 1 + M) from Eq. (7.14).
The initial activities of the Fi units are all negative, according to Eq. (7.5).
xfl « Table[-bl/(l+cl),{fldim>]
The function compete takes as its argument the list of activities on the F2
layer, and returns a new list in which only the unit with the largest activ¬
ity retains a nonzero activity; in other words, this function implements
competition on F2.
compete[f2Activities.] :=
Module[{i,x,f2dim,maxpos},
x=f 2Activiti.es;
maxpos=First[First[Position[x,Max[f2A ctivities]]]];
f2dim = Length[x];
For[i=l,i<=f2dim,i++,
If[i!=maxpos,x[[i]]=0;Continue] (* end of If *)
]; (* end of For *)
Return[x];
]; (* end of Module *)
in[[l]]
{0, 0, 0, 1, 0}
xfl = N[in[[l]]/(l+al*(in[[l]]+bl)+cl),3]
{0, 0, 0, 0.118, 0}
The only unit with a nonzero activity, however, is the one with a nonzero
input from below. The output of Fi is
sfl = s[xfl]
{0, 0, 0, 1, 0}
The dot product between the weights and input values from below de¬
termines the net inputs to the Fi units.
7.1. ART1 225
t= N[z21 . sfl,3]
xf2 = t
xf2 = compete[xf2];
{0.329, 0, 0, 0, 0, 0)
uf2 = u[xf2]
{1, 0, 0, 0, 0, 0}
We could combine the above two steps into one, but I left them sep¬
arate in order to remain true to the individual steps of the sequence.
Notice that the compete function, finding no clear winner of the compe¬
tition, returned the first, previously uncommitted unit on the list of F2
units.
We shall need to save the index of winning unit for later,
Going back to the F\ layer, the net inputs back to F\ from F2 are
v= N[zl2 . uf2,3]
sfl = s[xfl]
{0, 0, 0, 1, 0}
Notice that the new output vector is identical to the input vector. We
expect, therefore, that the orienting subsystem — implemented partially
as the resetflagl function — will indicate no mismatch.
resetflagl[sfl,in[[l]]]
False
A False from resetflagl indicates that resonance has been reached; there¬
fore, we can adjust the weights on both layers. The following procedure
sets the windex element of each weight vector on Fi equal to 1 if the output
of the corresponding Fi unit is equal to 1.
zl2=Transpose[zl2];
zl2[[windex]]=sfl; (* just use the values in sfl *)
MatrixForm[zl2=Transpose[zl2]]
As you can see, the weights on the Fi units have encoded the input
vector, but the encoding is distributed over the entire set of units. Since
unit 1 was the winner on F2, weights from that unit back to each Fi unit
are the only ones to be changed. On the other hand, only the winning
unit on F2 has its weights updated. Again you will see that the first unit
on F2 has encoded the input vector.
{0, 0, 0, 1., 0}
Now let's repeat the matching process for an input vector that is orthog¬
onal to in[[1]].
7.1. ART1 227
in[[2]]
{0, 0, 1, 0, 1}
sfl = s[xfl]
{0, 0, 1, 0, 1}
t= N[z21 . sfl,3]
xf2 = t
Notice that since the first unit on F2 has encoded a vector orthogonal to
the current input vector, the net input to that unit is zero.
xf2 = compete[xf2]
{0, 0.657, 0, 0, 0, 0}
Once again, compete returned the first uncommitted unit, since none of the
units was a clear winner. Continuing on,
uf2 = u[xf2]
{0, 1, 0, 0, 0, 0}
v= N[zl2 . uf2,3]
sfl = s[xf1]
{0, 0, 1, 0, 1}
resetflagl[sfl,in[[2]]]
False
Once again, we have resonance in one pass through the network. Since
unit 2 on F2 is the winner, the second weight values on the F\ units
encode the new input vector.
zl2=Transpose[zl2];
zl2[[windex]]=sfl; (* just use the values in sfl *)
MatrixForm[zl2=T ranspose[zl2]]
and, the second unit on F2 also encodes the input vector. Notice, how¬
ever, that the nonzero weight values are not equal to one in this case, since
vmagl[sfl] is not equal to one. Nevertheless, the pattern of the weights is
identical to the input vector.
MatrixForm[z21]
0 0 0 1. 0
0 0 0.75 0 0.75
0.329 0.329 0.329 0.329 0.329
0.329 0.329 0.329 0.329 0.329
0.329 0.329 0.329 0.329 0.329
0.329 0.329 0.329 0.329 0.329
Now let's try an input vector that is a subset of the input vector, in [[2]],
namely in[[7]].
7.1. ART1 229
in[[7]]
{0, 0, 0, 0, 1}
xfl = N[in[[7]]/(l+al*(in[[7]]+bl)+cl),3]
{0, 0, 0, 0, 0.118}
sfl « s[xfl]
{0, 0, 0, 0, 1}
t= N[z21 . sfl,3]
xf2 * t
{0, 0.75, 0, 0, 0, 0}
Even though unit 2 has already encoded a vector different than the cur¬
rent input vector, there is enough similarity between the encoded vector
and the input vector to allow unit 2 to win the competition. Trace through
the following six steps very carefully.
uf2 = u[xf2]
{0, 1, 0, 0, 0, 0}
uindex = winner[uf2,l]
v= N[zl2 . uf2,3]
sfl = s[xfl]
{0, 0, 0, 0, 1}
resetflagl[sfl,in[[7]]]
False
zl2=Transpose[zl2];
zl2[[uindex]]=sfl; (* just use the values in sfl *)
MatrixForm[zl2=Transpose[zl2]]
{0, 0, 0, 0, 1.)
HatrixForm[z21]
0 0 0 1. 0
0 0 0 0 1.
0.329 0.329 0.329 0.329 0.329
0.329 0.329 0.329 0.329 0.329
0.329 0.329 0.329 0.329 0.329
0.329 0.329 0.329 0.329 0.329
Now let's put the superset vector back in to see what happens.
7.1. ART1 231
in[[2]]
{0, 0, 1, 0, 1}
xfl = N[in[[2]]/(l+al*(in[[2]]+bl)+cl),3]
sfl = s[xfl]
{0, 0, 1, 0, 1}
t= N[z21 . sfl,3]
xf2 = t
{0, 1., 0, 0, 0, 0}
Once again, unit two wins the competition. The ART1 network will not
turn out to be very useful if this unit again recodes itself to the superset
vector. If that situation prevailed, we would not be able to encode both
a superset and a subset vector at the same time; a serious limitation in a
network that uses only binary vectors as inputs. Let's see what happens.
uf2 = u[xf2]
{0, 1, 0, 0, 0, 0}
windex = winner [uf 2,1]
v= N[zl2 . uf2,3]
{0, 0, 0, 0, 1.}
xfl ■ N[(in[[2]]+ dl*v-bl)/(l+al*(in[[2]]+dl*v)+cl),3]
232 Chapter 7. Adaptive Resonance Theory
sfl = s[xfi]
{0, 0, 0, 0, 1}
resetflagl[sfl,in[[2]]]
True
If[resetflag[sf1,in[[2]]]==True,
droplist[[windex]]=0,Continue]
droplist
{1. 0, 1, 1, 1, 1}
We shall see momentarily how we employ this droplist. For the moment,
let's reestablish the input vector and begin another matching cycle.
in[[2]]
{o, 0, 1, 0, 1}
xfl = N[in[[2]]/(l+al*(in[[2]]+bl)+cl),3]
sfl * s[xfl]
{0, 0, 1, 0, 1}
t= N[z21 . sfl,3]
xf2 = compete[xf2]
{0, 0, 0.657, 0, 0, 0}
Having eliminated the second unit from the competition, compete returns
the third unit as the winning unit.
uf2 = u[xf2]
{0, 0, 1, 0, 0, 0}
windex = winner[uf2,l]
v= N[zl2 . uf2,3]
sfl = s[xfl]
{0, 0, 1, 0, 1)
resetflagl[sf1,in[[2]]]
False
Since the third unit was not previously encoded, we do not get a reset,
and the superset vector is encoded by the third unit. Now both superset
and subset vectors are encoded in the network independently, as the
following weight matrices show.
zl2=Transpose[zl2];
zl2[[windex]]=sfi; (* just use the values in sfl *)
HatrixForm[zl2=Transpose[z!2]]
234 Chapter 7. Adaptive Resonance Theory
MatrixForm[z21]
0 0 0 1. 0
0 0 0 0 1.
0 0 0.75 0 0.75
0.329 0.329 0.329 0.329 0.329
0.329 0.329 0.329 0.329 0.329
0.329 0.329 0.329 0.329 0.329
If other subset vectors had been previously encoded in the network, then
we may have had other resets until all units encoding subset vectors had
been disabled by the orienting subsystem. At the end of the matching
cycle, the droplist vector will have zeros in all positions corresponding
to units that won the competition of F2, but which resulted in a reset.
Thus, before beginning the next matching cycle with a new input vector,
you must remember to reinitialize droplist as follows
droplist» droplistinit;
Experiment with the other input vectors in the list. You will learn more
about how ART1 operates by this experimentation than you will by read¬
ing about it.
used to alter the initial values of the weights slightly. Be sure to select
dell and del2 within the constraints of the weight initialization equations,
Eqs. (7.9) and (7.14).
artllnit[fldim_,f2dim_,bl_,dl_,el_,dell_,del2_] :=
Module[{zl2,z21},
zl2 = Tablet
Table[(bl-l)/dl + dell,{f2dim}],{fldim}];
z21 = Tablet
Table t(el/(el-l+fldim)-del2),{fldim}],{f2dim}];
Returnt{zl2,z21>];
]; (* end of Module *)
For the example that we shall perform later in this section we shall set
the dimension of the Fx layer to 25, and the dimension of the F2 layer
to five. For the other parameters, we shall use the same values that we
used in the example of the previous section.
{topDoun,bottomUp} =
artllnit[25,5,1.5,0.9,3,0.2,0.1];
ins = { {0,0,0,1.0},{0,0,1,0,1),{0,0,0,0,1) h
{td,bu,mlist} = artl[5,6,1,1.5,5,0.9,3,0.9,topDown,bottomUp,ins];
236 Chapter 7. Adaptive Resonance Theory
a rtl [fldim., f2dim_, al_, bl_, cl_, dl_, el_, rho., flHts_, f2Wts_, inputs.] :=
Module[{droplistinit,droplist,notOone=True,i,nIn=Length[inputs],reset,
n,sf1,t,xf2,uf2,v,uindex,matchList,newHatchList,tdWts,buWts},
droplistinit = Table[l,{f2dim>]; (* initialize droplist *)
tdWts=flWts; buWts=f2Yts;
matchList = (* construct list of F2 units and encoded input patterns *)
Table[{StringForm["Unit ",n]},{n,f2dim}];
While[notDone==True,newHatchList = matchList; (* process until stable *)
For[i=l,i<=nIn,i++,in = inputs[[i]]; (* process inputs in sequence *)
droplist = droplistinit ;reset=True; (* initialize *)
While[reset==True, (* cycle until no reset *)
xfl = in/(l+al*(in+bl)+cl); (* activities *)
sfl = MapClf[#>0,l,0]ft,xfl]; (* FI outputs *)
t= buWts . sfl; (* F2 net-inputs *)
t = t droplist; (* turn off inhibited units *)
xf2 = compete[t]; (* F2 activities *)
uf2= HapClf [#>0,l,0]&,xf2]; (* F2 outputs *)
windex = winner[uf2,l]; (* winning index *)
v= tdWts . uf2; (* FI net-inputs *)
xfl =(in+ dl*v-bl)/(l+al*(in+dl*v)+cl); (* new FI activities *)
sfl = Map[If[#>0,l,0]&,xfl]; (* new FI outputs *)
reset = resetflagl[sfl,in,rho]; (* check reset *)
If[reset==True,droplist[[windex]]=0; (* update droplist *)
Print["Reset with pattern ",i,' on unit ".windex],Continue];
]; (* end of While reset==True *)
Print["Resonance established on unit ".windex," with pattern ",i];
tdWts=Transpose[tdWts]; (* resonance, so update weights,top down first *)
tdWts[[windex]]=sfl;
tdWts=Transpose[tdWts];
buWts[[windex]] = el/(el-l+vmagl[sfl]) sfl; (* then bottom up *)
matchList[[windex]] = (* update matching list *)
Reverse[Union[matchList[[windex]],{i}]];
]; (* end of For i=l to nln *)
If [matchList==newHatchList,notDone=False; (* see if matchList is static *)
Print["Network stable"],
Print["Network not stable"];
newHatchList = matchList];]; (* end of While notDone==True *)
Return[{tdWts,buWts,matchList}];
]; (* end of Module *)
Listing 7.1
7.1. ART1 237
The network exhibits the same behavior that we saw in the previous
section. On the first cycle through the patterns, unit two of F2 was
recoded to the subset vector after previously encoding the superset vector.
On the second cycle through, unit two caused a reset on the second
pattern that was subsequently encoded on unit 3. A third cycle produced
no changes so the network was declared stable. The match list is
TableForm[mlist]
Unit 1 1
Unit 232
Unit 3 2
Unit 4
Unit 5
Unit 6
To interpret this list, you must recognize that the patterns encoded
by each unit are listed from left to right with the most recent pattern
appearing first on the left (after the unit number which is the first number
next to "Unit"). If a pattern appears on more than one unit's list, the unit
with that pattern farthest to the left is the one that currently encodes the
pattern; for example, pattern two appears on the list of both unit two
and unit three. Since the encoding for unit three is more recent, that unit
currently encodes pattern two. This list may get difficult to interpret if
there are a large number of patterns and recodings, but it is simple to
implement here and serves the immediate illustrative purpose.
238 Chapter 7. Adaptive Resonance Theory
You can compare the new weight vectors to verify that the calcula¬
tions were correct. The top-down weights are
N[MatrixForm[td],3]
N[MatrixForm[bu],3]
0 0 0 1. 0
0 0 0 0 1.
0 0 0.75 0 0.75
0.329 0.329 0.329 0.329 0.329
0.329 0.329 0.329 0.329 0.329
0.329 0.329 0.329 0.329 0.329
ListDensityPlot[Reverse[Partition[letterln[[6]],5]]]
7.1. ART1 239
letterln * { {0,0,1,0,0,
0,1,0,1,0,
1,0,0,0,1, (* * *)
1,144,1,
1,0,0,0,1>,
{1,1,14,0,
1,0,0,04,
1,1444. (* B *)
1,0,0,04,
1444,0),
{14444,
l.O.O.O.O,
1,0,0,0,0, (* c ♦)
,,,,,
10000
, , , . ),
11111
{1,14,1,0,
1,0,0,04,
1,0,0,04, (* o *)
1,0,0,04,
1,1,1,1,0>,
{14444,
l.O.O.O.O,
1,144,0, (* E *)
l.O.O.O.O,
1,144.1}.
{1,1444,
l.O.O.O.O,
1444.0, (* F *)
l.O.O.O.O,
l.O.O.O.O) );
Listing 7.2
240 Chapter 7. Adaptive Resonance Theory
These six patterns have sufficient similarities between several letter pairs
that relatively minor changes in the vigilance parameter can drastically
affect how the network encodes them. Let's look at two examples. First,
consider p = 0.9, which for all intents means perfect matching for this
example; and use the same parameters to initialize the weights that we
used in the previous example.
{topDown,bottomUp} =
artllnit[25,6,1.5,0.9,3,0.2,0.1];
{td,bu,mlist} - artl[25,6,l,1.5,5,0.9,3,0.9,topDown,bottoniUp,letterIn];
N[25/(25-1+25),3]
0.51
{topDown,bottomUp} = artllnit[25,6,1.5,0.9,25,0.2,0.1],
{td,bu,mlist} = artl[25,6,1,1.5,5,0.9,25,0.9,topDown,bottomUp,letterln];
242 Chapter 7. Adaptive Resonance Theory
The parameter change reduced the number of resets from 15 down to 10.
There are a great many more such experiments that you can perform to
gain experience with the ART1 network.
7.2. ART2 243
7.2 ART2
Superficially, the main difference between ART1 and ART2 is that ART2
accepts input vectors whose components can have any real number as
their value. In its execution, the ART2 network is considerably different
from the ART1 network.
Aside from the obvious fact that binary and analog patterns differ
in the nature of their respective components, ART2 must deal with ad¬
ditional complications. For example, ART2 must be able to recognize
the underlying similarity of identical patterns superimposed on constant
backgrounds having different levels. Compared in an absolute sense, two
such patterns may appear entirely different when, in fact, they should be
classified as the same pattern.
The price for this additional capability is primarily an increase in
complexity on the Fi processing level. The ART2 Fi level comprises
several sublevels and several gain-control systems. Processing on F2 is
the same for ART2 as it was for ART1. As partial compensation for the
added complexity on the F\ layer, the weight update equations are a bit
simpler for ART2 than they were for ART1. In a software simulation,
however, weight updates have the same complexity in either network.
Figure 7.5 This figure shows the ART2 architecture. The overall structure is the same as
that of ART1. The Fx layer has been divided into six sub-layers, w, x, u, v, p, and q. Each
node labeled G is a gain-control unit that sends an inhibitory signal to each unit on the
layer it feeds. All sublayers on Fx, as well as the r layer of the orienting subsystem, have
the same number of units. Individual sublayers on Fx are connected unit-to-unit; that is,
the layers are not fully interconnected, with the exception of the bottom-up connections to
F2 and the top-down connections from F2.
Quantity A D —7
Layer
w 1 1 '/+aui 0
X e 1 wi
Iwl
u e 1 vi
Ivl
V 1 1 f(x/)+W(q/) 0
p 1 1
» 0
q e 1 Pi
Ipl
Table 7.1 This table summarizes the factors in Eq. (7.21) for each of the sublayers on Fi,
and the r layer. U is the ith component of the input vector, a, b, and c are constants, e is a
small, positive constant used to prevent division by zero in the case where the magnitude
of the vector quantities is zero, yj is the activity of the ;th unit of the Fa layer. / and g
are functions that are described in the text.
subsystem.
Using the definitions from Table 7.1, we can write the equilibrium
equations for each of the sublayers on F\ as follows:
Wi = I{ + ain (7.22)
Wi
(7.23)
Xt ~ e + |w|
Vi
(7.25)
U'~e+\v\
Pi = uj + g(yj)zij (7.26)
3
(7.27)
*
II
+
246 Chapter 7. Adaptive Resonance Theory
The function /() acts as a thresholding function that the F\ layer uses
as a filter for noise. A sigmoid function would work well in this case,
but we shall use a simpler, linear threshold function:
0 <x < 6
(7.28)
x>9
Ti = I>*i< (7.29)
i
then determine a winning unit according to which has the largest net-
input value.
g(y) is the output function for the units on the F2 layer. Since the F2
layer is a winner-take-all competitive layer, just as in ART1, g{y) has a
particularly simple form
Tj — maxjt{Tfc}VA:
(7.30)
otherwise
f Ui if F2 is inactive
(7.31)
1 «. + dzu if the Jth node on F2 is active
time when a new F-i node is being recruited to encode a new input
pattern. Weights on F2 are initialized with fairly large values in order
to bias the network toward the selection of a new, uncommitted node,
which, as in ART1, helps to keep the number of matching cycles to a
minimum. There are a number of alternate ways to initialize the bottom-
up weights. We shall stay with the method described in Chapter 8 of
Neural Networks.
1
0) < (7.33)
(1 -d)y/M
Similar considerations lead to a condition relating the parameters c and
d.
cd
< 1 (7.34)
T^d
We are almost ready to begin an example calculation. First, we must
examine the orienting subsystem in some detail, as the matching process
on ART2 is not quite as straightforward as it was on ART1.
From Table 7.1 we can construct the equation for the activities of the
nodes on the r layer of the orienting subsystem.
Ui + cpi
(7.35)
M + |cp|
(7.36)
using Eq. (7.35), where cos(u, p) is the cosine of the angle between u and
P-
First, note that if u and p are parallel, then the above equation reduces
to |r| = 1, and there will be no reset. As long as there is no output from
F2, Eq. (7.26) shows that u = p, and there will be no reset in this case.
Suppose now that F2 does have an output from some winning unit,
and that the input pattern needs to be learned, or encoded, by the F2
unit. We also do not want a reset in this case. From Eq. (7.26) we see
that p = u + dzj, where zj is the weight vector on the winning F2 unit.
If we initialize all of the top-down weights, z^, to zero, then the initial
output from F2 will have no effect on the value of p; that is, p will remain
equal to u.
During the learning process itself, zj becomes parallel to u according
to Eq. (7.32). Thus, p also becomes parallel to u, and again |r| = 1
and there is no reset. As with ART1, a sufficient mismatch between the
bottom-up input vector and the top-down template results in a reset. In
ART2, the bottom-up pattern is taken at the u sublevel of F\ and the
top-down template is taken at p.
Since the calculations that are done on F2 are the same for ART2 as they
were for ART1, we shall not spend time describing them here. On the
other hand, what happens to an input vector, as it passes through the
various layers of the ART2 network, is far from obvious. Moreover,
there are many variations on the Fi architecture. We shall not concern
ourselves with the second issue here, but rather, we shall spend a bit of
time looking at the calculations on the particular Fx layer described in
the previous section.
The simulation of Fx requires that we calculate the vector magnitude
on several of the sublayers. We shall use the standard definition of the
magnitude of a vector:
vmag2[v_] := Sqrt[v . v]
fldim = 5; (* FI dimension *)
f2dim = 6; (* F2 dimension *)
7.2. ART2 249
fv[x_] := lf[x>theta,x,0]
f[x_] := Hap[fv,x] (* output function on v layer *)
Even though there is no output from F2 at this time, we shall define the
F-2 output function and initialize the weight vectors in anticipation of
continuing on with the F% calculation later.
d=0.9;
g[xj := If[x>0,d,0]
z21 = MatrixForm[N[Table[Table[0.5/((l-d)*Sqrt[fldim] ),
{fldim}],{f2dim}],4]]
Notice that the second input vector is a multiple of the first, while the
third bears little resemblance to either of the first two.
Initialize the sublayer outputs to zero vectors.
ClearAll[w,x,u,v,p,q,r];
w=x=u=v=p=q=r=Table[0,{fldim}];
w = inputs[[l]] + a u
250 Chapter 7. Adaptive Resonance Theory
x is a normalized version of w.
x = u / vmag2[w]
v = f[x] + b f[q]
Notice that the third component of v is zero, since the third compo¬
nent of x did not meet the threshold criterion. Since there is no top-down
signal from F2, the remaining three sublayers, u, p, and q, all have the
same outputs.
u = v / vmag2[v]
p=u
q = p / vmag2[p]
We cannot stop the Fi calculation yet, however, since both u and p are
now nonzero. These sublayers provide feedback to other layers, so we
must iterate the Fi sublayers.
w = inputs[[l]] + a u
x = w / vmag2[w]
v = f[x] + b f[q]
u = v / vmag2[v]
p = u
q = p / vmag2[p]
w = inputs[[l]] + a u
x = u / vmag2[w]
v = f[x] + b f[q]
u = v / vmag2[v]
p = u
q = p / vmag2[p]
which does not change the results. We shall stop at two iterations through
Fi for all of our calculations.
Let's now apply the second input vector to F\. Recall, the second
input vector is a multiple of the first. Reinitialize the sublayer outputs
first.
252 Chapter 7. Adaptive Resonance Theory
ClearAll[w,x,u,v,p,q,r];
y=x=u=v=p=q=r=Table[0,{fldim}];
w = inputs[[2]] + a u
x = w / vmag2[w]
v = f[x] + b f[q]
u = v / vmag2[v]
p= u
q = p / vmag2[p]
w = inputs[[2]] + a u
x = w / vmag2[w]
v = f[x] + b f[q]
u = v / vmag2[v]
art2Fl[in_,a_,b_,d_,tdWts_,fld_,winr_:0] :■
Module[{w,x,u,v,p,q,i),
w=x=u=v=p=q=Table[0,{fld}];
For[i=l,i<=2,i++,
w = in + a u;
x = u / vmag2[w];
v = f[x] + b f[q];
u = v / vmag2[v];
p = lf[vinr=0,u,u + d Transpose[tdklts][[winr]] ];
q = p / vmag2[p];
]; (* end of For i *)
Return[{u,p}];
] (* end of Module *)
Listing 7.3
P “ u
q ■ p / vmag2[p]
After the v layer, the results are identical for the two input vectors. We
can conclude that the Fi layer performs several functions on an input
vector: Vectors are normalized to the same ambient background level,
noise is eliminated using a threshold condition, and the final vector is
normalized to a length of one. Incidentally, the noise-reduction process
described above often goes by the name of contrast enhancement, since,
as you can see by comparing the original input vector, after reduction to
a common background level, to the final output vector, values above the
threshold have been enhanced, while values below the threshold have
been reduced to zero.
For later use, we shall assemble the Fi sublayers into a single func¬
tion, art2Fl in Listing 7.3. The function returns the values of u and p,
since these are used later, in is the input vector, a, b, c, and d, are the
layer parameters defined above, tdWts is the top-down weight matrix, fid
is the dimension of Flf and winr is the index of the winning F2 unit. If
254 Chapter 7. Adaptive Resonance Theory
{u,p} = art2Fl[inputs[[3]],al,bl,cl,d,zl2,fldim];
We can now assemble the functions into a complete ART2 simulator. That
development is the subject of the final section of this chapter.
We can pattern the ART2 simulator after the ART1 simulator. In fact to
construct the ART2 simulator, I began with the ART1 code. Also, as we
did before, we shall initialize the weights in a separate routine.
art2Init[fldim_,f2dim_,d_,dell_] :=
Hodule[{zl2,z21},
zl2 = Table [Table [0 ,{f2dim}],{fldim}];
z21 = Table[Table[dell/((1.0-d)*Sqrt[fldim] ) //N,
{fldim}],{f2dim}];
Return[{zl2,z21}];
]; (* end of Module *)
The parameter dell determines what fraction less than the maximum
value to which the top-down weights are initialized; see Eq. (7.33).
The simulator requires the winner and compete routines that we devel¬
oped previously for ART1, but the routine to determine the reset flag is
slightly different. The reset routine for ART2 is
resetflag2[u_,p_,c_,rho_]:=
Module[{r,flag},
r = (u + c p) / (vmag2[u] + vmag2[c p]);
If[rho/vmag2[r] > l,flag=True,flag=False];
Return[flag];
];
7.2. ART2 255
art2[fldim.,f2dim_,al_,bl_,d_,d_,theta.,rho_,flilts.,f2Wts_.inputs.] :=
Module [{droplistinit,droplist, notDone=True, i, nln= Length [inputs], reset,
u, p, t, x f 2, uf 2, v, uinde x, matchList, neuMatchList, tdl/ts, builts},
droplistinit = Table[l,{f2dim>]; (* initialize droplist *)
tdilts = flilts; builts = f2ifts;
u = p = Table [0,{fldim}];
(* construct list of F2 units and encodedinput patterns *)
matchList = Table[{StringForm["Unit "",n]},{n,f2dim}];
While[notDone==True,newMatchList = matchList; (* process until stable *)
For[i=l,i<=nIn,i++, (* process each input pattern in sequence *)
droplist = droplistinit; (* initialize droplist for neu input *)
reset=True;
in = inputs[[i]]; (* next input pattern *)
windex = 0; (* initialize *)
While [reset=True, (* cycle until no reset *)
{u,p> = art2Fl[in,al,bl,d,tdWts,fldim,windex];
t= buWts . p; (* F2 net-inputs *)
t = t droplist; (* turn off inhibited units *)
xf2 = compete [t]; (* F2 activities *)
uf2 = Hap[g,xf2]; (* F2 outputs *)
windex = winner[uf2,d]; (* winning index *)
{u,p> = art2Fl[in,al,bl,d,tdWts,fldim,windex];
reset * resetflag2[u,p,cl,rho]; (* check reset *)
If [reset=True,droplist[[windex]] =0; (* update droplist *)
Print["Reset with pattern ",i," on unit ".windex],Continue];
]; (* end of While reset=True *)
Print["Resonance established on unit", windex," with pattern ",i];
tdWts=Transpose[tdWts]; (* resonance, so update weights *)
tdWts[[windex]]=u/(1-d); tdWts=Transpose[tdWts];
buWts[[windex]] = u/(l-d);
matchList[[windex]] = (* update matching list *)
Reverse[Union[matchList[[windex]],{i>]];
]; (* end of For i=l to nln *)
If [matchList==newHatchList,notDone=False; (* see if matchList is static *)
Print["Network stable"],Print["Network not stable"];
newMatchList = matchList];
]; (* end of While notDone—True *)
Retu rn[{tdWts,buWts.matchList}];
]; (* end of Module *)
Listing 7.4
256 Chapter 7. Adaptive Resonance Theory
The complete ART2 simulator appears in Listing 7.4. To run the network
with the inputs and parameters defined in the previous section, first
initialize the weights
HatrixForm[flV]
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
MatrixForm[f2W]
then use these weights and the various parameters in the calling sequence
for the art2 function
outputs = art2[5,6,10,10,0.1,0.9,0.2,0.999,
fllf,f2W, inputs];
TableForm[outputs[[3]]]
7.2. ART2 257
Unit 1 2 1
Unit 2 3
Unit 3
Unit 4
Unit 5
Unit 6
As we should expect, unit one encoded both the first and second patterns.
The third pattern was encoded by unit 2. We also might guess what the
weight matrices look like. The top-down matrix is
MatrixForm[N[outputs[[l]],3]]
2.06 0 0 0 0 0
7.22 0 0 0 0 0
0 2.59 0 0 0 0
5.16 4.32 0 0 0 0
4.13 8.64 0 0 0 0
HatrixForm[N[outputs[[2]],3]]
As with the ART1 network, facility with this ART2 network can only
come by experience. Moreover, the sublayer structure on Fi provides
fertile ground for additional experimentation.
Summary
The ART1 and ART2 networks that we studied in this chapter repre¬
sent a class of network architectures based on adaptive resonance theory.
Among other characteristics, these networks retain their ability to learn
new information without having to be retrained on old information as
258 Chapter 7. Adaptive Resonance Theory
well as the new. Moreover, these networks know when presented infor¬
mation is new and automatically incorporate this new information. You
should view the ART1 and ART2 networks as building blocks. Hierar¬
chical structures based on combinations of these networks can exhibit
complex behavior. As with the other neural networks in this book, the
ART1 and ART2 networks should be used as starting points for your
own experimentation.
Chapter 8
Genetic Algorithms
260 Chapter 8. Genetic Algorithms
8.1 GA Basics
In the introduction to this chapter, I used the words chance, perhaps, and
potentially, all contributing to the probabilistic flavor of GAs. Often peo¬
ple attempt to disprove evolutionary theory on the basis that mere chance
could not possibly have resulted in the complex organisms that exist to¬
day. Usually a person offers some calculation that, in essence, proves
that no matter how many monkeys sit at typewriters for an infinite time,
the probability that one will type (insert your favorite book title here) is van¬
ishingly small. Supposedly, by inference, any theory of evolution based
8.1. GA Basics 261
generate[] := TableForm[Hap[FromCharacterCode,
Table[Random[Integer,{97,122}],{i,1,10}, Ij,1,13}]]]
gene rate []
vbkafmujpdneg
nifcmmuucqonf
ymvasqzsagpnp
ikowydtridswq
ilqsrqxahklvs
jogvhcjuzzxcj
xctkuuachkbwr
skfqtjbrjrrpq
kqrelgjcobhbj
alqlbsezaibdy
generate []
ywporsuuapyfh
owqozltdvbgpu
1 Richard Dawkins. The Blind Watchmaker. New York: W. W. Norton & Company, 1987.
262 Chapter 8. Genetic Algorithms
qxzsaymyozwdt
uummtfbgezlir
edkpejxjhkowk
qmhefxgdsblgt
tkxlovoyglvnx
mbzkdxtlrlvvc
jowseiaersfel
momgjesfuvgbf
generate []
hdahgucjzagag
uvgmxanpjpkqi
ikksflflyzqgh
oqozewohuaema
vtnxclgtcwuvz
axjoakjjnflqe
etvlqillzceja
xxanjvmlooltj
uxeijueevpous
urvkzgrxscyxb
We could go on, but you probably get the picture: Not one of these strings
has any noticeable resemblance to the target string. Occasionally the
proper letter does appear in the proper location in one of the strings, but
it is likely to take a very long time before our pseudomonkeys generate
the correct sequence purely by chance. Since there are 26 possible letters
for each of 13 locations in the string, the probability that a single monkey
will type the correct phrase is
(l./26)“13
-19
4.03038 10
matches the desired string. Then we allow each letter to change with
some fixed probability. From the resulting generation, we choose the
most closely matching string to form a new generation, and so on.
Begin by defining the phrase as a text string.
tobe = "tobeornottobe"
tobeornottobe
keyPhrase = ToCharacterCodeftobe]
{116, 111, 98, 101, 111, 114, 110, 111, 116, 116, 111, 98, 101}
initialPop = Table[Random[Integer,{97,122}],{i,l,10},{j,1,13}]
{{119, 105, 118, 110, 120, 117, 116, 110, 111, 116, 115,
106, 112}, {105, 113, 104, 111, 118, 114, 98, 119,
120, 109}}
Just out of curiosity lets convert this population to strings to see what
they look like.
TableForm[Hap[FromCharacterCode,initialPop]]
wivnxutnotsjp
264 Chapter 8. Genetic Algorithms
iqhovrbwkyrbh
clgkbpftfkswh
eavaaxuuajtob
rvnnzuqerbdyg
lncpajjxrgohl
hftljwsioihrp
tukbwblwzzntt
fjunrugmwczjn
frszjkeuowdxm
flip[x_] := If[Random[]<=x,True,False]
mutateLetter[pinute_,letter,] :=
If[fliptpmute],Random[Integer,{97,122}].letter];
newGenerate[0.1,key Phrase,initialPop,50]
newGenerate[pmutate.,keyPhrase.,pop.,numGens.] :=
Module[{i,newPop,pa rent,diff,matches,
index,fitness},
newPop=pop;
For[i=l,i<=numGens,i++,
diff = Map[(keyPhrase-#)ft,newPop];
matches = Hap[Count[f ,0]ft,diff];
fitness = Max[matches];
index = Position [matches, fitness];
parent = nevPop[[First[Flatten[index]]]];
Print["Generation ",i,": ",
FromCharacterCode[parent],
" Fitness3 ".fitness];
newPop =
Table[Map[mutateLetter[pmutate,#]ft,parent],
{100}];
]; (* end of For *)
]; (* end of Module *)
Listing 8.1
process does not lead to chaos, but rather to order and meaning, but we
have not yet added mating and reproduction to the process. Let's move
on to the discussion of a more complete genetic algorithm.
In this section we shall develop an algorithm that has all of the char¬
acteristic processing normally associated with a true GA. Although we
will build a very simple algorithm, it has application in a number of ar¬
eas. More complex GAs will have the same high-level structure, but will
differ in the details of the computations performed. Before we actually
perform the development of the GA in Section 8.2.2, a review of the rel¬
evant vocabulary is in order. Like neural networks, GAs are inspired by
certain results from biology.
8.2.1 GA Vocabulary
Plot[f[x],{x,-45,45},PlotPoints->200];
Notice that the optimal value of this function occurs at x = 0, but no¬
tice also that there are many local maxima which are suboptimal. A
traditional hill-climbing technique, unless it were fortuitously to begin
somewhere on the central peak, would quickly reach one of the subopti¬
mal peaks and would get stuck there.
N[80.0/1023,10]
0.07820136852
40.
0 (80.0/1023) -40
-40
pList = Flatten[Positional,0,1,0,1,0,1,0,1,0},1]]
{1, 3, 5, 7, 9}
values = Hap[2‘(10-#)fe,pList]
682
decodeBGA[chromosome,] :=
Module[{pList,lchrom,values,phenotype},
lchrom = Length [chromosome];
(* convert from binary to decimal *)
pList = Flatten[Position[chromosome.l] ];
values = Hap[2“(lchrom-#)ft,pList];
decimal = Apply[Plus,values];
(* scale to proper range *)
phenotype = decimal (0.07820136852394916911) -40;
Return[phenotype];
]; (* end of Module *)
Listing 8.2
which works for arbitrary bases. We can test the function on the same
binary number.
Horner[{l,0,1,0,1,0,1,0,1,0}]
682
13.3333
The complete function is in Listing 8.2. One thing to notice about our
choice of chromosomal representation is that the optimal phenotype (x =
0) is not represented by any chromosome. The largest negative pheno¬
type has the chromosome {0,1,1,1,1,1,1,1,1,1},
decodeBGA[{0,1,1,1,1,1,1,1,1,1}]
-0.0391007
decodeBGA[{l,0,0,0,0,0,0,0,0,0}]
0.0391007
8.2. A Basic Genetic Algorithm (BGA) 271
f[-0.0391007]
1.99922
2. Generate a random number between zero and the total of all of the
fitnesses in the population.
3. Return the first individual whose fitness, added to the fitness of all
other elements before it, from the list in step 1, is greater than or
equal to the random number from step 2.
MatrixForm[pop =
Table[Random[Integer,{0,1}],{i,10},{j,10}]]
272 Chapter 8. Genetic Algorithms
1 1 0 0 1 0 0 1 1 0
0 0 0 1 0 0 0 1 0 1
0 1 0 0 1 0 0 1 1 1
10 1110 110 1
1 1 1 0 0 0 0 0 1 1
1 0 1 1 0 0 0 1 0 1
1 0 1 1 0 0 1 0 0 0
1 0 0 0 1 0 1 1 1 0
11110 10 110
0 1110 0 110 1
Then decode the population by mapping the decodeBGA function onto pop.
pheno = Map[decodeBGA,pop]
We use the function FoldList to add each element of fitList to all of the
successive elements.
fitListSum = FoldList[Plus,First[fitList],Rest[fitList]]
The total of all fitness values is the last element in the above list
fitSum = LastffitListSum]
8.1069
The function selectOne in Listing 8.3 takes the folded list of fitness values
and returns the index of the individual who came up on a single spin of
the roulette wheel. The two parents are
selectOne[foldedFitnessList_,fitTotal_] :=
Module[{randFitness.elem,index},
randFitness = Randomf] fitTotal;
elem = Select[foldedFitnessList,#>=randFitnessft,l];
index = Flatten[Position[foldedFitnessList,First[elem]]];
Return[First[index]];
]; (* end of Module *)
Listing 8.3
parent2Index = selectOne[fitlistSum,fitSum]
parentl = pop[[parentlIndex]]
{1, 1, 1, 1, o, 1, 0, 1, 1, 0}
parent2 = pop[[parent2Index]]
{1, 0, 1, 1, 0, 0, 0, 1, 0, 1}
These two parents have fitnesses of 1.04247 and 0.714787 respectively. Of
course, if you execute the above two statements, you may get different
parents. We can now proceed to the reproduction process.
Parents Children
Figure 8.1 This figure illustrates the crossover operation, (a) Two parents are selected
according to their fitness, and a crossover point, illustrated by the vertical line, is chosen
by a uniform random selection, (b) The children's chromosomes are formed by combining
opposite parts of each parent's chromosome.
crossover[pcross.,pmutate_,parentl.,parent2_] :=
Module[{childl,child2,crossAt,lchrom},
(* chromosome length *)
lchrom = Length[parentl];
If[ fliptpcross],
(* True: select cross site at random *)
crossAt = Random[Integer,{l,lchrom-l>];
(* construct children *)
childl = Join[Take[parentl,crossAt], Drop[parent2,crossAt]];
child2 = Join[Take[parent2,crossAt], Drop[parentl,crossAt]],
(* False: return parents as children *)
childl = parentl;
child2 = parent2;
]; (* end of If *)
(* perform mutation *)
childl = Hap[mutateBGA[pmutate,#]ft,childl];
child2 = Map[mutateBGA[pmutate,#]ft,child2];
Return[{childl,child2}];
]; (* end of Module *)
Listing 8.4
a very small value, perhaps as low as one chance in a thousand for any
particular gene.
Since the chromosomes are binary strings, I have written the mutation
algorithm in terms of an XOR function. Moreover, since Mathematica does
not equate True and False with 1 and 0 respectively, I have had to write
my own XOR function:
myXor[x_,y_] := If[x==y,0,1];
mutateBGA[pmute_,allel_] := If[fliptpmute],myXor[allel,l].allel];
The procedure in Listing 8.4, which comprises both the crossover and
mutation algorithms, returns two children to be added to the next gen¬
eration of the population. Let's see if we make any progress after one
crossover-mutation operation.
MatrixForm[children = cross0ver[0.75,0.001,parentl,parent2]]
276 Chapter 8. Genetic Algorithms
1 1 1 1 0 0 0 1 0 1
10 110 10 110
decodeList = Map[decodeBGA,children]
{35.4643, 16.7742}
neufitness = Map[f,decodeList]
{0.95461, 0.87324}
We gained ground in one case, but lost ground in the other. On the
whole, however, the average fitness of the children
(newfitness[[1]]+newfitness[[2]])/2
0.913925
(1.04247+0.714787)/2
0.878628
The results could have turned out differently, however, since this process
does contain an element of chance.
There is one further issue to discuss before putting everything to¬
gether. That issue concerns how we are going to go about populating
the next generation.
initPop[psize_,csize_] :=
Table[Random[Integer,{0,l}],{psize},{csize>];
displayBest[fitnessList_,number2Print_] :=
Module[{i,sortedList},
sortedList = Sort[fitnessList,Greater];
For[i=l,i<=number2Print,i++,
Printffitness = ",sortedlist[[i]] ];
]; (* end of For i *)
]; (* end of Module *)
Listing 8.5
bga[pcross,,pmutate_,poplnitial.,fitFunction.,numGens_,printNumJ :=
Module[{i,newPop,pa rentl,pa rent2,diff,matches,
oldPop,reproNum,index,fitList,fitListSum,
fitSum,pheno,plndex,plndex2,f,children},
oldPop=popInitial; (* initialize first population *)
reproNum = Length [oldPop]/2; (* calculate number of reproductions *)
f = fitFunction; (* assign the fitness function *)
For[i=l,i<=numGens,i++, (* perform numGens generations *)
pheno = Map[decodeBGA,oldPop]; (* decode the chromosomes *)
fitList = f [pheno]; (* determine the fitness of each phenotype *)
Print[" "]; (* print out the best individuals *)
Print["Generation ",i," Best ".printNum];
Print[" "];
displayBest[fitList,printNum];
fitListSum = FoldList[Plus,First[fitList],Rest[fitList]];
fitSum = Last[fitListSum]; (* find the total fitness *)
newPop = Flatten [Tablet (* determine the new population *)
plndexl = selectOne[fitListSum,fitSum]; (* select parent indices *)
plndex2 = selectOne [fitListSum, fitSum];
parentl = oldPopt[plndexl]]; (* identify parents *)
parent2 = oldPop[[p!ndex2]];
children = crossover[pcross.pmutate,parentl,parent2]; (* crossover and mutate *)
children,{reproNum}], 1 (* add children to list; flatten to first level *)
]; (* end of Flatten [Table] *)
oldPop = newPop; (* new becomes old for next gen *)
]; (* end of For i*)
]; (* end of Module *)
Listing 8.6
8.2. A Basic Genetic Algorithm (BGA) 279
initialPopulation = initPop[100,10];
Generation 1 Best 10
fitness = 1.90724
fitness = 1.90724
fitness = 1.58669
fitness = 1.42534
fitness = 1.42534
fitness = 1.3598
fitness = 1.3598
fitness = 1.3334
fitness = 1.33098
fitness = 1.31726
Generation 2 Best 10
fitness = 1.96206
fitness = 1.93756
fitness = 1.90724
fitness = 1.90724
fitness = 1.90724
fitness = 1.90724
fitness = 1.78363
fitness = 1.48717
fitness = 1.46704
fitness = 1.34657
Generation 3 Best 10
fitness = 1.96206
fitness = 1.90724
fitness = 1.90724
fitness = 1.90724
fitness = 1.87132
fitness = 1.83002
fitness = 1.78363
280 Chapter 8. Genetic Algorithms
fitness = 1.73246
fitness = 1.66991
fitness = 1.48717
Generation 4 Best 10
fitness = 1.99299
fitness = 1.99299
fitness = 1.99299
fitness = 1.90724
fitness = 1.87132
fitness = 1.83002
fitness = 1.83002
fitness = 1.78363
fitness = 1.78363
fitness = 1.78363
Generation 5 Best 10
fitness = 1.99299
fitness = 1.99299
fitness = 1.99299
fitness = 1.99299
fitness = 1.98058
fitness = 1.90724
fitness = 1.90724
fitness = 1.90724
fitness = 1.90724
fitness = 1.87132
Not only does the best individual become better as time goes on, but
the population as a whole appears to be getting better as well. In most
problems we will probably be interested in the best solution; neverthe¬
less, as the population as a whole gets better, future generations will be
produced by a better and better group of parents.
8.3. A GA for Training Neural Networks 281
sigmoid[xJ := l./(l+E“(-x));
You should begin to see why this GA may take a long time to com¬
pute. Suppose we have a population of 100 networks, each initially with
randomly generated weights. To determine a new generation (if we are
using generational replacement) requires 50 matings, each of which in¬
volves three crossover-mutation operations. Then we must decode the
chromosomes and determine the mse for each network.
We are going to employ some time-saving measures for this example.
First, we will use a steady-state population methodology, in which we
replace only a few of the worst-performing individuals for each new gen¬
eration. Second, since we need not evaluate the fitness of each network
for each new generation, we need decode only the new children in order
to assess their fitness.
initXorPop[psize_,csize_,ioPairs_] :=
Module[{i,iPop,hidWts,outWts,mseln v>,
(* first the chromosomes *)
iPop = Tablet
{
Table[Random[Integer,{0,1}],{csize}],(* hi *)
Table[Random[Integer,{0,1}],{csize}],(* h2 *)
Table[Random[Integer,{0,1)3,{csize}] (* ol *)
}, {psize} ]; (* end of Table *)
(* then decode and eval fitness *)
(* use For loop for clarity *)
For[i=l,i<=psize,i++,
(* make hidden weight matrix *)
hidWts = Join[iPop[[i,l]],iPop[[i,2]] ];
hidWts = Partition [hidWts, 20];
hidWts = Map[decodeXorChrom, hidWts];
hidWts = Partition[hidWts,2];
(* make output weight matrix *)
outWts = Partition[iPop[[i,3]],20];
outWts = Map [decodeXorChrom, outWts];
(* get mse for this network *)
mselnv = gaNetFitness [hidWts, outWts, ioPairs];
(* prepend mselnv *)
PrependTo[iPop[[i]].mselnv];
]; (* end For *)
Return[iPop];
]; (* end of Module *)
Listing 8.7
284 Chapter 8. Genetic Algorithms
decodeXorChrom[chromosome_] :=
Module[{pList,lchrom,values,p,decimal},
lchrom = Length [chromosome];
(* convert from binary to decimal *)
pList = Flatten[Position[chromosome,1] ];
values = Hap[2"(lchrom-#)ft,pList];
decimal = Apply[Plus,values];
(* scale to proper range *)
p = decimal (9.536752259018191355*10^-6)-5;
Return[p];
]; (* end of Module *)
gaNetFitness[hiddenWts_,outputWts_,ioPairVectors_] :=
Module[{inputs,hidden,outputs,desired.errors,
len,errorTotal,errorSum},
inputs=Map[First,ioPairVectors];
desired=Map[Last,ioPairVectors];
len = Length[inputs];
hidden=sigmoid[inputs.Transpose[hiddenVts]];
outputs=sigmoid [hidden. T r ans pose [outputklts] ];
errors= desired-outputs;
errorSum = Apply[Plus,errors~2,2]; (* second level *)
errorTotal = Apply[Plus,errorSum];
(* inverse of mse *)
Return[len/errorTotal];
] (* end of Module *)
Listing 8.8
8.3. A GA for Training Neural Networks 285
dec = Round[(weight+5.)/(9.536752259018191355*10“-6)];
encodeNetGa[0.5,20]
{1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0}
pop = initXorPop[20,40,ioPairsX0R];
Sort the population according to fitness. Although this step is not neces¬
sary to begin the GA, it will allow us to easily determine the character¬
istics of the initial population.
pop=Sort[pop,Greater[First[i],First[#2]]ft];
crossoverXor[pcross_,pmutate.,parentl_,parent2_] :=
Hodule[{childl,child2,cross At,lchrom,
i,numchroms,chromsl,chroms2},
(* strip off mse *)
chromsl = Rest[parentl];
chroms2 = Rest[parent2];
(* chromosome length *)
lchrom = Length[chromsl[[l]]];
(* number of chromosomes in each list *)
numchroms = Length [chromsl];
For[i=l,i<=numchroms,i++, (* for each chrom *)
If[ flip[pcross],
(* True: select cross site at random *)
crossAt = Random[Integer,{l,lchrom-l}];
(* construct children *)
chromsl[[i]] = Join[Take[chromsl[[i]],crossAt],
Drop[chroms2[[i]].crossAt]];
chroms2[[i]] = Join[Take[chroms2[[i]],crossAt],
Drop[chromsl[[i]],cross At]],
(* False: don't change chroms[[i]] *)
Continue]; (* end of If *)
(* perform mutation *)
chromsl[[i]] = Hap[mutateBGA[pmutate,#]ft,chromsl[[i]]];
chroms2[[i]] = Hap[mutateBGA[pmutate,#]ft,chroms2[[i]]];
]; (* end of For i *)
Return[{chromsl,chroms2}];
]; (* end of Hodule *)
Listing 8.9
8.3. A GA for Training Neural Networks 287
gaXor[pcross.,pmutate.,popIni.ti.al_,numReplace.,ioPairs.,numGens.,printNum_] :=
Module[{i,j,neuPop,parentl,parent2,diff,matches,
oldPop,reproNum,index,fitList,fitListSum,
f itSum, pheno, plnde x,plndex2,f, child r en, hids, outs, mseln v>,
(* initialize first population sorted by fitness value *)
oldPop= Sort[popInitial,Greater[First[#],First[#2]]&];
reproNum = numReplace; (* calculate number of reproductions *)
For[i=l,i<=numGens,i++,
fitlist = Map[First,oldPop]; (* list of fitness values*)
(* make the folded list of fitness values *)
fitListSum = FoldList[Plus,First[fitList],Rest[fitList]];
fitSum = Last[fitListSum]; (* find the total fitness *)
nevPop = Drop [oldPop,-reproNum]; (* new population; eliminate reproNum worst *)
For[j=l,j<=reproNum/2,j++, (* make reproNum new children *)
(* select parent indices *)
plndexl = selectOne[fitListSum,fitSum];
plndex2 = selectOne [fitListSum, fitSum];
parentl = oldPop[[plndexl]]; (* identify parents *)
parent2 = oldPop[[pIndex2]];
children = crossOverXor[pcross,pmutate,parentl,parent2]; (* crossover and mutate *)
{hids,outs} = decodeXorGenotype[children[[1]] ]; (* fitness of children *)
mselnv = gaNetFitness[hids,outs,ioPairs];
children[[l]] = Prepend[children[[l]],mselnv];
{hids,outs} = decodeXorGenotype [children [[2]] ];
mselnv = gaNetFitness[hids,outs,ioPairs];
children[[2]] = Prepend[children[[2]],mselnv];
newPop = Join[newPop,children]; (* add children to new population *)
]; (* end of For j *)
oldPop = Sort[newPop,Greater[First[#],First[#2]]ft];(* for next gen *)
(* print best mse values (l/mselnv) *)
Print[ ];Print["Best of generation ",i];
For[j=l,j<=printNum,j++,Print[(l.0/oldPop[[j,1]])]; ];
]; (* end of For i*)
Return[oldPop];
]; (* end of Module *)
Listing 8.10
288 Chapter 8. Genetic Algorithms
decodeXorGenotype[genotype.] :=
Module[{hidWts,outWts},
hidWts = Join [genotype [[1]], genotype [[2]] ];
hidWts = Partition[hidWts,20];
hidWts = Map[decodeXorChrom,hidWts];
hidWts = Partition [hidWts, 2];
(* make output weight matrix *)
outWts = Partition [genotype [[3]], 20];
outWts = Map [decodeXorChrom, outWts];
Return[{hidWts,outWts}];
];
Listing 8.11
l/newpop[[l,l]]
0.156526
The last population individual has the lowest fitness, or the largest mse.
l/newpop[[20,l]]
0.394886
You can see the distribution by plotting the fitness values, or alternatively,
plotting the mse's. Here we plot the mse values.
poplist = Map[First,newpop];
ListPlot[1/poplist]
0.4
0.35
0.3
0.25
10 15 20
8.3. A GA for Training Neural Networks 289
encodeNetGa[weight.,len_] :=
Module[{pList,values,dec,ch romosome,i},
i=len;
l=Table[0,{i}];
(* scale to proper range *)
dec = Round[(weight+5.)/(9.536752259018191355*10“-6)];
While[dec!=0&Adec!=1,
l=ReplacePart[l,Nod[dec,2],i];
dec=Quotient[dec,2];
”i;
];
l=ReplacePart[l,dec,i]
]; (* end of Module *)
Listing 8.12
newpop = gaXor[0.8,0.01,pop,10,ioPairsXOR,100,1];
Best of generation 1
0.159407
Best of generation 5
0.14867
Best of generation 10
0.125276
Best of generation 15
0.125276
Best of generation 20
0.112863
Best of generation 25
0.102992
Best of generation 30
0.102976
Best of generation 35
0.102687
Best of generation 40
0.102538
290 Chapter 8. Genetic Algorithms
Best of generation 50
0.102463
Best of generation 55
0.102264
Best of generation 60
0.102223
Best of generation 65
0.101803
Best of generation 70
0.10178
Best of generation 75
0.101705
Best of generation 80
0.101669
Best of generation 85
0.101669
Best of generation 90
0.101669
Best of generation 95
0.101667
Best of generation 100
0.101644
You will notice a steady, but very slow, decline in the mse. At this
rate, it may take hundreds of generations to reach an acceptable error.
Moreover, this run of 100 generations required considerably more time
than was required for the standard backpropagation method to converge
on an acceptable solution. We might also ask whether we are doing any
better than we would if we simply generated networks at random. To
evaluate that situation, we can do just that: generate 100 populations of
20 individuals at random and see if we do as well. The function randomPop
in Listing 8.13 generates the required populations.
randomPop[20,40,ioPairsXOR,100];
Random generation 1
0.132439
Random generation 2
0.149261
Random generation 3
8.3. A GA for Training Neural Networks 291
randomPop[psize_,csize_,ioPairs_,numGensJ :=
Module[{i,pop},
For[i=l,i<=numGens,i++,
pop = initXorPop[psize.csize,ioPairs];
pop = Sort[pop,Greater[First[#],First[#2]]ft];
Print[ ];
Print["Random generation ",i];
Print[(1.0/pop[[l,l]])];
];
];
Listing 8.13
0.157786
Random generation 4
0.16147
Random generation 5
0.151606
Random generation 6
0.156832
Random generation 7
0.150389
Random generation 8
0.128841
Random generation 9
0.150084
Random generation 10
0.156748
Random generation 11
0.14517
Random generation 12
0.147808
Random generation 13
0.163434
Random generation 14
0.146134
Random generation 15
0.136783
292 Chapter 8. Genetic Algorithms
Random generation 16
0.126648
Random generation 17
0.153591
Random generation 18
0.154377
Random generation 19
0.162773
Random generation 20
0.16142
and so on . . .
I have reduced the output to the first 20 generations, but the results for
the remaining 80 generations are similar to those of the first 20. The best
I got in 100 random generations was 0.125274; the worst was 0.186056.
There is, of course, a possibility that some random population will ac¬
cidently produce a very good individual. You should know, however,
that generating the random population and evaluating the fitness of all
of the individuals required more time per generation than the GA algo¬
rithm did. Moreover, it appears as though the GA might eventually reach
an acceptable population, whereas we can never be sure about random
generations.
The above development represents a first attempt, and you should
not conclude that the method we employed is the best one for the task. In
fact, you will find it necessary to make several modifications to the data
representation in order to allow the GA to find a good solution. We could
be more clever about how chromosomes are combined during the mating
process. The binary representation and standard crossover algorithm are
probably not the best ones for our purposes in this case. Perhaps we
should maintain the chromosomes as lists of real numbers and search for
more appropriate crossover mechanisms. Rather than pursue this matter
further here, I will leave it to your creativity.
Even if we can generate weights for neural networks using the method of
the previous sections, it is not clear that there is a particular advantage
to doing so. If you look back at Chapter 3, you will recall that we
8.3. A GA for Training Neural Networks 293
Summary
In a book on neural networks, a chapter on genetic algorithms may seem
out of place, even though in this we did look at a way of using the
two technologies together. GAs, like some neural networks, are good at
finding solutions to optimization problems when you can determine a
score or cost function for each potential solution. We build a very basic
GA in this chapter; many variations are possible. Whether you continue
to experiment with combining GA and neural-network technologies, as
we have done in this chapter, or simply use GAs for other applications,
does not really matter. GAs are another tool in the problem-solving
toolbox which can be brought to bear on a variety of problems. Like
294 Chapter 8. Genetic Algorithms
neural networks, GAs will not guarantee you a perfect solution, but can,
in many cases, arrive at an acceptable solution without the time and
expense of an exhaustive search.
Appendix A
Code Listings
aIcTest[lea rnRate_,numlters_:250] :=
Module[{eta=learnRate,wts,k,inputs,wtList.outDesired,outputs,outError},
wts = Table[Random[] ,{2>]; (* initialize weights *)
Print["Starting weights = ",wts];
Print["Learning rate = ",eta];
Print["Number of iterations = ".numlters];
inputs = {0, Random [Real, {0, 0.175}]};(* initialize input vector *)
k=l;
wtList=Table[
inputs[[1]] = N[Sin[Pi k/8]]+Random[Real,{0, 0.175}];
outDesired = N[2 Cos[Pi k/8]]; (* desired output *)
outputs = wts.inputs; (* actual output *)
outError = outDesired-outputs; (* error *)
wts += eta outError inputs; (* update weights *)
inputs[[2]]=inputs[[l]]; (* shift input values *)
k++; wts,{numlters}]; (* end Table
Print["Final weight vector = ",wts];
wtPlot=ListPlot[wtList,PlotJoined->True] (* plot the weights *)
] (* end of Module *)
296 Code Listings
calcHse[ioPairs_,wtVec_] :=
Module[{errors,inputs,outDesired,outputs},
inputs = Map[First,ioPairs]; (* extract inputs *)
outDesired = Hap[Last,ioPairs]; (* extract desired outputs *)
outputs = inputs . wtVec; (* calculate actual outputs *)
errors = Flatten [outDesired-outputs];
Return[errors.errors/Length[ioPairs]]
]
alcXor[learnRate,,numlnputs.,ioPairs.,numlters_:250] :=
Module[{uts,eta=lea rnRate,errorList,inputs,outDesired,ourError,outputs},
SeedRandom[6460]; (* seed random number gen.*)
wts = Table[Random[],{numlnputs}]; (* initialize weights *)
errorList=Table[ (* select ioPair at random *)
{inputs, outDesired} = ioPairs[[Random[Integer,{l,4}]]];
outputs = wts.inputs; (* actual output *)
outError = First [outDesired-outputs]; (* error *)
wts += eta outError inputs;
outError,{numIters}]; (* end of Table *)
ListPlot[errorList,PlotJoined->True];
Return[wts];
]; (* end of Module *)
testXor[ioPairs.,weights,] :=
Module[{errors,inputs.outDesired,outputs,wts,mse},
inputs = MapfFirst,ioPairs]; (* extract inputs *)
outDesired = Map[Last,ioPairs]; (* extract desired outputs *)
outputs = inputs . weights; (* calculate actual outputs *)
errors = outDesired-outputs;
mse=
Flatten[errors] . Flatten[errors]/Length[ioPairs];
Print["Inputs = ".inputs];
Print["0utputs = ".outputs];
Print["Errors = ".errors];
Print["Mean squared error = ",mse]
]
Backpropagation and Functional Link Network 297
alcXorMin[learnRate_,numlnputs.,ioPairs_,maxErrorJ :=
Module[{wts,eta=lea rnRate,errorList,inputs.outDesired,
meanSqError.done,k.ourError.outputs.errorPlot},
wts = Table[Random[],{numlnputs}]; (* initialize weights *)
meanSqError = 0.0;
errorList=0;
For[k=l;done=False,!done,k++, (* until done *)
(* select ioPair at random *)
{inputs, outDesired} = ioPairs[[Random[Integer,{l,4}]]];
outputs = wts.inputs; (* actual output *)
outError = First[outDesired-outputs]; (* error *)
wts += eta outError inputs; (* update weights *)
If[Mod[k,4]==0,meanSqError=calcMse[ioPairs,wts];
AppendTo[errorList, meanSqError]; ];
If[k>4 ftft meanSqError<maxError,done=True,Continue]; (* test for done *)
]; (* end of For *)
error Plot=ListPlot [e r ror List, PloUoined->Tr ue];
Return[wts];
] (* end of Module *)
Options[sigmoid] = {xShift->0,yShift->0,temperature->l};
Options[bpnTest] = {printAll->False,bias->False};
sigmoid[x_,opts_Rule] :=
Module[{xshft,yshft,temp},
xshft = xShift /. {opts} /. Options[sigmoid];
yshft = yShift /. {opts} /. Options [sigmoid];
temp = temperature /. {opts} /. Options [sigmoid];
yshft+1/(l+E“(-(x-xshft)/temp)) //N
]
298 Code Listings
bpnTest[hiddenWts_,outputWts.,ioPairVectors.,opts_Rule] :=
Module[{inputs,hidden,outputs,desired,errors,i,len=Length[inputs],
prnt All,errorTotal,errorSum,bias},
prntAll= print All /. {opts} /. Options [bpnTest];
biasVal = bias /. {opts} /. Options [bpnTest];
inputs=Map[First,ioPairVectors];
If[biasVal,inputs=Map[Append[#,1.0]&,inputs] ];
desired=Map[Last,ioPairVectors];
hidden=sigmoid[inputs.Transpose[hiddenWts]];
If[biasVal,hidden = Map[Append[#,1.0]&,hidden] ];
outputs=sigmoid[hidden.Transpose[outputWts]];
errors= desired-outputs;
If[prntAll,Print["ioPairs:"];Print[ ];Print[ioPairVectors];
Print[ ];Print["inputs:"];Print[ ];Print[inputs];
Print[ ];Print["hidden-layer outputs:"];
Print[hidden];Print[ ];
Print["output-layer outputs:"];Print[ ];
Print[outputs];Print[ ];Print["errors:"];
Print[errors];Print[ ]; ]; (* end of If *)
For[i=l,i<=len,i++,Print[" Output ",i," = ", outputs [[i]]," desired = ",
desired[[i]]," Error = ",errors[[i]]];Print[]; ]; (* end of For *)
errorSum = Apply[Plus,errors“2,2]; (* second level *)
errorTotal = Apply[Plus,errorSum];
Print["Mean Squared Error = ",errorTotal/len];
] (* end of Module *)
Backpropagation and Functional Link Network 299
bpnMomentum[inNumber.,hidNumber_,outNumber.,ioPairs_,eta_,
alpha.,numlters.] :=
Module [{hidWts, outlets, ioP, inputs, hidOuts .outputs, outDesired,
hidLastDelta,outLastDelta,outDelta,hidOelta,outErrors),
hidWts = Table[Table[Random[Real,{-0.5,0.5}].{inNumber}].{hidNumber}];
outWts = Table[Table[Random[Real,{-0.5,0.5}].{hidNumber}].{outNumber}];
hidLastDelta = Table [Table [0, {inNumber}], {hidNumber}];
outLastDelta = Table [Table [0, {hidNumber}], {outNumber}];
errorList = Tablet
(* begin forward pass *)
ioP=ioPairs[[Random[Integer,{1,Length[ioPairs]}]]];
inputs=ioP[[l]];
outDesired=ioP[[2]];
hidOuts = sigmoid[hidWts.inputs]; (* hidden-layer outputs *)
outputs = sigmoid [outWts. hidOuts]; (* output-layer outputs *)
(* calculate errors *)
outErrors = outDesired-outputs;
outDelta= outErrors (outputs (1-outputs));
hidDelta=(hid0uts (1-hidOuts)) Transpose[outWts].outDelta;
(* update weights *)
outLastDelta= eta 0uter[Times,outDelta,hidOuts]+alpha outLastDelta;
outWts += outLastDelta;
hidLastDelta = eta Outer[Times,hidDelta,inputs]+
alpha hidLastDelta;
hidWts += hidLastDelta;
outErrors.outErrors, (* this puts the error on the list *)
{numlters}] ; (* this many times, Table ends here *)
Print["New hidden-layer weight matrix: "];
PrintG; Print[hidWts];Print[];
Print["New output-layer weight matrix: "];
PrintG; Print[outWts];PrintG;
bpnTest[hidWts.outWts,ioPairs,bias->False,printAll->False];
errorPlot = ListPlot[errorList, PlotJoined->True];
Return[{hidWts,outWts,errorList,errorPlot}];
] (* end of Module *)
302 Code Listings
bpnMomentumSmart[inNumber_,hidNumber.,outNumber.,ioPairs_,eta_,
alpha.,numltersj :=
Module[{hidWts,outWts,ioP,inputs,hidOuts,outputs,outOesired,
hidLastDelta,outLastDelta,outDelta,hidDelta,outErrors},
hidWts = Table[Table[Random[Real,{-0.5,0.5}].{inNumber}].{hidNumber}];
outWts = Table[Table[Random[Real,{-0.5,0.5}].{hidNumber}].{outNumber}];
hidLastDelta = Table [Table [0, {inNumber}], {hidNumber}];
outLastDelta = Table [Table [0, {hidNumber}], {outNumber}];
errorList = Table[
(* begin forward pass *)
ioP=ioPairs[[Random[Integer,{1,Length[ioPairs]}]]];
inputs=ioP[[l]];
outDesired=ioP[[2]];
hidOuts = sigmoid[hidWts.inputs]; (* hidden-layer outputs *)
outputs = sigmoid[outWts.hidOuts]; (* output-layer outputs *)
(* calculate errors *)
outErrors = outDesired-outputs;
If[First[Abs[outErrors]]>0.1,
outDelta= outErrors (outputs (1-outputs));
hidDelta=(hidOuts (1-hidOuts)) Transpose[outWts].outDelta;
(* update weights *)
outLastDelta= eta 0uter[Times,outDelta,hid0uts]+
alpha outLastDelta;
outWts += outLastDelta;
hidLastDelta = eta 0uter[Times,hidDelta,inputs]+
alpha hidLastDelta;
hidWts += hidLastDelta,Continue]; (* end of If *)
outEr rors.outEr rors, (* this puts the error on the list *)
{numlters}] ; (* this many times, Table ends here *)
Print["New hidden-layer weight matrix: "];
Print[]; Print[hidWts];Print[];
Print["New output-layer weight matrix: "];
Print[]; Print[outWts];Print[];
bpnTest[hidWts,outWts,ioPairs,bias->False,print*ll->False];
errorPlot = ListPlot[errorList, PlotJoined->True];
Retu rn[{hidWts,outWts,e r rorList,errorPlot}];
] (* end of Module *)
Backpropagation and Functional Link Network 303
bpnCompete[inNumber_,hidNumber,,outNumber.,ioPairs_,eta_,numlters.] :=
Module[{hidWts.outWts,ioP,inputs,hidOuts,outputs,outDesired,
outlnputs, hidEps, outEps, outDelta, hidPos, outPos, hidDelta, outEr rors>,
hidWts = Table[Table[Random[Real,{-0.5,0.5}],{inNumber}]{hidNumber>];
outWts = Table[Table[Random[Real,{-0.5,0.5}],{hidNumber>].{outNumber}];
errorList = Tablet (* begin forward pass *)
ioP=ioPairs[[Random[Integer,{1,Length[ioPairs]}]]];
inputs=ioP[[l]];
outDesired=ioP[[2]];
hidOuts = sigmoid [hidWts. inputs];
outputs = sigmoid [outWts. hidOuts];
outErrors = outDesired-outputs; (* calculate errors *)
out0elta= outErrors (outputs (1-outputs));
hidDelta=(hid0uts (1-hidOuts)) Transpose[outWts].outDelta;
(* index of max delta *)
outPos = First[Flatten[Position[Abs[outDelta],Max[Abs[outDelta]]]]];
outEps = outDelta [[outPos]]; (* max value *)
outDelta=Table[-1/4 outEps,{Length[outDelta]>]; (* new outDelta table *)
outDelta[[outPos]] = outEps; (* reset this one *)
(* index of max delta *)
hidPos = First[Flatten[Position[Abs[hidDelta],Max[Abs[hidDelta]]]]];
hidEps = hidDelta[[hidPos]]; (* max value *)
hidDelta=Table[-1/4 hidEps,{Length[hidDelta]>]; (* new outDelta table *)
hidDelta [[hidPos]] = hidEps; (* reset this one *)
outWts +=eta 0uter[Times,outDelta,hidOuts];
hidWts += eta Outer [Times, hidDelta, inputs];
outErrors.outErrors, (* this puts the error on the list *)
{numlters}] ; (* this many times, Table ends here *)
Print["New hidden-layer weight matrix: "];
Print[ ]; Print[hidWts];Print[ ];
Print["New output-layer weight matrix: "];
Print[ ]; Print[outWts] ;Print[ ];
bpnTest[hidWts,outWts,ioPairs,bias->False,printAll->False];
errorPlot = ListPlot[errorList, PloUoined->True];
Return[{hidWts,outWts,errorList,errorPlot}];
] (* end of Module *)
304 Code Listings
fin[inNumber.,outNumber.,ioPairs_,eta_,alpha.,numlte rs_] :=
Hodule [{outHts, ioP, inputs, outputs, outDesired,
outVals,outLastDelta,outDelta,outErrors},
outVals=0;
outHts = Table[Table[Random[Real,{-0.1,0.1}],{inNumber}].{outNumber}];
outLastDelta = Table [Table [0, {inNumber}], {outNumber}];
errorList * Table[
(* begin forward pass *)
ioP=ioPairs[[Random[Integer,{1,Length[ioPairs]}]]];
inputs=ioP[[l]];
outDesired=ioP[[2]];
outputs = outHts.inputs; (* output-layer outputs *)
(* calculate errors *)
outErrors = outDesired-outputs;
outDelta* outErrors;
(* update weights *)
outLastDelta* eta Outer[Times,outDelta,inputs]+alpha outLastDelta;
outHts += outLastDelta;
outErrors.outErrors, (* this puts the error on the list *)
{numlters}] ; (* this many times, Table ends here *)
Print["New output-layer weight matrix: "];
Print[]; Print[outHts];Print[];
outVals=fInTest[outHts,ioPairs];
errorPlot = ListPlot[errorList, PlotJoined->True];
Return[{outHts,errorList,errorPlot,outVals}];
] (* end of Hodule *)
Probabilistic Neural Network and Hopfield Network 305
flnTest[outputWts_,ioPairVectors_] :=
Module[{inputs,hidden,outputs,desired,errors,i,len,
errorTotal,errorSum},
inputs=Map[First,ioPairVectors];
desired=Map[Last,ioPairVectors];
len = Length[inputs];
outputs=inputs.Transpose[outputWts];
errors= desired-outputs;
For[i=l,i<=len,i++,
(*Print["Input ",i," = ",inputs[[i]]];*)
Print[" Output ",i," = ",outputs[[i]]," desired = ",
desired[[i]]," Error = ",errors[[i]]];Print[];
]; (* end of For *)
(*Print["errors= ".errors];Print[];*)
errorSum = Apply[Plus,errors“2,2]; (* second level *)
(*Print["errorSum= ".errorSum];Print[];*)
errorTotal = Apply[Plus,errorSum];
(*Print["errorTotal= ".errorTotal];*)
Print["Mean Squared Error = ",errorTotal/len];
Return[outputs];
] (* end of Module *)
normalize[x.List] := x/(Sqrt[x.x]]//N)
energyHop[x„,w_] := -0.5 x . u . x;
psi[inValue.,netlnj := If[netln>0.1,
If[netln<0,-l,inValue]]
306 Code Listings
phi[inVector_List,netlnVector.List] :=
MapThread[psi[#,#2]ft,{inVector,netlnVector)]
makeHopfieldHts[trainingPats.,printVts_:True] :=
Module[{wtVector},
wtVector =
Apply[Plus,Map[Outer[Times,#,#]&,trainingPats]];
If[printWts,
Print [];
Print[MatrixForm[wtVector]];
Print[];,Continue
]; (* end of If *)
Return[wtVector];
] (* end of Module *)
Probabilistic Neural Network and Hopfield Network 307
discreteHopfield[wtVector.,inVector.,printAll.:True] :=
Module[{done, energy, newEnergy, netlnput,
newlnput, output},
done = False;
newlnput = inVector;
energy = energyHop[inVector,wtVector];
IftprintAll,
Print[];Print["Input vector = ".inVector];
Print[];
Print["Energy = ".energy];
Print[],Continue
]; (* end of If *)
While[!done,
netlnput = wtVector . newlnput;
output = phi[newlnput,netlnput];
newEnergy = energyHop[output,wtVector];
If[printAll,
Print[];Print["Output vector = ".output];
Print[];
Print["Energy = ".newEnergy];
Print[].Continue
]; (* end of If *)
If[energy==newEnergy,
done=True,
energy=newEnergy;newInput=output,
Continue
]; (* end of If *)
]; (* end of While *)
If[IprintAll,
Print[];Print["Output vector = ".output];
Print[];
Print["Energy = ".newEnergy];
Print[];
]; (* end of If *)
]; (* end of Module *)
308 Code Listings
p robPsi[in Value.,netln.,temp.] :=
If[Random[]<=prob[netln,temp],1,psi[inValue,netln]];
stochasticHopfield[inVector.,weights.,numSweeps.,temp.]:=
Module[ {input, net, indx, numUnits, indxList, output},
numUnits=Length[inVector];
indxList=Table[0,{numUnits}];
input=inVector;
For[i=l,i<=numSweeps,i++,
Print["i= ",i];
For[j=l,j<=numUnits,j++,
(* select unit *)
indx = Random[Integer,{l,numUnits}];
(* net input to unit *)
net=input . weights[[indx]];
(* update input vector *)
output=probPsi[input[[indx]],net,temp];
input[[indx]]=output;
indxList[[indx]]+=l;
]; (* end For numUnits *)
Print[ ];Print["New input vector = "];Print[input];
]; (* end For numSweeps *)
Print[ ];Print["Number of times each unit was updated:"];
Print[ ];Print[indxList];
]; (* end of Module *)
Probabilistic Neural Network and Hopfield Network 309
pnnTuoClass[classlExemplars_,class2Exemplars_,
testlnputs_,sig_] :=
Module[{weights*,weightsB,inputsNorm,pattern*out,
patternBout.sumAout,sumBout},
weights* = Hap[normalize,dasslExemplars];
weightsB = Hap[normalize,dass2Exemplars];
inputsNorm = Hap[normalize,testInputs];
sigma = sig;
patternAout =
gaussOut[inputsNorm . Transpose[weightsA]];
patternBout =
gaussOut[inputsNorm . Transpose [weightsB]];
sumAout = Hap[Apply[Plus,#]&,patternAout];
sumBout = Hap[Apply[Plus, #]&,patternBout];
outputs * Sign [sum Aout-sumBout];
sigma=.;
Return[outputs];
]
310 Code Listings
nOutOfN[weights.,externln_,numUnits.,lambda.,deltaT_
numlters_,printFreq.,reset.:False]:=
Hodule[{iter,1,dt,indx,ins},
dt=deltaT;
l=lambda;
iter=numlters;
ins=externln;
(* only reset if starting over *)
If[reset,ui=Table[Random[].{numUnits}];
vi = g[l,ui],Continue]; (* end of If *)
Print[”initial ui = ",N[ui,2]];Print[];
Print["initial vi = ",N[vi,2]];
For[iter=l,iter<=numIters,iter++,
indx = Random[Integer,{l,numUnits}];
ui[[indx]] = ui[[indx]]+
dt (vi . Transpose[weights[[indx]]] +
ui[[indx]] + ins[[indx]]);
vi[[indx]] = g[l,ui[[indx]]];
If[Hod[iter,printFreq]=0,
Print[];Print["iteration = ",iter];
Print["net inputs = "];
Print[N[ui,2]];
Print ["outputs = "];
Print[N[vi,2]];Print[];
]; (* end of If 0
]; (* end of For *)
Print[];Print["iteration = ",--iter];
Print["final outputs = "];
Print[vi];
]; (* end of Hodule *)
Traveling Salesperson Problem 311
tsp[weights_,externln_,numUnits.,lambda.,deltaT.,
numlters.,printFreq.,reset.:False]:=
Module[{iter,1,dt,indx,ins,utemp},
dt=deltaT;
l=lambda;
iter=numlters;
ins=externln;
(* only reset if starting over *)
If[reset,
utemp = ArcTanh[(2.0/Sqrt[numUnits])-l]/l;
ui=Table[
utemp+Random[Real,{-utemp/10,utemp/10}],
{numUnits}]; (* end of Table *)
vi = g[l,ui] .Continue]; (* end of If *)
Print["initial ui = ",N[ui,2]];Print[];
Print["initial vi = ",N[vi,2]];
For[iter=l,iter<=numlters,iter++,
indx = Random[Integer,{1,numUnits}];
ui[[indx]] = ui[[indx]]+
dt (vi . Transpose[weights[[indx]]] +
ui[[indx]] + ins[[indx]]);
vi[[indx]] = g[l,ui[[indx]]];
If[Mod[iter,printFreq]==0,
Print[] ;Print["iteration = ".iter];
Print["net inputs = "];
Print[N[ui,2]];
Print["outputs = "];
Print[N[vi,2]];Print[];
]; (* end of If *)
]; (* end of For *)
Print[];Print["iteration = iter];
Print["final outputs = "];
Print[MatrixForm[Partition[N[vi,2],Sqrt[numUnits]]]];
]; (* end of Module *)
312 Code Listings
BAM
makeXtoYwts[exemplars.] :=
Module[{temp},
temp = Map[0uter[Times,#[[2]],f[[l]]]&,exemplars];
Apply[Plus,temp]
]; (* end of Module *)
psi[inValue.,netlnj := If[netIn>0.1,If[netIn<0,-l,inValue]];
phi[inVector_List,netInVector_List] :=
MapThread[psi[#,#2]&,{inVector,netlnVector}];
bam[initialX.,initialY_,x2yWeights.,y2xWeights_,printAll.:False] :=
Module[{done,newX,neuY,energy1,energy2},
done = False;
newX = initialX;
newY = initialY;
While[done = False,
newY = phi[neuY,x2yWeights.newX];
If[printAll,Print[];Print[];Print["y = ”,newY]];
energyl = energyBAM[newY,x2yWeights,neuX];
If[printAll,Print["energy = ".energyl]];
newX = phi[newX,y2xWeights . newY];
If [printAll,Print [] ;Print["x = ",newX]];
energy2 = energyBAM[newY,x2yWeights,newX];
If[printAll,Print["energy = ".energyl]];
If [energyl == energy 2, done=True, Continue];
]; (* end of While *)
Print[];Print[];
Print["final y = ",newY," energy= ".energyl];
Print["final x = ",newX," energy= ",energy2];
]; (* end of Module *)
Elman Network 313
Elman Network
elman[inNumber_,hidNumber.,outNumber.,ioPairs.,eta_,alpha.,numltersj :=
Module[{hidWts,outWts,ioP,inputs,hidOuts,outputs,outDesired,
i,indx,hidLastDelta,outlastDelta,outDelta, er rorList=0,
ioSequence, conUnits,hidDelta,outErrors},
hidWts = Table[Table[Random[Real,{-0.5,0.5}].{inNumber+hidNumber}].{hidNumber}];
outWts = Table[Table[Random[Real,{-0.5,0.5>].{hidNumber}].{outNumber}];
hidLastDelta = Table[Table[0,{inNumber+hidNumber}],{hidNumber}];
outLastDelta = Table [Table [0, {hidNumber}], {outNumber}];
For[indx=l,indx<=numlters,indx++, (* begin forward pass; select a sequence *)
ioSequence=ioPairs[[Random[Integer,{1,Length[ioPairs]}]]];
conUnits = Table[0.5,{hidNumber}]; (* reset conUnits *)
For[i=l,i<=Length[ioSequence],i++, (* process the sequence in order *)
ioP = ioSequence [[i]]; (* pick out the next ioPair *)
inputs=Join[conUnits,ioP[[1] ] ]; (* join context and input units *)
outDesired=ioP[[2]];
hidOuts = sigmoid [hidWts. inputs]; (* hidden-layer outputs *)
outputs = sigmoid [outWts. hidOuts]; (* output-layer outputs *)
outEr rors = outDesired-outputs; (* calculate errors *)
outDelta* outErrors (outputs (1-outputs));
hidDelta*(hidOuts (1-hidOuts)) Transpose[outWts].outDelta;
outLastDelta= eta 0uter[Times,outDelta,hidOuts]+alpha outLastDelta;
outWts += outLastDelta; (* update weights *)
hidLastDelta = eta 0uter[Times,hidDelta,inputs]+alpha hidLastDelta;
hidWts += hidLastDelta;
conUnits = hidOuts; (* update context units *)
(* put the sum of the squared errors on the list *)
AppendTo[errorList,outErrors.outErrors];
]; (* end of For i *)
]; (* end of For indx *)
Print["New hidden-layer weight matrix: "];
Print[ ]; Print[hidWts];Print[ ];
Print["New output-layer weight matrix: "];
Print[ ]; Print[outWts];Print[ ];
elmanTest[hidWts,outWts,ioPairs.hidNumber];
errorPlot = ListPlot[errorList, PlotJoined->True];
Return[{hidWts,outWts,errorList.errorPlot}];
] (* end of Module *)
314 Code Listings
elmanTest[hiddent/ts..outputVts.,ioPairVectors.,conNumber.,printAll.:False] :=
Hodule[{inputs,hidden,outputs,desired, error s, i ,j,
prntAll,conUnits,ioSequence,ioP>,
If[printAll,Print[];Print["ioPairs:"];Print[];Print[ioPairVectors]];
(* loop through the sequences *)
For[i=l,i<=Length[ioPairVectors],i++,
(* select the next sequence *)
ioSequence = ioPair Vectors [[i]];
(* reset the context units *)
conUnits = Table [0.5, {conNumber}];
(* loop through the chosen sequence *)
For[j=l,j<=Length[ioSequence],j++,
ioP = ioSequence[[j]];
(* join context and input units *)
inputs=Join[conUnits,ioP[[l]] ];
desired=ioP[[2]];
hidden=sigmoid[hiddenVts.inputs];
outputs=sigmoid[outputWts.hidden];
errors3 desired-outputs;
(* update context units *)
conUnits = hidden;
Print [];
Print["Sequence ",i, " input ",j];
Print[];Print["inputs:"];Print[];
Print[inputs];
If[printAll,Print[];Print["hidden-layer outputs:"];
Print[hidden] ;Print[];
];
Print["outputs:"];Print[];
Print[outputs];Print[];
Print["desired:"];Print[];Print[desired];Print[];
Print["Hean squared error:"];
Print[errors.errors/Length[errors]];
Print [];
]; (* end of For j *)
]; (* end of For i *)
] (* end of Hodule *)
Elman Network 315
elmanComp[inNumber.,hidNumber.,outNumber_,ioPairs.,eta_,alpha.,numltersj :=
Hodule[{hidWts,outWts,ioP,inputs,hidOuts,outputs,outOesired,
i,indx,hidLastDelta,outLastDelta,outDelta,errorlist=0,
ioSequence, conUnits,hidDelta.outErrors},
hidWts 5 Table[Table[Random[Real,{-0.5,0.5}].{inNumber+conNumber}].{hidNumber}];
outWts = Table[Table[Random[Real,{-0.5,0.5}].{hidNumber}].{outNumber}];
hidLastDelta = Table[Table[0,{inNumber+conNumber}],{hidNumber}];
outLastDelta = Table[Table[0,{hidNumber}] .{outNumber}];
outErrors 5 Table[0,{outNumber}];
For[indx=l,indx<=numlters,indx++,
ioSequence=ioPairs [[Random[Integer, {1, Length [ioPairs]}] ] ]; (* select a sequence *)
conUnits 5 Table[0.5,{conNumber}]; (* reset conUnits *)
For[i=l,i<=Length[ioSequence],i++, (* process the sequence in order *)
ioP = ioSequence[[i]]; (* pick out the next ioPair *)
inputs=Join [conUnits, ioP [[1] ] ]; (* join context and input units *)
outDesired=ioP[[2]];
hidOuts = sigmoid [hidWts. inputs]; (* hidden-layer outputs *)
outputs = outWts. hidOuts; (* output-layer outputs *)
outputs = sigmoid [outputs - 0.3 Apply [Plus,outputs] ♦ .5 outputs];
outErrors = outDesired-outputs; (* calculate errors *)
outDelta= outErrors (outputs (1-outputs));
hidDelta=(hidOuts (1-hidOuts)) Transpose[outWts].outDelta;
outLastDelta5 eta Outer [Times, outDelta, hidOuts]+alpha outLastDelta;
outWts += outLastDelta; (* update weights *)
hidLastDelta 5 eta Outer [Times, hidDelta, inputs]+alpha hidLastDelta;
hidWts += hidLastDelta; (* update weights *)
conUnits 5 hidOuts; (* update context units *)
(* put the sum of the squared errors on the list *)
AppendTo[errorList,outErrors.outErrors];
]; (* end of For i *)
]; (* end of For indx *)
Print["New hidden-layer weight matrix: "];
Print[ ]; Print[hidWts]; Print[ ];
Print["New output-layer weight matrix: "];
Print[ ]; Print[outWts]; Print[ ];
elmanCompTest[hidWts,outWts,ioPairs,conNumber];
errorPlot 5 ListPlot[errorList, PlotJoined->True];
Return[{hidWts,outWts,errorList,errorPlot}];
] (* end of Hodule *)
316 Code Listings
elmanCompTest[hiddenWts_,outputWts_,ioPairVectors.,conNumber.,printAll.:False] :=
Module[{inputs,hidden,outputs,desired,errors,i ,j,prntA11,conUnits,ioSequence,ioP},
If[printAll,Print[];Print["ioPairs:"];Print[];Print[ioPairVectors]];
For[i=l,i<=Length[ioPairVectors],i++, (* loop through the sequences *)
ioSequence = ioPairVectors[[i]]; (* select the next sequence *)
conUnits = Table[0.5,{conNumber}]; (* reset the context units *)
For[j=l,j<=Length[ioSequence] ,j++, (* loop through the chosen sequence *)
ioP = ioSequence[[j]];
inputs=Join[conUnits,ioP[[l]] ]; (* join context and input units *)
desired=ioP[[2]];
hidden=sigmoid[hiddenVts.inputs];
outputs=outputWts.hidden;
outputs=sigmoid[outputs -
0.3 Apply [Plus, outputs] +0.5 outputs];
errors= desired-outputs;
(* update context units *)
conUnits = hidden;
Print[];
Print["Sequence ",i, " input ",j];
Print[];Print["inputs:"];Print[];
Print[inputs];
If[printAll,Print[];Print["hidden-layer outputs:"];
Print[hidden];Print[];
];
Print["outputs:"];Print[];
Print[outputs];Print[];
Print["desired:"];Print[];Print[desired];Print[];
Print["Mean squared error:"];
Print[errors.errors/Length[errors]];
Print [];
]; (* end of For j *)
]; (* end of For i *)
] (* end of Module *)
Jordan Network 317
Jordan Network
jordan[inNumber_,hidNumber_,outnumber.,ioPairs.,eta_.alpha.,mu.,numlters.] :=
Hodule[{hidWts,outWts,ioP,inputs,hidOuts.outputs,outDesired,
i.indx,hidLastDelta,outLastDelta,outDelta,errorList = O,
ioSequence, stateUnits,hidOelta,outErrors},
hidWts = Table[Table[Random[Real,{-0.5,0.5}].{inNumber+outNumber}].{hidNumber}];
outWts = Table[Table[Random[Real,{-0.5,0.5}].{hidNumber}].{outNumber}];
hidLastDelta = Table[Table[0,{inNumber+outNumber}],{hidNumber}];
outLastDelta = Table[Table[0,{hidNumber}],{outNumber}];
For[indx=l,indx<=numIters,indx++, (* begin forward pass *)
ioSequence=ioPairs[[Random[Integer,{l,Length[ioPairs]}]]]; (* select a sequence *)
stateUnits = Table[0.1,{outNumber}]; (* reset stateUnits *)
For[i=l,i<=Length[ioSequence],i++, (* process the sequence in order *)
ioP = ioSequence [[i]]; (* pick out the next ioPair *)
inputs=Join[stateUnits,ioP[[l]] ]; (* join context and input units *)
outDesired=ioP[[2]];
hidOuts = sigmoid [hidWts. inputs]; (* hidden-layer outputs *)
outputs = sigmoid [outWts. hidOuts]; (* output-layer outputs *)
outEr rors = outDesired-outputs; (* calculate errors *)
outDelta= outErrors (outputs (1-outputs));
hidDelta=(hid0uts (1-hidOuts)) Transpose[outWts].outDelta;
outLastDelta= eta Outer[Times,outDelta,hid0uts]+alpha outLastDelta;
outWts += outLastDelta; (* update weights *)
hidLastDelta = eta Outer[Times,hidDelta,inputs]+alpha hidLastDelta;
hidWts += hidLastDelta; (* update weights *)
stateUnits = mu stateUnits + outputs; (* update state units *)
(* put the sum of the squared errors on the list *)
AppendTo[errorList,outErrors.outErrors];
]; (* end of For i *)
]; (* end of For indx *)
Print["New hidden-layer weight matrix: "];
Print[]; Print[hidWts];Print[];
Print["New output-layer weight matrix: "];
Print[]; Print[outWts];Print[];
jordanTest[hidWts,outWts,ioPairs,mu,outNumber];
errorPlot = ListPlot[errorList, PloUoined->True];
Retu rn[{hidWts,outWts,er rorList,er rorPlot}];
] (* end of Hodule *)
318 Code Listings
j ordanTest[hiddenWts_,outputWts_,ioPairVectors_,
mu_, stateNumber_,printAll_:False] :=
Module[{inputs,hidden,outputs,desired,e r rors,i,j,
prntA11,stateOnits,ioSequence,ioP},
If[printAll,Print[];Print["ioPairs:"];Print[];Print[ioPairVectors]];
For[i=l,i<=Length[ioPairVectors],i++, (* loop through the sequences *)
ioSequence = ioPairVectors[[i]]; (* select the next sequence *)
stateUnits = Table [0.1, {stateNumber}]; (* reset the context units *)
For[j=l,j<=Length[ioSequence] ,j++, (* loop through the chosen sequence *)
ioP = ioSequence[[j]];
inputs=Join[stateUnits,ioP[[l]] ]; (* join context and input units *)
desired=ioP[[2]];
hidden=sigmoid[hiddenWts.inputs];
outputs=sigmoid[outputWts.hidden];
errors= desired-outputs;
stateUnits = mu stateUnits + outputs; (* update context units *)
Print[];
Print["Sequence ",i, " input ",j];
Print[] ;Print["inputs:"] ;Print[];
Print[inputs];
If[printAll,Print[];Print["hidden-layer outputs:"];
Print[hidden];Print[]; ];
Print["outputs:"];Print[];
Print[outputs];Print[];
Print["desired:"];Print[];Print[desired];Print[];
Print["Mean squared error:"];
Print[errors.errors/Length[errors]];
Print[];
]; (* end of For j *)
]; (* end of For i *)
] (* end of Module *)
Jordan Network 319
(* this version sets the state units equal to the desired output values,
rather than the actual output values, during the training process *)
jordan2[inNumber_,hidNumber.,outNumber.,ioPairs_,eta_,alpha.,mu.,numlters.] :=
Hodule[{hidWts,outWts,ioP,inputs,hidOuts,outputs,outDesired,
i,indx,hidLastDelta,outLastDelta,outDelta,errorlist = O,
ioSequence, stateUnits,hidDelta,outEr rors},
hidWts = Table[Table[Random[Real,{-0.5,0.5}], {inNumber+outNumber}] .{hidNumber}];
outWts = Table[Table[Random[Real,{-0.5,0.5}].{hidNumber}].{outNumber}];
hidLastDelta = Table[Table[0,{inNumber+outNumber}],{hidNumber}];
outLastOelta = Table[Table[0,{hidNumber}],{outNumber}];
For[indx=l,indx<=numIters,indx++, (* begin forward pass *)
ioSequence=ioPairs[[Random[Integer,{1,Length[ioPairs]}]]]; (* select a sequence *)
stateUnits = Table[0.1,{outNumber}]; (* reset stateUnits *)
For[i=l,i<=Length[ioSequence],i++, (* process the sequence in order *)
ioP = ioSequence [[i]]; (* pick out the next ioPair *)
inputs=Join[stateUnits,ioP[[l]] ]; (* join context and input units *)
outDesired=ioP[[2]];
hidOuts = sigmoid [hidWts. inputs]; (* hidden-layer outputs *)
outputs a sigmoid[outVts.hidOuts]; (* output-layer outputs *)
outErrors = outDesired-outputs; (* calculate errors *)
outDelta= outErrors (outputs (1-outputs));
hidDelta=(hidOuts (1-hidOuts)) Transpose[outWts].outDelta;
outLastDelta= eta Outer[Times,outDelta,hidOuts]+alpha outLastDelta;
outWts += outLastDelta; (* update weights *)
hidLastDelta = eta Outer[Times,hidDelta,inputs]+alpha hidLastDelta;
hidWts += hidLastDelta; (* update weights *)
stateUnits = mu stateUnits + outDesired; (* update state units *)
AppendTo[errorList,outErrors.outErrors];
]; (* end of For i *)
]; (* end of For indx *)
Print["New hidden-layer weight matrix: "];
Print[]; Print[hidWts];Print[];
Print["New output-layer weight matrix: "];
Print[]; Print[outWts];Print[];
jordan2Test[hidWts,outWts,ioPairs,mu.outNumber];
errorPlot * ListPlot[errorList, PloUoined->True];
Return[{hidWts,outWts,errorList.errorPlot}];
] (* end of Hodule *)
320 Code Listings
jordan2Test[hiddenVts_,outputWts_,ioPairVectors_,
mu_, stateNumber_,printAll_:False] :=
Module[{inputs,hidden,outputs,desired.errors,i,j,
prntAll,stateUnits,ioSequence,ioP>,
If[printAll,Print[];Print["ioPairs:"];Print[];
Print[ioPairVectors]];
For[i=l,i<=Length[ioPairVectors],i++, (* loop through the sequences *)
ioSequence = ioPairVectors[[i]]; (* select the next sequence *)
stateUnits = Table[0.1,{stateNuinber}]; (* reset the context units *)
For[j=l,j<=Length[ioSequence],j++, (* loop through the chosen sequence *)
ioP = ioSequence[[j]];
inputs* Join [stateUnits, ioP[[l]] ]; (* join context and input units *)
desired=ioP[[2]];
hidden=sigmoid[hiddenWts.inputs];
outputs=sigmoid[outputVts.hidden];
errors* desired-outputs;
stateUnits * mu stateUnits + desired; (* update context units *)
Print[];
Print["Sequence ",i, " input ",j];
Print[];Print["inputs:"];Print[];
Print[inputs];
If[printAll,Print[];Print["hidden-layer outputs:"];
Print[hidden];Print[];];
Print["outputs:"];Print[];
Print[outputs];Print[];
Print["desired:"];Print[];Print[desired];Print[];
Print["Mean squared error:"];
Print[errors.errors/Length[errors]];
Print[];
]; (* end of For j *)
]; (* end of For i *)
] (* end of Nodule *)
Jordan Network 321
jordan2aTanhTest[hiddenWts.,outputWts.,ioPairVectors.,
mu_, stateNumber_,printAll_:False] :=
Module[{inputs,hidden,outputs,desired.errors,i, j,
prntAll,stateUnits,ioSequence,ioP},
If[printAll,Print[];Print["ioPairs:"];Print[]; Print[ioPairVectors]];
For[i=l,i<=Length[ioPairVectors],i++, (* loop through the sequences *)
ioSequence = ioPairVectors[[i]]; (* select the next sequence *)
For[j=l,j<=Length[ioSequence] ,j++, (* loop through the chosen sequence *)
ioP = ioSequence[[j]J;
inputs=Join[stateUnits,ioP[[l]] ]; (* join context and input units *)
desired=ioP[[2]];
hidden=Tanh[hiddenWts.inputs];
outputs=Tanh[outputWts.hidden];
errors* desired-outputs;
stateUnits = mu stateUnits + desired; (* update context units *)
Print[];
Print["Sequence ",i, " input ",j];
Print[];Print["inputs:"];Print[];
Print[inputs];
If[printAll,Print[];Print["hidden-layer outputs:"];
Print[hidden];Print[]; ];
Print["outputs:"];Print[];
Print[outputs];Print[];
Print["desired:"];Print[];Print[desired];Print[];
Print["Mean squared error:"];
Print[errors.errors/Length[errors]];Print[];
]; (* end of For j *)
]; (* end of For i *)
] (* end of Module *)
324 Code Listings
ART
compete[f2Activities_] :=
Module[{i,x,f2dim,maxpos>,
x=f2Activities;
maxpos=First[First[Position[x,Max[f2A ctivities]]]];
f2dim = Length[x];
For[i=l,i<=f2dim,i++,
If[i!=maxpos,x[[i]]=0;Continue] (* end of If *)
]; (* end of For *)
Return[x];
]; (* end of Module *)
a rtllnit[fldim.,f2dim_,bl_,dl_,el_,dell_,del2_] :=
Module[{zl2,z21>,
zl2 = Table[Table[(bl-l)/dl + dell, {f2dim>], {fldim}];
z21 = Table[Table[(el/(el-l+fldim)-del2),{fldim}],{f2dim}];
Return[{zl2,z21>];
]; (* end of Module *)
ART 325
a rtl[fldim_,f2dim_,al_,bl_,cl_,dl_,el_,rho_,flMts_,f2Wts_,inputs.] :=
Module [{droplistinit, d roplist, notDone=T r ue, i, nIn=Length [inputs], reset,
n,sf1,t,x f2,uf2,v,windex,matchlist,newMatchList,tdWts,buWts},
droplistinit = Table[l,{f2dim}]; (* initialize droplist *)
tdWts=flWts; buWts=f2Wts;
matchList = (* construct list of F2 units and encoded input patterns *)
Table[{StringForm["Unit ",n]},{n,f2dim>];
While[notDone—True,newMatchList = matchList; (* process until stable *)
For[i=l, i<=nln,i++,in = inputs[[i]]; (* process inputs in sequence *)
droplist = droplistinit;reset=True; (* initialize *)
While [reset^T rue, (* cycle until no reset *)
xfl = in/(l+al*(in+bl)+cl); (* activities *)
sfl = Hap[If[i>0,1,0]ft,xf1]; (* FI outputs *)
t= buWts . sfl; (* F2 net-inputs *)
t = t droplist; (* turn off inhibited units *)
xf2 = compete [t]; (* F2 activities *)
uf2 = Hap[If[#>0,l,0]ft,xf2]; (* F2 outputs *)
windex * winner[uf2,l]; (* winning index *)
v= tdWts . uf2; (* FI net-inputs *)
xfl =(in+ dl*v-bl)/(l+al*(in+dl*v)+cl); (* new FI activities *)
sfl = Hap[lf[f>0,l,0]ft,xfl]; (* new FI outputs *)
reset = resetflagl[sfl,in,rho]; (* check reset *)
If [reset=T rue, droplist [ [windex] ] =0; (* update droplist *)
Print["Reset with pattern ",i,‘ on unit ",windex].Continue];
]; (* end of While reset=True *)
Print["Resonance established on unit ".windex," with pattern ",i];
tdWts=Transpose[tdWts]; (* resonance, so update weights.top down first *)
tdWts[[windex]]=sfl;
tdWts=Transpose[tdWts];
buWts [[windex]] = el/(el-l+vmagl[sfl]) sfl; (* then bottom up *)
matchList[ [windex] ] * (* update matching list *)
Reverse[Union[matchList[[windex]],{i}]];
]; (* end of For i=l to nln *)
If [matchList—newHatchList, notDone=False; (* see if matchList is static *)
Print["Network stable"],
Print["Network not stable"];
newMatchList = matchList];]; (* end of While notDone=True *)
Return[{tdWts,buWts,matchList}];
]; (* end of Module *)
326 Code Listings
a rt2F1[in_,a_,b_,d_,tdWts_,fld_,win r_:0] :=
Module[{w,x,u,v,p,q,i},
w=x=u=v=p=q=Table[0,{fid}];
For[i=l,i<=2,i++,
w = in + a u;
x = w / vmag2[w];
v = f[x] + b f[q];
u = v / vmag2[v];
p = lf[winr==0,u,
u + d Transpose[tdWts][[winr]] ];
q * p / vmag2[p];
]; (* end of For i *)
Return[{u,p>];
] (* end of Module *)
a rt2Init[fldim_,f2dim_,d_,dellj :=
Module[{zl2,z21},
zl2 = Table[Table[0 ,{f2dim>],{fldim}];
z21 = Table[Table[dell/((l-d)*Sqrt[fldim] ),
{fldim}],{f2dim}];
Return[{zl2,z21}];
]; (* end of Module *)
ART 327
art2[fldim_,f2dim_,ai_,bl„,cl.,d_,theta.,rho_,flWts.,f2Wts_.inputs.] :*
Module [{d r oplistinit, droplist, notOone=True,i,nln= Length [inputs], reset,
u,p,t,x f2,uf2,v,windex,matchList,newHatchList,tdWts,buWts},
droplistinit = Table[l,{f2dim>]; (* initialize droplist *)
tdWts = flWts; buHts = f2Wts;
u = p = Table[0,{fldim}];
(* construct list of F2 units and encodedinput patterns *)
matchList = Table[{StringForm["Unit "",n]},{n,f2dim}];
While[notDone==True,newMatchList = matchList; (* process until stable *)
For[i=l,i<=nIn,i++, (* process each input pattern in sequence *)
droplist = droplistinit; (* initialize droplist for neu input *)
reset=True;
in = inputs[[i]]; (* next input pattern *)
windex = 0; (* initialize *)
While[reset=True, (* cycle until no reset *)
{u,p> = art2Fl[in,al,bl,d,tdWts,fldim,windex];
t= buWts . p; (* F2 net-inputs *)
t = t droplist; (* turn off inhibited units *)
xf2 = compete[t]; (* F2 activities *)
uf2 = Map[g,xf2]; (* F2 outputs *)
windex = winner[uf2,d]; (* winning index *)
{u,p> = art2Fl[in,al,bl,d,tdWts,fldim,windex];
reset = resetflag2[u,p,d,rho]; (* check reset *)
If[reset==True.droplist[[windex]]=0; (* update droplist *)
Print["Reset with pattern ",i," on unit ".windex],Continue];
]; (* end of While reset=True *)
Print["Resonance established on unit ", windex," with pattern ",i];
tdWts=Transpose[tdWts]; (* resonance, so update weights *)
tdWts[[windex]]=u/(1-d); tdWts=Transpose[tdWts];
buWts [[windex]] = u/(l-d);
matchList[[windex]] = (* update matching list *)
Reverse[Union[matchList[[windex]],{i>]];
]; (* end of For i=l to nln *)
If [matchList==newHatchList, notDone=False; (* see if matchList is static *)
Print["Network stable"]. Print ["Network not stable"];
newHatchList = matchList];
]; (* end of While notDone==True *)
Return[{tdWts,buWts.matchList}];
]; (* end of Nodule *)
328 Code Listings
Genetic Algorithms
flipCxJ := If[Random[]<3x,True,False]
newGenerate[pmutate,,key Ph rase_,pop_,numGensJ :3
Module[{i,neuPop,pa rent,diff,matches,
index,fitness},
neuPop=pop;
For[i=l,i<=numGens,i++,
diff = Map[(keyPhrase-#)ft,neuPop];
matches = Map[Count[#,0]ft,diff];
fitness 3 Max [matches];
index 3 Position [matches, fitness];
parent = newPop[[First[Flatten[index]]]];
Print["Generation ",i,": ".FromCharacterCode[parent],
" Fitness3 ",fitness];
newPop 3 Table[Map[mutateLetter[pmutate,#]A,parent],{100}];
]; (* end of For *)
]; (* end of Module *)
decodeBGA[chromosome.] :=
Module[{pList,lchrom,values,phenotype},
lchrom = Length [chromosome];
(* convert from binary to decimal *)
pList = Flatten[Position[chromosome,l] ];
values = Map[2~(lchrom-#)ft, pList];
decimal 3 Apply [Plus, values];
(* scale to proper range *)
phenotype 3 decimal (0.07820136852394916911)-40;
Return[phenotype];
]; (* end of Module *)
Genetic Algorithms 329
selectOne[foldedFitnessList_,fitTotal_] :=
Module[{randFitness,elem,index},
randFitness = Random [] fitTotal;
elem = Select[foldedFitnessList,#>=randFitnessA,l];
index =
Flatten[Position[foldedFitnessList,First[elem]]];
Return[First[index]];
]; (* end of Module *)
myXor[x_,y_] := If[x=y,0,l];
mutateBGA[pmute.,allelj :=
If[flip[pmute],myXor[allel,l],allel];
crossover[pcross_,pmutate.,parentl.,parent2_] :=
Module[{childl,child2,crossAt,lchrom >,
(* chromosome length *)
lchrom = Length [parentl];
If[ flip[pcross],
(* True: select cross site at random *)
crossAt = Random[Integer,{l,lchrom-l>];
(* construct children *)
childl = Join[Take[parentl,crossAt], 0rop[parent2,crossAt]];
child2 = Join[Take[parent2,crossAt], Drop[parentl,crossAt]],
(* False: return parents as children *)
childl = parentl;
child2 = parent2;
]; (* end of If *)
(* perform mutation *)
childl = Ma p[mutateBGA[pmutate,#]ft,childl];
child2 = Map [mutateBG A [pmutate, #]&, child2];
Return[{childl,child2}];
]; (* end of Module *)
330 Code Listings
initPop[psize_,csizeJ :=
Table[Random[Integer,{0,l)],{psize},{csize}];
displayBest[fitnessList_,number2Print_] :=
Module[{i,sortedList),
sortedList = Sort[fitnessList, Greater];
For[1=1,i<=number2Print,i++,
Print["fitness = ", sortedList[[i]] ];
]; (* end of For i *)
]; (* end of Module *)
bga[pcross.,pmutate.,poplnitial.,fitFunction.,numGens.,printNumJ :=
Module[{i,newPop,pa rentl,parent2,diff,matches,
oldPop,reproNum,index,fitList,fitListSum,
fitSum,pheno,plndex,plndex2,f,children),
oldPop=popInitial; (* initialize first population *)
reproNum = Length [oldPop] /2; (* calculate number of reproductions *)
f = fitFunction; (* assign the fitness function *)
For[i=l,i<=numGens,i++, (* perform numGens generations *)
pheno = Map[decodeBGA,oldPop]; (* decode the chromosomes *)
fitList = f[pheno]; (* determine the fitness of each phenotype *)
Print[" "]; (* print out the best individuals *)
Print["Generation ",i," Best ".printNum];
Print[" "];
displayBest[fitList,printNum];
f itListSum = FoldList[Plus,First[fitList],Rest[fitList]];
fitSum = Last[fitListSum]; (* find the total fitness *)
newPop = Flatten [Tablet (* determine the new population *)
plndexl = selectOne[fitListSum,fitSum]; (* select parent indices *)
plndex2 = selectOne [f itListSum, fitSum];
parentl = oldPop[[pIndexl]]; (* identify parents *)
parent2 = oldPop[[plndex2]];
children = crossover [pcross, pmutate, pa rentl, parent2]; (* crossover and mutate *)
children,{reproNum}] ,1 (* add children to list; flatten to first level *)
]; (* end of Flatten [Table] *)
oldPop = newPop; (* new becomes old for next gen *)
]; (* end of For i*)
]; (* end of Module *)
Genetic Algorithms 331
sigmoid[x_] := l./(l+E‘(-x));
initXorPop[psize_,csize_,ioPairs_] :=
Module[{i,iPop,hidWts,outWts,mselnv},
(* first the chromosomes *)
iPop = Table[
{Table[Random[Integer,{0,l}],{csize}],(* hi *)
Table[Random[Integer,{0,1}],{csize}],(* h2 *)
Table[Random[Integer,{0,l}],{csize}] (* ol *)
>, {psize} ]; (* end of Table *)
(* then decode and eval fitness *)
(* use For loop for clarity *)
For[i=l,i<=psize,i++,
(* make hidden weight matrix *)
hidWts = Join[iPop[[i,l]],iPop[[i,2]] ];
hidUts = Partition[hidHts,20];
hidWts = MapCdecodeXorChrom,hidWts];
hidWts = Partition [hidWts,2];
(* make output weight matrix *)
outWts = Partition[iPop[[i,3]],20];
outWts = Map[decodeXorChrom,outUts];
(* get mse for this network *)
mselnv = gaNetFitness [hidUts, outWts, ioPairs];
(* prepend mselnv *)
PrependTo[iPop[[i]],mselnv];
]; (* end For *)
Return[iPop];
]; (* end of Module *)
decodeXorChrom[chromosome_] :=
Module[{pList,lchrom,values,p.decimal},
lchrom = Length[chromosome];
(* convert from binary to decimal *)
pList = Flatten[Position[chromosome,1] ];
values = Map[2''(lchrom-#)ft, pList];
decimal = Apply[Plus,values];
(* scale to proper range *)
p = decimal (9.536752259018191355*10“-6)-5;
Return[p];
]; (* end of Module *)
332 Code Listings
gaNetFitness[hiddenWts_,outputWts_,ioPairVectors_] :=
Module[{inputs,hidden,outputs,desired.errors,
len,errorTotal,errorSum},
inputs=Map[First,ioPairVectors];
desired=Map[Last,ioPairVectors];
len = Length[inputs];
hidden=sigmoid[inputs.Transpose[hiddenWts]];
outputs=sigmoid[hidden.Transpose[outputWts]];
errors= desired-outputs;
errorSum = Apply[Plus,errors~2,2]; (* second level *)
errorTotal = Apply[Plus,errorSum];
(* inverse of mse *)
Return[len/errorTotal];
] (* end of Module *)
crossoverXor[pcross_,pmutate.,parentl.,parent2_] :=
Module[{childl,child2,cross At,lchrom,
i,numchroms,chromsl,chroms2},
(* strip off mse *)
chromsl = Rest[parentl];
chroms2 = Rest[parent2];
(* chromosome length *)
lchrom = Length[chromsl[[l]]];
(* number of chromosomes in each list *)
numchroms = Length[chromsl];
For[i=l,i<=numchroms,i++, (* for each chrom *)
If[ flip[pcross],
crossAt = Randomflnteger,{l,lchrom-l}]; (* True: select cross site at random *)
(* construct children *)
chromsl[[i]] = Join[Take[chromsl[[i]].crossAt],Drop[chroms2[[i]].crossAt]];
chroms2[[i]] = Join[Take[chroms2[[i]].crossAt], Drop[chromsl[[i]].crossAt]],
Continue]; (* False: don't change chroms[[i]]. End of If *)
(* perform mutation *)
chromsl[[i]] = Map[mutateBGA[pmutate,#]ft,chromsl[[i]]];
chroms2[[i]] = Map[mutateBGA[pmutate,#]ft,chroms2[[i]]];
]; (* end of For i *)
Return[{chromsl,chroms2>];
]; (* end of Module *)
Genetic Algorithms 333
gaXor[pcross_,pmutate_,poplnitial.,numReplace.,ioPairs_,numGens_,printNum_] :=
Module[{i,j,newPop,parentl,parent2,diff,matches,
oldPop,reproNum,index,fitList,fitListSum,
fitSum,pheno,plndex,plndex2,f,children,hids,outs,mselnv},
(* initialize first population sorted by fitness value *)
oldPop= Sort[popInitial,Greater[First[#],First[#2]]ft];
reproNum = numReplace; (* calculate number of reproductions *)
For[i=l,i<=numGens,i++,
fitList = Map[First,oldPop]; (* list of fitness values*)
(* make the folded list of fitness values *)
fitListSum = FoldList[Plus,First[fitList],Rest[fitList]];
fitSum = Last[fitListSum]; (* find the total fitness *)
newPop = Drop [oldPop,-reproNum]; (* new population; eliminate reproNum worst *)
For[j=l,j<=reproNum/2,j++, (* make reproNum new children *)
(* select parent indices *)
plndexl = selectOne[fitListSum,fitSum];
plndex2 = selectOne[fitListSum,fitSum];
parentl = oldPop[[pIndexl]]; (* identify parents *)
parent2 = oldPop[[pIndex2]];
children = cross0verXor[pcross,pmutate,parentl,parent2];(*cross and mutate*)
{hids,outs} = decodeXorGenotype[children[[1]] ]; (* fitness of children *)
mseln v = gaNetFitness[hids,outs,ioPairs];
children[[l]] = Prepend[children[[l]],mselnv];
{hids,outs} = decodeXorGenotype[children[[2]] ];
mselnv = gaNetFitness[hids,outs,ioPairs];
children[[2]] = Prepend[children[[2]],mselnv];
newPop = Join[newPop,children]; (* add children to new population *)
]; (* end of For j *)
oldPop = Sort[newPop, Greater [First [#], First [#2]]ft];(* for next gen *)
(* print best mse values (l/mselnv) *)
Print[ ];Print["Best of generation ",i];
For[j=l,j<=printNum,j++,Print[(1.0/oldPop[[j,l]])]; ];
]; (* end of For i*)
Return[oldPop];
]; (* end of Module *)
334 Code Listings
decodeXorGenotype[genotype.] :=
Module[{hidWts,outWts},
hidWts = Join[genotype[[l]],genotype[[2]] ];
hidWts = Partition [hidWts, 20];
hidWts = MaptdecodeXorChrom,hidWts];
hidWts = Partition [hidWts ,2];
(* make output ueight matrix *)
outWts = Partition[genotype[[3]],20];
outWts = Map[decodeXorChrom,outWts];
Return[{hidWts,outWts}];
3;
encodeNetGa[weight.,lenj :=
Module[{pList,values,dec,ch romosome,i},
i=len;
l=Table[0,{i>];
(* scale to proper range *)
dec = Round[(weight+5.)/(9.536752259018191355*10‘-6)];
While[dec!=0&ftdec!-1,
l=ReplacePart[l,Mod[dec,2],i];
dec=Quotient[dec,2];
—i;
];
l=ReplacePart[l,dec,i]
]; (* end of Module *)
randomPop[psize_,csize_,ioPairs.,numGensJ :=
Module[{i,pop>,
For[i=l,i<=numGens,i++,
pop = initXorPop[psize,csize,ioPairs];
pop = Sort[pop,Greater[First[#],First[#2]]ft];
Print[ ];
Print["Random generation ",i];
Print[(1.0/pop[[l,l]])];
];
3;
Bibliography
335
336 Bibliography
[13] John Hertz, Anders Krogh, and Richard G. Palmer. Introduction to the
Theory of Neural Computation. Addison-Wesley, Redwood City, CA,
1991.
[14] Geoffrey E. Hinton and James A. Anderson, editors. Parallel Models
of Associative Memory. Lawrence Erlbaum Associates, Hillsdale, NJ,
1981.
[15] Tarun Khanna. Foundations of Neural Networks. Addison-Wesley,
Reading MA, 1990.
[16] C. Klimasauskas. The 1989 Neuro-Computing Bibliography. MIT Press,
Cambridge, MA, 1989.
[17] Teuvo Kohonen. Self-Organization and Associative Memory. Springer-
Verlag, New York, 1984.
[18] Bart Kosko. Neural Networks and Fuzzy Systems: A Dynamical Systems
Approach to Machine Intelligence. Prentice-Hall, Englewood Cliffs, NJ,
1992.
[19] Roman Maeder. Programming in Mathematica. 2nd Edition, Addison-
Wesley, Reading MA, 1991.
[20] James McClelland and David Rumelhart. Explorations in Parallel Dis¬
tributed Processing. MIT Press, Cambridge, MA, 1988.
[21] Marvin Minsky and Seymour Papert. Perceptrons: Expanded Edition.
MIT Press, Cambridge, MA, 1988.
[22] B. Muller and J. Reinhardt. Neural Networks: An Introduction.
Springer-Verlag, Berlin, 1990.
[23] Yoh-Han Pao. Adaptive Pattern Recognition and Neural Networks.
Addison-Wesley, Reading, MA, 1989.
[24] David Rumelhart and James McClelland. Parallel Distributed Process¬
ing. MIT Press, Cambridge, MA, 1986.
[25] Patrick K. Simpson. Artificial Neural Systems: Foundations, Paradigms,
Applications, and Implementations. Pergamon Press, New York, 1990.
[26] Philip D. Wasserman. Neural Computing: Theory and Practice. Van
Nostrand Reinhold, New York, 1989.
[27] Bernard Widrow and Samuel D. Steams. Adaptive Signal Processing.
Prentice-Hall, Englewood Cliffs, NJ, 1985.
[28] Stephen Wolfram. Mathematica: A System for Doing Mathematics by
Computer. 2nd Edition, Addison-Wesley, Reading, MA, 1991.
[29] Steven F. Zornetzer, Joel L. Davis, and Clifford Lau, editors. An
Introduction to Neural and Electronic Networks. Academic Press, San
Diego, CA, 1990.
Index
337
338 Index
http://nihlibrary.nih.gov
10 Center Drive
Bethesda, MD 20892-1150
301-496-1080
Neural Networks/Mathematica
JAMES A.
fU'W
FREEMAN cY/mJn
SIMULATING %
NEURAL
NETWORKS
This book introduces neural networks, their operation and their application, in the context of
Mathematical a mathematical programming language. Readers will learn how to simulate
neural network operations using Mathematica, and will learn techniques for employing
Mathematica to assess neural network behavior and performance. They will see how this
popular and widely available software can be used to explore neural network technology,
experiment with various architectures, debug new training algorithms, and design techniques
for analyzing network performance.
Simulating Neural Networks with Mathematica® is suitable for professionals and students in
computer science, electrical engineering, applied mathematics, and related areas who need an
efficient way to learn about neural networks, and to gain some proficiency in their use. The
source code for the programs in the book is available free of charge via electronic mail from
“[email protected]”.
Highlights:
• Addresses a major neural network topic or a specific network architecture in each chapter.
• Includes an introduction to genetic algorithms.
• Includes Mathematica listings in an appendix.