0% found this document useful (0 votes)
48 views360 pages

Simulating Neural Networks

Uploaded by

Sin Nombre
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views360 pages

Simulating Neural Networks

Uploaded by

Sin Nombre
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 360

SIMULATING
flEURAL
NETWORKS
with
Mathematica

James A. Freeman
Loral Space Information Systems
and
University of Houston-Clear Lake

I
I

1
§
▲ ft
TT
ADDISON-WESLEY PUBLISHING COMPANY
Reading, Massachusetts • Menlo Park, California
New York • Don Mills, Ontario • Wokingham, England
Amsterdam • Bonn • Sydney • Singapore • Tokyo
Madrid • San Juan • Milan • Paris
Mathematica is not associated with Mathematica Inc., Mathematica Policy Research, Inc.,
or MathTech, Inc.

Many of the designations used by manufacturers and sellers to distinguish their products
are claimed as trademarks. Where those designations appear in this book, and Addison-
Wesley was aware of a trademark claim, the designations have been printed in caps or
initial caps.

The programs and applications presented in this book have been included for their instruc¬
tional value. They have been tested with care, but are not guaranteed for any particular
purpose. The publisher does not offer any warranties or representations, nor does it accept
any liabilities with respect to the programs or applications.

Library of Congress Cataloging-in-Publication Data

Freeman, James A.
Simulating neural networks with Mathematica / James A. Freeman,
p. cm.
Includes bibliographical references and index.
ISBN 0-201-56629-X
1. Neural networks (Computer science) 2. Mathematica (Computer
program) I. Title.
QA76.87.F72 1994
006.3-dc20

92-2345
CIP

Reproduced by Addison-Wesley from camera-ready copy supplied by the author.

Copyright (c) 1994 by Addison-Wesley Publishing Company, Inc.

All rights reserved. No part of this publication may be reproduced, stored in a retrieval
system, or transmitted, in any form or by any means, electronic, mechanical, photocopying,
recording, or otherwise, without the prior written permission of the publisher. Printed in
the United States of America.

123456789 10-DOC-97 96 95 94 93
Preface

I sat across the dinner table at a restaurant recently with a researcher from
Los Alamos National Laboratory. We were discussing a collaboration
on a neural-network project while eight of us had dinner at a quaint
restaurant in White Rock, just outside of Los Alamos. Fortified by a few
glasses of wine, I took the opportunity to mention that I had just reached
an agreement with my publisher to write a new book on neural networks
with Mathematica. My dinner companion looked over at me and asked a
question: "Why?"
I was somewhat surprised by this response, but reflecting for a mo¬
ment on the environment in which this person works, I realized that the
answer to his question was not necessarily so obvious. This person had
access to incredible computational resources: large Crays, and a 64,000-
node Connection Machine, for example. Considering the computational
demands of most neural networks, why would anyone working with
them want to incur the overhead of an interpreted language like Mathe¬
matica? To me the answer was obvious, but to him, and possibly to you,
the answer may require some explanation.
During the course of preparing the manuscript for an earlier text. Neu¬
ral Networks: Algorithms, Applications, and Programming Techniques (hence¬
forth "Neural Networks" ), I used Mathematica extensively for a variety
of purposes. I simulated the operation of most of the neural networks
described in that text using Mathematica. Mathematica gave me the ability
to confirm nuances of network behavior and performance, as well as to
develop examples and exercises pertaining to the networks. In addition
I used Mathematica's graphics capability to illustrate numerous points
throughout the text. The ease and speed with which I was able to im¬
plement a new network spoke highly of using Mathematica as a tool for
exploring neural-network technology.
iv Preface

The idea for Simulating Neural Networks with Mathematica grew along
with my conviction that Mathematica had saved me countless hours of
programming time. Even though Mathematica is an interpreted language
(much like the old BASIC) and neural networks are notorious CPU hogs,
there seemed to me much insight to be gained by using Mathematica in
the early stages of network development.
I explained all of these ideas (somewhat vigorously) to my dinner
companion. No doubt feeling a bit of remorse for asking the question in
the first place, he promised to purchase a copy of the book as soon as it
was available. I hope he finds it useful. I have the same wish for you.
This book will introduce you to the subject of neural networks within
the context of the interactive Mathematica environment. There are two
main thrusts of this text: teaching about neural networks and showing
how Mathematica can be used to implement and experiment with neural-
network architectures.
In Neural Networks my coauthor and I stated that you should do
some programming of your own, in a high-level language such as C,
FORTRAN, or Pascal, in order to gain a complete understanding of the
networks you study. I do not wish to retract that philosophy, and this
book is not an attempt to show you how you can avoid eventual trans¬
lation of your neural networks into executable software. As I stated in
the previous paragraph, most neural networks are computationally quite
intensive. A neural network of any realistic size for an actual application
would likely overwhelm Mathematica. Nevertheless, a researcher can use
Mathematica to experiment with variants of architectures, debug a new
training algorithm, design techniques for the analysis of network per¬
formance, and perform many other analyses that would prove far more
time-consuming if done in traditional software or by hand. This book
illustrates many of those techniques.
The book is suitable for a course in neural networks at the upper-level
undergraduate or beginning graduate level in computer science, electrical
engineering, applied mathematics, and related areas. The book is also
suitable for self-study. The best way to study the material presented
here is interactively, executing statements and trying new ideas as you
progress.
This book does not assume expertise with either neural networks or
Mathematica, although I expect that the readers will most likely know
something of Mathematica. If you have read through the first chapter
of Stephen Wolfram's Mathematica book and have spent an hour or two
Preface v

interacting with Mathematica, you will have more than sufficient back¬
ground for the Mathematica syntax in this book.
I have kept the Mathematica syntax as understandable as possible. It
is too easy to spend an inordinate amount of time trying to decipher
complex Mathematica expressions and programs, thereby missing the for¬
est for the trees. Moreover, many individual functions are quite long
with embedded print statements and code segments that I could have
written as separate functions. I chose not to, however, so the code stands
as it is. For these reasons, some Mathematica experts may find the coding
a bit nonelegant: To them I extend my apologies in advance. I have
also chosen to violate one of the "rules" of Mathematica programming:
using complete English words for function and variable names. As an
example, I use bpnMomentum instead of the more "correct" backprop-
agationNetworkWithMomentum. The former term is easily interpreted,
requires less space leading to more understandable expressions, and is
less prone to typographical errors.
Readers with no prior experience with neural networks should have
no trouble following the text, although the theoretical development is not
as complete as in Neural Networks. I do assume a familiarity with basic
linear algebra and calculus. Chapter 1 is a prerequisite to all other chap¬
ters in the book; the material in this chapter is fairly elementary, however,
both in terms of the neural-network theory and the use of Mathematica.
Readers who possess at least a working knowledge of each subject can
safely skip Chapter 1. If you have never studied the gradient-descent
method of learning, then you should also study Chapter 2 before pro¬
ceeding to later chapters. The discussion of the Elman and Jordan net¬
works in Chapter 6 requires an understanding of the backpropagation
algorithm given in Chapter 3.
The text comprises eight chapters; each one, with the exception of the
last, deals with a major topic related to neural networks or to a specific
type of network architecture. Each chapter also includes a simulation of
the networks using Mathematica and demonstrates the use of Mathemat¬
ica to explore the characteristics of the network and to experiment with
variations in many instances.
The last chapter introduces the subject of genetic algorithms (GAs).
We will study the application of GAs to a scaled-down version of the
traveling salesperson problem. To tie the subject back to neural networks,
we will look at one method of using GAs to find optimum weights for a
neural network. A brief description of the chapters follows.
vi Preface

Chapter 1: Introduction to Neural Networks and Mathematica. This


chapter contains basic information about neural networks and defines
many of the conventions in notation and terminology that are used
throughout the text. I also introduce the concept of learning in neu¬
ral networks, and that of Hinton diagrams, and show how to construct
them with Mathematica.
Chapter 2: Training by Error Minimization. This chapter introduces
the topic of gradient descent on an error surface as a method of learning
in a simple neural network. The chapter begins with the Adaline, show¬
ing both the theoretical calculation of the optimum weight vector, as well
as the least-mean-square algorithm. I also describe techniques for simu¬
lating the Adaline using Mathematica and for analyzing the performance
of the network.
Chapter 3: Backpropagation and Its Variants. In this chapter we de¬
velop the generalized delta rule and apply it to a multilayered network
generally known as a backpropagation network. To demonstrate how
Mathematica can be used to experiment with different network architec¬
tures, we investigate several modifications to the basic network structure
in an attempt to improve the performance of the network. We also study
a related network known as the functional link network.
Chapter 4: Probability and Neural Networks. In this chapter we
explore the use of probability concepts in neural networks using two
different types of networks. After describing the deterministic Hopfield
network, we examine some concepts in the thermodynamics of physi¬
cal systems and see how they relate to the Hopfield network. Then we
use a technique called simulated annealing to develop a stochastic Hop-
field network. We also examine the probabilistic neural network, which
is an example of a network that implements a traditional classification
algorithm based on Bayesian probability theory.
Chapter 5: Optimization and Constraint Satisfaction with Neural
Networks. In this chapter we explore the concept of constraint satisfac¬
tion using the familiar traveling salesperson problem. Then we illustrate
how this problem, and other constraint-satisfaction problems, map onto
the Hopfield network.
Chapter 6: Feedback and Recurrent Networks. In this chapter we
describe and simulate networks that depart from the simple feedforward
structures typical of the backpropagation network. We begin with a two-
layer network called a bidirectional associative memory (BAM). We then
proceed to networks that can recognize and reproduce time-sequences of
Preface vii

input patterns. Specifically, we look at the networks described by Jordan


and Elman.
Chapter 7: Adaptive Resonance Theory. Adaptive Resonance The¬
ory (ART) defines a class of neural networks that includes several dif¬
ferent, but related, types, including ART1, ART2, ART3, ArtMap, Fuzzy
ART, and other variations. This chapter explores ART1 and ART2. ART2
is distinguished from ART1 primarily by its ability to accommodate other
than binary input vectors.
Chapter 8: Genetic Algorithms. In this final chapter, I depart from
the neural network field to describe the basics of a data-processing tech¬
nique called genetic algorithms (GAs). Like some neural networks, GAs
are good at solving optimization problems of many types. Since deter¬
mining weight vectors in a neural network is an optimization problem
in its own right, a discussion of GAs is appropriate in a book on neural
networks.
Beginning with Chapter 2, each chapter has associated with it a set
of Mathematica functions. A complete listing of the functions for each
chapter appears in the Appendix. The Bibliography contains a list of
reference books from which you can obtain additional information about
the neural networks described in this book, as well as many other neural
networks. I have also included a few references on Mathematica and
genetic algorithms.
Because this text is supposed to convey a spirit of exploration, I have
not attempted to include only the best, or most efficient neural-network
techniques. In fact, some of the experiments presented result in networks
that do not actually work very well, but such is the nature of experimen¬
tation. We can learn from these attempts as well as from those that are
successful. The best way to study the material is interactively, executing
statements and trying new ideas as you progress. I trust that you will
be able to accept a text that does not always give the "right" answer
and that you will forgive any lingering errors, for which I accept full
responsibility.

Source Code Availability

The source code for all of the functions in this text is available, free of
charge, from MathSource. MathSource is a repository for Mathematica-
related material contributed by Mathematica users and by Wolfram Re-
viii Preface

search, Inc. To find out more about MathSource, send a single-line email
message: "help intro" to the MathSource server at [email protected].
For information on other ways to access MathSource, send the message
"help access" to [email protected].
If you do not have direct electronic access to MathSource, you can
still get some of the material in it from Wolfram Research. Contact the
MathSource Administrator at: Wolfram Research, Inc. 100 Trade Center
Drive, Champaign, Illinois, 61820-7237, USA, (217) 398-0700.

Acknowledgments
First, I would like to correct a grievous omission from my previous book.
Dr. Robert Hecht-Nielsen was my first and only formal instructor of neu¬
ral networks. I owe to him much of my appreciation and enthusiasm for
the subject. Thanks Robert! Of course, many others have contributed to
this project and I would like to mention several of those individuals.
In particular, I would like to thank Dr. John Engvall, of Loral Space
Information Systems, who has been a continuing source of support and
encouragement. Don Brady, M. Alan Meghrigian, and Jason Deines, who
I met through the CompuServe AI Expert and Wolfram Research forums,
reviewed portions of the manuscript and software in its early stages. I
thank them greatly for their efforts and comments.
I initially wrote the manuscript as a series of Mathematica notebooks,
which had to be translated to T^X. I want to thank Dr. Cameron Smith
for providing software and many hours of consultation to help with this
task.
I wish to express my appreciation to Alan Wylde, previously of
Addison-Wesley Publishing Company, for his support early on in the
project. I also want to thank Peter Gordon, Helen Goldstein, Mona Zef-
tel, Lauri Petrycki and Patsy DuMoulin, all of Addison-Wesley, for their
support, assistance, and exhortations throughout the preparation of this
manuscript.
Finally, I dedicate this book to my family: Peggy, Geoffrey, and Deb¬
orah, without whose support and patience this project would never have
been possible.

J.A.F.
Houston TX
Contents

Preface

1 Introduction to Neural Networks and Mathematica 1

1.1 The Neural-Network Paradigm. 2


1.2 Neural-Network Fundamentals. 7

2 Training by Error Minimization 39


2.1 Adaline and the Adaptive Linear Combiner. 40
2.2 The LMS Learning Rule. 42
2.3 Error Minimization in Multilayer Networks . 63

3 Backpropagation and Its Variants 67

3.1 The Generalized Delta Rule. 68


3.2 BPN Examples. 74
3.3 BPN Variations. ^7
3.4 The Functional Link Network.103

4 Probability and Neural Networks 115

4.1 The Discrete Hopfield Network. 116


4.2 Stochastic Methods for Neural Networks. 124
4.3 Bayesian Pattern Classification.135
4.4 The Probabilistic Neural Network. 144

5 Optimization and Constraint Satisfaction 153

5.1 The Traveling Salesperson Problem (TSP).154


5.2 Neural Networks and the TSP. 156

ix
x Contents

6 Feedback and Recurrent Networks 177


6.1 The BAM. 178
6.2 Recognition of Time Sequences. 185

7 Adaptive Resonance Theory 209


7.1 ART1 . 211
7.2 ART2.243

8 Genetic Algorithms 259


8.1 GA Basics. 260
8.2 A Basic Genetic Algorithm (BGA). 266
8.3 A GA for Training Neural Networks . 281

Appendix A Code Listings 295

Bibliography 335

Index 337
Chapter 1

Introduction to Neural
Networks and Mathematica
2 Chapter 1. Introduction to Neural Networks and Mathematica

More times than I care to remember, I have had to verify, using a calcu¬
lator, the computations performed by a neural network that I had pro¬
grammed in C. To say that the process was laborious will bring a chuckle
to anyone who has had the same experience. Then came Mathematica. For
a while, Mathematica became my super calculator, reducing the amount
of time spent doing such calculations by at least an order of magnitude.
Before long I was using Mathematica as a prototyping tool for experi¬
menting with neural-network architectures. The answer to the question,
"Why use Mathematica to build neural networks?" will, I trust, become
self-evident as you see the tool used throughout this book.
Those of you to whom neural networks is a new technology may
be asking a different question that requires answering before the issue
of using Mathematica ever arises: "Why neural networks?" I would like
to spend a little time discussing the answer to that question in the first
section of this chapter. In Section 1.2, we will begin to use Mathematica to
perform some basic neural-network calculations and to do some simple
analyses. The techniques and conventions that I introduce in that section
will form the basis of the work we do in subsequent chapters.

1.1 The Neural-Network Paradigm

Ask someone who is doing research or applications development in neu¬


ral networks why neural networks are worth pursuing. The ensuing
discussion will probably include comments about how well humans per¬
form tasks such as vision and language, which are difficult to program
a computer to perform. The short form of the argument is that humans
do some things well (e.g., vision and language), and computers do some
things well (e.g., numerical integration) and never the twain shall meet.
The fact is that computers do little useful work that a human cannot
also do; computers just do it significantly faster. After all, a human must
develop the algorithm and write the program that enables a computer to
perform its function. In order to perform this development, humans must
think in a serial fashion because most computers execute instructions one
after the other.
1.1. The Neural-Network Paradigm 3

n
& A A
h 0. A
Figure 1.1 We can recognize all of these letters as variations of the letter "A." Writing a
computer program to recognize all of these, and all other possible variations, is a formidable
task.

1.1.1 Neural Networks vs. Traditional Algorithms for Pattern


Classification
Take a look at the letters in Figure 1.1. Think about how you might
construct a computer program that can take as its input a picture of a
letter, and that produces an output indicating the identity of the letter.
One approach to writing the program is to store all of the possible
patterns in a database, and to do a bit by bit comparison of the letter in
question with all of the stored templates. If the pattern matches one of
the templates to within a certain degree of error, we can conclude that
the letter is in fact an "A."
This approach is not satisfactory, however, for a number of reasons.
First, the number of templates that you would have to store would be
large, especially if you wanted to recognize all letters, both upper and
lower case, and all numerals. Try to imagine all of the possible variations.
There could be variations in size and in angle of rotation, as well as in the
style of the letter. If you added the complication of wanting to recognize
handwritten characters, the number of templates could be astronomical.
The situation may not be hopeless, however, since we could add a bit
of intelligence to our computer program. For example, we could study
all of the ways that the letter "A" is formed, and attempt to discern
commonalities — features, as they are often called — that distinguish
one letter from another. Then we could program an intelligent system
4 Chapter 1. Introduction to Neural Networks and Mathematica

Data

Figure 1.2 In a program designed to identify what appears in the picture, each picture
element, or pixel, becomes a single entity in data memory. The central processing unit
(CPU) reads instructions sequentially and operates on individual data elements, storing
the results, which can themselves be operated on by the CPU. In order for the CPU to
correctly classify the image, we must specify exactly the correct sequence of operations that
the CPU must perform on the data.

to search systematically for those features and match them against the
known list for the various letters. While this approach might work for
some letters, variations in writing style and typography would still ne¬
cessitate a huge information base and would likely not account for all
possibilities.
The problem with these approaches and with others that have been
tried in the past is that they depend on our ability to systematically pick
apart the picture of the letter on a pixel-by-pixel, or feature by feature,
basis. We must look for specific information in the picture and apply
rules or algorithms sequentially to the data, hoping that we are smart
enough to be able to write down in sufficient detail what it is exactly
that makes an "A" an "A" and not some other letter. The solution to
this problem has proved elusive. Figure 1.2 shows a simple schematic of
how data processing of this type takes place in a sequential computing
environment.
1.1. The Neural-Network Paradigm 5

It is curious that, although we can identify letters and objects visu¬


ally with great speed and accuracy, we have trouble writing down an
algorithm that will allow a computer program to accomplish the same
task. Perhaps we are not just stupid; instead, it may be that we have
trouble translating our ability into a sequential program because we do
not perform the task sequentially ourselves.
Look again at one of the letters in Figure 1.1. Think about the process
you went through to identify the letter. Did you scan the letter from left
to right and top to bottom? Did you assemble information about the
angle of lines and the intersection between line segments? Did you pick
apart the letter's features and logically conclude that, on the basis of
the information, the letter must be an "A"? Probably not, at least not
consciously, and certainly not in a long sequence of individual steps that
treat each feature or pixel sequentially: The brain's processing elements
(neurons) are much too slow to account for the almost instantaneous
recognition. What then accounts for our remarkable ability to recognize
almost any handwritten or printed letter without the slightest effort? The
answer, I believe, lies in the massive parallelism of the brain, and the way
in which the brain processes data.
When you look at one of the letters in Figure 1.1, information from
every point reaches your retina simultaneously. That information is trans¬
formed into electrical impulses that travel along nerves to locations deep
within your brain. There, the information from all of the pixels is pro¬
cessed. Features are extracted from various parts of the image, and in¬
formation from one part of the image is combined with information from
other parts, allowing the brain to rapidly form a hypothesis about the
identity of the letter. It is likely that some of this processing is serial in
nature: Simple features are extracted first, then these features are com¬
bined to identify more complex features, etc. The main point, however,
is that much of this work is done in parallel, with the brain operating
on the whole image at once, sharing information among neurons about
features in all parts of the image. This massive parallelism gives the
brain its ability to process data so quickly and efficiently; the existence
of that same parallelism frustrates our attempts to render the process in
a sequential algorithm.
A neural network, as we use the term here, is a computational
paradigm inspired by the parallelism of the brain. Using this paradigm
we can begin to build computers that mimic those functions that humans
do so easily, but for which sequential algorithms either do not exist or
6 Chapter 1. Introduction to Neural Networks and Mathematica

Dog

Figure 1.3 This figure shows a simple example of a neural network that is used to identify
the image presented to it. Imagine that the input layer is the retina. The neurons on
this layer respond simultaneously to data from various parts of the image. In hidden-
layer 1, data from all parts of the retina are combined at individual neurons. The output
layer generates a hypothesis as to the identity of the input image, in this case a dog.
In this particular network, each neuron on the output layer corresponds to a particular
hypothesis concerning the identity of the pattern on the input layer. While there is some
serial processing, much of the work is done in parallel by neurons on each layer.

are so hard to discover that they are impractical. As a simple example.


Figure 1.3 shows how we might translate the picture-recognition task
from Figure 1.2 into one that uses the neural-network paradigm.
The example of Figure 1.3 illustrates some general characteristics of
the neural-network models that we shall study in this text; although not
all models share all of the same characteristics. Individual neurons, which
we shall call processing elements (PEs), nodes, or units, are represented
by the circles. Arrows show the direction of data flow from one unit to
the next. Notice that a unit can have numerous inputs and can send data
to numerous other units. Although not explicitly indicated in the figure,
each unit has only a single output value, which can be sent to many
other units. Also typical of most neural-network models is the layered
structure apparent in Figure 1.3.
1.2. Neural-Network Fundamentals 7

Figure 1.4 In this two-layer network, units are connected with bidirectional connections,
illustrated as lines without arrows. Also notice that each unit has a feedback connection to
itself. In this network the distinction between input and output layer is ambiguous, since
either layer can act in either capacity, depending on the particular application.

1.1.2 Example Neural-Network Architectures


So that you do not get the idea that all neural networks look alike. Figures
1.4,1.5, and 1.6 show examples of possible neural-network architectures.
Even these examples, however, do not exhaust the possibilities.

1.2 Neural-Network Fundamentals


Having now spent some time discussing the neural-network paradigpn
and its value, I would like to move on directly into some of the prelim-

Figure 1.5 This network architecture has only a single layer of units. The output of each

unit becomes an input to other units.


8 Chapter 1. Introduction to Neural Networks and Mathematica

Figure 1.6 This network architecture is an exampe of a complex, multilayered structure. For
the sake of clarity, all of the individual connections between units are not shown explicitly.
For example, in the layer labeled "F2 Layer," all of the units are connected to each other
by inhibitory connections. These connections are implied by the nature of the processing
that takes place on this layer, but the connections themselves are not shown.

inary technical information without stopping to elaborate further on the


earlier question of why Mathematica is a useful tool in this environment.
It is better that you become convinced as you see Mathematica put to use
to build and experiment with neural networks.
The neural-network paradigm has, at its core, the notion of individual
units. These units share some basic characteristics with real, biological
neurons, although to say that they are simulations of real neurons would
overstate the reality.
Like their biological counterparts, individual units generally have
many inputs but only one output. Moreover, this output can be fanned
out to many other units, again in analogy to actual neurons. In a neural
network, real numbers suffice to represent the strength of electrical sig¬
nals sent from one neuron to the other. Synapses, the junction between
neurons, are called connections in neural-network models. In real neu-
1.2. Neural-Network Fundamentals 9

ral networks, signals pass across synapses from one neuron to another.
During this passage, the efficiency with which the synapse transfers the
signal differs from synapse to synapse. In our neural-network models,
this difference manifests itself as a multiplicative factor that modifies the
incoming signal. This factor is called a weight or connection strength.
Each connection has its associated weight value. As you will see shortly,
these weight values contain the information that lets the network success¬
fully process its data. Building a neural network to perform a specific
task often depends on the development of a set of weights that encodes
the appropriate data-processing algorithm. Fortunately, many neural-
network models can learn the required weight values, so that we do not
have to specify them in advance.

1.2.1 The General Neural-Network Processing Element

Figure 1.7 shows a schematic of a typical processing element in a neural


network. This unit is somewhat more general than we will need for most
networks, but it is intended to cover the majority of cases.
Generally speaking, output values from units will be positive num¬
bers. Weights can be either positive or negative. This situation leads us
to categorize the inputs to a unit according to their effect. Inputs whose
connections have a positive weight contribute a net positive value to the
overall excitation of the unit. Those inputs whose connections have neg¬
ative weights detract from the overall excitation. We refer to the former
type as excitatory connections, and the latter as inhibitory connections.
Other types of connections are possible as we shall see in later chapters.
We shall refer to the overall excitation of a unit as its net-input value,
or simply, the net input. We usually calculate that value by summing
the products of the input values and the weights on the associated con¬
nections. For the zth unit, the net input is:
n
neti = (1-1)
j=1

where n is the number of units having connections to the zth unit, xj is


the output of the /th unit, and u/y is the weight on the connection from
the ;th unit to the zth unit.
Let's use Mathematica to perform this calculation for a single unit.
First, we must set up vectors for the weights and inputs. For this exam¬
ple, we shall assume that the inputs are all binary numbers.
10 Chapter 1. Introduction to Neural Networks and Mathematica

Figure 1.7 This figure shows a representation of a general processing element of a neural
network; in this case the ith unit.

xj = Table[Random[Integer,{0,1}],{10}]

{1, 0, 0, 1, 1, 1, 1, 1, 1, 0}

For weights, we shall assume random values between -0.5 and 0.5.

wij = Table[Random[Real,{-0.5,0.5}],{10}]

{-0.435611, 0.294576, -0.36385, 0.0753376, 0.472324, -0.437153,


-0.440387, -0.394219, -0.033852, -0.311905}

Finding the net-input value is simply a matter of multiplying the vectors


using the dot product. Remember that the vectors must be of the same
length.

neti = xj . uij

-1.19356
1.2. Neural-Network Fundamentals 11

In some neural-network models, units transform their net-input value


into an activation value as an intermediate step before producing an
output value. Many architectures skip this intermediate step and pro¬
ceed directly to the generation of an output value. We shall ignore the
complication of activation values for the moment, taking them up again
in a later chapter. For now we shall just be interested in the output value.
We can express the output value of a unit in the form of a differential
equation. Like their biological counterparts, the outputs of our units are
dynamic functions of time. The simplest form of the equation for the
outputs in which we shall be interested is the following:
ii = -Xi + /(neti) (1-2)

where /(neti) is a function that we shall refer to as the output function.


The function /(neti) can take many different forms. For now we
shall look at a simple case where /(neti) is the identity function; that is,
/(neti) = neti. Then we can write Eq. (1.2) as

^ = -Xi + neti (l-3)


We can use Mathmatica to integrate this differential equation in order
to study the behavior of the variable a: as a function of time. To do this
integration, we approximate the derivative as

±i « Axi/At

and rewrite Eq. (1.3) as


A Xi & A t(-Xi + neti)

and finally
Xi(t + 1) = Xi(t) + At{-Xi(t) + neti) (1.4)

The loop in Listing 1.1 iterates Eq. (1.4) and appends each value to
a list, along with the timestep index, in such a way that we can easily
plot the results when the calculation is finished. In the If statement, we
anticipated the fact that the value of x would asymptotically approach
the value of neti. You can see this fact easily by setting the derivative in
Eq. (1.3) equal to zero to find the equilibrium value for x.

= neti

Let's plot the results of our calculation.


12 Chapter 1. Introduction to Neural Networks and Mathematica

xlist = 0; (* define list to hold results *)


done = False; (* initialize flag *)
deltaT = 0.01; (* pick value for timestep size *)
xi = 0.0; (* initialize variables *)
neti = 2.0; (* pick value for net input *)
For[i=l,done==False,i++, (* until flag is true *)
xi = xi + deltaT (-xi + neti); (* update xi *)
AppendTo[xlist,{i deltaT,xi}]; (* append to list *)
If[Abs[xi-neti]< 0.005,done=True,Continue];(* stop ? *)
] (* end of For *)

Listing 1.1

xiList = ListPlot [xlist];

1.5

0.5

To see how close we came to the expected equilibrium value, look at the
last element in xlist.

Last[xlist]

{5.97, 1.99504}

Both the time step, deltaT, and the value of the stopping criterion in the
If statement affect the integration. Since Eq. (1.2) is easy to integrate
in closed form, we can compare the results of the numerical integration
with the actual solution.
Let's let Mathematica do the integration for us. We first clear the value
of neti so that Mathematica will include this parameter symbolically in
1.2. Neural-Network Fundamentals 13

the result. We have to include the initial condition, xi[0]==0, in order to


prevent Mathematica from adding an unknown constant of integration.

Clear[neti]
DSolve[{xi'[t]==-xi[t]+neti,xi[0]==0>,xi[t],t]

t
-neti + E neti
{{xi[t] ->-»
t
E

Now let's assign the result to the variable xi

xi = xi[t] /. First['/,]

t
-neti + E neti

t
E

and put the result into a more familiar form

xi = Simplify[xi]

neti
neti-
t
E

We would most likely choose a slightly different form for the solution,
such as Eq. (1.5):

Xi(t) = neti(l - e-t) (1-5)

Next, we can assign a value to neti and do the plot.

neti=2;
xiPlot = Plot[xi,{t,0,10}];
14 Chapter 1. Introduction to Neural Networks and Mathematica

Mathematica's choice of axes makes comparison of the two plots a little


difficult. We can get an idea of the accuracy of the numerical integration
by showing both graphs plotted on the same set of axes.

Shov[{xiList,xiPlot}];

On my computer the numerical integration of Eq. (1.2) required a fair


number of seconds to compute. We can generally assume that the units
always have enough time to reach their equilibrium state; thus, we can
forego the numerical integration step altogether for many of our experi¬
ments.

1.2.2 Output Function

As of yet, we have not considered any form of the output function other
than the identity function. Let's look at some other forms for the output
function.
For reference, let's plot the identity function:
1.2. Neural-Network Fundamentals 15

outl = Plot[neti,{neti,-5,5}];

We can change the slope and position of this graph by including some
constants in the output function. For example, consider the function

/(neti) = cnetj + d

where c and d are constants. If we pick some numbers for c and d, we


can plot the resulting function.

c=2;
d=-0.5;
out2 = Plot[c neti + d, {neti,-5,5}];

Plot them together to see the difference.

Show[{outl,out2>];
16 Chapter 1. Introduction to Neural Networks and Mathematica

A particularly useful output function is one called a sigmoid. One ver¬


sion of a sigmoid function is defined by the equation

d'6)
To see the general shape, plot the function. First, however, let's define
a function for the sigmoid, since we will be using it often.

sigmoid[x_] := l/(l+E~(-x))

Now the plot:

sigl ■ Plot[sigmoid[x],{x,-10,10}];

Notice that the sigmoid function is limited in range to the values between
zero and one. Also notice that if the net input to a unit having a sigmoid
output is less than about negative five or greater than about positive five,
the output of the unit will be approximately zero or one, respectively.
Thus, a unit with a sigmoid output saturates at these net-output values
and cannot distinguish between, for example, a net input of 8 or 10.
1.2. Neural-Network Fundamentals 17

We can change the shape and location of the sigmoid curve by in¬
cluding parameters in the defining equation. Consider this example:

r = 2;
s = 2.0;
sig2 = Plot[l/(l+E“(-r neti)) + s,{neti,-10,10}];

By adjusting the value of r we can change the slope of the sigmoid.


Consider the following three plots where we have set s=0:

r = 20.0;
sig3 = Plot[l/(l+E~(-r neti)),{neti,-10,10}];
r = 0.5;
sig4 = Plot[l/(l+E“(-r neti)),{neti,-10,10}];

r = 0.1;
sig5 = Plot[l/(l+E~(-r neti)),{neti,-10,10}];

lr

0.8

0.6
0.4

0 .z

-10 -5 5 10
18 Chapter 1. Introduction to Neural Networks and Mathematica

Let's plot them together with the original sigmoid graph:


Show[{sigl,sig3,sig4,sig5}];

Plot sig3 is almost a threshold function. A threshold function is one


defined by an equation such as the following:

1 x >9
f(x) = (1.7)
0 otherwise
1.2. Neural-Network Fundamentals 19

where 0 is the threshold value. The larger the value of r in the sigmoid
equation, the closer the function will approximate a threshold function.
Plot sig5 is essentially a linear function over the domain of interest.
The parameter s can shift the position of the graph along the ordinate.
For example,

sig6 = Plot[sigmoid[x]-0.5,{x,-10,10}]

The above plot shows the sigmoid shifted so that it passes through the
origin. The limiting values are ±0.5. One of the important features of
the sigmoid function is that it is differentiable everywhere. We will see
the importance of this characteristic in Chapter 3.
If the output function of a unit is a sigmoid function, then the fol¬
lowing relationship holds:

/'(net,) = /(neti)(l - /(net*))

If we define o* = /(net*), then we can write

/'(net*) = Oi( 1 — Oi) (1-8)

Equation (1.8) will be useful during the discussion of the backprop-


agation learning method in Chapter 3.

1.2.3 Layers and Weight Matrices

If you look back at Figures 1.3 through 1.6, you will notice the charac¬
teristic layered structure of the networks. This structure is general: You
can always decompose a neural network into layers. It may be true that
20 Chapter 1. Introduction to Neural Networks and Mathematica

Figure 1.8 This figure shows two layers of a neural network. The bottom layer sends inputs
to the units in the top layer. The layer is fully interconnected; that is, all units on the bottom
layer send connections to all units on the upper layer.

some layer has only a single unit, or that some units are members of
more than one layer, but, nevertheless, layers are the rule.
We shall impose a constraint on our layers that requires all units on a
layer to have the same number of incoming connections. This constraint
would seem to be in keeping with the networks in Figures 1.3 through
1.6. On the other hand, not all neural networks are structured so that all
units that form the logical grouping of a layer have the same number of
input connections. As an example, a network may have units that are
connected randomly to units in previous layers, making the number of
connections on each unit potentially different from that on other units.
Nevertheless, we can force any such network to appear as though all units
on a layer have the same number of connections by adding connections
with zero weights. Any data flowing across these connections would
have no effect on the net input to the unit, so it is as if the connection
did not exist. The reason that we go to this trouble is for convenience in
calculation.
Consider the small network shown in Figure 1.8. Each unit in the
top layer receives four input values from the units in the previous layer.
Each unit on the bottom layer sends the identical value — its output
value — to all units in the upper layer. Each unit in the upper layer has
its own weight vector with one weight for each input connection. The
easiest way to deal with these individual weight vectors is to combine
them into a weight matrix. Let's define such a weight matrix for the
network in Figure 1.8, using random weight values.

w = Table[Table[Random[Real,{-0.5,0.5>,3],{4}],{8>]
1.2. Neural-Network Fundamentals 21

{{0.364, -0.273, -0.453, -0.425},


{-0.159, -0.489, 0.379, -0.471},
{-0.196, -0.297, -0.0228, 0.148},
{-0.411, -0.385, -0.0487, -0.354},
{-0.0832, 0.177, 0.00528, 0.194},
{-0.362, -0.129, -0.0755, -0.208},
{-0.0372, 0.388, 0.31, -0.442},
{-0.11, 0.0686, -0.278, -0.431}}

We can access the weight matrix for each individual unit by indexing
into the matrix at the proper row. For example, the weight matrix on the
third unit in Figure 1.8 is given by

w[[3]]

{-0.196, -0.297, -0.0228, 0.148}

Similarly, the weight from the second unit on the first layer to the third
unit on the upper layer is given by

w[[3,2]]

-0.297

or alternatively

w[[3]][[2]]

-0.297
We will sometimes find it advantageous to view the weight matrix in
a more familiar form. We can do the transformation with the HatrixForm
command:

MatrixForm[u]

0.364 -0.273 -0.453 -0.425


-0.159 -0.489 0.379 -0.471
-0.196 -0.297 -0.0228 0.148
-0.411 -0.385 -0.0487 -0.354
-0.0832 0.177 0.00528 0.194
-0.362 -0.129 -0.0755 -0.208
-0.0372 0.388 0.31 -0.442
-0.11 0.0686 -0.278 -0.431
22 Chapter 1. Introduction to Neural Networks and Mathematica

To calculate the net inputs to all of the upper-layer units, all we need
do is to perform a vector-matrix multiplication using the output vector
of the bottom layer, and the weight matrix of the upper layer. First, we
must define the output vector of the bottom layer. Let's keep it simple:

outl = {1,2,3,4}

{1, 2, 3, 4}

The net-input calculation results in a vector of net-input values for the


upper layer.

netin2 = u . outl

{-3.24135, -1.88379, -0.268188, -2.74074, 1.0637, -1.68074,


-0.0994623, -2.53006}

Note that we could reverse the order of multiplication in the above ex¬
pression, provided we first transpose the weight matrix.

netin2a = outl . Transpose [u]

{-3.24135, -1.88379, -0.268188, -2.74074, 1.0637, -1.68074,


-0.0994623, -2.53006}

1 should point out a fundamental difference between vectors as they


appear in Mathematica and vectors as they might appear in another text.
In Mathematica, a vector is the same as a list; that is, the vector appears
as though it were a row vector.

VectorQ[{l,2,3,4}]

True

A list, however, is actually a vector in column form.

HatrixForm[{l,2,3,4}]

1
2
3
4
1.2. Neural-Network Fundamentals 23

This convention is in keeping with other texts that assume vectors are
column vectors. For example, in Neural Networks, I would write the
vector outl as (1,2,3,4)*, where the t superscript represents the transpose
operation. In fact, if you attempt to transpose the outl vector, you will
get an error. Try it.
Let's return to the issue of randomly-connected networks. With the
above convention, connections that do not exist should have zeros at
those locations in the weight matrix. The following matrix is an example.

0.364 0.000 -0.450 -0.425


-0.159 -0.489 0.000 -0.471
0.000 -0.297 -0.022 0.148
0.000 0.000 0.000 -0.354
-0.083 0.177 0.005 0.194
-0.362 -0.129 0.000 0.000
0.000 0.388 0.310 0.000
-0.110 0.000 -0.278 -0.431

In some high-level programming languages, such as Pascal, it might


be more advantageous to use a linked-list approach to layers that are not
fully connected, especially when the connection matrix is sparse. Here,
however, we shall stay with the weight-matrix approach. We may suffer
a bit in performance due to the number of additional multiplications that
do not contribute anything to the net input, but the ease in notation that
this approach affords is well worth the price. Bear in mind, however, that
when translating to a high-level language where performance is more of
an issue, there may be benefits to using the linked-list approach when
the connection matrix is sparse.

1.2.4 Hinton Diagrams


By now, it should come as no surprise to you that the weight values
play a crucial role in neural networks. It is the weights that encode the
processing algorithm that allows a network to transform its inputs into
some meaningful output value. In the next section we shall begin to
look at how we can go about determining the proper set of weights to
encode a particular processing function. In this section, we shall examine
an interesting way of displaying weight values that often will allow us
to infer certain facts about how the weights accomplish their task. That
method of display is called a Hinton diagram, after Geoffrey Hinton,
24 Chapter 1. Introduction to Neural Networks and Mathematica

(0,0,0,0,1)

(0,0,0,0,1)
(0,0,0,0,1)
(1.1.1.1.1)
(1.0,0,0,1)
(1,0,0,0,1)
(1.1.1.1.1)

(b)

Figure 1.9 (a) This figure shows a 5 by 7 array of inputs to a neural network. Each array
location, or pixel, may be either on (black) or off (white), (b) This binary representation of
the image on the input array are the 35 numbers used as inputs to the neural network.

who first used them.


Assume that we have constructed a neural network with 35 input
units, arranged in a rectangular array of 5 inputs by 7 inputs. Further¬
more, assume that the input values can take on only the values of zero
or one. Figure 1.9 shows an example of such an input array along with
one possible input pattern.
We shall assume that the 35 input values of the array in Figure 1.9
are sent to two units of a neural network. Let's examine hypothetical
weight vectors on those two units. For the first unit, assume the weight
vector is

wl = {0.8,0.3,0.5,0.1,0.1,0.6,-0.5,-0.7,-0.5,0.6,
0.6,-0.5,-0.7,-0.3,0.5,0.4,0.3,0.6,0.5,0.4,
0.5,-0.5,-0.6,-0.1,0.3,0.4,-0.4,-0.5,-0.2,0.4,
0.4,0.5,0.3,0.7,0.2);

Notice that I have used a semicolon after the above expression in order
to suppress the output from Mathematica, which I shall do when I do not
feel that showing the output adds anything to the discussion.
Assume the second unit has the weight vector

w2 = {-0.4,-0.7,-0.5,-0.6,0.8,-0.5,-0.4,-0.7,-0.6,0.8,
-0.4,-0.3.-0.7,-0.4,0.6,0.4,0.5,0.4,0.7,0/4,
0.7,-0.5,-0.6,-0.1,0.4,0.5,-0.4,-0.5,-0.2,0.6,
0.4,0.7,0.7,0.6,0.4};

The input vector is


1.2. Neural-Network Fundamentals 25

in = {0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,
1,1,1,1,1,1,0,0,0,1,1,0,0,0,1,1,1,1,1,1};

As a point of reference, let's find the net-input value for each of these
two units. For unit 1:

netl = in . wl

7.1

and for the second:

net2 = in . w2

9.6

The second unit has a somewhat larger net input; that is, unit number 2
is excited more strongly by the input pattern than unit 1. Just looking at
the weight vectors, however, would not easily lead you to that conclusion
until you actually calculated the results.
Let's rewrite the weight vectors as rectangular weight matrices having
a row-column structure the same as that of the input pattern. Weight 1
becomes

wl ■ Partition[wl,5]

{{0.8, 0.3, 0.5, 0.1, 0.1}, {0.6, -0.5, -0.7, -0.5, 0.6},
{0.6, -0.5, -0.7, -0.3, 0.5}, {0.4, 0.3, 0.6, 0.5, 0.4},
{0.5, -0.5, -0.6, -0.1, 0.3}, {0.4, -0.4, -0.5, -0.2, 0.4},
{0.4, 0.5, 0.3, 0.7, 0.2}}

In matrix form:

HatrixForm[wl]

0.8 0.3 0.5 0.1 0.1


0.6 -0.5 -0.7 -0.5 0.6
0.6 -0.5 -0.7 -0.3 0.5
0.4 0.3 0.6 0.5 0.4
0.5 -0.5 -0.6 -0.1 0.3
0.4 -0.4 -0.5 -0.2 0.4
0.4 0.5 0.3 0.7 0.2
26 Chapter 1. Introduction to Neural Networks and Mathematica

Similarly, weight vector two yields:

w2 = Partition[w2,5]

{{-0.4, -0.7, -0.5, -0.6, 0.8}, {-0.5, -0.4, -0.7, -0.6, 0.8},
{-0.4, -0.3, -0.7, -0.4, 0.6}, {0.4, 0.5, 0.4, 0.7, 0.4},
{0.7, -0.5, -0.6, -0.1, 0.4}, {0.5, -0.4, -0.5, -0.2, 0.6},
{0.4, 0.7, 0.7, 0.6, 0.4}}

HatrixForm[u2]

-0.4 -0.7 -0.5 -0.6 0.8


-0.5 -0.4 -0.7 -0.6 0.8
-0.4 -0.3 -0.7 -0.4 0.6
0.4 0.5 0.4 0.7 0.4
0.7 -0.5 -0.6 -0.1 0.4
0.5 -0.4 -0.5 -0.2 0.6
0.4 0.7 0.7 0.6 0.4

The matrix forms of wl and w2 are still not so transparent. Let's apply the
command ListDensityPlot to both.

ListDensityPlot[Reverse[vl]];

0
1.2. Neural-Network Fundamentals 27

ListDensityPlot[Reverse[v2]];

Each square in either of these two plots corresponds to one weight value,
and each row corresponds to the weight vector on a single unit. If you
stand back from these graphics, you should notice that the first one con¬
tains an image of an upper-case "B," and the second contains an image
of a lower case "d." In these pictures, the larger the weight, the lighter
the shade of the square.
Notice that the second weight matrix has large values in the same
relative locations in which the input vector has a 1. This correspondence
leads to a larger dot product than in the case where the large weight
values and large input values do not match. This analysis assumes that
the weight vectors are normalized in some fashion, so that there is no
false match resulting from an unusually large weight-vector length.
Notice in the plots of the weight matrices that we had to Reverse the
matrices before plotting. If we had not done this reversal, the image of
the letters would have appeared upside down in the plot.
It is certainly not always the case that the weight matrix mimics one
of the possible input vectors. When this does happen, we say that the
corresponding unit has encoded the pattern, which itself is often called
an exemplar. Nevertheless, it is often true that we can gleen some insight
from looking at the weights in this manner, even though their meaning
28 Chapter 1. Introduction to Neural Networks and Mathematica

*1 X2 O

0 0 0

0 1 0

1 0 0

1 1 1

Figure 1.10 This figure illustrates a simple network with two input units and a single output
unit The table on the right shows the AND function. We wish to find weights such that,
given any two binary inputs, the network correctly computes the AND function of those
two inputs.

may not be so obvious.

1.2.5 Learning in Neural Networks

Learning in neural networks involves finding a set of weights such that


the network performs correctly whatever data-processing function we
intend. There are many different ways to determine a proper set of
weights, and there is often more than one set of weights that will encode a
particular function. In much of the remainder of this book, we investigate
various methods of determining weights. For the moment, however, let's
look at a couple of examples.

Some Simple Examples For very simple cases, we might be able


to arrive at weights by a trial-and-error procedure. Let's consider con¬
structing a single-unit system that computes the AND function of two
binary inputs (see Figure 1.10).
The unit in Figure 1.10 has two weights and a threshold output func¬
tion, as in Eq. (1.7). We can rewrite the threshold condition in the fol-
1.2. Neural-Network Fundamentals 29

lowing way: The network will have an output of one if


n
^^WiXi~6> 0 (1-9)
i=i

and otherwise, the output will be zero, n refers to the number of inputs.
If we replace the inequality in Eq. (1.9) with an equality, the equation
becomes the equation of a line in the xjx2 plane. If we position that line
properly, we can determine the weights that will allow the network to
solve the AND problem. Look at the following plot:

Show[{Graphics[{PointSize[0.03],{Point[{0,0}],Point[{0,l}],
Point[{l,0>],Point[{l,l}]»],Graphics[Line[{{0,1.2),
{2.4,0}}],»xes->Automatic]}];

The line is the plot of the equation: x2 = -0.5xi + 1.2. Let 0 — 1.2,
Wl = 0.5, and u>2 = 1.0. Rewrite the equation in the form of Eq. (1.9):
0.5a;i + 1.0x2 = 1.2. If xi = x2 = 1, the left side of the equation is equal
to 1.5, which is greater than 1.2; thus giving the correct output. For all
other cases, we get an answer less that 1.2, giving zero as the result. So,
by an astute placement of a line, we have determined a set of weights
that solves the AND problem. There are an infinite number of other lines
that also yield weights that solve this problem.
The line in the preceding problem is an example of a decision sur¬
face (even though it is a line, we refer to it as a surface for the sake of
generality). Notice how the line breaks up the space into two regions,
one where the points in the region, when used as inputs, would satisfy
the threshold condition, and one where the points in the region would
not satisfy the threshold condition.
30 Chapter 1. Introduction to Neural Networks and Mathematica

*1 X2 O

0 0 0

0 1 1

1 0 1

1 1 0

Figure 1.11 This figure shows the four input points for the XOR problem, a line representing
a decision surface, and the XOR truth table. Note that there is no way to position the line
so that it separates the points (0,0) and (1,1) from the points (0,1) and (1,0).

The device in Figure 1.10 is often refered to as a perceptron. Percep-


trons were an early development in neural networks, dating from the late
1950s. They were invented by a psychologist named Frank Rosenblatt,
who actually referred to collections of the above devices, rather than the
individual unit, as perceptrons. Rosenblatt favored random connectivity
among layers of these devices in his models of perception and vision.
Unfortunately, the individual unit has a serious flaw that limits its use.
Because of this flaw, early optimism in the perceptron soon gave way
to extreme pessimism, and the field of neural-networks languished for
many years.
We can explain the limitation of the perceptron with the following
simple example. Consider once again a single unit, as in Figure 1.10. This
time, however, we wish to solve the XOR problem. Figure 1.11 shows
the truth table and the space of input points for this problem.
There is no orientation of the decision surface (line) that will correctly
separate the points having an output of zero from the points having an
output of one. The linear decision surface is a characteristic of the percep¬
tron unit. We say that the perceptron unit can only separate categories,
or classes, if they are naturally linearly separable. This characteristic is
considered by many to be a serious weakness, since many real problems,
of which the XOR is a very simple example, do not have classes that are
linearly separable.
Although this problem appears formidable, there are actually one or
two easy ways of overcoming it. One way is to construct a third input to
1.2. Neural-Network Fundamentals 31

*2 *»*2 0

0 0 0 0

0 1 0 1

1 0 0 1

1 1 1 0

Figure 1.12 This figure shows a three-input unit where the third input is the product of the
first two. The truth table still computes the XOR of the first two inputs; the network uses
the third input to help distinguish between the two classes.

the network using the other two inputs. For example, we could multiply
the two inputs together and use the result as a third input. Such a system
appears in Figure 1.12.
Let's plot the set of points made up by the three-dimensional points
from the truth table in Figure 1.12.

xor3d=Show[Graphics3D[{ PointSize[0.02].{Point[{0,0,0}],
Point[{0,1,0}],Point[{1,0,0}],Point[{1,1,1}]}},
ViewPoint->{8.474, -2.607, 1.547}]];
32 Chapter 1. Introduction to Neural Networks and Mathematica

Notice how the point (1,1,1) is elevated above the xxx2 plane. We can
now separate the two classes of input points by constructing a plane to
divide the space into the two proper regions.

Show[{/„Graphics3D[{GrayLevel[0.3],Polygon[{{0.7,0,0},
{0,0.7,0},{1,1,0.8»]}]}];

As with the AND example, there are an infinite number of decision sur¬
faces that will solve correctly this modified XOR problem. Increasing
the dimension of the input space forms the basis for a particular type
of network architecture, called the functional-link network, that we will
study in Chapter 3.
We can construct a second solution to the XOR problem by using a
network of the type shown in Figure 1.13. Notice in this case that we
have constructed a hidden layer of units between the input and output
units. It is this hidden layer that facilitates a solution. Each hidden-
layer unit produces a decision surface, as shown in the figure. The first
hidden-layer unit (the one on the left) will produce an output of one if
either or both inputs are one. The hidden-layer unit on the right will
produce an output of one only if both inputs are one.
The output unit will produce a one only if the output of the first
hidden unit is one AND the output of the second hidden unit is zero; in
other words, if only one, but not both, of the inputs are one.
To verify the correct operation of this network, let's calculate the out¬
put value for each of the possible input vectors. Although the calculation
is simple enough to perform by hand (or in your head), we shall write
Mathematica code (Listing 1.2) for practice. To keep the routine simple,
we shall perform the calculation for one input vector at a time. Here is
the result for one input vector:
1.2. Neural-Network Fundamentals 33

Figure 1.13 The figure on the left shows a network made up of three layers: an input layer,
a hidden layer comprising two units, and a single-unit output layer. All units are threshold
units with the value of 9 as the threshold in each case. The two hidden-layer units construct
the two decision surfaces shown in the graph on the right. The output unit performs the
logical function: (hidden unit 1) AND (NOT (hidden unit 2)).

Net inputs to hidden layer = {2, 2}


Outputs of hidden layer = {1, 1}
Input vector= {1, 1} Output= 0

By changing the value of the input vector, you can verify that the network
computes the XOR function correctly.
This example also serves to establish some notational conventions
that we shall use throughout the text. The input vector will always be
inputs, and the output vector will always be outputs. Other quantities will
generally have compound names, such as hidWt, outNetin, etc. In each
case, the first part of the name will refer to the layer, as in hid for hidden
layer, and out for output layer. The second part of the name will refer
to the particular quantity, as in Netin for net input value, and Threshold
for threshold value. Moreover, the second part of any name will be
capitalized for readability. If a third part of any name is necessary, the
conventions will be the same as for the second part. I will abbreviate
some names when I feel there will be no uncertainty in the intended
34 Chapter 1. Introduction to Neural Networks and Mathematica

inputs = {1,1} (* input vector, change for each input *)


hidWt = { {1,1},{1,1> } (* hidden layer weight matrix *)
outWt = { 0.6,-0.2} (* output layer weight matrix *)
hidThreshold = {0.4,1.2} (* thresholds on hidden units *)
outThreshold = 0.5 (* thresholds on output units *)
hidNetin = hidtft . inputs (* net inputs to hidden layer *)
Print["Net inputs to hidden layer = ".hidNetin]
(* good idea to print intermediate values for debug *)
hidOut = Nodule [{i}.
Table [If [hidNetin [[i]] > hidThreshold [[i]], 1,0],
{i,Length[hidNetin]}] ] (* apply threshold *)
Print["0utputs of hidden layer = " ,hid0ut]
outNetin = hidOut . outWt (* net input of output unit *)
outputs = If[outNetin > outThreshold, 1,0] (* apply threshold *)
Print["Input vector® ".inputs," Output® ".outputs ]

Listing 1.2

meaning.
Let's turn our attention now to another type of learning that has its
basis in an early theory of how brains actually learn. The theory was
first described by a psychologist, Donald Hebb, and the learning method
bears his name: Hebbian learning.

Hebbian Learning First we must digress briefly to discuss a few


facts concerning neurobiology. In the introduction to this section, we
mentioned the concept of synapses, and the fact that electrical impulses
are transferred from one neuron to another across these junctions. The
two connecting neurons do not actually come into physical contact at the
synapse; instead, there is a region between the cells called the synaptic
cleft When an electrical impulse, traveling down the axon of a neuron,
reaches the area of the presynaptic membrane, it causes the cell to re¬
lease certain chemicals into the synaptic cleft. We refer collectively to
these chemicals as neurotransmitters. These neurotransmitters diffuse
across the synaptic cleft and bond with receptor sites on the postsynap-
tic membrane of the receiving cell. These chemicals cause changes in
the permeability of the postsynaptic membrane to certain ionic species
resulting ultimately in changes in the electrical polarization of the fluid
1.2. Neural-Network Fundamentals 35

Figure 1.14 In this figure, which represents a schematic of the classical conditioning process,
each lettered circle corresponds to a neuron cell body. The long lines from the cell bodies
are the axons that terminate at the synaptic junctions. Sba and Sbc correspond to the two
synaptic junctions. Although we have represented this process in terms of a few simple
neurons, we intend this schematic to convey the concept of classical conditioning, not its
actual implementation in real nerve tissue.

in the receiving neuron. If the change in polarization is sufficient, the


receiving neuron may itself become excited and send electrical impulses
down its axon toward other neurons.
The type and strength of the effect that the presynaptic cell has on the
postsynaptic cell depends on the identity and amount of neurotransmit¬
ter released and absorbed at the synapse. Hebb theorized that learning
consisted of the modification of the strength of the effect that one cell had
on the other. In particular, he felt that if a presynaptic cell and a postsy¬
naptic cell were simultaneously on — that is, they were both transmitting
pulses down their respective axons — then the strength of the synaptic
connection between the two cells would be increased. Hebb put it this
way in his 1949 book The Organization of Behavior:

When an axon of cell A is near enough to excite a cell


B and repeatedly or persistently takes part in firing it, some
growth process or metabolic change takes place in one or both
such that A's efficiency as one of the cells firing B is increased.

This theory can be used directly to explain the behavior known as clas¬
sical conditioning, or Pavlovian conditioning. Refer to Figure 1.14.
Suppose that the sight of food is sufficient to excite cell C, which
in turn, excites cell B and causes salivation. Suppose also, that in the
absence of the sight of food, sound input from a ringing bell is insufficient
36 Chapter 1. Introduction to Neural Networks and Mathematica

to excite cell B, as we might expect. Now let's apply simultaneous sight


and sound stimulation and analyze the result in accordance with Hebb's
theory.
The strength of the synapse, Sbc is such that sight alone excites cell
B; but notice that cell A is also being excited due to the sound input.
Thus, cells A and B are on simultaneously, and, according to Hebb, the
strength of the synapse SBa will increase. If we repeat this experiment
often enough, the strength of SBa may increase to the point where the
excitation from cell A is sufficient by itself to excite cell B and cause
salivation, even in the absence of excitation from cell C.
There is an additional convention embedded in this example that I
should explain. Notice that the synaptic strengths have symbols such
as SBC■ In all cases the subscript has a "to-from" connotation. In other
words, SBc is the strength of the connection to cell B from cell C. Another
example is Wij, which refers to the weight on the connection to the ith
unit from the ;th unit.
Using this notation, we can describe the mathematical formulation of
Hebbian learning. If and xj are the outputs of the ith and ;th units
respectively, then we can express Hebbian learning with the differential
equation
= —awij + rjXiXj (1.10)
where 7? is a proportionality constant, usually less than one, and a is also
a constant less than one.
There is actually more in Eq. (1.10) than Hebbian learning as ex¬
pressed in the quotation cited above. Consider, for example, what hap¬
pens when either or both of the units have an output of zero. In that
case, the weight will decrease (we assume that 77, xir and xj are all non¬
negative quantities). This situation would correspond to forgetting rather
than learning. Moreover, the appearance of the -aw^ term limits the
magnitude of the resulting weight. Let's use Mathematica to solve Eq.
(1.10) for the weight as a function of time.

DSolve[{uij'[t] == -a uij[t] + eta xi xj, wij[0]==0}, wij[t],t]

a t
-1. eta + 1. E eta
{{wij [t] -> 0. +-»
a t
E a
1.2. Neural-Network Fundamentals 37

Plot the result assuming typical values for the various parameters. The
value asymptotically approaches the value rj/n; in this case, 1.6.

uij = wij [t] /. First[’/,]


eta = 0.8
a = 0.5
xi = xj = 1.0
Plot[uij,{t,0,10}];

Many variations of the basic Hebbian learning law exist, and we shall
encounter a few of them in other places in this book. Moreover, we have
not yet addressed the issue of how a network, made up of multiple units,
learns to perform a particular function. This topic will consume most of
the remainder of this book.

Summary
In this chapter, we began our study of neural networks using Mathematica
by considering the rationale behind the neural-network approach to data
processing. We looked at the fundamentals of processing for individual
units including the net input and output calculations. We also introduced
the concept of learning and decision surfaces in neural networks. Along
the way, we introduced many of the Mathematica functions and methods
that we will use in the remaining chapters of this book. Both the neural
network and the Mathematica concepts covered in this chapter will serve
as the basis for the material in the chapters that follow.
'
Chapter 2

Training by Error
Minimization
40 Chapter 2. Training by Error Minimization

In the previous chapter, we discussed the fact that we train neural net¬
works to perform their task, rather than programming them. In this
chapter and the next, we shall explore a very powerful learning method
that has its roots in a search technique called hill climbing.
Suppose you are standing on a hillside on a day that is so foggy, you
can see only a few feet in front of your face. You want to get to the
highest peak as quickly as possible, but you have no reference points,
map, or compass to assist you on your journey. How do you proceed?
One logical way to proceed would be to look around and determine
the direction of travel that has the steepest, upward slope and to begin
walking in that direction. As you walk, you change your direction so
that, at any given time, you are walking in the direction with the steep¬
est upward slope. Eventually you arrive at a location from which all
directions of travel lead downward. You then conclude that you have
reached your goal of the top of the hill. Without instrumentation (and
assuming the fog does not lift) you cannot be absolutely sure that you
are at the highest peak, or instead, at some intermediate peak. You could
mark your passage at this location and begin an exhaustive search of the
surrounding landscape to determine if there are any other peaks that are
higher, or you could satisfy yourself that this peak is good enough.
The method of training that we shall examine in this chapter is based
on a technique similar to hill-climbing, but in the opposite sense; that
is, we will seek the lowest valley rather than the highest peak. In this
chapter we look at the training of a system comprising a single unit. In
Chapter 3 we extend this method to cover the case of multiple units, and
multiple layers of interconnected units.

2.1 Adaline and the Adaptive Linear Combiner

The Adaline comprises two major parts, as illustrated in Figure 2.1: an


adaptive linear combiner (ALC), a unit almost identical in structure to
the general processing element described in Chapter 1, and a bipolar
output function, which determines its output based on the sign of the
net-input value of the ALC. Adaline is an acronym for ADAptive LINear
Element, or ADAptive Linear NEuron, depending on how you feel about
calling these units neurons.
Notice the addition of a connection with weight, wq, which we refer
to as the bias term. This term is a weight on a connection that has its
2.1. Adaline and the Adaptive Linear Combiner 41

1+1
Bipolar.
output
= sign(y)

Figure 2.1 The complete Adaline consists of the adaptive linear combiner, in the dashed
box, and a bipolar output function. The adaptive linear combiner resembles the general
processing element described in Chapter 1.

input value always equal to one. The inclusion of such a term is largely
a matter of experience.
The net input to the ALC is calculated as usual as the sum of the
products of the inputs and the weights. In the case of the ALC, the
output function is the identity function, so the output is the same as the
net input. If the output is y, then
n

i= 1

where the are the weights and the are the inputs. If we make the
identification x0 = 1, then we can write
n

i=0

or in terms of the vector dot product

y = w-x (2.1)

The final output of the Adaline is

o = Sign(y)
42 Chapter 2. Training by Error Minimization

where the value of the Sign function is +1, 0, or -1, depending on


whether the value of y is positive, zero, or negative. Mathematica contains
a built-in function Sign[] that performs the appropriate calculation. For
example:

Sign [2]

1
Sign[-2]

-1
In the remainder of this chapter, we shall be concerned only with the
ALC portion of the Adaline. It is possible to connect many Adalines
together to form a layered neural network. We refer to such a structure
as a Madaline (Many ADALINEs), but we do not consider that network
in this book.

2.2 The LMS Learning Rule

Suppose we have an ALC with four inputs and a bias term. Furthermore,
suppose that we desire that the output of the ALC be the value 2.0, when
the input vector is {1,0.4,1.2,0.5,1.1} where the first value is the input to
the bias term. We can represent the weight vector as {u/0, wu w2, tu4}.
There is an infinite number of weight vectors that will solve this particular
problem. To find one, simply select values for four of the weights at
random and compute the fifth weight. Let's do an example to illustrate
the use of the Solve function.

o = 2.0 (* desired output value *)


x = {1, 0.4, 1.2, 0.5, 1.1} (* input vector *)
w = Append[Table[Random[],{4}] ,w5]
(* weights, with one unknown *)
Solve[o=w.x,w5]

{{v5 -> 0.496487}}

The weight vector is

w = w /. I
2.2. The LMS Learning Rule 43

{{0.342886, 0.491887, 0.35746, 0.970542, 0.496487}}

Verify the calculation:

u.x

{2.}
Suppose we have a set of input vectors, {xl5 x2, ...,xL}, each having its
own, perhaps unique, correct or desired output value, dk,k = 1 ,L. The
problem of finding a single weight vector that can successfully associate
each input vector with its desired output value is no longer simple. In
this section we develop a method called the least-mean-square (LMS)
learning rule, or the delta rule, which is one method of finding the de¬
sired weight vector. We refer to this process of finding the weight vector
as training the ALC. Moreover, we call the process a supervised learning
technique, in the sense that there is some external teacher that knows
what the correct response should be for each given input vector. The
learning rule can be embedded in the device itself, which can then self-
adapt as inputs and desired outputs are presented to it. Small adjustments
are made to the weight values as each input-output combination is pro¬
cessed until the ALC gives correct outputs. In a sense, this procedure is
a true training procedure, because we do not calculate the value of the
weight vector explicitly.

2.2.1 Weight Vector Calculations


Before we develop the LMS rule, we can gain some insight into the pro¬
cedure by looking at a method with which we can calculate the weight
vector. To begin, let's restate the problem: Given examples (also called
exemplars), (x„ *), (*,*).(*i, dL), of some processing function that
associates (or maps) input vectors, xk, with output values, dk, what is the
best weight vector, wmin, for an ALC that performs this mapping? We
shall assume L > n + 1, where n is the number of inputs and there is
one additional weight for the bias term. This assumption means that
we cannot find the weight vector by solving a system of simultaneous
equations because such a system is overdetermined.
The answer to the question posed in the previous paragraph depends
on how we define the word best within the context of the problem. Once
we find this best weight vector, we would like the application of each
44 Chapter 2. Training by Error Minimization

input vector to result in the precise, corresponding output value. Since


it may not be possible to find a set of weights that allows this mapping
to be performed without error, we would like at least to minimize the
error. Thus, we choose to look for a set of weights that minimizes the
mean-squared error over the entire set of input vectors. If the actual
output value for the kth input vector is yk, then we define the error as

= dk — yk (2.2)
and the mean-squared error is

^4
«-<■» = •££ (2.3)
k= 1

where the angled brackets indicate the mean, or expectation, value.


Substituting Eqs. (2.1) and (2.2) into Eq. (2.3) shows that the mean-
squared error is an explicit function of the weight values:

? = (4 - w • x*)2 (2.4)

Expanding this equation we find

£ = (4) + w* {xkx{)w - 2(dkxtk)w (2.5)

The fact that £ is a function of the weights means that we should be


able to find weights that minimize £. To visualize our approach, let's
plot the function, £(w), for the case of an ALC with only two inputs and
no bias term. Using the following definitions

d = (4), R = (Xfcx£) and p = {dkxlk)

and without specifying the actual input vectors, we can construct the
graph.

ClearAll[R,p,wl,w2,ut,d];
wt = {wl,w2}
R = {{3,1}.{1,4»;
P = {4,5};
d = 10;
wtPlot=Plot3D[d+wt.R.vt-2 p.wt,{wl,-50,50},
{w2,-50,50}];
2.2. The LMS Learning Rule 45

To view the graph from a slightly different perspective, use Shou.

Show[y„ VievPoint->{8.578, -2.639, 0.671}];

20000

15000

10000

5000

Although it may not be apparent to you from these graphs, the surface
is a paraboloid. The function has a single minimum point. The weights
corresponding to that minimum point are the best weights for this ex¬
ample. You may find it more instructive to look at a contour plot of the
function.

Contou rPlot[d+ut.R.vt-2 p.wt,{wl,-10,10},


{w2,-10,10}];
46 Chapter 2. Training by Error Minimization

We can find the minimum point by taking the derivative of Eq. (2.5).
The result is the weight vector that gives the minimum error:

Wmin — R 'P (2-6)

For our example:

minVt = Inverse[R].p

{1, 1}
and the minimum error is

minError = d+minWt.R.minWt-2 p.minWt

2.2.2 Gradient Descent on the Error Surface

Given the knowledge of R, also called the input correlation matrix, and
p, we saw how it was possible to calculate the weight vector directly. In
many problems of practical interest, we do not know the values of R and
p. In these cases we must find an alternate method for discovering the
minimum point on the error surface.
Consider the situation shown in Figure 2.2. To initiate training, we
assign arbitrary values to the weights, which establishes the error, £,
2.2. The LMS Learning Rule 47

Figure 22 This figure illustrates gradient descent down an error surface toward the mini¬
mum weight value.

at a certain value. As we apply each training pattern to the network,


we can adjust the weight vector slightly in the direction of the greatest
downward slope. This procedure is exactly opposite to the hill-climbing
procedure described at the beginning of this chapter.
To perform this gradient descent, we must know the equation of the
surface, in which case we could calculate the weight vector directly. The
point of this discussion is to help you to understand the principles of
gradient descent, so that the idea will not be foreign when, in the next
section, we discuss how to approximate the process in the absence of
complete knowledge of the error surface.

2.2.3 The Delta Rule

Suppose we cannot specify the R matrix or p vector in advance, or sup¬


pose that the number of input vectors is so large as to make the calcula¬
tions excessively time consuming. There may also be a case in which the
distribution function of the input vectors changes as a function of time.
48 Chapter 2. Training by Error Minimization

We can still take advantage of the gradient descent method by employing


a local approximation to the error surface which is valid for a particular
input vector.
First, apply a particular input patter, say the kth, and note the output,
yic- Then determine the error ek. Instead of applying other patterns and
accumulating the squared error, we use this error value directly. As an
approximation to the mean-squared error in Eq. (2.3), we can use the
local value of the squared error for a particular pattern. That is:

4
f= ( )» 4=& (2.7)

Since 4 is a function of the weights we can compute the gradient:

~ I dk -^rwi(xi)k
V *=i

dwi
-2 ( 4 - ^2wi(xi)k (®i)fc — <^kip'i)k
i= 1

We then adjust the weight value, in this case wit by a small amount
in the direction opposite to the gradient. In other words, we update the
weight value according to the following prescription:

Wi(t + 1) = Wi(t) + r/e(xi)k (2.8)

or in vector form:

w(t+l) = w(t) + fje(x)k (2.9)

where 77 is called the learning rate parameter and usually has a value
much less than one.
Equations (2.8) and (2.9) are expressions of a learning law called the
LMS rule, or delta rule. By repeated application of this rule using all of
the input vectors, the point on the error surface moves down the slope
toward the minimum point, though it does not necessarily follow the
exact gradient of the surface. As the weight vector moves toward the
minimum point, the error values will decrease. You must keep iterating
until the errors have been reduced to an acceptable value, the definition
of acceptable being determined by the requirements of the application.
2.2. The LMS Learning Rule 49

Figure 2.3 This diagram shows the ALC in the transversal filter configuration. In this case,
there are two inputs. At each iteration the first input is shifted down to become the second
input. The fcth input is the sine function shown. The desired output value is twice the
cosine of the argument of the Jtth input. We have also added a random noise factor to
the inputs. We show the weights as variable resistors to indicate that they will change as
training proceeds.

2.2.4 A Delta-Rule Example

Let's consider how to apply this learning rule by trying a specific exam¬
ple. We shall use an example from the text Adaptive Signal Processing, by
Bernard Widrow and Samuel D. Steams1. The ALC is a two-input unit
arranged in a configuration known as a transversal filter. In this con¬
figuration, one input is simply a time-delayed copy of the other input.
Figure 2.3 shows the ALC for this case.
At the Jtth timestep, the input value is given by

. (7T&
Xk = sin —

1 Widrow, Bernard and Samuel D. Stearns. Adaptive Signal Processing., Prentice Hall: Englewood Cliffs,
NJ, 1985.
50 Chapter 2. Training by Error Minimization

and the desired output value is

dk = 2 cos

To each input value we shall add a random noise factor with a ran¬
dom signal power, (f> — 0.01, where

4> = {rl)
We can look at the input function by creating a table of points and
then plotting those points.

Table[{k,Sin[Pi k/8.] //N},{k,0,24}]

{{0, 0.}, {1, 0.382683}, {2, 0.707107}, {3, 0.92388}, {4, 1.},
-19
{5, 0.92388}, {6, 0.707107}, {7, 0.382683}, {8, 3.79471 10 },
{9, -0.382683}, {10, -0.707107}, {11, -0.92388}, {12, -1.},
{13, -0.92388}, {14, -0.707107}, {15, -0.382683},
-19
{16, -7.58942 10 }, {17, 0.382683}, {18, 0.707107}, {19, 0.92388},
{20, 1.}, {21, 0.92388}, {22, 0.707107}, {23, 0.382683},
-18
{24, 1.35525 10 }}

ListPlot['/,];

Adding a random value to the function disturbs the plot somewhat. Here
is a rendering of the function with the random value, plotted with the
points joined with lines.
2.2. The LMS Learning Rule 51

inPlot ■ ListPlot[Table[{k,Sin[Pi k/8.]+Random[Real,


{0, 0.175}] //N},{k,0,24}], PlotJoined->True];

The desired output looks like this:

Clear*ll[k]
outputs = Table[{k,2 Cos[Pi k/8]//N},{k,0,24}]

{{0, 2.}, {1, 1.84776}, {2, 1.41421}, {3, 0.765367}, {4, 0.},
{5, -0.765367}, {6, -1.41421}, {7, -1.84776}, {8, -2.},
{9, -1.84776}, {10, -1.41421}, {11, -0.765367}, {12, 0.},
{13, 0.765367}, {14, 1.41421}, {15, 1.84776}, {16, 2.},
{17, 1.84776}, {18, 1.41421}, {19, 0.765367}, {20, 0.},
{21, -0.765367}, {22, -1.41421}, {23, -1.84776}, {24, -2.}}

outPlot = ListPlot[outputs, PlotJoined->True];

Shown together, the inputs and outputs look like this:

Show[{inPlot,outPlot}];
52 Chapter 2. Training by Error Minimization

SeedRandom[4729]
wts = Table [Random [], {2}] (* initialize weights *)
inputs = {0, Random [Real, {0, 0.175}]}(* initialize input vector *)
eta = 0.2 (* learning rate parameter *)
k=l
errorList=Table[
inputs[[l]] = N[Sin[Pi k/8]]+Random[Real,{0, 0.175}];
outOesired = N[2 Cos[Pi k/8]]; (* desired output *)
outputs = wts.inputs; (* actual output *)
outError = outDesired-outputs; (* error *)
wts += eta outError inputs; (* update weights *)
inputs[[2]]=inputs[[l]]; (* shift input values *)
k++; (♦ increment counter *)
outError,{250}
] (* end of Table *)
Print["Final weight vector = ",wts]
ListPlot[errorlist,PlotJoined->True] (* plot the errors *)

Listing 2.1

Listing 2.1 shows the Mathematica code for the ALC. The output giving
the final weight vector and plot of the errors is as follows:

Final weight vector = {3.81539, -4.34484}


2.2. The LMS Learning Rule 53

Note that by setting the input vector equal to {0,Random} initially, all that
we need do to prepare the first valid input vector is replace inputs [[1]]
with Sin [Pi/8], which is accomplished the first time through the loop.
After the weight updates, the value in inputs[[l]] is shifted forward to
inputs [[2]], and inputs [[1]] is recalculated at the beginning of the loop.
The actual optimum weight vector for this problem is 3.784, -4.178.
You can see from the plot of the error values, that initially, the errors
appear quite sinusoidal in character. As the ALC learns, the error is
reduced to its random component.
We can make a slight modification to our code, as shown in Listing
2.2, and look at how the weight vector moves as a function of iteration
step. The output from the code in Listing 2.2 is as follows:

Final weight vector = {3.81539, -4.34484}

The following plot shows the weight vector as it converges on the known
optimum weight value, depicted as a large dot. If we continued iterating
the weight values, they would bounce around the optimum point. We
could get closer by decreasing the learning rate parameter, at the cost
54 Chapter 2. Training by Error Minimization

SeedRandom[4729]
wts = Table[Random[],{2>] (* initialize weights *)
inputs = {0, Random [Real, {0, 0.175}]}(* initialize input vector *)
eta = 0.2 (* learning rate parameter *)
k=l
wtList=Table[
inputs[[1]] = N[Sin[Pi k/8]]+Random[Real,{0, 0.175}];
outDesired = N[2 Cos[Pi k/8]]; (* desired output *)
outputs = wts.inputs; (* actual output *)
outError = outDesired-outputs; (* error *)
wts += eta outError inputs; (* update weights *)
inputs[[2]]=inputs[[l]]; (* shift input values *)
k++; (* increment counter *)
wts,{250} (* add wts value to table *)
] (* end of Table *)
Print["Final weight vector = ",wts]
ListPlot[wtList,PlotJoined->True]; (* plot the errors *)

Listing 2.2

of more iterations. For our purposes here, the current parameters are
sufficient to illustrate the concepts involved with this type of training.

Show[{wtPlot.Graphics[{PointSize[0.03],
Point[{3.784,-4.178}]}]}];

Widrow and Stearns have calculated the exact equation for the error
surface for this example:
2.2. The LMS Learning Rule 55

ClearAll[xi,vtl,wt2];
xi[wtl_,vt2j := 0.51 (wtr2+wt2~2) +
wtl ut2 Cos[N[Pi/8]] + 2 wt2 Sin[N[Pi/8]] + 2

(Be careful where you break an expression for a function definition. If


you break the above expression before one of the + signs, Mathematica
will think the function definition ends there.)
The error surface looks like this:

ClearAll[wl,u2]
errorPlot = Plot3D[xi[wl,w2],{wl,-2,8},{v2,-10,0},
ViewPoint->{-1.048, -2.529, 1.989},
Shading->False];

We can superimpose the contour plot with the plot of the movement of
the weight vector as the ALC learns. The crosshair indicates the position
of the actual optimum weight value.

Contou rPlot[xi[ul,u2],{ul,-2,8},{w2,-10,0},
Contou rLe v els->20, Epilog->{Line[wtList],
Line[{{2,-4.178},{6,-4.178}}],
Line[{{3.784,-3},{3.784,-5}}]}];
56 Chapter 2. Training by Error Minimization

If you wish to experiment with different parameters, I suggest you con¬


struct a function based on the code. The example in Listing 2.3 allows you
to set the learning-rate parameter and the number of iterations through
arguments passed in the function call. The number of iterations defaults
to 250. Some examples of the use of the alcTest function follow:

alcTest[0.1,10]

Starting weights = {0.232585, 0.222531}


Learning rate = 0.1
Number of iterations = 10
Final weight vector = {0.0926947, -0.330191}
2.2. The LMS Learning Rule 57

alcTest[learnRate_,numIters_:250] :=
Module[{eta=learnRate,wts,k,inputs,utList,outDesired,outputs,outError>,
wts = Table [Random [], {2}]; (* initialize weights *)
Print["Starting weights = ",wts];
Print["Learning rate = ",eta];
Print["Number of iterations = ".numlters];
inputs = {0,Random[Real,{0, 0.175}]};(* initialize input vector *)
k=l;
wtList=Table[
inputs[[1]] = N[Sin[Pi k/8]]+Random[Real,{0, 0.175}];
outDesired = N[2 Cos[Pi k/8]]; (* desired output *)
outputs = wts.inputs; (* actual output *)
outError = outDesired-outputs; (* error *)
wts += eta outError inputs; (* update weights *)
inputs[[2]]=inputs[[l]]; (* shift input values *)
k++; wts,{numlters}]; (* end of Table *)
Print[”Final weight vector = ",wts];
wtPlot=ListPlot[wtList,PloUoined->True] (* plot the weights *)
] (* end of Module *)

Listing 2.3

alcTest[0.1,200]

Starting weights = {0.232585, 0.222531}


Learning rate = 0.1
Number of iterations = 200
Final weight vector = {2.38218, -2.9718}
58 Chapter 2. Training by Error Minimization

Figure 2.4 This diagram shows the ALC in its standard configuration. For the XOR example,
there are two inputs.

alcTest[0.3]

Starting weights = {0.232585, 0.222531}


Learning rate = 0.3
Number of iterations = 250
Final weight vector = {4.22935, -4.81559}

2.2.5 The XOR Problem and the ALC

As a second example, let's attempt to solve the XOR problem using a


two-input ALC. (See Section 1.2.5 for a discussion of this problem.) We
shall configure the ALC in its standard form, rather than as a transversal
filter, as Figure 2.4 illustrates.
To run this example we shall construct a single function: alcXor ap¬
pearing in Listing 2.4. We will need to define a list called the ioPairs
2.2. The LMS Learning Rule 59

alcXor[learnRate,,numlnputs,,ioPairs_,numlters_:250] :=
Module [{vts, eta=lear nRate, er rorList, inputs .outDesired, ourEr ror, outputs},
SeedRandom[6460]; (* seed random number gen.*)
wts = Table[Random[],{numInputs>]; (* initialize ueights *)
errorList=Table[ (* select ioPair at random *)
{inputs, outDesired} = ioPairs[[Random[Integer,{l,4}]]];
outputs = wts.inputs; (* actual output *)
outError = First[outDesired-outputs]; (* error *)
wts += eta outError inputs;
outError,{numIters}]; (* end of Table *)
ListPlot[errorList,PlotJoined->True];
Return[wts];
]; (* end of Module *)

Listing 2.4

list, or ioPairs vector, outside of the function. This list should contain the
inputs and desired output values for the problem. The function, alcXor,
takes the learning rate and the ioPairs vector as an argument, as well as
the number of inputs and the number of iterations, which once again
defaults to 250. The function returns the weight matrix for use later. For
two inputs, the ioPairs vector is

ioPairsXor2 = {{{0,0},{0}},{{0,1},{1}},{{1,0},{1}},{{1,1},{0}»;

Let's execute the function with this ioPairs vector. The output will be a
plot of the error value as a function of iteration.

wtsXor = alcXor[0.2,2,ioPairsXor2];
60 Chapter 2. Training by Error Minimization

As you might expect, the error value shows no tendency to decrease.


You can convince yourself further by trying more iterations, or varying
the parameters, but the results will be the same.
Notice a couple of things about the code. We have introduced a new
convention with the symbol ioPairs, and variation of that name used to
identify specific examples. The format for the ioPairs table is:
ioPairs={{{input vector 1>, {output vector 1}>, . . . >

In other words, ioPairs is a list of pairs of lists. Each pair of lists


comprises a list of input values (the input vector) and a corresponding list
of desired output values (the output vector). We shall use this convention
throughout the text.
Let's add a third input to the ALC as we suggested in Section 1.2.5.
The modification to the code is easy. All we need do is add the third
input value to the ioPairs table, and initialize three weights instead of
two.

ioPairsXor3 =
{{{0,0,0>,{0»,{{0,1,0>,{1»,{{1,0,0>,{1»,
{{1,1,!},{()»};

Once again, let's execute the function; this time with the new ioPairs
vector.

utsXor = alcXor[0.2,3,ioPairsXor3];

Notice in this case that the error decreases as the iteration proceeds. To
see how close we are to an acceptable solution, we can write a simple
function — testXor in Listing 2.5 — to run through the ioPairs and deter¬
mine the error of the ALC for each input. Executing this function shows
2.2. The LMS Learning Rule 61

ClearAll[testXor]
testXor[ioPairs_,weights,] :=
Hodule[{errors,inputs,outDesired,outputs,wts,mse},
inputs = Hap[First,ioPairs]; (* extract inputs *)
outDesired = Hap [Last, ioPairs]; (* extract desired outputs *)
outputs s inputs . weights; (* calculate actual outputs *)
errors = outDesired-outputs;
mse=
Flatten[errors] . Flatten[errors]/Length[ioPairs];
Print["Inputs = ".inputs];
Print["Outputs = ".outputs];
Print["Errors = ".errors];
Print["Hean squared error = ",mse]
]
Listing 2.5

the explicit results, rather than just the errors.


testXor[ioPairsXor3,wtsXor]

Inputs = {{0, 0, 0}, {0, 1, 0}, {1, 0, 0}, {1, 1, 1}}


Outputs = {0, 0.956687, 0.972699, 0.0177852}
Errors = {{0}, {0.0433128}, {0.0273013}, {-0.0177852}}
Mean squared error = 0.000734417

In the examples that we have done so far, we have performed weight up¬
dates for a certain number of iterations. At the beginning of the section,
we derived the delta rule based on a minimization of the mean-squared
error; or rather its approximation, e\. Let's modify the code from the
previous example to include a test of the mean-squared error, and a con¬
ditional termination based on its value.
We shall also write a function that calculates the mean squared error.
The ALC code will call this function every four iterations. Even though
we choose the input vector randomly, after a few iterations, the ALC
should be learning all four patterns about equally. The code for the
mean-squared error calculation appears in Listing 2.6. The code for the
complete simulation appears in Listing 2.7. Finally, we execute the code,
using 0.01 as the error tolerance value. The function returns the list of
final weight values, which appear following the error plot.
62 Chapter 2. Training by Error Minimization

calcHse[ioPairs.,vtVecJ :=
Module[{errors,inputs,outDesired,outputs),
inputs = Hap[First,ioPairs]; (* extract inputs *)
outDesired = Map[Last,ioPairs]; (* extract desired outputs *)
outputs = inputs . wtVec; (* calculate actual outputs *)
errors = Flatten [outDesired-outputs];
Return[errors.errors/Length[ioPairs]]
]
Listing 2.6

wtsXor = alcXorMin[0.2,3,ioPairsXor3,0.01]

{0.843864, 0.920131, -1.68958}

The value of the coordinate on the abscissa of the resulting graph is the
number of cycles rather than the number of iterations. We define a cycle
to be equal to the number of exemplars; in this case four. Even though
we pick inputs at random rather than choosing the four exemplars in
sequence, we still consider a cycle to be four iterations. You could modify
the code to present the four exemplars in sequence: Typically, you would
use nested loops. As long as you have enough cycles so that all exemplars
are presented approximately the same number of times, random selection
is adequate.
Once again, we can test the network performance:

testXor[ioPairsXor3,wtsXor]

Inputs = {{0, 0, 0), {0, 1, 0), {1, 0, 0), {1, 1, 1»


2.3. Error Minimization in Multilayer Networks 63

alcXorMin[learnRate_,numlnputs_,ioPairs_,maxError_] :=
Module[{wts,eta=lea rnRate,errorList,inputs,outDesired,
meanSqError,done,k,ourError,outputs,errorPlot},
wts = Table[Random[],{numInputs}]; (* initialize weights *)
meanSqError = 0.0;
errorList={};
For[k=l;done=False, !done,k++, (* until done *)
(* select ioPair at random *)
{inputs, outDesired} = ioPairs[[Random[Integer,{l,4}]]];
outputs = wts.inputs; (* actual output *)
outError = First[outDesired-outputs]; (* error *)
wts += eta outError inputs; (* update weights *)
If[Mod[k,4]—0,meanSqError=calcMse[ioPairs,wts];
AppendTo[errorList, meanSqError]; ];
If[k>4 Ml meanSqError<maxError,done=True,Continue]; (* test for done *)
]; (* end of For *)
errorPlot=ListPlot[errorList,PlotJoined->True];
Return[wts];
] (* end of Module *)

Listing 2.7

Outputs = {0, 0.920131, 0.843864, 0.0744141}


Errors = {{0}, {0.079869}, {0.156136}, {-0.0744141}}
Mean squared error = 0.00907371

Before moving on, let's look at one item in the ALC simulation code.
Notice that we did not use the Table function this time; rather, we used
the For loop, as you might in a normal computer program. The reason for
this switch is that we no longer know in advance how big the final array
will be. When you know this value in advance, the Table construction is
a better choice.

2.3 Error Minimization in Multilayer Networks

Before we take up the study of the backpropagation network in the next


chapter, I wish to motivate that topic by posing a question: How would
we do a gradient-descent search for optimum weight values in a neural
64 Chapter 2. Training by Error Minimization

Figure 2.5 This figure shows a single layer of ALCs. Each ALC receives the identical input
vector, but responds with a different output value.

network with multiple units and multiple layers?


The question of multiple units is not, in and of itself, so difficult to
deal with. Suppose we had the situation as depicted in Figure 2.5.
In this situation, we presume that each of the five ALCs receives
the same input vector and is supposed to produce some output value,
different for each of the five units. Instead of having a single, desired-
output value, we have a desired-output vector having, in this case, five
components. Since we know the desired-output value for each of the
ALCs, we can apply the delta rule individually for each ALC, and adjust
the weights accordingly.
Consider, however, the network shown in Figure 2.6. Presumably the
same logic that applied above still applies to the layer of units called the
output layer: We still know what the desired outputs are — even though
the inputs now come from the hidden layer — and can use the delta rule
directly to update weights on the output-layer units.
The problem arises when we try to determine weight updates on the
hidden-layer units. We do not know what the outputs of these units
should be for any given input vector.
We can resolve this problem by realizing that the actual output values
of the output-layer units do depend on the hidden-layer weights, because
these weights form part of the calculation of the hidden-layer outputs,
which subsequently are used in the calculation of the output-layer out¬
puts. Therefore, it is possible to determine the local gradient of the error
surface with respect to the hidden-layer weights and to use this value to
update these weights.
We shall not carry through with such a derivation in this chapter be¬
cause this type of layered architecture turns out to be not very interesting
when made up only of ALCs. A more useful network results when we
use units with a nonlinear output function, such as the complete Adaline.
2.3. Error Minimization in Multilayer Networks 65

Figure 2.6 This figure shows a general, feed-forward neural network.

Unfortunately, the threshold output function is nondifferentiable because


of the discontinuity, and therefore, we cannot derive an equivalent of
the delta rule in this case. Other training paradigms are possible for a
network of Adalines, but we shall not consider them further in this text.
There are other nonlinear output functions that are differentiable and
do yield interesting network architectures. We shall look at such a net¬
work in the next chapter.

Summary
In this section, we used the single-unit adaptive linear combiner to de¬
velop a powerful learning method called the least-mean-square method.
This method forms the basis of the multi-unit, multi-layer learning al¬
gorithm called backpropagation, which we shall introduce in the next
chapter. We also saw that we can view the learning process geometri¬
cally as finding the minimum point on a surface that represents the error
plotted as a function of the weight values. In the case of the ALC, the er¬
ror surface is always a concave-upward hyperparaboloid. In multi-unit,
multi-layer networks, we can still think of the learning process as finding
the minimum value on a surface, but the topology of that surface is not
usually as simple as the one for the ALC.
Chapter 3

Backpropagation and Its


Variants
68 Chapter 3. Backpropagation and Its Variants

In the previous chapter we used a learning paradigm, known as the delta


rule, to calculate an approximation to the optimum weight vector that
would allow an ALC to correctly map input vectors to output values in
accordance with certain examples used during the training process. In
the last section of Chapter 2, we described how we could extend the delta
rule to cover the case of multiple ALCs on a single layer. Extending the
rule to multiple-layer networks requires that we add a nonlinear output
function to the units, and this fact complicates the situation. Moreover,
since we have no foreknowledge of the correct output values for units on
any layer other than the output layer, we have to resort to other methods
to determine the weight updates.
In this chapter we shall study a method for calculating weight up¬
dates that is commonly known as the method of backpropagation of
errors, or more simply, backpropagation. By using this method, we can
train a multilayered network to perform a great many processing func¬
tions. Because it is so powerful, the backpropagation network (BPN)
has become an industry standard.
First we describe the architecture and develop the training algorithm.
As a part of that development, we shall see that the BPN is quite ex¬
pensive computationally, especially during the training process. Many
people have attempted, therefore, to modify the basic backpropagation
algorithm to speed training. We examine a few of these methods, not
only for their relative value for neural-network applications, but also as
examples of the ease with which we can use Mathematica to experiment
with variation on a theme. Finally, we shall describe a backpropagation-
like network called the functional link network which, in some cases,
can eliminate the need for a hidden layer (as with the XOR problem
discussed in Chapters 1 and 2).

3.1 The Generalized Delta Rule

In this section we extend the delta rule to multi-layered networks. Be¬


fore we perform the derivation of the generalized delta rule (GDR) let's
review the architecture of multi-layered neural networks and point out
the features particular to the BPN.
3.1. The Generalized Delta Rule 69

Figure 3.1 This diagram shows a typical structure for a BPN. Although there is only one
hidden layer in this figure, you can have more than one. The superscripts on the various
quantities identify the layer. The p subscript refers to the p th input pattern.

3.1.1 BPN Architecture


The standard BPN architecture appears in Figure 3.1. The bias units
shown in that figure are optional. Bias units always have an output of
one and they are connected to all units on their respective layer. The
weights on the connections from bias units are called bias terms or bias
weights.
Units on all layers calculate their net-input values in accordance with
the standard sum-of-products calculation described in Chapter 1. For the
hidden-layer units:

netw' = Y1
i=
+ t3'1)

and for the output-layer units


L

ne^pk — ’YJW°k^P3 + (3.2)


3=1
70 Chapter 3. Backpropagation and Its Variants

where iPj is the input from the /th hidden-layer unit to the output layer
units for the pth input pattern, and the 6s are the bias values. N and L
refer to the number of units on the input and hidden layers respectively.
Unlike the ALC discussed in Chapter 2, the output function of these
units is not necessarily the simple identity function, although it can be in
the case of the output units. Most often, the output function will be the
sigmoid function described in Chapter 1. Then the outputs of the units
are

(3.3)

for units on the hidden layer, and

(3.4)

for units on the output layer.


We can use the identity function on the output-layer unit, in which
case we have

°pk

If we were to use the identity function on the hidden-layer units,


then the network would not be able to perform many of the complex
input-output mappings that it could otherwise.
When we propagate data through the network from inputs to outputs,
we can streamline the calculation by putting all of the weight values for
a single layer into a weight matrix. Each row of the matrix represents
the weights on a single unit on the layer. There would then be L rows,
where L is the number of units on the layer. If there are N inputs, there
would be N or N + 1 columns, the latter figure including a place for the
bias weight.
Let's look at a sample calculation for a single layer with four units,
each having five inputs. To calculate the net inputs for all of the units
on a layer, we can multiply the weight matrix by the input vector. First,
define a five-element input vector with random components.

inputs = Table[Random[],{5}]

{0.775091, 0.324416, 0.724447, 0.596067, 0.662368}


3.1. The Generalized Delta Rule 71

Since there are five inputs, the weight matrix should have five columns
(assuming no bias term). If there are four units on the layer, one possible
(but highly unlikely) weight matrix may appear as follows:

wts = { {1,0,0,0,0},{0,1,0,0,0},{0,0,1,0,0},{0,0,0,1,0> >;

Or in matrix form:

HatrixForm['/,]

1 0 0 0 0
0 10 0 0
0 0 10 0
0 0 0 1 0

The net inputs are

netln = wts. inputs

{0.775091, 0.324416, 0.724447, 0.596067}

We shall once again require the sigmoid function.

sigmoid[x_] := l/(l+E“(-x))

The output values are

outputs = sigmoid [netln]

{0.684621, 0.5804, 0.673586, 0.644756}

Notice that we can supply the sigmoid function with a list of values as
an argument, rather than just a single value. Muthctnuticu automatically
applies the function to each element in the list. Now that we have de¬
scribed the basic feed-forward propagation in the BPN, let's move on to
a derivation of the GDR.

3.1.2 Derivation of the GDR


As we did with the ALC, let's begin by stating the problem in slightly
more formal terms. Suppose we have a set of P vector-pairs (exemplars),
(xi.yi), (x2, y2),. • • , (Xp, yp), that are examples of a functional mapping
72 Chapter 3. Backpropagation and Its Variants

y = 3>(x), x e R^, y e RM

where x and y are N- and M-dimensional real vectors respectively. We


wish to train a neural network (i.ev find a set of weights) to learn an
approximation to that functional mapping. To develop the training algo¬
rithm we use the same approach that we used for the ALC in Chapter 2:
gradient descent down an error surface.
The error that we choose to minimize by our training algorithm is

1 M
Ev = 5 £ SU (3.5)
^ fc=l

where
fipk = (Vpk ~ °pk) (3.6)
The subscript, p, refers to the pth exemplar, opk is the output of the Jtth
output-layer unit for the pth exemplar, and there are M output-layer
units.
Equation (3.5) represents a local approximation to the global error
surface

e = £ep
P=1

Using the local approximation simplifies the calculation here, as it did in


Chapter 2.
Substituting Eq. (3.6) into Eq. (3.5), and using Eq. (3.4), we find

1 M
2 J2(ypk ~ /fc(netpfc))2

The gradient of Ep with respect to the output-layer weights is

dEp df°k^°pk) d(net°fc)


(,1/pk °pk)
dw°kj ^(netpfc) dw°kj

For now we shall write the partial derivative of the output function
as
d/fc(netpfc)
fk(^°pk)
5(netpfc)
3.1. The Generalized Delta Rule 73

Using Eq. (3.4) we can show that

d(net°pk) .
hi

Finally, we can write the gradient of the error surface as

= -(sip* - Op*)/f (netj*)ip, (3.7)

By a similar, and only slightly more complicated analysis, we can


find the gradient of the error surface with respect to the hidden-layer
weights:

= —fj\ne^pj)xpi J2(yPk - 0pk)fk'(nei,pk)Wkj (3-8)


dwji

The reason that I left the derivatives of the output functions as


"primed" functions instead of explicitly calculating the derivative, is be¬
cause the value of that derivative depends on the form of the output
function. The two primary cases of interest are the sigmoid and the
identity function. In these two cases, the derivatives of the functions for
output-layer units are

fk{net0pk) = Opk(l-°pk) (3.9)

for the sigmoid, and


/fc'(net"fc) = 1 (3.10)

for the identity function.


As each training pattern is presented to the network, we first prop¬
agate the information forward to determine the actual network outputs.
Then we calculate the error terms on the output layer, and the gradient of
the error surface with respect to each of the output-layer weights. Next,
we calculate the gradient of the error surface with respect to each of the
weights on the hidden layer. If you look carefully at Eq. (3.8) you will
notice that, for any given unit on the hidden layer, the gradient of the
error surface depends on nil of the errors on the output layer. This de¬
pendency is reasonable, since any change on a hidden-layer weight will
have an effect on all of the output values of the output layer. Here is
where the concept of backpropagation enters formally: We calculate errors
74 Chapter 3. Backpropagation and Its Variants

on the output layer first, then bring those errors back to the hidden layer
to calculate the surface gradients there.
Once we have calculated the gradients, then we adjust each weight
value a small amount in the direction of the negative of the gradient. The
proportionality constant is called the learning-rate parameter, just as it
was for the ALC in Chapter 2. Next, we present the next input pattern
and repeat the weight-update process. The process continues until we are
satisfied that all output-layer errors have been reduced to an acceptable
value.
Before moving on to some examples, we can simplify the notation
somewhat through the use of some auxiliary variables. Define the output-
layer delta as

fipk — {Vpk °pk)fk (netpfc) — fipkfk (netpfc) (3-11)


and the hidden-layer delta as

6pJ = fj'(netPj) D 6°pkw°kj (3-12)


fc=i
0

Using these definitions, the weight-update equations on both layers


take on a similar form:

w°kj(t + 1) = ™°kj(t) + vSpkipj (3.13)


on the output layer, and

wji(t + 1) = + vtfjXpi (3.14)

on the hidden layer. 77 is the learning-rate parameter, and we have as¬


sumed that it is the same on all units on all layers. This assumption is
typically a good one, and we will employ it exclusively in this book.

3.2 BPN Examples

In this section we will write the Mathematica code for the standard BPN
and use it to look at two specific examples. Because the BPN is so in¬
tensive computationally, we shall be restricted to fairly small networks.
Nevertheless, we shall be able to experiment with several network pa¬
rameters to see their overall effect on the performance of the BPN.
3.2. BPN Examples 75

■ ■
1 ■fi ■■
— —

□ r ■ □
::
DUO L (a)

0.9,0.9,0.9

0.1,0.9,0.1

0.1,0.9, 0.1

(b)

Figure 3.2 This figure shows the data-representation scheme for the T-C problem, (a) Each
letter is superimposed on a 3 by 3 grid, (b) Filled grid-squares are represented by the real
number 0.9, and empty ones are represented by 0.1. Each input vector consists of nine real
numbers.

3.2.1 The T-C Problem

The T-C problem is a fairly simple pattern recognition problem. It will


not severely tax the computational resources of our computer, and the
network converges quickly to a solution, making this problem ideal as
a first example. We wish to train a network to distinguish between the
letters T and C, independent of the angle of rotation of these letters. We
shall restrict the rotation angles to multiples of 90 degrees, resulting in
four possible inputs for each letter, as shown in Figure 3.2.
The neural network for this problem appears in Figure 3.3. We shall
use only a single output unit. We choose to represent the letter “T" by
an output value of 0.1, and the letter "C" by an output value of 0.9. An
alternate approach would be to have two output units. Then an output
vector of {0.9, 0.1} could represent the letter "T," and {0.1, 0.9} could
represent the letter "C."
Notice that we do not use zero and one in the input or output vectors.
76 Chapter 3. Backpropagation and Its Variants

Figure 3.3 This figure shows a standard, three-layer BPN that we can use to solve the T-C
problem. We only show one hidden layer in this figure, although we could add more. The
number of units on the hidden layer can vary and has effects on the performance of the
network. For this example, we assume all units have sigmoid outputs.

Recall that the sigmoid function asymptotically approaches the limits of


zero and one for infinite arguments. If we insisted that the actual net¬
work outputs attained the values of zero and one, we could be iterating
the weights forever, and they would grow to an extremely large value
(positive or negative). To avoid this problem, we let 0.1 represent the bi¬
nary zero state, and 0.9 the binary one state. As an alternative, we could
use the identity output function (on the output layer only), then zero
and one would be acceptable as desired output values. Nevertheless, we
shall stick with the sigmoid function here.
Let's step through one cycle of the training algorithm of a BPN. Then
we will put all of the steps together in a function that we can call with
appropriate arguments. First, we define the ioPairs vectors. I will write
each input vector in a matrix form, followed by the appropriate output
vector. We shall also define the quantity ioPairsTC for convenience later
on.
3.2. BPN Examples 77

ioPairsTC =
ioPairs = {{{0.9,0.9,0.9,
0.9,0.1,0.1,
0.9,0.9,0.9},{0.1}},
{{0.9,0.9,0.9,
0.1,0.9,0.1,
0.1,0.9,0.1},{0.9}},
{{0.9,0.9,0.9,
0.9,0.1,0.9,
0.9,0.1,0.9},{0.1}},
{{0.1,0.1,0.9,
0.9,0.9,0.9,
0.1,0.1,0.9},{0.9}},
{{0.9,0.9,0.9,
0.1,0.1,0.9,
0.9,0.9,0.9},{0.1}},
{{0.1,0.9,0.1,
0.1,0.9,0.1,
0.9,0.9,0.9},{0.9}},
{{0.9,0.1,0.9,
0.9,0.1,0.9,
0.9,0.9,0.9},{0.1}},
{{0.9,0.1,0.1,
0.9,0.9,0.9,
0.9,0.1,0.1},{0.9}} };

Next we shall establish the number of input units

inNumber = 9

the number of hidden units

hidNumber = 3

and the number of output-layer units

outNumber = 1
78 Chapter 3. Backpropagation and Its Variants

We need to initialize the matrices that will hold the weight values for the
units on each layer. For the BPN, we use typically small, random real
numbers.

hidVts = Table [Table [Random [Real, {-0.1,0.1}],


{inNumber}].{hidNumber}]

{{-0.0339828, 0.00385251, -0.0174868, -0.0257278, -0.0029669,


0.0580986, 0.0879445, 0.079082, -0.0360604},
{-0.0250644, 0.0615698, -0.0770543, 0.0152563, -0.03313,
-0.0941993, -0.00596479, -0.0440699, -0.05413},
{-0.0214526, 0.0939241, 0.0398202, -0.026531, 0.0867308,
0.0580687, -0.0883566, 0.0569707, 0.0859322}}

outWts = Table[Table[Random[Real,{-0.1,0.1}],
{hidNumber}].{outnumber}]

{{-0.0671577, 0.0939334, -0.00400093}}

Finally, we pick a value for the learning-rate parameter:

eta = 0.5

0.5

We are now ready to begin forward propagation of an input vector


through the network. During training, we will select input vectors at
random from the ioPairs vector.

ioP=ioPairs[[Random[Intege r,{1,Length[ioPairs]}]]]

{{0.1, 0.9, 0.1, 0.1, 0.9, 0.1, 0.9, 0.9, 0.9}, {0.9}}

Then extract the input and desired-output portions

inputs=ioP[[l]]

{0.1, 0.9, 0.1, 0.1, 0.9, 0.1, 0.9, 0.9, 0.9}


3.2. BPN Examples 79

outDe$ired=ioP[[2]]

{0.9}

To compute the output of the hidden-layer units, take the dot product of
the inputs and the hidden-layer weights and apply the sigmoid function
to each element of the resulting vector.

hidOuts = sigmoid [hidVts. inputs]

{0.529156, 0.478449, 0.553957}

We can compute the output-layer outputs with a similar statement:

outputs = sigmoid [outWts. hidOuts]

{0.501797}

Forward propagation is now complete. Calculation of the deltas is next,


starting with the output layer. We will employ an auxiliary variable to
hold the difference between the desired and actual output.

outErrors = outDesired-outputs

{0.398203}

Then the output delta is

outDelta= outErrors (outputs (1-outputs))

{0.0995494}

The hidden-layer delta is a bit more complicated. The factor TransposefoutVts]


. outDelta calculates the sum of products of the output weights and output
deltas in Eq. (3.12).

hidDelta=(hidOuts (1-hidOuts)) Transpose[outWts] .outDelta

{-0.00166569, 0.00233341, -0.0000984131}


80 Chapter 3. Backpropagation and Its Variants

To determine the new weights, we update according to Eqs. (3.13) and


(3.14). We must take the outer product of the deltas on each layer and
the inputs to the layer.

outWts += eta Outer[Times,outDelta.hidOuts]

{{-0.0408191, 0.117748, 0.0235721}}

hidWts += eta Outer [Times, hidDelta, inputs]

{{-0.0340661, 0.00310295, -0.0175701, -0.0258111, -0.00371646,


0.0580153, 0.087195, 0.0783325, -0.0368099},
{-0.0249478, 0.0626198, -0.0769376, 0.015373, -0.03208,
-0.0940826, -0.00491475, -0.0430199, -0.0530799},
{-0.0214575, 0.0938798, 0.0398153, -0.0265359, 0.0866865,
0.0580638, -0.0884009, 0.0569265, 0.0858879}}

We are now finished with the first training vector. To continue, we would
select a new input vector and repeat the above steps. To monitor our
progress, we can watch the value of outErrors until it, or its square, reaches
some acceptable level; or we can specify a certain number of iterations,
which will be our approach here.
Notice that all of the processing for the BPN training algorithm com¬
prises only six lines of Mathematica code. Let's put those lines together
in a function, shown in Listing 3.1, that implements the simple BPN.
Notice that we are constructing a table (vector) of error values as a part
of the function. If you were programming this function in a high-level
computer language, such as C, you would likely use a loop construct,
such as a for or while loop in the main body of the code. Since we know
exactly how many elements there will be in the final table (numlters), the
Table function is more appropriate. If we were to iterate training until a
certain error value was reached, we would also use a For construct, and
Append the error to a preexisting array.
The function returns three important pieces of information: the new
values for the hidden-layer weights, the new values for the output-layer
weights, and the list of errors generated as training occurred. We shall
use these values to assess how well the training went after we call the
function. Let's run a short test of 10 iterations.
bpnStandard[inNumber_, hidNumber_, outNumber_,ioPai'’S_, eta_, numltersj :=
Module[{errors,hidWts,outWts,ioP,inputs,outDesired,hidOuts,
outputs, outErrors.outDelta,hidDelta},
hidWts = Table[Table[Random[Real,{-0.1,0.1>],{inNumber>].{hidNumber}];
outWts = Table[Table[Random[Real,{-0.1,0.1}].{hidNumber}].{outNumber}];
errors = Tablet
(* select ioPair *)
ioP=ioPairs[[Random[Integer,{1,Length[ioPairs]>]]];
inputs=ioP[[l]];
outDesired=ioP[[2]];
(* forward pass *)
hidOuts = sigmoid [hidWts. inputs];
outputs = sigmoid [outWts. hidOuts];
(* determine errors and deltas *)
outErrors * outDesired-outputs;
outDelta= outErrors (outputs (1-outputs));
hidDelta=(hidOuts (1-hidOuts)) Transpose[outWts].outDelta;
(* update weights *)
outWts += eta Outer[Times,outDelta,hidOuts];
hidWts += eta Outer[Times,hidDelta,inputs];
(* add squared error to Table *)
outErrors.outErrors,{numIters}]; (* end of Table *)
Return[{hidWts,outWts.errors}];
]; (* end of Module *)

Listing 3.1
82 Chapter 3. Backpropagation and Its Variants_

outs={0,0,0}; (* place holder for returned values *)


outs=bpnStandard[9,3,1,ioPairsTC,0.5,10];

Examine the results. First, the hidden-unit weight values:

outs[[l]]

{{0.0100008, 0.0554054, 0.0299096, 0.0309324, -0.0318445,


0.0242747, 0.0837774, 0.0710365, 0.0341162),
{0.0739625, -0.0485557, 0.0695227, -0.0686431, -0.0373425,
0.0408982, 0.0913859, 0.00508364, -0.0908467),
{0.0350314, 0.037319, -0.00618832, 0.0978313, -0.0689194,
0.0794352, 0.0327677, 0.012315, -0.0586869))

Next, the output-unit weights:

outs[[2]]

{{0.021414, 0.032426, -0.0746567))

Finally, the list of errors:

outs[[3]]

{0.162749, 0.166452, 0.157103, 0.148467, 0.179714, 0.148903,


0.179508, 0.149288, 0.178831, 0.170143)

We will generally be interested in a plot of the error values. We can get


one as follows:

ListPlot[outs[[3]],PlotJoined->True];
3.2. BPN Examples 83

Rather than do individual calculations to assess how far along we are in


the training process, we can define another function to take the weight
vectors and the ioPairTC vector and calculate the individual errors for each
input. That function is called bpnTest. You will find the listing for it in the
appendix. The arguments of the function are the hidden-layer weights,
the output-layer weights, and the ioPairs vector.

bpnTest[outs[[l]],outs[[2]],ioPairsTC];

Input 1 = {0.9, 0.9, 0.9, 0.9, 0.1, 0.1, 0.9, 0.9, 0.9}
Output 1 = {0.475994} desired = {0.1} Error = {-0.375994}
Input 2 = {0.9, 0.9, 0.9, 0.1, 0.9, 0.1, 0.1, 0.9, 0.1}
Output 2 = {0.477661} desired = {0.9} Error = {0.422339}
Input 3 = {0.9, 0.9, 0.9, 0.9, 0.1, 0.9, 0.9, 0.1, 0.9}
Output 3 = {0.476084} desired = {0.1} Error = {-0.376084}
Input 4 = {0.1, 0.1, 0.9, 0.9, 0.9, 0.9, 0.1, 0.1, 0.9}
Output 4 = {0.477411} desired = {0.9} Error = {0.422589}
Input 5 = {0.9, 0.9, 0.9, 0.1, 0.1, 0.9, 0.9, 0.9, 0.9}
Output 5 = {0.476176} desired = {0.1} Error = {-0.376176}
Input 6 = {0.1, 0.9, 0.1, 0.1, 0.9, 0.1, 0.9, 0.9, 0.9}
Output 6 = {0.476818} desired = {0.9} Error = {0.423182}
Input 7 = {0.9, 0.1, 0.9, 0.9, 0.1, 0.9, 0.9, 0.9, 0.9}
Output 7 = {0.475935} desired = {0.1} Error = {-0.375935}
Input 8 = {0.9, 0.1, 0.1, 0.9, 0.9, 0.9, 0.9, 0.1, 0.1}
Output 8 = {0.476766} desired = {0.9} Error = {0.423234}

Mean Squared Error = {{0.160101}}

All of the output values are clustered near the central region of the sig¬
moid function. The network has not been trained sufficiently to allow it
to distinguish between the two classes of input vectors. Let's try again,
this time increasing the number of iterations.

outs={0,0,0} (* place holder for returned values *)


outs=bpnStandard[9,3,1,ioPairsTC,0.5,200];

ListPlot[outs[[3]],PlotJoined->True];
84 Chapter 3. Backpropagation and Its Variants

50 1l6o 150 200

bpnTest[outs[[l]],outs[[2]],ioPairsTC];

Input 1 = {0.9, 0.9, 0.9, 0.9, 0.1, 0.1, 0.9, 0.9, 0.9}
Output 1 = {0.551113} desired = {0.1} Error = {-0.451113}
Input 2 = {0.9, 0.9, 0.9, 0.1, 0.9, 0.1, 0.1, 0.9, 0.1}
Output 2 = {0.554287} desired = {0.9} Error = {0.345713}
Input 3 = {0.9, 0.9, 0.9, 0.9, 0.1, 0.9, 0.9, 0.1, 0.9}
Output 3 = {0.552132} desired = {0.1} Error = {-0.452132}
Input 4 = {0.1, 0.1, 0.9, 0.9, 0.9, 0.9, 0.1, 0.1, 0.9}
Output 4 = {0.55632} desired = {0.9} Error = {0.34368}
Input 5 = {0.9, 0.9, 0.9, 0.1, 0.1, 0.9, 0.9, 0.9, 0.9}
Output 5 = {0.552388} desired = {0.1} Error = {-0.452388}
Input 6 = {0.1, 0.9, 0.1, 0.1, 0.9, 0.1, 0.9, 0.9, 0.9}
Output 6 = {0.553846} desired = {0.9} Error = {0.346154}
Input 7 = {0.9, 0.1, 0.9, 0.9, 0.1, 0.9, 0.9, 0.9, 0.9}
Output 7 = {0.551687} desired = {0.1} Error = {-0.451687}
Input 8 = {0.9, 0.1, 0.1, 0.9, 0.9, 0.9, 0.9, 0.1, 0.1}
Output 8 = {0.552483} desired = {0.9} Error = {0.347517}

Mean Squared Error = {{0.161853}}

We are not getting very far very fast. Let's try again.

outs={0,0,0}; (* place holder for returned values *)


outs=bpnStandard[9,3,1,ioPairsTC,0.5,700];

That calculation took quite a long time on my computer. Let's see where
we are.

bpnTest[outs[[l]],outs[[2]].ioPairsTC];
3.2. BPN Examples 85

Input 1 = {0.9, 0.9, 0.9, 0.9, 0.1, 0.1, 0.9, 0.9, 0.9}
Output 1 = {0.278072} desired = {0.1} Error = {-0.178072}
Input 2 = {0.9, 0.9, 0.9, 0.1, 0.9, 0.1, 0.1, 0.9, 0.1}
Output 2 = {0.684431} desired = {0.9} Error = {0.215569}
Input 3 = {0.9, 0.9, 0.9, 0.9, 0.1, 0.9, 0.9, 0.1, 0.9}
Output 3 = {0.286354} desired = {0.1} Error = {-0.186354}
Input 4 = {0.1, 0.1, 0.9, 0.9, 0.9, 0.9, 0.1, 0.1, 0.9}
Output 4 = {0.677043} desired = {0.9} Error = {0.222957}
Input 5 = {0.9, 0.9, 0.9, 0.1, 0.1, 0.9, 0.9, 0.9, 0.9}
Output 5 = {0.283591} desired = {0.1} Error = {-0.183591}
Input 6 = {0.1, 0.9, 0.1, 0.1, 0.9, 0.1, 0.9, 0.9, 0.9}
Output 6 = {0.684752} desired = {0.9} Error = {0.215248}
Input 7 = {0.9, 0.1, 0.9, 0.9, 0.1, 0.9, 0.9, 0.9, 0.9}
Output 7 = {0.276314} desired = {0.1} Error = {-0.176314}
Input 8 = {0.9, 0.1, 0.1, 0.9, 0.9, 0.9, 0.9, 0.1, 0.1}
Output 8 = {0.695672} desired = {0.9} Error = {0.204328}

Mean Squared Error = 0.0394364

It looks like the categories are beginning to separate. We should see a


decrease in the error values.

ListPlot[outs[[3]],PlotJoined->True];

100 200 300 400 500 600 700

This plot is about what we might expect. If we went back and performed
more iterations, we should do even better. By now, however, you should
be asking if there are any ways in which we might speed up the con¬
vergence of the network. The answer, of course, is yes. The first thing
we can do is try a few variations of the learning rate parameter to see
86 Chapter 3. Backpropagation and Its Variants

if it has any effect on the convergence. In Section 3.3 we shall look at


some other methods that have been proposed to speed convergence of
the algorithm.

outs=bpnStandard[9,3,1*ioPairsTC,0.9,500];
ListPlot[outs[[3]],PlotJoined->True];

The larger value of learning rate resulted in a faster convergence. If larger


is better, let's keep going.

outs=bpnStandard[9,3,1,ioPairsTC,1.3,500];
ListPlot[outs[[3]] ,PloUoined->True] ;

outs=bpnStandard[9,3,1,ioPairsTC,2.0,350];
ListPlot[outs[[3]],PlotJoined->True];
3.2. BPN Examples 87

outs=bpnStandard[9,3,1,ioPairsTC,3.0,250];
ListPlot[outs[[3]] ,PloUoined->True];

100 150 200 250

outs=bpnStandard[9,3,1,ioPairsTC,4.0,200];
ListPlot[outs[[3]],PlotJoined->True];

0.4

50 100 150 200


88 Chapter 3. Backpropagation and Its Variants

outs-bpnStandard[9,3,1,ioPairsTC,5.0,150];
ListPlot[outs[[3]],PlotJoined->True];

outs-bpnStandard[9,3,1,ioPairsTC,10,150];
ListPlot[outs[[3]],PlotJoined->True];

outs=bpnStandard[9,3,1,ioPairsTC,30,150];
ListPlot[outs[[3]],PlotJoined->True];

0'4
0.3f
I

20 40 60 80 100 120 140


3.2. BPN Examples 89

What is amazing about this example is the magnitude of the learning


rate that you can use and still have a network that converges. It looks
like somewhere between 10 and 30, convergence breaks down. In no
way should you infer that such large learning rates are appropriate for
other networks. Typically, most real networks require a very small (*C 1)
learning rate for convergence. Because it does take a fair amount of time
for each run, and we are limited to small networks here, I suggest that
you conduct parametric studies of this type using compiled code.
When you execute this network yourself, you might find that the
network fails to converge with a learning rate as low as 2.5, for example.
Alternatively, you may find that you can use learning rates in excess of
40. Given the combination of starting location (determined by the initial
weights) and the learning rate, the network may find a local minimum
in the weight space causing learning to cease before the overall error
reaches a value low enough for each of the exemplars.
This phenomenon occurs sometimes when trying to solve real prob¬
lems with neural networks; the network may fail to converge after a large
number of training passes. If you get this result, you can try changing
the weight initialization, or adjusting the learning rate. It may also be
that there is not enough information in the set of exemplars to allow the
network to learn properly, or you may have too few or too many units
in the hidden layer. Experience is the best teacher in these cases.

3.2.2 The XOR Problem and the BPN


We have discussed aspects of the XOR problem in Chapters 1 and 2. In
this section we will begin to examine how a BPN responds to the XOR
problem. In comparison to the T-C problem, the XOR problem is quite
hard. In fact, we will not actually see a solution to this problem until
after we add a mechanism to speed convergence of the BPN, which we
shall do in Section 3.3. The ioPairs array for the XOR problem appears
below.

ioPairsXOR = { {{0.1,0.1},{0.1}}, {{0.1,0.9},{0.9}},


{{0.9,0.1},{0.9}}, {{0.9,0.9},{0.1}} };

Let's begin by using the Timing function to see how long iterations take
using the standard BPN model. My computer is a Macintosh Ilsi with a
90 Chapter 3. Backpropagation and Its Variants

math coprocessor, and the time required will be different depending on


your computer.

Timing[outs=bpnStandard[2,3,1,ioPairsXOR,0.5,100];]

{36.15 Second, Null}

Let's see if anything has happened.


ListPlot[outs[[3]], PlotJoined->True];

0.22
0.2-
0.18-
0.16-
0.14
0.12

20 40 60 80 100

Not much has. Let's increase the learning rate significantly to see if that
helps.

Timing[outs=bpnStandard[2,3,1,ioPairsXOR,5.0,100];]

{46.85 Second, Null}

ListPlot [outs [ [3] ], PloUoined->True];

Once again there has not been any learning. Let's try more passes
through the data.
3.2. BPN Examples 91

Timing[outs=bpnStandard[2,3.1.ioPairsXOR,5.0,1500];]

{710.6 Second, Null)

ListPlot[outs[[3]], PlotJoined->True];

o.

0 .

0.

Let's look at the outputs explicitly to see what is happening.

bpnTest[outs[[l]],outs[[2]].ioPairsXOR];

Input 1 = {0.1, 0.1}


Output 1 = {0.217616} desired = {0.1} Error = {-0.117616}
Input 2 = {0.1, 0.9}
Output 2 = {0.753512} desired = {0.9} Error = {0.146488}
Input 3 = {0.9, 0.1}
Output 3 = {0.795341} desired = {0.9} Error = {0.104659}
Input 4 = {0.9, 0.9}
Output 4 = {0.554083} desired = {0.1} Error = {-0.454083}

Mean Squared Error = 0.0631092

The network seems to be learning three of the four points. Perhaps more
iterations will do the trick. Based on the last results, this next calculation
should take about 16 minutes on my computer.

outs=bpnStandard[2,3,1,ioPairsXOR,5.0,2000];

ListPlot[outs[[3]], PloUoined->True];
92 Chapter 3. Backpropagation and Its Variants

bpnTest[outs[[l]],outs[[2]].ioPairsXOR];

Input 1 = {0.1, 0.1}


Output 1 = {0.137168} desired = {0.1} Error = {-0.0371683}
Input 2 = {0.1, 0.9}
Output 2 = {0.864222} desired = {0.9} Error = {0.0357782}
Input 3 = {0.9, 0.1}
Output 3 = {0.861855} desired = {0.9} Error = {0.0381451}
Input 4 = {0.9, 0.9}
Output 4 = {0.51232} desired = {0.1} Error = {-0.41232}

Mean Squared Error = 0.0435311

Well, we are not much better off than we were before. Perhaps we have
found another local minimum, where the network will never learn the
fourth training vector.
One variation that we have not yet tried is in the number of hidden-
layer units. In a sense, adding hidden-layer units adds "degrees of free¬
dom" that can help the network converge to a better solution, much like
adding higher orders to the polynomial fit to a curve. You must use some
restraint, however, since there is a trade-off between faster convergence
in terms of the number of iterations required, and the time per iteration.
Moreover, if you add too many hidden-layer units, you could end up
worse off than when you started. Let's evaluate the case of doubling the
number of hidden units to six.

outs={0,0,0};
Timing[outs=bpnStandard[2,6,1,ioPairsXOR,5.0,100];]
3.2. BPN Examples 93

{66.2667 Second, Null}


The time required is about 50% more than the case with three hidden
units. Let's see if we get as good a solution as before with only two
thirds the number of iterations.
outs={0,0,0};
Timing[outs=bpnStandard[2,6,1,ioPairsXOR,5.0,1000];]

{660. Second, Null}


ListPlot[outs[[3]], PlotJoined->True];

We are at about the same place as we were at the end of 1500 iterations
using three hidden units. Moreover it has taken us just as much time to
get to this point as it did to run 1500 iterations with three hidden units.
While it may be true that learning requires fewer iterations with a larger
number of hidden units, the actual amount of CPU time may, in fact, be
as much, or even greater. Of course, if the network will not converge at
all, adding hidden units may be the way to get it to converge. The point
I am trying to make here is that the number of iterations required for
convergence is not necessarily the best measure of learning speed. We
shall need to keep this fact in mind when we examine other methods
of increasing learning speed in Section 3.3. I should also point out that
in most real-world problems where the number of inputs is large (say
10s or 100s) the number of hidden-layer units is typically less than the
number of inputs, unlike our simple example here.

3.2.3 Adding a Bias Unit


Let's add the bias units to our BPN code. These units provide an extra
degree of freedom that may help the network converge to a solution in
94 Chapter 3. Backpropagation and Its Variants

fewer iterations. The trade-off, of course, is that there are more connec¬
tions to process. The code appears in Listing 3.2.
Adding space for the bias terms in the weight matrices is no big
problem; we just increase the column dimension by one. Two other
modifications involve adding an additional input value of 1.0 to each
input vector:
inputs=A ppend[ioP[[1]],1.0]
and forcing the last hidden output to be 1.0:
outlnputs = Append[hid0uts,1.0]
These two changes are indicated in the code by the comment
(* bias mod *)
A third modification appears in the statement that updates the hidden-
unit weights. Because of the way we calculate the weight deltas, the
equations for the weight updates would automatically try to calculate
new weights on connections from the input layer to the bias unit on the
hidden layer; however, there are no such connections. Therefore, we
must eliminate the last weight delta vector before updating the hidden
weights. The statement:
Drop[0uter[Times,hidDelta,inputs],-1]
performs this task.
I have also added some optional print statements, an automatic call
to the bpnTest routine, and an automatic plot of the errors. Notice that
there are now four returned items: the lists of the hidden weights, output
weights, output errors, and a graphics object representing the plot of the
errors. Let's try the T-C problem again, since it will require less time than
the XOR problem. The bpnTest function has an option that will allow it
to handle the bias terms correctly. We can set that option before running
the network.

outs={0,0,0,0};
SetOptions[bpnTest,bias->True];
Timing[outs=bpnBias[9,3,1,ioPairsTC,4.0,200];]

New hidden-layer weight matrix:


{{-0.375183, -0.235498, -0.41385, -0.31976, -0.00324652, -0.377699,
-0.279365, -0.250795, -0.256884, -0.321761},
{-0.917095, -0.198925, -0.708324, -0.150139, 2.98298, -0.267209,
-0.797281, -0.137803, -0.708624, 0.637737},
{-0.594783, -0.273314, -0.648397, -0.165679, 2.2331, -0.315633,
-0.615488, -0.213412, -0.559986, 0.500099}}
3.2. BPN Examples 95

bpnBias[inNumber_, hidNumber_, outNumber_,ioPairs_, eta_, numltersj :=


Module[{errors,hidWts,outWts,ioP,inputs,outOesired,hidOuts,
outputs, outErrors,outDelta,hidDelta},
hidWts = Table[Table[Random[Real,{-0.1,0.1}],{inNumber+l}],{hidNumber>];
outWts = Table[Table[Random[Real,{-0.1,0.1}] ,{hidNumber+l}], {outNumber}];
errorList = Tablet
(* select ioPair *)
ioP=ioPairs[[Random[Intege r, {1,Length[ioPairs]>]]];
inputs=Append[ioP[[l]],1.0]; (* bias mod *)
outDesired=ioP[[2]];
(* forward pass *)
hidOuts = sigmoid [hidWts. inputs];
outlnputs = Append [hidOuts, 1.0]; (* bias mod *)
outputs = sigmoid [outWts. outlnputs];
(* determine errors and deltas *)
outErrors = outDesired-outputs;
outDelta= outErrors (outputs (1-outputs));
hidDelta=(outInputs (l-outlnputs)) * Transpose [outWts] .outDelta;
(* update weights *)
outWts += eta Outer [Times, outDelta, outlnputs];
hidWts += eta Drop[Outer[Times,hidDelta,inputs],-1]; (* bias mod *)
(* add squared error to Table *)
outErrors.outErrors,{numlters}]; (* end of Table *)
Print["New hidden-layer weight matrix: "];
Print[]; Print[hidWts];Print[];
Print["New output-layer weight matrix: "];
Print[]; Print[outWts];Print[];
bpnTest[hidWts,outWts,ioPairs]; (* check how close we are *)
errorPlot = ListPlot[errorList, PlotJoined->True];
Return[{hidWts,outWts,errorList,errorPlot}];
]; (* end of Module *)

Listing 3.2
96 Chapter 3. Backpropagation and Its Variants

New output-layer weight matrix


{{0.384772, 3.2809 , 2.42758, 2.38318}}
Input 1 = {0.9, 0.9, 0.9, 0.9, 0.1, 0.1, 0.9, 0.9, 0.9, 1.}
Output 1 = {0.143134} desired = {0.1} Error = {-0.0431341}
Input 2 = {0.9, 0.9, 0.9, 0.1, 0.9, 0.1, 0.1, 0.9, 0.1, 1.}
Output 2 = {0.876863} desired = {0.9} Error = {0.023137}
Input 3 = {0.9, 0.9, 0.9, 0.9, 0.1, 0.9, 0.9, 0.1, 0.9, 1.}
Output 3 = {0.137324} desired = {0.1} Error = {-0.0373242}
Input 4 = {0.1, 0.1, 0.9, 0.9, 0.9, 0.9, 0.1, 0.1, 0.9, 1.}
Output 4 = {0.883766} desired = {0.9} Error = {0.0162339}
Input 5 = {0.9, 0.9, 0.9, 0.1, 0.1, 0.9, 0.9, 0.9, 0.9, 1.}
Output 5 = {0.136798} desired = {0.1} Error = {-0.0367985}
Input 6 = {0.1, 0.9, 0.1, 0.1, 0.9, 0.1, 0.9, 0.9, 0.9, 1.}
Output 6 = {0.886159} desired = {0.9} Error = {0.0138414}
Input 7 = {0.9, 0.1, 0.9, 0.9, 0.1, 0.9, 0.9, 0.9, 0.9, 1.}
Output 7 = {0.14003} desired : = {0.1} Error = {-0.0400298}
Input 8 = {0.9, 0.1, 0.1, 0.9, 0.9, 0.9, 0.9, 0.1, 0.1, 1.}
Output 8 = {0.869965} desired = {0.9} Error = {0.0300346}
Mean Squared Error = 0.0010128;

{152.967 Second, Null}

This result appears to be similar to that obtained without the bias term.
Nevertheless, the bias term is often incorporated as a part of a standard
BPN.
3.3. BPN Variations 97

3.3 BPN Variations


If a BPN as small as that which we used for the XOR problem takes
such a long time to converge, you might imagine that networks for more
realistic problems might be incredibly time consuming. I recently did a
problem with a BPN having only about 45,000 connections (a small net¬
work compared to some). The network took over two weeks to converge
to a solution (using a 80386-class computer). Needless to say, it is diffi¬
cult to justify waiting for two weeks to find out if you have a solution.
If you have access to a supercomputer, the time may be reduced signifi¬
cantly, but the overall cost may rise. The problem is the BPN algorithm:
It requires a large amount of computation for each iteration. Surely there
must be a way to speed the convergence of these networks.
The quest for the BPN holy grail has taken researchers down many
and varied paths. A recent conference proceedings contained upwards of
fifty papers, each claiming to have found a method to speed convergence
of the backpropagation algorithm. In this section we shall look at a few
of those methods. The methods that we shall examine are not necessarily
the best or fastest. My purpose here is not to identify the best method, but
to illustrate how easily experimentation can be accommodated within the
Mathematica environment. As a matter of fact, we have already employed
some variations of what you might call the original BPN method.
For both the T-C and XOR problems, we used 0.1 and 0.9 instead
of 0 and 1, for the components of both the input and output vectors.
The argument that we gave was that the sigmoid function could never
reach 0 or 1 and, therefore, we needed to back off from those limits. This
argument is appropriate to explain why 0.1 and 0.9 are used as desired
output values; it is not adequate to explain why we used them as input
values instead of 0 and 1. To answer the latter question, let's look at
one of the ioPairs for the T-C problem, and recall how the weight-update
values are calculated on the hidden layer.

(* This is the "T" vector *)

{{0.9,0.9,0.9,
0.1,0.9,0.1,
0.1,0.9,0.1},{0.9»

If we had used zeros and ones in the input vector instead of 0.1 and 0.9
the above would appear as follows:
98 Chapter 3. Backpropagation and Its Variants

,,,
010
0,1,0},{0.9»

To update weights on the hidden layer, the equation is

hidWts += eta Outer [Times, hidDelta, inputs]

Each delta value for a given weight is multiplied by the corresponding


input value. If that input value is zero, then there will be no change to
that weight. If there is no change, there is no learning. Learning would
only take place with inputs that are one. Thus, convergence should take a
larger number of iterations, because some weights do not change during
a given iteration. By using 0.1 and 0.9 as inputs, we ensure that weight
changes will be nonzero. This simple technique is not actually a variation
of the algorithm, however, so we will now turn our attention to some
techniques that are.

3.3.1 Momentum

As a first example of a modification of the algorithm, we shall look at


the addition of a term called momentum to the weight-update equations.
This term will have a significant effect on the learning speed, in terms of
the number of iterations required.
The idea behind momentum in a neural network is straightforward:
Once you start adjusting weights in a certain direction, keep them moving
generally in that direction. In more practical terms, after you adjust the
weights during one training iteration, save the value of that adjustment;
when calculating the adjustment for the next iteration, add a fraction of
the previous change to the new one. In terms of an equation (in this case,
for the hidden-layer weights):

Wji(t + 1) = wji + r)6pjXpi + a Awji(t) (3.15)

where a is called the momentum term, typically a positive number less


than one, and
Awji(t) = Wji(t) -Wji(t- 1)

The function bpnHomentum incorporates this momentum term as a pa¬


rameter alpha in the function call. The following is a template of the
function. A complete listing appears in the appendix.
3.3. BPN Variations 99

bpnHomentum[inNumber,hidNumber,
outNumbe r ,ioPairs,eta.alpha,numlte rs]

To implement the function, we must keep track of the weight changes


from one iteration to the next. To do that we introduce the following
matrices into the code:

hidLastDelta = Table[Table[0,{inNumber}],{hidNumber}];
outLastDelta = Table[Table[0,{hidNumber}],{outNumber}];

We add a fraction, alpha, of these values to the weight changes before


updating the weights:

outLastDelta= eta 0uter[Times,out0elta,hid0uts]+


alpha outLastDelta;
outUts += outLastDelta;

hidLastDelta = eta Outer [Times, hidDelta, inputs]+


alpha hidLastDelta;
hidWts += hidLastDelta;

Let's try the T-C problem using this new network. We shall repeat an
example that we ran in the previous section, this time including a mo¬
mentum factor of 0.5.

outs={0,0,0,0};
Timing[outs=bpnHomentum[9,3,1,ioPairsTC,0.9,0.5,150];]

Mean Squared Error = 0.000846143

20 40 60 80 100 120 140


100 Chapter 3. Backpropagation and Its Variants

{102.15 Second, Null)

This result represents a significant improvement in convergence of the


network. Let's try the network on the XOR problem to see if we obtain
similar results.

outs={0,0,0,0};
Timing[outs=bpnMomentum[2,3,1,ioPairsXOR,2.0,0.9,1500];]

Mean Squared Error = 0.0709606

0.4

0.3

0.2

0.1

{786.083 Second, Null}

The network seems to be acting as it did before when it was learning


only three of the four patterns. Perhaps we have found a local minimum
and the network will never learn the fourth pattern. Alternately, more
iterations may result in a complete solution.
In the spirit of experimentation, let's add another modification to the
program. We can arbitrarily set a maximum acceptable error for any
one pattern to some number, say 0.1. Then, we can add a conditional
statement to the program so that, if the error for an input pattern is
less than this acceptable value, no weight updates occur during that
iteration. That way, the network is not overlearning one pattern at the
expense of the others. I have included this modification in a program
called bpnMomentumSmart (not that the other programs are not smart, but I
am running out of names).

outs={0,0,0,0};
Timing[outs=bpnMomentumSma rt[2,3,1,ioPairsXOR,2.0,0.9,1500];]
3.3. BPN Variations 101

Mean Squared Error = 0.00434073

{575.25 Second, Null}

Not only did the network converge, but the time required to run 1500
iterations was significantly less in this case than it was for the unmodified
program.

3.3.2 Competitive Weight Updates


In this section we shall examine a modification of the BPN that uses a
competitive algorithm to update weight values. Recall that, according to
Eqs. (3.13) and (3.14), weight changes are proportional to the delta terms
(hidDelta and outDelta in our code); thus, we might reason that the unit
with the largest value of delta should adjust its weights by the largest
amount. All other units on the layer should adjust their weights in the
direction opposite to that winning unit. In other words, after we calculate
the delta values on a layer, we search for the unit with the largest delta
(actually we need to look for the largest magnitude). That unit is declared
the winner of the competition, and the delta value for all units on the
layer becomes a function of that unit's delta.
For the hidden layer, the delta value is given by Eq. (3.12):

6ij=6°pkw°kj
k= 1

Let
maxm(<5pm) j= winning unit (3.16)
{ -1 rnaxm(<5pm) otherwise
102 Chapter 3. Backpropagation and Its Variants

Then Eq. (3.14) becomes

wji(t + 1) = + V^jxpi

with a similar equation for the output layer. Let's build these changes
into a standard BPN program without momentum.
The algorithm proceeds as in the case of the standard algorithm until
we calculate the delta values for each layer. After that calculation, we
determine the epsilon values. We shall look at the code for the output
layer here; the code for the hidden layer is analogous. First we search
the delta values to find the one with the largest absolute value and save
its position:

outPos = First[Flatten[Position[Abs[outDelta],Max[Abs[outDelta]]]]];

Since the Position function returns a list of lists, we must extract the actual
number as shown, using First and Flatten. We need to remember the delta
value at this position:

outEps = outDelta [[outPos]]

All outDelta values are changed to -(1/4) outEps

outDelta=Table[-1/4 outEps,{Length[outDelta]>]

except for the one at position outPos:

outDelta [[outPos]] = outEps

We can now perform the same calculation for the hidden layer and up¬
date the weights as usual. The new program is called bpnCompete, and a
complete listing appears in the appendix.
Let's try the new algorithm on the T-C problem with one output.
Recall from Section 3.2 that without momentum, the error was still fairly
high, though it was diminishing after about 400 iterations.

outs={0,0,0,0};
outs = bpnCompete[9,3,1,ioPairsTC,5.0,150];

Mean Squared Error = 0.000880959


3.4. The Functional Link Network 103

This result looks somewhat better than with the standard algorithm. You
might want to experiment with this algorithm further to determine quan¬
titatively if it is better than standard backpropagation.
There are dozens — perhaps hundreds — of other modifications that
we could explore. Our intent, however, is not to examine all possibilities
in an attempt to find the best one, but rather to learn how to use Math-
ematica as a tool to facilitate that exploration. With that philosophy in
mind, let's move on to the discussion of a different network architecture
called the functional link network.

3.4 The Functional Link Network

The XOR problem is a difficult one, and it is prototypical of problems


whose classes are not linearly separable. By using hidden layers and
units with nonlinear output functions, we can overcome this difficulty.
The price we pay is the added computational complexity associated with
the backpropagation algorithm.
In Chapter 1, we showed a way of transforming the XOR problem into
one where the classes were linearly separable. In essence, we increased
the dimensionality of the input space by adding a third input made up
of the product of the original two inputs. This method allowed us to
construct a solution with a single output unit and no hidden units. In
this section we shall describe the functional link network (FLN) that
uses the concept of functional links to increase the dimensionality of the
input space.
104 Chapter 3. Backpropagation and Its Variants

3.4.1 FLN Architecture

In a typical feed-forward network, input units distribute input patterns


unchanged to units on succeeding layers. In the FLN, input units pass
their data through a functional link before distributing the data to other
units. The purpose of the functional link is to produce multiple data el¬
ements from each individual input element by using the input elements
as the arguments to certain functions, or by multiplying certain data el¬
ements together. The first method is called the functional-expansion
model, and the second is called the tensor model, or outer product
model.
The XOR example from Chapter 1 typifies the tensor model. In the
tensor model, you increase the dimension of the input vector by multiply¬
ing components together using combinations of two inputs, then three,
etc. You can keep the size of the resulting vector somewhat under control
by eliminating redundant products and products whose components are
uncorrelated over the set of input vectors.
Figure 3.4 illustrates the concept of the functional-expansion model.
You can also combine the two methods into a hybrid model where func¬
tions of data elements are multiplied by functions of other data elements.
The choice and number of functions to use is problematical, as is
whether to use the functional-expansion or tensor models. Experience
is bound to yield insight in this area, and I do not wish to make any
predictions about what might work. A major advantage of the network
is that you can generally eliminate the hidden layer. Let's look at some
examples in the next section.

3.4.2 FLN Examples

We shall look at an example of each type of FLN in this section. First,


let's apply the tensor model to the XOR problem. Then we shall apply
the functional-expansion model to a more complicated problem.

Tensor Model of the XOR Problem Let's recall the ioPairs vectors
for this problem:

ioPairsXOR = { { {0.1,0.1},{0.1} >,


{ {0.1,0.9},{0.9} },
{ {0.9,0.1},{0.9} },
{ {0.9,0.9},{0.1} } };
3.4. The Functional Link Network 105

Figure 3.4 This figure illustrates the functional-expansion model for a three-input FLN.
Each input passes through a functional link that generates n functions of the input value.
These 3n values are passed to the next layer of the network. Note that each unit on the
next layer would receive all 3ti data elements from the functional links.

If we enhance the ioPairs vectors by adding to each input the product


of the original two inputs, we will have the appropriate inputs to the
network. Then, we need not write our code to be specific to the model or
the problem: The same code can apply to either the tensor or functional-
expansion model. The new ioPairs is:

ioPairsXORFLN = { { {0.1,0.1,0.1},{0.1} >,


{ {0.1,0.9,0.1},{0.9} },
{ {0.9,0.1,0.1},{0.9} },
{ {0.9,0.9,0.9},{0.1} } };

Notice that 0.1 x 0.9 = 0.1, since this calculation represents Oxl.
Let's construct a program, called fin, to implement the network. We
can construct the program itself by modifying the bpnMomentum code. Since
there are no hidden units, we can eliminate all of the "hid" variables.
Inputs to the output units become inputs rather than hidOuts. We shall use
a linear unit as the output unit, so we do not need the sigmoid function,
and the equation for outDelta will change since the derivative of the linear
output function is unity. The argument list for fin is the same as that for
106 Chapter 3. Backpropagation and Its Variants

the bpnMomentum function, with the exception that there are no hidden units.
The function template is

fin[inNumber,outNumber,ioPairs,eta.alpha,numlters]

The function returns an array with four components: the new weight
matrix, the list of errors generated during the iterations, the graphic object
representing the plot of the errors, and the output vector generated by the
call to flnTest, which is a modified version of bpnTest. See the appendix
for the listings of these functions.
Let's try the fin function with only a few iterations.

outs={0,0,0.0>;
Timing[outs=fln[3,l,ioPairsX0RFLN,0.5,0.5,100];]

Mean Squared Error = 0.0019762

{22.2167 Second, Null}

That result is not too bad, considering the small number of passes through
the data. Let's try again with a few more passes.

outs={0,0,0,0};
Timing[outs=fln[3,l,ioPairsX0RFLN,0.5,0.5,500];]

Mean Squared Error = 0.0019765


3.4. The Functional Link Network 107

{78.5667 Second, Null)

The mean squared error for this run is not bad, although there appears
to be one point that has a significantly larger error than the others. Since
the output function is linear for this network, let's try using zeros and
ones in the ioPairs vectors.

ioPairsXOROlFLN = { { {0,0,0},{0} >,


{{0,1,0},{1> >,
{{1,0,0},{1} >,
{ {i,i,i},{0} } };

outs={0,0,0,0};
Timing[outs=fIn[3,1,ioPairsXOROlFLN,0.5,0.5,500];]

-22
Mean Squared Error = 5.18305 10

0.000015
0.0000125
0.00001
-6
7.5 10
-6
5. 10
-6
2.5 10
100 200 300 400 500

{83.2167 Second, Null)


108 Chapter 3. Backpropagation and Its Variants

The result of the change in desired output values is dramatic. The FLN
performs quite well on the XOR problem as you can see, although it is
somewhat sensitive to the learning rate parameter. You can see this fact
for yourself if you experiment with larger values of eta.

The Functional-Expansion Model In this section we shall teach an


FLN a continuous function of one variable. That is, the input will be a
single value, and the output will be some function of the input value.
Moreover, we shall use only a finite number of input points to enable us
to investigate how well the network learns to interpolate when we give
it input values that were not used during the training procedure.
Let's choose a nontrivial, but well-behaved function. In this context,
"well-behaved" means that there are no wild oscillations in the function
that would require us to use an inordinate number of sample points for
our training set. Here is the function, plotted between x = 1 and x — 3.

functionPlot=Plot[0.3x Sin[Pi x], {x,l,3}];

We build an ioPairs matrix by sampling the function at intervals of O.lx:

ioPairsFunct = Table[{{i},{0.3i Sin[Pi i]//N», {i,l,3,0.1}]

{0}}, {{1.1}, {-0.101976}}, {{1.2}, {-0.211603}},


{{1.3}, {-0.315517}}, {{1.4}, {-0.399444}}, {{1.5}, {-0.45}},
{{1.6}, {-0.456507}}, {{1.7}, {-0.412599}}, {{1.8}, {-0.317404}},
-20
{{1.9}, {-0.17614}}, {{2.}, {6.50521 10 }}, {{2.1}, {0.194681}},
{{2.2}, {0.387938}}, {{2.3}, {0.558222}}, {{2.4}, {0.684761}},
{{2.5}, {0.75}}, {{2.6}, {0.741824}}, {{2.7}, {0.655304}},
-18
{{2.8}, {0.49374}}, {{2.9}, {0.268845}}, {{3.}, {2.78098 10 }}}
3.4. The Functional Link Network 109

We can plot these values if we flatten the array and then partition it into
pairs of coordinates.

functionListPlot = ListPlot[Partition[Flatten[ioPairsFunct],2]];

We must decide on the number and identity of the functions for the
functional link. We shall use the following six functions in this example:
x, sin (par), cos (jpx), sin(2px), cos(2px), and sin(4px). In order to simulate
the functional link, we can replace the first element of each ioPairs vector
with a list of these functions already evaluated. Because we will use it
more than once, let's define a function that will produce the appropriate
list of functions.

functionList[y_] := {y, Sin[Pi y]//N, Cos[Pi y]//N,


Sin[2 Pi y]//N, Cos[2 Pi y]//N,
Sin[4 Pi y]//N>

We generate the new ioPairs vectors by replacing the first element of


each ioPair by the appropriate list of functions.
ioPairsFunctFLN = Hap[ReplacePart[#,functionList[#[[l,l]]],l]ft,ioPairsFunct];

Each ioPair input vector should now have six components, as the follow¬
ing example shows.

ioPairsFunctFLN[[1]]

{{2, 0, 1., 0, 1.. 0>, {1.41421}}

We can now try the FLN on this data.

outs={0,0,0,0};
Timing[outs=fIn[6,1,ioPairsFunctFLN,0.005,0.50,1050];]
110 Chapter 3. Backpropagation and Its Variants

Mean Squared Error = 0.00104127

200 400 600 800 1000

{193.183 Second, Null}

The agreement with the actual function appears to be fairly good. More¬
over, we have only executed 50 passes through the data set. These results
are sufficient to illustrate the concepts, so I shall not perform any more
executions of this network here.
By working on the data a bit, we can plot the output in order to
compare it to the correct answers. First, look at the list of output values:
outs[[4]]

{{0.0118456}, {-0.13436}, {-0.255246}, {-0.347552}, {-0.406662}, {-0.432413},


{-0.429038}, {-0.39617}, {-0.32316}, {-0.197473}, {-0.0204939}, {0.186378},
{0.39137}, {0.565842}, {0.688298}, {0.741125}, {0.712402}, {0.605634},
{0.445142}, {0.266315}, {0.097578}}

We need to substitute these values into the ioPairsFunct array in place


of the correct answers. To do this, we can use the MapThread function as
follows:

ioPairsOut = HapThread[ReplacePart[#,#2,2]ft,{ioPairsFunct,outs[[4]]}]

{{{1}, {0.0118456}}, {{1.1}, {-0.13436}}, {{1.2}, {-0.255246}},


{{1.3}, {-0.347552}}, {{1.4}, {-0.406662}}, {{1.5}, {-0.432413}},
{{1.6}, {-0.429038}}, {{1.7}, {-0.39617}}, {{1.8}, {-0.32316}},
{{1.9}, {-0.197473}}, {{2.}, {-0.0204939}}, {{2.1}, {0.186378}},
{{2.2}, {0.39137}}, {{2.3}, {0.565842}}, {{2.4}, {0.688298}},
{{2.5}, {0.741125}}, {{2.6}, {0.712402}}, {{2.7}, {0.605634}},
{{2.8}, {0.445142}}, {{2.9}, {0.266315}}, {{3.}, {0.097578}}}
3.4. The Functional Link Network 111

Now we can flatten and partition this array and plot it.

outListPlot = ListPlot[Partition[Flatten[ioPairsOut],2]];

0.6

0.4

0.2

-0.2

-0.4 • . . *

To see where we are, we can plot the output along with the original
function.

Shou[{functionPlot,outListPlot}];

The results are not bad, and presumably could be improved with further
training of the network. Let's see how well this network interpolates.
We can construct a new ioPairs array using points in between those used
for training.

ioPairsTest=Table[{{i},{0.3i Sin[Pi i]//N}},{i,l.l,3,0.1}]

{{{1.0, {-0.101976}}, {{1.2}, {-0.211603}}, {{1.3}, {-0.315517}},


{{1.4}, {-0.399444}}, {{1.5}, {-0.45}}, {{1.6}, {-0.456507}},
{{1.7}, {-0.412599}}, {{1.8}, {-0.317404}}, {{1.9}, {-0.17614}},
112 Chapter 3. Backpropagation and Its Variants

-20
{{2.}, {6.50521 10 », {{2.1}, {0.194681}}, {{2.2}, {0.387938}},
{{2.3}, {0.558222}}, {{2.4}, {0.684761}}, {{2.5}, {0.75}},
{{2.6}, {0.741824}}, {{2.7}, {0.655304}}, {{2.8}, {0.49374}},
-18
{{2.9}, {0.268845}}, {{3.}, {2.78098 10 }}}

Expand the input vectors as before.

ioPairsTestFLN = Map[ReplacePart[#,functionList[#[[l,l]]] ,l]ft,ioPairsTest];

Test these new vectors using the weights from the previous run. The
code for the function flnTest appears in the appendix.

outputValues=flnTest[outs[[l]].ioPairsTestFLN];

Output 1 = {-0.13436} desired = {-0.101976} Error = {0.0323845}


Output 2 = {-0.255246} desired = {-0.211603} Error = {0.0436432}
Output 3 = {-0.347552} desired = {-0.315517} Error = {0.0320357}
Output 4 = {-0.406662} desired = {-0.399444} Error = {0.00721808}
Output 5 = {-0.432413} desired = {-0.45} Error = {-0.0175867}
Output 6 = {-0.429038} desired = {-0.456507} Error = {-0.0274695}
Output 7 = {-0.39617} desired = {-0.412599} Error = {-0.0164282}
Output 8 = {-0.32316} desired = {-0.317404} Error = {0.00575561}
Output 9 = {-0.197473} desired = {-0.17614} Error = {0.0213337}
-20
Output 10 = {-0.0204939} desired = {6.50521 10 } Error = {0.0204939}
Output 11 = {0.186378} desired = {0.194681} Error = {0.00830251}
Output 12 = {0.39137} desired = {0.387938} Error = {-0.00343193}
Output 13 = {0.565842} desired = {0.558222} Error = {-0.00762052}
Output 14 = {0.688298} desired = {0.684761} Error = {-0.0035372}
Output 15 = {0.741125} desired = {0.75} Error = {0.00887472}
Output 16 = {0.712402} desired = {0.741824} Error = {0.0294223}
Output 17 = {0.605634} desired = {0.655304} Error = {0.04967}
Output 18 = {0.445142} desired = {0.49374} Error = {0.0485978}
Output 19 = {0.266315} desired = {0.268845} Error = {0.00253004}
-18
Output 20 = {0.097578} desired = {2.78098 10 } Error = {-0.097578}
Mean Squared Error = 0.00108632

These results are as good as the training set, indicating that the network
can interpolate well. Can it extrapolate, however? Let's find out by
3.4. The Functional Link Network 113

constructing a new ioPairs array using the same function, but outside the
range of the original data.

ioPairsTest2=Table[{{i},{0.3i Sin [Pi i]//N}},{i,3,5,0.1}]

{{{3}, {0}}, {{3.1}, {-0.287386}}, {{3.2}, {-0.564274}},


{{3.3}, {-0.800927}}, {{3.4}, {-0.970078}}, {{3.5}, {-1.05}},
{{3.6}, {-1.02714}}, {{3.7}, {-0.898009}}, {{3.8}, {-0.670075}},
{{3.9}, {-0.36155}}, {{4.}, {-4.94396 10 }}, {{4.1}, {0.380091}},
{{4.2}, {0.740609}}, {{4.3}, {1.04363}}, {{4.4}, {1.25539}},
{{4.5}, {1.35}}, {{4.6}, {1.31246}}, {{4.7}, {1.14071}},
-17
{{4.8}, {0.846411}}, {{4.9}, {0.454255}}, {{5.}, {1.16281 10 }}}

ioPairsTest2FLN = Hap[ReplacePart[#,functionList[#[[l,l]]],l]4,ioPairsTest2];

outputValues2=fInTest[outs[[1]],ioPairsTest2FLN];

Output 1 = {0.097578} desired = {0} Error = {-0.097578}


Output 2 = {-0.0486277} desired = {-0.287386} Error = {-0.238758}
Output 3 = {-0.169513} desired = {-0.564274} Error = {-0.39476}
Output 4 = {-0.26182} desired = {-0.800927} Error = {-0.539107}
Output 5 = {-0.320929} desired = {-0.970078} Error = {-0.649148}
Output 6 = {-0.346681} desired = {-1.05} Error = {-0.703319}
Output 7 = {-0.343305} desired = {-1.02714} Error = {-0.683836}
Output 8 = {-0.310438} desired = {-0.898009} Error = {-0.587571}
Output 9 = {-0.237427} desired = {-0.670075} Error = {-0.432648}
Output 10 = {-0.111741} desired = {-0.36155} Error = {-0.249809}
-18
Output 11 = {0.0652385} desired = {-4.94396 10 } Error = {-0.0652385}
Output 12 = {0.272111} desired = {0.380091} Error = {0.10798}
Output 13 = {0.477103} desired = {0.740609} Error = {0.263507}
Output 14 = {0.651575} desired = {1.04363} Error = {0.392057}
Output 15 = {0.77403} desired = {1.25539} Error = {0.481364}
Output 16 = {0.826858} desired = {1.35} Error = {0.523142}
Output 17 = {0.798134} desired = {1.31246} Error = {0.514324}
Output 18 = {0.691366} desired = {1.14071} Error = {0.449348}
Output 19 = {0.530874} desired = {0.846411} Error = {0.315536}
Output 20 = {0.352047} desired = {0.454255} Error = {0.102208}
114 Chapter 3. Backpropagation and Its Variants

-17
Output 21 = {0.18331} desired = {1.16281 10 } Error = {-0.18331}
Mean Squared Error = 0.183144

As you might have guessed, these results are not so good. As a gen¬
eral rule, you should not expect a neural network to be able to respond
properly to data that is outside of the domain used during the training
process.

Summary

In this chapter we have explored a very powerful and robust learning


methodology known as the generalized delta rule. This rule, which is
the learning algorithm for the backpropagation network, is a multi-unit,
multi-layer version of the delta rule discussed in Chapter 2. We also used
the backpropagation network as a starting point for experimentation with
modifications in an attempt to speed the convergence of the network to
a solution for specific problems. Using Muthernaticci we can quickly make
alterations in the code to accommodate our ideas. The functional link
network, introduced in the final section, often allows us to find a solution
to a problem without the need of a hidden layer of units. By expanding
the dimension of the input space by combining inputs or mapping inputs
onto a set of functions, the hidden layer may become unnecessary.
Chapter 4

Probability and Neural


Networks
116 Chapter 4. Probability and Neural Networks

In the previous chapters, we have studied several types of neural net¬


works, and in all cases the calculations that we performed were determin¬
istic. In this chapter, we add elements of probability and stochastic pro¬
cessing to some neural-network models. The Hopfield network, which
is the topic of Section 4.1, is a deterministic network. I introduce it here
because there are analogies between the Hopfield network and physical
systems having the properties of magnetic materials. In the next two
sections, we shall look at some of the properties of magnetic materials
and see how we can use these properties to extend the basic Hopfield
model to include stochastic processes.
Next, we will examine a standard probabilistic technique for pattern
classification, called Bayesian classification. We shall see that by making
suitable changes in the basic feed-forward neural network architecture
(such as that of the BPN) we can implement a Bayesian classifier in a
neural network. The probabilistic neural network (PNN), based on this
approach, is the subject of the final section in this chapter.

4.1 The Discrete Hopfield Network


Unlike the networks presented so far in this book, the Hopfield network
has only a single layer of processing elements. Furthermore, the con¬
nectivity scheme has each unit connected to all other units on the layer.
Figure 4.1 shows the architecture.
The Hopfield network is classified as an associative memory. Given
pairs of vectors, such as (xi,yi), (x2,y2), etc., an input of one of the x
vectors should result in the output of the corresponding y vector. More¬
over, a corrupted, or noisy, version of one of the input vectors should
also recall the corresponding output vector. In the case of the Hopfield
network, the y vectors are identical to the x vectors, making the network
an autoassociative memory. The network stores x vectors so that later, if
an incomplete, or noisy vector is used as an input, the network will find
the stored vector closest in some sense to the input vector.
We use the adjective discrete to indicate that the output values of the
Hopfield units take on discrete, rather than continuous, values. In this
case, we shall restrict the outputs to the values of ±1.
The output function is straightforward: If the net input is negative,
the output value is -1; if the net input is positive, the output value is
+1; and if the net input is zero, the output does not change. Let's call
4.1. The Discrete Hopfield Network 117

Figure 4.1 This figure shows the basic structure of the Hopfield network. Each of the n
units is connected by a weighted connection to all other units. There is no feedback from
a unit to itself. The I values are external inputs.

that output function psi, and define it as follows:

psi[inValue.,netln_] := lf[netln>0,l,
If[netln<0,-1,inValue]]

It will simplify things later if we also define a function that takes input
vectors and net-input vectors as arguments and maps them onto the psi
function. We shall call this new function phi.

phi[inVector_List,netInVector_List] :=
MapThread[psi[#,#2]ft,{inVector,netlnVector}]

The net-input value to each unit can be found in the usual way by find¬
ing the scalar product of the input vector and the weight vector. What
remains is to determine the appropriate weight vector for the given pairs
of vectors that we wish to store in the network.
To determine the weights we can use a training method known as
Hebb's rule (see Chapter 1). Hebb's rule was initially derived in an
attempt to explain how learning takes place in real, biological systems.
Simply put, it states that if two connected neurons are firing simulta¬
neously, the strength of the connection between them will increase. A
typical way to express this rule mathematically is

AWij oc XiXj

where aa and xj are the firing rates, or outputs of neurons i and j.


118 Chapter 4. Probability and Neural Networks

In a practical sense, such a simple formulation can lead to extremely


large weight values, given that the units stay on long enough. Rather
than try to come up with a more realistic model, we can just limit the
weight value to the product of the outputs. In other words, let XJUjj Xj Xj •
If there is more them one vector in the training set, we can sum the
contributions to each weight from the individual vectors. We write the
equation for the entire weight matrix in the general case as

1 L
w=-2>xj (4.1)
ni= 1

where L is the number of vectors in the training set, and n is the number
of units in the network.
We can also associate with the Hopfield network a quantity called an
energy function. The energy function has the form

E = -ix‘wx (4.2)

where the x vector is the current output vector of the network. This
equation can also be written as

1 J ^
E 2 ^ ! XiWijXj (4.3)
*J=l

For future use, let's define the energy function using Mathematica.

energyHop[x_,v_] := -0.5 x . w . x;

Let's try a small Hopfield network, say one with 10 units, and three
random training patterns. First the training patterns. We can use the
Random function to generate binary vectors, then convert to bipolar vectors
by multiplying each component by 2 and subtracting 1.

trainPats = 2 Table[Table[Random[Integer,{0,i>],{10>],{3}]-l

{{1, 1, 1, -1, -1, -1, -1, 1, 1, -1>,


H. -l, l, -1. -l, l, 1, -l, -l, 1>,
H, -l, 1, -1, 1, 1, 1, l, l, i»

To generate the weight matrix, we need to find the outer product of each
training vector with itself, and add the contributions from each.
4.1. The Discrete Hopfield Network 119

wts = Apply[Plus,Map[Outer[Times,#,#]ft,trainPats]];
HatrixForm[uts]

3 3 -1 1 -1 -3 -3 1 1 -3

CO
CO

3 -1 1 -1 -3 -3 1 1

1
CO

-1 -1 3 -1 1 1 1 1 1
1

1 1 -3 3 1 -1 -1 -1 -1 -1
-1 -1 -1 1 3 1 1 1 1 1

CO
CO

-1
CO

1 -1 1 3 3 -1
1

CO

-1 3
CO

CO

1 -1 1 3 -1
1
1

CO

1 1 1 -1 1 -1 -1 3 -1
CO

1 1 1 -1 1 -1 -1 3 -1
-1 -1 3
CO
CO

1 -1 1 3 3
1
1

As expected, the matrix is square and diagonal. Also notice that the
diagonal elements are all equal. To be consistent with the Hopfield ar¬
chitecture as we described it above, we should set all of the diagonal
elements to zero. It will not affect the results if we leave them as they
are, however, so let's not do any more manipulations with the matrix.
Calculating the energy of the network for each of the training patterns
gives the following:

eTrainPats = Hap[energyHop[#,wts]ft,trainPats]

{-60., -66., -60.)

These values represent minima on a hypersurface as a function of the


vector that is the first argument to the energyHop function, in direct analogy
to the error surfaces of the Adaline and BPN. Input vectors that are close
to one of the training patterns should have an energy less negative than
that of the training patterns (in this context, close refers to Hamming
distance, or the number of bits that are different between the two vectors).
Moreover, as the processing in the network proceeds, we might expect
that the output vector will evolve toward the nearest local-minimum
state. We also might expect that once the network settles into one of the
local minimum, further processing will result in no changes to the output
(more on this issue shortly); in other words, the network reaches a fixed
point. Let's try with an initial input vector that differs by two bits, the
first and the ninth, from the first training pattern:

inputl = {-1, 1, 1, -1, "1. "1» ~1»


120 Chapter 4. Probability and Neural Networks

The energy of the system with this input vector is

energyHop[inputl,wts]

-20.

which is less negative than -60 as we expected. Before we proceed to


propagate this vector through the network, we must confront an issue
relating to the way in which that propagation occurs. There are two
ways that we might proceed. The first is to calculate the net-input for
all of the units in parallel, then calculate the new output vector and the
new energy. We refer to this method as synchronous updating. The
second method, called asynchronous updating, is to calculate the new
output for one unit at a time, that unit generally being chosen at random,
then propagating this new output value around to the other units before
another unit is chosen. Although the asynchronous method is probably
more indicative of the way real brains operate, we shall stick with the
synchronous method here because it will make the code easier.
The calculation proceeds as follows:

netlnputl = wts . inputl

{8, 8, 4, -4, -8, -8, -8, 4, 4, -8}

outputl = phi[inputl, netlnputl]

{1, 1, 1, -1, -1, -1, -1, 1, 1, -1}


energyHop[outputl,uts]

-60.

Notice that the output is identical to the original training pattern,


and the energy is at the corresponding local minimum. Further propa¬
gation through the network should result in no changes. Let's see if that
statement is true. The new input vector is now outputl.

netlnputl2 = wts . outputl

{16, 16, 4, -4, -8, -16, -16, 12, 12, -16}

outputl2 = phi [outputl, netlnputl2]


4.1. The Discrete Hopfield Network 121

{1, 1, 1, -1, -1, -1, -1, 1, 1, -1}

energyHop[outputl2,wts]

-60.

which confirms the earlier statements.


Let's try a second example. For the next input vector, choose the
following:

input2 * {1, 1, 1, 1, 1, -1, 1, -1, 1, “D

{1, 1, 1, 1, 1, -1, 1, -1, 1, -1}


Examine this input vector closely, and, before you continue reading, try to
guess to which of the three training patterns the network will converge.

netlnput2 = wts . input2

{8, 8, -4, 4, 0, -8, -8, 4, 4, -8}

output2 = phi[input2,netlnput2]

{1, 1, -1, 1, 1, -1, -1, 1, 1, -1>

energyHop[output2,wts]

-66.

Let's continue processing to see if things change.

netlnput21 = wts . output2

{18, 18, -10, 10, 2, -18, -18, 10, 10, -18}

output21 = phi[output2,netInput21]

{1, 1, -1, 1, 1, “I, "I. 1. 1. -1>

energyHop[output21,wts]

-66.
122 Chapter 4. Probability and Neural Networks

makeHopfieldVts[trainingPats.,printWts.:True] :=
Module[{wtVector},
wtVector 3
Apply[Plus,Map[Outer[Times,i,#]&,trainingPats]];
If[printVts,
Print[];
Print[NatrixForm[vtVector]];
Print[];,Continue
]; (* end of If *)
Return[utVector];
] (* end of Module *)

Listing 4.1

There were no changes, so the network must be at a stable equilibrium


point. This energy is equal to that of training pattern two, but the output
vector is the negative, or complement, of that training pattern.
This example illustrates a peculiarity of this network (and some oth¬
ers that we shall discuss in Chapter 6): If you store a training vector,
you also store its complement. Moreover, as you try to store an increas¬
ingly large number of training patterns, the network will experience a
phenomenon known as crosstalk. Crosstalk manifests itself as the ap¬
pearance of stable configurations that have no relationship to any of the
training vectors. We call these states spurious stable states, and their
existence limits the number of training patterns that we can store in a
network of any given size. Usually, the limit is about 0.14n, where n is
the number of units in the network.
The code for the discrete Hopfield network appears below. First, you
must make the appropriate weight matrix from the training set. The func¬
tion makeHopfieldMts in Listing 4.1 performs this computation and returns
the weight matrix.
After calculating the weight vector, we pass it and the input vector
to the function discreteHopfield shown in Listing 4.2. The network will
iterate until there are no further changes in the output vector.
4.1. The Discrete Hopfield Network 123

discreteHopfield[wtVector.,inVector.,printAll.:True] :=
Module[{done, energy, nevEnergy, netlnput,
newlnput, output},
done = False;
newlnput = inVector;
energy = energy Hop [in Vector, wtVector];
If[printAll,
Print[ ];Print["Input vector = ".inVector];
Print[ ];
Print["Energy = ".energy];
Print[ ],Continue
]; (* end of If *)
While[!done,
netlnput = wtVector . newlnput;
output = phi [newlnput, netlnput];
newEnergy = energyHop[output,wtVector];
If[printAll,
Print[ ];Print["Output vector = ".output];
Print[ ];
Print["Energy = ".newEnergy];
Print[ ].Continue
]; (* end of If *)
If [energy—newEnergy,
done=True,
energy=newEnergy;newInput=output,
Continue
]; (* end of If *)
]; (* end of While *)
If['printA11,
Print[ ];Print["Output vector = ".output];
Print[ ];
Print["Energy = ".newEnergy];
Print[ ];
]; (* end of If *)
]; (* end of Module *)

Listing 4.2
124 Chapter 4. Probability and Neural Networks

!§ N
i| i
N :S: N N

m N N

N n H-r-'r

:fr N

Figure 4.2 You can think of a magnetic material as comprising individual atomic magnets
resulting from a quantum-mechanical property known as spin. In the presence of an ex¬
ternal magnetic field, the spin can be in one of two directions, each of which results in a
different orientation of the north and south poles of the individual atomic magnets.

4.2 Stochastic Methods for Neural Networks

In this section, we shall examine briefly several concepts from physics


that have direct analogs for neural networks. Energy, temperature, atomic
spin, entropy, and magnetic interactions are useful concepts, as is the
process known as annealing. In particular, you should keep in mind the
details of the Hopfield network as you read through this material.

4.2.1 Some Ideas About Magnetism and Statistical Physics

Let's talk about some simple physical concepts before we attempt to


apply any of this physics to neural networks. First, consider Figure 4.2,
which shows a simplified representation of a magnetic material.
We can begin to construct a physical model for this magnetic system
by assigning a value to a variable, sit for each atom. The value is +1
if the spin is up (North pole is up in Figure 4.2), and -1 if the spin is
down. Such a model is known as an Ising model.
The individual atomic magnets are acted upon by the constant exter¬
nal field, h, and the fields arising from the other magnets in the system.
Each magnet, m*, exerts an influence on the other magnets, mj, in pro¬
portion to its own magnetic field, hi. The proportionality constant is
4.2. Stochastic Methods for Neural Networks 125

called the exchange interaction strength, Wij. Moreover, this interaction


is symmetric, so that = Wji. Then the total magnetic field influencing
the ith magnet is
hi = ^ ^ WjjSj -\- h (4-4)

We can associate a certain potential energy with these interactions.


The potential energy of this system is

E = wHsisj ~hJ2Si (4-5)


i

The factor of one half in the first term on the right accounts for the
fact that each i and j index is counted twice in the summation.
At very low temperatures, individual magnets tend to line up in the
direction of the local field, hi. Thus, the spin becomes either positive or
negative one depending on the sign of the local field. At higher tempera¬
tures, thermal effects tend to disrupt the orderliness of this arrangement.
Consider the physical model that we just examined above in relation¬
ship to the Hopfield network. The spins, sif are analogous to the unit
outputs, the local magnetic fields, hif are analogous to the net inputs,
and the potential energy of the system is analogous to the energy of the
network (in the absense of any external field). The exchange interaction
strengths form a symmetric matrix analogous to the weight matrix of the
Hopfield network.
Incidentally, if all of the exchange interaction strengths are positive,
we refer to the material as a ferromagnet. Random strengths result in a
substance known as a spin glass. Since weights in a Hopfield network
are more likely to appear to be random than all positive, the analogy is
often made between the Hopfield network and spin glasses.
Thermal effects enter the picture because the random motions caused
by thermal energy can cause a magnet to flip to a different state. Since
there are a large number of atomic magnets in any system, you would not
likely notice that the magnetization state of the material was changing,
provided the temperature remained constant. However, if the tempera¬
ture were high enough, these thermal motions could completely random¬
ize the individual spin directions, resulting in a material that had no net
magnetization.
As individual spins flipped, the total energy of the system would
change slightly. On the average, however, the energy would remain at a
126 Chapter 4. Probability and Neural Networks

constant value, provided the temperature does not change. Nevertheless,


at any given instant, the total energy may differ from this average. In sta¬
tistical physics, the probability, Pr, that a system in thermal equilibrium
at a given temperature has an energy Er/ is proportional to a quantity
known as the Boltzmann factor, e~l3Er. In other words

Pr = Ce~f3Er (4.6)

where C is the proportionality constant, and 0 is a factor proportional to


the inverse of the temperature of the system. In a physical system, this
factor would take the form

where ks is called Boltzmann's constant, and T is the absolute temper¬


ature.
The porportionality constant, C, is related to a quantity known as
the partition function. The partition function, Z, is the sum of all of the
possible Boltzmann factors for the system, that is

where the sum is taken over all possible energy states of the system.

4.2.2 Statistical Mechanics and Neural Networks

We can now apply the results of the previous section to neural networks.
Each unit in the Hopfield network has an output of either positive one or
negative one. Thus, we can think of these units as the analogues of the
magnets in the physical system described in the previous section, where
the output of the units correspond to the values of the magnetic spin.
The weight values in the network correspond to the internal magnetic
fields at each of the individual magnets. We shall consider a physical
system with no external magnetic field. By doing so, Eq. (4.5) for the
energy of the system of magnets corresponds directly with Eq. (4.3) for
the energy of the Hopfield network.
In a neural network, such as the Hopfield network, we can impart
a stochastic nature to the system by means of a fictitious temperature,
whereby the unit outputs are made to fluctuate due to fictitious thermal
motions. Moreover, if we assume a condition of thermal equilibrium, we
4.2. Stochastic Methods for Neural Networks 127

can use the Boltzmann distribution to describe the effect of these fluctua¬
tions. The net effect is that the outputs of the network units are no longer
completely determined by the net-input values to those units. Instead,
there is the possibility that a unit will be bumped into a higher energy state
due to random thermal fluctuations within the system. Remember that
the ideas of temperature and thermal fluctuations in a neural network
are strictly mathematical constructs.
To build such a network we must be able to calculate the probabil¬
ity that any given unit will undergo a change of state due to a random
thermal fluctuation. To accomplish this task, we shall limit our con¬
sideration to a single unit and correspondingly change the unit-update
strategy from a synchronous update to a random, asynchronous update
procedure.
Let's focus our attention on the kth unit, whose output we denote
Xk• We shall call the energy of the system with Xk = +1, £(+i), arid the
energy with xk = -1, #(-i). According to the Boltzmann distribution,
the probability that the system will be in the state with xk — +1 is

n e~PE(+i)
P(+!) = e-/JB(+i) +

or

(+1) i _|_ e-/*(£(-i)--®(+i))

The energy difference is given by

Et(-D — -®(+i) - } ]wkjxj — net*.


i=i

so that

P(+r) — l e-neth/T

where we have absorbed the Boltzmann constant into the fictitious tem¬
perature. This equation gives the probability that a unit has an output of
+1, regardless of the value of the net input. Notice that the probability
curve has a sigmoidal shape, identical to the output function that we
have used in other networks. Let's define the function for use later on.

prob[n_,T_] := l/(l+E~(-n/T));
128 Chapter 4. Probability and Neural Networks

We can plot the probability function for several values of the temperature
parameter. Notice that as the temperature decreases, the system behaves
more and more like a deterministic system, until, at T = 0, the system is
completely deterministic.

Plot[{prob[n,0.01],prob[n,.5],prob[n,5]>,{n,-5,5},
AxesLabel->{"net","P"}];

net

To determine the output of any unit, we must first calculate the net
input. Then, depending on the temperature, calculate the probability
that the unit will have a +1 output. We then compare this probability
to a randomly generated number between zero and one. If this random
number is less than or equal to the probability, then we set the unit
output equal to +1, regardless of the net input. If the random number
is greater than the probability, then the normal rules for the Hopfield
network apply in determining the output.
By now, you may reasonably question why we would go to all this
trouble when the deterministic Hopfield network seems to work just fine.
The answer lies in the fact that this network, like others — in particular
the BPN — potentially can settle into a local minimum energy state, or
a spurious energy minimum, rather than one that corresponds to one of
the items stored in the network's weight matrix. Adding a stochastic
element to the unit outputs helps the network avoid such errors. The
procedure we employ is similar to the annealing process used in materials
processing, and hence we call it simulated annealing.
4.2. Stochastic Methods for Neural Networks 129

4.2.3 Simulated Annealing

Let's consider the example of a silicon boule being grown in a furnace to


be used as a substrate for integrated-circuit devices. It is highly desirable
that the crystal structure be a perfect, regular crystal lattice at ambient
temperature. Once the silicon boule is formed, it must be cooled slowly
to ensure that the crystal lattice forms properly. Rapid cooling can result
in many imperfections within the crystal structure, or in a substance that
is glasslike, with no regular crystalline structure at all. Both of these
configurations have a higher energy than the crystal with the perfect
lattice structure.
An annealing process must be used to find the global energy mini¬
mum. The temperature of the boule must be lowered gradually, giving
atoms within the structure time to rearrange themselves into the proper
configuration. At each temperature, sufficient time must be allowed so
that the material reaches an equilibrium. To understand how this anneal¬
ing process helps the crystal avoid a local minimum, we shall employ an
intuitive argument used by Hinton and Sejnowski in their discussion of
simulated annealing1. Consider the simple energy landscape shown in
Figure 4.3.
The ball-bearing, which begins with an energy of Es, has insufficient
energy initially to roll up the other side of the hill and down into the
global minimum. If we shake the whole system, we might give the ball
enough of a push to get it up the hill. The harder we shake, the more
likely it is that the ball will be given enough energy to get over that hill.
On the other hand, vigorous shaking might also push the ball from the
valley with the global minimum back over to the local minimum side.
If we give the system a gentle shaking, then once the ball gets to the
global minimum side, it is less likely to acquire sufficient energy to get
back across to the local minimum side. However, because of the gentle
shaking, it might take a very long time before the ball gets just the right
push to get it over to the global minimum side in the first place.
Annealing represents a compromise between hard shaking and gentle
shaking. At high temperatures, the large thermal energy corresponds to
vigorous shaking. Low temperatures correspond to gentle shaking. To
anneal an object, we raise the temperature then gradually lower it back

1 Learning and relearning in Boltzmann machines. In David E. Rumelhart and James L. McClel¬
land. editors. Parallel Distributed Processing: Explorations in the Microstructure of Cognition. MIT Press:
Cambridge. MA. pages 282-317, 1986.
130 Chapter 4. Probability and Neural Networks

Figure 4.3 This figure shows a simple energy landscape with two minima, a local minimum,
Ea, and a global minimum, Eb. The system begins with some energy, EB. We can draw
an analogy to a ball bearing rolling down a hill. The bearing rolls down the hill toward
the local minimum, Ea, but has insufficient energy to roll up the other side and down into
the global minimum.

to ambient temperature, allowing the object to reach equilibrium at each


stop along the way. The technique of gradually lowering the temperature
is the best way to ensure that a local minimum can be avoided without
having to spend an infinite amount of time waiting for a transition out
of a local minimum.
To anneal a neural network, raise the fictitious temperature to some
value above zero and apply the stochastic processing prescription de¬
scribed above. You must continue allowing the units to change until
equilibrium is reached. You could measure the properties of the network
to determine when it has reached equilibrium, or you could simply guess
at how long it might take to reach equilibrium. The latter procedure has
the advantage of requiring less computations, and a little trial and error
should give you a feel for making a reasonable choice.
Once the network has reached equilibrium at one temperature, then
lower the temperature slightly and allow the network to reach equilib¬
rium at the new temperature. Continue reducing the temperature in this
manner until you reach a suitably low temperature. Once again, when
to stop is a matter of personal experience for a given problem.
The annealing process is often described in terms of an annealing
schedule. An annealing schedule is a list of pairs of numbers, for which
4.2. Stochastic Methods for Neural Networks 131

each pair comprises a temperature and the number of sweeps at that tem¬
perature. A sweep is the number of times each unit has an opportunity
to change while the network is at a particular temperature. An example
of an annealing schedule is: ((10,5), (8,10), (4,20), (1,20), (0.1,50)).
You can imagine that executing such an annealing schedule would
take a considerable number of computations for a network of any size,
and you would be correct. To ensure that the network reaches the global
minimum energy, the temperature needs to be reduced very slowly. Often
we can live with an imperfect annealing schedule that may get us close
to the global minimum, but will get us there in our lifetime.

4.2.4 The Stochastic Hopfield Network


The following function implements the stochastic output function for the
Hopfield network:

probPsi[inValue_,netIn_,temp_] :=
If[Random[]<=prob[netIn,temp],l,psi[inValue,netIn]];

We shall adopt the asynchronous processing model for this network


rather than the synchronous one that we used earlier. We must, there¬
fore, have two loops: one that specifies how many sweeps at a given
temperature, and one that specifies how many unit-updates per sweep,
depending on the number of units in the network. Moreover, we must
run through this double loop for each temperature specified in the an¬
nealing schedule. After updating each unit, we must remember to replace
the new value into the appropriate slot in the input vector.
If numUnits is the number of units in the network, and numSweeps is the
number of sweeps, then the code for processing each temperature step
in the annealing schedule would look like Listing 4.3. Notice that there
is no guarantee that every unit will be updated numSweeps times.
The function stochasticHopfield in Listing 4.4 incorporates the above
code and computes the output vectors for a given number of sweeps at
a particular temperature. As we did with the discrete Hopfield network,
let's compute a random set of input vectors to test this network.

trainPats = 2 Table [Table [Random [Integer, {0,1}], {10}], {3}]-l

{{-l, -l, 1, -l, -1, l, -1, l, 1, -o,


{1, -1, -1, 1, 1, 1, 1, 1, 1, 1}.
H, -1, l, -1, -1, 1, -1, -l, 1, -l»
132 Chapter 4. Probability and Neural Networks

For[i=l;i<=numSweeps;i++;
For[j=l;j<=numUnits;j++;
(* select unit *)
indx * Random[Integer,{l.numUnits}];
(* net input to unit *)
net=inVector . weights[[indx]];
(* undate input vector *)
inVector[[indx]]=probPsi[inVector[[indx]],net,temp];
]; (* end For j *)
]; (* end For i *)

Listing 4.3

vts = Apply[Plus,Map[0uter[Times,#,t]ft,trainPats]];
HatrixForm[wts]
CO

CO

1 -3 3 -1 3 1 -1 3
1 3 -1 1 1 -3 1 -1 -3 1

-3 -1 3 -3 -3 1 -3 -1 1 -3
3 1 -3 3 3 -1 3 1 -1 3
CO

3 1 -3 3 3 -1 3 1 -1

-1 -3 1 -1 -1 3 -1 1 3 -1

3 1 -3 3 3 -1 3 1 -1 3
1 -1 -1 1 1 1 1 3 1 1

-1 -3 1 -1 -1 3 -1 1 3 -1

3 1 -3 3 3 -1 3 1 -1 3

We shall use the following as an initial input vector:

inputl = H, 1, 1, -1, -1, -1, -1, 1, -1, -1>;

With an arbitrary selection of temperature of 5, and the number of sweeps


set at 10, we can run the network through its paces:

stochasticHopfield[inputl,wts,10,8];

i= 1
New input vector =
H, 1, 1, -1, -1, 1, -1, 1, -1, -1}
4.2. Stochastic Methods for Neural Networks 133

stochasticHopfield[inVector.,weights.,numSweeps.,temp.]:=
Module[ {input, net, indx, numUnits, indxList, output),
numUnits=Length[inVector];
indxList=Table[0,{numUnits}];
input=inVector;
For[i=l,i<=numSweeps,i++,
Print["i= ",i];
For[j=l,j<=numUnits,j++,
(* select unit *)
indx = Random[Integer,{l,numUnits)];
(* net input to unit *)
net=input . weights[[indx]];
(* undate input vector *)
output=probPsi[input[[indx]],net,temp];
input[[indx]]=output;
indxList[[indx]]+=l;
]; (* end For numUnits *)
Print[ ];Print[”Neu input vector = ”];Print[input];
]; (* end For numSweeps *)
Print[ ];Print["Number of times each unit was updated:"];
Print[ ];Print[indxList];
]; (* end of Module *)

Listing 4.4

i= 2
New input vector =
{-1, -1, 1, -1, -1, 1, -1, -1, 1, -1>
i= 3
New input vector =
{-1, 1, 1, -1, -1, 1, -1, 1, 1, -1>
i= 4
New input vector =
{-1, 1, 1, -1, -1, 1, -1, "I, 1. -i>
i= 5
New input vector =
134 Chapter 4. Probability and Neural Networks

H, -l, 1, -1, -1, l, -1, 1, 1, -1}


i= 6
New input vector =
H, -l, l, -1, -l, l, -l, l, l, -1}
i= 7
New input vector =
H, -1, 1, -l, -l. l. -l, 1, 1, -1}
i= 8
New input vector =
{-1, -1, 1, -1, -1, 1, -1, 1, 1, -1}
i= 9
New input vector =
H, -1, 1, -1, -1, 1, -1, 1, 1, -1}
i= 10
New input vector =
H, 1, 1. -1, -1, 1, -1, 1, 1. -1}
Number of times each unit was updated:
{11, 11, 11, 8, 9, 14, 8, 11, 11, 6}

You should notice that the results tend to cluster around one of the train¬
ing vectors, but that there is some variation due to the finite temperature.
When you run this code on your computer, you will see different results
due to the random nature of the process.
Remember, this function executes the network only at a single tem¬
perature. To anneal this network properly, we should begin at a tem¬
perature considerably higher than 5, perform more sweeps, then reduce
the temperature according to some annealing schedule. You can use the
above code as a basis for a complete function that performs the entire
annealing process.

4.2.5 Boltzmann Machine Architecture

The Hopfield network is not the only one that can be annealed in the man¬
ner of the previous section. A network called the Boltzmann Machine
performs an annealing during the learning process as well as during
the postleaming production process. Figures 4.4 and 4.5 illustrate two
variations of the Boltzmann machine architecture. Because the learning
procedure for the Boltzmann machine is so time consuming, we shall not
attempt to simulate it here.
4.3. Bayesian Pattern Classification 135

hidden layer

visible layer

Figure 4.4 In the Boltzmann-completion architecture, there are two layers of units: visible
and hidden. The network is fully interconnected between layers and among units on each
layer. The connections are bidirectional and the weights are symmetric, Wij = Wji. All
of the units are updated according to the stochastic method that we described for the
Hopfield network. The function of the Boltzmann-completion network is to learn a set of
input patterns and then to be able to supply missing parts of the patterns when a partial,
or noisy, input pattern is processed.

4.3 Bayesian Pattern Classification

In this section we look at the theory of Bayesian pattern classification.


This look will be necessarily brief and we will consider only those points
needed for the implementation of the probabilistic neural network in the
next section.

4.3.1 Conditional Probability and Bayes' Theorem

To begin the discussion, let's consider a simple problem involving choos¬


ing an object at random from a number of possible objects. To simplify
the problem further, let's assume we have a red marble, a white marble
and a blue marble in a box and we are going to choose one blindly. What
is the probability that the chosen marble will be blue? Clearly, the an¬
swer is 1/3. Denote this probability as P(B). Similarly, the probability
of choosing the white marble is 1/3 or, is it?
Suppose we actually choose the blue marble from the box. Now
136 Chapter 4. Probability and Neural Networks

Input units Output units


visible layer

Figure 4.5 For the Boltzmann-input/output network, the visible units are separated into
input and output units. There are no connections among input units, and the connections
from input units to other units in the network are unidirectional. All other connections
are bi-directional, as in the Boltzmann-completion network. This network functions as a
heteroassociative memory. During the recall process, the input-vector units are clamped
permanently and are not updated during the annealing process. All hidden units and
output units are updated according to the simulated-annealing procedure described above
for the Hopfield network.

what is the probability that our next choice will be the white marble.
The answer depends on what we do with the blue marble that we have
already chosen. If we replace the blue marble in the box before we make
a second choice, then the probability of choosing the white marble is
P(W) = 1/3, because the two events of choosing are totally independent
of one another. P{B) and P(W) are called a priori probabilities. They
are the probabilities we would assign to the selection of a blue or white
marble initially, without any other information.
Let's look at the different ways that we can choose two objects, one
after the other, without replacing the first object in the box after it has
been chosen.

outcomes=Map[Drop[#,-1]ft,Permutations[{R, W, B}]]

{{R, W>, {R, B}, {W, R>, {W, B}, {B, R}, {B, W}}
4.3. Bayesian Pattern Classification 137

There are only two possiblilities, having chosen the blue marble first: red
or white. Then the probability of choosing the white marble with our
next choice is 0.5. In other words, the probability changes based on the
results of the first choice. We shall call this new result the conditional
probability of choosing the white marble, given that we have already
chosen the blue marble and have not replaced it in the box. The symbol
that we give to this probability is P(W2\B). Similarly, the conditional
probability of choosing the blue marble given that we have already cho¬
sen the white one is denoted P(B2\W). Notice, however, that there are
two out of six ways in which the white marble may be chosen second.
Thus, P(W2) = 1/3.
We can also define the joint probability of choosing, for example, a
blue and a white marble as the result of two successive choices. We call
that value P(B n W2).
We already know that P(W) = 1/3, and P(B) = 1/3; also, P(R) =
1/3:

P[R]=l./3;
P[W]=l-/3;
P[B]=l./3;

The conditional probabilities are:

P[W2|B]=P[W2|R]=P[R2|B]=P[R2|H]=P[B21M]=P[B2|R]=0.5;

We can also calculate joint probabilities, such as P(B n W2). From


the list of results, there is only a one out of sue possibility of obtaining a
blue marble, followed by a white one. Thus:

P[BandW2]=l./6;

Assume that we conduct n independent tests where we select two


marbles in the manner described above. Let nB be the total number of
times a blue marble was selected first. Furthermore, let nBnw2 be the
number of times the two choices resulted in a blue and a white marble,
in that order. We can write the ratio of these two quantities as

nBn\V2 _ n.BnW2/n
nB nB/n

The numerator is the frequency of a blue then a white marble being


selected, and the denominator is the frequency of the blue marble being
138 Chapter 4. Probability and Neural Networks

Figure 4.6 This figure shows a region of two-dimensional space broken up into three cate¬
gories, or classes: B, C, and D. Each point in a given region belongs to the same class. The
circular region. A, is superimposed on the space.

selected first. The ratio, then, represents the ratio of the frequency of se¬
lecting a white and a blue marble given that a blue marble was selected
first. As the number of trials increases, the frequencies approach the true
probabilities of occurrence. Then the ratio becomes the conditional prob¬
ability of W given B. In other words, we define the conditional probability
as follows

P(W2\B) =

You can see that this relationship holds for the above example:

P[BandW2]/P[B]

0.5

Moving on to a slightly more complicated example, consider the sit¬


uation shown in Figure 4.6. We can think of the two-dimensional space
in that figure as possible outcomes, or events.
We can write the set A as follows: A = (A n B) u (A n C) U (A n D).
If we select a point at random from the space, what is the probability of
that point being in the region A? The answer is

P(A) = P(AnB) + p(AnC) + p(AnD) (4.8)

If we use the definition of conditional probability, the equation becomes

P{A) = P(A\B)P(B) + P(A\C)P{C) + P(A\D)P(D) (4.9)


4.3. Bayesian Pattern Classification 139

We are now in a position to ask the following question: Given that


the selected point is in region A, what is the probability that it is also
in region B? According to the definition of conditional probability, the
answer is
P(A n B)
P{B\A)
P(A)
We can use Eq. (4.9) for the denominator, resulting in

p(B\A) - p(A|s)p(£) +P(A\C)P(C) +P(A\D)P(D)

Bayes' theorem is a generalization of this last result. Let B\, B2, ... Bn/
partition a region of space, S. Further, let A be an event associated with
one of the points of S. Then the probability that event A is in region Bi
is

(4.10)
p\Bi\A) Y:jP(A\Bj)p{Bj) p(A)
Bayes' theorem shows us how to convert the a priori probability,
P(Bi), into an a posteriori probability, P(Bi\A), based on obtaining
first the result A. Now that we have Bayes' theorem, let's move on to its
application to decision theory.

4.3.2 Bayesian Strategies for Pattern Classification


Pattern classification is another name for decision making based on in¬
formation. Consider a simple example where you must decide whether
a particular pattern belongs to one of two classes, A or B. The pattern
may, in fact, be a picture comprising many picture elements, or pixels,
or it may be a series of facts or measurements about the pattern. The
only restriction we impose is that these measurements all be reducible
to numerical form. In other words, we can describe the set of measure¬
ments by a vector, x = (xux2,.. Each component of the vector
represents some feature of the pattern. Based on the value of x, we must
decide into which class to put the pattern.
If we know the conditional probabilities, P(A\x) and P(B\x), then
our best guess will be to choose class A if P(A\x) > P(B|x), and class
B if P(A\x) < P(B\x). This example illustrates the Bayes' decision rule.
Using the notation d(x) for the decision as a function of x, we can use
Bayes' theorem to write an equation for the result
140 Chapter 4. Probability and Neural Networks

A P(x\A)P(A) > P(x\B)P{B)


(4.11)
B P(x\A)P(A) < P(x\B)P(B)
The boundary between the two decisions is known as the decision
surface or decision boundary; it is defined by the equation

P(x\A) = KP(x\B)

where

In a typical problem, we would have several examples of the mea¬


surement results from each class. In order to be able to classify some
previously unseen pattern, we must use these exemplars to estimate the
decision boundary. To do this estimation, we must either know, or be
able to approximate, the underlying probability distribution functions,
P(x\A) and P(x|J3).
A common way of estimating the probability distribution functions
is to construct some standard distribution function centered at each ex¬
emplar point and sum the results. A Gaussian function is a common
choice, although it is by no means a required choice. Let's look at a
simple example to see how this process works.
Figure 4.7 shows a two-dimensional space having points in two
classes, evenly distributed within the regions shown. In this case, each x
vector has only two dimensions. If the ith exemplar from category A is
xAi/ then the estimate of the probability distribution function for class A
is

where n is the number of exemplars, p is the dimension of the space, and


a is an adjustable parameter that determines the width of the individual
Gaussians.
The examplars for categories A and B are:

exemplarsA={ {2.5, 1.5}, {2.5, 2.5},


{3.5, 1.5}, {3.5, 2.5} };

exemplarsB={ {2.5, -1.5}, {2.5, -2.5},


{3.5, -1.5}, {3.5, -2.5} };
4.3. Bayesian Pattern Classification 141

3.0- -

i.o--

2.0 4.0

1 0--
- .

30
- . --

Figure 4.7 This figure shows a plane with two categories of points, A and B. The points
are uniformly distributed within each region. The examplar points are indicated by the

black dots.

The two-dimensional Gaussian function has the form:

gauss2[x_,y_] :=
(l/((2.0 Pi) sigma“2)) E~(-{(x-xa),(y-ya)}.
{(x-xa), (y-ya)}/ (2sigma“2))

To approximate the distribution function for one of the classes, we must


sum four of these functions, each with different values of xa and ya.
There are several ways to accomplish this task. The particular method
that I show here is not the most elegant, but will serve to illustrate the
calculation explicitly. First, let's transform the exemplars* array into an
array of rules.

Clear[xa,ya];
ruleSet* = Hap [<xa-># [ [1] ], ya-># [ [2] ] >ft .exemplars A]

{{xa -> 2.5, ya -> 1.5}, {xa -> 2.5, ya -> 2.5},
{xa -> 3.5, ya -> 1.5}, {xa -> 3.5, ya -> 2.5}}

Then create a list of functions, each with the appropriate replacements


for xa and ya.
142 Chapter 4. Probability and Neural Networks

gaussList=0;
For[i=l,i<=Length[exemplarsA],i++,
AppendTo[gaussList,gauss2[x,y]/.ruleSetA[[i]]];
];
gaussList

0.5
{-
2 2 2
((-2.5 + x) + (-1.5 + y) )/(2 sigma ) 2
E Pi sigma

0.5

2 2 2
((-2.5 + x) + (-2.5 + y) )/(2 sigma ) 2
E Pi sigma

0.5

2 2 2
((-3.5 + x) + (-1.5 + y) )/(2 sigma ) 2
E Pi sigma

0.5
-y
2 2 2
((-3.5 + x) + (-2.5 + y) )/(2 sigma ) 2
E Pi sigma

Sum these functions, then divide by the number of exemplars.

ClearAll[classA];
classA[x_,y_] :=
Apply[Plus,gaussList]/Length[exemplarsA]

Now let's plot the result.

Plot3D[classA[x,y]/.sigma->0.1,{x,0,5},{y,0,4},
PlotPoints->25,PlotRange->All];
4.3. Bayesian Pattern Classification 143

With a sigma of 0.1, the class* function represents the exemplars well,
but does not approximate the desired distribution over the entire class.
By adjusting the value of sigma, we can make the distribution more rea¬
sonable.
Plot30[class*[x,y]/.sigma->0.45,{x,0,5},{y,0,4},
PlotPoints->25,PlotRange-> *11];

To continue with this example, you should repeat the above development
to construct a function classB. These functions can then be used in the
Bayes' decision rule, Eq. (4.11), to classify any point in the plane. Notice
144 Chapter 4. Probability and Neural Networks

Figure 4.8 This figure shows the connectivity of a PNN designed to classify input vectors
into one of two classes. The input values are fully connected to the layer of pattern units
Pattern units that correspond to a single class are connected to one summation unit. Both
summation units are connected to the output unit. Details of the processing performed by
these units are in the text.

that points that are very far from either class will still be classified in one
of the two classes. You can fix this problem be requiring some threshold
value for either class function to validate the classification. Let's move
on now to the implementation of this methodology in a neural network.

4.4 The Probabilistic Neural Network


The probabilistic neural network (PNN) is a feed-forward neural net¬
work that implements a Bayesian decision strategy for classifying input
vectors. The basic architecture appears in Figure 4.8.
The input units distribute their values unchanged to the units of
the second layer. We shall assume that the input vectors have all been
normalized to unity.
Each unit in the second layer is called a pattern unit. There is one
4.4. The Probabilistic Neural Network 145

pattern unit for each exemplar in the training set. The weight vector for
each pattern unit is a copy of the corresponding exemplar, and is also
normalized. Moreover, training is accomplished by adding pattern units
with the appropriate weight vectors in place.
In the pattern units, we take the dot product of the input vector and
the weight vector, as is typically done in feed-forward networks. We then
apply a nonlinear output function, although we depart from the sigmoid
function commonly used. Instead of the sigmoid, we use the gaussian
function, exp{(net — l)/cr2}, where net = x • w is the net input to the
unit, x is the input vector, w is the weight vector and o is a smoothing
parameter.
Since both x and w are normalized, the value of net is restricted to
the range -1 < net < -1. Let's define the output function and plot it
between these limits.
gaussOut[x_] := E~( (x-l)/sigma“2)

Plot[gaussOut[x]/.sigma->.7,{x,-1,1)»PlotRange->All];

This Gaussian output function is equivalent to the exponential term in


the Gaussian distribution function that we used in the previous section,
provided both the input and weight vectors are normalized. You can
verify this statement with a simple calculation.
By summing the outputs of all of the pattern units belonging to a
single class, we have in effect computed the a posteriori probability dis¬
tribution function for that class, evaluated at the input point. In other
words, if the class is A and the input vector is x, then the combination
of the pattern units and the summation unit for class A computes

Fa(x) = (2n)p/‘2apnAP(x\A)
146 Chapter 4. Probability and Neural Networks

where ua is the number of examplars in class A. The decision boundary


occurs at P(x\A) = KP(x\B), or in terms of the / functions

/a(x) = vJb{x)

ua tib

or
fA(x)-CfB(x) = 0
where
_ P{B)nA
P(A)nB
This result suggests that we configure the output unit to have two
inputs: one from each summation unit. The connection from class A
will have a weight of one, and the connection from class B will have a
weight of -C. The output unit computes the net input as usual. The
output function can be the sign function: If the net input is positive, the
output is +1, corresponding to class A; if the net input is negative, the
output is -1, corresponding to class B.
You can make a final simplification if you can be sure that the number
of exemplars from each class is taken in proportion to the corresponding
a priori probability. In that case, C = 1 and the weight on the connection
from class B is just -1.
Let's step through an example using the classes described in the pre¬
vious section:

exemplarsA

{{2.5, 1.5}, {2.5, 2.5}, {3.5, 1.5}, {3.5, 2.5}}

exemplarsB

{{2.5, -1.5}, {2.5, -2.5}, {3.5, -1.5}, {3.5, -2.5}}

Since the input vectors must be normalized, we must define a function


to accomplish that task.

normalize[x.List] := x/(Sqrt[x.x]//N)

Since each exemplar array has more than one vector in it, use the Hap
function to perform the normalization.

exemplarsAnorm = Map[normalize,exemplarsA]
4.4. The Probabilistic Neural Network 147

{{0.857493, 0.514496}, {0.707107, 0.707107},


{0.919145, 0.393919}, {0.813733, 0.581238}}

exemplarsBnorm = Map[normalize,exemplars!}]

{{0.857493, -0.514496}, {0.707107, -0.707107},


{0.919145, -0.393919}, {0.813733, -0.581238}}

Before continuing on with the calculation, let's plot the original exemplars
from class A along with the normalized points.
pi = ListPlot[exemplarsA,PlotStyle->PointSize[0.05],
PlotRange->{ {0,5},{0,4}},
Prolog->{Line[{{2,1},{2,3}}],
Line[{{2,3},{4,3}}],
Line[{{4,3},{4,1}}],
Line[{{4,1},{2,1}}]}];

4
3.5
3
2.5
2
1.5
1
0.5
°~ 1 2 3 4 5

11 = ListPlot[exemplarsAnorm];

0.7
0.65
0.6
0.55
0.5
0.45

0.75 0.8 0.85 0.9 ’


148 Chapter 4. Probability and Neural Networks

Show[{pi,11}];

3.5
3
2.5
2
1.5
1
0.5

1 2 3 4 5

When you normalize vectors in the plane, all of the points project down
to the unit circle. All points that lie along a line in a particular direction
from the origin, when normalized, will fall on the same point on the
unit circle. Thus, you would not be able to separate classes that are
in different regions of the plane, but that lie along the same direction.
When you want to use normalization of input vectors, make sure that the
direction of the vector is the only relevant attribute, and that the vector's
magnitude does not matter. With that point made, let's return to the
calculation.
The weights on the pattern units are equal to the normalized exem¬
plar vectors.

weights* = exemplarsAnorm;
weightsB = exemplarsBnorm;

For input points, we shall use a table of random numbers, generally


within the range of the two classes.

inputs = Table[{Randoni[Real,{l,5}],
Random[Real,{-4,4}]},{10}]

{{2.52819, -1.74313), {1.27255, -0.78729),


{3.03809, -0.768469), {4.08654, 2.79729),
{2.91019, -0.760464), {2.27138, -3.95153), {2.5855, 1.5419),
{3.52378, -1.41451), {2.65525, -1.11967),
{1.25173, 0.721864)}

Seven of these vectors belong to class B and three (components 4, 7, and


10) belong to class A. Next, we must normalize these input vectors.
4.4. The Probabilistic Neural Network 149

inputsNorm = Map[normalize, inputs]

{{0.823282, -0.567633}, {0.850408, -0.526124},


{0.969467, -0.245222}, {0.82519, 0.564855},
{0.967513, -0.252821}, {0.498348, -0.866977},
{0.858867, 0.512198}, {0.928022, -0.372525},
{0.921428, -0.388549}, {0.866273, 0.499571}}

Net inputs to the pattern units are the standard dot products. Since
there is more than one input vector, we cannot simply dot inputsNorm with
weights* and weightsB.

patternAnet = inputsNorm . Transpose [weights A]

{{0.413913, 0.180771, 0.533114, 0.340002},


{0.45853, 0.229303, 0.574397, 0.386202},
{0.705146, 0.512119, 0.794483, 0.646356},
{0.99821, 0.98291, 0.980977, 0.9998},
{0.69956, 0.505363, 0.789694, 0.640348},
{-0.0187266, -0.260661, 0.116535 ,-0.0983982},
{0.999996, 0.96949, 0.991188, 0.996598},
{0.60411, 0.392796, 0.706242, 0.538637},
{0.590211, 0.376802, 0.693869, 0.523957},
{0.99985, 0.965798, 0.993021, 0.995285}}

patternBnet = inputsNorm . Transpose [weightsB]

{{0.998003, 0.983525, 0.980317, 0.999862},


{0.999907, 0.973355, 0.988899, 0.997809},
{0.957477, 0.858915, 0.987678, 0.93142},
{0.416979, 0.184085, 0.535962, 0.34317},
{0.959711, 0.862907, 0.988876, 0.934247},
{0.873386, 0.965431, 0.799573, 0.909442},
{0.472949, 0.245132, 0.587659, 0.40118},
{0.987435, 0.919626, 0.999732, 0.971688},
{0.990025, 0.926294, 0.999983, 0.975636},
{0.485795, 0.259297, 0.599439, 0.414545}}

We determine the outputs of the pattern units by applying the gaussian


output function. We must also select a value for sigma.
150 Chapter 4. Probability and Neural Networks

sigma = 0.45;
patternAout = gaussOut[patternAnet]

{{0.0553402, 0.0174996, 0.0996977, 0.0384172},


{0.0689808, 0.0222389, 0.122243, 0.0482624},
{0.23315, 0.0898791, 0.362439, 0.174402},
{0.991201, 0.91907, 0.910335, 0.999014},
{0.226807, 0.0869301, 0.353967, 0.169304},
{0.00653392, 0.00197837, 0.0127428, 0.00440864},
{0.999982, 0.860133, 0.957419, 0.983341},
{0.141563, 0.0498599, 0.234417, 0.102455},
{0.132172, 0.0460734, 0.220522, 0.0952902},
{0.99926, 0.844593, 0.966123, 0.976985}}

patternBout = gaussOut[patternBnet]

{{0.990187, 0.921865, 0.907374, 0.999318},


{0.999542, 0.876709, 0.946653, 0.989237},
{0.810591, 0.498218, 0.940967, 0.71272},
{0.0561845, 0.0177884, 0.10111, 0.0390229},
{0.819584, 0.508137, 0.946548, 0.72274},
{0.535124, 0.843063, 0.371664, 0.639417},
{0.0740718, 0.0240471, 0.130517, 0.0519676},
{0.939836, 0.672394, 0.998676, 0.869523},
{0.951934, 0.694904, 0.999916, 0.886642},
{0.078923, 0.0257894, 0.138335, 0.0555132}}

Now we must sum the outputs of the pattern units for each of the input
vectors.

sumAout = Hap[Apply[Plus,#]&,patternAout]

{0.210955, 0.261726, 0.859871, 3.81962, 0.837009, 0.0256637,


3.80088, 0.528294, 0.494058, 3.78696}

sumBout = Hap [Apply [Plus, #]ft, patternBout]

{3.81874, 3.81214, 2.9625, 0.214106, 2.99701, 2.38927,


0.280603, 3.48043, 3.5334, 0.298561}

The sign of the difference of each of these output values is the network
output for the corresponding input vector.
4.4. The Probabilistic Neural Network 151

pnnTwoClass[dasslExemplars_,class2Exemplars_,
testlnputs_,sig_] :=
Module[{weights A,weightsB,inputsNorm,patternAout,
patternBout,sumAout,sumBout},
weightsA = Map [normalize, dasslExemplars];
weightsB = Map [normalize, class2Exemplars];
inputsNorm = Map[normalize,testInputs];
sigma = sig;
patternAout =
gaussOut[inputsNorm . Transpose [weights A]];
patternBout =
gaussOut[inputsNorm . Transpose [weightsB]];
sumAout = Map[A pply[Plus,#]ft,patte rn Aout];
sumBout = Map[Apply[Plus,#]ft,patternBout];
outputs = Sign[sumAout-sumBout];
sigma=.;
Return[outputs];
]
Listing 4.5

outputs = Sign[sumAout-sumBout]

{-1, -1, -1, 1, -1, -1, 1, -1, -1, 1}


These results are consistent with our analysis of the original ten input
vectors. The function pnnTwoClass, shown in Listing 4.5, implements the
two-class PNN.

Summary
In this chapter we have explored two different ways in which proba¬
bilistic concepts can be used to advantage in neural networks. The Ising
model from statistical mechanics has a direct analog with the Hopfield
neural network. Using concepts from statistical mechanics, in particu¬
lar the concepts of temperature and stochastic processes, we can change
the processing performed by neural-network units from deterministic to
stochastic. In doing so, we can endow the network with the ability to
152 Chapter 4. Probability and Neural Networks

escape from local minimum states. Bayesian pattern classification is a


traditional methodology based on probabilistic concepts. By drawing
on this technology as a basis, we were able to construct a probabilistic
neural network which embodies the main features of Bayesian pattern
classification.
Chapter 5

Optimization and Constraint


Satisfaction
154 Chapter 5. Optimization and Constraint Satisfaction

The subject of this chapter is a class of problems for which there may
be many solutions, but for which one solution may be judged to be
better than another. The classic example of this type of problem is the
traveling salesperson problem: Given a list of cities and a known cost of
traveling from one city to the next, what is the most efficient route such
that all cities are visited, no cities are visited twice, and the total distance
traveled, and hence cost, is kept to a minimum?
Conditions that we impose on the problem, such as the restriction
that each city be visited only once, are called strong constraints. Any
solution must satisfy all strong constraints. On the other hand, the desire
to minimize the cost is a weak constraint, since not all possible solutions
will be minimum-cost solutions. Let's look at some of the details, and
then apply neural networks to the solution of this problem.

5.1 The Traveling Salesperson Problem (TSP)

If you choose a path at random through a given list of cities, you are
likely to find that it is not the most efficient in terms of cost. One way to
ensure efficiency is to compute the cost for all possible paths, and then
to follow the one with the least cost. Unfortunately, such a computation
may take an extremely long time. Given any n cities, there are n\ possible
tours. If you consider the fact that, for a given tour, it does not matter
where you begin, or in which direction you travel, then the total number
of independent tours is n!/2n.

numTours[n_]:=n!/(2 n)

For a small number of cities, you would simply compute the cost of each
tour, and choose the one with the minimum. Unfortunately, for a tour
of more than a few cities, this exhaustive search can become quite time
consuming. For five cities, the number of possible tours is

numTours[5]

12

We could easily compute the cost of these twelve tours and select the most
efficient one. However, if there are ten cities on the tour, the number of
different possibilities is

numTours[10]
5.1. The Traveling Salesperson Problem (TSP) 155

181440

You can see that the computations involved will quickly overwhelm us
for a tour of any size greater than a few cities. Let's examine some of the
details of this problem by considering a specific case: a five-city tour.
We can express the cost of travel from one city to another using a
matrix of values where each row refers to a starting city, and each column
refers to a destination city. We assume that the cost of travel from one city
to another is independent of the direction of travel, making the matrix
symmetric. Moreover, all of the diagonal elements will be zero, since
there will be no travel from one city to itself. We construct such a matrix
as follows, choosing random values for the costs. First, we construct a
lower triangular matrix.

costs =
Table[If[i<= j,0,Random[Intege r,{1,10}]],{i,5}, {j,5>]

{{0, 0, 0, 0, 0}, {4, 0, 0, 0, 0}, {3, 9, 0, 0, 0},


{8, 4, 6, 0, 0}, {3, 4, 8, 5, 0»

MatrixForm['/,]

0 0 0 0 0
4 0 0 0 0
3 9 0 0 0
8 4 6 0 0
3 4 8 5 0

By transposing the matrix and adding it to the original, we construct a


symmetric matrix with all diagonal elements equal to zero.

costs = costs + Transpose [costs]

{{0, 4, 3, 8, 3>, {4, 0, 9, 4, 4}, {3, 9, 0, 6, 8>,


{8, 4, 6, 0, 5}, {3, 4, 8, 5, 0}}

HatrixFormf/J

0 4 3 8 3
4 0 9 4 4
3 9 0 6 8
8 4 6 0 5
3 4 8 5 0
156 Chapter 5. Optimization and Constraint Satisfaction

To compute the cost of a particular tour, we add the appropriate elements,


for example:
tl=costs[[1,2]]+costs[[2,3]]+costs[[3,4]]+costs[[4,5]]+costs[[5,1]]

27

Continuing the process for the remaining tours, we find:


t2=costs[[1,2]]+costs[[2,3]]+costs[[3,5]]+costs[[5,4]]+costs[[4,1]]

34

t3=costs[[1,2]]+costs[[2,4]]+costs[[4,5]]+costs[[5,3]]+costs[[3,1]]

24

and so on. The cheapest tour is tour nine, and the range goes from 20 to
34.
Remember that your results may be quite different. Nevertheless,
you can see that any random selection is likely to result in a tour that is
not optimum from a cost standpoint.
It is often the case with problems such as the TSP, that a good solution
obtained quickly is more desirable than the best solution obtained after
laborious calculation. In the example above, we might be quite satisfied
with any solution whose cost is less than 25. In the next section we shall
look at how to apply a neural network to this problem. We shall see that
the network can provide a solution quickly (relative to an exhaustive
search), but that the solution is not always the absolute best.

5.2 Neural Networks and the TSP

In this section we shall look at how to apply a Hopfield network to the


solution of constraint satisfaction problems in general and to the TSP in
particular. We use a different procedure for calculating the weights than
we used for the associative memory application in Chapter 4. Although
there are several different ways of determing weights for the TSP, we
shall follow a method outlined by Page, Tagliarini, and Christ1. First,
however, we shall need to modify the Hopfield network slightly so that
the unit output values are continuous functions of the net input, rather
than binary.
1Page, Ed, G. A. Tagliarini, and F. Christ Optimization using neural networks. IEEE Transactions on
Computers, Voi. 40, No. 12, pp. 1347-58, Dec. 1991.
5.2. Neural Networks and the TSP 157

5.2.1 The Continuous Hopfield Network

By allowing the output of the units in a Hopfield network to be a con¬


tinuous function of the net-input value, the units will more closely re¬
semble the biological neurons they emulate. Moreover, there exists an
analogous electrical circuit, using nonlinear amplifiers, resistors, and ca¬
pacitors, which suggests the possibility of building a continuous Hopfield
memory circuit using VLSI technology.
To develop the continuous model, we shall define Ui to be the net
input to the ith processing element. One possible biological analog of
is the summed action potentials at the axon hillock of a neuron. In the
case of the neuron, the output of the cell would be a series of potential
spikes whose mean frequency versus total action potential resembles the
sigmoid curve that we introduced in previous chapters.
We use the following function as the output function,

g[lambda.,u_] := 0.5 (1+Tanh[lambda u])

where 1 is a constant called the gain parameter. Let's plot this function
for several values of 1. We can use the Mathematica package, Graphics'Legend'
to help us distinguish the various graphs.

«Graphics'Legend'

Plot[{g[0.2,u],g[0.5,u],g[l,u],g[5,u]>,{u,-5,5>,
PlotStyle->{GrayLevel[0].Dashing[{0.01}],
Dashing[{0.03)],Dashing[{0.05}]},
PlotLegend->{"l=.2","1=.5","1-1",”1=5"}];

The function g[0.5,u] is identical to the sigmoid function.


158 Chapter 5. Optimization and Constraint Satisfaction

Plot[{g[0.5,u],sigmoid[u]},{u,-5,5>];

We shall denote Vi = #[A,«»] as the output of the ith unit. In real neurons,
there will be a time delay between the appearance of the outputs, Vj, of
other cells, and the resulting net input, uit to a cell. This delay is caused
by the resistance and capacitance of the cell membrane and the finite
conductance of the synapse between the ;th and ith cells. These ideas are
incorporated into the circuit shown in Figure 5.1. At each connection,
we place a resistor having a value Rij = 1/12^1, where Ty represents
the weight matrix. Inverting amplifiers simulate inhibitory signals. If
the output of a particular element excites some other element, then the
connection is made with the signal from the noninverting amplifier. If
the connection is inhibitory, it is made from the inverting amplifier.
Each amplifier has an input resistance, p, and an input capacitance,
C, as shown. Also shown are the external signals, U. In the case of an
actual circuit, the external signals would supply a constant current to
each amplifier.
The net-input current to each amplifier is the sum of the individual
current contributions from other units, plus the external-input current,
minus leakage across the input resistor, p. The contribution from each
connecting unit is the voltage value across the resistor at the connection,
divided by the connection resistance. For the connection from the ;'th
unit to the ith, this contribution would be (vj - u^/Rij = (Vj - u^T^.
The leakage current is Ui/p. If we make the definition

1 1 1
(5.1)
Ri p+ “ Rij
then we can write a differential equation describing the input voltage for
each amplifier by considering the charging of the capacitor as a result of
5.2. Neural Networks and the TSP 159

Figure 5.1 In this circuit diagram for the continuous Hopfield memory, amplifiers with a sig¬
moid output characteristic are the processing elements. The black circles at the intersection
points of the lines represent connections between processing elements.

the total input current.

duj
dt
y, TijVj if1 , r.
Ri 1
(5.2)

These equations, one for each unit in the memory circuit, completely
describe the time evolution of the system. Unfortunately, since these
equations are a set of coupled differential equations, they cannot be
solved in closed form. If each processing element is given an initial
value, Ui(0), these equations can be solved on a digital computer using
the numerical techniques for initial value problems. Before we can pro¬
ceed, however, we must determine the value of the weight matrix, Ty,
and the external inputs, /*.
160 Chapter 5. Optimization and Constraint Satisfaction

5.2.2 Calculating Weights and External Inputs for the TSP

Provided that the gain parameter is sufficiently high, we can write the
energy function of the continuous Hopfield network as

<5-3>
t=l j—1 t=l

A solution to the TSP, or for that matter any constraint satisfaction


problem, will minimize the value of the energy function. We can exploit
this fact to determine the values for Ty and /*. In particular, we shall be
interested in networks for which the outputs equilibrate at values near
zero or one, even though each unit's output may assume any value in
the range of zero to one.
In our Hopfield network for the TSP, each unit represents a hypothesis
that we visit a particular city at a particular point in the sequence of the
tour. Figure 5.2 illustrates the data representation.

The n-out-of-N Problem Suppose for the moment that the only con¬
straint in the problem is that n cities are visited out of a total of N. We
can represent that constraint in the form of an equation as follows:
N

^2Vi = n (5.4)
i= 1
where n is the number of cities, and N = n2 is the number of units in
the network. The energy function

is a minimum when n of the units have outputs of 1. Furthermore, if we


add the term
N

^ 1 Vi(1 ~ vi) (5-6)


i=l
then the energy function will be a minimum for Vi e {0,1}, since this
condition minimizes the new term. The new energy function is then

N N
E= [ n (5.7)
i=1 t=l
5.2. Neural Networks and the TSP 161

A B C D E
i
01000 10000 00010 00001 001 00

(a)

1 2345
01000 A
10000 B
00010 C
00001 D
00100 E
(b)

Figure 5.2 (a) In this representation scheme for the output vectors in a five-city TSP problem,
five units are associated with each of the five cities. The cities are labeled A through E.
The position of the 1 within any group of five represents the location of that particular
city in the sequence of the tour. For this example, the sequence is B—A—E—C—D with the
return to B assumed. Notice that N — n2 processing elements are required to represent
the information for an n-city tour, (b) This figure shows an alternative way of looking at
the units. The processing elements are arranged in a two-dimensional matrix configuration
with each row representing a city and each column representing a position on the sequence
of the tour.

If we expand Eq. (5.7) and ignore the n2 term, we can rewrite the
energy function as

E = -^^2^2(-2)vivj-Y^vi(2n-!) (5-8)
i= 1 *=1 i=1
jV*

which is identical to Eq. (5.3), provided we make the following defini¬


tions:

T-. = | _2 (5.9)
\ 0 otherwise
Ii - 2n - 1

for all i. Notice that each unit in the network exerts an inhibitory strength
of -2 on all other units in the network. Moreover, the number of units
to be on is strictly a function of the external inputs, Ii.
162 Chapter 5. Optimization and Constraint Satisfaction

Before we continue on with the weight calculation for the TSP, let's
apply the results so far to a simple problem where we have four units,
and only two of them are to be on; it does not matter which two. We
can calculate the weight matrix from Eq. (5.9).

testVtsl = Table[If[i!= j,-2,0],{i,4}, {j,4}]

{{0, -2, -2, -2>, {-2, 0, -2, -2>, {-2, -2, 0, -2>,
{-2, -2, -2, 0}}

HatrixFormf’/,]

0 -2 -2 -2
-2 0 -2 -2
-2 -2 0 -2
-2 -2 -2 0

Since n — 2, the vector of external inputs is

testlnl = Table [2*2-1, {4}]

{3, 3, 3, 3}

We must integrate the equations for each of the units, as specified in


Eq. (5.2). In order to perform that integration, we first make the assump¬
tions that C — 1, and that Ri — R = 1 for all units. Furthermore, we
approximate the derivative by a difference equation:

dui^Arn Ui(t + 1) - Ui(t)


s ~~n - Ai— (510)

Let's assign random starting values to the unit outputs,

ui = Table [Random [], {4}]

{0.114204, 0.332446, 0.551176, 0.708964}

and values for lambda and deltat; in addition we should initialize the vi
array that we shall need later.

lambda=2;
deltat = 0.01;
vi = g[lambda,ui];
5.2. Neural Networks and the TSP 163

The current output values are given by the array vi.

vi

{0.612258, 0.790805, 0.900671, 0.944583}

We begin the calculation by selecting a unit at random,

indx = Random[Integer,{l,4}]

Then we calculate the new net-input value for that unit.

ui[[indx]] = ui[[indx]] +
deltat (vi . testWtsl[[indx]] -
ui[[indx]] + testlnl[[indx]])

0.31662

Calculate the new output values.

vi[[indx]] = g[lambda,ui[[indx]]]

0.78014

vi

{0.612258, 0.78014, 0.900671, 0.944583}

To continue we would select another unit at random and perform the


required updates until the network settled into a stable solution. Let's
assemble the pieces into a function called nOutOfN to indicate the specific
problem we are solving. See Listing 5.1. Let's try the code with our
example problem.

n0ut0fN[test«tsl,testInl,4,10,0.1,100,20,True];

initial ui = {0.405085, 0.677264, 0.0962308, 0.0902886}


initial vi = {0.999697, 0.999999, 0.872652, 0.85885}

iteration = 20
net inputs = {1.63788, 0.772332, -0.209322, -0.534506}

outputs =
{1., 1., 0.0149727, 0.0000227684}
164 Chapter 5. Optimization and Constraint Satisfaction

nOutOfN[weights.,externln_,numUnits.,lambda.,deltaT.,
numlters.,printFreq.,reset.:False]:=
Nodule[{iter,1,dt,indx,ins},
dt=deltaT;
l=lambda;
iter=numlters;
ins=externln;
(* only reset if starting over *)
If[reset,ui-Table[Random[].{numUnits}];
vi « g[l,ui],Continue]; (* end of If *)
Print ["initial ui = ",N[ui,2]];Print[];
Print["initial vi = ",N[vi,2]];
For[iter=l,iter<=numlters,iter++,
indx = Random[Integer,{l,numUnits}];
ui[[indx]] = ui[[indx]]+
dt (vi . weights[[indx]] - ui[[indx]] + ins[[indx]]);
vi[[indx]] = g[l,ui[[indx]]];
If[Hod[iter,printFreq]==0,
Print[];Print["iteration = ",iter];
Print["net inputs = "];
Print[N[ui,2]];
Print ["outputs = "];
Print[N[vi,2]];Print[];
]; (* end of If *)
]; (* end of For *)
Print[ ];Print["iteration = iter];
Print["final outputs = "];
Print[vi];
]; (* end of Nodule *)

Listing 5.1
5.2. Neural Networks and the TSP 165

iteration = 40
net inputs = {4.1303, 1.35895, -1.35663, -1.04605}
outputs =
-12 -10
{1., 1., 1.64622 10 , 8.20544 10 }
iteration = 60
net inputs = {4.64333, 5.11851, -2.13667, -2.62471}
outputs =
-19
{1., 1., 2.71051 10 , 0.}
iteration = 80
net inputs = {5.82843, 7.9581, -4.05164, -7.54687}
outputs =
{1., 1., 0., 0.}
iteration = 100
net inputs = {11.097, 16.4568, -5.11248, -12.7648}
outputs =
{1., 1., 0., 0.}
iteration = 100
final outputs =
{1., 1., 0., 0.}
You can see that the network quickly settles on an appropriate solution.

Adding Constraints The TSP is a bit more complicated than a sim¬


ple n-out-of-N problem. In a five-city problem, we would have N = 25
units. In a simple n-out-of-N problem, the network would converge on
a solution with five cities — a 5-out-of-25 problem in this case — but
there would be no guarantee that we would not violate some other hard
constraint; for example, the constraint that we should visit each city only
once. To illustrate how we can account for the additional constraints,
recall the 2-of-4 problem that we did in the previous section, and add
the additional constraint that of the two units selected, one must be from
units one and two, and one must be from units three and four. Each of
these constraints is a l-out-of-2 problem.
These constraints translate to additional weight values of -2 between
the units of each of the pairs of units. For example, units one and two
would each exert an additional -2 inhibitory connection on the other
unit. We add these weights to the original weights.
166 Chapter 5. Optimization and Constraint Satisfaction_

testWtsAdd = { {0,-2,0,0>,{-2,0,0,0>,{0,0,0,-2>,{0,0,-2,0> }

{{0, -2, 0, 0}, {-2, 0, 0, 0}, {0, 0, 0, -2}, {0, 0, -2, 0}}

HatrixForm[testVts Add]

0-200
-2 0 0 0
0 0 0 -2
0 0-20

All four of the units receive an additional +1 initial input value,


according to the second part of Eq. (5.9).

testlnUdd = {1,1,1,1}

{1, 1, 1, 1>
MatrixForm[testWts2 = testWtsl+testUtsAdd]
CN

CN

-4
1

-4 0 -2 -2
CN

-2 0 -4
CN

-4
1

-2 0

testln2 = testlnl+testlnAdd

{4, 4, 4, 4}

TSP Solutions Let's rerun the network with the weights that we
calculated in the previous section. First we shall make a few changes in
the nOutOfN code to accommodate specifics of the TSP. The function tsp
appears in Listing 5.2.
One particular item to note about the code is the way we calculate
the initial u{ values. We know that the sum of the output values should
be equal to SqrtfnumUnits] when the network has settled on a solution.
We calculate an initial Ui so that the network starts out with the sum of
its outputs equal to the proper number, but in addition, we add a little
random noise to the values to give the network a start. Let's run the
code with the new weight and input values.
5.2. Neural Networks and the TSP 167

tsp[weights_,externln,,numUnits,,lambda,,deltaT,,
numlters,,printFreq,,reset,:False]:=
Hodule[{iter,l,dt,indx,ins,utemp},
dt=deltaT;
l=lambda;
iter=numlters;
ins=externln;
(* only reset if starting over *)
If[reset,
utemp = ArcTanh[(2.0/Sqrt[numUnits])-l]/l;
ui=Table[utemp+Random[Real,{-utemp/10,utemp/10}],
{numUnits}]; (* end of Table *)
vi = g[l,ui],Continue]; (* end of If *)
Print["initial ui = ",N[ui,2]];PrintQ;
Print["initial vi = ",N[vi,2]];
For [iter*l, iter<=numlters, iter++,
indx * Random[Integer,{1,numUnits}];
ui[[indx]] = ui[[indx]]+
dt (vi . Transpose[weights[[indx]]] -
ui[[indx]] + ins[[indx]]);
vi[[indx]] = g[l,ui[[indx]]];
If[Hod[iter,printFreq]=0,
Print[];Print["iteration - ".iter];
Print["net inputs = "];
Print[N[ui,2]];
Print["outputs ="];
Print[N[vi,2]];Print[];
]; (* end of If *)
]; (* end of For *)
Print[] ;Print["iteration = iter];
Print["final outputs = "];
Print[HatrixForm[Partition[N[vi,2],Sqrt[numUnits]]]];
]; (* end of Module *)

Listing 5.2
168 Chapter 5. Optimization and Constraint Satisfaction

tsp[testWts2,testln2,4,10,0.1,100,20,True];

initial ui = {0.892655, 0.46726, 0.996994, 0.989973}


initial vi = {1., 0.999913, 1., 1.}

iteration = 20
net inputs = {1.12986, -0.700786, 0.3861, 0.25233}
outputs =
-7
{1., 8.18559 10 , 0.999557, 0.99361}
iteration = 40
net inputs = {3.20793, -1.61986, -1.13265, 1.23898}
outputs =
-15 -10
{1., 8.5124 10 , 1.45196 10 , 1.}
iteration = 60
net inputs = {5.62494, -1.98185, -4.71511, 4.31185}
outputs =
-18
{1., 6.12574 10 ,0., 1.}
iteration = 80
net inputs = {6.38743, -4.41281, -13.8339, 8.1653}
outputs =
{1., 0., 0., 1.}
iteration = 100
net inputs = {8.14879, -9.36068, -23.5006, 17.8093}
outputs =
{1., 0., 0., 1.}
iteration = 100
final outputs =
{1., 0., 0., 1.}
Notice that the additional constraint has been satisfied by this solu¬
tion: One solution is from the first two units, and one is from the final
two units.
Think back to the data representation for the TSP given in Figure 5.2.
In order to account for the fact that each city is visited only once, units in
each row of Figure 5.2(b) would have to exert an inhibitory connection
on all other units in the same row. This situation amounts to a l-out-of-5
problem for each row. Similarly, since you can only visit one city at a
5.2. Neural Networks and the TSP 169

time, units in each column would have to inhibit all other units in the
column.
We can construct the weight matrix for the TSP starting with the
original n-out-of-N. Because a five-city problem results in a weight matrix
having 252 = 625 elements, we would be better off here to consider a
simple three-city problem. In that case, n=3 and N=9, and the weight
matrix has only 81 elements. First, we construct the part of the weight
matrix that accounts for the 3-out-of-9 constraint.

HatrixForm[tspYtsl * Table[If[i!= j,-2,0],{i,9}, {j,9}]]


CN
CN

CN
CN
CN

1
1

-2
1

0 -2 -2
CN
CN

CN
CN

CN
CN
CN

1
1

1
1

1
1
1

0 -2
CN

CN
CN
CN
CN

-2
1

-2 0 -2
CN
CN
CN
CM

1
1

-2 -2 -2
1

-2 0
CN
CN
CN
CN
CM

1
1
1

-2 -2
1
1

-2 0
CN
CN
CN
CN
CN
CN
CN

1
1
1

-2
1
1

0
CN
CN
CN

NO

-2
1

-2 0
1

-2 -2
CN
CN
CN

CN
CN
CN

1
1
1

-2 0
1
1

-2
CN

CN
CN
CN

1
1

-2 0
1

-2
1

-2 -2

tsplnl = Table [2*3-1, {9}]

{5, 5, 5, 5, 5, 5, 5, 5, 5}
We can add the next constraint — that we visit each city only once —
with the following weights. Remember, each set of three units represents
all three cities at a particular position on the tour; thus, we must select
only one unit from units 1-3, one from 4-6, and one from 7-9. For ex¬
ample, among the first three units we would have inhibitory connections
between the unit pairs, 1-2, 1-3, 2-1, 3-1, 2-3, and 3-2. The appropriate
additions to the weight matrix and input values are as follows.

tspVts2 = { { 0,-2,-2, 0, 0, 0, 0, 0, 0>,


{-2, 0,-2, 0, 0, 0, 0, 0, 0>,
{-2,-2, 0, 0, 0, 0, 0, 0, 0),
{ 0, 0, 0, 0,-2,-2, 0, 0, 0),
{ 0, 0, 0,-2, 0,-2, 0, 0, 0>,
{ 0, 0, 0,-2,-2, 0, 0, 0, 0>,
{ 0, 0, 0, 0, 0, 0, 0,-2,-2},
{ 0, 0, 0, 0, 0, 0,-2, 0,-2},
{ 0, 0, 0, 0, 0, 0,-2,-2, 0} >;
170 Chapter 5. Optimization and Constraint Satisfaction

tspln2 = Table[2*l-1,{9}];

To account for the fact that we can only visit one city at a time,
we must inhibit the corresponding unit in each of the three groups; for
example, units 1, 4, and 7. The corresponding weights and inputs are

tspWts3 = { { 0, 0, 0,-2, 0, 0,-2, 0, 0>,


{ 0, 0, 0, 0,-2, 0, 0,-2, 0>,
{ 0, 0, 0, 0, 0,-2, 0, 0,-2},
{-2, 0, 0, 0, 0, 0,-2, 0, 0},
{ 0,-2, 0, 0, 0, 0, 0,-2, 0},
{ 0, 0,-2, 0, 0, 0, 0, 0,-2},
{-2, 0, 0,-2, 0, 0, 0, 0, 0},
{ 0,-2, 0, 0,-2, 0, 0, 0, 0},
{ 0, 0,-2, 0, 0,-2, 0, 0, 0} };
tspln3 = Table [2*1-1, {9}];

We must still account for the weak constraint: that constraint having
to do with the distances between the cities. Assume that the distance
between cities one and two is one unit, that between cities two and three
is two units, and that between one and three is three units. At a given step
on the tour, each unit corresponding to a certain city should inhibit the
units at either the next step or the previous step that correspond to other
cities, in proportion to the distance to that city. For example, unit four,
which corresponds to city one, tour position two, should inhibit units
eight (city two, position three) and nine (city three, position three), as
well as units two (city two, position one) and three (city three, position
one). Of course, since there is only one unique tour for the three-city
problem, the results that we get will be trivial. Nevertheless, the network
should settle on a solution that conforms to the strong constraints. The
corresponding weight matrix for the distances is

tspWts4 = 0.2 { { 0, 0, 0, 0,-1,-3, 0,-1,-3},


{ 0, 0, 0,-1, 0,-2,-1, 0,-2},
{ 0, 0, 0,-3,-2, 0,-3,-2, 0},
{ 0,-1,-3, 0, 0, 0, 0, 0, 0},
{-1, 0,-2, 0, 0, 0, 0, 0, 0},
{-3,-2, 0, 0, 0, 0, 0, 0, 0},
{ 0,-1,-3, 0, 0, 0, 0, 0, 0},
{-1, 0,-2, 0, 0, 0, 0, 0, 0},
{-3,-2, 0, 0, 0, 0, 0, 0, 0} };
5.2. Neural Networks and the TSP 171

The final weight matrix is the sum of the four individual matrices, and
likewise for the external inputs. I have included the factor of 0.2 in the
above formula because the distance constraint is a weak constraint; thus
its effect on the network should not be such as to overpower the other
constraints. A factor of 0.2 may, in fact, be too small, but if you run
the network without any multiplicative factor, you will sometimes get a
solution with only two cities. This result is presumably because of the
stronger inhibitory connections due to the distances between cities. Here
is the weight matrix and the vector of external inputs:

HatrixForm[tspWts = tspWtsl+tspWts2+tspWts3+tspWts4]

0 -4 -4 -4 -2.2 -2.6 -4 -2.2 -2.6


-4 0 -4 -2.2 -4 -2.4 -2.2 -4 -2.4
-2.4 -4
CN

-4 -4 -2.4 -4 -2.6
1

-4 -2.2 -2.6 0 -4 -4 -4 -2 -2
-2.2 -4 -2.4 -4 0 -4 -2 -4 -2
-2.6 -2.4 -4 -4 -4 0 -2 -2 -4
-4 -2.2 -2.6 -4 -2 -2 0 -4 -4
-2.2 -4 -2.4 -2 -4 -2 -4 0 -4
-2.6 -2.4 -4 -2 -2 -4 -4 -4 0

tspln = tsplnl+tspln2+tspln3

{7, 7, 7, 7, 7, 7, 7, 7, 7}

Now let's run the network. For space considerations, I have suppressed
printing of all but the initial values and final result.

tsp[tspWts,tspln,9,50,0.002,800,200,True];

initial ui = {-0.0072, -0.0067, -0.0068, -0.0071, -0.007,


-0.0075, -0.0066, -0.0075, -0.0074}
initial vi = {0.33, 0.34, 0.34, 0.33, 0.33, 0.32, 0.34, 0.32, 0.32}
iteration = 800
final outputs =

0. 1. 0.
-19
1. 1.6 10 0.

0. 0. 1.
172 Chapter 5. Optimization and Constraint Satisfaction

The solution meets all of the strong constraints, as we might expect. Let's
try again.

tsp[tspWts,tspln,9,50,0.002,800,200,True];

initial ui = {-0.0073, -0.0065, -0.0067, -0.0069, -0.0071,


-0.0071, -0.0063, -0.0066, -0.0075}
initial vi = {0.32, 0.34, 0.34, 0.33, 0.33, 0.33, 0.35, 0.34, 0.32}

iteration = 800
final outputs =
-14 -9 -14
3.8 10 5.1 10 1.7 10

0. 1. 0.

1. 0. 1.

Notice in this example that the network found a solution that violated
one of the strong constraints (although I had to run the program about
a dozen times before this solution appeared). We can attempt to fix this
problem by increasing the efficacy of the weights associated with those
constraints; thus, we can recompute tspWts as follows:

HatrixForm[tspl/ts = tspVtsl+ 2 tspWts2+ 2 tspWts3+tspWts4]

0 -6 -6 -6 -2.2 -2.6 -6 -2.2 -2.6


-6 0 -6 -2.2 -6 -2.4 -2.2 -6 -2.4
-6 -6 0 -2.6 -2.4 -6 -2.6 -2.4 -6
-6 -2.2 -2.6 0 -6 -6 -6 -2 -2
-2.2 -6 -2.4 -6 0 -6 -2 -6 -2
-2.6 -2.4 -6 -6 -6 0 -2 -2 -6
-6 -2.2 -2.6 -6 -2 -2 0 -6 -6
-2.2 -6 -2.4 -2 -6 -2 -6 0 -6
-2.6 -2.4 -6 -2 -2 -6 -6 -6 0

Let's try the network again.

tsp[tspWts,tspln,9,50,0.002,800,200,True];

initial ui = {-0.0069, -0.0062, -0.0072, -0.0062, -0.0071,


-0.007, -0.0063, -0.0076, -0.0064}
5.2. Neural Networks and the TSP 173

initial vi = {0.33, 0.35, 0.33, 0.35, 0.33, 0.33, 0.35, 0.32,


iteration = 800
final outputs =
0. 1. 0.
0. 0. 1.
1. 0. 0.

I ran this program numerous times and never saw a forbidden solution.
That is not to say that you may never see one; nevertheless, the modi¬
fication appears to have helped. I suggest that you construct a four-city
problem on your own. The solutions will not be trivial in that case.

5.2.3 Constraints Expressed As Inequalities


So far, our constraints have required that a fixed number of units are
"on." Suppose we have the condition that n or less units out of a group
are "on." Consider, for example, the problem we did in Section 5.2.2
where we had four units, and one emit among units one and two, and
one unit among units three and four, were to be "on." Suppose that we
impose the constraint that two or less emits are to be "on" among units
three and four.
We can impose such a constraint within the context of an n-out-of-N
problem through the addition of extra units which we shall call slack
units. These emits are added to the problem during its solution, but are
ignored during the interpretation of the solution.
Let's set up the problem that we posed in the first paragraph: four
units, with one of the first two on and less than or equal to two of the
second two on. This problem requires that we add two slack units to the
second group. For the first group, we set up a l-out-of-2 problem, and
for the second group, we set up a 2-out-of-4 problem.
The weight matrix for the first group is

grouplWts * { { 0,-2, 0, 0, 0, 0),


{-2, 0, 0, 0, 0, 0>,
{ 0, 0, 0, 0, 0, 0>,
{ 0, 0, 0, 0, 0, 0>,
{ 0, 0, 0, 0, 0, 0>,
{0, 0, 0, 0, 0, 0} >;
For the second group, the weight matrix is
174 Chapter 5. Optimization and Constraint Satisfaction

group2Wts = { { 0, 0, 0, 0, 0, 0},
{ 0, 0, 0, 0, 0, 0},
{ 0, 0, 0,-2,-2,-2},
{ 0, 0,-2, 0,-2,-2>,
{ 0, 0,-2,-2, 0,-2},
{ 0, 0,-2,-2,-2, 0} };
The total weight matrix is the sum of the above two matrices,

HatrixForm[groupWts = grouplWts+group2Vts]

0 -2 0 0 0 0
-2 0 0 0 0 0
0 0 0 -2 -2 -2
0 0 -2 0 -2 -2
0 0 -2 -2 0 -2
0 0 -2 -2 -2 0

The first group of two units gets an external input of 2*1-1=1, while the
second group gets 2*2-l=3.

groupln = {1,1,3,3,3,3}

{1, 1, 3, 3, 3, 3}

Let's run the network several times to see how the results are distributed.
Once again I have suppressed all output but the initial values and final
results.

nOutOfN[groupWts,groupln,6,10,0.1,150,150,True]

initial ui = {0.53, 0.88, 0.94, 0.54, 0.024, 0.098}


initial vi = {1., 1., 1., 1., 0.62, 0.88}
iteration = 150
final outputs =
{0., 1., 1., 1., 0., 0.}

Notice that the first group of two has only one unit on, while the second
group has two out of four on. Since the two that are on in the second
group are units three and four, the actual solution that we are interested
in is 0, 1,1, 1, assuming that units five and six are the slack units. Let's
try again.
5.2. Neural Networks and the TSP 175

nOutOfN[groupWts,groupln,6,10,0.1,100,100,True]

initial ui = {0.27, 0.31, 0.3, 0.24, 0.77, 0.97}


initial vi = {1., 1., 1., 0.99, 1., 1.}
iteration = 100
final outputs =
{0., 1., 0., 0., 1., 1.}

In this solution, neither units three or four are on, but that condition still
satisfies the constraint. The actual solution in this case is 0,1, 0, 0. Let's
try one more time.

nOutOfN [groupWts,grouping,10,0.1,100,100,True]

initial ui = {0.28, 0.56, 0.68, 0.42, 0.7, 0.75}


initial vi = {1., 1., 1., 1., 1., 1-}
iteration = 100
final outputs =
{0., 1., 1., 0., 1., 0.}

The solution here would be 0, 1, 1, 0, which again satisfies all of the


constraints.

Summary
Constraint satisfaction and optimization problems form a large class for
which the traveling salesperson problem is the prototypical example. In
this chapter we modified the Hopfield network to allow the units to take
on continuous values. Then, using a procedure based on n-out-of-N units
allowed to be in the "on" state, we showed how to calculate the weights
for a Hopfield network that solves the TSP. This method is quite general
and can be applied to many similar constraint-satisfaction problems.
Chapter 6

Feedback and Recurrent


Networks
178 Chapter 6. Feedback and Recurrent Networks

The title of this chapter implies the existence of neural networks whose
outputs find their way back to become inputs, or in which data moves
both forward and backward in the network. There are many varieties of
these networks. One such network is the Hopfield network, which was
the subject of Chapters 4 and 5. The Hopfield network is a derivative of
a two-layer network called a bidirectional associative memory (BAM)
(You could, alternatively, think of the BAM as a generalization of the
Hopfield network.) The BAM is a recurrent network that implements an
associative memory. We shall study the BAM first in this chapter.
Following the BAM, we shall look at two multilayer network architec¬
tures that have feedback paths within their structures. These networks
are named after individuals: Elman and Jordan. With these architec¬
tures, we shall be able to develop neural networks that can learn a time
sequence of input vectors.

6.1 The BAM

The BAM is similar to the Hopfield network in several ways. Like the
Hopfield network, we can compute a weight matrix in advance, provided
we know what we want to store. Moreover, you will notice a similarity
in the way the weights are determined and in the way processing is
done by the individual units in the BAM. This network implements a
heteroassociative memory rather than an autoassociative memory, as was
the case with the Hopfield network. For example, we might store pairs
of vectors representing the names and corresponding phone numbers of
our friends or customers. To see how we can accomplish this feat, let's
look at the BAM architecture.

6.1.1 BAM Architecture and Processing

Figure 6.1 illustrates the architecture of the BAM. The BAM comprises
two layers of units that are fully interconnected between the layers. The
units may, or may not, have feedback connections to themselves.
To apply the BAM to a particular problem, we assemble a set of pairs
of vectors where each pair comprises two pieces of information that we
would like to associate with each other; for example, a name and a phone
number. To store this information in the BAM, we must first represent
each datum as a vector having components in {-1, +1}. We refer to these
6.1. The BAM 179

x - layer

y - layer

Figure 6.1 The BAM shown here has n units on the x layer, and m units on the y layer. For
convenience, we shall call the x vector the input vector, and the y vector, the output vector.
In this network all of the elements in either the x or y vectors must be members of the
set {—1, +1}. All connections between units are bi-directional with weights at each end.
Information passes back and forth from one layer to the other, through these connections.

Feedback connections at each unit may not be present in all BAM architectures.

vectors as bipolar vectors, rather than binary vectors, whose components


would be in the set {0, +1}. Then we construct weight matrices for the x-
and y-layers in a manner similar to that used for the Hopfield network.
Once the weight matrix has been constructed, the BAM can be used to
recall information (for example, a phone number), when presented with
some key information (a name corresponding to a particular phone num¬
ber). If the desired information is only partially known in advance or is
noisy (a misspelled name such as "Simth"), the BAM may be able to cor¬
rect the error and provide the correct corresponding information (giving
the proper spelling, "Smith," and the correct phone number). Since our
examples here will all be fictitious, we shall make the assumption that
the data-representation issue has been addressed elsewhere. Let's define
some exemplars and examine BAM processing in some detail.

exemplars={ {{1,-1,-1,1,"1»M*"1»'-M}*{1»~1»"1»_1*'T*1}}»
{{1,1,1,-l.-i,-i,i,

There are two vector pairs in this example. The first vector of each pair
is the x-layer vector, and the second is the y-layer vector. There are 10
units on the x layer and six on the y layer, and hence, 10 connections to
■>'

180 Chapter 6. Feedback and Recurrent Networks_

each y-layer unit and six to each a>layer unit.


There are two weight matrices: one on the connections from the x
layer to the y layer, and one on the connections from the y layer to the x
layer. To construct the first weight matrix (x layer to y layer), we compute
the sum of the outer products of the vector pairs as follows

w=£y<*J (6.1)
t=l

where L is the number of vector pairs in the training set. To calculate


the second weight matrix (y layer to x layer), we simply transpose the
matrix in Eq. (6.1). The following function will calculate the x-to-y weight
matrix.

makeXtoYwts[exemplars.] :=
Module[{temp},
temp = Map[0uter[Times,#[[2]],f[[l]]]ft,exemplars];
Apply[Plus,temp]
]; (* end of Module *)

We can now calculate the two weight matrices.


MatrixForm[x2yWts=makeXtoYwts[exemplars]]

2 0 0 0 -2 0 2 0 -2 0
0 2 2 -2 0 -2 0 2 0 -2
0 2 2 -2 0 -2 0 2 0 -2
0 2 2 -2 0 -2 0 2 0 -2
-2 0 0 0 2 0 -2 0 2 0
0 -2 -2 2 0 2 0 -2 0 2

MatrixForm[y2xVts = Transpose[x2yWts]]

2 0 0 0 -2 0
0 2 2 2 0 -2
0 2 2 2 0 -2
0 -2 -2 -2 0 2
-2 0 0 0 2 0
0 -2 -2 -2 0 2
2 0 0 0 -2 0
0 2 2 2 0 -2
-2 0 0 0 2 0
0 -2 -2 -2 0 2
6.1. The BAM 181

Unit outputs will be -1, or +1, depending on the value of the net input to
the unit. We calculate net inputs in the usual manner of the dot product
between the input vector and the weight vector for each unit. Then the
output of the unit is given by

+1 net? > 0
Si(t) net? = 0
-1 net? < 0
where s refers to a unit on either layer, and we use the discrete variable
t to denote a particular timestep. Notice that if the net input is zero, the
output does not change from what it was in the previous timestep.
The two functions, psi and phi, that we defined for the Hopfield net¬
work (see Section 4.1), also apply to the BAM. These functions implement
the output function of the BAM units as specified in Eq. (6.2).

psi[inValue.,netlnj := lf[netln>0,l,
If[netln<0,-1,inValue]];
phi[inVector.List,netlnVector.List] :=
HapThread[psi[t,f2]&,{inVector,netlnVector}];

The energy function for the BAM is also similar to that of the Hopfield
network.
E — — y4wx (6.3)

energyBAM[xx_,w_,zz_] - (xx . w . zz)

We can use this function to calculate the energy of the network with the
exemplars.

energyBAM[exemplars[[l,2]],x2yWts,exemplars[[l,l]]]

-64

energyBAM[exemplars[[2,2]],x2yWts,exemplars[[2,l]]]

-64

To recall stored information from the BAM, we perform the following


steps:
1. Apply an initial vector pair, (x0,yo), to the processing elements of
the BAM.
182 Chapter 6. Feedback and Recurrent Networks

2. Propagate the information from the x layer to the y layer and update
the values on the y-layer units. Although we shall consistently
begin with the x-to-y propagation, you could begin in the other
direction.

3. Propagate the updated y information back to the x layer and update


the units there.

4. Repeat steps 2 and 3 until there is no further change in the units on


each layer, or, equivalently, there is no further change in the energy
of the system.

This algorithm is what gives the BAM its bi-directional nature. The
terms input and output refer to different quantities, depending on the
current direction of the propagation. For example, in going from y to x,
the y vector is considered as the input to the network, and the x vector
is the output. The opposite is true when propagating from x to y.
If all goes well, the final, stable state will recall one of the exemplars
used to construct the weight matrix. Since in this example, we assume we
know something about the desired x vector, but perhaps nothing about
the associated y vector, we hope that the final output is the exemplar
whose Xj vector is closest to the original input vector. There are many
definitions of the word close that are used when discussing neural net¬
works. In the case of the BAM, we use Hamming distance as the measure
of closeness between two vectors. The Hamming distance between two
vectors is the number of bits that differ between the two. The concept
applies equally to bipolar or binary vectors.
The above scenario works well provided we have not overloaded the
BAM with exemplars. If we try to put too much information in a given
BAM, a phenomenon known as crosstalk occurs between exemplar pat¬
terns. Crosstalk occurs when exemplar patterns are too close to each
other. The interaction between these patterns can result in the creation
of spurious stable states. In that case, the BAM could stabilize on mean¬
ingless vectors. If we think of the BAM in terms of an energy surface
in weight space, each exemplar pattern occupies a deep minimum well
in the space. Spurious stable states correspond to energy minima that
appear between the minima that correspond to the examplars.
6.1. The BAM 183

bam [ini.ti.alX_, initial Y_, x2y Heights., y2xWeights_, print A 11_: False] :=
Module[{done,newX,neuY,energyl,energy2>,
done s False;
neuX = initialX;
newY = initialY;
While [done = False,
newY = phi[newY,x2yWeights.neuX];
If[printAll,Print[] ;Print[] ;Print["y = ",neuY]];
energyl = energyBAH[newY,x2yWeights,newX];
If [printAll, Print ["energy = ",energyl]];
newX = phi[newX,y2xWeights . neuY];
If [printXll,Print[];Print["x = ",neuX]3;
energy2 = energyBAH[newY,x2yWeights,neuX];
If [printXll, Print["energy = ".energyl]];
If [energyl ~ energy2,done=True,Continue];
]; (* end of While *)
Print []; Print [];
Print["final y = ".newY," energy= ".energyl];
Print["final x = ",nevX," energy® ",energy2];
]; (* end of Hodule *)

Listing 6.2

6.1.2 BAM Processing Examples


Let's use the exemplars and weights that we calculated in the previous
section to exercise the BAM for a number of different initial vectors. First,
let's assemble the necessary code into a procedure as shown in Listing
6.2. For our first example, we shall select an initial x vector that differs
from the first x-vector exemplar by only one bit. The initial y vector will
be equal to the second y-vector exemplar.

initX = {-i,-l,-l,l,-l,l,l,-l,-U>;
initY = {1,1,1,1,-1,-1};

The energy of a BAM in this initial state is

energyBAH[initY,x2yWts,initX]

40
184 Chapter 6. Feedback and Recurrent Networks

By setting printAll to True, we can watch the progression of events as the


data propagate back and forth through the BAM.

bam[initX,initY,x2y¥ts,y2xWts,True]

y = {1, -1, -1, -1, -1, 1}


energy = -56
x = {1, -1, -1, 1, -1, 1, 1, -1, -1, 1}
energy = -56
y = {1, -1, -1, -1, -1, 1}
energy = -64
x = {1, -1, -1, 1, -1, 1, 1, -1, -1, 1}
energy = -64
final y = {1, -1, -1, -1, -1, 1} energy= -64
final x = {1, -1, -1, 1, -1, 1, 1, -1, -1, 1} energy= -64

Notice that we have recovered the first exemplar.


For our second example, we shall define the starting vectors as fol¬
lows:

ClearAll[initX,initY];
initX = {-1,1,1,-1,1,1,1,-1,1,-0;
initY = {-1,1,-1,1,-1,-1};

energy BAH [initY, x2yl/ts, initX]

-8

The Hamming distances of the xo vector from the training vectors is


h(xo, xi) = 7 and h(xo, X2) = 5. For the yo vector, the values are h{yo, yi) =
4 and h(y0, y2) = 2. Based on these results, we might expect that the BAM
would settle on the second exemplar as a final solution. Let's see what
happens.

bam[initX,initY,x2y¥ts,y2xVts,True]

y = H, 1, 1. 1, 1, -1}
energy = -24
x = {-1, 1, 1, -1, 1, -1, -1, 1, 1, -1}
energy = -24
y = H, 1, 1, 1, 1, -1}
energy = -64
6.2. Recognition of Time Sequences 185

x = {-I, 1. 1. "I, 1. -1, "I, 1, 1, -1}


energy = -64
final y = {-1, 1, 1, 1, 1, -1} energy= -64
final x = H, 1, 1, -1, 1, -1, -1, 1, 1, -1} energy= -64

Compare these results with the exemplars

exemplars

{{{1, -1, -1, 1, -1, 1, 1, -1, -1, 1>, {1, -1, -1, -1, -1, 1»,
{{1, 1, 1, -1, -1, -1, 1, 1, -1, -1}, {1, 1, 1, 1, -1, -1}»

You will notice that the final output vectors do not match any of the
examplars. Furthermore, they are actually the complement of the first
training set, (xout,y0ut) = (xS.y?), where the c superscript refers to the
complement. This example illustrates a basic property of the BAM: If
you encode an exemplar, (x, y), you also encode its complement, (xc,yc).
Let's try a pair of random vectors to see if we always get an exemplar
or a complement of an exemplar.

initX = 2 Table [Random [Integer, {0,1}], {10}] -1

H, -l, l, l, -l, -l, 1, -1. 1. 0

initY = 2 Table[Random[Integer,{0,1}],{6}]-1

H, 1, l, -1, l, -0

bam[initX,initY,x2yWts,y2xWts,False]

final y = H, -1, -1, -1, 1, 1} energy= -64


final x = H, -1, -1, 1. 1, 1, "I, "I. 1. 1> energy= -64

Right away, we have found a spurious stable state whose energy is


equal to that of the exemplars' states. You can see that the problem of
spurious stable states is not one to be dismissed casually.

6.2 Recognition of Time Sequences


Mapping networks, such as the backpropagation network, and associa¬
tive memories, such as the BAM and Hopfield networks, generally deal
with static or spatial patterns. The final output from these networks does
186 Chapter 6. Feedback and Recurrent Networks

not depend on any previous output. There are applications, however, for
which a network that can accommodate a time-ordered sequence of pat¬
terns would be necessary. An example, currently unattainable by any
neural network except a human one, would be learning the sequence of
finger and arm movements necessary to play a piece on the piano.
One way of encoding such a sequence is with a type of neural net¬
work called an avalanche, in which a series of units are triggered in a
time sequence. A second way of dealing with time in a neural network
is to take the results of processing at one particular time step and feed
that data back to the network inputs at the next time step. This latter
method is the one that we shall explore in this section.

6.2.1 The Elman Network


The architecture of the Elman network appears in Figure 6.2. The net¬
work resembles a standard, feed-forward, layered network such as a
BPN. In fact, the training of all of the feed-forward connection weights
follows the standard generalized delta rule. The extra input units are
called context units.
Let's apply this network to a simple problem in which we present
two different sequences of numbers to the network. The first sequence
is 1, 2, 3, 1, the second is 2, 1, 3, 2. We shall ask the network to supply
the next number in the sequence as we present it with the first three
numbers. In each sequence, the third number is the same, but the fourth
number is different, which requires that the network "remembers" the
context in which the "3" is presented. After the third input, we shall
reset the network to begin on a new sequence.
This problem would likely confound a standard BPN. Such a network
could easily learn to map a "1" to a "2," a "2" to a "3," and a "3" to a "1,"
as in the first sequence. However, when learning the second sequence,
we ask the network to learn that a "2" maps to a "1" and that a "3"
maps to a "2." In a standard BPN, where each input pattern is presented
independently, there is no context to alert the network as to which of
the two contradictory mappings is the appropriate one. For the BPN,
this situation represents a one-to-many mapping, for which the BPN is
ill-equipped to handle.
To solve this problem, we shall construct an Elman network with
three input units, one for each of the possible values (1, 2, or 3) in the
first three positions of the input sequences. For this example, we put four
6.2. Recognition of Time Sequences 187

Output vector

Figure 6.2 In this representation of the Elman network, outputs from each of the hidden-
layer units at timestep t, become additional inputs to the network at timestep t + 1. At
each timestep, information horn all previous timesteps influences the output of the hidden
layer and hence also influences the network output.

units on the hidden layer. Correspondingly there will be four context


units, for a total of seven units on the first layer. The output can be one of
three numbers (1, 2, or 3), so we assign one unit to each of those values,
making three units on the output layer.
We begin by constructing the input-output sequences.

ClearAll[ioPairsEl]
ioPairsEl =
(* inputs outputs *)
{ (* 1 2 3 1 2 3 *)
{{ {0.9, 0.1, 0.1}, {0.1, 0.9, 0.1} },
{ {0.1, 0.9, 0.1}, {0.1, 0.1, 0.9} },
{ {0.1, 0.1, 0.9}, {0.9, 0.1, 0.1} } },
{{ {0.1, 0.9, 0.1}, {0.9, 0.1, 0.1} },
{ {0.9, 0.1, 0.1}, {0.1, 0.1, 0.9} },
{ {0.1, 0.1, 0.9}, {0.1, 0.9, 0.1} }
} };
The degree of nesting in the ioPairsEl list maintains separation between
188 Chapter 6. Feedback and Recurrent Networks

the sequences. For example:

ioPairsEl[[1]]

{{{0.9, 0.1, 0.1}, {0.1, 0.9, 0.1}},


{{0.1, 0.9, 0.1}, {0.1, 0.1, 0.9}},
{{0.1, 0.1, 0.9}, {0.9, 0.1, 0.1}}}

is the first sequence, comprising three ioPair vectors:

ioPairsEl[Cl,1]3

{{0.9, 0.1, 0.1}, {0.1, 0.9, 0.1}}

Since the Elman network uses a standard backpropagation-of-errors dur¬


ing the learning process, we can begin the code development with the
BPN code from Chapter 3. We shall need to make several modifications,
but the basic algorithm remains intact.
First, we need to increase the size of the hidden-unit weight table. In¬
stead of (hidNumber by inNumber), the dimension of the table will be (hidNumber
by (inNumber +hidNumber)), where hidNumber is the number of hidden units,
which is equal to the number of context units. Likewise, we must increase
the vector that stores the previous delta values on the hidden-layer units
(see Chapter 3).

hidlfts = Table[Table[Random[Real,{-0.5,0.5}],
{inNumber+hidNumber}].{hidNumber}];
hidLastDelta = Table [Table [0, {inNumber+hidNumber}],
{hidNumber}];

Instead of calculating a Table of errors, as we did in the BPN code, we


shall switch over to a For loop. The numlters parameter will be the num¬
ber of times that we cycle through each sequence. We begin the actual
processing by selecting a sequence at random.

ioSequence=ioPairs[[Random[Intege r,{1,Length[ioPairs]}]]];

Then, since we are beginning a new sequence, we reset the outputs of


the context units to a value of 0.5.

conUnits = Table [0.5, {hidNumber}];


6.2. Recognition of Time Sequences 189

Next we begin a second For loop to cycle through the individual patterns
in the sequence. Using i as the index, we select the next pattern to be
processed:

ioP = ioSequence[[i]];

then identify the input vector and desired output vector. In this case,
however, we concatenate the context units to the sequence's input vector.

inputssJoin[conUnits,ioP[[l]] ];
outDesired=ioP[[2]];

The remainder of the processing in this inner loop is identical to the


BPN processing. Before completing the loop, we must remember to set
the context-unit outputs equal to the current outputs of the hidden-layer
units.

conUnits = hidOuts;

Finally, we add the square of the current error value to the errorList.

A ppendTo[errorList,outErrors.outErrors];

We then repeat the inner loop for as many patterns as there are in the
sequence. When finished with a sequence, we select another, reset the
context units to 0.5, and begin another inner loop. The entire program
appears in Listing 6.3. We also must write a new test program to ac¬
commodate the sequences. Call the new test program, elmanTest. This
program, which appears in Listing 6.4, is similar in intent to bpnTest, but
has an additional parameter, conNumber, that explicitly defines the number
of context units. Let's use this code to attempt our example problem.
For space considerations I have suppressed some of the printout from
the function. After this first run, I will also suppress the printout of the
results of individual patterns, showing only the error plot.

ClearAll[elOut];
elOut = O;
Timing[elOut = elman[3,4,3,ioPairsEl,0.5,0.9,100];]

Sequence 1 input 1
inputs:
{0.5, 0.5, 0.5, 0.5, 0.9, 0.1, 0.1}
outputs:
190 Chapter 6. Feedback and Recurrent Networks

elman[inNumber_,hidNumbe r_,outNumber_,ioPairs.,eta.,alpha.,numlters_] :=
Module[{hidWts,outWts,ioP,inputs,hidOuts,outputs,outDesired,
i,indx,hidLastDelta,outLastDelta,outDelta,errorList=0,
ioSequence, conUnits,hidDelta,outErrors},
hidWts = Table[Table[Random[Real,{-0.5,0.5}].{inNumber+hidNumber}].{hidNumber}];
outWts = Table[Table[Random[Real,{-0.5,0.5}].{hidNumber}].{outNumber}];
hidLastDelta = Table[Table[0,{inNumber+hidNumber}],{hidNumber}];
outLastDelta = Table [Table [0, {hidNumber}], {outNumber}];
For[indx=l,indx<=numIters,indx++, (* begin forward pass; select a sequence *)
ioSequence=ioPairs[[Random[Integer,{1,Length[ioPairs]}]]];
conUnits = Table[0.5,{hidNumber}]; (* reset conUnits *)
For[i=l,i<=Length[ioSequence],i++, (* process the sequence in order *)
ioP = ioSequence[[i]]; (* pick out the next ioPair *)
inputs=Join[conUnits,ioP[[l]] ]; (* join context and input units *)
outDesired=ioP[[2]];
hidOuts = sigmoid [hidWts. inputs]; (* hidden-layer outputs *)
outputs = sigmoid [outWts. hidOuts]; (* output-layer outputs *)
outErrors = outDesired-outputs; (* calculate errors *)
outDelta= outErrors (outputs (1-outputs));
hidDelta=(hidOuts (1-hidOuts)) Transpose[outWts].outDelta;
outLastDelta= eta 0uter[Times,outDelta,hidOuts]+alpha outLastDelta;
outWts += outLastDelta; (* update weights *)
hidLastDelta = eta 0uter[Times,hidDelta,inputs]+alpha hidLastDelta;
hidWts += hidLastDelta;
conUnits = hidOuts; (* update context units *)
(* put the sum of the squared errors on the list *)
AppendTo[errorList,outErrors.outErrors];
]; (* end of For i *)
]; (* end of For indx *)
Print["New hidden-layer weight matrix: "];
Print[ ]; Print[hidWts];Print[ ];
Print["New output-layer weight matrix: "];
Print[ ]; Print[outWts];Print[ ];
elmanTest[hidWts,outWts,ioPairs,hidNumber]; (* check how close we are *)
errorPlot = ListPlotferrorList, PlotJoined->True];
Return[{hidWts,outWts,errorList,errorPlot}];
] (* end of Module *)

Listing 6.3
6.2. Recognition of Time Sequences 191

elmanTest[hiddenVts.,outputWts.,ioPairVectors.,conNumber.,printAll.:False] : =.
Module[{inputs,hidden,outputs.desired.errors,i, j,
prntA11,conUnits,ioSequen ce,ioP>,
If[printAll,Print[];Print["ioPairs:"];Print[];Print[ioPairVectors]];
(* loop through the sequences *)
For[i=l,i<=Length[ioPairVectors],i++,
(* select the next sequence *)
ioSequence = ioPairVectors[[i]];
(* reset the context units *)
conUnits = Table[0.5,{conNumber}];
(* loop through the chosen sequence *)
For[j=l,j<=Length[ioSequence], j++,
ioP = ioSequence [[j]];
(* join context and input units *)
inputs®Join[conUnits,ioP[[l]] ];
desired=ioP[[2]];
hidden®sigmoid[hiddenWts.inputs];
outputs=sigmoid[outputWts.hidden];
errors® desired-outputs;
(* update context units *)
conUnits = hidden;
Print[ ];
Print["Sequence ",i, " input ",j];
Print[ ];Print["inputs:"];Print[ ];
Print[inputs];
If[printAll,Print[ ];Print["hidden-layer outputs:"];
Print[hidden];Print[];];
Print["outputs:"];Print[ ];
Print[outputs];Print[];
Print["desired:"];Print[];Print[desired];Print[ ];
Print["Mean squared error:"];
Print[errors.errors/Length[errors]];
Print[ ];
]; (* end of For j *)
]; (* end of For i *)
] (* end of Module *)

Listing 6.4
192 Chapter 6. Feedback and Recurrent Networks

{0.00991694, 0.848143, 0.342235}


desired:
{0.1, 0.9, 0.1}
Mean squared error:
0.0231606
Sequence 1 input 2
inputs:
{0.033297, 0.686708, 0.352224, 0.801336, 0.1, 0.9, 0.1}
outputs:
{0.164773, 0.095997, 0.676043}
desired:
{0.1, 0.1, 0.9}
Mean squared error:
0.0181227
Sequence 1 input 3
inputs:
{0.187346, 0.130752, 0.673253, 0.595115, 0.1, 0.1, 0.9}
outputs:
{0.549356, 0.366293, 0.105307}
desired:
{0.9, 0.1, 0.1}
Mean squared error:
0.0646305
Sequence 2 input 1
inputs:
{0.5, 0.5, 0.5, 0.5, 0.1, 0.9, 0.1}
outputs:
{0.83715, 0.00746378, 0.296124}
desired:
{0.9, 0.1, 0.1}
Mean squared error:
0.0169926
Sequence 2 input 2
inputs:
{0.720089, 0.0110963, 0.815086, 0.348975, 0.9, 0.1, 0.1}
outputs:
{0.0757287, 0.183033, 0.672605}
desired:
{0.1, 0.1, 0.9}
6.2. Recognition of Time Sequences 193

Mean squared error:


0.0197307
Sequence 2 input 3
inputs:
{0.152164, 0.214958, 0.591788, 0.733609, 0.1, 0.1, 0.9}
outputs:
{0.229314, 0.771018, 0.0831117}
desired:
{0.1, 0.9, 0.1}
Mean squared error:
0.0112145

{393.383 Second, Null}

We do not need to require that the error drop to an extremely small


value in this case. We can be satisfied if one node has an output that is
significantly larger than the other nodes. Using this criterion, the network
has learned all of the patterns after 100 passes. We also learn from this
test that the program takes a while to run.
The wild oscillations in the error plot may be an indication that ei¬
ther the learning rate or the momentum, or both, is too large. You may
want to experiment with different parameters to see if you can get better
performance.
Before leaving this network, let's consider experimenting with the
architecture a bit. Since we know that we would like only one unit in
the output layer to be "on" for any given input pattern, we can use our
knowledge of inhibitory connections, gained in Chapter 5, to facilitate
this behavior. There are several ways in which you might implement
194 Chapter 6. Feedback and Recurrent Networks

inhibitory connections between units on the output layer. The following


code represents one attempt.

outputs = outWts.hidOuts;
(* modify by inhibitory connections *)
outputs = sigmoid [outputs -
0.3 Apply [Plus .outputs] + .5 outputs];

The function elmanComp, which appears in the appendix, implements this


code. You can see by the following example that we did not gain much,
if anything, using the specific changed listed above. Nevertheless the
idea holds promise, and you may want to see if you can improve on the
above attempt.

ClearAll[elOut];
elOut = O;
elOut = elmanComp[3,4,3,ioPairsEl,0.5,0.9,100];

100 150 200 250 300

6.2.2 The Jordan Network

With a very straightforward modification of the Elman network archi¬


tecture, we produce another type of feedback network called the Jordan
network. Instead of taking the feedback from the hidden-layer units, we
take it from the output-layer units. The architecture appears in Figure
6.3.
We shall use this network to learn sequences, but in a slightly different
manner than we did with the Elman network. Assume that we have
several sequences of output vectors that we would like to encode in this
network. We can represent each sequence as a set of output vectors;
for example, the ith sequence would be {Xii,xi2,... ,Xin}, where we
6.2. Recognition of Time Sequences 195

Output units

Figure 6.3 This figure illustrates the architecture of the Jordan network. Notice that units
called state units receive their inputs from the output-layer units instead of the hidden-layer
units as in the case of the Elman network. Notice also that there are connections between
all of the state units as well as feedback from each state unit to itself. The function of the
plan units and state units is described in the text.

have assumed that the sequence contains n vectors. Let's associate a


unique plan vector, pix with the ith sequence. This plan vector uniquely
identifies the associated output sequence. When we apply the plan vector
to the plan units of the Jordan network, we want the network to respond
with the appropriate sequence of output vectors.
Because of the feedback connections from each state unit to itself,
the output of the state units can be influenced by all previous states in
the sequence, thus providing a context for the next output vector in the
sequence. We shall assume that the connection between the output unit
and its corresponding state unit carries a connection-weight value of one.
In a more general network, these connection weights could be learned
like other weights in the network. In general, the output of each state unit
is some function of the corresponding output-unit value at the previous
time step and the previous output of the state unit. For our examples we
shall use the following formulation:

Si(t) = nsi(t - 1) + Oi (6.4)


196 Chapter 6. Feedback and Recurrent Networks

Plan State Output


Unit Units Units

0 0 0 0 1
0 0 1 1 0
0 1 0 1 1
0 1 1 0 0

1 0 0 1 1
1 1 1 1 0
1 1 0 0 1
1 0 1 0 0

Table 6.1 This table shows the inputs and outputs for the counting example.

where s* is the output of the zth state unit, o* is the output of the zth
output unit, and the value of /z determines the amount of influence of
previous time steps. If /z is less than one, then the influence of previous
time steps decreases exponentially as we look farther back in time. In the
following examples, we shall not use the connections between the state
units.
Let's apply the Jordan network to a simple example and discuss the
processing within the context of that example. We shall call this example
the counting example. For this first example, we shall assume that /z = 0.
Table 6.1 shows the various vectors in their proper sequence.
We require one plan unit that can take on a value of 0 or 1. The
sequence corresponding to a plan unit of 0 counts upward from binary
1. The other sequence counts down from binary 11. Because /z = 0, the
state units take on values equal to the output units at the previous time
step. The network will have two state units, two output units, and two
hidden units, although the number of hidden units is not specified by
the example. Although we are using binary units here, there is nothing
that precludes the use of continuous-value units.
There are not many changes required to convert the elman function
into a jordan function. First we need to add an additional parameter, mu,
to the calling sequence.

jordan[inNumber.,hidNumber.,outNumber.,ioPairs_,
eta_,alpha.,mu.,numlters_]

Then we need to alter the size of the hidden-unit weight matrix and
6.2. Recognition of Time Sequences 197

last-delta matrix, changing hidNumber in elman, to outNumber.

hidWts = Table[Table[Random[Real,{-0.5,0.5}],
{inNumber+outNumber}].{hidNumber}];
hidLastDelta = Table[Table[0,{inNumber+outNumber}],
{hidNumber}];

In addition, we initialize the stateUnits array to zeros (actually 0.1),


in place of the conUnits array in elman which we initialized to 0.5.

stateUnits = Table[0.1,{outNumber}];

Update the value of the state units according to Eq. (6.4):

stateUnits = mu stateUnits + outputs;

We must also make corresponding changes to convert elmanTest into


jordanTest. Both complete routines appear in the appendix.
Using the table for the counting problem we can construct the appro¬
priate ioPair vector.

ioPairsJor = {
{{{0.1}, {0.1, 0.9}},
{{0.1}, {0.9, 0.1}},
{{0.1}, {0.9, 0.9}},
{{0.1}, {0.1, 0.1}}},
{{{0.9}, {0.9, 0.9}},
{{0.9}, {0.9, 0.1}},
{{0.9}, {0.1, 0.9}},
{{0.9}, {0.1, 0.1}}} };

Let's make a number of different runs using the Jordan network in var¬
ious configurations. There are several modifications that we can try in
order to assess the corresponding performance impact. You should be
aware that the runs that follow generally show only a relatively few iter¬
ations. If you actually want to reduce the error to a value that we would
consider appropriate for actual applications, you would likely have to
run the network for a significantly longer time. We restrict the number
of iterations here so that we can perform this experiment in a reasonable
time. Note, however, that many of the runs are quite time consuming. We
begin with the standard Jordan network as we have described it above.
198 Chapter 6. Feedback and Recurrent Networks

For the first run, we set n = 0. As with the Elman-network output, I have
edited out some information. For this first example I will leave intact the
results from individual inputs.

Timing[jordan[l,2,2,ioPairsJor,0.5,0.9,0,200];]

Sequence 1 input 1
inputs:
{0.1, 0.1, 0.1}
outputs:
{0.497832, 0.84752}
desired:
{0.1, 0.9}
Mean squared error:
0.0805123
Sequence 1 input 2
inputs:
{0.497832, 0.84752, 0.1}
outputs:
{0.48581, 0.112169}
desired:
{0.9, 0.1}
Mean squared error:
0.0858507
Sequence 1 input 3
inputs:
{0.48581, 0.112169, 0.1}
outputs:
{0.498776, 0.898134}
desired:
{0.9, 0.9}
Mean squared error:
0.080492
Sequence 1 input 4
inputs:
{0.498776, 0.898134, 0.1}
outputs:
{0.485458, 0.102813}
desired:
{0.1, 0.1}
6.2. Recognition of Time Sequences 199

Mean squared error:


0.074293
Sequence 2 input 1
inputs:
{0.1, 0.1, 0.9}
outputs:
{0.500204, 0.886611}
desired:
{0.9, 0.9}
Mean squared error:
0.0800082
Sequence 2 input 2
inputs:
{0.500204, 0.886611, 0.9}
outputs:
{0.486778, 0.124524}
desired:
{0.9, 0.1}
Mean squared error:
0.0856769
Sequence 2 input 3
inputs:
{0.486778, 0.124524, 0.9}
outputs:
{0.500819, 0.9171}
desired:
{0.1, 0.9}
Mean squared error:
0.0804742
Sequence 2 input 4
inputs:
{0.500819, 0.9171, 0.9}
outputs:
{0.486519, 0.117791}
desired:
{0.1, 0.1}
Mean squared error:
0.0748567
200 Chapter 6. Feedback and Recurrent Networks

{581.1 Second, Null}

If we consider all output values above &.5 to be "1" and all below
0.5 to be "0," then this network appears to be on the verge of performing
well, although it seems to be stalled. Adjustments in the parameters may
help, but we shall not undertake such a study here. Instead, let's redo
the example with a nonzero value of

Timing[jordan[l,2,2,ioPairsJor,0.5,0.9,0.1,200];]

{715.517 Second, Null}

The results here seem to be fairly close to those for n = 0. Let's see if
increasing the number of hidden units helps.

Timing[jordan[l,4,2,ioPairsJor,0.5,0.9,0.1,100];]
6.2. Recognition of Time Sequences 201

{486.333 Second, Null}

It does not look like we accomplished very much with that change,
though more passes through the data may help. Let's try something
different. Instead of setting the stateUnits equal to the actual output units
during training, we can set them equal to the desired outputs. The func¬
tion jordan2 implements this variation. For this test I have set the ^ factor
back to zero.

Timing[jordan2[1,4,2,ioPairsJor,0.5,0.9,0,100];]

{497.133 Second, Null}

The error dropped much faster for this run than it did for the previous
runs. Using the desired outputs as state vectors appears to have helped.
Let's make another change for purely aesthetic reasons. If you look at
the code, you will see that we are capturing the sum of the squares of the
errors as each pattern is presented to the network. When we print the
202 Chapter 6. Feedback and Recurrent Networks

results, we show the mean squared error, averaged over all of the output
units. The function jordan2a plots the average mean squared error; that
is, the squared error averaged over the output units, then averaged over
all of the patterns in a sequence.

Tijning[jordan2a[l,4,2,ioPairsJor,0.5,0.9,0,100];]

{484.75 Second, Null}

We can also try a different representation of the plan vector. The follow¬
ing uses a two-element plan vector for the counter problem.

ioPairsJor2 = {
{{{0.1, 0.9}, {0.1, 0.9}},
{{0.1, 0.9}, {0.9, 0.1}},
{{0.1, 0.9}, {0.9, 0.9}},
{{0.1, 0.9}, {0.1, 0.1}}},
{{{0.9, 0.1}, {0.9, 0.9}},
{{0.9, 0.1}, {0.9, 0.1}},
{{0.9, 0.1}, {0.1, 0.9}},
{{0.9, 0.1}, {0.1, 0.1}}}
};

Timing[jordan2a[2,4,2,ioPairsJor2,0.5,0.9,0,100];]
6.2. Recognition of Time Sequences 203

{274.05 Second, Null}

Although we are not seeing much difference here from run to run, such
changes often result in performance improvements. One final modifica¬
tion represents a somewhat radical change from the way we have been
doing the generalized delta rule algorithm.
The Tanh[u] function has the same general shape as the sigmoid func¬
tion, but the limits are +1 and -1, rather than 0 and 1. We can use
Tanh in place of the sigmoid function and at the same time change the
representation of the vectors.

Plot[Tanh[u],{u,-4,4>];

For the desired-output vectors, we use -0.9 to represent binary 0, and 0.9
to represent binary 1. The corresponding iopair vectors for the counter
problem appear as follows:
204 Chapter 6. Feedback and Recurrent Networks

ioPairsJor3 = {
«{-l>, {~0.9, 0.9»,
{ 0.9,-0.9»,
«-l>, { 0.9, 0.9»,
«-l>, {-0.9,-0.9}} },
{{{1}, { 0.9, 0.9}},
{{1}, { 0.9,-0.9}},
{{1}, {“0.9, 0.9}},
{{1}, {-0.9,-0.9}} } };

To accomplish this modification of the Jordan network we must make


several changes in the code. First, of course, we must change the calcu¬
lation of the hidden- and output-layer output values to incorporate the
Tanh function.

hidOuts = Tanh [hidUts. inputs];


outputs = Tanh [outWts. hidOuts];

If you recall from Chapter 3, the calculation of the weight updates in¬
volved the derivative of the output function. For the sigmoid, the deriva¬
tive turned out to be outputs(l-outputs), for the output layer, with a similar
expression for the hidden layer. In the case of the Tanh[u] output function,
the derivative is Sech[u]“2 = 1 - Tahn[u]“2 which for the output layer would
be (l-output~2). We must modify two expressions in the function to reflect
these differences.

outDelta= outErrors (outputs (1-outputs));

becomes

out0elta= outErrors (l-outputs“2);

and

hidDeltas( hidOuts (1-hidOuts)) Transpose [outWts]. outDelta;

becomes

hidDelta=(l-hid0uts~2) Transpose[outVts].outOelta;

We call the new function jordan2aTanh. We must also modify the test
function. Both functions appear in the appendix. Let's try the new func¬
tion.
6.2. Recognition of Time Sequences 205

jordan2aTanh[l,4,2,ioPairsJor3,0.1,0.9,0,100];

Sequence 1 input 1
inputs:
{-0.9, -0.9, -1}
outputs:
{-0.905053, 0.900023)
desired:
{-0.9, 0.9)
Mean squared error:
0.0000127663
Sequence 1 input 2
inputs:
{-0.9, 0.9, -1)
outputs:
{0.903497, -0.897313)
desired:
{0.9, -0.9)
Mean squared error:
-6
9.72306 10
Sequence 1 input 3
inputs:
{0.9, -0.9, -1)
outputs:
{0.910307, 0.885798)
desired:
{0.9, 0.9)
Mean squared error:
0.000153967
Sequence 1 input 4
inputs:
{0.9, 0.9, -1)
outputs:
{-0.893995, -0.959445}
desired:
{-0.9, -0.9}
Mean squared error:
0.00178488
206 Chapter 6. Feedback and Recurrent Networks

Sequence 2 input 1
inputs:
{-0.9, -0.9, 1}
outputs:
{0.893995, 0.959445}
desired:
{0.9, 0.9}
Mean squared error:
0.00178488
Sequence 2 input 2
inputs:
{0.9, 0.9, 1}
outputs:
{0.905053, -0.900023}
desired:
{0.9, -0.9}
Mean squared error:
0.0000127663
Sequence 2 input 3
inputs:
{0.9, -0.9, 1}
outputs:
{-0.903497, 0.897313}
desired:
{-0.9, 0.9}
Mean squared error:
-6
9.72306 10
Sequence 2 input 4
inputs:
{-0.9, 0.9, 1}
outputs:
{-0.910307, -0.885798}
desired:
{-0.9, -0.9}
Mean squared error:
0.000153967
6.2. Recognition of Time Sequences 207

0.025

0.02

0.015

0.01

0.005

20 40 60 80 100

This version appears to work extremely well; notice that far fewer than
100 cycles would have been sufficient. You should be aware, however, of
two minor additional changes: First, note that the learning rate is only
0.1 instead of 0.5 for previous runs. Second, if you examine the code, you
will see that I initialized the state units to -0.9 instead of 0.1. I know that
it is not advisable to change more than one item at a time when running
these experiments, but I have done so here in the interest of space.
I have presented a large number of variations to further illustrate
the kind of experimentation that is often necessary to get a network to
perform adequately. Feel free to continue to experiment.

Summary
In this chapter we studied several networks that include feedback con¬
nections as a part of their architecture. The BAM is a general form of a
network that reduces to the Hopfield network if we think of both lay¬
ers of BAM units as the same layer. The Elman and Jordan networks
incorporate feedback from hidden and output units respectively back to
the input layer. These networks can learn sequences of input patterns
using a backpropagation algorithm. Both networks are fertile ground for
experimentation with feedback structures.
Chapter 7

Adaptive Resonance Theory


210 Chapter 7. Adaptive Resonance Theory

One of the nice features of human memory is its ability to learn many
new things without necessarily forgetting things learned in the past. A
frequently cited example is the ability to recognize your parents even if
you have not seen them for some time and have learned many new faces
in the interim. Some popular neural networks, the backpropagation net¬
work in particular, cannot learn new information incrementally without
forgetting old information, unless it is retrained with the old information
along with the new.
Another characteristic of most neural networks is that if we present
to them a previously unseen input pattern, there is generally no built-in
mechanism for the network to be able to recognize the novelty of the
input. The neural network doesn't know that it doesn't know the input
pattern. On the other hand, suppose that an input pattern is simply a
distorted or noisy version of one already learned by the network. If a
network treats this pattern as a totally new pattern, then it may be over¬
working itself to learn what it has already learned in a slightly different
form.
We have been describing a situation called the stability-plasticity
dilemma. We can restate this dilemma as a series of questions: How
can a learning system remain adaptive (plastic) in response to significant
input, yet remain stable in response to irrelevant input? How does the
system know to switch between its plastic and its stable modes? How
can the system retain previously-learned information while continuing
to learn new information?
Adaptive resonance theory (ART) attempts to address the stability-
plasticity dilemma. A key to solving the stability-plasticity dilemma is
the addition of a feedback mechanism between the layers of the ART
network. This feedback mechanism facilitates the learning of new infor¬
mation without destroying old information, automatic switching between
stable and plastic modes, and stabilization of the encoding of the classes
done by the nodes. We shall discuss two classes of neural-network ar¬
chitectures that result from this approach. We refer to these network
architectures as ART1 and ART2. ART1 and ART2 differ in the nature
of their input patterns. ART1 networks require that the input vectors be
binary. ART2 networks are suitable for processing analog, or grey-scale,
patterns.
ART gets its name from the particular way in which learning and re¬
call interplay in the network. In physics, resonance occurs when a small
amplitude vibration of the proper frequency causes a large amplitude
7.1. ART1 211

vibration in an electrical or mechanical system. In an ART network, in¬


formation in the form of processing-element outputs, reverberates back
and forth between layers of the network. If the proper patterns develop,
a stable oscillation ensues, which is the neural-network equivalent of res¬
onance. During this resonant period, learning, or adaptation, can occur.
Before the network has achieved a resonant state, no learning takes place.
An ART network can achieve a resonant state in one of two ways. If
the network has previously learned to recognize an input vector, then it
will achieve a resonant state quickly when that input vector is presented
again. During the resonance period, the adaptation process will reinforce
the memory of the stored pattern. If the network does not immediately
recognize the input vector, it will rapidly search through its stored pat¬
terns looking for a match. If no match is found, the network will enter
a resonant state whereupon the new pattern will be stored for the first
time. Thus, the network responds quickly to previously-learned data, yet
remains able to learn when we present novel data.
Among other details, which we shall develop in this chapter, ART net¬
works utilize the concept of competition between units. We have already
seen an example of competition in the constraint satisfaction problem
which we discussed in Chapter 5. Recall that, in that problem, only a
certain few units were allowed to be on for any given solution, accord¬
ing to the constraints of the problem. We implemented these constraints
among the units by means of equally strong inhibitory connections be¬
tween all units subject to a particular constraint. Thus, if a particular unit
in a set were to have a larger output than the others, it would tend to
inhibit the others more strongly than it was inhibited by the others. This
situation represents a form of competition among units for the privilege
of being on. Units with large outputs tend to inhibit those with smaller
outputs driving their outputs to zero while they themselves increase their
output as a result of diminished inhibition from others.

7.1 ART1
As mentioned in the introduction to this chapter, ART1 networks require
binary input vectors. This limitation is not necessarily a severe handicap,
since many problems lend themselves reasonably well to a binary rep¬
resentation. In this section we shall examine some of the details of the
ART1 architecture and processing. The treatment of those topics here will
212 Chapter 7. Adaptive Resonance Theory

not be as complete as it is in Neural Networks, however, due to space con¬


siderations. On the other hand. Section 7.1.3 shows a detailed calculation
using Mathematica to illustrate the performance of ART1.

7.1.1 ART1 Architecture

The basic features of the ART architecture appear in Figure 7.1. The
two major subsystems are the attentional subsystem and the orienting
subsystem. Patterns of activity that develop over the units in the two
layers of the attentional subsystem are called short term memory (STM)
traces because they exist only in association with a single application of an
input vector. The weights associated with the bottom-up and top-down
connections between JPi and F2 are called long term memory (LTM)
traces because they encode information that remains a part of the network
for an extended period.
We shall delay any consideration of the mathematics governing the
ART1 network and describe the processing in the form of an algorithm.
Furthermore, we shall ignore, for the moment, the gain control system;
we will revisit that topic later in this section.
The following algorithm is brief, and omits many details of ART1
processing, but it does illustrate conceptually how the network responds
to inputs.

1. Apply a binary input vector to the layer units.

2. Determine the output of the -layer units and propagate that out¬
put up to the F2 layer.

3. Allow the F2 units to compete based on their net-input value so


that one unit wins and is the only one with a nonzero output.

4. Send the output from the winning F2 unit back down to Fi where
it stimulates the appearance of a top-down template pattern.

5. Compare the top-down template to the initial input vector. If the


patterns match to within a specified amount (called the vigilance
parameter) a resonant condition has been attained; continue with
step 7.

6. Failing to achieve resonance, disable the winning F2 unit preventing


it from competing further, reset all units to their initial values, and
begin a new matching cycle with step 1.
7.1. ART1 213

Attentlonal Subsystem Orienting


Gain Subsystem
Control F2 Layer

Input vector

Figure 7.1 This figure illustrates the ART1 system diagram. The two major subsystems are
the attentional subsystem and the orienting subsystem. F\ and F2 represent two layers
of units in the attentional subsystem. Units on each layer are fully interconnected to the
units on the other layer. Not shown are interconnects among the units on each layer. Other
connections between components are indicated by the arrows. A plus sign indicates an
excitatory connection and a minus sign indicates an inhibitory connection. The function of
the various subsystems is discussed in the text.

7. Upon resonance, modify weights on both layers to encode the pat¬


tern.

8. Reset all units, clear the input vector, and begin at step 1 with a
new input vector.

There is one subtlety involving ART networks in general that merits


a slight digression. In the above algorithm, learning takes place in step
seven, after resonance is established. In a real ART network, that is, one
that is not a software simulation, weight modification is not turned on or
off depending on the existence of a resonant condition; instead weights
are always subject to modification, even during a matching cycle that
leads to a mismatch and a reset condition. What keeps the network from
learning mismatches is that the time scale over which significant changes
in weight values occurs is much longer than the time scale over which
214 Chapter 7. Adaptive Resonance Theory

the matching cycles occur. In other words, during a matching cycle in


which a particular F2 unit wins, but the resulting top-down template
mismatches the input pattern, weights on the winning unit and from the
winning unit, can undergo modification; but the matching process and
subsequent reset occur so quickly that the weights do not have a chance
to change significantly before the reset has removed the offending unit
from consideration. Significant weight modification occurs only during
a resonant condition, in which data passage between layers is stable for
a considerable period of time.
A not so subtle, but equally important part of ART is the gain control
mechanism. The gain control system works in concert with what is called
the 2/3 rule. If you refer back to Figure 7.1, you will notice that the units
on the Fi layer have three possible sources of inputs: bottom-up inputs,
top-down inputs, and inputs from the gain control system (the same is
true of the F2 layer, but that fact is not obvious from the diagram since
the top-down inputs are missing). The 2/3 rule states that, for a given
unit to have a nonzero output, it must be receiving an input from two,
and only two, out of the possible three input paths. The gain control is
configured such that any output from F2 sends an inhibitory signal to
the gain control system, which completely shuts it off; therefore in the
presence of an output from F2, the gain control is disabled.
Disabling the gain control when F2 is active has the effect of pre¬
venting any output from Fx in the presence of an input from F2 alone,
that is, when there is no bottom-up input to Fx. In the algorithm for
ART1 processing, it would appear that F2 could never be active unless
preceded by activity on Fx. However, the basic ART structure was in¬
tended to be a building block in a hierarchical structure, meaning that
in some circumstances, the F2 layer may be active from influences other
than the associated F\ layer. If we allowed the Fx layer to respond to F2
activity alone, then a resonant condition could ensue without any inputs
to the Fx layer from below, a condition that does not make sense within
the context of trying to encode input patterns. You can find more details
about gain control and the 2/3 rule in Chapter 8 of Neural Networks.
The orienting subsystem receives the same inputs as does the Fx
layer, but also receives as inputs, the outputs of the Fx layer. If these
two input vectors match to within the criterion specified by the vigilance
parameter, nothing happens; or rather, the orienting subsystem remains
inhibited. If, on the other hand, the orienting subsystem input vectors do
not match, there will be a net excitatory signal to the orienting subsystem.
7.1. ART1 215

resulting in a reset signal to the F2 layer. The effect of the reset signal
depends on the state of the individual F2 unit. If the unit currently has a
nonzero output, that unit is disabled for the duration of the current input
vector. If the unit does not have a nonzero output, the unit ignores the
reset signal.

7.1.2 ART1 Processing Equations


The equations that describe the dynamics of the activity values on the jF\
and F2 layers are identical in their general form.

±k = -Xk + (1 -Axk)J£ - {B + Cxk)Jk (7.1)

j£ is an excitatory input to the fcth unit, and is an inhibitory input.


A, B, and C, are positive constants. To distinguish between layers, we
shall adopt the convention that the subscript, f, will always refer to a
unit on the Fi layer, and the subscript, /, will always refer to a unit
on the F2 layer. Furthermore, we shall append a 1 or a 2 subscript to
various quantities to identify the relevant layer; for example, A\ refers
to the constant A defined specifically for the Fi layer. Similarly, we shall
use xu and x2j to refer to the activities on Ft and F2 units respectively.
When it becomes necessary to label a particular unit, we shall use v{ for
Fi units and Vj for F2 units.
Figure 7.2 illustrates the various quantities that relate to units on the
Fi layer. Figure 7.3 provides the same information for an F2 unit.

Processing on Fi We can write the total excitatory input to Fi units


as
Jf = U + D\Vi + B\G (7.2)

where D\ and B\ are positive constants, G is the output of the gain


control system, and Vt is the net-input contribution from the top-down
connections from F2. We calculate V* by the usual method of the dot
product of the input vector and the weight vector.

Vi = Y^ujzij (7-3)
i

The value of G is 1 as long as an input pattern is present from below


and there is no output from F2, and is zero if there is an output from
216 Chapter 7. Adaptive Resonance Theory

T°F2

Figure 7.2 This figure shows a processing element, Vi, on the Fx layer of an ART1 network.
The activity of the unit is Xu- It receives a binary input value, /< from below, and an
G, from the gain control. In addition, the top-down signals, Uj, from F2
excitatory signal,
are gated (multiplied by) weights, Zij. Outputs, Si, from the processing element go up to
F2 and across to the orienting subsystem, A.

F2. We shall set the inhibitory input to Fx units equal to unity. With the
above definitions, the equation for the activity on Fi units becomes

xu — xu + (1 — A\xu)(Ii + D\Vi -f- B\G) — (Si + C\xu) (7-4)

Before any inputs are present on the Fx layer, G = 0, and Vx = 0, for


all i. We can substitute these conditions into Eq. (7.4) and solve for the
equilibrium values of the activities.

„ ~B 1
(7.5)
U (1 + C70

Notice that the equilibrium activities are negative, meaning that the units
are kept in a highly inhibited state.
During the initial stages of processing, that is, when an input is
present from below, but there has yet been no response from F2, W
remains at zero, but the gain control has become active: G = 1. The
7.1. ART1 217

To al F. units

Figure 7.3 This figure shows a processing element, Vj, on the F2 layer of an ART1 network.
The activity of the unit is x2j. The unit Vj receives inputs from the Fi layer, the gain control
system, G, and the orienting subsystem, A. Bottom-up signals. Si, from F\ are gated by
the weights, Zji. Outputs, Uj, are sent back down to Fi. In addition, each unit receives
a positive feedback term from itself, g(x2j), and sends an identical signal through an
inhibitory connection to all other units on the layer.

equilibrium activities under these conditions are

_U (7.6)
Xu 1 + Ai(Ii + Bi) + Ci

The unit activity will be nonzero in the case of h = 1, and will be


zero if h = 0. The activity of the gain control brings the unit activities
up to a zero value, ready to fire if the unit receives a nonzero input from
below.
When F2 finally responds, sending a top-down input to Fu the gain
control becomes inactive. In this case, the equilibrium activity values
become
Ii + DM - Bi (7.7)
XK~\ + A1(Ii + DlVl)^Cl
The activity on any particular unit can now be either positive or
negative, depending on the value of the numerator in Eq. (7.7). We
218 Chapter 7. Adaptive Resonance Theory

assume that we want a positive activity if the unit receives both a top-
down input and a nonzero bottom-up input {h = 1). This assumption
translates into a condition on the quantities in the numerator, namely

Vi > (7.8)

We implement this condition in the network by initializing the weights


on the top-down connections to be at least the value of the right hand
side of Eq. (7.8). In other words

Bi- 1
Zij > (7.9)
Di
In the complete analysis of processing of F\, a condition relating the
values of B\ an D\ arises. We state that condition here, but you can find
the details in Chapter 8 of Neural Networks.

max{Di, 1} < B\ < Di + 1 (7.10)

At any point in the calculation, the output of an F\ unit will be 1 if


the unit has a positive activity.

1 xu > 0
Si = (7.11)
0 xu < 0

Processing on F2 The processing of F2 units is more complicated


than that performed on Fi, yet it is much easier to implement in a com¬
puter simulation. The complication arises because F2 is a competitive
layer in which each unit sends inhibitory signals to all other units on the
layer, while the unit itself receives a positive feedback signal from itself.
This arrangement goes by the name of on-center, off-surround, and is
the exact scheme that we used in the Hopfield network to ensure that
only one of a certain group of units was allowed to turn on. On such
a competitive layer, the equations describing the processing are compli¬
cated by the fact that they are coupled differential equations, making it
impossible to find simple equilibrium values for the activities as we did
on the F\ layer.
On the other hand, because the layer is a winner-take-all competitive
layer, all that we need do is calculate the net inputs for all of the units,
select the one with the largest net-input value, and declare that unit the
7.1. ART1 219

winner. The winning unit will have an output signal of unit strength,
and all other units will remain inactive. We calculate the net input to the
F2 units in the usual manner

Tj = netj = ^2 sizoi (7.12)


i

Then the outputs are given by

1 Tj = maxfc{Tfc}VA;
m = (7.13)
0 otherwise

We shall give the winning F2 unit a special designation: vJ.


We must also impose an initial condition on the values of the weights
on the F2 units. The condition is

L
0 < Zji(0) < (7.14)
L-l + M

where L > 1, and M is equal to the number of units on the Fi layer. Once
again, we shall not digress into the rather lengthy discussion of how we
derive this condition. Suffice it to say that this condition helps to ensure
that a unit, which has encoded a particular pattern, continues to win
over uncommitted units in the F2 layer when that particular pattern is
presented to the network.

Maintaining Vigilance in ART1 The orienting subsystem deter¬


mines whether top-down templates, encoded in the network weights,
match input patterns presented to the network. The degree of match is
also an important consideration. The parameter, p, called the vigilance
parameter, specifies the degree to which one pattern must be similar to
another in order to be considered a match for the pattern. To examine
how vigilance works, we must first make a definition or two. Given a
binary vector, X, we define the magnitude of X by the following expres¬
sion

|X| = 5> (7.15)


i
In other words, the magnitude of a binary vector is equal to the number
of nonzero components in that vector.
220 Chapter 7. Adaptive Resonance Theory

We can also define the value of the output vector of the Fi units by
the following:
f I F2 is inactive
(7.16)
\ I n VJ F2 is active
Eq. (7.16) states that the output of Fi will be identical to the input
vector, I, if no top-down signal is present from F2, and will be equal to
the intersection of the input vector and the top-down template pattern,
VJ, received from the winning F2 unit, when F2 is active.
Provided we select a value for p which is less than or equal to one,
we can describe the matching condition as

JS|
> P (7.17)
III
If Eq. (7.17) holds, then the network will take the top-down template
as a match to the current input pattern.
Defining a vigilance criterion in this manner endows the network
with an important property called self-scaling. This property, illustrated
in Figure 7.4, enables the network to ignore minor differences in patterns,
such as might arise due to random noise, and yet remain sensitive to
major differences.

Weight Updates on ARTl Weight updates on both layers of an


ART1 network are easy to describe in terms of the final results, but some¬
what more complex to derive. As we have done previously, we present
only the final results here. The example calculation in the next section
will serve to illustrate how the particular weight-update equations serve
to encode input patterns in the network.
Weight updates occur only on the winning F2 unit, and on those
connections from the winning F2 unit down to the Fi units. Using the
subscript J to denote the index of the winning F2 unit, we can express
the weight updates on the two layers as follows. For F\ units.

f 1 if ^ is active
(7.18)
\ 0 if Vj is inactive

and for the winning F2 unit.

/ 7^T+[s[ if Vi is active
zj, = (7.19)
1° if Vi is inactive
7.1. ART1 221

Figure 7.4 This figure illustrates the self-scaling property of ART1 networks, (a) For a value
of p — 0.8, the existence of the extra feature in the center of the top-down pattern on
the right is ignored by the orienting subsystem, which considers both patterns to be of the
same class, (b) For the same value of p, these bottom-up and top-down patterns will cause
the orienting subsystem to send a reset to Fi.

We have now assembled all of the equations and conditions necessary


to build an ART1 network. Let's proceed to a step-by-step calculation as
a demonstration of some of the processing characteristics of this network.

7.1.3 ART1 Processing Example


In this section, we shall walk through an example that illustrates some
of the characteristics of the learning process in an ART1 example. This
particular example follows the one that we give in Section 8.2.3 of Neural
Networks. First, we initialize some of the network parameters and define
the input vectors that we will use later.

Network Initialization and Function Definitions The dimension of


the Fi layer is
fldim = 5;
The dimension of the F2 layer is

f2dim = 6;
We set the system vigilance parameter at a high value to ensure exact
matches.
222 Chapter 7. Adaptive Resonance Theory

rho = 0.9;

The input vectors to layer Fi are arbitrary, though we shall exploit


some relationships between some of them later on. Here is a list of the
10 input vectors.

in = {{0, 0, 0, 1, 0}, {0, 0, 1, 0, 1>, {1, 0, 0, 0, 1>,


{1, 1, 1, 1, 1>, {1, 1, 0, 0, 0}, {0, 1, 0, 1, 1>,
{0, 0, 0, 0, 1}, {0, 1, 1, 1, 1>, {1, 0, 1, 1, 0},
{0, 1, 1, 1, 0»;

The vector magnitude for ART1 networks has a different definition than
the standard vector magnitude.

vmagl[v_] := Count[v,l]

The function resetflagl tells us whether the matching condition estab¬


lished by the vigilance parameter has been met or not.

resetflagl[outp_,inpJ : = If[vmagl[outp]/vmagl[inp]<rho,True,False]

where outp refers to the output of Fx and inp refers to the input vector.
We shall also use a function that returns the index of the winning unit
in a competitive layer whose outputs are assembled into a vector, p. This
function assumes that the winning unit is the one having an activation
of val.

winner[p_,val_] := First[First[Position[p,val]]]

To keep track of which F2 elements have been inhibited by the orient¬


ing subsystem, we maintain a list as follows: (1 for not inhibited, 0 for
inhibited)

droplistinit = Table[l,{f2dim>]

{1, 1, 1, 1, 1, 1>

droplist = droplistinit

{1, 1, 1, 1, 1, 1}

The output functions for the two layers have identical definitions. For
the Fx layer:

h[xj := lf[x>0,l,0]
7.1. ART1 223

s[x_] := Map[h,x]

and for the F2 layer back to Fi:


f[x_] := lf[x>0,l»0]

u[xj := Hap[f,x]

The Fi layer has several parameters, which we define as follows:


a 1=1; bl=1.5; cl=5; dl=0.9;
The top-down weights on connections from F2 units to F\ units comprise
a matrix, which we denote zl2, having one row for each unit on F\, and
one column for each unit on F2. According to Eq. (7.9), the initial weights
must be greater than (Fi - 1)/I>i. This fact explains the addition of 0.2
to the weight initialization equation in the following.

MatrixForm[zl2 = Table [
Table[N[(bl-l)/dl + .2,3],{f2dim}],{fldim}]]

0.756 0.756 0.756 0.756 0.756 0.756


0.756 0.756 0.756 0.756 0.756 0.756
0.756 0.756 0.756 0.756 0.756 0.756
0.756 0.756 0.756 0.756 0.756 0.756
0.756 0.756 0.756 0.756 0.756 0.756

The only parameter required on the F2 layer is L from Eq. (7.14); we


define that parameter here.

el = 3;
Weights on connections from Fi units to F2 units comprise the matrix,
z21, having one row for each F2 unit and one column for each Fi unit.
The subtraction of 0.1 is optional, since the weights may be initialized
identically to L/(L — 1 + M) from Eq. (7.14).

Matrix Form [z21 = Tablet


Table[N[(el/(el-l+fldim)-0.1),3],{fldim>],{f2dim}]]

0.329 0.329 0.329 0.329 0.329


0.329 0.329 0.329 0.329 0.329
0.329 0.329 0.329 0.329 0.329
0.329 0.329 0.329 0.329 0.329
0.329 0.329 0.329 0.329 0.329
0.329 0.329 0.329 0.329 0.329
224 Chapter 7. Adaptive Resonance Theory

The initial activities of the Fi units are all negative, according to Eq. (7.5).

xfl « Table[-bl/(l+cl),{fldim>]

{-0.25, -0.25, -0.25, -0.25, -0.25}

The function compete takes as its argument the list of activities on the F2
layer, and returns a new list in which only the unit with the largest activ¬
ity retains a nonzero activity; in other words, this function implements
competition on F2.

compete[f2Activities.] :=
Module[{i,x,f2dim,maxpos},
x=f 2Activiti.es;
maxpos=First[First[Position[x,Max[f2A ctivities]]]];
f2dim = Length[x];
For[i=l,i<=f2dim,i++,
If[i!=maxpos,x[[i]]=0;Continue] (* end of If *)
]; (* end of For *)
Return[x];
]; (* end of Module *)

Processing Sequence To begin, we select the first input vector in


the list.

in[[l]]

{0, 0, 0, 1, 0}

With positive inputs from below, the Fx activities rise,

xfl = N[in[[l]]/(l+al*(in[[l]]+bl)+cl),3]

{0, 0, 0, 0.118, 0}

The only unit with a nonzero activity, however, is the one with a nonzero
input from below. The output of Fi is

sfl = s[xfl]

{0, 0, 0, 1, 0}

The dot product between the weights and input values from below de¬
termines the net inputs to the Fi units.
7.1. ART1 225

t= N[z21 . sfl,3]

{0.329, 0.329, 0.329, 0.329, 0.329, 0.329)

The activity on F2 equals the net inputs in this approximation.

xf2 = t

{0.329, 0.329, 0.329, 0.329, 0.329, 0.329)

Next, we find the unit with maximum net input,

xf2 = compete[xf2];

{0.329, 0, 0, 0, 0, 0)

and compute output values of F2

uf2 = u[xf2]

{1, 0, 0, 0, 0, 0}
We could combine the above two steps into one, but I left them sep¬
arate in order to remain true to the individual steps of the sequence.
Notice that the compete function, finding no clear winner of the compe¬
tition, returned the first, previously uncommitted unit on the list of F2
units.
We shall need to save the index of winning unit for later,

uindex = winner [uf 2,1]

Going back to the F\ layer, the net inputs back to F\ from F2 are

v= N[zl2 . uf2,3]

{0.756, 0.756, 0.756, 0.756, 0.756}

The new equilibrium activities of the F\ units are

xfl = N[(in[[l]]+ dl*v-bl)/(l+al*(in[[l]]+dl*v)+cl),3]

{-0.123, -0.123, -0.123, 0.0234, -0.123}


226 Chapter 7. Adaptive Resonance Theory

and the new output values of Fi are

sfl = s[xfl]

{0, 0, 0, 1, 0}

Notice that the new output vector is identical to the input vector. We
expect, therefore, that the orienting subsystem — implemented partially
as the resetflagl function — will indicate no mismatch.

resetflagl[sfl,in[[l]]]

False

A False from resetflagl indicates that resonance has been reached; there¬
fore, we can adjust the weights on both layers. The following procedure
sets the windex element of each weight vector on Fi equal to 1 if the output
of the corresponding Fi unit is equal to 1.

zl2=Transpose[zl2];
zl2[[windex]]=sfl; (* just use the values in sfl *)
MatrixForm[zl2=Transpose[zl2]]

0 0.756 0.756 0.756 0.756 0.756


0 0.756 0.756 0.756 0.756 0.756
0 0.756 0.756 0.756 0.756 0.756
1 0.756 0.756 0.756 0.756 0.756
0 0.756 0.756 0.756 0.756 0.756

As you can see, the weights on the Fi units have encoded the input
vector, but the encoding is distributed over the entire set of units. Since
unit 1 was the winner on F2, weights from that unit back to each Fi unit
are the only ones to be changed. On the other hand, only the winning
unit on F2 has its weights updated. Again you will see that the first unit
on F2 has encoded the input vector.

z21[[uindex]] = N[el/(el-l+vmagl[sfl]) sfl,3]

{0, 0, 0, 1., 0}
Now let's repeat the matching process for an input vector that is orthog¬
onal to in[[1]].
7.1. ART1 227

in[[2]]

{0, 0, 1, 0, 1}

xfl = N [in [ [2] ] / (1+a 1* (in [ [2] ] +bl)+cl), 3]

{0, 0, 0.118, 0, 0.118}

sfl = s[xfl]

{0, 0, 1, 0, 1}

t= N[z21 . sfl,3]

{0, 0.657, 0.657, 0.657, 0.657, 0.657}

xf2 = t

{0, 0.657, 0.657, 0.657, 0.657, 0.657}

Notice that since the first unit on F2 has encoded a vector orthogonal to
the current input vector, the net input to that unit is zero.

xf2 = compete[xf2]

{0, 0.657, 0, 0, 0, 0}

Once again, compete returned the first uncommitted unit, since none of the
units was a clear winner. Continuing on,

uf2 = u[xf2]

{0, 1, 0, 0, 0, 0}

windex = winner [uf 2,1]

v= N[zl2 . uf2,3]

{0.756, 0.756, 0.756, 0.756, 0.756}

xfl = N[(in[[2]]+ dl*v-bl)/(l+al*(in[[2]]+dl*v)+cl),3]

{-0.123, -0.123, 0.0234, -0.123, 0.0234}


228 Chapter 7. Adaptive Resonance Theory

sfl = s[xf1]

{0, 0, 1, 0, 1}
resetflagl[sfl,in[[2]]]

False

Once again, we have resonance in one pass through the network. Since
unit 2 on F2 is the winner, the second weight values on the F\ units
encode the new input vector.

zl2=Transpose[zl2];
zl2[[windex]]=sfl; (* just use the values in sfl *)
MatrixForm[zl2=T ranspose[zl2]]

0 0 0.756 0.756 0.756 0.756


0 0 0.756 0.756 0.756 0.756
0 1 0.756 0.756 0.756 0.756
1 0 0.756 0.756 0.756 0.756
0 1 0.756 0.756 0.756 0.756

and, the second unit on F2 also encodes the input vector. Notice, how¬
ever, that the nonzero weight values are not equal to one in this case, since
vmagl[sfl] is not equal to one. Nevertheless, the pattern of the weights is
identical to the input vector.

z21[[windex]] = N[el/(el-l+vmagl[sfl]) sfl,3]

{0, 0, 0.75, 0, 0.75}

The entire weight matrix on F2 is

MatrixForm[z21]

0 0 0 1. 0
0 0 0.75 0 0.75
0.329 0.329 0.329 0.329 0.329
0.329 0.329 0.329 0.329 0.329
0.329 0.329 0.329 0.329 0.329
0.329 0.329 0.329 0.329 0.329

Now let's try an input vector that is a subset of the input vector, in [[2]],
namely in[[7]].
7.1. ART1 229

in[[7]]

{0, 0, 0, 0, 1}
xfl = N[in[[7]]/(l+al*(in[[7]]+bl)+cl),3]

{0, 0, 0, 0, 0.118}

sfl « s[xfl]

{0, 0, 0, 0, 1}

t= N[z21 . sfl,3]

{0, 0.75, 0.329, 0.329, 0.329, 0.329}

xf2 * t

{0, 0.75, 0.329, 0.329, 0.329, 0.329}

xf2 = compete [xf 2]

{0, 0.75, 0, 0, 0, 0}

Even though unit 2 has already encoded a vector different than the cur¬
rent input vector, there is enough similarity between the encoded vector
and the input vector to allow unit 2 to win the competition. Trace through
the following six steps very carefully.

uf2 = u[xf2]

{0, 1, 0, 0, 0, 0}
uindex = winner[uf2,l]

v= N[zl2 . uf2,3]

{0, 0, 1., 0, 1.}


xfl = N[(in[[7]]+ dl*v-bl)/(l+al*(in[[7]]+dl*v)+cl),3]

{-0.25, -0.25, -0.087, -0.25, 0.0506}


230 Chapter 7. Adaptive Resonance Theory

sfl = s[xfl]

{0, 0, 0, 0, 1}

resetflagl[sfl,in[[7]]]

False

It may seem surprising that, even though unit 2 previously encoded a


vector that does not match the current input vector within the vigilance
parameter, there is still no reset. This behavior is characteristic of ART1,
in that the appearance of a subset vector, following the encoding of a
superset vector, may cause the unit encoding the superset vector to recode
its weights to match the subset vector. The weight matrices now appear
as follows:

zl2=Transpose[zl2];
zl2[[uindex]]=sfl; (* just use the values in sfl *)
MatrixForm[zl2=Transpose[zl2]]

0 0 0.756 0.756 0.756 0.756


0 0 0.756 0.756 0.756 0.756
0 0 0.756 0.756 0.756 0.756
1 0 0.756 0.756 0.756 0.756
0 1 0.756 0.756 0.756 0.756

z21[[windex]] = N[el/(el-l+vmag[sfl]) sfl,3]

{0, 0, 0, 0, 1.)
HatrixForm[z21]

0 0 0 1. 0
0 0 0 0 1.
0.329 0.329 0.329 0.329 0.329
0.329 0.329 0.329 0.329 0.329
0.329 0.329 0.329 0.329 0.329
0.329 0.329 0.329 0.329 0.329

Now let's put the superset vector back in to see what happens.
7.1. ART1 231

in[[2]]

{0, 0, 1, 0, 1}
xfl = N[in[[2]]/(l+al*(in[[2]]+bl)+cl),3]

{0, 0, 0.118, 0, 0.118}

sfl = s[xfl]

{0, 0, 1, 0, 1}

t= N[z21 . sfl,3]

{0, 1., 0.657, 0.657, 0.657, 0.657}

xf2 = t

{0, 1., 0.657, 0.657, 0.657, 0.657}

xf2 = compete [xf 2]

{0, 1., 0, 0, 0, 0}
Once again, unit two wins the competition. The ART1 network will not
turn out to be very useful if this unit again recodes itself to the superset
vector. If that situation prevailed, we would not be able to encode both
a superset and a subset vector at the same time; a serious limitation in a
network that uses only binary vectors as inputs. Let's see what happens.

uf2 = u[xf2]

{0, 1, 0, 0, 0, 0}
windex = winner [uf 2,1]

v= N[zl2 . uf2,3]

{0, 0, 0, 0, 1.}
xfl ■ N[(in[[2]]+ dl*v-bl)/(l+al*(in[[2]]+dl*v)+cl),3]
232 Chapter 7. Adaptive Resonance Theory

{-0.25, -0.25, -0.0714, -0.25, 0.0506}

sfl = s[xfi]

{0, 0, 0, 0, 1}

resetflagl[sfl,in[[2]]]

True

Fortunately, this time we get a reset. To implement the second part of


the orienting subsystem, the part that inhibits units that have won but
caused a reset, we use the droplist.

If[resetflag[sf1,in[[2]]]==True,
droplist[[windex]]=0,Continue]

droplist

{1. 0, 1, 1, 1, 1}

We shall see momentarily how we employ this droplist. For the moment,
let's reestablish the input vector and begin another matching cycle.

in[[2]]

{o, 0, 1, 0, 1}

xfl = N[in[[2]]/(l+al*(in[[2]]+bl)+cl),3]

{0, 0, 0.118, 0, 0.118}

sfl * s[xfl]

{0, 0, 1, 0, 1}

t= N[z21 . sfl,3]

{0, 1., 0.657, 0.657, 0.657, 0.657}

xf2 = N[t droplist,3] (* here is where we inhibit units *)

{0, 0, 0.657, 0.657, 0.657, 0.657}


7.1. ART1 233

By multiplying the activity vector on F2 by the droplist vector, we effec¬


tively inhibit those units that correspond to positions on droplist having
a zero value, thus removing them from the competition.

xf2 = compete[xf2]

{0, 0, 0.657, 0, 0, 0}

Having eliminated the second unit from the competition, compete returns
the third unit as the winning unit.

uf2 = u[xf2]

{0, 0, 1, 0, 0, 0}

windex = winner[uf2,l]

v= N[zl2 . uf2,3]

{0.756, 0.756, 0.756, 0.756, 0.756)

xfl = N[(in[[2]]+ dl*v-bl)/(l+al*(in[[2]]+dl*v)+cl),3]

{-0.123, -0.123, 0.0234, -0.123, 0.0234}

sfl = s[xfl]

{0, 0, 1, 0, 1)
resetflagl[sf1,in[[2]]]

False

Since the third unit was not previously encoded, we do not get a reset,
and the superset vector is encoded by the third unit. Now both superset
and subset vectors are encoded in the network independently, as the
following weight matrices show.

zl2=Transpose[zl2];
zl2[[windex]]=sfi; (* just use the values in sfl *)
HatrixForm[zl2=Transpose[z!2]]
234 Chapter 7. Adaptive Resonance Theory

0 0 0 0.756 0.756 0.756


0 0 0 0.756 0.756 0.756
0 0 1 0.756 0.756 0.756
1 0 0 0.756 0.756 0.756
0 1 1 0.756 0.756 0.756

z21[[windex]] = N[el/(el-l+vmag[sfl]) sfl,3]

{0, 0, 0.75, 0, 0.75}

MatrixForm[z21]

0 0 0 1. 0
0 0 0 0 1.
0 0 0.75 0 0.75
0.329 0.329 0.329 0.329 0.329
0.329 0.329 0.329 0.329 0.329
0.329 0.329 0.329 0.329 0.329

If other subset vectors had been previously encoded in the network, then
we may have had other resets until all units encoding subset vectors had
been disabled by the orienting subsystem. At the end of the matching
cycle, the droplist vector will have zeros in all positions corresponding
to units that won the competition of F2, but which resulted in a reset.
Thus, before beginning the next matching cycle with a new input vector,
you must remember to reinitialize droplist as follows

droplist» droplistinit;

Experiment with the other input vectors in the list. You will learn more
about how ART1 operates by this experimentation than you will by read¬
ing about it.

7.1.4 The Complete ART1 Network

By collecting the various processing steps into one procedure, we can


build a complete ART1 simulator. We will write a separate procedure to
initialize the weights.
To initialize the weights, we need to know the dimensions of the two
layers, and the values of the parameters Bx, Du and L. In addition,
this routine takes two additional parameters, dell and del2, which can be
7.1. ART1 235

used to alter the initial values of the weights slightly. Be sure to select
dell and del2 within the constraints of the weight initialization equations,
Eqs. (7.9) and (7.14).

artllnit[fldim_,f2dim_,bl_,dl_,el_,dell_,del2_] :=
Module[{zl2,z21},
zl2 = Tablet
Table[(bl-l)/dl + dell,{f2dim}],{fldim}];
z21 = Tablet
Table t(el/(el-l+fldim)-del2),{fldim}],{f2dim}];
Returnt{zl2,z21>];
]; (* end of Module *)

For the example that we shall perform later in this section we shall set
the dimension of the Fx layer to 25, and the dimension of the F2 layer
to five. For the other parameters, we shall use the same values that we
used in the example of the previous section.

{topDoun,bottomUp} =
artllnit[25,5,1.5,0.9,3,0.2,0.1];

As before, all weights on Fi are initialized to a value of 0.756. Weights


on F2 are initialized to a value of 0.111.
The ART1 simulator in Listing 7.1 requires all of the inputs provided
to the artllnit routine, as well as the additional parameters A\ and C\,
the vigilance parameter, rho, the two weight matrices, and a list of in¬
put vectors to be encoded. The network will process the list of input
patterns until all patterns have been encoded and the network has stabi¬
lized. As with artllnit, the function returns the weight matrices and a list
called matchList which we can use to determine the sequence of codings
and recodings performed by the network. The function also provides a
running commentary so that you can follow the processing. Previously
defined functions, vmagl, resetflagl, compete, and winner, are also required.
Let's verify the code by using the same three input vectors that we did
for the example in the previous section.

{topDown, bottomUp} = artllnit[5,6,1.5,0.9,3,0.2,0.1];

ins = { {0,0,0,1.0},{0,0,1,0,1),{0,0,0,0,1) h

{td,bu,mlist} = artl[5,6,1,1.5,5,0.9,3,0.9,topDown,bottomUp,ins];
236 Chapter 7. Adaptive Resonance Theory

a rtl [fldim., f2dim_, al_, bl_, cl_, dl_, el_, rho., flHts_, f2Wts_, inputs.] :=
Module[{droplistinit,droplist,notOone=True,i,nIn=Length[inputs],reset,
n,sf1,t,xf2,uf2,v,uindex,matchList,newHatchList,tdWts,buWts},
droplistinit = Table[l,{f2dim>]; (* initialize droplist *)
tdWts=flWts; buWts=f2Yts;
matchList = (* construct list of F2 units and encoded input patterns *)
Table[{StringForm["Unit ",n]},{n,f2dim}];
While[notDone==True,newHatchList = matchList; (* process until stable *)
For[i=l,i<=nIn,i++,in = inputs[[i]]; (* process inputs in sequence *)
droplist = droplistinit ;reset=True; (* initialize *)
While[reset==True, (* cycle until no reset *)
xfl = in/(l+al*(in+bl)+cl); (* activities *)
sfl = MapClf[#>0,l,0]ft,xfl]; (* FI outputs *)
t= buWts . sfl; (* F2 net-inputs *)
t = t droplist; (* turn off inhibited units *)
xf2 = compete[t]; (* F2 activities *)
uf2= HapClf [#>0,l,0]&,xf2]; (* F2 outputs *)
windex = winner[uf2,l]; (* winning index *)
v= tdWts . uf2; (* FI net-inputs *)
xfl =(in+ dl*v-bl)/(l+al*(in+dl*v)+cl); (* new FI activities *)
sfl = Map[If[#>0,l,0]&,xfl]; (* new FI outputs *)
reset = resetflagl[sfl,in,rho]; (* check reset *)
If[reset==True,droplist[[windex]]=0; (* update droplist *)
Print["Reset with pattern ",i,' on unit ".windex],Continue];
]; (* end of While reset==True *)
Print["Resonance established on unit ".windex," with pattern ",i];
tdWts=Transpose[tdWts]; (* resonance, so update weights,top down first *)
tdWts[[windex]]=sfl;
tdWts=Transpose[tdWts];
buWts[[windex]] = el/(el-l+vmagl[sfl]) sfl; (* then bottom up *)
matchList[[windex]] = (* update matching list *)
Reverse[Union[matchList[[windex]],{i}]];
]; (* end of For i=l to nln *)
If [matchList==newHatchList,notDone=False; (* see if matchList is static *)
Print["Network stable"],
Print["Network not stable"];
newHatchList = matchList];]; (* end of While notDone==True *)
Return[{tdWts,buWts,matchList}];
]; (* end of Module *)

Listing 7.1
7.1. ART1 237

Resonance established on unit 1 with pattern 1


Resonance established on unit 2 with pattern 2
Resonance established on unit 2 with pattern 3
Network not stable
Resonance established on unit 1 with pattern 1
Reset with pattern 2 on unit 2
Resonance established on unit 3 with pattern 2
Resonance established on unit 2 with pattern 3
Network not stable
Resonance established on unit 1 with pattern 1
Resonance established on unit 3 with pattern 2
Resonance established on unit 2 with pattern 3
Network stable

The network exhibits the same behavior that we saw in the previous
section. On the first cycle through the patterns, unit two of F2 was
recoded to the subset vector after previously encoding the superset vector.
On the second cycle through, unit two caused a reset on the second
pattern that was subsequently encoded on unit 3. A third cycle produced
no changes so the network was declared stable. The match list is

TableForm[mlist]

Unit 1 1
Unit 232
Unit 3 2
Unit 4
Unit 5
Unit 6

To interpret this list, you must recognize that the patterns encoded
by each unit are listed from left to right with the most recent pattern
appearing first on the left (after the unit number which is the first number
next to "Unit"). If a pattern appears on more than one unit's list, the unit
with that pattern farthest to the left is the one that currently encodes the
pattern; for example, pattern two appears on the list of both unit two
and unit three. Since the encoding for unit three is more recent, that unit
currently encodes pattern two. This list may get difficult to interpret if
there are a large number of patterns and recodings, but it is simple to
implement here and serves the immediate illustrative purpose.
238 Chapter 7. Adaptive Resonance Theory

You can compare the new weight vectors to verify that the calcula¬
tions were correct. The top-down weights are

N[MatrixForm[td],3]

0 0 0 0.756 0.756 0.756


0 0 0 0.756 0.756 0.756
0 0 1. 0.756 0.756 0.756
1. 0 0 0.756 0.756 0.756
0 1. 1. 0.756 0.756 0.756

and the bottom-up weights are

N[MatrixForm[bu],3]

0 0 0 1. 0
0 0 0 0 1.
0 0 0.75 0 0.75
0.329 0.329 0.329 0.329 0.329
0.329 0.329 0.329 0.329 0.329
0.329 0.329 0.329 0.329 0.329

which are identical to the results we obtained earlier.


To end this discussion of ART1, I shall construct for you a list of
input vectors that correspond to the first six letters of the alphabet. To
gain facility with ART1, you should experiment with this input list, vary¬
ing the network parameters and especially the vigilance parameter. You
might also try constructing a noisy version of these letters to see how the
network responds, once trained with the noise-free letters. The letters
are five-by-five pixel representations and appear in Listing 7.2. To view
these input patterns in a more illuminating manner, we can employ the
density plotting capability of Mathematica as we did in Chapter 1; for
example:

ListDensityPlot[Reverse[Partition[letterln[[6]],5]]]
7.1. ART1 239

letterln * { {0,0,1,0,0,
0,1,0,1,0,
1,0,0,0,1, (* * *)
1,144,1,
1,0,0,0,1>,
{1,1,14,0,
1,0,0,04,
1,1444. (* B *)
1,0,0,04,
1444,0),
{14444,
l.O.O.O.O,
1,0,0,0,0, (* c ♦)
,,,,,
10000
, , , . ),
11111
{1,14,1,0,
1,0,0,04,
1,0,0,04, (* o *)
1,0,0,04,
1,1,1,1,0>,
{14444,
l.O.O.O.O,
1,144,0, (* E *)
l.O.O.O.O,
1,144.1}.
{1,1444,
l.O.O.O.O,
1444.0, (* F *)
l.O.O.O.O,
l.O.O.O.O) );

Listing 7.2
240 Chapter 7. Adaptive Resonance Theory

These six patterns have sufficient similarities between several letter pairs
that relatively minor changes in the vigilance parameter can drastically
affect how the network encodes them. Let's look at two examples. First,
consider p = 0.9, which for all intents means perfect matching for this
example; and use the same parameters to initialize the weights that we
used in the previous example.

{topDown,bottomUp} =
artllnit[25,6,1.5,0.9,3,0.2,0.1];

{td,bu,mlist} - artl[25,6,l,1.5,5,0.9,3,0.9,topDown,bottoniUp,letterIn];

Resonance established on unit 1 with pattern 1


Reset with pattern 2 on unit 1
Resonance established on unit 2 with pattern 2
Reset with pattern 3 on unit 2
Reset with pattern 3 on unit 1
Resonance established on unit 3 with pattern 3
Resonance established on unit 2 with pattern 4
Reset with pattern 5 on unit 3
Reset with pattern 5 on unit 2
Reset with pattern 5 on unit 1
Resonance established on unit 4 with pattern 5
Resonance established on unit 4 with pattern 6
7.1. ART1 241

Network not stable


Resonance established on unit 1 with pattern 1
Reset with pattern 2 on unit 2
Reset with pattern 2 on unit 4
Reset with pattern 2 on unit 3
Reset with pattern 2 on unit 1
Resonance established on unit 5 with pattern 2
Resonance established on unit 3 with pattern 3
Resonance established on unit 2 with pattern 4
Reset with pattern 5 on unit 3
Reset with pattern 5 on unit 4
Reset with pattern 5 on unit 5
Reset with pattern 5 on unit 2
Reset with pattern 5 on unit 1
Resonance established on unit 6 with pattern 5
Resonance established on unit 4 with pattern 6
Network not stable
Resonance established on unit 1 with pattern 1
Resonance established on unit 5 with pattern 2
Resonance established on unit 3 with pattern 3
Resonance established on unit 2 with pattern 4
Resonance established on unit 6 with pattern 5
Resonance established on unit 4 with pattern 6
Network stable

This network required a total of 15 resets during the encoding process.


Recall from several paragraphs ago that the initial values of the F2-layer
weights are 0.111 in this example. The larger these weights, the more
favored is an uncommitted node over a previously committed node in the
competition on F2. Let's see what happens if we increase the initial values
of these weights considerably. We can effect this change by increasing
the value of the L parameter, in this case we let L = 25, then each weight
on F2 has a initial value of

N[25/(25-1+25),3]

0.51

{topDown,bottomUp} = artllnit[25,6,1.5,0.9,25,0.2,0.1],

{td,bu,mlist} = artl[25,6,1,1.5,5,0.9,25,0.9,topDown,bottomUp,letterln];
242 Chapter 7. Adaptive Resonance Theory

Resonance established on unit 1 with pattern 1


Resonance established on unit 2 with pattern 2
Reset with pattern 3 on unit 2
Resonance established on unit 3 with pattern 3
Resonance established on unit 2 with pattern 4
Reset with pattern 5 on unit 3
Reset with pattern 5 on unit 2
Resonance established on unit 4 with pattern 5
Resonance established on unit 4 with pattern 6
Network not stable
Resonance established on unit 1 with pattern 1
Reset with pattern 2 on unit 2
Reset with pattern 2 on unit 4
Reset with pattern 2 on unit 3
Resonance established on unit 5 with pattern 2
Resonance established on unit 3 with pattern 3
Resonance established on unit 2 with pattern 4
Reset with pattern 5 on unit 3
Reset with pattern 5 on unit 5
Reset with pattern 5 on unit 4
Reset with pattern 5 on unit 2
Resonance established on unit 6 with pattern 5
Resonance established on unit 4 with pattern 6
Network not stable
Resonance established on unit 1 with pattern 1
Resonance established on unit 5 with pattern 2
Resonance established on unit 3 with pattern 3
Resonance established on unit 2 with pattern 4
Resonance established on unit 6 with pattern 5
Resonance established on unit 4 with pattern 6
Network stable

The parameter change reduced the number of resets from 15 down to 10.
There are a great many more such experiments that you can perform to
gain experience with the ART1 network.
7.2. ART2 243

7.2 ART2
Superficially, the main difference between ART1 and ART2 is that ART2
accepts input vectors whose components can have any real number as
their value. In its execution, the ART2 network is considerably different
from the ART1 network.
Aside from the obvious fact that binary and analog patterns differ
in the nature of their respective components, ART2 must deal with ad¬
ditional complications. For example, ART2 must be able to recognize
the underlying similarity of identical patterns superimposed on constant
backgrounds having different levels. Compared in an absolute sense, two
such patterns may appear entirely different when, in fact, they should be
classified as the same pattern.
The price for this additional capability is primarily an increase in
complexity on the Fi processing level. The ART2 Fi level comprises
several sublevels and several gain-control systems. Processing on F2 is
the same for ART2 as it was for ART1. As partial compensation for the
added complexity on the F\ layer, the weight update equations are a bit
simpler for ART2 than they were for ART1. In a software simulation,
however, weight updates have the same complexity in either network.

7.2.1 ART2 Architecture


Figure 7.5 illustrates the architecture of the ART2 network; the similarities
to and differences from ART1 should be apparent from that diagram.
Both networks have attentional and orienting subsystems, as well as gain
control systems.
You should be aware that there are many variations of the ART2
network. The version shown in Figure 7.5 matches the one in Neural
Networks, but I do not intend to imply that this particular version is the
best; it is merely a starting point for your own investigations.

7.2.2 ART2 Processing Equations


The equations governing the dynamics of the ART2 network are almost
identical to those for ART1. Moreover, since the processing done on the
F2 layer of ART2 is identical to that of the ART1 F2 layer, we shall be
concerned primarily with the Fi layer in the ensuing discussion. On each
244 Chapter 7. Adaptive Resonance Theory

Figure 7.5 This figure shows the ART2 architecture. The overall structure is the same as
that of ART1. The Fx layer has been divided into six sub-layers, w, x, u, v, p, and q. Each
node labeled G is a gain-control unit that sends an inhibitory signal to each unit on the
layer it feeds. All sublayers on Fx, as well as the r layer of the orienting subsystem, have
the same number of units. Individual sublayers on Fx are connected unit-to-unit; that is,
the layers are not fully interconnected, with the exception of the bottom-up connections to
F2 and the top-down connections from F2.

of the Fx sublayers, the processing equations take the form

xk = -Axk + (1 - Bxk) J+ -(C + Dxk)J~ (7.20)

where A, B, C, and D, are constants, and the definitions of and Jk


depend on the particular sublayer. In all cases we shall set B = C = 0.
Since we shall, once again, be interested in the asymptotic solutions,
we can solve Eq. (7.20) in that case to find equilibrium values for the
activities.
„ Jj (7.21)
k A+DJk
Table 7.1 summarizes the values of A, D, J+, and Jk for the six
sublayers of Fi, as well as for the units in the r layer of the orienting
7.2. ART2 245

Quantity A D —7
Layer
w 1 1 '/+aui 0

X e 1 wi
Iwl

u e 1 vi
Ivl

V 1 1 f(x/)+W(q/) 0

p 1 1
» 0
q e 1 Pi
Ipl

r e 1 ui+ CP; lul + Icpi

Table 7.1 This table summarizes the factors in Eq. (7.21) for each of the sublayers on Fi,
and the r layer. U is the ith component of the input vector, a, b, and c are constants, e is a
small, positive constant used to prevent division by zero in the case where the magnitude
of the vector quantities is zero, yj is the activity of the ;th unit of the Fa layer. / and g
are functions that are described in the text.

subsystem.
Using the definitions from Table 7.1, we can write the equilibrium
equations for each of the sublayers on F\ as follows:

Wi = I{ + ain (7.22)

Wi
(7.23)
Xt ~ e + |w|

Vi = f(xi) + bf(qi) (7.24)

Vi
(7.25)
U'~e+\v\

Pi = uj + g(yj)zij (7.26)
3

(7.27)
*
II

+
246 Chapter 7. Adaptive Resonance Theory

The function /() acts as a thresholding function that the F\ layer uses
as a filter for noise. A sigmoid function would work well in this case,
but we shall use a simpler, linear threshold function:

0 <x < 6
(7.28)
x>9

where 9 is a positive constant less than one.


Processing on F2 is identical to that on ART1. We determine a net-
input value for each unit.

Ti = I>*i< (7.29)
i

then determine a winning unit according to which has the largest net-
input value.
g(y) is the output function for the units on the F2 layer. Since the F2
layer is a winner-take-all competitive layer, just as in ART1, g{y) has a
particularly simple form

Tj — maxjt{Tfc}VA:
(7.30)
otherwise

where d is a positive constant less than one.


With the above definition of the output function on F2/ and knowing
the fact that F2 is a winner-take-all competitive layer, we can rewrite the
equation for processing on the p sublayer of Fi as

f Ui if F2 is inactive
(7.31)
1 «. + dzu if the Jth node on F2 is active

The weight-update equations on ART2 turn out to be simpler than


those on ART1. Once again, only weights to or from the winning F2 unit
get updated, and only after resonance has been established. If vj is the
winning F2 node, then weights are modified according to

ZiJ = Zji = (7.32)

Our choice of initial values for the weights is driven by performance


considerations, much as it was on ART1. From the discussion of the
orienting subsystem in the next section, we shall find that setting the
top-down weights initially to zero, jsy(0) = 0, prevents reset during the
7.2. ART2 247

time when a new F-i node is being recruited to encode a new input
pattern. Weights on F2 are initialized with fairly large values in order
to bias the network toward the selection of a new, uncommitted node,
which, as in ART1, helps to keep the number of matching cycles to a
minimum. There are a number of alternate ways to initialize the bottom-
up weights. We shall stay with the method described in Chapter 8 of
Neural Networks.

1
0) < (7.33)
(1 -d)y/M
Similar considerations lead to a condition relating the parameters c and
d.
cd
< 1 (7.34)
T^d
We are almost ready to begin an example calculation. First, we must
examine the orienting subsystem in some detail, as the matching process
on ART2 is not quite as straightforward as it was on ART1.

7.2.3 ART2 Orienting Subsystem

From Table 7.1 we can construct the equation for the activities of the
nodes on the r layer of the orienting subsystem.

Ui + cpi
(7.35)
M + |cp|

where we have assumed that e = 0. The condition for reset is

(7.36)

where p is the vigilance parameter as in ART1.


Notice that two Fi sublayers, p, and u, participate in the matching
process. As top-down weights change on the p layer during learning, the
activity of the units on the p layer also changes. The u layer remains sta¬
ble during this process, so including it in the matching process prevents
reset from occurring while learning of a new pattern is taking place.
Since |r| = (r • r)1/2, we can write

_ [1 + 2|cpl cos(u, p) + |cp[2]1/2


(7.37)
'r‘ _ 1 + |cp|
248 Chapter 7. Adaptive Resonance Theory

using Eq. (7.35), where cos(u, p) is the cosine of the angle between u and
P-
First, note that if u and p are parallel, then the above equation reduces
to |r| = 1, and there will be no reset. As long as there is no output from
F2, Eq. (7.26) shows that u = p, and there will be no reset in this case.
Suppose now that F2 does have an output from some winning unit,
and that the input pattern needs to be learned, or encoded, by the F2
unit. We also do not want a reset in this case. From Eq. (7.26) we see
that p = u + dzj, where zj is the weight vector on the winning F2 unit.
If we initialize all of the top-down weights, z^, to zero, then the initial
output from F2 will have no effect on the value of p; that is, p will remain
equal to u.
During the learning process itself, zj becomes parallel to u according
to Eq. (7.32). Thus, p also becomes parallel to u, and again |r| = 1
and there is no reset. As with ART1, a sufficient mismatch between the
bottom-up input vector and the top-down template results in a reset. In
ART2, the bottom-up pattern is taken at the u sublevel of F\ and the
top-down template is taken at p.

7.2.4 Example ART2 Calculation

Since the calculations that are done on F2 are the same for ART2 as they
were for ART1, we shall not spend time describing them here. On the
other hand, what happens to an input vector, as it passes through the
various layers of the ART2 network, is far from obvious. Moreover,
there are many variations on the Fi architecture. We shall not concern
ourselves with the second issue here, but rather, we shall spend a bit of
time looking at the calculations on the particular Fx layer described in
the previous section.
The simulation of Fx requires that we calculate the vector magnitude
on several of the sublayers. We shall use the standard definition of the
magnitude of a vector:

vmag2[v_] := Sqrt[v . v]

The parameters and other functions needed are as follows. We assume


e = 0 in all cases.

fldim = 5; (* FI dimension *)

f2dim = 6; (* F2 dimension *)
7.2. ART2 249

rho = 0.9; (* we want close matches for now *)

fv[x_] := lf[x>theta,x,0]
f[x_] := Hap[fv,x] (* output function on v layer *)

theta = 0.2; (* threshold value *)

al=10; bl=10; cl=0.1; (* FI parameters *)

Even though there is no output from F2 at this time, we shall define the
F-2 output function and initialize the weight vectors in anticipation of
continuing on with the F% calculation later.

d=0.9;
g[xj := If[x>0,d,0]

zl2 = Table[Table[0 ,{f2dim>],{fldim>]; (* top-down *)

z21 = MatrixForm[N[Table[Table[0.5/((l-d)*Sqrt[fldim] ),
{fldim}],{f2dim}],4]]

2.236 2.236 2.236 2.236 2.236


2.236 2.236 2.236 2.236 2.236
2.236 2.236 2.236 2.236 2.236
2.236 2.236 2.236 2.236 2.236
2.236 2.236 2.236 2.236 2.236
2.236 2.236 2.236 2.236 2.236

Let's define a set of three input vectors.

inputs = { {0.2, 0.7, 0.1, 0.5, 0.4},


{0.8, 2.8, 0.4, 2.0, 1.6},
{0.1, 0.3, 1.2, 2.0, 4.0} };

Notice that the second input vector is a multiple of the first, while the
third bears little resemblance to either of the first two.
Initialize the sublayer outputs to zero vectors.

ClearAll[w,x,u,v,p,q,r];
w=x=u=v=p=q=r=Table[0,{fldim}];

We begin the calculation with the first input vector.

w = inputs[[l]] + a u
250 Chapter 7. Adaptive Resonance Theory

{0.2, 0.7, 0.1, 0.5, 0.4}

x is a normalized version of w.

x = u / vmag2[w]

{0.205196, 0.718185, 0.102598, 0.512989, 0.410391}

v = f[x] + b f[q]

{0.205196, 0.718185, 0, 0.512989, 0.410391}

Notice that the third component of v is zero, since the third compo¬
nent of x did not meet the threshold criterion. Since there is no top-down
signal from F2, the remaining three sublayers, u, p, and q, all have the
same outputs.

u = v / vmag2[v]

{0.206284, 0.721995, 0, 0.515711, 0.412568}

p=u

{0.206284, 0.721995, 0, 0.515711, 0.412568}

q = p / vmag2[p]

{0.206284, 0.721995, 0, 0.515711, 0.412568}

We cannot stop the Fi calculation yet, however, since both u and p are
now nonzero. These sublayers provide feedback to other layers, so we
must iterate the Fi sublayers.

w = inputs[[l]] + a u

{2.26284, 7.91995, 0.1, 5.65711, 4.52568}

x = w / vmag2[w]

{0.206276, 0.721965, 0.00911578, 0.515689, 0.412551}

v = f[x] + b f[q]

{2.26912, 7.94191, 0, 5.6728, 4.53824}


7.2. ART2 251

u = v / vmag2[v]

{0.206284, 0.721995, 0, 0.515711, 0.412568}

p = u

{0.206284, 0.721995, 0, 0.515711, 0.412568}

q = p / vmag2[p]

{0.206284, 0.721995, 0, 0.515711, 0.412568}

A third iteration results in

w = inputs[[l]] + a u

{2.26284, 7.91995, 0.1, 5.65711, 4.52568}

x = u / vmag2[w]

{0.206276, 0.721965, 0.00911578, 0.515689, 0.412551}

v = f[x] + b f[q]

{2.26912, 7.94191, 0, 5.6728, 4.53824}

u = v / vmag2[v]

{0.206284, 0.721995, 0, 0.515711, 0.412568}

p = u

{0.206284, 0.721995, 0, 0.515711, 0.412568}

q = p / vmag2[p]

{0.206284, 0.721995, 0, 0.515711, 0.412568}

which does not change the results. We shall stop at two iterations through
Fi for all of our calculations.
Let's now apply the second input vector to F\. Recall, the second
input vector is a multiple of the first. Reinitialize the sublayer outputs
first.
252 Chapter 7. Adaptive Resonance Theory

ClearAll[w,x,u,v,p,q,r];
y=x=u=v=p=q=r=Table[0,{fldim}];

w = inputs[[2]] + a u

{0.8, 2.8, 0.4, 2., 1.6}

x = w / vmag2[w]

{0.205196, 0.718185, 0.102598, 0.512989, 0.410391}

v = f[x] + b f[q]

{0.205196, 0.718185, 0, 0.512989, 0.410391}

u = v / vmag2[v]

{0.206284, 0.721995, 0, 0.515711, 0.412568}

p= u

{0.206284, 0.721995, 0, 0.515711, 0.412568}

q = p / vmag2[p]

{0.206284, 0.721995, 0, 0.515711, 0.412568}

Run a second iteration to ensure stability.

w = inputs[[2]] + a u

{2.86284, 10.0199, 0.4, 7.15711, 5.72568}

x = w / vmag2[w]

{0.206199, 0.721695, 0.0288103, 0.515497, 0.412397}

v = f[x] + b f[q]

{2.26904, 7.94164, 0, 5.6726, 4.53808}

u = v / vmag2[v]

{0.206284, 0.721995, 0, 0.515711, 0.412568}


7.2. ART2 253

art2Fl[in_,a_,b_,d_,tdWts_,fld_,winr_:0] :■
Module[{w,x,u,v,p,q,i),
w=x=u=v=p=q=Table[0,{fld}];
For[i=l,i<=2,i++,
w = in + a u;
x = u / vmag2[w];
v = f[x] + b f[q];
u = v / vmag2[v];
p = lf[vinr=0,u,u + d Transpose[tdklts][[winr]] ];
q = p / vmag2[p];
]; (* end of For i *)
Return[{u,p}];
] (* end of Module *)

Listing 7.3

P “ u

{0.206284, 0.721995, 0, 0.515711, 0.412568)

q ■ p / vmag2[p]

{0.206284, 0.721995, 0, 0.515711, 0.412568)

After the v layer, the results are identical for the two input vectors. We
can conclude that the Fi layer performs several functions on an input
vector: Vectors are normalized to the same ambient background level,
noise is eliminated using a threshold condition, and the final vector is
normalized to a length of one. Incidentally, the noise-reduction process
described above often goes by the name of contrast enhancement, since,
as you can see by comparing the original input vector, after reduction to
a common background level, to the final output vector, values above the
threshold have been enhanced, while values below the threshold have
been reduced to zero.
For later use, we shall assemble the Fi sublayers into a single func¬
tion, art2Fl in Listing 7.3. The function returns the values of u and p,
since these are used later, in is the input vector, a, b, c, and d, are the
layer parameters defined above, tdWts is the top-down weight matrix, fid
is the dimension of Flf and winr is the index of the winning F2 unit. If
254 Chapter 7. Adaptive Resonance Theory

winr is zero — as it is by default — p will be equal to u. Let's try this


function on the third input vector to verify that claim.

{u,p} = art2Fl[inputs[[3]],al,bl,cl,d,zl2,fldim];

{0, 0, 0.259161, 0.431934, 0.863868}

{0, 0, 0.259161, 0.431934, 0.863868}

We can now assemble the functions into a complete ART2 simulator. That
development is the subject of the final section of this chapter.

7.2.5 The Complete ART2 Simulator

We can pattern the ART2 simulator after the ART1 simulator. In fact to
construct the ART2 simulator, I began with the ART1 code. Also, as we
did before, we shall initialize the weights in a separate routine.

art2Init[fldim_,f2dim_,d_,dell_] :=
Hodule[{zl2,z21},
zl2 = Table [Table [0 ,{f2dim}],{fldim}];
z21 = Table[Table[dell/((1.0-d)*Sqrt[fldim] ) //N,
{fldim}],{f2dim}];
Return[{zl2,z21}];
]; (* end of Module *)

The parameter dell determines what fraction less than the maximum
value to which the top-down weights are initialized; see Eq. (7.33).
The simulator requires the winner and compete routines that we devel¬
oped previously for ART1, but the routine to determine the reset flag is
slightly different. The reset routine for ART2 is

resetflag2[u_,p_,c_,rho_]:=
Module[{r,flag},
r = (u + c p) / (vmag2[u] + vmag2[c p]);
If[rho/vmag2[r] > l,flag=True,flag=False];
Return[flag];
];
7.2. ART2 255

art2[fldim.,f2dim_,al_,bl_,d_,d_,theta.,rho_,flilts.,f2Wts_.inputs.] :=
Module [{droplistinit,droplist, notDone=True, i, nln= Length [inputs], reset,
u, p, t, x f 2, uf 2, v, uinde x, matchList, neuMatchList, tdl/ts, builts},
droplistinit = Table[l,{f2dim>]; (* initialize droplist *)
tdilts = flilts; builts = f2ifts;
u = p = Table [0,{fldim}];
(* construct list of F2 units and encodedinput patterns *)
matchList = Table[{StringForm["Unit "",n]},{n,f2dim}];
While[notDone==True,newMatchList = matchList; (* process until stable *)
For[i=l,i<=nIn,i++, (* process each input pattern in sequence *)
droplist = droplistinit; (* initialize droplist for neu input *)
reset=True;
in = inputs[[i]]; (* next input pattern *)
windex = 0; (* initialize *)
While [reset=True, (* cycle until no reset *)
{u,p> = art2Fl[in,al,bl,d,tdWts,fldim,windex];
t= buWts . p; (* F2 net-inputs *)
t = t droplist; (* turn off inhibited units *)
xf2 = compete [t]; (* F2 activities *)
uf2 = Hap[g,xf2]; (* F2 outputs *)
windex = winner[uf2,d]; (* winning index *)
{u,p> = art2Fl[in,al,bl,d,tdWts,fldim,windex];
reset * resetflag2[u,p,cl,rho]; (* check reset *)
If [reset=True,droplist[[windex]] =0; (* update droplist *)
Print["Reset with pattern ",i," on unit ".windex],Continue];
]; (* end of While reset=True *)
Print["Resonance established on unit", windex," with pattern ",i];
tdWts=Transpose[tdWts]; (* resonance, so update weights *)
tdWts[[windex]]=u/(1-d); tdWts=Transpose[tdWts];
buWts[[windex]] = u/(l-d);
matchList[[windex]] = (* update matching list *)
Reverse[Union[matchList[[windex]],{i>]];
]; (* end of For i=l to nln *)
If [matchList==newHatchList,notDone=False; (* see if matchList is static *)
Print["Network stable"],Print["Network not stable"];
newMatchList = matchList];
]; (* end of While notDone—True *)
Retu rn[{tdWts,buWts.matchList}];
]; (* end of Module *)

Listing 7.4
256 Chapter 7. Adaptive Resonance Theory

The complete ART2 simulator appears in Listing 7.4. To run the network
with the inputs and parameters defined in the previous section, first
initialize the weights

{flV, f2W} = art2Init[5,6,0.9,0.5];

HatrixForm[flV]

0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0

MatrixForm[f2W]

2.23607 2.23607 2.23607 2.23607 2.23607


2.23607 2.23607 2.23607 2.23607 2.23607
2.23607 2.23607 2.23607 2.23607 2.23607
2.23607 2.23607 2.23607 2.23607 2.23607
2.23607 2.23607 2.23607 2.23607 2.23607
2.23607 2.23607 2.23607 2.23607 2.23607

then use these weights and the various parameters in the calling sequence
for the art2 function

outputs = art2[5,6,10,10,0.1,0.9,0.2,0.999,
fllf,f2W, inputs];

Resonance established on unit 1 with pattern 1


Resonance established on unit 1 with pattern 2
Reset with pattern 3 on unit 1
Resonance established on unit 2 with pattern 3
Network not stable
Resonance established on unit 1 with pattern 1
Resonance established on unit 1 with pattern 2
Resonance established on unit 2 with pattern 3
Network stable

TableForm[outputs[[3]]]
7.2. ART2 257

Unit 1 2 1
Unit 2 3
Unit 3
Unit 4
Unit 5
Unit 6

As we should expect, unit one encoded both the first and second patterns.
The third pattern was encoded by unit 2. We also might guess what the
weight matrices look like. The top-down matrix is

MatrixForm[N[outputs[[l]],3]]

2.06 0 0 0 0 0
7.22 0 0 0 0 0
0 2.59 0 0 0 0
5.16 4.32 0 0 0 0
4.13 8.64 0 0 0 0

and the bottom-up matrix is

HatrixForm[N[outputs[[2]],3]]

2.06 7.22 0 5.16 4.13


0 0 2.59 4.32 8.64
2.24 2.24 2.24 2.24 2.24
2.24 2.24 2.24 2.24 2.24
2.24 2.24 2.24 2.24 2.24
2.24 2.24 2.24 2.24 2.24

As with the ART1 network, facility with this ART2 network can only
come by experience. Moreover, the sublayer structure on Fi provides
fertile ground for additional experimentation.

Summary
The ART1 and ART2 networks that we studied in this chapter repre¬
sent a class of network architectures based on adaptive resonance theory.
Among other characteristics, these networks retain their ability to learn
new information without having to be retrained on old information as
258 Chapter 7. Adaptive Resonance Theory

well as the new. Moreover, these networks know when presented infor¬
mation is new and automatically incorporate this new information. You
should view the ART1 and ART2 networks as building blocks. Hierar¬
chical structures based on combinations of these networks can exhibit
complex behavior. As with the other neural networks in this book, the
ART1 and ART2 networks should be used as starting points for your
own experimentation.
Chapter 8

Genetic Algorithms
260 Chapter 8. Genetic Algorithms

Whether or not you believe in an evolutionary development of life on this


planet, the theory of natural selection offers some compelling arguments
that individuals with certain characteristics are better able to survive and
pass on those characteristics to their progeny. We can think of natural
selection as nature's way of searching for better and better organisms. A
genetic algorithm (GA) is a general search procedure based on the ideas
of genetics and natural selection.
The solution methodology of a GA is reminiscent of sexual reproduc¬
tion in which the genes of two parents combine to form those of their
children. Our basic premise will be that we can create a population of
individuals that somehow represent possible solutions to a problem we
are trying to solve. These individuals have certain characteristics that
make them more or less fit as members of the population, and we shall
associate this fitness with procreational probability. The most fit mem¬
bers will have a higher probability of mating than the less fit members.
The power of GAs lies in the fact that as members of the population
mate, they produce offspring that have a significant chance of retaining
the desirable characteristics of their parents, perhaps even combining the
best characteristics of both parents. In this manner, the overall fitness of
the population can potentially increase from generation to generation as
we discover better solutions to our problem.
I intend this chapter to be a brief introduction to GAs. In the first
section, we shall look at a few of the fundamental ideas behind GAs and
do a simple example calculation. In the second section, we shall develop
the code for a basic GA (BGA) that solves a simple optimization problem.
In the final section, we shall look at one method of applying GAs to the
problem of learning in a neural network.

8.1 GA Basics

In the introduction to this chapter, I used the words chance, perhaps, and
potentially, all contributing to the probabilistic flavor of GAs. Often peo¬
ple attempt to disprove evolutionary theory on the basis that mere chance
could not possibly have resulted in the complex organisms that exist to¬
day. Usually a person offers some calculation that, in essence, proves
that no matter how many monkeys sit at typewriters for an infinite time,
the probability that one will type (insert your favorite book title here) is van¬
ishingly small. Supposedly, by inference, any theory of evolution based
8.1. GA Basics 261

on probability and random mutations is similarly doomed to failure.


I am not going to attempt to argue the pros and cons of evolution,
but I do want to point out that the theory of natural selection that we will
use here, while having probabilistic elements, is far from being based on
chance mutations. In fact there is a very nice demonstration, described
by Richard Dawkins in his book The Blind Watchmaker, that illustrates the
philosophical differences between the theory of natural selection and the
theory of randomness.1 The following presentation is in the spirit of the
one given by Dawkins.
Consider the following experiment. Set 10 monkeys down at 10 type¬
writers and allow each monkey to type 18 lower-case letters. We shall
call that set of phrases a generation. If none of the monkeys has typed the
expression "tobeomottobe," then begin again with a new sheet of paper
for each monkey. Let's simulate the experiment by having Mathematica
produce a few generations of monkey prose. The Mathematica function
FromCharacterCode converts a list of ascii integers into their appropriate char¬
acters. We use that function to generate text from a randomly-generated
list of character codes.

generate[] := TableForm[Hap[FromCharacterCode,
Table[Random[Integer,{97,122}],{i,1,10}, Ij,1,13}]]]

gene rate []

vbkafmujpdneg
nifcmmuucqonf
ymvasqzsagpnp
ikowydtridswq
ilqsrqxahklvs
jogvhcjuzzxcj
xctkuuachkbwr
skfqtjbrjrrpq
kqrelgjcobhbj
alqlbsezaibdy

generate []

ywporsuuapyfh
owqozltdvbgpu

1 Richard Dawkins. The Blind Watchmaker. New York: W. W. Norton & Company, 1987.
262 Chapter 8. Genetic Algorithms

qxzsaymyozwdt
uummtfbgezlir
edkpejxjhkowk
qmhefxgdsblgt
tkxlovoyglvnx
mbzkdxtlrlvvc
jowseiaersfel
momgjesfuvgbf

generate []

hdahgucjzagag
uvgmxanpjpkqi
ikksflflyzqgh
oqozewohuaema
vtnxclgtcwuvz
axjoakjjnflqe
etvlqillzceja
xxanjvmlooltj
uxeijueevpous
urvkzgrxscyxb

We could go on, but you probably get the picture: Not one of these strings
has any noticeable resemblance to the target string. Occasionally the
proper letter does appear in the proper location in one of the strings, but
it is likely to take a very long time before our pseudomonkeys generate
the correct sequence purely by chance. Since there are 26 possible letters
for each of 13 locations in the string, the probability that a single monkey
will type the correct phrase is
(l./26)“13

-19
4.03038 10

which is about two chances out of a billion billion. We agree, therefore,


that pure chance is probably not a sufficient operator to drive evolution¬
ary changes; but we have not been describing natural selection with this
example.
Let's change the generation process somewhat. At first, we generate
ten sequences at random, as before. For subsequent generations, how¬
ever, we base the generation on the string which, however slightly, best
8.1. GA Basics 263

matches the desired string. Then we allow each letter to change with
some fixed probability. From the resulting generation, we choose the
most closely matching string to form a new generation, and so on.
Begin by defining the phrase as a text string.

tobe = "tobeornottobe"

tobeornottobe

Convert the string to a generic variable, keyPhrase.

keyPhrase = ToCharacterCodeftobe]

{116, 111, 98, 101, 111, 114, 110, 111, 116, 116, 111, 98, 101}

Generate an initial population of 10 random phrases, but leave them in


ascii code form for use by the program.

initialPop = Table[Random[Integer,{97,122}],{i,l,10},{j,1,13}]

{{119, 105, 118, 110, 120, 117, 116, 110, 111, 116, 115,
106, 112}, {105, 113, 104, 111, 118, 114, 98, 119,

107, 121, 114, 98, 104},


{99, 108, 103, 107, 98, 112, 102, 116, 102, 107, 115,
119, 104}, {101, 97, 118, 97, 97, 120, 117, 117, 97,
106, 116, 111, 98}, {114, 118, 110, 110, 122, 117,

113, 101, 114, 98, 100, 121, 103},


{108, 110, 99, 112, 97, 106, 106, 120, 114, 103, 111,
104, 108}, {104, 102, 116, 108, 106, 119, 115, 105,

111, 105, 104, 114, 112},


{116, 117, 107, 98, 119, 98, 108, 119, 122, 122, 110,
116, 116}, {102, 106, 117, 110, 114, 117, 103, 109,

119, 99, 122, 106, 110},


{102, 114, 115, 122, 106, 107, 101, 117, 111, 119, 100,

120, 109}}

Just out of curiosity lets convert this population to strings to see what
they look like.
TableForm[Hap[FromCharacterCode,initialPop]]

wivnxutnotsjp
264 Chapter 8. Genetic Algorithms

iqhovrbwkyrbh
clgkbpftfkswh
eavaaxuuajtob
rvnnzuqerbdyg
lncpajjxrgohl
hftljwsioihrp
tukbwblwzzntt
fjunrugmwczjn
frszjkeuowdxm

We will need a few more functions to implement this demonstration. The


first, fliptprob], implements a biased coin toss; that is, a coin toss that will
come up heads with probability prob, instead of 50-50.

flip[x_] := If[Random[]<=x,True,False]

The function mutateletter will result in a change to a given letter with a


probability pmute.

mutateLetter[pinute_,letter,] :=
If[fliptpmute],Random[Integer,{97,122}].letter];

newGenerate in Listing 8.1 is the main program that produces each


group of new phrases. Let's produce 50 new generations to see how
well this scheme works.

newGenerate[0.1,key Phrase,initialPop,50]

Generation 1: iqhovrbwkyrbh Fitness= 2


Generation 2: tqhovrbwkyrbh Fitness= 3
Generation 3: tohovrbrkyrbh Fitness= 4
Generation 4: tohovrbrktrbh Fitness= 5
Generation 5: tohovrboktrbh Fitness= 6
Generation 6: tohocrboktrbh Fitness= 6
Generation 7: tobocrboktrbh Fitness= 7
Generation 8: tobocrnoktrbh Fitness= 8
Generation 9: toboornoktrbh Fitness= 9
Generation 10: toboornoktrbh Fitness= 9
Generation 11: toboornoktobh Fitness= 10
Generation 12: toboornoktobh Fitness= 10
Generation 13: toboornoktobh Fitness= 10
Generation 14: toboornoktobo Fitness= 10
8.1. GA Basics 265

newGenerate[pmutate.,keyPhrase.,pop.,numGens.] :=
Module[{i,newPop,pa rent,diff,matches,
index,fitness},
newPop=pop;
For[i=l,i<=numGens,i++,
diff = Map[(keyPhrase-#)ft,newPop];
matches = Hap[Count[f ,0]ft,diff];
fitness = Max[matches];
index = Position [matches, fitness];
parent = nevPop[[First[Flatten[index]]]];
Print["Generation ",i,": ",
FromCharacterCode[parent],
" Fitness3 ".fitness];
newPop =
Table[Map[mutateLetter[pmutate,#]ft,parent],
{100}];
]; (* end of For *)
]; (* end of Module *)

Listing 8.1

Generation 15: toboornoktobo Fitness= 10


Generation 16: toboornoktobe Fitness 3 11
Generation 17: toboornoktobe Fitness 3 11
Generation 18: tobeornoktobe Fitness 3 12
Generation 19: tobeornoktobe Fitness 3 12
Generation 20: tobeornoktobe Fitness 3 12
Generation 21: tobeornoktobe Fitness 3 12
Generation 22: tobeornoktobe Fitness 3 12
Generation 23: tobeornottobe Fitness 3 13

After only 23 generations, the population produced the desired phrase.


In all fairness to the random method, we should go back and produce
230 random phrases and then see if we have a match. Please feel free to
run the experiment. If you come up with anything that is even close, I
would like to hear about it: You have phenomenal luck, and we should
collaborate on some lottery tickets.
We have seen that the addition of randomness in the natural-selection
266 Chapter 8. Genetic Algorithms

process does not lead to chaos, but rather to order and meaning, but we
have not yet added mating and reproduction to the process. Let's move
on to the discussion of a more complete genetic algorithm.

8.2 A Basic Genetic Algorithm (BGA)

In this section we shall develop an algorithm that has all of the char¬
acteristic processing normally associated with a true GA. Although we
will build a very simple algorithm, it has application in a number of ar¬
eas. More complex GAs will have the same high-level structure, but will
differ in the details of the computations performed. Before we actually
perform the development of the GA in Section 8.2.2, a review of the rel¬
evant vocabulary is in order. Like neural networks, GAs are inspired by
certain results from biology.

8.2.1 GA Vocabulary

In a biological organism, the structure that encodes the prescription that


specifies how the organism is to be constructed is called a chromosome.
One or more chromosomes may be required to specify the complete or¬
ganism. The complete set of chromosomes is called a genotype, and the
resulting organism is called a phenotype.
Each chromosome comprises a number of individual structures called
genes. Each gene encodes a particular feature of the organism, and the
location, or locus, of the gene within the chromosome structure, deter¬
mines what particular characteristic the gene represents. At a particular
locus, a gene may encode any of several different values of the particular
characteristic it represents; eye color, for example, may be green, blue,
hazel, etc. The different values of a gene are called alleles.
The development of a new generation involves sexual reproduction
between two parent phenotypes resulting in child phenotypes. During
this reproduction, the chromosomes of the parents are combined to form
the chromosomes of the children. The children inherit certain characteris¬
tics from each of their parents. If, for example, the child inherits the best
characteristics from each of its parents, then it will supposedly be more
fit to survive and reproduce, thus passing the favorable characteristics
on to its progeny.
In a GA, chromosomes are typically represented by a string of some
8.2. A Basic Genetic Algorithm (BGA) 267

variable type. In the example of the previous section, the chromosome is


a string of ascii values. Since there is only one chromosome per organism,
the chromosome and the genotype were the same. Each position in the
chromosome string is a gene which can take on different ascii values,
which themselves are the alleles. The phenotype is the string of characters
decoded from the genotype. In the BGA that we shall develop in the
next section, the chromosomes comprise binary numbers. In this case,
the alleles are zero and one. You may, of course, eschew the biological
terminology and speak instead of strings, positions on the string, and
values, instead of chromosomes, genes, and alleles.
There is one aspect of the theory of GAs that we shall not treat in
this chapter, but which is, nonetheless, an important topic in the devel¬
opment of the theoretical basis for GAs. That topic is schemata. Briefly
stated, schemata are subsets of the chromosome string which have simi¬
lar alleles at specific locations. We can view schemata as building blocks
that are combined during the reproduction process to produce a new
chromosome. Assuming that the schemata represent certain favorable
characteristics, then combining these schemata in an offspring should re¬
sult in an increase in survival probability for that offspring. Part of the
theory of GAs is involved with demonstrating how GAs manipulate these
schemata from one generation to the next in order to form individuals
whose fitness increases from generation to generation.

8.2.2 A BGA Algorithm


In this section we shall construct a GA that solves a specific problem
and in the process illustrate many of the techniques required to solve
other, real-world problems. Most problems to which GAs apply, have
the characteristic that something or some quantity needs to be optimized.
We can identify that quantity with the fitness, or survival probability, of
an individual.
Describing the actual BGA requires only a few statements.

1. Begin with a population of individuals generated at random.

2. Determine the fitness of each individual in the current population.

3. Select parents for the next generation with a probability propor¬


tional to their fitness.
268 Chapter 8. Genetic Algorithms

4. Mate the selected parents to produce offspring to populate the new


generation.

5. Repeat items 2-4.

The Fitness Function Let's begin our example by defining a quan¬


tity that determines an individual's fitness. Consider the function

f[x_] := l+Cos[x]/(l+0.01 x~2)

We shall assume that this function measures the fitness of an indi¬


vidual phenotype, x. The phenotype, x, is a numerical value which we
decode from a chromosome. Let's examine the behavior of the fitness
function.

Plot[f[x],{x,-45,45},PlotPoints->200];

Notice that the optimal value of this function occurs at x = 0, but no¬
tice also that there are many local maxima which are suboptimal. A
traditional hill-climbing technique, unless it were fortuitously to begin
somewhere on the central peak, would quickly reach one of the subopti¬
mal peaks and would get stuck there.

Chromosome Representation Each phenotype is a value decoded


from a chromosome. We shall use chromosomes with binary alleles for
this example. We choose to represent a chromosome as a ten-digit binary
string, and we shall restrict the phenotypes to values between -40 and
40. Therefore, the chromosome {0,0,0,0,0,0,0,0,0,0} must decode to
the value -40, and the chromosome {1,1,1,1,1,1, l, l, l, 1} must decode
to the value 40.
8.2. A Basic Genetic Algorithm (BGA) 269

The binary number 1111111111 is equal to decimal 1023. If we mul¬


tiply this number by 80/1023

N[80.0/1023,10]

0.07820136852

and subtract 40, we get the number 40.

1023 (80.0/1023) -40

40.

Similarly treated, 0000000000 yields -40.

0 (80.0/1023) -40

-40

The function decodeBGA embodies this conversion procedure. The func¬


tion first converts the binary number into a decimal number as follows,
using 1010101010 as an example. First, determine the locations of the Is
in the string.

pList = Flatten[Positional,0,1,0,1,0,1,0,1,0},1]]

{1, 3, 5, 7, 9}

Then, calculate the power of 2 represented by each of these locations,

values = Hap[2‘(10-#)fe,pList]

{512, 128, 32, 8, 2}

Add these values,

decimal = Apply [Plus, values]

682

Incidentally, an elegant way of accomplishing the conversion of a


binary number to a decimal number is to use a technique called Horner s
rule. That technique is embodied in the following function:

Horner[u_List,base_:2] := Fold[base #1 + 12 ft,0,u]


270 Chapter 8. Genetic Algorithms

decodeBGA[chromosome,] :=
Module[{pList,lchrom,values,phenotype},
lchrom = Length [chromosome];
(* convert from binary to decimal *)
pList = Flatten[Position[chromosome.l] ];
values = Hap[2“(lchrom-#)ft,pList];
decimal = Apply[Plus,values];
(* scale to proper range *)
phenotype = decimal (0.07820136852394916911) -40;
Return[phenotype];
]; (* end of Module *)

Listing 8.2

which works for arbitrary bases. We can test the function on the same
binary number.

Horner[{l,0,1,0,1,0,1,0,1,0}]

682

Finally, convert to a number between -40 and 40.

pheonotype = decimal (0.07820136852394916911) -40

13.3333

The complete function is in Listing 8.2. One thing to notice about our
choice of chromosomal representation is that the optimal phenotype (x =
0) is not represented by any chromosome. The largest negative pheno¬
type has the chromosome {0,1,1,1,1,1,1,1,1,1},

decodeBGA[{0,1,1,1,1,1,1,1,1,1}]

-0.0391007

The smallest positive phenotype has the chromosome {1,0,0,0,0,0,0,0,0,0},

decodeBGA[{l,0,0,0,0,0,0,0,0,0}]

0.0391007
8.2. A Basic Genetic Algorithm (BGA) 271

We could adjust our representation slightly, but in a real problem, we


will not have advanced knowledge of the actual optimal values. Bear
this issue in mind when designing a representation scheme.
The function f[x] determines the fitness of a particular phenotype.
For example,

f[-0.0391007]

1.99922

The details of statements 3 and 4 require a little more discussion. The


details of how we implement those statements comprise the essence of
the particular GA that we are developing.

Parent Selection In keeping with the ideas of natural selection, we


presume that individuals with a higher fitness are more likely to mate
than individuals with a lower fitness. One way to accomplish this sce¬
nario is to select parents with a probability in direct proportion to their
fitness values, which may seem to be an obvious choice, but it is certainly
not the only way parents can be selected.
The method we shall use is called the roulette-wheel method. In
principle, we construct a roulette wheel on which each member of the
population is given a sector whose size is proportional to the individual's
fitness. Then we spin the wheel and whichever individual comes up
becomes a parent.
We can describe the implementation of this method as follows:

1. Construct a list of the fitnesses of all individuals in the population.

2. Generate a random number between zero and the total of all of the
fitnesses in the population.

3. Return the first individual whose fitness, added to the fitness of all
other elements before it, from the list in step 1, is greater than or
equal to the random number from step 2.

Let's walk through an example. First, construct a random population


with ten individuals.

MatrixForm[pop =
Table[Random[Integer,{0,1}],{i,10},{j,10}]]
272 Chapter 8. Genetic Algorithms

1 1 0 0 1 0 0 1 1 0
0 0 0 1 0 0 0 1 0 1
0 1 0 0 1 0 0 1 1 1
10 1110 110 1
1 1 1 0 0 0 0 0 1 1
1 0 1 1 0 0 0 1 0 1
1 0 1 1 0 0 1 0 0 0
1 0 0 0 1 0 1 1 1 0
11110 10 110
0 1110 0 110 1

Then decode the population by mapping the decodeBGA function onto pop.

pheno = Map[decodeBGA,pop]

{23.0303, -34.6041, -16.9306, 18.5728, 30.303, 15.4448,


15.6794, 3.63636, 36.7937, -3.94917}

The fitness of the individuals is found from

fitlist = Nap [f, pheno]

{0.919582, 0.923009, 0.911761, 1.21619, 1.04341, 0.714787,


0.710969, 0.222705, 1.04247, 0.40201}

We use the function FoldList to add each element of fitList to all of the
successive elements.

fitListSum = FoldList[Plus,First[fitList],Rest[fitList]]

{0.919582, 1.84259, 2.75435, 3.97055, 5.01396, 5.72875,


6.43972, 6.66242, 7.70489, 8.1069}

The total of all fitness values is the last element in the above list

fitSum = LastffitListSum]

8.1069

The function selectOne in Listing 8.3 takes the folded list of fitness values
and returns the index of the individual who came up on a single spin of
the roulette wheel. The two parents are

parentllndex = selectOne [fitListSum, fitSum]


8.2. A Basic Genetic Algorithm (BGA) 273

selectOne[foldedFitnessList_,fitTotal_] :=
Module[{randFitness.elem,index},
randFitness = Randomf] fitTotal;
elem = Select[foldedFitnessList,#>=randFitnessft,l];
index = Flatten[Position[foldedFitnessList,First[elem]]];
Return[First[index]];
]; (* end of Module *)

Listing 8.3

parent2Index = selectOne[fitlistSum,fitSum]

parentl = pop[[parentlIndex]]

{1, 1, 1, 1, o, 1, 0, 1, 1, 0}

parent2 = pop[[parent2Index]]

{1, 0, 1, 1, 0, 0, 0, 1, 0, 1}
These two parents have fitnesses of 1.04247 and 0.714787 respectively. Of
course, if you execute the above two statements, you may get different
parents. We can now proceed to the reproduction process.

Reproduction: Crossover and Mutation In Section 8.1 we looked


at an example of reproduction involving only random mutation of genes
from a single parent; a case of asexual reproduction. In this section we
look at sexual reproduction in which each parent contributes part of its
genetic structure to the offspring. Crossover is the name that we give to
this sharing of genes. Mutation, in this scenario, occurs at a much lower
probability than in the previous example.
There are many different crossover methods, and often each new ap¬
plication requires the development of a special crossover mechanism. We
shall restrict our attention here to a method called single-point crossover.
Figure 8.1 illustrates the results of single-point crossover on a pair of
chromosomes.
274 Chapter 8. Genetic Algorithms

Parents Children

Figure 8.1 This figure illustrates the crossover operation, (a) Two parents are selected
according to their fitness, and a crossover point, illustrated by the vertical line, is chosen
by a uniform random selection, (b) The children's chromosomes are formed by combining
opposite parts of each parent's chromosome.

After we select two parents for mating, we perform a biased coin


flip with a certain probability of heads that will determine whether to
proceed with the crossover. If the coin toss is successful, we choose, at
random, a particular locus which we call the crossover point. Two children
are produced by splicing the genes up to the crossover point from one
parent with the genes beyond the crossover point from the other parent,
as Figure 8.1 shows. If the coin toss is not successful, we simply return
the parents themselves as the new children. The theory of GAs shows
how this crossover mechanism can result in a population whose overall
fitness increases with time as the desirable features of each parent are
combined in their progeny. It can happen, however, that crossover results
in children who are less fit than their parents. The mutation process can
help to combat the effects of a destructive crossover.
Crossover is a powerful method for natural selection, but as we
pointed out in the previous paragraph, crossover may not always work
the way we want it to. By subjecting each of the genes in a chromosome
to a small probability of mutation, we can, on occasion, reverse the results
of a bad crossover. Suppose during crossover, a chromosome which has
an allele of 1 at a particular location which is very favorable to survival,
has that allele changed to a 0 during reproduction. Mutation can flip the
gene back to a 1. Of course, since all of the genes are subjected to muta¬
tion, we could end up with the opposite situation where a favorable gene
is altered. For this reason we need to keep the probability of mutation to
8.2. A Basic Genetic Algorithm (BGA) 275

crossover[pcross.,pmutate_,parentl.,parent2_] :=
Module[{childl,child2,crossAt,lchrom},
(* chromosome length *)
lchrom = Length[parentl];
If[ fliptpcross],
(* True: select cross site at random *)
crossAt = Random[Integer,{l,lchrom-l>];
(* construct children *)
childl = Join[Take[parentl,crossAt], Drop[parent2,crossAt]];
child2 = Join[Take[parent2,crossAt], Drop[parentl,crossAt]],
(* False: return parents as children *)
childl = parentl;
child2 = parent2;
]; (* end of If *)
(* perform mutation *)
childl = Hap[mutateBGA[pmutate,#]ft,childl];
child2 = Map[mutateBGA[pmutate,#]ft,child2];
Return[{childl,child2}];
]; (* end of Module *)

Listing 8.4

a very small value, perhaps as low as one chance in a thousand for any
particular gene.
Since the chromosomes are binary strings, I have written the mutation
algorithm in terms of an XOR function. Moreover, since Mathematica does
not equate True and False with 1 and 0 respectively, I have had to write
my own XOR function:

myXor[x_,y_] := If[x==y,0,1];

mutateBGA[pmute_,allel_] := If[fliptpmute],myXor[allel,l].allel];

The procedure in Listing 8.4, which comprises both the crossover and
mutation algorithms, returns two children to be added to the next gen¬
eration of the population. Let's see if we make any progress after one
crossover-mutation operation.

MatrixForm[children = cross0ver[0.75,0.001,parentl,parent2]]
276 Chapter 8. Genetic Algorithms

1 1 1 1 0 0 0 1 0 1
10 110 10 110

decodeList = Map[decodeBGA,children]

{35.4643, 16.7742}

neufitness = Map[f,decodeList]

{0.95461, 0.87324}

We gained ground in one case, but lost ground in the other. On the
whole, however, the average fitness of the children

(newfitness[[1]]+newfitness[[2]])/2

0.913925

is greater than that of their parents

(1.04247+0.714787)/2

0.878628

The results could have turned out differently, however, since this process
does contain an element of chance.
There is one further issue to discuss before putting everything to¬
gether. That issue concerns how we are going to go about populating
the next generation.

Populating the New Generation A simplistic approach to building


a new population is to mate enough parents so that enough children are
produced to completely replace their parents. This technique is called
generational replacement. This technique allows for the most thorough
mixing of genes possible in the new generation, but it has some draw¬
backs. There is no guarantee that all, or even most of the children will
turn out better than their parents; thus generational replacement might
result in a loss of individuals with the best genes. Not only could the
best individuals be lost, but the population as a whole could diminish in
overall fitness, and we need to be concerned with the overall population
as well as the best individual.
8.2. A Basic Genetic Algorithm (BGA) 277

initPop[psize_,csize_] :=
Table[Random[Integer,{0,l}],{psize},{csize>];

displayBest[fitnessList_,number2Print_] :=
Module[{i,sortedList},
sortedList = Sort[fitnessList,Greater];
For[i=l,i<=number2Print,i++,
Printffitness = ",sortedlist[[i]] ];
]; (* end of For i *)
]; (* end of Module *)

Listing 8.5

We could counter some of the negative effects of generational re¬


placement by retaining a certain number of the best individuals from the
previous generation. We call this strategy elitism.
At the opposite end of the spectrum from generational replacement
is steady state reproduction. In this method, a select number of one pop¬
ulation are deleted and are replaced with an equal number of children.
In the extreme case, only two children are replaced in each successive
generation.
Other strategies may be brought to bear to address this issue, but we
shall spend no additional time on the problem here. We are going to use
generational replacement, but feel free to experiment on your own.

Testing the BGA In order to make the program bga as generic as


possible, we need several support functions. The function initPop (List¬
ing 8.5) provides an initial random population of appropriate size. The
function displayBest (Listing 8.5) allows us to see the fitness of the best
chromosomes of the current generation. The arguments of the function
bga (Listing 8.6) include the crossover probability, the mutation proba¬
bility, the initial population, the name of the function that defines the
fitness of each phenotype, the number of generations to calculate, and
the number of individuals to print out at each generation. Let's choose
an initial population and run the program for several generations to see
how it performs. From the graph of the function, f, you can see that the
maximum value of fitness is 2.0, and that any result greater than about
1.6 must be on the central peak.
278 Chapter 8. Genetic Algorithms

bga[pcross,,pmutate_,poplnitial.,fitFunction.,numGens_,printNumJ :=
Module[{i,newPop,pa rentl,pa rent2,diff,matches,
oldPop,reproNum,index,fitList,fitListSum,
fitSum,pheno,plndex,plndex2,f,children},
oldPop=popInitial; (* initialize first population *)
reproNum = Length [oldPop]/2; (* calculate number of reproductions *)
f = fitFunction; (* assign the fitness function *)
For[i=l,i<=numGens,i++, (* perform numGens generations *)
pheno = Map[decodeBGA,oldPop]; (* decode the chromosomes *)
fitList = f [pheno]; (* determine the fitness of each phenotype *)
Print[" "]; (* print out the best individuals *)
Print["Generation ",i," Best ".printNum];
Print[" "];
displayBest[fitList,printNum];
fitListSum = FoldList[Plus,First[fitList],Rest[fitList]];
fitSum = Last[fitListSum]; (* find the total fitness *)
newPop = Flatten [Tablet (* determine the new population *)
plndexl = selectOne[fitListSum,fitSum]; (* select parent indices *)
plndex2 = selectOne [fitListSum, fitSum];
parentl = oldPopt[plndexl]]; (* identify parents *)
parent2 = oldPop[[p!ndex2]];
children = crossover[pcross.pmutate,parentl,parent2]; (* crossover and mutate *)
children,{reproNum}], 1 (* add children to list; flatten to first level *)
]; (* end of Flatten [Table] *)
oldPop = newPop; (* new becomes old for next gen *)
]; (* end of For i*)
]; (* end of Module *)

Listing 8.6
8.2. A Basic Genetic Algorithm (BGA) 279

initialPopulation = initPop[100,10];

bga[0.75, 0.008, initialPopulation,f,5,10];

Generation 1 Best 10

fitness = 1.90724
fitness = 1.90724
fitness = 1.58669
fitness = 1.42534
fitness = 1.42534
fitness = 1.3598
fitness = 1.3598
fitness = 1.3334
fitness = 1.33098
fitness = 1.31726

Generation 2 Best 10

fitness = 1.96206
fitness = 1.93756
fitness = 1.90724
fitness = 1.90724
fitness = 1.90724
fitness = 1.90724
fitness = 1.78363
fitness = 1.48717
fitness = 1.46704
fitness = 1.34657

Generation 3 Best 10

fitness = 1.96206
fitness = 1.90724
fitness = 1.90724
fitness = 1.90724
fitness = 1.87132
fitness = 1.83002
fitness = 1.78363
280 Chapter 8. Genetic Algorithms

fitness = 1.73246
fitness = 1.66991
fitness = 1.48717

Generation 4 Best 10

fitness = 1.99299
fitness = 1.99299
fitness = 1.99299
fitness = 1.90724
fitness = 1.87132
fitness = 1.83002
fitness = 1.83002
fitness = 1.78363
fitness = 1.78363
fitness = 1.78363

Generation 5 Best 10

fitness = 1.99299
fitness = 1.99299
fitness = 1.99299
fitness = 1.99299
fitness = 1.98058
fitness = 1.90724
fitness = 1.90724
fitness = 1.90724
fitness = 1.90724
fitness = 1.87132

Not only does the best individual become better as time goes on, but
the population as a whole appears to be getting better as well. In most
problems we will probably be interested in the best solution; neverthe¬
less, as the population as a whole gets better, future generations will be
produced by a better and better group of parents.
8.3. A GA for Training Neural Networks 281

8.3 A GA for Training Neural Networks

Finding the proper weights in a neural network, such as backpropagation,


is an optimization problem in itself. We search through weight space
looking for weights that optimize (in this case minimize) the value of the
mean squared error of the neural-network outputs for the given problem.
It would seem that a G A would be well suited to this task. We shall begin
to explore this idea in this section. I have chosen the XOR problem as a
basis for the effort.

8.3.1 Data Representation

We shall restrict our experiments to a standard, feedforward network


with two inputs, two hidden-layer units, and a single output. Each of
the three units (2-hidden, 1-output) will have two weights, for a total
of six in each network. The hidden- and output-layer units will have
the standard sigmoidal output function, reproduced here in a simplified
form.

sigmoid[xJ := l./(l+E“(-x));

The iopairs for this problem are

ioPairsXOR = { {{0.1,0.1},{0.1», {{0.1,0.9},{0.9}},


{{0.9,0.1),{0.9}}, {{0.9,0.9},{0.1» };

Whereas in the example of the previous section each phenotype had a


single chromosome, the neural-network problem is complicated by the
existence of multiple weights. We must decide on a data representation
suitable to account for all of the weights. Although it is possible to
work directly witn real-valued chromosomes, I am going to stay with a
binary representation which will keep us on familiar ground and allow
us to reuse a fair amount of code from the previous example. We shall
represent each weight value by a 20-bit binary string, after scaling the
weight to a number between —5 and +5. At issue then, is how to view
the individual weights. One representation would be to concatenate all
of the weights together into one long string. At the other extreme, we
could treat each weight as a separate chromosome and mate it only with
its counterparts in other networks.
282 Chapter 8. Genetic Algorithms

I am going to adopt a middle-of-the-road appoach for this example.


We shall concatenate the two weights on each unit to form a single chro¬
mosome; thus, our genotype will comprise three chromosomes which we
label hi, h2, and ol, for the two hidden-unit chromosomes, and the out¬
put chromosome respectively. During mating, hi chromosomes will cross
with hi chromosomes only, and so on. For the fitness function, we shall
use the inverse of the mean squared error (mse), mselnv, of the network
outputs over the four input patterns. We use the inverse of the mse be¬
cause we want a fitness function for which the larger values represent
the better fitness.
Each generation will consist of a number of neural networks. We
shall represent each network as a list comprising four parts: the network
fitness, its hi chromosome, its h2 chromosome, and its ol chromosome.
Symbolically, each network has the following representation:

{ mselnv, {hi}, {h2>, {ol} }

You should begin to see why this GA may take a long time to com¬
pute. Suppose we have a population of 100 networks, each initially with
randomly generated weights. To determine a new generation (if we are
using generational replacement) requires 50 matings, each of which in¬
volves three crossover-mutation operations. Then we must decode the
chromosomes and determine the mse for each network.
We are going to employ some time-saving measures for this example.
First, we will use a steady-state population methodology, in which we
replace only a few of the worst-performing individuals for each new gen¬
eration. Second, since we need not evaluate the fitness of each network
for each new generation, we need decode only the new children in order
to assess their fitness.

8.3.2 Calculating the Generations

To set up the initial population, we compute the appropriate number


of chromosomes and prepend the fitness to the list of chromosomes for
each network. The function initXorPop, in Listing 8.7, accomplishes this
task. The initialization program requires one routine to decode the chro¬
mosomes into weight vectors, and one routine to calculate the fitness of
each of the networks. Those tasks are embodied in the two functions ap¬
pearing in Listing 8.8. Notice that gaNetFitness returns the inverse of the
mean squared error. The parent-selection function remains unchanged.
8.3. A GA for Training Neural Networks 283

initXorPop[psize_,csize_,ioPairs_] :=
Module[{i,iPop,hidWts,outWts,mseln v>,
(* first the chromosomes *)
iPop = Tablet
{
Table[Random[Integer,{0,1}],{csize}],(* hi *)
Table[Random[Integer,{0,1}],{csize}],(* h2 *)
Table[Random[Integer,{0,1)3,{csize}] (* ol *)
}, {psize} ]; (* end of Table *)
(* then decode and eval fitness *)
(* use For loop for clarity *)
For[i=l,i<=psize,i++,
(* make hidden weight matrix *)
hidWts = Join[iPop[[i,l]],iPop[[i,2]] ];
hidWts = Partition [hidWts, 20];
hidWts = Map[decodeXorChrom, hidWts];
hidWts = Partition[hidWts,2];
(* make output weight matrix *)
outWts = Partition[iPop[[i,3]],20];
outWts = Map [decodeXorChrom, outWts];
(* get mse for this network *)
mselnv = gaNetFitness [hidWts, outWts, ioPairs];
(* prepend mselnv *)
PrependTo[iPop[[i]].mselnv];
]; (* end For *)
Return[iPop];
]; (* end of Module *)

Listing 8.7
284 Chapter 8. Genetic Algorithms

decodeXorChrom[chromosome_] :=
Module[{pList,lchrom,values,p,decimal},
lchrom = Length [chromosome];
(* convert from binary to decimal *)
pList = Flatten[Position[chromosome,1] ];
values = Hap[2"(lchrom-#)ft,pList];
decimal = Apply[Plus,values];
(* scale to proper range *)
p = decimal (9.536752259018191355*10^-6)-5;
Return[p];
]; (* end of Module *)

gaNetFitness[hiddenWts_,outputWts_,ioPairVectors_] :=
Module[{inputs,hidden,outputs,desired.errors,
len,errorTotal,errorSum},
inputs=Map[First,ioPairVectors];
desired=Map[Last,ioPairVectors];
len = Length[inputs];
hidden=sigmoid[inputs.Transpose[hiddenVts]];
outputs=sigmoid [hidden. T r ans pose [outputklts] ];
errors= desired-outputs;
errorSum = Apply[Plus,errors~2,2]; (* second level *)
errorTotal = Apply[Plus,errorSum];
(* inverse of mse *)
Return[len/errorTotal];
] (* end of Module *)

Listing 8.8
8.3. A GA for Training Neural Networks 285

We shall have to modify the generation and crossover routines somewhat


to accommodate multiple crossovers for each mating and to account for
the steady-state population methodology. The new crossover routing ap¬
pears in Listing 8.9. To minimize the amount of decoding and encoding,
the generation routine, gaXor (Listing 8.10), prints out only the fitness val¬
ues for each generation. The routine also returns the final population,
so that we can decode any members for our own analysis or use the
population as a starting point for more generations.
You should also notice that gaXor calls decodeXorGenotype (Listing 8.11),
a function that takes as its argument a list of the form { {hi}, {h2},
{ol} } and returns the weights on the hidden and output layer. In case
you want to see what a particular weight value looks like encoded as
a 20-bit binary chromosome, I have included the function encodeNetGa.
Remember that weight values are restricted to a range of —5 to +5. You
also must supply the length of the chromosome, which for this particular
function, must be 20. I wrote the function this way so that you could
easily change the length and range by changing the two numerical values
in one statement in the routine:

dec = Round[(weight+5.)/(9.536752259018191355*10“-6)];

The function is in Listing 8.12. As an example, a weight of 0.5 would


encode to a chromosome of

encodeNetGa[0.5,20]

{1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0}

8.3.3 An Example Calculation


We are now in a position to try our GA. First, we populate a starting
generation at random. We shall use populations of 20 individuals, in the
interest of expediting the calculation.

pop = initXorPop[20,40,ioPairsX0R];

Sort the population according to fitness. Although this step is not neces¬
sary to begin the GA, it will allow us to easily determine the character¬
istics of the initial population.

pop=Sort[pop,Greater[First[i],First[#2]]ft];

The best fitness is the inverse of the best mse.


286 Chapter 8. Genetic Algorithms

crossoverXor[pcross_,pmutate.,parentl_,parent2_] :=
Hodule[{childl,child2,cross At,lchrom,
i,numchroms,chromsl,chroms2},
(* strip off mse *)
chromsl = Rest[parentl];
chroms2 = Rest[parent2];
(* chromosome length *)
lchrom = Length[chromsl[[l]]];
(* number of chromosomes in each list *)
numchroms = Length [chromsl];
For[i=l,i<=numchroms,i++, (* for each chrom *)
If[ flip[pcross],
(* True: select cross site at random *)
crossAt = Random[Integer,{l,lchrom-l}];
(* construct children *)
chromsl[[i]] = Join[Take[chromsl[[i]],crossAt],
Drop[chroms2[[i]].crossAt]];
chroms2[[i]] = Join[Take[chroms2[[i]],crossAt],
Drop[chromsl[[i]],cross At]],
(* False: don't change chroms[[i]] *)
Continue]; (* end of If *)
(* perform mutation *)
chromsl[[i]] = Hap[mutateBGA[pmutate,#]ft,chromsl[[i]]];
chroms2[[i]] = Hap[mutateBGA[pmutate,#]ft,chroms2[[i]]];
]; (* end of For i *)
Return[{chromsl,chroms2}];
]; (* end of Hodule *)

Listing 8.9
8.3. A GA for Training Neural Networks 287

gaXor[pcross.,pmutate.,popIni.ti.al_,numReplace.,ioPairs.,numGens.,printNum_] :=
Module[{i,j,neuPop,parentl,parent2,diff,matches,
oldPop,reproNum,index,fitList,fitListSum,
f itSum, pheno, plnde x,plndex2,f, child r en, hids, outs, mseln v>,
(* initialize first population sorted by fitness value *)
oldPop= Sort[popInitial,Greater[First[#],First[#2]]&];
reproNum = numReplace; (* calculate number of reproductions *)
For[i=l,i<=numGens,i++,
fitlist = Map[First,oldPop]; (* list of fitness values*)
(* make the folded list of fitness values *)
fitListSum = FoldList[Plus,First[fitList],Rest[fitList]];
fitSum = Last[fitListSum]; (* find the total fitness *)
nevPop = Drop [oldPop,-reproNum]; (* new population; eliminate reproNum worst *)
For[j=l,j<=reproNum/2,j++, (* make reproNum new children *)
(* select parent indices *)
plndexl = selectOne[fitListSum,fitSum];
plndex2 = selectOne [fitListSum, fitSum];
parentl = oldPop[[plndexl]]; (* identify parents *)
parent2 = oldPop[[pIndex2]];
children = crossOverXor[pcross,pmutate,parentl,parent2]; (* crossover and mutate *)
{hids,outs} = decodeXorGenotype[children[[1]] ]; (* fitness of children *)
mselnv = gaNetFitness[hids,outs,ioPairs];
children[[l]] = Prepend[children[[l]],mselnv];
{hids,outs} = decodeXorGenotype [children [[2]] ];
mselnv = gaNetFitness[hids,outs,ioPairs];
children[[2]] = Prepend[children[[2]],mselnv];
newPop = Join[newPop,children]; (* add children to new population *)
]; (* end of For j *)
oldPop = Sort[newPop,Greater[First[#],First[#2]]ft];(* for next gen *)
(* print best mse values (l/mselnv) *)
Print[ ];Print["Best of generation ",i];
For[j=l,j<=printNum,j++,Print[(l.0/oldPop[[j,1]])]; ];
]; (* end of For i*)
Return[oldPop];
]; (* end of Module *)

Listing 8.10
288 Chapter 8. Genetic Algorithms

decodeXorGenotype[genotype.] :=
Module[{hidWts,outWts},
hidWts = Join [genotype [[1]], genotype [[2]] ];
hidWts = Partition[hidWts,20];
hidWts = Map[decodeXorChrom,hidWts];
hidWts = Partition [hidWts, 2];
(* make output weight matrix *)
outWts = Partition [genotype [[3]], 20];
outWts = Map [decodeXorChrom, outWts];
Return[{hidWts,outWts}];
];
Listing 8.11

l/newpop[[l,l]]

0.156526

The last population individual has the lowest fitness, or the largest mse.

l/newpop[[20,l]]

0.394886

You can see the distribution by plotting the fitness values, or alternatively,
plotting the mse's. Here we plot the mse values.

poplist = Map[First,newpop];

ListPlot[1/poplist]

0.4

0.35

0.3

0.25

10 15 20
8.3. A GA for Training Neural Networks 289

encodeNetGa[weight.,len_] :=
Module[{pList,values,dec,ch romosome,i},
i=len;
l=Table[0,{i}];
(* scale to proper range *)
dec = Round[(weight+5.)/(9.536752259018191355*10“-6)];
While[dec!=0&Adec!=1,
l=ReplacePart[l,Nod[dec,2],i];
dec=Quotient[dec,2];
”i;
];
l=ReplacePart[l,dec,i]
]; (* end of Module *)

Listing 8.12

Let's begin the calculation by producing 100 generations where we re¬


place half the population at each generation.

newpop = gaXor[0.8,0.01,pop,10,ioPairsXOR,100,1];

Best of generation 1
0.159407
Best of generation 5
0.14867
Best of generation 10
0.125276
Best of generation 15
0.125276
Best of generation 20
0.112863
Best of generation 25
0.102992
Best of generation 30
0.102976
Best of generation 35
0.102687
Best of generation 40
0.102538
290 Chapter 8. Genetic Algorithms

Best of generation 50
0.102463
Best of generation 55
0.102264
Best of generation 60
0.102223
Best of generation 65
0.101803
Best of generation 70
0.10178
Best of generation 75
0.101705
Best of generation 80
0.101669
Best of generation 85
0.101669
Best of generation 90
0.101669
Best of generation 95
0.101667
Best of generation 100
0.101644

You will notice a steady, but very slow, decline in the mse. At this
rate, it may take hundreds of generations to reach an acceptable error.
Moreover, this run of 100 generations required considerably more time
than was required for the standard backpropagation method to converge
on an acceptable solution. We might also ask whether we are doing any
better than we would if we simply generated networks at random. To
evaluate that situation, we can do just that: generate 100 populations of
20 individuals at random and see if we do as well. The function randomPop
in Listing 8.13 generates the required populations.

randomPop[20,40,ioPairsXOR,100];

Random generation 1
0.132439
Random generation 2
0.149261
Random generation 3
8.3. A GA for Training Neural Networks 291

randomPop[psize_,csize_,ioPairs_,numGensJ :=
Module[{i,pop},
For[i=l,i<=numGens,i++,
pop = initXorPop[psize.csize,ioPairs];
pop = Sort[pop,Greater[First[#],First[#2]]ft];
Print[ ];
Print["Random generation ",i];
Print[(1.0/pop[[l,l]])];
];
];
Listing 8.13

0.157786
Random generation 4
0.16147
Random generation 5
0.151606
Random generation 6
0.156832
Random generation 7
0.150389
Random generation 8
0.128841
Random generation 9
0.150084
Random generation 10
0.156748
Random generation 11
0.14517
Random generation 12
0.147808
Random generation 13
0.163434
Random generation 14
0.146134
Random generation 15
0.136783
292 Chapter 8. Genetic Algorithms

Random generation 16
0.126648
Random generation 17
0.153591
Random generation 18
0.154377
Random generation 19
0.162773
Random generation 20
0.16142

and so on . . .

I have reduced the output to the first 20 generations, but the results for
the remaining 80 generations are similar to those of the first 20. The best
I got in 100 random generations was 0.125274; the worst was 0.186056.
There is, of course, a possibility that some random population will ac¬
cidently produce a very good individual. You should know, however,
that generating the random population and evaluating the fitness of all
of the individuals required more time per generation than the GA algo¬
rithm did. Moreover, it appears as though the GA might eventually reach
an acceptable population, whereas we can never be sure about random
generations.
The above development represents a first attempt, and you should
not conclude that the method we employed is the best one for the task. In
fact, you will find it necessary to make several modifications to the data
representation in order to allow the GA to find a good solution. We could
be more clever about how chromosomes are combined during the mating
process. The binary representation and standard crossover algorithm are
probably not the best ones for our purposes in this case. Perhaps we
should maintain the chromosomes as lists of real numbers and search for
more appropriate crossover mechanisms. Rather than pursue this matter
further here, I will leave it to your creativity.

8.3.4 Other Uses for GAs

Even if we can generate weights for neural networks using the method of
the previous sections, it is not clear that there is a particular advantage
to doing so. If you look back at Chapter 3, you will recall that we
8.3. A GA for Training Neural Networks 293

implemented the guts of a backpropagation algorithm in about a dozen


lines of code; the GA required many times more than that, with a similar
increase in the amount of time required to produce a solution. What
role then, if any, should GAs have in neural networks? I think there are
several ways to use GAs effectively along with neural networks.
We could use GAs as a supplement to standard training algorithms,
rather than as a replacement. As a simple example, we might choose to
use the standard backpropagation algorithm to train the output layer of
a neural network and use a GA to train the hidden layer. We might also
be able to use a GA to help a network escape from a local minimum.
We could use GAs in a more fundamental role to help determine the
particular architecture suitable for a given problem. A question that is
often asked, for example, is: How many hidden-layer units should I have
in my network for such-and-such a problem? Rules-of-thumb are often
quoted in response to such a question, or experience is invoked as being
the best teacher. We could bring a GA to bear on this issue to provide
a more analytical, and perhaps, therefore, more satisfying answer. We
could also use a GA to optimize the connectivity scheme among units in
a network or to tune parameters such as the learning rate. Moreover, a
GA might be used to find a good initial set of weights.
All of the above ignores the potential of GAs as an independent
methodology for the solution of optimization problems. Although our
emphasis in this book is neural networks, GAs have applicability to many
other fields, and by ignoring that potential here, I hope I have not misled
anyone.

Summary
In a book on neural networks, a chapter on genetic algorithms may seem
out of place, even though in this we did look at a way of using the
two technologies together. GAs, like some neural networks, are good at
finding solutions to optimization problems when you can determine a
score or cost function for each potential solution. We build a very basic
GA in this chapter; many variations are possible. Whether you continue
to experiment with combining GA and neural-network technologies, as
we have done in this chapter, or simply use GAs for other applications,
does not really matter. GAs are another tool in the problem-solving
toolbox which can be brought to bear on a variety of problems. Like
294 Chapter 8. Genetic Algorithms

neural networks, GAs will not guarantee you a perfect solution, but can,
in many cases, arrive at an acceptable solution without the time and
expense of an exhaustive search.
Appendix A

Code Listings

Adaline and ALC

aIcTest[lea rnRate_,numlters_:250] :=
Module[{eta=learnRate,wts,k,inputs,wtList.outDesired,outputs,outError},
wts = Table[Random[] ,{2>]; (* initialize weights *)
Print["Starting weights = ",wts];
Print["Learning rate = ",eta];
Print["Number of iterations = ".numlters];
inputs = {0, Random [Real, {0, 0.175}]};(* initialize input vector *)

k=l;
wtList=Table[
inputs[[1]] = N[Sin[Pi k/8]]+Random[Real,{0, 0.175}];
outDesired = N[2 Cos[Pi k/8]]; (* desired output *)
outputs = wts.inputs; (* actual output *)
outError = outDesired-outputs; (* error *)
wts += eta outError inputs; (* update weights *)
inputs[[2]]=inputs[[l]]; (* shift input values *)
k++; wts,{numlters}]; (* end Table
Print["Final weight vector = ",wts];
wtPlot=ListPlot[wtList,PlotJoined->True] (* plot the weights *)
] (* end of Module *)
296 Code Listings

calcHse[ioPairs_,wtVec_] :=
Module[{errors,inputs,outDesired,outputs},
inputs = Map[First,ioPairs]; (* extract inputs *)
outDesired = Hap[Last,ioPairs]; (* extract desired outputs *)
outputs = inputs . wtVec; (* calculate actual outputs *)
errors = Flatten [outDesired-outputs];
Return[errors.errors/Length[ioPairs]]
]
alcXor[learnRate,,numlnputs.,ioPairs.,numlters_:250] :=
Module[{uts,eta=lea rnRate,errorList,inputs,outDesired,ourError,outputs},
SeedRandom[6460]; (* seed random number gen.*)
wts = Table[Random[],{numlnputs}]; (* initialize weights *)
errorList=Table[ (* select ioPair at random *)
{inputs, outDesired} = ioPairs[[Random[Integer,{l,4}]]];
outputs = wts.inputs; (* actual output *)
outError = First [outDesired-outputs]; (* error *)
wts += eta outError inputs;
outError,{numIters}]; (* end of Table *)
ListPlot[errorList,PlotJoined->True];
Return[wts];
]; (* end of Module *)

testXor[ioPairs.,weights,] :=
Module[{errors,inputs.outDesired,outputs,wts,mse},
inputs = MapfFirst,ioPairs]; (* extract inputs *)
outDesired = Map[Last,ioPairs]; (* extract desired outputs *)
outputs = inputs . weights; (* calculate actual outputs *)
errors = outDesired-outputs;
mse=

Flatten[errors] . Flatten[errors]/Length[ioPairs];
Print["Inputs = ".inputs];
Print["0utputs = ".outputs];
Print["Errors = ".errors];
Print["Mean squared error = ",mse]
]
Backpropagation and Functional Link Network 297

alcXorMin[learnRate_,numlnputs.,ioPairs_,maxErrorJ :=
Module[{wts,eta=lea rnRate,errorList,inputs.outDesired,
meanSqError.done,k.ourError.outputs.errorPlot},
wts = Table[Random[],{numlnputs}]; (* initialize weights *)
meanSqError = 0.0;
errorList=0;
For[k=l;done=False,!done,k++, (* until done *)
(* select ioPair at random *)
{inputs, outDesired} = ioPairs[[Random[Integer,{l,4}]]];
outputs = wts.inputs; (* actual output *)
outError = First[outDesired-outputs]; (* error *)
wts += eta outError inputs; (* update weights *)
If[Mod[k,4]==0,meanSqError=calcMse[ioPairs,wts];
AppendTo[errorList, meanSqError]; ];
If[k>4 ftft meanSqError<maxError,done=True,Continue]; (* test for done *)
]; (* end of For *)
error Plot=ListPlot [e r ror List, PloUoined->Tr ue];
Return[wts];
] (* end of Module *)

Backpropagation and Functional Link Network

Options[sigmoid] = {xShift->0,yShift->0,temperature->l};
Options[bpnTest] = {printAll->False,bias->False};

sigmoid[x_,opts_Rule] :=
Module[{xshft,yshft,temp},
xshft = xShift /. {opts} /. Options[sigmoid];
yshft = yShift /. {opts} /. Options [sigmoid];
temp = temperature /. {opts} /. Options [sigmoid];
yshft+1/(l+E“(-(x-xshft)/temp)) //N

]
298 Code Listings

bpnTest[hiddenWts_,outputWts.,ioPairVectors.,opts_Rule] :=
Module[{inputs,hidden,outputs,desired,errors,i,len=Length[inputs],
prnt All,errorTotal,errorSum,bias},
prntAll= print All /. {opts} /. Options [bpnTest];
biasVal = bias /. {opts} /. Options [bpnTest];
inputs=Map[First,ioPairVectors];
If[biasVal,inputs=Map[Append[#,1.0]&,inputs] ];
desired=Map[Last,ioPairVectors];
hidden=sigmoid[inputs.Transpose[hiddenWts]];
If[biasVal,hidden = Map[Append[#,1.0]&,hidden] ];
outputs=sigmoid[hidden.Transpose[outputWts]];
errors= desired-outputs;
If[prntAll,Print["ioPairs:"];Print[ ];Print[ioPairVectors];
Print[ ];Print["inputs:"];Print[ ];Print[inputs];
Print[ ];Print["hidden-layer outputs:"];
Print[hidden];Print[ ];
Print["output-layer outputs:"];Print[ ];
Print[outputs];Print[ ];Print["errors:"];
Print[errors];Print[ ]; ]; (* end of If *)
For[i=l,i<=len,i++,Print[" Output ",i," = ", outputs [[i]]," desired = ",
desired[[i]]," Error = ",errors[[i]]];Print[]; ]; (* end of For *)
errorSum = Apply[Plus,errors“2,2]; (* second level *)
errorTotal = Apply[Plus,errorSum];
Print["Mean Squared Error = ",errorTotal/len];
] (* end of Module *)
Backpropagation and Functional Link Network 299

bpnStandard[inNumber_, hidNumber_, outnumber.,ioPairs_, eta_, numltersj :=


Module [{errors, hidWts, outWts, ioP, inputs, outOesired, hidOuts,
outputs, outErrors,outDelta,hidDelta},
hidWts = Table[Table[Random[Real,{-0.1,0.1}].{inNumber}].{hidNumber}];
outlets = Table [Table [Random [Real, {-0.1,0.1}] .{hidNumber}] .{outNumber}];
errors = Table[
(* select ioPair *)
ioP=ioPairs[[Random[Integer,{1,Length[ioPairs]}] ] ];
inpu ts=ioP [ [1] ]‘;
outDesired=ioP[[2]];
(* forward pass *)
hidOuts = sigmoid [hidWts. inputs];
outputs = sigmoid [outWts. hidOuts];
(* determine errors and deltas *)
outEr rors = outDesired-outputs;
outDelta* outErrors (outputs (1-outputs));
hidDelta*(hidOuts (1-hidOuts)) Transpose[outWts].outDelta;
(* update weights *)
outWts += eta Outer [Times, outDelta, hidOuts];
hidWts += eta Outer [Times, hidDelta, inputs];
(* add squared error to Table *)
outEr rors. outEr rors, {numlters}]; (* end of Table *)
Return[{hidWts,outWts.errors}];
]; (* end of Module *)
300 Code Listings

bpnBias[inNumber_, hidNumber., outNumber.,ioPairs_, eta_, numltersj :=


Module[{errors,hidWts,outWts,ioP,inputs,outDesired,hidOuts,
outputs, outErrors,outDelta,hidDelta>,
hidWts = Table[Table[Random[Real,{-0.1,0.1}],{inNumber+l}].{hidNumber}];
outWts = Table[Table[Random[Real,{-0.1,0.1}],{hidNumber+1}], {outnumber}];
errorList = Table[
(* select ioPair *)
ioP=ioPairs[[Random[Integer,{1,Length[ioPairs]}]]];
inputs=Append[ioP[ [1] ], 1.0]; (* bias mod *)
outDesired=ioP[[2]];
(* forward pass *)
hidOuts = sigmoid [hidWts. inputs];
outlnputs = Append [hidOuts, 1.0]; (* bias mod *)
outputs = sigmoid [outWts. outlnputs];
(* determine errors and deltas *)
outErrors = outOesired-outputs;
out0elta= outErrors (outputs (1-outputs));
hidDelta=(outInputs (l-outlnputs)) * Transpose[outHts].outDelta;
(* update weights *)
outWts += eta Outer [Times, outDelta, outlnputs];
hidWts += eta Drop[Outer[Times,hidDelta,inputs],-l]; (* bias mod *)
(* add squared error to Table *)
outErrors.outErrors,{numlters}]; (* end of Table *)
Print["New hiddeo-layer weight matrix: "];
Print[]; Print[hidMts] ;Print[];
Print["New output-layer weight matrix: "];
Print[]; Print[outWts];Print[];
bpnTest[hidUts,outllts,ioPairs]; (* check how close we are *)
errorPlot = ListPlot[errorList, PlotJoined->True];
Return[{hidWts,outWts.errorList,errorPlot}];
(* end of Module *)
Backpropagation and Functional Link Network 301

bpnMomentum[inNumber.,hidNumber_,outNumber.,ioPairs_,eta_,
alpha.,numlters.] :=
Module [{hidWts, outlets, ioP, inputs, hidOuts .outputs, outDesired,
hidLastDelta,outLastDelta,outDelta,hidOelta,outErrors),
hidWts = Table[Table[Random[Real,{-0.5,0.5}].{inNumber}].{hidNumber}];
outWts = Table[Table[Random[Real,{-0.5,0.5}].{hidNumber}].{outNumber}];
hidLastDelta = Table [Table [0, {inNumber}], {hidNumber}];
outLastDelta = Table [Table [0, {hidNumber}], {outNumber}];
errorList = Tablet
(* begin forward pass *)
ioP=ioPairs[[Random[Integer,{1,Length[ioPairs]}]]];
inputs=ioP[[l]];
outDesired=ioP[[2]];
hidOuts = sigmoid[hidWts.inputs]; (* hidden-layer outputs *)
outputs = sigmoid [outWts. hidOuts]; (* output-layer outputs *)
(* calculate errors *)
outErrors = outDesired-outputs;
outDelta= outErrors (outputs (1-outputs));
hidDelta=(hid0uts (1-hidOuts)) Transpose[outWts].outDelta;
(* update weights *)
outLastDelta= eta 0uter[Times,outDelta,hidOuts]+alpha outLastDelta;
outWts += outLastDelta;
hidLastDelta = eta Outer[Times,hidDelta,inputs]+
alpha hidLastDelta;
hidWts += hidLastDelta;
outErrors.outErrors, (* this puts the error on the list *)
{numlters}] ; (* this many times, Table ends here *)
Print["New hidden-layer weight matrix: "];
PrintG; Print[hidWts];Print[];
Print["New output-layer weight matrix: "];
PrintG; Print[outWts];PrintG;
bpnTest[hidWts.outWts,ioPairs,bias->False,printAll->False];
errorPlot = ListPlot[errorList, PlotJoined->True];
Return[{hidWts,outWts,errorList,errorPlot}];
] (* end of Module *)
302 Code Listings

bpnMomentumSmart[inNumber_,hidNumber.,outNumber.,ioPairs_,eta_,
alpha.,numltersj :=
Module[{hidWts,outWts,ioP,inputs,hidOuts,outputs,outOesired,
hidLastDelta,outLastDelta,outDelta,hidDelta,outErrors},
hidWts = Table[Table[Random[Real,{-0.5,0.5}].{inNumber}].{hidNumber}];
outWts = Table[Table[Random[Real,{-0.5,0.5}].{hidNumber}].{outNumber}];
hidLastDelta = Table [Table [0, {inNumber}], {hidNumber}];
outLastDelta = Table [Table [0, {hidNumber}], {outNumber}];
errorList = Table[
(* begin forward pass *)
ioP=ioPairs[[Random[Integer,{1,Length[ioPairs]}]]];
inputs=ioP[[l]];
outDesired=ioP[[2]];
hidOuts = sigmoid[hidWts.inputs]; (* hidden-layer outputs *)
outputs = sigmoid[outWts.hidOuts]; (* output-layer outputs *)
(* calculate errors *)
outErrors = outDesired-outputs;
If[First[Abs[outErrors]]>0.1,
outDelta= outErrors (outputs (1-outputs));
hidDelta=(hidOuts (1-hidOuts)) Transpose[outWts].outDelta;
(* update weights *)
outLastDelta= eta 0uter[Times,outDelta,hid0uts]+
alpha outLastDelta;
outWts += outLastDelta;
hidLastDelta = eta 0uter[Times,hidDelta,inputs]+
alpha hidLastDelta;
hidWts += hidLastDelta,Continue]; (* end of If *)
outEr rors.outEr rors, (* this puts the error on the list *)
{numlters}] ; (* this many times, Table ends here *)
Print["New hidden-layer weight matrix: "];
Print[]; Print[hidWts];Print[];
Print["New output-layer weight matrix: "];
Print[]; Print[outWts];Print[];
bpnTest[hidWts,outWts,ioPairs,bias->False,print*ll->False];
errorPlot = ListPlot[errorList, PlotJoined->True];
Retu rn[{hidWts,outWts,e r rorList,errorPlot}];
] (* end of Module *)
Backpropagation and Functional Link Network 303

bpnCompete[inNumber_,hidNumber,,outNumber.,ioPairs_,eta_,numlters.] :=
Module[{hidWts.outWts,ioP,inputs,hidOuts,outputs,outDesired,
outlnputs, hidEps, outEps, outDelta, hidPos, outPos, hidDelta, outEr rors>,
hidWts = Table[Table[Random[Real,{-0.5,0.5}],{inNumber}]{hidNumber>];
outWts = Table[Table[Random[Real,{-0.5,0.5}],{hidNumber>].{outNumber}];
errorList = Tablet (* begin forward pass *)
ioP=ioPairs[[Random[Integer,{1,Length[ioPairs]}]]];
inputs=ioP[[l]];
outDesired=ioP[[2]];
hidOuts = sigmoid [hidWts. inputs];
outputs = sigmoid [outWts. hidOuts];
outErrors = outDesired-outputs; (* calculate errors *)
out0elta= outErrors (outputs (1-outputs));
hidDelta=(hid0uts (1-hidOuts)) Transpose[outWts].outDelta;
(* index of max delta *)
outPos = First[Flatten[Position[Abs[outDelta],Max[Abs[outDelta]]]]];
outEps = outDelta [[outPos]]; (* max value *)
outDelta=Table[-1/4 outEps,{Length[outDelta]>]; (* new outDelta table *)
outDelta[[outPos]] = outEps; (* reset this one *)
(* index of max delta *)
hidPos = First[Flatten[Position[Abs[hidDelta],Max[Abs[hidDelta]]]]];
hidEps = hidDelta[[hidPos]]; (* max value *)
hidDelta=Table[-1/4 hidEps,{Length[hidDelta]>]; (* new outDelta table *)
hidDelta [[hidPos]] = hidEps; (* reset this one *)
outWts +=eta 0uter[Times,outDelta,hidOuts];
hidWts += eta Outer [Times, hidDelta, inputs];
outErrors.outErrors, (* this puts the error on the list *)
{numlters}] ; (* this many times, Table ends here *)
Print["New hidden-layer weight matrix: "];
Print[ ]; Print[hidWts];Print[ ];
Print["New output-layer weight matrix: "];
Print[ ]; Print[outWts] ;Print[ ];
bpnTest[hidWts,outWts,ioPairs,bias->False,printAll->False];
errorPlot = ListPlot[errorList, PloUoined->True];
Return[{hidWts,outWts,errorList,errorPlot}];
] (* end of Module *)
304 Code Listings

fin[inNumber.,outNumber.,ioPairs_,eta_,alpha.,numlte rs_] :=
Hodule [{outHts, ioP, inputs, outputs, outDesired,
outVals,outLastDelta,outDelta,outErrors},
outVals=0;
outHts = Table[Table[Random[Real,{-0.1,0.1}],{inNumber}].{outNumber}];
outLastDelta = Table [Table [0, {inNumber}], {outNumber}];
errorList * Table[
(* begin forward pass *)
ioP=ioPairs[[Random[Integer,{1,Length[ioPairs]}]]];
inputs=ioP[[l]];
outDesired=ioP[[2]];
outputs = outHts.inputs; (* output-layer outputs *)
(* calculate errors *)
outErrors = outDesired-outputs;
outDelta* outErrors;
(* update weights *)
outLastDelta* eta Outer[Times,outDelta,inputs]+alpha outLastDelta;
outHts += outLastDelta;
outErrors.outErrors, (* this puts the error on the list *)
{numlters}] ; (* this many times, Table ends here *)
Print["New output-layer weight matrix: "];
Print[]; Print[outHts];Print[];
outVals=fInTest[outHts,ioPairs];
errorPlot = ListPlot[errorList, PlotJoined->True];
Return[{outHts,errorList,errorPlot,outVals}];
] (* end of Hodule *)
Probabilistic Neural Network and Hopfield Network 305

flnTest[outputWts_,ioPairVectors_] :=
Module[{inputs,hidden,outputs,desired,errors,i,len,
errorTotal,errorSum},
inputs=Map[First,ioPairVectors];
desired=Map[Last,ioPairVectors];
len = Length[inputs];
outputs=inputs.Transpose[outputWts];
errors= desired-outputs;
For[i=l,i<=len,i++,
(*Print["Input ",i," = ",inputs[[i]]];*)
Print[" Output ",i," = ",outputs[[i]]," desired = ",
desired[[i]]," Error = ",errors[[i]]];Print[];
]; (* end of For *)
(*Print["errors= ".errors];Print[];*)
errorSum = Apply[Plus,errors“2,2]; (* second level *)
(*Print["errorSum= ".errorSum];Print[];*)
errorTotal = Apply[Plus,errorSum];
(*Print["errorTotal= ".errorTotal];*)
Print["Mean Squared Error = ",errorTotal/len];
Return[outputs];
] (* end of Module *)

Probabilistic Neural Network and Hopfield Network

normalize[x.List] := x/(Sqrt[x.x]]//N)

energyHop[x„,w_] := -0.5 x . u . x;

psi[inValue.,netlnj := If[netln>0.1,
If[netln<0,-l,inValue]]
306 Code Listings

phi[inVector_List,netlnVector.List] :=
MapThread[psi[#,#2]ft,{inVector,netlnVector)]

makeHopfieldHts[trainingPats.,printVts_:True] :=
Module[{wtVector},
wtVector =
Apply[Plus,Map[Outer[Times,#,#]&,trainingPats]];
If[printWts,
Print [];
Print[MatrixForm[wtVector]];
Print[];,Continue
]; (* end of If *)
Return[wtVector];
] (* end of Module *)
Probabilistic Neural Network and Hopfield Network 307

discreteHopfield[wtVector.,inVector.,printAll.:True] :=
Module[{done, energy, newEnergy, netlnput,
newlnput, output},
done = False;
newlnput = inVector;
energy = energyHop[inVector,wtVector];
IftprintAll,
Print[];Print["Input vector = ".inVector];
Print[];
Print["Energy = ".energy];
Print[],Continue
]; (* end of If *)
While[!done,
netlnput = wtVector . newlnput;
output = phi[newlnput,netlnput];
newEnergy = energyHop[output,wtVector];
If[printAll,
Print[];Print["Output vector = ".output];
Print[];
Print["Energy = ".newEnergy];
Print[].Continue
]; (* end of If *)
If[energy==newEnergy,
done=True,
energy=newEnergy;newInput=output,
Continue
]; (* end of If *)
]; (* end of While *)
If[IprintAll,
Print[];Print["Output vector = ".output];
Print[];
Print["Energy = ".newEnergy];
Print[];
]; (* end of If *)
]; (* end of Module *)
308 Code Listings

prob[n_,T_] := l/(l+E“(-n/T)) //N;

p robPsi[in Value.,netln.,temp.] :=
If[Random[]<=prob[netln,temp],1,psi[inValue,netln]];

stochasticHopfield[inVector.,weights.,numSweeps.,temp.]:=
Module[ {input, net, indx, numUnits, indxList, output},
numUnits=Length[inVector];
indxList=Table[0,{numUnits}];
input=inVector;
For[i=l,i<=numSweeps,i++,
Print["i= ",i];
For[j=l,j<=numUnits,j++,
(* select unit *)
indx = Random[Integer,{l,numUnits}];
(* net input to unit *)
net=input . weights[[indx]];
(* update input vector *)
output=probPsi[input[[indx]],net,temp];
input[[indx]]=output;
indxList[[indx]]+=l;
]; (* end For numUnits *)
Print[ ];Print["New input vector = "];Print[input];
]; (* end For numSweeps *)
Print[ ];Print["Number of times each unit was updated:"];
Print[ ];Print[indxList];
]; (* end of Module *)
Probabilistic Neural Network and Hopfield Network 309

pnnTuoClass[classlExemplars_,class2Exemplars_,
testlnputs_,sig_] :=
Module[{weights*,weightsB,inputsNorm,pattern*out,
patternBout.sumAout,sumBout},
weights* = Hap[normalize,dasslExemplars];
weightsB = Hap[normalize,dass2Exemplars];
inputsNorm = Hap[normalize,testInputs];
sigma = sig;
patternAout =
gaussOut[inputsNorm . Transpose[weightsA]];
patternBout =
gaussOut[inputsNorm . Transpose [weightsB]];
sumAout = Hap[Apply[Plus,#]&,patternAout];
sumBout = Hap[Apply[Plus, #]&,patternBout];
outputs * Sign [sum Aout-sumBout];
sigma=.;
Return[outputs];
]
310 Code Listings

Traveling Salesperson Problem

nOutOfN[weights.,externln_,numUnits.,lambda.,deltaT_
numlters_,printFreq.,reset.:False]:=
Hodule[{iter,1,dt,indx,ins},
dt=deltaT;
l=lambda;
iter=numlters;
ins=externln;
(* only reset if starting over *)
If[reset,ui=Table[Random[].{numUnits}];
vi = g[l,ui],Continue]; (* end of If *)
Print[”initial ui = ",N[ui,2]];Print[];
Print["initial vi = ",N[vi,2]];
For[iter=l,iter<=numIters,iter++,
indx = Random[Integer,{l,numUnits}];
ui[[indx]] = ui[[indx]]+
dt (vi . Transpose[weights[[indx]]] +
ui[[indx]] + ins[[indx]]);
vi[[indx]] = g[l,ui[[indx]]];
If[Hod[iter,printFreq]=0,
Print[];Print["iteration = ",iter];
Print["net inputs = "];
Print[N[ui,2]];
Print ["outputs = "];
Print[N[vi,2]];Print[];
]; (* end of If 0
]; (* end of For *)
Print[];Print["iteration = ",--iter];
Print["final outputs = "];
Print[vi];
]; (* end of Hodule *)
Traveling Salesperson Problem 311

tsp[weights_,externln_,numUnits.,lambda.,deltaT.,
numlters.,printFreq.,reset.:False]:=
Module[{iter,1,dt,indx,ins,utemp},
dt=deltaT;
l=lambda;
iter=numlters;
ins=externln;
(* only reset if starting over *)
If[reset,
utemp = ArcTanh[(2.0/Sqrt[numUnits])-l]/l;
ui=Table[
utemp+Random[Real,{-utemp/10,utemp/10}],
{numUnits}]; (* end of Table *)
vi = g[l,ui] .Continue]; (* end of If *)
Print["initial ui = ",N[ui,2]];Print[];
Print["initial vi = ",N[vi,2]];
For[iter=l,iter<=numlters,iter++,
indx = Random[Integer,{1,numUnits}];
ui[[indx]] = ui[[indx]]+
dt (vi . Transpose[weights[[indx]]] +
ui[[indx]] + ins[[indx]]);
vi[[indx]] = g[l,ui[[indx]]];
If[Mod[iter,printFreq]==0,
Print[] ;Print["iteration = ".iter];
Print["net inputs = "];
Print[N[ui,2]];
Print["outputs = "];
Print[N[vi,2]];Print[];
]; (* end of If *)
]; (* end of For *)
Print[];Print["iteration = iter];
Print["final outputs = "];
Print[MatrixForm[Partition[N[vi,2],Sqrt[numUnits]]]];

]; (* end of Module *)
312 Code Listings

BAM

makeXtoYwts[exemplars.] :=
Module[{temp},
temp = Map[0uter[Times,#[[2]],f[[l]]]&,exemplars];
Apply[Plus,temp]
]; (* end of Module *)

psi[inValue.,netlnj := If[netIn>0.1,If[netIn<0,-l,inValue]];
phi[inVector_List,netInVector_List] :=
MapThread[psi[#,#2]&,{inVector,netlnVector}];

energyBAH[xx_,w_,zz_] - (xx . w . zz)

bam[initialX.,initialY_,x2yWeights.,y2xWeights_,printAll.:False] :=
Module[{done,newX,neuY,energy1,energy2},
done = False;
newX = initialX;
newY = initialY;
While[done = False,
newY = phi[neuY,x2yWeights.newX];
If[printAll,Print[];Print[];Print["y = ”,newY]];
energyl = energyBAM[newY,x2yWeights,neuX];
If[printAll,Print["energy = ".energyl]];
newX = phi[newX,y2xWeights . newY];
If [printAll,Print [] ;Print["x = ",newX]];
energy2 = energyBAM[newY,x2yWeights,newX];
If[printAll,Print["energy = ".energyl]];
If [energyl == energy 2, done=True, Continue];
]; (* end of While *)
Print[];Print[];
Print["final y = ",newY," energy= ".energyl];
Print["final x = ",newX," energy= ",energy2];
]; (* end of Module *)
Elman Network 313

Elman Network

elman[inNumber_,hidNumber.,outNumber.,ioPairs.,eta_,alpha.,numltersj :=
Module[{hidWts,outWts,ioP,inputs,hidOuts,outputs,outDesired,
i,indx,hidLastDelta,outlastDelta,outDelta, er rorList=0,
ioSequence, conUnits,hidDelta,outErrors},
hidWts = Table[Table[Random[Real,{-0.5,0.5}].{inNumber+hidNumber}].{hidNumber}];
outWts = Table[Table[Random[Real,{-0.5,0.5>].{hidNumber}].{outNumber}];
hidLastDelta = Table[Table[0,{inNumber+hidNumber}],{hidNumber}];
outLastDelta = Table [Table [0, {hidNumber}], {outNumber}];
For[indx=l,indx<=numlters,indx++, (* begin forward pass; select a sequence *)
ioSequence=ioPairs[[Random[Integer,{1,Length[ioPairs]}]]];
conUnits = Table[0.5,{hidNumber}]; (* reset conUnits *)
For[i=l,i<=Length[ioSequence],i++, (* process the sequence in order *)
ioP = ioSequence [[i]]; (* pick out the next ioPair *)
inputs=Join[conUnits,ioP[[1] ] ]; (* join context and input units *)
outDesired=ioP[[2]];
hidOuts = sigmoid [hidWts. inputs]; (* hidden-layer outputs *)
outputs = sigmoid [outWts. hidOuts]; (* output-layer outputs *)
outEr rors = outDesired-outputs; (* calculate errors *)
outDelta* outErrors (outputs (1-outputs));
hidDelta*(hidOuts (1-hidOuts)) Transpose[outWts].outDelta;
outLastDelta= eta 0uter[Times,outDelta,hidOuts]+alpha outLastDelta;
outWts += outLastDelta; (* update weights *)
hidLastDelta = eta 0uter[Times,hidDelta,inputs]+alpha hidLastDelta;
hidWts += hidLastDelta;
conUnits = hidOuts; (* update context units *)
(* put the sum of the squared errors on the list *)
AppendTo[errorList,outErrors.outErrors];
]; (* end of For i *)
]; (* end of For indx *)
Print["New hidden-layer weight matrix: "];
Print[ ]; Print[hidWts];Print[ ];
Print["New output-layer weight matrix: "];
Print[ ]; Print[outWts];Print[ ];
elmanTest[hidWts,outWts,ioPairs.hidNumber];
errorPlot = ListPlot[errorList, PlotJoined->True];
Return[{hidWts,outWts,errorList.errorPlot}];
] (* end of Module *)
314 Code Listings

elmanTest[hiddent/ts..outputVts.,ioPairVectors.,conNumber.,printAll.:False] :=
Hodule[{inputs,hidden,outputs,desired, error s, i ,j,
prntAll,conUnits,ioSequence,ioP>,
If[printAll,Print[];Print["ioPairs:"];Print[];Print[ioPairVectors]];
(* loop through the sequences *)
For[i=l,i<=Length[ioPairVectors],i++,
(* select the next sequence *)
ioSequence = ioPair Vectors [[i]];
(* reset the context units *)
conUnits = Table [0.5, {conNumber}];
(* loop through the chosen sequence *)
For[j=l,j<=Length[ioSequence],j++,
ioP = ioSequence[[j]];
(* join context and input units *)
inputs=Join[conUnits,ioP[[l]] ];
desired=ioP[[2]];
hidden=sigmoid[hiddenVts.inputs];
outputs=sigmoid[outputWts.hidden];
errors3 desired-outputs;
(* update context units *)
conUnits = hidden;
Print [];
Print["Sequence ",i, " input ",j];
Print[];Print["inputs:"];Print[];
Print[inputs];
If[printAll,Print[];Print["hidden-layer outputs:"];
Print[hidden] ;Print[];
];
Print["outputs:"];Print[];
Print[outputs];Print[];
Print["desired:"];Print[];Print[desired];Print[];
Print["Hean squared error:"];
Print[errors.errors/Length[errors]];
Print [];
]; (* end of For j *)
]; (* end of For i *)
] (* end of Hodule *)
Elman Network 315

elmanComp[inNumber.,hidNumber.,outNumber_,ioPairs.,eta_,alpha.,numltersj :=
Hodule[{hidWts,outWts,ioP,inputs,hidOuts,outputs,outOesired,
i,indx,hidLastDelta,outLastDelta,outDelta,errorlist=0,
ioSequence, conUnits,hidDelta.outErrors},
hidWts 5 Table[Table[Random[Real,{-0.5,0.5}].{inNumber+conNumber}].{hidNumber}];
outWts = Table[Table[Random[Real,{-0.5,0.5}].{hidNumber}].{outNumber}];
hidLastDelta = Table[Table[0,{inNumber+conNumber}],{hidNumber}];
outLastDelta = Table[Table[0,{hidNumber}] .{outNumber}];
outErrors 5 Table[0,{outNumber}];
For[indx=l,indx<=numlters,indx++,
ioSequence=ioPairs [[Random[Integer, {1, Length [ioPairs]}] ] ]; (* select a sequence *)
conUnits 5 Table[0.5,{conNumber}]; (* reset conUnits *)
For[i=l,i<=Length[ioSequence],i++, (* process the sequence in order *)
ioP = ioSequence[[i]]; (* pick out the next ioPair *)
inputs=Join [conUnits, ioP [[1] ] ]; (* join context and input units *)
outDesired=ioP[[2]];
hidOuts = sigmoid [hidWts. inputs]; (* hidden-layer outputs *)
outputs = outWts. hidOuts; (* output-layer outputs *)
outputs = sigmoid [outputs - 0.3 Apply [Plus,outputs] ♦ .5 outputs];
outErrors = outDesired-outputs; (* calculate errors *)
outDelta= outErrors (outputs (1-outputs));
hidDelta=(hidOuts (1-hidOuts)) Transpose[outWts].outDelta;
outLastDelta5 eta Outer [Times, outDelta, hidOuts]+alpha outLastDelta;
outWts += outLastDelta; (* update weights *)
hidLastDelta 5 eta Outer [Times, hidDelta, inputs]+alpha hidLastDelta;
hidWts += hidLastDelta; (* update weights *)
conUnits 5 hidOuts; (* update context units *)
(* put the sum of the squared errors on the list *)
AppendTo[errorList,outErrors.outErrors];
]; (* end of For i *)
]; (* end of For indx *)
Print["New hidden-layer weight matrix: "];
Print[ ]; Print[hidWts]; Print[ ];
Print["New output-layer weight matrix: "];
Print[ ]; Print[outWts]; Print[ ];
elmanCompTest[hidWts,outWts,ioPairs,conNumber];
errorPlot 5 ListPlot[errorList, PlotJoined->True];
Return[{hidWts,outWts,errorList,errorPlot}];
] (* end of Hodule *)
316 Code Listings

elmanCompTest[hiddenWts_,outputWts_,ioPairVectors.,conNumber.,printAll.:False] :=
Module[{inputs,hidden,outputs,desired,errors,i ,j,prntA11,conUnits,ioSequence,ioP},
If[printAll,Print[];Print["ioPairs:"];Print[];Print[ioPairVectors]];
For[i=l,i<=Length[ioPairVectors],i++, (* loop through the sequences *)
ioSequence = ioPairVectors[[i]]; (* select the next sequence *)
conUnits = Table[0.5,{conNumber}]; (* reset the context units *)
For[j=l,j<=Length[ioSequence] ,j++, (* loop through the chosen sequence *)
ioP = ioSequence[[j]];
inputs=Join[conUnits,ioP[[l]] ]; (* join context and input units *)
desired=ioP[[2]];
hidden=sigmoid[hiddenVts.inputs];
outputs=outputWts.hidden;
outputs=sigmoid[outputs -
0.3 Apply [Plus, outputs] +0.5 outputs];
errors= desired-outputs;
(* update context units *)
conUnits = hidden;
Print[];
Print["Sequence ",i, " input ",j];
Print[];Print["inputs:"];Print[];
Print[inputs];
If[printAll,Print[];Print["hidden-layer outputs:"];
Print[hidden];Print[];
];
Print["outputs:"];Print[];
Print[outputs];Print[];
Print["desired:"];Print[];Print[desired];Print[];
Print["Mean squared error:"];
Print[errors.errors/Length[errors]];
Print [];
]; (* end of For j *)
]; (* end of For i *)
] (* end of Module *)
Jordan Network 317

Jordan Network

jordan[inNumber_,hidNumber_,outnumber.,ioPairs.,eta_.alpha.,mu.,numlters.] :=
Hodule[{hidWts,outWts,ioP,inputs,hidOuts.outputs,outDesired,
i.indx,hidLastDelta,outLastDelta,outDelta,errorList = O,
ioSequence, stateUnits,hidOelta,outErrors},
hidWts = Table[Table[Random[Real,{-0.5,0.5}].{inNumber+outNumber}].{hidNumber}];
outWts = Table[Table[Random[Real,{-0.5,0.5}].{hidNumber}].{outNumber}];
hidLastDelta = Table[Table[0,{inNumber+outNumber}],{hidNumber}];
outLastDelta = Table[Table[0,{hidNumber}],{outNumber}];
For[indx=l,indx<=numIters,indx++, (* begin forward pass *)
ioSequence=ioPairs[[Random[Integer,{l,Length[ioPairs]}]]]; (* select a sequence *)
stateUnits = Table[0.1,{outNumber}]; (* reset stateUnits *)
For[i=l,i<=Length[ioSequence],i++, (* process the sequence in order *)
ioP = ioSequence [[i]]; (* pick out the next ioPair *)
inputs=Join[stateUnits,ioP[[l]] ]; (* join context and input units *)
outDesired=ioP[[2]];
hidOuts = sigmoid [hidWts. inputs]; (* hidden-layer outputs *)
outputs = sigmoid [outWts. hidOuts]; (* output-layer outputs *)
outEr rors = outDesired-outputs; (* calculate errors *)
outDelta= outErrors (outputs (1-outputs));
hidDelta=(hid0uts (1-hidOuts)) Transpose[outWts].outDelta;
outLastDelta= eta Outer[Times,outDelta,hid0uts]+alpha outLastDelta;
outWts += outLastDelta; (* update weights *)
hidLastDelta = eta Outer[Times,hidDelta,inputs]+alpha hidLastDelta;
hidWts += hidLastDelta; (* update weights *)
stateUnits = mu stateUnits + outputs; (* update state units *)
(* put the sum of the squared errors on the list *)
AppendTo[errorList,outErrors.outErrors];
]; (* end of For i *)
]; (* end of For indx *)
Print["New hidden-layer weight matrix: "];
Print[]; Print[hidWts];Print[];
Print["New output-layer weight matrix: "];
Print[]; Print[outWts];Print[];
jordanTest[hidWts,outWts,ioPairs,mu,outNumber];
errorPlot = ListPlot[errorList, PloUoined->True];
Retu rn[{hidWts,outWts,er rorList,er rorPlot}];
] (* end of Hodule *)
318 Code Listings

j ordanTest[hiddenWts_,outputWts_,ioPairVectors_,
mu_, stateNumber_,printAll_:False] :=
Module[{inputs,hidden,outputs,desired,e r rors,i,j,
prntA11,stateOnits,ioSequence,ioP},
If[printAll,Print[];Print["ioPairs:"];Print[];Print[ioPairVectors]];
For[i=l,i<=Length[ioPairVectors],i++, (* loop through the sequences *)
ioSequence = ioPairVectors[[i]]; (* select the next sequence *)
stateUnits = Table [0.1, {stateNumber}]; (* reset the context units *)
For[j=l,j<=Length[ioSequence] ,j++, (* loop through the chosen sequence *)
ioP = ioSequence[[j]];
inputs=Join[stateUnits,ioP[[l]] ]; (* join context and input units *)
desired=ioP[[2]];
hidden=sigmoid[hiddenWts.inputs];
outputs=sigmoid[outputWts.hidden];
errors= desired-outputs;
stateUnits = mu stateUnits + outputs; (* update context units *)
Print[];
Print["Sequence ",i, " input ",j];
Print[] ;Print["inputs:"] ;Print[];
Print[inputs];
If[printAll,Print[];Print["hidden-layer outputs:"];
Print[hidden];Print[]; ];
Print["outputs:"];Print[];
Print[outputs];Print[];
Print["desired:"];Print[];Print[desired];Print[];
Print["Mean squared error:"];
Print[errors.errors/Length[errors]];
Print[];
]; (* end of For j *)
]; (* end of For i *)
] (* end of Module *)
Jordan Network 319

(* this version sets the state units equal to the desired output values,
rather than the actual output values, during the training process *)
jordan2[inNumber_,hidNumber.,outNumber.,ioPairs_,eta_,alpha.,mu.,numlters.] :=
Hodule[{hidWts,outWts,ioP,inputs,hidOuts,outputs,outDesired,
i,indx,hidLastDelta,outLastDelta,outDelta,errorlist = O,
ioSequence, stateUnits,hidDelta,outEr rors},
hidWts = Table[Table[Random[Real,{-0.5,0.5}], {inNumber+outNumber}] .{hidNumber}];
outWts = Table[Table[Random[Real,{-0.5,0.5}].{hidNumber}].{outNumber}];
hidLastDelta = Table[Table[0,{inNumber+outNumber}],{hidNumber}];
outLastOelta = Table[Table[0,{hidNumber}],{outNumber}];
For[indx=l,indx<=numIters,indx++, (* begin forward pass *)
ioSequence=ioPairs[[Random[Integer,{1,Length[ioPairs]}]]]; (* select a sequence *)
stateUnits = Table[0.1,{outNumber}]; (* reset stateUnits *)
For[i=l,i<=Length[ioSequence],i++, (* process the sequence in order *)
ioP = ioSequence [[i]]; (* pick out the next ioPair *)
inputs=Join[stateUnits,ioP[[l]] ]; (* join context and input units *)
outDesired=ioP[[2]];
hidOuts = sigmoid [hidWts. inputs]; (* hidden-layer outputs *)
outputs a sigmoid[outVts.hidOuts]; (* output-layer outputs *)
outErrors = outDesired-outputs; (* calculate errors *)
outDelta= outErrors (outputs (1-outputs));
hidDelta=(hidOuts (1-hidOuts)) Transpose[outWts].outDelta;
outLastDelta= eta Outer[Times,outDelta,hidOuts]+alpha outLastDelta;
outWts += outLastDelta; (* update weights *)
hidLastDelta = eta Outer[Times,hidDelta,inputs]+alpha hidLastDelta;
hidWts += hidLastDelta; (* update weights *)
stateUnits = mu stateUnits + outDesired; (* update state units *)
AppendTo[errorList,outErrors.outErrors];
]; (* end of For i *)
]; (* end of For indx *)
Print["New hidden-layer weight matrix: "];
Print[]; Print[hidWts];Print[];
Print["New output-layer weight matrix: "];
Print[]; Print[outWts];Print[];
jordan2Test[hidWts,outWts,ioPairs,mu.outNumber];
errorPlot * ListPlot[errorList, PloUoined->True];
Return[{hidWts,outWts,errorList.errorPlot}];
] (* end of Hodule *)
320 Code Listings

jordan2Test[hiddenVts_,outputWts_,ioPairVectors_,
mu_, stateNumber_,printAll_:False] :=
Module[{inputs,hidden,outputs,desired.errors,i,j,
prntAll,stateUnits,ioSequence,ioP>,
If[printAll,Print[];Print["ioPairs:"];Print[];
Print[ioPairVectors]];
For[i=l,i<=Length[ioPairVectors],i++, (* loop through the sequences *)
ioSequence = ioPairVectors[[i]]; (* select the next sequence *)
stateUnits = Table[0.1,{stateNuinber}]; (* reset the context units *)
For[j=l,j<=Length[ioSequence],j++, (* loop through the chosen sequence *)
ioP = ioSequence[[j]];
inputs* Join [stateUnits, ioP[[l]] ]; (* join context and input units *)
desired=ioP[[2]];
hidden=sigmoid[hiddenWts.inputs];
outputs=sigmoid[outputVts.hidden];
errors* desired-outputs;
stateUnits * mu stateUnits + desired; (* update context units *)
Print[];
Print["Sequence ",i, " input ",j];
Print[];Print["inputs:"];Print[];
Print[inputs];
If[printAll,Print[];Print["hidden-layer outputs:"];
Print[hidden];Print[];];
Print["outputs:"];Print[];
Print[outputs];Print[];
Print["desired:"];Print[];Print[desired];Print[];
Print["Mean squared error:"];
Print[errors.errors/Length[errors]];
Print[];
]; (* end of For j *)
]; (* end of For i *)
] (* end of Nodule *)
Jordan Network 321

(* this is a modification of jordan2 in which the mean squared error


is calculated over the entire training pass before being added to the list *)
jordan2a[inNumber_,hidNumber_,outNumber_,ioPairs_,eta_,alpha,,mu_,numlters_] :=
Module[{hidWts,outWts,ioP,inputs,hidOuts,outputs,outDesired,
i,indx,hidLastDelta,outLastDelta,outDelta,errorList = O,
cycleError,ioSequence, stateUnits,hidDelta,outErrors>,
hidWts = Table[Table[Random[Real,{-0.5,0.5}].{inNumber+outNumber}].{hidNumber}];
outWts = Table[Table[Random[Real,{-0.5,0.5}],{hidNumber}],{outNumber}];
hidLastDelta = Table[Table[0,{inNumber+outNumber}],{hidNumber}];
outLastDelta = Table[Table[0,{hidNumber}],{outNumber}];
For[indx=l,indx<=numIters,indx++, (* begin forward pass *)
ioSequence=ioPairs[[Random[Integer,{1,Length[ioPairs]}]]] ;(* select a sequence *)
stateUnits = Table[0.1,{outNumber}]; (* reset stateUnits *)
cydeError = 0.0; (* initialize error *)
Fo r[i=l,i<=Length[ioSequence],i++, (* process the sequence in order *)
ioP = ioSequence[[i]]; (* pick out the next ioPair *)
inputs=Join[stateUnits,ioP[[l]] ]; (* join context and input units *)
outDesired=ioP[[2]];
hidOuts = sigmoid [hidl/ts. inputs]; (* hidden-layer outputs *)
outputs = sigmoid [outWts. hidOuts]; (* output-layer outputs *)
outErrors = outDesired-outputs; (* calculate errors *)
outDelta= outErrors (outputs (1-outputs));
hidDelta=(hidOuts (1-hidOuts)) Transpose[outWts].outOelta;
outLastDelta= eta Outer [Times, outDelta, hidOuts]+alpha outLastDelta;
outWts += outLastDelta; (* update weights *)
hidLastDelta = eta Outer [Times, hidDelta, inputs]+alpha hidLastDelta;
hidWts += hidLastDelta; (* update weights *)
stateUnits = mu stateUnits + outDesired; (* update state units *)
(* compute mse for this sequence *)
cydeError=cycleError + outErrors. outEr rors/Length [outErrors];
]; (* end of For i *)
AppendTo[errorList,cycleError/Length[ioSequence]];
]; (* end of For indx *)
Print["New hidden-layer weight matrix: "];Print[ ]; Print[hidWts];Print[ ];
Print["New output-layer weight matrix: "];Print[ ]; Print[outWts];Print[ ];
jordan2Test[hidWts,outWts,ioPairs,mu.outNumber];
errorPlot = ListPlot[errorList, PlotJoined->True];
Return[{hidWts,outWts,errorList,errorPlot}];
] (* end of Module *)
322 Code Listings

(* this is a modification of jordan2a using the Tanh function *)


jordan2aTanh[inNumber.,hidNumber.,outNumber_,ioPairs_,eta_,alpha.,mu_,numlters.] :
Module[{hidWts,outWts,ioP,inputs,hidOuts,outputs,outDesired,
i,indx,hidLastDelta,outLastDelta,outDelta,errorList = {},
cycleError,ioSequence, stateUnits,hidDelta,outErrors>,
hidWts = Table[Table[Random[Real,{-0.5,0.5}].{inNumber+outNumber}].{hidNumber}];
outWts = Table[Table[Random[Real,{-0.5,0.5}].{hidNumber}].{outNumber}];
hidLastDelta = Table[Table[0,{inNumber+outNumber}],{hidNumber}];
outLastDelta = Table [Table [0, {hidNumber}], {outNumber}];
For[indx=l,indx<=numIters,indx++, (* begin forward pass *)
ioSequence=ioPairs[[Random[Integer,{1,Length[ioPairs]}]]];(* select a sequence *)
stateUnits = Table[-0.9,{outNumber}]; (* reset stateUnits *)
cydeError = 0.0; (* initialize error *)
For[i=l,i<=Length[ioSequence],i++, (* process the sequence in order *)
ioP = ioSequence[[i]]; (* pick out the next ioPair *)
inputs=Join[stateUnits,ioP[[l]] ]; (* join context and input units *)
outDesired=ioP[[2]];
hidOuts = Tanh [hidWts. inputs]; (* hidden-layer outputs *)
outputs = Tanh [outWts. hidOuts]; (* output-layer outputs *)
outEr rors = outDesired-outputs; (* calculate errors *)
outDelta= outErrors (l-outputs“2);
hidDelta=(l-hid0uts''2) Transpose [outWts] .outDelta;
outLastDelta= eta Outer [Times, outDelta, hidOuts]+alpha outLastDelta;
outWts += outLastDelta; (* update weights *)
hidLastDelta = eta 0uter[Times,hidDelta,inputs]+alpha hidLastDelta;
hidWts += hidLastDelta; (* update weights *)
stateUnits = mu stateUnits + outDesired; (* update state units *)
cydeError=cycleError + outErrors.outErrors/Length [outEr rors];
]; (* end of For i *)
AppendTo[errorList,cycleError/Length[ioSequence]]; (* put the average mse on the list *)
]; (* end of For indx *)
Print["New hidden-layer weight matrix: "];
Print[]; Print[hidWts];Print[];
Print["New output-layer weight matrix: "];
Print[]; Print[outWts];Print[];
jordan2aTanhTest[hidWts,outWts,ioPairs,mu,outNumber];
errorPlot = ListPlot[errorList, PlotJoined->True];
Return[{hidWts,outWts,errorList,errorPlot}];
] (* end of Module *)
Jordan Network 323

jordan2aTanhTest[hiddenWts.,outputWts.,ioPairVectors.,
mu_, stateNumber_,printAll_:False] :=
Module[{inputs,hidden,outputs,desired.errors,i, j,
prntAll,stateUnits,ioSequence,ioP},
If[printAll,Print[];Print["ioPairs:"];Print[]; Print[ioPairVectors]];
For[i=l,i<=Length[ioPairVectors],i++, (* loop through the sequences *)
ioSequence = ioPairVectors[[i]]; (* select the next sequence *)
For[j=l,j<=Length[ioSequence] ,j++, (* loop through the chosen sequence *)
ioP = ioSequence[[j]J;
inputs=Join[stateUnits,ioP[[l]] ]; (* join context and input units *)
desired=ioP[[2]];
hidden=Tanh[hiddenWts.inputs];
outputs=Tanh[outputWts.hidden];
errors* desired-outputs;
stateUnits = mu stateUnits + desired; (* update context units *)
Print[];
Print["Sequence ",i, " input ",j];
Print[];Print["inputs:"];Print[];
Print[inputs];
If[printAll,Print[];Print["hidden-layer outputs:"];
Print[hidden];Print[]; ];
Print["outputs:"];Print[];
Print[outputs];Print[];
Print["desired:"];Print[];Print[desired];Print[];
Print["Mean squared error:"];
Print[errors.errors/Length[errors]];Print[];
]; (* end of For j *)
]; (* end of For i *)
] (* end of Module *)
324 Code Listings

ART

(* vmag for ART1 networks *)


vmagl[v_] :=Count[v,l]

(* vmag for ART2 networks *)


vmag2[v_] = Sqrt[v . v]

(* reset for ART1 *)


resetflagl[outp_,inp_,rho_] :=
If[vmagl[outp]/vmagl[inp]<rho.True,False]

(* reset for ART2 *)


resetflag2[u_,p_,c_,rho_]:=
Module[{r,flag},
r * (u + c p) / (vmag2[u] + vmag2[c p]);
If[rho/vmag2[r] > l,flag=True,flag=False];
Return[flag];
];
winner[p_,val_] := First[First[Position[p.val]]]

compete[f2Activities_] :=
Module[{i,x,f2dim,maxpos>,
x=f2Activities;
maxpos=First[First[Position[x,Max[f2A ctivities]]]];
f2dim = Length[x];
For[i=l,i<=f2dim,i++,
If[i!=maxpos,x[[i]]=0;Continue] (* end of If *)
]; (* end of For *)
Return[x];
]; (* end of Module *)

a rtllnit[fldim.,f2dim_,bl_,dl_,el_,dell_,del2_] :=
Module[{zl2,z21>,
zl2 = Table[Table[(bl-l)/dl + dell, {f2dim>], {fldim}];
z21 = Table[Table[(el/(el-l+fldim)-del2),{fldim}],{f2dim}];
Return[{zl2,z21>];
]; (* end of Module *)
ART 325

a rtl[fldim_,f2dim_,al_,bl_,cl_,dl_,el_,rho_,flMts_,f2Wts_,inputs.] :=
Module [{droplistinit, d roplist, notDone=T r ue, i, nIn=Length [inputs], reset,
n,sf1,t,x f2,uf2,v,windex,matchlist,newMatchList,tdWts,buWts},
droplistinit = Table[l,{f2dim}]; (* initialize droplist *)
tdWts=flWts; buWts=f2Wts;
matchList = (* construct list of F2 units and encoded input patterns *)
Table[{StringForm["Unit ",n]},{n,f2dim>];
While[notDone—True,newMatchList = matchList; (* process until stable *)
For[i=l, i<=nln,i++,in = inputs[[i]]; (* process inputs in sequence *)
droplist = droplistinit;reset=True; (* initialize *)
While [reset^T rue, (* cycle until no reset *)
xfl = in/(l+al*(in+bl)+cl); (* activities *)
sfl = Hap[If[i>0,1,0]ft,xf1]; (* FI outputs *)
t= buWts . sfl; (* F2 net-inputs *)
t = t droplist; (* turn off inhibited units *)
xf2 = compete [t]; (* F2 activities *)
uf2 = Hap[If[#>0,l,0]ft,xf2]; (* F2 outputs *)
windex * winner[uf2,l]; (* winning index *)
v= tdWts . uf2; (* FI net-inputs *)
xfl =(in+ dl*v-bl)/(l+al*(in+dl*v)+cl); (* new FI activities *)
sfl = Hap[lf[f>0,l,0]ft,xfl]; (* new FI outputs *)
reset = resetflagl[sfl,in,rho]; (* check reset *)
If [reset=T rue, droplist [ [windex] ] =0; (* update droplist *)
Print["Reset with pattern ",i,‘ on unit ",windex].Continue];
]; (* end of While reset=True *)
Print["Resonance established on unit ".windex," with pattern ",i];
tdWts=Transpose[tdWts]; (* resonance, so update weights.top down first *)
tdWts[[windex]]=sfl;
tdWts=Transpose[tdWts];
buWts [[windex]] = el/(el-l+vmagl[sfl]) sfl; (* then bottom up *)
matchList[ [windex] ] * (* update matching list *)
Reverse[Union[matchList[[windex]],{i}]];
]; (* end of For i=l to nln *)
If [matchList—newHatchList, notDone=False; (* see if matchList is static *)
Print["Network stable"],
Print["Network not stable"];
newMatchList = matchList];]; (* end of While notDone=True *)
Return[{tdWts,buWts,matchList}];
]; (* end of Module *)
326 Code Listings

a rt2F1[in_,a_,b_,d_,tdWts_,fld_,win r_:0] :=
Module[{w,x,u,v,p,q,i},
w=x=u=v=p=q=Table[0,{fid}];
For[i=l,i<=2,i++,
w = in + a u;
x = w / vmag2[w];
v = f[x] + b f[q];
u = v / vmag2[v];
p = lf[winr==0,u,
u + d Transpose[tdWts][[winr]] ];
q * p / vmag2[p];
]; (* end of For i *)
Return[{u,p>];
] (* end of Module *)

a rt2Init[fldim_,f2dim_,d_,dellj :=
Module[{zl2,z21},
zl2 = Table[Table[0 ,{f2dim>],{fldim}];
z21 = Table[Table[dell/((l-d)*Sqrt[fldim] ),
{fldim}],{f2dim}];
Return[{zl2,z21}];
]; (* end of Module *)
ART 327

art2[fldim_,f2dim_,ai_,bl„,cl.,d_,theta.,rho_,flWts.,f2Wts_.inputs.] :*
Module [{d r oplistinit, droplist, notOone=True,i,nln= Length [inputs], reset,
u,p,t,x f2,uf2,v,windex,matchList,newHatchList,tdWts,buWts},
droplistinit = Table[l,{f2dim>]; (* initialize droplist *)
tdWts = flWts; buHts = f2Wts;
u = p = Table[0,{fldim}];
(* construct list of F2 units and encodedinput patterns *)
matchList = Table[{StringForm["Unit "",n]},{n,f2dim}];
While[notDone==True,newMatchList = matchList; (* process until stable *)
For[i=l,i<=nIn,i++, (* process each input pattern in sequence *)
droplist = droplistinit; (* initialize droplist for neu input *)
reset=True;
in = inputs[[i]]; (* next input pattern *)
windex = 0; (* initialize *)
While[reset=True, (* cycle until no reset *)
{u,p> = art2Fl[in,al,bl,d,tdWts,fldim,windex];
t= buWts . p; (* F2 net-inputs *)
t = t droplist; (* turn off inhibited units *)
xf2 = compete[t]; (* F2 activities *)
uf2 = Map[g,xf2]; (* F2 outputs *)
windex = winner[uf2,d]; (* winning index *)
{u,p> = art2Fl[in,al,bl,d,tdWts,fldim,windex];
reset = resetflag2[u,p,d,rho]; (* check reset *)
If[reset==True.droplist[[windex]]=0; (* update droplist *)
Print["Reset with pattern ",i," on unit ".windex],Continue];
]; (* end of While reset=True *)
Print["Resonance established on unit ", windex," with pattern ",i];
tdWts=Transpose[tdWts]; (* resonance, so update weights *)
tdWts[[windex]]=u/(1-d); tdWts=Transpose[tdWts];
buWts [[windex]] = u/(l-d);
matchList[[windex]] = (* update matching list *)
Reverse[Union[matchList[[windex]],{i>]];
]; (* end of For i=l to nln *)
If [matchList==newHatchList, notDone=False; (* see if matchList is static *)
Print["Network stable"]. Print ["Network not stable"];
newHatchList = matchList];
]; (* end of While notDone==True *)
Return[{tdWts,buWts.matchList}];
]; (* end of Nodule *)
328 Code Listings

Genetic Algorithms

f[x_] := l+Cos[x]/(l+0.01 x“2)

flipCxJ := If[Random[]<3x,True,False]

newGenerate[pmutate,,key Ph rase_,pop_,numGensJ :3
Module[{i,neuPop,pa rent,diff,matches,
index,fitness},
neuPop=pop;
For[i=l,i<=numGens,i++,
diff = Map[(keyPhrase-#)ft,neuPop];
matches = Map[Count[#,0]ft,diff];
fitness 3 Max [matches];
index 3 Position [matches, fitness];
parent = newPop[[First[Flatten[index]]]];
Print["Generation ",i,": ".FromCharacterCode[parent],
" Fitness3 ",fitness];
newPop 3 Table[Map[mutateLetter[pmutate,#]A,parent],{100}];
]; (* end of For *)
]; (* end of Module *)

decodeBGA[chromosome.] :=
Module[{pList,lchrom,values,phenotype},
lchrom = Length [chromosome];
(* convert from binary to decimal *)
pList = Flatten[Position[chromosome,l] ];
values = Map[2~(lchrom-#)ft, pList];
decimal 3 Apply [Plus, values];
(* scale to proper range *)
phenotype 3 decimal (0.07820136852394916911)-40;
Return[phenotype];
]; (* end of Module *)
Genetic Algorithms 329

selectOne[foldedFitnessList_,fitTotal_] :=
Module[{randFitness,elem,index},
randFitness = Random [] fitTotal;
elem = Select[foldedFitnessList,#>=randFitnessA,l];
index =
Flatten[Position[foldedFitnessList,First[elem]]];
Return[First[index]];
]; (* end of Module *)

myXor[x_,y_] := If[x=y,0,l];

mutateBGA[pmute.,allelj :=
If[flip[pmute],myXor[allel,l],allel];

crossover[pcross_,pmutate.,parentl.,parent2_] :=
Module[{childl,child2,crossAt,lchrom >,
(* chromosome length *)
lchrom = Length [parentl];
If[ flip[pcross],
(* True: select cross site at random *)
crossAt = Random[Integer,{l,lchrom-l>];
(* construct children *)
childl = Join[Take[parentl,crossAt], 0rop[parent2,crossAt]];
child2 = Join[Take[parent2,crossAt], Drop[parentl,crossAt]],
(* False: return parents as children *)
childl = parentl;
child2 = parent2;
]; (* end of If *)
(* perform mutation *)
childl = Ma p[mutateBGA[pmutate,#]ft,childl];
child2 = Map [mutateBG A [pmutate, #]&, child2];
Return[{childl,child2}];
]; (* end of Module *)
330 Code Listings

initPop[psize_,csizeJ :=
Table[Random[Integer,{0,l)],{psize},{csize}];

displayBest[fitnessList_,number2Print_] :=
Module[{i,sortedList),
sortedList = Sort[fitnessList, Greater];
For[1=1,i<=number2Print,i++,
Print["fitness = ", sortedList[[i]] ];
]; (* end of For i *)
]; (* end of Module *)

bga[pcross.,pmutate.,poplnitial.,fitFunction.,numGens.,printNumJ :=
Module[{i,newPop,pa rentl,parent2,diff,matches,
oldPop,reproNum,index,fitList,fitListSum,
fitSum,pheno,plndex,plndex2,f,children),
oldPop=popInitial; (* initialize first population *)
reproNum = Length [oldPop] /2; (* calculate number of reproductions *)
f = fitFunction; (* assign the fitness function *)
For[i=l,i<=numGens,i++, (* perform numGens generations *)
pheno = Map[decodeBGA,oldPop]; (* decode the chromosomes *)
fitList = f[pheno]; (* determine the fitness of each phenotype *)
Print[" "]; (* print out the best individuals *)
Print["Generation ",i," Best ".printNum];
Print[" "];
displayBest[fitList,printNum];
f itListSum = FoldList[Plus,First[fitList],Rest[fitList]];
fitSum = Last[fitListSum]; (* find the total fitness *)
newPop = Flatten [Tablet (* determine the new population *)
plndexl = selectOne[fitListSum,fitSum]; (* select parent indices *)
plndex2 = selectOne [f itListSum, fitSum];
parentl = oldPop[[pIndexl]]; (* identify parents *)
parent2 = oldPop[[plndex2]];
children = crossover [pcross, pmutate, pa rentl, parent2]; (* crossover and mutate *)
children,{reproNum}] ,1 (* add children to list; flatten to first level *)
]; (* end of Flatten [Table] *)
oldPop = newPop; (* new becomes old for next gen *)
]; (* end of For i*)
]; (* end of Module *)
Genetic Algorithms 331

sigmoid[x_] := l./(l+E‘(-x));

initXorPop[psize_,csize_,ioPairs_] :=
Module[{i,iPop,hidWts,outWts,mselnv},
(* first the chromosomes *)
iPop = Table[
{Table[Random[Integer,{0,l}],{csize}],(* hi *)
Table[Random[Integer,{0,1}],{csize}],(* h2 *)
Table[Random[Integer,{0,l}],{csize}] (* ol *)
>, {psize} ]; (* end of Table *)
(* then decode and eval fitness *)
(* use For loop for clarity *)
For[i=l,i<=psize,i++,
(* make hidden weight matrix *)
hidWts = Join[iPop[[i,l]],iPop[[i,2]] ];
hidUts = Partition[hidHts,20];
hidWts = MapCdecodeXorChrom,hidWts];
hidWts = Partition [hidWts,2];
(* make output weight matrix *)
outWts = Partition[iPop[[i,3]],20];
outWts = Map[decodeXorChrom,outUts];
(* get mse for this network *)
mselnv = gaNetFitness [hidUts, outWts, ioPairs];
(* prepend mselnv *)
PrependTo[iPop[[i]],mselnv];
]; (* end For *)
Return[iPop];
]; (* end of Module *)

decodeXorChrom[chromosome_] :=
Module[{pList,lchrom,values,p.decimal},
lchrom = Length[chromosome];
(* convert from binary to decimal *)
pList = Flatten[Position[chromosome,1] ];
values = Map[2''(lchrom-#)ft, pList];
decimal = Apply[Plus,values];
(* scale to proper range *)
p = decimal (9.536752259018191355*10“-6)-5;
Return[p];
]; (* end of Module *)
332 Code Listings

gaNetFitness[hiddenWts_,outputWts_,ioPairVectors_] :=
Module[{inputs,hidden,outputs,desired.errors,
len,errorTotal,errorSum},
inputs=Map[First,ioPairVectors];
desired=Map[Last,ioPairVectors];
len = Length[inputs];
hidden=sigmoid[inputs.Transpose[hiddenWts]];
outputs=sigmoid[hidden.Transpose[outputWts]];
errors= desired-outputs;
errorSum = Apply[Plus,errors~2,2]; (* second level *)
errorTotal = Apply[Plus,errorSum];
(* inverse of mse *)
Return[len/errorTotal];
] (* end of Module *)

crossoverXor[pcross_,pmutate.,parentl.,parent2_] :=
Module[{childl,child2,cross At,lchrom,
i,numchroms,chromsl,chroms2},
(* strip off mse *)
chromsl = Rest[parentl];
chroms2 = Rest[parent2];
(* chromosome length *)
lchrom = Length[chromsl[[l]]];
(* number of chromosomes in each list *)
numchroms = Length[chromsl];
For[i=l,i<=numchroms,i++, (* for each chrom *)
If[ flip[pcross],
crossAt = Randomflnteger,{l,lchrom-l}]; (* True: select cross site at random *)
(* construct children *)
chromsl[[i]] = Join[Take[chromsl[[i]].crossAt],Drop[chroms2[[i]].crossAt]];
chroms2[[i]] = Join[Take[chroms2[[i]].crossAt], Drop[chromsl[[i]].crossAt]],
Continue]; (* False: don't change chroms[[i]]. End of If *)
(* perform mutation *)
chromsl[[i]] = Map[mutateBGA[pmutate,#]ft,chromsl[[i]]];
chroms2[[i]] = Map[mutateBGA[pmutate,#]ft,chroms2[[i]]];
]; (* end of For i *)
Return[{chromsl,chroms2>];
]; (* end of Module *)
Genetic Algorithms 333

gaXor[pcross_,pmutate_,poplnitial.,numReplace.,ioPairs_,numGens_,printNum_] :=
Module[{i,j,newPop,parentl,parent2,diff,matches,
oldPop,reproNum,index,fitList,fitListSum,
fitSum,pheno,plndex,plndex2,f,children,hids,outs,mselnv},
(* initialize first population sorted by fitness value *)
oldPop= Sort[popInitial,Greater[First[#],First[#2]]ft];
reproNum = numReplace; (* calculate number of reproductions *)
For[i=l,i<=numGens,i++,
fitList = Map[First,oldPop]; (* list of fitness values*)
(* make the folded list of fitness values *)
fitListSum = FoldList[Plus,First[fitList],Rest[fitList]];
fitSum = Last[fitListSum]; (* find the total fitness *)
newPop = Drop [oldPop,-reproNum]; (* new population; eliminate reproNum worst *)
For[j=l,j<=reproNum/2,j++, (* make reproNum new children *)
(* select parent indices *)
plndexl = selectOne[fitListSum,fitSum];
plndex2 = selectOne[fitListSum,fitSum];
parentl = oldPop[[pIndexl]]; (* identify parents *)
parent2 = oldPop[[pIndex2]];
children = cross0verXor[pcross,pmutate,parentl,parent2];(*cross and mutate*)
{hids,outs} = decodeXorGenotype[children[[1]] ]; (* fitness of children *)
mseln v = gaNetFitness[hids,outs,ioPairs];
children[[l]] = Prepend[children[[l]],mselnv];
{hids,outs} = decodeXorGenotype[children[[2]] ];
mselnv = gaNetFitness[hids,outs,ioPairs];
children[[2]] = Prepend[children[[2]],mselnv];
newPop = Join[newPop,children]; (* add children to new population *)
]; (* end of For j *)
oldPop = Sort[newPop, Greater [First [#], First [#2]]ft];(* for next gen *)
(* print best mse values (l/mselnv) *)
Print[ ];Print["Best of generation ",i];
For[j=l,j<=printNum,j++,Print[(1.0/oldPop[[j,l]])]; ];
]; (* end of For i*)
Return[oldPop];
]; (* end of Module *)
334 Code Listings

decodeXorGenotype[genotype.] :=
Module[{hidWts,outWts},
hidWts = Join[genotype[[l]],genotype[[2]] ];
hidWts = Partition [hidWts, 20];
hidWts = MaptdecodeXorChrom,hidWts];
hidWts = Partition [hidWts ,2];
(* make output ueight matrix *)
outWts = Partition[genotype[[3]],20];
outWts = Map[decodeXorChrom,outWts];
Return[{hidWts,outWts}];
3;
encodeNetGa[weight.,lenj :=
Module[{pList,values,dec,ch romosome,i},
i=len;
l=Table[0,{i>];
(* scale to proper range *)
dec = Round[(weight+5.)/(9.536752259018191355*10‘-6)];
While[dec!=0&ftdec!-1,
l=ReplacePart[l,Mod[dec,2],i];
dec=Quotient[dec,2];
—i;
];
l=ReplacePart[l,dec,i]
]; (* end of Module *)

randomPop[psize_,csize_,ioPairs.,numGensJ :=
Module[{i,pop>,
For[i=l,i<=numGens,i++,
pop = initXorPop[psize,csize,ioPairs];
pop = Sort[pop,Greater[First[#],First[#2]]ft];
Print[ ];
Print["Random generation ",i];
Print[(1.0/pop[[l,l]])];
];
3;
Bibliography

[1] Igor Aleksander, editor. Neural Computing Architectures. MIT Press,


Cambridge, MA, 1989.
[2] James A. Anderson and Edward Rosenfeld, editors. Neurocomputing:
Foundations of Research. MIT Press, Cambridge, MA, 1988
[3] Maureen Caudill and Charles Butler. Naturally Intelligent Systems.
MIT Press, Cambridge, MA, 1990.
[4] Lawrence Davis, editor. Handbook of Genetic Algorithms. Van Nos¬
trand Reinhold, New York, 1991.
[5] John S. Denker, editor. Neural Networks for Computing: AIP Conference
Proceedings 151. American Institute of Physics, New York, 1986.
[6] Russell C. Eberhart and Roy W. Dobbins, editors. Neural Network PC
Tools: A Practical Guide. Academic Press, San Diego, CA, 1990.
[7] Rolf Eckmiller and Christoph v. d. Malsburg, editors. Neural Comput¬
ers. NATO ASI Series F: Computer and Systems Sciences. Springer-
Verlag, Berlin, 1988.
[8] James A. Freeman and David M. Skapura. Neural Networks: Al¬
gorithms, Applications, and Programming Techniques. Addison-Wesley,
Reading, MA, 1991.
[9] David E. Goldberg. Genetic Algorithms in Search, Optimization & Ma¬
chine Learning. Addison-Wesley, Reading MA, 1989.
[10] Stephen Grossberg, editor. Studies of Mind and Brain. D. Reidel Pub¬
lishing, Boston, MA, 1982.
[11] Stephen Grossberg, editor. Neural Networks and Natural Intelligence.
MIT Press, Cambridge, MA, 1988.
[12] Robert Hecht-Nielsen. Neurocomputing. Addison-Wesley, Reading,
MA, 1990.

335
336 Bibliography

[13] John Hertz, Anders Krogh, and Richard G. Palmer. Introduction to the
Theory of Neural Computation. Addison-Wesley, Redwood City, CA,
1991.
[14] Geoffrey E. Hinton and James A. Anderson, editors. Parallel Models
of Associative Memory. Lawrence Erlbaum Associates, Hillsdale, NJ,
1981.
[15] Tarun Khanna. Foundations of Neural Networks. Addison-Wesley,
Reading MA, 1990.
[16] C. Klimasauskas. The 1989 Neuro-Computing Bibliography. MIT Press,
Cambridge, MA, 1989.
[17] Teuvo Kohonen. Self-Organization and Associative Memory. Springer-
Verlag, New York, 1984.
[18] Bart Kosko. Neural Networks and Fuzzy Systems: A Dynamical Systems
Approach to Machine Intelligence. Prentice-Hall, Englewood Cliffs, NJ,
1992.
[19] Roman Maeder. Programming in Mathematica. 2nd Edition, Addison-
Wesley, Reading MA, 1991.
[20] James McClelland and David Rumelhart. Explorations in Parallel Dis¬
tributed Processing. MIT Press, Cambridge, MA, 1988.
[21] Marvin Minsky and Seymour Papert. Perceptrons: Expanded Edition.
MIT Press, Cambridge, MA, 1988.
[22] B. Muller and J. Reinhardt. Neural Networks: An Introduction.
Springer-Verlag, Berlin, 1990.
[23] Yoh-Han Pao. Adaptive Pattern Recognition and Neural Networks.
Addison-Wesley, Reading, MA, 1989.
[24] David Rumelhart and James McClelland. Parallel Distributed Process¬
ing. MIT Press, Cambridge, MA, 1986.
[25] Patrick K. Simpson. Artificial Neural Systems: Foundations, Paradigms,
Applications, and Implementations. Pergamon Press, New York, 1990.
[26] Philip D. Wasserman. Neural Computing: Theory and Practice. Van
Nostrand Reinhold, New York, 1989.
[27] Bernard Widrow and Samuel D. Steams. Adaptive Signal Processing.
Prentice-Hall, Englewood Cliffs, NJ, 1985.
[28] Stephen Wolfram. Mathematica: A System for Doing Mathematics by
Computer. 2nd Edition, Addison-Wesley, Reading, MA, 1991.
[29] Steven F. Zornetzer, Joel L. Davis, and Clifford Lau, editors. An
Introduction to Neural and Electronic Networks. Academic Press, San
Diego, CA, 1990.
Index

action potential, 157 feedback, 250


activation, 222 FI layer, 243-248, 250, 251, 253, 257
value, 11 F2 layer, 243-250, 253
Adaline 40-65,119 gain control, 244
adaptive linear combiner (ALC), 40- orienting subsystem, 244, 246-248
65, 68, 70, 71, 74 reset, 248, 254
adaptation, 211 top down,
adaptive resonance theory (ART), 210 connections, 244
ART1, 211-242 template, 248
attentional subsystem, 212 weights, 248, 253, 254, 257
bottom-up inputs, 214, 218, 221 vigilance parameter, 247
feedback, 207, 210 allele, 226, 267, 274
FI layer, 212-226, 235 binary, 268
F2 layer, 212-227, 233-235, 237, 241 AND function, 28, 29, 32
gain control, 214-217 annealing, 124, 128, 130, 134, 136
orienting subsystem, 212, 214, 216, simulated, 128-131, 136
217, 219, 221, 222, 232, 234 a posteriori probability, 139
reset, 215, 221, 232-234, 237 a priori probability, 136,146
resonance, 212-214, 226, 228 ART. See adaptive resonance ’ eory
top down, associative memory, 116, 15i 8
connections, 215 asynchronous,
inputs, 214, 217, 218
processing, 131
pattern, 221
updating, 120,127
template, 212, 214
autoassodative memory, 116,178
weights, 223, 238
avalanche, 186
2/3 rule, 214
axon hillock, 157
vigilance parameter, 212,214, 219-222,
230, 235, 238, 240
backpropagation, 19, 65, 68-103, 188,
ART2, 243-257
207, 281, 290, 293
bottom up,
network, 63, 68-103,114,119,128,185,
connections, 244
inputs, 248 186, 188, 189, 210

weights, 247, 257 of errors, 88, 188

337
338 Index

BAM. See bidirectional associative mem¬ excitatory, 9, 213


ory feedback, 7, 195
Bayes/ inhibitory, 8, 9, 168, 169, 194, 211, 213,
classifier, 116 217
decision rule, 139, 143 matrix,
decision strategy, 144 sparse, 23
pattern classification, 116,135-144,152 strength, 9
theorem, 135 constraint, 162, 168, 173, 211
Bayesian. See Bayes' satisfaction, 153-175
bias, strong, 154
term, 40, 43, 69, 71, 94, 96 weak, 154, 170
unit, 69, 93-96 context unit, 186-189
weight, 69 contrast enhancement, 253
bidirectional associative memory (BAM), crossover, 273-276, 282, 285, 292
178-185 point, 274
bidirectional connections, 135, 136,179 single point 273
binary, crosstalk, 122, 182
inputs, 28, 211
numbers, 9 Dawkins, Richard, 261
bipolar, delta rule, 43, 47-49, 61, 64, 65, 68, 114
output function, 40 decision,
vectors, 118,179 boundary, 140, 146
Boltzmann, surface, 29, 32, 37, 140
constant, 126,127 desired output, 43, 50, 51, 60, 201, 203
completion network, 135 dot product, 10, 27, 41, 79, 145, 149, 215,
distribution, 127 224
factor, 126
input/output network, 136 elitism, 277
machine, 134 Elman, J. E., 179
BPN. See backpropagation network Elman network, 186-194, 198, 207
brain, 5 context units, 186-189
energy, 124, 125, 182
children, 266, 276 function, 118,160
Christ, Fury, 156 global minimum, 131
chromosome, 266,267,270,273-275,277, potential, 125
281, 282, 285, 292 surface, 182
representation, 268-271 total, 126
classical conditioning, 35 entropy, 124
competition, 101, 211, 224, 225, 231, 233, equilibrium, 130
234 activities, 216
winner-take-all, 218, 246 values, 244
competitive, error, 46, 48, 53, 59, 60, 193
layer, 218, 222 mean squared (mse), 44, 48, 61, 107,
weight updates, 101-103 202, 281, 282, 288, 290
conditional probability, 135, 137-139 minimum, 46
connection, 7-9, 20, 36, 40, 94 surface, 46, 48, 55, 73, 74
Index 339

evolution, 261, 262 gain parameter, 157, 160


exchange interaction strength, 125 Gaussian,
excitatory, function, 140,141,145
connections, 9, 213 output function, 145,159
inputs, 215 GDR. See generalized delta rule
exemplar, 27, 43, 62, 71, 72, 89, 140, 143, generalized delta rule (GDR), 68, 71, 114
145, 146, 179,182, 185 generation, 277, 280, 282, 285, 289, 290,
expectation value, 44 292
external magnetic field, 126 generational replacement, 276, 277
genes, 266, 267, 273-276
feed-forward propagation, 71 locus of, 266
feedback connections, 7,195 genetic algorithm, 259-294
ferromagnet, 125 genetics, 260
fitness, 267, 271, 272, 276, 277, 282, 285, genotype, 266, 267
288, 292 gradient, 64, 72-74
function, 268, 282 gradient descent, 46-48, 63
FLN. See functional link network
function Hamming distance, 119,182,184
AND, 28, 29, 32 Hebb, Donald O., 34, 35
bipolar output, 40 Hebb's,
energy, 118,160 rule, 117
fitness, 268, 282 theory, 36
Gaussian, 140,141,145 Hebbian learning, 34
Gaussian output, 145,149 heteroassodative memory, 136,178
identity, 70, 73 hidden layer, 32, 33, 64
linear output, 105 units, 32, 33, 70, 92,187,188,194, 281,
linear threshold, 246 293
nonlinear output, 103,145 weights, 73, 83
output, 14, 70, 222 hill climbing, 40, 268
partition, 126 Hinton diagrams, 23
probability distribution, 140 Hinton, Geoffrey, 23,129
sigmoid, 16, 17, 19, 70, 71, 73, 76, 79, Hopfield network, 116-118,122,124-126,
83,105, 127, 145, 203, 204, 246 128, 151,156,175, 178,181,185,207,
tanh, 203, 204 218, 305-308
threshold, 18, 19, 246 continuous, 157-160
threshold output, 28, 65 discrete, 116-124,131
XOR, 31-33, 275 stochastic, 131-134
functional link network (FLN), 32, 68, Homer's rule, 269
103-114
functional expansion model, 104, 105, identity function, 70, 73
108 inhibitory,
functional links, 103-105,109 connections, 8, 9, 168, 169, 194, 211,
outer product model, 104 213, 217
tensor model, 104 input, 215, 216
signals, 218, 244
GA. See genetic algorithm strength, 161
340 Index

input correlation matrix, 46 optimum weight vector, 53


Ising model, 124,151 output function, 11,14, 70, 222

Jordan, M. I., 178 Page, Ed, 156


Jordan network, 194-207 parallelism, 5
parents, 266, 271, 273, 276, 280
layer, parent selection, 271, 272, 282
hidden, 32, 33, 64 partition function, 126
learning, 28, 98 pattern,
learning rate, 48, 56, 59, 74, 78, 89, 90, classification, 139
108, 193, 207, 293 recognition, 75
linear output function, 105 unit, 144
linear threshold function, 246 Pavlovian conditioning, 35
linearly separability, 30,103 perceptron, 30
linked list, 23 phenotype, 266-268, 270, 271, 277, 281
LMS learning rule, 42-49 pixel, 5, 24, 139, 238
local minimum, 159, 293 plan vector, 195
long term memory (LTM), 212 PNN. See probabilistic neural network
population, 276, 280, 285, 289, 292
Madaline, 42 positive feedback, 218
magnet, 124 postsynaptic,
magnetic, cell, 35
field, 124, membrane, 34
material, 124 potential,
spin, 126 energy, 125
minimum cost, 154 spikes, 157
momentum, 98-102, 193 presynaptic,
mse. See mean squared error cell, 35
mutation, 273-277, 282 membrane, 34
probabilistic neural network (PNN), 116,
n-out-of-N problem, 160, 165, 169, 173, 135, 144-152
175 probability distribution function, 140
natural selection, 260-262, 265, 271, 274 processing element, 6, 9,10
net-input value, 9-11, 16, 22, 25, 37, 41,
69, 71, 117, 127, 157, 163 randomly connected network, 23, 30
neurobiology, 34 recurrent network, 178
neurons, 5, 6, 8, 34, 157 reproduction, 273
neurotransmitters, 34, 35 retina, 5, 6
node, 6 reset, 215, 221, 232-234, 237, 248
nonlinear output function, 103, 145 resonance, 210-214, 226, 228
noise reduction, 253 Rosenblatt, Frank, 30
normalization, 145, 146, 148 roulette-wheel method, 271

on-center, off-surround, 218 schemata, 267


one-to-many mapping, 186 Sejnowski, Terrence J., 129
optimization, 175, 281, 293 self-adaptation, 43
Index 341

self-scaling, 220 equilibrium, 26


short term memory (STM), 212 fluctuations, 127
sigmoid function. See function, sigmoid threshold, 33
slack units, 173 condition, 29, 253
spatial patterns, 185 function, 18,19, 246
spin, 124,125 output function, 28, 65
spin glass, 125 value, 19
spurious stable state, 122,182 unit, 33
stability-plasticity dilemma, 210 training, 43, 46, 108, 293
statistical mechanics, 151 algorithm, 68, 72
steady state, transversal filter, 49, 58
population, 285 traveling salesperson problem (TSP),
reproduction, 277 154-173
Steams, Samuel D., 49
stochastic processes, 116, 151 units, 6, 8,11,16, 20, 27
supervised learning, 43 winning, 101
survival probability, 267
sweep, 131, 132 vector-matrix multiplication, 22
synapse, 8, 9, 34, 36 VLSI, 157
synaptic,
weight, 9,10, 27, 28, 36, 44, 49, 126
cleft, 34
matrix, 19-27, 59, 70, 71,162,169,171,
junction, 35
173, 174,178-180,196, 230, 233, 235
strength, 36
synchronous updating, 120,127 space 281
vectors, 24-27, 42, 43, 46-48, 122, 145,
T-C problem, 75-89, 94, 97, 99,102 238
Tagliarini, Gene, 156 Widrow, Bernard, 49, 54
tanh function, 203, 204
temperature, 124—126,128-132,134,151 XOR,
fictitious, 127,130 function, 31-33, 275
problem, 30,58,68,89,94,97,100,103,
thermal,
effects, 125 104, 108, 281
'
• LlBRAty

http://nihlibrary.nih.gov

10 Center Drive
Bethesda, MD 20892-1150
301-496-1080
Neural Networks/Mathematica

JAMES A.
fU'W
FREEMAN cY/mJn

SIMULATING %
NEURAL
NETWORKS

This book introduces neural networks, their operation and their application, in the context of
Mathematical a mathematical programming language. Readers will learn how to simulate
neural network operations using Mathematica, and will learn techniques for employing
Mathematica to assess neural network behavior and performance. They will see how this
popular and widely available software can be used to explore neural network technology,
experiment with various architectures, debug new training algorithms, and design techniques
for analyzing network performance.

Simulating Neural Networks with Mathematica® is suitable for professionals and students in
computer science, electrical engineering, applied mathematics, and related areas who need an
efficient way to learn about neural networks, and to gain some proficiency in their use. The
source code for the programs in the book is available free of charge via electronic mail from
[email protected]”.

Highlights:
• Addresses a major neural network topic or a specific network architecture in each chapter.
• Includes an introduction to genetic algorithms.
• Includes Mathematica listings in an appendix.

About the Author


Janies A. Freeman is a senior engineer in the Artificial Intelligence Laboratory of Loral
Space Information Systems in Houston, where he develops neural network applications and
special paiallel computer systems. Dr. Freeman is also a member of the adjunct faculty in the
jpnrpg nf rhp I Tnivprgit-y

You might also like