Parallel Distributed Processing - Vol. 1
Parallel Distributed Processing - Vol. 1
PREFACE
the simple insights about the interactive nature of processing that had
lead to such models as the HEARSAY model of speech understanding.
More generally, they had failed to provide a framework for representing
knowledge in a way that allowed it to be accessedby content and effectively combined with other knowledge to produce useful automatic
syntheses that would allow intelligence to be productive . And they
made no contact with the real strengths and weaknesses of the
hardware in the brain. A Cray computer can perform on the order of
100 million double-precision multiplications in a second, but it does not
exhibit natural intelligence. How then are we to understand the capabilities of human thought , given the time constants and noisiness
inherent in neural systems? It seemed obvious that to get any processing done in real time , the slow, noisy hardware in the brain would have
to do massively parallel processing.
As our interest in parallel mechanisms developed, we began to study
the work of others who shared our convictions and to build on their
work . Particularly important in this regard was Hinton and J. A .
Anderson's ( 1981) Parallel Models of Associative Memory. Indeed, we
see our book as a descendant of their book on two accounts. First , the
material presented here represents further developments on the work
presented in Hinton and Anderson 's book. Second, we owe a particular
intellectual debt to both Hinton and Anderson. Our interest in distrib uted, associative memories goes back to interactions with Jim Ander son, beginning as early as 1968. Our interest in these topics began in
earnest, however, during the period when we were developing the
interactive activation model of word perception, in 1979, shortly after
Geoffrey Hinton began a postdoctoral fellowship at UCSD. Geoffrey 's
crisp explanations showed us the potential power and generality of
models created from connections among simple processing units , and
fit together nicely with our own developing conviction that various
aspectsof perception, language processing, and motor control were best
thought of in terms of massively parallel processing (see McClelland ,
1979, and Rumelhart , 1977, for our earliest steps in this direction ) .
The project culminating in this book formally began in December,
1981 when the two of us and Geoffrey Hinton decided to work together
exploring the implications of network models and to write a book outlining our conclusions. We expected the project to take about six
months . We began in January 1982 by bringing a number of our colleagues together to form a discussion group on these topics. During
the first six months we met twice weekly and laid the foundation for
most of the work presented in these volumes. Our first order of business was to develop a name for the class of models we were investigating. It seemed to us that the phrase parallel distributedprocessing(POP
PREFACE
.
XI
for short ) best captured what we had in mind . It emphasized the paral lel nature of the processing , the use of distributed representations and
distributed control , and the fact that these were general processing systems , not merely memories we were studying , as the phrase associative
memory suggests . Thus the POP research group was born . Hinton and
McClelland
left
McClelland
after
to MIT
and later
the
first
six
months - Hinton
to
CMU
and
group , how -
has varied
from
five
or six of us at times
to as many
as 15
that
there
was
much
work
to
be
done
and
many
directions
to
explore . Thus , our work continued and expanded as we and our col leagues followed the implications of the POP approach in many dif ferent
ways .
A good deal has happened since we began this project . Though much
of the initial groundwork was laid in early 1982, most of the material
described
in these
volumes
did
not
take
its present
form
until
much
later .
..
XII PREFACE
book
consists
of six
parts , three
in each of the
two
volumes
The overall structure is indicated in the accompanying table . Part I pro vides an overview . Chapter 1 presents the motivation for the approach
and describes much of the early work that lead to the developments
reported in later sections . Chapter 2 describes the POP framework in
more formal terms . Chapter 3 focuses on the idea of distributed
representation , and Chapter 4 provides a detailed discussion of several
general issues that the POP approach has raised and explains how these
issues are addressed in the various later chapters of the book .
The remaining parts of the book present different facets of our
explorations in parallel distributed processing . The chapters in Part II
address central theoretical problems in the development of models of
parallel distributed processing , focusing for the most part on fundamen tal problems in learning . The chapters in Part III describe various
mathematical and computational
tools that have been important in
the development and analysis of POP models . Part IV considers
A CONDENSED
TABLE
VOLUME
I . THE
POP
PERSPECTIVE
II . BASIC
OF CONTENTS
I
MECHANISMS
III . FORMAL
ANALYSES
5. Competitive Learning
6 . Harmony Theory
9 . Linear Algebra
10. Activation Functions
3 . Distributed
7 . Boltzmann
11 . The
Representations
4. General Issues
Machines
8. Learning by
Delta
Rule
Error Propagation
VOLUME
IV . PSYCHOLOGICAL
PROCESSES
14 . Schemata
and POP
II
V . BIOLOGICAL
-VI . CONCLUSION
MECHANISMS
20 . Anatomy
and
Physiology
21. Computation in
the Brain
22 . Neural and
Conceptual Levels
23. Place Recognition
24. Neural Plasticity
25 . Amnesia
26 . Reflections
Future Directions
...
PREFACEXIII
applications and implications of PDP models to various aspects of
human cognition , including perception , memory , language , and higher level thought processes . Part V considers the relation between parallel
distributed processing models and the brain , reviews relevant aspects of
the anatomy and physiology , and describes several models that apply
POP models to aspects of the neurophysiology and neuropsychology of
information processing , learning , and memory . Part VI contains two
short pieces : a reflection on PDP models by Don Norman and a brief
discussion of our thoughts about promising future directions .
How to read this book? It i~ too long to read straight through . Nor
is it designed to be read this way . Chapter 1 is a good entry point for
readers unfamiliar with the POP approach , but beyond that the various
parts of the book may be approached in various orders , as one might
explore the different parts of a complex object or machine . The vari ous facets of the POP approach are interrelated , and each part informs
the others ~ but there are few strict sequential dependencies . Though
we have tried to cross -reference ideas that come up in several places ,
we hope that most chapters can be understood without reference to the
rest of the book . Where dependencies exist they are noted in the intro ductory sections at the beginning of each part of the book .
This book charts the explorations we and our colleagues have made
in the microstructure of cognition . There is a lot of terrain left to be
explored . We hope this book serves as a guide that helps others join us
in these ongoing explorations .
December1985
JamesL. McClelland
PI1TSBURGH
, PENNSYLVANIA
David E. Rumelhart
LA JOLLA
, CALIFORNIA
Acknowledgments
As we have already said, nearly all the ideas in this book were born
out of interactions, and one of our most important acknowledgments is
to the environment that made these interactions possible. The Institute
for Cognitive Science at UCSD and the members of the Institute have
made up the core of this environment .
Don Norman , our colleague and friend , the Founder and Director of
the Institute , deserves special credit for making ICS an exciting and
stimulating place, for encouraging our explorations in parallel distrib uted processing, and for his central role in arranging much of the finan cial support this book has benefited from (of which more below) . The
atmosphere depends as well on the faculty , visiting scholars, and graduate students in and around ICS. The members of the PDP Research
Group itself , of course, have played the most central role in helping to
shape the ideas found in this book. All those who contributed to the
actual contents of the book are listed on the cover page; they have all
contributed , as well, in many other ways. Several other participants in
the group who do not have actual contributions to the book also
deserve mention . Most prominent among these are Mike Mozer and
Yves Chauvin , two graduate students in the Cognitive Science Lab, and
Gary Cottrell , a recent addition to the group from the University of
Rochester.
Several other members of the intellectual community in and around
ICS have played very important roles in helping us to shape our
thoughts. These include Liz Bates, Michael Cole, Steve Draper, Don
.
XVI ACKNOWLEDGMENTS
Gentner , Ed Hutchins , Jim Hollan , Jean Mandler , George Mandler ,
Jeff Miller , Guy van Orden , and many others , including the participants
in Cognitive Science 200 .
There are also several colleagues at other universities who have
helped us in our explorations . Indeed , the annual connectionist
workshops (the first of which resulted in the Hinton and Anderson
book ) have been important opportunities to share our ideas and get
feedback
on them
from
others
from
the con -
one or both
of us have
interacted
with
a great deal
include Bill Brewer , Neal Cohen , Al Collins , Billy Salter , Ed Smith , and
Walter Schneider . All of these people have contributed more or less
directly to the development of the ideas presented in this book .
An overlapping group of colleagues deserves credit for helping us
improve the book itself . Jim Anderson , Andy Barto , Larry Barsalou ,
Chris Reisbeck , Walter Schneider , and Mark Seidenberg all read several
chapters of the book and sent useful comments and suggestions . Many
other people read and commented on individual chapters , and we are
book; her cheerful , thoughtful , and very careful assistance made the
production of the book run much more smoothly than we have had any
right to hope and allowed us to keep working on the content of some of
the chapters even as the final production was rolling forward on other
sections. Eileen Conway's assistance with graphics and formatting has
also been invaluable and we are very grateful to her as well . Mark Wal len kept the computers running , served as chief programming consul tant and debugger par excellence , and tamed troff , the phototypesetter .
Without him we would never have gotten all the formatting to come
out right . Karol Lightner worked very hard toward the end of the
..
ACKNOWLEDGMENTSXVII
NOOO14 - 79 - C -O323 , NR
667 -
The people behind both SDF and ONR deserve acknowledgment too .
The entire PDP enterprise owes a particular debt of gratitude to Charlie
Smith , formerly of SDF , who appreciated the appeal of parallel distrib uted processing very early on , understood our need for computing
resources , and helped provide the entire POP research group with the
funds and encouragement needed to complete such project . Henry
Halff , formerly
of ONR , was also an early source of support ,
encouragement , and direction . Charlie Smith has been succeeded by
Carl York , and Henry Halff has been succeeded by Susan Chipman ,
Michael Si:tafto , and Harold Hawkins . We are grateful to all of these
people for their commitment to the completion of this book and to the
ongoing development of the ideas .
Several
other
sources
have
contributed
to the support
of individual
under
Grant
for Human
Informa
JLM/DER
ChisatoAsanuma
SalkInstitute
P.O. Box85800
SanDiego, CA 92138
FrancisH. C. Crick
SalkInstitute
P.o . Box85800
SanDiego, CA 92138
Jeffrey L. Elman
Department
of Linguistics
Universityof California
, SanDiego
La Jolla, CA 92093
GeoffreyE. Hinton
Michael I. Jordan
Department
of
Carnegie
- Mellon
Pittsburgh
PA
15213
of
Information
Alan H. Kawamoto
of
,
Pittsburgh
Massachusetts
MA
Department
Carnegie
Computer
Science
University
01003
of
- Mellon
,
Science
University
Department
Amherst
Computer
PA
Psychology
University
15213
and
xx
ADDRESSES
OFTHEPDP
RESE)LRCH
GROUP
JamesL. McClelland
Departmentof Psychology
Carnegie-Mellon University
Pittsburgh, PA 15213
Paul W. Munro
Donald A . Norman
University
of California
, SanDiego
LaJolla,CA 92093
DanielE. Rabin
Intellicorp
1975El Camino Real West
Mountain View, CA 94040
David E. Rumelhart
TerrenceJ. Sejnowski
Departmentof Biophysics
JohnsHopkins University
Baltimore, MD 21218
Paul Smolensky
Department
of
University
of
Boulder
GregoryO. Stone
Center
CO
for
Adaptive
MA
Institute
02215
for
UniverSi
David Zipser
Mathematics
University
Boston
La
Systems
of
Boston
Science
80309
Department
RonaldJ. Williams
Computer
Colorado
~ y
Jolla
CA
Cognitive
of
California
Science
San
Diego
92093
Insitutefor CognitiveScience
Universityof California
, SanDiego
La Jolla, CA 92093
CHAPTER
1
The Appeal of
Parallel Distributed Processing
J. L. McCLELLAND
, D. E. RUMELHART
, andG. E. HINTON
What makes people smarter than machines? They certainly are not
quicker or more precise. Yet people are far better at perceiving objects
in natural scenes and noting their relations, at understanding language
and retrieving contextually appropriate information from memory, at
making plans and carrying out contextually appropriate actions, and at a
wide range of other natural cognitive tasks. People are also far better at
learning
to do
.
. these things more accurately and fluently through processlng experIence.
What is the basis for these differences? One answer, perhaps the
classic one we might expect from artificial intelligence , is "software." If
we only had the right computer program, the argument goes, we might
be able. to capture the fluidity and adaptability of human information
processlng.
Certainly this answer is partially correct. There have been great
breakthroughs in our understanding of cognition as a result of the
development of expressive high-level computer languagesand powerful
algorithms . No doubt there will be more such breakthroughs in the
future . However, we do not think that software is the whole story.
In our view, people are smarter than today's computers because the
brain employs a basic computational architecture that is more suited to
deal with a central aspect of the natural information processing tasks
that people are so good at. In this chapter, we will show through examples that these tasks generally require the simultaneous consideration of
many pieces of information or constraints. Each constraint may be
imperfectly specified and ambiguous, yet each can playa potentially
THE
PDP
PERSPECTIVE
decisive role in determining the outcome of processing. After examining these points, we will introduce a computational framework for
modeling cognitive processesthat seems well suited to exploiting these
constaints and that seems closer than other frameworks to the style of
computation as it might be done by the brain. We will review several
early examples of models developed in this framework , and we will
show that the mechanisms these models employ can give rise to powerful emergent properties that begin to suggest attractive alternatives to
traditional accounts of various aspects of cognition . We will also show
that models of this class provide a basis for understanding how learning
can occur spontaneously, as a by-product of processing activity .
THEPOP
PERSPECTIVE
palm is facing the cup and the thumb and index finger are
below. The turning motion occurs just in time , as my hand
drops, to avoid hitting the coffee cup. My index finger and
thumb close in on the knob and grasp it , with my hand completely upside down.
1. THEAPPEALOFPDP
these sentences
, then, is determined in part by the semanticrelations
that the constituents of the sentence might plausibly bear to one
another. Thus, the influencesappearto run both ways, from the syntax to the semanticsand from the semanticsto the syntax.
In these examples, we see how syntactic considerationsinfluence
semantic ones and how semantic ones influence syntactic ones. We
cannotsaythat one kind of constraintis primary.
Mutual constraintsoperate, not only betweensyntacticand semantic
processing
, but also within each of these domains as well. Here we
consideran examplefrom syntacticprocessing
, namely, the assignment
of wordsto syntacticcategories
. Considerthe sentences
:
like the joke.
like the drive.
like to joke.
like to drive.
In this case it looks as though the words the and to serve to determine
whether the following word will be read as a noun or a verb. This , of
course, is a very strong constraint in English and can serve to force a
verb interpretation of a word that is not ordinarily used this way:
I like to mud.
8 THE
PDP
PERSPECTIVE
FIGURE 2. Some ambiguousdisplays. The first one is from Selfridge, 1955. The
secondline showsthat three ambiguouscharacterscan eachconstrainthe identity of the
others. The third, fourth, and fifth lines show that thesecharactersare indeedambiguous in that they assumeother identities in other contexts. (The ink-blot techniqueof
makinglettersambiguousis due to Lindsayand Norman, 1972).
1. THEAPPEALOFPOP
until
we have
established
the identities
The resolution of the paradox , of course , is simple . One of the dif ferent possible letters in each position fits together with the others . It
appears then that our perceptual system is capable of exploring all these
possibilities without committing itself to one until all of the constraints
are taken
into account .
the context
of this view .
However , it is important to bear in mind that most everyday situa tions cannot be rigidly assigned to just a single script . They generally
involve
an interplay
between
a number
of different
sources
of informa
Representations like scripts , frames , and schemata are useful struc tures for encoding knowledge , although we believe they only approxi mate the underlying structure of knowledge representation that emerges
from the class of models we consider in this book , as explained in
Chapter 14. Our main point here is that any theory that tries to
account for human knowledge using script -like knowledge structures
will
have to allow
them
to interact
with
each other
to capture
the gen -
such
greatest
interactions
has been
one
of
the
difficulties
associated
using script - or
10
THEPOPPERSPECTIVE
PARALLEL DISTRIBUTEDPROCESSING
In the examples we have considered, a number of different pieces of
information must be kept in mind at once. Each plays a part, constraining others and being constrained by them . What kinds of
mechanisms seem well suited to these task demands? Intuitively , these
tasks seem to require mechanisms in which each aspect of the informa tion in the situation can act on other aspects, simultaneously influenc ing other aspects and being influenced by them . To articulate these
intuitions , we and others have turned to a class of models we call Parallel Distributed Processing (POP) models. These models assume that
information processing takes place through the interactions of a large
number of simple processing elements called units , each sending excitatory and inhibitory signals to other units . In some cases, the units
stand for possible hypotheses about such things as the letters in a particular display or the syntactic roles of the words in a particular sentence. In these cases, the activations stand roughly for the strengths
associated with the different possible hypotheses, and the interconnections among the units stand for the constraints the system knows to
exist between the hypotheses. In other cases, the units stand for possible goals and actions, such as the goal of typing a particular letter , or
the action of moving the left index finger , and the connections relate
goals to subgoals, subgoals to actions, and actions to muscle movements. In still other cases, units stand not for particular hypotheses or
goals, but for aspects of these things. Thus a hypothesis about the
identity of a word, for example, is itself distributed in the activations of
a large number of units.
12 THE
POP
PERSPECTIVE
The Microstructure of Cognition
The process of human cognition , examined on a time scale of
seconds and minutes , has a distinctly sequential character to it . Ideas
come, seem promising , and then are rejected; leads in the solution to a
problem are taken up, then abandoned and replaced with new ideas.
Though the process may not be discrete, it has a decidedly sequential
character, with transitions from state-to-state occurring, say, two or
three times a second. Clearly, any useful description of the overall
organization of this sequential flow of thought will necessarily describe
a sequence of states.
But what is the internal structure of each of the states in the
sequence, and how do they come about? Serious attempts to model
even the simplest macrosteps of cognition - say, recognition of single
words- require vast numbers of microsteps if they are implemented
sequentially. As Feldman and Ballard ( 1982) have pointed out , the
biological hardware is just too sluggish for sequential models of the
microstructure to provide a plausible account, at least of the
microstructure of human thought . And the time limitation only gets
worse, not better, when sequential mechanisms try to take large
numbers of constraints into account. Each additional constraint
requires more time in a sequential machine, and, if the constraints are
imprecise, the constraints can lead to a computational explosion. Yet
people get faster, not slower, when they are able to exploit additional
constraints.
Parallel distributed processing models offer alternatives to serial
models of the microstructure of cognition . They do not deny that there
is a macrostructure, just as the study of subatomic particles does not
deny the existence of interactions between atoms. What PDP models
do is describe the internal structure of the larger units , just as
subatomic physics describes the internal structure of the atoms that
form the constituents of larger units of chemical structure .
We shall show as we proceed through this book that the analysis of
the microstructure of cognition has important implications for most of
the central issues in cognitive science. In general, from the PDP point
of view, the objects referred to in macrostructural models of cognitive
processing are seen as approximate descriptions of emergent properties
of the microstructure . Sometimes these approximate descriptions may
be sufficiently accurate to capture a process or mechanism well enough~
but many times, we will argue, they fail to provide sufficiently elegant
or tractable accounts that capture the very flexibility and openendedness of cognition that their inventors had originally intended to
capture. We hope that our analysis of PDP models will show how an
1. THE
APPEAL
OFPDP 13
examination of the microstructure of cognition can lead us closer to an
adequate
of the real extent of human processing and learn.
.description
.
Ing capacities.
The development of POP models is still in its infancy . Thus far the
models which have been proposed capture simplified versions of the
kinds of phenomena we have been describing rather than the full elaboration that these phenomena display in real settings. But we think
there have been enough steps forward in recent years to warrant a concerted effort at describing where the approach has gotten and where it
is going now, and to point out some directions for the future .
The first section of the book represents an introductory course in
parallel distributed processing. The rest of this chapter attempts to
describe in informal terms a number of the models which have been
proposed in previous work and to show that the approach is indeed a
fruitful one. It also contains a brief description of the major sources of
the inspiration we have obtained from the work of other researchers.
This chapter is followed , in Chapter 2, by a description of the quantitative framework within which these models can be described and examined. Chapter 3 explicates one of the central concepts of the book: distributed representation
. The final chapter in this section, Chapter 4,
returns to the question of demonstrating the appeal of parallel
distributed processing models and gives an overview of our explorations
in the microstructure of cognition as they are laid out in the remainder
of this book.
EXAMPLES
OFPDPMODELS
In what follows , we review a number of recent applications of POP
models to problems in motor control , perception, memory, and
language. In many cases, as we shall see, parallel distributed processing
mechanisms are used to provide natural accounts of the exploitation of
multiple , simultaneous, and often mutual constraints. We will also see
that these same mechanisms exhibit emergent properties which lead to
novel interpretations of phenomena which have traditionally been inter preted in other ways.
Motor Control
14 THE
POP
PERSPECTIVE
models in this domain . These models have not developed far enough
to capture the full details of obstacle avoidance and multiple constraints
on reaching and grasping , but there have been applications to two prob lems with
high speed film of a good typist shows that the right hand moves up to
anticipate the typing of the u , even as the left hand is just beginning to
type the v. By the time the c is typed the right index finger is in posi tion over the u and ready to strike it .
When two successive key strokes are to be typed with the fingers of
the same hand , concurrent preparation to type both can result in similar
or conflicting instructions to the fingers and / or the hand . Consider , in
this light , the difference between the sequence ev and the sequence ere
The first sequence requires the typist to move up from home row to
type the e and to move down from the home row to type the v, while
the second sequence , both the e and the f are above the home row .
in
The hands take very different positions in these two cases. In the
first case, the hand as a whole stays fairly stationary over the home
row . The middle finger moves up to type the e, and the index finger
moves
down
to type
the
v.
In the second
as a whole
moves up , bringing the middle finger over the e and the index finger
over the f . Thus , we can see that several letters can simultaneously
influence the positioning of the fingers and the hands .
From the point of view of optimizing the efficiency of the typing
motion
, these
different
patterns
seem very
case ,
to type a word
caused activation
of a unit
for
1. THE
APPEAL
OFPOP 15
for the second to inhibit the third and following letters, and so on. As
a result of the interplay of activation and inhibition among these units ,
the unit for the first letter was at first the most strongly active, and the
units for the other letters were partially activated.
Each letter unit exerts influences on the hand and finger involved in
typing the letter . The v unit , for example, tends to cause the index
finger to move down and to cause the whole hand to move down with
it . The e unit , on the other hand, tends to cause the middle finger on
the left hand to move up and to cause the whole hand to move up also.
The r unit also causes the left index finger to move up and the left
hand to move up with it .
The extent of the influences of each letter on the hand and finger it
directs depends on the extent of the activation of the letter ." Therefore ,
at first , in typing the word very, the v exerts the greatest control .
upward
...tw8rd
+
Response
System
w
W
"' X
"' " ow
~
zoo
- a
-: ~
~ -z
...
~ 5 ~ -l
oQz
~
z
~
-
(I:
upward
THUMB
inward
PALM
PALM
RESPONSE
LI (- 1 , + 0 .5)
- l
LM (+ 1 , - 0 .3 )
;ftWid
+ OUIw
...
downward
SYSTEM
LI (+ 1, - 0 .3 )
RI (+ 1, + 1.3 )
.target
.. current
finger
finger
position
position
Keypress
Schemata
Word
Schema
FIGURE 4. The interaction of activationsin typing the word very. The very unit is
activatedfrom outside the model. It in turn activatesthe units for each of the component letters. Each letter unit specifiesthe target finger positions, specifiedin a keyboardcoordinatesystem. Land R standfor the left and right hands, and I and M for the
index and middle fingers. The letter units receiveinformation about the current finger
position from the responsesystem. Each letter unit inhibits the activation of all letter
units that follow it in the word: inhibitory connectionsare indicatedby the lines with
solid dots at their terminations. (From "Simulatinga Skilled Typist: A Study of Skilled
Motor Pe~formance" by D. E. Rumelhartand D. A. Norman, 1982, CognitiveScience
, 6,
p. 12. Copyright1982byAblex Publishing
. Reprintedby permission
.)
16 THE
POP
PERSPECTIVE
Becausethe e and r are simultaneously pulling the hand up, though , the
v is typed primarily by moving the index finger , and there is little
movement on the whole hand.
Once a finger is within a certain striking distance of the key to be
typed, the actual pressing movement is triggered, and the keypress
occurs. The keypress itself causes a strong inhibitory signal to be sent
to the unit for the letter just typed, thereby removing this unit from the
picture and allowing the unit for the next letter in the word to become
the most strongly activated.
This mechanism provides a simple way for all of the letters to jointly
determine the successive configurations the hand will enter into in the
process of typing a word. This model has shown considerable success
predicting the time between successive keystrokes as a function of the
different keys involved . Given a little noise in the activation process, it
can also account for some of the different kinds of errors that have
been observed in transcription typing .
The typing model represents an illustration of the fact that serial
behavior - a succession of key strokes- is not necessarily the result of
an inherently serial processing mechanism. In this model, the sequential structure of typing emerges from the interaction of the excitatory
and inhibitory influences among the processing units .
Reaching for an object without falling over. Similar mechanisms
can be used to model the process of reaching for an object without losing one' s balance while standing, as Hinton ( 1984) has shown. He considered a simple version of this task using a two-dimensional " person"
with a foot , a lower leg, an upper leg, a trunk , an upper arm, and a
lower arm. Each of these limbs is joined to the next at a joint which
has a single degree of rotational freedom . The task posed to this person is to reach a target placed somewhere in front of it , without taking
any steps and without falling down. This is a simplified version of the
situation in which a real person has to reach out in front for an object
placed somewhere in the plane that vertically bisects the body. The
task is not as simple as it looks, since if we just swing an arm out in
front of ourselves, it may shift our center of gravity so far forward that
we will lose our balance. The problem, then, is to find a set of joint
angles that simultaneously solves the two constraints on the task. First ,
the tip of the forearm must touch the object. Second, to keep from
falling down, the person must keep its center of gravity over the foot .
To do this , Hinton assigneda single processor to each joint . On each
computational cycle, each processor received information about how far
the tip of the hand was from the target and where the center of gravity
was with respect to the foot . Using these two pieces of information ,
each joint adjusted its angle so as to approach the goals of maintaining
I. THE
APPEAL
OFPDP 17
balance and bringing the tip closer to the target. After a number of
iterations , the stick-person settled on postures that satisfied the goal of
reaching the target and the goal of maintaining the center of gravity
over the "feet."
Though the simulation was able to perform the task, eventually satisfying both goals at once, it had a number of inadequacies stemming
from the fact that each joint processor attempted to achieve a solution
in ignorance of what the other joints were attempting to do. This problem was overcome by using additional processors responsible for setting
combinations of joint angles. Thus, a processor for flexion and extension of the leg would adjust the knee, hip , and ankle joints synergistically, while a processor for flexion and extension of the arm would
adjust the shoulder and elbow together. With the addition of processors of this form , the number of iterations required to reach a solution
was greatly reduced, and the form of the approach to the solution
looked very natural. The sequence of configurations attained in one
processing run is shown in Figure 5.
Explicit attempts to program a robot to cope with the problem of
maintaining balance as it reaches for a desired target have revealed the
difficulty of deriving explicitly the right combinations of actions for
each possible starting state and goal state. This simple model illustrates
that we may be wrong to seek such an explicit solution . We see here
that a solution to the problem can emerge from the action of a number
of simple processors each attempting to honor the constraints
independently.
18
THEPOPPERSPECTIVE
Perception
Stereoscopic
vision
cessing
was
Marr
and
tion
depth
in
of
Julesz
'
of
dot
dots
in
region
sidered
alone
alone
is
pattern
page
far
the
among
front
left
region
of
or
as
behind
and
the
other
Yet
the
pattern
two
of
views
con
surfaces
when
to
with
depending
of
the
the
let
these
right
eye
shifted
on
one
other
patterns
different
the
surface
pattern
with
two
of
of
eye
shown
the
of
either
surfaces
the
the
entirely
presence
those
to
each
consists
one
two
horizontally
of
in
indicate
the
that
shifted
Each
information
in
to
of
as
except
is
challenges
consists
such
patterns
images
other
percep
Figure
region
the
direction
of
FIGURE
left
can
sees
in
the
see
by
the
stereogram
the
pattern
no
explain
1971
interesting
of
of
is
stereogram
the
projected
observer
of
those
there
relations
patterns
shift
one
retinal
that
hovering
the
so
depth
dot
an
simple
copy
of
two
pro
proposed
mechanism
on
distributed
to
Julesz
present
exact
of
to
dots
In
rest
corresponding
percepti
an
parallel
perception
proposed
stereograms
the
random
theory
stereograms
processing
dot
is
to
using
depth
Their
dot
depth
pattern
respect
patterns
one
model
stereoscopic
distributed
of
here
1976
random
mechanisms
random
early
of
random
simple
One
model
Poggio
of
terms
the
in
the
eye
Some
wall
- dot
in
right
and
Random
dots
the
the
right
Cyclopean
Perception
Copyright
1971
pattern
to
be
Bell
the
able
right
to
the
21
Telephone
The
two
the
patterns
left
by
eye
achieve
Julesz
Laboratories
are
such
the
by
into
Inc
the
area
line
Chicago
appears
of
sight
permission
the
. g
Foundations
of
by
the
above
point
From
University
to
to
hover
distant
the
respect
projects
to
that
with
pattern
to
Reprinted
except
over
left
converging
the
1971
identical
shifted
that
shifted
this
figure
are
pattern
stereoscopically
interposing
by
of
viewed
may
then
region
When
readers
and
stereograms
central
Chicago
of
Press
. )
. ,
1. THE
APPEAL
OFPDP 19
What kind of a mechanismmight we proposeto account for these
facts? Marr and Poggio(1976) beganby explicitly representingthe two
views in two arrays, as human observersmight in two different retinal
images. They noted that correspondingblack dots at different perceived distancesfrom the observer will be offset from each other by
gifferent amounts in the two views. The job of the model is to determine which points correspond
. This task is, of course, made difficult
by the fact that there will be a very large number of spurious
correspondences
of individual dots. The goal of the mechanism
, then,
is to find those correspondences
that representreal correspondences
in
depth and suppressthosethat representspuriouscorrespondences
.
To carry out this task, Marr and Poggioassigneda processingunit to
each possible,conjunction of a point in one image and a point in the
other. Sincethe eyesare offset horizontally, the possible conjunctions
occur at various offsets or disparitiesalong the horizontal dimension.
Thus, for each point in one eye, there was a set of processingunits
with one unit assignedto the conjunctionof that point and the point at
eachhorizontal offset from it in the other eye.
Each processingunit receivedactivation wheneverboth of the points
the unit stood for containeddots. So far, then, units for both real and
spurious correspondenceswould be equally activated. To allow the
mechanismto find the right correspondences
, they pointed out two
generalprinciplesabout the visual world: (a) Each point in each view
generallycorrespondsto one and only one point in the other view, and
(b) neighboringpoints in spacetend to be at nearly the samedepth and
therefore at about the samedisparity in the two images. While there
are discontinuities at the edges of things, over most of a twodimensionalview of the world there will be continuity. These principlesare calledthe uniqueness
and continuityconstraints, respectively
.
Marr and Poggio incorporatedthese principles into the interconnections betweenthe processingunits. The uniquenessconstraintwascaptured by inhibitory connectionsamongthe units that stand for alternative correspondencesof the same dot. The continuity principle was
capturedby excitatoryconnectionsamongthe units that standfor similar offsets of adjacentdots.
These additional connectionsallow the Marr and Poggio model to
"solve" stereogramslike the one shown in the figure. At first, when a
pair of patternsis presented
, the units for all possiblecorrespondences
of a dot in one eye with a dot in the other will be equally excited.
However, the excitatory connectionscause the units for the correct
conjunctionsto receivemore excitationthan units for spuriousconjunctions, and the inhibitory connectionsallow the units for the correct
conjunctionsto turn off the units for the spuriousconnections. Thus,
20 THEPOP
PERSPECTIVE
the model tends to settle down into a stable state in which only the
correct correspondence of each dot remains active.
There are a number of reasons why Marr and Poggio ( 1979) modified this model (see Marr , 1982, for a discussion) , but the basic
mechanisms of mutual excitation between units that are mutually consistent and mutual inhibition between units that are mutually incompatible provide a natural mechanism for settling on the right conjunctions
of points and rejecting spurious ones. The model also illustrates how
general principles or rules such as the uniqueness and continuity princi ples may be embodied in the connections between processing units , and
how behavior in accordance with these principles can emerge from the
interactions determined by the pattern of these interconnections.
Perceptual completion of familiar patterns . Perception, of course, is
influenced by familiarity . It is a well-known fact that we often misperceive unfamiliar objects as more familiar ones and that we can get by
with less time or with lower-quality information in perceiving familiar
items than we need for perceiving unfamiliar items. Not only does
familiarity help us determine what the higher-level structures are when
the lower-level information is ambiguous~ it also allows us to fill in
missing lower-level information within familiar higher-order patterns.
The well-known phonemic restoration effect is a case in point . In this
phenomenon, perceivers hear sounds that have been cut out of words
as if they had actually been present. For example, Warren (1970)
presented /egi# /ature to subjects, with a click in the location marked by
the # . Not only did subjects correctly identify the word legislature~
they also heard the missing / s/ just as though it had been presented.
They had great difficulty localizing the click , which they tended to hear
as a disembodied sound. Similar phenomena have been observed in
visual perception of words since the work of Pillsbury ( 1897) .
Two of us have proposed a model describing the role of familiarity in
perception based on excitatory and inhibitory interactions among units
standing for various hypotheses about the input at different levels of
abstraction (McClelland & Rumelhart , 1981~ Rumelhart & McClelland ,
1982) . The model has been applied in detail to the role of familiarity
in the perception of letters in visually presented words, and has proved
to provide a very close account of the results of a large number of
experiments.
The model assumes that there are units that act as detectors for the
visual features which distinguish letters, with one set of units assigned
to detect the features in each of the different letter -positions in the
word. For four -letter words, then , there are four such sets of detectors.
There are also four sets of detectors for the letters themselves and a set
of detectors for the words.
I. THE
APPEAL
OFPOP 21
In the model, each unit has an activation value, corresponding
roughly to the strength of the hypothesis that what that uni t stands for
is present in the perceptual input . The model honors the following
important relations which hold between these "hypotheses" or activations: First , to the extent that two hypotheses are mutually consistent,
they should support each other . Thus, units that are mutually consistent, in the way that the letter T in the first position is consistent
with the word TAKE, tend to excite each other . Second, to the extent
that two hypotheses are mutually inconsistent, they should weaken each
other . Actually , we can distinguish two kinds of inconsistency: The
first kind might be called between-level inconsistency. For example,
the hypothesis that a word begins with a T is inconsistent with the
hypothesis that the word is MO VE. The second might be called mutual
exclusion. For example, the hypothesis that a word begins with T
excludes the hypothesis that it begins with R since a word can only
begin with one letter . Both kinds of inconsistencies operate in the word
perception model to reduce the activations of units. Thus, the letter
units in each position compete with all other letter units in the same
position , and the word units compete with each other . This type of
inhibitory interaction is often called competitiveinhibition. In addition ,
there are inhibitory interactions between incompatible units on different
levels. This type of inhibitory interaction is simply called
between-level inhibition.
The set of excitatory and inhibitory interactions between units can be
diagrammed by drawing excitatory and inhibitory links between them .
The whole picture is too complex to draw, so we illustrate only with a
fragment: Some of the interactions between some of the units in this
model are illustrated in Figure 7.
Let us consider what happens in a system like this when a familiar
stimulus is presented under degraded conditions . For example, consider the display shown in Figure 8. This display consists of the letters
W, 0 , and R , completely visible , and enough of a fourth letter to rule
out all letters other than Rand K . Before onset of the display, the
activations of the units are set at or below o. When the display is
presented, detectors for the features present in each position become
active (i.e., their activations grow above 0) . At this point , they begin to
excite and inhibit the corresponding detectors for letters. In the first
three positions, W, 0 , and R are unambiguously activated, so we will
focus our attention on the fourth position where Rand K are both
equally consistent with the active features. Here, the activations of the
detectors for Rand K start out growing together, as the feature detectors below them become activated. As these detectors become active,
they and the active letter detectors for W, 0 , and R in the other positions start to activate detectors for words which have these letters in
I. THE
APPEAL
OFPOP 23
Word Level
Letter Level
24 THE
PDP
PERSPECTIVE
perception of letters in unfamiliar letter strings which are word -like but
not themselves actually familiar .
One way of accounting for such performances is to imagine that the
perceiver possesses, in addition to detectors for familiar words , sets of
detectors for regular subword units such as familiar letter clusters , or
that they use abstract rules , specifying which classes of letters can go
with
which
others
in different
contexts
. It turns
out , however
, that the
as a result
is that , when
the
pattern
of
activation
at the
letter
level .
While they compete with each other , none of these words gets strongly
enough activated to completely suppress all the others . Instead , these
units act as a group to reinforce particularly the letters E and A . There
are no close partial matches which include the letter F in the second
position , so this letter receives no feedback support . As a result , E
comes to dominate , and eventually suppress , the F in the second
position .
The fact that the word perception model exhibits perceptual facilita tion to pronounceable nonwords as well as words illustrates once again
how behavior in accordance with general principles or rules can emerge
from the interactions of simple processing elements . Of course , the
behavior of the word perception model does not implement exactly any
of the systems of orthographic rules that have been proposed by
linguists (Chomsky & Halle , 1968; Venesky , 1970) or psychologists
(Spoehr & Smith , 1975) . In this regard , it only approximates such
rule -based descriptions of perceptual processing . However , rule systems such as Chomsky and Halle ' s or Venesky ' s appear to be only
approximately honored in human performance as well (Smith & Baker ,
1976) . Indeed , some of the discrepancies between human performance
data and rule systems occur in exactly the ways that we would predict
from the word perception model ( Rumelhart & McClelland , 1982) .
This illustrates the possibility that POP models may provide more
accurate accounts of the details of human performance than models
1. THEAPPEALOFPOP
25
Word Level
Letter Level
~ 1: W~ ~J
C& F
L
Time
Jl
1: wI~ ~
Retrieving Information
From Memory
26
THEPOPPERSPECTIVE
1. THE
APPEAL
OFPOP 27
CI
).CI
).CI
).CI
}.CI
}.CI
}.CI
}.CI
}.CI
}.CI
}.CI
}.CI
}
.~~~~~~N
============
~~~~~
Name
Gang
Age
Edu
Mar
Occupation
40 ' s
Sing .
Pusher
30 ' s
Mar
20 ' s
Burglar
Bookie
30 ' s
Sing .
Sing .
Sing .
20 ' s
Div .
20 ' s
Mar .
40 ' s
Bookie
Bookie
Burglar
Pusher
20 ' s
Mar
30 ' s
Sing .
Burglar
Bookie
Burglar
20 ' s
Mar
20 ' s
Div .
20 ' s
Sing
Sing
Sing
Sing
.
.
.
.
Mar
20 ' s
20 ' s
30 ' s
Sing .
Sing .
Mar
Mar
Mar
Sing .
Mar
Div .
Mar
Sing .
Div .
Burglar
Bookie
Pusher
Pusher
Pusher
Pusher
Bookie
Pusher
Burglar
Bookie
Bookie
Burglar
Burglar
Burglar
Pusher
Bookie
Pusher
28 THE
POP
PERSPECTIVE
of the units needed to represent this information is shown in Figure 11.
In this network , there is an " instance unit " for each of the characters
described in Figure 10, and that unit is linked by mutually excitatory
connections to all of the units for the fellow 's properties. Note that we
have included property units for the names of the characters, as well as
units for their other properties.
Now, suppose we wish to retrieve the properties of a particular indi vidual , say Lance. And suppose that we know Lance's name. Then we
can probe the network by activating Lance's name unit , and we can see
what pattern of activation arises as a result . Assuming that we know of
no one else named Lance, we can expect the Lance name unit to be
hooked up only to the instance unit for Lance. This will in turn
activate the property units for Lance, thereby creating the pattern of
1. THEAPPEAL
OFPOP 29
activation
corresponding
representation
described
Of
of
properties
and
it
the
instance
and
fill
unit
Graceful
ticular
that
item
contains
of
some
other
particular
this
model
probe
not
This
also
requires
graceful
we
do
in
the
not
.
cess1ng
many
and
error
nature
this
fits
the
but
not
the
corresponding
content
they
make
of
- recovery
probe
point
to
probe
closer
of
varies
, errors
to
a
in
in the
the
wrong
or partial
work
to
as the
activation
probe
mechanism
node
which
sufficient
nodes
incomplete
probe
one
the
of
name
scheme
retrieval
, as long
of the
the
' s handling
be
make
degree
to
node
the
than
other
set
a par -
sufficient
node .
still
any
kind
the
instance
activate
cue
will
does
exact
of the
instance
it
, though
is
the
strongly
than
vary
probes
- it is a natural
that
it is capable
of
.
of
the
behavior
of
is that
other
we
have
which
view
models
this
in
Sharks
as the
examined
from
will
been
have
the
deserve
One
reason
as a stepping
stone
of
model
, this
using
this
book
, for
like
model
interactions
occurred
examining
models
parts
simple
model
allows
model
other
this
already
emerge
have
and
space
, such
occur
It probably
we
Jets
present
to say about
we
situations
the
the
, that
more
models
of
than
them
properties
the
is
In
to retrieve
which
be a poorer
strongly
of the
of the
special
assignment
of
clearly
general
unless
, have
.
un1 ts .
Default
in
will
consideration
other
who
.
of this
as we
activate
other
information
In
representations
useful
's
know
units
properties
features
will
most
model
development
the
some
of
item
will
of
go into
do , however
Lance
, Ken , who
in an attempt
any
kind
aspects
distributed
of
degradation
detailed
set
be fatal
of the
These
more
Any
" more
node
no
by - product
memory
information
as a function
will
memory
have
other
of
20s
happens
than
misleading
instance
and
desirable
what
features
This
answer
item
we
, given
some
individual
the
the
strongly
best .
" right
introduction
to
misleading
the
of
a particular
misleading
no
activate
as well .
to probe
more
it matches
what
do you
Shark
is a single
retrieved
a name
with
considering
' s name
contains
just
retrieve
the
have
here .
start
system
A few
from
characterize
which
that
visible
individual
to
we
than
us stop
might
the
properties
degradation
uniquely
wish
we
happen
let
by activating
there
effect
we activate
these two properties
, we will activate
.
Ken , and this in turn
will activate
his name
unit ,
for
we use
In
will
may
asking
other
are
of features
for
we
that
moment
case ,
So , when
in his
model
More
20s ?"
out
Lance
the
this
in his
turns
description
of
In
, effectively
a Shark
for
, sometimes
information
case
Lance
so far , but
course
to
We
some
exhibits
of
to the
, there
more
.
the
reader
will
pro -
that
be other
30
THEPOPPERSPECTIVE
1. THE
APPEAL
OFPDP 31
Thus, if we want to know , for example, what people in their 20s with a
junior high school education are like , we can probe the model by
activating these two units . Since all such people are Jets and Burglars,
these two units are strongly activated by the model in this case; two of
them are divorced and two are married , so both of these units are partially activated. 1
The sort of model we are considering, then , is considerably more
than a content addressable memory . In addition , it performs default
assignment, and it can spontaneously retrieve a general concept of the
individuals that match any specifiable probe. These properties must be
explicitly implemented as complicated computational extensions of
other models of knowledge retrieval , but in POP models they are
natural by-products of the retrieval process itself .
REPRESENT
ATION AND LEARNING IN PDP MODELS
In the Jets and Sharks model, we can speak of the model's active
representationat a particular time , and associate this with the pattern of
activation over the units in the system. We can also ask: What is the
stored knowledge that gives rise to that pattern of activation ? In considering this question, we see immediately an important difference
between PDP models and other models of cognitive processes. In most
models, knowledge is stored as a static copy of a pattern. Retrieval
amounts to finding the pattern in long-term memory and copying it into
a buffer or working memory . There is no real difference between the
stored representation in long-term memory and the active representation in working memory . In POP models, though , this is not the case.
In these models, the patterns themselves are not stored. Rather, what
is stored is the connectionstrengths between units that allow these patterns to be re-created. In the Jets and Sharks model, there is an
instance unit assigned to each individual , but that unit does not contain
a copy of the representation of that individual . Instead, it is simply the
case that the connections between it and the other units in the system
are such that activation of the unit will cause the pattern for the
individual to be reinstated on the property units .
I In
enced
in
this
activated
this
and
by
partially
all
case , there
, giving
other
cases
activated
is
Married
, there
, near
Jet
is a tendency
neighbors
AI , who
a slight
edge
is
a
over
, which
Married
Divorced
for
the
do
pattern
not
Burglar
in
quite
.
the
The
of
activation
match
unit
simulation
to
the
for
.
probe
AI
gets
be
innu
Thus
slightly
32 THE
PERSPECTIVE
POP
This
difference
between
PDP
models
and conventional
models
has
with considerable innate knowledge about a domain, and/ or some starting set of primitive
propositional
representations , then formulate
hypothetical general rules , e.g., by comparing particular cases and for mulating explicit generalizations .
The approach that we take in developing POP models is completely
different . First , we do not assume that the goal of learning is the for mulation of explicit rules . Rather , we assume it is the acquisition of
connection strengths which allow a network of simple units to act as
though it knew the rules . Second , we do not attribute powerful compu tational capabilities to the learning mechanism . Rather , we assume
very simple connection strength modulation mechanisms which adjust
the strength of connections between units based on information locally
available
at the connection
I. THEAPPEALOFPOP
33
34 THEPOP
PERSPECTIVE
From Vision
A Units
B Units
FIGURE 12. A simple pattern associator. The example assumes that patterns of activation in the A units can be produced by the visual system and patterns in the B units can
be produced by the olfactory system. The synaptic connections allow the outputs of the
A units to influence the activations of the B units. The synaptic weights linking the A
units to the B units were selected so as to allow the pattern of activation shown on the A
units to reproduce the pattern of activation shown on the B units without the need for
any olfactory input .
the A units are produced upon viewing a rose or a grilled steak, and
alternative patterns on the B units are produced upon sniffing the same
objects. Figure 13 shows two pairs of patterns, as well as sets of inter connections necessaryto allow the A member of each pair to reproduce
the B member.
The details of the behavior of the individual units vary among dif ferent versions of pattern associators. For present purposes, we' ll
assume that the units can take on positive or negative activation values,
with 0 representing a kind of neutral intermediate value. The strengths
of the interconnections between the units can be positive or negative
real numbers.
The effect of an A unit on a B unit is determined by multiplying the
activation of the A unit times the strength of its synaptic connection
with the B unit . For example, if the connection from a particular A
unit to a particular B unit has a positive sign, when the A unit is
1. THEAPPEALOFPOP
+1
-1
-1
-1
+1
+1
-1
35
+1
- 1
-1
- 1
+1
+ 1
+1
+1
-1
excited (activation greater than 0) , it will excite the B unit . For this
example, we'll simply assume that the activation of each unit is set to
the sum of the excitatory and inhibitory effects operating on it . This is
one of the simplest possible cases.
Suppose, now, that we have created on the A units the pattern
corresponding to the first visual pattern shown in Figure 13, the rose.
How should we arrange the strengths of the interconnections between
the A units and the B units to reproduce the pattern corresponding to
the aroma of a rose? We simply need to arrange for each A unit to
tend to excite each B unit which has a positive activation in the aroma
pattern and to inhibit each B unit which has a negative activation in the
aroma pattern. It turns out that this goal is achieved by setting the
strength of the connection between a given A unit and a given B unit
to a value proportional to the product of the activation of the two units .
In Figure 12, the weights on the connections were chosen to allow the
A pattern illustrated there to produce the illustrated B pattern according
to this principle . The actual strengths of the connections were set to
:::t: .25, rather than :t: 1, so that the A pattern will produce the right magnitude , as well as the right sign, for the activations of the units in the B
pattern. The same connections are reproduced in matrix form in Figure 13A.
Pattern associators like the one in Figure 12 have a number of nice
properties. One is that they do not require a perfect copy of the input
to produce the correct output , though its strength will be weaker in this
case. For example, suppose that the associator shown in Figure 12 were
presented with an A pattern of ( 1,- 1,0, 1) . This is the A pattern shown
in the figure , with the activation of one of its elements set to O. The B
pattern produced in response will have the activations of all of the B
units in the right direction ; however, they will be somewhat weaker
than they would be, had the complete A pattern been shown. Similar
36
THEPOP
PERSPECTIVE
1. THE
APPEAL
OFPDP 37
It is very important to note that the information needed to use the
Hebb rule to determine the value each connection should have is locally
available at the connection. All a given connection needs to consider is
the activation of the units on both sides of it . Thus, it would be possible to actually implement such a connection modulation scheme locally,
in each connection, without requiring any programmer to reach into
each connection and set it to just the right value.
It turns out that the Hebb rule as stated here has some serious limi tations, and, to our knowledge, no theorists continue to use it in this
simple form . More sophisticated connection modulation .schemeshave
been proposed by other workers~ most important among these are the
delta rule , discussed extensively in Chapters 8 and 11; the competitive
learning rule, discussed in Chapter 5~ and the rules for learning in stochastic parallel models, described in Chapters 6 and 7. All of these
learning rules have the property that they adjust the strengths of connections between units on the basis of information that can be assumed
to be locally available to the unit . Learning , then , in all of these cases,
amounts to a very simple process that can be implemented locally at
each connection without the need for any overall supervision. Thus,
models which incorporate these learning rules train themselves to have
the right interconnections in the course of processing the members of
an ensemble of patterns.
Learning multiple patterns in the same set of interconnections. Up
to now, we have considered how we might teach our pattern associator
to associate the visual pattern for one object with a pattern for the
aroma of the same object. Obviously , different patterns of interconnections between the A and B units are appropriate for causing the visual
pattern for a different object to give rise to the pattern for its aroma.
The same principles apply, however, and if we presented our patterri
associator with the A and B patterns for steak, it would learn the right
set of interconnections for that case instead (these are shown in Figure
13B) . In fact, it turns out that we can actually teach the same pattern
associator a number of different associations. The matrix representing
the set of interconnections that would be learned if we taught the same
pattern associator both the rose association and the steak association is
shown in Figure 14. The reader can verify this by adding the two
matrices for the individual patterns together. The reader can also verify
that this set of connections will allow the rose A pattern to produce the
rose B pattern, and the steak A pattern to produce the steak B pattern:
when either input pattern is presented, the correct corresponding output
is produced.
The examples used here have the property that the two different
visual patterns are completely uncorrelated with each other . This being
-+
-+
-
38 THE
POP
PERSPECTIVE
+
+
++
- -
-++
FIGURE 14. The weightsin the third matrix allow either A patternshownin Figure 13
to recreatethe corresponding8 pattern. Eachweight in this caseis equalto the sum of
the weightfor the A patternand the weightfor the B pattern, as illustrated.
the case, the rose pattern produces no effect when the interconnections
for the steak have been established, and the steak pattern produces no
effect when the interconnections for the rose association are in effect .
For this reason, it is possible to add together the pattern of intercon nections for the rose association and the pattern for the steak association , and still be able to associate the sight of the steak with the smell
of a steak and the sight of a rose with the smell of a rose. The two sets
of interconnections do not interact at all.
One of the limitations of the Hebbian learning rule is that it can
learn the connection strengths appropriate to an entire ensemble of patterns only when all the patterns are completely uncorrelated. This
restriction does not , however, apply to pattern associators which use
more sophisticated learning schemes.
_
I !
III
111
_ 11
1'.
1. THE
APPEAL
OFPOP 39
Extracting the structure of an ensemble of patterns . The fact that
similar patterns tend to produce similar effects allows distributed
models to exhibit a kind of spontaneous generalization, extending
behavior appropriate for one pattern to other similar patterns. This
property is shared by other PDP models, such as the word perception
model and the Jets and Sharks model described above~ the main differ ence here is in the existence of simple, local, learning mechanisms that
can allow the acquisition of the connection strengths needed to produce
these generalizations through experience with members of the ensemble of patterns. Distributed models have another interesting property
as well: If there are regularities in the correspondences between pairs
of patterns, the model will naturally extract these regularities. This
property allows distributed
models to acquire patterns of
interconnections that lead them to behave in ways we ordinarily take as
evidence for the use of linguistic rules.
A detailed example of such a model is described in Chapter 18.
Here, we describe the model very briefly . The model is a mechanism
that learns how to construct the past tenses of words from their root
forms through repeated presentations of examples of root forms paired
with the corresponding past-tense form . The model consists of two
pools of units . In one pool, patterns of activation representing the phonological structure of the root form of the verb can be represented,
and, in the other , patterns representing the phonological structure of
the past tense can be represented. The goal of the model is simply to
learn the right connection strengths between the root units and the
past-tense units , so that whenever the root form of a verb is presented
the model will construct the corresponding past-tense form . The model
is trained by presenting the root form of the verb as a pattern of activation over the root units , and then using a simple, local, learning rule to
adjust the connection strengths so that this root form will tend to produce the correct pattern of activation over the past-tense units . The
model is tested by simply presenting the root form as a pattern of
activation over the root units and examining the pattern of activation
produced over the past-tense units .
The model is trained initially with a small number of verbs children
learn early in the acquisition process. At this point in learning, it can
only produce appropriate outputs for inputs that it has explicitly been
shown. But as it learns more and more verbs, it exhibits two interesting behaviors. First , it produces the standard ed past tense when tested
with pseudo-verbs or verbs it has never seen. Second, it " overregularizes" the past tense of irregular words it previously completed correctly .
Often , the model will blend the irregular past tense of the word with
the regular ed ending, and produce errors like CAMED as the past of
40
THEPOPPERSPECTIVE
as strong
evidence
has induced
the
rule which states that the regular correspondence for the past tense in
English is to add a final ed (Berko , 1958) . On the evidence of its per formance , then , the model can be said to have acquired the rule . How ever , no special rule -induction mechanism is used , and no special
language -acquisition device is required . The model learns to behave in
accordance with the rule , not by explicitly noting that most words take
ed in the past tense in English and storing this rule away explicitly , but
simply by building up a set of connections in a pattern associator
through a long series of simple learning experiences . The same
mechanisms of parallel distributed processing and connection modi fica -.
tion
which
of domains
1. THEAPPEAL
OFPOP 41
42 THE
POP
PERSPECTIVE
changing synaptic connections . Rosenblatt ' s work was very controver sial at the time , and the specific models he proposed were not up to all
the hopes he had for them . But his vision of the human information
most
influential
in
our
work
have
been
J . A . Anderson
1. THEAPPEALOFPOP
43
word recognition
McClelland , 1982) .
The ideas represented in the interactive activation model had other
precursors as well . Morton ' s logogen model (Morton , 1969) was one of
the first models to capture concretely the principle of interaction of dif ferent sources of information , and Marslen -Wilson (e.g., Marslen Wilson & Welsh , 1978) provided important empirical demonstrations of
interaction between different levels of language processing . Levin ' s
( 1976) Proteus model demonstrated
the virtues
of activation competition mechanisms , and Glushko ( 1979) helped us see how con spiracies of partial activations could account for certain aspects of
apparently rule -guided behavior .
Our work also owes a great deal to a number of colleagues who have
been working on related ideas in recent years . Many of these col leagues appear as authors or coauthors of chapters in this book . But
there are others as well . Several of these people have been very
influential in the development of the ideas in this book . Feldman and
Ballard ( 1982) laid out many of the computational principles of the
POP approach (under the name of connectionism ) , and stressed the bio logical implausibility of most of the prevailing computational models in
artificial intelligence . Hofstadter
( 1979, 1985) deserves credit for
stressing
the
existence
of
a
subcognitive - what
we
call
microstructural - level , and pointing out how important it can be to
delve into the microstructure to gain insight . A sand dune , he has said ,
is not a grain of sand . Others have contributed
crucial technical
insights . Sutton and Barto ( 1981) provided an insightful analysis of the
connection
modification
scheme
the
power of the rule to account for some of the subtler properties of classical conditioning . And Hopfield ' s ( 1982) contribution of the idea that
network models can be seen as seeking minima in energy landscapes
played a prominent role in the development of the Boltzmann machine
(Chapter 7) , and in the crystallization
of the ideas presented in
Chapters 7 and 14 on harmony theory and schemata .
The power of parallel distributed processing is becoming more and
more apparent , and many others have recently joined in the exploration
of the capabilities of these mechanisms . We hope this book represents
44
THEPOPPERSPECTIVE
the nature of the enterprisewe are all involved in, and that it doesjustice to the potentialof the POP approach
.
ACKNOWLEDGMENTS
This researchwas supported by Contract NOOO14
-79-C-O323, NR
667-437 with the Personneland Training ResearchProgramsof the
Office of Naval Research
, by grants from the System Development
Foundation, and By a NIMH CareerDevelopmentAward (MHOO385
)
to the first author.
CHAPTER
2
A General Framework for
Parallel Distributed Processing
D. E. RUMELHART
, G. E. HINTON
, andJ. L. McCLELLAND
The GeneralFramework
It
our
1
class
are
is
useful
to
models
We
of
papers
are
and
models
with
of
begin
then
course
.
Kohonen
similarly
with
an
describe
not
the
first
( 1977
general
analysis
the
to
1984
aims
attempt
) ,
of
various
Amari
the
various
specific
general
( 1977a
components
of
assumptions
we
characterization
) , and
Feldman
of
and
this
Ballard
can
general
( 1982
46
THEPOP
PERSPECTIVE
.
.
A set of processingunits
A state of activation
An outputfunction for each uni t
A pattern of connectivityamong units
A propagation rule for propagating patterns of activities through
the network of connectivities
An activation rule for combining the inputs impinging on a unit
with the current state of that unit to produce a new level of
activation for the unit .
A learning
rule whereby patterns of connectivity are modified by
.
experIence
An environmentwithin which the system must operate
2. A FRAMEWORK
FORPOP
47
Gt
(I)
/ } (o} )
0
.m
+ m
M
a,0
-m
MI
,=10
(I)ljOj
(l)
SigmoidActivation
Function
FIGURE
1. The basiccomponentsof a paralleldistributedprocessingsystem.
simply abstract elements over which meaningful patterns can be
defined. When we speak of a distributed representation, we mean one
in which the units represent small, feature-like entities. In this case it is
the pattern as a whole that is the meaningful level of analysis. This
should be contrasted to a one-unit- one-concept representational system
in which single units represent entire concepts or other large meaningful entities .
We let N be the number of units. We can order the units arbitrarily
and designate the ith unit U; . All of the processing of a POP model is
carried out by these units . There is no executive or other overseer.
There are only relatively simple units, each doing it own relatively simple job . A unit 's job is simply to receive input from its neighbors and,
as a function of the inputs it receives, to compute an output value
which it sends to its neighbors. The system is inherently parallel in that
many units can carry out their computations at the same time .
48
THEPOPPERSPECTIVE
Within
any
types
of
system
units
sources
to
input
which
den
units
tem
we
modeling
state
of
of
by
tor
of
tion
over
the
unit
unit
or
small
If
set
may
such
as
are
mean
models
denoted
As
slightly
we
, +
or
shall
see
different
small
each
, +
for
in
this
and
is
of
book
they
of
be
It
is
leads
part
the
of
implications
the
taken
it
is
to
inactive
values
to
to
the
set
model
program
of
of
, +
values
restricted
as
Sometimes
usually
that
the
max
discrete
such
assumptions
cases
values
is
and
activation
to
may
of
and
nortbinary
values
any
other
binary
mean
restricted
or
or
In
When
to
continu
continuous
where
activation
minimum
are
times
set
determine
Other
these
to
taken
are
example
finite
characteristics
represented
assumptions
Thus
to
evolu
be
values
often
unbounded
are
value
activa
the
be
the
units
may
some
of
of
binary
vec
activa
representing
as
set
pattern
The
is
units
activation
need
the
pattern
may
system
they
most
values
the
about
interval
and
system
the
between
values
is
take
an
they
activation
It
primarily
of
time
the
models
we
is
the
values
may
the
active
simply
involved
the
is
the
as
Activation
they
values
unit
value
example
This
element
at
sys
units
hid
the
representing
Each
in
some
number
to
the
other
discrete
restricted
that
often
for
to
over
continuous
real
assumptions
in
any
within
of
units
activity
on
are
real
on
what
different
take
Thus
processing
discrete
any
restricted
they
are
take
imum
see
set
Qj
of
they
values
on
may
If
the
captures
pattern
to
they
of
take
they
are
of
designated
to
of
allowed
one
that
make
is
discrete
bounded
is
useful
units
The
systems
time
the
simply
outside
the
of
or
modeling
are
to
at
numbers
of
models
In
is
to
system
processing
units
time
Different
real
"
in
out
systems
are
either
system
signals
outputs
visible
the
of
time
of
It
through
values
are
at
set
time
set
"
be
processing
send
we
and
from
may
motoric
addition
of
activation
Uj
the
any
ous
of
the
the
ones
three
inputs
inputs
units
the
not
In
state
vector
for
tion
tion
the
over
stands
at
of
activation
are
characterize
receive
These
of
inputs
They
activation
representation
specified
only
affect
to
to
units
output
directly
whose
useful
parts
The
external
those
are
The
either
is
Input
study
other
it
under
embedded
systems
are
hidden
system
may
other
and
from
is
They
modeling
inputs
model
influence
are
output
the
or
the
system
external
sensory
we
input
these
with
research
various
Output of the units. Units interact. They do so by transmitting signals to their neighbors. The strengthof their signals, and therefore the
degree to which they affect their neighbors, is determined by their
degreeof activation. Associatedwith each unit, u;, there is an output
function, Ii (ai (1)) , which mapsthe current state of activation Oi(1) to
2. A FRAMEWORK
FORPOP 49
uO'lelUaSaJda ~ X!J\BV-.j
50
THE POP PERSPECTIVE
2.
A FRAMEWORK
FORPOP
51
52 THEPDPPERSPECTIVE
the new state of activation dependson the old one as well as the
current input. In general, however, we have
a (t+ 1) ~ F (8 (t ) ,net (t ) 1,net (t )2,...) ~
the function F itself is what we call the activation
function is assumed to be deterministic . Thus ,
threshold is involved it may be that aj (t ) = 1 if the
some threshold value and equals 0 otherwise .
assumed
that
is stochastic .
Sometimes
activations
to
decay slowly with time so that even with no external input the activa tion of a unit will simply decay and not go directly to zero . Whenever
a; (t ) is assumed to take on continuous values it is common to assume
or maximum
value of activation
activation
function
In this
is the quasi-
function
, F , is a
2. AFRAMEWORK
FOR
PDP 53
Virtually all learning rules for models of this type can be considered a
variant of the Hebbian learning rule suggested by Hebb in his classic
book Organization of Behavior ( 1949) . Hebb 's basic idea is this : If a
unit , Ui, receives a input from another unit , u} ; then , if both are highly
arguments
to their
. Thus we have
AW I)" = '. nQ
. O) ' ,
"
Oj (t ) .
has
to have a clear
model
of the environment
in which
the environment
this
as a
time -varying stochastic function over the space of input patterns. That
54 THEPDPPERSPECTIVE
is, we imagine that at any point in time , there is some probability that
any of the possible set of input patterns is impinging on the input units .
This probability function may in general depend on the history of
inputs to the system as well as outputs of the system. In practice, most
PDP models involve a much simpler characterization of the environ ment . Typically , the environment is characterized by a stable probability
distribution over the set of possible input patterns independent of past
inputs and past responses of the system. In this case, we can imagine
listing the set of possible inputs to the system and numbering them
from 1 to M . The environment is then characterized by a set of probabilities , Pi for i = 1, . . . , M . Since each input pattern can be considered a vector, it is sometimes useful to characterize those patterns
with nonzero probabilities as constituting orthogonal or linearly independent sets of vectors. 2 Certain PDP models are restricted in the kinds of
patterns they are able to learn: some being able to learn to respond
correctly only if the input vectors form an orthogonal set; others if they
form a linearly independent set of vectors; and still others are able to
learn to respond to essentially arbitrary patterns of inputs .
CLASSES
OFPDPMODELS
There are many paradigms and classesof POP models that have been
developed. In this section we describe some general classesof assumptions and paradigms. In the following section we describe some specific
PDP models and show their relationships to the general framework outlined here.
Paradigmsof Learning
Although most learning rules have roughly the form indicated above,
we can categorize the learning situation into two distinct sorts. These
are:
.
of these terms .
2. AFRAMEWORK
FOR
POP 55
units to produce another arbitrary pattern on another set of
units.
.
Regularity discovery, in which units learn to respond to " interesting " patterns in their input . In general, such a scheme should
be able to form the basis for the development of feature detectors and therefore the basis for knowledge representation in a
PD P system.
In certain cases these two modes of learning blend into one another,
but it is valuable to see the different goals of the two kinds of learning.
Associative learning is employed whenever we are concerned with storing patterns so that they can be re-evoked in the future . These rules
are primarily concerned with storing the relationships among subpatterns. Regularity detectors are concerned with the meaning of a single
units response. These kinds of rules are used when feature discoveryis
the essential task at hand.
The associative learning case generally can be broken down into two
subcases- pattern association and auto-association. A pattern association
paradigm is one in which the goal is to build up an association between
patterns defined over one subset of the units and other patterns defined
over a second subset of units . The goal is to find a set of connections
so that whenever a particular pattern reappears on the first set of units ,
the associated pattern will appear on the second set. In this case, there
is usually a teaching input to the second set of units during training indi cating the desired pattern association. An auto-association paradigm is
one in which an input pattern is associated with itself . The goal here is
pattern completion . Whenever a portion of the input pattern is
presented, the remainder of the pattern is to be filled in or completed.
This is similar to simple pattern association, except that the input pattern plays both the role of the teaching input and of the pattern to be
associated. It can be seen that simple pattern association is a special
case of auto-association. Figure 3 illustrates the two kinds of 'learning
paradigms. Figure 3A shows the basic structure of the pattern association situation . There are two distinct groups of units - a set of input
units and a set of output units . Each input unit connects with each output unit and each output unit receives an input from each input unit .
During training , patterns are presented to both the input and output
units . The weights connecting the input to the output units are modi fied during this period. During a test, patterns are presented to the
input units and the response on the output units is measured. Figure
3B shows the connectivity matrix for the pattern associator. The only
modifiable connections are from the input units to the output units .
All other connections are fixed at zero. Figure 3C shows the ha.'\ic
2. AFRAMEWORK
FOR
PDP 57
potentially a modifiable connection from every unit to every other unit .
In the case of pattern association, however, the units are broken into
two subpatterns, one representing the input pattern and another
representing the teaching input . The only modifiable connections are
those from the input units to the output units receiving the teaching
input . In other cases of associative learning the teaching input may be
more or less indirect . The problem of dealing with indirect feedback is
difficult , but central to the development of more sophisticated models
of learning. Barto and Sutton ( 1981) have begun a nice analysis of
such learning situations.
In the case of regularity detectors, a teaching input is not explicitly
provided~ instead, the teaching function is determined by the unit itself .
The form of the internal teaching function and the nature of its input
patterns determine what features the unit will learn to respond to. This
is sometimes called unsupervised learning. Each different kind of
unsupervised learning procedure has its own evaluation function . The
particular evaluation procedures are mentioned when we treat these
models. The three unsupervised learning models discussed in this book
are addressedin Chapters 5, 6, and 7.
Bottom-Up Processing
The fundamental characteristic of a bottom -up system is that units at
level i may not affect the activity of units at levels lower than i . To
see how this maps onto the current formulation , it is useful to partition
the coalitions of units into a set of discrete categories corresponding to
the levels their inputs come from . There are assumed to be no coalitions with inputs from more than one level . Assume that there are Lj
units at level i in the system. We then order the units such that those
in level L 1 are numbered UI, . . . , ULI' those in level L 2 are numbered
UL1+ 1, . . . , ULI+L2' etc. Then , the constraint that the system be a pure
58 THE
POP
PERSPEcrIVE
bottom -up system is equivalent to the constraint that the connectivity
matrix , W , has zero entries for Wij in which Uj is the member of a level
no higher than U; . This amounts to the requirement that the upper
right -hand region of W contains zero entries. Table 1 shows this constraint graphically. The table shows an example of a three-level system
with four units at each level.3 This leads to a 12x 12 connectivity matrix
and an a vector of length 12. The matrix can be divided up into 9
regions. The upper-left region represents interactions among Levell
units . The entries in the left -middle region of the matrix represents
the effects of Level 1 units on Level 2 units . The lower-left region
represents the effects of Levell
uni ts on Level 3 uni ts. Often
bottom -up models do not allow units at level i effect units at level i + 2.
Thus, in the diagram we have left that region empty representing no
effect of Level 1 on Level 3. It is typical in a bottom -up system to
assume as well that the lowest level units (Levell ) are input units and
that the highest level units (Level 3) are output units . That is, the
lowest level of the system is the only one to receive direct inputs from
outside of this module and only the highest level units affect other
units outside of this module .
TABLEt
Levell
Input Units
ul u2 u3 u4
Levell
Units
level 2
Units
Level 3
Units
ul
u2
u3
u4
within
Levell
effects
uS
u6
u7
u8
Levell
affecting
Level 2
Level 2
Hidden Units
uS u6 u7 u8
within
level 2
errects
u9
Leve .l 2
ul0
affecting
u 11
Level 3
Output Units
u9 ulO ull u12
Level
within
Level 3
effects
u12
3 In general , of course , we would expect many levels and many units at each level .
2. A FRAMEWORK
FORPOP
59
Top-Down Processing
The generalization to a hierarchical top-down system should be clear
enough. Let us order the units into levels just as before. A top-down
model then requires that the lower-left regions of the weight matrix be
empty - that is, no lower level unit affects a higher level unit . Table 2
illustrates a simple example of a top-down processing system. Note, in
this case, we have to assume a top-down input or " message" that is
propagated down the system from higher to lower levels as well as any
data input that might be coming directly into Levell units .
Interactive Models
Interactive models are simply models in which there can be both
top-down and bottom -up connections. Again the generalization is
straightforward . In the general interactive model, any of the cells of
the weight matrix could be nonzero. The more restricted models in
which information flows both ways, but in which information only
flows between adjacent levels, assume only that the regions of the
matrix more than one region away from the main diagonal are zero.
Table 3 illustrates a simple three-level interactive model with both topdown and bottom -up input . Most of the models that actually have been
suggestedcount as interactive models in this sense.
TABLE
2
Levell
Input Units
ul u2 u3 u4
ul
within
Levell
u2
Levell
Units
u3
effects
Level 2
Hidden Units
uS u6 u7 u8
Level
Level 3
Output Units
u9 ulO ull u12
affecting
Levell
u4
uS
Level 2
Units
within
u6
Level 2
u7
effects
Level
affecting
Level
u8
u9
Level 3
Units
within
ulO
Level
uII
effects
u12
60 THE
POP
PERSPECTIVE
TABLE3
ul
Levell
Uni1s
u2
u3
Levell
Input Units
ul u2 u3 u4
Level 2
Hidden Units
u5 u6 u7 u8
within
level 1
effects
Level 2
affecting
Levell
Levell
affecting
Level 2
within
Level 2
effects
Level 3
Output Units
u9 uIO uII u12
u4
u5
Level 2
Units
u6
u7
Level
affecting
Level
u8
u9
ulO
Units ull
u12
Level 3
Level 2
affecting
Level 3
within
level 3
errects
2. AFRAMEWORK
FOR
POP 61
Synchronous Versus Asynchronous Update
Even given all of the components of the POP models we have
described so far , there is still another important issue to be resolved in
the development of specific models; that is the timing of the application
of the activation rule . In some models, there is a kind of central timing
pulse and after each such clock tick a new value is determined simul taneously for all units . This is a synchronousupdate procedure. It is
usually viewed as a discrete, difference approximation to an underlying
continuous , differential equation in which all units are continuously
updated. In some models, however, units are updated asynchronously
and at random. The usual assumption is that at each point in time each
unit has a fixed probability of evaluating and applying its activation rule
and updating its activation value. This later method has certain
theoretical advantages and was developed by Hopfield (1982) and has
been employed in Chapters 6, 7, and 14. The major advantage is that
since the units are independently being updated, if we look at a short
enough time interval , only one unit is updating at a time . Among
other things, this system can help the stability of the network by
keeping it out of oscillations that are more readily entered into with
synchronous update procedures.
In the following sections we will show how specification of the particular functions involved produces various kinds of these models. There
have been many authors who have contributed to the field and whose
work might as well have been discussed. We discuss only a representative sample of this work .
62 THE
POP
PERSPECTIVE
below , there is no need for hidden units since all computation possible
with a multiple -step linear system can be done with a single -step linear
system .) In general , any unit in the input layer may connect to any unit
in the output layer . All connections in a linear model are of the same
type . Thus , only a single connectivity matrix is required . The matrix
consists of a set of positive , negative , and zero values , for excitatory
values , inhibitory values , and zero connections , respectively . The new
value of activation of each unit is simply given by the weighted sums of
the inputs . For the simple linear model with connectivity matrix W we
have
a (t + l ) = Wa (t ) .
In general , it can be
number of limitations .
be computed from two
single step . This follows
a (t ) = W ' a (0) .
We can see this by proceeding step by step . Clearly ,
a (2) = Wa ( 1) = W ( Wa (0
= W 28 (0) .
is a linear
network
, there
is no feedback
in the
system
nor
are
there hidden units between the inputs and outputs . There are two
sources of input in the system . There are the input patterns that estab lish a pattern of activation on the input units , and there are the teach ing units that establish a pattern of activation on the output units . Any
of several learning rules could be employed with a linear network such
as this , but the most common are the simple Hebbian rule and the
d~lta rule . The linear model with the simple Hebbian rule is called the
simple linear associator (cf . Anderson , 1970; Kohonen , 1977, 1984) . In
this case, the increment
in weight
In
matrix notation , this means that .6W ~ TITaT . The system is then
tested by presenting
an input
pattern
without
a teaching input
and
2. A FRAMEWORK
FORPOP
63
seeing how close the pattern generated on the output layer matches the
original teaching input . It can be shown that if the input patterns are
orthogonal ,4 there will be no interference and the system will perfectly
produce the relevant associated patterns exactly on the output layer. If
they are not orthogonal , however, there will be interference among the
input patterns. It is possible to make a modification in the learning rule
and allow a much larger set of possible associations. In particular , it is
possible to build up correct associations among patterns whenever the
set of input patterns are linearly independent. To achieve this , an error
correcting rule must be employed. The delta rule is most commonly
employed. In this case, the rule becomes .6W;j = T) (t;- a; )aj . What is
learned is essentially the difference between the desired response and
that actually attained at unit u; due to the input . Although it may take
many presentations of the input pattern set, if the patterns are linearly
independent the system will eventually be able to produce the desired
outputs. Kohonen ( 1977, 1984) has provided an important analysis of
this and related learning rules.
The examples described above were for the case of the pattern associator. Essentially the same results hold for the auto-associator version
of the linear model. In this case, the input patterns and the teaching
patterns are the same, and the input layer and the output layer are also
the same. The tests of the system involve presenting a portion of the
input pattern and having the system attempt to reconstruct the missing
parts.
64
THEPOP
PERSPECTIVE
It is useful to see some of the kinds of functions that can be computed with linear threshold units that cannot be computed with simple
linear models. The classic such function is the exclusiveor (XOR ) illus trated in Figure 4. The idea is to have a system which responds { I } if it
receives a {O,l } or a { I ,O} and responds to} otherwise. The figure
shows a network capable of this pattern. In this case we require two
x 0 R Netwo rk
Output
Unit
+1
+ 1
Internal
Units
Thresholds= .01
+1
Input
Units
- 1
+1
Input
Output
00
11} 0
01
10J
2. A FRAMEWORK
FORPOP
65
layers of units . Each unit has a zero threshold and respondsjust in case
its input is greater than zero. The weights are :t: 1. Since the set of
stimulus patterns is not linearly independent, this is a discrimination
that can never be made by a simple linear model and cannot be done in
a single step by any network of linear threshold units .
Although multilayered systems of linear threshold units are very
powerful and, in fact, are capable of computing any boolean function ,
there is no generally known learning algorithm for this general case
(see Chapter 8) . There is, however, a well-understood learning algorithm for the special case of the perceptron. A perceptron is essentially
a single-layer network of linear threshold units without feedback. The
learning situation here is exactly the same as that for the linear model.
An input pattern is presented along with a teaching input . The perceptroD learning rule is precisely of the same form as the delta rule for
error correcting in the linear model, namely, 4. Wjj = 7) (tj- aj )aj . Since
the teaching input and the activation values are only 0 or 1, the rule
reduces to the statements that:
1. Weights are only changed on a given input line when that line
is turned on (i.e., Qj = 1) .
2. If the system is correct on unit i (i .e., t; = a;) , make no change
66
THEPOPPERSPECTIVE
functions
full
(i .e., differentiable
discussion
. As we shall
tions of the one -step perceptron in no way apply to the more complex
networks
There
is , however
a maximum
and
minimum
activation
value
The auto-associator illustrated in Figure 3 is the typical learning paradigm for BSB. Note that with this pattern of interconnections the system feeds back on itself and thus the activation can recycle through the
system in a positive feedback loop . The positive feedback is especially
evident in J. A . Anderson and Mozer ' s ( 1981) version . Their activation
rule is gi ven by
aj (t+ l ) = aj (t )+ Ew ;jaj (t )
if aj is less than 1 and greater than - 1. Otherwise , if the quantity
greater
than
is
is , the
of activation
units .
activation
value
of
the
second
unit , and
the
third
coordinate
corresponds to the activation value of the third unit . Thus , each point
in the space corresponds to a possible state of the system . The feature
that each unit is limited to the region [- 1, 1] means that all points must
lie somewhere within the box whose vertices are given by the points
(- 1,- 1,- 1) , (- 1,- 1,+ 1) , (- 1,+ 1,- 1) , (- 1,+ 1,+ 1) , (+ 1,- 1,- 1) ,
(+ 1,- 1,+ 1) , (+ 1,+ 1,- 1) , and (+ 1,+ 1,+ 1) . Moreover , since the
2. AFRAMEWORK
FOR
POP 67
( - ,+ ,+ )
( + ,+ ,+ )
11
I
I
( - ,- ,
1(- .+,- )
~
(+,+,- )
.~t'J /
Activation
of Unit 2
~~ /
0'
.~-(:t /
.~ ~
~
~ /
/
( - ,- ,- )
,
Activation of Unit 1
FIGURE 5. The statespacefor a three-unit versionof a BSBmodel. Eachdimensionof
the box representsthe activationvalue of one unit. Eachunit is boundedin activation
between[- 1,1] . The curving arrow in the box representsthe sequenceof statesthe system moved through. It beganat the blackspot near the middle of the box and, as processingproceeded
, movedto the (- ,+ ,+ ) corner of the box. BSBsystemsalwaysend up
in one or anotherof the corners. The particularcorner dependson the start stateof the
network, the input to the system, and the patternof connectionsamongthe units.
68 THEPOPPERSPECfIVE
Learning in the BSB system involves auto-association. In different
applications two different learning rules have been applied. J. A .
Anderson and Mozer (1981) applied the simplest rule . They simply
allowed the system to settle down and then employed the simple Hebbian learning rule . That is, 4 W;j = 11a;aj ' The error correction rule has
also been applied to the BSB model. In this case we use the input as
the teaching input as well as the source of activation to the system.
The learning rule thus becomes 4 W;j = "1(I;- a; )aj where t; is the input
to .unit i and where Q; and Qj are the activation values of the system
after it has stabilized in one of the corners of the hypercube.
Thermod
lynamic Models
Other more recent developments are the thermodynamic models.
Two examples of such models are presented in the book. One, harmony theory, was developed by Paul Smolensky and is described in
detail in Chapter 6. The other , the Boltzmann machine, was developed
by Hinton and Sejnowski and is described in Chapter 7. Here we
describe the basic ,idea behind these models and show how they relate
to the general class of models under discussion. To begin, the thermo dynamic models employ binary units which take on the values {O, I } .
The units are divided into two categories: the visible units corresponding to our input and output units and the hidden units . In general, any
unit may connect to any other unit . However, there is a constraint that
the connections must be symmetric . That is, the Wlj = Wji . In these
models, there is no distinction between the output of the unit and its
activation value. The activation values are, however, a stochastic func tion of the inputs. That is,
1
p(0;(/)=I)=-I+e-(Ij,W
;jOj
+"l;-8;)/T
where 11; is the input from outside of system into unit i , 9; is the
threshold for the unit , and T is a parameter, called temperature, which
determines the slope of the probability function . Figure 6 shows how
the probabilities vary with various values of T . It should be noted that
as T approaches zero, the individual units become more and more like
linear threshold units . In general, if the unit exceeds threshold by a
great enough margin it will always attain value 1. If it is far enough
below threshold , it always takes on value O. Whenever the unit is
above threshold, the probability that it will turn on is greater than 1/ 2.
2. A FRAMEWORK
FORPOP
Temperature
- 5
Net
0
5
Input
10
15
20
~.
0
~.
0
lilll -
. -
m.
0
0.
...
BUWL;1! WJ: 1
N.
0
II
70 THE
POP
PERSPECTIVE
apply
so
and
ever
that
this
unit
second
of
and
of
even
anti
is
the
no
actual
the
mined
only
ment
stimuli
we
It
on
after
the
of
the
visible
present
turned
on
learned
the
with
. These
are
of
free
- running
connections
its
actually
portion
it
will
Quite
of
complete
probability
the
powerful
units
due
learning
are
turned
portion
units
that
Chapter
the
environ
remaining
those
in
the
input
subpattern
addressed
the
deter
performance
the
that
given
again
the
phase
reflect
to
its
is
is
correctly
out
sim
by
phase
due
subtract
is
To
performance
then
if
intuition
and
the
are
respond
phase
The
driven
patterns
issues
this
..,., ajaj
interconnections
the
stimulus
to
environmentally
This
has
continue
During
-
equal
inputs
of
at
the
an
no
0
-
During
for
which
will
Wjj
the
and
that
units
when
during
look
shown
system
in
presented
of
1)
respond
in
it
values
pattern
set
alone
be
to
on
amount
occurs
state
stochastic
an
change
allowed
during
structure
can
is
are
the
take
by
- running
employed
should
structure
.
system
is
is
activations
no
free
internal
internal
internal
scheme
the
since
incremented
otherwise
performance
the
plus
both
The
by
environment
performance
by
environment
on
system
rule
that
,
is
so - called
Since
determined
to
Note
weight
learning
- Hebbian
roughly
the
are
in
though
ple
1) Q ; a j .
time
presented
that
phase
period
W ;j
says
had
been
had
been
7 .
Grossberg
Stephen
of
Grossberg
this
class
important
be
some
the
' s
allowed
to
maximum
cations
, a
only
if
argues
the
the
take
on
value
activation
the
level
output
is
than
of
the
inhibitory
rule
, but
that
don
Grossberg
they
a
its
typically
unit
threshold
have
discussed
' t
simply
sum
have
presented
the
but
relates
of
units
minimum
and
appear
number
unit
Grossberg
function
rule
in
' s appli
S - shaped
far
are
another
activation
a
form
's
affect
thus
it
Grossberg
or
's
we
how
Moreover
sigmoid
Grossberg
has
of
will
instead
summary
many
many
will
Grossberg
unit
show
clearest
) .
models
contains
We
between
in
given
be
and
the
is ,
must
others
work
value
above
the
inputs
his
to
and
here
( 1980
function
function
value
of
activation
so
contributors
complex
Perhaps
Grossberg
output
function
its
rules
major
is
review
aspects
real
the
work
cannot
in
any
The
threshold
activation
activation
central
of
His
we
appears
that
and
one
framework
work
complex
tory
the
of
activation
more
been
years
which
general
Grossberg
of
the
details
descri
to
has
over
is
that
rather
excita
separately
of
possible
in
2. AFRAMEWORK
FOR
POP 71
OJ
where
( / +
tion
is
of
of
another
Grossberg
and
Chapter
and
. t1W
..
'1
Grossberg
cases
of
but
. ( O '1
in
that
each
unit
unit
in
unit
like
the
( by
yield
a
j
" net
This
learning
is
unit
itself
a
.
the
and
number
The
one
rules
of
learning
analyzed
in
beyond
the
in
number
of
scope
Thus
BSB
the
input
net
of
the
the
"
to
the
is
the
the
layers
layer
that
and
Let
combined
netj
and
I ,
with
all
the
is
values
into
be
previous
in
the
addition
with
a
,
the
kind
for
each
summed
wija
In
that
units
.
model
coming
)
are
exci
below
inconsistent
minimum
information
and
to
are
units
with
unit
activation
strengths
unit
which
above
and
interactive
connects
that
below
value
The
in
layer
with
maximum
The
was
any
threshold
activation
negatively
interactive
then
on
activation
the
which
take
was
pattern
in
connects
own
units
could
threshold
inconsistent
interconnection
input
if
Rumelhart
had
Units
of
in
its
with
model
element
and
and
)
function
connectivity
are
in
system
output
above
elements
units
difference
an
unit
words
was
that
all
McClelland
( 1982
The
was
that
all
of
and
the
such
below
feedback
,
the
unit
schemes
to
similar
output
that
and
inhibits
weighted
] .
to
with
question
positive
the
the
inhibitory
which
over
similar
applications
activation
with
above
, max
involves
layers
consistent
layers
letters
the
connections
are
the
in
networks
McClelland
equal
the
model
in
tatory
below
that
learning
is
model
[ min
if
organized
and
these
features
was
threshold
activation
these
and
range
and
the
Band
inhibited
from
( I )
excita
than
field
come
in
activation
such
threshold
be
assumes
different
Rumelhart
the
function
of
;j
Model
visual
value
degree
) nel
and
represented
can
however
this
of
interactive
)
( / ) +
magnitude
inhibitory
inputs
( OJ
W .. )
'1
.
Activation
The
maximal
in
unit
recurrent
( I )
by
review
Interactive
( 1981
the
the
several
discussion
nelej
smaller
learning
applied
( /
represents
system
most
' nQ
."
oj
generally
the
ven
much
of
studied
gi
( B
excitatory
studied
has
,
present
unit
level
B
is
kind
the
is
Grossberg
studied
amount
and
has
has
O .
has
he
( I -
rate
and
from
embedded
years
( I )
maximal
come
from
to
value
inputs
OJ
decay
unit
the
resting
rule
the
the
represents
is
I )
of
each
unit
is
algebraically
the
net
input
activation
to
72 THE
POP
PERSPECfIVE
value
to
produce
activation
aj
where
words
rule
(/ +
1)
is
or
input
into
,
specific
aspects
such
and
is
formulation
input
In
other
activation
pushes
the
value
toward
magnitude
similar
the
model
goals
word
was
were
to
of
that
excitation
the
the
net
employed
and
book
we
has
been
might
how
,
by
inhibition
aimed
have
are
for
of
no
work
( See
for
specific
model
we
on
plausible
pro
network
the
giving
learned
account
particular
much
at
been
model
could
was
the
see
we
there
where
shall
as
show
Thus
explain
As
designed
to
perception
to
network
and
learning
accounts
especially
of
Chapters
Feldman
Ballard
and
connectionist
which
they
call
10 , 10 ] .
9 ) .
somewhat
what
they
(t +
the
Feldman
units
1)
to
and
available
discussed
aj
In
f : Jnetj
reaches
Ballard
also
below
within
rules
(t )
value
the
our
take
on
kind
of
number
a
this
current
examined
of
.
the
activation
other
Their
call
values
in
the
range
function
which
integer
values
discrete
rule
maximum
or
is
unit
types
simplest
rule
is
( 1981
learning
each
unit
type
given
by
value
has
.
the
is
In
to
our
class
,
we
sim
sigma
- pi
learning
machinery
practice
is
inhibition
considered
more
same
it
self
similar
also
offers
of
by
unit
framework
are
minimum
implemented
cofljunctive
Feldman
to
value
threshold
of
they
activation
(t ) .
Decay
have
framework
any
number
case
its
.
approach
actually
is
small
continuous
activation
P - unit
can
proposed
that
described
general
learning
have
activation
pinned
proposed
have
function
different
the
have
units
which
on
They
call
aj
ply
take
The
output
to
Oi ~
Once
( 1982
.
potential
Their
allowed
(O~
Ballard
modeling
with
In
rule
no
old
that
on
following
otherwise
nelj
>
the
factor
the
6 .)
Feldman
[-
this
by
depending
activation
our
from
this
min
aj ( I
given
given
)
to
of
in
how
activation
is
minus
activation
came
reported
is
in
according
the
( or
proposed
assumed
of
This
and
learning
value
.
interactive
system
value
value
combined
The
rate
plus
except
cessing
unit
algebraically
of
( / ) ( 1-
maximum
the
Grossberg
activation
activation
decayed
minimum
aj
decay
new
properly
new
the
the
the
than
however
have
,
already
is
the
2. AFRAMEWORK
FOR
POP 73
SIGMA
-PI UNITS
Before completing our section on a general framework , it should be
mentioned that we have sometimes found it useful to postulate units
that are more complex than those described up to this point in this
chapter. In our descriptions thus far, we have assumed a simple additive unit in which the net input to the unit is given by I . W;j a; . This is
certainly the most common form in most of our models. Sometimes,
however, we want multiplicative connections in which the output values
of two (or possibly more) units are multiplied before entering into the
sum. Such a multiplicative connection allows one unit to gate another.
Thus, if one unit of a multiplicative pair is zero, the other member of
the pair can have no effect , no matter how strong its output . On the
other hand, if one unit of a pair has value 1, the output of the other is
passed unchanged to the receiving unit . Figure 7 illustrates several
such connections. In this case, the input to unit A is the weighted sum
of the products of units Band C and units D and E. The pairs, BC and
DE are called conjuncts. In this case we have conjuncts of size 2. In
general, of course, the conjuncts could be of any size. We have no
applications, however, which have required conjuncts larger than size 2.
In general, then , we assume that the net input to a unit is given by the
weighted sum of the products of a set of individual inputs. That is, the
net input to a unit is given by I . w;;II
~ a; 1a;2 . . . a;k where i indexes the
conjuncts impinging on unit j and Uj1, Uj2 , . . . , Ujk are the k units in
the conjunct. We call units such as these sigma-pi units.
In addition to their use as gates, sigma-pi units can be used to convert the output level of a unit into a signal that acts like a weight connecting two units . Thus, assume we have the pattern of connections
illustrated in the figure . Assume further that the weights on those connections are alII . In this case, we can use the output levels of units B
and D to , in effect , set the weights from C to A and E to A respectively . Since, in general, it is the weights among the units that determine the behavior of the network , sigma-pi units allow for a dynamically programmable network in which the activation value of some units
determine what another network can do.
In addition to its general usefulness in these cases, one might ask
whether we might not sometime need still more complex patterns of
interconnections. Interestingly , as described in Chapter 10, we will
never be forced to develop any more complex interconnection type,
since sigma-pi units are sufficient to mimic any function monotonic of
its inputs.
74
THEPOPPERSPECTIVE
SigmaPi Units
net
A = IWiA11
' Coni
i
Con 1 - as . aC
Con2= aD ' aE
FIGURE 7. Two conjunctiveinputs to unit A from the conjunct8 and C and D and E.
The input to unit A is the sum of the productof the outputsof units DC and DE.
CONCLUSION
We have provided a very general mathematical and conceptual
framework within which we develop our models. This framework provides a language for expressing PDP models, and, though there is a lot
of freedom within it , it is at least as constrained as most computational
formalisms , such as production systems or high-level languagessuch as
Lisp.
We must take note of the fact, however, that the framework does
not specify all of the constraints we have imposed on ourselves in our
model building efforts . For example, virtually any computing device,
serial or parallel, can be described in the framework we have described
here.
2. A FRAMEWORK
FORPOP
75
There is a further set of considerations which has guided our particular formulations . These further considerations arise from two sources:
our beliefs about the nature of the hardware available for carrying out
mental processesin the brain and our beliefs about the essential character of these mental processesthemselves. We discuss below the additional constraints on our model building which arise from these two
beliefs.
. First , the operations in our models can be characterized as "neurally
inspired." We wish to replace the "computer metaphor" as a model of
mind with the " brain metaphor" as model of mind . This leads us to a
number of considerations which further inform and constrain our
model building efforts . Perhaps the most crucial of these is time . Neurons are remarkably slow relative to components in modern computers.
Neurons operate in the time scale of milliseconds whereas computer
components operate in the time scale of nanoseconds- a factor of 106
faster. This means that human processes that take on the order of a
second or less can involve only a hundred or so time steps. Since most
of the processeswe have studied- perception, memory retrieval , speech
processing, sentence comprehension, and the like - take about a second
or so, it makes sense to impose what Feldman ( 1985) calls the " 100step program" constraint. That is, we seek explanations for these mental phenomena which do not require more than about a hundred elementary sequential operations. Given that the processes we seek to
characterize are often quite complex and may involve consideration of
large numbers of simultaneous constraints, our algorithms must involve
considerable parallelism. Thus, although a serial computer could be
created out of the kinds of components represented by our units , such
an implementation would surely violate the tOO-step program constraint
for any but the simplest processes.
A second co~sideration differentiates our models from those inspired
by the computer metaphor: that is, the constraint that all the
knowledge is in the connections
. From conventional programmable computers we are used to thinking of knowledge as being stored in the state
of certain units in the system. In our systems we assume that only very
short term storage can occur in the states of units ; long term storage
takes place in the connections among units . Indeed, it is the
connections- or perhaps the rules for forming them through
experience- which primarily differentiate one model from another.
This is a profound difference between our approach and other more
conventional approaches, for it means that almost all knowledge is
implicit in the structure of the device that carries out the task rather
than explicit in the states of units themselves. Knowledge is not directly
accessible to interpretation by some separate processor, but it is built
into the processor itself and directly determines the course of
76 THE
pOP
PERSPECTIVE
processing
. It is acquiredthrough tuning of connectionsas these are
used in processing
, rather than formulated and stored as declarative
facts.
In addition to thesetwo neurally inspired working assumptions
, there
are a number of other constraintsthat derive rather directly from our
understandingof the nature of neural information processing
. These
assumptionsare discussedmore fully in Chapter4.
The second class of constraintsarises from our beliefs about the
nature of human information processingconsideredat a more abstract,
computationallevel of analysis. We see the kinds of phenomenawe
have been studying as productsof a kind of constraintsatisfactionprocedure in which a very large number of constraintsact simultaneously
to producethe behavior. Thus, we see most behavior not as the product of a single, separatecomponentof the cognitivesystem, but as the
product of largeset of interactingcomponents, eachmutually constraining the others and contributing in its own way to the globally observable behaviorof the system. It is very difficult to use serial algorithms
to implement such a conception, but very natural to use highly parallel
ones. These problems can often be characterizedas best match or
optimizationproblems. As Minsky and Papert (1969) have pointed out,
it is very difficult to solve best match problemsserially. However, this
is precisely the kind of problem that is readily implemented using
highly parallel algorithms of the kind we consider in this book. See
Kanerva (1984) for a discussionof the best match problem and its
solution with parallelprocessingsystems.
To summarize, the POP framework consists not only of a formal
language
, but a perspectiveon our models. Other Qualitativeand Quantitative considerationsarising from our understandingof brain process
ing and of human behavior combine with the formal system to form
what might be viewed as an aestheticfor our model building enterprises. The remainderof our book is largelya study of this aestheticin
practice.
ACKNOWLEDGMENTS
This research was supported by Contract NOOO14- 79-C-O323, NR
667 -437 with the Personnel and Training Research Programs of the
Office of Naval Research , by grants from the System Development
Foundation , and by a NIMH Career Development Award (MHOO385 )
to the second
author .
CHAPTER
3
Distributed
Representations
G. E. HINTON
, J. L. McCLELLAND
, andD. E. RUMELHART
78
THEPOPPERSPECTIVE
Every
representational
uted
scheme
representations
very
are
naturally
Other
from
properties
trarv
the
best
,
.
their
of
for
bad
points
desirable
as
store
to
achieve
distributed
arise
representations
large
As
set
we
those
arbi
see
is
match
of
shall
representations
weaknesses
Distrib
properties
activity
temporarily
harder
and
and
of
to
much
good
Some
patterns
ability
strengths
first
section
distributed
of
lems
can
final
avoided
by
be
the
of
better
advocates
processing
the
degree
the
of
on
distributed
an
human
wrong
to
tional
view
been
useful
in
fruitful
abstract
to
representations
gent
properties
when
in
working
in
situation
mented
in
sonable
So
using
or
the
For
simultaneous
the
in
are
situation
We
the
schemata
characterizations
shall
chapter
large
being
of
mechanisms
14
constructs
opera
degree
it
representa
we
will
see
only
on
rules
that
distributed
rely
for
with
fitting
the
There
of
- level
provide
which
primitive
consistent
partially
to
of
( Chapter
- level
higher
are
of
automatic
properties
in
applied
unrea
as
emergent
number
imple
not
these
captured
properties
schemata
higher
rule
the
current
are
implement
representations
rule
is
memory
of
easily
these
other
models
to
automatic
the
it
distributed
fits
appropriate
primitives
way
Some
not
of
each
examine
on
and
easy
distributed
abstract
an
no
application
current
relevant
is
computers
example
as
best
- addressable
of
there
representations
malisms
content
Distri
emer
example
that
It
more
taken
representations
selection
though
rule
have
proviso
memory
more
distributed
like
conventional
the
that
abilities
even
distributed
the
assumes
brain
in
one
of
that
unexpected
be
For
be
representa
these
- addressable
selection
would
intelligence
one
content
status
It
to
artificial
therefore
of
their
and
formalism
for
the
treat
generalization
operations
and
can
abstract
good
if
the
to
tions
more
are
about
implementing
with
of
favor
systems
powerful
the
alternative
of
but
some
as
in
production
way
to
properties
representations
generalization
rise
These
an
often
focusing
clear
and
one
networks
give
as
parallel
be
prob
such
processing
psychology
them
of
are
object
to
as
cognitive
classes
which
information
or
efficiency
arguments
important
of
distributed
sequential
detailed
networks
view
schemes
uted
is
human
semantic
certain
structured
representations
like
found
more
it
the
issues
the
the
of
distributed
schemes
of
virtues
why
representations
examining
theory
for
and
aspects
Before
overall
ones
structure
different
the
clearly
difficult
distributed
representations
within
local
of
considers
shows
some
constituent
effort
/ aimers
than
some
section
and
discusses
of
Disc
second
section
representation
stresses
The
representations
representations
to
chapter
distributed
is
this
representations
of
mate
its
The
that
use
evidence
which
mind
has
exception
the
are
psychological
to
tions
like
associations
no
is
clearly
approxi
distributed
3. DISTRIBUTED
REPR
~ ENT
ATIONS 79
representations. Thus, the contribution that an analysis of distributed
representations can make to these higher-level formalisms is to legiti mize certain powerful , primitive operations which would otherwise
appear to be an appeal to magic~ to enrich our repertoire of primitive
operations beyond those which can conveniently be captured in many
higher-level formalisms ; and to suggest that these higher-level formal isms may only capture the coarse features of the computational capabilities of the underlying processing mechanisms.
Another common source of confusion is the idea that distributed
representations are somehow in conflict with the extensive evidence for
localization of function in the brain (Luria , 1973) . A system that uses
distributed representations still requires many different modul(;s for
representing conipletely different kinds of thing at the same time . The
distributed representations occur within these localized modules. For
example, different modules would be devoted to things as different as
mental images and sentence structures, but two different mental images
would correspond to alternative patterns of activity in the same module .
The representations advocated here are local at a global scale but global
at a local scale.
VIRTUESOF DISTRIBUTEDREPRESENTATIONS
This section considers three important features of distributed
representations: (a) their essentially constructive character; (b) their
ability to generalize automatically to novel situations; and (c) their
tunability to changing environments . Several of these virtues are
shared by certain local models, such as the interactive activation model
of word perception, or McClelland 's ( 1981) model of generalization and
retrieval described in Chapter 1.
Memory as Inference
People have a very flexible way of accessing their memories: They
can recall items from partial descriptions of their contents (Norman &
Bobrow, 1979) . Moreover , they can do this even if some parts of the
partial description are wrong. Many people, for example, can rapidly
retrieve the item that satisfies the following partial description: It is an
actor, it is intelligent , it is a politician . This kind of content-addressable
memory is very useful and it is very hard to implement on a conventional computer because computers store each item at a particular
80 THE
POP
PERSPECTIVE
address , and to retrieve an item they must know its address . If all the
combinations of descriptors that will be used for access are free of
errors and are known in advance , it is possible to use a method called
hash coding that Quickly yields the address of an item when given part
of its content .- In general , however , content -addressable memory
requires a massive search for the item that best fits the partial descrip tion . The central computational problem in memory is how to make
this search efficient . When the cues can contain errors , this is very dif ficult because the failure to fit one of the cues cannot be used as a filter
for Quickly eliminating inappropriate ans .'ers .
Distributed representations provide an efficient way of using parallel
hardware to implement best -fit searches . The basic idea is fairly sim ple , though it is Quite unlike a conventional computer memory . Dif ferent items correspond to different patterns of activity over the very
same group of hardware units . 1\ partial description is presented in the
form of a partial activity pattern , activating some of the hardware
units . 1 Interactions between the units then allow the set of active units
to influence others of the units , thereby completing the pattern , and
generating the item that best fits the description . A new item is
"stored " by modifying the interactions between the hardware units so as
to create a new stable pattern of activity . The main difference from a
conventional computer memory is that patterns which are not active do
not exist anywhere . They can be re-created because the connection
strengths between units have been changed appropriately , but each con nection strength is involved in storing many patterns , so it is impossible
to point to a particular place where the memory for a particular item is
stored .
Many people are surprised when they understand that the connec tions between a set of simple processing units are capable of supporting
a large number of different patterns . Illustrations of this aspect of dis tributed models are provided in a number of papers in the literature
(e.g., Anderson , 1977~ Hinton , 1981a) ~ this property is illustrated in
the model of memory and amnesia described in Chapters 17 and 25.
One way of thinking about distributed memories is in terms of a very
large set of plausible inference rules . Each active unit represents a
"microfeature " of an item , and the connection strengths stand for plau sible " microinferences " between microfeatures . Any particular pattern
1 This is easy
descriptionis simply a set of features, but it is much more
. if the partial
difficult if the partialdescriptionmentionsrelationshipsto other objects. If, for example,
the systemis askedto retrieveJohn's father, it must representJohn, but if John and his
father are representedby mutually exclusivepatternsof activity in the very samegroup
of units., it is hard to seehow this can be done without preventingthe representationof
John"s father. A distributedsolution to this problemis describedin the text.
3. DISTRIBUTED
REPRESENT
AnONS 81
of activity of the units will satisfy some of the microinferences and
violate others. A stable pattern of activity is one that violates the plausible microinferences less than any of the neighboring patterns. A new
stable pattern can be created by changing the inference rules so that the
new pattern violates them less than its neighbors. This view of
memory makes it clear that there is no sharp distinction between
genuine memory and plausible reconstruction . A genuine memory is a
pattern that is stable becausethe inference rules were modified when it
occurred before. A "confabulation " is a pattern that is stable becauseof
the way the inference rules have been modified to store several dif ferent previous patterns. So far as the subject is concerned, this may
be indistinguishable from the real thing .
The blurring of the distinction between veridical recall and confabulation or plausible reconstruction seems to be characteristic of human
memory (Bartlett , 1932; Neisser, 1981) . The reconstructive nature of
human memory is surprising only because it conflicts with the standard
metaphors we use. We tend to think that a memory system should
work by storing literal copies of items and then retrieving the stored
copy, as in a filing cabinet or a typical computer database. Such systems are not naturally reconstructive.
If we view memory as a process that constructs a pattern of activity
which represents the most plausible item that is consistent with the
given cues, we need some guarantee that it will converge on the
representation of the item that best fits the description, though it might
be tolerable to sometimes get a good but not optimal fit . It is easy to
imagine this happening, but it is harder to make it actually work . One
recent approach to this problem is to use statistical mechanics to
analyze the behavior of groups of interacting stochastic units . The
analysis guarantees that the better an item fits the description, the more
likely it is to be produced as the solution . This approach is described in
Chapter 7, and a related approach is described in Chapter 6. An alternative approach, using units with continuous activations (Hopfield ,
1984) is described in Chapter 14.
PERSPECTIVE
82 THEPOP
For
unrelated
patterns , however
, there
will
be very
little
transfer
of
distributed
memory
are many
variations
of the
basic idea (See Hinton & Anderson , 1981 , for several examples ) .
It is possible to prevent interference altogether by using orthogonal
patterns of activity for the various items to be stored (a rudimentary
example of such a case is given in Chapter 1) . However , this elim inates one of the most interesting properties of distributed representa tions : They automatically give rise to generalizations . If the task is
simply to remember accurately a set of unrelated items , the generaliza -
tion effects are harmful and are called interference . But generalization
is normally a helpful phenomenon . It allows us to deal effectively with
situations that are similar but not identical to previously experienced
.
situations
the
very
same
set of units .
It would
then
be hard
to represent
chimps and onions at the same time . This problem can be solved by
using separate modules for each possible role of an item within a larger
structure . Chimps , for example , are the " agent " of the liking and so a
pattern representing chimps occupies the " agent " module and the pat tern representing onions occupies the " patient " module (see Figure 1) .
2 The internal structure of this pattern may also change . There is always a choice
between changing the weights on the outgoing connections and changing the pattern itself
so that different outgoing connections become relevant . Changes in the pattern itself
alter its similarity to other patterns and thereby alter how generalization will occur in the
future . It is generally much harder to figure out how to change the pattern that represents
an item than it is to figure out how to change the outgoing connections so that a particular pattern
will
have
the desired
effects
on another
part
of the network
3. DISTRIBUTED
REPR
~ ENTAnONS
83
Each module can have alternative patterns for all the various items, so
this scheme does not involve local representations of items. What is
localized is the role.
If you subsequently learn that gibbons and orangutans do not like
onions your estimate of the probability that gorillas like onions will fall ,
though it may still remain higher than it was initially . Obviously , the
combination of facts suggests that liking onions is a peculiar quirk of
chimpanzees. A system that uses distributed representations will
automatically arrive at this conclusion, provided that the alternative patterns that represent the various apes are related to one another in a particular way that is somewhat more specific than just being similar to
one another: There needs to be a part of each complete pattern that is
identical for all the various apes. In other words, the group of units
used for the distributed representations must be divided into two
RELATIONSHIP
PATIENT
AGENT
I I
I
FIGURE 1. In this simplified schemethere are two different modules, one of which
representsthe agentand the other the patient. To incorporatethe fact that chimpanzees
like onions, the patternfor chimpanzees
in one module must be associatedwith the pattern for onions in the other module. Relationshipsother than "liking" can be implementedby havinga third group of units whosepatternof activity representsthe relationship. This pattern must then "gate" the interactionsbetween the agent and patient
groups. Hinton (1981a
) describesone wayof doing this gatingby usinga fourth group of
units.
84
THEPDPPERSPECTIVE
subgroups, and all the various apes must be represented by the same
pattern in the first su~group, but by different patterns in the second
subgroup. The pattern of activity over the first subgroup represents the
type of the item , and the pattern over the second subgroup represents
additional microfeatures that discriminate each instance of the type
from the other instances. Note that any subset of the microfeatures
can be considered to define a type. One subset might be common to all
apes, and a different (but overlapping) subset might be common to all
pets. This allows an item to be an instance of many different types
simultaneously.
When the system learns a new fact about chimpanzees, it usually has
no way of knowing whether the fact is true of all apes or is just a
property of chimpanzees. The obvious strategy is therefore to modify
the strengths of the connections emanating from all the active units , so
that the new knowledge will be partly a property of apes in general and
partly a property of whatever features distinguish chimps from other
apes. If it is subsequently learned that other apes do not like onions,
correcting modifications will be made so that the information about
onions is no longer associatedwith the subpattern that is common to all
apes. The knowledge about onions will then be restricted to the subpattern that distinguishes chimps from other apes. If it had turned out
that gibbons and orangutans also liked onions, the modifications in the
weights emanating from the subpattern representing apes would have
reinforced one another, and the knowledge would have become associated with the subpattern shared by all apes rather than with the patterns
that distinguish one ape from another.
A very simple version of this theory of generalization has been
implemented in a computer simulation (Hinton , 1981a) . Several applications that make use of this property can be found in Part IV of this
book.
There is an obvious generalization of the idea that the representation
of an item is composed of two parts, one that represents the type and
another that represents the way in which this particular instance differs
from others of the same type. Almost all types are themselves
instances of more general types, and this can be implemented by divid ing the pattern that represents the type into two subpatterns, one for
the more general type of which this type is an instance, and the other
for the features that discriminate this particular type from others
instances of the same general type. Thus the relation between a type
and an instance can be implemented by the relationship between a set
of units and a larger set that includes .it . Notice that the more general
the type, the smaller the set of units used to encode it . As the number
of terms in an intensional description gets smaller, the corresponding
extensionalset gets larger.
3. DISTRIBUTED
REPRFSENT
ATIONS
8S
In traditional semantic networks that use local representations , gen eralization is not a direct consequence of the representation . Given
that chimpanzees like onions , the obvious way of incorporating the new
knowledge is by changing the strengths of connections belonging to the
chimpanzee unit . But this does not automatically change connections
that belong to the gorilla unit . So extra processes must be invoked to
implement generalization in a localist scheme . One commonly used
method is to allow activation to spread from a local unit to other units
that represent similar concepts (Collins & Loftus , 1975; Quillian ,
1968) . Then when one concept unit is activated , it will partially
activate its neighbors and so any knowledge stored in the connections
emanating from these neighbors will be partially effective . There are
many variations of this basic idea (Fahlman , 1979; Levin , 1976;
McClelland , 1981) .
It is hard to make
a clean distinction
between
systems
representations plus spreading activation and systems that use distrib uted representations . In both cases the result of activating a concept is
that
many
different
hardware
units
distinction
almost
of units
for
instances
main
difference
is
that in one case there is a particular individual hardware unit that acts
as a " handle " which makes it easy to attach purely conventional proper ties like the .name of the concept and easier for the theorist who con structed the network
to know what the individual
parts of the network
stand for .
- oneself . But
if
it
is the
entire , distributed
pattern
of
interacting influences among the units in the network that is doing the
work , this understanding can often be illusory . Second , it seems intui tively obvious that it is harder to attach an arbitrary name to a distrib uted pattern than it is to attach it to a single unit . What is intuitively
harder , however
efficient
. We will
actually implement aribitrary associations with fewer units using distrib uted representations . Before we turn to such considerations , however ,
we examine a different advantage of distributed representations : They
make it possible to create new concepts without allocating new
hardware
86
THEPOPPERSPECTIVE
CreatingNew Concepts
Any plausible scheme for representing knowledge must be capable of
learning novel concepts that could not be anticipated at the time the
network was initially wired up. A scheme that uses local representations must first make a discrete decision about when to form a new concept, and then it must find a spare hardware unit that has suitable connections for implementing the concept involved . Finding such a unit
may be difficult if we assume that, after a period of early development,
new knowledge is incorporated by changing the strengths of the existing
connections rather than by growing new ones. If each unit only has
connections to a small fraction of the others, there will probably not be
any units that are connected to just the right other ones to implement a
new concept. For example, in a collection of a million units each connected at random to ten thousand others, the chance of there being any
unit that is connected to a particular set of 6 others is only one in a
million .
In an attempt to rescue local representations from this problem ,
several clever schemes have been proposed that use two classes of
units . The units that correspond to concepts are not directly connected
to one another. Instead, the connections are implemented by indirect
pathways through several layers of intermediate units (Fahlman, 1980;
Feldman, 1982) . This scheme works because the number of potential
pathways through the intermediate layers far exceeds the total number
of physical connections. If there are k layers of units , each of which
has a fan-out of n connections to randomly selected units in the follow ing layer, there are n k potential pathways. There is almost certain to be
a pathway connecting any two concept-units , and so the intermediate
units along this pathway can be dedicated to connecting those two
concept-units . However, these schemes end up having to dedicate
several intermediate units to each effective connection, and once the
dedication has occurred, all but one of the actual connections emanating from each intermediate unit are wasted. The use of several inter mediate units to create a single effective connection may be appropriate
in switching networks containing elements that have units with relatively small fan-out , but it seems to be an inefficient way of using the
hardware of the brain.
The problems of finding a unit to stand for a new concept and wiring
it up appropriately do not arise if we use distributed representations.
All we need to do is modify the interactions between units so as to
create a new stable pattern of activity . If this is done by modifying a
large number of connections very slightly , the creation of a new pattern
need not disrupt the existing representations. The difficult problem is
3. DISTRIBUTED
REPRESENT
ATIONS
87
DISTRIBUTED
REPRESENTATIONS
THAT
WORKEFFICIENTLY
In this section, we consider some of the technical details about the
implementation of distributed representations. First , we point out that
certain distributed representation schemes can fail to provide a suffi cient basis for differentiating different concepts, and we point out what
is required to avoid this limitation . Then , we describe a way of using
distributed representations to get the most information possible out of a
simple network of connected units . The central result is a surprising
one: If you want to encode features accurately using as few units as
88 THE
POP
PERSPECTIVE
possible , it pays to use units that are very coarsely tuned , so that each
feature activates many different units and each unit is activated by
many different features . A specific feature is then encoded by a pattern
of activity in many units rather than by a single active unit , so coarse
coding is a form of distributed representation .
To keep the analysis simple , we shall assume that the units have only
two values , on and off .3 We shall also ignore the dynamics of the system because the question of interest , for the time being , is how many
units it takes to encode features with a given accuracy . We start by
considering the kind of feature that can be completely specified by giv ing a type (e.g., line -segment , corner , dot ) and the values of some
continuous parameters that distinguish it from other features of the
same type (e.g., position , orientation , size .) For each type of feature
there is a space of possible instances . Each continuous parameter
defines a dimension of the feature space, and each particular feature
corresponds to a point in the space. For features like dots in a plane ,
the space of possible features is two -dimensional . For features like
stopped , oriented edge-segments in three -dimensional
space, the
feature space is six -dimensional . We shall start by considering two dimensional feature spaces and then generalize to higher dimensionali ties .
Suppose that we wish to represent the position of a single dot in a
plane , and we wish to achieve high accuracy without using too many
units . We define the accuracy of an encoding scheme to be the number
of different encodings that are generated as the dot is moved a standard
distance through the space. One encoding scheme would be to divide
the units into an X group and a Y group , and dedicate each unit to
encoding a particular X or Y interval as shown in Figure . 2. A given dot
would then be encoded by activity in two units , one from each group ,
and the accuracy would be proportional to the number of units used .
Unfortunately , there are two problems with this . First , if two dots have
to be encoded at the same time , the method breaks down . The two
dots will activate two units in each group , and there will be no way of
telling , from the active units , whether the dots were at (x 1, y 1) and
(x2 , y2 ) or at (x 1, y2 ) and (x2 , y 1) . This is called the binding prob lem . It arises because the representation does not specify what goes
with what .
3 Similar argumentsapply with multivalued activity levels, but it' is important not to
allow activity levelsto havearbitrary precisionbecausethis makesit possibleto represent
an infinite amount of information in a singleactivity level. Units that transmita discrete
impulsewith a probability that varies as a function of their activationseemto approximatethe kind of precisionthat is possiblein neuralcircuitry (seeChapters20 and 21).
3. DISTRIBUTED
REPR
~ ENT
ATIONS89
Ygroup
X group
0
Y group
0
1
0
0
0
00100010
X group
FIGURE 2. A: A simpleway of usingtwo groupsof binary units to encodethe position
of a point in a two-dimensionalspace
. The active units in the X and Y groupsrepresent
the x- and y-coordinates
. B: When two points must be encodedat the sametime, it is
impossibleto tell which x-coordinategoeswith which y-coordinate.
90
THEPOPPERSPECTIVE
y 1 and the unit for y 2 would both have to activate the response. There
would be no way of preventing the response from being activated when
the unit for x 1 and the unit for y 2 were both activated. This is another
aspect of the binding problem since, again, the representation fails to
specify what must go with what.
In a conventional computer it is easy to solve the binding problem .
We simply create two records in the computer memory . Each record
contains a pair of coordinates that go together as coordinates of one
dot, and the binding information is encoded by the fact that the two
coordinate values are sitting in the same record (which usually means
they are sitting in neighboring memory locations) . In parallel networks
it is much harder to solve the binding problem .
Conjunctive Encoding
One approach is to set aside , in advance , one unit for each possible
combination of X and Y values . This amounts to covering the plane
with a large number of small , nonoverlapping zones and dedicating a
unit to each zone . A dot is then represented by activity in a single unit
so this is a local representation . The use of one unit for each discrimin able feature solves the binding problem by having units which stand for
the conjunction of values on each of two dimensions . In general , to
permit an arbitrary association between particular combinations
of
features and some output or other pattern of activation , some conjunc tive representation may be required .
However , this kind of local encoding is very expensive . It is much
less efficient than the previous scheme because the accuracy of pin pointing a point in the plane is only proportional to the square root of
the number of units . In general , for a k -dimensional feature space, the
local encoding yields an accuracy proportional to the kth root of the
number of units . Achieving high accuracy without running into the
binding problem is thus very expensive .
The use of one unit for each discriminable feature may be a reason able encoding if a very large number of features are presented on each
occasion , so that a large fraction of the units are active . However , it is
a very inefficient encoding if only a very small fraction of the possible
features are presented at once . The average amount of information
conveyed by the state of a binary unit is 1 bit if the unit is active half
the time , and it is much less if the unit is only rarely active .4 It would
4 The amount of information conveyedby a unit that hasa probabilityof p of beingon
is - plogp - (1 - p ) 10g(1 - p ).
91
3. DISTRIBUTED
REPRESENTATIONS
therefore be more efficient to use an encoding in which a larger fraction of the units were active at any moment . This can be done if we
abandon the idea that each discriminable feature is represented by
activity in a single unit .
Coarse Coding
Suppose we divide the space into larger, overlapping zones and assign
a unit to each zone. For simplicity , we will assume that the zones are
circular , that their centers have a uniform random distribution
throughout the space, and that all the zones used by a given encoding
scheme have the same radius. The question of interest is how
accurately a feature is encoded as a function of the radius of the zones.
If we have a given number of units at our disposal is it better to use
large zones so that each feature point falls in many zones, or is it better
to use small zones so that each feature is represented by activity in
fewer but more finely tuned units ?
The accuracy is proportional to the number of different encodings
that are generated as we move a feature point along a straight line from
one side of the space to the other . Every time the line crosses the
boundary of a zone, the encoding of the feature point changes because
the activity of the unit corresponding to that zone changes. So the
number of discriminable features along the line is just twice the
number of zones that the line penetrates. 5 The line penetrates every
zone whose center lies within one radius of the line (see Figure 3) .
This number is proportional to the radius of the zones, r , and it is also
proportional to their number , n . Hence the accuracy, a , is related to
the number of zones and to their radius as follows :
QCXnr.
In general, for a k -dimensional space, the number of zones whose
centers lie within one radius of a line through the space is proportional
to the volume of a k -dimensional hypercylinder of radius , . This
volume is equal to the length of the cylinder (which is fixed ) times its
(k - 1) -dimensional cross-sectional area which is proportional to , k - I.
S Problems arise if you enter and leave a zone without crossing other zone borders in
between becauseyou revert to the same encoding as before, but this effect is negligible if
the zones are dense enough for there to be many zones containing each point in the
space.
92 THE
POP
PERSPECTIVE
FIGURE
3. The
So, for example, doubling the radius of the zones increases by a factor of 32, the linear accuracy with which a six-dimensional feature like a
stopped oriented three-dimensional edge is represented. The intuitive
idea that larger zones lead to sloppier representations is entirely wrong
because distributed representations hold information much more effi ciently than local ones. Even though each active unit is less specific in
its meaning, the combination of active units is far more specific.
Notice also that with coarse coding the accuracy is proportional to the
number of units , which is much better than being proportional to the
kth root of the number .
Units that respond to complex features in retinotopic maps in visual
cortex often have fairly large receptive fields . This is often interpreted
as the first step on the way to a translation invariant representation.
However, it may be that the function of the large fields is not to
achieve, translation invariance but to pinpoint accurately where the
feature is!
Limitations on coarse coding. So far , only the advantagesof coarse
coding have been mentioned , and its problematic aspects have been
ignored. There are a number of limitations that cause the coarse coding strategy to break down when the " receptive fields " become too
3. DISTRIBUTED
REPRFSENTATIONS
93
large . One obvious limitation occurs when the fields become compar able in size to the whole space. This limitation is generally of little
interest because other , more severe , problems arise before the recep ti ve fields become this large .
Coarse coding is only effective when the features that must be
represented are relatively sparse. If many feature points are crowded
together , each receptive field will contain many features and the activity
pattern in the coarse-coded units will not discriminate between many
alternative combinations of feature points . (If the units are allowed to
have integer activity levels that reflect the number of feature points fal ling within their fields , a few nearby points can be tolerated , but not
many .) Thus there is a resolution / accuracy trade -off . Coarse coding
can give high accuracy for the parameters of features provided that
features are widely spaced so that high resolution is not also required .
As a rough rule of thumb , the diameter of the receptive fields should
be of the same order as the spacing between simultaneously present
feature points .6
The fact that coarse coding only works if the features are sparse
should be unsurprising given that its advantage over a local encoding is
that it uses the information capacity of the units more efficiently by
making each unit active more often . If the features are so dense that
the units would be active for about half the time using a local encoding ,
coarse coding can only make things worse .
A second major limitation on the use of coarse coding stems from
the fact that the representation of a feature must be used to affect other
representations . There is no point using coarse coding if the features
have to be recoded as activity in finely tuned units before they can
have the appropriate effects on other representations . If we assume
that the effect of a distributed representation is the sum of the effects
of the individual active units that constitute the representation , there is
a strong limitation on the circumstances under which coarse coding can
be used effectively . Nearby features will be encoded by similar sets of
active units , and so they will inevitably tend to have similar effects .
Broadly speaking , coarse coding is only useful if the required effect of a
feature is the average of the required effects of its neighbors . At a fine
enough scale this is nearly always true for spatial tasks . The scale at
which it breaks down determines an upper limit on the size of the
receptive fields .
94
THEPOPPERSPECTIVE
coding
is efficient
in the connection
illustrate
these
points
with
a very
simple
example .
Consider
microlanguage
consisting
of the three - letter words of English made up
of w or I , followed
by i or e, followed
by g or r . The strings wig and leg
are words , but weg , lig , and all strings ending in r are not . Suppose we
wanted
to use a distributed
representation
scheme
as a basis for
representing
the words , and we wanted to be able to use the distributed
pattern as a basis for deciding whether
the string is a word or a non word . For simplicity
we will have a single " decision " unit . The prob lem is to find connections
from the units representing
the word to the
decision unit such that it fires whenever
a word is present but does not
fire when no word is present . 7
7 Note that the problem remains the same if the decision unit is replaced by a set of
units and the task of the network is to produce a different pattern for the word and nonword decisions. For when we examine each unit , it either takes the same or a different
value in the two patterns~ in the caseswhere the value is the same, there is no problem ,
but neither do such units differentiate the two patterns. When the values are different ,
the unit behavesjust like the single decision unit discussed in the text .
3. DISTRIBUTED
REPRESENTATIONS9S
96 THEPOP
PERSPECTIVE
problem facing our network. Conjunctive distributed representations
will suffice.
The schemeillustrated in the secondpanel of the figure providesa
conjunctive distributed representation
. In this scheme, there are units
for pairs of letters which, in this limited vocabulary, happento capture
the combinationsthat are essentialfor determining whether a string of
letters is a word or not. These are, of course, the pairs wi and Ie.
Theseconjunctive units, together with direct input to the decisionunit
from the g unit, are sufficient to construct a network which correctly
classifiesall strings consistingof a w or an I, followed by an i or an e,
followed by a g or r.
This exampleillustrates that conjunctivecoding is often necessaryif
distributed representationsare to be used to solve problemsthat might
easily be posedto networks. This samepoint could be illustrated with
many other examples- the exclusiveor problem is the classicexample
(Minsky & Papert, 1969) . Other examplesof problemsrequiring some
sort of conjunctive encodingcan be found in Hinton (1981a
) and in
Chapters7 and 8. An applicationof conjunctivecodingto a psychologi
cal model is found in Chapter18.
Someproblems(mostly very simple ones) can be solvedwithout any
conjunctive encodingat all, and others will require conjunctsof more
than two units at a time. In general, it is hard to specifyin advancejust
what "order" of conjunctionswill be required. Instead, it is better to
searchfor a learning schemethat can find representationsthat are adequate. The mechanismsproposedin Chapters7 and 8 representtwo
stepstowardthis goal.
3. DISTRIBUTED
REPRESENT
ATIONS
97
case, then they are certainly better in cases where there are underlying
regularities that can be captured by regularities in the patterns of activa tion on the units in one group and the units in another . A discussion
of the benefit distributed representations can provide in such cases can
be found in Chapter 18.
If we restrict ourselves to monomorphemic words , the mapping from
strings of graphemes onto meanings appears to be arbitrary in the sense
that knowing what some strings of graphemes mean does not help one
predict what a new ,string means .8 This arbitrariness in the mapping
from graphemes to meanings is what gives plausibility to models that
have explicit word units . It is obvious that arbitrary mappings can be
implemented if there are such units . A grapheme string activates
exactly one word unit , and this activates whatever meaning we wish to
associate with it (see Figure SA) . The semantics of similar grapheme
strings can then be completely independent because they are mediated
by separate word units . There is none of the automatic generalization
that is characteristic of distributed representations .
Intuitively , it is not at all obvious that arbitrary mappings can be
implemented in a system where the intermediate layer of units encodes
the word as a distributed pattern of activity instead of as activity in a
single local unit . The distributed alternative appears to have a serious
drawback . The effect of a pattern of activity on other representations is
the combined result of the individual effects of the active units in the
pattern . So similar patterns tend to have similar effects . It appears that
we are not free to make a given pattern have whatever effect we wish
on the meaning representations without thereby altering the effects that
other patterns have . This kind of interaction appears to make it diffi cult to implement arbitrary mappings from distributed representations
of words onto meaning representations . We shall now show that these
intuitions are wrong and that distributed representations of words can
work perfectly well and may even be more efficient than single word
units .
Figure 58 shows a three -layered system in which grapheme / position
units feed into word-set units which , in turn , feed into semantic or
sememe units . Models of this type , and closely related variants , have
been analyzed by Willshaw ( 1981) , V . Dobson (personal communica tion , 1984) , and by David Zipser (personal communication , 1981) ;
some further relevant analyses are discussed in Chapter 12. For simpli 8 Even for monomorphemicwords there may be particularfragmentsthat have associ
ated meaning. For example, words starting with sn usually mean somethingunpleasant
to do with the lips or nose (sneer
, snarl, snigger
) , and words with long vowelsare more
likely to stand for large, slow things than words with short vowels (GeorgeLakoff, personalcommunication
) . Much of LewisCarroll' s poetry relieson sucheffects.
98
THEPOPPERSPECTIVE
5/ 3
3. DISTRIBUTED
REPR
~ ENTAnONS
99
100
THEPOPPERSPECTIVE
f = (1- i )U
= [1- (1- p)(w- }>]U
By inspection, this probability of a " false-positive " sememe reduces
to zero when w is 1. Table 1 shows the value of f for various combinations of values of p , u , and w . Notice that if p is very small, f can
remain negligible even if w is quite large. This means that distributed
representations in which each word-set unit participates in the representation of many words do not lead to errors if the semantic features are
relatively sparse in the sense that each word meaning contains only a
small fraction of the total set of sememes. So the word-set units can be
fairly nonspecific provided the sememe units are fairly specific (not
shared by too many different word meanings) . Some of the entries in
the table make it clear that for some values ofp , there can be a negligible chance of error even though the number of word-set units is considerably less than the number of words (the ratio of words to word-set
units is w/ u ) .
The example described above makes many simplifying assumptions.
For example, each word-set unit is assumed to be connected to every
relevant sememe unit . If any of these connections were missing, we
could not afford to give the sememe units a threshold equal to the
number of active word-set units . To allow for missing connections we
could lower the threshold. This would increase the false-positive error
rate, but the effect may be quite small and can be compensated by
adding word-set units to increase the specificity of the word-level
representations (Willshaw , 1981) . Alternatively , we could make each
word-set unit veto the sememes that do not occur in any of its words.
This scheme is robust against missing connections because the absence
of one veto can be tolerated if there are other vetos (V . Dobson, personal communication , 1984) .
There are two more simplifying assumptions both of which lead to an
underestimate of the effectiveness of distributed representations for the
arbitrary mapping task. First , the calculations assume that there is no
fine -tuning procedure for incrementing some weights and decrementing
others to improve performance in the cases where the most frequent
errors occur. Second, the calculations ignore cross-connections among
the sememes. If each word meaning is a familiar stable pattern of
sememes, there will be a strong "clean-up" effect which tends to
suppress erroneous sememes as soon as the pattern of activation at the
sememe level is sufficiently close to the familiar pattern for a particular
word meaning. Interactions among the sememes also provide an explanation for the ability of a single grapheme string (e.g., bank) to elicit
two quite different meanings. The bottom-up effect of the activated
3. DISTRIBUTED
REPRESENTATIONS101
TABLEt
u
.2
0 .071
10
.2
0 .49
5
5
5
5
10
20
40
80
.01
.01
.01
.01
.01
5
10
20
40
80
.1
.1
.1
.1
.1
0.0048
0.086
0.48
0.92
1.0
0.0074
0.23
0.85
1.0
1.0
10
10
10
10
10
10
20
40
.01
.01
.01
80 .01
160 .01
2.3x10-11
2.5x10-8
1.3x10-5
0.0024
0.10
0 .52
40
40
40
40
40
40
80
160
320
640
2.7x1<:t""20
3.5x1<:t""11
0.00012
0.19
0.94
20
.2
0 .93
40
.2
1.0
80
.2
1.0
10
10
10
10
10
10
20
40
80
160
.2
.2
.2
.2
.2
0.24
0.86
1.0
1.0
1.0
10
10
10
10
10
.1
.1
.1
80 .1
160 .1
40
40
40
40
40
40
80
160
320
640
.2
.2
.2
.2
.2
0.99
1.0
1.0
1.0
1.0
40
40
40
40
40
40
80
160
320
640
100
100
100
100
100
200
400
800
.2
.2
.2
.2
1.0
1.0
1.0
1.0
10
20
40
0 .99
1.0
1.0
1.0
10 100 .1
10 200 .1
100 400 .1
100 800 .1
0.99
1.0
1.0
1.0
5
5
5
5
100 100
100 200
100 400
100 800
.01
.01
.01
.01
9.5x10- 8
4.8x 10- 6
0.00016
0.0036
0.049
9.0x 10
--21
4.8x 10- 7
0.16
0.97
The probability, f, of a false-positivesememeas a function of the numberof activewordset units per word, u, the numberof wordsin eachword-set, w, and the probability, p, of
a sememebeingpart of a word meaning.
11 _
word-set units helps both sets of sememes, but as soon as top-down factors give an advantage to one meaning, the sememes in the other
meaning will be suppressed by competitive interactions at the sememe
level (Kawamoto & Anderson, 1984) .
A simulation . As soon as there are cross-connections among the
sememe units and fine -tuning of individual weights to avoid frequent
errors, the relatively straightforward probabilistic analysis given above
breaks down. To give the cross-connections time to clean up the output, it is necessary to use an iterative procedure instead of the simple
"straight-through " processing in which each layer completely determines
the states of all the units in the subsequent layer in a single, synchronous step. Systems containing cross-connections, feedback, and asynchronous processing elements are probably more realistic, but they are
generally very hard to analyze. However, we are now beginning to discover that there are subclasses of these more complex systems that
behave in tractable ways. One example of this subclass is described in
102 THE
POP
PERSPECTIVE
more detail in Chapter 7. It uses processing elements that are
inherently
stochastic . Surprisingly , the use of stochastic elements
makes these networks better at performing searches , better at learning ,
and easier to analyze .
A simple network of this kind can be used to illustrate some of the
claims about the ability to " clean up " the output by using interactions
among sememe units and the ability to avoid errors by fine -tuning the
appropriate weights . The network contains 30 grapheme units , 20
word - set units , and 30 sememe
units . There
are no direct
connections
between grapheme and sememe units , but each word -set unit is con -
nected to all the grapheme and sememe units . The grapheme units are
divided
into
three
word
unit in each group of ten (units can only have activity levels of 1 or 0) .
The " meaning " of a word is chosen at random by selecting each
sememe unit to be active with a probability of 0.2. The network shown
in Figure 6 has learned to associated 20 different grapheme strings with
their chosen meanings . Each word -set unit is involved in the represen tation
of many
involves
The details of the learning procedure used to create this network and
the search procedure which is used to settle on a set of active sememes
units
are knocked
may settle
on a
104 THE
POP
PERSPECTIVE
it
has
been
every
damaged
connection
performance
from
retrained
and
original
rate
rapid
there
very
by
when
from
more
the
retraining
sememe
pairings
have
that
were
the
selected
is not
representations
the
words
tends
unit
All
that
to
each
recovery
of
tri buted
representations
STRUCTURED
In
this
.
section
involved
weight
behave
items
of
representations
the
two
some
of
the
that
pair words
subset
added
used
as a qualitative
extensions
that
of
have
unclear
how
of
noise
a separate
can
view
spon -
signature
of dis -
scheme
directions
to deal
with
the
been
structure
.
that
these
The
can
of
of
because
some
particularly
parts
be
taken
be
in
the
in
of
to
in
section
extending
these
a distri
give
some
distributed
.
9 The error rate was 99 .3% rather than 99 .9% in this example because the network
forced to respond faster , so the cooperative
output .
of
structure
attuned
considerations
field
proponents
captured
of this
represen
from
importance
is to
representa
distributed
insights
two
important
PROCESSES
distributed
idea
major
Perhaps
not
of
the
the
concerning
processes
representations
is often
the
of distributed
the
way , so one
AND
illustrate
with
and
grapheme
all
use
so
. A scheme
in this
is
the
is substantially
recovery
, and
set
are omit
other
in encoding
retraining
every
consider
representational
indication
are
during
"
that
.)
because
of the
This
shows
exhibit
words
the
" spontaneous
is a result
its
intelligence
distributed
buted
The
again
unrehearsed
we
representations
issues , it
to them
resulting
that
words
relation
not
extensions
is consistent
artificial
these
REPRESENTATIONS
These
tations
for
intrinsic
.
The
of the
though
from
taneous
rate
then
than
strengths
discussion
if a few
was
which
strengths
further
occurs
weights
would
set .
to
the
64 .3 % correct
connection
, even
shown
word
of
noise
reduced
faster
argument
connection
7 for
adding
This
network
was
proceeds
shown
the
are
be removed
for
tions
no
a set
error
, much
a near - perfect
of
The
randomly
network
performance
effect
retraining
relearning
a geometrical
to
by
unit
64 .3 % . 9 The
its
( See Chapter
as the
- set
by
sets
surprising
reduced
ings
to
about
noise
other
damaged
word
rapid
special
adding
from
even
correct
predicted
is something
was
a
very
learning
performance
An
network
involved
99 .3 %
was
different
same
The
exhibited
of
is generated
ted
it
recovery
that
that
was
3. DISTRIBUTED
REPRESENTATIONS
105
RepresentingConstituent Structure
Any system that attempts to implement the kinds of conceptual
structures that people use has to be capable of representing two rather
different kinds of hierarchy. The first is the " IS-A" hierarchy that
relates types to instances of those types. The second is the part/ whole
hierarchy that relates items to the constituent items that they -are composed of . The most important characteristics of the IS-A hierarchy are
that known properties of the types must be " inherited " by the instances,
and properties that are found to apply to all instances of a type must
normally be attributed to the type. Earlier in this chapter we saw how
the IS- A hierarchy can be implemented by making the distributed
representation of an instance include, as a subpart, the distributed
representation for the type. This representational trick automatically
yields the most important characteristics of the IS- A hierarchy, but the
trick can only be used for one kind of hierarchy. If we use the
part/ whole relationship between patterns of activity to represent the
type/ instance relationship between items, it appears that we cannot also
use it to represent the part/ whole relationship between items. We cannot make the representation of the whole be the sum of the representations of its parts.
The Question of how to represent the relationship between an item
and the constituent items of which it is composed has been a major
stumbling block for theories that postulate distributed representations.
In the rival , localist scheme, a whole is a node that is linked by labeled
arcs to the nodes for its parts. But the central tenet of the distributed
scheme is that different items correspond to alternative patterns of
activity in the same set of units , so it seems as if a whole and its parts
cannot both be represented at the same time .
Hinton ( 1981a) described one way out of this dilemma . It relies on
the fact that wholes are not simply the sums of their parts. They are
composed of parts that play particular roles within the whole structure .
A shape, for example, is composed of smaller shapes that have a particular size, orientation , and position relative to the whole. Each constituent shape has its own spatial role, and the whole shape is composed
of a set of shape/ role pairs.tO Similarly , a proposition is composed of
objects that occupy particular semantic roles in the whole propositional
10 Relationships
representing
can view
the
of an object
mented
between
shape / role
various
as the
by positive
pairs
different
fillers
parts
important
it allows
locations
of these
interactions
are
is that
slots .
between
within
as
well .
different
pairs
an object
Knowledge
the various
One
as slots
of a whole
slot - fillers
advantage
to support
and
shape
.
each
of
explicitly
other . One
the shapes
can then
of parts
be imple -
PERSPECTIVE
106 THEPOP
SequentialSymbol Processing
If constituent structure is implemented in the way described above,
there is a serious issue about how many structures can be active at any
one time . The obvious way to allocate the hardware is to use a group
of units for each possible role within a structure and to make the pattern of activity in this group represent the identity of the constituent
that is currently playing that role. This implies that only one structure
can be represented at a time , unless we are willing to postulate multiple
copies of the entire arrangement. One way of doing this , using units
with programmable rather than fixed connections, is described in
Chapter 16. However , even this technique runs into difficulties if more
than a few modules must be " programmed" at once. However, people
do seem to suffer from strong constraints on the number of structures
of the same general type that they can process at once. The
3. DISTRIBUTED
REPRFSENT
AnONS
sequentiality
becomes much easier to understand if we abandon our localist predelic tions in favor of the distributed alternative which uses the parallelism
to give each active representation a very rich internal structure that
allows the right kinds of generalization and content -addressability .
There may be some truth to the notion that people are sequential symbol processors if each " symbolic representation " is identified with a
108 THEPOP
PERSPECTIVE
successive state of a large interactive
further
discussion
network .
See Chapter
14 for
of these issues .
structures
into
a reduced
form
that
allows
them
to be used
as
less cumbersome representation, so that several identity / role combinations can be represented simultaneously
tation of a larger structure .
SUMMARY
tions are efficient whenever there are underlying regularities which can
be captured by interactions among microfeatures . By
piece of knowledge as a large set of interactions , it
achieve useful
properties
like content -addressable
automatic generalization , and new items can be created
to create new connections
at the hardware
encoding each
is possible to
memory
and
without having
of con -
3. DISTRIBUTED
REPRFSENT
ATIONS
109
for this task, distributed representations can be made fairly efficient and
they exhibit some psychologically interesting effects when damaged.
There are several difficult problems that must be solved before
distributed representations can be used effectively . One is to decide on
the pattern of activity that is to be used for representing an item . The
similarities between the chosen pattern and other existing patterns will
determine the kinds of generalization and interference that occur. The
search for good patterns to use is equivalent to the search for the
underlying regularites of the domain. This learning problem is
addressedin the chapters of Part II .
Another hard problem is to clarify the relationship between distrib uted representations and techniques used in artificial intelligence like
schemas, or hierarchical structural descriptions. Existing artificial intel ligence programs have great difficulty in rapidly finding the schema that
best fits the current situation . Parallel networks offer the potential of
rapidly applying a lot of knowledge to this best-fit search, but this
potential will only be realized when there is a good way of implement ing schemas in parallel networks. A discussion of how this might be
done can be found in Chapter 14.
ACKNO
WLEDGMENTS
This chapter is based on a technical report by the first author , whose
work is supported by a grant from the System Development Foundation . We thank Jim Anderson, Dave Ackley , Dana Ballard, Francis
Crick , Scott Fahlman, Jerry Feldman, Christopher Longuet-Higgins,
Don Norman , Terry Sejnowski, and Tim Shallice for helpful
discussions.
CHAPTER
4
PDP Modelsand
General Issues in Cognitive Science
D. E. RUMELHARTandJ. L. MCCLELLAND
4.GENERAL
ISSUES
111
SOMEOBJECTIONS
TOTHEPDPAPPROACH
POP Models Are Too Weak
The one-layer perceptron. The most commonly heard objection to
POP models is a variant of the claim that POP models cannot perform
any interesting computations. One variant goes like this : "These POP
models sound a lot like perceptrons to me. Didn 't Minsky and Papert
show that perceptron-like models couldn 't do anything interesting ?"
This comment represents a misunderstanding of what Minsky and
Papert (1969) have actually shown. A brief sketch of the context in
which Minsky and Papert wrote will help clarify the situation . (See
Chapter 5 for a somewhat fuller account of this history .)
In the late 1950s and early 1960s there was a great deal of effort in
the development of self-organizing networks and similar POP-like computational devices. The best known of these was the perceptron
developed by Frank Rosenblatt (see, for example, Rosenblatt, 1962) .
Rosenblatt was very enthusiastic about the perceptron and hopeful that
it could serve as the basis both of artificial intelligence and the modeling of the brain. Minsky and Papert, who favored a serial symbolprocessing approach to artificial intelligence , undertook a very careful
mathematical analysis of the perceptron in their 1969 book entitled ,
simply , Perceptrons
.
The perceptron Minsky and Papert analyzed most closely is illustrated
in Figure 1. Such machines consist of what is generally called a retina,
an array of binary inputs sometimes taken to be arranged in a twodimensional spatial layout; a set of predicates, a set of binary threshold
units with fixed connections to a subset of units in the retina such that
each predicate computes some local function over the subset of units to
which it is connected; and one or more decision units, with modifiable
connections to the predicates. This machine has only one layer of
modifiable connections; for this reason we will call it a one-layer perceptron .
Minsky and Papert set out to show which functions can and cannot
be computed by this class of machines. They demonstrated, in particular, that such perceptrons are unable to calculate such mathematical
functions as parity (whether an odd or even number of points are on in
the retina) or the topological function of connectedness (whether all
points that are on are connected to all other points that are on either
directly or via other points that are also on) without making use of
absurdly large numbers of predicates. The analysis is extremely elegant
and demonstrates the importance of a mathematical approach to analyzing computational systems.
PERSPECTIVE
112 THEPOP
Minsky and Papert's analysis of the limitations of the one-layer perceptron, coupled with some of the early successesof the symbolic processing approach in artificial intelligence , was enough to suggest to a
large number of workers in the field that there was no future in
perceptron-like computational devices for artificial intelligence and cognitive psychology. The problem is that although Minsky and Papert
were perfectly correct in their analysis, the results apply only to these
simple one-layer perceptrons and not to the larger class of perceptronlike models. In particular (as Minsky and Papert actually conceded) , it
can be shown that a multilayered perceptron system, including several
layers of predicates between the retina and the decision stage, can compute functions such as parity, using reasonable numbers of units each
computing a very local predicate. (See Chapters 5 and 8 for examples
of multilayer networks that compute parity) . Similarly , it is not diffi cult to develop networks capable of solving the connectedness or
inside/ outside problem . Hinton and Sejnowski have analyzed a version
of such a network (see Chapter 7) .
Essentially, then , although Minsky and Papert were exactly correct in
their analysis of the one-layer perceptron, the theorems don't apply to
systems which are even a little more complex. In particular, it doesn' t
apply to multilayer systems nor to systems that allow feedback loops.
Minsky and Papert argued that there would not be much value to
multilayer perceptrons. First , they argued that these systems are suffi ciently unrestricted as to be vacuous. They pointed out , for example,
that a universal computer could be built out of linear threshold units .
4.GENERAL
ISSUES
113
Therefore , restricting consideration of machines made out of linear
threshold units is no restriction at all on what can be computed.
We don't , of course, believe that the class of models sketched in
Chapter 2 is a small or restrictive class. (Nor , for that matter, are the
languages of symbol processing systems especially restrictive .) The real
issue, we believe, is that different algorithms are appropriate to dif ferent architectural designs. We are investigating an architecture in
which cooperative computation and parallelism is natural. Serial symbolic systems such as those favored by Minsky and Papert have a
natural domain of algorithms that differs from those in POP models.
Not everything can be done in one step without feedback or layering
(both of which suggest a kind of "seriality " ) . We have been led to consider models that have both of these features. The real point is that we
seek algorithms that are as parallel as possible. We believe that such
algorithms are going to be closer in form to the algorithms which could
be employed by the hardware of the brain and that the kind of parallelism we employ allows the exploitation of multiple information sources
and cooperative computation in a natural way.
A further argument advanced by Minsky and Papert against
perceptron-like models with hidden units is that there was no indication
how such multilayer networks were to be trained. One of the appealing
features of the one-layer perceptron is the existence of a powerful
learning procedure, the perceptron convergence procedure of Rosenblatt . In Minsky and Papert's day, there was no such powerful learning
procedure for the more complex multilayer systems. This is no longer
true . Chapters 5, 6, 7, and 8 all provide schemes for learning in systems with hidden units . Indeed, Chapter 8 provides a direct generalization of the perceptron learning procedure which can be applied to arbitrary networks with multiple layers and feedback among layers. This
procedure can, in principle , learn arbitrary functions including , of
course, parity and connectedness.
The problem of stimulus equivalence. A second problem with early
POP models- and one that is not necessarily completely overcome by
multilayer systems- is the problem of invariance or stimulus equivalence.
An A is an A is an A , no matter where on the retina it appears or how
large it is or how it is oriented ~ and people can, in general, recognize
patterns rather well despite various transformations . It has always
seemed elegant and natural to imagine that an A , no matter where it is
presented, is normalized and then processed for recognition using
stored knowledge of the appearanceof the letter (Marr , 1982~ Neisser,
1967) .
In conventional computer programs this seems to be a rather
straightforward matter requiring , first , normalization of the input , and,
PERSPECTIVE
114 THEPDP
criticisms
of perceptrons
leveled
at
of the whole .
While it is certainly true that certain PDP models lack explicit atten tional
mechanisms
, it is far
from
true
that
PDP
mechanisms
are in
to a one -dimensional
rotational
transformation
the
connection
from
each
retinocentric
unit
to the
central
unit
that
4.GENERAL
ISSUES
115
Letter Units
Canonical
Featu re
Units
Mapping
Units
Q
Retinocentric
Feature
Units
FIGURE
2 .
into
Hinton
patterns
mutual
in
' s
of
in
the
line
the
right
are
indicates
units
the
will
active
just
flow
it
inputs
to
input
of
proceed
processing
as
the
onto
the
two
central
unit
the
preceding
the
canonical
way
is
is
possible
in
mapping
is
stimulus
on
is
on
excitatory
no
activa
mapping
unit
is
needed
one
map
from
advance
chosen
and
In
Each
of
to
the
two
to
known
units
when
activating
connections
it
) .
feature
connections
by
unit
multiplicative
an
is
multiplicative
programmable
the
feature
receive
at
retinocen
canonical
connection
this
retinal
Object
is
system
used
to
central
recognition
( perhaps
this
and
the
will
the
the
pattern
at
only
seg
outside
arriving
then
feature
In
arrow
line
retinocentric
in
retinocentric
pair
the
from
with
orienta
units
retinocentric
mapping
mapping
the
unit
on
to
the
follows
on
is
by
direction
the
and
feature
and
each
particular
inputs
position
mappings
system
units
indicated
which
canonical
retinal
the
mechanism
the
one
unit
programs
if
for
( the
different
with
coordinate
- detector
retinocentric
frame
one
letter
is
indicate
inputs
if
corresponding
each
this
coordinates
now
to
the
Using
active
of
effectively
implement
case
the
three
imposed
of
two
units
six
canonical
mapping
one
are
units
the
in
are
represents
of
the
illustrated
this
to
If
each
on
pairs
clockwise
top
feature
bottom
transformation
are
In
the
in
three
patterns
the
detectors
At
upright
inputs
corresponding
) .
At
canonical
these
arrows
the
mapping
.
six
to
( The
of
These
900
of
unit
to
unit
the
the
corresponding
receives
receiving
tion
each
nature
unit
connection
input
of
for
system
each
corresponds
canonical
and
"
features
frame
scheme
to
segment
" body
canonical
tric
coordinate
connections
the
ment
( 1981b
another
excitatory
tion
to
(3
on
to
involving
can
the
map
basis
retinal
variable
of
116 THE
POP
PERSPECTIVE
translational mappings, in addition to the rotational mappings shown
here, it would be possible to focus the attention of the system successively on each of several different patterns merely by changing the
mapping. Thus it would not be difficult to implement a complete system for sequential processing of a series of patterns using Hinton ' s
scheme (a number of papers have proposed mechanisms for performing
a set of operations in sequence, including Grossberg, 1978, and
Rumelhart & Norman , 1982~ the latter paper is discussed in Chapter 1) .
So far , we have described what amounts to a POP implementation of
a conventional pattern recognition system. First , map the pattern into
the canonical frame of reference, then recognize it . Such is the procedure advocated, for example, by Neisser ( 1967) and Marr ( 1982) .
The demonstration shows that POP mechanisms are in fact capable of
normalization and of focusing attention successively on one pattern
after another.
But the demonstration may also seem to give away too much. For it
seems to suggest that the POP network is simply a method for implementing standard sequential algorithms of pattern recognition . We
seem to be left with the question, what has the POP implementation
added to our understanding of the problem ?
It turns out that it has added something very important . It allows us
to begin to see how we could solve the problem of recognizing an input
pattern even in the case where we do not know in advance either what
the pattern is or which mapping is correct. In a conventional sequential
algorithm , we might proceed by serial se.arch, trying a sequence of mappings and looking to see which mapping resulted in the best recognition
.nerformance. With Hinton 's mapping units , however, we can actually
perform this search in parallel. To see how this parallel search would
work , it is first necessaryto see how another set of multiplicative connections can be used to choose the correct mapping for a pattern given
both the retinal input and the correct central pattern of activation .
In this situation , this simultaneous activation of a central feature and
a retinal feature constitutes evidence that the mapping that connects
them is the correct mapping. We can use this fact to choose the mapping by allowing central and retinal units that correspond under a particular mapping to project to a common multiplicative connection on
the appropriate mapping unit . Spurious conjunctions will of course
occur, but the correct mapping units will generally receive more conjunctions of canonical and retinal features than any other (unless there
is an ambiguity due to a symmetry in the figure ) . If the mapping units
compete so that the one receiving the most excitation is allowed to win ,
the network can settle on the correct mapping.
We are now ready to see how it may be possible to simultaneously
settle on a mapping and a central representation using both sets of
4. GENERALISSU6
117
activation
between this
central pattern and higher level detectors for the pattern then reinforces
the elements of the pattern relative to the noise . This process by itself
the central units to work together with the input pattern to support the
correct mapping over the other partially active mappings via the multi -
that
tate performance ) , but are not , in general , essential unless the input is
in fact ambiguous without them .
Hinton ' s mapping scheme allows us to make two points . First , that
parallel distributed processing is in fact compatible with normalization
and focusing of attention ~ and second , that a POP implementation of a
normalization mechanism can actually produce a computational advan tage .by allowing what would otherwise be a painful , slow , serial search
to be carried out in a single settling of a parallel network . In general ,
Hinton ' s mapping system illustrates that POP mechanisms are not restricted to fixed computations
but are quite clearly capable of modula tion and control by signals arising from other parts of an integrated pro cessing system ~ and that they can , when necessary , be used to imple ment a serial process , in which
one at a time .
The introduction
each of several
of multiplicative
patterns
or contingent
is considered
connections
(Feld -
man & Ballard, 1982) is a way of greatly increasing the power of POP
networks of fixed numbers of units ( Marr , 1982~ Poggio & Torre , 1978~
see Chapter 10) . It means , essentially , that each unit can perform com putations as complex as those that could be performed by an entire
one -layer perceptron , including both the predicates and the decisi "on
unit . However , it must also be noted that multiplicative connections are
118
THEPOPPERSPECTIVE
operation
4. GENERALISSUES
119
Recursion . There are many other specific points that have been
raised with respect to existing POP models . Perhaps the most common
one has to do with recursion . The ability to perform recursive function
calls is a major feature of certain computational frameworks , such as
augmented transition network ( A TN ) parsers (Woods , 1973~ Woods &
Kaplan , 1971) , and is a property of such frameworks that gives them
the capability of processing recursively defined structures such as sentences , in which embedding may produce dependencies between ele ments of a surface string that are indefinitely far removed from each
other (Chomsky , 1957) . It has often been suggested that POP
mechanisms lack the capacity to perform recursive computations and so
are simply incapable of providing mechanisms for processing sentences
and other recursively defined structures .
As before , these suggestions are simply wrong . As we have already
seen , one can make an arbitrary computational machine out of linear
threshold units , including , for example , a machine that can carry out all
the operations necessary for implementing a Turing machine ~ the one
limitation is that real biological systems cannot be Turing machines
because they have finite hardware . In Chapter 14, however , we point
out that with external memory aids (such as paper and pencil and a
notational system ) such limitations can be overcome as well .
We have not dwelt on POP implementations of Turing machines and
recursive processing engines because we do not agree with those who
would argue that such capabilities are of the essence of human compu tation . As anyone who has ever attempted to process sentences like
" The man the boy the girl hit kissed moved " can attest , our ability to
process even moderate degrees of center -embedded structure is grossly
impaired relative to that of an A TN parser . And yet , the human ability
to use semantic and pragmatic contextual information
to facilitate
comprehension far exceeds that of any existing sentence processing
machine
What
we know
of .
for flawless
and effortless
of mechanism
which
facilitates
the simultaneous
consideration
of
natural
This challenge is one that has not yet been fully met . However ,
some initial steps toward a POP model of language processing are
described in Chapter 19. The model whose implementation
is
120
THEPOPPERSPECTIVE
described
in
straints
that
may
chapter
be
underlying
roles
vides
extended
We
not
to
The
framework
can
more
spective
this
book
PDP
and
indicate
to
Models
We
Are
have
distributed
or
lawful
us
is
applying
rules
earlier
work
Rumelhart
phology
Some
ment
rules
have
We
of
less
phenomena
) .
In
we
a
rule
has
connection
" yes
."
of
The
power
cases
rules
on
the
we
have
could
of
but
&
the
is
in
,
English
able
to
show
from
from
must
and
approach
of
been
than
we
Rumelhart
emerge
that
neither
this
learning
readily
rather
Here
of
argue
ball
behavior
( McClelland
and
to
bouncing
their
has
rules
us
behavior
.
parallel
which
specific
led
- driven
the
of
behavior
sometimes
in
units
argument
against
approach
are
the
is
to
studying
( e .g . ,
than
understanding
.
is
for
application
rule
these
our
science
an
throughout
character
regularities
1982
of
cognitive
cognitive
through
the
perception
cognitive
that
application
imple
1981
mor
how
interactions
application
of
any
the
believe
of
mul
to
question
account
demonstrated
viewed
against
to
regularities
processing
level
can
per
of
described
the
POP
discovery
distribution
cooperative
This
and
word
18
the
to
us
necessarily
application
simple
higher
McClelland
( Chapter
apparent
among
thought
have
on
&
the
allows
rules
We
is
the
connections
answer
to
exhibit
.
have
instead
The
multiplicative
models
computational
of
developments
the
no
Question
Cognitive
not
between
The
con
existing
exploration
patterns
that
of
roughly
explorations
-
be
Our
Our
pro
could
regard
done
cognition
other
that
job
score
of
attributed
behavior
planet
of
of
often
rules
orbiting
nor
input
observed
distinguish
the
of
Not
been
grammar
use
host
the
is
this
.
done
further
of
the
processing
previously
our
the
is
this
through
transformations
,
microstructure
rules
information
not
be
also
model
that
in
of
chapter
the
way
con
assignment
The
problems
to
on
made
the
learning
ment
is
much
be
on
tilayer
Question
different
the
which
in
these
remains
in
.
in
limitations
solved
of
aid
sentences
clauses
have
variety
to
ways
and
much
claim
progress
of
embedded
claim
models
different
capabilities
and
begun
how
POP
constituents
three
process
limitations
just
the
human
do
by
of
to
with
have
to
discussion
sistent
illustrates
combined
the
the
firing
activation
the
attempt
of
the
explicit
rules
psychology
mechanisms
of
of
to
mechanisms
of
production
our
units
explain
We
The
is
argu
agree
.
The
neither
real
mental
which
an
not
cognition
)
as
do
more
character
phenomena
underlie
those
4. GENERALISSUES
121
A related claim that some people have made is that our models
appear to share much in common with behaviorist accounts of
behavior. While they do involve simple mechanisms of learning, there
is a crucial difference between our models and the radical behaviorism
of Skinner and his followers . In our models, we are explicitly concerned with the problem of internal representation and mental processing, whereas the radical behaviorist explicitly denies the scientific utility
and even the validity of the consideration of these constructs. The
training of hidden units is, as is argued in Chapters 5 to 8, the construction of internal representations. The models described throughout
the book all concern internal mechanisms for activating and acquiring
the ability to activate appropriate internal representations. In this
sellse, our models must be seen as completely antithetical to the radical
behaviorist program and strongly committed to the study of representation and process.
PERSPECfIVE
122 THEPDP
TABLEt
THETHREE
LEVELS
ATWHICHANYMACHINE
CARRYING
OUT
INFORMATION
PROCESSING
TASKS
MUSTBEUNDERSTOOD
Computational Theory
Representation and
Algorithm
Hardware
Implementation
4. GENERALISSUES
123
124 THE
POP
PERSPECTIVE
would
seem
or so would
( Feldman
&
Ballard , 1982 ) .
In short , the claim
level
failure
of
description
to acknowledge
other
the primary
address a fundamentally
psychological
level
models
of description
psychological
theorizing
is directed . At this
be considered
as competitors of other models
psychological
data .
different
is based
to which
on
much
of , even in Marr ' s scheme . Many of our colleagues have chal our approach with a rather different
conception
of levels bor -
of levels of programming
languages . It might
such as , say , schema theory or the ACT . model
of John R . Anderson
( 1983 ) is a statement in a " higher level " language
analogous , let us say , to the Pascal or LISP programming
languages and
that our distributed
model is a statement
in a " lower level 't theory that
is , let us say , analogous to the assembly code into which higher level
programs can be compiled . Both Pascal and assembler , of course , are
considerably
above the hardware level , though the latter may in some
sense be closer to the hardware and more machine dependent
than the
former .
From this point of view one might ask why we are mucking
around
trying to specify our algorithms
at the level of assembly code when we
could state them more succinctly
in a high - level language . We believe
that most people who raise the levels issue with regard to our models
have a relationship
something
like this in mind . People who adopt this
notion
have no objection
to our models . They only believe
that
psychological
models are more simply and easily stated in an equivalent
higher level language - so why bother ?
We believe that the programming
language analogy is very mislead ing , unless it is analyzed more carefully . The relationship
between a
Pascal program and its assembly code counterpart
is very special indeed .
It is necessary for the Pascal and assembly language to map exactly onto
one another
only when the program
was written in Pascal and the
assembly
from
4. GENERAL
ISSUES
125
that most of the programming that might be taking place in the brain is
taking place at a "lower level " rather than a "higher level ," it seems
unlikely that some particular higher level description will be identical to
some particular lower level description. We may be able to capture the
actual code approximately in a higher level language- and it may often
be useful to do so- but this doeS' not mean that the higher level
language is an adequate characterization.
There is still another notion of levels which illustrates our view.
This is the notion of levels implicit in the distinction between
Newtonian mechanics on the one hand and quantum theory on the
other . 3 It might be argued that conventional symbol processing models
are macroscopic accounts, analogous to Newtonian mechanics, whereas
our models offer more microscopic accounts, analogous to quantum
theory . Note , that over much of their range, these two theories make
precisely the same predictions about behavior of objects in the world .
Moreover , the Newtonian theory is often much simpler to compute
with since it involves discussions of entire objects and ignores much of
their internal structure . However, in some situations Newtonian theory
breaks down. In these situations we must rely on the microstructural
account of quantum theory . Through a thorough understanding of the
relationship between the Newtonian mechanics and quantum theory we
can understand that the macroscopic level of description may be only an
approximation to the more microscopic theory . Moreover , in physics,
we understand just when the macrotheory will fail and the microtheory
must be invoked . We understand the macrotheory as a useful formal
tool by virtue of its relationship to the microtheory . In this sense the
objects of the macrotheory can be viewed as emergingfrom interactions
of the particles described at the microlevel .
The basic perspective of this book is that many of the constructs of
macrolevel descriptions such as schemata, prototypes, rules, productions, etc. can be viewed as emerging out of interactions of the
microstructure of distributed models. These points are most explicitly
considered in Chapters 6, 14, 17, and 18. We view macrotheories as
approximations to the underlying microstructure which the distributed
model presented in our paper attempts to capture. As approximations
they are often useful, but in some situations it will turn out that an
examination of the microstructure may bring much deeper insight .
Note for example, that in a conventional model of language acquisition,
one has to make very delicate decisions about the exact circumstances
under which a new rule will be added to the rule system. In our POP
models no such decision need be made. Since the analog to a rule is
This
analogy
was
I to us by PaulSmolensky
.
suggested
126 THE
POP
PERSPECTIVE
not necessarily discrete but simply something that may emerge from
interactions among an ensemble of processing units , there is no prob lem with having the functional equivalent of a " partial " rule . The same
observation applies to schemata (Chapter 14) , prototypes and logogens
(Chapter 18) , and other cognitive constructs too numerous to mention .
Thus , although we imagine that rule -based models of language
acquisition - the logogen model , schema theory , prototype theory , and
other macrolevel theories - may all be more or less valid approximate
macrostructural
descriptions , we believe that the actual algorithms
involved cannot be represented precisely in any of those macrotheories .
It may also be, however , that some phenomena are too complex to
be easily represented as POP models . If these phenomena took place at
a time frame over which a macrostructural model was an adequate
approximation , there is no reason that the macrostructural model ought
not be applied . Thus , we believe that the concepts of symbols and
symbol processing can be very useful . Such models may sometimes
offer the simplest accounts . It is , however , important to keep in mind
that these models are approximations and should not be pushed too far .
We suspect that when they are , some account similar to our POP
account will again be required . Indeed , a large part of our own motiva tion for exploring the POP approach came from the failure of schema
theory to provide an adequate account of knowledge application even to
the task of understanding very simple stories .
Lest it may seem that we have given too much away , however , it
should be noted that as we develop clearer understandings of the
microlevel
models , we may wish to formulate
rather different
macrolevel models . As pointed out in Chapter 3, POP mechanisms
provide a powerful alternative set of macrolevel primitives .4
Imagine a computational system that has as a primitive , " Relax into a
state that represents an optimal global interpretation
of the current
input ." This would be, of course , an extremely powerful place to begin
building up a theory of higher level computations . Related primitives
would be such things as " Retrieve the representation in memory best
matching the current input , blending into it plausible reconstructions of
details missing from the original memory trace ," and " Construct a
dynamic configuration of knowledge structures that captures the present
situation , with variables instantiated properly ." These sorts of primi tives would be unthinkable in most conventional approaches to higher
level cognition , but they are the kinds of emergent properties that POP
mechanisms give us , and it seems very likely that the availability of
4 We thank Walter Schneiderfor stressingin his commentson an earlier draft of this
chapterthe importanceof the differencesbetweenthe computationalprimitives offered
by POPand thoseoffered by other formalismsfor modelingcognitiveprocesses
.
4. GENERAL
ISSUES
127
such primitives
will
change the
shape of
higher
level
theory
considerably .
POP mechanisms may also place some constraints on what we might
realistically ask for in the way of computational primitives because of
the costs of implementing
certain kinds of computations in parallel
hardware in a single relaxation search . The parallel matching of vari ablized productions is one case in point . Theories such as ACT . (J. R .
Anderson , 1983) assume that this can be done without worrying about
the implementation and , therefore , provide no principled accounts of
the kinds of crosstalk exhibited in human behavior when processing
multiple patterns simultaneously . However , it appears to be a quite
general property of POP mechanisms that they will exhibit crosstalk
when processing multiple patterns in parallel (Hinton & Lang , 1985;
Mozer , 1984; see Chapters 12 and 16) .
High -level languages often preserve some of the character of the
lower level mechanisms that implement them , and the resource and
time requirements of algorithms drastically depends on the nature of
the underlying hardware . Higher level languages that preserve the
character of PDP mechanisms and exploit the algorithms that are effec tive descriptions of parallel networks are not here yet , but we expect
such things to be coming along in the future . This will be a welcome
development , in our view , since certain aspects of cognitive theory
have been too strongly influenced by the discrete , sequential algorithms
available for expression in most current high -level languages .
As we look closely , both at the hardware in which cognitive algo rithms are implemented and at the fine structure of the behavior that
these algorithms are designed to capture , we begin to see why it may be
appropriate to formulate models which come closer to describing the
microstructure of cognition . The fact that our microstructural models
can account for many of the facts about the representation of general
and specific information , for example , as discussed in Chapter 18,
makes us ask why we should view constructs like logogens , prototypes ,
and schemata as anything other than convenient approximate descrip tions of the underlying structure of memory and thought .
Reductionism
A slightly different , though related , argument is that the POP enter prise is an exercise in reductionism - an exercise in which all of
psychology is reduced to neurophysiology and ultimately to physics . It
is argued that coherent phenomena which emerge at any level (psychol ogy or physics or sociology ) require their own language of description
128 THEPOPPERSPECrlVE
and explanation and that we are denying the essence of what is cognitive by reducing it to units and connections rather than adopting a more
psychologically relevant language in our explanations.
We do not classify our enterprise as reductionist , but rather as
interactional . We understand that new and useful concepts emerge at
different levels of organization. We are simply trying to understandthe
essence of cognition as a property emerging from the interactions of
connected units in networks.
We certainly believe in emergent phenomena in the sense of
phenomena which could never be understood or predicted by a study of
the lower level elements in isolation . These phenomena are functions
of the particular kinds of groupings of the elementary units . In general,
a new vocabulary is useful to talk about aggregate phenomena rather
than the characteristics of isolated elements. This is the case in many
fields . For example, we could not know about diamonds through the
study of isolated atoms; we can't understand the nature of social systems through the study of isolated individuals ; and we can't understand
the behavior of networks of neurons from the study of isolated neurons. Features such as the hardness of the diamond is understandable
through the interaction of the carbon atoms and the way they line up.
The whole is different than the sum of the parts. There are nonlinear
interactions among the parts. This does not , however, suggest that the
nature of the lower level elements is irrelevant to the higher level of
organization- on the contrary, the higher level is, we believe, to be
understood primarily through the study of the interactions among lower
level units . The ways in which units interact is not predictable from the
lower level elements as isolated entities . It is, however, predictable if
part of our study involves the interactions among these lower level
units . We can understand why diamonds are hard, not as an isolated
fact, but becausewe understand how the atoms of carbon can line up to
form a perfect lattice. This is a feature of the aggregate, not of the
individual atom, but the features of the atom are necessary for understanding the aggregate behavior. Until we understand that, we are left
with the unsatisfactory statement that diamonds are hard, period. A
useful fact, but not an explanation. Similarly , at the social level, social
organizations cannot be understood without understanding the
individuals which make up the organization. Knowing about the
individuals tells us little about the structure of the organization, but we
can't understand the structure of the higher level organizations without
knowing a good deal about individuals and how they function . This is
the sense of emergence we are comfortable with . We believe that it is
entirely consistent with the POP view of cognition .
There is a second, more practical reason for rejecting radical reductionism as a research strategy. This has nothing to do with emergence;
4. GENERALISSUES
129
it has to do with the fact that we can' t know everything and find out
everything at once. The approach we have been arguing for suggests
that to understand something thoroughly at some level requires
knowledge at that level, plus knowledge of the lower levels. Obviously ,
this is impractical. In practice, even though there might be effects of
lower levels on higher levels, one cannot always know them . Thus,
attempting to formulate a description at this higher level as a first order
of approximation is an important research strategy. We are forced into
it if we are to learn anything at all. It is possible to learn a good deal
about psychology without any reference whatsoever to any lower levels.
This practical strategy is not , however, an excuse for ignoring what is
known about the lower levels in the formulation of our higher level
theories. Thus, the economist is wrong to ignore what we might know
about individuals when formulating his theories. The chemist would be
wrong to ignore what is known about the structure of the carbon atom
in explaining the hardness of diamonds. We argued above that the
view that the computational level is correct derives from experience
with a very special kind of device in which the higher level was designed
to give the right answers- exactly. In describing natural intelligence
that can't , we suspect, be right - exactly. It can be a first order of
approximation . As we learn more about a topic and as we look at it in
more and more detail we are going to be forced to consider more and
more how it might emerge (in the above sense) from the interactions
among its constituents. Interaction is the key word here. Emergent
properties occur whenever we have nonlinear interactions. In these
cases the principles of interaction themselves must be formulated and
the real theory at the higher level is, like chemistry , a theory of interactions of elements from a theory one level lower.
130 THE
POP
PERSPECTIVE
believe that neuroscientists can be guided in their bottom -up search for
an understanding of how the brain functions .
We agree with many of these sentiments. We believe that an understanding of the relationships between cognitive phenomena and brain
functions will slowly evolve. We also believe that cognitive theories can
provide a useful source of information for the neuroscientist. We do
not , however, believe that current knowledge from neuroscience provides no guidance to those interested in the functioning of the mind .
We have not , by and large, focused on the kinds of constraints which
arise from detailed analysis of particular circuitry and organs of the
brain. Rather we have found that information concerning brain-style
processing has itself been very provocative in our model building
efforts . Thus, we have, by and large, not focused on neural modeling
(i .e., the modeling of neurons) , but rather we have focused on neurally
inspired modeling of cognitive processes. Our models have not
depended strongly on the details of brain structure or on issues that are
very controversial in neuroscience. Rather, we have discovered that if
we take some of the most obvious characteristics of brain-style processing seriously we are led to postulate models which differ in a number of
important ways from those postulated without regard for the hardware
on which these algorithms are to be implemented . We have found that
top-down considerations revolving about a need to postulate parallel,
cooperative computational models (cf . Rumelhart , 1977) have meshed
nicely with
a number of more bottom -up considerations of brain style
.
processing.
There are many brain characteristics which ought to be attended to in
the formulation of our models (see Chapters 20 and 21) . There are a
few which we have taken most seriously and which have most affected
our thinking . We discuss these briefly below.
Neurons are slow. One of the most important characteristics of
brain-style processing stems from the speed of its components. Neurons are much slower than conventional computational components.
Whereas basic operations in our modern serial computers are measured
in the nanoseconds, neurons operate at times measured in the
milliseconds- perhaps 10s of milliseconds. Thus, the basic hardware of
the brain is some 106 times slower than that of serial computers. Imagine slowing down our conventional AI programs by a factor of 106.
More remarkable is the fact that we are able to do very sophisticated
processing in a few hundred milliseconds. Clearly, perceptual processing, most memory retrieval , much of language processing, much intui tive reasoning, and many other processes occur in this time frame .
That means that these tasks must be done in no more than 100 or so
serial steps. This is what Feldman ( 1985) calls the lOO-step program
4. GENERAL
ISSUES
131
constraint. Moreover , note that individual neurons probably don't compute very complicated functions . It seems unlikely that a single neuron
computes a function much more complex than a single instruction in a
digital computer . Imagine, again, writing an interesting program in
even 1000 operations of this limited complexity of a serial computer .
Evidently , the brain succeeds through massiveparallelism. Thus, we
conclude, the mechanisms of mind are most likely best understood as
resulting from the cooperative activity of very many relatively simple
processing units operating in parallel.
There
is
important
processing
on
order
is
large
few
the
It
gives
constraint
evaluating
our
- out
to
have
,
other
can
of
units
from
action
of
( or
less
connectivity
any
the
rather
than
neuron
the
that
.
If
but
for
of
but
this
of
action
potential
decisions
of
immediate
' s
sake
behavior
this
in
in
large
synapses
assume
be
processors
measured
many
we
of
should
of
sin
processing
are
very
com
product
connectivity
Moreover
not
the
the
statistical
neighbors
is
exam
digital
are
numbers
.
for
which
independent
degree
neuron
potentials
,
our
in
,
of
does
make
and
dendrites
( see
process
the
and
neu
computation
of
- in
dendrites
number
) .
neurons
fan
cortical
their
issue
other
the
which
is
account
this
large
on
we
of
This
into
synapses
which
these
single
not
limits
of
the
on
is
the
of
is
somewhat
argument
wrong
neurons
take
number
stability
thousands
no
to
human
in
Usually
in
suggests
other
of
.
very
parallelism
discussion
statistical
many
number
action
of
of
for
small
that
Again
computers
)
the
neu
a
involving
require
begun
, 000
out
of
they
vary
an
each
the
of
synapses
100
kind
of
are
on
push
units
timates
or
the
number
large
, 000
to
from
units
with
parallel
&
decisions
derives
numbers
current
of
circuits
of
processing
suggests
logic
make
Reliability
contrasted
tens
not
12
generate
involves
cooperative
.
large
it
have
100
one
to
This
of
that
do
the
units
kind
but
enough
) .
others
1 , 000
Generally
the
of
to
scale
but
there
provides
sometimes
brain
unit
from
20
the
,
1 , 000
not
Chapter
puters
each
Chapter
of
from
computation
number
from
from
are
involve
gle
inputs
make
neurons
received
,
( see
that
Moreover
parallel
number
large
and
parallelism
models
large
feature
and
can
likewise
ple
models
important
rons
is
we
brain
massive
power
our
the
that
receive
Another
fan
one
the
- evident
large
hold
processors
is
brain
self
very
suggests
of
it
that
of
the
This
amazing
happens
because
Neurons
its
human
It
plausibility
in
it
the
.
in
.
complex
that
Another
the
estimates
understanding
be
.
is
neurons
unit
An
well
Although
real
that
unlimited
lOll
reasonably
may
brain
to
neurons
processing
Conventional
processing
indeed
of
- style
.
1010
active
hundred
model
number
brain
involved
of
an
scale
large
of
units
the
ron
very
aspect
the
degree
away
that
every
132 THE
POP
PERSPECTIVE
cortical neuron is connected to 1,000 other neurons and that the system
forms a lattice, all of the neurons in the brain would be within , at most,
four synapses from one another. Thus, large fan-in and fan-out leads
to shallow networks. It should finally be noted that even though the
fan-in and fan-out is large, it is not unlimited . As described in Chapter
12, the limitations can cause problems for extending some simple ideas
of memory storage and retrieval .
Learning
our
involves
models
isms
in
the
the
simply
,
such
Hewitt
' s
sages
to
limited
symbols
are
processing
required
( 1975
passed
but
,
other
This
they
( Hofstadter
units
means
emerge
1979
few
bits
that
from
be
To
this
and
Chapters
11
through
excita
communicated
such
symbolic
,
degree
subsymbolic
per
systems
of
the
8 ~
which
simple
simple
currency
simplicity
inhibition
arbitrary
require
to
.)
passing
the
inhibition
7 ,
view
can
allows
we
6 ,
involves
message
which
and
must
Its
or
neurons
parallel
its
excitation
5 ,
in
computational
procedures
this
than
assumed
real
Chapters
of
mechan
generally
learning
of
feature
rather
are
activation
Only
system
among
precision
( See
among
ACTOR
is
procedure
sending
key
learning
connections
There
implications
by
unlike
)
the
powerful
the
Another
of
learning
learning
messages
in
develop
Communication
Thus
be
simple
to
consider
inhibitory
is
strengths
communicate
second
not
25
and
knowledge
incrementally
and
Neurons
of
understanding
Moreover
us
and
24
connections
tory
the
.
allow
18
that
our
connection
to
work
from
themselves
homogeneity
is
modifying
advantages
connections
derives
brain
units
involve
17
modifying
which
signed
as
mes
numbers
our
systems
that
is
symbols
level
of
) .
4. GENERALISSUES
133
work on word
is a feature of
error propaga an error signal
to be propagated back through . In general, reciprocally interacting systems are very important
models .
We
the
view
that
connections
between systems are excitatory and those within a region are inhibitory .
This is employed to advantage in Chapters 5 and 15.
The geometric structure of connections in the brain have not had
much impact on our work . We generally have not concerned ourselves
with where the units might physically be with respect to one another .
However , if we imagine that there is a constraint toward the conserva tion of connection length (which there must be) , it is easy to see that
those units which interact most should be the closest together . If you
add to this the view that the very high -dimensional space determined
by the number of interconnections must be embedded into the two - or
three -dimensional space (perhaps two and a half dimensions ) of the
cortex , we can see the importance of mapping the major dimensions
physically in the geometry of the brain (see Ballard , in press, for a dis cussion of embedding high -dimensional spaces into two dimensions ) .
134 THE
POP
PERSPECfIVE
different
of stages .
Graceful
degradation
overload.
From the study of brain lesions and other forms of brain damage , it
seems fairly clear there is not some single neuron whose functioning is
essential for the operation of any particular cognitive process . While
reasonably circumscribed regions of the brain may play fairly specific
roles , particularly at lower levels of processing , it seems fairly clear that
within regions , performance is characterized by a kind of graceful degradation in which the system ' s performance gradually deteriorates as more
and more neural units are destroyed , but there is no single critical point
was not
used , there
would
be no
effect
on
the
system .
the general flow of processing. In conventional programming frameworks it is easy to imagine an executive system which calls subroutines
to carry out its necessary tasks . In some information processing models
this notion
of an executive
Neuropsychological investigation of patients with brain damage indi cates that there is no part of the cortex on whose operation all other
parts depend. Rather it seems that all parts work together, influencing
one another , and each region contributes
of
of
information
To
be sure , brainstem
mechanisms
control
vital bodily functions and the overall state of the system , and certain
parts of the cortex are critical for receiving information in particular
4. GENERAL
ISSUES
135
modalities . But higher level functions seem very much to be character ized by distributed , rather than central control .
This point has been made most clearly by the Russian neuropsychol -
ogist Luria ( 1966; 1973) . Luria 's investigations show that for every
integrated behavioral function (e.g., visual perception, language
comprehension or production , problem solving , reading ) , many dif ferent parts of the cortex playa role so that damage to any part influ ences performance but is not absolutely crucial to it . Even the frontal
lobes , most frequently associated with executive functions , are not
absolutely necessary in Luria ' s view , in that some residual function is
generally observed even after massive frontal damage (and mild frontal
damage may result in no detectable symptomatology at all) . The fron tal lobes have a characteristic role to play, facilitating strategy shifts and
inhibiting impulsive responding, but the overall control of processing
can be as severely impaired by damage to parietal lobe structures that
's.
We
have
come
to
believe
that
the
notion
of
subroutines
with
one system " calling " another is probably not a good way to view the
operation of the brain . Rather , we believe that subsystems may modulate the behavior of other subsystems , that they may provide constraints
to be factored into the relaxation computation . An elaboration of some
aspects of these ideas may be found in Chapter 14.
for desired
behavior
As can be seen , this list does not depend on specific discoveries from
neuroscience . Rather , it depends on rather global considerations .
Although none of these general properties of the brain tell us in any
detail how the brain functions to support cognitive phenomena ,
together they lead to an understanding of how the brain works that
136
THEPOPPERSPECTIVE
serves as a set of constraints on the development of models of cogni tive processes. We find that these assumptions , together with those
that derive from the constraints imposed by the tasks we are trying to
PD P Models
Lack
Neural
Realism
function
that
would
make
the
difference
between
an accurate
the behavioral
that
the
, a number
of other
facts from
neuroscience
that we have
not included in most of our models , but that we imagine will be impor tant when
we learn
how to include
of these is
4. GENERAL
ISSUES137
the fact that we normally assume that units communicate via numbers.
These are sometimes associated with mean firing rates. In fact, of
course, neurons produce spikes and this spiking itself may have some
computational significance (see Chapters 7 and 21 for discussions of the
possible computational significance of neural spiking) . Another example of possibly important facts of neuroscience which have not played a
role in our models is the diffuse pattern of communication which
occurs by means of the dispersal of chemicals into various regions of
the brain through the blood stream or otherwise. We generally assume
that communication is point -to-point from one unit to another. However, we understand that diffuse communication can occur through
chemical means and such communication may play an important role in
setting parameters and modulating the networks so that they can perform rather different tasks in different situations. We have employed
the idea of diffuse distribution of chemicals in our account of amnesia
(Chapter 25) , but, in general, we have not otherwise integrated such
assumptions into our models. Roughly , we imagine that we are studying networks in which there is a fixed setting of such parameters, but
the situation may well be much more complex than that. (See Chapter
24 for some discussion of the role of norepinephrine and other neuromodulators.)
Most of our models are homogeneous with respect to the functioning
of our units . Some of them may be designated as inhibitory and others
as excitatory, but beyond that , they are rarely differentiated . We
understand that there are perhaps hundreds of kinds of neurons (see
Chapter 20) . No doubt each of these kinds playa slightly different role
in the information processing system. Our assumptions in this regard
are obviously only approximate. Similarly , we understand that there
are many different kinds of neurotransmitters and that there are dif -,
ferent systems in which different of these neurotransmitters are dominant . Again, we have ignored this difference (except for excitatory
and inhibitory connections) and presume that as more is understood
about the information processing implications of such facts we will be
able to determine how they fit into our class of models.
It is also true that we have assumed a number of mechanisms that
are not known to exist in the brain (see Chapter 20) . In general, we
have postulated mechanisms which seemed to be required to achieve
certain important functional goals, such as, for example, the development of internal representations in multilayer networks (see Chapter
8) . It is possible that these hypothesized mechanisms do exist in the
brain but have not yet been recognized. In that sense our work could
be considered as a source of hypotheses for neuroscience. It is also
possible that we are correct about the computations that are performed ,
but that they are performed by a different kind of neural mechanism
138 THE
POP
PERSPECTIVE
than our formulations seem at first glance to suggest. If this is the
case, it merely suggests that the most obvious mapping of our models
onto neural structures is incorrect .
A neuroscientist might be concerned about the ambiguity inherent in
the fact that many of the mechanisms we have postulated could be
implemented in different ways. From our point of view, though , this is
not a serious problem . We think it useful to be clear about how our
mechanisms might be implemented in the brain, and we would certainly
be worried if we proposed a process that could not be implemented in
the brain. But since our primary concern is with the computations
themselves, rather than the detailed neural implementation of these
computations, we are willing to be instructed by neuroscientists on
which of the possible implementations are actually employed. This
position does have its dangers. We have already argued in this chapter
that the mechanism whereby a function is computed often has strong
implications about exactly what function is being computed. Neverthe less, we have chosen a level of approximation which seems to us the
most fruitful , given our goal of understanding the human information
processing system.
We close this section by noting two different ways in which PDP
models can be related to actual neurophysiological processes, apart from
the possibility that they might actually be intended to model what is
known about the behavior of real neural circuitry (see Chapters 23 and
24 for examples of models of this class) . First , they might be intended
as idealizations. In this approach, the emergent properties of systems
of real neurons are studied by idealizing the properties of the individual
neurons, in much the same way that the emergent properties of real
gassescan be studied by idealizing the properties of the individual gas
molecules. This approach is described at the end of Chapter 21. An
alternative is that they might be intended to provide a higher level of
description, but one that could be mapped on to a real neurophysiological implementation . Our interactive activation model of word recognition has some of this flavor , as do most of the models described in
Chapters 14 through 19. Specifically with regard to the word recognition model, we do not claim that there are individual neurons that
stand for visual feature, letter , and word units , or that they are connected together just as we proposed in that model. Rather, we really
suppose that the physiological substrate provides a mechanism whereby
various abstract informational states- such as, for example, the state in
which the perceptual system is entertaining the hypothesis that the
second letter in a word is either an H or an A - can give rise to other
informational states that are contingent upon them .
4. GENERAL
ISSUES
139
Nativism vs. Empiricism
Historically , perceptron -like models have been associated with the
idea of " random self -organizing " networks , the learning of arbitrary
associations , very general , very simple learning rules , and similar ideas
which show the emergence of structure from the tabula rasa . We often
find , especially in discussion with colleagues from linguistics surround ing issues of language aquisition (see Chapters 18 and 19) , that PDP
models are judged to involve learning processes that are too general
and , all in all , give too little weight to innate characteristics of language
or other information processing structures . This feeling is brought out
even more by demonstrations that some PDP learning mechanisms are
capable of learning to respond to symmetry and of learning how to deal
with such basic perceptual problems as perceptual constancy under
translation and rotation (see Chapter 8) . In fact , however , PDP models
are , in and of themselves , quite agnostic about issues of nativism
versus empiricism . Indeed , they seem to us to offer a very useful per spective on the issue of innate versus acquired knowledge .
For the purposes of discussion let us consider an organism that con sists of a very large set of very simple but highly interconnected pro cessing units . The units are assumed to be homogeneous in their
properties except that some are specialized to serve as " input " units
because they receive inputs from the environment and some are specialized to serve as " output " units because they drive the effectors of
the system . The behavior of such a system is thus entirely determined
by the pattern of inputs , the pattern of interconnections among the
units , and the nature of and connections to the effectors . Note , that
interconnections
can have various strengths - positive , negative , and
zero . If the strength of connection is positive , th ~n activity in one unit
tends to increase the activity of the second unit . If the strength of con nection is negative , then the activity in the first unit tends to decrease
the activity of the second unit . If the strength is zero , then activity of
the first unit has no effect on the activity of the second .
In such a system the radical nativism hypothesis would consist of the
view that all of the interconnections are genetically determined at birth
and develop only through a biologically driven process of maturation .
If such were the case, the system could have any particular behavior
entirely wired in . The system could be designed in such a way as to
respond differentially to human speech from other acoustic stimuli , to
perform any sort of computation that had proven evolutionarily adaptive , to mimic any behavior it might observe , to have certain stimulus
dimensions to which it was pretuned to respond , etc . In short , if all of
the connections were genetically predetermined , the system could
140 THE
PDP
PERSPECTIVE
perform any behavior that such a system of units , interconnections , and
associations
modifiable
behavior
In
short , if
all
connections
in
the
system
learn to perform
of units , interconnections
were
any
, and effec -
tors might ever be capable of . The question of what behaviors it actu ally did carry out would presumably be determined by the learning pro cess and the patterns of inputs the system actually experienced . In this
sense, the simple PDP model is clearly consistent with a rabidly empiri cist world
view .
were modifiable
interconnections
as this
we have , it seems
to us , the benefits
of both
nativism
6 Obviously both of these views are overstatements . Clearly the genes do not deter mine every connection at birth . Probably some sort of random processes are also
involved . Equally clearly , not every pattern of interconnectivity
is possible since the spatial layout of the neurons in the cortex , for example , surely limit the connectivity . Still ,
there is probably a good deal of genetic specification of neural connection, and there is a
good deal of plasticity in the pattern of connectivities
after birth .
4. GENERALISSUES
141
among
themselves
and relatively
weakly
connected
to units
out -
side of the set ; of course , this concept admits all gradations of modu larity , just as our view of schemata allows all degrees of schematization
of knowledge .) There is, on this view , no such thing as " hardwiring ."
Neither is there any such thing as " software ." There are only connec tions . All connections are in some sense hardwired (in as much as they
are physical entities ) and all are software (in as much as they can be
changed .) Thus , it may very well be that there is a part of the network
prewired to deal with this or that processing task . If that task is not
relevant in the organism ' s environment , that part of the network can be
used for something else. If that part of the network is damaged ,
another part can come to play the role " normally " carried out by the
damaged portion . These very properties have been noted characteristics
of the brain since Hughlings -Jackson ' s work in the late 19th century
(e.g., Jackson, 1869/ 1958) ; Jackson pointed them out as difficulties for
the strict localizationist views then popular among students of the brain .
Note too that our scheme allows for the organism to be especially sensi tive to certain relationships (such as the relationship between nausea
and eating , for which there might be stronger or more direct prewired
7 Here again , our organism oversimplifies a bit . It appears that some parts of the ner vous system - particularly lower level , reflexive , or regulatory mechanisms - seem to be
prewired and subject only to control by trainable modulatory connections to higher level ,
more adaptive mechanisms , rather than to be directly modifiable themselves ; for discus sion see Teitelbaum ( 1967) and Gallistel ( 1980) .
142 THE
POP
PERSPECTIVE
connections ) while at the same time allowing
tions
quite arbitrary
associa-
to be learned .
Of course , not all connections may be plastic - certainly , many sub cortical mechanisms are considerably less plastic than cortical ones .
Also , plasticity may not continue throughout life (see Chapter 24) . It
would , of course , be a simple matter to suppose that certain connec tions
. This
is an issue about
which
our framework
4. GENERALISSUES
143
144 THEPDP
PERSPECTIVE
space through a series of sequential steps. Can parallel distributed processing have anything to say about these explicit , introspectively accessible, temporally extended acts of thinking ? Some have suggested that
the answer is no- that POP models may be fine as accounts for perception , motor control , and other low-level phenomena, but that they are
simply unable to account for the higher level mental processing of the
kind involved in reasoning, problem solving, and other higher level
aspectsof thought .
We agree that many of the most natural applications of POP models
are in the domains of perception and memory (see, for example,
Chapters 15, 16, and 17) . However, we are convinced that these
models are equally applicable to higher level cognitive processes and
offer new insights into these phenomena as well . We must be clear,
though , about the fact that we cannot and do not expect POP models to
handle complex, extended, sequential reasoning processes as a single
settling of a parallel network . We think that POP models describe the
microstructure of the thought process, and the mechanisms whereby
these processes come, through practice, to flow more quickly and run
together into each other .
Partly because of the temporally extended nature of sequential
thought processes- the fact that they involve many settlings of a network instead of just one- they are naturally more difficult to deal with ,
and our efforts in these areas are, as yet, somewhat tentative .
Nevertheless, we have begun to develop models of language processing
(Chapter 19) , language acquisition (Chapter 18) , sequential thought
processes and consciousness (Chapter 14) , and problem solving and
thinking in general (Chapters 6, 8, and 14) . We view this work as
preliminary , and we firmly believe that other frameworks provide additional , important levels of description that can augment our accounts,
but we are encouraged by the progress we have made in these areas and
believe that the new perspectives that arise from these efforts are suffi ciently provocative to be added to the pool of possible explanations of
these higher level cognitive processes. Obviously , the extension of our
explorations more deeply into these domains is high on our ongoing
agenda. We see no principled reasons why these explorations cannot
succeed, and every indication is that they will lead us somewhat further
toward an understanding of the microstructure of cognition .
MANYMODELS
ORJUSTONE?
Before concluding this chapter, some comment should be made
about the status of the various models we and other members of the
4. GENERAL
ISSUES145
POP
research
group
book
suggests
been
impressed
spectives
to
on
the
to
model
which
forms
for
do
we
kind
on
the
keeping
one
single
eye
model
of
all
on
worlds
problem
area
Each
more
under
the
less
models
used
to
to
into
account
the
for
useful
them
is
same
to
to
time
really
each
have
In
the
be
best
rough
to
the
represents
the
an
space
insights
of
reflects
specialized
of
behavior
generated
the
out
region
lead
and
be
' t
as
developing
approach
don
however
uncharted
has
study
are
models
turn
POP
super
together
at
model
of
models
we
can
the
proper
may
likely
space
can
related
underlying
More
or
Thus
of
models
application
phenomena
specific
the
while
family
the
simply
exploration
the
but
but
particular
that
picture
unifying
question
into
feel
of
produce
models
the
application
have
we
applications
have
our
some
in
exploration
of
of
to
we
of
specific
we
bigger
each
but
chapter
we
of
to
which
of
detailed
the
Rather
approximation
models
of
areas
success
metatheory
models
our
which
The
part
framework
from
this
phenomena
yet
per
tried
features
times
as
have
have
to
certain
different
of
POP
metatheory
detailed
We
We
certain
Other
capable
all
the
applications
indirectly
study
that
of
specific
explore
ourselves
connect
feel
with
the
our
in
that
of
changing
application
fact
title
for
system
the
deal
the
exploration
outlined
to
to
see
an
from
phenomena
to
not
would
due
As
models
details
elaborated
choice
We
as
POP
principles
the
other
processing
general
vary
book
work
of
are
be
different
Rather
of
for
models
our
potential
variations
need
the
information
to
suppressed
made
the
free
Sometimes
be
understand
kinds
felt
models
we
human
the
have
throughout
with
the
maintain
we
offer
the
of
both
POP
into
specific
the
versions
CONCLUSION
Some
of
specific
general
the
overview
the
neural
of
book
in
earlier
formal
our
Part
seek
to
are
and
crucial
In
issues
of
in
the
pursuing
it
in
have
has
Part
and
the
views
also
of
many
on
provided
the
an
book
ways
on
computational
these
theory
our
touched
in
quite
more
rest
are
are
nurture
cognition
Indeed
they
cognitive
of
the
chapters
of
vs
doing
we
overcome
the
scope
overview
so
chapter
main
nature
in
it
the
ways
the
analysis
described
doing
to
models
that
is
response
II
an
this
the
of
an
questions
for
in
as
provided
that
in
but
Question
to
reasons
network
tools
the
has
work
the
is
questions
central
the
of
considered
mechanisms
these
some
chapters
such
levels
chapter
of
with
the
concern
of
have
enterprise
between
present
number
we
particular
They
relation
The
issues
our
relevance
of
the
to
along
the
rest
here
of
The
limitations
III
kinds
provide
of
some
goals
of
The
146 THEPOP
PERSPECTIVE
chapters in Part IV address themselves to cognitive constructs and
attempt to redefine the cognitive structures of earlier theories in terms
of emergent properties of PDP networks . The chapters in Part V con sider the neural mechanisms themselves and their relation to the algo rithmic
and
level
of most
of the work
described
in Parts II
IV .
ACKNOWLEDGMENTS
We would like to thank the many people who have raised the Questions and the objections that we have attempted to discuss here . These
people include John Anderson , Francis Crick , Steve Draper , Jerry
Fodor , Jim Greeno , Allen Newell , Zenon Pylyshyn , Chris Riesbeck ,
Kurt van Lehn , and many others in San Diego , Pittsburgh , and elsewhere .
In
addition
, we
would
like
to
thank
Allan
Collins , Keith
Preparation
of
Grnlln
this
NOOO14 - 82 - C - O374 , NR
chapter
was supported
by
ONR
contracts
667 -437 ,
from
the National
Institute
of Mental
Health .
PART
II
BASIC ~
l\II~
E:CHANISMS
The chapters of Part II represent explorations into specific architectures and learning mechanisms for POP models. These explorations
proceed through mathematical analysis coupled with results from simulations. The major theme which runs through all of these explorations
is a focus on the learning problem. How can POP networks evolve to
perform the kinds of tasks we require of them ? Since one of the primary features of POP models in general is their ability to self-modify ,
these studies form an important base for the application of these
models to specific psychological and biological phenomena.
In Chapter 5, Rumelhart and Zipser begin with a summary of the
history of early work on learning in parallel distributed processing systems. They then study an unsupervised learning procedure called competitive learning. This is a procedure whereby feature detectors capable
of discriminating among the members of a set of stimulus input patterns evolve without a specific teacher guiding the learning. The basic
idea is to let pools of potential feature detector units competeamong
themselves to respond to each stimulus pattern. The winner within
each pool- the one whose connections make it respond most strongly
to the pattern- then adjusts its connections slightly toward the pattern
that it won. Several earlier investigators have considered variants of
the competitive learning idea (e.g., Grossberg, 1976~ von der Malsberg,
1973) . Rumelhart and Zipser show that when a competitive network is
trained through repeated presentations of members of a set of patterns,
each unit in a pool comes to respond when patterns with a particular
148
BASICMECHANISMS
INTRODUCTION
TOPART
II 149
environmental
CHAPTER
5
Learning
D. E. RUMELHAR
T andD. ZIPSER
Start with a set of units that are all the same except for some
randomly distributed parameter which makes each of them
respond slightly differently to a set of input patterns.
The
net
ing
paradigm
result
of
correctly
is
that
applying
individual
these
units
three
learn
This chapter
originally
- appeared in Cognitive Science
by Ablex Publishing. Reprinted by permission.
to
components
to
specialize
on
, 1985 , 9 , 75 - 112 ,
a learn
sets
of
Copyright
1985
152
BASICMECHANISMS
similar patterns and thus become " feature detectors " or " pattern classif iers ." In addition
to Prank Rosenblatt , whose work will be discussed
than
pattern
classification
. We address
elements
concerns
the
limitations
inherent
in a one -level
system and the difficulty of developing learning schemes for multi fayered systems . Competitive learning is a scheme in which important
features can be discovered at one level that a multilayer system can use
to classify pattern sets which cannot be classified with a single level
system .
Thirty -five years of experience have shown that getting neuron -like
elements to learn some easy things is often quite straightforward , but
designing systems with powerful general learning properties is a difficult
problem , and the competitive learning paradigm does not change this
fact . What we hope to show is that competitive learning is a powerful
strategy which , when used in a variety of situations , greatly expedites
some difficult tasks. Since the competitive learning paradigm has roots
which go back to the very beginnings of the study of artificial
devices , it seems
reasonable
issue into
learning
historical
per -
spective . This is even more to the point , since one of the first simple
learning devices , the perceptron , caused great furor and debate , the
reverberations
of which
us .
In the beginning , thirty -five or forty years ago, it was very hard to
see how anything resembling a neural network could learn at all , so any
example of learning was immensely interesting . Learning was elevated
to a status of great importance in those days because it was somehow
uniquely associated with the properties of animal brains . After
McCulloch and Pitts ( 1943) showed how neural -like networks could
compute , the main problem then facing workers in this area was to
understand
could
learn .
The first set of ideas that really got the enterprise going were con tained in Donald Hebb ' s Organization o/ Behavior ( 1949) . Before Hebb ' s
work , it was believed that some physical change must occur in a net work to support learning , but it was not clear what this change could
be. Hebb proposed that a reasonable and biologically plausible change
would be to strengthen the connections between elements of the net work only when both the presynaptic and postsynaptic units were active
simultaneously . The essence of Hebb ' s ideas still persists today in
many learning paradigms . The details of the rules for changing weight
s.
153
COMPETITIVE
LEARNING
may be different , but the essential notion that the strength of connections between the units must change in response to some function of
the correlated activity of the connected units still dominates learning
models.
Hebb's ideas remained untested speculations about the nervous system until it became possible to build some form of simulated network
to test learning theories. Probably the first such attempt occurred in
1951 when Dean Edmonds and Marvin Minsky built their learning
machine. The flavor of this machine and the milieu in which it
operated is captured in Minsky ' s own words which appeared in a
wonderful New Yorker profile of him by Jeremy Bernstein ( 1981) :
Minsky remembers:
We sort of quit science for awhile to watch the machine. We
were amazed that it could have several activities going on at
once in this little nervous system. Because of the random
wiring it had a sort of fail safe characteristic. If one of the
neurons wasn' t working , it wouldn 't make much difference and
with nearly three hundred tubes, and the thousands of
connections we had soldered there would usually be something
wrong somewhere. . . . I don't think we ever debugged our
machine completely , but that didn 't matter. By having this
crazy random design it was almost sure to work no matter how
you built it . (p. 69)
In fact, the functioning of this machine apparently stimulated Minsky
sufficiently to write his PhD thesis on a problem related to learning
(Minsky , 1954) . The whole idea must have generated rather wide
interest~ von Neumann , for example, was on Minsky 's PhD committee
and gave him encouragement. Although Minsky was perhaps the first
on the scene with a learning machine, the real beginnings of meaningful neuron-like network learning can probably be traced to the work of
Frank Rosenblatt, a Bronx High School of Science classmate of
154
BASICMECHANISMS
are
permit
zation
the
study
a nerve
and
case
may
successive
yield
The
a closer
types
of
neural
models
but
behavior
Rosenblatt
and
learning
formal
Since
in
the
in
this
theorem
neural
&
two
in
Clark
the
area
, 1954
of
.
" which
Per -
the
results
represent
system
, in
suppressed
which
of
In
the
this
system
is
and
it
The
for
mean
are
model
, and
unobtainable
digital
simulation
is
exploratory
particular
mechanisms
answers
with
permits
about
memory
either
point
that
answered
not
the
or
terminal
analysis
of
its
of
most
boldly
- like
that
of
networks
,
fundamental
:
although
could
importance
digital
computer
to
the
he
learn
on
was
digital
simulation
not
the
first
to
computers
( cf .
) .
Rosenblatt
His
extended
they
refinements
approach
employed
analysis
networks
paradigm
work
capable
case
, and
more
likely
nervous
analytical
techniques
neural
mathematical
simulate
, are
starting
organi
( p . 28 )
pioneered
of
Farley
this
others
hypothetical
is
of
any
exact
methods
More
and
asked
. When
in
and
this
be
organizations
experimental
of
it
parts
the
environment
which
central
approximation
to
models
hardware
the
exaggerated
strength
questions
result
systems
its
of
, designed
between
to
applicable
perturbations
main
ingful
of
copies
networks
of
correspond
of
are
detailed
relationships
performances
directly
properties
as
simplified
organization
biological
be
serve
' re
lawful
, the
simplifications
some
to
They
actually
will
extreme
of
"
might
obtained
net
" psychological
networks
intended
system
the
of
ceptrons
study
not
nervous
competitive
, it
is
influential
asserts
learning
worthwhile
result
uses
reviewing
was
the
concepts
some
" perceptron
that
of
appear
his
ideas
learning
5. COMPETITIVE
LEARNING
155
As it turned out , the real problems arose out of the phrase "for which a
solution exists" - more about this later.
Less widely known is Rosenblatt's work on what he called "spontaneous learning." All network learning models require rules which tell how
to present the stimuli and change the values of the weights in accordance with the response of the model. These rules can be characterized
as forming a, spectrum, at one end of which is learning with an errorcorrecting teacher, and at the other is completely spontaneous, unsupervised discovery. In between is a continuum of rules that depend on
manipulating the content of the input stimulus stream to bring about
learning. These intermediate rules are often referred to as "forced
learning." Here we are concerned primarily with attempts to design a
perceptron that would discover something interesting without a teacher
becausethis is similar to what happens in the competitive learning case.
In fact, Rosenblatt was able to build a perceptron that was able to spontaneously dichotomize a random sequence of input patterns into classes
such that the members of a single class were similar to each other , and
different from the members of the other class. Rosenblatt realized that
any randomly initialized perceptron would have to dichotomize an arbitrary input pattern stream into a " I -set," consisting of those patterns
that happened to produce a response of 1, and a "O-set," consisting of
those that produced a response of O. Of course one of these sets could
be empty by chance and neither would be of much interest in general.
He reasoned that if a perceptron could reinforce these sets by an
appropriate rule based only on the perceptron's spontaneous response
and not on a teacher's error correction , it might eventually end up with
a dichotomization in which the members of each set were more like
each other than like the members of the opposite set. What was the
appropriate rule to use to achieve the desired dicotomization ? The first
rule he tried for these perceptrons, which he called C -type, was to
increment weights on lines active with patterns in the I -set, and decrement weights on lines active with patterns in the O-set. The idea was to
force a dichotomization into sets whose members were similar in the
sense that they activated overlapping subsets of lines. The- results were
disastrous. Sooner or later all the input patterns were classified in one
set. There was no dichotomy but there was stability . Once one of the
sets won, it remained the victor forever .
Not to be daunted, he examined why this undesirable result occurred
and realized that the problem lay in the fact that since the weights
could grow without limit , the set that initially had a majority of the patterns would receive the majority of the reinforcement . This meant that
weights on lines which could be activated by patterns in both sets would
grow to infinite magnitudes in favor of the majority set, which in turn
would lead to the capture of minority patterns by the majority set and
156 BASIC
MECHANISMS
ultimate total victory for the majority . Even where there was initial
equality between the sets, inevitable fluctuations in the random presen tation of patterns would create a majority set that would then go on to
win . Rosenblatt overcame this problem by introducing mechanisms to
limit weight growth in such a way that the set that was to be positively
reinforced at active lines would compensate the other set by giving up
some weight from all its lines . He called the modified perceptrons C '.
An example of a C ' rule is to lower the magnitude of all weights by a
fixed fraction of their current value before specifically incrementing the
similar
to each other
can be described
what he thought
(Rosenblatt , 1959) :
his
It scems
about
clear
that
the
achievements ,
consider
his
introduces
claim
a new
5. COMPETITIVE
LEARNING
157
non-human systems which may embody human cognitive func tions at a level far beyond that which can be achieved through
present day automatons. The future of information processing.
devices which operate on statistical, rather then logical princi pies seems to be clearly indicated. (p. 449)
It
is
this
notion
superior
that
to
had
models
the
do
computers
are
in
still
with
at
do
so
important
of
thermodynamics
logic
can
to
logical
we
to
draw
1959
, p . 423
answer
the
this
of
brain
was
an
following
can
impor
contributed
to
the
statement
on
main
on
by
Mechanization
functions
logical
The
third
,
system
spontaneously
valid
network
Elements
what
that
conclusions
with
rule
human
of
brain
of
think
with
improve
from
is
a
(b )
to
brain
the
be
these
its
information
to
found
in
self
ability
to
.
environ
environment
completely
control
performs
of
the
the
and
interpretation
interpretation
- like
.
certainly
the
two
based
hold
doubt
way
rules
The
about
almost
some
intelligence
neural
processes
conference
with
never
and
no
in
artificial
of
Consider
share
together
. Why
is
1959
are
in
arguments
and
time
in
on
,
in
debate
cognitive
There
making
based
important
seem
functions
perceptrons
development
other
mind
the
Decision
again
ment
do
back
Computers
the
today
that
at
Processes
(a)
't
's
debate
made
Thought
us
can
that
ignited
on
and
Rosenblatt
acrimonious
Rosenblatt
that
learning
computers
issue
's -
effects
both
debate
that
Rosenblatt
significant
for
tant
of
the
be
laws
contained
organize
( Rosenblatt
,
,
Clearly in some sense, Rosenblatt was saying that there were things
that the brain and perceptrons, because of their statistical properties,
could do which computers could not do. Now this may seem strange
since Rosenblatt knew that a computer program could be written that
would simulate the behavior of statistical perceptrons to any arbitrary
degree of accuracy. Indeed, he was one of the pioneers in the application of digital simulation to this type of problem . What he was actually
referring to is made clear when we examine the comments of other participants at the conference, such as Minsky ( 1959) and McCarthy
(1959) , who were using the symbol manipulating capabilities of the
computer to directly simulate the logical processesinvolved in decision
making, theorem proving , and other intellectual activities of this sort.
Rosenblatt believed the computer used in this way would be inadequate
to mimic the brain's true intellectual powers. This task, he thought ,
could only be accomplished if the computer or other electronic devices
were used to simulate perceptrons. We can summarize these divergent
158
BASICMECHANISMS
points of view by saying that Rosenblatt was concerned not only with
what the brain did , but with how it did it , whereas others, such as Min sky and McCarthy , were concerned with simulating what the brain did ,
and didn 't really care how it was done. The subsequent history of AI
has shown both the successesand failures of the standard AI approach.
We still have the problems today, and it 's still not clear to what degree
computational strategies similar to the ones used by the brain must be
employed in order to simulate its performance.
In addition to producing fertilizer , as all debates do, this one also
stimulated the growth of some new results on perceptrons, some of
which came from Minsky . Rosenblatt had shown that a two layer perceptron could carry out any of the 22N possible classifications of N
binary inputs~ that is, a solution to the classification problem had always
existed in principle . This result was of no practical value however,
because 2N units were required to accomplish the task in the completely general case. Rosenblatt's approach to this problem was to use a
much smaller number of units in the first layer with each unit connected to a small subset of the N inputs at random. His hope was that
this would give the perceptron a high probability of learning to carry
out classifications of interest. Experiments and formal analysis showed
that these random devices could learn to recognize patterns to a signifi cant degree but that they had severe limitations . Rosenblatt ( 1962)
characterized his random perceptron as follows :
It does not generalize well to similar forms occurring in new
positions in the retinal field , and its performance in detection
experiments, where a familiar figure appears against an
unfamiliar background, is apt to be weak. More sophisticated
psychological capabilities, which depend on the recognition of
topological properties of the stimulus field , or on abstract relations between the components of a complex image, are lacking.
(pp. 191-192)
Minsky and Papert worked through most of the sixties on a mathematical analysis of the computing powers of perceptrons with the goal of
understanding these limitations . The results of their work are available
in a book called Perceptrons (Minsky & Papert, 1969) . The central
theme of this work is that parallel recognizing elements, such as perceptrons, are beset by the same problems of scale as serial pattern
recognizers. Combinatorial explosion catches you sooner or later,
although sometimes in different ways in parallel than in serial. Minsky
and Papert's book had a very dampening effect on the study of
neuron-like networks as computational devices. Minsky has recently
come to reconsider this negative effect:
5. COMPETITIVE
LEARNING
159
THECOMPETITIVE
LEARNING
MECHANISM
Paradigms of Learning
It is possible to classify learning mechanisms in several ways. One
useful classification is in terms of the learning paradigm in which the
160
BASICMECHANISMS
<1>
cJ
>
=~
FIGURE
put
like
of
all
parity
with
of
model
paradigms
Parity
if
of
network
all
of
other
is
of
of
and
is
order
and
it
all
are
systems
Each
< p
with
network
least
is
four
common
log2N
of
In
its
two
of
general
learning
. ,
odd
parity
i . e
from
the
order
out
sum
made
in
of
an
signal
determining
layers
at
the
networks
described
of
has
of
when
other
Parity
unit
threshold
only
the
for
subnetwork
number
fires
and
network
the
the
There
unit
1969
unit
figure
This
network
using
processing
the
threshold
the
Three
Papert
linear
in
of
network
work
like
I .
threshold
layer
is
unit
units
to
neural
the
rightmost
supposed
in
units
than
Two
units
Minsky
are
threshold
threshold
inputs
is
the
from
inputs
greater
in
linear
number
its
linear
inputs
layers
the
the
weighted
pair
only
5. COMPETITIVE
LEARNING
161
. Auto Associator. In this paradigm a set of patterns are repeatedly presented and the system is supposed to "store" the patterns. Then , later, parts of one of the original patterns or possibly a pattern similar to one of the original patterns is
presented, and the task is to " retrieve " the original pattern
through a kind of pattern completion procedure. This is an
auto-association process in which a pattern is associated with
itself so that a degraded version of the original pattern can act
as a retrieval cue.
.
Competitive learning is a mechanism well-suited for regularity Q~tection , as in the environment described in above.
162
BASICMECHANISMS
Competitive Learning
The architecture of a competitive learning system (illustrated in Figure 2) is a common one. It consists of a set of hierarchically layered
units in which each layer connects, via excitatory connections, with the
layer immediately above it . In the most general case, each unit of a
layer receives an input from each unit of the layer immediately below
and projects output to each unit in the layer immediately above it .
Moreover , within a layer, the units are broken into a set of inhibitory
clusters in which all elements within a cluster inhibit all other elements
in the cluster. Thus the elements within a cluster at one level compete
wi th one another to respond to the pattern appearing on the layer
below. The more strongly any particular unit responds to an incoming
stimulus , the more it shuts down the other members of its cluster.
There are many variations on the competitive learning theme. A
number of researchers have developed variants of competitive learning
mechanisms and a number of results already exist in the literature . We
have already mentioned the pioneering work of Rosenblatt. In addition , von der Malsburg ( 1973) , Fukushima ( 1975) , and Grossberg
( 1976) , among others, have developed models which are competitive
learning models, or which have many properties in common with competitive learning. We believe that the essential properties of the competitive learning mechanism are quite general. However, for the sake
of concreteness, in this paper we have chosen to study, in some detail,
the simplest of the systems which seem to be representative of the
essential characteristics of competitive learning. Thus, the system we
have analyzed has much in common with the previous work , but wherever possible we have simplified our assumptions. The system that we
have studied most is described below:
.
The units in a given layer are broken into a set of nonoverlap ping clusters . Each unit within a cluster inhibits every other
unit within a cluster . The clusters are winner -take -all , such
that the unit receiving the largest input achieves its maximum
value while all other units in the cluster are pushed to their
minimum value . 1 We have arbitrarily set the maximum value
to 1 and the minimum value to O.
5. COMPETITIVE
LEARNING
Layer 3
Inhibitory Clusters
163
?;o..-..\
\
/"6"' ;""\. . . . . D~
' ...~ 9I
' -~.g..91
Layer 2
Inhibitory Clusters
~ ~ ,~;"(;~
(5':;;_'~-',
' ,,~:~:91 '~_. S~..~~
~~ ) \~_!..9~
Excitatory
Con nections
Layer 1
'ncut Units
INPUTPATTERN
164
BASICMECHANISMS
ow
g-Cik
'..J
nk- 0rF
can
illustrates
consider
the
same
and
This
izing
not
magnitude
of
be
learning
out
obtained
the
can
active
the
,
matter
which
so
the
input
we
lines
chose
of
these
proposed
the
on
von
patterns
option
two
der
is
and
Normalizing
that
rules
then
weights
input
.
point
by
of
normalizing
patterns
as
was
analogy
as
lines
viewed
rule
on
geometric
pattern
renormalization
by
values
useful
stimulus
number
each
points
each
For
vectors
an
we
chose
( 1973
since
) .
the
to
experiments
all
We
contain
same
As
The
that
simpler
our
system
patterns
length
hypersphere
assuming
of
all
are
necessary
is
this
If
- dimensional
Malsburg
not
then
most
all
weights
.
to
vector
patterns
Grossberg
same
( 1976
result
the
weights
implement
can
approach
than
,
however
were
)
be
normal
,
of
the
it
does
same
5. COMPETITIVE
LEARNING
165
where N is the number of units in the lower level, and therefore, also
the number of input lines received by each unit in the upper level.
166 BASICMECHANISMS
Each x in Figure 3A represents a particular pattern. Those patterns
that are very similar are near one another on the sphere~ those that are
very different will be far from one another on the sphere. Now note
that since there are N input lines to each unit in the upper layer, its
weights can also be considered a vector in N-dimensional space. Since
all units have the same total Quantity of weight, we have N-dimensional
vectors of approximately fixed length for each unit in the cluster. 3
Thus, properly scaled, the weights themselves form a set of vectors
which (approximately) fall on the surface of the same hypersphere. In
Figure 38, the a 's represent the weights of two units superimposed on
the same sphere with the stimulus patterns. Now, whenever a stimulus
pattern is presented, the unit which responds most strongly is simply
the one whose weight vector is nearest that for the stimulus . The
learning rule specifies that whenever a unit wins a competition for a
stimulus pattern, it moves a percentage g of the way from its current
location toward the location of the stimulus pattern on the hypersphere.
Now , suppose that the input patterns fell into some number , M ,
" natural" groupings. Further , suppose that an inhibitory cluster receiving inputs from these stimuli contained exactly M units (as in Figure
3C) . After sufficient training , and assuming that the stimulus groupings are sufficiently distinct , we expect to find one of the vectors for
the M units placed roughly in the center of each of the stimulus groupings. In this case, the units have come to detect the grouping to which
the input patterns belong. In this sense, they have "discovered" the
structure of the input pattern sets.
Each cluster classifies the stimulus set into M groups , one for
each unit in the cluster . Each of the units captures roughly an
equal number of stimulus patterns . It is possible to consider a
cluster as forming an M -ary feature in which every stimulus
pattern is classified as having exactly one of the M possible
s.
COMPETITIVE
LEARNING167
The
particular
the
starting
patterns
actually
receiving
inputs
sify
the
inputs
alternatively
in
the
coding
the
cluster
points
of
that
our
number
out
, then
that
work
the
patterns
a system
when
is with
of patterns
the
these
cases
is much
the
into
by
large
This
addressed
this
are sufficiently
as this
conditions
which
than
of
cluster
the
number
of
different
on
stimulus
clusters
general
each
, clas -
groupings
features
provide
problem
in
his
very
find
a perfectly
similar
there
stable
of
, or
present
kind
of
coarse
.5
of
can , in
independent
can
depends
sequence
lines
number
patterns
larger
large
input
a variety
population
in
and
same
stimulus
such
a particular
weights
presented
from
stimulus
( 1976 ) has
if the
done
of
, discover
of
4 Grossberg
proved
grouping
value
is no
the number
perfectly
of units
stable
system .
are enough
classification
He
units
in the inhibitory
in
He also
can be unstable
classification
has
. Most
and
clusters
the
.
168
BASICMECHANISMS
Formal Analysis
Perhaps the simplest mathematical analysis that can be given of the
competitive learning model under discussion involves the determination
of the sets of equilibrium states of the system- that is, states in which
the average inflow of weight to a particular line is equal to the average
outflow of weight on that line . Let Pk be the probability that stimulus
Sk is presented on any trial . Let Vjk be the probability that unit j wins
when stimulus Sz.. is Dresented. Now we want to consider the case in
which 1: d Wi} V}kP~ = 0, that is, the case in which the average change in
k
the weights is zero. We refer to such states as equilibrium states. Thus,
using the learning rule and averaging over stimulus patterns we can
write
C
'k.-Pk
0=gI
.!- gI
;jPk
Vjk
knkVjk
k.W
Wjj1
V
Vjk
kPkCjk
nk
k:Pk
}'k=~
LI
PkCik
Vjk
1
:
;
~
_
.
~
Wij
=1k:Pk
V
jk.
and thus
There are a number of important observations to note about this equation . First , note that LPk Vjk
- is simply the probability that unit j wins
k
averaged over all stimulus patterns. Note further that I . PkCikvjk is the
probability
LPkCik
~
LiPk
k
that
input
line
is
active
and
unit
wins
k
Thus, the ratio
Vjk
is
the
conditional
probability
that
line
i is
active
given
unit
Vjk
5 There is a problemin that one can't be certainthat the different clusterswill discover
different features. A slight modification of the system in which clusters"repel" one
anothercan insurethat different clustersfind different features. We shall not pursuethat
further in this paper.
169
5. COMPETITIVE
LEARNING
wins
( line
size
i .e . ,
the
probability
; =
nk
11
unit
for
that
We
are
j
the
now
when
face
input
in
. =
I
11
is
S / .
can
at
of
wins
the
same
proportional
That
to
is ,
the
simply
response
Let
a j /
the
sum
be
at
the
of
equilibrium
input
to
weights
unit
on
of
the
in
active
written
a J,/ - - + } : ,. W ;jC ;/ =
I
that
are
becomes
PkC
;k Vjk
implies
patterns
W ;j
specify
presented
is
all
unit
. wins
J
}:.
which
if
weight
given
unit
This
be
the
to
SI
This
Thus
active
position
stimulus
lines
) .
then
i is
stimulus
of
k ,
line
w . . --. . . . ! p ( line
IJ
n
unit
wins
all
~ I
Ci /
nk
} : . PkVjk
k
equilibrium
~
. L. , p I, ' / 1' v ) "1
;
a
'/ =
)
where
' Ii
~
. L. , p 1, v ) "1
i
' Ii
~
LIJ
k
represents
C ki C kl
that
responds
.
on
Sk
further
in
the
In
to
it
between
the
However
patterns
to
v jk
there
unit
I
d
respon
always
it
the
the
the
far
and
stimulus
most
i,
1
strong
unit
from
that
probability
responds
patterns
there
is
that
learning
for
many
rule
all
to
which
another
unit
we
and
to
it
set
of
responds
have
to
studied
has
a jk
solutions
we
that
the
values
Vik
the
possible
say
of
hold
state
that
the
the
equations
can
- stated
only
reach
relationships
.
which
to
weights
on
has
a jk
stable
way
equilibrium
above
system
of
the
mechanisms
in
the
become
same
to
which
also
in
i ; c j
learning
in
is
state
values
is
noted
competitive
responds
,
aik
system
,
such
, the
Vjk
which
are
competitive
are
The
and
the
In
that
be
of
the
otherwise
a jk
>
changing
therefore
tem
010b .
I num
should
states
Whenever
stimulus
that
equilibrium
state
between
patterns
value
fact
11
above
not
equi
other
general
described
are
at
restriction
v jk
those
weakly
Finally
restrictions
stimulus
us
overlap
most
responds
Thus
Th
overlap
ni
patterns
the
the
become
.
a
average
reached
,
an
relatively
When
this
will
stimulus
be
weights
equilibrium
stable
happens
particular
the
pushed
, the
and
sys
pattern
out
.
of
170
BASICMECHANISMS
T= r,Pkr
(ajk- aik).
k j,i, Vjk
The larger the value of T, the more stable the system can be expected
to be and the more time we can expect the system !o spend in that
state. Roughly , if we assume that the system moves into states which
maximize T, we can show that this amounts to maximizing the overlap
among patterns within a group while minimizing the overlap among
patterns between groups. In the geometric analogy above, this will
occur when the weight vectors point toward maximally compact
stimulus regions that are as distant as possible from other such regions.
SOMEEXPERIMENTAL
RESULTS
Dipole Experiments
The essential structure that a competitive learning mechanism can
discover is represented in the overlap of stimulus patterns. The simplest stimulus population in which stimulus patterns can overlap with
one another is one constructed out of dipoles- stimulus patterns consisting of exactly two active elements and the rest inactive. If we have
a total of N input units there are N (N-l ) / 2 possible dipole stimuli . Of
course, if the actual stimulus population consists of all N (N-1J/ 2 possibilities , there is no structure to be discovered. There are no clusters
for our units to point at (unless we have one unit for each of the possible stimuli , in which case we can point a weight vector at each of the
5. COMPETITIVE
LEARNING
171
possible input
stimuli ) .
If , however , we restrict
that indicates that Unit 1 had the largest weight on that line . If the circle is unfilled , that means that Unit 2 had the largest weight on that
line . The grids on the left indicate the initial configurations of the
FIGURE 4. A dipole stimulus defined on a 4 x 4 matrix of input units . The rule for generating such stimuli is simply that any two adjacent units may be simultaneously active .
Nonadjacent units may not be active and more than two units may not be simultaneously
active . Active units are indicated by filled circles .
172
FIGURE
BASICMECHANISMS
5.
that Unit 1 was the winner , the narrow line indicates that Unit 2 was
the winner . It should be noted, therefore , that two unfilled circles
must always be joined by a narrow line and two filled circles must
always be joined by a wide line . The reason for this is that if a particular unit has more weight on both of the active lines then that unit must
win the competition . The results clearly show that the weights move
from a rather chaotic initial arrangement to an arrangement in which
essentially al1 of those on one side of the grid are filled and all on the
other side are unfilled . The border separating the two halves of the
grid may be at any orientation , but most often it is oriented vertically
and horizontally , as shown in the upper two examples. Only rare1y is
the orientation diagonal, as in the example in the lower right -hand grid .
Thus, we have a case in which each unit has chosen a coherent half of
the grid to which they respond. It is important to realize that as far as
the competitive learning mechanism is concerned the sixteen input
S. COMPETiTIVE
LEARNING
173
~ )
..
A
"
:::>
'"
'
' - '
"
(~
""'"
' -'
.J
:>
400
~\
~)
B
~,
'
~ \
.I'
"
'
~)
"' )
'
400
~
(~
( ,~,
0"~'~
"
50
400
174
BASICMECHANISMS
of stimulus
structure
inherent
in the stimulus
population and has devised binary feature detectors that tell which half
of the grid contains the stimulus pattern. Note , each unit responds to
roughly half of the stimulus patterns. Note also that while some units
break the grid vertically , some break the grid horizontally , and some
break it diagonally ; a combination of several clusters offers a rather
more precise classification of a stimulus pattern .
In other experiments , we tried clusters of other sizes . For example ,
Figure 7 shows the results for a cluster of size four . It shows the initial
configuration and its sequence of evolution after 100, 200, 400, 800,
and after 4000 training trials . Again , initially the regions are chaotic .
After training , however , the system settles into a state in which stimuli
in compact regions of the grid are responded to by the same units . It
can be seen, in this case, that the trend is toward a given unit respond ing to a maximally compact group of stimuli . In this experiment , three
of the units settled on compact square regions while the remaining one
settled on two unconnected stimulus regions. It can be shown that the
state into which the system settled does not quite maximize the value
T, but does represent a relatively stable equilibrium state .
In the examples discussed thus far , the system , to a first approxima -
space .
In
the
case of
three
dimensions
, there
are
three equally good planes which can be passed through the space and ,
depending on the starting directions of the weight vectors and on the
sequence
of
stimuli
, different
clusters
will
choose
different
ones
of
these -olanes . Thus . a system which receives input from a set of such
clusters will be given information as to which quadrant of the space in
which the pattern appears . It is important to emphasize that the coher ence of the space is entirely in the choice of input stimuli , not in the
architecture of the competitive learning mechanism . The system discovers the spatial structure in the input lines .
Formal analysis. For the dipole examples described above, it is possible to develop a rather precise characterization of the behavior of the
competitive learning system . Recall our argument that the most stable
equilibrium state (and therefore the one the system is most likely to
S. COMPETITIVE
LEARNING
175
100
9I
200
400
800
4000
<>I
FIGURE
7.
The relative weights of each of the four elements of the cluster after 0, 100,
T~ I;Pk
(ajk- a;k).
k I;
j,;Vjk
Now, in the dipole examples
, all stimulus patterns of the stimulus
population are equally likely (i.e., Pk ~ 1IN ) , all stimulus patterns
involve two active lines, and for every stimulus pattern in the population of patternsthere are a fixed number of other stimulus patternsin
176 BASIC
MECHANISMS
- -,.- : Q
".
,.
I
I
-;~
,.. .;'
I
I
-:;~
,.
I
,
-~
.;' " ,
FIGURE 8. The relative weights for a system in which the stimulus patterns were
chosenfrom a three-dimensionalgrid after 4000presentations
.
the population
f~
I.N
I.
this
which
the
analysis
size
minimized
tion
have
when
in
6 Note
due
to
of
that
carried
clear
this
latter
.
are
experiments
is
not
eliminate
as well
we
drawn
to
tori
most
are
does
possible
on
the
minimized
spherical
pairs
condition
It
that
is
stimulus
effects
out
is
border
regions
which
edge
it
the
quite
, and
stable
states
Since
can
total
conclude
from
hold
edge
the
for
effects
results
are
border
that
adjacent
the
are
the
in
is
in
points
examples
by
ones
region
essentially
of
a torus
the
in
presented
use
situa
same
above
.
.
We
5. COMPETITIVE
LEARNING
177
high-dimensional hyperspace, our competitive learning mechanism will
form essentially spherical regions that partition the space into one such
spherical region for each element of the cluster.
Another result of our simulations which can be explained by these
equations is the tendency for each element of the cluster to capture
roughly equally sized regions. This results from the interconnectedness
of the stimulus population. The result is easiest in the case in which
M = 2. In this case, the function we want to minimize is given by
B1
B2
N
;-+N;.
There
.
is
is
units
has
will
rather
have
is
common
Whenever
If
evolved
such
case
of
reduce
the
number
of
border
stimuli
to
two
and
units
the
on
have
stimulus
the
the
equal
patterns
total
same
of
amount
numbers
among
amount
of
of
equally
the
weight
that
weight
likely
they
stimulus
Letters
practice
to
one
,
The
- point
perception
)
That
networks
such
question
word
handcraft
creates
the
( 1981
word
divide
depends
- in
to
that
roughly
successfully
Rumelhart
variety
way
Words
and
pressure
capture
patterns
It
in
unit
tasks
to
There
Learning
pressure
minimum
arises
perception
and
as
experiments
to
how
model
Rumelhart
model
to
network
such
but
was
crafted
task
might
McClelland
( 1982
detailed
it
particular
network
in
McClelland
rather
out
performs
developed
and
offers
carry
that
is
accounts
to
one
of
do
its
job
a
.
178
BASICMECHANISMS
learning
Let ' s begin with the fact that the word perception model required a
set of position -specific letter detectors . Suppose that a competitive
learning mechanism is faced with a set of words - to what features
would the system learn to respond ? Would it create position -specific
letter detectors or their equivalent ? We proceeded to answer this ques tion by again viewing the lower level units as forming a two dimensional grid . Letters and words could then be presented by
activating those units on the grid corresponding to the points of a stan -
dard CRT font . Figure 9 gives examples of some of the stimuli used in
our experiments . The grid we used was a 7 x 14 grid . Each letter
occurred in a 7 x 5 rectangular region on the grid . There was room for
two letters with some space in between , as shown in the figure . We
then carried out a series of experiments in which we presented a set of
stimuli , the letter stimuli only sparsely covered the grid and many of
the
units
in the
lower
level
never
became
active
at all .
Therefore
there was a possibility that , by chance , one of the units would have
most of its weight on input lines that were never active , whereas
another unit may have had most of its weight on lines common to all of
the
stimulus
patterns .
Since
a unit
never
.
.
.
.
.
.
.
learns
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
unless
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
c
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
.- . . . . . . .- . . . . . .
. . . . . . . . . . . . . .
FIGURE
9. Examplestimuli for the word and letter experiments.
it wins , it is
5. COMPETITIVE
LEARNING179
possible that one of the units will never win , and therefore never learn.
This , of course, takes the competition out of competitive learning.
This situation is analogous to the situation in the geometric analogy in
which all of the stimulus points are relatively close together on the
hypersphere, and one of the weight vectors, by chance, points near the
cluster while the other one points far from the stimuli . (See Figure
10) . It is clear that the more distant vector is not closest to any
stimulus and thus can never move toward the collection . We have
investigated two modifications to the system which deal with the problem . One, which we call the leaky learning model, modifies the learning rule to state that both the winning and the losing units move toward
the presented stimulus : the close vector simply moves much further .
In symbols this suggeststhat
C;k
- g,W..
nk
IJ
Cik
gw-;;;- - gwWij
g,
d Wij =
whereg, is the learningrate for the losing units, gw is the learning rate
for the winning unit, and whereg,
gw. In our experimentswe made
180
BASICMECHANISMS
gl an order of magnitude smaller than gw. This change has the property
that it slowly moves the losing units into the region where the actual
stimuli lie, at which point they begin to capture some units and the
ordinary dynamics of competitive learning take over.
The second method is similar to that employed by Bienenstock,
Cooper, and Munro ( 1982) , in which a unit modulates its own sensitivity so that when it is not receiving enough inputs , it becomes
increasingly sensitive. When it is receiving too many inputs, it
decreases its sensitivity . This mechanism can be implemented in the
present context by assuming that there is a threshold and that the
relevant activation is the degree to which the unit exceeds its threshold.
If , whenever a unit fails to win it decreasesits threshold and whenever
it does win it increases its threshold, then this method will also make
all of the units eventually respond, thereby engaging the mechanism of
competitive learning. This second method can be understood in terms
of the geometric analogy that the weight vectors have a circle surround ing the end of the vector. The relevant measure is not the distance to
the vector itself but the distance to the circle surrounding the vector.
Every time a unit loses, it increases the radius of the circle; every time
it wins, it decreasesthe radius of the circle. Eventually , the circle on
the losing unit will be large enough to be closer to some stimulus pattern than the other units.
We have used both of these mechanisms in our experiments and
they appear to result in essentially similar behavior. The rormer , the
leaky learning method , does not alter the formal analysis as long as the
ratio Kif Kw is sufficiently small. The varying threshold method is more
difficult to analyze and may, under some circumstances, distort the
competitive learning process somewhat. After this diversion , we can
now return to our experiments on the development of word/ positionspecific letter detectors and other feature detectors.
Position -specific letter detectors. In our first experiment , we
presented letter pairs drawn from the set: AA, AB, BA, and BB. We
began with clusters of size two. The results were unequivocal. The system developed position-specific letter detectors. In some experimental
runs, one of the units responded whenever AA or AB was presented,
and the other responded whenever BA or BB was presented. In this
case, Unit 1 represents an A detector in position 1 and Unit 2
represents a B detector for position 1. Moreover , as in the word perception model, the letter detectors are, of course, in a mutually inhibi tory pool. On other experimental runs, the pattern was reversed. One
of the units responded whenever there was an A in the second position
and the other unit responded whenever there was a B in the second
position . Figure 11 shows the final configuration of weights for one of
5. COMPETITIVE
LEARNING
.
.
.
. .
.
.
.
.
.
. . .
.
.
.
. . . .
.
.
.
.
. . .
.
.
.
.
. . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
181
.
.
our experimental runs. Note that although the units illustrated here
respond only to the letter in the first position , there is still weight on
the active" lines in the second position . It is just that the weights on the
first position differentiate between A and B, whereas those on the
second position respond equally to the two letters. In particular, as suggested by our formal analysis, asymptotically the weights on a given
line are proportion ?! to the probability that that line is active when the
unit wins. That i~, W;j --+P (unit ; = 11unitj wins) . Since the lower
level units unique to A occur equally as often as those unique to B, the
weights on those lines are roughly equal. The input lines common to
the two letters are on twice as often as those unique to either letter ,
and hence, they have twice as much weight. Those lines that never
come on reach zero weight.
Word detection units. In another experiment , we presented the
same stimulus patterns, but increased the elements in the cluster from
two to four . In this case, each of the four level-two units came to
respond to one of the four input patterns- in short, the system
developed word detectors. Thus, if layer two were to consist of a
number of clusters of various sizes, large clusters with approximately
one unit per word pattern will develop into word detectors, while
smaller clusters with approximately the number of letters per spatial
position will develop into position-specific letter detectors. As we shall
see below, if the number of elements of a cluster is substantially less
than the number of letters per position , then the cluster will come to
detect position-specific letter features.
Effects of number of elements per serial position . In another experiment , we varied the number of elements in a cluster and the number
of letters per serial position . We presented stimulus patterns drawn
182
BASICMECHANISMS
from
the
set : AA , AB , AC , AD , BA , BB , BC , BD .
In this
case , we
found that with clusters of size two , one unit responded to the patterns
of structure
. If the patterns
are to be put
relevant
distinction
On the
other
are to be
of elements
Letter similarity
in a cluster .
effects.
effects of letter similarity to look for units that detect letter features .
We presented letter patterns consisting of a letter in the first position
only . We chose the patterns so they formed two natural clusters based
on the similarity of the letters to one another . We presented the letters
A , B, S, and E . The letters were chosen so that they fell naturally into
two classes. In our font , the letters A and E are Quite similar and the
letters
Band
ally , one of the units responded to the A or the E while the other unit
responded to the B or the S. The weights were largest on those
fe,atures of the stimulus pairs which were common among each of these
similar pairs . Thus , the system developed subletter -size feature detec tors for the features
relevant
to the discrimination
Correlated teaching inputs. We carried out one other set of experiments with the word/ letter patterns. In this case, we used clusters of
size two and presented stimuli
Note
that
side , we have
letters
as we
had in the previous experiment , but on the right -hand side we have
only two patterns ; these two patterns are correlated with the letter in
the first position . An A in the second position means that the first
position contains either an A or a B , whereas a B in the second position
S. COMPETITIVE
LEARNING
183
tion . This suggests that even though the competitive learning system is
an " unsupervised " learning mechanism , one can control what it learns
by controlling the statistical structure of the stimulus patterns being
presented to it . In this sense, we can think of the right -hand letter in
this experiment as being a kind of teaching stimulus aimed at determin ing the classification learned for other aspects of the stimulus . It
should also be noted that this ' teaching mechanism is essentially the
features
which
it otherwise
.. ..
would
. .
..
.. . . .
..
.
.
..
. .
.. . .
U nl .t 1
.
.
to discover .
learning mechanism to
be unable
nl
.t 2
FIGURE
12. The pattern of weights developed in the correlated learning experiment .
184
BASICMECHANISMS
- type
device .
It
is therefore
of
some
interest
to see what
as well
as our successes
because
the sys -
tem fails is elucidating . It should be noted at the outset that our goal is
not so much to present a model of how the human learns to distinguish
between vertical and horizontal lines ( indeed , such a distinction is prob ably prewired in the human system ) , but rather to show how competi tive learning can discover features which allow for the system to learn
distinctions with multiple layers of units that cannot be learned by
single -layered systems . Learning to distinguish vertical and horizontal
lines is simply a paradigm case.
In this set of experiments , we represented the lower level of units as
if they were on a 6x6 grid . We then had a total of 12 stimulus pat terns , each consisting of turning on six Levell
units in a row on the
grid. Figure 13 illustrates the grid and several of the stimulus patterns.
Ideally, one might hope that one of the units would respond whenever
a vertical line is presented ; the other would respond whenever a
horizontal line is presented . Unfortunately , a little thought indicates
that this is impossible . Since every input unit participates in exactly
S. COMPETITIVE
LEARNING
185
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
. .
. .
. .
. .
. .
.
.
.
.
.
.
.
.
.
.
.
.
. . .
. . .
. . .
. . .
. . .
. . .
.
.
.
.
.
.
. .
. .
. .
. .
. .
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. . .
. . .
D
FIGURE
13.
.
.
.
.
receive about the same input in the face of a horizontal line , we might
expect that sometimes one and sometimes the other would win the
competition but that the primary response would be to vertical lines. If
other clusters settled down similarly to horizontal lines, then a unit at
the third level looking at the output of the various clusters could distin guish vertical and horizontal . Unfortunately , that is not the pattern of
weights discovered by the competitive learning mechanism. Rather, a
typical pattern of weights is illustrated in Figure 15. In this arrangement , each cluster responds to exactly three horizontal and three verti cal lines. Such a cluster has lost all information that might distinguish
vertical from horizontal . We have discovered a feature of absolutely no
use in this distinction . In fact, such features systematically throwaway
the information relevant to horizontal vs. vertical . Some further
thought indicates why such a result occurred. Note , in particular, that
two horizontal lines have exactly nothing in common . The grid that we
show in the diagrams is merely for our convenience. As far as the
units are concerned there are 36 unordered input units; sometimes
some of those units are active. Pattern similarity is determined entirely
by pattern overlap. Since horizontal lines don't intersect, they have no
units in common , thus they are not seen as similar at all. However,
every horizontal line intersects with every vertical line and thus has
much more in common with vertical lines than with other horizontal
186
BASICMECHANISMS
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Unit 2 Cluster 1
Unit 1 Cluster 2
Unit 1 Cluster 1
.
.
.
.
.
.
.
.
.
.
.
.
Unit 2 Cluster 2
.
.
.
.
.
.
.
.
.
. .
.
.
.
. .
.
.
.
. .
. .
FIGURE15.
Unit 1
Unit 2
lion.
.
aAAAAa
-- .
Im
S. COMPETITIVE
LEARNING
187
the left -hand side of the grid we always presented the uppermost
horizontal line whenever any horizontal line was presented on the
right -hand grid , and we always presented the vertical line furthest to
the left on the left -hand grid whenever we presented any vertical line
on the right -hand side of the grid. We then had a cluster of two units
receiving inputs from all 12x 6 = 72 lower level units . (Figure 16
shows several of the stimulus patterns .)
As expected , the two units soon learned to discriminate between
vertical and horizontal lines . One of the units responded whenever a
vertical line was presented and the other responded whenever a
horizontal line was presented . They were responding , however , to the
pattern
side rather
than
to the vertical
and horizontal
pattern on the right . This too should be expected . Recall that the
value of the wi} approaches a value which is proportional to the proba -
bility that input unit ; is active, given that unit j won the competition .
Now , in the case of the unit that responds to vertical lines for example ,
every unit on the right -hand grid occurs equally often so that all of the
weights connecting to units in that grid have equal weights. The same
is true for the unit responding to the horizontal
.
.
.
.
.
.
.
.
.
.
.
.
. .
. .
. .
. .
. .
. .
.
.
.
.
.
.
.
.
.
.
.
.
. . .
. . .
. . .
. . .
. . .
. . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
188
BASICMECHANISMS
the right -hand grid are identical for the two cluster members. Thus ,
when the " teacher" is turned off , and only the right -hand figure is
presented, the two units respond randomly and show no evidence of
having learned the horizontal / vertical distinction .
Suppose, however, that we have four , rather than two, units in the
level-two clusters. We ran this experiment and found that of the four
units , two of them divided up the vertical patterns and two of them
divided up the horizontal patterns. Figure 17 illustrates the weight
values for one of our runs. One of the units took three of the vertical
line patterns; another unit took three other vertical patterns. A third
unit responded to three of the horizontal line patterns, and the last unit
responded to the remaining three horizontal lines. Moreover , after we
took away the " teaching" pattern, the system continued to classify the
vertical and horizontal lines just as it did when the left -hand " teaching"
pattern was present.
Cluster 1
Unit 2
Unit 1
.
.
.
.
.
.
.
.
.
.
.
.
Unit4
Unit 3
. . . . . .
. . . . . .
. . . . . .
. . . . . .
Cluster 2
Unit2
Unit 1
.
.
.
.
.
.
. . . . . .
Unit 4
Unit 3
. . . . . .
.
. . . . . .
. . . . . .
.
.
.
.
.
.
FIGURE 17. The weight values for the two clusters of size four for the
vertical/ horizontaldiscriminationexperimentwith a correlated"teaching" stimulus.
S. COMPETITIVE
LEARNING
189
of four
units
at the second
of two
units
will
in each cluster
ever be active
will
become
active .
None
of the horizontal
stimulus
. Thus , one of
the units at the highest level learns to respond whenever a vertical line
is presented , and the other unit responds whenever a horizontal line is
La yer
Layer 2
Input Units
FIGURE 18.
experiment
.
190
BASICMECHANISMS
presented
absence
Once
of
the
the
the
"
system
has
teaching
competitive
"
learning
develop
feature
patterns
that
trained
Thus
allow
by
occurs
we
the
differentiable
this
what
can
which
not
mechanism
detectors
are
been
stimulus
under
to
the
shown
is
certain
system
simple
despite
have
conditions
distinguish
linear
that
among
unit
in
one
level
CONCLUSION
We
have
cover
of
shown
set
"
teaching
"
detectors
to
principle
principle
we
do
be
of
in
the
itself
is
the
how
description
building
of
the
other
of
neural
is
only
environment
statisti
learning
among
We
pat
offer
to
can
the
learning
of
associations
networks
stimulus
feature
learn
the
of
find
powerful
kinds
network
kind
we
mechanisms
adaptive
nonassociative
that
learning
simple
as
of
it
essentially
complete
serve
Although
that
to
separable
and
imagine
imagine
more
serve
development
interesting
an
certainly
competitive
of
very
in
the
set
feature
can
linearly
can
the
these
that
not
otherwise
course
involved
these
understanding
finds
of
We
in
allow
are
dis
of
how
system
stimuli
to
can
aspects
shown
that
develop
learning
will
important
system
not
also
multilayer
correlated
mechanism
activation
analysis
sets
of
not
Competitive
mechanisms
system
the
scheme
of
use
mechanism
important
have
of
stimulus
would
learning
competitive
capture
We
basis
the
input
calleaming
terns
how
which
competitive
ing
the
of
shown
simple
that
patterns
form
categorizations
have
very
detectors
input
can
learn
feature
stimulus
detectors
We
how
of
this
further
our
discover
features
in
which
the
ACKNOWLEDGMENTS
This researchwassupportedby grantsfrom the SystemDevelopment
Foundationand by Contract N(xx) 14-79-C-O323, NR 667-437 with the
Personnel and Training ResearchPrograms of the Office of Naval
Research
.
LEARNING191
s. COMPETITIVE
APPENDIX
For the case of homogeneous dipole stimulus patterns, it is possible
to derive an expression for the most stable equilibrium state of the system . We say that a set of dipole stimulus patterns is homogeneous if
(a) they are equally likely and (b) for every input pattern in the set
there are a fixed number of other input patterns that overlap them .
These conditions were met in our simulations . Our measure of stability
is given by
T = 1: Pk1: 1: Vjk(ajk - aik ) .
k
j i
S.
1
.t
tnce Pk = N ' we can wrt e
1
1
T = N ~, ~} 1:k VjkCljk- N1 ;:, ~} 1:k vjkaik.
Summing the first portion of the equation over i and the second over j
we have
T = . . MN1
j: 1:k vjkajk - -Nl1:i 1:k aik1j: vjk.
Now note that when Pk = 1/ N , we have aik ~ 1: rkj Vij/ 1: Vkl. Further j
I
more, 1: Vlk= 1 and 1: Vlk= NI , where NI is the number of patterns
I
k
captured by unit I . Thus, we have
1: rkl Vii
T = ~ 1:j 1:k Vjkajk - ~ 1:; 1:k ~
Now , since all stimuli are the same size, we have rij = ,ji . Moreover ,
since all stimuli have the same number of neighbors, we have
where R is a constant determined by the dimen ;
j
sionality of the stimulus space from which the dipole stimuli are drawn.
Thus, we have
I.rjj~I.ru~R,
I
;
Vii
M
~
~
_
1
.
_
T=-N
Ij;Ik;vjkajk
-NL
;-N
I.'
192
BASICMECHANISMS
and we have
M
RM
T=-N1j:1k:VjkUjk
-N .
Since R, M, and N are constants, we have that T is maximum whenever T' = I : I : Vjklrjk is maximum. Now substituting for Qjk, we can
j k
write
T'=I,-LI
/O
j ~ k,I,rk
/ /VjkVj
We can now substitute for the
Vjk- Vjk(1 - vi/ ) . We then can write
product
VjkVjl
the
term
T'=r.j Nl
- r.j Nl
(1- Vj
/)'
J.r.r.rk
k / /Vjk
J.r.rik/
k / Vjk
Summing the first term of the equation first over I, then over k, and
then overj , gives us
T'=MR- I,-1
-I,I,rkl
(1- Vjl
).
j Nj
k I Vjk
Now , recall that ' kl is given by the degree of stimulus overlap between
stimulus I and stimulus k . In the case of dipoles there are only three
possible values of ' kl :
' kl =
0
1
no overlap
k= l
1/ 2
otherwise
Now , the second term of the equation for T ' is 0 if either ' kl = 0 or if
Vjk( 1 - vjI) = O. Since Vik is either 1 or 0, this will be zero whenever
j = / . Thus , for all nonzero cases in the second term we have ' kl = 1/1.
Thus we have
5. COMPETITIVE
LEARNING193
1 B.
T'= MR- - r, - L.
2 j Nj
Finally
, wecan
see
thatT' willbeamaximum
whenever
T"= r,j NB
~
J is
. .
minimum. Thus
, minimizing
T" leads
tothemaximally
stable
solution
in this case.
CHAPTER
6
P. SMOLENSKY
6. HARMONYTHEORY
195
procedure , including
languages for precisely expressing them and
theoretical machines for realizing them . This body of theory grew out
of mathematical logic , and in turn contributed to computer science ,
physical computing systems , and the theoretical paradigm in cognitive
science often called the ( von Neumann ) computer metaphor . 1
In his paper " Physical Symbol Systems ," Allen Newell ( 1980) articu lated the role of the mathematical theory of symbolic computation in
cognitive science and furnished a manifesto for what I will call the symbolic paradigm . The present book offers an alternative paradigm for
cognitive science , the subsymbolic paradigm , in which the most powerful
level of description of cognitive systems is hypothesized to be lower
than the level that is naturally described by symbol manipulation .
The fundamental insights into cognition explored by the subsymbolic
paradigm do not involve effective procedures and symbol manipulation .
Instead they involve the " spread of activation ," relaxation , and statistical
correlation . The mathematical language in which these concepts are
naturally expressed are probability theory and the theory of dynamical
systems . By dynamical systems theory I mean the study of sets of
numerical variables (e.g., activation levels ) that evolve in time in paral lel and interact through differential equations . The classical theory of
dynamical systems includes the study of natural physical systems (e.g.,
mathematical physics ) and artificially designed systems (e.g., control
theory ) . Mathematical characterizations of dynamical systems that for malize the insights of the subsymbolic paradigm would be most helpful
in developing the paradigm .
This chapter introduces harmony theory , a mathematical framework
for studying a class of dynamical systems that perform cognitive tasks
according to the account of the subsymbolic paradigm . These dynami cal systems can serve as models of human cognition or as designs for
artificial cognitive systems . The ultimate goal of the enterprise is to
develop a body of mathematical results for the theory of information
processing that complements the results of the classical theory of (sym bolic ) computation . These results would serve as the basis for a mani festo for the subsymbolic paradigm comparable to Newell 's manifesto
for the symbolic paradigm . The promise offered by this gQal will , I
hope , be suggested by the results of this chapter , despite their very lim ited scope .
1 Mathematicallogic has recentlygiven rise to another approachto formalizing information: situation semantics(Barwise & Perry, 1983). This is related to Shannon's
(1948/ 1963) measureof information through the work of Dretske (1981). The approach
of this chapteris more faithful to the probabilisticformulation of Shannonthan is the
symbolicapproachof situationsemantics
. (This resultsfrom Dretske's move of identifying information with conditionalprobabilitiesof 1.)
196
BASICMECHANISMS
A Top-Down TheoreticalStrategy
How can mathematical analysis be used to study the processing
mechanisms underlying the performance of some cognitive task?
One strategy, often associated with David Marr ( 1982) , is to characterize the task in a way that allows mathematical derivation of mechanisms that perform it . This top-down theoretical strategy is pursued in
harmony theory . My claim is not that the strategy leads to descriptions
that are necessarilyapplicable to all cognitive systems, but rather that
the strategy leads to new insights, mathematical results, computer
architectures, and computer models that fill in the relatively unexplored
conceptual world of parallel, massively distributed systems that perform
cognitive tasks. Filling in this conceptual world is a necessarysubtask,
I believe, for understanding how brains and minds are capable of intel ligence and for assessing whether computers with novel architectures
might share this capability.
197
6. HARMONYTHEORY
cognitive
of
are
task
perception
customarily
intuiting
perceptual
.
point
.
The
vast
well
natural
to
of
cessing
are
motor
viewed
as
stretch
all
The
rope
the
way
to
processing
descriptions
only
the
2 There
towards
ladder
to
top
distinguish
of
the
involve
levels
each
of
theoretical
neural
after
taken
level
through
neuron
; these
form
- up
at the
retina
firing
higher
rates
level
, that
pro and
the
abyss
hoping
it
will
intermediate
perceptual
processing
viewed
that
up
as appropriate
explicitly
into
it
the
are
the
will
involve
abyss
on
extend
descriptive
all
the
change
theoretical
entities .
.
in
the
distant
processing .
from
entities
up
from
.
to
level
the
peri -
These
pro -
causally
interact
Higher
level
that
are directly
theories
part
. Each
firing
. The
average
Higher
example
individual
change
processes
. It is important
as a whole ; they
distant
propagation
entities
strategy
system
comprise
changes
level , perceptual
computationally
processing
information
by definition
of
top -down
definitional
cause
lower
theoretical
and
; these
describes
to causal
that
are descriptively
causal , but
related
of
sensory
into
view
, hoping
levels
the processing
is not
rates
of
top
as
climb
from
from
comprise
mechanism
individually
are
processing
the
are
a top - down
entities
that
that
How
kind
bottom
kinds
down
processes
working
entities
bottom
entities
levels
firing
retinal
together
processing
a delay
neurons
entities
other
descriptive
between
computational
descriptive
an actual
the
low
.2
processing
pheral , sensorimotor
same
theorists
abyss
opposite
cognitive
at
abyss
the
the
manipulation
few
anchored
levels
processes
involve
the
levels
at
the
. Other
anchored
of
rela
on
conceptualize
climb
the
symbol
to
theorists
of
of
is
highest
are
abyss
abyss
low
takes
Subsymbolic
is no contradiction
higher
cessing
bottom
and
this
this
levels
and
a conceptual
in
of
at
extremes
logic
extremely
logic
are
of
reasoning
perceptual
to
paradigm
Logic
formal
role
the
lowest
the
manipulation
symbolic
mechanisms
mechanisms
logical
limited
the
by
lies
symbol
symbolic
subsymbolic
end
paradigm
as
Thus
of
the
at
( e .g . ,
the
useful
between
middle
symbolic
on
are
the
that
perception
lies
processing
levels
processing
clutching
with
the
intermediate
than
paradigm
processing
of
tasks
comments
and
high
the
"
few
abstraction
cognitive
reasoning
of
the
an
processing
logical
In
subsymbolic
cognitive
strategy
the
way
level
) .
cognitive
on
science
conceptualize
The
in
is
many
" higher
Descriptions
- informed
chapter
includes
much
explicit
.
this
problems
the
of
processing
by
we
as
in
of
in
abstraction
physics
majority
levels
sensory
end
to
processing
cognitive
study
This
regarded
answers
tively
I will
firing
instantly
rates
same
rates
of
individual
in visual
changes
, without
in
cortex
in individual
of pools
of
any causal
retinal
infor -
- theory , models
of higherlevelprocesses
arederivedfrom modelsof lowerlevel, perceptual
, processes
,
whilelowerleveldescriptions
of thesemodelsarederivedfrom higherleveldescriptions
.
mation
propagation
from
the
lower
level
description
.) Thus
in harmony
198
BASICMECHANISMS
An application that will be described in some detail is qualitative problem solving in circuit analysis . 3
The central idea of the top -down theoretical strategy is that properties
of the task are powerfully constraining on mechanisms . This idea can
tasks involve
interpreting
or controlling
over an
extended period of time . To deal properly with such tasks, harmony theory must be
extended
environments
from
.
the interpretation
of static environments
to the interpretation
of dynamic
6. HARMONY
THEORY199
introduced by this expository organization will be repaid by greater
accessibility
.
Section 1 is a top-down presentationof how the perceptualperspec
tive on cognition leadsto the basicfeatures of harmony theory. This
presentation starts with a particular perceptual model, the letterperceptionmodel of McClelland and Rumelhart (1981) , and abstracts
from it generalfeaturesthat can apply to modeling of higher cognitive
processes
. Crucial to the developmentis a particular formulation of
aspectsof schematheory, alongthe lines of Rumelhart (1980) .
Section~, the majority of the chapter, is a bottom-up presentationof
harmony theory that starts with the primitives of the knowledge
representation
. Theoremsare informally describedthat provide a competence theory for a cognitive system that performs the completion
task, a machine that realizes this theory, and a learning procedure
through which the machinecan absorbthe necessaryinforntation from
its environment. Then an application of the general theory is
described: a model of intuitive, qualitative problem-solving in elementary electric circuits. This model illustrates several points about the
relation between symbolic and subsymbolicdescriptionsof cognitive
phenomena
; for example, it furnishes a sharp contrast between the
description
at these two levels of the nature and acquisition of
.
expertise.
The final part of the chapteris an Appendix containinga concisebut
self-containedformal presentationof the definitions and theorems.
SECTION1: SCHEl\{.A~
. THEORYAND
SELF-CONSISTENCY
THELOGICAL
STRUCTURE
OFHARMONY
THEORY
The logical structure of harmony theory is shown schematically in
Figure 1. The box labeled Mathematical Theory represents the use of
mathematical analysis and computer simulation for drawing out the
implications of the fundamental principles. These principles comprise a
mathematical characterization of computational requirements of a cognitive system that performs the completion task. From these principles
200
BASICMECHANISMS
CONCEPTUAL
descriptive characterization
of computetlon . 1 requirements
mathematical
characterization
of computetlon
requirements
characterlzetlon
",...
of performance
~
"
"
, , ~
machine Implementation
(simulation ) : PROCESS
MATH EMA TICAL
THEORY
.
derivation
.~
81mulatlon
it is possible to mathematically analyze aspectsof the resulting perform ance as well as rigorously derive the rules for a machine implementing
the computational requirements. The rules defining this machine have
a different status from those defining most other computer models of
cognition : They are not ad hoc, or post hoc~ rather they are logically
derived from a set of computational requirements. This is one sense in
which harmony theory has a top-down theoretical development.
Where do the " mathematically characterized computational requirements" of Figure 1 come from ? They are a formalization of a descriptive characterization of cognitive processing, a simple form of sf/lema
theory. In Section 1 of this chapter, I will give a description of this
form of schema theory and show how to transform the descriptive characterization into a mathematical one- how to get from the conceptual
box of Figure 1 into the mathematical box. Once we are in the formal
world , mathematical analysis and computer simulation can be put to
work .
Throughout Section 1, the main points of the development will be
explicitly enumerated.
Point J. The mathematics Qf harmony theory is .founded on .familia /"
conceptsQfcognitive scien(~e.' in.ference through activation l?fsc /lemata .
6. HARMONYTHEORY
201
Headache
In a Restaurant
restaurant
--~.r,~(k~~ ~ ----_.J, restaurant
con
text -----~" ~~:~~~~~~:~J/ ----_.1-/~scrip
t
headache
context
:::::{>S :::::{>
restaurant
- - - -..t--,
- - -. J' ,
& headache
- - - - -" , '
- - - --}./
context
headache
8Crl pt
special purpose
script
. .
inferences . goals
...
inferences. goals
...
'ask waitress
for aspirin '
202 BASIC
MECHANISMS
instead that the knowledge base is a set of know/edgeatoms that configure themselves dynamically in each context to form tailor -made scripts.
This is the fundamental idea formalized in harmony theory .4
The degree of flexibility demanded of scripts is equaled by that
demanded of all conceptual structures. S For example, metaphor is an
extreme example of the flexibility demanded of word meanings; even
so-called literal meaning on closer inspection actually relies on extreme
flexibility of knowledge application (Rumelhart , 1979) . In this chapter
I will consider knowledge structures that embody our knowledge of
objects, words, and other concepts of comparable complexity ; these I
will refer to as schemata. The defining properties of schemata are that
they have conceptual interpretations and that they supportinference.
For lack of a better term , I will use knowledgeatoms to refer to the
elementary constituents of which I assume schemata to be composed. 6
These atoms will shortly be given a precise description~ they will be
interpreted as a particular instantiation of the idea of memory trace.
Point 2. At the time of inference, stored know/edge atoms are dynamically assembled into context-sensitive schemata.
would
atoms ."
might
call
An
acronym
perhaps
these
be taken
for
particles
Units for
gnosons
or sophons , but
Constructing
as an advertising
gimmick
Schemata
these
terms
Dynamically
. So I have
stuck
with
seem
might
quite
serve ,
" knowledge
6. HARMONYTHEORY
203
204
BASICMECHANISMS
Atoms
6. HARMONYTHEORY
205
Knowledge
Atoms
W1
A2 =)
M1A2
=)
A2K3=)
(((0
0
0
0)
0)
-)
line -segment
model .
This
units , which
simple
model
illustrates
several
points
about
letter - perception
the
nature
of
knowledge
atoms
in harmony
theory . The
digraph
unit
W IA 2
represents a pattern of values over the letter units : WI and A 2 on , with
all other letter units for positions
1 and 2 off . This pattern is shown in
Figure
4 , using
These indicate
whether
there is an excitatory
connection , inhibitory
connection , or no connection
between the corresponding
nodes . 7
Figure 4 shows the basic structure
of harmony
models . There are
atoms of knowledge , represented
by nodes in an upper layer , and a
lower layer of nodes that comprises a representation
of the state of the
perceptual or problem domain with which the system deals . Each node
is a feature in the representation
of the domain . We can now view
" atoms of knowledge " like WI and A 2 in several ways . Mathematically
,
each atom is simply a vector of + , - , and 0 values , one for each node
in the lower , representation
layer . This pattern can also be viewed as a
fragment of a percept : The 0 values mark those features omitted in the
fragment . This fragment
can
behind in memory by perceptual
in turn be interpreted
experience .
as a trace
left
7 Omitted are the knowledge atoms that relate the Jetter nodes to the line segment
nodes. Both line segment and letter nodes are in the lower layer, and all knowledge
atoms are in the upper layer. Hierarchies in harmony theory are imbedded within an
architecture of only two layers of nodes, as will be discussed in Section 2.
206
BASICMECHANISMS
THECOMPLETIONTASK
Having specified more precisely what the atoms of knowledge are, it
is time to specify the task in which they are used.
Many cognitive tasks can be viewed as inference tasks. In problem
solving , the role of inference is obvious; in perception and language
comprehension, inference is less obvious but just as central. In harmony theory , a tightly prescribed but extremely general inferential task
is studied: the completiontask. In a problem-solving completion task, a
partial description of a situation is given (for example, the initial state
of a system) ; the problem is to complete the description to fill in the
missing information (the final state, say) . In a story understanding
completion task, a partial description of some events and actors' goals is
given; comprehension involves filling in the missing events and goals.
In perception, the stimulus gives values for certain low-level features of
the environmental state, and the perceptual system must fill in values
for other features. In general, in the completion task some features of
an environmental state are given as input , and the cognitive system
must complete that input by assigning likely values to unspecified
features.
A simple example of a completion task (Lindsay & Norman , 1972) is
shown in Figure 5. The task is to fill in the features of the obscured
portions of the stimulus and to decide what letters are present. This
task can be performed by the model shown in Figure 3, as follows .
The stimulus assigns values of on and off to the unobscured letter
features. What happens is summarized in Table 1.
Note that which atoms are activated affects how the representation is
FIGURE 5. A perceptualcompletiontask.
6. HARMONYTHEORY
207
TABLE1
A PROCEDURE
FORPERFORMING
THECOMPLETION
TASK
Input :
Activation :
Inference:
filled in , and how the representation is filled in affects which atoms are
activated. The activation and inference processes mutually constrain
each other ~ these processesmust run in parallel. Note also that all the
decisions come out of a striving for consistency
.
Point 6. Assembly of schemata (activation of atoms) and inference
(completing missing parts of the representation
) are both achieved by
finding maximally self-consistentstates of the system that are also consistent with the input.
The completion of the stimulus shown in Figure 5 is shown in
Figure 6. The consistency is high because wherever an active atom is
[ ASM4
J
,
..,
,," ' ) =
\ --- - /
active ;
on
inactive;
off
FIGURE 6. The state of the network in the completion of the stimulus shown in Figure 5.
208 BASIC
MECHANISMS
connectedto a representationalfeature by a + (respectively,- ) connection, that feature has value on (respectively, 0.fJ) . In fact, we can define
a very simple measureof the degreeof self-consistencyjust by considering all active atoms, counting + 1 for every agreementbetweenone
of its connections and the value of the correspondingfeature, and
counting - 1 for every disagreement
. (Here + with on or - with off
constitutes agreement
.) This is the simplest example of a harmony
function- and brings us into the mathematicalformulation.
THE HARMONY
FUNCTION
harmonYknowledge
(representational
feat
(Jrevector
, activations
)
base
~
~
atoms
strength
atom
of
a
.l'\feature
vector
)
ofatom
a .' representation
feature
vector
)(0;fatom
Isimilarit
a inactive ;
1 if active
6. HARMONY
THEORY209
A PROBABILISTICFORMULATION
OFSCHEMA
THEORY
210
BASICMECHANISMS
HARMONY THEORY
According to Points 2, 6, and 7, schemata are collections of
knowledgeatoms that becomeactive in order to maximize harmony,
6. HARMONY
THEORY
211
can be found
by lowering
the
temperature
to zero . This process , cooling , is fundamental
to harmony
theory . Concepts and techniques
from thermal
physics can be used to
understand
and analyze decision -making processes in harmony theory .
A technique
for performing
Monte Carlo computer
studies of ther mal systems can be readily adapted to harmony theory .
Point
11 . A massively parallel
performs
completions
stochastic
in accordance
machine
with Points 1 - 10 .
8 Since harmony corresponds to minus energy, at low physical temperatures only the
state with the lowestenergy (the ground state) has nonnegligible probability .
212 BASIC
MECHANISMS
For a given harmony model (e.g., that of Figure 4) , this machine is
constructed as follows . Every node in the network becomes a simple
processor , and every
link
in the network
becomes
a communication
link
between two processors . The processors each have two possible values
(+ 1 and - 1 for the representational feature processors ~ 1 = active and
0 = inactive for the knowledge atom processors ) . The input to a com pletion problem is provided by fixing the values of some of the feature
processors . Each of the other processors continually updates its value
by making stochastic decisions based on the harmony associated at the
current time with its two possible values . It is most likely to choose the
value that corresponds
to greater harmony ~ but with
some
probability - greater the higher is the computational temperature T - it
will make the other choice . Each processor computes the harmony
associated with its possible values by a numerical calculation that uses
as input the numerical values of all the other processors to which it is
connected . Alternately , all the atom processors update in parallel , and
then all the feature processors update in parallel . The process repeats
many times , implementing the procedure of Table 1. All the while , the
temperature
is lowered
to zero , pursuant
to
Point
10 .
It can be
proved that the machine will eventually " freeze " into a completion that
maximizes the harmony .
I call this machine harmonium because, like the Selfridge and Neisser
( 1960) pattern recognition system pandemonium , it is a parallel distri buted processing system in which many atoms of knowledge are simul taneously " shouting " out their little contributions to the inference pro cess~ but unlike pandemonium , there is an explicit method to the mad ness: the collecti ve search for maximal harmony . 9
The final point concerns the account of learning in harmony theory .
Point 12. There is a procedure for accumulating knowledge atoms
through exposure to the environment so that the system will perform the
completion task optimally .
The precise meaning of " optimality " will be an important topic in the
subsequent discussion .
This completes the descriptive account of the foundations of har mony theory . Section 2 fills in many of the steps and details omitted
9 Harmonium
basic dynamics of the machines are the same, although there are differences in most
details . In the Appendix , it is shown that in a certain sense the Boltzmann machine is a
special case of harmonium , in which knowledge atoms connected to more than two
features are forbidden . In another sense , harmonium is a special case of the Boltzmann
machine , in which the connections are restricted to go only between two layers .
6. HARMONY
THEORY
213
above, and reports the results of some particular studies. The most f ormal matters are treated in the Appendix .
SECTION2: HARMONYTHEORY
. . . the privileged unconscious phenomena , those susceptible of
becoming conscious, are those which . . . affect most profoundly our
emotional sensibility . . . Now , what are the mathematic entities to
which we attribute this character of beauty and elegance . . . ?
They are those whose elements are harmoniously disposed so that
the mind without effort can embrace their totality while realizing the
details. This harmony is at once a satisfaction of our esthetic needs
and an aid to the mind, sustaining and guiding . . . . Figure the
future elements of our combinations as something like the unhooked
atoms of Epicurus. . . . They flash in every direction through the
space . . . like the molecules of a gas in the kinematic theory of
gases. Then their mutual impacts may produce new combinations .
Henri Poincare ( 1913)
Mathematical
Creation
10
In Section 1, a top-down analysis led from the demands of the completion task and a probabilistic formulation of schema theory to perceptual features, knowledge atoms, the central notion of harmony , and the
role of harmony in estimating probabilities of environmental states. In
Section
. . . 2, the presentation will be bottom -up, starting from the
prImItIves.
KNOWLEDGEREPRESENT
ATION
Representation Vector
At the center of any harmony theoretic model of a particular cogni tive process is a set of representational features r 1, r2 , . . . . These
10I am indebtedto Yves Chauvin for recentlypointingout this remarkablepassage
by
the great mathematician.
SeealsoHofstadter
(1985
, pp. 655
-656
).
214 BASIC
MECHANISMS
features constitute
the cognitive
with
which
of possible
of
collection of values for all the representational variables {r;} . This collection can be designated by a list or vector of + ' s and - ' s: the
representation vector r .
Where do the features used in the representation vector come from ?
Are they " innate " or do they develop with experience ? These crucial
Questions will be deferred until the last section of this chapter . The
evaluation of various possible representations for a given environment
and the study of the development of good representations through
exposure to the environment is harmony theory ' s raison d 'etre. But a
prerequisite for understanding the appropriateness of a representation is
understanding how the representation supports performance on the task
for which it used ~ that is the primary concern of this chapter . For now ,
we simply assume that somehow a set of representational features has
already been set up : by a programmer , or experience , or evolution .
11 While continuous values make the analysis more complex , they may well improve
the performance of the simulation models . In simulations with discrete values , the system state jumps between corners of a hypercube ~ with continuous values , the system
state crawls smoothly around inside the hypercube. It was observed in the work reported
in Chapter 14 that " bad" corners corresponding to stable nonoptimal completions (local
harmony maxima ) were typically not visited by the smoothly moving continuous state ~
these corners typically are visited by the jumping discrete state and can only be escaped
from through thermal stochasticity . Thus continuous values may sometimes eliminate
the need
for stochastic
simulation
6. HARMONY
THEORY
215
Activation Vector
The representational features serve as the blackboard on which the
cognitive system carries out its computations. The knowledge that
guides those computations is associated with the second ~~~ of entities ,
the knowledgeatoms. Each such atom a is characterized ~ ~ knowledge
vector ka , which is a list of + 1, - 1, and 0 values~/ one for each
representation variable rj . This list encodes a piece of knowledge that
specifies what value each rj should have: + 1, - 1, or unspecified (0) .
Associated with knowledge atom a is its activation variable, aa. This
variable will also be taken to be binary: 1 will denote active~ 0, inactive.
Because harmony theory is probabilistic , degrees of activation are
represented by varying probability of being active rather than varying
values for the activation variable. (Like continuous values for
representation variables, continuous values for activation variables
could be incorporated into the theory with little difficulty , but a need to
do so has not yet arisen.) The list of {O, I } values for the activations
{aa} comprises the activation vectora .
Knowledge atoms encode subpatterns of feature values that occur in
the environment . The different frequencies with which various such
patterns occur is encoded in the set of strengths, {Ua} , of the atoms.
In the example of qualitative circuit analysis, each knowledge atom
records a pattern of qualitative changes in some of the circuit features
(currents, voltages, etc.) . These patterns are the ones that are consistent with the laws of physics, which are the constraints characterizing
the circuit environment . Knowledge of the laws of physics is encoded
in the set of knowledge atoms. For example, the atom whose
knowledge vector contains all zeroes except those features encoding the
pattern < current decreases
, voltagedecreases
, resistanceincreases> is one
of the atoms encoding qualitative knowledge of Ohm's law. Equally
important is the absence of an atom like one encoding the pattern
< current increases
, voltage decreases
, resistance increases> , which
violates Ohm 's law.
There is a very useful graphical representation for knowledge atoms~
it was illustrated in Figure 4 and is repeated as Figure 8. The representational features are designated by nodes drawn in a lower layer~ the
activation variables are depicted by nodes drawn in an upper layer. The
connections from an activation variable aa to the representation variables {r;} show the knowledge vector ka . When ka contains a + or for r; , the connection between aa and r; is labeled with the appropriate
sign~ when ka contains a 0 for r; , the connection between aa and rj is
omitted .
In Figure 8, all atoms are assumed to have unit strength. In general,
different atoms will have different strengths~ the strength of each atom
216
BASICMECHANISMS
Knowledge
Atoms
kw A =
kM A =
1
kA K =
2
0)
0)
-)
would them be indicated above the atom in the drawing . (For the com pletely generalcase, see Figure13.)
6. HARMONY
THEORY
217
connect that information with the letter nodes, completing the
representation to include letter recognition . Other knowledge atoms
connect patterns on the letter nodes with word nodes, and these complete the representation to include word recognition .
The pattern of connectivity of Figure 9 allows the network to be
redrawn as shown in Figure 10. This network shows an alternation of
representation and knowledge nodes, restoring the image of a series of
layers. In this sense, " vertically " hierarchical networks of many layers
can be imbedded as "horizontally " hierarchical networks within a twolayer harmony network .
Figure 10 graphically displays the fact that in a harmony architecture,
the nodes that encode patterns are not part of the representation; there
is a firm distinction between representation and knowledge nodes. This
distinction is not made in the original letter -perception model, where
the nodes that detect a pattern over the line-segment features are identical with the nodes that actually represent the presence of letters. This
distinction seems artificial ; why is it made?
I claim that the artificiality actually resides in the original letter perception model, in which the presence of a letter can be identified
with a single pattern over the primitive graphical features (line segments) . In a less idealized reading task, the presence of a letter would
have to be inferable from many different combinations of primitive
graphical features. In harmony theory , the idea is that there would be a
set of representation nodes dedicated to the representation of the presence of letters independent of their shapes, sizes, orientations , and so
forth . There would also be a set of representation nodes for graphical
segment/ letter
letter / word
knowledge atoms
knowledge atoms
FIGURE
abstractness
9.
.
' - - - ---letter
..- nodes
---J
' -
~~--
- - - ""'v ,.- - --
- - - --~J
word nodes
6. HARMONY
THEORY219
knowledge atoms
' - - - - ....,..-- - 1
' - - - -..........- - - 1
graphical
features
FIGURE
11. A
\.....-
- -~
letter
word
nodes
nodes
- - I
' - ~
. . . - -- J
phonological
features
\ . - - - ......,...- - -- - - 1
" - - - --~
syntactic
features
- - I
semantic
features
truly
gine
parallel implementation
a system starting
out
. The uniformity
means that we can ima with an " innate " two - layer structure
and
is set up to analyze
the environmen
tal conditions
under
which
certain
kinds
of representations
hierarchical
ones ) might emerge or be expedient .
(e .g .,
12 A negative connection between two lower level nodes means that the value pairs
(+ ,- ) and (- ,+ ) are favored relative to the other two pairs. This effect can be achieved
by creating two knowledge atoms that each encode one of the two favored patterns. A
positive connection similarly can be replaced by two atoms for the patterns (+ ,+ ) and
(- ,- ) .
220
BASICMECHANISMS
6. HARMONY
THEORY
221
(1)
- K.
hK
(r,ka
) =~Ikal
I~
81
82
r.
104
r2
r
FIGURE
12. A decomposable
harmony
network
.
13In physics, one says that H must be an extensivequantity.
~ I
a,
222
BASICMECHANISMS
The vector
inner
product
( see Chapter
9 ) is defined
by
r .k a = ~
Lli r . ( k al) .
i
and the norm
Ikal
14 is defined
= 1: ; 1( ka ) i I.
i
by
on these definitions
function
each knowledge
atom , with the term for atom a depending
only on
those representation
variables , ; to which it has nonzero
connection
( ko ) ; . Thus HI satisfies the additivity
The contribution
to H of an inactive
of an active
between
atom
a is the product
its knowledge
requirement
.
atom is zero . The contribution
of its strength
vector r ; this is
0 towards
stringent
1, the
criterion
criterion
than 50 %. As
K goes
from
goes
through
fraction
from
0%
matching
is required
than 1 because there
achieve
number
compute
atom
with
a finite
of possible
in a model
( the maximum
number
oflkal
matches . Indeed
of nonzero
it is easy to
connections
to any
14
This
the
so
-called
,which
is
the
Lby
2norm
defined
in
Chapter
9.is
For
each
pin(0L,01)norm
0
the
Lp
norm
ofdifferent
avector
vfrom
isdefined
Ivlp
~[~,vi'P/pll.
6. HARMONY
THEORY
223
criterion
of
this
place
will
Note
that
limit
match
the
they
knowledge
specify
a value
be active
agrees
active
perfectly
In fact , this
at any
of
r ; and
( ka ) ; for
every
a .
mony
function
If
harmony
binary
product
r; , -
1 if
The
This
bias
bias
is to
It seemed
unbiased
the
toward
knowledge
is +
in
all
feature
; the
values
that
Estimating
Probabilities
In
Section
ties
values
estimated
could
for
function
probability
It
is this
sics .
In
relationship
this
that
section
of
a set
of the
one
all
that
it .
the
the
way ,
through
of
feature
a , then
the
give
the
form
to
change
i
ith
In
values
same
the
theory
harmony
{ I , O}
would
preceding
form
however
be
the
com -
Function
system
for
performing
estimating
fact , Point
for
corresponding
this
The
Harmony
1 in
as well .
probabili
9 asserted
unknown
harmony
the
that
variables
value
the
was
an
<X eHI T .
and
of
should
value
to
function
of
har -
to break
be inserted
for
, for
a symmetric
{ + I ,- I }
a cognitive
variables
can
invariance
a harmony
signs
general
large
in
+ 1 and
sign
the
and
the
toward
, if the
be used
rewritten
the
complicated
unknown
probability
exponential
, could
With
use
a bias
values
under
of +
+ .
the
be
the
, how
't
a , irrespec
reason
theory
example
toward
from
1 , I suggested
task
of
easily
a more
outset
the
course
about
into
the
be attained
HK ( r , a ) unchanged
principled
, i .e . , ( ka ) ; =
invariance
into
the
can
flipping
variables
start
Of
example
can
transformed
pletion
it doesn
the exchange
inserted
of no
from
to
biased
is changed
underlying
think
atoms
0 , for
of
0 if
to node
under
value
an extreme
sacred
1 and
the
in
do
match
be + 1 if
, and
value
representation
value
take
be strongly
when
The
either
function
values
built
they
the
representational
is , simultaneously
deliberately
reasonable
is nothing
harmony
be
knowledge
values
leaves
the
closer
for
it disagrees
is invariant
I could
than
limit .
th ~ perfect
when
( ka ) ; r ; will
maximum
That
was
in
K . To
r ; will
There
The
all a
because
values
be ?
node .
symmetry
a systematic
binary
function
representation
in
even
greater
matching
the
i .
#( value
perfect
, even
become
the
feature
Any
the
probabilistic
that
with
call
- 1 as
ensured
for
is
representation
+ 1 and
vector
match
I will
sometimes
will
have
what
theory
will
current
choosing
, we
a perfect
in
harmony
likely
features
is with
model
, atoms
the
more
By
2/n
the
since
matching
not
1 -
(2)
establishes
the
the
next , the
mapping
relationship
with
statistical
between
phy -
harmony
224
BASICMECHANISMS
and probability is analyzed. In this section I will point out that if probabilities are to be estimated using H , then the exponential relationship
of Equation 2 should be used. In the next section I adapt an argument
of Stuart Geman (personal communication , 1984) to show that, starting
from the extremely general probabilistic assumption known as the principle of maximum missing information, both Equation 2 and the form of
the harmony function (Equation 1) can be jointly derived.
What we know about harmony functions in general is that they are
additive under network decomposition. If a harmony network consists
of two unconnected components, the harmony of any given state of the
whole network is the sum of the harmonies of the states of the component networks. In the case of such a network , what is required of
the probability assigned to the state? I claim it should be the product of
the probabilities assigned to the states of the component networks. The
meaning of the unconnectedness is that the knowledge used in the
inference process does not relate the features in the two networks to
each other . Thus the results of inference about these two sets of
features should be independent
. Since the probabilities assigned to the
states in the two networks should be independent, the probability of
their joint occurrence- the state of the network as a whole- should be
the product of their individual probabilities .
In other words, adding the harmonies of the components' states
should correspond to multiplying the probabilities of the components'
states. The exponential function of Equation 2 establishes just this
correspondence. It is a mathematical fact that the only continuous
functions f that map addition into multiplication ,
f
are
the
( x
exponential
for
( x
some
( x
functions
number
positive
Equivalently
these
functions
can
be
written
for
some
( x
exIT
general
where
would
can
be
the
made
sign
correspond
For
the
its
second
1 /
lna
undetermined
However
the
several
value
of
observations
about
the
the
com
value
of
to
environmental
leaves
temperature
First
argument
putational
mates
value
This
of
must
be
smaller
positive
probability
observation
probability
for
otherwise
greater
harmony
consider
distribution
cognitive
system
with
certain
that
value
esti
for
6. HARMONYTHEORY 225
Ta
and
a certain
temperature
that
harmony
T b , we
computational
Hb
of
on
Thus
, the
been
of
the
than
will
tem
cognitive
means
programmed
in
choice
that
by
simply
have
the
same
esti
Hal
Ta '
Thus
their
learned
being
, then
determines
a specific
by
any
the
once
modeler
busing
function
indistinguishable
is
positive
scale
the
convenient
scale
of
has
system
choice
that
the
sys -
third
observation
expressing
the
Tb
meaningful
if
other
system
harmony
would
Hbl
be
only
any
modified
since
would
is
given
cognitive
systems
task
of
Then
the
probabilities
This
learn
another
and
completion
do ; the
will
The
Both
magnitude
set for
rather
environmental
behavior
Ha
hypothesize
temperature
( T b/ T a ) Ha
mates
function
could
Equation
refines
2 is
to
the
use
the
second
likelihood
convenient
ratio
of
two
way
states
of
s 1 and
S2 :
Thus
prob
( s 1)
prob
( s2 )
sets
significant
ences
"
in
the
are
the
ihood
makes
state
goes
to
that
the
T , the
and
only
we
limit
as
the
there
zero
probability
ture
of
are
smaller
distribution
states
smaller
, the
the
is
zero
limit
not
they
In
and
other
in
T , the
likel
the
peaked
In
will
at
than
one
higher
two
; as
powers
so
, compared
states
to
looks
larger
.
functions
addition
exactly
ratio
words
the
exponential
, except
with
greater
higher
.
into
In
the
all
, one
this
ones
the
limit
with
same
end
distribution
will
correspond
to
an
as
the
Of
example
,
the
with
called
exponential
being
estimated
maximal
maximal
up
be
emerged
multiplication
functions
exponential
probability
does
to
discontinuous
states
It
of
sharply
"
, decreasing
likelihood
between
mapping
of
temperature
This
to
., differ
significant
fixed
more
that
value
to
been
s 2 , the
infinity
and
consider
has
correspond
" differences
the
, a number
raised
harmony
argument
T --+ 0
than
to
functions
all
smaller
correspond
brackets
goes
in
gets
several
that
here
while
distribution
\ gets
ratio
could
The
will
of
harmony
square
number
continuous
subtraction
.)
that
harmony
understood
[ eH (S l >- H (Sp ] 1/ T .
in
preceding
probability
If
greater
division
is
rewritten
difference
as
the
course
the
( s 1)
( S2 )
this
fixed
by
a scale
in
( It
probability
be
likelihood
larger
In
the
number
zero
by
, once
3 can
the
differences
Thus
s 1 has
be
differences
measured
measured
of
those
probability
are
prob
prOb
If
for
in
harmony
, Equation
will
scale
ratios
(3)
.
harmony
probability
value
e [H (s 1)- H (s2 ) V T
differences
smaller
fact
harmony
harmony
equal
the
zero
.
, in
nonzero
tempera
distribution
226 BASIC
MECHANISMS
but it can be obtained as the limit of exponential distributions ~ in fact,
the zero-temperature limit plays a major role in the theory since the
states of maximal harmony are the best answers to completion
problems.
THECOM
PETEN
CE, REAL][ZABILITY, AND
ILITY THEOREMS
LEARNAB
In this section, the mathematical results that currently form the core
of harmony theory are informally described. A formal presentation
may be found in the Appendix .
The CompetenceTheorem
In harmony theory , a cognitive system' s knowledge is encoded in its
knowledge atoms. Each atom represents a pattern of values for a few
features describing environmental states, values that sometimes cooccur in the system' s environment . The strengths of the atoms encode
the frequencies with which the different patterns occur in the environ ment. The atoms are used to estimate the probabilities of events in the
environment .
Suppose then that a particular cognitive system is capable of observing the frequency with which each pattern in some pre-existing set {ka}
occurs in its environment . (The larger the set {ka} , the greater is the
potential power of this cognitive system.) Given the frequencies of
these patterns, how should the system estimate the probabilities of
environmental events? What probability distribution should the system
guess for the environment ?
There will generally be many possible environmental distributions
that are consistent with the known pattern frequencies. How can one
be selected from all these possibilities?
Consider a simple example. Suppose there are only two environmen tal features in the representation, , 1 and ' 2, and that the system's only
information is that the pattern ' 1 = + occurs with a frequency of 80%.
There are infinitely many probability distributions for the four environ mental events (' 1" 2) E { (+ ,+ ) (+ ,- ) (- ,+ ) (- ,- )} that are consistent
with the given information . For example, we know nothing about the
relative likelihood of the two events (+ ,+ ) and (+ ,- ) ; all we know is
that together their probability is . 80 .
One respect in which the possible probability distributions differ is in
their degree of homogeneity. A distribution P in which P (+ ,+ ) = .7
6. HARMONY
THEORY227
and
( +
events
, -
is
less
way
of
applies
terms
then
at
information
the
more
the
missing
the
the
moment
InP
state
( x
which
both
these
associated
of
the
more
is
of
the
with
In
greater
probability
of
than
Shannon
'
distribution
'
distribution
amount
the
Shannon
environment
applies
first
homogeneous
there
current
of
for
uncertainty
distribution
one
that
second
given
the
information
that
than
inhomogenous
is
greater
if
any
about
than
this
is
1963
saying
distribution
1948
homogeneous
probability
Another
second
have
missing
there
is
if
formula
for
is
Thus
the
missing
while
information
in
the
inhomogeneous
probabilities
is
the
7In
<
missing
1In
<
>
information
48
in
the
homogeneous
probabilities
.
IS
The
4In
the
to
known
to
extrapolate
mate
for
an
1979
the
the
, +
respect
to
known
fact
feature
The
( -
, -
discussed
the
derive
'
the
This
it
'
can
be
This
one
principle
it
is
to
is
can
the
1981
be
often
an
esti
Levine
80
it
It
( +
is
, +
&
must
be
estimate
, -
account
for
in
known
is
any
40
with
to
any
selecting
maximal
inhomogeneous
distribution
justify
homogeneous
violating
this
of
should
is
because
principle
system
be
without
to
distribution
choosing
information
the
the
second
information
that
other
there
is
distribution
not
with
case
the
information
general
the
than
Christensen
cognitive
to
10
for
missing
In
because
given
less
information
above
that
feature
justification
enough
first
that
consistent
information
dis
are
homogenous
statistical
environ
inhomogeneity
distribution
example
the
r2
most
more
missing
given
con
the
environmental
that
information
distribution
is
no
maximal
implies
the
that
patterns
in
guessing
have
known
some
environmental
( -
one
of
of
information
the
frequency
homogeneity
distributions
the
probability
simple
missing
of
probability
to
from
the
lack
for
all
principle
entire
on
any
the
73
principle
of
for
as
used
environment
account
formalized
>
frequencies
the
to
For
information
One
select
supposes
needed
about
is
Tribus
<
'
distribution
that
4In
information
mental
with
system
some
tribution
to
cognitive
tains
<
distribution
one
can
with
use
the
maximal
formula
for
missing
missing
information
information
that
is
228 BASIC
MECHANISMS
consistent with the observed frequencies
of the patterns ka . The result
is a probability
distribution
1T( r )
eU (r)
U is defined by
U (r ) = I , Aa Xa( r ).
a
The values of the real parameters Aa (one for each atom ) are con strained by the known pattern frequencies ; they will shortly be seen to
be proportional to the atom strengths, 0" a' the system should use for
modeling the environment . The value of Xa ( r ) is simply 1 when the
environmental state r includes the pattern ka defining atom a , and 0
otherwise
the different
distribution
p ( r , a ) cx
: eH (r ,a).
Here , H is the
strengths are
harmony
function
defined
previously , where
the
Aa
u
aI
I(
8 . These
serve
to eliminate
the functions
Xa. from
the
in a must
be filled
in as well . In other
words , to perform the completion , the cognitive system must find those
6. HARMONYTHEORY
229
valuesof the unknown ' j and those valuesof the Qathat togethermaximize the harmonyH ( r ,a) and therebymaximizethe estimatedprobability p (r ,a) .
This discussionis summarizedin the following theorem:
Theorem1.. Competence
. Supposea cognitive system can observe
the frequencyof the patterns{tal in its environment. The probability distribution with the most Shannonmissing information that is
consistent
with
1T
with
this
distribution
the
< X
defined
observations
eU
as
above
are
with
the
harmony
cx
is
the
eH
The
same
( r
function
, a
maximum
as
those
likelihood
completions
of
of
defined
above
This theorem describes how a cognitive system should perform completions, according to some mathematical principles for statistical extrapolation and inference. In this sense, it is a competencetheorem . The
obvious next question is: Can we design a system that will really compute completions according to the specifications of the competence
theorem ?
The "PhysicsAnalogy"
It turns out that designing a machine to do the required computations is a relatively straightforward application of a computational technique from statistical physics. It is therefore an appropriate time to discuss the "analogy" to physics that is exploited in harmony theory .
Why is the relation between probability and harmony expressed in
the competence theorem the same as the relation between probability
and energy in statistical physics? The mapping between statistical physics and inference that is being exploited is one that has been known
for a long time .
The second law of thermodynamics states that as physical systems
evolve in time , they will approach conditions that maximize randomness or entropy, subject to the constraint that a few conserved Quantities
like the systems' energy must always remain unchanged. One of the
triumphs of statistical mechanics was the understanding that this law is
the macroscopic manifestation of the underlying microscopic description
230 BASIC
MECHANISMS
of matter in terms of constituent particles. The particleswill occupy
various statesand the macroscopicpropertiesof a systemwill depend
on the probabilitieswith which the statesare occupied. The randomnessor entropy of the system, in particular, is the homogeneityof this
probabilitydistribution. It is measuredby the formula
- ~xP(x) InP(x).
A systemevolvesto maximizethis entropy, and, in particular, a system
that has come to equilibrium in contact with a large reservoir of heat
will have a probability distribution that maximizesentropy subject to
the constraintthat its energyhavea fixed averagevalue.
Shannonrealizedthat the homogeneityof a probability distribution,
as measuredby the microscopicformula for entropy, wasa measureof
the missing information of the distribution. He started the book of
information theory with a pagefrom statisticalmechanics
.
The competence theorem shows that the exponential relation
between harmony and probability stems from maximizing missing
information subject to the constraint that given information be
accountedfor. The exponentialrelation betweenenergyand probability
stems from maximizing entropy subject to a constraint on average
energy. The physicsanalogythereforestemsfrom the fact that entropy
and missing information shareexactly the samerelation to probability.
It is not surprising that the theory of information processingshould
shareformal featureswith the theory of statisticalphysics.
Shannonbegana mappingbetweenstatisticalphysicsand the theory
of information by mapping entropy onto information content. Har-mony theory extends this mapping by mapping self-consistency(i .e.,
harmony) onto energy. In the next subsection, the mapping will be
further extendedto map stochasticityof inference (i.e., computational
temperature
) onto physicaltemperature.
The RealizabilityTheorem
The mappingwith statisticalphysicsallowsharmonytheoryto exploit
a computationaltechniquefor studying thermal systemsthat was
developedby N. Metropolis
, M. Rosenbluth
, A. Rosenbluth
, A.
Teller, andE. Tellerin 1953. Thistechniqueusesstochastic
or "Monte
Carlo" computationto simulatethe probabilisticdynamicalsystem
understudy. (SeeBinder, 1979.)
The procedurefor simulatinga physicalsystemat temperatureT is
as follows: The variablesof the systemare assignedrandominitial
231
6. HARMONYTHEORY
values
The
One
by
one
probability
tional
to
if
value
the
the
eHx
is
value
this
through
theorem
It
completions
Theorem
at
( minus
Thus
the
any
the
to
to
energy
higher
,
becomes
stochastic
the
variable
the
system
T ,
proceeds
moment
technique
an
analysis
defines
the
2 ..
Each
the
the
rule
is
would
more
proportional
have
random
are
that
to
propor
probability
knowledge
atom
completion
All
the
the
desired
putation
.
initial
O .
Let
rule
of
each
node
and
to
realizes
theory
the
following
the
theory
or
are
specified
the
knowledge
fixed
of
) .
or
Let
the
of
the
update
its
the
are
initially
value
to
computa
the
com
ran
have
value
according
to
the
to
the
1
prob
where
node
T
from
is
( value
global
the
other
1)
system
nodes
+ - e .~ l7T
parameter
attached
0' (1
. . .
and
to
it
is
the
( defined
0' (1 '
( k ( I ,) I.,
. . .
. . .
. . .
" input
below
feature
assigned
all
each
input
during
input
har
1 , and
the
values
atoms
processor
given
throughout
their
in
stochastically
value
assigning
update
of
denote
activation
by
not
node
( its
these
representation
each
have
repeatedly
features
harmony
,
that
graphical
let
specified
values
values
the
can
nodes
The
In
be
of
Appendix
1 .
processor
value
the
harmonium
13 )
correct
other
in
Figure
problem
their
computations
Theorem
( see
node
the
machine
Realizability
feature
nodes
to
described
in
system
dom
value
computation
expressed
mony
tion
is
according
new
( s )/ T .
Adapting
leads
the
updated
a
chosen
As
state
are
Hx
were
.
in
, eH
they
assigning
/ T , where
decisions
system
of
"
) .
All
the
232
BASICMECHANISMS
nodesin the upper layer updatein parallel, then all the nodesin the
lower layer updatein parallel, and so on alternatelythroughout the
computation. During the updateprocess
, T startsout at somepositive value and is graduallylowered. If T is loweredto 0 sufficiently
slowly, then asymptotically
, with probability 1, the system state
forms the best completion (or one of the best completionsif there
are more than one that maximizeharmony) .
To define the input I to eachnode, it is convenientto assignto the
link in the graph betweenatom a and feature i a weight Wia whose
sign is that of the link and whosemagnitudeis the strengthof the atom
divided by the number of links to the atom:
Ua
Wia
= &a); ik"'J.
Using theseweights, the input to a node is essentiallythe weightedsum
of the valuesof the nodesconnectedto it. The exactdefinitions are
1; = 2 I .
a
for
feature
nodes
fa
I .
i
WiaQa
, and
Wia ' i -
0" aK
for knowledgeatoms.
The formulae for Ii and I a are both derived from the fact that the
input to a node is preciselythe harmony the systemwould have if the
given node were to choosethe value 1 minus the harmony resulting
from not choosing1. The factor of 2 in the input to a feature node is
in fact the difference (+ 1) - (- 1) between its possiblevalues. The
term K in the input to an atom comesfrom the K in the harmonyfunction; it is a threshold that must be exceededif activatingthe atom is to
increasethe harmony.
The stochasticdecisionrule can be understoodwith the aid of Figure
14. If the input to the node is large and positive (i.e., selectingvalue 1
would producemuch greatersystemharmony) , then it will almost certainly choosethe value 1. If the input to the node is largeand negative
(i.e., selectingvalue 1 would produce much lower system harmony) ,
then it will almost certainly not choosethe value 1. If the input to the
node is near zero, it will choosethe value 1 with a probability near .5.
The width of the zone of random decisionsaround zero input is larger
the greateris T.
The processof graduallylowering T can be thought of as coolingthe
6. HARMONY
THEORY
233
prob (value = 1 )
FIGURE
probability
....../ slope = 1 / T
of random
rule
becomes
(Minsky
decisions
the
shrinks
deterministic
to zero
linear
and the
threshold
stochastic
rule
of
decision
perceptrons
probability
that the node will select the value with lower har -
H .
Harmonium
thus
realizes
the
second half of the competence theorem , which deals with optimal com pletions . But Theorem 1 also states that estimates of environmental
234
BASICMECHANISMS
a (l
CI
Harmon1
lum Graph
FIGURE 15. The grapht"or a one-atom harmonyfunctionand the graphfor the
corresponding
U function
. In the latter. thereareonlyfeaturenodes
. Eachfeaturenode
hasa singleinputpointlabeled:f: A, wherethe signis the sameas that assigned
to the
featureby the knowledge
atom. Into this input point comelinks from all the other
featuresassigned
valuesby the knowledge
atom. Thelabelon eacharcleavinga feature
is thesameasthevalueassigned
to thatfeatureby theknowledge
atom.
6. HARMONYTHEORY 235
functions
under
Physicists
the
name
problems
and
Vecchi
) .
The
for
2
performs
of
of
is
.
the
The
Learnability
the
is
the
patterns
cognitive
;
of
,
there
basic
idea
patterns
of
the
much
&
Johnson
the
search
function
&
introduces
is
of
It
the
a
the
concept
3 ,
experience
gives
of
Theorem
that
performance
and
with
central
itself
description
more
the
pro
H
system
- level
is
also
tuned
in
different
of
of
of
which
says
the
each
the
learning
states
is
for
that
the
presented
to
to
of
patterns
environment
are
defining
a
that
cognitive
modifying
converge
its
varies
environments
those
of
in
What
frequencies
distribution
gradually
will
the
across
representing
probability
state
atoms
frequency
is
for
cognitive
the
variation
Suppose
the
given
patterns
environments
in
calls
1 , a
feature
atoms
/ ity
environments
Theorem
observing
of
across
itself
cognitive
connections
the
,
,
machine
( 1 -
K ) .
In
its
,
corresponding
can
this
system
the
the
values
sense
simple
is
that
strengths
of
required
by
that
,
that
atom
an
atom
Whenever
in
with
by
the
must
one
a
that
whenever
increases
means
.
present
associated
means
strength
this
to
this
is
observe
parameter
harmonium
stimulus
procedure
system
,
In
simulation
~ u
,
search
dynamical
set
procedure
environment
matches
Gelatt
other
1 .
incremented
~ A
be
capable
to
knowledge
the
the
of
knowledge
and
is
Theorem
atom
statistical
formalism
be
Learnabi
environment
In
the
according
the
so
implementation
task
the
manifests
3 ..
from
rather
theory
can
not
but
does
the
system
this
Theorem
The
( Aragon
Theorem
predetermined
selected
the
to
to
a
strengths
Then
an
machine
In
assumed
given
the
is
But
completion
of
for
characterization
the
knowledge
system
against
design
Theorem
environment
the
function
Performing
of
gives
central
harmony
different
technique
computer
( Kirkpatrick
results
describes
it
what
More
how
It
machine
- level
says
harmony
the
practical
annealing
theory
of
:
;
functional
system
simulated
harmony
important
completion
both
mixed
maxima
completions
kind
applied
to
) .
finding
Theorem
problems
of
produced
1985
contribution
cedure
independently
maximization
Benchmarks
have
McGeoch
IBM
annealing
classical
1983
procedures
high
at
simulated
pattern
a
small
corresponds
is
knowledge
. amount
parameter
be
of
stimulus
on
4 ( 1' .
all
incremented
to
the
by
memory
236
BASICMECHANISMS
trace of a feature pattern, and the strength of the atom is the strength
of the trace: greater the more often it has been experienced.
There is an error-correcting mechanism in the learning procedure
that decrements parameters when they become too large. Intermixed
with its observationof the environment , the cognitive system must perform simulation of the environment . As discussed above, this can be
done by running the simulation machine at temperature 1 without input
from the environment . During simulation , patterns that appear in the
feature nodes produce exactly the oppositeeffect as during environmen tal observation, i.e., a decrementin the corresponding parameters.
Harmonium can be used to approximate the simulation machine. By
running harmonium at temperature 1, without input , states are visited
with a probability of eH, which approximates the probabilities of the
simulation machine, e ~ 16When harmonium is used to approximately
simulate the environment , every time an atom matches the feature vector its strength is decrementedby ~ u .
This error -correcting mechanism has the following effect . The
strength of each atom will stabilize when it gets (on the average) incremented during environmental observation as often as it gets decremented during environmental simulation . If environmental observation
and simulation are intermixed in equal proportion , the strength of each
atom will stabilize when its pattern appears as often in simulation as in
real observation. This means the simulation is as veritical as it can be,
and that is why the procedure leads to the strengths required by the
competence theorem .
the
completion
task
requires
simultaneously
satisfying
activated .
, at
high
16Theorem 1 makes this approximation precise: These two distributions are not equal,
but the maximum -probability states are the same for any possible input .
6. HARMONY
THEORY237
temperatures, it occupies states that are local solutions, but finally , at
low temperatures, it occupies only states that are global solutions . If
the problem is well posed, there is only one such state.
Thus the process of solving the problem corresponds to the passage
of the harmonium dynamical system from a high-temperature phase to
a low-temperature phase. An important question is: Is there a sharp
transition betweenthesephases? This is a "freezing point " for the system, where major decisions are made that can only be undone at lower
temperatures by waiting a very long time . It is important to cool slowly
through phase transitions, to maximize the chance for these decisions
to be made properly; then the system will relatively quickly find the
global harmony maximum without getting stuck for very long times in
local maxima.
In this section, I will discuss an analysis that suggeststhat phase transitions do exist in very simple harmony theory models of decisionmaking. In the next section, a more complex model that answers simple physics questions will furnish another example of a harmony system
that seems to possessa phase transition .17
The cooling process is an essentially new feature of the account of
cognitive processing offered by harmony theory . To analyze the impli cations of cooling for cognition , it is necessary to analyze the temperature dependence of harmony models. Since the mathematical framework of harmony theory significantly overlaps that of statistical
mechanics, general concepts and techniques of thermal physics can be
used for this analysis. However, since the structure of harmony models
is quite different from the structure of models of real physical systems,
specific results from physics cannot be carried over. New ideas particular to cognition enter the analysis; some of these will be discussed in a
later section on the macrolevel in harmony theory .
SymmetryBreaking
At high temperatures , physical systems typically have a disordered
phase, like a fluid , which dramatically shifts to a highly ordered phase,
11 It is tempting to identify freezing or "crystallization
" of harmonium with the
phenomenalexperienceof sudden"crystallization
" of scatteredthoughtsinto a coherent
form. There may even be someusefulnessin this identification. However, it should be
pointedout that sincecoolingshould be slow at the freezingpoint, in terms of iterations
of harmonium, the transitionfrom the disorderedto the orderedphasemay not be sudden. If iterations of harmonium are interpretedas real cognitive processingtime, this
calls into question the argument that "sudden" changesas a function of temperature
correspondto "sudden" changesasa function of real time.
238 BASIC
MECHANISMS
like a crystal, at. a certain freezing temperature. In the low-temperature
phase, a single ordered configuration is adopted by the system, while at
high temperatures, parts of the system shift independently among
pieces of ordered configurations so that the system as a whole is a constantly changing, disordered blend of pieces of different ordered states.
Thus we might expect that at high temperatures, the states of harmoni urn models will be shifting blends of pieces of reasonable completions of the current input ; it will form locally coherent solutions. At
low temperatures (in equilibrium ) , the model will form completions
that are globally coherent.
Finding the best solution to a completion problem may involve fine
discriminations among states that all have high harmonies. There may
even be several completions that have exactly the same harmonies, as
in interpreting ambiguous input . This is a useful case to consider, for
in an ordered phase, harmonium must at any time construct one of
these " best answers" in its pure form , without admixing parts of other
best answers (assuming that such mixtures are not themselves best
answers, which is typically the case) . In physical terminology , the system must break the symmetry between the equally good answers in order
to enter the ordered phase. One technique for finding phase transitions
is to look for critical temperatures above which symmetry is respected,
and below which it is broken .
An IdealizedDecision
This suggests we consider the following idealized decision-making
task. Suppose the environment is always in one of two states, A and
B , with equal probability . Consider a cognitive system performing the
completion task. Now for some of the system's representational
features, these two states will correspond to the same feature value.
These features do not enter into the decision about which state the
environment is in , so let us remove them . Now the two states
correspond to opposite values on all features. We can assume without
loss of generality that for each feature, + is the value for A , and - the
value for B (for if this were not so we could redefine the features,
exploiting the symmetry of the theory under flipping signs of features) .
After training in this environment , the knowledge atoms of our system
each have either all + connections or all - connections to the features.
To look for a phase transition , we see if the system can break symmetry . We give the system a completely ambiguous input : no input at
all. It will complete this to either the all-+ state, representing A , or
the all- - state, representing B , each outcome being equally likely .
6. HARMONY
THEORY239
Observing the harmonium model we see that for high temperatures, the
states are typically blends of the all-+ and all-- states. These blends
are not themselves good completions since the environment has no
such states. But at low temperatures, the model is almost always in one
pure state or the other , with only short-lived intrusions on a feature or
two of the other state. It is equally likely to cool into either state and,
given enough time , will flip from one state to the other through a
sequence of (very improbable) intrusions of the second state into the
first . The transition between the high- and low-temperature phases
occurs over a quite narrow temperature range. At this freezing temperature, the system drifts easily back and forth between the two pure
states.
The harmonium simulation gives empirical evidence that there is a
critical temperature below which the symmetry between the interpretations of ambiguous input is broken. There is also analytic evidence for
a phase transition in this case. This analysis rests on an important concept from statistical mechanics: the thermodynamic limit .
The ThermodynamicLimit
Statistical mechanics relates microscopic descriptions that view matter
as dynamical systems of constituent particles to the macrolevel descriptions of matter used in thermodynamics. Thermodynamics provides a
good approximate description of the bulk properties of systems containing an extremely large number of particles. The thermodynamiclimit is
a theoretical limit in which the number of particles in a statistical
mechanical system is taken to infinity , keeping finite certain aggregate
properties like the system's density and pressure. It is in this limit that
the microtheory provably admits the macrotheory as a valid approximate description.
The thermodynamic limit will later be seen to relate importantly to
the limit of harmony theory in which symbolic macro-accounts become
valid. But for present purposes, it is relevant to the analysis of phase
transitions. One of the important insights of statistical mechanics is
that qualitative changes in thermal systems, like those characteristic of
genuine phase transitions , cannot occur in systems with a finite number
of degrees of freedom (e.g., particles) . It is only in the thermodynamic
limit that phase transitions can occur.
This means that an analysis of freezing in the idealized-decision
model must consider the limit in which the number of features and
knowledge atoms go to infinity . In this limit , certain approximations
become valid that suggest that indeed there is a phase transition .
240
BASICMECHANISMS
ANAPPLICATION
: ELECTRICITY
PROBLEM
SOLVING
Theoretical context of the model. In this section I show how the
framework of harmony theory can be used to model the intuition that
allows experts to answer, without any conscious application of " rules,"
Questions like that posed in Figure 16. Theoretical conceptions of how
such problems are answered plays an increasingly significant role in the
design of instruction . (For example, see the new journal , Cognition and
R1
V
total
241
6. HARMONYTHEORY
Instruction
Figure
16
The
Riley
and
Ginsburg
have
model
( Riley
&
model
will
analyzed
will
fixed
effects
that
they
the
like
through
say
solving
much
But
symbolic
we
do
of
that
models
of
is
harmony
theory
select
environment
Figure
16
the
that
represented
up
battery
voltage
, o , al
of
the
deeper
three
1973
the
as
Larkin
claim
the
1983
circuit
that
experts
these
of
their
specially
vision
( See
the
in
psychological
for
for
, o , al
objects
from
possess
developed
.
in
in
differently
they
filling
experts
domain
richer
voltage
features
describing
in
the
resistance
involves
expertise
much
the
deeper
experts
features
environment
additional
capturing
example
the
Chase
&
.)
features
seven
is
are
"
effective
people
the
perceive
are
for
sighted
in
studies
that
all
Many
" see
16
R
and
represent
I
the
of
be
R2
experts
current
and
experts
Figure
for
experts
particular
the
whether
for
that
the
harmony
Here
circuit
similarly
like
resistors
filling
representational
in
Our
the
that
of
features
of
So
claim
and
a
.
obviously
components
general
electric
must
hypothesize
two
representations
representational
changes
We
that
Their
structure
Simon
also
the
space
show
novices
We
What
physical
same
the
developing
in
non
into
the
environment
- system
of
above
in
physics
environment
involves
- dimensional
literature
the
'
just
scene
step
of
production
the
the
stays
problem
features
perceiving
of
in
across
the
first
we
problem
insights
many
described
whole
stage
complex
traditional
changes
laws
this
circuit
perceiving
the
The
early
offers
concrete
been
qualitative
or
Vrolal
of
VIand
that
of
somehow
as
example
model
representing
changes
down
features
drops
of
the
the
goes
' s
deeper
set
obey
are
goes
.
for
in
represents
this
explicit
our
render
have
At
activity
an
to
that
features
is
giving
those
serves
features
to
by
solving
nicely
also
Representational
model
that
of
circuits
intuition
solving
is
of
represent
solving
problem
like
different
' s
to
number
networks
expert
should
that
small
many
problem
the
It
network
problem
the
of
problem
model
the
coordination
complement
The
features
the
these
complex
direct
chunks
claim
account
expertise
that
to
about
that
S .
This
rules
circuit
a
with
which
16
different
contain
Mary
) .
of
Figure
harmony
every
propose
with
and
about
different
experience
"
of
of
) .
( 1984
manipulation
experts
we
networks
intuitions
cannot
that
" chunks
domain
these
that
1984
with
DeMarzo
circuit
for
as
,
collaboration
Peter
that
one
problems
( Riley
symbolic
created
cumulated
the
circuit
employs
imagine
the
simple
in
and
particular
assume
much
form
)
any
is
we
of
studied
1984
the
we
Rather
such
implications
without
describe
networks
the
,
about
that
Even
was
,
answers
assumed
one
.)
describe
questions
be
the
1983
instructional
Smolensky
provides
qualitative
not
important
in
variables
possess
our
:
model
1,
some
encode
2 ,
set
of
the
VI
features
V 2 ,
qualitative
Vrolal
like
,
these
and
;
242 BASIC
MECHANISMS
there are undoubtedly many other possibilities, with different sets being
appropriate for modeling different experts.
Next , the three qualitative changes up, down, and same for these
seven variables need to be given binary encodings. The encoding I will
discuss here uses one binary variable to indicate whether there is any
change and a second to indicate whether the change is up. Thus there
are two binary variables, I .c and I .u , that represent the change in the
current , I . To represent no change in I , the change variable I .c is set
to - 1~ the value of I .u is, in this case, irrelevant . To represent
increase or decrease of I , I .c is given the value + 1 and I .u is assigned
a value of + 1 or - 1, respectively. Thus the total number of representational features in the model is 14: two for each of the seven circuit
variables.
Know /edge atoms. The next step in constructing a harmony model
is to encode the necessaryknowledge into a set of atoms, each of which
encodes a subpattern of features that co-occur in the environment . The
environment of idealized circuits is governed by formal laws of physics,
so a specification of the knowledge required for modeling the environ ment is straightforward . In most real-world environments , no formal
laws exist, and it is not so simple to give a priori methods for directly
constructing an appropriate knowledge base. However, in such
environments , the fact that harmony models encode statistical informa tion rather than rules makes them much more natural candidates for
viable models than rule-based systems. One way that the statistical
prop-erties of the environment can be captured in the strengths of
knowledge atoms is given by the learning procedure. Other methods
can probably be derived for directly passing from statistics about the
domain (e.g., medical statistics) to an appropriate knowledge base.
The fact that the environment of electric circuits is explicitly rulegoverned makes a probabilistic model of intuition , like the model under
construction , a particularly interesting theoretical contrast to the obvious rule -applying models of explicit conscious reasoning.
For our model we selected a minimal set of atoms~ more realistic
models of experts would probably involve additional atoms. A minimal
specification of the necessary knowledge is based directly on the equations constraining the circuit : Ohm' s law, Kirchofrs law, and the equation for the total resistance of two resistors in series. Each of these is
an equation constraining the simultaneous change in three of the circuit
variables. For each law, we created a knowledge atom for each combination of changes in the three variables that does not violate the law.
These are memory traces that might be left behind after experiencing
many problems in this domain, i.e., after observing many states of this
6. HARMONY
THEORY243
environment . It turns out that this process gives rise to 65 knowledge
atoms~8all of which we gave strength 1.
A portion of the model is shown in Figure 17. The two atoms shown
are respectively instances of Ohm's law for R 1 and of the formula for
the total resistance of two resistors in series.
It can be shown that with the knowledge base I have described,
whenever a completion problem posed has a unique correct answer,
that answer will correspond to the state with highest harmony. This
assumes that K is set within the range determined by Theorem 1: the
perfect matching limit .19
The parameter K. According to the formula defining the perfect
matching limit , K must be less than 1 and greater than 1 - 2/ 6 = 2/ 3
because the knowledge atoms are never connected to more than 6
features (two binary features for each of three variables) . In the
Knowledge
Atoms
Id V1d A1.
Au
Au
totalu
Representational
Features
FIGURE
model
17 .
of
denotes
cuit
analysis
pair
of
Each
variables
V1
schematic
circuit
the
binary
encoding
binary
diagram
.
of
u , d , and
feature
connection
Ai
the
feature
s denote
nodes
nodes
and
, and
a pair
similarly
two
, and
I , and
d denotes
down
~ ~
total
up , down
representing
labeled
( + ,- ) representing
A2
of
for
knowledge
same
similarly
atoms
The
for
the
connections
connections
box
of
other
labeled
the
labeled
six
with
labeled
U and
R I ' R 2 , and
R ,o,a / .
I
cir the
s.
18 Ohm
together
tions
, the
' s law
with
three
applies
the
other
variables
three
two
times
for
laws
gives
five
can
undergo
involved
this
circuit
~ once
constraint
each
equations
13 combinations
for
.
In
each
of
qualitative
of
these
changes
This
equa
244
BASICMECHANISMS
0 . 75
~
0 . 50
0 .25
0.00
100
200
300
400
Time
FIGURE
20 In the reported simulations , one node , selected randomly , was updated at a time .
The computation lasted for 400 " iterations " of 100 node updates each ; that is, on the
average each of the 79 nodes was updated about 500 times . " Updating " a node means
deciding whether to change the value of that node , regardless of whether the decision
changes the value . (Note on "psychological plausibility " : 500 updates may seem like a lot to
solve such a simple problem . But I claim the model cannot be dismissed as implausible
on this ground . According to current \.aerygeneral hypotheses about neural computation
[see Chapter 20] , each node update is a computation comparable to what a neuron can
perform in its "cycle time " of about 10 msec. Because harmonium could actually be
implemented in parallel hardware , in accordance with the realizability theorem , the 500
updates could be achieved in 500 cycles . With the cycle time of the neuron , this comes
to about 5 seconds . This is clearly the correct order of magnitude for solving such prob lems intuitively . While it is also possible to solve such problems by firing a few symbolic
productions , it is not so clear that an implementation
of a production system model could
be devised that would run in 500 cycles of parallel computations comparable to neural
computations .)
6. HARMONY
THEORY245
"
t2
0 0
100
200
300
400
during
the computation
Time
FIGURE
19. The
that
nodes
final
schedule
were
flipping
temperature
be
was
to
probably
schedule
be
The
ground
at
useful
,
each
to
ever
essentially
be
."
upper
at
sufficiently
flipped
decision
the
so
that
random
small
the
Considerable
and
on
encoding
( colors
) .
features
temperature
~ the
that
the
completion
computation
lower
and
proceeds
eventually
white
may
the
At
and
ends
any
into
of
time
the
cooling
that
Throughout
is
final
some
the
much
value
atoms
. 21
The
representation
white
" answer
variables
or
the
will
unknown
starts
As
less
and
is
read
be
values
nodes
"
black
fixed
active
flickers
back
computation
and
of
are
node
pro -
gray
computation
flickering
, each
their
( inactive
When
On
the
maintain
many
cools
was
value
black
simulation
that
box
the
.
are
moment
system
in
process
by
colors
there
used
information
atoms
and
the
happen
given
random
high
settles
It
node
the
, all
display
computational
denoted
current
assigned
is
the
was
the
Initially
are
graphical
of
node
the
nodes
The
image
depending
21
" final
wasted
simulation
a
black
its
values
chosen
hardly
of time
vides
tion
their
was
features
said
T as a function
between
( 0 . 25 )
representational
could
showing
, the
between
computa
connected
less
out
and
by
only
to
knowledge
atoms that are inactive
towards
the end of the computation
~ these
tion variables
will continue
to flicker
at arbitrarily
low temperatures
, spending
representa 50 % of the
time
R 1. u ) that
encode
by atoms
in each
state .
In fact , this
( those
involving
happens
answer
produced
for
the
for
bits
of the
variables
These
circuit
bits
that
are
variable
by the network
representation
are in state
ignored
) and
( like
no change , indicated
by the
active
knowledge
when
we
246
BASICMECHANISMS
macrodescription
lation
it
when
can
' t
the
is
make
up
that
22
cal
The
high
which
' s
is
decided
system
through
which
distinction
characterize
to
the
gravitation
characterizes
physics
the
settle
system
be
the
describe
light
the
emitted
dynamics
but
by
the
of
At
as
neatly
different
Other
Kepler
atom
.
but
's
,
of
for
the
from
Balmer
different
dynami
' s
the
laws
into
dynamics
of
,
of
this
neatly
motion
formula
laws
generating
states
example
's
it .
physics
for
Newton
for
govern
to
flickering
express
laws
utterly
during
stopped
reasonable
grammar
examples
is
system
terminology
laws
laws
motion
It
on
point
stops
no
laws
" the
performance
of
simu
Early
have
node
and
different
process
."
viewed
planetary
hydrogen
that
some
be
the
.
to
These
states
completely
the
."
know
completely
of
yet
down
everywhere
dynamics
clear
competence
can
equilibrium
essentially
orbits
the
laws
is
when
although
system
However
enters
describe
circuit
it
went
the
,
watching
process
seems
made
physics
dynamical
found
planetary
,
current
been
the
of
will
the
are
in
expressing
system
node
characterizing
one
variable
the
the
has
states
When
the
that
,
that
common
- harmony
the
about
between
a
furiously
however
macrodecision
distinction
systems
production
the
" it
flickering
mind
,
solving
anthropomorphizing
is
its
problem
avoid
node
computation
say
tum
to
feature
flickering
and
of
hard
neatly
of
quan
6. HARMONY THEORY
247
is the
('(}mlJefence
of
the
morlel
neatly
descrihable
symbolically
, hut
Of course , macrodecisions
emerge
first
about
those
vari -
ables that are most directly constrained by the given inputs , but not
because rules are being used that have conditions that only allow them
to apply when all but one of the variables is known . Rather it is
because the variables given in the input are fIXed and do not fluctuate :
They provide the information that is the most consistent over time , and
therefore the knowledge consistent with the input is most consistently
activated , allowing those variables involved in this knowledge to be
more consistently completed than other variables . As the temperature
is lowered , those variables " near " the input (with respect to the connec tions provided by the knowledge ) stop fluctuating first , and their rela tive
constancy
of value
over
time
makes
them
function
somewhat
like
BASICMECHANISMS
anlDA 8JnJD8j
6. HARMONYTHEORY 249
1 .0
0 .5
0 .0
- 0 .5
- 1.0
100
200
300
400
Time
FIGURE
cuit
21 . Emergent
variables
" freeze
in " in the
order
the cir-
~ Rtota
!' I ~ Itota
/, VI' V2 (R and J are very
close ) .
c = <H2
> -T2
<H>2.
enlDA eJn ~De..:f
250
BASICMECHANISMS
20
15
u
10
100
200
300
400
Time
FIGURE
22
computation
The specific heat of the circuit analysis model through the course of the
/ /
- - -
/
- - - - - - - - - - - I
/ /
\ C
\
\
\
./
,/ ./
..........
- - """""
-
FIGURE
23
made
being
100
are
200
Time
300
.400
There is a peakin the specificheat at the time when the R and I decisions
.
251
6. HARMONYTHEORY
MACRODESCRIPTION
: PRODUCTIONS
, SCHEMATA
,
ANDEXPERTISE
Productions and Expertise
While
there
lem
solving
are
important
accounts
are
how
symbolic
idea
inspecting
At
this
solving
circuit
base
up ,
THEN
of
performance
no
equations
An
together
1978
) .
posed
The
inferred
in
solve
produced
have
firing
expert
this
contains
lem
single
, give
A
the
this
subsymbo
been
tested
circuit
like
R luR2u
with
by
theorem
manipulation
By
of
of
circuit
are
composed
reason
says
thi ~ stage
, but
are
( Lewis
,
/ d .
get
fewer
problem
both
productions
- - + Rlolalu
, the
R2
At
example
actions
so
the
abbreviated
together
for
are
As
compilation
Figure
is
that
the
" whenever
you
, the
harmony
pro is
need
to
process
16
all
at
has
Now
we
once
, by
knowledge
are
given
, more
/ d V Id V 2u .
in
com
the
larger
productions
.)
learning
and
.
this
Beginning
language
features
symbols
like
( more
with
the
contrast
goes
experts
experience
of
expertise
simulations
the
is
in
experienced
As
Eventually
The
be
U v , otal S - - + Id .
and
and
stored
: R 1 and
can
consciously
when
base
this
prob
."
account
means
they
the
.
but
a set
solve
that
of
analysis
Through
up -
/ ic
acquisition
unknowns
, knowledge
are
given
on
circuit
procedures
to
solved
which
mentioned
firing
, a rule
answer
" IF
, R1uR2UVtotaiS
production
, prestored
is
problem
can
Rtotal
conditions
production
who
be
steps
just
, the
a given
be
up "
searched
production
productions
an
logical
problems
composed
each
might
goes
might
of
productions
a single
are
to
Another
solve
two
into
of
R ,o,ol
are
pti on
equations
productions
the
based
to
have
values
scan
problems
descri
is
Applied
Novices
assign
consciously
- analysis
consciously
to
ductions
fire
to
in
.
standard
) .
there
apparent
expertise
, 1982
this
them
circuit
example
, a series
are
like
circuit
conclude
.R 1u R 2u --+- R ,otal U .
used
As
prob
,
represented
of
( Anderson
, novices
.
of
account
most
and
acquisition
using
performance
: specialized
knowledge
the
roughly
and
are
acquisition
of
account
harmony
is acquired
expertise
goes
- system
the
differences
compilation
problems
proceduralized
go
of
equations
of
production
of
These
paradigm
account
stage
the
' knowledge
knowledge
, the
for
experts
symbolic
of
analysis
in
macrodescription
account
the
the
the
differences
of
within
similarities
and
physics
or
and
process
less )
can
be
has
students
at
referred
account
account
are
symbol
mathematics
knowledge
These
theory
( This
to
of
not
yet
novices
in
manipulation
,
they
in
the
atoms
for
the
used
to
inspect
have
.
built
leamability
perception
the
and
circuit
252 BASIC
MECHANISMS
equations and draw inferences from them to solve circuit problems.
With experience, features dedicated to the perception of circuits evolve ,
and knowledge atoms relating these features develop. The final network for circuit perception contains within it something like the model
described in the previous section (as well as other portions for analyzing other types of simple circuits) . This final network can solve the
entire problem of Figure 16 in a single cooling. Thus experts perceive
the solution in a single conscious step. (Although sufficiently careful
perceptual experiments that probe the internal structure of the construction of the percept should reveal the kind of sequential filling -in
that was displayed by the model.) Earlier networks, however, are not
sufficiently well-tuned by experience~ they can only solve pieces of the
problem in a single cooling. Several coolings are necessaryto solve the
problem, and the answer is derived by a series of consciously experienced steps. (This gives the symbol-manipulating network a chance to
participate, offering justifications of the intuited conclusions by citing
circuit laws.) The number of circuit constraints that can be satisfied in
parallel during a single cooling grows as the network is learned. Productions are higher level descriptions of what input/ output pairscompletions- can be reliably performed by the network in a single cooling.
Thus, in terms of their produc.tions, novices are described by productions with simple conditions and actions, and experts are described by
complex conditions and actions.
Dynamic creation of productions . The point is, however, that in the
harmony theory account, productions are just descriptiveentities; they are
not stored, precompiled, and fed through a formal inference engine; rather
they are dynamically createdat the time they are needed by the appropriate collective action of the small knowledge atoms. Old patterns that
have been stored through experience can be recombined in completely
novel ways, giving the appearance that protluctions had been precompiled even though the particular condition / action pair had never before
been performed . When a familiar input is changed slightly , the network can settle down in a slightly different way, flexing the usual production to meet the new situation . Knowledge is not stored in large
frozen chunks; the productions are truly context sensitive. And since
the productions are created on-line by combining many small pieces of
stored knowledge, the set of available productions has a size that is an
exponential function of the number of knowledge atoms. The
exponential explosion of compiled productions is virtual , not precompiled and stored.
Contrasts with logical iriference. It should be noted that the harmonium model can answer ill -posed Questions just as it can answer
6. HARMONY
THEORY
253
well-posed ones. If insufficient information is provided, there will be
more than one state of highest harmony , and the model will choose one
of them . It does not stop dead due to " insufficient information " for
any formal inference rule to fire . If inconsistent information is given,
no available state will have a harmony as high as that of the answer to a
well-posed problem; nonetheless, those answers that violate as few circuit laws as possible will have the highest harmony and one of these
will therefore be selected. It is not the case that "any conclusion fol lows from a contradiction ." The mechanism that allows harmonium to
solve well-posed problems allows it to find the best possible answers to
ill -posed problems, with no modification whatever.
Schemata
Productions are higher level descriptions of the completion process
that ignore the internal structures that bring about the input / output
mapping. Schemata are higher level descriptions of chunks of the
knowledge base that ignore the internal structure within the chunk . To
suggest how the relation between knowledge atoms and schemata can
be formalized , it is useful to begin with the idealized two-choice decision model discussed in the preceding section entitled Decision-Making
and Freezing.
Two-choice model. In this model, each knowledge atom had either
all + or all - connections. To form a higher level description of the
knowledge, let 's lump all the + atoms together into the + schema, and
denote it with the symbol S+. The activation level of this schema,
A (8+) , will be defined to be the average of the activations of its constituent atoms. Now let us consider all the feature nodes together as a
slot or variable, s , for this schema. There are two states of the slot that
occur in completions: all + and all - . We can define these to be the
possible fillers or values of the slot and symbolize them by f + and f - .
The information in the schema S+ is that the slot s should be filled
with f +; the proposition s = f +. The " degree of truth " of this proposition , T (s = f +) , can be defined to be the average value of all the
feature nodes comprising the slot: If they are all + , this is 1 or true~ if
all - this is - lor false. At intermediate points in the computation
when there may be a mixture of signs on the feature nodes, the degree
of truth is somewhere between 1 and - 1.
Repeating the construction for the schema S_ , we end up with a
higher level description of the original model depicted in Figure 24.
254
BASICMECHANISMS
Mlcrodescrlptlon
s -
s +
/ ' ,
, ""' - -
"' "
'
- -
....
/'
. "
' "
'
. . .
,
\
. . .
\, - - - - - - - - -- - - - - .." ~- - - - - - - - - - - - - ,I'I
s
Macrodescrlptlon
The interesting fact is that the harmony of any state of the original
model can now be re-expressedusing the higher level variables:
6. HARMONY
THEORY255
The analysis of decision making in this model considered the limit as
the number of features and atoms goes to infinity - for only in this
"thermodynamic limit " can we see real phase transitions . In this limit ,
the set of possible values for the averages that define the aggregate
variables comes closer and closer to a continuum . The central limit
theorem constrains these averages to deviate less and less from their
means~ statistical fluctuations become less and less significant ~ the
model's behavior becomes more and more deterministic .
Thus, just as the statistical behavior of matter disappears into the
deterministic laws of thermodynamics as systems become macroscopic
in size, so the statistical behavior of individual features and atoms in
harmony models becomes more and more closely approximated by the
higher level description in terms of schemata as the number of constituents aggregated into the schemata increases. However there are two
important differences between harmony theory and statistical physics
relevant here. First , the number of constituents aggregated into schemata is nowhere near the number - IO23- of particles aggregated into
bulk matter. Schemata provide a useful but significantly limited
description of real cognitive processing. And second, the process of
aggregation in harmony theory is much more complex than in physics.
This point can be brought out by passing from the grossly oversimpli fied two-choice decision model just considered to a more realistic cogniti ve domain.
Schemata for rooms. In a realistically complicated and large network , the schema approximation would go something like this. The
knowledge atoms encode clusters of values for features that occur in
the environment . Commonly recurring clusters would show up in
many atoms that differ slightly from each other . (In a different
language, the many exemplars of a schema would correspond to
knowledge atoms that differ slightly but share many common features.)
These atoms can be aggregatedinto a schema, and their average activation at any moment defines the activation of the schema. Now among
the atoms in the cluster corresponding to a schema for a living-room, for
example, might be a subcluster corresponding to the schema for
sofa/ coffee-table. These atoms comprise a subschemaand the average of
their activations would be the activation variable for this subschema.
The many atoms comprising the schema for kitchen share a set of
connections to representational features relating to cooking devices. It
is convenient to group together these connections into a cooking-device
slot, Scooking
. Different atoms for different instances of kitchen encode
various patterns of values over these representational features,
corresponding to instances of stove, conventional oven, microwave oven,
and so forth . Each of these patterns defines a possiblefiller , f k , for the
256
BASICMECHANISMS
6. HARMONY
THEORY257
a probability distribution . Such a distribution of course contains poten tially a phenomenal amount of information : the joint statistics of all
combinations
of all features
used to represent
the environment
How
can we hope to encode such a distribution effectively ? Schemata pro vide an answer . They comprise a way of breaking up the environment
into modules - schemata - that can individually
by represented as a
miniprobability
distribution . These minidistributions
must then be
folded together during processing to form an estimate of the whole dis tribution . To analyze a room scene, we don ' t need information about
the joint probability of all possible features ~ rather , our schema for
"chair " takes care of the joint probability of the features of chairs ~ the
schema for " sofa/ coffee-table" contains information about the joint
probability of sofa and coffee -table features , and so on . Each schema
ignores the features of the others , by and large .
This modularization of the encoding can reduce tremendously the
amount of information the cognitive system needs to encode . If there
are f binary features , the whole probability distribution
requires 2/
numbers to specify . If we can break the features into s groups
corresponding to schemata , each involving f Is features , then only
sV / s numbers are needed . This can be an enormous reduction ~ even
with such small numbers as f = 100 and s = 10, for example , the
reduction
factor
with
other
features
of the chair
but
not
with
features
of the
sofa . To some extent this assumption is valid , but there clearly are
limits to its validity .
A less crude approximation is to allow schemata to share features so
that the shared features can be constrained simultaneously by the joint
probabilities with the different sets of variables contained in the dif ferent
schemata
to
which
it
relates .
Now
we
are
in
the
situation
258
as
BASICMECHANISMS
simple
as
the
independent
The
trick
is
comprise
to
define
to
small
fraction
of
the
this
new
most
features
important
issue
LEARNING
NEW
The
Learning
Throughout
this
modeler
few
to
or
stated
As
23
former
endogenous
this
of
comments
on
systems
that
were
esta
programming
section
the
by
would
like
to
establishment
of
Chapter
the
the
make
features
process
of
never
25
Hinton
to
nodes
the
In
and
now
and
get
the
consider
the
this
use
atoms
final
while
.
through
experi
of
internal
,. 23
Figure
of
visible
the
several
and
latter
abstract
through
of
endogenous
environment
terms
,
whose
environ
state
values
features
the
nodes
cognitive
Other
described
their
features
the
external
network
network
knowledge
the
external
features
of
to
always
feature
the
be
that
exogenous
observed
to
from
Sejnowski
exogenous
being
meaninglessness
directly
example
of
state
acquired
abstract
is
har
abstractness
is
evolve
set
in
of
analysis
notion
the
is
experts
features
levels
circuit
basic
by
features
and
7 ,
feature
So
the
all
in
with
state
Figure
correspond
environment
through
that
at
expertise
existence
initial
specific
as
In
In
experience
Endogenous
repeated
most
problem
that
through
emphasized
completely
an
have
how
domain
the
completion
few
features
environment
through
into
from
meaning
determined
evolve
ence
possibility
of
the
whenever
features
either
last
the
that
comes
the
the
cognitive
using
learning
chapter
representing
ment
or
account
are
capture
is
considered
this
represent
preceding
values
groups
models
system
these
Features
have
investigation
this
was
each
distributions
This
offers
environment
about
learning
the
section
use
to
Abstract
their
evolution
Throughout
mony
and
our
comments
through
to
selected
there
chapter
of
prior
be
probability
environment
last
still
which
REPRESENTATIONS
states
blished
the
The
Procedure
represent
for
the
must
in
set
record
groups
is
features
feature
and
completely
savings
environmental
whole
The
and
informational
of
features
features
schemata
the
groups
interrelationships
constructing
it
nonoverlapping
abstract
these
important
In
of
isolate
more
using
case
subdistributions
hidden
encompass
which
is
levels
of
units
.
both
The
the
6. HARMONY
THEORY259
segment/ leUer
letter/word
knowledgeatoms
knowledgeatoms
' - - ~""'- -
line - segment
FIGURE
nodes
25 . A network
" - - - ~-
letter nodes
representing
--
- "" v
~~--
- - ~~~-J
word nodes
could
be learned .24
The features representing the line segments are taken to be the exo genous features given a priori . This network comes into existence with
these line -segment nodes , together with extra endogenous feature
nodes which , through experience , will become the letter and word
nodes .
The environment
(in this case, a set of words ) is observed . Each
time a word is presented , the appropriate values for the line -segment
nodes are set. The current atom strengths are used to complete the
input , through the cooling procedure discussed above. The endogenous
features are thus assigned values for the particular
input .
Initially ,
24 The issue of selecting patterns on exogenous features for use in defining endogenous
features - including the word domain - is discussed in Smolensky ( 1983) . To map the
terminology of that paper on to that of this chapter , replace schemas by knowledge atoms
and beliefs by feature values. That paper offers an alternative use of the harmony concept
260
BASICMECHANISMS
when
the
have
received
little
environmental
of atoms
Intermixed
with
that
match
the
feature
this incrementing
tuning ,
nodes
of strengths
by
environmental
observation
is a process of decrementing
strengths during environmen
tal simulation . Thus
the learning
process is exactly
like the one
referred
to in the learnability
vation ,
not
all
the
features
theorem
are
, except
set
by
obser -
the
environment
the
endogenous
features must be filled in by completion .
Initially , the values of the endogenous
features are random . But as
learning
occurs , correlations
between
recurring
patterns
in the exo genous features and the random endogenous
features will be amplified
by the strengthening
of atoms that encode those correlations . An
endogenous
feature by chance tends to be + when patterns of line seg ments defining the letter A are present and so leads to strengthening
of
atoms relating it to those patterns ; it gradually comes to represent A .
In this way , self - organization
of the endogenous
features can potentially
lead them to acquire meaning .
The learnability
theorem states that when no endogenous
features are
present , this learning
process will produce
strengths
that optimally
encode the environmental
regularities , in the sense that the comple tions
tions
as well as computer
environments
.
simulations
of the
learning
2S In Chapter 7, Hinton and Sejnowski use a different but related optimality condition .
They use a function G which measures the information -theoretic difference between the
true environmental probability distribution and the estimated distribution eH . For the
case of no endogenous features, the following is true (see Theorem 4 of the Appendix ) .
The strengths that correspond to the maximal -missing-information distribution consistent
with observable statistics are the same as the strengths that minimize G. That the
estimated distribution is of the form eH must be assumeda priori in using the minimal -G
criterion ~ it is entailed by the maximal -missing-information criterion .
261
6. HARMONYTHEORY
is
approaches
the
to
new
concept
new
schema
cognition
in
the
, developing
ginal
and
flexible
the
subsymbolic
being
gradually
gain
schema
."
observing
Or
rather
evolving
Similarly
" now
is
the
the
Rather
create
word
never
time
to
assign
a network
is
system
and
." Then
time
this
the
and
that
it
the
again
, some
completely
response
by
a new
the
modeler
itself
system
feature
to
the
" meaning
"
and
different
."
gi ven
of
the
strength
modeler
decides
dominant
the
letter
need
store
system
the
with
to
slowly
, there
endogenous
25 , and
means
is
the
ori -
into
atoms
connect
with
Figure
feature
it
this
that
emerge
" this
in
and
cognitive
to
atoms
,
,
by
in
comes
learning
create
not
meaning
tha ,t of
to
knowledge
coherent
made
when
the
shifts
a
like
them
schema
During
time
Learning
complex
new
of
the
something
generating
slowly
groups
decision
atoms
like
say
atoms
shift
the
, a
processing
all
slowly
contrast
subsymbolic
learning
and
for
is
of
observing
this
feature
the
representation
may
The
reason
that
, and
their
systems
knowledge
for
same
this
therefore
potential
establishing
for
learning
be
the
derived
for
analyzed
, is
extremely
so
hard
more
subsymbolic
us
than
any
has
been
computational
knowledge
tractable
to
to
are
other
to
rather
It
It
is
is
the
offer
new
by
environment
these
.
that
motivated
simultaneously
facilitate
in
program
paradigm
theory
subsymbolic
that
impoverished
for
subsymbolic
Harmony
using
analytically
are
learning
for
.
are
they
of
greatest
cognition
and
of
that
domain
mechanisms
powerful
can
mathematically
representations
seems
of
procedures
properties
the
into
goal
learning
reason
in
insights
study
the
large
" now
such
is
systems
the
if
there
may
ABLE
emerge
in
in
new
and
creating
difficult
of
, and
cognitive
Eventually
system
by
strengths
that
feature
may
the
the
such
is extremely
strengths
endogenous
feature
procedures
decision
the
than
are
symbolic
entails
schemata
influence
any
dramatic
automatic
observation
be
the
approach
account
, as
important
never
more
ways
environmental
between
symbolic
Because
structures
In
contrast
the
where
sufficiently
than
hinder
the
CONCLUSIONS
In
this
formal
generalized
chapter
I have
subsymbolic
perceptual
described
framework
the
for
computations
foundations
performing
:
the
of
an
harmony
theory
important
class
completion
of
partial
, a
of
262
BASICMECHANISMS
descriptions
of
knowledge
is
tual
static
features
extremely
states
encoded
These
ing
both
mathematical
or
seriality
be
neatly
expressed
by
interpreter
In
theory
The
theory
between
physical
cal
related
the
systems
of
Insights
ultimate
top
from
described
on
as
sometimes
performance
is
passing
an
( such
can
their
- down
emerge
them
never
through
of
be
the
or
to
to
are
of
relate
cognitive
the
micro
and
concepts
pursued
the
statisti
the
Theoretical
being
formulation
in
adapted
harmony
function
exploited
techniques
self
randomness
consistency
computation
subsymbolic
Computational
Hamiltonian
leading
exploited
computational
physics
can
the
Shannon
and
of
or
the
computational
goal
entropy
statistical
plays
that
energy
of
- consistency
centrality
the
theory
and
self
energy
from
harmony
of
The
of
accounts
theorems
but
and
physical
that
macrolevel
gradually
framework
relationship
physical
temperature
concept
the
to
mirrors
physics
an
informal
implement
is
rules
in
an
evolve
features
this
rules
and
is
function
in
these
extends
information
consistency
theory
percep
harmony
role
models
storing
imbedded
derived
new
symbolic
explicitly
are
machine
computation
qualitatively
of
- tuned
mechanisms
are
the
well
features
learning
When
competence
by
symbolic
to
macrolevel
The
achieved
and
harmony
of
and
processing
and
principles
aggregate
In
set
satisfaction
numerical
performance
numerical
constraints
The
among
constraint
The
experience
environment
are
parallel
engine
through
an
constraints
constraints
powerful
inference
of
as
theory
towards
of
the
information
.
processing
ACKNOWLEDGMENTS
The
framework
malize
Dave
Peter
and
have
benefited
the
of
Stu
support
Gentner
Thanks
computer
from
for
Mozer
to
Eileen
support
Rutie
with
to
and
The
work
Mark
supported
the
thank
them
too
to
Francis
like
and
to
their
Dan
for
and
thank
particularly
all
theory
would
Norman
Wallen
was
for
Stewart
conversations
Lab
Don
for
from
Thanks
Dirlich
to
group
instructive
to
learned
Gerhard
him
Judith
attempt
years
Science
Kimchi
Conway
have
Hinton
very
research
go
an
contributions
Cognitive
thanks
Mike
of
Geoff
Riley
working
UCSD
out
that
several
Mary
important
Processing
Special
and
over
Geman
made
the
Distributed
me
McClelland
greatly
members
Parallel
Jay
has
grew
cognition
with
especially
DeMarzo
chapter
Hofstadter
insights
this
understanding
Doug
their
Greenspan
Crick
and
sharing
Steve
in
to
Rumelhart
for
fett
presented
approaches
Rabin
and
and
Sondra
excellent
by
the
help
Buf
graphics
the
Don
System
6. HARMONY
THEORY
263
Development Foundation, the Alfred P. Sloan Foundation, National
Institute of Mental Health Grant PHS MH 14268 to the Center for
Human Information Processing
, and Personneland Training Research
Programsof the Office of Naval ResearchContract NOOOI4
-79-C-O323,
NR 667-437.
264 BASIC
MECHANISMS
APPENDIX
:
FORMALPRESENT
ATION OFTHE THEOREMS
Formal
relationships
statistical
between
mechanics
research
groups
their
in
initially
use
that
Sejnowski
1985
Sejnowski
1983a
Geman
perspective
and
1984
mal
the
the
of
text
this
linearly
ordered
Preliminary
the
the
&
Hinton
1984
&
&
Smolen
from
other
Geman
presented
the
groups
have
document
is
levels
an
of
at
and
making
con
the
degree
this
three
In
certain
at
pursued
formal
section
contained
theory
and
deliberately
final
incurred
the
motivated
is
in
necessarily
presenting
from
Hinton
. 26
appendix
1983
1983
are
ideas
sampler
results
These
per
Ackley
informally
self
is
but
since
independent
Gibbs
Three
contact
ideas
Smolensky
referenced
presented
close
the
the
all
and
related
Sejnowski
been
text
are
&
so
properly
with
are
rather
reflect
appendix
have
proofs
quence
single
this
ideas
in
Hinton
computation
researchers
machine
Chapter
theory
presentation
dancy
In
and
The
closely
Boltzmann
theory
harmony
length
cise
in
which
the
harmony
the
some
for
of
redun
inevitable
conse
formality
within
Definitions
Overview
of
schematically
the
definitions
represented
ment
with
likely
than
in
structure
that
others
become
space
include
on
26
Hofstadter
than
formal
way
anagrams
were
inspirational
665
His
1983
to
of
uses
the
modulate
idea
into
the
of
be
Quite
the
to
representa
transducers
might
high
level
. )
The
statistical
an
AI
mechanics
see
so
that
they
the
are
features
in
in
theory
more
transducers
temperature
harmony
are
of
processing
between
of
events
processing
model
symbolic
environ
through
is
external
features
computational
parallel
relationships
development
passed
cognitive
particular
an
which
application
fact
the
the
insights
for
in
level
framework
is
exogenous
and
might
the
the
the
perceptual
at
theoretical
There
of
is
in
Depending
features
unanalyzed
environment
considerable
exogenous
basic
26
prediction
internally
The
Figure
allows
This
represented
tional
654
neural
several
of
1983
incorporated
Because
been
research
Fahlman
of
their
maintain
Riley
been
have
for
or
by
development
they
1984
exploited
particular
names
spectives
&
been
independent
groups
sky
parallel
have
heuristic
the
rather
system
for
and
Hofstadter
just
in
doing
cognition
1985
pp
6. HARMONY
THEORY
265
Mental
Space
:M
knowledge
I
A
1
Representation
Space
Rex Ren
Transducer
P on E .
De/. A representational
spaceR is a cartesianproduct Rexx Ren of
two binary hypercubes
. Eachof the N (Nex; Nen) binary-valuedcoordinate functions rj of R (Rex; Ren) is called an (exogenous
; endogenous
)
feature.
De! A transduction
map T from an environmentEdislalto a represen
tational spaceR = RexxRenis a map T: E - + Rex. T inducesa probability distribution p on Rex: p = poT
-I. This distribution is the
(proximal) environment
.
266
BASICMECHANISMS
tion vectors.)
De/. A basiceventa hasthe form
a collection
can
be
of exogenous
characterized
by
features
the
and
function
Xa(r) = fI 1/21
'; (r)+bJJ
.1
IJ.
/.L -
which
is 1 if the features
f3
6. HARMONYTHEORY
267
~ p (x ) Inp (x ).
xEX
De
!
z = I . eV(x).
xEX
268
BASICMECHANISMS
U(r ) =
r. Aa
Xa
(r)
ClEO
for suitable parameters~ = {Aa}a E0 (S. Geman, personal communication, 1984) . B: The completionfunction c is the maximumlikelihood completion function CPHof the Gibbs distribution PH,
whereH : M - +R , M = Rx A, A = (O,I}IOI, is defined by
H (r , a) =
I. 0"aaah(r,ka
)
aED
and
h ( r , ka) = r.kJlkal - IC
for suitableparametersa = {u ala E 0 and for K sufficiently close
to 1:
l > K >
aEO].
1- 2/[maXlkal
6. HARMONY
THEORY 269
the
cognitive
ment
system
we
can
the
say
features
that
tal
are
the
statistical
of
allows
more
the
self
are
regularities
an
input
system
to
consistent
than
concerned
as
an
environ
as
why
far
as
is
process
which
is
the
inference
distinguish
others
This
of
I ,
patterns
the
of
environmen
called
the
harmony
De
be
The
cognitive
system
represented
of
by
stochastic
the
atom
for
atom
for
label
strength
<
Retrieving
or
De
Let
The
process
time
of
one
O -
be
the
Pt
while
the
the
of
networks
is
are
nodes
labeled
two
another
will
turn
color
out
to
the
all
permit
some
variable
by
state
distributions
one
the
on
heat
bath
following
described
by
'
, )
a
27
harmony
network
The
graph
associated
with
stochas
procedure
aa
FIGURE
if
links
probability
defined
by
Performance
is
ka
::::t:
the
node
value
node
and
corresponding
each
This
of
values
If
the
can
network
feature
binary
network
coordinate
each
atom
au
of
each
for
with
processing
of
occupies
link
and
paths
For
Finally
sequence
The
harmony
by
colors
in
as
carry
aa
color
different
determined
graphs
is
nodes
is
and
that
of
there
From
Pt
aa
parallelism
cube
These
value
function
interpreted
27
value
harmony
be
Figure
activation
Information
binary
see
the
of
of
node
assigned
nodes
degree
for
are
between
high
shortly
space
carries
nodes
nodes
go
is
by
will
mental
the
the
feature
is
carries
joining
its
there
which
processors
'
feature
feature
determined
graph
system
node
for
parallel
cognitive
each
At
of
completion
regularities
function
tic
some
Viewing
. ,
I
harmony
function
270 BASIC
MECHANISMS
some arbitrary initial distribution , pr (x (0) ~ x ) . Given the initial state
x , the new state at Time 1, x ( 1) , is constructed as follows . One of the
n coordinates of M is selected (with uniform distribution ) for updating .
All the other n - 1 coordinates of x ( 1) will be the same as those of
x (0) = x . The updated coordinate can retain its previous value , lead ing to x ( 1) = x , or it can flip to the other binary value , leading to a
new
state
that
will
updated coordinate
likelihood
be denoted
x' .
The
selection
for x ( 1) is stochastically
of the
value
of the
ratio :
Po (x )
PT = Nil pl/T
wherethe normalization
NT =
sequence of positive values { rr }; : o that converge to zero. The annealing process determined by p and T is the heat bath stochastic process
determined by the sequence of distributions , Pr . If P is the Gibbs dis I
tribution
determined
by V , then
ZT = I . eV(x)/T.
xEX
This is the same (exceptfor the sign of the exponent) as the relationship that holds in classicalstatisticalmechanicsbetweenthe probability
p ( x) of a microscopicstatex , its energy V( x ) , and the temperatureT.
This is the basisfor the names"temperature" and "annealingschedule."
In the annealingprocessfor the Gibbs distribution PH of Theorem 1 on
6. HARMONY
THEORY
271
the space M , the graph of the harmony network has the following significance. The updating of a coordinate can be conceptualized as being
performed by a processor at the corresponding node in the graph. To
make its stochastic choice with the proper probabilities, a node updating
at time t must compute the ratio
PT
')) =e[H(x')-H(x)V.Tt
tt((xX
PT
The exponent is the difference in harmony between the two choices of
value for the updating node, divided by the current computational temperature. By examining the definitions of the harmony function and its
graph, this difference is easily seen to depend only on the values of
nodes connected to the updating node. Suppose at times t and t + 1 two
nodes in a harmony network are updated. If these nodes are not connected, then the computation of the second node is not affected by the
outcome of the first : They are statistically independent. These computations can be performed in parallel without changing the statistics of
the outcomes (assuming the computational temperature to be the same
at t and t + 1) . Because the graph of harmony networks is two-color ,
this means there is another stochastic process that can be used without
violating the validity of the upcoming Theorem 2.27All the nodes of one
color can update in parallel. To pass from x (t ) to x (t+ 1) , all the nodes
of one color update in parallel; then to pass from x (t+ 1) to x (t + 2) , all
the nodes of the other color update in parallel. In twice the time it
takes a processor to perform an update, plus twice the time required to
pass new values along the links , a cycle is completed in which an
entirely new state (potentially different in all N + 101 coordinates) is
computed.
of
given
Those
27 This
networks
is
this
an
allowed
theorem
features
important
in
the
means
the
specified
respect
Boltzmann
in
following
in
which
machine
harmony
.
to
.
have
networks
Suppose
values
differ
an
+
from
input
or
the
t
-
is
are
arbitrary
272 BASIC
MECHANISMS
assigned
their
features
the
values
are
stochastic
by
Pu .
for
their
binary
system
values
) .
we
fix
features
r :) L
( The
the
to
of
random
of
I,
progresses
.)
De !
( After
follows
sample
( environmental
to
the
as
in
equal
by
the
2
to
stored
mean
all
of
minus
- environment
) .
this
decrement
/ modify
- A.
, 1T , c )
be
system
for
each
current
with
Aa
following
cycle
this
Pu
for
, change
Repeat
define
by
a decrement
Finally
equal
to
Pu
a
,
d .etermined
a
as
distribution
the
as time
iteratively
the
store
.
maximum
distribution
Now
probabil
likely
defined
process
sample
and
- likelihood
(R , p , 0
Present
the
by
drops
, the
multiple
, use
assigned
features
maximum
increment
Next
from
the
are
equally
Let
Part
other
determined
environmental
stochastic
in
is
an
simulation
Xa ( r )
.)
in
the
temperature
are
EO
store
sample
the
start
progresses
than
nonzero
: As
unfixed
The
become
the
Now
this
The
procedure
from
and
the
estimate
this
variables
time
1983a
finding
have
process
there
0 for
use
of
/ simulate
( If
learning
values
increment
environment
learning
of
Sejnowski
) .
in
( environmental
the
H :
drawn
1 and
sample
As
I,
is
between
- entropy
I,
other
Learning
Aa
Xa ( r )
Theorem
Theorem
Aa
of
in
&
r ,
values
completions
observation
mean
generate
R x A .)
state
zero
trace
, let
states
=
their
of
1 is
activation
Pu
forth
of
input
through
distribution
and
Theorem
annealing
remaining
determined
probability
the
same
back
maximum
the
, these
. The
Initially
of
of
The
schedule
in
Hinton
system
now
to
Information
cognitive
is
goes
completions
start
the
completions
in
between
system
likelihood
Storing
annealing
the
the
The
change
process
flip
, the
only
Part
We
flip
the
completion
that
values
O.
space
finding
R , and
specified
, say , of
state
according
so
.
will
stochastic
variables
of
features
activations
now
approaches
L,
fixed
~ these
the
progresses
meaning
with
values
PH .
ity
The
begin
is
time
on
thereafter
values
nonfixed
As
state
the
off
initial
all
we
The
( conditioned
probability
A ,
.)
any
are
initial
Now
space
times
in
1T ( r )
state
all
which
random
process
( The
used
assigned
to
each
each
Aa
observe
Throughout
the
, define
Ua
For
small
to
alternately
dA
Aa
1 -
.
1(
, a good
observe
approximate
and
way
simulate
to
implement
the
this
environment
procedure
in
is
equal
6. HARMONY
THEORY273
proportions , and to increment (respectively, decrement) Aa by ~ A each
time the feature pattern defining a appears during observation (respectively , simulation ) . It is in this sense that 0' a is the strength of the
memory trace for the feature pattern ka defining a . Note that in learning~ equilibrium is established when the frequency of occurrence of
each pattern ka during simulation equals that during observation (i.e.,
Aa has no net change) .
Theorem 3.. Learnability . Suppose all knowledge atoms are
independent. Then if sufficient sampling is done in the trace learning procedure to produce accurate estimates of the observable statistics, A. and u will converge to the values required by Theorem 1.
Independence of the knowledge atoms means that the functions
{Xa}a E 0 are linearly independent. This means no two atoms can have
exactly the same knowledge vector. It also means no knowledge atom
can be simply the "or " of some other atoms: for example, the atom
with knowledge vector + 0 is the " or " of the atoms + + and + - , and so
is not independent of them . (Indeed, X+0 = X++ + X+_ .) The sampling
condition of this theorem indicates the tradeoff between learning speed
and performance accuracy. By adding higher order statistics to 0
(longer patterns) , we can make 1T a more accurate representation of p
and thereby increase performance accuracy, but then learning will
require greater sampling of the environment .
a ; & aje
274
To
BASICMECHANISMS
see
that
the
independent
of
other
these
first
,
consider
or
a
second
- order
particular
pair
observations
of
features
are
, ;
and
not
,j ,
and let
Xbtb2= X[rj- b1]&[rj - b2]
and
XOb2
Then
notice
X + -
x +o -
x ++
X- a =
1 -
x- -
1 - X++ - X+- - x - +
X+o
= 1 - x ++ -
fx +o - x ++] -
fxo+ - x ++] .
belinearlygenerated
fromtheset
0 = {Xij};<j U {x;};
which will now be taken to be the set of observables . I will abbreviate
"-a..I) as A;j
U =
i<j
! . AijXiXj + ! . AiXi.
i< j
6. HARMONY
THEORY275
If we regard the variables of the system to be the X; instead of the ,; ,
this formula for U can be identified with minus the formula for energy,
E , in the Boltzmann machine formalism (see Chapter 7) . The mapping
takes the harmony feature ,; to the Boltzmann node Xi , the harmony
parameter A;j to the Boltzmann weight Wij, and minus the parameter A.;
to the threshold (J; . Harmony theory's estimated probability for states
of the environment , e u , is then mapped onto the Boltzmann machine's
estimate, e- E. For the isomorphism to be complete, the value of A.
that arises from learning in harmony theory must map onto the weights
and thresholds given by the Boltzmann machine learning procedure.
This is established by the following theorem , which also incorporates
the preceding results.
Theorem4. Consider a cognitive system with the above set of first and second-order observables, o . Then the weights { w;j } ;<j and
thresholds {8i }i learned by the Boltzmann machine are related to the
parameters A generated by the trace learning procedure by the relations W;j = A;j and 8; = - A; . It follows that the Boltzmann
machine energy function , E , is equal to - U , and the Boltzmann
machine's estimated probabilities for environmental states are the
same as those of the cognitive system.
1:1T
(r)=1
rER
and
276 BASIC
MECHANISMS
We introduce the Lagrange multipliers A and Aa (see, for example,
Thomas, 1968) and solve for the values of '7r( r ) obeying
0=-.JL
(r-)Ir'I:ER
1T
(r')ln1T
(r')
u1T
- aI:E0Aa
[r'I:ERr
Xa
(r')1T
(r')- Pa
]- A['I:ER
1T
(r')- 1]
This leads directly to A . Part B: Since Xa can be expressed as the product of Ita I terms each linear in the feature variables, the function U is
a polynomial in the features of degree Ikal. By introducing new variables
aa,
.
. U will now be replaced by a quadratic function H . The trick
IS to write
Xa ( r ) =
11ifreka
/lkal
=1
0
otherwise
as
where ICis chosenclose enough to 1 that r ekJ Ita I can only exceedIC
by equaling 1. This is assuredby the condition on ICof the theorem.
Now U can be written
U( r) =
max
max
aI.E0UaQa
E(O
,l)[aah(r,ka)] =a
EAH(8,r)
CT
a are simply the Lagrangemultipliers, rescaled
:
r
:
max
max
r:)c. U(r) =max
r:).. 8
EAH(r,a).
Thus the maximum -likelihood completion function c" = cPu determined by '11
' , the Gibbs distribution determined by U , is the same as
the maximum -likelihood completion function CPHdetermined by PH'
the Gibbs distribution determined by H . Note that PH is a distribution
6. HARMONYTHEORY
277
2.
Part
( 1953 ) , provided
A : This
probability
process
matrix .
another
classic
the foundation
prove
distribution
result
for
has , since
the computer
always
Metropolis
simulation
process
converges
et ale
of ther -
determined
to p .
The
by
stochastic
x is a Markov
process with a stationary
transition probability
(The probability
of making
a transition
from
one state to
is time - independent . This is not true of a process in which
variables
are
updated
in
a fixed
sequence
rather
selecting
a variable
according
to some fixed
For the sequential
updating
process , Theorem
than
by randomly
probability
distribution
.
2A still holds , but the
proof is less direct [see , for example , Smolensky , 1981 ] ) . Since only
one variable can change per time step , IXI steps are required to com pletely
change
from
probability
of changing
in I XI
time
to any other
steps ,
state . In
that
in a finite
state
space any
irreducible
Markov
process
distribution
remaining
ary distribution
p . To show that p is a stationary
process , we assume that at time t the distribution
distribution
at time t + 1 is then
pr ( x ( / + 1) = x ) =
in the station -
distribution
for the
of states is p . The
I . pr ( x (I ) = X ' ) pr ( x (/ + 1) = x I x ( I ) = x ' )
x' E Xx
I . p ( x ' ) Wx ' X'
x' E Xx
The sum here runs over Xx , the set of states that differ from x in at
most one coordinate ; for the remaining
states , the one time - step transi tion probability
W x' x ~ pr ( x (1+ 1) = x Ix ( I ) = x ' ) is zero . Next
use the important
detailed balance condition ,
p ( Xl ) W X' X = p (X ) W X X'
we
278 BASIC
MECHANISMS
which states that in an ensemble of systems with states distributed
according to p , the number of transitions from x' to x is equal to the
number from x to x ' . Detailed balance holds because, for the nontrivial case in which x' and x differ in the single coordinate v, the transition matrix W determined by the distribution p is
= p (x )
x' E Xx
E Wxx' = p (x).
x' E Xx
6. HARMONYTHEORY
279
F(A) =InZv
(A)=In1: el,~oA
."[Xa
(r)-Pa
]1
rER
Proof
of
pu
Lemma
pv
Note
that
Zv
leV
where
V(r)=oED
~Ao
rxo
(r)- po
]=U(r)- oED
~Ao
Po
From
thisit follows
thatthegradient
ofF is
of
ax:- = < Xa>Pu- Pa
Theconstraint
thatA
. enforces
isprecisely
thatthisvanish
foralla~
then Pu = 11
". Thus the correct A. is a critical point of F . To see that
in fact the correc,t A is a minimum of F , we show that F has a
positive-definite matrix of second-partial derivatives and is therefore
convex. It is straightforward to verify that the Quadratic form
~
a2F
~
qa
a' qa
a,a'E0 aAa
aA
'
is the variance
< (Q - < Q > PU) 2 > PU
of the random variable Q defined by Q ( r ) =
r . qa Xa ( r ) . This
aED
variance is clearly nonnegative definite . That Q cannot vanish is
assured by the assumption that the Xa are linearly independent. Since a
Gibbs distribution Pu is nowhere zero, this means that the variance of
Q is positive, so the Lemma is proved.
Proof of Theorem 3: Since F is convex, we can find its minimum , A ,
by gradient descent from any starting point . The process of learning
the correct A, then , can proceed in time according to the gradient descent equation
a~- R
- Pa) = < Xa> p =- Xa
>pu
~dX
oXa
< Xa> pu
280
BASICMECHANISMS
< Xa> p is estimated; in the environmentalsimulation phase, the decrement < Xa> pu is estimated (following Theorem 2) . By hypothesis,
theseestimatesare accurate
. (That is, this theorem treatsthe ideal case
of perfect samples, with sample means equal to the true population
means.) Thus A. will convergeto the correct value. The proportional
relation betweenu andA. wasderived in the proof of Theorem 1.
Theorem4. The proof of Theorem 3 showsthat the trace learning
proceduredoes gradient descent in the function F . The Boltzmann
learningproceduredoesgradientdescentin the function G:
G(A.)
=-I.r p(r)In!!.p(r)
.~~2
V(r)=U(r)-aED
r. Aa
<Xa
>
U ( r) - < U>
values with
r. eVCr
) = e- < U> r. eU(r),
r
I .e . ,
Zv = Zu e- < u> .
By the definition
F = InZv
of F ,
lnpu (r ) = - lnZu + U (r )
so that the preceding equation for F becomes
= - ~r p(r) Inpu
(r).
tUQWUOJ
! S!
!AUQ
1'81
.{1
~qt1UJO
'81SUO
AdOJtUQ
~ '8 snu
Qqt
!w :d~ 1snf S! D ' paW !'8I~ S'8 ' sn~
A'HO3H
.LANOW
'HVH
'9
"
..
"
uapuadapu~
18Z
CHAPTER 7
Machines
G. E. HINTONandT. J. SEJNOWSKI
Many of the chapters in this volume make use of the ability of a parallel network to perform cooperative searches for good solutions to problems. The basic idea is simple: The weights on the connections
between processing units encode knowledge about how things normally
fit together in some domain and the initial states or external inputs to a
subset of the units encode some fragments of a structure within the
domain. These fragments constitute a problem : What is the whole
structure from which they probably came? The network computes a
"good solution " to the problem by repeatedly updating the states of
units that represent possible other parts of the structure until the network eventually settles into a stable state of activity that represents the
solution .
One field in which this style of computation seems particularly
appropriate is vision (Ballard, Hinton , & Sejnowski, 1983) . A visual
system must be able to solve large constraint-satisfaction problems
rapidly in order to interpret a two-dimensional intensity image in terms
of the depths and orientations of the three-dimensional surfaces in the
world that gave rise to that image. In general, the information in the
image is not sufficient to specify the three-dimensional surfaces unless
the interpretive process makes use of additional plausible constraints
about the kinds of structures that typically appear. Neighboring pieces
of an image, for example, usually depict fragments of surface that have
similar depths, similar surface orientations , and the same reflectance.
The most plausible interpretation of an image is the one that satisfies
7. LEARNING
INBOLTZMANN
MACHINES
283
constraints of this kind as well as possible, and the human visual system stores enough plausible constraints and is good enough at applying
them that it can arrive at the correct interpretation of most normal
.
Images.
The computation may be performed by an iterative search which
starts with a poor interpretation and progressively improves it by reducing a cost function that measures the extent to which the current
interpretation violates the plausible constraints. Suppose, for example,
that each unit stands for a small three-dimensional surface fragment ,
and the state of the unit indicates the current bet about whether that
surface fragment is part of the best three-dimensional interpretation .
Plausible constraints about the nature of surfaces can then be encoded
by the pairwise interactions between processing elements. For
example, two units that stand for neighboring surface fragments of
similar depth and surface orientation can be mutually excitatory to
encode the constraints that each of these hypotheses tends to support
the other (becauseobjects tend to have continuous surfaces) .
RELAXATION
SEARCHES
The general idea of using parallel networks to perform relaxation
searchesthat simultaneously satisfy multiple constraints is appealing. It
might even provide a successor to telephone exchanges, holograms, or
communities of agents as a metaphor for the style of computation in
cerebral cortex. But some tough technical questions have to be
answered before this style of computation can be accepted as either
efficient or plausible:
.
284 BASIC
MECHANISMS
.
How are the weights that encode the knowledge acquired? For
models of low-level vision it is possible for a programmer to
decide 6n the weights, and evolution might do the same for the
earliest stages of biological visual systems. But if the same kind
of constraint-satisfaction searches are to be used for higher
level functions like shape recognition or content-addressable
memory , there must be some learning procedure that automatically encodes properties of the domain into the weights.
This chapter is mainly concerned with the last of these questions, but
the learning procedure we present is an unexpected consequence of our
attempt to answer the other Questions, so we shall start with them .
7. LEARNING
INDOL
TZMANN
MACHINES285
good solutions to problems in which there are discrete hypotheses that
are true or false . Even though the allowable solutions all assign values
the constraint
reduce
this violation
and tries
to alter
the
values
of the variables
to
Linear programming
and its variants make a sharp distinction
between constraints (which must be satisfied ) and costs . ' A solution
which achieves a very low cost by violating one or two of the con straints is simply not allowed . In many domains , the distinction
between
constraints
and
costs
is
not
so
clear -cut .
In
vision , for
is no distinction
between
constraints
and
costs . The optimal solution is then the one which minimizes the total
constraint violation where different constraints are given different
strengths depending on how reliable they are . Another way of saying
this is that all the constraints have associated plausibilities , and the
most plausible solution is the one which fits these plausible constraints
as well as possible .
Some relaxation schemes dispense with separate feedback loops for
the constraints and implement weak constraints directly in the excita tory and inhibitory interactions between units . We would like these
networks to settle into states in which a few units are fully active and
the rest are inactive . Such states constitute clean " digital " interpreta tions . To prevent the network from hedging its bets by settling into a
state where many units are slightly active , it is usually necessary to use
a strongly nonlinear decision rule , and this also speeds convergence .
However , the strong nonlinearities
that are needed to force the network
to make a decision also cause it to converge on different
states on dif -
ferent occasions : Even with the same external inputs , the final state
depends on the initial state of the net . This has led many people (Hop field , 1982; Rosenfeld , Hummel , & Zucker , 1976) to assume that the
particular problem to be solved should be encoded by the initial state of
the network rather than by sustained external input to some of its
units .
286
BASICMECHANISMS
Hummel and Zucker ( 1983) and Hopfield ( 1982) have shown that
some relaxation schemes have an associated " potential " or cost function
and that the states to which the network converges are local minima of
this function . This means that the networks are performing optimiza tion of a well -defined function . Unfortunately , there is' no guarantee
that the network will find the best minimum . One possibility is to
redefine the problem as finding the local minimum which is closest to
the initial state . This is useful if the minima are used to represent
"items " in a memory , and the initial states are queries to memory
which may contain missing or erroneous information . The network
simply finds the minimum that best fits the query . This idea was used
by Hopfield ( 1982) who introduced an interesting kind of network in
which the units were always in one of two states. I Hopfield showed that
if the units are symmetrically connected (i .e., the weight from unit j to
unit j exactly equals the weight from unit j to unit i) and if they are
updated one at a time , each update reduces (or at worst does not
increase ) the value of a cost function which he called " energy " because
of the analogy with physical systems . Consequently , repeated iterations
are guaranteed to find an energy minimum . The global energy of the
system is defined as
E = - I . WjjSjSj + I . 9jsj
j <j
j
( 1)
(2)
7. LEARNING
INBOLTZMANN
MACHINES
287
Using Probabilistic Decisions to EscapeFrom Local Minima
At about the same time that Hopfield
of this kind
could
as local
computers
and it is
to guide the use of occasional uphill steps. To find a very low energy
state of a metal , the best strategy
is to melt
reduce
or
B because
both
minima
have
the
same
width
and
so the
initial
B
FIGURE
1.
energy barrier . Shaking can be used to allow the state of the network
by a ball -bearing ) to escape from local minima .
separated by an
(represented
here
288 BASIC
MECHANISMS
random point is equally likely to lie in either minimum . If we shake
the whole system, we are more likely to shake the ball-bearing from A
to B than vice versa because the energy barrier is lower from the A
side. If the shaking is gentle, a transition from A to B will be many
times as probable as a transition from B to A , but both transitions will
be very rare. So although gentle shaking will ultimately lead to a very
high probability of being in B rather than A , it will take a very long
time before this happens. On the other hand, if the shaking is violent ,
the ball-bearing will cross the barrier frequently and so the ultimate
probability ratio will be approached rapidly , but this ratio will not be
very good: With violent shaking it is almost as easy to cross the barrier
in the wrong direction (from B to A ) as in the right direction . A good
compromise is to start by shaking hard and gradually shake more and
more gently. This ensures that at some stage the noise level passes
through the best possible compromise between the absolute probability
of a transition and the ratio of the probabilities of good and bad transitions. It also means that at the end, the ball-bearing stays right at the
bottom of the chosen minimum .
This view of why annealing helps is not the whole story. Figure 1 is
misleading because all the states have been laid out in one dimension .
Complex systems have high-dimensional state spaces, and so the barrier
between two low-lying states is typically massively degenerate: The
number of ways of getting from one low-lying state to another is an
exponential function of the height of the barrier one is willing to cross.
This means that a rise in the level of thermal noise opens up an enormous variety of paths for escaping from a local minimum and even
though each path by itself is unlikely , it is highly probable that the system will cross the barrier. We conjecture that simulated annealing will
only work well in domains where the energy barriers are highly
degenerate.
1
Pk
=(l+e-~Ekl
T)
(3)
7. LEARNING
INBOLTZMANN
MACHINES
289
where T is a parameter which acts like the temperature of a physical
system . This local decision rule ensures that in thermal equilibrium the
relative probability of two global states is determined solely by their
energy
difference
Pa
, and follows
a Boltzmann
distribution
- (E - E )/ T
=
(4)
f3
PfJ
where P a is the probability
the energy of that state .
At low temperatures
there
bias in favor
of states with
low
PatternCompletion
One way of using a parallelnetwork is to treat it as a patterncompletion device. A subsetof the units are I'clamped" into their on or off
states and the weights in the network then complete the pattern by
determiningthe statesof the remainingunits. There are strong limita"lions on the sets of binary vectors that can be learned if the network
has one unit for each componentof the vector. These limits can be
transcendedby using extra units whose states do not correspondto
componentsin the vectors to be learned. The weightsof connections
to these extra units can be used to representcomplex interactionsthat
290
BASICMECHANISMS
cannot
of
be
the
expressed
vectors
hidden
Markov
the
tors
to
units
it
are
to
where
sufficiently
and
advance
be
units
units
by
state
obvious
global
are
visible
second
in
two
which
.
from
any
part
we
parts
will
know
will
be
is
which
distinguished
are
always
needs
to
determine
in
have
completion
might
hidden
which
never
pattern
different
the
vec
The
times
there
units
advance
units
are
representations
a
Other
and
specify
specifies
vector
internal
with
to
units
that
complete
input
the
to
para
part
of
subset
the
of
the
clamped
by
the
the
states
of
the
each
this
LEARNING
which
in
the
of
that
be
a
.
of
only
of
weight
- change
Fortunately
equilibrium
probability
of
EQuations
each
3
global
if
and
state
will
we
4
allow
changes
any
to
energy
changing
the
so
the
us
to
as
!\ w 11
..
there
will
is
immediately
it
the
way
weight
is
changed
actually
particular
reaches
derive
global
weight
not
ener
single
of
until
.~~~-~~~
-=.! [S
;Q
SjQ
- r.PisfsfI
u
it
state
can
their
probability
network
thermal
We
of
units
global
controlling
states
proba
its
reaches
4 ) .
by
the
the
of
particular
states
affect
run
using
any
( Equation
but
different
network
in
state
contributed
many
freely
having
the
it
global
straightforward
run
without
When
finding
energies
how
energy
to
3 ,
of
probabilities
would
state
allowed
probability
weight
the
is
Equation
environment
the
on
control
as
network
rule
only
change
own
to
there
input
network
the
equilibrium
If
of
HARD
clamped
So
the
the
decision
fore
the
its
used
visible
partial
completed
given
any
The
knowing
be
are
analogy
Consider
depends
In
so
AND
bilistic
called
,
EASY
gies
first
.
environment
these
be
able
without
will
output
the
output
visible
it
be
that
components
( by
environment
build
to
must
parts
as
In
required
of
units
the
the
units
units
complete
can
part
hidden
the
visible
to
like
part
which
completed
digms
it
would
call
and
asks
between
units
we
the
network
we
which
and
network
or
large
given
extra
learned
the
,
correlations
these
the
learn
Sometimes
be
be
between
for
pairwise
call
processes
patterns
interface
as
. . We
thermal
in
which
the
(5)
.B
where s;o is the binary state of the ith unit in the ath global state and
P;; is the probability, at thermal equilibrium, of global statea of the
network when none of the visible units are clamped(the lack of clamping is denotedby the superscript- ) . Equation 5 showsthat the effect
7. LEARNINGIN BOLTZMANN
MACHINES
291
the two units that the weight connects (the second term is just the
probability
of the
states
of the
visible
units , and
so the
environment
or
teacher only has direct access to the states of these units . The difficult
learning problem is to decide how to use the hidden units to help
achieve the required behavior of the visible units . A learning rule
which
assumes
use all of
problem
that
its units
the network
is of limited
is instructed
interest
task among
the hidden
from
because
outside
on how
it evades
representations
the
to
main
for a given
units .
In statistical terms , there are many kinds of statistical structure impli cit in a large ensemble of environmental vectors . The separate proba bility of each visible unit being active is the first -order structure and
can be captured by the thresholds of the visible units . The v2/ 2 pairwise
correlations
between
the
v visible
units
constitute
the
second -
order structure and this can be captured by the weights between pairs of
units .2 All structure higher than second -order cannot be captured by
itself to capturing
as
possible in a few underlying "factors." It ignores all higher order structure which is where
much of the interesting
information
292 BASIC
MECHANISMS
rule can do easy learning, but it cannot do the kind of hard learning
that involves deciding how to use extra units whose behavior is not
directly specified by the task.
It is tempting to think that networks with pairwise connections can
never capture higher than second-order statistics. There is one sense in
which this is true and another in which it is false. By introducing extra
units which are not part of the definition of the original ensemble, it is
possible to express the third -order structure of the original ensemble in
the second-order structure of the larger set of units . In the example
given above, we can add a fourth component to get the ensemble
{ ( 1101) , ( 1010) , (0110) , (OOOO
)} . It is now possible to use the thresholds and weights between all four units to express
# the third -order structure in the first three components. A more familiar way of saying this
is that we introduce an extra "feature detector" which in this example
detects the case when the first two units are both on. We can then
make each of the first two units excite the third unit , and use strong
inhibition from the feature detector to overrule this excitation when both of the first two units are on. The difficult problem in introducing
the extra unit was deciding when it should be on and when it should be
off - deciding what feature it should detect.3
One way of thinking about the higher order structure of an ensemble
of environmental vectors is that it implicitly specifies good sets of
underlying features that can be used to model the structure of the
environment . In common-sense terms, the weights in the network
should be chosen so that the hidden units represent significant underlying features that bear strong, regular relationships to each other and to
the states of the visible units . The hard learning problem is to figure
out what these features are, i .e., to find a set of weights which turn the
hidden units into useful feature detectors that explicitly represent
properties of the environment which are only implicitly present as
higher order statistics in the ensemble of environmental vectors.
view
of
a generative
of
weights
activity
be if the
that
learning
model
so that
occur
environment
of
when
over
was
is that
the
the
the
the
weights
environment
network
visible
clamping
in
- we
is running
units
them
are
.
The
the
the
network
would
freely
same
number
like
to
, the
as they
of units
consti
find
patterns
would
in the
7. LEARNINGIN BOLTZMANN
MACHINES
293
of weights that gives a good model given the limitations imposed by the
architecture
of the network
vectors . This
is called
a maximum
likelihood
model
and
there is a large literature within statistics on maximum likelihood esti mation . The learning procedure we describe actually has a close rela tionship to a method called Expectation and Maximization
(EM )
( Dempster , Laird , & Rubin , 1976) . EM is used by statisticians for
estimating missing parameters . It represents probability distributions by
using parameters like our weights that are exponentially related to
probabilities , rather than using probabilities themselves . The EM algo rithm is closely related to an earlier algorithm invented by Baum that
manipulates probabilities directly . Baum ' s algorithm has been used suc-
the
parameters
of
a hidden
Markov
chain
- a transition
net -
work which has a fixed structure but variable probabilities on the arcs
and variable probabilities of emitting a particular output symbol as it
arrives at each internal node . Given an ensemble of strings of symbols
and a fixed -topology transition network , the algorithm finds the combi nation of transition probabilities and output probabilities that is most
likely to have produced these strings (actually it only finds a local max imum ) .
Maximum likelihood methods work by adjusting the parameters to
increase the probability that the generative model will produce the
observed data . Baum ' s algorithm and EM are able to estimate new
values for the probabilities (or weights ) that are guaranteed to be better
than the previous values . Our algorithm simply estimates the gradient
of the log likelihood with respect to a weight , and so the magnitude of
the weight change must be decided using additional criteria . Our algo rithm , however , has the advantage that it is easy to implement in a
parallel network of neuron -like units .
The idea of a stochastic generative model is attractive because it pro vides a clean quantitative way of comparing alternative representational
schemes . The problem of saying which of two representational schemes
is best appears to be intractable . Many sensible rules of thumb are
available , but these are generally pulled out of thin air and justified by
commonsense and practical experience . They lack a firm mathematical
foundation
If we confine
ourselves
to a space of allowable
stochastic
MECHANISMS
294 BASIC
environmental vectors given the representational scheme? In our networks , representations are patterns of activity in the units , and the
representational scheme therefore corresponds to the set of weights that
determines
THE
DOL TZMANN
are active .
MACHINE
LEARNING
ALGORITHM
G=r,ap+(Va
)ln!P
.~
-(~Va
)
where P + ( Va) is the probability
(6)
7. LEARNING
INBOLTZMANN
MACHINF5
295
phase+ when their states are determined by the environment , and
p - ( VQ) is the corresponding probability in phase- when the network is
running freely with no environmental input .
G is never negative and is only zero if the distributions are identical.
G is actually the distance in bits from the free running distribution to
the environmental distribution .4 It is sometimes called the asymmetric
divergence or information gain. The measure is not symmetric with
respect to the two distributions . This seems odd but is actually very
reasonable. When trying to approximate a probability distribution , it is
more important to get the probabilities correct for events that happen
frequently than for rare events. So the match between the actual and
predicted probabilities of an event should be weighted by the actual
probability as in EQuation 6.
It is possible to improve the network 's model of the structure of its
environment by changing the weights so as to reduce G . 5 To perform
gradient descent in G , we need to know how G will change when a
weight is changed. But changing a single weight changes the energies
of one quarter of all the global states of the network , and it changes the
probabilities of all the states in ways that depend on all the other
weights in the network . Consider, for example, the very simple network shown in Figure 2. If we want the two units at the ends of the
chain to be either both on or both off , how should we change the
weight W3,47 It clearly depends on the signs of remote weights like w 1,2
because we need to have an even number of inhibitory weights in the
chain. 6 So the partial derivative of G with respect to one weight
depends on all the other weights and minimizing G appears to be a
(~
\..-_
w
.2J~
~\~~3.'\~
("-:;\",.~.~w3t4
-/~'\-J
,-:.')~
.1
!~
/"-~-!!""
~.!-
input
unit
output
unit
FIGURE 2. A very simplenetworkwith one input unit, one output unit, and two hidden
units. The task is to make the output unit adopt the samestate as the input unit. The
difficulty is that the correctvaluefor weight w3.4 dependson remote informationlike the
valueof weight wl.2.
4 If we use base 2 logarithms .
296
BASICMECHANISMS
-aw
;; = - T [PU- Pi} ]
(7)
Unlearning
Crick and Mitchison (1983) have suggestedthat a form of reverse
learning might occur during REM sleep in mammals. Their proposal
was basedon the assumptionthat parasiticmodesdevelopin large networks that hinder the distributed storageand retrieval of information.
The mechanismthat Crick and Mitchison proposeis basedon
More or less random stimulation of the forebrain by the brain
stem that will tend to stimulate the inappropriatemodes of
brain activity . . . and especiallythosewhich are too prone to be
set off by random noise rather than by highly structured
specificsignals. (p. 112)
During this state of random excitation and free running they postulate
that changes occur at synapsesto decreasethe probability of the
spuriousstates.
A simulation of reverselearning was performed by Hopfield, Feinstein, and Palmer (1983) who independentlyhad been studyingwaysto
improve the associativestoragecapacityof simple networks of binary
processors(Hopfield, 1982) . In their algorithm an input is presentedto
the network as an initial condition, and the system evolves by falling
7. LEARNINGIN DOLTZMANNMACHINES
297
into a nearby local energy minimum . However , not all local energy
minima represent stored information . In creating the desired minima ,
they accidentally create other spurious minima , and to eliminate these
they use " unlearning " : The learning procedure is applied with reverse
sign to the states found after starting from random initial conditions .
Following this procedure , the performance of the system in accessing
stored states was found to be improved .
There is an interesting relationship between the reverse learning pro posed by Crick and Mitchison and Hopfield et ala and the form of the
learning algorithm which we derived by considering how to minimize
an information theory measure of the discrepancy between the environ mental structure
and the network ' s internal
model ( Hinton
&
Sejnowski , 1983b ) . The two phases of our learning algorithm resemble
the learning and unlearning procedures : Positive Hebbian learning
occurs in phase+ during which information in the environment is captured by the weights ; during phase- the system randomly samples states
according to their Boltzmann distribution and Hebbian learning occurs
with a negative coefficient .
However , these two phases need not be implemented in the manner
suggested by Crick and Mitchison . For example , during phase- the
average co-occurrences could be computed without making any changes
to the weights . These averages could then be used as a baseline for
making changes during phase+ ; that is, the co-occurrences during
phase+ could be computed and the baseline subtracted before each per manent weight change . Thus , an alternative but equivalent proposal for
the function
of dream sleep is to recalibrate the baseline for
plasticity - the break -even point which determines whether a synaptic
weight is incremented or decremented . This would be safer than mak ing permanent weight decrements to synaptic weights during sleep and
solves the problem of deciding how much " unlearning " to do .
Our learning algorithm refines Crick and Mitchison ' s interpretation
of why two phases are needed . Consider a hidden unit deep within the
network : How should its connections with other units be changed to
best capture regularity present in the environment ? If it does not
receive direct input from the environment , the hidden unit has no way
to determine whether the information
it receives from neighboring
units is ultimately caused by structure in the environment or is entirely
a result of the other weights . This can lead to a " folie a deux " where
two parts of the network each construct a model of the other and
ignore the external environment . The contribution
of internal and
external sources can be separated by comparing the co-occurrences in
phase+ with similar information
that is collected in the absence of
environmental input . phase- thus acts as a control condition . Because
of the special properties of equilibrium it is possible to subtract off this
MECHANISMS
298 BASIC
purely internal contribution and use the difference to update the
weights. Thus , the role of the two phases is to make the system maximally responsive to regularities present in the environment and to
prevent the system from using its capacity to model internally generated regulariti es.
lingAlgorithmCanFail
Ways in Which the Learn
The ability to discover the partial derivative of G by observing PU
and Pi} does not completely determine the learning algorithm . It is still
necessary to decide how much to change each weight, how long to collect co-occurrence statistics berore changing the weight, how many
weights to change at a time , and what temperature schedule to use during the annealing searches. For very simple networks in very simple
environments , it is possible to discover reasonable values for these
parameters by trial and error . For more complex and interesting cases,
serious difficulties arise because it is very easy to violate the assumptions on which the mathematical results are based (Derthick , 1984) .
The first difficulty is that there is nothing to prevent the learning
algorithm from generating very large weights which create such high
energy barriers that the network cannot reach equilibrium in the allotted time . Once this happens, the statistics that are collected will not be
the equilibrium statistics required for Equation 7 to hold and so all bets
are off . We have observed this happening for a number of different
networks. They start off learning quite well and then the weights
become too large and the network "goes sour" - its performance
deteriorates dramatically.
One way to ensure that the network gets close to equilibrium is to
keep the weights small. Pearlmutter (personal communication ) has
shown that the learning works much better if , in addition to the weight
changes caused by the learning, every weight continually decays towards
a value of zero, with the speed of the decay being proportional to the
absolute magnitude of the weight. This keeps the weights small and
eventually leads to a relatively stable situation in which the decay rate
of a weight is balanced by the partial derivative of G with respect to the
weight. This has the satisfactory property that the absolute magnitude
of a weight shows how important it is for modeling the environmental
structure .
The use of weight-decay has several other consequences which are
not so desirable. Because the weights stay small, the network cannot
construct very deep minima in the energy landscape and so it cannot
make the probability ratios for similar global states be very different .
7. LEARNINGIN BOLTZMANN
MACHINES
299
AN EXAMPLE
OFHARDLEARNING
A simple example which can only be solved by capturing the higher
order statistical structure in the ensemble of input vectors is the
"shifter " problem. The visible units are divided into three groups.
Group V 1 is a one-dimensional array of 8 units , each of which is
clamped on or off at random with a probability of 0.3 of being on.
Group V2 also contains 8 units and their states are determined by shift ing and copying the states of the units in group V 1. The only shifts
allowed are one to the left , one to the right , or no shift . Wrap-around
is used so that when there is a right shift , the state of the right -most
unit in V 1 determines the state of the left -most unit in V2. The three
possible shifts are chosen at random with equal probabilities. Group V3
contains three units to represent the three possible shifts , so at anyone
time one of them is clamped on and the others are clamped off .
The problem is to learn the structure that relates the states of the
three groups. One facet of this problem is to " recognize" the shift i .e., to complete a partial input vector in which the states of Viand V2
are clamped but the units in V3 are left free. It is fairly easy to see why
this problem cannot possibly be solved by just adding together a lot of
pairwise interactions between units in VI , V2, and V3. If you know
300
that
BASICMECHANISMS
particular
what
unit
the
and
shift
is
that
required
of
place
represents
to
algorithm
together
that
not
at
all
obvious
learning
what
algorithm
Figure
machine
shows
doing
very
most
little
of
them
which
occurs
above
the
This
is
two
have
more
and
other
The
other
tive
1985
The
for
from
"
procedure
units
be
1980
simply
note
that
locations
in
to
each
from
from
is
having
different
than
one
side
connected
that
Fukushima
each
to
not
detectors
detector
each
the
used
kind
in
"
Rumelhart
of
competi
&
Zipser
Procedure
training
feature
on
different
to
be
and
of
interesting
rather
to
weights
than
are
seem
type
shift
It
detectors
Boltzmann
detectors
One
have
units
of
paradigms
all
is
the
The
of
other
feature
gradient
the
negative
of
it
that
weights
type
hidden
the
no
set
obvious
looking
excitatory
the
it
units
large
of
work
The
but
of
hidden
localized
feature
the
the
among
Training
the
above
detectors
sensible
smaller
that
learning
detectors
is
that
version
two
feature
of
nor
active
unit
detectors
set
24
of
the
pressure
learning
the
which
an
detector
units
the
into
of
good
are
this
though
inhibition
spatially
one
VI
units
VIand
whether
set
running
consists
extra
the
set
optimal
remainder
of
even
comes
lateral
in
bit
VI
to
and
which
were
it
frequency
was
measured
until
was
10
the
run
which
This
were
repeated
all
pair
then
schedule
assumed
each
thermal
annealing
for
was
units
was
with
phase
approached
The
then
In
representing
system
hidden
network
phases
states
The
the
of
and
on
into
shift
annealing
the
two
clamped
states
After
time
both
were
temperature
equilibrium
during
between
relative
their
below
their
change
at
thermal
and
equilibrium
described
alternated
vectors
allowed
units
hidden
in
same
detector
instances
have
is
are
Of
by
weights
various
in
information
extra
units
finding
of
discriminating
positive
the
of
flanked
units
the
support
the
become
then
detectors
times
other
and
hidden
defines
the
so
unit
generate
several
to
active
the
result
but
is
an
duplicating
procedure
about
active
that
question
will
capable
the
learning
these
is
means
some
it
shift
shift
as
turning
minimizes
This
empirical
than
whatsoever
of
the
The
whether
rather
weights
to
of
and
nothing
in
you
the
right
capable
kind
well
order
such
shift
tells
combinations
recognize
the
right
is
this
to
it
predict
task
features
one
of
the
way
finding
to
third
informative
unit
by
possible
perform
obvious
on
only
least
to
The
VIis
is
is
at
required
detect
It
it
is
are
in
is
to
further
be
10
pair
of
20
close
iterations
connected
times
with
~
~
~
~
~
~
=
~
:~
.
.~
.~
~
=
..=
-....
=-~-~
~.~
-=
=-.-=
-=~=
=
.
.=
.
=
..
~
~
~
.
.
=
=
-~
~
~
..~
=
~
.
.
~
~
~
---~
--- ~=~~~~--.---.
~
~
=
--.
E
-=
=
~
--~=
~
=
~
~
E
=
~
=
=
=
-=
:
-.~~~--~=
=
~
=
..
=
=
--..
.
~
=
-.
=
~
~
~
~
--.
.
.
=
....
=
~
~
-.
=
~
=
~
~
~
.
~
.
-i..:=
~
.
.
.
--.-.
-..-.
=-~---.1
~
.--.-~
!--....----.-..--.,~
.
=
..
-.
=
=
...
~
~
~
:
.....
.~
=
~
~
-....-.---,--=-.---:
----:=
=
..
..
~
.
~
-=
=
~
~
~
-.
-=
=
=
.
~
..~
=
~
=
~~
~
~
~L
~
~
=:.-.-~=
~
-:-=
..-=--..-~.~
--.=-=-=
.--=
.--.:.-. ----=
==....
7. LEARNINGIN BOLTZMANN
MACHINES
301
FIGURE 3. The weightsof the 24 hidden units in the shifter network. Eachlargeregion
correspondsto a unit. Within this regionthe black rectanglesrepresentnegativeweights
and the white rectanglesrepresentpositiveones. The size of a rectanglerepresentsthe
magnitudeof the weight. The two rowsof weightsat the bottom of eachunit are its connectionsto the two groupsof input units, V1 and V2. Theseweightsthereforerepresent
the "receptivefield" of the hidden unit. The three weightsin the middle of the top row
of each unit are its connectionsto the three output units that representshift-left, noshift, and shift-right. The solitary weight at the top left of each unit is its threshold.
Eachhidden unit is directly connectedto all 16 input units and all 3 output units. In this
example, the hidden units are not connectedto eachother. The top-left unit hasweights
that are easyto understand
: Its optimal stimulus is activity in the fourth unit of VIand
the fifth unit of V2' and it votes for shift-right. It has negativeweightsto make it less
likely to come on when there is an alternativeexplanationfor why its two favorite input
units are active.
302 BASIC
MECHANISMS
The
was
entire
set
called
by
5 ( PU
by
prevented
the
resuscitate
hidden
and
weights
for
weights
until
come
The
on
half
Annealing
The
following
at
15 , 2 at
dom
it
states
each
probed
to
adopt
gets
using
of
other
,
and
The
the
must
can
kind
of
current
time
weight
life
the
and
decay
( units
to
predom
in
+
same
phase
gradually
with
.
)
of
It
be
all
zero
is
learning
of
,
by
Shifter
higher
delays
Units
this
it
is
truly
, an
parallel
decide
on
in
the
, Kienker
added
kind
two
Since
changes
like
of
its
units
must
, Hinton
act
of
3.
An
ran -
average
, &
tempera
Network
because
order
it
step
of
on
other
recent
networks
encouraging
but
the
( Sejnowski
time
number
Equation
all
at
20 , 2
which
in
very
iterations
25 , 2 at
time
decide
of
delays
of
shown
tolerated
the
to
rule
time
the
one
parallel
aware
- order
be
perceptrons
a single
first
as
probed
states
of
30 , 2 at
defined
decision
can
that
network
also
structure
that
illustrates
a clear
was
several
example
beyond
of
the
weaknesses
capa
in
the
The
learning
each
of
which
with
vectors
phase
learning
as
number
gap
recent
being
of
approach
to
is
energy
tolerate
without
therefore
shifter
bility
the
or
phase
35 , 2 at
is
unit
its
most
system
units
each
stochastic
the
regarded
1985
40 , 2 at
iteration
uses
be
Performance
The
so
following
2 at
that
the
see
states
Schumacher
ture
their
The
decay
helped
) .
One
so
to
asynchronous
new
all
.
the
also
negative
spend
back
weight
it
Pi }
magnitude
This
and
and
incremented
absolute
.
. predominantly
, and
PU
was
its
large
come
spent
:
, a unit
cannot
states
time
10 .
required
unit
their
too
units
they
the
had
information
these
estimate
magnitude
units
no
schedule
12 , 2 at
is
iteration
Such
temperatures
probes
When
had
to
weight
Schedule
annealing
the
weight
becoming
convey
used
, every
absolute
which
identical
their
its
from
units
were
sweep
, every
times
weights
that
each
addition
weights
are
erodes
In
therefore
statistics
annealings
After
0 . 0005
positive
state
40
.
Pi } ) .
decreased
inantly
of
sweep
with
, this
was
very
involved
clamped
no
units
seems
slow
It
required
reaching
on
VI ,
clamped
excessively
9000
equilibrium
V 2,
.
and
Even
slow
V 3,
for
learning
20
low
times
and
- level
in
20
sweeps
phase
times
perceptual
in
7. LEARNINGIN BOLTZMANN
MACHINES
303
. The weightsare fairly clearly not optimal becauseof the 5 hidden units that appear to do nothing useful. Also, the
performanceis far from perfect. When the statesof the units
in VIand V2 are clampedand the network is annealedgently to
half the final temperatureusedduring learning, the units in V3
quite frequently adopt the wrong states. If the number of on
units in VIis 1,2,3,4,5,6,7, the percentageof correctly recognized shifts is 50%, 71%, 81%, 86%, 89%, 82%, and 66%
respectively
. The wide variation in the number of active units
in V1 naturally makesthe task harderto learn than if a constant
proportion of the units were active. Also, some of the input
patternsare ambiguous. When all the units in VIand V2 are
off , the network can do no better than chance.
WITH
304
BASICMECHANISMS
ANEXAMPLE
OFTHEEFFECTS
OFDAMAGE
To show the effects of damageon a network, it is necessaryto
choose a task for the network to perform. Since we are mainly
305
7. LEARNINGIN BOLTZMANN
MACHIN~
The Network
The network consisted of three groups or layers of units . The grapheme group was used to represent the letters in a three-letter word. It
contained 30 units and was subdivided into three groups of 10 units
each. Each subgroup was dedicated to one of the three letter positions
within a word, and it represented one of the 10 possible letters in that
position by having a single active unit for that letter . The three-letter
grapheme strings were not English words. They were chosen randomly,
subject to the constraint that each of the 10 possible graphemes in each
position had to be used at least once. The sememegroup was used to
encode the semantic features of the "word."? It contained 30 units , one
for each possible semantic feature. The semantic features to be associated with a word were chosen randomly , with each feature having a
probability of 0.2 of being chosen for each word. There were connections between all pairs of units in the sememe group to allow the network to learn familiar combinations of semantic features. There were
no direct connections between the grapheme and sememe groups.
Instead, there was an intermediate layer of 20 units , each of which was
connected to all the units in both the grapheme and the sememe
groups. Figure 4 is an artist's impression of the network . It uses
English letters and words to convey the functions of the units in the
various layers. Most of the connections are missing.
in
The
network
the
grapheme
7 The
so
the
was
representation
use
of
the
trained
units
of
word
with
meaning
" semantic
to
an
associate
is clearly
" here
each
arbitrarily
should
of
related
more
not
20
complicated
be
taken
patterns
pattern
than
too
of
in
just
literally
the
a set
.
activity
sememe
of
features
306
BASICMECHANISMS
FIGURE 4. Part of the network used for associating three-letter words with sets of
semantic features . English words are used in this figure to help convey the functional
roles of the units . In the actual simulation , the letter -strings and semantic features were
chosen randomly .
In phase+ all the grapheme and sememe units were clamped in states
that represented the physical form and the meaning of a single word ,
and the intermediate
was repeated twice for each of the 20 possible grapheme/ sememe associations
statistics
annealings
to yieldan estimate
, for eachconnection
, of pt . In phase
-,
only the grapheme units were clamped and the network settled to
equilibrium
itself
what
for a further
units
5 iterations
should
be active . The
network
statistics
was then
run
were collected
for all connected pairs of units . This was repeated twice for each of the
20 grapheme strings and the co-occurrence statistics were averaged to
yield an estimate of Pi} . Each learning sweep thus involved a total of
80 annealings .
After each sweep , every weight was either incremented or decre mented by 1, with the sign of the change being determined by the sign
increments
in which
MACHINES307
7. LEARNINGIN BOLTZMANN
magnitude decreasedby 1. For each weight, the probability of this happening was 0.0005 times the absolute magnitude of the weight.
We found that the network performed better if there was a
preliminary learning stage which just involved the sememe units. In
this stage, the intermediate units were not yet connected. During
phase+ the required patterns were clamped on the sememe units and PU
was measured (annealing was not required because all the units
involved were clamped) . During phase- no units were clamped and the
network was allowed to reach equilibrium 20 times using the annealing
schedule given above. After annealing, Pi) was estimated from the cooccurrences as before, except that only 20 phase- annealings were used
instead of 40. There were 300 sweeps of this learning stage and they
resulted in weights between pairs of sememe units that were sufficient
to give the sememe group an energy landscape with 20 strong minima
corresponding to the 20 possible " word meanings." This helped subsequent learning considerably, because it reduced the tendency for the
intermediate units to be recruited for the job of modeling the structure
among the sememe units . They were therefore free to model the structure between the grapheme units and the sememe units .9 The results
described here were obtained using the preliminary learning stage and
so they correspond to learning to associate grapheme strings with
" meanings" that are already familiar .
learning
units
in
There
pheme
and
so
procedure
the
generates
intermediate
was
no
units
because
there
is
no
weights
layer
need
to
in
tendency
the
have
main
for
to
similar
stage
the
stage
of
network
be
which
used
for
learning
to
for
learning
the
try
to
cause
many
the
grapheme
model
structure
units
the
each
different
structure
among
are
of
the
words
always
among
the
gra
clamped
them
308
BASICMECHANISMS
with
removed
. In all 10 ,000
tests , using the careful annealing schedule , it made 140 errors (98 .6%
correct ) . Many errors consisted of the correct set of semantic features
with one or two extra or missing features , but 83 of the errors consisted
of the precise meaning of some other grapheme string . An analysis of
these 83 errors showed that the hamming distance between the correct
meanings and the erroneous ones had a mean of 9.34 and a standard
deviation of 1.27 which is significantly lower (p < .01) than the complete set of hamming distances which had a mean of 10.30 and a stan dard deviation of 2.41 . We also looked at the hamming distances
between the grapheme strings that the network was given as input and
the grapheme strings that corresponded to the erroneous familiar mean ings . The mean was 3.95 and the standard deviation was 0.62 which is
significantly lower ( p < .01) than the complete set which had mean
5.53 and standard deviation 0.87. (A hamming distance of 4 means
that the strings have one letter in common .)
In summary , when a single unit is removed from the intermediate
layer , the network still performs well . The majority of its errors consist
of producing exactly the meaning of some other grapheme string , and
the erroneous meanings tend to be similar to the correct one and to be
associated with a grapheme string that has one letter in common with
the string used as input .
The original learning was very slow . Each item had to be presented
5000 times
to eliminate
almost
7. LEARNING
INBOLTZMANN
MACHINES
309
a short distance and then steeply up again (like the cross-section of a
ravine ) .10In most other directions the surface slopes gently upwards . In
a relatively narrow cone of directions , the surface slopes gently down
with very low curvature . This narrow cone corresponds to the floor of
the ravine and to get a low value of G (which is the definition of good
performance ) the learning must follow the floor of the ravine without
going up the sides . This is particularly hard in a high -dimensional
space. Unless the gradient of the surface is measured very accurately , a
step in the direction of the estimated gradient will have a component
along the floor of the ravine and a component up one of the many
sides of the ravine . Because the sides are much steeper than the floor ,
the result of the step will be to raise the value of G which makes
performance worse . Once out of the bottom of the ravine , almost all
the measurable gradient will be down towards the floor of the ravine
instead of along the ravine . As a result , the path followed in weight
space tends to consist of an irregular sloshing across the ravine with
only a small amount of forward progress . We are investigating ways of
ameliorating this difficulty , but it is a well -known problem of gradient
descent techniques
in high -dimensional
spaces, and it may be
unavoidable .
The ravine problem leads to a very interesting prediction about
relearning when random noise is added to the weights . The original
learning takes the weights a considerable distance along a ravine which
is slow and difficult because most directions in weight space are up the
sides of the ravine . When a lot of random noise is added , there will
typically be a small component along the ravine and a large component
up the sides . Performance will therefore get much worse (because
height in this space means poor performance ) , but relearning will be
fast because the network can get back most of its performance by sim ply descending to the floor of the ravine (which is easy) without mak ing progress along the ravine (which is hard ) .
The same phenomenon can be understood by considering the energy
landscape rather than the weight -space (recall that one point in weight space constitutes a whole energy landscape ) . Good performance
requires a rather precise balance between the relative depths of the 20
energy minima and it also requires that all the 20 minima have consid erably lower energy than other parts of the energy landscape . The bal ance between the minima in energy -space is the cross-section of the
ravine in weight -space (see Figure 5) and the depth of all the minima
compared with the rest of the energy landscape corresponds to the
direction along the ravine . Random noise upsets the precise balance
10The surfaceis never very steep. Its gradientparallelto any weightaxis must always
lie between1 and- 1 becauseit is the differenceof two probabilities
.
310
BASIC
\ ./ '.'\ ./
A B
increase weights that help B . . .
7. LEARNING
INBOLTZMANN
MACHIN
~ 311
FIGURE
point
6.
The
represents
learning
curve
nallearning
recovery
500 tests
after
of
performance
( 25 with
a considerable
, performance
number
increases
line
to the
shows
hidden
recovery
units
after
have
been
5 of the
phemic
the
network
When
the
it
learns
each
are
units
sweeps . It shows
Each
data -
of the original
that
in the origi -
sweeps . All
the other
( but
have
allowed
to relearn
been
) . The
permanently
dashed
ablated
. The
random
noise between
- 22 and + 22 is added to
In all cases , .a successful
trial was defined
as one
the correct
semantic
, it
will
the
affect
same
Unrehearsed
associations
the
features
when
given
the gra -
is
in
change
in
different
in
the
network
units
encoding
encoded
several
Items
intermediate
involved
association
require
of
the
among
weights
changed
will
damage .
is a section
100/ 0 in 30 learning
set to zero
20 hidden
exactly
Recovery
representations
and
of
line
input .
Spontaneous
of
produced
types
heavy
a net that
had very
good
performance
( 99 .3%
show the rapid recovery
after 200/ 0 or 50 % of the
remaining
line is the case when uniform
all the connections
to the hidden
units .
in which
various
of learning
by less than
lines
show
recovery
after
damaging
correct ) . The lines with open circles
weights
after
the
several
many
energy
weight
uses
This
different
weights
minima
to
distributed
means
restore
that
many
associations
If
and
weight
all
them
is
of
them
to
their
312 BASIC
MECHANISMS
previous depths . So, in relearning anyone of the associations , there
should be a positive transfer effect which tends to restore the others .
This effect is actually rather weak and is easily masked so it can only be
noise between
to and from
correct to 90/ 100 correct with the careful schedule, but the few errors
that remained tended to be completely wrong answers rather than
minor perturbations of the correct answer . We repeated the experiment
selecting two associations for which the error rate was high and the
errors were typically
large .
caused an improvement
Retraining
from
on the other
18 associations
the
associations
in terms
of them .
Now , consider
two
dif -
7. LEARNING
INBOLTZMANN
MACHINES
313
damaged and relearns. However, we have not made any serious
attemptto fit the simulationto particulardata.
CONCLUSION
We have presen ted three ideas:
.
The
learning
which
rule
are
ing
after
These
three
damage
items
can
be
by
and
the
properties
them
in
and
have
1980
left
many
where
between
Hinton
layered
1984
have
that
units
not
in
for
to
these
investigate
self
similar
to
use
annealing
crucial
which
demonstrated
networks
resistance
unrehearsed
network
are
of
parallel
are
practiced
particular
stochastic
testbed
ends
Hinton
and
some
in
of
press
helps
these
Sejnowski
annealing
between
In
algorithm
convenient
loose
and
problem
tionship
learning
are
of
of
and
are
recovery
use
back
those
repair
used
by
Sejnowski
search
phase
kinds
The
Huberman
nonstochastic
Fukushima
We
they
Hogg
effects
two
though
separately
bring
relearning
that
spontaneous
relearn
can
the
associations
other
rapid
process
during
the
and
exhibit
relearning
assessed
representations
search
Ackley
be
relearning
representations
and
practiced
to
exhibited
distributed
where
can
rapid
The
not
related
ideas
distributed
damage
are
randomly
construct
minor
damage
that
only
to
to
major
associations
to
tends
resistant
1985
are
and
give
example
also
and
the
discussed
detailed
they
networks
which
give
discuss
the
mammalian
different
else
of
rela
cortex
example
of
314 BASIC
MECHANISMS
learning
in
which
the
communicating
present
large
network
constructs
information
the
learning
networks
and
across
algorithm
future
is
progress
efficient
narrow
internal
bandwidth
too
slow
to
hinges
on
being
from
the
codes
for
channels
be
tested
able
At
properly
to
speed
on
it
up
ACKNOWLEDGMENTS
This
research
Foundation
Mark
Kienker
Shallice
.
Derthick
,
Jay
, and
was
We
,
supported
thank
Scott
Smolensky
grants
Ackley
Fahlman
McClelland
Paul
by
David
,
Barak
for
Stuart
Pearlmutter
helpful
Peter
System
Brown
Geman
,
,
discussions
Development
,
John
David
Francis
Hopfield
Rumelhart
.
Crick
,
Paul
,
Tim
7. LEARNINGIN DOLTZMANNMACHIN~
315
APPENDIX
:
ALGORITHM
DERIVATION OFTHELEARNING
When a network is free-running at equilibrium the probability distrib ution over the visible units is given by
(8)
- ;<
I;jWij
S;a.8Sja
.s.
Ea.e
Hence ~
oe
-Ea
.s/ T
Bwij
1
Ts;Q
.sSjQ
.se-Ea
.a/T.
.
!
~
Ea
~
/
T
.
O
B
0
.
B
~
Ea
~
/
T
.
!
~
EA
.
I
/
T
I
.
A
#
L
.
A
#
L
-()
T
~
e
5
,
5j
~
e
T
~
e
S
,
5J
8P
V
O.
B
.
B
A
.
#
L
=
I
-.
_
\
2
0.8e-Ea
A
.#
Le-EA
8w
/j ~
~/T ~
.I/T
I]
Differentiating(8) thenyields
~11:P-(Vo/\H.s)s;o.sSj
.s- P-(Vo)1:P-(V~I\HIL
)S;~ILSjAIL
.s
AIL
G- I,P
)ln~~
a +(Va
P-(Va
)
where P + ( Va) is the clampedprobability distribution over the visible
units and is independentof W;j ' So
OWl
)
OWl
)
316 BASIC
MECHANISMS
T[ }2.
a~
P--~~
(Va
) }2P
~ - (VaJ\HfJ
= - -1
.!~~
)SiafJSja
~-
I.app-+
(V
a)p
)I.P
.)S
;A.SIJ
.jAIJ
.
(V
alpha
) -(Va
AIJ
. -(VA/\HIJ
Now
Val
Va
HfJ
HfJ
Va
Va
HfJ
HfJ
Va
Va
and
Equation
visible
HfJ
Va
holds
state
were
in
the
that
HfJ
' .p
. v
of
in
hidden
state
whether
there
equilibrium
arrived
Va
probability
or
Va
same
state
Va
HfJl
the
be
clamped
because
must
by
free
given
the
visible
running
some
units
Hence
Va
HfJ
Also,
I
Va
Therefore
=
~
. .
1rp
.t
IJ
.- : -
I)
IJ
where
P
.t
I)
=
-
~
LI
fJ
. afJs
. afJ
ofJ
and
P
;j
=
-
~
LI
AIL
AILS
AIL
Sj
AIL
The
an
and
probabilities
Boltzmann
input
an
machine
output
learning
model
output
set
of
the
form
The
algorithm
visible
and
an
can
units
are
environment
OfJ
also
into
specifies
Ifa
During
be
divided
phase
set
formulated
an
of
the
as
input
conditional
environment
set
7. LEARNING
INBOLTZMANN
MACHINES
317
clamps both the input and output units , and the PUs are estimated.
During phase- the input units are clamped and the output units and
hidden units free-run , and the Pi} S are estimated. The appropriate G
measure in this case is
G=~p
\0/3)10
P
0/3+(Jol
p+
-(O
( ,/8
3IIa
/0
1>
).
Similar mathematics apply in this formulation and aGla Wij is the same
as before.
CHAPTER
8
Learning Internal Representations
by Error Propagation
THE PROBLEM
We now have a rather good understanding of simple two-layer associative networks in which a set of input patterns arriving at an input layer
are mapped directly to a set of output patterns at an output layer. Such
networks have no hidden units . They involve only input and output
units . In these casesthere is no internal representation
. The coding provided by the external world must suffice . These networks have proved
useful in a wide variety of applications (cf . Chapters 2, 17, and 18) .
Perhaps the essential character of such networks is that they map simi lar input patterns to similar output patterns. This is what allows these
networks to make reasonable generalizations and perform reasonably on
patterns that have never before been presented. The similarity of patterns in a POP system is determined by their overlap. The overlap in
such networks is determined outside the learning system itself - by
whatever produces the patterns.
The constraint that similar input patterns lead to similar outputs can
lead to an inability of the system to learn certain mappings from input
to output . Whenever the representation provided by the outside world
is such that Lhe similarity structure of the input and output patterns are
very different , a network without internal representations (i .e., a
REPRF5ENTATIONS319
8. LEARNING
INTERNAL
Output Patterns
00
--+
01
--+
10
--+
11
--+
0
1
1
0
TABLE2
Input Patterns
Output Patterns
000
--+
010
--+
100
--+
III
--+
320
BASICMECHANISMS
Output Patterns
Internal
.
Representation
Units
,
,
J'
J
J
J
,
,
,
,
,
,
,
,
Input Patterns
FIGURE 1. A multilayer network. In this casethe in(ormation coming to the input
units is ,erodedinto an internal representationand the outputsare generatedby the internal representationrather than by the original pattern. Input patternscan always be
encoded
, if there are enoughhidden units, in a form so that the appropriateoutput pattern can be generatedfrom any input pattern.
321
REPRESENTATIONS
8. LEARNING
INTERNAL
Output
Unit
HiddenUnit
Input Units
FIGURE 2. A simple XOR network with one hidden unit . See text for explanation.
hidden
units . There
have been
three
basic responses
to this lack .
322
BASICMECHANISMS
learning
pairs
vector
to
the
input
takes
delta
Otherwise
this
case
rule
as
4p
where
tpj
pattern
the
ment
made
of
is
,
the
Opj
the
to
the
pattern
target
is
presentation
of
in
1) ( tpj
input
weight
the
th
of
pattern
from
of
Opj )
jth
the
ith
to
this
reduce
11 .
The
pair
is
with
learning
the
differ
the
rule
tpj
to
standard
for
changing
given
by
( 1)
of
actual
,
ipi
pj ,
the
set
input
this
, no
generates
/ output
the
pattern
=
the
compares
difference
changed
component
of
, 8 pj
no
of
uses
1) 8 pj ip ;
element
input
first
then
is
and
input
ip ; =
for
and
units
Chapters
presentation
system
there
are
hidden
input
j
If
weights
no
presentation
Wj ; =
the
The
vector
vector
with
described
following
output
target
the
involves
patterns
own
or
In
weights
by
propose
output
its
output
place
.
we
and
produce
desired
ence
procedure
of
jth
and
the
output
output
is
the
value
4 p Wij
unit
pattern
pattern
is
following
of
the
for
produced
the
ith
change
ele
to
be
presentation
The delta rule and gradient descent. There are many ways of deriv ing this rule. For present purposes, it is useful to see that for linear
units it minimizes the squares of the differences between the actual and
the desired output values summed over the output units and all pairs of
input / output vectors. One way to show this is to show that the derivative of the error measure with respect to each weight is proportional to
the weight change dictated by the delta rule , with negative constant of
proportionality . This corresponds to performing steepest descent on a
surface in weight space whose height at any point in weight space is
equal to the error measure. (Note that some of the following sections
8. LEARNING
INTERNAL
REPRF5ENTATIONS
323
are written in italics. These sections constitute informal derivations of
the claims made in the surrounding text and can be omitted by the
reader who finds such derivations tedious.)
(2)
- lJ
lJEp
=8pj
ipi
Wji
,
which is proportional to 4p Wj; as prescribed by the delta rule. When there are no
hidden units it is straightforward to computethe relevant derivative. For this purpose
we use the chain rule to write the derivative as the product of two parts.. the derivative of the error with respectto the output of the unit times the derivative of the output with respectto the weight.
lJEp -
lJEp lJOpj
lJ Wj; -
lJopj lJ Wj; .
(3)
The first part tells how the error changes with the output of the j th unit and the
secondpart tells how much changing Wj; changesthat output. Now, the derivatives
are easy to compute. First, from Equation 2
8Ep
-;
U-0-pj = -
(tpj - Op
;Ji ) = - 8 Pl. .
(4.I}
Opj
=I;i W
jiipi
,
(5)
80
.
.
Pi
=
Ip
8wji;e
~
-
a Wji
= 8pjip
;
(6)
324
BASICMECHANISMS
.~.~- = I;!!!:PoW
ji pOW
ji
should lead us to concludethat the net changein Wji after one completecycle of pattern presentations is proportional to this derivative and hence that the delta rule
implementsa gradient descentin E . In fact , this is strictly true only if the valuesof
the weightsare not changed during this cycle. By changing the weightsafter each
pattern is presented we depart to some extent from a true gradient descentin E .
Nevertheless
, provided the learning rate (i.e., the constant of proportionality) is sufficiently small, this departure will be negligibleand the delta rule will implementa very
close approximation to gradient descent in sum-squared error. In particular, with
small enough learning rate, the delta rule will find a set of weightsminimizing this
error function.
8. LEARNINGINTERNALREPRESENTATIONS
325
netpj
=r.i Wj
; p;,
(7)
(8)
derivative
is infinite
at the threshold
BEp
.
cx
:
,
~pWJI
BWji
iJ Wj ;
By Equation
8Ep
anetpj
7 we see that
(9)
8netpj
a Wj ;
the second
factor
is
(10)
~ ~ : ~ L- =
a Wji
- -. Q.a Wji
I ; WjkOpk
k
Opio
BEp
.
8pj
=-onetpj
(By comparing this to Equation 4, note that this is consistentwith the definition of
8pj used in the original delta rule for linear units since Opj = netpj when unit Uj is
linear.) Equation 9 thus has the equivalentform
- ~
= B .
a w..
P) Opje
)1
This says that to implement gradient descent in E we should make our weight
changesaccording to
~pWj; ~ l1Bpjop
;,
(11)
326 BASIC
MECHANISMS
just
as
in
each
unit
there
is
the
standard
U j
a
in
delta
the
simple
propagating
rule
network
recursive
error
The
The
trick
computation
signals
is
interesting
of
backward
to
figure
result
through
these
the
out
,
"
what
which
's
which
network
we
pj
should
now
can
be
derive
be
for
is
that
implemented
by
BE
To
compute
"
pj
._
u
tive
as
tion
of
tion
of
the
product
the
of
output
two
of
changes
in
factors
the
the
_.. : e. . _
we
apply
the
chain
rule
to
write
this
partial
deriva
netpj
,
unit
input
one
and
one
Thus
factor
reflecting
the
reflecting
we
the
change
change
in
in
the
error
as
output
as
func
. func
have
BEp
BEp
_
~
.
~
Bl
8p)~-a-;e~~
-~-ao
;-Bnetp
).
( 12
Bop}
f j (netpj
),
a netpj
(13)
for
the
weight
tutes
the
These
eralized
of
amount
computing
is
not
an
the
changes
in
generalized
all
the
network
be
rule
units
Equations
in
13
the
for
has
the
to
the
product
11
of
three
same
as
line
of
recursive
an
used
This
semi
the
to
linear
consti
units
First
the
gen
delta
changed
available
rule
by
standard
be
signal
compute
procedure
should
error
procedure
then
equations
form
each
give
are
network
in
on
14
which
Equation
afeetiforward
exactly
weight
to
summarized
The
and
network
according
rule
can
delta
proportional
unit
for
delta
results
Equation
output
' s
(14)
an
to
8. LEARNING
INTERNAL
REPRESENTATIONS
327
the unit receivinginput along that line and the output of the unit sending activationalongthat line. In symbols
,
dp wi; = "18pjOp
;e
The other two equationsspecifythe error signal. Essentially, the determination of the error signal is a recursiveprocesswhich startswith the
output units. If a unit is an output unit, its error signal is very similar
to the standarddelta rule. It is given by
8pj = f j (netpj
)I , 8pkWkj
k
whenever the unit is not an output unit .
The application of the generalized delta rule , thus, involves two
phases: During the first phase the input is presented and propagated
forward through the network to compute the output value Opj for each
unit . This output is then compared with the targets, resulting in an
error signal Bpj for each output unit . The second phase involves a
backward pass through the network (analogous to the initial forward
pass) during which the error signal is passed to each unit in the network and the appropriate weight changes are made. This second, backward pass allows the recursive computation of B as indicated above.
The first step is to compute B for each of the output units . This is simply the difference between the actual and desired output values times
the derivative of the squashing function . We can then compute weight
changes for all connections that feed into the final layer. After this is
done, then compute 8's for all units in the penultimate layer. This
propagates the errors back one layer, and the same process can be
repeated for every layer. The backward pass has the same computational complexity as the forward pass, and so it is not unduly expensive.
We have now generated a gradient descent method for finding
weights in any feedforward network with semilinear units. Before
reporting our results with these networks, it is useful to note some
further observations. It is interesting that not all weights need be variable. Any number of weights in the network can be fixed . In this
case, error is still propagated as before; the fixed weights are simply not
328 BASIC
MECHANISMS
modified . It should also be noted that there is no reason why some
output units might not receive inputs from other output units in earlier
layers. In this case, those units receive two different kinds of error :
that from the direct comparison with the target and that passedthrough
the other output units whose activation it affects. In this case, the
correct procedure is to simply add the weight changes dictated by the
direct comparison to that propagated back from the other output units .
SIMULATION
RESULTS
We now have a learning procedure which could, in principle , evolve
a set of weights to produce an arbitrary mapping from input to output .
However, the procedure we have produced is a gradient descent procedure and, as such, is bound by all of the problems of any hill climb ing procedure- namely, the problem of local maxima or (in our case)
minima . Moreover , there is a question of how long it might take a system to learn. Even if we could guarantee that it would eventually find
a solution , there is the question of whether our procedure could learn
in a reasonable period of time . It is interesting to ask what hidden
units the system actually develops in the solution of particular problems. This is the question of what kinds of internal representations the
system actually creates. We do not yet have definitive answers to these
questions. However, we have carried out many simulations which lead
us to be optimistic about the local minima and time questions and to be
surprised by the kinds of representations our learning mechanism discovers. Before proceeding with our results, we must describe our simulation system in more detail. In particular, we must specify an activation function and show how the system can compute the derivative of
this function .
A useful activation function . In our above derivations the derivative
of the activation function of unit U} , f j (net} ) , always played a role.
This implies that we need an activation function for which a derivative
exists. It is interesting to note that the linear threshold function , on
which the perceptron is based, is discontinuous and hence will not suffice for the generalized delta rule . Similarly , since a linear system
achieves no advantage from hidden units , a linear activation function
will not suffice either . Thus, we need a continuous , nonlinear activation function . In most of our experiments we have used the logistic
activation function in which
329
8. LEARNINGINTERNALREPRESENTATIONS
1
- (I, wjiop
;+8j )
l+e ;
0 .=
Pl
(15)
aOp
:'~=On
~
n
t
uepj ,J)'(1- 0pj).
Thus, for the logistic activation function , the error signal, 8pj, for an
output unit is given by
Bpj
and
the
error
tpj
for
Bpj
Opj
an
Opj
opj
arbitrary
Opj
hidden
Opj
Uj
8Pk
Wkj
is
given
by
It
should
be
imum
for
Opj
Opj
the
One
are
weights
desired
these
as
the
sought
targets
this
or
off
Opj
weights
the
in
system
binary
Therefore
though
we
believe
or
not
yet
contri
noted
the
as
if
The
without
infin
in
can
use
talk
situation
system
will
most
be
of
typically
we
given
changed
should
the
sense
learning
practical
we
max
as
in
some
values
its
minimum
be
in
feature
extreme
reaches
change
will
function
its
its
of
This
activation
approaches
and
of
amount
learning
even
midrange
on
are
values
derivative
Therefore
outputs
achieve
Opj
Opj
the
their
reach
Since
this
of
actually
the
feature
large
since
one
either
of
not
derivative
near
stability
can
. 9
or
being
other
system
the
and
that
to
to
the
to
units
committed
itely
proportional
those
butes
zero
is
for
that
approaches
weight
noted
which
never
actually
values
of
values
of
and
are
The
learning
change
in
requires
ity
is
weight
that
the
larger
that
imagine
Our
learning
be
proportional
steps
rate
changes
the
infinitesimal
learning
the
1 Note
simply
rate
values
in
in
the
of the
our
procedure
to
be
weights
iJEp / iJ w .
taken
procedure
.
For
. The
.
The
from
a unit
True
that
only
larger
of
this
purposes
just
is always
like
on .
that
gradient
constant
practical
requires
any
the
descent
proportional
constant
we
other
, the
choose
weights
We
330 BASIC
MECHANISMS
learning
rate that is as large as possible without
leading to oscillation .
This offers the most rapid learning . One way to increase the learning
rate without
leading to oscillation
is to modify the generalized delta rule
to include a momentum term . This can be accomplished
by the follow ing rule :
4 Wji ( n + 1) = 1) (8pjop ;) + a4 Wj; ( n )
where
the subscript
n indexes
( 16 )
the presentation
these
it is necessary
to take very
small
causes
system
learns
by setting
much
faster
a = 0 and reducing
overall
with
larger
the size of
values
of a
Symmetry
breaking . Our learning procedure
has one more problem
that can be readily overcome
and this is the problem
of symmetry
breaking . If all weights start out with equal values and if the solution
requires that unequal weights be developed , the system can never learn .
This is because error is propagated back through the weights in propor tion to the values of the weights . This means that all hidden units con nected
directly
to the output
inputs
will
get identical
error
signals , and ,
return . We counteract
this
problem
these
by starting
conditions
the system
symmetry
problems
with
of
8. LEARNING
INTERNAL
REPRFSENT
AnONS 331
problems involve an XOR as a subproblem. We have run the XOR
problem many times and with a couple of exceptions discussed below,
the system has always solved the problem . Figure 3 shows one of the
solutions to the problem. This solution was reached after 558 sweeps
through the four stimulus patterns with a learning rate of 11= 0.5. In
this case, both the hidden unit and the output unit have positive biases
so they are on unless turned off . The hidden unit turns on if neither
input unit is on. When it is on, it turns off the output unit . The connections from input to output units arranged themselves so that they
turn off the output unit whenever both inputs are on. In this case, the
network has settled to a solution which is a sort of mirror image of the
one illustrated in Figure 2.
We have taught the system to solve the XOR problem hundreds of
times. Sometimes we have used a single hidden unit and direct connections to the output unit as illustrated here, and other times we have
allowed two hidden units and set the connections from the input units
to the outputs to be zero, as shown in Figure 4. In only two caseshas
the system encountered a local minimum and thus been unable to solve
the problem. Both cases involved the two hidden units version of the
-.-/6
4
.4
2
/-9
.6
2
I4\--4
1
\.4
OutputUnit
Hidden Unit
Input Units
FIGURE 3. Observed XOR network . The connection weights are written on the arrows
and the biases are written in the circles. Note a positive bias means that the unit is on
unless tqrned off .
332 BASIC
MECHANISMS
FIGURE
4. A simplearchitecturefor solving XOR with two hidden units and no direct
connections from input to output .
8. LEARNINGINTERNALREPRFSENT
AnONS
-4 .5
333
-2 .
8 .8
FIGURE 5. A network at a local minimum for the exclusive -or problem . The dotted
lines indicate negative weights . Note that whenever the right most input unit is on it
turns on both hidden units . The weights connecting the hidden units to the output are
arranged so that when both hidden units are on , the output unit gets a net input of zero .
This leads to an output value of 0 .5. In the other cases the network provides the correct
answer
number of hidden units and varying the learning rate on time to solve
the problem . Using as a learning criterion an error of 0.01 per pattern ,
Yves found that the average number of presentations to solve the prob lem with TJ = 0 .25 varied
from
about
two hidden
units to about 120 presentations for 32 hidden units . The results can
334 BASIC
MECHANISMS
learning rates larger than this led to unstable behavior. However,
within this range larger learning rates speeded the learning substantially.
In most of our problems we have employed learning rates of 1) = 0.25
or smaller and have had no difficulty .
Parity
One of the problems given a good deal of discussion by Minsky and
Papert ( 1969) is the parity problem, in which the output required is 1 if
the input pattern contains an odd number of 1sand 0 otherwise. This
is a very difficult problem because the most similar patterns (those
which differ by a single bit) require different answers. The XOR problem is a parity problem with input patterns of size two . We have tried a
number of parity problems with patterns ranging from size two to eight.
Generally we have employed layered networks in which direct connections from the input to the output units are not allowed, but must be
mediated through a set of hidden units . In this architecture, it requires
at least N hidden units to solve parity with patterns of length N . Figure 6 illustrates the basic paradigm for the solutions discovered by the
system. The solid lines in the figure indicate weights of + 1 and the
dotted lines indicate weights of - 1. The numbers in the circles
represent the biases of the units . Basically, the hidden units arranged
,
"
"
"
"
FIGURE
6.
ing system .
A paradigmfor the solutionsto the parity problemdiscoveredby the learnSeetext for explanation.
- - : ' =' :- - = - - :8
8. LEARNINGINTERNALREPRESENTATIONS
335
The EncodingProblem
Ackley , Hinton , and Sejnowski ( 1985) have posed a problem in
which a set of orthogonal input patterns are mapped to a set of orthogonal output patterns through a small set of hidden units. In such cases
the internal representations of the patterns on the hidden units must be
rather efficient . Suppose that we attempt to map N input patterns onto
N output patterns. Suppose further that log2N hidden units are provided. In this case, we expect that the system will learn to use the
TABLE3
Number of On
Input Units
Hidden Unit
Patterns
--+
--+
--+
--+
--+
1111
1011
1010
0010
()()()()
Output
Value
.....
....
--+
--+
--+
8. LEARNING
INTERNAL
REPR
~ ENTATIONS
337
output as in the input . Table 5 shows the mapping generated by our
learning system on this example. It is of some interest that the system
employed its ability to use intermediate values in solving this problem .
It could, of course, have found a solution in which the hidden units
took on only the values of zero and one. Often it does just that , but in
this instance, and many others, there are solutions that use the inter mediate values, and the learning system finds them even though it has
a bias toward extreme values. It is possible to set up problems that
require the system to make use of intermediate values in order to solve
a problem. We now turn to such a case.
Table 6 shows a very simple problem in which we have to convert
from a distributed representationover two units into a local representation
over four units . The similarity structure of the distributed input patterns is simply not preserved in the local output representation.
We presented this problem to our learning system with a number of
constraints which made it especially difficult . The two input units were
only allowed to connect to a single hidden unit which , in turn , was
allowed to connect to four more hidden units . Only these four hidden
units were allowed to connect to the four output units . To solve
this problem, then, the system must first convert the distributed
TABLES
Hidden Unit
Input
Patterns
Output
Patterns
Patterns
lOO
(K)()()(}
01()()()()()()
001OO
()()()
00010
()()()
00001000
00000100
(KX
)()()()10
(MX
)()()()()1
---+
---+
---+
---+
---+
---+
--+
- +
.5 0
0 1
1 1
1
0
.5
1
0
1
1
0
0
0
0
0
0
--+
1
1
1
.5
.5
--+
--+
--+
--+
--+
--+
--+
lO()(XX
)()()
OlO
()()()()()
OOl
()()()()()
0001()()()()
00001000
00000
100
O()()OOO
10
OO
()()()()()l
TABLE6
Input Patterns
00
01
10
11
Output Patterns
--+
--+
--+
--+
1<XX
>
0100
0010
0001
338 BASIC
MECHANISMS
representation of the input patterns into various intermediate values of
the singleton hidden unit in which different activation values
correspond to the different input patterns. These continuous values
must then be converted back through the next layer of hidden units first to another distributed representation and then , finally , to a local
representation. This problem was presented to the system and it
reached a solution after 5,226 presentations with "1 = 0.05. 3 Table 7
shows the sequence of representations the system actually developed in
order to transform the patterns and solve the problem . Note each of
the four input patterns was mapped onto a particular activation value of
the singleton hidden unit . These values were then mapped onto distri buted patterns at the next layer of hidden units which were finally
mapped into the required local representation at the output level . In
principle , this trick of mapping patterns into activation values and then
converting those activation values back into patterns could be done for
any number of patterns, but it becomes increasingly difficult for the
system to make the necessary distinctions as ever smaller differences
among activation values must be distinguished . Figure 8 shows the
network the system developed to do this job . The connection weights
from the hidden units to the output units have been suppressed for
clarity . (The sign of the connection, however, is indicated by the form
of the connection - e.g., dashed lines mean inhibitory connections) .
The four different activation values were generated by having relatively
large weights of opposite sign. One input line turns the hidden unit full
on, one turns it full off . The two differ by a relatively small amount so
that when both turn on, the unit attains a value intermediate between 0
and 0.5. When neither turns on, the near zero bias causes the unit to
attain a value slightly over 0.5. The connections to the second layer of
hidden units is likewise interesting . When the hidden unit is full on,
TABLE
7
10
11
00
-..
-..
.2
- +
.6
-..
01
3 Relatively
obtain
small learning
Output
Patterns
Remaining
Hidden Units
Singleton
HiddenUnit
Input
Patterns
--+
--+
--+
--+
1
1
.5
0
1
1
0
0
1
0
0
0
0
0
.3
1
--+
--+
--+
--+
0010
0001
1000
0100
8. LEARNINGINTERNALREPRFSENT
ATIONS
339
Output
Units
Hidden
Units
- 61
I
()
Input
Units
the right -most of these hidden units is turned on and all others turned
off . When the hidden unit is turned off , the other three of these hidden units are on and the left -most unit off . The other connections
from the singleton hidden unit to the other hidden units are graded so
that a distinct pattern is turned on for its other two values. Here we
have an example of the flexibility of the learning system.
Our experience is that there is a propensity for the hidden units to
take on extreme values, but , whenever the learning problem calls for it ,
they can learn to take on graded values. It is likely that the propensity
to take on extreme values follows from the fact that the logistic is a sigmoid so that increasing magnitudes of its inputs push it toward zero or
one. This means that in a problem in which intermediate values are
required, the incoming weights must remain of moderate size. It is
interesting that the derivation of the generalized delta rule does not
depend on all of the units having identical activation functions . Thus,
it would be possible for some units , those required to encode informa tion in a graded fashion, to be linear while others might be logistic.
The linear unit would have a much wider dynamic range and could
encode more different values. This would be a useful role for a linear
unit in a network with hidden units.
340
BASICMECHANISMS
Symmetry
Another interesting problem we studied is that of classifying input
strings as to whether or not they are symmetric about their center. We
used patterns of various lengths with various numbers of hidden units.
To our surprise, we discovered that the problem can always be solved
with only two hidden units. To understand the derived representation,
consider one of the solutions generated by our system for strings of
length six. This solution was arrived at after 1,208 presentations of each
six-bit pattern with T) = 0.1. The final network is shown in Figure 9.
For simplicity we have shown the six input units in the center of the
diagram with one hidden unit above and one below. The output unit ,
which signals whether or not the string is symmetric about its center"t is
shown at the far right . The key point to see about this solution is that
for a given hidden unit , weights that are symmetric about the middle
are equal in magnitude and opposite in sign. That means that if a symmetric pattern is on, both hidden units will receive a net input of zero
from the input units , and, since the hidden units have a negative bias,
both will be off . In this case, the output unit , having a positive bias,
FIGURE 9. Network for solving the symmetry problem . The six open circles represent
the input units. There are two hidden units, one shown above and one below the input
units. The output unit is shown to the far right . See text for explanation.
8. LEARNING
INTERNAL
REPRFSENT
ATIONS 341
will be on. The next most important thing to note about the solution is
that the weights on each side of the midpoint of the string are in the
ratio of 1:2:4. This insures that each of the eight patterns that can
occur on each side of the midpoint sends a unique activation sum to
the hidden unit . This assures that there is no pattern on the left that
will exactly balance a non-mirror -image pattern on the right . Finally ,
the two hidden units have identical patterns of weights from the input
units except for sign. This insures that for every nonsymmetric pattern , at least one of the two hidden units will come on and turn on the
output unit . To summarize, the network is arranged so that both hid den units will receive exactly zero activation from the input units when
the pattern is symmetric, and at least one of them will receive positive
input for every nonsymmetric pattern.
This problem was interesting to us because the learning system
developed a much more elegant solution to the problem than we had
previously considered. This problem was not the only one in which this
happened. The parity solution discovered by the learning procedure
was also one that we had not discovered prior to testing the problem
with our learning procedure. Indeed, we frequently discover these
more elegant solutions by giving the system more hidden units than it
needs and observing that it does not make use of some of those provided. Some analysis of the actual solutions discovered often leads us
to the discovery of a better solution involving fewer hidden units.
Addition
Another interesting problem on which we have tested our learning
algorithm is the simple binary addition problem. This problem is
interesting because there is a very elegant solution to it , because it is
the one problem we have found where we can reliably find local
minima and because the way of avoiding these local minima gives us
some insight into the conditions under which local minima may be
found and avoided. Figure 10 illustrates the basic problem and a
minimal solution to it . There are four input units , three output units ,
and two hidden units . The output patterns can be viewed as the binary
representation of the sum of two two-bit binary numbers represented
by the input patterns. The second and fourth input units in the
diagram correspond to the low-order bits of the two binary numbers
and the first and third units correspond to the two higher order bits.
The hidden units correspond to the carry bits in the summation . Thus
the hidden unit on the far right comes on when both of the lower order
bits in the input pattern are turned on, and the one on the left comes
342
BASICMECHANISMS
Output Units
Hidden
Units
Input Units
FIGURE 10. Minimal network for adding two two-bit binary numbers. There are four
input units, three output units, and two hidden units. The output patterns can be viewed
as the ~inary representation of the sum of two two-bit binary numbers represented by the
input patterns. The second and fourth input units in the diagram correspond to the loworder bits of the two binary numbers, and the first and third units correspond to the two
higher order bits. The hidden units correspond to the carry bits in the summation . The
hidden unit on the far right comes on when both of the lower order bits in the input pattern are turned on, and the one on the left comes on when both higher order bits are
turned on or when one of the higher order bits and the other hidden unit is turned on.
The weights on all lines are assumed to be + 1 except where noted. Negative connections are indicated by dashed lines. As usual, the biases are indicated by the numbers in
the circles.
on when both higher order bits are turned on or when one of the
higher order bits and the other hidden unit is turned on. In the
diagram, the weights on all lines are assumed to be + 1 except where
noted. Inhibitory connections are indicated by dashed lines. As usual,
the biases are indicated by the numbers in the circles. To understand
how this network works, it is useful to note that the lowest order output bit is determined by an exclusive-or among the two low-order input
bits. One way to solve this XOR problem is to have a hidden unit
come on when both low-order input bits are on and then have it inhibit
the output unit . Otherwise either of the low-order input units can turn
on the low-order output bit . The middle bit is somewhat more
8. LEARNING
INTERNAL
REPRESENTATIONS
343
difficult . Note that the middle bit should come on whenever an odd
number of the set containing the two higher order input bits and the
lower order carry bit is turned on . Observation will confirm that the
network shown performs that task . The left - most hidden unit receives
inputs from the two higher order bits and from the carry bit . Its bias is
such that it will come on whenever two or more of its inputs are turned
on . The middle output unit receives positive inputs from the same
three units and a negative input of - 2 from the second hidden unit .
This insures that whenever just one of the three are turned on , the
second hidden unit will remain off and the output bit will come on .
Whenever exactly two of the three are on , the hidden unit will turn on
and counteract the two units exciting the output bit , so it will stay off .
Finally , when all three are turned on , the output bit will receive - 2
from its carry bit and + 3 from its other three inputs . The net is posi tive , so the middle unit will be on . Finally , the third output bit should
turn on whenever the second hidden unit is on - that is , whenever
there is a carry from the second bit . Here then we have a minimal net work to carry out the job at hand . Moreover , it should be noted that
the concept behind this network is generalizable to an arbitrary number
of input and output bits . In general , for adding two m bit binary
numbers we will require 2m input units , m hidden units , and m + 1 out put units .
Unfortunately , this is the one problem we have found that reliably
leads the system into local minima . At the start in our learning trials
on this problem we allow any input unit to connect to any output unit
and to any hidden unit . We allow any hidden unit to connect to any
output unit , and we allow one of the hidden units to connect to the
other hidden unit , but , since we can have no loops , the connection in
the opposite direction is disallowed . Sometimes the system will discover
essentially the same network shown in the figure .4 Often , however , the
system ends up in a local minimum . The problem arises when the XOR
problem on the low -order bits is not solved in the way shown in the
diagram . One way it can fail is when the " higher " of the two hidden
units is " selected " to solve the XOR problem . This is a problem
because then the other hidden unit cannot " see" the carry bit and there fore cannot finally solve the problem . This problem seems to stem
from the fact that the learning of the second output bit is always depen dent on learning the first (because information about the carry is necessary to learn the second bit ) and therefore lags behind the learning of
the first bit and has no influence on the selection of a hidden unit to
4 The network is the sameexcept for the highestorder bit. The highestorder bit is
alwayson wheneverthree or more of the input units are on. This is alwayslearnedfirst
and alwayslearnedwith direct connectionsto the input units.
344
BASICMECHANISMS
solve the first XOR problem . Thus , about half of the time (in this
problem ) the wrong unit is chosen and the problem cannot be solved .
In this case , the system
finds
a solution
the
11 + 11 --+ 110 (3+ 3 = 6) case in which it misses the carry into the
middle bit and gets 11 + 11 --+ 100 instead . This problem differs from
others we have solved in as much as the hidden units are not " equi potential " here . In most of our other problems the hidden units have
been equipotential , and this problem has not arisen .
It should be noted , however , that there is a relatively simple way out
of the problem - namely , add some extra hidden units . In this case we
can afford
to make a mistake
on one or more
selections
can still solve the problems . For the problem of adding two -bit
numbers we have found that the system always solves the problem with
one extra hidden unit . With larger numbers it may require two or three
more
For
purposes
of
illustration
, we
show
the
results
of
one
of
our
runs with three rather than the minimum two hidden units . Figure 11
shows the state reached by the network after 3,020 presentations of
each input pattern and with a learning rate of 11= 0.5. For conveni ence , we show the network in four parts . In Figure IIA we show the
connections to and among the hidden units . This figure shows the
internal representation generated for this problem ~ The " lowest " hid den
other
unit
turns
words
off
whenever
it detects
either
of the
low - order
no low - order
bits
bit
are on .
In
is turn
on .
The " highest " hidden unit is arranged so that it comes on whenever the
sum
is less than
two . The
conditions
under
which
the middle
hidden
unit comes on are more complex . Table 8 shows the patterns of hidden
units which occur to each of the sixteen input patterns . Figure 11B
shows the connections to the lowest order output unit . Noting that the
relevant hidden unit comes on when neither low -order input unit is on ,
it is clear how the system computes XOR . When both low -order inputs
are off , the output unit is turned off by the hidden unit . When both
low -order input units are on , the output is turned off directly by the
two input units . If just one is on , the positive bias on the output unit
keeps it on . Figure IIC gives the connections to the middle output
unit , and in Figure 11D we show those connections to the left -most ,
verified from the figures, the network is balanced so that this works.
It should be pointed out th.at most of the problems described thus far
have involved hidden units with quite simple interpretations . It is
much more often the case, especially when the number of hidden units
exceeds the minimum number required for the task , that the hidden
units are not readily interpreted . This follows from the fact that there
is very little tendency for localist representations to develop . Typically
8. LEARNINGINTERNALREPRF5ENT
ATIONS
345
Output
Units
000
Input Units
Input Units
Output
Units
0
0
Output Units
00
Hidden
Units
Input Units
Input Units
FIGURE 11. Network found for the summationproblem. A: The connectionsfrom the
input units to the three hidden units and the connectionsamong the hidden units. B:
The connectionsfrom the input and hiddenunits to the lowestorder output unit. C: The
connectionsfrom the input and hidden units to the middle output unit. D: The connections from the input and hiddenunits to the highestorder output unit.
346
BASICMECHANISMS
TABLE8
Hidden Unit
Patterns
Input
Patterns
Output
Patterns
00 + 00
--+
111
- .
00 + 01
--+
110
--+
00 +
10
--+
011
--+
00 +
11
--+
010
--of'
01 + 00
110
--+
01 + 01
--+
010
--+
01 +
10
--+
010
--+
01 +
11
--+
000
--+
10 + 00
--+
011
--+
10 + 01
--+
010
--+
10 +
10
001
10 +
11
. -. .
( KX)
--+
11 + 00
.-..
010
--+
11 + 01
--+
( XX)
--+
11 +
10
--+
000
--+
11 +
11
--+
000
--+
000
001
010
011
001
010
011
100
010
011
100
101
011
100
101
110
The NegationProblem
Consider a situation in which the input to a system consists of patterns of n+ 1 binary values and an output of n values. Suppose further
that the general rule is that n of the input units should be mapped
directly to the output patterns. One of the input bits, however, is special. It is a negation bit . When that bit is off , the rest of the pattern is
supposed to map straight through , but when it is on, the complement
of the pattern is to be mapped to the output . Table 9 shows the
appropriate mapping. In this case the left element of the input pattern
is the negation bit , but the system has no way of knowing this and
must learn which bit is the negation bit . In this case, weights were
allowed from any input unit to any hidden or output unit and from any
hidden unit to any output unit . The system learned to set all of the
weights to zero except those shown in Figure 12. The basic structure
of the problem and of the solution is evident in the figure . Clearly the
problem was reduced to a set of three XORs between the negation bit
8.
LEARNING
TABLE
Input Patterns
()()()()
0001
0010
0011
0100
0101
0110
0111
1000
1001
1010
1011
1100
1101
1110
1111
INTERNAL
REPRESENTATIONS
347
Output Patterns
--+
- +
.-..
. -. .
. -. .
. -. .
--+
--+
--+
--+
--+
---+
--+
--+
--+
--+
(:-;)
8- -00-o-00
-8
- 00o-- 00o-- -o-- - oo-
and each input . In the case of the two right -most input units, the XOR
problems were solved by re~ruiting a hidden unit to detect the case in
which neither the negation unit nor the corresponding input unit was on.
In the third case, the hidden unit detects the case in which both the
negation unit and relevant input were on. In this case the problem was
solved in less than 5,000 passesthrough the stimulus set with TJ= 0.25.
\\ 10 .4
~ 10 """.4
"'
./ 1 \
........
\
+4
/ /
"
T
~
/ /
'(
-8
-41 '
J. " ""
........"
I G
1 ~
"
10
.41 \ """
.8 - - r - - /
---- - - r - 7
- - I / .8
. . -
1"
"
I .8
+2
- - -I - 1 /
I / -8
FIGURE 12. The solution discoveredfor the negationproblem. The left-most unit is
the negation unit. The problem has been reducedand solved as three exclusive-ors
betweenthe negationunit and eachof the other three units.
348
BASICMECHANISMS
a=a
FIGURE
B::B
block C in each of four orientations . One of the eight patterns is presented on each trial .
STerry Sejnowski pointed out to us that the T -C problem was difficult for models of
8. LEARNING
INTERNAL
REPRESENT
ATIONS 349
Output
Unit
[J
[J
c
c
C
Hidden
Units
[J
[J
[J
Input
Units
FIGURE 14. The network for solving the T -C problem . See text for explanation.
350
BASICMECHANISMS
this way , the whole field of hidden units consists simply of replications
of a single feature detector centered on different regions of the input
space, and the learning that occurs in one part of the field is automati -
, a T can extend
into
the on -
center and achieve a net input of + 1, this detector will be turned on for
a T at any orientation . On the other hand , any C extending into the
center must cover at least two inhibitory cells . With this detector the
bias can be set so that only one of the whole field of inhibitory units
will come on whenever a T is presented and none of the hidden units
will be turned on by any C . This is a kind of protrusion detector which
differentiates between a T and C by detecting the protrusion of the T.
The receptive field shown in Figure 15B is again a kind of T detector .
Every T activates one of the hidden units by an amount + 2 and none
of the hidden
units
receives
more
than
+ 1 from
shown in the figure , T ' s at 900 and 2700 send a total of + 2 to the hid den
units
on which
the crossbar
lines
up .
The
two
hidden units , and the C turns off only 20. This is a truly distributed
7 The ratios of the weights are about right . The actual values can be larger or smaller
than the values given in the figure .
8. LEARNINGINTERNALREPRESENTATIONS351
..........................
.
....
.....
.
.
........
..........
-1
1 -1
A
.
;- 1
- 1
.
-1
. . . . . . . .
- 1
.
-1
2 -1
:- 1
:
.
-1 ~
1 -1
........
-1 + 1 -1
- 1
.
.
I
. . . . . . .
.
.
.
.
.
.
-1 -1 + 1
.................
1 + 21.;. 1
:
........:
;+
1
1
-1
..
........
- 1 ~- 1 + 1
. . . . . . . . . . . . . . . . .,
. . . . . .
- 1 ~ 2 :.- 1
~ 1 -1 -1
.
.
.
.
........
.........
FIGURE
Receptive
15 .
center
- off - surround
responds
to
~ etectors
tor .
This
field
the
Since
It
is
Figure
10 , 000
is
them
In
each
the
8 Since
translation
presented
where
.
Thus
input
, there
are
-2
off
only
diagonal
such
vertical
bar
an
than
the
D :
input
bar
detector
detectors
whenever
the
solution
the
set
inhibitory
this
occurs
three
it
.
A
A : An
detector
T activates
any
off
on -
which
five
compactness
covers
turns
indeed
first
was
built
~ the
eight
same
detec
region
of
20
such
detectors
in
from
about
its
all
distinct
is
of
the
into
patterns
to
shown
learned
be
of
simulations
is
This
which
the
presented
procedure
with
wherever
presented
in
predominance
through
learning
be
field
a
our
system
the
will
. 8
receptive
there
trajectory
when
thing
patterns
of
that
the
reached
eight
type
and
At
was
of
and
considering
.
compact
independence
the
-2
common
in
moves
-2
-2
the
by
typically
difference
more
of
most
-2
-2
C :
turns
case
that
was
-2
detecting
only
presentations
understood
learning
C 's.
of
21
connections
be
than
off
interesting
150
inhibitory
can
the
T turns
to
for
field
-2
in different
field
a C activates
receptive
representation
5 , 000
strongly
whereas
inhibitory
receptive
whereas
fields found
receptive
T ' s more
such
-2
to
it
the
the
makes
no
pattern
system
is
352
BASICMECHANISMS
SOMEFURTHER
GENERALIZATIONS
We have intensively studied the learning characteristics of the generalized delta rule on feedforward networks and semilinear activations
functions . Interestingly these are not the most general cases to which
the learning procedure is applicable. As yet we have only studied a few
examples of the more fully generalized system, but it is relatively easy
to apply the same learning rule to sigma-pi units and to recurrent networks. We will simply sketch the basic ideas here.
8. LEARNING
INTERNAL
REPRESENTATIONS
353
The GeneralizedDelta Rule and Sigma-Pi Units
It will be recalledfrom Chapter 2 that in the caseof sigma-pi units
we have
o).=f).(~w
.Li. )I..no.
k 'k)
(17)
where j varies over the set of conjuncts feeding into unit j and k varies
over the elements of the conjuncts. For simplicity of exposition , we
restrict ourselves to the case in which no conjuncts involve more than
two elements. In this case we can notate the weight from the conjunction of units j and j to unit k by Wkij. The weight on the direct connection from unit j to unit j would , thus, be Wjii, and since the relation
is multiplicative , Wkij= Wkji. We can now rewrite Equation 17 as
OJ
=Ij <Lj,hWjhjOh
OJ
).
We now set
BEp
.
~pWkij
cx
:-BWkij
Takingthe derivativeandsimplifying
, we get a rule for sigma
-pi units
strictlyanalogous
to the rule for semilinearactivationfunctions
:
dp Wkij= Bk i OJ.
We can see the correct form of the error signal, 8, for this case by
inspectingFigure 16. Considerthe appropriatevalue of 8; for unit u;
in the figure. As before, the correct value of 8; is given by the sum of
the 8's for all of the units into which U; feeds, weightedby the amount
of effect due to the activation of U; times the derivative of the activation function. In the case of semilinear functions, the measureof a
unit's effect on another unit is given simply by the weight w connecting the first unit to the second. In this case, the u; 's effect on Uk
dependsnot only on Wk
;j , but alsoon the value of Uj. Thus, we have
8;=f';(net
;)1
;jOJ
j,:8k
k Wk
if U; is not an output unit and, as before,
8; = f ,; (net; ) (t;- o;)
if it is an output unit.
354 BASIC
MECHANISMS
FIGURE 16. The generalizeddelta rule for sigma-pi units. The productsof activation
valuesof individual units activateoutput units. See text for explanationof how the 8
valuesare computedin this case.
RecurrentNets
We have
seem like
thus
far restricted
a substantial
ourselves
restriction
to feedforward
, but as Minsky
nets . This
and Papert
point
may
out ,
network , a feedforward
network with ident period of time ) . We will now indicate how
can proceed
and thereby
form
of the
performs
, we subscript
each
recurrent
network
unit with
and the
its unit
time
it
as we constrain
the weights at each level of the
to be the same , we have a feedforward
network
identically
with
the recurrent
network
of Figure
17A .
9 Note that in this discussion, and indeed in our entire development here, we have
assumed a discrete time system with synchronous update and with each connection
involving a unit delay.
8. LEARNING
INTERNAL
REPRESENTATIONS
355
W21
11
W12
Time
t +1
.
.
.
.
.
.
.
.
.
0
FIGURE 17. A comparison of a recurrent network and a feedforward network with
identical behavior . A: A completely connected recurrent network with two units. B: A
feedforward network which behaves the same as the recurrent network . In this case, we
have a separate unit for each time step and we require that the weights connecting each
layer of units to the next be the same for all layers. Moreover , they must be the same as
the analogous weights in the recurrent case.
The appropriate method for maintaining the constraint that all weights
be equal is simply to keep track of the changes dictated for each weight
at each level and then change each of the weights according to the sum
of these individually prescribed changes. Now , the general rule for
determining the change prescribed for a weight in the system for a particular time is simply to take the product of an appropriate error
356 BASIC
MECHANISMS
measure 8 and the input along the relevant line both for the appropriate
times. Thus, the problem of specifying the correct learning rule for
recurrent networks is simply one of determining the appropriate value
of 8 for each time . In a feedforward network we determine 8 by multi plying the derivative of the activation function by the sum of the 8 's
for those units it feeds into weighted by the connection strengths. The
same process works for the recurrent network - except in this case, the
value of 8 associated with a particular unit changes in time as a unit
passeserror back, sometimes to itself . After each iteration , as error is
being passed back through the network , the change in weight for that
iteration must be added to the weight changes specified by the preceding iterations and the sum stored. This process of passing error
through the network should continue for a number of iterations equal
to the number of iterations through which the activation was originally
passed. At this point , the appropriate changes to all of the weights can
be made.
In general, the procedure for a recurrent network is that an input
(generally a sequence) is presented to the system while it runs for some
number of iterations . At certain specified times during the operation of
the system, the output of certain units are compared to the ' target for
that unit at that time and error signals are generated. -Each such error
signal is then passed back through the network for a number of iterations equal to the number of iterations used in the forward pass.
Weight changes are computed at each iteration and a sum of all the
weight changes dictated for a particular weight is saved. Finally , after
all such error signals have been propagated through the system, the
weights are changed. The major problem with this procedure is the
memory required. Not only does the system have to hold its summed
weight changes while the error is being propagated, but each unit must
somehow record the sequence of activation values through which it was
driven during the original processing. This follows from the fact that
during each iteration while the error is passedback through the system,
the current 8 is relevant to a point earlier in time and the required
weight changes depend on the activation levels of the units at that time .
It is not entirely clear how such a mechanism could be implemented in
the brain. Nevertheless, it is tantalizing to realize that such a procedure
is potentially very powerful , since the problem it is attempting to solve
amounts to that of finding a sequential program (like that for a digital
computer) that produces specified input -sequence/ output -sequence
pairs. Furthermore , the interaction of the teacher with the system can
be quite flexible , so that , for example, should the system get stuck in a
local minimum , the teacher could introduce "hints " in the form of
desired output values for intermediate stages of processing. Our experience with recurrent networks is limited , but we have carried out some
8. LEARNING
INTERNAL
REPRESENTATIONS
357
experiments. We turn first to a very simple problem in which the system is induced to invent a shift register to solve the problem.
Learning
to be a shift register . Perhaps the si mplest class of
recurrent problems we have studied is one in which the input and out put units are one and the same and there are no hidden units . We sim ply present a pattern and let the system process it for a period of time .
The state of the system is then compared to some target state . If it
hasn ' t reached the target state at the designated time , error is injected
into the system and it modifies its weights . Then it is shown a new
input pattern and restarted . In these cases, there is no constraint on
the connections in the system . Any unit can connect to any other unit .
The simplest such problem we have studied is what we call the shift
register problem . In this problem , the units are conceptualized as a cir cular shift register . An arbitrary bit pattern is first established on the
units . They are then allowed to process for two time -steps. The target
state , after those two time -steps, is the original pattern shifted two
spaces to the left . The interesting Question here concerns the state of
the units between the presentation of the start state and the time at
which the target state is presented . One solution to the problem is for
the system to become a shift register and shift the pattern exactly one
unit to the left during each time period . If the system did this then it
would surely be shifted two places to the left after two time units . We
have tried this problem with groups of three or five units and , if we
constrain the biases on all of the uni ts to be negati ve (so the units are
off unless turned on ) , the system always learns to be a shift register of
this sort . loThus , even though in principle any unit can connect to any
other unit , the system actually learns to set all weights to zero except
the ones connecting a unit to its left neighbor . Since the target states
were determined on the assumption of a circular register , the left -most
unit developed a strong connection to the right -most unit . The system
learns this relatively Quickly . With 1) = 0.25 it learns perfectly in fewer
than 200 sweeps through the set of possible patterns with either three or five -unit systems .
The tasks we have described so far are exceptionally simple , but they
do illustrate how the algorithm works with unrestricted networks . We
have attempted a few more difficult problems with recurrent networks .
10If the constraintthat biasesbe negativeis not imposed, other solutionsare possible
.
These solutions can involve the units passingthrough the complementsof the shifted
pattern or even through more complicatedintermediatestates. These trajectoriesare
interestingin that they matcha simpleshift registeron all even numbersof shifts, but do
not matchfollowing an odd numberof shifts.
358
BASICMECHANISMS
TABLE10
2SSEQUENCES
TOBELEARNED
AA1212
BA2312
CA3112
DA2112
EA1312
AB1223
AC1231
AD1221
882323
BC2331
802321
CB3123
CC3131
C03121
082123
DC2131
DD2121
E81323
EC133 !
ED1321
AE1213
BE2313
CE3113
DE2113
EEl313
II The constraint in feedforward networks -is that it must be possible to arrange the
units into layers such that units do not influence units in the same or lower layers . In
recurrent networks this amounts to the constraint that during the forward iteration ,
future states must not affect past ones .
8. LEARNING
INTERNAL
REPRESENTATIONS
359
and every output unit was also connected to every other output unit
and to itself . All the connections started with small random weights
uniformly distributed between - 0.3 and + 0.3. All the hidden and output units started with an activity level of 0.2 at the beginning of each
sequence.
We used a version of the learning procedure in which the gradient of
the error with respect to each weight is computed for a whole set of
examples before the weights are changed. This means that each connection must accumulate the sum of the gradients for all the examples
and for all the time steps involved in each example. During training ,
we used a particular set of 20 examples, and after these were learned
almost perfectly we tested the network on the remaining examples to
see if it had picked up on the obvious regularity that relates the first
two items of a sequence to the subsequent four . The results are shown
in Table 11. For four out of the five test sequences, the output units
all have the correct values at all times (assuming we treat values above
0.5 as 1 and values below 0.5 as 0) . The network has clearly captured
the rule that the first item of a sequence determines the third and
fourth , and the second determines the fifth and sixth . We repeated the
simulation with a different set of random initial weights, and it got all
five test sequencescorrect.
The learning required 260 sweeps through all 20 training sequences.
The errors in the output units were computed as follows : For a unit
that should be on, there was no error if its activity level was above 0.8,
otherwise the derivative of the error was the amount below 0.8. Similarly , for output units that should be off , the derivative of the error was
the amount above 0.2. After each sweep, each weight was decremented
by .02 times the total gradient accumulated on that sweep plus 0.9
times the previous weight change.
We have shown that the learning procedure can be used to create a
network with interesting sequential behavior, but the particular problem
we used can be solved by simply using the hidden units to create "delay
lines" which hold information for a fixed length of time before allowing
it to influence the output . A harder problem that cannot be solved
with delay lines of fixed duration is shown in Table 12. The output is
the same as before, but the two input items can arrive at variable times
so that the item arriving at time 2, for example, could be either the
first or the second i tern and could therefore determine the states of the
output units at either the fifth and sixth or the seventh and eighth
times. The new task is equivalent to requiring a buffer that receives
two input "words" at variable times and outputs their " phonemic realizations" one after the other . This problem was solved successfully by a
network similar to the one above except that it had 60 hidden units and
half of their possible interconnections were omitted at random. The
360
BASICMECHANISMS
TABLE11
PERFORMANCE
OF'THENETWORK
ONFIVENOVEL
TESTSEQUENCES
Input Sequence
DesiredOutputs
OutputUnit 1
OutputUnit 2
OutputUnit 3
0.2
0.2
0.2
0.12
0.16
0.07
Input Sequence
Desired Outputs
0.90
0.13
0.08
-
0.22
0.82
0.03
0.11
0.88
0.01
0.83
0.03
0.22
0.48
0.04
0.48
0.26
0.09
0.53
Actual States of :
Output Unit 1
Output Unit 2
Output Unit 3
Input Sequence
0.2
0.2
0.2
c
0.12
0.16
0.07
A
-
0.25
0.05
0.79
0.19
0.19
0.80
0.80
0.00
0.13
0.87
0.13
0.01
Desired Outputs
0.20
0.80
0.02
Actual States of :
Output Unit 1
Output Unit 2
Output Unit 3
Input Sequence
Desired Outputs
0.2
0.2
0.2
0.12
0.16
0.07
0.16
0.80
0.20
0.79
0.15
0.01
0.11
0.70
0.25
3
Actual States of :
Output Unit 1
Output Unit 2
Output Unit 3
Input Sequence
DesiredOutputs
0.2
0.2
0.2
0.12
0.16
0.07
0.07
0.87
0.13
0.11
0.05
0.96
-
0.27
0.01
0.76
0.78
0.02
0.13
Actual Statesof:
Output Unit 1
Output Unit 2
Output Unit 3
0.2
0.2
0.2
0.12
0.16
0.07
0.80
0.20
0.07
0.09
0.13
0.94
8. LEARNING
INTERNAL
REPRESENTATIONS
361
TABLE12
SIXVARIATIONS
OFTHESEQUENCE
EA1312
PRODUCED
BY
PRESENTING
THEFIRST
TWOITEMSAT VARIABLE
TIMES
E- A- 1312
- E- A1312
EA- - 1312
- EA- 1312
E- - A1312
- - EAI312
Note: Withthesetemporalvariations
, the25sequences
shownin
Table10canbeusedto generate
150differentsequences
.
CONCLUSION
In
They
1969
their
pessimistic
finally
discussion
discuss
state
of
multilayer
perceptrons
machines
near
Minsky
the
and
end
of
Papert
their
book
The
perceptron
has
even
because
that
attract
theorem
of
its
will
to
our
be
or
the
an
reason
many
pp
to
version
231
the
232
for
"
for
the
of
multilayered
to
failure
these
Nevertheless
elucidate
extension
the
parallel
any
problem
of
is
theorem
reason
theorem
kind
that
convergence
and
features
learning
suppose
that
many
intriguing
research
profound
despite
has
its
layered
important
learning
study
It
as
judgement
some
"
found
linearity
powerful
interesting
be
its
no
intuitive
some
of
simplicity
is
to
discovered
an
There
it
reject
worthy
limitations
carryover
Perhaps
itself
severe
paradigmatic
consider
or
its
clear
computation
we
shown
attention
virtues
sterile
will
to
be
produce
machine
Although our learning results do not guarantee that we can find a solution for all solvable problems, our analyses and results have shown that
as a practical matter , the error propagation scheme leads to solutions in
virtually every case. In short, we believe that we have answered Min sky and Papert's challenge and have found a learning result sufficiently
powerful to demonstrate that their pessimism about learning in mul tilayer machines was misplaced.
One way to view the procedure we have been describing is as a parallel computer that, having been shown the appropriate input / output
exemplars specifying some function , programs itself to compute that
function in general. Parallel computers are notoriously difficult to program. Here we have a mechanism whereby we do not actually have to
know how to write the program in order to get the system to do it .
Parker ( 1985) has emphasized this point .
362 BASIC
MECHANISMS
On many occasions we have been surprised to learn of new methods
of computing interesting functions by observing the behavior of our
learning algorithm . This also raised the Question of generalization. In
most of the cases presented above, we have presented the system with
the entire set of exemplars. It is interesting to ask what would happen
if we presented only a subset of the exemplars at training time and then
watched the system generalize to remaining exemplars. In small problems such as those presented here, the system sometimes finds solutions to the problems which do not properly generalize. However, preliminary results on larger problems are very encouraging in this regard.
This research is still in progress and cannot be reported here. This is
currently a very active interest of ours.
Finally , we should say that this work is not yet in a finished form .
We have only begun our study of recurrent networks and sigma-pi
units . We have not yet applied our learning procedure to many very
complex problems. However, the results to date are encouraging and
we are continuing our work .
PART
III
FOR~
~,f .~~L ANALYSES
Part III is focused on the formal tools employed in the study of POP
models and their application in the analysis of several specific aspects of
PO P mechanisms
In Chapter 10, Williams provides a useful analysis of activation func tions . Throughout
the book we employ several different activation
functions . There is a Question as to whether we will need to look for
more complex activation functions to carry out some of the more com plex computations . Williams shows that we will never have to consider
activation functions more complex than the sigma pi function .
In Chapter 11, Stone provides a useful analysis of the delta rule ,
which plays an important role throughout the book. Stone shows how a
change of basis can be employed to reveal the internal workings of the
delta rule . He shows that when there is no deterministic relationship
between inputs and targets , the delta rule leads to a system whose out puts match the central tendencies of the target patterns . Finally , he
364
In
FORMAL
ANAL
Chapter
kinds
12
of
fan
he
out
the
of
in
P3
provides
general
they
describe
many
of
Second
mal
through
13
learning
the
as
in
several
this
models
are
available
how
described
for
these
tools
analyzing
can
for
used
how
two
sections
the
reasons
lie
of
produce
of
useful
behind
the
some
networks
to
an
their
that
indicate
parallel
be
and
of
results
other
they
POP
in
and
simula
of
observing
useful
results
importantly
computer
description
P3
are
basic
networks
in
section
useful
more
and
built
and
programmable
describing
example
be
two
in
networks
an
simulations
for
can
chapters
perhaps
that
example
the
simulation
and
tools
gives
fan
networks
using
describe
these
algorithm
association
computer
language
with
Chapter
Rabin
of
of
of
building
16
and
for
limitations
limitations
connections
Chapter
Zipser
both
interacting
competitive
In
of
pattern
and
in
13
P3
capacity
effects
standard
units
outlined
called
the
the
of
Chapter
for
behavior
studies
in
kind
interface
analyzes
He
costs
the
system
models
McClelland
capacities
the
Finally
on
networks
First
networks
explores
tion
Ys
book
the
and
for
show
results
CHAPTER
M. I. JORDAN
VECTORS
366 FORMAL
ANAL
YSFS
A
1
0
37
30
72
175 61
65
121
.,167
66
55
]
25
]
Joe
Mary
Brad
Carol
B
Joe
FIGURE
1.
to only three components , however . If , for example , we also wanted to
keep track of Joe' s shoe size and year of birth , then we would simply
make a vector with five components , as in Figure lB .
One important reason for the great utility of linear algebra lies in the
simplicity of its notation . We will use bold, lower-case letters such as v
to stand for vectors . With this notation , an arbitrarily
has
no
more
than
]
When
three
components
, it
can
be
- -
distinction
components
.
order to develop
on vectors .
between such
. All of the
number
of components
.J
9. INTRODUCTION
TOLINEAR
ALGEBRA367
Weight
Height
I
/
/
I
I
-
.I
.I
Multiplication by Scalars
In linear algebra, a single real number is referred to as a scalar. A
vector can be multiplied by a scalar by multiplying every component of
the vector by the scalar.
Examples
..
2 1
4
--
-3
4
1
--
- 15
20
5
ANALYSES
368 FORMAL
1
'-:::: -
.J
.
-
.I
Addition of Vectors
.I
- ~
FIGURE
3.
1
2
1
Examples.-
2
1
3
3
=
2
1
0
1
3
-3
5
=
- 1
ConsiderFigure
It can be
9. INTRODUCTION
TOLINEAR
ALGEBRA369
3
1
.
.
.
.
.
.
.
.
.
.
.I
.I
FIGURE
4.
the two vectors is the diagonal of this parallelogram. In two and three
dimensions this is easy to visualize, but not when the vectors have
more than three components. Nevertheless, it will be useful to imagine
vector addition as forming the diagonal of a parallelogram. One impli cation of this view, which we will find useful , is that the sum of two
vectors is a vector that lies in the same plane as the vectors being
added.
Example.. Calculating averages. We can demonstrate the use of the
two operations thus far defined in calculating the average vector. Suppose we want to find the average age, height , and weight of the four
individuals in Figure IA . Clearly this involves summing the components separately and then dividing each sum by 4. Using vectors,
this corresponds to adding the four vectors and then multiplying the
resulting sum by the scalar 1/ 4. Using u to denote the average vector,
- -1
0- 4
37
72
175
10
30
61
2S
6S
.21
66
67
155
34.5
58.5
128
Using vector notation, if we denote the four vectorsby VI, V2, V3, and
V4, then we can write the averaging operation as
1
370 FORMAL
ANAL
YS
~
The
vector
the
components
result
then
of
is
resulting
and
vectors
if
vector
whose
four
vector
added
obey
components
individual
each
are
addition
the
obtained
vector
is
vectors
is
This
are
first
averages
that
by
that
law
Notice
multiplied
shows
distributive
the
the
multiplication
as
in
ordinary
of
same
and
the
by
algebra
scalars
1
3
9
Consider
thevectors
VI~ 2 ' V2== 2 ' andu = 10. Canu be
written
asthesumofscalar
multiples
ofv1and
V2? Thatis, canscalars
Clandc2befound
such
thatu canbewritten
intheform
u = CIVI
+ C2V2
?
If so,then
u issaid
tobealinear
combination
ofthevectors
v1and
v2.
Thereader
canverify
thatC1= 3andC2
~ 2willwork
, andthus
u isa
linear
combination
ofv1and
v2.
Thiscan
where
these
vectors
are
plotted
. Remembering
that
FIGURE
5.
9. INTRODUCTION
TOLINEAR
ALGEBRA
371
lengthens
diagonal
adjust
in
vector
of
the
in
and
figure
the
linear
by
or
vectors
that
can
scalars
positive
of
VI
it
vector
in
The
v2
This
is
and
be
any
vector
By
can
are
writ
multipli
vector
as
said
generated
using
be
because
to
indicated
plane
of
can
way
true
direction
vectors
is
scalars
the
the
scalars
This
this
in
'
forming
find
positive
vector
plane
can
generated
the
the
we
yields
using
any
and
to
that
be
reverses
lengthening
any
figure
scalar
because
clear
that
clear
the
negative
corresponds
seems
parallelogram
seems
of
addition
it
combination
shortening
plane
vector
also
and
cation
fonn
area
negative
as
that
to
It
shaded
both
ten
and
parallelogram
well
to
as
span
from
the
these
two
In
be
general
linear
such
given
set
Vt
combination
, V2
of
the
Vi
if
Vn
of
vectors
scalars
Ct
C2
vector
Cn
is
can
said
be
to
found
that
The
set
the
CtVI
of
Vi
all
C2V2
linear
CnVn
combinations
of
the
is
called
the
set
spanned
1 )
by
Example
The
three
vectors
and
dimensional
space
since
any
vector
can
be
written
as
linear
c
1
combination
as
the
standard
basis
in
basis
the
next
three
for
section
- dimensional
space
( more
on
the
idea
of
Linear Independence
To
say
the
space
We
have
space
to
and
ing
dimensional
can
be
set
three
In
fact
it
vectors
span
from
examples
we
general
have
means
space
it
the
using
vectors
say
by
vectors
that
that
spanned
to
"
by
might
"
good
vectors
be
led
- dimensional
without
defin
definition
in
- dimensional
We
span
dimension
vectors
combination
two
space
tenD
all
linear
span
suffice
the
seem
of
to
set
two
vectors
would
set
is
- dimensional
been
is
three
space
original
which
span
in
the
in
vectors
that
what
of
generated
shown
expect
space
that
of
372 FORMAL
ANALYS
~
To make this definition work , we would require that the same size
space be generated by any set of n vectors. However, this is not the
case, as can be easily shown. Consider any pair of collinear vectors, for
example. Such vectors lie along a single line , thus any linear combination of the vectors will lie along the same line . The space spanned by
these two vectors is therefore only a one-dimensional set. The colI
2
linear vectors I and 2 are a good example. Any linear combination of these vectors will have equal components, thus they do not span
the plane.
Another example is a set of three vectors that lie on a plane in
three-dimensionalspace. Any parallelogramsthat we form will be in
the same plane, thus all linear combinationswill remain in the plane
and we can't spanall of three-dimensionalspace.
The general rule arising from these examplesis that of a set of n
vectors, if at least one can be written as a linear combination of the
others, then the vectorsspansomethingless than a full n-dimensional
space. We call such a set of vectorslinearlydependent
. If , on the other
hand, none of the vectorscan be written as a linear combinationof the
others, then the set is called linearly independent
. We now revise the
definition of dimensionalityas follows: n -dimensionalspaceis the set
of vectors spannedby a set of n linearly independentvectors. The n
vectorsare referred to as a basisfor the space.
[2.~
]1and
[and
~
]2
dimensio
space
.1
plane
,
a
two
d
imensi
space
.
1
2
.first
3
l'vector
1
1
3
2
,all
,of
2
0
0
E~'Qmples
..
1.
- 1
, and
vector .
4.
and
10
9. INTRODUCTION
TOLINEAR
ALGEBRA
373
Notice the relationship between examples (2) and (3) . The vectors in
example (2) are linearly independent , therefore they span the plane .
Thus
any other
vector
with
two components
is a linear
combination
of
these two vectors . In example ( 3) , then , we know that the set will be
linearly dependent ref ore being told what the third vector is. This suggests the following rule: There can be no more than n linearly independent vectors
in n -dimensional
space .
v
.
.
.
FIGURE
6.
374 FORMAL
ANAL
YSF5
VECTOR SPACES
Let
us
pause
to
implied
that
to
to
a point
is
one
refer
tors
, or
other
objects
As
is
by
vector
vector
To
pair
+
For
any
tion
by
vector
The
addi
answer
linear
to
algebra
space
simply
fying
what
the
can
addition
multiplication
,
because
space
Arrows
defined
scalar
mials
of
This
sort
because
any
1 I have
These
left
include
an additive
in
vec -
Are
there
?
questions
are
definition
V,
, with
there
of
an
the
follow
corresponds
and
v , in
such
a way
V , there
such
and
is
that
a line
is
a vector
a way
that
distributive
Thus
cv
in
multiplica
with
vector
adding
space
respect
defined
as
multiplying
operations
in
is
to
defined
n , with
addition
also
obey
are
or
vectors
of
of
of
a vector
space
is
and
scalar
multiplication
the
the
vector
addition
is
parallelogram
shortening
requirements
scalar
by
when
of
and
the
arrow
a vector
set
these
and
components
vectors
diagonal
the
speci
that
requirements
lengthening
fill
, without
numbers
the
in
a vector
separately
the
are
as
example
have
of
all
the
object
of
objects
Lists
all
space
operations
of
components
fill
taking
undefined
definition
must
set
.
an
The
vectors
as
as
is
.
, any
called
points
vector
geometry
that
be .
a
in
defined
, these
order
v , in
is
unrelated
way
and
ternt
is .
associative
vector
c
properties
multiplication
obvious
of
of
vectors
of
is a vector
a vector
sum
and
any
what
I have
the
objects
other
kinds
in
the
is .
used
these
following
vectors
be
or
seemingly
of
associative
must
these
again
v ,
Just
, called
V , called
and
like
geometrically
because
and
is
the
is
scalar
elements
question
a vector
properties
when
the
what
decide
of
the
these
to
also
ti on . 1
, much
lists
?
,
a vector
both
for
Consider
product
scalars
Are
in
the
vectors
is commutative
scalar
V , called
what
I have
try
V of
also
upon
, and
representation
called
, u
space
mathematics
a set
every
addition
in
be
, and
is
that
numbers
in
avoided
of
arrow
case
space
vector
an
should
the
space
.
es :
a moment
a list
a heuristic
being
propertt
or
that
abstract
for
is
just
often
solved
.
mg
reflect
a vector
space
of
polyno
defined
in
the
.
of
abstraction
theorem
out
certain
the axiom
inverse
is
that
is
common
true
technicalities
that
there
must
in
about
usually
mathematics
a general
included
be a zero
vector
vector
as axioms
, and for
It
is
space
for
every
useful
must
a vector
vector
be
space .
, there
is
9. INTRODUCTION
TOLINEAR
ALGEBRA
375
true about any instantiationof a vector space. We can thereforediscuss
generalpropertiesof vector spaceswithout being committed to choosing a particular representationsuch as a list of numbers. Much of the
discussionabout linear combinationsand linear independencewas of
this nature.
When we do choosenumbers to representvectors, we use the following scheme. First we choosea basisfor the space. Sinceevery vector in the spacecan be written as a linear combinationof the basisvectors, each vector has a set of coefficientscl , c2, . . . , Cnwhich are the
coefficients in the linear combination. These coefficients are the
numbers used as the componentsof the vector. As was shown in the
previoussection, the coefficientsof a given vector are unique because
basisvectorsare linearly independent.
There is a certain arbitrarinessin assigningthe numbers, since there
are infinitely many sets of basisvectors, and each vector in th~ space
has a different descriptiondependingon which basisis used. That is,
the coefficients, which are referred to as coordinates
, are different for
different choicesof basis. The implications of this fact are discussed
further in a later section where I also discusshow to relate the coordinates of a vector in one basis to the coordinatesof the vector in
another basis. Chapter 22 contains a lengthy discussion of several
issuesrelating to the choiceof basis.
INNERPRO
DUCTS
As of yet, we have no way to speak of the length of a vector or of
the similarity between two vectors. This will be rectified with the
notion of an inner product.
The inner product of two vectors is the sum of the products of the
vector components. The notation for the inner product of vectors
v and w is v . w . As with vector addition , the inner product is defined
only if the vectors have the same number of components.
Example..
3
v ~
1
2
1
w ~
2
1
v . w ~ (3 . 1) + (- 1 . 2) + (2 . 1) ~ 3.
376
FORMALANALYS~
Length
As a special case, consider taking the
3
itself . An example is the vector v =
Although the definition was motivated by an example in two dimen sions , it can be applied to any vector . Notice that many of the
FIGURE
7.
377
9. INTRODUCTION
TOLINEARALGEBRA
we
For
to
example
it
1
.
vector
bute
duces
times
the
is
the
components
is
in
the
this
defin
than
will
vector
by
absolute
value
another
contri
scalar
of
the
pro
scalar
be
easily
proved
inequality
less
than
or
which
Somewhat
states
equal
to
the
harder
that
the
sum
of
to
length
the
prove
of
the
lengths
of
the
IIVtl1
triangle
Ilv211
inequality
is
no
corresponds
longer
than
to
the
sum
of
the
statement
the
lengths
that
of
the
the
special
is
are
length
included
squared
Multiplying
vector
can
triangle
product
operands
the
are
components
is
V211
in
length
larger
sides
II
that
Thus
inner
Illv
vectors
of
two
because
old
triangle
Geometrically
other
I c
vectors
side
has
the
property
IIVt
with
vector
whose
of
II
associate
product
vector
two
two
one
longer
- called
of
if
inner
length
so
sum
be
larger
new
is
the
will
II cv
This
intuitively
-;
properties
ition
diff
case
closely
eren
where
the
related
vectors
to
operands
the
are
idea
of
the
same
length
vector
What
if
the
the
Angle
The angle betweentwo vectorsv and w is defined in terms of the
inner product by the following definition:
v.w
C0S9~~;ilii; f
(2)
Ilv111
- 1
Ilv211
- J2,
378
FORMAL ANALYSFS
9 - COS
- l (0.707) ~ 45 .
This
result
could
clearly
the
ing
the
also
inner
angle
The
of
the
shows
components
is
zero
the
tv
We
can
our
space
to
the
the
vectors
when
vectors
site
the
the
It
then
between
ty
case
"
from
Equation
the
two
equation
sets
in
of
numbers
angle
angle
be
clock
is
the
inner
product
more
negative
others
however
in
more
inner
at
As
two
product
when
the
two
vectors
point
in
oppo
two
larger
vectors
inner
if
negative
is
the
max
the
most
that
vectors
inner
its
the
propor
example
is
closer
that
have
For
the
product
claiming
they
and
180
the
of
thus
the
vectors
is
cosine
models
lengths
maximum
900
is
The
the
( }
better
important
PDP
product
cos
is
Thus
the
the
at
vectors
because
II
decreases
two
inner
where
angle
the
seen
hold
1IIIw
be
directions
the
Ilv
the
between
we
the
zero
moving
If
that
cosine
when
the
therefore
opposite
careful
two
be
under
gain
is
will
imagine
says
must
zero
us
vectors
apart
in
on
our
and
understanding
as
Let
and
around
This
because
angles
the
farther
about
Equation
Equation
reaches
larger
than
products
out
similari
'h
product
hands
of
the
must
"
the
gives
correlation
models
the
directions
We
turn
product
point
are
; 2
inner
cosine
move
value
together
inner
decreases
of
or
be
products
between
"
to
Writing
intuitions
PDP
like
the
angle
imum
tw
the
inner
constant
tional
<
i -
the
of
compute
in
lh
to
of
vectors
geometrical
analysis
around
vectors
match
Wi
correlation
understanding
often
sum
for
use
the
the
the
way
"
seems
but
find
of
for
; 2
formula
means
standing
of
this
consider
with
clearer
<
This
as
Vi
i cas
the
measure
sense
product
in
trigonometry
general
components
to
vague
inner
this
the
basic
in
forty
said
In
using
superior
with
often
of
tenDS
is
vectors
definition
found
is
vectors
product
two
however
been
method
between
inner
between
the
have
product
are
product
closer
We
9. INTRODUCTION
TOLINEAR
ALGEBRA 379
must remember to divide the inner product by the lengths of the vectors involved to make such comparative statements.
An important special case occurs when the inner product is zero. In
this case, the two vectors are said to be orthogonal. Plugging zero into
the right side of EQuation 2 gives
cas () ~ O.
which implies that the angle between the vectors is 90 . Thus, orthogonal vectors are vectors which lie at right angles to one another.
We will often speak of a set of orthogonal vectors. This means that
every vector in the set is orthogonal to every other vector in the set.
That is, every vector lies at a right angle to every other vector. A good
example in three-dimensional space is the standard basis referred to
Projections
A further application of the inner product, closely related to the ideas
of length and angle, is the notion of a projection of one vector onto
another. An example is given in Figure 8. The distance x is the projection of v on w . In two dimensions, we readily know how to calculate
the projection. It is
x ~ Ilvlt cos8
(3)
380 FORMAL
ANALYSES
w
V
\
\
\
FIGURE
8.
There is a close relationship between the inner product and the projection . Using Equation 2, we can rewrite the formula for the projection :
x~'Iv" cas
9
v.w
~IIvII~vii--i;i ir
v.w.
~ Ilw
II
Thus, the projectionis the inner productdivided by the length ofw . In
particular, if w has length one, then IlwII ~ 1, and the projection of v
on w and the inner productof v andware the samething. This way of
thinking about the inner product is consistent with our earlier comments. That is, if we hold the lengths of v and w constant, then we
know that the inner productgetslargerasv movestowardw . From the
picture, we see that the projection gets larger as well. When the two
vectors are orthogonal, the projection as well as the inner product are
zero.
9. INTRODUCTION
TOLINEAR
ALGEBRA
381
V
\
\
\\
W
\
\
FIGURE
9.
Let I denote the projection of v on w . We have I = IlvII cos(J from
geometry. We can breakI into two piecesIx and Iy as shownin the figure. Iy can be computed from the diagram by noticing that triangles
OAD and COB, in Figure 10, are similar triangles. Thus, the ratio of
correspondingsidesis constant:
. .
giving
Iy-- Ilwll
Wy
-Vy
'
Vy
Wy
Iy~lW
[\
\\
IT
Iw
FIGURE
10.
382
~
FORMALANALYS
.. ... ..
"
- - - - -
.:." ,
-:: : _ - - - -
Vx
,
"
.~~
,.--:::;;;;;;;;-!
,
\~
I
I
I
Wx
FIGURE
11.
. .
glvlng
Ix- IIWx
-VX
-II'
Vx
Wx
Ix- u;;r.
/- Ilv
IIcas
9- Ix+/.,- ~
Thus ,
Vy
- w"..,.. - v .w
+ II" II IlwII
v."
cas
8- .llvIli
~
;ii.
9. INTRODUCTION
TOLINEAR
ALGEBRA
383
numbers . In what follows , c and Ci will be any scalars, and the v and w
(4)
(5)
(6)
The first theorem says simply that order is unimportant ; the inner product is commutative . The second and third theorems show that the
inner product is a linear function , as we will discuss at length in a later
section.
We can combine these two equations to get
W . (CIV1+ C2VJ ~ CI (W . v I) + C2(W . V2) . It is also well worth our
while to use mathematical induction to generalize this formula , giving
us
W . (CIVI + C2V2+ . . . + CnVn) ~
CI (W . v I) + C2(W . V2) + . . . + Cn(W . vn ) .
(7)
(8)
DISTRIBUTED
In this section, we show how some of the concepts we have intro duced can be used in analyzing a very simple PDP model. Consider the
processing unit in Figure 12 which receives inputs from the n units
below. Associated with each of the n + 1 units there is a scalar activation value. We shall use the scalar u to denote the activation of the out put unit and the vector v to denote the activations of the n input units .
That is, the ith component of v is the activation of the ith input unit .
Since there are n input units , v is an n -dimensional vector.
Associated with each link between the input units and the output
unit , there is a scalar weight value, and we can think of the set of n
384 FORMAL
ANALYS
~
FIGURE
12.
w
1
W
n
FIGURE
13.
weights as an n -dimensional vector w . This is the weight vector
corresponding to the output unit . Later we will discuss a model with
many output units , each of which will have its own weight vector.
Another way to draw the same model is shown in Figure 13. Here
we have drawn the n input units at the top with the output unit on the
right . The components of the weight vector are stored at the junctions
where the vertical input lines meet the horizontal output line . Which
diagram is to be preferred (Figure 12 or Figure 13 ) is mostly a matter
of taste, although we will see that the diagram in Figure 13 generalizes
better to the case of many output units .
Now to the operation of the model: Let us assume that the activation of each input unit is multiplied by the weight on its link , and that
these products are added up to give the activation of the output unit .
Using the definition of the inner product, we translate that statement
into mathematics as follows :
u = w .v.
The activationof the output unit is the inner product of its weight vector with the vector of input activations .
9. INTRODUCTION
TOLINEAR
ALGEBRA
385
The geometric properties of the inner product give us the following
picture to help in understanding what the model is computing . We
imagine that the set of possible inputs to the model is a vector space.
It is an n -dimensional space, where n is the number of input lines.
The weight vector also has n components, thus we can plot the weight
vector in the input space. The advantage of doing this is that we can
now state how the system will respond to the various inputs . As we
have seen, the inner product gives an indication of how close two vectors are. Thus, in this simple POP model, the output activation gives
an indication or measurement of how close the input vector is to the
stored weight vector. The inputs lying close to the weight vector will
yield a large positive response, those lying near 90 0 will yield a zero
response, and those pointing in the opposite direction will yield a large
negative response. If we present a succession of input vectors of constant length, the output unit will respond most strongly to that input
vector which is closest to its weight vector, and will drop off in
response as the input vectors move away from the weight vector.
One way to describe the functioning of the processing unit is to say
that it splits the input space into two parts, the part where the response
is negative and the part where the response is positive . We can easily
imagine augmenting the unit in the following way: if the inner product
is positive, output a 1; if the inner product is negative, output a O.
This unit , referred to as a linear threshold unit, explicitly computes
which part of the space the input lies in .
In some models, the weight vector is assumed to be normalized, that
is, IlwII = 1. As we have seen, in this case, the activation of the output
unit is simply the projection of the input vector on the weight vector.
l\{.A.TRICES
ANDLINEARSYSTEMS
The first section introduced the concepts of a vector space and the
inner product. We have seen that vectors may be added together and
multiplied by scalars. Vectors also have a length , and there is an angle
between any pair of vectors. Thus, we have good ways of describing
the structure of a set of vectors.
The usefulness of vectors can be broadened considerably by introduc ing the concept of a matrix . From an abstract point of view, matrices
are a kind of " operator" that provide a mapping from one vector space
386 FORMAL
ANAL
YSF5
to another vector space. They are at the base of most of the models in
this book which take vectors as inputs and yield vectors as outputs.
First , we will define matrices and show that they have an algebra of
their own which is analogous to that of vectors. In particular, matrices
can be added together and multiplied by scalars.
MATRICES
Examples..
M -
3 4 5
101
3 0 0
N ~ 0 7 0
001
10 - 1
P = - 1 27
Multiplication
by Scalars
345
9
12
15
3M
~3101
~303
Example..
9. INTRODUCTION
TO LINEARALGEBRA
387
Addition of Matrices
Matricesare addedtogetherby addingcorrespondingelements. Only
matricesthat have the samenumber of rows and columnscan be added
together.
Example.
345
10 1
- 10 2
4 1- 1
2
5
4
.I
M +N-
toproduce
anew
vector
.
5
1
w
and
the
vector
v- O
.
2denoted
which
isthe
product
ofWand
v,and
-- ~
34 5 1
10 1 0
1
1.
2
.I
~ -
u - Wv -
1 0 1~
388 FORMAL
ANALYSES
Example..
3. 1+ 4.0+ 5.2
1. 1+ 0.0+ 1.2
3 4
10
u = Wv =
The components of u are the inner products of v with the row vectors
ofW .
For a general m x n matrix Wand an n -dimensional vector v ,2 the
product Wv is an m -dimensional vector u , whose elements are the
inner products ofv with the row vectors ofW . As suggested by Figure
14, the ith component of u is the inner product of v with the ith row
vector of W . Thus , the multiplication
of a vector by a matrix can be
thought of as simply a shorthand way to write down a series of inner
products of a vector with a set of other vectors . The vector u tabulates
the results . This way of thinking about the multiplication operation is a
good way to conceptualize what is happening in a POP model with
many output units , as we will see in the next section .
There is another way of writing the multiplication
operation that
gives a different perspective on what is occurring . If we imagine break ing the matrix up into its columns , then we can equally well speak of
the column vectors of the matrix . It can then be easily shown that the
multiplication operation Wv produces a vector u that is a linear combi nation of the column vectors of W . Furthermore , the coefficients of
the linear combination are the components of v . For example , letting
WI , W2, W3 be the column vectors ofW , we have
~
..J
~ ~
~! ~
.th r~
r~w
I~
FIGURE 14.
~ .
com~.thnent 0
1~]
.I
9. INTRODUCTION
rTO LINEARALGEBRA
389
where the Vi are the components of v . This way of viewing the multi plication operation is suggested in Figure 15 for a matrix with n
columns.
If we let the term column space refer to the space spanned by the
column vectors of a matrix , then we have the following interesting
result: The vector u is in the column space ofW .
Finally , it is important to understand what is happening on an
abstract level. Notice that for each vector v , the operation Wv produces another vector u . The operation can thus be thought of as a
mapping or function from one set of vectors to another set of vectors.
That is, if we consider an n -dimensional vector spaceV (the domain)
and an m -dimensional vector space U (the range) , then the operation
of multiplication by a fixed matrix W is a function from V to U , as
shown in Figure 16. It is a function whose domain and range are both
vector spaces.
VIWI +
+ VnWn
390 FORMAL
ANALYSES
components
of the vector
of the matrix .
must
(9)
B ~ ~~ ~
W (av ) ~ aWv
of columns
W (0 + v ) ~ W 0 + W v
( 10)
W (CIVI + c2v2 +
. . . + cnvn ) ~
and N
must
( 11)
have
the
Mv + Nv ~ ( M + N ) v
ONE
LAYER
OF A PARALLEL
PROCESSING
same
( 12)
DISTRIBUTED
SYSTEM
from
the
other
Outputs
.
.
Inputs
.
.
FIGURE
17.
output
units .
As before , the
activation
rule
9. INTRODUCTION
TOLINEAR
ALGEBRA391
says
that
of
its
the
we
at
'
form
the
for
is
output
be
Wv
output
unit
input
is
vector
given
by
thus
the
inner
product
whose
row
vector
vectors
are
multiplication
the
vector
to
whose
the
write
Wi
all
components
of
are
then
we
the
can
use
computations
the
Then
very
succinct
It
an
the
expression
says
that
vector
units
matrix
Let
This
matrix
network
with
Wj
rule
once
of
vector
Uj
If
activation
weight
for
of
each
whose
the
input
computation
vector
components
performed
are
the
the
by
network
the
produces
activations
of
an
the
output
Another
way
general
~ zation
junction
in
an
to
matrix
the
vector
The
Now
puted
in
array
is
to
the
there
The
is
weight
horizontal
appears
matrix
network
13
diagram
unit
on
the
Figure
the
output
appear
draw
of
the
of
in
of
many
weight
When
linking
junctions
in
18
output
this
the
the
diagram
At
the
each
unit
with
output
it
is
vector
is
is
each
way
output
which
input
with
drawn
units
an
associated
equation
Figure
connecting
vectors
lines
shown
case
to
exactly
unit
clear
why
the
the
input
weight
let
by
us
the
attempt
model
to
understand
Each
geometrically
output
unit
is
what
computing
is
the
being
inner
com
product
Inputs
.. .
wII w12 . . .
W
21 W
22 . . .
W
In
W
2n
. Outputs
.
W
mn
wml wm2 . . .
FIGURE
18.
~
ith
Note
output
that
the
unit
weight
in
the
ith
row
and
jth
column
connects
the
jth
inputunit to the
392 FORMAL
ANAL
YSFS
of
its
weight
put
units
its
weight
the
vector
) .
unit
In
to
input
generalize
by
can
see
are
for
spread
Also
weight
the
tor
. For
output
vectors
network
tion
is
be
used
input
on
the
the
to
have
unit
of
the
weight
will
lead
, then
on
tool
can
, that
The
have
we
to
some
inputs
weight
the
vec on
or
the
of
representa
approach
hundreds
If
output
this
graphic
Now
input
of
, however
can
activation
ith
the
representation
led
, we
vectors
the
the
of
the
directly
the
different
projections
a graphic
which
to
length
projection
draw
the
vectors
units
space
If
input
respond
see
input
input
to .
the
be
plotting
to
output
the
every
will
a conceptual
of
columns
follows
weight
vectors
tive
, each
tion
, and
the
system
general
of
out
the
is
are
the
from
input
cannot
thousands
the
of
, as
will
input
is
units
of
vectors
be
discussed
the
view
The
input
its
outgoing
are
added
further
Each
vector
as
the
, it
is
applies
to
units
PDP
in
in
yield
a later
of
when
the
, in
output
section
model
the
vector
the
vec -
rows
that
of
Thus
weight
to
weight
combination
the
coefficients
output
the
a linear
the
com
the
seen
combination
on
The
are
its
com
weights
unit
:
outgoing
was
on
the
with
which
obtained
lines
to
unit
vectors
vector
a linear
the
that
section
This
on
units
such
to
be
than
correspond
weight
resulting
multiplies
resulting
referred
can
rather
weights
linking
previous
matrix
vector
the
with
weights
, the
activations
unit
processing
incoming
the
a vector
of
the
are
the
. 4 In
output
columns
model
matrix
are
, the
the
the
weight
associated
vectors
with
of
the
matrix
from
17
vector
the
operation
of
units
These
The
are
the
columns
rows
matrix
multiplies
bination
then
emphasized
on
the
Figure
matrix
In
be
as
contrasted
weight
the
us
systems
the
above
the
gives
going
in
of
tors , as
the
This
processing
lines
ponents
in
will
us
which
responds
units
the
can
the
row
units
is just
should
on
to
lower
space
assumed
perspective
. Whereas
in
the
, we
most
focusing
ing
it
attained
have
imagined
several
Another
rows
vector
inputs
input
mostly
in
lines
by
It
useful
weight
different
and
with
out
close
enabled
response
model
is
activation
, we
all
how
vectors
vector
This
to
activation
weight
largest
unit
.
common
computing
larger
input
output
space
as
the
the
each
unit
to
a large
the
the
which
are
a given
one
to
in
, the
vectors
ith
weight
the
unit
with
vector
In
plotting
each
of
around
response
the
led
response
of
is closest
input
vectors
small
all
is
of
If
unit
only
the
( which
thought
vector
are .
output
with
in
be
input
vector
model
vector
which
the
that
vector
can
vectors
weight
the
input
unit
to
two
, then
whose
weight
is
the
length
the
, each
vector
closer
same
and
Thus
linear
this
by
as
outgoing
com
perspec
its
activa
vector
, a unit
of
can
9. INTRODUCTION
TOLINEAR
ALGEBRA
393
appear
in
vector
and
case
can
be
multilayer
system
an
both
outgoing
views
of
thought
input
as
the
multiplied
by
vector
have
its
product
both
as
an
shown
in
weight
and
weight
the
to
weight
19
be
the
to
result
next
useful
vector
sending
vector
incoming
Figure
can
incoming
outgoing
multiplication
matching
inner
the
thus
vector
matrix
of
using
and
weight
this
The
unit
the
of
level
In
current
this
match
LINEARITY
A
distinction
system
is
In
general
understand
will
everything
of
nonlinear
else
The
linear
systems
there
and
is
to
system
relatively
can
easy
be
Nonlinear
section
and
will
to
difficult
systems
some
nonlinear
analyze
In
are
give
and
this
section
defined
simply
specific
examples
x2
function
system
the
which
represents
output
is
given
system
in
that
by
might
and
the
be
system
Xl
later
linear
are
systems
systems
In
particular
inputs
hold
that
input
between
nonlinear
linear
Suppose
the
systems
each
made
whereas
characterize
as
for
often
scalars
and
or
The
any
they
function
real
might
number
be
is
said
the
vectors
to
following
be
depending
linear
on
if
two
for
any
equations
cx
( x
FIGURE
19.
13
14
394 FORMAL
ANALYSES
The
first
of
some
these
The
second
Xl
and
is
,
all
X2
obtain
hand
we
find
the
we
the
to
ourselves
linear
the
However
For
is
cr
for
zero
. ,
for
is
on
sum
to
the
other
larger
taken
when
the
the
much
inputs
even
In
separately
sum
the
real
are
fixed
functions
those
some
scalar
are
systems
scalar
tion
i . e
many
example
to
functions
input
be
when
found
the
on
to
system
system
to
based
might
the
nonlinear
response
outputs
separately
outputs
inputs
or
separately
strong
responses
are
restrict
only
the
by
constant
the
the
of
the
In
input
same
responds
output
expected
sum
measuring
add
sum
that
be
to
the
the
the
presenting
system
simply
multiply
Consider
and
the
predict
we
by
system
the
if
multiplied
important
We
to
separately
If
the
to
would
response
obtained
to
might
than
The
more
need
response
smaller
is
that
is
how
presented
the
output
knowing
we
is
implies
the
separately
system
Xl
equations
then
equation
X2
linear
innuts
two
constant
in
of
which
number
scalar
or
vector
scalar
variable
output
is
then
proportional
vector
the
the
functions
of
vector
input
function
function
of
vector
input
This
function
is
example
of
linear
func
because
( c
( v
( w
and
The
PDP
system
model
It
system
with
in
one
which
is
turns
out
function
output
The
POP
linear
2 .
unit
when
is
Let
an
be
the
in
such
of
Wi
is
basis
,
( v
space
linear
Wv
we
and
for
all
let
v
we
vector
is
( vi
10
if
space
an
example
of
multiplication
the
know
is
. 5
what
if
and
That
another
matrix
know
ma1rix
section
presented
vector
-
to
by
by
multiplication
previous
linearity
input
Equations
functions
matrix
represented
vectors
standard
the
vector
the
the
to
vector
by
is
from
according
linear
one
it
two
ith
are
represented
because
of
obtained
only
discussed
sum
the
from
be
because
system
columns
can
is
system
are
model
the
output
linear
maps
it
system
such
these
which
then
the
also
that
linear
whose
2 )
multiplication
is
output
the
Then
will
outputs
if
be
when
is
In
matrix
9. INTRODUCTION
TOLINEAR
ALGEBRA
395
the vectors are presented separately. We also know what the output
will be to scalar multiples of a vector. These properties imply that if we
know the output to all of the vectors in some set {v,} , then we can calculate the output to any linear combination of the Vi . That is, if
v - CIVI + C2V2+ . . . + CnVn, then the output when v is presented to
the system is
w
Cl
The
terms
the
in
Ci
to
for
basis
other
If
it
is
set
of
no
need
tem
the
MATRIX
to
any
'
that
other
measurements
constitute
is
be
have
its
the
be
AND
it
systems
for
various
to
We
then
of
calculated
are
example
responses
we
to
responses
any
because
that
space
The
immediately
made
to
the
system
linear
are
linear
matrix
responses
input
the
Imagine
measure
system
physiological
first
for
the
of
follows
are
output
carefully
property
is
of
vectors
the
the
measuring
already
TION
outputs
15
They
space
to
or
If
the
studied
as
these
in
reference
should
can
vectors
the
measurements
vector
immediately
is
basis
presented
vector
electronic
further
Wvn
multiply
knowing
thing
we
known
defining
then
are
calculate
by
simply
without
be
we
Cn
we
every
to
same
input
right
then
system
TIPLICA
should
might
that
MUL
the
Thus
space
the
any
C n V
when
us
important
say
output
vector
system
make
Therefore
physical
inputs
on
'
system
linear
to
WV2
Vi
Vi
the
to
The
allows
some
C2
extremely
way
inputs
statement
an
studying
2V
space
preceding
Another
the
the
in
The
vectors
vector
vector
vectors
expresses
calculate
of
the
the
some
combination
to
IV
parentheses
to
the
basis
WVl
the
outputs
by
have
the
based
sys
on
MUL
TILA
YER
SYSTEMS
The systems considered until now have been one-layer systems. That
is, the input arrives at a set of input units , is passed through a set of
weighted connections described by a matrix , and appears on a set of
output units . Let us now arrange two such systems in cascade, so that
the output of the first system becomes the input to the next system, as
shown in Figure 20. The composite system is a two-layer system and is
described by two matrix -vector multiplications . An input vector v is
first multiplied by the matrix N to produce a vector z on the set of
intermediate units :
z = Nv ,
396 FORMAL
ANAL
YSFS
t
v
FIGURE
20.
tor u on the uppermost
and then z is multiplied by M to produce a vec1
set of units :
u ~ Mz.
Substituting
system :
u ~ M ( Nv ) .
( 16 )
will simplify
the analysis
us to
element
the j th column
MN
from
product
of the ith
row of
of multiplication
is
..
345
12
101
2 0
0
Another
definition
of P is the inner
of N . Note
1. 2
- 1 1
(3+ 8-
5)
(6 + 0 + 5)
( 1+ 0 -
1)
( 2 + 0 + 1)
(0 + 2-
2)
(0 + 0 + 2)
611
=
way to think
about matrix
multiplication
follows
from the
of matrix - vector multiplication
. Each column vector of P is
the corresponding
of P is computed
column
vector of
by multiplying
the
9. INTRODUCTION
TOLINEARALGEBRA
397
IIII !
IIIIII
1'11111
_1~1_1118
t
M
M
MDt
Os
p
Mns
FIGURE
21.
6 The two systems are identical in the sense that they compute the same function . Of
course, they may have different internal dynamics and therefore take different amounts
of time to compute their outputs.
398 FORMAL
ANALYSES
. .
. .
FIGURE
22.
This equation can be easily generalized to give the strength of the connection between thejth element ofv and the ith element ofu :
Pij ~ m / I n Ij + m ; 2 n 2.1 +
. . . + m /s nsj .
This formula calculates the iOlter product between the ith row of M
and the j th column of N , which shows that P is equal to the product
MN .
This result can be extended to systems with more than two layers by
seem somewhat odd, especially since it would seem more straightfor ward to define it by analogy with matrix addition as the element -wise
product . In fact , it would be perfectly acceptable to define multiplica tion as the element -wise product , and then to use another name for the
operation we have discussed in this section . However , element -wise
multiplication has never found much of an application in linear algebra .
Therefore , the term multiplication has been reserved for the operation
described
application to multilayer
proves
to be a useful
systems demonstrates .
definition
, as the
9. INTRODUCTION
TOLINEAR
ALGEBRA399
Algebraic Properties of Matrix Multiplication
M (cN ) = cMN
(17)
M (N + P ) ~ MN + MP
(18)
(N + P )M = NM + PM
(19)
EIGENVECTORS
ANDEIGENV
ALVES
The next two sections develop some of the mathematics important
for the study of learning in POP networks . First , I will discuss eigenvectors and eigenvaluesand show how they relate to matrices. Second, I
will discuss outer products. Outer products provide one way of constructing matrices from vectors. In a later section, I will bring these
concepts together in a discussion of learning.
Recall the abstract point of view of matrices and vectors that was discussed earlier: The equation 0 = Wv describes a function or mapping
from one space, called the domain, to another space, called the range.
In such vector equations, both the domain and the range are vector
spaces, and the equation associates a vector 0 in the range with each
vector v in the domain.
In general, a function from one vector space to another can associate
an arbitrary vector in the range with each vector in the domain. However, knowing that u ~ Wv is a linear function highly constrains the
form the mapping between the domain and range can have. For example, if v 1 and v 2 are close together in the domain, then the vectors
01 - Wv 1 and 0 2 ~ Wv 2 must be close together in the range. This is
known as a continuity property of linear functions . Another important
constraint on the form of the mapping is the following , which has
already been discussed. If V3 is a linear combination of Vi and V2, and
the vectors Ul ~ WVl and 02 ~ WV2 are known , then U3 ~ WV3 is completely determined - it is the same linear combination of Ul and 02.
Furthermore , if we have a set of basis vectors for the domain, and it is
known which vector in the range each basis vector maps to , then the
400 FORMAL
ANAL
YS
~
mappings of all other vectors in the domain are determined (cf . Equa tion 15 ) .
In this section , let us specialize to the case of square matrices , that
is , matrices
the domain
In this case ,
of dimensions
(because the vectors v and u must have the same number of com ponents ) , and the vectors in the domain and the range can be plotted in
the same space. This is done in Figure 23 , where we have shown two
vectors before and after multiplication by a matrix .
In general , vectors in this space will change direction as well as
Wv ~ Xv
liiiii
.I
~ 2
V2
FIGURE
23.
1
2
WV2
- 30
Wv
1
9.INTRODUcrION
TO
LINEAR
ALGEBR
401
Amatrix
can
have
more
than
one
eigenvector
,more
which
,geometri
,
means
that
it
is
possible
to
have
eigenvectors
in
than
one
direc
tion
example
,the
leftmost
matrix
above
also
has
the
eigenve
1. For
1with
eigenvalue
3,and
the
diagonal
matrix
on
the
right
also
has
the
0
eigenvector
1with
eigenvalue
4.
There is another, more trivial , sense in which a matrix can have mul tiple eigenvectors: Each vector that is collinear with an eigenvector is
itself an eigenvector. If v is an eigenvector with eigenvalue '\ , and if
y - c v , then it is easy to show that y is also an eigenvector with eigenvalue '\ . For the ensuing discussion, the collinear eigenvectors will just
confuse things, so I will adopt the convention of reserving the term
eigenvector only for vectors of length 1. This is equivalent to choosing
a representative eigenvector for each direction in which there are eigenvectors.
.
3 0
Let us now return to the diagonalmatrix 0 4 . We have seenthat
1
0
this matrix has two eigenvectors
, 0 and 1 ' with eigenvalues3 and
4. The fact that the eigenvaluesare the sameas the diagonalelements
of the matrix is no coincidence
: This is true for all diagonalmatrices,
as can be seen by multiplying any diagonal matrix by one of its
eigenvectors
- a vector in the standardbasis. It is also true that this
matrix has only two eigenvectors
. This can be seenby consideringany
a
vector of the form b ' where a and b are both nonzero. Then we
have
3
0
a
3a
04b~4b
Sucha vector is not an eigenvector, becausethe componentsare multiplied by different scalars. The fact that the matrix has distinct eigenvalues is the determining factor here. If the diagonal elements had
been identical, then any two-dimensional vector would indeed have
been an eigenvector. This can also be seen in the caseof the n x n
identity matrix I , for which every n -dimensionalvector is an eigenvector with eigenvalue1.
In general, an n x n matrix can have up to, but no more than, n distinct eigenvalues
. Furthermore, distinct eigenvaluescorrespondto distinct directions. To be more precise, if a matrix has n distinct
402
FORMALANALYSFS
Using
can
now
write
Wv
linearity
CtVt
C2V2
+ CnVn
).
(21)
9. INTRODUCfION
TO LINEARALGEBRA
403
Notice that there are no matricesin this last equation. Eachterm CiAi is
a scalar; thus we are left with a simple linear combination of vectors
after havingstartedwith a matrix multiplication.
This equationshould give some idea of the power and utility of the
eigenvectorsand eigenvaluesof a matrix. If we know the eigenvectors
and eigenvalues
, then, in essence
, we can throwaway the matrix. We
simply write a vector as a linear combinationof eigenvectors
, then multiply each term by the appropriateeigenvalueto produceEquation 21,
which can be recombined to produce the result. Eigenvectorsturn
matrix multiplication into simplemultiplication by scalars
.
It is also revealingto considerthe magnitudesof the eigenvaluesfor
a particular matrix. In Equation 21, all of the vectorsVi are of unit
length, thus the length of the vectoru dependsdirectly on the product
of the magnitudesof the Ci and the eigenvaluesAi. Considerthe vectors that tend to point in the directions of the eigenvectorswith large
eigenvalues
. Theseare the vectorswith large c; for those eigenvectors
.
Equation 21 says that after multiplication by the matrix they will be
longer than vectors of the sameinitial length that point in other directions. In particular, of all unit length vectors, the vector that will be
the longestafter multiplication by the matrix is the eigenvectorwith the
largesteigenvalue. In other words, knowledgeof the eigenvectorsand
eigenvaluesof a systemtells which input vectorsthe systemwill give a
large responseto. This fact can be useful in the analysisof linear
models.
TRANSPOSES
AND THE OUTERPRODUCT
The transpose of an n x m matrix W is an m x n matrix denoted
W T. The i ,j th element of W T is the j , i th element of W .
Example.'
3
102
3
-
40
5
404 FORMAL
ANAL
YS
~
(WT)T ~ W
(CW)T~ CWT
( M + N ) T ~ MT + NT
( MN ) T = NTMT
If a matrix is its own transpose , that is if WT ~ W , then the matrix is
.
symmetrIc
Outer Products
Before discussing outer products , let me attempt to ward off what
vector ?
3
1
reason .
same results
In equations
will be obtained
involving
whether
entities
calculating values and manipulating equations, there is no need to distinguish between vectors and n x 1 matrices . Rather , by treating them
as the same thing , we have a uniform set of procedures for dealing with
all equations involving vectors and matrices .
is an element
in a vector
space , whereas
a matrix
can be used to
define a linear mapping from one vector space to another. These are
very different
concepts .
9. INTRODUCTION
TO LINEARALGEBRA
405
Example..
y -
3
1
2
u -
vrU - [31
4
1
0
4
1
- [6]
Notice that the result has only a sing; e component, and that this component is calculated by taking the inner product of the vectors v and u .
In many applications, there is no need to distinguish between vectors
with one component and scalars, thus the notation v Tu is often used
for the inner product.
Let us next consider the product u v T. This is a legal product
becausethe number of columns in u and the number of rows in v Tare
the same, namely one. Following the rule for matrix multiplication , we
find that there are n 2 inner products to calculate and that each inner
product involves vectors of length one.
Example..
uv T ~
3 1 2
4 [ 3 1 2] ~
0
12 4 8
0 0 0
.I
The i ,jth element of the resulting matrix is equal to the product Ui Vj.
For those who may have forgotten the noncommutativity of matrix
multiplication , this serves as a good reminder : Whereas the product
v Tu has a single component, a simple change in the order of multipli cation yields an n x n matrix .
Products of the form uv T are referred to as outer products, and will
be discussed further in the next section. Note that the rows of the
resulting matrix are simply scalar multiples of the vector v . In other
words, if we let W be the matrix BV T, and let Wi be the i th row of W ,
then we have
w
where
. =
u I.v
u; is the i th component
of the vector u .
406
FORMALANALYSFS
OUTER
PRODUCTS
In
this
section
together
several
vectors
We
have
the
have
now
with
learning
scheme
will
begin
by
consider
the
Figure
13
the
is
to
as
follows
We
we
along
the
place
It
of
uv
is
given
Let
us
learning
rule
output
vec
autonomously
form
not
of
associa
the
18
form
calculates
convert
as
the
vector
not
This
can
has
into
then
let
Let
us
one
as
the
we
given
can
use
scalar
we
choose
in
present
are
the
If
would
seen
so
only
is
discussed
and
we
be
for
vector
weight
choice
system
when
that
the
that
Note
the
scalar
is
such
make
length
vector
This
To
to
using
we
have
multiplied
by
simple
algebra
of
on
will
is
work
Our
in
the
to
the
To
do
unit
inner
the
has
weight
product
corresponds
shown
in
each
such
the
case
let
vector
between
finding
any
vector
projects
to
simple
the
choice
of
of
an
us
output
vector
consider
weight
matrix
24
the
as
so
to
Figure
making
direction
generalize
of
As
because
same
output
rows
involved
Each
finding
solution
to
the
output
unit
have
component
problem
points
one
we
difficult
Figure
of
the
by
result
line
on
than
unit
level
be
particular
can
Ware
upper
vectors
that
with
to
Hebbian
vectors
which
the
projection
that
vectors
each
wish
dotted
vector
more
of
the
rudimentary
Thus
vector
desired
same
in
itself
the
wish
whose
the
modeled
output
system
weight
is
Geometrically
vector
input
denote
unknown
Since
gives
the
choose
which
will
output
with
case
wish
logic
as
1977
rows
in
of
assumption
we
We
and
following
to
be
The
units
associates
matrix
to
referred
capable
work
that
which
input
were
the
vectors
vector
only
simplest
Kohonen
the
eigen
in
can
of
that
is
making
component
here
and
matrix
taken
bring
including
described
systems
each
matrix
scheme
weight
input
input
have
maps
1977
that
The
the
we
systems
are
PDP
is
scheme
choose
such
us
it
particular
implement
tive
now
simple
can
with
Until
we
Jones
LEARNING
PDP
systems
linear
where
how
consider
tor
discussed
whereby
and
associated
system
and
AND
previously
These
simple
Wv
vectors
the
Ritz
that
example
discussed
weight
of
two
products
seen
concepts
Silverstein
equation
EIGENVECTORS
discuss
the
outer
the
of
and
Anderson
the
vector
its
As
weight
POP
and
these
discussed
with
system
weight
earlier
vector
and
the
9. INTRODUCTION
TOLINEARALGEBRA
407
\
\
\
\
\
V
\
\
u7 ',
\
\
\
\
\
\
FIGURE
24.
input vector v , and these inner products are the components of the output vector u . To implement a learning scheme, we need to be able to
choose weight vectors that produce the desired components of u .
Clearly, for each component, we can use the scheme already described
for the single unit model above. In other words, the ith weight vect,or
should be given by
' Wi =
The
i th
unit
with
v .
Thus
presented
rows
is
can
are
be
the
The
the
fact
produce
system
We
by
now
ovT
fact
implementation
will
like
This
the
of
produce
a way
to
write
done
by
noting
is
outer
product
0
the
when
presented
vector
a matrix
of
0
W
that
Equation
and
v .
when
whose
Thus
22
, W
v
W
of
this
choice
one
in
for
as follows
) v
( v Tv
that
that
component
correctness
( ovT
would
22 .
i th
whole
the
calculating
as follows
the
as
Equation
equations
check
Wv
using
v .
written
can
then
, the
given
of
W
We
will
with
set
(22)
UiV .
is of
length
is
an
outer
of
Hebbian
product
learning
making
the
has
important
in
POP
last
networks
step
implications
. As
discussed
for
408
FORMALANALYSF5
In other
words
, for
each
; , we wish
output vectors .
weight matrix
that the same
us assume that
. . . , Un which
input
vectors
to have
U ; = WV ; .
Vj -
1 if i = j
0 otherwise
developedabove:
W I. = U.V.
I IT
Finally,
theWi :
we form
a composite
9. INTRODUCTION
TOLINEARALGEBRA
409
When the set of input vectors is not orthogonal, the Hebb rule will
not correctly
modification
rule , known
Widrow -
Hoff rule , can make such associations . The requirement for the delta
rule to work is that the input vectors be linearly independent . The
delta rule is discussed further in Chapter 11, and at length in Kohonen
( 1977) .
Earlier it was discussed how , at least for square matrices , knowledge
of the eigenvectors of a matrix permits an important simplification to
be made . The matrix multiplication
of ' a vector can be replaced by
scalar multiplication
(cf . EQuation 21) . I will now show that the Heb bian learning scheme fits nicely with the notion of eigenvectors . Suppose that we wish to associate vectors with scalar copies of themselves .
This is what is done , for example , in an auto -associator like those dis cussed in J. A . Anderson et ale ( 1977) ; see Chapters 2 and 17. In other
words , we want
the vectors
U i to be of the form
input vectors . Let us further assume that the n scalars A; are distinct .
Using the outer product learning rule , we have
w ~ WI + . . . + W ; + . . . + Wn
where
W ,. ~ u ".v .T ~
If we now present
\ .v .v ,.T.
1\ . "
the vector
V ; to the matrix
W thus formed
, we have
Wv ; ~ (WI + . . . + W ; + . . . + Wn ) v ;
410 FORMAL
ANALYSES
I-
--
-- ~
MATRIXINVERSES
Throughout this chapter, I have discussed the linear vector equation
u ~ Wv . First , I discussed the situation in which v was a known vector
and W a known matrix . This corresponds to knowing the input to a
system and its matrix , and wanting to know the output of the system.
Next , I discussed the situation in which v and u were known vectors,
and a matrix W was desired to associate the two vectors. This is the
learning problem discussed in the previous section. Finally , in this section , I discuss the case in which both u and Ware known , but v is
unknown . There are many situations in which this problem arises,
including the change of basis discussed in the next section.
As we will see, the solution to this problem involves the concept of a
matrix inverse. Let us first assume that we are dealing with square
matrices. The inverse of a matrix W , if it exists, is another matrix
denoted W- l that obeys the following equations:
W - lW = I
WW - l = I
Example..
1
--
- 1
Ih
W- l =
1
-2
--3
3
2
-2
33
9. INTRODUCTION
TOLINEARALGEBRA
10
Ih--322---231~0
1
1
ww - .1- - 1
J
.
.1
-3--3
2 -2
-3
3
.
.I
.I
.J
3
3
21
W- IW ~
411
1
0
J
~
.I
= Iv
~ v .
v -
3
3
3~4
1.
3
I
v ~
1 1/2 1
3
- 1 1 4 = 3
I.
-~- -
It is importantto realizethatW- l, despitethe newnotation, is simply a matrix like any other. Furthermore
, the equationv ~ W- lu is
nothing more than a linear mappingof the kind we have studied
throughoutthis chapter
. The domainof this mappingis the rangeof
412 FORMAL
ANAL
YSF5
W , and the range of the mapping is the domain of W . This in verse
relationship is shown in Figure 25. The fact that W- I represents a
function from one vector space to another has an important consequence. For every u in the domain of W- I, there can be only one v in
the range such that v ~ W- I u . This is true becauseof th.e defini ti on of
a function . Now let us look at the consequence of this fact from the
point of view of the mapping represented by W . If W maps any two
distinct points VI and V2 in its domain to the same point u in its range,
that is, if W is not one-to-one, then there can be no W- I to represent
the inverse mapping.
We now wish to characterize matrices that can map distinct points in
the domain to a single point in the range, for these are the matrices
that do not have inverses. To do so, first recall that one way to view
the equation u ~ W v is that u is a linear combination of the column
vectors of W . The coefficients of the linear combination are the components of v . Thus, there is more than one v which maps to the same
point u exactly in the case in which there is more than one way to write
u as a linear combination of the column vectors ofW . These are completely equivalent statements. As discussed earlier, we know that a
vector u can be written as a unique linear combination of a set of vectors only in the case where the vectors are linearly independent. Otherwise, if the vectors are iinearly dependent, then there are an infinite
number of ways to write u as a linear combination . Therefore , we have
the result that a matrix has an inverse only if its column vectors are
linearly independent.
For square matrices with linearly dependent column vectors and for
non-square matrices, it is possible to define an inverse called the generalized inverse, which performs part of the inverse mapping. In the
case in which an infinite number of points map to the same point , there
will be an infinite number of generalized inverses for a particular
matrix , each of which will map from the point in the range to one of
the points in the domain .
W -1
FIGURE
2S.
9. INTRODUCTION
TOLINEAR
ALGEBRA
413
In summary , the matrix inverse W - 1 can be used to solve the equa tion u = Wv , where v is the unknown , by multiplying
u by W - l . The
inverse exists only when the column vectors of Ware
linearly indepen dent . Let me mention in passing that the maximum
number of linearly
independent
column
vectors
of a matrix
is called the rank of the
matrix .1 An n x n matrix is defined to have full rank if the rank is equal
to n . Thus , the condition
that a matrix
the condition
that it have full rank .
CHANGE
OF
have an inverse
is equivalent
to
BASIS
As was discussed
independent
vectors that span the space . Although
we most naturally
tend to think in tenns of the standard basis , for a variety of reasons it
is often
convenient
to change
relation -
numbers
remembered
that
, are
are
relative
used
to
represent
to a particular
a vector , it
choice
of
basis .
should
be
When
we
26 , there
I- ~ \- .
is a vector
We now
1
basis vectors , Y 1 =
be written
- 1
as a linear
change
in the standard
basis by choosing
two
basis
new
1/2
and Y2 =
combination
v , which
combination
1
1 . As shown
of Yl and Y2.
in Figure
It turns
27 , v can
out , as we
coefficients
v . represent
2 .
7 An important
theorem in linear algebra establishes that , for any matrix , the maximum number of linearly independent column vectors is equal to the maximum number
of linearly independent row vectors. Thus, the rank can be taken as either .
.1
.I
.I
. -- -- ~ ~ .
414 FORMAL
ANAL
YSFS
FIGURE
26.
We
new
cients
now
want
basis
Cj
in
1 ,
the
to
Y2
show
how
Yn
to
These
find
the
coordinates
coordinates
of
are
simply
vector
the
coeffi
in
equation
v = CtYt + C2Y2+
(23)
. . . + CnYn.
Let us form a matrix Y whose columns are the new basis vectors y ; ,
and let v . be the vector whose components are the c; . Then Equation
23 is equivalent to the following equation:
v = Yv.
(24)
v . = v - Iv .
I
Example
. Lettingy 1~ - 1
andy - I ~
2
1
---3
3
2
-2
33
IA
1,
we have Y =
-1
1
1Ihl
EBRA
9. INTRODUcnON
TOLINEAR
ALG
415
FIGURE
27.
Thus,
I
~
.J
v . = V - IV =
12 1
-2
--3
3
-2
3 -2
3 1=2.
Notice that we have also solved the inverse problem along the way.
That is, supposethat we know the coordinatesv . in the new basis, and
we wish to find the coordinatesv in the old basis. This transformation
is that shown in Equation 24: We simply multiply the vector of new
coordinatesby Y .
We have shown how to representvectorswhen the basisis changed.
Now, let us accomplishthe same thing for matrices. Let there be a
squarematrix W that transforms vectors in accordancewith the equation u = Wv. Supposewe now changebasisand write v and u in the
new basisasv . andu . . We want to know if there is a matrix that does
the samething in the new basisasW did in the original basis. In other
words, we want to know if there is a matrix W' such that u . = W ' v . .
This is shown in the diagramin Figure 28, where it should be remembered that v and v . (and u and u . ) are really the same vector, just
describedin terms of different basisvectors.
To see how to find W ' , consider a somewhatroundabout way of
solvingu . = W' v . . We can convertv . back to the original basis, then
map from v to u using the matrix W , and finally convert u to u . .
416 FORMAL
ANALYSES
W'
v '
u '
y-1
w
FIGURE
28.
Luckily , we
already
know
how
to
make
transformations - they are given by the equations :
each
of
these
v ~ Yv
u ~ Wv
u ' ~ y - 1u .
~ y - IWyv . .
Thus , W ' must be equal to y - 1WY . Matrices related by an equation of
the form W ' - y - lWY are called similar .
One
aspect
of
this
discussion
needs
further
elaboration
We
have
numbers used for representing vectors. When the basis changes, the
numbers change according to the equation w ' ~ y - lWY . The underlying mapping , which remains the same when the matrix W is used in
the original basis and the matrix W ' is used in the new basis, is called a
linear tran .yformation. The same linear transformation is represented by
different
matrices
in different
bases .
9. INTRODUCTION
TOLINEARALGEBRA 417
the
eigenvectors
For
each
of
Wy
If
25
is
for
; =
of
the
of
both
sides
Thus
Ai
This
now
is
,
can
bases
change
be
.
then
'
in
the
we
can
new
write
( cf . Figure
in
the
In
,
basis
Equation
21 ) :
the
main
diagonal
yourself
of
other
the
of
.
Now
when
the
are
correct
premultiply
entries
is
of
the
change
of
of
set
.
of
linear
combination
we
have
avl
bV2
earlier
in
.
results
change
depth
in
Chapter
a
That
is ,
in
Let
of
on
if
a
one
those
be
basis
on
of
vectors
of
'
the
W
eigenvalues
depend
structure
basis
model
in
linear
the
the
use
to
discussed
the
are
of
implications
of
we
corresponding
restatement
behavior
let
words
matrix
perspective
the
that
a
different
same
,
Then
whose
than
the
over
matrix
combination
on
convince
placement
basis
question
example
to
new
note
as
to
the
consider
This
same
basis
Yi
follows
entries
try
equal
does
linear
For
as
whose
more
How
written
of
is
the
seen
the
should
diagonal
simply
the
as
'
as
to
chosen
remains
written
, but
us
matrix
give
worthwhile
let
are
once
particularly
nothing
models
that
is
really
is
POP
basis
eigenvectors
It
to
of
is
matrix
new
You
V - I
- lWY
the
at
matrix
equation
eigenvectors
the
columns
diagonal
by
the
YA
this
find
( 25 )
whose
eigenvalues
ness
us
definition
eigenvectors
is
Let
A ;Y ; e
the
WY
Y i , by
matrix
all
where
eigenvector
of
basis
matrix
For
vectors
vector
vectors
the
basis
22
set
for
the
can
,
be
then
it
in
all
of
- lw
~ y - l ( aVl + bV2 )
~ ay - IYl + by - lY2
~
avl . +
b V2. .
The coefficientsin the linear combinationare the samein the old and
in the new basis. The equationsshow that this result holds because
changeof basisis a linearoperation.
The behavior of a linear POP model dependsentirely on the linear
structure of the input vectors. That is, if w ~ av 1+ bv2, then the
responseof the systemto w is determinedby its responseto v 1 and v 2
and the coefficientsa and b. The fact that a changeof basispreserves
418
FORMALANALYSES
the linear
structure
of the vectors
to describe
shows
that
it is this
linear
structure
the vectors .
NONLINEAR
SYSTEMS
deemed necessary. 8 Although these reasons are based on the desire for
behaviors
linear
outside
systems
the domain
of linear
models , it should
in themselves
be stated
that
of the nonlinearities represent comparatively small changes to underly ing models which are linear . Other models are more fundamentally
nonlinear
. Further
discussions
with
one output
of nonlinear
mathematics
can be found
computes
the inner
pro -
duct of its weight vector and the input vector . This is a linear system ,
given the linearity of the inner product . The geometrical properties of
the inner product led us to picture the operation of this system as com puting the closeness of input vectors to the weight vector in space.
Suppose we draw a line perpendicular to the weight vector at some
point , as in Figure 29. Since all vectors on this line project to the same
point on the weight vector , their inner products with the weight vector
are equal . Furthermore , all vectors to the left of this line have a
smaller inner product , and all vectors to the right have a larger inner
product . Let us choose a fixed number as a threshold for the unit by
requiring that if the inner product is greater than the threshold , the unit
outputs
a 1, otherwise
it outputs
a O. Such a unit
9. INTRODUCTION
TOLINEAR
ALGEBRA
419
\
\
\
\
\
\ \
W
\
\
\
\
\
\
\
\
\
\
FIGURE
29.
system
thattaketheirinputfromthisunitcouldchoose
completely
differentbehaviors
basedon the decision
. Noticealsothatthe unit is a
categorizer
: All inputvectorsthatareon the samesideof the space
leadto thesameresponse
.
To introducea thresholdinto the mathematical
description
of the
processing
unit, it is necessary
to distinguish
between
theactivation
of
theunitandits output
. A functionrelatingthetwoquantities
is shown
in Figure30. It produces
a oneor a zerobasedon themagnitude
of
the activation
. It is alsopossible
to havea probabilistic
threshold
. In
this case
, the fartherthe activation
is abovethe threshold
, the more
OUTPUT
ACTIVATION
FIGURE
30.
420
FORMALANALYS~
likely the unit is to have an output of one, and the farther the activation is below the threshold , the more likely the unit is to have an output of zero. Units such as these are discussed in Chapters 6 and 7.
The threshold unit is a good example of many of the nonlinearities
that are to be found in PDP models. An underlying linear model is
modified with a nonlinear function relating the output of a unit to its
activation . Another related example of such a nonlinearity is termed
subthresholdsummation. It is often observed in biological systems that
two stimuli presented separately to the system provoke no response,
although when presented simultaneously, a response is obtained. Furthennore , once the system is responding, further stimuli are responded
to in a linear fashion. Such a system can be modeled by endowing a
linear POP unit with the nonlinear output function in Figure 31. Note
that only if the sum of the activations produced by vectors exceeds T
will a response be produced. Also , there is a linear range in which the
system responds linearly . It is often the case in nonlinear systems that
there is such a linear range, and the system can be treated as linear provided that the inputs are restricted to this linear range.
One reason why subthreshold summation is desirable is that it
suppressesnoise. The system will not respond to small random inputs
that are assumed to be noise.
All physical systems have a limited dynamic range. That is, the
response of the system cannot exceed a certain maximum response.
This fact can be modeled with the output function in Figure 32, which
shows a linear range followed by a cutoff . The system will behave
linearly until the output reaches M , at which point no further increase
can occur. In Figure 33, a nonlinear function is shown which also has a
OUTPUT
T
ACTIVATION
FIGURE
31.
9. INTRODUcrION
TOLINEAR
ALGEBRA421
M
OUTPUT
ACTIVATION
M - - --
OUTPUT
ACTIVATION
422 FORMAL
ANALYSES
FURTHER
READING
Halmos, P. R. (1974) . Finite-dimensionalvector spaces
. New York:
Springer-Verlag. For the more mathematicallyminded. An excellent
accountof linear algebrafrom an abstractpoint of view.
Kohonen , T . ( 1977) . Associativememory.. A systemtheoretic approach.
Berlin : Springer -Verlag . This book has a short tutorial on linear algebra .
The
discussion
of
CHAPTER
10
Functions
R. J. WILLIAMS
424 FORMAL
ANALYSF5
examplesof this are the conceptsof spatialfiltering and spatialFourier
analysisin the visual systemand the conceptof correlationalprocessing
in matrix models of associativememory (Kohonen, 1977) . Chapter9
describesthe relevant mathematicsfor this approach
, that of linear
algebra.
This chapterexploressome ideasmotivated by the first of thesetwo
views of a PDP unit's computation (i.e., as some generalizationof the
notion of a Booleanfunction) , but the approachis implicitly basedon a
very liberal interpretation of what this means. ~ sentially, the only.
structure assumedfor the set of confidencevaluesis that it be a totally
orderedset with a Booleaninterpretationof its endpoints. While a fully
developedmathematicaltheory along these lines would deal with those
propertiesthat are invariant under any transformationspreservingthis
structure, the ideaspresentedhere do not go this far.
The specific program to be embarked upon here is probably best
describedas an exploratoryinterweavingof severalthreads, all related
to these notions of logical computationand their potential applicability
to the study of activation functions. First, the point of view is taken
that any function whatsoeveris a candidatefor being an activation
function. From this perspective
, the traditional linear and thresholded
linear activation functions may be viewed as very isolt:ed examples
from a much larger rangeof possibilities. Next, severalwaysto shrink
this vast spaceof possibilitiesare suggested
. One way proposedhere is
the imposition of a constraintbasedon the requirementthat the notion
of excitatory or inhibitory input be meaningful. Another way is the
introduction of an equivalencerelation on activationfunctions basedon
invariance under transfonnations preserving the logical and ordinal
structure. Finally, an investigation is carried out to determine just
where certain familiar types of activation functions, built out of the
more traditional ingredientssuch as additive, subtractive, and multiplicative interactions among input values and weights, fit into this
scheme. As a by-productof this development, someelementaryresults
concerningimplementationof Booleanfunctions via real-valued functions are also obtained.
This last aspectis closely related to what is historically one of the
oldest formal approachesto the theory of neural computation, in which
neurons are treated as Booleandevices. This approachwas pioneered
by McCulloch and Pitts (1943) ; an introductory overview of this whole
subjectcan be found in the text by Glorioso and Colon Osorio (1980) .
An important influence on much of the work done in this area has
been the perceptronresearchof Rosenblatt (1962; Minsky & Papert,
1969) .
In what follows, severalsimplifying assumptionswill be made. The
first is that the range of values over which each input to a unit may
10. ACTIVATION
FUNCTIONS
425
vary
is
unit
the
same
its
variable
tion
as
the
function
values
range
to
order
to
results
key
found
into
unit
or
linear
used
ous
are
but
activation
for
of
this
unit
elements
detailed
is
of
proofs
short
abstract
of
sketches
of
proofs
of
the
indicating
formulation
detailed
of
the
is
two
activation
the
the
basic
results
functions
from
the
is
Papert
1969
may
which
1977
into
this
called
where
The
is
operator
called
the
interval
call
. ~
right
to
threshold
simple
left
logic
perceptron
Kohonen
lIS
og
of
. "!
Let
is
basis
unit
sometinJ
in
function
the
nondecreasir
function
is
closed
is
Example
set
composition
and
linear
point
activation
Rand
thresholding
denotes
&
tuples
place
with
Minsky
function
reasons
This
quasi
squashing
is
linear
og
where
commonly
activation
function
for
func
obvi
Example
of
The
this
earlier
whose
unspecified
unit
into
variant
multilinear
function
where
inputs
func
but
RULES
using
1962
Example
tion
along
threshold
fixed
the
as
be
slightly
presentation
their
here
unit
of
ignored
to
function
examples
linear
Rosenblatt
is
1983
functions
Example
and
some
and
two
manner
in
constructed
between
the
are
at
be
taken
presumably
ordered
ACTIVATION
been
Example
of
be
with
activation
set
rigorous
following
linear
output
may
will
unit
at
the
time
unit
unit
the
which
that
more
Williams
have
the
cluttering
OF
The
here
in
models
omitted
EXAMPLES
is
introduced
be
been
steps
concepts
An
avoid
have
the
given
set
the
inputs
An
over
is
of
the
values
Another
of
its
Thus
from
denoted
In
over
function
output
of
time
of
vary
function
computes
range
may
activation
unspecified
just
the
The
that
time
as
activation
Xl
is
assumed
into
Xn
to
10
XIX2
be
where
of
even
the
XJX4
is
nondecreasing
into
and
form
Such
an
Xn
activation
IXn
function
is
sug
426 FORMAL
ANALYSES
except that the coefficients have now become explicit inputs . This type
of activation function will be called a gating activation function because
the odd -numbered inputs gate the even -numbered ones (and vice versa ) .
Example
5.
is an arbitrary
I , a = fog
multilinear
g (Xl , . . . , Xn ) =
where
is the
activation
ficients
multilinear
Wj
3 .) Note
function
power
function
being
activation
that
, where
function
into
is nondecreasinr
R .
That
I and
form
1: , Wj 11 Xi ,
SJ.EP
iES J.
set
is called
( i .e . , set
of
subsets
) of
called
weights .
( We
might
function
to emphasize
its
Examples
~ into
is , g is of the
3 and
4 are just
special
{I, . . . , n}.
function
, with
also
call
relationship
cases
of
this
to
this
Such
the
an
coef -
a quasi Example
activation
10
. AcrIV
ATION
FUNCTIONS
427
parameterized by the remaining coordinates. Such a function is called a
section of the original function a along the kth coordinate. Note that
there is one such section along the kth coordinate for each possible
combination of values for the remaining n- l coordinates. Now make
the following definitions : 1
1. a is monotonic-in-context along the k th coordinate if all its sections along the k th coordinate are monotonic .
2. a is uniformly monotonic in the kth coordinate if all sections
along the k th coordinate are monotonic and have the same
sense (i .e., all are nondecreasing or all are nonincreasing) .
3. a is monotonic-in-context if it is monotonic -in-context along all
its coordinates
.
428 FORMAL
ANAL
YS
~
used to set the context for the computation of its output as a function
of the remaining inputs , and each input in this latter group has purely
excitatory or purely inhibitory effect on the unit ' s output in this partic ular
context .
Whether
this
turns
out
to be a useful
way to view
the
monotonic -in -context activation function and its possible role in activa tion models will not be explored here . The main reason for introducing
the concept is simply that it appears to be the strongest variant on
monotonicity satisfied by any activation function capable of computing
an arbitrary Boolean function (such as the multilinear and sigma -pi
activation functions , as will be seen later ) .
In order to capture the notion of an activation function being simply
an
extension
of
a Boolean
function
, define
an
activation
function
words , an activation
function
of In . It is also useful
is Boolean -like
if
when restricted to
func -
functions
are vertex -
may be a certain interchangeability between different activation func tions that are vertex -equivalent , in that the logic of a unit ' s computa tion might be considered to reside solely in what it does when all input
lines are set to their extreme values (corresponding to true or false ) . If
two vertex -equivalent activation functions are additionally monotonic in -context
and continuous
their interchangeability
pursued here .
ILLUSTRATION
, then
an even stronger
OF THESE
CONCEPTS
429
sections
of
The
along
activation
caption
of
that
Figures
the
OR
the
can
tion
that
vertices
gests
not
value
of
unit
while
in
the
that
Figures
the
all
Boolean
Boolean
This
of
- like
PDP
and
realization
Its
Finally
in
but
is
computa
constant
formalized
sug
the
here
function
function
shows
Boolean
intuition
out
is
It
such
activation
constant
puts
behavior
an
to
Figure
which
restriction
confidence
performs
one
of
func
- interval
function
which
interpretation
unit
func
activation
function
such
the
inputs
from
while
no
an
- like
activation
constant
that
monotonic
Boolean
difference
- context
fact
or
and
shows
as
network
observation
In
Boolean
active
the
( exclusive
- like
not
an
differently
essential
Figure
viewed
of
to
in
XOR
Boolean
straightforward
number
very
- in
are
is
be
- equivalent
the
monotonic
Boolean
but
have
the
behave
terms
the
example
any
- will
plot
defined
realizing
monotonic
might
on
vertex
that
tion
of
monotonic
pathological
and
is
vertices
functions
monotonic
does
based
rather
- dimensional
functions
functions
depicted
function
measure
two
figure
activation
the
These
uniformly
uniformly
activation
like
realization
not
thus
this
at
activation
uniformly
is
single
each
different
function
function
but
be
three
in
activation
- context
XOR
show
vertices
shows
The
displayed
AND
monotonic
on
being
different
at
Figure
tion
superimposed
and
three
uniformly
Boolean
show
function
in
figure
realize
and
function
in
fails
is
uniformly
to
define
to
be
mono
.
tonIc
SOME
RES
Before
tions
the
of
xn
1 ,
be
In
clear
intended
to
our
Xn
real
from
will
and
the
helpful
In
to
second
Boolean
Boolean
of
which
context
in
whether
maps
or
such
will
use
conjunction
Boolean
variables
we
denote
func
in
formal
expressions
to
two
expressions
expressions
juxtaposition
the
be
of
algebraic
for
and
it
vertices
notation
disjunction
always
.1 +
but
real
it
.,
will
operations
are
The
mapping
assigning
from
to
appears
denote
results
maps
expressions
to
main
which
variables
Boolean
yields
the
first
forDtal
Vi
TS
stating
xl
UL
once
For
the
example
expression
vertices
vertex
with
VI
the
to
negation
2X
Boolean
the
operator
plying
IX
vn
this
4 .
function
expressions
is
conjunction
in
applied
to
to
the
vertex
X ;
defined
which
if
( 0
by
each
and
, 1
Xi
only
1 , 0
if
ofl4
430 FORMAL
ANAL
YS
~
- -
I
I
a
- -
-~
.~
X1
X2
a
1
- .
I
a i
~
?
.75
I
a
00
O
1
X1
X1
0
0
10
. ACTIV
AnONFUNCTIONS
431
A
- - - -
- - -~
- ~
~
; )1. . .
.:I!.'..,~:.
..;
I
I
4 ~ - v '. ..
. ~ ~ - ~ . ~
,~11
\;,:,...:... ..
~1~...\ .:'.:~
~ L. I
.1""
. .
~ .. ,
'
" . ~ .. . . . .
I .
J ....1..-...... . .
, .' 01
. . -. .
t -- .
. .. . ~I . . ~
. . . .
. .. - '- "- .
. . . .- . . . ~
. . . .
. . ' .. . . . : . . . " . . " . . . . .
..
a
. .. . ~
.
.
.
.
. . .
'
. . .. .
. . ,
.
.
.
' .
I . .
. "
. "
. .
. . .
. . .
"
. .
.-- "
- -. .
.
.
, . .
..
. .
.
. . . ' -- .
.. . ,
. .
.
,
. .
.
.
.
.
. .
.
.
'
.. , ~ . . . . ~ . . . . . , " ~ , ~: . . .' .
. .. .
. .,- .::"..:
"\ ...; ::,,~.~,,::,,,,::;
;_. . ,: ,'-.
. . :. .
-. . " . . : . . . , . .
.. . .
.
. .
~ . . .' -
. . . . . . . . .
. .
.
.
-
. ..' ~"
. .
. .
. .. . :. :
." . . .
~
.
.
. . . ..' . ~ . . , . . "' . . . .
. . . . . . .
"
. r . . . : " " ~ . . . . - . - _ . . . . . . . . - .y
.
. . . .
. . : . . . . .. . . . . . . .. . . ! . . . . ' . . . . . . . . . . .
. . .
.
. .. . . . . . . . . . . : . . " . .
..
. . . . . . . . . - : ~ . . ' " .. - . -
...'
. . ~.
. -
. . .
~ .
.. . .
. . . .
. .
. , ~
. ~
. ~ . !
, . . . ,
.~
. .
. , I.....; : ~ .\ \..~
' . . : ...: .: ':, .:, ......: :..-::~ . .. . . .... .. :. . . ... : .... ...
' . . " -I ... ... .: ' ~. : . .... . .f ~ P ' w. . .
~ ., : . . " . .. : ~ ' : .
.. . :..."
., ~ .:
. " .' .. . . .
X2
. 8.. , .: \
' . .
.. ! \ .., : . -
.J.~ "
. , .-
. . . .."
. . . . .. . . .
- . ~~
..-.
.......
X1
X2
a
1
=
- .
a =.751
a = .5
1
1
I
I
a = .25
I
a
~~
~ 1.
00
X1
00
X2= 0
X1
ANALYSES
FORMAL
432
10
. ACTIVATION
FUNCTIONS
433
I
I
(1
.~ -
'
-..-.
,
X2
, X1
X2 = 1
I
I
I
I
I
I
I
I
I
0
X1
FORMALANALYSFS
10
. ACTIVATION
FUNCTIONS
435
A
. . . .. ' . . .. . .. . . . . . . . . . . . . . .
, ..
. .
..
.
"
.
.
.
.
. .
'
.
,
.
.
'
.
.
. .
..
.
.
.
.
..
"
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
'
. .
.
. ..
. .
.
. .
. . .
.
.
'
.
.
'
.
.
.
.
,
, .
. .
. .
.
.
.
.
.
.
.
.
.
,
.
'
.
.
.
'
.
.
.
.
.
.
.
'
.
.
. .
..
'
..
.
.
.
'
- -
-- --
--
' .
.
.
. .~
.~.
.
. .
. . . ...../
.
.. :J
.
"
.
.
,. . . . . .. . .
.
. .
. .
.
'
. . ..
'
. . .
. .
. .
. . . .
.
.
.
. . . .
. . . .
, .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
"
.
#
.
. .
.
. .
.
.
. . .
. .
.
.
.
.
. : . .- . . . . .. . .. . . . : .. .
.
.
'
0
'
.
.
. .
.
'
.
.
. .
.
"
.. .
.
.
. '
.
.
.
.
.
.
.
-
.
-
;?
X2
. X1
a
1
I
I
I
I
I
I
I
I
I
I
1
' X1
FIGURE 6. a (K1,X2) - min(l ,xl +X2) . .4: Three-dimensionalplot. The cubeis bounded
by the planeswhereeachcoordinateis 0 or I . B: Contour plot. C: Somesectionsalong
xI . Note that the three-dimensionalplot of this function consistsof two planarsurfaces
.
Evidently, eachsectionalongx I is a nondecreasingfunction; by symmetrythe sameis
true of eachsectionalongX2. Thus this function is uniformly nondecreasing
.
436
FORMALANALYSES
.
.
'
.
'
.
.
.
.
..
-
- .
.
.
.
.
..
.
.
'
.
.
.
'
. .
.
.
. .. . . . . .
. "
. "
.
~ -' .
' .
.
. ' .
. .
.
.
. .
.",--~.,...~..:. . . . . . . '
.J . ~
. . .
\.' .....
.
"'. ..'
~
.
.:.
. .' . :
. :.1
...I .. . . .. ' .
~ . "" .. :.: :.r ,.
'
' ~
' ~ ~
.. . . '
.
. .
. .
..
'
.
..
. ' . .' . .
.
. . . . . .. .
. ' .
.
.
. . . . . , . . . . . . . ..
. .
- .!
. . . ' ..
"
.
'- - ' . ' . . . . . . - ' . . . . .
. . ., . . .' .. . . . . .
.
. "
t . .
. . . . . oL ' .
. ~
.
. ~ " "
. .
. .
.
.
.
I ~."~:."; .'.~.'.
" . ..
~ .1.).:
(1
. .
- . ,-~ , ' : . .
' - ".
. . -
- " ~ .~.\~.
" '
"' "
. ..
"
.~
X2
x,
X2
1
,-:. .
- -75 0.
.25 0. =
00
a
1
0. =0
0. =.5
~~.
I
,
I
I
1 X,
,
1X
C
is 0 or 1. B: Contour
the sections along x 1 are linear functions with slopes ranging from 1 to - 1~ by symmetry
the same is true of the sections along x 2. Thus this function
not uniformly monotonic .
10
. AcrIVATION
FUNCTIONS
437
A
.
I
.
.
.
.
.
"
.
.
.
. , "
' .
. '
.
'
'
'
"
, .
'
"
.
.
'
. '
. .
.
. '
..
. '
'
.
"
. ,.
. .
. ,
'
'
. "
'
'
.
.
'
'
'
'
.
.
.
"
'
"
'
"
'
'
'
"
,
"
.
.
.
.
'
'
'
'
.
.
'
'
.
'
.
'
'
'
.
'
'
.
.
'
'
'
.
'
.
.
.
.
'
'
'
"
"
.
.
' "
.
'
'
'
.
'
.
.
'
.
.
..
'
'
. '
"
'
..
. . '
'
.
'
. .. . . . .
,
'
.
'
.
. . "
'
. .
.
.
.
.
. ' .
. ,
'
.'
..
"
.
'
.
'
'
'
,
'
S .'
..
. .. . .. . . . '
.
.
'
. . - ..
.
.
'
.
'
.
"
.
'
.
.
"
.
.
'
. . .
. . .'
. '
"
'
.
.
.
"
,
.
.
.
'
'
'
'
.
.
'
'
.
.
'
. "
.
. '
. .
. '
.
. "
. "
.
.
'
'
.
.
. ' .
'
.
.
.
'
.
.
' .' .
.
'
"
."
.
. .
.
'
.
.
.
.
.
'
'
.
.
. ,
. . . ,. . "
.
, .
.
. .
'
'
"
,
.
.
.
'
'
. . .
'
.
.
'
"
. . . .'
'
,
. . .. . . . . . . . ' : .
'
.
.
'
.
'
'
.
.
'
"
.
.
'
'
.
.
.
.
.
.
"
.
-
. .. . : . . . ... . . . '
"
"
. .. . . . . . .. , . ., .
.
.
"
"
.
,
. . . . , .. . . .
'
'
.
.
.
.
"
'
.
.
'
.
.
.
.
"
. . .. ' . . .; . "
" . . ' . : . . . . .
. ,
. .
. "
.
. .
'
.
'
. . . . . . .. . : . . , . . .
.
.
'
. . '. "
.
.
.
.
.
.
.
. '
'
'
"
.
.
. .
.
.
.
.
'
"
'
.
.
. '
"
'
"
"
,
.
'
"
., , . . .
'
.
.
. "
.
.
.
'
. .
.
. . "
.
..
'
.
'
.
'
.
.
.
.
.
'
.
'
'
. . ,
.'
'
'
, .
'
'
.
' .
. , . .
'
. . .
.
.
"
.
, . . . , .
,
"
'
,
.
.
.
'
.
'
'
,. . . . . . , . "
.
.
.
.
.
'
'
.. . . . . . . ,
' . '
.
.
.
.
'
.
.
.
.,..' .. - . . : .. ' ..
, . "
.
.
.
'
'
"
.
.
.
.
'
'
...
. . . . . . . .' . . ' .
. . .
. . . . , . . . .
.
..
. ' . ' . '
. . . '
. ,. .
' .
I .. .
.
'
. . . &~
.
"
. .
'
"
.. '
.. . . ' .
. . . ."
. .'
'
'
.
.
.. ,
'
.
. '
'
.
,
'
"
'
'
'
. .-
'
'
X2
X1
X2
a
1
1
1
- e
1
a = .751
1
1
I
1
I
I
I
=
00
1
B
X,
00
X1
438 FORMAL
ANALYSES
0
.
,,'
>
. .
.
t . .
. 0 ,
.c;~
I '
', .
. . '. '
.
. .
~
.
.I
o.
.
. .. 0
.
. .
0
0
.o. 0
:
f
_ ,
. . 0
. .
.
. 0
.
0
00
.
.
.
.
.
.
.
.
.
.
0
. .
.
0
.
(l
0
.
.
X1
X2
- -- _~ a=1
a
.75
.5
.25
I
a = 0
I
I
00
X1
X1
10. ACTIVATION
FUNCTIONS
439
is defined by replacing:
1.
2.
3.
4.
5.
6.
True by 1.
False by O.
The disjunction
operator by addition .
440
FORMALANALYSES
contrast
linear
ized
are
called
converse
is
linearly
not
not
true
see
uniformly
but
Our
not
the
is
.
, X3
, X4
next
result
also
class
of
- equivalence
functions
simple
linearly
( X 1 , X2
by
is
easily
function
linearly
An
example
lX2
of
I ,X 2)
that
of
the
IX
2+
but
IX
not
2.
The
that
is
the
is
observe
that
real
linearly
,
that
to
function
quasi
so
any
function
is
be
it
is
uniformly
is
X ~
4.
consequence
activation
the
can
monotonic
separable
of
by
that
shown
.BI (X
not
realized
uniformly
example
separable
all
It
be
functions
necessarily
XOR
it
fJ2
general
vertex
monotonic
monotonic
very
that
is
can
Boolean
separable
is
to
function
Those
function
separable
way
Boolean
.
linearly
Boolean
easiest
every
function
separable
functions
narrower
lemma
may
class
be
of
shows
that
represented
multilinear
the
up
to
activation
Theorem
unique
The
fairly
not
activation
2 .
next
result
wide
variety
Theorem
context
Every
multilinear
3 .
activation
function
activation
suggests
of
Every
function
that
monotonicity
activation
sigma
is
functions
- pi
activation
vertex
- equivalent
to
- in - context
is
enjoyed
by
function
is
monotonic
- in -
Theorem4. A multilinear activation function is uniformly monotonic if and only if its restriction to verticesis uniformly monotonic.
The key step in the proof of this result is the observationthat a
multilinear function may be built up inductively through linear interpolation, starting with the valuesat the vertices. This follows from the
fact that a multilinear function is linear in eachvariablewhen the other
variablesare held constant. The remainderof the proof consistsof verifying that each step of this inductive construction preservesuniform
441
44 2
FORMALANALYSFS
This theorem may be generalized to cover arbitrary senses of uni form monotonicity by running any inputs for which the activation func tion is nonincreasing through the "inverter " f (x ) = I- x . Thus the
general class of all uniformly monotonic Boolean -like activation func tions may be represented up to vertex -equivalence by a narrower class
of sigma -pi activation functions of a certain form .
It is instructive to contrast the sigma -pi activation functions which
result from applying Theorems I and 5 to a particular uniformly mono tonic
activation
function
Consider
the
Boolean
function
of six vari -
ables f3 (X1 ,X2 ,X3 ,X4 ,XS,X6 ) - X1X2 + X ~ 4+ XsX6 . Theorem 1 real izes this using the vertex normal form , which , after simplification ,
becomes
by
CONCLUSION
of whole
networks
is difficult
to foresee .
this
gap between local and global computation might be bridged is by dealing with Questions of learning in such networks . The goal of learning is
generally to cause the network to have a particular global behavior , but
ACKNOWLEDGMENTS
This researchwassupportedby a grant to David Zipserfrom the System DevelopmentFoundation. I am also grateful to JamesMcClelland
and David Zipserfor their many helpful commentsand suggestions
.
CHAPTER
11
An Analysis of the Delta Rule
and the Learning of Statistical Associations
G. O. STONE
should
be modified
when
a desired
behavior
is not achieved .
Widrow and Hoff ( 1960) . This learning rule which has been analyzed
and employed by a number of authors (Amari , 1977a, 1977b ;
Kohonen , 1974, 1977; Sutton & Barto , 1981) , has been called the
Widrow -Hoff rule by Sutton and Barto (1981) and is generally referred
to as the delta rule in this book . This rule is introduced in Chapter 2,
discussed extensively and generalized in Chapter 8, and employed in
models discussed in a number of chapters - most notably Chapters 17
and 18. In the present chapter I show how concepts from linear algebra
11. THEDELTA
RULE 445
and vector spacescan provide insight into the operation of this learning
mechanism. I then show how this mechanism can be used for learning
statistical relationships between patterns, and finally show how the delta
rule relates to multiple linear regression. Concepts from linear algebra
are used extensively; for explanation of these concepts, especially as
applied to PDP models, the reader is referred to Chapter 9.
dpwji = tpjip;
where
jth
tpi
represents
element
ment
of
change
1 Note
that
the
of
the
input
the
entire
~ p
tpil
this
is
essentially
product
change
by
target
of
.
If
input
we
the
value
p
for
for
weight
the
the
pattern
that
the
product
the
is
. 1 In
Hebbian
that
rule
or
activation
vector
target
output
value
of
notation
, we
for
the
ith
can
the
ele
write
the
as
learning
levels
the
desired
the
matrix
activation
assume
ipi
pattern
weight
the
of
and
of
rule
the
activation
described
of
here
input
the
In
the
and
output
is identically
Hebbian
output
unit
the
rule
units
is entirely
Hebbian
it
is assumed
determine
determined
rule
the
446 FORMALANALYSES
where , as usual ,
bold
letters
indicate
vectors , uppercase
indicates
of all of the
W= ftpif.
p- l
( 1)
the desired
., (n ) - t (n ) - W (n - 1) i (n )
( 2)
11. THEDELTA
RULE 447
The Delta
To this point we have discussed the delta rule for what Smolensky
component for each concept. In general, the input and output patterns
correspond to an arbitrary set of vectors . Interestingly , it is possible to
show that the delta rule applies only to the " structure " of the input and
output vectors and not to other details of the representation . In a
linear system , it is only the pattern of correlations among the patterns
that matter , not the contents of the specific patterns themselves .
We can demonstrate this by deriving the same learning rule following
a change of basis from the unit basis to the pattern basis. Since a detailed
are N
units , then
the vector
is of dimension
N .
In the unit
basis merely in vol ves transforming the coordinate system so that the
patterns line up with the axes. Figure 1 illustrates a simple case of this
1, PI , involves
activation
patterns
value + 1.
described
PI~[tl ] andP2
~[~l].
Figure 1B shows the same two vectors , but now expressed with respect
to a new coordinate
coordinate
P*1~[?]
and
P*2=[~].
units .
The
vectors
448 FORMALANALYSES
Unit 2
Pattern 1
< + 1,+ 1 >
Unit 1
Pattern 2
< + 1 ,- 1 >
- 1
+ 1
Pattern
< 0 .+ 1 >
Pattern
< + 1 .0 >
- 1
+ 1
FIGURE
coordinates
1.
11. THEDELTA
RULE 449
transforms the input patterns into a coordinate space based on the input
patterns, which we denote PI , and one that transforms the target patterns into a coordinate space based on the target patterns, PT . In this
case, we have i * ; = PI i ; for the input vectors and t * ; = P Tt ; for the
target vectors. Moreover , since the output vectors must be in the same
space as the target vectors we have 0* ; = P TO; . We must also
transform the weight matrix W to the new basis. Since the weight
matrix maps the input space onto the output space, both transformations must be involved in transforming the weight matrix . We can see
what this transformation must be by considering the job that the weight
matrix must do. Suppose that in the old basesWi = 0 for some input i
and output o . In the new baseswe should be able to write W*i * = 0* .
Thus, W* P/ i = Pro and PrlW * P / i = 0 = Wi . From this we can
readily see that PrlW * PI = Wand finally , we can write the appropriate
transformation matrix for W as
W* ~ PrWp / l.
We can multiply both sidesof Equation 1 by P T on the right and Pil
on the left. This leadsto
PrWPi1 (n) = PrWPi1 (n - 1) + PT1) 8 (n) iT (n) Pil
which, by substitution, can be written as
W* (n) = W* (n - 1) + '7)8* (n )[ P[ li * (n >] TpIt ,
where
(3)
(4)
where the matrixC , given byC - (Pi1 ) Tpi1 , is a matrix which holds
the correlationalinformation amongthe original input patterns. To see
this, recall that we are changingthe input patterns into their pattern
basis and the target patterns into their pattern basis. Therefore, the
vector 1* j consistsof a 1 in the j th cell and zeros everywhereelse.
Thus, sinceIj - PIli * j , we see that Pi ! must be a matrix whosejth
column is the jth original input vector. Therefore, C is a matrix with
the inner product of the input vectorsII and ij occupyingthe ith row
and jth column. This inner product is the vector correlation between
the two patterns.
450
FORMALANALYSES
II. THE
DELTA
RULE 451
2
- . 10
- . 66
. 30
X1
1=- ..
Y1
- .08
- . 45
. 50
.57
.05
.12
.23
X2 .08 :- ..
.47
- .36
.50
---
. 25
-
. 48
- . 48
. 56
:- .. Y3
. 18
X4
.
I
I
(\ I
vv
iC)
.."""
....-
1.00 .75 0
0
.75 1.00 0
0
0
0 1.00 .25
0
0
.25 1.00
CDomv
..- a>O
) an
MNVVONOU
. . . . I. . . ).
FIGURE 2. Key-targetpairsandthekeycorrelation
structure
~
=~~
I~
.~-1
~11
_~_
_~111111Iil
i
a > MCO
OOMO
..-
.47
.17
- .21
.01
- .66
- .51
. 04
- . 22
- . 28
X3
Y2
452 FORMAL
ANAL
YS
~
Learningwith I = 1.3
AFTER
1LeamiDI
~ le
PatternBased
: mse
- O.OO
0.39 . 0.18 0.00 0.00
0.90 1.20 0.00 0.00
0.00 0.00 1.11 . o.~
0.00 0.00 0.30 1.20
AFTER 4 LeamingC~ lm
PatternBased
: D8e- O.OO
--.
.8LearningCyclm
Unit Based
: mse- o.oo
0.01
. 0. 8
0.10
0.14
. 0.82
0.54
PatternBased
: ae - O.OO
----
FIGURE 3. Comparison of unit -based and pattern -based weight matrices after one , four ,
and eight learning cycles .
Bj " ( n ) -
Substituting
for
ing
is the
further
t *j -
index
, we have
Bj * ( n ) -
(5)
W * ( n ) j *j .
W * ( n ) from
8j * (n ) ~
where
t "j -
Equation
1 , gives
W * ( n - l ) ij of
the
Bj * ( n -
the
pattern
"18k * i * [ Ci *j
presented
recursive
form
1) -
"18 k * ( n -
on
trial
n -
1.
Simplify-
1) 1* [ Ci * j .
(6)
11
. THE
DELTA
RULE453
Since the vectors i * j and i * [ consist of a 1 and the rest zeros, the
entire expressioni* [ C i * j reducesto Ckj, the entry in the kth row and
jth
(7)
In other words , the decrease in error to the j th input / output pair due
to a new learning trial is a constant times the error pattern on the new
learning trial . The constant is given by the learning rate , 11, times the
correlation of the currently tested input and input from the learning
trial . Thus , the degree to which learning affects performance on each
test input is proportional to its correlation with the pattern just used in
learning . Note that if 11 is small enough , the error to the presented pattern always decreases. In this case Equation 7 can be rewritten
8k '* (n ) = 8k '* (n - 1) ( 1 - TlCkk) .
, this exercise
has demonstrated
that a mechanism
can
often be made more conceptually tractable by a judicious transforma tion . In this case, expressing the possible input and output representa tions in the appropriate pattern bases clarified the importance , indeed
the sufficiency , of the input "structure " (i .e., the pattern of inner products among the input vectors) in determining the role of the input
representations in learning . Furthermore , converting the weight matrix
into a form from which the errors at any stage of learning can be read
directly allowed us to " see" the learning more obviously . The result
has been a clearer understanding of the operation of the delta rule for
learning .
454 FORMAL
ANAL
YSFS
might vary with regardto pitch, timbre, etc. In this case, the systemis
simultaneouslylearning the categoriesof dogand bark at the sametime
it is learningthe associationbetweenthe two concepts
.
In
, statisticalrelation--- addition.
- - --~- - when we have categoryassociations
7
- patterns
~
ships
betweenthe input and output
within a categorycan be
pickedup. For example, the systemcould learn that small dogstend to
have high-pitched barks whereaslarge dogs may tend to have lowpitchedbarks.
In order to analyzethe caseof statisticallearning, we now treat the
input/ output pairs of patterns as random variables. In other words,
each time pattern iJ is selectedas input, its entries can take different
values. Similarly, the target output for pair j , IJ will have variable
entries. The probability distributions of these random variablesmay
take any form whatsoever
, but they are assumednot to changeover
time. Moreover, we canconsiderthe entire set of input/ output pairsto
form a single probability distribution. We then assumethat on each
trial an input/ output pair is randomlysampledfrom this overall probability distribution.
We proceedwith our analysisof statisticalleamingby computingthe
expectedor averagechangein the weight matrix following a presenta
tion. From Equations1 and 2 we get the following .form of the delta
rule:
W (n) ~ W (n - 1) + "1[t (n) - W (n - 1) i (n) ] i T(n ).
Simplifyingand taking the expectedvalue of eachside we have
E [W (n )] ~ E [W (n - 1)] (I - 1) E [i (n ) i T(n ) ]) + 1}E [t (n ) i T(n ) ] . (8)
Note , we may take
statisticalcorrelations
patterns,
E [W (n) ] - E [W (n- l ) ] (1 - T)R1) + 1JRIO
.
If we solve the recursionby replacingW (n - 1) with an expressionin
terms of W (n - 2) etc. down to W (0) and assumingthat W (0) - 0, the
matrix of all 0 entries, we can write the expectedvalue of the weight
matrix after n trials as
455
11. THEDELTARULE
(9)
E[W(n)] ~ TJR
/o! (I - TJRJ
)j .
j-O
is a matrix
which , unlike
the
inverse , is certain
to exist
for
all
B+ = 1}BTI; (I - 1}BBT)j
j - l
small.
~
limE[W(n) ] = E[W~ ] = RIO(pT )- 1[1)pTI; (I - 1)ppT )j ].
n-+~
J-. 1
(10)
Now, bysubstituting
in for thepseudo
-inverseof P andsimplifying
we
get
E [W~ ] = RIO (pT )- Ip + = RIO (PpT )- 1= RioRI - I.
(11)
(12)
Now we wish to show that , after training , the system will respond
appropriately . Without further restrictions , we can demonstrate a
minimal appropriateness of the response , namely , we can show that
E [W oJ ] = E [t ] . In other words , we can show that the mean output
of the system, after learning, is the mean target. Since the test trials
and learning trials are statistically independent we can write
E [Wooi ] ~ E [ Woo ]E [i ] .
456
FORMAL ANALYSES
EEW
00
1
I = RjoRtEEi I = EEtiT(iiT) IEEiI.
Although
it is notgenerally
truethat(BC) =
thisrelationdoes
it appears.Wheninputpatternsbeingassociated
are themselves
the
outputcf a linearsystem,eachentryin the patternwillbe a linearcombinationof the originalinputs entries. If the patterns have large
dimensionality
(i.e., there are manycomponentsto the vectors),one
obtainsan approximationto an infiniteseries of randomvariables. A
powerful
central-limit
theorem
dueto Lyapunov
(Eisen,1969,Ch. 13)
showsthat such a serieswillconvergeto a normaldistributionso long
as severalweak assumptionshold (most importantly,the means and
variancesof each random variablemust exist and none of the random
variables may be excessivelydominant).
we haveshown
that
EEW
jI=E
[tiI,
(13)
11. THEDELTARULE
457
Yj~ boXOj
+b1Xlj
+b2
.x"2) . . . bnxnJ
~ bTXj
(where
Xoistaken
tobe1) andthesum
-squared
error
t (Yj
-YjA)2
j-l
is minimized . This is precisely the problem that the delta rule seeks to
solve. In this case, each element of the target vector for input / output
pair (p t p) is analogous to a to-be-predicted observation Yj ; our prediction variables Xj are analogous to our input vectors ip; our regression
coefficients b correspond to a row of the weight matrix W ; and the
intercept of the regression line , bo, corresponds to the bias often
assumed for our units (cf. Chapter 8) . In our typical case the target
vectors have many components, so we are simultaneously solving a
multiple regression problem for each of the components of the target
vectors. Now , the standard result from linear regression,
A for zero-mean
random variables, is that our estimate for the vector b , b is given by
A
b ~ (XTX)- lXTy
whereX is the matrix whosecolumns representthe valuesof the predictors and whose rows representthe individual observations
. (Again,
we take the first column to be all Is.) Now, note from Equation 12
that the delta rule convergesto
E[W~] = (E[iTi])+E[iTt].
458 FORMAL
ANALYSES
This equation is the strict analog of that from linear regression theory . 2
If we assume that each output unit has a bias corresponding to the
intercept bo of the regression line , we can see that the delta rule is , in
effect , an iterative method of computing the best , in the sense of least
SUMMARY
To summarize , this chapter has shown that close examination of the
delta rule reveals a number of interesting and useful properties . When
fixed patterns are being learned , the rule ' s operation can be elucidated
by converting from a unit -based description to a pattern -based descrip tion . In particular , the analysis showed that the correlations between
the input patterns , and not the specific patterns used , determined their
variables
the pattern
of covan -
ation between the inputs and targets . Finally , we showed that the delta
rule carries out the equivalent of a multiple linear regression from the
input patterns to the targets. Those familiar with linear regression
should
conclude
from
this
both
the
power
of
the
rule
and
its
in convention
and that
typical of linear regression. In our case, the stimulus vectors are the column vectors,
whereas in linear regression the predictor variables are the rows of the matrix X . Thus
this equation differs by a transposition from Equation 12. This has no consequencesfor
the points made here .
11. THEDELTA
RULE 459
The
preceding
plete
ideas
based
First
,
insights
into
of
often
of
the
that
rather
analysis
can
discussion
analysis
than
the
reveal
basic
unit
operation
component
new
does
delta
rule
not
.
principle
- based
of
by
( in
any
,
it
this
means
a useful
case
mechanism
which
illustrates
representations
mechanisms
applications
Rather
were
the
can
provide
two
important
use
of
provide
; and
designed
com
pattern
valuable
second
for
, that
the
one
use
CHAPTER
12
Resource Requirements of
Standard and Programmable Nets
J . L . McCLELLAND
In this chapter I continue this line of thinking and extend it in vari ous ways , drawing on the work of several other researchers , particularly
Willshaw ( 1971 , 1981) . The analysis is far from exhaustive , but it
focuses on several fairly central questions about the resource require ments of PDP networks . In the first part of the chapter , I consider the
resource requirements of a simple pattern associator . I review the
analysis offered by Willshaw ( 1981) and extend it in one or two small
ways , and I consider how it might be possible to overcome some
12
. RFSOURCE
461
REQUIREMENTS
limitations that arise in networks consisting of units with limited connectivity . In the second part of the chapter, I consider the resource
requirements of a distributed version of the dynamically programmable
networks described in Chapter 16.
THESTANDARD
PATTERN
ASSOCIA
TOR
In this section, we will consider pattern associator models similar to
the models studied by J. A . Anderson (e.g., Anderson, 1983) and
Kohonen (1977, 1984) , and to the past-tense learning model described
in Chapter 18. A small pattern associator is illustrated in Figure 1. A
pattern associator consists of two sets of units , called input and output
units , and a connection from each input unit to each output unit . The
associator takes as input a pattern of activation on its input units and
produces in response a pattern on the output units based on the connections between the input and output units .
Different pattern associators make slightly different assumptions
about the processing characteristics of the units . We will follow
Willshaw 's ( 1981) analysis of a particular, simple case; he used binary
units and binary connections between units . Thus, units could take on
activation values of 0 or 1. Similar\y, the connections between the
units could take on only binary values of 0 and 1.
In Willshaw nets, processing is an extremely simple matter. A pattern of activation is imposed on the input units , turning each one either
on or off . Each active input unit then sends a Quantum of activation to
each of the output units it has a switched-on connection to. Output
units go on if the number of quanta they receive exceeds a threshold;
otherwise they stay off .
The learning rule Willshaw studied is equally simple. Training
amounts to presenting each input pattern paired with the corresponding
output pattern, and turning on the connection from each active input
unit to each active output unit . This is, of course, a sirnple variant of
Hebbian learning. Given this learning rule , it follows that when the
input pattern of a known association is presented to the network , each
of the activated input units will send one quanturn of activation to all of
the correct output units . This rneans that the number of quanta of
activation each correct output unit will receive will be equal to the
number of active input units .
In examining the learning capacity of this network , Willshaw made
several further assumptions. First , he assumed that all of the associations (or pairs of patterns) to be learned have the same number of
active input units and the same number of active output units . Second,
462
FORMALANALYS~
"
"
he assumed that the threshold of each output unit is set equal to the
number of active input units . Given this assumption , only those out put units with switched -on connections from all of the active input
units will reach threshold .
Now we can begin to examine the capacity of these networks . In
particular , we can ask Questions like the following . How many input
units (n; ) and output units (no) would be needed to allow retrieval of
the correct mate of each of r different input patterns ?
The answer to such a questi on depends on the cri teri on of correct
retrieval used . For present purposes , we can adopt the following cri terion : All of the correct output units should be turned on , and , on the
average , no more than one output unit should be turned on spuriously .
12
. RESOURCE
REQUIREMENTS
463
.:
a
:...
Sincethe assumptionsof the model guaranteethat all the correct output units will be turned on when the correct input is shown, the
analysisfocuses on the number of units needed to store r patterns
without exceedingthe acceptablenumber of spuriousactivations.
The answer to our Questionalso dependson the number of units
active in the input and output patternsin each pattern pair and on the
similarity relations among the patterns. A very useful case that
Willshaw consideredis the case in which each of the r associations
involves a random selection of m; input units and mo output units.
From the assumptionof randomness
, it is easyto compute the probability that any given junction will be turned on after learningall , associations. From this it is easyto computethe averagenumber of spurious activations. We will now go through thesecomputations.
First we considerthe probabilityPonthat any given junction will end
up being turned on, for a particularchoiceof the parametersn;, no, m; ,
mo, and r . Imagine that the , patternsare stored, one after the other,
in the n; no connectionsbetweenthe n; input units and the no output
units. As each pattern is stored, it turns on m; mo of the n; no connections, so each junction in the network is turned on with orobabilitv
m;mo/ n;no. The probability that a junction is not turned on by a single
associationis just 1 minus this Quantity. Since each of the r associa
tions is a new randomsampleof m; of the n; input units and mo of the
no output units, the probability that a junction has not been turned
on- or 1 minus the probability that it has been turned on- after r patterns have beenstoredis
I -
=
on
mimo
1 n I. n
]'
=
on
1 -
I -
mimo
n I. n
]'
(no- mo)Ponm
,.
464 FORMALANALYSF5
We want to keep this number less than 1. Adopting a slightly more
stringent criterion to simplify the calculations, we can set
m.I
1 ~ noPon
or
1
"*1
m
.
I
I ~ Pon
1 - ..~ !.~~~
~ 1-
I'
nino
Rearranging
, we get
1
1-
no
[ - 1 - ] -;;;; ~
[ 1 -
n , no
Ir
1
m
,
m
;mo
.
./-r
~
n
log
1-[~1 ;no
mi
mo
nino
:...
(1)
r::
r ~ -
.69.'~
~
~0m
I.m
or
nino ~ 1.4Srmimo
.
465
12. RESOURCE
REQUIREMENTS
This result tells us that the number of storage elements (that is, connections, n; no) that we need is proportional to the number of associations we wish to store times the number of connections (m;mo)
activated in storing each association. This seems about right , intui tively . In fact, this is an upper bound rather greater than the true
number of storage elements required for less sparse patterns, as can be
seen by plugging values of m; greater than log2no into Equation 1.
It is interesting to compare Willshaw nets to various kinds of local
representation. One very simple local representation would associate a
single, active input unit with one or more active output units . Obviously, such a network would have a capacity of only n; patterns. We
can use the connections of a Willshaw net more effectively with a distributed input if the input and output patterns are reasonably sparse.
For instance, in a square net with the same number n of input and output units and the same number m of active elements in each, if
n = 1000 and m = 10, we find that we can store about 7,000 associations instead of the 1,000 we could store using local representation over
the input units .
Another scheme to compare to the Willshaw scheme would be one
that encodes each pattern to be learned with a single hidden unit
between the input and output layers. Obviously a net that behaved perfectly in performing r associations between m; active input elements
and mo active output units could be handcrafted using r hidden units ,
each having m; input connections and mo output connections. Such a
network can be economical once it is wired up exactly right : It .only
needs r (m;+ mo) connections. However, there are two points to note.
First , it is not obvious how to provide enough hardware in advance to
handle an arbitrary r patterns of m; active input units and mo active
output units . The number of such patterns possible is approximately
(n;mi/ m; !) (nomo
/ mo!) , and if we had to provide a unit in ~ vance for
each of these our hardware cost would get out of hand very fast.
Second, the economy of the scheme is not due to the use of local
representation, but to the use of hidden units . In many cases even
more economical representation can be achieved with coarse-coded hid den units (see Chapter 3 and Kanerva, 1984) .
466 FORMAL
ANAL
YSPS
distributed
array
guarantee
the
not
receive
be
it
that
we
active
tions
study
the
on
of
between
d ' is
units
that
of
Let
us
with
the
first
should
in
unit
that
bility
a
that
that
the
will
be
is
m2 / n2
a
units
one
joined
of
just
by
the
as
ones
an
the
The
our
is
number
testing
each
unit
to
not
allows
we
should
unit
calculate
Swets
should
that
then
1966
the
)
as
on
to
Since
bypass
the
arbitrary
an
be
unit
will
rest
on
is
unit
be
m / n .
on
is
learning
the
earlier
an
and
.
. an
m / n .
,
The
are
particular
analysis
we
train
arbitrary
The
Similarly
we
a
as
Pick
pattern
connection
in
of
input
arbitrary
particular
on
network
scheme
arbitrary
probabil
the
proba
This
Willshaw
is
1 -
exactly
model
the
,
and
1 -
same
it
is
value
that
we
off
still
had
ber
, the
number
ore
considering
pattern
applies
then
, and
I ' .
independent
probability
get
Pan
an
distinguish
be
us
random
's
will
turned
before
in
Willshaw
learning
output
the
unit
particular
that
measure
that
there
inputs
will
output
the
some
unit
&
that
that
During
active
we
( Green
units
, this
between
input
of
and
connec
output
, and
each
and
each
units
again
storing
scheme
d '
.
units
so
so
unit
number
happens
, consider
particular
on
Note
imagine
receive
using
net
Now
reaching
itself
what
our
the
threshold
patterns
net
sensitivity
be
the
output
just
output
units
the
output
input
f
restriction
inputs
will
inputs
threshold
of
connection
output
of
consider
pairs
this
active
each
to
active
on
of
of
inputs
projects
and
turned
independent
we
properties
output
be
problem
these
has
learning
of
measure
ability
the
without
of
receive
be
the
question
we
will
not
to
Thus
exceed
and
unit
we
fact
it .
the
input
input
among
Willshaw
number
there
.
below
have
in
would
fall
between
of
performance
units
that
of
net
units
output
would
expect
the
would
any
Second
connected
would
correct
without
unit
reformulate
each
fully
of
can
distributed
the
we
network
may
unit
a
output
networks
average
input
- detection
index
ity
an
for
sharp
units
Suppose
they
threshold
all
we
square
randomly
using
should
that
associations
than
incorrect
on
store
connection
activations
that
where
the
turned
signal
to
that
patterns
that
have
have
examine
be
we
are
guarantee
will
handle
units
units
To
out
fall
connections
of
turns
units
given
nets
the
guarantee
all
output
and
realistic
in
of
receives
rather
that
that
set
strict
net
in
get
wish
which
no
and
to
output
output
random
, nor
,
that
it
able
and
Assume
rather
variability
However
slightly
the
actually
guarantee
threshold
and
inputs
not
sharp
In
inherent
would
unit
on
actually
m ;
some
to
output
depends
activation
could
connections
each
analysis
unit
of
that
in
of
the
original
connections
12
. RF30URCE
REQUIREMENTS
467
each unit makes. This factor will become important soon, but it does
not affect the probability that a particular connection will be on after
learning r patterns.
Now consider what happens during the testing of a particular learned
association. We activate the correct m input units and examine the
mean number of quanta of activation that each output unit that should
be on will receive. The m active input units each have f outputs, so
there are mf total "active" connections. A particular one of these connections reaches a particular output unit with probability Iln , since
each connection is assumed to fall at random among the n output units .
Thus, the average number of active connections each output unit
receives will simply be mf In . For output units that should be on, each
of these connections will have been turned on during learning, so
mf In is the average number of quanta that unit will receive. Assuming that n is reasonably large, the distribution of this quantity is
approximately Poisson, so its variance is also given by m/ ln .
Units that should not be on also receive an arbitrary connection from
an active input unit with probability l / n , but each such connection is
only on with probability Pon. Thus, the average number of quanta such
units receive is {mfln )Pon. This quantity is also approximately Poisson, so its variance is also equal to its mean.
Our measure of sensitivity , d ', is the difference between these means
divided by the square root of the average of the variances. That is,
'
.Jn
~ 77n
Pon
Pon
(2)
) /
'
We can get boundson the true value of d ' by noting that the denominator above cannot be greater than 1 or less than .Jill. The largest
value of the denominatorsetsa lower bound on d ', so we find that
d ' ~ .Jn ~77n ( 1 - Pon) .
Substituting
d ' ~ .J ,~77n [ 1 - ~
Ir .
(3)
468
FORMALANALYS~
Takinglogsof bothsides
, invoking
thelog(1- x) = - x approxima
tion, andsolving
for" weobtain
r ~ .5~ [ log(m/ / n) - 210g
(dl)] .
One of the first things to note from this expressionis its similarity to
the expressionwe had for the caseof WiUshaw
's fully connectednet.
In particular
,
if
we
let
f
=
n
,
so
that
each
unit
sends
an averageof one
connectionto reachanotherunit, we get
n2[ log
r ~ .5-;;;2
(m)The
expression
city
of
and
in
the
net
captures
the
independent
tions
activated
tions
of
.
d '.
More
( n2 )
takes
up
Effects
new
emerges
indicates
can
achieve
increase
1984
for
several
on
n2 / m2
However
see
term
the
Figure
total
due
is
to
the
for
the
achieve
as
we
of
number
the
fact
of
of
number
of
connections
,
-
distribu
as
values
the
connec
the
apart
up
increase
that
pull
capa
goes
to
moderate
ratio
that
to
connections
units
factors
,
fact
want
benefit
of
is
average
gets
and
connec
each
pattern
our
indicates
concerning
values
, as
and
relatively
increases
fidelity
in
are
a
simply
of
capacity
,
reach
increasing
to
a
point
by
.
we
as
Capacity
where
fan
,
- out
is
benefi
can
no
longer
n .
network
as
:
of
increasing
uniformly
;
root
Alternatively
as
( personal
5 .
f
require
longer
by
d ' =
insensitive
we
no
increases
for
square
gracefully
of
values
the
bigger
Mitchison
r
with
we
and
degrade
decreases
possible
to
bigger
will
discovery
the
of
gets
network
/ n )
other
proportional
degree
as
indefinitely
of
is
consider
directly
increases
( mf
capacity
we
d ' is
any
that
log
and
- Out
diminish
of
also
The
slight
spurious
small
when
that
performance
can
tion
Fan
returns
We
This
before
the
and
the
we
) .
, we
reduced
cial
to
result
Thus
as
expresses
ratio
vely
Limited
is
the
correct
relati
right
sensitivity
there
on
of
relative
though
the
that
the
the
association
are
of
Equation
f
per
( m2
as
effect
important
tions
its
activations
These
on
down
fact
of
larger
brackets
goes
(4)
2108
;(d')].
communica
a
function
depends
long
as
.Jn
~77n
; ~ 77n
of
roughly
.
d '.
approaches
d ';
12. R~ OURCEREQUIREMENTS469
FIGURE
tion
2 .
of
the
number
of
curves
member
on
is
.
- out
of
that
( Equation
maximum
byf
fan
1 , 000
can
if
we
number
be
levels
each
and
to
pick
of
were
of
all
cases
while
misleading
The
patterns
randomly
and
pair
stored
regardless
In
and
each
gives
off
in
input
input
of
sensitivity
invariant
Thus
in
- out
of
member
capacity
in
fan
number
units
patterns
Equation
increase
for
limited
n , the
upper
of
here
of
of
active
The
is
number
Effects
log
output
of
,
the
is
y - axis
a
for
we
of
value
could
of
the
log
-;;-
near
d ') .
result
achievable
,
of
store
and
d
depends
',
we
would
be
maximum
no
only
would
strictly
based
further
increasing
on
find
of
lower
are
by
the
pair
~ the
the
in
func
m ,
each
of
Calculations
of
10 , 000
5 .
in
below
- out
as
values
d ' of
. Jmr7
capacity
value
fan
reflects
maintaining
net
different
, indicated
for
increases
fixed
associative
~ for
pattern
curves
maximal
units
results
further
the
connected
output
' and
that
the
limited
~ !1B
: ;I;:~
. ! 51
Biological limits on storage capacity of neural nets. With these analyses in mind , we can now consider what limits biological hardware
might place on the storage capacity of a neural net. Of course, we must
be clear on the fact that we are considering a very restricted class of
distributed models and there is no guarantee that our results will generalize. Nevertheless, it is reasonably interesting to consider what it
would take to store a large body of information , say, a million different
pairs of patterns, with each pattern consisting of a 1,000 active input
units and 1,000 active output units .
To be on the safe side, let's adopt a d ' of 5. With this value, if the
units have an unbiased threshold, the network will miss less than 1% of
470 FORMAL
ANALYSES
the
units
that
that
should
should
be
How
big
a
we
This
to
set
have
alarm
to
net
to
equal
20
are
are
10
by
to
be
to
and
less
than
meet
1 %
these
consulting
value
of
the
units
specifications
Equation
near
about
4 ,
106
and
to
to
we
find
large
get
the
'
It
the
dispersion
units
units
be
units
Note
unit
unit
has
sion
layers
feeding
- out
the
order
It
observed
should
the
unit
has
billion
or
collect
an
units
trillion
out
that
fan
Similarly
intrinsic
of
of
1 , 000
units
the
we
collection
single
unit
collector
dispersion
and
disper
will
units
connections
1
input
collection
by
The
the
input
of
million
To
output
need
billion
number
of
connec
units
would
be
on
noise
in
the
units
would
3 .
net
each
the
units
Figure
output
to
set
average
connections
any
of
col
collection
the
and
the
of
by
that
inputs
unit
the
single
call
set
dispersion
driven
so
- out
input
and
is
to
multiple
in
is
net
input
2 .
the
of
all
might
a
of
"
to
the
use
illustrated
linear
of
But
we
from
each
is
projects
fanout
we
short
matter
what
" dendrites
each
, 000
drastically
to
of
input
perfectly
million
- out
well
simply
set
unit
collection
dispersion
,
pointed
2 .
fan
10
simple
of
the
sum
/
unit
is
a
scheme
effective
of
and
1012
be
the
=
still
limitation
receive
unit
be
neurons
relatively
dispersion
Assuming
output
associator
the
sensitivity
each
to
trick
unit
this
are
of
- out
is
activate
among
just
it
of
each
of
con
Figure
with
we
- out
connections
of
net
With
but
fan
it
The
output
becomes
units
between
collection
is
fan
the
unit
assumed
to
assuming
dispersion
20
of
that
outgoing
each
are
each
an
there
Chapter
number
.
,
the
out
that
unit
into
to
If
consult
of
patterns
network
each
assumed
and
fan
in
the
better
patterns
that
input
together
construct
tions
let
project
a
version
output
that
units
and
units
each
then
distributed
is
stated
suggested
had
150
, 000
limitation
miniature
Collection
( see
assumed
as
in
we
about
15
turns
each
the
overcoming
It
- out
Let
it
as
the
1010
generally
capacity
distributed
Let
randomly
that
units
fan
input
- out
only
for
units
is
of
units
of
from
work
magnitude
maximum
about
seems
lost
of
lection
method
overcome
not
to
unit
of
fan
the
of
completely
layers
of
capacity
simple
not
on
that
capacity
mark
the
orders
estimates
upward
are
scheme
per
three
since
range
ad
problem
units
this
connections
to
indicates
1 , 000
serious
generally
individual
limitation
figure
for
, 000
unit
this
limits
two
per
up
not
brain
However
to
Given
get
is
the
connections
off
nections
The
in
) .
1 , 000
we
units
units
enough
net
of
of
Chapter
to
false
connected
number
have
of
and
number
of
fully
need
enough
on
. 1
would
Assuming
that
be
off
reduce
the
actual
411
12. RFSOURCE
REQUIREMENTS
472 FORMAL
ANALYSES
sensiblelearning rule, as shown in Chapter 18. Thus randomnets lik~
the ones that have been analyzedin this section probably representa
lower limit on efficiency that we can use as a benchmarkagainstwhich
to measure"smarter" POP mechanisms
.
d' ~ k.J,~77n
.
Now , suppose that during testing we turn on only some proportion Pt
of the m units representing a pattern. The m in the above equation
becomes mPt, so we see that the sensitivity of the network as indexed
by d I falls off as the square root of the fraction of the probe that is
presented. Similarly , suppose some of the connections leading out of
each unit are destroyed, leaving a random intact proportion Pi of the
mf active connections. Again, the sensitivity of the network will be
proportional to the square root of the number of remaining connections . Thus , performance degrades gracefully under both kinds of
damage.
Another frequently noted virtue of distributed memories is the
redundancy they tend naturally to provide . The ability of simple distri buted memories to cope with degraded input patterns is really just a
matter of their redundancy, as Willshaw (1981) pointed out . For , if a
network is fully loaded, in the sense that it can hold no more associations and still meet some predetermined standard of accuracy with complete patterns, it will not be able to meet that same criterion with degradation. The only way to guard against this problem is to load the network lightly enough so that the criterion can still be met after subjecting the network or the inputs to the specified degree of degradation.
473
12. RF50URCE
REQUIREMENTS
PROGRAMMABLE
PATTERNASSOCIA
TORS
In this section, I extend the sort of analysis we have performed on
simple associator models to the resource requirements of connection
information distribution (CID ) networks of the type discussed in
Chapter 16.
The mechanism shown in Figure 4 is a distributed CID mechanism.
The purpose of this network is to allow connection information stored
in a central associative network to be used to set connections in several
local or programmablenetworks in the course of processing so that more
474
FORMALANALYS~
than one input pattern can be processed at one time . The mechanism
works as follows : One or more patterns to be processed are presented
as inputs , with each pattern going to the input units in a different pro grammable network . The input pattern to each local net is also
transmitted to the input units of the central associative network . When
more than one pattern is presented at a time , the input to the central
network is just the pattern that results from superimposing all of the
input patterns . This pattern , via the connections in the central associative network , causes a pattern of activation over the central output
units . The central output pattern , of course , is a composite representa tion of all of the input patterns . It is not itself the desired output of
the system , but is the pattern that serves as the basis for programming
(or turning on connections ) in the local , programmable networks . The
local networks are programmed via a set of units called the connection
activation (CA ) units . The CA units act essentially as switches that
turn on connections in the programmable networks . In the version of
the model we will start with , each CA unit projects to the one specific
connection it corresponds to in each programmable network , so there
are as many CA units as there are connections in a single programma ble net . In the figure , the CA units are laid out so that the location of
each one corresponds to the location of the connection it commands in
each of the programmable networks .
To program the local networks , then , central output units activate the
CA units corresponding to the connections needed to process the patterns represented on the central output units . The CA units turn on
the corresponding connections . This does not mean that the CA units
actually cause activation to pass to the local output units . Rather , they
simply enable connections in the programmable nets . Each active local
input unit sends a quantum of activation to a given local output unit if
the connection between them is turned on .
The question we will be concerned with first is the number of CA
units required to make the mechanism work properly . In a later section , we will consider the effect of processing multiple items simultane ously on the resource requirements of the central network .
12. R~ OURCEREQUIREMENTS475
network
on . If each local
capable
network
of processing
must
r different
be as complex
as a
patterns , we are in
as though
the number
of connections
required
in each local
grows linearly
with the number
of known patterns times the
of each . If we had one CA unit for each programmable
con -
nection , a programmable
version of our square I - million - pattern associ ator would require 1012 CA units , a figure which is one or two orders of
magnitude
larger than conventional
the brain . Just putting the matter
use programmable
connections , it appears that we need n 2 CA units
just to specify the connections
needed for a standard net that could do
the same work with just the connections
between n input and n output
units . 2
However , things are not nearly as bad as this argument
suggests .
The computation
I just gave misses the very important
fact that it is
generally
not necessary to pinpoint
only those connections
. that are
relevant to a particular
association . We can do very well if we allow
each CA unit to activate a whole cohort of connections , as long as (a)
we activate all the connections
that we need to process any particular
pattern of interest , and ( b ) we do not activate so many that we give rise
to an inordinate
number of spurious activations
of output units .
The idea of using one CA unit for a whole cohort of programmable
connections
is a kind of coarse coding . In this case , we will see that we
can reap a considerable
benefit from coarse coding , compared to using
one CA unit per connection . A simple illustration
of the idea is shown
in Figure 5 . The figure illustrates
CA units proj @cting to a single one
2 Many readers will observe that the CA units are not strictly necessary. However , the
specificity of their connections to connections in local networks is an issue whether CA
units are used as intermediaries or not . Thus, even if the CA units were eliminated , it
would not change the relevance of the following results. In a later section, the CA units
and central output units will be collapsed into one set of units ~ in that case, this analysis
will apply directly to the number of such units that will be required.
476
FORMALANALYS~
FIGURE
grammable
to
for
random
programmable
network
set
two
each
on
connections
of
of
the
of
turns
is
on
two
all
One
To
at
see
case
gle
pattern
how
units
connections
one
it
only
projects
output
units
is
are
without
CA
unit
and
assumed
connections
chosen
one
to
in
units
These
are
and
and
activation
64
to
only
drawn
replacement
Whenever
pro
in
so
project
CA
that
unit
is
Note
that
each
given
CA
programmable
unit
net
when
must
there
is
Time
this
which
of
connections
much
in
input
Pattern
the
sets
connections
one
connection
connections
The
by
the
same
than
16
networks
the
more
programmable
activate
the
programmed
of
with
of
programmable
units
connection
it
CA
Each
we
We
ask
scheme
want
how
can
to
small
buy
program
us
some
number
will
local
nCQ
start
nets
of
CA
by
to
units
considering
process
can
we
sin
get
12
. RFSOURCE
REQUIREMENTS
477
by with , assuming that each one activates a distinct , randomly selected
set of nj no/ nCQconnections?
First of all, the number of CA units that must be activated may have
to be as large as mj mo, in case each of the different connections
required to process the pattern is a member of a distinct cohort .
Second, for comparability to our analysis of the standard network , we
want the total fraction of connections turned on to allow no more than
an average of 1 output unit to be spuriously activated. As before, this
constraint is represented by
1
Pon~
.
I
nj
As long as mj ~ 10g2nj, .5 will be less than the right -hand side of the
expression, so we will be safe if we keep Pan less than or equal to .5.
Since we may have to activate mjmo CA units to activate all the right
connections and since we do not want to activate more than half of the
connections in all, we conclude that
nca ~ 2m; mo.
From this result we discover that the number of CA units required
does not dependat all on the number of connections in each programmable network . Nor in fact does it depend on the number of different
known patterns. The number of known patterns does of course influ ence the complexity of the central network , but it does not affect the
number of CA units . The number of CA units depends on m; mo, the
number of connections that need to be turned on per pattern. Obviously, this places a premium on the sparsenessof the patterns. Regardless of this , we are much better off than before.
478 FORMAL
ANAL
YSFS
below .5. Formally , assume that we know which s patterns we want to
process. Each one will need to turn on its own set 'of m;mo CA units
out of the total number nca of CA units . The proportion of connections turned on will then be
1-.~nca
!.~~s
Pon~ 1 -
This formula is, of course, the same as the one we saw before for the
number of connections activated in the standard net with s , the
number of different patterns to be processed simultaneously, replacing
r , the number of patterns stored in the memory , and with nca, the
number of connection activation units , replacing n; no, the total number
of connections. Using Pan= .5 and taking the log of both sides we get
;1-.~nca
~
.~-
- .69 = s 108
the ProgrammableNetworks
Overlapping
In Chapter 16, the CID scheme we have been considering thus far
was generalized to the case where the programmable networks overlapped with each other . This allowed strings of letters starting in any of
a large number of input locations to correctly activate units for the
corresponding word at the appropriate location at the next higher level .
Here I will consider a more general overlapping scheme using distri buted representations in the overlapping local networks. A set of three
overlapping local networks is illustrated in Figure 6. In this scheme,
both the input and the output units can play different roles depending
on the alignment of the input pattern with the input units . In consequence, some of the connections also play more than one role. These
connections are assumed to be programmable by a number of different
CA units , one for each of the connection' s different roles. Obviously ,
this will tend to increase the probability that a connection will be turned
479
12
. RESOURCE
REQUIREMENTS
FIGURE 6. Three overlappingprogrammablenetworks of 8x 8 units each. The networks overlapevery four units, so the input and output units can participatein two different, partiallyoverlappingnetworks.
on, and therefore will require a further revision of our estimate of the
number of CA units required.
Unfortunately , an exact mathematical analysis is a bit tricky due to
the fact that different junctions have different numbe~s of opportunities
to be turned on. In addition , input patterns in adjacent locations will
tend to cross-activate each other 's output units . If the patterns to be
processedare well separated, this will not be a problem . Restricting our
attention to the well-separated case, we can get an upper bound on the
cost in CA units of allowing overlapping modules by considering the
case where all of the connections are assumed to play the maximum
number of roles. This number is equivalent to the step size or grain,
g , of the overlap, relative to the size of the pattern as a whole. For
480
FORMALANALYSFS
}-
m.m
n10
co
sg
'
..
In summary, the number of CA units required to program a programmable network depends on different variables than the number of
connections required in a standard associator. We can unify the two
analyses by noting that both depend on the number of patterns the net
must be ready to process at any given time . For the standard associator , the number is r , the number of known patterns; for the programmable net, the number is sg, the number of patterns the net is programmed for times the grain of overlap allowed in the starting locations
of input patterns.
This analysis greatly increases the plausibility of the CID scheme.
For we find that the " initial investment " in CA units needed to program a set of networks to process a single association is related to the
content of the association or the number of connections required to
allow each of the active input elements to send a quantum of activation
to each of the active output elements. Incorporating a provision for
overlapping networks, we find that the investment required for processing one association is related to the content of the association times the
grain of the overlap. This cost is far more reasonable than it looked
like it might be at first , and, most importantly , it does not depend on
the number of patterns known .
An additional important result is that the cost of programming a set
of networks grows with the number of patterns we wish to program for
at one time . This cost seems commensurate with the linear speedup we
would get by being able to process several patterns simultaneously.
12. R~ OURCEREQUIREMENTS481
482 FORMAL
ANALYS
~
get
terns
To
are
random
been
assuming
present
the
we
this
the
is
and
the
that
input
units
that
can
the
input
now
randomly
we
spurious
be
pat
as
have
pattern
easily
is
calculated
selected
unit
will
be
particular
spurious
by
in
we
can
present
out
Thus
the
in
number
of
to
the
known
average
pat
number
of
an
spurious
represented
this
is
by
of
be
therefore
fully
just
patterns
superposition
to
and
simplify
performance
is
such
the
is
patterns
of
times
pattern
number
present
the
probability
acceptable
assume
the
patterns
that
average
this
Assuming
Rearranging
patterns
patterns
of
probability
probability
activated
just
The
of
that
spurious
take
units
is
matter
out
just
of
power
terns
this
the
probability
set
on
of
superposition
calculate
The
grip
selections
throughout
in
First
on
quantitative
replacing
average
of
CP
units
one
or
active
with
fewer
If
we
spurious
we
get
1
}
log
}-(-r)m=slog
(}-m
/n).
and
taking
logs
12. RESOURCE
REQUIREMENTS483
Simultaneous
Access
The
results
use
of
we
just
local
instead
described
Obviously
from
happen
on
if
We
mo
input
output
level
Second
will
up
'
the
output
the
of
first
random
Pon
the
with
full
fan
we
on
'
at
We
con
the
central
is
shown
not
from
( n
pat
patterns
network
- out
per
pattern
recall
activat
superimpose
whole
central
if
single
of
question
nets
ghosts
to
considering
units
have
unit
output
only
that
one
single
would
just
First
this
where
units
into
were
not
should
mapping
what
units
we
we
input
units
central
does
case
of
in
no
effect
in
analysis
for
of
probability
show
our
one
questions
the
the
of
arise
describes
is
out
what
to
is
begin
expression
relative
what
presented
To
on
output
the
if
questions
mediate
central
sets
the
would
Furthermore
simply
the
central
like
separate
patterns
on
representation
units
somewhat
two
general
that
analysis
the
exactly
output
active
two
several
case
the
CID
on
happens
representations
situation
Figure
these
from
depend
What
to
the
present
collapsed
now
but
sider
to
in
in
The
relevant
as
directly
that
tern
units
simply
consider
except
well
units
level
distributed
representation
CA
connections
remembered
output
units
output
the
we
as
output
central
distributed
be
central
remains
central
the
the
must
question
access
one
ing
it
the
accessing
this
distributed
that
another
simultaneously
using
Representations
at
simultaneous
note
Distributed
representations
consider
about
to
Equation
the
on
.
We
first
with
'
ask
pattern
be
mj
first
active
input
on
to
The
'
of
the
turning
number
( n
with
before
and
on
units
spurious
input
representing
of
input
units
units
particular
that
will
then
in
in
the
that
of
square
not
these
( nj
mj
) p
; ) p
and
root
on
inputs
will
canceling
hence
of
the
will
be
on
mj
in
the
variances
have
Pon
The
to
' .
the
will
reduce
However
in
on
learning
probability
simply
proba
inputs
that
with
of
receive
with
receive
on
then
decrease
verage
These
connections
out
will
turned
to
patterns
term
will
be
other
on
connections
should
expression
be
were
learning
variance
should
that
revised
( n
connections
plus
our
the
) p
units
whose
on
mj
each
output
for
increase
units
The
turned
numerator
as
effect
pattern
been
value
) /
addition
output
lines
presented
an
the
in
inputs
Pon
the
processed
Consider
bility
is
Mi
Mt
is
to
on
what
probability
be
is
. J " ; ; I ;
The
its
there
old
will
be
denominator
of
the
two
484
FORMALANALYSES
means
for
'
which
are
therefore
'
get
of
the
'
" '
variance
equal
to
"
Jm
'
;
I.
on
the
means
The expression
we
nj
Pon
n
n
;
I.
with
down
mj
of
] /
approximate
therefore
+
+
up
goes
value
and
maximum
goes
and
mjPon
if
its
variance
units
expression
with
The
before
Jrmj
simpler
denominator
timate
as
We
also
becomes
by
this
slight
replacing
gives
Pon
us
in
slight
underestimate
the
overes
of
' :
m
m
;
I.
the
with
the
0
P
mean
number
effect
of
of
this
on
spuriously
the
activated
square
root
of
the
.
varIance
To
that
terns
number
determine
the
from
the
all
the
of
Inserting
this
view
units
activated
units
for
of
of
such
effect
point
presenting
of
several
the
by
units
the
patterns
that
other
belong
patterns
on
to
one
are
spurious
' ,
of
we
note
the
pat
The
is
; /
in
the
previous
equation
gives
d'=m:~7~;;_i+_-(ni- mi
(1)-[1Pon
) - mil
_~ni)S .
- (1
Using this equation we can examine the effects of increasing s on the
value of d '. Not too surprisingly , d ' does go down as s goes up , but
the effect is relatively benign . For example , with n = 106, r = 106,
m = 1,000 , and s = 1, d ' is about 11.6. It drops to half that value at
s = 4, and drops much more gradually thereafter . With n = 2 x 106
units and the same values of rand m , we can get an acceptable value
of d ' (~ 6) with s as high as 16.
The final issue we will consider is the possibility that new spurious
output patterns have been introduced in the superposition of the s out put patterns simultaneously activated in processing the s input patterns .
For simplicity , we will just consider the probability of a " ghost ," given
that all and only the correct mo units are active for each of the s pat terns . The analysis is entirely the same as the one we gave before for
the probability of ghosts showing up in the input patterns . We get an
average of one ghost when
1 ~ r [ 1-
( 1 - m / n ) S] m.
12
. R~ URCE
REQUIREMENTS
485
As
before , the
number
of
simultaneous
with
patterns
we
m and is relatively
can
tolerate
insensitive
to
Discussi
This
model
is really
simply
on
analysis
of the resource
has discovered
a very simple
requirements
a number
of networks
like
the CID
requirements
of CID
depend
the number
of patterns
to be programmed
for and is
of r , the number
of known patterns . In the central net -
of units
required
to keep spurious
activations
under
control
grows with s , as does the number
of units required
to keep
ghosts from emerging
in the input and -output
patterns . It is worth
noting , also , that the probability
of ghosts increases as we increase m .
The fact that the resource requirements
of the local networks
are
independent
of the number of patterns known is obviously
important .
Relative
to the central network , it means that the local networks
are
very cheap . The number of distinct inputs that are needed to program
them is quite reasonable , and , as I will explain , we can even get by with
far fewer units in the local networks than we need at the central level .
On the other
concerning
, there
that , at fixed
are several
numbers
of units
important
further
and patterns
function
if it is lightly
One
sensitivity
as a function
of s is rather gradual .
loaded network , one can take s up to reasonable
trophe . Simultaneous
degradation : a network
observations
r - a
will
is
of
access by multiple
patterns
is very much like
can handle it without
a noticeable decrement
of
loaded . A second
observation
n , the choice
concerns
of m essentially
486 FORMAL
ANAL
YSFS
to
question
of
coarse
coding
in
Chapter
down
as
the
sparse
ve
coding
that
restrict
this
of
is
the
of
attention
and
well
of
restricted
we
the
moved
and
no
may
be
have
for
them
units
to
higher
the
the
local
in
this
each
local
net
local
much
networks
modules
of
the
units
in
need
patterns
the
representation
However
needs
programmable
the
resolution
to
network
the
Another
Throughout
units
than
paral
network
which
while
worlds
inputs
central
the
both
interactive
central
The
in
clues
outputs
much
spotlight
nets
that
be
and
the
of
the
the
to
the
primitive
their
to
activation
best
networks
in
networks
after
with
of
spotlight
local
of
other
dissociate
proportional
Thus
few
in
require
to
program
the
the
the
small
incorporated
the
programmable
central
units
of
combined
assumed
to
"
provide
prob
of
model
patterns
get
the
must
of
that
and
the
idea
the
but
we
good
to
case
contents
hold
way
in
to
16
reason
represented
isomorphic
to
this
we
parts
one
knowledge
inputs
resolution
and
stimuli
Chapter
proportional
would
simul
me
the
letters
the
with
In
that
successive
only
process
central
the
be
"
need
con
blackboard
network
several
it
and
higher
goes
we
lead
in
like
using
interact
on
to
that
networks
is
saw
time
with
imagine
programmable
central
to
of
there
very
we
tuned
has
perhaps
program
to
patterns
access
would
sharply
fact
to
except
the
access
processing
only
in
that
has
work
As
associated
This
subpatterns
continue
chapter
at
rather
are
sequentially
those
is
processing
costs
reasonable
access
to
point
to
coding
patterns
as
large
differentiated
assuming
sequential
lel
corresponds
fine
simultaneous
network
access
programmed
allow
that
most
blackboard
to
For
serving
central
probably
attention
so
is
to
by
mable
very
several
unit
simultaneous
reading
Large
it
idea
represent
each
observation
compact
is
to
coarser
with
access
ably
code
corresponds
to
gets
final
taneous
view
the
ability
detector
The
coarse
small
the
patterns
juncti
how
and
are
central
to
be
network
CONCLUSION
This
chapter
pattern
extensions
into
the
works
continuing
of
1984
how
Willshaw
be
several
the
number
elaboration
fruitful
in
analysis
several
interesting
remain
of
Willshaw
of
to
be
'
These
particularly
personal
communi
programmable
net
explored
analysis
simple
Mitchison
requirements
issues
of
directions
observations
capacity
of
'
extended
connectivity
into
and
exploration
to
limited
and
large
discussion
can
lead
effects
indicated
models
have
cation
this
has
associator
will
hope
aid
that
in
this
12
. RESOURCE
REQUIREMENTS
487
ACKNOWLEDGMENTS
This work was supported by Contract N -OOO14-82-C -O374, NR 667 483 with the Personnel and Training Research Programs of the Office
of Naval Research , by a grant from the System Development Founda tion to the Institute for Cognitive Science at UCSD , and by an NIMH
Research Scientist Career Development
Award (MH -OO385) . This
chapter was developed in response to a number of Questions raised by
Geoff Hinton and Scott Fahlman about the resource requirements of
programmable nets . I thank Dave Rumelhart for several useful discus sions and for encouraging me to pursue the issues described herein .
The material described in the section entitled " Randomly Connected
Nets " was developed in collaboration with Dave , and the application of
the d ' analysis to the problem of simultaneous access to a distributed
memory network was Dave ' s suggestion .
CHAPTER
13
D . ZIPSER
and
System
D . RABIN
Research on parallel distributed processing is to a large extent depen dent upon the use of computer simulation , and a good deal of the
researcher ' s time is spent writing programs for this purpose . Virtually
all the POP systems described in this book require special -purpose com puter programs to emulate the networks under study . In writing pro -
grams of this type, it is usually found that the basic algorithms of the
PDP network are easy to program but that these rather simple " core "
programs are of little value unless they are embedded in a system that
lets
the
researcher
observe
and
interact
with
their
functions
These
user interface programs are generally tedious and very time consuming
to write . What is more , when they are directed toward one particular
system they can be quite inflexible , making it difficult to easily modify
the POP network being studied . Also , because of the time involved ,
particularly for interactive graphics programs , the researcher often
makes do with very limited facilities for analyzing the performance of
the network . In this chapter we will describe a general -purpose parallel
system simulator called P3 . It was developed with PDP research explic itly in mind and its major goal is to facilitate simulation by providing
both the tools for network description and a powerful user interface
that can be used with any network described using the tools . There are
13. P3SYSTEM
489
. The methodlanguage
, an extensionto LISP, which implements
the internal computationalbehaviorsof the units in a model.
. The constructor
, which transforms the plan and associated
methodsinto a computerprogramand, when run, simulatesthe
network.
.
Input to units described in a P3 plan can come only from other units
in the plan. That is, there is no "outside world " in a P3 plan language
description of a network . This means that at the level of description of
the P3 plan language, the P3 network is closed. Access to the outside
world must occur inside a unit through its method. Methods may
accessthe world outside the P3 system through any available computer
peripheral. The only thing that methods are not allowed to do is to
reconfigure with the P3 system itself or communicate with other
methods through "underground connections" not mentioned in the P3
plan.
In any simulation , the relationship between real and modeled time is
of key importance. A real unit , such as a neuron , would read inputs
continuously and update its outputs asynchronously, but this cannot be
simulated exactly on a digital computer. Many simulations use a simple
synchronous approximation to real time . However, sometimes this produces unwanted artifacts and a closer approximation of asynchrony is
required. Often , in fact, what the investigator really wants to do is to
experiment with the effect of different kinds of time simulation on the
network under study. Since there is no way for the system designer to
know in advance all the possible ways that the investigator will want to
handle time , some strategy has to be used that allows great flexibility .
The approach taken by P3 is that this flexibility can come through the
methods that can use conditional updating. The P3 system itself is
completely synchronous and updates all units on each cycle. Since
updating a unit involves invoking its method, the question of whether
or not the outputs of a unit actually change on any P3 cycle can be
490
FORMALANALYSFS
13. P3SYSTEM491
which the connectivity is described in terms of geometrical relations .
This is often the case when dealing with realistic neuronal modeling ,
especially of primary sensory processing structures .
either a single
any
number
of
parameters
Before
the
start
of
a simulation
values must be given to all parameters and to all outputs . Each value is
always a single computer word in length . The interpretation of this
word depends on how the methods use it . As the simulation proceeds ,
these initial values are continuously
updated . Taken together , the
values of the parameters
during
parameter
and
values
simulation .
output
values
constitute
outputs
between
are available
to
: unit parameters
and terminal
parame -
ters. The unit parameters apply to the whole unit , for example , the
threshold
in a linear
threshold
parameters
are associ -
also used within the method programs to read inputs and set outputs .
The basic form
of the CONNECT
statement
is
(CONNECT < unit -name > OUTPUT < output -name >
TO < unit -name > INPUT < input -name
492 FORMAL
ANAL
YS
~
For units with only a few inputs or outputs each input or output can be
to illustrate
how
P3 works , we will
describe
a model
of a
members
of the cluster
work . The cluster unit which wins is the only one that learns and it
'.'i/. if unit j
g-Cik
nk- gUlt
wins on stimulus k
494 FORMAL
ANAL
YSFS
Note that the input
to the output array
correspondence will
between the output
type units . Also
notice
that a terminal
parameter
ated with the input array " C ." The competitor unit needs an additional
input called " i -A " which will receive information from the outputs of all
the other members of the cluster .
We have described the two unit
ahead and instantiate
units
types we will
that creates a
outputs (d array (i 0 5) (j 0 5) ) ) )
The unit statement names the unit it is creating . This is the name of a
real unit that is actually going to exist in our model and it is the name
that will
be referred
P3 to build
number
such
to when
this unit
is connected
to other
units . For
can be any
of units of the same type and they can all have different
real unit
in P3 has a location
in P3 space , we must
specify it in the unit statement that instantiates the unit . The at clause
is used for this . The at is followed by a location specifier that simply
evaluates
of the unit
in P3 space . For
" d ."
that
will
receive
these
patterns . The
statement
that
of
instantiates
13
. P3SYSTEM
495
of two or more units, so we want a way to vary the number of units in
a cluster. In the first line of the unit statementwe give the namecluster to the array and then we indicate the size of the array with a subscript specifier. The name of this subscriptis "k"; its initial value is O.
Its final value is one less than the global constant"cluster-size." The
value of cluster-size, which will occur at various points in the plan, is
set by a statementat the beginningof the P3 plan that determinesthe
value of global constants. This feature meansthat we can changethe
parameterssuch as cluster-size globally throughout the plan by only fiddling with a single value. The upper bound of the stimulus input line
array has also been set with the use of a globalconstant"stimulus-size"
rather than with an integer as was done previously. Also notice that
the variable"k" is usedin an at clauseto placeeachunit of the array at
a different placein P3 space.
Our next task is to connect the units together in the appropriate
fashion. We have two classesof connections: those that go from the
stimulus generatorto the learning cluster and those that interconnect
the units within the learning cluster. Each of these classesof connections has many individual connectionswithin it , but these individual
connectionscan be specifiedalgorithmically in such a way that only a
few CONNECT statementsare neededto generatethe entire network.
What is more, the algorithmic specificationof theseconnectionsmakes
it possibleto changethe size of the cluster or the size of the stimulus
array without altering the CONNECT statementsat all. The code
requiredto connectthe stimulus to the learningcluster is given below:
(for (k 0 (+ 1 k
exit when(> k cluster-size) do
(for (i 0 ( + 1 i
exit when(> i stimulus-size) do
(for (j 0 (+ 1 j
exit when(> j stimulus-size) do
(connectunit stimulus outputd i j
to unit (cluster k) input C i j
terminalinitialize(W ~ (si:random-in-range
0.0 (II 2.0 (expt (+ stimulus-size 1) 2
There are three nested loops. The first rangesover each member of
the cluster, and the next two range over each dimension of the
stimulus array. Inside these three nestedloops is a single CONNECT
statement. The CONNECT statement has the job of initializing the
value of any terminal parameters
. In our model we have a very important terminal parameter, IIW," the weight betweena stimulus line and a
cluster unit, which we want to initialize to a random value which sums
496 FORMAL
ANALYSES
to
one
for
initial
the
value
quantity
In
function
our
replace
the
by
simple
This
is
self
input
"W "
itself
LISP
and
Each
general
way
( for
clusters
of
( +
when
( for
G 0
exit
here
higher
it .
CONNECT
over
units
to
connections
there
examine
all
feature
is
We
basic
used
have
to
and
an
analyze
however
puts
of
development
all
given
for
, to
see
the
units
of
in
the
its
the
are
of
method
which
.
' t have
we
have
ordinary
detail
of
this
competitive
language
we
if
have
that
how
code
has
that
this
won
can
the
programs
for
been
code
, which
saying
This
to
.
the
the
shows
It
is
be
written
methods
a com
worthwhile
inputs
so
the
only
appropriate
The
accesses
will
needs
computer
learning
many
that
describes
plan
available
chapter
ranges
plan
,
own
multiple
a method
unit
of
is
is
its
know
unit
whenever
course
with
making
to
whose
subscript
loop
of
the
originating
Of
units
each
there
if
in
outer
method
features
model
those
whose
a case
decide
applied
appendix
the
plan
to
not
them
about
the
be
an
is
don
the
to
unit
loops
this
We
line
these
simulation
how
network
a running
Since
this
unit
within
within
but
learning
is
each
separate
that
line
can
input
to
nested
input
't
plan
all
each
then
Note
because
programs
here
from
a completely
o-A
i-A
two
, we
plete
and
both
in
used
more
) do
connect
it
specified
won
bit
0 -A
k output
j input
first
method
LISP
in
is
learning
input
this
below
of
are
it
i-A
- size
cluster
on
useful
construct
receive
doing
one
because
1 j
requires
are
now
averages
course
must
for
k output
j cluster
we
competitive
the
a cluster
) do
j input
( +
cluster
of
- size
cluster
connections
value
is given
(>
a single
very
the
size
in
of
,
can
generated
only
algorithm
one
required
a number
j k ) do
than
the
connections
know
that
This
in
any
cluster
1)
unit
statement
all
code
unit
lower
than
cluster
cluster
when
is
are
the
The
( -
G ( +
to
idea
to
the
required
numbers
, but
learning
sum
is
to
random
one
unit
( connect
subscripts
the
number
setting
the
1 j)
to unit
The
of
to
evaluates
exactly
members
k cluster
( +
when
exit
that
by
evaluates
1 k
(>
( connect
( for
the
of
itself
(k 0
not
the
link
accomplished
that
competitive
except
exit
is
wherever
sum
is
member
for
The
force
that
members
plan
function
the
will
connections
other
This
function
P3
function
for
- normalizing
complex
a LISP
number
a LISP
in
case
satisfactory
The
array
with
general
( in
our
whole
of
much
and
,
out
in
the
497
13. P3SYSTEM
The
only
difference
arguments
to
accessed
by special
parameter
between
a normal
, the
access
form
array
of input
following
returns
form
the
lines
( read - input
In
this
case
program
and
the
value
an
The
P3
The
are
first
that
the
can
model
levels
of debugging
the
network
.
network
can
of the
debugging
the
each
unit
interacts
with
particular
the
unit
presenting
network
that
, the job
from
an
in
the
the
are
outputs
The
simulated
been
First
created
established
P3
provides
of
connections
and
than
of the
There
constructor
, a model
wants
is
connected
up
actual
functioning
for
both
net -
really
, the
tools
the
a simulation
are
user
a com -
about
to run
, the
the
The
construct
programming
.
of
to a compiler
information
be
use
, rather
system
in P3
3600 .
methods
output
relevant
models
models
in purpose
can
model
display
at
with
its
that
of
from
that
proved
with
wiring
the
of analyzing
the
diagram
model
function
much
P3
two
to know
in
the
way
of the
these
phases
a display
space .
device
the
unit .
provides
in
pointing
enables
, has
constructed
, P3
location
a mouse
a menu
at a time
the
, setting
emanating
a user
is convinced
in
provides
, one
of a
There
Symbolics
the
similar
computer
been
correctness
this
connections
connections
has
process
check
are
point
extensive
the
of
network
has
makes
of
form
it
be debugged
To
the
value
in which
P3 simulation
any
for
at the
occurs
of a network
all the
before
this
shows
the
an input
parameters
is to compile
by the
that
Once
read
them
and
of a program
with
debugged
intended
To
be bound
" pointer
is a P3 description
As
be
, to get
environment
interactive
containing
must
that
is the
is a program
be used
.
and
arguments
be used :
terminal
" mouse
description
structure
flag .
can
using
a model
input
language
is a data
work
the
constructor
, the
P3
is highly
and
in simulating
However
of
It
plan . The
puter
values
system
example
reading
P3 simulation
step
the
for
System
system
of
form
expression
parameter
simulated
value
following
Simulation
window
a P3 method
the
flag )
forms
setting
For
that
(C i j
where
corresponding
to
is
is used :
current
the
arguments
function
functions
the
LISP
user
This
of the
for
useful
model
corresponds
model
.
to
user
Clicking
to trace
facility
more
of the
that
The
out
tracing
than
Once
the
can
on
any
a
of
out
simply
the
user
envisioned
begin
498 FORMAL
ANALYSES
Analyzing the running of a complex simulation is a demanding task.
It is in this analysis that we have found that all the features of the P3
system come together and begin to justify their existence. Because
every object in the model has a location in P3 space that corresponds to
the user's mental image of the network , the simulation system can
display values representing the state of the system at locations on the
screen that have meaning to the user. This means that during the
course of a simulation , meaningful patterns of P3 variables can be
displayed. This approach is widely used in analyzing the function of
parallel systems. What P3 has done is to standardize it and relieve the
user of the need to implement the details of this display strategy for
each new model.
In the current implementation of P3, each object in the model is
represented at its designated location by a small rectangular icon. By
the use of a mouse pointer driven menu system, the user can assign the
icon representing a unit the variable whose value is to be displayed.
Thus, for example, the icons representing the input terminals of a cluster unit in our example can be assigned either the value of the input to
that terminal or the value of the weight on that terminal . These assignments can be made or changed at any time during a simulation run .
They can be set to be updated continually as a simulation proceeds, or
they can be examined in detail when the simulation is temporarily
interrupted . The current P3 implementation displays the relevant state
values at two possible levels of precision. The approximate value of the
state value is indicated by the degree of darkening of the icon. There
are five levels of intensity . Their range is under user control and can
be changed at any time . This enables the user to adjust the range so
that the difference between the lightest and the darkest icons will
optimize the information content of the display. There is also a high
precision display that permits the exact value of any P3 variable to be
examined.
Figure 1 shows how the screen of the Symbolics 3600 looks after 588
P3 cycles of simulation of a competitive learning model with a 6 x 6
stimulus array and a dipole stimulus . There are six windows displayed
and each shows a different aspect of the simulation . Window A shows
the three units in the model at their respective positions in P3 space.
The upper narrow rectangle is the pattern generator. It is not displaying
any value. The lower two r~ctangles represent the two cluster units .
They are displaying the approximate value of their outputs by the size
of the contained black rectangle. Clearly the unit on the left has a
lower output value than the one on the right . Window B shows the
output of the pattern generator unit , which was called " stimulus " in the
plan. The lines form a square array because that is the way they were
specified in the plan. The two dark rectangles show the current dipole
l3. P3SYSTEM
FIGURE
1.
Display
of Symbolics
" command
in the
3600 during
upper
P3 session . The
left . Clicking
a mouse
mouse
arrow
button
will
~99
is pointing
start
simula -
soo
FORMALANALYS~
In
addition
all
to
the
the
special
powerful
For
example
due
to
program
an
code
tion
go
alter
it
at
the
to
we
features
adding
P3
new
that
Each
of
these
ment
analysis
used
For
in
to
any
instances
be
any
display
the
that
values
record
to
will
entire
This
state
of
overwhelming
be
of
simula
facility
has
any
it
is
very
to
develops
recorder
strip
that
of
instruments
for
simulation
in
just
time
charts
instruments
class
simply
con
multiple
of
since
"
be
Since
number
record
' s
can
important
of
instrument
."
instru
chart
to
of
Every
that
unit
to
features
the
strip
any
feature
it
"
imple
instruments
probe
envision
the
"
of
new
analogous
the
time
addition
modeler
as
view
call
any
In
results
important
find
We
we
the
of
Clearly
constantly
results
also
the
data
at
has
we
the
up
created
time
model
flow
method
the
This
model
variable
record
very
the
the
the
to
as
Thus
recorder
we
is
chart
variables
the
same
enable
is
functioning
called
can
the
to
contains
is
interrupt
return
techniques
state
able
3600
bug
to
interrupted
instrument
variables
be
concept
required
one
stri
of
these
modeler
ment
at
available
observed
possible
process
be
instrument
running
has
Symbolics
an
then
was
displays
particular
of
can
also
the
that
and
laboratory
can
The
is
analytical
analyze
that
It
analytical
to
example
Figure
nected
user
window
of
system
new
instruments
has
it
the
to
the
simulation
these
instrument
enable
user
that
buffer
which
in
of
method
editor
P3
useful
each
the
the
are
of
at
believes
alteration
debugging
with
that
mentation
the
point
in
work
the
invaluable
As
user
of
recompile
the
P3
tools
the
code
directly
exactly
proved
that
in
of
development
suppose
error
simulation
functions
serious
The
instru
those
variables
recording
would
the
produce
an
Performance
So
run
far
we
This
have
is
models
run
How
always
given
piece
than
any
itself
the
useful
will
the
early
it
stages
of
is
it
we
model
Generally
models
envision
this
development
to
There
system
For
write
model
faster
special
harder
that
simu
possible
Big
parallel
and
particular
much
- purpose
generally
some
simulations
POP
performance
general
However
makes
Thus
it
which
for
definition
model
run
run
at
model
in
for
and
structure
of
pay
will
time
model
ease
we
hardware
system
in
stress
that
considerable
of
be
computer
speed
computers
penalty
program
general
takes
details
will
of
serial
must
performance
tailored
the
importance
on
P3
penalty
some
specially
like
much
about
tremendous
slowly
systems
lation
any
nothing
of
inherently
programming
is
said
problem
tailoring
to
programs
when
change
the
like
the
P3
size
13. P3SYSTEM
. When
decided
upon
the
structure
and it is necessary
and
parameters
of a model
501
have
been
extremely rapidly , it may in some cases be advantageous to write a special program to implement the model .
The general -purpose systems , however , have several things going for
them with respect to model performance . First of all , since the data
structure has the same form for models , it is possible to put a lot of
effort into optimizing running speed for the particular hardware on
which the system is implemented . This optimization only has to be
done once rather than for each model . A second way in which general purpose systems can improve performance is through the use of
special -purpose hardware . The models generated by the P3 system are
inherently parallel models and map well to some parallel computer
architectures. The one way to get blinding speed from parallel models
is to implement real parallelism in parallel computers . In some cases,
array processors can also be highly beneficial . Since all the P3 models
are of the same sort , a constructor can be made that will provide the
appropriate data structures to run any P3 model on these kinds of
hardware . This will make the hardware transparently available to the
502 FORMAL
ANALYSES
APPENDIX
A
. . .
"
. . .
"
---- -- -- - ---- - -- - --- --- -- - -- ---- --- - --- -- --- -- - -- --- -- ---
. . .
"
. . .
"
. . .
"
. . .
"
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
. . .
"
.... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
"
;;;
Unit types
- - - . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
aaa
" ,
,
.
"
Unit instances
...* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * . * * * * * * * * * * . *
" ,
13. P3SYSTEM
." . ,. * * * * * * * * Learning
( unit
cluster
at ( @
********
( k 0 ( - cluster
(* I ( +
initialize
inputs
units
array
cluster
(q =
503
- size
- size
(+
of type
cluster
- size
competitor
10 ) 0 )
0 .05 )
( C array
( i 0 ( - stimulus
- size
(j 0 stimulus
- size
.. .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
" ,
;;;
Connections
.... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
" ,
; ; ; * * * * * * * * Stimulus
( for
(k 0 ( +
exit
( for
when
to both
(=
(i 0 ( +
k cluster
when
( for
(j 0 ( +
(=
- size ) do
i stimulus
when
( = j stimulus
unit
to unit
(W
; ; ; * * * * * * * Interconnect
( for
when
(=
when
( connect
( si :random
(j ( +
exit
clusters
(+
( connect
to unit
1) do
( = j k ) do
cluster
k 1) ( +
when
- size
o -A
i-A ) )
1j))
( = j cluster
unit
k output
j input
cluster
cluster
- size
k output
j input
stimulus
to implement
1j)
cluster
d i j
C i j
- in - range
2 .0 ( expt
the
k cluster
unit
to unit
for
output
k input
1 k))
(j 0 ( +
exit
- size ) do
initialize
0 .0 ( II
exit
) do
stimulus
cluster
terminal
(k 0 ( +
- size
1j
( connect
( for
*. ******
1 i) )
exit
exit
clusters
1 k )
1 ) do
0-A
i-A ) ) )
- size
1) 2
competition
********
504
FORMALANALYSFS
APPENDIX
B
.
"
.
"
.
"
.
,
.
,
.
"
.
"
.
"
.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
...
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
"
;;;
learners
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . * . . * * * * * * . * * *
"
method
(let
.. . . . . . . . . ~ o . . . . . . . .
"
value
(set-unit -parameter p 0)
0 below
imax
do
13. P3SYSTEM
505
..
" . . . . . . . . Find out whether we won . . . . . . . .
;; Win was initialized to t in the let at the top level of this method .
(for -terminals k of input i -A
(if
= (read -unit -parameter p)
(read-input (i - A terminal k )
(setq win nil )
(when win
506 FORMAL
ANALYSES
;; Updatethe terminal parameterto the new weight
(set-terminal-parameter
(C i j ) W
new-weight
))
-parameter
flag1))))
. . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
" ,
;;;
Dipole pattern generator method
.. .* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
"
method
; ; * * * * * * * * Do we need a new pattern
on this iteration
? ********
( cond
;; ** ** *** * Yes . Erase old dipole and make new one . ** * *. ***
< (read -unit -parameter flag ) 1)
(let
imax (- (output -dimension -n d 1) 2
( jmax (- (output -dimension -n d 1) 2 )
(set-output (d (read-unit -parameter i1 ) (read -unit -parameter i1
(set-output (d (read -unit -parameter i2 ) (read -unit -parameter i2
(set-unit -parameter i1 ( + (random imax ) 1 )
(set-unit -parameter jl ( + (random jmax 1
(cond
> (random 2) 0.5
(cond
> (random 2 0.5)
(set - uni t - parameter i 2 ( + (read - uni t - parameter i 1) 1 )
(t
(set-unit -parameter i2 (- (read -unit -parameter ill
(set-unit -parameter i2 (read-unit -parameter j1 ) ) )
(t
(cond
> (random 2) 0 .5)
))))
0)
0)
References
508
REFERENCES
REFERENC
~ 509
Crick , F ., & Mitchison , G . ( 1983) .
Thefunctionof dreamsleep
. Nature
, 304,
111 - 114 .
to mathematical probability
theory . Englewood
Hall .
Fahlman , S. E . ( 1979) . NETL .. A system for representing and using real - world
knowledge . Cambridge , MA : MIT Press .
Fahlman , S. E . ( 1980) . The Hashnet interconnection scheme (Tech . Rep . CMU -
puter Science
.
Fahlman, S. E., Hinton, G. E., & Sejnowski, T. J. (1983) . Massivelyparallel
architecturesfor AI : NETL, Thistle, and Boltzmannmachines. Proceedings
o/ theNationalConference
on Artificial Intelligence
AAAI-83.
Farley, B. G., & Clark, W. A . (1954) . Simulationof self-organizingsystemsby
digital computer. IRE Transactions
o/ InformationTheory
, 4, 76-84.
Feldman, J. A. (1981). A connectionistmodel of visual memory. In G. E.
Hinton & J. A. Anderson (Eds.) , Parallelmodels0/ associative
memory(pp.
49-81) . Hillsdale, NJ: Erlbaum.
Feldman, J. A. (1982) . Dynamic connectionsin neural networks. Biological
Cybernetics
, 46, 27-39.
Feldman, J. A. (1985) . Connectionistmodelsand their applications
: Introduction. CognitiveScience
, 9, 1-2.
Feldman, J. A ., & Ballard, D. H. (1982) . Connectionist models and their
properties. CognitiveScience
, 6, 205-254.
Fodor, J. A. (1983). Modularityof mind.' An essayonfaculty psychology
. Cambridge, MA : MIT Press.
Fukushima, K. (1975) . Cognitron: A self-organizingmultilayeredneural network. BiologicalCybernetics
, 20, 121-136.
Fukushima, K. (1980) . Neocognitron: A self-organizingneural network model
for a mechanismof pattern recognitionunaffectedby shift in position. BiologicalCybernetics
, 36, 193-202.
Gallistel, C. R. (1980) . The organizationof action.' A newsynthesis
. Hillsdale,
NJ: Erlbaum.
Geman, S., & Geman, D. (1984). Stochasticrelaxation, Gibbs distributions,
and the Bayesianrestoration of images. IEEE Transactionson Pattern
Analysisand MachineIntelligence
, 6, 721-741.
Ginsburg, H. P. (1983) . Thedevelopment
of mathematical
thinking. New York:
AcademicPress.
510 REFERENCES
Glorioso
, R . M . , & Colon
Bedford
Glushko
, MA : Digital
R . J.
knowledge
Human
in
( 1979 ) .
The
reading
Perception
organization
words
aloud .
and Performance
and
activation
Journal
systems .
of
of
orthographic
Experimental
Psychology ..
, 5 , 674 - 691 .
intelligent
Press .
detection
York : Wiley .
Grossberg
, S. ( 1976 ) .
Part I . Parallel
Adaptive
development
pattern
classification
and coding
of neural
and
universa
feature
! recoding
detectors
. Biologi -
, S . ( 1978 ) .
A theory
In E . L . J . Leeuwenberg
visual perception . New
, P . R . ( 1974 ) .
coding , memory
, and development
( Eds .) , Formal
theories
of
York : Wiley .
of visual
& H . F . J . M . Buffart
does the brain
Finite -dimensional
build
a cognitive
vector spaces .
code ? Psychologi -
New
York : Springer
Verlag .
Hebb , D . o . ( 1949 ) . The organization
Hewitt
, C . ( 1975 ) .
problem
of
Theoretical
Stereotypes
procedural
attachment
Issues
Natural
in
workshop . Cambridge
Hinton
Language
, University
York : Wiley .
approach
in FRAME
, MA : Bolt , Beranek
, G . E . ( 1977 ) . Relaxation
dissertation
o/ behavior . New
as an ACTOR
, & Newman
based frames
ence on Artificial
Hinton
, NJ : Erlbaum
computation
. Proceedings
Intelligence
, G . E . ( 1984 ) . Parallel
Motor
Hinton
of reference
In
An
solving
the
Proceedings
of
interdisciplinary
.
Unpublished
doctoral
Processing ..
of Edinburgh
towards
theories
.
that
in parallel hardware .
models of associative
assigns
canonical
Joint
object Confer -
.
computations
for controlling
an arm . Journal
of
, G . E ., & Anderson
models o/ associative
memory
. Hillsdale, NJ: Erlbaum.
Hinton, G. E., & Lang, K. (1985). Shaperecognitionand illusory conjunctions. Proceedings
of the Ninth InternationalJoint Conferenceon Artificial
Intelligence
.
Hinton, G. E., & Sejnowski, T. J. (1983a
) . Analyzing cooperativecomputation. Proceedings
of the Fifth Annual Conferenceof the CognitiveScience
Society.
Hinton , G . E., & Sejnowski, T . J. (1983b) . Optimal perceptual inference.
Proceedingsof the IEEE Computer Society Conferenceon Computer Vision and
PatternRecognition
, 448-453.
Hinton, G. E., Sejnowski, T. J., & Ackley, D. H. (1984). Boltzmannmachines
..
Constraintsatisfactionnetworksthat learn (Tech. Rep. No. CMU-CS-84-119) .
Pittsburgh, PA: Carnegie-Mellon University, Department of Computer
Science
.
511
REFERENCES
Hofstadter, D. R. (1979) . Giidel, Escher
, Bach.. An eternalgoldenbraid. New
York: BasicBooks.
Hofstadter, D. R. (1983) . The architectureof Jumbo. Proceedings
of the InternationalMachineLearningWorkshop
.
Hofstadter, D. R. (1985) . Metamagicalthemas
. New York: BasicBooks.
Hogg, T., & Huberman, B. A. (1984) . Understandingbiologicalcomputation.
Proceedings
of theNationalAcademyo/ Sciences
, USA, 81, 6871-6874.
Hopfield, J. J. (1982) . Neural networks and physicalsystemswith emergent
collectivecomputationalabilities. Proceedings
o/ theNationalAcademyofSciences
, USA, 79, 2554-2558.
Hopfield, J. J. (1984) . Neuronswith gradedresponsehavecollectivecomputational propertieslike thoseof two-state neurons. Proceedings
of the National
Academyo
/ Sciences
, USA, 81, 3088-3092.
Hopfield, J. J., Feinstein, D. I., & Palmer, R. G. (1983). "Unlearning" has a
stabilizingeffect in collectivememories. Nature, 304, 158-159.
Hummel, R. A ., & Zucker, S. W. (1983) . On the foundations of relaxation
labelingprocesses
. IEEE Transactions
on PatternAnalysisand MachineIntelligence
, 5, 267-287.
Isenberg, D., Walker, E. C. T., Ryder, J. M., & Schweikert, J. (1980,
November) . A top-downeffecton the identificationoj junction words
. Paper
presentedat the AcousticalSocietyof America, Los Angeles.
Jackson
, J. H. (1958) . On localization. In Selectedwritings (Vol. 2) . New
York: BasicBooks. (Original work published1869)
Julesz, B. (1971) . Foundations0/ cyclopean
perception
. Chicago: University of
ChicagoPress.
Kanerva, P. (1984) . Self-propagatingsearch
.. A unifiedtheoryof memory(Rep.
No. CSLI-84-7) . Stanford, CA: Stanford University, Center for the Study
of Languageand Information.
Kawamoto, A . H., & Anderson, J. A . (1984) . Lexical accessusing a neural
network. Proceedings
0/ the SixthAnnual Conference
of the CognitiveScience
Society
Kienker
204
- 213
Separating
simulated
. ,
from
Gelatt
( 1974
- 23
( 1977
444
J . ,
Hinton
with
Science
An
- 445
ground
annealing
sactions
Kohonen
Sejnowski
figure
Kirkpatrick
Kohonen
. ,
adaptive
Jr
. ,
&
220
parallel
Vecchi
671
. ,
&
Schumacher
network
- 680
associative
Unpublished
1983
( 1985
Optimization
by
memory
principle
IEEE
Tran
Associative
memory
.'
system
theoretical
approach
New
York: Springer.
Kohonen, T. (1982) . Clustering, taxonomy, and topologicalmapsof patterns.
In M. Lang (Ed.) , Proceedings
of theSixth InternationalConference
on Pattern
Recognition(pp. 114-125) . Silver Spring, MD: IEEE Computer Society
Press.
Kohonen, T. (1984) . Self-organization and associativememory
. Berlin:
Springer-Verlag.
Kullback, S. (1959) . Informationtheoryand statistics
. New York: Wiley.
Lamperti, J. (1977). Lecturenotesin appliedmathematicalsciences
.' Stochastic
processes
. Berlin: Springer-Verlag.
512 REFERENCES
Larkin , J. H . ( 1983) . The role of problem representation in physics . In D .
Gentner & A . L . Stevens (Eds .) , Mental models ( pp . 75-98) . Hillsdale , NJ :
Erlbaum
, Information
Sciences
Institute
Levine , R . D ., & Tribus , M . ( 1979 ) . The maximum entropy formalism . Cam bridge , MA : MIT Press .
Lewis , C . H . ( 1978 ) . Production
system models
of practice
effects .
Unpublished
- Hill .
REFERENCES
513
Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., &
Teller, E. (1953) . Equation of state calculations for fast computing
machines. Journal0/ ChemicalPhysics
, 6, 1087. .
Minsky, M. (1954). Neuralnetsand the brain-modelproblem. Unpublisheddoctoral dissertation, PrincetonUniversity.
Minsky, M. (1959). Somemethodsof artificial intelligenceand heuristic programming. In Mechanisation
0/ thoughtprocesses
.. Proceedings
of a symposium
held at the National Physical Laboratory
, November 1958. Vol. 1
(pp. 3-28) . London: Her Majesty's StationeryOffice.
Minsky, M. (1975) . A frameworkfor representingknowledge. In P. H. Winston (Ed.) , The psychologyof computervision (pp. 211-277) . New York:
McGraw-Hill .
Minsky, M., & Papert, S. (1969) . Perceptrons
. Cambridge, MA : MIT Press.
Morton, J. (1969). Interactionof information in word recognition. Psychologi
cal Review, 76, 165-178.
Moussouris, J. (1974) . Gibbs and Markov random systemswith constraints.
Journalo/ StatisticalPhysics
, 10, 11-33.
Mozer, M. C. (1984) . Theperceptionof multipleobjects
. A parallel, distributed
processing
approach
. Unpublishedmanuscript, University of California, San
Diego~Institute for CognitiveScience
.
Neisser, U. (1967). Cognitivepsychology
. New York: Appleton-Century-Crofts.
Neisser, U. (1981) . John Dean's memory: A casestudy. Cognition
, 9, 1-22.
Newell, A. (1980). Physicalsymbolsystems. CognitiveScience
, 4, 135-183.
Norman, D. A., & Bobrow, D. G. (1975) . On data-limited and resourcelimited processes
. CognitivePsychology
, 7, 44-64.
Norman, D. A., & Bobrow, D. G. (1976). On the role of active memory
processes
in perceptionand cognition. In C. N. Cofer (Ed.) , Thestructureof
humanmemory(pp. 114-132) . Freeman: SanFrancisco.
Norman, D. A., & Bobrow, D. G. (1979) . Descriptions: An intermediatestage
in memoryretrieval. CognitivePsychology
, 11, 107-123.
Palmer, S. E. (1980) . What makestriangles point: Local and global effects in
configurationsof ambiguoustriangles. CognitivePsychology
, 9, 353-383.
Parker, D. B. (1985) . Learning-logic (TR-47) . Cambridge, MA : Massachusetts
Institute of Technology, Center for ComputationalResearchin Economics
and ManagementScience
.
Pillsbury, W. B. (1897). A study in apperception
. AmericanJournalofPsychology, 8, 315-393.
Poggio, T., & Torre, V. (1978). A new approachto synapticinteractions. In
R. Heim & G. Palm (Eds.) , Approachesto complex systems
. Berlin:
Springer-Verlag.
Poincare, H. (1913) . Foundationsof science(G. B. Halstead, Trans.) . New
York: SciencePress.
Quillian, M. R. (1968) . Semanticmemory. In M. Minsky (Ed.) , Semantic
informationprocessing
(pp. 227-270) . Cambridge, MA : MIT Press.
Rao, C. R., & Mitra, S. K. (1971). Generalizedinverseof a matrix and applications. Sixth BerkeleySymposium
on MathematicalStatisticsand Probability
,
1, 601-620.
514
REFERENCES
Riley , M . S. ( 1984 ) .
Structural
understanding
in performance
and learning .
Press .
Feature
discovery
by competitive
REFERENC
~ 515
Sejnowski, T . J., & Hinton , G . E. (in press) . Separating figure from ground
with a Boltzmann machine. In M . A . Arbib & A . R. Hanson (Eds.) , Vision,
brain, and cooperativecomputation. Cambridge, MA : MIT Press/ Bradford .
Sejnowski, T . J., Hinton , G . E., Kienker , P., & Schumacher, L . E. ( 1985) .
Figure-ground separationby simulated annealing. Unpublished manuscript.
Selfridge, O. G . ( 1955) . Pattern recognition in modern computers. Proceedings
of the WesternJoint Computer Conference.
Selfridge, O. G ., & Neisser, U . ( 1960) . Pattern recognition by machine. Scientific American, 203, 60-68.
Shannon, C. E. ( 1963) . The mathematical theory of communication . In C. E.
Shannon & W. Weaver (Eds.) , The mathematical theory of communication
(pp. 29- 125) . Urbana: University of Illinois Press. (Reprinted from Bell
SystemTechnicalJournal, 1948, July and October)
Shepard, R. N . ( 1984) . Ecological constraints on internal representation:
Resonant kinematics of perceiving, imagining , thinking , and dreaming.
PsychologicalReview, 91, 417-447.
Smith , P. T ., & Baker, R. G . ( 1976) . The influence of English spelling patterns on pronounciation . Journal of Verbal Learning and Verbal Behavior,
15, 267-286.
Smolensky, P. ( 1981) . Lattice renormalization of ~ 4 theory. Unpublished doctoral dissertation, Indiana University .
Smolensky, P. ( 1983) . Schema selection and stochastic inference in modular
environments . Proceedingsof the National Conferenceon Artijiciallntelligence
AAAI -83, 109- 113.
Smolensky, P. ( 1984) . The mathematical role of self-consistency in parallel
computation . Proceedingsof the Sixth Annual Conferenceof the CognitiveScienceSociety.
Smolensky, P., & Riley , M . S. ( 1984) . Harmony theory.. Problem solving, parallel
cognitive models, and thermal physics (Tech. Rep. No . 8404) . La Jolla:
University of California , San Diego, Institute for Cognitive Science.
Spoehr, K ., & Smith , E. ( 1975) . The role of orthographic and phonotactic
rules in perceiving letter patterns. Journal of Experimental Psychology
..
Human Perceptionand Performance, 1, 21-34.
Sternberg, S. ( 1969) . Memory scanning: Mental processes revealed by
reaction-time experiments. American Scientist, 57, 421-457.
Strang, G . (1976) . Linear algebra and its applications. New York : Academic
Press.
Sutton, R. S., & Barto, A . G . ( 1981) . Toward a modern theory of adaptive
networks: Expectation and prediction . PsychologicalReview, 88, 135- 170.
Teitelbaum , P. ( 1967) . The biology of drive .. In G. Quarton , T . Melnechuk , &
F. O. Schmitt (Eds.) , The neurosciences
.' A study program. New York :
Rockefeller Press.
Terrace, H . S. ( 1963) . Discrimination learning with and without errors. Journal of the ExperimentalAnalysis of Behavior, 6, 1-27.
Thomas, G . B. Jr. ( 1968) . Calculus and analytic geometry (4th ed.) . Reading,
MA : Addison -Wesley.
Venesky, R. L . ( 1970) . The structure of English orthography. The Hague:
Mouton .
516
REFERENCES
Williams , R . J. ( 1983 ) . Unit activation rules for cognitive network models (Tech .
Rep . No . ICS 8303 ) . La Jolla : University of California , San Diego , Institute
- Hill .
Index
, R . P ., 18 , 19 , 574
Abrams
, T . W ., 552 , 561
Abramson
unit , 425
logic of
conclusions
, A . S ., 94 , 560
Abstraction
models
199 - 206
of memory
main
comparison of experimental
simulation
and
, 202 -203
on , 201 -
202
examples of
gating activation
semilinear
on , 442 - 443
description of , 363
examples of , 428 -429
concepts
of , 426 - 428
of , 423 - 425
rule . See Activation
functions
Activation
vector of harmony
theoretical
Active representation
function , 426
, 52 , 324 - 328
in PDP
representation in
Addition problem , simple binary ,
341 - 346 . See also Delta
generalized
rule ,
518
INDEX
Anderson
, C . W ., 539 , 554
Anderson
, J . A ., 33 , 42 , 56 , 62 , 66 ,
, G . M ., 477 , 554
, M . A . , 362 , 577
Arbitrary
mapping , implementing
96 - 104
machines
assignment , model of
Ary , M ., 477 , 564
Asanuma
, H ., 357 , 554
Asynchronous update vs .
synchronous update , 61
parsers
of , 55 , 161
of , 211 - 212
INDEX 519
Blakemore
Barrow , H . G ., 63 , 554
Blanchard
Bartlett
Blank , M . A ., 79 , Ill
, F . C ., 81 , 508 , 17 , 19 , 554
, 559
of memory
554 , 576
cells , 363 - 364
Blumstein
Bobillier
models
of
issue in , 463
of , 469 - 470
Bernstein
machines
performance of , 464
programming , 463
Berman
, S., 62 , 576
, P ., 350 , 555
304 - 313
simulation
, E . L ., 42 , 180 , 508 ,
on , 313 - 314
searches , use of to
with , 264
second - order
273 - 275
observables
Boolean
, 377 -378
Binocular deprivation
Bipolarcells, 363
Bishop
, P. O., 480, 574
and ,
Bond , Z . S ., 61 , 555
model
of
, H . E ., 161 , 568
function
activation
428
AND
function
disjunctive
function
as extension
, 429
of ,
520 INDEX
Boolean function
(continued)
treated
function
, 429
123
activation
unit
Brachman
Carew
Canonical
feature
detectors
, R . J . , 313 , 555
, T . J . , 552 , 561
cortex , anatomy
Carlson
, M ., 275 , 573
Carman
, J . B ., 350 , 556
, 114
, Lewis , 97
Cascade
neuropsychological investigation
patients with , 134- 135
simulated
effects
of
architecture
semantic
details
descriptions
discussion
of , 313 - 325
of , 314 - 315
parsers , conventional
interfacing
recursion
Brown , P ., 295
Brown , R ., 219 , 241 , 556
Bruce , C . J ., 368 , 556 , 558
state in a box
, 318 -320
multiple
of , 3-4
323 - 325
on , 277
constraints
studies
simulation
model
Buisseret
. 278 - 283
of , 288 - 289
sentence -structure
Brodmann
Brodmann
microfeatures
66 - 68
Broadbent
555
of , 277 -289
of , 304 -313
Bresnan
model , 42
on , 272 -275
experiments
basic results
of , 289 -292
of , 292 -293
INDEX 521
feature patterns, shading of , 305306
roles , distributed
representation
cues ,
verb -frame
selection
module
Charniak
, 130 - 131
cortex , anatomy
areas of , 345
cortical
outputs
Christen
, W . G . , 478 , 570
Christensen
, R ., 227 , 508
CID . See Connection
information
distribution
mechanism
feature
detection
neuron
of , 367 -371
behavior
neurons
in , 365 - 366
, nature
of
vs . inhibitory
neurons ,
362
( Rosenblatt ) ,
155 - 156
, 356 -357
excitatory
cortical
neocortical
, E ., 324 , 556
and
352 - 353
neuron
, 300 - 301
neuron
of , 304 -305
physiology of
cortex inputs , 346 -356
group
312
structural
Cerebral
Chandelier
Central
of , 312 - 313
semantic
377
307 - 310
other
387 - 389
in , 364 - 365
peptides , 339
synapses , types of , 338 -339
on , 92 - 94
to
COHORT
77 , 97 - 106
basic assumptions
dilemma
of , 98- 99
522 INDEX
COHORT
model (continued)
resolution
cerebral
Colon
Coltheart
, M ., 102 , 508
identification
, factors
influencing
effect of at phoneme level , 89 - 90
Competitive learning
architecture of competitive learning
basic components
definition
, 81
activation
system , 130 -
activation
unit (CA ) .
134
Connection
information
mechanism
of , 132 - 134
multiple
478
on , 190
of , 147
information
distribution
183
letter similarity
number
effects , 182
of elements
, effect
of per
units , 181
of , 164 - 166
simulation
of word
extensions
of , 167 - 168
resource requirements
of , 473 -486 ,
166 - 167
Connectionist
models , 72
166
Connors
benefits
180 - 181
word
pattern
definition
of , 151 - 152
of , 166 - 167
conclusions
, as a stable
distribution
, A . M ., 85 , 508
Colonnier
in , 387 - 389
temperature ( 7) , 211
viewpoint on PDP
cortex
Computational
Computational
of , 101 - 106
and vertical
experiments
mechanism
lines ,
of , 159
system
Computational
Computational
level , 121
models , role of
of
-addressable
memory
, 79 - 80
INDEX523
Cooper , L . N ., 42 , 180 , 508 , 480 ,
487 , 489 , 499 , 555 , 557 , 573
Corkin
Cortical
in
Boltzmann
Daniel
Davis
. ,
robustness
238
239
102
effects
of
545
237
237
and
472
benefits
See
/ so
associators
control
See
/ so
PDP
DCC
on
rule
rule
Delta
43
analysis
rule
and
53
62
of
63
363
417
457
See
also
generalized
multiple
, 474 -
linear
regression
458
pattern
based
coordinates
447
453
Reading , programmable
of
statistical
learning
summary
of
in
vector
Delta
, R . G ., 64 , 94 , 209 , 557
application
of
354
pi
units
361
322
324
323
319
321
functions
networks
447
149
362
activation
feedforward
457
361
318
problem
sigma
328
descent
semilinear
in
and
361
352
nets
453
445
327
gradient
XOR
; Brain
cases
for
459
on
problem
of
recurrent
458
generalized
conclusions
general
, A ., 350 , 553
.and
notation
rule
and
approach
418
in
machines
pattern
Delta
236
conscious
Boltzmann
redundancy
Delta
Dahlstrom
in
239
Deliberate
on , 478
computational
limit
reflections
, 476 - 477
model
blackboard
238
of
543
statistics
dyslexia
Standard
freezing
breaking
Deep
Memory
significance
/ so
240
task
plasticity, 473-484
of
transitions
of
577
556
to
See
and
Degradation
interpretation
dominance
thermodynamic
558
524
increments
181
symmetry
theory
phase
Cottrell
574
500
519
making
idealized
of environment
. ,
in
of
, 476 - 478
model
Decision
313
558
process
554
480
497
304
558
. ,
. ,
coherent
Coulter
modulate
432
477
. ,
354
distributed
, S ., 99 , 105 , 557
Crowder
. ,
harmony
modulate
Crosstalk
weights
ocular
476
Decay
effect
353
Cowan
Smith
Deacedo
to cerebral
Cotton
Darien
Daw
output
Daniels
thalamus , as gateway
cortex - 349 - 352
machines
353
324
328
524 INDEX
Delta rule , generalized (continued)
simulation
arbitrary
binding
coarse
use of to determine
size and
and
plausible
in
of
problems
of
Dobson
Dukes
450
. ,
. ,
. ,
61
. ~
500
558
578
558
509
. ,
477
380
558
558
110
functional
561
570
system
systems
PDP
machine
571
Dynamical
- 454
100
490
. ,
Dynamic
of
- 456
336
- 451
453
435
195
Durham
191 - 193
. ,
Duda
97
Duckrow
location
453
J . ,
Place
453
goal
/ so
- 453
454
. ,
Dostrovsky
451
of
Dretske
- 91
for
See
LISP
of
Drager
for
90
models
3600
Dowling
451
of
use
goal
used
testing
Desimone
of
simulation
Dendrites
- 90
model
- 460
biologically
properties
Derthick
449
Symbolics
- 96
- field
recognition
network
, 179 - 181
88
91
encoding
view
generalized , 209
DeMarzo
conjunctive
P3
implementing
problem
coding
description
of changes in
104
location
connections
Distributed
341 - 346
direction
pairing
96
models
as
41
perspective
397
of
- 398
of
. ,
336
367
370
553
558
Edelman
. ,
. ,
. ,
387
153
163
Eichenbaum
, 79 - 81
558
519
573
. ,
385
524
556
565
Eigenvalues
of , 78 - 79
eigen
structure in representations
and
structure
106 - 108
on , 108 - 109
details
of , 87 - 104
Eigenvectors
and
See
eigenvalues
/ so
. ,
94
Matrices
399
and
linear
systems
Eimas
Eisen
Elbaum
technical
See
ues
and
403
val
Eigenvectors
Ehrlich
new concepts , 86 - 87
status
555
Edmonds
Eccles
. ,
Electricity
Electroencephalography
456
. ,
499
problem
473
558
560
240
509
573
- solving
( EEG
- 250
JJ5
525
INDEX
on amount
Elman , J . L ., 58 , 63 , 69 , 71 , 80 , 81 ,
Reading, programmable
blackboard
EM . See Expectation
maximization
Empiricism
and
, delta
activation
rule for
functions
in ,
rule ,
generalized
Feinstein
Feldman
, J . A ., 12 , 43 , 45 , 72 , 75 ,
Feldman
by . See Delta
, D . J ., 296 , 511
conjunctive
units , 72
units , 72
rule , generalized
Ervin , S ., 219 , 220 , 254 , 559
of
networks
semilinear
Evanczuk
model
Feedforward
method
representations
Felleman
570
Fennell
, R . D ., 63 , 122 , 573
Feustel
Fillmore
, C . J ., 273 , 559
Neocortical
, 293
neurons
, nature
of
identification
, factors
influencing
32
, S . E ., 85 , 86 , 264 , 509
of
sequences
of , 161 - 164
Fan - out
definition
limited
of , 51
, effects
of , 468 -472
limitation ,
470 - 472
of
526 INDEX
Freezing and decision -making in
harmony theory . See Decision making and freezing in harmony
theory
Freund
, T . F . , 363 , 576
Friedlander
, M . J ., 351 , 560
Glorioso
Gluck
, R . M . , 424 , 510
Glushko
Goodness
- of - fit function
, 14 - 16 , 31 ,
of
Gradient
descent
and delta
models ~of
Gallistel
Glaser
, E. M., 366
, 577
Glass
-, L.-, 494
-. 571
, C . R ., 141 , 509
computers
, 287 - 288
560 , 577
Garnes , S ., 61 , 555
Green , K ., 96 , 569
Garrett
, M . F ., 274 , 559
Garrud
Greenberg , Z ., 254
Greenway , A . P., 477 , 554
Grinvald
Gating activation
also Activation
functions
, A ., 385 , 561
Geman
Gutnick
289 , 509
Generalization
, 30 , 85 . See also
Haggard , M ., 85 , 576
Gernsbacher
Gerstein
, M . A ., 79 , 559
, G . L ., 385 , 560
Halmos
Gibson
, E . J ., 198
Hamiltonian
Gilbert
Hamori
, P . R ., 422 , 510
function
(1:1) , 211
, J . , 351 , 577
machines
INDEX 527
Hebbian learning rule , 36, 37, 38,
machine
53 , 69 - 70 , 297
, 212
retrieving
formulation
information
of , 208 ,
from , 269 -
272
Heise , G ., 63 , 569
Hendrickson
572
conclusions
on , 261 - 262
, T . , 387 , 556
, C ., 132 , 510
model
of , 48
definition
of , 148
Hinton
, G . E ., 17 , 20 , 33 , 42 , 80 , 82 ,
theorem
, 226 - 229
also Attention
, networks
Hintzman
Hiorns
research
with , 264
of , 264 - 281
Hawken
, M . J ., 490 , 555
Hawkins
, R . D ., 552 , 561
, R . W ., 344 , 573
Hofstadter
for
focusing
Hash
of
units , definition
, 123
Hokfelt
Hopfield
and vertical
experiments
lines ,
528 INDEX
Horizontally
hierarchical
networks ,
associators
217
; Reading
programmable
of ; Standard
also
IS - A
hierarchy
and
of
Iverson
Imbert
Jackson
in
in
and
nonlinear
models
localized
damage
, learning
linear
, 121
of
, 413
connections
systems
, 411
, L . L . , 340
Jakimik
, J . , 61 , 98 , 99 , 100
Jelinek
Jenkins
, J . J . , 473
Jenkins
, W . M . , 385
of word
Johnson
, D . S . , 235
Johnson
, E . , 324
Johnston
, D . , 477
Johnston
, J . C . , 159
, F . , 293
Jones
563
Jones
of word perception
Jouvet
Just
, 107
, 569
, 563
, 508
, 563
, 562
, 160
, 352
, R . S . , 56 , 406
Julesz
, 563
, 357
, 574
, 363
, 564
507
layered systems , 60
, 193
, 508
, E . G . , 351
, 192
, 557
in two - dimensional
of
- 413
, 511
, 172
, 563
111
and
, 563
, J . H . , 41 , 141
, L . L . , 171
556
model
levels
- 418
Jacoby
Implementational (physiological)
activation
of
neural
416
416
Interactive
also
- 424
levels
See
- 396
levels
477 - 496
memory
and
interpretation
conceptual
422
of distributed
, neural
conceptual
6-hydroxydopamine (6-0HDA ) ,
level
hypothesis
failure
Delta
, 105
models
395
See
; Representation
, D . , 7 , 511
PDP
of
Isomorphism
, generalized
Isenberg
of
- 209
of
learning
, 208
, a distributed
representations
rule
model
associators
memories
Memory
model
Internal
Hummel
of
See
pattern
Interference
, C . R ., 351 , 562
blackboard
, 173
, 226
, M . , 350
, 409
, 311
, 551
, 410
, 418
, 554
, 555
, B . , 18 , 511
, M . A . , 153
, 161
, 549
, 564
, 577
, 352
, 354
, 385
, 386
single-level, 60
three - level , 59
Interference
of patterns
Kaas
, 139 - 142 .
, J . H . , 345
387
Kahneman
, 569
, 570
, 578
, D . , 536
, 564
529
INDEX
Kaiserman
-Abramof, I. R., 336, 361,
365, 571
Kalaska
, J. F., 378, 560
Kandel,E. R., 333, 364, 507, 552,
561, 564, 576
Kanerva
, P., 76, 465, 511
Kant, E., 17, 19, 564
Kaplan
, R. M., 119, 516, 274, 559,
564
Kasamatsu
, T., 476, 477, 492, 497,
564
Kawamoto
, A. H., 101, 511, 277,
311, 564
Kawashima
, T., 77, 94, 560
Keele, S. W., 171, 183, 200, 203,
572
, H . S ., 275 , 565
, G ., 97
LaManna
, J . C ., 477 , 561
of
, J . H ., 241 , 512
Larsson
Kienker
Kinematics
of , 202
and electricity
, T ., 42 , 45 , 62 , 63 , 152 ,
, M ., 473 , 565
, S . W ., 367 , 565
, K ., 350 , 553
machines
of ,
315 - 318
, R . Y ., 351 , 563
Lehiste
Kullback
, So
, 294
, 511
KuDo
, Mo
, 336
, 565
, I ., 61 , 565
Lesser , U . R ., 63 , 559
530 INDEX
vectors
activation
model
of
ba & ic
word perception
LeVay , S., 350 , 353 , 355 , 361 , 374 ,
497 , 500 , 563 , 565 , 566
reductionism
and emergent
- associator
, W ., 63 , 569
Lieberman
, F . , 489 , 557
Limited
fan - out . See Fan - out ,
limited ; Standard
associators
Limited
pattern
hypothesis , 520 .
matrices
and linear
systems
LISP
use
of
in
language
P3
system
, R . , 364
Local
representations
, 558
, G . E . , 383
- 477
, 208
, 566
Luce
, notion
of .
model
, 43 , 192
de
No
- 193
, 566
, R . , 344
, 358
, 566
, J . B . , 62 , 560
, R . D .,
Luce
choice
Lund
, 410 - 413
75 , 93 , 195
rule
, 566
, application
of , 90 -
, J . S . , 357
566
, 358
, 361
, 366
Lund
, R . D . , 336
Luria
, A . R . , 41 , 79 , 135
Luria
' s dynamic
, 357
, 566
, 512
functional
41
Lynch
Ma
, J . C . , 366
, S . - K . , 388
Macchi
, 374
, 573
, 567
410
nonlinear
See
functions
, L . H . , 402
Lorente
, 493
, 566
computation
Loomis
- 497
, 566
) , 476
, E . F . , 85 , 508
Logogen
124
, 496
, 77 , 85 , 96
( LC
Activation
, 492
, 370
coeruleus
Loeb
, 386 - 390
inverses
, 64 - 65
91 , 195
linearity , 393
matrix
, 65
Llinas
399 - 403
matrices
, 63 - 66 , 425
Activation
programming
Lovins
, 62
units
a / so
function
Logical
associator
See
perceptron
Loftus
of , 63
, 63
threshold
Locus
increment
See
of , 63
XOR
, J . C . R ., 106 , 566
, 61 - 63 .
functions
, A . M ., 61 , 62 , 73 , 92 , 94 ,
Licklider
linear
224
- 373
- 375
version
associator
Linear
cells , 480
, 374
algebra
weaknesses
and
, simple
simple
- 366
, 370
Linear
Lewis
of
- 385
, 375
models
a / so
Lichten
of , 365
spaces
Linear
pattern
Liberman
, 383
combinations
auto
LGN
model
linear
PDP
products
, B . , 341 , 566
analysis
inner
vector
in
independence
- 370
of
description
of , 367
, use
simple
operations
concepts
, G . , 350
Macrodescription
mechanism
, 567
, 567
of
, 246
harmony
- 258
system
531
INDEX
problem solving for , 246-247
productions
schemata
one
, 253
, 256 - 258
inverses
function
and
, W . D . , 43 , 512 , 63 ,
77 , 79 , 80 , 97 , 98 , 99 , 275 , 567
, K . A . C ., 363 , 567 , 576
Massaro , D . W ., 77 , 81 , 86 , 94 , 567 ,
568 , 571
and linear
Linear algebra
basis for vector space, change of ,
413 -418
90
See
/ so
properties
also
Matrices
. ,
345
58
119
352
model
157
158
. ,
20
85
126
75
127
138
195
217
370
380
381
383
394
532
558
559
568
574
. ,
480
161
. ,
152
77
142
135
130
172
141
28
133
McConkie
294
512
71
27
121
69
140
321
123
24
120
63
292
512
22
216
59
79
202
353
577
131
567
163
568
512
573
McCulloch
McDermott
McGeoch
McGill
MD
See
/ so
424
568
235
508
575
. ,
374
. ,
375
448
553
568
deprivation
171
181
200
205
568
Memory
Monocular
536
432
Medin
. ,
145
McNaughton
. ,
456
512
distributed
model
Distributed
of
representations
retrieving
See
information
from
conclusions
on
214
215
assumptions
of
experimental
results
repetition
192
and
182
effects
general
information
extensions
of
208
176
simulations
familiarity
of
specific
amnesia
of
199
representation
, 410 - 413
and multilayer
170
detailed
of , 387 - 389
inverses
403
systems
199
71
Memory
addition
McClurkin
403
matrix
43
177
Meditch
matrices
391
413
likelihood
42
See
568
McGuinness
McClelland
, W . H ., 352 , 579
500
Martin
algebraic
Marshall
McCarthy
80
- Wilson
390
systems
Maximum
, R . T ., 480 , 567
390
linear
356
514
, J . C ., 102 , 508
Marshall
389
Maunsell
Marslen
product
linear
mapping
of
, o . S . M ., 134 , 514
Marrocco
outer
410
and
Matrix
Matrices
Marin
system
and
Matrix
also Boolean
POP
405
approximation
two -choice
layer
transposes
spared
and
199
206
model
learning
in
207
532 INDEX
Miller
memory
emergence
semantic
of , 207
memory
, emergence
of
modular
structure
, 174 - 175
pattern
of activation
, 175
, key aspects
of ,
information
representations , features of ~
Memory , a distributed model of
content addressability , 25-29
default assignment , 29- 30
graceful degradation , 29
563 , 569
Models
, similarities
, M . M ., 350 , 569
Modular
, 80 - 81
Microfeatures
Semantic
, semantic . See
microfeatures
of cognition , 12- 13
Miller
, D . T ., 536 , 564
Miller
Miller
, J . L ., 62 , 96 , 569
Miller
Miller
, P ., 515 , 576
structure
distributed
Molinari
of memory
, 79 ,
model
of
, M . , 35 , 567
Monocular
deprivation
( MD ) , 474 ,
475
OD shift
under , 487
rearing , 489
Monocular occlusion (AM ) , 474
Monotonicity concepts for activation
functions
Activation
Monotonicity
Morrell
Microfeature
of
Michael
of patterns
interconnectivity
in PDP models ,
52- 53 . See a/so Learning
Montero
and differences
, 4-5
Modifications
Miezin
between
Memory , retrieving
Microstructure
, D . E ., 475 , 558
Merzenich
Mitchell
Mitchinson
509
Mesulam
, D ., 350 , 567
blends , 208
regularities of behavior ,
Mercer
Minciacchi
209
relation
theories
, W ., 254
Milner
Moran
, V . M ., 351 , 569
, J ., 374 , 558
Morris
Morrison
, F ., 379 , 569
, R . G . M ., 434 , 466 , 569
, J . H ., 364 , 569
example of multiple
constraints
in ,
4 -6
INDEX 533
Moussouris
, J., 289, 513
Mower-. G. D.~
. 478, 570
groups of neurons
behavior
cerebral cortex , 365 - 366
565 , 570
effects
~ Graceful
duplication of connection
information , need for to exploit ,
124
mutual
constraints
in
of on a network
degradation
Neural and conceptual interpretation
in
, delta
rule
in
, L ., 61 , 110 , 570
, P . W ., 514 , 515 , 574
Nativism
NE . See Norepinephrine
Necker
Schemata
, concept
of
Neocortical
neurons
cell types
of
in , 358 - 360
experimental
feature
, nature
detection
, 367 - 371
on , 478
discussion
, 496 - 497
569 , 570
Nelson
environment
Nelson
neuron
ocularity
, 479 - 480
534 INDEX
isomorphism , failure of in , 422 -424
quasi -linear systems with , 418 -422
484 - 489
subthreshold
respect
499 - 501
to ocular
with
Norepinephrine
in ,
48O . 493
dominance
, 420
also Neocortical
of
of , 501
terminology , 473
visual
summation
Nonpyramidal
neurons
, nature
Norman , D . A . , 8 , 9 , 14 , 15 , 79 ,
116 , 133 , 206 , 512 , 513 , 514 , 18 ,
Neural
570 , 574
cortex ,
of neurons
, 365 -366
between
Nusbaum
number
among , 132
output
of , 133 - 134
, 358 -365
of , 131
Newsome
Newtonian
, H . C ., 120 , 571
Ocular dominance
mechanics
(00 )
class , 484
, D ., 63 , 100 , 571
shift , under
communication
continuous
, in brain ,
Norris
MO , 487
, 474 - 476
dominance
484 , 490
, 479 - 480
, 125
Nicholson
, C ., 364 , 566
00
Ninteman
, F . W ., 493 , 572
Nonlinear
. See Ocular
dominance
dominance
histogram
001 . See Ocular
dominance
index
INDEX535
A ) , 502 - 503
plan language of , 490 , 491 -497
simplified diagrams of , 454 -455 ,
and linear
464
systems
in interactive
models , 60
units
definition
of , 112 - 113
simulation
of , 48
PARSIFAL
(Marcus ) , 317
Parsing in PDP models , 317 -323
Part / whole hierarchy , 105
Past-tense acquisition , three stages
of . 219 - 221 . See a /so Verbs ,
48 - 49
process
programmer
model
state as , 176
definition
of , 55
Pattern associators . See also
Programmable
Standard
pattern associators ~
pattern
associators
Paradigms of learning
, 54 - 55
definition
regularity discovery , 55
of , 161
processing . See
patterns
in , 226
POP .
restrictions
of , 228 - 233
33 , 533
of ,
176
, 42
Parallel distributed
, 175
associative
, in distributed
of memory
of
Pandemonium
of activation
model
process
system .
programmer
( P3
, delta
rule
in , 447 -453
in conceptual interpretation
models , 406 - 411
of PDP
536 INDEX
Pattern completion
device , use of
Boltzmann
machine
See also Boltzmann
as , 289 - 290 .
machines
Pattern of connectivity
aspect of PDP
Patterson
as major
models , 46 , 49 - 51
science
on , 145 - 146
of , 537 - 538
of , 539 - 543
evaluative
541 - 542
multiple
structure
, need for ,
realism
in , 136 - 138
reductionism
and emergent
of , 111 - 120
of various
to serial
models , 12 -
retrieval
control
, 25 - 31
, 13 - 17
perception , 18-24
history of , 41 -44
local vs . distributed representation ,
32 - 33
associator
models
attractive properties of , 38
ensemble of patterns , extracting
structure
from , 39 -40
models
associators
attractive
, 33 - 37
of ensemble
of patterns
extracting , 39-40
POP models
, 32
associator
properties of , 38
, 540 - 541
other
pattern
result
weaknesses
542 - 543
121
status
in , 545 - 546
strengths of , 535-539
, K ., 102 , 508
conclusions
Cerebral
and
neurophysiology
mechanisms
relevant to , 327 -
328
POP models , future
547 - 552
directions
on , 74 - 76
hierarchical
organizations , 57
interactive
models , 59 - 60
rule , 51 - 52
for ,
for
INDEX
versions, specific
brain statein a box model (BSB) ,
66-68
Feldmanand Ballard's units, 72
Grossberg
' s units, 70-71
interactiveactivationmodel, 71
linear thresholdunits, 63-66
simple linear models, 61-63
thermodynamicmodels, 68-70
POPmodels, learningrate as factor
in, 472-473. SeealsoNeural
plasticityand learning
POPmodels, neuraland conceptual
interpretationof
conclusionson, 429-431
considerationof, 390-391
distributednonlinearmodels,
natural competitionin, 424-429
as dynamicalsystems, 397-398
interpretations, 391-J96
isomorphismhypothesis, failure of
in nonlinearmodels, 422-424
kinematics, 400-403
kinematicsand dynamics, 398-399
learningconnectionsand
isomorphismof levels, 416-418
linear systems, isomorphismof
levelsin, 411-413
localizeddamageand, 413-416
nonlinearity, Quasi
-linear systems
with, 418-422
patterncoordinates
, 406-411
vector space, structureof, 403-406
Pearlmutter, B., 298
Pearson
, J. C., 385, 575
Peck, C. K., 475, 571
Peptidesin the cerebralcortex, 340
Perez, R., 494, 571
Perception
, examplesof for POP
models
familiar patternsand, 20-23
novel patterns, completionof, 2325
stereoscopicvision, 18-20
Perceptrons(Rosenblatt
) , 154-158
convergenceprocedure, 41-42, 65,
225-226
classC', 157-158
537
, E ., 376 , 578
, S . E ., 347 , 554
, J . W . , 119 , 565
Phoneme
identification
, factors
influencing
categorical perception , 84 , 88 - 95
detectors , retuning of by context ,
95 - 97
lexical
effects
on , 77 - 78
, summary
of , 97
restoration
effect , 20
538
INDEX
distal
on , 466
landmarks
distributed
460
parameters
model , 449 -
functions
of , 491 - 493
, H ., 213 , 513
, L . J ., 350 , 572
Probability
information
, 441 - 443
features
pattern associators .
of
, 445
view - field
of
Programmable
representational system , 47
Programmable blackboard model of
conclusions
discussion
on , 486
of , 485 - 486
distributed
representations ,
simultaneous
Prototypes
coexistence of and repeated
exemplars , 189- 192
neurons
, nature
of
function , 52 ,
, 394 -395 .
models , neural
and
conceptual interpretation of
Quasi -multilinear
activation
Quillan , M . R ., 85 , 513
function ,
INDEX539
Rader , R . K ., 497 , 577
Rail , W ., 336 , 381 , 382 , 569 , 572 ,
575
Ralston
also Amnesia
, H . J ., 336 , 572
Ramachandran
, V . S ., 478 , 572
- dot stereograms
, 18
Rawlins
Recurrent
networks
exercise
, PDP
models
as
in , 127 - 129
model of (PABLO )
ambiguous characters , 155 - 157
amount
of feedback
, effect
of
computer
simulation
, results
of , 164 - 166
computer
simulation
maximum
294
extensions
of , 167 - 168
mechanism
, details
of , 129 - 136
pattern
of , 161 - 164
bottom
- up activations
version
model ,
of word
of , 137 - 138
interference
and crosstalk , 139 - 141
PABLO
simulation
model
escape from
288
machines
Repetition
alternative
details
of , / 5 / - / 53
pattern
15 /
overlappina slots in , / 43 - / 47
role -specific letter units , 145- 146
words of different lengths ,
and familiarity
effects ,
192 - 199
194
, 287 -
codi ng , / 46 - J 47
, / 47
local minima
coarse
feedback
Relaxation
interactive
models , 292 -
286
in CID
likelihood
optimization
, sequences
, H . J . P ., 385 , 573
units
benefits
fixations
associators
on , 168 - 169
information
distribution
Regularity
pattern
of , 136 - 141
conclusions
connection
also Standard
interpretation
action
of , 193-
and response
experimental
curves , effects
of
variables on , / 9.5-
199
traditional
193
interpretation
of , 192-
540
INDEX
model
of
Rosenblatt
machines
Rosenfeld
, A ., 285 , 514
Rosenthal
, M ., 477 , 561
See Reverse
suture
Rudnicky , A ., 58 , 557
Rumelhart
, D . E ., 6 , 9 , 14 , 15 , 20 ,
518 - 524 .
feature
detectors
, 114
Ryder , J. M ., 7, 511
515
Saffran
, E . M ., 134 , 514
models
course
Samuel , A . G ., 95 , / 60 , 574
Sanderson
, K . J ., 480 , 574
and linear
systems ~ Vectors
Schacter , D ., 519 , 574
Schaffer
Rockland
Roberts
, E ., 35 / , 562
18 , 19 , 324 , 574
Scheibel
, A . B ., 351 , 574
Scheibel
, M . E ., 351 , 574
INDEX 541
variables
, concept
of , 9 , 78 - 79 ,
of , 31 , 48
consciousness
, content
, 33 - 34
summary
of , 39
Schiller
, T . M ., 96 , 569
, P . H . , 367 , 574
Schmolze
, J . G ., 313 , 555
conversations
Schneider
and , 42 - 44
Schumacher
515
mental
models
and , 40 - 42
mental stimulation
and practice ,
Schwartz
Schwartz
, M . F ., 134 , 514
Schwartz
, M ., 477 , 554
conclusions
Schweikert
42
constraint
on , 53 - 57
satisfaction
and , 17
description of , 1, 7-8
example of , 22 -25
examples of processing of , 25 -31
goodness -of -fit landscapes of , 15,
16 , 28 , 29 , 30 , 31 , 32 - 33 , 35
of , 201 - 203
formulation
of , 209 - 210
function
205 - 206
modified
of , 203 -
., 9
history of , 17- 19
important features of , 19-20
interpretation differences and , 21
POP models
as constraint
Semantic
memory
microfeatures
model
properties of
, 278 -283 .
of
units , 97 - 98 , 99 , 101 - 102 ,
processes , 36
, 36 - 37
to data ,
network
Semilinear
36
knowledge representation , 36
, 35
of
of
Semantic
Sememe
, emergence
, 8 - 17
features
and
204 - 205
subschema
structure
additional
active
, G . L . , 351 , 569
of , 208
networks
Scott
, J ., 7 , 511
Activation
activation
functions
functions
. See
, semilinear
542 INDEX
Sentence processing in POP
networks
assignment , model of
Sentence -structure (SS)
representation , 283 -286 . See also
Case role assignment , model of
Sequential symbol processing , 106-
rule ,
generalized
Simulated annealing , 287 , 288 -289 ,
108
on ~ 53 - 57
consciousness
, contents
of , 39
, 42 - 44
models
and , 40 - 42
mental simulations
summary
and practice , 42
cf , 48
Harmony
Simulation
theory
system
environment
major component
, as
of P3 system ,
, L . R ., 350 , 577
Slowiaczek
Shankweiler
Shannon
, D ., 92 , 94 , 566
, L . M ., 120 , 571
, J . L ., 274 , 567
architecture
Sherman
importance of , 63 -64
context effects , left and right , 60
cues , sensitivity of , 62
lack of boundaries
Shiffrin
, S . M . , 351 , 560
, R . M ., 197 , 198 , 559 , 574
Silverman
, D . J ., 385 , 575
Silverstein
for TRACE
model ,
, 60 , 61
INDEX 543
Spencer , W . A ., 364 , 576
Spoehr , K ., 24 , 515
Structure
in
155 - 156
functions
pattern
associators
Programmable
conclusions
. See also
pattern associators
and
/ so
Sur
352
Surround
377
94
261
43
. ,
of
569
570
cortex
. ,
225
the
types
338
261
341
cortex
of
/ so
vs
processing
in
61
PDP
models
323
. ,
351
362
577
analysis of
units , 81
problem
Delta
Tabula
rasa
Takeuchi
Talbot
Tank
Tash
. ,
. ,
489
389
Teller
. ,
See
495
553
567
562
572
Teller
141
366
352
92
generalized
139
Teitelbaum
Tenenbaum
348
rule
230
142
230
277
515
277
. ,
Amnesia
update
Szentagothai
262
See
339
update
561
neurochemistry
See
Synchronous
508
generalized
509
340
cerebral
change
317
in
rule
in
Syntactic
577
learning
Delta
109
444
576
510
problem
basic
576
subsymbolic
/ so
57
554
350
466
and
Synapses
See
use
Cerebral
53
Symmetry
262
rule ,
, M . , 274 , 557
576
/ so
539
asynchronous
566
420
visual
Symbolic
model , 402
Steedman
about
507
385
Questions
Swets
See
383
Synaptic
85
in
Swanson
. ,
354
paradigms
activation
106
in
515
for , 400
of general nonlinear
Sutton
POP model , 46 , 48
system
92
effects
, L . R ., 351 , 560
system
systems
Summerfield
Swinney
coordinate
in
Nonlinear
computation
summation
cortex
106
learning
Subthreshold
of , 468 - 472
of , 463 -465
symbolic
limited
computations
paradigms
461 - 465
Kennedy
Subsymbolic
374
105
processing
576
on , 486
symbol
Studdert
108
108
structure
sequential
and
104
constituent
9 7- 98
Stanford
representations
processes
513
513
63
554
also
544 INDEX
Terminal
parameters
, 491 .
Seealso
P3 system
machines
observables
, second - order
harmony function
H , cognitive
of definitions
, 264 - 267
in H , 272 -273
of network ,
Thermodynamic
Thermodynamic
Boltzmann
limit , 239
models , 68 -70
machines
, 68 , 69
harmony theory , 68
Thibadeau
Thomas
interactive
model , 59
, M ., 350 , 555
on , 120 - 121
deficiencies
of , 119 - 120
description
of , 2 , 58 -59 , 64 -68 ,
effects
of on , 77 - 81
, summary
of , 97
of , 106 - 115
Activation
functions
word segmentation
simulations ,
summary
of , 115 - 117
successes , summary
of , 117 - 119
TRACE
I model , 69 , 70 - 71 , 75 , 76
TRACE
II model , 69 , 70 , 71 - 75 ,
76 , 102 , 110
Timing
565 , 572
Touret
of
Threshold
199 - 213
Torre , V ., 117 , 513 , 129 , 365 , 381 ,
322
from H ,
269 - 272
storing information
Thermal equilibrium
strategy
systems
57 - 59
stimulation
, 383 - 384
, M . , 227 , 512
Turner
Two - dimensional
, M . R ., 385 , 560
INDEX545
Two -layer scheme in harmony
theory , advantages of , 218 - 219
Tyler , L . K ., 97, 98, 99, 104 , 275 ,
567 , 577
space , inner
linear
, K ., 339 , 577
function
, 427
, 19 - 20
parameters
system
UNIT TYPE
generator
statement
for pattern
in P3 , 493 -494
machines
combinations
combinations
Van
, F ., 363 , 559
Van
vector
structure
of , 403 - 406
Venesky , R . L ., 24 , 515
Verb - frame
selection
for role -
of
555
structure
of , 221 -223
summary
of structure
novel
verbs , transfer
of , 239 - 240
to , 261 - 266
of
of 368 - 369
description
conclusions
Valverde
and
independence
of , 365 -366
546
INDEX
Verbs , regular , 245 -246 , 247 , 254 257 , 2j8 -260 . See also Verbs ,
and horizontal
, F . S ., 336 , 578
Werblin
, L . E ., 336 , 578
lines ,
Whittlesea
Videen
, T . 0 ., 497 , 577
details
, S . R . , 350 , 577
models
monocular
occlusion ,
Vital - Durand
Wiener
architecture
, R . J ., 425 , 516
Willshaw
, D . J ., 42 , 97 , 100 , 460 ,
, 534 - 535 .
Willshaw
difficulties
reflections
Winston
on
metaphor ,
195
computer
, 534
, S. F ., 367 , 574
, G . , 321 , 516
Widrow - Hoff
a /so Delta
of , 489 - 490
Volman
490 , 491
of , 234 - 239
of , 489 - 494
alternating
Wolverton
, A . E ., 349 , 578
Walker
, E . C . T . , 7 , 511
Walker
, J . A ., 435 , 554
TRACE , summary
of , 115 - 117
Word - order and semantic cues to
Warren
Webster
, R . M ., 20 , 516
, H . de F ., 338 , 571
, 298 - 299 .
, R . E ., 345 , 578
Welsh , A ., 43 , 512 , 63 , 77 , 79 , 80 ,
98 , 99 , 567
perception
model
of word
INDEX
Word segmentation
, lexical
for TRACE . 106 - 115
identification
basis of
and multiple
Wu , T . Y ., 350 , 577
Wyatt , H . J ., 500 , 558
XOR
problem
rule , generalized
, E . D ., 383 , 574
, D ., 161 , 568
Zucker
547