Parallel Distributed Processing Explorations in The Microstructure of Cognition
Parallel Distributed Processing Explorations in The Microstructure of Cognition
net/publication/200033859
CITATIONS READS
613 19,694
4 authors, including:
James L Mcclelland
Stanford University
374 PUBLICATIONS 91,964 CITATIONS
SEE PROFILE
All content following this page was uploaded by James L Mcclelland on 16 September 2014.
A Bradford Book
The MIT Press
Cambridge, Massachusetts
London, England
0 1986 by. The Massachusetts Institute of Technology
Rumelhart, David E.
Parallel distributed processing.
.. ,
I . . . . . . . . . . . . .. . .. .. . .. ..
CHAPTER 7
Many of the chapters in this volume make use of the ability of a paral-
lel network to perform cooperative searches for good solutions to prob-
lems. The basic idea is simple: The weights on the connections
between processing units encode knowledge about how things nornlally
fit together in some domain and the initial states or external inputs to a
subset of the units encode some fragments of a structure within the
domain. These fragments constitute a problem: What is the whole
structure from which they probably came? The network computes a
"good solution" to the problem by repeatedly updating the states of
units that represent possible other parts of the structure until the net-
work eventually settles into a stable state of activity that represents the
solution.
One field in which this style of computation seems particularly
appropriate is vision (Ballard, Hinton, & Sejnowski, 1983). A visual
system must be able to solve large constraint-satisfaction problems
rapidly in order to interpret a two-dimensional intensity image in terms
of the depths and orientations of the three-dimensional surfaces in the
world that gave rise to that image. In general, the information in the
image is not sufficient to specify the three-dimensional surfaces unless
the interpretive process makes use of additional plausible constraints
about the kinds of structures that typically appear. Neighboring pieces
of an image, for example, usually depict fragments of surface that have
similar depths, similar surface orientations, and the same reflectance.
The most plausible interpretation of an image is the one that satisfies
1
I
I
constrai~itsof this kind as well as possible, and the human visual sys-
ten1 stores enough plausible constraints and is good enough at applying
them that i r can arrive at the correct interpretation of most normal
images.
The computation may be perl'orrnetl by a n iterative search which
starts with a poor interprelation and progressively improves i t by reduc-
ing a cost function that measures the extent to which the current
interpretation violates the plausible constraints. Suppose, for example,
that each unit stands for a small three-dimensional surface fragment,
and the state of the unit indicates the current bet about whether that
surface fragment is part of the best three-dimensional interpretation.
Plausible constraints about the nature of surfaces can then be encoded
by the pairwise interactions between processing elements. For
example, iwo units that stand Sor neighboring surface fragments of
similar depth and surface orientation can be mutually excitatory to
encode the constraints that each of these hypotheses tends to support
the other (because objects tend to have continuous surfaces).
RELAXATION SEARCHES
How are the weights that encode the knowledge acquired? For
models of low-level vision it is possible for a programmer to
decide on the weights, and evolution might do the same for the
earliest stages of biological visual systems. But if the same kind
of constraint-satisfaction searches are to be used for higher
level functions like shape recognition or content-addressable
memory, there must be some learning procedure that autornati-
cally encodes properties of the domain into the weights.
where w,,is the strength of connection (synaptic weight) from the ,jth
to the ith unit, s, is the state of the i t h unit (0 or l ) , and 0 , is a
threshold.
T h e updating rule is to switch each unit into whichever of its two
states yields the lower total energy given the current states of the other
units. Because the connections are symmetrical, the difference between
the energy of the whole system with the k t h hypothesis false and its
energy with the k th hypothesis true can b e determined locally by the
k t h unit, and is just
I Hopfield used the states 1 and - 1 because his model was derived from physical sys-
tems called spin glasses in which spins are either "up" or "down." Provided the units
have thresholds, models that use 1 and - 1 can be translated into models that use 1 and 0
and have different thresholds.
7 L E A R N I N G IN BOLTZMANN MACHINES 287
) U s i n g Probabilistic D e c i s i o n s t o E s c a p e F r o m Local M i n i m a
1 At about the same time that Hopfield showed how parallel networks
of this kind could be used to access memories that were stored as local
minima, Kirkpatrick, working at IBM, introduced an interesting new
i search technique for solving hard optimization problems on conven-
tional computers.
O n e standard technique is to use gradient descent: The values of the
variables in the problem are modified in whatever direction reduces the
cost function (energy). For hard problems, gradient descent gets stuck
at local minima that are not globally optimal. This is an inevitable
consequence of only allowing downhill moves. If jumps to higher
energy states occasionally occur, it is possible to break out of local
minima, but i t is not obvious how the system will then behave and it is
far from clear when uphill steps should be allowed.
Kirkpatrick, Gelatt, and Vecchi (1983) used another physical analogy
to guide the use of occasional uphill steps. T o find a very low energy
state of a metal, t h e best strategy is to melt it and then to slowly reduce
its temperature. This process is called annealing, and so they named
their search method "simulated annealing." Chapter 6 contains a dis-
cussion of why annealing works. W e give a simple intuitive account
here.
O n e way of seeing why thermal noise is helpful is to consider the
energy landscape shown in Fig~lre1. Let us suppose thar a ball-bearing
starts at a randomly chosen point on the landscape. I f i t always goes
downhill (and has n o inertia), i t will have an even chance of ending up
at A or B because both minima have the same width and so the initial
Pattern Completion
where SF is the binary state of the i t h unit in the a t h global state and
P i is the probability, at thermal equilibrium, of global state a of the
network when none of the visible units are clamped (the lack of clamp-
ing is denoted by the superscript - ) . Equation 5 shows that the effect
7
3 I n this example there are six different ways o f using rhe extra unit to solve the task
network and their interconnectivity define a space of possible models of'
the environment, and any particular set of weights defines a particular
model within this space. T h e learning problem is to I'ind a combination
of weights that gives a good model given the limitations imposed by the
architecture of the network and the way i t runs.
More forn~ally,we would like a way of finding the combination ol'
weights that is most likely to have produced the observed ensemble of
environmental vectors. This is called a ~ ? ~ a s i t ?l ~~~uctl ~
i h o omodel
ti and
there is a large literature within statistics on maximum likelihood esti-
mation. T h e learning procedure we describe actually has a close rela-
tionship to a r n e ~ h o d called Expectation and Maximiza~ion ( E M )
(Dempster, Laird, & Rubin, 1976). EM is used by statisticians for
estimating missing pararneters. I t represents probability distributions by
using pararneters like our weights that are exponentially related to
probabilities, rather than using probabilities themselves. The EM algo-
rithm is closely related to an earlier algorithm invented by Baum that
manipulates probabilities directly. Baum's algorithm has been used suc-
cessfully for speech recognition (Bahl, Jelinek, & Mercer, 1983). I t
estimates the parameters of a hidden Markov chain-a transition net-
work which has a fixed s t r u c t ~ ~ rbut
e variable probabilities on the arcs
and variable probabilities of eniirting a particular output symbol ;is i t
arrives at each internal node. Given an ensemble of strings of' synibols
and a fixed-topology transition network, the algor-ith m finds the combi-
nation of transition probabilities and output probabilities that is most
likely to have produced these strings (actually i t only finds a local max-
imum).
Maxirnuni likelihood methods work by adjusting the p a r m e t e r s to
increase the probability that the generative model will produce the
observed data. Baum's algorithm and EM are able to estimate new
values for the probabilities (or weights) that are guaranteed to be better
than the previous values. Our algorithm simply estimates the gradient
of the log likelihood with respect to a weight, and so the magnitude of
the weight change must be decided using additional criteria. Our algo-
rithm, however, has the advantage that i t is easy to implement in a
parallel network of neuron-like units.
T h e idea of a stochastic generative model is attractive because it pro-
vides a clean quantitative way of comparing alternative representational
schemes. T h e problem of saying which of two representational schemes
is best appears to be intractable. Many sensible rules of thumb are
available, but these are generally pulled out of thin air and justified by
commonsense and practical experience. They lack a firm mathematical
foundation. If we confine ourselves to a space of allowable stochastic
models, we can then get a simple Bayesian measure of the quality of a
representational scheme: How likely is the observed ensemble of
environmental vectors given the representational scheme? In our net-
works, representations are patterns of activity in the units, and the
representational scheme therefore corresponds to the set of weights that
determines when those patterns are active.
input output
unit unit
where p,: is the probability, averaged over all environmental inputs and
measured at equilibrium, that the i t h and j t h units are both on when
the network is being driven by the environment, and p,; is the
corresponding probability when the network is free running. O n e
surprising feature of Equation 7 is that it does not matter whether the
weight is between two visible units, two hidden units, or one of each.
T h e same rule applies for the gradient of G .
Unlearning
During this state of random excitation and free running they postulate
that changes occur at synapses to decrease the probability of the
spurious states.
A simulation of reverse learning was performed by Hopfield, Fein-
stein, and Palmer (1983) who independently had been studying ways to
improve the associative storage capacity of simple networks of binary
processors (Hopfield, 1982). In their algorithm an input is presented to
the network as an initial condition, and the system evolves by falling
into a nearby local energy minimum. However, not all local energy
minima represent stored information. In creating the desired minima,
they accidentally create other spurious minima, and ro eliminate these
they use "unlearning": T h e learning procedure is applied with reverse
sign to the states found after starting from random initial conditions.
Following this procedure, the performance of the system in accessing
stored states was found to be improved.
There is an interesting relationship between the reverse learning pro-
posed by Crick and Mitchison and Hopfield et al. and the form of rhe
learning algorithm which we derived by considering how to minimize
an information theory measure of the discrepancy between the environ-
mental structure and the network's internal model (1-Iinton &
Sejnowski, 1983b). T h e two phases of our learning algorithm resemble
the learning and unlearning procedures: Positive Hebbian learning
occurs in phase.+ during which information in the environ~iientis cap-
tured by the weights; during phase- the system randomly samples states
according to their Boltzmann distribution and Hebbian learning occurs
with a negative coefficient.
However, these two phases need not be implemented in the manner
suggested by Crick and Mitchison. For example. during pI1a.s.c the
average co-occurrences could be computed without rnaking any changes
to the weights. These averages could then be used as a baseline for
making changes during phasef; that is, the co-occurrences during
p/ia~o' could be computed and the baseline subtracred before each per-
manent weight change. Thus, an alternative but equivalenr proposal for-
the function of dream sleep is to recalibrate the baseline for-
plasticity-the break-even point which determines whether a synaptic
weight is incremented or decremented. This would be safer than mak-
ing permanent weight decrements to synaptic weights during sleep and
solves the problem of deciding how much "unlearning" to do.
Our learning algorithm refines Crick and Mitchison's interpretation
of why two phases are needed. Consider a hidden unit deep within the
network: How should its connections with other units be changed to
best capture regularity present in the environment? If i t 'does not
receive direct input from the environment, the hidden unit has n o way
to determine whether t h e information it receives from neighboring
units is ultimately caused by structure in the environment or is entirely
a result of the other weights. This can lead to a "folie a deux" where
two parts of t h e network each construct a model of the other and
ignore the external environment. T h e contribution of internal and
external sources can b e separated by comparing the co-occurrences in
phasef with similar information that is collected in the absence of
environmental input. phase- thus acts as a control condition. Because
of the special properties of equilibrium it is possible to subtract off this
purely internal contribution and use the difference to update the
weights. 'Thus, the role of the two phases is to make the system maxi- --
mally responsive to regularities present in the environment and to
prevent the system from using its capacity to model internally-
generated regularities.
A N EXAMPLE OF H A R D LEARNING
T h e entire set of 40 annealings that were used to estimate p,; and p,;
was called a sweep. After each sweep, every weight was incremented
by 5(p,: - p,;). In addition, every weight had its absolute magnitude
decreased by 0.0005 times its absolute magnitude. This weight decay
prevented the weights from becoming too large and it also helped to
resuscitate hidden units which had predominantly negative or predom-
inantly positive weights. Such units spend all their time in the same
state and therefore convey n o information. T h e phase' and phase-
statistics are identical for these units, and s o the weight decay gradually
erodes their weights until they come back to life (units with all zero
weights come on half the time).
T h e Network
T h e Training Procedure
8 See Hinton, Sejnowski, and Ackley (1984) for a discussion of the advantages of
discrete weight increments over the more obvious steepest descent technique in which
the weight increment is proportional to p g - pi,-.
magnitude decreased by 1. For each weight, the probability of this hap-
pening was 0.0005 times the absolute magnitude of the weight.
We found that the network performed better if there was a
preliminary learning stage which just involved the sememe units. In
this stage, the intermediate units were not yet connected. During
phasef the required patterns were clamped on the senieme units and p,:
was measured (annealing was not required because all the units
involved were clamped). During phase- no units were clamped and the
network was allowed to reach equilibrium 20 times using the annealing
schedule given above. After annealing, p,; was estimated from the co-
occurrences as before, except that only 20 phase- annealings were used
instead of 40. There were 300 sweeps of this learning stage and they
resulted in weights between pairs of sememe units that were sufficient
to give the sememe group an energy landscape with 20 strong minima
corresponding to the 20 possible "word meanings." This helped subse-
quent learning considerably, because it reduced the tendency for the
intermediate units to be recruited for the job of modeling the structure
among the sememe units. They were therefore free to model the struc-
ture between the grapheme units and the sememe units.9 The results
described here were obtained using the preliminary learning stage and
so they correspond to learning to associate grapheme strings with
"meanings" that are already familiar.
Using the same annealing schedule as was used during learning, the
network can be tested by clamping a grapheme string and looking at the
resulting activities of the sememe units. After 5000 learning sweeps, it
gets the semantic features exactly correct 99.3% of the time. A
performance level of 99.9% can be achieved by using a "careful"
annealing schedule that spends twice as long at each temperature and
goes down to half the final temperature.
9 There was no need to have a similar stage for learning the structure among the gra-
pheme units because in the main stage of learning the grapheme units are always clamped
and so there is no tendency for the network to try to model the structure among them.
This kind of distributed representation should be more tolerant of local
damage than the more obvious method of using one intermediate unit
per word. W e were particularly interested in the pattern of errors pro-
duced by local damage. If the connections between sememe units are
left intact, they should be able to "clean up" patterns of activity that are
close to familiar ones. So the network should still produce perfect out-
put even if the input to the sememe units is slightly disrupted. If the
disruption is more severe, the clean-up effect may actually produce a
d@erent familiar meaning that happens to share the few semantic
features that were correctly activated by the intermediate layer.
T o test these predictions we removed each of the intermediate units
in turn, leaving the other 19 intact. We tested the network 25 times on
each of the 20 words with each of the 20 units removed. In all 10,000
tests, using the careful annealing schedule, it made 140 errors (98.6%
correct). Many errors consisted of the correct set of semantic features
with o n e or two extra or missing features, but 83 of the errors consisted
of the precise meaning of some other grapheme string. An analysis of
these 83 errors showed that the hamming distance between the correct
meanings and the erroneous ones had a mean of 9.34 and a standard
deviation of 1.27 which is significantly lower ( p < .01) than the com-
plete set of hamming distances which had a mean of 10.30 and a stan-
dard deviation of 2.41. We also looked at the hamming distances
between the grapheme strings that the network was given as input and
the grapheme strings that corresponded to the erroneous familiar mean-
ings. T h e mean was 3.95 and the standard deviation was 0.62 which is
significantly lower ( p < .01) than the complete set which had mean
5.53 and standard deviation 0.87. (A hamming distance of 4 means
that the strings have one letter in common.)
In summary, when a single unit is removed from the intermediate
layer, the network still performs well. The majority of its errors consist
of producing exactly the meaning of some other grapheme string, and
the erroneous meanings tend to be similar to the correct one and to be
associated with a grapheme string that has one letter in common with
the string used as input.
T h e Speed of Relearning
l o The surface is never very steep. Its gradient parallel to any weight axis must always
lie between 1 and - 1 because it is the difference of two probabilities.
3 10 BASIC MECHANISMS
0 5 10 15 20 25 30
Learning sweeps
FIGURE 6. The recovery of performance after various types of damage. Each data-
point represents 500 tests (25 with each word). The heavy line is a section of the original
learning curve after a considerable number of learning sweeps. I t shows that in the origi-
nal learning, performance increases by less than 10°/o in 30 learning sweeps. All the other
lines show recovery after damaging a net that had very good performance (99.3%
correct). The lines with open circles show the rapid recovery after 20% or 50% of the
weights to the hidden units have been set to zero (but allowed to relearn). The dashed
line shows recovery after 5 of the 20 hidden units have been permanently ablated. The
remaining line is the case when uniform random noise between -22 and f 2 2 is added to
all the connections to the hidden units. In all cases, a successful trial was defined as one
in which the network produced exactly the correct semantic features when given the gra-
phemic input.
CONCLUSION
ACKNOWLEDGMENTS
APPENDIX:
DERIVATION OF THE, LEARNING ALGORITHM
Hence,
Also,
Therefore,
where
and
clamps both the input and output units, and the p,Ts are estimated.
During phase- the inputsunits are clamped and the output units and
hidden units free-run, and the p,;s are estimated. T h e appropriate G
measure in this case is
P+(O, I I,)
G = ~ P + ( I , A O,)ln
cx P P - ( O p 1 IJ -
Similar mathematics apply in this formulation and a G / a wi, is the same
as before.