0% found this document useful (0 votes)
57 views8 pages

093 Hinton G

1) Neural networks are modeled after the brain and consist of interconnected artificial neurons that can learn from experience. 2) The connections between neurons are modeled by modifiable weights that represent the strength of influence between neurons. Learning occurs by changing these weights. 3) Geoffrey Hinton and his colleagues created artificial neural networks to test theories about how the brain learns and processes information. Their goal was to determine the essential features needed to simulate learning in neural networks.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views8 pages

093 Hinton G

1) Neural networks are modeled after the brain and consist of interconnected artificial neurons that can learn from experience. 2) The connections between neurons are modeled by modifiable weights that represent the strength of influence between neurons. Learning occurs by changing these weights. 3) Geoffrey Hinton and his colleagues created artificial neural networks to test theories about how the brain learns and processes information. Their goal was to determine the essential features needed to simulate learning in neural networks.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

© 1992 SCIENTIFIC AMERICAN, INC

How Neural Networks


Learn from Experience
Networks of artificial neurons can learn to represent
complicated information. Such neural networks may prOvide
inSights into the learning abilities of the human brain

by Geoffrey E. Hinton

T
he brain is a remarkable computer. It ons, and they express the electrical output of
interprets imprecise information from a neuron as a single number that represents
the senses at an incredibly rapid rate. the rate of firing-its activity.
It discerns a whisper in a noisy room, a face Each unit converts the pattern of incoming
in a dimly lit alley and a hidden agenda in a activities that it receives into a single outgo­
political statement. Most impressive of all, the ing activity that it broadcasts to other units. It
brain learns-without any explicit instruc­ performs this conversion in two stages. First,
tions-to create the internal representations it multiplies each incoming activity by the
that make these skills possible. weight on the connection and adds together
Much is still unknown about how the brain all these weighted inputs to get a quantity
trains itself to process information, so theo­ called the total input. Second, a unit uses an
ries abound. To test these hypotheses, my col­ input-output function that transforms the to­
leagues and I have attempted to mimic the brain's learning tal input into the outgoing activity [see "The Amateur Scien­
processes by creating networks of artificial neurons. We con­ tist," page 170].
struct these neural networks by first trying to deduce the es­ The behavior of an artificial neural network depends on
sential features of neurons and their interconnections. We both the weights and the input-output function that is speci­
then typically program a computer to simulate these features. fied for the units. This function typically falls into one of three
Because our knowledge of neurons is incomplete and our categories: linear, threshold or sigmoid. For linear units, the
computing power is limited, our models are necessarily gross output activity is proportional to the total weighted input.
idealizations of real networks of neurons. Naturally, we en­ For threshold units, the output is set at one of two levels, de­
thusiastically debate what features are most essential in sim­ pending on whether the total input is greater than or less
ulating neurons. By testing these features in artificial neural than some threshold value. For sigmoid units, the output
networks, we have been successful at ruling out all kinds of varies continuously but not linearly as the input changes.
theories about how the brain processes information. The Sigmoid units bear a greater resemblance to real neurons
models are also beginning to reveal how the brain may ac­ than do linear or threshold units, but all three must be consid­
complish its remarkable feats of learning. ered rough approximations.
In the human brain, a typical neuron collects signals from To make a neural network that performs some specific
others through a host of fine structures called dendrites. The task, we must choose how the units are connected to one
neuron sends out spikes of electrical activity through a long, another, and we must set the weights on the connections ap­
thin strand known as an axon, which splits into thousands propriately. The connections determine whether it is possi­
of branches. At the end of each branch, a structure called a ble for one unit to influence another. The weights specify the
synapse converts the activity from the axon into electrical strength of the influence.
effects that inhibit or excite activity in the connected neu­ The commonest type of artificial neural network consists
rons. When a neuron receives excitatory input that is suffi­ of three groups, or layers, of units: a layer of input units is
ciently large compared with its inhibitory input, it sends a connected to a layer of "hidden" units, which is connected to
spike of electrical activity down its axon. Learning occurs by a layer of output units. The activity of the input units repre­
changing the effectiveness of the synapses so that the influ­ sents the raw information that is fed into the network. The
ence of one neuron on another changes. activity of each hidden unit is determined by the activities of
Artificial neural networks are typically composed of inter­ the input units and the weights on the connections between
connected "units," which serve as model neurons. The func­
tion of the synapse is modeled by a modifiable weight, which
is associated with each connection. Most artificial networks
do not reflect the detailed geometry of the dendrites and ax- GEOFFREY E. HINTON has worked on representation and learn­
ing in artificial neural networks for the past 20 years. in 1978 he
received his Ph.D. in artificial intelligence from the University of
Edinburgh. He is currently the Noranda Fellow of the Canadian
NETWORK OF NEURONS in the brain provides people with institute for Advanced Research and professor of computer sci­
the ability to assimilate information. Will simulations of such ence and psychology at the University of Toronto.
networks reveal the underlying mechanisms of learning?

SClENTIFIC AMERICAN September 1992 145


© 1992 SCIENTIFIC AMERICAN, INC
INPUT ACTIVITY n/ WEIGHTED the EW-is tricky to compute efficient­

n WEIGHT
J L
INPUT UNIT ly. One way to calculate the EW is to
perturb a weight slightly and observe
2 OUTPUT how the error changes. But that meth­

Jl
ACTIVITY od is inefficient because it requires a
TOTAL
Jl
separate perturbation for each of the
WEIGHTED" many weights.
INPUT

Jl
Around 1974 Paul J. Werbos invent­

Jl
ed a much more efficient procedure for
INPUT- calculating the EW while he was work­
OUTPUT ing toward a doctorate at Harvard Uni­
IiiI�FUNCTION�"__• versity. The procedure, now known as

Jl
the back-propagation algorithm, has be­
come one of the more important tools
for training neural networks.
The back-propagation algorithm is
easiest to understand if all the units in

Jl the network are linear. The algorithm


computes each EW by first computing
the EA, the rate at which the error
changes as the activity level of a unit is
changed. For output units, the EA is
IDEAUZAnON OF A NEURON processes activities, or signals. Each input activity is simply the difference between the actu­
multiplied by a number called the weight. The "unit" adds together the weighted in· al and the desired output. To compute
puts. It then computes the output activity using an input· output function. the EA for a hidden unit in the layer
just before the output layer, we first
identify all the weights between that
the input and hidden units. Similarly, pIes, which consist of a pattern of activ­ hidden unit and the output units to
the behavior of the output units de· ities for the input units together with which it is connected. We then multiply
pends on the activity of the hidden the desired pattern of activities for the those weights by the EAs of those out­
units and the weights between the hid­ output units. We then determine how put units and add the products. This
den and output units. closely the actual output of the network sum equals the EA for the chosen hid­
This simple type of network is inter­ matches the desired output. Next we den unit. After calculating all the EAs in
esting because the hidden units are free change the weight of each connection so the hidden layer just before the output
to construct their own representations that the network produces a better ap­ layer, we can compute in like fashion
of the input. The weights between the proximation of the desired output. the EAs for other layers, moving from
input and hidden units determine when For example, suppose we want a net­ layer to layer in a direction opposite to
each hidden unit is active, and so by work to recognize handwritten digits. the way activities propagate through
modifying these weights, a hidden unit We might use an array of, say, 256 sen­ the network. This is what gives back
can choose what it represents. sors, each recording the presence or propagation its name. Once the EA has
We can teach a three-layer network absence of ink in a small area of a sin­ been computed for a unit, it is straight­
to perform a particular task by using gle digit. The network would therefore forward to compute the EW for each in­
the following procedure. First, we pre­ need 256 input units (one for each sen­ coming connection of the unit. The EW
sent the network with training exam- sor), 10 output units (one for each kind is the product of the EA and the activity
of digit) and a number of hidden units. through the incoming connection.
For each kind of digit recorded by the For nonlinear units, the back-propa­
••• sensors, the network should produce gation algorithm includes an extra step.
high activity in the appropriate output Before back-propagating, the EA must
unit and low activity in the other out­ be converted into the £I, the rate at
put units. which the error changes as the total in­
To train the network, we present an put received by a unit is changed. (The
image of a digit and compare the actu­ details of this calculation are given in
al activity of the 10 output units with the box on page 148.)
the desired activity. We then calculate

T
the error, which is defined as the square he back-propagation algorithm
of the difference between the actual and was largely ignored for years after
the desired activities. Next we change its invention, probably because its
the weight of each connection so as to usefulness was not fully appreciated.
reduce the error. We repeat this train­ In the early 1980s David E. Rumelhart,
ing process for many different images then at the University of California at
of each kind of digit until the network San Diego, and David B. Parker, then
classifies every image correctly. at Stanford University, independently
COMMON NEURAL NETWORK consists
of three layers of units that are fully
To implement this procedure, we rediscovered the algorithm. In 1986
connected. Activity passes from the in­ need to change each weight by an Rumelhart, Ronald J. Williams, also at
put units (green) to the hidden units amount that is proportional to the rate the University of California at San Diego,
(gray) and finally to the output units at which the error changes as the weight and I popularized the algorithm by
(yellow). The reds and blues of the con­ is changed. This quantity-called the er­ demonstrating that it could teach the
nections represent different weights. ror derivative for the weight, or Simply hidden units to produce interesting rep-

146 SCIENTIFIC AMERICAN September 1992


© 1992 SCIENTIFIC AMERICAN, INC
resentations of complex input patterns. algorithm is a useful tool for explain­ methods would be hopeless because
The back-propagation algorithm has ing the function of some neurons in the they would inevitably lead to locally
proved surprisingly good at training brain's cortex. They trained a neural net­ optimal but globally terrible solutions.
networks with multiple layers to per­ work to respond to visual stimuli us­ For example, a digit-recognition network
form a wide variety of tasks. It is most ing back propagation. They then found might consistently home in on a set
useful in situations in which the rela­ that the responses of the hidden units of weights that makes the network con­
tion between input and output is non­ were remarkably similar to those of real fuse ones and sevens even though an
linear and training data are abundant. neurons responsible for converting vi­ ideal set of weights exists that would
By applying the algorithm, researchers sual information from the retina into a allow the network to discriminate be­
have produced neural networks that rec­ form suitable for deeper visual areas of tween the digits. This fear supported a
ognize handwritten digits, predict cur­ the brain. widespread belief that a learning proce­
rency exchange rates and maximize the Yet back propagation has had a rather dure was interesting only if it were guar­
yields of chemical processes. They have mixed reception as a theory of how bio­ anteed to converge eventually on the
even used the algorithm to train net­ logical neurons learn. On the one hand, globally optimal solution. Back propa­
works that identify precancerous cells the back-propagation algorithm has gation showed that for many tasks
in Pap smears and that adjust the mir­ made a valuable contribution at an ab­ global convergence was not necessary
ror of a telescope so as to cancel out stract level. The algorithm is quite good to achieve good performance.
atmospheric distortions. at creating sensible representations in On the other hand, back propaga­
Within the field of neuroscience, Rich­ the hidden units. As a result, research­ tion seems biologically implausible. The
ard Andersen of the Massachusetts In­ ers gained confidence in learning pro­ most obvious difficulty is that informa­
stitute of Technology and David Zipser cedures in which weights are gradually tion must travel through the same con­
of the University of California at San adjusted to reduce errors. Previously, nections in the reverse direction, from
Diego showed that the back-propagation many workers had assumed that such one layer to the previous layer. Clearly,


How a Neural Network Represents Handwritten Digits
eural network-con­
sisting of 256 input
units, nine hidden
units and 10 output units­
has been trained to recog­
nize handwritten digits. The
illustration below shows the
activities of the units when
the network is presented
with a handwritten 3. The
third output unit is most ac­
tive. The nine panels at the
right represent the 256 in­
coming weights and the 10
outgoing weights for each
of the nine hidden units. The
red regions indicate weights
that are excitatory, where­
as yellow regions represent
weights that are inhibitory.

OUTPUT

HIDDEN

INPUT

SCIENTIFIC AMERICAN September 1992 147


© 1992 SCIENTIFIC AMERICAN, INC
this does not happen in real neurons. The most serious objection to back weights in the network appropriately.
But this objection is actually rather su­ propagation as a model of real learning All these procedures share two char­
perficial. The brain has many pathways is that it requires a teacher to supply acteristics: they appeal, implicitly or ex­
from later layers back to earlier ones, the desired output for each training ex­ pliCitly, to some notion of the quality
and it could use these pathways in ample. In contrast, people learn most of a representation, and they work by
many ways to convey the information things without the help of a teacher. changing the weights to improve the
required for learning. Nobody presents us with a detailed quality of the representation extracted
A more important problem is the description of the internal representa­ by the hidden units.
speed of the back-propagation algo­ tions of the world that we must learn

I
rithm. Here the central issue is how the to extract from our sensory input. We n general, a good representation is
time required to learn increases as the learn to understand sentences or visual one that can be described very eco­
network gets larger. The time taken to scenes without any direct instructions. nomically but nonetheless contains
calculate the error derivatives for the How can a network learn appropriate enough information to allow a close ap­
weights on a given training example is internal representations if it starts with proximation of the raw input to be re­
proportional to the size of the network no knowledge and no teacher? If a net­ constructed. For example, consider an
because the amount of computation is work is presented with a large set of pat­ image consisting of several ellipses. Sup­
proportional to the number of weights. terns but is given no information about pose a device translates the image into
But bigger networks typically require what to do with them, it apparently does an array of a million tiny squares, each
more training examples, and they must not have a well-defined problem to solve. of which is either light or dark. The im­
update the weights more times. Hence, Nevertheless, researchers have devel­ age could be represented simply by the
the learning time grows much faster oped several general-purpose, unsuper­ positions of the dark squares. But oth­
than does the size of the network. vised procedures that can adjust the er, more efficient representations are

o
The Back-Propagation Algorithm

T
train a neural network to perform some task, we the difference between the actual and the desired activity.
must adjust the weights of each unit in such a way
a'E
that the error between the desired output and the ac­ EAj = = Yj-d j
tual output is reduced. This process requires that the neural a Yj
network compute the error derivative of the weights (EW). 2. Compute how fast the error changes as the total input
In other words, it must calculate how the error changes as received by an output unit is changed. This quantity (EI) is
each weight is increased or decreased slightly. The back­ the answer from step 1 multiplied by the rate at which the
propagation algorithm is the most widely used method output of a unit changes as its total input is changed.
for determining the EW.
To implement the back-propagation algorithm, we must a'E a'E dyj
first describe a neural network in mathematical terms. As­
Elj = -s-- = -s-- -d = EAjyj(l-y)
UXj UYj Xj
sume that unit j is a typical unit in the output layer and
unit i is a typical unit in the previous layer. A unit in the 3. Compute how fast the error changes as a weight on
output layer determines its activity by following a two­ the connection into an output unit is changed. This quan­
step procedure. First, it computes the total weighted in­ tity (EW) is the answer from step 2 multiplied by the activ­
put Xl' using the formula ity level of the unit from which the connection emanates.

Xj = �)� W;j' a'E a'E aXj


EWij= :s-- = -s-- :s-- = EljYi
uWij UXj uWij
where Vi is the activity level of the ith unit in the previous
layer and wi} is the weight of the connection between the 4. Compute how fast the error changes as the activity
ith and jth unit. of a unit in the previous layer is changed. This crucial step
Next, the unit calculates the activity Vj using some func­ allows back propagation to be applied to multilayer net­
tion of the total weighted input. Typically, we use the sig­ works. When the activity of a unit in the previous layer
moid function: changes, it affects the activities of all the output units to
1 which it is connected. So to compute the overall effect on
)) = the error, we add together all these separate effects on
1 + e-Xj •
output units. But each effect is simple to calculate. It is the
Once the activities of all the output units have been de­ answer in step 2 multiplied by the weight on the connec­
termined, the network computes the error 'E, which is de­ tion to that output unit.
fined by the expression
a'E aXj �
±L ()j -dj)
a'E �
EAi � £..J � � -£..J Elj Wij
_ _ _

'E = 2,
- -

j Y, j XJ Y, j
where Vj is the activity level of the jth unit in the top layer By using steps 2 and 4, we can convert the EAs of one layer
and dj is the desired output of the jth unit. of units into EAs for the previous layer. This procedure can
The back-propagation algorithm consists of four steps: be repeated to get the EAs for as many previous layers as
1. Compute how fast the error changes as the activity desired . Once we know the EA of a unit, we can use steps
of an output unit is changed. This error derivative (EA) is 2 and 3 to compute the EWs on its incoming connections.

148 SCIENTIFIC AMERICAN September 1992


© 1992 SCIENTIFIC AMERICAN, INC

also possible. Ellipses differ in only five
s
ways: orientation, vertical position, hor­ �0 �
izontal position, length and width. The 0-
O-<:'���
image can therefore be described using Q"
only five parameters per ellipse.

Although describing an ellipse by five
.. . ...++
parameters requires more bits than de­

scribing a single dark square by two co­

r
y •
ordinates, we get an overall savings be­
cause far fewer parameters than coor­ •
dinates are needed. Furthermore, we do •
Y
not lose any information by describing
the ellipses in terms of their parame­
ters: given the parameters of the el­ X ) X )
lipse, we could reconstruct the original
TWO FACES composed of eight ellipses can be represented as many points in two
image if we so desired.
dimensions. Alternatively, because the ellipses differ in only five ways-orienta­
Almost all the unsupervised learning
tion, vertical position, horizontal position, length and width-the two faces can be
procedures can be viewed as methods represented as eight points in a five-dimensional space.
of minimizing the sum of two terms, a
code cost and a reconstruction cost.
The code cost is the number of bits re­ Many researchers, including Ralph ber of hidden units cooperate in repre­
quired to describe the activities of the Unsker of the IBM Thomas]. Watson senting the input pattern. In contrast,
hidden units. The reconstruction cost Research Center and Erkki Oja of Lap­ in competitive learning, a large number
is the number of bits required to de­ peenranta University of Technology in of hidden units compete so that a sin­
scribe the misfit between the raw input Finland, have discovered alternative al­ gle hidden unit is used to represent any
and the best approximation to it that gorithms for learning principal compo­ particular input pattern. The selected
could be reconstructed from the activ­ nents. These algorithms are more bio­ hidden unit is the one whose incoming
ities of the hidden units. The recon­ logically plausible because they do not weights are most similar to the input
struction cost is proportional to the require output units or back propaga­ pattern.
squared difference between the raw in­ tion. Instead they use the correlation Now suppose we had to reconstruct
put and its reconstruction. between the activity of a hidden unit the input pattern solely from our knowl­
Two simple methods for discovering and the activity of an input unit to de­ edge of which hidden unit was chosen.
economical codes allow fairly accurate termine the change in the weight. Our best bet would be to copy the pat­
reconstruction of the input: principal­ When a neural network uses princi­ tern of incoming weights of the chosen


/)(
components learning and competitive pal-components learning, a small num- hidden unit. To minimize the recon-
learning. In both approaches, we first
decide how economical the code should SIAMESE DACHSHUND


be and then modify the weights in the RETRIEVER
PERSIAN
network to minimize the reconstruc­


TERRIER
tion error.
A principal-components learning TABBY ..
HUSKY

J
strategy is based on the idea that if the
activities of pairs of input units are cor­

PAT ERN:WEUNIGTHTS
.�D
related in some way, it is a waste of bits
to describe each input activity separate­
ly. A more efficient approach is to ex­
tract and describe the principal compo­
nents-that is, the components of vari­
ation shared by many input units. If we
wish to discover, say, 10 of the princi­ e____ OF
pal components, then we need only a
single layer of 10 hidden units.
Because such networks represent the
input using only a small number of
components, the code cost is low. And

� tI�
because the input can be reconstructed
quite well from the principal compo­
nents, the reconstruction cost is small.
One way to train this type of net­ KODIAK


work is to force it to reconstruct an GUERNSEY
approximation to the input on a set of • BROWN
output units. Then back propagation
can be used to minimize the difference POLAR
between the actual output and the de­ COMPETITIVE LEARNING can be envisioned as a process in which each input pat­
sired output. This process resembles tern attracts the weight pattern of the closest hidden unit. Each input pattern repre­
supervised learning, but because the sents a set of distinguishing features. The weight patterns of hidden units are ad­
desired output is exactly the same as justed so that they migrate slowly toward the closest set of input patterns. In this
the input, no teacher is required. way, each hidden unit learns to represent a cluster of similar input patterns.

SCIENTIFIC AMERICAN September 1992 149


© 1992 SCIENTIFIC AMERICAN, INC
struction error, we should move the pat­ Unfortunately, most current meth­ rameters of the face being represent­
tern of weights of the winning hidden ods of minimizing the code cost tend ed by that population code. In abstract
unit even closer to the input pattern. to eliminate all the redundancy among terms, each face cell represents a partic­
This is what competitive learning does. the activities of the hidden units. As a ular point in a multidimensional space
If the network is presented with training result, the network is very sensitive to of possible faces, and any face can then
data that can be grouped into clusters the malfunction of a single hidden unit. be represented by activating all the cells
of similar input patterns, each hidden This feature is uncharacteristic of the that encode very similar faces, so that a
unit learns to represent a different clus­ brain, which is generally not affected bump of activity appears in the multidi­
ter, and its incoming weights converge greatly by the loss of a few neurons. mensional space of possible faces.
on the center of the cluster. The brain seems to use what are Population coding is attractive be­
Like the principal-components algo­ known as population codes, in which cause it works even if some of the neu­
rithm, competitive learning minimizes information is represented by a whole rons are damaged. It can do so because
the reconstruction cost while keeping population of active neurons. That point the loss of a random subset of neurons
the code cost low. We can afford to use was beautifully demonstrated in the has little effect on the population aver­
many hidden units because even with a experiments of David L. Sparks and his age. The same reasoning applies if
million units it takes only 20 bits to say co-workers at the University of Alaba- some neurons are overlooked when the
which one won. system is in a hurry. Neurons
In the early 1980s Teuvo Ko­ communicate by sending dis­
honen of Helsinki University in­ crete spikes called action po­
troduced an important modifi­ tentials, and in a very short
cation of the competitive learn­ time interval many of the "ac­
ing algorithm. Kohonen showed tive" neurons may not have
how to make physically adja­ time to send a spike. Neverthe­
cent hidden units learn to rep­ less, even in such a short inter­
resent similar input patterns. val, a population code in one
Kohonen's algorithm adapts part of the brain can still give
not only the weights of the win­ rise to an approximately cor­
ning hidden unit but also the POPULATION CODING represents a multiparameter ob­
rect population code in another
weights of the winner's neigh­ ject as a bump of activity spread over many hidden part of the brain.
bors. The algorithm's ability to units. Each disk represents an inactive hidden unit. At first sight, the redundancy
map similar input patterns Each cylinder indicates an active unit, and its height de­ in population codes seems in­
to nearby hidden units sug­ picts the level of activity. compatible with the idea of
gests that a procedure of this constructing internal represen­
type may be what the brain tations that minimize the code
uses to create the topographic maps mao While investigating how the brain cost. Fortunately, we can overcome this
found in the visual cortex [see "The Vi­ of a monkey instructs its eyes where difficulty by using a less direct mea­
sual Image in Mind and Brain," by Se­ to move, they found that the required sure of code cost. If the activity that en­
mir Zeki, page 68]. movement is encoded by the activities codes a particular entity is a smooth
Unsupervised learning algorithms can of a whole population of cells, each of bump in which activity falls off in a
be classified according to the type of which represents a somewhat different standard way as we move away from
representation they create. In principal­ movement. The eye movement that is the center, we can describe the bump
components methods, the hidden units actually made corresponds to the aver­ of activity completely merely by specify­
cooperate, and the representation of age of all the movements encoded by ing its center. So a fairer measure of
each input pattern is distributed across the active cells. If some brain cells are code cost is the cost of describing the
all of them. In competitive methods, anesthetized, the eye moves to the point center of the bump of activity plus the
the hidden units compete, and the rep­ associated with the average of the re­ cost of describing how the actual activi­
resentation of the input pattern is lo­ maining active cells. Population codes ties of the units depart from the de­
calized in the single hidden unit that is may be used to encode not only eye sired smooth bump of activity.
selected. Until recently, most work on movements but also faces, as shown by Using this measure of the code cost,
unsupervised learning focused on one Malcolm P. Young and Shigeru Yamane we find that population codes are a con­
or another of these two techniques, at the RIKEN Institute in Japan in recent venient way of extracting a hierarchy of
probably because they lead to simple experiments on the inferior temporal progressively more efficient encodings
rules for changing the weights. But the cortex of monkeys. of the sensory input. This point is best
most interesting and powerful algo­ illustrated by a simple example. Con­

F
rithms probably lie somewhere between or both eye movements and fac­ sider a neural network that is present­
the extremes of purely distributed and es, the brain must represent enti­ ed with an image of a face. Suppose the
purely localized representations. ties that vary along many differ­ network already contains one set of
Horace B. Barlow of the University of ent dimensions. In the case of an eye units dedicated to representing noses,
Cambridge has proposed a model in movement, there are just two dimen­ another set for mouths and another set
which each hidden unit is rarely active sions, but for something like a face, for eyes. When it is shown a particular
and the representation of each input there are dimenSions such as happiness, face, there will be one bump of activity
pattern is distributed across a small hairiness or familiarity, as well as spatial in the nose units, one in the mouth
number of selected hidden units. He parameters such as position, size and units and two in the eye units. The lo­
and his co-workers have shown that this orientation. If we associate with each cation of each of these activity bumps
type of code can be learned by forcing face-sensitive cell the parameters of the represents the spatial parameters of the
hidden units to be uncorrelated while face that make it most active, we can av­ feature encoded by the bump. Describ­
also ensuring that the hidden code al­ erage these parameters over a popula­ ing the four activity bumps is cheaper
lows good reconstruction of the input. tion of active cells to discover the pa- than describing the raw image, but it

150 SCIENTIFIC AMERICAN September 1992


© 1992 SCIENTIFIC AMERICAN, INC
would obviously be cheaper still to de­
scribe a single bump of activity in a set
of face units, assuming of course that
the nose, mouth and eyes are in the cor­
rect spatial relations to form a face.
This raises an interesting issue: How
can the network check that the parts
are correctly related to one another to
make a face? Some time ago Dana H. IMAGE OF NOSE AND MOUTH
Ballard of the University of Rochester
introduced a clever technique for solv­
ing this type of problem that works
nicely with population codes.
If we know the position, size and ori­
entation of a nose, we can predict the
position, size and orientation of the face
to which it belongs because the spa­
tial relation between noses and faces
is roughly fixed. We therefore set the
weights in the neural network so that a
bump of activity in the nose units tries
to cause an appropriately related bump
of activity in the face units. But we also
set the thresholds of the face units so
that the nose units alone are insufficient
to activate the face units. If, however,
the bump of activity in the mouth units
also tries to cause a bump in the same
place in the face units, then the thresh­
olds can be overcome. In effect, we have
checked that the nose and mouth are
correctly related to each other by check­
ing that they both predict the same spa­
tial parameters for the whole face.
This method of checking spatial re­
lations is intriguing because it makes MOUTH UNITS
use of the kind of redundancy between
BUMPS OF ACTIVITY in sets of hidden units represent the image of a nose and a
different parts of an image that unsu­
mouth. These population codes will cause a bump in the face units if the nose and
pervised learning should be good at mouth have the correct spatial relation (left). If not, the active nose units will try to
finding. It therefore seems natural to create a bump in the face units at one location while the active mouth units will do
try to use unsupervised learning to dis­ the same at a different location. As a result, the input activity to the face units does
cover hierarchical population codes for not exceed a threshold value, and no bump is formed in the face units (right).
extracting complex shapes. In 1986
Eric Saund of M. LT. demonstrated one
method of learning simple population All the learning procedures discussed the methods discovered by evolution.
codes for shapes. It seems likely that thus far are implemented in neural net­ When that happens, a lot of diverse
with a clear definition of the code cost, works in which activity flows only in the empirical data about the brain will sud­
an unsupervised network will be able forward direction from input to output denly make sense, and many new ap­
to discover more complex hierarchies even though error derivatives may flow plications of artificial neural networks
by trying to minimize the cost of cod­ in the backward direction. Another im­ will become feasible.
ing the image. Richard Zemel and I at portant possibility to consider is net­
the University of Toronto are now in­ works in which activity flows around
vestigating this possibility. closed loops. Such recurrent networks
FURTHER READING
By using unsupervised learning to ex­ may settle down to stable states, or they
tract a hierarchy of successively more may exhibit complex temporal dynam­ lEARNING REPRESENTATIONS BY BACK­
PROPAGATING ERRORS. David E. Rumel­
economical representations, it should be ics that can be used to produce sequen­
hart, Geoffrey E. Hinton and Ronald J.
possible to improve greatly the speed tial behavior. If they settle to stable
Williams in Nature, Vol. 323, No. 6188,
of learning in large multilayer networks. states, error derivatives can be comput­ pages 533-536; October 9, 1986.
Each layer of the network adapts its in­ ed using methods much simpler than CONNECTIONIST lEARNING PROCEDURES.
coming weights to make its representa­ back propagation. Geoffrey E. Hinton in Artificial Intelli-
tion better than the representation in Although investigators have devised gence, Vol. 40, Nos. 1-3, pages 185-234;
the previous layer, so weights in one some powerful learning algorithms that September 1989.
INTRODUCTION TO THE THEORY OF NEU­
layer can be learned without reference are of great practical value, we still do
RAL COMPUTATION. J. Hertz, A. Krogh
to weights in subsequent layers. This not know which representations and
and R. G. Palmer. Addison-Wesley, 1990.
strategy eliminates many of the interac­ learning procedures are actually used THE COMPUTATIONAL BRAIN. Patricia S.
tions between weights that make back­ by the brain. But sooner or later com­ Churchland and Terrence]. Sejnowski.
propagation learning very slow in deep putational studies of learning in artifi­ The MIT Press/Bradford Books, 1992.
multilayer networks. cial neural networks will converge on

SCIENTIFIC AMERICAN September 1992 151


© 1992 SCIENTIFIC AMERICAN, INC

You might also like