093 Hinton G
093 Hinton G
by Geoffrey E. Hinton
T
he brain is a remarkable computer. It ons, and they express the electrical output of
interprets imprecise information from a neuron as a single number that represents
the senses at an incredibly rapid rate. the rate of firing-its activity.
It discerns a whisper in a noisy room, a face Each unit converts the pattern of incoming
in a dimly lit alley and a hidden agenda in a activities that it receives into a single outgo
political statement. Most impressive of all, the ing activity that it broadcasts to other units. It
brain learns-without any explicit instruc performs this conversion in two stages. First,
tions-to create the internal representations it multiplies each incoming activity by the
that make these skills possible. weight on the connection and adds together
Much is still unknown about how the brain all these weighted inputs to get a quantity
trains itself to process information, so theo called the total input. Second, a unit uses an
ries abound. To test these hypotheses, my col input-output function that transforms the to
leagues and I have attempted to mimic the brain's learning tal input into the outgoing activity [see "The Amateur Scien
processes by creating networks of artificial neurons. We con tist," page 170].
struct these neural networks by first trying to deduce the es The behavior of an artificial neural network depends on
sential features of neurons and their interconnections. We both the weights and the input-output function that is speci
then typically program a computer to simulate these features. fied for the units. This function typically falls into one of three
Because our knowledge of neurons is incomplete and our categories: linear, threshold or sigmoid. For linear units, the
computing power is limited, our models are necessarily gross output activity is proportional to the total weighted input.
idealizations of real networks of neurons. Naturally, we en For threshold units, the output is set at one of two levels, de
thusiastically debate what features are most essential in sim pending on whether the total input is greater than or less
ulating neurons. By testing these features in artificial neural than some threshold value. For sigmoid units, the output
networks, we have been successful at ruling out all kinds of varies continuously but not linearly as the input changes.
theories about how the brain processes information. The Sigmoid units bear a greater resemblance to real neurons
models are also beginning to reveal how the brain may ac than do linear or threshold units, but all three must be consid
complish its remarkable feats of learning. ered rough approximations.
In the human brain, a typical neuron collects signals from To make a neural network that performs some specific
others through a host of fine structures called dendrites. The task, we must choose how the units are connected to one
neuron sends out spikes of electrical activity through a long, another, and we must set the weights on the connections ap
thin strand known as an axon, which splits into thousands propriately. The connections determine whether it is possi
of branches. At the end of each branch, a structure called a ble for one unit to influence another. The weights specify the
synapse converts the activity from the axon into electrical strength of the influence.
effects that inhibit or excite activity in the connected neu The commonest type of artificial neural network consists
rons. When a neuron receives excitatory input that is suffi of three groups, or layers, of units: a layer of input units is
ciently large compared with its inhibitory input, it sends a connected to a layer of "hidden" units, which is connected to
spike of electrical activity down its axon. Learning occurs by a layer of output units. The activity of the input units repre
changing the effectiveness of the synapses so that the influ sents the raw information that is fed into the network. The
ence of one neuron on another changes. activity of each hidden unit is determined by the activities of
Artificial neural networks are typically composed of inter the input units and the weights on the connections between
connected "units," which serve as model neurons. The func
tion of the synapse is modeled by a modifiable weight, which
is associated with each connection. Most artificial networks
do not reflect the detailed geometry of the dendrites and ax- GEOFFREY E. HINTON has worked on representation and learn
ing in artificial neural networks for the past 20 years. in 1978 he
received his Ph.D. in artificial intelligence from the University of
Edinburgh. He is currently the Noranda Fellow of the Canadian
NETWORK OF NEURONS in the brain provides people with institute for Advanced Research and professor of computer sci
the ability to assimilate information. Will simulations of such ence and psychology at the University of Toronto.
networks reveal the underlying mechanisms of learning?
n WEIGHT
J L
INPUT UNIT ly. One way to calculate the EW is to
perturb a weight slightly and observe
2 OUTPUT how the error changes. But that meth
Jl
ACTIVITY od is inefficient because it requires a
TOTAL
Jl
separate perturbation for each of the
WEIGHTED" many weights.
INPUT
Jl
Around 1974 Paul J. Werbos invent
Jl
ed a much more efficient procedure for
INPUT- calculating the EW while he was work
OUTPUT ing toward a doctorate at Harvard Uni
IiiI�FUNCTION�"__• versity. The procedure, now known as
Jl
the back-propagation algorithm, has be
come one of the more important tools
for training neural networks.
The back-propagation algorithm is
easiest to understand if all the units in
T
the error, which is defined as the square he back-propagation algorithm
of the difference between the actual and was largely ignored for years after
the desired activities. Next we change its invention, probably because its
the weight of each connection so as to usefulness was not fully appreciated.
reduce the error. We repeat this train In the early 1980s David E. Rumelhart,
ing process for many different images then at the University of California at
of each kind of digit until the network San Diego, and David B. Parker, then
classifies every image correctly. at Stanford University, independently
COMMON NEURAL NETWORK consists
of three layers of units that are fully
To implement this procedure, we rediscovered the algorithm. In 1986
connected. Activity passes from the in need to change each weight by an Rumelhart, Ronald J. Williams, also at
put units (green) to the hidden units amount that is proportional to the rate the University of California at San Diego,
(gray) and finally to the output units at which the error changes as the weight and I popularized the algorithm by
(yellow). The reds and blues of the con is changed. This quantity-called the er demonstrating that it could teach the
nections represent different weights. ror derivative for the weight, or Simply hidden units to produce interesting rep-
�
How a Neural Network Represents Handwritten Digits
eural network-con
sisting of 256 input
units, nine hidden
units and 10 output units
has been trained to recog
nize handwritten digits. The
illustration below shows the
activities of the units when
the network is presented
with a handwritten 3. The
third output unit is most ac
tive. The nine panels at the
right represent the 256 in
coming weights and the 10
outgoing weights for each
of the nine hidden units. The
red regions indicate weights
that are excitatory, where
as yellow regions represent
weights that are inhibitory.
OUTPUT
HIDDEN
INPUT
I
rithm. Here the central issue is how the to extract from our sensory input. We n general, a good representation is
time required to learn increases as the learn to understand sentences or visual one that can be described very eco
network gets larger. The time taken to scenes without any direct instructions. nomically but nonetheless contains
calculate the error derivatives for the How can a network learn appropriate enough information to allow a close ap
weights on a given training example is internal representations if it starts with proximation of the raw input to be re
proportional to the size of the network no knowledge and no teacher? If a net constructed. For example, consider an
because the amount of computation is work is presented with a large set of pat image consisting of several ellipses. Sup
proportional to the number of weights. terns but is given no information about pose a device translates the image into
But bigger networks typically require what to do with them, it apparently does an array of a million tiny squares, each
more training examples, and they must not have a well-defined problem to solve. of which is either light or dark. The im
update the weights more times. Hence, Nevertheless, researchers have devel age could be represented simply by the
the learning time grows much faster oped several general-purpose, unsuper positions of the dark squares. But oth
than does the size of the network. vised procedures that can adjust the er, more efficient representations are
o
The Back-Propagation Algorithm
T
train a neural network to perform some task, we the difference between the actual and the desired activity.
must adjust the weights of each unit in such a way
a'E
that the error between the desired output and the ac EAj = = Yj-d j
tual output is reduced. This process requires that the neural a Yj
network compute the error derivative of the weights (EW). 2. Compute how fast the error changes as the total input
In other words, it must calculate how the error changes as received by an output unit is changed. This quantity (EI) is
each weight is increased or decreased slightly. The back the answer from step 1 multiplied by the rate at which the
propagation algorithm is the most widely used method output of a unit changes as its total input is changed.
for determining the EW.
To implement the back-propagation algorithm, we must a'E a'E dyj
first describe a neural network in mathematical terms. As
Elj = -s-- = -s-- -d = EAjyj(l-y)
UXj UYj Xj
sume that unit j is a typical unit in the output layer and
unit i is a typical unit in the previous layer. A unit in the 3. Compute how fast the error changes as a weight on
output layer determines its activity by following a two the connection into an output unit is changed. This quan
step procedure. First, it computes the total weighted in tity (EW) is the answer from step 2 multiplied by the activ
put Xl' using the formula ity level of the unit from which the connection emanates.
'E = 2,
- -
j Y, j XJ Y, j
where Vj is the activity level of the jth unit in the top layer By using steps 2 and 4, we can convert the EAs of one layer
and dj is the desired output of the jth unit. of units into EAs for the previous layer. This procedure can
The back-propagation algorithm consists of four steps: be repeated to get the EAs for as many previous layers as
1. Compute how fast the error changes as the activity desired . Once we know the EA of a unit, we can use steps
of an output unit is changed. This error derivative (EA) is 2 and 3 to compute the EWs on its incoming connections.
r
y •
ordinates, we get an overall savings be
cause far fewer parameters than coor •
dinates are needed. Furthermore, we do •
Y
not lose any information by describing
the ellipses in terms of their parame
ters: given the parameters of the el X ) X )
lipse, we could reconstruct the original
TWO FACES composed of eight ellipses can be represented as many points in two
image if we so desired.
dimensions. Alternatively, because the ellipses differ in only five ways-orienta
Almost all the unsupervised learning
tion, vertical position, horizontal position, length and width-the two faces can be
procedures can be viewed as methods represented as eight points in a five-dimensional space.
of minimizing the sum of two terms, a
code cost and a reconstruction cost.
The code cost is the number of bits re Many researchers, including Ralph ber of hidden units cooperate in repre
quired to describe the activities of the Unsker of the IBM Thomas]. Watson senting the input pattern. In contrast,
hidden units. The reconstruction cost Research Center and Erkki Oja of Lap in competitive learning, a large number
is the number of bits required to de peenranta University of Technology in of hidden units compete so that a sin
scribe the misfit between the raw input Finland, have discovered alternative al gle hidden unit is used to represent any
and the best approximation to it that gorithms for learning principal compo particular input pattern. The selected
could be reconstructed from the activ nents. These algorithms are more bio hidden unit is the one whose incoming
ities of the hidden units. The recon logically plausible because they do not weights are most similar to the input
struction cost is proportional to the require output units or back propaga pattern.
squared difference between the raw in tion. Instead they use the correlation Now suppose we had to reconstruct
put and its reconstruction. between the activity of a hidden unit the input pattern solely from our knowl
Two simple methods for discovering and the activity of an input unit to de edge of which hidden unit was chosen.
economical codes allow fairly accurate termine the change in the weight. Our best bet would be to copy the pat
reconstruction of the input: principal When a neural network uses princi tern of incoming weights of the chosen
�
/)(
components learning and competitive pal-components learning, a small num- hidden unit. To minimize the recon-
learning. In both approaches, we first
decide how economical the code should SIAMESE DACHSHUND
�
be and then modify the weights in the RETRIEVER
PERSIAN
network to minimize the reconstruc
�
TERRIER
tion error.
A principal-components learning TABBY ..
HUSKY
J
strategy is based on the idea that if the
activities of pairs of input units are cor
PAT ERN:WEUNIGTHTS
.�D
related in some way, it is a waste of bits
to describe each input activity separate
ly. A more efficient approach is to ex
tract and describe the principal compo
nents-that is, the components of vari
ation shared by many input units. If we
wish to discover, say, 10 of the princi e____ OF
pal components, then we need only a
single layer of 10 hidden units.
Because such networks represent the
input using only a small number of
components, the code cost is low. And
� tI�
because the input can be reconstructed
quite well from the principal compo
nents, the reconstruction cost is small.
One way to train this type of net KODIAK
�
work is to force it to reconstruct an GUERNSEY
approximation to the input on a set of • BROWN
output units. Then back propagation
can be used to minimize the difference POLAR
between the actual output and the de COMPETITIVE LEARNING can be envisioned as a process in which each input pat
sired output. This process resembles tern attracts the weight pattern of the closest hidden unit. Each input pattern repre
supervised learning, but because the sents a set of distinguishing features. The weight patterns of hidden units are ad
desired output is exactly the same as justed so that they migrate slowly toward the closest set of input patterns. In this
the input, no teacher is required. way, each hidden unit learns to represent a cluster of similar input patterns.
F
rithms probably lie somewhere between or both eye movements and fac sider a neural network that is present
the extremes of purely distributed and es, the brain must represent enti ed with an image of a face. Suppose the
purely localized representations. ties that vary along many differ network already contains one set of
Horace B. Barlow of the University of ent dimensions. In the case of an eye units dedicated to representing noses,
Cambridge has proposed a model in movement, there are just two dimen another set for mouths and another set
which each hidden unit is rarely active sions, but for something like a face, for eyes. When it is shown a particular
and the representation of each input there are dimenSions such as happiness, face, there will be one bump of activity
pattern is distributed across a small hairiness or familiarity, as well as spatial in the nose units, one in the mouth
number of selected hidden units. He parameters such as position, size and units and two in the eye units. The lo
and his co-workers have shown that this orientation. If we associate with each cation of each of these activity bumps
type of code can be learned by forcing face-sensitive cell the parameters of the represents the spatial parameters of the
hidden units to be uncorrelated while face that make it most active, we can av feature encoded by the bump. Describ
also ensuring that the hidden code al erage these parameters over a popula ing the four activity bumps is cheaper
lows good reconstruction of the input. tion of active cells to discover the pa- than describing the raw image, but it