An Introduction To Neural Computing: Original Contribution
An Introduction To Neural Computing: Original Contribution
An Introduction To Neural Computing: Original Contribution
1988
Printed in the USA. All rights reserved,
ORIGINAL C O N T R I B U T I O N
Keywords--Neural computing, Neural networks, Adaptive systems, Learning machines, Pattern recognition.
i. HISTORICAL OVERVIEW
Theoretical explanations of the brain and thinking
processes were first suggested by some ancient Greek
philosophers, such as Plato (427-347 B.C.) and Aristotle (384-322 B.C.). Rather physical views of mental
processes were held by Descartes (1596-1650) and the
18th century empiricist philosophers.
The class of so-called cybernetic machines to which
the "neural computers" belong have a much longer history than generally believed: Heron the Alexandrian
built hydraulic automata around 100 B.C. Among the
numerous animal models, which have been built to
demonstrate need-driven behavior in variable living
conditions, one may mention the "Protozoon" of Lux
from 1920, the "dogs" of Philips from 1920 to 1930,
the "'Homeostat" of Ashby from 1948, the "Machina
Speculatrix" and "Machina Docilis" of Walter from
1950, the "ladybird" of Szeged from 1950, the "squirrel" of Squee from 195 I, the "tortoise" of Eichler from
1956, and many versions of a "mouse in the labyrinth"
(Nemes, 1969).
Abstract, conceptual information processing operations were performed by mechanical devices a couple
of centuries ago. for example, the slide rule for the
demonstration of syllogisms by Ch. Stanhope (17531816), and many mechanisms for set-theoretic and logic
operations devised in the 19th century.
Analytical neural modeling has usually been pursued
in connection with psychological theories and neurophysiological research, The first theorists to conceive
the fundamentals of neural computing were W. S.
McCulloch and W. A. Pitts (1943) from Chicago, who
loosely called thinking. All of these functions are mutually dependent in one way or another, but it may be
possible to conceptualize some of them in idealized
forms.
In the development of information technology there
now seems to exist a new phase whereby the aim is to
replicate many of these "neural" functions artificially.
Nonetheless, it is not always clear which ones of the
above aspects are meant. A central motive seems to be
to develop new types of computers. For instance, one
may aim at the implementation of artificial sensor)'
Jhnctions, in order to make the machines "see" or
"hear": this is an extension of the more traditional instrumentation techniques. Analysis of, say, satellite data
may in itself comprise a practical task which has to be
automated in one way or another. Then there exist needs
to develop "intelligent robots" whose motor control and
other actions must be integrated with the sensory systems: some kind of internal "neural computer" is
needed to coordinate these actions. It further seems
that certain expectations are held of new types of
"thinking machines," which have no sensors or actuators, but which are capable of answering complex queries and solving abstract problems. The science-fiction
robots, of course, are doing all of this. Although the
potential of the Artificial Intelligence methods by which
these problems were earlier approached has been
known for about 25 years, it is still hoped that new
avenues to Artificial Intelligence could be opened when
massive parallelism of the computing circuits, and new
technologies (eventually, optical computing) are developed whereby the computing capacity is increased by
orders of magnitude. Before that, however, it will be
necessary to find out what to compute. There is at least
one new dimension of computation visible which has
been very difficult to reach by the digital computers,
namely, to take into account all the high-order statistical
relationships in stochastic data. Apparently the biological brain always has to deal with stochastic signals,
whereas even in the best "intelligent" machines developed so far, all data occur in discrete form. In other
words, the knowledge they handle is usually of the linguistic type, and the pieces of information inputted to
the machines have always been prepared carefully by
human beings. One intriguing objective would therefore
be to leave such a preprocessing to the machines.
Since the concepts of (artificial) "neural networks"
and "neural computers" appear widely in publicity
nowadays, it may be necessary to specify their exact
meaning. Briefly, in relation to the above motives, we
may give the following definition:
"ArtO~cial neural networks" are massively parallel interconnected networks of simple (usually adaptive) elements and
their hierarchical organizations which are intended to interact
with the objects of the real world in the same way as biological
nervous systems do.
T. Kohonen
5
stract like the well-known "traveling-salesman problem" (Hopfield & Tank, 1986). In setting up such problems, the neural network is directly regarded to
constitute a static or dynamic analogy of the problem
structure; for instance, if the purpose is to minimize a
loss function where the total cost is built up of interdependent cost elements, then it may be possible to put
these elements to correspond to the connection
strengths in a network; the network structure itself then
represents the different combinatorial alternatives of
the problem "graphically." The activities in different
parts of the network shall isomorphically stand for values of corresponding variables of the problem. When
the network, by relaxation of its activity due to its feedback loops converges to certain optimal states, then this
state, by isomorphism, is also assumed to define the
solution to the original problem. The main problem
seems to be who shall program such a network.
One could also interpret the interest in "neural
computers" in a slightly different way. The new technologies, such as optical processing of intbrmation,
high-density semiconductor networks, and eventually
new materials like the "spin glasses" offer an unforeseen
capacity for computation. On the other hand, their elementary operations may not be perfect. The question
is then how it would be possible to utilize this vast,
although slightly unreliable capacity. Obviously the
brain has an answer to this, and so the new computer
technology has been motivated by the biological brain
research. Obviously, too, such networks must be highly
adaptable. This consideration then leads to the next
problem: What component and system properties of the
"E Kohonen
All physiological facts speak in favor of neurons acting as (nonlinear) analog integrators, and the efficacies
of the synapses change gradually. At least they do not
flip back-and-forth.
:Vo macllin~" instrttclions or control codes OCCltF itl tl(,ttral
compltling.
Due to stability problems discussed above, the format of such codes cannot be maintained for any significant periods of time, in particular during the growth
processes.
The brain circuits do not implement recursive computation.
and are thus not algorithmic.
8
some of the members are unknown, and the system
has to find from memory all the relations that match
with the "equations" in their specified parts, whereby
a number of solutions for the unknown variables become known.
The above discussed the so-called relational data
bases which are widely used for business data. Searching
of information from them calls for the following elementary operations: (a) parallel or otherwise very fast
matching of a number of search arguments (such as
known members in the above relations) with all the
elementary items stored in memory, and their provisional marking, and (b) analysis of the markings and
sequential readout of the results that satisfy all conditions.
In order to find a solution to a searching task which
is defined in terms of several simultaneous incomplete
queries, as the case in formal problem solving tasks
usually is, it is thus not sufficient to implement a content-addressable (or autoassociative) memory, but the
partial searching results must somehow be buffered and
studied sequentially. This, of course, is completely expedient for a digital computer, which can store the candidates as lists and study them by a program code: in
a neural network, however, tasks like holding a number
of candidates in temporary storage, and investigation
of their "markings" would be very cumbersome.
In artificial neural networks, the searching arguments are usually imposed as initial conditions to the
network, and solution for the "answers" results when
the activity state of the network relaxes to some kind
of "energetic minimum." One has to note the following
facts that are characteristic of these devices: (a) Their
network elements are analog devices, whereby representation of numerical variables, and their matching
can only be defined with relatively low accuracy. This,
however, may be sufficient for prescreening purposes
which is most time-consuming; (b) A vast number of
relations in memory which only approximately match
with the search argument can be activated. On the other
hand, since the "conflicts" then cannot be totally resolved but only minimized, the state of the neural network to which it converges in the process represents
some kind of optimal answer (usually, however, only in
the sense of Euclidean metric); and (c) The "answer,"
or the asymptotic state which represents the searching
result has no alternatives. Accordingly, it is not possible,
except in some rather weird constructs, to find the
complete set of solutions, or even a number of the best
candidates for them. It is not sure that the system will
converge to the global optimum; it is more usual that
the answer corresponds to one of the local optima
which, however, may be an acceptable solution in practice.
T. Kohonen
tasks such that some objective or cost function is minimized (or maximized). A great number of variables
usually enter the problem, and to evaluate and minimize the objective function, a combinatorial problem
has to be solved. In large-scale problems such as optimization of economic or business operations, the systems of equations are usually static, although nonlinear,
and if conventional computers are used, the solutions
must be found in a great many iterative steps.
Another category of complex optimization tasks is
met in systems and control problems which deal with
physical variables and space- and time-continuous processes. Their interrelations (the restricting conditions)
are usually expressed as systems of partial differential
equations, whereas the objective function is usually an
integral-type lhnctional. Mathematically these problems often call for methods of variational calculus, Although the number of variables then may be orders of
magnitude smaller than in the first category of problems, exact mathematical treatment of the functionals
again creates the need of rather large computing power.
It may come as a surprise that "massively parallel"
computers for both of the above categories of problems
existed in the 1950s. The d~ffbrential anal.vzers, based
on either analog or digital computing principles and
components, were set up as direct analogies for the systems to be studied whereby plenty of interconnections
(feedbacks) were involved. For details of these systems
and the many problems already solved by them, see
Korn and Korn (1964), Aoki (1967), and Tsypkin
(1968).
It may then also be obvious that if the "massively
parallel computers" such as the "neural networks" are
intended to solve optimization problems, they must, in
principle at least, operate as analog devices; the dynamics of their processing elements must be definable
with sufficient accuracy and individually for each element, and the interconnectivities must be specifically
configurable.
9
Input
~2
in
Pnl
Output
(1)
J--I
5. T H E F O R M A L N E U R O N
dn/dt = I - "r(~).
The logic circuits are building blocks of digital computers. For historical reasons, because the first computers were conceptualized simultaneously with the first
steps made in neural computing, it is still often believed
that the operation of neural networks could also be
described in terms of threshold-logic units. Assume a
processing element depicted in Figure 1 which is then
imagined to describe a biological neuron.
Each of the continuous-valued input signals ~j, j = 1,
2 . . . . . n shall represent the electrical activity on the
corresponding input line, or alternatively, the momentary frequency of neural impulses delivered by another
neuron to this input. In the simplest formal model, the
output frequency n is often approximated by a function
(2)
I = X gj~j.
J-]
(3)
lO
T Kohonen
If now the input signals are held steady or are changing slowly, rt approaches the asymptotic equilibrium
whereby d~/dt = O, and then rt can be solved from
equations (2) and (3) yielding
~/= 3' '(1),
(a)
Input
(4)
where 3,-j is the inverse function of % In reality, saturation effects are always encountered at high activity
which means that the loss term 3,(r/) must be a progressively increasing function of activity n. On the other
hand, remembering that rt cannot become negative, it
is possible to deduce that rt = rt(1) then indeed coarsely
resembles the Heaviside function.
Output
(b)
Input
, !
#j change in proportion to the product o/ inpu! and output signals," this is the famous hypothesis of Hebb (1949)
dressed in analytical form. However, since all activities
must be nonnegative, the changes would always be unidirectional, which is not possible in the long run.
Therefore, some kind offorgetting or other similar effect
must be introduced which makes the changes reversible.
This author is of the opinion that the "forgetting" effect
must be proportional to activity rt, that is, that the forgetting is "active," which is a biologically motivated
assumption. Then the law fadaptation would read
(5)
Output
FIGURE 2. Basic network structures.
proof, which for several case examples has been presented in Kohonen (1988).
7. N E T W O R K S T R U C T U R E S
The basic structure in most neural networks formed
of elementary processing elements is the crossbar switch
(Figure 2a), The next step towards more complex functions results from inclusion of lateral feedback (Figure
2b). This kind of connectivity has tbr a long time been
known in neuroanatomy; we applied it to modeling in
the early 1970s (cf. a rather late publication on neural
associative memory Kohonen, Lehti6, & Rovamo,
1974, and another series of publications on the "'novelty
filter" based on it, e.g., Kohonen, 1976).
In an attempt to devise complete operational rnodides for neural systems, it is useful to define the basic
network structure containing adaptive crossbars, for
input as well as for feedback (Figure 3).
The most natural topology of the network is twodimensional (Figures 2 and 3 only show a one-dimensional section of it for simplicity), and the distribution
of lateral feedbacks within this subsystem, in the first
approximation, could be the same around every neuron: however, the distribution of the interconnections
could strongly depend on the distance between two
points in the network. Within an artificial module, all
units ("neurons") could receive the same set of input
11
Input
o
c
--y-
o m
Z
Output
signals in parallel. None of these assumptions is unnatural, even when relating to biological structures.
If more complex systems have to be defined, an arbitrary new organization can then be implemented by
interconnecting such operational modules, each one
perhaps with somewhat different internal structure and/
or parameters. This would then correspond to the
"compartmental" approximation frequently made in
the description of other physiological functions.
We may now write the system equations for one
module in the following general way:
dy/dt : f(x, y, M, N),
(6a)
d M / d l = g(x, y, M),
(6b)
(6c)
BY L E A R N E D
ni = tj(x, m)
(7)
where m is the vector of inlernal p a r a m e t e r s of the network. If the system is stimulated by input x, it is then
thought that x belongs to class i if for all j4. i, n, > rh.
The theoretical problem is to devise an adaptive process
such that given a set of pairs of input and output (X~k(
~i , one finally hopes to have
V(i,k), ~,~k~ _ l ,,(x~k~, m).
(8)
-./j(x,
Ira)) 2',
(9)
(10)
12
T. Kohonen
terms in the sum over i can be minimized independently, can be dressed into the form of the familiar
Widrow-Hoffequation (Widrow & Hoff, 1960). If, for
analytical simplicity, 0j is regarded as the (n + 1)th
component of the m~ vector, whereby the (n + 1)th
component o f x is - 1, respectively, we can write (using
these so-called augmented vectors) ~ = m~x, whereby
m (tk+ 1)
= mi(k) + oL(k)[~(ik)
m~,k)rx(k)]x(k),
(11)
with {e~<k)t another decreasing sequence of "gain parameters"; a possible choice is ctk) = const./k. After a
great many steps, the mlk) then converge to the optimal
values. This process is identical with the mathematical
method called stochastic approximation, which was
earlier developed by Robbins and Monro (1951), as
well as Kiefer and Wolfowitz (1952) for regression
problems.
For a random network configuration, it is difficult
to find a similar theoretically justified optimal recursive
expression: however, it is always possible to optimize
the internal parameters directly from equation (10).
For instance, randomly selected internal parameters
(components of the m vector) can be adjusted iteratively
in such a direction that J always decreases. A similar
idea is applied in the so-called "Boltzmann machines"
(Hinton & Sejnowski, 1984). This process, however, is
computationally formidably slow, as can be imagined.
Therefore there exists considerable interest in methods
by which faster training of, say, multilevel networks
becomes possible. The solution, however, usually remains suboptimal. In the principle ofbackpropagation
~?terrors, originally invented by Werbos (1975, 1982),
(k)
the differences rh - f ( x tk), m) are converted into hypothetical signals which are then propagated through
the network in the opposite direction, thereby adjusting
the internal parameters in proportion to them, like in
the Widrow-Hoff scheme. The correction signals thus
pass several layers, not only the last one like in the
classical constructs, and all the corrections can be made
in parallel by special hardware.
While it is possible to demonstrate many interesting
learned stimulus-response relationships in networks of
the above kinds, like conversion of pieces of text into
phonetic transcriptions (which can then be synthesized
into speech by standard methods) (Sejnowski & Rosenberg, 1986), one has to emphasize that in difficult
pattern recognition tasks such as recognition of natural
speech, none of these principles works particularly well;
for instance, the recognition of phonemes from continuous speech by the optimal "learned response" method
defined above has only succeeded with an accuracy of
65%, whereas the theoretically optimal Bayes classifier
has an accuracy exceeding 80%. (Using auxiliary syntactic methods which take into account the coarticulation effects (Kohonen, 1986a), we have in fact transcribed continuous speech into text with an accuracy
of 90 to 95%.) As stated earlier, and explained in the
13
(b)
m(,~) for i # c.
(12)
10. L E A R N E D
/Ui2
o
o
/
/~il
FIGURE 4. (a) The probability density function of x [~1, ~2]T
is represented here by its samples, the small dots. The superposition of two symmetric Gaussian density functions corresponding to two different classes C1 and C2, with their centroids shown by the white and the dark cross, respectively, is
shown. Solid curve: the theoretically optimal Bayes decision
surface. (b) Large black dots: reference vectors of class C~.
Open circles: reference vectors of class C2. Solid curve: decision surface in the learning vector quantization. Broken curve:
Bayes decision surface.
=
process which has to be optimized according to misclassifications. A problem of completely different nature
is to define proper output actions in response to a particular input, such as the motor reactions usually are;
then each response is equally important, because no
classes for the inputs are defined at all, and the optimization of the internal parameters by any of the
methods discussed in the previous section is completely
expedient; the J functional represents the only applicable criterion.
One aspect, as pointed out by Pellionisz (1986), is
that for the input-output relation in systems which
14
T Kohonen
(13)
where
(14)
15
If the law by which 2~M is formed has certain properties (like in holography, or if the adaptive changes
depend on signals as discussed in Section 6), and the
transfer function./has a simple analytical form, then
~x can be solved (accurately or approximately, i.e., in
the sense of least squares). Computer simulations of
this principle in neural networks with partly random
structure have been reported in Kohonen (1971) and
Kohonen (1973).
Explanations for many intriguing phenomena discussed in experimental psychology become now readily
available: for example, hallucinations, so-called phantom effects in amputated limbs, and geometric illusions.
Without the principle of virtual images, on the other
hand, the neural networks or computers are only "'behavioristic" machines, and at least it can be said that
they then do not reflect the most important phenomenological features of thinking.
13. VISTAS TO N E U R O C O M P U T E R
SYSTEMS
More complicated architectures for neurocomputers
can be implemented by interconnecting several modules of the above types. For instance, Grossberg (1982)
and Carpenter and Grossberg (in press) have suggested
circuits for the interaction of different subsystems in
the brain.
In engineering systems, the neurocomputers ought
to be understood as special-purpose models or coprocessors which are operating under the control of a more
conventional host computer. The latter defines a program code on the application level and schedules the
application operations, while the "neurocomputer"
implements various bulk tasks, such as preprocessing
natural input information like images or speech, or acts
as a "neural expert system," answering queries in a
statistically optimal fashion.
A more complete view to neurocomputer architectures can be obtained from the books edited by
McClelland and Rumelhard (1986), from the SPIE (The
Society of Photo-Optical Instrumentation Engineers)
Advanced Institutes proceedings ( 1987), as well as special issues published by Applied Optics (1986, 1987).
(15)
REFERENCES
AiB
x
AiB
y
x+Ax
(a)
(b)
16
T Kohonen
optimization: Proceedings of the International Federation/or Information Processing. (pp. 762-770). New York: Springer-Verlag.
Widrow, B., & Hoff, M. E. (1960). Adaptive Switching Circuits. 1960
WESCON Convention, Record Part IV, pp. 96-104.
Young, T. Y., & Calvert, T. W. (1974). ClassO~cation. estimation, and
pattern recognition. New York: Elsevier.