Steps Toward Artificial Intelligence
Steps Toward Artificial Intelligence
Received by the IRE, October 24, 1960. The author's work summarized
here—which was done at the MIT Lincoln Laboratory, a center for
research operated by MIT at Lexington, Mass., with the joint Support of
the U. S. Army, Navy, and Air Force under Air Force Contract AF
19(604)-5200; and at the Res. Lab. of Electronics, MIT, Cambridge, Mass.,
which is supported in part by the U. S. Army Signal Corps, the Air Force
Office of Scientific Research, and the ONR—is based on earlier work done
by the author as a Junior Fellow of the Society of Fellows, Harvard
University.
The adjective "heuristic," as used here and widely in the literature, means
related to improving problem-solving performance; as a noun it is also
used in regard to any method or trick used to improve the efficiency of a
problem-solving system. A "heuristic program," to be considered
successful, must work well on a variety of problems, and may often be
excused if it fails on some. We often find it worthwhile to introduce a
heuristic method, which happens to cause occasional failures, if there is an
over-all improvement in performance. But imperfect methods are not
necessarily heuristic, nor vice versa. Hence "heuristic" should not be
regarded as opposite to "foolproof"; this has caused some confusion in the
literature.
INTRODUCTION
Our visitor might remain puzzled if he set out to find, and judge for
himself, these monsters. For he would find only a few machines mostly
general-purpose computers), programmed for the moment to behave
according to some specification) doing things that might claim any real
intellectual status. Some would be proving mathematical theorems of
rather undistinguished character. A few machines might be playing certain
https://web.media.mit.edu/~minsky/papers/steps.html 2/67
10/28/2019 Steps Toward Artificial Intelligence - - -Marvin Minsky
In this article, an attempt will be made to separate out, analyze, and find the
relations between some of these problems. Analysis will be supported with
enough examples from the literature to serve the introductory function of a
review article, but there remains much relevant work not described here.
This paper is highly compressed, and therefore, cannot begin to discuss all
these matters in the available space.
https://web.media.mit.edu/~minsky/papers/steps.html 3/67
10/28/2019 Steps Toward Artificial Intelligence - - -Marvin Minsky
A computer can do, in a sense, only what it is told to do. But even when we
do not know exactly how to solve a certain problem, we may program a
machine to Search through some large space of solution attempts.
Unfortunately, when we write a straightforward program for such a search,
we usually find the resulting process to be enormously inefficient. With
Pattern- Recognition techniques, efficiency can be greatly improved by
restricting the machine to use its methods only on the kind of attempts for
which they are appropriate. And with Learning, efficiency is further
improved by directing Search in accord with earlier experiences. By
actually analyzing the situation, using what we call Planning methods, the
machine may obtain a fundamental improvement by replacing the
originally given Search by a much smaller, more appropriate exploration.
Finally, in the section on Induction, we consider some rather more global
concepts of how one might obtain intelligent machine behavior.
I. THE PROBLEM OF SEARCH
Summary—If, for a given problem, we have a means for checking a
proposed solution, then we can solve the problem by testing all possible
answers. But this always takes much too long to be of practical interest.
Any device that can reduce this search may be of value. If we can detect
relative improvement, then “hill-climbing” (Section l-B) may be feasible,
but its use requires some structural knowledge of the search space. And
unless this structure meets certain conditions, hill-climbing may do more
harm than good.
Note 1: The adjective "heuristic," as used here and widely in the literature,
means related to improving problem-solving performance; as a noun it is
also used in regard to any method or trick used to improve the efficiency of
a problem-solving system. A "heuristic program," to be considered
successful, must work well on a variety of problems, and may often be
excused if it fails on some. We often find it worthwhile to introduce a
heuristic method, which happens to cause occasional failures, if there is an
over-all improvement in performance. But imperfect methods are not
necessarily heuristic, nor vice versa. Hence "heuristic" should not be
regarded as opposite to "foolproof"; this has caused some confusion in the
literature.
https://web.media.mit.edu/~minsky/papers/steps.html 4/67
10/28/2019 Steps Toward Artificial Intelligence - - -Marvin Minsky
In one sense, all such problems are trivial. For if there exists a solution to
such a problem, that solution can be found eventually by any blind
exhaustive process which searches through all possibilities. And it is
usually not difficult to mechanize or program such a search.
But for any problem worthy of the name, the search through all
possibilities will be too inefficient for practical use. And on the other hand,
systems like chess, or nontrivial parts of mathematics, are too complicated
for complete analysis. Without complete analysis, there must always
remain some core of search, or “trial and error.” So we need to find
techniques through which the results of incomplete analysis can be used to
make the search more efficient. The necessity for this is simply
overwhelming. A search of all the paths through the game of checkers
involves some 10**40 move choices [2]—in chess, some 10**120 [3]. If
we organized all the particles in our galaxy into some kind of parallel
computer operating at the frequency of hard cosmic rays, the latter
computation would still take impossibly long; we cannot expect
improvements in “hardware” alone to solve all our problems. Certainly, we
must use whatever we know in advance to guide the trial generator. And
we must also be able to make use of results obtained along the way.
But we need it. Many publications have been marred by the misuse, for this
purpose, of precise mathematical terms, e.g., metric and topological. The
term “connection,” with its variety of dictionary meanings, seems just the
word to designate a relation without commitment as to the exact nature of
the relation. An important and simple kind of heuristic connection is that
defined when a space has coordinates (or parameters) and there is also
defined a numerical “success function” E which is a reasonably smooth
function of the coordinates. Here we can use local optimization or hill-
climbing methods.
B. Hill-Climbing
Suppose that we are given a black-box machine with inputs x1, . . . xn and
an output E(x1, … xn). We wish to maximize E by adjusting the input
values. But we are not given any mathematical description of the function
E; hence, we cannot use differentiation or related methods. The obvious
approach is to explore locally about a point, finding the direction of
steepest ascent. One moves a certain distance in that direction and repeats
the process until improvement ceases. If the hill is smooth, this may be
done, approximately, by estimating the gradient component dE/dxi
separately for each coordinate. There are more sophisticated approaches—
one may use noise added to each variable, and correlate the output with
each input (see below)—but this is the general idea. It is a fundamental
technique, and we see it always in the background of far more complex
systems. Heuristically, its great virtue is this: the sampling effort (for
determining the direction of the gradient) grows, in a sense, only linearly
with the number of parameters. So if we can solve, by such a method, a
certain kind of problem involving many parameters, then the addition of
more parameters of the same kind ought not to cause an inordinate increase
in difficulty. We are particularly interested in problem-solving methods that
can be so extended to more problems that are difficult. Alas, most
interesting systems, which involve combinational operations usually, grow
exponentially more difficult as we add variables.
https://web.media.mit.edu/~minsky/papers/steps.html 7/67
10/28/2019 Steps Toward Artificial Intelligence - - -Marvin Minsky
out. The problem-solver must find other methods; hill-climbing might still
be feasible with a different heuristic connection.
The useful classifications are those which match the goals and methods of
the machine. The objects grouped together in the classifications should
have something of heuristic value in common; they should be “similar” in
a useful sense; they should depend on relevant or essential features. We
should not be surprised, then, to find ourselves using inverse or teleological
expressions to define the classes. We really do want to have a grip on “the
class of objects which can be transformed into a result of form Y,” that is,
the class of objects which will satisfy some goal. One should be wary of
the familiar injunction against using teleological language in science.
While it is true `that talking of goals in some contexts may dispose us
towards certain kinds of animistic explanations, this need not be a bad
thing in the field of problem-solving; it is hard to see how one can solve
problems without thoughts of purposes. The real difficulty with
teleological definitions is technical, not philosophical, and arises when they
have to be used and not just mentioned. One obviously cannot afford to use
for classification a method that actually requires waiting for some remote
outcome, if one needs the classification precisely for deciding whether to
try out that method. So, in practice, the ideal teleological definitions often
have to be replaced by practical approximations, usually with some risk of
error; that is, the definitions have to be made heuristically effective, or
economically usable. This is of great importance. (We can think of
https://web.media.mit.edu/~minsky/papers/steps.html 10/67
10/28/2019 Steps Toward Artificial Intelligence - - -Marvin Minsky
https://web.media.mit.edu/~minsky/papers/steps.html 11/67
10/28/2019 Steps Toward Artificial Intelligence - - -Marvin Minsky
What is a “pattern”? We often use this term to mean a set of objects which
can in some (useful) way be treated alike. For each problem area we must
ask, “What patterns would be useful for a machine working on such
problems?”
C. Prototype-Derived Patterns
If the noise is not too severe, we may be able to manage the identification
by what we call a normalization and template-matching process. We first
remove the differences related to size and position—that is, we normalize
the input figure. One may do this, for example, by constructing a similar
figure inscribed in a certain fixed triangle (see below) or one may
transform the figure to obtain a certain fixed center of gravity and a unit
second central moment.
https://web.media.mit.edu/~minsky/papers/steps.html 12/67
10/28/2019 Steps Toward Artificial Intelligence - - -Marvin Minsky
The figures A, A' and B, B' are topologically equivalent pairs. Lengths
have been distorted in an arbitrary manner, but the connectivity relations
between corresponding points have been preserved. In Sherman (8] and
https://web.media.mit.edu/~minsky/papers/steps.html 13/67
10/28/2019 Steps Toward Artificial Intelligence - - -Marvin Minsky
Haller [391 we find computer programs which can deal with such
equivalences.
https://web.media.mit.edu/~minsky/papers/steps.html 14/67
10/28/2019 Steps Toward Artificial Intelligence - - -Marvin Minsky
If the given properties are placed in a fixed order then we can represent any
of these elementary regions by a vector, or string of digits. The vector so
assigned to each figure will be called the Character of that figure (with
respect to the sequence of properties in question). (In [9] we use the term
characteristic for a property without restriction to 2 values.) Thus a square
has the Character (1, 1, 1) and a circle the Character (0, 1, 1) for the given
sequence of properties.
For many problems, one can use such Characters as names for categories
and as primitive elements with which to define an adequate set of patterns.
Characters are more than conventional names. They are instead very
rudimentary forms of description (having the form of the simplest
symbolic expression—the list) whose structure provides some information
about the designated classes. This is a step, albeit a small one, beyond the
template method; the Characters are not simple instances of the patterns,
and the properties may themselves be very abstract. Finding a good set of
properties is the major concern of many heuristic programs.
E. Invariant Properties
The idea behind their mathematical argument is this: suppose that we have
a function P of figures, and suppose that for a given figure F we define [F]
= {F1, F2 . . .} to be the set of all figures equivalent to F under the given
set of transformations; further, define P [F] to be the set {P (F1), P (F2), . .
https://web.media.mit.edu/~minsky/papers/steps.html 15/67
10/28/2019 Steps Toward Artificial Intelligence - - -Marvin Minsky
F. Generating Properties
https://web.media.mit.edu/~minsky/papers/steps.html 16/67
10/28/2019 Steps Toward Artificial Intelligence - - -Marvin Minsky
"We now feed the machine A's and 0's telling the machine each time
which letter it is. Beside each sequence under the two letters, the
machine builds up distribution functions from the results of applying the
sequences to the image. Now, since the sequences were chosen
completely randomly, it may well be that most of the sequences have very
flat distribution functions; that is, they [provide] no information, and the
sequences are therefore [by definition] not significant. Let it discard these
and pick some others. Sooner or later, however, some sequences will
prove significant; that is, their distribution functions will peak up
somewhere. What the machine does now is to build up new sequences
like the significant ones. This is the important point. If it merely chose
https://web.media.mit.edu/~minsky/papers/steps.html 17/67
10/28/2019 Steps Toward Artificial Intelligence - - -Marvin Minsky
sequences at random, it might take a very long while indeed to find the
best sequences. But with some successful sequences, or partly successful
ones, to guide it, we hope that the process will be much quicker. The
crucial question remains: How do we build up sequences “like” other
sequences, but not identical? As of now we think we shall merely build
sequences from the transition frequencies of the significant sequences.
We shall build up a matrix of transition frequencies from the significant
ones, and use them as transition probabilities with which to choose new
sequences.
"We do not claim that this method is necessarily a very good way of
choosing sequences—only that it should do better than not using at all
the knowledge of what kinds of sequences have worked. It has seemed to
us that this is the crucial point of learning." See p. 93 of [12].
G. Combining Properties
One cannot expect easily to find a small set of properties that will be just
right for a problem area. It is usually much easier to find a large set of
properties each of which provides a little useful information. Then one is
faced with the problem of finding a way to combine them to make the
desired distinctions. The simplest method is to define, for each class, a
prototypical "characteristic vector" (a particular sequence of property
https://web.media.mit.edu/~minsky/papers/steps.html 18/67
10/28/2019 Steps Toward Artificial Intelligence - - -Marvin Minsky
values) and then to use some matching procedure, e.g., counting the
numbers of agreements and disagreements, to compare an unknown with
these chosen prototypes.
Assume that the situation is probabilistic, and that we know the probability
pij that, if the object is in class Fj then the i-th property Ei will have value
1. Assume further that these properties are independent; that is, even given
Fj, knowledge of the value of Ei tells us nothing more about the value of a
different Ek in the same experiment. (This is a strong condition—see
below.) Let fj be the absolute probability that an object is in class Fi.
Finally, for this experiment define V to be the particular set of is for which
the Ei's are 1. Then this V represents the Character of the object! From the
definition of conditional probability, we have
Given the Character V, we want to guess which Fj has occurred (with the
least chance of being wrong—the so-called maximum likelihood estimate);
https://web.media.mit.edu/~minsky/papers/steps.html 19/67
10/28/2019 Steps Toward Artificial Intelligence - - -Marvin Minsky
that is, for which j is Pr(Fj) the largest. Since in the above Pr(V) does not
depend on j, we have only to calculate for which j is Pr(V)Pr(Fj|V) =
Pr(Fj)Pr(V|Fj) the largest. Hence, by our independence hypothesis, we
have to maximize
fjPpijPqij = fjPpij/qijPqij,
.
where the first product is over V and the second, over its complement.
These “maximum likelihood” decisions can be made (Fig. 6) by a simple
network device. [7]
https://web.media.mit.edu/~minsky/papers/steps.html 20/67
10/28/2019 Steps Toward Artificial Intelligence - - -Marvin Minsky
Note: At the cost of an additional network layer, we may also account for
the possible cost gjk that would be incurred if we were to assign to Fk a
figure really in class Fj. In this case, the minimum cost decision is given by
the k for which SigjkfjPpijPqij.
The nets of Fig. 6 are very orderly in structure. Is all this structure
necessary? Certainly if there were a great many properties, each of which
provided very little marginal information, some of them would not be
missed. Then one might expect good results with a mere sampling of all
the possible connection paths w~~. And one might thus, in this special
situation, use a random connection net. The two-layer nets here resemble
those of the “perceptron” proposal of Rosenblatt [22]. I n the latter, there is
an additional level of connections coming directly from randomly selected
points of a “retina.” Here the properties, the devices which abstract the
visual input data, are simple functions which add some inputs, subtract
others, and detect whether the result exceeds a threshold. Equation (1), we
think, illustrates what is of value in this scheme. It does seem clear that
such nets can handle a maximum-likelihood type of analysis of the output
of the property functions. But these nets, with their simple, randomly
generated, connections can probably never achieve recognition of such
patterns as “the class of figures having two separated parts,” and they
cannot even achieve the effect of template recognition without size and
position normalization (unless sample figures have been presented
previously in essentially all sizes and positions). For the chances are
extremely small of finding, by random methods, enough properties usefully
correlated with patterns appreciably more abstract than are those of the
prototype-derived kind. And these networks can really only separate out
(by weighting) information in the individual input properties; they cannot
extract further information present in nonadditive form. The “perceptron”
class of machines has facilities neither for obtaining better-than-chance
https://web.media.mit.edu/~minsky/papers/steps.html 22/67
10/28/2019 Steps Toward Artificial Intelligence - - -Marvin Minsky
We should not leave the discussion of decision net models without noting
their important limitations. The hypothesis that the pis represent
independent events is a very strong condition indeed. Without this
hypothesis we could still construct maximum- likelihood nets, but we
would need an additional layer of cells to represent all of the joint events
V; that is, we would need to know all the Pr (Fj|V). This gives a general
(but trivial) solution, but requires 2**n cells for n properties, which is
completely impractical for large systems. What is required is a system
which computes some sampling of all the joint conditional probabilities,
and uses these to estimate others when needed. The work of Uttley [28],
[29], bears on this problem, but his proposed and experimental devices do
not yet clearly show how to avoid exponential growth. See also Roberts
[25], Papert [21], and Hawkins [22]. We can find nothing resembling this
type of analysis in Rosenblatt [22].
"A rectangle (1) contains two subfigures disposed horizontally. The part
on the left is a rectangle (2) that contains two subfigures disposed
vertically, the upper part of which is a circle (3) and the lower a triangle
(4). The part on the right . . . etc."
https://web.media.mit.edu/~minsky/papers/steps.html 24/67
10/28/2019 Steps Toward Artificial Intelligence - - -Marvin Minsky
where "IN (x, y)" means 'y is inside x,'-->(x y)" means 'X is to the left of
Y,' and "ABOVE (x, y)" means 'x is above y.' This description may be
regarded as an expression in a simple "list-structure" language. Newell,
Shaw and Simon have developed powerful computer techniques for
manipulating symbolic expressions in such languages for purposes of
heuristic programming. (See the remarks at the end of Section IV. If some
of the members of a list are lists, they must be surrounded by exterior
parentheses, and this accounts for the accumulation of parentheses.
This description language may be regarded as a simple kind of "list-
structure" language. Newell, Shaw and Simon have developed powerful
computer techniques for manipulating symbolic expressions in such
languages for purposes of heuristic programming. See the remarks at the
end of Section IV. By introducing notation for the relations 'inside of', 'to
the left of', and 'above', we construct a symbolic description. Such
descriptions can be formed and manipulated by machines.
By abstracting out the complex relation between the parts of the figure, we
can re-use the same formula to describe all three of the figures above, by
using the same "more abstract" expression for all of them:
F(A, B, C, D, E, F, G, H) = IN(A, (-->(IN(B, (ABOVE (C, D))), IN(E,
(ABOVE (-->(F, G, H))))))),
and the other two scenes can be represented by the same F with different
substitutions for its variables. It is up to the programmer to decide at just
what level of complexity a part of a picture should be considered
"primitive". This will depend on what the description is to be used for. We
could further divide the drawings into vertices, lines, and arcs. Obviously,
for some applications the relations would need more metrical information,
e.g., specification of lengths or angles.
https://web.media.mit.edu/~minsky/papers/steps.html 25/67
10/28/2019 Steps Toward Artificial Intelligence - - -Marvin Minsky
The important thing about such "articular" descriptions is that they can be
obtained by repeated application of a fixed set of pattern-recognition
techniques. Thus we can obtain arbitrarily complex descriptions from a
fixed complexity classification-mechanism. The new element required in
the mechanism (beside the capacity to manipulate the list-structures) is the
ability to articulate—to "attend fully" to a selected part of the picture and
bring all one's resources to bear on that part. In efficient problem-solving
programs, w e will not usually complete such a description in a single
operation. Instead, the depth or detail of description will be under the
control of other processes. These will reach deeper, or look more carefully,
only when they have to, e.g., when the presently available description is
inadequate for a current goal. The author, together with L. Hodes, is
working on pattern-recognition schemes using articular descriptions. By
manipulating the formal descriptions, we can deal with overlapping and
incomplete figures, and several other problems of the “Gestalt” type.
It seems likely that as machines are turned toward more difficult problem
areas, passive classification systems will become less adequate, and we
may have to turn toward schemes which are based more on internally-
generated hypotheses, perhaps “error-controlled” along the lines proposed
by MacKay [89].
https://web.media.mit.edu/~minsky/papers/steps.html 26/67
10/28/2019 Steps Toward Artificial Intelligence - - -Marvin Minsky
Summary—In order to solve a new problem, one should first try using
methods similar to those that have worked on similar problems. To
implement this “basic learning heuristic” one must generalize on past
experience, and one way to do this is to use success-reinforced decision
models. These learning systems are shown to be averaging devices. Using
devices that also learn which events are associated with reinforcement, i.e.,
reward, we can build more autonomous “secondary reinforcement”
systems. In applying such methods to complex problems, one encounters a
serious difficulty—in distributing credit for success of a complex strategy
among the many decisions that were involved. This problem can be
managed by arranging for local reinforcement of partial goals within a
hierarchy, and by grading the training sequence of problems to parallel a
process of maturation of the machine's resources.
In order to solve a new problem one uses what might be called the basic
learning heuristic first try using methods similar to those which have
worked, in the past, on similar problems. We want our machines, too, to
benefit from their past experience. Since we cannot expect new situations
to be precisely the same as old ones, any useful learning will have to
involve generalization techniques. There are too many notions associated
with `learning” to justify defining the term precisely. But we may be sure
that any useful learning system will have to -use records of the past as
evidence for more general propositions; it must thus entail some
commitment or other about “inductive inference.” (See Section V-B.)
https://web.media.mit.edu/~minsky/papers/steps.html 27/67
10/28/2019 Steps Toward Artificial Intelligence - - -Marvin Minsky
A. Reinforcement
https://web.media.mit.edu/~minsky/papers/steps.html 28/67
10/28/2019 Steps Toward Artificial Intelligence - - -Marvin Minsky
which moves p a fraction (1-q) of the way towards unity. (Properly, the
reinforcement functions should depend both on the p's and on the previous
reaction. Reward should decrease p if our animal has just turned to the left.
The notation in the literature is somewhat confusing in this regard.) If we
dislike what it does we apply negative reinforcement,
moving p the same fraction of the way toward 0. Some theory of such
"linear" learning operators, generalized to several stimuli and responses,
will be found in Bush and Mosteller [45]. We can show that the learning
result is an average weighted by an exponentially-decaying time factor:
Let Zn be ±1 according to whether the n-th event is rewarded or
extinguished and replace pn by cn-2pn-1 so that -1<cn<1, as for a
correlation coefficient. Then (with c0 = 0) we obtain by induction
https://web.media.mit.edu/~minsky/papers/steps.html 29/67
10/28/2019 Steps Toward Artificial Intelligence - - -Marvin Minsky
and since
(1)
If the term Zi is regarded as a product of (i) how the creature responded
and (ii) which kind of reinforcement was given, then cn is a kind of
correlation function (with the decay weighting) of the joint behavior of
these quantities. The ordinary, uniformly-weighted average has the same
general form but with time-dependent q:
(2)
Incidentally, in spite of the space given here for their exposition, I am not
convinced that such “incremental” or “statistical” learning schemes should
play a central role in our models. They will certainly continue to appear as
components of our programs but, I think, mainly by default. The more
intelligent one is, the more often he should be able to learn from an
experience something rather definite; e.g., to reject or accept a hypothesis,
or to change a goal. (The obvious exception is that of a truly statistical
environment in which averaging is inescapable. But the heart of problem
solving is always, we think, the combinatorial part that gives rise to
searches, and we should usually be able to regard the complexities caused
by “noise” as mere annoyances, however irritating they may be.) In this
connection, we can refer to the discussion of memory in Miller, Galanter
and Pribram [46]. This seems to be the first major work in Psychology to
https://web.media.mit.edu/~minsky/papers/steps.html 31/67
10/28/2019 Steps Toward Artificial Intelligence - - -Marvin Minsky
show the influence of work in the artificial intelligence area, and its
programme is generally quite sophisticated.
https://web.media.mit.edu/~minsky/papers/steps.html 32/67
10/28/2019 Steps Toward Artificial Intelligence - - -Marvin Minsky
The new unit U is a device that learns which external stimuli are strongly
correlated with the various reinforcement signals, and responds to such
stimuli by reproducing the corresponding reinforcement signals. (The
device U is not itself a reinforcement learning device; it is more like a
“Pavlovian” conditioning device, treating the Z signals as “unconditioned”
stimuli and the S signals as moves and replies. We might also limit the
number of conditioned stimuli.) The heuristic idea is that any signal from
the environment that in the past has been well-correlated with (say)
positive reinforcement is likely to be an indication that something good has
just happened. If the training on early problems was such that this is
realistic, then the system eventually should be able to detach itself from the
Trainer, and become autonomous. If we further permit “chaining” of the
“secondary reinforcers,” e.g., by admitting the connection shown as a
dotted line, the scheme becomes quite powerful, in principle. There are
obvious pitfalls in admitting such a degree of autonomy; the values of the
system may drift to a non-adaptive condition.
https://web.media.mit.edu/~minsky/papers/steps.html 34/67
10/28/2019 Steps Toward Artificial Intelligence - - -Marvin Minsky
Perhaps the simplest scheme is to use a weighted sum of some selected set
of “property” functions of the positions mobility, advancement, center
control, and the like. This is done in Samuel's program, and in most of its
predecessors. Associated with this is a multiple-simultaneous-optimizer
method for discovering a good coefficient assignment (using the
correlation technique noted in Section III-A). But the source of
https://web.media.mit.edu/~minsky/papers/steps.html 35/67
10/28/2019 Steps Toward Artificial Intelligence - - -Marvin Minsky
reinforcement signals in [2] is novel. One cannot afford to play out one or
more entire games for each single learning step. Samuel measures instead
for each move the difference between what the evaluation function yields
directly of a position and what it predicts on the basis of an extensive
continuation exploration, i.e., backing-up. The sign of this error, "Delta,""
is used for reinforcement; thus the system may learn something at each
move.
will appear in later trials. (We lack space for details of how this is done.)
Thus the program tries to find "good" instructions, more or less
independently, for each location in program memory. The machine did
learn to solve some extremely simple problems. But it took of the order of
1000 times longer than pure chance would expect. In part I of [54], this
failure is discussed and attributed in part to what we called (Section I-C)
the "Mesa phenomenon." In changing just one instruction at a time, the
machine had not taken large enough steps in its search through program
space.
And even there, if we accept the general view of Darlington [56] who
emphasizes the heuristic aspects of genetic systems, we must have
developed early in, e.g., the phenomena of meiosis and crossing-over, quite
highly specialized mechanisms providing for the segregation of groupings
related to solutions of subproblems. Recently, much effort has been
devoted to the construction of training sequences about programming
"teaching machines." Naturally, the psychological literature abounds with
theories of how complex behavior is built up from simpler. In our own
area, perhaps the work of Solomonoff [55], while overly cryptic, shows the
most thorough consideration of this dependency, on training sequences.
https://web.media.mit.edu/~minsky/papers/steps.html 39/67
10/28/2019 Steps Toward Artificial Intelligence - - -Marvin Minsky
It is not surprising that the testing grounds for early work on mechanical
problem solving have usually been areas of mathematics, or games, in
which the rules are learned with absolute clarity. The "Logic 'Theory,"
machine of [57] and [58], called "LT", was a first attempt to prove
theorems in logic, by frankly heuristic methods. Although the program was
not by human standards a brilliant success (and did not surpass its
designers), it stands as a landmark both in heuristic programming and in
the development of modern automatic programming.
Among the heuristics that were studied were 1) a similarity test to reduce
the work in step 4 (which includes another search through the theorem list),
2) a simplicity test to select apparently easier problems from the problem
list, and 3) a strong nonprovability test to remove from the problem list
expressions which are probably false and hence not provable. In a series of
experiments "learning" was used to find which earlier theorems had been
most useful and should be given priority in step 3. We cannot review the
effects of these changes in detail. Of interest was the balance between the
extra cost for administration of certain heuristics and the resultant search
reductions; this balance was quite delicate in some cases when computer
memory became saturated. The system seemed to be quite sensitive to the
training sequence--the order in which problems were given. And some
heuristics that gave no significant overall improvement did nevertheless
affect the class of solvable problems. Curiously enough, the general
efficiency of LT was not greatly improved by any or all of these devices.
But all this practical experience is reflected in the design of the much more
sophisticated "GPS" system described briefly in Section IV-D, 2).
Hao Wang [59] has criticized the LT project on the grounds that there exist,
as he and others have shown, mechanized proof methods which, for the
particular run of problems considered, use far less machine effort than does
LT and which have the advantage that they will ultimately find a proof for
any provable proposition. (LT does not have this exhaustive "decision
procedure" character and can fail ever to find proofs for some theorems.)
The authors of [58], perhaps unaware of the existence of even moderately
efficient exhaustive methods, supported their arguments by comparison
with a particularly inefficient exhaustive procedure. Nevertheless, I feel
that some of Wang's criticisms are misdirected. He does not seem to
recognize that the authors of LT are not so much interested in proving these
theorems as they are in the general problem of solving difficult problems.
The combinatorial system of Russell and Whitehead (with which LT deals)
is far less simple and elegant than the system used by Wang. (Note, e.g.,
https://web.media.mit.edu/~minsky/papers/steps.html 42/67
10/28/2019 Steps Toward Artificial Intelligence - - -Marvin Minsky
the emphasis in [49] and [60]. Wang's procedure [59] too, works
backwards, and can be regarded as a generalization of the method of
"falsification" for deciding truth-functional tautology. In [93] and its
unpublished sequel, Wang introduces more powerful methods for solving
harder problems.]
All these efforts are directed toward the reduction of search effort. In that
sense, they are all heuristic programs. Since practically no one still uses
"heuristic" in a sense opposed to "algorithmic, serious workers might do
well to avoid pointless argument on this score. The real problem is to find
https://web.media.mit.edu/~minsky/papers/steps.html 43/67
10/28/2019 Steps Toward Artificial Intelligence - - -Marvin Minsky
Imagine that the problems and their relations are arranged to form some
kind of directed-graph structure [14], [57], and [62]. The main problem is
to establish a "valid" path between two initially distinguished nodes.
Generation of new problems is represented by the addition of new, not-yet-
valid paths, or by the insertion of new nodes in old paths. Associate with
each connection, quantities describing its current validity state (solved,
plausible, doubtful, etc.) and its current estimated difficulty.
tries to open the circuit between them. Each move consists of shorting (or
opening), irreversibly, one of the remaining resistors. Shannon's machine
applies a potential between the boundaries and selects that resistor which
carries the largest current. Very roughly speaking, this resistor is likely to
be most critical because changing it will have the largest effect on the
resistance of the net and, hence, in the goal direction of shorting (or
opening) the circuit. And although this argument is not perfect, nor is this a
perfect model of the real combinatorial situation, the machine does play
extremely well. For example, if the machine begins by opening resistor 1,
the opponent might counter by shorting resistor 4 (which now has the
largest current). The remaining move-pairs (if both players use that
strategy)—would be (5,8) (9,13) (6,3) (12, 10)—or (12, 2)—and the
machine wins. This strategy can make unsound moves in certain situations,
but no one seems to have been able to force this during a game. [Note:
after writing this, I did find such a strategy, that defeats large-board
versions of this machine.]
The use of such a global method for problem-selection requires that the
available "difficulty estimates" for related subproblems be arranged to
combine in roughly the manner of resistance values. Also, we could regard
this machine as using an "analog models for "planning." (See Section IV-
D.) [A variety of combinatorial methods will be matched against the
network-analogy opponent in a program being completed by R. Silver at
the MIT Lincoln Laboratory.]
The prospect of having to study at each step the whole problem structure is
discouraging, especially since the structure usually changes only slightly
after each attempt. One naturally looks for methods which merely update
or modify a small fragment of the stored record. A variety of compromises
lie between the extremes of the "first-come-first-served" problem-list
method and the full global-survey methods, techniques. Perhaps the most
https://web.media.mit.edu/~minsky/papers/steps.html 46/67
10/28/2019 Steps Toward Artificial Intelligence - - -Marvin Minsky
C. "Character-Method" Machines
Once a problem is selected, we must decide which method to try first. This
depends on our ability to classify or characterize problems. We first
compute the Character of our problem (by using some pattern recognition
https://web.media.mit.edu/~minsky/papers/steps.html 47/67
10/28/2019 Steps Toward Artificial Intelligence - - -Marvin Minsky
D. Planning
intelligent machine will fail, rather sharply, for some modest level of
problem difficulty. Only schemes which actively pursue an analysis toward
obtaining a set of sequential goals can be expected to extend smoothly into
increasingly complex problem domains.
The critical problem in using the Character-Algebra model for planning is,
of course, the prediction-reliability of the matrix entries. One cannot expect
the Character of a result to be strictly determined by the Character of the
original and the method used. And the reliability of the predictions will, in
any case, deteriorate rapidly as the matrix power is raised. But, as we have
noted, any plan at all is so much better than none that the system should do
very much better than exhaustive search, even with quite poor prediction
quality.
https://web.media.mit.edu/~minsky/papers/steps.html 51/67
10/28/2019 Steps Toward Artificial Intelligence - - -Marvin Minsky
Compare this with the "matching,' process described in [57]. The notions
of "Character," "Character-Algebra," etc., originate in [14] but seem useful
in describing parts of the "GPS', system of [57] and [151. Reference [15]
contains much additional material we cannot survey here. Essentially, GPS
is to be self-applied to the problem of discovering sets of Differences
appropriate for given problem areas. This notion of "bootstrapping"—that
is, applying a problem-solving system to the task of improving some of its
own methods—is old and famil1ar, but in [15] we find perhaps the first
specific proposal about how such an advance might be realized.
https://web.media.mit.edu/~minsky/papers/steps.html 52/67
10/28/2019 Steps Toward Artificial Intelligence - - -Marvin Minsky
A. Intelligence
In all of this discussion we have not come to grips with anything we can
isolate as "intelligence." We have discussed only heuristics, shortcuts, and
classification techniques. Is there something missing? I am confident that
sooner or later we will be able to assemble programs of great problem-
solving ability from complex combinations of heuristic devices-multiple
optimizers, pattern-recognition tricks, planning algebras, recursive
administration procedures, and the like. In no one of these will we find the
seat of intelligence. Should we ask what intelligence "really is"? My own
view is that this is more of an esthetic question, or one of sense of dignity,
than a technical matter! To me "intelligence" seems to denote little more
than the complex of performances which we happen to respect, but do not
understand. So it is, usually, with the question of "depth" in mathematics.
Once the proof of a theorem is really understood, its content seems to
become trivial. (Still, there may remain a sense of wonder about how the
proof was discovered.)
Programmers, too, know that there is never any "heart" in a program. There
are high-level routines in each program, but all they do is dictate that "if
such-and-such, then transfer to such-and-such a subroutine." And when we
look at the low-level subroutines, which "actually do the work," we find
senseless loops and sequences of trivial operations, merely carrying out the
dictates of their superiors. The intelligence in such a system seems to be as
intangible as becomes the meaning of a single common word when it is
thoughtfully pronounced over and over again.
But we should not let our inability to discern a locus of intelligence lead us
to conclude that programmed computers therefore cannot think. For it may
be so with man, as with machine, that, when we understand finally the
structure and program, the feeling of mystery (and self-approbation) will
weaken. See [14] and [9]. We find similar views concerning "creativity" in
https://web.media.mit.edu/~minsky/papers/steps.html 55/67
10/28/2019 Steps Toward Artificial Intelligence - - -Marvin Minsky
[60]. The view expressed by Rosenbloom [73] that minds (or brains) can
transcend machines is based, apparently, on an erroneous interpretation of
the meaning of the "unsolvability theorems" of Godel.
B. Inductive Inference
Let us pose now for our machines, a variety of problems more challenging
than any ordinary game or mathematical puzzle. Suppose that we want a
ma chine which, when embedded for a time in a complex environment or
"universe," will essay to produce a description of that world-to discover its
regularities or laws of nature. We might ask it to predict what will happen
next. We might ask it to predict what would be the likely consequences of a
https://web.media.mit.edu/~minsky/papers/steps.html 56/67
10/28/2019 Steps Toward Artificial Intelligence - - -Marvin Minsky
We will take language to mean the set of expressions formed from some
given set of primitive symbols or expressions, by the repeated application
of some given set of rules; the primitive expressions plus the rules is the
grammar of the language. Most induction problems can be framed as
problems in the discovery of grammars. Suppose, for instance, that a
machine's prior experience is summarized by a large collection of
statements, some labeled "good" and some 'bad" by some critical device.
How could we generate selectively more good statements? The trick is to
find some relatively simple (formal) language in which the good statements
are grammatical, and in which the bad ones are not. Given such a language,
we can use it to generate more statements, and presumably these will tend
to be more like the good ones. The heuristic argument is that if we can find
a relatively simple way to separate the two sets, the discovered rule is
likely to be useful beyond the immediate experience. If the extension fails
to be consistent with new data, one might be able to make small changes in
the rules and, generally, one may be able to use many ordinary problem-
solving methods for this task.
Solomonoff's work [18] is that it may point the way toward systematic
mathematical ways to explore this discovery problem. He considers the
class of all programs (for a given general-purpose computer) which will
produce a certain given output (the body of data in question). Most such
programs, if allowed to continue, will add to that body of data. By properly
weighting these programs, perhaps by length, we can obtain corresponding
weights for the different possible continuations, and thus a basis for
prediction. If this prediction is to be of any interest, it will be necessary to
show some independence of the given computer; it is not yet clear
precisely what form such a result will take.
C. Models of Oneself
To the extent that the creature's actions affect the environment, this internal
model of the world will need to include some representation of the creature
itself. If one asks the creature "why did you decide to do such and such"
(or if it asks this of itself), any answer must come from the internal model.
Thus the evidence of introspection itself is liable to be based ultimately on
the processes used in constructing one's image of one's self. Speculation on
the form of such a model leads to the amusing prediction that intelligent
machines may be reluctant to believe that they are just machines. The
argument is this: our own self-models have a substantially "dual" character;
there is a part concerned with the physical or mechanical environment (that
is, with the behavior of inanimate objects)—and there is a part concerned
https://web.media.mit.edu/~minsky/papers/steps.html 58/67
10/28/2019 Steps Toward Artificial Intelligence - - -Marvin Minsky
Now, when we ask such a creature what sort of being it is, it cannot simply
answer "directly." It must inspect its model(s). And it must answer by
saying that it seems to be a dual thing-which appears to have two parts-a
"mind" and a "body." Thus, even the robot, unless equipped with a
satisfactory theory of artificial intelligence, would have to maintain a
dualistic opinion on this matter.
CONCLUSION
bold pioneering of Nicholas Rashevsky (c. 1929) and his later co-workers
[95]; Theories of Learning, e.g., Saul Gorn [84]; Theory of Games, e.g.,
Martin Shubik [85]; and Psychology, e.g., Jerome Bruner, et al. [861. And
everyone should know the work of George Polya [87] on how to solve
problems. We can hope only to have transmitted the flavor of some of the
more ambitious projects directly concerned with getting machines to take
over a larger portion of problem-solving tasks.
One last remark: we have discussed here only work concerned with more
or less self-contained problem solving programs. But as this is written, we
are at last beginning to see vigorous activity in the direction of constructing
usable time-sharing or multiprogramming computing systems. With these
systems, it will at last become economical to match human beings in real
time with really large machines. This means that we can work toward
programming what will be, in effect, "thinking aids." In the years to come,
we expect that these man-machine systems will share, and perhaps for a
time be dominant, in our advance toward the development of "artificial
intelligence."
BIBLIOGRAPHY
[05] W. R. Ashby, "Design for a Brain," John Wiley and Sons, Inc. New
York, N. Y., 1952.
[06] W. R. Ashby, '"Design for an intelligence amplifier," in [B].
[07] M. L. Minsky and O. G. Selfridge, "Learning in random nets," in [H].
[08] H. Sherman, "A quasi-topological method for machine recognition of
line patterns, " in [E].
[09] M. L. Minsky, "Some aspects of heuristic programming and artificial
intelligence," in [C].
[10] W. Pitts and W. S. McCulloch, "How we know universals," Bull.
Math. Biophys., vol. 9, pp. 127-147, 1947.
[11] N. Wiener, "Cybernetics," John Wiley and Sons, Inc., New York, N.
Y., 1948
[12] O. G. Selfridge, "Pattern recognition and modern computers,"
[13] G. P. Dinneen, "Programming pattern recognition," in [A].
[14] M. L. Minsky, "Heuristic Aspects of the Artificial Intelligence
Problem," Lincoln Lab., M.I.T., Lexington, Mass., Group Rept. 34-55,
ASTIA Doc. No. 236885, December 1956. (M.I.T. Hayden Library No. H-
58.)
[15] A. Newell, T. C. Shaw, and H. A. Simon, "A variety of intelligent
learning in a general problem solver," in [D].
[16] R. J. Solomonoff, "The Mechanization of Linguistic Learning," Zator
Co., Cambridge, Mass., Zator Tech. Bull. No. 125, Second Intl Congress
on Cybernetics Namur, Belgium- September 1958
[17] R. J. Solomonoff "A new method for discovering the grammars of
phrase structure languages," in [E].
[18] R. J. Solomonoff, "A Preliminary Report on a General Theory of
Inductive Inference," Zator Co., Cambridge, Mass., Zator Tech. Bull. V-
131, February 1960.
[19] O. G. Selfridge, "Pandemonium: a paradigm for learning," in [C]
[20] O. G. Selfridge and U Neisser, "Pattern recognition by machine," Sci.
Am, vol. 203, pp. 60-68, August 1960.
[21] A. L. Samuel, "Some studies in machine learning using the game of
checkers," IBM J. Res. Dev., vol. 3,pp. 211-219, July 1959.
[22] F. Rosenblatt, "The Perceptron" Cornell Aeronautical Lab. Inc.,
Ithaca, N. Y. Rept. VG-1196, January 1958. See also the article of Hawkins
https://web.media.mit.edu/~minsky/papers/steps.html 62/67
10/28/2019 Steps Toward Artificial Intelligence - - -Marvin Minsky
in this issue.
[23] W. H. Highleyman and L. A. Kamentsky, "Comments on a character
recognition method of Bledsoe and Browning" (Correspondence), IRE
Trans. On Electronic Computers, vol. I EC-9, p. 263, June 1960.
[24] W. W. Bledsoe and I. Browning, "Pattern recognition and reading by
machine, in [F].
[25] L. G. Roberts, "Pattern recognition with an adaptive network," IRE
International Convention Record, 1960 pt. 2, p66
[26] W. Doyle, '"Recognition of Sloppy, Hand-Printed Characters," Lincoln
Lab., M.I.T., Lexington, Mass., Group Rept. 54-12, December 1959.
[27] R. A. Kirsch, C. Ray, L. Cahn, and G. H. Urban, "Experiments in
Processing Pictorial Information with a Digital Computer," Proc. EJCC,
pp. 221-229- December 1957.
[28] A. M. Uttley, "Conditional probability machines " and "Temporal and
spatial patterns in a conditional probability machine,"
[29] A. M. Uttley, "Conditional probability computing in a nervous
system," in [C].
[30] C. N. Mooers, "Information retrieval on structured content,"
[31] C. E. Shannon, "Programming a digital computer for playing chess,"
in [K].
[32] J. McCarthy, Recursive Functions of Symbolic Expressions," in [G]
[33] J. S. Bomba, "Alpha-numeric character recognition using local
operations," in [F].
[34] R. L. Grimsdale, et al., "A system for the automatic recognition of
patterns," Proc. IEE, vol. 106, pt. B, March 1959.
[35] S. H. Unger, Pattern detection and recognition" PROC. IRE vol. 47,
pp. 1737-1752, October 1959
[36] J. H. Holland, "on iterative circuit computers constructed of
microelectronic components and systems," Proc. WJCC, pp.
[37] D. O. Hebb, "The Organization of Behavior," John Wiley and Sons,
Inc., New York N. Y., 1949
[38] W. Kohler, "Gestalt Psychology". M~ 279-1947.
[39] N. Haller, "Line Tracing for Character Recognition," M.S.E.E. thesis,
M.I.T., Cambridge, Mass., 1959
[40] N. Minot, "Automatic Devices for Recognition of Visible Two-
https://web.media.mit.edu/~minsky/papers/steps.html 63/67
10/28/2019 Steps Toward Artificial Intelligence - - -Marvin Minsky
https://web.media.mit.edu/~minsky/papers/steps.html 67/67