0% found this document useful (0 votes)
45 views20 pages

Samuel 1959

Uploaded by

tannaz0803
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views20 pages

Samuel 1959

Uploaded by

tannaz0803
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

4.3.

3 Some Studies in Machine Learning Using the Game of Checkers 535

Some Studies in Machine Learning


Using the Game of Checkers
Arthur L. Samuel

Abstract: Two machine-learning procedures have been investigated in some detail usi!Jg the game of
checkers. Enough work has been done to verify the fact that a computer can be programmed so that it will
learn to playa better game of checkers than can be played by the person who wrote the program. Further-
more, it can learn to do this in a remarkably short period of time (8 or 10 hours of machine-playing time)
when given only the rules of the game, a sense of direction, and a redundant and incomplete list of
parameters which are thought to have something to do with the game, but whose correct signs and relative
weights are unknown and unspecified. The principles of machine learning verified by these 'experiments
are, of course, applicable to many other situations.

Introduction
The studies reported here have been concerned with the method should lead to the dcvelopmcnt of general-pur-
progr;Jmming of a uigital computer to behave in a way pose learning machines. A comparison between the size
which. if done by human beings or animals, would be of the switching nets that can be reas,?nably constructed
descrihcu as involving the process of learning. While or simulated at the present time and the size of the neural
this is not thc place to uwell on the importance of ma- nets used by animals, suggests that we have a long way
chinc-learning proceuures. or to uiscourse on the philo- to go before we obtain practical devices.~ The second
sophical aspccts, I there is obviously a very large amount procedure requires rcprogramming for each new applica-
nf \\'L)rk. now done hy people. which is quite trivial in its tion. but it is capable of realization at the present time.
uemanus on the intellect hut does, nevertheless, involve The experimcnts to be describcd here were based on this
some learning. We have at our command computers with second approach.
adequate data-handling ahility and with sufficient ,com-
• Choice of prohlem
putational specd to make use of machine-learning tech-
niqucs. hut our knowledge of the basic principlcs of these For some years the writer has devotcd his spare time to
tcchniqucs is still rudimentary. Lacking such knowledge, the subject of machine learning and has concentrated on
it is neccssary to spccify methods of problcm solution in the development of learning proccdures as applied to
minute anu ~xact uetaiL a time-consuming and costly games.:: A game provides a convenient vehicle for such
pr0cedurc. Programming computers to learn from ex- study as contrasted with a problem taken from life, since
perienec shoulu eventual Iv eliminatc the need for much many of the complications of dctail are removed.
of this dctailed programm'ing erfort. Checkers. rather than chcss,4-7 was chosen because the
simplicity of its rules permits greater emphasis to be
• Ge/ll'l'lll /llelhodl' of llpproach
placed on learning techniques. Regardless of the relative
At thc outset it might he well to distinguish sharply be- merits cf the two games as intellectual pastimes, it is fair
twccn two gcncral approaches to thc problem of machine to state that checkers contains all of the basic characteris-
learning. One method. which might be called the Nellral- tics of an intellectual activity in which heuristic proce-
Xel ApproliCh. ueals with the possibility of inducing dures and learning processes can playa major role and
learncd hchavior into a randomly connected switching in which these processes can be evaluated.
nct (or its simulation on a digital computer) as a result Some nf these· characteristics might well be enumer-
of a rewaru-and-punishment routine. A seconu. and ated. They are:
1l1uch Illorc ctlicient approach, is to prouucc the equiva- (l) The activity must not be deterministic in the prac-
lent nf a highly organizeu network which has been de- tical sense. Therc exists no known algorithm which will
signcd tn Icarn nnly certain specific things. The first guarantee a win or a draw in checkers, and the complete

Originally published in IBM Journal, Vol. 3, No.3. July, 1959.


536 Improving the Efficiency of a Problem Solver

explorations of every possible path through a checker given board position. The indicated moves are explored
game would involve perhaps 10.111 choices of moves in turn by producing new hoard-position records cor-
which. at 3 choices per millimicrosecond. would still take responding .to the conditions after the move in question
1O~I centuries to consider. (the old board positions being saved to facilitate a return
(2) A definite goal must exist-the winning of the to the starting. poinn and the process can be repeated.
game-and at least one criterion or intermediate goal This look-ahead procc;dure is carried several moves in
must exist which has' a bearing on the achievement of the advance, as illustrated in Fig. ·1.. The resulting board pe-
final goal and for which the sign should be known. In sitions are then scored in terms of their relative value to
checkers the goal is to deprive the opponent of the pos- the machine.
sibility of moving, and the dominant criterion is the The standard method of scoring the resulting board
number of pieces of each color on the board. The im- positions has been in terms of a linear polynomial. A
portance of having a known criterion will be discussed number of schemes of an abstract sort were tried for
later. evaluating board positions without regard to the usual
(3) The rules of the activity must be definite and they checker concepts, but none of these was suecessfuI.IO
should be known. Games satisfy this requirement. Un- One way of looking at the various terms in the scoring
fortunately, many problems of economic importance do polynomial is that those terms with numerically small
not. While in principle the determination of the rules can coefficients should measure criteria related as intermedi-
be a part of the learning process, this is a complication ate goals to the criteria measured l'y the larger terms.
which might well be left until later. The achievement of these intermediate goals indicates
(4) There should be a background of knowledge con- that the machine is going in the right direction, such that
cerning the activity against which the learning progress the larger terms win. eventually increJse. If the program
can be tested. could look far enough ahead we need only ask, "Is the
(5) The activity should be one that is familiar to a machine still in' 'the game?,~J.l Since it cannot look this
substantial body of people so that the behavior of the far ahead in the usual situation, we must substitute some-
program can be made understandable to them. The thing else, say the piece ratio, and let the machine con-
ability to have the program play against human oppo- tinue the look-ahead until one side has gained a piece
nents (or antagonists) adds spice to the study and, inci- advantage. But even this is not always possible, so we
dentally, provides a convincing demonstration for those have the program test to see if the machine has gained a
who do not believe that machines can learn. positional advantage, et cetera. Numerical measures of
Having settled on the game of checkers for our learn- these various properties of the board positions are then
ing studies, we must, of course, first program the com- added together (each with an appropriate coefficient
puter to play legal check ~rs; that is, we must express the which defines its relative importance) to form the evalu-
rules of the game in macl:Jine language and we must ar- ation polynomial.
range for the mechanics. of accepting an opponent's More 3pecifically, as defined by the rules for checkers,
moves and of reporting the computer's moves, together the dominant scoring parameter is the inability for one
with all pertinent data desired by the experimenter. The side or the other to move,12 Since this can occur but once
general methods for doing this were described by in any game, it is tested for separately and is not included
Shannon H in 1950 as applied to chess rather than check- in the scoring polynomial as tabulated by the computer
ers. The basic program used in these experiments is quite during play. The next parameter to be considered is the
similar to the program described by Strachey9 in 1952. relative piece advantage. It is always assumed that it is
The availability of a larger and faster machine (the to the machine's advantage to reduce the number of the
IBM 704), coupled with many detailed changes in the opponent's pieces as compared to its own. A reversal of
programming procedure, leads to a fairly interesting the sign of this. term will, in fact, cause the program to
game being played, even without any learning. The basic play "give-away" checkers, and with learning it can only
forms of the program will now be described. learn to playa better and better give-away game. Were
the sign of this term not known by the programmer it
The basic checker-playing program could. of course, be determined by tests, but it must be
fixed by the experimenter and, in effect, it is one of the
The computer plays by looking ahead a few moves and instructions to the machine defining its task. The nu-
by evaluating the resulting board positions much as a merical computation of the piece advantage has been ar-
human player might do. Boal d positions are stored by ranged in such a way as to account for the well-known
sets of machine words. four words normally being used property that it is usually to one's advantage to trade
to represent any particular board position. Thirty-two bit pieces when one is ahead and to avoid trades when
positions (of the 36 available in an IBM 704 word) are, behind. Furthermore, it is assumed that kings are more
by convention, assigned to the 32 playing squares on the valuable than pieces, the relative weights assigned to
checkerboard, and pieces appearing on these squares are them being three to twO. 13 This ratio means that the
represented by I's appearing in the assigned bit positions program will trade three men for two kings, or two
of the corresponding word. "Looking-ahead" is prepared kings for three men. if by so doing it can obtain some
for by computing all possible next moves, starting with a positional advantage.

-
4.3.3 Some Studies in Machine Learning Using the Game of Checkers
-
537

INITIAL BOARD POSITION - - - •

.,
.... .\

,; \. '.
.' j '. . i .
.' ! i

/ \ .I
.'
,
.
.

• • •

! 5

1 ·
6

8
.,
./ !\. -~ \

• • • • • • 9

/ A\ 10

, I '
.: :.' \
/
• • • • • • • • • 11

Figure I A "tree" of moves which might be investigated during the look-ahead procedure. The actual
branchings are much more numerous than those shown, and the "tree" is apt to extend to as many
as 20 levels.
538 Improving the Efficiency of a Problem Solver

Th~ Chllic~for lh~ paralllcl~rs to follow this first t~rlll program continues looking aheaLl. At a ply of 4 the
llf th~
scoring polynlllllial and thcir cocllici~nts th~n b~­ program will stop and evaluate the resulting board posi-
Clllll~S a Illatter of concern. Two cours~s ar~ open- tion if conditions (I) and (3) ahove arc not met. At a ply
~ith~r the experilll~ntcr can d~cid~ what these suhse- of 5 or greater. the program stops the look-ahead when-
ljuent t~rms are to h~. or he can arrange for the program ever the next ply level docs not offer a jump. At a ply
to make the s~kction. We will discuss the first case in of I I or greater. the 'program will terminate the look-
some detail in connection with the rot~-Icarning studies aheaLl. even if the next move i,;"to he a jump, should one
am! leave for a lat~r sedion the discussion of various siLle at this time be ahead by more than two kings (to
program methods of s~lccting parameters and adjusting prevent the needless exploration of obviously losing or
their coetlicients. winning sequences), The program stops at a ply of 20
It is not satisfactory to s~lect the initial move which regardkss of all conditions (since the memory space for
kads to the hoard position with the highest score, since the look-ahead moves is then exhausted) and an adjust-
to reach this position would require the cooperation of ment in score is made to allow for the pending jump.
the opponent. Instead. an analysis must he made pro- Finally. an adjustment is made in the levels of the break
ceeding h(/ck \I"(/rd from the evaluated board positions points between the different conditions when time is
through the "tree" of possible moves, each time with saved through rotc learning (see below) and when the
consideration of the intent of the side whose move is total number of pieces on the board falls below an arbi-
being examined, assuming that the opponent would trary number. All break points are determined by single
always attempt to minimize the machine's score while data words which can be changed at any time by manual
the machine acts to maximize its score. At each branch intervention.
point, then. the corresponding hoard position is given This tying of the' ply with board conditions achieves
the score of the hoard position which would result from three desired re~ults. In the first place. it permits board
the most favorahle move. Carrying this "minimax" pro- evaluations to be· made und'cr conditions of relative sta-
cedure hack to the starting point results in the selection bility for so-called dead positions. as defined by Turing.l~
of a "best move." The score of the board position at the Secondly, it causes greater surveillance of those paths
end of the most likely chain is also brought back. and for which offer better opportunities for gaining or losing an
learning purposes this score is now assigned to the pres- advantage. Finally, since branching is usually seriously
ent board position. This process is shown in Fig. 2. The restricted by a jump situation. the total number of board
best move is executed, reported on the console lights. positions and moves to be considered is still held down
and tabulated by the printer. to a reasonable number and is more equitably distributed
The opponent is then permitted to make his move, between the various pos~ible initial moves.
which can be communicated to the machine either by As a practical maller, machine-playing time usually
means of console switches or by means of punched has heen limited to appro;.:imately 3D seconds per move.
cards. The computer verifies the legality of the oppo- Elaborate table-lookup procedures. fast sorting and
nent's move, rejecting l-I or accepting it, and the process searching procedures, and a variety of new programming
is repeated. \Vhen the program can look ahead and pre- tricks were developed, and full use was made of all of the
dict a win, this fact is reported on the printer. Similarly, resources of the IBM 704 to increase the operating speed
the program concedes when it sees that it is going to as much as possible. One can, of course, set the playing
lose. time at any desired value by adjustments of the permitted
ply; too small a ply results in a bad game and too large
• Ply lilllilatiolls
a ply makes the game unduly costly in terms of machine
Playing-time considerations make it necessary to limit time.
the look-ahead distance to some fairly small value. This
• Other modes of play
distance is defined as the ply (a ply of 2. consisting of
one proposed move hy the machine and the anticipated For study purposes the program was written to accom-
reply by the opponent). Th~ ply is not fixed hut depends modate several variations of this basic plan. One of these
upon the dynamics of the situation. and it varies from permits the program to play against itself. that is. to play
move to move and from branch to branch during the both sides of the game. This mode of play has been
move analysis. A great many schemes of adjusting the found to be especiaJly good during the early stages of
look-ahead distance have been tried at various times, learning.
some of them quite complicated. The most effective one. The program can also follow book games presented to
although quite detailed. is simple in concept and is as it either on cards or on magnetic tape. When operating
follows. The program always looks ahead a minimum in this mode. the program lkcides at each point in the
Llistance. which for the op~ning game and without learn- game on its ncxt move in the usual way and rcports this
ing is usually set at three moves. At this minimum ply proposeLl move. Instead of actually making this move.
the program will evaluate the boarLl position if none of the program refers to the ~torcd record of a book game
the following conditions llCCurS: (I) the next move is a and makes the book move. The program r~cords its
jump. (2) the last move was a jump. or (3) an exchange evaluation of the two moves. and it also counts and re-
olIcr is possihk. If any on~ of these conditions exists. the ports the number of possible moves which the program

-
4.3.3 Some Studies in Machine Learning Using the Game of Checkers 539

MACHINE CHOOSES BRANCH


CD WITH LARGEST SCORE

CD OPPONENT EXPECTfD
TO CHOOSE BRANCH
WITH SMALLEST SCORE

CD MACHINE CHOOSES BRANCH


WI TH MOST POSI TI VE SCORE

.--;- -.

'· '- ., -70 +7


PLY NUMBERCD

AA A A .1 \
.:
,
'.
+7 +15

CD


+100

+50

EVALUATIONS MADE

+20 -7

AT THIS LEVEL

+4

-3

o

+3

-lD

-20

-70

-100

+3 +7

+15
• •
-5

Figure 2 Simplified diagram showing how the evaluations are backed-up through the "tree" of possible
moves to arrive at the best next move. The evaluation process starts at @.

rates as being better than the book move and th~ number cal information which assist in the evaluation of playing
it rates as being poorer. The sides are then reversed and performance.
the process is repeated. At the end of a book game a cor- Numerous other features have also been added to
rdation coefficient is computed, relating the machine's make the program convenient to operate (for details see
indicated moves to those moves adjudged best by the Appendix A), but these have no direct bearing on the
checker masters.!!; problem of learning, to which we will now turn our
It should be noted that the emphasis throughout all of attention.
these studies has been on learning techniques. The
temptation to improve the machine's game by giving it Rote learning and its variants
standard openings or other man-generated knowledge of Perhaps the most elementary type of learning worth dis-
playing techniques has been consistently resisted. Even cussing would be a form of rote learning in which the
when book games are played, no weight is given to the program simply saved all of the board positions en-
fact that the moves as listed are presumably the best pos- countered during play, together with their computed
sible moves under the circumstances. scores. Reference could then be made to this memory
For demonstration purposes, and also as a means of record and a certain amount of computing time might
avoiding lost machine time while an opponent is think- be saved. This can hardly be called a very advanced
ing. it is sometimes convenient to play several simul- form of learning; nevertheless, if the program then util-
taneous games against different opponents. With the izes the saved time to compute further in depth it will
program in its present .form the most convenient num- improve with time.
her for this purpose has been found to be six, although Fortunately, the ability to store board information at
eight have been played on a number of occasions. a ply of 0 and to look up boards at a larger ply provides
Games may be started with any initial configuration the possibility of looking much farther in advance than
fllr the hoard position so that the program may be tested might otherwise be possible. To understand this, con-
lIn end games. checkcr puzzles. et cetera. For nonstand- sider a very simple case where the look-ahead is always
ani starting conditions, the program lists the initial piece terminated at a fixed ply, say 3. Assume further that the
arrangement. From time to time, and at the end of each program saves only the board positions encountered
game. the program also tabulates various bits of statisti- during the actual play with their associated backed-up
540 Improving the Efficiency of a Problem Solver

scorcs. Now it is this list of previous board positions that board POSItion in which White is to move. so that all
is used to look up board positions while at a ply lev"cl of boards are reported as if it were Black's turn to move.
J in the suhsequent games. If a board position is ("ound, This reduces by nearly a factor of two the numher of
its score has, in clIeet. already been backed up by three boards which must be saved. Board positions. in which
levels. and if it hecomes etrective in determining the all of the pieces 'are kings, can be reflected about the
move to be made. it is a .(i-ply score rather than a simple diagonals with a possible fourfold reduction in the num-
J-ply score. This new initial board position with its (i-ply ber which must be saved. A more' compact board repre-
score is, in turn, saved and "(t may be encountered in a sentation than the one employed during play is also used
future game and the score backed up by an additional so as to minimize the storage requirements.
set of three levels, et cetera. This procedure is illustrated After the board positions are standardized. they are
in Fig. 3. The incorporation of this variation. together grouped into records on the basis of (I) the number of
with the simpler rotc-learning feature. results in a fairly pieces on the board, (2) the presence or absence of a
powerful learning technique which has been studied in piece advantage, (3) the side possessing this advantage,
some detail. (4) the presence or absence of kings on the board.:(5) the
Several additional features had to be incorporated into side having the so-called "move," or .opposition advan-
the program before it was practical to embark on learn- tage, and finally (6) the first moments of the pieces about
ing studies using this storage scheme. In the first place, normal and diagonal axes through the board. During
it was necessary to impart a sense of direction to the pro- play, newly acquired board positions are saved in the
gram in order to force it to press on toward a win. To memory until a reasonable number have been accumu-
illustrate this, consider the situati9n of two kings against lated. and they are then merged with those on the "mem-
one king, which is a winning combination for practically ory tape" and a new memory tape is produced. Board
all variations in board positions. In time, the program positions within a .record are listed in a serial fashion,
can be assumed to have stored all of these variations, being sorted with respect to the" words which define them.
each associated with a winning score. Now. if such a The records are arranged on the tape in the order that
situation is encountered, the program will look ahead they are most likely to be needed during the course of a
along all possible paths and each path will lead to a win- game; board positions with 12 pieces to a side coming
ning combination, in spite of the fact that only one of first, et cetera. This method of cataloging is very impor-
the possible initial moves may be along the direct path tant because it cuts tape-searching time to a minimum.
toward the win while all of the rest may be wasting time. Reference must be made. of course. to the board posi-
How is the program to differentiate between these? tions already saved, and this is done by reading the cor-
A good solution is to keep a record of the ply value of rect record into the memory ane searching through it by
the ditrerent board positions at all times and to make a a dichotomous search procedure. Usually five or more
further choice between board positions on this basis. If records are held in memory at on\~ time. the exact num-
ahead. the program can be arranged to push directly ber at any time depending upon the lengths of the par-
toward the win while. if behind, it can be arranged to ticular records in question. Normally, the program calls
adopt delaying' tactics. The most recent method used is three or four new records into memory during each new
to carry the effective ply along with the score by simply move, making room for them as needed, by discarding
decreasing the magnitude of the score a small amount the records which have been held the longest.
each time it is backed-up a ply level during the analyses. Two different procedures have been found to be of
If the program is now faced with a choice of board posi- value in limiting the number of board positions that are
tions whose scores differ only by the ply number, it will saved; one based on the frequency of use. and the sec-
automatically make the most advantageous choice, ond on the ply. To keep track of the frequency of use,
choosing a low-ply alternative if winning and a high-ply an age term is carried along with the score. Each new
alternative if losing. The significance of this concept of a board position to be saved is arbitrarily assigned an age.
direction sense should not be overlooked. Even without When reference is made to a stored board position,
"learning," it is very important. Several of the early at- either to update its score or to utilize it in the look-
tempts at learning failed because the direction sense was ahead procedure, the age recorded for this board position
not properly taken into account. is divided by two. This is called relre~·hing. Offsetting
this, each board position is automatically aged by one
• Cawloging lind culling stored inlormatioll
unit at the memory merge times (normally occurring
Since practical considerations limit the number of board about once every 20 moves). When the age of anyone
positions which can be saved, and since the time to board position reaches an arbitrary maximum value this
search through those that are saved can easily become hoard position is expunged from the r~cord. This is a
unduly long. one must devisG systems (I) to catalog form of lorgelling. New board positions which remain
hoards that arc saved. (2) to delete redundancies. and unused arc soon forgotten. while board positions which
(3) to discard board positions which are not believed to arc used several times in succession will be refreshed to
he of much value. The most effective cataloging system such an extent that they will be remembered even if not
found to date starts hy standardizing all board positions. used thereafter for a fairly long period of time. This form
first by reversing the pieces and piece positions if it is a of refreshing and forgetting was adopted on the basis of


4.3.3 Some Studies in Machine Learning Using the Game of Checkers 541

PLY NUMBER 1

3
EVALUATIONS WOULD NORMALLY BE MADE AT THIS LEVEL i .......... ~.
/
.......
i .,
~. e

.f \
\ !
\.
! \

/ ...
\.
/
e
.'\ •: \ e.
r\ •
i ' \ \ i
i \
i I
i \ ! i
PREVIOUS EVALUATION LEVEL
• • • • • • • •
Figure 3 Simplified representation of the rote-learning process, in which information saved from a pre-
vious game is used to increase the effective ply of the .backed-up score.

retlections as to the frailty of human memories. It has efficients. These values varied from 0.2 for the poorest
proven to be very effective. polynomial tested, to approximately 0.6 for the one
In addition to the limitations imposed by forgetting, it finally adopted. The selected polynomial contained four
seemed desirable to place a restriction on the maximum terms (as contrasted with the use of 16 terms in later
size of anyone record. Whenever an arbitrary limit is experiments). In decreasing order of importance these
reached. enough of the lowest-ply board positions are were: (l) piece advantage, (2) denial of occupancy,
automatically culled from the record to bring the size (3) mobility, and (4) a hybrid term which combined con-
well hclow the maximum. trol of the center and piece advancement.
Before embarking o~ a study of the learning capa-
• Rote-learning tests
hilities of the system as just described, it was, of course,
first necessary to fix the terms and coefficients in the After a scoring polynomial was arbitrarily picked, a series
evaluation polynomial. To do this, a number of different of games was played, both self-play and play against
sets of values were tested by playing through a series many different individuals (several of these being check-
. of hook games ami computing the move correlation co- er masters). Many book games were also followed, some
542 improving the Efficiency of a Problem Solver

of these heing end games. The program lcarnel1 to 'play define the task is computed separately anJ. of course. is
a very good opening game and to recognize most win- , not altered hy the program.
ning and losing end positions many moves in advanc.:, After a number of relatively unsuccessful attempts to
although its midgame play was not greatly improved. havc thc program generalize while playing both sides of
This program now qualifies as a rather hetter-than- the gamc. the program was arranged to act as two dif-
average novice. but definitely not as an expert. fe .. ~nt players, for convenience called A Iplw and BetG.
At the present time the ,IJlemory tape contains some- Alph,~ gcneralizes on its experiertce after each m~ve bv
thing over 53.000 board positions (averaging 3.X words adjustin~ the coefficients in its evaluation polynomial and
each) which have been selected from a much larger hy replacinb terms which appear to be unilTlPortant by
numher of positions by means of the culling techniques new parametel ~ drawn from a reserve list. Beta. on the
described. While this is still far from the number which contrary, uses th~ same evaluation polynomial for the
would tax the listing and searching procedures used in duration of any ont: game. Program Alpha is used to
the program, rough estimates. based on the frequency play against human opponents. and during self-play
with which the saved boards are utilized during normal Alpha and Beta play each other.
play (these figures being tabulated automatically), indi- At the end of each self-play game a determination is
cate that a library tape containing at least 20 times the made of the relative playing ability of Alpha, as com-
present number of board positions would he needed to pared with Beta, by a neutral portion of the program. If
improve the midgame play significantly. At the present Alpha wins-or is adjudged to be ahead when a game is
rate of acquisition of new positions this would require otherwise terminated-the then current scoring system
an inordinate amount of play and, consequently. of used by Alpha is given to Beta. If, on the other hand,
machine timeY' Beta wins or is ahead; this fact is recorded as a black-
The general conclusions which can be drawn from mark for Alpha. 'Yhenever Alpha receives an arbitrary
these tests are that: number of black marks (usu'rrlly set at three) it is as-
(I) An effective rote-learning technique must include sumed to be on the wrong track. and a fairly drastic and
a procedure to give the program a sense of direction. arbitrary change is made in its scoring polynomial (by
and it must contain a refined system for cataloging and reducing the coefficient of the leading term to zero).
storing information. This action is necessary on occasion, since the entire
(2) Rote-learning procedures can be use'd etlectively learning process is an attempt to find the highest point
on machines with the data-handling capacity of the in multidimensional scoring space in the presence of
IBM 704 if the information which must be saved and many secondary maxima on which the program can
searched docs not occupy more than, wugh1v. one mil- become trapped. By manual intervention it is possible to
lion words, and if not more than llne hundred or so ref- return to some previous condition or make some other
erences need to be made to this informatil'n per minute. change if it becomes apparent that the learning process
These figures are, of course. highly dependent upon the is not functioning properly. In general, however, the
exact etnciency of cataloging which can be a-:hieved. program seeks to extricate itself from traps and to im-
(3) Thy game of checkers. when played with a simple prove more or less continuously.
scoring scheme and with rote learning only, requires The capability of the program can be tested at any
more than this number of words for master cal iber of time by having Alpha play one or more book games
play and. as a consequence. is not completely amenable (with the learning procedure temporarily immobilized)
to this treatment on the 18i\'1 70-L and by correlating its play with the recommendations of
(4) A game, such as checkers. is a suitable vehicle for the masters or. more interestingly, by pitting it against
use during the development of learning techniques. and a human player.
it is a very satisfactory device for demonstrating ma-
chine-learning procedures to the unbelieving. • Polynomial modification procedure
~ f Alpha is to make changes in its scoring polynomial.
Learning procedure involving generalizations It must be given some trustworthy criteria for measurin a
performance. A logical difficulty presents itself, sinc;
An obvious way to decrease the amount of storage the only measuring parameter available is this same
needed to utilize past experience is to gen~ralize on the scoring polynomial that the process is desi!!ned to im-
hasis of cxperience and to save nnly the generalizatil1ns. prove. Recourse is had to the peculiar pro;erty of the
This should. of course. be a continulllis process if it is to look-ahead procedure. which makes it less important for
be truly e1fective. and it slllluld in\'l1lve se\'cral levels nf the scoring polynomial to be particularlv good the
abstractillll. A start has been made in this direction by further ahead the process is continued. Thi~ means that
having the program select a subsL,t llf possible terms for one can evaluate the relative change in the positions of
use in the evaluation polyllllJllial and by having the pro- two players, when this evaluation is made over a fairly
gram determine the sign and ma~nitllde llf the cnl'lli- large number of moves, by using a scoring system which
cients which multiply these paran1L'ters. i\t the present is much too gross to be significant on a move-Iw-move
time this subset consists of /(, tL'nn, c1lllsen frllm a list basis. -
of JX parameters. The piece-advantage term needed to Perhaps an even better way of looking at the matter
4.3.3 Some Studies in Machine Learning Using the Game of Checkers 543

is that we arc attempting to make the score, calculatet! had been made of Alpha's current scoring polynomial to
for the current boart! position. look like that calculated determine Alpha's moves but not to determine the oppo-
for the terminal hoard position of the chain of moves nent's moves, while during the anticipation play the
which most probably will occur t!uring actual play. Of moves for hoth sides are made using Alpha's scoring
course, if one coult! t!evelop a perfect system of this sort polynomial. One is tempted to increase the sensitivity of
it would be the equivalent of always looking ahead to delta as an indicator of change by increasing the span of
the end of the game. The nearer this it!eal is approachet!, the remembered portion. This has been found to be
the better would be the play.l.~ dangerous since the coefficients in the evaluation poly-
In order to obtain a sufficiently large span to make use nomial and, indeed. the terms themselves, may change
of this characteristic, Alpha keeps a record of the ap- between the time of the remembered evaluation and the
parent goodness of its board positions as the game pro- time at which the anticipation evaluation is made, As a
gresses. This record is kept by computing the scoring matter of fact, this difficulty is present even for a span
polyno'mial for each board position encountered in actual of one move-pair. It is necessary to recompute the scor-
play and by saving this polynomial in its entirety. At the ing polynomial for a given initial board position after a
same time, Alpha also computes the backed-up score for move has been determined and after the indicated cor~
all board positions, using the look-ahead procedure de- rections in the scoring polynomial have been made, and
scribed earlier. At each play by Alpha the initial board to save this score for future comparisons, rather than to
score, as saved from the previous Alpha move, is com- save the score used to determine the move. This may
pared with the backed-up score for the current position. seem a trivial point, but its neglect in the initial stages
The difference between these scores, defined as delta, is of these experiments led to oscillations quite analogous
used to check the scoring polynomial. If delta is positive to the instability inouced iri electrical circuits by long
it is reasonable to assume that the initial board evalua- delays in a feedback loop.
tion was in error and terms which contributed positively As a means of stabilizing against 'minor variations in
shoult! have been given more weight, while those that the delta values, an arbitrary minimum value was set,
contributed negatively should have been given less and when delta fell below this minimum for any par-
weight. A converse statement can be made for the case ticular move no change was made in the polynomial.
where delta is negative. Presumably, in this case, either This same minimum value is used to set limits for the
the initial board evaluation was incorrect, or a wrong initial board evaluation score to decide whether or not
choice of moves was made, and greater weight should it will be assumed to be zero. This minimum is recom-
have been given to terms making negative contributions, puted each time and, normally, has been fixed at the
with less \veight to positive terms. These changes are average value of the coefficients for the terms in the cur-
not made directly but are brought about in an involved rently existing evaluation polynomial.
way which will now be described. Still another type of instability can occur whenever
i\ record is kept of the correlation existing between a new term is introduced into the scoring polynomial.
the signs of the individual term contributions in the ini- Obviously, after only a single move the correlation coeffi-
tial sCl)rii1g polynomial and the sign of delta. After each cient of this new term will have a magnitude of 1, even
play an adjustment is made in the values of the correla- though it might go to 0 after the very next move. To
tion cl)ellicients, due account being taken of the number prevent violent fluctuations due to this cause, the corre-
of times that each particular term has been used and has lation coefficients for newly introduced terms are com-
hat! a nonzero value. The coefficient for the polynomial puted as if these terms had already been used several
term (other than the piece-advantage term) with the then times and had been found to have a zero correlation co-
largest correlation coefficient is set at a prescribed maxi- efficient.
. This is done by replacing the times-used
s num-
mum value with proportionate values determined for all ber III the calculation by an arbitrary number (usually
of the remaining coefficients. Actually, the term coeffi- set at 16) until the usage does, in fact, equal this number.
cients arc fi:\et! at integral powers of 2, this power being After a term has been in use for some time, quite the
t!efined by the ratio of the correlation coefficients. More opposite action is desired so that the more recent experi-
precisely, if the ratio oCtwo correlation coefficients is ence can outweigh earlier results. This is achieved, to-
gether with a substantial reduction in calculation time,
equal to or larger than n hut less than Il+ I, where n is
an integer. then the ratio of the two term coefficients is by using powers of 2 in place of the actual times-used
and by limiting the maximum power that is used. To be
set equal to 2". This procedure was adopted in order to
specific, at any stage of play defined as the Nth move,
increase the range in values of the term coefficients.
corrections to the values of the correlation coefficients
\Vhenever a correlation-coefficient calculation leads to a
C x are made using 16 for N until N equals 32, where-
negative sign, a corresponding reversal is made in the
upon 32 is used until N equals 64, et cetera, using the
sign associated with the term itself.
formula:
• IIl.l'w!>ilitie.\' C~'_l±1
C.v=C.Y_l N
It should he noted that the span of moves over which
delta is computet! consists of a remembered part and an and a value for N larger than 256 is never used.
anticipated portion. During the remembered play, use After a minimum was set for delta it seemed reason a-
544 Improving the Efficiency of a Problem Solver

hie to attach greater weight to situations leading to large double corners is an example. Twenty-seven tlilferent
values of tlelta. Accortlingly, two atltlitional categories simple terms are now in use, the rest being com hi national
are ddinetl. If a contrihution to tlelta is matle hy the first terms. as will he described later.
term. meaning that a change has occurred in the piece A word might be said about these terms with respect
ratio. the indicatetl changes in the correlation coetlicients to the exact' way in which they arc definetl and the
arc tloubled, while .if the value of tlelta is so large as to general procedures used for their evaluation. Each term
indicate that an almost sure win or lose will result. the relates to the relative standi~gs of the two sides. with
elfect on the correlation"~oefficients is quadrupled. respect to the parameter in question, and it is numeri-
cally equal to the difference hetween the ratings for the
• Term replacemellt
individual sides. A reversal of the. sigh obviously cor-
;\o[ention has been made several times of the procedure responds to a change of sides. As a further means of
for replacing terms in the scoring polynomial. The pro- insuring symmetry the individual ratings of the respec-
gram. as it is currently running, contains 38 different tive sides are determined at corresponding times in the
terms (in addition to the piece-advantage term), 16 of playas viewed by the side in question. for example,
·these being included in the scoring polynomial at anyone consider a parameter which relates to the board condi-
time and the remaining 22 being kept in reserve. After tions as left after one side has moved. The rating of
each move a low-term tally is recorded against that active Black for such a parameter would be made after Black
term which has the lowest correlation coefficient and, at had moved, and the rating of White would not be made
the same time, a test is made to see if this brings its tally until after White had moved. During anticipation play,
count up to some arbitrary limit, usually set at 8. When these individual ratings are made after each move and
this limit is reached for any specific term. this term is saved for future reference. When an evaluation is de-
transferred to the bottom of the reserve list, and it is re- sired the progrm)1 takes the differences between the most
. placed by a term from the head of the reserve Jist. This recent ratings and those mane a move earlier. In general,
new term enters the polynomial with zero values for an attempt has been made to define all parameters so
its correlation coefficient. times used, and low-tally that the individual-side ratings are expressible as small
count. On the average, then, an active term is replaced positive integers.
once each eight moves and the replaced terms are given
• Binary connecti!'e terms
another chance after 176 moves. As a check on the ef-
fectiveness of this procedure, the program reports on In addition to the simple terms of the type just described,
the usage which has accrued against each discarded term. a number of combinational terms have been introduced.
Terms which are repeatedly rejected after a minimum Without these terms the s.~oring polynomial would, of
amount of usage can be removed and replaced with com- course, be linear. A number of different ways of intro-
pletely new terms. ducing nonlinear terms have been devised but only one
It might be argued that this procedure of having the of these has been tested in any detail. This scheme pro-
program select terms for the evaluation polynomial from vides terms which have some of the properties of binarv
a supplied list is much too simple and that the program logical connectives. Four such terms are formed fo'r
should generate the terms for itself. Unfortunately, no each pair of simple terms which are to be related. This
satisfactory scheme for doing this has yet been devised. is done by making an arbitrary division of the range in
With a man-generated list one might at least ask that values for each of the simple terms and assigning the
the terms be members of an orthogonal set, assuming binary values of 0 and 1 to these ranges. Since most of
that this has some meaning as applied to the evaluation the simple terms are symmetrical about 0, this is easilv
of a checker position. Apparently, no one knows enough done on a sign basis. The new terms are then of th~
about checkers to define such a set. The only practical form A-B, A-B, A-B. and A-8, yielding values either of'
solution seems to he that of including a relatively large o or I. These terms are introduced into the scorina
number of possible terms in the hope that all of the polynomial with adjustable coefficients and si!.!ns, and
contributing parameters get covered somehow, even are thereafter indistinguishable from the other t~rms.
though in an involved and redundant way. This is not As it would require some 1404 such combinational
an undesirable state of atfairs, however, since it simulates terms to interrelate the 27 simple terms originallv used.
the situation which is likely to exist when an attempt is it was fountl desirable to limit the actual nlln~ber of
made to apply similar learning techniques to real-life combinational terms used at anyone time to a small
situations. fraction of these and to introduce new terms onlv as it
:-'1any of the terms in the existing list arc related in became possible to retire older inetfectual tcrm~. The
some vague way to the parameters used by checker ex- terms actually used are given in Appendix C.
perts. Some of the concepts which checker experts
• Preliminary leamiflg-hy-geflerali;;ariofl tests
appear to use have elutled the writer's attempts at defi-
nition, and he has heen unable to program them. Some An idea of the learning ability of this procedure can he
of the terms are quite unrelated to the usual checker gained by analyzing an initial test series of 28 !.!amesUl
lore and have been discovered more or less by accident. played with the program just described. At the ~tart an
The second moment ahout the diagonal axis through the arbitrary selection of 16 terms was chosen and all terms

-
4.3.3 Some Studies in Machine Learning Using the Game of Checkers 545

were assigned equal weights. During the first 14 games polynomial was negative, changes were made to the co-
Alpha was assigned the White side, with Beta con- efficients associated with positively contributing terms.
strained as to its first move (two cycles of the seven No changes were made to coefficients associated with
different initial moves). Thereafter, Alpha was assigned terms which happened to. be zero. For the negative delta
Black and White alternately. During this time a total case, changes were made to the coefficients of all con-
of 29 different terms was dis·carded and replaced, the tributing terms, just as before.
majority of these on two different- occasions. A second defect seemed to be connected with the
Certain other figures obtained during these 2S games too frequent introduction of new terms into the scoring
~Ire of interest. At frequent intervals the program lists polynomial and the tendency for these new terms to
the 12 leading terms in Alpha's scoring polynomial with assume dominant positions on the basis.·of insufficient
their correlation coefficients and a running count of the evidence. This was remedied by the simple expedient
number of times these coefficients have been altered. of decreasing the rate of introduction of new terms
Based on these samplings, one observes that at least 20 from one every eight moves to one every 32 moves.
different terms were assigned the largest coefficient at The third defect had to do with the complete exclusion
some time or other, some of these alternating with other from consideration of many of the board positions
terms a number of times, and two even reappearing at encountereq during play by reason of the minimum
the top of the list with their signs reversed. While these limit on delta. This resulted in the misassignment of
variations were more violent at the start of the series credit to those board positions which permitted spec-
of games and decreased as time went on, their presence tacular moves when the credit rightfully belonged to
indicated that the learning procedure was still not com- earlier board positions which. had permitted the neces-
pletely stable. During the first seven games there were sary groundlaying moves. Although no precise way has
at least 14 changes in occupancy at the top of the list yet been devised to insu,e the corr:ect assignment of
involving 10 different terms. Alpha won three of these credit, a very simple expedient was found to be most
gamcs and lost four. The quality of the play was ex- effective in minimizing the adverse effects of earlier
trcmely poor. During the next seven games there were assignments. This expedient was to allow the span of
at least eight changes made in the top listing involving remembered moves, over which delta is computed, to
live different terms. Alpha lost the first of these games increase until delta exceeded the arbitrary minimum
and won the next six. Quality of play improved steadily value, and then to apply the corrections to the coeffi-
hut the machine still played rather badly. During Games cients as dictated by the terms in the retained poly-
15 through 11 there were eight changes in the top listing nomial for this earlier board position. In this case, the
involving five terms; Alpha winning five g;,mes and difficulty which was mentioned in the section on In-
losing two. Some fairly good amateur players who stabilities in connection with an arbitrary increase in
played the machine during this period agreed that it span, does not occur after each correction, since no
lI'as "triek~' but beatable". During Games 22 through 28 changes are made in the coefficients of the scoring
there \\'ere at least four changes involving three terms. polynomial as long as delta is below the minimum value.
Alpha \\'on two games and lost five. The program ap- Of course, whenever delta does exceed the minimum
peared to be approaching a quality of play which caused value the program must then recompute the initial scor-
it to he described as "a better-than-average player". A ing polynomial for the then current board position and
detailed ~Inalysis of these results indicated that the learn- so restart the procedure with a span of a single remem-
ing procedure did work and that the rate of learning bered move-pair. This over-all procedure rectifies the
\\as surprisingly high, but that the learning was quite defect of assigning credit to a board position that lies
erratic and none too stable. too far along the move chain, but it introduces the
possibility of assigning credit to' a board position that
• Second series oj tests
is not far enough along.
Some of the more obvious reasons for this erratic As a partial expedient to compensate for this newly
hehavior in the first series of tests have been identified. introduced danger, a change was made in the initial
The program was modified in several respects to i~­ board evaluatio·n. Instead of evaluating the initial board
prOve the situation, and additional tests were made. Four positions directly, as was done before, a standard but
l\f these modifications are important enough to justify a rudimentary tree-search (terminated after the first non-
uetaileu explanation. jump move) was used. Errors due to impending jump
I n the first place, the program was frequently fooled situations were eliminated by this procedure, and be-
hy bad play on the part of its opponent. A simple solu- cause of the greater accuracy of the evaluation it was
tion was to chan!!e the correlation coefficients less dras- possible to reouee the minimum delta limit by a small
tically when delta was positive than when delta was amount.
negative. The procedure finally adopted for the positive F.inally, to avoid the danger of having Beta adopt
delta case was to make corrections to selected terms in Alpha's polynomial as a result of a chance win on
the polynomial only. When the scoring polynomial was Alpha's part (or perhaps a situation in which Alpha
Positive. changes were made to coefficients associated had allowed its polynomial to degenerate after an early
lVith the negatively contributing terms, and when the or midgame advantage had been gained), it was decided
546 Improving the Efficiency of a Problem Solver

to re4uire a majority of wins on Alpha's p;u I before game of checkers in a relatively short period 01 time.
Jkta would aLlopt Alpba's scoring polynomial. As a final precautionary note, it should be stateLl that
\Vith these mollifications, a new series of t,~,>ts was these experiments have not encompassed a sulliciently
maLIc. In order to reduce tbe learning time, the initial large series of games to demonstrate unamhil!uouslv
selection of terms was made on the has is of tl,,; results that the learning procedure is completely stable -or th;t
ohtaineLl during the carl ier tests, but no allent ion was it will necessarily lend to the best possible choice of
paid to their previously assigned weights. In contrast parameters and coetlicients.' .
with the earlier erratic behavior, the revised prol~ram ap-
Rote learning vs. generalization
peared to be extremely stahle, perhaps at the ex pense of
a somewhat lower initial learning rate. The way in which Some interesting comparisons can he niade between the
the character of the evaluation polynomial altered as playing style developed hy the learning-by-generalization
learning progresseLl is shown in Fig. 4. program and that developed by the carl ier rotc-learning
The most obvious change in behavior was in regard procedure. The program with rote learning soon learned
to the relative number of games won hy Alpha ;,nd the to imitate master play during the opening moves. It was
prevalence of draws. During the first 2H game'> of the always quite poor during the middle game, b~t it easily
earlier series Alpha won 16 and lost 12. The eorre- learned how to avoid most of the obvious traps during
sponding figures for the first 2M games of the neW series end-game play and could usually drive on toward a win
were 18 won by Alpha, and four lost, with six draws. when left with a piece advantage. The program with the
In all cases the names were terminated, if not finished, generalization procedure has never learned to play in
in 70 moves and'" a judgment made in terms of tIll; final a conventional manner and its openings are apt to be
p:>sitions. Unfortunately, these ligures arc not '>tric~ly weak. On the other hand. it soon learned to play a
comparable because of the decreased frequency With good middle game: and with a piece advantage it usually
which Beta adopted Alpha's polynomial during the ,ccond polishes off its -opponent' in short order. Interestingly
series, both by design and hecause a programming crror enough, after 28 games it had still not learned how to
immobilized the adoption procedure during part of the win an end game with two kings against one in a
tests. Nevertheless, the great decrease in the nUlllhcr of double corner.
losses and the prevalence of draws seemed to indicate Apparently, rote learning is of the greatest help,
that the learning process was much more stable. Some either under conditions when the results of any specific
typical games from this second series arc given in Ap- action are long delayed, or in those situations where
pendix B. highly specialized techniques are required. Contrasting
As learning proceeds, it should become harder and with this, the generalization proceG'Jre is most helpful
harder for Alpha .to improve its game, and one would in situations in which the available permutations of con-
expect the number of wins hy Alpha to decrease with ditions are large in number and when the consequences
time. If secondary maxima in scoring space arc en- of any specific action 'are not long delayed.
countered, one might even find siluations in which /\Ipha
•. Procedures i/ll'oll'ing both forms of learning
wins less than half of the games. With Beta at such a
maximum a~y minor change ill Alpha's polynlJlllial The next obvious step is to combine the better features
would result in a degradation of ils play, and' several of the rote-learning procedure with a generalization
oscillations about the maximulll might occur 11l'IIlre scheme. This must be done with some care, since it is
Alpha landed at a point which WlHlll1 enable it to heat not practical to update the previously saved information
Beta. Some evidence of this trend is discernible in Ihe after every change in the evaluation polynomial. A com-
play, although many more games \\ill have 10 he I'I;p,ed promise solution might be to save only a very limited
before it can be ohserved with certainty. amount of information during the early stages of learn-
The tentative conclusions which can be drawn InJlll ing and to increase the amount as warranted by the
these tests are: increasing stability of the evaluation coetlicient with
(I) A simple generalization Schl'1l1e of the typl' hne learning. For example, the program could he arranged
used can be an effective learning device for pn1hll'nls to save only the piece-advantage term at the start. -At
amenable to tree-searching procedures. some stage in the learning process the next term could
(2) The memory requirements llf such schemes ;Ire be added. perhaps when no change had been made in
quite modest and remain fixed with time. the parameter used for this tern; during some fairlv
(3) The operating times arc abll n:asonahle and rl'- long period, say for three complete games~ If and whe~
main fixed. independent of the anll11ll1t of al'l'umulall'd the program is able to play an additional perioLl without
learning. changes in the next parameter, this could also be added.
(4) Incipient forms of instabilil\' in the s"IUlilHl l';lIl et cetera. \Vhenever a change does occur in a parameter
he expected hut, at least for Ihe chl'l'ker pr"gralll. Illl'w previously assumed to be stable the entire memory
can be dealt with hy quite straighlf,l['ward pnlcl'lhlll· S . tape could be reviewed, all terms involving the changed
(5) Even with the ineolllplete and redundant sl'l Ilf parameter and those lower on the list could be ex-
parameters which have been used III date. it is Pllssihle punged, and the program could drop hack to the earlier
for the computer to learn to play ;I bl'tter-than-a\'l'r;l~e condition with respect to its term-saving schedule.
4.3.3 Some Studies in Machine Learning Using the Game of Checkers 547

-r-;--r
;. ;. "
;. . :. :. :. ;;
,
; ,, ,
548 Improving the Efficiency of a Problem Solver

Another solution woulll hc to utilize the generaliza- highly likely that similar compromises will havc to be
tion scheme alone until it hall hecomc fairly stable and made when one attempts to apply learning procedures
to introduce rote learning at this time. It is, of course, to problems of economic importance.
perfectly feasible to salvage much of the learning which
has been accumulated by both of the programs studicd Conclusions
to date. This could be done hy appending an abridged
As a result of thcse experiments one can. say with some
form of the present mcmory tape ,to the generalization
certainty that it is now possible to devise learning
scheme in its present stage of learning and by proceed-
schemes which will greatly outperform an average per-
ing from there in accordance with the first solution
son and that such learning schemes may eventually be
proposed above.
economically feasible as applied to real-life problems .
• Future development
Acknowledgments
While it is believed that these tests have reached the
stage of diminishing returns, some effort might well be Many different people have contributed to these studies
expended in an attempt to get the program to generate through stimulating discussions of the basic problems.
its own parameters for the evaluation polynomial. Lack- From time to time the writer was assisted by several
ing a perfectly general procedure, it might still be different programmers, although most of the detailed
possible to generate terms .based on theories as proposed work was his own. The forbearance of the machine room
hy students of the game. This procedure would be at operators and thei~ willingness to play the machine at
variance with the writer's previous philosophy, but it is all hours of the day and night are also greatly appreciated.

Footnotes and References


1. Some of these are quite profound and have a bearing on 10. One of the more interesting of these was to express a
the questions raised by Nelson Goodman in Fact, Fic- board position in terms of the first and higher moments
tiol/ I/I/d Fnrecas(. Harvard University Press, 1954. of the white and black pieces separately about two or-
Warre.~ S. ~[cCulloch ("The Brain as a Computing Ma- thogonal axes on the board. Two such sets of axes were
.:hllle. £lee. £I/g. 69, 492, 1949) has compared the tried, one set being parallel to the sides of the board
lligllal computer 10 the nervous system of a flatworm. and the second set being those through the diagonals.
T () extend this comparison to the situation under dis- 11. This apt phraseology was suggested by John McCarthy.
cussion would he unfair to the worm since its nervous 12. Not the capture of all of the opponent's pieces, as popu-
system is a':lLlally quite highly organ'ized as compared larly assumed, although- nearly all games end in this
WIth the random-net studies by B. G. Farley and W. A. fashion.
CI.;lrke \..··Simulation of Self-Organizing Systems by
Digital Computers," IRE PGIT 4, 76, Sept. 1954), 13. The use of a weight ratio rather than this, conforming
N. Ro.:hester, J. H. Holland, L. H. Haibt and W. L. more closely to the values assumed by many players.
Duda ('"Tests on a Cell Assembly Theory of the Action can lead into certain logical complications, as found by
".1 the Brain Using a Large Digital Computer," IRE Strachey, lac. cit.
7 Tc/I/suel/o/ls Oil II/formation Theory IT-2 No.3 80. 14. The only departure from complete generality of the
S I(re) , , ,-
. cpt. .'~' .' .and b~- F. Rosenblatt ("The Perceptron; game as programmed is that the program requires the
A Probabilistic Model for Information Storage and Or- opponent to make a permissible move. including the
ganizatIOn III the Brain" Psych. Rev. 6 65 November taking of a capture if one is offered. "Huffing" is not per-
1958). ' " , mitted. .
The first operating checker program for the IBM 701 15. B. V. Bowden, Faster Than Thought, Chapter 25,
was wntten III 1952. This was recoded for the IBM 704 Pitman, 1953.
In 195_.:1. The first program with learning was completed 16. This coefficient is defined as C=(L-H)/(L+H), where
III 19)5 and demonstrated on television on February
2-1. 1956. . L is the total number of different legal moves which the
machine judged to be poorer than the indicated book
-I. C. E. ~hannon. "Programming a Computer for Playing
Chess. PllIl. Mag. 41, 256 (March 1950L moves, and H is the total number which it judged to be
better than the book moves.
5. A. Ber~;~tein and M. deY. Roberts, "Computer vs. Chess-
Player. Snellt.. Amer. 198,6 (June 1958). 17. This playing-time requirement, while _large in terms of
fl. 1. J;:.lster. P. Steill. S. Ulam, W. Walden, M. Wells, "Ex- cost. would be less than the time which the checker
penments III Chess," Journal of the ACM 4 174 (April master probably spends to acquire his proficiency.
1957"1. ' , 18. There is a logical fallacy in this argument. The program
7. A. Newell. 1. C. Shaw and H. A. Simon, "Chess-Playing might save only invariant terms which have nothing to do
Pro!!r~ms and the Problem of Complexity," IBM J. of with goodness of play; for example, it might count the
Re.\". L~ DCI'e!. 2. 320 (October 1958). squares on the checkerboard. The forced inclusion of
x. Shannon. loe cit. the piece-advantage term prevents this.
I). C. S. Strachey, "Logical or Non-Mathematical Pro- 19. Each game averaged 68 moves (34 to a side), of which
grammes," Proc. of ACM Meeting at Toronto, Ontario, approximately 20 caused changes to be made in the
pp. 46-49. Sept. 8-10, 1952. ' scoring polynomial.
4.3.3 Some Studies in Machine Learning Using the Game of Checkers 549

Appendix A: Programming details

• ,..j pproxilllate siz.e of program


Basic chccker-playing routine . I 100 instructions
Input. move verification and output 1400 instructions
Game starting and..\crminating routines 600 instructions
Loaders. table generators, dumping, et cetera H50 instructions
Statistical and analytical routincs 700 instructions
Rote-learning routines 1500 instructions
Generalization-learning routines 650 instructions
Tables and constants for basic play 700 words
Working space for basic play . 2000 words
\Vorking space for generalization learning 500 words
Working space for rote learning balance of memory

• Approximate compllfation times


To find all available moves from given board position. 2.6 milliseconds
To make a single move and find resulting board position 1.5 milliseconds
To evaluate a board position (4 terms) 2.4 milliseconds
To find score for a saved board position (rote learning) 2.3 milliseconds
To evaluate position (with 16 terms for generalization learning) 7.5 milliseconds

• Board representations
The standard checkerboard numbering system (see Appendix B) is used in communicating with the machine. A modi-
fied numbering system is used for internal computations, the numbers shown on the squares in Fig. A-I corresponding
to the bit positions in an IBM 704 word. Any given board position is represented by four such worc1s; one word (FA)
containing 1's in those bit positions corresponding to squares containing pieces of the color whose turn it is to move
and which normally move in a forward direction. To be specific. if it is Black's turn to move (i.e., i: Black is "active")
FA designates the location of all of Black's pieces, both men and kings. Conversely, if White is active, FA designates
the location of White's kings only, since White's men can only move in the direction arbitrarily called backward.
The other words designate, respectively: BA, backward active pieces; FP, forward passive pieces; and BP, backward
passive pieces.
To conserve space when writing on tape, three words are used to record board positions with kings, and only two
words are used for board positions without kings. These are saved in a standardized form, as explained in the text.
Possible moves are designated by five words; one word to indicate by its sign (with the word itself containing other
information) whether the moves are jumps or not. (If a jump is available, only jump moves are saved.) The other
four words designate the location of those pieces which can move in the four different diagonal directions: RF, for
right forward; LF, for left forward; LB, for left backward; and RB, for right backward, respectively.
By reference to Fig. A-I, it will be observed that a right-forward move results in an increase of 4 in the square
designation. while a left-forward move results in an increase of 5. Bit positions 9, 18 and 27 do not appear on the
board. This notation makes it possible to compute available moves fos all pieces simultaneously. Having previously.
computed a word called EMPTY, which contains 1's in locations corresponding to all unoccupied squares, one can
compute RF, for the normal move case, in four instructions, as listed below (in IBM 704 symbolic language):

CLA EMPTY (puts word EMPTY into the accumulator);


ALS 4 (shifts word to left by 4 positions);
ANA FA (forms logical AND between EMPTY and FA);
STO RF (stores word as newly computed RF).

Jump moves are computed by a simple extension of this procedure. Multiple jumps are handled as a sequence of single
jumps separated by null-reply moves.
550 Improving the Efficiency of a Problem Solver

WHITE

!
I

!
j @ ® @ ®
I
I
I
I

® I
® ® ®
I
,
I
I
I I
I
I
i ® ! @ ® ®

I ® ;
I
I ® @ ®
I, I
I
i
i
, II @
,
® !i
® ®
I I
I

!
i
® iI ® ® @
I
I
!
I, II
I CD CD CD CD
1
i I
I
I
I
I

I I o

I
CD !
i I CD CD CD
0<
<
3:
I I 0<
o
I
I ....
BLACK

Fil:llr,' A -I Checkerboard notation for internal computations.

• ", ddilio//lIllillle-s(/\'ing expedie//Is


Ilit counting is done hy a table-lookup procedure in a closed subroutine of 16 executed instructions (408 microseconds).
'I his reLJuires a 156-word tahle which is generated at the start by a 13-word program. Similar table-lookup procedures
arc used. 10 llIrn a word end-for-end. and to locate the l's in a word for move reporting.
.\ Iulliplications are usually avoided. In several places where multiplication by small integers must be done, it is
programmed in terms of shifts and logical operations.
During the look-ahead procedure a complete record is kept of the sequence of board positions currently under
investigalion. As a result. no computing is needed to retract moves.

I
-
4.3.3 Some Studies in Machine Learning Using the Game of Checkers 551

Appendix B: Sample games from the second series with generalization learning

• Typical opel/iI/R.I'
The first eight moves of selected games in which Alpha played Black against Beta, showing the way in which different
types of play were tried.
G-4 G-6 G-11 G-17 G-19 G-21 G-31 G-37 G-39 G-41 G-43
-- -- ,-.- -- -- -- --
10 14 II 16 11 16 I L 16 Ll 16 I L 16 I1 16 12 16 11 16 10 14 11 16
14 19 22 1~ 21 17 24 20 24 20 24 20 23 18 24 20 24 20 24 20 23 19
14 18 16 20 16 20 10 14 7 11 8 11 7 11 8 12 10 15 1L 15 16 23
23 L4 18 14 17 13 10 11 21 17 2~ 24 27 23 28 24 20 11 17 24 26 19
9 IS 9 18 9 14 l) 15 10 14 10 14 16 20 10 14 7 16 7 10 8 11
22 15 23 14 13 18 21 17 17 10 23 18 23 19 23 1~ 21 17 13 18 22 17
II 18 10 17 14 23 7 11 6 15 14 23 20 27 14 23 6 10 14 23 10 14
21 17 21 14 27 18 17 10 28 24 27 18 31 24 27 1l) 23 19 26 19 17 10

• Typical games
Sample games in which Alpha played White against forced Beta openings.
G-I G-I8 G-30 G-40 G-I G:~8 G-30 G--IU
-- -- -- I'" - -- --
12 16 12 16 12 16 10 14 I' 9 13 11 16 9 1.+ .+ 8
24 19 24 20 24 20 24 20 1 6 24 20 1l) 9 1 6
8 12 8 12 8 12 I I IS I: 13 17 16 19 8 [I 10 14
22 18 28 24 28 24 27 24 I'
:: 31 27 19 25 15 8 6 10
10 14 10 15 10 14 7 10 16 20 13 17 4 I I 14 17
26 22 22 18 22 18 23 18 18 14 LO 7 19 IS LO L5
j;
16 20 15 22 6 10 14 23 11 IS 1 1I IL 18 17 21
30 26 25 18 24 19 26 19
I: 6 10 14 10 23 14 32 28
j:
11 16 7 10 1 6 10 14 15 18 19 13 13 17 5 9
28 24 18 14 32 18 19 10 14 9 2L 14 9 5 27 14
7 I1 10 17 3 8 6 15 .
"
i,
Terminated 13 26 11 16 20 17
22 17 21 14 26 21 11 17 Manually 10 7 28 14 19 16
3 8 9 18 9 13 2 7 26 30 17 22 12 1~
17 10 23 14 18 9 17 10 25 11 6 10 IS 12 31
6 15 22 6 9 5 L4 7 14 30 16 30 15 9 14
26 17 30 25 22 18 24 19 7 3 L 6 31 26
"It
9 13 9 18 6 9 15 24 H I 1 '15 25 21 14 18
17 14 26 23 25 22 28 19 ,I' 14 to 5 I 18 24
2 7 3 8 2 6 14 17 ":l 5 9 21 17 8 lL
23 18 23 14 30 25 21 14 i' 10 6 24 20 24 19
1:
16 23 1 6 14 17 9 18 15 19 16 19 21 25
J!
14 10 27 23 21 14 5 25 22 II 6 I 20 16 30 21
7 14 6 9 6 9 18 25 Ii 26 12 17 13 Beta Concedes
,I
18 9 14 10 L8 15 29 22 Ii" 1 6 6 2
5 14 9 13 IL 18 5 9 9 13 13 17
27 18 9 25 21 20 IL 2 31 17 20 16 10 6
20 27 11 15 10 14 1 5 19 23 Beta Concedes
31 24 20 11 22 IS 20 16 6 9
12 16 15 18 14 17 3 7 23 27
21 17 23 14 5 1 22 17 16 1I
13 22 8 IS 17 21 8 II 22 15
25 18 24 19 25 22 17 13 .,i II 7
jI
1 5 IS 24 21 15 lL 20 25 30
9 6 32 28 21 18 13 6 7 2
5 9 24 27 25 30 7 10 27 32
6 1 3 I 24 2 6 6 1 70 Move Termination
552 Improving the Efficiency of a Problem Solver

WHITE
)(

•• •
•••
•••

••••
• •

FiC:/Ir,. B-1
• BLACK

Square designations used in reporting games.

Appendix C: Evaluation polynomial details for second series

• ,\fethod of C()II/pllling terll/S

The 16 terms called for in the evaluation polynomial are computed, individually, by taking the value of the appropriate
parameter, as L1etined below, for the board position under consideration and subtracting the value of this same
parameter computeLl for the board position just prior to the last move (with the necessary reversal in the definitions
lIf active and passive sides). This difference is then multiplied by the corresponding program-computed coefficient.
which can vary between - 2 1" and + 2 1x • and credited to the side which was passive on the board position under
cllnsiLleration.

-
4.3.3 Some Studies in Machine Learning Using the Game of Checkers 553

• Dcjinitinns O!I'(/rtllIll'tl'rS

ADV (Advancemcnt) DYKE (Dyke)


The paramcter is crcdited with I for each passive man in The parameter is credited with I for each string of pas-
thc 5th and 6th rows (counting in passivc's direction) sive pieces that occupy three adjacent diagonal squares.
and debited with 1 for each passive man in thc 3rd and
EXCH (Exehang~)
4th rows.
The parameter is credited with 1 for each square to
APEX (Apcx) which the active side may advance a piece and. in so
The parameter is dcbited with I if there are no kings on doing, force an exchange.
the board. if either square 7 or 26 is occupied by an ac- EXPOS (Exposure)
tive man. and if neither of these squares is occupied by a The parameter is credited with I for each passive piece
passive man. that is flanked along one or the other diagonal by two
BACK (Back Row Bridge) empty squares.
The parameter is credited with 1 if there are no active FORK (Threat of Fork)
kings on the board and if the two bridge squares (I and The parameter is credited with 1 for each situation in .
3, or 30 and 32) in the back row are occupied by passive which passive pieces occupy two adjacent squares in one
pieces. row and in which there are three empty squares so dis-
CENT (Center Control I) posed that the active side could, by occupying one of
The parameter is credited with I for each of the follow- them, threaten a. ~ure capture of one or the other of the
ing squares: 11,12,15,16,20,21,24 and 25 which is two pieces.
occupied by a passive man. GAP (Gap)
The parameter is credited with 1 for each single empty
CNTR (Center Control II)
square that separates two passive pieces along a diagonal,
The parameter is credited with I for each of the follow-
or that separates a passive piece from the edge of the
ing squares: 11,12,15,16,20,21,24 and 25 that is
board.
either currently occupied by an active piece or to which
an active piece can move. GUARD (Back Row Control)
The parameter is credited with 1 if there are no active
CORN (Double-Corner Credit)
kings and if either the Bridge or the Triangle of Oreo is
The parameter is credited with 1 if the material credit
occupied by passive pieces.
value for the active side is 6 or less, if the passive side is
ahead in material credit, and if the active side can move HOLE (Hole)
into one of the double-corner squares. The parameter is credited with 1 for each empty square
that is surrounded by three or more passive pieces.
CRAMP (Cramp)
The paranleter is credited with 2 if the passive side occu- KCENT (King Center Control)
pies the cramping square (13 for Black, and 20 for The parameter is credited with 1 for each of the follow-
White) and at least one other nearby square (9 or 14 for ing squares: 11, 12, 15, 16, 20,21,24 and 25 which is
Black, and 19 or 20 for White), while certain squares occupied by a passive king.
(17,21,22 and 25 for Black, and 8,11,12 and 16 for MOB (Total Mobility)
White) are all occupied by the active side. The parameter is credited with 1 for each square to
DENY (Denial of Occupancy) which the active side could move one or more pieces in
The parameter is credited with I for each square defined the normal fashion, disregarding the fact that jump
in MOB if on the next move a piece occupying this moves mayor may not be available.
square could he captured without an exchange. MOBIL (Undenied Mobility)
The parameter is credited with the difference between
orA (Double Diagonal File)
MOB and DENY.
The parameter is credited with I for each passive piece
located in the diagonal tiles terminating in the double- MOVE (I\Iove)
corner squares. The parameter is credited with I if pieces are even with
a total piece count (2 for men. and 3 for kings) of less
OlAV (Diagonal Moment Value)
than 24, and if an odd number of pieces are in the move
The parameter is credited with 1/2 for each passive
system. defined as those vertical files starting with
piece located on squarcs 2 removed from the double-
squares 1, 2, 3 and 4.
corner diagonal files. with l for each passive piece lo-
cated on squarcs I removed from the douhle-corner files NODE (Node)
and with 3! 2 for each passive piece in the double-corner The parameter is credited with 1 for each passive piece
Illes. that is surrounded by at least three empty squares.
554 Improving the Efficiency of a Problem Solver

OREO (Triangle of Orca) RECAP (Recapture)


The parameter is credited with I if there are no passive This parameter is identical with Exchange. as defined
kings and if the Triangle of Oreo (squares 2, 3 and 7 for ahove. (It was introduced to test the elfects produced by
Black. and squares 26. 3D and 3 I for White) is occupied the random times at which parameters are introduced
hy passive pieces. and deleted from the evaluation polynomial.)
THRET (Threat)
The parameter is credited with I fo'r each square to
POLE (Pole) which an active piece may he moved and in so doing
The parameter is credited with I for each passive man threaten the capture of a passive piece on a subsequent
that is completely surrounded hy empty squares. move.
--- - ' - - - - - - - _ . _ - - - - - - - - - - - - - -
• Binary connectil'e terms
The abbreviations used for the terms of this type which have been employed are listed below, in the order of
AoB. r/oB AoB, and AoB. where A and B are the two respective parameters heading the sublists of abbreviations.
Undenied Mobility-
Denial 01 Occllpancy-Total Mohility Undenied Mobility-Denial 01 Occupancy Center Control I
DEMO MODE 1 MOC 1
DEMMa MODE 2 MOC 3
DDEMO MODE 3 MOC 2
DDMM MODE 4 MOC 4

• !:'I'all/ation polynomial (first /2 terms only) alter 42 !?Clines, dllring which a total 011039 different sets 01 adjllstments
11'['1'1' !!lade ro (he terms and their coefficients. ':'

... ,.-----._- --'_._---- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -


Correlation Sign 01 Power 012 Times
Coefficient Coefficient Used as Coefficient A dillsted
------------------------
\IOC 2 0.45 18 84
J..:CENT 0.40 + 16 127
\IOC -+ 0.35 14 95
\IOD~ .1 0.33 13 210
DE\I\IO 0.27 11 1~')
.J_

\IOVE 0.19 + 8 91
.-\DV 0.19 8 739
\IODE 2 0.19 8 55
Il.-\CK 0.14 6 6
C:\TR
THRET
0.13 + 5 12
0.13 + 5 442
\IOC 3 0.10 + 4 89

Times A djlls/ed Times Adjusted


fIelore Discard Term Belore Discard
--'i-' - - - - - - - - - - -
COR:\ o MODE 1
CR.-\\1 P o CENT 386
(iL'.\RD o MODE 4 o
F\:I'OS 162 FORK 400
DD\I\1 19 MOBIL 707
[)YJ..:E 115 " POLE 11
\IOC 1 I "
"
HOLE 598
I:\:CH 445 GAP 792
[)DDIO 53 MOB 608
.\'"r,· od",.',! ill f'l'oof: An additional 211 games have recently been played, Although
"'111<: 't);l1i1t'an.1 changes w<:re notell. the general stabilization of the learning process
,tt););c,lcll hy h~ure -l has heen contirmeJ. During this play. 412 more adjustments
IInc ll1ad, Itl the terms ;tnd their coellicients anJ 12 allllitions were made to the
Ihl uf di-carded tnms. Receil'ed ivlarch 3, )1)59

You might also like