Massively Parallel Chess: CFJ, Bradley @lcs - Mit.edu
Massively Parallel Chess: CFJ, Bradley @lcs - Mit.edu
1
of chess programs: the total work W and the critical path than the first child.
length C .
The total work and critical path length give us a chance Each child that turns out to be better than the first child
to understand how the performance of a parallel program is searched in turn to determine which is the best.
will scale as the number of processors increase, and also
gives us a chance to understand the effectiveness of our If the move ordering is best-first, i.e., the first move con-
scheduler. For example, the effectiveness with which the sidered is always better than the other moves, then all of
available work is scheduled into the machine can be mea- the tests succeed, and the position is evaluated quickly and
sured by comparing it to the bound from Brent’s theorem efficiently. We expect that the tests will usually succeed,
[Bre74, Lemma 2], which states that the runtime on P pro- because the move ordering is often best-first due the the
cessors with a perfect scheduler can be brought down to no application of several chess-specific move-ordering heuris-
more than C + W=P . tics.
The values of W and C depend on the parallel algorithm,
rather than on the scheduler. In our game-tree search al- 2.1 Negamax Search Without Pruning
gorithm, the values of W and C are partially dependent
on scheduling decisions made by the scheduler, but we Before delving into the details of the Jamboree algorithm,
believe that W and C are mostly independent of those de- let us review the basic search algorithms that are applica-
cisions. A good algorithm reduces W and C . We can ble to computer chess. (Readers who are familiar with the
compare W to the runtime of a corresponding serial chess serial game tree search algorithms may wish to skip di-
program, and we can compare C to W . The ratio of W to rectly ahead to the description of the Jamboree algorithm
the work done by the serial program is the efficiency of the in Section 2.4.) Most chess programs use some variant
program, and indicates how much overhead is inherent in of negamax tree search to evaluate a chess position. The
the parallel algorithm. The ratio W=C is the average avail- goal of the negamax tree search is to compute the value of
able parallelism of the program. We can hope, because of position p in a tree Tp rooted at position p. The value of p
Brent’s theorem, to use as many as W=C processors with is defined according to the negamax formula:
8 static eval(p)
an efficiency of at least 50%.
This paper explains how we obtain predictable high- ><
performance on *Socrates. Section 2 describes the Jam- vp = > maxf vc : c a child ifofppisinaTleaf
pg
in Tp , and
boree game-tree search algorithm and presents some an- : if p is not a leaf.
alytical results describing the performance of Jamboree
search. Then, in Section 3, we describe the Cilk 1.0 lan-
The negamax formula states that the best move for player
guage and run-time system. The modifications made to
A is the move that gives player B , who plays the best move
from B ’s point of view, the worst option. If there are no
Cilk in order to run the chess program are described in
Section 4. In Section 5 we outline several other mech-
moves, then we use a static evaluation function. Of course,
anisms used in the chess program. Section 6 presents a
no chess program searches the entire game tree. Instead
description of how the Jamboree algorithm relates to the
some limited game tree is searched using an imperfect
algorithms used by other chess programs. We make some
static evaluation function. Thus, we have formalized the
chess knowledge as Tp , which tells us what tree to search,
concluding remarks in Section 7.
and static eval, which tells us how to evaluate a leaf
2 Parallel Game Tree Search position.
The naive Algorithm negamax shown in Figure 1 com-
The *Socrates chess program uses an efficient parallel putes the negamax value vp of position p by searching
game-tree search algorithm called “Jamboree” search. In the entire tree rooted at p. It is easy to make Algo-
this section we explain Jamboree search, starting with the rithm negamax into a parallel algorithm, because there
basics of negamax search and serial - search, and present are no dependencies between iterations of the for loop of
some analytical performance results for the algorithm. Line (N5). One simply changes the for loop into a par-
The basic idea behind Jamboree search is to do the fol- allel loop. But negamax is not a efficient serial search
lowing operations on a position in the game tree that has k algorithm, and thus, it makes little sense to parallelize it.
children:
The value of the first child of the position is deter- 2.2 Alpha-Beta Pruning
mined (by a recursive call to the search algorithm.)
The most efficient serial algorithms for game-tree search
Then, in parallel, all of the remaining k 1 children all avoid searching the entire tree by proving that certain
are tested to verify that they are not better alternatives subtrees need not be examined. In this section we review
2
(N1) Define negamax(p) as
(N2) If n is a leaf then return static eval(n).
(N3) Let ~c the children of n, and
(N4) b 1:
(N5) For i from 0 below j~cj do:
(N6) Let s negamax(~ci ): ;; Recursive Search
(N7) if s > b then set b s: ;; New best score
(N8) enddo
(N9) return b.
3
(A1) Define absearch(n; ; ) as
(A2) If n is a leaf then return static eval(n).
(A3) Let ~c the children of n, and
(A4) b 1:
(A5) For i from 0 below j~cj do:
(A6) Let s absearch(~ci ; ; ):
(A7) If s then return s. ;; Fail High
(A8) If s > then set s. ;; Raise
(A9) If s > b then set b s.
(A10) enddo
(A11) return b.
Figure 4 shows the serial Scout search algorithm, which a non-empty window to determine its true value.
is due to J. Pearl [Pea80]. Procedure scout is similar to If it happens to be the case that + 1 = , then
Procedure absearch, except that when considering any Line (S13) never executes because s > implies s ,
child that is not the first child, a test is first performed to which causes the return on Line (S11) to execute. Conse-
determine if the child is no better a move than the best quently, the same code for Algorithm scout can be used
move seen so far. If the child is no better, the test is said for the testing and for the valuing of a position.
to succeed. If the child is determined to be better than the Line S10, which raises the best score seen so far accord-
best move so far, the test is said to fail, and the child is ing to the value returned by a test, is necessary to insure
searched again (valued) to determine its true value. that if the test fails low (i.e., if the test succeeds), then the
The Scout algorithm performs tests on positions to see value returned is an upper bound to the score. If a test were
if they are greater than or less than a given value. A test is to return a score that is not a proper bound to its parent,
performed by using an empty-window search on a position. then the parent might return immediately with the wrong
For integer scores one uses the values ( 1) and ( ) answer when the parent performs the check of the returned
as the parameters of the recursive search, as shown on score against on Line S11.
Line (S9). A child is tested to see if it is worse than the A test is typically cheaper to execute than a valuation
best move so far, and if the test fails on Line (S12) (i.e., because the - window is smaller, which means that more
the move looks like it might be better than the best move of the tree is likely to be pruned. If the test succeeds, then
seen so far), then the child is valued, on Line (S13), using algorithm scout has saved some work, because testing
4
(J1) Define jamboree(n; ; ) as
(J2) If n is a leaf then return static eval(n).
(J3) Let ~c the children of n, and
(J4) b jamboree(c0 ; ; ):
(J5) If b then return b.
(J6) If b > then set b.
(J7) In Parallel: For i from 1 below j~cj do:
(J8) Let s jamboree(~ci ; 1; ):
(J9) If s > b then set b s.
(J10) If s then abort-and-return s.
(J11) If s > then
(J12) Wait for the completion of all previous iterations
(J13) of the parallel loop.
(J14) Set s jamboree(~ci ; ; ). ;; Research for value
(J15) If s then abort-and-return s.
(J16) If s > then set s.
(J17) If s > b then set b s.
(J18) Note the completion of the ith iteration of the parallel loop.
(J19) enddo
(J20) return b.
a node is cheaper than finding its exact value. If the test run, using up processor and memory resources. The abort
fails, then scout searches the node twice and has squan- causes any children that are running in parallel to abort their
dered some work. Algorithm scout bets that the tests children recursively, which has the effect of deallocating
will succeed often enough to outweigh the extra cost of the entire subtree.
any nodes that must be searched twice, and empirical evi- The actual search algorithm used in *Socrates also in-
dence [Pea80] justify its dominance as the search algorithm cludes some forward pruning heuristics that prune a deep
of choice in modern serial chess-playing programs. search based on a shallow preliminary search. The idea
is that if the shallow search looks really bad, then most
2.4 Jamboree Search of the time a deep search will not change the outcome.
Forward pruning techniques have lately been shown to be
The Jamboree algorithm, shown in Figure 5, is a paral- extremely powerful, allowing programs running on single
lelized version of the Scout search algorithm. The idea processors to beat some of the best humans at chess. The
is that all of the testing of the children is done in par- serial Socrates program uses such a scheme, and so does
allel, and any tests that fail are sequentially valued. A *Socrates. In the *Socrates version of Jamboree search,
parallel loop construct, in which all of the iterations of we first perform the preliminary search, then we search the
a loop run concurrently, appears on Line (J7). Some syn- first child, then we test the remaining children in parallel,
chronization between various iterations of the loop appears and research the failed tests serially.
on Lines J12 and J18. We sequentialize the full-window Parallel search of game-trees is difficult because the most
searches for values, because, while we are willing to take efficient algorithms for game-tree search are inherently
a chance that an empty window search will be squandered serial. We obtain parallelism by performing the tests in
work, we are not willing to take the chance that a full- parallel, but those tests may not all be necessary in a serial
window search (which does not prune very much) will be execution order. In order to get any parallelism, we must
squandered work. Such a squandered full-window search take the risk of performing extra work that a good serial
could lead us to search the entire tree, which is much larger program would avoid.
than the pruned tree we want to search.
The abort-and-return statements that appear on Lines 2.5 Analysis of Jamboree Search
J10 and J15 return a value from Procedure jamboree
and abort any of the children that are still running. Such The Jamboree search algorithm can be analyzed for a few
an abort is needed when the procedure has found a value special cases of trees of uniform height and degree. Here
that can be returned, in which case there is no advantage we summarize our results. The complete statement of the
to allowing the procedure and its children to continue to theorems and proofs can be found in [Kus94]. It turns out
5
that we have two analytical results, one for best ordered program, which are important for serial programs because
trees and one for worst ordered trees. it reduces the work, are doubly important for our parallel
Theorem 1 states how Jamboree search behaves on best- algorithm because it also decreases the critical path length.
ordered trees. A best-ordered tree is one in which it turns It is difficult to analyze Jamboree search for arbitrary
out that the first move considered is always the best move, game trees, because it is difficult to characterize the tree
and thus the tests in the jamboree search algorithm always itself, and the tree that is actually searched can depend on
succeed. how the work is scheduled. Unlike many other applica-
tions, the shape of the tree traversed by Jamboree search
Theorem 1 For uniform best-ordered trees of degree d and can be affected by the order of the execution of the work,
height h the following hold: sometimes increasing the work and sometimes decreasing
The total work performed is Θ(dh=2 ), which is the
work. Thus, measurements of “critical path length” and
same as serial - search would perform. That is,
“work” on a particular run may be different than the mea-
surements taken on another run, because the trees them-
the work efficiency is 1.
selves are different. It is not clear what “critical path”
The critical path length is Θ(2h=2 ), and thus the av- and “work” mean for Jamboree search on arbitrary trees.
erage available parallelism is Θ((d=2)h=2 ). Nonetheless, we have found that we can use the measurd
critical path length and total work to tune the program.
Chess trees typically have degree of between 30 and 40 Our strategy is to measure the critical path and the work
in the middle-game, and since we hope to search at least on a particular run, and to try to predict the performance
to depth 10, a best-ordered chess tree would have several from those measurements. (The details of how we measure
hundred-thousand fold parallelism. critical path length are discussed in Section 3.) We mea-
If the tree is not best-ordered,then the performance of the sured the program on a set of eight problems5 , shown in
parallel algorithm can be much worse,however. Theorem 2 Figure 6. For each problem the program was run to various
addresses worst-ordered trees. A worst-ordered tree is one depths up to those that allowed the program to solve the
in which the worst move is considered first, and the second problem by getting the “correct” answer, as identified by
worst move is considered second, and so-on, with the best Kaufman. We also measured the programming running on
move considered last. a variety of different sized machines. Then we performed
a curve-fit of the data to a performance model of the form
Theorem 2 For uniform worst-ordered trees of degree d
and height h the following hold: Tpredicted = c1 C + c2 W
P + c3 :
The total work performed is Θ(dh ).
We found that the performance can be accurately mod-
The critical path is Θ(dh ). eled as
For large d and h, the constants work out so that the total T (0:95 0:04)C + (1:091 0:001) W P + 0 (1)
work performed is approximately three times as much as
the serial - search would perform (thus the efficiency
is 1=3), and the critical path length is equal to the work with a sample correlation coefficient of 0.999947, and a7
6
performed by serial - (with the speedup approaching 1 mean error of 14.2% and a mean relative error of 4.85%.
from below.) To us, this is quite amazing, because chess is a very de-
manding application. For *Tech, we found that according
Surprisingly, for worst-ordered uniform game trees, the to measurements of the ideal parallelism profile (which
speedup of Jamboree search over serial - search turns out shows the amount of parallelism as a function of time, run-
to be under 1. That is, Jamboree search is worse than serial ning the program on an ideal infinite processor machine),
- search, even on a machine with no overhead for com- for half the run-time there is often less than 10-fold par-
munications or scheduling. For comparison, parallelized allelism. The low coefficients on Equation 1 indicate that
negamax search achieves linear speedup on worst-ordered the program quickly finishes the available work during the
trees, and Fishburn’s MWF algorithm achieves not-quite 5 Our eight problems were provided by *Socrates team member
linear speedup on worst-ordered trees [Fis84]. L. Kaufman, who is an International Master. Kaufman has published
several larger sets of benchmarks [Kau92, Kau93] that were used to un-
derstand *Tech [Kus94].
2.6 Real Chess Trees 6 For a definition of sample correlation coefficients and other statistical
6
rl0ZrZkZ 0Z0Z0s0j 0ZrZ0ZkZ rZ0Z0skZ
Z0Z0Zpo0 o0l0Zpo0 Zqs0a0o0 Z0Z0ZpZ0
pZ0Z0Z0Z 0obZ0Z0o pZ0oPm0Z 0m0Z0ZpL
Z0Z0aNO0 m0Z0S0MQ mpoPo0Z0 Z0o0Z0Z0
0o0ZPZ0O 0Z0Z0Z0O 0Z0Z0Z0Z pZ0Z0ZPZ
Z0Z0ZQZ0 Z0ZBZPZ0 ZPO0A0ZP O0Z0ZNZ0
PO0Z0ZKZ 0ZPZ0ZPZ PZBL0O0Z 0lbZBO0O
Z0Z0ZRZR Z0Z0Z0ZK Z0Z0ZKSR S0Z0S0J0
(a) Ng7 (b) Re6 (c) Rg7 20 (d) Ra2
rmbZkZ0s 0Z0Z0Z0Z bZ0s0skZ 0ZkZ0Z0Z
opZ0lpop Z0Z0Z0Z0 Z0l0Zpop Z0o0Z0Sp
0Z0o0m0Z 0Z0j0o0Z pZnapm0Z pZpa0Z0Z
Z0opZ0A0 o0ZBmPZ0 ZpZ0Z0Z0 ZpZ0o0Z0
0aPZ0Z0Z 0Z0ZPM0Z 0O0Z0Z0Z 0Z0ZPZ0Z
Z0M0O0Z0 ZPZ0J0o0 O0M0ONZ0 ONZPZPZ0
PO0ZNOPO 0Z0Z0Z0Z 0A0ZQOPO 0O0ZKO0s
S0ZQJBZR Z0Z0ZbZ0 ZBS0ZRJ0 Z0Z0Z0Z0
(e) a3 N
(f) . d3 (g) Ne4 (h) f4
Figure 6: The 8 chess positions used in this paper. Below each position is shown Kaufman’s “correct” move for that
position. All positions are “White to move”, except for Position (f).
times of low parallelism, and when there is much paral- algorithm change that would value the first two children
lelism the program efficiently load balances the work. before starting the parallel tests of all the remaining chil-
We also found that the work increases by about a factor dren. The idea is that by valuing more children, it becomes
of two to three as the number of processors increases from more likely that the best of the children that have been
1 to 128 processors, and that the critical path length is valued will be able to prune some of the remaining chil-
fairly stable as the number of processors increases. Most dren. When we measured the runtime on a small machine,
of the difficulty of predicting the performance of the chess the program ran faster but on a big machine the runtime
program comes from the fact that the amount of work is actually got worse. To understand why, we looked at the
increasing. The processors end up expanding subtrees that work and critical path length. We found that this variant
are pruned in the serial code. of Jamboree search actually does decrease the total work,
We found that the critical path does not limit the speedup but it increases the critical path length, so that there is not
for our test problems, or for the program running under enough available parallelism to keep a big machine busy.
tournament conditions. By using critical path to under- By looking at both the critical path length and the total
stand the parallelism of our algorithm, we are able to make work we were able to extrapolate the performance on the
good tradeoffs in our algorithm design. Without such a big machine from the performance on the little machine,
methodology it can be very difficult to do algorithm de- however, and so we avoided introducing modifications that
sign. For example, Feldmann, Monien, and Mysliwietz would hurt us in tournament conditions.
find themselves changing their Zugzwang chess program
to increase the parallelism without really having a good
way to measure their changes [FMM93]. They express
3 The Cilk Work-Stealing Scheduler
concern that by serially searching the first child before Now that we have explained the search algorithm used
starting the other children they have reduced the available in *Socrates, we need to explain how the computation is
parallelism. Our technique allows us to state that there distributed across the machine. We use a run-time system
is sufficient parallelism to keep thousands of processors called Cilk 1.0 [BJK*94]8 to distribute work among the
busy without changing the algorithm. We can conclude CM-5 processors. This section explains how a program is
that we should try to reduce the total amount of work done expressed in Cilk and how the computation is distributed
by the program, even if it reduces the available parallelism across the machine.
slightly.
To distribute work among CM-5 processors, Cilk uses
We experimented with some techniques to improve the a randomized work-stealing approach, in which idle pro-
work efficiency, and found several techniques to improve cessors request work. Processors run code that is nearly
the work efficiency at the expense of increasing the criti-
cal path length. For example, on *Tech we considered a 8 Cilk is a threaded language of the C ilk.
7
F1 thread sum (cont k, int x, int y)
F2 f
P1 int fib (int n) F3 SendWordArgument (k, x+y);
P2 f F4 g
P3 if(n<2) return n; F5
P4 else return fib(n-1)+fib(n-2); F6 thread fib (cont k, int n)
P6 g F7 f
F8 if (n<2) SendWordArgument (k, n);
F9 else
fib(n-1) F10 f cont x, y;
F11 spawn next sum (k, ?x, ?y);
sum F12 spawn fib (x, n-1);
F13 spawn fib (y, n-2);
fib(n-2) F14 g
F15 g
Figure 7: Expressing the doubly recursive Fibonacci program in Cilk 1.0. On the upper-left is the program written in
serial C. On the lower-left is the dataflow graph for the program. On the right is the corresponding Cilk code.
serial. When a processor discovers some work that could Cilk automatically computes the critical path length and
be done in parallel, it posts the work into a local data struc- total work of a computation. The computation of the crit-
ture. When a processor runs out work locally, it sends ical path is done by a system of time-stamping, as shown
a message to another processor, selected at random, and in Figure 9.
removes work from that processor’s collection of posted The Cilk system runs on both the CM-5 and network
work. of workstations. Soon we expect to provide Cilk versions
The Cilk system was original based on the Parallel Con- that run on shared memory multiprocessors and a variety
tinuation Machine run-time system of Halbherr, Zhou and of other parallel computing platforms. We are currently
Joerg [HZJ94]. In PCM, the scheduler uses a double ended working on improving the Cilk time system to provide
queue (a deque) on every processor. When a processor better support for global data structures, for input/output,
posts work, it pushes it on the bottom of the deque. When and to help automatically break up a program into threads.
a processor needs more work to do locally, it pops it off the
bottom of the deque. When a processor steals work, the 4 Using Cilk for Chess Search
work is stolen from the top of the deque on the remote pro-
cessor. It turns out that we modified this basic scheduler, In the following two sections we describe the implementa-
as we shall describe in Section 4.4. tion of *Socrates using Cilk. These sections are an inter-
Cilk requires that the programmer explicitly break the esting case study in implementing a large, multithreaded,
algorithm into threads. To give an idea of how programs speculative application. As mentioned in the introduction,
are expressed, consider the doubly recursive Fibonacci pro- *Socrates is a parallelization of a serial chess program.
gram shown in Figure 7. First we convert the program to Much of the code, including the static evaluator, is iden-
a dataflow graph, and then for each node of the graph, we tical in the parallel and the serial versions and is not be
write a thread, which looks like a C function. Thus, in discussed here. Instead, we focus on the portions of the
the final Cilk code, there are two threads, the sum thread code which were written specifically for the parallel ver-
and the fib thread. The sum thread accepts two values, sion.
adds them, and sends the result to an explicitly provided This section focuses on those parts of the Cilk scheduler
continuation. The fib thread creates a thread to sum two that we had to change in order to make Cilk behave more
results, and passes continuations (denoted x and y) for that like the scheduler used in *Tech. The changes we made
thread to two subsidiary fib threads. For a more com- include implementing migration handlers, aborting com-
plete description of the Cilk syntax, including a tutorial, putations that are in progress, changing the order in which
see [BJK*94]. threads are stolen, and adding level waiting.
Similarly for the Jamboree algorithm, we transform the
4.1 Migration Threads
search code shown in Figure 5 into a dataflow graph, as
shown in Figure 8. Then we express the program in Cilk We use a large, variable sized data structure (nearly 200
analogously to the Fibonacci example. bytes) to describe the state of a chess board. In the serial
8
V0 Value child i Vi
Test child i Ti
T1 T2 T3 Tk 1
Merge
Test
Fork
Join
V1 V2 V3 Vk 1
Figure 8: The dataflow graph for Jamboree search. First Child 0 is searched to determine its value, then the rest of the
children are tested in parallel to try to prove that they are worse choices than Child 0, and then each of the children
that fail their respective tests are serially researched. This dataflow graph can be used to measure the critical path
length of the computation by using time-stamping. Compare this description of the Jamboree algorithm to the textual
description in Figure 5.
d ; t1 )
( 1 d ; t2 )
( 2
d d2 ; + max(t1 ; t2 ))
( 1
Figure 9: The time at which an instruction in a dataflow graph is executed in a perfect infinite-processor schedule
can be computed by time-stamping the tokens. In addition to the normal data-value of a token (d1 , d2 , and d1 d2
respectively in the figure), the token includes a time-stamp (t1 , t2 , and + max(t1 ; t2 ) respectively.) The time-stamp
on the outgoing token is computed as a function of the time-stamps of the incoming tokens and the time to execute
the instruction.
9
code we pass around pointers to this structure and copy it that may be necessary. (For example, the code may need
only when necessary. In the parallel code we cannot just to free some data structures.)
blindly pass pointers between threads, because if the thread The abort mechanism provides functions to create, up-
is migrated the pointer will no longer be valid. A naive date, and deallocate the abort structures; to check if a thread
solution is to copy the state structure into every thread, but is aborted; and to start an abort. By using these functions
this adds a significant overhead to the parallel code. This and passing around a few pointers to abort tables, the search
overhead is especially distasteful when you realize that code was modified to include aborting without too many
well under 1% of threads are actually migrated, so most of changes.
the copying would be wasted effort. One difficulty encountered in implementing the abort
To solve this problem we use migration threads. Any tables was in keeping the tables correct when a computation
thread can have a migration thread associated with it. When migrates. When a computation is stolen an abort table is
the scheduler tries to migrate a thread that has an associated allocated on the stealer’s side and the existing abort table
migration thread, the scheduler will first call the migration is modified to point to it. The difficulty arises because at
thread. This migration thread will return a new closure the time a computation is stolen there is not yet an abort
which is migrated instead. table on the stealer’s side to point to. This abort table is not
Using this mechanism we are able to pass threads a be allocated until after the thread begins to run (unless we
pointer to a state structures. Any thread that is passed a state change the run time system, which we wanted to avoid). So
pointer is also given a migration thread which will copy the instead we create a unique identifier (UID) for each stolen
state into the closure if the thread is stolen. Once the closure computation, and store that into the abort table. Then on
arrives at the stealing processor, the stolen thread can then the stealer’s side we have a hash table to map the UID into
be called with a pointer to the copied state structure. This a pointer to the abort table. The protocol for accessing the
allows the overhead of copying the state to be paid only hash table is quite tricky since there are many cases which
when it is actually necessary. require special handling. For example, the network of the
CM-5 can reorder messages, therefore we have to handle
4.2 Abort the case where a message to abort a computation arrives
before the thread that will allocate the hash table entry and
In order to implement the jamboree search algorithm we abort table for that computation. Unfortunately, we did not
must be able to abort a computation. This is needed when consider all such possibilities before beginning the design,
we discover that at least one child has a score greater than so getting this mechanism working correctly took longer
beta, so there is no need to search the rest of the children. than anticipated.
(This is called failing high.) The Cilk system has no built-
in mechanism for aborting a computation, so this had to 4.3 Steal Ordering
be added as user code. Our goal in designing the abort
mechanism was to keep it as self contained as possible and In the original Cilk runtime system the thread queue con-
to minimize changes to the rest of the code. Eventually we sisted of a single double ended queue. Newly enabled
would like to add support for such a mechanism to Cilk threads were placed at the front of the queue and the local
itself. processor took work out of this side as well (i.e. LIFO).
In order to abort a computation we must be first able to When stealing occurs, threads are stolen from the other side
find all of the threads that are working on this computa- of the queue (i.e. FIFO). For a tree shaped computation,
tion. To implement this we use abort tables to link together the LIFO scheduling allows the computation to proceed
all the threads working on a computation. When a com- locally in a depth first ordering, thus giving us the same
putation, say A1 , needs to create several children it first execution order a sequential program would have. How-
creates an abort table containing an entry for each child of ever when stealing occurs the FIFO steal ordering causes
the computation. If a child of A1 , say B1 , itself spawns a thread near the top of the tree to be stolen, so a large
off children, then the entry for B1 is updated to contain a piece of work will be migrated, thus minimizing stealing.
pointer to the abort table that B1 creates. Once B1 and all Since jamboree search is a tree shaped computation this
its children have completed, B1 ’s table is deallocated and mechanism works reasonably well.
the entry for B1 is updated. With this mechanism in place With this scheduling mechanism, the order in which
the abort code is able to find all the descendants of any children are executed depends on whether or not a child is
computation. When performing an abort, the abort code stolen. For most computations this execution order does
does not actually destroy any threads, instead it merely not matter; but for jamboree search it does. Execution
makes a mark in the affected abort tables. When a user’s order has an effect because if one child fails high, the rest
thread runs its first action should be to check to see if it of the children do not need to be searched. Our program
has been aborted, and if so skip the rest of its computation. orders the children such that when no children are stolen
This check allows the user’s code to do any cleaning up (the common case) the children most likely to fail high
10
are executed first; this order minimizes the total work W . spawns children all the sub-computations are placed at the
The problem is that when stealing occurs we steal the child same level. The level waiting mechanism simply requires
least likely to fail high. that all of these sub-computations have completed before
Ideally we would like to steal from the top of the tree, we may begin any work at a shallower level. This prevents
but still steal the child that is most likely to fail high. To do us from starting, and then preempting, an unrelated search.
this we had to modify the scheduler by adding the concept Implementing this change seemed to give us a 15-20%
of levels. Each thread in the queue is assigned a level speedup.
and threads at the same level will be executed in a fixed
order, regardless of whether they are stolen or executed
locally. Between levels, however, scheduling is done as 5 Other Chess Mechanisms
before: We execute locally at the shallowest (newest) level
The previous section described issues that arose in getting
and steal from the deepest (oldest) level. The search code
the search routines to run in our parallel environment. This
then marks all the children of a computation as being at a
section describes other aspects of the serial code that had
level one shallower than the level at which the computation
to be modified to run in a parallel system. These aspects
is currently executing. This gives us exactly the ordering
include the transposition table, detecting repeated moves,
of threads that we want. Adding this to *Socrates reduced
and debugging support.
the amount of work performed for searching a position and
seemed to give a speedup of 20-25%. This idea seemed
important enough that we included a cleaner version of this 5.1 Transposition Table
mechanism in Cilk 1.0.
Most serial chess programs include a Transposition Table.
This is basically a hash table of previously evaluated nodes.
4.4 Level Waiting After a node is searched we create (or update) the hash
entry for this node. The information stored in this entry
The final change we made to the scheduler was a further
includes a score, a move, a depth and a check key. The
attempt to reduce the extra work being performed by the
score tells us the value of the node; the move tells us what
parallel version. When a processor is searching a board
position, A, it spawns off a bunch of children to test. If a
move achieves this score; and the depth tells us how deep
a search was done. The check key is used to distinguish
processor ran out of children to work on while some chil-
between the many positions which may hash to this entry.
dren were still being worked on elsewhere, that processor
Before searching a node we first check to see if it is
would steal another closure and begin working on that.
present with a deep enough depth, then we need not search
Consider the case where one (or more) of the children is
this node again. This can occur because the same position
stolen and the processor finishes the rest of the tests before
can be reached by many different sequences of moves (i.e.
the test of the stolen child completes. The processor may
a transposition). Much of the time when we get a hit the
then be out of work to do9 . This processor will then steal
depth is not sufficient for the current search. But even in
some closure from another processor and begin searching
its board position, call it B . Eventually the test of the
this case the table is still useful because it gives us the best
move found by an earlier search, and often the best move
stolen child will complete. When this result comes back it
will restart the computation on position A and preempt B .
at a shallower depth is the best move at a deeper depth.
Since position A may still have additional value searches
By using the returned move as our predicted best move,
we increase our chances of accurately predicting the best
to perform, this is potentially a long computation. We are
now in a position where B , no matter how little work it
move, which, as we saw in Section 2, reduces the work and
critical path of the computation.
has, will not complete until the potentially long compu-
tation for A completes. The computation which spawned
For *Socrates we implemented a distributed transpo-
B will continue without it. It may eventually block (and sition table. We had a choice between implementing a
thereby artificially lengthen the critical path C ) or it may
blocking or a non-blocking interface to the table. When a
be able to continue, but will use looser bounds than if B
thread begins a search of a node the first thing it typically
does is to do a transposition table lookup on that node. In
had completed (and will thereby increase the total work
W ). a blocking implementation, this thread would send off a
lookup request to the appropriate node and busy-wait until
To avoid this stalled work we further modified the sched-
the response arrives, and then continue. The obvious dis-
uler. We added “level waiting”, a feature which makes uses
advantage of blocking is that we waste time busy-waiting.
of the same levels that were used in the previous section
In a non-blocking implementation we would break this
for optimizing the steal ordering. When a computation
thread into several threads. When the time came to do a
9 It
will often be out of work because none of the children at this level lookup, a thread would be posted on the node that would
would have been stolen if there were any work earlier in the queue. hold the entry. This thread would do the lookup and send
11
the result back to the original node, enabling the continua- and before it completes and writes its result into the hash
tion of the search. This implementation has the advantage table Processor P2 begins another search of Position B .
that we do not spend any time busy-waiting while we do This leads to part of the search being duplicated. In the
a table lookup. But it has one big disadvantage in that it serial code these searches would be performed sequentially
may lead to many searches taking place on the same node so this problem would not occur.
concurrently. Intermixing two or more searches on the We considered trying to avoid this overhead in the fol-
same node can cause both the work and the critical path lowing manner. When a search begins if the transposition
to increase. To avoid these increases the scheduler would table lookup fails an entry is created for that position and it
have to be modified to keep the two computations separate. is marked as “search in progress.” Then if another lookup
To avoid the complexity involved in such a modification occurs on this position we know that a search is already
we chose to implement a blocking transposition table. being done. We would then have the option of waiting for
Since there is no way to implement this blocking mech- the earlier search to complete.
anism using Cilk primitives, we dropped to a lower level We chose not to implement this mechanism. Imple-
and used the Strata active message library [BB94]. We menting it would have been somewhat complicated, and
designed the transposition table such that all accesses are there were a number of issues that this would raise that we
atomic. For example when a value is to be put into the did not have a clear understanding of. For example, when
table, the information about the position is sent to the node we were about to abort a search would it be necessary to
where the entry resides, and that node updates the entry first check to see if anyone else is waiting for the results
as required. Alternatively, we could have implemented a of this search. Another example is deciding when to wait:
non-atomic update by performing a remote read of the en- If a position is already being searched to depth d, and we
try, modifying the entry, and then doing a remote write. want to search it to depth d 1, do we wait for the deeper
Non-atomic updates would have required more messages search? If we don’t wait we are doing extra work, if we
and would have had to either lock the entry while the up- do wait we may wait much longer than if we had just done
date was in progress, or risk losing some information if it ourself. We instrumented our program to estimate how
two update operations overlapped. much duplicate work was being done. Each time we com-
To determine how much the busy-waiting hurts us, we pleted a search and were about to write the hash table entry
instrumented our code to measure the time spent busy- we first did a hash table lookup to see if we would get a
waiting10 . Our experiments showed us that the mean time hit if we began the search now. (If so, then someone else
between sending the request and receiving the reply was must have completed a search of this node during the time
around 1600 cycles. This worked out to about 7% of the since we began the search.) We found that this occured less
execution time. than 1% of the time. Furthermore, we had implemented a
Another decision we faced was how large to make the similar mechanism for *Tech, and it sometimes speeds the
hash entries. Clearly, we would like to make them as large program up, and sometimes slows it down.
as possible11 . The score and the move each require 16 bits.
The bits describing the depth and type of search required
5.2 Repeated Moves
another 9. The only other piece of an entry is the check
bits. In our implementation each position had a 64 bit key. To fully describe a position in a chess game we need more
Of these bits 9 were used to select a processor and 21 were than just a description of where each piece is on the board;
used to select a hash line on a given processor, so there some history is needed as well. A simple example is we
is no need to store these bits in the entry itself. Of the need to know if the king has moved. If it has then we
remaining bits 34 bits we stored only 23 of them as the cannot castle, even if the king has moved back to its original
check bits since this allowed us to fit an entry in one 64 position. This sort of information can easily be stored in a
bit double word. When executing on the 512 processor few bits in the state so this causes no difficulty.
system we had a 1 billion entry hash table! Other required history can not be stored so easily. In
The last aspect of the transposition table we will examine chess if the same position is repeated 3 times then the
is subsumptions. The issue is what, if anything, do we do game is a draw. Similarly if 50 moves are made by each
if two independent searches are concurrently searching the player without an irreversible move being made, the game
same position (i.e. one search “subsumes” the other). For is a draw12 . To handle these cases we need to keep track
example, Processor P1 may begin a search of Position B of all moves since the last irreversible move. (Once an
10 Not all this time is wasted since while busy-waiting we poll the
irreversible move is made earlier positions cannot be re-
network so we may spent part of this time responding to arriving messages. peated.) We do this by adding an array of positions to our
But the analysis above gives us an upper bound on the cost of busy- state structure. This array contains all the positions (repre-
waiting.
11 Hsu claims that increasing the size of the hash table by a factor of 12 An irreversible move is one which cannot be undone; that is, one
256 can easily give a factor of 2 to 5 speedup [Hsu90]. which captures a piece or moves a pawn.
12
sented by their 64 bit hash key) since the last irreversible or not to use the hash table was being set correctly only on
move. processor 0. Therefore all other processors would never
This array greatly increases the size of the state structure use the hash table. As is often the case, bugs which affect
(from about 160 bytes to nearly 1000 bytes). For a serial only performance can be harder to detect than bugs that
program the size of the state may not be significant since affect correctness.
the code could just modify and unmodify the same state
structure. For parallel code, however, it is often necessary
to make copies of the state so a large state can slow down 6 Related Search Algorithms
the program. To prevent this from occuring when we copy
Our chess program uses Jamboree search [Kus94], a paral-
a state we only copy the part of the repeated position array
lelization of scout search [Pea80], in which at every node
that is meaningful. Since the average length of this list
is quite small (under 2) copying this list adds very little of the search tree, the program searches the first child
to determine its value, and then tries to prove, in paral-
overhead.
lel, that all of the other children of the node are worse
alternatives than the first child. This approach to paral-
5.3 Debugging lelizing game tree search is quite natural, and it has been
In order to make it easier to debug our code,we make liberal used by several other parallel chess programs., such as
use of ‘assert’ statements. Not only did this cause bugs to Cray Blitz [HSN89] and Zugzwang [FMM91]. Still oth-
be detected sooner, it was also helpful in pinpointing the ers have proposed or analyzed variations of this style of
cause of the bug. One of our biggest problems initially game tree search [ABD82, MC82, Fis84, Hsu90]. We do
was making sure that the parallel version was working not claim that the search algorithm is a new contribution.
correctly. This was difficult because if the parallel version Instead, we view the algorithm as a testbed for evaluating
was close to the serial, but not exactly the same, it would mechanisms needed for the design of scalable, predictable,
usually produce the exact same answers. We were often asynchronous parallel programs.
modifying both the parallel and the serial search algorithms Jamboree search was used in our previous program,
and keeping them consistent was quite error prone. One *Tech [Kus94]. *Socrates is a step forward compared
method we occasionally used to test whether both versions to *Tech because we introduced a linguistic layer and run-
were identical was to run the parallel code on one processor time system called Cilk 1.0 [BJK*94] to make it easier
and run the serial code and make sure they both searched to program the application without worrying about the
exactly the same number of nodes. Unfortunately we did scheduling issues. Many of the techniques originally used
not do this check often enough and at one point so many in *Tech were borrowed for *Socrates. Inspired by some
minor variations had crept in that we wound up spending problems we had with early versions of our *Tech pro-
almost a week trying to make both versions consistent gram, Leiserson and Blumofe designed a provably good
again. scheduler that has good space and time bounds, as well as
One of the most useful assertions we added was to check low communications requirements [BL94].
at every node of the tree that the results of the parallel code Other parallel algorithms based on Scout search include
were the same as the serial code. In the debugging version minimal tree search, mandatory work first, and princi-
of the code, after the search of a position was complete we pal variation splitting. S. Akl, D. Barnard and R. Do-
would call the serial code on the same position and assert ran [ABD82] proposed the minimal tree search, which
that the results were the same. (We do this with the hash performs the weak - search by searching the minimal
table turned off, otherwise the serial code simply finds the tree (i.e., the Knuth-Moore critical tree [KM75]). Each
result in the hash table.) This was extremely slow, but it position is kept in an expanded form, potentially for a
is an easy way to detect any differences between the serial long time, resulting in unrealistic storage requirements.
and parallel searches, and to pinpoint exactly where the The Deep-Thought parallel algorithm as described in Hsu’s
differences lie. After we started using this check, keeping thesis [Hsu90] is a variant of the high-storage-requirement
both versions identical became much easier. We think this minimal tree search.
is an approach that is applicable to many parallel programs, J. Fishburn [Fis84] proposed the mandatory work first
not just chess. (MWF) algorithm. Algorithm MWF is based on the weak
Even with this grandiose verification not all our bugs version of - search. It explicitly computes the number
were detected. At one point the debugging mode worked of critical children of the position being searched. A child
fine when run on any number of processors, as did the of a position is critical if the child is in the Knuth-Moore
non-debugging program when run on one processor. But critical tree, which means that the child would definitely
when we ran on more than one processor the speedup was be searched by the - algorithm. If the position be-
quite small. It turned out that debugging mode was not ing searched has more than one critical child, then MWF
being completely turned off as the flag which says whether searches the first child and then searches the other children
13
in parallel. If the first child turns out to be worse than allel search algorithm that is very similar to Jamboree
some other child, MWF then researches the children that search. Zugzwang achieves high work-efficiency, search-
might be the best, all in parallel. In contrast, Jamboree re- ing to within a few percent the same number of nodes in a
searches sequentially. For nodes with exactly one critical parallel search as in a sequential search. The efficiency of
child, MWF searches just the first child. Fishburn analyzed our programs appears to be somewhat lower, probably be-
MWF for best-ordered and worst-ordered trees, but not for cause the Zugzwang team has gone to substantial effort to
realistic game trees. One can construct game trees that are try to ensure that they search the tree in a mostly best-first
mostly best-ordered, in which the MWF algorithm p does order.
almost as badly as the naive parallel - search’s O( P ) The parallel aspiration search algorithm [Bau78] divides
speedup. the - window into segments, and gives each processor
Fishburn’s MWF algorithm can be viewed as being a different segment of the window to search. Aspiration
separate from the scheduler, but his analysis depends on search achieves only small parallel speedups. Surprisingly,
the scheduler. For example, Fishburn proves that worst- the serial version of aspiration search often runs faster
ordered game-trees achieve speedup using mandatory- than a infinite window search. Today most state-of-the-art
work-first on a tree-of-processors scheduler, in which the chess programs, including *Tech, use a serial aspiration
depth of the game-tree is much greater than the depth of search in which the game tree is searched with a small -
the processor tree. Our Theorem 2, in contrast, states window, and if the score is outside of the window, the tree
that for an infinite processor perfect scheduler the aver- is researched.
age available parallelism is less than 3 and the speedup is R. Karp and Y. Zhang [KZ89] show how to search an
less than one. Even though the MWF algorithm is tangled AND/OR tree in parallel by carefully allocating the right
up with the tree-of-processors scheduler, one can interpret number of processors to each subtree. C. Stein [Ste92]
Fishburn’s results somewhat independently of the sched- employs Karp and Zhang’s algorithm as a subroutine to
uler. Fishburn’s results indicate, for example, that if one do a parallel - search. Stein performs a binary search
has a tree of processors that is half as deep as the game for the value of the game tree, at each stage converting the
tree and the degree of the processor tree is greater than game tree to an AND/OR tree with the question “Is the
the degree of the game tree, then the critical path is short value of the root greater than s?”.
and the work efficiency is good. Such a tree is as good There are several other approaches to game tree search
as “infinite processors” for an algorithm in which the shal- that are not based on - search. H. Berliner’s B* search
lowest h=2 plies of the game tree are searched in parallel algorithm [Ber79] tries to prove that one of the moves is
and the deepest h=2 plies of the game tree are searched better with respect to a pessimistic evaluation than any of
serially. It turns out that the half-the-depth-serially strat- the other moves with respect to an optimistic evaluation.
egy, when applied to Jamboree search, reduces the average D. McAllester’s Conspiracy search [McA88] expands the
available parallelism even further, down to about 2 for tree in such a way that to change the value of the root will
worst-ordered trees. Fishburn did not analyze what hap- require changing the values of many of the leaves of the
pens if the tree of processors is as deep as the game tree. tree. The SSS* algorithm [Sto79] applies branch and
The reason that MWF achieves speedup on worst-ordered bound techniques to game tree search. These algorithms
trees is that MWF researches the children who failed their all require space which is nearly proportional to the run
tests in parallel, while the Jamboree algorithm serially re- time of the algorithm, but the the constant of proportion-
searches all the failed children. Hence, for worst ordered ality may be small enough to be feasible. While these
trees, Jamboree search finds little parallelism, while MWF algorithms all appear to be parallelizable, they have not
finds much parallelism. Any chess program that is search- yet been successfully demonstrated as practical serial al-
ing worst-ordered trees is not competitive, however. gorithms. We wanted to be able to compare our work to
Several programs use principal variant splitting (PV- the best serial algorithms.
splitting) [MC82], which is a another variation on MWF,
but the ideas behind PV-splitting are, like MWF, some- 7 Conclusions
what obscured by the fact that a tree-of-processors sched-
uler is entangled into the search algorithm. Later work The history of *Socrates sheds some light on the prob-
has separated the scheduler from the algorithm. For ex- lems of developing a high-performance parallel program.
ample, Cray Blitz [HSN89] apparently uses PV-splitting The *Socrates chess team, which includes includes R. Blu-
with something like a work-stealing scheduler. No critical mofe, M. Halbherr, C. Joerg, B. Kuszmaul, C. Leiserson,
path analysis or measurement has been performed for Cray and Y. Zhou of MIT as well as D. Dailey and L. Kauf-
Blitz, however. man of Heuristic Software, decided to start with a new
The Zugzwang program, developed by R. Feldmann, chess program rather than to try to parallelize the original
P. Mysliwietz, and B. Monien [FMM91], uses a par- Socrates program. The difficulty with the original Socrates
14
program is that it uses many global variables which are plementation of *Socrates, and allowed us to implement
modified throughout the search. We felt that it would be several other parallel applications including a protein fold-
easier to start with a program that was designed to modify ing program [PJG*94] which was the first program to find
its state in a non-destructive fashion by always making a the number of Hamiltonian paths in a 4 4 3 grid, and
new copy of the variables that represent the state of a chess some smaller programs such as the doubly recursive Fi-
board in the tree search. It turned out that the decision bonacci routine, a backtracking search to solve the problem
to start with a new program resulted in the program being of determining how many ways there are to place n queens
substantially weaker than we had hoped, because we did on an n by n chess board, a ray-tracing image rendering
not have sufficient time to get all of the chess knowledge program, and a radiosity image rendering program.
transfered from Socrates to *Socrates. We are now developing additional mechanisms for Cilk
The program was developed on a very tight sched- to provide high performance on a wider variety of appli-
ule. Dailey implemented a bare-bones chess program that cations. We are trying to improve the linguistic layer, to
copies chess boards and provided it to the MIT contingent develop abstractions for manipulating shared data struc-
in May 1994. During June, Dailey visited MIT to help tune tures, and to simplify the interface to input/output and the
the program, but we spent most of June simply getting the operating system.
parallel version of the program to work correctly. The pro-
gram started playing predictably only a few days before Acknowledgments
the tournament. The tournament was to start on Saturday
morning, and on the previous Thursday night the program Charles E. Leiserson, Robert D. Blumofe, Yuli Zhou, and
crashed 2 out 3 times that we played it. Friday morning we Michael Halbherr all contributed to making the chess pro-
packed up two X-terminals and two modems into the trunk gram work and to developing the underlying parallel tech-
of our cars and drove the eight hours to Cape May, New nology used in *Socrates. Don Dailey and Larry Kauf-
Jersey, wondering whether we were going to be embar- man of Heuristic Software provided the serial program,
rassed by a program that would crash during tournament Socrates, on which our parallel program is based, and Don
play. Friday night we logged in and made changes to the worked many hours to help us get our parallel program
program until 3am. Then the tournament began. Saturday working. Hans Berliner and Chris McConnell of CMU
morning we played and won our first game. We noticed provided the serial version of Hitech that we first used as a
some problems with the program, and modified it for the testbed to develop our ideas for parallel game tree search.
Saturday evening match, which we also won. Saturday The National Center for Supercomputing Applications at
night we made some more modifications to the program, the University of Illinois at Urbana-Champagne (NCSA)
and on Sunday morning we won our third game. We left provided a 512-processor CM-5 for both the 1993 and the
the program alone for the Sunday evening game, which we 1994 ACM Computer Chess Championships under NCSA
lost to Deep Thought. *Socrates’s insufficient apprecia- Grant TRA930289N.
tion of the value of castling rights resulted in a poor move
that Deep Thought punished brilliantly in what the on-site
commentators called “one of the all-time greatest games of
References
computer chess”. Our fifth game resulted in a disappoint- [ABD82] Selim G. Akl, David T. Barnard, and Ralph J. Doran.
ing loss to Zarkov, in which *Socrates made two mistakes Design and implementation of a parallel tree search algorithm.
due to insufficient chess knowledge. The first mistake was IEEE Transactions on Pattern Analysis and Machine Intelli-
similar to the mistake in the game against Deep Thought, gence, PAMI-4 (2), pages 192–203, March 1982.
but *Socrates managed to salvage the game to a drawn rook [Bau78] G. M. Baudet. The Design and Analysis of Algorithms
and pawn endgame. Unfortunately, *Socrates managed to for Asynchronous Multiprocessors. Technical Report CMU-
find a losing move in a position that the commentators CS-78-116, CMUCS, April 1978, 182 pp. (Ph.D. thesis.)
thought was nearly a forced draw. Throughout the tourna-
[Ber79] Hans Berliner. The B* tree search algorithm: a best-
ment the program ran without crashing, and searched quite first proof procedure. Artificial Intelligence, 12, pages 23–40,
deeply. If only we had given Dailey more time to tune the 1979.
chess knowledge...
[BE89] Hans Berliner and Carl Ebeling. Pattern knowledge and
One of the important organization differences between search: the SUPREM architecture. Artificial Intelligence, 38
*Tech and *Socrates is that *Socrates separates the appli- (2), pages 161–198, March 1989.
cation from the scheduler, whereas in *Tech the sched- [BJK*94] Robert D. Blumofe, Christopher F. Joerg, Bradley C.
uler and the application were wound up together. More Kuszmaul, Charles E. Leiserson, Phil Lisiecki, Keith H.
importantly, *Socrates employs a linguistic layer to help Randall, Andy Shaw, and Yuli Zhou. Cilk 1.1 Refer-
the programmer express the program independently of the ence Manual. Massachusetts Institute of Technology, Lab-
scheduler. Separating the system greatly simplified the im- oratory for Computer Science, September 1994. (Avail-
15
able via anonymous FTP from theory.lcs.mit.edu in [Kau92] Larry Kaufman. Rate your own computer. Computer
/pub/cilk/manual1.0.ps.Z.) Chess Reports, 3 (1), pages 17–19, 1992. (Published by ICD,
[BL94] Robert D. Blumofe and Charles E. Leiserson. Schedul- 21 Walt Whitman Rd., Huntington Station, NY 11746, 1-800-
ing multithreaded computations by work stealing. In Proceed- 645-4710.)
ings of the 35th Annual Symposium on Foundations of Com- [Kau93] Larry Kaufman. Rate your own computer — part II.
puter Science (FOCS ’94), Santa Fe, New Mexico, November Computer Chess Reports, 3 (2), pages 13–15, 1992-93.
1994. (To appear.) [KM75] Donald E. Knuth and Ronald W. Moore. An analysis of
[Bre74] Richard P. Brent. The parallel evaluation of general alpha-beta pruning. Artificial Intelligence, 6 (4), pages 293–
arithmetic expressions. Journal of the ACM, 21 (2), pages 201– 326, Winter 1975.
206, April 1974. [Kus94] Bradley C. Kuszmaul. Synchronized MIMD Computing.
[BB94] Eric A. Brewer and Robert D. Blumofe. Strata: A Multi- Ph.D. thesis, Massachusetts Institute of Technology, Depart-
Layer Communications Library. Technical Report, MIT Lab- ment of Electrical Engineering and Computer Science, May
oratory for Computer Science, January 1994. (To appear. 1994. (Avail-
Available via anonymous FTP from ftp.lcs.mit.edu in able via anonymous FTP from csg-ftp.lcs.mit.edu
/pub/supertech/strata.) in /pub/users/bradley/phd.ps.Z.)
[FMM91] R. Feldmann, P. Mysliwietz, and B. Monien. A fully [MC82] T. A. Marsland and M. S. Campbell. Parallel search of
distributed chess program. In D. F. Beal, ed., Advances in strongly ordered game trees. ACM Computing Surveys, 14 (4),
Computer Chess 6, pages 1–27, Ellis Horwood, Chichester, pages 533–552, December 1982.
West Sussex, England, London, 1991. (Conference held in [McA88] David Allen McAllester. Conspiracy numbers for min-
August 1990.) max search. Artificial Intelligence, 35, pages 287–310, 1988.
[FMM93] R. Feldmann, P. Mysliwietz, and B. Monien. Game
[PJG*94] Vijay S. Pande, Chris Joerg, Alexander Yu Grosberg,
tree search on a massively parallel system. In Advances in
and Toyoichi Tanaka. Enumeration of the Hamiltonian walks
Computer Chess 7, 1993. (The conference was held in June
on a cubic sublattice. Journal of Physics A., 1994. (To appear.)
1993, but the proceedings have not published as of August
1993.) [Pea80] Judea Pearl. Asymptotic properties of minimax trees
and game-searching procedures. Artificial Intelligence, 14
[FF82] Raphael A. Finkel and John P. Fishburn. Parallelism in
(2), pages 113–138, September 1980.
alpha-beta search. Artificial Intellgence, 19 (1), pages 89–106,
September 1982. [Ste92] Clifford Stein. Evaluating game trees in parallel. In
Charles E. Leiserson, ed., Proceedings of the 1992 MIT Stu-
[Fis83] John P. Fishburn. Another optimization of alpha-beta
dent Workshop on VLSI and Parallel Systems, pages (47-1)–
search. SIGART Newsletter, Number 84, pages 37–38, April
(47-2), MIT Endicott House, July 1992.
1983.
[Sto79] G. C. Stockman. A minimax algorithm better than alpha-
[Fis84] J. P. Fishburn. Analysis of Speedup in Distributed Algo-
beta? Artificial Intelligence, 12 (2), pages 179–196, August
rithms. UMI Research Press, Ann Arbor, MI, 1984.
1979.
[HZJ94] Michael Halbherr, Yuli Zhou, and Chris F. Joerg.
MIMD-Style Parallel Programming Based on Continuation-
Passing Threads. Computation Structures Group Memo 355,
Massachusetts Institute of Technology, Laboratory for Com-
puter Science, April 1994, 22 pp. (A shorter version will
appear in Proc. of 2nd Int. Workshop on Massive Parallelism:
Hardware, Software and Applications. Capri, Italy, Oct. 1994.)
[HL93] Robert V. Hogg and Johanenes Ledolter. Applied Statis-
tics for Engineers and Physical Scientists. Macmillan Pub-
lishing Company, New York, 1993.
[Hsu90] Feng-hsiung Hsu. Large Scale Parallelization of Alpha-
Beta Search: An Algorithmic and Architectural Study with
Computer Chess. Technical report CMU-CS-90-108, Com-
puter Science Department, Carnegie-Mellon University, Pitts-
burgh, PA 15213, February 1990.
[HSN89] Robert M. Hyatt, Bruce W. Suter, and Harry L. Nel-
son. A parallel alpha/beta tree searching algorithm. Parallel
Computing, 10 (3), pages 299–308, May 1989.
[KZ89] Richard M. Karp and Yanjun Zhang. On parallel eval-
uation of game trees. In Proceedings of the 1989 ACM Sym-
posium on Parallel Algorithms and Architectures, pages 409–
420, Santa Fe, New Mexico, June 1989.
16