0% found this document useful (0 votes)
594 views5 pages

Learning To Play Othello Without Human Knowledge

Our agent learns to play Othello without human knowledge by using self-play reinforcement learning. It uses a neural network to evaluate board states and select moves, and improves its policy through Monte Carlo tree search guided self-play games. The outcomes of self-play games are used to update the neural network without human data. Testing found that for 6x6 Othello, our agent achieves superhuman performance, outperforming humans.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
594 views5 pages

Learning To Play Othello Without Human Knowledge

Our agent learns to play Othello without human knowledge by using self-play reinforcement learning. It uses a neural network to evaluate board states and select moves, and improves its policy through Monte Carlo tree search guided self-play games. The outcomes of self-play games are used to update the neural network without human data. Testing found that for 6x6 Othello, our agent achieves superhuman performance, outperforming humans.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Learning to Play Othello Without Human Knowledge

Shantanu Thakoor Surag Nair Megha Jhunjhunwala


Stanford University Stanford University Stanford University
[email protected] [email protected] [email protected]

Abstract through self-play. This new system, AlphaGo Zero, even


outperforms the earlier AlphaGo model. This represents
Game playing is a popular area within the field of
artificial intelligence. Most agents in literature have a very exciting result, that computers may be capable of
hand-crafted features and are often trained on datasets superhuman performances entirely through self-learning,
obtained from expert human play. We implement a self- and without any guidance from humans.
play based algorithm using neural networks for policy In our work, we extract ideas from the AlphaGo Zero
estimation and Monte Carlo Tree Search for policy im- paper and apply them to the game of Othello. We use
provement, with no input human knowledge that learns board sizes of 6x6 and 8x8, for which learning through
to play Othello. We evaluate our learning algorithm for self-play is more tractable on the computing resources
6x6 and 8x8 versions of the game of Othello. Our work available to us. For evaluation, we compare our trained
is compared with random and greedy baselines, as well agents to random and greedy baselines, as well as a
as a minimax agent that uses a hand-crafted scoring
minimax agent with hand-crafted features. We also
function, and achieves impressive results. Further, our
agent for the 6x6 version of Othello easily outperforms compared against humans, and found that our 6x6 ver-
humans when tested against it. sion achieves superhuman performance very quickly.

2 Related Work
1 Introduction
Self-play for learning optimal playing strategies in games
Game playing is a popular area within the field of artifi- has been a widely studied area. For example, 9x9 Go
cial intelligence. One of the earliest works in this field has been studied in (Gelly and Silver 2008). Chess,
was a checkers engine developed in (Samuel 2000), that though widely played using alpha-beta search strategies,
learned through self-play and machine learning, and has also seen some work through self-play methods in
not through rule-based methods. An early triumph was (Heinz 2001). (Wiering 2010) study the problem of
Deep Blue (Campbell, Hoane, and hsiung Hsu 2002), a learning to play Backgammon through a combination
computer program capable of superhuman performance of self-play and expert knowledge methods.
on Chess respectively, beating the top human players. In particular, (Van Der Ree and Wiering 2013) learn
These are relatively simple games, where the branching to play Othello through self-play methods, and (Nijssen
factor for each state is small, and it is easy to evaluate 2007) apply Monte Carlo methods to Othello. For the
how good a non-terminal position is. It was estimated 6x6 version, a perfect strategy for player 2 is known to
that games like Go, which have a large branching factor, exist 1 .
and where it is very difficult to determine the likely (Silver et al. 2016) and (Silver et al. 2017b) have
winner from a non-terminal board position, would not trained a novel neural network agent to achieve state of
be solved for several decades. However, AlphaGo (Silver the art results in the game of Go. Very recently (just 4
et al. 2016), which uses recent deep reinforcement learn- days before submission of this report!), this approach has
ing and Monte Carlo Tree Search methods, managed to also been extended to a general game-playing strategy
defeat the top human player, through extensive use of in (Silver et al. 2017a), achieving state of the art in the
domain knowledge and training on the games played by games of Chess and Shogi.
top human players.
Many of the existing approaches for designing sys-
tems to play games relied on the availability of expert
3 Methods
domain knowledge to train the model on and evaluate We provide a high-level overview of the algorithm we
non-terminal states. Recently, however, AlphaGo Zero employ, which is based on the AlphaGo Zero (Silver
(Silver et al. 2017b) described an approach that used et al. 2017b) paper. The algorithm is based on pure
absolutely no expert knowledge and was trained entirely self-play and does not use any human knowledge except
1
Stanford University CS238 Final Project Report Solved by Joel F Feinstein
the rules of the game. At the core, we use a neural P (s, ·) = p~θ (s), which is the prior probability of taking
network that evaluates the value of a given board state a particular action from state s according to the policy
and estimates the optimal policy. The self-play is guided returned by our neural network. From these, we calcu-
by a Monte-Carlo Tree Search (MCTS) that acts as a late U (s, a), which is an upper confidence bound on the
policy improvement operator. The outcomes of each Q value of our edge. These values are calculated as
game of self-play are then used as rewards, which are pP
used to train the neural network along with the improved b N (s, b)
U (s, a) = Q(s, a) + cpuct P (s, a)
policy. Hence, the training is performed in an iterative 1 + N (s, a)
fashion- the current neural network is used to execute
self-play games, the outcomes of which are then used Here, cpuct is a hyperparameter controlling the degree
to retrain the neural network. The following sections of exploration (set as 1.0 in our experiments).
describe the different components of our system in more When using MCTS to find a policy from a given state
detail. s, we start creating the MCTS tree with s as the root.
At each step of our iteration, we calculate the action
3.1 Neural Policy and Value Network to take as the a which maximizes the upper confidence
We use a neural network fθ parametrised by θ that takes bound U (s, a). If our next state already exists in our
as input the board state s and outputs the continuous MCTS tree, we continue our simulation. If it does not
value of the board state vθ ∈ [−1, 1] from the perspective exist, we create a new node in our tree and initialize its
of the current player, and a probability vector p~ over all P (s, ·) = p~θ (s) and the expected reward v = vθ (s) from
possible actions. p~θ represents a stochastic policy that our neural network, and initialize Q(s, a) and N (s, a)
is used to guide the self-play. to 0 for all a. We then propagate the reward v back
The neural network is initialized randomly. At the up the MCTS tree, updating all the Q(s, a) values seen
end of each iteration of self-play, the neural network is during the simulation, and start again from the root.
provided training examples of the form (st , ~πt , zt ). ~πt On the other hand, if we encounter a terminal state, we
gives an improved estimate of the policy after performing propagate the actual reward found from the board and
MCTS starting from st (described in Section 3.2), and restart our MCTS.
zt ∈ {−1, 1} is the final outcome of the game from the Now, after a few simulations of the MCTS, our N (s, a)
perspective of the current player. The neural network values provide a good approximation for the optimal
is then trained to minimize the following loss function: stochastic process from each state. Hence, the action we
take is randomly sampled from a distribution πs , with
X 1
l= (vθ (st ) − zt )2 + ~πt log(~
pθ (st )) probability proportional to N (s, a) τ , where τ is a tem-
t perature parameter. Setting τ to a high value gives us
We use a neural network that takes the raw board almost uniform distribution, while setting it to 0 makes
state as the input. This is followed by 4 convolutional us always select the best action. τ is hence another
networks and 2 fully connected feedforward networks. hyperparameter controlling the degree of exploration
This is followed by 2 connected layers- one that outputs during our learning. Hence, the training example gen-
vθ and another that outputs the vector p~θ . Training is erated from the MCTS starting at s is (s, πs , r), where
performed using the Adam (Kingma and Ba 2014) opti- r ∈ {+1, −1} which is determined at the end of the
mizer with a batch size of 64, with a dropout (Srivastava game by considering whether the current player won or
et al. 2014) of 0.3, and batch normalisation (Ioffe and lost. Pseudocode of the MCTS search is provided in
Szegedy 2015). The code is implemented in PyTorch2 . Algorithm 1.

3.2 Monte Carlo Tree Search for Policy 3.3 Policy iteration through Self-play
Improvement We now describe the complete training algorithm. We
We use a Monte Carlo Tree Search (Browne et al. 2012) initialize our neural network with random weights, thus
to improve upon the policy learned by the neural net- starting with a random policy. In each iteration of
work. MCTS is a policy search algorithm that balances our algorithm, we play a number of episodes (100 in
exploration with exploitation to output an improved pol- our experiments) of self-play using MCTS. This results
icy after a number of simulations of the game. MCTS in a set of training examples of the form (st , ~πt , zt ).
explores the tree where nodes represent different board We exploit the symmetry of the state space to further
configurations and a directed edge exists between two augment our dataset. In our experiments, since Othello
nodes (i → j) if a valid action can cause state to transi- is invariant to rotations and flips of the board, we thus
tion from state i to state j. For each edge, we maintain obtain 7 extra training examples per examples in our
a Q value denoted by Q(s, a) which is the expected dataset.
reward for taking that action and N (s, a) which repre- Then, we update our neural network using our new
sents the number of times we took action a from state training examples, to get a new neural network. We then
s across different simulations. We also keep track of play our old and new networks against each other for a
number of games (40 in our experiments). If the new
2
www.pytorch.org network wins more than a set threshold number of times
Algorithm 1 Monte Carlo Tree Search Algorithm 2 Policy Iteration through Self-Play
1: procedure MCTS(s, θ) 1: procedure PolicyIterationSP
2: if s is terminal then 2: θ ←initNN()
3: return game_result 3: trainExamples ← []
4: if s ∈
/ Tree then 4: for i in [1, . . . , numIters] do
5: Tree ← Tree ∪ s 5: for e in [1, . . . , numEpisodes] do
6: Q(s, ·) ← 0 6: ex ← executeEpisode(nn)
7: N (s, ·) ← 0 7: trainExamples.append(ex)
8: P (s, ·) ← p~θ (s) θnew ← trainNN(trainExamples)
9: return vθ (s) 8: if θnew beats θ ≥ thresh then
10: else 9: θ ← θnew
return θ
11: a ← argmaxa0 ∈A U (s, a0 )
12: s0 ←getNextState(s, a)
13: v ←MCTS(s0 ) Algorithm 3 Execute Episode
14: Q(s, a) ← N (s,a)∗Q(s,a)+v
N (s,a)+1
1: procedure ExecuteEpisode(θ)
15: N (s, a) ← N (s, a) + 1 2: examples ← []
16: return v 3: s ← gameStartState()
4: while True do
5: for i in [1, . . . , numSims] do
6: MCTS(s, θ)
(60% in our experiments), we update the network and 7: examples.add((s, πs , _))
continue with the next iteration, resetting the MCTS 8: a∗ ∼ πs
tree. Else, we continue with the old network and the old 9: s ← gameNextState(s, a∗ )
MCTS tree, and conduct another iteration to augment 10: if gameEnded(s) then
our training examples further. Experimentally, we find 11: //fill _ in examples with reward
that when the new network was not better than the 12: examples ←assignRewards(examples)
old network, the new network obtained after a further 13: return examples
iteration of training was far better. Hence, in one or
two iterations we almost always improve our network.
In our experiments, the temperature parameter τ is set
to 1 for the first 25 turns in an episode, to encourage
4.1 Baselines
early exploration, and then set to 0. It is always set to We implemented two baselines for comparison with our
0 during evaluation. Pseudocode of the policy iteration trained AI player. The first is a greedy player that always
algorithm is provided in Algorithms 2 and 3. chooses a move that causes the maximum number of
flips in the next step of the game. The second is a
random player baseline. A random player chooses from
4 Experiments one of the valid moves randomly at each step in the
game.
The above sections describe a general approach to game- We also used a minimax agent4 as a third baseline
playing. In our experiments, we specifically tackled which tries to maximize the worst-case gain assuming
the problem of learning to play the game of Othello. that the opponent plays perfectly at each move by ex-
Othello is traditionally played on an 8x8 sized board. ploring the game tree up to a certain depth. The results
The size of the state space is exponential in the size of of the different baselines are listed in Table 1.
the board. Experimentally, we found that converging
to an optimal policy on the 8x8 board with limited 4.2 Human Evaluation
computing resources would take a very long time. In We also implemented an interface where a human player
order to show the effectiveness of our approach, we also can play against any of our baselines or our learned
ran experiments on a 6x6 version of Othello3 . The 8x8 strategies. For the 6x6 version, we evaluated our bot
version was trained with 50 simulations of the MCTS against a local player who has been playing Othello from
per step, while the 6x6 version was trained with 25. childhood. These results are also available in Table 1.
Both were trained on training examples of 100 episodes Since the 8x8 version took a lot more time to train, we
per training iteration. The 6x6 version completed 78 did not get a chance to evaluate against humans.
iterations of training, while the 8x8 version completed
30 iterations of training. Both were trained for over 4.3 Analysis of Experiments
72 hours on a Google Compute Engine instance with a
We analyze our performance as a function of training
GPU.
time. In Figure 1 and Figure 2, we plot our performance
3
against the greedy and random baselines against the
Environment adapted from https://github.com/
4
JaimieMurdock/othello From https://github.com/Zolomon/reversi-ai
Figure 1: Performance Against Random and Greedy Figure 2: Performance Against Random and Greedy
baselines over 30 iterations (6x6) baselines over 30 iterations (8x8)

Baseline 6x6 board 8x8 board


number of iterations trained. As we see, these simple Greedy 20/20 20/20
baselines are quickly beaten by the 6x6 version in a few Random 20/20 18/20
iterations. However, learning a good agent for the 8x8 Minimax 30/30 29/30
bot is much more difficult. If performance against a Human 6/6 -
more sophisticated baseline is observed, it would help
decide when our model has converged and we can stop Table 1: Number of games won against various baselines
training. by our final models
Aside from the comparisons against baselines, we
follow the (Silver et al. 2016) approach of analyzing the
games played by our agent, to try and understand its edge. Our agent convincingly beats all baselines in-
strategies. In Figure 3 we examine our agent’s early cluding greedy, random and the standard alpha-beta
game strategies against the minimax bot. Our agent is minimax AI baseline. Further, the time taken to make
black, while the opponent is white. Boards are shown a move is much less for our agent, since it is just a feed
on each turn after our agent has made a move. We see forward operation in a neural network, compared to the
that the strategy adopted is to quickly grow towards the minimax algorithm, which involves exploring an expo-
walls and corners, and capture the pieces there. This is nential state space to a large depth to get good results.
indeed a strong high-level strategy that human players As seen in Section 3, our framework is very generic in
use, since pieces at corners and walls are very difficult its implementation, and can be easily extended to many
for the opponent to flip. It is quite remarkable that our other games such as Chess or Go.
agent is able to display such subtle strategies through The original implementation by DeepMind (Silver et
self-play, even against a strong minimax opponent. al. 2017b) uses orders of magnitude more raw compu-
In Figure 4, we examine some late game moves of our tational power on industry hardware (4TPUs, 64GPUs,
agent against the minimax strategy. We observe that 4 and 19CPUs, for several days). In our work, we show
moves before the end, in terms of number of pieces we do that it is possible to train similar networks on commod-
not appear to be performing significantly different from ity hardware for smaller problems. We plan to release
our opponent. However, by the endgame our agent has our implementation for the open source community.
learned to position its pieces very strategically. Instead
of placing a position in a place which would maximize the References
number of flips in one move (as the greedy baseline would [Browne et al. 2012] Browne, C. B.; Powley, E.; Whitehouse,
do), it places them in such a way that the opponent has D.; Lucas, S. M.; Cowling, P. I.; Rohlfshagen, P.; Tavener, S.;
no moves left and is forced to pass. Hence, it can quickly Perez, D.; Samothrakis, S.; and Colton, S. 2012. A survey
cover a larger portion of the board without the opponent of monte carlo tree search methods. IEEE Transactions on
moving and thus completely dominate the board at the Computational Intelligence and AI in games 4(1):1–43.
end of the game. [Campbell, Hoane, and hsiung Hsu 2002] Campbell, M.;
Hoane, A.; and hsiung Hsu, F. 2002. Deep blue. Artificial
Intelligence 134(1):57 – 83.
5 Conclusions
[Gelly and Silver 2008] Gelly, S., and Silver, D. 2008. Achiev-
We implement an agent that learns to play Othello ing master level play in 9 x 9 computer go. In AAAI, volume 8,
through pure self-play, without using any human knowl- 1537–1540.
a b c d e f a b c d e f a b c d e f a b c d e f

1 1 1 1

2 2 2 2

3 3 3 3

4 4 4 4

5 5 5 5

6 6 6 6

Figure 3: Early game play of our agent(B) vs minimax(W), capturing walls and corners

a b c d e f a b c d e f a b c d e f a b c d e f

1 1 1 1

2 2 2 2

3 3 3 3

4 4 4 4

5 5 5 5

6 6 6 6

Figure 4: Late game play of our agent(B) vs minimax (W), forcing passes

[Heinz 2001] Heinz, E. A. 2001. New Self-Play Results in [Srivastava et al. 2014] Srivastava, N.; Hinton, G. E.;
Computer Chess. Berlin, Heidelberg: Springer Berlin Heidel- Krizhevsky, A.; Sutskever, I.; and Salakhutdinov, R. 2014.
berg. 262–276. Dropout: a simple way to prevent neural networks from
[Ioffe and Szegedy 2015] Ioffe, S., and Szegedy, C. 2015. overfitting. Journal of machine learning research 15(1):1929–
Batch normalization: Accelerating deep network training by 1958.
reducing internal covariate shift. In International Conference [Van Der Ree and Wiering 2013] Van Der Ree, M., and Wier-
on Machine Learning, 448–456. ing, M. 2013. Reinforcement learning in the game of othello:
[Kingma and Ba 2014] Kingma, D., and Ba, J. 2014. Adam: learning against a fixed opponent and learning from self-
A method for stochastic optimization. arXiv preprint play. In Adaptive Dynamic Programming And Reinforcement
arXiv:1412.6980. Learning (ADPRL), 2013 IEEE Symposium on, 108–115.
IEEE.
[Nijssen 2007] Nijssen, J. 2007. Playing othello using monte
carlo. Strategies 1–9. [Wiering 2010] Wiering, M. A. 2010. Self-play and using an
expert to learn to play backgammon with temporal differ-
[Samuel 2000] Samuel, A. L. 2000. Some studies in machine ence learning. Journal of Intelligent Learning Systems and
learning using the game of checkers. IBM Journal of Research Applications 2(02):57.
and Development 44(1.2):206–226.
[Silver et al. 2016] Silver, D.; Huang, A.; Maddison, C. J.;
Guez, A.; Sifre, L.; van den Driessche, G.; Schrittwieser, J.;
Antonoglou, I.; Panneershelvam, V.; Lanctot, M.; Dieleman,
S.; Grewe, D.; Nham, J.; Kalchbrenner, N.; Sutskever, I.;
Lillicrap, T.; Leach, M.; Kavukcuoglu, K.; Graepel, T.; and
Hassabis, D. 2016. Mastering the game of go with deep
neural networks and tree search. Nature 529(7587):484–489.
Article.
[Silver et al. 2017a] Silver, D.; Hubert, T.; Schrittwieser, J.;
Antonoglou, I.; Lai, M.; Guez, A.; Lanctot, M.; Sifre, L.;
Kumaran, D.; Graepel, T.; Lillicrap, T.; Simonyan, K.; and
Hassabis, D. 2017a. Mastering Chess and Shogi by Self-Play
with a General Reinforcement Learning Algorithm. ArXiv
e-prints.
[Silver et al. 2017b] Silver, D.; Schrittwieser, J.; Simonyan,
K.; Antonoglou, I.; Huang, A.; Guez, A.; Hubert, T.; Baker,
L.; Lai, M.; Bolton, A.; et al. 2017b. Mastering the game of
go without human knowledge. Nature 550(7676):354–359.

You might also like