Reinforcement Learning - Playing Tic-Tac-Toe (Pre-Print)
Reinforcement Learning - Playing Tic-Tac-Toe (Pre-Print)
Jocelyn Ho
Georgia Institute of Technology
[email protected]
Jeffrey Huang*
Pacific American School
[email protected]
Benjamin Chang
Johns Hopkins University
[email protected]
Allison Liu
Emory University
[email protected]
Zoe Liu
Emory University
[email protected]
I. Abstract
Machine learning constructs computer systems that develop through experience.
Applications surround disciplines in daily life ranging from malware filtering to image
recognition. Recent research has shifted towards maximizing efficiency in decision-making,
creating algorithms that quickly and accurately process patterns to generate insight. This research
focuses on reinforcement learning, a paradigm of machine learning that makes decisions through
maximizing reward. Specifically, we use Q-learning – a model-free reinforcement learning
algorithm – to assign scores for different decisions given the unique states of the problem.
Widyantoro et al. (2009) has studied the effect of Q-learning on learning to play Tic-Tac-Toe.
However, the study yielded a win/tie rate of less than 50 percent. We believe that does not
represent an effective algorithm to fully exploit the benefits of Q-learning. In the same
environment, this research aims to close the gaps in the effectiveness of Q-learning while
minimizing human input. Data were processed by setting the epsilon value as 0.9 to ensure
randomness, then consecutively decrease with a constant rate as possible states increase. The
program played 300,000 games against its previous version, eventually securing a win/tie rate of
approximately 90 percent. Future directions include improving the efficiency of Q-learning
algorithms and applying the research in practical fields.
Reinforcement Learning: Playing Tic-Tac-Toe
II. Introduction
( I ) Background
AlphaGo is the first computer to defeat a human professional player in Go, a
board game that contains 10172 possible position combinations (Silver et al., 2016).
AlphaGo continued to improve by training against artificial and human intelligence,
eventually defeating first-tier players around the world (Deepmind, 2021). AlphaGo
opens new avenues for artificial intelligence to advance in disciplines dominated by
humans. With the example of AlphaGo, artificial intelligence can be applied to fields
existing in fiction, such as autonomous robots, chatbots, and trading systems (Ritthaler,
2018).
( II ) Motivation
Given the trend of artificial intelligence like AlphaGo, we intend to solve
real-world problems through a similar automated approach. Board games are the simplest
means to train a computer given its adequate environment of having certain rules and
possibilities for outcomes (Ritthaler, 2018). Under a perfect information system, artificial
intelligence could calculate exact outcomes and derive a goal of interest by figuring out
how to maximize utility. Considering the multitude of existing board games, Tic-Tac-Toe
proved to be the simplest and most comprehensible game. Though possible positions
(360,000) are relatively less than Go, Tic-Tac-Toe still provides the complexity and
nuance to be solved with deep neural networks such as reinforcement learning (Ritthaler,
2018).
1
Reinforcement Learning: Playing Tic-Tac-Toe
state space and has no evaluation functions that are sufficient for an immense amount of
outcomes. Underlying the aforementioned flaws of different algorithms, we intend to use
reinforcement learning, an algorithm that automatically finds a balance between
exploration of pathways and exploitation of knowledge – to train the computer to play
Tic-Tac-Toe (Friedrich, 2018).
Through reinforcement learning, the computer will optimize rewards through
interaction with the environment and update itself with a better prediction based on its
experience. In Q-learning, each action comes with a reward based on the outcome, a
value calculated based on the current state, and optimal action from the previous state
(Watkins, 1992). These variables allow machines to calculate a new value and repeat the
process until the game terminates. After many consecutive simulations, machines would
experience different patterns, allowing them to accurately estimate the probability of
winning the game (Watkins, 1992). We aim to demonstrate a high success rate through
Q-learning and yield a success/tie rate above the existing literature today.
2
Reinforcement Learning: Playing Tic-Tac-Toe
3
Reinforcement Learning: Playing Tic-Tac-Toe
In our research, the Bellman equation is used in the Q-learning model. In the model, the
machine assigns the Q value, which is the expected reward with a higher value being more
desirable, to every state-action pair. The Q values are updated iteratively based on the current
state, future states, and potential actions. In Tic-Tac-Toe, the state is the board position while the
action is the game move. At the end of each match, the result is associated with the move that
caused the result. The machine will then work back the game history recursively and update the
Q values. Next, the epsilon-greedy strategy is employed to ensure the agent’s familiarity with all
game moves instead of only reinforcing Q values that are already high. The epsilon-greedy
strategy is a method that balances exploration and exploitation of by agent by choosing the two
choices randomly: a random move with probability 1-ε from the Q-table or a random move with
probability ε.
4
Reinforcement Learning: Playing Tic-Tac-Toe
Similarly, the opponent in the game was coded with the same Q-learning algorithm to
yield greater results. However, different from the player, the opponent does not update its Q-table
during the plays. The opponent only updates its Q-table to the agent’s Q-table once every 1000
episodes to increase the efficiency of the agent’s policy.
Figure 1 illustrates that after playing 300,000 episodes, the agent will win on an average of 75%,
break/tie for about 15%, and lose 10% of games played
( II ) Suggestions
Future research will focus on improving the machine’s learning strategies, as well
as the related winning rate for the agent. We are looking forward to discovering a strategy
5
Reinforcement Learning: Playing Tic-Tac-Toe
VII. Acknowledgements
The authors thank Benjamin Chang and Jeffrey Huang from Pacific American School for
their unwavering support in co-authoring and providing constructive feedback and review.
VIII. References
1. DeepMind. (2021). AlphaGo: The story so far. Deepmind. Retrieved September 23,
2021, from https://deepmind.com/research/case-studies/alphago-the-story-so-far.
2. Eyal Even-Dar School of Computer Science (2001, January 1). Convergence of optimistic
and incremental Q-learning. Convergence of optimistic and incremental Q-learning |
Proceedings of the 14th International Conference on Neural Information Processing
Systems: Natural and Synthetic. Retrieved September 23, 2021, from
https://dl.acm.org/doi/abs/10.5555/2980539.2980734.
3. Friedrich, C. (2018, July 20). Part 3 - tabular Q learning, a tic tac toe player that gets
better and better. Medium. Retrieved September 23, 2021, from
https://medium.com/@carsten.friedrich/part-3-tabular-q-learning-a-tic-tac-toe-player-that
-gets-better-and-better-fa4da4b0892a.
4. Junling, H., & Wellman, M. P. (2003). Nash Q-Learning for General-Sum Stochastic
Games. Journal of Machine Learning Research. Retrieved September 23, 2021, from
https://www.jmlr.org/papers/volume4/hu03a/hu03a.pdf.
5. Morris, F. L., & Jones, C. B. (1984). An Early Program Proof by Alan Turing. Early
Proof. Retrieved September 23, 2021, from
http://www.cs.tau.ac.il/~nachumd/term/EarlyProof.pdf.
6. Shannon, C. E. (1949). Xxii. programming a computer for playing chess. Taylor &
Francis. Retrieved September 23, 2021, from
https://www.tandfonline.com/doi/abs/10.1080/14786445008521796?journalCode=tphm1
8.
6
Reinforcement Learning: Playing Tic-Tac-Toe
7. Silver, D. et. al. (2016, January 27). Mastering the game of go with deep neural networks
and Tree Search. Nature News. Retrieved September 23, 2021, from
https://www.nature.com/articles/nature16961-.
8. Sutton, R. S., & Barto, A. G. (2020). Reinforcement Learning: An Introduction. 2020
Reinforcement Learning Book from The MIT Press. Retrieved September 23, 2021, from
http://incompleteideas.net/book/RLbook2020.pdf.
9. Torres, J. (2021, May 10). The bellman equation. Medium. Retrieved September 23,
2021, from https://towardsdatascience.com/the-bellman-equation-59258a0d3fa7.
10. Watkin, C. J. C. H., & Dayan, P. (1992). Technical Note Q-Learning. Kluwer Academic
Publishers. Retrieved September 23, 2021, from
https://link.springer.com/content/pdf/10.1007/BF00992698.pdf.
11. Watkins, C. (1989). Learning from delayed rewards. ResearchGate. Retrieved September
23, 2021, from
https://www.researchgate.net/publication/33784417_Learning_From_Delayed_Rewards.
12. Wunder , M., Littman, M., & Babes, M. (2010). Classes of Multiagent Q-learning
Dynamics with -greedy Exploration Michael Wunder. International Conference on
Machine Learning. Retrieved September 23, 2021, from
https://icml.cc/Conferences/2010/papers/191.pdf.
13. YouTube. (2018). Using Q-Learning and Deep Learning to Solve Tic-Tac-Toe. YouTube.
Retrieved September 23, 2021, from https://www.youtube.com/watch?v=4C133ilFm3Q.
7
Reinforcement Learning: Playing Tic-Tac-Toe
8
Reinforcement Learning: Playing Tic-Tac-Toe
9
Reinforcement Learning: Playing Tic-Tac-Toe
lose_rate.append(lose)
x_axis.append(episode)
print("testing rates... odds = " + str(odds) + "%
draw = " + str(draw) + "% lose = " + str(lose) + "%")
AI.setEpsilon(epsilon)
while not tictactoe.isEndGame():
tictactoe = AI.makeMove(tictactoe)
AI.updateQ_Table()
if(tictactoe.hasALine()):
break
print("training completed")
print(max(winning_rate))
10