0% found this document useful (0 votes)
68 views11 pages

Reinforcement Learning - Playing Tic-Tac-Toe (Pre-Print)

Uploaded by

h
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
68 views11 pages

Reinforcement Learning - Playing Tic-Tac-Toe (Pre-Print)

Uploaded by

h
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Reinforcement Learning: Playing Tic-Tac-Toe

Jocelyn Ho
Georgia Institute of Technology
[email protected]

Jeffrey Huang*
Pacific American School
[email protected]

Benjamin Chang
Johns Hopkins University
[email protected]

Allison Liu
Emory University
[email protected]

Zoe Liu
Emory University
[email protected]

I. Abstract
Machine learning constructs computer systems that develop through experience.
Applications surround disciplines in daily life ranging from malware filtering to image
recognition. Recent research has shifted towards maximizing efficiency in decision-making,
creating algorithms that quickly and accurately process patterns to generate insight. This research
focuses on reinforcement learning, a paradigm of machine learning that makes decisions through
maximizing reward. Specifically, we use Q-learning – a model-free reinforcement learning
algorithm – to assign scores for different decisions given the unique states of the problem.
Widyantoro et al. (2009) has studied the effect of Q-learning on learning to play Tic-Tac-Toe.
However, the study yielded a win/tie rate of less than 50 percent. We believe that does not
represent an effective algorithm to fully exploit the benefits of Q-learning. In the same
environment, this research aims to close the gaps in the effectiveness of Q-learning while
minimizing human input. Data were processed by setting the epsilon value as 0.9 to ensure
randomness, then consecutively decrease with a constant rate as possible states increase. The
program played 300,000 games against its previous version, eventually securing a win/tie rate of
approximately 90 percent. Future directions include improving the efficiency of Q-learning
algorithms and applying the research in practical fields.
Reinforcement Learning: Playing Tic-Tac-Toe

II. Introduction

( I ) Background
AlphaGo is the first computer to defeat a human professional player in Go, a
board game that contains 10172 possible position combinations (Silver et al., 2016).
AlphaGo continued to improve by training against artificial and human intelligence,
eventually defeating first-tier players around the world (Deepmind, 2021). AlphaGo
opens new avenues for artificial intelligence to advance in disciplines dominated by
humans. With the example of AlphaGo, artificial intelligence can be applied to fields
existing in fiction, such as autonomous robots, chatbots, and trading systems (Ritthaler,
2018).

( II ) Motivation
Given the trend of artificial intelligence like AlphaGo, we intend to solve
real-world problems through a similar automated approach. Board games are the simplest
means to train a computer given its adequate environment of having certain rules and
possibilities for outcomes (Ritthaler, 2018). Under a perfect information system, artificial
intelligence could calculate exact outcomes and derive a goal of interest by figuring out
how to maximize utility. Considering the multitude of existing board games, Tic-Tac-Toe
proved to be the simplest and most comprehensible game. Though possible positions
(360,000) are relatively less than Go, Tic-Tac-Toe still provides the complexity and
nuance to be solved with deep neural networks such as reinforcement learning (Ritthaler,
2018).

( III ) Research Context and Goals


There are at least two methods for a computer to reach an outcome from previous
actions: decision tree and reinforcement learning (Sutton et al., 2018). Decision trees are
graphs that contain branches to represent specific conditions, with outcomes located at
their ends (Sutton et al., 2018). Each node in the decision tree constitutes a given action
for which the condition is met. However, the decision tree requires sorting all the possible
combinations, taking an enormous amount of memory. Hence, many people use an
extended version – the min-Max algorithm – to estimate the quality of action by
backtracking results. The opponent attempts the optimal move and a value is assigned to
the outcome for this algorithm (Sutton et al., 2018). Depth of the tree depends on the
number of moves; to optimize performance, evaluation of function with respect to depth
is used to estimate the final value of the match. Nevertheless, the algorithm uses massive

1
Reinforcement Learning: Playing Tic-Tac-Toe

state space and has no evaluation functions that are sufficient for an immense amount of
outcomes. Underlying the aforementioned flaws of different algorithms, we intend to use
reinforcement learning, an algorithm that automatically finds a balance between
exploration of pathways and exploitation of knowledge – to train the computer to play
Tic-Tac-Toe (Friedrich, 2018).
Through reinforcement learning, the computer will optimize rewards through
interaction with the environment and update itself with a better prediction based on its
experience. In Q-learning, each action comes with a reward based on the outcome, a
value calculated based on the current state, and optimal action from the previous state
(Watkins, 1992). These variables allow machines to calculate a new value and repeat the
process until the game terminates. After many consecutive simulations, machines would
experience different patterns, allowing them to accurately estimate the probability of
winning the game (Watkins, 1992). We aim to demonstrate a high success rate through
Q-learning and yield a success/tie rate above the existing literature today.

III. Literature Review


Many studies related to the utilization of machine learning through simple board games
arise since the nascent of computer-operated programs such as those in chess. In 1949, Claude
Shannon first started to develop a computer-operated chess program (Shannon, 1950). Building
on Shannon's work Alan Turing developed a computer-simulated checker player (Morris and
Jones, 1984). In 1966, MacHack 6 by Greenblatt (1967) became the first computer-operated
chess program that defeated a human player in a tournament. The program uses the approach of
searching techniques where the state is the board configuration and operations are all potential
steps. It uses a game tree with a depth of four levels, and choosing certain choices and levels will
maximize a specific utility function.
In 1989, Watkins first introduced Q-learning, a model-free reinforcement learning
algorithm (Watkins, 1989). Ever since the introduction of the algorithm, many studies have built
themselves upon it, such as those of Even-Dar and Mansour (2001) and Hu and Wellman (2003).
Several previous studies have covered the utilization of reinforcement learning on simple board
games. Widyantoro et al. (2009) apply a Q-learning algorithm to play Tic-Tac-Toe. In the study,
a new update rule is established by only updating the Q-value when transitioning from the final
move back to the first move. Though its partial-board representation yields comparable results to
that of a human player, its full-board representation only has a win/tie rate of less than 50%.
Thus, our study will implement a new design for the reward setup and utilize optimistic
initialization to encourage agent’s learning and improve the efficiency of this study.

2
Reinforcement Learning: Playing Tic-Tac-Toe

IV. Research Methods


In reinforcement learning, no prior information of the potential value of actions is given.
The essential goal is to maximize the cumulative rewards of tasks through exploring the
environment and exploiting learned materials. In Tic-Tac-Toe, the learning agent repetitively
plays a standard game: a 3 x 3 board where three X’s or O’s are placed consecutively diagonally,
vertically, or horizontally to win. The learning agent must consider both the immediate and
subsequent future rewards to achieve a combination of delayed rewards.

In a reinforcement learning environment, the learning agent’s policy is a learned strategy


dictating the agent's actions as a function of the environment and its current state. The reward
signals are numerical rewards after an action. Agents alter their policy to maximize reward
signals: strategy is established by prioritizing actions with high rewards in the future. With the
eventual reward being +1 after a win and punishment being -1 after a loss, the agent’s policy will
alter to avoid the actions leading to the low reward.
A finite Markov Decision Process consists of the interactions between the agent and
environment, with factors such as the action, state, and reward. This process can be represented
by a stochastic series of possible actions that motivate the agent to seek various rewards, which
leads to partially random and partially machine-controlled outcomes. Therefore, it operates
through a value function of step t. In the current state s, the agent could choose action a. The
system is moved to the next state s’, which only depends on the current state and the action. A
corresponding reward R is given after the action. The probability of state s’ could be represented
by the transition function pa(s, s’), a solution to the Bellman equation, which shows the
relationship between the current state and successor states. In the state s, action a is decided by
policy 𝜋. The probability function p then decides the corresponding reward r and the next state
s’. By weighing possible steps on the probability of occurring and averaging them, the Bellman
equation provides the immediate reward by action a in state s and the maximum reward in the
next state.

3
Reinforcement Learning: Playing Tic-Tac-Toe

In our research, the Bellman equation is used in the Q-learning model. In the model, the
machine assigns the Q value, which is the expected reward with a higher value being more
desirable, to every state-action pair. The Q values are updated iteratively based on the current
state, future states, and potential actions. In Tic-Tac-Toe, the state is the board position while the
action is the game move. At the end of each match, the result is associated with the move that
caused the result. The machine will then work back the game history recursively and update the
Q values. Next, the epsilon-greedy strategy is employed to ensure the agent’s familiarity with all
game moves instead of only reinforcing Q values that are already high. The epsilon-greedy
strategy is a method that balances exploration and exploitation of by agent by choosing the two
choices randomly: a random move with probability 1-ε from the Q-table or a random move with
probability ε.

V. Data Analysis and Results


Through reinforcement learning, the Q-learning algorithm, coded with Python, allows an
agent to find the solution that yields the greatest reward. When the agent is using the epsilon, the
agent explores with a random move. During each move, a reward is collected and the Q-value is
generated from a given state and action with the Q-learning equation. The data will then be
stored into the Q-table, where each slot for the Q-table corresponds to a board state. The discount
factor and learning rate affects the convergence of winning/tie rates. By following the policy that
yields the most reward, the machine learns from the Q-table.
As the computer explores the possibilities, its epsilon decreases to encourage the agent to
play the optimal action while ensuring a certain degree of exploration. Using a randomly
generated decimal between 0 and 1, the computer determines whether it plays under the trained
policy. The computer will play a random move if the generated number is within the epsilon
value. However, if this generated number is greater than the current epsilon, the algorithm will
be processed, utilizing the Q-learning approach.
For the first one thousand episodes, we set the epsilon to 0.9 to allow the computer to
generate random moves, producing enough data sets to be stored into the Q-table; the table is
updated with the Bellman equation. We then decrease the epsilon by ((1-epsilon)*10)/episodes to
reduce the rate of exploration of the board, allowing the agent to have a greater possibility to
play with optimal policy. Through evaluating the value the Q-learning equation computed, the
algorithm enables the computer to play strategically by following the policy that yields the
greatest rewards.

4
Reinforcement Learning: Playing Tic-Tac-Toe

Similarly, the opponent in the game was coded with the same Q-learning algorithm to
yield greater results. However, different from the player, the opponent does not update its Q-table
during the plays. The opponent only updates its Q-table to the agent’s Q-table once every 1000
episodes to increase the efficiency of the agent’s policy.

Figure 1 illustrates that after playing 300,000 episodes, the agent will win on an average of 75%,
break/tie for about 15%, and lose 10% of games played

VI. Conclusion and Suggestions


( I ) Conclusion
Our research has shown that a learning agent, via reinforcement learning, can
master playing simple games such as Tic-Tac-Toe with a high winning rate after
receiving a sufficient amount of training. Throughout our experiment, our agent has
developed its playing strategy by utilizing the Bellman equation and the Q-learning
algorithm to recall its previous moves and discover the optimal Tic-Tac-Toe action. By
the result of our experiment, the agent has around a 90% win/tie rate after 300,000
episodes of training. We believe that the winning rate can increase even more to 100%
with a better tuned Q-function and more training episodes.

( II ) Suggestions
Future research will focus on improving the machine’s learning strategies, as well
as the related winning rate for the agent. We are looking forward to discovering a strategy

5
Reinforcement Learning: Playing Tic-Tac-Toe

to maximize learning without relying on long-term training, which is more efficient as it


will take a much shorter time for machines to complete.
We wish to apply our research regarding the adjustments of the learning rate,
decay exploration rates, as well as their effects on agent's abilities not only on
Tic-Tac-Toe but also on similar games that can be advanced with reinforcement learning.
Though the states, actions, and rewards vary by program, the fundamental quality of
Q-learning algorithms remains model-free, and thus the research can be applied to a wide
variety of programs, ranging from games to other fields where credible unsupervised
learning is essential to quick operations and exemplary results. Finance, business,
medicine, and industrial robotics are some examples of the fields.

VII. Acknowledgements
The authors thank Benjamin Chang and Jeffrey Huang from Pacific American School for
their unwavering support in co-authoring and providing constructive feedback and review.

VIII. References

1. DeepMind. (2021). AlphaGo: The story so far. Deepmind. Retrieved September 23,
2021, from https://deepmind.com/research/case-studies/alphago-the-story-so-far.
2. Eyal Even-Dar School of Computer Science (2001, January 1). Convergence of optimistic
and incremental Q-learning. Convergence of optimistic and incremental Q-learning |
Proceedings of the 14th International Conference on Neural Information Processing
Systems: Natural and Synthetic. Retrieved September 23, 2021, from
https://dl.acm.org/doi/abs/10.5555/2980539.2980734.
3. Friedrich, C. (2018, July 20). Part 3 - tabular Q learning, a tic tac toe player that gets
better and better. Medium. Retrieved September 23, 2021, from
https://medium.com/@carsten.friedrich/part-3-tabular-q-learning-a-tic-tac-toe-player-that
-gets-better-and-better-fa4da4b0892a.
4. Junling, H., & Wellman, M. P. (2003). Nash Q-Learning for General-Sum Stochastic
Games. Journal of Machine Learning Research. Retrieved September 23, 2021, from
https://www.jmlr.org/papers/volume4/hu03a/hu03a.pdf.
5. Morris, F. L., & Jones, C. B. (1984). An Early Program Proof by Alan Turing. Early
Proof. Retrieved September 23, 2021, from
http://www.cs.tau.ac.il/~nachumd/term/EarlyProof.pdf.
6. Shannon, C. E. (1949). Xxii. programming a computer for playing chess. Taylor &
Francis. Retrieved September 23, 2021, from
https://www.tandfonline.com/doi/abs/10.1080/14786445008521796?journalCode=tphm1
8.

6
Reinforcement Learning: Playing Tic-Tac-Toe

7. Silver, D. et. al. (2016, January 27). Mastering the game of go with deep neural networks
and Tree Search. Nature News. Retrieved September 23, 2021, from
https://www.nature.com/articles/nature16961-.
8. Sutton, R. S., & Barto, A. G. (2020). Reinforcement Learning: An Introduction. 2020
Reinforcement Learning Book from The MIT Press. Retrieved September 23, 2021, from
http://incompleteideas.net/book/RLbook2020.pdf.
9. Torres, J. (2021, May 10). The bellman equation. Medium. Retrieved September 23,
2021, from https://towardsdatascience.com/the-bellman-equation-59258a0d3fa7.
10. Watkin, C. J. C. H., & Dayan, P. (1992). Technical Note Q-Learning. Kluwer Academic
Publishers. Retrieved September 23, 2021, from
https://link.springer.com/content/pdf/10.1007/BF00992698.pdf.
11. Watkins, C. (1989). Learning from delayed rewards. ResearchGate. Retrieved September
23, 2021, from
https://www.researchgate.net/publication/33784417_Learning_From_Delayed_Rewards.
12. Wunder , M., Littman, M., & Babes, M. (2010). Classes of Multiagent Q-learning
Dynamics with -greedy Exploration Michael Wunder. International Conference on
Machine Learning. Retrieved September 23, 2021, from
https://icml.cc/Conferences/2010/papers/191.pdf.
13. YouTube. (2018). Using Q-Learning and Deep Learning to Solve Tic-Tac-Toe. YouTube.
Retrieved September 23, 2021, from https://www.youtube.com/watch?v=4C133ilFm3Q.

Appendix A: Python Code for Playing Tic-Tac-Toe via Reinforcement Learning

import numpy as np board = self.board.reshape(size, size)


from numpy.random import randint horizontal_sums = board.sum(axis=1)
import matplotlib.pyplot as plt vertical_sums = board.sum(axis=0)
import random diag_sum = np.trace(board)
import json backDiag_sum = np.trace(np.fliplr(board))
for i in range(self.dimension):
def genBoard(dimension): h = abs(horizontal_sums[i])
table = [["-" for c in range(dimension)] for r in v = abs(vertical_sums[i])
range(dimension)] d = abs(diag_sum)
return table b = abs(backDiag_sum)
if (h == size or v == size or d == size or b ==
class TicTacToe: size):
def __init__(self, dimension): return True
self.dimension = dimension return False
self.board_size = dimension**2
self.board = np.zeros(self.board_size) def isDraw(self):
self.actions = list(np.arange(dimension**2)) return np.count_nonzero(self.board) ==
self.possible_moves = np.arange(dimension**2) (self.board_size) - 1
self.taken_moves = []
self.current_state = [] def isEndGame(self):
self.is_end_game = False self.is_end_game = self.isDraw() or
self.hasALine()

7
Reinforcement Learning: Playing Tic-Tac-Toe

def resetBoard(self): return self.is_end_game


self.board = np.zeros(self.board_size)
self.possible_moves = np.arange(dimension**2) def getReward(self, ID):
self.taken_moves = [] if(self.hasALine()):
self.current_state = [] return 100*ID
self.is_end_game = False elif(self.isDraw()):
return 5
def printBoard(self): else:
print(self.board.reshape(self.dimension, return 0
self.dimension))
def getState(self):
def placeAction(self, ID, action): return np.sort(np.array(self.current_state))
self.board[action] = ID #reflect move on tictactoe
board def getPossibleMoves(self):
self.taken_moves.append(action) return self.possible_moves
rst = np.where(self.possible_moves == action)
idx = int(rst[0]) def getTakenMoves(self):
self.possible_moves = return self.taken_moves
np.delete(self.possible_moves, idx)
self.possible_moves = def getBoardSize(self):
np.array(self.possible_moves) return self.board_size
self.current_state.append((ID*action))
return [self.getState(), self.getReward(ID), action] epsilon = 0.9
#returns state, reward, and action
class Player():
def hasALine(self): def __init__(self, ID, char, epsilon, learningRate,
size = self.dimension discountFactor):
self.ID = ID
self.char = char def setLearningRate(self, lr):
self.q_table = [] self.learning_rate = lr
self.tree = []
self.epsilon = epsilon def setDiscountFactor(self, df):
self.learning_rate = learningRate self.discount_factor = df
self.discount_factor = discountFactor
self.current_state = [] def updateQ_Table(self):
old_q_value =
def makeMove(self, tictactoe): self.q_table[self.tree.index(self.current_state)][self.acti
if (tictactoe.isEndGame()): on]
return tictactoe try:
self.current_state = list(tictactoe.getState())[:] idx = self.tree.index(self.new_state)
#stores the old state value = np.max(self.q_table[idx])
possible_acts = []
if (self.current_state not in self.tree): for i in range(tictactoe.getBoardSize()):
action_q_values = [] if self.q_table[idx][i] == self.q_table[value]:
for i in range(tictactoe.getBoardSize()): possible_acts.append(i)
action_q_values.append(1) value = random.choice(possible_acts)
self.tree.append(self.current_state[:]) except ValueError:
self.q_table.append(action_q_values) value = 0
temporal_difference = self.reward +
if np.random.random() < self.epsilon: (self.discount_factor * value) - old_q_value
qT = new_q_value = old_q_value + (self.learning_rate
np.array(self.q_table[self.tree.index(self.current_state)] * temporal_difference)
)

8
Reinforcement Learning: Playing Tic-Tac-Toe

for i in range (len(qT)): self.q_table[self.tree.index(self.current_state)][self.acti


if (i not in tictactoe.getPossibleMoves()): on] = new_q_value
qT[i] = -100
act = np.argmax(qT) def getID(self):
possible_acts = [] return self.ID
for i in range(tictactoe.getBoardSize()):
if qT[i] == qT[act]: def getCurrentState(self):
possible_acts.append(i) return self.current_state
self.new_state, self.reward, self.action =
tictactoe.placeAction(self.ID,
random.choice(possible_acts)) AI = Player(1, "X", epsilon, 0.4, 0.9)
else: NPC = Player(-1, "O", epsilon, 0.9, 0.9)
self.new_state, self.reward, self.action =
tictactoe.placeAction(self.ID, num = 100000
random.choice(tictactoe.getPossibleMoves()))
return tictactoe dimension = 3
tictactoe = TicTacToe(dimension)
def makeCompMove(self, tictactoe):
if (tictactoe.isEndGame()): odds = 0
return tictactoe draw = 0
self.new_state, self.reward, self.action = winning_rate = []
tictactoe.placeAction(self.ID, draw_rate = []
random.choice(tictactoe.getPossibleMoves())) lose_rate = []
return tictactoe x_axis = []

def setEpsilon(self, e): for episode in range (1, num+1, 1):


self.epsilon = e if episode % 1000 == 0:
NPC.q_table = AI.q_table plt.plot(x_axis, lose_rate, label = "Lose", linewidth = 3,
odds = 0 color = "#F35050")
draw = 0 plt.xlabel("Episodes", **csfont)
lose = 0 plt.ylabel("Rates (%)", **csfont)
with open('qlearning.txt', 'w') as filehandle: plt.title("Rates vs. Episodes", **csfont)
json.dump(AI.q_table, filehandle) plt.legend(prop={'family':"Comic Sans MS"})
AI.setEpsilon(1.0) plt.show()
for j in range (1000):
while not tictactoe.isEndGame():
tictactoe = AI.makeMove(tictactoe)
if (tictactoe.hasALine()):
odds += 1
break
tictactoe = NPC.makeCompMove(tictactoe)
#random move
if (tictactoe.hasALine()):
lose += 1
break
if tictactoe.isDraw():
draw += 1
tictactoe.resetBoard()
draw = (draw/(1000*1.0))*100
odds = (odds/(1000*1.0))*100
lose = (lose/(1000*1.0))*100
winning_rate.append(odds)
draw_rate.append(draw)

9
Reinforcement Learning: Playing Tic-Tac-Toe

lose_rate.append(lose)
x_axis.append(episode)
print("testing rates... odds = " + str(odds) + "%
draw = " + str(draw) + "% lose = " + str(lose) + "%")
AI.setEpsilon(epsilon)
while not tictactoe.isEndGame():
tictactoe = AI.makeMove(tictactoe)
AI.updateQ_Table()
if(tictactoe.hasALine()):
break

tictactoe = NPC.makeMove(tictactoe) #not


random move
epsilon += ((1-epsilon)*10)/num
AI.setEpsilon(epsilon)
tictactoe.resetBoard()
print("game: " + str(episode) + " done")

print("training completed")
print(max(winning_rate))

csfont = {'fontname':'Comic Sans MS'}

plt.plot(x_axis, winning_rate, label = "Win", linewidth


= 3, color = "#A2F350")
plt.plot(x_axis, draw_rate, label = "Tie", linewidth = 3,
color = "#1c4285")

10

You might also like