0% found this document useful (0 votes)

68 views11 pages

Reinforcement Learning - Playing Tic-Tac-Toe (Pre-Print)

Uploaded by

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

68 views11 pages

Reinforcement Learning - Playing Tic-Tac-Toe (Pre-Print)

Uploaded by

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Reinforcement Learning: Playing Tic-Tac-Toe

Jocelyn Ho
Georgia Institute of Technology
[email protected]

Jeffrey Huang*
Pacific American School
[email protected]

Benjamin Chang
Johns Hopkins University
[email protected]

Allison Liu
Emory University
[email protected]

Zoe Liu
Emory University
[email protected]

I. Abstract
Machine learning constructs computer systems that develop through experience.
Applications surround disciplines in daily life ranging from malware filtering to image
recognition. Recent research has shifted towards maximizing efficiency in decision-making,
creating algorithms that quickly and accurately process patterns to generate insight. This research
focuses on reinforcement learning, a paradigm of machine learning that makes decisions through
maximizing reward. Specifically, we use Q-learning – a model-free reinforcement learning
algorithm – to assign scores for different decisions given the unique states of the problem.
Widyantoro et al. (2009) has studied the effect of Q-learning on learning to play Tic-Tac-Toe.
However, the study yielded a win/tie rate of less than 50 percent. We believe that does not
represent an effective algorithm to fully exploit the benefits of Q-learning. In the same
environment, this research aims to close the gaps in the effectiveness of Q-learning while
minimizing human input. Data were processed by setting the epsilon value as 0.9 to ensure
randomness, then consecutively decrease with a constant rate as possible states increase. The
program played 300,000 games against its previous version, eventually securing a win/tie rate of
approximately 90 percent. Future directions include improving the efficiency of Q-learning
algorithms and applying the research in practical fields.
Reinforcement Learning: Playing Tic-Tac-Toe

II. Introduction

( I ) Background
AlphaGo is the first computer to defeat a human professional player in Go, a
board game that contains 10172 possible position combinations (Silver et al., 2016).
AlphaGo continued to improve by training against artificial and human intelligence,
eventually defeating first-tier players around the world (Deepmind, 2021). AlphaGo
opens new avenues for artificial intelligence to advance in disciplines dominated by
humans. With the example of AlphaGo, artificial intelligence can be applied to fields
existing in fiction, such as autonomous robots, chatbots, and trading systems (Ritthaler,
2018).

( II ) Motivation
Given the trend of artificial intelligence like AlphaGo, we intend to solve
real-world problems through a similar automated approach. Board games are the simplest
means to train a computer given its adequate environment of having certain rules and
possibilities for outcomes (Ritthaler, 2018). Under a perfect information system, artificial
intelligence could calculate exact outcomes and derive a goal of interest by figuring out
how to maximize utility. Considering the multitude of existing board games, Tic-Tac-Toe
proved to be the simplest and most comprehensible game. Though possible positions
(360,000) are relatively less than Go, Tic-Tac-Toe still provides the complexity and
nuance to be solved with deep neural networks such as reinforcement learning (Ritthaler,
2018).

( III ) Research Context and Goals

There are at least two methods for a computer to reach an outcome from previous
actions: decision tree and reinforcement learning (Sutton et al., 2018). Decision trees are
graphs that contain branches to represent specific conditions, with outcomes located at
their ends (Sutton et al., 2018). Each node in the decision tree constitutes a given action
for which the condition is met. However, the decision tree requires sorting all the possible
combinations, taking an enormous amount of memory. Hence, many people use an
extended version – the min-Max algorithm – to estimate the quality of action by
backtracking results. The opponent attempts the optimal move and a value is assigned to
the outcome for this algorithm (Sutton et al., 2018). Depth of the tree depends on the
number of moves; to optimize performance, evaluation of function with respect to depth
is used to estimate the final value of the match. Nevertheless, the algorithm uses massive

1
Reinforcement Learning: Playing Tic-Tac-Toe

state space and has no evaluation functions that are sufficient for an immense amount of
outcomes. Underlying the aforementioned flaws of different algorithms, we intend to use
reinforcement learning, an algorithm that automatically finds a balance between
exploration of pathways and exploitation of knowledge – to train the computer to play
Tic-Tac-Toe (Friedrich, 2018).
Through reinforcement learning, the computer will optimize rewards through
interaction with the environment and update itself with a better prediction based on its
experience. In Q-learning, each action comes with a reward based on the outcome, a
value calculated based on the current state, and optimal action from the previous state
(Watkins, 1992). These variables allow machines to calculate a new value and repeat the
process until the game terminates. After many consecutive simulations, machines would
experience different patterns, allowing them to accurately estimate the probability of
winning the game (Watkins, 1992). We aim to demonstrate a high success rate through
Q-learning and yield a success/tie rate above the existing literature today.

III. Literature Review

Many studies related to the utilization of machine learning through simple board games
arise since the nascent of computer-operated programs such as those in chess. In 1949, Claude
Shannon first started to develop a computer-operated chess program (Shannon, 1950). Building
on Shannon's work Alan Turing developed a computer-simulated checker player (Morris and
Jones, 1984). In 1966, MacHack 6 by Greenblatt (1967) became the first computer-operated
chess program that defeated a human player in a tournament. The program uses the approach of
searching techniques where the state is the board configuration and operations are all potential
steps. It uses a game tree with a depth of four levels, and choosing certain choices and levels will
maximize a specific utility function.
In 1989, Watkins first introduced Q-learning, a model-free reinforcement learning
algorithm (Watkins, 1989). Ever since the introduction of the algorithm, many studies have built
themselves upon it, such as those of Even-Dar and Mansour (2001) and Hu and Wellman (2003).
Several previous studies have covered the utilization of reinforcement learning on simple board
games. Widyantoro et al. (2009) apply a Q-learning algorithm to play Tic-Tac-Toe. In the study,
a new update rule is established by only updating the Q-value when transitioning from the final
move back to the first move. Though its partial-board representation yields comparable results to
that of a human player, its full-board representation only has a win/tie rate of less than 50%.
Thus, our study will implement a new design for the reward setup and utilize optimistic
initialization to encourage agent’s learning and improve the efficiency of this study.

2
Reinforcement Learning: Playing Tic-Tac-Toe

IV. Research Methods

In reinforcement learning, no prior information of the potential value of actions is given.
The essential goal is to maximize the cumulative rewards of tasks through exploring the
environment and exploiting learned materials. In Tic-Tac-Toe, the learning agent repetitively
plays a standard game: a 3 x 3 board where three X’s or O’s are placed consecutively diagonally,
vertically, or horizontally to win. The learning agent must consider both the immediate and
subsequent future rewards to achieve a combination of delayed rewards.

In a reinforcement learning environment, the learning agent’s policy is a learned strategy

dictating the agent's actions as a function of the environment and its current state. The reward
signals are numerical rewards after an action. Agents alter their policy to maximize reward
signals: strategy is established by prioritizing actions with high rewards in the future. With the
eventual reward being +1 after a win and punishment being -1 after a loss, the agent’s policy will
alter to avoid the actions leading to the low reward.
A finite Markov Decision Process consists of the interactions between the agent and
environment, with factors such as the action, state, and reward. This process can be represented
by a stochastic series of possible actions that motivate the agent to seek various rewards, which
leads to partially random and partially machine-controlled outcomes. Therefore, it operates
through a value function of step t. In the current state s, the agent could choose action a. The
system is moved to the next state s’, which only depends on the current state and the action. A
corresponding reward R is given after the action. The probability of state s’ could be represented
by the transition function pa(s, s’), a solution to the Bellman equation, which shows the
relationship between the current state and successor states. In the state s, action a is decided by
policy 𝜋. The probability function p then decides the corresponding reward r and the next state
s’. By weighing possible steps on the probability of occurring and averaging them, the Bellman
equation provides the immediate reward by action a in state s and the maximum reward in the
next state.

3
Reinforcement Learning: Playing Tic-Tac-Toe

In our research, the Bellman equation is used in the Q-learning model. In the model, the
machine assigns the Q value, which is the expected reward with a higher value being more
desirable, to every state-action pair. The Q values are updated iteratively based on the current
state, future states, and potential actions. In Tic-Tac-Toe, the state is the board position while the
action is the game move. At the end of each match, the result is associated with the move that
caused the result. The machine will then work back the game history recursively and update the
Q values. Next, the epsilon-greedy strategy is employed to ensure the agent’s familiarity with all
game moves instead of only reinforcing Q values that are already high. The epsilon-greedy
strategy is a method that balances exploration and exploitation of by agent by choosing the two
choices randomly: a random move with probability 1-ε from the Q-table or a random move with
probability ε.

V. Data Analysis and Results

Through reinforcement learning, the Q-learning algorithm, coded with Python, allows an
agent to find the solution that yields the greatest reward. When the agent is using the epsilon, the
agent explores with a random move. During each move, a reward is collected and the Q-value is
generated from a given state and action with the Q-learning equation. The data will then be
stored into the Q-table, where each slot for the Q-table corresponds to a board state. The discount
factor and learning rate affects the convergence of winning/tie rates. By following the policy that
yields the most reward, the machine learns from the Q-table.
As the computer explores the possibilities, its epsilon decreases to encourage the agent to
play the optimal action while ensuring a certain degree of exploration. Using a randomly
generated decimal between 0 and 1, the computer determines whether it plays under the trained
policy. The computer will play a random move if the generated number is within the epsilon
value. However, if this generated number is greater than the current epsilon, the algorithm will
be processed, utilizing the Q-learning approach.
For the first one thousand episodes, we set the epsilon to 0.9 to allow the computer to
generate random moves, producing enough data sets to be stored into the Q-table; the table is
updated with the Bellman equation. We then decrease the epsilon by ((1-epsilon)*10)/episodes to
reduce the rate of exploration of the board, allowing the agent to have a greater possibility to
play with optimal policy. Through evaluating the value the Q-learning equation computed, the
algorithm enables the computer to play strategically by following the policy that yields the
greatest rewards.

4
Reinforcement Learning: Playing Tic-Tac-Toe

Similarly, the opponent in the game was coded with the same Q-learning algorithm to
yield greater results. However, different from the player, the opponent does not update its Q-table
during the plays. The opponent only updates its Q-table to the agent’s Q-table once every 1000
episodes to increase the efficiency of the agent’s policy.

Figure 1 illustrates that after playing 300,000 episodes, the agent will win on an average of 75%,
break/tie for about 15%, and lose 10% of games played

VI. Conclusion and Suggestions

( I ) Conclusion
Our research has shown that a learning agent, via reinforcement learning, can
master playing simple games such as Tic-Tac-Toe with a high winning rate after
receiving a sufficient amount of training. Throughout our experiment, our agent has
developed its playing strategy by utilizing the Bellman equation and the Q-learning
algorithm to recall its previous moves and discover the optimal Tic-Tac-Toe action. By
the result of our experiment, the agent has around a 90% win/tie rate after 300,000
episodes of training. We believe that the winning rate can increase even more to 100%
with a better tuned Q-function and more training episodes.

( II ) Suggestions
Future research will focus on improving the machine’s learning strategies, as well
as the related winning rate for the agent. We are looking forward to discovering a strategy

5
Reinforcement Learning: Playing Tic-Tac-Toe

to maximize learning without relying on long-term training, which is more efficient as it

will take a much shorter time for machines to complete.
We wish to apply our research regarding the adjustments of the learning rate,
decay exploration rates, as well as their effects on agent's abilities not only on
Tic-Tac-Toe but also on similar games that can be advanced with reinforcement learning.
Though the states, actions, and rewards vary by program, the fundamental quality of
Q-learning algorithms remains model-free, and thus the research can be applied to a wide
variety of programs, ranging from games to other fields where credible unsupervised
learning is essential to quick operations and exemplary results. Finance, business,
medicine, and industrial robotics are some examples of the fields.

VII. Acknowledgements
The authors thank Benjamin Chang and Jeffrey Huang from Pacific American School for
their unwavering support in co-authoring and providing constructive feedback and review.

VIII. References

1. DeepMind. (2021). AlphaGo: The story so far. Deepmind. Retrieved September 23,
2021, from https://deepmind.com/research/case-studies/alphago-the-story-so-far.
2. Eyal Even-Dar School of Computer Science (2001, January 1). Convergence of optimistic
and incremental Q-learning. Convergence of optimistic and incremental Q-learning |
Proceedings of the 14th International Conference on Neural Information Processing
Systems: Natural and Synthetic. Retrieved September 23, 2021, from
https://dl.acm.org/doi/abs/10.5555/2980539.2980734.
3. Friedrich, C. (2018, July 20). Part 3 - tabular Q learning, a tic tac toe player that gets
better and better. Medium. Retrieved September 23, 2021, from
https://medium.com/@carsten.friedrich/part-3-tabular-q-learning-a-tic-tac-toe-player-that
-gets-better-and-better-fa4da4b0892a.
4. Junling, H., & Wellman, M. P. (2003). Nash Q-Learning for General-Sum Stochastic
Games. Journal of Machine Learning Research. Retrieved September 23, 2021, from
https://www.jmlr.org/papers/volume4/hu03a/hu03a.pdf.
5. Morris, F. L., & Jones, C. B. (1984). An Early Program Proof by Alan Turing. Early
Proof. Retrieved September 23, 2021, from
http://www.cs.tau.ac.il/~nachumd/term/EarlyProof.pdf.
6. Shannon, C. E. (1949). Xxii. programming a computer for playing chess. Taylor &
Francis. Retrieved September 23, 2021, from
https://www.tandfonline.com/doi/abs/10.1080/14786445008521796?journalCode=tphm1
8.

6
Reinforcement Learning: Playing Tic-Tac-Toe

7. Silver, D. et. al. (2016, January 27). Mastering the game of go with deep neural networks
and Tree Search. Nature News. Retrieved September 23, 2021, from
https://www.nature.com/articles/nature16961-.
8. Sutton, R. S., & Barto, A. G. (2020). Reinforcement Learning: An Introduction. 2020
Reinforcement Learning Book from The MIT Press. Retrieved September 23, 2021, from
http://incompleteideas.net/book/RLbook2020.pdf.
9. Torres, J. (2021, May 10). The bellman equation. Medium. Retrieved September 23,
2021, from https://towardsdatascience.com/the-bellman-equation-59258a0d3fa7.
10. Watkin, C. J. C. H., & Dayan, P. (1992). Technical Note Q-Learning. Kluwer Academic
Publishers. Retrieved September 23, 2021, from
https://link.springer.com/content/pdf/10.1007/BF00992698.pdf.
11. Watkins, C. (1989). Learning from delayed rewards. ResearchGate. Retrieved September
23, 2021, from
https://www.researchgate.net/publication/33784417_Learning_From_Delayed_Rewards.
12. Wunder , M., Littman, M., & Babes, M. (2010). Classes of Multiagent Q-learning
Dynamics with -greedy Exploration Michael Wunder. International Conference on
Machine Learning. Retrieved September 23, 2021, from
https://icml.cc/Conferences/2010/papers/191.pdf.
13. YouTube. (2018). Using Q-Learning and Deep Learning to Solve Tic-Tac-Toe. YouTube.
Retrieved September 23, 2021, from https://www.youtube.com/watch?v=4C133ilFm3Q.

Appendix A: Python Code for Playing Tic-Tac-Toe via Reinforcement Learning

import numpy as np board = self.board.reshape(size, size)

from numpy.random import randint horizontal_sums = board.sum(axis=1)
import matplotlib.pyplot as plt vertical_sums = board.sum(axis=0)
import random diag_sum = np.trace(board)
import json backDiag_sum = np.trace(np.fliplr(board))
for i in range(self.dimension):
def genBoard(dimension): h = abs(horizontal_sums[i])
table = [["-" for c in range(dimension)] for r in v = abs(vertical_sums[i])
range(dimension)] d = abs(diag_sum)
return table b = abs(backDiag_sum)
if (h == size or v == size or d == size or b ==
class TicTacToe: size):
def __init__(self, dimension): return True
self.dimension = dimension return False
self.board_size = dimension**2
self.board = np.zeros(self.board_size) def isDraw(self):
self.actions = list(np.arange(dimension**2)) return np.count_nonzero(self.board) ==
self.possible_moves = np.arange(dimension**2) (self.board_size) - 1
self.taken_moves = []
self.current_state = [] def isEndGame(self):
self.is_end_game = False self.is_end_game = self.isDraw() or
self.hasALine()

7
Reinforcement Learning: Playing Tic-Tac-Toe

def resetBoard(self): return self.is_end_game

self.board = np.zeros(self.board_size)
self.possible_moves = np.arange(dimension**2) def getReward(self, ID):
self.taken_moves = [] if(self.hasALine()):
self.current_state = [] return 100*ID
self.is_end_game = False elif(self.isDraw()):
return 5
def printBoard(self): else:
print(self.board.reshape(self.dimension, return 0
self.dimension))
def getState(self):
def placeAction(self, ID, action): return np.sort(np.array(self.current_state))
self.board[action] = ID #reflect move on tictactoe
board def getPossibleMoves(self):
self.taken_moves.append(action) return self.possible_moves
rst = np.where(self.possible_moves == action)
idx = int(rst[0]) def getTakenMoves(self):
self.possible_moves = return self.taken_moves
np.delete(self.possible_moves, idx)
self.possible_moves = def getBoardSize(self):
np.array(self.possible_moves) return self.board_size
self.current_state.append((ID*action))
return [self.getState(), self.getReward(ID), action] epsilon = 0.9
#returns state, reward, and action
class Player():
def hasALine(self): def __init__(self, ID, char, epsilon, learningRate,
size = self.dimension discountFactor):
self.ID = ID
self.char = char def setLearningRate(self, lr):
self.q_table = [] self.learning_rate = lr
self.tree = []
self.epsilon = epsilon def setDiscountFactor(self, df):
self.learning_rate = learningRate self.discount_factor = df
self.discount_factor = discountFactor
self.current_state = [] def updateQ_Table(self):
old_q_value =
def makeMove(self, tictactoe): self.q_table[self.tree.index(self.current_state)][self.acti
if (tictactoe.isEndGame()): on]
return tictactoe try:
self.current_state = list(tictactoe.getState())[:] idx = self.tree.index(self.new_state)
#stores the old state value = np.max(self.q_table[idx])
possible_acts = []
if (self.current_state not in self.tree): for i in range(tictactoe.getBoardSize()):
action_q_values = [] if self.q_table[idx][i] == self.q_table[value]:
for i in range(tictactoe.getBoardSize()): possible_acts.append(i)
action_q_values.append(1) value = random.choice(possible_acts)
self.tree.append(self.current_state[:]) except ValueError:
self.q_table.append(action_q_values) value = 0
temporal_difference = self.reward +
if np.random.random() < self.epsilon: (self.discount_factor * value) - old_q_value
qT = new_q_value = old_q_value + (self.learning_rate
np.array(self.q_table[self.tree.index(self.current_state)] * temporal_difference)
)

8
Reinforcement Learning: Playing Tic-Tac-Toe

for i in range (len(qT)): self.q_table[self.tree.index(self.current_state)][self.acti

if (i not in tictactoe.getPossibleMoves()): on] = new_q_value
qT[i] = -100
act = np.argmax(qT) def getID(self):
possible_acts = [] return self.ID
for i in range(tictactoe.getBoardSize()):
if qT[i] == qT[act]: def getCurrentState(self):
possible_acts.append(i) return self.current_state
self.new_state, self.reward, self.action =
tictactoe.placeAction(self.ID,
random.choice(possible_acts)) AI = Player(1, "X", epsilon, 0.4, 0.9)
else: NPC = Player(-1, "O", epsilon, 0.9, 0.9)
self.new_state, self.reward, self.action =
tictactoe.placeAction(self.ID, num = 100000
random.choice(tictactoe.getPossibleMoves()))
return tictactoe dimension = 3
tictactoe = TicTacToe(dimension)
def makeCompMove(self, tictactoe):
if (tictactoe.isEndGame()): odds = 0
return tictactoe draw = 0
self.new_state, self.reward, self.action = winning_rate = []
tictactoe.placeAction(self.ID, draw_rate = []
random.choice(tictactoe.getPossibleMoves())) lose_rate = []
return tictactoe x_axis = []

def setEpsilon(self, e): for episode in range (1, num+1, 1):

self.epsilon = e if episode % 1000 == 0:
NPC.q_table = AI.q_table plt.plot(x_axis, lose_rate, label = "Lose", linewidth = 3,
odds = 0 color = "#F35050")
draw = 0 plt.xlabel("Episodes", **csfont)
lose = 0 plt.ylabel("Rates (%)", **csfont)
with open('qlearning.txt', 'w') as filehandle: plt.title("Rates vs. Episodes", **csfont)
json.dump(AI.q_table, filehandle) plt.legend(prop={'family':"Comic Sans MS"})
AI.setEpsilon(1.0) plt.show()
for j in range (1000):
while not tictactoe.isEndGame():
tictactoe = AI.makeMove(tictactoe)
if (tictactoe.hasALine()):
odds += 1
break
tictactoe = NPC.makeCompMove(tictactoe)
#random move
if (tictactoe.hasALine()):
lose += 1
break
if tictactoe.isDraw():
draw += 1
tictactoe.resetBoard()
draw = (draw/(1000*1.0))*100
odds = (odds/(1000*1.0))*100
lose = (lose/(1000*1.0))*100
winning_rate.append(odds)
draw_rate.append(draw)

9
Reinforcement Learning: Playing Tic-Tac-Toe

lose_rate.append(lose)
x_axis.append(episode)
print("testing rates... odds = " + str(odds) + "%
draw = " + str(draw) + "% lose = " + str(lose) + "%")
AI.setEpsilon(epsilon)
while not tictactoe.isEndGame():
tictactoe = AI.makeMove(tictactoe)
AI.updateQ_Table()
if(tictactoe.hasALine()):
break

tictactoe = NPC.makeMove(tictactoe) #not

random move
epsilon += ((1-epsilon)*10)/num
AI.setEpsilon(epsilon)
tictactoe.resetBoard()
print("game: " + str(episode) + " done")

print("training completed")
print(max(winning_rate))

csfont = {'fontname':'Comic Sans MS'}

plt.plot(x_axis, winning_rate, label = "Win", linewidth

= 3, color = "#A2F350")
plt.plot(x_axis, draw_rate, label = "Tie", linewidth = 3,
color = "#1c4285")

Advanced Technologies in Modern Robotic PDF
100% (2)
Advanced Technologies in Modern Robotic PDF
428 pages
Cognizant Response To AZ CISS RFP-112806-Word
100% (4)
Cognizant Response To AZ CISS RFP-112806-Word
241 pages
Adobe Acrobat Xi Pro 1102 Torrent PDF
No ratings yet
Adobe Acrobat Xi Pro 1102 Torrent PDF
4 pages
DLL CSS12 Week7
No ratings yet
DLL CSS12 Week7
4 pages
Lagrange's Interpolation Method PDF
No ratings yet
Lagrange's Interpolation Method PDF
22 pages
Asynchronous (Serial, Communication
No ratings yet
Asynchronous (Serial, Communication
2 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
50 pages
Ideas For Topics of Formal Writing Oxford
No ratings yet
Ideas For Topics of Formal Writing Oxford
90 pages
Winter Semester 2023-24 CSE3015 ETH AP2023246000714 Quiz-I-Question-Paper
No ratings yet
Winter Semester 2023-24 CSE3015 ETH AP2023246000714 Quiz-I-Question-Paper
74 pages
Ecoview9 Plus User & Technical Manual (Mar, 2015) R-TR2
No ratings yet
Ecoview9 Plus User & Technical Manual (Mar, 2015) R-TR2
312 pages
Telit 3g Modules at Commands Reference Guide r9
No ratings yet
Telit 3g Modules at Commands Reference Guide r9
537 pages
PS DBM Cebu CNAS 6-18-2024
No ratings yet
PS DBM Cebu CNAS 6-18-2024
5 pages
Tic Tac Toe
No ratings yet
Tic Tac Toe
80 pages
Bluetooth Car Using Arduino: Mini Project Synopsis
No ratings yet
Bluetooth Car Using Arduino: Mini Project Synopsis
8 pages
Filippov Theory On Infinitesimal Epsilon-Greedy Q-Learning
No ratings yet
Filippov Theory On Infinitesimal Epsilon-Greedy Q-Learning
66 pages
Unit 5
No ratings yet
Unit 5
70 pages
Thesis Fabrizio Galli
No ratings yet
Thesis Fabrizio Galli
22 pages
Lecture 06
No ratings yet
Lecture 06
98 pages
Unit 5
No ratings yet
Unit 5
65 pages
Marvel Electronics and Home Entertainment (SRS)
No ratings yet
Marvel Electronics and Home Entertainment (SRS)
15 pages
Module No. 3: Parsing Structure in Text
No ratings yet
Module No. 3: Parsing Structure in Text
54 pages
Reinforcement Learning Note
No ratings yet
Reinforcement Learning Note
16 pages
Generating Intelligent Agent Behaviors in Multi-Agent Game AI Using Deep Reinforcement Learning Algorithm
No ratings yet
Generating Intelligent Agent Behaviors in Multi-Agent Game AI Using Deep Reinforcement Learning Algorithm
9 pages
Entec E-1500 Manual EN
No ratings yet
Entec E-1500 Manual EN
71 pages
Rewards in Reinforcement Learning
No ratings yet
Rewards in Reinforcement Learning
12 pages
C1 5 DRL 2021
No ratings yet
C1 5 DRL 2021
38 pages
Safe Multi Agent Reforcement Learning For Autonomous Driving
No ratings yet
Safe Multi Agent Reforcement Learning For Autonomous Driving
13 pages
Learning Sequential Decision Rules Using Simulation Models and Competition
No ratings yet
Learning Sequential Decision Rules Using Simulation Models and Competition
27 pages
Application of Reinforcement Learning To The Game of Othello
No ratings yet
Application of Reinforcement Learning To The Game of Othello
20 pages
Lecture 2
No ratings yet
Lecture 2
47 pages
RL Lecturer
No ratings yet
RL Lecturer
38 pages
Co-Learning in Differential Games
No ratings yet
Co-Learning in Differential Games
35 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
18 pages
DeepMind Whitepaper
No ratings yet
DeepMind Whitepaper
9 pages
DQN 1
No ratings yet
DQN 1
9 pages
Genetic Reinforcement Learning Algorithms For On-Line Fuzzy Inference System Tuning "Application To Mobile Robotic"
No ratings yet
Genetic Reinforcement Learning Algorithms For On-Line Fuzzy Inference System Tuning "Application To Mobile Robotic"
31 pages
ML - Unit 3 - Part II
No ratings yet
ML - Unit 3 - Part II
51 pages
Unit 6
No ratings yet
Unit 6
34 pages
Training An Artificial Neural Network To Play Tic Tac Toe PDF
No ratings yet
Training An Artificial Neural Network To Play Tic Tac Toe PDF
16 pages
Reinforcement Learning B.Tech. IV Year I Sem. Unit - I
No ratings yet
Reinforcement Learning B.Tech. IV Year I Sem. Unit - I
27 pages
Readymade Dissertation in Delhi
100% (1)
Readymade Dissertation in Delhi
4 pages
Playing Geometry Dash With Convolutional Neural Networks
No ratings yet
Playing Geometry Dash With Convolutional Neural Networks
7 pages
Assignment 3 - ReinforcementLearning - 200508263 - AdityaAnantharaman - Trikkur
No ratings yet
Assignment 3 - ReinforcementLearning - 200508263 - AdityaAnantharaman - Trikkur
9 pages
RL 2048
No ratings yet
RL 2048
10 pages
Part4 F
No ratings yet
Part4 F
26 pages
MCQs - Deep Learning Fundamentals - Understanding Neural Networks, Activation Functions, and Bac
No ratings yet
MCQs - Deep Learning Fundamentals - Understanding Neural Networks, Activation Functions, and Bac
10 pages
Unit 1
No ratings yet
Unit 1
18 pages
TD (λ) and Q-learning based Ludo players: · September 2012
No ratings yet
TD (λ) and Q-learning based Ludo players: · September 2012
9 pages
Junior Level - Mini-CRM For Freelancers
No ratings yet
Junior Level - Mini-CRM For Freelancers
5 pages
I2ml3e Chap18
No ratings yet
I2ml3e Chap18
27 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
32 pages
Applications of NLP
No ratings yet
Applications of NLP
48 pages
Markov Decision Process: Reinforcement Learning
No ratings yet
Markov Decision Process: Reinforcement Learning
10 pages
Module 2 PDF
No ratings yet
Module 2 PDF
26 pages
Exploring Game Playing AI Using Reinforcement Learning Techniques
No ratings yet
Exploring Game Playing AI Using Reinforcement Learning Techniques
5 pages
Experiment No. 1: Name: Juili Maruti Kadu Te A Roll No: 19 UID: 118CP1102B Sub: Software Engineering
No ratings yet
Experiment No. 1: Name: Juili Maruti Kadu Te A Roll No: 19 UID: 118CP1102B Sub: Software Engineering
32 pages
Module-5: Project Management Concepts
No ratings yet
Module-5: Project Management Concepts
18 pages
Exploration of Reinforcement Learning To SNAKE: Bowei Ma, Meng Tang, Jun Zhang
No ratings yet
Exploration of Reinforcement Learning To SNAKE: Bowei Ma, Meng Tang, Jun Zhang
5 pages
Python Programs1
No ratings yet
Python Programs1
7 pages
DL Unit 6 QP Solution
No ratings yet
DL Unit 6 QP Solution
15 pages
IMPLing The DQN
No ratings yet
IMPLing The DQN
9 pages
Unit Vi
No ratings yet
Unit Vi
17 pages
Unit 5 ML
No ratings yet
Unit 5 ML
15 pages
JHV
No ratings yet
JHV
24 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
17 pages
Reinforcement Learning by Comparing Immediate Reward: Punit Pandey Deepshikhapandey
No ratings yet
Reinforcement Learning by Comparing Immediate Reward: Punit Pandey Deepshikhapandey
5 pages
Unit-5 MLT
No ratings yet
Unit-5 MLT
13 pages
Assessment 3
No ratings yet
Assessment 3
9 pages
Reinforcement Learning Model Based Planning Dynamic Programming
No ratings yet
Reinforcement Learning Model Based Planning Dynamic Programming
17 pages
Unit-5 Part C 1) Explain The Q Function and Q Learning Algorithm Assuming Deterministic Rewards and Actions With Example. Ans)
No ratings yet
Unit-5 Part C 1) Explain The Q Function and Q Learning Algorithm Assuming Deterministic Rewards and Actions With Example. Ans)
11 pages
Software Re-Engineering: ©ian Sommerville 2000 Software Engineering, 6th Edition. Chapter 28 Slide 1
No ratings yet
Software Re-Engineering: ©ian Sommerville 2000 Software Engineering, 6th Edition. Chapter 28 Slide 1
32 pages
32-Bit Fixed and Floating-Point Hardware Implementation For Enhanced Inverter Control Leveraging FPGA in Recurrent Neural Network Applications
No ratings yet
32-Bit Fixed and Floating-Point Hardware Implementation For Enhanced Inverter Control Leveraging FPGA in Recurrent Neural Network Applications
14 pages
Machine - Learning - Chapter 4
No ratings yet
Machine - Learning - Chapter 4
13 pages
Unit 1-RL
No ratings yet
Unit 1-RL
11 pages
CSD311: Artificial Intelligence
No ratings yet
CSD311: Artificial Intelligence
11 pages
Cybersecurity EOU
No ratings yet
Cybersecurity EOU
8 pages
HKHB
No ratings yet
HKHB
21 pages
An Online Scheduling Algorithm With Advance Reservation For Large-Scale Data Transfers
No ratings yet
An Online Scheduling Algorithm With Advance Reservation For Large-Scale Data Transfers
22 pages
5.5 Reinforcement Learning
No ratings yet
5.5 Reinforcement Learning
5 pages
A Short Tutorial On Reinforcement Learning: Review and Applications
No ratings yet
A Short Tutorial On Reinforcement Learning: Review and Applications
5 pages
Reinforcement Learning Algorithms in Global Path Planning For Mobile Robot
No ratings yet
Reinforcement Learning Algorithms in Global Path Planning For Mobile Robot
5 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
6 pages
Component-Based Software Engineering 1: ©ian Sommerville 2004 Slide 1
No ratings yet
Component-Based Software Engineering 1: ©ian Sommerville 2004 Slide 1
16 pages
Chess Using Reinforcement Learning: Mungili Chetan Sai Raju - 21BCE9409 Jakka Subramanya Rithwik - 21BCE9028
No ratings yet
Chess Using Reinforcement Learning: Mungili Chetan Sai Raju - 21BCE9409 Jakka Subramanya Rithwik - 21BCE9028
15 pages
Q Learning SARSA Deep Q Learning
No ratings yet
Q Learning SARSA Deep Q Learning
4 pages
Introduction
No ratings yet
Introduction
13 pages
Mastering Excel
No ratings yet
Mastering Excel
12 pages
Algorithms: An Intelligent Coup Agent
No ratings yet
Algorithms: An Intelligent Coup Agent
1 page
Reinforcement 2
No ratings yet
Reinforcement 2
2 pages
Oomd Mod
No ratings yet
Oomd Mod
9 pages
ISYE6501 HW1 Kevin
No ratings yet
ISYE6501 HW1 Kevin
7 pages
ICT - Tools in Education by Roshan - Sai
No ratings yet
ICT - Tools in Education by Roshan - Sai
1 page
Arijit Math
No ratings yet
Arijit Math
6 pages
Summer Sem
No ratings yet
Summer Sem
2 pages
Aakriti Thakur Resume
No ratings yet
Aakriti Thakur Resume
1 page
Reinforcement Learning Explained - A Step-by-Step Guide to Reward-Driven AI
From Everand
Reinforcement Learning Explained - A Step-by-Step Guide to Reward-Driven AI
Luka Nikolic
No ratings yet
Intermediate AI Prompting – Reinforcement Learning
From Everand
Intermediate AI Prompting – Reinforcement Learning
Eric Centore
No ratings yet
Python Machine Learning Illustrated Guide For Beginners & Intermediates: The Future Is Here!
From Everand
Python Machine Learning Illustrated Guide For Beginners & Intermediates: The Future Is Here!
William Sullivan
5/5 (1)
State Space Search: Fundamentals and Applications
From Everand
State Space Search: Fundamentals and Applications
Fouad Sabry
No ratings yet
Machine Learning: Fundamentals and Applications
From Everand
Machine Learning: Fundamentals and Applications
Fouad Sabry
No ratings yet