0% found this document useful (0 votes)

64 views21 pages

Reinforcement Learning: R M V E R I

The document describes a reinforcement learning project that explores value iteration, bridge crossing analysis, policies, asynchronous value iteration, Q-learning, e-greedy exploration, and approximate Q-learning. It includes 11 chapters that analyze these reinforcement learning concepts through applications like grid world problems, bridge crossing problems, and using Q-learning for Pacman. The document contains 17 figures to illustrate the reinforcement learning algorithms and their results. It was submitted by authors Jimut Bahan Pal, Debadri Chatterjee, and Sounak Modak for their M.Sc degree, under the supervision of Tamal Maharaj.

Uploaded by

bgpexpert

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

64 views21 pages

Reinforcement Learning: R M V E R I

Uploaded by

bgpexpert

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

R AMAKRISHNA M ISSION V IVEKANANDA

E DUCATIONAL AND R ESEARCH I NSTITUTE

A RTIFICIAL I NTELLIGENCE P ROJECT 2

Reinforcement Learning

Authors: Supervisor:
Jimut Bahan Pal Tamal Maharaj
Debadri Chatterjee
Sounak Modak

Submitted to the Department of Computer Science in partial fulfilment of the

requirements for the degree of M.Sc.
By

Dynammic Duo

October 26, 2019

Contents

1 Introduction 1

2 Value Iteration 2

3 Bridge Crossing Analysis 4

4 Policies 5

5 Asynchronous Value Iteration 8

6 Q-learning 9

7 e-Greedy 11

8 Bridge Crossing Revisited 12

9 Q-Learning And Pacman 13

10 Approximate Q-Learning 15

11 Conclusion 17
iii

List of Figures

1.1 The general structure of Reinforcement Learning system . . . . . . . . 1

2.1 Values after 1 iterations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.2 Values after 5 iterations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.3 Values after 20 iterations . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.4 Values after 100 iterations . . . . . . . . . . . . . . . . . . . . . . . . . . 3

3.1 Values obtained after 100 iterations for the bridge crossing problem . . 4

4.1 The policies taken by the agent . . . . . . . . . . . . . . . . . . . . . . . 5

4.2 Values of each state and Policies obtained from the Discount grid lay-
out after 100 iterations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
4.3 Q values of each state obtained from the Discount grid layout after
100 iterations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

6.1 Q values of each state obtained from the grid world after 100 iterations 10
6.2 Q values of each state obtained from the grid world after 100 iterations 10

7.1 The crawler robot in action . . . . . . . . . . . . . . . . . . . . . . . . . . 11

8.1 After training a random Q-learner with default learning rate on the
noiseless BridgeGrid for 50 episodes and epsilon of 1 . . . . . . . . . . 12
8.2 After training a random Q-learner with default learning rate on the
noiseless BridgeGrid for 50 episodes and epsilon of 0 . . . . . . . . . . 12

9.1 Pacman as a Q-learning agent, successfully avoiding ghosts and eat-

ing food pallets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
9.2 Pacman as a Q-learning agent on a medium grid, unsuccessful at eat-
ing food pallets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

10.1 Pacman as an approximate Q-learning agent, successfully avoiding

ghosts and eating food pallets in medium grid . . . . . . . . . . . . . . 16
10.2 Pacman as an approximate Q-learning agent, successfully avoiding
ghosts and eating food pallets even for large and complex maze . . . . 16
1

Chapter 1

Introduction

Reinforcement Learning (RL) is one of the model free machine learning algorithms
where the agent learns its behaviours from the environment by actually interacting
with it. This is better than the offline planner because the agent actually interacts
with the environment to learn its behaviours because it is almost impossible to sim-
ulate a real world in a computer. By using the reinforcement learning, the agent
learns those extra features which can only be learned in an real world environment
hence giving it a learning capability like living organisms because in a real world
there are certain parameters which cannot be simulated by a computer. Since the
reinforcement learning agent gets its feedback from the environment, it allows the
agent to automatically determine its behaviours that are considered ideal within a
specified context. Reinforcement learning is deemed important in the field of artifi-
cial intelligence as it starts to make breakthrough and benchmarks in various indus-
trial applications. Previously we have analysed the pacman game where the pacman
agent is a reflex agent, here, we are trying to make the pacman agent more smarter
by applying RL techniques, i.e, Q-learning successfully. The components of a Rein-
forcement learning system is shown in Figure 1.1.

F IGURE 1.1: The general structure of Reinforcement Learning system

We have used the skeleton code from UC Berkeley CS188 Intro to AI [1], which
was specially designed for this course.
2

Chapter 2

Value Iteration

In this part batch version of Value Iteration is implemented. Value iteration is a

method for computing optimal MDP policy and value [2]. Value iteration stats from
the end and then works backward. Since there is no end to value iteration so we
approximate it to some k state when the value converges [3]. Value iteration starts
with and arbitrary function V0 and uses the following equations for computing the
kth stage.

VK+1 (s) = max ∑ T (s, a, s ) R(s, a, s ) + γV (s )
0 0 0
(2.1)
a
s0

The values of states in previous iteration is used for updating values of states
in next iteration. The utility of a state is the expected utility of that state sequence
encountered when an optimal policy has been executed, starting in that state. The
value iteration algorithm for solving MDP’s works by iteratively solving the equa-
tions relating to the utility of each state to those of its neighbours. The following
algorithm is used to calculate the states values of the value iteration in our project.

Algorithm 1 Value Iteration: Learn function J : X → R

Require:
Sates X = {1, . . . , n x }
Actions A = {1, . . . , n a }, A:X ⇒A
Cost function g : X × A → R
Transition probabilities f xy ( a) = P(y| x, a)
Discounting factor α ∈ (0, 1), typically α = 0.9
procedure VALUE I TERATION(X , A, g, f , α)
Initialize J, J 0 : X → R0+ arbitrarily
while J is not converged do
J0 ← J
for x ∈ X do
for a ∈ A( x ) do
Q( x, a) ← g( x, a) + α ∑nj=x 1 f xj ( a) · J 0 ( j)
for x ∈ X do
J ( x ) ← mina { Q( x, a)}
return J

After applying the value iteration on our grid world, we get the following values
for each state as shown in Figures 2.1, 2.2, 2.3 and 2.4. We can also see that the policy
converges faster than the values.
Chapter 2. Value Iteration 3

F IGURE 2.1: Values after 1 iterations

F IGURE 2.2: Values after 5 iterations

F IGURE 2.3: Values after 20 iterations

F IGURE 2.4: Values after 100 iterations

Chapter 3

Bridge Crossing Analysis

We have implemented the bridge cross analysis by the help of BridgeGrid which is a
grid world map with a low reward terminal state and a high reward terminal state
separated by a narrow bridge. The agent starts near the low reward state and have
to cross the bridge to get to a high reward state. By this process the agents falls into
the low reward terminal state several times and learns that - those paths are bad
for it and tries new options and chooses the best path, i.e, the path through which
the agent gets the highest reward. We have used a default discount of 0.9 and a
default noise of 0.2 and by using this the optimal policy does not cross the bridge.
We change only one of the discount and noise parameters so that the optimal policy
causes the agent to cross the bridge. The noise refers to the randomness of the agent
for going into a successor state. The bridge world that we have used and the values
for each state and the policies that the agent have learnt is shown in Figure 3.1.

F IGURE 3.1: Values obtained after 100 iterations for the bridge cross-
ing problem
5

Chapter 4

Policies

There are two terminal states with a positive payoff of +1 and another payoff of +10
as shown in Figure 4.1. The starting state is the yellow square and we distinguish
between two types of paths, i.e, one which risks the cliff and travel near the bottom
edge of the grid and the one which avoids the cliff and travels along the top edge of
the grid. These paths are longer but are less likely to incur huge negative payoffs.
These paths are shown by green arrow in the figure. Here we choose the setting of
the discount, noise and living reward parameters for this MDP to produce optimal
policies. Our agent followed its optimal policy without being subject to any noise
and exhibited the given behaviour of choosing the best path.

F IGURE 4.1: The policies taken by the agent

The value of and optimal policy can be defined in the following way:

∗
Q (s, a) = ∑0 P(s |s, a)
0 0 ∗ 0
R(s, a, s ) + γV (s ) (4.1)
s

V ∗ (s) = max Q∗ (s, a) (4.2)

a
6 Chapter 4. Policies

π ∗ (s) = arg max Q∗ (s, a) (4.3)

We define Q∗ (s, a) to be the expected value of performing action a in state s and

then following the optimal policy. We define V ∗ (s) to be the expected value of fol-
lowing the optimal policy from state s. V ∗ (s) is obtained by performing the action
that gives the best value in each state, similarly the optimal policy π ∗ is the one that
gives the best value for each state. The algorithm for selecting policies is as follows:

Algorithm 2 Policy Iteration: Learning a policy π : X → A

Require:
Sates X = {1, . . . , n x }
Actions A = {1, . . . , n a }, A:X ⇒A
Cost function g : X × A → R
Transition probabilities f , F
α ∈ (0, 1)
procedure P OLICY I TERATION(X , A, g, f , F, α)
Initialize π arbitrarily
while π is not converged do
J ← solve system of linear equations ( I − α · F (π )) · J = g(π )
for x ∈ X do
for a ∈ A( x ) do
Q( x, a) ← g( x, a) + α ∑nj=x 1 f xj ( a) · J ( j)
for x ∈ X do
π ( x ) ← arg mina { Q( x, a)}
return π

Here we see that the agent learns the policy of taking the path that "avoids the
cliff" as shown by the policies obtained in Figure 4.2 and the Q values of each state
learned by the agent as shown in Figure 4.3.

So, we find that the optimal policy is found by the agent, and the agent acts
accordingly when placed in any grid of the gridworld, finding its path towards the
goal.
Chapter 4. Policies 7

F IGURE 4.2: Values of each state and Policies obtained from the Dis-
count grid layout after 100 iterations

F IGURE 4.3: Q values of each state obtained from the Discount grid
layout after 100 iterations
8

Chapter 5

Asynchronous Value Iteration

A common refinement of the value iteration algorithm is the asynchronous value

iteration. It updates the state one at a time rather than sweeping through the states
to create a new value function and stores the values in a single array. The value iter-
ation can store either the Q(s,a) or V(s) array. It converges faster than value iteration
and uses less space forming the basis for some of the algorithms for reinforcement
learning. Termination can be difficult to determine as the agent must guarantee a
particular error unless the agent is careful about how the action and the states are
selected. This procedure often runs indefinitely and is prepared to give its best esti-
mate of optimal action for any state when asked.

Asynchronous value iteration when implemented while storing just the V(s) array
carries the following update:

V (s) = max ∑ P(s |s, a) R(s, a, s ) + γV (s )
0 0 0
(5.1)
a
s0

Though the variant stores less information. It is more difficult to extract the pol-
icy. It requires one extra backup to determine which action a results in the maximum
value, which is given by the following equation:

π (s) = arg max ∑ P(s0 |s, a) R(s, a, s0 ) + γV (s0 ) (5.2)
a
s0

Our value iteration agent is an offline planner and not a reinforcement learn-
ing agent (007) so the relevant training option is the number of iteration of value
iteration it should run in its initial planning phase. The agent takes an MDP on con-
struction and runs cyclic value iteration for the specified number of iteration. Our
agent is asynchronous value iteration agent because it updates only one state in each
iteration as opposed to batch style update. The working of the cyclic value iteration
are as follows:

• The first iteration only updates the value of the first state in the states list.

• In the second iteration it only updates the value of the second state

• This procedure continues until the agent has updated the value of each state
once and then it starts back from the first state for the subsequent iteration.
9

Chapter 6

Q-learning

The value iteration agent does not actually learn from experience. It ponders its
MDP model to arrive at a complete policy before interacting with the real environ-
ment. So, it becomes a reflex agent when it follows the pre-computed policy when
interacting with a real environment [4]. This distinction is subtle in a simulated en-
vironment like GridWorld but is very important in the real world because the real
MDP is not available. The work of the two components of adaptive heuristic critic
can be accomplished by Watkin’s Q-learning algorithm. Q-learning is typically eas-
ier to implement and we can do that by the following equations:

Q∗ (s, a) = R(s, a) + γ ∑ T (s, a, s0 ) max

0
Q∗ (s0 , a0 ) (6.1)
s0 eS a

where Q∗ (s, a) is the expected discounted reinforcement for taking action a in

state s and then choosing the actions optimally. V ∗ (s) is the value of s after assuming
that the best action is taken initially so V ∗ (s) = maxa Q∗ (s, a). We also see that since
V ∗ (s) = maxa Q∗ (s, a), we have π ∗ (s) = arg maxa Q∗ (s, a) as an optimal policy. The
Q-learning equation is:

0 0
Qt (s, a) = Qt−1 (s, a) + α R(s, a) + γ max
0
Q(s , a ) − Qt−1 (s, a) (6.2)
a

Where h s, a, r, s’ i is an experienced tuple. The Q values will converge with

probability 1 to Q∗ as each action is executed in each state an infinite number of
times and the α is decayed appropriately. When the Q values nearly converge to
their optimal value the agent can act greedily by selecting the highest Q value of
a particular state. There is a difficult exploitation vs exploration trade-off during
learning. The details of the exploration strategy will not effect the convergence of
the learning algorithm, so Q-learning is the most popular and seems to be the most
efficient model free algorithm for learning for delayed reinforcement. So, it may
converge quite slowly to a good policy. We have implemented a Q-learning agent
which did very little on construction but instead learned by trial and error from
interaction with the environment. The algorithm for Q-learning is as follows:
10 Chapter 6. Q-learning

Algorithm 3 Q-learning: Learn function Q : X × A → R

Require:
Sates X = {1, . . . , n x }
Actions A = {1, . . . , n a }, A:X ⇒A
Reward function R : X × A → R
Black-box (probabilistic) transition function T : X × A → X
Learning rate α ∈ [0, 1], typically α = 0.1
Discounting factor γ ∈ [0, 1]
procedure QL EARNING(X , A, R, T, α, γ)
Initialize Q : X × A → R arbitrarily
while Q is not converged do
Start in state s ∈ X
while s is not terminal do
Calculate π according to Q and exploration strategy (e.g. π ( x ) ← a
Q( x, a))
a ← π (s)
r ← R(s, a) . Receive the reward
s0 ← T (s, a) . Receive the new state
Q(s0 , a) ← (1 − α) · Q(s, a) + α · (r + γ · maxa0 Q(s0 , a0 ))
s ← s0
return Q

The Q vaules obtained for each state after the 100 iterations in grid world is
shown in Figure 6.1. Similarly the Q values obtained for the "bridge crossing" prob-
lem after 100 iterations is shown in Figure 6.2.

F IGURE 6.1: Q values of each state obtained from the grid world after
100 iterations

F IGURE 6.2: Q values of each state obtained from the grid world after
100 iterations
11

Chapter 7

e-Greedy

To balance exploration and exploitation trade off we need e Greedy strategy. The
algorithm after a certain period of exploration exploits the best option k greedily e%
of the time. For example if we set e as 0.04 then the algorithm will exploit the best
option k, 96% of the time, exploring random alternatives 4% of the time. This is
actually quite effective but there is a drawback i.e., the agent can sometimes under
explore the variant space before exploiting the strongest variant k. This drawback
makes the e-Greedy algorithm prone to getting stuck while exploiting a sub-optimal
variant.

We have built our Q-learning agent by implementing e-greedy action using getAc-
tion method. It chooses random action an e fraction of the time and follows [5] its
current best Q-values otherwise. We have found that choosing a random action may
result in choosing the best action which may not be found by choosing a sub-optimal
action, rather can be found by choosing a random legal action. We have simulated a
binary variable having probability p of success which returns true and false just like
a coin flip. Our final Q-values resembled those of the value iteration agent specially
along optimal paths. However our average returns was lower than the Q-values
predicted due to the random actions taken in the initial learning phase. We have
also implemented a crawler robot using our Q-learning class and we have tweaked
certain learning parameters to study the agents policies and actions as shown in Fig-
ure 7.1. We have noticed that the step delay is a parameter of the simulation whereas
the learning rate and e are parameters of our learning algorithm, whereas discount
factor is the property of the environment.

F IGURE 7.1: The crawler robot in action

Chapter 8

Bridge Crossing Revisited

We have trained a completely random Q-learner with a default learning rate on the
noiseless BridgeGrid for 50 episodes and observed whether it finds an optimal pol-
icy. We have noticed that the learning agent finds the optimal path for one crosssing
and couldn’t find the other part of the bridge because it was busy exploring and not
exploiting what it has learned as shown in Figure 8.1. Now we have tried the same
experiment with an e of 0 and noticed that after finding an optimal path, it only fol-
lows that path reflexively and does not explore any other path as shown in Figure
8.2.

F IGURE 8.1: After training a random Q-learner with default learning

rate on the noiseless BridgeGrid for 50 episodes and epsilon of 1

F IGURE 8.2: After training a random Q-learner with default learning

rate on the noiseless BridgeGrid for 50 episodes and epsilon of 0
13

Chapter 9

Q-Learning And Pacman

We have trained a pacman which began to learn about the Q-values of positions and
actions and their consequences. It takes a long time to learn the accurate Q-values
even for small grids. The pacman’s training mode runs without GUI so that the
pacman agent can learn as soon as possible [6]. Once the pacman’s training is com-
plete, it enters into the testing mode. When testing, pacmans’ exploration policies
and learning rate will be set to 0 which effectively stops Q-learning and disables
the exploration so that the pacman is able to exploit its learned policies. The testing
games are shown in the GUI by default and we found that the pacman avoids the
ghost and successfully eats the cakes which results in its winning every game and
celebrating. The default learning parameters which were effective for this problem
are e=0.05, α=0.2, γ=0.8. Our Q-learning agent works for GridWorld, Crawler and
pacman since it learns good policy for every world because our agent considers un-
seen actions and every case properly. Unseen actions have a Q-value of 0. If all the
actions that have been seen, have negative Q-values then an unseen action may be
optimal. We have played a total of 2010 games of which the first 2000 games were
not displayed since they were used in training the pacman and the last 10 games
were successfully won by the pacman which shows that the pacman learned the op-
timal policies for each state in the pacman world as shown in Figure 9.1.

F IGURE 9.1: Pacman as a Q-learning agent, successfully avoiding

ghosts and eating food pallets

During training we see an output of the performance of pacman after every 100
14 Chapter 9. Q-Learning And Pacman

games. e is kept positive during training so the pacman learns a good policy but
still tries to explore new policy, so it results in its poor performance. This is because
it occasionally makes a random exploratory move into a ghost. As a benchmark, it
should take between 1000 and 1400 games before pacmans reward for a 100 episode
segment becomes positive, which reflects that it started winning more than losing.
By the end of the training pacman rewards should remain positive and fairly high.

The MDP state is the exact board configuration facing pacman, with the now
complex transition describing an entire ply of change to that state. The intermediate
game configuration in which pacman has moved but the ghost has not replied are
not MDP states but are bundled into the transition. We have seen that once the
pacman has completed its training it should win in the test game 90% of the times
since its exploiting its learned policy as shown in Figure 9.1.
.
However when we have trained the agent on a seemingly simple mediumGrid as
shown in Figure 9.2, it does not perform well. In our implementation the pacman’s
average training reward remains negative throughout training. At test time it looses
its games because it couldnot explore all the states. Training also took long time
despite its ineffectiveness. The pacman fails to win on a larger layout because each
board configuration is a separate state with separate state values. It has no way to
generalise that running into a ghost is bad for all positions and this approach doesn’t
result is scaling so this is a bad approach for solving pacman using Q-learning for
each state.

F IGURE 9.2: Pacman as a Q-learning agent on a medium grid, unsuc-

cessful at eating food pallets
15

Chapter 10

Approximate Q-Learning

We have implemented an approximate Q-learning agent that learns weights for

features of states where many state may share the same feature. Approximate Q-
learning assumes the existence of a feature function f (s, a) over state and action
pairs which yields a vector f 1 (s, a).. f i (s, a).. f n (s, a) of feature values. Feature vectors
are like a dictionary of objects containing the non zero pair of features and values,
where omitted features have a value of zero.

The approximate Q-learning function takes the following form:

n
Q(s, a) = ∑ fi (s, a)wi (10.1)
i =1

where each weight wi is associated with a particular feature f i (s, a). We imple-
mented the weight vectors as dictionary mapping features to weight value which
were returned by the feature extractors. We updated weight vectors similar to how
we updated Q-values by applying the following equations:

wi ← wi + α · di f f erence · f i (s, a) (10.2)

di f f erence = (r + γ max
0
Q(s0 , a0 )) − Q(s, a) (10.3)
a

The di f f erecnce term is the same as used in normal Q-learning whereas ours is
the experience reward. By default approximate Q-agent uses the IdentityExtractor
which assigns a single feature to every h state, action i pair. With this feature ex-
tractor, our approximate Q-learning agent worked identically to PacmanQAgent as
shown in Figure 10.1.

Approximate Q-agent shares several methods which are common to Q-learning

agent because it is a subclass of the later. We have implemented a Q-learning agent
to call the getQValue instead of accessing Q-values directly, so that when we over-
ride getQValue in our agent, the new approximate Q-values are useful in computing
actions. We have also trained our approximate Q-agent is a large layout which took
few minutes to train and our agent won almost every time with these simple fea-
tures, even with only 50 training games as shown in Figure 10.2.
16 Chapter 10. Approximate Q-Learning

F IGURE 10.1: Pacman as an approximate Q-learning agent, success-

fully avoiding ghosts and eating food pallets in medium grid

F IGURE 10.2: Pacman as an approximate Q-learning agent, success-

fully avoiding ghosts and eating food pallets even for large and com-
plex maze
17

Chapter 11

Conclusion

In this project we used value iteration for the agent that will choose the action which
will maximise its expected utility. We have noticed that when we use value iteration,
the policy converges faster than the values so, we need to maintain a threshold after
which we need to stop since the values will change very minutely without signifi-
cantly affecting the policies of a given GridWorld. The values obtained will decide
the optimal policy which will decide the agents action when it is spawned in a given
state in the GridWorld. Then we implemented Bridge Crossing Analysis in which
an agent has to cross the bridge by exploring and exploiting the learned values of
each state. We have noticed that when the agent explores more, it finds certain paths
which are not good or optimal. So, exploiting the given policies the agent learns the
optimal or best path over time.

We have implemented asynchronous value iteration along with Q-learning and

noticed that Q-learning agent does not have to have prior information about the
environment as it learns by process of trial and error. We implemented the approx-
imate Q-learning agent by using weighted features of states. By implementing the
approximate Q-learning agent on the crawler, we found that the crawler learns about
the environment and moves from left to right in a two dimensional world. We also
implemented the pacman game using approximate Q-learning agent which had a
higher rate of winning the game than the normal Q-learning agent. We found that
the approximate Q-learning agent is practically more useful than normal Q-learning
agent for any state space.
18

Bibliography

[1] DeNero, J., Klein, D., Abbeel, P. (2013). The Pac-Man Projects. UC Berke-
ley CS188 Intro to AI – Course Materials. available online at http :
//ai.berkeley.edu/projecto verview.html, last accessed on 26th October, 2019.

[2] Russell, S., Norvig, P. (2010). Artificial Intelligence: A Modern Approach. Upper
Saddle River, NJ: Prentice Hall.

[3] Kaelbling, P., K. (1996). Journal of Artificial Intelligence Research . En-

hancement to Value Iteration and Policy Iteration. Available online at
https://www.cs.cmu.edu/afs/cs/project/jair/pub/, last accessed on 26th Oc-
tober, 2019.

[4] Pal, S. (2019, May 15). An Introduction to Q-learning: Reinforcement

Learning. Floydhub Blogs. Available online at https://blog.floydhub.com/an-
introduction-to-q-learning-reinforcement-learning/, last accessed on 26th Octo-
ber, 2019.

[5] Sita, C., Calvinthio, G. (2017). COMP3211 Final Project Re-

port: Pacman with Reinforcement Learning. Available online at
https://www.cs.cmu.edu/afs/cs/project/jair/pub/, last accessed on 26th
October, 2019.

[6] Pal, J., B. (2019, August 29). DESIGNING OF SEARCH AGENTS USING PAC-
MAN. Available online at https://doi.org/10.31219/osf.io/rnsy6, last accessed
on 26th October, 2019.

Ai (It) Unit-5
No ratings yet
Ai (It) Unit-5
43 pages
Reinforcement Learning and Dynamic Programming For Control
100% (1)
Reinforcement Learning and Dynamic Programming For Control
111 pages
16 RL PDF
No ratings yet
16 RL PDF
87 pages
Deep Reinforcement Learning: Lecture Notes
No ratings yet
Deep Reinforcement Learning: Lecture Notes
60 pages
کتاب هشتم بارگزاری شده
No ratings yet
کتاب هشتم بارگزاری شده
112 pages
Lec 04 Reinforcement Learning
No ratings yet
Lec 04 Reinforcement Learning
57 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
48 pages
Unit 5d - Deep Reinforcement Learning
No ratings yet
Unit 5d - Deep Reinforcement Learning
52 pages
L13 Reinforcement Learning
No ratings yet
L13 Reinforcement Learning
57 pages
CS 188 Introduction To Artificial Intelligence Summer 2019 Note 4
No ratings yet
CS 188 Introduction To Artificial Intelligence Summer 2019 Note 4
9 pages
L13 Reinforcement Learning
No ratings yet
L13 Reinforcement Learning
35 pages
Audio To Text Embedding
No ratings yet
Audio To Text Embedding
144 pages
Algorithm For RL
No ratings yet
Algorithm For RL
99 pages
Exploring Reinforcement Learning Algorithms: Information and Communication Technologies Department
No ratings yet
Exploring Reinforcement Learning Algorithms: Information and Communication Technologies Department
60 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
45 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
31 pages
Reinforcement Learning: EEE 485/585 Statistical Learning and Data Analytics
No ratings yet
Reinforcement Learning: EEE 485/585 Statistical Learning and Data Analytics
15 pages
ML Unit 4
No ratings yet
ML Unit 4
17 pages
RL MJJ
No ratings yet
RL MJJ
32 pages
21 - Reinforcement Learning
No ratings yet
21 - Reinforcement Learning
25 pages
Artificial Intelligence: Lecture 11 - Reinforcement Learning II Dr. Shivanjali Khare
No ratings yet
Artificial Intelligence: Lecture 11 - Reinforcement Learning II Dr. Shivanjali Khare
52 pages
2024 MDPs Part 1
No ratings yet
2024 MDPs Part 1
59 pages
A Crash Course On Reinforcement Learning - Felix Wagner
No ratings yet
A Crash Course On Reinforcement Learning - Felix Wagner
84 pages
Algorithms For Reinforcement Learning - Szepesvari
No ratings yet
Algorithms For Reinforcement Learning - Szepesvari
98 pages
An Introduction To Reinforcement Learning From Theory To Algorithms (December 19, 2024) - Joon Kwon
No ratings yet
An Introduction To Reinforcement Learning From Theory To Algorithms (December 19, 2024) - Joon Kwon
66 pages
20ai903 - RL - Unit 2
No ratings yet
20ai903 - RL - Unit 2
27 pages
Artificial Intelligence: Lecture 10 - Reinforcement Learning Prof. Shivanjali Khare
No ratings yet
Artificial Intelligence: Lecture 10 - Reinforcement Learning Prof. Shivanjali Khare
45 pages
Lec17 ReinforcementLearning
No ratings yet
Lec17 ReinforcementLearning
58 pages
22 Reinforcement Learning
No ratings yet
22 Reinforcement Learning
18 pages
11-DL-Deep Learning For Reinforcement Learning
No ratings yet
11-DL-Deep Learning For Reinforcement Learning
47 pages
Tri-Tue-Nhan-Tao - Nathan-Lambert - Lec13 - 6up-Reinforcement-Learning - (Cuuduongthancong - Com)
No ratings yet
Tri-Tue-Nhan-Tao - Nathan-Lambert - Lec13 - 6up-Reinforcement-Learning - (Cuuduongthancong - Com)
8 pages
Lecture Notes v1.0 687 F22
No ratings yet
Lecture Notes v1.0 687 F22
115 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
9 pages
Hota ML ReinforcementLearning
No ratings yet
Hota ML ReinforcementLearning
12 pages
37 RL
No ratings yet
37 RL
18 pages
7 - Reinforcement Learning
No ratings yet
7 - Reinforcement Learning
23 pages
Reinforcement Learning: Nguyen Do Van, PHD
No ratings yet
Reinforcement Learning: Nguyen Do Van, PHD
40 pages
Reinf 2
No ratings yet
Reinf 2
4 pages
Báo Cáo Nhóm 5 Final AI
No ratings yet
Báo Cáo Nhóm 5 Final AI
23 pages
Lecture 30 Reinforcement-Learning
No ratings yet
Lecture 30 Reinforcement-Learning
50 pages
Unit 5
No ratings yet
Unit 5
39 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
10 pages
ML Unit-4
No ratings yet
ML Unit-4
10 pages
HGTFHGFHTF
No ratings yet
HGTFHGFHTF
5 pages
Reinforcement Learning MY101
No ratings yet
Reinforcement Learning MY101
15 pages
MLT Unit-5 Notes
No ratings yet
MLT Unit-5 Notes
17 pages
Alg RLearning Ejemplo
No ratings yet
Alg RLearning Ejemplo
99 pages
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
No ratings yet
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
15 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
232 pages
Unit 5 - Reinforcement Learning
No ratings yet
Unit 5 - Reinforcement Learning
15 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I PDF
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I PDF
38 pages
Fundamentals of Reinforcement Learning
No ratings yet
Fundamentals of Reinforcement Learning
33 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
46 pages
ML 4
No ratings yet
ML 4
4 pages
Algorithms For Reinforced Learning
No ratings yet
Algorithms For Reinforced Learning
98 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I
35 pages
Reinforcement Learning: Instructor: Max Welling
No ratings yet
Reinforcement Learning: Instructor: Max Welling
18 pages
03 Neural Network Proxy Modeling of Complex Oil Fedutenko
No ratings yet
03 Neural Network Proxy Modeling of Complex Oil Fedutenko
38 pages
Nand Nor Implementation PDF
No ratings yet
Nand Nor Implementation PDF
15 pages
Ace Multicloud Network Professional Comprehensive Guide
No ratings yet
Ace Multicloud Network Professional Comprehensive Guide
2 pages
Cloud Solution Advisor
No ratings yet
Cloud Solution Advisor
30 pages
2021 - Data Structure Syllabus
No ratings yet
2021 - Data Structure Syllabus
2 pages
Introduction To Parallel Computing
No ratings yet
Introduction To Parallel Computing
13 pages
Coding
No ratings yet
Coding
38 pages
Ai-Unit 3
No ratings yet
Ai-Unit 3
67 pages
r19 Ds - Unit - I (Ref-2)
No ratings yet
r19 Ds - Unit - I (Ref-2)
26 pages
DAA Syllabus
No ratings yet
DAA Syllabus
2 pages
Truth Tables: A True False F Becomes True T. All Other Combinations of True T and False F Are True T
No ratings yet
Truth Tables: A True False F Becomes True T. All Other Combinations of True T and False F Are True T
2 pages
Theory of Computation
No ratings yet
Theory of Computation
11 pages
Assurance Features and Navigation: Cisco DNA Center 1.1.2 Training
No ratings yet
Assurance Features and Navigation: Cisco DNA Center 1.1.2 Training
54 pages
Design and Analysis of Algorithms New Questionpaper
No ratings yet
Design and Analysis of Algorithms New Questionpaper
1 page
Asymptotic
No ratings yet
Asymptotic
19 pages
Veeam and AWS: Customer Reference Book
No ratings yet
Veeam and AWS: Customer Reference Book
8 pages
CSE205
No ratings yet
CSE205
15 pages
ProductFlyer 9783540877523
No ratings yet
ProductFlyer 9783540877523
1 page
AQA Computer Science A-Level 4.5.2 Number Bases: Advanced Notes
No ratings yet
AQA Computer Science A-Level 4.5.2 Number Bases: Advanced Notes
10 pages
Team Coordination Training Student Guide: Subject
No ratings yet
Team Coordination Training Student Guide: Subject
171 pages
DS UnitII
No ratings yet
DS UnitII
57 pages
Data Mining For Business Fall 2019
No ratings yet
Data Mining For Business Fall 2019
8 pages
2020 Covid 19 State of Remote Work Survey Report
No ratings yet
2020 Covid 19 State of Remote Work Survey Report
22 pages
Computer Organization and Architecture Computer Arithmetic
No ratings yet
Computer Organization and Architecture Computer Arithmetic
78 pages
Why Tree Automata?: Pierre Genevès
No ratings yet
Why Tree Automata?: Pierre Genevès
8 pages
TOC 2022 Solution
No ratings yet
TOC 2022 Solution
9 pages
MATH 147 Practice 10 Solutions
No ratings yet
MATH 147 Practice 10 Solutions
4 pages
Rebort Tic Tac Toe
No ratings yet
Rebort Tic Tac Toe
15 pages
Reinforcement Learning: R M V E R I
No ratings yet
Reinforcement Learning: R M V E R I
21 pages
7142 Skype For Business Accelerates Modern Collaboration at Microsoft BCS
No ratings yet
7142 Skype For Business Accelerates Modern Collaboration at Microsoft BCS
5 pages
A Conjoint Analysis of Customer Preferences For Voip Service in Pakistan
No ratings yet
A Conjoint Analysis of Customer Preferences For Voip Service in Pakistan
10 pages
Algorithmic Complexity
No ratings yet
Algorithmic Complexity
5 pages
Using A Combo Box To Pre-Populate Form Fields
No ratings yet
Using A Combo Box To Pre-Populate Form Fields
1 page
BUSINESS CALCULUS - MATH2301 December 2015
No ratings yet
BUSINESS CALCULUS - MATH2301 December 2015
4 pages
DD - Week 2
No ratings yet
DD - Week 2
6 pages
Standard Data Types
No ratings yet
Standard Data Types
7 pages
Sheet08 Exercise
No ratings yet
Sheet08 Exercise
2 pages
CD Assignments Unit Wise
No ratings yet
CD Assignments Unit Wise
3 pages
What Is Candidate Sampling: X ,) T X T L
No ratings yet
What Is Candidate Sampling: X ,) T X T L
9 pages
Expanded Form: © Dorling Kindersley Limited (2010)
No ratings yet
Expanded Form: © Dorling Kindersley Limited (2010)
2 pages
Polynomial Inequalities: X X X X
No ratings yet
Polynomial Inequalities: X X X X
2 pages
LAZER - Editorial-CodeChef
No ratings yet
LAZER - Editorial-CodeChef
2 pages
The Elements of Quantitative Investing
From Everand
The Elements of Quantitative Investing
Giuseppe A. Paleologo
No ratings yet
Multi-dimensional Monte Carlo Integrations Utilizing Mathematica
From Everand
Multi-dimensional Monte Carlo Integrations Utilizing Mathematica
SUJAUL CHOWDHURY
No ratings yet
Advanced Multiplayer Game Development with Ureal Engine 5: A Comprehensive Guide to C++ Scripting
From Everand
Advanced Multiplayer Game Development with Ureal Engine 5: A Comprehensive Guide to C++ Scripting
Vladimir Kiselev
No ratings yet

Reinforcement Learning: R M V E R I

Uploaded by

Reinforcement Learning: R M V E R I

Uploaded by

R AMAKRISHNA M ISSION V IVEKANANDA

E DUCATIONAL AND R ESEARCH I NSTITUTE

A RTIFICIAL I NTELLIGENCE P ROJECT 2

Submitted to the Department of Computer Science in partial fulfilment of the

October 26, 2019

3 Bridge Crossing Analysis 4

5 Asynchronous Value Iteration 8

8 Bridge Crossing Revisited 12

9 Q-Learning And Pacman 13

1.1 The general structure of Reinforcement Learning system . . . . . . . . 1

2.1 Values after 1 iterations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

4.1 The policies taken by the agent . . . . . . . . . . . . . . . . . . . . . . . 5

7.1 The crawler robot in action . . . . . . . . . . . . . . . . . . . . . . . . . . 11

9.1 Pacman as a Q-learning agent, successfully avoiding ghosts and eat-

10.1 Pacman as an approximate Q-learning agent, successfully avoiding

F IGURE 1.1: The general structure of Reinforcement Learning system

In this part batch version of Value Iteration is implemented. Value iteration is a

Algorithm 1 Value Iteration: Learn function J : X → R

F IGURE 2.1: Values after 1 iterations

F IGURE 2.2: Values after 5 iterations

F IGURE 2.3: Values after 20 iterations

F IGURE 2.4: Values after 100 iterations

Bridge Crossing Analysis

F IGURE 4.1: The policies taken by the agent

V ∗ (s) = max Q∗ (s, a) (4.2)

π ∗ (s) = arg max Q∗ (s, a) (4.3)

We define Q∗ (s, a) to be the expected value of performing action a in state s and

Algorithm 2 Policy Iteration: Learning a policy π : X → A

Asynchronous Value Iteration

A common refinement of the value iteration algorithm is the asynchronous value

Q∗ (s, a) = R(s, a) + γ ∑ T (s, a, s0 ) max

where Q∗ (s, a) is the expected discounted reinforcement for taking action a in

Where h s, a, r, s’ i is an experienced tuple. The Q values will converge with

Algorithm 3 Q-learning: Learn function Q : X × A → R

F IGURE 7.1: The crawler robot in action

Bridge Crossing Revisited

F IGURE 8.1: After training a random Q-learner with default learning

F IGURE 8.2: After training a random Q-learner with default learning

Q-Learning And Pacman

F IGURE 9.1: Pacman as a Q-learning agent, successfully avoiding

F IGURE 9.2: Pacman as a Q-learning agent on a medium grid, unsuc-

We have implemented an approximate Q-learning agent that learns weights for

The approximate Q-learning function takes the following form:

wi ← wi + α · di f f erence · f i (s, a) (10.2)

Approximate Q-agent shares several methods which are common to Q-learning

F IGURE 10.1: Pacman as an approximate Q-learning agent, success-

F IGURE 10.2: Pacman as an approximate Q-learning agent, success-

We have implemented asynchronous value iteration along with Q-learning and

[3] Kaelbling, P., K. (1996). Journal of Artificial Intelligence Research . En-

[4] Pal, S. (2019, May 15). An Introduction to Q-learning: Reinforcement

[5] Sita, C., Calvinthio, G. (2017). COMP3211 Final Project Re-

You might also like