Reinforcement Learning: R M V E R I
Reinforcement Learning: R M V E R I
Reinforcement Learning: R M V E R I
Reinforcement Learning
Authors: Supervisor:
Jimut Bahan Pal Tamal Maharaj
Debadri Chatterjee
Sounak Modak
Dynammic Duo
Contents
1 Introduction 1
2 Value Iteration 2
4 Policies 5
6 Q-learning 9
7 e-Greedy 11
10 Approximate Q-Learning 15
11 Conclusion 17
iii
List of Figures
3.1 Values obtained after 100 iterations for the bridge crossing problem . . 4
6.1 Q values of each state obtained from the grid world after 100 iterations 10
6.2 Q values of each state obtained from the grid world after 100 iterations 10
8.1 After training a random Q-learner with default learning rate on the
noiseless BridgeGrid for 50 episodes and epsilon of 1 . . . . . . . . . . 12
8.2 After training a random Q-learner with default learning rate on the
noiseless BridgeGrid for 50 episodes and epsilon of 0 . . . . . . . . . . 12
Chapter 1
Introduction
Reinforcement Learning (RL) is one of the model free machine learning algorithms
where the agent learns its behaviours from the environment by actually interacting
with it. This is better than the offline planner because the agent actually interacts
with the environment to learn its behaviours because it is almost impossible to sim-
ulate a real world in a computer. By using the reinforcement learning, the agent
learns those extra features which can only be learned in an real world environment
hence giving it a learning capability like living organisms because in a real world
there are certain parameters which cannot be simulated by a computer. Since the
reinforcement learning agent gets its feedback from the environment, it allows the
agent to automatically determine its behaviours that are considered ideal within a
specified context. Reinforcement learning is deemed important in the field of artifi-
cial intelligence as it starts to make breakthrough and benchmarks in various indus-
trial applications. Previously we have analysed the pacman game where the pacman
agent is a reflex agent, here, we are trying to make the pacman agent more smarter
by applying RL techniques, i.e, Q-learning successfully. The components of a Rein-
forcement learning system is shown in Figure 1.1.
We have used the skeleton code from UC Berkeley CS188 Intro to AI [1], which
was specially designed for this course.
2
Chapter 2
Value Iteration
VK+1 (s) = max ∑ T (s, a, s ) R(s, a, s ) + γV (s )
0 0 0
(2.1)
a
s0
The values of states in previous iteration is used for updating values of states
in next iteration. The utility of a state is the expected utility of that state sequence
encountered when an optimal policy has been executed, starting in that state. The
value iteration algorithm for solving MDP’s works by iteratively solving the equa-
tions relating to the utility of each state to those of its neighbours. The following
algorithm is used to calculate the states values of the value iteration in our project.
After applying the value iteration on our grid world, we get the following values
for each state as shown in Figures 2.1, 2.2, 2.3 and 2.4. We can also see that the policy
converges faster than the values.
Chapter 2. Value Iteration 3
Chapter 3
We have implemented the bridge cross analysis by the help of BridgeGrid which is a
grid world map with a low reward terminal state and a high reward terminal state
separated by a narrow bridge. The agent starts near the low reward state and have
to cross the bridge to get to a high reward state. By this process the agents falls into
the low reward terminal state several times and learns that - those paths are bad
for it and tries new options and chooses the best path, i.e, the path through which
the agent gets the highest reward. We have used a default discount of 0.9 and a
default noise of 0.2 and by using this the optimal policy does not cross the bridge.
We change only one of the discount and noise parameters so that the optimal policy
causes the agent to cross the bridge. The noise refers to the randomness of the agent
for going into a successor state. The bridge world that we have used and the values
for each state and the policies that the agent have learnt is shown in Figure 3.1.
F IGURE 3.1: Values obtained after 100 iterations for the bridge cross-
ing problem
5
Chapter 4
Policies
There are two terminal states with a positive payoff of +1 and another payoff of +10
as shown in Figure 4.1. The starting state is the yellow square and we distinguish
between two types of paths, i.e, one which risks the cliff and travel near the bottom
edge of the grid and the one which avoids the cliff and travels along the top edge of
the grid. These paths are longer but are less likely to incur huge negative payoffs.
These paths are shown by green arrow in the figure. Here we choose the setting of
the discount, noise and living reward parameters for this MDP to produce optimal
policies. Our agent followed its optimal policy without being subject to any noise
and exhibited the given behaviour of choosing the best path.
The value of and optimal policy can be defined in the following way:
∗
Q (s, a) = ∑0 P(s |s, a)
0 0 ∗ 0
R(s, a, s ) + γV (s ) (4.1)
s
Here we see that the agent learns the policy of taking the path that "avoids the
cliff" as shown by the policies obtained in Figure 4.2 and the Q values of each state
learned by the agent as shown in Figure 4.3.
So, we find that the optimal policy is found by the agent, and the agent acts
accordingly when placed in any grid of the gridworld, finding its path towards the
goal.
Chapter 4. Policies 7
F IGURE 4.2: Values of each state and Policies obtained from the Dis-
count grid layout after 100 iterations
F IGURE 4.3: Q values of each state obtained from the Discount grid
layout after 100 iterations
8
Chapter 5
Asynchronous value iteration when implemented while storing just the V(s) array
carries the following update:
V (s) = max ∑ P(s |s, a) R(s, a, s ) + γV (s )
0 0 0
(5.1)
a
s0
Though the variant stores less information. It is more difficult to extract the pol-
icy. It requires one extra backup to determine which action a results in the maximum
value, which is given by the following equation:
π (s) = arg max ∑ P(s0 |s, a) R(s, a, s0 ) + γV (s0 ) (5.2)
a
s0
Our value iteration agent is an offline planner and not a reinforcement learn-
ing agent (007) so the relevant training option is the number of iteration of value
iteration it should run in its initial planning phase. The agent takes an MDP on con-
struction and runs cyclic value iteration for the specified number of iteration. Our
agent is asynchronous value iteration agent because it updates only one state in each
iteration as opposed to batch style update. The working of the cyclic value iteration
are as follows:
• The first iteration only updates the value of the first state in the states list.
• In the second iteration it only updates the value of the second state
• This procedure continues until the agent has updated the value of each state
once and then it starts back from the first state for the subsequent iteration.
9
Chapter 6
Q-learning
The value iteration agent does not actually learn from experience. It ponders its
MDP model to arrive at a complete policy before interacting with the real environ-
ment. So, it becomes a reflex agent when it follows the pre-computed policy when
interacting with a real environment [4]. This distinction is subtle in a simulated en-
vironment like GridWorld but is very important in the real world because the real
MDP is not available. The work of the two components of adaptive heuristic critic
can be accomplished by Watkin’s Q-learning algorithm. Q-learning is typically eas-
ier to implement and we can do that by the following equations:
0 0
Qt (s, a) = Qt−1 (s, a) + α R(s, a) + γ max
0
Q(s , a ) − Qt−1 (s, a) (6.2)
a
The Q vaules obtained for each state after the 100 iterations in grid world is
shown in Figure 6.1. Similarly the Q values obtained for the "bridge crossing" prob-
lem after 100 iterations is shown in Figure 6.2.
F IGURE 6.1: Q values of each state obtained from the grid world after
100 iterations
F IGURE 6.2: Q values of each state obtained from the grid world after
100 iterations
11
Chapter 7
e-Greedy
To balance exploration and exploitation trade off we need e Greedy strategy. The
algorithm after a certain period of exploration exploits the best option k greedily e%
of the time. For example if we set e as 0.04 then the algorithm will exploit the best
option k, 96% of the time, exploring random alternatives 4% of the time. This is
actually quite effective but there is a drawback i.e., the agent can sometimes under
explore the variant space before exploiting the strongest variant k. This drawback
makes the e-Greedy algorithm prone to getting stuck while exploiting a sub-optimal
variant.
We have built our Q-learning agent by implementing e-greedy action using getAc-
tion method. It chooses random action an e fraction of the time and follows [5] its
current best Q-values otherwise. We have found that choosing a random action may
result in choosing the best action which may not be found by choosing a sub-optimal
action, rather can be found by choosing a random legal action. We have simulated a
binary variable having probability p of success which returns true and false just like
a coin flip. Our final Q-values resembled those of the value iteration agent specially
along optimal paths. However our average returns was lower than the Q-values
predicted due to the random actions taken in the initial learning phase. We have
also implemented a crawler robot using our Q-learning class and we have tweaked
certain learning parameters to study the agents policies and actions as shown in Fig-
ure 7.1. We have noticed that the step delay is a parameter of the simulation whereas
the learning rate and e are parameters of our learning algorithm, whereas discount
factor is the property of the environment.
Chapter 8
We have trained a completely random Q-learner with a default learning rate on the
noiseless BridgeGrid for 50 episodes and observed whether it finds an optimal pol-
icy. We have noticed that the learning agent finds the optimal path for one crosssing
and couldn’t find the other part of the bridge because it was busy exploring and not
exploiting what it has learned as shown in Figure 8.1. Now we have tried the same
experiment with an e of 0 and noticed that after finding an optimal path, it only fol-
lows that path reflexively and does not explore any other path as shown in Figure
8.2.
Chapter 9
We have trained a pacman which began to learn about the Q-values of positions and
actions and their consequences. It takes a long time to learn the accurate Q-values
even for small grids. The pacman’s training mode runs without GUI so that the
pacman agent can learn as soon as possible [6]. Once the pacman’s training is com-
plete, it enters into the testing mode. When testing, pacmans’ exploration policies
and learning rate will be set to 0 which effectively stops Q-learning and disables
the exploration so that the pacman is able to exploit its learned policies. The testing
games are shown in the GUI by default and we found that the pacman avoids the
ghost and successfully eats the cakes which results in its winning every game and
celebrating. The default learning parameters which were effective for this problem
are e=0.05, α=0.2, γ=0.8. Our Q-learning agent works for GridWorld, Crawler and
pacman since it learns good policy for every world because our agent considers un-
seen actions and every case properly. Unseen actions have a Q-value of 0. If all the
actions that have been seen, have negative Q-values then an unseen action may be
optimal. We have played a total of 2010 games of which the first 2000 games were
not displayed since they were used in training the pacman and the last 10 games
were successfully won by the pacman which shows that the pacman learned the op-
timal policies for each state in the pacman world as shown in Figure 9.1.
During training we see an output of the performance of pacman after every 100
14 Chapter 9. Q-Learning And Pacman
games. e is kept positive during training so the pacman learns a good policy but
still tries to explore new policy, so it results in its poor performance. This is because
it occasionally makes a random exploratory move into a ghost. As a benchmark, it
should take between 1000 and 1400 games before pacmans reward for a 100 episode
segment becomes positive, which reflects that it started winning more than losing.
By the end of the training pacman rewards should remain positive and fairly high.
The MDP state is the exact board configuration facing pacman, with the now
complex transition describing an entire ply of change to that state. The intermediate
game configuration in which pacman has moved but the ghost has not replied are
not MDP states but are bundled into the transition. We have seen that once the
pacman has completed its training it should win in the test game 90% of the times
since its exploiting its learned policy as shown in Figure 9.1.
.
However when we have trained the agent on a seemingly simple mediumGrid as
shown in Figure 9.2, it does not perform well. In our implementation the pacman’s
average training reward remains negative throughout training. At test time it looses
its games because it couldnot explore all the states. Training also took long time
despite its ineffectiveness. The pacman fails to win on a larger layout because each
board configuration is a separate state with separate state values. It has no way to
generalise that running into a ghost is bad for all positions and this approach doesn’t
result is scaling so this is a bad approach for solving pacman using Q-learning for
each state.
Chapter 10
Approximate Q-Learning
n
Q(s, a) = ∑ fi (s, a)wi (10.1)
i =1
where each weight wi is associated with a particular feature f i (s, a). We imple-
mented the weight vectors as dictionary mapping features to weight value which
were returned by the feature extractors. We updated weight vectors similar to how
we updated Q-values by applying the following equations:
di f f erence = (r + γ max
0
Q(s0 , a0 )) − Q(s, a) (10.3)
a
The di f f erecnce term is the same as used in normal Q-learning whereas ours is
the experience reward. By default approximate Q-agent uses the IdentityExtractor
which assigns a single feature to every h state, action i pair. With this feature ex-
tractor, our approximate Q-learning agent worked identically to PacmanQAgent as
shown in Figure 10.1.
Chapter 11
Conclusion
In this project we used value iteration for the agent that will choose the action which
will maximise its expected utility. We have noticed that when we use value iteration,
the policy converges faster than the values so, we need to maintain a threshold after
which we need to stop since the values will change very minutely without signifi-
cantly affecting the policies of a given GridWorld. The values obtained will decide
the optimal policy which will decide the agents action when it is spawned in a given
state in the GridWorld. Then we implemented Bridge Crossing Analysis in which
an agent has to cross the bridge by exploring and exploiting the learned values of
each state. We have noticed that when the agent explores more, it finds certain paths
which are not good or optimal. So, exploiting the given policies the agent learns the
optimal or best path over time.
Bibliography
[1] DeNero, J., Klein, D., Abbeel, P. (2013). The Pac-Man Projects. UC Berke-
ley CS188 Intro to AI – Course Materials. available online at http :
//ai.berkeley.edu/projecto verview.html, last accessed on 26th October, 2019.
[2] Russell, S., Norvig, P. (2010). Artificial Intelligence: A Modern Approach. Upper
Saddle River, NJ: Prentice Hall.
[6] Pal, J., B. (2019, August 29). DESIGNING OF SEARCH AGENTS USING PAC-
MAN. Available online at https://doi.org/10.31219/osf.io/rnsy6, last accessed
on 26th October, 2019.