0% found this document useful (0 votes)
6 views

Midterm_Report_Example3

Uploaded by

Yat Kiu Wong
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Midterm_Report_Example3

Uploaded by

Yat Kiu Wong
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Reinforcement Learning to control an agent in LunarLander-v2 - an OpenAI

gym environment

Akhras Emad ([email protected]), Jain Uday ([email protected]),


Lohia Suyash ([email protected]), Ma Chun Lai Jonathan ([email protected])

COMP3340 Group 7

1. Introduction 3. Related Work


Inspired by the learning behaviours in biological We have identified a number of unique strategies in exist-
species, reinforcement learning is considered a fundamental ing literature aimed at solving the lunar lander environment.
paradigm in machine learning, alongside supervised learn- Some works attempt to solve the environment using a par-
ing and unsupervised learning, that aims to maximize the ticular algorithm. Lu, Chai, and Squillante implemented an
notion of cumulative reward in a task. From a high level, unconventional model-based technique where the algorithm
a reinforcement learning problem contains an actor, an en- avoids learning the system’s parameters in favour of learn-
vironment and a reward function, with the goal being that ing the optimal control parameters [4]. Banerjee, Ghosh,
the agent should learn to take the actions that maximize the and Das improved upon a simple policy gradient by evolv-
expected reward in the given environment, by repeatedly ing neural network topologies through incorporating Evolu-
interacting with the environment [2]. In this project, the tionary Algorithms (JADE) and Metaheuristic Algorithms
topic concerned is to learn the optimal steps in controlling (PSO) [1]. Peters, Stewart, West, and Esfandiari’ algorithm
the lander in a way that maximizes the reward. The reward relied on spiking neural networks to dynamically select ac-
system of LunarLander-v2 has been defined in the problem tions [8]. However, other works attempt to evaluate some of
definition section below. the reinforcement learning algorithms used in solving the
lunar lander environment in the presence of environment
2. Problem Definition and agent uncertainties [9].
In this project we aim to apply reinforcement learning to
train an agent in an OpenAI Gym environment. We adopt 4. DQN
“LunarLander-v2” as the simulator environment to run the
algorithm implementation, train the agent, and test its per- In this subsection, we present the DQN model. The
formance. The agent will be trained to make a safe landing full algorithm for training the agent in a deep Q-network
on the moon surface in between two yellow flags with some is shown in Algorithm 1 [5]. In addition to a basic DQN
given constraints. In the context of this paper, the term envi- algorithm, there are additional features that our model im-
ronment may capture the agent, the yellow flags, the landing plements. These are detailed in the following subsections.
surface, etc [7]. The state space contains 8 state variables To conclude this section, we also detail some findings upon
where each variable describes a specific attribute about the running the model on the given problem.
agent.

• (x, y) : coordinate of the agent in the space 4.1. Experience Relay


• (vx , vy ) : velocity vector of the agent in the space Experience Replay is used as a method to reduce corre-
lation between successive data points. In this method, the
• θ : orientation of the agent in the space
experiences of an agent are saved at every time step. From
• vθ : angular velocity of the agent this pool of experiences, a sample is chosen and the model
is trained on the chosen sample. Thus, by reducing the vari-
• lef tLeg : is agent’s left leg on the ground
ances of the updates, the model is able to achieve better
• rightLeg) : is agent’s right leg on the ground convergence behaviour.

1
Algorithm 1 Deep Q-Learning with experience replay the first 100 episodes, the average reward is negative be-
initialise Q-network parameters cause the agent is exploring the environment and collecting
initialise replay memory D information to store in D. After the first 100 episodes, our
for episode = 1 → M do agent rapidly increases consistency in being able to achieve
initialise state ϕ(s1 ) a perfect score. In all the models tested until now, DQN’s
finished ← false. ▷ check for terminal states performance has been the best although it took ∼ 11 hours
for t = 1 → T do to train the DQN agent. In the next steps, we aim to explore
take random action a; P (a) = ϵ additional ways of increasing consistency and to analyse the
otherwise: effect of removing the Experience Relay and Exploration
Forward propagate ϕ(st ) in the Q-network v/s Exploitation features from our current model.
Execute argmaxa Q(ϕ(st ), a)
Observe rewards r and next state st+1
Derive ϕ(st+1 ) from st+1 5. Policy Gradient
(ϕ(st ), a, r,ϕ(st+1 )) → replay memory D
sample random mini-batch of transitions from D
Policy gradient aims to maximize the expected reward of
if finished then
an episode. The actor is represented in the form of a neural
set targets y by forward propagating ϕ(st+1 )
network, which is trained and improved incrementally
then compute loss
based on the gradient of the expected reward given the
update parameters with gradient descent network. This process is commonly known as gradient
ascent. This section focuses on REINFORCE (Monte Carlo
Policy Gradient) and Proximal Policy Optimization (PPO).
4.2. Exploration v/s Exploitation REINFORCE (Monte Carlo Policy Gradient) is the
When training a reinforcement learning model, it is pos- algorithm was implemented in the notebook of the tutorial.
sible for the agent to favor exploiting the promising areas Currently, REINFORCE is able to create positive results,
found where rewards have been high. Doing this can pos- with the use of two different network architectures. Figure
sibly ignore a non explored local minimum that could yield 2 shows the rewards for REINFORCE using 2 hidden
better and more consistent rewards. In order to balance this layers with 16 features while figure 3 uses 2 hidden layers
trade-off, we use a hyper-parameter to define a small proba- with 128 features, in hopes of boosting the performance
bility of choosing a random action as opposed to following of the network. Experiments were also done on changing
the model’s previous prediction. This process is called Ex- the model’s forward mechanism from tanh to ReLU, but
ploration. the results were worse off. As seen below, while the
performance for tanh on 128 features appears to perform
4.3. Results better on average for reaching a score slightly over 100 after
1000 batches, the performance diverged a lot more, and it
Using Experience Relay 4.1 and Exploration v/s Ex- can be observed that the curve tends to drop dramatically
ploitation 4.2, our model has been able to achieve excellent occasionally. More fine-tuning is expected in the network
results. These results are presented in Figure 1. structure and parameters tuning.
Figure 1. Total Rewards for DQN after 399 episodes.

Figure 2. Total Rewards for REINFORCE with 16 features

The model was allowed to run for 390+ episodes. For

2
Figure 3. Total Rewards for REINFORCE with 128 features Figure 4. Rewards vs Iteration - SARSA

Alternatively, Proximal Policy Optimization (PPO) is a


popular and promising method in solving reinforcement
learning problems [6]. Normal policy gradient methods suf-
fer from high gradient variance and low convergence. Such
variance could impose conflicting descent directions, hin- gets fixed and the y-coordinate logically needs to only be
dering the training process of the agent. The objective func- decreased with the vertical velocity direction also always
tion of PPO implements a trust region update compatible being negative.
with stochastic gradient descent to solve the problem [3].
The algorithm for PPO is currently being studied and its
implementation is expected to follow. 6.2. Exploration Policy
Even after state discretization, the number of state-action
6. SARSA pairs is extremely large. To further optimize the agent,
an exploration policy is used. We implemented an ϵ -
On implementing a naive Sarsa model on the problem in
greedy policy, where ϵ is the probability of agent acting sub-
which the Sarsa agent updates the state action pair values
optimally/experimenting new state-action pair and 1 − ϵ is
according to the rule:
the probability of agent acting optimally to max the rewards.
Q(st , at ) = Q(st , at ) + α[rt + Q(st+1 , at+1 )Q(st , at )] As the no. of iterations increases & the agent learns
more about the environment, the ϵ value is decayed such
and further comparing it with a random agent choosing ac- that at the latter stages, there is less sub-optimal behaviour
tions randomly we found that the Sarsa agent has better re- that pushes the agent to converge on a solution. By hard-
sults. However as seen in Figure 4 even after a high number coding the values of ϵ based on the iteration no. and after
of iterations the average rewards for the Sarsa agent was fine tuning we obtained better results as shown in Figure 5.
negative indicating that it further needs to be optimised.
Figure 5. Rewards vs Iteration - SARSA with optmizations

6.1. State Discretization


Due to the large number of states and actions in the
given environment, the state space becomes exponentially
big due to which the Sarsa model isn’t able to converge
to the optimum solution within the given time frame and
resources.To reduce the state space, there are some rules
observed from the nature of the task which could be
hard-coded into the agent. [9]
The agent or the lunar-lander always wants to reach
the horizontal centre of the space as soon as possible. For
example, if the agent starts at the left of the space, the
rewards system and action pairs dictate it to move right and
vice-versa.
After reaching the centre, the x-coordinate essentially

3
References
[1] Abhijit Banerjee, Dipendranath Ghosh, and Suvrojit Das.
Evolving network topology in policy gradient reinforcement
learning algorithms. In 2019 Second International Conference
on Advanced Computational and Communication Paradigms
(ICACCP), pages 1–5, 2019. 1
[2] HKUMMLab. Applied deep learning lecture notes (reinforce-
ment learning). 1
[3] Jonathan Hui. Proximal policy optimization explained.
In https://jonathan-hui.medium.com/rl-proximal-policy-
optimization-ppo-explained-77f014ec3f12. 3
[4] Yingdong Lu, Mark Squillante, and Chai Wu. A control-
model-based approach for reinforcement learning, 05 2019.
1
[5] Google Deep Mind. Human-level con-
trol through deep reinforcement learning.
In https://storage.googleapis.com/deepmind-
media/dqn/DQNNaturePaper.pdf. 1
[6] OpenAI. Introduction to proximal policy optimization. In
https://openai.com/blog/openai-baselines-ppo/. 3
[7] OpenAI. Openai gym documentation. In
https://gym.openai.com/docs/. 1
[8] Chad Peters, Terrance Stewart, Robert West,
and Babak Esfandiari4. Dynamic action selec-
tion in openai using spiking neural networks. In
https://www.aaai.org/ocs/index.php/20FLAIRS/FLAIRS19/paper//view/18312/17429,
2019. 1
[9] Chengzhe Xu Yunfeng Xin, Soham Gadgil.
Solving the lunar lander problem under un-
certainty using reinforcement learning. In
https://web.stanford.edu/class/aa228/reports/2019/final8.pdf.
1, 3

You might also like