Midterm_Report_Example3

Uploaded by

Yat Kiu Wong

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views

Midterm_Report_Example3

Uploaded by

Yat Kiu Wong

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

Reinforcement Learning to control an agent in LunarLander-v2 - an OpenAI

gym environment

Akhras Emad ([email protected]), Jain Uday ([email protected]),

Lohia Suyash ([email protected]), Ma Chun Lai Jonathan ([email protected])

COMP3340 Group 7

1. Introduction 3. Related Work

Inspired by the learning behaviours in biological We have identified a number of unique strategies in exist-
species, reinforcement learning is considered a fundamental ing literature aimed at solving the lunar lander environment.
paradigm in machine learning, alongside supervised learn- Some works attempt to solve the environment using a par-
ing and unsupervised learning, that aims to maximize the ticular algorithm. Lu, Chai, and Squillante implemented an
notion of cumulative reward in a task. From a high level, unconventional model-based technique where the algorithm
a reinforcement learning problem contains an actor, an en- avoids learning the system’s parameters in favour of learn-
vironment and a reward function, with the goal being that ing the optimal control parameters [4]. Banerjee, Ghosh,
the agent should learn to take the actions that maximize the and Das improved upon a simple policy gradient by evolv-
expected reward in the given environment, by repeatedly ing neural network topologies through incorporating Evolu-
interacting with the environment [2]. In this project, the tionary Algorithms (JADE) and Metaheuristic Algorithms
topic concerned is to learn the optimal steps in controlling (PSO) [1]. Peters, Stewart, West, and Esfandiari’ algorithm
the lander in a way that maximizes the reward. The reward relied on spiking neural networks to dynamically select ac-
system of LunarLander-v2 has been defined in the problem tions [8]. However, other works attempt to evaluate some of
definition section below. the reinforcement learning algorithms used in solving the
lunar lander environment in the presence of environment
2. Problem Definition and agent uncertainties [9].
In this project we aim to apply reinforcement learning to
train an agent in an OpenAI Gym environment. We adopt 4. DQN
“LunarLander-v2” as the simulator environment to run the
algorithm implementation, train the agent, and test its per- In this subsection, we present the DQN model. The
formance. The agent will be trained to make a safe landing full algorithm for training the agent in a deep Q-network
on the moon surface in between two yellow flags with some is shown in Algorithm 1 [5]. In addition to a basic DQN
given constraints. In the context of this paper, the term envi- algorithm, there are additional features that our model im-
ronment may capture the agent, the yellow flags, the landing plements. These are detailed in the following subsections.
surface, etc [7]. The state space contains 8 state variables To conclude this section, we also detail some findings upon
where each variable describes a specific attribute about the running the model on the given problem.
agent.

• (x, y) : coordinate of the agent in the space 4.1. Experience Relay

• (vx , vy ) : velocity vector of the agent in the space Experience Replay is used as a method to reduce corre-
lation between successive data points. In this method, the
• θ : orientation of the agent in the space
experiences of an agent are saved at every time step. From
• vθ : angular velocity of the agent this pool of experiences, a sample is chosen and the model
is trained on the chosen sample. Thus, by reducing the vari-
• lef tLeg : is agent’s left leg on the ground
ances of the updates, the model is able to achieve better
• rightLeg) : is agent’s right leg on the ground convergence behaviour.

1
Algorithm 1 Deep Q-Learning with experience replay the first 100 episodes, the average reward is negative be-
initialise Q-network parameters cause the agent is exploring the environment and collecting
initialise replay memory D information to store in D. After the first 100 episodes, our
for episode = 1 → M do agent rapidly increases consistency in being able to achieve
initialise state ϕ(s1 ) a perfect score. In all the models tested until now, DQN’s
finished ← false. ▷ check for terminal states performance has been the best although it took ∼ 11 hours
for t = 1 → T do to train the DQN agent. In the next steps, we aim to explore
take random action a; P (a) = ϵ additional ways of increasing consistency and to analyse the
otherwise: effect of removing the Experience Relay and Exploration
Forward propagate ϕ(st ) in the Q-network v/s Exploitation features from our current model.
Execute argmaxa Q(ϕ(st ), a)
Observe rewards r and next state st+1
Derive ϕ(st+1 ) from st+1 5. Policy Gradient
(ϕ(st ), a, r,ϕ(st+1 )) → replay memory D
sample random mini-batch of transitions from D
Policy gradient aims to maximize the expected reward of
if finished then
an episode. The actor is represented in the form of a neural
set targets y by forward propagating ϕ(st+1 )
network, which is trained and improved incrementally
then compute loss
based on the gradient of the expected reward given the
update parameters with gradient descent network. This process is commonly known as gradient
ascent. This section focuses on REINFORCE (Monte Carlo
Policy Gradient) and Proximal Policy Optimization (PPO).
4.2. Exploration v/s Exploitation REINFORCE (Monte Carlo Policy Gradient) is the
When training a reinforcement learning model, it is pos- algorithm was implemented in the notebook of the tutorial.
sible for the agent to favor exploiting the promising areas Currently, REINFORCE is able to create positive results,
found where rewards have been high. Doing this can pos- with the use of two different network architectures. Figure
sibly ignore a non explored local minimum that could yield 2 shows the rewards for REINFORCE using 2 hidden
better and more consistent rewards. In order to balance this layers with 16 features while figure 3 uses 2 hidden layers
trade-off, we use a hyper-parameter to define a small proba- with 128 features, in hopes of boosting the performance
bility of choosing a random action as opposed to following of the network. Experiments were also done on changing
the model’s previous prediction. This process is called Ex- the model’s forward mechanism from tanh to ReLU, but
ploration. the results were worse off. As seen below, while the
performance for tanh on 128 features appears to perform
4.3. Results better on average for reaching a score slightly over 100 after
1000 batches, the performance diverged a lot more, and it
Using Experience Relay 4.1 and Exploration v/s Ex- can be observed that the curve tends to drop dramatically
ploitation 4.2, our model has been able to achieve excellent occasionally. More fine-tuning is expected in the network
results. These results are presented in Figure 1. structure and parameters tuning.
Figure 1. Total Rewards for DQN after 399 episodes.

Figure 2. Total Rewards for REINFORCE with 16 features

The model was allowed to run for 390+ episodes. For

2
Figure 3. Total Rewards for REINFORCE with 128 features Figure 4. Rewards vs Iteration - SARSA

Alternatively, Proximal Policy Optimization (PPO) is a

popular and promising method in solving reinforcement
learning problems [6]. Normal policy gradient methods suf-
fer from high gradient variance and low convergence. Such
variance could impose conflicting descent directions, hin- gets fixed and the y-coordinate logically needs to only be
dering the training process of the agent. The objective func- decreased with the vertical velocity direction also always
tion of PPO implements a trust region update compatible being negative.
with stochastic gradient descent to solve the problem [3].
The algorithm for PPO is currently being studied and its
implementation is expected to follow. 6.2. Exploration Policy
Even after state discretization, the number of state-action
6. SARSA pairs is extremely large. To further optimize the agent,
an exploration policy is used. We implemented an ϵ -
On implementing a naive Sarsa model on the problem in
greedy policy, where ϵ is the probability of agent acting sub-
which the Sarsa agent updates the state action pair values
optimally/experimenting new state-action pair and 1 − ϵ is
according to the rule:
the probability of agent acting optimally to max the rewards.
Q(st , at ) = Q(st , at ) + α[rt + Q(st+1 , at+1 )Q(st , at )] As the no. of iterations increases & the agent learns
more about the environment, the ϵ value is decayed such
and further comparing it with a random agent choosing ac- that at the latter stages, there is less sub-optimal behaviour
tions randomly we found that the Sarsa agent has better re- that pushes the agent to converge on a solution. By hard-
sults. However as seen in Figure 4 even after a high number coding the values of ϵ based on the iteration no. and after
of iterations the average rewards for the Sarsa agent was fine tuning we obtained better results as shown in Figure 5.
negative indicating that it further needs to be optimised.
Figure 5. Rewards vs Iteration - SARSA with optmizations

6.1. State Discretization

Due to the large number of states and actions in the
given environment, the state space becomes exponentially
big due to which the Sarsa model isn’t able to converge
to the optimum solution within the given time frame and
resources.To reduce the state space, there are some rules
observed from the nature of the task which could be
hard-coded into the agent. [9]
The agent or the lunar-lander always wants to reach
the horizontal centre of the space as soon as possible. For
example, if the agent starts at the left of the space, the
rewards system and action pairs dictate it to move right and
vice-versa.
After reaching the centre, the x-coordinate essentially

3
References
[1] Abhijit Banerjee, Dipendranath Ghosh, and Suvrojit Das.
Evolving network topology in policy gradient reinforcement
learning algorithms. In 2019 Second International Conference
on Advanced Computational and Communication Paradigms
(ICACCP), pages 1–5, 2019. 1
[2] HKUMMLab. Applied deep learning lecture notes (reinforce-
ment learning). 1
[3] Jonathan Hui. Proximal policy optimization explained.
In https://jonathan-hui.medium.com/rl-proximal-policy-
optimization-ppo-explained-77f014ec3f12. 3
[4] Yingdong Lu, Mark Squillante, and Chai Wu. A control-
model-based approach for reinforcement learning, 05 2019.
1
[5] Google Deep Mind. Human-level con-
trol through deep reinforcement learning.
In https://storage.googleapis.com/deepmind-
media/dqn/DQNNaturePaper.pdf. 1
[6] OpenAI. Introduction to proximal policy optimization. In
https://openai.com/blog/openai-baselines-ppo/. 3
[7] OpenAI. Openai gym documentation. In
https://gym.openai.com/docs/. 1
[8] Chad Peters, Terrance Stewart, Robert West,
and Babak Esfandiari4. Dynamic action selec-
tion in openai using spiking neural networks. In
https://www.aaai.org/ocs/index.php/20FLAIRS/FLAIRS19/paper//view/18312/17429,
2019. 1
[9] Chengzhe Xu Yunfeng Xin, Soham Gadgil.
Solving the lunar lander problem under un-
certainty using reinforcement learning. In
https://web.stanford.edu/class/aa228/reports/2019/final8.pdf.
1, 3

Ciampa CompTIASec+ 7e PPT Mod01
No ratings yet
Ciampa CompTIASec+ 7e PPT Mod01
40 pages
Neper Reference Manual: Romain Quey
No ratings yet
Neper Reference Manual: Romain Quey
78 pages
GOM Brochure Training en
No ratings yet
GOM Brochure Training en
28 pages
Report On Reinforcement Learning
No ratings yet
Report On Reinforcement Learning
26 pages
Rule-based Reinforcement Learning augmented by External Knowledge
No ratings yet
Rule-based Reinforcement Learning augmented by External Knowledge
7 pages
07 Deep Reinforcement Learning (John)
No ratings yet
07 Deep Reinforcement Learning (John)
52 pages
Download Full Foundations of Deep Reinforcement Learning Theory and Practice in Python First Edition Laura Graesser PDF All Chapters
100% (4)
Download Full Foundations of Deep Reinforcement Learning Theory and Practice in Python First Edition Laura Graesser PDF All Chapters
62 pages
Op Tim Ization
No ratings yet
Op Tim Ization
19 pages
Full Download Foundations of Deep Reinforcement Learning Theory and Practice in Python First Edition Laura Graesser PDF
100% (5)
Full Download Foundations of Deep Reinforcement Learning Theory and Practice in Python First Edition Laura Graesser PDF
62 pages
Reinforcement Learning - Open AI Gym
No ratings yet
Reinforcement Learning - Open AI Gym
13 pages
21 - Reinforcement Learning
No ratings yet
21 - Reinforcement Learning
25 pages
Unit 5d - Deep Reinforcement Learning
No ratings yet
Unit 5d - Deep Reinforcement Learning
52 pages
Bridging The Gap Between Value and Policy Based Reinforcement Learning
No ratings yet
Bridging The Gap Between Value and Policy Based Reinforcement Learning
21 pages
Asynchronous Methods For Deep Reinforcement Learning
No ratings yet
Asynchronous Methods For Deep Reinforcement Learning
28 pages
Playing Geometry Dash With Convolutional Neural Networks
No ratings yet
Playing Geometry Dash With Convolutional Neural Networks
7 pages
Solving Lunar Lander With DQN
No ratings yet
Solving Lunar Lander With DQN
5 pages
Untitled document
No ratings yet
Untitled document
11 pages
DD2431 Machine Learning Lab 4: Reinforcement Learning Python Version
No ratings yet
DD2431 Machine Learning Lab 4: Reinforcement Learning Python Version
9 pages
Asynchronous Methods For Deep Reinforcement Learning
No ratings yet
Asynchronous Methods For Deep Reinforcement Learning
28 pages
Instant ebooks textbook (Ebook) Foundations of Deep Reinforcement Learning: Theory and Practice in Python by Laura Graesser; Wah Loon Keng ISBN 9780135172490, 0135172497 download all chapters
100% (8)
Instant ebooks textbook (Ebook) Foundations of Deep Reinforcement Learning: Theory and Practice in Python by Laura Graesser; Wah Loon Keng ISBN 9780135172490, 0135172497 download all chapters
65 pages
Q_Networks[1]-31-50
No ratings yet
Q_Networks[1]-31-50
20 pages
15
No ratings yet
15
17 pages
Reinforcement Learning and Dynamic Programming For Control
100% (1)
Reinforcement Learning and Dynamic Programming For Control
111 pages
What is TD Learning
No ratings yet
What is TD Learning
15 pages
Unit 5 - Reinforcement Learning
No ratings yet
Unit 5 - Reinforcement Learning
15 pages
Towards Adapting Reinforcement Learning Agents To New Tasks: Insights From Q-Values
No ratings yet
Towards Adapting Reinforcement Learning Agents To New Tasks: Insights From Q-Values
10 pages
6S191 MIT DeepLearning L5
No ratings yet
6S191 MIT DeepLearning L5
62 pages
Abdolmaleki et al. - 2018 - Maximum a Posteriori Policy Optimisation
No ratings yet
Abdolmaleki et al. - 2018 - Maximum a Posteriori Policy Optimisation
23 pages
RL UNIT V QA (1)
No ratings yet
RL UNIT V QA (1)
13 pages
Reinforcement learning
No ratings yet
Reinforcement learning
10 pages
Lecture 2
No ratings yet
Lecture 2
47 pages
Reinf 2
No ratings yet
Reinf 2
4 pages
Deep Reinforcement Learning: Lecture Notes
No ratings yet
Deep Reinforcement Learning: Lecture Notes
60 pages
Reinforcement LN-6
No ratings yet
Reinforcement LN-6
13 pages
Unit - 1
No ratings yet
Unit - 1
14 pages
Lecture Reinforcement Learning
No ratings yet
Lecture Reinforcement Learning
28 pages
DL questions
No ratings yet
DL questions
30 pages
Hindsight Experience Replay
No ratings yet
Hindsight Experience Replay
15 pages
IMPLing The DQN
No ratings yet
IMPLing The DQN
9 pages
Unit 1
No ratings yet
Unit 1
18 pages
13-RL DRL
No ratings yet
13-RL DRL
102 pages
Reinforcement Learning Optimization
No ratings yet
Reinforcement Learning Optimization
6 pages
unit 3 ai
No ratings yet
unit 3 ai
5 pages
Temporal Difference Learning
No ratings yet
Temporal Difference Learning
17 pages
1611.01606v1
No ratings yet
1611.01606v1
13 pages
Unit-5 Part C 1) Explain The Q Function and Q Learning Algorithm Assuming Deterministic Rewards and Actions With Example. Ans)
No ratings yet
Unit-5 Part C 1) Explain The Q Function and Q Learning Algorithm Assuming Deterministic Rewards and Actions With Example. Ans)
11 pages
37 RL
No ratings yet
37 RL
18 pages
RL With LCS
No ratings yet
RL With LCS
29 pages
Lecture Notes v1.0 687 F22
No ratings yet
Lecture Notes v1.0 687 F22
115 pages
Unit3
No ratings yet
Unit3
13 pages
ReinforcementLearning
No ratings yet
ReinforcementLearning
17 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I
35 pages
Reinforcement Learning Notes ?
No ratings yet
Reinforcement Learning Notes ?
40 pages
Chapter 1
No ratings yet
Chapter 1
33 pages
A Reinforcement Learning Approach To Obstacle Avoidance of Mobil
No ratings yet
A Reinforcement Learning Approach To Obstacle Avoidance of Mobil
5 pages
AI (IT) UNIT-5
No ratings yet
AI (IT) UNIT-5
43 pages
MLT Unit-5 notes
No ratings yet
MLT Unit-5 notes
17 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I PDF
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I PDF
38 pages
Lecture Notes on Reinforcement Learning Basics
No ratings yet
Lecture Notes on Reinforcement Learning Basics
6 pages
Maai 6
No ratings yet
Maai 6
143 pages
Origins of Life Questions and Debates
No ratings yet
Origins of Life Questions and Debates
12 pages
Q-Learning and Deep Q Networks (DQN)
No ratings yet
Q-Learning and Deep Q Networks (DQN)
52 pages
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
From Everand
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
Fouad Sabry
No ratings yet
Breath Project - Hisham Al Hemyari
No ratings yet
Breath Project - Hisham Al Hemyari
8 pages
Cp467 12 Lecture17 Segmentation
No ratings yet
Cp467 12 Lecture17 Segmentation
66 pages
Vic 500i PDF
No ratings yet
Vic 500i PDF
4 pages
Log
No ratings yet
Log
5 pages
CII3D4 SisTerPar 07 Indirect Communication
No ratings yet
CII3D4 SisTerPar 07 Indirect Communication
32 pages
740090387-EasyGrow-Imperium-Agency-Sales-Tracking-Sheet
No ratings yet
740090387-EasyGrow-Imperium-Agency-Sales-Tracking-Sheet
12 pages
Run Your Ansys Fluent Simulations at Top Speed
No ratings yet
Run Your Ansys Fluent Simulations at Top Speed
8 pages
3shape PC Catalog 24.3
No ratings yet
3shape PC Catalog 24.3
11 pages
Case 3:22-cv-01300-BEN-MSB
No ratings yet
Case 3:22-cv-01300-BEN-MSB
22 pages
Alcatel-Lucent Omniswitch 6465: Compact Hardened Ethernet Switches
No ratings yet
Alcatel-Lucent Omniswitch 6465: Compact Hardened Ethernet Switches
11 pages
AI71 AI Engineering Internship
No ratings yet
AI71 AI Engineering Internship
3 pages
Vocab
No ratings yet
Vocab
11 pages
HP CP 2025 Eng
No ratings yet
HP CP 2025 Eng
2 pages
Flava Works vs. Myvidster, Marques Rondale Gunter, Salsa Indy, LLC. Appellant's Brief Filed by Appellants Marques Rondale Gunter and SalsaIndy LLC
No ratings yet
Flava Works vs. Myvidster, Marques Rondale Gunter, Salsa Indy, LLC. Appellant's Brief Filed by Appellants Marques Rondale Gunter and SalsaIndy LLC
80 pages
Digital Logic Circuits Objective Questions PDF
No ratings yet
Digital Logic Circuits Objective Questions PDF
9 pages
Job Description - Inspector
No ratings yet
Job Description - Inspector
3 pages
Manual SQL Sb2
No ratings yet
Manual SQL Sb2
1,201 pages
Audio Generation With Diffusion Models
No ratings yet
Audio Generation With Diffusion Models
16 pages
Cisco Packet Tracer
100% (1)
Cisco Packet Tracer
2 pages
Xt2052y2asr G
No ratings yet
Xt2052y2asr G
1 page
Allan Kirui: Re: Help - Which Is Right Table KNA1 or ADRC
No ratings yet
Allan Kirui: Re: Help - Which Is Right Table KNA1 or ADRC
2 pages
Fax TB
No ratings yet
Fax TB
26 pages
API Service Interface Specs
No ratings yet
API Service Interface Specs
21 pages
Barco Communicator UserGuide 151 293
No ratings yet
Barco Communicator UserGuide 151 293
143 pages
IJCRT22A6443
No ratings yet
IJCRT22A6443
7 pages
Pump Global Gear Service Manual
No ratings yet
Pump Global Gear Service Manual
41 pages
Host Configuration: Bootp and DHCP
No ratings yet
Host Configuration: Bootp and DHCP
11 pages