Midterm_Report_Example3
Midterm_Report_Example3
gym environment
COMP3340 Group 7
1
Algorithm 1 Deep Q-Learning with experience replay the first 100 episodes, the average reward is negative be-
initialise Q-network parameters cause the agent is exploring the environment and collecting
initialise replay memory D information to store in D. After the first 100 episodes, our
for episode = 1 → M do agent rapidly increases consistency in being able to achieve
initialise state ϕ(s1 ) a perfect score. In all the models tested until now, DQN’s
finished ← false. ▷ check for terminal states performance has been the best although it took ∼ 11 hours
for t = 1 → T do to train the DQN agent. In the next steps, we aim to explore
take random action a; P (a) = ϵ additional ways of increasing consistency and to analyse the
otherwise: effect of removing the Experience Relay and Exploration
Forward propagate ϕ(st ) in the Q-network v/s Exploitation features from our current model.
Execute argmaxa Q(ϕ(st ), a)
Observe rewards r and next state st+1
Derive ϕ(st+1 ) from st+1 5. Policy Gradient
(ϕ(st ), a, r,ϕ(st+1 )) → replay memory D
sample random mini-batch of transitions from D
Policy gradient aims to maximize the expected reward of
if finished then
an episode. The actor is represented in the form of a neural
set targets y by forward propagating ϕ(st+1 )
network, which is trained and improved incrementally
then compute loss
based on the gradient of the expected reward given the
update parameters with gradient descent network. This process is commonly known as gradient
ascent. This section focuses on REINFORCE (Monte Carlo
Policy Gradient) and Proximal Policy Optimization (PPO).
4.2. Exploration v/s Exploitation REINFORCE (Monte Carlo Policy Gradient) is the
When training a reinforcement learning model, it is pos- algorithm was implemented in the notebook of the tutorial.
sible for the agent to favor exploiting the promising areas Currently, REINFORCE is able to create positive results,
found where rewards have been high. Doing this can pos- with the use of two different network architectures. Figure
sibly ignore a non explored local minimum that could yield 2 shows the rewards for REINFORCE using 2 hidden
better and more consistent rewards. In order to balance this layers with 16 features while figure 3 uses 2 hidden layers
trade-off, we use a hyper-parameter to define a small proba- with 128 features, in hopes of boosting the performance
bility of choosing a random action as opposed to following of the network. Experiments were also done on changing
the model’s previous prediction. This process is called Ex- the model’s forward mechanism from tanh to ReLU, but
ploration. the results were worse off. As seen below, while the
performance for tanh on 128 features appears to perform
4.3. Results better on average for reaching a score slightly over 100 after
1000 batches, the performance diverged a lot more, and it
Using Experience Relay 4.1 and Exploration v/s Ex- can be observed that the curve tends to drop dramatically
ploitation 4.2, our model has been able to achieve excellent occasionally. More fine-tuning is expected in the network
results. These results are presented in Figure 1. structure and parameters tuning.
Figure 1. Total Rewards for DQN after 399 episodes.
2
Figure 3. Total Rewards for REINFORCE with 128 features Figure 4. Rewards vs Iteration - SARSA
3
References
[1] Abhijit Banerjee, Dipendranath Ghosh, and Suvrojit Das.
Evolving network topology in policy gradient reinforcement
learning algorithms. In 2019 Second International Conference
on Advanced Computational and Communication Paradigms
(ICACCP), pages 1–5, 2019. 1
[2] HKUMMLab. Applied deep learning lecture notes (reinforce-
ment learning). 1
[3] Jonathan Hui. Proximal policy optimization explained.
In https://jonathan-hui.medium.com/rl-proximal-policy-
optimization-ppo-explained-77f014ec3f12. 3
[4] Yingdong Lu, Mark Squillante, and Chai Wu. A control-
model-based approach for reinforcement learning, 05 2019.
1
[5] Google Deep Mind. Human-level con-
trol through deep reinforcement learning.
In https://storage.googleapis.com/deepmind-
media/dqn/DQNNaturePaper.pdf. 1
[6] OpenAI. Introduction to proximal policy optimization. In
https://openai.com/blog/openai-baselines-ppo/. 3
[7] OpenAI. Openai gym documentation. In
https://gym.openai.com/docs/. 1
[8] Chad Peters, Terrance Stewart, Robert West,
and Babak Esfandiari4. Dynamic action selec-
tion in openai using spiking neural networks. In
https://www.aaai.org/ocs/index.php/20FLAIRS/FLAIRS19/paper//view/18312/17429,
2019. 1
[9] Chengzhe Xu Yunfeng Xin, Soham Gadgil.
Solving the lunar lander problem under un-
certainty using reinforcement learning. In
https://web.stanford.edu/class/aa228/reports/2019/final8.pdf.
1, 3