0% found this document useful (0 votes)

301 views7 pages

CS5500: Reinforcement Learning Assignment 3: Additional Guidelines

This document provides instructions for Assignment 3 of the CS5500 Reinforcement Learning course. The assignment involves implementing Deep Q-Network (DQN) and policy gradient (PG) algorithms in OpenAI Gym environments. For DQN, students should code the algorithm and test it on MountainCar-v0 and Pong-v0 environments, plotting learning curves. For PG, students should implement the algorithm with techniques like reward-to-go and advantage normalization to reduce variance. The assignment is due on December 11, 2021.

Uploaded by

SHUBHAM PANCHAL

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

301 views7 pages

CS5500: Reinforcement Learning Assignment 3: Additional Guidelines

Uploaded by

SHUBHAM PANCHAL

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

CS5500 : Reinforcement Learning

Assignment № 3
Due Date : 12/11/2021
Teaching Assistants : Sai Srinivas Kancheti and Dupati Srikar Chandra

Easwar Subramanian, IIT Hyderabad 25/10/2021

This assignment predominantly involves coding the Deep Q Network (DQN) algorithm and
policy gradient (PG) algorithm and testing the implementation on several Open AI Gym envi-
ronments. Read all the instructions and questions carefully before you start working on the
assignment. The assignment policies remain same as in Assignment 1 and 2.

Additional Guidelines

• For questions that involve programming, do provide a short write-up for all (sub) ques-
tions asked. These can include performance reports, learning curves, hyper-parameter
choices, other insights or justifications that a (sub) question may warrant.

• We expect that the training for the Pong game using DQN may take good amount of com-
pute time. For example, on a modest laptop, it could take about two to three nights. Hence,
please start the assignment early.

• For additional insights, some questions will require you to perform multiple runs of Q- learn-
ing, different hyper-parameter values or alternate learning rate or exploration schedules.
Each of which can take a long time, if you choose work on complex environments.

• Due to high training time involved in some experiments, consider using tools like screen
to run jobs in background.

• Due to computational constraints that some of you may have due to not being in campus,
apart from the question involving the Pong game, all other questions involve training on
simple Open AI Gym environments. Although, we have included a version number of the
environment in the question, it is possible that you might find a slightly different version of
the environment in the Open AI Gym repo.

• Although it is not required, those who have access to decent computational resources
are encouraged to try out their code to train on other complex environments available in
Gym. For example, you could test your DQN code on Atari games like Breakout or Space
Invaders. And the PG code be tried out on continuous control tasks like inverted pendulum
or half-cheetah. You may also want to try PG implementation on Atari games like Pong.
Interesting additional experiments or extensions can be considered for bonus points.

Assignment № 3 Page 1
Problem 1 : Deep Q Learning : Preliminaries

Problem 1 : Deep Q Learning : Programming

This problem involves coding the DQN algorithm and one of its variant, namely, DDQN. We will
consider two environments from Open AI Gym for this question, namely, MountainCar-v0 and
Pong-v0. Although, the crux of the DQN algorithm was covered in lectures (Lectures 14,15 and
16 from Piazza), we encourage you to read the original papers, namely, [1, 2] . You may also
want to look at numerous other materials available on the internet.

About Environments

• MountainCar-v0

A car is on a one-dimensional track, positioned between two "mountains". The goal is to

drive up the mountain on the right; however, the car’s engine is not strong enough to scale
the mountain in a single pass. Therefore, the only way to succeed is to drive back and
forth to build up momentum. Here, the reward is greater if you spend less energy to reach
the goal. This is depicted in Figure 1.

Figure 1: Mountain Car Environment

The position and velocity of the car are the state (or input) variables. While the action
consists of push left, push right and no push. For each time step that the car does not
reach the goal, located at position 0.5, the environment returns a reward of -1. Depending
on the version of the environment, an episode may end after 200 time-steps.

• Pong-v0

Pong is a two-dimensional sports game that simulates table tennis. The agent that we
intend to develop usually controls the paddle on the right. It competes with another hard-
coded AI agent who controls the second paddle on the opposite side. Players use the
paddles to hit a ball back and forth. The goal is to develop an agent that can beat the

Assignment № 3 Page 2
hard-coded AI player on the other side. A screen shot of the game is provided in Figure
2.

Figure 2: Pong Environment

As raw pixels of a frame are fed as input to the problem, there is no unique way formulate a
state space. Usually, state space formulation involves some preprocessing of at least two
past frames and combining them in suitable way. Suggested actions for the agent involve
moving the paddle up (action 2), down (action 3) or do-nothing (action 0). In Pong, one
player wins a volley if the ball passes by the other player. Winning a volley gives a reward
of 1, while losing gives a reward of -1. An episode (or a game) is over when one of the
two players reaches 21 wins.

(a) Develop a small code snippet to load the corresponding Gym environment(s) and print out
the respective state and action space. Develop a random agent to understand the reward
function of the environment. Record your observations. (6 Points)

(b) Implement the DQN algorithm to solve the tasks envisioned by the two environments. For the
Pong-v0 environment consider pre-processing of the frames (to arrive at state formulation)
using the following step(s).

• Convert the RGB image into grayscale and downsample the resultant grayscale image.
• Further, one could also perform image subtraction between two successive frames

Any other pre-processing framework could also be used. Although, due to computational
constraints, one may not be able to develop ’best’ pong player, a reasonable pong player
can be developed with a modest laptop (say in 2 to 4 million steps). Include a learning
curve plot showing the performance of your implementation on the game Pong-v0 and
MountainCar-v0. The x-axis should correspond to number of time steps (suitably scaled)
and the y-axis should show the mean n-episode reward (for a suitable choice of n) as well as
the best mean reward. For the pong game, you could implement using normal feed forward
networks or convnets. Accordingly, you may have to do suitable preprocessing. In the case
of MountainCar-v0 environment do plot a graph (one graph) that explains the action choices
of the trained agent for various values of position and velocity. (15 Points)

(c) The DQN implementation in the above question involves choosing suitable values for a
number of hyper-parameters. Choose one hyper-parameter of your choice and run at least

Assignment № 3 Page 3
three other settings of this hyper-parameter, in addition to the one used in sub-question 1,
and plot all four learning curves on the same graph. Examples include: learning rates, neural
network architecture, replay buffer size, mini-batch size, target update interval, exploration
schedule or exploration rule (e.g. you may implement an alternative to -greedy), etc. You
are advised to choose a hyper-parameter that makes a nontrivial difference on performance
of the agent. If computational resource is a constraint, consider answering this question only
on MountainCar-v0 environment. (8 Points)

(d) Implement the double DQN estimator on the MountainCar-v0 environment and compare
the performance of DDQN and vanilla DQN. Since there would be considerable variance
between runs, you must run at least three random seeds for both DQN and DDQN and
plot the learning curve as described in sub-question (a). If computational resource is a
constraint, consider answering this question only on MountainCar-v0 environment. ( 6
Points)

Problem 3 : Policy Gradient

This problem involves coding the policy gradient algorithm with variance reduction techniques.
Consider the following tricks that will be used in the PG implementation.

• Temporal Structure or Reward to go

Consider the following formulations of the policy gradient

"∞ #
X
∇θ J(θ) = Eπθ ∇θ log πθ (at |st )Ψt
t=0

P∞ kr
1. Ψt = k=0 γ k+1 = G0:∞ , Total reward of the trajectory
t0 r 0
P∞
2. Ψt = t0 =t γ t +1 = Gt:∞ ,Total reward following action at (reward to-go)

where for a path τ ∼ πθ

b
X
Ga:b (τ ) = γ t rt+1 .
t=a

The first formulation involves computing the total reward of the trajectory and the second
formulation involves computing the reward to go that results from taking action at from
state st . Supposedly, the second formulation is expected to lower the variance of the
gradient estimate and would thus help in stabilizing the policy gradient algorithm.

• Advantage Normalization

Advantage normalization involves centering advantage estimates (computed as reward-

to-go minus a suitable baseline) by normalizing them to have mean of 0 and a standard
deviation of 1. This technique is known to boost empirical performance of the policy gra-
dient by lowering the variance of the advantage estimator thereby stabilizing the policy
gradient algorithm.

Assignment № 3 Page 4
We will now code a generic policy gradient algorithm and test it on two different environments
from Open AI Gym, namely, Cartpole-v0 and Lunarlander-v2.

About Environments

• Cartpole-v0

A pole is attached by an unactuated joint to a cart, which moves along a friction-less track
as shown in Figure 3. The input (or state) to the system consists of cart position, cart

Figure 3: Cartpole Environment

velocity, pole angle and pole velocity at tip. The system is controlled by applying a force
that either pushes the cart to the left or right. The pendulum starts upright, and the goal is
to prevent it from falling over. A reward of +1 is provided for every time-step that the pole
remains upright. The episode ends when the pole is more than 15 degrees from vertical,
or the cart moves more than 2.4 units from the center.

• Lunarlander-v2

The goal of a Lunar lander, as you can imagine, is to land on the moon. Landing pad is
always at coordinates (0,0). Coordinates are the first two numbers in state vector. Landing
outside landing pad is possible. Fuel is infinite, so an agent can learn to fly and then land
on its first attempt. Figure 4 depicts a schematic lunar lander.

There are four discrete actions available: do nothing, fire left orientation engine, fire main
engine, fire right orientation engine. The state constitute the coordinates and position of
the lander. This reward function is determined by the Lunar Lander environment. The
reward is mainly a function of how close the lander is to the landing pad and how close
it is to zero speed, basically the closer it is to landing the higher the reward. But there
are other things that affect the reward include firing the main engine deducts points on

Assignment № 3 Page 5
Figure 4: Lunar Lander Environment

every frame, moving away from the landing pad deducts points, crashing deducts points,
etc. The game or episode ends when the lander lands, crashes, or flies off away from the
screen.

(a) Develop a small code snippet to load the corresponding Gym environments and print out
the respective state and action space. Develop a random agent to understand the reward
function of the environment. Record your observations. (2 Points)

(b) Implement the policy gradient algorithm with and without the reward to go and advantage
normalization functionality. For the advantage normalization functionality, use a suitable
baseline function (Slide 41, Lecture 17 : Policy Gradient Methods). Test them on Cartpole-
v0 and Lunarlander-v2 environments. The code developed must be able to run from
command line using options such as environment name, reward-to-go functionality (true
or false), advantage normalization functionality (true or false), number of iterations, batch
size for policy gradient and other command line variables as you deem fit. Attach in your re-
ports, a comparison of learning curve graphs (average return at each iteration) with and with
out advantage normalization and reward-to-go functionality. (12 Points)

(d) In the policy gradient lecture, we saw that adding a baseline to the policy gradient estimate
does not create bias. The policy gradient formulation with temporal structure and baseline
is given by, "∞ #
X
∇θ J(θ) = Eπθ ∇θ log πθ (at |st )Ψt
t=0
t0 r 0
P∞
where Ψt = t0 =t γ t +1 − b(st0 ) = Gt:∞ − b(st ). Prove that the policy gradient estimate
∇θ J(θ) has minimum variance for the following choice of baseline b, given by,

Eπθ (∇θ log π(at |st )2 Gt:∞ (τ ))

b=
Eπθ (∇θ log π(at |st )2 )

(5 Points)

Assignment № 3 Page 6
References

[1] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan
Wierstra, and Martin A. Riedmiller. Playing atari with deep reinforcement learning. CoRR,
abs/1312.5602, 2013.

[2] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G.
Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Pe-
tersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran,
Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep rein-
forcement learning. Nature, 518(7540):529–533, February 2015.

Assignment № 3 Page 7

CN_Unit-1_PPT
No ratings yet
CN_Unit-1_PPT
232 pages
Industrial Revolution
100% (1)
Industrial Revolution
13 pages
3 (1) Introduction To Physical Layer
100% (1)
3 (1) Introduction To Physical Layer
73 pages
Hopfield Neural Network
100% (1)
Hopfield Neural Network
6 pages
Disk Management in Operating System
100% (1)
Disk Management in Operating System
8 pages
Chapter 2 - Transmission Media
100% (1)
Chapter 2 - Transmission Media
29 pages
Unit - 4
100% (1)
Unit - 4
92 pages
UNIT 1-System Models Distributed Computing
No ratings yet
UNIT 1-System Models Distributed Computing
70 pages
AI_20report_205
No ratings yet
AI_20report_205
18 pages
Network Security and Cryptography I
No ratings yet
Network Security and Cryptography I
130 pages
Unit 5 Notes
100% (1)
Unit 5 Notes
33 pages
Lecture-4 (Petri Nets)
No ratings yet
Lecture-4 (Petri Nets)
34 pages
L02 - IS - Security Models
100% (1)
L02 - IS - Security Models
17 pages
CS234_A2
No ratings yet
CS234_A2
9 pages
Q-Learning and Deep Q Networks (DQN)
No ratings yet
Q-Learning and Deep Q Networks (DQN)
52 pages
Report On Reinforcement Learning
No ratings yet
Report On Reinforcement Learning
26 pages
2410.22766v1
No ratings yet
2410.22766v1
12 pages
Subnetting: Historic Network Classes
No ratings yet
Subnetting: Historic Network Classes
9 pages
RL for Logits Weight
No ratings yet
RL for Logits Weight
11 pages
Chapter 3 - Data Link Layer
100% (1)
Chapter 3 - Data Link Layer
46 pages
Transport Layer Services
100% (1)
Transport Layer Services
71 pages
L10-L11-Instruction Pipelining
No ratings yet
L10-L11-Instruction Pipelining
38 pages
Deep Reinforcement Learning
100% (4)
Deep Reinforcement Learning
48 pages
An Introduction To Independent Component Analysis (ICA)
No ratings yet
An Introduction To Independent Component Analysis (ICA)
39 pages
AIML Algorithms and Applications in VLSI Design and Technology
No ratings yet
AIML Algorithms and Applications in VLSI Design and Technology
41 pages
Protocol Layering
No ratings yet
Protocol Layering
31 pages
Routing Algorithms (Distance Vector, Link State) Study Notes - Computer Sc. & Engg
No ratings yet
Routing Algorithms (Distance Vector, Link State) Study Notes - Computer Sc. & Engg
6 pages
An Intuitive Exploration of Artificial Intelligence Theory and Applications of Deep Learning by Simant Dube
80% (5)
An Intuitive Exploration of Artificial Intelligence Theory and Applications of Deep Learning by Simant Dube
355 pages
Inspiration From Neurobiology: Human Biological Neuron
No ratings yet
Inspiration From Neurobiology: Human Biological Neuron
47 pages
Routing Protocols: Ad Hoc Wireless Networks
No ratings yet
Routing Protocols: Ad Hoc Wireless Networks
78 pages
02 02 2023 - Cse3009 Iot BK Zigbee
No ratings yet
02 02 2023 - Cse3009 Iot BK Zigbee
43 pages
Cheng Nien Fang 2023 Thesis
No ratings yet
Cheng Nien Fang 2023 Thesis
93 pages
DSA - Linked List PDF
No ratings yet
DSA - Linked List PDF
37 pages
VTU EC EBCS CCN Module1 Raghudathesh
100% (1)
VTU EC EBCS CCN Module1 Raghudathesh
94 pages
Playing Geometry Dash With Convolutional Neural Networks
No ratings yet
Playing Geometry Dash With Convolutional Neural Networks
7 pages
AMP: Adversarial Motion Priors For Stylized Physics-Based Character Control
No ratings yet
AMP: Adversarial Motion Priors For Stylized Physics-Based Character Control
20 pages
Sppu CN Insem Solved Paper Aug 2018
No ratings yet
Sppu CN Insem Solved Paper Aug 2018
14 pages
تشييك
No ratings yet
تشييك
30 pages
RL VIVA
No ratings yet
RL VIVA
30 pages
Cs6551 Computer Networks: Unit - I
No ratings yet
Cs6551 Computer Networks: Unit - I
86 pages
Machine Learning For Absolute Beginners A Plain English Introduction 2 edition Edition Oliver Theobald pdf download
100% (4)
Machine Learning For Absolute Beginners A Plain English Introduction 2 edition Edition Oliver Theobald pdf download
63 pages
Lecture-2.1 DLL Design Issues
No ratings yet
Lecture-2.1 DLL Design Issues
28 pages
Stream Control Transmission Protocol SCTP
No ratings yet
Stream Control Transmission Protocol SCTP
46 pages
MidTerm AI PDF
100% (1)
MidTerm AI PDF
4 pages
DHCP
No ratings yet
DHCP
23 pages
Practical Algorithmic Trading Using State Represen
No ratings yet
Practical Algorithmic Trading Using State Represen
12 pages
lksk ML typesToStudents
No ratings yet
lksk ML typesToStudents
18 pages
Learning To Rewrite Prompts For Personalized Text Generation
No ratings yet
Learning To Rewrite Prompts For Personalized Text Generation
12 pages
Smart City Traffic Flow and Signal Optimization Using STGCN-LSTM and PPO Algorithms
No ratings yet
Smart City Traffic Flow and Signal Optimization Using STGCN-LSTM and PPO Algorithms
17 pages
Distributed Systems-A Brief Introduction
No ratings yet
Distributed Systems-A Brief Introduction
30 pages
Adaptive Tree Walk Protocol
100% (2)
Adaptive Tree Walk Protocol
2 pages
Ain3001 - 01.3 - ML - Fast.tutorial
No ratings yet
Ain3001 - 01.3 - ML - Fast.tutorial
58 pages
Multi-Channel Speech Enhancement
No ratings yet
Multi-Channel Speech Enhancement
35 pages
Classical Problems of Synchronization
No ratings yet
Classical Problems of Synchronization
10 pages
UNIT I-Machine Learning
No ratings yet
UNIT I-Machine Learning
68 pages
MACHINELEARING UNIT 1material
100% (1)
MACHINELEARING UNIT 1material
64 pages
Erka
No ratings yet
Erka
11 pages
BoussaadaAchraf Tunisian Truck License Plate Recognition
No ratings yet
BoussaadaAchraf Tunisian Truck License Plate Recognition
92 pages
Switching: Dr. Gihan Naguib
No ratings yet
Switching: Dr. Gihan Naguib
24 pages
Deep Reinforcement Learning-Based Sum-Rate Maximization For Uplink Multi-User SIMO-RSMA Systems
No ratings yet
Deep Reinforcement Learning-Based Sum-Rate Maximization For Uplink Multi-User SIMO-RSMA Systems
10 pages
CN PPT-Unit VI
No ratings yet
CN PPT-Unit VI
29 pages
02-ProtocolArchitecture - William Stallings
No ratings yet
02-ProtocolArchitecture - William Stallings
47 pages
Collaborative Intelligent Confident Information Coverage Node Sleep Scheduling For 6G-Empowered Green IoT
No ratings yet
Collaborative Intelligent Confident Information Coverage Node Sleep Scheduling For 6G-Empowered Green IoT
12 pages
Computer Networks Prof. Sujoy Ghosh Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture - 9 Sonet/Sdh
No ratings yet
Computer Networks Prof. Sujoy Ghosh Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture - 9 Sonet/Sdh
38 pages
Distributed Resource Management: Distributed Shared Memory
No ratings yet
Distributed Resource Management: Distributed Shared Memory
20 pages
6.2 Elements of Transport Protocols PDF
No ratings yet
6.2 Elements of Transport Protocols PDF
12 pages
Project
No ratings yet
Project
27 pages
Data Link Layer (Chapter 3)
No ratings yet
Data Link Layer (Chapter 3)
12 pages
3213 Final Fall2010
No ratings yet
3213 Final Fall2010
14 pages
Practice Assignment 6: Reinforcement Learning Prof. B. Ravindran
No ratings yet
Practice Assignment 6: Reinforcement Learning Prof. B. Ravindran
24 pages
Unit3-Neural Networks (Solution)
No ratings yet
Unit3-Neural Networks (Solution)
3 pages
A Brief Survey of Deep Reinforcement Learning
No ratings yet
A Brief Survey of Deep Reinforcement Learning
16 pages
Deep Learning by Author of LSTM
No ratings yet
Deep Learning by Author of LSTM
26 pages
rl assignment.pdf
No ratings yet
rl assignment.pdf
10 pages
Lift Calculation
No ratings yet
Lift Calculation
32 pages
Figure 1.1 A Block Diagram of A Communication Systems
No ratings yet
Figure 1.1 A Block Diagram of A Communication Systems
14 pages
G9 Report
No ratings yet
G9 Report
9 pages
(P) (Iwai, 2011) Effectness of Metacognitive Reading Strategies
No ratings yet
(P) (Iwai, 2011) Effectness of Metacognitive Reading Strategies
8 pages
Learning To Trade Using Q-Learning
No ratings yet
Learning To Trade Using Q-Learning
18 pages
5 Reasons Why Machine Learning Is Important in Today
No ratings yet
5 Reasons Why Machine Learning Is Important in Today
6 pages
Midterm Solution - COSC 3213 - Computer Networks 1
No ratings yet
Midterm Solution - COSC 3213 - Computer Networks 1
13 pages
Socket Presentation
No ratings yet
Socket Presentation
13 pages
A Wireless Intrusion Detection System and A New Attack Model
No ratings yet
A Wireless Intrusion Detection System and A New Attack Model
28 pages
RLAlgs in MDPs
No ratings yet
RLAlgs in MDPs
9 pages
Computer Architecture Assignment 3 (ARCH)
No ratings yet
Computer Architecture Assignment 3 (ARCH)
9 pages
NSK OS I 13 Solution 1
No ratings yet
NSK OS I 13 Solution 1
7 pages
Cs1451 Network Protocol Handout
No ratings yet
Cs1451 Network Protocol Handout
3 pages
Advanced Networking Tutorial Sheets
No ratings yet
Advanced Networking Tutorial Sheets
19 pages
Additional MCQs Chap 4 MA
No ratings yet
Additional MCQs Chap 4 MA
4 pages
Connectivity Prediction in Mobile Ad Hoc Networks for Real-Time Control
From Everand
Connectivity Prediction in Mobile Ad Hoc Networks for Real-Time Control
Sebastian Thelen
5/5 (1)