CS6700 Programming Assignment 2
CS6700 Programming Assignment 2
Reinforcement Learning
Assignment 2
1
Contents
1 Problem Statement 4
1.1 Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.1 Acrobot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.2 CartPole . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3 Experiments 9
3.1 Dueling Deep Q-Network (Dueling DQN) . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.1.1 CartPole . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.1.2 Observation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.1.3 Acrobot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.1.4 Observation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2 REINFORCE MC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2.1 CartPole . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2.2 Acrobot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3 Results & Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3.1 Duel DQN Regret Analysis: . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3.2 Duel DQN Inference: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3.3 REINFORCE Regret Analysis: . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3.4 REINFORCE Inference: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2
List of Tables
1 Actions and Explanations for Acrobot Environment . . . . . . . . . . . . . . . . . . . 4
2 Actions and Explanations for Acrobot Environment . . . . . . . . . . . . . . . . . . . 5
3 Summary of the best hyperparameters for each experiment and selection policy . . . . 16
Preface
We would like to thank Prof. Balaraman Ravindran for guiding us in the CS6700 course on Rein-
forcement Learning. The codes for our assignment can be found on GitHub.
3
1 Problem Statement
This assignment entails implementing Dueling-DQN and MC REINFORCE algorithms with and
without baseline across two distinct Gym environments, namely Acrobot and CartPole.
1.1 Environments
1.1.1 Acrobot
Observation Space: The Acrobot environment comprises a 2-link pendulum system. The obser-
vation space includes the angular velocity and sine,cosine of the angular position of the two joints.
Action Space: The action is discrete, deterministic, and represents the torque applied on the
actuated joint between the two links.
Action Explanation
0 Apply -1 torque to the actuated joint
1 Apply 0 torque to the actuated joint
2 Apply 1 torque to the actuated joint
Reward Structure: The goal is to have the free end reach a designated target height in as few
steps as possible, and as such, all steps that do not reach the goal incur a reward of -1. Achieving
the target height results in termination with a reward of 0. The reward threshold is -100.
1. The free end reaches the target height, which is constructed as: −cos(θ1 ) − cos(θ2 + θ1 ) > 1.0
Figure 1: Acrobot
1.1.2 CartPole
Observation Space: The CartPole environment involves a pole attached to a cart, which can move
along a frictionless track. The observation space consists of the cart position, cart velocity, pole
angle, and pole angular velocity.
Action Space: The agent can apply a force to push the cart left or right 0,1.
4
Table 2: Actions and Explanations for Acrobot Environment
Action Explanation
0 Push cart to the left
1 Push cart to the right
Reward Structure: Since the goal is to keep the pole upright for as long as possible, a reward of
+1 for every step taken, including the termination step, is allotted. The threshold for rewards is 500
for v1 and 200 for v0.
2. Cart Position is greater than ±2.4 (center of the cart reaches the edge of the display).
5
2 Diving into the Algorithms
2.1 Double Deep Q-Network (Double DQN)
Double Deep Q-Network (Double DQN) is an extension of the Deep Q-Network (DQN) algorithm
that addresses the issue of overestimation of Q-values, which can lead to suboptimal policies. In
standard DQN, the Q-values used to select actions and compute the target values are derived from
the same network, resulting in potential overestimation.
This algorithm involves the use of two separate Q-values for next sest to estimate target Q-values
for next state to estimate target Q-values. We can thus avoid maximization bias by disentanglement
our updates from biased estimates.
The key update equation for Double DQN is as follows:
6
2.3 REINFORCE Algorithm
The REINFORCE algorithm, short for REward Increment = Nonnegative Factor × Offset Reinforce-
ment × Characteristic Eligibility, is a policy gradient method used for training neural network-based
policies in reinforcement learning tasks.
1. θ as the parameters of the policy function π(a|s; θ), which defines the probability of taking
action a given state s.
2. R(τ ) as the return of a trajectory τ , which is the sum of rewards obtained along that trajectory.
1. We generate trajectories by interacting with the environment using the current policy π(a|s; θ).
Each trajectory consists of a sequence of states, actions, and rewards.
2. For each sampled trajectory, we compute the return R(τ ), which is the sum of rewards obtained
along that trajectory.
3. We compute the gradient of the log-probability of the trajectory with respect to the policy
parameters θ. This is done by taking the gradient of the log-probability of each action with
respect to θ, and then summing over all time steps.
4. We multiply the log-probability gradients by the corresponding returns and take their average
across all sampled trajectories. This gives us an estimate of the gradient of the expected return.
where N is the number of sampled trajectories, T is the length of each trajectory, τi represents
the i-th sampled trajectory, and ait and sit denote the action and state at time step t in trajectory i
respectively. This gradient estimate is then used to update the policy parameters via gradient ascent.
7
2.5 Baseline and Value Function
In the basic REINFORCE algorithm, the gradient of the expected return is estimated without using
a baseline. However, the use of a baseline can reduce the variance of the gradient estimates, leading
to faster and more stable learning. A baseline is subtracted from the returns to reduce the variance
of the gradient estimates.
A good choice for a baseline is the value function V (s), which estimates the expected return from
state s under the current policy. Subtracting the value function from the return normalizes the
advantages of different actions, making the gradient estimates less noisy. The baseline-adjusted
gradient estimate becomes:
N T
1 X i
X
∇θ J(θ) ≈ (R(τi ) − V (s0 )) ∇θ log π(ait |sit ; θ) (5)
N i=1 t=1
where si0 is the initial state of trajectory i, and V (si0 ) is the estimated value of the initial state.
By using the value function as a baseline, we ensure that actions that result in higher returns than
expected receive a positive gradient, while actions with lower returns than expected receive a neg-
ative gradient. This encourages exploration by favoring actions that lead to better-than-expected
outcomes.
The value function V (s) can be estimated using methods such as Monte Carlo estimation or Temporal
Difference learning, depending on the available data and computational resources.
8
3 Experiments
3.1 Dueling Deep Q-Network (Dueling DQN)
The experiment involves the implementation of DQN and Double DQN for Dueling architecture of
Q-value function, Tuning Hyper parameters, and Plot comparison for both Q-value estimate one
using mean Advantage and other with maximum of advantage.
!
1 X
Q(s, a) = V (s) + A(s, a) − A(s, a′ )
|A| a′
′
Q(s, a) = V (s) + A(s, a) − max
′
A(s, a )
a
Figure 3: Architecture
9
The following function (learn) is part of class TutorialAgent and is implementation of the backpropa-
gation step using MSE loss for a given batch of Q-value estimates. In this case we have used the mean
of advantage value. For the implementation of max Advantage, maximum value of the advantage
can be chosen over different actions keeping rest of the code same.
1 def learn ( self , experiences , gamma ) :
2 """ + E EXPERIENCE REPLAY PRESENT """
3 states , actions , rewards , next_states , dones = experiences
4 ’’’ Get max predicted Q values ( for next states ) from target model
’’’
5 V_targets_next , A_targets_next = self . qnetwork_target ( next_states )
6 Q_targets_next = torch . add ( V_targets_next , A_targets_next -
A_targets_next . mean ( dim =1 , keepdim = True ) ) . detach () . max (1) [0].
unsqueeze (1)
7
Since RL problems, especially DQN networks, require a significant amount of data and time to train.
The models are trained such that that the average return is greater then certain threshold (250 in
the case of CartPole and -102 in the case of Acrobot).
10
3.1.1 CartPole
Return Analysis over Episodes :
Since iteration can be end either due to the agent reaching the average return limit or by the max-
imum episode limit, hence we might get different length returns of arrays hence minimum number
of episodes is taken into account for regret and plot analysis. Regret analysis involves mean over
minimum number of episodes to get a normalised regret.
3.1.2 Observation
1. Clearly Type-2 Dueling DQN converged with in 3000 episodes which is way faster than Type-1
Dueling DQN which converged in 6000 episodes.
2. Over the interval of Type-2 episodes the average regret of Type-1 DQN and Type-2 DQN turns
out to be 181.66 adn 197.39 respectively.
11
3.1.3 Acrobot
Return Analysis over Episodes :
Just like previous case, Since iteration can be ended either due to the agent reaching the average
return limit or by the maximum episode limit, we might get different length returns of arrays hence
minimum number of episodes is taken into account for regret and plot analysis. Regret analysis
involves mean over minimum number of episodes to get a normalised regret.
3.1.4 Observation
1. Clearly Type-1 Dueing DQN converged with in 500 episodes which is a bit faster than Type-2
Dueling DQN which converged in 3000 episodes.
2. Over the interval of Type-1 episodes the average regret of Type-1 DQN and Type-2 DQN turns
out to be 85.93 and 125.36 respectively.
12
3.2 REINFORCE MC
Kindly refer to the GitHub page mentioned in the Preface for the complete code.
13
Figure 9: REINFORCE Value Function Neural Network
14
3.2.1 CartPole
3.2.2 Acrobot
15
3.3 Results & Analysis
3.3.1 Duel DQN Regret Analysis:
The analysis involves calculating the return at every episode until a certain threshold is reached,
which is done for different hyperparameters like batch size, learning rate, and hidden layer size.
Considering for 2 Type of Dueling Architecture , 3 different batch sizes, 3 different learning rate and
3 different hidden layer size so in total we are comparing 27 combination of hyper parameters for
each type of Dueling DQN.
Here the regret is calculated by summing over the subtraction of return of each episodes during
training from threshold.
The following table shows the best parameter corresponding to each batch size:
Environment Type Batch Size Learning Rate Hidden Layer nodes Regret
16 0.001 16 1097523.0
1 32 0.001 16 406653
64 0.01 8 490807
CartPole
16 0.0001 8 1148520
2 32 0.001 16 928779
64 0.01 16 221300
32 0.0001 8 28054
1 64 0.0001 16 29268
128 0.0001 16 27663
Acrobot
32 0.0001 32 40077
2 64 0.0001 32 36542
128 0.0001 16 35405
Table 3: Summary of the best hyperparameters for each experiment and selection policy
2. Generally agent with the higher Learning Rate turns out to be the one with least regret which
is true intuitively as well given that its not too high that weights start diverging from the
optimal value.
3. Generally agent with the higher Higher Batch size turns out to be the one with least regret
since more data being provided in a given episode with higher batch size.
16
4. In Cart Pole since regret for Neural Network (NN) with hidden layer size 8 is significantly
higher than that of NN with size 16 depicting 8 nodes are not sufficient to address states to
Q-value function accurately.
Considering for 2 Types of REINFORCE algorithms , 1 batch size (namely 1), 3 different learning
rate and 3 different hidden layer size so in total we are comparing 9 combination of hyper parameters
for each type of REINFORCE algorithm.
Here the regret is calculated by summing over the subtraction of return of each episodes during
training from threshold.
The following table shows the best parameter corresponding to each batch size:
2. Generally agent with the higher Learning Rate turns out to be the one with least regret which
is true intuitively as well given that its not too high that weights start diverging from the
optimal value.
3. Generally agent with the higher Higher Batch size turns out to be the one with least regret
since more data being provided in a given episode with higher batch size.
4. In Cart Pole since regret for Neural Network (NN) with hidden layer size 32 is significantly
higher than that of NN with size 64 depicting 32 nodes are not sufficient to address states to
Q-value function accurately.
Thank You!
17