0% found this document useful (0 votes)

71 views

CS6700 Programming Assignment 2

This document discusses reinforcement learning algorithms including Dueling DQN and REINFORCE. It describes experiments applying these algorithms on Acrobot and CartPole environments. Details like network architecture, hyperparameters, results and analysis are provided.

Uploaded by

Rahul me20b145

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

71 views

CS6700 Programming Assignment 2

Uploaded by

Rahul me20b145

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

CS6700

Reinforcement Learning

Assignment 2

Team Policy Pioneers

Sukeerth Ramkumar: [[email protected]]
Rahul Verma: [[email protected]]

February 21. 2024

Indian Institute of Technology, Madras

1
Contents
1 Problem Statement 4
1.1 Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.1 Acrobot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.2 CartPole . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Diving into the Algorithms 6

2.1 Double Deep Q-Network (Double DQN) . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Dueling Deep Q-Network (Dueling DQN) . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 REINFORCE Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3.1 Basic Idea and Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4 Sampling Trajectories and Gradient Estimation . . . . . . . . . . . . . . . . . . . . . 7
2.5 Baseline and Value Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3 Experiments 9
3.1 Dueling Deep Q-Network (Dueling DQN) . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.1.1 CartPole . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.1.2 Observation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.1.3 Acrobot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.1.4 Observation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2 REINFORCE MC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2.1 CartPole . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2.2 Acrobot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3 Results & Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3.1 Duel DQN Regret Analysis: . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3.2 Duel DQN Inference: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3.3 REINFORCE Regret Analysis: . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3.4 REINFORCE Inference: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2
List of Tables
1 Actions and Explanations for Acrobot Environment . . . . . . . . . . . . . . . . . . . 4
2 Actions and Explanations for Acrobot Environment . . . . . . . . . . . . . . . . . . . 5
3 Summary of the best hyperparameters for each experiment and selection policy . . . . 16

Preface
We would like to thank Prof. Balaraman Ravindran for guiding us in the CS6700 course on Rein-
forcement Learning. The codes for our assignment can be found on GitHub.

3
1 Problem Statement
This assignment entails implementing Dueling-DQN and MC REINFORCE algorithms with and
without baseline across two distinct Gym environments, namely Acrobot and CartPole.

1.1 Environments
1.1.1 Acrobot
Observation Space: The Acrobot environment comprises a 2-link pendulum system. The obser-
vation space includes the angular velocity and sine,cosine of the angular position of the two joints.

Action Space: The action is discrete, deterministic, and represents the torque applied on the
actuated joint between the two links.

Table 1: Actions and Explanations for Acrobot Environment

Action Explanation
0 Apply -1 torque to the actuated joint
1 Apply 0 torque to the actuated joint
2 Apply 1 torque to the actuated joint

Reward Structure: The goal is to have the free end reach a designated target height in as few
steps as possible, and as such, all steps that do not reach the goal incur a reward of -1. Achieving
the target height results in termination with a reward of 0. The reward threshold is -100.

Episode End: The episode ends if one of the following occurs:

1. The free end reaches the target height, which is constructed as: −cos(θ1 ) − cos(θ2 + θ1 ) > 1.0

2. Episode length is greater than 500 (200 for v0)

Figure 1: Acrobot

1.1.2 CartPole
Observation Space: The CartPole environment involves a pole attached to a cart, which can move
along a frictionless track. The observation space consists of the cart position, cart velocity, pole
angle, and pole angular velocity.

Action Space: The agent can apply a force to push the cart left or right 0,1.

4
Table 2: Actions and Explanations for Acrobot Environment

Action Explanation
0 Push cart to the left
1 Push cart to the right

Reward Structure: Since the goal is to keep the pole upright for as long as possible, a reward of
+1 for every step taken, including the termination step, is allotted. The threshold for rewards is 500
for v1 and 200 for v0.

Episode End: The episode ends if one of the following occurs:

1. Pole Angle is greater than ±12°.

2. Cart Position is greater than ±2.4 (center of the cart reaches the edge of the display).

3. Episode length is greater than 500 (200 for v0)

Figure 2: Cart Pole

5
2 Diving into the Algorithms
2.1 Double Deep Q-Network (Double DQN)
Double Deep Q-Network (Double DQN) is an extension of the Deep Q-Network (DQN) algorithm
that addresses the issue of overestimation of Q-values, which can lead to suboptimal policies. In
standard DQN, the Q-values used to select actions and compute the target values are derived from
the same network, resulting in potential overestimation.
This algorithm involves the use of two separate Q-values for next sest to estimate target Q-values
for next state to estimate target Q-values. We can thus avoid maximization bias by disentanglement
our updates from biased estimates.
The key update equation for Double DQN is as follows:

Qtarget (s, a) = r + γQtarget (s′ , arg max

′
Qonline (s′ , a′ ; θ))
a
where:
• Qtarget (s, a) is the target Q-value for state s and action a.
• r is the immediate reward.
• γ is the discount factor.
• s′ is the next state.
• arg maxa′ Qonline (s′ , a′ ; θ) is the action selected using the online network.
• θ are the parameters of the online network.
• θ− are the parameters of the target network.

2.2 Dueling Deep Q-Network (Dueling DQN)

Dueling Deep Q-Network (Dueling DQN) is an extension of the Deep Q-Network (DQN) architecture
that aims to improve learning efficiency by explicitly separating the estimation of the state value
function and the advantage function.
This approach involves estimating the Advantage function with the value function, which is the
estimate of value at a state regardless of the action take. Intuitively advantage tells how advantageous
selection a particular action is.
!
1 X
Q(s, a) = V (s) + A(s, a) − A(s, a′ )
|A| a′

′
Q(s, a) = V (s) + A(s, a) − max ′
A(s, a )
a
where:
• Q(s, a) is the Q-value for state s and action a.
• V (s) is the estimated state value.
• A(s, a) is the estimated advantage for taking action a in state s.
• A is the set of all possible actions.
The advantage function represents how much better or worse it is to take an action compared to
the average action value. By decoupling the estimation of state value and advantage, Dueling DQN
can learn the value of each state independent of the potential advantages of each action, leading to
improved performance.

6
2.3 REINFORCE Algorithm
The REINFORCE algorithm, short for REward Increment = Nonnegative Factor × Offset Reinforce-
ment × Characteristic Eligibility, is a policy gradient method used for training neural network-based
policies in reinforcement learning tasks.

2.3.1 Basic Idea and Formulation

The core idea behind the REINFORCE algorithm is to adjust the parameters of a policy in a direction
that increases the likelihood of actions that lead to higher rewards. Let’s denote:

1. θ as the parameters of the policy function π(a|s; θ), which defines the probability of taking
action a given state s.
2. R(τ ) as the return of a trajectory τ , which is the sum of rewards obtained along that trajectory.

The objective of REINFORCE is to maximize the expected return:

J(θ) = Eτ ∼p(τ ;θ) [R(τ )] (1)

where p(τ ; θ) is the probability of trajectory τ under policy π(a|s; θ). To update the policy parame-
ters, we employ gradient ascent:

θ ← θ + α∇θ J(θ) (2)

where α is the learning rate. The gradient of the expected return can be computed as:

∇θ J(θ) = Eτ ∼p(τ ;θ) [R(τ )∇θ log p(τ ; θ)] (3)

This expression gives the direction of steepest ascent in the expected return, weighted by the return
of each trajectory.

2.4 Sampling Trajectories and Gradient Estimation

To compute the gradient of the expected return, we need to estimate it using samples from the
environment. This involves the following steps:

1. We generate trajectories by interacting with the environment using the current policy π(a|s; θ).
Each trajectory consists of a sequence of states, actions, and rewards.
2. For each sampled trajectory, we compute the return R(τ ), which is the sum of rewards obtained
along that trajectory.
3. We compute the gradient of the log-probability of the trajectory with respect to the policy
parameters θ. This is done by taking the gradient of the log-probability of each action with
respect to θ, and then summing over all time steps.
4. We multiply the log-probability gradients by the corresponding returns and take their average
across all sampled trajectories. This gives us an estimate of the gradient of the expected return.

The key equation used for gradient estimation is:

N T
1 X X
∇θ J(θ) ≈ R(τi ) ∇θ log π(ait |sit ; θ) (4)
N i=1 t=1

where N is the number of sampled trajectories, T is the length of each trajectory, τi represents
the i-th sampled trajectory, and ait and sit denote the action and state at time step t in trajectory i
respectively. This gradient estimate is then used to update the policy parameters via gradient ascent.

7
2.5 Baseline and Value Function
In the basic REINFORCE algorithm, the gradient of the expected return is estimated without using
a baseline. However, the use of a baseline can reduce the variance of the gradient estimates, leading
to faster and more stable learning. A baseline is subtracted from the returns to reduce the variance
of the gradient estimates.

A good choice for a baseline is the value function V (s), which estimates the expected return from
state s under the current policy. Subtracting the value function from the return normalizes the
advantages of different actions, making the gradient estimates less noisy. The baseline-adjusted
gradient estimate becomes:
N T
1 X i
X
∇θ J(θ) ≈ (R(τi ) − V (s0 )) ∇θ log π(ait |sit ; θ) (5)
N i=1 t=1

where si0 is the initial state of trajectory i, and V (si0 ) is the estimated value of the initial state.

By using the value function as a baseline, we ensure that actions that result in higher returns than
expected receive a positive gradient, while actions with lower returns than expected receive a neg-
ative gradient. This encourages exploration by favoring actions that lead to better-than-expected
outcomes.

The value function V (s) can be estimated using methods such as Monte Carlo estimation or Temporal
Difference learning, depending on the available data and computational resources.

8
3 Experiments
3.1 Dueling Deep Q-Network (Dueling DQN)
The experiment involves the implementation of DQN and Double DQN for Dueling architecture of
Q-value function, Tuning Hyper parameters, and Plot comparison for both Q-value estimate one
using mean Advantage and other with maximum of advantage.

!
1 X
Q(s, a) = V (s) + A(s, a) − A(s, a′ )
|A| a′

′
Q(s, a) = V (s) + A(s, a) − max
′
A(s, a )
a

Figure 3: Architecture

General Architecture:- (Here hidden layer size is a hyper parameter)

1 class QNetwork1 ( nn . Module ) :

3 def init ( self , state_size , action_size , hl_size ) :

4 super ( QNetwork1 , self ) . __init__ ()
5 self . input_layer = nn . Linear ( state_size , hl_size )
6 self . value_layer = nn . Linear ( hl_size ,1)
7 self . advantage_layer = nn . Linear ( hl_size , action_size )
8

9 def forward ( self , state ) :

10 x = F . relu ( self . input_layer ( state ) )
11 value = self . value_layer ( x )
12 advantage = self . advantage_layer ( x )
13 return value , advantage

9
The following function (learn) is part of class TutorialAgent and is implementation of the backpropa-
gation step using MSE loss for a given batch of Q-value estimates. In this case we have used the mean
of advantage value. For the implementation of max Advantage, maximum value of the advantage
can be chosen over different actions keeping rest of the code same.
1 def learn ( self , experiences , gamma ) :
2 """ + E EXPERIENCE REPLAY PRESENT """
3 states , actions , rewards , next_states , dones = experiences
4 ’’’ Get max predicted Q values ( for next states ) from target model
’’’
5 V_targets_next , A_targets_next = self . qnetwork_target ( next_states )
6 Q_targets_next = torch . add ( V_targets_next , A_targets_next -
A_targets_next . mean ( dim =1 , keepdim = True ) ) . detach () . max (1) [0].
unsqueeze (1)
7

8 ’’’ Compute Q targets for current states ’’’

9 Q_targets = rewards + ( gamma * Q_targets_next * (1 - dones ) )
10 ’’’ Get expected Q values from local model ’’’
11 Q_expected = torch . add ( V_targets , A_targets - A_targets . mean ( dim
=1 , keepdim = True ) ) . gather (1 , actions )
12

13 ’’’ Compute loss ’’’

14 loss = F . mse_loss ( Q_expected , Q_targets )
15

16 ’’’ Minimize the loss ’’’

17 self . optimizer . zero_grad ()
18 loss . backward ()
19 ’’’ Gradiant Clipping ’’’
20 """ + T TRUNCATION PRESENT """
21 for param in self . qnetwork_local . parameters () :
22 param . grad . data . clamp_ ( -1 , 1)
23

24 self . optimizer . step ()

Since RL problems, especially DQN networks, require a significant amount of data and time to train.
The models are trained such that that the average return is greater then certain threshold (250 in
the case of CartPole and -102 in the case of Acrobot).

10
3.1.1 CartPole
Return Analysis over Episodes :
Since iteration can be end either due to the agent reaching the average return limit or by the max-
imum episode limit, hence we might get different length returns of arrays hence minimum number
of episodes is taken into account for regret and plot analysis. Regret analysis involves mean over
minimum number of episodes to get a normalised regret.

(a) Type-1 (b) Type-2

Figure 4: Return vs Episodee

3.1.2 Observation

Figure 5: Type-2 Return vs Episode

1. Clearly Type-2 Dueling DQN converged with in 3000 episodes which is way faster than Type-1
Dueling DQN which converged in 6000 episodes.

2. Over the interval of Type-2 episodes the average regret of Type-1 DQN and Type-2 DQN turns
out to be 181.66 adn 197.39 respectively.

11
3.1.3 Acrobot
Return Analysis over Episodes :
Just like previous case, Since iteration can be ended either due to the agent reaching the average
return limit or by the maximum episode limit, we might get different length returns of arrays hence
minimum number of episodes is taken into account for regret and plot analysis. Regret analysis
involves mean over minimum number of episodes to get a normalised regret.

(a) Type-1 (b) Type-2

Figure 6: Return vs Episodee

3.1.4 Observation

Figure 7: Type-2 Return vs Episode

1. Clearly Type-1 Dueing DQN converged with in 500 episodes which is a bit faster than Type-2
Dueling DQN which converged in 3000 episodes.

2. Over the interval of Type-1 episodes the average regret of Type-1 DQN and Type-2 DQN turns
out to be 85.93 and 125.36 respectively.

12
3.2 REINFORCE MC
Kindly refer to the GitHub page mentioned in the Preface for the complete code.

Figure 8: REINFORCE Policy Neural Network

13
Figure 9: REINFORCE Value Function Neural Network

Figure 10: REINFORCE Take Action Function

14
3.2.1 CartPole

Figure 11: REINFORCE in CartPole

3.2.2 Acrobot

Figure 12: REINFORCE in Acrobot

15
3.3 Results & Analysis
3.3.1 Duel DQN Regret Analysis:
The analysis involves calculating the return at every episode until a certain threshold is reached,
which is done for different hyperparameters like batch size, learning rate, and hidden layer size.

Considering for 2 Type of Dueling Architecture , 3 different batch sizes, 3 different learning rate and
3 different hidden layer size so in total we are comparing 27 combination of hyper parameters for
each type of Dueling DQN.
Here the regret is calculated by summing over the subtraction of return of each episodes during
training from threshold.

Hyper parameters in case of Cart Pole are :-

Batch size = [16, 32, 64]
LR = [1e-2, 1e-3, 1e-4]
HL size = [8, 16, 32]

Hyper parameters in case of Acrobot are :-

Batch size = [32, 64, 128]
LR = [5e-4, 1e-4, 5e-5]
HL size = [8, 16, 32]

The following table shows the best parameter corresponding to each batch size:

Environment Type Batch Size Learning Rate Hidden Layer nodes Regret
16 0.001 16 1097523.0
1 32 0.001 16 406653
64 0.01 8 490807
CartPole
16 0.0001 8 1148520
2 32 0.001 16 928779
64 0.01 16 221300
32 0.0001 8 28054
1 64 0.0001 16 29268
128 0.0001 16 27663
Acrobot
32 0.0001 32 40077
2 64 0.0001 32 36542
128 0.0001 16 35405

Table 3: Summary of the best hyperparameters for each experiment and selection policy

3.3.2 Duel DQN Inference:

1. In case of Cart Pole Type-2 Dueling DQN with Batch Size 64, Learning Rate 0.01 , Hidden
Layer nodes 16 yields the least regret and in the case of Acrobot Type-1 Dueling DQN with
Batch Size 128, Learning Rate 0.001 , Hidden Layer nodes 16 yields the least regret.

2. Generally agent with the higher Learning Rate turns out to be the one with least regret which
is true intuitively as well given that its not too high that weights start diverging from the
optimal value.

3. Generally agent with the higher Higher Batch size turns out to be the one with least regret
since more data being provided in a given episode with higher batch size.

16
4. In Cart Pole since regret for Neural Network (NN) with hidden layer size 8 is significantly
higher than that of NN with size 16 depicting 8 nodes are not sufficient to address states to
Q-value function accurately.

3.3.3 REINFORCE Regret Analysis:

The analysis involves calculating the return at every episode until a certain threshold is reached,
which is done for different hyperparameters like batch size, learning rate, and hidden layer size.

Considering for 2 Types of REINFORCE algorithms , 1 batch size (namely 1), 3 different learning
rate and 3 different hidden layer size so in total we are comparing 9 combination of hyper parameters
for each type of REINFORCE algorithm.

Here the regret is calculated by summing over the subtraction of return of each episodes during
training from threshold.

Hyper parameters in case of Cart Pole are :-

LR = [1e-2, 1e-3, 1e-4]
HL size = [32, 64, 128]

Hyper parameters in case of Acrobot are :-

Batch size = [32, 64, 128]
LR = [1e-2, 1e-3, 1e-4]
HL size = [32, 64, 128]

The following table shows the best parameter corresponding to each batch size:

3.3.4 REINFORCE Inference:

1. In case of Cart Pole Type-2 REINFORCE with Learning Rate 0.01 , Hidden Layer nodes 64
yields the least regret and in the case of Acrobot Type-1 REINFORCE with Learning Rate
0.001 , Hidden Layer nodes 128 yields the least regret.

3. Generally agent with the higher Higher Batch size turns out to be the one with least regret
since more data being provided in a given episode with higher batch size.

4. In Cart Pole since regret for Neural Network (NN) with hidden layer size 32 is significantly
higher than that of NN with size 64 depicting 32 nodes are not sufficient to address states to
Q-value function accurately.

Thank You!

Chapter_1_Introduction_RL_Report_Kiran
No ratings yet
Chapter_1_Introduction_RL_Report_Kiran
2 pages
RLDL_PBL_AmriteshChandra_09411503121
No ratings yet
RLDL_PBL_AmriteshChandra_09411503121
15 pages
Final MSC Report Divyam Rastogi
No ratings yet
Final MSC Report Divyam Rastogi
78 pages
Report On Reinforcement Learning
No ratings yet
Report On Reinforcement Learning
26 pages
Deep Reinforcement Learning: Lecture Notes
No ratings yet
Deep Reinforcement Learning: Lecture Notes
60 pages
REINFORCEMENT LEARNING REVIEW 1
No ratings yet
REINFORCEMENT LEARNING REVIEW 1
7 pages
Dhruv Anirudh DrSandeep (3)
No ratings yet
Dhruv Anirudh DrSandeep (3)
21 pages
Modern_Deep_Reinforcement_Learning_Algorithms
No ratings yet
Modern_Deep_Reinforcement_Learning_Algorithms
56 pages
Reinforcement Learning and Dynamic Programming For Control
100% (1)
Reinforcement Learning and Dynamic Programming For Control
111 pages
Playing Geometry Dash With Convolutional Neural Networks
No ratings yet
Playing Geometry Dash With Convolutional Neural Networks
7 pages
Chapter_2_3_Problem_and_Methodology_RL_Report_Kiran
No ratings yet
Chapter_2_3_Problem_and_Methodology_RL_Report_Kiran
3 pages
Chapter_4_5_Vast_Professional_RL_Report_Kiran
No ratings yet
Chapter_4_5_Vast_Professional_RL_Report_Kiran
2 pages
Download Full Foundations of Deep Reinforcement Learning Theory and Practice in Python First Edition Laura Graesser PDF All Chapters
100% (4)
Download Full Foundations of Deep Reinforcement Learning Theory and Practice in Python First Edition Laura Graesser PDF All Chapters
62 pages
Unit 5d - Deep Reinforcement Learning
No ratings yet
Unit 5d - Deep Reinforcement Learning
52 pages
DL questions
No ratings yet
DL questions
30 pages
6S191 MIT DeepLearning L5
No ratings yet
6S191 MIT DeepLearning L5
62 pages
Data Driven Control IEEE Paper
No ratings yet
Data Driven Control IEEE Paper
4 pages
2410.22766v1
No ratings yet
2410.22766v1
12 pages
Untitled document
No ratings yet
Untitled document
11 pages
Control of Nonholonomic Vehicle System Using Hierarchical Deep Reinforcement Learning
No ratings yet
Control of Nonholonomic Vehicle System Using Hierarchical Deep Reinforcement Learning
4 pages
The Actor-Dueling-Critic Method
No ratings yet
The Actor-Dueling-Critic Method
20 pages
Full Download Foundations of Deep Reinforcement Learning Theory and Practice in Python First Edition Laura Graesser PDF
100% (5)
Full Download Foundations of Deep Reinforcement Learning Theory and Practice in Python First Edition Laura Graesser PDF
62 pages
Chapter_4_5_Formatted_RL_Report_Kiran
No ratings yet
Chapter_4_5_Formatted_RL_Report_Kiran
3 pages
Instant ebooks textbook (Ebook) Foundations of Deep Reinforcement Learning: Theory and Practice in Python by Laura Graesser; Wah Loon Keng ISBN 9780135172490, 0135172497 download all chapters
100% (9)
Instant ebooks textbook (Ebook) Foundations of Deep Reinforcement Learning: Theory and Practice in Python by Laura Graesser; Wah Loon Keng ISBN 9780135172490, 0135172497 download all chapters
65 pages
2023_11_30_Linus_Bantel
No ratings yet
2023_11_30_Linus_Bantel
61 pages
Chapter 1
No ratings yet
Chapter 1
33 pages
Levine Deep RL Lecture
No ratings yet
Levine Deep RL Lecture
70 pages
Chapter_4_5_Results_Conclusion_RL_Report_Kiran
No ratings yet
Chapter_4_5_Results_Conclusion_RL_Report_Kiran
2 pages
15
No ratings yet
15
17 pages
Towards-Monocular-Vision-based-Obstacle-Avoidance-through-Deep-Reinforcement-Learning (1)
No ratings yet
Towards-Monocular-Vision-based-Obstacle-Avoidance-through-Deep-Reinforcement-Learning (1)
14 pages
Lecture 30 Reinforcement-Learning
No ratings yet
Lecture 30 Reinforcement-Learning
50 pages
Algorithms For Reinforcement Learning Csaba Szepesvari instant download
No ratings yet
Algorithms For Reinforcement Learning Csaba Szepesvari instant download
36 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
232 pages
UNIT 5 ML
No ratings yet
UNIT 5 ML
49 pages
2.4+Advanced+Tricks+for+DQNs
No ratings yet
2.4+Advanced+Tricks+for+DQNs
82 pages
Autonomous Car Racing in Simulation Environment Using Deep Reinforcement Learning
No ratings yet
Autonomous Car Racing in Simulation Environment Using Deep Reinforcement Learning
6 pages
Origins of Life Questions and Debates
No ratings yet
Origins of Life Questions and Debates
12 pages
RL
No ratings yet
RL
60 pages
Midterm_Report_Example3
No ratings yet
Midterm_Report_Example3
4 pages
Introduction To Deep Reinforcement Learning
No ratings yet
Introduction To Deep Reinforcement Learning
7 pages
Deep Reinforcement Learning Yuxi Li Itebooks download
No ratings yet
Deep Reinforcement Learning Yuxi Li Itebooks download
53 pages
Lecture Notes on Reinforcement Learning Basics
No ratings yet
Lecture Notes on Reinforcement Learning Basics
6 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I
35 pages
Experiment Number 5
No ratings yet
Experiment Number 5
2 pages
Deep RL Tutorial Small
No ratings yet
Deep RL Tutorial Small
66 pages
Reinforcement Learning - A comprehensive Overview
No ratings yet
Reinforcement Learning - A comprehensive Overview
177 pages
Report
No ratings yet
Report
3 pages
Benchmarking Deep Reinforcement Learning for Continuous Control
No ratings yet
Benchmarking Deep Reinforcement Learning for Continuous Control
14 pages
MEG511_Term Report
No ratings yet
MEG511_Term Report
15 pages
21 - Reinforcement Learning
No ratings yet
21 - Reinforcement Learning
25 pages
unit 4
No ratings yet
unit 4
23 pages
2011-Leon Teaching A Robotb
No ratings yet
2011-Leon Teaching A Robotb
8 pages
Stockhammer TCP 2019
No ratings yet
Stockhammer TCP 2019
37 pages
Towards Vision-Based Deep Reinforcement Learning For Robotic Motion Control
No ratings yet
Towards Vision-Based Deep Reinforcement Learning For Robotic Motion Control
8 pages
Reinforcement Learning Optimization
No ratings yet
Reinforcement Learning Optimization
6 pages
S18 Reinforcement Learning 2
No ratings yet
S18 Reinforcement Learning 2
46 pages
Reinforcement Learning With Python
No ratings yet
Reinforcement Learning With Python
24 pages
SSRN 4963741
No ratings yet
SSRN 4963741
26 pages
Einforcement Learning
No ratings yet
Einforcement Learning
27 pages
Advanced Multiplayer Game Development with Ureal Engine 5: A Comprehensive Guide to C++ Scripting
From Everand
Advanced Multiplayer Game Development with Ureal Engine 5: A Comprehensive Guide to C++ Scripting
Vladimir Kiselev
No ratings yet
Q1.ipynb - Colab
No ratings yet
Q1.ipynb - Colab
3 pages
CS6700 RL 2024 Wa1
No ratings yet
CS6700 RL 2024 Wa1
7 pages
DEL MAA: Rahul / Rahul MR AI0538
No ratings yet
DEL MAA: Rahul / Rahul MR AI0538
1 page
L2 Projection Piecewise
No ratings yet
L2 Projection Piecewise
9 pages
Triangle Quadratureby Mapping
No ratings yet
Triangle Quadratureby Mapping
2 pages
Hammer NumericalIntegrationSimplexes 1956
No ratings yet
Hammer NumericalIntegrationSimplexes 1956
9 pages
Quadrature Rules For Numerical Integration Over Triangles and Tetrahedra
No ratings yet
Quadrature Rules For Numerical Integration Over Triangles and Tetrahedra
3 pages
Assignment EE5179 ME20B145 Report
No ratings yet
Assignment EE5179 ME20B145 Report
6 pages
Perimeter and Circumference
No ratings yet
Perimeter and Circumference
5 pages
MarcelGrossman11C PDF
No ratings yet
MarcelGrossman11C PDF
1,047 pages
KSSM Form 4 Math Topics
No ratings yet
KSSM Form 4 Math Topics
31 pages
ADA IMP Questions With Solution For GTU
No ratings yet
ADA IMP Questions With Solution For GTU
63 pages
Lecture 8 ESO207 Complete Solution O (N) Local Minima in Grid and Examples Proofs of Correctness
No ratings yet
Lecture 8 ESO207 Complete Solution O (N) Local Minima in Grid and Examples Proofs of Correctness
24 pages
CBSE Class 10 Maths Standard Marking Scheme and Answer Key Term 1 2021 22
No ratings yet
CBSE Class 10 Maths Standard Marking Scheme and Answer Key Term 1 2021 22
5 pages
Gauss Elimination Method (Second Practical)
No ratings yet
Gauss Elimination Method (Second Practical)
9 pages
20) Area of A Triangle (Using Sine)
No ratings yet
20) Area of A Triangle (Using Sine)
13 pages
Maths Grade 6 (V)
No ratings yet
Maths Grade 6 (V)
5 pages
SASMO 2014 Round 1 Secondary 1 Problems
100% (1)
SASMO 2014 Round 1 Secondary 1 Problems
3 pages
Angle Between Two Line
No ratings yet
Angle Between Two Line
4 pages
Quiz 1 Year 8 Term 1 (LM)
No ratings yet
Quiz 1 Year 8 Term 1 (LM)
5 pages
MATH209 FINAL EXAM AND SOLUTIONS_FALL 2021 (1)
No ratings yet
MATH209 FINAL EXAM AND SOLUTIONS_FALL 2021 (1)
11 pages
(MAA Textbooks) Hale, Margie - Essentials of Mathematics - Introduction To Theory, Proof, and The Professional Culture-Mathematical Association of America (2014)
100% (1)
(MAA Textbooks) Hale, Margie - Essentials of Mathematics - Introduction To Theory, Proof, and The Professional Culture-Mathematical Association of America (2014)
199 pages
Nonlinear Control and Filtering
No ratings yet
Nonlinear Control and Filtering
755 pages
Mathematics: Quarter 1 - Module 9
No ratings yet
Mathematics: Quarter 1 - Module 9
17 pages
International Texts in Critical Media Ae Grant D Taylor-When The Machine Made Art The Troubled History of Computer Art-Bl
100% (1)
International Texts in Critical Media Ae Grant D Taylor-When The Machine Made Art The Troubled History of Computer Art-Bl
353 pages
1.4+Solving+Proportions+Notes+Complete
No ratings yet
1.4+Solving+Proportions+Notes+Complete
2 pages
Basics of Reversible Logic Gates
No ratings yet
Basics of Reversible Logic Gates
22 pages
Fuzzy Randomness - Towards A New Modeling of Uncertainty: Bernd Möller, Wolfgang Graf, Michael Beer and Jan-Uwe Sickert
No ratings yet
Fuzzy Randomness - Towards A New Modeling of Uncertainty: Bernd Möller, Wolfgang Graf, Michael Beer and Jan-Uwe Sickert
10 pages
K.L. Vasundhara II Year I Sem M III
No ratings yet
K.L. Vasundhara II Year I Sem M III
7 pages
Criterion ABCD For MYP4 5
No ratings yet
Criterion ABCD For MYP4 5
6 pages
Binder 1
No ratings yet
Binder 1
8 pages
Satellite Report
No ratings yet
Satellite Report
87 pages
Local Media2168744162916925199
No ratings yet
Local Media2168744162916925199
12 pages
Cuadratura Del Círculo - Nicolás de Cusa
No ratings yet
Cuadratura Del Círculo - Nicolás de Cusa
18 pages
Tower of Hanoi Using Recursion
No ratings yet
Tower of Hanoi Using Recursion
7 pages
Application of Calculus in Everyday Life.: Newton's Law of Cooling
No ratings yet
Application of Calculus in Everyday Life.: Newton's Law of Cooling
15 pages
WST01 01 Rms 20220113
No ratings yet
WST01 01 Rms 20220113
13 pages
Understand Heat and Temperature PDF
No ratings yet
Understand Heat and Temperature PDF
16 pages

CS6700 Programming Assignment 2

Uploaded by

CS6700 Programming Assignment 2

Uploaded by

CS6700

Team Policy Pioneers

February 21. 2024

Indian Institute of Technology, Madras

2 Diving into the Algorithms 6

Table 1: Actions and Explanations for Acrobot Environment

Episode End: The episode ends if one of the following occurs:

2. Episode length is greater than 500 (200 for v0)

Episode End: The episode ends if one of the following occurs:

1. Pole Angle is greater than ±12°.

3. Episode length is greater than 500 (200 for v0)

Figure 2: Cart Pole

Qtarget (s, a) = r + γQtarget (s′ , arg max

2.2 Dueling Deep Q-Network (Dueling DQN)

2.3.1 Basic Idea and Formulation

The objective of REINFORCE is to maximize the expected return:

J(θ) = Eτ ∼p(τ ;θ) [R(τ )] (1)

θ ← θ + α∇θ J(θ) (2)

∇θ J(θ) = Eτ ∼p(τ ;θ) [R(τ )∇θ log p(τ ; θ)] (3)

2.4 Sampling Trajectories and Gradient Estimation

The key equation used for gradient estimation is:

General Architecture:- (Here hidden layer size is a hyper parameter)

1 class QNetwork1 ( nn . Module ) :

3 def __init__ ( self , state_size , action_size , hl_size ) :

9 def forward ( self , state ) :

8 ’’’ Compute Q targets for current states ’’’

13 ’’’ Compute loss ’’’

16 ’’’ Minimize the loss ’’’

24 self . optimizer . step ()

(a) Type-1 (b) Type-2

Figure 4: Return vs Episodee

Figure 5: Type-2 Return vs Episode

(a) Type-1 (b) Type-2

Figure 6: Return vs Episodee

Figure 7: Type-2 Return vs Episode

Figure 8: REINFORCE Policy Neural Network

Figure 10: REINFORCE Take Action Function

Figure 11: REINFORCE in CartPole

Figure 12: REINFORCE in Acrobot

Hyper parameters in case of Cart Pole are :-

Hyper parameters in case of Acrobot are :-

3.3.2 Duel DQN Inference:

3.3.3 REINFORCE Regret Analysis:

Hyper parameters in case of Cart Pole are :-

Hyper parameters in case of Acrobot are :-

3.3.4 REINFORCE Inference:

You might also like

3 def init ( self , state_size , action_size , hl_size ) :