HW3 Questions
HW3 Questions
Homework 3:
Policy-Based Methods
Designed By:
Nima Shirzady
[email protected]
SeyyedAli MirGhasemi
[email protected]
Spring 2025
Deep Reinforcement Learning [Spring 2025]
Preface
Welcome to the homework!
As you may know, Reinforcement Learning (RL) is a branch of Artificial Intelligence that focuses
on finding an optimal policy for an agent. The goal is to enable the agent to take actions in a way that
maximizes its cumulative reward, allowing it to perform a given task as efficiently as possible.
In previous exercises, you were introduced to the fundamental concepts of RL, explored various environ-
ments, and implemented value-based methods. In this assignment, we shift our focus to policy-based
methods, providing an introductory exploration of their principles and applications.
One of the foundational approaches in this category is policy gradient methods, with REINFORCE
being one of the simplest and earliest algorithms. We begin by comparing policy search using evolutionary
optimization techniques, such as the genetic algorithm (GA), with the REINFORCE algorithm. Then,
we implement different variants of REINFORCE that enhance its performance and compare them to the
standard version. Additionally, we examine how to adapt REINFORCE for continuous action spaces.
Finally, we compare policy gradient methods (REINFORCE algorithm) with DeepQ-Network (DQN), an-
alyzing their strengths and weaknesses to better understand when and why each method is preferred.
The goal of this assignment is to explore policy-based reinforcement learning methods, with a focus
on policy gradient algorithms. You will:
• Compare policy search methods – Implement and compare REINFORCE with genetic algorithm-
based policy search to understand different optimization approaches.
• Improve REINFORCE – Implement and analyze variants of REINFORCE, such as REINFORCE
with baseline, to see how they enhance learning stability and efficiency.
• Apply REINFORCE to continuous action spaces – Modify the algorithm to work in environ-
ments with continuous actions, highlighting key differences from discrete action spaces.
• Compare Policy Gradient (REINFORCE) vs. DeepQ-Network (DQN) – Evaluate the
strengths and weaknesses of policy gradient methods in contrast to DeepQ-Network.
By completing this assignment, you will gain hands-on experience with policy gradient techniques,
understand their advantages and limitations, and develop insights into how they are applied in different
reinforcement learning problems.
Grading
The grading will be based on the following criteria, with a total of 100 points:
Task Points
Task 1: Policy Search: REINFORCE vs. GA 20
Task 2: REINFORCE: Baseline vs. No Baseline 25
Task 3: REINFORCE in a continuous action space 20
Task 4: Policy Gradient Drawbacks 25
Clarity and Quality of Code 5
Clarity and Quality of Report 5
Bonus 1: Writing your report in Latex 10
Deep Reinforcement Learning [Spring 2025]
Submission
The deadline for this homework is 1403/12/12 (March 2nd 2025) at 11:59 PM.
Please submit your work by following the instructions below:
• Place your solution alongside the Jupyter notebook(s).
– Your written solution must be a single PDF file named HW3_Solution.pdf .
– If there is more than one Jupyter notebook, put them in a folder named Notebooks .
• Zip all the files together with the following naming format:
DRL_HW3_[StudentNumber]_[FullName].zip
– Replace [FullName] and [StudentNumber] with your full name and student number,
respectively. Your [FullName] must be in CamelCase with no spaces.
• Submit the zip file through Quera in the appropriate section.
• We provided this LaTeX template for writing your homework solution. There is a 5-point bonus for
writing your solution in LaTeX using this template and including your LaTeX source code in your
submission, named HW3_Solution.zip .
• If you have any questions about this homework, please ask them in the Homework section of our
Telegram Group.
• If you are using any references to write your answers, consulting anyone, or using AI, please mention
them in the appropriate section. In general, you must adhere to all the rules mentioned here and
here by registering for this course.
Keep up the great work and best of luck with your submission!
Contents
1 Part 1 (Setup Instructions) 1
1.1 Environment Setup. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Submission Requirements. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
3 References 9
1 Part 1 (Setup Instructions) Deep Reinforcement Learning [Spring 2025]
1
1 Part 1 (Setup Instructions) Deep Reinforcement Learning [Spring 2025]
2
2 Part 2 (Problem Descriptions) Deep Reinforcement Learning [Spring 2025]
3
2 Part 2 (Problem Descriptions) Deep Reinforcement Learning [Spring 2025]
• Stability of learning – Whether the algorithm converges consistently or shows high variance in
performance.
• Sample efficiency – How many episodes or iterations are needed to learn a good policy.
• Exploration vs. exploitation balance – How each method approaches searching for new paths
versus refining known strategies.
By analyzing the results, you will gain insights into the strengths and weaknesses of these two different
policy optimization methods.
2.1.2 Instructions
All the necessary explanations and guidance for completing this task are provided in the corresponding
notebook. Additionally, the notebook includes a set of conceptual questions that you will find within the
file. You are required to provide appropriate answers to these questions.
However, due to space limitations in the notebook, you do not need to write your answers directly in the
notebook. Instead, please submit them separately.
2.1.3 Questions
• Based on the implementation and results from comparing policy search using Genetic Algorithm (GA)
and the REINFORCE algorithm:
Question 1:
How do these two methods differ in terms of their effectiveness for solving reinforcement learning tasks?
Question 2:
Discuss the key differences in their performance, convergence rates, and stability.
Question 3:
Additionally, explore how each method handles exploration and exploitation, and suggest situations where
one might be preferred over the other.
4
2 Part 2 (Problem Descriptions) Deep Reinforcement Learning [Spring 2025]
2.2.2 Instructions
All the necessary explanations and guidance for completing this task are provided in the corresponding
notebook. Additionally, the notebook includes a set of conceptual questions that you will find within the
file. You are required to provide appropriate answers to these questions.
However, due to space limitations in the notebook, you do not need to write your answers directly in the
notebook. Instead, please submit them separately.
2.2.3 Questions
Question 1:
How are the observation and action spaces defined in the CartPole environment?
Question 2:
What is the role of the discount factor (γ) in reinforcement learning, and what happens when γ=0 or
γ=1?
Question 3:
Why is a baseline introduced in the REINFORCE algorithm, and how does it contribute to training stability?
Question 4:
What are the primary challenges associated with policy gradient methods like REINFORCE?
Question 5:
Based on the results, how does REINFORCE with a baseline compare to REINFORCE without a baseline
in terms of performance?
Question 6:
Explain how variance affects policy gradient methods, particularly in the context of estimating gradients
from sampled trajectories.
5
2 Part 2 (Problem Descriptions) Deep Reinforcement Learning [Spring 2025]
2.3.2 Instructions
All the necessary explanations and guidance for completing this task are provided in the corresponding
notebook. Additionally, the notebook includes a set of conceptual questions that you will find within the
file. You are required to provide appropriate answers to these questions.
However, due to space limitations in the notebook, you do not need to write your answers directly in the
notebook. Instead, please submit them separately.
2.3.3 Questions
Question 1:
How are the observation and action spaces defined in the MountainCarContinuous environment?
Question 2:
How could an agent reach the goal in the MountainCarContinuous environment while using the least
amount of energy? Explain a scenario describing the agent’s behavior during an episode with most
optimal policy.
Question 3:
What strategies can be employed to reduce catastrophic forgetting in continuous action space environments
like MountainCarContinuous?
(Hint: experience replay or target networks)
6
2 Part 2 (Problem Descriptions) Deep Reinforcement Learning [Spring 2025]
2.4.2 Instructions
This notebook is designed to let you explore the differences between Deep Q-Network (DQN) and Policy
Gradient (REINFORCE) by tuning hyperparameters and observing their impact on learning performance.
The core implementations of both algorithms are already provided, so your focus will be on adjusting
parameters and interpreting the results. To complete this task, first navigate to the Hyperparameter
Settings section in the notebook. Modify the provided hyperparameters for both DQN and Policy Gradient
according to your preference. You can experiment with different values for parameters such as the learning
rate, discount factor (gamma), exploration settings (for DQN), and the number of training episodes. Once
you have set the hyperparameters, proceed by running the training cells. This will initiate the learning
process for both algorithms, allowing the agents to interact with the Frozen Lake environment and improve
their policies over time.
After the training is completed, examine the results using the provided visualizations and performance
metrics. The notebook includes plots showing episode rewards, training stability, and convergence speed
for both methods. Compare their learning efficiency and final success rates to understand how policy-based
and value-based approaches differ in handling this environment.
You are not required to modify the core algorithmic code. Instead, focus on tuning the hyperparameters,
observing their effects, and drawing conclusions about the strengths and weaknesses of each learning
method in the Frozen Lake environment.
Note that Maybe Policy Gradient is not good for this environment.
2.4.3 Questions
1. Which algorithm performs better in the Frozen Lake environment? Why?
Compare the performance of Deep Q-Network (DQN) and Policy Gradient (REINFORCE) in terms
of training stability, convergence speed, and overall success rate. Based on your observations, which
algorithm achieves better results in this environment?
2. What challenges does the Frozen Lake environment introduce for reinforcement learning?
Explain the specific difficulties that arise in this environment. How do these challenges affect the
7
2 Part 2 (Problem Descriptions) Deep Reinforcement Learning [Spring 2025]
8
3 References Deep Reinforcement Learning [Spring 2025]
3 References
[1] Cover image designed by freepik
[2] Policy Search
[3] CartPole environment from OpenAI Gym
[4] Mountain Car Continuous environment from OpenAI Gym
[5] FrozenLake environment from OpenAI Gym