0% found this document useful (0 votes)
107 views12 pages

Report PDF

The document summarizes an assignment on reinforcement learning. It includes: 1. Code snippets and explanations of implementations of Taxi, Cartpole, and DQN algorithms. 2. Experiment results comparing the performance of Taxi, Cartpole, and DQN on their respective environments. 3. Questions about calculating optimal Q-values, discretizing state spaces, exploration vs exploitation, and the purpose of "torch.no_grad()" in DQN.

Uploaded by

鄭博仁
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
107 views12 pages

Report PDF

The document summarizes an assignment on reinforcement learning. It includes: 1. Code snippets and explanations of implementations of Taxi, Cartpole, and DQN algorithms. 2. Experiment results comparing the performance of Taxi, Cartpole, and DQN on their respective environments. 3. Questions about calculating optimal Q-values, discretizing state spaces, exploration vs exploitation, and the purpose of "torch.no_grad()" in DQN.

Uploaded by

鄭博仁
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Homework 4:

Reinforcement Learning
Report
Please keep the title of each section and delete examples. Note that please keep the questions liste
d in Part III.

Part I. Implementation (-5 if not explain in detail):

● Please screenshot your code snippets of Part 1 ~ Part 3, and explain your implementation.
part1:
part2:
part3:
Part II. Experiment Results:

Please paste taxi.png, cartpole.png, DQN.png and compare.png here.

1. taxi.png:
2. cartpole.png

3. DQN.png
4. compare.png

Part III. Question Answering (50%):

1. Calculate the optimal Q-value of a given state in Taxi-v3, and compare with the Q-value y
ou learned (Please screenshot the result of the “check_max_Q” function to show the Q-
value you learned). (10%)

taxi : (2,2) 、passenger : Y , destination : R


[west, west, south, south, pick up, north, north, north, north, drop-off]
10 steps and passenger delivered, gamma = 0.9
optimal Q = (– 1 – 0.9*1 – 0.9^2*1 - … - 0.9^9*1) + 0.9^9 * 20

equals to the result of max-Q in taxi.py

2. Calculate the max Q-value of the initial state in CartPole-v0, and compare with the Q-valu
e you learned. (Please screenshot the result of the “check_max_Q” function to show the
Q-value you learned) (10%)

num of steps is large , gamma = 0.97


optimal Q = 1 + 0.97*1 + 0.97^2*1+ … + 0.97^n*1
= 1 * (1-0.97^n) / 1 -0.97 = 1 / 0.03 = 33.3333
cartpole.py : 29.55 ; DQN.py : 34.656 result of DQN is closer to optimal Q.

3.
a. Why do we need to discretize the observation in Part 2? (3%)
Since the original continuous observation is difficult to use, discretize them can
make it easily to build the state list.

b. How do you expect the performance will be if we increase “num_bins”? (3%)


If num_bins increase, the state will be more precise and the performance will be
better.

c. Is there any concern if we increase “num_bins”? (3%)


We will need more space and time since it become bigger.

4. Which model (DQN, discretized Q learning) performs better in Cartpole-v0, and what are t
he reasons? (5%)
DQN performs better
Since DQN use deep neueal network which allows it to handle large and continous state
space, it performs better with high dimensional space state and action space.

5.
a. What is the purpose of using the epsilon greedy algorithm while choosing an actio
n? (3%)
To balance exploration and exploitation when choosing actions.

b. What will happen, if we don’t use the epsilon greedy algorithm in the CartPole-v
0 environment? (3%)
The agent will always select the action with the highest estimated Q-value,
which may cause the agent get stuck in a suboptimal policy, where it fails to
balance the pole for long periods of time.

c. Is it possible to achieve the same performance without the epsilon greedy algorith
m in the CartPole-v0 environment? Why or Why not? (3%)
It is possible if we can find another exploration strategy which can also balance
exploration and exploitation to replace it.
d. Why don’t we need the epsilon greedy algorithm during the testing section? (3%)
Since the qtable of the environment is done and agent's policy is fixed.

6. Why does “with torch.no_grad():“ do inside the “choose_action” function in DQN?


(4%)
To disable gradient computation and backpropagation during action selection.

You might also like