Report PDF
Report PDF
Reinforcement Learning
Report
Please keep the title of each section and delete examples. Note that please keep the questions liste
d in Part III.
● Please screenshot your code snippets of Part 1 ~ Part 3, and explain your implementation.
part1:
part2:
part3:
Part II. Experiment Results:
1. taxi.png:
2. cartpole.png
3. DQN.png
4. compare.png
1. Calculate the optimal Q-value of a given state in Taxi-v3, and compare with the Q-value y
ou learned (Please screenshot the result of the “check_max_Q” function to show the Q-
value you learned). (10%)
2. Calculate the max Q-value of the initial state in CartPole-v0, and compare with the Q-valu
e you learned. (Please screenshot the result of the “check_max_Q” function to show the
Q-value you learned) (10%)
3.
a. Why do we need to discretize the observation in Part 2? (3%)
Since the original continuous observation is difficult to use, discretize them can
make it easily to build the state list.
4. Which model (DQN, discretized Q learning) performs better in Cartpole-v0, and what are t
he reasons? (5%)
DQN performs better
Since DQN use deep neueal network which allows it to handle large and continous state
space, it performs better with high dimensional space state and action space.
5.
a. What is the purpose of using the epsilon greedy algorithm while choosing an actio
n? (3%)
To balance exploration and exploitation when choosing actions.
b. What will happen, if we don’t use the epsilon greedy algorithm in the CartPole-v
0 environment? (3%)
The agent will always select the action with the highest estimated Q-value,
which may cause the agent get stuck in a suboptimal policy, where it fails to
balance the pole for long periods of time.
c. Is it possible to achieve the same performance without the epsilon greedy algorith
m in the CartPole-v0 environment? Why or Why not? (3%)
It is possible if we can find another exploration strategy which can also balance
exploration and exploitation to replace it.
d. Why don’t we need the epsilon greedy algorithm during the testing section? (3%)
Since the qtable of the environment is done and agent's policy is fixed.