Report PDF
Report PDF
2. cartpole.png
3. DQN.png
4. compare.png
2. Calculate the max Q-value of the initial state in CartPole-v0, and compare with the Q-value you learned.
(Please screenshot the result of the “check_max_Q” function to show the Q-value you learned) (4%)
Close to
3. a. Why do we need to discretize the observation in Part 2? (2%)
We can use a lighter architecture to learn at the cost of a less efficient policy.
Better.
4. Which model (DQN, discretized Q learning) performs better in Cartpole-v0, and what are the reasons? (3%)
DQN
b. What will happen, if we don’t use the epsilon greedy algorithm in the CartPole-v0 environment? (3%)
It may get the result look like above. Since it can not get the information of the environment, it will stay in some
well-known action and result.
c. Is it possible to achieve the same performance without the epsilon greedy algorithm in the CartPole-v0
environment? Why or Why not? (3%)
Yes. Since we just need some to konw the environment, epsilon greedy algorithm is a simple and efficient way.
But we may use other similar way to achieve that. For example, softmax may reach same performance.
d. Why don’t we need the epsilon greedy algorithm during the testing section? (2%)
Since we only want to use network to choose action here, we do not want to update it.
No.
If we have two network, the Q values can be taken from the target network, which is meant to reflect the state of
the main DQN. However, it doesn’t have identical weights because it’s only updated after a certain number of
episodes. It should have a more robust performance over time.
The addition of the target network might slow down the training since the target network is not continuously
updated.
8. a. What is a replay buffer(memory)? Is it necessary to implement a replay buffer? What are the advantages
of implementing a replay buffer? (5%)
No.
The agent will get a relatively good result in a single step, and then affect the subsequent actions, which can be
solved by remembering through the relay buffer.
Let us know how many samples we can remember and we have how many samples for every train.
c. Is there any effect if we adjust the size of the replay buffer(memory) or batch size? Please list some
advantages and disadvantages. (2%)
A small buffer might force your network to only care about what it saw recently. A large buffer might take a long
time to "become refreshed" with good trajectories, when they finally start to be discovered. Conversely, their (large,
small) respective advantages
9. a. What is the condition that you save your neural network? (1%)
We save my neural network when done is true and this time is a great one of 5 trainings.
In addition, I try to use test() to be condition to get better result. But it will let train slower. I think that is not a
best way.
I have learned many thing about deep Q learning, one way of RL. Some implement and detail with DQN.
In fact, after finishing part 2, I think I know nothing and I don't have any idea with part 3. Then I had to look up a lot
of information, and then I felt like I learned something during part 3. And these question let me know many detail
about DQN, some I've thought about, some haven't. These questions let me know more thing about DQN.