CZ3005 Module 5 - Reinforcement Learning
CZ3005 Module 5 - Reinforcement Learning
Artificial Intelligence
Reinforcement Learning
https://personal.ntu.edu.sg/hanwangzhang/
Email: [email protected]
Office: N4-02c-87
Lesson Outline
• Some RL algorithms:
– Monte-Carlo
– Temporal difference
– Q-learning
– Deep Q-Network
Reinforcement Learning
• Motivation
– In last lecture, we compute the value function and find the optimal policy
– But if without the transition function ?
– We can learn the value function and find the optimal policy without transition
• From experience
learning
Experience Policy/Value
RL algorithms
Model-free
• Types Learning
– Monte Carlo
– Q-Learning Sampling Bootstrapping
– DQN Monte Carlo Q-Learning
– …
• Basic Idea: we run in the world randomly and gain experience to learn
• What experience? Many trajectories!
• How we learn?
– Use experience to learn an empirical state value function
An Example
𝑐𝑒𝑙𝑙1𝑐𝑒𝑙𝑙2𝑐𝑒𝑙𝑙3𝑐𝑒𝑙𝑙4
Start point Destination
One-dimensional Grid World
• Trajectory or episode:
𝑐𝑒𝑙𝑙1𝑐𝑒𝑙𝑙2𝑐𝑒𝑙𝑙3𝑐𝑒𝑙𝑙4
– The sequence of states from the staring state to the terminal state
– Robot starts in , ends in Start point Destination
• The representation of the three episodes
-1 10 -1 -1 -1 10
𝒄𝒆𝒍𝒍 𝟐 𝒄𝒆𝒍𝒍 𝟑 𝒄𝒆𝒍𝒍 𝟒 𝒄𝒆𝒍𝒍 𝟐 𝒄𝒆𝒍𝒍 3 𝒄𝒆𝒍𝒍 𝟐 𝒄𝒆𝒍𝒍 𝟑 𝒄𝒆𝒍𝒍 𝟒
1 2
-1 -1 -1 10
3
𝒄𝒆𝒍𝒍 𝟐 𝒄𝒆𝒍𝒍 𝟏 𝒄𝒆𝒍𝒍 𝟐 𝒄𝒆𝒍𝒍 𝟑 𝒄𝒆𝒍𝒍 𝟒
Compute Value Function
• Idea: Average return observed after visits to
• First-visit MC: average returns only for first time is visited in an
episode
• Return in one episode (trajectory):
• Representation: A table
– Filled with the Q-vale given a state and an action
𝑐𝑒𝑙𝑙𝑐𝑒𝑙𝑙
1 𝑐𝑒𝑙𝑙
2 3
Computing Q-value
• MC for estimating Q:
– A slight difference from estimating the value function
– Average returns for state-action pair is visited in an episode
• We calculate the return for of first episode with
-1 10
𝒄𝒆𝒍𝒍 𝟐 𝒄𝒆𝒍𝒍 𝟑 𝒄𝒆𝒍𝒍 𝟒
Compute Q-Value (cont’d)
• Similarly the return for of second episode with
-1 -1 -1 10
𝒄𝒆𝒍𝒍 𝟐 𝒄𝒆𝒍𝒍 3 𝒄𝒆𝒍𝒍 𝟐 𝒄𝒆𝒍𝒍 𝟑 𝒄𝒆𝒍𝒍 𝟒
6.2 6.9 10
0 4.6 6.2
• Selecting action
At , we choose right
MC control algorithm
Policy evaluation
Policy improvement
Q-Learning is Bootstrapping
Q-Learning : the blessings of Temporal Difference
old estimation
new estimation learning rate new sample
Q-Learning
A Step-by-step Example
• 5-room environment as MDP
We'll number each room 0 through 4
The outside of the building can be thought of as one big room 5
–
End at room 5
–
Notice that doors at rooms 1 and 4 lead into the building from room 5
–
(outside)
–
A Step-by-step Example (cont’d)
• Goal
– Put an agent in any room, and from that room, go outside (or room 5)
• Reward
– The doors that lead immediately to the goal have an instant reward of
100
– Other doors not directly connected to the target room have zero
reward
0 1 2 3 4 5 action
[ ]
0 0 0 0 0 0 0
1 0 0 0 0 0 100
2 0 0 0 0 0 0
𝑅=
3 0 0 0 0 0 0
4 0 0 0 0 0 100
5 0 0 0 0 0 100
state
Q-Learning Step by Step
1
Q-Learning Step by Step (cont’d)
• When we loop many episodes, we can get
𝑠
𝑎 𝒘𝒒 𝑞^ (𝑠 , 𝑎, 𝒘 𝒒 )
DQN in Atari
• Pong’s video
• https://www.youtube.com/watch?v=PSQt5KGv7Vk
• Beat human on many games