Value-Based Reinforcement Learning: Shusen Wang
Value-Based Reinforcement Learning: Shusen Wang
Shusen Wang
Action-Value Functions
Discounted Return
Definition: Discounted return (aka cumulative discounted future reward).
• 𝑈" = 𝑅" + 𝛾 𝑅"$% + 𝛾 & 𝑅"$& + 𝛾 7 𝑅"$7 + ⋯
• The return depends on actions 𝐴" , 𝐴"$%, 𝐴"$&, ⋯ and states 𝑆" , 𝑆"$%, 𝑆"$&, ⋯
• Actions are random: ℙ 𝐴 = 𝑎 | 𝑆 = 𝑠 = 𝜋 𝑎 𝑠 . (Policy function.)
• States are random: ℙ 𝑆 1 = 𝑠 1 |𝑆 = 𝑠, 𝐴 = 𝑎 = 𝑝 𝑠 1 𝑠, 𝑎 . (State transition.)
Action-Value Functions 𝑄 𝑠, 𝑎
Definition: Discounted return (aka cumulative discounted future reward).
• 𝑈" = 𝑅" + 𝛾 𝑅"$% + 𝛾 & 𝑅"$& + 𝛾 7 𝑅"$7 + ⋯
• Taken w.r.t. actions 𝐴"$%, 𝐴"$&, 𝐴"$7, ⋯ and states 𝑆"$%, 𝑆"$&, 𝑆"$7, ⋯
• Integrate out everything except for the observations: 𝐴" = 𝑎" and 𝑆" = 𝑠" .
Action-Value Functions 𝑄 𝑠, 𝑎
Definition: Discounted return (aka cumulative discounted future reward).
• 𝑈" = 𝑅" + 𝛾 𝑅"$% + 𝛾 & 𝑅"$& + 𝛾 7 𝑅"$7 + ⋯
𝑄 𝑠, “left”; 𝐰 = 2000
Conv Dense
𝑄 𝑠, “right”; 𝐰 = 1000
𝑄 𝑠, “up”; 𝐰 = 3000
state 𝑠
feature
Deep Q Network (DQN)
• Input shape: size of the screenshot.
• Output shape: dimension of action space.
𝑄 𝑠, “left”; 𝐰 = 2000
Conv Dense
𝑄 𝑠, “right”; 𝐰 = 1000
𝑄 𝑠, “up”; 𝐰 = 3000
state 𝑠
feature
Reference
1. Sutton and others: A convergent O(n) algorithm for off-policy temporal-difference learning with linear function
approximation. In NIPS, 2008.
2. Sutton and others: Fast gradient-descent methods for temporal-difference learning with linear function
approximation. In ICML, 2009.
Example
• I want to drive from NYC to Atlanta.
NYC
• Model 𝑄 𝐰 estimates the time cost, e.g., 1000 minutes.
Question: How do I update the model?
Atlanta
Example
• I want to drive from NYC to Atlanta.
NYC
• Model 𝑄 𝐰 estimates the time cost, e.g., 1000 minutes.
Question: How do I update the model?
• Make a prediction: 𝑞 = 𝑄(𝐰), e.g., 𝑞 = 1000.
1000 minutes (estimate).
• Finish the trip and get the target 𝑦, e.g., 𝑦 = 860.
%
• Loss: 𝐿 = 𝑞 − 𝑦 &. 860 minutes (actual).
&
^ _ ^ ` ^ _ ^ a 𝐰
• Gradient: = ⋅ = 𝑞−𝑦 ⋅ .
^ 𝐰 ^ 𝐰 ^ ` ^ 𝐰
^ _
• Gradient descent: 𝐰"$% = 𝐰" − 𝛼 ⋅ │𝐰d𝐰e . Atlanta
^ 𝐰
Example
• I want to drive from NYC to Atlanta.
NYC
• Model 𝑄 𝐰 estimates the time cost, e.g., 1000 minutes.
Question: How do I update the model?
• Make a prediction: 𝑞 = 𝑄(𝐰), e.g., 𝑞 = 1000.
1000 minutes (estimate).
• Finish the trip and get the target 𝑦, e.g., 𝑦 = 860.
%
• Loss: 𝐿 = 𝑞 − 𝑦 &. 860 minutes (actual).
&
^ _ ^ ` ^ _ ^ a 𝐰
• Gradient: = ⋅ = 𝑞−𝑦 ⋅ .
^ 𝐰 ^ 𝐰 ^ ` ^ 𝐰
^ _
• Gradient descent: 𝐰"$% = 𝐰" − 𝛼 ⋅ │ .
^ 𝐰 𝐰d𝐰e Atlanta
Example
• I want to drive from NYC to Atlanta.
NYC
• Model 𝑄 𝐰 estimates the time cost, e.g., 1000 minutes.
Question: How do I update the model?
• Can I update the model before finishing the trip?
Atlanta
Example
• I want to drive from NYC to Atlanta (via DC).
NYC
• Model 𝑄 𝐰 estimates the time cost, e.g., 1000 minutes.
Question: How do I update the model?
DC
• Can I update the model before finishing the trip?
• Can I get a better 𝐰 as soon as I arrived at DC?
Atlanta
Temporal Difference (TD) Learning
• Model’s estimate:
NYC
NYC to Atlanta: 1000 minutes (estimate).
Atlanta
Temporal Difference (TD) Learning
• Model’s estimate:
NYC
NYC to Atlanta: 1000 minutes (estimate).
300 minutes (actual).
• I arrived at DC; actual time cost:
NYC to DC: 300 minutes (actual). DC
Temporal Difference (TD) Learning
• Model’s estimate:
NYC
NYC to Atlanta: 1000 minutes (estimate).
300 minutes (actual).
• I arrived at DC; actual time cost:
NYC to DC: 300 minutes (actual). DC
• Model now updates its estimate:
DC to Atlanta: 600 minutes (estimate).
600 minutes (estimate).
Atlanta
Temporal Difference (TD) Learning
• Model’s estimate: 𝑄 𝐰 = 1000 minutes.
NYC
• Updated estimate: 300 + 600 = 900 minutes.
300 minutes (actual).
TD target. DC
Atlanta
Temporal Difference (TD) Learning
• Model’s estimate: 𝑄 𝐰 = 1000 minutes.
NYC
• Updated estimate: 300 + 600 = 900 minutes.
300 minutes (actual).
TD target. DC
• TD target y = 900 is a more reliable estimate than 1000.
Atlanta
Temporal Difference (TD) Learning
• Model’s estimate: 𝑄 𝐰 = 1000 minutes.
NYC
• Updated estimate: 300 + 600 = 900 minutes.
300 minutes (actual).
TD target. DC
• TD target y = 900 is a more reliable estimate than 1000.
%
• Loss: 𝐿 = 𝑄 𝐰 − 𝑦 &.
& 600 minutes (estimate).
TD error
Atlanta
Temporal Difference (TD) Learning
• Model’s estimate: 𝑄 𝐰 = 1000 minutes.
NYC
• Updated estimate: 300 + 600 = 900 minutes.
300 minutes (actual).
TD target. DC
• TD target y = 900 is a more reliable estimate than 1000.
%
• Loss: 𝐿 = 𝑄 𝐰 − 𝑦 &.
& 600 minutes (estimate).
^ _ ^ a 𝐰
• Gradient: = 1000 − 900 ⋅ .
^ 𝐰 ^ 𝐰
TD error
Atlanta
Temporal Difference (TD) Learning
• Model’s estimate: 𝑄 𝐰 = 1000 minutes.
NYC
• Updated estimate: 300 + 600 = 900 minutes.
300 minutes (actual).
TD target. DC
• TD target y = 900 is a more reliable estimate than 1000.
%
• Loss: 𝐿 = 𝑄 𝐰 − 𝑦 &.
& 600 minutes (estimate).
^ _ ^ a 𝐰
• Gradient: = 1000 − 900 ⋅ .
^ 𝐰 ^ 𝐰
^ _
• Gradient descent: 𝐰"$% = 𝐰" − 𝛼 ⋅ │𝐰d𝐰e .
^ 𝐰
Atlanta
Why does TD learning work?
• Model’s estimates:
• NYC to Atlanta: 1000 minutes. NYC
• DC to Atlanta: 600 minutes.
• è NYC to DC: 400 minutes.
DC
Atlanta
Why does TD learning work?
• Model’s estimates:
• NYC to Atlanta: 1000 minutes. NYC
• DC to Atlanta: 600 minutes. 300 minutes (actual).
• è NYC to DC: 400 minutes.
DC
• Ground truth:
• NYC to DC: 300 minutes.
600 minutes (estimate).
• TD error: 𝛿 = 400 − 300 = 100
Atlanta
TD Learning for DQN
How to apply TD learning to DQN?
= 𝑈"$%
How to apply TD learning to DQN?
Identity: 𝑈" = 𝑅" + 𝛾 ⋅ 𝑈"$% .
= 𝑈"$%
How to apply TD learning to DQN?
Identity: 𝑈" = 𝑅" + 𝛾 ⋅ 𝑈"$% .
Prediction TD target
Train DQN using TD learning