0% found this document useful (0 votes)
78 views53 pages

Value-Based Reinforcement Learning: Shusen Wang

The document discusses value-based reinforcement learning and the deep Q-network (DQN) algorithm. [1] It defines discounted return and action-value functions for a policy. [2] It explains that DQN uses a neural network to approximate the optimal action-value function since the true function is unknown. [3] The network is trained using temporal difference learning by comparing its predictions to actual outcomes experienced by the agent.

Uploaded by

MInh Thanh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
78 views53 pages

Value-Based Reinforcement Learning: Shusen Wang

The document discusses value-based reinforcement learning and the deep Q-network (DQN) algorithm. [1] It defines discounted return and action-value functions for a policy. [2] It explains that DQN uses a neural network to approximate the optimal action-value function since the true function is unknown. [3] The network is trained using temporal difference learning by comparing its predictions to actual outcomes experienced by the agent.

Uploaded by

MInh Thanh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

Value-Based Reinforcement Learning

Shusen Wang
Action-Value Functions
Discounted Return
Definition: Discounted return (aka cumulative discounted future reward).
• 𝑈" = 𝑅" + 𝛾 𝑅"$% + 𝛾 & 𝑅"$& + 𝛾 7 𝑅"$7 + ⋯

• The return depends on actions 𝐴" , 𝐴"$%, 𝐴"$&, ⋯ and states 𝑆" , 𝑆"$%, 𝑆"$&, ⋯
• Actions are random: ℙ 𝐴 = 𝑎 | 𝑆 = 𝑠 = 𝜋 𝑎 𝑠 . (Policy function.)
• States are random: ℙ 𝑆 1 = 𝑠 1 |𝑆 = 𝑠, 𝐴 = 𝑎 = 𝑝 𝑠 1 𝑠, 𝑎 . (State transition.)
Action-Value Functions 𝑄 𝑠, 𝑎
Definition: Discounted return (aka cumulative discounted future reward).
• 𝑈" = 𝑅" + 𝛾 𝑅"$% + 𝛾 & 𝑅"$& + 𝛾 7 𝑅"$7 + ⋯

Definition: Action-value function for policy 𝜋.

• 𝑄9 𝑠" , 𝑎" = 𝔼 𝑈" |𝑆" = 𝑠" , 𝐴" = 𝑎" .

• Taken w.r.t. actions 𝐴"$%, 𝐴"$&, 𝐴"$7, ⋯ and states 𝑆"$%, 𝑆"$&, 𝑆"$7, ⋯
• Integrate out everything except for the observations: 𝐴" = 𝑎" and 𝑆" = 𝑠" .
Action-Value Functions 𝑄 𝑠, 𝑎
Definition: Discounted return (aka cumulative discounted future reward).
• 𝑈" = 𝑅" + 𝛾 𝑅"$% + 𝛾 & 𝑅"$& + 𝛾 7 𝑅"$7 + ⋯

Definition: Action-value function for policy 𝜋.

• 𝑄9 𝑠" , 𝑎" = 𝔼 𝑈" |𝑆" = 𝑠" , 𝐴" = 𝑎" .

Definition: Optimal action-value function.


• 𝑄⋆ 𝑠" , 𝑎" = max 𝑄9 𝑠" , 𝑎" .
9
• Whatever policy function 𝜋 is used, the result of taking 𝑎" at state 𝑠"
cannot be better than 𝑄⋆ 𝑠" , 𝑎" .
Deep Q-Network (DQN)
Approximate the Q Function
Goal: Win the game (≈ maximize the total reward.)

Question: If we know 𝑄⋆ 𝑠, 𝑎 , what is the best action?


Approximate the Q Function
Goal: Win the game (≈ maximize the total reward.)

Question: If we know 𝑄⋆ 𝑠, 𝑎 , what is the best action?

• Obviously, the best action is 𝑎⋆ = argmax 𝑄⋆ 𝑠, 𝑎 .


B

𝑄 ⋆ is an indicator of how good it is for an agent


to pick action 𝑎 while being in state 𝑠.
Approximate the Q Function
Goal: Win the game (≈ maximize the total reward.)

Question: If we know 𝑄⋆ 𝑠, 𝑎 , what is the best action?

• Obviously, the best action is 𝑎⋆ = argmax 𝑄⋆ 𝑠, 𝑎 .


B

Challenge: We do not know 𝑄⋆ 𝑠, 𝑎 .


• Solution: Deep Q Network (DQN)
• Use neural network 𝑄(𝑠, 𝑎; 𝐰) to approximate 𝑄⋆ 𝑠, 𝑎 .
Deep Q Network (DQN)
• Input shape: size of the screenshot.
• Output shape: dimension of action space.

𝑄 𝑠, “left”; 𝐰 = 2000
Conv Dense
𝑄 𝑠, “right”; 𝐰 = 1000

𝑄 𝑠, “up”; 𝐰 = 3000
state 𝑠
feature
Deep Q Network (DQN)
• Input shape: size of the screenshot.
• Output shape: dimension of action space.

𝑄 𝑠, “left”; 𝐰 = 2000
Conv Dense
𝑄 𝑠, “right”; 𝐰 = 1000

𝑄 𝑠, “up”; 𝐰 = 3000
state 𝑠
feature

Question: Based on the predictions, what should be the action?


Apply DQN to Play Game

𝑠" 𝑎" 𝑠"$% 𝑎"$% 𝑠"$& 𝑎"$& 𝑠"$7 ⋯


𝑟" 𝑟"$% 𝑟"$&

𝑎" = argmaxB 𝑄(𝑠" , 𝑎; 𝐰)


Apply DQN to Play Game

𝑠"$%~𝑝(⋅ |𝑠" , 𝑎" )

𝑠" 𝑎" 𝑠"$% 𝑎"$% 𝑠"$& 𝑎"$& 𝑠"$7 ⋯


𝑟" 𝑟"$% 𝑟"$&

𝑎" = argmaxB 𝑄(𝑠" , 𝑎; 𝐰)


Apply DQN to Play Game

𝑠"$%~𝑝(⋅ |𝑠" , 𝑎" )

𝑠" 𝑎" 𝑠"$% 𝑎"$% 𝑠"$& 𝑎"$& 𝑠"$7 ⋯


𝑟" 𝑟"$% 𝑟"$&

𝑎" = argmaxB 𝑄(𝑠" , 𝑎; 𝐰)

𝑎"$% = argmaxB 𝑄(𝑠"$%, 𝑎; 𝐰)


Apply DQN to Play Game

𝑠"$%~𝑝(⋅ |𝑠" , 𝑎" ) 𝑠"$&~𝑝(⋅ |𝑠"$%, 𝑎"$%)

𝑠" 𝑎" 𝑠"$% 𝑎"$% 𝑠"$& 𝑎"$& 𝑠"$7 ⋯


𝑟" 𝑟"$% 𝑟"$&

𝑎" = argmaxB 𝑄(𝑠" , 𝑎; 𝐰)

𝑎"$% = argmaxB 𝑄(𝑠"$%, 𝑎; 𝐰)


Apply DQN to Play Game

𝑠"$%~𝑝(⋅ |𝑠" , 𝑎" ) 𝑠"$&~𝑝(⋅ |𝑠"$%, 𝑎"$%)

𝑠" 𝑎" 𝑠"$% 𝑎"$% 𝑠"$& 𝑎"$& 𝑠"$7 ⋯


𝑟" 𝑟"$% 𝑟"$&

𝑎" = argmaxB 𝑄(𝑠" , 𝑎; 𝐰) 𝑎"$& = argmaxB 𝑄(𝑠"$&, 𝑎; 𝐰)

𝑎"$% = argmaxB 𝑄(𝑠"$%, 𝑎; 𝐰)


Apply DQN to Play Game

𝑠"$%~𝑝(⋅ |𝑠" , 𝑎" ) 𝑠"$&~𝑝(⋅ |𝑠"$%, 𝑎"$%) 𝑠"$7~𝑝(⋅ |𝑠"$&, 𝑎"$&)

𝑠" 𝑎" 𝑠"$% 𝑎"$% 𝑠"$& 𝑎"$& 𝑠"$7 ⋯


𝑟" 𝑟"$% 𝑟"$&

𝑎" = argmaxB 𝑄(𝑠" , 𝑎; 𝐰) 𝑎"$& = argmaxB 𝑄(𝑠"$&, 𝑎; 𝐰)

𝑎"$% = argmaxB 𝑄(𝑠"$%, 𝑎; 𝐰)


Temporal Difference (TD) Learning

Reference

1. Sutton and others: A convergent O(n) algorithm for off-policy temporal-difference learning with linear function
approximation. In NIPS, 2008.
2. Sutton and others: Fast gradient-descent methods for temporal-difference learning with linear function
approximation. In ICML, 2009.
Example
• I want to drive from NYC to Atlanta.
NYC
• Model 𝑄 𝐰 estimates the time cost, e.g., 1000 minutes.
Question: How do I update the model?

1000 minutes (estimate).

Atlanta
Example
• I want to drive from NYC to Atlanta.
NYC
• Model 𝑄 𝐰 estimates the time cost, e.g., 1000 minutes.
Question: How do I update the model?
• Make a prediction: 𝑞 = 𝑄(𝐰), e.g., 𝑞 = 1000.
1000 minutes (estimate).
• Finish the trip and get the target 𝑦, e.g., 𝑦 = 860.
%
• Loss: 𝐿 = 𝑞 − 𝑦 &. 860 minutes (actual).
&
^ _ ^ ` ^ _ ^ a 𝐰
• Gradient: = ⋅ = 𝑞−𝑦 ⋅ .
^ 𝐰 ^ 𝐰 ^ ` ^ 𝐰
^ _
• Gradient descent: 𝐰"$% = 𝐰" − 𝛼 ⋅ │𝐰d𝐰e . Atlanta
^ 𝐰
Example
• I want to drive from NYC to Atlanta.
NYC
• Model 𝑄 𝐰 estimates the time cost, e.g., 1000 minutes.
Question: How do I update the model?
• Make a prediction: 𝑞 = 𝑄(𝐰), e.g., 𝑞 = 1000.
1000 minutes (estimate).
• Finish the trip and get the target 𝑦, e.g., 𝑦 = 860.
%
• Loss: 𝐿 = 𝑞 − 𝑦 &. 860 minutes (actual).
&
^ _ ^ ` ^ _ ^ a 𝐰
• Gradient: = ⋅ = 𝑞−𝑦 ⋅ .
^ 𝐰 ^ 𝐰 ^ ` ^ 𝐰
^ _
• Gradient descent: 𝐰"$% = 𝐰" − 𝛼 ⋅ │ .
^ 𝐰 𝐰d𝐰e Atlanta
Example
• I want to drive from NYC to Atlanta.
NYC
• Model 𝑄 𝐰 estimates the time cost, e.g., 1000 minutes.
Question: How do I update the model?
• Can I update the model before finishing the trip?

Atlanta
Example
• I want to drive from NYC to Atlanta (via DC).
NYC
• Model 𝑄 𝐰 estimates the time cost, e.g., 1000 minutes.
Question: How do I update the model?
DC
• Can I update the model before finishing the trip?
• Can I get a better 𝐰 as soon as I arrived at DC?

Atlanta
Temporal Difference (TD) Learning
• Model’s estimate:
NYC
NYC to Atlanta: 1000 minutes (estimate).

1000 minutes (estimate).

Atlanta
Temporal Difference (TD) Learning
• Model’s estimate:
NYC
NYC to Atlanta: 1000 minutes (estimate).
300 minutes (actual).
• I arrived at DC; actual time cost:
NYC to DC: 300 minutes (actual). DC
Temporal Difference (TD) Learning
• Model’s estimate:
NYC
NYC to Atlanta: 1000 minutes (estimate).
300 minutes (actual).
• I arrived at DC; actual time cost:
NYC to DC: 300 minutes (actual). DC
• Model now updates its estimate:
DC to Atlanta: 600 minutes (estimate).
600 minutes (estimate).

Atlanta
Temporal Difference (TD) Learning
• Model’s estimate: 𝑄 𝐰 = 1000 minutes.
NYC
• Updated estimate: 300 + 600 = 900 minutes.
300 minutes (actual).

TD target. DC

600 minutes (estimate).

Atlanta
Temporal Difference (TD) Learning
• Model’s estimate: 𝑄 𝐰 = 1000 minutes.
NYC
• Updated estimate: 300 + 600 = 900 minutes.
300 minutes (actual).

TD target. DC
• TD target y = 900 is a more reliable estimate than 1000.

600 minutes (estimate).

Atlanta
Temporal Difference (TD) Learning
• Model’s estimate: 𝑄 𝐰 = 1000 minutes.
NYC
• Updated estimate: 300 + 600 = 900 minutes.
300 minutes (actual).

TD target. DC
• TD target y = 900 is a more reliable estimate than 1000.
%
• Loss: 𝐿 = 𝑄 𝐰 − 𝑦 &.
& 600 minutes (estimate).

TD error

Atlanta
Temporal Difference (TD) Learning
• Model’s estimate: 𝑄 𝐰 = 1000 minutes.
NYC
• Updated estimate: 300 + 600 = 900 minutes.
300 minutes (actual).

TD target. DC
• TD target y = 900 is a more reliable estimate than 1000.
%
• Loss: 𝐿 = 𝑄 𝐰 − 𝑦 &.
& 600 minutes (estimate).
^ _ ^ a 𝐰
• Gradient: = 1000 − 900 ⋅ .
^ 𝐰 ^ 𝐰

TD error
Atlanta
Temporal Difference (TD) Learning
• Model’s estimate: 𝑄 𝐰 = 1000 minutes.
NYC
• Updated estimate: 300 + 600 = 900 minutes.
300 minutes (actual).

TD target. DC
• TD target y = 900 is a more reliable estimate than 1000.
%
• Loss: 𝐿 = 𝑄 𝐰 − 𝑦 &.
& 600 minutes (estimate).
^ _ ^ a 𝐰
• Gradient: = 1000 − 900 ⋅ .
^ 𝐰 ^ 𝐰
^ _
• Gradient descent: 𝐰"$% = 𝐰" − 𝛼 ⋅ │𝐰d𝐰e .
^ 𝐰
Atlanta
Why does TD learning work?
• Model’s estimates:
• NYC to Atlanta: 1000 minutes. NYC
• DC to Atlanta: 600 minutes.
• è NYC to DC: 400 minutes.
DC

600 minutes (estimate).

Atlanta
Why does TD learning work?
• Model’s estimates:
• NYC to Atlanta: 1000 minutes. NYC
• DC to Atlanta: 600 minutes. 300 minutes (actual).
• è NYC to DC: 400 minutes.
DC

• Ground truth:
• NYC to DC: 300 minutes.
600 minutes (estimate).
• TD error: 𝛿 = 400 − 300 = 100

Atlanta
TD Learning for DQN
How to apply TD learning to DQN?

• In the “driving time” example, we have the equation:


𝑇klm→opq ≈ 𝑇klm→rm + 𝑇rm→opq .

Model’s estimate Actual time Model’s estimate


How to apply TD learning to DQN?

• In the “driving time” example, we have the equation:


𝑇klm→opq ≈ 𝑇klm→rm + 𝑇rm→opq .

Model’s estimate Actual time Model’s estimate

• In deep reinforcement learning:


𝑄 𝑠" , 𝑎" ; 𝐰 ≈ 𝑟" + 𝛾 ⋅ 𝑄 𝑠"$% , 𝑎"$% ; 𝐰 .
How to apply TD learning to DQN?

Definition of discounted return:

• 𝑈" = 𝑅" + 𝛾 ⋅ 𝑅"$% + 𝛾 & ⋅ 𝑅"$& + 𝛾 7 ⋅ 𝑅"$7 + 𝛾 s ⋅ 𝑅"$s + ⋯

= 𝛾 ⋅ 𝑅"$% + 𝛾 ⋅ 𝑅"$& + 𝛾 & ⋅ 𝑅"$7 + 𝛾 7 ⋅ 𝑅"$s + ⋯


How to apply TD learning to DQN?

• 𝑈" = 𝑅" + 𝛾 ⋅ 𝑅"$% + 𝛾 & ⋅ 𝑅"$& + 𝛾 7 ⋅ 𝑅"$7 + 𝛾 s ⋅ 𝑅"$s + ⋯


= 𝑅" + 𝛾 ⋅ 𝑅"$% + 𝛾 ⋅ 𝑅"$& + 𝛾 & ⋅ 𝑅"$7 + 𝛾 7 ⋅ 𝑅"$s + ⋯

= 𝑈"$%
How to apply TD learning to DQN?
Identity: 𝑈" = 𝑅" + 𝛾 ⋅ 𝑈"$% .

• 𝑈" = 𝑅" + 𝛾 ⋅ 𝑅"$% + 𝛾 & ⋅ 𝑅"$& + 𝛾 7 ⋅ 𝑅"$7 + 𝛾 s ⋅ 𝑅"$s + ⋯


= 𝑅" + 𝛾 ⋅ 𝑅"$% + 𝛾 ⋅ 𝑅"$& + 𝛾 & ⋅ 𝑅"$7 + 𝛾 7 ⋅ 𝑅"$s + ⋯

= 𝑈"$%
How to apply TD learning to DQN?
Identity: 𝑈" = 𝑅" + 𝛾 ⋅ 𝑈"$% .

TD learning for DQN:


• DQN’s output, 𝑄 𝑠" , 𝑎" ; 𝐰 , is an estimate of 𝑈" .
• DQN’s output, 𝑄 𝑠"$% , 𝑎"$% ; 𝐰 , is an estimate of 𝑈"$% .
How to apply TD learning to DQN?
Identity: 𝑈" = 𝑅" + 𝛾 ⋅ 𝑈"$% .

TD learning for DQN:


• DQN’s output, 𝑄 𝑠" , 𝑎" ; 𝐰 , is an estimate of 𝑈" .
• DQN’s output, 𝑄 𝑠"$% , 𝑎"$% ; 𝐰 , is an estimate of 𝑈"$% .

• Thus, 𝑄 𝑠" , 𝑎" ; 𝐰 ≈ 𝔼 𝑅" + 𝛾 ⋅ 𝑄 𝑆"$% , 𝐴"$% ; 𝐰 .

estimate of 𝑈" estimate of 𝑈"$%


How to apply TD learning to DQN?
Identity: 𝑈" = 𝑅" + 𝛾 ⋅ 𝑈"$% .

TD learning for DQN:


• DQN’s output, 𝑄 𝑠" , 𝑎" ; 𝐰 , is an estimate of 𝑈" .
• DQN’s output, 𝑄 𝑠"$% , 𝑎"$% ; 𝐰 , is an estimate of 𝑈"$% .

• Thus, 𝑄 𝑠" , 𝑎" ; 𝐰 ≈ 𝑟" + 𝛾 ⋅ 𝑄 𝑠"$% , 𝑎"$% ; 𝐰 .

Prediction TD target
Train DQN using TD learning

• Prediction: 𝑄 𝑠" , 𝑎" ; 𝐰" .


• TD target:
𝑦" = 𝑟" + 𝛾 ⋅ 𝑄 𝑠"$% , 𝑎"$% ; 𝐰"
Train DQN using TD learning

• Prediction: 𝑄 𝑠" , 𝑎" ; 𝐰" .


• TD target:
𝑦" = 𝑟" + 𝛾 ⋅ 𝑄 𝑠"$% , 𝑎"$% ; 𝐰"
= 𝑟" + 𝛾 ⋅ max 𝑄 𝑠"$% , 𝑎; 𝐰" .
B
Train DQN using TD learning

• Prediction: 𝑄 𝑠" , 𝑎" ; 𝐰" .


• TD target:
𝑦" = 𝑟" + 𝛾 ⋅ 𝑄 𝑠"$% , 𝑎"$% ; 𝐰"
= 𝑟" + 𝛾 ⋅ max 𝑄 𝑠"$% , 𝑎; 𝐰" .
B
%
• Loss: 𝐿" = 𝑄 𝑠" , 𝑎" ; 𝐰 − 𝑦" & .
&
^ _e
• Gradient descent: 𝐰"$% = 𝐰" − 𝛼 ⋅ │𝐰d𝐰e .
^𝐰
Summary
Value-Based Reinforcement Learning

Definition: Optimal action-value function.


• 𝑄⋆ 𝑠" , 𝑎" = max 𝔼 𝑈" |𝑆" = 𝑠" , 𝐴" = 𝑎" .
9
Value-Based Reinforcement Learning

Definition: Optimal action-value function.


• 𝑄⋆ 𝑠" , 𝑎" = max 𝔼 𝑈" |𝑆" = 𝑠" , 𝐴" = 𝑎" .
9

DQN: Approximate 𝑄⋆ 𝑠, 𝑎 using a neural network (DQN).


• 𝑄 𝑠, 𝑎; 𝐰 is a neural network parameterized by 𝐰.
• Input: observed state 𝑠.
• Output: scores for all the action 𝑎 ∈ 𝒜.
Temporal Difference (TD) Learning

Algorithm: One iteration of TD learning.

1. Observe state 𝑆" = 𝑠" and perform action 𝐴" = 𝑎" .


2. Predict the value: 𝑞" = 𝑄 𝑠" , 𝑎" ; 𝐰" .
^ a we ,Be ;𝐰
3. Differentiate the value network: 𝐝" = │𝐰d𝐰e .
^ 𝐰
Temporal Difference (TD) Learning

Algorithm: One iteration of TD learning.

1. Observe state 𝑆" = 𝑠" and perform action 𝐴" = 𝑎" .


2. Predict the value: 𝑞" = 𝑄 𝑠" , 𝑎" ; 𝐰" .
^ a we ,Be ;𝐰
3. Differentiate the value network: 𝐝" = │𝐰d𝐰e .
^ 𝐰
4. Environment provides new state 𝑠"$% and reward 𝑟" .
5. Compute TD target: 𝑦" = 𝑟" + 𝛾 ⋅ max 𝑄 𝑠"$% , 𝑎; 𝐰" .
B
Temporal Difference (TD) Learning

Algorithm: One iteration of TD learning.

1. Observe state 𝑆" = 𝑠" and perform action 𝐴" = 𝑎" .


2. Predict the value: 𝑞" = 𝑄 𝑠" , 𝑎" ; 𝐰" .
^ a we ,Be ;𝐰
3. Differentiate the value network: 𝐝" = │𝐰d𝐰e .
^ 𝐰
4. Environment provides new state 𝑠"$% and reward 𝑟" .
5. Compute TD target: 𝑦" = 𝑟" + 𝛾 ⋅ max 𝑄 𝑠"$% , 𝑎; 𝐰" .
B
6. Gradient descent: 𝐰"$% = 𝐰" − 𝛼 ⋅ 𝑞" − 𝑦" ⋅ 𝐝" .
Play Breakout using DQN

(The video was posted on YouTube by DeepMind)


Thank you!

You might also like