0% found this document useful (0 votes)

78 views53 pages

Value-Based Reinforcement Learning: Shusen Wang

The document discusses value-based reinforcement learning and the deep Q-network (DQN) algorithm. [1] It defines discounted return and action-value functions for a policy. [2] It explains that DQN uses a neural network to approximate the optimal action-value function since the true function is unknown. [3] The network is trained using temporal difference learning by comparing its predictions to actual outcomes experienced by the agent.

Uploaded by

MInh Thanh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

78 views53 pages

Value-Based Reinforcement Learning: Shusen Wang

Uploaded by

MInh Thanh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 53

Value-Based Reinforcement Learning

Shusen Wang
Action-Value Functions
Discounted Return
Definition: Discounted return (aka cumulative discounted future reward).
• 𝑈" = 𝑅" + 𝛾 𝑅"$% + 𝛾 & 𝑅"$& + 𝛾 7 𝑅"$7 + ⋯

• The return depends on actions 𝐴" , 𝐴"$%, 𝐴"$&, ⋯ and states 𝑆" , 𝑆"$%, 𝑆"$&, ⋯
• Actions are random: ℙ 𝐴 = 𝑎 | 𝑆 = 𝑠 = 𝜋 𝑎 𝑠 . (Policy function.)
• States are random: ℙ 𝑆 1 = 𝑠 1 |𝑆 = 𝑠, 𝐴 = 𝑎 = 𝑝 𝑠 1 𝑠, 𝑎 . (State transition.)
Action-Value Functions 𝑄 𝑠, 𝑎
Definition: Discounted return (aka cumulative discounted future reward).
• 𝑈" = 𝑅" + 𝛾 𝑅"$% + 𝛾 & 𝑅"$& + 𝛾 7 𝑅"$7 + ⋯

Definition: Action-value function for policy 𝜋.

• 𝑄9 𝑠" , 𝑎" = 𝔼 𝑈" |𝑆" = 𝑠" , 𝐴" = 𝑎" .

• Taken w.r.t. actions 𝐴"$%, 𝐴"$&, 𝐴"$7, ⋯ and states 𝑆"$%, 𝑆"$&, 𝑆"$7, ⋯
• Integrate out everything except for the observations: 𝐴" = 𝑎" and 𝑆" = 𝑠" .
Action-Value Functions 𝑄 𝑠, 𝑎
Definition: Discounted return (aka cumulative discounted future reward).
• 𝑈" = 𝑅" + 𝛾 𝑅"$% + 𝛾 & 𝑅"$& + 𝛾 7 𝑅"$7 + ⋯

Definition: Action-value function for policy 𝜋.

• 𝑄9 𝑠" , 𝑎" = 𝔼 𝑈" |𝑆" = 𝑠" , 𝐴" = 𝑎" .

Definition: Optimal action-value function.

• 𝑄⋆ 𝑠" , 𝑎" = max 𝑄9 𝑠" , 𝑎" .
9
• Whatever policy function 𝜋 is used, the result of taking 𝑎" at state 𝑠"
cannot be better than 𝑄⋆ 𝑠" , 𝑎" .
Deep Q-Network (DQN)
Approximate the Q Function
Goal: Win the game (≈ maximize the total reward.)

Question: If we know 𝑄⋆ 𝑠, 𝑎 , what is the best action?

Approximate the Q Function
Goal: Win the game (≈ maximize the total reward.)

Question: If we know 𝑄⋆ 𝑠, 𝑎 , what is the best action?

• Obviously, the best action is 𝑎⋆ = argmax 𝑄⋆ 𝑠, 𝑎 .

𝑄 ⋆ is an indicator of how good it is for an agent

to pick action 𝑎 while being in state 𝑠.
Approximate the Q Function
Goal: Win the game (≈ maximize the total reward.)

Question: If we know 𝑄⋆ 𝑠, 𝑎 , what is the best action?

• Obviously, the best action is 𝑎⋆ = argmax 𝑄⋆ 𝑠, 𝑎 .

Challenge: We do not know 𝑄⋆ 𝑠, 𝑎 .

• Solution: Deep Q Network (DQN)
• Use neural network 𝑄(𝑠, 𝑎; 𝐰) to approximate 𝑄⋆ 𝑠, 𝑎 .
Deep Q Network (DQN)
• Input shape: size of the screenshot.
• Output shape: dimension of action space.

𝑄 𝑠, “left”; 𝐰 = 2000
Conv Dense
𝑄 𝑠, “right”; 𝐰 = 1000

𝑄 𝑠, “up”; 𝐰 = 3000
state 𝑠
feature
Deep Q Network (DQN)
• Input shape: size of the screenshot.
• Output shape: dimension of action space.

𝑄 𝑠, “left”; 𝐰 = 2000
Conv Dense
𝑄 𝑠, “right”; 𝐰 = 1000

𝑄 𝑠, “up”; 𝐰 = 3000
state 𝑠
feature

Question: Based on the predictions, what should be the action?

Apply DQN to Play Game

𝑠" 𝑎" 𝑠"$% 𝑎"$% 𝑠"$& 𝑎"$& 𝑠"$7 ⋯

𝑟" 𝑟"$% 𝑟"$&

𝑎" = argmaxB 𝑄(𝑠" , 𝑎; 𝐰)

Apply DQN to Play Game

𝑠"$%~𝑝(⋅ |𝑠" , 𝑎" )

𝑠" 𝑎" 𝑠"$% 𝑎"$% 𝑠"$& 𝑎"$& 𝑠"$7 ⋯

𝑟" 𝑟"$% 𝑟"$&

𝑎" = argmaxB 𝑄(𝑠" , 𝑎; 𝐰)

Apply DQN to Play Game

𝑠"$%~𝑝(⋅ |𝑠" , 𝑎" )

𝑠" 𝑎" 𝑠"$% 𝑎"$% 𝑠"$& 𝑎"$& 𝑠"$7 ⋯

𝑟" 𝑟"$% 𝑟"$&

𝑎" = argmaxB 𝑄(𝑠" , 𝑎; 𝐰)

𝑎"$% = argmaxB 𝑄(𝑠"$%, 𝑎; 𝐰)

Apply DQN to Play Game

𝑠"$%~𝑝(⋅ |𝑠" , 𝑎" ) 𝑠"$&~𝑝(⋅ |𝑠"$%, 𝑎"$%)

𝑠" 𝑎" 𝑠"$% 𝑎"$% 𝑠"$& 𝑎"$& 𝑠"$7 ⋯

𝑟" 𝑟"$% 𝑟"$&

𝑎" = argmaxB 𝑄(𝑠" , 𝑎; 𝐰)

𝑎"$% = argmaxB 𝑄(𝑠"$%, 𝑎; 𝐰)

Apply DQN to Play Game

𝑠"$%~𝑝(⋅ |𝑠" , 𝑎" ) 𝑠"$&~𝑝(⋅ |𝑠"$%, 𝑎"$%)

𝑠" 𝑎" 𝑠"$% 𝑎"$% 𝑠"$& 𝑎"$& 𝑠"$7 ⋯

𝑟" 𝑟"$% 𝑟"$&

𝑎" = argmaxB 𝑄(𝑠" , 𝑎; 𝐰) 𝑎"$& = argmaxB 𝑄(𝑠"$&, 𝑎; 𝐰)

𝑎"$% = argmaxB 𝑄(𝑠"$%, 𝑎; 𝐰)

Apply DQN to Play Game

𝑠"$%~𝑝(⋅ |𝑠" , 𝑎" ) 𝑠"$&~𝑝(⋅ |𝑠"$%, 𝑎"$%) 𝑠"$7~𝑝(⋅ |𝑠"$&, 𝑎"$&)

𝑠" 𝑎" 𝑠"$% 𝑎"$% 𝑠"$& 𝑎"$& 𝑠"$7 ⋯

𝑟" 𝑟"$% 𝑟"$&

𝑎" = argmaxB 𝑄(𝑠" , 𝑎; 𝐰) 𝑎"$& = argmaxB 𝑄(𝑠"$&, 𝑎; 𝐰)

𝑎"$% = argmaxB 𝑄(𝑠"$%, 𝑎; 𝐰)

Temporal Difference (TD) Learning

Reference

1. Sutton and others: A convergent O(n) algorithm for off-policy temporal-difference learning with linear function
approximation. In NIPS, 2008.
2. Sutton and others: Fast gradient-descent methods for temporal-difference learning with linear function
approximation. In ICML, 2009.
Example
• I want to drive from NYC to Atlanta.
NYC
• Model 𝑄 𝐰 estimates the time cost, e.g., 1000 minutes.
Question: How do I update the model?

1000 minutes (estimate).

Atlanta
Example
• I want to drive from NYC to Atlanta.
NYC
• Model 𝑄 𝐰 estimates the time cost, e.g., 1000 minutes.
Question: How do I update the model?
• Make a prediction: 𝑞 = 𝑄(𝐰), e.g., 𝑞 = 1000.
1000 minutes (estimate).
• Finish the trip and get the target 𝑦, e.g., 𝑦 = 860.
%
• Loss: 𝐿 = 𝑞 − 𝑦 &. 860 minutes (actual).
&
^ _ ^ ` ^ _ ^ a 𝐰
• Gradient: = ⋅ = 𝑞−𝑦 ⋅ .
^ 𝐰 ^ 𝐰 ^ ` ^ 𝐰
^ _
• Gradient descent: 𝐰"$% = 𝐰" − 𝛼 ⋅ │𝐰d𝐰e . Atlanta
^ 𝐰
Example
• I want to drive from NYC to Atlanta.
NYC
• Model 𝑄 𝐰 estimates the time cost, e.g., 1000 minutes.
Question: How do I update the model?
• Make a prediction: 𝑞 = 𝑄(𝐰), e.g., 𝑞 = 1000.
1000 minutes (estimate).
• Finish the trip and get the target 𝑦, e.g., 𝑦 = 860.
%
• Loss: 𝐿 = 𝑞 − 𝑦 &. 860 minutes (actual).
&
^ _ ^ ` ^ _ ^ a 𝐰
• Gradient: = ⋅ = 𝑞−𝑦 ⋅ .
^ 𝐰 ^ 𝐰 ^ ` ^ 𝐰
^ _
• Gradient descent: 𝐰"$% = 𝐰" − 𝛼 ⋅ │ .
^ 𝐰 𝐰d𝐰e Atlanta
Example
• I want to drive from NYC to Atlanta.
NYC
• Model 𝑄 𝐰 estimates the time cost, e.g., 1000 minutes.
Question: How do I update the model?
• Can I update the model before finishing the trip?

Atlanta
Example
• I want to drive from NYC to Atlanta (via DC).
NYC
• Model 𝑄 𝐰 estimates the time cost, e.g., 1000 minutes.
Question: How do I update the model?
DC
• Can I update the model before finishing the trip?
• Can I get a better 𝐰 as soon as I arrived at DC?

Atlanta
Temporal Difference (TD) Learning
• Model’s estimate:
NYC
NYC to Atlanta: 1000 minutes (estimate).

1000 minutes (estimate).

Atlanta
Temporal Difference (TD) Learning
• Model’s estimate:
NYC
NYC to Atlanta: 1000 minutes (estimate).
300 minutes (actual).
• I arrived at DC; actual time cost:
NYC to DC: 300 minutes (actual). DC
Temporal Difference (TD) Learning
• Model’s estimate:
NYC
NYC to Atlanta: 1000 minutes (estimate).
300 minutes (actual).
• I arrived at DC; actual time cost:
NYC to DC: 300 minutes (actual). DC
• Model now updates its estimate:
DC to Atlanta: 600 minutes (estimate).
600 minutes (estimate).

Atlanta
Temporal Difference (TD) Learning
• Model’s estimate: 𝑄 𝐰 = 1000 minutes.
NYC
• Updated estimate: 300 + 600 = 900 minutes.
300 minutes (actual).

TD target. DC

600 minutes (estimate).

Atlanta
Temporal Difference (TD) Learning
• Model’s estimate: 𝑄 𝐰 = 1000 minutes.
NYC
• Updated estimate: 300 + 600 = 900 minutes.
300 minutes (actual).

TD target. DC
• TD target y = 900 is a more reliable estimate than 1000.

600 minutes (estimate).

Atlanta
Temporal Difference (TD) Learning
• Model’s estimate: 𝑄 𝐰 = 1000 minutes.
NYC
• Updated estimate: 300 + 600 = 900 minutes.
300 minutes (actual).

TD target. DC
• TD target y = 900 is a more reliable estimate than 1000.
%
• Loss: 𝐿 = 𝑄 𝐰 − 𝑦 &.
& 600 minutes (estimate).

TD error

Atlanta
Temporal Difference (TD) Learning
• Model’s estimate: 𝑄 𝐰 = 1000 minutes.
NYC
• Updated estimate: 300 + 600 = 900 minutes.
300 minutes (actual).

TD target. DC
• TD target y = 900 is a more reliable estimate than 1000.
%
• Loss: 𝐿 = 𝑄 𝐰 − 𝑦 &.
& 600 minutes (estimate).
^ _ ^ a 𝐰
• Gradient: = 1000 − 900 ⋅ .
^ 𝐰 ^ 𝐰

TD error
Atlanta
Temporal Difference (TD) Learning
• Model’s estimate: 𝑄 𝐰 = 1000 minutes.
NYC
• Updated estimate: 300 + 600 = 900 minutes.
300 minutes (actual).

TD target. DC
• TD target y = 900 is a more reliable estimate than 1000.
%
• Loss: 𝐿 = 𝑄 𝐰 − 𝑦 &.
& 600 minutes (estimate).
^ _ ^ a 𝐰
• Gradient: = 1000 − 900 ⋅ .
^ 𝐰 ^ 𝐰
^ _
• Gradient descent: 𝐰"$% = 𝐰" − 𝛼 ⋅ │𝐰d𝐰e .
^ 𝐰
Atlanta
Why does TD learning work?
• Model’s estimates:
• NYC to Atlanta: 1000 minutes. NYC
• DC to Atlanta: 600 minutes.
• è NYC to DC: 400 minutes.
DC

600 minutes (estimate).

Atlanta
Why does TD learning work?
• Model’s estimates:
• NYC to Atlanta: 1000 minutes. NYC
• DC to Atlanta: 600 minutes. 300 minutes (actual).
• è NYC to DC: 400 minutes.
DC

• Ground truth:
• NYC to DC: 300 minutes.
600 minutes (estimate).
• TD error: 𝛿 = 400 − 300 = 100

Atlanta
TD Learning for DQN
How to apply TD learning to DQN?

• In the “driving time” example, we have the equation:

𝑇klm→opq ≈ 𝑇klm→rm + 𝑇rm→opq .

Model’s estimate Actual time Model’s estimate

How to apply TD learning to DQN?

• In the “driving time” example, we have the equation:

𝑇klm→opq ≈ 𝑇klm→rm + 𝑇rm→opq .

Model’s estimate Actual time Model’s estimate

• In deep reinforcement learning:

𝑄 𝑠" , 𝑎" ; 𝐰 ≈ 𝑟" + 𝛾 ⋅ 𝑄 𝑠"$% , 𝑎"$% ; 𝐰 .
How to apply TD learning to DQN?

Definition of discounted return:

• 𝑈" = 𝑅" + 𝛾 ⋅ 𝑅"$% + 𝛾 & ⋅ 𝑅"$& + 𝛾 7 ⋅ 𝑅"$7 + 𝛾 s ⋅ 𝑅"$s + ⋯

= 𝛾 ⋅ 𝑅"$% + 𝛾 ⋅ 𝑅"$& + 𝛾 & ⋅ 𝑅"$7 + 𝛾 7 ⋅ 𝑅"$s + ⋯

How to apply TD learning to DQN?

• 𝑈" = 𝑅" + 𝛾 ⋅ 𝑅"$% + 𝛾 & ⋅ 𝑅"$& + 𝛾 7 ⋅ 𝑅"$7 + 𝛾 s ⋅ 𝑅"$s + ⋯

= 𝑅" + 𝛾 ⋅ 𝑅"$% + 𝛾 ⋅ 𝑅"$& + 𝛾 & ⋅ 𝑅"$7 + 𝛾 7 ⋅ 𝑅"$s + ⋯

= 𝑈"$%
How to apply TD learning to DQN?
Identity: 𝑈" = 𝑅" + 𝛾 ⋅ 𝑈"$% .

• 𝑈" = 𝑅" + 𝛾 ⋅ 𝑅"$% + 𝛾 & ⋅ 𝑅"$& + 𝛾 7 ⋅ 𝑅"$7 + 𝛾 s ⋅ 𝑅"$s + ⋯

= 𝑅" + 𝛾 ⋅ 𝑅"$% + 𝛾 ⋅ 𝑅"$& + 𝛾 & ⋅ 𝑅"$7 + 𝛾 7 ⋅ 𝑅"$s + ⋯

= 𝑈"$%
How to apply TD learning to DQN?
Identity: 𝑈" = 𝑅" + 𝛾 ⋅ 𝑈"$% .

TD learning for DQN:

• DQN’s output, 𝑄 𝑠" , 𝑎" ; 𝐰 , is an estimate of 𝑈" .
• DQN’s output, 𝑄 𝑠"$% , 𝑎"$% ; 𝐰 , is an estimate of 𝑈"$% .
How to apply TD learning to DQN?
Identity: 𝑈" = 𝑅" + 𝛾 ⋅ 𝑈"$% .

TD learning for DQN:

• DQN’s output, 𝑄 𝑠" , 𝑎" ; 𝐰 , is an estimate of 𝑈" .
• DQN’s output, 𝑄 𝑠"$% , 𝑎"$% ; 𝐰 , is an estimate of 𝑈"$% .

• Thus, 𝑄 𝑠" , 𝑎" ; 𝐰 ≈ 𝔼 𝑅" + 𝛾 ⋅ 𝑄 𝑆"$% , 𝐴"$% ; 𝐰 .

estimate of 𝑈" estimate of 𝑈"$%

How to apply TD learning to DQN?
Identity: 𝑈" = 𝑅" + 𝛾 ⋅ 𝑈"$% .

TD learning for DQN:

• DQN’s output, 𝑄 𝑠" , 𝑎" ; 𝐰 , is an estimate of 𝑈" .
• DQN’s output, 𝑄 𝑠"$% , 𝑎"$% ; 𝐰 , is an estimate of 𝑈"$% .

• Thus, 𝑄 𝑠" , 𝑎" ; 𝐰 ≈ 𝑟" + 𝛾 ⋅ 𝑄 𝑠"$% , 𝑎"$% ; 𝐰 .

Prediction TD target
Train DQN using TD learning

• Prediction: 𝑄 𝑠" , 𝑎" ; 𝐰" .

• TD target:
𝑦" = 𝑟" + 𝛾 ⋅ 𝑄 𝑠"$% , 𝑎"$% ; 𝐰"
Train DQN using TD learning

• Prediction: 𝑄 𝑠" , 𝑎" ; 𝐰" .

• TD target:
𝑦" = 𝑟" + 𝛾 ⋅ 𝑄 𝑠"$% , 𝑎"$% ; 𝐰"
= 𝑟" + 𝛾 ⋅ max 𝑄 𝑠"$% , 𝑎; 𝐰" .
B
Train DQN using TD learning

• Prediction: 𝑄 𝑠" , 𝑎" ; 𝐰" .

• TD target:
𝑦" = 𝑟" + 𝛾 ⋅ 𝑄 𝑠"$% , 𝑎"$% ; 𝐰"
= 𝑟" + 𝛾 ⋅ max 𝑄 𝑠"$% , 𝑎; 𝐰" .
B
%
• Loss: 𝐿" = 𝑄 𝑠" , 𝑎" ; 𝐰 − 𝑦" & .
&
^ _e
• Gradient descent: 𝐰"$% = 𝐰" − 𝛼 ⋅ │𝐰d𝐰e .
^𝐰
Summary
Value-Based Reinforcement Learning

Definition: Optimal action-value function.

• 𝑄⋆ 𝑠" , 𝑎" = max 𝔼 𝑈" |𝑆" = 𝑠" , 𝐴" = 𝑎" .
9
Value-Based Reinforcement Learning

Definition: Optimal action-value function.

• 𝑄⋆ 𝑠" , 𝑎" = max 𝔼 𝑈" |𝑆" = 𝑠" , 𝐴" = 𝑎" .
9

DQN: Approximate 𝑄⋆ 𝑠, 𝑎 using a neural network (DQN).

• 𝑄 𝑠, 𝑎; 𝐰 is a neural network parameterized by 𝐰.
• Input: observed state 𝑠.
• Output: scores for all the action 𝑎 ∈ 𝒜.
Temporal Difference (TD) Learning

Algorithm: One iteration of TD learning.

1. Observe state 𝑆" = 𝑠" and perform action 𝐴" = 𝑎" .

2. Predict the value: 𝑞" = 𝑄 𝑠" , 𝑎" ; 𝐰" .
^ a we ,Be ;𝐰
3. Differentiate the value network: 𝐝" = │𝐰d𝐰e .
^ 𝐰
Temporal Difference (TD) Learning

Algorithm: One iteration of TD learning.

1. Observe state 𝑆" = 𝑠" and perform action 𝐴" = 𝑎" .

2. Predict the value: 𝑞" = 𝑄 𝑠" , 𝑎" ; 𝐰" .
^ a we ,Be ;𝐰
3. Differentiate the value network: 𝐝" = │𝐰d𝐰e .
^ 𝐰
4. Environment provides new state 𝑠"$% and reward 𝑟" .
5. Compute TD target: 𝑦" = 𝑟" + 𝛾 ⋅ max 𝑄 𝑠"$% , 𝑎; 𝐰" .
B
Temporal Difference (TD) Learning

Algorithm: One iteration of TD learning.

1. Observe state 𝑆" = 𝑠" and perform action 𝐴" = 𝑎" .

(The video was posted on YouTube by DeepMind)

Thank you!

Module 5-rl
No ratings yet
Module 5-rl
54 pages
HCIA-Transmission V2.5 Training Material
No ratings yet
HCIA-Transmission V2.5 Training Material
283 pages
Deep Reinforcement Learning: Lecture Notes
No ratings yet
Deep Reinforcement Learning: Lecture Notes
60 pages
TOC - NOTES - 1 - Compressed
No ratings yet
TOC - NOTES - 1 - Compressed
122 pages
01 Module 2 Neural Network Based Reinforcement Learning
No ratings yet
01 Module 2 Neural Network Based Reinforcement Learning
133 pages
Q Learning
No ratings yet
Q Learning
187 pages
2 4+Advanced+Tricks+for+DQNs
No ratings yet
2 4+Advanced+Tricks+for+DQNs
82 pages
Temporal Difference (TD) Learning: Slides Prepared by DR J Alamelu Mangai
No ratings yet
Temporal Difference (TD) Learning: Slides Prepared by DR J Alamelu Mangai
57 pages
Q-Learning and Deep Q Networks (DQN)
No ratings yet
Q-Learning and Deep Q Networks (DQN)
52 pages
6S191 MIT DeepLearning L5
No ratings yet
6S191 MIT DeepLearning L5
62 pages
DRL v5
No ratings yet
DRL v5
64 pages
Demonstration Final Presentation
No ratings yet
Demonstration Final Presentation
59 pages
Lecture Notes Deep Reinforcement Learning: Generalizability in Deep RL
No ratings yet
Lecture Notes Deep Reinforcement Learning: Generalizability in Deep RL
7 pages
DOS - Presentation
No ratings yet
DOS - Presentation
44 pages
Lecture#4 Temporal DifferenceTD Learning Q Learning & SARSA 2024
No ratings yet
Lecture#4 Temporal DifferenceTD Learning Q Learning & SARSA 2024
62 pages
QLearning (v2)
No ratings yet
QLearning (v2)
43 pages
Lecture2 DRL A
No ratings yet
Lecture2 DRL A
39 pages
COMP 4901Z: Reinforcement Learning: 2.3 Value Function Approximation
No ratings yet
COMP 4901Z: Reinforcement Learning: 2.3 Value Function Approximation
55 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
48 pages
Q Learning
No ratings yet
Q Learning
38 pages
ML RUSA Module 1 Intro
No ratings yet
ML RUSA Module 1 Intro
30 pages
5SC28 L7 Machine Learning
No ratings yet
5SC28 L7 Machine Learning
61 pages
Artificial Intelligence: Lecture 11 - Reinforcement Learning II Dr. Shivanjali Khare
No ratings yet
Artificial Intelligence: Lecture 11 - Reinforcement Learning II Dr. Shivanjali Khare
52 pages
S18 Reinforcement Learning 2
No ratings yet
S18 Reinforcement Learning 2
46 pages
Unit 5d - Deep Reinforcement Learning
No ratings yet
Unit 5d - Deep Reinforcement Learning
52 pages
Introduction To Deep Learning AI 2025
No ratings yet
Introduction To Deep Learning AI 2025
78 pages
DL Questions
No ratings yet
DL Questions
30 pages
10 Deep Reinforcement
No ratings yet
10 Deep Reinforcement
40 pages
Microprocessor
No ratings yet
Microprocessor
626 pages
DL Mentoring Session - Final
No ratings yet
DL Mentoring Session - Final
17 pages
Hansen 2022
No ratings yet
Hansen 2022
20 pages
TD Convergence: An Optimization Perspective: Equal Contribution
No ratings yet
TD Convergence: An Optimization Perspective: Equal Contribution
15 pages
What Is TD Learning
No ratings yet
What Is TD Learning
15 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I - Print
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I - Print
25 pages
RL Unit V Qa
No ratings yet
RL Unit V Qa
13 pages
Q - Networks (1) 31 50
No ratings yet
Q - Networks (1) 31 50
20 pages
22CS302 - UNIT 1 To 3 - Material
No ratings yet
22CS302 - UNIT 1 To 3 - Material
93 pages
Deep RL Tutorial Small
No ratings yet
Deep RL Tutorial Small
66 pages
CH5 - Function Approximation
No ratings yet
CH5 - Function Approximation
33 pages
Temporal Difference Learning
No ratings yet
Temporal Difference Learning
15 pages
RLDL PBL AmriteshChandra 09411503121
No ratings yet
RLDL PBL AmriteshChandra 09411503121
15 pages
Learning Task
No ratings yet
Learning Task
14 pages
Temporal Difference Models - Model-Free Deep RL For Model-Based Control
No ratings yet
Temporal Difference Models - Model-Free Deep RL For Model-Based Control
14 pages
Artificial Intelligence
No ratings yet
Artificial Intelligence
31 pages
07 Deep Reinforcement Learning (John)
No ratings yet
07 Deep Reinforcement Learning (John)
52 pages
Self-Driving Car Racing: Application of Deep Reinforcement Learning
No ratings yet
Self-Driving Car Racing: Application of Deep Reinforcement Learning
12 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
6 pages
DDQN PDF
No ratings yet
DDQN PDF
13 pages
8200 Non Delusional Q Learning and Value Iteration
No ratings yet
8200 Non Delusional Q Learning and Value Iteration
11 pages
Report On Reinforcement Learning
No ratings yet
Report On Reinforcement Learning
26 pages
MAS Lab7 QFA
No ratings yet
MAS Lab7 QFA
10 pages
Deep Reinforcement Learning: 1 Notation
No ratings yet
Deep Reinforcement Learning: 1 Notation
9 pages
Reinforcement Learning With A and A Deep Heuristic: Ariel Kesleman Sergey Ten Adham Ghazali Majed Jubeh
No ratings yet
Reinforcement Learning With A and A Deep Heuristic: Ariel Kesleman Sergey Ten Adham Ghazali Majed Jubeh
6 pages
Temporal Difference Learning
No ratings yet
Temporal Difference Learning
17 pages
5 Temporal Difference Learning
No ratings yet
5 Temporal Difference Learning
25 pages
Unit 1 ML
No ratings yet
Unit 1 ML
14 pages
37 RL
No ratings yet
37 RL
18 pages
Tri-Tue-Nhan-Tao - Nathan-Lambert - Lec13 - 6up-Reinforcement-Learning - (Cuuduongthancong - Com)
No ratings yet
Tri-Tue-Nhan-Tao - Nathan-Lambert - Lec13 - 6up-Reinforcement-Learning - (Cuuduongthancong - Com)
8 pages
Deep Q Network
No ratings yet
Deep Q Network
6 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I PDF
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I PDF
38 pages
Java MCQ Worksheet-10 MCQ
No ratings yet
Java MCQ Worksheet-10 MCQ
5 pages
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
No ratings yet
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
30 pages
Playing Geometry Dash With Convolutional Neural Networks
No ratings yet
Playing Geometry Dash With Convolutional Neural Networks
7 pages
Reinforcement Learning: Instructor: Max Welling
No ratings yet
Reinforcement Learning: Instructor: Max Welling
18 pages
FINAL
No ratings yet
FINAL
13 pages
Grade 9 (SPTVE) CSS (1) Diagnostic Test
No ratings yet
Grade 9 (SPTVE) CSS (1) Diagnostic Test
6 pages
Top PL SQL Interview Questions
No ratings yet
Top PL SQL Interview Questions
3 pages
Zenon IEC 61850 Webinar - Module 1
No ratings yet
Zenon IEC 61850 Webinar - Module 1
28 pages
Chapter 2 Multimedia Authoring
No ratings yet
Chapter 2 Multimedia Authoring
7 pages
Pagasys Manual English
No ratings yet
Pagasys Manual English
602 pages
2022 Streaming Summit Netflix
No ratings yet
2022 Streaming Summit Netflix
100 pages
Cambridge O Level: Computer Science 2210/23
No ratings yet
Cambridge O Level: Computer Science 2210/23
16 pages
SADCW 7e Chapter07
No ratings yet
SADCW 7e Chapter07
30 pages
Few-Shot Learning: Shusen Wang
No ratings yet
Few-Shot Learning: Shusen Wang
42 pages
Convolutional Neural Networks: Shusen Wang
No ratings yet
Convolutional Neural Networks: Shusen Wang
75 pages
TC2985en-Ed01 SIP Trunk Solution Planetel-TopoC-IT Configuration Guideline For OXO Connect ONE051
No ratings yet
TC2985en-Ed01 SIP Trunk Solution Planetel-TopoC-IT Configuration Guideline For OXO Connect ONE051
26 pages
Recurrent Neural Networks (RNNS) : Shusen Wang
No ratings yet
Recurrent Neural Networks (RNNS) : Shusen Wang
33 pages
RNN + RL: Shusen Wang
No ratings yet
RNN + RL: Shusen Wang
51 pages
Common CNN Architectures: Shusen Wang
No ratings yet
Common CNN Architectures: Shusen Wang
67 pages
Neural Machine Translation: Shusen Wang
No ratings yet
Neural Machine Translation: Shusen Wang
57 pages
Text Generation: Shusen Wang
No ratings yet
Text Generation: Shusen Wang
49 pages
Types of ROM
No ratings yet
Types of ROM
4 pages
Siamese Network: Shusen Wang
No ratings yet
Siamese Network: Shusen Wang
51 pages
Policy-Based Reinforcement Learning: Shusen Wang
No ratings yet
Policy-Based Reinforcement Learning: Shusen Wang
46 pages
Linux Malware
No ratings yet
Linux Malware
50 pages
Convex Function vs. Nonconvex Function: A Little Bit Theory: Shusen Wang
No ratings yet
Convex Function vs. Nonconvex Function: A Little Bit Theory: Shusen Wang
23 pages
SYSTEMBC RAT Malware Analysis
No ratings yet
SYSTEMBC RAT Malware Analysis
18 pages
Neural Architecture Search: Basics
No ratings yet
Neural Architecture Search: Basics
20 pages
Data Poisoning Attacks: Shusen Wang
No ratings yet
Data Poisoning Attacks: Shusen Wang
17 pages
Compiler Lab
No ratings yet
Compiler Lab
60 pages
Section 5 Quiz
No ratings yet
Section 5 Quiz
7 pages
A Detailed Study of An Internet of Things Iot
No ratings yet
A Detailed Study of An Internet of Things Iot
7 pages
02HaipengDai SecurityPrinciple
No ratings yet
02HaipengDai SecurityPrinciple
52 pages
Ficha Tecnica RM045
No ratings yet
Ficha Tecnica RM045
1 page
Module 32: Programming in C++: Type Casting & Cast Operators: Part 1
No ratings yet
Module 32: Programming in C++: Type Casting & Cast Operators: Part 1
18 pages
Data File Handling Working With Binary Files
No ratings yet
Data File Handling Working With Binary Files
10 pages
Introduction To C#
No ratings yet
Introduction To C#
13 pages
Virtual Systems & Services Lecture 12
No ratings yet
Virtual Systems & Services Lecture 12
14 pages
Network Interface Converter Net-01X (03X) Owner's Manual 2010-01
No ratings yet
Network Interface Converter Net-01X (03X) Owner's Manual 2010-01
12 pages
OLT Onu Radio - Drawio
No ratings yet
OLT Onu Radio - Drawio
1 page
T10. Brosur Addressable-Switch-Devices-ASD+-DS - 0920 - V1.
No ratings yet
T10. Brosur Addressable-Switch-Devices-ASD+-DS - 0920 - V1.
2 pages
PANASONIC Dell - PowerEdge - R730xd
No ratings yet
PANASONIC Dell - PowerEdge - R730xd
1 page
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet

Value-Based Reinforcement Learning: Shusen Wang

Uploaded by

Value-Based Reinforcement Learning: Shusen Wang

Uploaded by

Value-Based Reinforcement Learning

Definition: Action-value function for policy 𝜋.

• 𝑄9 𝑠" , 𝑎" = 𝔼 𝑈" |𝑆" = 𝑠" , 𝐴" = 𝑎" .

Definition: Action-value function for policy 𝜋.

• 𝑄9 𝑠" , 𝑎" = 𝔼 𝑈" |𝑆" = 𝑠" , 𝐴" = 𝑎" .

Definition: Optimal action-value function.

Question: If we know 𝑄⋆ 𝑠, 𝑎 , what is the best action?

Question: If we know 𝑄⋆ 𝑠, 𝑎 , what is the best action?

• Obviously, the best action is 𝑎⋆ = argmax 𝑄⋆ 𝑠, 𝑎 .

𝑄 ⋆ is an indicator of how good it is for an agent

Question: If we know 𝑄⋆ 𝑠, 𝑎 , what is the best action?

• Obviously, the best action is 𝑎⋆ = argmax 𝑄⋆ 𝑠, 𝑎 .

Challenge: We do not know 𝑄⋆ 𝑠, 𝑎 .

Question: Based on the predictions, what should be the action?

𝑠" 𝑎" 𝑠"$% 𝑎"$% 𝑠"$& 𝑎"$& 𝑠"$7 ⋯

𝑎" = argmaxB 𝑄(𝑠" , 𝑎; 𝐰)

𝑠"$%~𝑝(⋅ |𝑠" , 𝑎" )

𝑠" 𝑎" 𝑠"$% 𝑎"$% 𝑠"$& 𝑎"$& 𝑠"$7 ⋯

𝑎" = argmaxB 𝑄(𝑠" , 𝑎; 𝐰)

𝑠"$%~𝑝(⋅ |𝑠" , 𝑎" )

𝑠" 𝑎" 𝑠"$% 𝑎"$% 𝑠"$& 𝑎"$& 𝑠"$7 ⋯

𝑎" = argmaxB 𝑄(𝑠" , 𝑎; 𝐰)

𝑎"$% = argmaxB 𝑄(𝑠"$%, 𝑎; 𝐰)

𝑠"$%~𝑝(⋅ |𝑠" , 𝑎" ) 𝑠"$&~𝑝(⋅ |𝑠"$%, 𝑎"$%)

𝑠" 𝑎" 𝑠"$% 𝑎"$% 𝑠"$& 𝑎"$& 𝑠"$7 ⋯

𝑎" = argmaxB 𝑄(𝑠" , 𝑎; 𝐰)

𝑎"$% = argmaxB 𝑄(𝑠"$%, 𝑎; 𝐰)

𝑠"$%~𝑝(⋅ |𝑠" , 𝑎" ) 𝑠"$&~𝑝(⋅ |𝑠"$%, 𝑎"$%)

𝑠" 𝑎" 𝑠"$% 𝑎"$% 𝑠"$& 𝑎"$& 𝑠"$7 ⋯

𝑎" = argmaxB 𝑄(𝑠" , 𝑎; 𝐰) 𝑎"$& = argmaxB 𝑄(𝑠"$&, 𝑎; 𝐰)

𝑎"$% = argmaxB 𝑄(𝑠"$%, 𝑎; 𝐰)

𝑠"$%~𝑝(⋅ |𝑠" , 𝑎" ) 𝑠"$&~𝑝(⋅ |𝑠"$%, 𝑎"$%) 𝑠"$7~𝑝(⋅ |𝑠"$&, 𝑎"$&)

𝑠" 𝑎" 𝑠"$% 𝑎"$% 𝑠"$& 𝑎"$& 𝑠"$7 ⋯

𝑎" = argmaxB 𝑄(𝑠" , 𝑎; 𝐰) 𝑎"$& = argmaxB 𝑄(𝑠"$&, 𝑎; 𝐰)

𝑎"$% = argmaxB 𝑄(𝑠"$%, 𝑎; 𝐰)

1000 minutes (estimate).

1000 minutes (estimate).

600 minutes (estimate).

600 minutes (estimate).

600 minutes (estimate).

• In the “driving time” example, we have the equation:

Model’s estimate Actual time Model’s estimate

• In the “driving time” example, we have the equation:

Model’s estimate Actual time Model’s estimate

• In deep reinforcement learning:

Definition of discounted return:

• 𝑈" = 𝑅" + 𝛾 ⋅ 𝑅"$% + 𝛾 & ⋅ 𝑅"$& + 𝛾 7 ⋅ 𝑅"$7 + 𝛾 s ⋅ 𝑅"$s + ⋯

= 𝛾 ⋅ 𝑅"$% + 𝛾 ⋅ 𝑅"$& + 𝛾 & ⋅ 𝑅"$7 + 𝛾 7 ⋅ 𝑅"$s + ⋯

• 𝑈" = 𝑅" + 𝛾 ⋅ 𝑅"$% + 𝛾 & ⋅ 𝑅"$& + 𝛾 7 ⋅ 𝑅"$7 + 𝛾 s ⋅ 𝑅"$s + ⋯

• 𝑈" = 𝑅" + 𝛾 ⋅ 𝑅"$% + 𝛾 & ⋅ 𝑅"$& + 𝛾 7 ⋅ 𝑅"$7 + 𝛾 s ⋅ 𝑅"$s + ⋯

TD learning for DQN:

TD learning for DQN:

• Thus, 𝑄 𝑠" , 𝑎" ; 𝐰 ≈ 𝔼 𝑅" + 𝛾 ⋅ 𝑄 𝑆"$% , 𝐴"$% ; 𝐰 .

estimate of 𝑈" estimate of 𝑈"$%

TD learning for DQN:

• Thus, 𝑄 𝑠" , 𝑎" ; 𝐰 ≈ 𝑟" + 𝛾 ⋅ 𝑄 𝑠"$% , 𝑎"$% ; 𝐰 .

• Prediction: 𝑄 𝑠" , 𝑎" ; 𝐰" .

• Prediction: 𝑄 𝑠" , 𝑎" ; 𝐰" .

• Prediction: 𝑄 𝑠" , 𝑎" ; 𝐰" .

Definition: Optimal action-value function.

Definition: Optimal action-value function.

DQN: Approximate 𝑄⋆ 𝑠, 𝑎 using a neural network (DQN).

Algorithm: One iteration of TD learning.

1. Observe state 𝑆" = 𝑠" and perform action 𝐴" = 𝑎" .

Algorithm: One iteration of TD learning.

1. Observe state 𝑆" = 𝑠" and perform action 𝐴" = 𝑎" .

Algorithm: One iteration of TD learning.

1. Observe state 𝑆" = 𝑠" and perform action 𝐴" = 𝑎" .

(The video was posted on YouTube by DeepMind)

You might also like