Lecture2 DRL A
Lecture2 DRL A
• Associate Professor
• Electrical and Computer Engineering
• Newark College of Engineering
• New Jersey Institute of Technology
• https://tao-han-njit.netlify.app
Slides are designed based on Prof. Hung-yi Lee’s Machine Learning courses at National Taiwan University
Supervised Learning → RL
Human label
? Cat
Human label
? “3-3”?
Policy Gradient
Actor-Critic
Machine Learning
≈ Looking for a Function
Actor
Observation Action
Action =
Function Function
f( Observation ) output
input
Environment
Example: Playing Video Game
Termination: all the aliens are killed,
• Space invader
or your spaceship is destroyed.
Score
(reward)
Kill the
aliens
shield
fire
Example: Playing Video Game
Actor
Observation Action
“right”
Reward
reward = 0
Environment
Example: Playing Video Game
Find an actor maximizing expected reward.
Actor
Observation Action
“fire”
Reward
reward = 5
if killing an alien.
Environment
Machine Learning is so simple ……
…
fire 0.1
pixels
Classification Task!!!
This is an episode.
After many turns Game Over Total reward
$
(spaceship destroyed) (return):
𝑅 = ( 𝑟%
%&!
Obtain reward 𝑟$
What we want
Action 𝑎 $ to maximize
Trajectory
Step 3: Optimization 𝜏 = 𝑠!, 𝑎!, 𝑠", 𝑎", ⋯
Network Network
𝑠! 𝑎! 𝑠" 𝑎"
𝑠! 𝑎! 𝑠" 𝑎" 𝑠#
$
How to do the optimization here is
𝑅 𝜏 = ( 𝑟%
the main challenge in RL.
c.f. GAN %&!
Outline
Policy Gradient
Actor-Critic
How to control your actor
• Make it take (or don’t take) a specific action 𝑎! given
specific observation 𝑠.
𝑎 𝑎,
left 1
Actor right 0
𝜃
s fire 0
𝑒
Take action 𝑎, Cross-entropy
𝐿=𝑒
Don’t take action 𝑎, 𝜃 ∗ = 𝑎𝑟𝑔 min 𝐿
"
𝐿 = −𝑒
How to control your actor
Take action 𝑎, given s 𝑎 𝑒! 𝑎,
left 1
Actor right 0
𝜃
s fire 0
𝐿 = 𝑒# − 𝑒$ 𝜃 ∗ = 𝑎𝑟𝑔 min 𝐿
"
How to control your actor
Training Data
……
Training Data
……
……
……
Reward Reward ……
𝑟! 𝑟" 𝑠' , 𝑎' A' = 𝑟'
many episodes
Short-sighted Version!
𝑠! 𝑎! 𝑠"
Version 0 Env Actor Env Actor
𝑠! 𝑎! 𝑠" 𝑎"
“right” “fire”
Reward Reward ……
?
𝑟! 0 𝑟" +5
……
……
𝐺# = 𝑟# + 𝑟$ + 𝑟% + …… + 𝑟* 𝑠' , 𝑎' A' = 𝐺'
𝐺$ = 𝑟$ + 𝑟% + …… + 𝑟*
*
𝐺% = 𝑟% + …… + 𝑟* 𝐺( = 1 𝑟'
cumulated reward ')(
Version 2
Training Data
……
……
Also the credit of 𝑎! ?
𝐺# = 𝑟# + 𝑟$ + 𝑟% + …… + 𝑟* 𝑠' , 𝑎' A' = 𝐺'(
𝐺#+ = 𝑟# + 𝛾𝑟$ + 𝛾 $ 𝑟% + …… *
……
……
Good or bad reward is “relative”
If all the 𝑟) ≥ 10 𝑠' , 𝑎' A' = 𝐺'( −𝑏
𝑟) = 10 is negative … *
Minus by a baseline 𝑏 ??? 𝐺(+ = 1 𝛾 ',( 𝑟'
Make 𝐺%( have positive and negative values ')(
Policy Gradient
• Initialize actor network parameters 𝜃 -
• For training iteration 𝑖 = 1 to 𝑇
• Using actor 𝜃 .,# to interact
• Obtain data 𝑠# , 𝑎# , 𝑠$ , 𝑎$ , … , 𝑠* , 𝑎*
• Compute 𝐴# , 𝐴$ , … , 𝐴*
• Compute loss 𝐿 Data collection is in the “for
• 𝜃 . ← 𝜃 .,# − 𝜂∇𝐿 loop” of training iterations.
Policy Gradient
Training Data
𝑠 Actor 𝑎
𝑠!, 𝑎! A! 𝜃
𝑠", 𝑎" A"
𝑠#, 𝑎# A# 𝐿 = 1 A' 𝑒'
……
……
𝜃 . ← 𝜃 .,# − 𝜂∇𝐿
𝑠' , 𝑎' A' only update once
Policy Gradient
Actor-Critic
Critic 𝐺#+ = 𝑟# + 𝛾𝑟$ + 𝛾 $ 𝑟% + ……
𝑉, 𝑠
s 𝑉"
scalar
𝑉 , 𝑠 is large 𝑉 , 𝑠 is smaller
After seeing 𝑠- ,
Until the end of the episode, 𝑠- 𝑉" 𝑉 , 𝑠- 𝐺-(
the cumulated reward is 𝐺-(
After seeing 𝑠. ,
Until the end of the episode,
the cumulated reward is 𝐺.(
𝑠. 𝑉" 𝑉 , 𝑠. 𝐺.(
"
How to estimate 𝑉 𝑠
• Temporal-difference (TD) approach
⋯ 𝑠( , 𝑎( , 𝑟( , 𝑠(/# ⋯ (ignore the expectation here)
𝑉 , 𝑠% = 𝑟% + 𝛾𝑟%/! + 𝛾 "𝑟%/" …
𝑉 , 𝑠%/! = 𝑟%/! + 𝛾𝑟%/" + ⋯
" 𝑉 , 𝑠% = 𝛾𝑉 , 𝑠%/! + 𝑟%
𝑠% 𝑉 𝑉, 𝑠%
- 𝑉 , 𝑠% − 𝛾𝑉 , 𝑠%/! 𝑟%
×𝛾
𝑠%/! 𝑉" 𝑉 , 𝑠%/!
MC v.s. TD
• The critic has observed the following 8 episodes
• 𝑠- , 𝑟 = 0, 𝑠. , 𝑟 = 0, END
• 𝑠. , 𝑟 = 1, END
𝑉 " 𝑠1 = 3/4
• 𝑠. , 𝑟 = 1, END
• 𝑠. , 𝑟 = 1, END 𝑉 " 𝑠0 =? 0? 3/4?
• 𝑠. , 𝑟 = 1, END
• 𝑠. , 𝑟 = 1, END Monte-Carlo: 𝑉 " 𝑠0 = 0
• 𝑠. , 𝑟 = 1, END
• 𝑠. , 𝑟 = 0, END Temporal-difference:
……
……
𝑉, 𝑠' , 𝑎' A' = 𝐺'( −𝑏
s 𝑉" 𝑠
Version 3.5
……
……
𝑉, 𝑠' , 𝑎' A' = 𝐺'( −𝑉 , 𝑠'
s 𝑉" 𝑠
Version 3.5 𝑠% , 𝑎% A% = 𝐺%( −𝑉 , 𝑠%
𝐺 = 100
𝐺=3
𝐺=1 𝑉 , 𝑠%
𝑠% 𝐺=2
𝐺 = −10
(not necessary take 𝑎% )
(You sample the actions based on
a distribution) A% > 0
𝑎% is better than average.
𝑎%
𝐺%( A% < 0
𝑠%
Just a sample 𝑎% is worse than average.
𝑟% + 𝑉 , 𝑠%/! − 𝑉 , 𝑠%
Version 4 𝑠% , 𝑎% A% = 𝐺%( −𝑉 , 𝑠%
Advantage Actor-Critic
𝐺 = 100
𝐺=3
𝐺=1 𝑉 , 𝑠%
𝑠% 𝐺=2
𝐺 = −10
(not necessary take 𝑎% )
𝐺 = 101
Obtain 𝑟% 𝐺=4
𝑎% 𝑟%
𝐺=3 +𝑉 , 𝑠%/!
𝑠% 𝑠%/! 𝐺=1
𝐺 = −5
Tip of Actor-Critic
• The parameters of actor and critic can be
shared.
left
Policy Gradient
Actor-Critic