0% found this document useful (0 votes)
3 views39 pages

Lecture2 DRL A

The document outlines a course on Applied Machine Learning taught by Dr. Tao Han, focusing on Reinforcement Learning (RL) concepts such as policy gradients and actor-critic methods. It discusses the challenges of labeling data, the structure of RL, and various examples including playing video games to illustrate how actors maximize rewards through observations and actions. The document also covers the optimization process in RL and the importance of exploration in training data collection.

Uploaded by

ra734
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views39 pages

Lecture2 DRL A

The document outlines a course on Applied Machine Learning taught by Dr. Tao Han, focusing on Reinforcement Learning (RL) concepts such as policy gradients and actor-critic methods. It discusses the challenges of labeling data, the structure of RL, and various examples including playing video games to illustrate how actors maximize rewards through observations and actions. The document also covers the optimization process in RL and the importance of exploration in training data collection.

Uploaded by

ra734
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

ECE 381:

Applied Machine Learning

• Tao Han, Ph.D.

• Associate Professor
• Electrical and Computer Engineering
• Newark College of Engineering
• New Jersey Institute of Technology

• https://tao-han-njit.netlify.app

Slides are designed based on Prof. Hung-yi Lee’s Machine Learning courses at National Taiwan University
Supervised Learning → RL
Human label
? Cat

Human label
? “3-3”?

It is challenging to label data in some tasks.


…… machine can know the results are good or not.
Outline

What is RL? (Three steps in ML)

Policy Gradient

Actor-Critic
Machine Learning
≈ Looking for a Function
Actor
Observation Action
Action =
Function Function
f( Observation ) output
input

Find a policy maximizing


total reward
Reward

Environment
Example: Playing Video Game
Termination: all the aliens are killed,
• Space invader
or your spaceship is destroyed.
Score
(reward)

Kill the
aliens

shield
fire
Example: Playing Video Game

Actor
Observation Action

“right”

Reward
reward = 0

Environment
Example: Playing Video Game
Find an actor maximizing expected reward.
Actor
Observation Action

“fire”

Reward
reward = 5
if killing an alien.

Environment
Machine Learning is so simple ……

Step 1: Step 2: define


Step 3:
function with loss from
optimization
unknown training data
Step 1: Function with Unknown
Policy Network
(Actor) Sample based
left 0.7 on scores

… right 0.2 Scores of


actions


fire 0.1
pixels
Classification Task!!!

• Input of neural network: the observation of machine


represented as a vector or a matrix
• Output neural network : each action corresponds to a
neuron in output layer
Step 2: Define “Loss”
Start with
observation 𝑠! Observation 𝑠" Observation 𝑠#

Obtain reward Obtain reward


𝑟! = 0 𝑟" = 5

Action 𝑎!: “right” Action 𝑎" : “fire”


(kill an alien)
Step 2: Define “Loss”
Start with
observation 𝑠! Observation 𝑠" Observation 𝑠#

This is an episode.
After many turns Game Over Total reward
$
(spaceship destroyed) (return):
𝑅 = ( 𝑟%
%&!
Obtain reward 𝑟$
What we want
Action 𝑎 $ to maximize
Trajectory
Step 3: Optimization 𝜏 = 𝑠!, 𝑎!, 𝑠", 𝑎", ⋯
Network Network
𝑠! 𝑎! 𝑠" 𝑎"

Env Actor Env Actor Env ……

𝑠! 𝑎! 𝑠" 𝑎" 𝑠#

Reward sample Reward They are


black box …
𝑟! 𝑟" … with randomness

$
How to do the optimization here is
𝑅 𝜏 = ( 𝑟%
the main challenge in RL.
c.f. GAN %&!
Outline

What is RL? (Three steps in ML)

Policy Gradient

Actor-Critic
How to control your actor
• Make it take (or don’t take) a specific action 𝑎! given
specific observation 𝑠.
𝑎 𝑎,
left 1
Actor right 0
𝜃
s fire 0
𝑒
Take action 𝑎, Cross-entropy
𝐿=𝑒
Don’t take action 𝑎, 𝜃 ∗ = 𝑎𝑟𝑔 min 𝐿
"
𝐿 = −𝑒
How to control your actor
Take action 𝑎, given s 𝑎 𝑒! 𝑎,
left 1
Actor right 0
𝜃
s fire 0

Don’t take action 𝑎′


, given 𝑠′ 𝑎 𝑒" 𝑎′
,
left 0
Actor right 1
𝜃
𝑠′ fire 0

𝐿 = 𝑒# − 𝑒$ 𝜃 ∗ = 𝑎𝑟𝑔 min 𝐿
"
How to control your actor

Training Data

𝑠!, 𝑎,! +1 Yes 𝑠 Actor 𝑎


𝑠", 𝑎," -1 No 𝜃
𝑠#, 𝑎,# +1 Yes
𝐿 = + 𝑒# − 𝑒$ + 𝑒% ⋯ − 𝑒&
……

……

𝑠' , 𝑎,' -1 No 𝜃 ∗ = 𝑎𝑟𝑔 min 𝐿


"
How to control your actor

Training Data

𝑠!, 𝑎,! A! +1.5 𝑠 Actor 𝑎


𝑠", 𝑎," A" - 0.5 𝜃
𝑠#, 𝑎,# A# +0.5
𝐿 = 1 A' 𝑒'
……

……

𝑠' , 𝑎,' A' -10 𝜃 ∗ = 𝑎𝑟𝑔 min 𝐿


"
? ?
Version 0
𝑠! 𝑎! 𝑠" Training Data

Env Actor Env Actor 𝑠!, 𝑎! A! = 𝑟!


𝑠", 𝑎" A" = 𝑟"
𝑠! 𝑎! 𝑠" 𝑎"
𝑠#, 𝑎# A# = 𝑟#

……
……
Reward Reward ……
𝑟! 𝑟" 𝑠' , 𝑎' A' = 𝑟'
many episodes

Short-sighted Version!
𝑠! 𝑎! 𝑠"
Version 0 Env Actor Env Actor

𝑠! 𝑎! 𝑠" 𝑎"
“right” “fire”
Reward Reward ……
?
𝑟! 0 𝑟" +5

• An action affects the subsequent observations and thus


subsequent rewards.
• Reward delay: Actor has to sacrifice immediate reward to
gain more long-term reward.
• In space invader, only “fire” yields positive reward, so vision
0 will learn an actor that always “fire”.
Version 1
Training Data

𝑠! 𝑠" 𝑠# 𝑠' 𝑠!, 𝑎! A! = 𝐺!


𝑎! 𝑎" 𝑎# …… 𝑎' 𝑠", 𝑎" A" = 𝐺"
𝑟! 𝑟" 𝑟# 𝑟' 𝑠#, 𝑎# A# = 𝐺#

……

……
𝐺# = 𝑟# + 𝑟$ + 𝑟% + …… + 𝑟* 𝑠' , 𝑎' A' = 𝐺'
𝐺$ = 𝑟$ + 𝑟% + …… + 𝑟*
*
𝐺% = 𝑟% + …… + 𝑟* 𝐺( = 1 𝑟'
cumulated reward ')(
Version 2
Training Data

𝑠! 𝑠" 𝑠# 𝑠' 𝑠!, 𝑎! A! = 𝐺!(


𝑎! 𝑎" 𝑎# …… 𝑎' 𝑠", 𝑎" A" = 𝐺"(
𝑟! 𝑟" 𝑟# 𝑟' 𝑠#, 𝑎# A# = 𝐺#(

……

……
Also the credit of 𝑎! ?
𝐺# = 𝑟# + 𝑟$ + 𝑟% + …… + 𝑟* 𝑠' , 𝑎' A' = 𝐺'(

𝐺#+ = 𝑟# + 𝛾𝑟$ + 𝛾 $ 𝑟% + …… *

𝐺(+ = 1 𝛾 ',( 𝑟'


Discount factor 𝛾 < 1
')(
Version 3
Training Data

𝑠! 𝑠" 𝑠# 𝑠' 𝑠!, 𝑎! A! = 𝐺!( −𝑏


𝑎! 𝑎" 𝑎# …… 𝑎' 𝑠", 𝑎" A" = 𝐺"( −𝑏
𝑟! 𝑟" 𝑟# 𝑟' 𝑠#, 𝑎# A# = 𝐺#( −𝑏

……

……
Good or bad reward is “relative”
If all the 𝑟) ≥ 10 𝑠' , 𝑎' A' = 𝐺'( −𝑏
𝑟) = 10 is negative … *
Minus by a baseline 𝑏 ??? 𝐺(+ = 1 𝛾 ',( 𝑟'
Make 𝐺%( have positive and negative values ')(
Policy Gradient
• Initialize actor network parameters 𝜃 -
• For training iteration 𝑖 = 1 to 𝑇
• Using actor 𝜃 .,# to interact
• Obtain data 𝑠# , 𝑎# , 𝑠$ , 𝑎$ , … , 𝑠* , 𝑎*
• Compute 𝐴# , 𝐴$ , … , 𝐴*
• Compute loss 𝐿 Data collection is in the “for
• 𝜃 . ← 𝜃 .,# − 𝜂∇𝐿 loop” of training iterations.
Policy Gradient
Training Data
𝑠 Actor 𝑎
𝑠!, 𝑎! A! 𝜃
𝑠", 𝑎" A"
𝑠#, 𝑎# A# 𝐿 = 1 A' 𝑒'
……
……

𝜃 . ← 𝜃 .,# − 𝜂∇𝐿
𝑠' , 𝑎' A' only update once

Each time you update the model parameters, you need to


collect the whole training set again.
Policy Gradient
• Initialize actor network parameters 𝜃 -
• For training iteration 𝑖 = 1 to 𝑇
.,#
• Using actor 𝜃 .,#
to interact Experience of 𝜃
• Obtain data 𝑠# , 𝑎# , 𝑠$ , 𝑎$ , … , 𝑠* , 𝑎*
• Compute 𝐴# , 𝐴$ , … , 𝐴*
• Compute loss 𝐿
• 𝜃 . ← 𝜃 .,# − 𝜂∇𝐿

May not be good for 𝜃 .


Policy Gradient
• Initialize actor network parameters 𝜃 -
• For training iteration 𝑖 = 1 to 𝑇
• Using actor 𝜃 .,# to interact
• Obtain data 𝑠# , 𝑎# , 𝑠$ , 𝑎$ , … , 𝑠* , 𝑎*
• Compute 𝐴# , 𝐴$ , … , 𝐴*
May not observe by 𝜃 *
• Compute loss 𝐿
• 𝜃 . ← 𝜃 .,# − 𝜂∇𝐿 𝑠 𝑠 𝑠 𝑠
! " # '
Trajectory of 𝑎! 𝑎" 𝑎# …… 𝑎'
𝜃 *+!
𝑟! 𝑟" 𝑟# 𝑟'
Collection Training Data:
Exploration
𝑠! 𝑎! 𝑠"
Enlarge output
Env Actor Env Actor entropy
𝑠! 𝑎! 𝑠" 𝑎" Add noises onto
parameters
Reward Reward ……
Suppose your actor
𝑟! 𝑟" always takes “left”.
The actor needs to have randomness during We never know
data collection. what would happen
if taking “fire”.
A major reason why we sample actions. J
DeepMind - PPO https://youtu.be/gn4nRCC9TwQ
Outline

What is RL? (Three steps in ML)

Policy Gradient

Actor-Critic
Critic 𝐺#+ = 𝑟# + 𝛾𝑟$ + 𝛾 $ 𝑟% + ……

• Critic: Given actor 𝜃, how good it is when observing 𝑠 (and


taking action 𝑎)
• Value function 𝑉 , 𝑠 : When using actor 𝜃, the discounted
cumulated reward expects to be obtained after seeing s

𝑉, 𝑠
s 𝑉"
scalar

𝑉 , 𝑠 is large 𝑉 , 𝑠 is smaller

The output values of a critic depend on the actor evaluated.


!
How to estimate 𝑉 𝑠
• Monte-Carlo (MC) based approach
The critic watches actor 𝜃 to interact with the environment.

After seeing 𝑠- ,
Until the end of the episode, 𝑠- 𝑉" 𝑉 , 𝑠- 𝐺-(
the cumulated reward is 𝐺-(

After seeing 𝑠. ,
Until the end of the episode,
the cumulated reward is 𝐺.(
𝑠. 𝑉" 𝑉 , 𝑠. 𝐺.(
"
How to estimate 𝑉 𝑠
• Temporal-difference (TD) approach
⋯ 𝑠( , 𝑎( , 𝑟( , 𝑠(/# ⋯ (ignore the expectation here)
𝑉 , 𝑠% = 𝑟% + 𝛾𝑟%/! + 𝛾 "𝑟%/" …
𝑉 , 𝑠%/! = 𝑟%/! + 𝛾𝑟%/" + ⋯

" 𝑉 , 𝑠% = 𝛾𝑉 , 𝑠%/! + 𝑟%
𝑠% 𝑉 𝑉, 𝑠%

- 𝑉 , 𝑠% − 𝛾𝑉 , 𝑠%/! 𝑟%
×𝛾
𝑠%/! 𝑉" 𝑉 , 𝑠%/!
MC v.s. TD
• The critic has observed the following 8 episodes
• 𝑠- , 𝑟 = 0, 𝑠. , 𝑟 = 0, END
• 𝑠. , 𝑟 = 1, END
𝑉 " 𝑠1 = 3/4
• 𝑠. , 𝑟 = 1, END
• 𝑠. , 𝑟 = 1, END 𝑉 " 𝑠0 =? 0? 3/4?
• 𝑠. , 𝑟 = 1, END
• 𝑠. , 𝑟 = 1, END Monte-Carlo: 𝑉 " 𝑠0 = 0
• 𝑠. , 𝑟 = 1, END
• 𝑠. , 𝑟 = 0, END Temporal-difference:

(Assume 𝛾 = 1, and the 𝑉 " 𝑠0 = 𝑉 " 𝑠1 + 𝑟


actions are ignored here.) 3/4 3/4 0
Version 3.5

𝑠! 𝑠" 𝑠# 𝑠' Training Data


𝑎! 𝑎" 𝑎# …… 𝑎' 𝑠!, 𝑎! A! = 𝐺!( −𝑏
𝑟! 𝑟" 𝑟# 𝑟' 𝑠", 𝑎" A" = 𝐺"( −𝑏
𝑠#, 𝑎# A# = 𝐺#( −𝑏

……
……
𝑉, 𝑠' , 𝑎' A' = 𝐺'( −𝑏
s 𝑉" 𝑠
Version 3.5

𝑠! 𝑠" 𝑠# 𝑠' Training Data


𝑎! 𝑎" 𝑎# …… 𝑎' 𝑠!, 𝑎! A! = 𝐺!( −𝑉 , 𝑠!
𝑟! 𝑟" 𝑟# 𝑟' 𝑠", 𝑎" A" = 𝐺"( −𝑉 , 𝑠"
𝑠#, 𝑎# A# = 𝐺#( −𝑉 , 𝑠#

……
……
𝑉, 𝑠' , 𝑎' A' = 𝐺'( −𝑉 , 𝑠'
s 𝑉" 𝑠
Version 3.5 𝑠% , 𝑎% A% = 𝐺%( −𝑉 , 𝑠%

𝐺 = 100
𝐺=3
𝐺=1 𝑉 , 𝑠%
𝑠% 𝐺=2
𝐺 = −10
(not necessary take 𝑎% )
(You sample the actions based on
a distribution) A% > 0
𝑎% is better than average.
𝑎%
𝐺%( A% < 0
𝑠%
Just a sample 𝑎% is worse than average.
𝑟% + 𝑉 , 𝑠%/! − 𝑉 , 𝑠%
Version 4 𝑠% , 𝑎% A% = 𝐺%( −𝑉 , 𝑠%
Advantage Actor-Critic
𝐺 = 100
𝐺=3
𝐺=1 𝑉 , 𝑠%
𝑠% 𝐺=2
𝐺 = −10
(not necessary take 𝑎% )

𝐺 = 101
Obtain 𝑟% 𝐺=4
𝑎% 𝑟%
𝐺=3 +𝑉 , 𝑠%/!
𝑠% 𝑠%/! 𝐺=1
𝐺 = −5
Tip of Actor-Critic
• The parameters of actor and critic can be
shared.

left

Network right Actor


𝑠 Network fire

Network scalar Critic


Outline

What is RL? (Three steps in ML)

Policy Gradient

Actor-Critic

You might also like