0% found this document useful (0 votes)

51 views28 pages

Policy Gradient Methods

1) Policy gradient methods aim to maximize the expected reward of a policy by estimating the gradient of the policy parameters. 2) The policy gradient can be estimated using the score function gradient estimator, which samples trajectories and uses the rewards as importance weights that adjust the policy parameters in the direction of the rewards. 3) For episodic tasks, the policy gradient decomposes into terms corresponding to each timestep, allowing the gradient to incorporate temporal structure by crediting earlier timesteps with later rewards.

Uploaded by

jimmyjoe

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

51 views28 pages

Policy Gradient Methods

Uploaded by

jimmyjoe

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 28

Policy Gradient Methods

February 13, 2017

Policy Optimization Problems

maximize Eπ [expression]
π

PT −1
I Fixed-horizon episodic: t=0 rt
T −1
Average-cost: limT →∞ T1 t=0
P
I rt
P∞ t
I Infinite-horizon discounted: t=0 γ rt
PTterminal −1
I Variable-length undiscounted: t=0 rt
P∞
I Infinite-horizon undiscounted: t=0 rt
Episodic Setting

a0 a1 aT-1

s0 s1 s2 sT

μ0 r0 r1 rT-1

Environment
P

Objective:

maximize η(π), where

η(π) = E [r0 + r1 + · · · + rT −1 | π]
Parameterized Policies

I A family of policies indexed by parameter vector θ ∈ Rd

I Deterministic: a = π(s, θ)
I Stochastic: π(a | s, θ)
I Analogous to classification or regression with input s, output a.
I Discrete action space: network outputs vector of probabilities
I Continuous action space: network outputs mean and diagonal covariance of
Gaussian
Policy Gradient Methods: Overview

Problem:

maximize E [R | πθ ]

Intuitions: collect a bunch of trajectories, and ...

1. Make the good trajectories more probable1
2. Make the good actions more probable
3. Push the actions towards good actions (DPG2 , SVG3 )

1
R. J. Williams. “Simple statistical gradient-following algorithms for connectionist reinforcement learning”. Machine learning (1992);
R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour. “Policy gradient methods for reinforcement learning with function approximation”. NIPS.
MIT Press, 2000.
2
D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, et al. “Deterministic Policy Gradient Algorithms”. ICML. 2014.
3
N. Heess, G. Wayne, D. Silver, T. Lillicrap, Y. Tassa, et al. “Learning Continuous Control Policies by Stochastic Value Gradients”. arXiv
preprint arXiv:1510.09142 (2015).
Score Function Gradient Estimator
I Consider an expectation Ex∼p(x | θ) [f (x)]. Want to compute gradient wrt θ
Z
∇θ Ex [f (x)] = ∇θ dx p(x | θ)f (x)
Z
= dx ∇θ p(x | θ)f (x)
∇θ p(x | θ)
Z
= dx p(x | θ) f (x)
p(x | θ)
Z
= dx p(x | θ)∇θ log p(x | θ)f (x)

= Ex [f (x)∇θ log p(x | θ)].

I Last expression gives us an unbiased gradient estimator. Just sample

4
T. Jie and P. Abbeel. “On a connection between importance sampling and the likelihood ratio policy gradient”. Advances in Neural Information
Processing Systems. 2010, pp. 1000–1008.
Score Function Gradient Estimator: Intuition

ĝi = f (xi )∇θ log p(xi | θ)

I Let’s say that f (x) measures how good the sample x is.
I Moving in the direction ĝi pushes up the logprob of the
sample, in proportion to how good it is
I Valid even if f (x) is discontinuous, and unknown, or sample
space (containing x) is a discrete set
Score Function Gradient Estimator: Intuition

ĝi = f (xi )∇θ log p(xi | θ)

Score Function Gradient Estimator: Intuition

ĝi = f (xi )∇θ log p(xi | θ)

Score Function Gradient Estimator for Policies
I Now random variable x is a whole trajectory τ = (s0 , a0 , r0 , s1 , a1 , r1 , . . . , sT −1 , aT −1 , rT −1 , sT )

∇θ Eτ [R(τ )] = Eτ [∇θ log p(τ | θ)R(τ )]

I Just need to write out p(τ | θ):

I Interpretation: using good trajectories (high R) as supervised examples in classification / regression

Policy Gradient: Use Temporal Structure
I Previous slide:
−1 −1
" T
! T
!#
X X
∇θ Eτ [R] = Eτ rt ∇θ log π(at | st , θ)
t=0 t=0

I We can repeat the same argument to derive the gradient estimator for a single reward
term rt 0 .
 
t0
X
∇θ E [rt 0 ] = E rt 0 ∇θ log π(at | st , θ)
t=0

I Sum this formula over t, we obtain

 0

T
X −1 t
X
∇θ E [R] = E  rt 0 ∇θ log π(at | st , θ)
t=0 t=0
"T −1 T −1
#
X X
=E ∇θ log π(at | st , θ) rt 0
t=0 t 0 =t
Policy Gradient: Introduce Baseline

I Further reduce variance by introducing a baseline b(s)

"T −1 T −1
!#
X X
∇θ Eτ [R] = Eτ ∇θ log π(at | st , θ) rt 0 − b(st )
t=0 t 0 =t

I For any choice of b, gradient estimator is unbiased.

I Near optimal choice is expected return,
b(st ) ≈ E [rt + rt+1 + rt+2 + · · · + rT −1 ]
I Interpretation:
P −1 increase logprob of action at proportionally to how much
returns Tt 0 =t rt 0 are better than expected
Baseline—Derivation

Eτ [∇θ log π(at | st , θ)b(st )]

h i
= Es0:t ,a0:(t−1) Es(t+1):T ,at:(T −1) [∇θ log π(at | st , θ)b(st )] (break up expectation)
h i
= Es0:t ,a0:(t−1) b(st )Es(t+1):T ,at:(T −1) [∇θ log π(at | st , θ)] (pull baseline term out)
= Es0:t ,a0:(t−1) [b(st )Eat [∇θ log π(at | st , θ)]] (remove irrelevant vars.)
= Es0:t ,a0:(t−1) [b(st ) · 0]

Last equality because 0 = ∇θ Eat ∼π(· | st ) [1] = Eat ∼π(· | st ) [∇θ log πθ (at | st )]
Discounts for Variance Reduction

I Introduce discount factor γ, which ignores delayed effects between actions

and rewards
"T −1 T −1
!#
0
X X
∇θ Eτ [R] ≈ Eτ ∇θ log π(at | st , θ) γ t −t rt 0 − b(st )
t=0 t 0 =t

Now, we want b(st ) ≈ E rt + γrt+1 + γ 2 rt+2 + · · · + γ T −1−t rT −1

I
“Vanilla” Policy Gradient Algorithm

Initialize policy parameter θ, baseline b

for iteration=1, 2, . . . do
Collect a set of trajectories by executing the current policy
At each timestep P in each trajectory, compute
−1 t 0 −t
the return Rt = tT0 =t γ rt 0 , and
the advantage estimate Ât = Rt − b(st ).
Re-fit the baseline, by minimizing kb(st ) − Rt k2 ,
summed over all trajectories and timesteps.
Update the policy, using a policy gradient estimate ĝ ,
which is a sum of terms ∇θ log π(at | st , θ)Ât .
(Plug ĝ into SGD or ADAM)
end for
Practical Implementation with Autodiff
P
I Usual formula t ∇θ log π(at | st ; θ)Ât is inefficient—want to batch data
I Define “surrogate” function using data from currecnt batch
X
L(θ) = log π(at | st ; θ)Ât
t

I Then policy gradient estimator ĝ = ∇θ L(θ)

I Can also include value function fit error
X
2
L(θ) = log π(at | st ; θ)Ât − kV (st ) − R̂t k
t
Value Functions

Q π,γ (s, a) = Eπ r0 + γr1 + γ 2 r2 + . . . | s0 = s, a0 = a

Called Q-function or state-action-value function

V π,γ (s) = Eπ r0 + γr1 + γ 2 r2 + . . . | s0 = s

= Ea∼π [Q π,γ (s, a)]

Called state-value function

Aπ,γ (s, a) = Q π,γ (s, a) − V π,γ (s)

Called advantage function
Policy Gradient Formulas with Value Functions
I Recall:
"T −1 T −1
!#
X X
∇θ Eτ [R] = Eτ ∇θ log π(at | st , θ) rt 0 − b(st )
t=0 t 0 =t
"T −1 T −1
!#
X X 0
−t
≈ Eτ ∇θ log π(at | st , θ) γt rt 0 − b(st )
t=0 t 0 =t

I Using value functions

"T −1 #
X
∇θ Eτ [R] = Eτ ∇θ log π(at | st , θ)Q π (st , at )
t=0
"T −1 #
X
π
= Eτ ∇θ log π(at | st , θ)A (st , at )
t=0
"T −1 #
X
≈ Eτ ∇θ log π(at | st , θ)Aπ,γ (st , at )
t=0

I Can plug in “advantage estimator” Â for Aπ,γ

I Advantage estimators have the form Return − V (s)
Value Functions in the Future

I Baseline accounts for and removes the effect of past actions

I Can also use the value function to estimate future rewards
(1)
R̂t = rt + γV (st+1 ) cut off at one timestep
(2)
R̂t = rt + γrt+1 + γ 2 V (st+2 ) cut off at two timesteps
...
(∞)
R̂t = rt + γrt+1 + γ 2 rt+2 + . . . ∞ timesteps (no V )
Value Functions in the Future

I Subtracting out baselines, we get advantage estimators

(1)
Ât = rt + γV (st+1 )−V (st )
(2)
Ât = rt + rt+1 + γ 2 V (st+2 )−V (st )
...
(∞)
Ât = rt + γrt+1 + γ 2 rt+2 + . . . −V (st )

(1) (∞)
I Ât has low variance but high bias, Ât has high variance but low bias.
I Using intermediate k (say, 20) gives an intermediate amount of bias and variance
Discounts: Connection to MPC

I MPC:

maximize Q ∗,T (s, a) ≈ maximize Q ∗,γ (s, a)

a a

I Discounted policy gradient

Ea∼π [Q π,γ (s, a)∇θ log π(a | s; θ)] = 0 when a ∈ arg max Q π,γ (s, a)
Application: Robot Locomotion
Finite-Horizon Methods: Advantage Actor-Critic
I A2C / A3C uses this fixed-horizon advantage estimator. (NOTE: “async” is only
for speed, doesn’t improve performance)
I Pseudocode
for iteration=1, 2, . . . do
Agent acts for T timesteps (e.g., T = 20),
For each timestep t, compute
R̂t = rt + γrt+1 + · · · + γ T −t+1 rT −1 + γ T −t V (st )
Ât = R̂t − V (st )

R̂t is target value function, in regression problem

Ât is estimated advantage function h i
PT
Compute loss gradient g = ∇θ t=1 − log πθ (at | st )Ât + c(V (s) − R̂t )2
g is plugged into a stochastic gradient descent variant, e.g., Adam.
end for
V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, et al. “Asynchronous methods for deep reinforcement learning”. (2016)
A3C Video
A3C Results

16000 Beamrider 600 Breakout 30 Pong 12000 Q*bert 1600 Space Invaders
DQN DQN DQN DQN
14000 1-step Q 500 1-step Q 20 10000 1-step Q 1400 1-step Q
12000 1-step SARSA 1-step SARSA 1-step SARSA 1200 1-step SARSA
n-step Q n-step Q n-step Q n-step Q
10000 A3C 400 A3C 10 8000 A3C 1000 A3C
Score

Score

Score
8000 300 0 6000 800
6000 200 10 DQN 4000 600
4000 1-step Q 400
100 20 1-step SARSA 2000
2000 n-step Q 200
A3C
0 0 30 0 0
0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14
Training time (hours) Training time (hours) Training time (hours) Training time (hours) Training time (hours)
Further Reading

I A nice intuitive explanation of policy gradients:

http://karpathy.github.io/2016/05/31/rl/
I R. J. Williams. “Simple statistical gradient-following algorithms for
connectionist reinforcement learning”. Machine learning (1992);
R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour. “Policy gradient
methods for reinforcement learning with function approximation”. NIPS.
MIT Press, 2000
I My thesis has a decent self-contained introduction to policy gradient
methods: http://joschu.net/docs/thesis.pdf
I A3C paper: V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, et al.
“Asynchronous methods for deep reinforcement learning”. (2016)

Reinforcement Learning Cheat Sheet: Return
No ratings yet
Reinforcement Learning Cheat Sheet: Return
7 pages
CS 188 Fall 2018 Written HW4 Soln
No ratings yet
CS 188 Fall 2018 Written HW4 Soln
6 pages
Dis9 Sol
No ratings yet
Dis9 Sol
8 pages
RL Week - 3 - 4
No ratings yet
RL Week - 3 - 4
33 pages
13 ML Reinforcement Learning - Policy Search
No ratings yet
13 ML Reinforcement Learning - Policy Search
10 pages
07 Deep Reinforcement Learning (John)
No ratings yet
07 Deep Reinforcement Learning (John)
52 pages
Unit7 RL
No ratings yet
Unit7 RL
7 pages
Unit 5 - Policy Based
No ratings yet
Unit 5 - Policy Based
30 pages
Lecture 12 Slides - After
No ratings yet
Lecture 12 Slides - After
50 pages
Solutions - REINFORCE and Linear Function Approximation
No ratings yet
Solutions - REINFORCE and Linear Function Approximation
5 pages
Policy Gradient Methods
No ratings yet
Policy Gradient Methods
70 pages
RL 5
No ratings yet
RL 5
26 pages
13 RL 3
No ratings yet
13 RL 3
48 pages
10 - Reinforcement Learning
No ratings yet
10 - Reinforcement Learning
24 pages
5 - Policy Gradient Methods
No ratings yet
5 - Policy Gradient Methods
57 pages
Policy-Based Reinforcement Learning: Shusen Wang
No ratings yet
Policy-Based Reinforcement Learning: Shusen Wang
46 pages
9541-Article Text-13069-1-2-20201228
No ratings yet
9541-Article Text-13069-1-2-20201228
7 pages
cs229 Notes14
No ratings yet
cs229 Notes14
6 pages
High-Dimensional Continuous Control Using Generalized Advantage Estimation-1506.02438v5
No ratings yet
High-Dimensional Continuous Control Using Generalized Advantage Estimation-1506.02438v5
14 pages
Part 3 - Intro To Policy Optimization - Spinning Up Documentation PDF
No ratings yet
Part 3 - Intro To Policy Optimization - Spinning Up Documentation PDF
10 pages
Assignment 2 - Policy Gradients
No ratings yet
Assignment 2 - Policy Gradients
7 pages
4 Reinforcement Learning - Basic Algorithms: - S, A) ) and The Immediate Reward Function R (R (S, A, S
No ratings yet
4 Reinforcement Learning - Basic Algorithms: - S, A) ) and The Immediate Reward Function R (R (S, A, S
16 pages
GDD Nonlinear NIPS 2009 Convergent Temporal Difference Learning With Arbitrary Smooth Function Approximation
No ratings yet
GDD Nonlinear NIPS 2009 Convergent Temporal Difference Learning With Arbitrary Smooth Function Approximation
9 pages
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
No ratings yet
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
30 pages
2023 Week5 Policy
No ratings yet
2023 Week5 Policy
62 pages
New CZ3005 Module 4 - Markov Decision Process
No ratings yet
New CZ3005 Module 4 - Markov Decision Process
38 pages
CH3 - 3 Policy Search Alg
No ratings yet
CH3 - 3 Policy Search Alg
9 pages
Policy Gradient
No ratings yet
Policy Gradient
33 pages
SRE Report Merged
No ratings yet
SRE Report Merged
16 pages
EE675A Lecture 16
No ratings yet
EE675A Lecture 16
6 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
101 pages
1、Reinforcement Learning With Gaussian Processes (2005)
No ratings yet
1、Reinforcement Learning With Gaussian Processes (2005)
8 pages
5SC28 Machine Learning For Systems and Control
No ratings yet
5SC28 Machine Learning For Systems and Control
68 pages
An Introduction To Policy Search Methods: Thomas Furmston
No ratings yet
An Introduction To Policy Search Methods: Thomas Furmston
33 pages
cs229 Notes13
No ratings yet
cs229 Notes13
15 pages
Book All in One
No ratings yet
Book All in One
288 pages
1 - Table of Contents
No ratings yet
1 - Table of Contents
6 pages
RL Exam Tutti
No ratings yet
RL Exam Tutti
47 pages
Paper RL
No ratings yet
Paper RL
61 pages
【PPT】Conservative policy iteration
No ratings yet
【PPT】Conservative policy iteration
75 pages
Chapter 13: Policy Gradient Methods: by Richard Sutton and Andrew Barto
No ratings yet
Chapter 13: Policy Gradient Methods: by Richard Sutton and Andrew Barto
35 pages
Policy Gradient Methods-BR
No ratings yet
Policy Gradient Methods-BR
14 pages
22 Reinforcement Learning
No ratings yet
22 Reinforcement Learning
18 pages
NIPS 2012 A Unifying Perspective of Parametric Policy Search Methods For Markov Decision Processes Paper
No ratings yet
NIPS 2012 A Unifying Perspective of Parametric Policy Search Methods For Markov Decision Processes Paper
9 pages
Deep Reinforcement Learning
No ratings yet
Deep Reinforcement Learning
93 pages
08 PG Methods
No ratings yet
08 PG Methods
83 pages
Book All-In-One 2
No ratings yet
Book All-In-One 2
281 pages
Lecture 5: Value Function Approximation: Emma Brunskill
No ratings yet
Lecture 5: Value Function Approximation: Emma Brunskill
59 pages
Trust Region Policy Optimization: John Schulman Sergey Levine Philipp Moritz Michael Jordan Pieter Abbeel
No ratings yet
Trust Region Policy Optimization: John Schulman Sergey Levine Philipp Moritz Michael Jordan Pieter Abbeel
16 pages
Chapter 11
No ratings yet
Chapter 11
17 pages
Reinforcement Learning Cheatsheet
No ratings yet
Reinforcement Learning Cheatsheet
16 pages
RL Problem Sheet: E0 270: Machine Learning (Spring 2025)
No ratings yet
RL Problem Sheet: E0 270: Machine Learning (Spring 2025)
10 pages
Doubly Robust Off-Policy RL
No ratings yet
Doubly Robust Off-Policy RL
14 pages
Module 6
No ratings yet
Module 6
47 pages
Cs229-Notes12 Reinforcement in Control
No ratings yet
Cs229-Notes12 Reinforcement in Control
17 pages
Reinforcement Learning I
No ratings yet
Reinforcement Learning I
85 pages
Deep Reinforcement Learning: 1 Notation
No ratings yet
Deep Reinforcement Learning: 1 Notation
9 pages
Policy Gradient Methods For Reinforcement Learning PDF
No ratings yet
Policy Gradient Methods For Reinforcement Learning PDF
5 pages
RL Lecture4
No ratings yet
RL Lecture4
7 pages
Differential Forms
From Everand
Differential Forms
Henri Cartan
5/5 (2)
Barefoot Running: Have Sneakers Changed The Way We Run?
No ratings yet
Barefoot Running: Have Sneakers Changed The Way We Run?
21 pages
User Manual: Question? Contact Philips
No ratings yet
User Manual: Question? Contact Philips
16 pages
Mental Magnetism Course by Harry Lorayne PDF
100% (9)
Mental Magnetism Course by Harry Lorayne PDF
424 pages
Operating Instructions: DMR-EH55
No ratings yet
Operating Instructions: DMR-EH55
84 pages
Rear Drive System: Service Instructions
No ratings yet
Rear Drive System: Service Instructions
6 pages
Classic Series '60s Telecaster® Service Diagrams PDF
No ratings yet
Classic Series '60s Telecaster® Service Diagrams PDF
4 pages
60's TELECASTER
No ratings yet
60's TELECASTER
4 pages
Muddy Waters Discography
No ratings yet
Muddy Waters Discography
9 pages
Markdown Cheatsheet
No ratings yet
Markdown Cheatsheet
1 page
A Comparative Study of Categorical Variable Encoding Techniques
No ratings yet
A Comparative Study of Categorical Variable Encoding Techniques
4 pages
DotNetNuke 7.0.6 SuperUser Manual
No ratings yet
DotNetNuke 7.0.6 SuperUser Manual
1,413 pages
10 Must Know Jazz Guitar Licks Ebook
100% (3)
10 Must Know Jazz Guitar Licks Ebook
8 pages
Econ Shu301 CH11
No ratings yet
Econ Shu301 CH11
53 pages
Master Thesis - Akinyemi - Owolabi
No ratings yet
Master Thesis - Akinyemi - Owolabi
11 pages
Solved Examples of Cramer Rao Lower Bound
100% (1)
Solved Examples of Cramer Rao Lower Bound
6 pages
Fisher Information - Wikipedia
No ratings yet
Fisher Information - Wikipedia
13 pages
Ps 3
No ratings yet
Ps 3
15 pages
Maximum Likelihood Estimation: Topic 15
No ratings yet
Maximum Likelihood Estimation: Topic 15
10 pages
Mlelectures PDF
No ratings yet
Mlelectures PDF
24 pages
Categorical Reparameterization With Gumbel Softmax
No ratings yet
Categorical Reparameterization With Gumbel Softmax
13 pages
A Comprehensive Survey On Generative Diffusion Models For Structured Data
No ratings yet
A Comprehensive Survey On Generative Diffusion Models For Structured Data
20 pages
Application of Newton Raphson Method To Non - Linear Models 1
No ratings yet
Application of Newton Raphson Method To Non - Linear Models 1
11 pages
Policy Gradient Methods
No ratings yet
Policy Gradient Methods
28 pages
SampleQs Solutions PDF
No ratings yet
SampleQs Solutions PDF
20 pages
Computing Jacobian and Hessian of Estimators and Their Application To Risk Approximation
No ratings yet
Computing Jacobian and Hessian of Estimators and Their Application To Risk Approximation
4 pages
Mathematical Statistics, Asymptotic Minimax Theory
No ratings yet
Mathematical Statistics, Asymptotic Minimax Theory
258 pages
poLCA An R Package For Polytomous Variable Latent
No ratings yet
poLCA An R Package For Polytomous Variable Latent
29 pages
Score (Statistics)
No ratings yet
Score (Statistics)
5 pages
Fisher Information For GLM
No ratings yet
Fisher Information For GLM
35 pages

Policy Gradient Methods

Uploaded by

Policy Gradient Methods

Uploaded by

Policy Gradient Methods

February 13, 2017

maximize η(π), where

I A family of policies indexed by parameter vector θ ∈ Rd

Intuitions: collect a bunch of trajectories, and ...

= Ex [f (x)∇θ log p(x | θ)].

I Last expression gives us an unbiased gradient estimator. Just sample

ĝi = f (xi )∇θ log p(xi | θ)

ĝi = f (xi )∇θ log p(xi | θ)

ĝi = f (xi )∇θ log p(xi | θ)

∇θ Eτ [R(τ )] = Eτ [∇θ log p(τ | θ)R(τ )]

I Just need to write out p(τ | θ):

I Interpretation: using good trajectories (high R) as supervised examples in classification / regression

I Sum this formula over t, we obtain

I Further reduce variance by introducing a baseline b(s)

I For any choice of b, gradient estimator is unbiased.

Eτ [∇θ log π(at | st , θ)b(st )]

I Introduce discount factor γ, which ignores delayed effects between actions

Now, we want b(st ) ≈ E rt + γrt+1 + γ 2 rt+2 + · · · + γ T −1−t rT −1

Initialize policy parameter θ, baseline b

I Then policy gradient estimator ĝ = ∇θ L(θ)

Q π,γ (s, a) = Eπ r0 + γr1 + γ 2 r2 + . . . | s0 = s, a0 = a

Called Q-function or state-action-value function

V π,γ (s) = Eπ r0 + γr1 + γ 2 r2 + . . . | s0 = s

= Ea∼π [Q π,γ (s, a)]

Aπ,γ (s, a) = Q π,γ (s, a) − V π,γ (s)

I Using value functions

I Can plug in “advantage estimator” Â for Aπ,γ

I Baseline accounts for and removes the effect of past actions

I Subtracting out baselines, we get advantage estimators

maximize Q ∗,T (s, a) ≈ maximize Q ∗,γ (s, a)

I Discounted policy gradient

R̂t is target value function, in regression problem

I A nice intuitive explanation of policy gradients:

You might also like