0% found this document useful (0 votes)
51 views28 pages

Policy Gradient Methods

1) Policy gradient methods aim to maximize the expected reward of a policy by estimating the gradient of the policy parameters. 2) The policy gradient can be estimated using the score function gradient estimator, which samples trajectories and uses the rewards as importance weights that adjust the policy parameters in the direction of the rewards. 3) For episodic tasks, the policy gradient decomposes into terms corresponding to each timestep, allowing the gradient to incorporate temporal structure by crediting earlier timesteps with later rewards.

Uploaded by

jimmyjoe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views28 pages

Policy Gradient Methods

1) Policy gradient methods aim to maximize the expected reward of a policy by estimating the gradient of the policy parameters. 2) The policy gradient can be estimated using the score function gradient estimator, which samples trajectories and uses the rewards as importance weights that adjust the policy parameters in the direction of the rewards. 3) For episodic tasks, the policy gradient decomposes into terms corresponding to each timestep, allowing the gradient to incorporate temporal structure by crediting earlier timesteps with later rewards.

Uploaded by

jimmyjoe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Policy Gradient Methods

February 13, 2017


Policy Optimization Problems

maximize Eπ [expression]
π

PT −1
I Fixed-horizon episodic: t=0 rt
T −1
Average-cost: limT →∞ T1 t=0
P
I rt
P∞ t
I Infinite-horizon discounted: t=0 γ rt
PTterminal −1
I Variable-length undiscounted: t=0 rt
P∞
I Infinite-horizon undiscounted: t=0 rt
Episodic Setting

s0 ∼ µ(s0 )
a0 ∼ π(a0 | s0 )
s1 , r0 ∼ P(s1 , r0 | s0 , a0 )
a1 ∼ π(a1 | s1 )
s2 , r1 ∼ P(s2 , r1 | s1 , a1 )
...
aT −1 ∼ π(aT −1 | sT −1 )
sT , rT −1 ∼ P(sT | sT −1 , aT −1 )
Objective:
maximize η(π), where
η(π) = E [r0 + r1 + · · · + rT −1 | π]
Episodic Setting
π
Agent

a0 a1 aT-1

s0 s1 s2 sT

μ0 r0 r1 rT-1

Environment
P

Objective:

maximize η(π), where


η(π) = E [r0 + r1 + · · · + rT −1 | π]
Parameterized Policies

I A family of policies indexed by parameter vector θ ∈ Rd


I Deterministic: a = π(s, θ)
I Stochastic: π(a | s, θ)
I Analogous to classification or regression with input s, output a.
I Discrete action space: network outputs vector of probabilities
I Continuous action space: network outputs mean and diagonal covariance of
Gaussian
Policy Gradient Methods: Overview

Problem:

maximize E [R | πθ ]

Intuitions: collect a bunch of trajectories, and ...


1. Make the good trajectories more probable1
2. Make the good actions more probable
3. Push the actions towards good actions (DPG2 , SVG3 )

1
R. J. Williams. “Simple statistical gradient-following algorithms for connectionist reinforcement learning”. Machine learning (1992);
R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour. “Policy gradient methods for reinforcement learning with function approximation”. NIPS.
MIT Press, 2000.
2
D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, et al. “Deterministic Policy Gradient Algorithms”. ICML. 2014.
3
N. Heess, G. Wayne, D. Silver, T. Lillicrap, Y. Tassa, et al. “Learning Continuous Control Policies by Stochastic Value Gradients”. arXiv
preprint arXiv:1510.09142 (2015).
Score Function Gradient Estimator
I Consider an expectation Ex∼p(x | θ) [f (x)]. Want to compute gradient wrt θ
Z
∇θ Ex [f (x)] = ∇θ dx p(x | θ)f (x)
Z
= dx ∇θ p(x | θ)f (x)
∇θ p(x | θ)
Z
= dx p(x | θ) f (x)
p(x | θ)
Z
= dx p(x | θ)∇θ log p(x | θ)f (x)

= Ex [f (x)∇θ log p(x | θ)].

I Last expression gives us an unbiased gradient estimator. Just sample


xi ∼ p(x | θ), and compute ĝi = f (xi )∇θ log p(xi | θ).
I Need to be able to compute and differentiate density p(x | θ) wrt θ
Derivation via Importance Sampling
Alternative Derivation Using Importance Sampling4
 
p(x | θ)
Ex∼θ [f (x)] = Ex∼θold f (x)
p(x | θold )
 
∇θ p(x | θ)
∇θ Ex∼θ [f (x)] = Ex∼θold f (x)
p(x | θold )
" #
∇θ p(x | θ) θ=θ
∇θ Ex∼θ [f (x)] θ=θ = Ex∼θold old
f (x)
old p(x | θold )
h i
= Ex∼θold ∇θ log p(x | θ) θ=θ f (x)
old

4
T. Jie and P. Abbeel. “On a connection between importance sampling and the likelihood ratio policy gradient”. Advances in Neural Information
Processing Systems. 2010, pp. 1000–1008.
Score Function Gradient Estimator: Intuition

ĝi = f (xi )∇θ log p(xi | θ)

I Let’s say that f (x) measures how good the sample x is.
I Moving in the direction ĝi pushes up the logprob of the
sample, in proportion to how good it is
I Valid even if f (x) is discontinuous, and unknown, or sample
space (containing x) is a discrete set
Score Function Gradient Estimator: Intuition

ĝi = f (xi )∇θ log p(xi | θ)


Score Function Gradient Estimator: Intuition

ĝi = f (xi )∇θ log p(xi | θ)


Score Function Gradient Estimator for Policies
I Now random variable x is a whole trajectory τ = (s0 , a0 , r0 , s1 , a1 , r1 , . . . , sT −1 , aT −1 , rT −1 , sT )

∇θ Eτ [R(τ )] = Eτ [∇θ log p(τ | θ)R(τ )]

I Just need to write out p(τ | θ):

−1
TY
p(τ | θ) = µ(s0 ) [π(at | st , θ)P(st+1 , rt | st , at )]
t=0
T
X −1
log p(τ | θ) = log µ(s0 ) + [log π(at | st , θ) + log P(st+1 , rt | st , at )]
t=0
T
X −1
∇θ log p(τ | θ) = ∇θ log π(at | st , θ)
t=0
T −1
" #
X
∇θ Eτ [R] = Eτ R∇θ log π(at | st , θ)
t=0

I Interpretation: using good trajectories (high R) as supervised examples in classification / regression


Policy Gradient: Use Temporal Structure
I Previous slide:
−1 −1
" T
! T
!#
X X
∇θ Eτ [R] = Eτ rt ∇θ log π(at | st , θ)
t=0 t=0

I We can repeat the same argument to derive the gradient estimator for a single reward
term rt 0 .
 
t0
X
∇θ E [rt 0 ] = E rt 0 ∇θ log π(at | st , θ)
t=0

I Sum this formula over t, we obtain


 0

T
X −1 t
X
∇θ E [R] = E  rt 0 ∇θ log π(at | st , θ)
t=0 t=0
"T −1 T −1
#
X X
=E ∇θ log π(at | st , θ) rt 0
t=0 t 0 =t
Policy Gradient: Introduce Baseline

I Further reduce variance by introducing a baseline b(s)


"T −1 T −1
!#
X X
∇θ Eτ [R] = Eτ ∇θ log π(at | st , θ) rt 0 − b(st )
t=0 t 0 =t

I For any choice of b, gradient estimator is unbiased.


I Near optimal choice is expected return,
b(st ) ≈ E [rt + rt+1 + rt+2 + · · · + rT −1 ]
I Interpretation:
P −1 increase logprob of action at proportionally to how much
returns Tt 0 =t rt 0 are better than expected
Baseline—Derivation

Eτ [∇θ log π(at | st , θ)b(st )]


h i
= Es0:t ,a0:(t−1) Es(t+1):T ,at:(T −1) [∇θ log π(at | st , θ)b(st )] (break up expectation)
h i
= Es0:t ,a0:(t−1) b(st )Es(t+1):T ,at:(T −1) [∇θ log π(at | st , θ)] (pull baseline term out)
= Es0:t ,a0:(t−1) [b(st )Eat [∇θ log π(at | st , θ)]] (remove irrelevant vars.)
= Es0:t ,a0:(t−1) [b(st ) · 0]

Last equality because 0 = ∇θ Eat ∼π(· | st ) [1] = Eat ∼π(· | st ) [∇θ log πθ (at | st )]
Discounts for Variance Reduction

I Introduce discount factor γ, which ignores delayed effects between actions


and rewards
"T −1 T −1
!#
0
X X
∇θ Eτ [R] ≈ Eτ ∇θ log π(at | st , θ) γ t −t rt 0 − b(st )
t=0 t 0 =t

Now, we want b(st ) ≈ E rt + γrt+1 + γ 2 rt+2 + · · · + γ T −1−t rT −1


 
I
“Vanilla” Policy Gradient Algorithm

Initialize policy parameter θ, baseline b


for iteration=1, 2, . . . do
Collect a set of trajectories by executing the current policy
At each timestep P in each trajectory, compute
−1 t 0 −t
the return Rt = tT0 =t γ rt 0 , and
the advantage estimate Ât = Rt − b(st ).
Re-fit the baseline, by minimizing kb(st ) − Rt k2 ,
summed over all trajectories and timesteps.
Update the policy, using a policy gradient estimate ĝ ,
which is a sum of terms ∇θ log π(at | st , θ)Ât .
(Plug ĝ into SGD or ADAM)
end for
Practical Implementation with Autodiff
P
I Usual formula t ∇θ log π(at | st ; θ)Ât is inefficient—want to batch data
I Define “surrogate” function using data from currecnt batch
X
L(θ) = log π(at | st ; θ)Ât
t

I Then policy gradient estimator ĝ = ∇θ L(θ)


I Can also include value function fit error
X 
2
L(θ) = log π(at | st ; θ)Ât − kV (st ) − R̂t k
t
Value Functions

Q π,γ (s, a) = Eπ r0 + γr1 + γ 2 r2 + . . . | s0 = s, a0 = a


 

Called Q-function or state-action-value function

V π,γ (s) = Eπ r0 + γr1 + γ 2 r2 + . . . | s0 = s


 

= Ea∼π [Q π,γ (s, a)]


Called state-value function

Aπ,γ (s, a) = Q π,γ (s, a) − V π,γ (s)


Called advantage function
Policy Gradient Formulas with Value Functions
I Recall:
"T −1 T −1
!#
X X
∇θ Eτ [R] = Eτ ∇θ log π(at | st , θ) rt 0 − b(st )
t=0 t 0 =t
"T −1 T −1
!#
X X 0
−t
≈ Eτ ∇θ log π(at | st , θ) γt rt 0 − b(st )
t=0 t 0 =t

I Using value functions


"T −1 #
X
∇θ Eτ [R] = Eτ ∇θ log π(at | st , θ)Q π (st , at )
t=0
"T −1 #
X
π
= Eτ ∇θ log π(at | st , θ)A (st , at )
t=0
"T −1 #
X
≈ Eτ ∇θ log π(at | st , θ)Aπ,γ (st , at )
t=0

I Can plug in “advantage estimator” Â for Aπ,γ


I Advantage estimators have the form Return − V (s)
Value Functions in the Future

I Baseline accounts for and removes the effect of past actions


I Can also use the value function to estimate future rewards
(1)
R̂t = rt + γV (st+1 ) cut off at one timestep
(2)
R̂t = rt + γrt+1 + γ 2 V (st+2 ) cut off at two timesteps
...
(∞)
R̂t = rt + γrt+1 + γ 2 rt+2 + . . . ∞ timesteps (no V )
Value Functions in the Future

I Subtracting out baselines, we get advantage estimators


(1)
Ât = rt + γV (st+1 )−V (st )
(2)
Ât = rt + rt+1 + γ 2 V (st+2 )−V (st )
...
(∞)
Ât = rt + γrt+1 + γ 2 rt+2 + . . . −V (st )

(1) (∞)
I Ât has low variance but high bias, Ât has high variance but low bias.
I Using intermediate k (say, 20) gives an intermediate amount of bias and variance
Discounts: Connection to MPC

I MPC:

maximize Q ∗,T (s, a) ≈ maximize Q ∗,γ (s, a)


a a

I Discounted policy gradient

Ea∼π [Q π,γ (s, a)∇θ log π(a | s; θ)] = 0 when a ∈ arg max Q π,γ (s, a)
Application: Robot Locomotion
Finite-Horizon Methods: Advantage Actor-Critic
I A2C / A3C uses this fixed-horizon advantage estimator. (NOTE: “async” is only
for speed, doesn’t improve performance)
I Pseudocode
for iteration=1, 2, . . . do
Agent acts for T timesteps (e.g., T = 20),
For each timestep t, compute
R̂t = rt + γrt+1 + · · · + γ T −t+1 rT −1 + γ T −t V (st )
Ât = R̂t − V (st )

R̂t is target value function, in regression problem


Ât is estimated advantage function h i
PT
Compute loss gradient g = ∇θ t=1 − log πθ (at | st )Ât + c(V (s) − R̂t )2
g is plugged into a stochastic gradient descent variant, e.g., Adam.
end for
V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, et al. “Asynchronous methods for deep reinforcement learning”. (2016)
A3C Video
A3C Results

16000 Beamrider 600 Breakout 30 Pong 12000 Q*bert 1600 Space Invaders
DQN DQN DQN DQN
14000 1-step Q 500 1-step Q 20 10000 1-step Q 1400 1-step Q
12000 1-step SARSA 1-step SARSA 1-step SARSA 1200 1-step SARSA
n-step Q n-step Q n-step Q n-step Q
10000 A3C 400 A3C 10 8000 A3C 1000 A3C
Score

Score

Score

Score

Score
8000 300 0 6000 800
6000 200 10 DQN 4000 600
4000 1-step Q 400
100 20 1-step SARSA 2000
2000 n-step Q 200
A3C
0 0 30 0 0
0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14
Training time (hours) Training time (hours) Training time (hours) Training time (hours) Training time (hours)
Further Reading

I A nice intuitive explanation of policy gradients:


http://karpathy.github.io/2016/05/31/rl/
I R. J. Williams. “Simple statistical gradient-following algorithms for
connectionist reinforcement learning”. Machine learning (1992);
R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour. “Policy gradient
methods for reinforcement learning with function approximation”. NIPS.
MIT Press, 2000
I My thesis has a decent self-contained introduction to policy gradient
methods: http://joschu.net/docs/thesis.pdf
I A3C paper: V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, et al.
“Asynchronous methods for deep reinforcement learning”. (2016)

You might also like