Policy Gradient Methods
Policy Gradient Methods
maximize Eπ [expression]
π
PT −1
I Fixed-horizon episodic: t=0 rt
T −1
Average-cost: limT →∞ T1 t=0
P
I rt
P∞ t
I Infinite-horizon discounted: t=0 γ rt
PTterminal −1
I Variable-length undiscounted: t=0 rt
P∞
I Infinite-horizon undiscounted: t=0 rt
Episodic Setting
s0 ∼ µ(s0 )
a0 ∼ π(a0 | s0 )
s1 , r0 ∼ P(s1 , r0 | s0 , a0 )
a1 ∼ π(a1 | s1 )
s2 , r1 ∼ P(s2 , r1 | s1 , a1 )
...
aT −1 ∼ π(aT −1 | sT −1 )
sT , rT −1 ∼ P(sT | sT −1 , aT −1 )
Objective:
maximize η(π), where
η(π) = E [r0 + r1 + · · · + rT −1 | π]
Episodic Setting
π
Agent
a0 a1 aT-1
s0 s1 s2 sT
μ0 r0 r1 rT-1
Environment
P
Objective:
Problem:
maximize E [R | πθ ]
1
R. J. Williams. “Simple statistical gradient-following algorithms for connectionist reinforcement learning”. Machine learning (1992);
R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour. “Policy gradient methods for reinforcement learning with function approximation”. NIPS.
MIT Press, 2000.
2
D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, et al. “Deterministic Policy Gradient Algorithms”. ICML. 2014.
3
N. Heess, G. Wayne, D. Silver, T. Lillicrap, Y. Tassa, et al. “Learning Continuous Control Policies by Stochastic Value Gradients”. arXiv
preprint arXiv:1510.09142 (2015).
Score Function Gradient Estimator
I Consider an expectation Ex∼p(x | θ) [f (x)]. Want to compute gradient wrt θ
Z
∇θ Ex [f (x)] = ∇θ dx p(x | θ)f (x)
Z
= dx ∇θ p(x | θ)f (x)
∇θ p(x | θ)
Z
= dx p(x | θ) f (x)
p(x | θ)
Z
= dx p(x | θ)∇θ log p(x | θ)f (x)
4
T. Jie and P. Abbeel. “On a connection between importance sampling and the likelihood ratio policy gradient”. Advances in Neural Information
Processing Systems. 2010, pp. 1000–1008.
Score Function Gradient Estimator: Intuition
I Let’s say that f (x) measures how good the sample x is.
I Moving in the direction ĝi pushes up the logprob of the
sample, in proportion to how good it is
I Valid even if f (x) is discontinuous, and unknown, or sample
space (containing x) is a discrete set
Score Function Gradient Estimator: Intuition
−1
TY
p(τ | θ) = µ(s0 ) [π(at | st , θ)P(st+1 , rt | st , at )]
t=0
T
X −1
log p(τ | θ) = log µ(s0 ) + [log π(at | st , θ) + log P(st+1 , rt | st , at )]
t=0
T
X −1
∇θ log p(τ | θ) = ∇θ log π(at | st , θ)
t=0
T −1
" #
X
∇θ Eτ [R] = Eτ R∇θ log π(at | st , θ)
t=0
I We can repeat the same argument to derive the gradient estimator for a single reward
term rt 0 .
t0
X
∇θ E [rt 0 ] = E rt 0 ∇θ log π(at | st , θ)
t=0
Last equality because 0 = ∇θ Eat ∼π(· | st ) [1] = Eat ∼π(· | st ) [∇θ log πθ (at | st )]
Discounts for Variance Reduction
(1) (∞)
I Ât has low variance but high bias, Ât has high variance but low bias.
I Using intermediate k (say, 20) gives an intermediate amount of bias and variance
Discounts: Connection to MPC
I MPC:
Ea∼π [Q π,γ (s, a)∇θ log π(a | s; θ)] = 0 when a ∈ arg max Q π,γ (s, a)
Application: Robot Locomotion
Finite-Horizon Methods: Advantage Actor-Critic
I A2C / A3C uses this fixed-horizon advantage estimator. (NOTE: “async” is only
for speed, doesn’t improve performance)
I Pseudocode
for iteration=1, 2, . . . do
Agent acts for T timesteps (e.g., T = 20),
For each timestep t, compute
R̂t = rt + γrt+1 + · · · + γ T −t+1 rT −1 + γ T −t V (st )
Ât = R̂t − V (st )
16000 Beamrider 600 Breakout 30 Pong 12000 Q*bert 1600 Space Invaders
DQN DQN DQN DQN
14000 1-step Q 500 1-step Q 20 10000 1-step Q 1400 1-step Q
12000 1-step SARSA 1-step SARSA 1-step SARSA 1200 1-step SARSA
n-step Q n-step Q n-step Q n-step Q
10000 A3C 400 A3C 10 8000 A3C 1000 A3C
Score
Score
Score
Score
Score
8000 300 0 6000 800
6000 200 10 DQN 4000 600
4000 1-step Q 400
100 20 1-step SARSA 2000
2000 n-step Q 200
A3C
0 0 30 0 0
0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14
Training time (hours) Training time (hours) Training time (hours) Training time (hours) Training time (hours)
Further Reading