Policy Gradient
Policy Gradient
Ashwin Rao
We want to find the Policy that fetches the “Best Expected Returns”
Gradient Ascent on “Expected Returns” w.r.t params of Policy func
So we need a func approx for (stochastic) Policy Func: π(s, a; θ)
In addition to the usual func approx for Action Value Func: Q(s, a; w )
π(s, a; θ) func approx called Actor, Q(s, a; w ) func approx called Critic
Critic parameters w are optimized w.r.t Q(s, a; w ) loss function min
Actor parameters θ are optimized w.r.t Expected Returns max
We need to formally define “Expected Returns”
But we already see that this idea is appealing for continuous actions
GPI with Policy Improvement done as Policy Gradient (Ascent)
Value Function-based
Learn Value Function (with a function approximation)
Policy is implicit - readily derived from Value Function (eg: -greedy)
Policy-based
Learn Policy (with a function approximation)
No need to learn a Value Function
Actor-Critic
Learn Policy (Actor)
Learn Value Function (Critic)
Advantages:
Finds the best Stochastic Policy (Optimal Deterministic Policy,
produced by other RL algorithms, can be unsuitable for POMDPs)
Naturally explores due to Stochastic Policy representation
Effective in high-dimensional or continuous action spaces
Small changes in θ ⇒ small changes in π, and in state distribution
This avoids the convergence issues seen in argmax-based algorithms
Disadvantages:
Typically converge to a local optimum rather than a global optimum
Policy Evaluation is typically inefficient and has high variance
Policy Improvement happens in small steps ⇒ slow convergence
Discount Factor γ
Assume episodic with 0 ≤ γ ≤ 1 or non-episodic with 0 ≤ γ < 1
States st ∈ S, Actions at ∈ A, Rewards rt ∈ R, ∀t ∈ {0, 1, 2, . . .}
a = Pr (s
State Transition Probabilities Ps,s 0
0 t+1 = s |st = s, at = a)
Expected Rewards Ras = E [rt |st = s, at = a]
Initial State Probability Distribution p0 : S → [0, 1]
Policy Func Approx π(s, a; θ) = Pr (at = a|st = s, θ), θ ∈ Rk
PG coverage will be quite similar for non-discounted non-episodic, by
considering average-reward objective (so we won’t cover it)
X∞
J(θ) = Eπ [ γ t rt ]
t=0
Value Function V π (s) and Action Value function Q π (s, a) defined as:
∞
X
V π (s) = Eπ [ γ k−t rk |st = s], ∀t ∈ {0, 1, 2, . . .}
k=t
X∞
π
Q (s, a) = Eπ [ γ k−t rk |st = s, at = a], ∀t ∈ {0, 1, 2, . . .}
k=t
∞
X ∞
X
t
J(θ) = Eπ [ γ rt ] = γ t Eπ [rt ]
t=0 t=0
∞
X Z Z Z
t
= γ ( p0 (s0 ) · p(s0 → s, t, π) · ds0 ) π(s, a; θ) · Ras · da · ds
t=0 S S A
∞
Z Z X Z
t
= ( γ · p0 (s0 ) · p(s0 → s, t, π) · ds0 ) π(s, a; θ) · Ras · da · ds
S S t=0 A
Definition
Z Z
π
J(θ) = ρ (s) π(s, a; θ) · Ras · da · ds
S A
R P∞
where ρπ (s) = S t=0 γ t · p0 (s0 ) · p(s0 → s, t, π) · ds0 is the key function
(for PG) we’ll refer to as Discounted-Aggregate State-Visitation Measure.
Ashwin Rao (Stanford) Policy Gradient Algorithms 9 / 33
Policy Gradient Theorem (PGT)
Theorem
Z Z
π
∇θ J(θ) = ρ (s) ∇θ π(s, a; θ) · Q π (s, a) · da · ds
S A
Z Z
= p0 (s0 ) ∇θ π(s0 , a0 ; θ) · Q π (s0 , a0 ) · da0 · ds0
ZS ZA Z
+ p0 (s0 ) π(s0 , a0 ; θ) · ∇θ (Ras00 + γ · Psa00,s1 · V π (s1 ) · ds1 ) · da0 · ds0
S A S
Z Z
= p0 (s0 ) ∇θ π(s0 , a0 ; θ) · Q π (s0 , a) · da0 · ds0
ZS ZA Z
+ p0 (s0 ) π(s0 , a0 ; θ) γ · Psa00,s1 · ∇θ V π (s1 ) · ds1 · da0 · ds0
S A S
R R R
Now bring the outside S and A inside the inner S
Z Z
= p0 (s0 ) ∇θ π(s0 , a0 ; θ) · Q π (s0 , a0 ) · da0 · ds0
S
Z Z A Z
+ ( γ · p0 (s0 ) π(s0 , a0 ; θ) · Psa00,s1 · da0 · ds0 ) · ∇θ V π (s1 ) · ds1
S S A
Z Z
= p0 (s0 ) ·∇θ π(s0 , a0 ; θ) · Q π (s0 , a0 ) · da · ds0
ZS Z A Z
+ ( γ · p0 (s0 )p(s0 → s1 , 1, π)ds0 ) · ∇θ ( π(s1 , a1 ; θ)Q π (s1 , a1 )da1 )ds1
S S A
P∞ R
Bring t=0 inside the two S , and note that
π (s , a ) · da is independent of t.
R
A ∇ θ π(s t , a t ; θ) · Q t t t
∞
Z Z X Z
= γ t · p0 (s0 ) · p(s0 → s, t, π) · ds0 ∇θ π(s, a; θ) · Q π (s, a) · da · ds
S S t=0 A
R P∞ t def
Reminder that S · p0 (s0 ) · p(s0 → s, t, π) · ds0 = ρπ (s). So,
t=0 γ
Z Z
π
∇θ J(θ) = ρ (s) ∇θ π(s, a; θ) · Q π (s, a) · da · ds
S A
Q.E.D.
Initialize θ arbitrarily
episode {s0 , a0 , r0 , . . . , sT , aT , rT } ∼ π(·, ·; θ)
for each
for t ← 0 to T
G← T
P k−t · r
do k=t γ k
do
θ ← θ + α · γ t · ∇θ log π(st , at ; θ) · G
return (θ)
δ π = r + γV π (s 0 ) − V π (s)
δ(s, r , s 0 ; v ) = r + γV (s 0 ; v ) − V (s; v )
Theorem
If the following two conditions are satisfied:
1 Critic gradient is compatible with the Actor score function
Z Z
π
Therefore, ρ (s) π(s, a; θ) · Q π (s, a) · ∇θ log π(s, a; θ) · da · ds
ZS ZA
= ρπ (s) π(s, a; θ) · Q(s, a; w ) · ∇θ log π(s, a; θ) · da · ds
S A
Z Z
But ∇θ J(θ) = ρπ (s) π(s, a; θ) · Q π (s, a) · ∇θ log π(s, a; θ) · da · ds
S A
Z Z
π
So, ∇θ J(θ) = ρ (s) π(s, a; θ) · Q(s, a; w ) · ∇θ log π(s, a; θ) · da · ds
ZS ZA
= ρπ (s) ∇θ π(s, a; θ) · Q(s, a; w ) · da · ds
S A
Q.E.D.
This means with conditions (1) and (2) of Compatible Function
Approximation Theorem, we can use the critic func approx
Q(s, a; w ) and still have the exact Policy Gradient.
Ashwin Rao (Stanford) Policy Gradient Algorithms 30 / 33
How to enable Compatible Function Approximation
A simple way to enable Compatible Function Approximation
∂Q(s,a;w )
∂wi = ∂ log ∂θ
π(s,a;θ)
i
, ∀i is to set Q(s, a; w ) to be linear in its features.
n n
X X ∂ log π(s, a; θ)
Q(s, a; w ) = φi (s, a) · wi = · wi
∂θi
i=1 i=1
Denoting [ ∂ log ∂θ
π(s,a;θ)
i
], i = 1, . . . , n as the score column vector SC (s, a; θ)
and assuming compatible linear-approx critic:
Z Z
π
∇θ J(θ) = ρ (s) π(s, a; θ) · (SC (s, a; θ) · SC (s, a; θ)T · w ) · da · ds
S A
= Es∼ρπ ,a∼π [SC (s, a; θ) · SC (s, a; θ)T ] · w
= FIMρπ ,π (θ) · w
Therefore, ∇nat
θ J(θ) = w