0% found this document useful (0 votes)
14 views

3 - Chapter 9 Policy Gradient Methods

Uploaded by

zhanshengheyu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

3 - Chapter 9 Policy Gradient Methods

Uploaded by

zhanshengheyu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Chapter 9

Policy Gradient Methods

Algorithms/Methods

Chapter 4: Chapter 5: Chapter 6:


with model Stochastic
Value Iteration & to Monte Carlo
Policy Iteration Methods Approximation
without model

Chapter 7:
Chapter 3: Temporal-Difference
Chapter 2: Methods
Bellman Optimality
Bellman Equation
Equation
tabular representation
to
function representation
Chapter 1:
Basic Concepts
Chapter 8:
Value Function
Fundamental tools
Approximation

Chapter 10: policy-based


Chapter 9:
Actor-Critic plus Policy Gradient
Methods value-based Methods

Figure 9.1: Where we are in this book.

The idea of function approximation can be applied not only to represent state/action
values, as introduced in Chapter 8, but also to represent policies, as introduced in this
chapter. So far in this book, policies have been represented by tables: the action prob-
abilities of all states are stored in a table (e.g., Table 9.1). In this chapter, we show
that policies can be represented by parameterized functions denoted as π(a|s, θ), where
θ ∈ Rm is a parameter vector. It can also be written in other forms such as πθ (a|s),
πθ (a, s), or π(a, s, θ).
When policies are represented as functions, optimal policies can be obtained by op-
timizing certain scalar metrics. Such a method is called policy gradient. The policy

199
9.1. Policy representation: From table to function S. Zhao, 2023

a1 a2 a3 a4 a5
s1 π(a1 |s1 ) π(a2 |s1 ) π(a3 |s1 ) π(a4 |s1 ) π(a5 |s1 )
.. .. .. .. .. ..
. . . . . .
s9 π(a1 |s9 ) π(a2 |s9 ) π(a3 |s9 ) π(a4 |s9 ) π(a5 |s9 )

Table 9.1: A tabular representation of a policy. There are nine states and five actions for each state.

π(a1 |s, θ)
s
π(a|s, θ) s ..
θ θ .
a

π(a5 |s, θ)
function function
(a) (b)

Figure 9.2: Function representations of policies. The functions may have different structures.

gradient method is a big step forward in this book because it is policy-based. By contrast,
all the previous chapters in this book discuss value-based methods. The advantages of the
policy gradient method are numerous. For example, it is more efficient for handling large
state/action spaces. It has stronger generalization abilities and hence is more efficient in
terms of sample usage.

9.1 Policy representation: From table to function


When the representation of a policy is switched from a table to a function, it is necessary
to clarify the difference between the two representation methods.

 First, how to define optimal policies? When represented as a table, a policy is defined
as optimal if it can maximize every state value. When represented by a function, a
policy is defined as optimal if it can maximize certain scalar metrics.
 Second, how to update a policy? When represented by a table, a policy can be updated
by directly changing the entries in the table. When represented by a parameterized
function, a policy can no longer be updated in this way. Instead, it can only be
updated by changing the parameter θ.
 Third, how to retrieve the probability of an action? In the tabular case, the probability
of an action can be directly obtained by looking up the corresponding entry in the
table. In the case of function representation, we need to input (s, a) into the function
to calculate its probability (see Figure 9.2(a)). Depending on the structure of the
function, we can also input a state and then output the probabilities of all actions
(see Figure 9.2(b)).

200
9.2. Metrics for defining optimal policies S. Zhao, 2023

The basic idea of the policy gradient method is summarized below. Suppose that J(θ)
is a scalar metric. Optimal policies can be obtained by optimizing this metric via the
gradient-based algorithm:

θt+1 = θt + α∇θ J(θt ),

where ∇θ J is the gradient of J with respect to θ, t is the time step, and α is the
optimization rate.
With this basic idea, we will answer the following three questions in the remainder of
this chapter.

 What metrics should be used? (Section 9.2).


 How to calculate the gradients of the metrics? (Section 9.3)
 How to use experience samples to calculate the gradients? (Section 9.4)

9.2 Metrics for defining optimal policies


If a policy is represented by a function, there are two types of metrics for defining optimal
policies. One is based on state values and the other is based on immediate rewards.

Metric 1: Average state value

The first metric is the average state value or simply called the average value. It is defined
as
X
v̄π = d(s)vπ (s),
s∈S

P
where d(s) is the weight of state s. It satisfies d(s) ≥ 0 for any s ∈ S and s∈S d(s) = 1.
Therefore, we can interpret d(s) as a probability distribution of s. Then, the metric can
be written as

v̄π = ES∼d [vπ (S)].

How to select the distribution d? This is an important question. There are two cases.

 The first and simplest case is that d is independent of the policy π. In this case, we
specifically denote d as d0 and v̄π as v̄π0 to indicate that the distribution is independent
of the policy. One case is to treat all the states equally important and select d0 (s) =
1/|S|. Another case is when we are only interested in a specific state s0 (e.g., the
agent always starts from s0 ). In this case, we can design

d0 (s0 ) = 1, d0 (s 6= s0 ) = 0.

201
9.2. Metrics for defining optimal policies S. Zhao, 2023

 The second case is that d is dependent on the policy π. In this case, it is common to
select d as dπ , which is the stationary distribution under π. One basic property of dπ
is that it satisfies

dTπ Pπ = dTπ ,

where Pπ is the state transition probability matrix. More information about the
stationary distribution can be found in Box 8.1.
The interpretation of selecting dπ is as follows. The stationary distribution reflects the
long-term behavior of a Markov decision process under a given policy. If one state is
frequently visited in the long term, it is more important and deserves a higher weight;
if a state is rarely visited, then its importance is low and deserves a lower weight.

As its name suggests, v̄π is a weighted average of the state values. Different values
of θ lead to different values of v̄π . Our ultimate goal is to find an optimal policy (or
equivalently an optimal θ) to maximize v̄π .
We next introduce another two important equivalent expressions of v̄π .

 Suppose that an agent collects rewards {Rt+1 }∞t=0 by following a given policy π(θ).
Readers may often see the following metric in the literature:

" n
# " #
X X
J(θ) = lim E γ t Rt+1 = E γ t Rt+1 . (9.1)
n→∞
t=0 t=0

This metric may be nontrivial to interpret at first glance. In fact, it is equal to v̄π .
To see that, we have
∞ ∞
" # " #
X X X
E γ t Rt+1 = d(s)E γ t Rt+1 |S0 = s
t=0 s∈S t=0
X
= d(s)vπ (s)
s∈S

= v̄π .

The first equality in the above equation is due to the law of total expectation. The
second equality is by the definition of state values.
 The metric v̄π can also be rewritten as the inner product of two vectors. In particular,
let

vπ = [. . . , vπ (s), . . . ]T ∈ R|S| ,
d = [. . . , d(s), . . . ]T ∈ R|S| .

202
9.2. Metrics for defining optimal policies S. Zhao, 2023

Then, we have

v̄π = dT vπ .

This expression will be useful when we analyze its gradient.

Metric 2: Average reward

The second metric is the average one-step reward or simply called the average reward
[2, 64, 65]. In particular, it is defined as

. X
r̄π = dπ (s)rπ (s)
s∈S

= ES∼dπ [rπ (S)], (9.2)

where dπ is the stationary distribution and

. X
rπ (s) = π(a|s, θ)r(s, a) = EA∼π(s,θ) [r(s, A)|s] (9.3)
a∈A

. P
is the expectation of the immediate rewards. Here, r(s, a) = E[R|s, a] = r rp(r|s, a).
We next present another two important equivalent expressions of r̄π .

 Suppose that the agent collects rewards {Rt+1 }∞


t=0 by following a given policy π(θ).
A common metric that readers may often see in the literature is
" n−1 #
1 X
J(θ) = lim E Rt+1 . (9.4)
n→∞ n
t=0

It may seem nontrivial to interpret this metric at first glance. In fact, it is equal to
r̄π :
" n−1 #
1 X X
lim E Rt+1 = dπ (s)rπ (s) = r̄π . (9.5)
n→∞ n
t=0 s∈S

The proof of (9.5) is given in Box 9.1.


 The average reward r̄π in (9.2) can also be written as the inner product of two vectors.
In particular, let

rπ = [. . . , rπ (s), . . . ]T ∈ R|S| ,
dπ = [. . . , dπ (s), . . . ]T ∈ R|S| ,

203
9.2. Metrics for defining optimal policies S. Zhao, 2023

where rπ (s) is defined in (9.3). Then, it is clear that


X
r̄π = dπ (s)rπ (s) = dTπ rπ .
s∈S

This expression will be useful when we derive its gradient.

Box 9.1: Proof of (9.5)

Step 1: We first prove that the following equation is valid for any starting state
s0 ∈ S:
" n−1 #
1 X
r̄π = lim E Rt+1 |S0 = s0 . (9.6)
n→∞ n
t=0

To do that, we notice
" n−1 # n−1
1 X 1X
lim E Rt+1 |S0 = s0 = lim E [Rt+1 |S0 = s0 ]
n→∞ n n→∞ n
t=0 t=0

= lim E [Rt+1 |S0 = s0 ] , (9.7)


t→∞

where the last equality is due to the property of the Cesaro mean (also called the
Cesaro summation). In particular, if {ak }∞k=1 is a convergent sequence such that
Pn ∞
limk→∞ ak exists, then {1/n k=1 ak }n=1 is also a convergent sequence such that
limn→∞ 1/n nk=1 ak = limk→∞ ak .
P

We next examine E [Rt+1 |S0 = s0 ] in (9.7) more closely. By the law of total
expectation, we have
X
E [Rt+1 |S0 = s0 ] = E [Rt+1 |St = s, S0 = s0 ] p(t) (s|s0 )
s∈S
X
= E [Rt+1 |St = s] p(t) (s|s0 )
s∈S
X
= rπ (s)p(t) (s|s0 ),
s∈S

where p(t) (s|s0 ) denotes the probability of transitioning from s0 to s using exactly t
steps. The second equality in the above equation is due to the Markov memoryless
property: the reward obtained at the next time step depends only on the current
state rather than the previous ones.
Note that
lim p(t) (s|s0 ) = dπ (s)
t→∞

204
9.2. Metrics for defining optimal policies S. Zhao, 2023

by the definition of the stationary distribution. As a result, the starting state s0 does
not matter. Then, we have
X X
lim E [Rt+1 |S0 = s0 ] = lim rπ (s)p(t) (s|s0 ) = rπ (s)dπ (s) = r̄π .
t→∞ t→∞
s∈S s∈S

Substituting the above equation into (9.7) gives (9.6).


Step 2: Consider an arbitrary state distribution d. By the law of total expectation,
we have
" n−1 # " n−1 #
1 X 1X X
lim E Rt+1 = lim d(s)E Rt+1 |S0 = s
n→∞ n n→∞ n
t=0 s∈S t=0
" n−1 #
X 1 X
= d(s) lim E Rt+1 |S0 = s .
n→∞ n
s∈S t=0

Since (9.6) is valid for any starting state, substituting (9.6) into the above equation
yields
" n−1 #
1 X X
lim E Rt+1 = d(s)r̄π = r̄π .
n→∞ n
t=0 s∈S

The proof is complete.

Some remarks

Metric Expression 1 Expression 2 Expression 3


P  Pn t

v̄π s∈S d(s)vπ (s) ES∼d [vπ (S)] limn→∞ E t=0 γ Rt+1
P 1
 Pn−1 
r̄π s∈S dπ (s)rπ (s) ES∼dπ [rπ (S)] limn→∞ n E t=0 Rt+1

Table 9.2: Summary of the different but equivalent expressions of v̄π and r̄π .

Up to now, we have introduced two types of metrics: v̄π and r̄π . Each metric has
several different but equivalent expressions. They are summarized in Table 9.2. We
sometimes use v̄π to specifically refer to the case where the state distribution is the
stationary distribution dπ and use v̄π0 to refer to the case where d0 is independent of π.
Some remarks about the metrics are given below.

 All these metrics are functions of π. Since π is parameterized by θ, these metrics


are functions of θ. In other words, different values of θ can generate different metric
values. Therefore, we can search for the optimal values of θ to maximize these metrics.
This is the basic idea of policy gradient methods.

205
9.3. Gradients of the metrics S. Zhao, 2023

 The two metrics v̄π and r̄π are equivalent in the discounted case where γ < 1. In
particular, it can be shown that

r̄π = (1 − γ)v̄π .

The above equation indicates that these two metrics can be simultaneously maximized.
The proof of this equation is given later in Lemma 9.1.

9.3 Gradients of the metrics


Given the metrics introduced in the last section, we can use gradient-based methods to
maximize them. To do that, we need to first calculate the gradients of these metrics.
The most important theoretical result in this chapter is the following theorem.

Theorem 9.1 (Policy gradient theorem). The gradient of J(θ) is


X X
∇θ J(θ) = η(s) ∇θ π(a|s, θ)qπ (s, a), (9.8)
s∈S a∈A

where η is a state distribution and ∇θ π is the gradient of π with respect to θ. Moreover,


(9.8) has a compact form expressed in terms of expectation:
h i
∇θ J(θ) = ES∼η,A∼π(S,θ) ∇θ ln π(A|S, θ)qπ (S, A) , (9.9)

where ln is the natural logarithm.

Some important remarks about Theorem 9.1 are given below.

 It should be noted that Theorem 9.1 is a summary of the results in Theorem 9.2,
Theorem 9.3, and Theorem 9.5. These three theorems address different scenarios
involving different metrics and discounted/undiscounted cases. The gradients in these
scenarios all have similar expressions and hence are summarized in Theorem 9.1. The
specific expressions of J(θ) and η are not given in Theorem 9.1 and can be found in
Theorem 9.2, Theorem 9.3, and Theorem 9.5. In particular, J(θ) could be v̄π0 , v̄π ,
or r̄π . The equality in (9.8) may become a strict equality or an approximation. The
distribution η also varies in different scenarios.
The derivation of the gradients is the most complicated part of the policy gradient
method. For many readers, it is sufficient to be familiar with the result in Theorem 9.1
without knowing the proof. The derivation details presented in the rest of this section
are mathematically intensive. Readers are suggested to study selectively based on
their interests.

206
9.3. Gradients of the metrics S. Zhao, 2023

 The expression in (9.9) is more favorable than (9.8) because it is expressed as an


expectation. We will show in Section 9.4 that this true gradient can be approximated
by a stochastic gradient.
Why can (9.8) be expressed as (9.9)? The proof is given below. By the definition of
expectation, (9.8) can be rewritten as
X X
∇θ J(θ) = η(s) ∇θ π(a|s, θ)qπ (s, a)
s∈S a∈A
" #
X
= ES∼η ∇θ π(a|S, θ)qπ (S, a) . (9.10)
a∈A

Furthermore, the gradient of ln π(a|s, θ) is

∇θ π(a|s, θ)
∇θ ln π(a|s, θ) = .
π(a|s, θ)

It follows that

∇θ π(a|s, θ) = π(a|s, θ)∇θ ln π(a|s, θ). (9.11)

Substituting (9.11) into (9.10) gives


" #
X
∇θ J(θ) = E π(a|S, θ)∇θ ln π(a|S, θ)qπ (S, a)
a∈A
h i
= ES∼η,A∼π(S,θ) ∇θ ln π(A|S, θ)qπ (S, A) .

 It is notable that π(a|s, θ) must be positive for all (s, a) to ensure that ln π(a|s, θ) is
valid. This can be achieved by using softmax functions:

eh(s,a,θ)
π(a|s, θ) = P h(s,a0 ,θ)
, a ∈ A, (9.12)
a0 ∈A e

where h(s, a, θ) is a function indicating the preference for selecting a at s. The policy
P
in (9.12) satisfies π(a|s, θ) ∈ [0, 1] and a∈A π(a|s, θ) = 1 for any s ∈ S. This policy
can be realized by a neural network. The input of the network is s. The output layer
is a softmax layer so that the network outputs π(a|s, θ) for all a and the sum of the
outputs is equal to 1. See Figure 9.2(b) for an illustration.
Since π(a|s, θ) > 0 for all a, the policy is stochastic and hence exploratory. The policy
does not directly tell which action to take. Instead, the action should be generated
according to the probability distribution of the policy.

207
9.3. Gradients of the metrics S. Zhao, 2023

9.3.1 Derivation of the gradients in the discounted case


We next derive the gradients of the metrics in the discounted case where γ ∈ (0, 1). The
state value and action value in the discounted case are defined as

vπ (s) = E[Rt+1 + γRt+2 + γ 2 Rt+3 + . . . |St = s],


qπ (s, a) = E[Rt+1 + γRt+2 + γ 2 Rt+3 + . . . |St = s, At = a].
P
It holds that vπ (s) = a∈A π(a|s, θ)qπ (s, a) and the state value satisfies the Bellman
equation.
First, we show that v̄π (θ) and r̄π (θ) are equivalent metrics.

Lemma 9.1 (Equivalence between v̄π (θ) and r̄π (θ)). In the discounted case where γ ∈
(0, 1), it holds that

r̄π = (1 − γ)v̄π . (9.13)

Proof. Note that v̄π (θ) = dTπ vπ and r̄π (θ) = dTπ rπ , where vπ and rπ satisfy the Bellman
equation vπ = rπ + γPπ vπ . Multiplying dTπ on both sides of the Bellman equation yields

v̄π = r̄π + γdTπ Pπ vπ = r̄π + γdTπ vπ = r̄π + γv̄π ,

which implies (9.13).

Second, the following lemma gives the gradient of vπ (s) for any s.

Lemma 9.2 (Gradient of vπ (s)). In the discounted case, it holds for any s ∈ S that
X X
∇θ vπ (s) = Prπ (s0 |s) ∇θ π(a|s0 , θ)qπ (s0 , a), (9.14)
s0 ∈S a∈A

where

0 . X k k
γ [Pπ ]ss0 = (In − γPπ )−1 ss0
 
Prπ (s |s) =
k=0

is the discounted total probability of transitioning from s to s0 under policy π. Here,


[·]ss0 denotes the entry in the sth row and s0 th column, and [Pπk ]ss0 is the probability of
transitioning from s to s0 using exactly k steps under π.

Box 9.2: Proof of Lemma 9.2

208
9.3. Gradients of the metrics S. Zhao, 2023

First, for any s ∈ S, it holds that


" #
X
∇θ vπ (s) = ∇θ π(a|s, θ)qπ (s, a)
a∈A
X
= [∇θ π(a|s, θ)qπ (s, a) + π(a|s, θ)∇θ qπ (s, a)] , (9.15)
a∈A

where qπ (s, a) is the action value given by


X
qπ (s, a) = r(s, a) + γ p(s0 |s, a)vπ (s0 ).
s0 ∈S

P
Since r(s, a) = r rp(r|s, a) is independent of θ, we have
X
∇θ qπ (s, a) = 0 + γ p(s0 |s, a)∇θ vπ (s0 ).
s0 ∈S

Substituting this result into (9.15) yields


" #
X X
∇θ vπ (s) = ∇θ π(a|s, θ)qπ (s, a) + π(a|s, θ)γ p(s0 |s, a)∇θ vπ (s0 )
a∈A s0 ∈S
X X X
= ∇θ π(a|s, θ)qπ (s, a) + γ π(a|s, θ) p(s0 |s, a)∇θ vπ (s0 ). (9.16)
a∈A a∈A s0 ∈S

It is notable that ∇θ vπ appears on both sides of the above equation. One way to
calculate it is to use the unrolling technique [64]. Here, we use another way based on
the matrix-vector form, which we believe is more straightforward to understand. In
particular, let
. X
u(s) = ∇θ π(a|s, θ)qπ (s, a).
a∈A

Since
X X X X
π(a|s, θ) p(s0 |s, a)∇θ vπ (s0 ) = p(s0 |s)∇θ vπ (s0 ) = [Pπ ]ss0 ∇θ vπ (s0 ),
a∈A s0 ∈S s0 ∈S s0 ∈S

equation (9.16) can be written in matrix-vector form as

.. .. ..
     
. . .
     
 ∇θ vπ (s) = u(s)  +γ(Pπ ⊗ Im )  ∇θ vπ (s0 ) ,
.. .. ..
     
. . .
| {z } | {z } | {z }
∇θ vπ ∈Rmn u∈Rmn ∇θ vπ ∈Rmn

209
9.3. Gradients of the metrics S. Zhao, 2023

which can be written concisely as

∇θ vπ = u + γ(Pπ ⊗ Im )∇θ vπ .

Here, n = |S|, and m is the dimension of the parameter vector θ. The reason that
the Kronecker product ⊗ emerges in the equation is that ∇θ vπ (s) is a vector. The
above equation is a linear equation of ∇θ vπ , which can be solved as

∇θ vπ = (Inm − γPπ ⊗ Im )−1 u


= (In ⊗ Im − γPπ ⊗ Im )−1 u
= (In − γPπ )−1 ⊗ Im u.
 
(9.17)

For any state s, it follows from (9.17) that


X
(In − γPπ )−1 u(s0 )

∇θ vπ (s) = ss0
s0 ∈S
X X
(In − γPπ )−1 ∇θ π(a|s0 , θ)qπ (s0 , a).

= ss0
(9.18)
s0 ∈S a∈A

The quantity [(In − γPπ )−1 ]ss0 has a clear probabilistic interpretation. In particular,
since (In − γPπ )−1 = I + γPπ + γ 2 Pπ2 + · · · , we have

X
(In − γPπ )−1 ss0 = [I]ss0 + γ[Pπ ]ss0 + γ 2 [Pπ2 ]ss0 + · · · = γ k [Pπk ]ss0 .
 
k=0

Note that [Pπk ]ss0 is the probability of transitioning from s to s0 using exactly k
steps (see Box 8.1). Therefore, [(In − γPπ )−1 ]ss0 is the discounted total probability of
.
transitioning from s to s0 using any number of steps. By denoting [(In − γPπ )−1 ]ss0 =
Prπ (s0 |s), equation (9.18) becomes (9.14).

With the results in Lemma 9.2, we are ready to derive the gradient of v̄π0 .

Theorem 9.2 (Gradient of v̄π0 in the discounted case). In the discounted case where
γ ∈ (0, 1), the gradient of v̄π0 = dT0 vπ is

∇θ v̄π0 = E ∇θ ln π(A|S, θ)qπ (S, A) ,


 

where S ∼ ρπ and A ∼ π(S, θ). Here, the state distribution ρπ is


X
ρπ (s) = d0 (s0 )Prπ (s|s0 ), s ∈ S, (9.19)
s0 ∈S

P∞
where Prπ (s|s0 ) = k=0 γ k [Pπk ]s0 s = [(I − γPπ )−1 ]s0 s is the discounted total probability of

210
9.3. Gradients of the metrics S. Zhao, 2023

transitioning from s0 to s under policy π.

Box 9.3: Proof of Theorem 9.2

Since d0 (s) is independent of π, we have


X X
∇θ v̄π0 = ∇θ d0 (s)vπ (s) = d0 (s)∇θ vπ (s).
s∈S s∈S

Substituting the expression of ∇θ vπ (s) given in Lemma 9.2 into the above equation
yields
X X X X
∇θ v̄π0 = d0 (s)∇θ vπ (s) = d0 (s) Prπ (s0 |s) ∇θ π(a|s0 , θ)qπ (s0 , a)
s∈S s∈S s0 ∈S a∈A
!
X X X
= d0 (s)Prπ (s0 |s) ∇θ π(a|s0 , θ)qπ (s0 , a)
s0 ∈S s∈S a∈A
. X X
= ρπ (s0 ) ∇θ π(a|s0 , θ)qπ (s0 , a)
s0 ∈S a∈A
X X
= ρπ (s) ∇θ π(a|s, θ)qπ (s, a) (change s0 to s)
s∈S a∈A
X X
= ρπ (s) π(a|s, θ)∇θ ln π(a|s, θ)qπ (s, a)
s∈S a∈A

= E [∇θ ln π(A|S, θ)qπ (S, A)] ,

where S ∼ ρπ and A ∼ π(S, θ). The proof is complete.

With Lemma 9.1 and Lemma 9.2, we can derive the gradients of r̄π and v̄π .

Theorem 9.3 (Gradients of r̄π and v̄π in the discounted case). In the discounted case
where γ ∈ (0, 1), the gradients of r̄π and v̄π are
X X
∇θ r̄π = (1 − γ)∇θ v̄π ≈ dπ (s) ∇θ π(a|s, θ)qπ (s, a)
s∈S a∈A
 
= E ∇θ ln π(A|S, θ)qπ (S, A) ,

where S ∼ dπ and A ∼ π(S, θ). Here, the approximation is more accurate when γ is
closer to 1.

Box 9.4: Proof of Theorem 9.3

211
9.3. Gradients of the metrics S. Zhao, 2023

It follows from the definition of v̄π that


X
∇θ v̄π = ∇θ dπ (s)vπ (s)
s∈S
X X
= ∇θ dπ (s)vπ (s) + dπ (s)∇θ vπ (s). (9.20)
s∈S s∈S

This equation contains two terms. On the one hand, substituting the expression of
∇θ vπ given in (9.17) into the second term gives
X
dπ (s)∇θ vπ (s) = (dTπ ⊗ Im )∇θ vπ
s∈S

= (dTπ ⊗ Im ) (In − γPπ )−1 ⊗ Im u


 

= dTπ (In − γPπ )−1 ⊗ Im u.


 
(9.21)

It is noted that
1
dTπ (In − γPπ )−1 = dT ,
1−γ π

which can be easily verified by multiplying (In − γPπ ) on both sides of the equation.
Therefore, (9.21) becomes
X 1
dπ (s)∇θ vπ (s) = dT ⊗ Im u
s∈S
1−γ π
1 X X
= dπ (s) ∇θ π(a|s, θ)qπ (s, a).
1 − γ s∈S a∈A

On the other hand, the first term of (9.20) involves ∇θ dπ . However, since the second
1
term contains 1−γ , the second term becomes dominant, and the first term becomes
negligible when γ → 1. Therefore,

1 X X
∇θ v̄π ≈ dπ (s) ∇θ π(a|s, θ)qπ (s, a).
1 − γ s∈S a∈A

Furthermore, it follows from r̄π = (1 − γ)v̄π that


X X
∇θ r̄π = (1 − γ)∇θ v̄π ≈ dπ (s) ∇θ π(a|s, θ)qπ (s, a)
s∈S a∈A
X X
= dπ (s) π(a|s, θ)∇θ ln π(a|s, θ)qπ (s, a)
s∈S a∈A

= E [∇θ ln π(A|S, θ)qπ (S, A)] .

212
9.3. Gradients of the metrics S. Zhao, 2023

The approximation in the above equation requires that the first term does not go to
infinity when γ → 1. More information can be found in [66, Section 4].

9.3.2 Derivation of the gradients in the undiscounted case


We next show how to calculate the gradients of the metrics in the undiscounted case
where γ = 1. Readers may wonder why we suddenly start considering the undiscounted
case while we have only considered the discounted case so far in this book. The reasons
are as follows. First, for continuing tasks, it may be inappropriate to introduce the
discount rate and we need to consider the undiscounted case. Second, the definition of
the average reward r̄π is valid for both discounted and undiscounted cases. While the
gradient of r̄π in the discounted case is an approximation, we will see that its gradient in
the undiscounted case is more elegant.

State values and the Poisson equation

In the undiscounted case, it is necessary to redefine state and action values. Since the
undiscounted sum of the rewards, E[Rt+1 + Rt+2 + Rt+3 + . . . |St = s], may diverge, the
state and action values are defined in a special way [64]:

.
vπ (s) = E[(Rt+1 − r̄π ) + (Rt+2 − r̄π ) + (Rt+3 − r̄π ) + . . . |St = s],
.
qπ (s, a) = E[(Rt+1 − r̄π ) + (Rt+2 − r̄π ) + (Rt+3 − r̄π ) + . . . |St = s, At = a],

where r̄π is the average reward, which is determined when π is given. There are different
names for vπ (s) in the literature such as the differential reward [65] or bias [2, Sec-
tion 8.2.1]. It can be verified that the state value defined above satisfies the following
Bellman-like equation:
" #
X X X
vπ (s) = π(a|s, θ) p(r|s, a)(r − r̄π ) + p(s0 |s, a)vπ (s0 ) . (9.22)
a r s0

P P
Since vπ (s) = a∈A π(a|s, θ)qπ (s, a), it holds that qπ (s, a) = r p(r|s, a)(r − r̄π ) +
0 0
P
s0 p(s |s, a)vπ (s ). The matrix-vector form of (9.22) is

vπ = rπ − r̄π 1n + Pπ vπ , (9.23)

where 1n = [1, . . . , 1]T ∈ Rn . Equation (9.23) is similar to the Bellman equation and it
has a specific name called the Poisson equation [65, 67].
How to solve vπ from the Poisson equation? The answer is given in the following
theorem.

213
9.3. Gradients of the metrics S. Zhao, 2023

Theorem 9.4 (Solution of the Poisson equation). Let

vπ∗ = (In − Pπ + 1n dTπ )−1 rπ . (9.24)

Then, vπ∗ is a solution of the Poisson equation in (9.23). Moreover, any solution of the
Poisson equation has the following form:

vπ = vπ∗ + c1n ,

where c ∈ R.

This theorem indicates that the solution of the Poisson equation may not be unique.

Box 9.5: Proof of Theorem 9.4


We prove using three steps.
 Step 1: Show that vπ∗ in (9.24) is a solution of (9.25).
For the sake of simplicity, let

.
A = In − Pπ + 1n dTπ .

Then, vπ∗ = A−1 rπ . The fact that A is invertible will be proven in Step 3. Substi-
tuting vπ∗ = A−1 rπ into (9.25) gives

A−1 rπ = rπ − 1n dTπ rπ + Pπ A−1 rπ .

This equation is valid as proven below. Recognizing this equation gives (−A−1 +
In − 1n dTπ + Pπ A−1 )rπ = 0, and consequently,

(−In + A − 1n dTπ A + Pπ )A−1 rπ = 0.

The term in the brackets in the above equation is zero because −In +A−1n dTπ A+
Pπ = −In + (In − Pπ + 1n dTπ ) − 1n dTπ (In − Pπ + 1n dTπ ) + Pπ = 0. Therefore, vπ∗ in
(9.24) is a solution.
 Step 2: General expression of the solutions.
Substituting r̄π = dTπ rπ into (9.23) gives

vπ = rπ − 1n dTπ rπ + Pπ vπ (9.25)

and consequently

(In − Pπ )vπ = (In − 1n dTπ )rπ . (9.26)

214
9.3. Gradients of the metrics S. Zhao, 2023

It is noted that In − Pπ is singular because (In − Pπ )1n = 0 for any π. Therefore,


the solution of (9.26) is not unique: if vπ∗ is a solution, then vπ∗ +x is also a solution
for any x ∈ Null(In − Pπ ). When Pπ is irreducible, Null(In − Pπ ) = span{1n }.
Then, any solution of the Poisson equation has the expression vπ∗ + c1n where
c ∈ R.
 Step 3: Show that A = In − Pπ + 1n dTπ is invertible.
Since vπ∗ involves A−1 , it is necessary to show that A is invertible. The analysis is
summarized in the following lemma.

Lemma 9.3. The matrix In − Pπ + 1n dTπ is invertible and its inverse is



T −1
X
(Pπk − 1n dTπ ) + In .
 
In − (Pπ − 1n dπ ) =
k=1

Proof. First of all, we state some preliminary facts without proof. Let ρ(M )
be the spectral radius of a matrix M . Then, I − M is invertible if ρ(M ) < 1.
Moreover, ρ(M ) < 1 if and only if limk→∞ M k = 0.
Based on the above facts, we next show that limk→∞ (Pπ − 1n dTπ )k → 0, and then
the invertibility of In − (Pπ − 1n dTπ ) immediately follows. To do that, we notice
that

(Pπ − 1n dTπ )k = Pπk − 1n dTπ , k ≥ 1, (9.27)

which can be proven by induction. For instance, when k = 1, the equation is


valid. When k = 2, we have

(Pπ − 1n dTπ )2 = (Pπ − 1n dTπ )(Pπ − 1n dTπ )


= Pπ2 − Pπ 1n dTπ − 1n dTπ Pπ + 1n dTπ 1n dTπ
= Pπ2 − 1n dTπ ,

where the last equality is due to Pπ 1n = 1n , dTπ Pπ = dTπ , and dTπ 1n = 1. The case
of k ≥ 3 can be proven similarly.
Since dπ is the stationary distribution of the state, it holds that limk→∞ Pπk = dTπ 1n
(see Box 8.1). Therefore, (9.27) implies that

lim (Pπ − 1n dTπ )k = lim Pπk − dTπ 1n = 0.


k→∞ k→∞

As a result, ρ(Pπ −1n dTπ ) < 1 and hence In −(Pπ −1n dTπ ) is invertible. Furthermore,

215
9.3. Gradients of the metrics S. Zhao, 2023

the inverse of this matrix is given by



X
(In − (Pπ − 1n dTπ ))−1 = (Pπ − 1n dTπ )k
k=0

X
= In + (Pπ − 1n dTπ )k
k=1
X∞
= In + (Pπk − 1n dTπ )
k=1

X
= (Pπk − 1n dTπ ) + 1n dTπ .
k=0

The proof is complete.

The proof of Lemma 9.3 is inspired by [66]. However, the result (In − Pπ +
1n dTπ )−1 = ∞ k T
P
k=0 (Pπ − 1n dπ ) given in [66] (the statement above equation (16)
P∞ k T
P∞ k
in [66]) is inaccurate because k=0 (Pπ − 1n dπ ) is singular since k=0 (Pπ −
1n dTπ )1n = 0. Lemma 9.3 corrects this inaccuracy.

Derivation of gradients

Although the value of vπ is not unique in the undiscounted case, as shown in Theorem 9.4,
the value of r̄π is unique. In particular, it follows from the Poisson equation that

r̄π 1n = rπ + (Pπ − In )vπ


= rπ + (Pπ − In )(vπ∗ + c1n )
= rπ + (Pπ − In )vπ∗ .

Notably, the undetermined value c is canceled and hence r̄π is unique. Therefore, we
can calculate the gradient of r̄π in the undiscounted case. In addition, since vπ is not
unique, v̄π is not unique either. We do not study the gradient of v̄π in the undiscounted
case. For interested readers, it is worth mentioning that we can add more constraints to
uniquely solve vπ from the Poisson equation. For example, by assuming that a recurrent
state exists, the state value of this recurrent state is zero [65, Section II], and hence c can
be determined. There are also other ways to uniquely determine vπ . See, for example,
equations (8.6.5)-(8.6.7) in [2].
The gradient of r̄π in the undiscounted case is given below.

Theorem 9.5 (Gradient of r̄π in the undiscounted case). In the undiscounted case, the

216
9.3. Gradients of the metrics S. Zhao, 2023

gradient of the average reward r̄π is


X X
∇θ r̄π = dπ (s) ∇θ π(a|s, θ)qπ (s, a)
s∈S a∈A
 
= E ∇θ ln π(A|S, θ)qπ (S, A) , (9.28)

where S ∼ dπ and A ∼ π(S, θ).

Compared to the discounted case shown in Theorem 9.3, the gradient of r̄π in the
undiscounted case is more elegant in the sense that (9.28) is strictly valid and S obeys
the stationary distribution.

Box 9.6: Proof of Theorem 9.5


P
First of all, it follows from vπ (s) = a∈A π(a|s, θ)qπ (s, a) that
" #
X
∇θ vπ (s) = ∇θ π(a|s, θ)qπ (s, a)
a∈A
X
= [∇θ π(a|s, θ)qπ (s, a) + π(a|s, θ)∇θ qπ (s, a)] , (9.29)
a∈A

where qπ (s, a) is the action value satisfying


X X
qπ (s, a) = p(r|s, a)(r − r̄π ) + p(s0 |s, a)vπ (s0 )
r s0
X
= r(s, a) − r̄π + p(s |s, a)vπ (s0 ).
0

s0

P
Since r(s, a) = r rp(r|s, a) is independent of θ, we have
X
∇θ qπ (s, a) = 0 − ∇θ r̄π + p(s0 |s, a)∇θ vπ (s0 ).
s0 ∈S

Substituting this result into (9.29) yields


" !#
X X
∇θ vπ (s) = ∇θ π(a|s, θ)qπ (s, a) + π(a|s, θ) −∇θ r̄π + p(s0 |s, a)∇θ vπ (s0 )
a∈A s0 ∈S
X X X
= ∇θ π(a|s, θ)qπ (s, a) − ∇θ r̄π + π(a|s, θ) p(s0 |s, a)∇θ vπ (s0 ).
a∈A a∈A s0 ∈S
(9.30)

Let
. X
u(s) = ∇θ π(a|s, θ)qπ (s, a).
a∈A

217
9.4. Monte Carlo policy gradient (REINFORCE) S. Zhao, 2023

0 0 0 0
P P P
Since a∈A π(a|s, θ) s0 ∈S p(s |s, a)∇θ vπ (s ) = s0 ∈S p(s |s)∇θ vπ (s ), equation
(9.30) can be written in matrix-vector form as

.. .. ..
     
. . .
     
 ∇θ vπ (s) = u(s)  −1n ⊗ ∇θ r̄π + (Pπ ⊗ Im )  ∇θ vπ (s0 ) ,
.. .. ..
     
. . .
| {z } | {z } | {z }
∇θ vπ ∈Rmn u∈Rmn ∇θ vπ ∈Rmn

where n = |S|, m is the dimension of θ, and ⊗ is the Kronecker product. The above
equation can be written concisely as

∇θ vπ = u − 1n ⊗ ∇θ r̄π + (Pπ ⊗ Im )∇θ vπ ,

and hence

1n ⊗ ∇θ r̄π = u + (Pπ ⊗ Im )∇θ vπ − ∇θ vπ .

Multiplying dTπ ⊗ Im on both sides of the above equation gives

(dTπ 1n ) ⊗ ∇θ r̄π = dTπ ⊗ Im u + (dTπ Pπ ) ⊗ Im ∇θ vπ − dTπ ⊗ Im ∇θ vπ


= dTπ ⊗ Im u,

which implies

∇θ r̄π = dTπ ⊗ Im u
X
= dπ (s)u(s)
s∈S
X X
= dπ (s) ∇θ π(a|s, θ)qπ (s, a).
s∈S a∈A

9.4 Monte Carlo policy gradient (REINFORCE)


With the gradient presented in Theorem 9.1, we next show how to use the gradient-based
method to optimize the metrics to obtain optimal policies.
The gradient-ascent algorithm for maximizing J(θ) is

θt+1 = θt + α∇θ J(θt )


h i
= θt + αE ∇θ ln π(A|S, θt )qπ (S, A) , (9.31)

where α > 0 is a constant learning rate. Since the true gradient in (9.31) is unknown, we

218
9.4. Monte Carlo policy gradient (REINFORCE) S. Zhao, 2023

can replace the true gradient with a stochastic gradient to obtain the following algorithm:

θt+1 = θt + α∇θ ln π(at |st , θt )qt (st , at ), (9.32)

where qt (st , at ) is an approximation of qπ (st , at ). If qt (st , at ) is obtained by Monte Carlo


estimation, the algorithm is called REINFORCE [68] or Monte Carlo policy gradient,
which is one of earliest and simplest policy gradient algorithms.
The algorithm in (9.32) is important since many other policy gradient algorithms can
be obtained by extending it. We next examine the interpretation of (9.32) more closely.
Since ∇θ ln π(at |st , θt ) = ∇π(a
θ π(at |st ,θt )
t |st ,θt )
, we can rewrite (9.32) as
 
qt (st , at )
θt+1 = θt + α ∇θ π(at |st , θt ),
π(at |st , θt )
| {z }
βt

which can be further written concisely as

θt+1 = θt + αβt ∇θ π(at |st , θt ). (9.33)

Two important interpretations can be seen from this equation.

 First, since (9.33) is a simple gradient-ascent algorithm, the following observations


can be obtained.

- If βt ≥ 0, the probability of choosing (st , at ) is enhanced. That is

π(at |st , θt+1 ) ≥ π(at |st , θt ).

The greater βt is, the stronger the enhancement is.


- If βt < 0, the probability of choosing (st , at ) decreases. That is

π(at |st , θt+1 ) < π(at |st , θt ).

The above observations can be proven as follows. When θt+1 − θt is sufficiently small,
it follows from the Taylor expansion that

π(at |st , θt+1 ) ≈ π(at |st , θt ) + (∇θ π(at |st , θt ))T (θt+1 − θt )
= π(at |st , θt ) + αβt (∇θ π(at |st , θt ))T (∇θ π(at |st , θt )) (substituting (9.33))
= π(at |st , θt ) + αβt k∇θ π(at |st , θt )k22 .

It is clear that π(at |st , θt+1 ) ≥ π(at |st , θt ) when βt ≥ 0 and π(at |st , θt+1 ) < π(at |st , θt )
when βt < 0.
 Second, the algorithm can strike a balance between exploration and exploitation to a

219
9.4. Monte Carlo policy gradient (REINFORCE) S. Zhao, 2023

Algorithm 9.1: Policy Gradient by Monte Carlo (REINFORCE)

Initialization: Initial parameter θ; γ ∈ (0, 1); α > 0.


Goal: Learn an optimal policy for maximizing J(θ).
For each episode, do
Generate an episode {s0 , a0 , r1 , . . . , sT −1 , aT −1 , rT } following π(θ).
For t = 0, 1, . . . , T − 1:
Value update: qt (st , at ) = Tk=t+1 γ k−t−1 rk
P
Policy update: θ ← θ + α∇θ ln π(at |st , θ)qt (st , at )

certain extent due to the expression of

qt (st , at )
βt = .
π(at |st , θt )

On the one hand, βt is proportional to qt (st , at ). As a result, if the action value of


(st , at ) is large, then π(at |st , θt ) is enhanced so that the probability of selecting at
increases. Therefore, the algorithm attempts to exploit actions with greater values.
One the other hand, βt is inversely proportional to π(at |st , θt ) when qt (st , at ) > 0. As
a result, if the probability of selecting at is small, then π(at |st , θt ) is enhanced so that
the probability of selecting at increases. Therefore, the algorithm attempts to explore
actions with low probabilities.

Moreover, since (9.32) uses samples to approximate the true gradient in (9.31), it is
important to understand how the samples should be obtained.

 How to sample S? S in the true gradient E[∇θ ln π(A|S, θt )qπ (S, A)] should obey the
distribution η which is either the stationary distribution dπ or the discounted total
probability distribution ρπ in (9.19). Either dπ or ρπ represents the long-term behavior
exhibited under π.
 How to sample A? A in E[∇θ ln π(A|S, θt )qπ (S, A)] should obey the distribution of
π(A|S, θ). The ideal way to sample A is to select at following π(a|st , θt ). Therefore,
the policy gradient algorithm is on-policy.

Unfortunately, the ideal ways for sampling S and A are not strictly followed in practice
due to their low efficiency of sample usage. A more sample-efficient implementation of
(9.32) is given in Algorithm 9.1. In this implementation, an episode is first generated by
following π(θ). Then, θ is updated multiple times using every experience sample in the
episode.

220
9.5. Summary S. Zhao, 2023

9.5 Summary
This chapter introduced the policy gradient method, which is the foundation of many
modern reinforcement learning algorithms. Policy gradient methods are policy-based. It
is a big step forward in this book because all the methods in the previous chapters are
value-based. The basic idea of the policy gradient method is simple. That is to select an
appropriate scalar metric and then optimize it via a gradient-ascent algorithm.
The most complicated part of the policy gradient method is the derivation of the
gradients of the metrics. That is because we have to distinguish various scenarios with
different metrics and discounted/undiscounted cases. Fortunately, the expressions of the
gradients in different scenarios are similar. Hence, we summarized the expressions in
Theorem 9.1, which is the most important theoretical result in this chapter. For many
readers, it is sufficient to be aware of this theorem. Its proof is nontrivial, and it is not
required for all readers to study.
The policy gradient algorithm in (9.32) must be properly understood since it is the
foundation of many advanced policy gradient algorithms. In the next chapter, this algo-
rithm will be extended to another important policy gradient method called actor-critic.

9.6 Q&A
 Q: What is the basic idea of the policy gradient method?
A: The basic idea is simple. That is to define an appropriate scalar metric, derive
its gradient, and then use gradient-ascent methods to optimize the metric. The most
important theoretical result regarding this method is the policy gradient given in
Theorem 9.1.
 Q: What is the most complicated part of the policy gradient method?
A: The basic idea of the policy gradient method is simple. However, the derivation
procedure of the gradients is quite complicated. That is because we have to distin-
guish numerous different scenarios. The mathematical derivation procedure in each
scenario is nontrivial. It is sufficient for many readers to be familiar with the result
in Theorem 9.1 without knowing the proof.
 Q: What metrics should be used in the policy gradient method?
A: We introduced three common metrics in this chapter: v̄π , v̄π0 , and r̄π . Since they
all lead to similar policy gradients, they all can be adopted in the policy gradient
method. More importantly, the expressions in (9.1) and (9.4) are often encountered
in the literature.
 Q: Why is a natural logarithm function contained in the policy gradient?

221
9.6. Q&A S. Zhao, 2023

A: A natural logarithm function is introduced to express the gradient as an expected


value. In this way, we can approximate the true gradient with a stochastic one.
 Q: Why do we need to study undiscounted cases when deriving the policy gradient?
A: First, for continuing tasks, it may be inappropriate to introduce the discount rate
and we need to consider the undiscounted case. Second, the definition of the average
reward r̄π is valid for both discounted and undiscounted cases. While the gradient
of r̄π in the discounted case is an approximation, we will see that its gradient in the
undiscounted case is more elegant.
 Q: What does the policy gradient algorithm in (9.32) do mathematically?
A: To better understand this algorithm, readers are recommended to examine its
concise expression in (9.33), which clearly shows that it is a gradient-ascent algorithm
for updating the value of π(at |st , θt ). That is, when a sample (st , at ) is available, the
policy can be updated so that π(at |st , θt+1 ) ≥ π(at |st , θt ) or π(at |st , θt+1 ) < π(at |st , θt )
depending on the coefficients.

222

You might also like