0% found this document useful (0 votes)

20 views24 pages

3 - Chapter 9 Policy Gradient Methods

Uploaded by

zhanshengheyu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views24 pages

3 - Chapter 9 Policy Gradient Methods

Uploaded by

zhanshengheyu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

Chapter 9

Policy Gradient Methods

Algorithms/Methods

Chapter 4: Chapter 5: Chapter 6:

with model Stochastic
Value Iteration & to Monte Carlo
Policy Iteration Methods Approximation
without model

Chapter 7:
Chapter 3: Temporal-Difference
Chapter 2: Methods
Bellman Optimality
Bellman Equation
Equation
tabular representation
to
function representation
Chapter 1:
Basic Concepts
Chapter 8:
Value Function
Fundamental tools
Approximation

Chapter 10: policy-based

Chapter 9:
Actor-Critic plus Policy Gradient
Methods value-based Methods

Figure 9.1: Where we are in this book.

The idea of function approximation can be applied not only to represent state/action
values, as introduced in Chapter 8, but also to represent policies, as introduced in this
chapter. So far in this book, policies have been represented by tables: the action prob-
abilities of all states are stored in a table (e.g., Table 9.1). In this chapter, we show
that policies can be represented by parameterized functions denoted as π(a|s, θ), where
θ ∈ Rm is a parameter vector. It can also be written in other forms such as πθ (a|s),
πθ (a, s), or π(a, s, θ).
When policies are represented as functions, optimal policies can be obtained by op-
timizing certain scalar metrics. Such a method is called policy gradient. The policy

199
9.1. Policy representation: From table to function S. Zhao, 2023

a1 a2 a3 a4 a5
s1 π(a1 |s1 ) π(a2 |s1 ) π(a3 |s1 ) π(a4 |s1 ) π(a5 |s1 )
.. .. .. .. .. ..
. . . . . .
s9 π(a1 |s9 ) π(a2 |s9 ) π(a3 |s9 ) π(a4 |s9 ) π(a5 |s9 )

Table 9.1: A tabular representation of a policy. There are nine states and five actions for each state.

π(a1 |s, θ)
s
π(a|s, θ) s ..
θ θ .
a

π(a5 |s, θ)
function function
(a) (b)

Figure 9.2: Function representations of policies. The functions may have different structures.

gradient method is a big step forward in this book because it is policy-based. By contrast,
all the previous chapters in this book discuss value-based methods. The advantages of the
policy gradient method are numerous. For example, it is more efficient for handling large
state/action spaces. It has stronger generalization abilities and hence is more efficient in
terms of sample usage.

9.1 Policy representation: From table to function

When the representation of a policy is switched from a table to a function, it is necessary
to clarify the difference between the two representation methods.

First, how to define optimal policies? When represented as a table, a policy is defined
as optimal if it can maximize every state value. When represented by a function, a
policy is defined as optimal if it can maximize certain scalar metrics.
Second, how to update a policy? When represented by a table, a policy can be updated
by directly changing the entries in the table. When represented by a parameterized
function, a policy can no longer be updated in this way. Instead, it can only be
updated by changing the parameter θ.
Third, how to retrieve the probability of an action? In the tabular case, the probability
of an action can be directly obtained by looking up the corresponding entry in the
table. In the case of function representation, we need to input (s, a) into the function
to calculate its probability (see Figure 9.2(a)). Depending on the structure of the
function, we can also input a state and then output the probabilities of all actions
(see Figure 9.2(b)).

200
9.2. Metrics for defining optimal policies S. Zhao, 2023

The basic idea of the policy gradient method is summarized below. Suppose that J(θ)
is a scalar metric. Optimal policies can be obtained by optimizing this metric via the
gradient-based algorithm:

θt+1 = θt + α∇θ J(θt ),

where ∇θ J is the gradient of J with respect to θ, t is the time step, and α is the
optimization rate.
With this basic idea, we will answer the following three questions in the remainder of
this chapter.

What metrics should be used? (Section 9.2).

How to calculate the gradients of the metrics? (Section 9.3)
How to use experience samples to calculate the gradients? (Section 9.4)

9.2 Metrics for defining optimal policies

If a policy is represented by a function, there are two types of metrics for defining optimal
policies. One is based on state values and the other is based on immediate rewards.

Metric 1: Average state value

The first metric is the average state value or simply called the average value. It is defined
as
X
v̄π = d(s)vπ (s),
s∈S

P
where d(s) is the weight of state s. It satisfies d(s) ≥ 0 for any s ∈ S and s∈S d(s) = 1.
Therefore, we can interpret d(s) as a probability distribution of s. Then, the metric can
be written as

v̄π = ES∼d [vπ (S)].

How to select the distribution d? This is an important question. There are two cases.

The first and simplest case is that d is independent of the policy π. In this case, we
specifically denote d as d0 and v̄π as v̄π0 to indicate that the distribution is independent
of the policy. One case is to treat all the states equally important and select d0 (s) =
1/|S|. Another case is when we are only interested in a specific state s0 (e.g., the
agent always starts from s0 ). In this case, we can design

d0 (s0 ) = 1, d0 (s 6= s0 ) = 0.

201
9.2. Metrics for defining optimal policies S. Zhao, 2023

The second case is that d is dependent on the policy π. In this case, it is common to
select d as dπ , which is the stationary distribution under π. One basic property of dπ
is that it satisfies

dTπ Pπ = dTπ ,

where Pπ is the state transition probability matrix. More information about the
stationary distribution can be found in Box 8.1.
The interpretation of selecting dπ is as follows. The stationary distribution reflects the
long-term behavior of a Markov decision process under a given policy. If one state is
frequently visited in the long term, it is more important and deserves a higher weight;
if a state is rarely visited, then its importance is low and deserves a lower weight.

As its name suggests, v̄π is a weighted average of the state values. Different values
of θ lead to different values of v̄π . Our ultimate goal is to find an optimal policy (or
equivalently an optimal θ) to maximize v̄π .
We next introduce another two important equivalent expressions of v̄π .

Suppose that an agent collects rewards {Rt+1 }∞t=0 by following a given policy π(θ).
Readers may often see the following metric in the literature:
∞
" n
# " #
X X
J(θ) = lim E γ t Rt+1 = E γ t Rt+1 . (9.1)
n→∞
t=0 t=0

This metric may be nontrivial to interpret at first glance. In fact, it is equal to v̄π .
To see that, we have
∞ ∞
" # " #
X X X
E γ t Rt+1 = d(s)E γ t Rt+1 |S0 = s
t=0 s∈S t=0
X
= d(s)vπ (s)
s∈S

= v̄π .

The first equality in the above equation is due to the law of total expectation. The
second equality is by the definition of state values.
The metric v̄π can also be rewritten as the inner product of two vectors. In particular,
let

vπ = [. . . , vπ (s), . . . ]T ∈ R|S| ,
d = [. . . , d(s), . . . ]T ∈ R|S| .

202
9.2. Metrics for defining optimal policies S. Zhao, 2023

Then, we have

v̄π = dT vπ .

This expression will be useful when we analyze its gradient.

Metric 2: Average reward

The second metric is the average one-step reward or simply called the average reward
[2, 64, 65]. In particular, it is defined as

. X
r̄π = dπ (s)rπ (s)
s∈S

= ES∼dπ [rπ (S)], (9.2)

where dπ is the stationary distribution and

. X
rπ (s) = π(a|s, θ)r(s, a) = EA∼π(s,θ) [r(s, A)|s] (9.3)
a∈A

. P
is the expectation of the immediate rewards. Here, r(s, a) = E[R|s, a] = r rp(r|s, a).
We next present another two important equivalent expressions of r̄π .

Suppose that the agent collects rewards {Rt+1 }∞

t=0 by following a given policy π(θ).
A common metric that readers may often see in the literature is
" n−1 #
1 X
J(θ) = lim E Rt+1 . (9.4)
n→∞ n
t=0

It may seem nontrivial to interpret this metric at first glance. In fact, it is equal to
r̄π :
" n−1 #
1 X X
lim E Rt+1 = dπ (s)rπ (s) = r̄π . (9.5)
n→∞ n
t=0 s∈S

The proof of (9.5) is given in Box 9.1.

The average reward r̄π in (9.2) can also be written as the inner product of two vectors.
In particular, let

rπ = [. . . , rπ (s), . . . ]T ∈ R|S| ,
dπ = [. . . , dπ (s), . . . ]T ∈ R|S| ,

203
9.2. Metrics for defining optimal policies S. Zhao, 2023

where rπ (s) is defined in (9.3). Then, it is clear that

X
r̄π = dπ (s)rπ (s) = dTπ rπ .
s∈S

This expression will be useful when we derive its gradient.

Box 9.1: Proof of (9.5)

Step 1: We first prove that the following equation is valid for any starting state
s0 ∈ S:
" n−1 #
1 X
r̄π = lim E Rt+1 |S0 = s0 . (9.6)
n→∞ n
t=0

To do that, we notice
" n−1 # n−1
1 X 1X
lim E Rt+1 |S0 = s0 = lim E [Rt+1 |S0 = s0 ]
n→∞ n n→∞ n
t=0 t=0

= lim E [Rt+1 |S0 = s0 ] , (9.7)

t→∞

where the last equality is due to the property of the Cesaro mean (also called the
Cesaro summation). In particular, if {ak }∞k=1 is a convergent sequence such that
Pn ∞
limk→∞ ak exists, then {1/n k=1 ak }n=1 is also a convergent sequence such that
limn→∞ 1/n nk=1 ak = limk→∞ ak .
P

where p(t) (s|s0 ) denotes the probability of transitioning from s0 to s using exactly t
steps. The second equality in the above equation is due to the Markov memoryless
property: the reward obtained at the next time step depends only on the current
state rather than the previous ones.
Note that
lim p(t) (s|s0 ) = dπ (s)
t→∞

204
9.2. Metrics for defining optimal policies S. Zhao, 2023

by the definition of the stationary distribution. As a result, the starting state s0 does
not matter. Then, we have
X X
lim E [Rt+1 |S0 = s0 ] = lim rπ (s)p(t) (s|s0 ) = rπ (s)dπ (s) = r̄π .
t→∞ t→∞
s∈S s∈S

Substituting the above equation into (9.7) gives (9.6).

Step 2: Consider an arbitrary state distribution d. By the law of total expectation,
we have
" n−1 # " n−1 #
1 X 1X X
lim E Rt+1 = lim d(s)E Rt+1 |S0 = s
n→∞ n n→∞ n
t=0 s∈S t=0
" n−1 #
X 1 X
= d(s) lim E Rt+1 |S0 = s .
n→∞ n
s∈S t=0

Since (9.6) is valid for any starting state, substituting (9.6) into the above equation
yields
" n−1 #
1 X X
lim E Rt+1 = d(s)r̄π = r̄π .
n→∞ n
t=0 s∈S

The proof is complete.

Some remarks

Metric Expression 1 Expression 2 Expression 3

P Pn t

v̄π s∈S d(s)vπ (s) ES∼d [vπ (S)] limn→∞ E t=0 γ Rt+1
P 1
Pn−1
r̄π s∈S dπ (s)rπ (s) ES∼dπ [rπ (S)] limn→∞ n E t=0 Rt+1

Table 9.2: Summary of the different but equivalent expressions of v̄π and r̄π .

Up to now, we have introduced two types of metrics: v̄π and r̄π . Each metric has
several different but equivalent expressions. They are summarized in Table 9.2. We
sometimes use v̄π to specifically refer to the case where the state distribution is the
stationary distribution dπ and use v̄π0 to refer to the case where d0 is independent of π.
Some remarks about the metrics are given below.

All these metrics are functions of π. Since π is parameterized by θ, these metrics

are functions of θ. In other words, different values of θ can generate different metric
values. Therefore, we can search for the optimal values of θ to maximize these metrics.
This is the basic idea of policy gradient methods.

205
9.3. Gradients of the metrics S. Zhao, 2023

The two metrics v̄π and r̄π are equivalent in the discounted case where γ < 1. In
particular, it can be shown that

r̄π = (1 − γ)v̄π .

The above equation indicates that these two metrics can be simultaneously maximized.
The proof of this equation is given later in Lemma 9.1.

9.3 Gradients of the metrics

Given the metrics introduced in the last section, we can use gradient-based methods to
maximize them. To do that, we need to first calculate the gradients of these metrics.
The most important theoretical result in this chapter is the following theorem.

Theorem 9.1 (Policy gradient theorem). The gradient of J(θ) is

X X
∇θ J(θ) = η(s) ∇θ π(a|s, θ)qπ (s, a), (9.8)
s∈S a∈A

where η is a state distribution and ∇θ π is the gradient of π with respect to θ. Moreover,

(9.8) has a compact form expressed in terms of expectation:
h i
∇θ J(θ) = ES∼η,A∼π(S,θ) ∇θ ln π(A|S, θ)qπ (S, A) , (9.9)

where ln is the natural logarithm.

Some important remarks about Theorem 9.1 are given below.

It should be noted that Theorem 9.1 is a summary of the results in Theorem 9.2,
Theorem 9.3, and Theorem 9.5. These three theorems address different scenarios
involving different metrics and discounted/undiscounted cases. The gradients in these
scenarios all have similar expressions and hence are summarized in Theorem 9.1. The
specific expressions of J(θ) and η are not given in Theorem 9.1 and can be found in
Theorem 9.2, Theorem 9.3, and Theorem 9.5. In particular, J(θ) could be v̄π0 , v̄π ,
or r̄π . The equality in (9.8) may become a strict equality or an approximation. The
distribution η also varies in different scenarios.
The derivation of the gradients is the most complicated part of the policy gradient
method. For many readers, it is sufficient to be familiar with the result in Theorem 9.1
without knowing the proof. The derivation details presented in the rest of this section
are mathematically intensive. Readers are suggested to study selectively based on
their interests.

206
9.3. Gradients of the metrics S. Zhao, 2023

The expression in (9.9) is more favorable than (9.8) because it is expressed as an

expectation. We will show in Section 9.4 that this true gradient can be approximated
by a stochastic gradient.
Why can (9.8) be expressed as (9.9)? The proof is given below. By the definition of
expectation, (9.8) can be rewritten as
X X
∇θ J(θ) = η(s) ∇θ π(a|s, θ)qπ (s, a)
s∈S a∈A
" #
X
= ES∼η ∇θ π(a|S, θ)qπ (S, a) . (9.10)
a∈A

Furthermore, the gradient of ln π(a|s, θ) is

∇θ π(a|s, θ)
∇θ ln π(a|s, θ) = .
π(a|s, θ)

It follows that

∇θ π(a|s, θ) = π(a|s, θ)∇θ ln π(a|s, θ). (9.11)

Substituting (9.11) into (9.10) gives

" #
X
∇θ J(θ) = E π(a|S, θ)∇θ ln π(a|S, θ)qπ (S, a)
a∈A
h i
= ES∼η,A∼π(S,θ) ∇θ ln π(A|S, θ)qπ (S, A) .

It is notable that π(a|s, θ) must be positive for all (s, a) to ensure that ln π(a|s, θ) is
valid. This can be achieved by using softmax functions:

eh(s,a,θ)
π(a|s, θ) = P h(s,a0 ,θ)
, a ∈ A, (9.12)
a0 ∈A e

where h(s, a, θ) is a function indicating the preference for selecting a at s. The policy
P
in (9.12) satisfies π(a|s, θ) ∈ [0, 1] and a∈A π(a|s, θ) = 1 for any s ∈ S. This policy
can be realized by a neural network. The input of the network is s. The output layer
is a softmax layer so that the network outputs π(a|s, θ) for all a and the sum of the
outputs is equal to 1. See Figure 9.2(b) for an illustration.
Since π(a|s, θ) > 0 for all a, the policy is stochastic and hence exploratory. The policy
does not directly tell which action to take. Instead, the action should be generated
according to the probability distribution of the policy.

207
9.3. Gradients of the metrics S. Zhao, 2023

9.3.1 Derivation of the gradients in the discounted case

We next derive the gradients of the metrics in the discounted case where γ ∈ (0, 1). The
state value and action value in the discounted case are defined as

vπ (s) = E[Rt+1 + γRt+2 + γ 2 Rt+3 + . . . |St = s],

qπ (s, a) = E[Rt+1 + γRt+2 + γ 2 Rt+3 + . . . |St = s, At = a].
P
It holds that vπ (s) = a∈A π(a|s, θ)qπ (s, a) and the state value satisfies the Bellman
equation.
First, we show that v̄π (θ) and r̄π (θ) are equivalent metrics.

Lemma 9.1 (Equivalence between v̄π (θ) and r̄π (θ)). In the discounted case where γ ∈
(0, 1), it holds that

r̄π = (1 − γ)v̄π . (9.13)

Proof. Note that v̄π (θ) = dTπ vπ and r̄π (θ) = dTπ rπ , where vπ and rπ satisfy the Bellman
equation vπ = rπ + γPπ vπ . Multiplying dTπ on both sides of the Bellman equation yields

v̄π = r̄π + γdTπ Pπ vπ = r̄π + γdTπ vπ = r̄π + γv̄π ,

which implies (9.13).

Second, the following lemma gives the gradient of vπ (s) for any s.

Lemma 9.2 (Gradient of vπ (s)). In the discounted case, it holds for any s ∈ S that
X X
∇θ vπ (s) = Prπ (s0 |s) ∇θ π(a|s0 , θ)qπ (s0 , a), (9.14)
s0 ∈S a∈A

where
∞
0 . X k k
γ [Pπ ]ss0 = (In − γPπ )−1 ss0

Prπ (s |s) =
k=0

is the discounted total probability of transitioning from s to s0 under policy π. Here,

[·]ss0 denotes the entry in the sth row and s0 th column, and [Pπk ]ss0 is the probability of
transitioning from s to s0 using exactly k steps under π.

Box 9.2: Proof of Lemma 9.2

208
9.3. Gradients of the metrics S. Zhao, 2023

First, for any s ∈ S, it holds that

" #
X
∇θ vπ (s) = ∇θ π(a|s, θ)qπ (s, a)
a∈A
X
= [∇θ π(a|s, θ)qπ (s, a) + π(a|s, θ)∇θ qπ (s, a)] , (9.15)
a∈A

where qπ (s, a) is the action value given by

X
qπ (s, a) = r(s, a) + γ p(s0 |s, a)vπ (s0 ).
s0 ∈S

P
Since r(s, a) = r rp(r|s, a) is independent of θ, we have
X
∇θ qπ (s, a) = 0 + γ p(s0 |s, a)∇θ vπ (s0 ).
s0 ∈S

Substituting this result into (9.15) yields

It is notable that ∇θ vπ appears on both sides of the above equation. One way to
calculate it is to use the unrolling technique [64]. Here, we use another way based on
the matrix-vector form, which we believe is more straightforward to understand. In
particular, let
. X
u(s) = ∇θ π(a|s, θ)qπ (s, a).
a∈A

Since
X X X X
π(a|s, θ) p(s0 |s, a)∇θ vπ (s0 ) = p(s0 |s)∇θ vπ (s0 ) = [Pπ ]ss0 ∇θ vπ (s0 ),
a∈A s0 ∈S s0 ∈S s0 ∈S

equation (9.16) can be written in matrix-vector form as

.. .. ..
     
. . .
     
 ∇θ vπ (s) = u(s)  +γ(Pπ ⊗ Im )  ∇θ vπ (s0 ) ,
.. .. ..
     
. . .
| {z } | {z } | {z }
∇θ vπ ∈Rmn u∈Rmn ∇θ vπ ∈Rmn

209
9.3. Gradients of the metrics S. Zhao, 2023

which can be written concisely as

∇θ vπ = u + γ(Pπ ⊗ Im )∇θ vπ .

Here, n = |S|, and m is the dimension of the parameter vector θ. The reason that
the Kronecker product ⊗ emerges in the equation is that ∇θ vπ (s) is a vector. The
above equation is a linear equation of ∇θ vπ , which can be solved as

∇θ vπ = (Inm − γPπ ⊗ Im )−1 u

= (In ⊗ Im − γPπ ⊗ Im )−1 u
= (In − γPπ )−1 ⊗ Im u.

(9.17)

For any state s, it follows from (9.17) that

X
(In − γPπ )−1 u(s0 )

∇θ vπ (s) = ss0
s0 ∈S
X X
(In − γPπ )−1 ∇θ π(a|s0 , θ)qπ (s0 , a).

= ss0
(9.18)
s0 ∈S a∈A

The quantity [(In − γPπ )−1 ]ss0 has a clear probabilistic interpretation. In particular,
since (In − γPπ )−1 = I + γPπ + γ 2 Pπ2 + · · · , we have
∞
X
(In − γPπ )−1 ss0 = [I]ss0 + γ[Pπ ]ss0 + γ 2 [Pπ2 ]ss0 + · · · = γ k [Pπk ]ss0 .

k=0

Note that [Pπk ]ss0 is the probability of transitioning from s to s0 using exactly k
steps (see Box 8.1). Therefore, [(In − γPπ )−1 ]ss0 is the discounted total probability of
.
transitioning from s to s0 using any number of steps. By denoting [(In − γPπ )−1 ]ss0 =
Prπ (s0 |s), equation (9.18) becomes (9.14).

With the results in Lemma 9.2, we are ready to derive the gradient of v̄π0 .

Theorem 9.2 (Gradient of v̄π0 in the discounted case). In the discounted case where
γ ∈ (0, 1), the gradient of v̄π0 = dT0 vπ is

∇θ v̄π0 = E ∇θ ln π(A|S, θ)qπ (S, A) ,

where S ∼ ρπ and A ∼ π(S, θ). Here, the state distribution ρπ is

X
ρπ (s) = d0 (s0 )Prπ (s|s0 ), s ∈ S, (9.19)
s0 ∈S

P∞
where Prπ (s|s0 ) = k=0 γ k [Pπk ]s0 s = [(I − γPπ )−1 ]s0 s is the discounted total probability of

210
9.3. Gradients of the metrics S. Zhao, 2023

transitioning from s0 to s under policy π.

Box 9.3: Proof of Theorem 9.2

Since d0 (s) is independent of π, we have

X X
∇θ v̄π0 = ∇θ d0 (s)vπ (s) = d0 (s)∇θ vπ (s).
s∈S s∈S

Substituting the expression of ∇θ vπ (s) given in Lemma 9.2 into the above equation
yields
X X X X
∇θ v̄π0 = d0 (s)∇θ vπ (s) = d0 (s) Prπ (s0 |s) ∇θ π(a|s0 , θ)qπ (s0 , a)
s∈S s∈S s0 ∈S a∈A
!
X X X
= d0 (s)Prπ (s0 |s) ∇θ π(a|s0 , θ)qπ (s0 , a)
s0 ∈S s∈S a∈A
. X X
= ρπ (s0 ) ∇θ π(a|s0 , θ)qπ (s0 , a)
s0 ∈S a∈A
X X
= ρπ (s) ∇θ π(a|s, θ)qπ (s, a) (change s0 to s)
s∈S a∈A
X X
= ρπ (s) π(a|s, θ)∇θ ln π(a|s, θ)qπ (s, a)
s∈S a∈A

= E [∇θ ln π(A|S, θ)qπ (S, A)] ,

where S ∼ ρπ and A ∼ π(S, θ). The proof is complete.

With Lemma 9.1 and Lemma 9.2, we can derive the gradients of r̄π and v̄π .

Theorem 9.3 (Gradients of r̄π and v̄π in the discounted case). In the discounted case
where γ ∈ (0, 1), the gradients of r̄π and v̄π are
X X
∇θ r̄π = (1 − γ)∇θ v̄π ≈ dπ (s) ∇θ π(a|s, θ)qπ (s, a)
s∈S a∈A

= E ∇θ ln π(A|S, θ)qπ (S, A) ,

where S ∼ dπ and A ∼ π(S, θ). Here, the approximation is more accurate when γ is
closer to 1.

Box 9.4: Proof of Theorem 9.3

211
9.3. Gradients of the metrics S. Zhao, 2023

It follows from the definition of v̄π that

X
∇θ v̄π = ∇θ dπ (s)vπ (s)
s∈S
X X
= ∇θ dπ (s)vπ (s) + dπ (s)∇θ vπ (s). (9.20)
s∈S s∈S

This equation contains two terms. On the one hand, substituting the expression of
∇θ vπ given in (9.17) into the second term gives
X
dπ (s)∇θ vπ (s) = (dTπ ⊗ Im )∇θ vπ
s∈S

= (dTπ ⊗ Im ) (In − γPπ )−1 ⊗ Im u

= dTπ (In − γPπ )−1 ⊗ Im u.

(9.21)

It is noted that
1
dTπ (In − γPπ )−1 = dT ,
1−γ π

which can be easily verified by multiplying (In − γPπ ) on both sides of the equation.
Therefore, (9.21) becomes
X 1
dπ (s)∇θ vπ (s) = dT ⊗ Im u
s∈S
1−γ π
1 X X
= dπ (s) ∇θ π(a|s, θ)qπ (s, a).
1 − γ s∈S a∈A

On the other hand, the first term of (9.20) involves ∇θ dπ . However, since the second
1
term contains 1−γ , the second term becomes dominant, and the first term becomes
negligible when γ → 1. Therefore,

1 X X
∇θ v̄π ≈ dπ (s) ∇θ π(a|s, θ)qπ (s, a).
1 − γ s∈S a∈A

Furthermore, it follows from r̄π = (1 − γ)v̄π that

X X
∇θ r̄π = (1 − γ)∇θ v̄π ≈ dπ (s) ∇θ π(a|s, θ)qπ (s, a)
s∈S a∈A
X X
= dπ (s) π(a|s, θ)∇θ ln π(a|s, θ)qπ (s, a)
s∈S a∈A

= E [∇θ ln π(A|S, θ)qπ (S, A)] .

212
9.3. Gradients of the metrics S. Zhao, 2023

The approximation in the above equation requires that the first term does not go to
infinity when γ → 1. More information can be found in [66, Section 4].

9.3.2 Derivation of the gradients in the undiscounted case

We next show how to calculate the gradients of the metrics in the undiscounted case
where γ = 1. Readers may wonder why we suddenly start considering the undiscounted
case while we have only considered the discounted case so far in this book. The reasons
are as follows. First, for continuing tasks, it may be inappropriate to introduce the
discount rate and we need to consider the undiscounted case. Second, the definition of
the average reward r̄π is valid for both discounted and undiscounted cases. While the
gradient of r̄π in the discounted case is an approximation, we will see that its gradient in
the undiscounted case is more elegant.

State values and the Poisson equation

In the undiscounted case, it is necessary to redefine state and action values. Since the
undiscounted sum of the rewards, E[Rt+1 + Rt+2 + Rt+3 + . . . |St = s], may diverge, the
state and action values are defined in a special way [64]:

.
vπ (s) = E[(Rt+1 − r̄π ) + (Rt+2 − r̄π ) + (Rt+3 − r̄π ) + . . . |St = s],
.
qπ (s, a) = E[(Rt+1 − r̄π ) + (Rt+2 − r̄π ) + (Rt+3 − r̄π ) + . . . |St = s, At = a],

where r̄π is the average reward, which is determined when π is given. There are different
names for vπ (s) in the literature such as the differential reward [65] or bias [2, Sec-
tion 8.2.1]. It can be verified that the state value defined above satisfies the following
Bellman-like equation:
" #
X X X
vπ (s) = π(a|s, θ) p(r|s, a)(r − r̄π ) + p(s0 |s, a)vπ (s0 ) . (9.22)
a r s0

P P
Since vπ (s) = a∈A π(a|s, θ)qπ (s, a), it holds that qπ (s, a) = r p(r|s, a)(r − r̄π ) +
0 0
P
s0 p(s |s, a)vπ (s ). The matrix-vector form of (9.22) is

vπ = rπ − r̄π 1n + Pπ vπ , (9.23)

where 1n = [1, . . . , 1]T ∈ Rn . Equation (9.23) is similar to the Bellman equation and it
has a specific name called the Poisson equation [65, 67].
How to solve vπ from the Poisson equation? The answer is given in the following
theorem.

213
9.3. Gradients of the metrics S. Zhao, 2023

Theorem 9.4 (Solution of the Poisson equation). Let

vπ∗ = (In − Pπ + 1n dTπ )−1 rπ . (9.24)

Then, vπ∗ is a solution of the Poisson equation in (9.23). Moreover, any solution of the
Poisson equation has the following form:

vπ = vπ∗ + c1n ,

where c ∈ R.

This theorem indicates that the solution of the Poisson equation may not be unique.

Box 9.5: Proof of Theorem 9.4

We prove using three steps.
Step 1: Show that vπ∗ in (9.24) is a solution of (9.25).
For the sake of simplicity, let

.
A = In − Pπ + 1n dTπ .

Then, vπ∗ = A−1 rπ . The fact that A is invertible will be proven in Step 3. Substi-
tuting vπ∗ = A−1 rπ into (9.25) gives

A−1 rπ = rπ − 1n dTπ rπ + Pπ A−1 rπ .

This equation is valid as proven below. Recognizing this equation gives (−A−1 +
In − 1n dTπ + Pπ A−1 )rπ = 0, and consequently,

(−In + A − 1n dTπ A + Pπ )A−1 rπ = 0.

The term in the brackets in the above equation is zero because −In +A−1n dTπ A+
Pπ = −In + (In − Pπ + 1n dTπ ) − 1n dTπ (In − Pπ + 1n dTπ ) + Pπ = 0. Therefore, vπ∗ in
(9.24) is a solution.
Step 2: General expression of the solutions.
Substituting r̄π = dTπ rπ into (9.23) gives

vπ = rπ − 1n dTπ rπ + Pπ vπ (9.25)

and consequently

(In − Pπ )vπ = (In − 1n dTπ )rπ . (9.26)

214
9.3. Gradients of the metrics S. Zhao, 2023

It is noted that In − Pπ is singular because (In − Pπ )1n = 0 for any π. Therefore,

the solution of (9.26) is not unique: if vπ∗ is a solution, then vπ∗ +x is also a solution
for any x ∈ Null(In − Pπ ). When Pπ is irreducible, Null(In − Pπ ) = span{1n }.
Then, any solution of the Poisson equation has the expression vπ∗ + c1n where
c ∈ R.
Step 3: Show that A = In − Pπ + 1n dTπ is invertible.
Since vπ∗ involves A−1 , it is necessary to show that A is invertible. The analysis is
summarized in the following lemma.

Lemma 9.3. The matrix In − Pπ + 1n dTπ is invertible and its inverse is

∞
T −1
X
(Pπk − 1n dTπ ) + In .

In − (Pπ − 1n dπ ) =
k=1

Proof. First of all, we state some preliminary facts without proof. Let ρ(M )
be the spectral radius of a matrix M . Then, I − M is invertible if ρ(M ) < 1.
Moreover, ρ(M ) < 1 if and only if limk→∞ M k = 0.
Based on the above facts, we next show that limk→∞ (Pπ − 1n dTπ )k → 0, and then
the invertibility of In − (Pπ − 1n dTπ ) immediately follows. To do that, we notice
that

(Pπ − 1n dTπ )k = Pπk − 1n dTπ , k ≥ 1, (9.27)

which can be proven by induction. For instance, when k = 1, the equation is

valid. When k = 2, we have

(Pπ − 1n dTπ )2 = (Pπ − 1n dTπ )(Pπ − 1n dTπ )

= Pπ2 − Pπ 1n dTπ − 1n dTπ Pπ + 1n dTπ 1n dTπ
= Pπ2 − 1n dTπ ,

where the last equality is due to Pπ 1n = 1n , dTπ Pπ = dTπ , and dTπ 1n = 1. The case
of k ≥ 3 can be proven similarly.
Since dπ is the stationary distribution of the state, it holds that limk→∞ Pπk = dTπ 1n
(see Box 8.1). Therefore, (9.27) implies that

lim (Pπ − 1n dTπ )k = lim Pπk − dTπ 1n = 0.

k→∞ k→∞

As a result, ρ(Pπ −1n dTπ ) < 1 and hence In −(Pπ −1n dTπ ) is invertible. Furthermore,

215
9.3. Gradients of the metrics S. Zhao, 2023

the inverse of this matrix is given by

∞
X
(In − (Pπ − 1n dTπ ))−1 = (Pπ − 1n dTπ )k
k=0
∞
X
= In + (Pπ − 1n dTπ )k
k=1
X∞
= In + (Pπk − 1n dTπ )
k=1
∞
X
= (Pπk − 1n dTπ ) + 1n dTπ .
k=0

The proof is complete.

The proof of Lemma 9.3 is inspired by [66]. However, the result (In − Pπ +
1n dTπ )−1 = ∞ k T
P
k=0 (Pπ − 1n dπ ) given in [66] (the statement above equation (16)
P∞ k T
P∞ k
in [66]) is inaccurate because k=0 (Pπ − 1n dπ ) is singular since k=0 (Pπ −
1n dTπ )1n = 0. Lemma 9.3 corrects this inaccuracy.

Derivation of gradients

Although the value of vπ is not unique in the undiscounted case, as shown in Theorem 9.4,
the value of r̄π is unique. In particular, it follows from the Poisson equation that

r̄π 1n = rπ + (Pπ − In )vπ

= rπ + (Pπ − In )(vπ∗ + c1n )
= rπ + (Pπ − In )vπ∗ .

Notably, the undetermined value c is canceled and hence r̄π is unique. Therefore, we
can calculate the gradient of r̄π in the undiscounted case. In addition, since vπ is not
unique, v̄π is not unique either. We do not study the gradient of v̄π in the undiscounted
case. For interested readers, it is worth mentioning that we can add more constraints to
uniquely solve vπ from the Poisson equation. For example, by assuming that a recurrent
state exists, the state value of this recurrent state is zero [65, Section II], and hence c can
be determined. There are also other ways to uniquely determine vπ . See, for example,
equations (8.6.5)-(8.6.7) in [2].
The gradient of r̄π in the undiscounted case is given below.

Theorem 9.5 (Gradient of r̄π in the undiscounted case). In the undiscounted case, the

216
9.3. Gradients of the metrics S. Zhao, 2023

gradient of the average reward r̄π is

X X
∇θ r̄π = dπ (s) ∇θ π(a|s, θ)qπ (s, a)
s∈S a∈A

= E ∇θ ln π(A|S, θ)qπ (S, A) , (9.28)

where S ∼ dπ and A ∼ π(S, θ).

Compared to the discounted case shown in Theorem 9.3, the gradient of r̄π in the
undiscounted case is more elegant in the sense that (9.28) is strictly valid and S obeys
the stationary distribution.

Box 9.6: Proof of Theorem 9.5

P
First of all, it follows from vπ (s) = a∈A π(a|s, θ)qπ (s, a) that
" #
X
∇θ vπ (s) = ∇θ π(a|s, θ)qπ (s, a)
a∈A
X
= [∇θ π(a|s, θ)qπ (s, a) + π(a|s, θ)∇θ qπ (s, a)] , (9.29)
a∈A

where qπ (s, a) is the action value satisfying

X X
qπ (s, a) = p(r|s, a)(r − r̄π ) + p(s0 |s, a)vπ (s0 )
r s0
X
= r(s, a) − r̄π + p(s |s, a)vπ (s0 ).
0

P
Since r(s, a) = r rp(r|s, a) is independent of θ, we have
X
∇θ qπ (s, a) = 0 − ∇θ r̄π + p(s0 |s, a)∇θ vπ (s0 ).
s0 ∈S

Substituting this result into (9.29) yields

Let
. X
u(s) = ∇θ π(a|s, θ)qπ (s, a).
a∈A

217
9.4. Monte Carlo policy gradient (REINFORCE) S. Zhao, 2023

0 0 0 0
P P P
Since a∈A π(a|s, θ) s0 ∈S p(s |s, a)∇θ vπ (s ) = s0 ∈S p(s |s)∇θ vπ (s ), equation
(9.30) can be written in matrix-vector form as

.. .. ..
     
. . .
     
 ∇θ vπ (s) = u(s)  −1n ⊗ ∇θ r̄π + (Pπ ⊗ Im )  ∇θ vπ (s0 ) ,
.. .. ..
     
. . .
| {z } | {z } | {z }
∇θ vπ ∈Rmn u∈Rmn ∇θ vπ ∈Rmn

where n = |S|, m is the dimension of θ, and ⊗ is the Kronecker product. The above
equation can be written concisely as

∇θ vπ = u − 1n ⊗ ∇θ r̄π + (Pπ ⊗ Im )∇θ vπ ,

and hence

1n ⊗ ∇θ r̄π = u + (Pπ ⊗ Im )∇θ vπ − ∇θ vπ .

Multiplying dTπ ⊗ Im on both sides of the above equation gives

(dTπ 1n ) ⊗ ∇θ r̄π = dTπ ⊗ Im u + (dTπ Pπ ) ⊗ Im ∇θ vπ − dTπ ⊗ Im ∇θ vπ

= dTπ ⊗ Im u,

which implies

∇θ r̄π = dTπ ⊗ Im u
X
= dπ (s)u(s)
s∈S
X X
= dπ (s) ∇θ π(a|s, θ)qπ (s, a).
s∈S a∈A

9.4 Monte Carlo policy gradient (REINFORCE)

With the gradient presented in Theorem 9.1, we next show how to use the gradient-based
method to optimize the metrics to obtain optimal policies.
The gradient-ascent algorithm for maximizing J(θ) is

θt+1 = θt + α∇θ J(θt )

h i
= θt + αE ∇θ ln π(A|S, θt )qπ (S, A) , (9.31)

where α > 0 is a constant learning rate. Since the true gradient in (9.31) is unknown, we

218
9.4. Monte Carlo policy gradient (REINFORCE) S. Zhao, 2023

can replace the true gradient with a stochastic gradient to obtain the following algorithm:

θt+1 = θt + α∇θ ln π(at |st , θt )qt (st , at ), (9.32)

where qt (st , at ) is an approximation of qπ (st , at ). If qt (st , at ) is obtained by Monte Carlo

estimation, the algorithm is called REINFORCE [68] or Monte Carlo policy gradient,
which is one of earliest and simplest policy gradient algorithms.
The algorithm in (9.32) is important since many other policy gradient algorithms can
be obtained by extending it. We next examine the interpretation of (9.32) more closely.
Since ∇θ ln π(at |st , θt ) = ∇π(a
θ π(at |st ,θt )
t |st ,θt )
, we can rewrite (9.32) as

qt (st , at )
θt+1 = θt + α ∇θ π(at |st , θt ),
π(at |st , θt )
| {z }
βt

which can be further written concisely as

θt+1 = θt + αβt ∇θ π(at |st , θt ). (9.33)

Two important interpretations can be seen from this equation.

First, since (9.33) is a simple gradient-ascent algorithm, the following observations

can be obtained.

- If βt ≥ 0, the probability of choosing (st , at ) is enhanced. That is

π(at |st , θt+1 ) ≥ π(at |st , θt ).

The greater βt is, the stronger the enhancement is.

- If βt < 0, the probability of choosing (st , at ) decreases. That is

π(at |st , θt+1 ) < π(at |st , θt ).

The above observations can be proven as follows. When θt+1 − θt is sufficiently small,
it follows from the Taylor expansion that

It is clear that π(at |st , θt+1 ) ≥ π(at |st , θt ) when βt ≥ 0 and π(at |st , θt+1 ) < π(at |st , θt )
when βt < 0.
Second, the algorithm can strike a balance between exploration and exploitation to a

219
9.4. Monte Carlo policy gradient (REINFORCE) S. Zhao, 2023

Algorithm 9.1: Policy Gradient by Monte Carlo (REINFORCE)

Initialization: Initial parameter θ; γ ∈ (0, 1); α > 0.

Goal: Learn an optimal policy for maximizing J(θ).
For each episode, do
Generate an episode {s0 , a0 , r1 , . . . , sT −1 , aT −1 , rT } following π(θ).
For t = 0, 1, . . . , T − 1:
Value update: qt (st , at ) = Tk=t+1 γ k−t−1 rk
P
Policy update: θ ← θ + α∇θ ln π(at |st , θ)qt (st , at )

certain extent due to the expression of

qt (st , at )
βt = .
π(at |st , θt )

On the one hand, βt is proportional to qt (st , at ). As a result, if the action value of

(st , at ) is large, then π(at |st , θt ) is enhanced so that the probability of selecting at
increases. Therefore, the algorithm attempts to exploit actions with greater values.
One the other hand, βt is inversely proportional to π(at |st , θt ) when qt (st , at ) > 0. As
a result, if the probability of selecting at is small, then π(at |st , θt ) is enhanced so that
the probability of selecting at increases. Therefore, the algorithm attempts to explore
actions with low probabilities.

Moreover, since (9.32) uses samples to approximate the true gradient in (9.31), it is
important to understand how the samples should be obtained.

How to sample S? S in the true gradient E[∇θ ln π(A|S, θt )qπ (S, A)] should obey the
distribution η which is either the stationary distribution dπ or the discounted total
probability distribution ρπ in (9.19). Either dπ or ρπ represents the long-term behavior
exhibited under π.
How to sample A? A in E[∇θ ln π(A|S, θt )qπ (S, A)] should obey the distribution of
π(A|S, θ). The ideal way to sample A is to select at following π(a|st , θt ). Therefore,
the policy gradient algorithm is on-policy.

Unfortunately, the ideal ways for sampling S and A are not strictly followed in practice
due to their low efficiency of sample usage. A more sample-efficient implementation of
(9.32) is given in Algorithm 9.1. In this implementation, an episode is first generated by
following π(θ). Then, θ is updated multiple times using every experience sample in the
episode.

220
9.5. Summary S. Zhao, 2023

9.5 Summary
This chapter introduced the policy gradient method, which is the foundation of many
modern reinforcement learning algorithms. Policy gradient methods are policy-based. It
is a big step forward in this book because all the methods in the previous chapters are
value-based. The basic idea of the policy gradient method is simple. That is to select an
appropriate scalar metric and then optimize it via a gradient-ascent algorithm.
The most complicated part of the policy gradient method is the derivation of the
gradients of the metrics. That is because we have to distinguish various scenarios with
different metrics and discounted/undiscounted cases. Fortunately, the expressions of the
gradients in different scenarios are similar. Hence, we summarized the expressions in
Theorem 9.1, which is the most important theoretical result in this chapter. For many
readers, it is sufficient to be aware of this theorem. Its proof is nontrivial, and it is not
required for all readers to study.
The policy gradient algorithm in (9.32) must be properly understood since it is the
foundation of many advanced policy gradient algorithms. In the next chapter, this algo-
rithm will be extended to another important policy gradient method called actor-critic.

9.6 Q&A
Q: What is the basic idea of the policy gradient method?
A: The basic idea is simple. That is to define an appropriate scalar metric, derive
its gradient, and then use gradient-ascent methods to optimize the metric. The most
important theoretical result regarding this method is the policy gradient given in
Theorem 9.1.
Q: What is the most complicated part of the policy gradient method?
A: The basic idea of the policy gradient method is simple. However, the derivation
procedure of the gradients is quite complicated. That is because we have to distin-
guish numerous different scenarios. The mathematical derivation procedure in each
scenario is nontrivial. It is sufficient for many readers to be familiar with the result
in Theorem 9.1 without knowing the proof.
Q: What metrics should be used in the policy gradient method?
A: We introduced three common metrics in this chapter: v̄π , v̄π0 , and r̄π . Since they
all lead to similar policy gradients, they all can be adopted in the policy gradient
method. More importantly, the expressions in (9.1) and (9.4) are often encountered
in the literature.
Q: Why is a natural logarithm function contained in the policy gradient?

221
9.6. Q&A S. Zhao, 2023

A: A natural logarithm function is introduced to express the gradient as an expected

value. In this way, we can approximate the true gradient with a stochastic one.
Q: Why do we need to study undiscounted cases when deriving the policy gradient?
A: First, for continuing tasks, it may be inappropriate to introduce the discount rate
and we need to consider the undiscounted case. Second, the definition of the average
reward r̄π is valid for both discounted and undiscounted cases. While the gradient
of r̄π in the discounted case is an approximation, we will see that its gradient in the
undiscounted case is more elegant.
Q: What does the policy gradient algorithm in (9.32) do mathematically?
A: To better understand this algorithm, readers are recommended to examine its
concise expression in (9.33), which clearly shows that it is a gradient-ascent algorithm
for updating the value of π(at |st , θt ). That is, when a sample (st , at ) is available, the
policy can be updated so that π(at |st , θt+1 ) ≥ π(at |st , θt ) or π(at |st , θt+1 ) < π(at |st , θt )
depending on the coefficients.

222

RRB NTPC CBT Stage I & II Mathematics VOLUME 1 in English
0% (1)
RRB NTPC CBT Stage I & II Mathematics VOLUME 1 in English
416 pages
ACS 1000 Faults Alarms Classic0.1
94% (17)
ACS 1000 Faults Alarms Classic0.1
189 pages
Elevator Controler Part 4 TK
100% (6)
Elevator Controler Part 4 TK
22 pages
Shiyu Zhao - Mathematical Foundation of Reinforcement Learning (2024, Tsinghua University Press, Springer) - Libgen - Li
No ratings yet
Shiyu Zhao - Mathematical Foundation of Reinforcement Learning (2024, Tsinghua University Press, Springer) - Libgen - Li
283 pages
Book All-In-One 2
No ratings yet
Book All-In-One 2
281 pages
Sany HBT8018C-5
No ratings yet
Sany HBT8018C-5
2 pages
Book All in One
No ratings yet
Book All in One
288 pages
Value Functions & Bellman Equations: UNIT-3
No ratings yet
Value Functions & Bellman Equations: UNIT-3
11 pages
Reinforcement Learning Lec12
No ratings yet
Reinforcement Learning Lec12
60 pages
2024 MDPs Part 1
No ratings yet
2024 MDPs Part 1
59 pages
Lecture 06
No ratings yet
Lecture 06
98 pages
Lec 12
No ratings yet
Lec 12
60 pages
Ready To Install: Distribution Boards, DIN Rail Mounted Equipment, Enclosures and Connection Systems
No ratings yet
Ready To Install: Distribution Boards, DIN Rail Mounted Equipment, Enclosures and Connection Systems
215 pages
MIT 6.036 Lecture
No ratings yet
MIT 6.036 Lecture
64 pages
2025 - MDPs 2
No ratings yet
2025 - MDPs 2
42 pages
2025 - MDPs - Part 2
No ratings yet
2025 - MDPs - Part 2
41 pages
Artificial Intelligence: Lecture 9 - Markov Decision Processes II Dr. Shivanjali Khare
No ratings yet
Artificial Intelligence: Lecture 9 - Markov Decision Processes II Dr. Shivanjali Khare
44 pages
Module 04
No ratings yet
Module 04
63 pages
Exploring Decentralised Finance (DeFi) - en
No ratings yet
Exploring Decentralised Finance (DeFi) - en
22 pages
AutoCAD Basics To Advanced Electrical BIM
No ratings yet
AutoCAD Basics To Advanced Electrical BIM
12 pages
242 Sheet 02 03
No ratings yet
242 Sheet 02 03
5 pages
Markov Decision Process II
No ratings yet
Markov Decision Process II
88 pages
2 Dynamic
No ratings yet
2 Dynamic
50 pages
Policy (RL IITH)
No ratings yet
Policy (RL IITH)
46 pages
BS - EN12715 - 2000 Execution of Special Geotechnical Work - Grouting
No ratings yet
BS - EN12715 - 2000 Execution of Special Geotechnical Work - Grouting
56 pages
Lec 09
No ratings yet
Lec 09
51 pages
Lecture 3 - MDPs and Dynamic Programming
No ratings yet
Lecture 3 - MDPs and Dynamic Programming
66 pages
CS229
No ratings yet
CS229
17 pages
Powell UnifiedFrameworkStochasticOptimization Jan292018
No ratings yet
Powell UnifiedFrameworkStochasticOptimization Jan292018
69 pages
Lecture#3 Bellmann Equation and Dynamic Programming DP 2024 Part
No ratings yet
Lecture#3 Bellmann Equation and Dynamic Programming DP 2024 Part
33 pages
Dynamic Programming
No ratings yet
Dynamic Programming
52 pages
Policy Gradient
No ratings yet
Policy Gradient
33 pages
RL 5
No ratings yet
RL 5
26 pages
AI512/EE633: Reinforcement Learning: Lecture 3 - Dynamic Programming
No ratings yet
AI512/EE633: Reinforcement Learning: Lecture 3 - Dynamic Programming
43 pages
3 - Chapter 4 Value Iteration and Policy Iteration
No ratings yet
3 - Chapter 4 Value Iteration and Policy Iteration
20 pages
Ded 0125gudang Bnpbrevisi091908 1
No ratings yet
Ded 0125gudang Bnpbrevisi091908 1
59 pages
Fa19 Lecture 15 MDPs II
No ratings yet
Fa19 Lecture 15 MDPs II
76 pages
IoT Connectivity Technologies
No ratings yet
IoT Connectivity Technologies
33 pages
Intergrating Digital Technologies and PH For COVID19
No ratings yet
Intergrating Digital Technologies and PH For COVID19
50 pages
SKF - Agri Solutions - PUB 46 P1 18686 EN - 1
No ratings yet
SKF - Agri Solutions - PUB 46 P1 18686 EN - 1
60 pages
RL Lecture4
No ratings yet
RL Lecture4
7 pages
Unit 5 - Policy Based
No ratings yet
Unit 5 - Policy Based
30 pages
Trust Region Policy Optimization: John Schulman Sergey Levine Philipp Moritz Michael Jordan Pieter Abbeel
No ratings yet
Trust Region Policy Optimization: John Schulman Sergey Levine Philipp Moritz Michael Jordan Pieter Abbeel
16 pages
New CZ3005 Module 4 - Markov Decision Process
No ratings yet
New CZ3005 Module 4 - Markov Decision Process
38 pages
Moritz Lars
No ratings yet
Moritz Lars
97 pages
3 - Chapter 10 Actor-Critic Methods
No ratings yet
3 - Chapter 10 Actor-Critic Methods
22 pages
Y11 To Y12 A Level Maths Further Maths Bridging Work
No ratings yet
Y11 To Y12 A Level Maths Further Maths Bridging Work
46 pages
3 - Chapter 3 Optimal State Values and Bellman Optimality Equation
No ratings yet
3 - Chapter 3 Optimal State Values and Bellman Optimality Equation
21 pages
Reinforcement Learning Cheat Sheet: Return
No ratings yet
Reinforcement Learning Cheat Sheet: Return
7 pages
3 - Chapter 2 State Values and Bellman Equation
No ratings yet
3 - Chapter 2 State Values and Bellman Equation
20 pages
3 - Chapter 4 Value Iteration and Policy Iteration
No ratings yet
3 - Chapter 4 Value Iteration and Policy Iteration
20 pages
DP - Bellman - 1741339134 2025-03-07 09 - 19 - 05
No ratings yet
DP - Bellman - 1741339134 2025-03-07 09 - 19 - 05
13 pages
SRE Report Merged
No ratings yet
SRE Report Merged
16 pages
02 Bellman Equations and Optimality - Complete Guide
No ratings yet
02 Bellman Equations and Optimality - Complete Guide
6 pages
Hydrobots, An Underwater Robotics STEM Project Introduction of Engineering Design Process in Secondary Education
No ratings yet
Hydrobots, An Underwater Robotics STEM Project Introduction of Engineering Design Process in Secondary Education
24 pages
SLchapt 3
No ratings yet
SLchapt 3
10 pages
1.1 Discounted (Infinite-Horizon) Markov Decision Processes
No ratings yet
1.1 Discounted (Infinite-Horizon) Markov Decision Processes
26 pages
Lec 4
No ratings yet
Lec 4
16 pages
M 2
No ratings yet
M 2
12 pages
1 - Table of Contents
No ratings yet
1 - Table of Contents
6 pages
Lec 3
No ratings yet
Lec 3
15 pages
Intel It Annual Performance Report 2021 2022 Paper
No ratings yet
Intel It Annual Performance Report 2021 2022 Paper
19 pages
Flipkart 190325140735
No ratings yet
Flipkart 190325140735
12 pages
eME4 HW3 Flores BSME-4B
No ratings yet
eME4 HW3 Flores BSME-4B
6 pages
cs229 Notes13
No ratings yet
cs229 Notes13
15 pages
DD Env 12017-1998
No ratings yet
DD Env 12017-1998
18 pages
Module 3.0
No ratings yet
Module 3.0
17 pages
Machine Learning
No ratings yet
Machine Learning
5 pages
MDP Cheatsheet
No ratings yet
MDP Cheatsheet
3 pages
Baker Hughes
No ratings yet
Baker Hughes
10 pages
Unit 05 Dynamic Programming
No ratings yet
Unit 05 Dynamic Programming
9 pages
An Introduction To Policy Search Methods: Thomas Furmston
No ratings yet
An Introduction To Policy Search Methods: Thomas Furmston
33 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
7 pages
Datasheet Downhole-Fluid-Sampler en Screen
No ratings yet
Datasheet Downhole-Fluid-Sampler en Screen
8 pages
66fcb09978c0ee5e8bcd7e3e Munoxusunakadamosomilofo
No ratings yet
66fcb09978c0ee5e8bcd7e3e Munoxusunakadamosomilofo
2 pages
EE675A Lec12
No ratings yet
EE675A Lec12
5 pages
Cinema4d Env Variables
No ratings yet
Cinema4d Env Variables
1 page
EE675 Lecture 10
No ratings yet
EE675 Lecture 10
4 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
15 pages
19.5 Markov Decision Processes: Resolving Unbounded Expected Rewards
No ratings yet
19.5 Markov Decision Processes: Resolving Unbounded Expected Rewards
13 pages
DUAL COMBUSTION CYCLE Calculator
No ratings yet
DUAL COMBUSTION CYCLE Calculator
5 pages
Riley Nelson Resume
No ratings yet
Riley Nelson Resume
2 pages
CS 188 Fall 2018 Written HW4 Soln
No ratings yet
CS 188 Fall 2018 Written HW4 Soln
6 pages
Col CVT-LSCS SS
No ratings yet
Col CVT-LSCS SS
2 pages
Roronoa Zoro Laptop Skin For Sale by Titusdonna Redbubble
No ratings yet
Roronoa Zoro Laptop Skin For Sale by Titusdonna Redbubble
1 page
Routing Switching Elective Syllabus
No ratings yet
Routing Switching Elective Syllabus
3 pages
04-Master Functional Programming in Java With Five Interfaces
No ratings yet
04-Master Functional Programming in Java With Five Interfaces
3 pages
How Do I Setup A Layer 3 Network With Static Routes On My Dgs-3324Sr/ Dgs-3324Sri/Dxs-3350Sr/Dxs-3326Gsr?
No ratings yet
How Do I Setup A Layer 3 Network With Static Routes On My Dgs-3324Sr/ Dgs-3324Sri/Dxs-3350Sr/Dxs-3326Gsr?
2 pages
425 Food Process Packing: A.W. Chesterton Co
No ratings yet
425 Food Process Packing: A.W. Chesterton Co
1 page
Random Optimization: Fundamentals and Applications
From Everand
Random Optimization: Fundamentals and Applications
Fouad Sabry
No ratings yet

3 - Chapter 9 Policy Gradient Methods

Uploaded by

3 - Chapter 9 Policy Gradient Methods

Uploaded by

Chapter 9

Policy Gradient Methods

Chapter 4: Chapter 5: Chapter 6:

Chapter 10: policy-based

Figure 9.1: Where we are in this book.

9.1 Policy representation: From table to function

θt+1 = θt + α∇θ J(θt ),

 What metrics should be used? (Section 9.2).

9.2 Metrics for defining optimal policies

Metric 1: Average state value

v̄π = ES∼d [vπ (S)].

This expression will be useful when we analyze its gradient.

Metric 2: Average reward

= ES∼dπ [rπ (S)], (9.2)

where dπ is the stationary distribution and

 Suppose that the agent collects rewards {Rt+1 }∞

The proof of (9.5) is given in Box 9.1.

where rπ (s) is defined in (9.3). Then, it is clear that

This expression will be useful when we derive its gradient.

Box 9.1: Proof of (9.5)

= lim E [Rt+1 |S0 = s0 ] , (9.7)

Substituting the above equation into (9.7) gives (9.6).

The proof is complete.

Metric Expression 1 Expression 2 Expression 3

 All these metrics are functions of π. Since π is parameterized by θ, these metrics

9.3 Gradients of the metrics

Theorem 9.1 (Policy gradient theorem). The gradient of J(θ) is

where η is a state distribution and ∇θ π is the gradient of π with respect to θ. Moreover,

where ln is the natural logarithm.

Some important remarks about Theorem 9.1 are given below.

 The expression in (9.9) is more favorable than (9.8) because it is expressed as an

Furthermore, the gradient of ln π(a|s, θ) is

∇θ π(a|s, θ) = π(a|s, θ)∇θ ln π(a|s, θ). (9.11)

Substituting (9.11) into (9.10) gives

9.3.1 Derivation of the gradients in the discounted case

vπ (s) = E[Rt+1 + γRt+2 + γ 2 Rt+3 + . . . |St = s],

r̄π = (1 − γ)v̄π . (9.13)

v̄π = r̄π + γdTπ Pπ vπ = r̄π + γdTπ vπ = r̄π + γv̄π ,

which implies (9.13).

is the discounted total probability of transitioning from s to s0 under policy π. Here,

Box 9.2: Proof of Lemma 9.2

First, for any s ∈ S, it holds that

where qπ (s, a) is the action value given by

Substituting this result into (9.15) yields

equation (9.16) can be written in matrix-vector form as

which can be written concisely as

∇θ vπ = (Inm − γPπ ⊗ Im )−1 u

For any state s, it follows from (9.17) that

∇θ v̄π0 = E ∇θ ln π(A|S, θ)qπ (S, A) ,

where S ∼ ρπ and A ∼ π(S, θ). Here, the state distribution ρπ is

transitioning from s0 to s under policy π.

Box 9.3: Proof of Theorem 9.2

Since d0 (s) is independent of π, we have

= E [∇θ ln π(A|S, θ)qπ (S, A)] ,

where S ∼ ρπ and A ∼ π(S, θ). The proof is complete.

Box 9.4: Proof of Theorem 9.3

It follows from the definition of v̄π that

= (dTπ ⊗ Im ) (In − γPπ )−1 ⊗ Im u

= dTπ (In − γPπ )−1 ⊗ Im u.

Furthermore, it follows from r̄π = (1 − γ)v̄π that

= E [∇θ ln π(A|S, θ)qπ (S, A)] .

9.3.2 Derivation of the gradients in the undiscounted case

State values and the Poisson equation

Theorem 9.4 (Solution of the Poisson equation). Let

vπ∗ = (In − Pπ + 1n dTπ )−1 rπ . (9.24)

Box 9.5: Proof of Theorem 9.4

A−1 rπ = rπ − 1n dTπ rπ + Pπ A−1 rπ .

(−In + A − 1n dTπ A + Pπ )A−1 rπ = 0.

(In − Pπ )vπ = (In − 1n dTπ )rπ . (9.26)

It is noted that In − Pπ is singular because (In − Pπ )1n = 0 for any π. Therefore,

Lemma 9.3. The matrix In − Pπ + 1n dTπ is invertible and its inverse is

(Pπ − 1n dTπ )k = Pπk − 1n dTπ , k ≥ 1, (9.27)

which can be proven by induction. For instance, when k = 1, the equation is

(Pπ − 1n dTπ )2 = (Pπ − 1n dTπ )(Pπ − 1n dTπ )

lim (Pπ − 1n dTπ )k = lim Pπk − dTπ 1n = 0.

the inverse of this matrix is given by

What metrics should be used? (Section 9.2).

Suppose that the agent collects rewards {Rt+1 }∞

All these metrics are functions of π. Since π is parameterized by θ, these metrics

The expression in (9.9) is more favorable than (9.8) because it is expressed as an

First, since (9.33) is a simple gradient-ascent algorithm, the following observations