0% found this document useful (0 votes)
13 views21 pages

3 - Chapter 3 Optimal State Values and Bellman Optimality Equation

Uploaded by

zhanshengheyu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views21 pages

3 - Chapter 3 Optimal State Values and Bellman Optimality Equation

Uploaded by

zhanshengheyu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Chapter 3

Optimal State Values and Bellman


Optimality Equation

Algorithms/Methods

Chapter 4: Chapter 5: Chapter 6:


with model Stochastic
Value Iteration & to Monte Carlo
Policy Iteration Methods Approximation
without model

Chapter 7:
Chapter 3: Temporal-Difference
Chapter 2: Methods
Bellman Optimality
Bellman Equation
Equation
tabular representation
to
function representation
Chapter 1:
Basic Concepts
Chapter 8:
Value Function
Fundamental tools
Approximation

Chapter 10: policy-based


Chapter 9:
Actor-Critic plus Policy Gradient
Methods value-based Methods

Figure 3.1: Where we are in this book.

The ultimate goal of reinforcement learning is to seek optimal policies. It is, therefore,
necessary to define what optimal policies are. In this chapter, we introduce a core concept
and an important tool. The core concept is the optimal state value, based on which we
can define optimal policies. The important tool is the Bellman optimality equation, from
which we can solve the optimal state values and policies.
The relationship between the previous, present, and subsequent chapters is as follows.
The previous chapter (Chapter 2) introduced the Bellman equation of any given policy.

46
3.1. Motivating example: How to improve policies? S. Zhao, 2023

The present chapter introduces the Bellman optimality equation, which is a special Bell-
man equation whose corresponding policy is optimal. The next chapter (Chapter 4) will
introduce an important algorithm called value iteration, which is exactly the algorithm
for solving the Bellman optimality equation as introduced in the present chapter.
Be prepared that this chapter is slightly mathematically intensive. However, it is
worth it because many fundamental questions can be clearly answered.

3.1 Motivating example: How to improve policies?

s1 s2
r = −1
r=1

s3 s4
r=1
r=1

Figure 3.2: An example for demonstrating policy improvement.

Consider the policy shown in Figure 3.2. Here, the orange and blue cells represent the
forbidden and target areas, respectively. The policy here is not good because it selects a2
(rightward) at state s1 . How can we improve the given policy to obtain a better policy?
The answer lies in state values and action values.

 Intuition: It is intuitively clear that the policy can improve if it selects a3 (downward)
instead of a2 (rightward) at s1 . This is because moving downward enables the agent
to avoid entering the forbidden area.
 Mathematics: The above intuition can be realized based on the calculation of state
values and action values.
First, we calculate the state values of the given policy. In particular, the Bellman
equation of this policy is

vπ (s1 ) = −1 + γvπ (s2 ),


vπ (s2 ) = +1 + γvπ (s4 ),
vπ (s3 ) = +1 + γvπ (s4 ),
vπ (s4 ) = +1 + γvπ (s4 ).

47
3.2. Optimal state values and optimal policies S. Zhao, 2023

Let γ = 0.9. It can be easily solved that

vπ (s4 ) = vπ (s3 ) = vπ (s2 ) = 10,


vπ (s1 ) = 8.

Second, we calculate the action values for state s1 :

qπ (s1 , a1 ) = −1 + γvπ (s1 ) = 6.2,


qπ (s1 , a2 ) = −1 + γvπ (s2 ) = 8,
qπ (s1 , a3 ) = 0 + γvπ (s3 ) = 9,
qπ (s1 , a4 ) = −1 + γvπ (s1 ) = 6.2,
qπ (s1 , a5 ) = 0 + γvπ (s1 ) = 7.2.

It is notable that action a3 has the greatest action value:

qπ (s1 , a3 ) ≥ qπ (s1 , ai ), for all i 6= 3.

Therefore, we can update the policy to select a3 at s1 .

This example illustrates that we can obtain a better policy if we update the poli-
cy to select the action with the greatest action value. This is the basic idea of many
reinforcement learning algorithms.
This example is very simple in the sense that the given policy is only not good for
state s1 . If the policy is also not good for the other states, will selecting the action with
the greatest action value still generate a better policy? Moreover, whether there always
exist optimal policies? What does an optimal policy look like? We will answer all of
these questions in this chapter.

3.2 Optimal state values and optimal policies


While the ultimate goal of reinforcement learning is to obtain optimal policies, it is
necessary to first define what an optimal policy is. The definition is based on state
values. In particular, consider two given policies π1 and π2 . If the state value of π1 is
greater than or equal to that of π2 for any state:

vπ1 (s) ≥ vπ2 (s), for all s ∈ S,

then π1 is said to be better than π2 . Furthermore, if a policy is better than all the other
possible policies, then this policy is optimal. This is formally stated below.

48
3.3. Bellman optimality equation S. Zhao, 2023

Definition 3.1 (Optimal policy and optimal state value). A policy π ∗ is optimal if
vπ∗ (s) ≥ vπ (s) for all s ∈ S and for any other policy π. The state values of π ∗ are the
optimal state values.
The above definition indicates that an optimal policy has the greatest state value
for every state compared to all the other policies. This definition also leads to many
questions:
 Existence: Does the optimal policy exist?
 Uniqueness: Is the optimal policy unique?
 Stochasticity: Is the optimal policy stochastic or deterministic?
 Algorithm: How to obtain the optimal policy and the optimal state values?
These fundamental questions must be clearly answered to thoroughly understand
optimal policies. For example, regarding the existence of optimal policies, if optimal
policies do not exist, then we do not need to bother to design algorithms to find them.
We will answer all these questions in the remainder of this chapter.

3.3 Bellman optimality equation


The tool for analyzing optimal policies and optimal state values is the Bellman optimality
equation (BOE). By solving this equation, we can obtain optimal policies and optimal
state values. We next present the expression of the BOE and then analyze it in detail.
For every s ∈ S, the elementwise expression of the BOE is
!
X X X
v(s) = max π(a|s) p(r|s, a)r + γ p(s0 |s, a)v(s0 )
π(s)∈Π(s)
a∈A r∈R s0 ∈S
X
= max π(a|s)q(s, a), (3.1)
π(s)∈Π(s)
a∈A

where v(s), v(s0 ) are unknown variables to be solved and

. X X
q(s, a) = p(r|s, a)r + γ p(s0 |s, a)v(s0 ).
r∈R s0 ∈S

Here, π(s) denotes a policy for state s, and Π(s) is the set of all possible policies for s.
The BOE is an elegant and powerful tool for analyzing optimal policies. However,
it may be nontrivial to understand this equation. For example, this equation has two
unknown variables v(s) and π(a|s). It may be confusing to beginners how to solve two
unknown variables from one equation. Moreover, the BOE is actually a special Bellman
equation. However, it is nontrivial to see that since its expression is quite different from
that of the Bellman equation. We also need to answer the following fundamental questions
about the BOE.

49
3.3. Bellman optimality equation S. Zhao, 2023

 Existence: Does this equation have a solution?


 Uniqueness: Is the solution unique?
 Algorithm: How to solve this equation?
 Optimality: How is the solution related to optimal policies?
Once we can answer these questions, we will clearly understand optimal state values and
optimal policies.

3.3.1 Maximization of the right-hand side of the BOE


We next clarify how to solve the maximization problem on the right-hand side of the
BOE in (3.1). At first glance, it may be confusing to beginners how to solve two unknown
variables v(s) and π(a|s) from one equation. In fact, these two unknown variables can
be solved one by one. This idea is illustrated by the following example.
Example 3.1. Consider two unknown variables x, y ∈ R that satisfy

x = max(2x − 1 − y 2 ).
y∈R

The first step is to solve y on the right-hand side of the equation. Regardless of the value
of x, we always have maxy (2x − 1 − y 2 ) = 2x − 1, where the maximum is achieved when
y = 0. The second step is to solve x. When y = 0, the equation becomes x = 2x − 1,
which leads to x = 1. Therefore, y = 0 and x = 1 are the solutions of the equation.
We now turn to the maximization problem on the right-hand side of the BOE. The
BOE in (3.1) can be written concisely as
X
v(s) = max π(a|s)q(s, a), s ∈ S.
π(s)∈Π(s)
a∈A

Inspired by Example 3.1, we can first solve the optimal π on the right-hand side. How to
do that? The following example demonstrates its basic idea.
Example 3.2. Given q1 , q2 , q3 ∈ R, we would like to find the optimal values of c1 , c2 , c3
to maximize
X3
ci q i = c1 q 1 + c2 q 2 + c3 q 3 ,
i=1

where c1 + c2 + c3 = 1 and c1 , c2 , c3 ≥ 0.
Without loss of generality, suppose that q3 ≥ q1 , q2 . Then, the optimal solution is
c3 = 1 and c∗1 = c∗2 = 0. This is because

q3 = (c1 + c2 + c3 )q3 = c1 q3 + c2 q3 + c3 q3 ≥ c1 q1 + c2 q2 + c3 q3

for any c1 , c2 , c3 .

50
3.3. Bellman optimality equation S. Zhao, 2023

P
Inspired by the above example, since a π(a|s) = 1, we have
X X
π(a|s)q(s, a) ≤ π(a|s) max q(s, a) = max q(s, a),
a∈A a∈A
a∈A a∈A

where equality is achieved when


(
1, a = a∗ ,
π(a|s) =
6 a∗ .
0, a =

Here, a∗ = arg maxa q(s, a). In summary, the optimal policy π(s) is the one that selects
the action that has the greatest value of q(s, a).

3.3.2 Matrix-vector form of the BOE


The BOE refers to a set of equations defined for all states. If we combine these equations,
we can obtain a concise matrix-vector form, which will be extensively used in this chapter.
The matrix-vector form of the BOE is

v = max(rπ + γPπ v), (3.2)


π∈Π

where v ∈ R|S| and maxπ is performed in an elementwise manner. The structures of rπ


and Pπ are the same as those in the matrix-vector form of the normal Bellman equation:

. X X . X
[rπ ]s = π(a|s) p(r|s, a)r, [Pπ ]s,s0 = p(s0 |s) = π(a|s)p(s0 |s, a).
a∈A r∈R a∈A

Since the optimal value of π is determined by v, the right-hand side of (3.2) is a function
of v, denoted as
.
f (v) = max(rπ + γPπ v).
π∈Π

Then, the BOE can be expressed in a concise form as

v = f (v). (3.3)

In the remainder of this section, we show how to solve this nonlinear equation.

3.3.3 Contraction mapping theorem


Since the BOE can be expressed as a nonlinear equation v = f (v), we next introduce
the contraction mapping theorem [6] to analyze it. The contraction mapping theorem is
a powerful tool for analyzing general nonlinear equations. It is also known as the fixed-
point theorem. Readers who already know this theorem can skip this part. Otherwise,
the reader is advised to be familiar with this theorem since it is the key to analyzing the

51
3.3. Bellman optimality equation S. Zhao, 2023

BOE.
Consider a function f (x), where x ∈ Rd and f : Rd → Rd . A point x∗ is called a fixed
point if
f (x∗ ) = x∗ .

The interpretation of the above equation is that the map of x∗ is itself. This is the
reason why x∗ is called “fixed”. The function f is a contraction mapping (or contractive
function) if there exists γ ∈ (0, 1) such that

kf (x1 ) − f (x2 )k ≤ γkx1 − x2 k

for any x1 , x2 ∈ Rd . In this book, k · k denotes a vector or matrix norm.

Example 3.3. We present three examples to demonstrate fixed points and contraction
mappings.

 x = f (x) = 0.5x, x ∈ R.
It is easy to verify that x = 0 is a fixed point since 0 = 0.5 · 0. Moreover, f (x) = 0.5x
is a contraction mapping because k0.5x1 − 0.5x2 k = 0.5kx1 − x2 k ≤ γkx1 − x2 k for
any γ ∈ [0.5, 1).

 x = f (x) = Ax, where x ∈ Rn , A ∈ Rn×n and kAk ≤ γ < 1.


It is easy to verify that x = 0 is a fixed point since 0 = A0. To see the contraction
property, kAx1 − Ax2 k = kA(x1 − x2 )k ≤ kAkkx1 − x2 k ≤ γkx1 − x2 k. Therefore,
f (x) = Ax is a contraction mapping.

 x = f (x) = 0.5 sin x, x ∈ R.


It is easy to see that x = 0 is a fixed point since 0 = 0.5 sin 0. Moreover, it follows
from the mean value theorem [7, 8] that

0.5 sin x1 − 0.5 sin x2


= |0.5 cos x3 | ≤ 0.5, x3 ∈ [x1 , x2 ].
x1 − x2

As a result, |0.5 sin x1 − 0.5 sin x2 | ≤ 0.5|x1 − x2 | and hence f (x) = 0.5 sin x is a
contraction mapping.

The relationship between a fixed point and the contraction property is characterized
by the following classic theorem.

Theorem 3.1 (Contraction mapping theorem). For any equation that has the form x =
f (x) where x and f (x) are real vectors, if f is a contraction mapping, then the following
properties hold.

 Existence: There exists a fixed point x∗ satisfying f (x∗ ) = x∗ .

52
3.3. Bellman optimality equation S. Zhao, 2023

 Uniqueness: The fixed point x∗ is unique.


 Algorithm: Consider the iterative process:

xk+1 = f (xk ),

where k = 0, 1, 2, . . . . Then, xk → x∗ as k → ∞ for any initial guess x0 . Moreover,


the convergence rate is exponentially fast.

The contraction mapping theorem not only can tell whether the solution of a nonlinear
equation exists but also suggests a numerical algorithm for solving the equation. The
proof of the theorem is given in Box 3.1.
The following example demonstrates how to calculate the fixed points of some equa-
tions using the iterative algorithm suggested by the contraction mapping theorem.

Example 3.4. Let us revisit the abovementioned examples: x = 0.5x, x = Ax, and
x = 0.5 sin x. While it has been shown that the right-hand sides of these three equations
are all contraction mappings, it follows from the contraction mapping theorem that they
each have a unique fixed point, which can be easily verified to be x∗ = 0. Moreover, the
fixed points of the three equations can be iteratively solved by the following algorithms:

xk+1 = 0.5xk ,
xk+1 = Axk ,
xk+1 = 0.5 sin xk ,

given any initial guess x0 .

Box 3.1: Proof of the contraction mapping theorem

Part 1: We prove that the consequence {xk }∞ k=1 with xk = f (xk−1 ) is convergent.
The proof relies on Cauchy sequences. A sequence x1 , x2 , · · · ∈ R is called Cauchy
if for any small ε > 0, there exists N such that kxm − xn k < ε for all m, n > N .
The intuitive interpretation is that there exists a finite integer N such that all the
elements after N are sufficiently close to each other. Cauchy sequences are important
because it is guaranteed that a Cauchy sequence converges to a limit. Its convergence
property will be used to prove the contraction mapping theorem. Note that we must
have kxm − xn k < ε for all m, n > N . If we simply have xn+1 − xn → 0, it is
insufficient to claim that the sequence is a Cauchy sequence. For example, it holds
√ √
that xn+1 − xn → 0 for xn = n, but apparently, xn = n diverges.
We next show that {xk = f (xk−1 )}∞ k=1 is a Cauchy sequence and hence converges.

53
3.3. Bellman optimality equation S. Zhao, 2023

First, since f is a contraction mapping, we have

kxk+1 − xk k = kf (xk ) − f (xk−1 )k ≤ γkxk − xk−1 k.

Similarly, we have kxk − xk−1 k ≤ γkxk−1 − xk−2 k, . . . , kx2 − x1 k ≤ γkx1 − x0 k. Thus,


we have

kxk+1 − xk k ≤ γkxk − xk−1 k


≤ γ 2 kxk−1 − xk−2 k
..
.
≤ γ k kx1 − x0 k.

Since γ < 1, we know that kxk+1 − xk k converges to zero exponentially fast as k → ∞


given any x1 , x0 . Notably, the convergence of {kxk+1 − xk k} is not sufficient for
implying the convergence of {xk }. Therefore, we need to further consider kxm − xn k
for any m > n. In particular,

kxm − xn k = kxm − xm−1 + xm−1 − · · · − xn+1 + xn+1 − xn k


≤ kxm − xm−1 k + · · · + kxn+1 − xn k
≤ γ m−1 kx1 − x0 k + · · · + γ n kx1 − x0 k
= γ n (γ m−1−n + · · · + 1)kx1 − x0 k
≤ γ n (1 + · · · + γ m−1−n + γ m−n + γ m−n+1 + . . . )kx1 − x0 k
γn
= kx1 − x0 k. (3.4)
1−γ

As a result, for any ε, we can always find N such that kxm −xn k < ε for all m, n > N .
Therefore, this sequence is Cauchy and hence converges to a limit point denoted as
x∗ = limk→∞ xk .
Part 2: We show that the limit x∗ = limk→∞ xk is a fixed point. To do that, since

kf (xk ) − xk k = kxk+1 − xk k ≤ γ k kx1 − x0 k,

we know that kf (xk ) − xk k converges to zero exponentially fast. Hence, we have


f (x∗ ) = x∗ at the limit.
Part 3: We show that the fixed point is unique. Suppose that there is another
fixed point x0 satisfying f (x0 ) = x0 . Then,

kx0 − x∗ k = kf (x0 ) − f (x∗ )k ≤ γkx0 − x∗ k.

54
3.3. Bellman optimality equation S. Zhao, 2023

Since γ < 1, this inequality holds if and only if kx0 − x∗ k = 0. Therefore, x0 = x∗ .


Part 4: We show that xk converges to x∗ exponentially fast. Recall that kxm −
γn
xn k ≤ 1−γ kx1 − x0 k, as proven in (3.4). Since m can be arbitrarily large, we have

γn
kx∗ − xn k = lim kxm − xn k ≤ kx1 − x0 k.
m→∞ 1−γ

Since γ < 1, the error converges to zero exponentially fast as n → ∞.

3.3.4 Contraction property of the right-hand side of the BOE


We next show that f (v) in the BOE in (3.3) is a contraction mapping. Thus, the con-
traction mapping theorem introduced in the previous subsection can be applied.

Theorem 3.2 (Contraction property of f (v)). The function f (v) on the right-hand side
of the BOE in (3.3) is a contraction mapping. In particular, for any v1 , v2 ∈ R|S| , it holds
that
kf (v1 ) − f (v2 )k∞ ≤ γkv1 − v2 k∞ ,

where γ ∈ (0, 1) is the discount rate, and k · k∞ is the maximum norm, which is the
maximum absolute value of the elements of a vector.

The proof of the theorem is given in Box 3.2. This theorem is important because we
can use the powerful contraction mapping theorem to analyze the BOE.

Box 3.2: Proof of Theorem 3.2


.
Consider any two vectors v1 , v2 ∈ R|S| , and suppose that π1∗ = arg maxπ (rπ + γPπ v1 )
.
and π2∗ = arg maxπ (rπ + γPπ v2 ). Then,

f (v1 ) = max(rπ + γPπ v1 ) = rπ1∗ + γPπ1∗ v1 ≥ rπ2∗ + γPπ2∗ v1 ,


π

f (v2 ) = max(rπ + γPπ v2 ) = rπ2∗ + γPπ2∗ v2 ≥ rπ1∗ + γPπ1∗ v2 ,


π

where ≥ is an elementwise comparison. As a result,

f (v1 ) − f (v2 ) = rπ1∗ + γPπ1∗ v1 − (rπ2∗ + γPπ2∗ v2 )


≤ rπ1∗ + γPπ1∗ v1 − (rπ1∗ + γPπ1∗ v2 )
= γPπ1∗ (v1 − v2 ).

55
3.3. Bellman optimality equation S. Zhao, 2023

Similarly, it can be shown that f (v2 ) − f (v1 ) ≤ γPπ2∗ (v2 − v1 ). Therefore,

γPπ2∗ (v1 − v2 ) ≤ f (v1 ) − f (v2 ) ≤ γPπ1∗ (v1 − v2 ).

Define
.
z = max |γPπ2∗ (v1 − v2 )|, |γPπ1∗ (v1 − v2 )| ∈ R|S| ,


where max(·), | · |, and ≥ are all elementwise operators. By definition, z ≥ 0. On the


one hand, it is easy to see that

−z ≤ γPπ2∗ (v1 − v2 ) ≤ f (v1 ) − f (v2 ) ≤ γPπ1∗ (v1 − v2 ) ≤ z,

which implies

|f (v1 ) − f (v2 )| ≤ z.

It then follows that

kf (v1 ) − f (v2 )k∞ ≤ kzk∞ , (3.5)

where k · k∞ is the maximum norm.


On the other hand, suppose that zi is the ith entry of z, and pTi and qiT are the
ith row of Pπ1∗ and Pπ2∗ , respectively. Then,

zi = max{γ|pTi (v1 − v2 )|, γ|qiT (v1 − v2 )|}.

Since pi is a vector with all nonnegative elements and the sum of the elements is
equal to one, it follows that

|pTi (v1 − v2 )| ≤ pTi |v1 − v2 | ≤ kv1 − v2 k∞ .

Similarly, we have |qiT (v1 − v2 )| ≤ kv1 − v2 k∞ . Therefore, zi ≤ γkv1 − v2 k∞ and hence

kzk∞ = max |zi | ≤ γkv1 − v2 k∞ .


i

Substituting this inequality to (3.5) gives

kf (v1 ) − f (v2 )k∞ ≤ γkv1 − v2 k∞ ,

which concludes the proof of the contraction property of f (v).

56
3.4. Solving an optimal policy from the BOE S. Zhao, 2023

3.4 Solving an optimal policy from the BOE


With the preparation in the last section, we are ready to solve the BOE to obtain the
optimal state value v ∗ and an optimal policy π ∗ .

 Solving v ∗ : If v ∗ is a solution of the BOE, then it satisfies

v ∗ = max(rπ + γPπ v ∗ ).
π∈Π

Clearly, v ∗ is a fixed point because v ∗ = f (v ∗ ). Then, the contraction mapping


theorem suggests the following results.

Theorem 3.3 (Existence, uniqueness, and algorithm). For the BOE v = f (v) =
maxπ∈Π (rπ + γPπ v), there always exists a unique solution v ∗ , which can be solved
iteratively by

vk+1 = f (vk ) = max(rπ + γPπ vk ), k = 0, 1, 2, . . . .


π∈Π

The value of vk converges to v ∗ exponentially fast as k → ∞ given any initial guess


v0 .

The proof of this theorem directly follows from the contraction mapping theorem since
f (v) is a contraction mapping. This theorem is important because it answers some
fundamental questions.

- Existence of v ∗ : The solution of the BOE always exists.


- Uniqueness of v ∗ : The solution v ∗ is always unique.
- Algorithm for solving v ∗ : The value of v ∗ can be solved by the iterative algorithm
suggested by Theorem 3.3. This iterative algorithm has a specific name called
value iteration. Its implementation will be introduced in detail in Chapter 4. We
mainly focus on the fundamental properties of the BOE in the present chapter.

 Solving π ∗ : Once the value of v ∗ has been obtained, we can easily obtain π ∗ by solving

π ∗ = arg max(rπ + γPπ v ∗ ). (3.6)


π∈Π

The value of π ∗ will be given in Theorem 3.5. Substituting (3.6) into the BOE yields

v ∗ = rπ∗ + γPπ∗ v ∗ .

Therefore, v ∗ = vπ∗ is the state value of π ∗ , and the BOE is a special Bellman equation
whose corresponding policy is π ∗ .

57
3.4. Solving an optimal policy from the BOE S. Zhao, 2023

At this point, although we can solve v ∗ and π ∗ , it is still unclear whether the solution
is optimal. The following theorem reveals the optimality of the solution.

Theorem 3.4 (Optimality of v ∗ and π ∗ ). The solution v ∗ is the optimal state value, and
π ∗ is an optimal policy. That is, for any policy π, it holds that

v ∗ = vπ∗ ≥ vπ ,

where vπ is the state value of π, and ≥ is an elementwise comparison.

Now, it is clear why we must study the BOE: its solution corresponds to optimal state
values and optimal policies. The proof of the above theorem is given in the following box.

Box 3.3: Proof of Theorem 3.4


For any policy π, it holds that

vπ = rπ + γPπ vπ .

Since

v ∗ = max(rπ + γPπ v ∗ ) = rπ∗ + γPπ∗ v ∗ ≥ rπ + γPπ v ∗ ,


π

we have

v ∗ − vπ ≥ (rπ + γPπ v ∗ ) − (rπ + γPπ vπ ) = γPπ (v ∗ − vπ ).

Repeatedly applying the above inequality gives v ∗ − vπ ≥ γPπ (v ∗ − vπ ) ≥ γ 2 Pπ2 (v ∗ −


vπ ) ≥ · · · ≥ γ n Pπn (v ∗ − vπ ). It follows that

v ∗ − vπ ≥ lim γ n Pπn (v ∗ − vπ ) = 0,
n→∞

where the last equality is true because γ < 1 and Pπn is a nonnegative matrix with
all its elements less than or equal to 1 (because Pπn 1 = 1). Therefore, v ∗ ≥ vπ for
any π.

We next examine π ∗ in (3.6) more closely. In particular, the following theorem shows
that there always exists a deterministic greedy policy that is optimal.

Theorem 3.5 (Greedy optimal policy). For any s ∈ S, the deterministic greedy policy
(
1, a = a∗ (s),
π ∗ (a|s) = (3.7)
6 a∗ (s),
0, a =

58
3.4. Solving an optimal policy from the BOE S. Zhao, 2023

is an optimal policy for solving the BOE. Here,

a∗ (s) = arg max q ∗ (a, s),


a

where
. X X
q ∗ (s, a) = p(r|s, a)r + γ p(s0 |s, a)v ∗ (s0 ).
r∈R s0 ∈S

Box 3.4: Proof of Theorem 3.5

While the matrix-vector form of the optimal policy is π ∗ = arg maxπ (rπ + γPπ v ∗ ), its
elementwise form is
!
X X X
∗ 0 ∗ 0
π (s) = arg max π(a|s) p(r|s, a)r + γ p(s |s, a)v (s ) , s ∈ S.
π∈Π
a∈A r∈R s0 ∈S
| {z }
q ∗ (s,a)

It is clear that a∈A π(a|s)q ∗ (s, a) is maximized if π(s) selects the action with the
P

greatest q ∗ (s, a).

The policy in (3.7) is called greedy because it seeks the actions with the greatest
q (s, a). Finally, we discuss two important properties of π ∗ .

 Uniqueness of optimal policies: Although the value of v ∗ is unique, the optimal policy
that corresponds to v ∗ may not be unique. This can be easily verified by counterex-
amples. For example, the two policies shown in Figure 3.3 are both optimal.
 Stochasticity of optimal policies: An optimal policy can be either stochastic or de-
terministic, as demonstrated in Figure 3.3. However, it is certain that there always
exists a deterministic optimal policy according to Theorem 3.5.

p = 0.5
p = 0.5

Figure 3.3: Examples for demonstrating that optimal policies may not be unique. The two policies are
different but are both optimal.

59
3.5. Factors that influence optimal policies S. Zhao, 2023

3.5 Factors that influence optimal policies


The BOE is a powerful tool for analyzing optimal policies. We next apply the BOE to
study what factors can influence optimal policies. This question can be easily answered
by observing the elementwise expression of the BOE:
!
X X X
0 0
v(s) = max π(a|s) p(r|s, a)r + γ p(s |s, a)v(s ) , s ∈ S.
π(s)∈Π(s)
a∈A r∈R s0 ∈S

The optimal state value and optimal policy are determined by the following param-
eters: 1) the immediate reward r, 2) the discount rate γ, and 3) the system model
p(s0 |s, a), p(r|s, a). While the system model is fixed, we next discuss how the optimal
policy varies when we change the values of r and γ. All the optimal policies presented
in this section can be obtained via the algorithm in Theorem 3.3. The implementation
details of the algorithm will be given in Chapter 4. The present chapter mainly focuses
on the fundamental properties of optimal policies.

A baseline example

Consider the example in Figure 3.4. The reward settings are rboundary = rforbidden = −1
and rtarget = 1. In addition, the agent receives a reward of rother = 0 for every movement
step. The discount rate is selected as γ = 0.9.
With the above parameters, the optimal policy and optimal state values are given in
Figure 3.4(a). It is interesting that the agent is not afraid of passing through forbidden
areas to reach the target area. More specifically, starting from the state at (row=4,
column=1), the agent has two options for reaching the target area. The first option is to
avoid all the forbidden areas and travel a long distance to the target area. The second
option is to pass through forbidden areas. Although the agent obtains negative rewards
when entering forbidden areas, the cumulative reward of the second trajectory is greater
than that of the first trajectory. Therefore, the optimal policy is far-sighted due to the
relatively large value of γ.

Impact of the discount rate

If we change the discount rate from γ = 0.9 to γ = 0.5 and keep other parameters
unchanged, the optimal policy becomes the one shown in Figure 3.4(b). It is interesting
that the agent does not dare to take risks anymore. Instead, it would travel a long
distance to reach the target while avoiding all the forbidden areas. This is because the
optimal policy becomes short-sighted due to the relatively small value of γ.
In the extreme case where γ = 0, the corresponding optimal policy is shown in
Figure 3.4(c). In this case, the agent is not able to reach the target area. This is

60
3.5. Factors that influence optimal policies S. Zhao, 2023

1 2 3 4 5 1 2 3 4 5

1 1 5.8 5.6 6.2 6.5 5.8

2 2 6.5 7.2 8.0 7.2 6.5

3 3 7.2 8.0 10.0 8.0 7.2

4 4 8.0 10.0 10.0 10.0 8.0

5 5 7.2 9.0 10.0 9.0 8.1

(a) Baseline example: rboundary = rforbidden = −1, rtarget = 1, γ = 0.9.


1 2 3 4 5 1 2 3 4 5

1 1 0.0 0.0 0.0 0.0 0.0

2 2 0.0 0.0 0.0 0.0 0.1

3 3 0.0 0.0 2.0 0.1 0.1

4 4 0.0 2.0 2.0 2.0 0.2

5 5 0.0 1.0 2.0 1.0 0.5

(b) The discount rate is changed to γ = 0.5. The other parameters are the same as those in (a).
1 2 3 4 5 1 2 3 4 5

1 1 0.0 0.0 0.0 0.0 0.0

2 2 0.0 0.0 0.0 0.0 0.0

3 3 0.0 0.0 1.0 0.0 0.0

4 4 0.0 1.0 1.0 1.0 0.0

5 5 0.0 0.0 1.0 0.0 0.0

(c) The discount rate is changed to γ = 0. The other parameters are the same as those in (a).
1 2 3 4 5 1 2 3 4 5

1 1 3.5 3.9 4.3 4.8 5.3

2 2 3.1 3.5 4.8 5.3 5.9

3 3 2.8 2.5 10.0 5.9 6.6

4 4 2.5 10.0 10.0 10.0 7.3

5 5 2.3 9.0 10.0 9.0 8.1

(d) rforbidden is changed from −1 to −10. The other parameters are the same as those in (a).

Figure 3.4: The optimal policies and optimal state values given different parameter values.

61
3.5. Factors that influence optimal policies S. Zhao, 2023

because the optimal policy for each state is extremely short-sighted and merely selects
the action with the greatest immediate reward instead of the greatest total reward.
In addition, the spatial distribution of the state values exhibits an interesting pattern:
the states close to the target have greater state values, whereas those far away have lower
values. This pattern can be observed from all the examples shown in Figure 3.4. It can
be explained by using the discount rate: if a state must travel along a longer trajectory
to reach the target, its state value is smaller due to the discount rate.

Impact of the reward values

If we want to strictly prohibit the agent from entering any forbidden area, we can increase
the punishment received for doing so. For instance, if rforbidden is changed from −1 to
−10, the resulting optimal policy can avoid all the forbidden areas (see Figure 3.4(d)).
However, changing the rewards does not always lead to different optimal policies.
One important fact is that optimal policies are invariant to affine transformations of the
rewards. In other words, if we scale all the rewards or add the same value to all the
rewards, the optimal policy remains the same.

Theorem 3.6 (Optimal policy invariance). Consider a Markov decision process with
v ∗ ∈ R|S| as the optimal state value satisfying v ∗ = maxπ∈Π (rπ + γPπ v ∗ ). If every reward
r ∈ R is changed by an affine transformation to αr + β, where α, β ∈ R and α > 0, then
the corresponding optimal state value v 0 is also an affine transformation of v ∗ :

β
v 0 = αv ∗ + 1, (3.8)
1−γ

where γ ∈ (0, 1) is the discount rate and 1 = [1, . . . , 1]T . Consequently, the optimal policy
derived from v 0 is invariant to the affine transformation of the reward values.

Box 3.5: Proof of Theorem 3.6

For any policy π, define rπ = [. . . , rπ (s), . . . ]T where


X X
rπ (s) = π(a|s) p(r|s, a)r, s ∈ S.
a∈A r∈R

If r → αr + β, then rπ (s) → αrπ (s) + β and hence rπ → αrπ + β1, where 1 =


[1, . . . , 1]T . In this case, the BOE becomes

v 0 = max(αrπ + β1 + γPπ v 0 ). (3.9)


π∈Π

62
3.5. Factors that influence optimal policies S. Zhao, 2023

We next solve the new BOE in (3.9) by showing that v 0 = αv ∗ +c1 with c = β/(1−γ)
is a solution of (3.9). In particular, substituting v 0 = αv ∗ + c1 into (3.9) gives

αv ∗ + c1 = max(αrπ + β1 + γPπ (αv ∗ + c1)) = max(αrπ + β1 + αγPπ v ∗ + cγ1),


π∈Π π∈Π

where the last equality is due to the fact that Pπ 1 = 1. The above equation can be
reorganized as

αv ∗ = max(αrπ + αγPπ v ∗ ) + β1 + cγ1 − c1,


π∈Π

which is equivalent to

β1 + cγ1 − c1 = 0.

Since c = β/(1−γ), the above equation is valid and hence v 0 = αv ∗ +c1 is the solution
of (3.9). Since (3.9) is the BOE, v 0 is also the unique solution. Finally, since v 0 is
an affine transformation of v ∗ , the relative relationships between the action values
remain the same. Hence, the greedy optimal policy derived from v 0 is the same as
that from v ∗ : arg maxπ∈Π (rπ + γPπ v 0 ) is the same as arg maxπ (rπ + γPπ v ∗ ).

Readers may refer to [9] for a further discussion on the conditions under which mod-
ifications to the reward values preserve the optimal policy.

Avoiding meaningless detours

In the reward setting, the agent receives a reward of rother = 0 for every movement
step (unless it enters a forbidden area or the target area or attempts to go beyond the
boundary). Since a zero reward is not a punishment, would the optimal policy take
meaningless detours before reaching the target? Should we set rother to be negative to
encourage the agent to reach the target as quickly as possible?

1 2 1 2 1 2 1 2

1 1 9.0 10.0 1 1 9.0 8.1

2 2 10.0 10.0 2 2 10.0 10.0

(a) Optimal policy (b) Non-optimal policy

Figure 3.5: Examples illustrating that optimal policies do not take meaningless detours due to the
discount rate.

Consider the examples in Figure 3.5, where the bottom-right cell is the target area

63
3.6. Summary S. Zhao, 2023

to reach. The two policies here are the same except for state s2 . By the policy in
Figure 3.5(a), the agent moves downward at s2 and the resulting trajectory is s2 → s4 .
By the policy in Figure 3.5(b), the agent moves leftward and the resulting trajectory is
s2 → s1 → s3 → s4 .
It is notable that the second policy takes a detour before reaching the target area. If
we merely consider the immediate rewards, taking this detour does not matter because
no negative immediate rewards will be obtained. However, if we consider the discounted
return, then this detour matters. In particular, for the first policy, the discounted return
is
return = 1 + γ1 + γ 2 1 + · · · = 1/(1 − γ) = 10.

As a comparison, the discounted return for the second policy is

return = 0 + γ0 + γ 2 1 + γ 3 1 + · · · = γ 2 /(1 − γ) = 8.1.

It is clear that the shorter the trajectory is, the greater the return is. Therefore, although
the immediate reward of every step does not encourage the agent to approach the target
as quickly as possible, the discount rate does encourage it to do so.
A misunderstanding that beginners may have is that adding a negative reward (e.g.,
−1) on top of the rewards obtained for every movement is necessary to encourage the
agent to reach the target as quickly as possible. This is a misunderstanding because
adding the same reward on top of all rewards is an affine transformation, which preserves
the optimal policy. Moreover, optimal policies do not take meaningless detours due to
the discount rate, even though a detour may not receive any immediate negative rewards.

3.6 Summary
The core concepts in this chapter include optimal policies and optimal state values. In
particular, a policy is optimal if its state values are greater than or equal to those of any
other policy. The state values of an optimal policy are the optimal state values. The BOE
is the core tool for analyzing optimal policies and optimal state values. This equation
is a nonlinear equation with a nice contraction property. We can apply the contraction
mapping theorem to analyze this equation. It was shown that the solutions of the BOE
correspond to the optimal state value and optimal policy. This is the reason why we need
to study the BOE.
The contents of this chapter are important for thoroughly understanding many funda-
mental ideas of reinforcement learning. For example, Theorem 3.3 suggests an iterative
algorithm for solving the BOE. This algorithm is exactly the value iteration algorithm
that will be introduced in Chapter 4. A further discussion about the BOE can be found
in [2].

64
3.7. Q&A S. Zhao, 2023

3.7 Q&A
 Q: What is the definition of optimal policies?
A: A policy is optimal if its corresponding state values are greater than or equal to
any other policy.
It should be noted that this specific definition of optimality is valid only for tabular
reinforcement learning algorithms. When the values or policies are approximated by
functions, different metrics must be used to define optimal policies. This will become
clearer in Chapters 8 and 9.
 Q: Why is the Bellman optimality equation important?
A: It is important because it characterizes both optimal policies and optimal state
values. Solving this equation yields an optimal policy and the corresponding optimal
state value.
 Q: Is the Bellman optimality equation a Bellman equation?
A: Yes. The Bellman optimality equation is a special Bellman equation whose corre-
sponding policy is optimal.
 Q: Is the solution of the Bellman optimality equation unique?
A: The Bellman optimality equation has two unknown variables. The first unknown
variable is a value, and the second is a policy. The value solution, which is the optimal
state value, is unique. The policy solution, which is an optimal policy, may not be
unique.
 Q: What is the key property of the Bellman optimality equation for analyzing its
solution?
A: The key property is that the right-hand side of the Bellman optimality equation is
a contraction mapping. As a result, we can apply the contraction mapping theorem
to analyze its solution.
 Q: Do optimal policies exist?
A: Yes. Optimal policies always exist according to the analysis of the BOE.
 Q: Are optimal policies unique?
A: No. There may exist multiple or infinite optimal policies that have the same
optimal state values.
 Q: Are optimal policies stochastic or deterministic?
A: An optimal policy can be either deterministic or stochastic. A nice fact is that
there always exist deterministic greedy optimal policies.

65
3.7. Q&A S. Zhao, 2023

 Q: How to obtain an optimal policy?


A: Solving the BOE using the iterative algorithm suggested by Theorem 3.3 yields an
optimal policy. The detailed implementation of this iterative algorithm will be given
in Chapter 4. Notably, all the reinforcement learning algorithms introduced in this
book aim to obtain optimal policies under different settings.
 Q: What is the general impact on the optimal policies if we reduce the value of the
discount rate?
A: The optimal policy becomes more short-sighted when we reduce the discount rate.
That is, the agent does not dare to take risks even though it may obtain greater
cumulative rewards afterward.
 Q: What happens if we set the discount rate to zero?
A: The resulting optimal policy would become extremely short-sighted. The agent
would take the action with the greatest immediate reward, even though that action
is not good in the long run.
 Q: If we increase all the rewards by the same amount, will the optimal state value
change? Will the optimal policy change?
A: Increasing all the rewards by the same amount is an affine transformation of the
rewards, which would not affect the optimal policies. However, the optimal state value
would increase, as shown in (3.8).
 Q: If we hope that the optimal policy can avoid meaningless detours before reaching
the target, should we add a negative reward to every step so that the agent reaches
the target as quickly as possible?
A: First, introducing an additional negative reward to every step is an affine transfor-
mation of the rewards, which does not change the optimal policy. Second, the discount
rate can automatically encourage the agent to reach the target as quickly as possible.
This is because meaningless detours would increase the trajectory length and reduce
the discounted return.

66

You might also like