3 - Chapter 3 Optimal State Values and Bellman Optimality Equation
3 - Chapter 3 Optimal State Values and Bellman Optimality Equation
Algorithms/Methods
Chapter 7:
Chapter 3: Temporal-Difference
Chapter 2: Methods
Bellman Optimality
Bellman Equation
Equation
tabular representation
to
function representation
Chapter 1:
Basic Concepts
Chapter 8:
Value Function
Fundamental tools
Approximation
The ultimate goal of reinforcement learning is to seek optimal policies. It is, therefore,
necessary to define what optimal policies are. In this chapter, we introduce a core concept
and an important tool. The core concept is the optimal state value, based on which we
can define optimal policies. The important tool is the Bellman optimality equation, from
which we can solve the optimal state values and policies.
The relationship between the previous, present, and subsequent chapters is as follows.
The previous chapter (Chapter 2) introduced the Bellman equation of any given policy.
46
3.1. Motivating example: How to improve policies? S. Zhao, 2023
The present chapter introduces the Bellman optimality equation, which is a special Bell-
man equation whose corresponding policy is optimal. The next chapter (Chapter 4) will
introduce an important algorithm called value iteration, which is exactly the algorithm
for solving the Bellman optimality equation as introduced in the present chapter.
Be prepared that this chapter is slightly mathematically intensive. However, it is
worth it because many fundamental questions can be clearly answered.
s1 s2
r = −1
r=1
s3 s4
r=1
r=1
Consider the policy shown in Figure 3.2. Here, the orange and blue cells represent the
forbidden and target areas, respectively. The policy here is not good because it selects a2
(rightward) at state s1 . How can we improve the given policy to obtain a better policy?
The answer lies in state values and action values.
Intuition: It is intuitively clear that the policy can improve if it selects a3 (downward)
instead of a2 (rightward) at s1 . This is because moving downward enables the agent
to avoid entering the forbidden area.
Mathematics: The above intuition can be realized based on the calculation of state
values and action values.
First, we calculate the state values of the given policy. In particular, the Bellman
equation of this policy is
47
3.2. Optimal state values and optimal policies S. Zhao, 2023
This example illustrates that we can obtain a better policy if we update the poli-
cy to select the action with the greatest action value. This is the basic idea of many
reinforcement learning algorithms.
This example is very simple in the sense that the given policy is only not good for
state s1 . If the policy is also not good for the other states, will selecting the action with
the greatest action value still generate a better policy? Moreover, whether there always
exist optimal policies? What does an optimal policy look like? We will answer all of
these questions in this chapter.
then π1 is said to be better than π2 . Furthermore, if a policy is better than all the other
possible policies, then this policy is optimal. This is formally stated below.
48
3.3. Bellman optimality equation S. Zhao, 2023
Definition 3.1 (Optimal policy and optimal state value). A policy π ∗ is optimal if
vπ∗ (s) ≥ vπ (s) for all s ∈ S and for any other policy π. The state values of π ∗ are the
optimal state values.
The above definition indicates that an optimal policy has the greatest state value
for every state compared to all the other policies. This definition also leads to many
questions:
Existence: Does the optimal policy exist?
Uniqueness: Is the optimal policy unique?
Stochasticity: Is the optimal policy stochastic or deterministic?
Algorithm: How to obtain the optimal policy and the optimal state values?
These fundamental questions must be clearly answered to thoroughly understand
optimal policies. For example, regarding the existence of optimal policies, if optimal
policies do not exist, then we do not need to bother to design algorithms to find them.
We will answer all these questions in the remainder of this chapter.
. X X
q(s, a) = p(r|s, a)r + γ p(s0 |s, a)v(s0 ).
r∈R s0 ∈S
Here, π(s) denotes a policy for state s, and Π(s) is the set of all possible policies for s.
The BOE is an elegant and powerful tool for analyzing optimal policies. However,
it may be nontrivial to understand this equation. For example, this equation has two
unknown variables v(s) and π(a|s). It may be confusing to beginners how to solve two
unknown variables from one equation. Moreover, the BOE is actually a special Bellman
equation. However, it is nontrivial to see that since its expression is quite different from
that of the Bellman equation. We also need to answer the following fundamental questions
about the BOE.
49
3.3. Bellman optimality equation S. Zhao, 2023
x = max(2x − 1 − y 2 ).
y∈R
The first step is to solve y on the right-hand side of the equation. Regardless of the value
of x, we always have maxy (2x − 1 − y 2 ) = 2x − 1, where the maximum is achieved when
y = 0. The second step is to solve x. When y = 0, the equation becomes x = 2x − 1,
which leads to x = 1. Therefore, y = 0 and x = 1 are the solutions of the equation.
We now turn to the maximization problem on the right-hand side of the BOE. The
BOE in (3.1) can be written concisely as
X
v(s) = max π(a|s)q(s, a), s ∈ S.
π(s)∈Π(s)
a∈A
Inspired by Example 3.1, we can first solve the optimal π on the right-hand side. How to
do that? The following example demonstrates its basic idea.
Example 3.2. Given q1 , q2 , q3 ∈ R, we would like to find the optimal values of c1 , c2 , c3
to maximize
X3
ci q i = c1 q 1 + c2 q 2 + c3 q 3 ,
i=1
where c1 + c2 + c3 = 1 and c1 , c2 , c3 ≥ 0.
Without loss of generality, suppose that q3 ≥ q1 , q2 . Then, the optimal solution is
c3 = 1 and c∗1 = c∗2 = 0. This is because
∗
q3 = (c1 + c2 + c3 )q3 = c1 q3 + c2 q3 + c3 q3 ≥ c1 q1 + c2 q2 + c3 q3
for any c1 , c2 , c3 .
50
3.3. Bellman optimality equation S. Zhao, 2023
P
Inspired by the above example, since a π(a|s) = 1, we have
X X
π(a|s)q(s, a) ≤ π(a|s) max q(s, a) = max q(s, a),
a∈A a∈A
a∈A a∈A
Here, a∗ = arg maxa q(s, a). In summary, the optimal policy π(s) is the one that selects
the action that has the greatest value of q(s, a).
. X X . X
[rπ ]s = π(a|s) p(r|s, a)r, [Pπ ]s,s0 = p(s0 |s) = π(a|s)p(s0 |s, a).
a∈A r∈R a∈A
Since the optimal value of π is determined by v, the right-hand side of (3.2) is a function
of v, denoted as
.
f (v) = max(rπ + γPπ v).
π∈Π
v = f (v). (3.3)
In the remainder of this section, we show how to solve this nonlinear equation.
51
3.3. Bellman optimality equation S. Zhao, 2023
BOE.
Consider a function f (x), where x ∈ Rd and f : Rd → Rd . A point x∗ is called a fixed
point if
f (x∗ ) = x∗ .
The interpretation of the above equation is that the map of x∗ is itself. This is the
reason why x∗ is called “fixed”. The function f is a contraction mapping (or contractive
function) if there exists γ ∈ (0, 1) such that
Example 3.3. We present three examples to demonstrate fixed points and contraction
mappings.
x = f (x) = 0.5x, x ∈ R.
It is easy to verify that x = 0 is a fixed point since 0 = 0.5 · 0. Moreover, f (x) = 0.5x
is a contraction mapping because k0.5x1 − 0.5x2 k = 0.5kx1 − x2 k ≤ γkx1 − x2 k for
any γ ∈ [0.5, 1).
As a result, |0.5 sin x1 − 0.5 sin x2 | ≤ 0.5|x1 − x2 | and hence f (x) = 0.5 sin x is a
contraction mapping.
The relationship between a fixed point and the contraction property is characterized
by the following classic theorem.
Theorem 3.1 (Contraction mapping theorem). For any equation that has the form x =
f (x) where x and f (x) are real vectors, if f is a contraction mapping, then the following
properties hold.
52
3.3. Bellman optimality equation S. Zhao, 2023
xk+1 = f (xk ),
The contraction mapping theorem not only can tell whether the solution of a nonlinear
equation exists but also suggests a numerical algorithm for solving the equation. The
proof of the theorem is given in Box 3.1.
The following example demonstrates how to calculate the fixed points of some equa-
tions using the iterative algorithm suggested by the contraction mapping theorem.
Example 3.4. Let us revisit the abovementioned examples: x = 0.5x, x = Ax, and
x = 0.5 sin x. While it has been shown that the right-hand sides of these three equations
are all contraction mappings, it follows from the contraction mapping theorem that they
each have a unique fixed point, which can be easily verified to be x∗ = 0. Moreover, the
fixed points of the three equations can be iteratively solved by the following algorithms:
xk+1 = 0.5xk ,
xk+1 = Axk ,
xk+1 = 0.5 sin xk ,
Part 1: We prove that the consequence {xk }∞ k=1 with xk = f (xk−1 ) is convergent.
The proof relies on Cauchy sequences. A sequence x1 , x2 , · · · ∈ R is called Cauchy
if for any small ε > 0, there exists N such that kxm − xn k < ε for all m, n > N .
The intuitive interpretation is that there exists a finite integer N such that all the
elements after N are sufficiently close to each other. Cauchy sequences are important
because it is guaranteed that a Cauchy sequence converges to a limit. Its convergence
property will be used to prove the contraction mapping theorem. Note that we must
have kxm − xn k < ε for all m, n > N . If we simply have xn+1 − xn → 0, it is
insufficient to claim that the sequence is a Cauchy sequence. For example, it holds
√ √
that xn+1 − xn → 0 for xn = n, but apparently, xn = n diverges.
We next show that {xk = f (xk−1 )}∞ k=1 is a Cauchy sequence and hence converges.
53
3.3. Bellman optimality equation S. Zhao, 2023
As a result, for any ε, we can always find N such that kxm −xn k < ε for all m, n > N .
Therefore, this sequence is Cauchy and hence converges to a limit point denoted as
x∗ = limk→∞ xk .
Part 2: We show that the limit x∗ = limk→∞ xk is a fixed point. To do that, since
54
3.3. Bellman optimality equation S. Zhao, 2023
γn
kx∗ − xn k = lim kxm − xn k ≤ kx1 − x0 k.
m→∞ 1−γ
Theorem 3.2 (Contraction property of f (v)). The function f (v) on the right-hand side
of the BOE in (3.3) is a contraction mapping. In particular, for any v1 , v2 ∈ R|S| , it holds
that
kf (v1 ) − f (v2 )k∞ ≤ γkv1 − v2 k∞ ,
where γ ∈ (0, 1) is the discount rate, and k · k∞ is the maximum norm, which is the
maximum absolute value of the elements of a vector.
The proof of the theorem is given in Box 3.2. This theorem is important because we
can use the powerful contraction mapping theorem to analyze the BOE.
55
3.3. Bellman optimality equation S. Zhao, 2023
Define
.
z = max |γPπ2∗ (v1 − v2 )|, |γPπ1∗ (v1 − v2 )| ∈ R|S| ,
which implies
|f (v1 ) − f (v2 )| ≤ z.
Since pi is a vector with all nonnegative elements and the sum of the elements is
equal to one, it follows that
56
3.4. Solving an optimal policy from the BOE S. Zhao, 2023
v ∗ = max(rπ + γPπ v ∗ ).
π∈Π
Theorem 3.3 (Existence, uniqueness, and algorithm). For the BOE v = f (v) =
maxπ∈Π (rπ + γPπ v), there always exists a unique solution v ∗ , which can be solved
iteratively by
The proof of this theorem directly follows from the contraction mapping theorem since
f (v) is a contraction mapping. This theorem is important because it answers some
fundamental questions.
Solving π ∗ : Once the value of v ∗ has been obtained, we can easily obtain π ∗ by solving
The value of π ∗ will be given in Theorem 3.5. Substituting (3.6) into the BOE yields
v ∗ = rπ∗ + γPπ∗ v ∗ .
Therefore, v ∗ = vπ∗ is the state value of π ∗ , and the BOE is a special Bellman equation
whose corresponding policy is π ∗ .
57
3.4. Solving an optimal policy from the BOE S. Zhao, 2023
At this point, although we can solve v ∗ and π ∗ , it is still unclear whether the solution
is optimal. The following theorem reveals the optimality of the solution.
Theorem 3.4 (Optimality of v ∗ and π ∗ ). The solution v ∗ is the optimal state value, and
π ∗ is an optimal policy. That is, for any policy π, it holds that
v ∗ = vπ∗ ≥ vπ ,
Now, it is clear why we must study the BOE: its solution corresponds to optimal state
values and optimal policies. The proof of the above theorem is given in the following box.
vπ = rπ + γPπ vπ .
Since
we have
v ∗ − vπ ≥ lim γ n Pπn (v ∗ − vπ ) = 0,
n→∞
where the last equality is true because γ < 1 and Pπn is a nonnegative matrix with
all its elements less than or equal to 1 (because Pπn 1 = 1). Therefore, v ∗ ≥ vπ for
any π.
We next examine π ∗ in (3.6) more closely. In particular, the following theorem shows
that there always exists a deterministic greedy policy that is optimal.
Theorem 3.5 (Greedy optimal policy). For any s ∈ S, the deterministic greedy policy
(
1, a = a∗ (s),
π ∗ (a|s) = (3.7)
6 a∗ (s),
0, a =
58
3.4. Solving an optimal policy from the BOE S. Zhao, 2023
where
. X X
q ∗ (s, a) = p(r|s, a)r + γ p(s0 |s, a)v ∗ (s0 ).
r∈R s0 ∈S
While the matrix-vector form of the optimal policy is π ∗ = arg maxπ (rπ + γPπ v ∗ ), its
elementwise form is
!
X X X
∗ 0 ∗ 0
π (s) = arg max π(a|s) p(r|s, a)r + γ p(s |s, a)v (s ) , s ∈ S.
π∈Π
a∈A r∈R s0 ∈S
| {z }
q ∗ (s,a)
It is clear that a∈A π(a|s)q ∗ (s, a) is maximized if π(s) selects the action with the
P
The policy in (3.7) is called greedy because it seeks the actions with the greatest
q (s, a). Finally, we discuss two important properties of π ∗ .
∗
Uniqueness of optimal policies: Although the value of v ∗ is unique, the optimal policy
that corresponds to v ∗ may not be unique. This can be easily verified by counterex-
amples. For example, the two policies shown in Figure 3.3 are both optimal.
Stochasticity of optimal policies: An optimal policy can be either stochastic or de-
terministic, as demonstrated in Figure 3.3. However, it is certain that there always
exists a deterministic optimal policy according to Theorem 3.5.
p = 0.5
p = 0.5
Figure 3.3: Examples for demonstrating that optimal policies may not be unique. The two policies are
different but are both optimal.
59
3.5. Factors that influence optimal policies S. Zhao, 2023
The optimal state value and optimal policy are determined by the following param-
eters: 1) the immediate reward r, 2) the discount rate γ, and 3) the system model
p(s0 |s, a), p(r|s, a). While the system model is fixed, we next discuss how the optimal
policy varies when we change the values of r and γ. All the optimal policies presented
in this section can be obtained via the algorithm in Theorem 3.3. The implementation
details of the algorithm will be given in Chapter 4. The present chapter mainly focuses
on the fundamental properties of optimal policies.
A baseline example
Consider the example in Figure 3.4. The reward settings are rboundary = rforbidden = −1
and rtarget = 1. In addition, the agent receives a reward of rother = 0 for every movement
step. The discount rate is selected as γ = 0.9.
With the above parameters, the optimal policy and optimal state values are given in
Figure 3.4(a). It is interesting that the agent is not afraid of passing through forbidden
areas to reach the target area. More specifically, starting from the state at (row=4,
column=1), the agent has two options for reaching the target area. The first option is to
avoid all the forbidden areas and travel a long distance to the target area. The second
option is to pass through forbidden areas. Although the agent obtains negative rewards
when entering forbidden areas, the cumulative reward of the second trajectory is greater
than that of the first trajectory. Therefore, the optimal policy is far-sighted due to the
relatively large value of γ.
If we change the discount rate from γ = 0.9 to γ = 0.5 and keep other parameters
unchanged, the optimal policy becomes the one shown in Figure 3.4(b). It is interesting
that the agent does not dare to take risks anymore. Instead, it would travel a long
distance to reach the target while avoiding all the forbidden areas. This is because the
optimal policy becomes short-sighted due to the relatively small value of γ.
In the extreme case where γ = 0, the corresponding optimal policy is shown in
Figure 3.4(c). In this case, the agent is not able to reach the target area. This is
60
3.5. Factors that influence optimal policies S. Zhao, 2023
1 2 3 4 5 1 2 3 4 5
(b) The discount rate is changed to γ = 0.5. The other parameters are the same as those in (a).
1 2 3 4 5 1 2 3 4 5
(c) The discount rate is changed to γ = 0. The other parameters are the same as those in (a).
1 2 3 4 5 1 2 3 4 5
(d) rforbidden is changed from −1 to −10. The other parameters are the same as those in (a).
Figure 3.4: The optimal policies and optimal state values given different parameter values.
61
3.5. Factors that influence optimal policies S. Zhao, 2023
because the optimal policy for each state is extremely short-sighted and merely selects
the action with the greatest immediate reward instead of the greatest total reward.
In addition, the spatial distribution of the state values exhibits an interesting pattern:
the states close to the target have greater state values, whereas those far away have lower
values. This pattern can be observed from all the examples shown in Figure 3.4. It can
be explained by using the discount rate: if a state must travel along a longer trajectory
to reach the target, its state value is smaller due to the discount rate.
If we want to strictly prohibit the agent from entering any forbidden area, we can increase
the punishment received for doing so. For instance, if rforbidden is changed from −1 to
−10, the resulting optimal policy can avoid all the forbidden areas (see Figure 3.4(d)).
However, changing the rewards does not always lead to different optimal policies.
One important fact is that optimal policies are invariant to affine transformations of the
rewards. In other words, if we scale all the rewards or add the same value to all the
rewards, the optimal policy remains the same.
Theorem 3.6 (Optimal policy invariance). Consider a Markov decision process with
v ∗ ∈ R|S| as the optimal state value satisfying v ∗ = maxπ∈Π (rπ + γPπ v ∗ ). If every reward
r ∈ R is changed by an affine transformation to αr + β, where α, β ∈ R and α > 0, then
the corresponding optimal state value v 0 is also an affine transformation of v ∗ :
β
v 0 = αv ∗ + 1, (3.8)
1−γ
where γ ∈ (0, 1) is the discount rate and 1 = [1, . . . , 1]T . Consequently, the optimal policy
derived from v 0 is invariant to the affine transformation of the reward values.
62
3.5. Factors that influence optimal policies S. Zhao, 2023
We next solve the new BOE in (3.9) by showing that v 0 = αv ∗ +c1 with c = β/(1−γ)
is a solution of (3.9). In particular, substituting v 0 = αv ∗ + c1 into (3.9) gives
where the last equality is due to the fact that Pπ 1 = 1. The above equation can be
reorganized as
which is equivalent to
β1 + cγ1 − c1 = 0.
Since c = β/(1−γ), the above equation is valid and hence v 0 = αv ∗ +c1 is the solution
of (3.9). Since (3.9) is the BOE, v 0 is also the unique solution. Finally, since v 0 is
an affine transformation of v ∗ , the relative relationships between the action values
remain the same. Hence, the greedy optimal policy derived from v 0 is the same as
that from v ∗ : arg maxπ∈Π (rπ + γPπ v 0 ) is the same as arg maxπ (rπ + γPπ v ∗ ).
Readers may refer to [9] for a further discussion on the conditions under which mod-
ifications to the reward values preserve the optimal policy.
In the reward setting, the agent receives a reward of rother = 0 for every movement
step (unless it enters a forbidden area or the target area or attempts to go beyond the
boundary). Since a zero reward is not a punishment, would the optimal policy take
meaningless detours before reaching the target? Should we set rother to be negative to
encourage the agent to reach the target as quickly as possible?
1 2 1 2 1 2 1 2
Figure 3.5: Examples illustrating that optimal policies do not take meaningless detours due to the
discount rate.
Consider the examples in Figure 3.5, where the bottom-right cell is the target area
63
3.6. Summary S. Zhao, 2023
to reach. The two policies here are the same except for state s2 . By the policy in
Figure 3.5(a), the agent moves downward at s2 and the resulting trajectory is s2 → s4 .
By the policy in Figure 3.5(b), the agent moves leftward and the resulting trajectory is
s2 → s1 → s3 → s4 .
It is notable that the second policy takes a detour before reaching the target area. If
we merely consider the immediate rewards, taking this detour does not matter because
no negative immediate rewards will be obtained. However, if we consider the discounted
return, then this detour matters. In particular, for the first policy, the discounted return
is
return = 1 + γ1 + γ 2 1 + · · · = 1/(1 − γ) = 10.
It is clear that the shorter the trajectory is, the greater the return is. Therefore, although
the immediate reward of every step does not encourage the agent to approach the target
as quickly as possible, the discount rate does encourage it to do so.
A misunderstanding that beginners may have is that adding a negative reward (e.g.,
−1) on top of the rewards obtained for every movement is necessary to encourage the
agent to reach the target as quickly as possible. This is a misunderstanding because
adding the same reward on top of all rewards is an affine transformation, which preserves
the optimal policy. Moreover, optimal policies do not take meaningless detours due to
the discount rate, even though a detour may not receive any immediate negative rewards.
3.6 Summary
The core concepts in this chapter include optimal policies and optimal state values. In
particular, a policy is optimal if its state values are greater than or equal to those of any
other policy. The state values of an optimal policy are the optimal state values. The BOE
is the core tool for analyzing optimal policies and optimal state values. This equation
is a nonlinear equation with a nice contraction property. We can apply the contraction
mapping theorem to analyze this equation. It was shown that the solutions of the BOE
correspond to the optimal state value and optimal policy. This is the reason why we need
to study the BOE.
The contents of this chapter are important for thoroughly understanding many funda-
mental ideas of reinforcement learning. For example, Theorem 3.3 suggests an iterative
algorithm for solving the BOE. This algorithm is exactly the value iteration algorithm
that will be introduced in Chapter 4. A further discussion about the BOE can be found
in [2].
64
3.7. Q&A S. Zhao, 2023
3.7 Q&A
Q: What is the definition of optimal policies?
A: A policy is optimal if its corresponding state values are greater than or equal to
any other policy.
It should be noted that this specific definition of optimality is valid only for tabular
reinforcement learning algorithms. When the values or policies are approximated by
functions, different metrics must be used to define optimal policies. This will become
clearer in Chapters 8 and 9.
Q: Why is the Bellman optimality equation important?
A: It is important because it characterizes both optimal policies and optimal state
values. Solving this equation yields an optimal policy and the corresponding optimal
state value.
Q: Is the Bellman optimality equation a Bellman equation?
A: Yes. The Bellman optimality equation is a special Bellman equation whose corre-
sponding policy is optimal.
Q: Is the solution of the Bellman optimality equation unique?
A: The Bellman optimality equation has two unknown variables. The first unknown
variable is a value, and the second is a policy. The value solution, which is the optimal
state value, is unique. The policy solution, which is an optimal policy, may not be
unique.
Q: What is the key property of the Bellman optimality equation for analyzing its
solution?
A: The key property is that the right-hand side of the Bellman optimality equation is
a contraction mapping. As a result, we can apply the contraction mapping theorem
to analyze its solution.
Q: Do optimal policies exist?
A: Yes. Optimal policies always exist according to the analysis of the BOE.
Q: Are optimal policies unique?
A: No. There may exist multiple or infinite optimal policies that have the same
optimal state values.
Q: Are optimal policies stochastic or deterministic?
A: An optimal policy can be either deterministic or stochastic. A nice fact is that
there always exist deterministic greedy optimal policies.
65
3.7. Q&A S. Zhao, 2023
66