AbstractDynamic Programming
AbstractDynamic Programming
THIRD EDITION
Dimitri P. Bertsekas
Arizona State University
http://www.athenasc.com
Email: [email protected]
WWW: http://www.athenasc.com
1. Introduction . . . . . . . . . . . . . . . . . . . . p. 1
1.1. Structure of Dynamic Programming Problems . . . . . . . p. 2
1.2. Abstract Dynamic Programming Models . . . . . . . . . . p. 5
1.2.1. Problem Formulation . . . . . . . . . . . . . . . . p. 5
1.2.2. Monotonicity and Contraction Properties . . . . . . . p. 7
1.2.3. Some Examples . . . . . . . . . . . . . . . . . . p. 10
1.2.4. Reinforcement Learning - Projected and Aggregation . . . .
Bellman Equations . . . . . . . . . . . . . . . . p. 24
1.2.5. Reinforcement Learning - Temporal Difference and . . . . .
Proximal Algorithms . . . . . . . . . . . . . . . . p. 26
1.3. Reinforcement Learning - Approximation in Value Space . . . p. 29
1.3.1. Approximation in Value Space for . . . . . . . . . . . .
Markovian Decision Problems . . . . . . . . . . . . p. 29
1.3.2. Approximation in Value Space and . . . . . . . . . . . .
Newton’s Method . . . . . . . . . . . . . . . . . p. 35
1.3.3. Policy Iteration and Newton’s Method . . . . . . . . p. 39
1.3.4. Approximation in Value Space for General Abstract . . . .
Dynamic Programming . . . . . . . . . . . . . . . p. 41
1.4. Organization of the Book . . . . . . . . . . . . . . . . p. 41
1.5. Notes, Sources, and Exercises . . . . . . . . . . . . . . . p. 45
2. Contractive Models . . . . . . . . . . . . . . . . . p. 53
2.1. Bellman’s Equation and Optimality Conditions . . . . . . . p. 54
2.2. Limited Lookahead Policies . . . . . . . . . . . . . . . p. 61
2.3. Value Iteration . . . . . . . . . . . . . . . . . . . . . p. 66
2.3.1. Approximate Value Iteration . . . . . . . . . . . . . p. 67
2.4. Policy Iteration . . . . . . . . . . . . . . . . . . . . . p. 70
2.4.1. Approximate Policy Iteration . . . . . . . . . . . . p. 73
2.4.2. Approximate Policy Iteration Where Policies Converge . p. 75
2.5. Optimistic Policy Iteration and λ-Policy Iteration . . . . . . p. 77
2.5.1. Convergence of Optimistic Policy Iteration . . . . . . p. 79
2.5.2. Approximate Optimistic Policy Iteration . . . . . . . p. 84
2.5.3. Randomized Optimistic Policy Iteration . . . . . . . . p. 87
v
vi Contents
References . . . . . . . . . . . . . . . . . . . . . p. 391
Index . . . . . . . . . . . . . . . . . . . . . . . p. 401
Preface of the First Edition
This book aims at a unified and economical development of the core the-
ory and algorithms of total cost sequential decision problems, based on
the strong connections of the subject with fixed point theory. The analy-
sis focuses on the abstract mapping that underlies dynamic programming
(DP for short) and defines the mathematical character of the associated
problem. Our discussion centers on two fundamental properties that this
mapping may have: monotonicity and (weighted sup-norm) contraction. It
turns out that the nature of the analytical and algorithmic DP theory is
determined primarily by the presence or absence of these two properties,
and the rest of the problem’s structure is largely inconsequential.
In this book, with some minor exceptions, we will assume that mono-
tonicity holds. Consequently, we organize our treatment around the con-
traction property, and we focus on four main classes of models:
(a) Contractive models, discussed in Chapter 2, which have the richest
and strongest theory, and are the benchmark against which the the-
ory of other models is compared. Prominent among these models are
discounted stochastic optimal control problems. The development of
these models is quite thorough and includes the analysis of recent ap-
proximation algorithms for large-scale problems (neuro-dynamic pro-
gramming, reinforcement learning).
(b) Semicontractive models, discussed in Chapter 3 and parts of Chap-
ter 4. The term “semicontractive” is used qualitatively here, to refer
to a variety of models where some policies have a regularity/contrac-
tion-like property but others do not. A prominent example is stochas-
tic shortest path problems, where one aims to drive the state of
a Markov chain to a termination state at minimum expected cost.
These models also have a strong theory under certain conditions, of-
ten nearly as strong as those of the contractive models.
(c) Noncontractive models, discussed in Chapter 4, which rely on just
monotonicity. These models are more complex than the preceding
ones and much of the theory of the contractive models generalizes in
weaker form, if at all. For example, in general the associated Bell-
man equation need not have a unique solution, the value iteration
method may work starting with some functions but not with others,
and the policy iteration method may not work at all. Infinite hori-
zon examples of these models are the classical positive and negative
DP problems, first analyzed by Dubins and Savage, Blackwell, and
ix
x Preface
Dimitri P. Bertsekas
Spring 2013
Preface xiii
The second edition aims primarily to amplify the presentation of the semi-
contractive models of Chapter 3 and Chapter 4, and to supplement it with
a broad spectrum of research results that I obtained and published in jour-
nals and reports since the first edition was written. As a result, the size
of this material more than doubled, and the size of the book increased by
about 40%.
In particular, I have thoroughly rewritten Chapter 3, which deals with
semicontractive models where stationary regular policies are sufficient. I
expanded and streamlined the theoretical framework, and I provided new
analyses of a number of shortest path-type applications (deterministic,
stochastic, affine monotonic, exponential cost, and robust/minimax), as
well as several types of optimal control problems with continuous state
space (including linear-quadratic, regulation, and planning problems).
In Chapter 4, I have extended the notion of regularity to nonstation-
ary policies (Section 4.4), aiming to explore the structure of the solution set
of Bellman’s equation, and the connection of optimality with other struc-
tural properties of optimal control problems. As an application, I have
discussed in Section 4.5 the relation of optimality with classical notions
of stability and controllability in continuous-spaces deterministic optimal
control. In Section 4.6, I have similarly extended the notion of a proper
policy to continuous-spaces stochastic shortest path problems.
I have also revised Chapter 1 a little (mainly with the addition of
Section 1.2.5 on the relation between proximal algorithms and temporal
difference methods), added to Chapter 2 some analysis relating to λ-policy
iteration and randomized policy iteration algorithms (Section 2.5.3), and I
have also added several new exercises (with complete solutions) to Chapters
1-4. Additional material relating to various applications can be found in
some of my journal papers, reports, and video lectures on semicontractive
models, which are posted at my web site.
In addition to the changes in Chapters 1-4, I have also eliminated from
the second edition the analysis that deals with restricted policies (Chap-
ter 5 and Appendix C of the first edition). This analysis is motivated in
part by the complex measurability questions that arise in mathematically
rigorous theories of stochastic optimal control with Borel state and control
spaces. This material is covered in Chapter 6 of the monograph by Bert-
sekas and Shreve [BeS78], and followup research on the subject has been
limited. Thus, I decided to just post Chapter 5 and Appendix C of the first
xiv Preface
edition at the book’s web site (40 pages), and omit them from the second
edition. As a result of this choice, the entire book now requires only a
modest mathematical background, essentially a first course in analysis and
in elementary probability.
The range of applications of dynamic programming has grown enor-
mously in the last 25 years, thanks to the use of approximate simulation-
based methods for large and challenging problems. Because approximations
are often tied to special characteristics of specific models, their coverage in
this book is limited to general discussions in Chapter 1 and to error bounds
given in Chapter 2. However, much of the work on approximation methods
so far has focused on finite-state discounted, and relatively simple deter-
ministic and stochastic shortest path problems, for which there is solid and
robust analytical and algorithmic theory (part of Chapters 2 and 3 in this
monograph). As the range of applications becomes broader, I expect that
the level of mathematical understanding projected in this book will become
essential for the development of effective and reliable solution methods. In
particular, much of the new material in this edition deals with infinite-state
and/or complex shortest path type-problems, whose approximate solution
will require new methodologies that transcend the current state of the art.
Dimitri P. Bertsekas
January 2018
Preface xv
Dimitri P. Bertsekas
February 2022
1
Introduction
Contents
1
2 Introduction Chap. 1
Dynamic programming (DP for short) is the principal method for analysis
of a large and diverse class of sequential decision problems. Examples are
deterministic and stochastic optimal control problems with a continuous
state space, Markov and semi-Markov decision problems with a discrete
state space, minimax problems, and sequential zero-sum games. While the
nature of these problems may vary widely, their underlying structures turn
out to be very similar. In all cases there is an underlying mapping that
depends on an associated controlled dynamic system and corresponding
cost per stage. This mapping, the DP (or Bellman) operator, provides a
compact “mathematical signature” of the problem. It defines the cost func-
tion of policies and the optimal cost function, and it provides a convenient
shorthand notation for algorithmic description and analysis.
More importantly, the structure of the DP operator defines the math-
ematical character of the associated problem. The purpose of this book is to
provide an analysis of this structure, centering on two fundamental prop-
erties: monotonicity and (weighted sup-norm) contraction. It turns out
that the nature of the analytical and algorithmic DP theory is determined
primarily by the presence or absence of one or both of these two properties,
and the rest of the problem’s structure is largely inconsequential.
Here xk is the state of the system taking values in a set X (the state space),
and uk is the control taking values in a set U (the control space). † At stage
k, there is a cost
αk g(xk , uk )
incurred when uk is applied at state xk , where α is a scalar in (0, 1] that has
the interpretation of a discount factor when α < 1. The controls are chosen
as a function of the current state, subject to a constraint that depends on
that state. In particular, at state x the control is constrained to take values
in a given set U (x) ⊂ U . Thus we are interested in optimization over the
set of (nonstationary) policies
! "
Π = {µ0 , µ1 , . . .} | µk ∈ M, k = 0, 1, . . . ,
(We use limit superior rather than limit to cover the case where the limit
does not exist.) The optimal cost function is
We now note that both Eqs. (1.3) and (1.4) can be stated in terms of
the expression
$ %
H(x, u, J) = g(x, u) + αJ f (x, u) , x ∈ X, u ∈ U (x).
Defining $ %
(Tµ J)(x) = H x, µ(x), J , x ∈ X,
and
(T J)(x) = inf H(x, u, J), x ∈ X,
u∈U(x)
Then it can be verified by induction that for all initial states x0 , we have
¯ 0 ).
Jπ,N (x0 ) = (Tµ0 Tµ1 · · · TµN−1 J)(x (1.6)
Here Tµ0 Tµ1 · · · TµN−1 is the composition of the mappings Tµ0 , Tµ1 , . . . TµN−1 ,
i.e., for all J,
$ %
(Tµ0 Tµ1 J)(x) = Tµ0 (Tµ1 J) (x), x ∈ X,
and more generally
$ %
(Tµ0 Tµ1 · · · TµN−1 J)(x) = Tµ0 (Tµ1 (· · · (TµN−1 J))) (x), x ∈ X,
(our notational conventions are summarized in Appendix A). Thus the
finite horizon cost functions Jπ,N of π can be defined in terms of the map-
pings Tµ [cf. Eq. (1.6)], and so can the infinite horizon cost function Jπ :
¯
Jπ (x) = lim sup(Tµ0 Tµ1 · · · TµN−1 J)(x), x ∈ X, (1.7)
N →∞
The Bellman equation (1.3) and the optimality condition (1.4), stated in
terms of the mappings Tµ and T , highlight a central theme of this book,
which is that DP theory is intimately connected with the theory of abstract
mappings and their fixed points. Analogs of the Bellman equation, J * =
T J * , optimality conditions, and other results and computational methods
hold for a great variety of DP models, and can be stated compactly as
described above in terms of the corresponding mappings Tµ and T . The
gain from this abstraction is greater generality and mathematical insight,
as well as a more unified, economical, and streamlined analysis.
which we view as the “infinite horizon cost function” of π [cf. Eq. (1.7); we
use lim sup for generality, since we are not assured that the limit exists].
We want to minimize Jπ over π, i.e., to find
J * (x) = inf Jπ (x), x ∈ X,
π
or equivalently, †
J ≤ J& ⇒ T J ≤ T J &.
J ≤ J& ⇒ Tµ J ≤ Tµ J & , ∀ µ ∈ M,
= 0 Tµ J = 0 Tµ J
=0 J TJ =0 ) Jµ J TJ
on B(X). The properties of B(X) and some of the associated fixed point
theory are discussed in Appendix B. In particular, as shown there, B(X)
is a complete normed space, so any mapping from B(X) to B(X) that is a
contraction or an m-stage contraction for some integer m > 1, with respect
to + · +, has a unique fixed point (cf. Props. B.1 and B.2).
In what follows in this section, we describe a few special cases, which indi-
cate the connections of appropriate forms of the mapping H with the most
popular total cost DP models. In all these models the monotonicity As-
sumption 1.2.1 (or some closely related version) holds, but the contraction
Assumption 1.2.2 may not hold, as we will indicate later. Our descriptions
are by necessity brief, and the reader is referred to the relevant textbook
literature for more detailed discussion.
In particular, it can be shown that the limit exists if α < 1 and the expected
value of |g| is uniformly bounded, i.e., for some B > 0,
!( ("
E (g(x, u, w)( ≤ B, ∀ x ∈ X, u ∈ U (x). (1.12)
so that
! $ % $ %"
(Tµ J)(x) = E g x, µ(x), w + αJ f (x, µ(x), w) ,
and ! $ %"
(T J)(x) = inf E g(x, u, w) + αJ f (x, u, w) .
u∈U (x)
In this way, taking also into account the rule ∞−∞ = ∞ (see Appendix A), E{w}
is well-defined as an extended real number if Ω is finite or countably infinite.
12 Introduction Chap. 1
and parallels the one given for deterministic optimal control problems [cf. Eq.
(1.3)].
These properties can be expressed and analyzed in an abstract setting
by using just the mappings Tµ and T , both when Tµ and T are contractive
(see Chapter 2), and when they are only monotone and not contractive while
either g ≥ 0 or g ≤ 0 (see Chapter 4). Moreover, under some conditions, it is
possible to analyze these properties in cases where Tµ is contractive for some
but not all µ (see Chapter 3, and Section 4.4).
In the special case of the preceding example where the number of states is
finite, the system equation (1.10) may be defined in terms of the transition
probabilities
$ %
pxy (u) = Prob y = f (x, u, w) | x , x, y ∈ X, u ∈ U (x),
[cf. Eq. (1.12)] holds (or more simply, when U is a finite set), the mappings Tµ
and T are contraction mappings with respect to the standard (unweighted)
sup-norm. This is a classical model, referred to as discounted finite-state
MDP , which has a favorable theory and has found extensive applications (cf.
[Ber12a], Chapters 1 and 2). The model is additionally important, because it
is often used for computational solution of continuous state space problems
via discretization.
Sec. 1.2 Abstract Dynamic Programming Models 13
where G is some function representing expected cost per stage, and mxy (u)
are nonnegative scalars with
#
mxy (u) < 1, ∀ x ∈ X, u ∈ U (x).
y∈X
Let us consider the situation where a separate game of the type just
described is played at each stage. The game played at a given stage is repre-
sented by a “state” x that takes values in a finite set X. The state evolves
according to transition probabilities qxy (i, j) where i and j are the moves
selected by the minimizer and the maximizer, respectively (here y represents
14 Introduction Chap. 1
the next game to be played after moves i and j are chosen at the game rep-
resented by x). When the state is x, under u ∈ U and v ∈ V , the one-stage
expected payoff is u# A(x)v, where A(x) is the n × m payoff matrix, and the
state transition probabilities are
n m
# #
pxy (u, v) = ui vj qxy (i, j) = u# Qxy v,
i=1 j=1
where Qxy is the n × m matrix that has components qxy (i, j). Payoffs are
discounted by α ∈ (0, 1), and the objectives of the minimizer and maximizer,
roughly speaking, are to minimize and to maximize the total discounted ex-
pected payoff. This requires selections of u and v to strike a balance between
obtaining favorable current stage payoffs and playing favorable games in fu-
ture stages.
We now introduce an abstract DP framework related to the sequential
move selection process just described. We consider the mapping G given by
#
G(x, u, v, J) = u# A(x)v + α pxy (u, v)J(y)
y∈X
,
#
- (1.14)
#
=u A(x) + α Qxy J(y) v,
y∈X
and
(T J)(x) = min max G(x, u, v, J).
u∈U v∈V
[cf. Eq. (1.14)] is a matrix that is independent of u and v, we may view J ∗ (x)
as the value of a static game (which depends on the state x). In particular,
from the fundamental minimax equality (1.13), we have
This implies that J ∗ is also the unique fixed point of the mapping
where
H(x, v, J) = min G(x, u, v, J),
u∈U
∗
i.e., J is the fixed point regardless of the order in which minimizer and
maximizer select mixed strategies at each stage.
In the preceding development, we have introduced J ∗ as the unique
fixed point of the mappings T and T . However, J ∗ also has an interpretation
in game theoretic terms. In particular, it can be shown that J ∗ (x) is the value
of a dynamic game, whereby at state x the two opponents choose multistage
(possibly nonstationary) policies that consist of functions of the current state,
and continue to select moves using these policies over an infinite horizon. For
further discussion of this interpretation, we refer to [Ber12a] and to books on
dynamic games such as [FiV96]; see also [PaB99] and [Yu14] for an analysis
of the undiscounted case (α = 1) where there is a termination state, as in
the stochastic shortest path problems of the subsequent Example 1.2.6. An
alternative and more general formulation of sequential zero-sum games, which
allows for an infinite state space, will be given in Chapter 5.
The stochastic shortest path (SSP for short) problem is the special case of
the stochastic optimal control Example 1.2.1 where:
(a) There is no discounting (α = 1).
(b) The state space is X = {t, 1, . . . , n} and we are given transition proba-
bilities, denoted by
To simplify the notation, we have assumed that the cost per stage does not
depend on the successor state, which amounts to using expected cost per
stage in all calculations.
Since the termination state t is cost-free, the cost starting from t is zero
for every policy. Accordingly, for all cost functions, we ignore the component
that corresponds to t, and define
n
#
H(x, u, J) = g(x, u) + pxy (u)J(y), x = 1, . . . , n, u ∈ U (x), J ∈ !n .
y=1
n
$ % # $ %
(Tµ J)(x) = g x, µ(x) + pxy µ(x) J(y), x = 1, . . . , n,
y=1
0 n
1
#
(T J)(x) = min g(x, u) + pxy (u)J(y) , x = 1, . . . , n.
u∈U (x)
y=1
Note that the matrix that has components pxy (u), x, y = 1, . . . , n, is sub-
stochastic (some of its row sums may be less than 1) because there may be
a positive transition probability from a state x to the termination state t.
Consequently Tµ may be a contraction for some µ, but not necessarily for all
µ ∈ M.
The SSP problem has been discussed in many sources, including the
books [Pal67], [Der70], [Whi82], [Ber87], [BeT89], [HeL99], [Ber12a], and
[Ber17a], where it is sometimes referred to by earlier names such as “first
passage problem” and “transient programming problem.” In the framework
that is most relevant to our purposes, given in the paper by Bertsekas and
Tsitsiklis [BeT91], there is a classification of stationary policies for SSP into
proper and improper . We say that µ ∈ M is proper if, when using µ, there is
positive probability that termination will be reached after at most n stages,
regardless of the initial state; i.e., if
Tµ∗ J ∗ = T J ∗ .
In addition, J ∗ and Jµ can be computed by value iteration,
n
starting with any J ∈ ! (see [Ber12a], Chapter 3, for a textbook account).
These properties are in analogy with the desirable properties (a)-(c), given at
the end of the preceding subsection in connection with contractive models.
Regarding policy iteration, it works in its strongest form when there are
no improper policies, in which case the mappings Tµ and T are weighted sup-
norm contractions. When there are improper policies, modifications to the
policy iteration method are needed; see [Ber12a], [YuB13a], and also Section
3.6.2, where these modifications will be discussed in an abstract setting.
In Section 3.5.1 we will also consider SSP problems where the strong
SSP conditions (a) and (b) above are not satisfied. Then we will see that
unusual phenomena can occur, including that J ∗ may not be a solution of
Bellman’s equation. Still our line of analysis of Chapter 3 will apply to such
problems.
The special case of the SSP problem where the state transitions are determin-
istic is the classical shortest path problem. Here, we have a graph of n nodes
x = 1, . . . , n, plus a destination t, and an arc length axy for each directed arc
(x, y). At state/node x, a policy µ chooses an outgoing arc from x. Thus the
controls available at x can be identified with the outgoing neighbors of x [the
nodes u such that (x, u) is an arc]. The corresponding mapping H is
axu + J(u) if u ,= t,
&
H(x, u, J) = x = 1, . . . , n.
axt if u = t,
$ %
A stationary policy µ defines a graph whose arcs are x, µ(x) , x =
1, . . . , n. The policy µ is proper if and only if this graph is acyclic (it consists of
18 Introduction Chap. 1
a tree of directed paths leading from each node to the destination). Thus there
exists a proper policy if and only if each node is connected to the destination
with a directed path. Furthermore, an improper policy has finite cost starting
from every initial state if and only if all the cycles of the corresponding graph
have nonnegative cycle cost. It follows that the favorable analytical and
algorithmic results described for SSP in the preceding example hold if the
given graph is connected and the costs of all its cycles are positive. We will
see later that significant complications result if the cycle costs are allowed to
be zero, even though the shortest path problem is still well posed in the sense
that shortest paths exist if the given graph is connected (see Section 3.1).
¯
J(x) ≡ 1, x ∈ X,
so that
¯
Jπ (x) = lim sup (Tµ0 Tµ1 · · · TµN−1 J)(x), x ∈ X.
N→∞
which by using the iterated expectations formula (see e.g., [BeT08]) proves
the expression (1.17).
An important special case of a multiplicative model is when g has the
form
g(x, u, y) = eh(x,u,y)
for some one-stage cost function h. We then obtain a finite-state MDP with
an exponential cost function,
& $ %'
h(x0 ,µ0 (x0 ),x1 )+···+h(xN−1 ,µN−1 (xN−1 ),xN )
Jπ (x0 ) = lim sup E e ,
N→∞
which is often used to introduce risk aversion in the choice of policy through
the convexity of the exponential.
There is also a multiplicative version of the infinite state space stochas-
tic optimal control problem of Example 1.2.1. The mapping H takes the
form ! $ %"
H(x, u, J) = E g(x, u, w)J f (x, u, w) ,
where xk+1 = f (xk , uk , wk ) is the underlying discrete-time dynamic system;
cf. Eq. (1.10).
Multiplicative models and related risk-sensitive models are discussed
extensively in the literature, mostly for the exponential cost case and under
different assumptions than ours; see e.g., [HoM72], [Jac73], [Rot84], [ChS87],
[Whi90], [JBE94], [FlM95], [HeM96], [FeM97], [BoM99], [CoM99], [BoM02],
[BBB08], [Ber16a]. The works of references [DeR79], [Pat01], and [Pat07]
relate to the stochastic shortest path problems of Example 1.2.6, and are the
closest to the semicontractive models discussed in Chapters 3 and 4, based
on the author’s paper [Ber16a]; see the next example and Section 3.5.2.
where pxy (u) is the probability of transition from x to y under u, and g(x, u, y)
is the cost of the transition; see Section 3.5.2 for a detailed derivation. Clearly
Tµ has the affine monotonic form (1.18).
n
#
dxi = 1, ∀ x ∈ A.
i=1
Sec. 1.2 Abstract Dynamic Programming Models 21
Original System
i , j=1
according to pij (u), with cost
! !
), x ), y
Aggregate System
Figure 1.2.2 Illustration of the relation between aggregate and original sys-
tem states.
j3 y1 1 y2
φj1 y1
1 φj1 y2
i pij1 (u) x j1
x=ip 2 φj1 y3
y2 y3
x j1 j2
j2 j3
Representative/Aggregate States
1 if i = ix ,
&
dxi = (1.19)
0 if i =
, ix .
1 if j = jy ,
&
φjy =
0 if j =
, jy .
the original state space with S1 ∪· · ·∪Sn = {1, . . . , n}. We envision a network
of processors & = 1, . . . , m, each assigned to the computation of a local cost
function V" , defined on the corresponding aggregate state/subset S" :
Processor & also maintains a scalar aggregate cost R" for its aggregate state,
which is a weighted average of the detailed cost values V"x within S" :
#
R" = d"x V"x ,
x∈S"
+
where d"x are given probabilities with d"x ≥ 0 and x∈S d"x = 1. The aggre-
"
gate costs R" are communicated between processors and are used to perform
the computation of the local cost functions V" (we will discuss computation
models of this type in Section 2.6).
We denote J = (V1 , . . . , Vm , R1 , . . . , Rm ). We introduce the mapping
H(x, u, J) defined for each of the n states x by
and for each original system state y, we denote by s(y) the index of the subset
to which y belongs [i.e., y ∈ Ss(y) ].
We may view H as an abstract mapping on the space of J, and aim to
find its fixed point J ∗ = (V1∗ , . . . , Vm
∗
, R1∗ , . . . , Rm
∗
). Then, for & = 1, . . . , m, we
∗
may view V" as an approximation to the optimal cost vector of the original
MDP starting at states x ∈ S" , and we may view R"∗ as a form of aggregate
cost for S" . The advantage of this formulation is that it involves significant
decomposition and parallelization of the computations among the processors,
when performing various DP algorithms. In particular, the computation of
W" (x, u, V" , R1 , . . . , Rm ) depends on just the local vector V" , whose dimension
may be potentially much smaller than n.
for λ ∈ (0, 1), where T $ is the %-fold composition of T with itself % times.
Here there should be conditions that guarantee the convergence of the
infinite series in the preceding definition. The multistep analog of the
projected Eq. (1.20) is
Φr = Πξ T (λ) (Φr).
The popular temporal difference methods, such as TD(λ), LSTD(λ), and
LSPE(λ), aim to solve this equation (see the book references on approx-
imate DP, neuro-dynamic programming, and reinforcement learning cited
earlier). The mapping T (λ) also forms the basis for the λ-policy iteration
method to be discussed in Sections 2.5, 3.2.4, and 4.3.3.
The multistep analog of the aggregation Eq. (1.22) is
and methods that are similar to the temporal difference methods can be
used for its solution. In particular, a multistep method based on the map-
ping T (λ) is the, so-called, λ-aggregation method (see [Ber12a], Chapter
6), as well as other forms of aggregation (see [Ber12a], [YuB12]).
In the case where T is a linear mapping of the form
T J = AJ + b,
Equivalently,
5 6−1 5 6
c+1 1
P (c) J
= I −A b+ J , (1.23)
c c
where I is the identity matrix. Then it can be shown (see Exercise 1.2 or
the papers [Ber16b], [Ber18c]) that if
λ
c= ,
1−λ
we have
T (λ) = T · P (c) = P (c) · T.
Moreover, the vectors J, P (c) J, and T (λ) J are colinear and satisfy
c + 1 $ (c) %
T (λ) J = J + P J −J .
c
The preceding formulas show that T (λ) and P (c) are closely related, and
that iterating with T (λ) is “faster” than iterating with P (c) , since the eigen-
values of A are within the unit circle, so that T is a contraction. In addition,
methods such as TD(λ), LSTD(λ), LSPE(λ), and their projected versions,
which are based on T (λ) , can be adapted to be used with P (c) .
A more general form of multistep approach, introduced and studied
in the paper [YuB12], replaces T (λ) with a mapping T (w) : &n #→ &n that
has components
#∞
$ %
T (w) J (i) = wi$ (T $ J)(i), i = 1, . . . , n, J ∈ &n ,
$=1
where w is a vector sequence whose ith component, (wi1 , wi2 , . . .), is a prob-
ability distribution over the positive integers. Then the multistep analog
of the projected equation (1.20) is
Φr = Πξ T (w) (Φr), (1.24)
while the multistep analog of the aggregation equation (1.22) is
Φr = ΦDT (w) (Φr). (1.25)
The mapping T (λ) is obtained for wi$ = (1 − λ)λ$−1 , independently of
the state i. A more general version, where λ depends on the state i, is
obtained for wi$ = (1 − λi )λ$−1
i . The solution of Eqs. (1.24) and (1.25)
by simulation-based methods is discussed in the paper [YuB12]; see also
Exercise 1.3.
Let us also note that there is a connection between projected equa-
tions of the form (1.24) and aggregation equations of the form (1.25). This
connection is based on the use of a seminorm [this is given by the same
expression as the norm + · +ξ of Eq. (1.21), with some of the components
of ξ allowed to be 0]. In particular, the most prominent cases of aggrega-
tion equations can be viewed as seminorm projected equations because, for
these cases, ΦD is a seminorm projection (see [Ber12a], p. 639, [YuB12],
Section 4). Moreover, they can also be viewed as projected equations where
the projection is oblique (see [Ber12a], Section 7.3.6).
Sec. 1.3 Reinforcement Learning - Approximation in Value Space 29
and
! " #$
(T J)(x) = inf E g(x, u, w) + αJ f (x, u, w) , for all x, (1.28)
u∈U(x)
Tµ J = Gµ + Aµ J,
Aµ (γ1 J1 + γ2 J2 ) = γ1 Aµ J1 + γ2 Aµ J2 .
This is true because of the linearity of the expected value operation in Eq.
(1.27). The linearity of Tµ implies another important property: (T J)(x) is
a concave function of J for every x. By this we mean that the set
! "
Cx = (J, ξ) | (T J)(x) ≥ ξ, J ∈ R(X), ξ ∈ & (1.29)
is convex for all x ∈ X, where R(X) is the set of real-valued functions over
the state space X, and & is the set of real numbers. This follows from the
linearity of Tµ , the alternative definition of T given by Eq. (1.26), and the
fact that for a fixed x, the minimum of the linear functions (Tµ J)(x) over
µ ∈ M is concave as a function of J.
We illustrate these properties graphically with an example.
Assume that there are two states 1 and 2, and two controls u and v. Consider
the policy µ that applies control u at state 1 and control v at state 2. Then
the operator Tµ takes the form
2
# $ %
(Tµ J)(1) = p1j (u) g(1, u, y) + αJ(y) , (1.30)
y=1
2
# $ %
(Tµ J)(2) = p2y (v) g(2, v, y) + αJ(y) , (1.31)
y=1
where pxy (u) and pxy (v) are the probabilities that the next state will be y,
when the current state is x, and the control is u or v, respectively. Clearly,
(Tµ J)(1) and (Tµ J)(2) are linear functions of J. Also the operator T of the
Bellman equation J = T J takes the form
0 2
# $ %
(T J)(1) = min p1y (u) g(1, u, y) + αJ(y) ,
y=1
1 (1.32)
2
# $ %
p1y (v) g(1, v, y) + αJ(y) ,
y=1
0 2
# $ %
(T J)(2) = min p2y (u) g(2, u, y) + αJ(y) ,
y=1
1 (1.33)
2
# $ %
p2y (v) g(2, v, y) + αJ(y) .
y=1
Thus, (T J)(1) and (T J)(2) are concave and piecewise linear as functions of
the two-dimensional vector J (with two pieces; more generally, as many linear
pieces as the number of controls). This concavity property holds in general
Sec. 1.3 Reinforcement Learning - Approximation in Value Space 31
Geometrical Interpretations
Let us consider the mapping T for a problem that involves two states, 1 and
2, but an infinite number of controls. In particular, the control space at both
32 Introduction Chap. 1
∗ J ∗ (1)
(1) J ∗ (2)
Tµ! J
1 J J
(1) = 0 1 J J
Figure 1.3.2 Geometric interpretation of the linear Bellman operator Tµ and the
corresponding Bellman equation. The graph of Tµ is a plane in the space ! × !,
and when projected on a one-dimensional plane that corresponds to a single state
and passes through Jµ , it becomes a line. Then there are three cases:
(a) The line has slope less than 45 degrees, so it intersects the 45-degree line at
a unique point, which is equal to Jµ , the solution of the Bellman equation
J = Tµ J. This is true if Tµ is a contraction mapping, as is the case for
discounted problems with bounded cost per stage.
(b) The line has slope less than 45 degrees. Then it intersects the 45-degree line
at a unique point, which is a solution of the Bellman equation J = Tµ J,
but is not equal to Jµ . Then Jµ is not real-valued; we consider such µ to
be unstable under µ.
(c) The line has slope exactly equal to 45 degrees. This is an exceptional case
where the Bellman equation J = Tµ J has an infinite number of real-valued
solutions or no real-valued solution at all; we will provide examples where
this occurs later.
states is the unit interval, U (1) = U (2) = [0, 1]. Here (T J)(1) and (T J)(2)
are given by
Tµ̃ J
One-step lookahead
Position
ON-LINEEvaluation Policy µ̃ with
PLAY Lookahead Tree States
Tµ̃ J˜ = T J˜
Final Features Optimal Policy
Tµ J Final Features Optimal Policy
One-step lookahead Generic policy µ
T J = minµ Tµ J
= 4 Model minµ Tµ J˜
1 J J
Figure 1.3.3 Geometric interpretation of the Bellman operator T , and the cor-
responding Bellman equation. For a fixed x, the function (T J)(x) can be written
as minµ (Tµ J)(x), so it is concave as a function of J. The optimal cost function
J ∗ satisfies J ∗ = T J ∗ , so it is obtained from the intersection of the graph of T J
and the 45 degree line shown, assuming J ∗ is real-valued.
Note that the graph of T lies below the graph of every operator Tµ , and
is in fact obtained as the lower envelope of the graphs of Tµ as µ ranges over
the set of policies M. In particular, for any given function J, ˜ for every x, the
˜
value (T J)(x) is obtained by finding a support hyperplane/subgradient of the
graph of the concave function (T J)(x) at J, ˜ as shown in the figure. This support
hyperplane is defined by the control µ(x) of a policy µ̃ that attains the minimum
˜
of (Tµ J)(x) over µ:
˜
µ̃(x) ∈ arg min (Tµ J)(x)
µ∈M
(there may be multiple policies attaining this minimum, defining multiple support
hyperplanes). This construction also shows how the minimization
˜
(T J)(x) ˜
= min (Tµ J)(x)
µ∈M
˜
corresponds to a linearization of the mapping T at the point J.
State 1 State 2
Figure 1.3.4 Illustration of the Bellman operator T for states 1 and 2 in Example
1.3.2. The parameter values are g1 = 5, g2 = 3, r11 = 3, r12 = 15, r21 = 9,
r22 = 1, and the discount factor is α = 0.9. The optimal costs are J ∗ (1) = 49.7
and J ∗ (2) = 40.0, and the optimal policy is µ∗ (1) = 0.59 and µ∗ (2) = 0. The
figure also shows the one-dimensional slices of the operators at J(1) = 15 and
J(2) = 30, together with the corresponding 45-degree lines.
Jk+1 = T Jk , k = 0, 1, . . . ,
Jk+1 = Tµ Jk , k = 0, 1, . . . ,
and it can be similarly interpreted, except that the graph of the function
Tµ J is linear. Also we will see shortly that there is a similarly compact
description for the policy iteration algorithm.
Tµ̃ J˜ = T J,
˜
36 Introduction Chap. 1
J2
TJ
J1
1 J J
45◦ Line
Optimal cost Cost of rollout policy ˜
J J∗ = T J∗
0 Prob. = 1 1 J J
Stability Region 0 J0 J1 J2
as in Fig. 1.3.6. This equation implies that the graph of Tµ̃ J just touches
the graph of T J at J,˜ as shown in the figure. Moreover, for each state
x ∈ X the hyperplane Hµ̃ (x)
&$ % '
Hµ̃ (x) = J(x), ξ | (Tµ̃ J)(x) ≥ ξ ,
˜ ˜ ˜
$ %
at the point J(x), (T J)(x) and defines a subgradient of (T J)(x) at J.
Note that the one-step lookahead policy µ̃ need not be unique, since T
need not be differentiable.
In conclusion, the equation
J = Tµ̃ J
J = TJ
Sec. 1.3 Reinforcement Learning - Approximation in Value Space 37
Tµ̃ J
Corresponds to One-Step Lookahead Policy ˜
One-Step Lookahead Policy Cost
One-Step Lookahead Policy µ̃
TJ
1 J J
One-Step
Cost Approximation Value Space Lookahead Policy Cost l
Approximation
One-Step Lookahead Policy Cost l
T J˜ = min Tµ J.
˜
µ
J = Tµ̃ J
˜ and its solution, Jµ̃ , can be viewed as the result of a Newton iteration
at J,
˜ In summary, the Newton iterate at J˜ is Jµ̃ , the solution of
at the point J.
the linearized equation J = Tµ̃ J.†
We may also consider approximation in value space with %-step looka-
† The classical Newton’s method for solving a fixed point problem of the form
y = T (y), where y is an n-dimensional vector, operates as follows: At the current
iterate yk , we linearize T and find the solution yk+1 of the corresponding linear
fixed point problem. Assuming T is differentiable, the linearization is obtained
38 Introduction Chap. 1
head using J˜. This is the same as approximation in value space with one-
step lookahead using the (% − 1)-fold operation of T on J, ˜ T $−1 J.
˜ Thus
it can be interpreted as a Newton step starting from T $−1 ˜
J, the result of
˜ This is illustrated in Fig. 1.3.7.†
% − 1 value iterations applied to J.
∂T (yk )
yk+1 = T (yk ) + (yk+1 − yk ),
∂y
/yk+1 − y ∗ / = O /yk − y ∗ /2 ,
$ %
where / · / is the Euclidean norm, and holds assuming the Jacobian matrix ex-
ists and is Lipschitz continuous (see [Ber16], Section 1.4). There are extensions
of Newton’s method that are based on solving a linearized system at the cur-
rent iterate, but relax the differentiability requirement to piecewise differentiabil-
ity, and/or component concavity, while maintaining the superlinear convergence
property of the method.
The structure of the Bellman operators (1.28) and (1.27), with their mono-
tonicity and concavity properties, tends to enhance the convergence and rate of
convergence properties of Newton’s method, even in the absence of differentiabil-
ity, as evidenced by the convergence analysis of PI, and the extensive favorable
experience with rollout, PI, and MPC. In this connection, it is worth noting that
in the case of Markov games, where the concavity property does not hold, the
PI method may oscillate, as shown by Pollatschek and Avi-Itzhak [PoA69], and
needs to be modified to restore its global convergence; see the author’s paper
[Ber21c]. We will discuss abstract versions of game and minimax contexts n
Chapter 5.
† Variants of Newton’s method that involve combinations of first order it-
erative methods, such as the Gauss-Seidel and Jacobi algorithms, and New-
ton’s method, and they belong to the general family of Newton-SOR methods
(SOR stands for “successive over-relaxation”); see the classic book by Ortega
and Rheinboldt [OrR70] (Section 13.4).
Sec. 1.3 Reinforcement Learning - Approximation in Value Space 39
Tµ̃Policy
Corresponds to One-Step Lookahead J ˜
Multistep Lookahead Policy Cost
One-Step Lookahead Policy µ̃
TJ
(a) Policy evaluation, which computes the cost function Jµ . One possi-
bility is to solve the corresponding Bellman equation
& $ % $ %'
Jµ (x) = E g x, µ(x), w + αJµ f (x, µ(x), w) , for all x.
(1.34)
However, the value Jµ (x) for any x can also be computed by Monte
Carlo simulation, by averaging over many randomly generated tra-
jectories the cost of the policy starting from x. Other possibilities
include the use of specialized simulation-based methods, based on
the projected and aggregation Bellman equations discussed in Sec-
tion 1.2.4, for which there is extensive literature (see e.g., the books
[BeT96], [SuB98], [Ber12a], [Ber19b]).
(b) Policy improvement , which computes the rollout policy µ̃ using the
one-step lookahead minimization
& $ %'
µ̃(x) ∈ arg min E g(x, u, w) + αJµ f (x, u, w) , for all x.
u∈U(x)
(1.35)
40 Introduction Chap. 1
∗ TJ
Prob. = 1 Prob. =
also Newton Step
Optimal cost Cost of rollout policy ˜
J J∗ = T J∗
0 Prob. = 1
1 J J
Cost of µk+1 Cost of µk
Jµk+1 = Tµk+1 Jµk+1 Jµk = Tµk Jµk
Let us now consider the general case where the mapping Tµ is not assumed
linear for all stationary policies µ ∈ M. In this case we still have the
alternative description of T
(T J)(x) = min (Tµ J)(x), for all x,
µ∈M
but T need not be concave, i.e., for some x ∈ X, the function (T J)(x) may
not be concave as a function of J. We illustrate this fact in Fig. 1.3.9.
The nonlinearity of the mapping Tµ can have profound consequences
on the validity of the PI algorithm and its interpretation in terms of New-
ton’s method. A prominent case where this is so arises in minimax problems
and related two-person zero sum game settings (cf. Example 1.2.5). We will
discuss this case in Chapter 5, where we will introduce modifications to the
PI algorithm that restore its convergence property.
We note, however, that it is possible that the mappings Tµ are non-
linear and convex, but that T has concave and differentiable components
(T J)(x), in which case the Newton step interpretation applies. This occurs
in particular in the important case of zero-sum dynamic games involving a
linear system and a quadratic cost function.
The examples in the preceding sections demonstrate that while the mono-
tonicity assumption is satisfied for most DP models, the contraction as-
sumption may or may not hold. In particular, the contraction assumption
† This was part of Kleinman’s Ph.D. thesis [Kle67] at M.I.T., supervised by
M. Athans. Kleinman gives credit for the one-dimensional version of his results to
Bellman and Kalaba [BeK65]. Note also that the first proposal of the PI method
was given by Bellman in his classic book [Bel57], under the name “approximation
in policy space.”
42 Introduction Chap. 1
Tµ! J
Tµ J
T J = min{Tµ , Tµ! }
(1) = 0 1 J J
not with others, and policy iteration may fail altogether. Some of
these issues may be mitigated when additional structure is present,
which we discuss in Sections 4.4-4.6, focusing on noncontractive mod-
els that also have some semicontractive structure, and corresponding
favorable properties.
Examples of DP problems from each of the model categories above,
primarily special cases of the specific DP models discussed in Section 1.2,
are scattered throughout the book. They serve both to illustrate the theory
and its exceptions, and to highlight the beneficial role of additional special
structure.
We finally note some other types of models where there are restric-
tions to the set of policies, i.e., M may be a strict subset of the set of
functions µ : X #→ U with µ(x) ∈ U (x) for all x ∈ X. Such restrictions
may include measurability (needed to establish a mathematically rigorous
probabilistic framework) or special structure that enhances the characteri-
zation of optimal policies and facilitates their computation. These models
were treated in Chapter 5 of the first edition of this book, and also in
Chapter 6 of [BeS78].†
Algorithms
† Chapter 5 of the first edition is accessible from the author’s web site and
the book’s web page, and uses terminology and notation that are consistent with
the present edition.
Sec. 1.5 Notes, Sources, and Exercises 45
EXERCISES
This exercise shows how starting with an abstract mapping, we can obtain mul-
tistep mappings with the same fixed points and a stronger contraction modulus.
Consider a set of mappings Tµ : B(X) (→ B(X), µ ∈ M, satisfying the con-
traction Assumption 1.2.2, let m be a positive integer, and let Mm be the set
of m-tuples ν = (µ0 , . . . , µm−1 ), where µk ∈ M, k = 1, . . . , m − 1. For each
ν = (µ0 , . . . , µm−1 ) ∈ Mm , define the mapping T ν , by
/T ν J − T ν J # / ≤ αm /J − J # /, ∀ J, J # ∈ B(X), (1.39)
and
/T J − T J # / ≤ αm /J − J # /, ∀ J, J # ∈ B(X), (1.40)
where T is defined by
and by taking infimum of both sides over (Tµ0 · · · Tµm−1 ) ∈ Mm and dividing by
v(x), we obtain
(T J)(x) − (T J # )(x)
≤ αm /J − J # /, ∀ x ∈ X.
v(x)
Similarly
(T J # )(x) − (T J)(x)
≤ αm /J − J # /, ∀ x ∈ X,
v(x)
and by combining the last two relations and taking supremum over x ∈ X, Eq.
(1.40) follows.
The purpose of this exercise is establish a close connection between the map-
pings underlying temporal difference and proximal methods (cf. Section 1.2.5).
Consider a linear mapping of the form
T J = AJ + b,
cf. Eq. (1.23) [equivalently, for a given J, P (c) J is the unique vector Y ∈ !n that
solves the equation
1
Y − T Y = (J − Y ),
c
(cf. Fig. 1.5.1)].
(a) Show that P (c) is given by
∞
#
P (c) = (1 − λ) λ" T " ,
"=0
Sec. 1.5 Notes, Sources, and Exercises 49
1
δ c (J −Y)
ˆ J* J Y
T (λ) J = T Jˆ Jˆ = P (c) J J
Figure 1.5.1. Illustration of the iterates T (λ) J and P (c) J for finding the fixed
point J ∗ of a linear mapping T . $Given %J, we find the proximal iterate Jˆ =
P (c) J and then add the amount 1c Jˆ − J to obtain T (λ) J = T P (c) J. If T is a
contraction mapping, T (λ) J is closer to J ∗ than P (c) J.
respectively.
(d) Show that the fixed point property of part (c) yields the following formula
for the multistep mapping T (λ) :
∞
# ζi (1 − λ)
θi = (1 − λ) λ" ζi"+1 = , i = 1, . . . , n, (1.44)
1 − ζi λ
"=0
8−1 8−1 ∞
c+1 1
7 7 #
I−A = I−A = λ(I − λA)−1 = λ (λA)" .
c λ
"=0
1 1−λ
Thus, using the equation c
= λ
,
7c + 1 8−1 7 1
8
P (c) J = I −A b+ J
c c
∞
1−λ
# 7 8
=λ (λA)" b + J
λ
"=0
∞ ∞
# #
= (1 − λ) (λA)" J + λ (λA)" b,
"=0 "=0
(λ) (λ) +∞
which is equal to A J + b . The formula P (c) = (1 − λ) "=0
λ" T " follows
from this expression.
Sec. 1.5 Notes, Sources, and Exercises 51
(c) To show that T (λ) J is the fixed point of WJ , we must verify that
T (λ) J = WJ T (λ) J ,
$ %
or equivalently that
T (λ) J = (1 − λ)T J + λT T (λ) J) = (1 − λ)T J + λT (λ) (T J).
$
where w" (x) are nonnegative scalars such that for all x ∈ X,
∞
#
w" (x) = 1.
"=1
Show that
∞
( (w)
((Tµ J)(x) − (Tµ(w) J # )(x)(
(
#
≤ w" (x) α" /J − J # /, ∀ x ∈ X,
v(x)
"=1
(w)
so that Tµ is a contraction with modulus
∞
#
ᾱ = sup w" (x) α" ≤ α < 1.
x∈X
"=1
(w)
Moreover, for all µ ∈ M, the mappings Tµ and Tµ have the same fixed point.
(w)
showing the contraction property of Tµ .
Let Jµ be the fixed point of Tµ . By using the relation (Tµ" Jµ )(x) = Jµ (x),
we have for all x ∈ X,
∞
, ∞
-
# #
(Tµ(w) Jµ )(x) Tµ" Jµ
$ %
= w" (x) (x) = w" (x) Jµ (x) = Jµ (x),
"=1 "=1
(w) (w)
so Jµ is the fixed point of Tµ [which is unique since Tµ is a contraction].
2
Contractive Models
Contents
53
54 Contractive Models Chap. 2
In this chapter we consider the abstract DP model of Section 1.2 under the
most favorable assumptions: monotonicity and weighted sup-norm contrac-
tion. Important special cases of this model are the discounted problems
with bounded cost per stage (Example 1.2.1-1.2.5), the stochastic shortest
path problem of Example 1.2.6 in the case where all policies are proper, as
well as other problems involving special structures.
We first provide some basic analytical results and then focus on two
types of algorithms: value iteration and policy iteration. In addition to
exact forms of these algorithms, we discuss combinations and approximate
versions, as well as asynchronous distributed versions.
In this section we recall the abstract DP model of Section 1.2, and derive
some of its basic properties under the monotonicity and contraction as-
sumptions of Section 1.3. We consider a set X of states and a set U of
controls, and for each x ∈ X, a nonempty control constraint set U (x) ⊂ U .
We denote by M the set of all functions µ : X #→ U with µ(x) ∈ U (x)
for all x ∈ X, which we refer to as policies (or “stationary policies,” when
we want to emphasize the distinction from nonstationary policies, to be
discussed later).
We denote by R(X) the set of real-valued functions J : X #→ %. We
have a mapping H : X × U × R(X) #→ % and for each policy µ ∈ M, we
consider the mapping Tµ : R(X) #→ R(X) defined by
! "
(Tµ J)(x) = H x, µ(x), J , ∀ x ∈ X.
We also consider the mapping T defined by
(T J)(x) = inf H(x, u, J) = inf (Tµ J)(x), ∀ x ∈ X.
u∈U(x) µ∈M
[We will use frequently the second equality above, which holds because M
can be viewed as the Cartesian product Πx∈X U (x).] We want to find a
function J * ∈ R(X) such that
J * (x) = inf H(x, u, J * ), ∀ x ∈ X,
u∈U(x)
i.e., to find a fixed point of T within R(X). We also want to obtain a policy
µ∗ ∈ M such that Tµ∗ J * = T J * .
Let us restate for convenience the contraction and monotonicity as-
sumptions of Section 1.2.2.
J ≤ J# ⇒ T kJ ≤ T kJ #, Tµk J ≤ Tµk J # , ∀ µ ∈ M,
1 α
*J * − J* ≤ *T J − J*, *J * − T J* ≤ *T J − J*.
1−α 1−α
1 α
*Jµ − J* ≤ *Tµ J − J*, *Jµ − Tµ J* ≤ *Tµ J − J*.
1−α 1−α
Furthermore, for every " > 0, there exists µ" ∈ M such that
Proof: We note that the right-hand side of Eq. (2.1) holds by Prop.
2.1.1(e) (see the remark following its proof). Thus inf µ∈M Jµ (x) ≤ J * (x)
for all x ∈ X. To show the reverse inequality as well as the left-hand side
of Eq. (2.1), we note that for all µ ∈ M, we have T J * ≤ Tµ J * , and since
J * = T J * , it follows that J * ≤ Tµ J * . By applying repeatedly Tµ to both
sides of this inequality and by using the monotonicity Assumption 2.1.1,
we obtain J * ≤ Tµk J * for all k > 0. Taking the limit as k → ∞, we see
that J * ≤ Jµ for all µ ∈ M, so that J * (x) ≤ inf µ∈M Jµ (x) for all x ∈ X.
Q.E.D.
Note that without monotonicity, we may have inf µ∈M Jµ (x) < J * (x)
for some x. This is illustrated by the following example.
Example 2.1.1 (Counterexample Without Monotonicity)
with J¯ being some function in B(X), where Tµ0 · · · Tµk J denotes the com-
position of the mappings Tµ0 , . . . , Tµk applied to J, i.e.,
! "
Tµ0 · · · Tµk J = Tµ0 Tµ1 · · · (Tµk−1 (Tµk J)) · · · .
and the last equality holds by Prop. 2.1.1(b)]. Combining the preceding
relations, we obtain J * (x) = inf π∈Π Jπ (x).
Thus, in DP terms, we may view J * as an optimal cost function over
all policies, including nonstationary ones. At the same time, Prop. 2.1.2
states that stationary policies are sufficient in the sense that the optimal
cost can be attained to within arbitrary accuracy with a stationary policy
[uniformly for all x ∈ X, as Eq. (2.1) shows].
Sec. 2.1 Bellman’s Equation and Optimality Conditions 59
J ≤ J# + c v ⇒ Tµ J ≤ Tµ J # + αc v, (2.2)
H(x, u, J + c v) − H(x, u, J)
≤ α*J + c v − J* = αc.
v(x)
The condition (2.3) implies the desired condition (2.2). Conversely, con-
dition (2.2) for c = 0 yields the monotonicity assumption, while for c =
*J # − J* it yields the contraction assumption. Q.E.D.
αk c
TJ ≤ J + cv ⇒ J * ≤ T kJ + v,
1−α
αk c
J ≤ TJ + cv ⇒ T kJ ≤ J * + v.
1−α
Proof: (a) We show the first relation. Applying Eq. (2.2) with J # and J
replaced by J and T J, respectively, and taking infimum over µ ∈ M, we
see that if T J ≤ J + c v, then T 2 J ≤ T J + αc v. Proceeding similarly, it
follows that
T ! J ≤ T !−1 J + α!−1 c v.
We now write for every k,
k
$ k
$
T kJ − J = (T ! J − T !−1 J) ≤ α!−1 c v,
!=1 !=1
TJ ≤ J + cv
Sec. 2.2 Limited Lookahead Policies 61
T k+1 J ≤ T k J + αk c v.
˜ ≤ α
*Jµ − T J* *T J˜ − J*,
˜ (2.5)
1−α
2α ˜
*Jµ − J * * ≤ *J − J * *, (2.6)
1−α
and
2 ˜
*Jµ − J * * ≤ *T J̃ − J*. (2.7)
1−α
Proof: Equation (2.5) follows from the second relation of Prop. 2.1.1(e)
˜ Also from the first relation of Prop. 2.1.1(e) with J = J * , we
with J = J.
have
1
*Jµ − J * * ≤ *Tµ J * − J * *.
1−α
62 Contractive Models Chap. 2
where the second inequality follows from Eqs. (2.5) and (2.8). This proves
Eq. (2.7). Q.E.D.
2α"
1−α
Example 2.2.1
Consider a discounted optimal control problem with two states, 1 and 2, and
deterministic transitions. State 2 is absorbing, but at state 1 there are two
possible decisions: move to state 2 (policy µ∗ ) or stay at state 1 (policy µ).
Sec. 2.2 Limited Lookahead Policies 63
The cost of each transition is 0 except for the transition from 1 to itself under
policy µ, which has cost 2α", where " is a positive scalar and α ∈ [0, 1) is the
discount factor. The optimal policy µ∗ is to move from state 1 to state 2, and
the optimal cost-to-go function is J ∗ (1) = J ∗ (2) = 0. Consider the vector J˜
˜ = −" and J˜(2) = ", so that
with J(1)
#J˜ − J ∗ # = ",
as assumed in Eq. (2.6) (cf. Prop. 2.2.1). The policy µ that decides to stay
at state 1 is a one-step lookahead policy based on J˜, because
We have
2α" 2α
Jµ (1) = = #J˜ − J ∗ #,
1−α 1−α
so the bound of Eq. (2.6) holds with equality.
where δ and " are nonnegative scalars. These scalars are usually unknown,
so the resulting analysis will have a mostly qualitative character.
The case δ > 0 arises when the state space is either infinite or it is
finite but very large. Then instead of calculating (T J)(x) for all states x,
one may do so only for some states and estimate (T J)(x) for the remain-
ing states x by some form of interpolation. Alternatively, one may use
simulation data [e.g., noisy values of (T J)(x) for some or all x] and some
kind of least-squares error fit of (T J)(x) with a function from a suitable
parametric class. The function J˜ thus obtained will satisfy *J˜ − T J* ≤ δ
with δ > 0. Note that δ may not be small in this context, and the resulting
performance degradation may be a primary concern.
Cases where " > 0 may arise when the control space is infinite or
finite but large, and the minimization involved in the calculation of (T J)(x)
cannot be done exactly. Note, however, that it is possible that
δ > 0, " = 0,
64 Contractive Models Chap. 2
*Tµ J − T J* ≤ "
for some " > 0, and then we may set J˜ = Tµ J. In this case we have " = δ
in Eq. (2.9).
In a multistep method with approximations, we are given a posi-
tive integer m and a lookahead function Jm , and we successively compute
(backwards in time) Jm−1 , . . . , J0 and policies µm−1 , . . . , µ0 satisfying
Proof: Using the triangle inequality, Eq. (2.10), and the contraction prop-
erty of T , we have for all k
δ(1 − αk )
*Jm−k − T k Jm * ≤ , k = 1, . . . , m. (2.13)
1−α
Sec. 2.2 Limited Lookahead Policies 65
From Eq. (2.10), we have *Jk − Tµk Jk+1 * ≤ δ + ", so for all k
showing that
(δ + ")(1 − αk )
*Jm−k − Tµm−k · · · Tµm−1 Jm * ≤ , k = 1, . . . , m.
1−α
(2.14)
Using the fact *Tµ0 J1 − T J1 * ≤ " [cf. Eq. (2.10)], we obtain
where the last inequality follows from Eqs. (2.13) and (2.14) for k = m − 1.
From this relation and the fact that Tµ0 · · · Tµm−1 and T m are con-
tractions with modulus αm , we obtain
We also have using Prop. 2.1.1(e), applied in the context of the multistep
mapping of Example 1.3.1,
1
*Jπ − J * * ≤ *Tµ0 · · · Tµm−1 J * − J * *.
1 − αm
Combining the last two relations, we obtain the desired result. Q.E.D.
Note that for m = 1 and δ = " = 0, i.e., the case of one-step lookahead
policy µ with lookahead function J1 and no approximation error in the
minimization involved in T J1 , Eq. (2.11) yields the bound
2α
*Jµ − J * * ≤ *J1 − J * *,
1−α
66 Contractive Models Chap. 2
In this section, we discuss value iteration (VI for short), the algorithm
that starts with some J ∈ B(X), and generates T J, T 2 J, . . .. Since T is
a weighted sup-norm contraction under Assumption 2.1.2, the algorithm
converges to J * , and the rate of convergence is governed by
*T k J − J * * ≤ αk *J − J * *, k = 0, 1, . . . .
*Tµk J − Jµ * ≤ αk *J − Jµ *, k = 0, 1, . . . .
*J˜ − J * * ≤ γ, Tµ J˜ = T J,
˜
2α γ
*Jµ − J * * ≤ . (2.15)
1−α
*Jk+1 − T Jk * ≤ δ, k = 0, 1, . . . . (2.17)
68 Contractive Models Chap. 2
δ
lim sup *Jk − J * * ≤ , (2.19)
k→∞ 1−α
" 2αδ
lim sup *Jµk − J * * ≤ + . (2.20)
k→∞ 1 − α (1 − α)2
Proof: Using the triangle inequality, Eq. (2.17), and the contraction prop-
erty of T , we have
and finally
(1 − αk )δ
*Jk − T k J0 * ≤ , k = 0, 1, . . . . (2.21)
1−α
By taking limit as k → ∞ and by using the fact limk→∞ T k J0 = J * , we
obtain Eq. (2.19).
We also have using the triangle inequality and the contraction prop-
erty of Tµk and T ,
Since for a zero cost per stage and the given deterministic transitions, we
have T Jk = (2αrk , 2αrk ), the preceding minimization is written as
, -
rk+1 ∈ arg min ξ1 (r − 2αrk )2 + ξ2 (2r − 2αrk )2 ,
r
lim *Jµk − J * * = 0,
k→∞
Proof: We have
T J 45 Degree Line
Prob. = 1 Prob. =
∗ TJ
n Value Iterations Prob. = 1 Prob. =
J J∗ = T J∗
0 Prob. = 1
J0 Do not Replace Set S 1 J J
= T J0 = T 2 J0
Tµ0 J J
∗ TJ
Prob. = 1 Prob. =
J J∗ = T J∗
0 Prob. = 1
J Jµ0 = Tµ0 Jµ0 1 J J
J µ1 = Tµ1 J µ1
Policy Improvement Exact Policy Evaluation (Exact if
In the case where the set of policies is infinite, we may assert the
convergence of the sequence of generated policies under some compactness
and continuity conditions. In particular, we will assume that the state
space is finite, X = {1, . . . , n}, and that each control constraint set U (x)
is a compact subset of %m . We will view a cost function J as an element
of %n , and a policy µ as an element of the set U (1) × · · · × U (n) ⊂ %mn ,
which is compact. Then {µk } has at least one limit point µ, which must
be an admissible policy. The following proposition guarantees, under an
additional continuity assumption for H(x, ·, ·), that every limit point µ is
optimal.
we have
! "
H x, µk (x), Jµk−1 ≤ H(x, u, Jµk−1 ), x = 1, . . . , n, u ∈ U (x).
We now consider the PI method where the policy evaluation step and/or
the policy improvement step of the method are implemented through ap-
proximations. This method generates a sequence of policies {µk } and a
corresponding sequence of approximate cost functions {Jk } satisfying
where δ and " are some scalars, and *·* denotes the weighted sup-norm (the
one used in the contraction Assumption 2.1.2). The following proposition
provides an error bound for this algorithm.
*J − Jµ * ≤ δ, *Tµ J − T J* ≤ ",
Tµ Jµ ≤ Jµ + (" + 2αδ) v,
α(" + 2αδ)
Jµ = Tµ Jµ = Tµ Jµ + (Tµ Jµ − Tµ Jµ ) ≤ Tµ Jµ + v,
1−α
where the inequality follows by using Prop. 2.1.3 and Eq. (2.27). Subtract-
ing J * from both sides, we have
α(" + 2αδ)
Jµ − J * ≤ T µ Jµ − J * + v. (2.28)
1−α
Also by subtracting J * from both sides of Eq. (2.26), and using the
contraction property
T Jµ − J * = T Jµ − T J * ≤ α*Jµ − J * * v,
we obtain
" + 2αδ
*Jµk+1 − J * * ≤ α*Jµk − J * * + ,
1−α
which by taking the lim sup of both sides as k → ∞ yields the desired result.
Q.E.D.
We note that the error bound of Prop. 2.4.3 is tight, as can be shown
with an example from [BeT96], Section 6.2.3. The error bound is com-
parable to the one for approximate VI, derived earlier in Prop. 2.3.2. In
particular, the error *Jµk −J * * is asymptotically proportional to 1/(1−α)2
and to the approximation error in policy evaluation or value iteration, re-
spectively. This is noteworthy, as it indicates that contrary to the case of
exact implementation, approximate PI need not hold a convergence rate
advantage over approximate VI, despite its greater overhead per iteration.
Note that when δ = " = 0, Eq. (2.25) yields
*Jµk+1 − J * * ≤ α*Jµk − J * *.
Thus in the case of an infinite state space and/or control space, exact
PI converges at a geometric rate under the contraction and monotonicity
assumptions of this section. This rate is the same as the rate of convergence
of exact VI. It follows that judging solely from the point of view of rate
of convergence estimates, exact PI holds an advantage over exact VI only
when the number of states is finite. This raises the question what happens
when the number of states is finite but very large. However, this question
is not very interesting from a practical point of view, since for a very large
number of states, neither VI or PI can be implemented in practice without
approximations (see the discussion of Section 1.2.4).
cf. Eq. (2.23). Using Eq. (2.31) and the fact Jµ = Tµ Jµ , we have
˜ + *T J˜ − Tµ J*
*T Jµ − Jµ * ≤ *T Jµ − T J* ˜ + *Tµ J˜ − Jµ *
˜ + *T J˜ − Tµ J*
= *T Jµ − T J* ˜ + *Tµ J˜ − Tµ Jµ *
(2.32)
˜ + " + α*J˜ − Jµ *
≤ α*Jµ − J*
≤ " + 2αδ.
Using Prop. 2.1.1(d) with J = Jµ , we obtain the error bound (2.30).
Q.E.D.
The preceding error bound can be extended to the case where two
successive policies generated by the approximate PI algorithm are “not too
different” rather than being identical. In particular, suppose that µ and µ
are successive policies, which in addition to
*J˜ − Jµ * ≤ δ, *Tµ J˜ − T J*
˜ ≤ ",
*T J˜ − Tµ J* ˜ + *Tµ J˜ − Tµ J*
˜ ≤ *T J˜ − Tµ J* ˜ ≤ " + ζ,
and by replacing " with " + ζ and µ with µ in Eq. (2.32), we obtain
" + ζ + 2αδ
*Jµ − J ∗ * ≤ .
1−α
When ζ is small enough to be of the order of max{δ, "}, this error bound
is comparable to the one for the case where policies converge.
where {mk } is a sequence of positive integers (see Fig. 2.5.1, which shows
one iteration of the method where mk = 3). There is no systematic guide-
line for selecting the integers mk . Usually their best values are chosen
empirically, and tend to be considerably larger than 1 (in the case where
mk ≡ 1 the optimistic PI method coincides with the VI method). The
convergence of this method is discussed in Section 2.5.1.
Variants of optimistic PI include methods with approximations in the
policy evaluation and policy improvement phases (Section 2.5.2), and meth-
ods where the number mk is randomized (Section 2.5.3). An interesting
advantage of the latter methods is that they do not require the monotonic-
ity Assumption 2.1.1 for convergence in problems with a finite number of
policies.
A method that is conceptually similar to the optimistic PI method is
the λ-PI method defined by
(λ)
Tµk Jk = T Jk , Jk+1 = Tµk Jk , k = 0, 1, . . . , (2.34)
78 Contractive Models Chap. 2
Tµ0 J
T J = minµ Tµ J
= T J0
Policy Improvement Exact Policy Evaluation Approximate Polic
Evaluation
where J0 is an initial function in B(X), and for any policy µ and scalar
(λ)
λ ∈ (0, 1), Tµ is the multistep mapping defined by
∞
$
(λ)
Tµ J = (1 − λ) λ! Tµ!+1 J, J ∈ B(X),
!=0
(cf. Section 1.2.5). To compare optimistic PI and λ-PI, note that they both
involve multiple applications of the VI mapping Tµk : a fixed number mk
in the former case, and a geometrically weighted number in the latter case.
(λ)
In fact, we may view the λ-PI iterate Tµk Jk as the expected value of the
mk
optimistic PI iterate Tµk Jµk when mk is chosen by a geometric probability
distribution with parameter λ.
One of the reasons that make λ-PI interesting is its relation with
TD(λ) and other temporal difference methods on one hand, and the prox-
imal algorithm on the other. In particular, in λ-PI a policy evaluation is
performed with a single iteration of an extrapolated proximal algorithm;
cf. the discussion of Section 1.2.5 and Exercise 1.2. Thus implementation
Sec. 2.5 Optimistic Policy Iteration and λ-Policy Iteration 79
of λ-PI can benefit from the rich methodology that has developed around
temporal difference and proximal methods.
Generally the optimistic and λ-PI methods have similar convergence
properties. In this section, we focus primarily on optimistic PI, and we
discuss briefly λ-PI in Section 2.5.3, where we will prove convergence for a
randomized version. For a convergence proof of λ-PI without randomiza-
tion in discounted stochastic optimal control and stochastic shortest path
problems, see the paper [BeI96] and the book [BeT96] (Section 2.3.1).
We will now focus on the optimistic PI algorithm (2.33). The following two
propositions provide its convergence properties.
J ≤ J# ⇒ W J ≤ W J #, ∀ J, J # ∈ B(X),
*W J − W J # * ≤ α*J − J # *, ∀ J, J # ∈ B(X),
for some α ∈ (0, 1).
(a) For all J, J # ∈ B(X) and scalar c ≥ 0, we have
J ≥ J# − c v ⇒ W J ≥ W J # − αc v. (2.35)
αk
J ≥ WJ − cv ⇒ W kJ ≥ J * − c v, (2.36)
1−α
αk
WJ ≥ J − cv ⇒ J * ≥ W kJ − c v, (2.37)
1−α
where J * is the fixed point of W .
Proof: The proof of part (a) follows the one of Prop. 2.1.4(b), while the
proof of part (b) follows the one of Prop. 2.1.4(c). Q.E.D.
J ≥ T J − c v,
and
Tµk J ≥ T (Tµk J) − αk c v. (2.39)
J1 ≥ T J1 − αm0 c v = T J1 − β1 c v.
Jk ≥ T Jk − βk c v.
and
Jk+1 ≥ T Jk+1 − αmk βk c v = T Jk+1 − βk+1 c v,
respectively. This completes the induction. Q.E.D.
*J0 − T J0 * ≤ c. (2.43)
αk βk (k + 1)αk
Jk + c v ≥ Jk + c v ≥ J * ≥ Jk − c v, (2.44)
1−α 1−α 1−α
Proof: Using the relation J0 ≥ T J0 − c v [cf. Eq. (2.43)] and Lemma 2.5.3,
we have
Jk ≥ T Jk − βk c v, k = 0, 1, . . . .
Using this relation in Lemma 2.5.1(b) with W = T and k = 0, we obtain
βk
Jk ≥ J * − c v,
1−α
which together with the fact αk ≥ βk , shows the left-hand side of Eq.
(2.44).
Using the relation T J0 ≥ J0 − c v [cf. Eq. (2.43)] and Lemma 2.5.1(b)
with W = T , we have
αk
J * ≥ T k J0 − c v, k = 0, 1, . . . . (2.45)
1−α
Using again the relation J0 ≥ T J0 − c v in conjunction with Lemma 2.5.3,
we also have
α
T Jj ≥ Jj+1 − βj c v, j = 0, . . . , k − 1.
1−α
Applying T k−j−1 to both sides of this inequality and using the monotonicity
and contraction properties of T k−j−1 , we obtain
αk−j
T k−j Jj ≥ T k−j−1 Jj+1 − βj c v, j = 0, . . . , k − 1,
1−α
Sec. 2.5 Optimistic Policy Iteration and λ-Policy Iteration 83
Proof of Props. 2.5.1 and 2.5.2: Let c be a scalar satisfying Eq. (2.43).
Then the error bounds (2.44) show that limk→∞ *Jk −J * * = 0, i.e., the first
part of Prop. 2.5.1. To show the second part (finite termination when the
. be the finite set of nonoptimal policies.
number of policies is finite), let M
. which
Then there exists " > 0 such that *Tµ̂ J * − T J * * > " for all µ̂ ∈ M,
.
implies that *Tµ̂ Jk − T Jk * > " for all µ̂ ∈ M and k sufficiently large. This
implies that µk ∈ / M. for all k sufficiently large. The proof of Prop. 2.5.2
follows using the compactness and continuity Assumption 2.4.1, and the
convergence argument of Prop. 2.4.2. Q.E.D.
Let us consider the convergence rate bounds of Lemma 2.5.4 for optimistic
PI, and write them in the form
(k + 1)αk αm0 +···+mk−1
*J0 − T J0 * ≤ c ⇒ Jk − c v ≤ J * ≤ Jk + c v.
1−α 1−α
(2.47)
We may contrast these bounds with the ones for VI, where
αk αk
*J0 − T J0 * ≤ c ⇒ T k J0 − c v ≤ J * ≤ T k J0 + c v (2.48)
1−α 1−α
[cf. Prop. 2.1.4(c)].
In comparing the bounds (2.47) and (2.48), we should also take into
account the associated overhead for a single iteration of each method: op-
timistic PI requires at iteration k a single application of T and mk − 1
applications of Tµk (each being less time-consuming than an application of
T ), while VI requires a single application of T . It can then be seen that the
upper bound for optimistic PI is better than the one for VI (same bound
for less overhead), while the lower bound for optimistic PI is worse than the
one for VI (worse bound for more overhead). This suggests that the choice
of the initial condition J0 is important in optimistic PI, and in particular
it is preferable to have J0 ≥ T J0 (implying convergence to J * from above)
rather than J0 ≤ T J0 (implying convergence to J * from below). This is
consistent with the results of other works, which indicate that the conver-
gence properties of the method are fragile when the condition J0 ≥ T J0
does not hold (see [WiB93], [BeT96], [BeY10], [BeY12], [YuB13a]).
84 Contractive Models Chap. 2
We will now derive error bounds for the case where the policy evaluation
and policy improvement operations are approximate, similar to the nonop-
timistic PI case of Section 2.4.1. In particular, we consider a method that
generates a sequence of policies {µk } and a corresponding sequence of ap-
proximate cost functions {Jk } satisfying
m
*Jk − Tµkk Jk−1 * ≤ δ, *Tµk+1 Jk − T Jk * ≤ ", k = 0, 1, . . . , (2.49)
y(x)
M (y) = sup .
x∈X v(x)
Then the condition (2.50) can be written for all J, J # ∈ B(X), and µ ∈ M
as
M (Tµ J − Tµ J # ) ≤ αM (J − J # ), (2.51)
Sec. 2.5 Optimistic Policy Iteration and λ-Policy Iteration 85
which can be proved by induction using Eq. (2.51). We have the following
proposition.
" + 2αδ
lim sup *Jµk − J * * ≤ .
k→∞ (1 − α)2
Proof: Let us fix k ≥ 1, and for simplicity let us assume that mk ≡ m for
some m, and denote
J = Jk−1 , J = Jk , µ = µk , µ̄ = µk+1 ,
r = Tµ J − J, r̄ = Tµ̄ J − J.
r̄ = Tµ̄ J − J
= (Tµ̄ J − Tµ J) + (Tµ J − J)
! " ! "
≤ (Tµ̄ J − T J) + Tµ J − Tµ (Tµm J) + (Tµm J − J) + Tµm (Tµ J) − Tµm J
≤ "v + αM (J − Tµm J)v + δv + αm M (Tµ J − J)v
≤ (" + δ)v + αδv + αm M (r)v,
where the first inequality follows from Tµ̄ J ≥ T J, and the second and third
inequalities follow from Eqs. (2.49) and (2.52). From this relation we have
! "
M (r̄) ≤ " + (1 + α)δ + βM (r),
86 Contractive Models Chap. 2
" + (1 + α)δ
lim sup M (r) ≤ . (2.54)
k→∞ 1−β
s = Jµ − Tµm J
= Tµm Jµ − Tµm J
≤ αm M (Jµ − J)v
αm
≤ M (Tµ J − J)v
1−α
αm
= M (r)v,
1−α
where the first inequality follows from Eq. (2.52) and the second inequality
αm
follows by using Prop. 2.1.4(b). Thus we have M (s) ≤ 1−α M (r), from
which by taking lim sup of both sides and using Eq. (2.54), we obtain
! "
β " + (1 + α)δ
lim sup M (s) ≤ . (2.55)
k→∞ (1 − α)(1 − β)
T J − T J * ≤ αM (J − J * )v
= αM (J − Tµm J + Tµm J − J * )v
≤ αM (J − Tµm J)v + αM (Tµm J − J * )v
≤ αδv + αM (t)v.
t̄ = Tµ̄m J − J *
= (Tµ̄m J − Tµ̄m−1 J) + · · · + (Tµ̄2 J − Tµ̄ J) + (Tµ̄ J − T J) + (T J − T J * )
≤ (αm−1 + · · · + α)M (Tµ̄ J − J)v + "v + αδv + αM (t)v,
so finally
α − αm
M (t̄) ≤ M (r̄) + (" + αδ) + αM (t).
1−α
By taking lim sup of both sides and using Eq. (2.54), it follows that
! "
(α − β) " + (1 + α)δ " + αδ
lim sup M (t) ≤ 2 (1 − β)
+ . (2.56)
k→∞ (1 − α) 1−α
Sec. 2.5 Optimistic Policy Iteration and λ-Policy Iteration 87
∞
$
p(1) > 0, p(j) = 1.
j=1
88 Contractive Models Chap. 2
† Note that without monotonicity, J ∗ need not have any formal optimality
properties (cf. the discussion of Section 2.1 and Example 2.1.1).
Sec. 2.5 Optimistic Policy Iteration and λ-Policy Iteration 89
The preceding proof illustrates the key idea of the randomized opti-
m
mistic PI algorithm, which is that for µ ∈ M∗ , the mappings Tµ k have a
*
common fixed point that is equal to J , the fixed point of T . Thus within
a distance " from J * , the iterates (2.57) aim consistently at J * . Moreover,
because the probability of a VI (an iteration with mk = 1) is positive, the
algorithm is guaranteed to eventually come within " from J * through a
sufficiently long sequence of contiguous VI iterations. For this we need the
sequence {Jk } to be bounded, which will be shown as part of the proof of
the following proposition.
Proposition 2.5.5: Let Assumption 2.5.2 hold. Then for any start-
ing point J0 ∈ F (X), a sequence {Jk } generated by the randomized
optimistic PI algorithm (2.57)-(2.58) belongs to F (X) and converges
to J * with probability one.
Proof: We will show that {Jk } is bounded by showing that for all k, we
have
1
max *Jk − Jµ * ≤ ρk max *J0 − Jµ * + max *Jµ − Jµ# *, (2.59)
µ∈M µ∈M 1 − ρ µ,µ# ∈M
where ρ is a common contraction modulus of Tµ , µ ∈ M, and T . Indeed,
we have for all µ ∈ M
*Jk − Jµ * ≤ *Jk − Jµk−1 * + *Jµk−1 − Jµ *
k−1m
= *Tµk−1 Jk−1 − Jµk−1 * + *Jµk−1 − Jµ *
≤ ρmk−1 *Jk−1 − Jµk−1 * + *Jµk−1 − Jµ *
≤ ρmk−1 max *Jk−1 − Jµ * + max *Jµ − Jµ# *
µ∈M µ,µ# ∈M
We use this fact to argue that with enough contiguous value iterations, i.e.,
iterations where mk = 1, Jk can be brought arbitrarily close to J * , and
once this happens, the algorithm operates like the ordinary VI algorithm.
Indeed, each time the iteration Jk+1 = T Jk is performed (i.e., when
mk = 1), the distance of the iterate Jk from J * is reduced by a factor ρ, i.e.,
*Jk+1 − J * * ≤ ρ*Jk − J * *. Since {Jk } belongs to the bounded set D, and
our randomization scheme includes the condition p(1) > 0, the algorithm is
guaranteed (with probability one) to eventually execute a sufficient number
of contiguous iterations Jk+1 = T Jk to enter a sphere
% &
S" = J ∈ F (X) | *J − J * * < "
of small enough radius " to guarantee that the generated policy µk belongs
to M∗ , as per Prop. 2.5.4. Once this happens, all subsequent iterations
reduce the distance *Jk − J * * by a factor ρ at every iteration, since
)
)Tµm J − J * * ≤ ρ*Tµm−1 J − J * * ≤ ρ*J − J * *, ∀ µ ∈ M∗ , m ≥ 1, J ∈ S" .
Thus once {Jk } enters S" , it stays within S" and converges to J * . Q.E.D.
cf. Eq. (2.34), we consider a randomized version that involves a fixed prob-
ability p ∈ (0, 1). It has the form
'
T Jk with probability p,
Tµk Jk = T Jk , Jk+1 = (λ)
Tµk Jk with probability 1 − p. (2.60)
2.5.5, the sequence {Jk } generated by the randomized λ-PI algorithm (2.60)
belongs to F (X) and converges to J * with probability one. The reason is
that the contraction property of Tµ over F (X) with respect to the norm *·*
(λ) (λ)
implies that Tµ is well-defined, and also implies that Tµ is a contraction
over F (X). The latter assertion follows from the calculation
) )
) (λ) ) ) )
∞
$ ∞
$ )
)
)Tµ J − Jµ ) = )(1 − λ) λ! Tµ!+1 J − (1 − λ) λ! Jµ )
) )
!=0 !=0
∞
$ ) )
≤ (1 − λ) λ! )Tµ!+1 J − Jµ )
!=0
∞
$
≤ (1 − λ) λ! ρ!+1 *J − Jµ *
!=0
= ρ *J − Jµ *,
where the first inequality follows from the triangle inequality, and the sec-
ond inequality follows from the contraction property of Tµ . Given that
(λ)
Tµ is a contraction, the proof of Prop. 2.5.5 goes through with mini-
mal changes. The idea again is that {Jk } remains bounded, and through
a sufficiently long sequence of contiguous iterations where the iteration
xk+1 = T Jk is performed, it enters the sphere S" , and subsequently stays
within S" and converges to J * .
The convergence argument just given suggests that the choice of the
randomization probability p is important. If p is too small, convergence
may be slow because oscillatory behavior may go unchecked for a long
time. On the other hand if p is large, a correspondingly large number of
fixed point iterations xk+1 = T Jk may be performed, and the hoped for
(λ)
benefits of the use of the proximal iterations xk+1 = Tµk Jk may be lost.
Adaptive schemes that adjust p based on algorithmic progress may address
this issue. Similarly, the choice of the probability p(1) is significant in the
randomized optimistic PI algorithm (2.57)-(2.58).
Each VI of the form given in Section 2.3 applies the mapping T defined by
some (but not all) processors update their corresponding components, re-
serving the index k for computation stages involving all processors, and
also reserving subscript ) to denote component/processor index.
In an asynchronous VI algorithm, processor ) updates J! only for t in
a selected subset R! of iterations, and with components Jj , j .= ), supplied
by other processors with communication “delays” t − τ!j (t),
0 1 2
τ (t) τ (t)
T J1 "1 , . . . , Jm"m (x) if t ∈ R! , x ∈ X! ,
J!t+1 (x) = (2.61)
J!t (x) if t ∈
/ R! , x ∈ X! .
where T (J1t , . . . , Jnt )(') denotes the '-th component of the vector
T (J1t , . . . , Jnt ) = T J t ,
and for simplicity we write J"t instead of J"t ('). This algorithm is a special
case of iteration (2.61) where the set of times at which J" is updated is
R" = {t | xt = '},
and there are no communication delays (as in the case where the entire algo-
rithm is centralized at a single physical processor).
94 Contractive Models Chap. 2
S(k + 1) ⊂ S(k), k = 0, 1, . . . ,
and is such that if {V k } is any sequence with V k ∈ S(k), for all k ≥ 0,
then {V k } converges pointwise to J * . Assume further the following:
(1) Synchronous Convergence Condition: We have
(2) Box Condition: For all k, S(k) is a Cartesian product of the form
Proof: To explain the idea of the proof, let us note that the given condi-
tions imply that updating any component J! , by applying T to a function
J ∈ S(k), while leaving all other components unchanged, yields a function
in S(k). Thus, once enough time passes so that the delays become “irrel-
evant,” then after J enters S(k), it stays within S(k). Moreover, once a
component J! enters the subset S! (k) and the delays become “irrelevant,”
J! gets permanently within the smaller subset S! (k+1) at the first time that
J! is iterated on with J ∈ S(k). Once each component J! , ) = 1, . . . , m,
gets within S! (k + 1), the entire function J is within S(k + 1) by the Box
Condition. Thus the iterates from S(k) eventually get into S(k + 1) and
so on, and converge pointwise to J * in view of the assumed properties of
{S(k)}.
With this idea in mind, we show by induction that for each k ≥ 0,
there is a time tk such that:
(1) J t ∈ S(k) for all t ≥ tk .
(2) For all ) and t ∈ R! with t ≥ tk , we have
1 2
τ (t) τ (t)
J1 "1 , . . . , Jm"m ∈ S(k).
[In words, after some time, all fixed point estimates will be in S(k) and all
estimates used in iteration (2.61) will come from S(k).]
The induction hypothesis is true for k = 0 since J 0 ∈ S(0). Assuming
it is true for a given k, we will show that there exists a time tk+1 with the
required properties. For each ) = 1, . . . , m, let t()) be the first element of
R! such that t()) ≥ tk . Then by the Synchronous Convergence Condition,
96 Contractive Models Chap. 2
∗ J = (J1 , J2 )
(0) S2 (0) ) S(k + 1) + 1) J ∗ TJ
(0) S(k)
S(0)
S1 (0)
we have T J t(!) ∈ S(k + 1), implying (in view of the Box Condition) that
t(!)+1
J! ∈ S! (k + 1).
J1 Iterations
∗ J = (J1 , J2 )
) S(k + 1) + 1) J ∗
(0) S(k)
S(0)
Iterations J2 Iteration
Assumption 2.6.2:
(a) The set of policies M is finite.
(b) There exists an integer B ≥ 0 such that
(R! ∪ R! ) ∩ {τ | t < τ ≤ t + B} .= Ø, ∀ t, ).
0 ≤ t − τ!j (t) ≤ B # , ∀ t, ), j.
100 Contractive Models Chap. 2
Nonmonotone Jµ!
Optimistic PI∗ J µ Monotone
µ Optimistic PI
! Jµ!! J T J J0
Jµ1 !! Jµ0
1 Jµ2
J∗
J 0 ≥ T J 0 = Tµ0 J 0 ,
M∗ = {µ ∈ M | Jµ = J * } = {µ ∈ M | Tµ J * = T J * }.
Sec. 2.6 Asynchronous Algorithms 101
We will show that the algorithm eventually (with probability one) enters
a small neighborhood of J * within which it remains, generates policies in
M∗ , becomes equivalent to asynchronous VI, and therefore converges to
J * by Prop. 2.6.2. The idea of the proof is twofold; cf. Props. 2.5.4 and
2.5.5.
(1) There exists a small enough weighted sup-norm sphere centered at
J * , call it S ∗ , within which policy improvement generates only poli-
cies in M∗ , so policy evaluation with such policies as well as policy
improvement keep the algorithm within S ∗ if started there, and re-
duce the weighted sup-norm distance to J * , in view of the contraction
and common fixed point property of T and Tµ , µ ∈ M∗ . This is a
consequence of Prop. 2.3.1 [cf. Eq. (2.16)].
(2) With probability one, thanks to the randomization device, the algo-
rithm will eventually enter permanently S ∗ with a policy in M∗ .
We now establish (1) and (2) in suitably refined form to account for
the presence of delays and asynchronism. As in the proof of Prop. 2.5.5,
we can prove that given J 0 , we have that {J t } ⊂ D, where D is a bounded
set that depends on J 0 . We define
% &
S(k) = J | *J − J * * ≤ αk c ,
Such a k ∗ exists in view of the finiteness of M and Prop. 2.3.1 [cf. Eq.
(2.16)].
We now claim that with probability one, for any given k ≥ 1, J t
will eventually enter S(k) and stay within S(k) for at least B # additional
consecutive iterations. This is because our randomization scheme is such
#
that for any t and k, with probability at least pk(B+B ) the next k(B + B # )
iterations are policy improvements, so that
#
J t+k(B+B )−ξ ∈ S(k)
for all ξ with 0 ≤ ξ < B # [if t ≥ B # − 1, we have J t−ξ ∈ S(0) for all ξ
#
with 0 ≤ ξ < B # , so J t+B+B −ξ ∈ S(1) for 0 ≤ ξ < B # , which implies that
#
J t+2(B+B )−ξ ∈ S(2) for 0 ≤ ξ < B # , etc].
It follows that with probability one, for some t we will have J τ ∈ S(k ∗ )
for all τ with t − B # ≤ τ ≤ t, as well as µt ∈ M∗ [cf. Eq. (2.65)]. Based
on property (2.65) and the definition (2.63)-(2.64) of the algorithm, we see
that at the next iteration, we have µt+1 ∈ M∗ and
∗
*J t+1 − J * * ≤ *J t − J * * ≤ αk c,
102 Contractive Models Chap. 2
# t+1 #
#J (x) − J * (x)#
! ! ∗
≤ α*J t − J * * ≤ αk +1 c, (2.66)
v(x)
for all other x. Proceeding similarly, it follows that for all t > t we will
have
J τ ∈ S(k ∗ ), ∀ t with t − B # ≤ τ ≤ t,
for every i, x ∈ X! , and some t with t ≤ t < t + B, cf. Eq. (2.66)], J t will
enter S(k ∗ + 1) permanently, with µt ∈ M∗ (since µt ∈ M∗ for all t ≥ t
as shown earlier). Then, with the same reasoning, after at most another
B # + B iterations, J t will enter S(k ∗ + 2) permanently, with µt ∈ M∗ , etc.
Thus J t will converge to J * with probability one. Q.E.D.
The proof of Prop. 2.6.3 shows that eventually (with probability one
after some iteration) the algorithm will become equivalent to asynchronous
VI (each policy evaluation will produce the same results as a policy im-
provement), while generating optimal policies exclusively. However, the
expected number of iterations for this to happen can be very large. More-
over the proof depends on the set of policies being finite. These observa-
tions raise questions regarding the practical effectiveness of the algorithm.
However, it appears that for many problems the algorithm works well, par-
ticularly when oscillatory behavior is a rare occurrence.
A potentially important issue is the choice of the randomization prob-
ability p. If p is too small, convergence may be slow because oscillatory
behavior may go unchecked for a long time. On the other hand if p is
large, a correspondingly large number of policy improvement iterations
may be performed, and the hoped for benefits of optimistic PI may be lost.
Adaptive schemes which adjust p based on algorithmic progress may be an
interesting possibility for addressing this issue.
Sec. 2.6 Asynchronous Algorithms 103
where
• Fµ (V, Q) is a function with a component Fµ (V, Q)(x, u) for each (x, u),
defined by ! "
Fµ (V, Q)(x, u) = H x, u, min{V, Qµ } , (2.67)
and
! "
Fµ (V, Q)(x, u) = H x, u, min{V, Qµ }
$
n 1 % &2
= pxy (u) g(x, u, y) + α min V (y), Q(y, µ(y)) ,
y=1
! " $
n 1 % &2
M Fµ (V, Q) (x) = min pxy (u) g(x, u, y) + α min V (y), Q(y, µ(y)) ,
u∈U (x)
y=1
[cf. Eqs. (2.67)-(2.69)]. Note that Fµ (V, Q) is the mapping that defines Bell-
man’s equation for the Q-factors of a policy µ in an optimal stopping problem
where the stopping cost at state y is equal to V (y).
(b) The following uniform contraction property holds for all (V, Q)
and (Ṽ , Q̃):
) ) ) )
)Gµ (V, Q) − Gµ (Ṽ , Q̃)) ≤ α)(V, Q) − (Ṽ , Q̃)).
Sec. 2.6 Asynchronous Algorithms 105
so that
% ! "&
min J * (x), Q* x, µ(x) = J * (x), ∀ x ∈ X, µ ∈ M.
Indeed, the first inequality follows from the definition (2.67) of Fµ and
the contraction Assumption 2.1.2. The second inequality follows from a
nonexpansiveness property of the minimization map: for any J1 , J2 , J̃ 1 ,
J̃ 2 , we have
) ) % &
) min{J1 , J2 } − min{J̃ 1 , J̃ 2 }) ≤ max *J1 − J̃ 1 *, *J2 − J̃ 2 * ; (2.73)
take the minimum of both sides over m, exchange the roles of Jm and J̃ m ,
and take supremum over ! x]. Here
" we use the ! relation
" (2.73) for J1 = V ,
J˜1 = Ṽ , and J2 (x) = Q x, µ(x) , J˜2 (x) = Q̃ x, µ(x) , for all x ∈ X.
We next note that for all Q, Q̃, †
*M Q − M Q̃* ≤ *Q − Q̃*,
Q(x, u) Q̃(x, u)
≤ #Q − Q̃# + , ∀ u ∈ U (x), x ∈ X,
v(x) v(x)
take infimum of both sides over u ∈ U (x), exchange the roles of Q and Q̃, and
take supremum over x ∈ X.
106 Contractive Models Chap. 2
Asynchronous PI Algorithm
equal to Qtµt (x); cf. Eq. (2.68)]. In particular, at each time t, each processor
) does one of the following:
(a) Local policy improvement: If t ∈ R! , processor ) sets for all x ∈ X! , †
! " ! "
V t+1 (x) = min H x, u, min{V t , Qtµt } = M Fµt (V t , Qt ) (x),
u∈U(x)
sets µt+1 (x) to a u that attains the minimum, and leaves Q un-
changed, i.e., Qt+1 (x, u) = Qt (x, u) for all x ∈ X! and u ∈ U (x).
(b) Local policy evaluation: If t ∈ R! , processor ) sets for all x ∈ X! and
u ∈ U (x),
! "
Qt+1 (x, u) = H x, u, min{V t , Qtµt } = Fµt (V t , Qt )(x, u),
and leaves V and µ unchanged, i.e., V t+1 (x) = V t (x) and µt+1 (x) =
µt (x) for all x ∈ X! .
† As earlier we assume that the infimum over u ∈ U (x) in the policy im-
provement operation is attained, and we write min in place of inf.
108 Contractive Models Chap. 2
for all x ∈ X" . The policy improvement iteration (2.74) is a VI for the
stopping problem:
$
n 1 % &2
J t+1 (x) = V t+1 (x) = min pxy (u) g(x, u, y) + α min V t (y), J t (y) ,
u∈U (x)
y=1
for all x ∈ X" . The “stopping cost” V t (y) is the most recent cost value,
obtained by local policy improvement at y.
Sec. 2.6 Asynchronous Algorithms 109
Consider the optimistic PI algorithm (2.74)-(2.75) for the case of the minimax
problem of Example 1.2.5 of Chapter 1, where
* ! "+
H(x, u, J) = sup g(x, u, w) + αJ f (x, u, w) .
w∈W (x,u)
Then the local policy evaluation step [cf. Eq. (2.75)] is written as
* ! "
J t+1 (x) = sup g x, µt (x), w
w∈W (x,µt (x))
% ! " ! "&"+
+ α min V t f (x, µt (x), w) , J t f (x, µt (x), w) .
The local policy improvement step [cf. Eq. (2.74)] takes the form
*
J t+1 (x) = V t+1 (x) = min sup g(x, u, w)
u∈U (x) w∈W (x,u)
% ! " ! "&+
+ α min V t f (x, u, w) , J t f (x, u, w) ,
(1 t )J
t+1 (x) + ˆt+1 (x).
tJ (2.78)
The idea of the algorithm is to aim for a larger value of J t+1 (x) when
the condition (2.77) holds. Asymptotically, as t ! 0, the iteration (2.77)-
(2.78) becomes identical to the convergent update (2.75). For a detailed
analysis we refer to the paper by Bertsekas and Yu [BeY10].
Section 2.1: The contractive DP model of this section was first studied
systematically by Denardo [Den67], who assumed an unweighted sup-norm,
proved the basic results of Section 2.1, and described some of their appli-
cations. In this section, we have extended the analysis of [Den67] to the
case of weighted sup-norm contractions.
Section 2.2: The abstraction of the computational methodology for finite-
state discounted MDP within the broader framework of weighted sup-norm
contractions and an infinite state space (Sections 2.2-2.6) follows the au-
thor’s survey [Ber12b], and relies on several earlier analyses that use more
specialized assumptions.
Section 2.3: The multistep error bound of Prop. 2.2.2 is based on Scher-
rer [Sch12], which explores periodic policies in approximate VI and PI in
finite-state discounted MDP (see also Scherrer and Lesner [ShL12], who
give an example showing that the bound for approximate VI of Prop. 2.3.2
is essentially sharp for discounted finite-state MDP). For a related discus-
sion of approximate VI, including the error amplification phenomenon of
Example 2.3.1, and associated error bounds, see de Farias and Van Roy
[DFV00].
Section 2.4: The error bound of Prop. 2.4.3 extends a standard bound for
finite-state discounted MDP, derived by Bertsekas and Tsitsiklis [BeT96]
(Section 6.2.2), and shown to be tight by an example.
Section 2.5: Optimistic PI has received a lot of attention in the literature,
particularly for finite-state discounted MDP, and it is generally thought to
be computationally more efficient in practice than ordinary PI (see e.g.,
Puterman [Put94], who refers to the method as “modified PI”). The con-
vergence analysis of the synchronous optimistic PI (Section 2.5.1) follows
Rothblum [Rot79], who considered the case of an unweighted sup-norm
(v = e); see also Canbolat and Rothblum [CaR13]. The error bound for
optimistic PI (Section 2.5.2) is due to Thierry and Scherrer [ThS10b], which
was given for the case of a finite-state discounted MDP. We follow closely
their line of proof. Related error bounds and analysis are given by Scherrer
[Sch11].
Sec. 2.7 Notes, Sources, and Exercises 111
The -PI method [cf. Eq. (2.34)] was introduced by Bertsekas and
Io↵e [BeI96], and was also presented in the book [BeT96], Section 2.3.1. It
is the basis of the LSPE( ) policy evaluation method, described by Nedić
and Bertsekas [NeB03], and by Bertsekas, Borkar, and Nedić [BBN04]. It
was studied further in approximate DP contexts by Thierry and Scherrer
[ThS10a], Bertsekas [Ber11b], and Scherrer [Sch11]. An extension of -PI,
called ⇤-PI, uses a di↵erent parameter i for each state i, and is discussed
in Section 5 of the paper by Yu and Bertsekas [YuB12]. Based on the
discussion of Section 1.2.5 and Exercise 1.2, ⇤-PI may be viewed as a
diagonally scaled version of the proximal algorithm, i.e., one that uses a
di↵erent penalty parameter for each proximal term.
When the state and control spaces are finite, and cost approximation
over a subspace { r | r 2 <s } is used (cf. Section 1.2.4), a prominent
approximate PI approach is to replace the exact policy evaluation equation
Jµ k = Tµk J µk
rk = W Tµk ( rk ), (2.79)
r = W T ( r)
and
r = W Tµ ( r), µ 2 M,
have a unique solution. This is true if the composite mappings W T
and W Tµ are contractions over <n . In particular, in the case of an
aggregation equation, where W = D, the rows of and D are probability
distributions, and Tµ , µ 2 M, are monotone sup-norm contractions, the
mappings W T and W Tµ are also monotone sup-norm contractions.
However, in other cases, including when policy evaluation is done using
the projected equation, W T need not be monotone or be a contraction
of any kind, and the approximate PI algorithm (2.79)-(2.80) may lead to
systematic oscillations, involving cycles of policies (see related discussions
in [BeT96], [Ber11c], and [Ber12a]). This phenomenon has been known
112 Contractive Models Chap. 2
since the early days of approximate DP ([Ber96] and the book [BeT96]),
but its practical implications have not been fully assessed. Generally, the
line of analysis of Section 2.5.3, which does not require monotonicity or sup-
norm contraction properties of the composite mappings W T and W Tµ ,
can be applied to the approximate PI algorithm (2.79)-(2.80), but only in
the case where these mappings are contractions over <n with respect to a
common norm k · k; see Exercise 2.6 for further discussion.
Section 2.6: Asynchronous VI (Section 2.6.1) for finite-state discounted
MDP and games, shortest path problems, and abstract DP models, was
proposed in the author’s paper on distributed DP [Ber82]. The asyn-
chronous convergence theorem (Prop. 2.6.1) was first given in the author’s
paper [Ber83], where it was applied to a variety of algorithms, including
VI for discounted and undiscounted DP, and gradient methods for uncon-
strained optimization (see also Bertsekas and Tsitsiklis [BeT89], where a
textbook account is presented). The key convergence mechanism, which
underlies the proof of Prop. 2.6.1, is that while the algorithm iterates asyn-
chronously on the components J` of J, an iteration with any one com-
ponent does not impede the progress made by iterations with the other
components, thanks to the box condition. At the same time, progress to-
wards the solution is continuing thanks to the synchronous convergence
condition.
Earlier references on distributed asynchronous iterative algorithms in-
clude the work of Chazan and Miranker [ChM69] on Gauss-Seidel methods
for solving linear systems of equations (who attributed the original algo-
rithmic idea to Rosenfeld [Ros67]), and also Baudet [Bau78] on sup-norm
contractive iterations. We refer to [BeT89] for detailed references.
Asynchronous algorithms have also been studied and applied to simu-
lation-based DP, particularly in the context of Q-learning, first proposed
by Watkins [Wat89], which may be viewed as a stochastic version of VI,
and is a central algorithmic concept in approximate DP and reinforcement
learning. Two principal approaches for the convergence analysis of asyn-
chronous stochastic algorithms have been suggested.
The first approach, initiated in the paper by Tsitsiklis [Tsi94], con-
siders the totally asynchronous computation of fixed points of abstract
sup-norm contractive mappings and monotone mappings, which are de-
fined in terms of an expected value. The algorithm of [Tsi94] contains as
special cases Q-learning algorithms for finite-spaces discounted MDP and
SSP problems. The analysis of [Tsi94] shares some ideas with the theory
of Section 2.6.1, and also relies on the theory of stochastic approximation
methods. For a subsequent analysis of the convergence of Q-learning for
SSP, which addresses the issue of boundedness of the iterates, we refer to
Yu and Bertsekas [YuB13b].
The second approach, treats asynchronous algorithms of the stochas-
tic approximation type under some restrictions on the size of the communi-
Sec. 2.7 Notes, Sources, and Exercises 113
was shown by Williams and Baird [WiB93], who also gave examples show-
ing that without this condition, cycling of the algorithm may occur. The
asynchronous PI algorithm with a uniform fixed point (Section 2.6.3) was
introduced in the papers by Bertsekas and Yu [BeY10], [BeY12], [YuB13a],
in order to address this difficulty. Our analysis follows the analysis of these
papers.
In addition to resolving the asynchronous convergence issue, the asyn-
chronous PI algorithm of Section 2.6.3, obviates the need for minimization
over all controls at every iteration (this is the generic computational ef-
ficiency advantage that optimistic PI typically holds over VI). Moreover,
the algorithm admits a number of variations thanks to the fact that Prop.
2.6.4 asserts the contraction property of the mapping Gµ for all µ. This
can be used to prove convergence in variants of the algorithm where the
policy µt is updated more or less arbitrarily, with the aim to promote some
objective. We refer to the paper [BeY12], which also derives related asyn-
chronous simulation-based Q-learning algorithms with and without cost
function approximation, where µt is replaced by a randomized policy to
enhance exploration.
The randomized asynchronous optimistic PI algorithm of Section
2.6.2, introduced in the first edition of this book, also resolves the asyn-
chronous convergence issue. The fact that this algorithm does not require
the monotonicity assumption may be useful in nonDP algorithmic contexts
(see [Ber16b] and Exercise 2.6).
In addition to discounted stochastic optimal control, the results of this
chapter find application in the context of the stochastic shortest path prob-
lem of Example 1.2.6, when all policies are proper. Then, under some addi-
tional assumptions, it can be shown that T and Tµ are weighted sup-norm
contractions with respect to a special norm. It follows that the analysis and
algorithms of this chapter apply in this case. For a detailed discussion, we
refer to the monograph [BeT96] and the survey [Ber12b]. For extensions
to the case of countable state space, see the textbook [Ber12a], Section 3.6,
and Hinderer and Waldmann [HiW05].
114 Contractive Models Chap. 2
EXERCISES
to show that Tµ0 J1 is the fixed point of Tµ0 · · · Tµm 1, i.e., is equal to J0 .
Similarly, apply Tµ1 to the fixed point relation
to show that Tµ1 J2 is the fixed point of Tµ1 · · · Tµm 1 Tµ0 , etc.
k k k
J0 = lim T ⌫ J 0 , J1 = lim T ⌫ (Tµ0 J 0 ), . . . , Jm 1 = lim T ⌫ (Tµ0 · · · Tµm 2J
0
),
k!1 k!1 k!1
(2) Box Condition: For all k, S(k) is a Cartesian product of the form
where u 2 <n , J 0 u denotes the inner product of J and u, F (x, ·) is the conju-
gate convex function of the convex function (T J)(x), and U (x) = u 2 <n |
F (x, u) < 1 is the e↵ective domain of F (x, ·) (for the definition of these terms,
we refer to books on convex analysis, such as [Roc70] and [Ber09]). Assuming
that the infimum in Eq. (2.81) is attained for all x, show how the VI algorithm of
Section 2.6.1 and the PI algorithm of Section 2.6.3 can be used to find the fixed
point of T in the case where T is a sup-norm contraction, but not necessarily
116 Contractive Models Chap. 2
monotone. Note: For algorithms that relate to the context of this exercise and
are inspired by approximate PI, see [Ber16b], [Ber18c].
Solution: The analysis of Sections 2.6.1 and 2.6.3 does not require monotonicity
of the mapping Tµ given by
Gx = sup g(x, u) , x 2 X,
u2U (x)
belongs to B(X).
(2) The sequence V = {V1 , V2 , . . .}, where
X
Vx = sup pxy (u) v(y), x 2 X,
u2U (x)
y2X
belongs to B(X).
(3) We have
P
y2X
pxy (u)v(y)
1, 8 x 2 X.
v(x)
" #
X
(T J)(x) = inf g(x, u) + ↵ pxy (u)J(y) , x 2 X.
u2U (x)
y2X
Show that Tµ and T map B(X) into B(X), and are contraction mappings with
modulus ↵.
Sec. 2.7 Notes, Sources, and Exercises 117
Solution: We have
(Tµ J)(x)
kGk + kV k kJk, 8 x 2 X, µ 2 M.
v(x)
(T J)(x)
kGk + kV k kJk, 8 x 2 X.
v(x)
P
↵ y2X
pxy µ(x) J(x) J 0 (x)
0
kTµ J Tµ J k = sup
x2X v(x)
P
↵ y2X
pxy µ(x) v(y) |J(y) J 0 (y)|/v(y)
sup
x2X v(x)
P
y2X
pxy µ(x) v(y)
sup ↵ kJ J 0k
x2X v(x)
↵kJ J 0 k,
where the last inequality follows from assumption (3). Hence Tµ is a contraction
of modulus ↵.
To show that T is a contraction, we note that
(T J)(x) (T J 0 )(x)
+ ↵kJ J 0 k, x 2 X.
v(x) v(x)
Similarly,
(T J 0 )(x) (T J)(x)
+ ↵kJ J 0 k, x 2 X,
v(x) v(x)
and by combining the last two relations the contraction property of T follows.
118 Contractive Models Chap. 2
Let the monotonicity and contraction Assumptions 2.1.1 and 2.1.2 hold. Show
that if J T J and J 2 B(X), then J J ⇤ . Use this fact to show that if
X = {1, . . . , n} and U (i) is finite for each i = 1, . . . , n, then J ⇤ (1), . . . , J ⇤ (n)
solves the following problem (in z1 , . . . , zn ):
X
n
maximize zi
i=1
Let the monotonicity and contraction Assumptions 2.1.1 and 2.1.2 hold, and
assume that there are n states, and that U (x) is finite for every x. Consider a PI
method that aims to approximate a fixed point of T on a subspace S = { r | r 2
<s }, where is an n ⇥ s matrix, and evaluates a policy µ 2 M with a solution
J˜µ of the following fixed point equation in the vector J 2 <n :
J = W Tµ J (2.82)
n n
where W : < 7! < is some mapping (possibly nonlinear, but independent of
µ), whose range is contained in S. Examples where W is linear include policy
evalution using the projected and aggregation equations; see Section 1.2.4. The
algorithm is given by
J˜µ = lim (W Tµ )k J.
k!1
(3) For each µ 2 M, the mappings W and W Tµ are monotone in the sense
that
Note that conditions (1) and (2) guarantee that the iterations (2.83) are well-
defined. Assume that the method is initiated with some policy in M, and it is
operated so that it terminates when a policy µ is obtained such that Tµ J˜µ = T J˜µ .
Show that the method terminates in a finite number of iterations, and the vector
J˜µ obtained upon termination is a fixed point of W T . Note: Condition (2) is
satisfied if W Tµ is a contraction, while condition (b) is satisfied if W is a matrix
with nonnegative components and Tµ is monotone for all µ. For counterexamples
to convergence when the conditions (2) and/or (3) are not satisfied, see [BeT96],
Section 6.4.2, and [Ber12a], Section 2.4.3.
There are finitely many policies, so we must have J˜µ = J˜µ after a finite number
of iterations, which using the policy improvement equation Tµ J˜µ = T J˜µ , implies
that Tµ J˜µ = T J˜µ . Thus the algorithm terminates with µ, and since J˜µ =
W Tµ J˜µ , it follows that J˜µ is a fixed point of W T .
3
Semicontractive Models
Contents
121
122 Semicontractive Models Chap. 3
¯
and J(x) ≡ 0. The SSP problem arises when there is an additional ter-
mination state that is cost-free, and corresponding transition probabilities
pxt (u), x ∈ X.
A Summary of Pathologies
u Cost a1 Cost 1
t b Destination
u Costt 1b Cost 1
a12 12tb
Figure 3.1.1. A deterministic shortest path problem with a single node 1 and a
termination node t. At 1 there are two choices; a self-transition, which costs a,
and a transition to t, which costs b.
proper, it is called improper . Thus there exists a proper policy if and only
if each node is connected to t with a path. Furthermore, an improper policy
has cost greater than −∞ starting from every initial state if and only if all
the cycles of the corresponding subgraph have nonnegative cycle cost.
Let us now get a sense of what may happen by considering the simple
one-node example shown in Fig. 3.1.1. Here there is a single state 1 in
addition to the termination state t. At state 1 there are two choices: a
self-transition, which costs a, and a transition to t, which costs b. The
mapping H, abbreviating J(1) with just the scalar J, is
(
a+J if u: self transition,
H(1, u, J) = J ∈ ,.
b if u: transition to t,
There are two policies here: the policy µ that transitions from 1 to t,
which is proper, and the policy µ$ that self-transitions at state 1, which is
improper. We have
Tµ J = b, Tµ" J = a + J, J ∈ ,,
and
T J = min{b, a + J}, J ∈ ,.
Note that for the proper policy µ, the mapping Tµ : , &→ , is a contraction.
For the improper policy µ$ , the mapping Tµ" : , &→ , is not a contraction,
and it has a fixed point within , only if a = 0, in which case every J ∈ ,
is a fixed point.
We now consider the optimal cost J * , the fixed points of T within ,,
and the behavior of the VI and PI methods for different combinations of
values of a and b.
(a) If a > 0, the optimal cost, J * = b, is the unique fixed point of T , and
the proper policy is optimal.
Sec. 3.1 Pathologies of Noncontractive DP Models 129
(b) If a = 0, the set of fixed points of T (within ,) is the interval (−∞, b].
Here the improper policy is optimal if b ≥ 0, and the proper policy is
optimal if b ≤ 0 (both policies are optimal if b = 0).
(c) If a = 0 and b > 0, the proper policy is strictly suboptimal, yet its cost
at state 1 (which is b) is a fixed point of T . The optimal cost, J * = 0,
lies in the interior of the set of fixed points of T , which is (−∞, b].
Thus the VI method that generates {T k J} starting with J (= J *
cannot find J * . In particular if J is a fixed point of T , VI stops at J,
while if J is not a fixed point of T (i.e., J > b), VI terminates in two
iterations at b (= J * . Moreover, the standard PI method is unreliable
in the sense that starting with the suboptimal proper policy µ, it
may stop with that policy because Tµ Jµ = b = min{b, Jµ } = T Jµ
(the improper/optimal policy µ$ also satisfies Tµ" Jµ = T Jµ , so a rule
for breaking the tie in favor of µ is needed but such a rule may not
be obvious in general).
(d) If a = 0 and b < 0, the improper policy is strictly suboptimal, and
we have J * = b. Here it can be seen that the VI sequence {T k J}
converges to J * for all J ≥ b, but stops at J for all J < b, since the
set of fixed points of T is (−∞, b]. Moreover, starting with either
the proper policy or the improper policy, the standard form of PI
may oscillate, since Tµ Jµ" = T Jµ" and Tµ" Jµ = T Jµ , as can be easily
verified [the optimal policy µ also satisfies Tµ Jµ = T Jµ but it is not
clear how to break the tie; compare also with case (c) above].
(e) If a < 0, the improper policy is optimal and we have J * = −∞.
There are no fixed points of T within ,, but J * is the unique fixed
point of T within the set [−∞, ∞]. The VI method will converge to
J * starting from any J ∈ [−∞, ∞]. The PI method will also converge
to the optimal policy starting from either policy.
We consider the SSP problem, which was described in Example 1.2.6 and
will be revisited in Section 3.5.1. Here a policy is associated with a station-
ary Markov chain whose states are 1, . . . , n, plus the cost-free termination
state t. The cost of a policy starting at a state x is the sum of the expected
cost of its transitions up to reaching t. A policy is said to be proper , if in
its Markov chain, every state is connected with t with a path of positive
probability transitions, and otherwise it is called improper . Equivalently, a
policy is proper if its Markov chain has t as its unique ergodic state, with
all other states being transient.
In deterministic shortest path problems, it turns out that Jµ is always
a fixed point of Tµ , and J * is always a fixed point of T . This is a generic
feature of deterministic problems, which was illustrated in Section 1.1 (see
Exercise 3.1 for a rigorous proof). However, in SSP problems where the
130 Semicontractive Models Chap. 3
u Cost 0 Cost
Jµ (1) = 0 012345
Cost 0 Cost 2 Cost −2 Cost 1 Cost 1 u Cost −1 Cost 1 Cost 0 Cost 2 Cost
012345 01234567
12tb
1 Cost 1 u Cost −1 Cost 1
t b Destination
) Cost 0
cost per stage can take both positive and negative values this need not be
so, as we will now show with an example due to [BeY16].
Let us consider the problem of Fig. 3.1.2. It involves an improper
policy µ, whose transitions are shown with solid lines in the figure, and
form the two zero length cycles shown. All the transitions under µ are
deterministic, except at state 1 where the successor state is 2 or 5 with
equal probability 1/2. The problem has been deliberately constructed so
that corresponding costs at the nodes of the two cycles are negatives of
each other. As a result, the expected cost at each time period starting
from state 1 is 0, implying that the total cost over any number or even
infinite number of periods is 0.
Indeed, to verify that Jµ (1) = 0, let ck denote the cost incurred at
)N −1
time k, starting at state 1, and let sN (1) = k=0 ck denote the N -step
accumulation of ck starting from state 1. We have
sN (1) = 0 if N = 1 or N = 4 + 3t, t = 0, 1, . . .,
2 Prob. 1 − u2
2
u Destination
Prob. u2
a12 12tb
Figure 3.1.3. Transition diagram for the first variant of the blackmailer problem.
At state 1, the blackmailer may demand any amount u ∈ (0, 1]. The victim will
comply with probability 1 − u2 and will not comply with probability u2 , in which
case the process will terminate.
Tµ J = −µ + (1 − µ2 )J. (3.3)
Jµ = Tµ Jµ = −µ + (1 − µ2 )Jµ ,
which yields
1
Jµ = − .
µ
Here all policies are proper in the sense that they lead asymptotically to
t with probability 1, and the infimum of Jµ over µ is −∞, implying also
Sec. 3.1 Pathologies of Noncontractive DP Models 133
J = J − sup {u + u2 J}.
0<u≤1
It can be seen that for every policy µ, Tµ is again a contraction and we have
Jµ = µ − 1. Thus J * = −1, but again there is no optimal policy, stationary
or not. Moreover, T has multiple fixed points: its set of fixed points within
, is {J | J ≤ −1}. Here the VI method will converge to J * starting from
any J ∈ [−1, ∞). The PI method will produce an ever improving sequence
of policies {µk } with Jµk ↓ J * , starting from any policy µ0 , while µk will
converge to 0, which is not a feasible policy.
† An unusual fact about this problem is that there exists a nonstationary pol-
icy π ∗ that is optimal in the sense that Jπ∗ = J ∗ = −∞ (for a proof see [Ber12a],
Section 3.2). The underlying intuition is that when the amount demanded u is
decreased toward 0, the probability of noncompliance, u2 , decreases much faster.
This fact, however, will not be significant in the context of our analysis.
134 Semicontractive Models Chap. 3
but also allow, in addition to u ∈ (0, 1], the choice u = 0 that self-transitions
to state 1 at a cost c (this is the choice where the blackmailer can forego
blackmail for a single period in exchange for a fixed payment −c). Here
there is the extra (improper) policy µ$ that chooses µ$ (1) = 0. We have
Tµ" J = c + J,
Let us consider the optimal policies and the fixed points of T in the two
cases where c ≥ 0 and c < 0.
When c ≥ 0, we have J * = −1, while Jµ" = ∞ (if c > 0) or Jµ" = 0
(if c = 0). It can be seen that there is no optimal policy, and that all
J ∈ (−∞, −1] are fixed points of T , including J * . Here the VI method will
converge to J * starting from any J ∈ [−1, ∞). The PI method will produce
an ever improving sequence of policies {µk }, with Jµk ↓ J * . However, µk
will converge to 0, which is a feasible but strictly suboptimal policy.
When c < 0, we have Jµ" = −∞, and the improper policy µ$ is
optimal. Here the optimal cost over just the proper policies is Jˆ = −1,
while J * = −∞. Moreover Jˆ is not a fixed point of T , and in fact T
has no real-valued fixed points, although J * is a fixed point. It can be
verified that the VI algorithm will converge to J * starting from any scalar
J. Furthermore, starting with a proper policy, the PI method will produce
the optimal (improper) policy within a finite number of iterations.
One of the most important optimal control problems involves a linear sys-
tem and a cost per stage that is positive semidefinite quadratic in the state
and the control. The objective here is roughly to bring the system at or
close to the origin, which can be viewed as a cost-free and absorbing state.
Thus the problem has a shortest path character, even though the state
space is continuous.
Under reasonable assumptions (involving the notions of system con-
trollability and observability; see e.g., [Ber17a], Section 3.1), the problem
admits a favorable analysis and an elegant solution: the optimal cost func-
tion is positive semidefinite quadratic and the optimal policy is a linear
Sec. 3.1 Pathologies of Noncontractive DP Models 135
xk+1 = γxk + uk , xk ∈ ,, uk ∈ ,,
and it is seen that J * is a solution. We will now show that there is another
solution, which has an interesting interpretation.
Let us assume that γ > 1 so the system is unstable (the instability of
the system is important for the purpose of this example). It is well-known
that for linear-quadratic problems the class of quadratic cost functions,
& '
S = J | J(x) = px2 , p ≥ 0 ,
µ(x) = rx,
xk+1 = (γ + r)xk
Note that there is no policy in L that is optimal, since the optimal policy
µ∗ (x) ≡ 0 is unstable and does not belong to L.
Let us consider fixed points of the mapping T ,
& '
(T J)(x) = inf u2 + J(γx + u) ,
u∈(
within the class of nonnegative quadratic functions S. For J(x) = px2 with
p ≥ 0, we have & '
(T J)(x) = inf u2 + p(γx + u)2 ,
u∈(
and by setting to 0 the derivative with respect to u, we see that the infimum
is attained at
pγ
u∗ = − x.
1+p
By substitution into the formula for T J, we obtain
pγ 2 2
(T J)(x) = x . (3.6)
1+p
Thus the function J(x) = px2 is a fixed point of T if and only if p solves
the equation
pγ 2
p= .
1+p
This equation has two solutions:
p=0 and p = γ 2 − 1,
as shown in Fig. 3.1.4. Thus there are exactly two fixed points of T within
S: the functions
J * (x) ≡ 0 and ˆ
J(x) = (γ 2 − 1)x2 .
The fixed point Jˆ has some significance. It turns out to be the optimal
cost function within the subclass L of linear policies that are stable. This
can be verified by minimizing the expression (3.5) over the parameter r. In
particular, by setting to 0 the derivative with respect to r of
r2
,
1 − (γ + r)2
(1 − γ 2 )
µ̂(x) = x,
γ
Sec. 3.1 Pathologies of Noncontractive DP Models 137
T J 45 Degree Line
Prob. = 1 Prob. =
n Value Iterations
2
pγ
Function
Function 1+p
=0
p p0 0 p1 1 p2 γ2 − 1 1p
Figure 3.1.4. Illustrating the fixed points of T , and the convergence of the VI
algorithm for the one-dimensional linear-quadratic problem.
Thus, we have
ˆ
Jµ̂ (x) = inf Jµ (x) = J(x), x ∈ ,.
µ∈L
pk γ 2
pk+1 = , k = 0, 1, . . . .
1 + pk
From Fig. 3.1.4 it can be seen that starting with p0 > 0, the sequence {pk }
converges to
p̂ = γ 2 − 1,
138 Semicontractive Models Chap. 3
we can compute Jµ as the limit of the VI sequence {Tµk J}, where J is any
function in S, i.e.,
and noting that the iteration that maps p to r2 + p(γ + r)2 converges to
r2
pµ = ,
1 − (γ + r)2
Tµk J → Jµ , ∀ µ ∈ L, J ∈ S.
Moreover, we have Jµ = Tµ Jµ .
We now use a standard proof argument to show that PI generates
a sequence of linear stable policies starting from a linear stable policy.
Indeed, we have for all k,
where the second inequality follows by the monotonicity of Tµ1 and the
third inequality follows from the fact Jµ0 ≥ J.ˆ By taking the limit as
k → ∞, we obtain
ˆ
Jµ0 ≥ T Jµ0 ≥ Jµ1 ≥ J.
It can be verified that µ1 is a nonzero linear policy, so the preceding relation
implies that µ1 is linear stable. Continuing similarly, it follows that the
policies µk generated by PI are linear stable and satisfy for all k,
ˆ
Jµk ≥ T Jµk ≥ Jµk+1 ≥ J.
Moreover Jµ is real-valued.
(b) M+ is well-behaved with respect to PI: For each µ ∈ M,
+ any policy µ$
such that
Tµ" Jµ = T Jµ
+ and there exists at least one such µ$ .
belongs to M,
140 Semicontractive Models Chap. 3
We can show that Jˆ is a fixed point of T and obtain our main results
with the following line of argument. The first step in this argument is
to show that the cost functions of a PI-generated sequence {µk } ⊂ M +
(starting from a µ0 ∈ M)+ are monotonically nonincreasing. Indeed, we
have using Eq. (3.7),
ˆ
Jµk ≥ T Jµk ≥ lim Tµkk+1 Jµk = Jµk+1 ≥ J, (3.8)
k→∞
where the equality holds by Eq. (3.7), and the rightmost inequality holds
since µk+1 ∈ M + [by assumption (b) above]. Thus we obtain J k ↓ J∞ ≥ J, ˆ
µ
for some function J∞ .
Now by taking the limit as k → ∞ in the relation Jµk ≥ T Jµk ≥ Jµk+1
[cf. Eq. (3.8)], it follows (under a mild continuity assumption) that J∞ is
a fixed point of T with J∞ ≥ J. ˆ † We claim that J∞ = J.ˆ Indeed we have
By taking the limit as k → ∞, and using the fact µ ∈ M + [cf. Eq. (3.7)], we
obtain Jˆ ≤ J∞ ≤ Jµ for all µ ∈ M. + By taking the infimum over µ ∈ M,+ it
follows that J∞ = J. ˆ
ˆ We claim that T k J →
Finally, let J be real-valued and satisfy J ≥ J.
ˆ ˆ
J. Indeed, since we have shown that J is a fixed point of T , we have
Tµk J ≥ T k J ≥ T k Jˆ = J,
ˆ + k ≥ 0,
∀ µ ∈ M,
† We elaborate on this argument; see also the proof of Prop. 3.2.4 in the
next section. From Eq. (3.8), we have Jµk ≥ T Jµk ≥ T J∞ , so by letting k → ∞,
we obtain J∞ ≥ T J∞ . For the reverse inequality, we assume that H has the
property that H(x, u, J) = limm→∞ H(x, u, Jm ) for all x ∈ X, u ∈ U (x), and
sequence {Jm } of real-valued functions with Jm ↓ J. Thus we have
and from the preceding two relations we have H(x, u, J∞ ) ≥ J∞ (x). By taking
the infimum over u ∈ U (x), it follows that T J∞ ≥ J∞ . Combined with the
relation J∞ ≥ T J∞ shown earlier, this implies that J∞ is a fixed point of T .
Sec. 3.2 Semicontractive Models and Regular Policies 141
ˆ
Jµ ≥ lim T k J ≥ J, +
∀ µ ∈ M.
k→∞
Problem Formulation
Let us first introduce formally the model that we will use in this chap-
ter. Compared to the contractive model of Chapter 2, it maintains the
monotonicity assumption, but not the contraction assumption.
We introduce the set X of states and the set U of controls, and for
each x ∈ X, the nonempty control constraint set U (x) ⊂ U . Since in
the absence of the contraction assumption, the cost function Jµ of some
policies µ may take infinite values for some states, we will use the set of
extended real numbers ,∗ = , ∪ {∞, −∞} = [−∞, ∞]. The mathematical
operations with ∞ and −∞ are standard and are summarized in Appendix
A. We consider the set of all extended real-valued functions J : X &→ ,∗ ,
which we denote by E(X). We also denote by R(X) the set of real-valued
functions J : X &→ ,.
As earlier, when we write lim, lim sup, or lim inf of a sequence of
functions we mean it to be pointwise. We also write Jk → J to mean that
Jk (x) → J(x) for each x ∈ X; see Appendix A.
We denote by M the set of all functions µ : X &→ U with µ(x) ∈ U (x),
for all x ∈ X, and by Π the set of policies π = {µ0 , µ1 , . . .}, where µk ∈ M
for all k. We refer to a stationary policy {µ, µ, . . .} simply as µ. We
introduce a mapping H : X × U × E(X) &→ ,∗ that satisfies the following.
J ≤ J$ ⇒ T kJ ≤ T kJ $, Tµk J ≤ Tµk J $ , ∀ µ ∈ M,
¯
Jπ (x) = lim sup (Tµ0 · · · Tµk J)(x), ∀ x ∈ X.
k→∞
= 0 Tµ J
−
"-regular
−
-regular "-irregular
=0 ) Jµ J TJ
Figure 3.2.2. Illustration of S-regular and S-irregular policies for the case where
there is only one state and S = $. There are three mappings Tµ corresponding
to S-irregular policies: one crosses the 45-degree line at multiple points, another
crosses at a single point but at an angle greater than 45 degrees, and the third is
discontinuous and does not cross at all. The mapping Tµ of the $-regular policy
has Jµ as its unique fixed point and satisfies Tµk J → Jµ for all J ∈ $.
= 0 Tµ J
Jˆ Tµk J¯
S S-regular
J˜ J =0 J¯ = 0 ) Jµ J TJ
Figure 3.2.3. Illustration of a mapping Tµ where there is only one state and S
˜ If S is as
is a subset of the real line. Here Tµ has two fixed points, Jµ and J.
˜ µ becomes S-irregular.
shown, µ is S-regular. If S is enlarged to include J,
Figure 3.2.4. Interpretation of Prop. 3.2.1, where for illustration purposes, E(X)
is represented by the extended real line. A set S ⊂ E(X) such that JS∗ is a fixed
point of T , demarcates the well-behaved region WS [cf. Eq. (3.10)], within which
T has a unique fixed point, and starting from which the VI algorithm converges
to JS∗ .
which we refer to as the well-behaved region (see Fig. 3.2.4). Note that by
the definition of S-regularity, the cost functions Jµ , µ ∈ MS , belong to
S and hence also to WS . The proposition also provides a necessary and
sufficient condition for an S-regular policy µ∗ to be MS -optimal.
˜
J $ = T J $ ≤ Tµ J $ ≤ · · · ≤ Tµk J $ ≤ Tµk J, k = 1, 2, . . . .
where the equality follows from the fixed point property of JS* , while the
inequalities follow from the monotonicity and the definition of T . The
right-hand side tends to Jµ as k → ∞, since µ is S-regular and J˜ ∈ S.
Hence the infimum over µ ∈ MS of the limit of the right-hand side tends
to the left-hand side JS* . It follows that T k J → JS* .
(c) From the assumptions Tµ JS* = T JS* and T JS* = JS* , we have Tµ JS* = JS* ,
and since JS* ∈ S and µ is S-regular, we have JS* = Jµ . Thus µ is MS -
optimal. Conversely, if µ is MS -optimal, we have Jµ = JS* , so that the
fixed point property of JS* and the S-regularity of µ imply that
Q.E.D.
Some useful extensions and modified versions of the preceding propo-
sition are given in Exercises 3.2-3.5. Let us illustrate the proposition in the
context of the deterministic shortest path example of Section 3.1.1.
Example 3.2.1
Consider the deterministic shortest path example of Section 3.1.1 for the case
where there is a zero length cycle (a = 0), and let S be the real line (. There
are two policies: µ which moves from state 1 to the destination at cost b, and
µ" which stays at state 1 at cost 0. We use X = {1} (i.e., we do not include
t in X, since all function values of interest are 0 at t). Then by abbreviating
function values J(1) with J, we have
and the initial function J¯ is taken to be 0. It can be seen from the definition of
S-regularity that µ is S–regular, while the policy µ" is not. The cost functions
Jµ , Jµ" , and J ∗ are fixed points of the corresponding mappings, but the sets of
fixed points of Tµ" and T within S are ( and (−∞, b], respectively. Moreover,
JS∗ = Jµ = b, so JS∗ is a fixed point of T and Prop. 3.2.1 applies.
The figure also shows the well-behaved regions for the two cases b > 0
and b < 0. It can be seen that the results of Prop. 3.2.1 are consistent with
the discussion of Section 3.1.1. In particular, the VI algorithm fails when
Sec. 3.2 Semicontractive Models and Regular Policies 149
t b c u! , Cost 0
Stationary policy costs Prob.
Jµ (1) = b, Jµ! (1) = 0 Optimal cost t b Destination
u, Cost b J
(1) = 0 Optimal cost J ∗ (1) = min{b, 0} 1 2 at b1 2 a destination t
Figure 3.2.5. The well-behaved region of Eq. (3.10) for the deterministic shortest
path example of Section 3.1.1 when where there is a zero length cycle (a = 0). For
S = $, the policy µ is S-regular, while the policy µ" is not. The figure illustrates
the two cases where b > 0 and b < 0.
started outside the well-behaved region, while when started from within the
region, it is attracted to JS∗ rather than to J ∗ .
Let us now discuss some of the fine points of Prop. 3.2.1. The salient
assumption of the proposition is that JS∗ is a fixed point of T . Depending on
the choice of S, this may or may not be true, and much of the subsequent
analysis in this chapter is geared towards the development of approaches
to choose S so that JS∗ is a fixed point of T and has some other interesting
properties. As an illustration of the range of possibilities, consider the three
variants of the blackmailer problem of Section 3.1.3 for the choice S = ,:
(a) In the first variant, we have J * = JS* = −∞, and JS* is a fixed point
of T that lies outside S. Here parts (a) and (b) of Prop. 3.2.1 apply.
However, part (c) does not apply (even though we have Tµ JS* = T JS*
for all policies µ) because JS* ∈
/ S, and in fact there is no MS -optimal
policy. In the subsequent analysis, we will see that the condition
JS* ∈ S plays an important role in being able to assert existence of an
MS -optimal policy (see the subsequent Props. 3.2.5 and 3.2.6).
(b) In the second variant, we have J * = JS* = −1, and JS* is a fixed point
of T that lies within S. Here parts (a) and (b) of Prop. 3.2.1 apply,
but part (c) still does not apply because there is no S-regular µ such
150 Semicontractive Models Chap. 3
J˜ = (Jµ + J $ )/2, ˜
S = (−∞, J],
−
J Tµ J
-regular "-irregular J¯ Tµk J¯
∗ J∗ Tµ J T
!-regular
J T J∗
)
J! 1] Jµ = J * = JS* Jµ J¯ = 0 J TJ
Figure 3.2.6. Illustration of why the assumption that JS∗ is a fixed point of T is
essential for Prop. 3.2.1. In this example there is only one state and S = $. There
are two stationary policies: µ for which Tµ is a contraction, so µ is $-regular,
and µ for which Tµ has multiple fixed points, so µ is $-irregular. Moreover,
Tµ is discontinuous from above at Jµ as shown. Here, it can be verified that
Tµ0 · · · Tµk J¯ ≥ Jµ for all µ0 , . . . , µk and k, so that Jπ ≥ Jµ for all π and the S-
regular policy µ is optimal, so JS∗ = J ∗ . However, as can be seen from the figure,
we have JS∗ = J ∗ '= T J ∗ = T JS∗ . Moreover, starting at JS∗ , the VI sequence T k JS∗
converges to J " , the fixed point of T shown in the figure, and all parts of Prop.
3.2.1 fail.
,
,T J − J * 1v ≤ β1J − J * 1v , ∀ J ∈ WS . (3.11)
S S
Moreover, we have
1 J(x) − (T J)(x)
1J − JS* 1v ≤ sup , ∀ J ∈ WS . (3.12)
1 − β x∈X v(x)
By using again the relation Tµ JS* = T JS* , we have for all x ∈ X and
all J ∈ WS ,
J(x) − JS* (x) J(x) − (T J)(x) (T J)(x) − JS* (x)
= +
v(x) v(x) v(x)
J(x) − (T J)(x) (Tµ J)(x) − (Tµ JS* )(x)
≤ +
v(x) v(x)
J(x) − (T J)(x)
≤ + β1J − JS* 1v .
v(x)
By taking the supremum of both sides over x, we obtain Eq. (3.12). Q.E.D.
The critical assumption of Prop. 3.2.1 is that JS* is a fixed point of T . For
a specific application, this must be proved with a separate analysis after a
suitable set S is chosen. To this end, we will provide several approaches
that guide the choice of S and facilitate the analysis.
One approach applies to problems where J * is generically a fixed
point of T , in which case for every set S such that JS* = J * , Prop. 3.2.1
applies and shows that J * can be obtained by the VI algorithm starting
from any J ∈ WS . Exercise 3.1 provides some conditions that guarantee
that J * is a fixed point of T . These conditions can be verified in wide
classes of problems such as deterministic models. Sections 3.5.4 and 3.5.5
illustrate this approach. Other important models where J * is guaranteed to
be a fixed point of T are the monotone increasing and monotone decreasing
models of Section 4.3. We will discuss the application of Prop. 3.2.1 and
other related results to these models in Chapter 4.
In the present chapter the approach for showing that JS* is a fixed
point of T will be mostly based on the PI algorithm; cf. the discussion of
Section 3.1.5. An alternative and complementary approach is the perturba-
tion-based analysis to be given in Section 3.4. This approach will be applied
to a variety of problems in Section 3.5, and will also be prominent in
Sections 4.5 and 4.6 of the next chapter.
We will develop a PI-based approach for showing that JS* is a fixed point
of T . The approach is applicable under assumptions that guarantee that
there is a sequence {µk } of S-regular policies that can be generated by PI.
The significance of S-regularity of all µk lies in that the corresponding cost
function sequence {Jµk } belongs to the well-behaved region of Eq. (3.10),
and is monotonically nonincreasing (see the subsequent Prop. 3.2.3). Un-
der an additional mild technical condition, the limit of this sequence is a
fixed point of T and is in fact equal to JS* (see the subsequent Prop. 3.2.4).
Sec. 3.2 Semicontractive Models and Regular Policies 153
Jµk ≥ Jµk+1 , k = 0, 1, . . . .
where the equation on the right holds since µk+1 is S-regular and Jµk ∈ S
(in view of the S-regularity of µk ). Q.E.D.
Note a fine point here. For a given starting policy µ0 , there may be
many different sequences {µk } that can be generated by PI [i.e., satisfy Eq.
(3.13)]. While the weak PI property guarantees that some of these consist
of S-regular policies exclusively, there may be some that do not. The policy
improvement property shown in Prop. 3.2.3 holds for the former sequences,
but not necessarily for the latter. The following proposition provides the
basis for showing that JS* is a fixed point of T based on the weak PI
property.
Then:
(a) JS* is a fixed point of T and the conclusions of Prop. 3.2.1 hold.
(b) (PI Convergence) Every sequence of S-regular policies {µk } that
can be generated by PI satisfies Jµk ↓ JS* . If in addition the set
of S-regular policies is finite, there exists k̄ ≥ 0 such that µk̄ is
MS -optimal.
we have
Jµk ≥ T Jµk ≥ T J∞ ,
so by letting k → ∞, we obtain J∞ ≥ T J∞ . From Eq. (3.15) we also have
T Jµk ≥ Jµk+1 Taking the limit in this relation as k → ∞, we obtain
and by taking the infimum of the left-hand side over u ∈ U (x), it follows
that T J∞ ≥ J∞ . Thus J∞ is a fixed point of T .
Finally, we show that J∞ = JS* . Indeed, since JS* ≤ Jµk , we have
Consider the deterministic shortest path example of Section 3.1.1 for the case
where there is a zero length cycle (a = 0), and let S be the real line (, as
in Example 3.2.1. There are two policies: µ which moves from state 1 to the
destination at cost b, and µ" which stays at state 1 at cost 0. Starting with
the S-regular policy µ, the PI algorithm generates the policy that corresponds
156 Semicontractive Models Chap. 3
Let us also revisit the blackmailer example of Section 3.1.3. In the first
variant of that example, when S = ,, all policies are S-regular, the weak
PI property holds, and Prop. 3.2.4 applies. In this case, PI will generate a
sequence of S-regular policies that converges to JS* = −∞, which is a fixed
point of T , consistent with Prop. 3.2.4 (even though JS* ∈ / S and there is
no MS -optimal policy).
Proposition 3.2.4(a) does not guarantee that every sequence {µk } generated
by the PI algorithm satisfies Jµk ↓ JS* . This is true only for the sequences
that consist of S-regular policies. We know that when the weak PI property
holds, there exists at least one such sequence, but PI can also generate
sequences that contain S-irregular policies, even when started with an S-
regular policy, as we have seen in Example 3.2.2. We thus introduce a
stronger type of PI property, which will guarantee stronger conclusions.
The strong PI property implies that every sequence that can be gen-
erated by PI starting from an S-regular policy consists exclusively of S-
regular policies. Moreover, there exists at least one such sequence. Hence
the strong PI property implies the weak PI property. Thus if the strong
PI property holds together with the mild continuity condition (2) of Prop.
3.2.4, it follows that JS* is a fixed point of T and Prop. 3.2.1 applies. We
will see that the strong PI property implies additional results, relating to
the uniqueness of the fixed point of T .
Sec. 3.2 Semicontractive Models and Regular Policies 157
Then:
(a) A policy µ satisfying Tµ J ≤ J for some function J ∈ S is S-
regular.
(b) S has the strong PI property.
where the second equality holds since µ was proved to be S-regular, and
JS* ∈ S by assumption. Hence equality holds throughout in the above
Sec. 3.2 Semicontractive Models and Regular Policies 159
relation, which proves that JS* is a fixed point of T (implying the conclusions
of Prop. 3.2.1) and that µ is MS -optimal.
(c) Since the strong PI property [which holds by Prop. 3.2.5(b)] implies the
weak PI property, the result follows from Prop. 3.2.4(b). Q.E.D.
Consider the deterministic shortest path example of Section 3.1.1 for the case
where the cycle has positive length (a > 0), and let S be the real line (, as
in Example 3.2.1. The two policies are: µ which moves from state 1 to the
destination at cost b and is S-regular, and µ" which stays at state 1 at cost
a, which is S-irregular. However, µ" has infinite cost and satisfies Eq (3.17).
As a result, Prop. 3.2.5 applies and the strong PI property holds. Consistent
with Prop. 3.2.6, JS∗ is the unique fixed point of T within S.
Turning now to the PI algorithm, we see that starting from the S-regular
µ, which is optimal, it stops at µ, consistent with Prop. 3.2.6(c). However,
starting from the S-irregular policy µ" the policy evaluation portion of the
PI algorithm must be able to deal with the infinite cost values associated
with µ" . This is a generic difficulty in applying PI to problems where there
are irregular policies: we either need to know an initial S-regular policy, or
160 Semicontractive Models Chap. 3
We have already shown the validity of the VI and PI algorithms for com-
puting JS* (subject to various assumptions, and restrictions involving the
starting points). In this section and the next one we will consider some ad-
ditional algorithmic approaches that can be justified based on the preceding
analysis.
An Optimistic Form of PI
Then the optimistic PI algorithm (3.18) is well defined and the follow-
ing hold:
(a) The sequence {Jk } generated by the algorithm satisfies Jk ↓ J∞ ,
where J∞ is a fixed point of T .
Sec. 3.2 Semicontractive Models and Regular Policies 161
(b) If for a set S ⊂ E(X), the sequence {µk } generated by the algo-
rithm consists of S-regular policies, and we have Jk ∈ S for all
k, then Jk ↓ JS* and JS* is a fixed point of T .
Proof: (a) Condition (1) guarantees that the sequence {Jk , µk } is well
defined in the following argument. We have
J0 ≥ T J0 = Tµ0 J0 ≥ Tµm00 J0 = J1
≥ Tµm00 +1 J0 = Tµ0 J1 ≥ T J1 = Tµ1 J1 ≥ · · · ≥ J2 ,
(3.19)
and continuing similarly, we obtain
Jk ≥ T Jk ≥ Jk+1 , k = 0, 1, . . . . (3.20)
J∞ ≥ T J ∞ .
T J ∞ ≥ J∞ ,
We will also show that the reverse inequality holds, so that J∞ = JS* .
Indeed, for every S-regular policy µ and all k ≥ 0, we have
J∞ = T k J∞ ≤ Tµk J∞ ≤ Tµk J0 ,
162 Semicontractive Models Chap. 3
Note that, in general, the fixed point J∞ in Prop. 3.2.7(a) need not be
equal to JS* or J * . As an illustration, consider the shortest path Example
3.2.1 with S = ,, and a = 0, b > 0. Then if 0 < J0 < b, it can be seen
that Jk = J0 for all k, so J * = 0 < J∞ and J∞ < JS* = b.
λ-Policy Iteration
We next consider λ-policy iteration (λ-PI for short), which was described
in Section 2.5. It involves a scalar λ ∈ (0, 1) and it is defined by
(λ)
Tµk Jk = T Jk , Jk+1 = Tµk Jk , (3.21)
(λ)
where for any policy µ and scalar λ ∈ (0, 1), Tµ is the multistep mapping
discussed in Section 1.2.5:
∞
#
(λ)
(Tµ J)(x) = (1 − λ) λt (Tµt+1 J)(x), x ∈ X. (3.22)
t=0
Here we assume that the limit of the series above is well-defined as a func-
tion in E(X) for all x ∈ X, µ ∈ M, and J ∈ E(X).
(λ)
We will also assume that Tµ and Tµ commute, i.e.,
(λ) (λ)
Tµ (Tµ J) = Tµ (Tµ J), ∀ µ ∈ M, J ∈ E(X). (3.23)
Then the λ-PI algorithm (3.21) is well defined and the following hold:
(a) A sequence {Jk } generated by the algorithm satisfies Jk ↓ J∞ ,
where J∞ is a fixed point of T .
(b) If for a set S ⊂ E(X), the sequence {µk } generated by the algo-
rithm consists of S-regular policies, and we have Jk ∈ S for all
k, then Jk ↓ JS* and JS* is a fixed point of T .
Proof: (a) We first note that for all µ ∈ M and J ∈ E(X) such that
J ≥ Tµ J, we have
(λ)
Tµ J ≥ Tµ J.
This follows from the power series expansion (3.22) and the fact that J ≥
Tµ J implies that
Tµ J ≥ Tµ2 J ≥ · · · ≥ Tµm+1 J, ∀ m ≥ 1.
(λ)
Using also the monotonicity of Tµ and Tµ , and Eq. (3.23), we have that
(λ) (λ) (λ)
J ≥ Tµ J ⇒ Tµ J ≥ Tµ J ≥ Tµ (Tµ J) = Tµ (Tµ J).
The preceding relation and our assumptions imply that
(λ)
J0 ≥ T J0 = Tµ0 J0 ≥ Tµ0 J0 = J1
(λ)
≥ Tµ0 (Tµ0 J0 ) = Tµ0 J1 ≥ T J1 = Tµ1 J1 ≥ · · · ≥ J2 .
Continuing similarly, we obtain Jk ≥ T Jk ≥ Jk+1 for all k. Thus Jk ↓ J∞
for some J∞ . From this point, the proof that J∞ is a fixed point of T is
similar to the one of Prop. 3.2.7(a).
(b) Similar to the proof of Prop. 3.2.7(b). Q.E.D.
164 Semicontractive Models Chap. 3
Jµ = Tµ Jµ ≥ T Jµ ≥ T JS∗ , ∀ µ ∈ MS ,
and taking the infimum over all µ ∈ MS ), so the condition JS∗ ≤ T JS∗ is equivalent
to JS∗ being a fixed point of T .
Sec. 3.3 Irregular Policies/Infinite Cost Case 165
(f). They will be used to assert that J * = JS* , that J * is the unique fixed
point of T within S, and that the VI and PI algorithms have improved
convergence properties compared with the ones of Section 3.2.
Note that in the case where S is the set of real-valued functions R(X)
and J¯ ∈ R(X), condition (a) is automatically satisfied, while condition
(e) is typically verified easily. The verification of condition (f) may be
nontrivial in some cases. We postpone the discussion of this issue for later
(see the subsequent Prop. 3.3.2).
The main result of this section is the following proposition, which
provides results that are almost as strong as the ones for contractive models.
& '
Proof: For any x ∈ X with (T J)(x) < ∞, let λm (x) be a decreasing
Sec. 3.3 Irregular Policies/Infinite Cost Case 167
The set
& '
Um (x) = u ∈ U (x) | H(x, u, J) ≤ λm (x) ,
The next two lemmas follow from the analysis of the preceding section.
Let us also prove the following technical lemma, which makes use of
the additional part (e) of Assumption 3.3.1.
168 Semicontractive Models Chap. 3
Therefore from the definition (3.25), we have {ui }∞ i=k ⊂ Uk (x). Since Uk (x)
is compact, all the limit points of {ui }∞
i=k belong to Uk (x) and at least one
limit point exists. Hence the same is true for the limit points of the whole
sequence {ui }. Thus if ũ is a limit point of {ui }, we have
ũ ∈ ∩∞
k=0 Uk (x).
We are now ready to prove Prop. 3.3.1 by making use of the additional
parts (a) and (f) of Assumption 3.3.1.
Proof of Prop. 3.3.1: (a), (b) We will first prove that T k J → JS* for all
J ∈ S, and we will use this to prove that JS* = J * and that there exists
Sec. 3.3 Irregular Policies/Infinite Cost Case 169
an optimal S-regular policy. Thus parts (a) and (b), together with the
existence of an optimal S-regular policy, will be shown simultaneously.
We fix J ∈ S, and choose J $ ∈ S such that J $ ≤ J and J $ ≤ T J $
[cf. Assumption 3.3.1(f)]. By the monotonicity of T , we have T k J $ ↑ J∞
for some J∞ ∈ E(X). Let µ be an S-regular policy such that Jµ = JS* [cf.
Lemma 3.3.3(b)]. Then we have, using again the monotonicity of T ,
Tµ0 · · · Tµk−1 J¯ ≥ T k J.
¯
Jπ ≥ lim T k J¯ = JS* ,
k→∞
where the equality follows since T k J → JS* for all J ∈ S (as shown earlier),
and J¯ ∈ S [cf. Assumption 3.3.1(a)]. Thus for all π ∈ Π, Jπ ≥ JS* = Jµ ,
implying that the policy µ that is optimal within the class of S-regular
policies is optimal over all policies, and that JS* = J * .
(c) If µ is optimal, then Jµ = J * ∈ S, so by Assumption 3.3.1(c), µ is
S-regular and therefore Tµ Jµ = Jµ . Hence,
T µ J * = T µ Jµ = Jµ = J * = T J * .
Conversely, if
J * = T J * = Tµ J * ,
µ is S-regular (cf. Lemma 3.3.2), so J * = limk→∞ Tµk J * = Jµ . Therefore,
µ is optimal.
(d) If J ∈ S and J ≤ T J, by repeatedly applying T to both sides and using
the monotonicity of T , we obtain J ≤ T k J for all k. Taking the limit as
k → ∞ and using the fact T k J → J * [cf. part (b)], we obtain J ≤ J * . The
proof that J ≥ T J implies J ≥ J * is similar.
(e) As in the proof of Prop. 3.2.4(b), the sequence {Jµk } converges mono-
tonically to a fixed point of T , call it J∞ . Since J∞ lies between Jµ0 ∈ S
and JS* ∈ S, it must belong to S, by Assumption 3.3.1(a). Since the only
170 Semicontractive Models Chap. 3
Consider the third variant of the blackmailer problem (Section 3.1.3) for the
case where c > 0 and S = (. Then the (nonoptimal) S-irregular policy µ̄
whereby at each period, the blackmailer may demand no payment (u = 0)
and pay cost c > 0, has infinite cost (Jµ̄ = ∞). However, T has multiple fixed
points within the real line, namely the set (−∞, −1]. By choosing S = (, we
see that the uniqueness of fixed point part (a) of Prop. 3.3.1 fails because the
compactness part (d) of Assumption 3.3.1 is violated (all other parts of the
assumption are satisfied). In this example, the results of Prop. 3.2.1 apply
with S = (, because JS∗ is a fixed point of T .
In various applications, the verification of part (f) of Assumption 3.3.1
may not be simple. The following proposition is useful in several contexts,
including some that we will encounter in Section 3.5.
Proof: Let J ∈ Rb (x), and let r > 0 be a scalar such that JS* − re ≤ J
[such a scalar exists since JS* ∈ Rb (x) by Assumption 3.3.1(b)]. Define
J $ = JS* − re, and note that by Lemma 3.3.3, JS* is a fixed point of T . By
using Eq. (3.27), we have
J $ = JS* − re = T JS* − re ≤ T (JS* − re) = T J $ ,
while J $ ∈ Rb (x), thus proving part (f) of Assumption 3.3.1. Q.E.D.
finite for all states. A special case where this occurs is when g(x, w) ≡ 0 for
all x. Then the cost function of µ is identically 0.
Note that case (b) of the deterministic shortest path problem of Sec-
tion 3.1.1, which involves a zero length cycle, is a special case of the search
problem just described. Therefore, the anomalous behavior we saw there
(nonconvergence of VI to J ∗ and oscillation of PI; cf. Examples 3.2.1 and
3.2.2) may also arise in the context of the present example. We will see that
by adding a small positive constant to the length of the cycle we can rectify
the difficulties of VI and PI, at least partially; this is the idea behind the
perturbation approach that we will use in this section.
We will address the finite cost issue for irregular policies by intro-
ducing a perturbation that makes their cost infinite for some states. We
can then use Prop. 3.3.1 of the preceding section. The idea is that with a
perturbation, the cost functions of S-irregular policies may increase dispro-
portionately relative to the cost functions of the S-regular policies, thereby
making the problem more amenable to analysis.
We introduce a nonnegative “forcing function” p : X &→ [0, ∞), and
for each δ > 0 and policy µ, we consider the mappings
! "
(Tµ,δ J)(x) = H x, µ(x), J + δp(x), x ∈ X, Tδ J = inf Tµ,δ J.
µ∈M
Jµ,δ ≤ Jµ + wµ,δ ,
Then JS* is a fixed point of T and the conclusions of Prop. 3.2.1 hold.
Moreover, we have
JS* = Jˆ = lim Jˆδ .
δ↓0
Proof: For every x ∈ X, using conditions (2) and (3), we have for all
+
δ > 0, ' > 0, and µ ∈ M,
ˆ
J(x)−' ≤ Jµx," (x)−' ≤ Jµx," ,δ (x)−' ≤ Jˆδ (x) ≤ Jµ,δ (x) ≤ Jµ (x)+wµ,δ (x).
+
By taking the limit as ' ↓ 0, we obtain for all δ > 0 and µ ∈ M,
Jˆ ≤ Jˆδ ≤ Jµ,δ ≤ Jµ + wµ,δ .
+ it follows
By taking the limit as δ ↓ 0 and then the infimum over all µ ∈ M,
[using also condition (3)] that
Jˆ ≤ lim Jˆδ ≤ inf lim Jµ,δ ≤ inf Jµ = J,
ˆ
δ↓0 µ∈M0 δ↓0 µ∈M0
Taking the limit as m → ∞, and using condition (4) and the fact Jˆδm ↓ Jˆ
shown earlier, we have
ˆ ≥ J(x),
H(x, u, J) ˆ ∀ x ∈ X, u ∈ U (x),
so that T Jˆ ≥ J.
ˆ Thus Jˆ is a fixed point of T .
Finally, to show that Jˆ = JS* , we first note that JS* ≤ Jˆ since every
policy in M+ is S-regular. For the reverse inequality, let µ be S-regular.
+
We have J = T Jˆ ≤ Tµ Jˆ ≤ Tµk Jˆ for all k ≥ 1, so that for all µ$ ∈ M,
ˆ
where the equality follows since µ and µ$ are S-regular (so Jµ" ∈ S). Taking
the infimum over all S-regular µ, we obtain Jˆ ≤ JS* , so that JS* = J. ˆ
Q.E.D.
+ a key assumption of the pre-
Aside from S-regularity of the set M,
ceding proposition is that inf µ∈M ˆ
0 Jµ,δ = Jδ , i.e., that with a perturbation
added, the subset of policies M + is sufficient (the optimal cost of the δ-
perturbed problem can be achieved using the policies in M). + This is the
+
key insight to apply when selecting M.
Note that the preceding proposition applies even if
lim Jˆδ (x) > J * (x)
δ↓0
+ is such that:
Assumption 3.4.1: The subset of S-regular policies M
(a) The conditions of Prop. 3.4.1 are satisfied.
+ is S-regular for all the δ-perturbed problems,
(b) Every policy µ ∈ M
δ > 0.
+ and a scalar δ > 0, every policy µ$ such
(c) Given a policy µ ∈ M
that
Tµ" Jµ,δ = T Jµ,δ
+ and at least one such policy exists.
belongs to M,
Proof: We have that JS* is a fixed point of T by Prop. 3.4.1. The algorithm
definition (3.28) implies that for all m ≥ 1 we have
Tµmk+1 ,δ Jµk ,δk ≤ Tµk+1 ,δk Jµk ,δk = T Jµk ,δk + δk p ≤ Jµk ,δk .
k
Jµk+1 ,δk+1 ≤ Jµk+1 ,δk = lim Tµmk+1 ,δ Jµk ,δk ≤ Jµk ,δk ,
m→∞ k
where the equality holds because µk+1 and µk are S-regular for all the δ-
perturbed problems. It follows that {Jµk ,δk } is monotonically nonincreas-
ing, so that Jµk ,δk ↓ J∞ for some J∞ . Moreover, we must have J∞ ≥ JS*
since Jµk ,δk ≥ Jµk ≥ JS* . Thus
We also have
! "
inf H(x, u, J∞ ) ≤ lim inf H x, u, Jµk ,δk
u∈U(x) k→∞ u∈U(x)
! "
≤ inf lim H x, u, Jµk ,δk
u∈U(x) k→∞
! "
= inf H x, u, lim Jµk ,δk
u∈U(x) k→∞
= inf H(x, u, J∞ ),
u∈U(x)
where the first inequality follows from the fact J∞ ≤ Jµk ,δk , which implies
! "
that H(x, u, J∞ ) ≤ H x, u, Jµk ,δk , and the first equality follows from the
continuity property that is assumed in Prop. 3.4.1. Thus equality holds
throughout above, so that
lim T Jµk ,δk = T J∞ . (3.30)
k→∞
When the control space U is finite, Prop. 3.4.2 also implies that the
generated policies µk will be optimal for all k sufficiently large. The reason
is that the set of policies is finite and there exists a sufficiently small ' > 0,
such that for all nonoptimal µ there is some state x such that Jµ (x) ≥
ˆ
J(x)+'. This convergence behavior should be contrasted with the behavior
of PI without perturbations, which may lead to oscillations, as noted earlier.
However, when the control space U is infinite, the generated sequence
{µk } may exhibit some serious pathologies in the limit. If {µk }K is a
subsequence of policies that converges to some µ̄, in the sense that
lim µk (x) = µ̄(x), ∀ x = 1, . . . , n,
k→∞, k∈K
it does not follow that µ̄ is S-regular. In fact it is possible that the generated
sequence of S-regular policies {µk } satisfies limk→∞ Jµk → JS* = J * , yet
{µk } may converge to an S-irregular policy whose cost function is strictly
larger than JS* , as illustrated by the following example.
Example 3.4.2
Consider the third variant of the blackmailer problem (Section 3.1.3) for the
case where c = 0 (the blackmailer may forgo demanding a payment at cost
c = 0); see Fig. 3.4.1. Here the mapping T is given by
( *
& 2
'
T J = min J, inf − u + u + (1 − u)J ,
0<u≤1
Sec. 3.5 Applications in Shortest Path and Other Contexts 177
u Prob. 1 − u
1] Cost −u
u Destination
Prob. u
a12 12tb
u Cost 0 Destination
Figure 3.4.1. Transition diagram for a blackmailer problem (the third variant
of Section 3.1.3 in the case where c = 0). At state 1, the blackmailer may
demand any amount u ∈ [0, 1]. The victim will comply with probability
1 − u and will not comply with probability u, in which case the process will
terminate.
k−1
# - ! ".
¯ 0 ) = lim sup
Jµ (x0 ) = lim sup(Tµk J)(x E g xm , µ(xm ) ,
k→∞ k→∞ m=0
where {xm } is the (random) state trajectory generated & under policy
' µ,
starting from initial state x0 . The expected value E g(xm , µ(xm )) above
is!defined"in the natural way: it is the weighted sum of the numerical values
g x, µ(x) , x = 1, . . . , n, weighted by the probabilities p(xm = x | x0 , µ)
180 Semicontractive Models Chap. 3
that xm = x given that the initial state is x0 and policy µ is used. Thus
Jµ (x0 ) is the upper limit as k → ∞ of the cost for the first k steps or up
to reaching the destination, whichever comes first.
A stationary policy µ is said to be proper if for every initial state
there is positive probability that the destination will be reached under that
policy after at most n stages. A stationary policy that is not proper is said
to be improper . The relation between proper policies and S-regularity is
given in the following proposition.
3.1.2). Moreover, there may not exist an optimal stationary policy even if
all policies are proper (cf. the three variants of the blackmailer example of
Section 3.1.3).
In this section we will use various assumptions, which we will in turn
translate into the conditions and corresponding results of Sections 3.2-3.4.
Throughout this section we will assume the following.
† The strong SSP conditions and the weak SSP conditions, which will be
introduced shortly, relate to the strong and weak PI properties of Section 3.2.
182 Semicontractive Models Chap. 3
ˆ
J(x) = inf Jµ (x), x ∈ X,
µ: proper
is real-valued.
Rb (X) (since X is finite) and Eq. (3.27) clearly holds. Finally, to verify part
(c) we must show that given an improper policy µ, for every J ∈ R(X) there
exists an x ∈ X such that lim supk→∞ (Tµk J)(x) = ∞. This follows since
by Assumption 3.5.3, Jµ (x) = lim supk→∞ k ¯
& (Tµ J)(x)
' = ∞, for some x ∈ X,
k k ¯
and (Tµ J)(x) and (Tµ J)(x) differ by E J(xk ) , an amount that is finite
since J is real-valued and has a finite number of components J(x). Thus
Assumption 3.3.1 holds and the result follows from Prop. 3.3.1. Q.E.D.
Under the strong SSP conditions, we showed in Prop. 3.5.2 that J * is the
unique fixed point of T within R(X). Moreover, we showed that a policy µ∗
is optimal if and only if Tµ∗ J * = T J * , and an optimal proper policy exists
(so in particular J * , being the cost function of a proper policy, is real-
valued). In addition, J * can be computed by the VI algorithm starting
with any J ∈ ,n .
We will now replace Assumption 3.5.3 (improper policies have cost
∞ for some initial states) with the following weaker assumption:
We will refer to the Assumptions 3.5.1, 3.5.2, and 3.5.4 as the weak
SSP conditions. The examples of Sections 3.1.1 and 3.1.2 show that under
these assumptions, it is possible that
J * (= Jˆ = inf Jµ ,
µ: proper
while J * need not be a fixed point of T (Section 3.1.2). The key fact is that
under Assumption 3.5.4, we can use the perturbation approach of Section
3.4, whereby adding δ > 0 to the mapping Tµ makes all improper policies
have infinite cost for some initial states, so the results of Prop. 3.5.2 can be
used for the δ-perturbed problem. In particular, Prop. 3.5.1 implies that
ˆ so from Prop. 3.4.1 it follows that Jˆ is a fixed point of T and the
JS* = J,
conclusions of Prop. 3.2.1 hold. We thus obtain the following proposition,
which provides additional results, not implied by Prop. 3.2.1; see Fig. 3.5.1.
Fixed Points of T
ˆˆ:JLargest
2J
J fixed point of T J
Figure 3.5.1. Schematic illustration of Prop. 3.5.4 for a problem with two states,
so R(X) = $2 = S. We have that Jˆ is the largest solution of Bellman’s equation,
while VI converges to Jˆ starting from J ≥ J.
ˆ As shown in Section 3.1.2, J ∗ need
not be a solution of Bellman’s equation.
Proof: (a), (b) Let S = R(X), so the proper policies are identified with
the S-regular policies by Prop. 3.5.1. We use the perturbation framework
of Section 3.4 with forcing function p(x) ≡ 1. From Prop. 3.5.2 it follows
that Prop. 3.4.1 applies so that Jˆ is a fixed point of T , and the conclusions
of Prop. 3.2.1 hold, so T k J → Jˆ starting from any J ∈ R(X) with J ≥ J. ˆ
The convergence rate of VI is linear in view of Prop. 3.2.2 and the existence
of an optimal proper policy to be shown in part (c). Finally, let J $ ∈ R(X)
be another solution of Bellman’s equation, and let J ∈ R(X) be such that
J ≥ Jˆ and J ≥ J $ . Then T k J → J, ˆ while T k J ≥ T k J $ = J $ . It follows
that Jˆ ≥ J $ .
(c) If the proper policy µ satisfies Jµ = J,ˆ we have Jˆ = Jµ = Tµ Jµ = Tµ J, ˆ
ˆ ˆ
so, using also the relation J = T J [cf. part (a)], we obtain Tµ J = T J.ˆ ˆ
Conversely, if µ satisfies Tµ Jˆ = T J,
ˆ then using part (a), we have Tµ Jˆ = Jˆ
and hence limk→∞ Tµk Jˆ = J. ˆ Since µ is proper, we have Jµ = limk→∞ Tµk J, ˆ
so Jµ = J.ˆ
(d) Let J ≤ T J and δ > 0. We have J ≤ T J + δe = Tδ J, and hence
J ≤ Tδk J for all k. Since the strong SSP conditions hold for the δ-perturbed
Sec. 3.5 Applications in Shortest Path and Other Contexts 185
The first variant of the blackmailer Example 3.4.2 shows that un-
der the weak SSP conditions there may not exist an optimal policy or an
optimal policy within the class of proper policies if the control space is
infinite. This is consistent with Prop. 3.5.4(c). Another interesting fact is
provided by the third variant of this example in the case where c < 0. Then
J * (1) = −∞ (violating Assumption 3.5.4), but Jˆ is real-valued and does
not solve Bellman’s equation, contrary to the conclusion of Prop. 3.5.4(a).
Part (d) of Prop. 3.5.4 ˆ
)n shows that J is the unique! solution "of the
problem of maximizing i=1 β i J(i) over all J = J(1), . . . , J(n) such
that J ≤ T J, where β1 , . . . , βn are any positive scalars (cf. Prop. 3.2.9).
This problem can be written as
n
#
maximize J(i)
i=1
n
#
subject to J(x) ≤ g(x, u) + pij (u)J(j), i = 1, . . . , n, u ∈ U (i),
y=1
To deal with the oscillatory behavior of PI, which was illustrated in the de-
terministic shortest path Example 3.2.2, we may use the perturbed version
of the PI algorithm of Section 3.4, with forcing function p(x) ≡ 1. Thus,
we have
! "
(Tµ,δ J)(x) = H x, µ(x), J + δ, x ∈ X, Tδ J = inf Tµ,δ J.
µ∈M
where Jµk ,δk is computed as the unique fixed point of the mapping Tµk ,δk
given by
Tµk ,δk J = Tµk J + δk e.
The policy µk+1 of Eq. (3.31) exists by the compactness Assumption
3.5.2. We claim that µk+1 is proper. To see this, note that
Tµk+1 ,δk Jµk ,δk = T Jµk ,δk + δk e ≤ Tµk Jµk ,δk + δk e = Jµk ,δk ,
Tµmk+1 ,δ Jµk ,δk ≤ Tµk+1 ,δk Jµk ,δk = T Jµk ,δk + δk e ≤ Jµk ,δk , ∀ m ≥ 1.
k
Since Jµk ,δk forms an upper bound to Tµmk+1 ,δ Jµk ,δk , it follows that µk+1
k
is proper [if it were improper, we would have (Tµmk+1 ,δ Jµk ,δk )(x) → ∞ for
k
some x, because of the perturbation δk ]. Thus the sequence {µk } generated
by the perturbed PI algorithm (3.31) is well-defined and consists of proper
policies. We have the following proposition.
Proposition 3.5.5: Let the weak SSP conditions hold. Then the se-
quence {Jµk } generated by the perturbed PI algorithm (3.31) satisfies
ˆ
Jµk → J.
Thus Tµ maps E + (X) into E + (X), where E + (X) denotes the set of non-
negative extended real-valued functions J : X &→ [0, ∞]. Moreover Tµ
also maps R+ (X) to R+ (X), where R+ (X) denotes the set of nonnegative
real-valued functions J : X &→ [0, ∞).
The mapping T : E + (X) &→ E + (X) is given by
or equivalently,
1 n
2
#
(T J)(x) = inf b(x, u) + Axy (u)J(y) , x ∈ X.
u∈U(x)
y=1
and we consider the affine monotonic problem where the scalars Axy (u)
and b(x, u) are defined by
and
b(x, u) = pxt (u)h(x, u, t), x = 1, . . . , n, u ∈ U (x),
and the vector J¯ is the unit vector,
¯
J(x) = 1, x = 1, . . . , n.
¯ 0 ),
Jπ (x0 ) = lim sup (Tµ0 · · · TµN−1 J)(x x0 = 1, . . . , n,
N →∞
(so that the multiplicative cost accumulation stops once the state
reaches t).
Thus, we claim that Jπ (x0 ) can be viewed as the expected value of cost ac-
cumulated multiplicatively, starting from x0 up to reaching the termination
state t (or indefinitely accumulated multiplicatively, if t is never reached).
To verify the formula (3.32) for Jπ , we use the definition Tµ J =
bµ + Aµ J, to show by induction that for every π = {µ0 , µ1 , . . .}, we have
N
# −1
Tµ0 · · · TµN−1 J¯ = Aµ0 · · · AµN−1 J¯ + bµ0 + Aµ0 · · · Aµk−1 bµk . (3.33)
k=1
Sec. 3.5 Applications in Shortest Path and Other Contexts 189
Based on Eq. (3.32), we have that Jπ (x0 ) is the limit superior of the ex-
pected value of the exponential of the N -step additive finite horizon cost
)k̄ ! "
up to termination, i.e., k=0 g xk , µk (xk ), xk+1 , where k̄ is equal to the
first index prior to N − 1 such that xk̄+1 = t, or is equal to N − 1 if there
is no such index. The use of the exponential introduces risk aversion, by
assigning a strictly convex increasing penalty for large rather than small
cost of a trajectory up to termination (and hence a preference for small
variance of the additive cost up to termination).
The deterministic version of the exponential cost problem where for
each u ∈ U (x), one of the transition probabilities pxt (u), px1 (u), . . . , pxn (u)
is equal to 1 and all others are equal to 0, is mathematically equivalent
to the classical deterministic shortest path problem (since minimizing the
exponential of a deterministic expression is equivalent to minimizing that
expression). For this problem a standard assumption is that there are
no cycles that have negative total length to ensure that the shortest path
length is finite. However, it is interesting that this assumption is not re-
quired for the analysis of the present section: when there are paths that
travel perpetually around a negative length cycle we simply have J * (x) = 0
for all states x on the cycle, which is permissible within our context.
and hence ∞
#
Jµ = lim sup TµN J¯ = lim sup AN ¯
µ J + Akµ bµ (3.36)
N →∞ N →∞
k=0
∞
#
Jµ = Akµ bµ = (I − Aµ )−1 bµ ,
k=0
Let us introduce some assumptions that are similar to the ones of the
preceding section.
Consider the deterministic shortest path example of Section 3.1.1, but with
the exponential cost function of the present subsection; cf. Eq. (3.35). There
are two policies denoted µ and µ" ; see Fig. 3.5.2. The corresponding mappings
and costs are shown in the figure, and Bellman’s equation is given by
& '
J(1) = (T J)(1) = min exp(b), exp(a)J(1) .
Sec. 3.5 Applications in Shortest Path and Other Contexts 193
Length a
) and Policy µ!
t b Destination
) Jµ! (1) = limN →∞ exp(aN ) a Length b Jµ (1) = exp(b)
a12 12tb
(Tµ! J)(1) = exp(a)J(1) Policy µ ) (Tµ J)(1) = exp(b) (
ˆ
J(x) = inf Jµ (x), x = 1, . . . , n. (3.37)
µ: contractive
We use the perturbation approach of Section 3.4 and Prop. 3.4.1 to show
that Jˆ is a solution of Bellman’s equation. In particular, we add a constant
δ > 0 to all components of bµ . By using arguments that are entirely
analogous to the ones for the SSP case of Section 3.5.1, we obtain the
following proposition, which is illustrated in Fig. 3.5.3. A detailed analysis
and proof is given in the exercises.
194 Semicontractive Models Chap. 3
Fixed Points of T
ˆˆ:JLargest
2J
J fixed point of T J
b<0
Figure 3.5.3. Schematic illustration of Prop. 3.5.8 for a problem with two states.
The optimal cost function over contractive policies, J,ˆ is the largest solution of
Bellman’s equation, while VI converges to Jˆ starting from J ≥ J. ˆ
The other results of Section 3.5.1 for SSP problems also have straight-
forward analogs. Moreover, there is an adaptation of the example of Section
3.1.2, which provides an affine monotonic model for which J * is not a fixed
point of T (see the author’s paper [Ber16a], to which we refer for further
discussion).
Consider the problem of Fig. 3.5.2, for the case a = 0. This is the case where
the noncontractive policy µ" has finite cost, so Assumption 3.5.7 is violated
and Prop. 3.5.7 does not apply. However, it can be seen that the assumptions
of Prop. 3.5.8 hold. Consistent with part (a) of the proposition, the optimal
Sec. 3.5 Applications in Shortest Path and Other Contexts 195
so that
H(1, u, J) = u + (1 − u) exp (−u)J.
We thus obtain
µ(1)
Jµ (1) = ! " ! ".
1 − 1 − µ(1) exp − µ(1)
By minimizing over µ(1) ∈ (0, 1] this expression, it can be seen that J(1) ˆ =
J ∗ (1) = 12 , but there exists no optimal policy, and no optimal policy within
the class of contractive policies [Jµ (1) decreases monotonically to 12 as µ(1) →
0].
We will now discuss how the analysis of Sections 3.3 and 3.4 applies to min-
imax shortest path-type problems, following the author’s paper [Ber19c],
to which we refer for further discussion. To formally describe the problem,
we consider a graph& with a finite set of '
nodes X ∪ {t} and a finite set of
directed arcs A ⊂ (x, y) | x, y ∈ X ∪ {t} , where t is a special node called
the destination. At each node x ∈ X we may choose a control u from a
196 Semicontractive Models Chap. 3
and for each policy µ, the mapping Tµ : E(X) &→ E(X), defined by
! "
(Tµ J)(x) = H x, µ(x), J , x ∈ X. (3.41)
¯
Using Eqs. (3.38)-(3.41), we see that for any µ ∈ M and x ∈ X, (Tµk J)(x)
is the result of the k-stage DP algorithm that computes the length of the
longest path under µ that starts at x and consists of k arcs.
For completeness, we also define the length of a portion
& '
(xi , xi+1 ), (xi+1 , xi+2 ), . . . , (xm , xm+1 )
When confusion cannot arise we will also refer to such a finite-arc por-
tion as a path. Of special interest 'are cycles, i.e., paths of the form
&
(xi , xi+1 ), (xi+1 , xi+2 ), . . . , (xi+m , xi ) . Paths that do not contain any
cycle other than the self-cycle (t, t) are called simple.
For a given policy µ ∈ M and x0 (= t, a path p ∈ P (x0 , µ) is said to
be terminating if it has the form
& '
p = (x0 , x1 ), (x1 , x2 ), . . . , (xm , t), (t, t), . . . , (3.42)
! " m−1
# ! "
Lµ (p) = g xm , µ(xm ), t + g xk , µ(xk ), xk+1 ,
k=0
and is equal to the finite length of its initial portion that consists of the
first m + 1 arcs.
An important characterization of a policy µ ∈ M is provided by the
subset of arcs & ! "'
Aµ = ∪x∈X (x, y) | y ∈ Y x, µ(x) .
Thus Aµ ∪ (t, t) can be viewed as the set of all possible paths under µ,
∪x∈X P (x, µ), in the sense that it contains this set of paths and no other
paths. We refer to Aµ as the characteristic graph of µ. We say that Aµ is
destination-connected if for each x ∈ X there exists a terminating path in
P (x, µ).
We say that µ is proper if the characteristic graph Aµ is acyclic
(i.e., contains no cycles). Thus µ is proper if and only if all the paths
in ∪x∈X P (x, µ) are simple and hence terminating (equivalently µ is proper
if and only if Aµ is destination-connected and has no cycles). The term
“proper” is consistent with the one used in Section 3.5.1 for SSP prob-
lems, where it indicates a policy under which the destination is reached
198 Semicontractive Models Chap. 3
a12 a12
t b Destination t b Destination
12tb 12tb
012 012
Figure 3.5.4. A robust shortest path problem with X = {1, 2}, two controls at
node 1, and one control at node 2. The two policies, µ and µ" , correspond to the
two controls at node 1. The figure shows the characteristic graphs Aµ and Aµ! .
Proof: Any path with a finite number of arcs, can be decomposed into a
simple path, and a finite number of cycles (see e.g., the path decomposition
theorem of [Ber98], Prop. 1.1, and Exercise 1.4). Since there is only a
finite number of simple paths under µ, their length is bounded above and
below. Thus in part (a) the length of all paths with a finite number of
Sec. 3.5 Applications in Shortest Path and Other Contexts 199
arcs is bounded above, and in part (b) it is bounded below, implying that
Jµ (x) < ∞ for all x ∈ X or Jµ (x) > −∞ for all x ∈ X, respectively. Part
(c) follows by combining parts (a) and (b).
To show part (d), consider a path p, which consists of an infinite
repetition of the positive length cycle that is assumed to exist. Let Cµk (p)
be the length of the path that consists of the first k cycles in p. Then
Cµk (p) → ∞ and Cµk (p) ≤ Jµ (x) for all k, where x is the first node in the
cycle, thus implying that Jµ (x) = ∞. Moreover for every J ∈ R(X) and
all k, (Tµk J)(x) is the maximum over the lengths of the k-arc paths that
start at x, plus a terminal cost that is equal to either J(y) (if the terminal
node of the k-arc path is y ∈ X), or 0 (if the terminal node of the k-arc
path is the destination). Thus we have,
( *
¯
(Tµk J)(x) + min 0, min J(x) ≤ (Tµk J)(x).
x∈X
¯
Since lim supk→∞ (Tµk J)(x) = Jµ (x) = ∞ as shown earlier, it follows that
k
lim supk→∞ (Tµ J)(x) = ∞ for all J ∈ R(X). Q.E.D.
Proof: To show that (i) implies (ii), let µ be R(X)-regular and to arrive
at a contradiction, assume that Aµ contains a nonnegative length cycle.
Let x be a node on the cycle, consider the path p that starts at x and
consists of an infinite repetition of this cycle, and let Lkµ (p) be the length
of the first k arcs of that path. Let also J be a constant function, J(x) ≡ r,
where r is a scalar. Then we have
since from the definition of Tµ , we have that (Tµk J)(x) is the maximum
over the lengths of all k-arc paths under µ starting at x, plus r, if the last
node in the path is not the destination. Since µ is R(X)-regular, we have
Jµ ∈ R(X) and lim supk→∞ (Tµk J)(x) = Jµ (x) < ∞, so that for all scalars
r, ! "
lim sup Lkµ (p) + r ≤ Jµ (x) < ∞.
k→∞
Taking supremum over r ∈ ,, it follows that lim supk→∞ Lkµ (p) = −∞,
which contradicts the nonnegativity of the cycle of p. Thus all cycles of Aµ
have negative length. To show that Aµ is destination-connected, assume
the contrary. Then there exists some node x ∈ X such that all paths in
P (x, µ) contain an infinite number of cycles. Since the length of all cycles
is negative, as just shown, it follows that Jµ (x) = −∞, which contradicts
the R(X)-regularity of µ.
To show that (ii) implies (iii), we assume that µ is improper and show
that Jµ ∈ R(X). By (ii) Aµ is destination-connected, so the set P (x, µ)
contains a simple path for all x ∈ X. Moreover, since by (ii) the cycles
of Aµ have negative length, each path in P (x, µ) that is not simple has
smaller length than some simple path in P (x, µ). This implies that Jµ (x)
is equal to the largest path length among simple paths in P (x, µ), so Jµ (x)
is a real number for all x ∈ X.
To show that (iii) implies (i), we note that if µ is proper, it is R(X)-
regular, so we focus on the case where µ is improper. Then by (iii), Jµ ∈
R(X), so to show R(X)-regularity of µ, we must show that (Tµk J)(x) →
Jµ (x) for all x ∈ X and J ∈ R(X), and that Jµ = Tµ Jµ . Indeed, from the
definition of Tµ , we have
6 k 7
(Tµk J)(x) = sup Lµ (p) + J(xkp ) , (3.43)
p∈P (x,µ)
where Lkµ (p) is the length of the first k arcs of path p, xkp is the node reached
after k arcs along the path p, and J(t) is defined to be equal to 0. Thus as
k → ∞, for every path p that contains an infinite number of cycles (each
necessarily having negative length), the sequence Lkp (µ)+J(xkp ) approaches
−∞. It follows that for sufficiently large k, the supremum in Eq. (3.43) is
attained by one of the simple paths in P (x, µ), so xkp = t and J(xkp ) = 0.
Thus the limit of (Tµk J)(x) does not depend on J, and is equal to the limit
Sec. 3.5 Applications in Shortest Path and Other Contexts 201
t b Destination
a012
a12 12tb
¯
of (Tµk J)(x), i.e., Jµ (x). To show that Jµ = Tµ Jµ , we note that by the
preceding argument, Jµ (x) is the length of the longest path among paths
that start at x and terminate at t. Moreover, we have
6 7
(Tµ Jµ )(x) = max g(x, µ(x), y) + Jµ (y) ,
y∈Y (x,µ(x))
where we denote Jµ (t) = 0. Thus (Tµ Jµ )(x) is also the length of the longest
path among paths that start at x and terminate at t, and hence it is equal
to Jµ (x). Q.E.D.
Example 3.5.4:
Let X = {1}, and consider the policy µ where at state 1, the antagonistic
! "
opponent may force either staying at 1 or terminating, i.e., Y 1, µ(1) =
{1, t}; cf. Fig. 3.5.5. Then µ is improper since its characteristic graph Aµ
contains the self-cycle (1, 1). Let
! " ! "
g 1, µ(1), 1 = a, g 1, µ(1), t = 0.
Then, 6 7
(Tµ Jµ )(1) = max 0, a + Jµ (1) ,
and -
∞ if a > 0,
Jµ (1) =
0 if a ≤ 0.
Consistently with Prop. 3.5.10, the following hold:
(a) For a > 0, the cycle (1, 1) has positive length, and µ is R(X)-irregular.
Here we have Jµ (1) = ∞, and the infinite cost condition of Assumption
3.3.1 is satisfied.
202 Semicontractive Models Chap. 3
(b) For a = 0, the cycle (1, 1) has zero length, and µ is R(X)-irregular.
Here we have Jµ (1) = 0, and the infinite cost condition of Assumption
3.3.1 is violated because for a function J ∈ R(X) with J(1) > 0,
Assumption 3.5.8:
(a) There exists at least one R(X)-regular policy.
(b) For every R(X)-irregular policy µ, some cycle in the character-
istic graph Aµ has positive length.
for some x ∈ X. The proof under condition (2) is similar, using Prop.
3.5.10. Q.E.D.
We now show our main result for the problem of this section.
t b Destination t b Destination
012 a012
a12 12tb a12 12tb
Example 3.5.5:
! "
Let X = {1}, and consider the proper policy µ with Y 1, µ(1) = {t} and
! "
the improper policy µ" with Y 1, µ" (1) = {1, t} (cf. Fig. 3.5.6). Let
! " ! " ! "
g 1, µ(1), t = 1, g 1, µ" (1), 1 = a ≤ 0, g 1, µ" (1), t = 0.
The improper policy is the same as the one of Example 3.5.4. It can be seen
that under both policies, the longest path from 1 to t consists of the arc (1, t).
Thus,
Jµ (1) = 1, Jµ" (1) = 0,
so the improper policy µ" is optimal, and strictly dominates the proper policy
µ. To explain what is happening here, we consider two different cases:
(1) a = 0: In this case, the optimal policy µ" is both improper and R(X)-
irregular, but with finite cost Jµ" (1) < ∞. Thus the conditions of Props.
3.3.1 and 3.5.12 do not hold because Assumptions 3.3.1(c) and 3.5.9(b)
are violated.
(2) a < 0: In this case, µ" is improper but R(X)-regular, so there are no
R(X)-irregular policies. Then all the conditions of Assumption 3.5.8
are satisfied, and Prop. 3.5.12 applies. Consistent with this proposition,
there exists an optimal R(X)-regular policy (i.e., optimal over both
proper and improper policies), which however is improper.
For further analysis and algorithms for the robust shortest path plan-
ning problem, we refer to the paper [Ber19c]. In particular, this paper
applies the perturbation approach of Section 3.4 to the case where it may
be easier to guarantee nonnegativity rather than positivity of the lengths
Sec. 3.5 Applications in Shortest Path and Other Contexts 205
xk+1 = (A + BL)xk
is stable. We assume that there exists at least one linear stable policy.
Among others, this guarantees that the optimal cost function J ∗ is real-
valued (it is bounded above by the real-valued cost function of every linear
stable policy).
The solution also revolves around the algebraic matrix Riccati equa-
tion, which is given by
! "
P = A$ P − P B(B $ P B + R)−1 B $ P A + Q,
within the class of positive semidefinite symmetric matrices, and that the
optimal cost function has the form
J * (x) = x$ P ∗ x.
Moreover, there is a unique optimal policy, and this policy is linear stable
of the form
The existence of an optimal linear stable policy can be extended to the case
where Q is instead positive semidefinite, but satisfies a certain “detectabil-
ity” condition; see the textbooks cited earlier.
However, in the general case where Q is positive semidefinite without
further assumptions (e.g., Q = 0), the example of Section 3.1.4 shows that
the optimal policy need not be stable, and in fact the optimal cost function
over just the linear stable policies may be different than J * .† We will
discuss this case by using the perturbation-based approach of Section 3.4,
and provide results that are consistent with the behavior observed in the
example of Section 3.1.4.
To convert the problem to our abstract format, we let
X = ,n , U (x) = ,m , ¯
J(x) = 0, ∀ x ∈ X,
Let M+ be the set of linear stable policies, and note that every linear stable
policy is S-regular. This is due to the fact that for every quadratic function
J(x) = x$ P x and linear stable policy µ(x) = Lx, the k-stage costs (Tµk J)(x)
¯
and (Tµk J)(x) differ by the term
x$ (A + BL)k $ P (A + BL)k x,
† This is also true in the discounted version of the example of Section 3.1.4,
where there is a discount
! factor α ∈ (0, 1). The Riccati
" equation then takes
the form P = A" αP − α2 P B(αB " P B + R)−1 B " P A + Q, and for the given
αγ 2 −1
system and cost per stage, it has two solutions, P ∗ = 0 and P̂ = α
. The VI
algorithm converges to P̂ starting from any P > 0.
Sec. 3.5 Applications in Shortest Path and Other Contexts 207
where xk and uk are the state and control at stage k, belonging to sets
X and U , respectively, and f is a function mapping X × U to X. The
control uk must be chosen from a constraint set U (xk ). No restrictions are
placed on the nature of X and U : for example, they may be finite sets as
in deterministic shortest path problems, or they may be continuous spaces
as in classical problems of control to the origin or some other terminal set,
including the linear-quadratic problem of Section 3.5.4. The cost per stage
is denoted by g(x, u), and is assumed to be a real number. †
Because the system is deterministic, given an initial state x0 , a policy
π = {µ0 , µ1 , . . .} when applied
! to the system
" (3.44), generates a unique se-
quence of state-control pairs xk , µk (xk ) , k = 0, 1, . . . . The corresponding
† In Section 4.5, we will consider a similar problem where the cost per stage
will be assumed to be nonnegative, but some other assumptions from the present
section (e.g., the subsequent Assumption 3.5.9) will be relaxed.
208 Semicontractive Models Chap. 3
cost function is
N
X1
J⇡ (x0 ) = lim sup g xk , µk (xk ) , x0 2 X.
N !1
k=0
ˆ
J(x) = inf Jµ (x), x → X,
µ→M"
Since X0 consists of cost-free and absorbing states [cf. Eq. (3.45)], and
J * (x) > −↓ for all x → X (by Assumption 3.5.9), the set S contains the
cost functions Jµ of all terminating policies µ, as well as J * . Moreover
! ⊂ MS ,
it can be seen that every terminating policy is S-regular, i.e., M
implying that JS* = J * . The reason is that the terminal cost is zero after
termination for any terminal cost function J → S, i.e.,
¯
(Tµk J)(x) = (Tµk J)(x) = Jµ (x),
3.6 ALGORITHMS
V ↑ TV ↑ TV ↑ V , (3.47)
T kV ↑ J *, T kV ↓ J *.
The sets S(k) satisfy S(k + 1) ⊂ S(k) in view of Eq. (3.47) and the mono-
tonicity of T . Using Prop. 3.3.1, we also see that S(k) satisfy the syn-
chronous convergence and box conditions of Prop. 2.6.1. Thus, together
with Assumption 2.6.1, all the conditions of Prop. 2.6.1 are satisfied, and
the convergence of the algorithm follows starting from any J 0 → S(0).
Jµk = Tµk Jµk ≥ T Jµk = Tµk+1 Jµk ≥ lim Tµkk+1 Jµk = Jµk+1 .
m↓∞
Sec. 3.6 Algorithms 213
where
• Fµ (V, Q) is a function with a component Fµ (V, Q)(x, u) for each (x, u),
defined by ( )
Fµ (V, Q)(x, u) = H x, u, min{V, Qµ } , (3.49)
( )
• M Fµ (V, Q) is a function with a component M Fµ (V, Q) (x) for each
x, where M is the operator of pointwise minimization over u:
so that ( )
M Fµ (V, Q) (x) = min Fµ (V, Q)(x, u).
u→U(x)
R" , R" ⊂ {0, 1, . . .}, corresponding to policy improvement and policy eval-
uation iterations, respectively. At time t, each processor $ operates on
V t (x), Qt (x, u), and µt (x), only for x in its “local” state space X" . In
particular, at each time t, each processor $ does one of the following:
(a) Local policy improvement: If t → R" , processor $ sets for all x → X" ,
( ) ( )
V t+1 (x) = min H x, u, min{V t , Qtµt } = M Fµt (V t , Qt ) (x),
u→U(x)
(3.50)
sets µt+1 (x) to a u that attains the minimum, and leaves Q un-
changed, i.e., Qt+1 (x, u) = Qt (x, u) for all x → X" and u → U (x).
(b) Local policy evaluation: If t → R" , processor $ sets for all x → X" and
u → U (x),
( )
Qt+1 (x, u) = H x, u, min{V t , Qtµt } = Fµt (V t , Qt )(x, u), (3.51)
and leaves V and µ unchanged, i.e., V t+1 (x) = V t (x) and µt+1 (x) =
µt (x) for all x → X" .
where Fµ (V, Q) is given by Eq. (3.49). For this mapping and other re-
lated mappings to be defined shortly, we implicitly assume that it operates
on real-valued functions, so by Assumption 3.6.1(a),(b), it produces real-
valued functions. Note that the policy evaluation part of the algorithm [cf.
Eq. (3.51)] amounts to applying the second component of Lµ , while the
policy improvement part of the algorithm [cf. Eq. (3.50)] amounts to ap-
plying the second component of Lµ , and then applying the first component
of Lµ . The following proposition shows that (J * , Q* ) is the common fixed
point of the mappings Lµ , for all µ.
J * = M Q* , Q* = F Q* = Fµ (J * , Q* ),
Q% = Fµ (V % , Q% ) = F Q% ,
The uniform fixed point property of Lµ just shown is, however, in-
sufficient for the convergence proof of the asynchronous algorithm, in the
absence of a contraction property. For this reason, we introduce two map-
pings L and L that are associated with the mappings Lµ and satisfy
[cf. Eq. (3.49)]. Similarly, there exists µ that attains the minimum in Eq.
(3.56), uniformly for all V and (x, u). Thus for any given (V, Q), we have
where µ and µ̄ are some policies. The following proposition shows that
(J * , Q* ), the common fixed point of the mappings Lµ , for all µ, is also the
unique fixed point of L and L.
Jc− = J * − c e, Q− *
c = Q − c eQ ,
Jc+ = J * + c e, Q+ *
c = Q + c eQ ,
with e and eQ are the unit functions in the spaces of J and Q, respectively.
Proposition 3.6.4: Let Assumption 3.6.1 hold. Then for all c > 0,
k
Lk (Jc− , Q− * *
c ) ↑ (J , Q ), L (Jc+ , Q+ * *
c ) ↓ (J , Q ), (3.58)
k
where Lk (or L ) denotes the k-fold composition of L (or L, respec-
tively).
Proof: For any µ → M, using the assumption (3.48), we have for all (x, u),
( )
Fµ (Jc+ , Q+ + +
c )(x, u) = H x, u, min{Jc , Qc }
( )
= H x, u, min{J * , Q* } + c e
( )
↑ H x, u, min{J * , Q* } + c
= Q* (x, u) + c
= Q+
c (x, u),
and similarly
Q− − −
c (x, u) ↑ Fµ (Jc , Qc )(x, u).
We also have M Q+ + − −
c = Jc and M Qc = Jc . From these relations, the
definition of Lµ , and the fact Lµ (J * , Q* ) = (J * , Q* ) (cf. Prop. 3.6.2), we
have
+ + + +
(Jc− , Q− − − * *
c ) ↑ Lµ (Jc , Qc ) ↑ (J , Q ) ↑ Lµ (Jc , Qc ) ↑ (Jc , Qc ).
Denote for k = 0, 1, . . . ,
k
(V k , Qk ) = L (Jc+ , Q+
c ), (V k , Qk ) = Lk (Jc− , Q−
c ).
(V , Q) ≥ (J * , Q* ),
Sec. 3.7 Notes, Sources, and Exercises 219
(V , Q) ↑ (J * , Q* ).
and using the continuity from above and below property of L, implied
by Assumption 3.6.1(c), it follows that (V , Q) = L(V , Q), so (V , Q) must
k
be equal to (J * , Q* ), the unique fixed point of L. Thus, L (Jc+ , Q+c ) ↓
* * k − − * *
(J , Q ). Similarly, L (Jc , Qc ) ↑ (J , Q ). Q.E.D.
whose intersection is (J * , Q* ) [cf. Eq. (3.58)]. By Prop. 3.6.4 and Eq. (3.55),
this set sequence together with the mappings Lµ satisfy the synchronous
convergence and box conditions of the asynchronous convergence theorem
of Prop. 2.6.1 (more precisely, its time-varying version of Exercise 2.2). This
proves the convergence of the algorithm (3.50)-(3.51) for starting points
(V, Q) → S(0). Since c can be chosen arbitrarily large, it follows that the
algorithm is convergent from an arbitrary starting point.
Finally, let us note some variations of the asynchronous PI algorithm.
One such variation is to allow “communication delays” t − τ"j (t). Another
variation, for the case where we want to calculate just J * , is to use a
reduced space implementation similar to the one discussed in Section 2.6.3.
There is also a variant with interpolation, cf. Section 2.6.3.
tains more details. The exponentiated cost version of the SSP problem was
analyzed in the papers by Denardo and Rothblum [DeR79], and by Patek
[Pat01]. The paper [DeR79] assumes that the state and control spaces are
finite, that there exists at least one contractive policy (a transient policy
in the terminology of [DeR79]), and that every improper policy is noncon-
tractive and has infinite cost from some initial state. These assumptions
bypass the pathologies around infinite control spaces and multiple solu-
tions or no solution of Bellman’s equation. Also the approach of [DeR79]
is based on linear programming (relying on the finite control space), and
is thus quite different from ours. The paper [Pat01] assumes that the state
space is finite, that the control constraint set is compact, and that the ex-
pected one-stage cost is strictly positive for all state-control pairs, which is
much stronger than what we have assumed. Our results of Section 3.5.2,
when specialized to the exponential cost problem, are consistent with and
subsume the results of Denardo and Rothblum [DeR79], and Patek [Pat01].
The discussion on robust shortest path planning in Section 3.5.3 fol-
lows the author’s paper [Ber19c]. This paper contains further analysis
and computational methods, including a finitely terminating Dijkstra-like
algorithm for problems with nonnegative arc lengths.
The deterministic optimal control model of Section 3.5.5 is discussed
in more detail in the author’s paper [Ber17b] under Assumption 3.5.9 for the
case where g ≥ 0; see also Section 4.5 and the paper [Ber17c]. The analysis
under the more general assumptions given here is new. Deterministic and
minimax infinite-spaces optimal control problems have also been discussed
by Reissig [Rei16] under assumptions different than ours.
Section 3.6: The asynchronous VI algorithm of Section 3.6.1 was first
given in the author’s paper on distributed DP [Ber82]. It was further
formalized in the paper [Ber83], where a DP problem was viewed as a
special case of a fixed point problem, involving monotonicity and possibly
contraction assumptions.
The analysis of Section 3.6.2, parallels the one of Section 2.6.3, and is
due to joint work of the author with H. Yu, presented in the papers [BeY12]
and [YuB13a]. In particular, the algorithm of Section 3.6.2 is one of the op-
timistic PI algorithms in [YuB13a], which was applied to the SSP problem
of Section 3.5.1 under the strong SSP conditions. We have followed the line
of analysis of that paper and the related paper [BeY12], which focuses on
discounted problems. These papers also analyzed asynchronous stochastic
iterative versions of PI, and proved convergence results that parallel those
for classical Q-learning for SSP, given in Tsitsiklis [Tsi94], and Yu and
Bertsekas [YuB13b]. An earlier paper, which deals with a slightly differ-
ent asynchronous abstract PI algorithm without a contraction structure, is
Bertsekas and Yu [BeY10].
By allowing an infinite state space, the analysis of the present chapter
applies among others to SSP problems with a countable state space. Such
222 Semicontractive Models Chap. 3
EXERCISES
The purpose of this exercise is to show that the optimal cost function J → is a fixed
point of T under some assumptions, which among others, are satisfied generically
in deterministic optimal control problems. Let Π̂ be a subset of policies such that:
(1) We have
(µ, π) → Π̂ if and only if µ → M, π → Π̂,
where for µ → M and π = {µ0 , µ1 , . . .}, we denote by (µ, π) the policy
{µ, µ0 , µ1 , . . .}. Note: This condition precludes the possibility that Π̂ is
the set of all stationary policies (unless there is only one stationary policy).
(2) For every π = {µ0 , µ1 , . . .} → Π̂, we have
Jπ = Tµ0 Jπ1 ,
(3) We have
inf ˆ
Tµ Jπ = inf Tµ J,
µ∈M, π∈Π̂ µ∈M
Show that:
(a) Jˆ is a fixed point of T . In particular, if Π̂ = Π, then J → is a fixed point of
T.
(b) The assumptions (1)-(3) hold with Π̂ = Π in the case of the deterministic
mapping
( )
H(x, u, J) = g(x, u) + J f (x, u) , x → X, u → u(x), J → E (X). (3.60)
(c) Consider the SSP example of Section 3.1.2, where J → is not a fixed point
of T . Which of the conditions (1)-(3) is violated?
ˆ
J(x) = inf Jπ (x) = inf ˆ
(Tµ Jπ )(x) = inf (Tµ J)(x) ˆ
= (T J)(x),
π∈Π̂ µ∈M, π∈Π̂ µ∈M
where the second equality holds by conditions (1) and (2), and the third equality
holds by condition (3).
(b) This is evident in the case of the deterministic mapping (3.60). Notes: (i)
If Π̂ = Π, parts (a) and (b) show that J → , which is equal to J, ˆ is a fixed point
of T . Moreover, if we choose a set S such that JS→ can be shown to be equal to
J → , then Prop. →
# 3.2.1 applies and shows $that J is the unique fixed point of T
with the set J → E (X) | JS→ ≤ J ≤ J˜ for some J˜ → S. In addition the VI
sequence {T k J} converges to J → starting from every J within that set. (ii) The
assumptions (1)-(3) of this exercise also hold for other choices of Π̂. For example,
when Π̂ is the set of all eventually stationary policies, i.e., policies of the form
{µ0 , . . . , µk , µ, µ, . . .}, where µ0 , . . . , µk , µ → M and k is some positive integer.
(c) For the SSP problem of Section 3.1.1, condition (2) of the preceding proposi-
tion need not be satisfied (because the expected value operation need not com-
mute with lim sup).
This exercise provides a different starting point for the semicontractive analysis of
Section 3.2. In particular, the results of Prop. 3.2.1 are shown without assuming
that JS→ is a fixed point of T , but by making different assumptions, which include
the existence of an S-regular policy that is optimal. Let S be a given subset of
E (X). Assume that:
224 Semicontractive Models Chap. 3
Solution: (a) We first show that any fixed point J of T that lies in S satisfies
J ≤ J → . Indeed, if J = T J, then for the optimal S-regular policy µ→ , we have
J ≤ Tµ→ J, so in view of the monotonicity of Tµ→ and the S-regularity of µ→ ,
Tµk→ J ↓ T k J ↓ T k J → = J → , k = 0, 1, . . . .
Taking the limit as k → ∞, and using the fact limk↓∞ Tµk→ J = Jµ→ = J → , which
holds since µ→ is S-regular and optimal, we see that T k J → J → .
(c) If µ satisfies Tµ J → = T J → , then using part (a), we have Tµ J → = J → and hence
limk↓∞ Tµk J → = J → . If µ is in addition S-regular, then Jµ = limk↓∞ Tµk J → = J →
and µ is optimal. Conversely, if µ is optimal and S-regular, then Jµ = J → and
Jµ = Tµ Jµ , which combined with J → = T J → [cf. part (a)], yields Tµ J → = T J → .
Let S be a given subset of E (X). Show that the assumptions of Exercise 3.2 hold
if and only if J → → S, T J → ≤ J → , and there exists an S-regular policy µ such that
Tµ J → = T J → .
Solution: Let the conditions (1) and (2) of Exercise 3.2 hold, and let µ→ be the
S-regular policy that is optimal. Then condition (1) implies that J → = Jµ→ → S
and J → = Tµ→ J → ↓ T J → , while condition (2) implies that there exists an S-regular
policy µ such that Tµ J → = T J → .
Sec. 3.7 Notes, Sources, and Exercises 225
Solution: It will be sufficient to show that conditions (1) and (2) imply that
J → = T J → . Assume to obtain a contradiction, that J → &= T J → . Then J → ↓ T J → ,
as can be seen from the relations
with strict inequality for some x [note here that we can choose µ(x) = µ→ (x) for
all x such that J → (x) = (T J → )(x), and we can choose µ(x) to satisfy J → (x) >
(Tµ J → )(x) for all other x]. If µ were S-regular, we would have
This exercise provides a useful extension of Prop. 3.2.1. Given a set S, it may be
more convenient to work with a subset M ! ⊂ MS . Let Jˆ denote the corresponding
restricted optimal value:
ˆ
J(x) = inf Jµ (x),
µ∈M"
and assume that Jˆ is a fixed point of T . Show that the following analogs of the
conclusions of Prop. 3.2.1 hold:
(a) (Uniqueness of Fixed Point) If J % is a fixed point of T and there exists
J˜ → S such that J % ≤ J˜, then J % ≤ J. " given by
ˆ In particular, if the set W
# $
" = J → E (X) | Jˆ ≤ J ≤ J˜ for some J˜ → S ,
W
226 Semicontractive Models Chap. 3
"
is nonempty, then Jˆ is the unique fixed point of T within W.
"
(b) (VI Convergence) We have T k J → Jˆ for every J → W.
" so
Solution: The proof is nearly identical to the one of Prop. 3.2.1. Let J → W,
that
Jˆ ≤ J ≤ J˜
!
for some J˜ → S. We have for all k ↓ 1 and µ → M,
Jˆ = T k Jˆ ≤ T k J ≤ T k J˜ ≤ Tµk J˜,
where the equality follows from the fixed point property of Jˆ, while the inequal-
ities follow by using the monotonicity and the definition of T . The right-hand
side tends to Jµ as k → ∞, since µ is S-regular and J˜ → S. Hence the infimum
over µ → M ! of the limit of the right-hand side tends to the left-hand side J.ˆ It
k ˆ %
follows that T J → J , proving part (b). To prove part (a), let J be a fixed
point of T that belongs to W." Then J % is equal to limk↓∞ T k J % , which has been
proved to be equal to J.ˆ
¯
3.6 (The Case JS∗ ↑ J)
Within the framework of Section 3.2, assume that JS→ ≤ J¯. (This occurs in
particular in the monotone decreasing model where J¯ ↓ Tµ J¯ for all µ → M; see
Section 4.3.) Show that if JS→ is a fixed point of T , then we have JS→ = J → . Note:
This result manifests itself in the shortest path Example 3.2.1 for the case where
b < 0.
Consider the deterministic optimal control problem of Section 3.5.5. The purpose
of this exercise is to show that the Assumption 3.5.9 is equivalent to a seemingly
weaker assumption where nonstationary policies can be used for termination.
Given a state x → X → , we say that a (possibly nonstationary) policy π → Π
terminates from x if the sequence {xk }, which is generated starting from x and
using π, reaches X0 in the sense that xk̄ → X0 for some index k̄. Assume that for
every x → X → , there exists a policy π → Π that terminates from x. Show that:
(a) The set M! of terminating stationary policies is nonempty, i.e., there exists
a stationary policy that terminates from every x → X → .
Sec. 3.7 Notes, Sources, and Exercises 227
(b) Assumption 3.5.9 is satisfied if for every pair (x, ") with x → X → and " > 0,
there exists a policy π → Π that terminates from x and satisfies Jπ (x) ≤
J → (x) + ".
where X0 is the stopping set. Let k̄ be the first k ↓ 1 such that x̄ → Xk . Construct
the stationary policy µ as follows: for x → ∪k̄k=1 Xk , let
and for x →/ ∪k̄k=1 Xk , let µ(x) = µ̄(x), where µ̄ is a stationary policy that termi-
nates from every x → X → [and was shown to exist in part (a)]. Then it is seen
that µ terminates from every x → X → , and generates the same sequence as π %
starting from the state x̄, so it satisfies Jµ (x̄) = Jπ% (x̄) ≤ Jπ (x̄).
lim dist(xk , X0 ) = 0,
k↓∞
and that
J → (x) > 0, ∀x→
/ X0 .
Assume further the following:
# $
(1) For every x → X → = x → X | J → (x) < ∞ and " > 0, there exits a policy
π that asymptotically terminates from x and satisfies Jπ (x) ≤ J → (x) + ".
(2) For every " > 0, there exists a δ# > 0 such that for each x → X → with
dist(x, X0 ) ≤ δ# ,
Note: For further discussion, analysis, and application to the case of a linear
system, see the author’s paper [Ber17b].
Solution: (a) Fix x → X → and " > 0. Let π be a policy that asymptotically
terminates from x, and satisfies Jπ (x) ≤ J → (x) + ", as per condition (1). Starting
from x, this policy will generate a sequence {xk } such that for some index k̄ we
have dist(xk̄ , X0 ) ≤ δ# , so by condition (2), there exists a policy π̄ that terminates
from xk̄ and is such that Jπ̄ (xk̄ ) ≤ ". Consider the policy π % that follows π up to
index k̄ and follows π̄ afterwards. This policy terminates from x and satisfies
Jπ% (x) = Jπ,k̄ (x) + Jπ̄ (xk̄ ) ≤ Jπ (x) + Jπ̄ (xk̄ ) ≤ J → (x) + 2",
where Jπ,k̄ (x) is the cost incurred by π starting from x up to reaching xk̄ . From
Exercise 3.7 it follows that Assumption 3.5.9 holds.
(b) For any x and policy π that does not asymptotically terminate from x, we
will have Jπ (x) = ∞, so that if x → X → , all policies π with Jπ (x) < ∞ must be
asymptotically terminating from x.
The purpose of this exercise is to illustrate that the set of S-regular policies may
be different in the perturbed and unperturbed problems of Section 3.4. Consider
a single-state problem with J¯ = 0 and two policies µ and µ% , where
Let S = ,.
(a) Verify that µ is S-irregular and Jµ = J → = 0.
(b) Verify that µ% is S-regular and Jµ% = JS→ = β.
(c) For δ > 0 consider the δ-perturbed problem with p(x) = 1, where x is
the only state. Show that both µ and µ% are S-regular for this problem.
Moreover, we have Jˆδ = min{1, β} + δ.
! = {µ% } and β ≤ 1, but does not
(d) Verify that Prop. 3.4.1 applies for M
! %
apply if M = {µ, µ } or β > 1. Which assumptions of the proposition are
violated in the latter case?
Consider the affine monotonic model of Section 3.5.2, and let Assumptions 3.5.5
and 3.5.6 hold. In a perturbed version of this model we add a constant δ > 0 to all
components of bµ , thus obtaining what we call the δ-perturbed affine monotonic
problem. We denote by Jˆδ and Jµ,δ the corresponding optimal cost function and
policy cost functions, respectively.
(a) Show that for all δ > 0, Jˆδ is the unique solution within ,n
+ of the equation
J(i) = (T J)(i) + δ, i = 1, . . . , n.
(b) Show that for all δ > 0, a policy µ is optimal for the δ-perturbed problem
(i.e., Jµ,δ = Jˆδ ) if and only if Tµ Jˆδ = T Jˆδ . Moreover, for the δ-perturbed
problem, all optimal policies are contractive and there exists at least one
contractive policy that is optimal.
(c) The optimal cost function over contractive policies Jˆ [cf. Eq. (3.37)] satisfies
(d) If the control constraint set U (i) is finite for all states i = 1, . . . , n, there
exists a contractive policy µ̂ that attains the minimum over all contractive
policies, i.e., Jµ̂ = Jˆ.
(e) Show Prop. 3.5.8.
230 Semicontractive Models Chap. 3
Solution: (a), (b) By Prop. 3.5.6, we have that Assumption 3.3.1 holds for the
δ-perturbed problem. The results follow by applying Prop. 3.5.7 [the equation of
part (a) is Bellman’s equation for the δ-perturbed problem].
(c) For an optimal contractive policy µ→δ of the δ-perturbed problem [cf. part (b)],
we have
Jˆ = inf Jµ ≤ Jµ→ ≤ Jµ→ ,δ = Jˆδ ≤ Jµ% ,δ , ∀ µ% : contractive.
µ: contractive δ δ
Since for every contractive policy µ% , we have limδ↓0 Jµ% ,δ = Jµ% , it follows that
By taking the infimum over all µ% that are contractive, the result follows.
(d) Let {δk } be a positive sequence with δk ↓ 0, and consider a corresponding se-
quence {µk } of optimal contractive policies for the δk -perturbed problems. Since
the set of contractive policies is finite [in view of the finiteness of U (i)], some
policy µ̂ will be repeated infinitely often within the sequence {µk }, and since
{Jδ→k } is monotonically nonincreasing, we will have
Jˆ ≤ Jµ̂ ≤ Jδ→k ,
ˆ it follows that Jµ̂ = J.
for all k sufficiently large. Since by part (c), Jδ→k ↓ J, ˆ
Noncontractive Models
Contents
231
232 Noncontractive Models Chap. 4
Throughout this chapter we will continue to use the model of Section 3.2,
which involves the set of extended real numbers !∗ = ! ∪ {∞, −∞}. To
repeat some of the basic definitions, we denote by E(X) the set of all
extended real-valued functions J : X %→ !∗ , by R(X) the set of real-
valued functions J : X %→ !, and by B(X) the set of real-valued functions
J : X %→ ! that are bounded with respect to a given weighted sup-norm.
We have a set X of states and a set U of controls, and for each x ∈ X,
the nonempty control constraint set U (x) ⊂ U . We denote by M the set
of all functions µ : X %→ U with µ(x) ∈ U (x), for all x ∈ X, and by Π the
set of “nonstationary policies” π = {µ0 , µ1 , . . .}, with µk ∈ M for all k.
We refer to a stationary policy {µ, µ, . . .} simply as µ.
We introduce a mapping H : X × U × E(X) %→ !∗ , and we define the
mapping T : E(X) %→ E(X) by
A fact that we will be using frequently is that for each J ∈ E(X) and
scalar " > 0, there exists a µ! ∈ M such that for all x ∈ X,
(T J)(x) + " if (T J)(x) > −∞,
(Tµ! J)(x) ≤
−(1/") if (T J)(x) = −∞.
We will often use in our analysis the unit function e, defined by e(x) ≡ 1,
so for example, we write the preceding relation in shorthand as
Tµ! J ≤ T J + " e.
We define cost functions for policies consistently with Chapters 2 and
3. In particular, we are given a function J¯ ∈ E(X), and we consider for
every policy π = {µ0 , µ1 , . . .} ∈ Π and positive integer N the function
JN,π ∈ E(X) defined by
¯
JN,π (x) = (Tµ0 · · · TµN−1 J)(x), ∀ x ∈ X,
and the function Jπ ∈ E(X) defined by
¯
Jπ (x) = lim sup (Tµ0 · · · Tµk J)(x), ∀ x ∈ X.
k→∞
Consider the N -stage problem (4.1), where the cost function JN,π is defined
by
¯
JN,π (x) = (Tµ0 · · · TµN−1 J)(x), ∀ x ∈ X.
Based on the theory of finite horizon DP, we expect that (at least under
* is obtained by N successive
some conditions) the optimal cost function JN
¯ i.e.,
applications of the DP mapping T on the initial function J,
* = inf J
JN N ¯
N,π = T J .
π∈Π
This is the analog of Bellman’s equation for the finite horizon problem in
a DP context.
A favorable case where the analysis is simplified and we can easily show that
JN* = T N J¯ is when the finite horizon DP algorithm yields an optimal policy
during its execution. By this we mean that the algorithm that starts with
¯ and sequentially computes T J,
J, ¯ T 2 J,
¯ . . . , T N J,
¯ also yields corresponding
µ∗N −1 , µ∗N −2 , . . . , µ∗0 ∈ M such that
Tµ∗ T N −k−1 J¯ = T N −k J,
¯ k = 0, . . . , N − 1. (4.3)
k
While µ∗N −1 , . . . , µ∗0 ∈ M satisfying this relation need not exist (because
the corresponding infimum in the definition of T is not attained), if they
do exist, they both form an optimal policy and also guarantee that
¯
* = T N J.
JN
where the inequality follows from the monotonicity assumption and the def-
inition of T , and the last equality follows from Eq. (4.3). Thus {µ∗0 , µ∗1 , . . .}
has no worse N -stage cost function than every other policy, so it is N -stage
optimal and JN * = T ∗ ···T ∗
µ0
¯
µN−1 J. By taking the infimum of the left-hand
side over π ∈ Π in Eq. (4.4), we obtain JN ¯
* = T N J.
The preceding argument can also be used to show that {µ∗k , µ∗k+1 , . . .}
is (N − k)-stage optimal for all k = 0, . . . , N − 1. Such a policy is called
uniformly N -stage optimal . The fact that the finite horizon DP algorithm
provides an optimal solution of all the k-stage problems for k = 1, . . . , N ,
rather than just the last one, is a manifestation of the classical principle
236 Noncontractive Models Chap. 4
! "
is attained for all x ∈ X and k. Indeed if H x, u, T k J¯ = ∞ for all
u ∈ U (x), then every u ∈ U (x) attains the infimum. If for a given x ∈ X,
! "
inf H x, u, T k J¯ < ∞,
u∈U(x)
the corresponding part of the proof of Lemma 3.3.1 applies and shows that
the above infimum is attained. The result now follows from Prop. 4.2.1.
Q.E.D.
We now consider the case where there may not exist a uniformly N -stage
¯ the equation
∗ and T N J,
optimal policy. By using the definitions of JN
Sec. 4.2 Finite Horizon Problems 237
Since JN∗ ≤ T ¯
µ0 · · · TµN−1 J for all µ0 , . . . , µN −1 ∈ M, we have using also
Eq. (4.6) and the assumption Jk,π (x) < ∞, for all k, π, and x,
∗ ≤ inf · · · inf T m · · · T m
JN 0
¯
N−1 J
m0 mN−1 µ0 µN−1
( )
= inf · · · inf Tµm0 · · · TµmN−2 ¯
inf TµmN−1 J
m0 mN−2 0 N−2 mN−1 N−1
..
.
¯
= inf Tµm0 (T N −1 J)
m0 0
¯
= T N J.
On the other hand, it is clear from the definitions that T N J¯ ≤ JN,π for all
N and π ∈ Π, so that T N J¯ ≤ JN ¯ Q.E.D.
* . Thus, J * = T N J.
N
Moreover, there exists a scalar α ∈ (0, ∞) such that for all scalars
r ∈ (0, ∞) and functions J ∈ E(X), we have
* ≤ J
JN *
N,π! ≤ JN + " e.
with Jk,π! ≤ Jk* + " e. Using Eq. (4.7), we have for all µ ∈ M,
*
Jk+1 ≤ Tµ Jk,π! ≤ Tµ Jk* + α" e.
Taking the infimum over µ and then the limit as " → 0, we obtain Jk+1 * ≤
* *
T Jk . By using the induction hypothesis Jk = T k J, ¯ it follows that J * ≤
k+1
¯ On the other hand, we have clearly T k+1 J¯ ≤ Jk+1,π for all π ∈ Π,
T k+1 J.
so that T k+1 J¯ ≤ Jk+1
* , and hence T k+1 J¯ = J * .
k+1
We now turn to the existence of an "-optimal policy part of the in-
duction argument. Using the assumption Jk* (x) > −∞ for all x ∈ x ∈ X,
for any " > 0, we can choose π = {µ0 , µ1 , . . .} such that
"
Jk,π ≤ Jk* + e, (4.8)
2α
"
Jk+1,π ! = Tµ Jk,π ≤ Tµ Jk* + e ≤ T Jk* + " e = Jk+1
* + " e,
2
where the first inequality is obtained by applying Tµ to Eq. (4.8) and using
Eq. (4.7). The induction is complete. Q.E.D.
Let
X = {0}, U (0) = (−1, 0], ¯
J(0) = 0,
,
u if −1 < J(0),
H(0, u, J) =
J(0) + u if J(0) ≤ −1.
Then
(Tµ0 · · · TµN−1 J¯)(0) = µ0 (0),
∗
and JN (0) = −1, while (T N J¯)(0) = −N for every N . Here Assumption 4.2.1,
and the condition (4.7) (cf. Assumption 4.2.2) are violated, even though the
condition Jk∗ (x) > −∞ for all x ∈ X (cf. Assumption 4.2.2) is satisfied.
240 Noncontractive Models Chap. 4
Let
X = {0, 1}, U (0) = U (1) = (−∞, 0], ¯
J(0) ¯
= J(1) = 0,
,
u if J(1) = −∞,
H(0, u, J) = H(1, u, J) = u.
0 if J(1) > −∞,
Then
(Tµ0 · · · TµN−1 J¯)(0) = 0, ¯
(Tµ0 · · · TµN−1 J)(1) = µ0 (1), ∀ N ≥ 1.
∗ ∗
It can be seen that for N ≥ 2, we have JN (0)
= 0 and JN
= −∞, but (1)
¯
(T N J)(0) = (T N J¯)(1) = −∞. Here Assumption 4.2.1, and the condition
Jk∗ (x) > −∞ for all x ∈ X (cf. Assumption 4.2.2) are violated, even though
the condition (4.7) of Assumption 4.2.2 is satisfied.
Let
X = {0, 1}, U (0) = U (1) = ', J¯(0) = J¯(1) = 0,
let w be a real-valued random variable with E{w} = ∞, and let
, & '
E w + J(1) if x = 0,
H(x, u, J) = ∀ x ∈ X, u ∈ U (x).
u + J(1) if x = 1,
Then if Jm is real-valued for all m, and Jm (1) ↓ J(1) = −∞, we have
& '
lim H(0, u, Jm ) = lim E w + Jm (1) = ∞,
m→∞ m→∞
so Assumption 4.2.1 is violated. Indeed, the reader may verify with a straight-
¯
forward calculation that J2∗ (0) = ∞, J2∗ (1) = −∞, while (T 2 J)(0) = −∞,
¯
(T 2 J)(1) ¯ Note that Assumption 4.2.2 is also violated
= −∞, so J2∗ )= T 2 J.
because J2∗ (1) = −∞.
Let α = 1 and
N = 2, X = {0, 1, . . .}, U (x) = (0, ∞), ¯
J(x) = 0, ∀ x ∈ X,
f (x, u) = 0, ∀ x ∈ X, u ∈ U (x),
-
−u if x = 0,
g(x, u) = ∀ u ∈ U (x),
x if x =
) 0,
so that
H(x, u, J) = g(x, u) + J(0).
Then for π ∈ Π and x )= 0, we have J2,π (x) = x − µ1 (0), so that J2∗ (x) = −∞
for all x )= 0. Clearly, we also have J2∗ (0) = −∞. Here Assumption 4.2.1,
as well as Eq. (4.7) (cf. Assumption 4.2.2) are satisfied, and indeed we have
¯
J2∗ (x) = (T 2 J)(x) = −∞ for all x ∈ X. However, the condition Jk∗ (x) > −∞
for all x and k (cf. Assumption 4.2.2) is violated, and it is seen that there
does not exist a two-stage #-optimal policy for any # > 0. The reason is that
an #-optimal policy π = {µ0 , µ1 } must satisfy
1
J2,π (x) = x − µ1 (0) ≤ − , ∀ x ∈ X,
#
[in view of J2∗ (x) = −∞ for all x ∈ X], which is impossible since the left-hand
side above can become positive for x sufficiently large.
We now turn to the infinite horizon problem (4.2), where the cost function
of a policy π = {µ0 , µ1 , . . .} is
¯
Jπ (x) = lim sup (Tµ0 · · · Tµk J)(x), ∀ x ∈ X.
k→∞
¯
−∞ < J(x) ¯
≤ H(x, u, J), ∀ x ∈ X, u ∈ U (x).
(c) There exists a scalar α ∈ (0, ∞) such that for all scalars r ∈
(0, ∞) and functions J ∈ E(X) with J¯ ≤ J, we have
¯
J(x) ¯
≥ H(x, u, J), ∀ x ∈ X, u ∈ U (x).
¯
Jπ (x) = lim (Tµ0 · · · Tµk J)(x), ∀ x ∈ X,
k→∞
= 0 Tµ J = 0 Tµ J
J¯ Jµ J TJ Jµ J¯ J TJ
Jµ = T µ Jµ .
¯
(T J)(x) ≥ (T J)(x) > −∞, ∀ x ∈ X,
Tµ! J ≤ T J + " e.
This property is critical for the existence of an "-optimal policy under As-
sumption I (see the next proposition) and is not available under Assumption
D. It accounts in part for the different character of the results that can be
obtained under the two assumptions.
Proposition 4.3.2: Let Assumption I hold. Then given any " > 0,
there exists a policy π! ∈ Π such that
J * ≤ Jπ! ≤ J * + " e.
Proof: Let {"k } be a sequence such that "k > 0 for all k and
∞
.
αk "k = ". (4.9)
k=0
& '
For each x ∈ X, consider a sequence of policies πk [x] ⊂ Π of the form
& '
πk [x] = µk0 [x], µk1 [x], . . . , (4.10)
Such a sequence exists, since we have assumed that J(x) ¯ > −∞, and
*
therefore J (x) > −∞, for all x ∈ X.
The preceding notation should be interpreted as follows. The policy
πk [x] of Eq. (4.10) is associated with x. Thus µki [x] denotes for each x and
k, a function in M, while µki [x](z) denotes the value of µki [x] at an element
z ∈ X. In particular, µki [x](x) denotes the value of µki [x] at x ∈ X.
Consider the functions µk defined by
From Eqs. (4.13), (4.14), and part (c) of Assumption I, we have for all
x ∈ X and k = 1, 2, . . .,
! "
(Tµk−1 J¯k )(x) = H x, µk−1 (x), J¯k
! "
≤ H x, µk−1 (x), J * + "k e
! "
≤ H x, µk−1 (x), J * + α"k
* +
≤ H x, µk−1 (x), lim Tµk−1 [x] · · · Tµk−1 [x] J¯ + α"k
m→∞ 1 m
and finally
Tµk−1 J¯k ≤ J¯k−1 + α"k e, k = 1, 2, . . . .
Denote π! = {µ0 , µ1 , . . .}. Then by taking the limit in the preceding in-
equality and using Eq. (4.9), we obtain
Jπ! ≤ J * + " e.
& '
If α < 1, we take "k = "(1−α) for all k, and πk [x] = µ0 [x], µ1 [x], . . .
in Eq. (4.11). The stationary policy π! = {µ, µ, . . .}, where µ(x) = µ0 [x](x)
for all x ∈ X, satisfies Jπ! ≤ J * + " e. Q.E.D.
J * = T J *.
¯
Jπ (x) = lim (Tµ0 Tµ1 · · · Tµk J)(x)
k→∞
( )
¯
= Tµ0 lim Tµ1 · · · Tµk J (x)
k→∞
≥ (Tµ0 J * )(x)
≥ (T J * )(x).
J * ≥ T J *.
To prove the reverse inequality, let "1 and "2 be any positive scalars,
and let π = {µ0 , µ1 , . . .} be such that
= Tµ0 Jπ1
≤ Tµ0 J * + α"2 e
≤ T J * + ("1 + α"2 ) e.
Since "1 and "2 can be taken arbitrarily small, it follows that
J * ≤ T J *.
Hence J * = T J * .
Assume that J # ∈ E(X) satisfies J # ≥ J¯ and J # ≥ T J # . Let {"k } be
any sequence with "k > 0 for all k, and consider a policy π = {µ0 , µ1 , . . .} ∈
Π such that
Tµk J # ≤ T J # + "k e, k = 0, 1, . . . .
248 Noncontractive Models Chap. 4
The following counterexamples show that parts (b) and (c) of As-
sumption I are essential for the preceding proposition to hold.
Let
Thus
Let
*,
By taking the limit of both sides as N → ∞, we obtain Jπ ≥ limN →∞ JN
* * *
and by taking infimum over π, J ≥ limN →∞ JN . Thus J = limN →∞ JN *.
Q.E.D.
J * = T J *.
where the last inequality follows from the fact T k J¯ ↓ J * (cf. Prop. 4.3.5).
Taking the infimum of both sides over π ∈ Π, we obtain J * ≥ T J * .
To prove the reverse inequality, we select any µ ∈ M, and we apply
Tµ to both sides of the equation J * = limN →∞ T N J¯ (cf. Prop. 4.3.5). By
using part (b) of assumption D, we obtain
* +
Tµ J * = Tµ lim T N J¯ = lim Tµ T N J¯ ≥ lim T N +1 J¯ = J * .
N →∞ N →∞ N →∞
Proof: Consider& the' variation of our problem where the control constraint
set is Uµ (x) = µ(x) rather than U (x) for all x ∈ X. Application of Prop.
4.3.6 yields the result. Q.E.D.
An examination of the proof of Prop. 4.3.6 shows that the only point
where we need part (b) of Assumption D was in establishing the relations
* +
* = T
lim T JN *
lim JN
N →∞ N →∞
and
¯
* = T N J.
JN
Then
J * = T J *.
Furthermore, if J # ∈ E(X) is such that J # ≤ J¯ and J # ≤ T J # , then
J # ≤ J *.
252 Noncontractive Models Chap. 4
Proof: A nearly verbatim repetition of Prop. 4.2.4 shows that under our
* = T N J¯ for all N . We will show that
assumptions we have JN
* +
* ) ≤ H x, u, lim J * ,
lim H(x, u, JN ∀ x ∈ X, u ∈ U (x).
N →∞ N
N →∞
Tµ J * = T J * .
T µ Jµ = T J µ .
T J * = J * = Jµ = T µ Jµ = T µ J * .
Thus under D, a stationary optimal policy attains the infimum in the fixed
point Eq. (4.16) for all x. However, there may exist nonoptimal stationary
policies also attaining the infimum for all x; an example is the shortest path
problem of Section 3.1.1 for the case where a = 0 and b = 1. Moreover,
it is possible that this infimum is attained but no optimal policy exists, as
shown by Fig. 4.3.2.
Proposition 4.3.9 shows that under Assumption I, there exists a sta-
tionary optimal policy if and only if the infimum in the optimality equation
is attained for every x ∈ X. When the infimum is not attained for some x ∈
X, this optimality equation can still be used to yield an "-optimal policy,
which can be taken to be stationary whenever the scalar α in Assumption
I(c) is strictly less than 1. This is shown in the following proposition.
254 Noncontractive Models Chap. 4
J Tµ J
J¯ Tµk J¯
Tµ J T
=0 Jµ Jµ J¯ = 0 J TJ
J∗ = T J∗
Figure 4.3.2. An example where nonstationary policies are dominant under As-
sumption D. Here there is only one state and S = #. There are two stationary
policies µ and µ with cost functions Jµ and Jµ as shown. However, by considering
a nonstationary policy of the form πk = {µ, . . . , µ, µ, µ, . . .}, with a number k of
policies µ, we can obtain a sequence {Jπk } that converges to the value J ∗ shown.
Note that here there is no optimal policy, stationary or not.
Tµ∗ J * ≤ T J * + "k e, ∀ k = 0, 1, . . . ,
k
then
J * ≤ Jπ∗ ≤ J * + " e.
(b) If " > 0, the scalar α in part (c) of Assumption I is strictly less
than 1, and µ∗ ∈ M is such that
Tµ∗ J * ≤ T J * + "(1 − α) e,
then
J * ≤ Jµ∗ ≤ J * + " e.
Sec. 4.3 Infinite Horizon Problems 255
Tµ∗ J * ≤ J * + "k e,
k
Applying Tµ∗ throughout and repeating the process, we obtain for every
k−2
k = 1, 2, . . .,
/ k 0
.
Tµ∗0 · · · Tµ∗ J * ≤ J* + αi " i e, k = 1, 2, . . . .
k
i=0
and let
πN = {µN N N
0 , µ1 , . . . , µN −1 , µ, µ . . .}
JπN ≤ T N J¯ + ("/2) e.
T N̄ J¯ ≤ J * + ("/2) e
lim T k J0 = J * .
k→∞
are compact for every x ∈ X, λ ∈ !, and for all k greater than some
integer k. Assume that J0 ∈ E(X) is such that J¯ ≤ J0 ≤ J * . Then
lim T k J0 = J * .
k→∞
Proof: Similar to the proof of Prop. 4.3.13, it will suffice to show that
T k J¯ → J * . Since J¯ ≤ J * , we have T k J¯ ≤ T k J * = J * , so that
J¯ ≤ T J¯ ≤ · · · ≤ T k J¯ ≤ · · · ≤ J * .
J∞ ≤ T J ∞ .
Clearly, by Eq. (4.18), we must have J∞ (x̃) < ∞. For every k, consider
the set
! " - 2 3
Uk x̃, J∞ (x̃) = u ∈ U (x̃) 2 H(x̃, uk , T k J)
¯ ≤ J∞ (x̃) ,
258 Noncontractive Models Chap. 4
for all x̃ ∈ X with J * (x̃) < ∞. For x̃ ∈ X such that J * (x̃) = ∞, every
u ∈ U (x̃) attains the preceding minimum. Hence by Prop. 4.3.9 an optimal
stationary policy exists. Q.E.D.
and µ∗ (x̃)
is a limit point of {µk (x̃)}, for every x̃ ∈ X, then the stationary
policy µ∗ is optimal. Furthermore, {µk (x̃)} has at least one limit point
for every x̃ ∈ X for which J * (x̃) < ∞. Thus the VI algorithm under the
assumption of Prop. 4.3.14 yields in the limit not only the optimal cost
function J * but also an optimal stationary policy.
On the other hand, under Assumption I but in the absence of the
compactness condition (4.17), T k J¯ need not converge to J * . What is hap-
pening here is that while the mappings Tµ are continuous from below as
required by Assumption I(b), T may not be, and a phenomenon like the
one illustrated in the left-hand side of Fig. 4.3.1 may occur, whereby
( )
lim T k J¯ ≤ T lim T k J¯ ,
k→∞ k→∞
with strict inequality for some x ∈ X. This can happen even in simple
deterministic optimal control problems, as shown by the following example.
Sec. 4.3 Infinite Horizon Problems 259
Let
X = [0, ∞), U (x) = (0, ∞), ¯
J(x) = 0, ∀ x ∈ X,
and
& '
H(x, u, J) = min 1, x + J(2x + u) , ∀ x ∈ X, u ∈ U (x).
Then it can be verified that for all x ∈ X and policies µ, we have Jµ (x) = 1,
as well as J ∗ (x) = 1, while it can be seen by induction that starting with J¯,
the VI algorithm yields
& '
¯
(T k J)(x) = min 1, (1 + 2k−1 )x , ∀ x ∈ X, k = 1, 2 . . . .
¯
Thus we have 0 = limk→∞ (T k J)(0) )= J ∗ (0) = 1.
S(k) = {J | T k J¯ ≤ J ≤ J * }, k = 0, 1, . . . .
Proof: We have
J0 ≥ Tµ0 J0 ≥ Tµm00 J0 = J1 ≥ Tµm00 +1 J0 = Tµ0 J1 ≥ T J1 = Tµ1 J1 ≥ · · · ≥ J2 ,
† As with all PI algorithms in this book, we assume that the policy im-
provement operation is well-defined, in the sense that there exists µk such that
Tµk Jk = T Jk for all k.
Sec. 4.3 Infinite Horizon Problems 261
where the first, second, and third inequalities hold because the assumption
J0 ≥ T J0 = Tµ0 J0 implies that
Tµm0 J0 ≥ Tµm+1
0 J0 , ∀ m ≥ 0.
Jk ≥ T Jk ≥ Jk+1 , ∀ k ≥ 0.
where the last equality follows from the fact T J * = J * (cf. Prop. 4.3.6),
thus completing the induction. By combining the preceding two relations,
we have
Jk ≥ T Jk ≥ Jk+1 ≥ J * , ∀ k ≥ 0. (4.22)
We will now show by induction that
T k J0 ≥ J k ≥ J * , ∀ k ≥ 0. (4.23)
T k+1 J0 ≥ T Jk ≥ Jk+1 ≥ J * ,
(λ)
Tµk Jk = T Jk , Jk+1 = Tµk Jk , (4.24)
(λ)
where for any policy µ and scalar λ ∈ (0, 1), Tµ is the mapping defined
by
.∞
(λ)
(Tµ J)(x) = (1 − λ) λt (Tµt+1 J)(x), x ∈ X.
t=0
262 Noncontractive Models Chap. 4
Here we assume that Tµ maps R(X) to R(X), and that for all µ ∈ M
and J ∈ R(X), the limit of the series above is well-defined as a function in
R(X).
We discussed the λ-PI algorithm in connection with semicontractive
problems in Section 3.2.4, where we assumed that
(λ) (λ)
Tµ (Tµ J) = Tµ (Tµ J), ∀ µ ∈ M, J ∈ E(X). (4.25)
We will show that for undiscounted finite-state MDP, the algorithm can
be implemented by using matrix inversion, just like nonoptimistic PI for
discounted finite-state MDP. It turns out that this can be an advantage in
some settings, including approximate simulation-based implementations.
As noted earlier, λ-PI and optimistic PI are similar: they just use the
mapping Tµk to apply VI in different ways. In view of this similarity, it is
not surprising that it has the same type of convergence properties as the
earlier optimistic PI method (4.20). Similar to Prop. 4.3.15, we have the
following.
where for the third inequality, we use the relation J0 ≥ Tµ0 J0 , the definition
of J1 , and the assumption (4.25). Continuing in the same manner,
Jk ≥ T Jk ≥ Jk+1 , ∀ k ≥ 0.
Similar to the proof of Prop. 4.3.15, we show by induction that Jk ≥ J * ,
using the fact that if Jk ≥ J * , then
∞
.
(λ) (λ)
Jk+1 = Tµk Jk ≥ Tµk J * = (1 − λ) λt T t+1 J * = J * ,
t=0
[cf. the induction step of Eq. (4.21)]. By combining the preceding two
relations, we obtain Eq. (4.22), and the proof is completed by using the
argument following that equation. Q.E.D.
The λ-PI algorithm has a useful property, which involves the mapping
Wk : R(X) %→ R(X) given by
Wk J = (1 − λ)Tµk Jk + λ Tµk J. (4.26)
Sec. 4.3 Infinite Horizon Problems 263
Consider the SSP problem of Example 1.2.6 with states 1, . . . , n, plus the
termination state 0. For all u ∈ U (x), the state following x is y with prob-
ability pxy (u) and the expected cost incurred is nonpositive. This problem
arises when we wish to maximize nonnegative rewards up to termination. It
includes a classical search problem where the aim, roughly speaking, is to
move through the state space looking for states with favorable termination
rewards.
We view the problem within our abstract framework with J¯(x) ≡ 0 and
Tµ J = gµ + Pµ J, (4.27)
with gµ ∈ 'n being the corresponding nonpositive one-stage cost vector, and
Pµ being an n × ! n substochastic
" matrix. The components of Pµ are the
probabilities pxy µ(x) , x, y = 1, . . . , n. Clearly Assumption D holds.
Consider the λ-PI method (4.24), with Jk+1 computed by solving the
fixed point equation J = Wk J, cf. Eq. (4.26). This is a nonsingular n-
dimensional system of linear equations, and can be solved by matrix inversion,
just like in exact PI for discounted n-state MDP. In particular, using Eqs.
(4.26) and (4.27), we have
! "
Jk+1 = (I − λPµk )−1 gµk + (1 − λ)Pµk Jk . (4.28)
264 Noncontractive Models Chap. 4
Jµ = Tµ Jµ ≥ T Jµ = Tµ̄ Jµ ,
Jµ ≥ T Jµ ≥ Jµ̄ . (4.29)
Moreover, by Prop. 3.2.4, if S has the weak PI property and for each
sequence {Jm } ⊂ E(X) with Jm ↓ J for some J ∈ E(X), we have
stationary policies may not be the same as the one over nonstationary
policies (cf. Prop. 4.3.2, and the subsequent example). Generally, when
referring to an S-regular collection C, we implicitly assume that S and C
are nonempty, although on occasion we may state explicitly this fact for
emphasis.
For a given set C of policy-state pairs (π, x), let us consider the func-
tion JC* ∈ E(X), given by
JC* (x) = inf Jπ (x), x ∈ X.
{π | (π,x)∈C}
Note that JC* (x) ≥ J * (x) for all x ∈ X [for those x ∈ X for which the set
of policies {π | (π, x) ∈ C} is empty, we have by convention JC* (x) = ∞].
For an important example, note that in the analysis of Chapter 3, the
set of S-regular policies MS of Section 3.2 defines the S-regular collection
& '
C = (µ, x) | µ ∈ MS , x ∈ X ,
and the corresponding restricted optimal cost function JS* is equal to JC* . In
Sections 3.2-3.4 we saw that when JS* is a fixed point of T , then favorable
results are obtained. Similarly, in this section we will see that for an S-
regular collection C, when JC* is a fixed point of T , interesting results are
obtained.
The following two propositions play a central role in our analysis on
this section and the next two, and may be compared with Prop. 3.2.1,
which played a pivotal role in the analysis of Chapter 3.
(b) For all J # ∈ E(X) with J # ≤ T J # , and all J ∈ E(X) such that
J # ≤ J ≤ J˜ for some J˜ ∈ S, we have
˜
lim sup(T k J)(x) ≤ lim sup(T k J)(x) ˜
≤ lim sup(Tµ0 · · · Tµk−1 J)(x) = Jπ (x),
k→∞ k→∞ k→∞
& '
and by taking infimum of the right side over π | (π, x) ∈ C , we obtain
the result.
(b) Using the hypotheses J # ≤ T J # , and J # ≤ J ≤ J˜ for some J˜ ∈ S, and
the monotonicity of T , we have
(cf. the corresponding definition in Section 3.2), and the limit region
&
J ∈ E(X) | J # ≤ J ≤ JC* for all fixed points J # of T
'
with J # ≤ J˜ for some J˜ ∈ S .
Figure 4.4.1. Schematic illustration of Prop. 4.4.1. Neither JC∗ nor J ∗ need to
be fixed points of T , but if C is S-regular, and there exists J˜ ∈ S with JC∗ ≤ J,
˜
then JC∗ demarcates from above the range of fixed points of T that lie below J. ˜
Proof: (a) The first statement follows from Prop. 4.4.1(b). For the second
statement, let J # be a fixed point of T with J # ∈ WS,C . Then from the
definition of WS,C , we have JC* ≤ J # as well as J # ≤ J˜ for some J˜ ∈ S, so
from Prop. 4.4.1(b) it follows that J # ≤ JC* . Hence J # = JC* .
Sec. 4.4 Regularity and Nonstationary Policies 269
Well-Behaved Region
WS,C = J | JC* ≤ J
! "
Fixed
Path of VI Set of solutions Points of equation
of Bellman’s T JC*
Limit Region
Figure 4.4.2. Schematic illustration of Prop. 4.4.2,&for the case where S = 'E(X)
so that WS,C is unbounded above, i.e., WS,C = J ∈ E(X) | JC∗ ≤ J . In
this figure JC∗ is not a fixed point of T . The VI algorithm, starting from the
well-behaved region WS,C , ends up asymptotically within the limit region.
(b) The result follows from Prop. 4.4.1(a), and in the case where JC* is a
fixed point of T , from Prop. 4.4.1(b), with J # = JC* .
(c) See observation (3) in the discussion preceding the proposition. Q.E.D.
that belongs to S
I Set of solutions of Bellman’s equation JC*
Fixed Points of T S
Figure 4.4.3. Schematic illustration of Prop. 4.4.2, and the set WS,C of Eq.
(4.30), for a case where JC∗ is a fixed point of T and S is a strict subset of E(X).
Every fixed point of T that lies below some J˜ ∈ S should lie below JC∗ . Also, the
VI algorithm converges to JC∗ starting from within WS,C . If S were unbounded
from above, as in Fig. 4.4.2, JC∗ would be the largest fixed point of T .
functions JC* and J * is a fixed point of T , as can be seen from the examples
of Section 3.1.
We have seen in Section 4.3 that the results for monotone increasing and
monotone decreasing models are markedly different. In the context of S-
regularity of a collection C, it turns out that there are analogous significant
differences between the cases JC* ≥ J¯ and JC* ≤ J. ¯ The following propo-
sition establishes some favorable aspects of the condition JC* ≤ J¯ in the
context of VI. These can be attributed to the fact that J¯ can always be
added to S without affecting the S-regularity of C, so J¯ can serve as the
element J˜ of S in Props. 4.4.1 and 4.4.2 (see the subsequent proof). The
following proposition may also be compared with the result on convergence
of VI under Assumption D (cf. Prop. 4.3.13).
J * ≤ Jπ! ≤ J * + " e.
where g is the one-stage cost function and f is the system function. The
expected value is taken with respect to the distribution of the random
variable w (which takes values in a countable set W ). We assume that
& '
0 ≤ E g(x, u, w) < ∞, ∀ x ∈ X, u ∈ U (x), w ∈ W.
where Exπ0 {·} denotes expected value with respect to the probability dis-
tribution induced by π under initial state x0 .
We will apply the analysis of this section with
& '
C = (π, x) | Jπ (x) < ∞ ,
Note that S is the largest set with respect to which C is regular in the
sense that C is S-regular and if C is S # -regular for some other set S # , then
S # ⊂ S. To see this we write for all J ∈ E + (X), (π, x0 ) ∈ C, and k,
4 k−1 5
& ' . ! "
(Tµ0 · · · Tµk−1 J)(x0 ) = Exπ0 J(xk ) + Exπ0 g xm , µm (xm ), wm ,
m=0
In view of the definition (4.33) of S, this implies that for all J ∈ S, we have
J * ↓ J ↓ cJ * , (4.38)
where α ↑ (0, 1) is the discount factor, and as earlier Exπ0 {·} denotes ex-
pected value with respect to the probability measure induced by π ↑ Π
under initial state x0 . We assume that the one-stage expected cost is non-
negative,
! "
0 ↓ E g(x, u, w) < ∞, ∀ x ↑ X, u ↑ U (x), w ↑ W.
Sec. 4.4 Regularity and Nonstationary Policies 277
Assumption 4.4.1:
(a) For all π = {µ0 , µ1 , . . .} ∈ Π, Jπ can be defined as a limit:
¯
Jπ (x) = lim (Tµ0 · · · Tµk J)(x), ∀ x ∈ X. (4.39)
k→∞
Sec. 4.4 Regularity and Nonstationary Policies 279
J¯ ≤ J * .
(c) There exists α > 0 such that for all J ∈ Eb (X) and r ∈ !,
Proposition 4.4.8: Let Assumption 4.4.1 hold. Given any " > 0,
there exists a policy π! ∈ Π such that
J * ≤ Jπ! ≤ J * + " e.
Proof: Let {"k } be a sequence such that "k > 0 for all k and
∞
.
αk "k = ", (4.40)
k=0
From Eqs. (4.43), (4.44), and Assumption 4.4.1(c), we have for all x ∈ X
and k = 1, 2, . . .,
! "
(Tµk−1 J¯k )(x) = H x, µk−1 (x), J¯k
! "
≤ H x, µk−1 (x), J * + "k e
! "
≤ H x, µk−1 (x), J * + α"k
* +
≤ H x, µk−1 (x), lim Tµk−1 [x] · · · Tµk−1 [x] J¯ + α"k
m→∞ 1 m
and finally
Tµk−1 J¯k ≤ J¯k−1 + α"k e, k = 1, 2, . . . .
Using this inequality and Assumption 4.4.1(c), we obtain
Denote π! = {µ0 , µ1 , . . .}. Then by taking the limit in the preceding in-
equality and using Eq. (4.40), we obtain
Jπ! ≤ J * + " e.
Q.E.D.
By using Prop. 4.4.8 we can prove the following.
¯
Jπ (x) = lim (Tµ0 Tµ1 · · · Tµk J)(x)
k→∞
( )
= Tµ0 lim Tµ1 · · · Tµk J¯ (x)
k→∞
≥ (Tµ0 J * )(x)
≥ (T J * )(x).
J * ≥ T J *.
To prove the reverse inequality, let "1 and "2 be any positive scalars,
and let π = {µ0 , µ1 , . . .} be such that
= Tµ0 Jπ1
≤ Tµ0 (J * + "2 e)
≤ Tµ0 J * + α"2 e
≤ T J * + ("1 + α"2 ) e.
Since "1 and "2 can be taken arbitrarily small, it follows that
J * ≤ T J *.
Hence J * = T J * . Q.E.D.
uk = µk (xk ) System xk
xk+1 = f (xk , uk )
Cost: g(xk , uk ) ≥ 0 VI converges to
“Destination” t
t (cost-free and absorbing) :
) µk
Figure 4.5.1 A deterministic optimal control problem with nonnegative cost per
stage, and a cost-free and absorbing destination t.
† A related line of analysis for deterministic problems with both positive and
negative costs per stage is developed in Exercise 4.9.
284 Noncontractive Models Chap. 4
where xk and uk are the state and control at stage k, which belong to
sets X and U , referred to as the state and control spaces, respectively,
and f : X × U %→ X is a given function. The control uk must be chosen
from a constraint set U (xk ) ⊂ U that may depend on the current state xk .
The cost per stage g(x, u) is assumed nonnegative and possibly extended
real-valued:
Except for the cost nonnegativity assumption (4.46), this problem is similar
to the one of Section 3.5.5. It arises in many classical control applications
involving regulation around a set point, and in finite-state and infinite-state
versions of shortest path applications; see Fig. 4.5.1.
As earlier, we denote policies by π and stationary policies by µ. Given
an initial state x0 , a policy π = {µ0 , µ1 , . . .} when applied
! to the system
"
(4.45), generates a unique sequence of state-control pairs xk , µk (xk ) , k =
0, 1, . . .. The cost of π starting from x0 is
∞
. ! "
Jπ (x0 ) = g xk , µk (xk ) , x0 ∈ X,
k=0
Since t is cost-free and absorbing, this set contains the cost function Jπ of
every π ∈ Π, as well as J * .
Under the cost nonnegativity assumption (4.46), the problem can be
cast as a special case of the monotone increasing model with
! "
H(x, u, J) = g(x, u) + J f (x, u) ,
Sec. 4.5 Stable Policies and Deterministic Optimal Control 285
and the initial function J¯ being identically zero. Thus Prop. 4.4.4 applies
and in particular J * satisfies Bellman’s equation:
& ! "'
J * (x) = inf g(x, u) + J * f (x, u) , x ∈ X.
u∈U(x)
J * ≤ J0 ≤ cJ * ,
for some scalar c > 0. We also have that VI converges to J * starting from
any J0 with
0 ≤ J0 ≤ J *
under the compactness condition of Prop. 4.4.4(d). However, {Jk } may not
always converge to J * because, among other reasons, Bellman’s equation
may have multiple solutions within J .
The PI method starts from a stationary policy µ0 , and generates a
sequence of stationary policies {µk } via a sequence of policy evaluations to
obtain Jµk from the equation
! " ! ! ""
Jµk (x) = g x, µk (x) + Jµk f x, µk (x) , x ∈ X, (4.48)
[the near-optimal termination Assumption 3.5.10, but not the cost non-
negativity assumption (4.46)]. Our approach here will make use of the
cost nonnegativity but will address the problem under otherwise weaker
conditions.
Our analytical approach will also be different than the approach of
Section 3.5.5. Here, we will implicitly rely on the regularity ideas for non-
stationary policies that we introduced in Section 4.4, and we will make
a connection with traditional notions of feedback control system stability.
Using nonstationary policies may be important in undiscounted optimal
control problems with nonnegative cost per stage because it is not gener-
ally true that there exists a stationary "-optimal policy [cf. the "-optimality
result of Prop. 4.4.4(e)].
(with the convention that the infimum over the empty set is ∞). We
say that π is p-stable (without qualification) if π ∈ Πp,x for all x ∈ X
such that Πp,x /= Ø. The set of all p-stable policies is denoted by Πp .
Note that since Eq. (4.51) does not depend on δ, we see that an equiv-
alent definition of a policy π that is p-stable from x0 is that Jπ,p,δ (x0 ) < ∞
for some δ > 0 (rather than all δ > 0). Thus the set Πp,x of p-stable policies
from x depends on p and x but not on δ. Let us make some observations:
(a) Rate of convergence to t using p-stable policies: The relation (4.51)
shows that the forcing function p quantifies the rate at which the
destination is approached using the p-stable policies. As an example,
let X = !n and
p(x) = 2x2ρ ,
where ρ > 0 is a scalar. Then the policies π ∈ Πp,x0 are the ones that
force xk towards 0 at a rate faster than O(1/k ρ ), so slower policies
are excluded from Πp,x0 .
(b) Approximation property of Jπ,p,δ (x): Consider a pair (π, x0 ) with
π ∈ Πp,x0 . By taking the limit as δ ↓ 0 in the expression
∞
.
Jπ,p,δ (x0 ) = Jπ (x0 ) + δ p(xk ),
k=0
From this equation, we have that if π ∈ Πp,x , then Jπ,p,δ (x) is fi-
nite and differs from Jπ (x) by O(δ). By contrast, if π ∈ / Πp,x , then
Jπ,p,δ (x) = ∞ by the definition of p-stability, even though we may
have Jπ (x) < ∞.
(c) Limiting property of Jˆp (xk ): Consider a pair (π, x0 ) with π ∈ Πp,x0 .
By breaking down Jπ,p,δ (x0 ) into the sum of the costs of the first k
stages and the remaining stages, we have for all δ > 0 and k > 0,
k−1
. k−1
.
! "
Jπ,p,δ (x0 ) = g xm , µm (xm ) + δ p(xm ) + Jπk ,p,δ (xk ),
m=0 m=0
288 Noncontractive Models Chap. 4
Also, since Jˆp (xk ) ≤ Jˆp,δ (xk ) ≤ Jπk ,p,δ (xk ), it follows that
satisfies J * (x) ≤ Jˆp (x) ≤ Jˆ+ (x) for all x ∈ X. A policy π is said to be
terminating if it is simultaneously terminating from all x ∈ X such that
Π+x /= Ø. The set of all terminating policies is denoted by Π .
+
Note that if the state space X is finite, we have for every forcing
function p
β p+ (x) ≤ p(x) ≤ β̄ p+ (x), ∀ x ∈ X,
for some scalars β, β̄ > 0. As a result it can be seen that Πp,x = Π+ x and
ˆ ˆ+ +
Jp = J , so in effect the case where p = p is the only case of interest for
finite-state problems.
The notion of a terminating policy is related to the notion of control-
lability. In classical control theory terms, the system xk+1 = f (xk , uk ) is
said to be completely controllable if for every x0 ∈ X, there exists a pol-
icy that drives the state xk to the destination in a finite number of steps.
This notion of controllability is equivalent to the existence of a terminating
policy from each x ∈ X.
Sec. 4.5 Stable Policies and Deterministic Optimal Control 289
One of our main results, to be shown shortly, is that J * , Jˆp , and Jˆ+
are solutions of Bellman’s equation, with J * being the “smallest” solution
and Jˆ+ being the “largest” solution within J . The most favorable situation
arises when J * = Jˆ+ , in which case J * is the unique solution of Bellman’s
equation within J . Moreover, in this case it will be shown that the VI
algorithm converges to J * starting with any J0 ∈ J with J0 ≥ J * , and
the PI algorithm converges to J * as well. Once we prove the fixed point
property of Jˆp , we will be able to bring to bear the regularity ideas of the
preceding section (cf. Prop. 4.4.2).
Since Jˆp (x) < ∞ if and only if Πp,x /= Ø [cf. Eqs. (4.51) (4.52)], or equiv-
alently Jπ,p,δ (x) < ∞ for some π and all δ > 0, it follows that X 6p is also
the effective domain of Jˆp,δ ,
& & '
X6p = x ∈ X | Πp,x /= Ø} = x ∈ X | Jˆp,δ (x) < ∞ , ∀ δ > 0.
6p may depend on p and may be a strict subset of the effective
Note that X
domain of J * , which is denoted by
& '
X * = x ∈ X | J * (x) < ∞ ;
(cf. Section 3.5.5). The reason is that there may exist a policy π such that
Jπ (x) < ∞, even when there is no p-stable policy from x (for example, no
terminating policy from x).
Our first objective is to show that as δ ↓ 0, the p-δ-perturbed optimal
cost function Jˆp,δ converges to the restricted optimal cost function Jˆp .
where wπ,p,δ is a function such that limδ↓0 wπ,p,δ (x) = 0 for all
x ∈ X.
(b) We have
lim Jˆp,δ (x) = Jˆp (x), ∀ x ∈ X.
δ↓0
290 Noncontractive Models Chap. 4
Jˆp (x) − # ≤ Jπ! (x) − # ≤ Jπ! ,p,δ (x) − # ≤ Jˆp,δ (x) ≤ Jπ,p,δ (x) = Jπ (x) + wπ,p,δ (x),
where limδ↓0 wπ,p,δ (x) = 0 for all x ∈ X. By taking the limit as " ↓ 0, we
obtain for all δ > 0 and π ∈ Πp,x ,
By taking the limit as δ ↓ 0 and then the infimum over all π ∈ Πp,x , we
have
We now consider "-optimal policies, setting the stage for our main
proof argument. We know that given any " > 0, by Prop. 4.4.4(e), there
exists an "-optimal policy for the p-δ-perturbed problem, i.e., a policy π
such that Jπ (x) ≤ Jπ,p,δ (x) ≤ Jˆp,δ (x) + " for all x ∈ X. We address the
question whether there exists a p-stable policy π that is "-optimal for the
restricted optimization over p-stable policies, i.e., a policy π that is p-stable
simultaneously from all x ∈ Xp , (i.e., π ∈ Πp ) and satisfies
Proof: For any "-optimal policy π! for the p-δ-perturbed problem, we have
This implies that π! ∈ Πp . Moreover, for all sequences {xk } generated from
6p and π ∈ Πp,x , we have
initial state-policy pairs (π, x0 ) with x0 ∈ X 0
∞
.
Jπ! (x0 ) ≤ Jπ! ,p,δ (x0 ) ≤ Jˆp,δ (x0 ) + " ≤ Jπ (x0 ) + δ p(xk ) + ".
k=0
1∞
Taking the limit as δ ↓ 0 and using the fact k=0 p(xk ) < ∞ (since π ∈
Πp,x0 ), we obtain
which is a stronger statement than the definition Jˆp (x) = inf π∈Πp,x Jπ (x)
for all x ∈ X. However, it can be shown through examples that there
may not exist a restricted-optimal p-stable policy, i.e., a π ∈ Πp such that
Jπ = Jˆp , even if there exists an optimal policy for the original problem. One
such example is the one-dimensional linear-quadratic problem of Section
3.1.4 for the case where p = p+ . Then, there exists a unique linear stable
policy that attains the restricted optimal cost Jˆ+ (x) for all x, but this
policy is not terminating. Note also that there may not exist a stationary p-
"-optimal policy, since generally in undiscounted nonnegative cost optimal
control problems there may not exist a stationary "-optimal policy (an
example is given following Prop. 4.4.8).
We now take the first steps for bringing regularity ideas into the
analysis. We introduce the set of functions Sp given by
- 2
Sp = J ∈ J 2 J(xk ) → 0 for all sequences {xk } generated from initial
3
state-policy pairs (π, x0 ) with x0 ∈ X and π ∈ Πp,x0 .
(4.59)
In words, Sp consists of the functions in J whose value is asymptotically
driven to 0 by all the policies that are p-stable starting from some x0 ∈ X.
Similar to the analysis of Section 4.4.2, we can prove that the collection
292 Noncontractive Models Chap. 4
& '
Cp = (π, x0 ) | π ∈ Πp,x0 is Sp -regular. Moreover, Sp is the largest set S
for which Cp is S-regular.
Note that Sp contains Jˆp and Jˆp,δ for all δ > 0 [cf. Eq. (4.54), (4.55)].
Moreover, Sp contains all functions J such that
0 ≤ J ≤ cJˆp,δ
0 ≤ J ≤ cJˆp,δ
k−1
. k−1
. ! "
Jˆp,δ (x0 ) ≤ J(x
˜ 0 ) ≤ J(x
˜ k )+δ p(xm )+ g xm , µm (xm ) , ∀ k ≥ 1,
m=0 m=0
(4.62)
Sec. 4.5 Stable Policies and Deterministic Optimal Control 293
where {xk } is the state sequence generated starting from x0 and using π.
We have J˜(xk ) → 0 (since J˜ ∈ Sp and π ∈ Πp by Prop. 4.5.2), so that
4 k−1 k−1
5
. . ! "
˜
lim J(xk ) + δ p(xm ) + g xm , µm (xm ) = Jπ,δ (x0 )
k→∞
m=0 m=0
We next show our main result in this section, namely that Jˆp is the
unique solution of Bellman’s equation within the set of functions
Wp = {J ∈ Sp | Jˆp ≤ J}. (4.64)
Moreover, we show that the VI algorithm yields Jˆp in the limit for any initial
J0 ∈ Wp . This result is intimately connected with& the regularity ideas' of
Section 4.4. The idea is that the collection Cp = (π, x0 ) | π ∈ Πp,x0 is
Sp -regular, as noted earlier. In view of this and the fact that JC*p = Jˆp ,
the result will follow from Prop. 4.4.2 once Jˆp is shown to be a solution of
Bellman’s equation. This latter property is shown essentially by taking the
limit as δ ↓ 0 in Eq. (4.60).
Proof: (a), (b) We first show that Jˆp is a solution of Bellman’s equation.
Since Jˆp,δ is a solution of Bellman’s equation for the p-δ-perturbed problem
(cf. Prop. 4.5.3) and Jˆp,δ ≥ Jˆp [cf. Prop. 4.5.1(b)], we have for all δ > 0,
- ! "3
Jˆp,δ (x) = inf g(x, u) + δp(x) + Jˆp,δ f (x, u)
u∈U(x)
- ! "3
≥ inf g(x, u) + Jˆp,δ f (x, u)
u∈U(x)
- ! "3
≥ inf g(x, u) + Jˆp f (x, u) .
u∈U(x)
By taking the limit as δ ↓ 0 and using the fact limδ↓0 Jˆp,δ = Jˆp [cf. Prop.
4.5.1(b)], we obtain
- ! "3
Jˆp (x) ≥ inf g(x, u) + Jˆp f (x, u) , ∀ x ∈ X. (4.67)
u∈U(x)
Taking the limit as m → ∞, and using the fact limδm ↓0 Jˆp,δm = Jˆp [cf.
Prop. 4.5.1(b)], we have
! "
g(x, u) + Jˆp f (x, u) ≥ Jˆp (x), ∀ x ∈ X, u ∈ U (x),
so that
- ! "3
inf g(x, u) + Jˆp f (x, u) ≥ Jˆp (x), ∀ x ∈ X. (4.68)
u∈U(x)
By combining Eqs. (4.67) and (4.68), we see that Jˆp is a solution of Bell-
man’s equation. We also have Jˆp ∈ Sp by Prop. 4.5.3, implying that
Jˆp ∈ Wp and proving part (a) except for the uniqueness assertion. Part
Sec. 4.5 Stable Policies and Deterministic Optimal Control 295
(b) and the uniqueness part of part (a) follow from Prop. 4.4.2; see the
discussion preceding the proposition.
(c) If µ is p-stable and Eq. (4.66) holds, then
! " ! "
Jˆp (x) = g x, µ(x) + Jˆp f (x, µ(x)) , x ∈ X.
By Prop. 4.4.4(b), this implies that Jµ ≤ Jˆp , so µ is optimal over the set of
p-stable policies. Conversely, assume that µ is p-stable and Jµ = Jˆp . Then
by Prop. 4.4.4(b), we have
! " ! "
Jˆp (x) = g x, µ(x) + Jˆp f (x, µ(x)) , x ∈ X,
Proof: Follows from Prop. 4.4.5 in the deterministic special case where
wk takes a single value. Q.E.D.
[the last equality follows from Eq. (4.58)]. In this case, the set Sp+ of Eq.
(4.59) is the entire set J ,
Sp+ = J ,
since for all J ∈ J and all sequences {xk } generated from initial state-policy
pairs (π, x0 ) with x0 ∈ X and π terminating from x0 , we have J(xk ) = 0
for k sufficiently large. Thus, the corresponding set of Eq. (4.64) is
W + = {J ∈ J | Jˆ+ ≤ J}.
Proposition 4.5.6:
(a) Jˆ+ is the largest solution of the Bellman equation (4.65) within
J , i.e., Jˆ+ is a solution and if J # ∈ J is another solution, then
J # ≤ Jˆ+ .
(b) (VI Convergence) If {Jk } is the sequence generated by the VI
algorithm (4.47) starting with some J0 ∈ J with J0 ≥ Jˆ+ , then
Jk → Jˆ+ .
(c) (Optimality Condition) If µ+ is a terminating stationary policy
and
& ! "'
µ+ (x) ∈ arg min g(x, u) + Jˆ+ f (x, u) , ∀ x ∈ X, (4.69)
u∈U(x)
Proof: In view of Prop. 4.5.4, we only need to show that Jˆ+ is the largest
solution of the Bellman equation. From Prop. 4.5.4(a), Jˆ+ is a solution
that belongs to J . If J # ∈ J is another solution, we have J # ≤ J˜ for some
J˜ ∈ W + , so J # = T k J # ≤ T k J˜ for all k. Since T k J˜ → Jˆ+ , it follows that
J # ≤ Jˆ+ . Q.E.D.
Wp!
W∗
Jˆp!
Path of VI Set of solutions of Bellman’s equation
Set of solutions of Bellman’s equation
C J∗
C 0
For this choice of J0 , the value Jk (x) generated by VI is the optimal cost
that can be achieved starting from x subject to the constraint that t is
reached in k steps or less. As we have noted earlier, in shortest-path type
problems VI tends to converge faster when started from above.
Consider now the favorable case where terminating policies are suf-
ficient, in the sense that Jˆ+ = J * ; cf. Fig. 4.5.3. Then, from Prop. 4.5.6,
it follows that J * is the unique solution of Bellman’s equation within J ,
and the VI algorithm converges to J * from above, i.e., starting from any
J0 ∈ J with J0 ≥ J * . Under additional conditions, such as finiteness of
U (x) for all x ∈ X [cf. Prop. 4.4.4(d)], VI converges to J * starting from any
J0 ∈ E + (X) with J0 (t) = 0. These results are consistent with our analysis
of Section 3.5.5.
Examples of problems where terminating policies are sufficient in-
clude linear-quadratic problems under the classical conditions of controlla-
bility and observability, and finite-node deterministic shortest path prob-
298 Noncontractive Models Chap. 4
) Jˆ+Paths
= J * of VI Unique solution of Bellman’s equation
Unique solution of Bellman’s equation
lems with all cycles having positive length. Note that in the former case,
despite the fact Jˆ+ = J * , there is no optimal terminating policy, since the
only optimal policy is a linear policy that drives the system to the origin
asymptotically, but not in finite time.
Let us illustrate the results of this section with two examples.
p(x) = -x-2 ,
† In this example, the salient feature of the policy that never stops at an
x )= 0 is that it drives the system asymptotically to the destination according to
an equation of the form xk+1 = f (xk ), where f is a contraction mapping. The
example admits generalization to the broader class of optimal stopping problems
that have this property. For simplicity in illustrating our main point, we consider
here the special case where f (x) = γx with γ ∈ (0, 1).
300 Noncontractive Models Chap. 4
xγ
Stop Cone c Cost !x! (1
x c Cost c Cost
! γx (1
(0) = 0
! (1 − γ)c
Stop Cone C
Stop Cone
p(x) = -x-.
) J ∗ (x) ˆ
J(x) ) J + (x)
Then the p+ -stable policies are the terminating policies. Since stopping at
some time and incurring the cost c is a requirement for a p+ -stable policy, it
follows that the optimal p+ -stable policy is to stop as soon as possible, i.e.,
stop at every state. The corresponding restricted optimal cost function is
-
c if x )= 0,
Jˆ+ (x) =
0 if x = 0.
The optimal cost function of the corresponding p+ -δ-perturbed problem is
-
c+δ if x )= 0,
Jˆp+ ,δ (x) =
0 if x = 0,
since in the p+ -δ-perturbed problem it is again optimal to stop as soon as
possible, at cost c + δ. Here the set Sp+ is equal to J , and the corresponding
& '
set W + is equal to J ∈ J | Jˆ+ ≤ J .
However, there are infinitely many additional solutions of Bellman’s
equation between the largest and smallest solutions J ∗ and Jˆ+ . For example,
when n > 1, functions J ∈ J such that J(x) = J ∗ (x) for x in some cone
and J(x) = Jˆ+ (x) for x in the complementary cone are solutions; see Fig.
4.5.4. There is also a corresponding infinite number of regions of convergence
Wp of VI [cf. Eq. (4.64)]. Also VI converges to J ∗ starting from any J0 with
0 ≤ J0 ≤ J ∗ [cf. Prop. 4.4.4(d)]. Figure 4.5.5 illustrates additional solutions
of Bellman’s equation of a different character.
Generally, the standard PI algorithm [cf. Eqs. (4.48), (4.49)] produces un-
clear results under our assumptions. The following example provides an
instance where the PI algorithm may converge to either an optimal or a
strictly suboptimal policy.
302 Noncontractive Models Chap. 4
Consider the case X = {0, 1}, U (0) = U (1) = {0, 1}, and the destination is
t = 0. Let also
- -
0 if u = 0, 1 if u = 0, x = 1,
f (x, u) = g(x, u) =
x if u = 1, 0 if u = 1 or x = 0.
This is a one-state-plus-destination shortest path problem where the control
u = 0 moves the state from x = 1 to x = 0 (the destination) at cost 1,
while the control u = 1 keeps the state unchanged at cost 0 (cf. the problem
of Section 3.1.1). The policy µ∗ that keeps the state unchanged is the only
optimal policy, with Jµ∗ (x) = J ∗ (x) = 0 for both states x. However, under
any forcing function p with p(1) > 0, the policy µ̂, which moves from state 1
to 0, is the only p-stable policy, and we have Jµ̂ (1) = Jˆp (1) = 1. The standard
PI algorithm (4.48), (4.49) when started with µ∗ , it will repeat µ∗ . If this
algorithm is started with µ̂, it may generate µ∗ or it may repeat µ̂, depending
on how the policy improvement iteration is implemented. The reason is that
for both x we have
- ! "3
µ̂(x) ∈ arg min g(x, u) + Jˆp f (x, u) ,
u∈{0,1}
We implicitly assume here that Eq. (4.72) can be solved for Jµk , and that
the minimum in Eq. (4.73) is attained for each x ∈ X, which is true under
some compactness
! "condition on either U (x) or the level sets of the function
g(x, ·) + Jk f (x, ·) , or both.
cf. Eq. (4.29). Using µk and µk+1 in place of µ and µ̄, we see that the
sequence {Jµk } generated by PI converges monotonically to some function
J∞ ∈ E + (X), i.e., Jµk ↓ J∞ . Moreover, from Eq. (4.74) we have
& ! "'
J∞ (x) ≥ inf g(x, u) + J∞ f (x, u) , x ∈ X,
u∈U(x)
as well as
! "
g(x, u) + Jµk f (x, u) ≥ J∞ (x), x ∈ X, u ∈ U (x).
We now take the limit in the second relation as k → ∞, then take the
infimum over u ∈ U (x), and then combine with the first relation, to obtain
& ! "'
J∞ (x) ≥ inf g(x, u) + J∞ f (x, u) ≥ J∞ (x), x ∈ X.
u∈U(x)
Jµk → Jˆp . Related algorithms were given in Sections 3.4 and 3.5.1. The
following assumption requires that the algorithm generates p-stable policies
exclusively, which can be quite restrictive. For instance it is not satisfied
for the problem of Example 4.5.3.
Assumption 4.5.1: For each δ > 0 there exists at least one p-stable
stationary policy µ such that Jµ,p,δ ∈ Sp . Moreover, given a p-stable
stationary policy µ and a scalar δ > 0, every stationary policy µ such
that
- ! "3
µ(x) ∈ arg min g(x, u) + Jµ,p,δ f (x, u) , ∀ x ∈ X,
u∈U(x)
≥ ¯ 0 ),
(Tµmk+1 ,δ J)(x
k
as well as
Jµk ,δk (x0 ) ≥ (T Jµk ,δk )(x0 ) + δk p(x0 ) ≥ Jµk+1 ,δk+1 (x0 ).
Note that despite the fact Jµk → Jˆp , the generated sequence {µk }
may exhibit some serious pathologies in the limit. In particular, if U is a
metric space and {µk }K is a subsequence of policies that converges to some
µ̄, in the sense that
J0 ∈ W p , J0 ≥ T J0 . (4.78)
Proof: Since J0 ≥ Jˆp and Jˆp = T Jˆp [cf. Prop. 4.5.6(a)], all operations
on any of the functions Jk with Tµk or T maintain the inequality Jk ≥ Jˆp
for all k, so that Jk ∈ Wp for all k. Also the conditions J0 ≥ T J0 and
Tµk Jk = T Jk imply that
Jk ≥ T Jk ≥ Jk+1 , k = 0, 1, . . . . (4.79)
Sec. 4.6 Infinite-Spaces Stochastic Shortest Path Problems 307
Thus Jk ↓ J∞ for some J∞ , which must satisfy J∞ ≥ Jˆp , and hence belong
to Wp . By taking limit as k → ∞ in Eq. (4.79) and using an argument
similar to the one in the proof of Prop. 4.5.8, it follows that J∞ = T J∞ .
By Prop. 4.5.6(a), this implies that J∞ ≤ Jˆp . Together with the inequality
J∞ ≥ Jˆp shown earlier, this proves that J∞ = Jˆp . Q.E.D.
where xk and uk are the state and control at stage k, which belong to sets X
and U , wk is a random disturbance that takes values in a countable set W
with given probability distribution P (wk | xk , uk ), and f : X × U × W %→ X
is a given function (cf. Example 1.2.1 in Chapter 1). The state and control
spaces X and U are arbitrary, but we assume that W is countable to bypass
complex measurability issues in the choice of control (see [BeS78]).
The control u must be chosen from a constraint & set U '
(x) ⊂ U that
may depend on x. The expected cost per stage, E g(x, u, w) , is assumed
nonnegative:
& '
0 ≤ E g(x, u, w) < ∞, ∀ x ∈ X, u ∈ U (x), w ∈ W.
This is a special case of an SSP problem, where the cost per stage
is nonnegative, but the state and control spaces are arbitrary. It is also a
special case of the nonnegative cost stochastic optimal control problem of
Section 4.4.2. We adopt the notation and terminology of that section, but
we review it here briefly for convenience.
Given an initial state x0 , a policy π = {µ0 , µ1 , . . .} when applied
to
! the system
" (4.80), generates a random sequence of state-control pairs
xk , µk (xk ) , k = 0, 1, . . . , with cost
4 ∞
5
. ! "
Jπ (x0 ) = Exπ0 g xk , µk (xk ), wk ,
k=0
where Exπ0 {·} denotes expectation with respect to the probability measure
corresponding to initial state x0 and policy π. For a stationary policy µ, the
corresponding cost function is denoted by Jµ . The optimal cost function is
Since t is cost-free and absorbing, this set contains the cost functions Jπ of
all π ∈ Π, as well as J * .
Here the results of Section 4.3 under Assumption I apply, and the
optimal cost function J * is a solution of the Bellman equation
- ! "3
J(x) = inf E g(x, u, w) + J f (x, u, w) , x ∈ X,
u∈U(x)
Proposition 4.6.1:
(a) A policy π is proper at a state x ∈ X if and only if Jπ,δ (x) < ∞
for all δ > 0.
6
(b) We have Jˆδ (x) < ∞ for all δ > 0 if and only if x ∈ X.
(c) For every " > 0 and δ > 0, a policy π! that is "-optimal for the
6 and such a policy
δ-perturbed problem is proper at all x ∈ X,
exists.
310 Noncontractive Models Chap. 4
Proof: (a) Follows from Eq. (4.50) and the definition (4.81) of a proper
policy.
(b) If x ∈ X 6 there exists a policy π that is proper at x, and by part (a),
Jˆδ (x) ≤ Jπ,δ (x) < ∞ for all δ > 0. Conversely, if Jˆδ (x) < ∞, there exists π
such that Jπ,δ (x) < ∞, implying [by part (a)] that π ∈ Π 6 x , so that x ∈ X.
6
(c) An "-optimal π! exists by Prop. 4.4.4(e). We have Jπ! ,δ (x) ≤ Jˆδ (x) + "
6 implying by part (a) that
for all x ∈ X. Hence Jπ! ,δ (x) < ∞ for all x ∈ X,
6
π! is proper at all x ∈ X. Q.E.D.
The next proposition shows that the cost function Jˆδ of the δ-perturbed
problem can be used to approximate J.ˆ
ˆ + ",
Jπ! (x) ≤ J(x) ∀ x ∈ X.
Proof: Let us fix δ > 0, and for a given " > 0, let π! be a policy that is
6 and is "-optimal for the δ-perturbed problem [cf. Prop.
proper at all x ∈ X
6 and π ∈ Π
4.6.1(c)]. By using Eq. (4.50), we have for all " > 0, x ∈ X, 6 x,
ˆ − " ≤ Jπ! (x) − " ≤ Jπ! ,δ (x) − " ≤ Jˆδ (x) ≤ Jπ,δ (x) = Jπ (x) + wπ,δ (x),
J(x)
where
∞
.
wπ,δ (x) = δ rk (π, x) < ∞, 6 π∈Π
∀ x ∈ X, 6 x.
k=0
6 x,
By taking the limit as " ↓ 0, we obtain for all δ > 0 and π ∈ Π
ˆ
J(x) ≤ Jˆδ (x) ≤ Jπ (x) + wπ,δ (x), 6 π∈Π
∀ x ∈ X, 6 x.
ˆ
J(x) ≤ lim Jˆδ (x) ≤ inf Jπ (x) = J(x),
ˆ 6
∀ x ∈ X,
δ↓0 π∈Π6x
ˆ
from which J(x) 6 Moreover, by Prop. 4.6.1(b),
= limδ↓0 Jˆδ (x) for all x ∈ X.
ˆ ˆ 6
/ X, so that J(x) = limδ↓0 Jˆδ (x) for all x ∈ X.
Jδ (x) = J(x) = ∞ for all x ∈ ˆ
Sec. 4.6 Infinite-Spaces Stochastic Shortest Path Problems 311
We also have
∞
.
Jπ! (x) ≤ Jπ! ,δ (x) ≤ Jˆδ (x)+" ≤ Jπ (x)+δ r(π, x)+", 6 π∈Π
∀ x ∈ X, 6 x.
k=0
Main Results
By Prop. 4.4.4(a), Jˆδ solves Bellman’s equation for the δ-perturbed prob-
lem, while by Prop. 4.6.2, limδ↓0 Jˆδ (x) = J(x).
ˆ This suggests that Jˆ solves
the unperturbed Bellman equation, which is the “limit” as δ ↓ 0 of the
δ-perturbed version. Indeed we will show a stronger result, namely that Jˆ
is the unique solution of Bellman’s equation within the set of functions
8 = {J ∈ S | Jˆ ≤ J},
W (4.83)
where
- & ' 3
S = J ∈ J | Exπ0 J(xk ) → 0, ∀ (π, x0 ) with π ∈ Π 6x . (4.84)
0
& '
Here Exπ0 J(xk ) denotes the expected value of the function J along the
sequence {xk } generated starting from x0 and using π. Similar to earlier
proofs in Sections 4.4 and 4.5, we have that the collection
& '
6x
C = (π, x) | π ∈ Π (4.85)
is S-regular.
We first show a preliminary result. Given a policy π = {µ0 , µ1 , . . .},
we denote by πk the policy
πk = {µk , µk+1 , . . .}. (4.86)
Proposition 4.6.3:
(a) For all pairs (π, x0 ) ∈ C and k = 0, 1, . . ., we have
& ' & '
ˆ k ) ≤ Exπ Jπ (xk ) < ∞,
0 ≤ Exπ0 J(x 0 k
& '
Since Jπ,δ (x0 ) < ∞ [cf. Prop. 4.6.1(a)], it follows that Exπ0 Jπk ,δ (xk ) <
∞. Hence for all xk that can be reached with positive probability using π
and starting from x0 , we have Jπk ,δ (xk ) < ∞, implying [by Prop. 4.6.1(a)]
ˆ k ) ≤ Jπ (xk ) and by applying Exπ {·}, the
that (πk , xk ) ∈ C. Hence J(x k 0
result follows.
(b) We have for all (π, x0 ) ∈ C,
- ! "3 & '
Jπ (x0 ) = Exπ0 g x0 , µ0 (x0 ), w0 + Exπ0 Jπ1 (x1 ) , (4.87)
and for m = 1, 2, . . . ,
& ' - ! "3 & '
Exπ0 Jπm (xm ) = Exπ0 g xm , µm (xm ), wm + Exπ0 Jπm+1 (xm+1 ) ,
(4.88)
where {xm } is the sequence generated starting from x0 and using π. By
using repeatedly the expression (4.88) for m = 1, . . . , k − 1, and combining
it with Eq. (4.87), we obtain for all k = 1, 2, . . . ,
(a) Jˆ is the unique solution of the Bellman Eq. (4.65) within the set
W8 of Eq. (4.83).
(b) (VI Convergence) If {Jk } is the sequence generated by the VI
8 then Jk → J.
algorithm (4.47) starting with some J0 ∈ W, ˆ
(c) (Optimality Condition) If µ is a stationary policy that is proper
6 and
at all x ∈ X
- ! "3
µ(x) ∈ arg min E g(x, u, w) + Jˆ f (x, u, w) , ∀ x ∈ X,
u∈U(x)
(4.89)
ˆ
then µ is optimal over the set of proper policies, i.e., Jµ = J.
Conversely, if µ is proper at all x ∈ X 6 and Jµ = J, ˆ then µ
satisfies the preceding condition (4.89).
so that
- ! "3
inf E g(x, u, w) + Jˆ f (x, u, w) ≥ J(x),
ˆ ∀ x ∈ X. (4.91)
u∈U(x)
We illustrate Prop. 4.6.4 in Fig. 4.6.1. Let us consider now the favor-
able case where the set of proper policies is sufficient in the sense that it can
achieve the same optimal cost as the set of all policies, i.e., Jˆ = J * . This
is true for example if all policies are proper at all x such that J * (x) < ∞.
Moreover it is true in some of the finite-state formulations of SSP that we
discussed in Chapter 3; see also the subsequent Prop. 4.6.5. When Jˆ = J * ,
it follows from Prop. 4.6.4 that J * is the unique solution of Bellman’s equa-
tion within W,8 and that the VI algorithm converges to J * starting from
8
any J0 ∈ W. Under an additional compactness condition, such as finiteness
Sec. 4.6 Infinite-Spaces Stochastic Shortest Path Problems 315
(0) = 0 J J∗ ∗ Jˆ J J J
!
VI converges from W
Example 4.6.1
Let X = ', t = 0, and assume that there is only one control at each state,
and hence a single policy π. The disturbance wk takes two values: 1 and 0
with probabilities α ∈ (0, 1) and 1 − α, respectively. The system equation is
wk x k
xk+1 = ,
α
and there is no cost at each state and stage:
g(x, u, w) ≡ 0.
Thus from state xk we move to state xk /α with probability α and to the
termination state t = 0 with probability 1 − α.
Here, the unique policy is stationary and proper at all x ∈ X, and we
have
ˆ
J ∗ (x) = J(x) = 0, ∀ x ∈ X.
Bellman’s equation has the form
* +
x
J(x) = (1 − α)J(0) + αJ ,
a
which within J reduces to
* +
x
J(x) = αJ , ∀ J ∈ J , x ∈ X. (4.92)
α
It can be seen that Bellman’s equation has an infinite number of solu-
tions within J in addition to J ∗ and Jˆ: any positively homogeneous function,
such as, for example,
J(x) = γ|x|, γ > 0,
is a solution. Consistently with Prop. 4.6.4(a), none of these solutions belongs
6 since xk is either equal to x0 /αk (with probability αk ) or equal to 0
to W,
(with probability 1 − αk ). For example, in the case of J(x) = γ|x|, we have
& ' 2 2
2 x0 2
Exπ0 J(xk ) = αk γ 2 k 2 = γ|x0 |, ∀ k ≥ 0,
α
so J(xk ) does not converge to 0, unless x0 = 0. Moreover, none of these
additional solutions seems to be significant in some discernible way.
Let us consider the special case where the cost per stage g is bounded over
X × U × W , i.e.,
sup g(x, u, w) < ∞. (4.93)
(x,u,w)∈X×U×W
Since we have
/ 0 ∞
.
Jπ (x0 ) ≤ sup g(x, u, w) · rk (π, x0 ) < ∞, 6x ,
∀π∈Π 0
(x,u,w)∈X×U×W k=0
8b = {J ∈ B | Jˆ ≤ J}.
W
!
Fixed Points of T !b
W
2 Jˆ J
T J*
b>0
Figure 4.6.2. Schematic illustration of Prop. 4.6.5 for a nonnegative cost SSP
problem. The functions J ∗ and Jˆ are the smallest and largest solutions, re-
spectively, of Bellman’s equation within the set B. Moreover, the VI algorithm
6 b = {J ∈ B | Jˆ ≤ J}.
converges to Jˆ starting from J0 ∈ W
Proof: (a) Since the cost function of a uniformly proper policy belongs to
B, we have Jˆ ∈ B. On the other hand, for all J ∈ B, we have
/ 0
& '
π
Ex0 J(xk ) ≤ sup J(x) · rk (π, x0 ) → 0, ∀π∈Π 6x .
0
x∈X6
8b is contained in W,
It follows that the set W 8 while the function Jˆ belongs
8 8
to Wb . Since Wb is unbounded above within the set B, for every solution
J # ∈ B of Bellman’s equation we have J # ≤ J for some J ∈ W 8b , and hence
# ˜ ˜
also J ≤ J for some J in the set S of Eq. (4.84). It follows from Prop.
ˆ
4.4.2(a) and the S-regularity of the collection (4.85) that J # ≤ J.
Sec. 4.6 Infinite-Spaces Stochastic Shortest Path Problems 319
ˆ
while J(x) 6 In view of the finiteness of
< ∞ for all x ∈ X ∗ since X ∗ = X.
X, we can find a sufficiently large c such that Jˆ ≤ cJ * , so by Prop. 4.4.6,
it follows that Jˆ = J * . Q.E.D.
state, which is the termination state for the equivalent SSP problem. This
technique is strictly limited to finite-state problems, since in general the
6 do not imply that Jˆ = J * ,
conditions J * (x) > 0 for all x /= t and X ∗ = X
even under the bounded cost and uniform properness assumptions of this
section (see the deterministic stopping Example 4.5.2).
implies Eq. (4.95) (see [Ber77], Prop. 12, or [BeS78], Prop. 5.10).
The analysis of convergence of VI to J * under Assumption I and
starting with an initial condition J0 ≥ J * is far more complicated than for
the initial condition J0 = J.¯ A principal reason for this is the multiplicity
& '
of solutions of Bellman’s equation within the set J ∈ E + (X) | J ≥ J¯ .
We know that J * is the smallest solution (cf. Prop. 4.4.9), and an interest-
ing issue is the characterization of the largest solution and other solutions
within some restricted class of functions of interest. We substantially re-
solved this question in Sections 4.5 and 4.6 for infinite-spaces deterministic
and stochastic shortest path problems, respectively (as well in Sections
3.5.1 and 3.52 for finite-state stochastic shortest path and affine monotonic
problems). Generally, optimal control problems with nonnegative cost per
stage can typically be reduced to problems with a cost-free and absorb-
ing termination state (see [BeY16] for an analysis of the finite-state case).
However, the fuller characterization of the set of solutions of Bellman’s
equation for general abstract DP models under Assumption I requires fur-
ther investigation.
Optimistic PI and λ-PI under Assumption D have not been considered
prior to the 2013 edition of this book, and the corresponding analysis of
Section 4.3.3 is new. See [BeI96], [ThS10a], [ThS10b], [Ber11b], [Sch11],
[Ber16b] for analyses of λ-PI for discounted and SSP problems.
Section 4.4: The definition and analysis of regularity for nonstationary
policies was introduced in the author’s paper [Ber15]. We have primarily
used regularity in this book to analyze the structure of the solution set of
Bellman’s equation, and to identify the region of attraction of value and
policy iteration algorithms. This analysis is multifaceted, so it is worth
summarizing here:
(a) We have characterized the fixed point properties of the optimal cost
function J * and the restricted optimal cost function JC* over S-regular
322 Noncontractive Models Chap. 4
collections C, for various sets S. While J * and JC* need not be fixed
points of T , they are fixed points in a large variety of interesting
contexts (Sections 3.3-3.5 and 4.4-4.6).
(b) We have shown that when J * = JC* , then J * is the unique solution of
Bellman’s equation in several interesting noncontractive contexts. In
particular, Section 3.3 deals with an important case that covers among
others, the most common type of stochastic shortest path problems.
However, even when J * /= JC* , the functions J * and JC* often bound
the set of solutions from below and/or from above (see Sections 3.5.1,
3.5.2, 4.5, 4.6).
(c) Simultaneously with the analysis of the fixed point properties of J *
and JC* , we have used regularity to identify the region of convergence
of value iteration. Often convergence to JC* can be shown from start-
ing functions J ≥ JC* , assuming that JC* is a fixed point of T . In
the favorable case where J * = JC* , convergence to J * can often be
shown from every starting function of interest. In addition regularity
has been used to guarantee the validity of policy iteration algorithms
that generate exclusively regular policies, and are guaranteed to con-
verge to J * or JC* .
(d) We have been able to characterize some of the solutions of Bellman’s
equation, but not the entire set. Generally, there may exist an infinite
number of solutions, and some of them may not be associated with
an S-regular collection for any set S, unless we change the starting
function J¯ that is part of the definition of the cost function Jπ of the
policies. There is a fundamental difficulty here: the solutions of the
Bellman equation J = T J do not depend on J, ¯ but S-regularity of
a collection of policy-state pairs depends strongly on J. ¯ A sharper
characterization of the solution set of Bellman’s equation remains an
open interesting question, in both specific problem contexts as well
as in generality.
The use of regularity in the analysis of undiscounted and discounted
stochastic optimal control in Sections 4.4.2 and 4.4.3 is new, and was pre-
sented in the author’s paper [Ber15]. The analysis of convergent models in
Section 4.4.4, under the condition
¯
J * (x) ≥ J(x) > −∞, ∀ x ∈ X,
Section 4.5: This section follows the author’s paper [Ber17a]. The issue
of the connection of optimality with stability (and also with controllability
and observability) was raised in the classic paper by Kalman [Kal60] in the
context of linear-quadratic problems.
The set of solutions of the Riccati equation has been extensively inves-
tigated starting with the papers by Willems [Wil71] and Kucera [Kuc72],
[Kuc73], which were followed up by several other works; see the book
by Lancaster and Rodman [LaR95] for a comprehensive treatment. In
these works, the “largest” solution of the Riccati equation is referred to
as the “stabilizing” solution, and the stability of the corresponding policy
is shown, although the author could not find an explicit statement in the
literature regarding the optimality of this policy within the class of all lin-
ear stable policies. Also the lines of analysis of these works are tied to the
structure of the linear-quadratic problem and are unrelated to our analysis
of Section 4.5, which is based on semicontractive ideas.
Section 4.6: Proper policies for infinite-state SSP problems have been
considered earlier in the works of Pliska [Pli78], and James and Collins
[JaC06], where they are called “transient.” There are a few differences
between the frameworks of [Pli78], [JaC06] and Section 4.6, which impact
on the results obtained. In particular, the papers [Pli78] and [JaC06] use
a related (but not identical) definition of properness to the one of Section
4.6, while the notion of a transient policy used in [JaC06] coincides with
the notion of a uniformly proper policy of Section 4.6.2 when X 6 = X.
Furthermore, [Pli78] and [JaC06] do not consider the notion of policy that is
“proper at a state.” The paper [Pli78] assumes that all policies are transient,
that g is bounded, and that J * is real-valued. The paper [JaC06] allows
for notransient policies that have infinite cost from some initial states, and
extends the analysis of Bertsekas and Tsitsiklis [BeT91] from finite state
space to infinite state space (addressing also measurability issues). Also,
[JaC06] allows the cost per stage g to take both positive and negative values,
and uses assumptions that guarantee that J * = J, ˆ that J * is real-valued,
and that improper policies cannot be optimal. Instead, in Section 4.6 we
allow that J * /= Jˆ and that J * can take the value ∞, while requiring that
g is nonnegative and that the disturbance space W is countable.
The analysis of Section 4.6 comes from the author’s paper [Ber17b],
and is most closely related to the SSP analysis under the weak conditions of
Section 3.5.1, where we assumed that the state space is finite, but allowed
g to take both positive and negative values. The extension of some of
our results of Section 4.6 to SSP problems where g takes both positive and
negative values may be possible; Exercises 4.8 and 4.9 suggest some research
directions. However, our analysis of infinite-spaces SSP problems in this
chapter relies strongly on the nonnegativity of g and cannot be extended
without major modifications. In this connection, it is worth mentioning
the example of Section 3.1.2, which shows that J * may not be a solution
324 Noncontractive Models Chap. 4
EXERCISES
Show that J ∗ (x) = −1 for all x, but there is no policy (stationary or not) that
attains the optimal cost starting from x.
Solution: Since a cost is incurred only upon stopping, and the stopping cost is
greater than -1, we have Jµ (x) > −1 for all x and µ. On the other hand, starting
1
from any state x and stopping at x + n yields a cost −1 + x+n , so by taking n
sufficiently large, we can attain a cost arbitrarily close to -1. Thus J ∗ (x) = −1
for all x, but no policy can attain this optimal cost.
For the problem of Exercise 4.1, show that the policy µ that never stops is not
optimal but satisfies Tµ J ∗ = T J ∗ .
for all x.
Let
X = ', U (x) ≡ (0, 1], ¯
J(x) ≡ 0,
H(x, u, J) = |x| + J(ux), ∀ x ∈ X, u ∈ U (x).
Sec. 4.7 Notes, Sources, and Exercises 325
Let µ(x) = 1 for all x ∈ X. Then Jµ (x) = ∞ if x )= 0 and Jµ (0) = 0. Verify that
Tµ Jµ = T Jµ . Verify also that J ∗ (x) = |x|, and hence µ is not optimal.
This exercise shows that under Assumptions I and D, it is possible to use a com-
putational method based on mathematical programming when X = {1, . . . , n}.
(a) Under Assumption I, show that J ∗ is the unique solution of the following
optimization problem in z = (z1 , . . . , zn ):
.
n
minimize zi
i=1
¯
subject to zi ≥ J(i), zi ≥ inf H(i, u, z), i = 1, . . . , n.
u∈U (i)
(b) Under Assumption D, show that J ∗ is the unique solution of the following
optimization problem in z = (z1 , . . . , zn ):
.
n
maximize zi
i=1
Solution: (a) Any feasible solution z of the given optimization problem satisfies
z ≥ J¯ as well as zi ≥ inf u∈U (i) H(i, u, z) for all i = 1, . . . , n, so that z ≥ T z. It
follows from Prop. 4.4.9 that z ≥ J ∗ , which implies that J ∗ is an optimal solution
of the given optimization problem. Also J ∗ is the unique optimal 1 solution1 since
if z is feasible and z )= J ∗ , the inequality z ≥ J ∗ implies that i
z i > i
J ∗ (i),
so z cannot be optimal.
(b) Any feasible solution z of the given optimization problem satisfies z ≤ J¯ as
well as zi ≤ H(i, u, z) for all i = 1, . . . , n and u ∈ U (i), so that z ≤ T z. It follows
from Prop. 4.3.6 that z ≤ J ∗ , which implies that J ∗ is an optimal solution of
the given optimization problem. Similar to part (a), J ∗ is the unique optimal
solution.
(a) Show that Assumption I holds, and that the optimal cost function has the
form -
0 if x ∈ X ∗ ,
J ∗ (x) =
∞ otherwise,
where X ∗ is some subset of X.
(b) Consider the sequence of sets {Xk }, where
& '
Xk = x ∈ X | (T k J¯)(x) = 0 .
Show that Xk+1 ⊂ Xk for all k, and that X ∗ ⊂ ∩∞ k=0 Xk . Show also that
convergence of VI (i.e., T k J¯ → J ∗ ) is equivalent to X ∗ = ∩∞k=0 Xk .
are compact for all k greater than some index k̄. Hint: Use Prop. 4.3.14.
Solution: Let Ê(X) be the subset of E (X) that consists of functions that take
only the two values 0 and ∞, and for all J ∈ Ê (X) denote
& '
D(J) = x ∈ X | J(x) = 0 .
Note that for all J ∈ Ê(X) we have Tµ J ∈ Ê(X), T J ∈ Ê(X), and that
& ! " '
D(Tµ J) = x ∈ X | x ∈ D(J), f x, µ(x), w ∈ D(J), ∀ w ∈ W (x, µ(x)) ,
(a) For all J ∈ Ê(X), we have D(Tµ J) ⊂ D(J) and Tµ J ≥ J, so condition (1) of
Assumption I holds, and it is easily verified that the remaining two conditions of
Sec. 4.7 Notes, Sources, and Exercises 327
Assumption I also hold. We have J¯ ∈ Ê(X), so for any policy π = {µ0 , µ1 , . . .},
we have Tµ0 · · · Tµk J¯ ∈ Ê(X). It follows that Jπ , given by
¯
Jπ = lim Tµ0 · · · Tµk J,
k→∞
also belongs to Ê(X), and the same is true for J ∗ = inf π∈Π Jπ . Thus J ∗ has the
given form with D(J ∗ ) = X ∗ .
(b) Since {T k J} ¯ is monotonically nondecreasing we have D(T k+1 J¯) ⊂ D(T k J¯),
or equivalently Xk+1 ⊂ Xk for all k. Generally for a sequence {Jk } ⊂ Ê(X), if
Jk ↑ J, we have J ∈ Ê(X) and D(J) = ∩∞ k=0 D(Jk ). Thus convergence of VI (i.e.,
T k J¯ ↑ J ∗ ) is equivalent to D(J ∗ ) = ∩∞ ∗ ∞
k=0 D(Jk ) or X = ∩k=0 Xk .
& 2 '
Uk (x, λ) = u ∈ U (x) 2 H(x, u, T k J)
¯ ≤λ
are compact for every x ∈ X, λ ∈ ', and for all k greater than some integer k.
It can be seen that Uk (x, λ) is equal to the set
& 2 '
Ûk (x) = u ∈ U (x) 2 f (x, u, w) ∈ Xk , ∀ w ∈ W (x, u)
Consider the four cases of pairs of values (b, q) where b ∈ {0, 1} and q ∈ {0, 1}.
For each case, use the theory of Section 4.5 to find the optimal cost function
J ∗ and the optimal cost function over stable policies Jˆ+ , and to describe the
convergence behavior of VI.
& '
Jk+1 (x1 , x2 ) = min (x1 )2 + (x2 )2 + (u)2 + Jk (γx1 , x1 + x2 + u) , if x1 )= 0.
u
It can be seen that the VI iterates Jk (0, x2 ) evolve as in the case of a single state
variable problem, where x1 is fixed at 0. For x1 )= 0, the VI iterates Jk (x1 , x2 )
diverge to ∞.
When b = 1 and q = 0, we have J ∗ (x) ≡ 0, while 0 < Jˆ+ (x) < ∞ for
all x )= 0. Similar to Example 4.5.1, the VI algorithm converges to Jˆ+ starting
from any initial condition J0 ≥ Jˆ+ . The functions J ∗ and Jˆ+ are real-valued
and satisfy Bellman’s equation, which has the form
& '
J(x1 , x2 ) = min (u)2 + J(γx1 + u, x1 + x2 + u) .
u
However, Bellman’s equation has additional solutions, other than J ∗ and Jˆ+ .
One of these is
Jˆ(x1 , x2 ) = P (x1 )2 ,
where P = γ 2 − 1 (cf. the example of Section 3.5.4).
4.5.2 [Jˆ+ (x) = c for all x )= 0, which corresponds to the policy that stops
at all states].
Solution: (a) It can be seen that the Bellman equation for the (δ, β)-perturbed
version of the problem is
, & ! "'
J(x) = min c, δβ + (1 − δ) -x- + J(γx) if x )= 0,
0 if x = 0,
The latter equation involves a bounded cost per stage, and hence according to
the theory of Section 4.6, has a unique solution within B, when all policies are
proper.
(b) Evident since the effect of δ on the cost of the optimal policy of the problem
of Example 4.5.2 diminishes as δ → 0.
(c) Since termination at cost c is inevitable (with probability 1) under every
policy, the optimal policy for the (δ, β)-perturbed version of the problem is to
stop as soon as possible.
The purpose of this exercise is to adapt the perturbation approach of Section 3.4
so that it can be used in conjunction with the regularity notion for nonstationary
policies of Definition 4.4.1. Given a set of functions S ⊂ E (X) and a collection C
of policy-state pairs (π, x) that is S-regular, let JC∗ be the restricted optimal cost
function defined by
We refer to the problem associated with the mappings Tµ,δ as the δ-perturbed
problem. The cost function of a policy π = {µ0 , µ1 , . . .} ∈ Π for this problem is
¯
Jπ,δ = lim sup Tµ0 ,δ · · · Tµk ,δ J,
k→∞
and the optimal cost function is Jˆδ = inf π∈Π Jπ,δ . Assume that for every δ > 0:
(1) Jˆδ satisfies the Bellman equation of the δ-perturbed problem, Jˆδ = Tδ Jˆδ .
330 Noncontractive Models Chap. 4
(2) For every x ∈ X, we have inf (π,x)∈C Jπ,δ (x) = Jˆδ (x).
(3) For all x ∈ X and (π, x) ∈ C, we have
Then JC∗ is a fixed point of T and the conclusions of Prop. 4.4.2 hold. Moreover,
we have
JC∗ = lim Jˆδ .
δ↓0
Solution: The proof is very similar to the one of Prop. 3.4.1. Condition (2)
implies that for every x ∈ X and # > 0, there exists a policy πx,! such that
(πx,! , x) ∈ C and Jπx,! ,δ (x) ≤ Jˆδ (x) + #. Thus, using conditions (2) and (3), we
have for all x ∈ X, δ > 0, # > 0, and π with (π, x) ∈ C,
JC∗ (x) − # ≤ Jπx,! (x) − # ≤ Jπx,! ,δ (x) − # ≤ Jˆδ (x) ≤ Jπ,δ (x) ≤ Jπ (x) + wπ,δ (x).
By taking the limit as # ↓ 0, we obtain for all x ∈ X, δ > 0, and π with (π, x) ∈ C,
By taking the limit as δ ↓ 0 and then the infimum over all π with (π, x) ∈ C, it
follows [using also condition (3)] that for all x ∈ X,
JC∗ (x) ≤ lim Jˆδ (x) ≤ inf lim Jπ,δ (x) ≤ inf Jπ (x) = JC∗ (x),
δ↓0 {π|(π,x)∈C} δ↓0 {π|(π,x)∈C}
Taking the limit as m → ∞, and using condition (4) and the fact Jˆδm ↓ JC∗ shown
earlier, we have
and that J → (x) > ↑↓ for all x ∈ X. The latter assumption was also made in
Section 3.5.5, but in the present exercise, we will not assume the additional near-
optimal termination Assumption 3.5.9 of that section, and we will use instead
the perturbation framework of Exercise 4.8. Note that J → is a fixed point of T
because the problem is deterministic (cf. Exercise 3.1).
We say that a policy π is terminating from state x0 ∈ X if the sequence
{xk } generated by π starting from x0 terminates finitely (i.e., satisfies xk̄ = t for
some index k̄). We denote by Πx the set of all policies that are terminating from
x, and we consider the collection
! "
C = (π, x) | π ∈ Πx .
Solution: Part (a) follows from Exercise 4.8, and parts (b), (c) follow from
Exercise 4.8 and Prop. 4.4.2.
332 Noncontractive Models Chap. 4
Consider the infinite-spaces SSP problem of Section 4.6 under the assumptions
of Prop. 4.6.4, and assume that g is bounded over X × U × W .
(a) Show that if µ is a uniformly proper policy, then Jµ is the unique solution
of the equation J = Tµ J within B and that Tµk J → Jµ for all J ∈ B.
ˆ Show that a
(b) Let J & be a fixed point of T such that J & ∈ B and J & )= J.
& &
policy µ satisfying Tµ J = T J cannot be uniformly proper.
- ! "3 .
∞
ˆ )= E g(0, u, w) + Jˆ f (0, u, w)
∞ = J(0) = ˆ
px J(x) = 0.
x=1
Sec. 4.7 Notes, Sources, and Exercises 333
Consider the mapping H of Section 2.1 under the monotonicity Assumption 2.1.1.
Assume that instead of the contraction Assumption 2.1.2, the following hold:
(1) For every J ∈ B(X), the function T J belongs to B(X), the space of func-
tions on X that are bounded with respect to the weighted sup-norm corre-
sponding to a positive weighting function v.
(2) T is nonexpansive, i.e., -T J − T J & - ≤ -J − J & - for all J, J & ∈ B(X).
(3) T has a unique fixed point within B(X), denoted J ∗ .
(4) If X is infinite the following continuity property holds: For each J ∈ B(X)
and {Jm } ⊂ B(X) with either Jm ↓ J or Jm ↑ J,
Solution: (a) Assume first that X is finite. For any c > 0, let V0 = J ∗ + c v
and consider the sequence {Vk } defined by Vk+1 = T Vk for k ≥ 0. Note that
{Vk } ⊂ B(X), since -V0 - ≤ -J ∗ -+c so that V0 ∈ B(X), and we have Vk+1 = T Vk ,
so that property (1) applies. From the nonexpansiveness property (2), we have
where the first equality follows from the continuity property (4), and the inequal-
ity follows from the generic relation inf lim H ≥ lim inf H. Thus we have V = T V ,
which by the uniqueness property (3), implies that V = J ∗ and Vk ↓ J ∗ . With a
similar argument we obtain Wk ↑ J ∗ , implying that T k J → J ∗ .
(b) The proof of part (a) applies with simple modifications.
Consider the mapping H of Section 2.1 under the monotonicity Assumption 2.1.1.
Assume that instead of the contraction Assumption 2.1.2, the following hold:
(1) For every J ∈ B(X), the function T J belongs to B(X), the space of func-
tions on X that are bounded with respect to the weighted sup-norm corre-
sponding to a positive weighting function v.
(2) T is nonexpansive, i.e., -T J − T J & - ≤ -J − J & - for all J, J & ∈ B(X).
(3) T has a largest fixed point within B(X), denoted J, ˆ i.e., Jˆ ∈ B(X), Jˆ is a
fixed point of T , and for every other fixed point J & ∈ B(X) we have J & ≤ Jˆ.
(4) If X is infinite the following continuity property holds: For each J ∈ B(X)
and {Jm } ⊂ B(X) with either Jm ↓ J or Jm ↑ J,
Solution: (a) The proof follows the line of proof of the preceding exercise. As-
sume first that X is finite. For any c > 0, let V0 = Jˆ + c v and consider the
sequence {Vk } defined by Vk+1 = T Vk for k ≥ 0. Note that {Vk } ⊂ B(X), since
-V0 - ≤ -Jˆ- + c so that V0 ∈ B(X), and we have Vk+1 = T Vk , so that property
(1) applies. From the nonexpansiveness property (2), we have
H(x, u, Jˆ + c v) ≤ H(x, u, J)
ˆ + c v(x), x ∈ X, u ∈ U (x),
and by taking the infimum over u ∈ U (x), we obtain Jˆ ≤ T (Jˆ+c v) ≤ Jˆ+c v, i.e.,
Jˆ ≤ V1 ≤ V0 . From this and the monotonicity of T it follows that Jˆ ≤ Vk+1 ≤ Vk
Sec. 4.7 Notes, Sources, and Exercises 335
ˆ
for all k, so that for each x ∈ X, Vk (x) ↓ V (x) where V (x) ≥ J(x). Moreover,
ˆ
V lies in B(X) (since J ≤ V ≤ Vk ), and also satisfies -Vk − V - → 0 (since
X is finite). From property (2), we have -T Vk − T V - ≤ -Vk − V -, so that
-T Vk − T V - → 0, which together with the fact T Vk = Vk+1 → V , implies that
V = T V . Thus V = Jˆ by property (3), and it follows that Vk ↓ Jˆ.
If X is infinite and property (4) holds, the preceding proof goes through,
except for the part that shows that -Vk − V - → 0. Instead we use a different
argument to prove that V = T V . Indeed, since Vk ≥ Vk+1 = T Vk ≥ T V , it
follows that V ≥ T V . For the reverse inequality we write
where the first equality follows from the continuity property (4). Thus we have
V = T V , which by property (3), implies that V = Jˆ and Vk ↓ J.
ˆ
This exercise (due to unpublished joint work with H. Yu) considers a nonexpan-
sive mapping G : 'n 2→ 'n , and derives conditions under which the interpolated
mapping Gγ defined by
is a contraction for all γ ∈ (0, 1). Consider 'n equipped with a strictly convex
norm - · -, and the set
4( ) 22 5
x − y G(x) − G(y) 2 n
C= , 2 x, y ∈ ' , x )= y ,
-x − y- -x − y- 2
which can be viewed as a set of “slopes” of G along all directions. Show that the
mapping Gγ defined by
is a contraction for all γ ∈ (0, 1) if and only if there is no closure point (z, w) of C
such that z = w. Note: To illustrate with some one-dimensional examples what
can happen if this closure condition is violated, let G : ' 2→ ' be continuously
differentiable, monotonically nondecreasing, and satisfying 0 ≤ dG(x) dx
≤ 1. Note
that G is nonexpansive. We consider two cases.
(1) G(0) = 0, dG(0)
dx
= 1, 0 ≤ dG(x)
dx
< 1 for x )= 0, limx→∞ dG(x)
dx
< 1 and
dG(x)
limx→−∞ dx < 1. Here (z, w) = (1, 1) is a closure point of C and
satisfies z = w. Note that Gγ is not a contraction for any γ ∈ (0, 1),
although it has 0 as its unique fixed point.
336 Noncontractive Models Chap. 4
! "
(2) limx→∞ dG(x)
dx
= 1. Here we have limx→∞ G(x) − G(y) = x − y for
x = y + 1, so (1, 1) is a closure point of C. It can also be seen that because
dGγ (x)
limx→∞ dx = 1, Gγ is not a contraction for any γ ∈ (0, 1), and may
have one, more than one, or no fixed points.
Solution: Assume there is no closure point (z, w) of C such that z = w, and for
γ ∈ (0, 1), let 9 9
ρ = sup 9(1 − γ)z + γw9.
(z,w)∈C
where for the strict inequality we use the strict convexity of the norm, and for
the last inequality we use the fact -z- = 1 and -w- ≤ 1. Thus ρ < 1, and since
9 9 9 9
9 x − y G(x) − G(y) 9 9Gγ (x) − Gγ (y)9
9(1 − γ) +γ 9=
9 -x − y- -x − y- 9 -x − y-
9 9
≤ sup 9(1 − γ)z + γw9
(z,w)∈C
= ρ, ∀ x )= y,
Contents
337
338 Sequential Zero-Sum Games and Minimax Control Chap. 5
5.1 INTRODUCTION
or equivalently,
Taking the supremum over ν ∈ N of both sides above, and then the infimum
over µ ∈ M, and using Eq. (5.5), we obtain
Given a mapping H of the form (5.1) that satisfies Assumption 1.1, we are
interested in computing the fixed point J * of T , i.e., a function J * such
that
J * (x) = inf sup H(x, u, v, J * ), for all x ∈ X. (5.6)
u∈U(x) v∈V (x)
Markov Games
Consider two players that play repeated matrix games at each of an infinite
number of stages, using mixed strategies. The game played at a given stage is
defined by a state x that takes values in a finite set X, and changes from one
stage to the next according to a Markov chain whose transition probabilities
are influenced by the players’ choices. At each stage and state x ∈ X, the
minimizer selects a probability distribution u = (u1 , . . . , un ) over n possible
choices i = 1, . . . , n, and the maximizer selects a probability distribution v =
(v1 , . . . , vm ) over m possible choices j = 1, . . . , m. If the minimizer chooses
i and the maximizer chooses j, the payoff of the stage is a$ ij (x) and depends
on the state x. Thus the expected payoff of the stage is a (x)ui vj or
i,j ij
u! A(x)v, where A(x) is the n × m matrix with components aij (x) (u and v
are viewed as column vectors, and a prime denotes transposition).
342 Sequential Zero-Sum Games and Minimax Control Chap. 5
The state evolves according to transition probabilities qxy (i, j), where i
and j are the moves selected by the minimizer and the maximizer, respectively
(here y represents the next state and game to be played after moves i and j
are chosen at the game represented by x). When the state is x, under u and
v, the state transition probabilities are
n m
% %
pxy (u, v) = ui vj qxy (i, j) = u! Qxy v,
i=1 j=1
where Qxy is the n × m matrix that has components qxy (i, j). Payoffs are
discounted by α ∈ (0, 1), and the objectives of the minimizer and maximizer,
are to minimize and to maximize the total discounted expected payoff, re-
spectively.
As shown by Shapley [Sha53], the problem can be formulated as a fixed
point problem involving the mapping H given by
%
H(x, u, v, J) = u! A(x)v + α pxy (u, v)J(y)
y∈X
&
%
' (5.7)
!
=u A(x) + α Qxy J(y) v.
y∈X
It was shown by Shapley [Sha53] that the strategies obtained by solving the
static saddle point problem (5.9) correspond to a saddle point of the sequential
game in the space of strategies. Thus once we find J ∗ as the fixed point of the
mapping T [cf. Eq. (5.8)], we can obtain equilibrium policies for the minimizer
and maximizer by solving the matrix game (5.9).
Sec. 5.1 Introduction 343
Here the problem is the same as in the preceding example, except that there
is no discount factor (α = 1), and in addition to the states in X, there is a
termination state t that is cost-free and absorbing. In this case the mapping
H is given by
& '
!
%
H(x, u, v, J) = u A(x) + Qxy J(y) v, (5.10)
y∈X
cf. Eq. (5.7), where the matrix of transition probabilities Qxy may be sub-
stochastic, while T has the form
Assuming that the termination state t is reachable with probability one under
all policy pairs, it can be shown that the mapping H satisfies the contraction
Assumption 1.1, so results and algorithms that are similar to the ones for the
preceding example apply. This reachability assumption, however, is restric-
tive and is not satisfied when the problem has a semicontractive character,
whereby Tµ,ν is a contraction under some policy pairs but not for others. In
this case the analysis is more complicated and requires the notion of proper
and improper policies from single-player stochastic shortest path problems;
see the papers [BeT91], [PaB99], [YuB13a], [Yu14].
In the next section, we will view our abstract minimax problem, in-
volving the Bellman equation (5.6), as an optimization by a single player
who minimizes against a worst-case response by an antagonistic oppo-
nent/maximizer, and we will describe the corresponding PI algorithm. This
algorithm has been known for the case of Markov games since the 1960s. We
will highlight the main weakness of this algorithm: the computational cost
of the policy evaluation operation, which involves the solution of the maxi-
mizer’s problem for a fixed policy of the minimizer. We will then discuss an
attractive proposal by Pollatschek and Avi-Itzhak [PoA69] that overcomes
this difficulty, albeit with an algorithm that requires restrictive assump-
tions for its validity. Then, in Section 5.3, we will introduce and analyze a
new algorithm, which maintains the attractive structure of the Pollatschek
and Avi-Itzhak algorithm without requiring restrictive assumptions. We
will also show the validity of our algorithm in the context of a distributed
asynchronous implementation, as well as in an on-line context, which in-
volves one-state-at-a-time policy improvement, with the states generated
by an underlying dynamic system or Markov chain.
344 Sequential Zero-Sum Games and Minimax Control Chap. 5
we write T as
(T J)(x) = inf (T µ J)(x), x ∈ X. (5.15)
µ∈M
PI Algorithms
or equivalently
" #
Jµt (x) = max H x, µt (x), v, Jµt , x ∈ X. (5.17)
v∈V (x)
where {mt } is a sequence of positive integers; see Section 2.5. Here the
policy evaluation operation (5.16) that finds the fixed point of the mapping
T µt is approximated by mt value iterations using T µt , and starting from
J t , as in the second equation of (5.20). The convergence of the abstract
forms of these PI algorithms has been established under the additional
monotonicity assumption
where H is the Markov game mapping (5.7) (this is the policy evaluation
step), followed by solving the static minimax problem
and letting µt+1 be a policy that attains the minimum above (this is the
policy improvement step). The policy improvement subproblem (5.23) is a
matrix saddle point problem, involving the matrix
%
A(x) + Qxy Jµt (y),
y∈X
[cf. Eq. (5.10)], which is easily solvable by linear programming for each x
(this is well-known in the theory of matrix games).
However, the policy evaluation step (5.22) involves the solution of
the maximizer’s Markov decision problem, for the fixed policy µt of the
minimizer. This can be a quite difficult problem that requires an expensive
Sec. 5.2 Relations to Single-Player Abstract DP 347
† Newton’s method for solving a general fixed point problem of the form
z = F (z), where z is an n-dimensional vector, operates as follows: At the current
iterate zk , we linearize F and find the solution zk+1 of the corresponding linear
fixed point problem, obtained using a first order Taylor expansion:
∂F (zk )
zk+1 = F (zk ) + (zk+1 − zk ),
∂z
where ∂F (zk )/∂z is the n×n Jacobian matrix of F evaluated at the n-dimensional
vector zk . The most commonly given convergence rate property of Newton’s
method is quadratic convergence. It states that near the solution z ∗ , we have
%zk+1 − z ∗ % = O %zk − z ∗ %2 ,
" #
where % · % is the Euclidean norm, and holds assuming the Jacobian matrix exists
and is Lipschitz continuous (see [Ber16c], Section 1.4). Qualitatively similar
results hold under other assumptions. In particular a superlinear convergence
statement (suitably modified to account for lack of differentiability of F ) can be
proved for the case where F (z) has components that are either monotonically
increasing or monotonically decreasing, and either concave or convex. In the
case of the Pollatschek and Avi-Itzhak algorithm, the main difficulty is that the
concavity/convexity condition is violated; see Fig. 5.2.1.
Sec. 5.2 Relations to Single-Player Abstract DP 349
Single policy Minimax Current policy pair (µ, ν) Next policy pair (˜
TJ
) Info Tµ,ν J T
! " !
!
max !11 (J), !12 (J)
" max !21 (J), !22 (J)
TJ
J Tµ̃,ν̃ J
where "ij (J), are linear functions of J, corresponding to the choices i = 1, 2 of the
minimizer and j = 1, 2 of the maximizer. Thus T J is the minimum of the convex
functions
+ , + ,
max "11 (J), "12 (J) and max "21 (J), "22 (J) ,
as shown in the figure. Newton’s method linearizes T J at the current iterate [i.e.,
replaces T J with one of the four linear functions "ij (J), i = 1, 2, j = 1, 2 (the
one attaining the min-max at the current iterate)] and solves the corresponding
linear fixed point problem to obtain the next iterate.
350 Sequential Zero-Sum Games and Minimax Control Chap. 5
operator , to be given in Section 5.4 (cf. Prop. 5.4.2). In fact, our algorithm
does not require the monotonicity assumption (5.21) for its convergence,
and thus it can be used in minimax problems that are beyond the scope of
DP.†
As an aid to understanding intuitively the abstract framework of this
section, we note that it is patterned after a multistage process, whereby at
each stage, the following sequence of events is envisioned (cf. Fig. 5.3.1):
(1) We start at some state x1 from a space X1 .
(2) The minimizer, knowing x1 , chooses a control u ∈ U (x1 ). Then a
new state x2 from a space X2 is generated as a function of (x1 , u).
(It is possible that X1 = X2 , but for greater generality, we do not
assume so. Also the transition from x1 to x2 may involve a random
disturbance; see the subsequent Example 3.3.)
(3) The maximizer, knowing x2 , chooses a control v ∈ V (x2 ). Then a
new state x1 ∈ X1 is generated.
(4) The next stage is started at x1 and the process is repeated.
If we start with x1 ∈ X1 , this sequence of events corresponds to finding the
optimal minimizer policy against a worst case choice of the maximizer, and
the corresponding min-max value is denoted by J1* (x1 ). Symmetrically, if
we start with x2 ∈ X2 , this sequence of events corresponds to finding the
optimal maximizer policy against a worst case choice of the minimizer, and
the corresponding max-min value is denoted by J2* (x2 ).
This type of framework can be viewed within the context of the theory
of zero-sum games in extensive form, a methodology with a long history
[Kuh53]. Games in extensive form involve sequential/alternating choices
by the players with knowledge of prior choices. By contrast, for games in
simultaneous form, such as the Markov games of the preceding section, the
players make their choices without being sure of the other player’s choices.
Fixed Point Formulation
† For example, our algorithm can be used for the asynchronous distributed
computation of fixed points of concave operators, arising in fields like economics
and polulation dynamics. The key fact here is that a concave function can be
described as the minimum of a collection of linear functions through the classical
conjugacy operation.
352 Sequential Zero-Sum Games and Minimax Control Chap. 5
uv uv
x1 x2 2 x1 x2
/ " #0
J2∗ (x2 ) = sup g2 (x2 , v) + αJ1∗ f2 (x2 , v) . (5.33)
v∈V (x2 )
We will show that the discounted Markov game of Example 5.1.1 can be
reformulated within our fixed point framework of Eq. (5.28) by letting X1 =
X, X2 = X × U , and by redefining the minimizer’s control to be a probability
distribution (u1 , . . . , un ), and the maximizer’s control to be one of the m
possible choices j = 1, . . . , m.
To introduce into our problem formulation an appropriate contraction
structure that we will need in the next section, we use a scaling parameter β
such that
β > 1, αβ < 1. (5.34)
The idea behind the use of the scaling parameter β is to introduce discount-
ing into the stages of both the minimizer and the maximizer. We consider
functions J1∗ (x) and J2∗ (x, u) that solve the equations
1
J1∗ (x) = min J2∗ (x, u), (5.35)
β u∈U
1 & '
%
J2∗ (x, u) = max u !
A(x) + αβ Qxy J1∗ (y) (1), . . . ,
y∈X
& ' 2 (5.36)
!
%
u A(x) + αβ Qxy J1∗ (y) (m) ,
y∈X
354 Sequential Zero-Sum Games and Minimax Control Chap. 5
where xk is the state, uk is the control to be selected from some given set
U (xk ) (with perfect knowledge of xk ), and vk is a disturbance that is selected
by an antagonistic nature from a set V (xk , uk ) [with perfect knowledge of
(xk , uk )]. A cost g(xk , uk , vk ) is incurred at time k, it is accumulated over an
infinite horizon, and it is discounted by α ∈ (0, 1). The Bellman equation for
this problem is
/ " #0
J ∗ (x) = inf sup g(x, u, v) + αJ ∗ f (x, u, v) , (5.41)
u∈U (x) v∈V (x,u)
and the optimal cost function J ∗ is the unique fixed point of this equation,
assuming that the cost per stage g is a bounded function.
To reformulate this problem into the fixed point format (5.28), we iden-
tify the minimizer’s state x1 with the state x of the system (5.40), and the
maximizer’s state x2 with the state-control pair (x, u). We also introduce a
scaling parameter β that satisfies β > 1 and αβ < 1; cf. Eq. (5.34). We define
H1 and H2 as follows:
1
H1 (x, u, J2 ) maps (x, u, J2 ) to the real value J2 (x, u),
β
Then the resulting fixed point problem (5.28) takes the form
/ " #0
J2∗ (x, u) = sup g(x, u, v) + α(βJ1∗ ) f (x, u, v) ,
v∈V (x,u)
Consider a dynamic system such as the one of Eq. (5.40) in the preceding
example, except that there is an additional stochastic disturbance w with
356 Sequential Zero-Sum Games and Minimax Control Chap. 5
known conditional probability distribution given (x, u, v). Thus the state
evolves at each time k according to
xk+1 = f (xk , uk , vk , wk ), k = 0, 1, . . . , (5.42)
and the cost per stage is g(xk , uk , vk , wk ). The Bellman equation now is
* " #! -
J ∗ (x) = inf sup Ew g(x, u, v, w) + αJ ∗ f (x, u, v, w) ! x, u, v , (5.43)
u∈U (x) v∈V (x,u)
∗
and J is the unique fixed point of this equation, assuming that g is a bounded
function.
Similar to Example 5.3.3, we let the minimizer’s state be x, and the
maximizer’s state be (x, u), we introduce a scaling parameter β that satisfies
β > 1 and αβ < 1; cf. Eq. (5.34), and we define H1 and H2 as follows:
1
H1 (x, u, J2 ) maps (x, u, J2 ) to the real value J2 (x, u),
β
H2 (x, u,v, J1 ) maps (x, u, v, J1 )
* " #! -
to the real value Ew g(x, u, v, w) + αβJ1 f (x, u, v, w) ! x, u, v .
“Naive” PI Algorithms
A PI algorithm for the fixed point problem (5.28), which is patterned after
the Pollatschek and Avi-Itzhak algorithm, generates a sequence of policy
pairs {µt , ν t } ⊂ M × N and corresponding sequence of cost function pairs
{J1,µt ,ν t , J2,µt ,ν t } ⊂ B(X1 ) × B(X2 ). We use the term “naive” to indicate
that the algorithm does not address adequately the convergence issue of
the underlying Newton’s method.† Given {µt , ν t } it generates {µt+1 , ν t+1 }
with a two-step process as follows:
(a) Policy evaluation, which computes the functions {J1,µt ,ν t ,J t , J2,µt ,ν t }
2
by solving the fixed point equations
" #
J1,µt ,ν t (x1 ) = H1 x1 , µt (x), J2,µt ,ν t , x1 ∈ X1 , (5.44)
" #
J2,µt ,ν t (x2 ) = H2 x2 , ν t (x), J1,µt ,ν t , x2 ∈ X2 . (5.45)
(b) Policy improvement, which computes (µt+1 , ν t+1 ) with the mini-
mizations
" #
µt+1 (x1 ) ∈ arg min H1 x1 , u, J2,µt ,ν t , x1 ∈ X1 , (5.46)
u∈U(x1 )
" #
ν t+1 (x2 ) ∈ arg max H2 x2 , v, J1,µt ,ν t , x2 ∈ X2 . (5.47)
v∈V (x2 )
† We do not mean the term in a pejorative sense. In fact the Pollatschek and
Avi-Itzhak paper [PoA69] embodies original ideas, includes sophisticated and
insightful analysis, and has stimulated considerable followup work.
358 Sequential Zero-Sum Games and Minimax Control Chap. 5
Our PI algorithm for finding the solution (J1* , J2* ) of the Bellman equation
(5.28) has structural similarity with the “naive” PI algorithm that uses
optimistic policy evaluations of the form (5.48)-(5.49) and policy improve-
ments of the form (5.50)-(5.51). It differs from the PI algorithms of the
preceding section, such as the Hoffman-Karp and van der Wal algorithms,
in two ways:
(a) It treats symmetrically the minimizer and the maximizer , in that it
aims to find both the min-max and the max-min cost functions, which
are J1* and J2* , respectively, and it ignores the possibility that we may
have J1* = J2* .
(b) It separates the policy evaluations and policy improvements of the
minimizer and the maximizer, in asynchronous fashion. In particular,
in the algorithm that we will present shortly, each iteration will consist
of only one of four operations: (1) an approximate policy evaluation
(consisting of a single value iteration) by the minimizer, (2) a policy
improvement by the minimizer, (3) an approximate policy evaluation
(consisting of a single value iteration) by the maximizer, (4) a policy
improvement by the maximizer.
The order and frequency by which these four operations are performed
does not affect the convergence of the algorithm, as long as all of these oper-
ations are performed infinitely often. Thus the algorithm is well suited for
distributed implementation. Moreover, by executing the policy evaluation
steps (1) and (3) much more frequently than the policy improvement op-
erations (2) and (4), we obtain an algorithm involving nearly exact policy
evaluation.
Our algorithm generates two sequences of function pairs,
set µt+1 (x1 ) to a control u ∈ U (x1 ) that attains the above mini-
mum, and leave J2t , V2t , ν t unchanged.
(c) Single value iteration for policy evaluation of the maxi-
mizer: For all x2 ∈ X2 , set
set ν t+1 (x2 ) to a control v ∈ V (x2 ) that attains the above max-
imum, and leave J1t , V1t , µt unchanged.
Consider the minimax control problem with explicit separation of the two
players of Example 5.3.1, which involves the dynamic system x1,k ∈ X1 and
x2,k ∈ X2 , and they evolve according to
[cf. Eq. (5.29)]. The Bellman equation for this problem can be broken down
into the two equations (5.32), (5.33):
/ " #0
J1∗ (x1 ) = inf g1 (x1 , u) + αJ2∗ f1 (x1 , u) ,
u∈U (x1 )
/ " #0
J2∗ (x2 ) = sup g2 (x2 , v) + αJ1∗ f2 (x2 , v) .
v∈V (x2 )
(a) Single value iteration for policy evaluation for the minimizer:
For all x1 ∈ X1 , set
/ #0
J1t+1 (x1 ) = g1 x1 , µt (x1 ) +α max V2t f1 (x1 , µt (x1 )) , J2t f1 (x1 , µt (x1 ))
" # " # "
,
(5.56)
and leave J2t , V1t , V2t , µt , ν t unchanged.
(b) Policy improvement for the minimizer: For all x1 ∈ X1 , set
/
J1t+1 (x1 ) = V1t+1 (x1 ) = min g1 (x1 , u)
u∈U (x1 )
#40
+ α max V2t f1 (x1 , u) , J2t f1 (x1 , u)
3 " # "
,
(5.57)
set µt+1 (x1 ) to a control u ∈ U (x1 ) that attains the above minimum,
and leave J2t , V2t , ν t unchanged.
(c) Single value iteration for policy evaluation of the maximizer:
For all x2 ∈ X2 and v ∈ V (x2 ), set
/ #0
J2t+1 (x2 ) = g2 x2 , ν t (x2 ) +α min V1t f2 (x2 , ν t (x2 )) , J1t f2 (x2 , ν t (x2 ))
" # " # "
,
(5.58)
and leave J1t , V1t , V2t , µt , ν t unchanged.
(d) Policy improvement for the maximizer: For all x2 ∈ X2 , set
/
J2t+1 (x2 ) = V2t+1 (x2 ) = max g2 (x2 , v)
v∈V (x2 )
#40
+ α min V1t f2 (x2 , v) , J1t f2 (x2 , v)
3 " # "
,
(5.59)
set ν t+1 (x2 ) to a control u ∈ V (x2 ) that attains the above maximum,
and leave J1t , V1t , µt unchanged.
cf. Eq. (5.36). None of the operations (5.52)-(5.55) require the solution of
a Markovian decision problem as in the Hoffman-Karp algorithm. This is
similar to the Pollatschek and Avi-Itzhak algorithm.
362 Sequential Zero-Sum Games and Minimax Control Chap. 5
More specifically, the policy evaluation (5.52) for the minimizer takes
the form
1
/ 0
J1t+1 (x) = max V2t+1 x, µt (x) , J2t+1 x, µt (x) ,
" # " #
for all x ∈ X, (5.61)
β
while the policy improvement (5.53) for the minimizer takes the form
1
J1t+1 (x) = V1t+1 (x) = min max V2t+1 (x, u), J2t+1 (x, u) ,
3 4
for all x ∈ X.
β u∈U
(5.62)
The policy evaluation (5.54) for the maximizer takes the form
& '
%
J2t+1 (x, u) V1t (y), J1t (y) ν t (x) ,
!
3 4 " #
=u A(x) + αβ Qxy min (5.63)
y∈X
for all x ∈ X and u ∈ U , while the policy improvement (5.55) for the maxi-
mizer takes the form
the iterations of the minimizer in our algorithm, (5.52) and (5.53), are more
“pessimistic” about the choices of the maximizer than the iterates of the
minimizer in the “naive” PI iterates (5.48) and (5.49). Similarly, since
the iterations of the maximizer in our algorithm, (5.54) and (5.55), are more
“pessimistic” than the iterates of the maximizer in the naive PI iterates
(5.48) and (5.49). As a result the use of V1t and V2t in our PI algorithm
makes it more conservative, and mitigates the oscillatory swings that are
illustrated in Fig. 5.2.1.
Let us also note that the use of the functions V1 and V2 in our al-
gorithm (5.52)-(5.55) may slow down the algorithmic progress relative to
the (nonconvergent) “naive” algorithm (5.44)-(5.47). To remedy this sit-
uation an interpolation device has been suggested in the paper [BeY10]
(Section V), which roughly speaking interpolates between the two algo-
rithms, while still guaranteeing the algorithm’s convergence; see also Sec-
tion 2.6.3. Basically, such a device makes the algorithm less “pessimistic,”
as it guards against nonconvergence, and it can similarly be used in our
algorithm (5.52)-(5.55).
In the next section, we will show convergence of our PI algorithm
(5.52)-(5.55) with a line of proof that can be summarized as follows. Using
a contraction argument, based on an assumption to be introduced shortly,
we show that the sequences {V1t } and {V2t } converge to some functions
V1∗ ∈ B(X1 ) and V2∗ ∈ B(X2 ), respectively. From the policy improvement
operations (5.53) and (5.55) it will then follow that the sequences {J1t }
and {J2t } converge to the same functions V1∗ and V2∗ , respectively, so that
min[V1t , J1t ] and max[V2t , J2t ] converge to V1∗ and V2∗ , respectively, as well.
364 Sequential Zero-Sum Games and Minimax Control Chap. 5
Also for each ν ∈ N , we consider the operator T2,ν that maps a function
J1 ∈ B(X1 ) into the function of x2 given by
" #
(T2,ν J1 )(x2 ) = H2 x2 , ν(x2 ), J1 , x2 ∈ X2 . (5.66)
We will also consider the operator Tµ,ν that maps a function (J1 , J2 ) ∈
B(X1 ) × B(X2 ) into the function of (x1 , x2 ) ∈ X1 × X2 , given by
" # " #
Tµ,ν (J1 , J2 ) (x1 , x2 ) = (T1,µ J2 )(x1 ), (T1,ν J1 )(x2 ) . (5.67)
[Recall here that the norms on B(X1 ), B(X2 ), and B(X1 ) × B(X2 ) are
given by Eqs. (5.26) and (5.27).]
We will show convergence of our algorithm assuming the following.
for all J1 , J1" ∈ B(X1 ) and J2 , J2" ∈ B(X2 ) [cf. the norm definition (5.27)],
we have
$T1,µ J2 − T1,µ J2" $1 ≤ α$J2 − J2" $2 , for all J2 , J2" ∈ B(X2 ), (5.69)
Sec. 5.4 Convergence Analysis 365
and
$T2,ν J1 − T2,ν J1" $2 ≤ α$J1 − J1" $1 , for all J1 , J1" ∈ B(X1 ); (5.70)
and
where
defined by
T (J1 , J2 ) = (T1 J2 , T2 J1 ), (5.73)
is a contraction mapping from B(X1 ) × B(X2 ) to B(X1 ) × B(X2 ) with
modulus α. It follows that T has a unique fixed point (J1* , J2* ) ∈ B(X1 ) ×
B(X2 ). We will show that our algorithm yields in the limit this fixed point.
follows. The proof of the other relation, %T2 J1 − T2 J1! %2 ≤ α%J1 − J1! %1 , is similar.
366 Sequential Zero-Sum Games and Minimax Control Chap. 5
The proof is long but follows closely the steps of the proof for the
single-player abstract DP case in Section 2.6.3.
Lemma 5.4.1: For all (V1 , V2 ), (J1 , J2 ), (V1" , V2" ), (J1" , J2" ) ∈ B(X1 ) ×
B(X2 ), we have
. min[V1 , J1 ] − min[V " , J " ]. ≤ max $V1 − V " $1 , $J1 − J " $1 , (5.74)
. . + ,
1 1 1 1 1
. max[V2 , J2 ] − min[V " , J " ]. ≤ max $V2 − V " $2 , $J2 − J " $2 . (5.75)
. . + ,
2 2 2 2 2
so that
By exchanging the roles of (V1 , J1 ) and (V1" , J1" ), and combining the two
inequalities, we have
! ,!!
! min V1 (x1 ), J1 (x1 ) − min V1" (x1 ), J1" (x1 ) !
! + , +
≤ max $V1 −V1" $1 , $J1 −J1" $1 ,
+ ,
ξ1 (x1 )
and by taking the supremum over x1 ∈ X1 , we obtain Eq. (5.74). We
similarly prove Eq. (5.75). Q.E.D.
with the functions M1,ν (V2 , Q2 ), M2,µ (V1 , Q1 ), F1,ν (V2 , Q2 ), F2,µ (V1 , Q1 ), de-
fined as follows:
• M1,ν (V2 , Q2 ): This is the function of x1 given by
" # " #
M1,ν (V2 , Q2 ) (x1 ) = T1 max[V2 , Q̂2,ν ] (x1 )
" # (5.78)
= inf H1 x1 , u, max[V2 , Q̂2,ν ] ,
u∈U(x1 )
" # " #
M2,µ (V1 , Q1 ) (x2 ) = T2 min[V1 , Q̂1,µ ] (x2 )
" #
= sup H2 x2 , v, min[V1 , Q̂1,µ ] , (5.80)
v∈V (x2 )
" #
Q̂1,µ (x1 ) = Q1 x1 , µ(x1 ) . (5.81)
" #
F1,ν (V2 , Q2 )(x1 , u) = H1 x1 , u, max[V2 , Q̂2,ν ] . (5.82)
" #
F2,µ (V1 , Q1 )(x2 , v) = H2 x2 , v, min[V1 , Q̂1,µ ] . (5.83)
Note that the four components of Gµ,ν correspond to the four oper-
ations of our algorithm (5.52)-(5.55). In particular,
• M1,ν (V2 , Q2 ) corresponds to policy improvement of the minimizer.
• M2,µ (V1 , Q1 ) corresponds to policy improvement of the maximizer.
• F1,ν (V2 , Q2 ) corresponds to policy evaluation of the minimizer.
• F2,µ (V1 , Q1 ) corresponds to policy evaluation of the maximizer.
The key step in our convergence proof is to show that Gµ,ν has a
contraction property with respect to the norm on B(X1 )×B(X2 )×B(X1 ×
U ) × B(X2 × V ) given by
. . + ,
.(V1 , V2 , Q1 , Q2 ). = max $V1 $1 , $V2 $2 , $Q1 $1 , $Q2 $2 , (5.84)
Thus we have
.Gµ,ν (V1 , V2 , Q1 , Q2 ) − Gµ,ν (V " , V " , Q" , Q" ).
. .
1 2 1 2
≤ α .(V1 , V2 , Q1 , Q2 ) − (V " , V " , Q" , Q" ).,
. .
1 2 1 2
" #
V2 (x2 ) = sup H2 x2 , v " , min[V1 , Q̂1,µ ] , (5.92)
v ! ∈V (x2 )
" # " #
Q1 (x1 , u) = H1 x1 , u, max[V2 , Q̂2,ν ] , Q2 (x2 , v) = H2 x2 , v, min[V1 , Q̂1,µ ] .
(5.93)
By comparing the preceding two relations, it follows that for all x1 ∈ X1 ,
x2 ∈ X2 ,
V1 (x1 ) ≤ Q1 (x1 , u), for all x1 , u ∈ U (x1 ),
V2 (x2 ) ≥ Q2 (x2 , v), for all x2 , v ∈ V (x2 ),
which implies that
Thus, independently of (µ, ν), (V1 , V2 ) is the unique fixed point of the
contraction mapping T of Eq. (5.73), which is (J1* , J2* ). Moreover from Eq.
(5.93), we have that (Q1 , Q2 ) is precisely (Q∗1 , Q∗2 ) as given by Eqs. (5.85)
and (5.86). This shows that, independently of (µ, ν), the fixed point of
Gµ,ν is (J1* , J2* , Q∗1 , Q∗2 ), and proves the desired result. Q.E.D.
Gµt ,ν t evaluated at the current iterate (V1t , V2t , Qt1 , Qt2 , µt , ν t ), and updates
this iterate accordingly. This algorithm is well-suited for the calculation
of both (J1* , J2* ) and (Q*1 , Q*2 ). However, since we are just interested to
calculate (J1* , J2* ), a simpler and more efficient algorithm is possible, which
is in fact our PI algorithm based on the four operations (5.52)-(5.55). To
this end, we observe that the algorithm that updates (V1t , V2t , Qt1 , Qt2 , µt , ν t )
can be operated so that it does not require the maintenance of the full
Q-factor functions (Qt1 , Qt2 ). The reason is that the values Qt1 (x1 , u) and
Qt2 (x2 , v) with u *= µt (x1 ) and v *= ν t (x2 ), do not appear in the calculations,
and hence we need only the values Q̂t1,µt (x1 ) and Q̂t2,ν t (x2 ), which we store
in functions J1t and J2t , i.e., we set
" #
J1t (x1 ) = Q̂t1,µt (x1 ) = Qt1 x1 , µt (x) ,
" #
J2t (x2 ) = Q̂t2,ν t (x2 ) = Qt2 x2 , ν t (x2 ) .
uv uv
x̃1 ∈ X̃1 1 x2 ∈ X2 2 x̃2 ∈ X̃2 2 x1 ∈ X1 2 x̄1 ∈ X̃1
Suboptimal decision choices by the minimizer and the maximizer are then ob-
tained from the one-step lookahead optimizations
See the book [Ber19b] (Section 6.1) and the paper [Ber18a] for a detailed ac-
counting of the aggregation approach with representative states for single-player
infinite horizon DP.
of the PI methods of Section 5.3 [we refer to the book [Ber19b] (Chapter
6) for more details]. The cost function approximations thus obtained, call
them J˜1 , J˜2 , are used in the one-step lookahead minimization
the max-min values, and they are suitable for Markov zero-sum game prob-
lems, as well as for minimax control problems involving set-membership
uncertainty.
While we have not addressed in detail the issue of asynchronous dis-
tributed implementation in a multiprocessor system, our algorithm admits
such an implementation, as has been discussed for its single-player counter-
parts in the papers by Bertsekas and Yu [BeY10], [BeY12], [YuB13a], and
also in a more abstract form in the author’s books [Ber12a] and [Ber20]. In
particular, there is a highly parallelizable and convergent distributed im-
plementation, which is based on state space partitioning, and asynchronous
policy evaluation and policy improvement operations within each set of the
partition. The key idea, which forms the core of asynchronous DP algo-
rithms [Ber82], [Ber83] (see also the books [BeT89], [Ber12a], [Ber20]) is
that the mapping Gµ,ν of Eq. (5.77) has two components for every state
(policy evaluation and policy improvement) for the minimizer and two cor-
responding components for every state for the maximizer. Because of the
uniform sup-norm contraction property of Gµ,ν , iterating with any one of
these components, and at any single state, does not impede the progress
made by iterations with the other components, while making eventual
progress towards the solution.
In view of its asynchronous convergence capability, our framework is
also suitable for on-line implementations where policy improvement and
evaluations are done at only one state at a time. In such implementations,
the algorithm performs a policy improvement at a single state, followed by
a number of policy evaluations at other states, with the current policy pair
(µt , ν t ) evaluated at only one state x at a time, and the cycle is repeated.
One may select states cyclically for policy improvement, but there are al-
ternative possibilities, including the case where states are selected on-line
as the system operates. An on-line PI algorithm of this type, which may
also be operated as a rollout algorithm (a control selected by a policy im-
provement at each encountered state), was given recently in the author’s
paper [Ber21a], and can be straightforwardly adapted to the minimax and
Markov game cases of this chapter.
Other algorithmic possibilities, also discussed in the works just noted,
involve the presence of “communication delays” between processors, which
roughly means that the iterates generated at some processors may involve
iterates of other processors that are out-of-date. This is possible because
the asynchronous convergence line of analysis framework of [Ber83] in com-
bination with the uniform weighted sup-norm contraction property of Prop.
5.4.2 can tolerate the presence of such delays. Implementations that involve
forms of stochastic sampling have also been given in the papers [BeY12],
[YuB13a].
An important issue for efficient implementation of our algorithm is the
relative frequency of policy improvement and policy evaluation operations.
If a very large number of contiguous policy evaluation operations, using the
Sec. 5.6 Notes and Sources 375
5 3 46
J2,µt ,ν t (x2 ) = H2 x2 , ν t (x2 ), min V1t , J1,µt ,ν t , x2 ∈ X2 ,
cf. Eqs. (5.44)-(5.45) (in the context of Markovian decision problems, this
type of policy evaluation involves the solution of an optimal stopping
problem; cf. the paper [BeY12]). Otherwise the policy evaluation is in-
exact/optimistic, and in the extreme case where only one policy evaluation
is done between policy improvements, the algorithm resembles a value iter-
ation method. Based on experience with optimistic PI, it appears that the
optimal number of policy evaluations between policy improvements should
be substantially larger than one, and should also be problem-dependent.
We mention the possibility of extensions to other related minimax
and Markov game problems. In particular, the treatment of undiscounted
problems that involve a termination state can be patterned after the dis-
tributed asynchronous PI algorithm for stochastic shortest path problems
by Yu and Bertsekas [YuB13a], and will be the subject of a separate re-
port. A related area of investigation is on-line algorithms applied to robust
shortest path planning problems, where the aim is to reach a termination
state at minimum cost and against the actions of an antagonistic oppo-
nent. The author’s paper [Ber19c] (see also Section 3.5.3) has provided
analysis and algorithms, some of the PI type, for these minimax versions
of shortest path problems, and has given many references of related works.
Still our PI algorithm of Section 5.3, appropriately extended, offers some
substantial advantages within the shortest path context, in both a serial
and a distributed computing environment.
Note that a sequential minimax problem with a finite horizon may
be viewed as a simple special case of an infinite horizon problem with a
termination state. The PI algorithms of the present chapter are directly
applicable and can be simply modified for such a problem. In conjunction
with function approximation methods, such as the aggregation method
described earlier, they may provide an attractive alternative to exact, but
hopelessly time-consuming solution approaches.
For an interesting class of finite horizon problems, consider a two-
stage “robust” version of stochastic programming, patterned after Ex-
ample 5.3.3 and Eq. (5.42). Here, at an initial state x0 , the decision
maker/minimizer applies a decision u0 ∈ U (x0 ), an antagonistic nature
chooses v0 ∈ V (x0 , u0 ), and a random disturbance w0 is generated ac-
cording to a probability distribution than depends on (x0 , u0 , v0 ). A cost
376 Sequential Zero-Sum Games and Minimax Control Chap. 5
x1 = f (x0 , u0 , v0 , w0 )
%* = % ∪ {∞, −∞}.
We write −∞ < x < ∞ for all real numbers x, and −∞ ≤ x ≤ ∞ for all
extended real numbers x. We denote by [a, b] the set of (possibly extended)
real numbers x satisfying a ≤ x ≤ b. A rounded, instead of square, bracket
denotes strict inequality in the definition. Thus (a, b], [a, b), and (a, b)
denote the set of all x satisfying a < x ≤ b, a ≤ x < b, and a < x < b,
respectively.
Generally, we adopt standard conventions regarding addition and
multiplication in %* , except that we take
∞ − ∞ = −∞ + ∞ = ∞,
377
378 Notation and Mathematical Conventions Appendix A
α + ∞ = ∞ + α = ∞, ∀ α ∈ %* ,
Under these rules, the following laws of arithmetic are still valid within %* :
α1 + α2 = α2 + α1 , (α1 + α2 ) + α3 = α1 + (α2 + α3 ),
We also have
α(α1 + α2 ) = αα1 + αα2
A.2 FUNCTIONS
Sequences of Functions
Expected Values
In this way, taking also into account the rule ∞ − ∞ = ∞, the expected
value E{w} is well-defined if Ω is finite or countably infinite. In more gen-
eral cases, E{w} is similarly defined by the appropriate form of integration,
and more detail will be given at specific points as needed.
APPENDIX B:
Contraction Mappings
.F y − F z. ≤ ρ.y − z., ∀ y, z ∈ Y .
F y = b + Ay,
∀ y ∈ !n .
# $
%Ay%s ≤ σ(A) + " %y%s , (B.1)
381
382 Contraction Mappings Appendix B
Thus, if σ(A) < 1 we may select " > 0 such that ρ = σ(A) + " < 1, and obtain
the contraction relation
) )
%F y − F z%s = )A(y − z))s ≤ ρ%y − z%s , ∀ y, z ∈ !n . (B.2)
† We may show Eq. (B.1) by using the Jordan canonical form of A, which is
denoted by J. In particular, if P is a nonsingular matrix such that P −1 AP = J
and D is the diagonal matrix with 1, δ, . . . , δ n−1 along the diagonal, where δ > 0,
it is straightforward to verify that D−1 P −1 AP D = J, ˆ where Jˆ is the matrix
that is identical to J except that each nonzero off-diagonal term is replaced by δ.
Defining Pˆ = P D, we have A = PˆJˆPˆ−1 . Now if % · % is the # standard$ Euclidean
norm, we note that for some β > 0, we have %Jˆz% ≤ σ(A) + βδ %z% for all
z ∈ !n and δ ∈ (0, 1]. For a given δ ∈ (0, 1], consider the weighted Euclidean
norm % · %s defined by %y%s = %Pˆ−1 y%. Then we have for all y ∈ !n ,
%Ay%s = %Pˆ−1 Ay% = %Pˆ−1 PˆJˆPˆ−1 y% = %JˆP̂ −1 y% ≤ σ(A) + βδ %P̂ −1 y%,
# $
# $
so that %Ay%s ≤ σ(A) + βδ %y%s , for all y ∈ !n . For a given " > 0, we choose
δ = "/β, so the preceding relation yields Eq. (B.1).
Sec. B.1 Contraction Mapping Fixed Point Theorems 383
y∗ = F y∗.
.yk+1 − yk . ≤ ρk .y1 − y0 ., k = 1, 2, . . . .
To show the convergence rate bound of the last part, note that
) )
.F k y − y ∗ . = )F k y − F y ∗ ) ≤ ρ.F k−1 y − y ∗ ..
Repeating this process for a total of k times, we obtain the desired result.
Q.E.D.
y∗ = F y∗.
v(x) > 0, ∀ x ∈ X.
Let B(X) denote the set of all functions J : X +→ % such that J(x)/v(x)
is bounded as x ranges over X. We define a norm on B(X), called the
weighted sup-norm, by + +
+J(x)+
.J. = sup . (B.3)
x∈X v(x)
It is easily verified that .·. thus defined has the required properties for
being a norm. Furthermore, B(X) is complete under this norm. To see this,
consider a Cauchy sequence {Jk } ⊂ B(X), and note that .Jm − Jn . → 0 as
m, n → ∞ implies that for all x ∈ X, {Jk (x)} is a Cauchy sequence of real
numbers, so it converges to some J * (x). We will show that J * ∈ B(X) and
that .Jk − J * . → 0. To this end, it will be sufficient to show that given
any % > 0, there exists an integer K such that
+ +
+Jk (x) − J * (x)+
≤ %, ∀ x ∈ X, k ≥ K.
v(x)
This will imply that
+ +
+J * (x)+
sup ≤ % + .Jk ., ∀ k ≥ K,
x∈X v(x)
so that J * ∈ B(X), and will also imply that .Jk − J * . ≤ %, so that .Jk −
J * . → 0. Assume the contrary, i.e., that there exists an % > 0 and a
subsequence {xm1 , xm2 , . . .} ⊂ X such that mi < mi+1 and
+ +
+Jm (xm ) − J * (xm )+
i i i
%< , ∀ i ≥ 1.
v(xmi )
The right-hand side above is less or equal to
+ + + +
+Jm (xm ) − Jn (xm )+ +Jn (xm ) − J * (xm )+
i i i i i
+ , ∀ n ≥ 1, i ≥ 1.
v(xmi ) v(xmi )
386 Contraction Mappings Appendix B
The first term in the above sum is less than %/2 for i and n larger than some
threshold; fixing i and letting n be sufficiently large, the second term can
also be made less than %/2, so the sum is made less than % - a contradiction.
In conclusion, the space B(X) is complete, so the fixed point results of
Props. B.1 and B.2 apply.
In our discussions, unless we specify otherwise, we will assume that
B(X) is equipped with the weighted sup-norm above, where the weight
function v will be clear from the context. There will be frequent occasions
+ +
where the norm will be unweighted, i.e., v(x) ≡ 1 and .J. = supx∈X +J(x)+,
in which case we will explicitly state so.
Finite-Dimensional Cases
|yi |
.y. = max
i=1,...,n v(i)
if and only if
,n
j=1 |aij | v(j)
< 1, i = 1, . . . , n.
v(i)
Sec. B.2 Weighted Sup-Norm Contractions 387
Proof: (a) This is the Perron-Frobenius Theorem; see e.g., [BeT89], Chap-
ter 2, Prop. 6.6.
(b) This follows from the Perron-Frobenius Theorem; see [BeT89], Chapter
2, Cor. 6.2.
(c) This is proved in more general form in the following Prop. B.4. Q.E.D.
for some matrix P with nonnegative components and σ(P ) < 1. Here, we
generically denote by |w| the vector whose components are the absolute
values of the components of w, and the inequality is componentwise. Then
we claim that F is a contraction with respect to some weighted sup-norm.
To see this note that by the preceding discussion, P is a contraction with
respect to some weighted sup-norm .y. = maxi=1,...,n |yi |/v(i), and we
have
# $ # $
|F y − F z| (i) P |y − z| (i)
≤ ≤ α .y − z., ∀ i = 1, . . . , n,
v(i) v(i)
# $ # $
for some α ∈ (0, 1), where |F y − F z| (i) and P |y − z| (i) are the ith
components of the vectors |F y − F z| and P |y − z|, respectively. Thus, F
is a contraction with respect to . · .. For additional discussion of linear
and nonlinear contraction mapping properties and characterizations such
as the one above, see the book [OrR70].
Proof: (a) Assume that Eq. (B.4) holds. For any J, J " ∈ B(X), we have
+, # $++
+ j∈X aij J(j) − J " (j) +
+
.F J − F J " . = sup
i∈X v(i)
, + + %+ + &
j∈X aij v(j) J(j) − J (j) /v(j)
+ + + " +
≤ sup
i∈X v(i)
, + +
+aij + v(j) )
j∈X )
≤ sup )J − J " )
i∈X v(i)
≤ ρ .J − J " .,
with *+ +
Vi (µ) = +aij (µ)+ v(j), i ∈ X.
j∈X
By dividing this inequality with v(i) and by taking the supremum over
i ∈ X, we obtain
(b) By doing the same as in (a), but after first taking the infimum of
(Fµ J)(i) over µ, we obtain
Q.E.D.
References
[ABB02] Abounadi, J., Bertsekas, B. P., and Borkar, V. S., 2002. “Stochastic
Approximation for Non-Expansive Maps: Q-Learning Algorithms,” SIAM J. on
Control and Opt., Vol. 41, pp. 1-22.
[AnM79] Anderson, B. D. O., and Moore, J. B., 1979. Optimal Filtering, Prentice
Hall, Englewood Cliffs, N. J.
[BBB08] Basu, A., Bhattacharyya, and Borkar, V., 2008. “A Learning Algorithm
for Risk-Sensitive Cost,” Math. of OR, Vol. 33, pp. 880-898.
[BBD10] Busoniu, L., Babuska, R., De Schutter, B., and Ernst, D., 2010. Rein-
forcement Learning and Dynamic Programming Using Function Approximators,
CRC Press, N. Y.
[BFH86] Breton, M., Filar, J. A., Haurie, A., and Schultz, T. A., 1986. “On
the Computation of Equilibria in Discounted Stochastic Dynamic Games,” in
Dynamic Games and Applications in Economics, Springer, pp. 64-87.
[Bau78] Baudet, G. M., 1978. “Asynchronous Iterative Methods for Multiproces-
sors,” Journal of the ACM, Vol. 25, pp. 226-244.
[BeI96] Bertsekas, D. P., and Ioffe, S., 1996. “Temporal Differences-Based Policy
Iteration and Applications in Neuro-Dynamic Programming,” Lab. for Info. and
Decision Systems Report LIDS-P-2349, MIT.
[BeK65] Bellman, R., and Kalaba, R. E., 1965. Quasilinearization and Nonlinear
Boundary-Value Problems, Elsevier, N.Y.
[BeS78] Bertsekas, D. P., and Shreve, S. E., 1978. Stochastic Optimal Control:
The Discrete Time Case, Academic Press, N. Y.; may be downloaded from
http://web.mit.edu/dimitrib/www/home.html
[BeT89] Bertsekas, D. P., and Tsitsiklis, J. N., 1989. Parallel and Distributed
Computation: Numerical Methods, Prentice-Hall, Engl. Cliffs, N. J.; may be
downloaded from http://web.mit.edu/dimitrib/www/home.html
[BeT91] Bertsekas, D. P., and Tsitsiklis, J. N., 1991. “An Analysis of Stochastic
Shortest Path Problems,” Math. of OR, Vol. 16, pp. 580-595.
[BeT96] Bertsekas, D. P., and Tsitsiklis, J. N., 1996. Neuro-Dynamic Program-
ming, Athena Scientific, Belmont, MA.
[BeT08] Bertsekas, D. P., and Tsitsiklis, J. N., 2008. Introduction to Probability,
2nd Ed., Athena Scientific, Belmont, MA.
391
392 References
[BeY07] Bertsekas, D. P., and Yu, H., 2007. “Solution of Large Systems of Equa-
tions Using Approximate Dynamic Programming Methods,” Lab. for Info. and
Decision Systems Report LIDS-P-2754, MIT.
[BeY09] Bertsekas, D. P., and Yu, H., 2009. “Projected Equation Methods for Ap-
proximate Solution of Large Linear Systems,” J. of Computational and Applied
Mathematics, Vol. 227, pp. 27-50.
[BeY10] Bertsekas, D. P., and Yu, H., 2010. “Asynchronous Distributed Policy
Iteration in Dynamic Programming,” Proc. of Allerton Conf. on Communication,
Control and Computing, Allerton Park, Ill, pp. 1368-1374.
[BeY12] Bertsekas, D. P., and Yu, H., 2012. “Q-Learning and Enhanced Policy
Iteration in Discounted Dynamic Programming,” Math. of OR, Vol. 37, pp. 66-94.
[BeY16] Bertsekas, D. P., and Yu, H., 2016. “Stochastic Shortest Path Problems
Under Weak Conditions,” Lab. for Information and Decision Systems Report
LIDS-2909, January 2016.
[Ber71] Bertsekas, D. P., 1971. “Control of Uncertain Systems With a Set-Member-
ship Description of the Uncertainty,” Ph.D. Dissertation, Massachusetts Institute
of Technology, Cambridge, MA (available from the author’s website).
[Ber72] Bertsekas, D. P., 1972. “Infinite Time Reachability of State Space Regions
by Using Feedback Control,” IEEE Trans. Aut. Control, Vol. AC-17, pp. 604-613.
[Ber75] Bertsekas, D. P., 1975. “Monotone Mappings in Dynamic Programming,”
1975 IEEE Conference on Decision and Control, pp. 20-25.
[Ber77] Bertsekas, D. P., 1977. “Monotone Mappings with Application in Dy-
namic Programming,” SIAM J. on Control and Opt., Vol. 15, pp. 438-464.
[Ber82] Bertsekas, D. P., 1982. “Distributed Dynamic Programming,” IEEE Trans.
Aut. Control, Vol. AC-27, pp. 610-616.
[Ber83] Bertsekas, D. P., 1983. “Asynchronous Distributed Computation of Fixed
Points,” Math. Programming, Vol. 27, pp. 107-120.
[Ber87] Bertsekas, D. P., 1987. Dynamic Programming: Deterministic and Stochas-
tic Models, Prentice-Hall, Englewood Cliffs, N. J.
[Ber96] Bertsekas, D. P., 1996. Lecture at NSF Workshop on Reinforcement
Learning, Hilltop House, Harper’s Ferry, N. Y.
[Ber98] Bertsekas, D. P., 1998. Network Optimization: Continuous and Discrete
Models, Athena Scientific, Belmont, MA.
[Ber09] Bertsekas, D. P., 2009. Convex Optimization Theory, Athena Scientific,
Belmont, MA.
[Ber10] Bertsekas, D. P., 2010. “Williams-Baird Counterexample for Q-Factor
Asynchronous Policy Iteration,”
http://web.mit.edu/dimitrib/www/Williams-Baird Counterexample.pdf
[Ber11a] Bertsekas, D. P., 2011. “Temporal Difference Methods for General Pro-
jected Equations,” IEEE Trans. Aut. Control, Vol. 56, pp. 2128-2139.
[Ber11b] Bertsekas, D. P., 2011. “λ-Policy Iteration: A Review and a New Im-
plementation,” Lab. for Info. and Decision Systems Report LIDS-P-2874, MIT;
appears in Reinforcement Learning and Approximate Dynamic Programming for
Feedback Control, by F. Lewis and D. Liu (eds.), IEEE Press, 2012.
[Ber11c] Bertsekas, D. P., 2011. “Approximate Policy Iteration: A Survey and
References 393
Some New Methods,” J. of Control Theory and Applications, Vol. 9, pp. 310-335;
a somewhat expanded version appears as Lab. for Info. and Decision Systems
Report LIDS-2833, MIT, 2011.
[Ber12a] Bertsekas, D. P., 2012. Dynamic Programming and Optimal Control,
Vol. II, 4th Edition: Approximate Dynamic Programming, Athena Scientific,
Belmont, MA.
[Ber12b] Bertsekas, D. P., 2012. “Weighted Sup-Norm Contractions in Dynamic
Programming: A Review and Some New Applications,” Lab. for Info. and Deci-
sion Systems Report LIDS-P-2884, MIT.
[Ber15] Bertsekas, D. P., 2015. “Regular Policies in Abstract Dynamic Program-
ming,” Lab. for Information and Decision Systems Report LIDS-P-3173, MIT,
May 2015; arXiv preprint arXiv:1609.03115; SIAM J. on Optimization, Vol. 27,
2017, pp. 1694-1727.
[Ber16a] Bertsekas, D. P., 2016. “Affine Monotonic and Risk-Sensitive Models
in Dynamic Programming,” Lab. for Information and Decision Systems Report
LIDS-3204, MIT, June 2016; arXiv preprint arXiv:1608.01393; IEEE Trans. on
Aut. Control, Vol. 64, 2019, pp. 3117-3128.
[Ber16b] Bertsekas, D. P., 2016. “Proximal Algorithms and Temporal Differences
for Large Linear Systems: Extrapolation, Approximation, and Simulation,” Re-
port LIDS-P-3205, MIT, Oct. 2016; arXiv preprint arXiv:1610.1610.05427.
[Ber16c] Bertsekas, D. P., 2016. Nonlinear Programming, 3rd Edition, Athena
Scientific, Belmont, MA.
[Ber17a] Bertsekas, D. P., 2017. Dynamic Programming and Optimal Control,
Vol. I, 4th Edition, Athena Scientific, Belmont, MA.
[Ber17b] Bertsekas, D. P., 2017. “Value and Policy Iteration in Deterministic
Optimal Control and Adaptive Dynamic Programming,” IEEE Transactions on
Neural Networks and Learning Systems, Vol. 28, pp. 500-509.
[Ber17c] Bertsekas, D. P., 2017. “Stable Optimal Control and Semicontractive
Dynamic Programming,” Report LIDS-P-3506, MIT, May 2017; SIAM J. on
Control and Optimization, Vol. 56, 2018, pp. 231-252.
[Ber17d] Bertsekas, D. P., 2017. “Proper Policies in Infinite-State Stochastic
Shortest Path Problems,” Report LIDS-P-3507, MIT, May 2017; arXiv preprint
arXiv:1711.10129.
[Ber18a] Bertsekas, D. P., 2018. “Feature-Based Aggregation and Deep Rein-
forcement Learning: A Survey and Some New Implementations,” Lab. for In-
formation and Decision Systems Report, MIT; arXiv preprint arXiv:1804.04577;
IEEE/CAA Journal of Automatica Sinica, Vol. 6, 2019, pp. 1-31.
[Ber18b] Bertsekas, D. P., 2018. “Biased Aggregation, Rollout, and Enhanced
Policy Improvement for Reinforcement Learning,” Lab. for Information and De-
cision Systems Report, MIT; arXiv preprint arXiv:1910.02426.
[Ber18c] Bertsekas, D. P., 2018. “Proximal Algorithms and Temporal Differences
for Solving Fixed Point Problems,” Computational Optimization and Applica-
tions J., Vol. 70, pp. 709-736.
[Ber19a] Bertsekas, D. P., 2019. “Affine Monotonic and Risk-Sensitive Models
in Dynamic Programming,” IEEE Transactions on Aut. Control, Vol. 64, pp.
3117-3128.
394 References
[CoM99] Coraluppi, S. P., and Marcus, S. I., 1999. “Risk-Sensitive and Minimax
Control of Discrete-Time, Finite-State Markov Decision Processes,” Automatica,
Vol. 35, pp. 301-309.
[DFV00] de Farias, D. P., and Van Roy, B., 2000. “On the Existence of Fixed
Points for Approximate Value Iteration and Temporal-Difference Learning,” J.
of Optimization Theory and Applications, Vol. 105, pp. 589-608.
[DeM67] Denardo, E. V., and Mitten, L. G., 1967. “Elements of Sequential De-
cision Processes,” J. Indust. Engrg., Vol. 18, pp. 106-112.
[DeR79] Denardo, E. V., and Rothblum, U. G., 1979. “Optimal Stopping, Ex-
ponential Utility, and Linear Programming,” Math. Programming, Vol. 16, pp.
228-244.
[Den67] Denardo, E. V., 1967. “Contraction Mappings in the Theory Underlying
Dynamic Programming,” SIAM Review, Vol. 9, pp. 165-177.
[Der70] Derman, C., 1970. Finite State Markovian Decision Processes, Academic
Press, N. Y.
[DuS65] Dubins, L., and Savage, L. M., 1965. How to Gamble If You Must,
McGraw-Hill, N. Y.
[FeM97] Fernandez-Gaucherand, E., and Marcus, S. I., 1997. “Risk-Sensitive Op-
timal Control of Hidden Markov Models: Structural Results,” IEEE Trans. Aut.
Control, Vol. AC-42, pp. 1418-1422.
[Fei02] Feinberg, E. A., 2002. “Total Reward Criteria,” in E. A. Feinberg and A.
Shwartz, (Eds.), Handbook of Markov Decision Processes, Springer, N. Y.
[FiT91] Filar, J. A., and Tolwinski, B., 1991. “On the Algorithm of Pollatschek
and Avi-ltzhak,” in Stochastic Games and Related Topics, Theory and Decision
Library, Springer, Vol. 7, pp. 59-70.
[FiV96] Filar, J., and Vrieze, K., 1996. Competitive Markov Decision Processes,
Springer, N. Y.
[FlM95] Fleming, W. H., and McEneaney, W. M., 1995. “Risk-Sensitive Control
on an Infinite Time Horizon,” SIAM J. Control and Opt., Vol. 33, pp. 1881-1915.
[Gos03] Gosavi, A., 2003. Simulation-Based Optimization: Parametric Optimiza-
tion Techniques and Reinforcement Learning, Springer, N. Y.
[GuS17] Guillot, M., and Stauffer, G., 2017. “The Stochastic Shortest Path Prob-
lem: A Polyhedral Combinatorics Perspective,” Univ. of Grenoble Report.
[HCP99] Hernandez-Lerma, O., Carrasco, O., and Perez-Hernandez. 1999. “Mar-
kov Control Processes with the Expected Total Cost Criterion: Optimality, Sta-
bility, and Transient Models,” Acta Appl. Math., Vol. 59, pp. 229-269.
[Hay08] Haykin, S., 2008. Neural Networks and Learning Machines, (3rd Edition),
Prentice-Hall, Englewood-Cliffs, N. J.
[HeL99] Hernandez-Lerma, O., and Lasserre, J. B., 1999. Further Topics on
Discrete-Time Markov Control Processes, Springer, N. Y.
[HeM96] Hernandez-Hernandez, D., and Marcus, S. I., 1996. “Risk Sensitive Con-
trol of Markov Processes in Countable State Space,” Systems and Control Letters,
Vol. 29, pp. 147-155.
[HiW05] Hinderer, K., and Waldmann, K.-H., 2005. “Algorithms for Countable
State Markov Decision Models with an Absorbing Set,” SIAM J. of Control and
396 References
[PSP15] Perolat, J., Scherrer, B., Piot, B., and Pietquin, O., 2015. “Approximate
Dynamic Programming for Two-Player Zero-Sum Markov Games,” in Proc. In-
ternational Conference on Machine Learning, pp. 1321-1329.
[PaB99] Patek, S. D., and Bertsekas, D. P., 1999. “Stochastic Shortest Path
Games,” SIAM J. on Control and Opt., Vol. 36, pp. 804-824.
[Pal67] Pallu de la Barriere, R., 1967. Optimal Control Theory, Saunders, Phila;
republished by Dover, N. Y., 1980.
[Pat01] Patek, S. D., 2001. “On Terminating Markov Decision Processes with a
Risk Averse Objective Function,” Automatica, Vol. 37, pp. 1379-1386.
[Pat07] Patek, S. D., 2007. “Partially Observed Stochastic Shortest Path Prob-
lems with Approximate Solution by Neuro-Dynamic Programming,” IEEE Trans.
on Systems, Man, and Cybernetics Part A, Vol. 37, pp. 710-720.
[Pli78] Pliska, S. R., 1978. “On the Transient Case for Markov Decision Chains
with General State Spaces,” in Dynamic Programming and its Applications, by
M. L. Puterman (ed.), Academic Press, N. Y.
[PoA69] Pollatschek, M., and Avi-Itzhak, B., 1969. “Algorithms for Stochastic
Games with Geometrical Interpretation,” Management Science, Vol. 15, pp. 399-
413.
[Pow07] Powell, W. B., 2007. Approximate Dynamic Programming: Solving the
Curses of Dimensionality, J. Wiley and Sons, Hoboken, N. J; 2nd ed., 2011.
[PuB78] Puterman, M. L., and Brumelle, S. L., 1978. “The Analytic Theory of
Policy Iteration,” in Dynamic Programming and Its Applications, M. L. Puter-
man (ed.), Academic Press, N. Y.
[PuB79] Puterman, M. L., and Brumelle, S. L., 1979. “On the Convergence of
Policy Iteration in Stationary Dynamic Programming,” Math. of Operations Re-
search, Vol. 4, pp. 60-69.
[Put94] Puterman, M. L., 1994. Markovian Decision Problems, J. Wiley, N. Y.
[Rei16] Reissig, G., 2016. “Approximate Value Iteration for a Class of Determin-
istic Optimal Control Problems with Infinite State and Input Alphabets,” Proc.
2016 IEEE Conf. on Decision and Control, pp. 1063-1068.
[Roc70] Rockafellar, R. T., 1970. Convex Analysis, Princeton Univ. Press, Prince-
ton, N. J.
[Ros67] Rosenfeld, J., 1967. “A Case Study on Programming for Parallel Proces-
sors,” Research Report RC-1864, IBM Res. Center, Yorktown Heights, N. Y.
[Rot79] Rothblum, U. G., 1979. “Iterated Successive Approximation for Sequen-
tial Decision Processes,” in Stochastic Control and Optimization, by J. W. B.
van Overhagen and H. C. Tijms (eds), Vrije University, Amsterdam.
[Rot84] Rothblum, U. G., 1984. “Multiplicative Markov Decision Chains,” Math.
of OR, Vol. 9, pp. 6-24.
[ScL12] Scherrer, B., and Lesner, B., 2012. “On the Use of Non-Stationary Policies
for Stationary Infinite-Horizon Markov Decision Processes,” NIPS 2012 - Neural
Information Processing Systems, South Lake Tahoe, Ne.
[Sch75] Schal, M., 1975. “Conditions for Optimality in Dynamic Programming
and for the Limit of n-Stage Optimal Policies to be Optimal,” Z. Wahrschein-
lichkeitstheorie und Verw. Gebiete, Vol. 32, pp. 179-196.
398 References
[Sch11] Scherrer, B., 2011. “Performance Bounds for Lambda Policy Iteration
and Application to the Game of Tetris,” Report RR-6348, INRIA, France; J. of
Machine Learning Research, Vol. 14, 2013, pp. 1181-1227.
[Sch12] Scherrer, B., 2012. “On the Use of Non-Stationary Policies for Infinite-
Horizon Discounted Markov Decision Processes,” INRIA Lorraine Report, France.
[Sha53] Shapley, L. S., 1953. “Stochastic Games,” Proc. Nat. Acad. Sci. U.S.A.,
Vol. 39.
[Sob75] Sobel, M. J., 1975. “Ordinal Dynamic Programming,” Management Sci-
ence, Vol. 21, pp. 967-975.
[Str66] Strauch, R., 1966. “Negative Dynamic Programming,” Ann. Math. Statist.,
Vol. 37, pp. 871-890.
[SuB98] Sutton, R. S., and Barto, A. G., 1998. Reinforcement Learning, MIT
Press, Cambridge, MA.
[Sze98a] Szepesvari, C., 1998. Static and Dynamic Aspects of Optimal Sequential
Decision Making, Ph.D. Thesis, Bolyai Institute of Mathematics, Hungary.
[Sze98b] Szepesvari, C., 1998. “Non-Markovian Policies in Sequential Decision
Problems,” Acta Cybernetica, Vol. 13, pp. 305-318.
[Sze10] Szepesvari, C., 2010. Algorithms for Reinforcement Learning, Morgan
and Claypool Publishers, San Franscisco, CA.
[TBA86] Tsitsiklis, J. N., Bertsekas, D. P., and Athans, M., 1986. “Distributed
Asynchronous Deterministic and Stochastic Gradient Optimization Algorithms,”
IEEE Trans. Aut. Control, Vol. AC-31, pp. 803-812.
[ThS10a] Thiery, C., and Scherrer, B., 2010. “Least-Squares λ-Policy Iteration:
Bias-Variance Trade-off in Control Problems,” in ICML’10: Proc. of the 27th
Annual International Conf. on Machine Learning.
[ThS10b] Thiery, C., and Scherrer, B., 2010. “Performance Bound for Approxi-
mate Optimistic Policy Iteration,” Technical Report, INRIA, France.
[Tol89] Tolwinski, B., 1989. “Newton-Type Methods for Stochastic Games,” in
Basar T. S., and Bernhard P. (eds), Differential Games and Applications, Lecture
Notes in Control and Information Sciences, vol. 119, Springer, pp. 128-144.
[Tsi94] Tsitsiklis, J. N., 1994. “Asynchronous Stochastic Approximation and Q-
Learning,” Machine Learning, Vol. 16, pp. 185-202.
[VVL13] Vrabie, V., Vamvoudakis, K. G., and Lewis, F. L., 2013. Optimal Adap-
tive Control and Differential Games by Reinforcement Learning Principles, The
Institution of Engineering and Technology, London.
[Van78] van der Wal, J., 1978. “Discounted Markov Games: Generalized Policy
Iteration Method,” J. of Optimization Theory and Applications, Vol. 25, pp.
125-138.
[VeP87] Verdu, S., and Poor, H. V., 1987. “Abstract Dynamic Programming
Models under Commutativity Conditions,” SIAM J. on Control and Opt., Vol.
25, pp. 990-1006.
[Wat89] Watkins, C. J. C. H., Learning from Delayed Rewards, Ph.D. Thesis,
Cambridge Univ., England.
[Whi80] Whittle, P., 1980. “Stability and Characterization Conditions in Negative
Programming,” Journal of Applied Probability, Vol. 17, pp. 635-645.
References 399
401
402 Index
Regular, see S-regular 112, 150, 182, 184, 192, 194, 203, 207,
Reinforcement learning, 25, 29, 45, 210, 211, 221, 256, 259, 271, 274, 277,
371, 373 282, 293, 295, 296, 313, 318, 320, 333,
Risk-sensitive model, 18 334, 359
Robust SSP, 195, 221, 375 Value iteration, asynchronous, 91, 112,
Rollout, 25 211, 221, 259, 359
Value space approximation, 29, 35,
S
371
SSP problems, 15, 129, 178, 220, 221,
263, 307, 323 W
S-irregular policy, 122, 144, 165, 171 Weak PI property, 154
S-regular collection, 265 Weak SSP conditions, 183
S-regular policy, 122, 144 Weighted Bellman equation, 51
Search problems, 171 Weighted Euclidean norm, 25, 382
Self-learning, 25 Weighted multistep mapping, 51
Semi-Markov problem, 13 Weighted sup norm, 55, 352, 385
Seminorm projection, 28 Weighted sup-norm contraction, 104,
Semicontinuity conditions, 181 110, 352, 385
Semicontractive model, 42, 122, 141, Well-behaved region, 147, 266
219
X, Y
Shortest path problem, 15, 17, 127,
177, 307, 328 Z
Simulation, 28, 39, 43, 92, 372
Zero-sum games, 13, 109, 338, 351,
Spectral radius, 381
373
Stable policies, 135, 277, 282, 286,
289, 298, 323
Stationary policy, 54, 58
Stochastic shortest path problems, see
SSP problems
Stopping problems, 104, 108, 299
Strong PI property, 156
Strong SSP conditions, 181
Synchronous convergence condition, 95
T
TD(λ), 27
Temporal differences, 26, 27, 261
Terminating policy, 209, 226, 227, 288
Totally asynchronous algorithms, 94
Transient programming problem, 16
U
Uniform fixed point, 103, 338, 363,
369
Uniformly N -stage optimal policy, 22
Uniformly proper policy, 317, 323, 332
Unit function, 379
V
Value iteration, 9, 29, 36, 66, 67, 91,
Neuro-Dynamic Programming
Dimitri P. Bertsekas and John N. Tsitsiklis
Athena Scientific, 1996
512 pp., hardcover, ISBN 1-886529-10-8
This is the first textbook that fully explains the neuro-dynamic pro-
gramming/reinforcement learning methodology, a breakthrough in the prac-
tical application of neural networks and dynamic programming to complex
problems of planning, optimal decision making, and intelligent control.
From the review by George Cybenko for IEEE Computational Sci-
ence and Engineering, May 1998:
“Neuro-Dynamic Programming is a remarkable monograph that in-
tegrates a sweeping mathematical and computational landscape into a co-
herent body of rigorous knowledge. The topics are current, the writing is
clear and to the point, the examples are comprehensive and the historical
notes and comments are scholarly.”
“In this monograph, Bertsekas and Tsitsiklis have performed a Her-
culean task that will be studied and appreciated by generations to come.
I strongly recommend it to scientists and engineers eager to seriously un-
derstand the mathematics and computations behind modern behavioral
machine learning.”
Among its special features, the book:
• Describes and unifies a large number of NDP methods, including sev-
eral that are new
• Describes new approaches to formulation and solution of important
problems in stochastic optimal control, sequential decision making,
and discrete optimization
• Rigorously explains the mathematical principles behind NDP
• Illustrates through examples and case studies the practical applica-
tion of NDP to complex problems from optimal resource allocation,
optimal feedback control, data communications, game playing, and
combinatorial optimization
• Presents extensive background and new research material on dynamic
programming and neural network training
This book develops in greater depth some of the methods from the
author’s Reinforcement Learning and Optimal Control textbook (Athena
Scientific, 2019). It presents new research, relating to rollout algorithms,
policy iteration, multiagent systems, partitioned architectures, and dis-
tributed asynchronous computation.
The application of the methodology to challenging discrete optimiza-
tion problems, such as routing, scheduling, assignment, and mixed integer
programming, including the use of neural network approximations within
these contexts, is also discussed.
Much of the new research is inspired by the remarkable AlphaZero
chess program, where policy iteration, value and policy networks, approxi-
mate lookahead minimization, and parallel computation all play an impor-
tant role.
Among its special features, the book:
• Presents new research relating to distributed asynchronous computa-
tion, partitioned architectures, and multiagent systems, with applica-
tion to challenging large scale optimization problems, such as combi-
natorial/discrete optimization, as well as partially observed Markov
decision problems.
• Describes variants of rollout and policy iteration for problems with
a multiagent structure, which allow the dramatic reduction of the
computational requirements for lookahead minimization.
• Establishes connections of rollout algorithms and model predictive
control, one of the most prominent control system design methodol-
ogy.
• Expands the coverage of some research areas discussed in the author’s
2019 textbook Reinforcement Learning and Optimal Control.
• Provides the mathematical analysis that supports the Newton step
interpretations and the conclusions of the present book.
The book is supported by on-line video lectures and slides, as well
as new research material, some of which has been covered in the present
monograph.