Abstract Dynamic Programming
Abstract Dynamic Programming
Dimitri P. Bertsekas
Massachusetts Institute of Technology
http://www.athenasc.com
Email: [email protected]
WWW: http://www.athenasc.com
2. Contractive Models . . . . . . . . . . . . . . . . . p. 29
2.1. Fixed Point Equation and Optimality Conditions . . . . . . p. 30
2.2. Limited Lookahead Policies . . . . . . . . . . . . . . . p. 37
2.3. Value Iteration . . . . . . . . . . . . . . . . . . . . . p. 42
2.3.1. Approximate Value Iteration . . . . . . . . . . . . . p. 43
2.4. Policy Iteration . . . . . . . . . . . . . . . . . . . . . p. 46
2.4.1. Approximate Policy Iteration . . . . . . . . . . . . p. 48
2.5. Optimistic Policy Iteration . . . . . . . . . . . . . . . . p. 52
2.5.1. Convergence of Optimistic Policy Iteration . . . . . . p. 52
2.5.2. Approximate Optimistic Policy Iteration . . . . . . . p. 57
2.6. Asynchronous Algorithms . . . . . . . . . . . . . . . . p. 61
2.6.1. Asynchronous Value Iteration . . . . . . . . . . . . p. 61
2.6.2. Asynchronous Policy Iteration . . . . . . . . . . . . p. 67
2.6.3. Policy Iteration with a Uniform Fixed Point . . . . . . p. 72
2.7. Notes, Sources, and Exercises . . . . . . . . . . . . . . . p. 79
3. Semicontractive Models . . . . . . . . . . . . . . . p. 85
3.1. Semicontractive Models and Regular Policies . . . . . . . . p. 86
3.1.1. Fixed Points, Optimality Conditions, and
Algorithmic Results . . . . . . . . . . . . . . . . p. 90
3.1.2. Illustrative Example: Deterministic Shortest
Path Problems . . . . . . . . . . . . . . . . . . p. 97
3.2. Irregular Policies and a Perturbation Approach . . . . . p. 100
3.2.1. The Case Where Irregular Policies Have Infinite
Cost . . . . . . . . . . . . . . . . . . . . . . p. 100
3.2.2. The Case Where Irregular Policies Have Finite
iii
iv Contents
This book aims at a unified and economical development of the core the-
ory and algorithms of total cost sequential decision problems, based on
the strong connections of the subject with fixed point theory. The analy-
sis focuses on the abstract mapping that underlies dynamic programming
(DP for short) and defines the mathematical character of the associated
problem. Our discussion centers on two fundamental properties that this
mapping may have: monotonicity and (weighted sup-norm) contraction. It
turns out that the nature of the analytical and algorithmic DP theory is
determined primarily by the presence or absence of these two properties,
and the rest of the problem’s structure is largely inconsequential.
In this book, with some minor exceptions, we will assume that mono-
tonicity holds. Consequently, we organize our treatment around the con-
traction property, and we focus on four main classes of models:
(a) Contractive models, discussed in Chapter 2, which have the richest
and strongest theory, and are the benchmark against which the the-
ory of other models is compared. Prominent among these models are
discounted stochastic optimal control problems. The development of
these models is quite thorough and includes the analysis of recent ap-
proximation algorithms for large-scale problems (neuro-dynamic pro-
gramming, reinforcement learning).
(b) Semicontractive models, discussed in Chapter 3 and parts of Chap-
ter 4. The term “semicontractive” is used qualitatively here, to refer
to a variety of models where some policies have a regularity/contrac-
tion-like property but others do not. A prominent example is stochas-
tic shortest path problems, where one aims to drive the state of
a Markov chain to a termination state at minimum expected cost.
These models also have a strong theory under certain conditions, of-
ten nearly as strong as those of the contractive models.
(c) Noncontractive models, discussed in Chapter 4, which rely on just
monotonicity. These models are more complex than the preceding
ones and much of the theory of the contractive models generalizes in
weaker form, if at all. For example, in general the associated Bell-
man equation need not have a unique solution, the value iteration
method may work starting with some functions but not with others,
and the policy iteration method may not work at all. Infinite hori-
zon examples of these models are the classical positive and negative
DP problems, first analyzed by Dubins and Savage, Blackwell, and
v
vi Preface
The errata of the original edition, as per March 1, 2014, have been incor-
porated in the present edition of the book. The following two papers have
a strong connection to the book, and amplify on the range of applications
of the semicontractive models of Chapters 3 and 4:
(1) D. P. Bertsekas, “Robust Shortest Path Planning and Semicontractive
Dynamic Programming,” Lab. for Information and Decision Systems
Report LIDS-P-2915, MIT, Feb. 2014.
(2) D. P. Bertsekas, “Infinite-Space Shortest Path Problems and Semicon-
tractive Dynamic Programming,” Lab. for Information and Decision
Systems Report LIDS-P-2916, MIT, Feb. 2014.
These papers may be viewed as “on-line appendixes” of the book. They
can be downloaded from the book’s internet site and the author’s web page.
1
Introduction
Contents
1
2 Introduction Chap. 1
Dynamic programming (DP for short) is the principal method for analysis
of a large and diverse class of sequential decision problems. Examples are
deterministic and stochastic optimal control problems with a continuous
state space, Markov and semi-Markov decision problems with a discrete
state space, minimax problems, and sequential zero sum games. While the
nature of these problems may vary widely, their underlying structures turn
out to be very similar. In all cases there is an underlying mapping that de-
pends on an associated controlled dynamic system and corresponding cost
per stage. This mapping, the DP operator, provides a “compact signature”
of the problem. It defines the cost function of policies and the optimal cost
function, and it provides a convenient shorthand notation for algorithmic
description and analysis.
More importantly, the structure of the DP operator defines the math-
ematical character of the associated problem. The purpose of this book is to
provide an analysis of this structure, centering on two fundamental prop-
erties: monotonicity and (weighted sup-norm) contraction. It turns out
that the nature of the analytical and algorithmic DP theory is determined
primarily by the presence or absence of these two properties, and the rest
of the problem’s structure is largely inconsequential.
Here xk is the state of the system taking values in a set X (the state space),
and uk is the control taking values in a set U (the control space). At stage
k, there is a cost
αk g(xk , uk )
incurred when uk is applied at state xk , where α is a scalar in (0, 1] that has
the interpretation of a discount factor when α < 1. The controls are chosen
as a function of the current state, subject to a constraint that depends on
that state. In particular, at state x the control is constrained to take values
in a given set U (x) ⊂ U . Thus we are interested in optimization over the
set of (nonstationary) policies
! "
Π = {µ0 , µ1 , . . .} | µk ∈ M, k = 0, 1, . . . ,
† For the informal discussion of this section, we will disregard a few mathe-
matical issues. In particular, we assume that the series defining Jπ in Eq. (1.2)
is convergent for all allowable π, and that the optimal cost function J ∗ is real-
valued. We will address such issues later.
4 Introduction Chap. 1
We now note that both Eqs. (1.3) and (1.4) can be stated in terms of
the expression
! "
H(x, u, J ) = g(x, u) + αJ f (x, u) , x ∈ X, u ∈ U (x).
Defining ! "
(Tµ J )(x) = H x, µ(x), J , x ∈ X,
and
(T J )(x) = inf H(x, u, J ) = inf (Tµ J )(x), x ∈ X,
u∈U (x) µ∈M
Then it can be verified by induction that for all initial states x0 , we have
Jπ,N (x0 ) = (Tµ0 Tµ1 · · · TµN −1 J¯)(x0 ). (1.6)
Here Tµ0 Tµ1 · · · TµN −1 is the composition of the mappings Tµ0 , Tµ1 , . . . TµN −1 ,
i.e., for all J ,
! "
(Tµ0 Tµ1 J )(x) = Tµ0 (Tµ1 J ) (x), x ∈ X,
and more generally
! "
(Tµ0 Tµ1 · · · TµN −1 J )(x) = Tµ0 (Tµ1 (· · · (TµN −1 J ))) (x), x ∈ X,
(our notational conventions are summarized in Appendix A). Thus the
finite horizon cost functions Jπ,N of π can be defined in terms of the map-
pings Tµ [cf. Eq. (1.6)], and so can their infinite horizon limit Jπ :
Jπ (x) = lim (Tµ0 Tµ1 · · · TµN −1 J¯)(x), x ∈ X, (1.7)
N →∞
where J¯ is the zero function, J¯(x) = 0 for all x ∈ X (assuming the limit
exists).
Sec. 1.2 Abstract Dynamic Programming Models 5
The Bellman equation (1.3) and the optimality condition (1.4), stated in
terms of the mappings Tµ and T , highlight the central theme of this book,
which is that DP theory is intimately connected with the theory of abstract
mappings and their fixed points. Analogs of the Bellman equation, J * =
T J * , optimality conditions, and other results and computational methods
hold for a great variety of DP models, and can be stated compactly as
described above in terms of the corresponding mappings Tµ and T . The
gain from this abstraction is greater generality and mathematical insight,
as well as a more unified, economical, and streamlined analysis.
where Tµ0 Tµ1 · · · TµN −1 denotes the composition of the mappings Tµ0 , Tµ1 ,
. . . , TµN −1 , i.e.,
! "
Tµ0 Tµ1 · · · TµN −1 J = Tµ0 (Tµ1 (· · · (TµN −2 (TµN −1 J ))) · · ·) , J ∈ R(X).
We view Jπ,N as the “N -stage cost function” of π [cf. Eq. (1.5)]. Consider
also the function
Jπ (x) = lim sup Jπ,N (x) = lim sup(Tµ0 Tµ1 · · · TµN −1 J¯)(x), x ∈ X,
N →∞ N →∞
which we view as the “infinite horizon cost function” of π [cf. Eq. (1.7); we
use lim sup for generality, since we are not assured that the limit exists].
We want to minimize Jπ over π, i.e., to find
Tµ∗ J * = T J *
or equivalently,
J ≤ J! ⇒ T J ≤ T J !.
Another way to arrive at this relation, is to note that the monotonicity
assumption is equivalent to
J ≤ J! ⇒ Tµ J ≤ Tµ J ! , ∀ µ ∈ M,
on B(X). The properties of B(X) and some of the associated fixed point
theory are discussed in Appendix B. In particular, as shown there, B(X)
8 Introduction Chap. 1
= 0 Tµ J = 0 Tµ J
=0 J TJ =0 ) Jµ J TJ
and combining the preceding two relations, and taking the supremum of
the left side over x ∈ X, we obtain Eq. (1.9).
Nearly all mappings related to DP satisfy the monotonicity assump-
tion, and many important ones satisfy the weighted sup-norm contraction
assumption as well. When both assumptions hold, the most powerful an-
alytical and computational results can be obtained, as we will show in
Chapter 2. These are:
(a) Bellman’s equation has a unique solution, i.e., T and Tµ have unique
fixed points, which are the optimal cost function J * and the cost
functions Jµ of the stationary policies {µ, µ, . . .}, respectively [cf. Eq.
(1.3)].
(b) A stationary policy {µ∗ , µ∗ , . . .} is optimal if and only if
Tµ∗ J * = T J * ,
starting from some initial policy µ0 [here Jµk is obtained as the fixed
point of Tµk by several possible methods, including value iteration as
in (c) above].
These are the most favorable types of results one can hope for in the
DP context, and they are supplemented by a host of other results, involving
approximate and/or asynchronous implementations of the value and policy
iteration methods, and other related methods that combine features of
both. As the contraction property is relaxed and is replaced by various
weaker assumptions, some of the preceding results may hold in weaker
form. For example J * turns out to be a solution of Bellman’s equation in
all the models to be discussed, but it may not be the unique solution. The
interplay between the monotonicity and contraction-like properties, and
the associated results of the form (a)-(d) described above is the recurring
analytical theme in this book.
In what follows in this section, we describe a few special cases, which indi-
cate the connections of appropriate forms of the mapping H with the most
10 Introduction Chap. 1
popular total cost DP models. In all these models the monotonicity As-
sumption 1.2.1 (or some closely related version) holds, but the contraction
Assumption 1.2.2 may not hold, as we will indicate later. Our descriptions
are by necessity brief, and the reader is referred to the relevant textbook
literature for more detailed discussion.
In particular, it can be shown that the limit exists if α < 1 and g is uniformly
bounded, i.e., for some B > 0,
& &
&g(x, u, w)& ≤ B, ∀ x ∈ X, u ∈ U (x), w ∈ W. (1.12)
so that
' # $ # $(
(Tµ J)(x) = E g x, µ(x), w + αJ f (x, µ(x), w) ,
and ' # $(
(T J)(x) = inf E g(x, u, w) + αJ f (x, u, w) .
u∈U (x)
In this way, taking also into account the rule ∞−∞ = ∞ (see Appendix A), E{w}
is well-defined as an extended real number if Ω is finite or countably infinite.
12 Introduction Chap. 1
and parallels the one given for deterministic optimal control problems [cf. Eq.
(1.3)].
These properties can be expressed and analyzed in an abstract setting
by using just the mappings Tµ and T , both when Tµ and T are contractive
(see Chapter 2), and when they are only monotone and not contractive (see
Chapter 4). Moreover, under some conditions, it is possible to analyze these
properties in cases where Tµ is contractive for some but not all µ (see Chapter
3, and Sections 4.4-4.5).
In the special case of the preceding example where the number of states is
finite, the system equation (1.10) may be defined in terms of the transition
probabilities
" #
pxy (u) = Prob y = f (x, u, w) | x , x, y ∈ X, u ∈ U (x),
[cf. Eq. (1.12)] holds (or more simply when U is a finite set), the mappings Tµ
and T are contraction mappings with respect to the standard (unweighted)
sup-norm. This is a classical problem, referred to as discounted finite-state
MDP , which has a favorable theory and has found extensive applications (cf.
[Ber12a], Chapters 1 and 2). The model is additionally important, because it
is often used for computational solution of continuous state space problems
via discretization.
Sec. 1.2 Abstract Dynamic Programming Models 13
where G is some function representing expected cost per stage, and mxy (u)
are nonnegative scalars with
!
mxy (u) < 1, ∀ x ∈ X, u ∈ U (x).
y∈X
Let us consider a zero sum game analog of the finite-state MDP Example 1.2.2.
Here there are two players that choose actions at each stage: the first (called
the minimizer ) may choose a move i out of n moves and the second (called
the maximizer ) may choose a move j out of m moves. Then the minimizer
gives a specified amount aij to the maximizer, called a payoff . The minimizer
wishes to minimize aij , and the maximizer wishes to maximize aij .
The players use mixed strategies, whereby the minimizer selects a prob-
ability distribution u = (u1 , . . . , un ) over his n possible moves and the max-
imizer selects a probability distribution v = (v1 , . . . , vm ) over his m possible
moves. Since the probability of selecting i and j is ui vj , the expected pay-
a u v or u# Av, where A is the n × m matrix with
"
off for this stage is i,j ij i j
components aij .
In a single-stage version of the game, the minimizer must minimize
maxv∈V u# Av and the maximizer must maximize minu∈U u# Av, where U and
V are the sets of probability distributions over {1, . . . , n} and {1, . . . , m},
respectively. A fundamental result (which will not be proved here) is that
these two values are equal:
Let us consider the situation where a separate game of the type just
described is played at each stage. The game played at a given stage is repre-
sented by a “state” x that takes values in a finite set X. The state evolves
according to transition probabilities qxy (i, j) where i and j are the moves
selected by the minimizer and the maximizer, respectively (here y represents
the next game to be played after moves i and j are chosen at the game rep-
resented by x). When the state is x, under u ∈ U and v ∈ V , the one-stage
14 Introduction Chap. 1
expected payoff is u! A(x)v, where A(x) is the n × m payoff matrix, and the
state transition probabilities are
n m
! !
pxy (u, v) = ui vj qxy (i, j) = u! Qxy v,
i=1 j=1
where Qxy is the n × m matrix that has components qxy (i, j). Payoffs are
discounted by α ∈ (0, 1), and the objectives of the minimizer and maximizer,
roughly speaking, are to minimize and to maximize the total discounted ex-
pected payoff. This requires selections of u and v to strike a balance between
obtaining favorable current stage payoffs and playing favorable games in fu-
ture stages.
We now introduce an abstract DP framework related to the sequential
move selection process just described. We consider the mapping G given by
!
G(x, u, v, J) = u! A(x)v + α pxy (u, v)J(y)
y∈X
"
!
# (1.14)
!
=u A(x) + α Qxy J(y) v,
y∈X
and
(T J)(x) = min max G(x, u, v, J).
u∈U v∈V
[cf. Eq. (1.14)] is a matrix that is independent of u and v, we may view J ∗ (x)
as the value of a static game (which depends on the state x). In particular,
from the fundamental minimax equality (1.13), we have
This implies that J ∗ is also the unique fixed point of the mapping
where
H(x, v, J) = min G(x, u, v, J),
u∈U
i.e., J ∗ is the fixed point regardless of the order in which minimizer and
maximizer select mixed strategies at each stage.
In the preceding development, we have introduced J ∗ as the unique
fixed point of the mappings T and T . However, J ∗ also has an interpretation
in game theoretic terms. In particular, it can be shown that J ∗ (x) is the value
of a dynamic game, whereby at state x the two opponents choose multistage
(possibly nonstationary) policies that consist of functions of the current state,
and continue to select moves using these policies over an infinite horizon. For
further discussion of this interpretation, we refer to [Ber12a] and to books on
dynamic games such as [FiV96]; see also [PaB99] and [Yu11] for an analysis
of the undiscounted case (α = 1) where there is a termination state, as in the
stochastic shortest path problems of the subsequent Example 1.2.6.
The stochastic shortest path (SSP for short) problem is the special case of
the stochastic optimal control Example 1.2.1 where:
(a) There is no discounting (α = 1).
(b) The state space is X = {0, 1, . . . , n} and we are given transition prob-
abilities, denoted by
$ n
%
!
(T J)(x) = min g(x, u) + pxy (u)J(y) , x = 1, . . . , n.
u∈U (x)
y=1
Note that the matrix that has components pxy (u), x, y = 1, . . . , n, is sub-
stochastic (some of its row sums may be less than 1) because there may be
positive transition probability from a state x to the termination state 0. Con-
sequently Tµ may be a contraction for some µ, but not necessarily for all
µ ∈ M.
The SSP problem has been discussed in many sources, including the
books [Pal67], [Der70], [Whi82], [Ber87], [BeT89], [HeL99], and [Ber12a],
where it is sometimes referred to by earlier names such as “first passage
problem” and “transient programming problem.” In the framework that is
most relevant to our purposes, there is a classification of stationary policies
for SSP into proper and improper . We say that µ ∈ M is proper if, when
using µ, there is positive probability that termination will be reached after at
most n stages, regardless of the initial state; i.e., if
starting with any J ∈ "n (see [Ber12a], Chapter 3, for a textbook account).
These properties are in analogy with the desirable properties (a)-(c), given at
the end of the preceding subsection in connection with contractive models.
Regarding policy iteration, it works in its strongest form when there are
no improper policies, in which case the mappings Tµ and T are weighted sup-
norm contractions. When there are improper policies, modifications to the
policy iteration method are needed; see [YuB11a], [Ber12a], and also Sections
3.3.2, 3.3.3, where these modifications will be discussed in an abstract setting.
Let us also note that there is an alternative line of analysis of SSP
problems, whereby favorable results are obtained assuming that there exists
an optimal proper policy, and the one-stage cost is nonnegative, g(x, u) ≥ 0
for all (x, u) (see [Pal67], [Der70], [Whi82], and [Ber87]). This analysis will
also be generalized in Chapter 3 and in Section 4.4, and the nonnegativity
assumption on g will be relaxed.
The special case of the SSP problem where the state transitions are determin-
istic is the classical shortest path problem. Here, we have a graph of n nodes
x = 1, . . . , n, plus a destination 0, and an arc length axy for each directed arc
(x, y). At state/node x, a policy µ chooses an outgoing arc from x. Thus the
controls available at x can be identified with the outgoing neighbors of x [the
nodes u such that (x, u) is an arc]. The corresponding mapping H is
axu + J(u) if u $= 0,
!
H(x, u, J) = x = 1, . . . , n.
ax0 if u = 0,
" #
A stationary policy µ defines a graph whose arcs are x, µ(x) , x =
1, . . . , n. The policy µ is proper if and only if this graph is acyclic (it consists of
a tree of directed paths leading from each node to the destination). Thus there
18 Introduction Chap. 1
exists a proper policy if and only if each node is connected to the destination
with a directed path. Furthermore, an improper policy has finite cost starting
from every initial state if and only if all the cycles of the corresponding graph
have nonnegative cycle cost. It follows that the favorable analytical and
algorithmic results described for SSP in the preceding example hold if the
given graph is connected and the costs of all its cycles are positive. We will
see later that significant complications result if the cycle costs are allowed to
be nonpositive, even though the shortest path problem is still well posed in
the sense that shortest paths exist if the given graph is connected (see Section
3.1.2).
¯
J(x) ≡ 1, x ∈ X,
so that
¯
Jπ (x) = lim sup (Tµ0 Tµ1 · · · TµN −1 J)(x), x ∈ X.
N →∞
¯ N −1 ) = E g xN −1 , µN −1 (xN −1 ), xN J¯(xN ) | xN −1
" % & #
(TµN −1 J)(x
" % & #
= E g xN −1 , µN −1 (xN −1 ), xN | xN −1 ,
Sec. 1.2 Abstract Dynamic Programming Models 19
which by using the iterated expectations formula (see e.g., [BeT08]) proves
the expression (1.17).
An important special case of a multiplicative model is when g has the
form
g(x, u, y) = eh(x,u,y)
for some one-stage cost function h. We then obtain a finite-state MDP with
an exponential cost function,
% ! "&
h(x0 ,µ0 (x0 ),x1 )+···+h(xN −1 ,µN −1 (xN −1 ),xN )
Jπ (x0 ) = lim sup E e ,
N →∞
which is often used to introduce risk aversion in the choice of policy through
the convexity of the exponential.
There is also a multiplicative version of the infinite state space stochas-
tic optimal control problem of Example 1.2.1. The mapping H takes the
form # ! "$
H(x, u, J) = E g(x, u, w)J f (x, u, w) ,
where xk+1 = f (xk , uk , wk ) is the underlying discrete-time dynamic system;
cf. Eq. (1.10).
Multiplicative models and related risk-sensitive models are discussed
extensively in the literature, mostly for the exponential cost case and under
different assumptions than ours; see e.g., [HoM72], [Jac73], [Rot84], [ChS87],
[Whi90], [JBE94], [FlM95], [HeM96], [FeM97], [BoM99], [CoM99], [BoM99],
[BoM02], [BBB08]. The works of references [DeR79], [Pat01], and [Pat07]
relate to the stochastic shortest path problems of Example 1.2.6, and are the
closest to the semicontractive models discussed in Chapters 3 and 4.
Issues of risk-sensitivity have also been dealt within frameworks that
do not quite conform to the multiplicative model of the preceding example,
and are based on the theory of multi-stage risk measures; see e.g., [Rus10],
[CaR12], and the references quoted there. Still these formulations involve
abstract monotone DP mappings and are covered by our theory.
V! = {V!y | y ∈ S! }.
Processor ! also maintains a scalar aggregate cost R! for its aggregate state,
which is a weighted average of the detailed cost values V!x within S! :
!
R! = d!x V!x ,
x∈S!
"
where d!x are given probabilities with d!x ≥ 0 and d = 1. The aggre-
x∈S! !x
gate costs R! are communicated between processors and are used to perform
the computation of the local cost functions V! (we will discuss computation
models of this type in Section 2.6).
We denote J = (V1 , . . . , Vm , R1 , . . . , Rm ), so that J is a vector of di-
mension n + m. We introduce the mapping H(x, u, J) defined for each of the
n states x by
H(x, u, J) = W! (x, u, V! , R1 , . . . , Rm ), if x ∈ S! .
where for x ∈ S!
n
! !
W! (x, u, V! , R1 , . . . , Rm ) = pxy (u)g(x, u, y) + α pxy (u)V!y
y=1 y∈S!
!
+α pxy (u)Rs(y) ;
y ∈S
/ !
and for each original system state y, we denote by s(y) the index of the subset
to which y belongs [i.e., y ∈ Ss(y) ].
We may view H as an abstract mapping on the space of J, and aim to
find its fixed point J ∗ = (V1∗ , . . . , Vm
∗
, R1∗ , . . . , Rm
∗
). Then, for ! = 1, . . . , m, we
† See [Ber12a], Section 6.5.2, for a more detailed discussion. Other examples
of algorithmic mappings that come under our framework arise in asynchronous
policy iteration (see Sections 2.6.3, 3.3.2, and [BeY10a], [BeY10b], [YuB11a]),
and in constrained forms of policy iteration (see [Ber11c], or [Ber12a], Exercise
2.7).
Sec. 1.2 Abstract Dynamic Programming Models 21
may view V!∗ as an approximation to the optimal cost vector of the original
MDP starting at states x ∈ S! , and we may view R!∗ as a form of aggregate
cost for S! . The advantage of this formulation is that it involves significant
decomposition and parallelization of the computations among the processors,
when performing various DP algorithms. In particular, the computation of
W! (x, u, V! , R1 , . . . , Rm ) depends on just the local vector V! , whose dimension
may be potentially much smaller than n.
for λ ∈ (0, 1], where T ! is the "-fold composition of T with itself " times.
Here the mapping T (λ) is used in place of T in the projected equation
(1.18). In the context of the aggregation equation approach, a multistep
method based on the mapping T (λ) is the λ-aggregation method, given for
the case of hard aggregation in [Ber12a], Section 6.5, as well as other forms
of aggregation (see [Ber12a], [YuB12]).
A more general form of multistep approach, introduced and studied
in [YuB12], uses instead the mapping T (w) : "n #→ "n , with components
∞
#
! "
T (w) J (i) = wi! (T ! J )(i), i = 1, . . . , n, J ∈ "n ,
!=1
where for each i, (wi1 , wi2 , . . .) is a probability distribution over the positive
integers. Then the multistep analog of the projected Eq. (1.18) is
The examples of the preceding section have illustrated how the monotonic-
ity assumption is satisfied for many DP models, while the contraction as-
sumption may or may not be satisfied. In particular, the contraction as-
sumption is satisfied for the mapping H in Examples 1.2.1-1.2.5, assuming
24 Introduction Chap. 1
that there is discounting and that the cost per stage is bounded, but it
need not hold in the SSP Example 1.2.6 and the multiplicative Example
1.2.8.
The main theme of this book is that the presence or absence of mono-
tonicity and contraction is the primary determinant of the analytical and
algorithmic theory of a typical total cost DP model. In our development,
with some minor exceptions, we will assume that monotonicity holds. Con-
sequently, the rest of the book is organized around the presence or absence
of the contraction property. In the next four chapters we will discuss the
following four types of models.
(a) Contractive models: These models, discussed in Chapter 2, have
the richest and strongest algorithmic theory, and are the benchmark
against which the theory of other models is compared. Prominent
among them are discounted stochastic optimal control problems (cf.
Example 1.2.1), finite-state discounted MDP (cf. Example 1.2.2), and
some special types of SSP problems (cf. Example 1.2.6).
(b) Semicontractive models: In these models Tµ is monotone but it
need not be a contraction for all µ ∈ M. Instead policies are sepa-
rated into those that “behave well” with respect to our optimization
framework and those that do not. It turns out that the notion of
contraction is not sufficiently general for our purposes. We will thus
introduce a related notion of “regularity,” which is based on the idea
that a policy µ should be considered “well-behaved” if the dynamic
system defined by Tµ has Jµ as an asymptotically stable equilibrium
within some domain. Our models and analysis are patterned to a
large extent after the SSP problems of Example 1.2.6 (the regular µ
correspond to the proper policies). One of the complications here is
that policies that are not regular, may have cost functions that take
the value +∞ or −∞. Still under certain conditions, which directly
or indirectly guarantee that there exists an optimal regular policy,
the complications can be dealt with, and we can prove strong prop-
erties for these models, sometimes almost as strong as those of the
contractive models.
(c) Noncontractive models: These models rely on just the monotonic-
ity property of Tµ , and are more complex than the preceding ones.
As in semicontractive models, the various cost functions of the prob-
lem may take the values +∞ or −∞, and the mappings Tµ and T
must accordingly be allowed to deal with such functions. However,
the optimal cost function may take the values ∞ and −∞ as a matter
of course (rather than on an exceptional basis, as in semicontractive
models). The complications are considerable, and much of the the-
ory of the contractive models generalizes in weaker form, if at all.
For example, in general the fixed point equation J = T J need not
Sec. 1.4 Notes, Sources, and Exercises 25
have a unique solution, the value iteration method may work start-
ing with some functions but not with others, and the policy iteration
method may not work at all. Of course some of these weaknesses may
not appear in the presence of additional structure, and we will discuss
noncontractive models that also have some semicontractive structure,
and corresponding favorable properties.
(d) Restricted Policies Models: These models are variants of some of
the preceding ones, where there are restrictions of the set of policies,
so that M may be a strict subset of the set of functions µ : X !→ U
with µ(x) ∈ U (x) for all x ∈ X. Such restrictions may include mea-
surability (needed to establish a mathematically rigorous probabilistic
framework) or special structure that enhances the characterization of
optimal policies and facilitates their computation.
Examples of DP problems from each of the above model categories,
mostly special cases of the specific DP models discussed in Section 1.2, are
scattered throughout the book, both to illustrate the theory and its excep-
tions, and to illustrate the beneficial role of additional special structure.
The discussion of algorithms centers on abstract forms of value and policy
iteration, and is organized along three characteristics: exact, approximate,
and asynchronous.
The exact algorithms represent idealized versions, the approximate
represent implementations that use approximations of various kinds, and
the asynchronous involve irregular computation orders, where the costs and
controls at different states are updated at different iterations (for example
the cost of a single state being iterated at a time, as in Gauss-Seidel and
other methods; see [Ber12a]). Approximate and asynchronous implemen-
tations have been the subject of intensive investigations in the last twenty
five years, in the context of the solution of large-scale problems. Some of
this methodology relies on the use of simulation, which is asynchronous by
nature and is prominent in approximate DP and reinforcement learning.
The connection between DP and fixed point theory may be traced to Shap-
ley [Sha53], who exploited contraction mappings in analysis of the two-
player dynamic game model of Example 1.2.4. Since that time the under-
lying contraction properties of discounted DP problems with bounded cost
per stage have been explicitly or implicitly used by most authors that have
dealt with the subject. Moreover, the value of the abstract viewpoint as
the basis for economical and insightful analysis has been widely recognized.
An abstract DP model, based on unweighted sup-norm contraction
assumptions, was introduced in the paper by Denardo [Den67]. This model
pointed to the fundamental connections between DP and fixed point the-
26 Introduction Chap. 1
ory, and provided generality and insight into the principal analytical and
algorithmic ideas underlying the discounted DP research up to that time.
Abstract DP ideas were also researched earlier, notably in the paper by
Mitten (Denardo’s Ph.D. thesis advisor) [Mit64]; see also Denardo and
Mitten [DeM67]. The properties of monotone contractions were also used
in the analysis of sequential games by Zachrisson [Zac64].
Denardo’s model motivated a related abstract DP model by the au-
thor [Ber77], which relies only on monotonicity properties, and was pat-
terned after the positive DP problem of Blackwell [Bla65] and the negative
DP problem of Strauch [Str66]. These two abstract DP models were used
extensively in the book by Bertsekas and Shreve [BeS78] for the analysis
of both discounted and undiscounted DP problems, ranging over MDP,
minimax, multiplicative, Borel space models, and models based on outer
integration. Extensions of the analysis of [Ber77] were given by Verdu
and Poor [VeP87], which considered additional structure that allows the
development of backward and forward value iterations, and in the thesis
by Szepesvari [Sze98a], [Sze98b], which introduced non-Markovian poli-
cies into the abstract DP framework. The model of [Ber77] was also used
by Bertsekas [Ber82], and Bertsekas and Yu [BeY10b], to develop asyn-
chronous value and policy iteration methods for abstract contractive and
noncontractive DP models. Another line of related research involving ab-
stract DP mappings that are not necessarily scalar-valued was initiated by
Mitten [Mit74], and was followed up by a number of authors, including
Sobel [Sob75], Morin [Mor82], and Carraway and Morin [CaM88].
Restricted policies models that aim to address measurability issues
in the context of abstract DP were first considered in [BeS98]. Followup
research on this highly technical subject has been limited, and some issues
have not been fully worked out beyond the classical discounted, positive,
and negative stochastic optimal control problems; see Chapter 5.
Generally, noncontractive total cost DP models with some special
structure beyond monotonicity, fall in three major categories: monotone in-
creasing models principally represented by negative DP, monotone decreas-
ing models principally represented by positive DP, and transient models,
exemplified by the SSP model of Example 1.2.6, where the decision process
terminates after a period that is random and subject to control. Abstract
DP models patterned after the first two categories have been known since
[Ber77] and are further discussed in Section 4.3. The semicontractive mod-
els of Chapters 3 and 4, are patterned after the third category, and their
analysis is based on the idea of separating policies into those that are
well-behaved (have contraction-like properties) and those that are not (but
their detrimental effects can be effectively limited thanks to the problem’s
structure). As far as the author knows, this idea is new in the context of
abstract DP. One of the aims of the present monograph is to develop this
idea and to show that it leads to an important and insightful paradigm for
conceptualization and solution of major classes of practical DP problems.
Sec. 1.4 Notes, Sources, and Exercises 27
E X ER CI S E S
This exercise shows how starting with an abstract mapping, we can obtain mul-
tistep mappings with the same fixed points and a stronger contraction modulus.
Consider a set of mappings Tµ : B(X) !→ B(X), µ ∈ M, satisfying the con-
traction Assumption 1.2.2, let m be a positive integer, and let Mm be the set
of m-tuples ν = (µ0 , . . . , µm−1 ), where µk ∈ M, k = 1, . . . , m − 1. For each
ν = (µ0 , . . . , µm−1 ) ∈ Mm , define the mapping T ν , by
T ν J = Tµ0 · · · Tµm−1 J, ∀ J ∈ B(X).
Show that we have the contraction properties
&T ν J − T ν J " & ≤ αm &J − J " &, ∀ J, J " ∈ B(X), (1.23)
and
&T J − T J " & ≤ αm &J − J " &, ∀ J, J " ∈ B(X), (1.24)
where T is defined by
(T J)(x) = inf (Tµ0 · · · Tµm−1 J)(x), ∀ J ∈ B(X), x ∈ X.
(µ0 ,...,µm−1 )∈Mm
where w" (x) are nonnegative scalars such that for all x ∈ X,
∞
!
w" (x) = 1.
"=1
Show that
$ (w)
$(Tµ J)(x) − (Tµ(w) J " )(x)$ ∞
$
!
≤ w" (x)α" &J − J " &, ∀ x ∈ X,
v(x)
"=1
(w)
where α is the contraction modulus of Tµ , so that Tµ is a contraction with
modulus
∞
!
ᾱ = sup w" (x) α" ≤ α.
x∈X
"=1
(w)
Show also that Tµ and Tµ have a common fixed point for all µ ∈ M.
2
Contractive Models
Contents
29
30 Contractive Models Chap. 2
In this chapter we consider the abstract DP model of Section 1.2 under the
most favorable assumptions: monotonicity and weighted sup-norm contrac-
tion. Important special cases of this model are the discounted problems
with bounded cost per stage (Example 1.2.1-1.2.5), the stochastic shortest
path problem of Example 1.2.6 in the case where all policies are proper,
as well as other problems involving special structures. We first provide
some basic analytical results and then focus on two types of algorithms:
value iteration and policy iteration. In addition to exact forms of these
algorithms, we discuss combinations and approximate versions, as well as
asynchronous distributed versions.
i.e., to find a fixed point of T within R(X). We also want to obtain a policy
µ∗ ∈ M such that Tµ∗ J * = T J * .
Let us restate for convenience the contraction and monotonicity as-
sumptions of Section 1.2.2.
J ≤ J! ⇒ T kJ ≤ T kJ !, Tµk J ≤ Tµk J ! , ∀ µ ∈ M,
1 α
"J * − J " ≤ "T J − J ", "J * − T J " ≤ "T J − J ".
1−α 1−α
1 α
"Jµ − J " ≤ "Tµ J − J ", "Jµ − Tµ J " ≤ "Tµ J − J ".
1−α 1−α
Proof: We note that the right-hand side of Eq. (2.1) holds by Prop.
2.1.1(e) (see the remark following its proof). Thus inf µ∈M Jµ (x) ≤ J * (x)
for all x ∈ X. To show the reverse inequality as well as the left-hand side
of Eq. (2.1), we note that for all µ ∈ M, we have T J * ≤ Tµ J * , and since
J * = T J * , it follows that J * ≤ Tµ J * . By applying repeatedly Tµ to both
sides of this inequality and by using the monotonicity Assumption 2.1.1,
we obtain J * ≤ Tµk J * for all k > 0. Taking the limit as k → ∞, we see
that J * ≤ Jµ for all µ ∈ M. Q.E.D.
Note that without monotonicity, we may have inf µ∈M Jµ (x) < J * (x)
for some x. This is illustrated by the following example.
with J¯ being some function in B(X), where Tµ0 · · · Tµk J denotes the com-
position of the mappings Tµ0 , . . . , Tµk applied to J , i.e,
! "
Tµ0 · · · Tµk J = Tµ0 Tµ1 · · · (Tµk−1 (Tµk J )) · · · .
Jπ (x) = lim sup (Tµ0 Tµ1 · · · Tµk J¯)(x) ≥ lim (T k+1 J¯)(x) = J * (x)
k→∞ k→∞
and the last equality holds by Prop. 2.1.1(b)]. Combining the preceding
relations, we obtain J * (x) = inf π∈Π Jπ (x).
Thus, in DP terms, we may view J * as an optimal cost function over
all policies. At the same time, Prop. 2.1.2 states that stationary policies
are sufficient in the sense that the optimal cost can be attained to within
arbitrary accuracy with a stationary policy [uniformly for all x ∈ X, as Eq.
(2.1) shows].
Sec. 2.1 Fixed Point Equation and Optimality Conditions 35
J ≤ J! + c v ⇒ Tµ J ≤ Tµ J ! + αc v, (2.2)
H(x, u, J + c v) − H(x, u, J )
≤ α%J + c v − J % = αc.
v(x)
The condition (2.3) implies the desired condition (2.2). Conversely, con-
dition (2.2) for c = 0 yields the monotonicity assumption, while for c =
%J ! − J % it yields the contraction assumption. Q.E.D.
αk c
TJ ≤ J + cv ⇒ J * ≤ T kJ + v,
1−α
αk c
J ≤ TJ + cv ⇒ T kJ ≤ J * + v.
1−α
Proof: (a) We show the first relation. Applying Eq. (2.2) with J ! and J
replaced by J and T J , respectively, and taking infimum over µ ∈ M, we
see that if T J ≤ J + c v, then T 2 J ≤ T J + αc v. Proceeding similarly, it
follows that
T ! J ≤ T !−1 J + α!−1 c v.
k
! k
!
T kJ − J = (T ! J − T !−1 J ) ≤ α!−1 c v,
!=1 !=1
" #
from which, by taking the limit as k → ∞, we obtain J * ≤ J + c/(1 − α) v.
The second relation follows similarly.
(b) This part is the special case of part (a) where T is equal to Tµ .
(c) We show the first relation. From part (a), the inequality T J ≤ J + c v
implies that
c
J* ≤ J + v.
1−α
Sec. 2.2 Limited Lookahead Policies 37
˜ ≤ α
$Jµ − T J$ $T J˜ − J˜$, (2.5)
1−α
2α
$Jµ − J * $ ≤ $J˜ − J * $, (2.6)
1−α
and
2
$Jµ − J * $ ≤ $T J̃ − J˜$. (2.7)
1−α
Proof: Equation (2.5) follows from the second relation of Prop. 2.1.1(e)
with J = J˜. Also from the first relation of Prop. 2.1.1(e) with J = J * , we
38 Contractive Models Chap. 2
have
1
!Jµ − J * ! ≤ !Tµ J * − J * !.
1−α
By using the triangle inequality, and the relations Tµ J˜ = T J˜ and J * = T J * ,
we obtain
where the second inequality follows from Eqs. (2.5) and (2.8). This proves
Eq. (2.7). Q.E.D.
Example 2.2.1
Consider a discounted optimal control problem with two states, 1 and 2, and
deterministic transitions. State 2 is absorbing, but at state 1 there are two
possible decisions: move to state 2 (policy µ∗ ) or stay at state 1 (policy µ).
Sec. 2.2 Limited Lookahead Policies 39
The cost of each transition is 0 except for the transition from 1 to itself under
policy µ, which has cost 2α", where " is a positive scalar and α ∈ [0, 1) is the
discount factor. The optimal policy µ∗ is to move from state 1 to state 2, and
the optimal cost-to-go function is J ∗ (1) = J ∗ (2) = 0. Consider the vector J˜
˜
with J(1) = −" and J˜(2) = ", so that
#J˜ − J ∗ # = ",
as assumed in Eq. (2.6) (cf. Prop. 2.2.1). The policy µ that decides to stay
at state 1 is a one-step lookahead policy based on J˜, because
We have
2α" 2α
Jµ (1) = = #J˜ − J ∗ #,
1−α 1−α
so the bound of Eq. (2.6) holds with equality.
where δ and " are nonnegative scalars. These scalars may be unknown, so
the resulting analysis will have a mostly qualitative character.
The case δ > 0 arises when the state space is either infinite or it is
finite but very large. Then instead of calculating (T J )(x) for all states x,
one may do so only for some states and estimate (T J )(x) for the remain-
ing states x by some form of interpolation. Alternatively, one may use
simulation data [e.g., noisy values of (T J )(x) for some or all x] and some
kind of least-squares error fit of (T J )(x) with a function from a suitable
parametric class. The function J˜ thus obtained will satisfy "J˜ − T J " ≤ δ
with δ > 0. Note that δ may not be small in this context, and the resulting
performance degradation may be a primary concern.
Cases where " > 0 may arise when the control space is infinite or
finite but large, and the minimization involved in the calculation of (T J )(x)
cannot be done exactly. Note, however, that it is possible that
δ > 0, " = 0,
40 Contractive Models Chap. 2
!Tµ J − T J ! ≤ !
for some ! > 0, and then we may set J˜ = Tµ J . In this case we have ! = δ
in Eq. (2.9).
In a multistep method with approximations, we are given a posi-
tive integer m and a lookahead function Jm , and we successively compute
(backwards in time) Jm−1 , . . . , J0 and policies µm−1 , . . . , µ0 satisfying
Proof: Using the triangle inequality, Eq. (2.10), and the contraction prop-
erty of T , we have for all k
δ(1 − αk )
!Jm−k − T k Jm ! ≤ , k = 1, . . . , m. (2.13)
1−α
Sec. 2.2 Limited Lookahead Policies 41
From Eq. (2.10), we have !Jk − Tµk Jk+1 ! ≤ δ + ", so for all k
showing that
(δ + ")(1 − αk )
!Jm−k − Tµm−k · · · Tµm−1 Jm ! ≤ , k = 1, . . . , m.
1−α
(2.14)
Using the fact !Tµ0 J1 − T J1 ! ≤ " [cf. Eq. (2.10)], we obtain
where the last inequality follows from Eqs. (2.13) and (2.14) for k = m − 1.
From this relation and the fact that Tµ0 · · · Tµm−1 and T m are con-
tractions with modulus αm , we obtain
We also have using Prop. 2.1.1(e), applied in the context of the multistep
mapping of Example 1.3.1,
1
!Jπ − J * ! ≤ !Tµ0 · · · Tµm−1 J * − J * !.
1 − αm
Combining the last two relations, we obtain the desired result. Q.E.D.
Note that for m = 1 and δ = " = 0, i.e., the case of one-step lookahead
policy µ with lookahead function J1 and no approximation error in the
minimization involved in T J1 , Eq. (2.11) yields the bound
2α
!Jµ − J * ! ≤ !J1 − J * !,
1−α
42 Contractive Models Chap. 2
In this section, we discuss value iteration (VI for short), the algorithm
that starts with some J ∈ B(X), and generates T J, T 2 J, . . .. Since T is
a weighted sup-norm contraction under Assumption 2.1.2, the algorithm
converges to J * , and the rate of convergence is governed by
!T k J − J * ! ≤ αk !J − J * !, k = 0, 1, . . . .
!Tµk J − Jµ ! ≤ αk !J − Jµ !, k = 0, 1, . . . .
"J˜ − J * " ≤ γ, Tµ J˜ = T J,
˜
2α γ
"Jµ − J * " ≤ . (2.15)
1−α
Proof: Let M̃ be the set of policies such that Jµ &= J * . Since M̃ is finite,
we have
inf "Jµ − J * " > 0,
µ∈M̃
δ
lim sup !Jk − J * ! ≤ , (2.19)
k→∞ 1−α
! 2αδ
lim sup !Jµk − J * ! ≤ + . (2.20)
k→∞ 1−α (1 − α)2
Proof: Using the triangle inequality, Eq. (2.17), and the contraction prop-
erty of T , we have
and finally
(1 − αk )δ
!Jk − T k J0 ! ≤ , k = 0, 1, . . . . (2.21)
1−α
By taking limit as k → ∞ and by using the fact limk→∞ T k J0 = J * , we
obtain Eq. (2.19).
We also have using the triangle inequality and the contraction prop-
erty of Tµk and T ,
1 " 2α
!Jµk − J * ! ≤ !T k J * − J * ! ≤ + !Jk − J * !.
1−α µ 1−α 1−α
By combining this relation with Eq. (2.19), we obtain Eq. (2.20). Q.E.D.
Consider a two-state discounted MDP with states 1 and 2, and a single policy.
The transitions are deterministic: from state 1 to state 2, and from state 2 to
state 2. These transitions are also cost-free. Thus we have J ∗ (1) = J ∗ (2) = 0.
We consider a VI scheme that approximates!cost functions "within the
one-dimensional subspace of linear functions S = (r, 2r) | r ∈ " by using
a weighted least squares minimization; i.e., we approximate a vector J by its
weighted Euclidean projection onto S. In particular, given Jk = (rk , 2rk ), we
find Jk+1 = (rk+1 , 2rk+1 ), where for weights w1 , w2 > 0, rk+1 is obtained as
# $ %2 $ %2 &
rk+1 = arg min w1 r − (T Jk )(1) + w2 2r − (T Jk )(2) .
r
Since for a zero cost per stage and the given deterministic transitions, we
have T Jk = (2αrk , 2αrk ), the preceding minimization is written as
lim #Jµk − J * # = 0,
k→∞
Proof: We have
Tµk+1 Jµk = T Jµk ≤ Tµk Jµk = Jµk .
Applying Tµk+1 to this inequality while using the monotonicity Assumption
2.1.1, we obtain
Tµ2k+1 Jµk ≤ Tµk+1 Jµk = T Jµk ≤ Tµk Jµk = Jµk .
Similarly, we have for all m > 0,
Tµmk+1 Jµk ≤ T Jµk ≤ Jµk ,
Sec. 2.4 Policy Iteration 47
If Jµk+1 = Jµk , it follows that T Jµk = Jµk , so Jµk is a fixed point of T and
must be equal to J * . Moreover by using induction, Eq. (2.22) implies that
Jµk ≤ T k Jµ0 , k = 0, 1, . . . ,
Since
J * ≤ Jµk , lim $T k Jµ0 − J * $ = 0,
k→∞
In the case where the set of policies is infinite, we may assert the
convergence of the sequence of generated policies under some compactness
and continuity conditions. In particular, we will assume that the state
space is finite, X = {1, . . . , n}, and that each control constraint set U (x)
is a compact subset of &m . We will view a cost function J as an element
of &n , and a policy µ as an element of the compact set U (1) × · · · ×
U (n) ⊂ &mn . Then {µk } has at least one limit point µ, which must
be an admissible policy. The following proposition guarantees, under an
additional continuity assumption for H(x, ·, ·), that every limit point µ is
optimal.
We now consider the PI method where the policy evaluation step and/or
the policy improvement step of the method are implemented through ap-
proximations. This method generates a sequence of policies {µk } and a
corresponding sequence of approximate cost functions {Jk } satisfying
where δ and " are some scalars, &·& denotes the sup-norm and v is the weight
function of the weighted sup-norm (it is important to use v rather than the
unit function in the above equation, in order for the bounds obtained to
have a clean form). The following proposition provides an error bound for
this algorithm.
!J − Jµ ! ≤ δ, !Tµ J − T J ! ≤ ",
Proof: Using Eq. (2.25) and the contraction property of T and Tµ , which
implies that !Tµ Jµ − Tµ J ! ≤ αδ and !T J − T Jµ ! ≤ αδ, and hence Tµ Jµ ≤
Tµ J + αδ v and T J ≤ T Jµ + αδ v, we have
Tµ Jµ ≤ Jµ + (" + 2αδ) v,
α(" + 2αδ)
Jµ = Tµ Jµ = Tµ Jµ + (Tµ Jµ − Tµ Jµ ) ≤ Tµ Jµ + v,
1−α
where the inequality follows by using Prop. 2.1.3 and Eq. (2.27). Subtract-
ing J * from both sides, we have
α(" + 2αδ)
Jµ − J * ≤ T µ Jµ − J * + v, (2.28)
1−α
Also by subtracting J * from both sides of Eq. (2.26), and using the
contraction property
T Jµ − J * = T Jµ − T J * ≤ α!Jµ − J * ! v,
we obtain
" + 2αδ
#Jµk+1 − J * # ≤ α#Jµk − J * # + ,
1−α
which by taking the lim sup of both sides as k → ∞ yields the desired result.
Q.E.D.
We note that the error bound of Prop. 2.4.3 is tight, as can be shown
with an example from [BeT96], Section 6.2.3. The error bound is com-
parable to the one for approximate VI, derived earlier in Prop. 2.3.2. In
particular, the error #Jµk −J * # is asymptotically proportional to 1/(1−α)2
and to the approximation error in policy evaluation or value iteration, re-
spectively. This is noteworthy, as it indicates that contrary to the case of
exact implementation, approximate PI need not hold a convergence rate
advantage over approximate VI, despite its greater overhead per iteration.
On the other hand, approximate PI does not exhibit the same kind
of error amplification difficulty that was illustrated by Example 2.3.1 for
approximate VI. In particular, if the set of policies is finite, so that the
sequence {Jµk } is guaranteed to be bounded, the assumption of Eq. (2.23) is
not hard to satisfy in practice with commonly used function approximation
methods.
Note that when δ = " = 0, Eq. (2.25) yields
#Jµk+1 − J * # ≤ α#Jµk − J * #.
Thus in the case of an infinite state space and/or control space, exact
PI converges at a geometric rate under the contraction and monotonicity
assumptions of this section. This rate is the same as the rate of convergence
of exact VI.
!Tµ J˜ − T J!
˜ = !T J − T Jk ! ≤ !,
µk+1 k
cf. Eq. (2.23). Using Eq. (2.31) and the fact Jµ = Tµ Jµ , we have
˜ + !T J˜ − Tµ J˜! + !Tµ J˜ − Jµ !
!T Jµ − Jµ ! ≤ !T Jµ − T J!
˜ + !T J˜ − Tµ J˜! + !Tµ J˜ − Tµ Jµ !
= !T Jµ − T J!
(2.32)
≤ α!Jµ − J˜! + ! + α!J˜ − Jµ !
≤ ! + 2αδ.
The preceding error bound can be extended to the case where two
successive policies generated by the approximate PI algorithm are “not too
different” rather than being identical. In particular, suppose that µ and µ
are successive policies, which in addition to
!J˜ − Jµ ! ≤ δ, !Tµ J˜ − T J!
˜ ≤ !,
!Tµ J˜ − Tµ J˜! ≤ ζ,
52 Contractive Models Chap. 2
and by replacing " with " + ζ and µ with µ in Eq. (2.32), we obtain
" + ζ + 2αδ
!Jµ − J ∗ ! ≤ .
1−α
When ζ is small enough to be of the order of max{δ, "}, this error bound
is comparable to the one for the case where policies converge.
J ≤ J! ⇒ W J ≤ W J !, ∀ J, J ! ∈ B(X),
J ≥ J! − c v ⇒ W J ≥ W J ! − αc v. (2.34)
αk
J ≥ WJ − cv ⇒ W kJ ≥ J * − c v, (2.35)
1−α
αk
WJ ≥ J − cv ⇒ J * ≥ W kJ − c v, (2.36)
1−α
where J * is the fixed point of W .
Proof: The proof of part (a) follows the one of Prop. 2.1.4(b), while the
proof of part (b) follows the one of Prop. 2.1.4(c). Q.E.D.
J ≥ T J − c v,
and let µ ∈ M be such that Tµ J = T J . Then for all k > 0, we have
α
T J ≥ Tµk J − c v, (2.37)
1−α
and
Tµk J ≥ T (Tµk J ) − αk c v. (2.38)
J0 ≥ T J0 − c v.
J1 ≥ T J1 − αm0 c v = T J1 − β1 c v.
Jk ≥ T Jk − βk c v.
#J0 − T J0 # ≤ c. (2.42)
αk βk (k + 1)αk
Jk + c v ≥ Jk + c v ≥ J * ≥ Jk − c v, (2.43)
1−α 1−α 1−α
Proof: Using the relation J0 ≥ T J0 − c v [cf. Eq. (2.42)] and Lemma 2.5.3,
we have
Jk ≥ T Jk − βk c v, k = 0, 1, . . . .
56 Contractive Models Chap. 2
βk
Jk ≥ J * − c v,
1−α
which together with the fact αk ≥ βk , shows the left-hand side of Eq.
(2.43).
Using the relation T J0 ≥ J0 − c v [cf. Eq. (2.42)] and Lemma 2.5.1(b)
with W = T , we have
αk
J * ≥ T k J0 − c v, k = 0, 1, . . . . (2.44)
1−α
Applying T k−j−1 to both sides of this inequality and using the monotonicity
and contraction properties of T k−j−1 , we obtain
αk−j
T k−j Jj ≥ T k−j−1 Jj+1 − βj c v, j = 0, . . . , k − 1,
1−α
it follows that
k−1
! αk−j j kαk
T k J0 ≥ Jk − α c v = Jk − c v. (2.45)
j=0
1−α 1−α
Finally, by combining Eqs. (2.44) and (2.45), we obtain the right-hand side
of Eq. (2.43). Q.E.D.
Proof of Props. 2.5.1 and 2.5.2: Let c be a scalar satisfying Eq. (2.42).
Then the error bounds (2.43) show that limk→∞ $Jk −J * $ = 0, i.e., the first
part of Prop. 2.5.1. The second part (finite termination when the number
of policies is finite) follows similar to Prop. 2.4.1. The proof of Prop. 2.5.2
follows using the compactness and continuity Assumption 2.4.1, and the
convergence argument of Prop. 2.4.2. Q.E.D.
Sec. 2.5 Optimistic Policy Iteration 57
Let us consider the convergence rate bounds of Lemma 2.5.4 for optimistic
PI, and write them in the form
αk αk
!J0 − T J0 ! ≤ c ⇒ T k J0 − c v ≤ J * ≤ T k J0 + c v (2.47)
1−α 1−α
We will now derive error bounds for the case where the policy evaluation
and policy improvement operations are approximate, similar to the nonop-
timistic PI case of Section 2.4.1. In particular, we consider a method that
generates a sequence of policies {µk } and a corresponding sequence of ap-
proximate cost functions {Jk } satisfying
m
!Jk − Tµkk Jk−1 ! ≤ δ, !Tµk+1 Jk − T Jk ! ≤ #, k = 0, 1, . . . , (2.48)
y(x)
M (y) = sup .
x∈X v(x)
Then the condition (2.49) can be written for all J, J ! ∈ B(X), and µ ∈ M
as
M (Tµ J − Tµ J ! ) ≤ αM (J − J ! ), (2.50)
and also implies the following multistep versions, for " ≥ 1,
which can be proved by induction using Eq. (2.50). We have the following
proposition.
# + 2αδ
lim sup %Jµk − J * % ≤ .
k→∞ (1 − α)2
Sec. 2.5 Optimistic Policy Iteration 59
J = Jk−1 , J = Jk ,
µ = µk , µ̄ = µk+1 , m = mk , m̄ = mk+1 ,
r = Tµ J − J , r̄ = Tµ̄ J − J.
r̄ = Tµ̄ J − J
= (Tµ̄ J − Tµ J ) + (Tµ J − J )
! " ! "
≤ (Tµ̄ J − T J ) + Tµ J − Tµ (Tµm J ) + (Tµm J − J ) + Tµm (Tµ J ) − Tµm J
≤ !v + αM (J − Tµm J )v + δv + αm M (Tµ J − J )v
≤ (! + δ)v + αδv + αm M (r)v,
where the first inequality follows from Tµ̄ J ≥ T J , and the second and third
inequalities follow from Eqs. (2.48) and (2.51). From this relation we have
! "
M (r̄) ≤ ! + (1 + α)δ + βM (r),
! + (1 + α)δ
lim sup M (r) ≤ , (2.53)
k→∞ 1 − β̂
s = Jµ − Tµm J
= Tµm Jµ − Tµm J
≤ αm M (Jµ − J )v
αm
≤ M (Tµ J − J )v
1−α
αm
= M (r)v,
1−α
60 Contractive Models Chap. 2
where the first inequality follows from Eq. (2.51) and the second inequality
αm
follows by using Prop. 2.1.4(b). Thus we have M (s) ≤ 1−α M (r), from
which by taking lim sup of both sides and using Eq. (2.53), we obtain
! "
β̂ " + (1 + α)δ
lim sup M (s) ≤ . (2.54)
k→∞ (1 − α)(1 − β̂)
T J − T J * ≤ αM (J − J * )v
= αM (J − Tµm J + Tµm J − J * )v
≤ αM (J − Tµm J )v + αM (Tµm J − J * )v
≤ αδv + αM (t)v.
t̄ = Tµ̄m̄ J − J *
= (Tµ̄m̄ J − Tµ̄m̄−1 J ) + · · · + (Tµ̄2 J − Tµ̄ J ) + (Tµ̄ J − T J ) + (T J − T J * )
≤ (αm̄−1 + · · · + α)M (Tµ̄ J − J )v + "v + αδv + αM (t)v,
so finally
α − αm̄
M (t̄) ≤ M (r̄) + (" + αδ) + αM (t).
1−α
By taking lim sup of both sides and using Eq. (2.53), it follows that
! "
(α − β̂) " + (1 + α)δ " + αδ
lim sup M (t) ≤ + . (2.55)
k→∞ 2
(1 − α) (1 − β̂) 1−α
Each VI of the form described in Section 2.3 applies the mapping T defined
by
(T J )(x) = inf H(x, u, J ), ∀ x ∈ X,
u∈U (x)
In this case we may simplify the notation of iteration (2.57) by writing J!t
in place of the scalar component J!t (!), as we do in the following example.
where T (J1t , . . . , Jnt )(!) denotes the !-th component of the vector
T (J1t , . . . , Jnt ) = T J t ,
and for simplicity we write J!t instead of J!t (!). This algorithm is a special
case of iteration (2.57) where the set of times at which J! is updated is
R! = {t | xt = !}, and there are no communication delays (as in the case
where the entire algorithm is centralized at a single physical processor).
S(k + 1) ⊂ S(k), k = 0, 1, . . . ,
(2) Box Condition: For all k, S(k) is a Cartesian product of the form
Proof: To explain the idea of the proof, let us note that the given condi-
tions imply that updating any component J! , by applying T to a function
J ∈ S(k), while leaving all other components unchanged, yields a function
in S(k). Thus, once enough time passes so that the delays become “irrel-
evant,” then after J enters S(k), it stays within S(k). Moreover, once a
component J! enters the subset S! (k) and the delays become “irrelevant,”
J! gets permanently within the smaller subset S! (k+1) at the first time that
J! is iterated on with J ∈ S(k). Once each component J! , " = 1, . . . , m,
gets within S! (k + 1), the entire function J is within S(k + 1) by the Box
Sec. 2.6 Asynchronous Algorithms 65
Condition. Thus the iterates from S(k) eventually get into S(k + 1) and
so on, and converge pointwise to J * in view of the assumed properties of
{S(k)}.
With this idea in mind, we show by induction that for each k ≥ 0,
there is a time tk such that:
(1) J t ∈ S(k) for all t ≥ tk .
(2) For all ! and t ∈ R! with t ≥ tk , we have
! "
τ (t) τ (t)
J1 !1 , . . . , Jm!m ∈ S(k).
[In words, after some time, all fixed point estimates will be in S(k) and all
estimates used in iteration (2.57) will come from S(k).]
The induction hypothesis is true for k = 0 since J 0 ∈ S(0). Assuming
it is true for a given k, we will show that there exists a time tk+1 with the
required properties. For each ! = 1, . . . , m, let t(!) be the first element of
R! such that t(!) ≥ tk . Then by the Synchronous Convergence Condition,
we have T J t(!) ∈ S(k + 1), implying (in view of the Box Condition) that
t(!)+1
J! ∈ S! (k + 1).
Let t!k = max! {t(!)} + 1. Then, using the Box Condition we have
∗ J = (J1 , J2 )
(0) S2 (0) ) S(k + 1) + 1) J ∗ TJ
(0) S(k)
S(0)
S1 (0)
J1 Iterations
∗ J = (J1 , J2 )
) S(k + 1) + 1) J ∗
(0) S(k)
S(0)
Iterations J2 Iteration
Jµ!
∗ J µ
µ
! Jµ!! J T J J0
Jµ1 !! Jµ0
1 Jµ2
J∗
Assumption 2.6.2:
(a) The set of policies M is finite.
(b) There exists an integer B ≥ 0 such that
M∗ = {µ ∈ M | Jµ = J * } = {µ ∈ M | Tµ J * = T J * }.
We will show that the algorithm eventually (with probability one) enters
a small neighborhood of J * within which it remains, generates policies in
M∗ , becomes equivalent to asynchronous VI, and therefore converges to
J * by Prop. 2.6.2. The idea of the proof is twofold.
(1) There exists a small enough weighted sup-norm sphere centered at
J * , call it S ∗ , within which policy improvement generates only poli-
cies in M∗ , so policy evaluation with such policies as well as policy
improvement keep the algorithm within S ∗ if started there, and re-
duce the weighted sup-norm distance to J * , in view of the contraction
and common fixed point property of T and Tµ , µ ∈ M∗ . This is a
consequence of Prop. 2.3.1 [cf. Eq. (2.16)].
(2) With probability one, thanks to the randomization device, the algo-
rithm will eventually enter permanently S ∗ with a policy in M∗ .
We now establish (1) and (2) in suitably refined form to account for
the presence of delays and asynchronism. We first define a bounded set
within which the algorithm remains at all times. Consider the set
! "
Ab = J | #J − Jµ # ≤ b, ∀ µ ∈ M ,
Such a k ∗ exists in view of the finiteness of M and Prop. 2.3.1 [cf. Eq.
(2.16)].
We now claim that with probability one, for any given k ≥ 1, J t
will eventually enter S(k) and stay within S(k) for at least B " additional
consecutive iterations. This is because our randomization scheme is such
!
that for any t and k, with probability at least pk(B+B ) the next k(B + B " )
!
iterations are policy improvements, so that J t+k(B+B )−ξ ∈ S(k) for all
ξ with 0 ≤ ξ < B " [if t ≥ B " − 1, we have J t−ξ ∈ S(0) for all ξ with
!
0 ≤ ξ < B " , so J t+B+B −ξ ∈ S(1) for 0 ≤ ξ < B " , which implies that
!
J t+2(B+B )−ξ ∈ S(2) for 0 ≤ ξ < B " , etc].
It follows that with probability one, for some t we will have J τ ∈ S(k ∗ )
for all τ with t − B " ≤ τ ≤ t, as well as µt ∈ M∗ [cf. Eq. (2.61)]. Based
on property (2.61) and the definition (2.59)-(2.60) of the algorithm, we see
that at the next iteration, we have µt+1 ∈ M∗ and
∗
"J t+1 − J * " ≤ "J t − J * " ≤ αk c,
for all x ∈ X# and $ such that t ∈ R# ∪ R# , while J t+1 (x) = J t (x) for
all other x. Proceeding similarly, it follows that for all t > t we will have
J τ ∈ S(k ∗ ) for all τ with t − B " ≤ τ ≤ t, as well as µt ∈ M∗ . Thus, after at
most B iterations following t [after all components J# are updated through
policy evaluation or policy improvement at least once so that
# t+1 #
#J
# (x) − J#* (x)# ∗
≤ α"J t − J * " ≤ αk +1 c,
v(x)
for every i, x ∈ X# , and some t with t ≤ t < t + B, cf. Eq. (2.62)], J t will
enter S(k ∗ + 1) permanently, with µt ∈ M∗ (since µt ∈ M∗ for all t ≥ t
as shown earlier). Then, with the same reasoning, after at most another
72 Contractive Models Chap. 2
The proof of Prop. 2.6.3 shows that eventually (with probability one
after some iteration) the algorithm will become equivalent to asynchronous
VI (each policy evaluation will produce the same results as a policy im-
provement), while generating optimal policies exclusively. However, the
expected number of iterations for this to happen can be very large. More-
over the proof depends on the set of policies being finite. These observa-
tions raise questions regarding the practical effectiveness of the algorithm.
However, it appears that for many problems the algorithm works well, par-
ticularly when oscillatory behavior is a rare occurrence.
A potentially important issue is the choice of the randomization prob-
ability p. If p is too small, convergence may be slow because oscillatory
behavior may go unchecked for a long time. On the other hand if p is
large, a correspondingly large number of policy improvement iterations
may be performed, and the hoped for benefits of optimistic PI may be lost.
Adaptive schemes which adjust p based on algorithmic progress may be an
interesting possibility for addressing this issue.
where
Sec. 2.6 Asynchronous Algorithms 73
• Fµ (V, Q) is a function with a component Fµ (V, Q)(x, u) for each (x, u),
defined by ! "
Fµ (V, Q)(x, u) = H x, u, min{V, Qµ } , (2.63)
! "
• M Fµ (V, Q) is a function with a component M Fµ (V, Q) (x) for each
x, where M denotes minimization over u, so that
! "
M Fµ (V, Q) (x) = min Fµ (V, Q)(x, u). (2.64)
u∈U (x)
Consider the special case of the finite-state discounted MDP of Example 1.2.2.
We have
n
% ! "
H(x, u, J) = pxy (u) g(x, u, y) + αJ(y) ,
y=1
and
! "
Fµ (V, Q)(x, u) = H x, u, Wµ (V, Q)
n &
% # $'
= pxy (u) g(x, u, y) + α min V (y), Q(y, µ(y)) ,
y=1
n &
! " % # $'
M Fµ (V, Q) (x) = min pxy (u) g(x, u, y) + α min V (y), Q(y, µ(y)) ,
u∈U (x)
y=1
[cf. Eqs. (2.63)-(2.64)]. Note that Fµ (V, Q) is the mapping that defines Bell-
man’s equation for the Q-factors of a policy µ in an optimal stopping problem
where the stopping cost at state y is equal to V (y).
in the space of (V, Q), where !V ! is the weighted sup-norm of V , and !Q!
is defined by $ $
$Q(x, u)$
!Q! = sup .
x∈X, u∈U (x) v(x)
We have the following proposition.
(b) The following uniform contraction property holds for all (V, Q),
(Ṽ , Q̃):
! ! ! !
!Gµ (V, Q) − Gµ (Ṽ , Q̃)! ≤ α!(V, Q) − (Ṽ , Q̃)!.
so that
" % &#
min J * (x), Q* x, µ(x) = J * (x), ∀ x ∈ X, µ ∈ M.
Indeed, the first inequality follows from the contraction Assumption 2.1.2
and the second inequality follows from a nonexpansiveness property of the
minimization map: for any J1 , J2 , J̃ 1 , J̃ 2 , we have
! ! " #
! min{J1 , J2 } − min{J̃ 1 , J̃ 2 }! ≤ max #J1 − J̃ 1 #, #J2 − J̃ 2 # ; (2.68)
take the minimum of both sides over m, exchange the roles of Jm and J̃ m ,
and take supremum over $ x]. Here we use the
$ relation
% (2.68) for J1 = V ,
J˜1 = Ṽ , and J2 (x) = Q x, µ(x) , J˜2 (x) = Q̃ x, µ(x) , for all x ∈ X.
%
#M Q − M Q̃# ≤ #Q − Q̃#,
Q(x, u) Q̃(x, u)
≤ "Q − Q̃" + , ∀ u ∈ U (x), x ∈ X,
v(x) v(x)
take infimum of both sides over u ∈ U (x), exchange the roles of Q and Q̃, and
take supremum over x ∈ X and u ∈ U (x).
‡ Because Fµ and Gµ depend on µ, which changes as the algorithm pro-
gresses, it is necessary to use a minor extension of the asynchronous convergence
theorem, given in Exercise 2.2, for the convergence proof.
76 Contractive Models Chap. 2
Asynchronous PI Algorithm
sets µt+1 (x) to a u that attains the minimum, and leaves Q un-
changed, i.e., Qt+1 (x, u) = Qt (x, u) for all x ∈ X! and u ∈ U (x).
(b) Local policy evaluation: If t ∈ R! , processor ! sets for all x ∈ X! and
u ∈ U (x),
! "
Qt+1 (x, u) = H x, u, min{V t , Qtµt } = Fµt (V t , Qt )(x, u),
and leaves V and µ unchanged, i.e., V t+1 (x) = V t (x) and µt+1 (x) =
µt (x) for all x ∈ X! .
† As earlier we assume that the infimum over u ∈ U (x) in the policy im-
provement operation is attained, and we write min in place of inf.
Sec. 2.6 Asynchronous Algorithms 77
for all x ∈ X! . The “stopping cost” V t (y) is the most recent cost value,
obtained by local policy improvement at y.
Consider the optimistic PI algorithm (2.69)-(2.70) for the case of the dis-
counted dynamic game problem of Example 1.2.4 of Chapter 1. In the con-
text of this problem, a local policy evaluation step [cf. Eq. (2.70)] consists of
a local VI for the maximizer’s DP problem assuming a fixed policy for the
minimizer, and a stopping cost V t as per Eq. (2.70). A local policy improve-
ment step [cf. Eq. (2.69)] at state x consists of the solution of a static game
with a payoff matrix that also involves min{V t , J t } in place of J t , as per Eq.
(2.69).
The idea of the algorithm is to aim for a larger value of J t+1 (x)
when the condition (2.72) holds. Asymptotically, as γt → 0, the iteration
(2.72)-(2.73) becomes identical to the convergent update (2.70). For a
more detailed analysis of an algorithm of this type, we refer to the paper
[BeY10b].
is due to Thierry and Scherrer [ThS10b], given for the case of a finite-state
discounted MDP. We follow closely their proof. Related error bounds and
analysis are given by Scherrer [Sch11].
An alternative form of optimistic PI is the λ-PI method, introduced
by Bertsekas and Ioffe [BeI96] (see also [BeT96]), and further studied in
approximate DP contexts by Thierry and Scherrer [ThS10a], Bertsekas
[Ber11b], and Scherrer [Sch11], and in a modified form by Yu and Bertsekas
[YuB12]. The λ-PI method is defined by
(λ)
Tµk Jk = T Jk , Jk+1 = Tµk Jk ,
where J0 is an initial function in B(X), and for any policy µ and λ ∈ (0, 1),
(λ)
the mapping Tµ is defined by
∞
(λ)
!
Tµ J = (1 − λ) λ" Tµ"+1 J.
"=0
note that they both involve multiple applications of the VI mapping Tµk : a
fixed number mk in the latter case, and a geometrically weighted number in
the former case. We will revisit λ-PI in Section 4.3.3, where we will show
that it offers some implementation advantages in the contexts discussed
there.
Asynchronous VI (Section 2.6.1) for finite-state discounted MDP and
games, shortest path problems, and abstract DP models, was proposed in
Bertsekas [Ber82]. The asynchronous convergence theorem (Prop. 2.6.1)
was first given in the author’s paper [Ber83], where it was applied to a
variety of algorithms, including VI for discounted and undiscounted DP,
and gradient methods for unconstrained optimization (see also Bertsekas
and Tsitsiklis [BeT89], where a textbook account is presented). There are
earlier references on distributed asynchronous iterative algorithms, includ-
ing the early work of Chazan and Miranker [ChM69] on linear relaxation
methods (who attributed the original idea to Rosenfeld [Ros67]), and also
Baudet [Bau78] on sup-norm contractive iterations. We refer to [BeT89]
for detailed references.
Asynchronous algorithms have also been studied and applied to simu-
lation-based DP, particularly in the context of Q-learning, which may be
viewed as a stochastic version of VI. Two principal approaches for the
convergence analysis of such algorithms have been suggested. The first
approach, initiated in the paper by Tsitsiklis [Tsi94], considers the totally
asynchronous computation of fixed points of abstract sup-norm contractive
Sec. 2.7 Notes, Sources, and Exercises 81
E X ER CI S E S
cise 1.1 of Chapter 1, where Mm is the set of m-tuples ν = (µ0 , . . . , µm−1 ), with
µk ∈ M, k = 1, . . . , m−1, and m is a positive integer. Assume that the mappings
Tµ satisfy the monotonicity and contraction Assumptions 2.1.1 and 2.1.2, so that
the same is true for the mappings T ν (with the contraction modulus of T ν being
αm , cf. Exercise 1.1).
(a) Show that the unique fixed point of T ν is Jπ , where π is the nonstationary
but periodic policy
π = {µ0 , . . . , µm−1 , µ0 , . . . , µm−1 , . . .}.
(b) Show that the multistep mappings Tµ0 · · · Tµm−1 , Tµ1 · · · Tµm−1 Tµ0 , . . . ,
Tµm−1 Tµ0 · · · Tµm−2 , have unique corresponding fixed points J0 , J1 , . . .,
Jm−1 , which satisfy
J0 = Tµ0 J1 , J1 = Tµ1 J2 , . . . Jµm−2 = Tµm−2 Jµm−1 , Jµm−1 = Tµm−1 J0 .
Hint: Apply Tµ0 to the fixed point relation
J1 = Tµ1 · · · Tµm−1 Tµ0 J1
to show that Tµ0 J1 is the fixed point of Tµ0 · · · Tµm−1 , i.e., is equal to J0 .
Similarly, apply Tµ1 to the fixed point relation
J2 = Tµ2 · · · Tµm−1 Tµ0 Tµ1 J2 ,
to show that Tµ1 J2 is the fixed point of Tµ1 · · · Tµm−1 Tµ0 , i.e., is equal to
J1 , etc.
(2) Box Condition: For all k, S(k) is a Cartesian product of the form
S(k) = S1 (k) × · · · × Sm (k),
where S# (k) is a set of real-valued functions on X# , $ = 1, . . . , m.
Then for every J 0 ∈ S(0), the sequence {J t } generated by the asynchronous
algorithm
# τ (t) τ (t)
J#t+1 (x) = Tt (J1 #1 , . . . , Jm#m )(x) if t ∈ R# , x ∈ X# ,
J#t (x) if t ∈
/ R# , x ∈ X# ,
[cf. Eq. (2.57)] converges pointwise to J ∗ . Hint: Use the proof of Prop. 2.6.1.
Sec. 2.7 Notes, Sources, and Exercises 83
where u ∈ !n , J " u denotes the inner product of J and u, F (x, ·) is ! the conju-
gate convex function of the convex function −(T J)(x), and U (x) = u ∈ !n |
"
F (x, u) < ∞ is the effective domain of F (x, ·) (for the definition of these terms,
we refer to books on convex analysis, such as [Roc70] and [Ber09]). Assuming
that the infimum in Eq. (2.74) is attained for all x, show how the VI algorithm of
Section 2.6.1 and the PI algorithm of Section 2.6.3 can be used to find the fixed
point of T in the case where T is a sup-norm contraction, but not necessarily
monotone.
# #
Gx = sup #g(x, u)#, x ∈ X,
u∈U (x)
belongs to B(X).
! "
(2) The sequence V = V1 , V2 , . . . , where
$
Vx = sup pxy (u) v(y), x ∈ X,
u∈U (x)
y∈X
belongs to B(X).
(3) We have
%
y∈X
pxy (u)v(y)
≤ 1, ∀ x ∈ X.
v(x)
84 Contractive Models Chap. 2
$ %
#
(T J)(x) = inf g(x, u) + α pxy (u)J(y) , x ∈ X.
u∈U (x)
y∈X
Show that Tµ and T map B(X) into B(X), and are contraction mappings with
modulus α.
Let the monotonicity and contraction Assumptions 2.1.1 and 2.1.2 hold. Show
that if J ≤ T J and J ∈ B(X), then J ≤ J ∗ , and use this fact to show that
if X = {1, . . . , n} and U (i) is finite for each i = 1, . . . , n, then J ∗ (1), . . . , J ∗ (n)
solves the following problem (in z1 , . . . , zn ):
n
#
maximize zi
i=1
Consider the mapping H of Section 2.1 under the monotonicity Assumption 2.1.1.
Assume that X is a finite set and that instead of the contraction Assumption
2.1.2, H satisfies the following:
(1) For every J ∈ B(X), the function T J belongs to B(X).
(2) For all J, J # ∈ B(X), H satisfies
& &
&H(x, u, J) − H(x, u, J # )&
≤ %J − J # %, ∀ x ∈ X, u ∈ U (x).
v(x)
Semicontractive Models
Contents
85
86 Semicontractive Models Chap. 3
Our basic model for this chapter and the next one is similar to the one
of Chapter 2, but the assumptions are different. We will maintain the
monotonicity assumption, but we will weaken the contraction assumption,
and we will introduce some other conditions in its place.
Sec. 3.1 Semicontractive Models and Regular Policies 87
In our analysis of this chapter, the optimal cost function J * will typi-
cally be real-valued. However, the cost function Jµ of some policies µ may
take infinite values for some states. To accommodate this, we will use the
set of extended real numbers !∗ = ! ∪ {∞, −∞}, and the set of all ex-
tended real-valued functions J : X %→ !∗ , which we denote by E(X). We
denote by R(X) the set of real-valued functions J : X %→ !, and by B(X)
the set of real-valued functions J : X %→ ! that are bounded with respect
to a given weighted sup-norm. Throughout this chapter and the next two,
when we write lim, lim sup, or lim inf of a sequence of functions we mean
it to be pointwise. We also write Jk → J to mean that Jk (x) → J (x) for
each x ∈ X; see our notational conventions in Appendix A.
As in Chapters 1 and 2, we introduce the set X of states and the
set U of controls, and for each x ∈ X, the nonempty control constraint
set U (x) ⊂ U . We denote by M the set of all functions µ : X %→ U with
µ(x) ∈ U (x), for all x ∈ X, and by Π the set of nonstationary policies
π = {µ0 , µ1 , . . .}, with µk ∈ M for all k. We refer to a stationary policy
{µ, µ, . . .} simply as µ. We introduce a mapping H : X × U × E(X) %→ !∗ ,
satisfying the following condition.
The monotonicity assumption implies the following properties for all J, J " ∈
E(X) and k = 0, 1, . . .,
Regular Policies
= 0 Tµ J
−
"-regular
−
-regular "-irregular
=0 ) Jµ J TJ
Figure 3.1.2. Illustration of S-regular and S-irregular policies for the case where
there is only one state and S = !. There are three mappings Tµ corresponding
to S-irregular policies: one crosses the 45-degree line at multiple points, another
crosses at a single point but at an angle greater than 45 degrees, and the third is
discontinuous and does not cross at all. The mapping Tµ of the !-regular policy
has Jµ as its unique fixed point and satisfies Tµk J → Jµ for all J ∈ !.
postpone making the choice of S more specific. Figure 3.1.2 illustrates the
mappings Tµ of some S-regular and S-irregular policies for the case where
there is a single state and S = !. Figure 3.1.3 illustrates the mapping
Tµ of an S-regular policy µ, where Tµ has multiple fixed points, and upon
changing S, the policy may become S-irregular.
3.1.1 Fixed Points, Optimality Conditions, and Algorithmic
Results
We will now introduce an analytical framework where S-regular policies
are central. Our focus is reflected by our first assumption in the following
proposition, which is that optimal policies can be found among the S-
regular policies, i.e., that for some S-regular µ∗ we have J * = Jµ∗ . This
assumption implies that
where the second equality follows from the S-regularity of µ∗ . Thus the
Bellman equation J * = T J * follows if µ∗ attains the infimum in the rela-
tion T J * = inf µ∈M Tµ J * , which is our second assumption. In addition to
existence of solution of the Bellman equation, the regularity of µ∗ implies
a uniqueness assertion and a convergence result for the VI algorithm, as
shown in the following proposition.
Sec. 3.1 Semicontractive Models and Regular Policies 91
= 0 Tµ J
Jˆ Tµk J¯
S S-regular
J˜ J =0 J¯ = 0 ) Jµ J TJ
Figure 3.1.3. Illustration of a mapping Tµ where there is only one state and S
is a subset of the real line. Here Tµ has two fixed points, Jµ and J˜. If S is as
shown, µ is S-regular. If S is enlarged to include J˜, µ becomes S-irregular.
Proof: (a) The proof uses a more refined version of the argument preced-
ing the statement of the proposition. We first show that any fixed point J
of T that lies in S satisfies J ≤ J * . Indeed, if J = T J , then for the optimal
S-regular policy µ∗ , we have J ≤ Tµ∗ J , so in view of the monotonicity of
Tµ∗ and the S-regularity of µ∗ ,
J ≤ lim Tµk∗ J = Jµ∗ = J * .
k→∞
where the first two equalities follow from the S-regularity and optimality of
µ∗ , the second inequality follows from the monotonicity of Tµ , and the last
equality follows from the S-regularity of µ. Since µ∗ is optimal, it follows
that µ is also optimal, so equality holds in the above relation, and we have
Tµ∗ J * = T J * , implying condition (2) as stated in the proposition.
Let us also show an equivalent variation of the preceding proposition,
for problems where the validity of Bellman’s equation J * = T J * can be
independently verified. We will later encounter models where this can be
done (e.g., the perturbation model of Section 3.2.2, and the monotone
increasing and monotone decreasing models of Section 4.3).
(2) We have J * = T J * .
Then the assumptions and the conclusions of Prop. 3.1.1 hold.
Proof: We have
J * = Jµ∗ = Tµ∗ Jµ∗ = Tµ∗ J * ,
so, using also the assumption J * = T J * , we obtain Tµ∗ J * = T J * . Hence
condition (2) of Prop. 3.1.1 holds. Q.E.D.
Proof: In view of the remark following the proof of Prop. 3.1.1, it will
suffice to show that the policy µ of condition (2) is S-regular. Let µ satisfy
Tµ J * = T J * , and let µ∗ be an optimal S-regular policy. Then for all k ≥ 1,
where the first equality follows from the definition of an S-regular policy,
and the second inequality follows from the monotonicity of Tµ . If µ is
S-irregular, by taking the limit as k → ∞ in the preceding relation, the
right-hand side tends to ∞ for some x ∈ X, while the left-hand side is finite
since Jµ∗ ∈ S ⊂ R(X) - a contradiction. Thus µ is S-regular. Q.E.D.
The examples of Fig. 3.1.2 show how Bellman’s equation may fail in
the absence of existence of an optimal S-regular policy [cf. condition (1)
of Props. 3.1.1-3.1.3]. Consider for instance a problem where there is only
one policy µ that is S-irregular and Tµ has no fixed point.
94 Semicontractive Models Chap. 3
−
J Tµ J
-regular "-irregular J¯ Tµk J¯
∗ J∗ Tµ J T
!-regular
J T J∗
=0 Jµ = J ∗ Jµ J¯ = 0 J TJ
Figure 3.1.4. Illustration of why condition (2) is essential in Prop. 3.1.1. Here
there is only one state and S = !. There are two stationary policies: µ for which
Tµ is a contraction, so µ is !-regular, and µ for which Tµ has multiple fixed
points, so µ is !-irregular. Moreover, Tµ is discontinuous from above at Jµ as
shown. Here, it can be verified that Tµ0 · · · Tµk J¯ ≥ Jµ for all µ0 , . . . , µk and k,
so that Jπ ≥ Jµ for all π and the S-regular policy µ is optimal. However, µ does
not satisfy Tµ J ∗ = T J ∗ [cf. condition (2) of Prop. 3.1.1] and we have J ∗ #= T J ∗ .
Here the conclusions (a) and (c) of Prop. 3.1.1 are violated.
= 0 Tµ J S S-regular
=0 Jˆ Tµk J¯
Jµ = J ∗ J¯ = 0 J TJ
S S
Fixed Points of Tµ
Tµk J̄ ≤ Tµk J ∗ = T k J ∗ = J ∗ ,
−
J Tµ J
-regular "-irregular
J¯ Tµk J¯
Tµ J T
!-regular
Jµ = J ∗ Jµ J¯ = 0 J TJ
=0
Policy Iteration
Jµ = Tµ Jµ ≥ T Jµ = Tµ̄ Jµ ≥ Tµ̄k Jµ , k ≥ 1,
where the last inequality holds by the monotonicity of Tµ̄ . By taking the
limit as k → ∞ in the preceding relation and using the S-regularity of µ,
we obtain Jµ ≥ Jµ̄ .
The policy improvement relation Jµ ≥ Jµ̄ shows that the PI algo-
rithm, when restricted to S-regular policies, generates a nonincreasing
sequence {Jµk }, but this does not guarantee that Jµk ↓ J * . Moreover,
guaranteeing that the policies µk are S-regular may not be easy since the
equation Tµ Jµ = T Jµ may be satisfied by an S-irregular µ, in which case
there is no guarantee that Jµ ≤ Jµ , and an oscillation between policies
may occur. This can be seen from the example of Fig. 3.1.7, where there
Sec. 3.1 Semicontractive Models and Regular Policies 97
− J¯ Tµk J¯
J Tµ J
-regular "-irregular
Tµ J T
!-regular
J Tµ∗ J
J∗ = T J∗ !-regular
= 0 Jµ∗ = J
∗
µ Jµ Jµ J¯ = 0 J T J
Tµ Jµ = T Jµ , Tµ Jµ = T Jµ .
Here there is only one state and S = !. In addition to µ and µ, there is a third
policy µ∗ , which is S-regular and optimal. In this example all the assumptions
and conclusions of Props. 3.1.1-3.1.2 are satisfied.
T µ Jµ = T J µ , T µ Jµ = T J µ .
and J¯ = 0.
We will consider" S-regularity
# with S = "n . A policy µ defines a
graph whose arcs are x, µ(x) , x = 1, . . . , n. If this graph contains a cycle
with m arcs, x1 → x2 → · · · → xm → x1 , with length L = ax1 x2 + · · · +
axm−1 xm + axm x1 , then it can be seen that for all k ≥ 1, we have
(Tµkm J )(x1 ) = kL + J (x1 ).
Thus such a policy cannot be "n -regular (and if L != 0, its cost function
Jµ has some infinite entries, so it is outside "n ). By contrast if a policy
defines a graph that is acyclic, it can be verified to be "n -regular.
Let us assume now that all cycles have positive cost (L > 0 above),
and that every node is connected to the destination with some path (this
is a common assumption in deterministic shortest path problems, which
will be revisited and generalized considerably in Section 3.2.1). Then every
"n -irregular policy has infinite cost starting from some node/state, and it
can be shown that there exists an optimal "n -regular policy. Thus Prop.
3.1.3 applies, and guarantees that J * is the unique fixed point of T within
the set {J | J ≥ J * }, and that the VI algorithm converges to J * starting
only from within that set. Actually the uniqueness of the fixed point and
the convergence of VI can be shown within the entire space "n . This is
well-known in shortest path theory, and will be covered by results to be
given in Section 3.2.1.
In the other extreme case where there is a cycle of negative cost, there
are "n -irregular policies that are optimal and no "n -regular policy can be
optimal. Thus Props. 3.1.1-3.1.3 do not apply in this case.
The case where there is a cycle with zero cost exhibits the most com-
plex behavior and will be illustrated for the example of Fig. 3.1.8. Here
X = {1, 2}, U (1) = {0, 2}, U (2) = {1}, J¯(1) = J¯(2) = 0.
There are two policies:
µ : where µ(1) = 0, corresponding to the path 2 → 1 → 0,
µ : where µ(1) = 2, corresponding to the cycle 1 → 2 → 1,
and the corresponding mapping H is
b if x = 1, u = 0,
H(x, u, J ) = a + J (2) if x = 1, u = 2, (3.1)
a + J (1) if x = 2, u = 1.
Sec. 3.1 Semicontractive Models and Regular Policies 99
a
t b Destination
. Under these conditions, the Bellman equation
tb ! "
12 a12 a012 J (1) = min b, a + J (2) , J
, J (2) = a + J (1),
a
Figure 3.1.8. A deterministic shortest path problem with nodes 1, 2, and desti-
nation 0. Arc lengths are shown next to the arcs.
Here the policy µ is !2 -regular [the VI for µ is Jk+1 (1) = b, Jk+1 (2) =
a + Jk (1)], while the policy µ is !2 -irregular [the VI for µ is Jk+1 (1) =
a + Jk (2), Jk+1 (2) = a + Jk (1)].
In the case where the cycle has zero length, so a = 0, there are two
possibilities, b ≤ 0 and b > 0, which we will consider separately:
(a) a = 0, b ≤ 0: Here the !2 -regular policy is optimal, and Prop. 3.1.1
applies. Bellman’s equation (3.2) has the unique solution
J * (1) = b, J * (2) = b,
and VI starting from one of these solutions, will keep generating that
solution. Moreover we can verify that PI may oscillate between the
optimal !2 -regular policy and the !2 -irregular policy (which is nonop-
timal if b < 0). Indeed, the !2 -irregular policy µ is evaluated as
Jµ (1) = Jµ (2) = 0, while the !2 -regular policy µ is evaluated as
Jµ (1) = Jµ (2) = b, so in the policy improvement phase of the algo-
rithm, we have
! " ! "
µ(1) ∈ arg min b, Jµ (2) , µ(1) ∈ arg min b, Jµ (2) .
100 Semicontractive Models Chap. 3
In this section we will use the model and the results of the preceding section
as a starting point and motivation for the analysis of various special cases.
In particular, we will introduce various analytical techniques and alterna-
tive conditions, in order to strengthen the results of Props. 3.1.1-3.1.3, and
to extend the existing theory of the SSP problem of Example 1.2.6.
belongs to S.
(c) For each S-irregular policy µ and each J ∈ S, there is at least
one state x ∈ X such that
{u ∈ U (x) | H(x, u, J ) ≤ λ}
S contains only strictly positive functions, and the function g may take
values less than 1; see also the affine monotonic and exponential cost SSP
models of Section 4.5.
The compactness part (d) of the semicontraction Assumption 3.2.1
plays a key role for asserting the existence of an optimal S-regular policy, as
well as for various proof arguments (see Exercise 3.1 for counterexamples).
It implies that for every J ∈ S, the infimum in the equation
is attained for all x ∈ X, and it also implies that for every J ∈ S, there
exists a policy µ such that Tµ J = T J . This will be shown as part of the
proof of the following proposition.
The compactness condition of Assumption 3.2.1(d) can be verified in
a few interesting cases involving both finite and infinite state and control
spaces:
(1) The case where U is a finite set.
(2) Cases where for each x, H(x, u, J ) depends on J only through its
values J (y) for y in a finite set Yx . For an illustration, consider a
mapping like the one of the SSP Example 1.2.6:
!
H(x, u, J ) = g(x, u) + pxy (u)J (y).
y∈Yx
Consider the policy µ formed by the point ux , for x with (T J )(x) < ∞,
and by any point ux ∈ U (x) for x with (T J )(x) = ∞. Taking the limit
in Eq. (3.6) as m → ∞ shows that µ satisfies (Tµ J )(x) = (T J )(x) for
x with (T J )(x) < ∞. For x with (T J )(x) = ∞, we also have trivially
(Tµ J )(x) = (T J )(x), so Tµ J = T J . Q.E.D.
Therefore from the definition (3.7), we have {ui }∞ i=k ⊂ Uk (x). Since Uk (x)
is compact, all the limit points of {ui }∞
i=k belong to Uk (x) and at least one
limit point exists. Hence the same is true for the limit points of the whole
sequence {ui }. Thus if ũ is a limit point of {ui }, we have
ũ ∈ ∩∞
k=0 Uk (x).
Jµ = Tµ Jµ ≥ Tµ Jˆ ≥ T J,
ˆ
where the first equality follows from the S-regularity of µ. Taking the
infimum in this relation over all S-regular policies µ and using the definition
of Jˆ, we obtain Jˆ ≥ T J.
ˆ
To prove the reverse relation, let µ be any policy such that Tµ Jˆ = T Jˆ
(there exists one by Lemma 3.2.1). In view of the inequality Jˆ ≥ T Jˆ just
shown, we have Jˆ ≥ Tµ Jˆ, so µ is S-regular by Lemma 3.2.2. Thus we have,
using also the monotonicity of Tµ ,
Jˆ ≥ T Jˆ = Tµ Jˆ ≥ lim Tµk Jˆ = Jµ .
k→∞
From the definition of Jˆ, it follows that equality holds throughout in the
preceding relation, so µ is optimal within the class of S-regular policies,
and Jˆ is a fixed point of T .
Next we show that Jˆ is the unique fixed point of T within S. Indeed
if J # ∈ S is another fixed point, we choose an S-regular µ such that Jµ = Jˆ
(there exists one by the preceding argument), and we have
Proof of Prop. 3.2.1: (a), (b) We will first prove that T k J → Jˆ for all
J ∈ S, and we will use this to prove that Jˆ = J * and that there exists
an optimal S-regular policy. Thus both parts (a) and (b) will be shown
simultaneously.
We fix J ∈ S, and choose J # ∈ S such that J # ≤ J and J # ≤ T J #
[cf. Assumption 3.2.1(f)]. By the monotonicity of T , we have T k J # ↑ J˜ for
some J˜ ∈ E(X). Let µ be an S-regular policy such that Jµ = Jˆ [cf. Lemma
3.2.4(b)]. Then we have, using again the monotonicity of T ,
where the equality follows since T k J → Jˆ for all J ∈ S (as shown earlier),
and J¯ ∈ S [cf. Assumption 3.2.1(a)]. Thus for all π ∈ Π, Jπ ≥ Jˆ = Jµ ,
implying that the policy µ that is optimal within the class of S-regular
policies is optimal over all policies, and that Jˆ = J * .
(c) If µ is optimal, then Jµ = J * ∈ S, so by Assumption 3.2.1(c), µ is S-
regular and therefore Tµ Jµ = Jµ . Hence, Tµ J * = Tµ Jµ = Jµ = J * = T J * .
Conversely, if J * = T J * = Tµ J * , µ is S-regular (cf. Lemma 3.2.2), so
J * = limk→∞ Tµk J * = Jµ . Therefore, µ is optimal.
(d) If J ∈ S and J ≤ T J , by repeatedly applying T to both sides and using
the monotonicity of T , we obtain J ≤ T k J for all k. Taking the limit as
k → ∞ and using the fact T k J → J * [cf. part (b)], we obtain J ≤ J * . The
proof that J ≥ T J implies J ≥ J * is similar. Q.E.D.
inf Jµ ≤ Jµ∗ ≤ Jµ∗ ,δ = Jδ* ≤ Jµ" ,δ ≤ Jµ" + wµ" (δ), ∀ µ" : S-regular,
µ: S -regular δ δ
Example 3.2.2
Consider the case of a single state where J¯ = 0, and there are two policies,
µ∗ and µ, with
Tµ∗ J = J, Tµ J = 1, ∀ J ∈ #.
Here we have Jµ∗ = 0 and Jµ = 1. Moreover, it can be verified that for any set
S ⊂ # that contains the point 1, the optimal policy µ∗ is not S-regular while
the suboptimal policy µ is S-regular. For δ > 0, the δ-perturbed problem has
optimal cost Jδ∗ = 1 + δ, the unique solution of the Bellman equation
J = Tδ J = min{1, J} + δ,
and its optimal policy is the S-regular policy µ (see Fig. 3.2.1). We also have
Tδ J T J
1 1+δ
1 1+
J TJ
J T J J¯ = Jµ∗ = 0 11 1J+
µ,δ = 1+δ J J TJ
Figure 3.2.1: The mapping T and its perturbed version Tδ in Example 3.2.2.
From this relation, the fact limδ↓0 Jδ* = J * just shown, and Assumption
3.2.2(c), we have
J * = lim inf Tµ Jδ* ≤ inf lim Tµ Jδ* = inf Tµ J * = T J * . (3.10)
δ↓0 µ∈M µ∈M δ↓0 µ∈M
We also have
T J * ≤ T Jδ* = inf Tµ Jδ* = Jδ* − δ e, ∀ δ > 0,
µ∈M
where the last equality follows from Eq. (3.9). By taking the limit as
δ ↓ 0, we obtain T J * ≤ J * , which combined with Eq. (3.10), shows that
J * = T J * . Thus the assumptions of Prop. 3.1.2 are satisfied and the
conclusions follow from this proposition. Q.E.D.
k J → J
Tµ,δ µ,δ , ∀ J ∈ S.
k J ∗ )(x) = ∞,
lim sup (Tµ,δ (3.11)
µ ,δ
k→∞
Sec. 3.2 Irregular Policies and a Perturbation Approach 113
where the first equality holds because µ is δ $ -S-regular and Jµ∗ ,δ ∈ S, and
114 Semicontractive Models Chap. 3
the inequality uses part (a). This contradicts the δ ! -S-regularity of µ, which
implies that Jµ,δ! belongs to S and is therefore real-valued.
(c) Follows from part (b).
(d) Since µ is an S-regular policy, it is δ-S-regular for all δ > 0 by Assump-
tion 3.2.3(b). The sequence {Jµ,δk } is monotonically nonincreasing, belongs
to S, and is bounded below by Jµ , so Jµ,δk ↓ J + for some J + ≥ Jµ ≥ J * .
Hence, by Assumption 3.2.3(d), we have J + ∈ S and for all x ∈ X,
+ +
! " ! " ! "
H x, µ(x), J = lim H x, µ(x), Jµ,δk = lim Jµ,δk (x) − δk = J (x),
k→∞ k→∞
where the second equality follows from the definition of Jµ,δk as the fixed
point of Tµ,δk . Thus J + satisfies Tµ J + = J + and is therefore equal to Jµ ,
since µ is S-regular. Hence Jµ,δk ↓ Jµ .
(e) In the desired relation, repeated below for convenience,
Proof: We will show that J * = T J * . The proof will then follow from Prop.
3.1.2. Let µ∗ be the optimal S-regular policy of Assumption 3.2.3(a), and
let {δk } be a positive sequence such that δk ↓ 0. Using Assumption 3.2.3(c),
we may choose a policy µk such that
Tµmk ,δ Jµ∗ ,δk ≤ Tµk ,δk Jµ∗ ,δk ≤ T Jµ∗ ,δk + δk e ≤ Jµ∗ ,δk ,
k
where the first inequality follows from the monotonicity of Tµk ,δk . Taking
the limit as m → ∞, and using Assumption 3.2.3(b) [cf. Eq. (3.11)], it
follows that µk is δ k -S-regular, and we have
where the second inequality follows from Lemma 3.2.5(a). Since Jµ∗ ,δk ↓
Jµ∗ [cf. Lemma 3.2.5(d)], by taking the limit as k → ∞ in Eq. (3.12), we
obtain
Jµ∗ = lim T Jµ∗ ,δk . (3.13)
k→∞
We thus obtain
= (T Jµ∗ )(x)
≤ (Tµ∗ Jµ∗ )(x),
where the first equality is the Bellman equation for the S-regular pol-
icy µ∗ , the second equality is Eq. (3.13), and the third equality follows
from Assumption 3.2.3(d) and the fact Jµk ,δk ↓ Jµ∗ . Thus equality holds
throughout above and we obtain Jµ∗ = T Jµ∗ . Since µ∗ is optimal, we
obtain J * = T J * , and the conclusions follow from Prop. 3.1.2. Q.E.D.
Tµ J = T J = min{1, J },
Consider the finite-spaces SSP problem of Example 1.2.6, and let S = R(X).
We assume that there exists an optimal proper policy, and we will show that
Assumption 3.2.3 is satisfied, so that Prop. 3.2.4 applies.
Indeed, according to known results for SSP problems discussed in Ex-
ample 1.2.6 (e.g., [BeT96], Prop. 2.2, [Ber12a], Prop. 3.3.1), a policy µ is
proper (stops with probability 1 starting from any x ∈ X) if and only if Tµ
is a contraction with respect to a weighted sup-norm. It follows that for a
116 Semicontractive Models Chap. 3
this is because an additional cost of δ is incurred each time that the policy
does not stop. Since for the optimal proper policy µ∗ , we have Jµ∗ ,δ ∈
R(X), Assumption 3.2.3(b) holds. In addition the conditions (c) and (d) of
Assumption 3.2.3 are clearly satisfied.
Thus if there exists an optimal proper policy, Assumption 3.2.3 holds,
and the results of Prop.
! 3.2.4 apply. In ∗particular,
" J ∗ is the unique fixed point
of T within the set J ∈ R(X) | J ≥ J , and the VI algorithm converges to
J ∗ starting from any function J within this set. These results also apply to
the search problem of Example 3.2.1, assuming that there exists an optimal
policy that stops with probability 1.
We finally note that similar to the search problem, if we just assume that
there exists at least one proper policy, while J ∗ (x) > −∞ for all x ∈ X, Prop.
3.2.2 applies and shows that limδ↓0 Jδ∗ yields the best that can be achieved
when restricted to proper policies only.
3.3 ALGORITHMS
V ≤ TV ≤ TV ≤ V , (3.15)
T kV ↑ J *, T kV ↓ J *.
The sets S(k) satisfy S(k + 1) ⊂ S(k) in view of Eq. (3.15) and the mono-
tonicity of T . Using Prop. 3.2.1, we also see that S(k) satisfy the syn-
chronous convergence and box conditions of Prop. 2.6.1. Thus, together
with Assumption 2.6.1, all the conditions of Prop. 2.6.1 are satisfied, and
the convergence of the algorithm follows starting from any J 0 ∈ S(0).
118 Semicontractive Models Chap. 3
!
Consider "next the case where Assumption 3.2.3 holds with S = J ∈
B(X) | J ≥ J * . In this case we use the sets S(k) given by
! "
S(k) = J ∈ S | J * ≤ J ≤ T k V , k = 0, 1, . . . ,
where V is a function in S with J * ≤ T V ≤ V . These sets satisfy the
synchronous convergence and box conditions of Prop. 2.6.1, and we can
similarly show asynchronous convergence to J * of the generated sequence
{J t } starting from any J 0 ∈ S(0).
where
• Fµ (V, Q) is a function with a component Fµ (V, Q)(x, u) for each (x, u),
defined by ! "
Fµ (V, Q)(x, u) = H x, u, min{V, Qµ } , (3.18)
! "
• M Fµ (V, Q) is a function with a component M Fµ (V, Q) (x) for each
x, where M is the operator of pointwise minimization over u:
so that ! "
M Fµ (V, Q) (x) = min Fµ (V, Q)(x, u).
u∈U (x)
and leaves V and µ unchanged, i.e., V t+1 (x) = V t (x) and µt+1 (x) =
µt (x) for all x ∈ X! .
To this end, we first show that Q* is the unique fixed point of the mapping
F defined by s
! "
(F Q)(x, u) = H x, u, M Q , x ∈ X, u ∈ U (x).
where Fµ (V, Q) is given by Eq. (3.18). Note that the policy evaluation part
of the algorithm [cf. Eq. (3.20)] amounts to applying the second component
Sec. 3.3 Algorithms 121
of Lµ , while the policy improvement part of the algorithm [cf. Eq. (3.19)]
amounts to applying the second component of Lµ , and then applying the
first component of Lµ . The following proposition shows that (J * , Q* ) is
the common fixed point of the mappings Lµ , for all µ.
J * = M Q* , Q* = F Q* = Fµ (J * , Q* ),
Q = Fµ (V , Q) = F Q,
The uniform fixed point property of Lµ just shown is, however, in-
sufficient for the convergence proof of the asynchronous algorithm, in the
absence of a contraction property. For this reason, we introduce two map-
pings L and L that are associated with the mappings Lµ and satisfy
[cf. Eq. (3.18)]. Similarly, there exists µ that attains the minimum in Eq.
(3.24), uniformly for all V and (x, u). Thus for any given (V, Q), we have
where µ and µ̄ are some policies. The following proposition shows that
(J * , Q* ), the common fixed point of the mappings Lµ , for all µ, is also the
unique fixed point of L and L.
Jc− = J * − c e, Q− *
c = Q − c eQ ,
Jc+ = J * + c e, Q+ *
c = Q + c eQ ,
with e and eQ are the unit functions in the spaces of J and Q, respectively.
Proposition 3.3.3: Let Assumption 3.3.1 hold. Then for all c > 0,
k
Lk (Jc− , Q− * *
c ) ↑ (J , Q ), L (Jc+ , Q+ * *
c ) ↓ (J , Q ), (3.26)
k
where Lk (or L ) denotes the k-fold composition of L (or L, respec-
tively).
Sec. 3.3 Algorithms 123
Proof: For any µ, using the assumption (3.17), we have for all (x, u),
Fµ (Jc+ , Q+ + +
! "
c )(x, u) = H x, u, min{Jc , Qc }
! "
= H x, u, min{J * , Q* } + c e
! "
≤ H x, u, min{J * , Q* } + c
= Q* (x, u) + c
= Q+
c (x, u),
and similarly
Q− − −
c (x, u) ≤ Fµ (Jc , Qc )(x, u).
We also have M Q+ + − −
c = Jc and M Qc = Jc . From these relations, the
definition of Lµ , and the fact Lµ (J , Q ) = (J * , Q* ) (cf. Prop. 3.3.1),
* *
+ + + +
(Jc− , Q− − − * *
c ) ≤ Lµ (Jc , Qc ) ≤ (J , Q ) ≤ Lµ (Jc , Qc ) ≤ (Jc , Qc ).
Using this relation and Eqs. (3.23) and (3.25), we obtain
+ + + +
(Jc− , Q− − − * *
c ) ≤ L(Jc , Qc ) ≤ (J , Q ) ≤ L(Jc , Qc ) ≤ (Jc , Qc ). (3.27)
Denote for k = 0, 1, . . . ,
k
(V k , Qk ) = L (Jc+ , Q+
c ), (V k , Qk ) = Lk (Jc− , Q−
c ).
k J )(x) = ∞.
lim sup (Tµ,δ
k→∞
Such µk+1 exists by Assumption 3.3.2(a), and we claim that µk+1 is δk+1 -
S-regular. To see this, note that from Lemma 3.2.5(e), we have
Tµk+1 ,δk Jµk ,δk = T Jµk ,δk + δk e ≤ Tµk Jµk ,δk + δk e = Jµk ,δk ,
so that
Tµmk+1 ,δ Jµk ,δk ≤ Tµk+1 ,δk Jµk ,δk = T Jµk ,δk + δk e ≤ Jµk ,δk ,
∀ m ≥ 1.
k
(3.29)
Since Jµk ,δk ∈ R(X), from Assumption 3.3.2(b) it follows that µk+1 is δk -
S-regular, and hence also δk+1 -S-regular, by Lemma 3.2.5(c). Thus the se-
quence {µk } generated by the perturbed PI algorithm (3.28) is well-defined
and consists of δk -S-regular policies. We have the following proposition.
Jµk+1 ,δk+1 ≤ Jµk+1 ,δk = lim Tµmk+1 ,δ Jµk ,δk ≤ T Jµk ,δk + δk e ≤ Jµk ,δk ,
m→∞ k
We also have
= inf H(x, u, J + ),
u∈U (x)
where the equality follows from Assumption 3.2.3(d). It follows that equal-
ity holds throughout above, so that
Note that when X and U are finite sets, as in the SSP problem of
Example 3.2.4, Prop. 3.3.4 implies that the generated policies µk will be
optimal for all k sufficiently large. The reason is that in this case, the set
of policies is finite and there exists a sufficiently small " > 0, such that for
all nonoptimal µ there is some state x such that Jµ (x) ≥ J * (x) + ".
In the absence of finiteness of X and U , Prop. 3.3.4 guarantees the
monotonic pointwise convergence of {Jµk ,δk } to J * (see the preceding proof)
and the (possibly nonmonotonic) pointwise convergence of {Jµk } to J * .
This convergence behavior should be contrasted with the behavior of PI
without perturbations, which may lead to oscillation between two nonop-
timal policies, as noted earlier.
problem of Example 1.2.6, which involves finite state and control spaces,
as well as a termination state. In the absence of a termination state, a
key idea has been to generalize the notion of a proper policy from one that
leads to termination with probability 1, to one that is S-regular for an
appropriate set of functions S.
The line of proof of Prop. 3.1.1 dates back to an analysis of SSP
problems with finite state and control spaces, given in the author’s [Ber87],
Section 6.4, which assumes existence of an optimal proper policy and non-
negativity of the one-stage cost. Proposition 3.2.1 is patterned after a
similar result in Bertsekas and Tsitsiklis [BeT91] for SSP problems with fi-
nite state space and compact control constraint sets. The proof given there
contains an intricate part (Lemma 3 of [BeT91]) to show a lower bound on
the cost functions of proper policies, which is assumed here in part (b) of
the semicontraction Assumption 3.2.1.
The perturbation analysis of Section 3.2.2, including the PI algorithm
of Section 3.3.3, are new and are based on unpublished collaboration of the
author with H. Yu. The results for SSP problems using this analysis (cf.
Prop. 3.2.4) strengthen the results of [Ber87] (Section 6.4) and [BeT91]
(Prop. 3), in that the one-stage cost need not be assumed nonnegative.
We have given two different perturbation approaches in Section 3.2.2. The
first approach places assumptions on the optimal cost function Jδ* of the
δ-perturbed problem (cf. Prop. 3.2.2 and Assumption 3.2.2), while the sec-
ond places assumptions on policies (cf. Assumption 3.2.3) and separates
them into δ-S-regular and δ-S-irregular. The first approach is simpler an-
alytically, and at least in part, does not require existence of an S-regular
policy (cf. Prop. 3.2.2). The second approach allows the development of
a perturbed PI algorithm and the corresponding analysis of Section 3.3.3
(under the extra conditions of Assumption 3.3.2).
The asynchronous PI algorithm of Section 3.3.2 is essentially the same
as one of the optimistic PI algorithms of Yu and Bertsekas [YuB11a] for
the SSP problem of Example 1.2.6. This paper also analyzed asynchronous
stochastic iterative versions of the algorithms, and proved convergence re-
sults that parallel those for classical Q-learning for SSP, given in Tsitsiklis
[Tsi94] and Yu and Bertsekas [YuB11b]. We follow the line of analysis of
that paper. A related paper, which deals with a slightly different asyn-
chronous PI algorithm in an abstract setting and without a contraction
structure, is Bertsekas and Yu [BeY10b].
By allowing an infinite state space, the analysis of the present chapter
applies among others to SSP problems with a countable state space. Such
problems often arise in queueing control problems where the termination
state corresponds to an empty queue. The problem then is to empty the
system with minimum expected cost. Generalized forms of SSP problems,
which involve an infinite (uncountable) number of states, in addition to the
termination state, are analyzed by Pliska [Pli78], Hernandez-Lerma et al.
[HCP99], and James and Collins [JaC06]. The latter paper allows improper
Sec. 3.4 Notes, Sources, and Exercises 127
E X ER CI S E S
Consider an SSP problem where there is only one state x = 1, in addition to the
termination state 0. At state 1, we can choose a control u with 0 < u ≤ 1, while
incurring a cost −u; we then move to state 0 with probability u2 , and stay in state
1 with probability 1 − u2 . We may regard u as a demand made by a blackmailer,
and state 1 as the situation where the victim complies. State 0 is the situation
where the victim (permanently) refuses to yield to the blackmailer’s demand.
The problem then can be seen as one whereby the blackmailer tries to maximize
his total gain by balancing his desire for increased demands with keeping his
victim compliant. In terms of abstract DP we have
X = {1}, U (1) = (0, 1], ¯
J(1) = 0, H(1, u, J) = −u + (1 − u2 )J(1).
(a) Verify that Tµ is a sup-norm contraction for each µ. In addition, show that
1
Jµ (1) = − µ(1) , so that J ∗ (1) = −∞, that there is no optimal policy, and
that T has no fixed points within $. Which parts of Assumption 3.2.1 with
S = $ are violated?
(b) Consider a variant of the problem where at state 1, we terminate at no cost
with probability u, and stay in state 1 at a cost −u with probability 1 − u.
Here we have
H(1, u, J) = (1 − u)(−u) + (1 − u)J(1).
∗
Verify that J (1) = −1, that there is no optimal policy, and that T has
multiple fixed points within $. Which parts of Assumption 3.2.1 with
S = $ are violated?
128 Semicontractive Models Chap. 3
(c) Repeat part (b) for the case where at state 1, we may also choose u = 0
at a cost c. Show that the policy µ that chooses µ(1) = 0 is !-irregular.
What are the optimal policies and the fixed points of T in the three cases
where c > 0, c = 0, and c < 0. Which parts of Assumption 3.2.1 with
S = ! are violated in each of these three cases?
Let S be a given subset of E(X). Show that the assumptions of Prop. 3.1.1 hold
if and only if J ∗ ∈ S, T J ∗ ≤ J ∗ , and there exists an S-regular policy µ such that
Tµ J ∗ = T J ∗ .
3.3
Consider the three-node shortest path example of Section 3.1.2. Try to apply
Prop. 3.1.1 with S = [−∞, ∞) × [−∞, ∞). What conclusions can you obtain for
various values of a and b, and how do they compare with those for S = !2 ?
Let the assumptions of Prop. 3.1.1 hold, and let J ∗ be the optimal cost function.
Suppose that J¯ is changed to some function J ∈ S.
(a) Show that following the change, J ∗ continues to be the optimal cost func-
tion over just the S-regular policies.
(b) Consider the three-node shortest path problem of Section 3.1.2 for the case
where a = 0, b < 0. Change J¯ from J¯ = 0 to J¯ = r e where r ∈ !.
Verify the result of part (a) for this example. For which values of r is the
!2 -irregular policy optimal?
The purpose of this exercise and the next one is to provide conditions that imply
the results of Prop. 3.1.1. Let S be a given subset of E(X). Assume that:
(1) There exists an optimal S-regular policy.
(2) For every S-irregular policy µ, we have Tµ J ∗ ≥ J ∗ .
Then the assumptions and the conclusions of Prop. 3.1.1 hold.
Let Assumption 3.2.1 hold, and let {µk } be the sequence generated by the PI
algorithm described at the start of Section 3.3.2 [cf. Eq. (3.16)]. Let also J∞ =
limk→∞ Jµk , and assume that H(x, u, Jµk ) → H(x, u, J∞ ) for all x ∈ X and
u ∈ U (x). Show that J∞ = J ∗ .
4
Noncontractive Models
Contents
129
130 Noncontractive Models Chap. 4
!∗ = ! ∪ {∞, −∞}
and the set E(X) the set of all extended real-valued functions J : X %→ !∗ .
We denote by R(X) the set of real-valued functions J : X %→ !, and by
B(X) the set of real-valued functions J : X %→ ! that are bounded with
respect to a given weighted sup-norm. The operations with ∞ and −∞ are
summarized in Appendix A. In particular, we adopt standard conventions
regarding ordering, addition, and multiplication in !∗ , except that we take
∞ − ∞ = −∞ + ∞ = ∞, and we take the product of 0 and ∞ or −∞
to be 0, so that the sum and product of two extended real numbers are
well-defined (see Appendix A for details).
Sec. 4.1 Noncontractive Models 131
A fact that we will be using frequently is that for each J ∈ E(X) and
scalar " > 0, there exists a µ! ∈ M such that for all x ∈ X,
(T J )(x) + " if (T J )(x) > −∞,
(Tµ! J )(x) ≤
−(1/") if (T J )(x) = −∞.
In particular, if J is such that (T J )(x) > −∞ for all x ∈ X, then for each
" > 0, there exists a µ! ∈ M such that
We will often use in our analysis the unit function e, defined by e(x) ≡ 1,
so for example, we will write the above relation in shorthand as
Tµ! J ≤ T J + " e.
minimize Jπ (x)
(4.2)
subject to π ∈ Π.
* (x) = inf J
JN J * (x) = inf Jπ (x),
N,π (x), ∀ x ∈ X.
π∈Π π∈Π
* (x),
JN,π∗ (x) = JN ∀ x ∈ X,
Consider the N -stage problem (4.1), where the cost function JN,π is defined
by
JN,π (x) = (Tµ0 · · · TµN −1 J¯)(x), ∀ x ∈ X.
Based on the theory of finite horizon DP, we expect that (at least under
* is obtained by N successive
some conditions) the optimal cost function JN
applications of the DP mapping T on the initial function J¯, i.e.,
* = inf J
JN N ¯
N,π = T J .
π∈Π
This is the analog of Bellman’s equation for the finite horizon problem in
a DP context.
A favorable case where the analysis is simplified and we can easily show that
JN* = T N J¯ is when the finite horizon DP algorithm yields an optimal policy
during its execution. By this we mean that the algorithm that starts with
J¯, and sequentially computes T J¯, T 2 J¯, . . . , T N J¯, also yields corresponding
µ∗N −1 , µ∗N −2 , . . . , µ∗0 ∈ M such that
While µ∗N −1 , . . . , µ∗0 ∈ M satisfying this relation need not exist (because
the corresponding infimum in the definition of T is not attained), if they
do exist, they both form an optimal policy and also guarantee that
* = T N J¯.
JN
where the inequality follows from the monotonicity assumption and the def-
inition of T , and the last equality follows from Eq. (4.3). Thus {µ∗0 , µ∗1 , . . .}
has no worse N -stage cost function than every other policy, so it is N -stage
optimal and JN * = T ∗ ···T ∗
µ0
¯
µN −1 J . By taking the infimum of the left-hand
side over π ∈ Π, we obtain JN * = T N J¯.
The preceding argument can also be used to show that {µ∗k , µ∗k+1 , . . .}
is (N − k)-stage optimal for all k = 0, . . . , N − 1. Such a policy is called
uniformly N -stage optimal . The fact that the finite horizon DP algorithm
provides an optimal solution of all the k-stage problems for k = 1, . . . , N ,
rather than just the last one, is a manifestation of the classical principle
134 Noncontractive Models Chap. 4
the corresponding part of the proof of Prop. 3.2.1 (cf. Lemma 3.2.1) applies
and shows that the above infimum is attained. Q.E.D.
! " ! "
inf (Tµ Jm ) = lim (Tµ Jm ) = Tµ lim Jm = Tµ inf Jm .
m m→∞ m→∞ m
Since JN∗ ≤ T ¯
µ0 · · · TµN −1 J for all µ0 , . . . , µN −1 ∈ M, we have using also
Eq. (4.5) and the assumption Jk,π (x) < ∞, for all k, π, and x,
∗ ≤ inf · · · inf T m · · · T m
JN 0
¯
N −1 J
m0 mN −1 µ0 µN −1
! "
= inf · · · inf Tµm0 · · · TµmN −2 ¯
inf TµmN −1 J
m0 mN −2 0 N −2 mN −1 N −1
= inf · · · inf T m
µ0 0 ···T mN −2
µN −2
T J¯
m0 mN −2
..
.
= inf Tµm0 (T N −1 J¯)
m0 0
= T N J¯.
On the other hand, it is clear from the definitions that T N J¯ ≤ JN,π for all
N and π ∈ Π, so that T N J¯ ≤ JN * . Thus, J * = T N J¯.
N Q.E.D.
Moreover, there exists a scalar α ∈ (0, ∞) such that for all scalars
r ∈ (0, ∞) and functions J ∈ E(X), we have
* ≤ J
JN *
N,π! ≤ JN + " e.
*
Jk+1 ≤ Tµ Jk,π! ≤ Tµ Jk* + α! e.
Taking the infimum over µ and then the limit as ! → 0, we obtain Jk+1 * ≤
T Jk . By using the induction hypothesis Jk = T k J¯, it follows that Jk+1
* * * ≤
T k+1 ¯
J . On the other hand, we have clearly T k+1 ¯
J ≤ Jk+1,π for all π ∈ Π,
so that T k+1 J¯ ≤ Jk+1
* , and hence T k+1 J¯ = Jk+1
* .
We now turn to the existence of an !-optimal policy part of the in-
duction argument. Using the assumption Jk* (x) > −∞ for all x ∈ x ∈ X,
for any ! > 0, we can choose π = {µ0 , µ1 , . . .} such that
!
Jk,π ≤ Jk* + e, (4.7)
2α
!
Jk+1,π ! = Tµ Jk,π ≤ Tµ Jk* + e ≤ T Jk* + ! e = Jk+1
* + ! e,
2
where the first inequality is obtained by applying Tµ to Eq. (4.7) and using
Eq. (4.6). The induction is complete. Q.E.D.
Let
X = {0}, U (0) = (−1, 0], ¯
J(0) = 0,
!
u if −1 < J(0),
H(0, u, J) =
J(0) + u if J(0) ≤ −1.
Then
(Tµ0 · · · TµN −1 J¯)(0) = µ0 (0),
∗
and JN (0) = −1, while (T N J¯)(0) = −N for every N . Here Assumption 4.2.1,
and the condition (4.6) (cf. Assumption 4.2.2) are violated, even though the
condition Jk∗ (x) > −∞ for all x ∈ X (cf. Assumption 4.2.2) is satisfied.
138 Noncontractive Models Chap. 4
Let
X = {0, 1}, U (0) = U (1) = (−∞, 0], ¯
J(0) ¯
= J(1) = 0,
!
u if J(1) = −∞,
H(0, u, J) = H(1, u, J) = u.
0 if J(1) > −∞,
Then
(Tµ0 · · · TµN −1 J¯)(0) = 0, ¯
(Tµ0 · · · TµN −1 J)(1) = µ0 (1), ∀ N ≥ 1.
∗ ∗
It can be seen that for N ≥ 2, we have JN (0)
= 0 and JN
= −∞, but (1)
¯
(T N J)(0) = (T N J¯)(1) = −∞. Here Assumption 4.2.1, and the condition
Jk∗ (x) > −∞ for all x ∈ X (cf. Assumption 4.2.2) are violated, even though
the condition (4.6) of Assumption 4.2.2 is satisfied.
Let
X = {0, 1}, U (0) = U (1) = &, J¯(0) = J¯(1) = 0,
let w be a real-valued random variable with E{w} = ∞, and let
! " #
E w + J(1) if x = 0,
H(x, u, J) = ∀ x ∈ X, u ∈ U (x).
u + J(1) if x = 1,
Then if Jm is real-valued for all m, and Jm (1) ↓ J(1) = −∞, we have
" #
lim H(0, u, Jm ) = lim E w + Jm (1) = ∞,
m→∞ m→∞
while $ % " #
H 0, u, lim Jm = E w + J(1) = −∞,
m→∞
so Assumption 4.2.1 is violated. Indeed, the reader may verify with a straight-
¯
forward calculation that J2∗ (0) = ∞, J2∗ (1) = −∞, while (T 2 J)(0) = −∞,
2 ¯ ∗ 2 ¯
(T J)(1) = −∞, so J2 (= T J. Note that Assumption 4.2.2 is also violated
because J2∗ (1) = −∞.
Let α = 1 and
f (x, u) = 0, ∀ x ∈ X, u ∈ U (x),
−u if x = 0,
!
g(x, u) = , ∀ u ∈ U (x),
x if x =
% 0,
so that
H(x, u, J) = g(x, u) + J(0).
Then for π ∈ Π and x %= 0, we have J2,π (x) = x − µ1 (0), so that J2∗ (x) = −∞
for all x ∈ X. Here Assumption 4.2.1, as well as Eq. (4.6) (cf. Assumption
4.2.2) are satisfied, and indeed we have J2∗ (x) = (T 2 J¯)(x) = −∞ for all
x ∈ X. However, the condition Jk∗ (x) > −∞ for all x and k (cf. Assumption
4.2.2) is violated, and it is seen that there does not exist a two-stage #-optimal
policy for any # > 0, since an #-optimal policy π = {µ0 , µ1 } must satisfy
1
J2,π (x) = x − µ1 (0) ≤ − , ∀ x ∈ X,
#
[in view of J2∗ (x) = −∞ for all x ∈ X], which is impossible.
Consider the infinite horizon problem (4.2), where the cost function of a
policy π = {µ0 , µ1 , . . .} is
Jπ (x) = lim sup (Tµ0 · · · Tµk J¯)(x), ∀ x ∈ X.
k→∞
(c) There exists a scalar α ∈ (0, ∞) such that for all scalars r ∈
(0, ∞) and functions J ∈ E(X) with J¯ ≤ J , we have
= 0 Tµ J = 0 Tµ J
J¯ Jµ J TJ Jµ J¯ J TJ
Jµ = T µ Jµ .
J Tµ J
J¯ Tµk J¯
Tµ J T
=0 Jµ Jµ J¯ = 0 J TJ
J∗ = T J∗
Figure 4.3.2. An example where nonstationary policies are dominant under As-
sumption D. Here there is only one state and S = !. There are two stationary
policies µ and µ with cost functions Jµ and Jµ as shown. However, by considering
a nonstationary policy of the form πk = {µ, . . . , µ, µ, µ, . . .}, with a number k of
policies µ, we can obtain a sequence {Jπk } that converges to the value J ∗ shown.
Note that here there is no optimal policy, stationary or not.
Tµ! J ≤ T J + ! e.
J * ≤ Jπ! ≤ J * + ! e.
Proof: Let {!k } be a sequence such that !k > 0 for all k and
∞
!
αk !k = !. (4.8)
k=0
" #
For each x ∈ X, consider a sequence of policies πk [x] ⊂ Π of the form
" #
πk [x] = µk0 [x], µk1 [x], . . . , (4.9)
Such a sequence exists, since we have assumed that J¯(x) > −∞, and
therefore J * (x) > −∞, for all x ∈ X.
144 Noncontractive Models Chap. 4
From Eqs. (4.12), (4.13), and part (c) of Assumption I, we have for all
x ∈ X and k = 1, 2, . . .,
and finally
Tµk−1 J¯k ≤ J¯k−1 + α"k e, k = 1, 2, . . . .
Using this inequality and part (c) of Assumption I, we obtain
k
% '
Tµ0 · · · Tµk−1 J¯k ≤ J¯0 + (α"1 + · · · + αk "k ) e ≤ J * +
&
αi "i e.
i=0
Sec. 4.3 Infinite Horizon Problems 145
Denote π! = {µ0 , µ1 , . . .}. Then by taking the limit in the preceding in-
equality and using Eq. (4.8), we obtain
Jπ! ≤ J * + " e.
$ %
If α < 1, we take "k = "(1−α) for all k, and πk [x] = µ0 [x], µ1 [x], . . .
in Eq. (4.10). The stationary policy π! = {µ, µ, . . .}, where µ(x) = µ0 [x](x)
for all x ∈ X, satisfies Jπ! ≤ J * + " e. Q.E.D.
J * = T J *.
≥ (Tµ0 J * )(x)
≥ (T J * )(x).
J * ≥ T J *.
To prove the reverse inequality, let "1 and "2 be any positive scalars,
and let π = {µ0 , µ1 , . . .} be such that
= Tµ0 Jπ1
≤ Tµ0 J * + α#2 e
≤ T J * + (#1 + α#2 ) e.
Taking the limit as k → ∞, we obtain
J * ≤ Jπ = lim Tµ0 Tµ1 · · · Tµk J¯ ≤ T J * + (#1 + α#2 ) e.
k→∞
Since #1 and #2 can be taken arbitrarily small, it follows that
J * ≤ T J *.
Hence J * = T J * .
Assume that J # ∈ E(X) satisfies J # ≥ J¯ and J # ≥ T J # . Let {#k } be
any sequence with #k > 0 for all k, and consider a policy π = {µ0 , µ1 , . . .} ∈
Π such that
Tµk J # ≤ T J # + #k e, k = 0, 1, . . . .
We have from part (c) of Assumption I
J * = inf lim Tµ0 · · · Tµk J¯
π∈Π k→∞
The following counterexamples show that parts (b) and (c) of As-
sumption I are essential for the preceding proposition to hold.
Sec. 4.3 Infinite Horizon Problems 147
Let
¯
(Tµ0 · · · TµN −1 J)(0) = 0, (Tµ0 · · · TµN −1 J¯)(1) = µ0 (1).
Thus
Let
¯
(TµN J)(0) = 0, ¯
(TµN J)(1) = N,
Tµ0 · · · TµN −1 J¯ ≥ JN
*,
*,
so by taking the limit of both sides as N → ∞, we obtain Jπ ≥ limN →∞ JN
* * *
and by taking infimum over π, J ≥ limN →∞ JN . Thus J = limN →∞ JN *.
Q.E.D.
J * = T J *.
where the last inequality follows from the fact T k J¯ ↓ J * (cf. Prop. 4.3.5).
Taking the infimum of both sides over π ∈ Π, we obtain J * ≥ T J * .
To prove the reverse inequality, we select any µ ∈ M, and we apply
Tµ to both sides of the equation J * = limN →∞ T N J¯ (cf. Prop. 4.3.5). By
Sec. 4.3 Infinite Horizon Problems 149
≥ lim T N J #
N →∞
≥ J #,
Proof: Consider# the$ variation of our problem where the control constraint
set is Uµ (x) = µ(x) rather than U (x) for all x ∈ X. Application of Prop.
4.3.6 yields the result. Q.E.D.
An examination of the proof of Prop. 4.3.6 shows that the only point
where we need part (b) of Assumption D was in establishing the relations
! "
* = T
lim T JN *
lim JN
N →∞ N →∞
150 Noncontractive Models Chap. 4
the result of Prop. 4.3.6 follows. In this manner we obtain the following
proposition.
Then
J * = T J *.
Furthermore, if J ! ∈ E(X) is such that J ! ≤ J¯ and J ! ≤ T J ! , then
J ! ≤ J *.
Proof: A nearly verbatim repetition of Prop. 4.2.4 shows that under our
* = T N J¯ for all N . We will show that
assumptions we have JN
! "
* ) ≤ H x, u, lim J *
lim H(x, u, JN
N →∞ N ,
N →∞
∀ x ∈ X, u ∈ U (x).
! "
H(x̃, ũ, Jk* ) − " ≤ H x̃, ũ, Jk* − ("/α) e ≤ H x̃, ũ, lim JN
*
# $
,
N →∞
Tµ J * = T J * .
T µ Jµ = T J µ .
Thus under D, a stationary optimal policy attains the infimum in the fixed
point Eq. (4.15) for all x. However, there may exist nonoptimal stationary
policies also attaining the infimum for all x, as shown by Fig. 4.3.2, and by
case (a) of the three-node shortest path example of Section 3.1.2. Moreover,
it is possible that this infimum is attained but no optimal policy exists, as
shown by Fig. 4.3.2 (see also Exercise 4.7).
Proposition 4.3.9 shows that under Assumption I, there exists a sta-
tionary optimal policy if and only if the infimum in the optimality equation
is attained for every x ∈ X. When the infimum is not attained for some x ∈
X, this optimality equation can still be used to yield an !-optimal policy,
which can be taken to be stationary whenever the scalar α in Assumption
I(c) is strictly less than 1. This is shown in the following proposition.
Tµ∗ J * ≤ T J * + !k e, ∀ k = 0, 1, . . . ,
k
then
J * ≤ Jπ∗ ≤ J * + ! e.
Tµ∗ J * ≤ T J * + !(1 − α) e,
then
J * ≤ Jµ∗ ≤ J * + ! e.
Tµ∗ J * ≤ J * + !k e,
k
Applying Tµ∗ throughout and repeating the process, we obtain for every
k−2
k = 1, 2, . . .,
! k #
"
Tµ∗0 · · · Tµ∗ J * ≤ J * + αi "i e, k = 1, 2, . . . .
k
i=0
Since J¯ ≤ J * , it follows that
! k
#
Tµ∗0 · · · Tµ∗ J¯ ≤ J * +
"
αi "i e, k = 1, 2, . . . .
k
i=0
By taking the limit as k → ∞, we obtain Jπ∗ ≤ J * + " e.
(b) This part is proved by taking "k = "(1 − α) and µ∗k = µ∗ for all k in
the preceding argument. Q.E.D.
lim T k J0 = J * .
k→∞
are compact for every x ∈ X, λ ∈ %, and for all k greater than some
integer k. Assume that J0 ∈ E(X) is such that J¯ ≤ J0 ≤ J * . Then
lim T k J0 = J * .
k→∞
Proof: Similar to the proof of Prop. 4.3.13, it will suffice to show that
T k J¯ → J * . Since J¯ ≤ J * , we have T k J¯ ≤ T k J * = J * , so that
J¯ ≤ T J¯ ≤ · · · ≤ T k J¯ ≤ · · · ≤ J * .
Sec. 4.3 Infinite Horizon Problems 155
J ∞ ≤ T J∞ .
Clearly, by Eq. (4.17), we must have J∞ (x̃) < ∞. For every k, consider
the set
" # %
Uk x̃, J∞ (x̃) = u ∈ U (x̃) $ H(x̃, uk , T k J¯) ≤ J∞ (x̃) ,
! $
for all x̃ ∈ X with J * (x̃) < ∞. For x̃ ∈ X such that J * (x̃) = ∞, every
u ∈ U (x̃) attains the preceding minimum. Hence by Prop. 4.3.9 an optimal
stationary policy exists. Q.E.D.
and µ∗ (x̃) is a limit point of {µk (x̃)}, for every x̃ ∈ X, then the stationary
policy µ∗ is optimal. Furthermore, {µk (x̃)} has at least one limit point
for every x̃ ∈ X for which J * (x̃) < ∞. Thus the VI algorithm under the
assumption of Prop. 4.3.14 yields in the limit not only the optimal cost
function J * but also an optimal stationary policy.
On the other hand, under Assumption I but in the absence of the
compactness condition (4.16), T k J¯ need not converge to J * . What is hap-
pening here is that while the mappings Tµ are continuous from below as
required by Assumption I(b), T may not be, and a phenomenon like the
one illustrated in the left-hand side of Fig. 4.3.1 may occur, whereby
! "
lim T k J¯ ≤ T lim T k J¯ ,
k→∞ k→∞
with strict inequality for some x ∈ X. This can happen even in simple
deterministic optimal control problems, as shown by the following example.
Let
X = [0, ∞), U (x) = (0, ∞), ¯
J(x) = 0, ∀ x ∈ X,
and
# $
H(x, u, J) = min 1, x + J(2x + u) , ∀ x ∈ X, u ∈ U (x).
Then it can be verified that for all x ∈ X and policies µ, we have Jµ (x) = 1,
as well as J ∗ (x) = 1, while it can be seen by induction that starting with J¯,
the VI algorithm yields
¯
(T k J)(x) = min 1, (1 + 2k−1 )x ,
# $
∀ x ∈ X, k = 1, 2 . . . .
Sec. 4.3 Infinite Horizon Problems 157
¯
Thus we have 0 = limk→∞ (T k J)(0) != J * (0) = 1.
S(k) = {J | T k J¯ ≤ J ≤ J * }, k = 0, 1, . . . .
Proof: We have
where the first, second, and third inequalities hold because the assumption
J0 ≥ T J0 = Tµ0 J0 implies that Tµ!0 J0 ≥ Tµ!+1
0 J0 for all ! ≥ 0. Continuing
similarly we obtain
Jk ≥ T Jk ≥ Jk+1 , ∀ k ≥ 0. (4.20)
† As with all PI algorithms in this book, we assume that the policy im-
provement operation is well-defined, in the sense that there exists µk such that
Tµk Jk = T Jk for all k.
Sec. 4.3 Infinite Horizon Problems 159
where the last equality follows from the fact T J * = J * (cf. Prop. 4.3.6),
thus completing the induction. Thus, by combining the preceding two
relations, we have
Jk ≥ T Jk ≥ Jk+1 ≥ J * , ∀ k ≥ 0. (4.22)
T k J0 ≥ Jk ≥ J * , ∀ k ≥ 0. (4.23)
T k+1 J0 ≥ T Jk ≥ Jk+1 ≥ J * ,
(λ)
where for any policy µ and scalar λ ∈ (0, 1), Tµ is the mapping defined
by
∞
(λ)
!
Tµ J = (1 − λ) λ" Tµ"+1 J.
"=0
160 Noncontractive Models Chap. 4
Here we assume that Tµ maps R(X) to R(X), and that for all µ ∈ M
and J ∈ R(X), the limit of the series above is well-defined as a function in
R(X).
We also assume a linearity property for Tµ , whereby we have
(λ) (λ)
Tµ (Tµ J ) = Tµ (Tµ J ), ∀ µ ∈ M, J ∈ R(X). (4.25)
(λ) (λ)
J0 ≥ T J0 = Tµ0 J0 ≥ Tµ0 J0 = J1 ≥ Tµ0 J1 ≥ T J1 = Tµ1 J1 ≥ Tµ1 J0 = J2 ,
where for the third inequality, we use the relation J0 ≥ Tµ0 J0 , the definition
of J1 , and the assumption (4.25). Continuing in the same manner,
Jk ≥ T Jk ≥ Jk+1 , ∀ k ≥ 0.
[cf. the induction step of Eq. (4.21)]. By combining the preceding two
relations, we obtain Eq. (4.22), and the proof is completed by using the
argument following that equation. Q.E.D.
Sec. 4.3 Infinite Horizon Problems 161
The λ-PI algorithm has a useful property, which involves the mapping
Wk : R(X) !→ R(X) given by
Consider the SSP problem of Example 1.2.6 with states 1, . . . , n, plus the
termination state 0. For all u ∈ U (x), the state following x is y with prob-
ability pxy (u) and the expected cost incurred is nonpositive. This problem
arises when we wish to maximize nonnegative rewards up to termination. It
includes the classical search problem of Section 3.2.2, where the aim, roughly
speaking, is to move through the state space looking for states with favor-
able termination rewards. We have noted earlier the difficulties regarding PI
for this type of problem. In particular, case (a) of the three-node shortest
path problem of Section 3.1.2 provided an example where nonoptimistic PI
oscillates between an optimal and a nonoptimal policy (see also Exercise 4.8).
The PI method with perturbations of Section 3.3.3 applies only under the
conditions of that section. Here we will consider instead the λ-PI method,
which applies under the different conditions of Assumption D.
We view the problem within our abstract framework with J¯(x) ≡ 0 and
Tµ J = gµ + Pµ J, (4.27)
Consider the λ-PI method (4.24), with Jk+1 computed by solving the
fixed point equation J = Wk J, cf. Eq. (4.26). This is a nonsingular n-
dimensional system of linear equations, and can be solved by matrix inversion,
just like in exact PI for discounted n-state MDP. In particular, using Eqs.
(4.26) and (4.27), we have
Jµ = Tµ Jµ ≥ T Jµ = Tµ̄ Jµ ,
Jµ ≥ T Jµ ≥ Jµ̄ . (4.29)
S ⊂ J ∈ E(X) | J ≥ J¯ , J¯ ∈ S.
! "
(4.30)
The further specification of S will be left open for the moment, but we will
see that its choice is important for the application of the following analysis
in specific contexts. We will use the notion of S-regularity as defined in
Section 3.1.1 (cf. Def. 3.1.1), i.e., µ is called S-regular if Jµ is the unique
fixed point of Tµ within S, and
Tµk J → Jµ , ∀ J ∈ S. (4.31)
Proof: (a), (b) In view of the existence of an optimal S-regular policy and
the fixed point property J * = T J * (cf. Prop. 4.3.3), Prop. 3.1.2 applies,
and part (b) follows. Also by Prop. 3.1.2, J * is the unique fixed point of
T in the set {J | J ≥ J * }, while by Prop. 4.3.3, there are no fixed points
of T that belong to S and are outside this set. This proves part (a).
164 Noncontractive Models Chap. 4
Tµk∗ J ≥ T k J ≥ T k J¯,
where the second inequality holds since J ≥ J¯ for all J ∈ S. Taking the
limit in the preceding relation, and using the fact limk→∞ Tµk∗ J = Jµ∗ = J *
(which holds by the regularity and optimality of µ∗ ), and limk→∞ T k J¯ = J *
(which holds by assumption), we obtain limk→∞ T k J = J * . Q.E.D.
Policy Iteration
Thanks to the cost improvement property [cf. Eq. (4.29)], the sequence
{µk } (regardless of whether it consists of S-regular policies) satisfies
Proof: From Eq. (4.33), we have Jµk ≥ T Jµk for all k, and by taking the
limit as k → ∞,
J∞ ≥ lim T Jµk ≥ T J∞ , (4.35)
k→∞
Sec. 4.4 Semicontractive-Monotone Increasing Models 165
where the second inequality follows from the fact Jµk ≥ J∞ . We also have
for all x ∈ X and u ∈ U (x),
where the first equality follows from Eq. (4.34) and the second equality
follows from Eq. (4.33). By taking the infimum of the left-hand side over
u ∈ U (x), we obtain T J∞ ≥ J∞ , which combined with Eq. (4.35), yields
J∞ = T J∞ . Since there exists an optimal S-regular policy and J∞ ∈ S, we
have J∞ = J * in view of the uniqueness property shown in Prop. 4.4.1(a).
Q.E.D.
S = J ∈ R(X) | J ≥ J¯ S = J ∈ B(X) | J ≥ J¯
! " ! "
or
without affecting other assumptions, where R(X) [or B(X)] are the sets of
functions on X that are real-valued (or bounded with respect to a weighted
sup-norm, respectively). Note that this condition requires that the PI algo-
rithm works as described for some initial policy µ0 (not all initial policies).
This provides flexibility in selecting a suitable µ0 .
m
Tµk Jk = T Jk , Jk+1 = Tµkk Jk , k = 0, 1, . . . , (4.36)
Jk ≥ T Jk ≥ Jk+1 , k = 0, 1, . . . . (4.37)
Proof: We have shown that Jk ≥ T Jk for all k ≥ 0 [cf. Eq. (4.37)]. The
result follows by taking the limit in this equation to obtain
J∞ ≥ lim T Jk ≥ T J∞ ,
k→∞
[cf. Eq. (4.35)], and then by repeating the corresponding part of the proof
of Prop. 4.4.2. Q.E.D.
Note that a similar result may be shown when the λ-PI method is
used in place of the optimistic PI method (4.36); cf. Prop. 4.3.16.
The key to applying Props. 4.4.1 and 4.4.2 is to choose the set S in a
way that there exists an optimal S-regular policy, and the other required
assumptions are satisfied. We describe a few applications.
A context where the preceding analysis applies is SSP problems with count-
able state space, and nonnegative cost per stage, where with the choice
¯
J(x) ≡ 0, the monotone increase condition
¯
J(x) ¯
≤ H(x, u, J), ∀ x ∈ X, u ∈ U (x),
S = J ∈ R(X) | J ≥ J¯ S = J ∈ B(X) | J ≥ J¯ .
! " ! "
or
Sec. 4.4 Semicontractive-Monotone Increasing Models 167
¯
α ∈ (0, 1), and J(x) ≡ 0. Let
S = J ∈ B(X) | J ≥ J¯ .
! $
Note that X need not be a finite set, and that the functions in S need not have
finite unweighted sup-norm. Assume that there is a set of policies M ⊂ M
such that:
! $
(a) The function E g(x, µ(x), w) belongs to S for all µ ∈ M.
! " #$
(b) The function E J f (x, µ(x), w) belongs to S for all µ ∈ M and
J ∈ S.
(c) For each J ∈ S there exists µ ∈ M such that Tµ J = T J.
(d) There exists an optimal policy within M.
It can be seen that S is a closed subset of B(X), and that Tµ maps S
into S and is a contraction with modulus α within S for all µ ∈ M. † This
† To see this note that by Eq. (4.38), we have for all J, J " ∈ B(X) and x ∈ X,
% %
(Tµ J)(x) − (Tµ J " )(x) %J(y) − J " (y)%
≤ α sup = α(J − J " (,
v(x) y∈X v(y)
168 Noncontractive Models Chap. 4
implies that all policies in M are S-regular, since Jµ is the unique fixed point
of Tµ within S, and Tµk J converges to the fixed point Jµ for all J ∈ S.
In view of property (c) above, there exists µ ∈ M such that Tµ J ∗ =
∗
T J , which by Prop. 4.3.9, is optimal. Thus there exists an optimal S-regular
policy, and it may be verified that Prop. 4.4.1 applies. It follows from Prop.
4.4.1(a) that J ∗ is the unique fixed point of T within S. By Prop. 4.4.1(b),
J ∗ can be obtained in the limit by the VI algorithm starting with J ∈ S such
that J ≥ J ∗ [also starting with J ∈ S such that J ≥ J̄, if we can show that
T k J¯ → J ∗ (e.g., if the control space is finite); cf. Prop. 4.4.1(c)]. By Prop.
4.4.2, J ∗ can be obtained in the limit by the PI algorithm, which starts from
a policy µ0 ∈ M and generates exclusively policies in M [this is possible in
view of property (c) above], provided condition (2) of the proposition can be
verified.
where xk ∈ "n , uk ∈ "m for all k, and A and B are given matrices.
We assume that the random disturbances wk are independent identically
distributed with zero mean and finite second moment. The cost function
of a policy π = {µ0 , µ1 , . . .} has the form
!N −1 %
" # $
Jπ (x0 ) = lim E αk x$k Qxk + µk (xk )$ Rµk (xk ) ,
N →∞
k=0
The results just stated require that the pairs (A, B) and (A, C), where
Q = C ! C, are controllable and observable, respectively (see [Ber12a] for a
definition of these terms and corresponding proofs).
We now discuss how the problem can be solved by PI based on the the-
ory of this section. We define S to be the set of positive definite quadratic
functions plus a nonnegative constant,
! "
S = J (x) = x! W x + γ | W : positive definite symmetric, γ ≥ 0
It is easily seen that the continuity condition (4.34) of Prop. 4.4.2 is satis-
fied.
Consider the set M ⊂ M of all linear controllers µ that stabilize the
system, i.e,
µ(x) = Lx, ∀ x ∈ "n ,
where L is an m × n matrix such that the matrix
A + BL
170 Noncontractive Models Chap. 4
has eigenvalues strictly within the unit circle. We call such a controller
linear-stable. It is well-known that under the controllability and observ-
ability assumption noted earlier, there exists a linear-stable controller that
is optimal (see e.g., [Ber12a]). Moreover, from the definition of regularity
(4.31), it can be seen that a linear-stable controller µ is S-regular.
Consider now the PI algorithm of this section, starting with a linear-
stable controller µ0 (x) = L0 x. Then the cost corresponding to µ0 has the
form
Jµ0 (x) = x! K0 x + nonnegative constant,
! "
min u! Ru + α(Ax + Bu)! K0 (Ax + Bu) .
u
as can be verified from Eq. (4.39). This equation can be solved with efficient
specialized methods, for which we refer to the literature.
Sec. 4.5 Affine Monotonic Models 171
Tµ J = Aµ J + bµ , (4.40)
where for each µ, bµ is a given function in R+ (X), the set of all nonnegative
real-valued functions on X, and Aµ : E + (X) !→ E + (X) is a given mapping,
where E + (X) is the set of all nonnegative extended real-valued functions
on X. We assume that Aµ has the “linearity” property
J, J ! ∈ E + (X), J ≤ J ! ⇒ Aµ J ≤ Aµ J ! , Tµ J ≤ Tµ J ! .
We now consider the affine monotonic model without assuming the mono-
tone increase condition T J¯ ≥ J¯. We will use the approach of Section 3.2.1,
Sec. 4.5 Affine Monotonic Models 173
Letting J = 2J¯ and using the fact Akµ (2J¯) = 2Akµ J¯ [cf. Eq. (4.41)], we see
that Akµ J¯ → 0. It follows that µ is S-regular if and only if
∞
&
lim Akµ J = 0, ∀ J ∈ S, and Am
µ bµ ∈ S. (4.45)
k→∞
m=0
and part (f) will be assumed in the proposition that follows. The com-
pactness condition of Assumption 3.2.1(d) and the technical condition of
Assumption 3.2.1(e) are needed, and they will be assumed.
The critical part of Assumption 3.2.1 is (c), which requires that for
each S-irregular policy µ and each J ∈ S, there is at least one state x ∈ X
such that
∞
&
lim sup (Tµk J )(x) = lim sup (Akµ J )(x) + (Am
µ bµ )(x) = ∞.
k→∞ k→∞ m=0
174 Noncontractive Models Chap. 4
This part is satisfied if and only if for each S-irregular µ and J ∈ S, there
is at least one x ∈ X such that
!∞
lim sup (Akµ J )(x) = ∞ or (Amµ bµ )(x) = ∞. (4.46)
k→∞ m=0
belongs to S.
(4) The control set U is a metric space, and the set
" #
u ∈ U (x) | H(x, u, J ) ≤ λ
(6) In the case where S = Rp+ (X), for each function J ∈ S, there
exists a function J # ∈ S such that J # ≤ J and J # ≤ T J # .
Then:
(a) The optimal cost function J * is the unique fixed point of T within
S.
Sec. 4.5 Affine Monotonic Models 175
Note the difference between Props. 4.5.1 and 4.5.2: in the former,
the uniqueness of fixed point of T is guaranteed within a smaller set of
functions when J¯ ∈ Rp+ (X). Similarly, the convergence of VI is guaranteed
from within a smaller range of starting functions when J¯ ∈ Rp+ (X).
We will now apply the analysis of the affine monotonic model to SSP prob-
lems with an exponential cost function, which is introduced to incorporate
risk sensitivity in the control selection process.
Consider an SSP problem with finite state and control spaces, transi-
tion probabilities pxy (u), and real-valued transition costs h(x, u, y). State 0
is a termination state, which is cost-free and absorbing. Instead of the stan-
dard additive cost function (cf. Example 1.2.6), we consider an exponential
cost function of the form
Tµ given by
! " # " #
(Tµ J )(x) = pxy µ(x) exp h(x, µ(x), y) J (y)
y∈X
" # " #
+ px0 µ(x) exp h(x, µ(x), 0) , x ∈ X,
(4.47)
[cf. Eq. (4.43)]. Here Aµ and bµ have components
" # " #
Aµ (x, y) = pxy µ(x) exp h(x, µ(x), y) , (4.48)
" # " #
bµ (x) = px0 µ(x) exp h(x, µ(x), 0) . (4.49)
Note that there is a distinction between S-irregular policies and im-
proper policies (the ones that never terminate). In particular, there may
exist improper policies, which are S-regular because they can generate
some negative transition costs h(x, u, y), which make Aµ contractive [cf.
Eq. (4.47)]. Similarly, there may exist proper policies (i.e., terminate with
probability one),$which are S-irregular because for the corresponding Aµ
∞
and bµ we have m=0 (Am µ bµ )(x) → ∞ for some x.
We may consider the two cases where the condition T J¯ ≥ J¯ holds (cf.
Section 4.5.1) and where it does not (cf. Section 4.5.2), as well as a third
case where none of these conditions applies, but the perturbation-based
theory of Section 3.2.2 or the contractive theory of Chapter 2 can be used.
Consider first the case where T J¯ ≥ J¯. An example is when
h(x, u, y) ≥ 0, ∀ x, y ∈ X, u ∈ U (x),
so that from Eq. (4.47), we have exp h(x, u, y) ≥ 1, and since J¯ = e, it
" #
Consider an SSP problem where there are two controls at each x: stop, in
which case we move to the termination state 0 with a cost s(x), and continue,
in which case we move to a state y, with given transition probabilities pxy [at
no cost if y != 0 and a cost s̄(x) if y = 0]. The mapping H has the form
' " #
exp s(x) if u = stop,
H(x, u, J) = $ " #
y∈X
pxy J(y) + px0 exp s̄(x) if u = continue,
Sec. 4.5 Affine Monotonic Models 177
and J̄ is the unit function e. Here the stopping cost s(x) is often naturally
negative for some x (this is true for example in search problems of the type
discussed in Example 3.2.1), so the condition T J̄ ≥ J̄ can be written as
! %
" # $ " #
min exp s(x) , pxy + px0 exp s̄(x) ≥ 1, ∀ x ∈ X,
y∈X
and is violated.
When the condition T J¯ ≥ J¯ does not hold, we may use the analysis
of Section 4.5.2, under the conditions of Prop. 4.5.2, chief among which
is that an S-regular policy exists, and for every S-irregular policy µ and
J ∈ S, there exists x ∈ X such that
∞
$
lim sup (Akµ J )(x) = ∞ or (Am
µ bµ )(x) = ∞,
k→∞ m=0
where Aµ and bµ are given by Eqs. (4.48), (4.49) [cf. Eq. (4.46)], and
S = R+ (X) or S = Rp+ (X) or S = Rb+ (X).
If these conditions do not hold, we may also use the approach of
Section 3.2.2, which is based on adding a perturbation δ to bµ . We assume
that the optimal cost function Jδ* of the δ-perturbed problem is a fixed
point of the mapping Tδ given by
!
$ " #
(Tδ J )(x) = min pxy (u)exp h(x, u, y) J (y)
u∈U (x)
y∈X
%
" #
+ px0 (u)exp h(x, u, 0) + δ, x ∈ X,
a
. Undert these
b Destination
conditions, the Bellman equation
! "
tb J (1) = min exp(b), exp(a)J (2)
12 a12 a012
J (2) = exp(a)J (1)
a
Figure 4.5.1. Shortest path problem with exponential cost function. The
cost that is exponentiated is shown next to each arc.
Consider the context of the three-node shortest path problem of Section 3.1.2,
but with the exponential cost function of the present subsection (see Fig.
4.5.1). Here the DP model has two states: x = 1, 2. There are two policies
denoted µ and µ: the 1st policy is 2 → 1 → 0, while the 2nd policy is
2 → 1 → 2. The corresponding mappings Tµ and Tµ are given by
and !" #k
exp(a) J(1) if k is even,
(Tµk J)(1) = " #k
exp(a) J(2) if k is odd,
!" #k
exp(a) J(1) if k is odd,
(Tµk J)(2) = " #k
exp(a) J(2) if k is even.
and contains vectors J from the range J > J ∗ as well as from the range
J < J ∗ (however, J ∗ = e is the “smallest” fixed point with the property
J ≥ J¯ = e, consistently with Prop. 4.3.3).
(c) Case a = 0 and b = 0: Here µ and! µ are both optimal," and the results
of Prop. 4.5.1 apply with S = J | J ≥ J ∗ = J¯ = e . However, the
assumptions of Prop. 4.5.2 are violated, and indeed T has multiple fixed
points within both Rp+ (X) and (a fortiori) R+ (X); the set of its fixed
points is ! "
J | J ≤ e, J(1) = J(2) .
(d) Case a = 0 and b < 0: Here the regular policy µ is optimal. However,
the assumptions of Props. 4.5.1 and!4.5.2 are violated. On the other
hand, Prop. 3.2.3 applies with S = J | J ≥ J ∗ }, so T has a unique
fixed point within S, while value iteration converges to J ∗ starting from
within S. Here again T has multiple fixed points within Rp+ (X) and (a
fortiori) R+ (X); the set of its fixed points is
! "
J | J ≤ exp(b)e, J(1) = J(2) .
(e) Case a < 0: Here µ is optimal and also R+ (X)-regular [but not Rp+ (X)-
regular, since Jµ = 0 ∈/ Rp+ (X)]. However, the assumptions of Prop.
180 Noncontractive Models Chap. 4
4.5.1, and Prop. 4.5.2 with both S = R+ (X) and S = Rp+ (X) = Rb+ (X)
are violated. Still, however, our analysis applies and in a stronger form,
because both Tµ and Tµ are contractions. Thus we are dealing with a
contractive model for which the results of Chapter 2 apply (J ∗ = 0 is
the unique fixed point of T over the entire space !2 , and value iteration
converges to J ∗ starting from any J ∈ !2 ).
variety of special cases that we have analyzed. Thus the analysis of the
special structure of the given model becomes important. This analysis may
be guided by the general ideas and results of the semicontractive theory
that we have developed.
implies Eq. (4.50) (see [Ber77], Prop. 12, or [BeS78], Prop. 5.10).
Optimistic PI and λ-PI have not been considered earlier under As-
sumption D, and the corresponding analysis of Section 4.3.3 is new. See
[BeI96], [ThS10a], [ThS10b], [Ber11b], [Sch11] for analyses of λ-PI for dis-
counted and SSP problems. A related extension, called Λ-PI, is discussed
in Yu and Bertsekas [YuB12], Section 5.
The connection of the monotone increasing model with the semicon-
tractive model, discussed in Section 4.4, and the corresponding PI algo-
rithms are also new. Some finite-state SSP problems with nonpositive
costs may be related to average cost MDP, with possibly multichain poli-
cies, for which there are special PI methods (see [Put94], Section 10.4).
However, these PI methods are more complicated than the ones discussed
here. Linear-quadratic problems have been exhaustively analyzed in the
literature, and the convergence of PI for these problems was first shown by
Kleinman [Kle68].
The monotonic affine models of Sections 4.5.1 and 4.5.2 have not
been considered earlier, to the author’s knowledge. The analysis of the
SSP model with exponential cost of Section 4.5.3 is based on unpublished
collaboration with H. Yu. Two prominent papers for the latter model are
by Denardo and Rothblum [DeR79], and by Patek [Pat01]. The paper
[DeR79] assumes that the state and control spaces are finite, that there
exists at least one S-regular policy (a transient policy in the terminology
of [DeR79]), and that every improper policy is S-irregular. The paper
[Pat01] assumes that the state space is finite, that the control constraint
set is compact, and that the expected one-stage cost is strictly positive for
all state-control pairs.
E X ER CI S E S
available at each state (stop and continue). The state space is X = {1, 2, . . .}.
Continuation from state x leads to state x + 1 with certainty and no cost, while
the stopping cost is −1 + (1/x), so that there is an incentive to delay stopping
¯
at every state. Here for all x, J(x) = 0, and
!
J(x + 1) if u = continue,
H(x, u, J) =
−1 + (1/x) if u = stop.
Show that J ∗ (x) = −1 for all x, but there is no policy (stationary or not) that
attains the optimal cost starting from x.
For the problem of Exercise 4.1, show that the policy µ that never stops is
nonoptimal and satisfies Tµ J ∗ = T J ∗ .
Let
X = ", U (x) ≡ (0, 1], ¯
J(x) ≡ 0,
H(x, u, J) = |x| + J(ux), ∀ x ∈ X, u ∈ U (x).
Let µ(x) = 1 for all x ∈ X. Then Jµ (x) = ∞ if x '= 0 and Jµ (0) = 0. Verify that
Tµ Jµ = T Jµ . Verify also that J ∗ (x) = |x|, and hence µ is not optimal.
(b) Under Assumption D, show that J ∗ is the unique solution of the following
optimization problem in z = (z1 , . . . , zn ):
n
"
maximize zi
i=1
(a) Assume that there exists a subset M of policies such that for all µ ∈ M:
! "
(1) The sequence G = G1 , G2 , . . . , where
# $
Gx = g x, µ(x) , x ∈ X,
belongs to S.
! "
(2) The sequence V = V1 , V2 , . . . , where
% # $
Vx = pxy µ(x) v(y), x ∈ X,
y∈X
belongs to S.
(3) We have ( # $
y∈X
pxy µ(x) v(y)
≤ 1, x ∈ X.
v(x)
and
H(1, u, J) = −u + (1 − u2 )J(1).
From Exercise 3.1 we have J ∗ (1) = −∞ and that there is no optimal stationary
policy. Show that:
(a) The PI algorithm generates a sequence {µk } of policies according to
1 1
Jµk (1) = − , µk+1 (1) = − ,
µk (1) 2Jµk (1)
Consider the deterministic stopping problem of Exercise 4.1. Let µ be the policy
that stops at every state.
(a) Show that the next policy µ generated by nonoptimistic PI is to continue
at every state, and that we have
1
Jµ (x) = 0 > −1 + = Jµ (x), x ∈ X.
x
Moreover, the method oscillates between the policies µ and µ, none of which
is optimal.
(b) The optimistic PI and λ-PI algorithms starting from J0 = J¯ = 0 generate
sequences {Jk } with Jk ↓ J ∗ .
Consider a finite-state version of the problem of Exercises 4.1 and 4.7, where
the state space is X = {1, . . . , n} and there are two controls: stop and continue.
Continuation from states x = 1, . . . , n − 1 leads to state x + 1 at no cost, while
continuation at state x = n keeps the state at n with no cost. The stopping cost
is −1 + (1/x) for all x > 0. Here J¯(x) ≡ 0, and
J(x + 1) if x = 1, . . . , n − 1, and u = continue,
H(x, u, J) = J(x) if x = n, and u = continue,
−1 + (1/x) if x = 1, . . . , n, and u = stop.
186 Noncontractive Models Chap. 4
(a) Show that the pathology of Exercise 4.7(a) still can occur: nonoptimistic
PI can oscillate between the policy µ that stops at every state and the
policy µ that continues at every state, and that none of these two policies
is optimal (at any state except x = n). Note: This can be viewed as an
example of a finite-state SSP problem where there exists an optimal proper
policy, and all improper policies are not optimal (but have finite cost for
all initial states). The latter fact invalidates Assumption 3.2.1 of Chapter
3, and as a result the proof of Prop. 3.2.1 breaks down.
(b) Show that the pathology of part (a) still occurs if µ is replaced by the
optimal policy, which is to stop at x = n and to continue at x != n. Note:
Here we have Tµ J ∗ = T J ∗ even though µ is not optimal.
Show that Xk+1 ⊂ Xk for all k, and that X ∗ ⊂ ∩∞ k=0 Xk . Show also that
convergence of VI (i.e., T k J¯ → J ∗ ) is equivalent to X ∗ = ∩∞k=0 Xk .
Contents
187
188 Models with Restricted Policies Chap. 5
or equivalently
! " #$
(Tµ J )(x) = E g(x, µ(x), w) + αJ f (x, µ(x), w) ,
Tµ J ∈ E(X), ∀ J ∈ E(X), µ ∈ M,
where Tµ is given by
" #
(Tµ J )(x) = H x, µ(x), J , ∀ x ∈ X.
or as
(T J )(x) = inf H(x, u, J ), ∀ J ∈ E(X), x ∈ X. (5.2)
u∈U (x)
Let us assume that E(X) and M have been defined as described above.
Consider the case where there exists µ ∈ M such that the infimum in Eq.
(5.2) is attained at µ(x) for all x ∈ X, i.e., for all J ∈ E(X) there exists a
µ ∈ M such that
Tµ J = T J. (5.3)
Otherwise stated, for all J ∈ E(X), the minimization in Eq. (5.2) admits
an exact selector µ from within M. Then assuming also that T and Tµ ,
µ ∈ M, preserve membership in E(X), i.e.,
Tµ J ∈ E(X), T J ∈ E(X), ∀ J ∈ E(X), µ ∈ M, (5.4)
a large portion of the analysis of the preceding chapters carries through
verbatim, and much of the remainder can be extended with minimal mod-
ifications.
In particular, in the finite horizon problems of Chapter 4, under this
condition, the condition J¯ ∈ E(X), and the monotonicity assumption
H(x, u, J ) ≤ H(x, u, J ! ), ∀ J, J ! ∈ E(X), x ∈ X, u ∈ U (x), (5.5)
we have JN* = T N J¯ and that there exists an N -stage optimal policy. Such
a policy can be obtained via the DP algorithm that starts with the ter-
minal cost function J¯, and sequentially computes T J, ¯ T 2 J¯, . . . , T N J¯, and
∗ ∗ ∗
corresponding µN −1 , µN −2 , . . . , µ0 ∈ M such that
Tµ∗ T N −k−1 J¯ = T N −k J¯, k = 0, . . . , N − 1, (5.6)
k
When the exact selection property (5.3) may not hold, to conduct any kind
of meaningful analysis, it is necessary to adopt a restriction framework for
policies and functions, which guarantees that T J can be approximated
by Tµ J , with appropriate choice of µ. To this end, a seemingly natural
assumption would be that given J ∈ E(X) and ! > 0, there exists an
!-optimal selector , that is, a µ! ∈ M such that
!
(T J )(x) + ! if (T J )(x) > −∞,
(Tµ! J )(x) ≤ ∀ x ∈ X. (5.7)
−(1/!) if (T J )(x) = −∞,
However, in the Borel space model noted earlier and described in Ap-
pendix C, there is a serious difficulty: if E(X) and M are the spaces of
Borel measurable functions from X to &* , and from X to U , respectively,
there need not exist an !-optimal selector . For this reason, Borel measura-
bility of cost functions and policies is not the most appropriate probabilistic
framework for stochastic optimal control problems. † Instead, in the most
general framework for bypassing this difficulty, it is necessary to consider
a different kind of measurability, which is described in Appendix C. In this
framework:
(a) E(X) is taken to be the class of universally measurable functions from
X to &* .
(b) M is taken to be the class of universally measurable functions from
X to U .
(c) g, f , and the probability space of w must satisfy certain Borel mea-
surability conditions.
A key fact is that an !-selection property holds, whereby there exists a
µ! ∈ M such that
!
(T J )(x) + ! if (T J )(x) > −∞,
(Tµ! J )(x) ≤ ∀ x ∈ X, (5.8)
−(1/!) if (T J )(x) = −∞,
† There have been efforts to address the lack of an !-optimal selector within
the Borel measurability framework using the concept of a “p-!-optimal selector,”
whereby the concept of !-optimal selection is modified to hold over a set for
states that has p-measure 1, with p being any chosen probability measure over X
(see [Str66], [Str75], [DyY79]). This leads to a theory based on p-!-optimal and
p-optimal policies, i.e., policies that depend on the choice of p and are optimal
only for states in a subset of X that has p-measure 1 (rather than over all states
as in our case). It seems difficult to extend the abstract framework of this book
based on this inherently probabilistic viewpoint. For a related discussion and a
comparison of the p-!-optimal approach with ours, we refer to [BeS78].
192 Models with Restricted Policies Chap. 5
! "
U (x) = µ(x) | µ ∈ M , ∀ x ∈ X, (5.9)
Ê(X) ⊂ E(X),
Assumption 5.1.1:
(a) For each sequence {Jm } ⊂ E(X) with Jm → J , we have J ∈
E(X), and for each sequence {Jm } ⊂ Ê(X) with Jm → J , we
have J ∈ Ê(X).
(b) For all r ∈ $, we have
J ∈ E(X) ⇒ J + r e ∈ E(X),
and
J ∈ Ê(X) ⇒ J + r e ∈ Ê(X),
where e is the unit function, e(x) ≡ 1.
or in shorthand
T J = inf Tµ J.
µ∈M
The sets M, E(X), and Ê(X), and the mappings Tµ and T are
assumed to satisfy the following.
194 Models with Restricted Policies Chap. 5
Assumption 5.1.3:
(a) For all µ ∈ M, we have
J ∈ E(X) ⇒ Tµ J ∈ E(X).
(b) We have
J ∈ Ê(X) ⇒ T J ∈ Ê(X).
Problem Formulation
minimize Jπ (x)
(5.13)
subject to π ∈ Π.
* (x) = inf J
JN J * (x) = inf Jπ (x),
N,π (x), ∀ x ∈ X.
π∈Π π∈Π
* (x),
JN,π∗ (x) = JN ∀ x ∈ X,
* (x) + "
JN * (x) > −∞,
if JN
Jπ! (x) ≤
−(1/") * (x) = −∞,
if JN
J * (x) + " if J * (x) > −∞,
Jπ! (x) ≤
−(1/") if J * (x) = −∞.
Note that since J¯ ∈ Ê(X), the function T k J¯ belongs to Ê(X) for all k
[cf. Assumption 5.1.3(b)]. Similar to Chapter 4, we will aim to show under
various assumptions that JN * = T N J¯, and that J * ∈ Ê(X) and J * = T J * .
196 Models with Restricted Policies Chap. 5
We have the following proposition, which extends Prop. 4.2.3 to the re-
stricted policies framework.
* = T N J¯.
JN
= ···
= T N J¯,
where the last equality is obtained by repeating the process used to obtain
the previous equalities. On the other hand, it is clear from the definitions
that T N J¯ ≤ JN,π for all N and π ∈ Π, so that T N J¯ ≤ JN * . Thus, J * =
N
N ¯
T J . Q.E.D.
Moreover, there exists a scalar α ∈ (0, ∞) such that for all scalars
r ∈ (0, ∞) and functions J ∈ E(X), we have
πe ∈ Π with Jk,π! ≤ Jk* + " e. Using Eq. (5.16), we have for all µ ∈ M,
*
Jk+1 ≤ Tµ Jk,π! ≤ Tµ Jk* + α" e.
Taking the infimum over µ and then the limit as " → 0, we obtain Jk+1 * ≤
T Jk . By using the induction hypothesis, it follows that Jk+1 ≤ T k+1 J¯. On
* *
the other hand, we have clearly T k+1 J¯ ≤ Jk+1* , and hence T k+1 J¯ = Jk+1
* .
*
Using the assumption Jk (x) > −∞ for all x ∈ X, for any " > 0, we
can choose π = {µ0 , µ1 , . . .} ∈ Π such that
"
Jk,π ≤ Jk* + e,
2α
"
Jk+1,π ! = Tµ Jk,π ≤ Tµ Jk* + e ≤ T Jk* + " e = Jk+1
* + " e,
2
where the first inequality is obtained by using Eq. (5.16). The induction is
complete. Q.E.D.
minimize Jπ (x)
(5.17)
subject to π ∈ Π,
v(x) > 0, ∀ x ∈ X,
lim Tµk J = Jµ .
k→∞
lim T k J = J * .
k→∞
T µ J * = T µ Jµ = Jµ = J * = T J * .
Part (f) follows similarly, using the proof of Prop. 2.1.2. Q.E.D.
We will now apply the preceding analysis to models where the set of policies
is restricted in order to address measurability issues. The Borel space
model is the most general such model, and we will focus on it. Appendix
C provides a motivation and an outline of the model for finite horizon
problems, including the associated mathematical definitions, some basic
results, and a two-stage example. In this section we will provide a brief
discussion of an infinite horizon contractive model.
We consider the mapping H defined by
!
H(x, u, J ) = g(x, u) + α J (y) p(dy | x, u). (5.19)
Here X and U are Borel spaces, α is a scalar in (0, 1], J is an extended real-
valued functions on X, and p(dy | x, u) is a transition probability measure
for each x and u ∈ U (x). To make mathematical sense of the expression
in the right-hand side of Eq. (5.19), J must satisfy certain measurability
restrictions, so we assume that g is Borel measurable and that p(dy | x, u)
is a Borel measurable stochastic kernel. We let
Then assuming that J ∈ Ê(X), that g(x, u) and p(dy | x, u) are Borel
measurable, and that the set
! "
(x, u) | u ∈ U (x)
The restricted model framework of this chapter was treated briefly in the
book [BeS78] (Chapter 6). This book focused far more extensively on the
classical type of stochastic optimal control problems (cf. Example 1.2.1),
rather than the more general abstract restricted model case.
The restricted policies framework may also be applied to the so-called
semicontinuous models similar to how it was applied to Borel models in Sec-
tion 5.4. The semicontinuous models provide more powerful results regard-
ing the character of the cost functions, but require additional assumptions,
which may be restrictive, namely that the cost function g and the stochastic
kernel p in Eq. (5.19) have certain upper or lower semicontinuity properties.
The relevant mathematical background is given in Section 7.5 of [BeS78],
202 Models with Restricted Policies Chap. 5
and the critical selection theorems (with Borel measurable selection) are
given in Props. 7.33 and 7.34 of that reference. Detailed related references
to the literature may also be found in [BeS78].
APPENDIX A:
Notation and Mathematical
Conventions
In this appendix we collect our notation, and some related mathematical
facts and conventions.
%* = % ∪ {∞, −∞}.
We write −∞ < x < ∞ for all real numbers x, and −∞ ≤ x ≤ ∞ for all
extended real numbers x. We denote by [a, b] the set of (possibly extended)
real numbers x satisfying a ≤ x ≤ b. A rounded, instead of square, bracket
denotes strict inequality in the definition. Thus (a, b], [a, b), and (a, b)
denote the set of all x satisfying a < x ≤ b, a ≤ x < b, and a < x < b,
respectively.
Generally, we adopt standard conventions regarding addition and
multiplication in %* , except that we take
∞ − ∞ = −∞ + ∞ = ∞,
203
204 Notation and Mathematical Conventions Appendix A
α + ∞ = ∞ + α = ∞, ∀ α ∈ %* ,
Under these rules, the following laws of arithmetic are still valid within %* :
α1 + α2 = α2 + α1 , (α1 + α2 ) + α3 = α1 + (α2 + α3 ),
We also have
α(α1 + α2 ) = αα1 + αα2
A.2 FUNCTIONS
Sequences of Functions
Expected Values
In this way, taking also into account the rule ∞ − ∞ = ∞, the expected
value E{w} is well-defined if Ω is finite or countably infinite. In more gen-
eral cases, E{w} is similarly defined by the appropriate form of integration,
as will be discussed in more detail at specific points as needed.
APPENDIX B:
Contraction Mappings
!F y − F z! ≤ ρ!y − z!, ∀ y, z ∈ Ȳ .
F y = b + Ay,
∀ y ∈ !n .
! "
%Ay%s ≤ σ(A) + " %y%s , (B.1)
207
208 Contraction Mappings Appendix B
Thus, if σ(A) < 1 we may select " > 0 such that ρ = σ(A) + " < 1, and obtain
the contraction relation
! !
!F y − F z!s = !A(y − z)!s ≤ ρ!y − z!s , ∀ y, z ∈ &n . (B.2)
† We may show Eq. (B.1) by using the Jordan canonical form of A, which is
denoted by J. In particular, if P is a nonsingular matrix such that P −1 AP = J
and D is the diagonal matrix with 1, δ, . . . , δ n−1 along the diagonal, where δ > 0,
it is straightforward to verify that D−1 P −1 AP D = J, ˆ where Jˆ is the matrix
that is identical to J except that each nonzero off-diagonal term is replaced by δ.
Defining Pˆ = P D, we have A = PˆJˆPˆ−1 . Now if ! · ! is the " standard# Euclidean
norm, we note that for some β > 0, we have !Jˆz! ≤ σ(A) + βδ !z! for all
z ∈ &n and δ ∈ (0, 1]. For a given δ ∈ (0, 1], consider the weighted Euclidean
norm ! · !s defined by !y!s = !Pˆ−1 y!. Then we have for all y ∈ &n ,
!Ay!s = !Pˆ−1 Ay! = !Pˆ−1 PˆJˆPˆ−1 y! = !JˆP̂ −1 y! ≤ σ(A) + βδ !P̂ −1 y!,
" #
" #
so that !Ay!s ≤ σ(A) + βδ !y!s , for all y ∈ &n . For a given " > 0, we choose
δ = "/β, so the preceding relation yields Eq. (B.1).
Sec. B.1 Contraction Mapping Fixed Point Theorems 209
y∗ = F y∗.
$yk+1 − yk $ ≤ ρk $y1 − y0 $, k = 1, 2, . . . .
To show the convergence rate bound of the last part, note that
! !
!F k y − y ∗ ! = !F k y − F y ∗ ! ≤ ρ!F k−1 y − y ∗ !.
Repeating this process for a total of k times, we obtain the desired result.
Q.E.D.
y∗ = F y∗.
lim !F mk+! y − y ∗ ! = 0, ! = 0, 1, . . . , m − 1,
k→∞
v(x) > 0, ∀ x ∈ X.
Let B(X) denote the set of all functions J : X #→ % such that J (x)/v(x)
is bounded as x ranges over X. We define a norm on B(X), called the
weighted sup-norm, by
|J (x)|
!J ! = sup . (B.3)
x∈X v(x)
It is easily verified that !·! thus defined has the required properties for
being a norm. Furthermore, B(X) is complete under this norm. To see this,
consider a Cauchy sequence {Jk } ⊂ B(X), and note that !Jm − Jn ! → 0
as m, n → ∞ implies that for all x ∈ X, {Jk (x)} is a Cauchy sequence of
real numbers, so it converges to some J * (x). We will show that J * ∈ B(X)
and that !Jk − J * ! → 0. To this end, it will be sufficient to show that
given any " > 0, there exists a K such that
|Jk (x) − J * (x)|
≤ ", ∀ x ∈ X, k ≥ K.
v(x)
This will imply that
|J * (x)|
sup ≤ " + !Jk !, ∀ k ≥ K,
x∈X v(x)
so that J * ∈ B(X), and will also imply that !Jk − J * ! ≤ ", so that
!Jk − J * ! → 0. Assume the contrary, i.e., that there exists an " > 0 and a
subsequence {xm1 , xm2 , . . .} ⊂ X such that mi < mi+1 and
! !
!Jm (xm ) − J * (xm )!
i i i
"< , ∀ i ≥ 1.
v(xmi )
The right-hand side above is less or equal to
! ! ! !
!Jm (xm ) − Jn (xm )! !Jn (xm ) − J * (xm )!
i i i i i
+ , ∀ n ≥ 1, i ≥ 1.
v(xmi ) v(xmi )
212 Contraction Mappings Appendix B
The first term in the above sum is less than !/2 for i and n larger than some
threshold; fixing i and letting n be sufficiently large, the second term can
also be made less than !/2, so the sum is made less than ! - a contradiction.
In conclusion, the space B(X) is complete, so the fixed point results of
Props. B.1 and B.2 apply.
In our discussions, we will always assume that B(X) is equipped
with the weighted sup-norm above, where the weight function v will be
clear from the context. There will be frequent occasions! where
! the norm
will be unweighted, i.e., v(x) ≡ 1 and "J " = supx∈X !J (x)!, in which case
we will explicitly state so.
Finite-Dimensional Cases
Proof: (a) Assume that Eq. (B.4) holds. For any J, J " ∈ B(X), we have
%$ ! "%%
% j∈X aij J (j) − J " (j) %
%
#F J − F J " # = sup
i∈X v(i)
$ % % &% % '
"
j∈X aij v(j) J (j) − J (j) /v(j)
% % % %
≤ sup
i∈X v(i)
$ % %
%aij % v(j) (
j∈X (
(J − J " (
≤ sup
i∈X v(i)
214 Contraction Mappings Appendix B
≤ ρ "J − J ! ",
where the last inequality follows from the hypothesis.
Conversely, arguing by contradiction, let’s assume that Eq. (B.4) is
violated for some i ∈ X. Define J (j) = v(j) sgn(aij ) and J ! (j) = 0 for all
j ∈ X. Then we have "J − J ! " = "J " = 1, and
! ! "
j∈X |aij | v(j)
!(F J )(i) − (F J ! )(i)!
= > ρ = ρ "J − J ! ",
v(i) v(i)
showing that F is not a contraction of modulus ρ.
(b) Since Fµ is a contraction of modulus ρ, we have for any J, J ! ∈ B(X),
(Fµ J )(i) (Fµ J ! )(i)
≤ + ρ "J − J ! ", i ∈ X,
v(i) v(i)
so by taking the infimum over µ ∈ M ,
(F J )(i) (F J ! )(i)
≤ + ρ "J − J ! ", i ∈ X.
v(i) v(i)
Reversing the roles of J and J ! , we obtain
! !
!(F J )(i) − (F J ! )(i)!
≤ ρ "J − J ! ", i ∈ X,
v(i)
and by taking the supremum over i, the contraction property of F is proved.
Q.E.D.
with #! !
Vi (µ) = !aij (µ)! v(j), i ∈ X.
j∈X
Sec. B.2 Weighted Sup-Norm Contractions 215
By dividing this inequality with v(i) and by taking the supremum over
i ∈ X, we obtain
(b) By doing the same as in (a), but after first taking the infimum of
(Fµ J )(i) over µ, we obtain
#F J # ≤ #b# + #J # #V # < ∞.
Q.E.D.
APPENDIX C:
Measure Theoretic Issues
A general theory of stochastic dynamic programming must deal with the
formidable mathematical questions that arise from the presence of uncount-
able probability spaces. The purpose of this appendix is to motivate the
theory and to provide some mathematical background to the extent needed
for the development of Chapter 5. The research monograph by Bertsekas
and Shreve [BeS78] (freely available from the internet), contains a detailed
development of mathematical background and terminology on Borel spaces
and related subjects. We will explore here the main questions by means
of a simple two-stage example described in Section C.1. In Section C.2,
we develop a framework, based on universally measurable policies, for the
rigorous mathematical development of the standard DP results for this
example and for more general finite horizon models.
Suppose that the initial state x0 is a point on the real line !. Knowing
x0 , we must choose a control u0 ∈ !. Then the new state x1 is generated
according to a transition probability measure p(dx1 | x0 , u0 ) on the Borel
σ-algebra of ! (the one generated by the open sets of !). Then, knowing
x1 , we must choose a control u1 ∈ ! and incur a cost g(x1 , u1 ), where g is
a real-valued function that is bounded either above or below. Thus a cost
is incurred only at the second stage.
A policy π = {µ0 , µ1 } is a pair of functions from state to control, i.e.,
if π is employed and x0 is the initial state, then u0 = µ0 (x0 ), and if x1 is
the subsequent state, then u1 = µ1 (x1 ). The expected value of the cost
corresponding to π when x0 is the initial state is given by
!
" # " #
Jπ (x0 ) = g x1 , µ1 (x1 ) p dx1 | x0 , µ0 (x0 ) . (C.1)
216
Sec. C.1 A Two-Stage Example 217
where the infimum is over all policies π = {µ0 , µ1 } such that µ0 and µ1 are
measurable functions from ! to ! with respect to σ-algebras to be specified
later. Given # > 0, a policy π is #-optimal if
Jπ (x0 ) ≤ J * (x0 ) + #, ∀ x0 ∈ !.
A policy π is optimal if
Jπ (x0 ) = J * (x0 ), ∀ x0 ∈ !.
The DP Algorithm
The DP algorithm for the preceding two-stage problem takes the form
= J0 (x0 ).
We observe that if for each (x0 , u0 ), the measure p(dx1 | x0 , u0 ) has count-
able support , i.e., is concentrated on a countable number of points, then for
a fixed policy π and initial state x0 , the integral defining the cost Jπ (x0 )
of Eq. (C.1) is defined in terms of (possibly infinite) summation. Simi-
larly, the DP algorithm (C.2), (C.3) is defined in terms of summation, and
the same is true for the integrals in Eqs. (C.4a)-(C.4d). Thus, there is no
need to impose measurability restrictions of any kind for the integrals to
make sense, and for the summations/integrations to be well-defined, it is
sufficient that g is bounded either above or below.
It can also be shown that the interchange of infimization and sum-
mation in Eq. (C.4b) is justified in view of the assumption
To see this, for any " > 0, select µ̄1 : % &→ % such that
! "
g x1 , µ̄1 (x1 ) ≤ inf g(x1 , u1 ) + ", ∀ x1 ∈ %. (C.5)
u1
Then
#
! " ! "
inf g x1 , µ1 (x1 ) p dx1 | x0 , µ0 (x0 )
µ1
#
! " ! "
≤ g x1 , µ̄1 (x1 ) p dx1 | x0 , µ0 (x0 )
#
! "
≤ inf g(x1 , u1 ) p dx1 | x0 , µ0 (x0 ) + ".
u1
The reverse inequality also holds, since for all µ1 , we can write
# #
! " ! " ! "
inf g(x1 , u1 ) p dx1 | x0 , µ0 (x0 ) ≤ g x1 , µ1 (x1 ) p dx1 | x0 , µ0 (x0 ) ,
u1
and then we can take the infimum over µ1 . It follows that the interchange
of infimization and summation in Eq. (C.4b) is justified, with the "-optimal
selection property of Eq. (C.5) being the key step in the proof.
We have thus shown that when the measure p(dx1 | x0 , u0 ) has count-
able support, g is bounded either above or below, and J0 (x0 ) > −∞ for
all x0 and J1 (x1 ) > −∞ for all x1 , the derivation of Eq. (C.4) is valid and
proves that the DP algorithm produces the optimal cost function J * (cf.
220 Measure Theoretic Issues Appendix C
Also R.3 follows easily using the fact that there are no measurability re-
strictions on µ0 and µ1 .
To address the case where p(dx1 | x0 , u0 ) does not have countable support,
two approaches have been used. The first is to expand the notion of inte-
gration, and the second is to place appropriate measurability restrictions
on g, p, and {µ0 , µ1 }. Expanding the notion of integration is possible by
interpreting the integrals appearing in the preceding equations as outer
integrals. Since the outer integral can be defined for any function, mea-
surable or not, there is no need to impose any measurability assumptions,
and the arguments given above go through just as in the countable distur-
bance case. We do not discuss this approach further except to mention that
the book [BeS78] shows that the basic results for finite and infinite hori-
zon problems of perfect state information carry through within an outer
integration framework. However, there are inherent limitations in this ap-
proach centering around the pathologies of outer integration, as discussed
in [BeS78].
The second approach is to impose a suitable measurability structure
that allows the key proof steps of the validity of the DP algorithm. These
are:
(a) Properly interpreting the integrals in the definition (C.2)-(C.3) of the
DP algorithm and the derivation (C.4).
(b) The !-optimal selection property (C.5), which in turn justifies the
interchange of infimization and integration in Eq. (C.4b).
To enable (a), the required properties of the problem structure must include
the preservation of measurability under partial minimization. In particu-
lar, it is necessary that when g is measurable in some sense, the partial
minimum function
J1 (x1 ) = inf g(x1 , u1 )
u1
is also measurable in the same sense, so that the integration in Eq. (C.3) is
well-defined. It turns out that this is a major difficulty with Borel measur-
ability, which may appear to be a natural framework for formulating the
problem: J1 need not be Borel measurable even when g is Borel measurable.
For this reason it is necessary to pass to a larger class of measurable func-
tions, which is closed under the key operation of partial minimization (and
also under some other common operations, such as addition and functional
composition). †
One such class is lower semianalytic functions and the related class
of universally measurable functions, which will be the focus of the next
section. They are the basis for a problem formulation that enables a DP
theory as powerful as the one for problems where measurability is of no
concern (e.g., those where the state and control spaces are countable).
† It is also possible to use a smaller class of functions that is closed under the
same operations. This has led to the so-called semicontinuous models, where the
state and control spaces are Borel spaces, and g and p have certain semicontinu-
ity and other properties. These models are also analyzed in detail in the book
[BeS78] (Section 8.3). However, they are not as useful and widely applicable as
the universally measurable models we will focus on, because they involve assump-
tions that may be restrictive and/or hard to verify. By contrast, the universally
measurable models are simple and very general. They allow a problem formula-
tion that brings to bear the power of DP analysis under minimal assumptions.
This analysis can in turn be used to prove more specific results based on special
characteristics of the model.
222 Measure Theoretic Issues Appendix C
is analytic for every c ∈ ". The following proposition states that lower
analyticity is preserved under partial minimization, a key result for our
purposes. The proof follows from the preservation of analyticity of a subset
of a product space under projection onto one of the component spaces, as
in (i) above (see [BeS78], Prop. 7.47).
is lower semianalytic.
Universal Measurability
see [BeS78], Corollary 7.42.1), and hence the σ-algebra generated by the
analytic sets, called the analytic σ-algebra, and denoted AY , is contained
in UY :
BY ⊂ AY ⊂ UY .
+
!
where
! − we adopts the rule ∞ − ∞ = ∞ for the case where f dp = ∞ and
f dp = ∞. With this expanded definition, the integral of an extended real-
valued function is always defined as an extended real number (consistently also
with Appendix A).
Sec. C.2 Resolution of the Measurability Issues 225
is lower semianalytic.
(b) If q is universally measurable and h is universally measurable,
then the function l : Y "→ [−∞, ∞] given by
!
l(y) = h(y, z)q(dz | y)
Z
is universally measurable.
and let
! "
I = y ∈ Y | there exists a zy ∈ Z for which h(y, zy ) = h∗ (y) ,
i.e., I is the set of points y for which the infimum above is attained. For
any ! > 0, there exists a universally measurable function φ : Y $→ Z
such that # $
h y, φ(y) = h∗ (y), ∀ y ∈ I,
%
# $ h∗ (y) + !, ∀ y ∈ / I with h∗ (y) > −∞,
h y, φ(y) ≤
−1/!, ∀y∈ / I with h∗ (y) = −∞.
functions J1 and J0 , and yields the optimal cost function (as in R.1), and
furthermore there exist !-optimal and possibly exactly optimal policies (as
in R.2 and R.3), provided that:
(a) The stage cost function g is lower semianalytic; this is needed to show
that the function J1 of the DP Eq. (C.2) is lower semianalytic and
hence also universally measurable (cf. Prop. C.1). The more “nat-
ural” Borel measurability assumption on g implies lower analyticity
of g, but will not keep the functions J1 and J0 produced by the DP
algorithm within the domain of Borel measurability. This is because
the partial minimum operation on Borel measurable functions takes
us outside that domain (cf. Prop. C.1).
(b) The stochastic kernel p is Borel measurable. This is needed in order
for the integral in the DP Eq. (C.3) to be defined as a lower semi-
analytic function of (x0 , u0 ) (cf. Prop. C.4). In turn, this is used to
show that the function J0 of the DP Eq. (C.3) is lower semianalytic
(cf. Prop. C.1).
(c) The control functions µ0 and µ1 are allowed to be universally mea-
surable, and we have J0 (x0 ) > −∞ for all x0 and J1 (x1 ) > −∞ for
all x1 . This is needed in order for the calculation of Eq. (C.4) to go
through (using the measurable selection property of Prop. C.5), and
show that the DP algorithm produces the optimal cost function (cf.
R.1). It is also needed (using again Prop. C.5) in order to show the
associated existence of solutions results (cf. R.2 and R.3).
Let us now extend our analysis to an N -stage model with state xk and
control uk that take values in Borel spaces X and U , respectively. We
assume stochastic/transition kernels pk (dxk+1 | xk , uk ), which are Borel
measurable, and stage cost functions gk : X × U $→ (−∞, ∞], which are
lower semianalytic and bounded either above or below. † Furthermore, we
allow policies π = {µ0 , . . . , µN −1 } that are randomized: each component
µk is a universally measurable stochastic kernel µk (duk | xk ) from X to U .
If for every xk and k, µk (duk | xk ) assigns probability 1 to a single control
uk , π is said to be nonrandomized .
Each policy π and initial state x0 define a unique probability measure
with respect to which gk (xk , uk ) can be integrated to produce the expected
value of gk . The sum of these expected values for k = 0, . . . , N − 1, is the
cost Jπ (x0 ). It is convenient to write this cost in terms of the following
† Note that since gk may take the value ∞, constraints of the form uk ∈
Uk (xk ) may be implicitly introduced by letting gk (xk , uk ) = ∞ when uk ∈
/
Uk (xk ).
228 Measure Theoretic Issues Appendix C
! " ! #
Jπ,k (xk ) = gk (xk , uk ) + Jπ,k+1 (xk+1 ) pk (dxk+1 | xk , uk )
µk (duk | xk ), k = 0, . . . , N − 2.
The function obtained at the last step is the cost of π starting at x0 :
$ ! &
%
Jk (xk ) = inf gk (xk , uk ) + Jk+1 (xk+1 ) pk dxk+1 | xk , uk ) , ∀ xk , k.
uk ∈U
To show equality within " ≥ 0 in the above relation, we may use the
measurable selection theorem (Prop. C.5), assuming that
"
!
% ' % '
gk xk , µk (xk ) + Jk+1 (xk+1 ) pk dxk+1 | xk , µk (xk ) ≤ Jk (xk ) + .
N
(C.8)
Then, we can show by induction that
(N − k)"
Jk (xk ) ≤ Jπ,k (xk ) ≤ Jk (xk ) + , ∀ xk , k = 0, . . . , N − 1,
N
Sec. C.2 Resolution of the Measurability Issues 229
Thus, the DP algorithm produces the optimal cost function, and via the
approximate minimization of Eq. (C.8), an !-optimal policy. Similarly,
if the infimum is attained for all xk and k in the DP algorithm, then
there exists an optimal policy. Note that both the !-optimal and the exact
optimal policies can be taken be nonrandomized.
The assumptions of Borel measurability of the stochastic kernels,
lower semianalyticity of the costs per stage, and universally measurable
policies, are the basis for the framework adopted by Bertsekas and Shreve
[BeS78], which provides a comprehensive analysis of finite and infinite hori-
zon total cost problems. There is also additional analysis in [BeS78] on
problems of imperfect state information, as well as various refinements
of the measurability framework just described. Among others, these re-
finements involve analytically measurable policies, and limit measurable
policies (measurable with respect to the, so-called, limit σ-algebra, the
smallest σ-algebra that has the properties necessary for a DP theory that
is comparably powerful to the one for the universal σ-algebra).
APPENDIX D:
Solutions of Exercises
CHAPTER 1
By the contraction property of Tµ0 , . . . , Tµm−1 , we have for all J, J " ∈ B(X),
230
Appendix D 231
! (w)
!(Tµ J)(x) − (Tµ(w) J ! )(x)!
! !"∞ "∞ !
!
!=1
w! (x)(Tµ! J)(x) − !=1 w! (x)(Tµ! J ! )(x)!
=
v(x) v(x)
∞
#
≤ w! (x)$Tµ! J − Tµ! J ! $
!=1
$∞ %
#
≤ w! (x)α! $J − J ! $,
!=1
(w)
showing the contraction property of Tµ .
Let Jµ be the fixed point of Tµ . We have for all x ∈ X, by using the relation
(Tµ! Jµ )(x) = Jµ (x),
∞
$ ∞
%
# #
(Tµ(w) Jµ )(x) = w! (x) Tµ! Jµ (x) =
& '
w! (x) Jµ (x) = Jµ (x),
!=1 !=1
(w) (w)
so Jµ is the fixed point of Tµ [which is unique since Tµ is a contraction].
CHAPTER 2
k k k
J0 = lim T ν J̄ , J1 = lim T ν (Tµ0 J̄ ), . . . Jm−2 = lim T ν (Tµ0 · · · Tµm−2 J̄).
k→∞ k→∞ k→∞
Since T ν is a contraction mapping, J0 , . . . , Jm−1 are all equal to the unique fixed
point of T ν . Since J0 , . . . , Jm−1 are all equal, they are also equal to Jπ (by the
definition of Jπ ). Thus Jπ is the unique fixed point of T ν .
(b) Follow the hint.
We have
# # ! "
#(Tµ J)(x)# Gx $ pxy µ(x) v(y) |J(y)|
≤ +α , ∀ x ∈ X, µ ∈ M,
v(x) v(x) v(x) v(y)
y∈X
CHAPTER 3
− u + (1 − u2 )J(1) ,
! "
(T J)(1) = inf
0<u≤1
234 Solutions of Exercises Chap. 3
However, it can be seen that this equation has no solution. Here parts (b) and
(d) of Assumption 3.2.1 are violated.
(b) Here Tµ is again a sup-norm contraction with modulus 1 − µ(1)2 . For Jµ , the
unique fixed point of Tµ , we have
" # " #
Jµ (1) = (Tµ Jµ )(1) = − 1 − µ(1) µ(1) + 1 − µ(1) Jµ (1),
which yields Jµ (1) = −1 + µ(1). Hence J ∗ = −1, but there is no optimal µ. The
mapping T is given by
− u + u2 + (1 − u)J(1) ,
! $
(T J)(1) = inf
0<u≤1
It can be verified that the set of fixed points of T within " is {J | J ≤ −1}. Here
part (d) of Assumption 3.2.1 is violated.
(c) For the policy µ that chooses µ(1) = 0, we have
(Tµ J)(1) = c + J(1),
and µ is "-irregular since limk→∞ Tµk J either does not belong to " or depends
on J. Moreover, the mapping T is given by
% &
2
! $
(T J)(1) = min c + J(1), inf − u + u + (1 − u)J(1) .
0<u≤1
Let the assumptions of Prop. 3.1.1 hold, and let µ∗ be the S-regular policy that is
optimal. Then condition (1) implies that J ∗ = Jµ∗ ∈ S and J ∗ = Tµ∗ J ∗ ≥ T J ∗ ,
while condition (2) implies that there exists an S-regular policy µ such that
Tµ J ∗ = T J ∗ .
Conversely, assume that J ∗ ∈ S, T J ∗ ≤ J ∗ , and there exists an S-regular
policy µ such that Tµ J ∗ = T J ∗ . Then we have Tµ J ∗ = T J ∗ ≤ J ∗ . Hence
Tµk J ∗ ≤ J ∗ for all k, and by taking the limit as k → ∞, we obtain Jµ ≤ J ∗ .
Hence the S-regular policy µ is optimal, and both conditions of Prop. 3.1.1 hold.
Appendix D 235
3.3
b if x = 1, u = 0,
!
H(x, u, J) = a + J(2) if x = 1, u = 2,
a + J(1) if x = 2, u = 1.
while Jµ is the unique fixed point of T within S. In cases (1) and (2), Prop.
3.1.1 applies, because the S-regular policy µ is optimal. In case (3), Prop. 3.1.1
does not apply because the S-regular policy µ is not optimal. Finally, in case (4),
contrary to the case S = "2 , Prop. 3.1.1 applies, because the policy µ is optimal
and also S-regular. Case (3) cannot be analyzed with the aid of Props. 3.1.1,
3.1.2, or 3.2.1.
236 Solutions of Exercises Chap. 3
Jµ (1) = Jµ (2) = r.
We will show that conditions (1) and (2) imply that J ∗ = T J ∗ , and the result
will follow from Prop. 3.1.2. Assume to obtain a contradiction, that J ∗ %= T J ∗ .
Then J ∗ ≥ T J ∗ , as can be seen from the relations
with strict inequality for some x [note here that we can choose µ(x) = µ∗ (x) for
all x such that J ∗ (x) = (T J ∗ )(x), and we can choose µ(x) to satisfy J ∗ (x) >
(Tµ J ∗ )(x) for all other x]. If µ were S-regular, we would have
J ∗ ≥ Tµ J ∗ ≥ lim Tµk J ∗ = Jµ ,
k→∞
We have
Jµk ≥ T Jµk ≥ Jµk+1 , k = 0, 1, . . . . (3.1)
Denote
J∞ = lim T Jµk = lim Jµk .
k→∞ k→∞
Since for all k, we have Jµk ≥ Jˆ ∈ S, where Jˆ is the optimal cost function
ˆ and by
over S-regular policies [cf. Assumption 3.2.1(b)]. It follows that J∞ ≥ J,
Assumption 3.2.1(a), we obtain J∞ ∈ S. By taking the limit in Eq. (3.1), we
have
J∞ = lim T Jµk ≥ T J∞ , (3.2)
k→∞
where the inequality follows from the fact Jµk ↓ J∞ . Using also the given as-
sumption, we have for all x ∈ X and u ∈ U (x),
CHAPTER 4
Since a cost is incurred only upon stopping, and the stopping cost is greater than
-1, we have Jµ (x) > −1 for all x and µ. On the other hand, starting from any
1
state x and stopping at x + n yields a cost −1 + x+n , so by taking n sufficiently
large, we can attain a cost arbitrarily close to -1. Thus J ∗ (x) = −1 for all x, but
no policy can attain this optimal cost.
1
! "
J ∗ (x) = min J ∗ (x + 1), −1 +
x
for all x.
238 Solutions of Exercises Chap. 4
1
Jµ (1) = − ,
µ(1)
so the policy evaluation equation given in part (a) is correct. Moreover, we have
Jµ (1) ≤ −1 since µ(1) ∈ (0, 1]. The policy improvement equation is
0 = −1 − 2uJµk (1),
Appendix D 239
which is less or equal to −1/2 since Jµ (1) ≤ −1 for all µ. Hence uk is equal to
the constrained minimum in Eq. (4.1), and we have
1
µk+1 (1) = − .
2Jµk (1)
(a) The policy µ that stops at every state has cost function
1
Jµ (x) = −1 + , x ∈ X.
x
Policy improvement starting with µ yields µ with
1
! "
µ(x) ∈ arg min Jµ (x), −1 + ,
x
Let Ê(X) be the subset of E(X) that consists of functions that take only the two
values 0 and ∞, and for all J ∈ Ê(X) denote
# $
D(J) = x ∈ X | J(x) = 0 .
Note that for all J ∈ Ê(X) we have Tµ J ∈ Ê(X), T J ∈ Ê(X), and that
# % & $
D(Tµ J) = x ∈ X | x ∈ D(J), f x, µ(x), w ∈ D(J), ∀ w ∈ W (x, µ(x)) ,
240 Solutions of Exercises Chap. 4
(a) For all J ∈ Ê(X), we have D(Tµ J) ⊂ D(J) and Tµ J ≥ J, so condition (1) of
Assumption I holds, and it is easily verified that the remaining two conditions of
Assumption I also hold. We have J¯ ∈ Ê(X), so for any policy π = {µ0 , µ1 , . . .},
we have Tµ0 · · · Tµk J¯ ∈ Ê(X). It follows that Jπ , given by
¯
Jπ = lim Tµ0 · · · Tµk J,
k→∞
also belongs to Ê(X), and the same is true for J ∗ = inf π∈Π Jπ . Thus J ∗ has the
given form with D(J ∗ ) = X ∗ .
(b) Since {T k J} ¯ is monotonically nondecreasing we have D(T k+1 J¯) ⊂ D(T k J¯),
or equivalently Xk+1 ⊂ Xk for all k. Generally for a sequence {Jk } ⊂ Ê(X), if
Jk ↑ J, we have J ∈ Ê(X) and D(J) = ∩∞ k=0 D(Jk ). Thus convergence of VI (i.e.,
T k J¯ ↑ J ∗ ) is equivalent to D(J ∗ ) = ∩∞ ∗ ∞
k=0 D(Jk ) or X = ∩k=0 Xk .
are compact for every x ∈ X, λ ∈ (, and for all k greater than some integer k.
It can be seen that Uk (x, λ) is equal to the set
! #
Ûk (x) = u ∈ U (x) | f (x, u, w) ∈ Xk , ∀ w ∈ W (x, u)
[ABB02] Abounadi, J., Bertsekas, B. P., and Borkar, V. S., 2002. “Stochastic
Approximation for Non-Expansive Maps: Q-Learning Algorithms,” SIAM J. on
Control and Opt., Vol. 41, pp. 1-22.
[BBB08] Basu, A., Bhattacharyya, and Borkar, V., 2008. “A Learning Algorithm
for Risk-Sensitive Cost,” Math. of OR, Vol. 33, pp. 880-898.
[BBD10] Busoniu, L., Babuska, R., De Schutter, B., and Ernst, D., 2010. Rein-
forcement Learning and Dynamic Programming Using Function Approximators,
CRC Press, N. Y.
[Bau78] Baudet, G. M., 1978. “Asynchronous Iterative Methods for Multiproces-
sors,” Journal of the ACM, Vol. 25, pp. 226-244.
[BeI96] Bertsekas, D. P., and Ioffe, S., 1996. “Temporal Differences-Based Policy
Iteration and Applications in Neuro-Dynamic Programming,” Lab. for Info. and
Decision Systems Report LIDS-P-2349, MIT.
[BeS78] Bertsekas, D. P., and Shreve, S. E., 1978. Stochastic Optimal Control:
The Discrete Time Case, Academic Press, N. Y.; may be downloaded from
http://web.mit.edu/dimitrib/www/home.html
[BeT89] Bertsekas, D. P., and Tsitsiklis, J. N., 1989. Parallel and Distributed
Computation: Numerical Methods, Prentice-Hall, Engl. Cliffs, N. J.; may be
downloaded from http://web.mit.edu/dimitrib/www/home.html
[BeT91] Bertsekas, D. P., and Tsitsiklis, J. N., 1991. “An Analysis of Stochastic
Shortest Path Problems,” Math. of OR, Vol. 16, pp. 580-595.
[BeT96] Bertsekas, D. P., and Tsitsiklis, J. N., 1996. Neuro-Dynamic Program-
ming, Athena Scientific, Belmont, MA.
[BeT08] Bertsekas, D. P., and Tsitsiklis, J. N., 2008. Introduction to Probability,
2nd Ed., Athena Scientific, Belmont, MA.
[BeY07] Bertsekas, D. P., and Yu, H., 2007. “Solution of Large Systems of Equa-
tions Using Approximate Dynamic Programming Methods,” Lab. for Info. and
Decision Systems Report LIDS-P-2754, MIT.
[BeY09] Bertsekas, D. P., and Yu, H., 2009. “Projected Equation Methods for Ap-
proximate Solution of Large Linear Systems,” J. of Computational and Applied
Mathematics, Vol. 227, pp. 27-50.
[BeY10a] Bertsekas, D. P., and Yu, H., 2010. “Q-Learning and Enhanced Policy
Iteration in Discounted Dynamic Programming,” Lab. for Info. and Decision
Systems Report LIDS-P-2831, MIT; Math. of OR, Vol. 37, 2012, pp. 66-94.
241
242 References
[BeY10b] Bertsekas, D. P., and Yu, H., 2010. “Asynchronous Distributed Policy
Iteration in Dynamic Programming,” Proc. of Allerton Conf. on Communication,
Control and Computing, Allerton Park, Ill, pp. 1368-1374.
[Ber72] Bertsekas, D. P., 1972. “Infinite Time Reachability of State Space Regions
by Using Feedback Control,” IEEE Trans. Aut. Control, Vol. AC-17, pp. 604-613.
[Ber75] Bertsekas, D. P., 1975. “Monotone Mappings in Dynamic Programming,”
1975 IEEE Conference on Decision and Control, pp. 20-25.
[Ber77] Bertsekas, D. P., 1977. “Monotone Mappings with Application in Dy-
namic Programming,” SIAM J. on Control and Opt., Vol. 15, pp. 438-464.
[Ber82] Bertsekas, D. P., 1982. “Distributed Dynamic Programming,” IEEE Trans.
Aut. Control, Vol. AC-27, pp. 610-616.
[Ber83] Bertsekas, D. P., 1983. “Asynchronous Distributed Computation of Fixed
Points,” Math. Programming, Vol. 27, pp. 107-120.
[Ber87] Bertsekas, D. P., 1987. Dynamic Programming: Deterministic and Stochas-
tic Models, Prentice-Hall, Englewood Cliffs, N. J.
[Ber05a] Bertsekas, D. P., 2005. Dynamic Programming and Optimal Control,
Vol. I, 3rd Edition, Athena Scientific, Belmont, MA.
[Ber05b] Bertsekas, D. P., 2005. “Dynamic Programming and Suboptimal Con-
trol: A Survey from ADP to MPC,” Fundamental Issues in Control, Special Issue
for the CDC-ECC 05, European J. of Control, Vol. 11, Nos. 4-5.
[Ber09] Bertsekas, D. P., 2009. Convex Optimization Theory, Athena Scientific,
Belmont, MA.
[Ber10] Bertsekas, D. P., 2010. “Williams-Baird Counterexample for Q-Factor
Asynchronous Policy Iteration,”
http://web.mit.edu/dimitrib/www/Williams-Baird Counterexample.pdf
[Ber11a] Bertsekas, D. P., 2011. “Temporal Difference Methods for General Pro-
jected Equations,” IEEE Trans. Aut. Control, Vol. 56, pp. 2128-2139.
[Ber11b] Bertsekas, D. P., 2011. “λ-Policy Iteration: A Review and a New Im-
plementation,” Lab. for Info. and Decision Systems Report LIDS-P-2874, MIT;
appears in Reinforcement Learning and Approximate Dynamic Programming for
Feedback Control, by F. Lewis and D. Liu (eds.), IEEE Press, 2012.
[Ber11c] Bertsekas, D. P., 2011. “Approximate Policy Iteration: A Survey and
Some New Methods,” J. of Control Theory and Applications, Vol. 9, pp. 310-335.
[Ber12a] Bertsekas, D. P., 2012. Dynamic Programming and Optimal Control,
Vol. II, 4th Edition: Approximate Dynamic Programming, Athena Scientific,
Belmont, MA.
[Ber12b] Bertsekas, D. P., 2012. “Weighted Sup-Norm Contractions in Dynamic
Programming: A Review and Some New Applications,” Lab. for Info. and Deci-
sion Systems Report LIDS-P-2884, MIT.
[Bla65] Blackwell, D., 1965. “Positive Dynamic Programming,” Proc. Fifth Berke-
ley Symposium Math. Statistics and Probability, pp. 415-418.
[BoM99] Borkar, V. S., Meyn, S. P., 1999. “Risk Sensitive Optimal Control:
Existence and Synthesis for Models with Unbounded Cost,” SIAM J. Control
and Opt., Vol. 27, pp. 192-209.
[BoM00] Borkar, V. S., Meyn, S. P., 1900. “The O.D.E. Method for Convergence
References 243
[Pat01] Patek, S. D., 2001. “On Terminating Markov Decision Processes with a
Risk Averse Objective Function,” Automatica, Vol. 37, pp. 1379-1386.
[Pat07] Patek, S. D., 2007. “Partially Observed Stochastic Shortest Path Prob-
lems with Approximate Solution by Neuro-Dynamic Programming,” IEEE Trans.
on Systems, Man, and Cybernetics Part A, Vol. 37, pp. 710-720.
[Pli78] Pliska, S. R., 1978. “On the Transient Case for Markov Decision Chains
with General State Spaces,” in Dynamic Programming and its Applications, by
M. L. Puterman (ed.), Academic Press, N. Y.
[Pow07] Powell, W. B., 2007. Approximate Dynamic Programming: Solving the
Curses of Dimensionality, J. Wiley and Sons, Hoboken, N. J; 2nd ed., 2011.
[Put94] Puterman, M. L., 1994. Markovian Decision Problems, J. Wiley, N. Y.
[Roc70] Rockafellar, R. T., 1970. Convex Analysis, Princeton Univ. Press, Prince-
ton, N. J.
[Ros67] Rosenfeld, J., 1967. “A Case Study on Programming for Parallel Proces-
sors,” Research Report RC-1864, IBM Res. Center, Yorktown Heights, N. Y.
[Rot79] Rothblum, U. G., 1979. “Iterated Successive Approximation for Sequen-
tial Decision Processes,” in Stochastic Control and Optimization, by J. W. B.
van Overhagen and H. C. Tijms (eds), Vrije University, Amsterdam.
[Rot84] Rothblum, U. G., 1984. “Multiplicative Markov Decision Chains,” Math.
of OR, Vol. 9, pp. 6-24.
[Rus10] Ruszczynski, A., 2010. “Risk-Averse Dynamic Programming for Markov
Decision Processes,” Math. Programming, Ser. B, Vol. 125, pp. 235-261.
[ScL12] Scherrer, B., and Lesner, B., 2012. “On the Use of Non-Stationary Policies
for Stationary Infinite-Horizon Markov Decision Processes,” NIPS 2012 - Neural
Information Processing Systems, South Lake Tahoe, Ne.
[Sch75] Schal, M., 1975. “Conditions for Optimality in Dynamic Programming
and for the Limit of n-Stage Optimal Policies to be Optimal,” Z. Wahrschein-
lichkeitstheorie und Verw. Gebiete, Vol. 32, pp. 179-196.
[Sch11] Scherrer, B., 2011. “Performance Bounds for Lambda Policy Iteration
and Application to the Game of Tetris,” Report RR-6348, INRIA, France; to
appear in J. of Machine Learning Research.
[Sch12] Scherrer, B., 2012. “On the Use of Non-Stationary Policies for Infinite-
Horizon Discounted Markov Decision Processes,” INRIA Lorraine Report, France.
[Sha53] Shapley, L. S., 1953. “Stochastic Games,” Proc. Nat. Acad. Sci. U.S.A.,
Vol. 39.
[Str66] Strauch, R., 1966. “Negative Dynamic Programming,” Ann. Math. Statist.,
Vol. 37, pp. 871-890.
[Str75] Striebel, 1975. Optimal Control of Discrete Time Stochastic Systems,
Springer-Verlag, Berlin and New York.
[SuB98] Sutton, R. S., and Barto, A. G., 1998. Reinforcement Learning, MIT
Press, Cambridge, MA.
[Sze98a] Szepesvari, C., 1998. Static and Dynamic Aspects of Optimal Sequential
Decision Making, Ph.D. Thesis, Bolyai Institute of Mathematics, Hungary.
[Sze98b] Szepesvari, C., 1998. “Non-Markovian Policies in Sequential Decision
Problems,” Acta Cybernetica, Vol. 13, pp. 305-318.
246 References
247
248 Index